Optimize MySQL OR Conditions: Stop Using NOT EXISTS? Unveiling the Optimization Secrets of LEFT JOIN Anti-Join | SQLFlash

I. Preface

In complex business query scenarios, the need for anti-joins—specifically, “finding records that lack associated entries”—is very common. The traditional LEFT JOIN + HAVING COUNT = 0 approach, while logically straightforward, often generates substantial intermediate results and incurs grouping and aggregation overhead in large-scale data scenarios, leading to performance bottlenecks. This solution is based on a real-world business case. It utilizes a constructed test dataset of approximately 200,000 records to compare the execution plans and performance metrics of two anti-join implementations: LEFT JOIN versus NOT EXISTS. The objective is to validate the significant effectiveness of anti-join optimization techniques in a VIP customer screening scenario and to clarify the applicable boundaries and key implementation points for this optimization.

II. Test Environment Setup

1. Table Structure Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
-- Customer master table (100k records)
CREATE TABLE customers (
    customer_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_name VARCHAR(100) NOT NULL,
    vip_level VARCHAR(20) NOT NULL,
    total_spent DECIMAL(10,2) NOT NULL DEFAULT 0.0,
    INDEX idx_vip_spent (vip_level, total_spent)
) ENGINE=InnoDB;

-- Order detail table (100k records)
CREATE TABLE order_items (
    order_item_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_id INT NOT NULL,
    product_category VARCHAR(50) NOT NULL,
    INDEX idx_customer_category (customer_id, product_category)
) ENGINE=InnoDB;

2. Test Data Generation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Generate 100,000 customer records
SET cte_max_recursion_depth = 100000;
INSERT INTO customers (customer_name, vip_level, total_spent)
WITH RECURSIVE seq AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
    CONCAT('Customer_', n) AS customer_name,
    CASE
        WHEN n % 3 = 0 THEN 'gold'
        WHEN n % 3 = 1 THEN 'platinum'
        ELSE 'silver'
    END AS vip_level,
    ROUND(3000 + RAND(n) * 7000, 2) AS total_spent
FROM seq;

-- Generate 100,000 order detail records
INSERT INTO order_items (customer_id, product_category)
WITH RECURSIVE seq AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
    FLOOR(1 + RAND(n) * 100000) AS customer_id,
    CASE
        WHEN n % 4 = 0 THEN 'Electronics'
        WHEN n % 4 = 1 THEN 'Clothing'
        WHEN n % 4 = 2 THEN 'Books'
        ELSE 'Food'
    END AS product_category
FROM seq;

III. SQL Optimization

1. Original SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT 
    c.customer_id,
    c.customer_name,
    c.vip_level,
    c.total_spent
FROM customers c
LEFT JOIN order_items oi 
    ON c.customer_id = oi.customer_id 
    AND oi.product_category = 'Electronics'
WHERE c.vip_level IN ('gold', 'platinum')
  AND c.total_spent > 5000
GROUP BY c.customer_id, c.customer_name, c.vip_level, c.total_spent
HAVING COUNT(oi.customer_id) = 0;

2. Optimized SQL

We attempted to rewrite the SQL using SQLFlash:

SQLFlash Demo

The rewritten SQL obtained is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT 
    c.customer_id,
    c.customer_name,
    c.vip_level,
    c.total_spent
FROM customers c
WHERE c.vip_level IN ('gold', 'platinum')
  AND c.total_spent > 5000
  AND NOT EXISTS (
      SELECT 1
      FROM order_items oi
      WHERE oi.customer_id = c.customer_id
        AND oi.product_category = 'Electronics'
  );

View detailed report: https://sqlflash.ai/app/sqlResult?shareid=0cd99623-0c25-464d-bb9b-1f5c976057e5

IV. Performance Analysis

SQLFlash Analysis

According to the analysis provided by SQLFlash, this rewrite fundamentally changes the query execution strategy by converting the anti-join pattern of LEFT JOIN + HAVING COUNT = 0 into a NOT EXISTS subquery. The execution flow before the rewrite required first performing a LEFT JOIN, generating 244,602 rows of intermediate results, then performing GROUP BY aggregation via a temporary table, and finally filtering with HAVING. After the rewrite, the MySQL optimizer materializes the subquery into a temporary table (executed only once, cost 10,084.25). Subsequently, for each eligible customer, it quickly determines whether they exist in the materialized table via hash lookups, avoiding extensive intermediate data processing and grouping overhead. The execution cost was reduced from 50,261.35 to 35,969.60, a 28.43% reduction, with actual execution time improving by 52.14%.

Original Query Plan

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c LEFT JOIN order_items oi ON c.customer_id = oi.customer_id 
AND oi.product_category = 'Electronics' WHERE c.vip_level IN ('gold', 'platinum')
AND c.total_spent > 5000 GROUP BY c.customer_id, c.customer_name, c.vip_level, 
c.total_spent HAVING COUNT(oi.customer_id) = 0;

+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys          | key                   | key_len | ref                    | rows  | filtered | Extra                        |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
|  1 | SIMPLE      | c     | NULL       | ALL  | idx_vip_spent          | NULL                  | NULL    | NULL                   | 99949 |    58.24 | Using where; Using temporary |
|  1 | SIMPLE      | oi    | NULL       | ref  | idx_customer_category  | idx_customer_category | 206     | c.customer_id,const    |     4 |   100.00 | Using index                  |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+

Query Cost: 50,261.35
Execution Time: Average 0.3367 seconds (Minimum 0.2646 seconds, Maximum 0.3731 seconds)
Result Rows: 10,000 rows

Optimized Query Plan

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c WHERE c.vip_level IN ('gold', 'platinum') AND c.total_spent > 5000
AND NOT EXISTS (SELECT 1 FROM order_items oi WHERE oi.customer_id = c.customer_id 
AND oi.product_category = 'Electronics');

+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
| id | select_type    | table       | partitions | type   | possible_keys          | key                   | key_len | ref               | rows  | filtered | Extra                        |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
|  1 | SIMPLE         | c           | NULL       | ALL    | idx_vip_spent          | NULL                  | NULL    | NULL              | 99949 |    58.24 | Using where                  |
|  1 | SIMPLE         | <subquery2> | NULL       | eq_ref | <auto_distinct_key>    | <auto_distinct_key>   | 5       | c.customer_id     |     1 |   100.00 | Using where; Not exists      |
|  2 | MATERIALIZED   | oi          | NULL       | index  | idx_customer_category  | idx_customer_category | 206     | NULL              | 99800 |   100.00 | Using where; Using index     |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+

Performance Metrics Comparison

MetricOriginal QueryOptimized Query
SQL PatternLEFT JOIN + GROUP BY + HAVINGNOT EXISTS Anti-Join
Execution Time0.3367s0.1612s
Performance Improvement—52.14%
Query Cost50,261.3535,969.60
Cost Reduction—28.43%
Query LevelSingle-level (includes temp table)Materialized Subquery
Rows Scanned99,949 (main table) + 244,602 (JOIN result)99,949 (main table) + 99,800 (materialized table)
Intermediate Results244,602 rowsNo JOIN intermediate results needed
Extra IdentifierUsing temporaryMaterialized + Not exists
Optimizer StrategyNested Loop + Group AggregationSubquery Materialization + Hash Lookup
GROUP BY OverheadRequires temporary table for groupingNo grouping required

Optimization Principle Analysis

The essence of NOT EXISTS optimization lies in reducing intermediate data inflation and avoiding group aggregation:

1. Eliminate Redundant Intermediate Results LEFT JOIN generates all matching intermediate records (240,000 rows in this example), while NOT EXISTS only needs to check for existence, directly filtering without the need to retain intermediate data.

2. Replace Computation-Intensive Operations Converts the aggregate computation of GROUP BY + HAVING (which requires temporary table sorting/grouping) into simple hash lookups (the materialized table only needs to be built once).

3. Optimize Data Access Patterns After materialization, the subquery forms an in-memory hash table. The main query only requires constant-time lookups, significantly reducing CPU and I/O costs.

V. Conclusion

In anti-join scenarios for “finding records without associated entries,” NOT EXISTS achieves over 50% performance improvement compared to LEFT JOIN + HAVING COUNT = 0. The key points are:

  • Eliminate intermediate result inflation (reducing data volume by 60%)
  • Avoid group aggregation overhead (eliminating temporary table operations)
  • Leverage the optimization strategy of subquery materialization + hash lookup

Development Guideline: Prioritize the NOT EXISTS pattern for anti-join queries, ensure associated fields are indexed, and verify that the execution plan includes the “Materialized” identifier.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

How to use SQLFlash in a database?

Ready to elevate your SQL performance?

Join us and experience the power of SQLFlash today!.