217x Faster MySQL: Non-Correlated Subquery Optimization with Index-Driven Execution

Preface: Performance Challenges of Non-Correlated Subqueries

When query statements involve non-correlated subqueries, they typically perform well under small data volume scenarios. However, as the data scale grows, the full table scan behavior inherent to subqueries may lead to a sharp decline in performance. Through controlled experiments with million-level datasets, this report demonstrates the significant improvement in execution efficiency achievable by query rewriting, and validates the practical effects of index optimization and execution path refinement.

1. Experimental Environment & Data Preparation

1.1 Hardware Configuration

Component	Specification
CPU	4-core
Memory	8GB
Storage	500GB SSD
MySQL Version	8.0.18

1.2 Data Generation Method

Step 1: Create Basic Table Structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE TABLE students (
    student_id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(50),
    gender CHAR(1),
    birth_date DATE
) ENGINE=InnoDB;

CREATE TABLE enrollments (
    enrollment_id INT AUTO_INCREMENT PRIMARY KEY,
    student_id INT,
    course_id INT,
    enroll_date DATE
) ENGINE=InnoDB;

Step 2: Bulk Insert Test Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Insert 1 million student records 
INSERT INTO students (name, gender, birth_date)
WITH RECURSIVE seq AS (
    SELECT 1 AS n 
    UNION ALL 
    SELECT n+1 FROM seq WHERE n < 1000000
)
SELECT 
    CONCAT('S_', n),
    IF(RAND() < 0.5, 'M', 'F'),
    DATE('2000-01-01') + INTERVAL FLOOR(RAND() * 365 * 20) DAY
FROM seq;

-- Insert 5 million course enrollment records
INSERT INTO enrollments (student_id, course_id, enroll_date)
WITH RECURSIVE seq AS (
    SELECT 1 AS n 
    UNION ALL 
    SELECT n+1 FROM seq WHERE n < 5000000
)
SELECT 
    FLOOR(RAND() * 1000000) + 1,
    FLOOR(RAND() * 100) + 1,
    DATE('2020-01-01') + INTERVAL FLOOR(RAND() * 365 * 3) DAY
FROM seq;

-- Add necessary indexes
ALTER TABLE enrollments ADD INDEX idx_student (student_id);

2. Performance Comparison Experiment

2.1 Test Cases

Control Group (Original Query):

1
2
3
4
5
6
7
SELECT * 
FROM students 
WHERE student_id IN (
    SELECT student_id 
    FROM enrollments
)
LIMIT 100;

Experimental Group (Optimized Query):

We attempted to rewrite the queries using SQLFlash

The rewritten SQL is as follows:

1
2
3
4
5
6
7
8
SELECT * 
FROM students 
WHERE student_id IN (
    SELECT student_id 
    FROM enrollments 
    WHERE students.student_id = enrollments.student_id  
)
LIMIT 100;

View detailed report

2.2 Performance Metrics

Metric	Control Group	Experimental Group
Average Execution Time	2.17s	0.01s
Number of Scanned Rows	6,000,000	1,000,000

2.3 SQLFlash Analysis

According to SQLFlash’s analysis, compared with the original SQL statement, the rewritten statement adds a correlation condition between the subquery and the outer table. This allows the scan to terminate earlier after matching is completed, reducing redundant fetching of irrelevant student_id values. After retrieving the required 100 rows, the operation can be halted, effectively minimizing unnecessary scans and data processing overhead. The correlation condition between the subquery and the outer query enables targeted data filtering during execution, reducing redundant row matching and result merging operations, thereby accelerating query response time and improving overall resource efficiency.

3. Execution Plan Analysis

3.1 Diagnosis of Issues in the Control Group

1
2
3
4
5
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref                      | rows      | filtered | Extra       |
|----|--------------|-------------|------------|--------|---------------------|---------------------|---------|--------------------------|-----------|----------|-------------|
| 1  | SIMPLE       | students    | NULL       | ALL    | PRIMARY             | NULL                | NULL    | NULL                     |    998540 | 100.00   | Using where |
| 1  | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 5       | test.students.student_id |         1 | 100.00   | NULL        |
| 2  | MATERIALIZED | enrollments | NULL       | index  | idx_student         | idx_student         | 5       | NULL                     |   4989160 | 100.00   | Using index |

Problem Summary

Full Table Scans: The main table students underwent a full scan of 1,000,000 rows, while the subquery table enrollments executed a full index scan of 5,000,000 rows.
Invalid Computations: The actual requirement was only to verify the existence of the first 100 students.

3.2 Experimental Group Optimization Results

1
2
3
mysql> explain SELECT * FROM students WHERE student_id IN (  
    SELECT student_id FROM enrollments WHERE students.student_id = enrollments.student_id  
) LIMIT 100;  

1
2
3
4
5
6
+----+-------------+-------------+------------+------+---------------+-------------+---------+--------------------------+---------+----------+-----------------------------------+
| id | select_type | table       | partitions | type | possible_keys | key         | key_len | ref                      | rows    | filtered | Extra                             |
+----+-------------+-------------+------------+------+---------------+-------------+---------+--------------------------+---------+----------+-----------------------------------+
| 1  | SIMPLE      | students    | NULL       | ALL  | PRIMARY       | NULL        | NULL    | NULL                     | 998540  | 100.00   | NULL                              |
| 1  | SIMPLE      | enrollments | NULL       | ref  | idx_student   | idx_student | 5       | test.students.student_id | 5       | 100.00   | Using index; FirstMatch(students) |
+----+-------------+-------------+------------+------+---------------+-------------+---------+--------------------------+---------+----------+-----------------------------------+

Optimization Principles

Index-driven Query: Trigger index lookup through WHERE students.student_id = enrollments.student_id.
Row-by-row Validation: For each student_id read from the outer query, immediately confirm existence via the index.

4. Optimization Practice Recommendations

Recommended Scenarios

Main table data volume > 100,000 rows
Subquery table has indexes on associated fields
Online queries requiring fast response

Precautions

Avoid full-scale statistics: Evaluate execution plans for operations like COUNT.
Small dataset scenarios: Optimization benefits are limited when subquery results < 1,000 rows.
Index necessity: Indexes must be created on associated fields; otherwise, performance may degrade.

5. Conclusion

The comparative experiments on million-row datasets confirm that optimization strategies for non-correlated subqueries can significantly enhance query performance:

Execution Efficiency Leap
- Optimized query duration reduced from 2.17s to 0.01s (217x improvement)
- Scanned rows decreased by 83% (6M → 1M), reducing I/O overhead
Execution Path Optimization
- Rewriting as correlated subquery (WHERE students.student_id=enrollments.student_id)
- Triggering Index-Driven Execution with row-by-row validation
- Implementing FirstMatch strategy to terminate scanning after 100 matches
Critical Optimization Factors
❗️ Index Dependency: Mandatory index on enrollments.student_id
✅ LIMIT Synergy: Early termination works with index lookups
❌ Full-Table Scan Elimination: Prevents unconstrained scans on students
Scenario Validation
- Noticeable gains for datasets >100K rows
- Requires >95% index coverage on subquery tables
- Online queries achieve <100ms response
Production Recommendations
- Priority Rewriting: Convert IN (SELECT...) to correlated subqueries
- Dual Indexing: Index both students.student_id and enrollments.student_id
- Plan Monitoring: Use EXPLAIN ANALYZE to verify Using index and FirstMatch triggers

Final Conclusion

In large-scale data scenarios, rewriting non-correlated subqueries as correlated subqueries combined with index optimization can fundamentally alter the data access paths of the execution engine. This optimization mechanism—replacing full table scans with index-based location and bulk loading with row-by-row validation—reduces query complexity from O(N²) to O(N). Experimental results demonstrate that this strategy effectively overcomes performance bottlenecks in non-correlated subqueries, making it a preferred solution for massive data filtering in OLTP scenarios.