Preface
In modern database applications, achieving efficient query performance is a core challenge for system performance. Developers often use COUNT scalar subqueries for existence checks (e.g., (SELECT COUNT(*) FROM ...) > 0). However, this type of query can trigger MySQL’s DEPENDENT SUBQUERY execution plan, leading to significant performance issues: each row from the outer table may trigger a full table scan in the subquery. When data volumes are large, performance degrades sharply as a result of these repeated scans and aggregate calculations for each outer row .
By rewriting the COUNT scalar subquery into an IN subquery, MySQL’s SEMI JOIN optimization mechanism can be activated. This changes the execution plan from a Nested Loop to a more efficient Hash Join or index lookup, thereby avoiding redundant full table scans and aggregate calculations . This article will demonstrate the optimization effect through comprehensive test data generation, performance comparison experiments, and execution plan analysis, providing a standardized rewriting procedure .
I. Test Data Generation (Simulating Student Course Selection Scenario)
1. Creating Test Tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| -- Create students table (Student Information)
CREATE TABLE students (
student_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(20) NOT NULL,
class_id INT NOT NULL
) ENGINE=InnoDB;
-- Create enrollments table (Course Enrollment Records)
CREATE TABLE enrollments (
enrollment_id INT AUTO_INCREMENT PRIMARY KEY,
student_id INT NOT NULL,
course_id INT NOT NULL,
enroll_time DATETIME NOT NULL
) ENGINE=InnoDB;
|
2. Batch Insertion of 5 Million Test Records
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| -- Insert 500,000 student records (random class 1-100)
SET SESSION cte_max_recursion_depth = 500000;
INSERT INTO students (name, class_id)
WITH RECURSIVE seq AS (
SELECT 0 AS n
UNION ALL
SELECT n+1 FROM seq WHERE n < 499999
)
SELECT
CONCAT('Student_', n) AS name,
FLOOR(RAND(n) * 100) + 1 AS class_id
FROM seq;
-- Insert 5 million course enrollment records (random students and courses)
INSERT INTO enrollments (student_id, course_id, enroll_time)
WITH RECURSIVE seq AS (
SELECT 0 AS n
UNION ALL
SELECT n+1 FROM seq WHERE n < 4999999
)
SELECT
FLOOR(RAND(n) * 500000) + 1 AS student_id,
FLOOR(RAND(n) * 50) + 1 AS course_id,
NOW() - INTERVAL FLOOR(RAND(n) * 365) DAY AS enroll_time
FROM seq;
|
3. Creating Indexes (Accelerating Join Queries)
1
| ALTER TABLE enrollments ADD INDEX idx_student (student_id);
|
1. Test Environment
- Database Version: MySQL 8.0.18
- Hardware Configuration: 4-core CPU/8GB RAM/SSD Storage
- Data Volume:
- students table: 500,000 records
- enrollments table: 5,000,000 records
2. Original SQL
1
2
3
4
5
| SELECT * FROM students WHERE (
SELECT COUNT(*) FROM enrollments
WHERE students.student_id = enrollments.student_id
AND enrollment_id > 4990000
) > 0;
|
View detailed report:https://sqlflash.ai/app/sqlResult?shareid=202bcc49-35e3-4bb5-a5de-62a4eb6d6ea3
3. Optimized SQL
We attempted to rewrite using SQLFlash.

The rewritten SQL obtained is as follows:
1
2
3
4
5
6
7
| SELECT * FROM students
WHERE student_id IN (
SELECT student_id
FROM enrollments
WHERE students.student_id = enrollments.student_id
AND enrollment_id > 4990000
);
|
4. Test Results
| Metric | Original Approach (COUNT Subquery) | Optimized Approach (IN Subquery) | Performance Improvement |
|---|
| Execution Time | 4.41 seconds | 0.04 seconds | 99% |
| Rows Scanned | ~10 million rows | ~39,028 rows | 99.6% |
III. Optimization Principle Analysis
1. SQLFlash Analysis
Compared to the original SQL, the rewritten statement uses the form (order_id, product_id) IN (...), which avoids row-by-row calculation of COUNT(*). This allows the database to more easily recognize it as a filtering subquery, improves readability, and reduces redundant calculation overhead in some cases. At the logical level, compared to the original correlated subquery which required executing an aggregate function for each row of the main query, this rewritten form allows the optimizer to more directly determine the existence of (order_id, product_id), potentially reducing invalid access and iterative processes on the detail table, thereby achieving significant performance improvements in large dataset environments.
2. Execution Plan Comparison
Original SQL Execution Plan
1
2
3
4
5
6
7
8
| mysql> explain SELECT * FROM students WHERE student_id IN ( SELECT student_id FROM enrollments WHERE students.student_id = enrollments.student_id AND enrollment_id > 4990000 );
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where |
| 1 | SIMPLE | students | NULL | eq_ref | PRIMARY | PRIMARY | 4 | <subquery2>.student_id | 1 | 100.00 | NULL |
| 2 | MATERIALIZED | enrollments | NULL | range | PRIMARY,idx_student | PRIMARY | 4 | NULL | 19514 | 100.00 | Using where |
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+
|
Optimized SQL Execution Plan
1
2
3
4
5
6
7
| mysql> explain SELECT * FROM students WHERE ( SELECT COUNT(*) FROM enrollments WHERE students.student_id = enrollments.student_id AND enrollment_id > 4990000 ) > 0;
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+
| 1 | PRIMARY | students | NULL | ALL | NULL | NULL | NULL | NULL | 498986 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | enrollments | NULL | ref | PRIMARY,idx_student | idx_student | 4 | ct.students.student_id | 21 | 0.39 | Using where; Using index |
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+
|
IV. Summary
Through this optimization rule, the following benefits can be achieved:
99% Performance Improvement: Execution time is reduced from 4.41 seconds to 0.04 seconds, and the number of scanned rows drops from 10 million to 40,000, completely eliminating the performance bottleneck of nested loops caused by DEPENDENT SUBQUERY.
Efficient Index Utilization: Avoids back-table queries by directly locating data through indexes, reducing I/O and CPU overhead.
Code Maintainability: Transforms implicit COUNT aggregation logic into explicit IN existence checks, resulting in clearer semantics and reduced maintenance complexity.
Recommended reading
What is SQLFlash?
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
How to use SQLFlash in a database?