When SQL queries apply GROUP BY
to unique columns (like primary keys) without aggregate functions, forced grouping triggers unnecessary sorting and temporary table generation. This report validates performance gains from eliminating redundant GROUP BY
through million-row dataset experiments.
1. Experimental Environment & Data Preparation
1.1 Hardware Configuration
Component | Specification |
---|
CPU | 4-core |
RAM | 8GB |
Storage | 1TB SSD |
MySQL Version | 8.0.18 |
1.2 Table Schema Design
1
2
3
4
5
6
7
8
| -- Students table (Primary key: student_id)
CREATE TABLE students (
student_id INT AUTO_INCREMENT PRIMARY KEY,
student_name VARCHAR(100) NOT NULL,
date_of_birth DATE,
gender ENUM('Male','Female','Other'),
major VARCHAR(100)
) ENGINE=InnoDB;
|
2. Data Generation Method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| SET SESSION cte_max_recursion_depth = 1000000;
INSERT INTO students(student_name, date_of_birth, gender, major)
WITH RECURSIVE seq AS (
SELECT 1 AS n
UNION ALL
SELECT n+1 FROM seq WHERE n<1000000
)
SELECT
CONCAT('Student_',LPAD(n,7,'0')),
DATE('2000-01-01')+INTERVAL FLOOR(RAND()*365*20) DAY,
ELT(FLOOR(RAND()*3)+1,'Male','Female','Other'),
CONCAT('Major_',FLOOR(RAND()*10)+1)
FROM seq;
|
3.1 Original SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| SELECT
*
FROM
(
SELECT
*
FROM
students
WHERE
student_id > 100000
GROUP BY
student_id
) dt
LIMIT 10;
|
3.2 Optimized SQL
We attempted to rewrite the SQL using SQLFlash.
.png)
The rewritten SQL is as follows:
1
2
3
4
| SELECT *
FROM students
WHERE student_id > 100000
LIMIT 10;
|
View detailed report: https://sqlflash.ai/app/sqlResult?shareid=dfe1a9dc-b583-4f1c-aa71-9d2ca41cd40b
To control variables, we conducted a comparative experiment using only the GROUP BY clause elimination rule.
Metric | Control Group (with GROUP BY) | Experimental Group (without GROUP BY) |
---|
Execution Time | 0.58s | 0.001s |
Rows Scanned | 1,000,000 | 500,000 |
3.4 SQLFlash Analysis
The most significant improvement in the rewritten SQL is the removal of the original GROUP BY student_id operation, which avoids potential temporary table creation and additional sorting. The query now retains only basic primary key range filtering and LIMIT clause, significantly reducing execution steps. Compared to the original approach that required grouping first and then taking the top 10 results, the simplified SQL can directly read data within the qualified index range segment and immediately stop scanning after obtaining sufficient results, eliminating unnecessary grouping processing and potential filesort overhead. In scenarios with large data volumes, this can effectively reduce server-side CPU and I/O burden and accelerate data return speed.
4. In-Depth Execution Plan Analysis
4.1 Control Group Execution Plan
1
2
3
4
5
6
7
8
| mysql> explain select * from(select * from students where student_id > 100000 group by student_id) dt LIMIT 10;
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 498385 | 100.00 | NULL |
| 2 | DERIVED | students | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 498385 | 100.00 | Using where |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
|
Problem Diagnosis: The GROUP BY clause prevents MySQL from performing subquery pushdown optimization, resulting in a secondary full table scan of the students table.
4.2 Experimental Group Execution Plan (without GROUP BY)
1
2
3
4
5
6
7
| mysql> explain select * from(select * from students where student_id > 100000) dt LIMIT 10;
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | students | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 498412 | 100.00 | Using where |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
|
Optimization Principle: The outer query SELECT * FROM (subquery) dt LIMIT 10
is simplified by the optimizer into a single-layer query that directly operates on the original students table, avoiding secondary scanning.
5. Conclusion
For grouping queries based on primary keys or unique indexes, eliminating redundant GROUP BY clauses is a critical optimization strategy in OLTP scenarios. When combined with index optimization, this approach can resolve over 80% of unnecessary grouping performance issues, ensuring sub-second response times even in high-concurrency environments.
Recommended reading
What is SQLFlash?
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
How to use SQLFlash in a database?