GROUP BY Clause Elimination

Preface: Performance Pitfalls of Redundant GROUP BY

When SQL queries apply GROUP BY to unique columns (like primary keys) without aggregate functions, forced grouping triggers unnecessary sorting and temporary table generation. This report validates performance gains from eliminating redundant GROUP BY through million-row dataset experiments.

1. Experimental Environment & Data Preparation

1.1 Hardware Configuration

Component	Specification
CPU	4-core
RAM	8GB
Storage	1TB SSD
MySQL Version	8.0.18

1.2 Table Schema Design

1
2
3
4
5
6
7
8
-- Students table (Primary key: student_id)
CREATE TABLE students (
  student_id INT AUTO_INCREMENT PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  date_of_birth DATE,
  gender ENUM('Male','Female','Other'),
  major VARCHAR(100)
) ENGINE=InnoDB;

2. Data Generation Method

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SET SESSION cte_max_recursion_depth = 1000000;

INSERT INTO students(student_name, date_of_birth, gender, major)
WITH RECURSIVE seq AS (
  SELECT 1 AS n
  UNION ALL
  SELECT n+1 FROM seq WHERE n<1000000
)
SELECT
  CONCAT('Student_',LPAD(n,7,'0')),
  DATE('2000-01-01')+INTERVAL FLOOR(RAND()*365*20) DAY,
  ELT(FLOOR(RAND()*3)+1,'Male','Female','Other'),
  CONCAT('Major_',FLOOR(RAND()*10)+1)
FROM seq;

3. Performance Comparison Experiment

3.1 Original SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT
*
FROM
(
    SELECT
    *
    FROM
    students
    WHERE
    student_id > 100000
    GROUP BY
    student_id
) dt
LIMIT 10;

3.2 Optimized SQL

We attempted to rewrite the SQL using SQLFlash.

The rewritten SQL is as follows:

1
2
3
4
SELECT *
FROM students
WHERE student_id > 100000
LIMIT 10;

View detailed report: https://sqlflash.ai/app/sqlResult?shareid=dfe1a9dc-b583-4f1c-aa71-9d2ca41cd40b

3.3 Performance Metrics Comparison

To control variables, we conducted a comparative experiment using only the GROUP BY clause elimination rule.

Metric	Control Group (with GROUP BY)	Experimental Group (without GROUP BY)
Execution Time	0.58s	0.001s
Rows Scanned	1,000,000	500,000

3.4 SQLFlash Analysis

The most significant improvement in the rewritten SQL is the removal of the original GROUP BY student_id operation, which avoids potential temporary table creation and additional sorting. The query now retains only basic primary key range filtering and LIMIT clause, significantly reducing execution steps. Compared to the original approach that required grouping first and then taking the top 10 results, the simplified SQL can directly read data within the qualified index range segment and immediately stop scanning after obtaining sufficient results, eliminating unnecessary grouping processing and potential filesort overhead. In scenarios with large data volumes, this can effectively reduce server-side CPU and I/O burden and accelerate data return speed.

4. In-Depth Execution Plan Analysis

4.1 Control Group Execution Plan

1
2
3
4
5
6
7
8
mysql> explain select * from(select * from students where student_id > 100000 group by student_id) dt LIMIT 10;

+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1  | PRIMARY     | <derived2> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL | 498385 | 100.00   | NULL        |
| 2  | DERIVED     | students   | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 498385 | 100.00   | Using where |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+

Problem Diagnosis: The GROUP BY clause prevents MySQL from performing subquery pushdown optimization, resulting in a secondary full table scan of the students table.

4.2 Experimental Group Execution Plan (without GROUP BY)

1
2
3
4
5
6
7
mysql> explain select * from(select * from students where student_id > 100000) dt LIMIT 10;

+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| 1  | SIMPLE      | students | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 498412  | 100.00   | Using where |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+

Optimization Principle: The outer query SELECT * FROM (subquery) dt LIMIT 10 is simplified by the optimizer into a single-layer query that directly operates on the original students table, avoiding secondary scanning.

5. Conclusion

For grouping queries based on primary keys or unique indexes, eliminating redundant GROUP BY clauses is a critical optimization strategy in OLTP scenarios. When combined with index optimization, this approach can resolve over 80% of unnecessary grouping performance issues, ensuring sub-second response times even in high-concurrency environments.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.