Lesson 37 of the SQL Optimization Course: Hash Partitioning

Introduction

For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.

This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.

🧑‍💻 Target audience:

DBA
Database developers
Students

We will use MySQL as the demonstration database.

When discussing partitioned tables, range partitioning is most common, while hash partitioning has more specific use cases and lacks standardization. This article demonstrates MySQL hash partitioning scenarios and optimization techniques through examples.

For hash partitioned tables, the most common approach is to hash a single column. Consider table hash_t1 (containing 5 million rows), partitioned by HASH on the auto-increment id column into 1024 partitions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
mysql:ytt_new> SHOW CREATE TABLE hash_t1\G
*************************** 1. row ***************************
       Table: hash_t1
Create Table: CREATE TABLE `hash_t1` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `r1` int DEFAULT NULL,
  `log_date` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY HASH (`id`)
PARTITIONS 1024 */
1 row in set (0.00 sec)

mysql:ytt_new> SELECT COUNT(*) FROM hash_t1;
+----------+
| COUNT(*) |
+----------+
|  5000000 |
+----------+
1 row in set (2.43 sec)

The partitioning logic is straightforward: data is distributed by hashing the id (effectively id MOD 1024), resulting in very even distribution:

1
2
3
4
5
6
7
mysql:ytt_new> SELECT MAX(table_rows), MIN(table_rows) FROM information_schema.partitions WHERE table_name = 'hash_t1';
+-----------------+-----------------+
| MAX(table_rows) | MIN(table_rows) |
+-----------------+-----------------+
|            4883 |            4882 |
+-----------------+-----------------+
1 row in set (0.04 sec)

Now, consider these queries:

1
2
3
4
5
6
7
8
-- SQL 1: Equality
SELECT COUNT(*) FROM hash_t1 WHERE id = 1;
-- SQL 2: IN list (Equality on multiple values)
SELECT COUNT(*) FROM hash_t1 WHERE id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
-- SQL 3: Range (<=1)
SELECT COUNT(*) FROM hash_t1 WHERE id <= 1;
-- SQL 4: Range (<=15)
SELECT COUNT(*) FROM hash_t1 WHERE id <= 15;

SQL 1 and SQL 2 are ideal for hash partitioned tables. SQL 3 and SQL 4 are not.

SQL 1 Execution Plan: Represents the optimal case for hash partitioning. It targets a single partition (p8) using a constant equality condition (id = 8).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
mysql:ytt_new> EXPLAIN SELECT COUNT(*) FROM hash_t1 WHERE id = 8\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: hash_t1
       partitions: p8  -- Single Partition
             type: const
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 8
              ref: const
             rows: 1
         filtered: 100.00
            Extra: Using index
    1 row in set, 1 warning (0.00 sec)
    

SQL 2 Execution Plan: Uses an IN list (multiple equality conditions). It targets 15 partitions (p1 to p15). While much better than scanning all 1024 partitions, there’s room for optimization by changing the partitioning strategy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
mysql:ytt_new> EXPLAIN SELECT COUNT(*) FROM hash_t1 WHERE id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: hash_t1
       partitions: p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15  -- 15 Partitions
             type: range
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 8
              ref: NULL
             rows: 15
         filtered: 100.00
            Extra: Using where; Using index
    1 row in set, 1 warning (0.00 sec)
    

SQL 3 & SQL 4: Although achieving similar results (counting small ranges) as SQL 1 and SQL 2, these range queries (<=) force a scan of ALL partitions because hash partitioning is designed for equality lookups, similar to hash indexes on regular tables.

Optimizing SQL 2 (Reducing Partition Scans)

Can we reduce the number of partitions scanned for SQL 2? Yes, by redefining the partitioning function. Instead of HASH(id), use HASH(id DIV 1024):

1
2
3
4
5
6
7
8
9
mysql:ytt_new> CREATE TABLE hash_t2 (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    r1 INT,
    log_date DATE
) PARTITION BY HASH(id DIV 1024) PARTITIONS 1024;
Query OK, 0 rows affected (10.54 sec)

mysql:ytt_new> LOAD DATA INFILE '/var/lib/mysql-files/hash_sample.csv' INTO TABLE hash_t2;
Query OK, 5000000 rows affected (3 min 20.11 sec)

Now, SQL 2 targets only a single partition (p0):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
mysql:ytt_new> EXPLAIN SELECT COUNT(*) FROM hash_t2 WHERE id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\G
*************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: hash_t2
       partitions: p0  -- Single Partition!
             type: range
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 8
              ref: NULL
             rows: 15
         filtered: 100.00
            Extra: Using where; Using index
    1 row in set, 1 warning (0.00 sec)

Hash Partitioning for Date Equality Queries

Hash partitioning can also be effective for specific date-based equality queries, offering simpler definition than range partitioning for this case. Consider SQL 5:

1
2
-- SQL 5: Date Equality
SELECT COUNT(*) FROM hash_t1 WHERE log_date = '2020-08-05';

Create a table hash_t3 partitioned by HASH(YEAR(log_date)) (using 11 partitions):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
mysql:ytt_new> CREATE TABLE hash_t3 (
    id BIGINT UNSIGNED,
    r1 INT,
    log_date DATE,
    KEY idx_log_date(log_date)
);
Query OK, 0 rows affected (0.04 sec)

mysql:ytt_new> ALTER TABLE hash_t3 PARTITION BY HASH(YEAR(log_date)) PARTITIONS 11;
Query OK, 0 rows affected (0.32 sec)

mysql:ytt_new> LOAD DATA INFILE '/var/lib/mysql-files/hash_sample.csv' INTO TABLE hash_t3;
Query OK, 5000000 rows affected (2 min 4.59 sec)

SQL 5 now targets a single partition (p7):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
mysql:ytt_new> EXPLAIN SELECT COUNT(*) FROM hash_t3 WHERE log_date = '2020-08-05'\G
*************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: hash_t3
       partitions: p7  -- Single Partition
             type: ref
    possible_keys: idx_log_date
              key: idx_log_date
          key_len: 4
              ref: const
             rows: 1405
         filtered: 100.00
            Extra: Using index
    1 row in set, 1 warning (0.00 sec)

Note: Using YEAR might become less optimal as data grows. Consider partitioning by MONTH for finer granularity.

KEY Partitioning: A Special Hash Type

MySQL offers KEY partitioning, a special hash type using the PASSWORD hashing function internally. It simplifies definition and works with non-integer primary keys:

1
PARTITION BY KEY (non_integer_primary_key_column) PARTITIONS N;

LINEAR HASH Partitioning for Scaling

While the previous examples focused on query performance, scaling partitions (adding/removing) frequently can be expensive with standard HASH. LINEAR HASH implements a consistent hashing algorithm in MySQL, significantly improving scaling performance at the cost of data distribution uniformity and potential hotspots.

Comparison: HASH vs LINEAR HASH Scaling

Reducing hash_t1 Partitions (Standard HASH): From 1024 to 10 partitions took ~2 min 46 sec.
1 2 3
mysql:ytt_new> ALTER TABLE hash_t1 COALESCE PARTITION 1014; -- Remove 1014 partitions Query OK, 0 rows affected (2 min 46.01 sec)

Reducing hash_linear_t1 Partitions (LINEAR HASH): Scaling down was faster (~1 min 28 sec), almost twice as fast.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
mysql:ytt_new> CREATE TABLE hash_linear_t1 (
        id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
        r1 INT,
        log_date DATE
    ) PARTITION BY LINEAR HASH(id) PARTITIONS 1024;
    Query OK, 0 rows affected (34.13 sec)

    mysql:ytt_new> LOAD DATA INFILE '/var/lib/mysql-files/hash_sample.csv' INTO TABLE hash_linear_t1;
    Query OK, 5000000 rows affected (2 min 7.78 sec)

    mysql:ytt_new> ALTER TABLE hash_linear_t1 COALESCE PARTITION 1014;
    Query OK, 0 rows affected (1 min 28.29 sec)
    

Data Distribution: After reduction, hash_t1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

                          

(Standard HASH) shows relatively even distribution. hash_linear_t1 (LINEAR HASH) shows significant imbalance and data hotspots. 1 class=language-sql data-lang=sql>-- Standard HASH (hash_t1) Distribution (10 partitions): class=c1>    mysql:ytt_new> SELECT table_rows FROM information_schema.partitions WHERE table_name = 'hash_t1'; +------------+ class=c1>    | TABLE_ROWS | +------------+ class=c1>    |     485723 | |     537704 | |     523017 | |     470724 | |     478982 | |     512272 | |     483190 | |     455829 | |     520512 | |     461572 | +------------+ class=c1>    10 rows in set (0.00 sec) -- LINEAR HASH (hash_linear_t1) Distribution (10 partitions): class=c1>    mysql:ytt_new> SELECT table_rows FROM information_schema.partitions WHERE table_name = 'hash_linear_t1'; +------------+ class=c1>    | TABLE_ROWS | +------------+ class=c1>    |     269443 | |     340989 | |     611739 | |     584321 | |     566181 | |     624040 | |     637801 | |     688467 | |     331397 | |     317695 | +------------+ class=c1>    10 rows in set (0.01 sec)

Summary

Hash partitioning is exclusively suited for equality queries (=, IN). Range queries (<, <=, >, >=, BETWEEN) are inefficient as they scan all partitions.
Optimize multi-value IN list queries by using HASH(id DIV N) instead of HASH(id) to potentially target fewer partitions.
Hash partitioning can be effective for date equality lookups (e.g., HASH(YEAR(log_date))).
Use KEY partitioning for simpler definition with non-integer keys.
Use LINEAR HASH only if frequent scaling (adding/removing partitions) is a critical requirement, accepting the trade-offs of uneven data distribution and potential hotspots.

What MySQL topics would you like to see covered next? Share your suggestions in the comments! (Email for community collaboration: osc@actionsky.com)

👋 See you in the next lesson.

`Recommended reading`

`What is SQLFlash?`

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

Lesson 37 of the SQL Optimization Course: Hash Partitioning

Introduction

`Recommended reading`

`What is SQLFlash?`

`How to use SQLFlash in a database?`

`Ready to elevate your SQL performance?`