Lesson 18 of the SQL Optimization Course: Understanding Prefix Indexes

Introduction

For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.

This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.

🧑‍💻 Target audience:

DBA
Database developers
Students

We will use MySQL as the demonstration database.

This article explores the concept of MySQL prefix indexes and how they can optimize query performance and storage efficiency.

1. Understanding Prefix Indexes

A prefix index is created by indexing a specified number of characters or bytes from the beginning of a field. MySQL supports prefix indexes for character and binary data types like CHAR/VARCHAR, TEXT/BLOB, and BINARY/VARBINARY.

For character types, specify the prefix length in characters (e.g., f1(10) indexes the first 10 characters)
For binary types, specify the prefix length in bytes (e.g., f1(10) indexes the first 10 bytes)
TEXT/BLOB types only support prefix indexes, not full-field indexing

Consider table t1 with two indexes on field r1: a regular secondary index and a prefix index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<localhost|mysql>show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `r1` varchar(300) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_r1` (`r1`),
  KEY `idx_r1_p` (`r1`(6))
) ENGINE=InnoDB AUTO_INCREMENT=32755 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

2. Storage Efficiency of Prefix Indexes

The difference in storage requirements between full-field indexes and prefix indexes is significant:

1
2
3
4
5
6
7
// idx_r1 (full-field index)
root@debian-ytt1:/var/lib/mysql/3306/ytt# du -sh
26M     .

// idx_r1_p (prefix index)
root@debian-ytt1:/var/lib/myzsql/3306/ytt# du -sh
20M     .

3. Query Performance Optimization

When querying records starting with ‘sample’, MySQL chooses the smaller prefix index idx_r1_p:

1
2
3
4
5
6
7
<localhost|mysql>select count(*) from t1 where r1 like 'sample%';
+----------+
| count(*) |
+----------+
|        4 |
+----------+
1 row in set (0.00 sec)

The execution plan confirms the prefix index is used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<localhost|mysql>explain select count(*) from t1 where r1 like 'sample%'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: idx_r1,idx_r1_p
          key: idx_r1_p
      key_len: 27
          ref: NULL
         rows: 4
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

4. Determining Optimal Prefix Length

To find the best prefix length, calculate the selectivity of each prefix compared to the entire field. This involves:

Calculating the total selectivity of the field
Determining the selectivity for different prefix lengths
Choosing the prefix length that most closely matches the total selectivity

First, calculate the total selectivity of field r1:

1
2
3
4
5
6
7
<localhost|mysql>SELECT TRUNCATE(COUNT(DISTINCT r1) / COUNT(r1),4) 'total_pct' FROM t1;
+------------+
| total_pct  |
+------------+
|     0.0971 |
+------------+
1 row in set (0.13 sec)

Then, use a custom function to find the optimal prefix length:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
DELIMITER $$

USE `ytt`$$

DROP FUNCTION IF EXISTS `func_calc_prefix_length`$$

CREATE DEFINER=`ytt`@`%` FUNCTION `func_calc_prefix_length`() RETURNS JSON
BEGIN
      DECLARE v_total_pct DECIMAL(20,4);
      DECLARE v_prefix_pct DECIMAL(20,4);
      DECLARE v_result JSON DEFAULT '[]';
      DECLARE i TINYINT DEFAULT 1;
      
  SELECT TRUNCATE(COUNT(DISTINCT r1) / COUNT(r1),4) INTO v_total_pct FROM t1;
  label1:LOOP
    SELECT TRUNCATE(COUNT(DISTINCT LEFT(r1,i)) / COUNT(r1),4) INTO v_prefix_pct FROM t1; 
    SET v_result = JSON_ARRAY_APPEND(v_result,'$',JSON_OBJECT(i,v_prefix_pct));       
    IF v_prefix_pct >= v_total_pct THEN
      LEAVE label1;        
    END IF;        
    SET i = i + 1;
  END LOOP;
  RETURN v_result;
END$$
DELIMITER ;

Calling this function reveals the optimal prefix length of 6:

1
2
3
4
<localhost|mysql>SELECT func_calc_prefix_length() AS prefix_length\G
*************************** 1. row ***************************
prefix_length: [{"1": 0.0003}, {"2": 0.0005}, {"3": 0.0008}, {"4": 0.0013}, {"5": 0.0093}, {"6": 0.0971}]
1 row in set (0.32 sec)

5. Handling Suffix-Based Queries

For queries with wildcards at the beginning (e.g., LIKE '%sample'), two optimization strategies exist:

5.1 Strategy 1: Suffix Field Indexing

Create a new table with a suffix field and index it:

1
2
3
4
5
6
7
8
9
CREATE TABLE `t3` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `r1` varchar(300) DEFAULT NULL,
  `suffix_r1` varchar(6) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_suffix_r1` (`suffix_r1`)
) ENGINE=InnoDB

<localhost|mysql>insert into t3 select id,r1,right(r1,6) from t2;

5.2 Strategy 2: Reverse Data Indexing

Create a reversed copy of the data and query using the reversed keyword:

1
<localhost|mysql>insert into t4 select id,reverse(r1) from t2;

Query with reversed keyword:

1
2
3
4
5
6
7
<localhost|mysql>select count(*) from t4 where r1 like 'elpmas%';
+----------+
| count(*) |
+----------+
|        4 |
+----------+
1 row in set (0.00 sec)

The execution plan confirms the prefix index is used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<localhost|mysql>explain select count(*) from t4 where r1 like 'elpmas%'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t4
   partitions: NULL
         type: range
possible_keys: idx_r1_p
          key: idx_r1_p
      key_len: 27
          ref: NULL
         rows: 4
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

By mastering these prefix index techniques, you can significantly enhance your database performance and query efficiency.

👋 See you in the next lesson.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.