Lesson 6 of the SQL Optimization Course: Combing Through the Concepts of MySQL Character Sets

Introduction

For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.

This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.

🧑‍💻 Target audience:

DBA
Database developers
Students

We will use MySQL as the demonstration database.

Concepts

The content of a character set includes two main components: character set and collation rule.

Each character set can correspond to one or multiple collation rules, while each collation rule corresponds to a specific character set.

A character set is a collection of mappings between characters and their encodings, like this:

Character	Encoding
A	0
B	1

Collation rules are a set of rules within a character set used to compare each character, essentially defining the sorting order of characters.

For example, to compare the size of characters A and B, the simplest and most intuitive method is to compare their corresponding encodings. If the encoding for A is 0 and for B is 1, then under this rule, A < B. Such a collection of rules is what we call a collation rule. This applies not only to single-byte character encodings but also to multi-byte encodings, following the same principle.

Next, I will provide a detailed introduction to character sets and their usage scenarios.

1. Character Encoding Categories

1.1 ASCII

Purpose: Used to map simple single - byte characters, such as uppercase and lowercase English letters, Arabic numerals, common punctuation marks, operators, and control characters.

Encoding Range: U+0000 – U+007F

Note: Sufficient for scenarios involving these characters, but cannot represent characters like Chinese or Japanese.

1.2 UNICODE

Purpose: Used to map all characters, including those in ASCII.

Encoding Range: U+0000 – U+10FFFF

Note: ASCII is a subset of UNICODE, and ASCII - encoded characters can be losslessly converted to UNICODE - encoded characters.

2. Common Character Sets in MySQL

2.1 Latin1

Latin1 is an alias for cp1252 or ISO-8859-1. The ISO-8859-1 encoding is a single - byte encoding that is backward compatible with ASCII.

Encoding Range: U+0000 – U+00FF

Range	Character Content	Description
U+0000 - U+007F	Single-byte characters	Identical to ASCII
U+0080 - U+009F	Control characters
U+0080 - U+00FF	Printable characters

In addition to the characters included in ASCII, ISO-8859-1 also covers characters for Western European languages, Greek, Thai, Arabic, and Hebrew.

Since the entire single - byte space is occupied by ISO-8859-1 encoding, it can be used to store and transmit byte streams of any other encoding.

For example, storing a Utf8mb4 or GBK encoded byte stream in Latin1 will not cause any issues. This is because Latin1 preserves the original byte stream, which is why MySQL has long used Latin1 as the default character set.

However, since Latin1 stores all characters as byte streams, it leads to a waste of character count.

For example:

1
2
CHAR(10) CHARACTER SET LATIN1;
CHAR(10) CHARACTER SET UTF8;

The number of characters stored in the UTF8 field is three times that of Latin1!!!

2.2 GB18030

GB18030 is the official Chinese standard character set, compatible with GBK and GB2312, and is a superset of these two. It uses 1, 2, or 4 bytes to represent a symbol. For example, common Chinese characters are stored using two bytes by default. Windows systems use GB18030 by default.

If you only need to store Chinese characters, GB18030 is the best choice for two reasons:

It occupies less space, for example, compared to UTF8.
Chinese characters are sorted based on pinyin, making retrieval faster.

2.3 UTF8

UTF8 is an encoding implementation of Unicode and can store any character corresponding to the UNICODE encoding. It is the most widely used encoding. Its main feature is the variable - length encoding method, which uses 1 to 4 bytes to represent a symbol based on the length of the encoded bytes for different symbols.

For example, letters or numbers use 1 byte, Chinese characters use 3 bytes, and emoji symbols use 4 bytes. The UTF8 character set is currently the most widely used.

Note: In MySQL, the commonly referred-to UTF8 is actually an alias for UTF8MB3, which is a subset of UTF8MB4. UTF8MB4 is the true 4 - byte UTF8 character set!

UTF8MB3 supports a maximum of 3 bytes for character storage, while UTF8MB4 supports up to 4 bytes. Based on actual needs and future considerations, MySQL 8.0 now defaults to using UTF8MB4 as the base character set.

3. Viewing Character Sets

MySQL now supports a wide range of character sets. There are two methods to view the list of character sets supported by MySQL:

3.1 SQL Statement

1
2
3
4
5
6
7
8
9
-- Filter for specific character sets
mysql> show character set where description like '%unicode%' and charset like 'utf8%';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci |      4 |
+---------+---------------+--------------------+--------+
2 rows in set (0.01 sec)

3.2 Viewing Metadata Dictionary Table

1
2
3
4
5
6
7
8
9
-- Filter for specific character sets
mysql> select * from information_schema.character_sets where description like '%Unicode%' and character_set_name like 'utf8%';
+--------------------+----------------------+---------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION   | MAXLEN |
+--------------------+----------------------+---------------+--------+
| utf8               | utf8_general_ci      | UTF-8 Unicode |      3 |
| utf8mb4            | utf8mb4_0900_ai_ci   | UTF-8 Unicode |      4 |
+--------------------+----------------------+---------------+--------+
2 rows in set (0.00 sec)

Query Results:

The first column represents the name of the character set.
The second column indicates the collation rule of the character set.
The third column provides a description of the character set.
The fourth column shows the maximum byte length of the character set encoding.

4. Viewing Collation Rules

4.1 SQL Statement

1
2
3
4
5
6
7
8
-- Retrieve the default collation for the utf8mb4 character set
mysql> show collation where charset = 'utf8mb4' and `default` = 'yes';
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
+--------------------+---------+-----+---------+----------+---------+---------------+
1 row in set (0.00 sec)

4.2 Viewing Metadata Dictionary Table

1
2
3
4
5
6
7
8
9
-- Retrieve collations that include utf8mb4%_bin
mysql> select * from information_schema.collations where collation_name like 'utf8mb4%_bin';
+------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME   | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb4_bin      | utf8mb4            |  46 |            | Yes         |       1 | PAD SPACE     |
| utf8mb4_0900_bin | utf8mb4            | 309 |            | Yes         |       1 | NO PAD        |
+------------------+--------------------+-----+------------+-------------+---------+---------------+
2 rows in set (0.01 sec)

Query Results:

The first column represents the name of the collation rule.
The second column indicates the corresponding character set name.
The fourth column indicates whether it is the default collation rule.
The last column indicates whether to compare spaces following characters during sorting.

4.3 NO PAD vs PAD SPACE

NO PAD (Processed)

If there are spaces after a character, treat the spaces as a character. That is, do not ignore the existence of spaces when comparing.

PAD SPACE (Ignored)

Indicates that if there are spaces after a character, spaces can be ignored for comparison. That is, spaces are optional.

5. Migrating Between UTF8MB3 and UTF8MB4

5.1 Upgrading from UTF8 to UTF8MB4

This migration can be done without loss, as the former is a subset of the latter. For instance, upgrading the character set from MySQL 5.7 to MySQL 8.0 presents no issues in this scenario.

5.2 UTF8MB4 to UTF8 Compatibility

This process is akin to downgrading. If the characters included in UTF8MB4 do not exceed the range supported by UTF8, the migration can proceed smoothly; otherwise, it will fail.

6. Character Set System Parameters

MySQL character sets involve the following parameters:

6.1 MySQL Service Layer

The following two settings define the default character set and collation rules for the MySQL service layer, which represent the default character set and collation rules after the MySQL service starts.

character_set_server: The default character set encoding for the service layer.
collation_server: The default collation rule for the service layer.

6.2 Client Layer

character_set_client: Sets the character set for the client. It applies to any client that can connect to the MySQL server.

6.3 Database Layer

character_set_database: Sets the default character set for new databases created.
collation_database: Sets the default collation rule name for new databases created.

6.4 Metadata Layer

For database names, table names, column names, user names, etc.

character_set_system: The default character set for MySQL metadata, which is currently not settable and is fixed as UTF8.

6.5 Result Set Layer

character_set_results: Sets the character set for data sent from the server to the client, including query results and error message outputs.

6.6 Connection Layer

character_set_connection: Sets the character encoding for data sent from the client to the server before the server has accepted it, such as plain strings or SQL statements that have been written but not yet executed.
collation_connection: The collation rule for the connection layer.

6.7 File System Layer

character_set_filesystem: Sets the character set for file names involved in statements, such as load data into table t1'/tmp/t1.txt';. This represents the encoding in which the file name /tmp/t1.txt is parsed by MySQL.

The client layer, connection layer, and result set layer are usually set together. For example, set names utf8; sets the parameters for these three layers simultaneously.

The service layer must choose the correct encoding to avoid incorrect default character sets for subsequent tables, fields, stored procedures, etc., which could lead to character set upgrades. If compatibility is maintained, there will be no issues; otherwise, it may result in garbled text or other errors.

Summary

This concludes the introduction to the concepts of MySQL character sets. To summarize, this article mainly introduces the basic concepts related to character sets and the general situation of MySQL character set parameters. It also provides examples to illustrate the precautions for converting between UTF8MB3 and UTF8MB4, which we hope will be helpful to everyone.

👋 See you in the next lesson.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

Lesson 6 of the SQL Optimization Course: Combing Through the Concepts of MySQL Character Sets

Introduction

Concepts

1. Character Encoding Categories

1.1 ASCII

1.2 UNICODE

2. Common Character Sets in MySQL

2.1 Latin1

2.2 GB18030

2.3 UTF8

3. Viewing Character Sets

3.1 SQL Statement

3.2 Viewing Metadata Dictionary Table

Query Results:

4. Viewing Collation Rules

4.1 SQL Statement

4.2 Viewing Metadata Dictionary Table

Query Results:

4.3 NO PAD vs PAD SPACE

5. Migrating Between UTF8MB3 and UTF8MB4

5.1 Upgrading from UTF8 to UTF8MB4

5.2 UTF8MB4 to UTF8 Compatibility

6. Character Set System Parameters

6.1 MySQL Service Layer

6.2 Client Layer

6.3 Database Layer

6.4 Metadata Layer

6.5 Result Set Layer

6.6 Connection Layer

6.7 File System Layer

Summary

Recommended reading

What is SQLFlash?

How to use SQLFlash in a database?

Ready to elevate your SQL performance?