Lesson 6 of the SQL Optimization Course: Combing Through the Concepts of MySQL Character Sets
data:image/s3,"s3://crabby-images/5099e/5099e5bfd7098c692bc0f623c0f52637241b9a65" alt="Lesson 6 of the SQL Optimization Course: Combing Through the Concepts of MySQL Character Sets"
Table of Contents
Introduction
For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.
This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.
🧑💻 Target audience:
- DBA
- Database developers
- Students
We will use MySQL as the demonstration database.
Lesson 6: Combing Through the Concepts of MySQL Character Sets
Concepts
The content of a character set includes two main components: character set and collation rule.
Each character set can correspond to one or multiple collation rules, while each collation rule corresponds to a specific character set.
A character set is a collection of mappings between characters and their encodings, like this:
Character | Encoding |
---|---|
A | 0 |
B | 1 |
Collation rules are a set of rules within a character set used to compare each character, essentially defining the sorting order of characters.
For example, to compare the size of characters A and B, the simplest and most intuitive method is to compare their corresponding encodings. If the encoding for A is 0 and for B is 1, then under this rule, A < B. Such a collection of rules is what we call a collation rule. This applies not only to single-byte character encodings but also to multi-byte encodings, following the same principle.
Next, I will provide a detailed introduction to character sets and their usage scenarios.
1. Character Encoding Categories
1.1 ASCII
Purpose: Used to map simple single - byte characters, such as uppercase and lowercase English letters, Arabic numerals, common punctuation marks, operators, and control characters.
Encoding Range: U+0000 – U+007F
Note: Sufficient for scenarios involving these characters, but cannot represent characters like Chinese or Japanese.
1.2 UNICODE
Purpose: Used to map all characters, including those in ASCII.
Encoding Range: U+0000 – U+10FFFF
Note: ASCII is a subset of UNICODE, and ASCII - encoded characters can be losslessly converted to UNICODE - encoded characters.
2. Common Character Sets in MySQL
2.1 Latin1
Latin1 is an alias for cp1252 or ISO-8859-1. The ISO-8859-1 encoding is a single - byte encoding that is backward compatible with ASCII.
Encoding Range: U+0000 – U+00FF
Range | Character Content | Description |
---|---|---|
U+0000 - U+007F | Single-byte characters | Identical to ASCII |
U+0080 - U+009F | Control characters | |
U+0080 - U+00FF | Printable characters |
In addition to the characters included in ASCII, ISO-8859-1 also covers characters for Western European languages, Greek, Thai, Arabic, and Hebrew.
Since the entire single - byte space is occupied by ISO-8859-1 encoding, it can be used to store and transmit byte streams of any other encoding.
For example, storing a Utf8mb4 or GBK encoded byte stream in Latin1 will not cause any issues. This is because Latin1 preserves the original byte stream, which is why MySQL has long used Latin1 as the default character set.
However, since Latin1 stores all characters as byte streams, it leads to a waste of character count.
For example:
|
|
The number of characters stored in the UTF8 field is three times that of Latin1!!!
2.2 GB18030
GB18030 is the official Chinese standard character set, compatible with GBK and GB2312, and is a superset of these two. It uses 1, 2, or 4 bytes to represent a symbol. For example, common Chinese characters are stored using two bytes by default. Windows systems use GB18030 by default.
If you only need to store Chinese characters, GB18030 is the best choice for two reasons:
It occupies less space, for example, compared to UTF8.
Chinese characters are sorted based on pinyin, making retrieval faster.
2.3 UTF8
UTF8 is an encoding implementation of Unicode and can store any character corresponding to the UNICODE encoding. It is the most widely used encoding. Its main feature is the variable - length encoding method, which uses 1 to 4 bytes to represent a symbol based on the length of the encoded bytes for different symbols.
For example, letters or numbers use 1 byte, Chinese characters use 3 bytes, and emoji symbols use 4 bytes. The UTF8 character set is currently the most widely used.
Note: In MySQL, the commonly referred-to UTF8 is actually an alias for UTF8MB3, which is a subset of UTF8MB4. UTF8MB4 is the true 4 - byte UTF8 character set!
UTF8MB3 supports a maximum of 3 bytes for character storage, while UTF8MB4 supports up to 4 bytes. Based on actual needs and future considerations, MySQL 8.0 now defaults to using UTF8MB4 as the base character set.
3. Viewing Character Sets
MySQL now supports a wide range of character sets. There are two methods to view the list of character sets supported by MySQL:
3.1 SQL Statement
|
|
3.2 Viewing Metadata Dictionary Table
|
|
Query Results:
- The first column represents the name of the character set.
- The second column indicates the collation rule of the character set.
- The third column provides a description of the character set.
- The fourth column shows the maximum byte length of the character set encoding.
4. Viewing Collation Rules
4.1 SQL Statement
|
|
4.2 Viewing Metadata Dictionary Table
|
|
Query Results:
- The first column represents the name of the collation rule.
- The second column indicates the corresponding character set name.
- The fourth column indicates whether it is the default collation rule.
- The last column indicates whether to compare spaces following characters during sorting.
4.3 NO PAD vs PAD SPACE
NO PAD (Processed)
If there are spaces after a character, treat the spaces as a character. That is, do not ignore the existence of spaces when comparing.
PAD SPACE (Ignored)
Indicates that if there are spaces after a character, spaces can be ignored for comparison. That is, spaces are optional.
5. Migrating Between UTF8MB3 and UTF8MB4
5.1 Upgrading from UTF8 to UTF8MB4
This migration can be done without loss, as the former is a subset of the latter. For instance, upgrading the character set from MySQL 5.7 to MySQL 8.0 presents no issues in this scenario.
5.2 UTF8MB4 to UTF8 Compatibility
This process is akin to downgrading. If the characters included in UTF8MB4 do not exceed the range supported by UTF8, the migration can proceed smoothly; otherwise, it will fail.
6. Character Set System Parameters
MySQL character sets involve the following parameters:
6.1 MySQL Service Layer
The following two settings define the default character set and collation rules for the MySQL service layer, which represent the default character set and collation rules after the MySQL service starts.
character_set_server
: The default character set encoding for the service layer.collation_server
: The default collation rule for the service layer.
6.2 Client Layer
character_set_client
: Sets the character set for the client. It applies to any client that can connect to the MySQL server.
6.3 Database Layer
character_set_database
: Sets the default character set for new databases created.collation_database
: Sets the default collation rule name for new databases created.
6.4 Metadata Layer
For database names, table names, column names, user names, etc.
character_set_system
: The default character set for MySQL metadata, which is currently not settable and is fixed as UTF8.
6.5 Result Set Layer
character_set_results
: Sets the character set for data sent from the server to the client, including query results and error message outputs.
6.6 Connection Layer
character_set_connection
: Sets the character encoding for data sent from the client to the server before the server has accepted it, such as plain strings or SQL statements that have been written but not yet executed.collation_connection
: The collation rule for the connection layer.
6.7 File System Layer
character_set_filesystem
: Sets the character set for file names involved in statements, such asload data into table t1'/tmp/t1.txt';
. This represents the encoding in which the file name/tmp/t1.txt
is parsed by MySQL.
The client layer, connection layer, and result set layer are usually set together. For example, set names utf8;
sets the parameters for these three layers simultaneously.
The service layer must choose the correct encoding to avoid incorrect default character sets for subsequent tables, fields, stored procedures, etc., which could lead to character set upgrades. If compatibility is maintained, there will be no issues; otherwise, it may result in garbled text or other errors.
Summary
This concludes the introduction to the concepts of MySQL character sets. To summarize, this article mainly introduces the basic concepts related to character sets and the general situation of MySQL character set parameters. It also provides examples to illustrate the precautions for converting between UTF8MB3 and UTF8MB4, which we hope will be helpful to everyone.
👋 See you in the next lesson.
Recommended reading
What is SQLFlash?
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.