Lesson 6 of the SQL Optimization Course: Combing Through the Concepts of MySQL Character Sets

For relational databases, the design of tables and SQL is written are particularly crucial. It wouldn’t be an exaggeration to say that they account for 90% of performance. So this time, specifically targeting these two major knowledge areas, we’ll conduct a detailed analysis for you, peeling back the layers.
This Series uses plain and understandable language and selects a large number of examples to elaborate on the subtleties for you.
🧑💻 Target audience:
We will use MySQL as the demonstration database.
The content of a character set includes two main components: character set and collation rule.
Each character set can correspond to one or multiple collation rules, while each collation rule corresponds to a specific character set.
A character set is a collection of mappings between characters and their encodings, like this:
Character | Encoding |
---|---|
A | 0 |
B | 1 |
Collation rules are a set of rules within a character set used to compare each character, essentially defining the sorting order of characters.
For example, to compare the size of characters A and B, the simplest and most intuitive method is to compare their corresponding encodings. If the encoding for A is 0 and for B is 1, then under this rule, A < B. Such a collection of rules is what we call a collation rule. This applies not only to single-byte character encodings but also to multi-byte encodings, following the same principle.
Next, I will provide a detailed introduction to character sets and their usage scenarios.
Purpose: Used to map simple single - byte characters, such as uppercase and lowercase English letters, Arabic numerals, common punctuation marks, operators, and control characters.
Encoding Range: U+0000 – U+007F
Note: Sufficient for scenarios involving these characters, but cannot represent characters like Chinese or Japanese.
Purpose: Used to map all characters, including those in ASCII.
Encoding Range: U+0000 – U+10FFFF
Note: ASCII is a subset of UNICODE, and ASCII - encoded characters can be losslessly converted to UNICODE - encoded characters.
Latin1 is an alias for cp1252 or ISO-8859-1. The ISO-8859-1 encoding is a single - byte encoding that is backward compatible with ASCII.
Encoding Range: U+0000 – U+00FF
Range | Character Content | Description |
---|---|---|
U+0000 - U+007F | Single-byte characters | Identical to ASCII |
U+0080 - U+009F | Control characters | |
U+0080 - U+00FF | Printable characters |
In addition to the characters included in ASCII, ISO-8859-1 also covers characters for Western European languages, Greek, Thai, Arabic, and Hebrew.
Since the entire single - byte space is occupied by ISO-8859-1 encoding, it can be used to store and transmit byte streams of any other encoding.
For example, storing a Utf8mb4 or GBK encoded byte stream in Latin1 will not cause any issues. This is because Latin1 preserves the original byte stream, which is why MySQL has long used Latin1 as the default character set.
However, since Latin1 stores all characters as byte streams, it leads to a waste of character count.
For example:
|
|
The number of characters stored in the UTF8 field is three times that of Latin1!!!
GB18030 is the official Chinese standard character set, compatible with GBK and GB2312, and is a superset of these two. It uses 1, 2, or 4 bytes to represent a symbol. For example, common Chinese characters are stored using two bytes by default. Windows systems use GB18030 by default.
If you only need to store Chinese characters, GB18030 is the best choice for two reasons:
It occupies less space, for example, compared to UTF8.
Chinese characters are sorted based on pinyin, making retrieval faster.
UTF8 is an encoding implementation of Unicode and can store any character corresponding to the UNICODE encoding. It is the most widely used encoding. Its main feature is the variable - length encoding method, which uses 1 to 4 bytes to represent a symbol based on the length of the encoded bytes for different symbols.
For example, letters or numbers use 1 byte, Chinese characters use 3 bytes, and emoji symbols use 4 bytes. The UTF8 character set is currently the most widely used.
Note: In MySQL, the commonly referred-to UTF8 is actually an alias for UTF8MB3, which is a subset of UTF8MB4. UTF8MB4 is the true 4 - byte UTF8 character set!
UTF8MB3 supports a maximum of 3 bytes for character storage, while UTF8MB4 supports up to 4 bytes. Based on actual needs and future considerations, MySQL 8.0 now defaults to using UTF8MB4 as the base character set.
MySQL now supports a wide range of character sets. There are two methods to view the list of character sets supported by MySQL:
|
|
|
|
|
|
|
|
NO PAD (Processed)
If there are spaces after a character, treat the spaces as a character. That is, do not ignore the existence of spaces when comparing.
PAD SPACE (Ignored)
Indicates that if there are spaces after a character, spaces can be ignored for comparison. That is, spaces are optional.
This migration can be done without loss, as the former is a subset of the latter. For instance, upgrading the character set from MySQL 5.7 to MySQL 8.0 presents no issues in this scenario.
This process is akin to downgrading. If the characters included in UTF8MB4 do not exceed the range supported by UTF8, the migration can proceed smoothly; otherwise, it will fail.
MySQL character sets involve the following parameters:
The following two settings define the default character set and collation rules for the MySQL service layer, which represent the default character set and collation rules after the MySQL service starts.
character_set_server
: The default character set encoding for the service layer.collation_server
: The default collation rule for the service layer.character_set_client
: Sets the character set for the client. It applies to any client that can connect to the MySQL server.character_set_database
: Sets the default character set for new databases created.collation_database
: Sets the default collation rule name for new databases created.For database names, table names, column names, user names, etc.
character_set_system
: The default character set for MySQL metadata, which is currently not settable and is fixed as UTF8.character_set_results
: Sets the character set for data sent from the server to the client, including query results and error message outputs.character_set_connection
: Sets the character encoding for data sent from the client to the server before the server has accepted it, such as plain strings or SQL statements that have been written but not yet executed.collation_connection
: The collation rule for the connection layer.character_set_filesystem
: Sets the character set for file names involved in statements, such as load data into table t1'/tmp/t1.txt';
. This represents the encoding in which the file name /tmp/t1.txt
is parsed by MySQL.The client layer, connection layer, and result set layer are usually set together. For example, set names utf8;
sets the parameters for these three layers simultaneously.
The service layer must choose the correct encoding to avoid incorrect default character sets for subsequent tables, fields, stored procedures, etc., which could lead to character set upgrades. If compatibility is maintained, there will be no issues; otherwise, it may result in garbled text or other errors.
This concludes the introduction to the concepts of MySQL character sets. To summarize, this article mainly introduces the basic concepts related to character sets and the general situation of MySQL character set parameters. It also provides examples to illustrate the precautions for converting between UTF8MB3 and UTF8MB4, which we hope will be helpful to everyone.
👋 See you in the next lesson.
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.