Of character sets and sorting sequences

When working with MySQL, one common question is, which character set should be used. And on top of that the specification of the sorting sequence (COLLATE) appears to be an even larger mystery. In many cases it ends to configure everything in UTF8, nothing wrong with it, but keep in mind that MySQL itself is really powerful in terms of character setting and sorting sequences. You can mix character sets and sorting sequences at any server, database, or table level. If the system already provides so many possibilities, isn’t it worth to look a bit more carefully into it, and trying to use these settings efficiently?

What exactly is a character set anyway?

The MySQL documentation describes a character set like this:

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

Each character set defines its total of symbols or elements. These can be visible symbols such as letters, numbers, special characters or punctuation marks, but also invisible control characters, such as line breaks. In the German alphabet for example the encoding of the letter “Ü” is 220. In another character set, the same encoding can be assigned to a completely different symbol. In the EBCDIC character set the value 220 encodes the bracket }

Also, the number of characters that can be displayed can vary greatly from one character set to another. ASCII for example has 95 characters, while Unicode (Standard 5.1) has 100,000 characters. And with the numbers of characters grows the complexity and quantity of rules.

The simplest comparison of two characters is binary sorting. It tests if the ordinal number of one sign is higher than the order number of another: A comes before B because A has the ordinal number 65 and B 66, so far, so simple. But what if we compare a capital B (66) with a lowercase a (97)? Either a small a would be ordered after the B or the character set needs another rule that is case-insensitive. You can recognize these sorting sequences in MySQL at the _ci in the name (case insensitive).
Special characters enforce further: The german sign Ü is set to UE in the sorting order, but what about é or è, which rules apply in that case? In short, COLLATE defines how strings are compared, and at the end this can be a complex set of rules.

So what is the “right” character set then?

Not a trivial question, and there is no simple answer. Consider the following example: We have table USE which stores the following information: username, name, password, e-mail and country. All fields are text fields, but a lot of different conditions:

The password field should be encrypted. MD5 is often used for encryption. MD5 always supplies a string of 32 characters, and these 32 characters are always ASCII.
The e-mail, however, may contain special characters outside the basic ASCII code since the introduction of the international domain names. For example, German special characters in the domain part of an e-mail address are possible. In the local part of the e-mail (before the @ sign), all characters above the ASCII code 127 are generally forbidden, but the specification is case-sensitive. The addresses hans.mueller@email.de and Hans.Mueller@email.de could be two different addresses (even if this is hardly the case in practice). And the length – how many characters can an e-mail address last? If you are using RFC 5321, the local part can consist of a maximum of 64 characters, and the domain part of 255 characters. If the @ sign is added, the total length is 320 characters. A theoretical value. RFC-compliant SMTP servers can only handle a maximum length of 255 characters, since the definition of the path element in the RFC 5321 allows only a correspondingly reduced character length.
The “real name” of the user – depending on how international the audience is – can appear just about every letter from every alphabet. Whether it is a Hebrew, Russian, or Japanese user, the column should be able to handle appropriate characters.
The definitions of a “user name” are generally more strict than for the real name. Characters that are not present in the Latin-1 (ISO-8859-1) or US-ASCII are rarely allowed.
In many projects, the country is represented by a two or tree letter-code, these characters are always part of the ASCII character set.

It is therefore possible that the data which is stored within the same table, could be represented in different character sets. In this example, ASCII, Latin-1, and UTF8 can be distinguished, but what would be the advantage of different character sets at all, or do they might even cause disadvantages?

As mentioned above, UTF8 is becoming more and more popular, since 2008 it is the most widely used character set in the World Wide Web. With UTF8 almost every alphabet can be mapped, so no wonder that everyone changes to this character set.
The default character set of MySQL (Latin-1) for example has only 256 representable characters (not even the Euro symbol). The UTF8 encoding scheme, which works with a variable length of 1-4 bytes per character, represents an incomparably larger range of signs.

The UTF8 encoding has its price

UTF8 encoded characters from the ASCII space require 1 byte of memory, for languages like Hebrew, Turkish or German special characters it occupies 2 bytes and Japanese or Chinese claim 3 bytes (MySQL currently supports only up to 3 byte sequences). Latin-1 or ASCII always requires only 1 byte. For example, if you define the password field as follows:


PASSWORD CHAR(32) CHARSET utf8

We get the following: The CHAR column is a fixed column, it has always the same length, if fewer characters are inserted, MySQL fills these missing characters with a space. For this column MySQL must ensure that up to 32 characters fit into this column, with a maximum of 3 bytes per character, therefore the column always occupies 32 * 3 (96) bytes, but the column in ASCII would be only 32 bytes. No problem, change the type from CHAR to VARCHAR. VARCHAR always needs only as many bytes as actually necessary in the cell (plus 1 to 2 bytes for the number of characters). Since we know all of the characters are in ASCII we will get 32 bytes + 2 bytes for the length. 34 instead of 96 bytes for each entry.

Since MySQL 5, there is no longer a implicit type switch between CHAR and VARCHAR. In earlier versions VARCHAR columns defined with less than 4 characters had been automatically converted to a CHAR, And CHAR columns with a definition of more than 3 characters (assuming that the table contained one dynamic column) where converted to VARCHAR, so there wasn’t really a big difference, but since 5, it stays as it has been set:


CREATE TABLE testVARCHAR_CHAR (
  col1 CHAR(40),
  col2 VARCHAR(2),
  col3 CHAR(3)
);
DESC testVARCHAR_CHAR;


+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| col1  | char(40)   | YES  |     | NULL    |       |
| col2  | varchar(2) | YES  |     | NULL    |       |
| col3  | char(3)    | YES  |     | NULL    |       |
+-------+------------+------+-----+---------+-------+

If dynamic columns handle the space more efficient than fixed ones, why do we use fixed ones at all?
In this PHPPerformance article you can find some impressive numbers. Updates or deletions of data can cause fragmentation in the dynamic tables, and this has a marable effect on the performance – in this specific case, the query difference between the dynamic and the static variant was 0.6 to 0.1 seconds.

Back to my user table example: the password column has always the same length (32 characters), the CHAR type is therefore the optimal solution. The characters are always ASCII – so the ideal character set for this column is ASCII and not UTF8, you gain performance and also save bytes for each entry. Simular argument for the column with the country code.
UTF8 should be used for the fields E-Mail, (real) name and user-name. Even if the user-name uses exclusively with characters from the Latin-1 character space, a different encoding could cause (for example when printing out the username on the site). Since all these columns can have very variable value lengths, they should all be defined as VARCHAR:


CREATE TABLE user (
  id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  password CHAR(32) CHARSET ascii,
  name VARCHAR(50) CHARSET utf8,
  username VARCHAR(20) CHARSET utf8,
  email VARCHAR(255) CHARSET utf8,
  country CHAR(2) CHARSET ascii,
  UNIQUE KEY (username)
);

The UNIQUE KEY on the “username” field requires a unique user name, but what does unique mean in this case? Does it make a difference whether I am the user “axel”, “Axel” or “AXEL”?

Let’s assume the the username should be case-sensitive, and Axel could be another user than axel (in case of a username this could cause a lot of login errors, but this is a question of UI/UX), To achieve this, you have to use the COLLATE:


[...]
  username VARCHAR(20) CHARSET utf8 COLLATE utf8_bin,
[...]

COLLATE definitions, which ends with bin or cs are case-sensitive. Like this usernames like “axel”, “Axel” or “AXEL” could be inserted into the table, without an error (#1062 - Duplicate entry 'Axel' for key 'username' The column "name" had also the character set UTF8, keeps the default sort order (utf8_general_ci), which is correct for many comparisons - but not for all. An example: All German users should be displayed in alphabetical order:


SELECT name FROM user WHERE lang='DE' ORDER BY name;


+-------------+
| name        |
+-------------+
| Aachen      |
| Aechten     |
| Affe        |
| Ärger       |
| Äußern      |
| Axel Michel |
+-------------+

As you might see, the sorting does not correspond to the German sorting rules. The letter Ä stands for AE, and not for A. Here, too, the sorting sequence can help:


SELECT name FROM user 
  WHERE lang='DE' 
  ORDER BY 
   CAST(name AS CHAR CHARACTER SET latin1) 
   COLLATE latin1_german2_ci;


+-------------+
| name        |
+-------------+
| Aachen      |
| Aechten     |
| Ärger       |
| Äußern      |
| Affe        |
| Axel Michel |
+-------------+

Now the result is sorted correctly (in the sense of the German alphabet). The character set for sorting had been changed to latin1, and then the corresponding sorting sequence is selected. To sum this up – character sets and sorting sequence are not rigid, “invariable” properties of a database. They are highly dependent on the content and type of the respective table columns and – as shown in the last example – even of what you want to do with the data.