Collation and encoding in databases

2024-08-29 · 14 min read

Ler em português

Introduction

Every information in a computer is stored and processed in its binary number representation.

But what about texts? A text is a sequence of characters, a character being a letter or a symbol. Each character is represented by a binary number and the character-number mapping is called encoding.

The table below is for Windows-1252 encoding, which addresses 255 possible characters, each represented by one byte. You can see this table on Windows Character Map (Win+R, charmap).

Windows-1252 encoding table

The collation is the ordering and comparison rule for texts. It also determines which encoding is used for storage.

On Microsoft SQL Server, in the collation's name, CI / CS means case (in)sensitive; AI / AS, accent (in)sensitive. With a case-insensitive (CI) collation, for example, searching for either silva or SILVA yields the same results; with an accent-insensitive (AI) collation, Acucar and Açúcar are considered equal for comparisons.

Accent-insensitive is insensitive actually for all the diacritics of a language, like cedillas (ç, ş) and accents (é, ã, ô, etc). Some older collations do not cover all diacritics on comparisons (link), so newer collations should be preferred.

SELECT
  [Name],
  COLLATIONPROPERTY( [Name], 'LCID' ) AS [LCID],
  COLLATIONPROPERTY( [Name], 'CodePage' ) AS [CodePage]
FROM sys.fn_helpcollations()
ORDER BY [Name];
NameEncodingCompares casingCompares accents
Latin1_General_
CI_AS
Windows-1252NoYes
Latin1_General_
100_CI_AS_
KS_SC_UTF8
UTF-8 (65001)NoYes
Latin1_General_
100_CS_AS_
KS_SC_UTF8
UTF-8 (65001)YesYes

Unicode, UTF-8 and UTF-16

Unicode is a table that defines a number for each character, covering symbols, digits and letters from many languages. The number attributed to a character is called code point.

UTF-8 and UTF-16 are encodings that follow Unicode. Basically, they are ways of storing those numbers into bytes.

UTF-16 uses 2 bytes for most chars and 4 for those above the standard range. This encoding is used in the strings of most programming languages.

UTF-8 uses a variable number of bytes, starting at 1 and up to 4 for a char. It's the main encoding on the Internet.

Unicode rangeGroupsBytes per char, UTF-8Bytes per char, UTF-16
0x0000 - 0x007FBasic latin alphabet, arabic digits (0-9), basic keyboard symbols12
0x0080 - 0x07FFExtended latin alphabet, greek, cyrillic, arabic, hebrew22
0x0800 - 0xFFFFJapanese and chinese ideograms; varied symbols; math operators32
0x010000 - 0x10FFFFAncient writing pictograms (e.g. egyptian hieroglyphs); emojis; musical symbols44

The choice of encoding directly affects the size of text storage. If most characters lie in the basic latin range, UTF-8 is better, because it uses fewer bytes than UTF-16; however, if it's an asian text, UTF-16 is the best, because each character occupies 2 bytes, instead of 3 on UTF-8.

The table below shows how an Unicode number is converted to UTF-8 or UTF-16, for each range above.

Example characterCode point, in binaryIn UTF-8In UTF-16
P (0x0050)001100100011001000000000 00110010
Ω (0x03A9)00000011 1010100111001110 1010100100000011 10101001
(0x20AC)00100000 1010110011100010 10000010 1010110000100000 10101100
🐎 (0x1F40E)00000001 11110100 0000111011110000 10011111 10010000 1000111011011000 00111101 11011100 00001110

The logic for UTF-16 code points above 0x010000 is:

U = code point
W1 = 2 upper bytes
W2 = 2 lower bytes

W = U - 0x10000 
W = yyyyyyyyyyxxxxxxxxxx (20 binary digits)
W1 = 110110yy yyyyyyyy
W2 = 110111xx xxxxxxxx

-> there is no risk of W1 and W2 being mistaken for 
other characters because the possible interval for them 
is protected on the Unicode table.

Texts in SQL databases

CHAR and NCHAR store fixed-size texts; VARCHAR and NVARCHAR store variable-sized texts.

NCHAR and NVARCHAR are types present on SQL Server and the 'N' indicates that they store text in UTF-16 encoding. CHAR and VARCHAR, on the other hand, store according to the encoding of the database's collation.

The storage size is specified on the column type declaration, such as NVARCHAR(n). Many people think that n is the number of characters, but that is not true. For CHAR and VARCHAR, n defines the size in bytes; for NCHAR and NVARCHAR, n is the size in byte-pairs (x2).

Practical example

Let's have two databases, one with the Latin1_General_CI_AS collation (Windows-1252 encoding) and another with Latin1_General_100_CI_AS_KS_SC_UTF8 collation (UTF-8 encoding). For each of them, we will compare the storage sizes between VARCHAR and NVARCHAR, for texts in basic and extended latin alphabet, greek, japanese and emojis. Below, the script to run:

CREATE TABLE [dbo].[Person] (
  [Name] VARCHAR(24) NOT NULL,
  [NameUtf16] NVARCHAR(24) NOT NULL
);

INSERT INTO [dbo].[Person] VALUES
('Pericles','Pericles'), -- latin without accent
('Péricles','Péricles'), -- latin with accent
(N'Περικλῆς',N'Περικλῆς'), -- greek
(N'美しいキモノ',N'美しいキモノ'), -- japanese katakana
(N'Santa Claus 🎅',N'Santa Claus 🎅'); -- with emoji
-- the N prefix is necessary for unicode strings

SELECT
  [Name], DATALENGTH([Name]) AS [SizeInBytes],
  [NameUtf16], DATALENGTH([NameUtf16]) AS [SizeInBytes]
FROM [dbo].[Person];

DROP TABLE [dbo].[Person];

Latin1 General CI AS

VARCHAR nameSize in bytesNVARCHAR nameSize in bytes
Pericles8Pericles16
Péricles8Péricles16
?e??????8Περικλῆς16
??????6美しいキモノ12
Santa Claus ??14Santa Claus 🎅28

Note that Windows-1252 encoding does not support greek and japanese characters, nor emojis, that are replaced by '?'. Despite that, it handles very well latin words, with only 1 byte per letter, even on those with accents or cedillas.

Latin1 General 100 CI AS KS SC UTF8

VARCHAR nameSize in bytesNVARCHAR nameSize in bytes
Pericles8Pericles16
Péricles9Péricles16
Περικλῆς17Περικλῆς16
美しいキモノ18美しいキモノ12
Santa Claus 🎅16Santa Claus 🎅28

With an UTF-8 collation, the VARCHAR field successfully supported all characters and had a higher efficiency for most cases. The third name, Περικλῆς, needed 17 bytes because the letter ῆ (unicode 0x1FC6) is from ancient greek and requires 3 bytes in UTF-8 encoding. For japanese katakana, UTF-16 proved more efficient.

Sources and interesting reads

A

AlexandreHTRB

Campinas/SP,
Brasil