Collation and encoding in databases

Introduction

Every information in a computer is stored and processed in its binary number representation.

But what about texts? A text is a sequence of characters, a character being a letter or a symbol. Each character is represented by a binary number and the character-number mapping is called encoding.

The table below is for Windows-1252 encoding, which addresses 255 possible characters, each represented by one byte. You can see this table on Windows Character Map (Win+R, charmap).

The collation is the ordering and comparison rule for texts. It also determines which encoding is used for storage.

On Microsoft SQL Server, in the collation's name, CI / CS means case (in)sensitive; AI / AS, accent (in)sensitive. With a case-insensitive (CI) collation, for example, searching for either silva or SILVA yields the same results; with an accent-insensitive (AI) collation, Acucar and Açúcar are considered equal for comparisons.

Accent-insensitive is insensitive actually for all the diacritics of a language, like cedillas (ç, ş) and accents (é, ã, ô, etc). Some older collations do not cover all diacritics on comparisons (link), so newer collations should be preferred.

SELECT
  [Name],
  COLLATIONPROPERTY( [Name], 'LCID' ) AS [LCID],
  COLLATIONPROPERTY( [Name], 'CodePage' ) AS [CodePage]
FROM sys.fn_helpcollations()
ORDER BY [Name];

Name	Encoding	Compares casing	Compares accents
Latin1_General_ CI_AS	Windows-1252	No	Yes
Latin1_General_ 100_CI_AS_ KS_SC_UTF8	UTF-8 (65001)	No	Yes
Latin1_General_ 100_CS_AS_ KS_SC_UTF8	UTF-8 (65001)	Yes	Yes

Unicode, UTF-8 and UTF-16

Unicode is a table that defines a number for each character, covering symbols, digits and letters from many languages. The number attributed to a character is called code point.

UTF-8 and UTF-16 are encodings that follow Unicode. Basically, they are ways of storing those numbers into bytes.

UTF-16 uses 2 bytes for most chars and 4 for those above the standard range. This encoding is used in the strings of most programming languages.

UTF-8 uses a variable number of bytes, starting at 1 and up to 4 for a char. It's the main encoding on the Internet.

Unicode range	Groups	Bytes per char, UTF-8	Bytes per char, UTF-16
0x0000 - 0x007F	Basic latin alphabet, arabic digits (0-9), basic keyboard symbols	1	2
0x0080 - 0x07FF	Extended latin alphabet, greek, cyrillic, arabic, hebrew	2	2
0x0800 - 0xFFFF	Japanese and chinese ideograms; varied symbols; math operators	3	2
0x010000 - 0x10FFFF	Ancient writing pictograms (e.g. egyptian hieroglyphs); emojis; musical symbols	4	4

The choice of encoding directly affects the size of text storage. If most characters lie in the basic latin range, UTF-8 is better, because it uses fewer bytes than UTF-16; however, if it's an asian text, UTF-16 is the best, because each character occupies 2 bytes, instead of 3 on UTF-8.

The table below shows how an Unicode number is converted to UTF-8 or UTF-16, for each range above.

Example character	Code point, in binary	In UTF-8	In UTF-16
P (0x0050)	00110010	00110010	00000000 00110010
Ω (0x03A9)	00000011 10101001	11001110 10101001	00000011 10101001
€ (0x20AC)	00100000 10101100	11100010 10000010 10101100	00100000 10101100
🐎 (0x1F40E)	00000001 11110100 00001110	11110000 10011111 10010000 10001110	11011000 00111101 11011100 00001110

The logic for UTF-16 code points above 0x010000 is:

U = code point
W1 = 2 upper bytes
W2 = 2 lower bytes

W = U - 0x10000 
W = yyyyyyyyyyxxxxxxxxxx (20 binary digits)
W1 = 110110yy yyyyyyyy
W2 = 110111xx xxxxxxxx

-> there is no risk of W1 and W2 being mistaken for 
other characters because the possible interval for them 
is protected on the Unicode table.

Texts in SQL databases

CHAR and NCHAR store fixed-size texts; VARCHAR and NVARCHAR store variable-sized texts.

NCHAR and NVARCHAR are types present on SQL Server and the 'N' indicates that they store text in UTF-16 encoding. CHAR and VARCHAR, on the other hand, store according to the encoding of the database's collation.

The storage size is specified on the column type declaration, such as NVARCHAR(n). Many people think that n is the number of characters, but that is not true. For CHAR and VARCHAR, n defines the size in bytes; for NCHAR and NVARCHAR, n is the size in byte-pairs (x2).

Practical example

Let's have two databases, one with the Latin1_General_CI_AS collation (Windows-1252 encoding) and another with Latin1_General_100_CI_AS_KS_SC_UTF8 collation (UTF-8 encoding). For each of them, we will compare the storage sizes between VARCHAR and NVARCHAR, for texts in basic and extended latin alphabet, greek, japanese and emojis. Below, the script to run:

CREATE TABLE [dbo].[Person] (
  [Name] VARCHAR(24) NOT NULL,
  [NameUtf16] NVARCHAR(24) NOT NULL
);

INSERT INTO [dbo].[Person] VALUES
('Pericles','Pericles'), -- latin without accent
('Péricles','Péricles'), -- latin with accent
(N'Περικλῆς',N'Περικλῆς'), -- greek
(N'美しいキモノ',N'美しいキモノ'), -- japanese katakana
(N'Santa Claus 🎅',N'Santa Claus 🎅'); -- with emoji
-- the N prefix is necessary for unicode strings

SELECT
  [Name], DATALENGTH([Name]) AS [SizeInBytes],
  [NameUtf16], DATALENGTH([NameUtf16]) AS [SizeInBytes]
FROM [dbo].[Person];

DROP TABLE [dbo].[Person];

Latin1 General CI AS

VARCHAR name	Size in bytes	NVARCHAR name	Size in bytes
Pericles	8	Pericles	16
Péricles	8	Péricles	16
?e??????	8	Περικλῆς	16
??????	6	美しいキモノ	12
Santa Claus ??	14	Santa Claus 🎅	28

Note that Windows-1252 encoding does not support greek and japanese characters, nor emojis, that are replaced by '?'. Despite that, it handles very well latin words, with only 1 byte per letter, even on those with accents or cedillas.

Latin1 General 100 CI AS KS SC UTF8

VARCHAR name	Size in bytes	NVARCHAR name	Size in bytes
Pericles	8	Pericles	16
Péricles	9	Péricles	16
Περικλῆς	17	Περικλῆς	16
美しいキモノ	18	美しいキモノ	12
Santa Claus 🎅	16	Santa Claus 🎅	28

With an UTF-8 collation, the VARCHAR field successfully supported all characters and had a higher efficiency for most cases. The third name, Περικλῆς, needed 17 bytes because the letter ῆ (unicode 0x1FC6) is from ancient greek and requires 3 bytes in UTF-8 encoding. For japanese katakana, UTF-16 proved more efficient.