Character encoding, Unicode, UTF-8 (and a bit of chauvinism) explained for the masses

2011-04-06

Unicode is a code table that translates a character to something else (in this case a number), very similar to, for example, Morse code (character to tones) and Braille (characters to dots).

The Unicode character set contains more than 100,000 characters covering almost 100 different scripts. Scripts such as Arabic, Latin, Hebrew, Cyrillic, Devenagari, Tibetan, Runic, Cherokee, Braille, Music symbols, etc (Klingon has been on the roadmap for many years). The code tables ranges from 0x0 to 0x10FFFF (1,114,112 code points)

ASCII is another character table (based on the ordering of the English alphabet). It includes only 128 characters. For example the letter "A" in ASCII corresponds to 0x41. In an ASCII encoded document each character can easily be stored using 7 bits which can be stored in an 8-bit byte (Many variations to ASCII sprang from here which used the remaining 1 bit allowing 256 characters such as ISO-8859-1 ). Note that Unicode uses the same numeric values for the first 128/256 characters as ASCII and ISO-8859-1 do)

So with ASCII (and ISO-8859-1) we can use 1-byte-per-character, but what do we do when our code tables does not include 128 or 256 code points but need 1,114,112? We could be naive and use 32 bits. This kind of mapping between code points and bits is known as UTF-32. But wait a minute that would really suck! Just to send a document saying "hello world" would cost you 44b (11 characters * 4 bytes). See the diagram below what would be like

hello world in utf-32

Hell!, as you can see we have a lot of lost space!

That's where variable-width encodings comes to our rescue (well for some of us, as you will see later!). Encoding such as UTF-8 and UTF-16 are not fixed length such as UTF-32 is. The number of bytes used depends on the Unicode range the character is in. UTF-8 uses a sequence of 8-bit bytes between one and four bytes.

UTF-8 varying size

But wait a minute! How does  decoder know whether two bytes are used for either one or two characters? Depending on the number of bytes needed, each UTF-8 byte in the sequence has a special format. The higher bits have fixed values, the lower parts are used to store the unicode value. This way during UTF-8 decoding, you know how many bytes you need to read a single unicode character.

Below are two encoding examples.

UTF8 encoding example

First, the 'J' which has Unicode value \u004a hex, and binary 1001010. This means it falls into the first category and is encoded using a single byte. The encoded value is obtained by right-justifying the unicode value, which produces: 01001010 , and therefore the encoded value is left unchanged: \u004a The second example, the letter "aleph' has hex value \u05d0 (10111010000). This means it falls into the second category and will be encoded using two bytes. The bits of the unicode value are distributed over the "open" bits in the UTF-8 bytes, resulting in 1101011110010000, which is \ud790 hex.

Also notice that for the first 128 characters UTF-8 encode exactly the same as legacy ASCII (1 byte, with the higher bit a zero, which corresponds to the 7 bit encoding of ASCII). This means you can read an ASCII file using UTF-8 encoding.

English bias note: This all sound wonderful for many of "us", but that's because you are reading this article and you probably use a lot of Latin characters (which nicely fits in either a single byte of perhaps two). But if you have a lot of documents in Devanagari or Thai you will be needing three bytes per character, wasting a whole byte on control bits (the static bits in UTF-8). Furthermore single-byte encodings exist for almost all these scripts. That means a Thai document usin TIS-620 o ISO/IEC 8859-11 encoding is one-third in size of the same data using UTF-8.

This article does not necessarily reflect the technical opinion of EDC4IT, but purely of the writer. If you want to discuss about this content, please use thecontact ussection of the site