• Stijn de Witt

Max. bytes in a UTF-8 char?

4.

There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.


And this is how the encoding scheme works in a nutshell.


Source: Wikipedia


Wait, I heard there could be 6?

No. You heard wrong.


There used to be a lot of confusing information around this subject in the past. This is why I wrote a blog post about it on my personal blog back in the day. Today, some of the confusion has been cleared up. The Wikipedia page about UTF-8 used to show a table which went up to Byte 6. Confirming that there could be 6 bytes in a single UTF-8 char, when in fact that was not possible. Today that table has been changed to show what I am also showing here.


Some confusion lingers. Such as an article from the highly esteemed Joel Spolsky from Joel on Software, that states that UTF-8 characters can contain up to 6 bytes. These articles can remain as they are, because they accurately reflect what we thought then and were confused about back then. We can write new articles such as this one to correct ourselves later.


Why this confusion?

This confusion happened because of the history of Unicode itself.


When it started out, Unicode was supposed to remain within 16 bits. The fixed-length UCS-2 encoding could be used to store all possible 65536 code points and life would be good…


Except it wasn’t as people realized that Unicode would not be able to fit all characters and symbols and thus wouldn’t be very Universal anymore… So they decided to increase the width of their fixed-width encoding to 32 bits, any other number between 16 and 32 bits not being very practical.


2 billion is a crowd

With 32 bit numbers, and reserving one bit as the sign bit on integers most programming languages use, you’d get a possible 2 billion code points. Trying to cram all that into a variable length encoding where all of ASCII fits in a single byte, you would need… *drumroll* … 6 bytes at most. This led to early specs for UTF-8 talking about a maximum of 6 bytes per character.


However, people quickly realized that even though 64K characters might be too little for a universal character set, 2 billion was, well, overkill. So they settled on a compromise. One that solved the problem they created when their 16-bit fixed-length encoding wasn’t actually able to encode all characters anymore.


Settle for less

They limited Unicode to a possible 1,112,064 valid code points. In 2014 when I wrote my post about this on my personal blog, we were at Unicode 7.0.0, and there were only 112,218 characters actually defined. Today in 2022, we are at Unicode 15.0.0 and a total of 149,186 characters have been defined. All other positions are still unused. They even reserved the enormous amount of 137,468 code points for private use characters.


The highest possible valid code point is Ux10FFFF and they reserved 66 positions for ‘non-characters’ and 2,048 positions for ‘surrogate’ code points. Using these they could create ‘surrogate pairs’, a trick to create a new variable-length encoding named UTF-16, that was backwards compatible with the old 16-bits fixed-length UCS-2, but still able to encode all characters in the now 1.1 million+ character space of Unicode 2 and up.


All’s well that ends well

The moral of the story is that although it may at first sound like Unicode is 32 bits, if we look closer we see that in fact we need far less. When I originally wrote this post we could cram all Unicode characters in use at that time in just 17 bits.Today, we need 18. But even if, by the time we are at Unicode 99.0.0 or something, all 1.1+ million code points would actually be assigned, we could still fit it in just 21 bits.


And lo and behold… If we figure out the maximum number of bytes needed per character when we only need to encode 1.1 million possible different characters instead of 2 billion, we don’t need 6 bytes anymore. We can settle for 4. And so we did. It was made final in RFC 3629.

5 weergaven0 opmerkingen

Recente blogposts

Alles weergeven