What was the Soviet equivalent of ASCII

What every software developer at least and absolutely must know about Unicode and character sets (no pardon!)

Joel on Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
German translation

The original:
Joel Spolsky, October 8, 2003
© 2000-2009 Joel Spolsky
This translation:
Hans-Werner.Heinzen @ Bitloeffel.de, August 2009
Translation Notes:
As always, I have tried to use German terms, even if English words have become commonplace in the programming language. This has the advantage that you don't have to "switch" between two languages ​​within a sentence ;-)
A "tag" is a brand, "control" is an operating element, "email" can also be translated using electronic mail, and one point even accommodates network slots for "sites" ...

Haven't you ever wondered what this ominous brand "Content-Type" is all about? You already know this part that you should put in the HTML code, but you don't know what it is supposed to be.

Have you ever received electronic mail from a friend in Bulgaria with the subject "" ???? ?????? ??? ???? "?

I was horrified to discover how many developers are not up to date when it comes to the mysteries of character sets, encoding, Unicode, and so on. A few years ago a beta tester asked if FogBUGZ could handle Japanese email. Japanese? How do you get Japanese emails? I couldn't comment on that. After taking a closer look at the ActiveX control I bought - we used it to break down messages in MIME format - we discovered that this thing was doing exactly the wrong thing with the fonts. So we had to write code to undo the wrong conversion and then convert correctly afterwards. I then looked at another commercial program library, and there, too, the fonts were treated completely wrongly. I corresponded with the developer of this package - he suspected that "there is probably nothing that can be done about it". Like many other programmers, he hoped it would somehow take care of itself.

But it won't! When I found out that PHP, the popular language for Internet applications, is completely immune to coding problems (it is happy and happy with 8 bits per character and thus makes it as good as impossible to develop an international application.), There I thought to myself, enough is enough!

So I have a statement to make: If you worked as a programmer in 2003 and don't understand the basics of characters, character sets, encodings and Unicode, and if I know you catch: I'll put you in a submarine and let you peel onions for 6 months. Count on it!

And something else:


In this essay, I'm going to teach you exactly whatany active programmer should know. This talk of "pure text = ASCII = characters have 8 bits" is not just wrong, it is hopelessly wrong. If you still code like this, you're like a doctor who doesn't believe in bacteria. Please do not write another line of code until you have finished reading this essay.

Before I begin, I would like to warn you: If you are one of those rare exceptions who know about internationalization, you will find my treatise too simplistic. I'm just trying to pull in the bottom bar here, so that everyone can understand what it is all about and everyone can write code for them hope insists that he can also deal with text other than English, where even accented letters are not possible. And I should point out that the handling of characters is only a small part of what makes software usable internationally. But I can only cover one topic at a time, and today it's called "Character Sets".

A historical review

It will be easiest, I'll go chronologically.

You probably mean I'll start now with the ancient character sets like EBCDIC. No, I do not. EBCDIC is not essential - we don't have to go thatoo far back.

Spolsky was wrong here. IBM mainframes still use EBCDIC today. A.d.Ü.

Back in the good old days, when Unix was just being invented and K&R wrote "The C Programming Language", everything was still very simple. The only letters that mattered were good old English with no accents, and our encoding was ASCII; it could map all letters to numbers between 32 and 127. The space was 32, the letter "A" was 65, etc. That could easily be put in 7 bits. Most of the computers of the time used 8-bit bytes, so you couldn't just store all possible ASCII characters; you even had a bit left that, if you were mean enough, you could misuse for your own purposes: WordStar's cloudy cups actually used the high bit to mark the last letter of each word, thus limiting WordStar to English Language a. Codings below 32 werenot printable Sign called and used to curse. Small joke! They were used as control characters, 7 made your computer beep, 12 catapulted the current sheet of paper out of the printer and pulled in a new one.

Everything was fine as long as your language was English.

Because a byte has space for 8 bits, a lot of people thought, "Great, we can use the numbers 128-255 for our needs." Unfortunately they had lots People, at the same time various Ideas of what could be put there. The IBM PC, for example, came with what came to be known as the OEM character set, which provided a few accented letters for European languages, as well as a whole stock of symbols for line drawings: horizontal bars, vertical bars, horizontal bars with an appendage on the right , etc., and you could use it to paint nifty boxes and lines on the screen; still to be admired on the 8088 computer in the laundry room next door. As people started buying personal computers outside of the United States, a lot of different OEM character sets were quickly invented, each using the top 128 numbers for their own purposes. For example, on some PCs the character code 130 would be displayed as é, but on Israeli computers it would be the Hebrew letter Gimel (), so that when Americans sent their résumés to Israel, which were listed there as Rsums would arrive. In other cases, like Russian, there were many different ideas about how to use the top 128 numbers; so it was not even possible to reliably exchange Russian documents.

After all, all of these Here-There-For-Free-Beer OEMs were codified in the ANSI standard; everyone agreed on the area below 128 and that was nothing more than ASCII, but the area above was handled very differently depending on where you were. These different systems were called Character conversion tables (English: code pages). For example, DOS in Israel used table 862 while the Greeks used table 737. They were the same below 128, but different above - where the funny letters are at home. There were dozens of national versions of MS-DOS, from English to Icelandic, and there were even a few multilingual ones, Esperanto and Galicianon the same computer. Wow eh!. But it was impossible, let's say, to use Hebrew and Greek on the same computer (unless you wanted to write a program yourself that would display everything with bitmap graphics), because Hebrew and Greek interpreted the high encodings differently.

In the meantime, much crazier things were happening in Asia; you had to deal with the fact that Asian alphabets have hundreds of letters that you would never get down in 8 bits. Usually one used this disguised system named DBCS, the 2-byte character set (English: double byte character set), in which some Letters were stored in one byte, others in two. It was easy to process a string forwards, it was almost impossible backwards. The programmer was asked to avoid and instead call functions, e.g. AnsiNext or AnsiPrev in Windows; they then knew how to deal with the mess.

But still most of them pretended that a byte was a character, and a character was 8 bits, and that worked as long as you didn't have to transfer strings from one computer to another or speak more than one language. Of course, when the internet hit us and the transfer of characters between computers became commonplace, the whole mess became apparent. Fortunately, Unicode had already been invented.


Unicode was the brave attempt at a character set that contained all the serious writing systems on the planet, plus a few imaginary ones such as Klingon. There is a misunderstanding that Unicode is simply a 16-bit code, where each character has 16 bits and therefore there are 65,536 possible characters.That is not right! It's the most widespread rumor about Unicode - so not so bad if you believed it too.

In reality, Unicode has a very different approach; you must have understood that, otherwise nothing else makes no sense.

So far we have assumed that a letter corresponds to some bits that are stored on disk or in main memory.

A -> 0100 0001

In Unicode, however, a letter corresponds to something that youCode number (English: code point) and that is still just a theoretical concept. How this code number is represented in memory or on hard disk is a completely different story.

In Unicode, the letter A is a Platonic ideal. It hovers above the clouds:


This Platonic A is different from B and also from a, but it is the same as A andA. and A.

The idea that the A in the Times New Roman font is the same letter as the A in the Helvetica font is largely undisputed. In other languages, however, it can be controversial what a letter is is. Is the German ß a correct letter or just an imaginative way of writing ss? If the shape of a letter changes at the end of a word, is that a letter of its own? Yes in Hebrew, no in Arabic. No matter. The smart folks at the Unicode Consortium have been sorting this out over the past 10 years - in a host of controversial debates. You don't have to worry about that anymore, everything is already understood and knocked down.

Every ideal letter in every alphabet is assigned a magic number by the Unicode Consortium, which is written like this: U + 0639. This magic number is called Code number (English: code point). The U + means "Unicode" and the number is hexadecimal.U + 0639 is the Arabic letter Ain. The A is U + 0041. All together can be found with the tool under Windows 2000 / XPcharmap, or on the Unicode website.

There is no limit to the number of letters Unicode can define, and in fact it goes beyond 65,536, so you can't squeeze all Unicode letters into two bytes. But that was just a rumor anyway.

OK, so we have a string:


what corresponds to these five code numbers in Unicode:

U + 0048 U + 0061 U + 006C U + 006C U + 006F.

Nothing but a handful of code numbers, numbers actually. So far we haven't said anything about how they are stored or presented in emails.


This is where coding comes in.

It was the first idea to encode Unicode that led to the rumor of the two bytes, and that was - oh wonder - we store these code numbers in two bytes. This is how hello becomes:

Correct? ... stop, slowly! Couldn't it just as well be:


Well, technically, yes I think. It really was the case that the first developers wanted to be able to store Unicode code numbers in big and little endian mode, depending on what their respective CPU could do the fastest ... And it was evening and it was morning , and there was already two Ways to store unicode. And so one was forced to the bizarre convention of starting every Unicode text with an FE FF. This is called the byte order mark and when you swap high and low bytes it says FF FE and anyone reading this knows that they have to swap all the bytes in pairs. Phew! In the wild, however, not every Unicode text has a byte order mark.

The Big-endians (Dickender) sprang from Gulliver's journey to the Lilliputans. A popular legend (e.g. The Free On-line Dictionary of Computing) says that in the country of Lilliput, where the political problems are also minor, the thick-enders quarreled with the top-enders about whether a soft-boiled egg was thicker or pointed Open at the end. The reality in Lilliput was even sadder. The dispute led to 6 rebellions, one ruler lost his life, another his crown, and a long war between two empires ensued. (Jonathan Swift, Guliver's Travels, The Journey to Lilliput, Chapter Four)

For a while this solution seemed good enough, but the programmers complained. "Look at those whole zeros!" they said, because they were Americans and they saw English text, which of course hardly contains any code numbers above U + 00FF. Plus, it was alternative hippies from California who did save and keep (grins) wanted to. Texans would have slurped away as many bytes again without hesitation. These Californian wimps, however, couldn't stand the thought for text double Number of bytes to consume. Besides, there were so many documents everywhere in all these ASCII and DBCS character sets - who should convert them all?At least I don't! So most of them chose to just ignore Unicode; this went on for years and the hullabaloo became a nuisance.

Then someone came up with the brilliant UTF-8 concept. That was another system of storing a chain of Unicode code numbers, i.e. these mysterious U + numbers, using 8-bit bytes. In UTF-8 all code numbers are from 0-127 in one byte saved. Only code numbers 128 and higher are stored in 2, 3 and even up to 6 bytes.

That was so clever that English lyricslook the same in UTF-8 as before in ASCII, and therefore the Americans didn’t notice anything. Only the rest of the world had to jump through the hoop. The example Hello, i.e. U + 0048 U + 0061 U + 006C U + 006C U + 006F, would be saved as 48 61 6C 6C 6F, which - look at it - was the same as in ASCII and ANSI and every OEM character set on our planet.

But if you are so bold and use Greek or Klingon letters, then you also have to use several bytes to store a single code number - and only the Americans don't notice anything. (UTF-8 also has the nice feature that ignorant, old-fashioned word processors who still use the null terminator as the end-of-text character can continue to do so.)

So far we have spoken of three Ways to encode Unicode. The traditional 2-byte methods, called UCS-2 (because 2 bytes) or UTF-16 (because 16 bits), in the variants Big-Endian-UCS-2 and Little-Endian-UCS-2. And the universally popular new UTF-8 standard, which also works respectably when English text and brain-dead programs that know nothing but ASCII collide.

In truth, there are a few more of the methods. First, there is UTF-7, which is pretty similar to UTF-8, but guarantees that the high bit is always zero; So you can smuggle Unicode text without prejudice through dictatorial police state e-mail systems that think the 7 bitare enough, thank you! Then there is UCS-4, which stores each code number in 4 bytes, with the nice feature that there is one byte available for each digit of the code number, but heaven! Even in Texas you wouldn't have that much storage space waste.

And indeed, while we are already talking about ideal letters in the Platonic sense, which are represented by Unicode code numbers: these code numbers can also be encoded according to the old coding rules! For example, the Unicode string for Hello (U + 0048 U + 0061 U + 006C U + 006C U + 006F) could be encoded in ASCII, or the ancient Greek, or Hebrew ANSI character set, or any of the hundreds previously invented. The thing would only have one hook: some of the letters could not be seen! As long as there is no equivalent of the Unicode code number in the respective character set, you usually only see a small question mark:? or if you really Lucky you got a box. What do you see? ->

My modern browser even shows a question mark in a diamond. A.d.Ü.

There are several hundred traditional encodings only some Store code numbers correctly and turn everyone else into question marks. Popular encodings for English text are: Windows-1252 (this is the Windows 9x standard for Western European languages ​​and ISO-8859-1, also known as Latin-1 (this also helps with all Western European languages). But if you try Russian or Encoding Hebrew letters with it only gives you question marks, but UTF 7, 8, 16 and 32 can each Save the code number correctly!

The one particularly important finding about coding

And if you should forget everything that I have explained so far, please! at least remember this: A character string is meaningless if you don't know the coding. You can no longer bury your head in the sand and pretend "plain text" is ASCII.

There is no such thing as "plain text".

You must also know the coding of a character string in memory, in a file or in an electronic mail; otherwise, you will not be able to interpret them correctly or display them correctly to the user.

Almost every stupid "My website looks like gibberish" or "She can't read my email when I use accents" can be traced back to a naive programmer who didn't understand that if he didn't tell me whether a certain text is encoded in UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), I just can't display it correctly, I can't even tell where a letter ends. There are several hundred codes, and above code number 127 there is no longer a solid reason.

How do you preserve the information about the coding? Well that's standardized. In the case of emails, this is expected in the header data in this form:

For a website, the original idea was that the server should use a similar httpContent-Type sends - not in the HTML, but in the header data that is sent before the HTML.

That leads to problems. Suppose a server operates many network places, hundreds of pages are contributed by many people in many different languages, each in the coding that their respective version of Microsoft FrontPage would be able to generate. The server itself couldn'tknowledgeThe coding in which each file is written could therefore not send a content-type header in advance.

So it would be nice if you could write the content type of the HTML file directly into the HTML, within a special brand. Of course that drove the purists to the barricade ... how can you read the HTML file before you know how it's encoded ?! Fortunately, almost all popular encodings between 32 and 127 do the same thing, so that you can always get far enough in the HTML code without using funny letters.

But this meta mark really has to be the very first in the block, because as soon as the browser sees this mark, it stops analyzing and starts all over again - now with the new coding information.

What do browsers do if they cannot find a content type, neither in the http header nor in a meta tag? Internet Explorer does something really interesting: it tries to guess the language and encoding based on the frequency with which different bytes appear in typical texts of typical encodings for different languages.

Because the different 8-bit character sets park their national letters between 128 and 255 in different areas, and because each language has a characteristic frequency distribution of its letters, this works pretty well.

It is really strange, but it seems to happen often enough: A naive web designer looks at the page he has just created in the browser and at it looks good; one day he writes something that doesn't match the frequency distribution of the letters in his language, and Internet Explorer chooses Korean and displays it accordingly, which I believe proves Postel's law (Be conservative in what you say , and being liberal with what you accept) is not a good engineering principle. No matter. What is the poor reader of this page who was written in Bulgarian but now looks like Korean (not even real Korean) doing? She uses the menu item View> Coding and tries a whole series of codings until the picture becomes clearer (there are at least a dozen for Eastern European languages) ... if she knows that you can do that ... but what the least of all know!

For the last version of CityDesk, an administration software for websites that my company develops and sells, we decided to encode everything internally in UCS-2 (two-byte) unicode, just like Visual Basic, COM and Windows NT / 2000 / XP do by default. In C ++, we simply declare strings aswchar_t ("wide char") instead of char and then we use thatwcsFunctions instead of the st-Fuctions (e.g .: wcscat and wcslen instead of strcat and strlen). A literal in C becomes UCS-2 if you put an L in front of it, so like this: L "Hello".

When CityDesk publishes a website, it converts everything to UTF-8 that browsers have supported for a number of years. In this way, all 29 language versions of Joel on Software encoded, and so far I haven't heard of anyone who has had trouble viewing it.

This essay has become quite long, and I cannot cover everything there is to know about encodings and Unicode here. I hope, however, that if you've read this far, that you now know enough to continue tinkering with your program - with antibiotics and not with mercury or magic spells. Get to work!