Why does “é” become “é”?

As I said before, encoding issues are quite common, and yet, they can be very tricky to debug: the reason is that any link in the long chain between the data storage (sql or not) and the client can be the culprit and has to be investigated. I have recently experienced this first hand, and it was tricky enough to be the object of a future post.

In short, the problem was that a PDF document produced by PDFLaTeX in iso-8859-1 was incorrectly forced into UTF-8, therefore corrupting the binary file as a result. The sure sign of this was that single characters were “converted” into 2 or more characters, for example: “é” was displayed as “é”. Anybody who’s worked on non-ASCII projects (probably 98% of the non English-speaking world) has had a similar problem, I’m sure.

But why does “é” become “é”, why that particular sequence:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é
?

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx (see here for more details).

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ô (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

What happens if you convert “é” to UTF-8… again? You get something like “Ã?©” (the second character can vary). Why? Exactly the same reason: “Ô (U+00C3) is represented on 2 bytes, so it becomes 11000011 10000010 (C3 82), and “©” (U+00A9) becomes 11000010 10101001 (C2 A9). U+00C3 is, as we saw Ã, U+0082 is BPH (“Break Permitted Here”, which does not represent a graphic character), U+00C2 is Â, and U+00A9 is, as we saw, ©.

Update:

Just a few points to clarify the above, as the use of iconv above may be slightly confusing.

  • The problem is caused when UTF-8 “é” is literally interpreted as latin-1, that is 11000011 10101001 is read as the two 1-byte latin-1 characters é, rather than the 2-byte UTF-8 character é
  • This only happens when UTF-8 is mistakenly taken as latin-1.
  • iconv converts from one character code to another. This means that an UTF-8 “é” becomes an iso-8859-1 “é” when converting from UTF-8 to another. The sequence is therefore converted from 0xC3 0xA9 to 0xE9. Let’s see this:
sebastien@greystones:~$ echo é > /tmp/test.txt
sebastien@greystones:~$ xxd /tmp/test.txt
0000000: c3a9 0a                                  ...
sebastien@greystones:~$ iconv -f utf8 -t iso-8859-1 /tmp/test.txt --output=/tmp/test_1.txt
sebastien@greystones:~$ xxd /tmp/test_1.txt 
0000000: e90a                                     ..
sebastien@greystones:~$ 

In the example in the post:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

I know that the character entered on the console is UTF-8, but I ask iconv to consider it as latin-1, and then to convert it to UTF-8 to illustrate the problem.

I hope this clarifies things a bit.

Update: second part of the article here.

 
---

Comment

  1. Thanks for sharing this excellent post on character encoding. Even a year later, this is useful. Better late than never :)

    Claude · 2011-09-20 15:14 · #

  2. What happens if ASCII é is accidentally read as UTF-8?

    Richard G · 2012-08-01 10:45 · #

  3. Hi, I’m from Brazil…
    Thank you so much, this topic solved my problem.

    Rafael Lima · 2013-03-21 14:58 · #

  4. Best description I’ve seen on character encoding. Thank you very much!

    Alan Aversa · 2013-03-22 19:25 · #

  5. Suppose I convert to ISO. Then why does running “file -bi” on the file I create say it’s utf-8? How do I make its MIME type ISO? thanks

    Alan Aversa · 2013-03-22 20:22 · #

  6. Alan,

    What does your file contain? It should work:

    /tmp$ echo éclair > test.txt
    /tmp$ iconv -f utf-8 -t iso-8859-1 test.txt > test-iso.txt
    /tmp$ file -bi test-iso.txt
    text/plain; charset=iso-8859-1

    sébastien · 2013-03-22 20:48 · #

  7. I’ve found this is a (very wastefull) work around that works wide spread.
    .encode( ‘ISO-8859-1’ ).encode( ‘ISO-8859-15’, ‘UTF-8’ ).encode( ‘utf-8’ )

    Chad Petersen · 2013-06-11 16:02 · #

  8. test message

    text not display well whats this galu

    test · 2013-07-30 07:51 · #

  9. Thanks! Solved my problem here!

    Fred lavoie · 2014-12-04 18:58 · #

your_ip_is_blacklisted_by sbl.spamhaus.org

---