PHP characters (5 and prior) are one-byte long. When working with UTF-8, this becomes an incredible royal PITA and an endless source of frustration, even for people used to work with characters present in latin-1. Even more annoying, some functions such as htmlentities
, htmlspecialchars
, etc. just assume latin-1 by default, and you have to remember to explicitly set the encoding, e.g.:
htmlentities($string, ENT_COMPAT, 'UTF-8');
But it also has some extremely annoying consequences for simple string functions such as substr
or strlen
. Typically:
$ echo '<?php echo strlen("é"); ?>' | php
2
Let’s look at an example seen this morning on a popular literary French blog running on the also popular platform Wordpress:
And here is more than likely what happened here: characters in PHP are one-byte long, but as we have seen in the past, characters in UTF-8 strings may be longer than one-byte (up to 4). â belongs to the Latin-1 supplement group, and is encoded on 2 bytes: C3 A2
. As substr
only deals with 1-byte characters, it simply cut “â” in the middle, leaving C3
in, and getting rid A2
. C3
on its own is obviously invalid UTF-8, so it is replaced by the replacement character. Here is a file simulating this:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>
<?php
$text = "Critiquer la Bible sans écraser l’infâme";
echo substr($text, 0, 41);
?>
</p>
</body>
</html>
The solution is to use the multi-byte strings functions—but they have to be included in the PHP installation explicitly, as mbstring is a non-default extension. Here is an example with mb_substr
:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>
<?php
mb_internal_encoding("UTF-8");
$text = "Critiquer la Bible sans écraser l’infâme";
echo mb_substr($text, 0, 37);
echo "<br />";
echo mb_substr($text, 0, 38);
?>
</p>
</body>
</html>
(You will probably notice that I changed the index after which the string is truncated. That’s because strlen
is also based on 1-byte character, so when it counts the characters in a string that contains UTF-8 characters encoded with more than 1 byte, it “sees” more characters… So as mbstring
functions now can deal with multi-byte characters, we have to cut the string earlier to see whether “â” avoids the chop.)
AFAIK, PHP 6 will have Unicode support, so it will be the end of all this craze, but it’s something to take into account when dealing with PHP 5 apps…