## Do typefaces really matter?

They must do because the topic keeps popping up on the Beeb website…

“These people remind me of wine snobs – they can detect all these subtle notes and flavours but the average person probably won’t notice all these tiny flourishes on a font. When you’re reading an article you’re not thinking about the font. You have to be looking at fonts all day before you start getting emotional about them.”

Comment

## Unmappable character for encoding UTF8

This classically happens in the following scenario: developers happily code in their Windows environment in Eclipse or whatever IDE they love, check in their stuff, and suddenly, CruiseControl spits out a whole lot of warnings, or even errors depending on how the build is configured. Looking at the code, everything compiles nicely on the developer’s machine:

public class EncodingExample {
private final static String TEXT = "Éáíó";
public static void main(String[] args) {
System.out.println(EncodingExample.TEXT);
}
}


Here is the Ant file used by the build in CC:

<?xml version="1.0" encoding="utf-8" ?>
<project name="test" default="compile">
<target name="compile">
<javac srcdir="src" destdir="classes" debug="true" />
</target>
</project>


And yet, the CruiseControl logs show the following:

    [javac] Compiling 1 source file to /home/sebastien/workspace/sandbox/classes
[javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
[javac] 	private final static String TEXT = "����";
[javac] 	                                    ^
[javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
[javac] 	private final static String TEXT = "����";
[javac] 	                                     ^
[javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
[javac] 	private final static String TEXT = "����";
[javac] 	                                      ^
[javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
[javac] 	private final static String TEXT = "����";
[javac] 	                                       ^
[javac] 4 warnings


Here is what happens: when working on Windows, the IDE is more than likely configured to edit files in Cp1252, which is a Microsoft adaptation of latin-11. Teh developer checks in, and the Continuous Integration server (usually running on Linux, which nowadays is all utf8) picks up the file, and tries to compile as a UTF-8 file, hence the warning.

The way to solve this is: – Either save the file as UTF-8 (you can configure Eclipse for example to use UTF-8; make sure that you check in Eclipse preference files as well as so that everybody uses the same), but everybody has to make sure they use that encoding, – Or modify the Ant script to compile the file as latin-1:

<?xml version="1.0" encoding="utf-8" ?>
<project name="test" default="compile">
<target name="compile">
<javac srcdir="src" destdir="classes"
encoding="cp1252" debug="true" />
</target>
</project>


You can also try encoding="iso-8859-1". It is not wrong not to use utf-8 in itself (as in, cp1252 is not a bad “encoding”); you just have to make sure you keep the same encoding everywhere… And working with Windows and Linux at the same time, it can sometimes prove tricky.

1 It contains, in particular, French characters missing from latin-1 such as œ, Œ, and Ÿ. As well as our beloved European €.

Comment [7]

## MySQL and UTF-8

When working with UTF-8 on MySQL, it is not enough to define the CHARACTER SET and the COLLATE parameters to utf-8 when creating the database. You also have to tell MySQL that the queries you’ll be calling are utf-8. Indeed, by default the character set used by the connection and the result sets is latin-1:

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)


When doing your queries yourself with mysql_query, this can be a source of confusion, as your data is stored properly in UTF-8, but still comes back funny. That’s something that recently bit me as I was fiddling with an old version of ezSQL which didn’t allow the user to change the encoding1.

You can force utf-8 by executing the following:

SET NAMES 'utf8';


Which is equivalent to:

SET character_set_client = utf8;
SET character_set_results = utf8;
SET character_set_connection = utf8;


In recent PHP (>= 5.2), you can also execute:

mysql_set_charset('utf8',$conn);  Libraries like Propel usually handle that quite well by specifying a configuration option, and relieving the developer from these worries. Typically, the runtime configuration settings for Propel would be: <config> <propel> <datasources> <datasource> <connection> <!-- ... --> <settings> <setting id="charset">utf8</setting> </settings>  For Rails, it is also very similar. When defining your database instance in config/database.yml, you can also give the encoding parameter: development: adapter: mysql encoding: utf8 reconnect: false database: pouet_dev pool: 5 username: root password: pouet host: localhost socket: /var/run/mysqld/mysqld.sock  For Hibernate, arbitrary connection properties can be passed by using the property name, with hibernate.connection preprended to the name. <property name="hibernate.connection.characterEncoding">UTF-8</property>  This parameter is the MySQL Connector/J parameters used by the driver to indicate the encoding (note that the documentation indicates that SET NAMES 'utf8' would not work with Connector/J). Examples will probably follow… 1 Not sure recent versions do either? Comment ## Jersey Typography I am currently watching Uruguay v. Ghana, and I was thinking that the font on both teams’ jersey was really cool. And despite being both equiped by Puma, the font is radically different, which is also pretty cool… Well, it turns out that I’m not the only one having these thoughts whilst watching a football match, so here we go: World Cup Typography: Paul Barnes on fontfeed.com. Good to see that lettering on jersey is taken that seriously! Comment ## Should vibrato be banned when singing “O Canada”? That’s definitely a pertinent question when hearing this rendition of Canada’s National Anthem by Céline Dion: Happy Canada Day! Comment ## How to run some commands for XeLaTeX only? To call some commands when running xelatex on a file, I use an old trick that was quite useful when I wanted to run commands for pdflatex, and not for plain latex: I check that a given primitive is present or not, and if it is, do XeLaTeX stuff, else just do normal things: \newif\ifxelatex \ifx\XeTeXglyph\undefined \xelatexfalse \else \xelatextrue \fi % You can now use \ifxelatex to execute XeLaTeX-specific stuff \ifxelatex \usepackage[french]{polyglossia} \usepackage{xltxtra} \setmainfont[Mapping=tex-text]{Times New Roman} \else \usepackage{babel} \usepackage[utf8]{inputenc} \usepackage{times} \usepackage[T1]{fontenc} \fi  The trick here is the check whether the \XeTeXglyph primitive is present; if it is, the file is being XeLaTeX’ed, otherwise it’s probably PDFLaTeX’ed, or even LaTeX’ed, or whatever. The same can be achieved by importing the ifxetex package, which provides a ifxetex command. Strangely enough, when defining french as a documentclass option, it doesn’t automatically get passed to polyglossia, as I’d expect it, as it does for PDFLaTeX – almost made me believe for a while that polyglossia was broken for the French language, when it was just not getting the option. Comment ## World Cup Knockout Stage Simulation Via Ruby Ireland mailing list, Cool Mathematica article on simulating the knockout stage of the World Cup (though it would have been easier to understand with proper mathematical formulæ rather than Mathematica code…) Comment ## HTML and Regexp I was browsing through a few “Daily WTF”, and came across this one, which straight away made me think of this hilarious SO response about the evil of parsing HTML with Regexp. Here is a short excerpt that doesn’t even do justice to the whole thing: Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. Incidentally, probably the most creative use of utf-8 I have seen so far… Comment ## Find in Which SVN Revision a File Has Been Deleted It looks like the only way is to get the full Subversion log: sebastien@greystones$ svn log --verbose > /tmp/svnlog.txt


and then look for the first reference of the deleted file in /tmp/svnlog.txt.

(You could possibly grep, but this file could be part of a large changeset, so you don’t really know how many lines in --before-context to use to get the revision number)

Comment

## XeTeX

I was looking into how to typeset a Greek document in LaTeX using utf-8, and I must admit it proved to be a more complicated task than expected.

### The LaTeX Way

I started by using my usual approach, i.e. using inputenc with the utf8 definition:

\documentclass[a4paper,10pt]{article}

\usepackage[british,greek]{babel}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\begin{document}
\selectlanguage{greek}
Κατάγομαι από την Ιρλανδία.
\end{document}


But I got an error message:

! Package inputenc Error: Unicode char \u8:Κ not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H   for immediate help.
...

l.11 Κ
ατάγομαι από την Ιρλανδία.
?


Browsing the inputenc documentation, it appears that Greek characters are not set up: they have to be defined by the user.

Unfortunately the number of Unicode characters that in theory could be contained in a document is enormous. Thus even with today’s amount of computer memory it would be unrealistic to predefine all of them.

Characters can be defined using \DeclareUnicodeCharacter which takes two arguments, the first one being the Unicode code point, and the second one the character it maps to.

I was then redirected to another (older?) package called ucs (see here), bringing in a new definition for inputenc called utf8x:

\documentclass[a4paper,10pt]{article}

\usepackage[british,greek]{babel}
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}

\begin{document}
\selectlanguage{greek}
Κατάγομαι από την Ιρλανδία.
\end{document}


And this proved to be a hit.

However, given the author’s message (“Due to time restrictions, I am not able to maintain this package anymore”), I was not quite happy that I had found the “right” solution.

So I decided to look back at Ω, an extension of TeX using Unicode (and as I’m currently reading Yannis Haralambous’ book Fonts and Encodings, everything was converging back to it!).

### Enters XeTeX

Reading about Ω, references to XeTeX were popping up here and there, mentioning that it was a “recent Unicode capable TeX extension”, so it was definitely worth a look. And I wasn’t disappointed: XeTeX seems to be next logical step after LaTeX. In particular, it supports:

• Unicode,
• Font technologies such as AAT and OpenType (this makes life so much easier to select the font you want)
• PDF: it produces PDF out of the box. To produce xdv, you can use the -no-pdf option.

These characteristics really make Xe(La)TeX a “modern” LaTeX. Let’s have a look at what our example now looks like:

\documentclass[a4paper,10pt]{article}

\usepackage{xltxtra}

\setmainfont[Mapping=tex-text]{DejaVu Sans}

\begin{document}
Κατάγομαι από την Ιρλανδία.
\end{document}


You then compile the document with:

xelatex greek-sample.tex


You may have noticed that I have had to remove the babel stuff. I was indeed getting the following error:

LaTeX Font Warning: Font shape LGR/DejaVuSans(0)/m/n' undefined
(Font)              using LGR/cmr/m/n' instead on input line 2.


and at the second pass:

! Corrupted NFSS tables.
wrong@fontshape ...message {Corrupted NFSS tables}
error@fontshape else let f...
l.6 \select@language{greek}


According to the fontspec documentation:

The babel package is not really supported! Especially Vietnamese, Greek, and Hebrew at least might not work correctly, as far as I can tell.

No panic, there is actually a replacement package called polyglossia, which “aims to remain as compatible as possible with the fundamental features of Babel while being cleaner, light-weight, and modern.”

Our document now becomes:

\documentclass[a4paper,10pt]{article}

\usepackage{polyglossia}
\usepackage{xltxtra}
\setdefaultlanguage{greek}
\setmainfont[Mapping=tex-text]{DejaVu Sans}

\begin{document}
Κατάγομαι από την Ιρλανδία.
\end{document}


And here is the result:

Having struggled with NFSS in the past, this really makes a user’s life so much easier.

Comment

## Limoges en Pro A !

Limoges qui célèbre la victoire de Pau-Orthez, un truc de dingue ! Pau était déjà assuré de monter, et une victoire contre Aix Maurienne permettait à Limoges de monter également. La vidéo du compte à rebours est à donner des frissons !!

Comment

## PHP and UTF-8

PHP characters (5 and prior) are one-byte long. When working with UTF-81, this becomes an incredible royal PITA and an endless source of frustration, even for people used to work with characters present in latin-1. Even more annoying, some functions such as htmlentities, htmlspecialchars, etc. just assume latin-1 by default, and you have to remember to explicitly set the encoding, e.g.:

htmlentities($string, ENT_COMPAT, 'UTF-8');  But it also has some extremely annoying consequences for simple string functions such as substr or strlen. Typically: $ echo '<?php echo strlen("é"); ?>' | php
2


Let’s look at an example seen this morning on a popular literary French blog running on the also popular platform Wordpress:

And here is more than likely what happened here: characters in PHP are one-byte long, but as we have seen in the past, characters in UTF-8 strings may be longer than one-byte (up to 4). â belongs to the Latin-1 supplement group, and is encoded on 2 bytes: C3 A2. As substr only deals with 1-byte characters, it simply cut “â” in the middle, leaving C3 in, and getting rid A2. C3 on its own is obviously invalid UTF-8, so it is replaced by the replacement character. Here is a file simulating this:

<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<body>
<p>
<?php
$text = "Critiquer la Bible sans écraser l’infâme"; echo substr($text, 0, 41);
?>
</p>
</body>
</html>


The solution is to use the multi-byte strings functions—but they have to be included in the PHP installation explicitly, as mbstring is a non-default extension. Here is an example with mb_substr:

<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<body>
<p>
<?php
mb_internal_encoding("UTF-8");
$text = "Critiquer la Bible sans écraser l’infâme"; echo mb_substr($text, 0, 37);
echo "<br />";
echo mb_substr(\$text, 0, 38);
?>
</p>
</body>
</html>


(You will probably notice that I changed the index after which the string is truncated. That’s because strlen is also based on 1-byte character, so when it counts the characters in a string that contains UTF-8 characters encoded with more than 1 byte, it “sees” more characters… So as mbstring functions now can deal with multi-byte characters, we have to cut the string earlier to see whether “â” avoids the chop.)

AFAIK, PHP 6 will have Unicode support, so it will be the end of all this craze, but it’s something to take into account when dealing with PHP 5 apps…

1 UTF-8 is a popular encoding on the Web mainly because it is a variable width encoding where ASCII characters are encoded on one byte, most of European, Cyrillic, Arabic, Hebrew ones on 2 bytes, and the rest of the world use 2, 3 or 4 byte-long characters (so it made the English-speaking users happy as (1) writing text in ASCII is “automatically” in UTF-8, as the two match, and (2) it doesn’t increase the size of their file).

Comment [1]

## RTÉ Big Big Bazaar

With the joys of having a kid come the joys of getting up early at weekends and watching kids programmes on TV. I usually tune to TV5 to give Sophie a bit more French than during the week (if she grows up with a strong and a lively Québécois accent, don’t look any further!), but occasionally, I switch back to RTÉ. And at the weekend, I came across this programme called the Big Big Bazaar. Great idea and all: you get 2 teams of kids (something between 8 and 11) to collect stuff from local households to raise money for a local cause (a GAA club, a school band, etc.). It is a brilliant idea, and it’s great to see the kids visiting grandmothers to get the recipe of scones, or sorting through pile of junk for selling the items. Then, for 2 hours, the 2 teams try to sell a max of things.

Great idea, until it came to the end. The boys won, they raised something like 1,200+, and unfortunately, the girls only raised a bit more than 1,100€. So the girls lost. They are all very very disappointed, they worked so hard, and fell about a 100€ short… Then, the presenter swiftly says: “According to the rules of the Big Big Bazaar, the girls therefore have to give half of their money to the boys. Too bad…”

Whaaaaat? How mean is that?? So instead of raising (say) 1,100€, they now raise 550€, and give the rest to the boys. I find this just wrong. Ok, we have to teach our kids they can’t always win, but what sort of lesson are we trying to teach them here: if you lose, you’ll end up giving half your earnings to the winner?? Maybe that’s just me, but that doesn’t feel right to take money away from kids who’ve worked hard to get that money.

Comment

## Letter Playground

Link straight from the Liste Typo: http://www.letterplayground.com/

Comment

## Characters on Wikipedia Globe

Here are the characters on the revised Wikipedia globe after several errors were corrected. These characters represent the first letter of “Wikipedia” in different languages.

It is also interesting to note that the “W” in the favicon is actually not a “W”, but two overlapping “V”s.

Comment