Weblogism: Articles

26.08.10

In case you’d forgotten, the 2010 FIBA World Championship starts tomorrow in Turkey. France’s chances are not looking great, especially after the desastrous results against the US, and Brazil.

A good opportunity for me to mention the new FFBB font designed by Christophe Badani, whose art has been on my desktop for years:

(That’s back in 2004, and this wallpaper has followed me around…)

Comment

14.08.10

— Sébastien Le Callonnec

To Kill a CodingBird, JRuby

JRuby: Reading Java Annotations

I have shown examples of how JRuby could invoke Java classes, in particular SWT components. Here is now another example, this time using JRuby to read annotations in Java classes.

Here is the situation: when using FitNesse, developers sometimes have to develop what is called fixtures, that is, Java classes that can be used (usually by a business analyst) to write tests in the FitNesse wiki. Fixtures then perform the actual testing and return the result so that it can be displayed on the wiki. They can be very straightforward, or they can interact with web pages (via Selenium, for instance) or even call web services. As they are to be used by non-developers, they have to be properly documented. One way to do this is simply by publishing the JavaDoc – but that would kind of ruin my JRuby example¹!

So let’s use annotations to document our fixtures. Here is an example of an annotation that could be used:

package com.weblogism.jruby;

import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;

@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
public @interface FixtureHint {
    String usage();
    String description();
}

The RetentionPolicy indicates that this annotation will be available at runtime.

Fixtures can then be documented as follows²:

package com.weblogism.jruby;

public class TheFixture {
    @FixtureHint(usage="| click the | _button_id_ | button |", description="clicks _button_id_")
    public boolean clickTheButton() {
        System.out.println("Click");
        return true;
    }
}

These classes are nicely packaged in a jar called test.jar. So how to use JRuby to find the annotations? Extremely simple:

require 'java'
require 'lib/test.jar'
include_class 'com.weblogism.jruby.TheFixture'
include_class 'com.weblogism.jruby.FixtureHint'

You first import all the Java stuff, such as your classes. Don’t forget that these include_class can be called dynamically, and therefore you could potentially search for all the relevant classes in the jar, and then import them all. Here, the jar is explicitly imported (it is located in the lib directory of the current working dir), but another way to make it “visible” to the script is add it to the $CLASSPATH environment var.

annotations = Hash.new
TheFixture.java_class.declared_instance_methods.each do
  |m|
  if m.annotation_present?(FixtureHint.java_class)
    annotation = m.annotation(FixtureHint.java_class)
    annotations[m.name] = annotation
  end
end

annotations.values.each do
  |a|
  puts "#{a.usage()}\t#{a.description()}"
end

And that’s as simple as that: the Java methods isAnnotationPresent and getAnnotation become annotation_present? and annotation (à la ruby), and once they have been found, they can be manipulated like ruby objects.

JRuby version:

sebastien@greystones:~/workspace/sandbox$ jruby -v 
jruby 1.6.0.dev (ruby 1.8.7 patchlevel 249) (2010-08-10 f740f78)
(Java HotSpot(TM) 64-Bit Server VM 1.6.0_20) [amd64-java]

¹ The example might be a bit convoluted, but it illustrates the use of annotations through a real-life requirement.

² Fixtures would usually extend a fixture class, e.g. DoFixture, ColumnFixture, etc. but here it isn’t to keep things simple.

Comment

3.08.10

— Sébastien Le Callonnec

To Kill a CodingBird,

Eclipse Companion Shared Library gone AWOL

Coming back from hols, I decided to upgrade to Eclipse Helios, knowing that the morning wouldn’t be too hectic.

I had a minor glitch, though, with the following error:

The Eclipse executable launcher was unable to locate its companion shared library.

Not sure how I ended up in this situation (maybe unzipping with Cygwin was the cause?), but the fix was straightforward enough. Look for a dll in the plugins directory (I found it there: ./plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.0.v20100503/eclipse_1307.dll); in Windows Explorer, right-click the file, click on Properties, and in the Security tab, make sure Read & Execute permission is set (either for everyone, or for the user you’re logged on as). Click OK, and that does the trick.

Comment [1]

20.07.10

— Sébastien Le Callonnec

The Typesetting of Life,

Do typefaces really matter?

They must do because the topic keeps popping up on the Beeb website…

“These people remind me of wine snobs – they can detect all these subtle notes and flavours but the average person probably won’t notice all these tiny flourishes on a font. When you’re reading an article you’re not thinking about the font. You have to be looking at fonts all day before you start getting emotional about them.”

Comment

11.07.10

— Sébastien Le Callonnec

To Kill a CodingBird, Java

Unmappable character for encoding UTF8

This classically happens in the following scenario: developers happily code in their Windows environment in Eclipse or whatever IDE they love, check in their stuff, and suddenly, CruiseControl spits out a whole lot of warnings, or even errors depending on how the build is configured. Looking at the code, everything compiles nicely on the developer’s machine:

public class EncodingExample {
	private final static String TEXT = "Éáíó";
	public static void main(String[] args) {
		System.out.println(EncodingExample.TEXT);
	}
}

Here is the Ant file used by the build in CC:

<?xml version="1.0" encoding="utf-8" ?>
<project name="test" default="compile">
	<target name="compile">
		<javac srcdir="src" destdir="classes" debug="true" />
	</target>
</project>

And yet, the CruiseControl logs show the following:

    [javac] Compiling 1 source file to /home/sebastien/workspace/sandbox/classes
    [javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
    [javac] 	private final static String TEXT = "����";
    [javac] 	                                    ^
    [javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
    [javac] 	private final static String TEXT = "����";
    [javac] 	                                     ^
    [javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
    [javac] 	private final static String TEXT = "����";
    [javac] 	                                      ^
    [javac] /home/sebastien/workspace/sandbox/src/EncodingExample.java:2: warning: unmappable character for encoding UTF8
    [javac] 	private final static String TEXT = "����";
    [javac] 	                                       ^
    [javac] 4 warnings

Here is what happens: when working on Windows, the IDE is more than likely configured to edit files in Cp1252, which is a Microsoft adaptation of latin-1¹. Teh developer checks in, and the Continuous Integration server (usually running on Linux, which nowadays is all utf8) picks up the file, and tries to compile as a UTF-8 file, hence the warning.

The way to solve this is: – Either save the file as UTF-8 (you can configure Eclipse for example to use UTF-8; make sure that you check in Eclipse preference files as well as so that everybody uses the same), but everybody has to make sure they use that encoding, – Or modify the Ant script to compile the file as latin-1:

<?xml version="1.0" encoding="utf-8" ?>
<project name="test" default="compile">
	<target name="compile">
		<javac srcdir="src" destdir="classes" 
                           encoding="cp1252" debug="true" />
	</target>
</project>

You can also try encoding="iso-8859-1". It is not wrong not to use utf-8 in itself (as in, cp1252 is not a bad “encoding”); you just have to make sure you keep the same encoding everywhere… And working with Windows and Linux at the same time, it can sometimes prove tricky.

¹ It contains, in particular, French characters missing from latin-1 such as œ, Œ, and Ÿ. As well as our beloved European €.

Comment [7]

11.07.10

— Sébastien Le Callonnec

To Kill a CodingBird,

MySQL and UTF-8

When working with UTF-8 on MySQL, it is not enough to define the CHARACTER SET and the COLLATE parameters to utf-8 when creating the database. You also have to tell MySQL that the queries you’ll be calling are utf-8. Indeed, by default the character set used by the connection and the result sets is latin-1:

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

When doing your queries yourself with mysql_query, this can be a source of confusion, as your data is stored properly in UTF-8, but still comes back funny. That’s something that recently bit me as I was fiddling with an old version of ezSQL which didn’t allow the user to change the encoding¹.

You can force utf-8 by executing the following:

SET NAMES 'utf8';

Which is equivalent to:

SET character_set_client = utf8;
SET character_set_results = utf8;
SET character_set_connection = utf8;

In recent PHP (>= 5.2), you can also execute:

mysql_set_charset('utf8',$conn);

Libraries like Propel usually handle that quite well by specifying a configuration option, and relieving the developer from these worries. Typically, the runtime configuration settings for Propel would be:

<config>
 <propel>
  <datasources>
   <datasource>
    <connection>
     <!-- ... -->
     <settings>
      <setting id="charset">utf8</setting>
     </settings>

For Rails, it is also very similar. When defining your database instance in config/database.yml, you can also give the encoding parameter:

development:
  adapter: mysql
  encoding: utf8
  reconnect: false
  database: pouet_dev
  pool: 5
  username: root
  password: pouet
  host: localhost
  socket: /var/run/mysqld/mysqld.sock

For Hibernate, arbitrary connection properties can be passed by using the property name, with hibernate.connection preprended to the name.

<property name="hibernate.connection.characterEncoding">UTF-8</property>

This parameter is the MySQL Connector/J parameters used by the driver to indicate the encoding (note that the documentation indicates that SET NAMES 'utf8' would not work with Connector/J). Examples will probably follow…

¹ Not sure recent versions do either?

Comment

2.07.10

— Sébastien Le Callonnec

The Typesetting of Life,

Jersey Typography

I am currently watching Uruguay v. Ghana, and I was thinking that the font on both teams’ jersey was really cool. And despite being both equiped by Puma, the font is radically different, which is also pretty cool…

Well, it turns out that I’m not the only one having these thoughts whilst watching a football match, so here we go: World Cup Typography: Paul Barnes on fontfeed.com.

Good to see that lettering on jersey is taken that seriously!

Comment

1.07.10

— Sébastien Le Callonnec

“Curiouser and curiouser!”, Weblogism

Should vibrato be banned when singing “O Canada”?

That’s definitely a pertinent question when hearing this rendition of Canada’s National Anthem by Céline Dion:

Happy Canada Day!

Comment

1.07.10

— Sébastien Le Callonnec

To Kill a CodingBird, TeX/LaTeX

How to run some commands for XeLaTeX only?

To call some commands when running xelatex on a file, I use an old trick that was quite useful when I wanted to run commands for pdflatex, and not for plain latex: I check that a given primitive is present or not, and if it is, do XeLaTeX stuff, else just do normal things:

\newif\ifxelatex
  \ifx\XeTeXglyph\undefined
    \xelatexfalse
  \else
    \xelatextrue
  \fi

% You can now use \ifxelatex to execute XeLaTeX-specific stuff
\ifxelatex
\usepackage[french]{polyglossia}
\usepackage{xltxtra}
\setmainfont[Mapping=tex-text]{Times New Roman}
\else
\usepackage{babel}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\fi

The trick here is the check whether the \XeTeXglyph primitive is present; if it is, the file is being XeLaTeX’ed, otherwise it’s probably PDFLaTeX’ed, or even LaTeX’ed, or whatever. The same can be achieved by importing the ifxetex package, which provides a ifxetex command.

Strangely enough, when defining french as a documentclass option, it doesn’t automatically get passed to polyglossia, as I’d expect it, as it does for PDFLaTeX – almost made me believe for a while that polyglossia was broken for the French language, when it was just not getting the option.

Comment

28.06.10

— Sébastien Le Callonnec

“Curiouser and curiouser!”,

World Cup Knockout Stage Simulation

Via Ruby Ireland mailing list, Cool Mathematica article on simulating the knockout stage of the World Cup (though it would have been easier to understand with proper mathematical formulæ rather than Mathematica code…)

Comment

8.06.10

— Sébastien Le Callonnec

To Kill a CodingBird,

HTML and Regexp

I was browsing through a few “Daily WTF”, and came across this one, which straight away made me think of this hilarious SO response about the evil of parsing HTML with Regexp. Here is a short excerpt that doesn’t even do justice to the whole thing:

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide.

Incidentally, probably the most creative use of utf-8 I have seen so far…

Comment

6.06.10

— Sébastien Le Callonnec

To Kill a CodingBird,

Find in Which SVN Revision a File Has Been Deleted

It looks like the only way is to get the full Subversion log:

sebastien@greystones$ svn log --verbose > /tmp/svnlog.txt

and then look for the first reference of the deleted file in /tmp/svnlog.txt.

(You could possibly grep, but this file could be part of a large changeset, so you don’t really know how many lines in --before-context to use to get the revision number)

Comment

5.06.10

— Sébastien Le Callonnec

Weblogism,

Limoges en Pro A !

Limoges qui célèbre la victoire de Pau-Orthez, un truc de dingue ! Pau était déjà assuré de monter, et une victoire contre Aix Maurienne permettait à Limoges de monter également. La vidéo du compte à rebours est à donner des frissons !!

Comment

3.06.10

— Sébastien Le Callonnec

To Kill a CodingBird,

PHP and UTF-8

PHP characters (5 and prior) are one-byte long. When working with UTF-8¹, this becomes an incredible royal PITA and an endless source of frustration, even for people used to work with characters present in latin-1. Even more annoying, some functions such as htmlentities, htmlspecialchars, etc. just assume latin-1 by default, and you have to remember to explicitly set the encoding, e.g.:

htmlentities($string, ENT_COMPAT, 'UTF-8');

But it also has some extremely annoying consequences for simple string functions such as substr or strlen. Typically:

$ echo '<?php echo strlen("é"); ?>' | php
2

Let’s look at an example seen this morning on a popular literary French blog running on the also popular platform Wordpress:

And here is more than likely what happened here: characters in PHP are one-byte long, but as we have seen in the past, characters in UTF-8 strings may be longer than one-byte (up to 4). â belongs to the Latin-1 supplement group, and is encoded on 2 bytes: C3 A2. As substr only deals with 1-byte characters, it simply cut “â” in the middle, leaving C3 in, and getting rid A2. C3 on its own is obviously invalid UTF-8, so it is replaced by the replacement character. Here is a file simulating this:

<html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
  <p>
<?php
$text = "Critiquer la Bible sans écraser l’infâme";
echo substr($text, 0, 41);
?>
</p>
</body>
</html>

The solution is to use the multi-byte strings functions—but they have to be included in the PHP installation explicitly, as mbstring is a non-default extension. Here is an example with mb_substr:

<html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
  <p>
<?php
mb_internal_encoding("UTF-8");
$text = "Critiquer la Bible sans écraser l’infâme";
echo mb_substr($text, 0, 37);
echo "<br />";
echo mb_substr($text, 0, 38);
?>
</p>
</body>
</html>

(You will probably notice that I changed the index after which the string is truncated. That’s because strlen is also based on 1-byte character, so when it counts the characters in a string that contains UTF-8 characters encoded with more than 1 byte, it “sees” more characters… So as mbstring functions now can deal with multi-byte characters, we have to cut the string earlier to see whether “â” avoids the chop.)

AFAIK, PHP 6 will have Unicode support, so it will be the end of all this craze, but it’s something to take into account when dealing with PHP 5 apps…

¹ UTF-8 is a popular encoding on the Web mainly because it is a variable width encoding where ASCII characters are encoded on one byte, most of European, Cyrillic, Arabic, Hebrew ones on 2 bytes, and the rest of the world use 2, 3 or 4 byte-long characters (so it made the English-speaking users happy as (1) writing text in ASCII is “automatically” in UTF-8, as the two match, and (2) it doesn’t increase the size of their file).

Comment [1]

1.06.10

— Sébastien Le Callonnec

Symphony of the Good Auld World,

RTÉ Big Big Bazaar

With the joys of having a kid come the joys of getting up early at weekends and watching kids programmes on TV. I usually tune to TV5 to give Sophie a bit more French than during the week (if she grows up with a strong and a lively Québécois accent, don’t look any further!), but occasionally, I switch back to RTÉ. And at the weekend, I came across this programme called the Big Big Bazaar. Great idea and all: you get 2 teams of kids (something between 8 and 11) to collect stuff from local households to raise money for a local cause (a GAA club, a school band, etc.). It is a brilliant idea, and it’s great to see the kids visiting grandmothers to get the recipe of scones, or sorting through pile of junk for selling the items. Then, for 2 hours, the 2 teams try to sell a max of things.

Great idea, until it came to the end. The boys won, they raised something like 1,200+, and unfortunately, the girls only raised a bit more than 1,100€. So the girls lost. They are all very very disappointed, they worked so hard, and fell about a 100€ short… Then, the presenter swiftly says: “According to the rules of the Big Big Bazaar, the girls therefore have to give half of their money to the boys. Too bad…”

Whaaaaat? How mean is that?? So instead of raising (say) 1,100€, they now raise 550€, and give the rest to the boys. I find this just wrong. Ok, we have to teach our kids they can’t always win, but what sort of lesson are we trying to teach them here: if you lose, you’ll end up giving half your earnings to the winner?? Maybe that’s just me, but that doesn’t feel right to take money away from kids who’ve worked hard to get that money.

Comment

← Older Newer →

Weblogism

By Stuff

By Theme

By Jove

Links

FIBA World Championship 2010

JRuby: Reading Java Annotations

Eclipse Companion Shared Library gone AWOL

Do typefaces really matter?

Unmappable character for encoding UTF8

MySQL and UTF-8

Jersey Typography

Should vibrato be banned when singing “O Canada”?

How to run some commands for XeLaTeX only?

World Cup Knockout Stage Simulation

HTML and Regexp

Find in Which SVN Revision a File Has Been Deleted

Limoges en Pro A !

PHP and UTF-8

RTÉ Big Big Bazaar