Friday, October 23, 2009

i18n support for German Characters in Java

Problem: Issues on Double Byte support - Java platform.

Symptoms: The umlauts characters like ä,ö,ü in German / Spanish / Portuguese languages are garbled in the output stream. Java application receives data over a socket using an InputStreamReader. It reports "Cp1252" from its getEncoding() method:

/* java.net. */ Socket Sock = ...;
InputStreamReader is = new InputStreamReader(Sock.getInputStream());
System.out.println("Character encoding = " + is.getEncoding());
// Prints "Character encoding = Cp1252"

That doesn't necessarily match what the system reports as its code page. For example:

C:\>chcp
Active code page: 850

The application may receive byte 0x81, which in code page 850 represents the character |ü|. The program interprets that byte with code page 1252, which doesn't define any character at that value, so I get a question mark instead.

Solution:

You can work around this problem by using code page 850 i.e., by adding another command-line option:

java.exe -Dfile.encoding=Cp850 ...

ENC=...
java.exe -Dfile.encoding=%ENC% ...


To write at the command line,

> chcp 850
Active code page: 850

> type 1251.txt
abcde xyz
ÓßÔÒõ ²■


Some pointers relevant to this:

> http://illegalargumentexception.blogspot.com/2009/04/i18n-unicode-at-windows-command-prompt.html
> http://stackoverflow.com/questions/1336930/how-do-you-specify-a-java-file-encoding-value-consistent-with-the-underlying-wind

No comments:

Post a Comment