Archive for December, 2008

Removing non-english characters

Posted in General on December 14th, 2008 by mark – Be the first to comment

Like many workplaces we have lots of info stored in spreadsheets.  In my case we have a spreadsheet that we need to import into a database which, in the simple case, is pretty straight forward.  But in this spreadsheet there are non-English characters.  You know the ones, e acute (é) for café and rockdots for the hardcore Motörhead fans.  For my purposes I need to convert these into their English equivalent as I’m trying to represent user input. It should be no surprise that English speakers do not enter é when they are looking for cafes.  The process of making these changes is called transliteration and I don’t want to do it manually.

Enter iconv

iconv is an awesome piece of software which converts character strings from one character encoding to another.  iconv also has transliteration built in.  This will allow me to convert those fancy foreign “cafés” into bog standard “cafes”.

$ echo 'café numero uno' | LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT
cafe numero uno

In this example I set the locale to be French using a shell environment variable of LC_ALL then let iconv do it’s magic.

Microsoft Excel

My source dataset is in Excel 2003 format and I need to load this spreadsheet into PostgreSql in ASCII format (even though my db is in UTF-8 – remember that I’m trying to emulate user input).  If you export you spreadsheet in CSV format you lose all that nice non-English encoding and iconv will have nothing to work with.  Instead I export as Unicode Text (_File, Save As..,  Save as type: Unicode Text (*.txt)_) and scp it up to my linux development box.  Once there I can check it’s type by issuing a:

$ file test_cases.txt
test_cases.txt: Little-endian UTF-16 Unicode English character data, with CRLF, CR line terminators

The file is now a tab delimited text file and I want a CSV (comma separated) file.

Putting it all together

The final steps are to transliterate the file to ASCII and then convert the text file to a CSV.  I’ll use sed to translate tabs (\t) into commas. This line does the trick:

LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-16 test_cases.txt | sed -e 's/\t/,/g' > test_cases.csv

There are more ways to get this stuff wrong than you can point a stick at, so I’ll put a disclaimer right here to say that this works for me just fine! :)