Removing non-english characters
Like many workplaces we have lots of info stored in spreadsheets. In my case we have a spreadsheet that we need to import into a database which, in the simple case, is pretty straight forward. But in this spreadsheet there are non-English characters. You know the ones, e acute (é) for café and for the hardcore Motörhead fans. For my purposes I need to convert these into their English equivalent as I’m trying to represent user input. It should be no surprise that English speakers do not enter é when they are looking for cafes. The process of making these changes is called and I don’t want to do it manually.
Enter iconv
is an awesome piece of software which converts character strings from one character encoding to another. iconv also has . This will allow me to convert those fancy foreign “cafés” into bog standard “cafes”.
$ echo 'café numero uno' | LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT
cafe numero uno
In this example I set the locale to be French using a shell environment variable of LC_ALL then let iconv do it’s magic.
Microsoft Excel
My source dataset is in Excel 2003 format and I need to load this spreadsheet into in ASCII format (even though my db is in UTF-8 – remember that I’m trying to emulate user input). If you export you spreadsheet in CSV format you lose all that nice non-English encoding and iconv will have nothing to work with. Instead I export as Unicode Text (_File, Save As.., Save as type: Unicode Text (*.txt)_) and scp it up to my linux development box. Once there I can check it’s type by issuing a:
$ file test_cases.txt
test_cases.txt: Little-endian UTF-16 Unicode English character data, with CRLF, CR line terminators
The file is now a tab delimited text file and I want a CSV (comma separated) file.
Putting it all together
The final steps are to transliterate the file to ASCII and then convert the text file to a CSV. I’ll use sed to translate tabs (\t) into commas. This line does the trick:
LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-16 test_cases.txt | sed -e 's/\t/,/g' > test_cases.csv
There are more ways to get this stuff wrong than you can point a stick at, so I’ll put a disclaimer right here to say that this works for me just fine!