Removing non-english characters
Like many workplaces we have lots of info stored in spreadsheets. In my case we have a spreadsheet that we need to import into a database which, in the simple case, is pretty straight forward. But in this spreadsheet there are non-English characters. You know the ones, e acute (é) for café and
$ echo 'café numero uno' | LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT cafe numero uno
In this example I set the locale to be French using a shell environment variable of LC_ALL then let iconv do it’s magic.
My source dataset is in Excel 2003 format and I need to load this spreadsheet into
$ file test_cases.txt test_cases.txt: Little-endian UTF-16 Unicode English character data, with CRLF, CR line terminators
The file is now a tab delimited text file and I want a CSV (comma separated) file.
Putting it all together
The final steps are to transliterate the file to ASCII and then convert the text file to a CSV. I’ll use sed to translate tabs (\t) into commas. This line does the trick:
LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-16 test_cases.txt | sed -e 's/\t/,/g' > test_cases.csv
There are more ways to get this stuff wrong than you can point a stick at, so I’ll put a disclaimer right here to say that this works for me just fine!