<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>State Of Flux &#187; iconv</title>
	<atom:link href="http://stateofflux.com/tag/iconv/feed/" rel="self" type="application/rss+xml" />
	<link>http://stateofflux.com</link>
	<description>always changing</description>
	<lastBuildDate>Fri, 29 Jan 2010 03:29:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Removing non-english characters</title>
		<link>http://stateofflux.com/2008/12/14/removing-non-english-characters/</link>
		<comments>http://stateofflux.com/2008/12/14/removing-non-english-characters/#comments</comments>
		<pubDate>Mon, 15 Dec 2008 01:55:50 +0000</pubDate>
		<dc:creator>mark</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[accent]]></category>
		<category><![CDATA[accute]]></category>
		<category><![CDATA[anglofy]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[e accute]]></category>
		<category><![CDATA[english]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[non-english]]></category>
		<category><![CDATA[transform]]></category>
		<category><![CDATA[transliteration]]></category>

		<guid isPermaLink="false">http://stateofflux.com/?p=38</guid>
		<description><![CDATA[Like many workplaces we have lots of info stored in spreadsheets.  In my case we have a spreadsheet that we need to import into a database which, in the simple case, is pretty straight forward.  But in this spreadsheet there are non-English characters.  You know the ones, e acute (é) for café and rockdots for [...]]]></description>
			<content:encoded><![CDATA[<p>Like many workplaces we have lots of info stored in spreadsheets.  In my case we have a spreadsheet that we need to import into a database which, in the simple case, is pretty straight forward.  But in this spreadsheet there are non-English characters.  You know the ones, e acute (é) for café and <a href="http://en.wikipedia.org/wiki/Heavy_metal_umlaut">rockdots</a> for the hardcore Motörhead fans.  For my purposes I need to convert these into their English equivalent as I&#8217;m trying to represent user input.  It should be no surprise that English speakers do not enter é when they are looking for cafes.  The process of making these changes is called <a href="http://en.wikipedia.org/wiki/Transliteration">transliteration</a> and I don&#8217;t want to do it manually.</p>
<h3>Enter iconv</h3>
<p><a href="http://en.wikipedia.org/wiki/Iconv">iconv</a> is an awesome piece of software which converts character strings from one character encoding to another.  iconv also has <a href="http://taschenorakel.de/mathias/2007/11/06/iconv-transliterations/">transliteration built in</a>.  This will allow me to convert those fancy foreign &#8220;cafés&#8221; into bog standard &#8220;cafes&#8221;.</p>
<pre lang="bash">$ echo 'café numero uno' | LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT
cafe numero uno</pre>
<p>In this example I set the locale to be French using a shell environment variable of LC_ALL then let iconv do it&#8217;s magic.</p>
<h3>Microsoft Excel</h3>
<p>My source dataset is in Excel 2003 format and I need to load this spreadsheet into <a href="http://postgresql.org/">PostgreSql</a> in ASCII format (even though my db is in UTF-8 &#8211; remember that I&#8217;m trying to emulate user input).  If you export you spreadsheet in CSV format you lose all that nice non-English encoding and iconv will have nothing to work with.  Instead I export as Unicode Text (_File, Save As..,  Save as type: Unicode Text (*.txt)_) and scp it up to my linux development box.  Once there I can check it&#8217;s type by issuing a:</p>
<pre lang="bash">$ file test_cases.txt
test_cases.txt: Little-endian UTF-16 Unicode English character data, with CRLF, CR line terminators</pre>
<p>The file is now a tab delimited text file and I want a CSV (comma separated) file.</p>
<h3>Putting it all together</h3>
<p>The final steps are to transliterate the file to ASCII and then convert the text file to a CSV.  I&#8217;ll use sed to translate tabs (\t) into commas.  This line does the trick:</p>
<pre lang="bash">LC_ALL=fr_FR.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-16 test_cases.txt | sed -e 's/\t/,/g' &gt; test_cases.csv</pre>
<p>There are more ways to get this stuff wrong than you can point a stick at, so I&#8217;ll put a <strong>disclaimer</strong> right here to say that this works for me just fine! <img src='http://stateofflux.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://stateofflux.com/2008/12/14/removing-non-english-characters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
