Monday, July 14th, 2008...8:33 am

Ruby Iconv to the Rescue

Jump to Comments

I ran into a situation today while working on a client project that was resulting in some “funky” output.

I’ve been using Bob Aman’s standard FeedTools library to grab and parse some RSS feeds for output in a WordPress site.

utf8_issues.jpg

Now, it was obvious to me this had something to do with character encoding issues so I dug in to find out the problem. It turns out FeedTools converts the feeds it parses to UTF-8 encoding, which is great, but can result in some strange characters in the string depending on what the original feed encoding was.

Viewing the native feeds in a feed reader I could see they were simply quotes, apostrophes and ellipses characters. I needed a way to either convert these character strings into ASCII printable characters or simply discard them altogether.

I didn’t want to go through the process of setting up a translation table for every possible character code that might possibly show up in a feed title, so I simply chose to discard the “unprintable” strings completely. I came across this post by Obie Fernandez and decided his simple method would work for me.

require 'iconv'

class String
  def to_ascii_iconv
    converter = Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8')
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
  end
end

Ruby’s built-in Iconv library can be used for charset conversions. I extended the String class to include a new to_ascii_iconv method. This method creates a new Iconv converter that will convert from UTF-8 encoding to ASCII encoding. The IGNORE and TRANSLIT flags tell the converter to ignore errors and transliterate accented chracters to an appropriate charcter in the ASCII charset. Next, I use the newly created converter to do the conversion and then strip out any characters with decimal values higher than 127 (non-ASCII printable characters).

The results are exactly what I needed. No “funky” characters in my output.

2 Comments

© 2014 Craig P Jolicoeur.