Header left Header right
 

Hpricot Exposed; My experience with it, and why I think it’s amazing.

Published: December 23rd, 2007

Hpricot is pure HTML parsing elegance. I had a situation that was turning me bald, and then, in shining armor, Hpricot came to the rescue.


First, an introduction to my scenario in how Hpricot1 became of use:

The Scenario

Imagine that you have an article, and when you create that article you specify headings with ‘h3′ tags. My goal was to parse a body of text for all h3 tags, grab them and put them in a table of contents. That part was easy with some natural regexp. Next, I wanted to link each header in the table of contents to it’s respective heading with an anchor link.

The Problem

This is where the trouble started to occur. I got to the point where the name anchor was being updated on every h3 tag, but only if no other attribute existed in the ‘h3′ tag. This is due to the fragility of the regexp I used. To no avail with this cryptic demon, I scoured the internets for a solution; along came a mythical angel on IRC that mentioned using Hpricot to solve the problem I was having.

The thought of not having to use regexp, and replacing it with a pretty parser? Oh this is my bread and butter of programming enjoyment! Luckily, the day after all, was in fact saved due to Hpricot.

Hpricotified Solution

Here’s the victorious code that brought me to my solution:

  def chapter_titles
    headers = self.body.scan(/<h3.*?>(.+?)<\/h3>/)
  end
 
  def set_headers
    body = Hpricot(self.body)
    (body/"h3").each do |each|
      value = each.inner_html
      each.set_attribute("name", "#{value}")
    end
    self.body = body.to_s
  end

Now, I could probably create a simple version of chapter_titles with Hpricot, as a challenge maybe you should do it!

Regexpified Lame Version

Here’s the old regexp version of set_headers

  def set_chapter_names
   self.body.scan(/<h3>(.+?)<\/h3>/).each do |h|
     header = h.to_s
     header.gsub!(/^\s+|\s+$/, '')
     self.body.gsub!("<h3>#{h}</h3>", "<h3 name=\"#{header}\">#{h}</h3>")
   end
  end

As you can see, that’s hideous compared to the Hpricot version.

If anyone is concerned about speed with Hpricot, I don’t think it’s an issue. First of all, it’s coded by Why The Lucky Stiff, that almost requires a “nuff’ said”, but it’s scanner is also “…coded in C”2. Doubt the speed now?

If you’d like to take Hpricot for a spin and give it a try, just check out it’s home page.

What are your experiences with Hpricot if you’ve used it before?

  1. http://code.whytheluckystiff.net/hpricot/ []
  2. From the first paragraph of http://code.whytheluckystiff.net/hpricot/ []

Thanks for writing this up. I’ve yet to look into hpricot, but now I feel as if I could use it if I wanted. On a side note - you’re site is really cool. It reminds me of dig dug. :)

gravatar
  • arymcdo
  • Dec 23rd

I spent a bunch of time a while back using hpricot to do a bunch of screen scraping and parsing quirky XML (RSS and ATOM to be precise - apparently standards are of little consequence to most publishers!).

The fact that hpricot gives you the power of a full XML parser (Xpath, DOM-esque access), but still handles tag-soup makes it an awesomely useful library to become familiar with.

gravatar

Hpricot is pretty slick, but I’ve ran tests against RegEx (Oniguruma) and Regex was ~3x faster. Still, the ease and code beauty with Hpricot can’t be denied.

gravatar
  • Damon
  • Oct 17th
Enter your comment

Ready. Set. Go.

In terms of the formatting, you're allowed to use markdown, textile, or basic html; it's truly up to you -- what strikes your fancy?

You don't have to worry about your e-mail address being sold to a russian-spam-mafia. I'm only going to use it for my own weird needs; like asking you out for a date on a lonely night of coding.