Perl: Climbing Trees
June 6th, 2007 byHTML tree, that is… Ahh HTML. The tree with often not so perfect branches. Recently for one of my projects I had to grab certain bits of information from a series of HTML pages. The HTML was proper in some and not so good with others (huh? Font tags? Poorly nested tags?). I thought about parsing it as I would an XML document but given the “crooked branches” I figured that wouldn’t work. I could write a mess of regex, but gee — isn’t there a better way? Through a series of searches and poking around, I discovered HTML::Tree module which seemed to suit my purpose well! I will not bore you with the use I had for using HTML::Tree, so I made a fun sample:
#!/usr/bin/perl use warnings; use strict; use diagnostics; use HTML::TreeBuilder;
The typical top of a Perl script. I usually ″use diagnostics″ when developing and take it out when I am done. Just give you more verbose error messages. I need all the help I can get!
Now the fun stuff:
I made a function to load the DevChix homepage that I had saved to a local file. Ideally, this data would be pulled live from the site. For now though, I start simple.
sub load_tree {
my $page = HTML::TreeBuilder->new();
$page->parse_file('DevChix.htm');
return $page;
}
This returns the TreeBuilder object with my data loaded. Using the most awesome tool Firebug (Also, see Jen′s post about it awhile back), I see that the sidebar list is a div tag with id of “sidebarposts”. Lets look down our Html Tree and find that element:
my $page = load_tree();
my $sidebar = $page->look_down( '_tag', 'div',
sub { $_[0]->id eq 'sidebarposts' } );
Not too complex. Look down the tree, look for a tag thats a div with the id of “sidebarposts” .. gee, thats nearly english (and people say that Perl is jibberish! Bah!).
Now, lets grab the li elements in that div:
my @ul_list = $sidebar->look_down('_tag','li');
foreach my $li(@ul_list) {
print $li->as_text, "\n";
}
I know I′ll be getting back more than 1 element so I assign it to an array instead of a scalar. Then in the for loop, I want to iterate through the list and print the element as text, which gives me the name of the link.
Output is something like this:
... Regular Expressions: The Wart on the Bum of Every Language in Existence RUBY: DRY up your Enumerations *waving, not drowning* Beautiful Python: The programming language that taught me how to love again RailsConf Test More for Java?! Book Review: Beginning Ruby On Rails E-Commerce ...
Using a code ref to find elements came in extremely handy when I had some bunked up HTML, sometimes I had to look_down a table, find a tr that contained a table which had a certain class, or find all href tags that had a certain domain in the URL and then grab the tr container so I could grab the following tr .. etc.
The Tree::Builder interface has a ton more methods than I’ve talked about and you should take a look at it next time you need to grab bits from html pages!

June 7th, 2007 at 4:06 am
Perl *is* gibberish. ;-P
I’d use BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/
June 7th, 2007 at 5:36 am
This is no place for language bashing.
How about an example of the above code in YOUR preferred language?
June 7th, 2007 at 7:06 am
I don’t endorse the previous commenter’s rudeness, but here’s equivalent code for Beautiful Soup:
import urllib2
from BeautifulSoup import BeautifulSoup
#data = urllib2.urlopen(”http://www.devchix.com/”)
data = open(”DevChix.html”)
soup = BeautifulSoup(data)
sidebar = soup.find(’div’, {’id’:’sidebarposts’})
for li in sidebar.findAll(’li’):
print li.a.string
June 7th, 2007 at 8:02 am
This blog is meant to inform — saying “x thing sucks, and y works better” is not really THAT helpful. Yes, I could search for Y and find out about it.. but if you have a better solution, at least say a few sentences on why you like that and if possible some examples.
June 7th, 2007 at 2:09 pm
Here is how you would do it in Ruby using the hpricot gem. :-)
require 'rubygems' require 'hpricot' require 'open-uri' doc = Hpricot(open("http://www.devchix.com/")) (doc/"div#sidebarposts//li/a").each do |link| puts link.inner_html endJune 7th, 2007 at 2:19 pm
I tried to find a way to do it in Java and found several capable libraries. However, it would have been much more complex. :-)
June 7th, 2007 at 3:01 pm
Thanks Angel … thats cool :)
June 11th, 2007 at 6:57 am
I tried to parse the pages from this site, but “id” appears twice. Why do people put out pages that are “tag soup”? XHTML was supposed to fix all this. {sigh}
June 11th, 2007 at 7:21 am
Could you be more clear? id appears twice? in the same element? where?
June 14th, 2007 at 4:10 pm
On this page, there are two items with id=”recent” and id=”comment”. That’s not permitted by XHTML, and nonsensical even with HTML DOMs. id must be unique.
June 14th, 2007 at 8:28 pm
@randal, tag soup is quite different from having accidentally used an ID twice. ;p I may have over looked that. We will fix this issue. Thanks for bringing it to our attention.
June 14th, 2007 at 9:10 pm
@randal, just following up. The devChix site is now Valid XHTML 1.0 Transitional. Happy parsing!
June 15th, 2007 at 1:06 pm
Note that passing
sub { $_[0]->id eq ’sidebarposts’ }is the same as passingid => 'sidebarposts':my $sidebar = $page->look_down(_tag => 'div',
id => 'sidebarposts',
);
Also that look is just crying to be a
mapinstead:print
map { $_->as_text() . "\n" }
$sidebar->look_down( _tag => li );
June 15th, 2007 at 3:49 pm
Thanks Aristole :) Very cool
June 28th, 2007 at 4:42 pm
i found this post very helpful for an HTML parsing project I’m doing in PERL. Thanks for posting it.