HTML tree, that is… Ahh HTML. The tree with often not so perfect branches. Recently for one of my projects I had to grab certain bits of information from a series of HTML pages. The HTML was proper in some and not so good with others (huh? Font tags? Poorly nested tags?). I thought about parsing it as I would an XML document but given the “crooked branches” I figured that wouldn’t work. I could write a mess of regex, but gee — isn’t there a better way? Through a series of searches and poking around, I discovered HTML::Tree module which seemed to suit my purpose well! I will not bore you with the use I had for using HTML::Tree, so I made a fun sample:
#!/usr/bin/perl use warnings; use strict; use diagnostics; use HTML::TreeBuilder;
The typical top of a Perl script. I usually ″use diagnostics″ when developing and take it out when I am done. Just give you more verbose error messages. I need all the help I can get!
Now the fun stuff:
I made a function to load the DevChix homepage that I had saved to a local file. Ideally, this data would be pulled live from the site. For now though, I start simple.
sub load_tree {
my $page = HTML::TreeBuilder->new();
$page->parse_file('DevChix.htm');
return $page;
}
This returns the TreeBuilder object with my data loaded. Using the most awesome tool Firebug (Also, see Jen′s post about it awhile back), I see that the sidebar list is a div tag with id of “sidebarposts”. Lets look down our Html Tree and find that element:
my $page = load_tree();
my $sidebar = $page->look_down( '_tag', 'div',
sub { $_[0]->id eq 'sidebarposts' } );
Not too complex. Look down the tree, look for a tag thats a div with the id of “sidebarposts” .. gee, thats nearly english (and people say that Perl is jibberish! Bah!).
Now, lets grab the li elements in that div:
my @ul_list = $sidebar->look_down('_tag','li');
foreach my $li(@ul_list) {
print $li->as_text, "\n";
}
I know I′ll be getting back more than 1 element so I assign it to an array instead of a scalar. Then in the for loop, I want to iterate through the list and print the element as text, which gives me the name of the link.
Output is something like this:
... Regular Expressions: The Wart on the Bum of Every Language in Existence RUBY: DRY up your Enumerations *waving, not drowning* Beautiful Python: The programming language that taught me how to love again RailsConf Test More for Java?! Book Review: Beginning Ruby On Rails E-Commerce ...
Using a code ref to find elements came in extremely handy when I had some bunked up HTML, sometimes I had to look_down a table, find a tr that contained a table which had a certain class, or find all href tags that had a certain domain in the URL and then grab the tr container so I could grab the following tr .. etc.
The Tree::Builder interface has a ton more methods than I’ve talked about and you should take a look at it next time you need to grab bits from html pages!
DevChix