Perl: Climbing Trees

June 6th, 2007 by comment Nola

HTML tree, that is… Ahh HTML. The tree with often not so perfect branches. Recently for one of my projects I had to grab certain bits of information from a series of HTML pages. The HTML was proper in some and not so good with others (huh? Font tags? Poorly nested tags?). I thought about parsing it as I would an XML document but given the “crooked branches” I figured that wouldn’t work. I could write a mess of regex, but gee — isn’t there a better way? Through a series of searches and poking around, I discovered HTML::Tree module which seemed to suit my purpose well! I will not bore you with the use I had for using HTML::Tree, so I made a fun sample:

#!/usr/bin/perl

use warnings;
use strict;
use diagnostics;

use HTML::TreeBuilder;

The typical top of a Perl script. I usually ″use diagnostics″ when developing and take it out when I am done. Just give you more verbose error messages. I need all the help I can get!

Now the fun stuff:

I made a function to load the DevChix homepage that I had saved to a local file. Ideally, this data would be pulled live from the site. For now though, I start simple.

sub load_tree {
   my $page = HTML::TreeBuilder->new();
   $page->parse_file('DevChix.htm');
   return $page;
}

This returns the TreeBuilder object with my data loaded. Using the most awesome tool Firebug (Also, see Jen′s post about it awhile back), I see that the sidebar list is a div tag with id of “sidebarposts”. Lets look down our Html Tree and find that element:

my $page = load_tree();

my $sidebar = $page->look_down( '_tag', 'div',
   sub { $_[0]->id eq 'sidebarposts' } );

Not too complex. Look down the tree, look for a tag thats a div with the id of “sidebarposts” .. gee, thats nearly english (and people say that Perl is jibberish! Bah!).

Now, lets grab the li elements in that div:

my @ul_list = $sidebar->look_down('_tag','li');

foreach my $li(@ul_list) {
  print $li->as_text, "\n";
}

I know I′ll be getting back more than 1 element so I assign it to an array instead of a scalar. Then in the for loop, I want to iterate through the list and print the element as text, which gives me the name of the link.

Output is something like this:

...
Regular Expressions: The Wart on the Bum of Every Language in Existence
RUBY: DRY up your Enumerations
*waving, not drowning*
Beautiful Python: The programming language that taught me how to love again
RailsConf
Test More for Java?!
Book Review: Beginning Ruby On Rails E-Commerce
...

Using a code ref to find elements came in extremely handy when I had some bunked up HTML, sometimes I had to look_down a table, find a tr that contained a table which had a certain class, or find all href tags that had a certain domain in the URL and then grab the tr container so I could grab the following tr .. etc.

The Tree::Builder interface has a ton more methods than I’ve talked about and you should take a look at it next time you need to grab bits from html pages!

ˆ Back to top

Perl - The right tool for the right job

April 12th, 2007 by comment Nola
ˆ Back to top