Archive for the ‘trade secrets’ tag
HTMLParser, not for the faint of heart

In recent efforts to create a general purpose HTML scraper for mein Geschäftsführer, I’ve been getting my hands dirty in some Py. After much research and experimentation, I’ve decided to go with the built-in HTMLParser instead of the XML expat parser or the SGMLParser. Also, I should clarify this is not the HTMLParser from htmllib, this is HTMLParser’s HTMLParser. For all it’s wonderment, Python really fails on consistant naming schemes. Oh well.
One of the things I like most about HTMLParser is that it is not a module per say, but it is a factory for creating a wrapper. There is no default HTMLParser which you can feed HTML to and get output - you only get the factories for parsing.
class MyParser(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == "a":
print "Found link:",attrs
def handle_startendtag(self,tag,attrs):
if tag == "img":
print "Found image:",attrs
parser = MyParser()
parser.feed(rawhtml)
Lovely, no? There are a few more methods which you overwrite in order to achieve desired functionality. The nice thing about parsing HTML like this is that it is a one-pass operation. Unlike a series of regexp to find desired content, this allows us find multiple targets in a streaming fashion.
There was one really annoying thing about this module however. The built-in getpos() returns a tuple of line number and column position. I can’t think of an instance when this would be useful for anything really (unless you’re making a HTML editor in python or something), so natrually I modified it to my liking. My first solution was to just remove all the newlines and then work based on the column offset alone. Unfortunately, HTMLParser chokes on some really long lines. My next idea (the one I’m currently using) was to strip out tabs and trailing whitespace and precalculate the length of each line before I feed the parser.
# Suppose incoming HTML is clean
linepos = []
charpos = 0
for line in self.html.split(”\n”):
self.linepos.append(charpos)
charpos += len(line)
parser = MyParser(linepos=linepos)
This produces an array like [0,10,20,30,...] (if each line were 10 characters long). The next modification is to create a new method for MyParser.
def getcharpos(self):
return self.linepos[self.lineno-1] + self.offset
The two properties lineno and offset are inherited from HTMLParser (actually inherited from markupbase), and they represent exactly what you’d think.
Now that I have absolute position of tags in the HTML, I can all kinds of fun things like use K-means grouping to find clusters of images. Or maybe I want to see the average distance between occuraces of the word “the” in an article. It’s 276.21 for this one, btw.
-David