I love python: Zip code prefix web scrape and DB injection in 70 lines
April 17th, 2008 byHere’s a module I wrote (in an hour. Damn, Python is wonderful) which scrapes the US Postal service web site for three-digit zip code extensions (http://pe.usps.gov/text/dmm300/L002.htm). It creates a db table and injects zip code prefix, region and state info for each record found. It uses BeautifulSoup to parse the HTML, and SqlAlchemy to do the DB operations.
If you only need to check what region and state a particular zip code belongs, this is for you. If anyone can point me to a free longitude/latitude/full zip code site, please post that info to a reply, and I’ll rewrite this module.
import urllib2
import codecs
import re
import pdb
import os
import BeautifulSoup
import sqlalchemy
class GetZips:
def __init__(self):
self.zip_info = []
def getZipPrefixes(self):
zip_page = urllib2.urlopen("http://pe.usps.gov/text/dmm300/L002.htm").read()
self.soup = BeautifulSoup.BeautifulSoup(zip_page)
# match zip columns
zips = self.soup.findAll(attrs={'class':re.compile('^trBodyRow*')})
for i in zips:
y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignLeft*')})
'''
X is the symbol for an unused 3 digit zip prefix.
'''
if y.span and y.span.string == 'X':
continue
# last 3 digits
zip_prefix_3 = y.a.next.next
zip_prefix_3 = re.sub('[\n\r]+','',zip_prefix_3)
# finding the first column will suffice.
y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignRight*')})
region_state = y.a.next.next.split()
region = region_state[-3]
state_abbrev = region_state[-2]
if region_state[-1] != zip_prefix_3:
print "There is a problem here: %s" % i
self.zip_info.append((region,state_abbrev,zip_prefix_3))
print "Found %s %s %s" % (region,state_abbrev,zip_prefix_3)
def injectIntoDB(self):
engine = sqlalchemy.create_engine('postgres://%s:%s@%s/%s' % ('postgresql','something','127.0.0.1:5432','zip_db'),strategy='threadlocal')
'''
The sqlalchemy explicit scope is done for clarity. Of course
you can "from sqlalchemy import *" instead, and change the scope
of these calls.
'''
metadata = sqlalchemy.MetaData()
metadata.bind = engine
zip_table = sqlalchemy.Table('zip_abbrevs', metadata,
sqlalchemy.Column('zip_abbrevs_id', sqlalchemy.Integer, primary_key=True),
sqlalchemy.Column('three_digit_abbrev', sqlalchemy.String(4)),
sqlalchemy.Column('region', sqlalchemy.VARCHAR(50)),
sqlalchemy.Column('state_abbrev', sqlalchemy.String(3)))
metadata.create_all(engine)
for (region, state, zip) in self.zip_info:
print "Injecting %s %s %s\n" % (region, state,zip)
zip_table.insert(values={'region':region,'state_abbrev':state,'three_digit_abbrev':zip}).execute()
if __name__ == "__main__":
x=GetZips()
x.getZipPrefixes()
x.injectIntoDB()
# vim:ts=4: noet:
I am writing this code for the nonprofit called The Freelancer’s Union in NYC, which currently has a nationwide member drive: http://www.freelancersunion.org/advocacy/index.html.
I will shamelessly plug them in exchange for sharing this code with the world.
The more members they get in each US state, the better nationwide insurance plans they can offer. They offer E&O insurance for IT freelancers as well, so even if you freelance part-time, this could be for you. This organization rocks. I’ve been a member for three years, and now I proudly write code for them.
Gloria

May 8th, 2008 at 7:33 am
Perhaps you know by now, but just in case, some editing step in your publishing process is being just a little “too helpful”. Standard ASCII characters (” & ‘) are being replaced by cp1252 “typography” characters. Also, pairs of single quotes (”) are being replaced by a single double qoute (”), and triple quotes end up even worse. You may be able to resolve this by wrapping the actual code sample in
or tags.Also, thanks for posting some neat tools!
May 8th, 2008 at 7:35 am
Wow, you got me! Even the text in my entry were mangled by the “helpful” publishing package. Anyhow, the tags I mentioned, but which got wiped are ‘code’ and ‘pre’.
May 15th, 2008 at 1:38 pm
Hi Noah,
I’ll attach the actual code here as well. This tool is far from perfect, and my regex is mangled here.
July 27th, 2008 at 8:57 pm
Thanks so much, Gloria - this is exactly what I needed! And a nice demonstration of the module, too.