I love python: Zip code prefix web scrape and DB injection in 70 lines

April 17th, 2008 by comment

Here’s a module I wrote (in an hour. Damn, Python is wonderful) which scrapes the US Postal service web site for three-digit zip code extensions (http://pe.usps.gov/text/dmm300/L002.htm). It creates a db table and injects zip code prefix, region and state info for each record found. It uses BeautifulSoup to parse the HTML, and SqlAlchemy to do the DB operations.

If you only need to check what region and state a particular zip code belongs, this is for you. If anyone can point me to a free longitude/latitude/full zip code site, please post that info to a reply, and I’ll rewrite this module.

import urllib2
import codecs
import re
import pdb
import os
import BeautifulSoup
import sqlalchemy

class GetZips:
	def __init__(self):
		self.zip_info = []

	def getZipPrefixes(self):
		zip_page = urllib2.urlopen("http://pe.usps.gov/text/dmm300/L002.htm").read()
		self.soup = BeautifulSoup.BeautifulSoup(zip_page)
		# match zip columns
		zips = self.soup.findAll(attrs={'class':re.compile('^trBodyRow*')})
		for i in zips:
			y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignLeft*')})
			'''
			X is the symbol for an unused 3 digit zip prefix.
			'''
			if y.span and y.span.string == 'X':
				continue

			# last 3 digits
			zip_prefix_3 = y.a.next.next
			zip_prefix_3 = re.sub('[\n\r]+','',zip_prefix_3)

			# finding the first column will suffice.
			y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignRight*')})

			region_state = y.a.next.next.split()

			region = region_state[-3]
			state_abbrev = region_state[-2]

			if region_state[-1] != zip_prefix_3:
				print "There is a problem here: %s" % i

			self.zip_info.append((region,state_abbrev,zip_prefix_3))
			print "Found %s %s %s" % (region,state_abbrev,zip_prefix_3)

	def injectIntoDB(self):
		engine = sqlalchemy.create_engine('postgres://%s:%s@%s/%s' % ('postgresql','something','127.0.0.1:5432','zip_db'),strategy='threadlocal')
		'''
		The sqlalchemy explicit scope is done for clarity. Of course
		you can "from sqlalchemy import *" instead, and change the scope
		of these calls.
		'''
		metadata = sqlalchemy.MetaData()
		metadata.bind = engine
		zip_table = sqlalchemy.Table('zip_abbrevs', metadata,
			sqlalchemy.Column('zip_abbrevs_id', sqlalchemy.Integer, primary_key=True),
			sqlalchemy.Column('three_digit_abbrev', sqlalchemy.String(4)),
			sqlalchemy.Column('region', sqlalchemy.VARCHAR(50)),
			sqlalchemy.Column('state_abbrev', sqlalchemy.String(3)))

		metadata.create_all(engine)

		for (region, state, zip) in self.zip_info:
			print "Injecting %s %s %s\n" % (region, state,zip)
			zip_table.insert(values={'region':region,'state_abbrev':state,'three_digit_abbrev':zip}).execute()

if __name__ == "__main__":
	x=GetZips()
	x.getZipPrefixes()
	x.injectIntoDB()

# vim:ts=4: noet:

I am writing this code for the nonprofit called The Freelancer’s Union in NYC, which currently has a nationwide member drive: http://www.freelancersunion.org/advocacy/index.html.
I will shamelessly plug them in exchange for sharing this code with the world.

The more members they get in each US state, the better nationwide insurance plans they can offer. They offer E&O insurance for IT freelancers as well, so even if you freelance part-time, this could be for you. This organization rocks. I’ve been a member for three years, and now I proudly write code for them.

Gloria

The unmodified code

Comments

4 Responses to “I love python: Zip code prefix web scrape and DB injection in 70 lines”

  1. Noah F. San Tsorbutz says:


    Perhaps you know by now, but just in case, some editing step in your publishing process is being just a little “too helpful”. Standard ASCII characters (” & ‘) are being replaced by cp1252 “typography” characters. Also, pairs of single quotes (”) are being replaced by a single double qoute (“), and triple quotes end up even worse. You may be able to resolve this by wrapping the actual code sample in or tags.

    Also, thanks for posting some neat tools!

  2. Noah F. San Tsorbutz says:


    Wow, you got me! Even the text in my entry were mangled by the “helpful” publishing package. Anyhow, the tags I mentioned, but which got wiped are ‘code’ and ‘pre’.

  3. gloriajw says:


    Hi Noah,
    I’ll attach the actual code here as well. This tool is far from perfect, and my regex is mangled here.

  4. Catherine Devlin says:


    Thanks so much, Gloria – this is exactly what I needed! And a nice demonstration of the module, too.

Got something to say?


cheap research papers