I Love Python: BBC Language web scrape and encode to disk in 54 lines.

April 17th, 2008 by comment gloriajw

This module scrapes the BBC language web site (http://www.bbc.co.uk/worldservice/languages/)
for sample text from all 35 languages offered. It encodes the text snippets and writes to independent files, then test-reads one sample file.

The encoding requirements took some digging through obscure docs, but the rest wasn’t so bad. If you want to know how to do unicode language support to file in Python, this is for you.

import urllib2
import codecs
import BeautifulSoup
import re
import pdb
import os

class GetBBC:
	def __init__(self):
		print "In constructor"
		self.language_links = []
		self.dir = 'BBC_Language_pages'
		try:
			os.makedirs(self.dir)
		except OSError:
			pass

	def getLanguageChoices(self):
		lang_page = urllib2.urlopen("http://www.bbc.co.uk/worldservice/languages/").read()
		self.soup = BeautifulSoup.BeautifulSoup(lang_page)
		# match langtexttop too
		links = self.soup.findAll(attrs={'class':re.compile('^langtext*')})
		for x in links:
			self.language_links.append(x)
			print "Appending %s with link %s " % (x.a.string,x.a['href'])

		print "There are %d language choices for the BBC news page!" % len(self.language_links)

	def archiveLanguagePages(self):
		os.chdir(self.dir)
		for x in self.language_links:
			lang_page = urllib2.urlopen('http://www.bbc.co.uk' + x.a['href']).read()
			clean_page = BeautifulSoup.BeautifulSoup(lang_page).prettify()
			rawfile = codecs.open(x.a.string,'wb+','ISO8859-1')
			rawfile.write(unicode(clean_page,'ISO8859-1'))
			rawfile.close()
			print "Saved the %s page." % x.a.string
		os.chdir('..')

	def readLanguagePage(self,language):
		os.chdir(self.dir)
		rawfile = codecs.open(language,'rb','ISO8859-1')
		file = rawfile.read()
		rawfile.close()
		os.chdir('..')
		return rawfile

if __name__ == "__main__":
	x=GetBBC()
	x.getLanguageChoices()
	x.archiveLanguagePages()
	y = x.readLanguagePage('Portuguese')

There are languages for which ISO8859-1 encoding may not work, so you may need to experiment with encoding codecs for languages not supported by the BBC.

I wrote this in May 2007, as a language support test for GrrlCamp, which is an online Open Source development group for women. We will be recruiting again in late June. If you are female, interested in volunteering development effort in exchange for learning, and have at least 6 hours free each week to do cutting edge fun Python design and development in a supportive and great online community, please post your email address and we will get back to you.

Gloria

The unmodified code

ˆ Back to top

I love python: Zip code prefix web scrape and DB injection in 70 lines

April 17th, 2008 by comment gloriajw

Here’s a module I wrote (in an hour. Damn, Python is wonderful) which scrapes the US Postal service web site for three-digit zip code extensions (http://pe.usps.gov/text/dmm300/L002.htm). It creates a db table and injects zip code prefix, region and state info for each record found. It uses BeautifulSoup to parse the HTML, and SqlAlchemy to do the DB operations.

If you only need to check what region and state a particular zip code belongs, this is for you. If anyone can point me to a free longitude/latitude/full zip code site, please post that info to a reply, and I’ll rewrite this module.

import urllib2
import codecs
import re
import pdb
import os
import BeautifulSoup
import sqlalchemy

class GetZips:
	def __init__(self):
		self.zip_info = []

	def getZipPrefixes(self):
		zip_page = urllib2.urlopen("http://pe.usps.gov/text/dmm300/L002.htm").read()
		self.soup = BeautifulSoup.BeautifulSoup(zip_page)
		# match zip columns
		zips = self.soup.findAll(attrs={'class':re.compile('^trBodyRow*')})
		for i in zips:
			y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignLeft*')})
			'''
			X is the symbol for an unused 3 digit zip prefix.
			'''
			if y.span and y.span.string == 'X':
				continue

			# last 3 digits
			zip_prefix_3 = y.a.next.next
			zip_prefix_3 = re.sub('[\n\r]+','',zip_prefix_3)

			# finding the first column will suffice.
			y = i.find(attrs={'class':re.compile('^pTblBodyLL pAlignRight*')}) 

			region_state = y.a.next.next.split()

			region = region_state[-3]
			state_abbrev = region_state[-2]

			if region_state[-1] != zip_prefix_3:
				print "There is a problem here: %s" % i

			self.zip_info.append((region,state_abbrev,zip_prefix_3))
			print "Found %s %s %s" % (region,state_abbrev,zip_prefix_3)

	def injectIntoDB(self):
		engine = sqlalchemy.create_engine('postgres://%s:%s@%s/%s' % ('postgresql','something','127.0.0.1:5432','zip_db'),strategy='threadlocal')
		'''
		The sqlalchemy explicit scope is done for clarity. Of course
		you can "from sqlalchemy import *" instead, and change the scope
		of these calls.
		'''
		metadata = sqlalchemy.MetaData()
		metadata.bind = engine
		zip_table = sqlalchemy.Table('zip_abbrevs', metadata,
			sqlalchemy.Column('zip_abbrevs_id', sqlalchemy.Integer, primary_key=True),
			sqlalchemy.Column('three_digit_abbrev', sqlalchemy.String(4)),
			sqlalchemy.Column('region', sqlalchemy.VARCHAR(50)),
			sqlalchemy.Column('state_abbrev', sqlalchemy.String(3)))

		metadata.create_all(engine)

		for (region, state, zip) in self.zip_info:
			print "Injecting %s %s %s\n" % (region, state,zip)
			zip_table.insert(values={'region':region,'state_abbrev':state,'three_digit_abbrev':zip}).execute()

if __name__ == "__main__":
	x=GetZips()
	x.getZipPrefixes()
	x.injectIntoDB()

# vim:ts=4: noet:

I am writing this code for the nonprofit called The Freelancer’s Union in NYC, which currently has a nationwide member drive: http://www.freelancersunion.org/advocacy/index.html.
I will shamelessly plug them in exchange for sharing this code with the world.

The more members they get in each US state, the better nationwide insurance plans they can offer. They offer E&O insurance for IT freelancers as well, so even if you freelance part-time, this could be for you. This organization rocks. I’ve been a member for three years, and now I proudly write code for them.

Gloria

The unmodified code

ˆ Back to top