I Love Python: BBC Language web scrape and encode to disk in 54 lines.

April 17th, 2008 by comment gloriajw

This module scrapes the BBC language web site (http://www.bbc.co.uk/worldservice/languages/)
for sample text from all 35 languages offered. It encodes the text snippets and writes to independent files, then test-reads one sample file.

The encoding requirements took some digging through obscure docs, but the rest wasn’t so bad. If you want to know how to do unicode language support to file in Python, this is for you.

import urllib2
import codecs
import BeautifulSoup
import re
import pdb
import os

class GetBBC:
	def __init__(self):
		print "In constructor"
		self.language_links = []
		self.dir = 'BBC_Language_pages'
		try:
			os.makedirs(self.dir)
		except OSError:
			pass

	def getLanguageChoices(self):
		lang_page = urllib2.urlopen("http://www.bbc.co.uk/worldservice/languages/").read()
		self.soup = BeautifulSoup.BeautifulSoup(lang_page)
		# match langtexttop too
		links = self.soup.findAll(attrs={'class':re.compile('^langtext*')})
		for x in links:
			self.language_links.append(x)
			print "Appending %s with link %s " % (x.a.string,x.a['href'])

		print "There are %d language choices for the BBC news page!" % len(self.language_links)

	def archiveLanguagePages(self):
		os.chdir(self.dir)
		for x in self.language_links:
			lang_page = urllib2.urlopen('http://www.bbc.co.uk' + x.a['href']).read()
			clean_page = BeautifulSoup.BeautifulSoup(lang_page).prettify()
			rawfile = codecs.open(x.a.string,'wb+','ISO8859-1')
			rawfile.write(unicode(clean_page,'ISO8859-1'))
			rawfile.close()
			print "Saved the %s page." % x.a.string
		os.chdir('..')

	def readLanguagePage(self,language):
		os.chdir(self.dir)
		rawfile = codecs.open(language,'rb','ISO8859-1')
		file = rawfile.read()
		rawfile.close()
		os.chdir('..')
		return rawfile

if __name__ == "__main__":
	x=GetBBC()
	x.getLanguageChoices()
	x.archiveLanguagePages()
	y = x.readLanguagePage('Portuguese')

There are languages for which ISO8859-1 encoding may not work, so you may need to experiment with encoding codecs for languages not supported by the BBC.

I wrote this in May 2007, as a language support test for GrrlCamp, which is an online Open Source development group for women. We will be recruiting again in late June. If you are female, interested in volunteering development effort in exchange for learning, and have at least 6 hours free each week to do cutting edge fun Python design and development in a supportive and great online community, please post your email address and we will get back to you.

Gloria

The unmodified code

ˆ Back to top

Programming from the (under)ground up

January 5th, 2008 by comment lisa
ˆ Back to top