I Love Python: BBC Language web scrape and encode to disk in 54 lines.
April 17th, 2008 byThis module scrapes the BBC language web site (http://www.bbc.co.uk/worldservice/languages/)
for sample text from all 35 languages offered. It encodes the text snippets and writes to independent files, then test-reads one sample file.
The encoding requirements took some digging through obscure docs, but the rest wasn’t so bad. If you want to know how to do unicode language support to file in Python, this is for you.
import urllib2
import codecs
import BeautifulSoup
import re
import pdb
import os
class GetBBC:
def __init__(self):
print "In constructor"
self.language_links = []
self.dir = 'BBC_Language_pages'
try:
os.makedirs(self.dir)
except OSError:
pass
def getLanguageChoices(self):
lang_page = urllib2.urlopen("http://www.bbc.co.uk/worldservice/languages/").read()
self.soup = BeautifulSoup.BeautifulSoup(lang_page)
# match langtexttop too
links = self.soup.findAll(attrs={'class':re.compile('^langtext*')})
for x in links:
self.language_links.append(x)
print "Appending %s with link %s " % (x.a.string,x.a['href'])
print "There are %d language choices for the BBC news page!" % len(self.language_links)
def archiveLanguagePages(self):
os.chdir(self.dir)
for x in self.language_links:
lang_page = urllib2.urlopen('http://www.bbc.co.uk' + x.a['href']).read()
clean_page = BeautifulSoup.BeautifulSoup(lang_page).prettify()
rawfile = codecs.open(x.a.string,'wb+','ISO8859-1')
rawfile.write(unicode(clean_page,'ISO8859-1'))
rawfile.close()
print "Saved the %s page." % x.a.string
os.chdir('..')
def readLanguagePage(self,language):
os.chdir(self.dir)
rawfile = codecs.open(language,'rb','ISO8859-1')
file = rawfile.read()
rawfile.close()
os.chdir('..')
return rawfile
if __name__ == "__main__":
x=GetBBC()
x.getLanguageChoices()
x.archiveLanguagePages()
y = x.readLanguagePage('Portuguese')
There are languages for which ISO8859-1 encoding may not work, so you may need to experiment with encoding codecs for languages not supported by the BBC.
I wrote this in May 2007, as a language support test for GrrlCamp, which is an online Open Source development group for women. We will be recruiting again in late June. If you are female, interested in volunteering development effort in exchange for learning, and have at least 6 hours free each week to do cutting edge fun Python design and development in a supportive and great online community, please post your email address and we will get back to you.
Gloria
