Monthly Archives: April 2014

Python Scraping

In this post, I will perform a little scraping exercise. Scraping is a software technique to automatically collect information from a webpage. Note: I have provided this example for illustrative purposes. It should be noted though scraping websites is not always allowed.

What will we be doing?

In this post, I will be building a very small program that will scrape the top 250 of movies listed on the IMDB website.Luckily there is a URL provided by IMDB that will give us the 250 most popular movies already. This URL is http://www.imdb.com/chart/top. From that list, we are interested to find the titles and the rating for each movie.

What tools will we use?

Python seems to be the perfect candidate for this, although ruby could also be used in fact. Since we are using Python, we’ll also be using a little tool called BeautifulSoup (http://www.crummy.com/software/BeautifulSoup). This tool provides a couple of methods for navigating, searching and modifying a parse tree. So in other words, you provide the tool with the page you want to get info from, and it will allow you to find the particular piece of information you are searching for.

The code

So let’s start. Create a file called scrape.py (or whatever you feel like). Import the BeatifulSoup tool as well as the urllib2 library. As mentioned above, BeautifulSoup will provide us all the methods needed for scraping the website while urllib2 is a library to open and handle URLs.

from BeautifulSoup import BeautifulSoup
import urllib2

Obviously, we need to provide the url we would like to scrape, in our case this is the Top 250 IMDB list. Eventually, the complete html page will be loaded in the variable ‘soup’. We can now apply some methods to find the piece of info we are interested in.

url="http://www.imdb.com/chart/top"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

If you look carefully at the html code of the page, you will see that all data is in fact part of a table.

<table class="chart"  data-caller-name="other-chart">
      <colgroup>
        <col class="chartTableColumnPoster"/>
        <col class="chartTableColumnTitle"/>
        <col class="chartTableColumnIMDbRating"/>
        <col class="chartTableColumnYourRating"/>
        <col class="chartTableColumnWatchlistRibbon"/>
      </colgroup>
      <thead>
      <tr>
        <th></th>
        <th>Rank & Title</th>
        <th>IMDb Rating</th>
        <th>Your Rating</th>
        <th></th>
      </tr>
      </thead>
      <tbody class="lister-list">
<tr class="odd">
  <td class="posterColumn">...</td>
  <td class="titleColumn">...</td>
  <td class="ratingColumn">...</td>
  <td class="ratingColumn">...</td>
  <td class="watchlistColumn">...</td>
</tr> 

So with BeatifulSoup, we can find all the relevant data easily. First of all, we will find the table and the table_body. Then, we will continue to search in the table_body to find all the rows. In below code, all rows data are found in the variable rows. Within each row, we will search for all columns and store that data in the data list.

data = []
info = []
table = soup.find('table', attrs={'class':'chart'})
table_body = table.find('tbody')

rows = table_body.findAll('tr')
for row in rows:
    cols = row.findAll('td')
    info = []
    for item in cols:
        cols = item.text.strip().encode('utf-8')
        info.append(cols)
    data.append(info)

In data, you will find the following:

[
  ['', '1.The Shawshank Redemption(1994)', '9.2', 'RATE  123456789109.3/10X ', ''], 
  ['', '2.The Godfather(1972)', '9.2', 'RATE  123456789109.2/10X ', ''], 
  ['', '3.The Godfather: Part II(1974)', '9.0', 'RATE  123456789109.1/10X ', ''], 
  ['', '4.The Dark Knight(2008)', '8.9', 'RATE  123456789109/10X ', ''], 
  ['', '5.Pulp Fiction(1994)', '8.9', 'RATE  123456789109/10X ', ''], 
  ['', '6.The Good, the Bad and the Ugly(1966)', '8.9', 'RATE  123456789109/10X ', ''], 
  ....
]

The reason why I put it in a double list (list in a list) is that we can now print out all info we require in a flexible way. Obviously, if you wish to store the data in a csv file or create a PDF file, this is entirely up to you. For simplicity I just print it out as follows:

for item in data:
    print item[1] + " => " + item[2]

We get the following output:

1.The Shawshank Redemption(1994) => 9.2
2.The Godfather(1972) => 9.2
3.The Godfather: Part II(1974) => 9.0
4.The Dark Knight(2008) => 8.9
5.Pulp Fiction(1994) => 8.9
...