Data Scraping- Part II:

日期August 14, 2013

This is Part III in our Data Scraping blog post series. Part I by Jewel Loree shows how to scrape websites using IFTTT and Part III by Isaac Obezo provides a walk-through for using Python.

Today, as part of our Tableau Public Data Month we’re focusing on yet another excellent and easy to use data scraping tool, Unlike IFTTT, which Jewel reviewed earlier this week, allows you to scrape any website, not just ones that have special connectors, but it does require a bit more steps to setup.

To showcase the power of we will do a step by step walkthrough of how to collect data from, a movie review site similar to IMDB. These walkthrough steps are universal and will apply to any site you scrape in the future.

Once we scrape the website we'll have a CSV file with all of the collected data that we can use in Tableau Public to build vizzes and share with the world. Here’s a sample viz showing the number of reviews and movies by country based on the data collected using the scraper we’ll build below. Walkthrough

To get started you’ll first need to install the desktop client. Once you have it running you’re met with a summary of the steps required for collecting data.

  1. Enter the address of the website you want to scrape into the URL bar, in this case we’ll be scraping Let’s pick a movie page and see what different fields are available. Next, let’s open the “Extract Data from a web page” tab.
  2. Now, turn on the “Data Table” and switch the “Web Mode” to “Crawler”, this will allow us to start defining the location of our desired fields on the webpage and eventually crawl the rest of the site.
  3. Since we’ll be collecting a single movie review per page we’ll choose “Single row of data on the web page”. Then we can define the fields we want to collect by “Insert column” to add columns to our data table.
  4. Each column we add can have a different data type, for example Title will be a text field while rating would be a number, year the movie premiered would be date, and country would be an image since we are using a picture of the flag.
  5. For our sample movie data set we’ll want to create the following columns.
    • title
    • year
    • running_time
    • country
    • rating
    • votes
    • genre1
    • genre2
  6. Now that we know the fields we want to collect we need to map each column to a piece of content on the webpage. To do that we pick a column, hover over the webpage to the field we want to select, when the correct content is highlighted in blue we click on it, and then we click on the blue “Capture” icon on the column tab. We continue highlighting content on the page and adding it to each column until all of the columns are mapped to parts of the webpage.
    When done you should have the very first row of your data table filled in. Confirm that all of the results in the columns match the fields you want to collect.

    Lastly, click the “Add to data basket” icon to add that page to your collected data set.

  7. We now have our very first row of movie review data and we will treat this mapping as our template for the rest of the website. We now want to build a crawler that will go through all of the movie review pages on the entire website and capture the requested fields. Click on “Crawl a website for data” to start building the crawler. In addition to our first page of results we’ll need to include at least one more sample page of results. To do that let’s browse to another movie review page, turn the “Data table” view back on and check if the data table updated automatically with that particular movies results. Lastly if all of the fields are mapped properly let’s also “Add to data basket” these results.
  8. Now that we have our two sample pages of results we can run our crawler. Click “Run Crawler”, select where on your computer you want to have the output file placed and then let’s adjust a few of the “Advanced options” to fine tune the pages we want to crawl. For this movie review data set we want to have the crawler start on a page that includes links to all of the movies we want to crawl. The All the Movies (A-Z) link has all of the movies spread out over nearly 300 separate result pages. By choosing this start page, the crawler will click on the page links and iterate through each result page collecting the fields for each movie.
  9. We will leave most of the remaining settings alone, only modifying the concurrent request to “2” and the pause between request to “5 seconds” this should give us fairly good performance without straining the host website. We can also modify the “Where to extract data from?” option and change "{alpha}{num}.html$" to "{num}.html$" since all of the movie review pages have that URL format.
  10. The final step is to click “Go” and see the results stream in! Depending upon the size of the site you are crawling and the setting you choose, it could take several hours to collect all of the data.
  11. Next Steps

    Now that you've setup your first web crawler using you can start building scrapers to mine data from entire websites as well as individual webpages. Instead of copying and pasting from websites into excel and then correcting the formatting, you could use to scrape data directly off of tables embedded in webpages such as real estate data from and bus time tables from

    If you're interested in learning more about make sure to check out their website
    Also, stay tuned as a new version of the Data Browser will be launching shortly with a revised user interface, here's a teaser screenshot of what it will look like.


Its a very nice topic and good information shared by you. Thanks a lot for sharing this with us.