Collecting Flight Data from Bureau of Transportation with Python
One particularly interesting data set, and one that is not very straightforward in harvesting, is the flight data provided by the Research and Innovative Technology Administration of the Bureau of Transportation Statistics, or RITA.
So what is so compelling about this data set? Well, besides it being flight data, it has geographical data, it has datetime data, is very wide, very flat, and very large, very very deep (it dates back decades). It is one of our favorite little datasets here within Tableau Software. It is one of those “go to” demo sets of data because it can be unwieldy and the data is understandable. But how do we collect it?
RITA data, fortunately comes to us as a ZIP file containing a CSV, which is easy to work with in Tableau … well easy enough to work with in Tableau, it is still a rather large and wide dataset (as mentioned above). But there no transformations are needed, and if we can crack the code to deliver the CSV file to us, then all we have to do is grab that zip file, unzip and either perform any ETL on it we want or load it directly into Tableau … did anyone say incremental refresh?
Let’s get cracking, we know that the data we want is somewhere to be found on the RITA webpage, or we are going to be directed at it.
Well there was a little bit of discovery on my part from the RITA homepage, but after searching the RITA a wonderful discovery was found, the TranStats page at;
As much as I love data, sometimes I need to stay away from pages like this because I generally have something pressing that needs to be completed. And here on this RITA data page, we see all of these potentially powerful data sets; oh look bicycle data, and rail data … must focus, back to the task at hand.
What we are really interested in is the flight data, specifically the on-time performance data. A quick search brings us to the following page which provides us with definitions of the data we are looking for:
But wait, here is a real treat for us: there is something on the left hand side under Data Tools.
We can download the data!
By connecting to the download data button, we are presented with a variety of fields to select. Being a little overly data greedy, I want it all. After all I can always be more selective downstream with what I am presenting.
Here is a snapshot of what we are presented with;
With that lovely download button on the side, filter on year, filter on month, and by geography. This is a data bonanza!
So instead of manually clicking on each of the columns, we quickly discover that we can just select the “Prezipped File” option to collect the data;
Selecting download I get prompted to save a file (on my Windows machine);
What just happened? I do not just want to save a single month at a time, I want to be able to save months of data, or even better yet years worth of data. I want to do real analytics on this data set. I want to blend it with other data sets to see what is happening. I do not want to manually save each of these datasets. Ughh.
I am in luck, I happen to be able to figure this out. With a little bit of Python and with the help of some very useful tools like Fiddler2 I can snoop and sniff then write code to pull down datasets directly from the webpage.
Step 1. Sniff the Web Transactions
This is a topic unto itself; how to sniff around, what to look for, etc. I, because I am quite gracious, will shortcut this. We use Fiddler2 to discover that something is being sent to a database, as seen here;
Clearly we can see that after we select the download we are loading something from a server tsdata1.bts.gov, but if we connect directly to this server we get the following;
Which is pretty standard. Oh well, need to investigate further … we already know how this story ends, the data is somewhere here and it turns out it is being POSTed to the tsdata1.bts.gov server, as seen in the inspect tab from the source;
This is a scary looking POST.
But that’s ok, we can manage.
Step 2. Let’s Make a String Together
Python is my friend, it is my go-to language, and I know that I can manage this string and all of the variable data within a Python script. I copy this POST to a post.py file as a string much like the following snippet;
I discovered that there are quite a couple of variables to this POST, but I am really only interested in the time variables. I want to be able to pull data down based on time and not worry about geography.
Searching the previous POST I see the following information;
which I change to become;
Quick note: I am using Python 2.7, so I use string.format(name=val, …) convention. The variable name is wrapped in curly brackets. For further information about string formatting in Python please see, http://docs.python.org/2/library/string.html#format-examples
Now that I have a POST that I can send to the tsdata1.bts.gov directly, I can request individual months at a time and without bothering to use the WebUI.
But how do I send this POST?
Step 3. Let’s Send our POST
Now that I have my POST, it is in a file named (because I am clever) post.py and it lives within the same directory as the new file I just created called get_transtat_data.py.
Here is the start of my script;
Notice the shebang, the
#!/usr/bin/env python line. Yes, I am working on a non-windows machine and yes I need this. For more information about the shebang, see http://en.wikipedia.org/wiki/Shebang_(Unix).
I like to write the doc string out. Why? I am forgetful and need constant reminders on what I am doing, and what I have done. A nice little doc string is your best friend.
But the question still remains, how do I send this POST to the RITA server and get the CSV file? Since I am familiar with this process, I already know which modules I will need to use in order to collect the RITA data. I have, without much ado, added those modules to the Python script. We now have the following;
Now that we have our modules defined, we now just build up the script to create a user-agent-string, connect to the RITA database, and request the data. With a little work we have the following;
And now we add the request and the response, like;
We can now download a file from the server.
Now we have sent the POST, for January 2013, and have received a file system object from the BTS server. Now what do we do with this?
Step 4. Let’s Find the Unzip and Save
The response from the previous step is the response object, we want to make sure that we are getting a 302 from the server before we attempt to do anything else, this is done easily enough with the following;
Now I already know the location of the zip file, the file that is being created by the server, is sent back in the headers. I can now download that zip file from the location provided with the following;
This is the location, in the headers, with the address to the ZIP file that is provided by the tsdata2.bts.gov server.
Now that we have the location, all we need to do is download it and unzip it.
This is easy enough: Python provides a set of modules to handle normal file system operations like unzip. Using the zipfile module we can manage the zipfile directly with the following code;
Now we have unzipped our downloaded file, renamed it to something meaningful to us, and are ready to do some analytics with this file.
But we really want to automate this, we want to pull down more than just January 2013’s RITA data.
Step 5. Now What?
We have successfully collected the CSV file from RITA for January, 2013. But now what?
We can either decide on a couple of things;
1. we can add command line parameters to collect range of dates, concat results, and create the CSV files locally, or
2. we can plug this tool into ScraperWiki and let it collect results for us.
I have added the command line parameters to the get_transtat_data.py on my github account at;
The script at the github account is the finished product for this script that we just created.
Also, since we are talking about ScraperWiki, this is a fantastic tool to share and create datasets! It is amazing to me such things exist. I am working on setting up this same RITA script to work on ScraperWiki, it just takes me a moment or two to learn new things.
But the dataset will be at;
While POSTing to a server to get data files is probably not the most straightforward method for collecting web data, it does illustrate that we can collect data from web sources with a small set of tools, a little bit of programming, and some determination. There is so much to see and understand.
And with great data comes great opportunity to create visualizations using Tableau.
In my talk "Domesticating Web Data for use with Tableau" at the upcoming Tableau Customer Conference in DC I will also show you how we scrape from types of web data such as tables and using APIs to build data sets.