Meet the developers behind Tableau’s Data Interpreter and pivot features, Sasha Dadiomov and Stefan Batres.
Q: Can you tell us a little bit about yourself? What do you do when you don’t work on awesome software features?
Stefan: I’ve been in development for a long time and have worked on all kinds of things, from desktop applications, to middle-tier servers, to web services to compilers. I’ve been with Tableau for 3+ years and I currently manage one of the engineering teams that works on data prep. I am a 20-year Seattle transplant; I came here from Mexico and live here with my wife and kids.
Sasha: I’ve worked in software for a long time and in general, I try to choose projects in which I can be an actual user. So far, I’ve gotten to work on a myriad of things from consumer electronics to distributed petabyte-scale data processing, and medical diagnostics, to name a few. I’ve always been curious about history, science, and especially in statistically-backed-up analyses. Since a lot of historical and scientific data is locked up in research papers, I’ve felt the personal need to ingest data from various crazy sources like poorly-formatted Excel and PDF.
Q: I had just come to accept that I would always have to tidy up my Excel spreadsheets before using them in Tableau. Between pivot and the Data Interpreter features, you’ve saved me a lot of wrangling in Excel. How did Tableau come up with the ideas for these features?
Sasha: I came up with the Data Interpreter when Tableau was switching from Microsoft JET Excel driver to the LibXL library. I noticed many of our test Excel workbooks had a lot of junk headers/footers, missing rows, and sub-tables. While my new code could parse the files “correctly,” the user would still have to do a lot of manual cleaning to get their Excel data ready for Tableau analysis. Since a majority of the cleaning operations were simple, and the user had to do them repeatedly, I thought this would be doable with code.
Stefan: Pivot is a fairly common transformation that other tools have supported for some time, but it is particularly useful in Tableau. Because of the way Tableau likes its data, things like crosstabs don’t work well. And given how popular those kinds of data representations are, we had seen many customers over the years have to massage their data before analysis. So, building it into the product was an obvious win!
Q: What was your favorite part of developing these data-prep features?
Sasha: All of it, since it was one of my original ideas. Tableau, like many other companies, has hackathons, which lets developers explore their interests and make proof-of-concepts. The Data Interpreter wasn’t a feature that was supposed to be released; it was my hackathon pet-project. It was really exciting finding the right analytical algorithms, and really cool when my code saw things—like my code could see the data in your messy spreadsheet.
Stefan: It’s funny Sasha mentioned the hackathon, because pivot was a hack, too. The data-prep team is pretty fast-paced, and building things quickly is part of the fun. That said, we ran into our fair share of fun problems with my hack. For example, what happens if you try to pivot columns from different tables that are joined? What happens if you change the data source and one of the pivoted columns disappears?
Q: Sasha, for most Tableau users the Data Interpreter looks like a magic tool. Can you say a few words about what goes on under the hood?
Sasha: The Data Interpreter tries to follow the same logic as the human brain. First, it looks for regularity: The data probably sits in some regular grid, so we run a few algorithms to detect rectangles of maximal regularity (of data types, formatting, or even just cell non-emptiness). It’s very similar to an image-processing problem.
These “maximal regularity rectangles” give approximate sub-table locations, and another algorithm classifies the rows into header/data/junk by analyzing row-patterns in data types.
Having the initial table location and the header, the Data Interpreter uses heuristics to fine-tune the boundaries, adjusting them a few rows or columns in all directions. The heuristics are based on the observed results for ~20,000 real Excel files (selected as “typical” out of ~half million that we collected from Internet). After running the algorithm and sifting through the results, it became clear what the shortcomings were and how a human user could interpret these cases correctly. In other words, the heuristic fine-tuning rules are just the code following these human observations.
The combination of analytical and heuristic algorithms proved to be more practical than pure machine learning (which we tried too). And appearing like magic is a good goal for awesome software!
Q: Stefan, I struggled to wrap my head around Pivot. Any pointers?
It’s is a tricky concept; here is an example we used a lot to explain it to ourselves during development. Imagine you had a simple Sales table like this (and if you’ve ever needed to use Pivot in Tableau, this should be familiar!):
But what you really want is a column called Year and Sales. The problem is that the year values are in the headers instead of the rows. This is where the “pivot” comes from—converting header data to row data. In the table below, we’ve put the year data in the rows and added the Sales column.
Now, we fill out the Sales amount column with the data for each year:
And finally, delete old rows:
Q: We’re all getting excited about the new features in the upcoming Tableau 10. Tell me about what new features will help you with data preparation.
Stefan: Pattern-based union, or “wildcard union," which is the ability to treat a set of files that match a given name pattern as a single table. This is a little bit like pivot. It's something that Tableau users have wanted for a long time. No more copying and pasting Excel sheets together! Another one of my favorites is the Google Sheets connector; I can now connect directly to Google Sheets from Tableau, with all the goodness of the Tableau Data Interpreter and, of course, the ability to connect live or use an extract.
Sasha: Cross-database joins will let me combine my cleaned spreadsheets with other data, which is pretty exciting!