Chris DeMartini, current Tableau Zen Master, is sharing some script analysis techniques as a contribution to this Iron Viz Silver Screen. If you'd like to read more from Chris, check out the Datablick blog.
It was a happy coincidence when the latest Iron Viz topic came across my phone screen as I looked over the movie based visualization I was working on at that very moment. I had already completed a viz based on a movie script previously, so I was really just looking to see if I could re-use that existing layout and make a few tweaks while substituting data from a different movie.
Ultimately, both vizzes (originally inspired by this pudding post) were built to try and peak my very young daughters interest (ok, maybe bribe is a better word) in the power of data visualization. Thus if you haven’t guessed it yet, the first viz was built based on Disney’s Frozen script and the second based on Disney’s Moana script. Here is the before script and after viz images for both movies.
So how does one go about manipulating these sometimes free text looking scripts into an insightful Tableau visualization? Well, for starters you have to find the script(s) of choice. One potential reference you can leverage for this is IMSDb (referenced source of the aforementioned pudding article, check out their Github as well), however, there are others out there.
Once you have the script(s), you will need to get your hands dirty in your data manipulation tool of choice. I will be showing a few key steps along the way using Alteryx, but these steps can be done using other tools like Excel, SQL, Python, etc. Just anticipate it taking a lot longer than it does with Alteryx :) From here forward I will be using the recent work I just did on the Moana script as an example. Just in case you do not have access to Alteryx, you can download SQL Server Express, then leverage this example script which walks through steps 1, 2 and 3 outlined below.
Step 1: Find the patterns, and hope they are consistent
I noticed on the Moana script that each time a character is referenced there is a colon (“:”) following their name. However, if the character spoke for a long time, their data continued across multiple lines which did not have a character reference. Lines were normally separated by a carriage return or standard punctuation mark, unless it was a song then it was separated by a forward slash (“/”). I had found the pattern and was off and running… until I made it half way through the script and noticed that the pattern completely fell apart. The fix? For this single script effort, manually validate and clean the script data… painful. Note that this was not as much of a problem with the IMSDb script, it was consistent all the way through. Lastly, there may be more than one pattern you need to leverage across the script(s), and, I am sure regular expressions could come in quite handy during this step.
Step 2: Leverage the identified patterns to parse the script into lines and their characters
For our example this is really a question of parsing the data into additional rows based on punctuations and copying down the character names. I previously did this in Excel for the Frozen script, which led me to a lot of convoluted multi-row, “if-then” parsing formulas. In Alteryx, it was two tools (Multi-row Formula and Text to Columns/Rows). In SQL there are some procedures readily available online that can provide the same functionality.
We set the Multi-row formula to copy down the character name from the previous row in cases where that data is not included on the current line. Then we use split to rows in the text to columns tool using common punctuation characters (.?!) and forward slash (/). Our data that once looked liked this…
Now looks like this…
Progress! We can continue to clean this up, remove blank lines, analyze the data to see if we have any oddities caused by multiple punctuations in a row, etc. The result of this is ultimately our lines dataset which contains an ID, character and line text.
Step 3: Go one step further and parse the lines into words
We are going to do another, very similar manipulation step before we bring this into Tableau. We parse the lines of text into each word, using space as the delimiter. In Alteryx, this is again pretty simple, using the same Text to Columns tool again, with “text to rows” selected and space as the delimiter.
The majority of the parsing/prep work you need is now complete and you are ready to bring this data into Tableau to analyze it further…
Step 4: Make a character master file
I found it really useful to make a character master file with some key information (e.g., Gender). You can try to automate this by getting the information from sources like IMDB, however if you are going small scale, you can often do this manually (like I did in this example).
Step 5: Perform additional text analysis to augment the data further
Now that our data is nicely structured and parsed we can extend it further. For example, we can leverage this handy post by Zen Master Brit Cava to include sentiment analysis, account for common stop words, or look into even further text analytics like those provided in Text Mining with R.
I didn’t end up including this in the published version on Public, but here is another view of the Moana script unit chart with corresponding sentiment (via Brit’s python script) below. Tamatoa is a pretty negative crustacean…
Step 6: Build an awesome viz and submit it to Iron Viz!
Pull these pieces together in Tableau and build your viz.
When you are done, post it on Tableau Public and submit it to Iron Viz Silver Screen!
Here is the one I ended up with for the Moana script, trying to answer questions around gender breakdown, when key characters talked throughout the film, who spoke the most and what were some of the recurring items those characters talked about.
And now... Happy Vizzing!