About the Data
To facilitate communication with the client about the data Radish Lab utilized Google Drive’s Sheets App, which works in a similar fashion to Microsoft’s Excel software. This way we could see any updates made to data by the client in real time and give them feedback without having to send files back and forth via email or an FTP. The Google Sheets document contained a total of 8 worksheets: two worksheets for each country representing per-capita and total data, a sheet listing the data sources, and another containing interesting facts relating to the data called “Wow-Factors.”
Here is a sample of the United State’s total country data in tabular format:
|Unit of Measure||2007||2008||2009||2010||2011||2012||2013|
|Renewables (Solar Wind)||Mtce||4.3||6.9||9.2||11.8||15||17.8||21.7|
|Water Use||Billions of gallons per day||409||404.6||404.6||355||355||355|
|Losses - Power Sector||Mtce||878||860||804||839||810||782||784|
|*Million Metric Ton|
And another for the United State’s per-capita data:
|US per capita||Unit of Measure||2007||2008||2009||2010||2011||2012||2013|
|Renewables (Solar Wind)||tce||0.014||0.023||0.030||0.038||0.048||0.057||0.069|
|CO2 Emissions||tons per capita||19.425||18.636||17.261||17.539||17.025||16.483|
|Water Use||gallons per day||1357.761||1330.510||1318.897||1147.655||1139.345|
|Losses - Power Sector||tce||2.915||2.828||2.621||2.712||2.600||2.491||2.480|
|*Million Metric Ton|
As you can see, the total and per-capita data are structured in such a way that the values are grouped in sub-categories which are then grouped by category. For example for the category Exports there are sub-cateogries for coal, gas, crude oil, oil products, and electricity. This made it difficult to read the raw data into D3JS directly using the d3.csv method. D3 also has a method for parsing CSV data, but it didn’t make sense to do a ton of work parsing 8 CSV files in the browser each time the interactive loads or changes due to a button click.
The output JSON data format I decided on would look something like the following schema:
As you can see the raw data will be re-structured into arrays of objects, where each object contains the year and data-value for that year. These arrays are contained in an object representing a sub-category, which is contained in an object representing a category, which are then contained in objects for each country. Having the data structured this way would make the logic for toggling between category and subcategory in the interactive’s user interface fairly straight-forward to code and integrate with D3.
Parsing in Preparation for D3JS
To parse the data from it’s original structure to the JSON structure I ended up writing a Node JS script that would take CSV files I downloaded for each worksheet, convert them to multi-dimensional arrays, and then output values from those arrays into objects containing the year and data-value. The key parts of the script are the following functions:
iterateMultiArray loops over values in a multi-dimensional array that is a result of using Node’s file system module and a third party fast-csv module to read and parse each CSV file. You can think of mulit-dimensional arrays as lying on an x, y coordinate system that increases from top to bottom and left to right. So the y coordinate will be starting point for the outer array and the x coordinate will be the starting place for the inner array. A nested for-loop then retrieves the value from the inner array. That value is then converted from a string to a number data type and pushed to a temporary array. Finally, when the for-loops finish the temporary array is mapped to an array of years. However, if the data value returned from
filterFloat() is null (
NaN) then no object is created for that particular year. Finally an array of objects is returned.
So for example, if you’d like the value for “coal production” from either the “per-capita” or “total” worksheets you would do
arr is the multi-demnsional array representing the corresponding worksheet. Or if you wanted the value for “residential consumption” you would do
If you’d like to see the entire script, have a look at this Gist (code snippet).
Following the completion of this project I definitely learned a thing or two about improving the workflow for parsing raw data for interactive data visualizations! The main take away I learned is that whatever tool you will be using to visualize your data should determine how the data is structured. Originally I had the data-values in plain old arrays, not objects inside arrays. This proved problematic when there were null values for certain years and we wanted the graph to start drawing at a year later than 2007 or stop drawing at a year earlier than 2013.
The second take away I learned was that there is an API for the Google Sheets App that allows for worksheets to be accessed programatically. I ended up using the GSpread Python module and writing a Python script to automate the process of downloading the worksheets as CSVs each time the client made updates to the data. This was a heck of a lot easier than manually downloading each of 8 sheets and renaming them!
Scripting the data-parsing proved to be a huge time saver, given that the client made frequent changes to the data during our development process. I can’t imagine having had to do all that work on in Excel! If you have any questions or ideas for improvement about this process, feel free to email me at chris [dot] henrick [at] radishlab [dot] com.
data node-js d3js