Data Processing Methodology for "Tracking the Energy Titans"

Parsing irregularly formatted CSV files for a visualizaiton with D3JS.

Background

For my first project at Radish Lab I assisted with creating an interactive data-visualization for the Woodrow Wilson Center called Tracking the Energy Titans that compares energy consumption and exports between the U.S., Canada, and China. One of my roles in creating this interactive was to take raw data provided by WWC and convert it to a structure and format that plays nicely with D3JS, a data-visualization javascript library that is being used to draw the charts in the interactive. Rather than manually reformatting the data in a program like Excel, I chose to write a Node JS script which would parse the data and output it to a JSON data format. The following post describes the process of how I accomplished this.

About the Data

To facilitate communication with the client about the data Radish Lab utilized Google Drive’s Sheets App, which works in a similar fashion to Microsoft’s Excel software. This way we could see any updates made to data by the client in real time and give them feedback without having to send files back and forth via email or an FTP. The Google Sheets document contained a total of 8 worksheets: two worksheets for each country representing per-capita and total data, a sheet listing the data sources, and another containing interesting facts relating to the data called “Wow-Factors.”

Here is a sample of the United State’s total country data in tabular format:

    Unit of Measure 2007 2008 2009 2010 2011 2012 2013
Production                  
  Coal Mtce 846 859 778 793 800 744 720
  Gas Mtce 712 745 761 785 843 887 896
  Oil Mtce 474 469 501 517 538 612 692
  Nuclear Mtce 305 303 301 304 298 290 298
  Hydro Mtce 30 31 33 32 39 34 33
  Renewables (Solar Wind) Mtce 4.3 6.9 9.2 11.8 15 17.8 21.7
                   
Imports                  
  Coal Mtce 35 34 21 19 13 9 8
  Gas Mtce 170 147 138 138 128 116 106
  Crude Oil Mtce 789 772 709 725 705 693 613
  Oil Products Mtce 247 225 194 188 180 149 146
  Electricity Mtce 6 7 6 6 6 7 8
                   
Exports                  
  Coal Mtce 56 76 56 77 100 112 105
  Gas Mtce 30 35 39 41 55 59 57
  Crude Oil Mtce 2 2 3 3 4 5 9
  Oil Products Mtce 106 135 149 171 213 229 250
  Electricity Mtce 2 3 2 2 2 1 1
                   
Consumption                  
  Power Sector Mtce 1369 1348 1272 1326 1295 1260 1263
  Transport Mtce 1046 1000 974 990 979 962 971
  Industry Mtce 895 862 789 850 857 866 893
  Residential Mtce 409 418 408 415 409 377 411
  Commercial Mtce 305 312 306 308 309 296 309
                   
Impact                  
  CO2 Emissions Megaton* 5851.51 5667.24 5295.18 5425.19 5304.77 5173.52 N/A
  Water Use Billions of gallons per day 409 404.6 404.6 355 355 355  
  Losses - Power Sector Mtce 878 860 804 839 810 782 784
                   
    *Million Metric Ton              

And another for the United State’s per-capita data:

US per capita   Unit of Measure 2007 2008 2009 2010 2011 2012 2013
Production                  
  Coal tce 2.800 2.825 2.536 2.564 2.568 2.370 2.278
  Gas tce 2.364 2.450 2.481 2.538 2.706 2.826 2.834
  Oil tce 1.574 1.542 1.633 1.671 1.727 1.950 2.189
  Nuclear tce 1.101 0.996 0.981 0.983 0.956 0.924 0.943
  Hydro tce 0.100 0.102 0.108 0.103 0.125 0.108 0.104
  Renewables (Solar Wind) tce 0.014 0.023 0.030 0.038 0.048 0.057 0.069
                   
Imports                  
  Coal tce 0.116 0.112 0.068 0.061 0.042 0.029 0.025
  Gas tce 0.564 0.483 0.450 0.446 0.411 0.370 0.335
  Crude Oil tce 2.619 2.539 2.311 2.344 2.263 2.208 1.939
  Oil Products tce 0.820 0.740 0.632 0.608 0.578 0.475 0.462
  Electricity tce 0.020 0.023 0.020 0.019 0.019 0.022 0.025
                   
Exports                  
  Coal tce 0.186 0.250 0.183 0.249 0.321 0.357 0.332
  Gas tce 0.100 0.115 0.127 0.133 0.177 0.188 0.180
  Crude Oil tce 0.007 0.007 0.010 0.010 0.013 0.016 0.028
  Oil Products tce 0.352 0.444 0.486 0.553 0.684 0.730 0.791
  Electricity tce 0.007 0.010 0.007 0.006 0.006 0.003 0.003
                   
Consumption                  
  Power Sector tce 4.545 4.433 4.146 4.287 4.156 4.014 3.995
  Transport tce 3.472 3.288 3.175 3.201 3.142 3.065 3.072
  Industry tce 2.971 2.835 2.572 2.748 2.750 2.759 2.825
  Residential tce 1.358 1.375 1.330 1.342 1.313 1.201 1.300
  Commercial tce 1.013 1.026 0.997 0.996 0.992 0.943 0.977
                   
Impact                  
  CO2 Emissions tons per capita 19.425 18.636 17.261 17.539 17.025 16.483  
  Water Use gallons per day 1357.761 1330.510 1318.897 1147.655 1139.345    
  Losses - Power Sector tce 2.915 2.828 2.621 2.712 2.600 2.491 2.480
                   
    *Million Metric Ton              

As you can see, the total and per-capita data are structured in such a way that the values are grouped in sub-categories which are then grouped by category. For example for the category Exports there are sub-cateogries for coal, gas, crude oil, oil products, and electricity. This made it difficult to read the raw data into D3JS directly using the d3.csv method. D3 also has a method for parsing CSV data, but it didn’t make sense to do a ton of work parsing 8 CSV files in the browser each time the interactive loads or changes due to a button click.

The output JSON data format I decided on would look something like the following schema:

{
    "country-one" : { // eg: United States
      "category-one" : { // eg: production
          "sub-category-one" : { // eg: coal
              "total" : [  // sample array for total of category-one > sub-category-two
                {
                  "year" : 2007,
                  "val" : 846
                },
                {
                  "year" : 2008,
                  "val" : 859
                },
                {
                  "year" : 2009,
                  "val" : 778
                },
                {
                  "year" : 2010,
                  "val" : 793
                },
                {
                  "year" : 2011,
                  "val" : 800
                },
                {
                  "year" : 2012,
                  "val" : 744
                },
                {
                  "year" : 2013,
                  "val" : 720
                }
              ], 
              "per_capita" : [ // sample array for per-capita of category-one > sub-category-two
                {
                  "year" : 2007,
                  "val" : 2.800
                },
                {
                  "year" : 2008,
                  "val" : 2.825
                },
                {
                  "year" : 2009,
                  "val" : 2.536
                },
                {
                  "year" : 2010,
                  "val" : 2.564
                },
                {
                  "year" : 2011,
                  "val" : 2.568
                },
                {
                  "year" : 2012,
                  "val" : 2.370
                },
                {
                  "year" : 2013,
                  "val" : 2.278
                }
              ], 
              "source" : ["Energy Information Administration"] // the data's source, this differs country to country
            },
          "sub-category-two" : { // eg: gas
              "total" : [...],
              "per_capita" : [...],
              "source" : [...]
            },              
          // additional sub-categories follow...
          },
        "category-two" : { // eg: consumption
          "sub-category-one" : {  // eg: power-sector
              "total" : [...],
              "per_capita" : [...],
              "source" : [...]
            },
          // additional sub-categories ...
        }
        // additional categories ...
      },          
  "country-two" : {...}, // eg: Canada        
  "country-three" : {...} // eg: China
}

As you can see the raw data will be re-structured into arrays of objects, where each object contains the year and data-value for that year. These arrays are contained in an object representing a sub-category, which is contained in an object representing a category, which are then contained in objects for each country. Having the data structured this way would make the logic for toggling between category and subcategory in the interactive’s user interface fairly straight-forward to code and integrate with D3.

Parsing in Preparation for D3JS

To parse the data from it’s original structure to the JSON structure I ended up writing a Node JS script that would take CSV files I downloaded for each worksheet, convert them to multi-dimensional arrays, and then output values from those arrays into objects containing the year and data-value. The key parts of the script are the following functions:

function iterateMultiArray(arr,x,y) {
    // iterates over a multi-dimensional array
    // returns an array of numeric values

    var i = 0, l = arr.length, arrToReturn = [];

    for (i; i<l; i++) {
        if (i===y) {
            for (var j=0; j<arr[i].length; j++) {
                if (j>x) {
                    var tmp = arr[i][j]
                    arrToReturn.push(filterFloat(tmp));
                }
            }               
        }
    }       

    return mapData2Years(arrToReturn);
}

function mapData2Years(arr) {
    // maps each value of an array to its corresponding year
    // preferred data format for d3
    var years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014];
    var toReturn = [];
    for (var i = 0; i < arr.length; i++) {
        var obj = {};
        obj.val = arr[i];
        obj.year = years[i];
    
        // only add the data if it isn't null
        if (obj.val >= 0) {
            toReturn.push(obj);
        }       
    }
    
    return toReturn;
} 
 
// cast data type from string to a number type
// if no data the value will be NaN and output value in JSON will be null
function filterFloat(value) {
    value = value.replace(/ /g,'');
    if(/^(\-|\+)?([0-9]+(\.[0-9]+)?|Infinity)$/
        .test(value))
        return Number(value);
  return NaN;
}

The function iterateMultiArray loops over values in a multi-dimensional array that is a result of using Node’s file system module and a third party fast-csv module to read and parse each CSV file. You can think of mulit-dimensional arrays as lying on an x, y coordinate system that increases from top to bottom and left to right. So the y coordinate will be starting point for the outer array and the x coordinate will be the starting place for the inner array. A nested for-loop then retrieves the value from the inner array. That value is then converted from a string to a number data type and pushed to a temporary array. Finally, when the for-loops finish the temporary array is mapped to an array of years. However, if the data value returned from filterFloat() is null (NaN) then no object is created for that particular year. Finally an array of objects is returned.

So for example, if you’d like the value for “coal production” from either the “per-capita” or “total” worksheets you would do iterateMultiArray(arr,2,2) where arr is the multi-demnsional array representing the corresponding worksheet. Or if you wanted the value for “residential consumption” you would do iterateMultiArray(arr,2,27)

If you’d like to see the entire script, have a look at this Gist (code snippet).

After Thoughts

Following the completion of this project I definitely learned a thing or two about improving the workflow for parsing raw data for interactive data visualizations! The main take away I learned is that whatever tool you will be using to visualize your data should determine how the data is structured. Originally I had the data-values in plain old arrays, not objects inside arrays. This proved problematic when there were null values for certain years and we wanted the graph to start drawing at a year later than 2007 or stop drawing at a year earlier than 2013.

The second take away I learned was that there is an API for the Google Sheets App that allows for worksheets to be accessed programatically. I ended up using the GSpread Python module and writing a Python script to automate the process of downloading the worksheets as CSVs each time the client made updates to the data. This was a heck of a lot easier than manually downloading each of 8 sheets and renaming them!

Scripting the data-parsing proved to be a huge time saver, given that the client made frequent changes to the data during our development process. I can’t imagine having had to do all that work on in Excel! If you have any questions or ideas for improvement about this process, feel free to email me at chris [dot] henrick [at] radishlab [dot] com.

cheers,

- Chris

Dialogue & Discussion