Summary: This blog details R data.table programming to handle multi-gigabyte data. It shows how the data can be efficiently loaded, “normalized”, and counted. Readers can readily copy and enhance the code below for their own analytic needs. An intermediate level of R coding sophistication is assumed.
In my travels over the holidays, I came across an article in the New York Times on voter purging in Ohio. Much of the research surrounding the article was driven by data on the voting history of Ohio residents readily available to the public.
When I returned home, I downloaded the four large csv files and began to investigate. The data consisted of over 7.7M voter records with in excess of 100 attributes. The “denormalized” structure included roughly 50 person-location variables such as address and ward, and a close to 50 variable “repeating group” indicating voter participation in specific election events and characterized by a concatenated type-date attribute with an accompanying voted or not attribute.
My self-directed task for this blog was to load the denormalized data as is, then create auxiliary “melted” data.tables that could readily be queried i.e. transform from wide to long. The query type of interest revolved on counts/frequencies of the dimensions election type, date, and participation. The text will hopefully elucidate both the power and ease of programming with R’s data.table and tidyverse packages.
The technology used is Wintel 10 along with JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, magrittr, fst, feather, and knitr packages are featured.
See the entire blog here.