At a conference I attended a few years ago, a data scientist on a round table discussion replied to a question of what she considered the most important mathematical function in her work with: “the division operator”. That clever response provided grist for my later answer to a similar question on my favorite statistical procedure: “frequencies and crosstabs”. The commonality is, of course, the simplicity and ubiquity of the functions.
I spend much of my current analytics time in what used to be called exploratory data analysis (EDA) or now just data analysis. DA sits between business intelligence and statistical modeling, using comprehensible computations and visualizations to tell data stories. Among the leading “statistics” are simple counts or frequencies, and their multivariate analogs, crosstabs or contingency tables. Actually for me, they’re all just frequencies, be they uni or multi-attribute.
Counts and frequencies play a foundational role in statistical analysis. In my early career, I used Poisson regression extensively. “Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables.” In later years, my emphasis has been more on time series analysis, where count data such as visits, hits, defections, etc. are central.
The analysis that follows is all about frequencies and was done in R using its splendid data.table and tidyverse capabilities for data analysis. I’ve also done similar computations in Python/pandas and am confident the work could be done as well with standard SQL. Indeed, I believe most good BI/OLAP tools can handle the demands. I know several Tableau geeks who’d say it’s a piece of cake!
Why another frequencies function in R? After all, there are the table and xtabs functions from base, count from plyr, and countless others from lesser-know packages. The answer is simple: frequenciesdyn is built on data.table, a very powerful and flexible data management addon package that performs group computations (e.g. frequencies) faster than others. It also fits nicely in tidyverse pipelines.
A data set on crime in Chicago is used for the analyses. The data, representing all reported crime in Chicago from 2001, are updated daily and posted a week in arrears. Attributes revolve on the what, where, and when of crime events. The file at this point consists of over 6.5M records.
The technologies deployed below are JupyterLab running an R 3.4 kernel. The scripts are driven primarily through the R data.table and tidyverse packages. Hopefully, readers will see just how powerful these tools are in collaboration. Notable is that neither data.table nor tidyverse is a part of “core” R; each is an addon maintained by the energetic R ecosystem.
The remainder of the blog can be found here.