Both R and Python-Pandas are array-oriented platforms that support fast filtering through vectors of record-id’s. In Python-Pandas, such vectors are implemented via Pandas’s powerful index construct; in R-data.table, they’re accessible through the “which” and “row.name” functions. In both instances, joins to record-id vectors generate fast subsetted access.
How is the record-id vector approach helpful? For starters, the analyst can encapsulate common subsetting conditions once and use many times. And second, working with such filtering vectors is simpler than maintaining sets of subsetted dataframes/data.tables.
This notebook illustrates the record-id approach with both Python-Pandas and R-data.table. The data exercised is the 7M+ row 2001-present Chicago crime file. Record-id indexes are derived from an attribute that details type of crime category, including homicide, violent crime, property crime, and index crime. For each of these, a vector of pertinent record id’s is assembled, first in Python-Pandas, then in R-data.table.
As an example of how the record-id approach works in R with my “chicagocrime” data.table, I compute indexes of record-ids for homicide, violent crime, property crime, and index crime using values of the “fbicode” attribute, which depicts crime type. Homicide is denoted by fbicode “01A”, so the statements hcde <- c(“01A”) and hidx <- chicagocrime[fbicode %in% hcde,which=TRUE] generate the homicide record-id index, hidx. The “join” chicagocrime[hidx] then quickly produces all homicide records through those record-ids. Further, chicagocrime[hidx] is just another data.table for subsequent access. Simple and clean.
The code below details the record-id approach first for Python-Pandas and then for R-data.table. I run the Python code with the Python kernel, then switch to R.
The technology stack is Windows 10 with JupyterLab 0.35.4, Python 3.7.3, Pandas 0.24.2 and R 3.6.0, along with data.table 1.12.2.
Read the entire blog here.