In a blog from three months ago, I wrote on “kind of” big data in R. The “kind of” was meant as a caveat that data size in R is limited by RAM. I also mentioned the potential of working with relational data stores for even larger data, and made a vague proposal to demonstrate R/database interoperability in a future article. Well the future is now, and so an attempt to deliver on that promise.
I have two RDBMS’s installed on my notebook. The first is the ubiquitous PostgreSQL, my preferred open source, row-based, relational database. The second is MonetDB, an under-the-radar columnar db.
I love both databases for different reasons. PostgreSQL is my go-to for evaluating interoperability with other analytics products such as R, SAS, Python, Tableau, and Power BI. And I’ve had good success working with MonetDB for some of my more intense analytics needs.
For my PostgreSQL servers, I primarily use the DBI and RPostgreSQL R libraries for access. Indeed, outside the graphical pgAdmin interface, most of my PostgreSQL work is via either RPostgreSQL or the psycopg2 package for Python.
MonetDB capabilities for R come in two flavors: the first is through the package MonetDB.R that connects to an installed MonetDB server. For an established database, MonetDB.R allows the R developer to execute SQL commands, push and pull data from MonetDB to R, load CSV data into a database, etc. With these capabilities, a programmer can work around R’s memory limitations, subsetting larger data into R structures as needed. I’ve exercised MonetDB.R quite a bit and am a fan.
More lightweight and easier to use though is MonetDBLite for R, “a fully embedded version of MonetDB that installs like any other R package. The database runs within the R process itself (like its namesake SQLite), greatly improving efficiency of data transfers. The package is available for Linux, Mac OS X and Windows (64 bit).” Thus, with MonetDBLite, a MonetDB server needn’t be present; only R packages DBI and MonetDBLite have to be installed. An additional benefit of RPostgreSQL, MonetDB.R, and MonetDBLite is the support for Hadley Wickham’s powerful dplyr data management package commands in tandem with SQL.
For testing the capabilities of these databases in analytics, it’s important to see the db serve data that at least starts out too large for memory. There’s no reason to test database analytics capabilities in the MB size — I know full well that such MB data will work fine in an R data.table or Pandas dataframe, without the support of a DBMS. I want my notebook to be uncomfortable — to have to deliver on data larger than RAM.
For this series of blogs showing database access in R, I’m using the RPostgreSQL package on the PostgreSQL side, and MonetDBLite for MonetDB. dplyr and dbplyr from tidyverse extend the PostgreSQL/MonetDBLite capabilities.
An 128M+ record, 70 attribute Home Mortgage Disclosure Act (hmda) data set, conveniently available from the MonetDB web site, drives the tests. After downloading and unzipping the six annual “f” files, I was ready to go to work.
This first of the two blog series focuses on loading the raw hmda data into PostgreSQL and MonetDBLite respectively. Each of six raw data files are first read, munged, and written to an R fst file for fast subsequent access. Once this process is complete, the dbWriteTable functions in RPostgreSQL and MonetDBLite push the data to the respective databases. While using specific database bulk loaders would induce the fasted load times, first converting the csv data to fst format, then using the same load function for both databases, simplifies the work — and is fast enough.
The remainder of this blog revolves on connecting to the respective data stores, then loading the data from delimited files, first into fst format, then to both PostgreSQL and MonetDB data stores in turn. A follow-up article will take a deep dive into the data, demonstrating SQL and dplyr/dbplyr “tbl” queries to build subsetted data.frames/data.tables and tabulate attribute frequencies. Important to note is that data must be loaded into the databases just once, where it persists. It can then be accessed repeatedly for subsequent analyses.
Read the entire blog here.