In PostgreSQL, MonetDB, and Too-Big-for-Memory Data in R — Part I, I began to discuss how data that was too big for RAM is handled in R, a memory-constrained statistical platform. I attempted to demonstrate the potential of working with relational data stores for larger data, showing the design/code for loading a large data set into databases PostgreSQL and MonetDB through R interfaces. I ended Part I with accessible data, promising to look at R/database queries in this follow-up.
From my vantage point, there are two different query challenges for this R/database collaboration. The first is analytic queries, wherein counts, sums, means, etc. are calculated for grouped data. The simplest illustration is basic frequencies, wherein counts are computed for the combinations of one or more grouping attributes. Once the basic analytics have been calculated, they are moved to R data.tables for fast/flexible post processing.
The second challenge is the database “serving” the output of basic data.frame/data.table data for broader analyses/studies in R. Once the data are projected, subsetted and grouped by the database, they are pushed to R data.tables for subsequent work.
My strategy for the interoperation of databases with R is to use database queries to do the heavy lifting of “where” and “group by”, at which point the hopefully much smaller subsetted data is pushed into R for fast in-memory processing.
There are multiple ways of executing SQL within an R session. The first uses basic DBI functions such as dbGetQuery to run SQL whose results are then immediately returned as an R data.frame. The second is to invoke the dplyr/dbplyr tbl functions to connect R with database objects/SQL. A strength of tbl’s is that execution is “lazy” — i.e. only when needed. Thus tbl’s can function much as database views, actually assembling R data only when executed and collected.
As noted in Part I, I have two RDBMS’s installed on my notebook (actually three with MariaDB). The first is the ubiquitous PostgreSQL, my preferred open source, row-based, relational database management system. The second is MonetDB, an under-the-radar columnar db that so far has served my analytics needs quite well. For PostgreSQL, the DBI and RPostgreSQL R libraries enable connectivity to R. On the MonetDB side, I’ve deployed MonetDBLite for R, “a fully embedded version of MonetDB that installs like any other R package. The database runs within the R process itself (like its namesake SQLite), greatly improving efficiency of data transfers. The package is available for Linux, Mac OS X and Windows (64 bit).”
For testing the capabilities of these databases with analytics, it’s important to see the DBMS serve data that at least starts out too large for memory. I know full well that multi-MB-sized data will work fine in an R data.table or Pandas dataframe, without the support of a DBMS. I want my notebook to be uncomfortable — to have to deliver on data larger than RAM.
An 128M+ record, 70 attribute Home Mortgage Disclosure Act (hmda) data set, conveniently available from the MonetDB web site, meets this need. After downloading and unzipping the six annual “f” files, I was ready to go to work. My development platform is JupyterLab with Microsoft Open R 3.4.4.
The MonetDB “test” queries outperform their PostgreSQL handily, often by an order of magnitude. MonetDB’s columnar design is optimized for the query load of this notebook, while the PostgreSQL deployment, with no indexes, clusters, multi-servers, etc. is a disadvantaged straw man. All queries ran pretty much identically with both databases; I’ll report on all MonetDB queries, but only some PostgreSQL.
The points of departure for what follows are established databases in MonetDB and PostgreSQL, each consisting of large tables named hmda. The notebook demonstrates SQL and dplyr/dbplyr “tbl” queries to build subsetted data.frames/data.tables and also tabulate attribute frequencies. Important to note is that data is already loaded into the databases, where it has persisted.
Read the remainder of the blog here.