I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article’s title: “Using AWK and R to parse 25tb”. I’m a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author’s recountings of his painful but steady education on working with that volume of data: “I didn’t fail a thousand times, I just discovered a thousand ways not to parse lots of data into an easily query-able format.” Been there, done that.
After reading the article, I was again intrigued with AWK after all these years. A Unix-based munging predecessor of perl and python, AWK’s particularly adept at working with delimited text files, automatically splitting each record into fields identified as 1, 2, etc. My use of AWK generally revolved on selecting columns (projecting) and rows (filtering) from text files, in turn piping the results to other scripts for additional processing. I found that AWK did these simple tasks very well but didn’t scale for more demanding data programming — remembering well that trouble lurked when I attempted to contort AWK to do something it wasn’t intended to do. And indeed, I pretty much abandoned AWK when the more comprehensive perl emerged in the late 80s. In retrospect, I’m not sure that was the best course. Optimal might have been to continue using AWK for the simpler file project and filter work, saving perl (and then python) for more complex tasks.
So I just had to reacquaint myself with AWK, and downloaded the GNU version gawk. I then divined several quick tasks on a pretty large data source to test the language. The data for analyses consist of 4 large files of census information totaling over 14 GB which, in sum, comprise 15.8M records and 286 attributes. I use AWK to project/filter the input data, and then pipe the results to python or R for analytic processing. AWK does some pretty heavy albeit simple processing. In my tests, both R and python/pandas could have handled AWK’s tasks as well, but it’s not hard to imagine a pipeline that required pre project/filtering.
Unlike other blogs I’ve written using Jupyter Notebook, this one does not execute in a python or R kernel; rather the notebook simply displays the AWK, python, and R scripts and their outputs.
dict.awk
gawk -F, ‘BEGIN{OFS=”,”} NR==1 {split($0,dict,”,”);for (d in dict) print d, dict[d]}’ psam_pusa.csv
./dict.awk 2>null | head
1,RT
2,SERIALNO
3,DIVISION
4,SPORDER
5,PUMA
6,REGION
7,ST
8,ADJINC
9,PWGTP
10,AGEP
pums.awk
gawk -F, ‘BEGIN{OFS=”,”} (NR==1 || FNR!=1) && ($10>=18 && $104>1000) \
{print $2,$3,$5,$6,$7,$8,$10,$39,$67,$69,$89,$104,$112}’ psam_pusa.csv psam_pusb.csv psam_pusc.csv psam_pusd.csv
pums.py
import sys, pandas as pd, numpy as np, feather
import warnings
warnings.filterwarnings(‘ignore’)
functdir = “c:/steve/jupyter/notebooks/functions”
sys.path.append(functdir)
from newmyfuncs import *
fthname = “pumspy.feather”
pums = pd.read_csv(sys.stdin,header=0)
pums.columns = [c.lower() for c in pums.columns]
print(pums.shape,”\n”)
print(pums.columns,”\n”)
print(pums.dtypes,”\n”)
feather.write_dataframe(pums,fthname)
prarr(pums)
time ./pums.awk 2>null | python pums.py
(10925570, 13)
Index([‘serialno’, ‘division’, ‘puma’, ‘region’, ‘st’, ‘adjinc’, ‘agep’, ‘mar’,
‘schl’, ‘sex’, ‘hisp’, ‘pincp’, ‘rac1p’],
dtype=’object’)
serialno int64
division int64
puma int64
region int64
st int64
adjinc int64
agep int64
mar int64
schl int64
sex int64
hisp int64
pincp int64
rac1p int64
dtype: object
serialno division puma region st ... schl sex hisp pincp rac1p
0 2013000000154 6 2500 3 1 … 20 2 1 52000 2
1 2013000000154 6 2500 3 1 … 16 1 1 99000 2
2 2013000000156 6 1700 3 1 … 21 2 1 39930 2
3 2013000000160 6 2200 3 1 … 14 2 1 10300 3
4 2013000000160 6 2200 3 1 … 16 1 1 1100 3
[5 rows x 13 columns]
serialno division puma region ... sex hisp pincp rac1p
10925565 2017001533765 8 300 4 … 1 1 29800 1
10925566 2017001534830 8 200 4 … 2 1 9000 1
10925567 2017001534957 8 300 4 … 1 24 4000 1
10925568 2017001535265 8 200 4 … 2 1 5000 9
10925569 2017001536156 8 400 4 … 1 1 2400 1
[5 rows x 13 columns]
real 1m9.569s
user 1m1.576s
sys 0m2.359s
pums.r
suppressMessages(library(readr))
suppressMessages(library(data.table))
suppressMessages(library(fst))
suppressMessages(library(feather))
wd <- “c:/bigdata/steve/awk”
setwd(wd)
fstnme <- “pumsfst.fst”
fthnme <- “pumsfth.feather”
f <- file(“stdin”)
open(f)
pums <- data.table(read_csv(readLines(f,n=-1)))
setnames(pums,tolower(names(pums)))
print(dim(pums))
print(str(pums))
head(pums)
tail(pums)
write_fst(pums,fstnme)
write_feather(pums, fthnme)
time ./pums.awk 2>null | rscript pums.r
[1] 10925570 13 Classes ‘data.table’ and ‘data.frame’: 10925570 obs. of 13 variables:
serialno:num2.01e+122.01e+122.01e+122.01e+122.01e+12… division: num 6 6 6 6 6 6 6 6 6 6 …
puma:chr”02500″”02500″”01700″”02200″… region : num 3 3 3 3 3 3 3 3 3 3 …
st:chr”01″”01″”01″”01″… adjinc : num 1061971 1061971 1061971 1061971 1061971 …
agep:num55566361207881595670… mar : num 1 1 3 4 5 2 2 1 1 1 …
schl:chr”20″”16″”21″”14″… sex : num 2 1 2 2 1 2 2 1 2 1 …
hisp:chr”01″”01″”01″”01″… pincp : chr “000052000” “000099000” “000039930” “000010300” …
$ rac1p : num 2 2 2 3 3 2 1 1 1 1 …
- attr(*, “.internal.selfref”)=
serialno division puma region st adjinc agep mar schl sex hisp pincp
1: 2.013e+12 6 02500 3 01 1061971 55 1 20 2 01 000052000
2: 2.013e+12 6 02500 3 01 1061971 56 1 16 1 01 000099000
3: 2.013e+12 6 01700 3 01 1061971 63 3 21 2 01 000039930
4: 2.013e+12 6 02200 3 01 1061971 61 4 14 2 01 000010300
5: 2.013e+12 6 02200 3 01 1061971 20 5 16 1 01 000001100
6: 2.013e+12 6 02400 3 01 1061971 78 2 01 2 01 000003900
rac1p 1: 2
2: 2
3: 2
4: 3
5: 3
6: 2serialno division puma region st adjinc agep mar schl sex hisp
1: 2.017002e+12 8 00500 4 56 1011189 51 3 20 1 01
2: 2.017002e+12 8 00300 4 56 1011189 19 5 18 1 01
3: 2.017002e+12 8 00200 4 56 1011189 20 5 19 2 01
4: 2.017002e+12 8 00300 4 56 1011189 18 5 16 1 24
5: 2.017002e+12 8 00200 4 56 1011189 31 5 18 2 01
6: 2.017002e+12 8 00400 4 56 1011189 60 3 19 1 01pincp rac1p
1: 000035300 1
2: 000029800 1
3: 000009000 1
4: 000004000 1
5: 000005000 9
6: 000002400 1
real 1m53.440s
user 1m0.655s
sys 0m2.779s
pums.awk
gawk -F, ‘BEGIN{OFS=”,”} NR==1 {split($0,dict,”,”);for (d in dict) newdict[dict[d]] = d}\ (NR==1 || FNR!=1) && ($newdict[“AGEP”]>=18 && $newdict[“PINCP”]>1000)\ $newdict[“SERIALNO”],$newdict[“DIVISION”],$newdict[“REGION”],$newdict[“ST”],$newdict[“PUMA”],$newdict[“AGEP”],$newdict[“HICOV”],\ $newdict[“MAR”],$newdict[“RAC1P”],$newdict[“HISP”],$newdict[“SEX”],$newdict[“SCHL”],$newdict[“PINCP”],$newdict[“ADJINC”]}’ psam_pusa.csv psam_pusb.csv psam_pusc.csv psam_pusd.csv
Fun. My take is that dusty AWK can indeed serve an important role in 2019 data pipelines — provided that it’s used simply and to its projecting/filtering strengths. That’s it for now. Off to Octoberfest. Happy wrangling!¶
The technology used below is Windows 10, JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, R 3.6.0, Cgywin 3.0.7, and GNU Awk (gawk) 5.0.1. All gawk, python, and R scripts are simply components in pipelines generated from bash shell command lines in Cgywin windows.
The original post can be viewed here.