After posting my last blog, I decided next to do a 2-part series on XGBoost, a versatile, highly-performant, inter-operable machine learning platform. I’m a big fan of XGBoost and other multi-language ML packages such as H20, but generally test them in R environments. So, this time I’ve chosen to work in Python.
My plan was to have the first blog revolve on data munging and exploration, with the second to focus on modeling. I picked the newly-released census 2017 American Community Survey (ACS), Public Use Microdata Samples … data to analyze. I particularly like that PUMS is meaty, consisting of over 3M records and 250+ attributes, while simultaneously clean and meaningful. I can’t wait to put the 5-year 2013-2017 data to test in January when it becomes available.
I got started with wrangling and exploring the data, using Python/Pandas for the data management lifting. The task I gave myself was to look at income, the target variable, as a function of potential explanatory attributes such as age, sex, race, education, and marital status. Working in a Python Jupyter notebook with R magic enabled, I’m able to interoperate between Python and R, taking advantage of R’s splendid ggplot graphics package. For numeric targets, I extensively use R’s dot plots to visualize relationships to the predictors.
After building my data sets and setting aside test for later use, I was left with about 1.5M qualified records for training and exploration. Once I’d reviewed frequencies for each attribute, I looked at breakdowns of income by attribute levels of predictors such as age and education. One particularly interesting factor I examined was the Public Use Microdata Area, puma, a geography dimension consisting of 2,378 statistical areas covering the U.S. In contrast to state, puma would seem to offer a much more granular geography grouping.
What an understatement! The between-puma differences in income are nothing short of stunning. I was so taken aback by what I was seeing, that, after triple checking the data/calculations. I decided to post a Part 0 write-up detailing some of the findings. The remainder of this blog outlines several of the analysis steps, starting with finalized training data to be detailed in Part 1 after the holidays. Part 2 will focus on modeling in XGBoost.
The technology is a Python kernel Jupyter notebook with R magic enabled.
Find the remainder of the blog here.