I had an interesting discussion with one of my son’s friends at a neighborhood gathering over the holidays. He’s just reached the halfway point of a Chicago-area Masters in Analytics program and wanted to pick my brain on the state of the discipline.
Of the four major program foci of business, data, computation, and algorithms, he acknowledged he liked computation best, with Python in the lead against R and SAS for his attention. I was impressed with his understanding of Python, especially given that he’d had no programming experience outside Excel before starting the curriculum.
After a while, we got to chatting about NumPy and Pandas, the workhorses of Python data programming. My son’s friend was using Pandas a bit now, but hadn’t been exposed to NumPy per se. And while he noted the productivity benefits of working with such libraries, I don’t think he quite appreciated the magnitude of relief provided for day-to-day data programming challenges. He seemed smitten with the power he’d discovered with core Python. Actually, he sounded a lot like me when I was first exposed to Pandas almost 10 years ago — and when I first saw R as a SAS programmer back in 2000. As our conversation progressed, I just smiled, fully confident his admiration would grow over time.
Our discussion did whet my appetite for the vanilla Python data programming I’ve done in the past. So I just had to dig up some code I’d written “BP” — before Pandas. Following a pretty exhaustive search, I found scripts from 2010. The topic was stock market performance. The work stream entailed wrangling CSV files from the investment benchmark company Russell FTSE website pertaining to the performance of its many market indexes. Just about all the work was completed using core Python libraries and simple data structures lists and dictionaries.
As I modernized the old code a bit, my appreciation for Pandas/NumPy did nothing but grow. Much more looping-like code in vanilla Python. And, alas, lists aren’t dataframes. On the other hand, with Pandas: Array orientation/functions? Check. Routinized missing data handling? Check. Tables/dataframes as core data structures? Check. Handling complex input files? Check. Powerful query capability? Check. Easy updating/variable creation? Check. Joins and group by? Check. Et al.? Check.
For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for “measuring the market”. I first download two files — a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested.
Once the file data are in memory, they’re munged and combined into a single Python multi-variable list. The combined data are then sorted by date, at which point duplicates are deleted. After that, I compute daily percent change variables from the index levels, ultimately producing index performance statistics. At the end I write the list to a CSV file.
My take? Even though the code here is no more than intermediate-level, data programming in Python without Pandas seems antediluvian now. The array orientations of both Pandas and NumPy make this work so much simpler than the looping idioms of vanilla Python. Indeed, even though I programmed with Fortran, PL/I, and C in the past, I’ve become quite lazy in the past few years.
This is the first of three blogs pretty much doing the same tasks with the Russell 3000 data. The second uses NumPy, and the final, Pandas.
The technology used for the three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.
Read the remainder of the blog here.