This article was written by Will Koehrsen.
Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.
We’ll follow the general machine learning workflow step-by-step:
- Data cleaning and formatting
- Exploratory data analysis
- Feature engineering and selection
- Compare several machine learning models on a performance metric
- Perform hyperparameter tuning on the best model
- Evaluate the best model on the testing set
- Interpret the model results
- Draw conclusions and document work
Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here. This first article will cover steps 1–3 with the rest addressed in subsequent posts.
(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!)
Problem Definition
The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City.
The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score.
The data includes the Energy Star Score, which makes this a supervised regression machine learning task:
- Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
- Regression: The Energy Star score is a continuous variable
We want to develop a model that is both accurate — it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.
Data Cleaning
Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems.
To read the rest of the article, click here.