Kaggle is an AirBnB for Data Scientists – this is where they spend their nights and weekends. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science and predictive analytics problems through machine learning. It has over 536,000 active members from 194 countries and it receives close to 150,000 submissions per month. Started from Melbourne, Australia Kaggle moved to Silicon Valley in 2011, raised some 11 million dollars from the likes of Hal Varian (Chief Economist at Google), Max Levchin (Paypal), Index and Khosla Ventures and then ultimately been acquired by the Google in March of 2017. Kaggle is the number one stop for data science enthusiasts all around the world who compete for prizes and boost their Kaggle rankings. There are only 94 Kaggle Grandmasters in the world to this date.
Zillion Pillows (Zillow) is the largest digital inventory and estimation of American homes in the world. American housing stock is more than 27.5 trillion dollars. Zillow covers 110 million homes throughout the United States with 103 variables per house. With 73 million unique visitors per month, 20 TBs of data and 1.2 million statistical and machine learning models that runs every night to predict the next Zestimates, it is undoubtedly the best machine learning case study for real estate under the sun. Zillow has reduce its median margin of error when predicting the sale value of a house (Zestimate) from 14% to 5% on average, but it is still a long way to go given unpredictable outliers and unseen financial recessions. Zillow has been criticized lately for its margin of error and the trouble it will cause to perspective seller or buyer, it is estimated to have up to 14,000 US$ price error on a typical house.
Zillow launched its Zillow Prize Competition on Kaggle on May 24, 2017. It has two phases and will run for eight months. While, million dollar seems like a big prize, it’s the cost of having 10 data science engineers in Silicon Valley for eight months for 100,000$ a piece, whereas, to-date there are 2900 teams participating and competing for this prize from all around the world, with a typical size of three members per team, 8700 individuals it is just 114$ per engineer, which is equivalent to 14$ per month or 1.7$ per hour per data scientist. This is the beauty and power of crowd sourcing and Kaggle.
To get started with competition, please go through the welcome messages of Wendy Kan and Andrew Martin’s FAQ and Ask-Me-Anything threads. He is head of research at Zillow. Here is another good blog (Train, Score and Repeat) by him that contains few hints on how to rank up in the competition. Here is a fireside chat kind of video about Zillow’s innovation and competition by the founder and chief economist.
Create your account on Kaggle, join the competition and accept the rules. To submit your first kernel, you can fork my public kernel – how to compete for Zillow prize – first kernel and run it. Once you get the results, please submit the file to Zillow. Welcome to Zillow prize challenge. I don’t know who will win but I am sure it will be through combination of Feature Selection, Ensemble and External datasets.
To make the first submission super easy for you, here are all the steps and complete source code
Step 1: Create your account on Kaggle and join the competition
Step 2: Go to Kernels tab and click on New Kernel
Step 3: Select Notebook
Step 4: Copy-paste the following code (XGB) in code window
### Importing Libraries or Packages that are needed throughout the Program ###
import numpy as np
import pandas as pd
import xgboost as xgb
import random
import datetime as dt
import gc
import seaborn as sns #python visualization library
color = sns.color_palette()
#%matplotlib inline
np.random.seed(1)
###Load the Datasets ###
# We need to load the datasets that will be needed to train our machine learning algorithms, handle our data and make predictions. Note that these datasets are the ones that are already provided once you enter the competition by accepting terms and conditions #
train = pd.read_csv(‘../input/train_2016_v2.csv’ , parse_dates=[“transactiondate”])
properties = pd.read_csv(‘../input/properties_2016.csv’)
test = pd.read_csv(‘../input/sample_submission.csv’)
test= test.rename(columns={‘ParcelId’: ‘parcelid’}) #To make it easier for merging datasets on same column_id later
### Analyse the Dimensions of our Datasets.
print(“Training Size:” + str(train.shape))
print(“Property Size:” + str(properties.shape))
print(“Sample Size:” + str(test.shape))
### Type Converting the DataSet ###
# The processing of some of the algorithms can be made quick if data representation is made in int/float32 instead of int/float64. Therefore, in order to make sure that all of our columns types are in float32, we are implementing the following lines of code #
for c, dtype in zip(properties.columns, properties.dtypes):
if dtype == np.float64:
properties[c] = properties[c].astype(np.float32)
if dtype == np.int64:
properties[c] = properties[c].astype(np.int32)
for column in test.columns:
if test[column].dtype == int:
test[column] = test[column].astype(np.int32)
if test[column].dtype == float:
test[column] = test[column].astype(np.float32)
### Let’s do some feature engineering
#living area proportions
properties[‘living_area_prop’] = properties[‘calculatedfinishedsquarefeet’] / properties[‘lotsizesquarefeet’]
#tax value ratio
properties[‘value_ratio’] = properties[‘taxvaluedollarcnt’] / properties[‘taxamount’]
#tax value proportions
properties[‘value_prop’] = properties[‘structuretaxvaluedollarcnt’] / properties[‘landtaxvaluedollarcnt’]
###Merging the Datasets ###
# We are merging the properties dataset with training and testing dataset for model building and testing prediction #
df_train = train.merge(properties, how=’left’, on=’parcelid’)
df_test = test.merge(properties, how=’left’, on=’parcelid’)
### Remove previos variables to keep some memory
del properties, train
gc.collect();
print(‘Memory usage reduction…’)
df_train[[‘latitude’, ‘longitude’]] /= 1e6
df_test[[‘latitude’, ‘longitude’]] /= 1e6
df_train[‘censustractandblock’] /= 1e12
df_test[‘censustractandblock’] /= 1e12
### Let’s do some pre-exploratory analysis to identify how much missing values do we have in our datasets.
### Thanks to Nikunj-Carefully dealing with missing values. Ref. https://www.kaggle.com/nikunjm88/carefully-dealing-with-missing-values
### Label Encoding For Machine Learning & Filling Missing Values ###
# We are now label encoding our datasets. All of the machine learning algorithms employed in scikit learn assume that the data being fed to them is in numerical form. LabelEncoding ensures that all of our categorical variables are in numerical representation. Also note that we are filling the missing values in our dataset with a zero before label encoding them. This is to ensure that label encoder function does not experience any problems while carrying out its operation #
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
for c in df_train.columns:
df_train[c]=df_train[c].fillna(0)
if df_train[c].dtype == ‘object’:
lbl.fit(list(df_train[c].values))
df_train[c] = lbl.transform(list(df_train[c].values))
for c in df_test.columns:
df_test[c]=df_test[c].fillna(0)
if df_test[c].dtype == ‘object’:
lbl.fit(list(df_test[c].values))
df_test[c] = lbl.transform(list(df_test[c].values))
### Rearranging the DataSets ###
# We will now drop the features that serve no useful purpose. We will also split our data and divide it into the representation to make it clear which features are to be treated as determinants in predicting the outcome for our target feature. Make sure to include the same features in the test set as were included in the training set #
x_train = df_train.drop([‘parcelid’, ‘logerror’, ‘transactiondate’, ‘propertyzoningdesc’,
‘propertycountylandusecode’, ], axis=1)
x_test = df_test.drop([‘parcelid’, ‘propertyzoningdesc’,
‘propertycountylandusecode’, ‘201610’, ‘201611’,
‘201612’, ‘201710’, ‘201711’, ‘201712’], axis = 1)
x_train = x_train.values
y_train = df_train[‘logerror’].values
### Cross Validation ###
# We are dividing our datasets into the training and validation sets so that we could monitor and the test the progress of our machine learning algorithm. This would let us know when our model might be over or under fitting on the dataset that we have employed. #
from sklearn.model_selection import train_test_split
X = x_train
y = y_train
Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=0.2, random_state=42)
###Implement the Xgboost###
# We can now select the parameters for Xgboost and monitor the progress of results on our validation set. The explanation of the xgboost parameters and what they do can be found on the following link http://xgboost.readthedocs.io/en/latest/parameter.html #
dtrain = xgb.DMatrix(Xtrain, label=ytrain)
dvalid = xgb.DMatrix(Xvalid, label=yvalid)
dtest = xgb.DMatrix(x_test.values)
# Try different parameters!
xgb_params = {‘min_child_weight’: 5, ‘eta’: 0.035, ‘colsample_bytree’: 0.5, ‘max_depth’: 4,
‘subsample’: 0.85, ‘lambda’: 0.8, ‘nthread’: -1, ‘booster’ : ‘gbtree’, ‘silent’: 1, ‘gamma’ : 0,
‘eval_metric’: ‘mae’, ‘objective’: ‘reg:linear’ }
watchlist = [(dtrain, ‘train’), (dvalid, ‘valid’)]
model_xgb = xgb.train(xgb_params, dtrain, 1000, watchlist, early_stopping_rounds=100,
maximize=False, verbose_eval=10)
###Predicting the results###
# Let us now predict the target variable for our test dataset. All we have to do now is just fit the already trained model on the test set that we had made merging the sample file with properties dataset #
Predicted_test_xgb = model_xgb.predict(dtest)
### Submitting the Results ###
# Once again load the file and start submitting the results in each column #
sample_file = pd.read_csv(‘../input/sample_submission.csv’)
for c in sample_file.columns[sample_file.columns != ‘ParcelId’]:
sample_file[c] = Predicted_test_xgb
print(‘Preparing the csv file …’)
sample_file.to_csv(‘xgb_predicted_results.csv’, index=False, float_format=’%.4f’)
print(“Finished writing the file”)
Step 5: Publish the Kernel. Kaggle servers will take some time to render the kernel. The submission file will be ready when done in the output tab of the kernel. Download it and submit
Once you done with this congratulate yourself of submitting your first kernel and getting started to become a proud Kaggler. Now, it’s time for some heavy lifting and considerations to improve your models. Here is a wonderful slideshare presentation of data sciences at Zillow that talks about Zestimate model in detail. This appraisal journal article should be your next stop to get some external view of the problem. Here is Zillow’s own working of Zestimates and this is an awesome introductory courses on predicting house values by Lynda. Follow the links to learn about Zillow’s home value index and home value forecast methodology.
Gearing up for the last month of the competition and going into phase two, I would highly recommend to read the book Zillow’s Talk – The New Rules of Real Estate written by Zillow’s founder and chief economist.
American hosing has been changed widely – in 1950, the average residential square feet per capital was 300, it has increased to 900 by 2000. Interestingly, the office square feet per capital has reduced from 600 in 1970 to 100 sq. ft. in 2000. American homes are not only bigger, it also houses an office space (telecommuting), a gym and game room (one of the top choices for buyers when look for the house.)
We need to remember that house prices and buying is not only statistics, it’s a symbol of progression in one’s life and hence it is as much about sentiments as anything else, and we need to find a way to mine those sentiments to be able to predict the right value for the house.
Zillow talk gives more clues then you can handle in your models: for example, the houses next to the adjacent neighborhood to city center appreciate more than any other neighborhood in the city (I am thinking of Kim Rossmo’s equation here), the houses within a quarter mile of Starbucks can sell for a premium of as much as 37,000 US$ on average compare to houses far away from the coffee shop, the Great schools ranking can contribute and influence house prices, re-modeling of a bathroom is a positive indicator for house sales prices versus re-modeling of a basement that yields half of the money invested in remodeling as an addition in sales prices, and the names – houses on named street versus a numbered street sale for 2% more on average, and the “Lakes” and “Sunset” are more valuable street names compared to “Main Street” or “Jefferson”. Below are few charts from the book to enforce the value of external datasets.
I hope you have enjoyed the blog, please stay tuned for more updates and better code. Do UpVote if it was helpful.