This is Part 1 of a three part study on predicting hotel cancellations with machine learning. Originally posted by Michael Grogan.
Logistic Regression and SVM
Hotel cancellations can cause issues for many businesses in the industry. Not only is there the lost revenue as a result of the customer cancelling, but this can also cause difficulty in coordinating bookings and adjusting revenue management practices.
Data analytics can help to solve this issue, in terms of identifying the customers who are most likely to cancel — allowing a hotel chain to adjust its marketing strategy accordingly.
To investigate how machine learning can aid in this task, the ExtraTreesClassifer, logistic regression, and support vector machine models were employed in Python to determine whether cancellations can be accurately predicted with this model. For this example, both hotels are based in Portugal. The Algarve Hotel dataset available from Science Direct was used to train and validate the model, and then the logistic regression was used to generate predictions on a second dataset for a hotel in Lisbon.
Data Processing
At the outset, there is the consideration of overfitting when building the model with the data.
For example, in the original H1 file, there were 11,122 cancellations while 28,938 bookings did not cancel. Therefore, non-cancellations could likely end up being overrepresented in the model. For this reason, the H1 dataset was filtered to include 10,000 cancellations and 10,000 non-cancellations.
For the test dataset (H2.csv), 12,000 observations were selected at random, irrespective of whether the booking was cancelled or not.
The relevant libraries were imported and the relevant data type for each variable was classified.
Read the full article, and access source code, here.