ML Project : Prediction Hotel Booking Cancellation

Aybüke Meydan
4 min readSep 23, 2020
Photo by Anmol Seth on Unsplash

Reservation is to arrange your place before going there. The guests indicate the room number, type and the time he/she came in advance and hotels try to adjust their preparations and the needs to be provided for the guests accordingly. It is also critical for hotels to be informed about reservation in advance in terms of hotel expenses.

Problem Statement

Will the guest going to cancel hotel reservation?


1.) Dataset

Hotel Booking Demand Datasets

This data set is taken from this article. You can view the data set from this link. In short, the data set contains hotel reservation records for two different hotel types in Portugal.

2.) Classification

The target variable is categorical, and the problem we’re dealing with is the binary classification problem, so will a guest actually come or not?

3.)İmbalance Data

The distribution of the data is of critical importance especially for classification problems. If we have imbalanced data, our options are: Oversampling and undersampling. If you have too many records in your available data (for example, the time taken to model or validate is too much), then undersampling will do the trick. You can decide which one is good on the base model and proceed accordingly.

Step By Step

Step 1: Data Preprocessing & EDA & Feature Engineering

Columns with null records: Country, Agent, Children, Company.Country is a categorical column while filling the null records here with “Unknown”, I filled the other numeric records with 0.

In addition, there were records with only babies in the rooms and records with no guests, I cleared these records. I combined the children and babies under a single record and created a new column, thinking that records that do not match the assigned room number and reserved room number would cause cancellation. I dropped booking_changes column which is a possible source of leakage and drop the other columns reservation_status,arrival_date_year.

While exploring the data, we see that City Hotel has more registrations than Resort Hotel and the number of cancellations is the same. From the month chart where the most visitors come from Portugal (Hotel is in Portugal), we see that August is the month with the highest number of visitors, and the maximum number of cancellations is that month. We can observe that an increase in the number of guests coming to the room or an increase in lead time generally results in a cancellation of the reservation. If we take a look at the correlation matrix:

There is a strong correlation between the newly added variable (is_true_room), lead_time, previous_cancelation and our target variable.

Before entering the model, if let’s look at the target distribution:

The target variable is in the minority, ML models have a bias towards the majority class so I apply SMOTE from over_sampling methods. In this way, both classes had equal entries.

Step 2 Model Analysis

I trained 11 models,This is results:

As you can see, the most successful model is Random Forest.

But It can only be an overfitting problem and the best way to avoid overfitting is to optimize tuning hyperparameters. We can tune your parameters using RandomizeSearchCV.

ROC Curve


  • I got the best results from tree-based algorithms. The accuracy I got after model evalution : 0,9012
  • - As can be seen in the feature importance tables, lead time and total of special requests are the most valuable features.

Thanks for reading! You can find me on linkedin and github.