Prediction of Airbnb Listing Prices with Classical Machine Learning: Project to Learn AI vol.3
Introduction
This article is the continuation of this.
The price prediction modelling with AI was started with classical Machine Learning because of the curriculum of the Springboard. The goal was learning several techniques about Machine Learning and understanding how to find the best model. There were two main ways which were classification and regression, and I used the regression because this case was the price prediction of Airbnb listing. The machine spec was following.
PC: MacBook Air (Retina, 13-inch, 2018)
Processor: 1.6GHz Intel Core i5
Memory: 16GB
Learned knowledge
■statsmodels
statsmodels.org
■Scikit-learn
scikit-learn.org
Github: SciPy 2016 Scikit-learn Tutorial
■Regularization
Blog: A Complete Tutorial on Ridge and Lasso Regression in Python
Blog: Ridge and Lasso Regression: A Complete Guide with Python Scikit-Learn
Blog: Lasso, Ridge and Elastic Net Regularization
regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.
- Ridge Regression
Performs L2 regularization, i.e. adds penalty equivalent to the square of the magnitude of coefficients - Lasso Regression
Performs L1 regularization, i.e. adds penalty equivalent to the absolute value of the magnitude of coefficients - Elastic Net
Combination of the above two such as Elastic Nets– This adds regularization terms in the model which are a combination of both L1 and L2 regularization.
■k-fold cross-validation
Wikipedia: Cross-validation_(statistics)
Blog: k-fold cross-validation (Japanese)
The goal of cross-validation is to test the model’s ability in order to flag problems like overfitting or selection bias and to give an insight into how the model will generalize to an independent dataset.
■data contamination (a.k.a. Data leakage.)
Blog: data-leakage
The leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate. There are two main types of leakage: Leaky Predictors and Leaky Validation Strategies.
■Performance Evaluation
Document: 3.3. Model evaluation: quantifying the quality of predictions
Blog: 機械学習で使う指標総まとめ(教師あり学習編)
Evaluating the quality of a model’s predictions
E.g.)
- Accuracy
- R-squared/coefficient of determination
- Mean Squared Error
- Root Mean Squared Error RMSE
- Mean Absolute Error
- Coefficient of Determination
- AIC
- BIC
■Interpretable
Blog: Human-interpretable Machine Learning — The road to Explainable AI
E.g.)
- LIME
- ELI5
- Skater
- SHAP
Techniques
take the logarithm of price
At first, I tried some models but the prediction accuracy was not very well, so I consulted with the Springboard mentor. Based on his advice, I compared with each feature, then the correlation coefficient between prices and accommodates was very as low as 0.135307.
The comparison of values which was output by describe() function showed the large difference of digits.
I took the logarithm of price. As a result, the correlation coefficient improved to 0.581324, therefore, I decided that it was reasonable.
Modelling
The scikit-learn for the library and the r2 as performance evaluation were used. First of all, the score r2 was output by Dummy Regressor, and it was as a base of other models. After that, I tried three models which were Linear, Elastic Net, and Random Forest. In Elastic Net and Random Forest, some parameters were changed for finding the best condition.
Result 1: Execution time
I found out obviously that cross-validation needed massive resources.
※↓Wall time
- ElasticNet
Normal: 20.9 s
cross-validate(cv=3): 2min 51s - RandomForestRegressor
Normal: 1h 7min 29s
cross-validate(cv=3): 1h 43min 34s
Cross-validation has the advantage of preventing over-fitting and estimating performance without selection bias, but more resources are needed for trial and error. I thought it would be unsuitable when trying out various models and parameters easily in a low spec machine.
Result 2: Evaluation
RandomForestRegressor had the best score.
- DummyRegressor
r2: [-0.022997 , -0.00028029, -0.03440069] - LinearRegression
r2: [-4.34995354e+18, -9.17101405e+18, -7.01205721e+16] - ElasticNet
r2: 0.3757551978407917 - RandomForestRegressor
r2: 0.6498992662637662
I changed parameters as like followings.
- ElasticNet
alpha: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 1000]
l1_ratio: [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1] - RandomForest
n_estimators: [2, 5, 10, 20, 30, 50]
max_depth: [2, 3, 5, 10, 20, 30]
Conclusion
I understood the whole method in ML. I also learned how to compare several models and variables to find the best one. As a result, r2 =0.6499 was the best this time. But honestly, I don’t know if it’s good enough and I don’t feel like it’s over. In real business, it can be imagined that it is very difficult and persevering to have to try and explore a wider range of methods by using multiple evaluation indicators as much as possible.
In the future, I would like to learn the effects of each parameter and specified tips with experience. At the same time, since the interpretable technique is not used, it is necessary to try.
Finally, the following link is my Jupyter Notebook.
https://gist.github.com/furuta/1a2b3bd0610e53488d7819fdc9383eea