Airbnb Listing Data Wrangling for Machine Learning: Project to Learn AI vol.2

tom_furu
3 min readAug 20, 2019
Photo by Fredy Jacob on Unsplash

Introduction

This article is the continuation of this.
The data are calendar.csv and listing.csv which was in Tokyo from this site. In this data, I decided to use 10 columns because these seem to affect the price of listings. These columns contained a data type of categorical variables such as the date and the cancellation policy. Since the data cannot be used for learning, these categorical variables were one-hot encoded. Also, the location information was transferred from latitude and longitude to the neighborhood information. (The Jupyter Notebook which is written all code paste the end of this blog.)

Learned knowledge

■Python
Video: How Code Slows as Data Grows
Video: Python Performance Investigation by Example
Video: Seven Strategies for Optimizing Your Numerical Code

■Luigi
luigi.readthedocs.io
Video: Scalable Pipelines with Luigi

■Pandas
pandas.pydata.org
Video: Data Wrangling and Intro to pandas — Part 1
Video: Data Wrangling and Intro to pandas — Part 2
Blog: Working with Time Series

■Dask
dask.org
Github: dask-tutorial-pycon-2018
Blog: Ultimate guide to handle Big Datasets for Machine Learning using Dask
Qiita: Dask使ってみた
Blog: Reference of a judgement to use Dask

■Spark
Spark Python API
Video: Data Wrangling with PySpark for Data Scientists Who Know Pandas

Techniques

Manipulating Date Information

At first, The date data was converted into a DateTime object, then the month, day of the month, and day of the week were gotten from the object. For holiday, I used the library named jpholiday.

df_calendar['datetime'] = df_calendar['date'].map(lambda x:datetime.datetime.strptime(str(x), '%Y-%m-%d'))
df_calendar['month'] = df_calendar['datetime'].map(lambda x:x.month)
df_calendar['day'] = df_calendar['datetime'].map(lambda x:x.day)
df_calendar['day_of_week'] = df_calendar['datetime'].map(lambda x:x.weekday())
df_calendar['holiday'] = df_calendar['datetime'].map(lambda x:jpholiday.is_holiday(x.date()))

The date and datetime columns were not necessary, so these were deleted.

del df_calendar['date']
del df_calendar['datetime']

One-Hot (aka one-of-k) Encoding

There is a useful library, get_dummies, in Pandas. Using this library, the variables which need to be categorical was converted.

df_calendar = pd.get_dummies(df_calendar, columns=['month', 'day_of_week','day']df_listing = pd.get_dummies(df_listing, columns=['property_type', 'room_type', 'cancellation_policy'])

Neighborhood from Google API

Using the Places API in Google Maps Platform can get neighorhood information. This time, the radius was in the range of 300m. As the API take some cost if the reiteration is over some times, the information was saved with the timestamp to file.

response = requests.get(google_places_api_url + 
'key=' + api_key +
'&location=' + str(latitude_round) + ',' + str(longitude_round) +
'&radius=' + radius +
'&language=' + language)
data = response.json()
neighborhood = pd.DataFrame([latitude_round, longitude_round, data['results'], time.time()], index=df_neighborhood.columns).T
df_neighborhood = df_neighborhood.append(neighborhood)

Using Dask

Basically, since the Dask is able to use the same API with the Pandas, you do not need to change the source code. However, some API need additional parameters. For instance, map() API in Dask need meta parameter to define the type of return values. Finally, I paste the link of my Jupyter Notebook below. Please check it out about using Dask.

https://gist.github.com/furuta/ac64fbafc240a6d1efd48f90b4bf97f0

Conclusion

I usually don’t use Python, so I was able to understand what I could do with libraries such as Python and Pandas / Dask. In addition, there are multiple data formats in Airbnb listing data, such as numerical values, dates, categories, and location information, and I think I learned a lot about how to make them usable for machine learning. This time it was about 10000 instances that were not so big, so it managed to be done with my own notebook PC, but there is a need to learn separately how to deal with bigger data (spark etc).

--

--

tom_furu

Toronto, Canada ← Tokyo, Japan/ Software engineer(Python, PHP, HTML5, JavaScript, C#)/ exCTO/ AI/ Blockchain/ Skiing/ Running/ Photograph/ Pokemon Go/ SAKE