Aug 20, 2019


The data are calendar.csv and listing.csv which was in Tokyo from this site. In this data, I decided to use 10 columns because these seem to affect the price of listings. These columns contained a data type of categorical variables such as the date and the cancellation policy. Since the data cannot be used for learning, these categorical variables were one-hot encoded. Also, the location information was transferred from latitude and longitude to the neighborhood information. (The Jupyter Notebook which is written all code paste the end of this blog.)

Manipulating Date Information

At first, The date data was converted into a DateTime object, then the month, day of the month, and day of the week were gotten from the object. For holiday, I used the library named jpholiday.

df_calendar['datetime'] = df_calendar['date'].map(lambda x:datetime.datetime.strptime(str(x), '%Y-%m-%d'))
df_calendar['month'] = df_calendar['datetime'].map(lambda x:x.month)
df_calendar['day'] = df_calendar['datetime'].map(lambda
df_calendar['day_of_week'] = df_calendar['datetime'].map(lambda x:x.weekday())
df_calendar['holiday'] = df_calendar['datetime'].map(lambda x:jpholiday.is_holiday(

The date and datetime columns were not necessary, so these were deleted.

del df_calendar['date']
del df_calendar['datetime']

One-Hot (aka one-of-k) Encoding

There is a useful library, get_dummies, in Pandas. Using this library, the variables which need to be categorical was converted.

df_calendar = pd.get_dummies(df_calendar, columns=['month', 'day_of_week','day']df_listing = pd.get_dummies(df_listing, columns=['property_type', 'room_type', 'cancellation_policy'])

Neighborhood from Google API

Using the Places API in Google Maps Platform can get neighorhood information. This time, the radius was in the range of 300m. As the API take some cost if the reiteration is over some times, the information was saved with the timestamp to file.

response = requests.get(google_places_api_url + 
'key=' + api_key +
'&location=' + str(latitude_round) + ',' + str(longitude_round) +
'&radius=' + radius +
'&language=' + language)
data = response.json()
neighborhood = pd.DataFrame([latitude_round, longitude_round, data['results'], time.time()], index=df_neighborhood.columns).T
df_neighborhood = df_neighborhood.append(neighborhood)

Using Dask

Basically, since the Dask is able to use the same API with the Pandas, you do not need to change the source code. However, some API need additional parameters. For instance, map() API in Dask need meta parameter to define the type of return values. Finally, I paste the link of my Jupyter Notebook below. Please check it out about using Dask.


I usually don’t use Python, so I was able to understand what I could do with libraries such as Python and Pandas / Dask. In addition, there are multiple data formats in Airbnb listing data, such as numerical values, dates, categories, and location information, and I think I learned a lot about how to make them usable for machine learning. This time it was about 10000 instances that were not so big, so it managed to be done with my own notebook PC, but there is a need to learn separately how to deal with bigger data (spark etc).




