Sentiment Analysis of Airbnb Reviews with BERT

6 min readNov 10, 2019

Introduction

This article is a continuation of this.
At first, there was no plan to handle NLP in the learning project that I was working on at Springboard, but I was interested in studying it. I wondered what could be done. I looked at Airbnb data again and realized that there were reviews data and decided to use it. As I’ve ever seen cases that Twitter posts have been categorized as positive and negative before, I tried to do a similar sentiment analysis. I selected transfer learning by BERT which was shown in the Springboard curriculum since it was one of the relatively advanced methods. However, one of the big problems was that Airbnb’s data did not have a correct tag then I decided to focus on using it anyway.

While looking up BERT, I found out that Google Colaboratory can be used for free. I referred to the following sites.

* https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b

* https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

All notebooks are here.

Techniques

First, authenticate to use the TPU runtime. Colab runtime type must be set to TPU in advance. You can change it from the menu bar [Runtime> Change runtime type.

TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
        auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)

The training data used SST (The Stanford Sentiment Treebank) of GLUE Tasks. The review subject in this data is movies, but it categorizes emotions as positive and negative and is closest to the purpose of this time. We also specified a GCP bucket according to the reference code as the storage location for the learned model. So it is necessary to make the same bucket name as specified beforehand.

TASK = 'SST-2'
DATA_NAME = 'SST'# Download glue data.
! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
!python download_glue_repo/download_glue_data.py --data_dir='glue_data' --tasks=$DATA_NAME
TASK_DATA_DIR = 'glue_data/' + TASK# Model output directory
BUCKET = 'springboard-project-nlp'
OUTPUT_DIR = 'gs://{}/bert-capstone-project/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)

Get Airbnb review data. I used the GoogleDrive directory because it can be mounted easily. Save training data in your GoogleDrive in advance.

from google.colab import drive
drive.mount('/content/drive')df_reviews = pd.read_csv('drive/My Drive/Springboard/projects/data/reviews.csv')

Adjust the data and convert each row to an InputExample object. In BERT learning, four columns (guid, text_a, text_b, label) are used.

# Training and evaluation data
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'
label_list = [0, 1]df_train = pd.read_csv(TASK_DATA_DIR + '/train.tsv', delimiter='\t')
train_InputExamples = df_train.apply(
    lambda x: run_classifier.InputExample(
        guid=None,
        text_a = x[DATA_COLUMN], 
        text_b = None, 
        label = x[LABEL_COLUMN]
    ), axis = 1
)df_dev = pd.read_csv(TASK_DATA_DIR + '/dev.tsv', delimiter='\t')
dev_InputExamples = df_dev.apply(
    lambda x: run_classifier.InputExample(
        guid=None,
        text_a = x[DATA_COLUMN], 
        text_b = None, 
        label = x[LABEL_COLUMN]
    ), axis = 1
)# Prediction data
df_test = pd.DataFrame({
    'id': df_reviews.id,
    'comments': df_reviews.comments
})
df_test = df_test.dropna()
table = str.maketrans({'\t': '', '\n': '', '\r': ''})
df_test['comments'] = df_test['comments'].map(lambda x:x.translate(table))test_InputExamples = df_test.apply(
    lambda x: run_classifier.InputExample(
        guid = x.id,
        text_a = x['comments'], 
        text_b = None, 
        label = 0
    ), axis = 1
)

I used a model which is called uncased_L-12_H-768_A-12 because of lightweight. Use it to generate a tokenizer. Then use the tokenizer to generate features from the examples.

BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'
tokenizer = 
    run_classifier_with_tfhub.create_tokenizer_from_hub_module(
        BERT_MODEL_HUB)MAX_SEQ_LENGTH = 128
train_features = run_classifier.convert_examples_to_features(
    train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
dev_features = run_classifier.convert_examples_to_features(
    dev_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = run_classifier.convert_examples_to_features(
    test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

Build the model to learn and generate an estimator. A function called get_run_config that gets configuration is defined separately. Please refer to the full code for details on the various constants.

model_fn = run_classifier_with_tfhub.model_fn_builder(
    num_labels=len(label_list),
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    bert_hub_module_handle=BERT_MODEL_HUB
)estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=get_run_config(OUTPUT_DIR),
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=PREDICT_BATCH_SIZE,
)

Create an input function for training and then train.

train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

The result of the evaluation
eval_accuracy = 0.9162844
eval_loss = 0.49866587
global_step = 6313
loss = 0.60559416

Do prediction of Airbnb reviews data using the learned model. Then we tagged the results.

predict_input_fn = run_classifier.input_fn_builder(
    features=test_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=True)results = estimator.predict(predict_input_fn)labels = ["Negative", "Positive"]
predictions = []
for test, result in zip(df_test.values, results):
    label = 
        0 if result['probabilities'][0] > result['probabilities'][1] 
        else 1
    predictions.append(
        [test[0], labels[label], result['probabilities'], test[1]])

Results

Positive 264454
Negative 33823
About 89% are classified as positive and there are a lot of positive opinions. That reason is there are actually many good accommodation experiences, and another reason is that users write reviews with each other in Airbnb, so it’s hard to write negatively. It is helpful for the following people to write reviews honestly, but there is a risk that negative reviews will be written because they wrote negative content

I confirmed 20 positive and negative classified reviews that were displayed. If you look at it, the ones classified as positive appear to be correct. However, it seems that languages other than English (especially Japanese and Chinese) are classified as Negative regardless of the content.

It is difficult to use this analysis in real service. It is necessary to think about a new method of assessing correctness and handle multiple languages.

Multilingual

I didn’t leave it in the notebook on Colab, but I tried the following two ways.

Google Translate

The way is translating using API before prediction. When I checked the pricing, it was $ 20 per million characters. There are about 300,000 rows in all data, and if one review is 200 characters, the total is 60,000,000 characters. It would cost $ 1,200 to translate everything. Since it was expensive, I considered the way of reducing the cost. There are many reviews originally written in English, so I would like to filter the target of translation, but of course each line is not labelled as to which language it was written in. So I decided to distinguish by the type of characters. The code for identifying languages other than English is as follows:

import unicodedatadef is_english(string):
    for char in str(string):
        try:
            unicode = unicodedata.name(char)
        except:
            return False
        if "CJK UNIFIED" in unicode \
        or "WITH ACUTE" in unicode \
        or "WITH DOT ABOVE" in unicode \
        or "WITH DIAERESIS" in unicode \
        or "WITH RING ABOVE" in unicode \
        or "HIRAGANA" in unicode \
        or "KATAKANA" in unicode:
            return False
    return True

As a result, the target reviews were 88078 rows and the cost was estimated at $ 352.31. Still, it is expensive as an individual trial for learning, so I thought of another way. The Springboard mentor taught me a module called googletrans. Google translation API is used as this base, but it is free. When I tried it, the error of `JSONDecodeError: (‘Expecting value: line 1 column 1 (char 0)’, ‘occurred at index 0’)` occurs when about 300 were executed each time. I looked up and it seemed that IP restriction was applied because there were many requests. I didn’t know how long I should wait, and even if I changed the IP of the TPU server every time, I had to do it about 300 times, so I gave up.

Multilingual model

On the other hand, looking at models of BERT, a multilingual version was recently released. There is no need for translation, so I tried using it. Multilingual_L-12_H-768_A-12 was specified as Mode, the code was changed a little, and it was executed. However, every time the model learning cell was executed, my own MacBookAir crashed due to insufficient memory. I didn’t know about it, but I thought that the execution of the program was done in the cloud using TPU, but it seemed to put a load on the local PC. This full version of the multilingual model could not be run in the current environment. I gave up because I need a lightweight version or to upgrade my environment.

Conclusion

I was able to experience some of the problems that could occur in the field. There is no correct label in the review data. One of the solutions is to add a five-star rating to the review. Moreover, multiple languages are mixed. Whether to translate the review data or make the model multilingual is costly, so it depends on the business judge. This time, especially text information such as reviews is qualitative and difficult to handle. On the other hand, it was very useful to be able to easily use an excellent model such as BERT.
It is convenient to use TPU for Colab for free, but I needed a special code for that, so I felt that it was a little annoying that the code could not be run on different environments such as GPU. However, an environment that can be run without starting Jupyter Notebook is very useful, so I will continue to use it in the future.