Sentiment Analysis is one of the most common NLP (Natural Language ) applications. With machine learning you can train models based on textual datasets that can identify or predict the sentiment in a piece of text, like e.g. "negative" or "positive". In this blog we are going to describe how you can train such a model with practical example and then create a REST interface which you can use to predict the sentiment of a text. The source code for this exercise can be found on Github.

5 Star Sentiment Analysis

The typical Sentiment Analysis models and examples in the "blogosphere" use two categories, so we looked for a variation of this "negative" / "positive" models and tried to find a dataset which uses 5 star rating with the goal of training an ML model that is able to categorise the sentiment based on five stars - thus  giving a more nuanced idea about the sentiment. We have found a publicly available dataset by Yelp, a review website.

Yelp Data

Yelp provides a JSON based review dataset which multiple fields out of which we have 2 that we used for training our sentiment analysis model:

textNormal English text, like e.g: "As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go! ..."
starsNumbers 1 to 5 corresponding to the stars

The full documentation of the dataset we used is on this page:

As you  can see we used only a portion of the dataset. It would also be possible to create a model that predicts funny, cool or useful content. We will have a look at this in an  upcoming blog.

Model and Libraries

Sentiment Analysis models are these days typically deep learning models. ML practitioners also do use existing pre-trained models as the starting point for their training. This technique is referred to as transfer learning.

For our 5 star sentiment analysis exercise we have chosen the BERT model. BERT is a neural network architecture which was created and published in 2018 by Google researchers and delivers state-of-the-art performance in many NLP tasks.

Our language of choice for ML is Python that has another three of your favourite libraries used in this exercise:

Pytorchawesome deep learning library by Facebook that allows to train and also to perform inference based on neural networks
Hugging Face Transformerslibrary that provides a very large amount of pre-trained neural networks and also tools for training and using NLP models.
FlaskA Python micro web framework, great for creating quickly REST interfaces

The Hugging Face Transformers provides a pre-trained version of BERT model based on cased text which we used in our training.

Steps for creating the ML Model and REST

This exercise has 4 steps:

Data Transformation

The original Yelp reviews come in JSON format and also with some unnecessary fields. We converted the original format to a Pandas dataframe with only two fields, text and stars:

The dataset is also unbalanced, i.e: some categories are over- and others under-represented. Here is a graph of the distribution of a sample of 2 million records:

As you can see people like to give more positive than negative reviews; one star reviews are more common than 2 and 3 star reviews.

An unbalanced dataset might lead to a biased classifier which during inference time gives preference to the class with the highest frequency, so we decided to create a dataset with the same number of records per class. And after removing records from different classes we got this distribution:

The final training set contained 792786 records out of a sample of 2 million records out of the original 8 million records. We serialize the Pandas dataframe with  all records in a Python pickle file.

The Python Jupyter notebook with the initial data transformation is available on Github.


In the training notebook (also available on GitHub) we read the previously serialized (or pickled) pandas dataframe and we checked first the distribution of the lengths of the Yelp reviews. It seems that most reviews have around 80 words or so. But there are a good amount of review that have more that 512 words. Overall the distribution  of the lengths of words in review should be skewed to the right:

Based on this distribution we decided the set the maximum length of the BERT tokenizer to 512. As it seems this number is also the default for most tokenizers provided by the Transformers library:

After our mini text length distribution analysis we proceeded to create the Yelp data set class that uses the Transformers library's tokenizer:

We have also split the data in a train, validation and test sets using stratification to avoid unbalanced sets.

So again we kept the datasets balanced:

Next we created the dataloaders for our training, validation and test sets with a batch size of 16:

As our model we have used a model provided by the Transformers library: BertForSequenceClassification. This is how we instantiate the model using a factory method:

The optimizer AdamW of the Transformers library with a learning rate of 2e-5 (0,00002) was used:

Also a learning rate scheduler (transformers.get_linear_schedule_with_warmup provided by the Transformers library) was used:

This scheduler changes the learning rate in this way:

And finally we created a very standard training loop and evaluation loops. See the source code on Github for more information.

As metrics we just used the accuracy metric:

Training results

We have trained on a GPU machine provided by with a Tesla V100 with 16GB of RAM:

The best accuracy we got was 0.71 which means that the star prediction on the test set was correct in 71% of cases. This accuracy was achieved on the second epoch of the training that took almost 20 hours.

Prediction / Inference

After the training we have simply "pickled" the best trained model and created a simple Jupyter notebook that we used to check our predictions. This notebook is available on Github.

In this notebook we created a predict method which encodes the input sequence using the same tokenizer we used during training and then uses the encoding as input of the trained model. The trained model outputs a tensor with five elements that represents the probabilities for each class.

This is the predict method:

Here are some examples of how you can use this model with simple sentences in a notebook:


As the last and final step of our exercise we have created a REST API with Flask. The implementation of this API can be found here. The code in this script just re-uses the predict method from the prediction notebook and wraps it around with some Flask based code:

If you try for example the sentence from CNN below you will get one star rating with our model:

Boris Johnson's dream of a 'Global Britain' is turning into a nightmare


Hugging Face Transformers has all the material you need to create customized sentiment analysis models. Whilst very powerful, the BERT model is heavy when it comes to training models. Heavy in terms of GPU memory and also slow, but the results are satisfying. After trying out many sentences the star rating make some sense and when integrated into software can be really useful. Also the achieved accuracy of 71% on five stars is a good start.

The Yelp dataset would also offer you the possibility for creating some other very interesting models that can predict other categories like e.g: "funny", "cool" or even "useful". We have not explored these possibilities in this blog, but might do so in a future blog.

We will try to improve our experiment also by including other metrics and using parallelization with more data and describe our results in an upcoming blog.