Natural Language Processing (NLP) is an area of Machine Learning that has evolved almost exponentially in recent years. 

The recent advancements are mostly related to Deep Learning and also to the introduction of new types of neural networks (BERT and derived deep learning models), which deal with sequential input, like sentences, more efficiently.

What is NLP good for?

Our readers might not be familiar with the topic of NLP and what it is good for. We will give an overview of what you can do these days with the available NLP models. These are the typical tasks NLP models are used for:

  • Sequence classification — You want to classify a sequence according to some pre-defined criteria. For example, you have a set of movie reviews from Rotten Tomatoes and you want to figure out whether the movie review is positive, neutral or negative.
  • Extractive question answering — Based on some text, you answer questions to which the models give an answer. This task is especially important for conversational bots. For example, if you have this text: "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyse large amounts of natural language data." And you ask this question: "What is NLP?" You will get this answer: "is a subfield of linguistics, computer science, and artificial intelligence."
  • Masked language modelling — This task is about asking NLP models to fill gaps in sentences. For example: "I will now sit on the ________." The model would then replace ________ with "chair." If you would ask if this task can be used in a business context, the answer is "No". But this task is incredibly useful for another reason: you can find an unlimited amount of data with this task. Just take any text, mask some words at random using a script, and you are good to use it in your training. You can then adapt the trained model to another task. So, this task is interesting for business indirectly. It is easy to train with loads of data and will produce models with a good understanding of language, which can be used for other tasks later on.
  • Causal language modelling — Based on some text, the NLP model generates the next word.
  • Text generation: Very similar to causal language modelling but instead of a single word, it generates multiple words out of some existing text. For this task, you feed the model with one sentence and then the model writes some text based on that. To get to know this task, you can go to and play around with it as I did below:

  • Named entity recognition (NER) — According to one definition, "this is the task of classifying tokens according to a class, for example, identifying a token as a person, an organisation or a location." NER can be interesting when you want to convert text into structured data. You will use this task if you want to extract the names of individuals and the locations associated to them from text:

  • Summarisation — NLP models will compress text in the way human beings do. Below is an example for this task based on a text extracted from the CNN website with the summary in the green area:

  • Translation — A model will convert text from a language to another, similar to what Google Translate does.

The NLP boom and a model zoo

Since the introduction of BERT by Google in 2018, there has been a boom in NLP. A whole "zoo" of modern NLP models perform tasks better than previous models. 

An NLP start-up has created a repository full of NLP models which machine learning practitioners can use in their projects: Hugging Face. This is a favourite resource for the Onepoint team when we do innovation projects for clients and experimentation within Onepoint Labs.

How we used the NLP zoo in a Kaggle competition

To dive into the modern NLP world and partake in the NLP model zoo, I joined Tarun Lutra, a friend of mine, in the CommonLit Readability Prize competition on Kaggle

I have previously written about Kaggle in other posts, like this one. Kaggle is a great resource for ML practitioners as well as for businesses and institutions who wish to get state-of-the-art solutions for their ML problems. Since Kaggle hosts ML competitions, it has access to a worldwide pool of talent ready to solve the most difficult challenges.

The competition was about identifying the reading ease of excerpts from literature, and also from Wikipedia, to rate them using a score. The score was a continuous number. So the actual task of the competition was text classification by scoring — not categorising.

The challenges we faced 

The competition's training data provides the text excerpt with some extra metadata, like its source and copyright information and the score. But the training data had only 2,834 records. So there was not much data, like in other competitions where you have millions of training records. In general, the more training data we have, the better the NLP models will perform.

The first challenge was the minimal data, which led to unstable training results

By 'unstable' I mean that the evaluation results from the validation data set would heavily vary. After each epoch (that's lingo for every training pass through the whole dataset), the model would also not improve much.

It took a while to figure out how to stabilise the evaluation results with Pytorch and the excellent Hugging Face Transformers library. I was able to stabilise it with the help of some excellent public notebooks, like this one.

The technique I used to stabilise my training is called Layerwise Learning Rate Decay. This is a technique that applies higher learning rates for top layers and lower learning rates for bottom layers of the neural network. This requires the creation of a tweaked optimiser that applies different learning rates to different layers.

Finding the right learning rates with Optuna

Since we have multiple learning rates and they could vary between the different folds, we have used the brilliant Optuna framework to find the different learning rates.

Here is an example of a notebook in which we were trying to find the best learning rates.

A ride through the Model Zoo and Ensembling

I started this competition by using pure BERT models in their base or distilled versions. These models were downloaded from here:



Overall, these models have fewer parameters and are all quicker to train — hence a good starting point.

Soon I also started to use models of the Roberta family. Roberta ("A Robustly Optimized BERT Pretraining Approach") is a Bert model trained with more epochs and improved pre-training. Most of the public notebooks with better scores were using models of the Roberta family.

The Roberta models I started to use were:



The results for a single model were better with Roberta in comparison to BERT.

Then we also tried other models like:








The best scoring model was microsoft/deberta-large and after that roberta-large.

After trying out all of these models I combined all of these models in an inference ensemble notebook which is now publicly available on Kaggle. This notebook landed my team in 31st place, top 1% on this competition.

Ensembling means combining the inference output of multiple models and almost always helps against overfitting, i.e. to improve generalization.

What we missed

After looking at the blog of the competition winner who has used pseudo-labelled data sets extracted via web scraping, we realised that we had missed exactly that. Whenever you do not have much training data, the ML practitioner needs to enhance or augment the data. The focus needs to be on that.

I just remembered this quote:

"It's not who has the best algorithm that wins, It's who has the most data" - Andrew Ng


NLP has evolved a lot in recent years. You can get amazing results with an increasing amount of freely available machine learning models which you can train on your specific problem. 

If you want to train state-of-the-art models in ML, you will need to use advanced training techniques, like KFold cross-validation, hyper parameter tuning, ensemble learning, and Layerwise Learning Rate Decay. 

But never forget the importance of data and use techniques to augment your training data.