In this post, we are going to dive into the modern NLP (Natural Language Processing) world and have a look at a modern NLP library: the Hugging Face transformers Python library.

The state of NLP

NLP has gone through a profound transformation in recent years (since 2010 approximately) and in just a brief period of 10 years became incredibly more powerful. With the popularisation of Deep Learning, NLP started to be more and more one of the main topics of machine learning researchers and practitioners. From 2010 onwards NLP problems like, e.g. sentiment analysis, named entity recognition, summarisation, translation were being dealt with recurrent neural networks RNN, but then something changed.

From RNN to the Transformer

RNN's are powerful and performed well on a wide range of tasks, but had problems modelling large dependencies and were not good in terms of parallelisation. In an RNN, you process a sequence of words, word by word with a shared state that you keep updating along the way. This is quite bad for the current hardware you use for deep learning (CPU's and GPU's). Typically you have a for loop in an RNN, and this tends to be quite inefficient.

The Transformer architecture was released by Google through BERT and was a huge revolution in the NLP community (see the original paper about it). This neural network architecture processes a bunch of words, and all words will be looked at at the same time. So the Transformer architecture parallelises the processing of words using encoders and decoders and also introduces the mechanism of "self-attention" which allows for one word in an input sequence to be "understood" in relation to other words. A very good explanation of this architecture can be found in this excellent blog post.

Hugging Face Transformers

Hugging Face Transformers is an NLP library which allows the use of pre-existing NLP models (based on the transformer architecture mentioned above) for inference as well to train your own models using transfer learning. It also provides a repository of pre-trained models for multiple languages and also multiple scripts for training your own models based on existing ones. Hugging Face Transformers also provides tokeniser implementations.

In essence, Hugging Face Transformers provides you:

  • a library for using and training NLP models using transfer learning
  • a model repository at https://huggingface.co/models
  • own implementations of very performant NLP tokenisers
  • the "pipeline" high-level API for inference

The really noticeable thing is that you get with transformers a model repository with the so-called official models, but also with community contributions.

And then there is also another very noticeable thing about this framework: it supports both Pytorch and Tensorflow models. This means that the Transformers library sits on top of both frameworks offering for inference and training a consistent API.

NLP Tasks in Hugging Face Transformers

This library covers a range of popular NLP tasks which we will try to demo later on in a Python notebook

FeatureDescription
feature-extractionExtraction of the components of a sentence, like pronouns, verb, adjective, punctuation, etc
sentiment-analysisClassifying the sentiment of the text. Typically binary, but can also use multiple classes.
named-entity-recognition (ner)Recognition of entities in a sentence. For example, you want to recognise a person given name, surname or location in a sentence.
question-answeringGiven a sentence with some information, you want to ask a question in natural language for which then a reply is given.
fill-maskCompleting parts of sentences like, e.g.: 'The weather is good, so let us go for a [MASK]'
summarisationSummarising long texts
translationTranslation from one language to another. At the present the transformer pipelines only support three translations: 'translation_en_to_fr', 'translation_en_to_de', 'translation_en_to_ro'
text-generationRandom text generation based on a given text

Also, note that the Transformers library has two API types. A low-level API which allows you to create your own tokenisers for different types of models. This low-level API allows you to access these components:

ComponentDescription
NormaliserExecutes all the initial transformation over the initial input string. For example, when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalisation processes, you will add a Normaliser.
PreTokenizerIn charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
Tokeniser ModelHandles all the sub-token discovery and generation, this part is trainable and really dependant of your input data.
Post-ProcessorProvides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT, it would wrap the tokenised sentence around CLS and SEP tokens.
DecoderIn charge of mapping back a tokenised input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
TrainerProvides training capabilities to each model.

The high-level pipeline API which is a kind of embedding of the tokeniser and model. This API allows you to access the NLP task described above with very little code. For example, if you want to do some sentiment analysis, you can simply write some code like, e.g.:

from transformers import pipeline

nlp = pipeline('sentiment-analysis')

sequence = "This angry, disjointed documentary wobbles between high-minded outrage and crude tabloid sensationalism. Sorry but no."

result = nlp(sequence)

result

[{'label': 'NEGATIVE', 'score': 0.9994897842407227}]

This API does a lot of work for you, like selecting a default tokeniser, model, performing the entire tokenisation operation, inference and decoding the output of the inference. 

We have created a simple Notebook that you can access on Google Colab that demos how you can use this API with all NLP features supported by the Transformer library:

https://colab.research.google.com/drive/1faBybyFPCoy2mBzKTLG0GD0a2rsBg8xv?usp=sharing

Conclusion

Modern NLP models are now easily available for inference or being used via transfer learning. There is a wide range of business applications which can be built on top of these now ready technologies (today's NLP models totally outperforms those of 10 years ago). One example that comes to mind is the use of Named Entity Recognition and Custom Entity Linking to extract structured data from unstructured text. The bridge between the unstructured data world and structured data world is under fast construction.