At Onepoint, we have been working on a project that processes structured data and also unstructured French text. Structured data is easy to process since it is nicely organised and labelled, and you can simply store it in a database or data warehouse and then query it. But what about the French text, that is typically just text, freely describing car offers? How can you extract some value out of it?
In order to process unstructured text, you need to recognise patterns in it. One way you can achieve this is by using POS Tagging (part-of-speech tagging or grammatical tagging). POS Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, i. e. mapping a specific word in a sentence to a tag. On this page, you can find a list of English POS tags. For other languages, you will have a different set of tags. Here is a visual POS Tagging example for a French sentence:
The above examples show how French words in a sentence are tagged. You see for instance that the word "Turquie" (English: Turkey, the country) is tagged with "NPP" ('nom propre' - proper noun in English) as well as "Recep Tayyip Erdogan" (Turkish head of state). Another example is "ordonne", tagged with "V" (verb).
POS Tagging can be implemented using neural networks, but these neural networks need to be trained first for this task. There are also pre-made POS Taggers that you can use, like, e.g., https://nlp.stanford.edu/software/tagger.shtml .
Yet we wanted to try out a POS Tagging solution based on the new state-of-the-art transformer model, and for this reason, we chose to use the Hugging Face Transformers library. This library allows you to use transfer learning to train a POS Tagging neural network model on top of an existing one.
Training the Neural Network
As mentioned, we decided to create our POS Tagging neural network model on top of a pre-trained neural network using the transformer network architecture. But something is missing still: a dataset. After browsing the internet, we found a suitable dataset:
In this repository, we found a file with properly POS tagged text: https://github.com/nicolashernandez/free-french-treebank/tree/master/130612/frwikinews/txt-tok-pos
The Hugging Face Transformers makes the training process actually very easy since it comes with examples of training scripts for different NLP tasks. We have based our training on the scripts available in this directory:
The Training Script
We used a Jupyter notebook for training. We have created a copy of our training notebook on Google Colab:
This Jupyter notebook executes the following steps:
- Download and Install Libraries (Hugging Face Transformers and seqeval (used to compute the training result metrics))
- Download dataset (from https://github.com/nicolashernandez/free-french-treebank)
- Create dataset files for training, i.e. create a split with training, validation and test sets.
- Transform original pos files, i.e. convert the original pos files to a format which can be read by the Hugging Face Transformers scripts
- Remove tokens that would confuse the model and make sure the sequence length is no larger than 128
- Run the training script finally "run_ner.py"
If you run the previously mentioned Jupyter notebook, you will get a PyTorch based model that is saved in folder french-postag-model.
After running our training for three epochs, we got an evaluation precision and recall both with around 98.5%.
The training loss also went down smoothly:
Testing Inference on the model
After training the model, we can start using it. Again we created a small notebook with this code below:
from transformers import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
nlp_token_class = pipeline('ner', model='french-postag-model', tokenizer=tokenizer, grouped_entities=True)
nlp_token_class('Les trois économistes y dressent un premier bilan post-confinement et proposent des mesures de relance à court et moyen termes, pour un total de près de 50 milliards d'euros.')
And this prints out:
So yes, the POS tagger seems to work quite well on most word tokens, but is not perfect yet. It seems to have some troubles with apostrophes, but for a first try, it is pretty good. It actually miss-classified the word "euros".
Sharing the Model
One of the things that the author of this blog likes the most about the Hugging Face Transformers library is that it provides a well-defined mechanism to share models with the community. Another bonus is that it comes with tools that convert Pytorch models to Tensorflow models and also the other way around.
Basically, we just converted and uploaded the model using the instructions on the following page:
And that is it; our model is now accessible to the community, and there is even a REST interface that allows developers to connect their apps to it. You can find the model we trained here:
And of course, you can also use this model if you have the Hugging Face Transformers library installed on your notebook:
NLP models are especially important if you want to bridge the gap between the world of unstructured and structured data. With access to powerful NLP models, data engineers will be able to convert vast amounts of unstructured textual data into a structured representation. But application developers will also to benefit and be able to access powerful models that will allow them to extend the capabilities of the software they write.