All communication between people - either written or verbal, generates a lot of information. This information can be interpreted in different ways like tone, chosen words, the intention of expression, which can help us to understand and evaluate human behaviour or the personality of a person.

With the social media revolution, this task becomes all the more important. At the same time, it becomes hard to manage when we consider thousands or millions of people exchanging messages on different channels and forms. Each person speaks a different language and expresses in a variety of words and sentences. These factors add to the complexity of interpreting human communication and extracting some meaningful information and patterns out of it. 

So, here comes NLP. What is NLP?

Natural Language Processing or NLP is a field of computer science and linguistics, related to the interaction of computer and human languages. The main objective is to enrich computer algorithms to handle natural language.

Use cases in NLP

NLP uses are increasing day by day in lots of tasks and applications to automate processes.

  • Recognition and prediction of diseases ahead of time, to treat them in very early stages by analysing health records.
  • The most popular is Sentiment analysis in which we analyse customer's feedback or views to determine the quality of a product.
  • Fake news identification by finding the source of news and validating its authenticity
  • Automate litigation tasks that help to conduct case researches for the legal team, thus saving a lot of time. 
  • For voice-driven interfaces like Siri, Alexa. They respond to voice orders and help in searching songs, weather forecasts, driving directions and automate activities like turning the lights on or off.
  •  Recognising spam mails used by Google and Yahoo etc.
  •  Cognitive assistant to provide personalised search assistance.
  •  Helps in financial trading by keeping track of information in news or reports. 
  •  Talent recruitment to find potential skills.

Major Steps in NLP

How does the process of NLP happen? By using algorithms. NLP pipeline generally consists of steps that break a sentence into words (parsing), analyses and cleanses them syntactically and then infers the semantics. Some of the basic concepts and algorithms used for such a pipeline are mentioned below. It may vary as per the objective or application of NLP.  Depends on words (token) we can choose either stemming or lemmatisation. In Python and Java, there are certain prebuild libraries available (NLTK, SpaCy, CoreNLP), which automate these processes.

  • BOW (Bag of words)
  • Tokenisation
  • Stop words removal
  • Stemming
  • Lemmatisation

1)   BOW (Bag of words) :

          In this model, we find the word counts in text, sentences, or documents.  We build a matrix for all unique words except grammar and order of the words.

This is a commonly used model that allows you to count all words in a piece of text. Basically, it creates an occurrence matrix by getting the frequency of occurrences of a word in the sentence or document for training, disregarding grammar, and word order.

An example :

Winter, winter, you make a glow

You freeze my body with ice and snow  

Now let's count the words:


winter

you

make

a

glow

freeze

my

body

with

ice

and

snow

Winter, winter, you make a glow

2

1

1

1

1

0

0

0

0

0

0

0

You freeze my body with ice and snow  

0

1

0

0

0

1

1

1

1

1

1

1

There are some shortcomings in this like the absence of semantic meaning and context, stop words (a, and, you, with ..) are considered, which does not bring any value in analysis. While 'freeze' has a lower value than 'you'.

2) Tokenisation

Tokenisation is simply the method to break sentences into small pieces like words which are named as a token. Where we remove the punctuations if any.

In our above example if we do tokenisation

Winter, winter, you make a glow

winter


winter


you


make


a


glow





You freeze my body with ice and snow  

you


freeze


my


body


with


ice


and


snow

In tokenisation, we remove the punctuation, but in some places, we may need to have them for example 'Dr.' will need the '.' along with 'Dr', in the same way, we sometimes need a hyphen along with the word. In such cases, we face challenges. So such places we need not remove punctuations.

2) Stop words removal

Stop words are the common language articles, pronouns, and prepositions such as 'a' 'and', 'the', 'with' etc. These words do not add any value in the NLP objective; hence we exclude them from our token list. As these frequent words do not provide any informative details for NLP analysis. Removal of Stop words from a pre-defined list of these words helps in free space in the database and improves process time. 

But in some cases like sentiment analysis, it will lose the context of sentences.  For instance, the removal of 'not' will completely lose the sentiment. It will not give us the proper results of sentiment analysis. We need to be careful in such cases choosing which word should be removed as the stop word, and which one should not.

3)  Stemming

In this, we remove affixes and suffices from the end or beginning of the basic words, for example, the suffix 'ful' from 'beautiful' and prefix 'astro' in the word 'astrobiology'.

In stemming issue arises when affixes create a new word itself of a new form of the same word. Example prefix 'eco' in the word 'ecosystem' and suffix in 'ist' in 'guitarist'

Here is the example or right and wrong stemming.

4) Lemmatisation

The main intention of lemmatisation is reducing a word to its base form so the same base word can be grouped, for example, 'went' is changed to 'go' and 'gone' to 'go'. Synonyms are unified 'best' is changed to 'good'. It means here we do standardise words which had the same meaning words at the root. Lemmatisation is almost the same as stemming but uses a totally different approach.

Lemmatisation resolves words to their dictionary form (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas.

For example, the words 'going', 'go' and 'went' are all forms of the word 'go', so 'go' is the lemma of all the previous words.

In lemmatisation, we get the problem of disambiguation, in which the same word can have different meanings based on the specific context. Example words like "bat" could relate to 'animal bat' or a 'cricket bat' and 'bank' corresponds to the financial institution or the land alongside a body of water. We can solve this issue by having part of speech along with the verifying whether the word is a noun or a verb.