These days there is big hype regarding deep learning, computer vision, NLP (natural language processing), sound processing, speech recognition. We tend to forget that the type of data, which lives in most of the databases of the enterprise world is actually tabular data - structured data. 

So, in this blog, we are going to focus on the area of tabular data, which is the less hyped area of machine learning - that most people tend to forget - but is in my opinion and an extremely important one for businesses, like ours at Onepoint.

Structured vs Unstructured

We have two worlds in machine learning: the world of structured and unstructured data. Structured data is data with a data model, which is typically represented in tables; hence we speak often about tabular data. Unstructured data is data, which does not have a data model and is not organised in a pre-defined manner

Here are some examples of unstructured data:

  • Images
  • Sound files (e.g., mp3, wav)
  • Videos
  • Text in an email, book
  • 3D object files

Here are some examples of structured data:

  • Table in an Excel sheet
  • Table in a relational database
  • A CSV file (comma separated file)

Each data world has its type of problems. Let us enumerate some typical machine learning problems related to these types:

Structured
Unstructured
Sales forecasting
Image classification
Collaborative filters (recommendations)
Image segmentation
Predicting organ function (medical, like, e.g., lung function decline)
Video classification
Human resource candidate selection
Image generation (GAN)
House price prediction
Sentiment Analysis (NLP)
Games player accuracy prediction
POS  Tagging (Position of Speech, NLP)
Election forecasting
Question Answering (NLP)
Weather forecasting (e.g. precipitation forecasting)
Text generation (NLP)
Price prediction (e.g. Car price prediction)
Summarization (NLP)
Volcanic Eruption Prediction
Sequence to Sequence (Language translation, NLP)
Customer Transaction Prediction
Named Entity Recognition (NER, NLP)
Ad Demand prediction
3D Object detection

Prediction of 3D object movement

The list of machine learning problems is really long, but all problems can be categorised based on the data used. Here is a list based on Kaggle competitions that gives you an idea about the variety of machine learning problems:

https://ndres.me/kaggle-past-solutions/

ML Technologies in Structures / Unstructured Data

The following diagram illustrates the ML technologies in these different areas:

The unstructured world is dominated these days by deep learning and neural networks. In fact, they are also being used in the world of structured data.  The world of structured data though has not yet been fully conquered by it.

On Kaggle competitions, you still see many ML engineers and data scientists using other technologies beyond deep learning to solve problems related to structured data. One technology that seems to stand out is gradient boosting with decision trees. In the latest competitions with structured data, we see very often the winners using Microsoft's LightGBM library.

LightGBM is open source, was first released in 2016 - although it seems the first stable release was only in 2017 - and is very similar to another popular gradient boosting library: XGBoost.

Some examples of competitions which were won using this library are:

Competition
Description
Solutions
M5 Forecasting Accuracy
Estimate the unit sales of Walmart retail goods
1st place solution
https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163684
2019 Data Science Bowl
Uncover the factors to help measure how young children learn
1st place solution
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127469
IEEE-CIS Fraud Detection
Can you detect fraud from customer transactions?
1st place solution
https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163684
University of Liverpool - Ion Switching
Identify the number of channels open at each time point
1st place solution
https://www.kaggle.com/c/liverpool-ion-switching/discussion/153940

Please note, that I am not saying that these competitions were won using this library exclusively - some of the winning solutions used ensembles with different ML libraries, like, e.g. the "2019 Data Science Bowl" competition.

Another very popular gradient boosting library is the previously mentioned XGBoost. According to Wikipedia XGBoost was created in 2014, but became really popular from 2016 onwards. It is a mature library, which can also be used using another language other than Python and C++. It seems to somehow have been losing popularity against LightGBM, because of speed and lower memory usage. The way it builds trees (splits nodes) is different when compared to LightGBM. Nonetheless, it is still quite popular on Kaggle.

Some examples of competitions which were won using this library are:

Competition
Description
Solutions
IEEE-CIS Fraud Detection
Can you detect fraud from customer transactions?
1st place solution
https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163684
Rossmann Store Sales
Forecast sales using store, promotion, and competitor data
1st place solution
https://storage.googleapis.com/kaggle-forum-message-attachments/102102/3454/Rossmann_nr1_doc.pdf
Otto Group Product Classification Challenge
Classify products into the correct category
1st place solution
https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/14335

And there is a new gradient boosting library around the corner, which has been showing up more and more on Kaggle: Catboost

Catboost was created by Yandex researchers and engineers and open-sourced in April 2017. It offers direct categorical encoding support, as well as a GPU support out of the box. It also seems to provide, in most cases, good results with default parameters.

Here are a couple of competitions, in which the winners used this library:

Competition
Description
Solutions
Home Credit Default Risk
Can you predict how capable each applicant is of repaying a loan?
1st place solution
https://www.kaggle.com/c/home-credit-default-risk/discussion/64821
2019 Data Science Bowl
Uncover the factors to help measure how young children learn
1st place solution
https://www.kaggle.com/c/data-science-bowl-2019/discussion/127469

Conclusion

Even though you see on Kaggle more and more ML practitioners using deep learning for tabular data problems and you see, that slowly more and more competitors are getting good results with neural networks, they are still not beating the tree-based gradient boosting libraries. As far as I know, right now, there is only one competition that was won with an RNN (recurrent neural network) type of network:

Competition
Description
Solutions
M4 Forecasting Competition
See https://en.wikipedia.org/wiki/Makridakis_Competitions
1st place solution
https://www.sciencedirect.com/science/article/pii/S0169207019301153

But there is a lot of research going on, in terms of dethroning the gradient boosting libraries. Google has recently released this year a paper on Tabnet, an attention-based neural network architecture for tabular data. So we might soon see new deep learning libraries competing at the top in tabular data competitions.