A couple of days ago the Riiid AIEd Challenge 2020 - a Kaggle machine learning competition - finished. I participated in this competition and really enjoyed it, due to the interaction with other competitors and a very helpful and open Kaggle community.
But what is the Riiid AIEd Challenge 2020? The official description of the competition is:
In this competition, your challenge is to create algorithms for "Knowledge Tracing," modelling student knowledge over time. The goal is to predict how students will perform in future interactions accurately. You will pair your machine learning skills using Riiid’s EdNet data.
There is also a competition website with more information:
In this competition, you need to predict whether a user answers a quiz question correctly or not - so we have a binary prediction target. The data is strictly tabular, and we are dealing here with a time series - so we have to take the history of user answers into account to make accurate predictions.
The training data set contains 100 million records with past user interactions, information about the quiz questions and lectures. The set of lectures and quizzes is pre-determined, but the total amount of users is unknown during the inference (prediction) phase.
If you ask me, what is my most predominant impression of this competition was: openness. And openness from the start, when I noticed that the organisers presented their paper on SAINT+ ("Integrating Temporal Features for EdNet Correctness Prediction").
At the start of the competition, the participants were invited to use this neural network-based architecture based on the Transformer ("Attention is all you need") model, released in 2017 by Google Researchers.
This spirit of openness kept going during the competition, which lots of really good notebooks being published.
At the end of the competition, I noticed that the notebooks of the second-place solution were immediately published:
With so many great solutions for this problem domain being published, you can have a great learning experience.
Gradient Boosting vs Transformers
The most popular gradient boosting library seems to be LightGBM.
But neural networks seem to be gradually taking over in the domain of tabular data. The second-place solution, for example, is using a Transformer based architecture using a Pytorch implementation (note: this solution is better described by "LSTM-Encoded SAKT-like TransformerEncoder"). The author of this solution used an ensemble of 6 different neural networks for inference.
My own solution, which landed me in the top 7% is a simple ensemble of a LightGBM and a neural network, which implements the ideas in the SAKT paper.
A couple of years ago, gradient boosting libraries would probably have dominated "knowledge tracing" machine learning competitions, but right now newer neural network architectures seem to be highly competitive.
Participating in machine learning competitions is always a great learning experience and a way to stay up to date and experience trends in the machine learning world. I am also really grateful to Kaggle for what they are offering for free (I find it honestly just amazing)
The Riiid AIEd Challenge 2020 was a particularly friendly competition with the organizers providing from the start really excellent input on potential solutions, but also an engaged and open set of competitors.
We can also spot a clear trend: neural network-based architectures are really taking control of the scene, even though gradient boosting libraries still perform quite well in knowledge tracing problems.