At the moment, I am playing around in a Kaggle competition, which provides a very large data set. The Riid Answer Correctness Prediction competition is about predicting the student's correct answers over time. This means that predictive models need at inference time to process some data to be adjusted incrementally. 

Having a large dataset (100 million +) and having to adjust the inference model as time goes requires a lot of processing power and time. From the start of the competition, this has been one of the biggest challenges for most people in this competition. The problem in the Riid Answer Correctness Prediction competition is not so much the time you spend in training, but the feature engineering duration before training and during inference.

How to speed up feature engineering?

So the main question is: how can you speed up feature engineering on large datasets? 

Pandas and Numpy are the "go-to" libraries of most machine learning practitioners, who work with Python, like me. They are really awesome, powerful and really fast on the CPU, but they operate on the CPU. They are both written in the C programming language with a Python wrapper on top, but in some situations, you wish they would be faster.

So are there any libraries something similar to Pandas and Numpy, that are much faster? 

RAPIDS

Yes, I got to know last week about RAPIDS, a suite of GPU accelerated libraries, written by NVIDIA.

"The RAPIDS suite of software libraries, built on CUDA-X AI, gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs" - https://developer.nvidia.com/rapids

All of the RAPIDS libraries work on NVIDIA graphic cards. They are built on top of the CUDA (Compute Unified Device Architecture) API, a C++ layer and wrapped with a user-friendly Python layer:

These are the libraries which are part of RAPIDS:

Library
Description
cuDF
a Pandas-like dataframe manipulation library. This library can be used as a replacement of the Pandas library and offers a large subset of its functionality.
The name indicates the usage of CUDA and the DataFrame API
cuML
a machine learning library, which implements popular algorithms, like K-nearest neighbours, Random forests, Logistic Regression, etc
cuGraph
"The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames" (quote from https://github.com/rapidsai/cugraph). It implements graph theory algorithms like e.g. Breadth First Search (BFS), Single Source Shortest Path (SSSP), Pagerank on single or multiple GPU's
CLX
"CLX ("clicks") provides a collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases." (quote from https://github.com/rapidsai/clx)
This is more a collection of example collection of examples targeting a specific developer audience
cuxfilter
"cuxfilter ( ku-cross-filter ) is a RAPIDS framework to connect web visualizations to GPU accelerated crossfiltering. Inspired by the javascript version of the original, it enables interactive and super fast multi-dimensional filtering of 100 million+ row tabular datasets via cuDF" (quote from https://github.com/rapidsai/cuxfilter)
This library is intended to be used by visualization libraries to be able to access data in GPU memory.
cuspatial
"cuSpatial is a GPU accelerated C++/Python library for accelerating GIS workflows including point-in-polygon, spatial join, coordinate systems, shape primitives, distances, and trajectory analysis" (quote from https://github.com/rapidsai/cuspatial)
This is a library that contains implementations of spatial distances, speed, etc. using GPUs
cusignal
A signal processing library, which is a port of Scipy Signal using GPUs.
Java + cuDF
Java binding for the cuDF library

Using cuDF

I have started using cuDF to speed up processing in the  Riid Answer Correctness Prediction competition. Indeed cuDF is always faster than Pandas.

Below you can see a simple performance comparison between the regular Pandas based code with the cuDF code using a normal Kaggle notebook with 16 GB of RAM and 4 CPU cores, and a 16 GB GPU.

  • Dropping columns and filtering with reset index on 100 million records

Pandas:

around 10 seconds

cuDF:

around 5 seconds

  • Sorting 100 million records

Pandas:

around 23 seconds

cuDF:

less than 1 second!

  • Merging 100 million records table with another one with 13000 table:

Pandas:

15 seconds

cuDF:

200 milliseconds!

  • Filling not available records in 100 million table

Pandas:

cuDF:

29 milliseconds!

  • Merge with 90 million table and subtraction operation

Pandas:

15 seconds

cuDF:

1 second!

  • Aggregation with 90 million records

Pandas:

22 seconds

cuDF:

1 second

Missing features in cuDF (0.15.0)

cuDF has an API that is very similar to the Pandas API. This means that in many cases you can copy your Pandas code into your cuDF notebook and you are done.

But the Pandas library, whilst slower, has still many more features that cuDF does not have right now.

Here are a couple of features that are available in Pandas and missing in cuDF:

Feature
Description
.cumsum
Cumulative sum in aggregations
.cumcount
Cumulative count in aggregations
.last
Extract the last n elements in aggregations
.loc assignming multiple rows
You can use the .loc function in Pandas (but not in cuDF) to locate via index and then assign in one go values to multiple fields:

.loc adding new row
You can add in Pandas new rows with a certain index value. This is possible in cuDF, but the index number cannot be set.
.apply
Apply user-defined function to series in Pandas. This function is not available.

For sure, more functions are missing, but we expect that in the near future, the gap between both libraries gets narrower.

Conclusion

It is great that new open-source GPU-based libraries are being developed, which cover different fields of data science. One of them if cuDF. This library is definitely useful for data scientists nowadays, as it offers for many operations, like, e.g., sorting, merging, aggregation operations, impressive speed gains.

But cuDF is a young library and still lacks a lot of features, which are available in Pandas. But these gaps are going to be closed in the next years because most of the libraries mentioned in this blog are being developed very actively.