At the moment, I am playing around in a Kaggle competition, which provides a very large data set. The Riid Answer Correctness Prediction competition is about predicting the student's correct answers over time. This means that predictive models need at inference time to process some data to be adjusted incrementally.
Having a large dataset (100 million +) and having to adjust the inference model as time goes requires a lot of processing power and time. From the start of the competition, this has been one of the biggest challenges for most people in this competition. The problem in the Riid Answer Correctness Prediction competition is not so much the time you spend in training, but the feature engineering duration before training and during inference.
How to speed up feature engineering?
So the main question is: how can you speed up feature engineering on large datasets?
Pandas and Numpy are the "go-to" libraries of most machine learning practitioners, who work with Python, like me. They are really awesome, powerful and really fast on the CPU, but they operate on the CPU. They are both written in the C programming language with a Python wrapper on top, but in some situations, you wish they would be faster.
These are the libraries which are part of RAPIDS:
|cuDF||a Pandas-like dataframe manipulation library. This library can be used as a replacement of the Pandas library and offers a large subset of its functionality.|
The name indicates the usage of CUDA and the DataFrame API
|cuML||a machine learning library, which implements popular algorithms, like K-nearest neighbours, Random forests, Logistic Regression, etc|
|cuGraph||"The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames" (quote from https://github.com/rapidsai/cugraph). It implements graph theory algorithms like e.g. Breadth First Search (BFS), Single Source Shortest Path (SSSP), Pagerank on single or multiple GPU's|
|CLX||"CLX ("clicks") provides a collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases." (quote from https://github.com/rapidsai/clx)|
This is more a collection of example collection of examples targeting a specific developer audience
This library is intended to be used by visualization libraries to be able to access data in GPU memory.
|cuspatial||"cuSpatial is a GPU accelerated C++/Python library for accelerating GIS workflows including point-in-polygon, spatial join, coordinate systems, shape primitives, distances, and trajectory analysis" (quote from https://github.com/rapidsai/cuspatial)|
This is a library that contains implementations of spatial distances, speed, etc. using GPUs
|cusignal||A signal processing library, which is a port of Scipy Signal using GPUs.|
|Java + cuDF||Java binding for the cuDF library|
Below you can see a simple performance comparison between the regular Pandas based code with the cuDF code using a normal Kaggle notebook with 16 GB of RAM and 4 CPU cores, and a 16 GB GPU.
- Dropping columns and filtering with reset index on 100 million records
around 10 seconds
around 5 seconds
- Sorting 100 million records
around 23 seconds
less than 1 second!
- Merging 100 million records table with another one with 13000 table:
- Filling not available records in 100 million table
- Merge with 90 million table and subtraction operation
- Aggregation with 90 million records
Missing features in cuDF (0.15.0)
cuDF has an API that is very similar to the Pandas API. This means that in many cases you can copy your Pandas code into your cuDF notebook and you are done.
But the Pandas library, whilst slower, has still many more features that cuDF does not have right now.
Here are a couple of features that are available in Pandas and missing in cuDF:
|.cumsum||Cumulative sum in aggregations|
|.cumcount||Cumulative count in aggregations|
|.last||Extract the last n elements in aggregations|
|.loc assignming multiple rows||You can use the .loc function in Pandas (but not in cuDF) to locate via index and then assign in one go values to multiple fields:|
|.loc adding new row||You can add in Pandas new rows with a certain index value. This is possible in cuDF, but the index number cannot be set.|
|.apply||Apply user-defined function to series in Pandas. This function is not available.|
For sure, more functions are missing, but we expect that in the near future, the gap between both libraries gets narrower.
It is great that new open-source GPU-based libraries are being developed, which cover different fields of data science. One of them if cuDF. This library is definitely useful for data scientists nowadays, as it offers for many operations, like, e.g., sorting, merging, aggregation operations, impressive speed gains.
But cuDF is a young library and still lacks a lot of features, which are available in Pandas. But these gaps are going to be closed in the next years because most of the libraries mentioned in this blog are being developed very actively.