NVIDIA’s NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
The library boasts a 95 times speedup over traditional ETL tools such as NumPy and was specifically built for accelerating NVIDIA Merlin‘s data pipeline to build recommender systems. It does so by scaling ETL over multiple GPUs and nodes.
How to install NVIDIA’s NVTabular
The following prerequisites must be met in order to install NVTabular:
- CUDA version 10.1+
- Python version 3.7+
- NVIDIA Pascal GPU or later
NVTabular can be installed with Anaconda from the nvidia
channel by running the following command:
conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=10.2
This should successfully install NVTabular in your system.
NOTE: At the moment, NVTabular will only run on Linux and other operating systems are not currently supported.
Getting Started with NVTabular Example Notebooks
To get started with NVTabular, there are multiple NVTabular Example Notebooks hosted on NVIDIA’s GitHub repository.
We suggest checking out the Jupyter Notebooks in ‘examples/getting-started-movielens’ as an entry point for understanding the library better. The MovieLens25M is a popular dataset for recommender systems and is used in academic publications.
The example notebooks are structured as follows and should be reviewed in this order as per the repository:
- 01-Download-Convert.ipynb: Demonstrates how to download the dataset and convert it into the correct format so that it can be consumed.
- 02-ETL-with-NVTabular.ipynb: Demonstrates how to execute the preprocessing and feature engineering pipeline (ETL) with NVTabular on the GPU.
- 03-Training-with-PyTorch.ipynb: Demonstrates how to train a model with PyTorch based on the ETL output.
- 03-Training-with-TF.ipynb: Demonstrates how to train a model with TensorFlow based on the ETL output.
- 04-Triton-Inference-with-TF.ipynb: Demonstrates how to make inference using Triton.
The goal of the example notebooks is to show how NVIDIA Merlin uses NVTabular to perform ETL, subsequently train TensorFlow, or PyTorch, or HugeCTR models, and then, make inferences using Triton.