NVIDIA’s NVTabular – Superfast ETL using Python

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

NVIDIA’s NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

NVIDIA NVTabular for training recommender model

The library boasts a 95 times speedup over traditional ETL tools such as NumPy and was specifically built for accelerating NVIDIA Merlin‘s data pipeline to build recommender systems. It does so by scaling ETL over multiple GPUs and nodes.

NVIDIA's NVTabular - Superfast ETL for Data Science

How to install NVIDIA’s NVTabular

The following prerequisites must be met in order to install NVTabular:

  • CUDA version 10.1+
  • Python version 3.7+
  • NVIDIA Pascal GPU or later

NVTabular can be installed with Anaconda from the nvidia channel by running the following command:

conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=10.2

This should successfully install NVTabular in your system.

NOTE: At the moment, NVTabular will only run on Linux and other operating systems are not currently supported.

Getting Started with NVTabular Example Notebooks

To get started with NVTabular, there are multiple NVTabular Example Notebooks hosted on NVIDIA’s GitHub repository.

We suggest checking out the Jupyter Notebooks in ‘examples/getting-started-movielens’ as an entry point for understanding the library better. The MovieLens25M is a popular dataset for recommender systems and is used in academic publications. 

The example notebooks are structured as follows and should be reviewed in this order as per the repository:

  • 01-Download-Convert.ipynb: Demonstrates how to download the dataset and convert it into the correct format so that it can be consumed.
  • 02-ETL-with-NVTabular.ipynb: Demonstrates how to execute the preprocessing and feature engineering pipeline (ETL) with NVTabular on the GPU.
  • 03-Training-with-PyTorch.ipynb: Demonstrates how to train a model with PyTorch based on the ETL output.
  • 03-Training-with-TF.ipynb: Demonstrates how to train a model with TensorFlow based on the ETL output.
  • 04-Triton-Inference-with-TF.ipynb: Demonstrates how to make inference using Triton.
NVIDIA's NVTabular - Superfast ETL using Python

The goal of the example notebooks is to show how NVIDIA Merlin uses NVTabular to perform ETL, subsequently train TensorFlow, or PyTorch, or HugeCTR models, and then, make inferences using Triton.

NVIDIA's NVTabular - Superfast ETL using Python NVIDIA's NVTabular - Superfast ETL using Python

Leave a Comment