NVIDIA's NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
The library boasts a 95 times speedup over traditional ETL tools such as NumPy and was specifically built for accelerating NVIDIA Merlin's data pipeline to build recommender systems. It does so by scaling ETL over multiple GPUs and nodes.
The following prerequisites must be met in order to install NVTabular:
NVTabular can be installed with Anaconda from the nvidia
channel by running the following command:
conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=10.2
This should successfully install NVTabular in your system.
NOTE: At the moment, NVTabular will only run on Linux and other operating systems are not currently supported.
To get started with NVTabular, there are multiple NVTabular Example Notebooks hosted on NVIDIA's GitHub repository.
We suggest checking out the Jupyter Notebooks in 'examples/getting-started-movielens' as an entry point for understanding the library better. The MovieLens25M is a popular dataset for recommender systems and is used in academic publications.
The example notebooks are structured as follows and should be reviewed in this order as per the repository:
The goal of the example notebooks is to show how NVIDIA Merlin uses NVTabular to perform ETL, subsequently train TensorFlow, or PyTorch, or HugeCTR models, and then, make inferences using Triton.