A Hands-On Introduction to cuDF for GPU-Accelerated Data Workflows

In this article, you will learn what cuDF is and how to use it in a pandas-like way to accelerate common data-wrangling tasks on the GPU via a short, hands-on example.

Topics we will cover include:

The aim and distinctive features of cuDF.
How to load, view, and perform simple data operations with cuDF in a dataframe-like fashion.
How to compare cuDF performance on specific operations against plain pandas dataframes.

Let’s get into it.

A Hands-On Introduction to cuDF for GPU-Accelerated Data Workflows
Image by Editor | ChatGPT

Introduction

This article introduces, through a hands-on Python example, cuDF: one of the latest Python libraries designed by RAPIDS AI (an open-source suite, part of NVIDIA) for leveraging GPU-accelerated data science and machine learning projects. Alongside its machine-learning–oriented sibling, cuML, cuDF is a great asset that is attracting popularity among practitioners seeking scalable solutions.

Building a Human Handoff Interface for AI-Powered Insurance Agent Using Parlant and Streamlit

How OpenAI’s Sora 2 Is Transforming Toy Design into Moving Dreams

About cuDF: an Accelerated Pandas

RAPIDS cuDF is an open-source, dataframe-based library designed to mimic pandas’ data-wrangling capabilities and speed them up significantly. It has recently been integrated into popularly used data science environments like Google Colab, speeding large-dataset processes typically carried out by pandas by up to 50x.

Among its most salient features:

If you are familiar with pandas, you will find that its syntax and functions closely mirror the mainstream data-science library, minimizing the learning curve and easing migration for Python users.
cuDF leverages NVIDIA GPUs through CUDA, thereby handling large-scale structured data operations much faster than CPU-oriented pandas.
It fits well alongside other libraries in NVIDIA’s RAPIDS framework—most notably cuML for machine learning processes—offering methods and functions similar to those in scikit-learn for efficient processing of complex datasets.

Hands-On Introductory Example

To illustrate the basics of cuDF, we will consider a fairly large—yet publicly accessible—dataset in Jason Brownlee’s repository: the adult income dataset. This is a large, slightly class-unbalanced dataset intended for binary classification tasks, namely predicting whether an adult’s income level is high or low, based on demographic and socioeconomic features.

However, the scope of this tutorial is limited to managing and wrangling datasets in a pandas-like fashion while leveraging cuDF’s GPU capabilities.

IMPORTANT: To run the code below on Google Colab or a similar notebook environment, make sure you change the runtime type to GPU; otherwise, a warning will be raised indicating cuDF cannot find the specific CUDA driver library it utilizes.

We start by importing some libraries:

import cudf import pandas as pd import time

import cudf

import pandas as pd

import time

For a first quick performance comparison — and to showcase the minimal differences in usage — we will load the dataset twice: once in a regular pandas dataframe, and once more in a cuDF dataframe.

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv” # Column names (they are not included in the dataset’s CSV file we will read) cols = [ “age”,”workclass”,”fnlwgt”,”education”,”education_num”, “marital_status”,”occupation”,”relationship”,”race”,”sex”, “capital_gain”,”capital_loss”,”hours_per_week”,”native_country”,”income” ] print(“Loading with pandas…”) t0 = time.time() df_pd = pd.read_csv(url, header=None, names=cols) t1 = time.time() print(f”Pandas loaded in {t1 – t0:.3f} sec”) print(“Loading with cuDF…”) t0 = time.time() df_cudf = cudf.read_csv(url, header=None, names=cols) t1 = time.time() print(f”cuDF loaded in {t1 – t0:.3f} sec”)

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/adult-all.csv”

# Column names (they are not included in the dataset’s CSV file we will read)

cols = [

“age”,“workclass”,“fnlwgt”,“education”,“education_num”,

“marital_status”,“occupation”,“relationship”,“race”,“sex”,

“capital_gain”,“capital_loss”,“hours_per_week”,“native_country”,“income”

]

print(“Loading with pandas…”)

t0 = time.time()

df_pd = pd.read_csv(url, header=None, names=cols)

t1 = time.time()

print(f“Pandas loaded in {t1 – t0:.3f} sec”)

print(“Loading with cuDF…”)

t0 = time.time()

df_cudf = cudf.read_csv(url, header=None, names=cols)

t1 = time.time()

print(f“cuDF loaded in {t1 – t0:.3f} sec”)

The time module is used to measure execution times precisely over instruction blocks. If you run the above code excerpt in a notebook cell repeatedly, you will see that load times may vary, but the general trend is that reading the dataset with cuDF yields a several-times faster result (this may not be the case on the very first execution due to typical initial GPU setup overhead).

Next, an overview of both datasets. At this point, if you only want to stick to using cuDF without performing every step twice with pandas, simply remove the pandas-related (or df_pd-related, dataframe-wise) code instructions:

print(“Pandas shape:”, df_pd.shape) print(“cuDF shape:”, df_cudf.shape) print(“\nPandas head():”) display(df_pd.head()) print(“\ncuDF head():”) display(df_cudf.head())

print(“Pandas shape:”, df_pd.shape)

print(“cuDF shape:”, df_cudf.shape)

print(“\nPandas head():”)

display(df_pd.head())

print(“\ncuDF head():”)

display(df_cudf.head())

Once again, we see how simple it is to perform quick data exploration with cuDF if you are familiar with pandas.

Before finalizing, we will illustrate how to perform some simple data operations with cuDF. Specifically, we will take the education feature, which is categorical, and for all records under a given education category, compute the average value for hours_per_week. The process involves a somewhat computationally costly data operation: a grouping with the groupby() function, on which we will focus for comparing performance:

t0 = time.time() pd_result = df_pd.groupby(“education”)[“hours_per_week”].mean() t1 = time.time() print(f”Pandas groupby took {t1 – t0:.3f} sec”) t0 = time.time() cudf_result = df_cudf.groupby(“education”)[“hours_per_week”].mean() t1 = time.time() print(f”cuDF groupby took {t1 – t0:.3f} sec”) print(“\ncuDF result:”) print(cudf_result)

t0 = time.time()

pd_result = df_pd.groupby(“education”)[“hours_per_week”].mean()

t1 = time.time()

print(f“Pandas groupby took {t1 – t0:.3f} sec”)

t0 = time.time()

cudf_result = df_cudf.groupby(“education”)[“hours_per_week”].mean()

t1 = time.time()

print(f“cuDF groupby took {t1 – t0:.3f} sec”)

print(“\ncuDF result:”)

print(cudf_result)

Aside from possibly the very first execution, you should see cuDF running much faster than standalone pandas for this operation.

Wrapping Up

This article provided a gentle, hands-on introduction to the cuDF library for enabling GPU-boosted treatment of datasets under a pandas dataframe approach. For further reading and learning, we recommend you check this related article that takes the example dataset we just analyzed to build a machine learning model with one of cuDF’s dedicated “sibling” libraries: cuML.

Source_link