By Jeff Nelson (Google) and Will Hill (NVIDIA)
TL;DR: Speed up your pandas and scikit-learn machine learning workflows by up to an order of magnitude on Google Cloud’s Colab Enterprise using NVIDIA GPUs. With zero code rewrites, just a single extension-loading command.
To get hands-on, check out the learning path Accelerated Machine Learning with GPUs.
You wrote a Python script and tested it on a sample CSV. It worked. But when you ran it on the full 10GB dataset, it stopped. The progress bar crawled or the kernel crashed and returned the feared “out of memory” error.
Analyzing data and training models on large datasets can be time-consuming on CPUs. You can speed up existing workflows by 50x or more without learning new APIs or rewriting code.
Google Cloud’s Colab Enterprise and NVIDIA CUDA-X™ open-source libraries make this possible. This post shows how to use GPU acceleration for pandas and scikit-learn workflows, often with zero code changes.
Tech Stack: Colab Enterprise + NVIDIA RAPIDS
To go fast, you need powerful infrastructure and efficient software.
Colab Enterprise is Google Cloud’s managed notebook environment. It combines Colab with enterprise-grade security and compliance. It integrates with BigQuery and Vertex AI. Most importantly, it gives you access to NVIDIA GPUs (like the L4 and A100) through Runtime Templates. These templates let you define a consistent environment for a team.
NVIDIA CUDA-X Data Science is a collection of open-source libraries that accelerate popular data science libraries and platforms on NVIDIA GPUs.
- NVIDIA cuDF accelerates popular data frame libraries like pandas, Polars, and Apache Spark.
- NVIDIA cuML accelerates scikit-learn, UMAP, and HDBSCAN.
1. Instant data processing speedups with cudf.pandas
Data preparation is a common bottleneck in ML pipelines. Loading, filtering, and joining millions of rows on a CPU does not parallelize efficiently.
With cuDF pandas, you can run existing and new pandas code on the GPU. It works by intercepting pandas calls and if an operation supports GPU acceleration, it runs on the GPU. If not, it gracefully falls back to the CPU with the data frame automatically and efficiently shared between host and GPU memory.
Load the extension at the top of your notebook, before you import pandas. No other changes are required.
%load_ext cudf.pandas
import pandas as pd
Before vs. After
Standard CPU pandas:
import pandas as pd
# Takes minutes on large data
df = pd.read_parquet("large_dataset.parquet")
df = df.groupby("category").agg({"amount": "mean"})
GPU-accelerated pandas:
%load_ext cudf.pandas
import pandas as pd
# Takes seconds
df = pd.read_parquet("large_dataset.parquet")
df = df.groupby("category").agg({"amount": "mean"})
Benchmarks show this can deliver speedups of 150x or more for standard data operations compared to CPU execution.
2. Faster training with cuml.accel and scikit-learn
Once data is prepared, you need to train a model. Algorithms like Random Forest, Linear Regression, and t-SNE are staples of data science.
NVIDIA’s cuML library accelerates these algorithms by parallelizing training and inference execution on NVIDIA GPUs. Similar to cudf.pandas, use cuml.accel to accelerate scikit-learn functions on the NVIDIA GPU. Load the cuml extension prior to importing scikit-learn APIs:
%load_ext cuml.accel
from sklearn.ensemble import RandomForestRegressor
# This runs on the GPU automatically
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
This acceleration lets you run experiments faster. You check results in seconds and minutes instead of hours. This rapid iteration loop allows you to run more experiments and improve your final model.
3. GPU-accelerated XGBoost
XGBoost is another high-performance, machine learning library. XGBoost has native support for NVIDIA GPUs. Setting the device parameter to cuda enables GPU acceleration.
# Train on GPU
model = xgb.XGBRegressor(
tree_method='hist',
device='cuda',
n_estimators=100
)
model.fit(X_train, y_train)
Find your bottlenecks
NVIDIA cuDF and cuML also provide profilers to help debugging and isolation performance bottlenecks.
Use %%cudf.pandas.profile or %%cuml.accel.profile in a notebook cell to get a report on which operations ran on the GPU and which fell back to the CPU, and also the execution time for each function.
%%cudf.pandas.profile
# Your existing data processing code here...
df.groupby("id").apply(complex_function)
The output shows if a specific line (like a complex, non-vectorized .apply function) forces a CPU fallback. This helps you identify where to optimize your code.
Try it yourself
Speed is about staying in the flow. When code runs instantly, you can ask more questions of data and build better products.
We built a hands-on Notebook that takes you through the process using the NYC Taxi dataset. You will set up a Colab Enterprise runtime, accelerate pandas data preparation, and train models with scikit-learn and XGBoost on a Google Cloud with NVIDIA GPUs.
Get started today with the learning path, Accelerated Machine Learning with GPUs!