By: Robbie Jensen (NVIDIA), Will Hill (NVIDIA), and Jeff Nelson (Google Cloud)
TL;DR: Running into memory errors when loading big CSV or Parquet files in pandas can be a headache. If your local environment crashes during this process, here are three practical fixes you can try:
- Swap space: Let your system use the hard drive as extra memory. It’s simple to enable, but can slow things down.
- Sampling: Load a random portion of your data instead of the whole thing. This is quick and easy on memory, but comes at the cost of lower data fidelity.
- Unified Virtual Memory (UVM): When you are using a GPU-accelerated setup, NVIDIA cuDF lets you handle datasets bigger than your GPU’s memory by “spilling” extra data to your CPU’s RAM. This maintains data fidelity and allows processing beyond GPU limits with minimal performance overhead.
To get hands-on, check out the learning path Speed Up Data Analytics on GPUs.
Introduction
If you’ve ever faced an “Out of Memory” (OOM) error working with big data files, you know how disruptive it can be. It’s a common pain point when working with big data; by default, pandas tries to pull the whole file into RAM, but there just isn’t enough space.
This guide introduces three options to solve this issue and get you back to analyzing your data.
Why do OOM errors happen?
OOM errors can happen even when it seems like enough memory is available. “My 10GB file should fit in 16GB of RAM, right?” Unfortunately, it isn’t that straightforward. The issue lies in how pandas loads and represents your data types.
-
Numeric data: Values in CSVs are stored as text, but pandas turns them into 64-bit floats (
float64), which always take up 8 bytes each—even if the text version was shorter. -
Strings/objects: Pandas uses a generic ‘object’ data type for strings, adding lots of memory overhead compared to raw text.
-
Missing values: If an integer column has just one missing value (
NaN), pandas often converts it tofloat64, further swelling memory.
Because of this, a 10GB CSV can balloon to 30GB, 50GB, or even 100GB+ once loaded into RAM.
While you can manually optimize these data types (e.g., converting strings to ‘category’ or float64 to float32) to reduce this overhead, it requires tedious coding and prior knowledge of the dataset. The following solutions let you solve the problem without micro-managing your schema.
1. Use swap space (OS-Level Fix)
Using swap space is the traditional brute-force solution. By setting up a swap file (or partition), your operating system can use part of your disk as “extra” RAM. If you run out of physical RAM, the OS stashes idle memory pages on the disk, freeing up real RAM for your pandas job.
The example below shows how you would configure this on Linux.
# Create a 100GB swap file
sudo fallocate -l 100G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
This is fairly easy to implement, and doesn’t require any changes to your pandas code. However, disks are much slower than RAM, so your system may crawl or “thrash,” swapping data endlessly and delaying actual work.
2. Sample your data (data-level fix)
If you don’t need your entire dataset yet, you can sample it for fast exploratory analysis, visualization, or debugging.
import pandas as pd
import numpy as np
# Method 1: First N rows (schema check)
df_head = pd.read_csv("large_file.csv", nrows=1000)
# Method 2: Random 10% sample with chunking
chunks = []
for chunk in pd.read_csv("large_file.csv", chunksize=10000):
# Keep 10% of each chunk
chunks.append(chunk.sample(frac=0.1))
df_sample = pd.concat(chunks)
This is fast and saves memory, since only a fraction of the data gets loaded. Ultimately your results are based on incomplete data. Not ideal for production applications or imbalanced datasets, since samples could miss important information.
3. Use Unified Virtual Memory (UVM) with NVIDIA cuDF
This is the most powerful, modern solution if you have an NVIDIA GPU. cuDF is a GPU-accelerated DataFrame library that accelerates pandas. By enabling Unified Virtual Memory (or “UVM”, a feature of CUDA enabled in cuDF) you can fill up your GPU’s VRAM first, and then “spill” extra data into your much larger CPU RAM. Data in use will automatically be moved into the GPU, while unused data is moved back to the CPU.
# On Colab: install cuDF
!pip install cudf-cu12
import cudf
# Enable UVM spilling
cudf.set_option('spill', True)
# Use it like pandas!
df = cudf.read_csv("large_file.csv")
# Your analysis still runs on the GPU
print(df.head())
Using UVM you can load your whole dataset and enjoy GPU acceleration without being bottlenecked by memory. No code changes are required beyond enabling cuDF and UVM spilling.
Summary and key takeaways
The optimal choice is a trade-off between three key variables that vary based on the needs of your project: speed, hardware availability, and data fidelity.
- Swap Space works if you must load everything, but expect slowdowns.
- Sampling is ideal for quick explorations or debugging.
- Unified Virtual Memory with NVIDIA cuDF is best for demanding workflows on GPUs.
Looking ahead
Even UVM isn’t limitless - it’s capped by the sum of your CPU and GPU memory. For truly massive data (terabytes!), distributed computing is the next step. Tools like Dask let you spread pandas operations over many CPU cores or machines, while Dask-cuDF scales cuDF across clusters of GPUs. You can learn more about Dask-cuDF here.
Get hands-on with a new learning path, Speed Up Data Analytics on GPUs!