an article where I walked through some of the newer DataFrame tools in Python, such as Polars and DuckDB.
I explored how they can enhance the data science workflow and perform more effectively when handling large datasets.
Here’s a link to the article.
The whole idea was to give data professionals a feel of what “modern dataframes” look like and how these tools could reshape the way we work with data.
But something interesting happened: from the feedback I got, I realized that a lot of data scientists still rely heavily on Pandas for most of their day-to-day work.
And I totally understand why.
Even with all the new options out there, Pandas remain the backbone of Python data science.
And this isn’t even just based on a few comments.
A recent State of Data Science survey reports that 77% of practitioners use Pandas for data exploration and processing.
I like to think of Pandas as that reliable old friend you keep calling: maybe not the flashiest, but you know it always gets the job done.
So, while the newer tools absolutely have their strengths, it’s clear that Pandas isn’t going anywhere anytime soon.
And for many of us, the real challenge isn’t replacing Pandas, it’s making it more efficient, and a bit less painful when we’re working with larger datasets.
In this article, I’ll walk you through seven practical ways to speed up your Pandas workflows. These are simple to implement yet capable of making your code noticeably faster.
Setup and Prerequisites
Before we jump in, here’s what you’ll need. I’m using Python 3.10+ and Pandas 2.x in this tutorial. If you’re on an older version, you can just upgrade it quickly:
pip install --upgrade pandasThat’s really all you need. A standard environment, such as Jupyter Notebook, VS Code, or Google Colab, works fine.
If you already have NumPy installed, as most people do, everything else in this tutorial should run without any extra setup.
1. Speed Up read_csv With Smarter Defaults
I remember the first time I worked with a 2GB CSV file.
My laptop fans were screaming, the notebook kept freezing, and I was staring at the progress bar, wondering if it would ever finish.
I later realized that the slowdown wasn’t because of Pandas itself, but rather because I was letting it auto-detect everything and loading all 30 columns when I only needed 6.
Once I started specifying data types and selecting only what I needed, things became noticeably faster.
Tasks that normally had me staring at a frozen progress bar now ran smoothly, and I finally felt like my laptop was on my side.
Let me show you exactly how I do it.
Specify dtypes upfront
When you force Pandas to guess data types, it has to scan the entire file. If you already know what your columns should be, just tell it directly:
df = pd.read_csv(
"sales_data.csv",
dtype={
"store_id": "int32",
"product_id": "int32",
"category": "category"
}
)Load only the columns you need
Sometimes your CSV has dozens of columns, but you only care about a few. Loading the rest just wastes memory and slows down the process.
cols_to_use = ["order_id", "customer_id", "price", "quantity"]
df = pd.read_csv("orders.csv", usecols=cols_to_use)Use chunksize for huge files
For very large files that don’t fit in memory, reading in chunks allows you to process the data safely without crashing your notebook.
chunks = pd.read_csv("logs.csv", chunksize=50_000)
for chunk in chunks:
# process each chunk as needed
passSimple, practical, and it actually works.
Once you’ve got your data loaded efficiently, the next thing that’ll slow you down is how Pandas stores it in memory.
Even if you’ve loaded only the columns you need, using inefficient data types can silently slow down your workflows and eat up memory.
That’s why the next trick is all about choosing the right data types to make your Pandas operations faster and lighter.
2. Use the Right Data Types to Cut Memory and Speed Up Operations
One of the easiest ways to make your Pandas workflows faster is to store data in the right type.
A lot of people stick with the default object or float64 types. These are flexible, but trust me, they’re heavy.
Switching to smaller or more suitable types can reduce memory usage and noticeably improve performance.
Convert integers and floats to smaller types
If a column doesn’t need 64-bit precision, downcasting can save memory:
# Example dataframe
df = pd.DataFrame({
"user_id": [1, 2, 3, 4],
"score": [99.5, 85.0, 72.0, 100.0]
})
# Downcast integer and float columns
df["user_id"] = df["user_id"].astype("int32")
df["score"] = df["score"].astype("float32")Use category for repeated strings
String columns with lots of repeated values, like country names or product categories, benefit massively from being converted to category type:
df["country"] = df["country"].astype("category")
df["product_type"] = df["product_type"].astype("category")This saves memory and makes operations like filtering and grouping noticeably faster.
Check memory usage before and after
You can see the effect immediately:
print(df.info(memory_usage="deep"))I’ve seen memory usage drop by 50% or more on large datasets. And when you’re using less memory, operations like filtering and joins run faster because there’s less data for Pandas to shuffle around.
3. Stop Looping. Start Vectorizing
One of the biggest performance mistakes I see is using Python loops or .apply() for operations that can be vectorized.
Loops are easy to write, but Pandas is built around vectorized operations that run in C under the hood, plus they run much faster.
Slow approach using .apply() (or a loop):
# Example: adding 10% tax to prices
df["price_with_tax"] = df["price"].apply(lambda x: x * 1.1)This works fine on small datasets, but once you hit hundreds of thousands of rows, it starts crawling.
Fast vectorized approach:
# Vectorized operation
df["price_with_tax"] = df["price"] * 1.1
That’s it. Same result, orders of magnitude faster.
4. Use loc and iloc the Right Way
I once tried filtering a large dataset with something like df[df["price"] > 100]["category"]. Not only did Pandas throw warnings at me, but the code was slower than it should’ve been.
I learned pretty quickly that chained indexing is messy and inefficient; it could also lead to subtle bugs and performance issues.
Using loc and iloc properly makes your code faster and easier to read.
Use loc for label-based indexing
When you want to filter rows and select columns by name, loc is your best bet:
# Select rows where price > 100 and only the 'category' column
filtered = df.loc[df["price"] > 100, "category"]This is safer and faster than chaining, and it avoids the infamous SettingWithCopyWarning.
Use iloc for position-based indexing
If you prefer working with row and column positions:
# Select first 5 rows and the first 2 columns
subset = df.iloc[:5, :2]Using these methods keeps your code clean and efficient, especially when you’re doing assignments or complex filtering.
5. Use query() for Faster, Cleaner Filtering
When your filtering logic starts getting messy, query() can make things feel a lot more manageable.
Instead of stacking multiple boolean conditions inside brackets, query() lets you write filters in a cleaner, almost SQL-like syntax.
And in many cases, it runs faster because Pandas can optimize the expression internally.
# More readable filtering using query()
high_value = df.query("price > 100 and quantity < 50")This comes in handy especially when your conditions start to stack up or when you want your code to look clean enough that you can revisit it a week later without wondering what you were thinking.
It’s a simple upgrade that makes your code feel more intentional and easier to maintain.
6. Convert Repetitive Strings to Categoricals
If you have a column filled with repeated text values, such as product categories or location names, converting it to categorical type can give you an immediate performance boost.
I’ve experienced this firsthand.
Pandas stores categorical data in a much more compact way by replacing each unique value with an internal numeric code.
This helps reduce memory usage and makes operations on that column faster.
# Converting a string column to a categorical type
df["category"] = df["category"].astype("category")Categoricals will not do much for messy, free-form text, but for structured labels that repeat across many rows, they’re one of the simplest and most effective optimizations you can make.
7. Load Large Files in Chunks Instead of All at Once
One of the fastest ways to overwhelm your system is to try to load a massive CSV file all at once.
Pandas will try pulling everything into memory, and that can slow things to a crawl or crash your session entirely.
The solution is to load the file in manageable pieces and process each one as it comes in. This approach keeps your memory usage stable and still lets you work through the entire dataset.
# Process a large CSV file in chunks
chunks = []
for chunk in pd.read_csv("large_data.csv", chunksize=100_000):
chunk["total"] = chunk["price"] * chunk["quantity"]
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
Chunking is especially helpful when you are dealing with logs, transaction records, or raw exports that are far larger than what a normal laptop can comfortably handle.
I learned this the hard way when I once tried to load a multi-gigabyte CSV in one shot, and my entire system responded like it needed a moment to think about its life choices.
After that experience, chunking became my go-to approach.
Instead of trying to load everything at once, you take a manageable piece, process it, save the result, and then move on to the next piece.
The final concat step gives you a clean, fully processed dataset without putting unnecessary pressure on your machine.
It feels almost too simple, but once you see how smooth the workflow becomes, you’ll wonder why you didn’t start using it much earlier.
Final Thoughts
Working with Pandas gets a lot easier once you start using the features designed to make your workflow faster and more efficient.
The techniques in this article aren’t complicated, but they make a noticeable difference when you apply them consistently.
These improvements might seem small individually, but together they can transform how quickly you move from raw data to meaningful insight.
If you build good habits around how you write and structure your Pandas code, performance becomes much less of a problem.
Small optimizations add up, and over time, they make your entire workflow feel smoother and more deliberate.


