AM on a Tuesday (well, technically Wednesday, I suppose), when my phone buzzed with that familiar, dreaded PagerDuty notification.
I didn’t even need to open my laptop to know that the daily_ingest.py script had failed. Again.
It keeps failing because our data provider always changes their file format without warning. I mean, they could randomly switch from commas to pipes or even mess up the dates overnight.
Usually, the actual fix takes me just about thirty seconds: I simply open the script, swap sep=',' for sep='|', and hit run.
I know that was quick, but in all honesty, the real cost isn’t the coding time, but rather the interrupted sleep and how hard it is to get your brain working at 2 AM.
This routine got me thinking: if the solution is so obvious that I can figure it out just by glancing at the raw text, why couldn’t a model do it?
We often hear hype about “Agentic AI” replacing software engineers, which, to me, honestly feels somewhat overblown.
But then, the idea of using a small, cost-effective LLM to act as an on-call junior developer handling boring pandas exceptions?
Now that sounded like a project worth trying.
So, I built a “Self-Healing” pipeline. Although it isn’t magic, it has successfully shielded me from at least three late-night wake-up calls this month.
And personally, anything (no matter how little) that can improve my sleep health is definitely a big win for me.
Here is the breakdown of how I did it so you can build it yourself.
The Architecture: A “Try-Heal-Retry” Loop
The core concept of this is relatively simple. Most data pipelines are fragile because they assume the world is perfect, and when the input data changes even slightly, they fail.
Instead of accepting that crash, I designed my script to catch the exception, capture the “crime scene evidence”, which is basically the traceback and the first few lines of the file, and then pass it down to an LLM.
Pretty neat, right?
The LLM now acts as a diagnostic tool, analyzing the evidence to return the correct parameters, which the script then uses to automatically retry the operation.
To make this system robust, I relied on three specific tools:
- Pandas: For the actual data loading (obviously).
- Pydantic: To ensure the LLM returns structured JSON rather than conversational filler.
- Tenacity: A Python library that makes writing complex retry logic incredibly clean.
Step 1: Defining the “Fix”
The primary challenge with using Large Language Models for code generation is their tendency to hallucinate. From my experience, if you ask for a simple parameter, you often receive a paragraph of conversational text in return.
To stop that, I leveraged structured outputs via Pydantic and OpenAI’s API.
This forces the model to complete a strict form, acting as a filter between the messy AI reasoning and our clean Python code.

Here is the schema I settled on, focusing strictly on the arguments that most commonly cause read_csv to fail:
from pydantic import BaseModel, Field
from typing import Optional, Literal
# We need a strict schema so the LLM doesn't just yap at us.
# I'm only including the params that actually cause crashes.
class CsvParams(BaseModel):
sep: str = Field(description="The delimiter, e.g. ',' or '|' or ';'")
encoding: str = Field(default="utf-8", description="File encoding")
header: Optional[int | str] = Field(default="infer", description="Row for col names")
# Sometimes the C engine chokes on regex separators, so we let the AI switch engines
engine: Literal["python", "c"] = "python"By defining this BaseModel, we are effectively telling the LLM: “I don’t want a conversation or an explanation. I want these four variables filled out, and nothing else.”
Step 2: The Healer Function
This function is the heart of the system, designed to run only when things have already gone wrong.
Getting the prompt right took some trial and error. And that’s because initially, I only provided the error message, which forced the model to guess blindly at the problem.
I quickly realized that to correctly identify issues like delimiter mismatches, the model needed to actually “see” a sample of the raw data.
Now here is the big catch. You cannot actually read the whole file.
If you try to pass a 2GB CSV into the prompt, you’ll blow up your context window and apparently your wallet.
Fortunately, I found out that just pulling the first few lines gives the model just enough info to fix the problem 99% of the time.
import openai
import json
client = openai.OpenAI()
def ask_the_doctor(fp, error_trace):
"""
The 'On-Call Agent'. It looks at the file snippet and error,
and suggests new parameters.
"""
print(f"🔥 Crash detected on {fp}. Calling LLM...")
# Hack: Just grab the first 4 lines. No need to read 1GB.
# We use errors='replace' so we don't crash while trying to fix a crash.
try:
with open(fp, "r", errors="replace") as f:
head = "".join([f.readline() for _ in range(4)])
except Exception:
head = "<>"
# Keep the prompt simple. No need for complex "persona" injection.
prompt = f"""
I'm trying to read a CSV with pandas and it failed.
Error Trace: {error_trace}
Data Snippet (First 4 lines):
---
{head}
---
Return the correct JSON params (sep, encoding, header, engine) to fix this.
"""
# We force the model to use our Pydantic schema
completion = client.chat.completions.create(
model="gpt-4o", # gpt-4o-mini is also fine here and cheaper
messages=[{"role": "user", "content": prompt}],
functions=[{
"name": "propose_fix",
"description": "Extracts valid pandas parameters",
"parameters": CsvParams.model_json_schema()
}],
function_call={"name": "propose_fix"}
)
# Parse the result back to a dict
args = json.loads(completion.choices[0].message.function_call.arguments)
print(f"💊 Prescribed fix: {args}")
return args I’m sort of glossing over the API setup here, but you get the idea. It takes the “symptoms” and prescribes a “pill” (the arguments).
Step 3: The Retry Loop (Where the Magic Happens)
Now we need to wire this diagnostic tool into our actual data loader.
In the past, I wrote ugly while True loops with nested try/except blocks that were a nightmare to read.
Then I found tenacity, which allows you to decorate a function with clean retry logic.
And the best part is that tenacity also allows you to define a custom “callback” that runs between attempts.
This is exactly where we inject our Healer function.
import pandas as pd
from tenacity import retry, stop_after_attempt, retry_if_exception_type
# A dirty global dict to store the "fix" between retries.
# In a real class, this would be self.state, but for a script, this works.
fix_state = {}
def apply_fix(retry_state):
# This runs right after the crash, before the next attempt
e = retry_state.outcome.exception()
fp = retry_state.args[0]
# Ask the LLM for new params
suggestion = ask_the_doctor(fp, str(e))
# Update the state so the next run uses the suggestion
fix_state[fp] = suggestion
@retry(
stop=stop_after_attempt(3), # Give it 3 strikes
retry_if_exception_type(Exception), # Catch everything (risky, but fun)
before_sleep=apply_fix # <--- This is the hook
)
def tough_loader(fp):
# Check if we have a suggested fix for this file, otherwise default to comma
params = fix_state.get(fp, {"sep": ","})
print(f"🔄 Trying to load with: {params}")
df = pd.read_csv(fp, **params)
return dfDoes it actually work?
To test this, I created a purposefully broken file called messy_data.csv. I made it pipe-delimited (|) but didn’t tell the script.
When I ran tough_loader('messy_data.csv'), the script crashed, paused for a moment while it “thought,” and then fixed itself automatically.
It feels surprisingly satisfying to watch the code fail, diagnose itself, and recover without any human intervention.
The “Gotchas” (Because Nothing is Perfect)
I don’t want to oversell this solution, as there are definitely risks involved.
The Cost
First, remember that every time your pipeline breaks, you are making an API call.
That might be fine for a few errors, but if you have a massive job processing, let’s say about 100,000 files, and a bad deployment causes all of them to break at once, you could wake up to a very nasty surprise on your OpenAI bill.
If you’re running this at scale, I highly recommend implementing a circuit breaker or switching to a local model like Llama-3 via Ollama to keep your costs down.
Data Safety
While I am only sending the first four lines of the file to the LLM, you need to be very careful about what is in those lines. If your data contains Personally Identifiable Information (PII), you are effectively sending that sensitive data to an external API.
If you work in a regulated industry like healthcare or finance, please use a local model.
Seriously.
Do not send patient data to GPT-4 just to fix a comma error.
The “Boy Who Cried Wolf”
Finally, there are times when data should fail.
If a file is empty or corrupt, you don’t want the AI to hallucinate a way to load it anyway, potentially filling your DataFrame with garbage.
Pydantic filters the bad data, but it isn’t magic. You have to be careful not to hide real errors that you actually need to fix yourself.
Conclusion and takeaway
You could argue that using an AI to fix CSVs is overkill, and technically, you might be right.
But in a field as fast-moving as data science, the best engineers aren’t the ones clinging to the methods they learned five years ago; they are the ones constantly experimenting with new tools to solve old problems.
Honestly, this project was just a reminder to stay flexible.
We can’t just keep guarding our old pipelines; we have to keep finding ways to improve them. In this industry, the most valuable skill isn’t writing code faster; rather, it’s having the curiosity to try a whole new way of working.


