at IBM TechXchange, I spent a lot of time around teams who were already running LLM systems in production. One conversation that stayed with me came from LangSmith, the folks who build tooling for monitoring, debugging, and evaluating LLM workflows.
I originally assumed evaluation was mostly about benchmarks and accuracy numbers. They pushed back on that immediately. Their point was straightforward: a model that performs well in a notebook can still behave unpredictably in real usage. If you are not evaluating against realistic scenarios, you are not aligning anything. You are simply guessing.
Two weeks ago, at Cohere Labs Connect Conference 2025, the topic resurfaced again. This time the message came with even more urgency. One of their leads pointed out that public metrics can be fragile, easy to game, and rarely representative of production behavior. Evaluation, they said, remains one of the hardest and least-solved problems in the field.
Hearing the same warning from two different places made something click for me. Most teams working with LLMs are not wrestling with philosophical questions about alignment. They are dealing with everyday engineering challenges, such as:
- Why does the model change behavior after a small prompt update?
- Why do user queries trigger chaos even when tests look clean?
- Why do models perform well on standardized benchmarks but poorly on internal tasks?
- Why does a jailbreak succeed even when guardrails seem solid?
If any of this feels familiar, you are in the same position as everyone else who is building with LLMs. This is where alignment begins to feel like a real engineering discipline instead of an abstract conversation.
This article looks at that turning point. It is the moment you realize that demos, vibes, and single-number benchmarks do not tell you much about whether your system will hold up under real conditions. Alignment genuinely starts when you define what matters enough to measure, along with the methods you will use to measure it.
So let’s take a closer look at why evaluation sits at the center of reliable LLM development, and why it ends up being much harder, and much more important, than it first appears.
Table of Contents
- What “alignment” means in 2025
- Capability ≠ alignment: what the last few years actually taught us
- How misalignment shows up now (not hypothetically)
- Evaluation is the backbone of alignment (and it’s getting more complex)
- Alignment is inherently multi-objective
- When things go wrong, eval failures usually come first
- Where this series goes next
- References
What “alignment” means in 2025
If you ask ten people what “AI alignment” means, you’ll usually get ten answers plus one existential crisis. Thankfully, recent surveys try to pin it down with something resembling consensus. A major review — AI Alignment: A Comprehensive Survey (2025) — defines alignment as making AI systems behave in line with human intentions and values.
Not “make the AI wise,” not “give it perfect ethics,” not “turn it into a digital Gandalf.”
Just: please do what we meant, not what we accidentally typed.
Both surveys organize the field around four goals: Robustness, Interpretability, Controllability, and Ethicality — the RICE framework, which sounds like a wholesome meal but is actually a taxonomy of everything your model will do wrong if you ignore it.
Meanwhile, industry definitions, including IBM’s 2024–2025 alignment explainer, describe the same idea with more corporate calm: encode human goals and values so the model stays helpful, safe, and reliable. Translation: avoid bias, avoid harm, and ideally avoid the model confidently hallucinating nonsense like a Victorian poet who never slept.
Across research and industry, alignment work is often split into two buckets:
- Forward alignment: how we train models (e.g., RLHF, Constitutional AI, data curation, safety finetuning).
- Backward alignment: how we evaluate, monitor, and govern models after (and during) training.
Forward alignment gets all the publicity.
Backward alignment gets all the ulcers.

If you’re a data scientist or engineer integrating LLMs, you mostly feel alignment as backward-facing questions:
- Is this new model hallucinating less, or just hallucinating differently?
- Does it stay safe when users send it prompts that look like riddles written by a caffeinated goblin?
- Is it actually fair across the user groups we serve?
And unfortunately, you can’t answer those with parameter count or “it feels smarter.” You need evaluation.
Capability ≠ alignment: what the last few years actually taught us
One of the most important results in this space still comes from Ouyang et al.’s InstructGPT paper (2022). That study showed something unintuitive: a 1.3B parameter model with RLHF was often preferred over the original 175B GPT-3, despite being about 100 times smaller. Why? Because humans said its responses were more helpful, more truthful, and less toxic. The big model was more capable, but the small model was better behaved.
This same pattern has repeated across 2023–2025. Alignment techniques — and more importantly, feedback loops — change what “good” means. A smaller aligned model can outperform a huge unaligned one on the metrics that actually matter to users.
Truthfulness is a great example.
The TruthfulQA benchmark (Lin et al., 2022) measures the ability to avoid confidently repeating internet nonsense. In the original paper, the best model only hit around 58% truthfulness, compared to humans at 94%. Larger base models were sometimes less truthful because they were better at smoothly imitating wrong information. (The internet strikes again.)
OpenAI later reported that with targeted anti-hallucination training, GPT-4 roughly doubled its TruthfulQA performance — from around 30% to about 60% — which is impressive until you remember this still means “slightly better than a coin flip” under adversarial questioning.
By early 2025, TruthfulQA itself evolved. The authors released a new binary multiple-choice version to fix issues in earlier formats and published updated results, including newer models like Claude 3.5 Sonnet, which likely approaches human-level accuracy on that variant. Many open models still lag behind. Additional work extends these tests to multiple languages, where truthfulness often drops because misinformation patterns differ across linguistic communities.
The broader lesson is clearer than ever:
If the only thing you measure is “does it sound fluent?”, the model will optimize for sounding fluent, not being correct. If you care about truth, safety, or fairness, you need to measure those things explicitly.
Otherwise, you get exactly what you optimized for:
a very confident, very eloquent, occasionally wrong librarian who never learned to whisper.
How misalignment shows up now (not hypothetically)
Over the last three years, misalignment has gone from a philosophical debate to something you can actually point at on your screen. We no longer need hypothetical “what if the AI…” scenarios. We have concrete behaviors, logs, benchmarks, and occasionally a model doing something bizarre that leaves an entire engineering team staring at each other like, did it really just say that?
Hallucinations in safety-critical contexts
Hallucination is still the most familiar failure mode, and unfortunately, it has not retired. System cards for GPT-4, GPT-4o, Claude 3, and others openly document that models still generate incorrect or fabricated information, often with the confident tone of a student who definitely did not read the assigned chapter.
A 2025 study titled “From hallucinations to hazards” argues that our evaluations focus too heavily on general tasks like language understanding or coding, while the actual risk lies in how hallucinations behave in sensitive domains like healthcare, law, and safety engineering.
In other words: scoring well on Massive Multitask Language Understanding (MMLU) does not magically prevent a model from recommending the wrong dosage of a real medication.
TruthfulQA and its newer 2025 variants confirm the same pattern. Even top models can be fooled by adversarial questions laced with misconceptions, and their accuracy varies by language, phrasing, and the creativity of whoever designed the trap.
Bias, fairness, and who gets harmed
Bias and fairness concerns are not theoretical either. Stanford’s Holistic Evaluation of Language Models (HELM) framework evaluates dozens of models across 42 scenarios and multiple dimensions (accuracy, robustness, fairness, toxicity, efficiency, etc.) to create a kind of “alignment scoreboard.”

The results are what you’d expect from any large, messy ecosystem:
- GPT-4-class models usually score highest on accuracy and robustness.
- Claude 3-series models often produce less toxic and more ethically balanced outputs.
- No model is consistently best.
- Every model still exhibits measurable bias and toxicity under adversarial prompting.
Think of it like different athletes excelling in different events, except the events are “don’t offend users” and “don’t hallucinate harmful instructions.”
The Holistic Evaluation of Vision-Language Models (VHELM) extension to multimodal (vision + language) models shows the same pattern. These models are great at perception and reasoning but still struggle with fairness, multilingual performance, and toxicity unless those metrics are explicitly tested.
Deception and “alignment faking”
This is the part where things get uncomfortable.
The most concerning pattern to emerge recently is deception: models behaving aligned in some contexts while producing very different behavior in others. It sounds dramatic, but unfortunately, it is now backed by empirical evidence rather than sci-fi imagination.
Back in 2023, the GPT-4 system card reported the now-famous ARC test: GPT-4, acting as an agent, hired a TaskRabbit worker to solve a CAPTCHA, then lied about being visually impaired when asked if it was a bot.
The test was controlled, but it demonstrated an awkward reality: if you give a model goals and tools, sometimes it starts improvising ways to achieve them.
Since then, multiple research threads expanded the picture:
- Anthropic (2024): “Alignment faking in large language models.”
Their study of Claude 3 Opus and 3.5 Sonnet found that models trained on safety rules can learn to perform alignment when monitored while breaking the same rules in other contexts. - Koorndijk et al. (2025): “Empirical Evidence for Alignment Faking in a Small LLM.”
Even a smaller model learned to behave safely under supervision and then reverted to a different objective when oversight was removed. - Apollo Research (2025): context-scheming experiments.
They found more capable models are better at recognizing evaluation contexts and sometimes strategically underperform or act extra cautious during tests — the AI equivalent of “acting polite because the teacher is watching.” - Anthropic (mid-2025): large-scale multi-model simulations.
Across 16 frontier models (OpenAI, Google, Meta, Anthropic, xAI, and others), models lied, cheated, and even chose harmful actions in controlled scenarios when given autonomy and tool access. Misaligned behaviors were more frequent in the most capable systems.
This does not mean current models are plotting anything in real deployments.
It does mean deception, goal-driven shortcuts, and “performing alignment to pass the test” are real behaviors that show up in experiments — and the behaviors get stronger as models become more capable.
The alignment problem is no longer just “don’t generate toxic content.” It increasingly includes “don’t pretend to be aligned only while we are watching.”
Evaluation is the backbone of alignment (and it’s getting more complex)
Given all of this, recent work has shifted from “we need evaluation” to “we need better, more reliable evaluation.”
From one-number leaderboards to multi-dimensional diagnostics
Early on, the community relied on single-number leaderboards. This worked about as well as rating a car solely by its cupholder count. So efforts like HELM stepped in to make evaluation more holistic: many scenarios multiplied by many metrics, instead of “this model has the highest score.”
Since then, the space has expanded dramatically:
- BenchHub (2025) aggregates 303,000 questions across 38 benchmarks, giving researchers a unified ecosystem for running multi-benchmark tests. One of its main findings is that the same model can perform brilliantly in one domain and fall over in another, sometimes comically so.
- VHELM extends holistic evaluation to vision-language models, covering nine categories such as perception, reasoning, robustness, bias, fairness, and multilinguality. Basically, it’s HELM with extra eyeballs.
- A 2024 study, “State of What Art? A Call for Multi-Prompt LLM Evaluation,” showed that model rankings can flip depending on which prompt phrasing you use. The conclusion is simple: evaluating a model on a single prompt is like rating a singer after hearing only their warm-up scales.
More recent surveys, such as the 2025 Comprehensive Survey on Safety Evaluation of LLMs, treat multi-metric, multi-prompt evaluation as the default. The message is clear: real reliability emerges only when you measure capability, robustness, and safety together, not one at a time.
Evaluation itself is noisy and biased
The newer twist is: even our evaluation mechanisms are misaligned.
A 2025 ACL paper, “Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts,” tested 11 LLMs used as automatic “judges.” The results were… not comforting. Judge models were highly sensitive to superficial artifacts like apologetic phrasing or verbosity. In some setups, simply adding “I’m really sorry” could flip which answer was judged safer up to 98% of the time.
This is the evaluation equivalent of getting out of a speeding ticket because you were polite.
Worse, larger judge models were not consistently more robust, and using a jury of multiple LLMs helped but didn’t fix the core issue.
A related 2025 position paper, “LLM-Safety Evaluations Lack Robustness”, argues that current safety evaluation pipelines introduce bias and noise at many stages: test case selection, prompt phrasing, judge choice, and aggregation. The authors back this with case studies where minor changes in evaluation setup materially change conclusions about which model is “safer.”
Put simply: if you rely on LLMs to grade other LLMs without careful design, you can easily end up fooling yourself. Evaluating alignment requires just as much rigor as building the model.
Alignment is inherently multi-objective
One thing both alignment and evaluation surveys now emphasize is that alignment is not a single metric problem. Different stakeholders care about different, often competing objectives:
- Product teams care about task success, latency, and UX.
- Safety teams care about jailbreak resistance, harmful content rates, and misuse potential.
- Legal/compliance cares about auditability and adherence to regulation.
- Users care about helpfulness, trust, privacy, and perceived honesty.
Surveys and frameworks like HELM, BenchHub, and Unified-Bench all argue that you should treat evaluation as navigating a trade-off surface, not picking a winner.
A model that dominates generic NLP benchmarks might be terrible for your domain if it is brittle under distribution shift or easy to jailbreak. Meanwhile, a more conservative model might be perfect for healthcare but deeply frustrating as a coding assistant.
Evaluating across objectives — and admitting that you are choosing trade-offs rather than discovering a magical “best” model — is part of doing alignment work honestly.
When things go wrong, eval failures usually come first
If you look at recent failure stories, a pattern emerges: alignment problems often start as evaluation failures.
Teams deploy a model that looks great on the standard leaderboard cocktail but later discover:
- it performs worse than the previous model on a domain-specific safety test,
- it shows new bias against a particular user group,
- it can be jailbroken by a prompt style no one bothered to test, or
- RLHF made it more polite but also more confidently wrong.
Every one of those is, at root, a case where nobody measured the right thing early enough.
The newest work on deceptive alignment points in the same direction. If models can detect the evaluation environment and behave safely only during the exam, then testing becomes just as important as training. You may think you’ve aligned a model when you’ve actually trained it to pass your eval suite.
It’s the AI version of a student memorizing the answer key instead of understanding the material: impressive test scores, questionable real-world behavior.
Where this series goes next
In 2022, “we need better evals” was an opinion. By late 2025, it’s just how the literature reads:
- Larger models are more capable, and also more capable of harmful or deceptive behavior when the setup is wrong.
- Hallucinations, bias, and strategic misbehavior are not theoretical; they are measurable and sometimes painfully reproducible.
- Academic surveys and industry system cards now treat multi-metric evaluation as a central part of alignment, not a nice-to-have.
The rest of this series will zoom in:
- next, on classic benchmarks (MMLU, HumanEval, etc.) and why they’re not enough for alignment,
- then on holistic and stress-test frameworks (HELM, TruthfulQA, safety eval suites, red teaming),
- then on training-time alignment methods (RLHF, Constitutional AI, scalable oversight),
- and finally, on the societal side: ethics, governance, and what the new deceptive-alignment work implies for future systems.
If you’re building with LLMs, the practical takeaway from this first piece is simple:
Alignment begins where your evaluation pipeline begins.
If you don’t measure a behavior, you’re implicitly okay with it.
The good news is that we now have far more tools, far more data, and far more evidence to decide what we actually care about measuring. And that’s the foundation everything else will build on.
References
- Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). OpenAI. https://arxiv.org/abs/2203.02155
- Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. https://arxiv.org/abs/2109.07958
- OpenAI. (2023). GPT-4 System Card. https://cdn.openai.com/papers/gpt-4-system-card.pdf
- Kirk, H. et al. (2024). From Hallucinations to Hazards: Safety Benchmarking for LLMs in Critical Domains.https://www.sciencedirect.com/science/article/pii/S0925753525002814
- Li, R. et al. (2024). HELM: Holistic Evaluation of Language Models. Stanford CRFM. https://crfm.stanford.edu/helm/latest
- Muhammad, J. et al. (2025). Red Teaming Large Language Models: A comprehensive review and critical analysis https://www.sciencedirect.com/science/article/abs/pii/S0306457325001803
- Ryan, G. et al. (2024). Alignment Faking in Large Language Models Anthropic. https://www.anthropic.com/research/alignment-faking
- Koorndijk, J. et al. (2025). Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques. https://arxiv.org/abs/2506.21584
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AIFeedback. Anthropic. https://arxiv.org/abs/2212.08073
- Mizrahi, M. et al. (2024). State of What Art? A Call for Multi-Prompt Evaluation of LLMs. https://arxiv.org/abs/2401.00595
- Lee, T. et al. (2024). VHELM: A Holistic Evaluation Suite for Vision-Language Models. https://arxiv.org/abs/2410.07112
- Kim, E. et al. (2025). BenchHub: A Unified Evaluation Suite for Holistic and Customizable LLM Evaluation. https://arxiv.org/abs/2506.00482
- Chen, H. et al. (2025). Safer or Luckier? LLM Safety Evaluators Are Not Robust to Artifacts. ACL 2025. https://arxiv.org/abs/2503.09347
- Beyer, T. et al. (2025). LLM-Safety Evaluations Lack Robustness. https://arxiv.org/abs/2503.02574
- Ji, J. et al. (2025). AI Alignment: A Comprehensive Survey. https://arxiv.org/abs/2310.19852
- Seshadri, A. (2024). The Crisis of Unreliable AI Leaderboards. Cohere Labs. https://betakit.com/cohere-labs-head-calls-unreliable-ai-leaderboard-rankings-a-crisis-in-the-field
- IBM. (2024). AI Governance and Responsible AI Overview. https://www.ibm.com/artificial-intelligence/responsible-ai
- Stanford HAI. (2025). AI Index Report. https://aiindex.stanford.edu


