The $(N, K)$ Trade-off in Reproducible ML Evaluation

[Submitted on 5 Aug 2025 (v1), last revised 10 Dec 2025 (this version, v2)]

View a PDF of the paper titled Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation, by Deepak Pandita and 3 other authors

View PDF

Abstract:Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$, or if one even existed, depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

Submission history

From: Deepak Pandita [view email]
[v1]
Tue, 5 Aug 2025 17:18:34 UTC (844 KB)
[v2]
Wed, 10 Dec 2025 21:20:12 UTC (7,185 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

The $(N, K)$ Trade-off in Reproducible ML Evaluation

A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

Manifold-Matching Autoencoders

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

Bridging Facts for Cross-Document Reasoning at Index Time

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Google AI Overviews now appear on 14% of shopping queries: Report

When Is the Best Time to Post on TikTok in 2026?

Zapier vs. Tray: Which is best? [2026]

Cacio e Pepe (Classic Roman Cheese and Pepper Pasta) Recipe

Google brings vehicle feeds to Search campaigns

70+ AI art styles to use in your AI prompts

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

The $(N, K)$ Trade-off in Reproducible ML Evaluation

Submission history

Related Posts

Subscribe to Updates