Token-Selective Hierarchical Data Selection for Instruction Tuning

[Submitted on 2 Jun 2025 (v1), last revised 1 Dec 2025 (this version, v2)]

View a PDF of the paper titled T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning, by Yanjun Fu and 2 other authors

View PDF
HTML (experimental)

Abstract:Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at this https URL.

Submission history

From: Yanjun Fu [view email]
[v1]
Mon, 2 Jun 2025 04:59:17 UTC (2,273 KB)
[v2]
Mon, 1 Dec 2025 01:30:30 UTC (2,330 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Token-Selective Hierarchical Data Selection for Instruction Tuning

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Follow the AI Footpaths | Towards Data Science

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Hallucinations in LLMs Are Not a Bug in the Data

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

How to Build a Production-Ready Claude Code Skill

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Trust Is The New Ranking Factor

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

What They Mean and How to Use Them in Social Media Campaigns

Follow the AI Footpaths | Towards Data Science

Top 7 Traackr Alternatives 2026

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Token-Selective Hierarchical Data Selection for Instruction Tuning

Submission history

Related Posts

Subscribe to Updates