a Real-World Benchmark and Dataset for Code Semantic Reasoning

[Submitted on 31 May 2025 (v1), last revised 3 Feb 2026 (this version, v3)]

View a PDF of the paper titled CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning, by Monoshi Kumar Roy and 6 other authors

View PDF
HTML (experimental)

Abstract:Understanding and reasoning about code semantics is essential for enhancing code LLMs’ abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limits models’ capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at this https URL.

Submission history

From: Monoshi Roy [view email]
[v1]
Sat, 31 May 2025 23:32:01 UTC (19,851 KB)
[v2]
Thu, 2 Oct 2025 16:10:36 UTC (5,781 KB)
[v3]
Tue, 3 Feb 2026 23:34:25 UTC (19,902 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

a Real-World Benchmark and Dataset for Code Semantic Reasoning

Bridging Modality Gap with Temporal Evolution Semantic Space

How to Effectively Review Claude Code Output

Everything You Need to Know About Recursive Language Models

[2601.15871] Why Inference in Large Models Becomes Decomposable After Training

Self-Hosting Your First LLM | Towards Data Science

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Bridging Modality Gap with Temporal Evolution Semantic Space

How to Effectively Review Claude Code Output

Google adds video visibility to Performance Max reporting

Everything You Need to Know About Recursive Language Models

[2601.15871] Why Inference in Large Models Becomes Decomposable After Training

Top Blog Platforms for SEO: Which Sites to Conside

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

a Real-World Benchmark and Dataset for Code Semantic Reasoning

Submission history

Related Posts

Subscribe to Updates