What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

arXiv:2603.19017v1 Announce Type: cross
Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

How to Measure AI Value

The Math That’s Killing Your AI Agent

Agent Control Protocol: Admission Control for Agent Actions

Building Robust Credit Scoring Models (Part 3)

[2510.16001] An Order-Sensitive Conflict Measure for Random Permutation Sets

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

How to Measure AI Value

4 ways to automate Rillet

The Best Portable Coffee Makers for High Quality Coffee on the Road

Google launches Ads DevCast Vodcast for developers

My Year of Obsessive Recipe Journaling Made Me a Better Cook

What It Is & Why It Matters

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Related Posts

Subscribe to Updates