Unified Vision-Language Codes or Agent-Induced Novelty?

[Submitted on 1 Aug 2025 (v1), last revised 21 Nov 2025 (this version, v2)]

View a PDF of the paper titled Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?, by Satiyabooshan Murugaboopathy and Connor T. Jerzak and Adel Daoud

View PDF
HTML (experimental)

Abstract:We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

Submission history

From: Connor Jerzak [view email]
[v1]
Fri, 1 Aug 2025 23:07:16 UTC (3,639 KB)
[v2]
Fri, 21 Nov 2025 14:32:46 UTC (10,229 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Unified Vision-Language Codes or Agent-Induced Novelty?

A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

[2503.10144] Multiplicative learning from observation-prediction ratios

Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection

Production-Ready LLM Agents: A Comprehensive Framework for Offline Evaluation

Evolutionary Biparty Multiobjective UAV Path Planning: Problems and Empirical Comparisons

[2603.02960] Architecting Trust in Artificial Epistemic Agents

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

ChatGPT citations favor a small group of domains: Study

A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

ERP integration: How to connect systems

10-Minute Bone Broth Soup Recipe

[2503.10144] Multiplicative learning from observation-prediction ratios

5 ways to automate Claude with Zapier MCP

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Unified Vision-Language Codes or Agent-Induced Novelty?

Submission history

Related Posts

Subscribe to Updates