Learning Symbolic World Models via Pretrained Vision-Language Models

[Submitted on 31 Dec 2024 (v1), last revised 9 Mar 2026 (this version, v4)]

View a PDF of the paper titled From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models, by Ashay Athalye and 6 other authors

View PDF
HTML (experimental)

Abstract:Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

Submission history

From: Nishanth Kumar [view email]
[v1]
Tue, 31 Dec 2024 06:14:16 UTC (6,848 KB)
[v2]
Mon, 9 Jun 2025 01:52:27 UTC (8,778 KB)
[v3]
Tue, 10 Jun 2025 03:08:29 UTC (8,778 KB)
[v4]
Mon, 9 Mar 2026 17:35:57 UTC (8,754 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Learning Symbolic World Models via Pretrained Vision-Language Models

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Hallucinations in LLMs Are Not a Bug in the Data

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

How to Build a Production-Ready Claude Code Skill

Interactive Robot Skill Adaptation using Natural Language

Bayesian Thinking for People Who Hated Statistics

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Top 7 Traackr Alternatives 2026

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Get threat intelligence to your team fast, in the tools they already use

10 Lead-Generating Mortgage Social Media Posts to Grow Your Sales Pipeline

AI Search Barely Cites Syndicated News Or Press Releases

How to Choose Social Media Networks in 2026

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Learning Symbolic World Models via Pretrained Vision-Language Models

Submission history

Related Posts

Subscribe to Updates