3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

[Submitted on 23 Apr 2024 (v1), last revised 2 Dec 2025 (this version, v4)]

View a PDF of the paper titled CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios, by Jingyang Lin and 7 other authors

View PDF
HTML (experimental)

Abstract:3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.

Submission history

From: Jingyang Lin [view email]
[v1]
Tue, 23 Apr 2024 17:59:01 UTC (905 KB)
[v2]
Fri, 26 Apr 2024 16:50:20 UTC (905 KB)
[v3]
Mon, 29 Apr 2024 03:25:14 UTC (905 KB)
[v4]
Tue, 2 Dec 2025 08:27:30 UTC (3,344 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Manifold-Matching Autoencoders

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

Bridging Facts for Cross-Document Reasoning at Index Time

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment

Bridging Modality Gap with Temporal Evolution Semantic Space

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

70+ AI art styles to use in your AI prompts

Manifold-Matching Autoencoders

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

Why customer personas help you win earlier in AI search

Google expands Personal Intelligence to AI Mode, Gemini, Chrome

Google AI Overviews Cut Germany’s Top Organic CTR By 59%

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Submission history

Related Posts

Subscribe to Updates