Efficient Long Sequence Decoding on Dataflow Accelerators

[Submitted on 5 Nov 2025 (v1), last revised 10 Dec 2025 (this version, v5)]

Authors:Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Häggström, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Håkan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

View a PDF of the paper titled SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators, by Jonathan Li and 20 other authors

View PDF
HTML (experimental)

Abstract:The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

Submission history

From: Jonathan Li [view email]
[v1]
Wed, 5 Nov 2025 00:38:31 UTC (338 KB)
[v2]
Thu, 6 Nov 2025 18:27:11 UTC (339 KB)
[v3]
Fri, 7 Nov 2025 19:27:58 UTC (339 KB)
[v4]
Fri, 14 Nov 2025 19:14:59 UTC (339 KB)
[v5]
Wed, 10 Dec 2025 00:29:21 UTC (339 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Efficient Long Sequence Decoding on Dataflow Accelerators

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Hallucinations in LLMs Are Not a Bug in the Data

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

How to Build a Production-Ready Claude Code Skill

Interactive Robot Skill Adaptation using Natural Language

Bayesian Thinking for People Who Hated Statistics

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

49 Kitchen Utensil Holders With Strong Aesthetic Opinions

What incrementality really means in affiliate marketing

3 CMS Platforms Control 73% Of The Market & Shape Technical SEO Defaults

Top 7 Traackr Alternatives 2026

Hallucinations in LLMs Are Not a Bug in the Data

7UP Cake With Lemony Glaze Recipe

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Efficient Long Sequence Decoding on Dataflow Accelerators

Submission history

Related Posts

Subscribe to Updates