Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial
    AI Tools

    Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial

    AwaisBy AwaisMarch 22, 2026No Comments9 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial
    Share
    Facebook Twitter LinkedIn Pinterest Email

    In my previous post, Prompt Caching — what it is, how it works, and how it can save you a lot of money and time when running AI-powered apps with high traffic. In today’s post, I walk you through implementing Prompt Caching specifically using OpenAI’s API, and we discuss some common pitfalls.


    A brief reminder on Prompt Caching

    Before getting our hands dirty, let’s briefly revisit what exactly the concept of Prompt Caching is. Prompt Caching is a functionality provided in frontier model API services like the OpenAI API or Claude’s API, that allows caching and reusing parts of the LLM’s input that are repeated frequently. Such repeated parts may be system prompts or instructions that are passed to the model every time when running an AI app, along with any other variable content, like the user’s query or information retrieved from a knowledge base. To be able to hit cache with prompt caching, the repeated parts of the prompt must be at the beginning of it, namely, a prompt prefix. In addition, in order for prompt caching to be activated, this prefix must exceed a certain threshold (e.g., for OpenAI the prefix should be more than 1,024 tokens, while Claude has different minimum cache lengths for different models). As far as those two conditions are satisfied — repeated tokens as a prefix exceeding the size threshold defined by the API service and model — caching can be activated to achieve economies of scale when running AI apps.

    Unlike caching in other components in a RAG or other AI app, prompt caching operates at the token level, in the internal procedures of the LLM. In particular, LLM inference takes place in two steps:

    • Pre-fill, that is, the LLM takes into account the user prompt to generate the first token, and
    • Decoding, that is, the LLM recursively generates the tokens of the output one by one

    In short, prompt caching stores the computations that take place in the pre-fill stage, so the model doesn’t need to recompute it again when the same prefix reappears. Any computations taking place in the decoding iterations phase, even if repeated, aren’t going to be cached.

    For the rest of the post, I will be focusing solely on the use of prompt caching in the OpenAI API.


    What about the OpenAI API?

    In OpenAI’s API, prompt caching was initially introduced on the 1st of October 2024. Originally, it offered a 50% discount on the cached tokens, but nowadays, this discount goes up to 90%. On top of this, by hitting their prompt cache, additional savings on latency can be achived up to 80%.

    When prompt caching is activated, the API service attempts to hit the cache for a submitted request by routing the submitted prompt to an appropriate machine, where the respective cache is expected to exist. This is called the Cache Routing, and to do this, the API service typically utilizes a hash of the first 256 tokens of the prompt.

    Beyond this, their API also allows for explicitly defining a the prompt_cache_key parameter in the API request to the model. That is a single key defining which cache we are referring to, aiming to further increase the chances of our prompt being routed to the correct machine and hitting cache.

    In addition, OpenAI API provides two distinct types of caching in regards to duration, defined through the prompt_cache_retention parameter. Those are:

    • In-memory prompt cache retention: This is essentially the default type of caching, available for all models for which prompt caching is available. With in-memory cache, cached data remain active for a period of 5-10 minutes beteen requests.
    • Extended prompt cache retention: This available for specific models. Extended cache allows for keeping data in cache for loger and up to a maximum of 24 hours.

    Now, in regards to how much all these cost, OpenAI charges the same per input (non cached) token, either we have prompt caching activated or not. If we manage to hit cache succesfully, we are billed for the cached tokens at a greatly discounted price, with a discount up to 90%. Moreover, the price per input token remains the same both for the in memory and extended cache retention.


    Prompt Caching in Practice

    So, let’s see how prompt caching actually works with a simple Python example using OpenAI’s API service. More specifically, we are going to do a realistic scenario where a long system prompt (prefix) is reused across multiple requests. If you are here, I guess you already have your OpenAI API key in place and have installed the required libraries. So, the first thing to do would be to import the OpenAI library, as well as time for capturing latency, and initialize an instance of the OpenAI client:

    from openai import OpenAI
    import time
    
    client = OpenAI(api_key="your_api_key_here")

    then we can define our prefix (the tokens that are going to be repeated and we are aiming to cache):

    long_prefix = """
    You are a highly knowledgeable assistant specialized in machine learning.
    Answer questions with detailed, structured explanations, including examples when relevant.
    
    """ * 200  

    Notice how we artificially increase the length (multiply with 200) to make sure the 1,024 token caching threshold is met. Then we also set up a timer so as to measure our latency savings, and we are finally ready to make our call:

    start = time.time()
    
    response1 = client.responses.create(
        model="gpt-4.1-mini",
        input=long_prefix + "What is overfitting in machine learning?"
    )
    
    end = time.time()
    
    print("First response time:", round(end - start, 2), "seconds")
    print(response1.output[0].content[0].text)

    So, what do we expect to happen from here? For models from gpt-4o and newer, prompt caching is activated by default, and since our 4,616 input tokens are well above the 1,024 prefix token threshold, we are good to go. Thus, what this request does is that it initially checks if the input is a cache hit (it is not, since this is the first time we do a request with this prefix), and since it is not, it processes the entire input and then caches it. Next time we send an input that matches the initial tokens of the cached input to some extent, we are going to get a cache hit. Let’s check this in practice by making a second request with the same prefix:

    start = time.time()
    
    response2 = client.responses.create(
        model="gpt-4.1-mini",
        input=long_prefix + "What is regularization?"
    )
    
    end = time.time()
    
    print("Second response time:", round(end - start, 2), "seconds")
    print(response2.output[0].content[0].text)

    Indeed! The second request runs significantly faster (23.31 vs 15.37 seconds). This is because the model has already made the calculations for the cached prefix and only needs to process from scratch the new part, “What is regularization?”. As a result, by using prompt caching, we get significantly lower latency and reduced cost, since cached tokens are discounted.


    Another thing mentioned in the OpenAI documentation we’ve already talked about is the prompt_cache_key parameter. In particular, according to the documentation, we can explicitly define a prompt cache key when making a request, and in this way define the requests that need to use the same cache. Nonetheless, I tried to include it in my example by appropriately adjusting the request parameters, but didn’t have much luck:

    response1 = client.responses.create(
        prompt_cache_key = 'prompt_cache_test1',
        model="gpt-5.1",
        input=long_prefix + "What is overfitting in machine learning?"
    )

    🤔

    It seems that while prompt_cache_key exists in the API capabilities, it is not yet exposed in the Python SDK. In other words, we cannot explicitly control cache reuse yet, but it is rather automatic and best-effort.


    So, what can go wrong?

    Activating prompt caching and actually hitting the cache seems to be kind of straightforward from what we’ve said so far. So, what may go wrong, resulting in us missing the cache? Unfortunately, a lot of things. As straightforward as it is, prompt caching requires a lot of different assumptions to be in place. Missing even one of those prerequisites is going to result in a cache miss. But let’s take a better look!

    One obvious miss is having a prefix that is less than the threshold for activating prompt caching, namely, less than 1,024 tokens. Nonetheless, this is very easily solvable — we can always just artificially increase the prefix token count by simply multiplying by an appropriate value, as shown in the example above.

    Another thing would be silently breaking the prefix. In particular, even when we use persistent instructions and system prompts of appropriate size across all requests, we must be exceptionally careful not to break the prefixes by adding any variable content at the beginning of the model’s input, before the prefix. That is a guaranteed way to break the cache, no matter how long and repeated the following prefix is. Usual suspects for falling into this pitfall are dynamic data, for instance, appending the user ID or timestamps at the beginning of the prompt. Thus, a best practice to follow across all AI app development is that any dynamic content should always be appended at the end of the prompt — never at the beginning.

    Ultimately, it is worth highlighting that prompt caching is only about the pre-fill phase — decoding is never cached. This means that even if we impose on the model to generate responses following a specific template, that beggins with certain fixed tokens, those tokens aren’t going to be cached, and we are going to be billed for their processing as usual.

    Conversely, for specific use cases, it doesn’t really make sense to use prompt caching. Such cases would be highly dynamic prompts, like chatbots with little repetition, one-off requests, or real-time personalized systems.

    . . .

    On my mind

    Prompt caching can significantly improve the performance of AI applications both in terms of cost and time. In particular when looking to scale AI apps prompt caching comes extremelly handy, for maintaining cost and latency in acceptable levels.

    For OpenAI’s API prompt caching is activated by default and costs for input, non-cached tokens are the same either we activate prompt caching or not. Thus, one can only win by activating prompt caching and aiming to hit it in every request, even if they don’t succeed.

    Claude also provides extensive functionality on prompt caching through their API, which we are going to be exploring in detail in a future post.

    Thanks for reading! 🙂

    . . .

    Loved this post? Let’s be friends! Join me on:

    📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

    All images by the author, except mentioned otherwise.

    API Caching full HandsOn OpenAI Prompt Python Tutorial
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    Escaping the SQL Jungle | Towards Data Science

    March 21, 2026

    A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

    March 21, 2026

    Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

    March 21, 2026

    Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

    March 21, 2026

    How to Measure AI Value

    March 20, 2026

    What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

    March 20, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial

    March 22, 2026

    In my previous post, Prompt Caching — what it is, how it works, and how…

    Bake Pasta on a Sheet Pan for Maximum Crisp

    March 22, 2026

    Sheet-Pan Pepperoni Pasta Recipe | Epicurious

    March 22, 2026

    From SEO And CRO To Agentic AI Optimization (AAIO)

    March 22, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    For Demi Lovato, Learning to Cook Meant Starting to Heal

    March 21, 2026

    Adobe to shut down Marketo Engage SEO tool

    March 21, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.