oogle is revealing a new feature in its Gemini API that the company claims will make its latest AImodels cheaper for third-party developers. Google also calls
this new feature “implicit caching.”, saying that it can deliver 75% savings on “repetitive context” passed to models via the Gemini API. This also supports Google Gemini’s 2.5 Pro and 2.5 Flash models.
One of the most common practices in the AI sector minimizes the computing demands and costs by reusing frequently accessed or pre-processed model data. Think about how caches can store responses to commonly asked questions, preventing the model from generating the same answers repeatedly.
Beforehand, Google offered only explicit prompt caching, requiring developers to define manually their best prompts. While it promised cost savings, the manual effort was mostly required.
Some developers expressed dissatisfaction with how explicit caching functioned for Gemini 2.5 Pro, mentioning that it led to unexpectedly high API bills. The criticism intensified over the past week, prompting the Gemini team to apologize and commit to making improvements.
Differing from explicit coaching, implicit coaching happens automatically. This is also enabled by Gemini 2.5 models, and it can pass on cost savings if the Gemini API needs to model hits a cache.
Subscribe to our newsletter
“[W]hen you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of the previous requests, then it’s eligible for a cache hit,” mentioned Google in a post, “We will dynamically pass cost savings back to you.”.
It is also worth mentioning that the minimum prompt token count for implicit caching is 1.024 for 2.5 Flash and 2.048 for 2.5 Pro, as mentioned by Google’s developer documentation. Which is not represented by an enormous amount, leading to it not taking too much to trigger these automatic savings. Tokens that use raw bits of data models also work with thousands of tokens, which are equivalent to about 750 words.