Prompt Caching
To save on inference costs, you can enable prompt caching on supported providers and models. Note that prompt caching does not work when switching between providers. In order to cache the prompt, LLM engines must store a memory snapshot of the processed prompt, which is not shared with other providers.
The prompt caching request syntax is provider-specific at this time. As more providers support prompt caching, we will explore normalizing them into a consistent request format.
Anthropic Claude
Caching price changes:
-
Cache writes: charged at 1.25x the price of the original input pricing
-
Cache reads: charged at 0.1x the price of the original input pricing
Supported models:
- anthropic/claude-3.5-sonnet
- anthropic/claude-3-haiku
- anthropic/claude-3-opus
Prompt caching with Anthropic requires the use of cache_control
breakpoints. There is a limit of four breakpoints, and the cache will expire within five minutes. Therefore, it is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc.
Click here to read more about Anthropic prompt caching and its limitation.
The cache_control
breakpoint can only be inserted into the text part of a multipart message.
System message caching example:
{
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a historian studying the fall of the Roman Empire. You know the following book very well:"
},
{
"type": "text",
"text": "HUGE TEXT BODY",
"cache_control": {
"type": "ephemeral"
}
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What triggered the collapse?"
}
]
}
]
}
User message caching example:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Given the book below:"
},
{
"type": "text",
"text": "HUGE TEXT BODY",
"cache_control": {
"type": "ephemeral"
}
},
{
"type": "text",
"text": "Name all the characters in the above book"
}
]
}
]
}
DeepSeek
Caching price changes:
-
Cache writes: charged at the same price as the original input pricing
-
Cache reads: charged at 0.1x the price of the original input pricing
Prompt caching with DeepSeek is automated and does not require any additional configuration.