“This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing.”
This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.
This is probably also more or less related to recent advancements like Streaming-LLM
This won’t help once text gets inserted “in the middle” or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I’m able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.
It’s just a clever way to re-use the KV cache in one specific case. But I’ve wished for this for quite some time.
Cool stuff! Smarter than smart contexts.