-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1,4 +1,5 @@ | |||
# SPDX-License-Identifier: Apache-2.0 | |||
# type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
@@ -22,7 +22,7 @@ | |||
KVCacheSpec) | |||
from vllm.v1.outputs import ModelRunnerOutput | |||
from vllm.v1.utils import bind_kv_cache | |||
from vllm.v1.worker.tpu_model_runner import TPUModelRunner | |||
from vllm.v1.worker.tpu_model_runner import TPUModelRunner # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…rid_mem Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
from vllm.v1.metrics.stats import PrefixCacheStats | ||
from vllm.v1.request import Request, RequestStatus | ||
|
||
logger = init_logger(__name__) | ||
|
||
|
||
@dataclass | ||
class KVCacheBlocks: | ||
blocks: list[list[GroupedKVCacheBlock]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 dimensions:
blocks[manager_id][ith_block][group_id_in_manager]
Also CC @comaniac |
Could you please provide a design doc for the community to review? How does it work with prefix caching? How will it maximize the KV cache hit rate for global attention layer? How will it work with speculative decoding? What is the potential timeline for this support? It has been 6 months since the RFC |
You can find the design in the PR description of #13296 and this PR. |
Hi, I tried running inference on the For example, when I host the model with the vllm server like this:
And then give it a prompt from the AIME24 dataset, I get response like this:
For comparison, when I do the same with the main vllm branch (v0.8.3), the response to the same prompt is this:
Is there anything that I need to setup separately in order for it to work? |
# Use copy to avoid modifying the original block_hashes | ||
block_hashes = [ | ||
block_hashes_dict[g.kv_cache_spec.block_size].copy() | ||
for g in self.kv_cache_config.kv_cache_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use number of specialized manager instead of kv_cache_groups?
|
||
|
||
@dataclass | ||
class KVCacheNewTensor(KVCacheTensorBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the tests might need to update to this naming, KVCacheNewTensor, e.g. test_kv_cache_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. This PR is just a POC, so I didn't fix the tests.
@heheda12345 , I tried with the fixes in the PR you shared but still I don't get any valid responses from the model. Do you have any test scripts that you used with your PR that I can try out? |
The reference implementation of hybrid allocator. I’m splitting it into smaller PRs and do further clean up in the smaller PRs.
Key differences with #13296 and #16178
1. Only create one specialize manager for each type of attention. For instance, Gemma 3 uses 2 managers (a full attention manager and a SWA manager) instead of 6 (one full attention manager and five SWA managers). (The same as #16178 but with different implementation
2. Hash: compute the hash for each block_size, instead of each kv cache group in #13296 or only for full attention layer in #16178
3. A general hybrid allocator instead of a specialized one in #16178 or two allocators, one for hybrid model and another for non-hybrid model in #13296. Add fast path to non-hybrid model when necessary.
4. Introduce
GroupedKVCacheBlock
vllm/vllm/v1/core/kv_cache_utils.py
Lines 873 to 877 in e5cb02e
5. In block_pool, perform cache and eviction at the granularity of
GroupedKVCacheBlock
vllm/vllm/v1/core/block_pool.py
Lines 53 to 56 in e5cb02e
so that we do not need to iterate over all groups to check whether all group has a cached block for a specific hash (
vllm/vllm/v1/core/specialized_manager.py
Line 142 in 0fa9747
6. Change the allocation result of KVCacheManager to
vllm/vllm/v1/core/kv_cache_manager.py
Lines 22 to 23 in e5cb02e
7. (Not included in this PR, but will do) introduce a memory coordinator that serves as a middle layer between KVCacheManager and SpecializedManagers, to simplify the logic of KVCacheManager.
Hybrid allocator RFC #11382