[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

heheda12345 · 2025-04-05T16:32:04Z

The reference implementation of hybrid allocator. I’m splitting it into smaller PRs and do further clean up in the smaller PRs.
Key differences with #13296 and #16178
1. Only create one specialize manager for each type of attention. For instance, Gemma 3 uses 2 managers (a full attention manager and a SWA manager) instead of 6 (one full attention manager and five SWA managers). (The same as #16178 but with different implementation
2. Hash: compute the hash for each block_size, instead of each kv cache group in #13296 or only for full attention layer in #16178
3. A general hybrid allocator instead of a specialized one in #16178 or two allocators, one for hybrid model and another for non-hybrid model in #13296. Add fast path to non-hybrid model when necessary.
4. Introduce GroupedKVCacheBlock

vllm/vllm/v1/core/kv_cache_utils.py

Lines 873 to 877 in e5cb02e

    
           # KVCacheBlocks for the same block of all kv cache groups with the same kv cache 
        
           # spec (and belongs to the same manager) 
        
           @dataclass 
        
           class GroupedKVCacheBlock: 
        
               blocks: tuple[KVCacheBlock, ...]

to save the same block of all kv cache groups with the same kv cache spec (and belongs to the same manager). E.g., a GroupedKVCacheBlock for Gemma3 may contain 5 blocks for the [0-16] tokens of the 5 SWA kv cache groups.
5. In block_pool, perform cache and eviction at the granularity of GroupedKVCacheBlock

vllm/vllm/v1/core/block_pool.py

Lines 53 to 56 in e5cb02e

    
           self.cached_block_hash_to_block: list[dict[BlockHashType, dict[ 
        
               int, GroupedKVCacheBlock]]] = [ 
        
                   defaultdict(dict) for _ in range(num_specialized_managers) 
        
               ]

so that we do not need to iterate over all groups to check whether all group has a cached block for a specific hash (

vllm/vllm/v1/core/specialized_manager.py

Line 142 in 0fa9747

if (cached_blocks and all(group_id in cached_blocks

) like #16178
6. Change the allocation result of KVCacheManager to

vllm/vllm/v1/core/kv_cache_manager.py

Lines 22 to 23 in e5cb02e

    
           class KVCacheBlocks: 
        
               blocks: list[list[GroupedKVCacheBlock]]

where blocks[i][j] is for the GroupedKVCacheBlock of manager i and tokens [j * block_size, (j+1) * block_size]. With this data structure, each manager can work almost independently and do not need to iterate over all groups managed by it to update the allocation result.
7. (Not included in this PR, but will do) introduce a memory coordinator that serves as a middle layer between KVCacheManager and SpecializedManagers, to simplify the logic of KVCacheManager.

Hybrid allocator RFC #11382

Signed-off-by: Chen Zhang <[email protected]>

github-actions · 2025-04-05T16:32:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon

Thanks for the PR. Great work!

A few high-level suggestions:

Can we first focus on the cases where every layer has the same embedding size? I think we can support Mamba or other cases in a future PR.
Can we have an architecture like this?

WoosukKwon · 2025-04-05T17:13:05Z

vllm/v1/worker/tpu_model_runner.py

@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: Apache-2.0
+# type: ignore


What is this for?

WoosukKwon · 2025-04-05T17:13:15Z

vllm/v1/worker/tpu_worker.py

@@ -22,7 +22,7 @@
                                        KVCacheSpec)
 from vllm.v1.outputs import ModelRunnerOutput
 from vllm.v1.utils import bind_kv_cache
-from vllm.v1.worker.tpu_model_runner import TPUModelRunner
+from vllm.v1.worker.tpu_model_runner import TPUModelRunner  # type: ignore


what is this for?

vllm/v1/core/hybrid_kv_cache_manager.py

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-04-09T06:52:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-04-26T03:19:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

…rid_mem Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-04-29T02:43:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-04-29T03:02:53Z

vllm/v1/core/kv_cache_manager.py

 from vllm.v1.metrics.stats import PrefixCacheStats
 from vllm.v1.request import Request, RequestStatus

 logger = init_logger(__name__)


+@dataclass
+class KVCacheBlocks:
+    blocks: list[list[GroupedKVCacheBlock]]


3 dimensions:
blocks[manager_id][ith_block][group_id_in_manager]

heheda12345 · 2025-04-30T03:13:20Z

Also CC @comaniac

renjie0 · 2025-05-06T02:32:34Z

Could you please provide a design doc for the community to review? How does it work with prefix caching? How will it maximize the KV cache hit rate for global attention layer? How will it work with speculative decoding? What is the potential timeline for this support? It has been 6 months since the RFC

heheda12345 · 2025-05-06T02:43:28Z

You can find the design in the PR description of #13296 and this PR.

shan18 · 2025-05-06T19:54:47Z

Hi, I tried running inference on the gemma-3-12b-it using your branch but I keep on getting invalid responses from the model.

For example, when I host the model with the vllm server like this:

python3 -m vllm.entrypoints.openai.api_server \
    --model google/gemma-3-12b-it \
    --trust-remote-code \
    --seed 1 \
    --host "0.0.0.0" \
    --port 5000 \
    --served-model-name "test-model" \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --enforce-eager

And then give it a prompt from the AIME24 dataset, I get response like this:

\u9154 cudd\u0cbf\u0c82\u0ca6\u179a\u17bc\u1794breviinction Coy blossom\u5414\u0b9f\u0b95 \u0939\u093e\u092eClo obstructions\u054f anf Zin \u0aa5\u05de\u05d9\u05ea Meat\u8c5agente\u0644\u0627\u0646Twelves .....

For comparison, when I do the same with the main vllm branch (v0.8.3), the response to the same prompt is this:

The uncertainty principle states that the uncertainty in energy (\u0394E) and the lifetime (\u03c4) of a quantum state are related by \u0394E \
u2248 \u0127/\u03c4, where \u0127 is the reduced Planc ...

Is there anything that I need to setup separately in order for it to work?

heheda12345 · 2025-05-07T04:20:36Z

@shan18 #17574 You need to cherry-pick this PR.

CXIAAAAA · 2025-05-07T04:58:16Z

vllm/v1/core/kv_cache_manager.py

+            # Use copy to avoid modifying the original block_hashes
+            block_hashes = [
+                block_hashes_dict[g.kv_cache_spec.block_size].copy()
+                for g in self.kv_cache_config.kv_cache_groups


use number of specialized manager instead of kv_cache_groups?

CXIAAAAA · 2025-05-08T06:01:30Z

vllm/v1/kv_cache_interface.py

+
+
+@dataclass
+class KVCacheNewTensor(KVCacheTensorBase):


Some of the tests might need to update to this naming, KVCacheNewTensor, e.g. test_kv_cache_utils.py

Thanks for pointing it out. This PR is just a POC, so I didn't fix the tests.

shan18 · 2025-05-12T17:02:29Z

@shan18 #17574 You need to cherry-pick this PR.

@heheda12345 , I tried with the fixes in the PR you shared but still I don't get any valid responses from the model. Do you have any test scripts that you used with your PR that I can try out?

heheda12345 · 2025-05-13T03:06:17Z

@shan18 If I remembered correctly, this PR should pass tests/v1/e2e/test_correctness_sliding_window.py after cherry-pick #17574
But this PR is just a prototype and I don't plan to maintain it. I think the final implementation will be finished very soon.

heheda12345 added 6 commits April 1, 2025 05:33

copy manager code

c6a2d25

Signed-off-by: Chen Zhang <[email protected]>

save

4b27c82

Signed-off-by: Chen Zhang <[email protected]>

can run

4dea38d

Signed-off-by: Chen Zhang <[email protected]>

can pass e2e tests

55720e0

Signed-off-by: Chen Zhang <[email protected]>

run precommit

273dd44

Signed-off-by: Chen Zhang <[email protected]>

can run again

0bfec8d

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added v1 tpu Related to Google TPUs labels Apr 5, 2025

WoosukKwon reviewed Apr 6, 2025

View reviewed changes

quick copy

df31d7a

Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon mentioned this pull request Apr 7, 2025

[WIP] Hybrid Memory Allocator #16178

Draft

mergify bot added the needs-rebase label Apr 9, 2025

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_mem

1ce3023

Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Apr 23, 2025

heheda12345 added 4 commits April 23, 2025 06:50

a runable version

6aee98d

Signed-off-by: Chen Zhang <[email protected]>

fix bug

7f19466

Signed-off-by: Chen Zhang <[email protected]>

1 hash per block_size

34ba571

Signed-off-by: Chen Zhang <[email protected]>

one manager for each type

18245e3

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Apr 24, 2025

heheda12345 added 9 commits April 24, 2025 10:11

small update

2c81fe6

Signed-off-by: Chen Zhang <[email protected]>

small fix

42a8244

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_mem

8af9ace

fix gemma

6493e5e

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'fix_gemma' of github.com:heheda12345/vllm into hybrid_mem

fa224f2

update attn backends

c512bc5

Signed-off-by: Chen Zhang <[email protected]>

fix flashinfer type

4ce3424

Signed-off-by: Chen Zhang <[email protected]>

fix flashmla type

d17843e

Signed-off-by: Chen Zhang <[email protected]>

fix triton type

47ec1a7

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 added 3 commits April 25, 2025 09:13

fix

136a54c

Signed-off-by: Chen Zhang <[email protected]>

add note

765d9ed

Signed-off-by: Chen Zhang <[email protected]>

group partition strategy

216a079

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the needs-rebase label Apr 26, 2025

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_mem

1c66541

Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Apr 26, 2025

heheda12345 added 6 commits April 25, 2025 22:59

support eagle

37c4494

Signed-off-by: Chen Zhang <[email protected]>

get a specific type of layer from forward context

84280fc

Signed-off-by: Chen Zhang <[email protected]>

fix

51ffeb6

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'filter_fwd_ctx' of github.com:heheda12345/vllm into hyb…

7d03c88

…rid_mem Signed-off-by: Chen Zhang <[email protected]>

update eagle

1da28d9

Signed-off-by: Chen Zhang <[email protected]>

only enable cuda platform

e5cb02e

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the needs-rebase label Apr 29, 2025

heheda12345 commented Apr 29, 2025

View reviewed changes

This was referenced Apr 30, 2025

[v1] Move block management logic from KVCacheManager to SpecializedManager #17474

Merged

[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager #17479

Merged

[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders #17483

Merged

CXIAAAAA reviewed May 7, 2025

View reviewed changes

CXIAAAAA reviewed May 8, 2025

View reviewed changes

This was referenced May 10, 2025

[v1] Support multiple KV cache groups in GPU model runner #17945

Merged

[v1] Hybrid Memory Allocator #17996

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

heheda12345 commented Apr 5, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 5, 2025

WoosukKwon left a comment

WoosukKwon Apr 5, 2025

WoosukKwon Apr 5, 2025

mergify bot commented Apr 9, 2025

mergify bot commented Apr 26, 2025

mergify bot commented Apr 29, 2025

heheda12345 Apr 29, 2025 •

edited

Loading

heheda12345 commented Apr 30, 2025

renjie0 commented May 6, 2025

heheda12345 commented May 6, 2025

shan18 commented May 6, 2025

heheda12345 commented May 7, 2025

CXIAAAAA May 7, 2025

CXIAAAAA May 8, 2025

heheda12345 May 8, 2025

shan18 commented May 12, 2025

heheda12345 commented May 13, 2025

	# KVCacheBlocks for the same block of all kv cache groups with the same kv cache
	# spec (and belongs to the same manager)
	@dataclass
	class GroupedKVCacheBlock:
	blocks: tuple[KVCacheBlock, ...]

	self.cached_block_hash_to_block: list[dict[BlockHashType, dict[
	int, GroupedKVCacheBlock]]] = [
	defaultdict(dict) for _ in range(num_specialized_managers)
	]

		@@ -1,4 +1,5 @@
		# SPDX-License-Identifier: Apache-2.0
		# type: ignore

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

Are you sure you want to change the base?

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

Conversation

heheda12345 commented Apr 5, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 5, 2025

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Apr 5, 2025

Choose a reason for hiding this comment

WoosukKwon Apr 5, 2025

Choose a reason for hiding this comment

mergify bot commented Apr 9, 2025

mergify bot commented Apr 26, 2025

mergify bot commented Apr 29, 2025

heheda12345 Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

heheda12345 commented Apr 30, 2025

renjie0 commented May 6, 2025

heheda12345 commented May 6, 2025

shan18 commented May 6, 2025

heheda12345 commented May 7, 2025

CXIAAAAA May 7, 2025

Choose a reason for hiding this comment

CXIAAAAA May 8, 2025

Choose a reason for hiding this comment

heheda12345 May 8, 2025

Choose a reason for hiding this comment

shan18 commented May 12, 2025

heheda12345 commented May 13, 2025

heheda12345 commented Apr 5, 2025 •

edited by github-actions bot

Loading

heheda12345 Apr 29, 2025 •

edited

Loading