[BugFix] fix speculative decoding memory leak when speculation is disabled #15506

noyoshi · 2025-03-25T23:45:46Z

There is a memory leak associated with self.previous_hidden_states on the spec_decode_worker when using --speculative-disable-by-batch-size.

It comes from the fact that we do not clean up the tensors except inside of the speculative decoding codepath, but when it is disabled via batch size, we go through the _run_no_spec function which never frees up the hidden states.

You can verify this by checking the output of

llm = vllm.LLM(
    model="model",
    speculative_model="spec-model",
    quantization="fp8", 
    enable_prefix_caching=False,
    enable_chunked_prefill=True,
    num_speculative_tokens=3,
)
sampling_params = llm.get_default_sampling_params()
sampling_params.max_tokens = 100
sampling_params.temperature = 0.0
for i in range(1000):  # Simulate multiple inference runs
    prompt = f"Test prompt {i}"
    outputs = llm.generate([prompt] * 2, sampling_params)
    
    if i % 10 == 0:
        print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2} MB")
        print(f"Cached: {torch.cuda.memory_reserved() / 1024**2} MB")

You can hard code no_spec=True in execute_model in the speculative decoding worker to simulate the behavior of a model that exceeds its max speculative batch size, and remains there for a long period of time.

I think this was missed because it was probably tested with the following scenario:

start model with speculation enabled, disable by batch size = X
batch size hits X, speculation disabled 
--> CUDA memory starts growing without stopping
batch size goes below X, speculation re-enabled
--> CUDA memory reset / fixed via self.previous_hidden_state = None

It won't show up in this example, but in the real world often times once the model goes above X (the max batch size to run speculation), it will stay under heavy load for a long period of time - say for the entire work day (8 hours), at which point you will likely run into a CUDA oom and increasingly bad performance.

Signed-off-by: Noah Yoshida <[email protected]>

github-actions · 2025-03-25T23:45:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Noah Yoshida <[email protected]>

NickLucche

Hey thanks a lot for the contribution!
This looks correct, but I think to reproduce it more easily you need to set speculative_disable_by_batch_size and also use the async interface to stack up reqs.
Otherwise memory should be constant.

Would you mind turning the example into a unit test that we can run to verify everything is working as intended?

Signed-off-by: Noah Yoshida <[email protected]>

NickLucche

Nice work with the test!

What do you think about using the online API in the test with

    with RemoteOpenAIServer(model_name, server_args) as remote_server:
        client: openai.AsyncOpenAI = remote_server.get_async_client()

I think it serves our purpose a bit better. Can you give it a try?
We should still be able to track the CUDA memory usage from the test process.

Other than that we're good to go here.

noyoshi · 2025-04-01T22:14:45Z

@NickLucche I think its actually a lot harder to test that way - because you need to simulate a situation where you never have smaller batch sizes (which re-enable speculation and clear the variable).

So something like

Low QPS initially (optional)
High constant QPS

Check cuda memory usage
Maintain High QPS (but same as earlier, so it should be held constant)
Check cuda memory usage

Not quite sure how to instrument that out with the standard open AI client

NickLucche · 2025-04-02T08:57:40Z

We have utils in benchmark_serving.py to send at a request rate.
Otherwise it should still show a mem increase if you just keep sending async requests in a loop as you're bound to cross the batch size limit that disables spec decoding.

Anyways my suggestion is non-blocking, we can still merge this.

njhill

Thanks @noyoshi!

FYI @LiuXiaoxuanPKU

njhill · 2025-04-02T21:58:31Z

@noyoshi could you merge in latest main branch? Not sure what the issue is with the tests.

noyoshi · 2025-04-07T21:18:56Z

@njhill Hows it looking? Can someone run the CI :)

DarkLight1337 · 2025-04-15T06:14:12Z

PTAL at the failing test

noyoshi · 2025-04-18T23:28:47Z

Test failures look completely unrelated to my single line change, is there a way to re-run them? Should I merge off master again?

DarkLight1337 · 2025-04-25T07:46:47Z

Yeah, can you try merging from main? Sorry for the delay

* Revert "[Misc] Add S3 environment variables for better support of MinIO." (vllm-project#17021) * [misc] tune some env vars for GB200 (vllm-project#16992) Signed-off-by: youkaichao <[email protected]> * [INTEL-HPU][v0] Port delayed sampling to upstream (vllm-project#16949) Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> * [doc] add download path tips (vllm-project#17013) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Triton FA function takes no keyword arguments (vllm-project#16902) Signed-off-by: vllmellm <[email protected]> * [V1] Avoid socket errors during shutdown when requests are in in-flight (vllm-project#16807) Signed-off-by: Nick Hill <[email protected]> * [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#16998) Signed-off-by: Lucas Wilkinson <[email protected]> * [Misc] Improve readability of get_open_port function. (vllm-project#17024) Signed-off-by: gitover22 <[email protected]> * [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers (vllm-project#16964) Signed-off-by: chaunceyjiang <[email protected]> * [CI] Run v1/test_serial_utils.py in CI (vllm-project#16996) Signed-off-by: Russell Bryant <[email protected]> * Mistral-format support for compressed-tensors (vllm-project#16803) Signed-off-by: mgoin <[email protected]> * Categorize `tests/kernels/` based on kernel type (vllm-project#16799) Signed-off-by: mgoin <[email protected]> * [Doc] Add top anchor and a note to quantization/bitblas.md (vllm-project#17042) Signed-off-by: windsonsea <[email protected]> * Ensure that `pid` passed to `kill_process_tree` is `int` for `mypy` (vllm-project#17051) Signed-off-by: Harry Mellor <[email protected]> * [CI] Update structured-output label automation (vllm-project#17055) Signed-off-by: Russell Bryant <[email protected]> * Improve Transformers backend model loading QoL (vllm-project#17039) Signed-off-by: Harry Mellor <[email protected]> * `CacheConfig.block_size` should always be `int` when used (vllm-project#17052) Signed-off-by: Harry Mellor <[email protected]> * Use `@property` and private field for `data_parallel_rank_local` (vllm-project#17053) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar (vllm-project#15949) Signed-off-by: Travis Johnson <[email protected]> * [BugFix][V1] Fix int32 token index overflow when preparing input ids (vllm-project#16806) * [V1][Spec Decode] Always use argmax for sampling draft tokens (vllm-project#16899) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] workaround for CI build failure (vllm-project#17070) Signed-off-by: csy1204 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Quantization]add prefix for commandA quantized model (vllm-project#17017) * [Minor] Use larger batch sizes for A100/B100/B200/MI300x (vllm-project#17073) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Enable V1 usage stats (vllm-project#16986) Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * More informative error when using Transformers backend (vllm-project#16988) Signed-off-by: Harry Mellor <[email protected]> * Addendum Fix to support FIPS enabled machines with MD5 hashing (vllm-project#17043) Signed-off-by: sydarb <[email protected]> * [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… (vllm-project#16472) Signed-off-by: 开哲 <[email protected]> Co-authored-by: 开哲 <[email protected]> * [V1] Update structured output (vllm-project#16812) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [doc] update to hyperlink (vllm-project#17096) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add docs for runai_streamer_sharded (vllm-project#17093) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Chore] Remove Sampler from Model Code (vllm-project#17084) Signed-off-by: Woosuk Kwon <[email protected]> * Disable enforce_eager for V1 TPU sampler and structured output tests (vllm-project#17016) Signed-off-by: mgoin <[email protected]> * Simplify `TokenizerGroup` (vllm-project#16790) Signed-off-by: Harry Mellor <[email protected]> * Fix OOT registration test (vllm-project#17099) Signed-off-by: Harry Mellor <[email protected]> * [V1][PP] Optimization: continue scheduling prefill chunks (vllm-project#17080) Signed-off-by: Rui Qiao <[email protected]> * [Misc] Remove OLMo2 config copy (vllm-project#17066) Signed-off-by: Isotr0py <[email protected]> * Improve static type checking in `LoRAModelRunnerMixin` (vllm-project#17104) Signed-off-by: Harry Mellor <[email protected]> * [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning (vllm-project#16954) Signed-off-by: shen-shanshan <[email protected]> * [Frontend] Using matryoshka_dimensions control the allowed output dimensions. (vllm-project#16970) * Add missing rocm_skinny_gemms kernel test to CI (vllm-project#17060) Signed-off-by: mgoin <[email protected]> * [Misc] refactor example series - structured outputs (vllm-project#17040) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (vllm-project#16665) Signed-off-by: Mark McLoughlin <[email protected]> * [CI] Add automation for the `tool-calling` github label (vllm-project#17118) Signed-off-by: Russell Bryant <[email protected]> * Updating builkite job for IBM Power (vllm-project#17111) Signed-off-by: Aaruni Aggarwal <[email protected]> * existing torch installation pip command fix for docs (vllm-project#17059) * Molmo Requirements (vllm-project#17026) Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> * Add `:markdownhelp:` to `EngineArgs` docs so markdown docstrings render properly (vllm-project#17124) Signed-off-by: Harry Mellor <[email protected]> * Improve configs - `LoRAConfig` + `PromptAdapterConfig` (vllm-project#16980) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Generate correct github links for decorated functions (vllm-project#17125) Signed-off-by: Russell Bryant <[email protected]> * Add collective_rpc to llm engine (vllm-project#16999) Signed-off-by: Yinghai Lu <[email protected]> * Add chat template for Llama 4 models (vllm-project#16428) Signed-off-by: Max de Bayser <[email protected]> * [Misc] Add example to run DeepSeek with Ray Serve LLM (vllm-project#17134) Signed-off-by: Rui Qiao <[email protected]> * Better error message for missing mistral params.json (vllm-project#17132) Signed-off-by: mgoin <[email protected]> * Use custom address for listening socket (vllm-project#15988) Signed-off-by: Jens Glaser <[email protected]> * [FEAT] [ROCm]: AITER Fused MOE V1 Support (vllm-project#16752) Signed-off-by: vllmellm <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 (vllm-project#16864) Signed-off-by: Lucas Wilkinson <[email protected]> * fix float16 support for kimi-vl (vllm-project#17156) Co-authored-by: zhouzaida <[email protected]> * [Doc] V1 : Update LoRA status (vllm-project#17133) Signed-off-by: varun sundar rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> * [Docs] Fix True->true in supported_models.md (vllm-project#17141) * Move missed `SchedulerConfig` args into scheduler config group in `EngineArgs` (vllm-project#17131) Signed-off-by: Harry Mellor <[email protected]> * [Misc] Clean up redundant code in uniproc_executor.py (vllm-project#16762) Signed-off-by: Lifu Huang <[email protected]> * [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton (vllm-project#15099) Signed-off-by: Mengqing Cao <[email protected]> * [Misc] Benchmark Serving Script Support Appending Results (vllm-project#17028) Signed-off-by: Lucas Wilkinson <[email protected]> * [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance (vllm-project#16457) Signed-off-by: cynthieye <[email protected]> Co-authored-by: MagnetoWang <[email protected]> * [Bugfix] remove fallback in guided_json (int range, patterns) (vllm-project#16725) Signed-off-by: csy1204 <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> * [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization (vllm-project#15734) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> * [Doc] Add headings to improve gptqmodel.md (vllm-project#17164) Signed-off-by: windsonsea <[email protected]> * Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (vllm-project#17158) * [Doc] Add two links to disagg_prefill.md (vllm-project#17168) Signed-off-by: windsonsea <[email protected]> * [Doc] Move todo out of beam search docstring (vllm-project#17183) Signed-off-by: Alex-Brooks <[email protected]> * [Bugfix] Fix mistral model tests (vllm-project#17181) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix Mistral ChatCompletionRequest Body Exception (vllm-project#16769) Signed-off-by: Jasmond Loh <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Bump Transformers to 4.51.3 (vllm-project#17116) Signed-off-by: Harry Mellor <[email protected]> * Use Transformers helper `get_text_config()` instead of checking for `text_config` (vllm-project#17105) Signed-off-by: Harry Mellor <[email protected]> * [doc] update wrong hf model links (vllm-project#17184) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Inline Molmo requirements (vllm-project#17190) Signed-off-by: DarkLight1337 <[email protected]> * [Security] Use safe serialization and fix zmq setup for mooncake pipe (vllm-project#17192) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * [V1] Move usage stats to worker and start logging TPU hardware (vllm-project#16211) * [Bugfix] Fix hybrid model tests (vllm-project#17182) Signed-off-by: DarkLight1337 <[email protected]> * Fix Python packaging edge cases (vllm-project#17159) Signed-off-by: Christian Heimes <[email protected]> * [BugFix][Frontend] Fix `LLM.chat()` tokenization (vllm-project#16081) Signed-off-by: Nick Hill <[email protected]> * [V1][Spec Decode] EAGLE-3 Support (vllm-project#16937) Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> * [Misc] Refine ray_serve_deepseek example (vllm-project#17204) Signed-off-by: Rui Qiao <[email protected]> * [Bugfix] gemma[2,3] interleaved attention when sliding window is disabled (vllm-project#17180) Signed-off-by: Chen Zhang <[email protected]> * [AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary (vllm-project#17215) Signed-off-by: Randall Smith <[email protected]> * [v1] [P/D] Adding LMCache KV connector for v1 (vllm-project#16625) * [Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env (vllm-project#17142) Signed-off-by: James Wu <[email protected]> * [MISC][AMD] Add unused annotation to rocm kernel file (vllm-project#17097) Signed-off-by: Lu Fang <[email protected]> * [doc] add Anything LLM integration (vllm-project#17216) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Minor][Spec Decode] Add use_eagle to SpeculativeConfig (vllm-project#17213) Signed-off-by: Woosuk Kwon <[email protected]> * [Doc] Minor fix for the vLLM TPU setup page (vllm-project#17206) Signed-off-by: Yarong Mu <[email protected]> * [Minor][Models] Fix Return Types of Llama & Eagle (vllm-project#17220) Signed-off-by: Woosuk Kwon <[email protected]> * Allocate kv_cache with stride order (vllm-project#16605) Signed-off-by: shuw <[email protected]> * [ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. (vllm-project#17011) Signed-off-by: charlifu <[email protected]> * [V1][Metrics] Allow V1 AsyncLLM to use custom logger (vllm-project#14661) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Avoid race conditions in zero-copy tensor transmission (vllm-project#17203) Signed-off-by: Nick Hill <[email protected]> * [CI/test] Fix Eagle Correctness Test (vllm-project#17209) Signed-off-by: Woosuk Kwon <[email protected]> * [Core] Remove prompt string from engine core data structures (vllm-project#17214) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix missing int type for `-n` in multi-image example (vllm-project#17223) * [Bugfix] Fix standard models tests (vllm-project#17217) Signed-off-by: DarkLight1337 <[email protected]> * [Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device (vllm-project#17186) Signed-off-by: Agata Dobrzyniewicz <[email protected]> * [V1] Add `structural_tag` support using xgrammar (vllm-project#17085) * [BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set (vllm-project#17088) Signed-off-by: Andy Xie <[email protected]> * [Chore] added stubs for `vllm_flash_attn` during development mode (vllm-project#17228) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Update structured output doc for V1 (vllm-project#17135) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps` (vllm-project#9276) Signed-off-by: changjun.lee <[email protected]> * Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 (vllm-project#16573) Signed-off-by: Lu Fang <[email protected]> * [MISC] rename interval to max_recent_requests (vllm-project#14285) * [Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation (vllm-project#16878) Signed-off-by: imkero <[email protected]> * [Minor] Fix lint error in main branch (vllm-project#17233) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh (vllm-project#16271) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Update test_flash_attn.py (vllm-project#17102) Signed-off-by: ShuaibinLi <[email protected]> * [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel (vllm-project#12591) Signed-off-by: Randall Smith <[email protected]> * [Misc] Make cached tokenizer pickle-compatible (vllm-project#17048) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix QWen2 VL multimodal mapping (vllm-project#17240) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Get a specific type of layer from forward context (vllm-project#17222) Signed-off-by: Chen Zhang <[email protected]> * [MISC] Use string annotation types for class definitions (vllm-project#17244) Signed-off-by: Jade Zheng <[email protected]> * [Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens (vllm-project#17033) Signed-off-by: sfc-gh-zhwang <[email protected]> * [Bugfix] Fix Lora Name Parsing (vllm-project#17196) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032) Signed-off-by: kaixih <[email protected]> * [Feature] support sequence parallelism using compilation pass (vllm-project#16155) Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [doc] Add feature status legend (vllm-project#17257) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Metrics] Fix minor inconsistencies in bucket progression (vllm-project#17262) Signed-off-by: DarkLight1337 <[email protected]> * [V1][Spec Decode] Make eagle compatible with prefix caching. (vllm-project#17137) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * [BugFix] Fix vllm_flash_attn install issues (vllm-project#17267) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix] Fix missing ARG in Dockerfile for arm64 platforms (vllm-project#17261) Signed-off-by: lkm-schulz <[email protected]> * [Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… (vllm-project#16751) Signed-off-by: Ther-LF <[email protected]> * [Bugfix] Fix Mistral3 spatial merge error (vllm-project#17270) Signed-off-by: mgoin <[email protected]> * [Doc] Fix wrong github link in LMCache examples (vllm-project#17274) Signed-off-by: KuntaiDu <[email protected]> * [Doc] small fix (vllm-project#17277) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Validate `stop_token_ids` contents (vllm-project#17268) Signed-off-by: Nick Hill <[email protected]> * [Minor][Models] Pass partial_rotary_factor parameter to rope (vllm-project#17266) Signed-off-by: evian <[email protected]> Co-authored-by: evian <[email protected]> * [Core] Remove legacy input mapper/processor from V0 (vllm-project#15686) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Add Granite Speech Support (vllm-project#16246) Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> * Update tpu_worker.py 's typo (vllm-project#17288) * Add missing class docstring for `PromptAdapterConfig` (vllm-project#17302) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Add missing `get_language_model` to new MLLMs (vllm-project#17300) Signed-off-by: DarkLight1337 <[email protected]> * [doc] update wrong model id (vllm-project#17287) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Minor typo/grammar in `platforms/interface.py` (vllm-project#17307) Signed-off-by: NickLucche <[email protected]> * [Misc] Clean up Qwen2.5-Omni code (vllm-project#17301) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Add a security guide (vllm-project#17230) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Improve conversion from dataclass configs to argparse arguments (vllm-project#17303) Signed-off-by: Harry Mellor <[email protected]> * Make name of `compressed-tensors` quant method consistent across vLLM (vllm-project#17255) Signed-off-by: Harry Mellor <[email protected]> * Explicitly explain quant method override ordering and ensure all overrides are ordered (vllm-project#17256) Signed-off-by: Harry Mellor <[email protected]> * [Security] Don't bind tcp zmq socket to all interfaces (vllm-project#17197) Signed-off-by: Russell Bryant <[email protected]> * [Chore] cleanup license indicators in light of SPDX (vllm-project#17259) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]> * [BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#17283) Signed-off-by: Lucas Wilkinson <[email protected]> * [Bugfix] Fix moe weight losing all extra attrs after `process_weights_after_loading`. (vllm-project#16854) Signed-off-by: charlifu <[email protected]> * [Model] Qwen3 Dense FP8 Compat Fixes (vllm-project#17318) Signed-off-by: simon-mo <[email protected]> * Support loading transformers models with named parameters (vllm-project#16868) Signed-off-by: Alex <[email protected]> * [Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project#17328) Signed-off-by: mgoin <[email protected]> * [Benchmark] Add single turn MTBench to Serving Bench (vllm-project#17202) * [Optim] Compute multimodal hash only once per item (vllm-project#17314) Signed-off-by: DarkLight1337 <[email protected]> * implement Structural Tag with Guidance backend (vllm-project#17333) Signed-off-by: Michal Moskal <[email protected]> * [V1][Spec Decode] Make Eagle model arch config driven (vllm-project#17323) * [model] make llama4 compatible with pure dense layers (vllm-project#17315) Signed-off-by: Lucia Fang <[email protected]> * [Bugfix] Fix `numel()` downcast in fused_layernorm_dynamic_per_token_quant.cu (vllm-project#17316) * Ignore `'<string>'` filepath (vllm-project#17330) Signed-off-by: rzou <[email protected]> * [Bugfix] Add contiguous call inside rope kernel wrapper (vllm-project#17091) Signed-off-by: 苏政渊 <[email protected]> Co-authored-by: 苏政渊 <[email protected]> * [Misc] Add a Jinja template to support Mistral3 function calling (vllm-project#17195) Signed-off-by: chaunceyjiang <[email protected]> * [Model] support MiniMax-VL-01 model (vllm-project#16328) Signed-off-by: qingjun <[email protected]> * [Misc] Move config fields to MultiModalConfig (vllm-project#17343) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Use a platform independent interface to obtain the device attributes (vllm-project#17100) * [Fix] Documentation spacing in compilation config help text (vllm-project#17342) Signed-off-by: Zerohertz <[email protected]> * [Build][Bugfix] Restrict setuptools version to <80 (vllm-project#17320) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] Ignore rotary embed load for Cohere model (vllm-project#17319) * Update docs requirements (vllm-project#17379) Signed-off-by: Harry Mellor <[email protected]> * [Doc] Fix QWen3MOE info (vllm-project#17381) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Clean up MiniMax-VL and fix processing (vllm-project#17354) Signed-off-by: DarkLight1337 <[email protected]> * `pre-commit autoupdate` (vllm-project#17380) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support `chat_template_kwargs` in `LLM.chat` (vllm-project#17356) Signed-off-by: DarkLight1337 <[email protected]> * Transformers backend tweaks (vllm-project#17365) Signed-off-by: Harry Mellor <[email protected]> * Fix: Spelling of inference (vllm-project#17387) * Improve literal dataclass field conversion to argparse argument (vllm-project#17391) Signed-off-by: Harry Mellor <[email protected]> * [V1] Remove num_input_tokens from attn_metadata (vllm-project#17193) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] add qwen3 reasoning-parser fix content is None when disable … (vllm-project#17369) Signed-off-by: mofanke <[email protected]> * fix gemma3 results all zero (vllm-project#17364) Signed-off-by: mayuyuace <[email protected]> * [Misc][ROCm] Exclude `cutlass_mla_decode` for ROCm build (vllm-project#17289) Signed-off-by: Tianyuan Wu <[email protected]> * Enabling multi-group kernel tests. (vllm-project#17115) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Propose a deprecation policy for the project (vllm-project#17063) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc][Typo] Fixing label in new model requests link in overview.md (vllm-project#17400) * [TPU][V1][CI] Replace `python3 setup.py develop` with standard `pip install --e` on TPU (vllm-project#17374) Signed-off-by: NickLucche <[email protected]> * [CI] Uses Python 3.11 for TPU (vllm-project#17359) Signed-off-by: Aaron Pham <[email protected]> * [CI/Build] Add retry mechanism for add-apt-repository (vllm-project#17107) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix Minicpm-O-int4 GPTQ model inference (vllm-project#17397) Signed-off-by: Isotr0py <[email protected]> * Simplify (and fix) passing of guided decoding backend options (vllm-project#17008) Signed-off-by: Harry Mellor <[email protected]> * Remove Falcon3 2x7B from CI (vllm-project#17404) Signed-off-by: Harry Mellor <[email protected]> * Fix: Python package installation for opentelmetry (vllm-project#17049) Signed-off-by: Dilip Gowda Bhagavan <[email protected]> * [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (vllm-project#17211) Signed-off-by: Bryan Lu <[email protected]> * Remove Bamba 9B from CI (vllm-project#17407) Signed-off-by: Harry Mellor <[email protected]> * [V1][Feature] Enable Speculative Decoding with Structured Outputs (vllm-project#14702) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> * [release] Always git fetch all to get latest tag on TPU release (vllm-project#17322) * Truncation control for embedding models (vllm-project#14776) Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Max de Bayser <[email protected]> * Update PyTorch to 2.7.0 (vllm-project#16859) * Improve configs - `ModelConfig` (vllm-project#17130) Signed-off-by: Harry Mellor <[email protected]> * Fix call to `logger.info_once` (vllm-project#17416) Signed-off-by: Harry Mellor <[email protected]> * Fix some speculative decode tests with tl.dot (vllm-project#17371) Signed-off-by: Huy Do <[email protected]> * Support LoRA for Mistral3 (vllm-project#17428) Signed-off-by: mgoin <[email protected]> * [Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue (vllm-project#17298) Signed-off-by: Kunshang Ji <[email protected]> * [Hardware][Intel GPU] Upgrade to torch 2.7 (vllm-project#17444) Signed-off-by: Kunshang Ji <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> * [Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' (vllm-project#17434) Signed-off-by: chaunceyjiang <[email protected]> * [MODEL ADDITION] Ovis2 Model Addition (vllm-project#15826) Signed-off-by: Marco <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Make the _apply_rotary_emb compatible with dynamo (vllm-project#17435) * [Misc] Remove deprecated files (vllm-project#17447) Signed-off-by: chaunceyjiang <[email protected]> * [V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None (vllm-project#15755) Signed-off-by: rongfu.leng <[email protected]> * [TPU][V1][CI] Update regression test baseline for v6 CI (vllm-project#17064) Signed-off-by: NickLucche <[email protected]> * [Core] Prevent side-channel attacks via cache salting (vllm-project#17045) Signed-off-by: Marko Rosenmueller <[email protected]> * [V1][Metrics] add support for kv event publishing (vllm-project#16750) Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> * [Feature] The Qwen3 reasoning parser supports guided decoding (vllm-project#17466) Signed-off-by: chaunceyjiang <[email protected]> * [Docs] Add command for running mypy tests from CI (vllm-project#17475) Signed-off-by: Russell Bryant <[email protected]> * [Fix] Support passing args to logger (vllm-project#17425) Signed-off-by: Aaron Pham <[email protected]> * [Bugfix] Fixed mistral tokenizer path when pointing to file (vllm-project#17457) Signed-off-by: Pete Savage <[email protected]> * [V1] Allow turning off pickle fallback in vllm.v1.serial_utils (vllm-project#17427) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Update optimization.md doc (vllm-project#17482) Signed-off-by: mgoin <[email protected]> * [BugFix] Fix authorization of openai_transcription_client.py (vllm-project#17321) Signed-off-by: zh Wang <[email protected]> * [Bugfix][ROCm] Restrict ray version due to a breaking release (vllm-project#17480) Signed-off-by: Gregory Shtrasberg <[email protected]> * [doc] add install tips (vllm-project#17373) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * doc: fix bug report Github template formatting (vllm-project#17486) Signed-off-by: David Xia <[email protected]> * [v1][Spec Decode] Make sliding window compatible with eagle prefix caching (vllm-project#17398) Signed-off-by: Chen Zhang <[email protected]> * Bump Compressed Tensors version to 0.9.4 (vllm-project#17478) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: mgoin <[email protected]> * [Misc] Rename Audios -> Audio in Qwen2audio Processing (vllm-project#17507) Signed-off-by: Alex-Brooks <[email protected]> * [CI][TPU] Skip Multimodal test (vllm-project#17488) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix][ROCm] Fix import error on ROCm (vllm-project#17495) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Bugfix] Temporarily disable gptq_bitblas on ROCm (vllm-project#17411) Signed-off-by: Yan Cangang <[email protected]> * [CI][TPU] Skip structured outputs+spec decode tests on TPU (vllm-project#17510) Signed-off-by: mgoin <[email protected]> * [CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (vllm-project#17500) Signed-off-by: mgoin <[email protected]> * [CI/Build] Reorganize models tests (vllm-project#17459) Signed-off-by: DarkLight1337 <[email protected]> * FIxing the AMD test failures caused by PR#16457 (vllm-project#17511) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Build] Require setuptools >= 77.0.3 for PEP 639 (vllm-project#17389) Signed-off-by: Russell Bryant <[email protected]> * [ROCm] Effort to reduce the number of environment variables in command line (vllm-project#17229) Signed-off-by: Hongxia Yang <[email protected]> * [BugFix] fix speculative decoding memory leak when speculation is disabled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]> * [BugFix] Fix mla cpu - missing 3 required positional arguments (vllm-project#17494) Signed-off-by: Lucas Wilkinson <[email protected]> * Avoid overwriting vllm_compile_cache.py (vllm-project#17418) Signed-off-by: Keyun Tong <[email protected]> * [Core] Enable IPv6 with vllm.utils.make_zmq_socket() (vllm-project#16506) Signed-off-by: Russell Bryant <[email protected]> * [Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (vllm-project#17515) Signed-off-by: chaunceyjiang <[email protected]> * Improve configs - `ObservabilityConfig` (vllm-project#17453) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model (vllm-project#17285) Signed-off-by: Teruaki Ishizaki <[email protected]> * [Frontend] Show progress bar for adding requests (vllm-project#17525) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Clean up test docstrings and names (vllm-project#17521) Signed-off-by: DarkLight1337 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X (vllm-project#17530) Signed-off-by: tjtanaa <[email protected]> * Fix more broken speculative decode tests (vllm-project#17450) Signed-off-by: Huy Do <[email protected]> * [doc] add streamlit integration (vllm-project#17522) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config (vllm-project#17535) Signed-off-by: tjtanaa <[email protected]> * [Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452) Signed-off-by: chaunceyjiang <[email protected]> * [ROCm] remove unsupported archs from rocm triton flash-attention supported list (vllm-project#17536) Signed-off-by: Hongxia Yang <[email protected]> * [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <[email protected]> * [Misc] refactor example - cpu_offload_lmcache (vllm-project#17460) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: gitover22 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: csy1204 <[email protected]> Signed-off-by: sydarb <[email protected]> Signed-off-by: 开哲 <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jens Glaser <[email protected]> Signed-off-by: varun sundar rabindranath <[email protected]> Signed-off-by: Lifu Huang <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: cynthieye <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Jasmond Loh <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Christian Heimes <[email protected]> Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: James Wu <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Yarong Mu <[email protected]> Signed-off-by: shuw <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: changjun.lee <[email protected]> Signed-off-by: imkero <[email protected]> Signed-off-by: ShuaibinLi <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Jade Zheng <[email protected]> Signed-off-by: sfc-gh-zhwang <[email protected]> Signed-off-by: kaixih <[email protected]> Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: lkm-schulz <[email protected]> Signed-off-by: Ther-LF <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: evian <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Alex <[email protected]> Signed-off-by: Michal Moskal <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: 苏政渊 <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: mofanke <[email protected]> Signed-off-by: mayuyuace <[email protected]> Signed-off-by: Tianyuan Wu <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Huy Do <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Marco <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Pete Savage <[email protected]> Signed-off-by: zh Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Yan Cangang <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: Noah Yoshida <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Teruaki Ishizaki <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Sage Moore <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: huafeng <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Sangyeon Cho <[email protected]> Co-authored-by: Chen Xia <[email protected]> Co-authored-by: Areeb Syed <[email protected]> Co-authored-by: 张宇 <[email protected]> Co-authored-by: 开哲 <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Atilla <[email protected]> Co-authored-by: Eyshika Agarwal <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: jglaser <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: zhouzaida <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> Co-authored-by: Lifu Huang <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: yexin(叶鑫) <[email protected]> Co-authored-by: MagnetoWang <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Jasmond L <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Daniel Li <[email protected]> Co-authored-by: Christian Heimes <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> Co-authored-by: James Wu <[email protected]> Co-authored-by: yarongmu-google <[email protected]> Co-authored-by: Shu Wang <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Zijing Liu <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: changjun.lee <[email protected]> Co-authored-by: Kero Liang <[email protected]> Co-authored-by: Happy <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Jade Zheng <[email protected]> Co-authored-by: Flex Wang <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: cascade <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Lennart K. M. Schulz <[email protected]> Co-authored-by: TherLF <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Wanrui Dai <[email protected]> Co-authored-by: evian <[email protected]> Co-authored-by: idouba <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Richard Barnes <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Zhengyuan Su (苏政渊) <[email protected]> Co-authored-by: 苏政渊 <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: ponix-j <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: a2q1p <[email protected]> Co-authored-by: mofanke <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> Co-authored-by: TY-AMD <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: casinca <[email protected]> Co-authored-by: Dilip Gowda Bhagavan <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Gabriel Marinho <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Marco <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Marko Rosenmueller <[email protected]> Co-authored-by: Alec <[email protected]> Co-authored-by: Pete Savage <[email protected]> Co-authored-by: zh Wang <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: NaLan ZeYu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Noah Yoshida <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: Teruaki Ishizaki <[email protected]> Co-authored-by: Sage Moore <[email protected]>

…abled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]>

ProExpertProg · 2025-05-07T13:26:59Z

~~Btw - this broke spec_decoding tests on main because the EngineArgs interface changed. Pushing a fix.~~
Fixed in #17754

…abled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]> Signed-off-by: Mu Huai <[email protected]>

fix speculative decoding memory leak

93a2090

Signed-off-by: Noah Yoshida <[email protected]>

mergify bot added the speculative-decoding label Mar 25, 2025

fix precommit

4524458

Signed-off-by: Noah Yoshida <[email protected]>

noyoshi force-pushed the main branch from b8e951a to 4524458 Compare March 26, 2025 00:54

NickLucche suggested changes Mar 27, 2025

View reviewed changes

add in unit test to check for memory usage

82ac34d

Signed-off-by: Noah Yoshida <[email protected]>

noyoshi requested review from njhill and LiuXiaoxuanPKU as code owners March 29, 2025 00:27

noyoshi added 4 commits March 29, 2025 10:51

precommit

46f1816

Signed-off-by: Noah Yoshida <[email protected]>

fix precommit

e81381c

Signed-off-by: Noah Yoshida <[email protected]>

fix precommit

2d70f9a

Signed-off-by: Noah Yoshida <[email protected]>

fix precommit

2f4f94c

Signed-off-by: Noah Yoshida <[email protected]>

NickLucche suggested changes Apr 1, 2025

View reviewed changes

njhill added v0 bug Something isn't working labels Apr 2, 2025

njhill approved these changes Apr 2, 2025

View reviewed changes

Merge branch 'vllm-project:main' into main

d35bfa3

DarkLight1337 enabled auto-merge (squash) April 11, 2025 03:35

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 11, 2025

Merge branch 'vllm-project:main' into main

5523a7f

Merge branch 'vllm-project:main' into main

0e0b28a

vllm-bot merged commit 13cf6b6 into vllm-project:main May 1, 2025
41 of 43 checks passed

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[BugFix] fix speculative decoding memory leak when speculation is dis…

7461f52

…abled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]>

xjpang pushed a commit to xjpang/vllm that referenced this pull request May 4, 2025

[BugFix] fix speculative decoding memory leak when speculation is dis…

a50024e

…abled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]>

ProExpertProg mentioned this pull request May 7, 2025

[Fix] Fix spec decoding tests #17798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] fix speculative decoding memory leak when speculation is disabled #15506

[BugFix] fix speculative decoding memory leak when speculation is disabled #15506

noyoshi commented Mar 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 25, 2025

NickLucche left a comment

NickLucche left a comment

noyoshi commented Apr 1, 2025

NickLucche commented Apr 2, 2025

njhill left a comment

njhill commented Apr 2, 2025

noyoshi commented Apr 7, 2025

DarkLight1337 commented Apr 15, 2025

noyoshi commented Apr 18, 2025

DarkLight1337 commented Apr 25, 2025

ProExpertProg commented May 7, 2025 •

edited

Loading

[BugFix] fix speculative decoding memory leak when speculation is disabled #15506

[BugFix] fix speculative decoding memory leak when speculation is disabled #15506

Conversation

noyoshi commented Mar 25, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 25, 2025

NickLucche left a comment

Choose a reason for hiding this comment

NickLucche left a comment

Choose a reason for hiding this comment

noyoshi commented Apr 1, 2025

NickLucche commented Apr 2, 2025

njhill left a comment

Choose a reason for hiding this comment

njhill commented Apr 2, 2025

noyoshi commented Apr 7, 2025

DarkLight1337 commented Apr 15, 2025

noyoshi commented Apr 18, 2025

DarkLight1337 commented Apr 25, 2025

ProExpertProg commented May 7, 2025 • edited Loading

noyoshi commented Mar 25, 2025 •

edited by github-actions bot

Loading

ProExpertProg commented May 7, 2025 •

edited

Loading