[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

mofanke · 2025-04-29T09:05:54Z

add a new reasoning-parser qwen3

Code Attribution

logic adapted from gaocegege/vllm project's deepseek_r1_reasoning_parser.py
Original author: @gaocegege

python3 -m vllm.entrypoints.openai.api_server --model  Qwen3-32B  -tp 4 --enable-reasoning --reasoning-parser qwen3

test for request

response = client.chat.completions.create(
    model="Qwen3-32B",
    messages=[
        {"role": "user", "content": "who are u?"},
    ],
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "top_k": 20},
)

github-actions · 2025-04-29T09:06:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-29T11:10:43Z

Thanks for adding this, can you add some tests to verify the fix?

mofanke · 2025-04-29T14:03:05Z

Thanks for adding this, can you add some tests to verify the fix?

Thanks for the feedback! I have already added tests to verify the fix. Please let me know if you need any additional tests or if there’s anything else I should improve.

ItzAmirreza · 2025-04-29T14:13:44Z

Thanks a lot! Looking forward to merge.

…thinking (vllm-project#17357) Signed-off-by: mofanke <[email protected]>

DarkLight1337

Thanks, LGTM

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

chaunceyjiang · 2025-04-30T08:34:14Z

I think there might be an issue with this PR implementation. I used the following test case:
The command --reasoning-parser deepseek_r1 works correctly, while --reasoning-parser deepseek_r1 fails to work as expected.
Clearly, the result from deepseek_r1 is the desired one.

deepseek_r1:

vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 --guided-decoding-backend xgrammar

vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3 --guided-decoding-backend xgrammar

client:

from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"

class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float


if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # client.chat.completions.create
    json_schema = Step.model_json_schema()

    chat_response = client.beta.chat.completions.parse(
        model="",
        messages=[
            {'role': 'system',
            'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
            {'role': 'user',
            'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
        ],
        temperature=0.0,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
    )
    print("Chat response:", chat_response)
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
    print("-----", s.system_response_key_ideas)

result:

deepseek_r1:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-c8ac33157c6a46aa91adede0f1f36b06', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}'), stop_reason=None)], created=1746001853, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
----- 1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.

qwen3:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-7b079ebfa7ef4c9e87779bcb6cfffccd', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content=None), stop_reason=None)], created=1746002026, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1187, in parse_raw
    obj = parse.load_str_bytes(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
    return json_loads(b)  # type: ignore
           ^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/vllm/test14.py", line 35, in <module>
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1214, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for Step
__root__
  the JSON object must be str, bytes or bytearray, not NoneType [type=type_error, input_value=None, input_type=NoneType]

chaunceyjiang · 2025-04-30T08:39:19Z

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.

vllm/vllm/reasoning/qwen3_reasoning_parser.py

Lines 114 to 117 in a39203f

    
           # Check if the model output contains the <think> tokens. 
        
           if (self.think_start_token not in model_output 
        
                   or self.think_end_token not in model_output): 
        
               return None, model_output

@DarkLight1337 @mofanke @YorkSu WDYT?

YorkSu · 2025-04-30T08:52:36Z

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.根本原因是它错误地假设当前模式不是推理模式，但我确实启用了推理模式。然而，模型的输出被 xgrammar 格式化成了 JSON，导致 qwen3-reasoning-parser 误认为当前模式不是推理模式。

vllm/vllm/model_executor/guided_decoding/xgrammar_decoding.py

Lines 345 to 353 in ece5a8b

    
           def __call__(self, input_ids: list[int], 
        
                        scores: torch.Tensor) -> torch.Tensor: 
        
               # Skip the structured logits processing if reasoning is not finished. 
        
               # reasoner is not None only when `--enable-reasoning` is set. 
        
               if self.reasoner is not None and \ 
        
               not self.reasoner.is_reasoning_end( 
        
                       input_ids): 
        
                   return scores

vllm/vllm/model_executor/guided_decoding/outlines_logits_processors.py

Lines 59 to 67 in ece5a8b

    
           def __call__(self, input_ids: List[int], 
        
                        scores: torch.Tensor) -> torch.Tensor: 
        
               """Use the FSM to bias the logits before sampling the next token.""" 
        
               # Skip the structured logits processing if reasoning is not finished. 
        
               # reasoner is not None only when `--enable-reasoning` is set. 
        
               if self._reasoner is not None: 
        
                   if not self._reasoner.is_reasoning_end(input_ids): 
        
                       return scores

vllm/vllm/reasoning/abs_reasoning_parsers.py

Line 36 in ece5a8b

def is_reasoning_end(self, input_ids: list[int]) -> bool:

is_reasoning_end is used by guided decoding backend to check reasoning stage. This Qwen3ReasoningParser don't implement this method.

vllm/vllm/reasoning/deepseek_r1_reasoning_parser.py

Lines 46 to 47 in ece5a8b

    
           def is_reasoning_end(self, input_ids: list[int]) -> bool: 
        
               return self.end_token_id in input_ids

However, in the openai entrypoints, ReasoningParser only check if the model output contains </think> currently. But if </think> were already present in the Prompt, output_tokens could not contains the token, so it will returns False. If we pass "chat_template_kwargs": {"enable_thinking": false}, chat_template add <think>\n\n</think>\n\n at the start of completion.

#17349 (comment)

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 607 to 608 in 1534d38

    
           and not reasoning_parser.is_reasoning_end( 
        
               previous_token_ids)):

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 623 to 624 in 1534d38

    
           if reasoning_parser.is_reasoning_end( 
        
                   list(output.token_ids)):

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 684 to 685 in 1534d38

    
           if reasoning_parser.is_reasoning_end( 
        
                   list(output.token_ids)):

YorkSu · 2025-04-30T09:15:20Z

@chaunceyjiang

extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},

Try to run some example with guided_json and set enable_thinking to False, both r1 and qwen3 reasoning parser fails to work as expected.

gaocegege · 2025-05-01T01:39:13Z

Thanks for the PR, the commit copied from my fork looks a little outdated. For example, it still uses regex in the extract_reasoning_content. Could we use the latest deepseek r1 reasoning parser's logic? https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/deepseek_r1_reasoning_parser.py#L139

@chaunceyjiang You might be interested.

* Revert "[Misc] Add S3 environment variables for better support of MinIO." (vllm-project#17021) * [misc] tune some env vars for GB200 (vllm-project#16992) Signed-off-by: youkaichao <[email protected]> * [INTEL-HPU][v0] Port delayed sampling to upstream (vllm-project#16949) Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> * [doc] add download path tips (vllm-project#17013) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Triton FA function takes no keyword arguments (vllm-project#16902) Signed-off-by: vllmellm <[email protected]> * [V1] Avoid socket errors during shutdown when requests are in in-flight (vllm-project#16807) Signed-off-by: Nick Hill <[email protected]> * [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#16998) Signed-off-by: Lucas Wilkinson <[email protected]> * [Misc] Improve readability of get_open_port function. (vllm-project#17024) Signed-off-by: gitover22 <[email protected]> * [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers (vllm-project#16964) Signed-off-by: chaunceyjiang <[email protected]> * [CI] Run v1/test_serial_utils.py in CI (vllm-project#16996) Signed-off-by: Russell Bryant <[email protected]> * Mistral-format support for compressed-tensors (vllm-project#16803) Signed-off-by: mgoin <[email protected]> * Categorize `tests/kernels/` based on kernel type (vllm-project#16799) Signed-off-by: mgoin <[email protected]> * [Doc] Add top anchor and a note to quantization/bitblas.md (vllm-project#17042) Signed-off-by: windsonsea <[email protected]> * Ensure that `pid` passed to `kill_process_tree` is `int` for `mypy` (vllm-project#17051) Signed-off-by: Harry Mellor <[email protected]> * [CI] Update structured-output label automation (vllm-project#17055) Signed-off-by: Russell Bryant <[email protected]> * Improve Transformers backend model loading QoL (vllm-project#17039) Signed-off-by: Harry Mellor <[email protected]> * `CacheConfig.block_size` should always be `int` when used (vllm-project#17052) Signed-off-by: Harry Mellor <[email protected]> * Use `@property` and private field for `data_parallel_rank_local` (vllm-project#17053) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar (vllm-project#15949) Signed-off-by: Travis Johnson <[email protected]> * [BugFix][V1] Fix int32 token index overflow when preparing input ids (vllm-project#16806) * [V1][Spec Decode] Always use argmax for sampling draft tokens (vllm-project#16899) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] workaround for CI build failure (vllm-project#17070) Signed-off-by: csy1204 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Quantization]add prefix for commandA quantized model (vllm-project#17017) * [Minor] Use larger batch sizes for A100/B100/B200/MI300x (vllm-project#17073) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Enable V1 usage stats (vllm-project#16986) Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * More informative error when using Transformers backend (vllm-project#16988) Signed-off-by: Harry Mellor <[email protected]> * Addendum Fix to support FIPS enabled machines with MD5 hashing (vllm-project#17043) Signed-off-by: sydarb <[email protected]> * [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… (vllm-project#16472) Signed-off-by: 开哲 <[email protected]> Co-authored-by: 开哲 <[email protected]> * [V1] Update structured output (vllm-project#16812) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [doc] update to hyperlink (vllm-project#17096) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add docs for runai_streamer_sharded (vllm-project#17093) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Chore] Remove Sampler from Model Code (vllm-project#17084) Signed-off-by: Woosuk Kwon <[email protected]> * Disable enforce_eager for V1 TPU sampler and structured output tests (vllm-project#17016) Signed-off-by: mgoin <[email protected]> * Simplify `TokenizerGroup` (vllm-project#16790) Signed-off-by: Harry Mellor <[email protected]> * Fix OOT registration test (vllm-project#17099) Signed-off-by: Harry Mellor <[email protected]> * [V1][PP] Optimization: continue scheduling prefill chunks (vllm-project#17080) Signed-off-by: Rui Qiao <[email protected]> * [Misc] Remove OLMo2 config copy (vllm-project#17066) Signed-off-by: Isotr0py <[email protected]> * Improve static type checking in `LoRAModelRunnerMixin` (vllm-project#17104) Signed-off-by: Harry Mellor <[email protected]> * [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning (vllm-project#16954) Signed-off-by: shen-shanshan <[email protected]> * [Frontend] Using matryoshka_dimensions control the allowed output dimensions. (vllm-project#16970) * Add missing rocm_skinny_gemms kernel test to CI (vllm-project#17060) Signed-off-by: mgoin <[email protected]> * [Misc] refactor example series - structured outputs (vllm-project#17040) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (vllm-project#16665) Signed-off-by: Mark McLoughlin <[email protected]> * [CI] Add automation for the `tool-calling` github label (vllm-project#17118) Signed-off-by: Russell Bryant <[email protected]> * Updating builkite job for IBM Power (vllm-project#17111) Signed-off-by: Aaruni Aggarwal <[email protected]> * existing torch installation pip command fix for docs (vllm-project#17059) * Molmo Requirements (vllm-project#17026) Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> * Add `:markdownhelp:` to `EngineArgs` docs so markdown docstrings render properly (vllm-project#17124) Signed-off-by: Harry Mellor <[email protected]> * Improve configs - `LoRAConfig` + `PromptAdapterConfig` (vllm-project#16980) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Generate correct github links for decorated functions (vllm-project#17125) Signed-off-by: Russell Bryant <[email protected]> * Add collective_rpc to llm engine (vllm-project#16999) Signed-off-by: Yinghai Lu <[email protected]> * Add chat template for Llama 4 models (vllm-project#16428) Signed-off-by: Max de Bayser <[email protected]> * [Misc] Add example to run DeepSeek with Ray Serve LLM (vllm-project#17134) Signed-off-by: Rui Qiao <[email protected]> * Better error message for missing mistral params.json (vllm-project#17132) Signed-off-by: mgoin <[email protected]> * Use custom address for listening socket (vllm-project#15988) Signed-off-by: Jens Glaser <[email protected]> * [FEAT] [ROCm]: AITER Fused MOE V1 Support (vllm-project#16752) Signed-off-by: vllmellm <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 (vllm-project#16864) Signed-off-by: Lucas Wilkinson <[email protected]> * fix float16 support for kimi-vl (vllm-project#17156) Co-authored-by: zhouzaida <[email protected]> * [Doc] V1 : Update LoRA status (vllm-project#17133) Signed-off-by: varun sundar rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> * [Docs] Fix True->true in supported_models.md (vllm-project#17141) * Move missed `SchedulerConfig` args into scheduler config group in `EngineArgs` (vllm-project#17131) Signed-off-by: Harry Mellor <[email protected]> * [Misc] Clean up redundant code in uniproc_executor.py (vllm-project#16762) Signed-off-by: Lifu Huang <[email protected]> * [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton (vllm-project#15099) Signed-off-by: Mengqing Cao <[email protected]> * [Misc] Benchmark Serving Script Support Appending Results (vllm-project#17028) Signed-off-by: Lucas Wilkinson <[email protected]> * [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance (vllm-project#16457) Signed-off-by: cynthieye <[email protected]> Co-authored-by: MagnetoWang <[email protected]> * [Bugfix] remove fallback in guided_json (int range, patterns) (vllm-project#16725) Signed-off-by: csy1204 <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> * [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization (vllm-project#15734) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> * [Doc] Add headings to improve gptqmodel.md (vllm-project#17164) Signed-off-by: windsonsea <[email protected]> * Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (vllm-project#17158) * [Doc] Add two links to disagg_prefill.md (vllm-project#17168) Signed-off-by: windsonsea <[email protected]> * [Doc] Move todo out of beam search docstring (vllm-project#17183) Signed-off-by: Alex-Brooks <[email protected]> * [Bugfix] Fix mistral model tests (vllm-project#17181) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix Mistral ChatCompletionRequest Body Exception (vllm-project#16769) Signed-off-by: Jasmond Loh <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Bump Transformers to 4.51.3 (vllm-project#17116) Signed-off-by: Harry Mellor <[email protected]> * Use Transformers helper `get_text_config()` instead of checking for `text_config` (vllm-project#17105) Signed-off-by: Harry Mellor <[email protected]> * [doc] update wrong hf model links (vllm-project#17184) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Inline Molmo requirements (vllm-project#17190) Signed-off-by: DarkLight1337 <[email protected]> * [Security] Use safe serialization and fix zmq setup for mooncake pipe (vllm-project#17192) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * [V1] Move usage stats to worker and start logging TPU hardware (vllm-project#16211) * [Bugfix] Fix hybrid model tests (vllm-project#17182) Signed-off-by: DarkLight1337 <[email protected]> * Fix Python packaging edge cases (vllm-project#17159) Signed-off-by: Christian Heimes <[email protected]> * [BugFix][Frontend] Fix `LLM.chat()` tokenization (vllm-project#16081) Signed-off-by: Nick Hill <[email protected]> * [V1][Spec Decode] EAGLE-3 Support (vllm-project#16937) Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> * [Misc] Refine ray_serve_deepseek example (vllm-project#17204) Signed-off-by: Rui Qiao <[email protected]> * [Bugfix] gemma[2,3] interleaved attention when sliding window is disabled (vllm-project#17180) Signed-off-by: Chen Zhang <[email protected]> * [AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary (vllm-project#17215) Signed-off-by: Randall Smith <[email protected]> * [v1] [P/D] Adding LMCache KV connector for v1 (vllm-project#16625) * [Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env (vllm-project#17142) Signed-off-by: James Wu <[email protected]> * [MISC][AMD] Add unused annotation to rocm kernel file (vllm-project#17097) Signed-off-by: Lu Fang <[email protected]> * [doc] add Anything LLM integration (vllm-project#17216) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Minor][Spec Decode] Add use_eagle to SpeculativeConfig (vllm-project#17213) Signed-off-by: Woosuk Kwon <[email protected]> * [Doc] Minor fix for the vLLM TPU setup page (vllm-project#17206) Signed-off-by: Yarong Mu <[email protected]> * [Minor][Models] Fix Return Types of Llama & Eagle (vllm-project#17220) Signed-off-by: Woosuk Kwon <[email protected]> * Allocate kv_cache with stride order (vllm-project#16605) Signed-off-by: shuw <[email protected]> * [ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. (vllm-project#17011) Signed-off-by: charlifu <[email protected]> * [V1][Metrics] Allow V1 AsyncLLM to use custom logger (vllm-project#14661) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Avoid race conditions in zero-copy tensor transmission (vllm-project#17203) Signed-off-by: Nick Hill <[email protected]> * [CI/test] Fix Eagle Correctness Test (vllm-project#17209) Signed-off-by: Woosuk Kwon <[email protected]> * [Core] Remove prompt string from engine core data structures (vllm-project#17214) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix missing int type for `-n` in multi-image example (vllm-project#17223) * [Bugfix] Fix standard models tests (vllm-project#17217) Signed-off-by: DarkLight1337 <[email protected]> * [Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device (vllm-project#17186) Signed-off-by: Agata Dobrzyniewicz <[email protected]> * [V1] Add `structural_tag` support using xgrammar (vllm-project#17085) * [BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set (vllm-project#17088) Signed-off-by: Andy Xie <[email protected]> * [Chore] added stubs for `vllm_flash_attn` during development mode (vllm-project#17228) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Update structured output doc for V1 (vllm-project#17135) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps` (vllm-project#9276) Signed-off-by: changjun.lee <[email protected]> * Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 (vllm-project#16573) Signed-off-by: Lu Fang <[email protected]> * [MISC] rename interval to max_recent_requests (vllm-project#14285) * [Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation (vllm-project#16878) Signed-off-by: imkero <[email protected]> * [Minor] Fix lint error in main branch (vllm-project#17233) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh (vllm-project#16271) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Update test_flash_attn.py (vllm-project#17102) Signed-off-by: ShuaibinLi <[email protected]> * [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel (vllm-project#12591) Signed-off-by: Randall Smith <[email protected]> * [Misc] Make cached tokenizer pickle-compatible (vllm-project#17048) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix QWen2 VL multimodal mapping (vllm-project#17240) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Get a specific type of layer from forward context (vllm-project#17222) Signed-off-by: Chen Zhang <[email protected]> * [MISC] Use string annotation types for class definitions (vllm-project#17244) Signed-off-by: Jade Zheng <[email protected]> * [Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens (vllm-project#17033) Signed-off-by: sfc-gh-zhwang <[email protected]> * [Bugfix] Fix Lora Name Parsing (vllm-project#17196) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032) Signed-off-by: kaixih <[email protected]> * [Feature] support sequence parallelism using compilation pass (vllm-project#16155) Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [doc] Add feature status legend (vllm-project#17257) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Metrics] Fix minor inconsistencies in bucket progression (vllm-project#17262) Signed-off-by: DarkLight1337 <[email protected]> * [V1][Spec Decode] Make eagle compatible with prefix caching. (vllm-project#17137) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * [BugFix] Fix vllm_flash_attn install issues (vllm-project#17267) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix] Fix missing ARG in Dockerfile for arm64 platforms (vllm-project#17261) Signed-off-by: lkm-schulz <[email protected]> * [Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… (vllm-project#16751) Signed-off-by: Ther-LF <[email protected]> * [Bugfix] Fix Mistral3 spatial merge error (vllm-project#17270) Signed-off-by: mgoin <[email protected]> * [Doc] Fix wrong github link in LMCache examples (vllm-project#17274) Signed-off-by: KuntaiDu <[email protected]> * [Doc] small fix (vllm-project#17277) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Validate `stop_token_ids` contents (vllm-project#17268) Signed-off-by: Nick Hill <[email protected]> * [Minor][Models] Pass partial_rotary_factor parameter to rope (vllm-project#17266) Signed-off-by: evian <[email protected]> Co-authored-by: evian <[email protected]> * [Core] Remove legacy input mapper/processor from V0 (vllm-project#15686) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Add Granite Speech Support (vllm-project#16246) Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> * Update tpu_worker.py 's typo (vllm-project#17288) * Add missing class docstring for `PromptAdapterConfig` (vllm-project#17302) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Add missing `get_language_model` to new MLLMs (vllm-project#17300) Signed-off-by: DarkLight1337 <[email protected]> * [doc] update wrong model id (vllm-project#17287) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Minor typo/grammar in `platforms/interface.py` (vllm-project#17307) Signed-off-by: NickLucche <[email protected]> * [Misc] Clean up Qwen2.5-Omni code (vllm-project#17301) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Add a security guide (vllm-project#17230) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Improve conversion from dataclass configs to argparse arguments (vllm-project#17303) Signed-off-by: Harry Mellor <[email protected]> * Make name of `compressed-tensors` quant method consistent across vLLM (vllm-project#17255) Signed-off-by: Harry Mellor <[email protected]> * Explicitly explain quant method override ordering and ensure all overrides are ordered (vllm-project#17256) Signed-off-by: Harry Mellor <[email protected]> * [Security] Don't bind tcp zmq socket to all interfaces (vllm-project#17197) Signed-off-by: Russell Bryant <[email protected]> * [Chore] cleanup license indicators in light of SPDX (vllm-project#17259) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]> * [BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#17283) Signed-off-by: Lucas Wilkinson <[email protected]> * [Bugfix] Fix moe weight losing all extra attrs after `process_weights_after_loading`. (vllm-project#16854) Signed-off-by: charlifu <[email protected]> * [Model] Qwen3 Dense FP8 Compat Fixes (vllm-project#17318) Signed-off-by: simon-mo <[email protected]> * Support loading transformers models with named parameters (vllm-project#16868) Signed-off-by: Alex <[email protected]> * [Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project#17328) Signed-off-by: mgoin <[email protected]> * [Benchmark] Add single turn MTBench to Serving Bench (vllm-project#17202) * [Optim] Compute multimodal hash only once per item (vllm-project#17314) Signed-off-by: DarkLight1337 <[email protected]> * implement Structural Tag with Guidance backend (vllm-project#17333) Signed-off-by: Michal Moskal <[email protected]> * [V1][Spec Decode] Make Eagle model arch config driven (vllm-project#17323) * [model] make llama4 compatible with pure dense layers (vllm-project#17315) Signed-off-by: Lucia Fang <[email protected]> * [Bugfix] Fix `numel()` downcast in fused_layernorm_dynamic_per_token_quant.cu (vllm-project#17316) * Ignore `'<string>'` filepath (vllm-project#17330) Signed-off-by: rzou <[email protected]> * [Bugfix] Add contiguous call inside rope kernel wrapper (vllm-project#17091) Signed-off-by: 苏政渊 <[email protected]> Co-authored-by: 苏政渊 <[email protected]> * [Misc] Add a Jinja template to support Mistral3 function calling (vllm-project#17195) Signed-off-by: chaunceyjiang <[email protected]> * [Model] support MiniMax-VL-01 model (vllm-project#16328) Signed-off-by: qingjun <[email protected]> * [Misc] Move config fields to MultiModalConfig (vllm-project#17343) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Use a platform independent interface to obtain the device attributes (vllm-project#17100) * [Fix] Documentation spacing in compilation config help text (vllm-project#17342) Signed-off-by: Zerohertz <[email protected]> * [Build][Bugfix] Restrict setuptools version to <80 (vllm-project#17320) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] Ignore rotary embed load for Cohere model (vllm-project#17319) * Update docs requirements (vllm-project#17379) Signed-off-by: Harry Mellor <[email protected]> * [Doc] Fix QWen3MOE info (vllm-project#17381) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Clean up MiniMax-VL and fix processing (vllm-project#17354) Signed-off-by: DarkLight1337 <[email protected]> * `pre-commit autoupdate` (vllm-project#17380) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support `chat_template_kwargs` in `LLM.chat` (vllm-project#17356) Signed-off-by: DarkLight1337 <[email protected]> * Transformers backend tweaks (vllm-project#17365) Signed-off-by: Harry Mellor <[email protected]> * Fix: Spelling of inference (vllm-project#17387) * Improve literal dataclass field conversion to argparse argument (vllm-project#17391) Signed-off-by: Harry Mellor <[email protected]> * [V1] Remove num_input_tokens from attn_metadata (vllm-project#17193) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] add qwen3 reasoning-parser fix content is None when disable … (vllm-project#17369) Signed-off-by: mofanke <[email protected]> * fix gemma3 results all zero (vllm-project#17364) Signed-off-by: mayuyuace <[email protected]> * [Misc][ROCm] Exclude `cutlass_mla_decode` for ROCm build (vllm-project#17289) Signed-off-by: Tianyuan Wu <[email protected]> * Enabling multi-group kernel tests. (vllm-project#17115) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Propose a deprecation policy for the project (vllm-project#17063) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc][Typo] Fixing label in new model requests link in overview.md (vllm-project#17400) * [TPU][V1][CI] Replace `python3 setup.py develop` with standard `pip install --e` on TPU (vllm-project#17374) Signed-off-by: NickLucche <[email protected]> * [CI] Uses Python 3.11 for TPU (vllm-project#17359) Signed-off-by: Aaron Pham <[email protected]> * [CI/Build] Add retry mechanism for add-apt-repository (vllm-project#17107) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix Minicpm-O-int4 GPTQ model inference (vllm-project#17397) Signed-off-by: Isotr0py <[email protected]> * Simplify (and fix) passing of guided decoding backend options (vllm-project#17008) Signed-off-by: Harry Mellor <[email protected]> * Remove Falcon3 2x7B from CI (vllm-project#17404) Signed-off-by: Harry Mellor <[email protected]> * Fix: Python package installation for opentelmetry (vllm-project#17049) Signed-off-by: Dilip Gowda Bhagavan <[email protected]> * [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (vllm-project#17211) Signed-off-by: Bryan Lu <[email protected]> * Remove Bamba 9B from CI (vllm-project#17407) Signed-off-by: Harry Mellor <[email protected]> * [V1][Feature] Enable Speculative Decoding with Structured Outputs (vllm-project#14702) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> * [release] Always git fetch all to get latest tag on TPU release (vllm-project#17322) * Truncation control for embedding models (vllm-project#14776) Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Max de Bayser <[email protected]> * Update PyTorch to 2.7.0 (vllm-project#16859) * Improve configs - `ModelConfig` (vllm-project#17130) Signed-off-by: Harry Mellor <[email protected]> * Fix call to `logger.info_once` (vllm-project#17416) Signed-off-by: Harry Mellor <[email protected]> * Fix some speculative decode tests with tl.dot (vllm-project#17371) Signed-off-by: Huy Do <[email protected]> * Support LoRA for Mistral3 (vllm-project#17428) Signed-off-by: mgoin <[email protected]> * [Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue (vllm-project#17298) Signed-off-by: Kunshang Ji <[email protected]> * [Hardware][Intel GPU] Upgrade to torch 2.7 (vllm-project#17444) Signed-off-by: Kunshang Ji <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> * [Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' (vllm-project#17434) Signed-off-by: chaunceyjiang <[email protected]> * [MODEL ADDITION] Ovis2 Model Addition (vllm-project#15826) Signed-off-by: Marco <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Make the _apply_rotary_emb compatible with dynamo (vllm-project#17435) * [Misc] Remove deprecated files (vllm-project#17447) Signed-off-by: chaunceyjiang <[email protected]> * [V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None (vllm-project#15755) Signed-off-by: rongfu.leng <[email protected]> * [TPU][V1][CI] Update regression test baseline for v6 CI (vllm-project#17064) Signed-off-by: NickLucche <[email protected]> * [Core] Prevent side-channel attacks via cache salting (vllm-project#17045) Signed-off-by: Marko Rosenmueller <[email protected]> * [V1][Metrics] add support for kv event publishing (vllm-project#16750) Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> * [Feature] The Qwen3 reasoning parser supports guided decoding (vllm-project#17466) Signed-off-by: chaunceyjiang <[email protected]> * [Docs] Add command for running mypy tests from CI (vllm-project#17475) Signed-off-by: Russell Bryant <[email protected]> * [Fix] Support passing args to logger (vllm-project#17425) Signed-off-by: Aaron Pham <[email protected]> * [Bugfix] Fixed mistral tokenizer path when pointing to file (vllm-project#17457) Signed-off-by: Pete Savage <[email protected]> * [V1] Allow turning off pickle fallback in vllm.v1.serial_utils (vllm-project#17427) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Update optimization.md doc (vllm-project#17482) Signed-off-by: mgoin <[email protected]> * [BugFix] Fix authorization of openai_transcription_client.py (vllm-project#17321) Signed-off-by: zh Wang <[email protected]> * [Bugfix][ROCm] Restrict ray version due to a breaking release (vllm-project#17480) Signed-off-by: Gregory Shtrasberg <[email protected]> * [doc] add install tips (vllm-project#17373) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * doc: fix bug report Github template formatting (vllm-project#17486) Signed-off-by: David Xia <[email protected]> * [v1][Spec Decode] Make sliding window compatible with eagle prefix caching (vllm-project#17398) Signed-off-by: Chen Zhang <[email protected]> * Bump Compressed Tensors version to 0.9.4 (vllm-project#17478) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: mgoin <[email protected]> * [Misc] Rename Audios -> Audio in Qwen2audio Processing (vllm-project#17507) Signed-off-by: Alex-Brooks <[email protected]> * [CI][TPU] Skip Multimodal test (vllm-project#17488) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix][ROCm] Fix import error on ROCm (vllm-project#17495) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Bugfix] Temporarily disable gptq_bitblas on ROCm (vllm-project#17411) Signed-off-by: Yan Cangang <[email protected]> * [CI][TPU] Skip structured outputs+spec decode tests on TPU (vllm-project#17510) Signed-off-by: mgoin <[email protected]> * [CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (vllm-project#17500) Signed-off-by: mgoin <[email protected]> * [CI/Build] Reorganize models tests (vllm-project#17459) Signed-off-by: DarkLight1337 <[email protected]> * FIxing the AMD test failures caused by PR#16457 (vllm-project#17511) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Build] Require setuptools >= 77.0.3 for PEP 639 (vllm-project#17389) Signed-off-by: Russell Bryant <[email protected]> * [ROCm] Effort to reduce the number of environment variables in command line (vllm-project#17229) Signed-off-by: Hongxia Yang <[email protected]> * [BugFix] fix speculative decoding memory leak when speculation is disabled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]> * [BugFix] Fix mla cpu - missing 3 required positional arguments (vllm-project#17494) Signed-off-by: Lucas Wilkinson <[email protected]> * Avoid overwriting vllm_compile_cache.py (vllm-project#17418) Signed-off-by: Keyun Tong <[email protected]> * [Core] Enable IPv6 with vllm.utils.make_zmq_socket() (vllm-project#16506) Signed-off-by: Russell Bryant <[email protected]> * [Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (vllm-project#17515) Signed-off-by: chaunceyjiang <[email protected]> * Improve configs - `ObservabilityConfig` (vllm-project#17453) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model (vllm-project#17285) Signed-off-by: Teruaki Ishizaki <[email protected]> * [Frontend] Show progress bar for adding requests (vllm-project#17525) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Clean up test docstrings and names (vllm-project#17521) Signed-off-by: DarkLight1337 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X (vllm-project#17530) Signed-off-by: tjtanaa <[email protected]> * Fix more broken speculative decode tests (vllm-project#17450) Signed-off-by: Huy Do <[email protected]> * [doc] add streamlit integration (vllm-project#17522) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config (vllm-project#17535) Signed-off-by: tjtanaa <[email protected]> * [Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452) Signed-off-by: chaunceyjiang <[email protected]> * [ROCm] remove unsupported archs from rocm triton flash-attention supported list (vllm-project#17536) Signed-off-by: Hongxia Yang <[email protected]> * [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <[email protected]> * [Misc] refactor example - cpu_offload_lmcache (vllm-project#17460) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: gitover22 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: csy1204 <[email protected]> Signed-off-by: sydarb <[email protected]> Signed-off-by: 开哲 <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jens Glaser <[email protected]> Signed-off-by: varun sundar rabindranath <[email protected]> Signed-off-by: Lifu Huang <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: cynthieye <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Jasmond Loh <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Christian Heimes <[email protected]> Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: James Wu <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Yarong Mu <[email protected]> Signed-off-by: shuw <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: changjun.lee <[email protected]> Signed-off-by: imkero <[email protected]> Signed-off-by: ShuaibinLi <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Jade Zheng <[email protected]> Signed-off-by: sfc-gh-zhwang <[email protected]> Signed-off-by: kaixih <[email protected]> Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: lkm-schulz <[email protected]> Signed-off-by: Ther-LF <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: evian <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Alex <[email protected]> Signed-off-by: Michal Moskal <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: 苏政渊 <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: mofanke <[email protected]> Signed-off-by: mayuyuace <[email protected]> Signed-off-by: Tianyuan Wu <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Huy Do <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Marco <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Pete Savage <[email protected]> Signed-off-by: zh Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Yan Cangang <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: Noah Yoshida <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Teruaki Ishizaki <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Sage Moore <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: huafeng <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Sangyeon Cho <[email protected]> Co-authored-by: Chen Xia <[email protected]> Co-authored-by: Areeb Syed <[email protected]> Co-authored-by: 张宇 <[email protected]> Co-authored-by: 开哲 <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Atilla <[email protected]> Co-authored-by: Eyshika Agarwal <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: jglaser <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: zhouzaida <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> Co-authored-by: Lifu Huang <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: yexin(叶鑫) <[email protected]> Co-authored-by: MagnetoWang <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Jasmond L <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Daniel Li <[email protected]> Co-authored-by: Christian Heimes <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> Co-authored-by: James Wu <[email protected]> Co-authored-by: yarongmu-google <[email protected]> Co-authored-by: Shu Wang <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Zijing Liu <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: changjun.lee <[email protected]> Co-authored-by: Kero Liang <[email protected]> Co-authored-by: Happy <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Jade Zheng <[email protected]> Co-authored-by: Flex Wang <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: cascade <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Lennart K. M. Schulz <[email protected]> Co-authored-by: TherLF <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Wanrui Dai <[email protected]> Co-authored-by: evian <[email protected]> Co-authored-by: idouba <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Richard Barnes <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Zhengyuan Su (苏政渊) <[email protected]> Co-authored-by: 苏政渊 <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: ponix-j <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: a2q1p <[email protected]> Co-authored-by: mofanke <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> Co-authored-by: TY-AMD <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: casinca <[email protected]> Co-authored-by: Dilip Gowda Bhagavan <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Gabriel Marinho <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Marco <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Marko Rosenmueller <[email protected]> Co-authored-by: Alec <[email protected]> Co-authored-by: Pete Savage <[email protected]> Co-authored-by: zh Wang <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: NaLan ZeYu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Noah Yoshida <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: Teruaki Ishizaki <[email protected]> Co-authored-by: Sage Moore <[email protected]>

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Mu Huai <[email protected]>

mofanke force-pushed the fix_qwen3_thinking_parser branch 2 times, most recently from e8e6b42 to a28caaf Compare April 29, 2025 09:51

mergify bot added the documentation Improvements or additions to documentation label Apr 29, 2025

mofanke force-pushed the fix_qwen3_thinking_parser branch from a28caaf to a4063c0 Compare April 29, 2025 10:06

chaunceyjiang mentioned this pull request Apr 29, 2025

[Bugfix] add Qwen3ReasoningParser #17377

Closed

4XII-Khan mentioned this pull request Apr 29, 2025

[Bug]: qwen3 235B模型 enable_thinking 为False 时，返回的content 为空，但 reason_content 有值 QwenLM/Qwen3#1297

Closed

4 tasks

mofanke force-pushed the fix_qwen3_thinking_parser branch from a4063c0 to 852ca12 Compare April 29, 2025 13:58

[Bugfix] add qwen3 reasoning-parser fix content is None when disable …

7d4031b

…thinking (vllm-project#17357) Signed-off-by: mofanke <[email protected]>

mofanke force-pushed the fix_qwen3_thinking_parser branch from 852ca12 to 7d4031b Compare April 29, 2025 14:14

DarkLight1337 approved these changes Apr 29, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 29, 2025 14:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2025

DarkLight1337 merged commit a39203f into vllm-project:main Apr 29, 2025
45 of 47 checks passed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

c45d03e

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

YorkSu mentioned this pull request Apr 30, 2025

[Bug]: Qwen3's answer was wrongly placed in reasoning_content #17349

Open

1 task

chaunceyjiang mentioned this pull request Apr 30, 2025

[Feature] The Qwen3 reasoning parser supports guided decoding #17466

Merged

chaunceyjiang mentioned this pull request May 1, 2025

[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content #17515

Merged

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

62a2ec4

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

xjpang pushed a commit to xjpang/vllm that referenced this pull request May 4, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

197ac18

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

Zerohertz mentioned this pull request May 5, 2025

[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request. #17609

Open

1 task

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

f8c17b8

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

mofanke commented Apr 29, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 29, 2025

DarkLight1337 commented Apr 29, 2025

mofanke commented Apr 29, 2025

ItzAmirreza commented Apr 29, 2025

DarkLight1337 left a comment

chaunceyjiang commented Apr 30, 2025

chaunceyjiang commented Apr 30, 2025

YorkSu commented Apr 30, 2025 •

edited

Loading

YorkSu commented Apr 30, 2025

gaocegege commented May 1, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

Conversation

mofanke commented Apr 29, 2025 • edited by github-actions bot Loading

Code Attribution

github-actions bot commented Apr 29, 2025

DarkLight1337 commented Apr 29, 2025

mofanke commented Apr 29, 2025

ItzAmirreza commented Apr 29, 2025

DarkLight1337 left a comment

Choose a reason for hiding this comment

chaunceyjiang commented Apr 30, 2025

chaunceyjiang commented Apr 30, 2025

YorkSu commented Apr 30, 2025 • edited Loading

YorkSu commented Apr 30, 2025

gaocegege commented May 1, 2025

mofanke commented Apr 29, 2025 •

edited by github-actions bot

Loading

YorkSu commented Apr 30, 2025 •

edited

Loading