Skip to content

Commit 76186b9

Browse files
robertgshaw2-redhats3wozbohnstinglDarkLight1337tlrmchlsmth
authored
Upstream Sync (#80)
* [Model] Add GraniteMoeHybrid 4.0 model (vllm-project#17497) Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [easy] Fix logspam on PiecewiseBackend errors (vllm-project#17138) Signed-off-by: rzou <[email protected]> * [Bugfix] Fixed prompt length for random dataset (vllm-project#17408) Signed-off-by: Mikhail Podvitskii <[email protected]> * [Doc] Update notes for H2O-VL and Gemma3 (vllm-project#17219) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix ScalarType float4 naming (vllm-project#17690) Signed-off-by: Lucas Wilkinson <[email protected]> * Fix `dockerfilegraph` pre-commit hook (vllm-project#17698) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix triton import with local TritonPlaceholder (vllm-project#17446) Signed-off-by: Mengqing Cao <[email protected]> * [V1] Enable TPU V1 backend by default (vllm-project#17673) Signed-off-by: mgoin <[email protected]> * [V1][PP] Support PP for MultiprocExecutor (vllm-project#14219) Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> * [v1] AttentionMetadata for each layer (vllm-project#17394) Signed-off-by: Chen Zhang <[email protected]> * [Feat] Add deprecated=True to CLI args (vllm-project#17426) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Use gh-file to add links to tool_calling.md (vllm-project#17709) Signed-off-by: windsonsea <[email protected]> * [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (vllm-project#17479) Signed-off-by: Chen Zhang <[email protected]> * [doc] Add RAG Integration example (vllm-project#17692) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix modality limits in vision language example (vllm-project#17721) Signed-off-by: DarkLight1337 <[email protected]> * Make right sidebar more readable in "Supported Models" (vllm-project#17723) Signed-off-by: Harry Mellor <[email protected]> * [TPU] Increase block size and reset block shapes (vllm-project#16458) * [Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py` (vllm-project#16839) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> * [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (vllm-project#17732) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (vllm-project#16828) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Thomas Ortner <[email protected]> Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Mikhail Podvitskii <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: Stan Wozniak <[email protected]> Co-authored-by: Thomas Ortner <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Mikhail Podvitskii <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Jevin Jiang <[email protected]> Co-authored-by: d.transposed <[email protected]> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>
1 parent 3783696 commit 76186b9

File tree

77 files changed

+2678
-384
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+2678
-384
lines changed

.pre-commit-config.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,8 +125,6 @@ repos:
125125
name: Update Dockerfile dependency graph
126126
entry: tools/update-dockerfile-graph.sh
127127
language: script
128-
files: ^docker/Dockerfile$
129-
pass_filenames: false
130128
# Keep `suggestion` last
131129
- id: suggestion
132130
name: Suggestion

benchmarks/benchmark_dataset.py

Lines changed: 103 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -315,13 +315,15 @@ def sample(
315315
)
316316

317317
vocab_size = tokenizer.vocab_size
318+
num_special_tokens = tokenizer.num_special_tokens_to_add()
319+
real_input_len = input_len - num_special_tokens
318320

319321
prefix_token_ids = (np.random.randint(
320322
0, vocab_size, size=prefix_len).tolist() if prefix_len > 0 else [])
321323

322324
# New sampling logic: [X * (1 - b), X * (1 + b)]
323-
input_low = int(input_len * (1 - range_ratio))
324-
input_high = int(input_len * (1 + range_ratio))
325+
input_low = int(real_input_len * (1 - range_ratio))
326+
input_high = int(real_input_len * (1 + range_ratio))
325327
output_low = int(output_len * (1 - range_ratio))
326328
output_high = int(output_len * (1 + range_ratio))
327329

@@ -344,6 +346,17 @@ def sample(
344346
vocab_size).tolist()
345347
token_sequence = prefix_token_ids + inner_seq
346348
prompt = tokenizer.decode(token_sequence)
349+
# After decoding the prompt we have to encode and decode it again.
350+
# This is done because in some cases N consecutive tokens
351+
# give a string tokenized into != N number of tokens.
352+
# For example for GPT2Tokenizer:
353+
# [6880, 6881] -> ['Ġcalls', 'here'] ->
354+
# [1650, 939, 486] -> ['Ġcall', 'sh', 'ere']
355+
# To avoid uncontrolled change of the prompt length,
356+
# the encoded sequence is truncated before being decode again.
357+
re_encoded_sequence = tokenizer.encode(
358+
prompt, add_special_tokens=False)[:input_lens[i]]
359+
prompt = tokenizer.decode(re_encoded_sequence)
347360
total_input_len = prefix_len + int(input_lens[i])
348361
requests.append(
349362
SampleRequest(
@@ -874,6 +887,94 @@ def sample(self,
874887
return sampled_requests
875888

876889

890+
# -----------------------------------------------------------------------------
891+
# Next Edit Prediction Dataset Implementation
892+
# -----------------------------------------------------------------------------
893+
894+
895+
zeta_prompt = """### Instruction:
896+
You are a code completion assistant and your task is to analyze user edits and then rewrite an excerpt that the user provides, suggesting the appropriate edits within the excerpt, taking into account the cursor location.
897+
898+
### User Edits:
899+
900+
{}
901+
902+
### User Excerpt:
903+
904+
{}
905+
906+
### Response:
907+
908+
""" # noqa: E501
909+
910+
911+
def _format_zeta_prompt(
912+
sample: dict,
913+
original_start_marker: str = "<|editable_region_start|>") -> dict:
914+
"""Format the zeta prompt for the Next Edit Prediction (NEP) dataset.
915+
916+
This function formats examples from the NEP dataset
917+
into prompts and expected outputs. It could be
918+
further extended to support more NEP datasets.
919+
920+
Args:
921+
sample: The dataset sample containing events,
922+
inputs, and outputs.
923+
original_start_marker: The marker indicating the
924+
start of the editable region. Defaults to
925+
"<|editable_region_start|>".
926+
927+
Returns:
928+
A dictionary with the formatted prompts and expected outputs.
929+
"""
930+
events = sample["events"]
931+
input = sample["input"]
932+
output = sample["output"]
933+
prompt = zeta_prompt.format(events, input)
934+
935+
# following the original implementation, extract the focused region
936+
# from the raw output
937+
output_start_index = output.find(original_start_marker)
938+
output_focused_region = output[output_start_index:]
939+
expected_output = output_focused_region
940+
941+
return {"prompt": prompt, "expected_output": expected_output}
942+
943+
944+
class NextEditPredictionDataset(HuggingFaceDataset):
945+
"""
946+
Dataset class for processing a Next Edit Prediction dataset.
947+
"""
948+
949+
SUPPORTED_DATASET_PATHS = {
950+
"zed-industries/zeta",
951+
}
952+
MAPPING_PROMPT_FUNCS = {
953+
"zed-industries/zeta": _format_zeta_prompt,
954+
}
955+
956+
def sample(self, tokenizer: PreTrainedTokenizerBase, num_requests: int,
957+
**kwargs):
958+
formatting_prompt_func = self.MAPPING_PROMPT_FUNCS.get(
959+
self.dataset_path)
960+
if formatting_prompt_func is None:
961+
raise ValueError(f"Unsupported dataset path: {self.dataset_path}")
962+
samples = []
963+
for sample in self.data:
964+
sample = formatting_prompt_func(sample)
965+
samples.append(
966+
SampleRequest(
967+
prompt=sample["prompt"],
968+
prompt_len=len(tokenizer(sample["prompt"]).input_ids),
969+
expected_output_len=len(
970+
tokenizer(sample["expected_output"]).input_ids),
971+
))
972+
if len(samples) >= num_requests:
973+
break
974+
self.maybe_oversample_requests(samples, num_requests)
975+
return samples
976+
977+
877978
# -----------------------------------------------------------------------------
878979
# ASR Dataset Implementation
879980
# -----------------------------------------------------------------------------

benchmarks/benchmark_serving.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@
5353
from benchmark_dataset import (AIMODataset, ASRDataset, BurstGPTDataset,
5454
ConversationDataset, HuggingFaceDataset,
5555
InstructCoderDataset, MTBenchDataset,
56-
RandomDataset, SampleRequest, ShareGPTDataset,
57-
SonnetDataset, VisionArenaDataset)
56+
NextEditPredictionDataset, RandomDataset,
57+
SampleRequest, ShareGPTDataset, SonnetDataset,
58+
VisionArenaDataset)
5859
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
5960

6061
MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@@ -603,6 +604,9 @@ def main(args: argparse.Namespace):
603604
elif args.dataset_path in AIMODataset.SUPPORTED_DATASET_PATHS:
604605
dataset_class = AIMODataset
605606
args.hf_split = "train"
607+
elif args.dataset_path in NextEditPredictionDataset.SUPPORTED_DATASET_PATHS: # noqa: E501
608+
dataset_class = NextEditPredictionDataset
609+
args.hf_split = "train"
606610
elif args.dataset_path in ASRDataset.SUPPORTED_DATASET_PATHS:
607611
dataset_class = ASRDataset
608612
args.hf_split = "train"

benchmarks/kernels/benchmark_moe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,12 @@
1010

1111
import ray
1212
import torch
13-
import triton
1413
from ray.experimental.tqdm_ray import tqdm
1514
from transformers import AutoConfig
1615

1716
from vllm.model_executor.layers.fused_moe.fused_moe import *
1817
from vllm.platforms import current_platform
18+
from vllm.triton_utils import triton
1919
from vllm.utils import FlexibleArgumentParser
2020

2121
FP8_DTYPE = current_platform.fp8_dtype()

benchmarks/kernels/benchmark_rmsnorm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@
44
from typing import Optional, Union
55

66
import torch
7-
import triton
87
from flashinfer.norm import fused_add_rmsnorm, rmsnorm
98
from torch import nn
109

1110
from vllm import _custom_ops as vllm_ops
11+
from vllm.triton_utils import triton
1212

1313

1414
class HuggingFaceRMSNorm(nn.Module):

benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
# Import DeepGEMM functions
77
import deep_gemm
88
import torch
9-
import triton
109
from deep_gemm import calc_diff, ceil_div, get_col_major_tma_aligned_tensor
1110

1211
# Import vLLM functions
1312
from vllm import _custom_ops as ops
1413
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
1514
per_token_group_quant_fp8, w8a8_block_fp8_matmul)
15+
from vllm.triton_utils import triton
1616

1717

1818
# Copied from
Loading

docs/source/deployment/frameworks/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ helm
1111
lws
1212
modal
1313
open-webui
14+
retrieval_augmented_generation
1415
skypilot
1516
streamlit
1617
triton
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
(deployment-retrieval-augmented-generation)=
2+
3+
# Retrieval-Augmented Generation
4+
5+
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
6+
7+
Here are the integrations:
8+
- vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus)
9+
- vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus)
10+
11+
## vLLM + langchain
12+
13+
### Prerequisites
14+
15+
- Setup vLLM and langchain environment
16+
17+
```console
18+
pip install -U vllm \
19+
langchain_milvus langchain_openai \
20+
langchain_community beautifulsoup4 \
21+
langchain-text-splitters
22+
```
23+
24+
### Deploy
25+
26+
- Start the vLLM server with the supported embedding model, e.g.
27+
28+
```console
29+
# Start embedding service (port 8000)
30+
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
31+
```
32+
33+
- Start the vLLM server with the supported chat completion model, e.g.
34+
35+
```console
36+
# Start chat service (port 8001)
37+
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
38+
```
39+
40+
- Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py>
41+
42+
- Run the script
43+
44+
```python
45+
python retrieval_augmented_generation_with_langchain.py
46+
```
47+
48+
## vLLM + llamaindex
49+
50+
### Prerequisites
51+
52+
- Setup vLLM and llamaindex environment
53+
54+
```console
55+
pip install vllm \
56+
llama-index llama-index-readers-web \
57+
llama-index-llms-openai-like \
58+
llama-index-embeddings-openai-like \
59+
llama-index-vector-stores-milvus \
60+
```
61+
62+
### Deploy
63+
64+
- Start the vLLM server with the supported embedding model, e.g.
65+
66+
```console
67+
# Start embedding service (port 8000)
68+
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
69+
```
70+
71+
- Start the vLLM server with the supported chat completion model, e.g.
72+
73+
```console
74+
# Start chat service (port 8001)
75+
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
76+
```
77+
78+
- Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py>
79+
80+
- Run the script
81+
82+
```python
83+
python retrieval_augmented_generation_with_llamaindex.py
84+
```

docs/source/features/tool_calling.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,9 @@ Known issues:
141141
much shorter than what vLLM generates. Since an exception is thrown when this condition
142142
is not met, the following additional chat templates are provided:
143143

144-
* `examples/tool_chat_template_mistral.jinja` - this is the "official" Mistral chat template, but tweaked so that
144+
* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
145145
it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
146-
* `examples/tool_chat_template_mistral_parallel.jinja` - this is a "better" version that adds a tool-use system prompt
146+
* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
147147
when tools are provided, that results in much better reliability when working with parallel tool calling.
148148

149149
Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
@@ -170,15 +170,15 @@ Known issues:
170170

171171
VLLM provides two JSON based chat templates for Llama 3.1 and 3.2:
172172

173-
* `examples/tool_chat_template_llama3.1_json.jinja` - this is the "official" chat template for the Llama 3.1
173+
* <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1
174174
models, but tweaked so that it works better with vLLM.
175-
* `examples/tool_chat_template_llama3.2_json.jinja` - this extends upon the Llama 3.1 chat template by adding support for
175+
* <gh-file:examples/tool_chat_template_llama3.2_json.jinja> - this extends upon the Llama 3.1 chat template by adding support for
176176
images.
177177

178178
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
179179

180180
VLLM also provides a JSON based chat template for Llama 4:
181-
* `examples/tool_chat_template_llama4_json.jinja` - this is based on the "official" chat template for the Llama 4
181+
* <gh-file:examples/tool_chat_template_llama4_json.jinja> - this is based on the "official" chat template for the Llama 4
182182
models, but tweaked so that it works better with vLLM.
183183

184184
For Llama 4 use `--tool-call-parser llama4_json examples/tool_chat_template_llama4_json.jinja`.
@@ -191,7 +191,7 @@ Supported models:
191191

192192
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
193193

194-
`examples/tool_chat_template_granite.jinja`: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
194+
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
195195

196196
* `ibm-granite/granite-3.1-8b-instruct`
197197

@@ -203,7 +203,7 @@ The chat template from Huggingface can be used directly. Parallel function calls
203203

204204
Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
205205

206-
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
206+
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
207207

208208
### InternLM Models (`internlm`)
209209

@@ -253,12 +253,12 @@ Limitations:
253253

254254
Example supported models:
255255

256-
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
257-
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
258-
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
259-
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
260-
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
261-
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
256+
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
257+
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
258+
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
259+
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
260+
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
261+
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
262262

263263
Flags: `--tool-call-parser pythonic --chat-template {see_above}`
264264

@@ -270,7 +270,7 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
270270

271271
## How to write a tool parser plugin
272272

273-
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
273+
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
274274

275275
Here is a summary of a plugin file:
276276

0 commit comments

Comments
 (0)