-
Notifications
You must be signed in to change notification settings - Fork 11.6k
mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli
#13012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
llava-cli
and gemma3-cli
into single mtmd-cli
llama-mtmd-cli
} | ||
|
||
if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) { | ||
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it may not be a good idea to support real batching in clip_image_encode
because the memory usage can blow up very quickly (some models use ~4k patches per image)
What we can do in the short term is to allow clip_image_encode
to run the decode multiple times, i.e. simply copy this loop into clip_image_encode
- this can be done in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.
What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR
Sounds good.
image_tokens->nx = clip_n_patches(ctx->ctx_clip); | ||
image_tokens->ny = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ny
- number of tokens in y direction is only used by qwen2vl for now (which we don't support yet in mtmd)
I should refactor mtmd_image_tokens
very soon (in another PR). Otherwise, having both nx
, ny
for models not using M-RoPE looks very weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov @slaren Tagging you here for visibility. Regarding to the discussion of tracking n_past
internally by libllama
. For qwen models, the first dimension of the 4D pos is the traditional n_past
, but the rest are relative position (in the case of image, they are 2D X/Y coordinates)
So I think when refactoring llama_batch_ext
, we should find a way to allow user to specify this "additional" dimensions:
(Note: for qwen, dimensions of each token in llama.cpp is currently [n_past, x, y, unused]
- the unused dim is reserved for future usage, for ex. 3D spatial understanding)
llama-mtmd-cli
llama-mtmd-cli
} | ||
|
||
if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) { | ||
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.
What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR
Sounds good.
Co-authored-by: Georgi Gerganov <[email protected]>
…li` (ggml-org#13012) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
…li` (ggml-org#13012) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
This PR unifies all vision models supported by llama.cpp into
libmtmd
Qwen2VL is not yet merged for now, due to some complications in M-RoPE. This can be resolved after the
llama_batch_ext
refactoring: #11875These models are supported:
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF # and other sizes: 12b, 27b (+ QAT version) llama-mtmd-cli -hf guinmoon/MobileVLM-3B-GGUF --chat-template deepseek llama-mtmd-cli -hf THUDM/glm-edge-v-5b-gguf llama-mtmd-cli -hf second-state/Llava-v1.5-7B-GGUF --chat-template vicuna llama-mtmd-cli -hf cjpais/llava-1.6-mistral-7b-gguf --chat-template vicuna llama-mtmd-cli -hf ibm-research/granite-vision-3.2-2b-GGUF llama-mtmd-cli -hf second-state/MiniCPM-Llama3-V-2_5-GGUF llama-mtmd-cli -hf openbmb/MiniCPM-V-2_6-gguf llama-mtmd-cli -hf openbmb/MiniCPM-o-2_6-gguf
NOTE:
Yi-VL-6B
is removed from the test because:Follow-up PRs:
clip.cpp
(used by glm-edge), they should be processed as text tokens ==> will be a breaking changeREADME
files)