Skip to content

mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli #13012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 21, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 18, 2025

This PR unifies all vision models supported by llama.cpp into libmtmd

Qwen2VL is not yet merged for now, due to some complications in M-RoPE. This can be resolved after the llama_batch_ext refactoring: #11875

These models are supported:

llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF  # and other sizes: 12b, 27b (+ QAT version)
llama-mtmd-cli -hf guinmoon/MobileVLM-3B-GGUF --chat-template deepseek
llama-mtmd-cli -hf THUDM/glm-edge-v-5b-gguf
llama-mtmd-cli -hf second-state/Llava-v1.5-7B-GGUF --chat-template vicuna
llama-mtmd-cli -hf cjpais/llava-1.6-mistral-7b-gguf --chat-template vicuna
llama-mtmd-cli -hf ibm-research/granite-vision-3.2-2b-GGUF
llama-mtmd-cli -hf second-state/MiniCPM-Llama3-V-2_5-GGUF
llama-mtmd-cli -hf openbmb/MiniCPM-V-2_6-gguf
llama-mtmd-cli -hf openbmb/MiniCPM-o-2_6-gguf

NOTE: Yi-VL-6B is removed from the test because:

  • There is no good info about which chat template is used, the built-in one is chatml while the model does not support it
  • This model has very little usage, not worth maintaining it

Follow-up PRs:

  • Remove BOI / EOI token embeddings from clip.cpp (used by glm-edge), they should be processed as text tokens ==> will be a breaking change
  • Refactor documentations (find a way to reduce number of README files)
  • Unify the conversion scripts
  • Add support for SmolVLM, should be relatively simple

@ngxson ngxson changed the title mtmd : merge llava-cli and gemma3-cli into single mtmd-cli mtmd : merge llava, gemm3 and minicpmv CLI into single llama-mtmd-cli Apr 19, 2025
}

if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) {
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it may not be a good idea to support real batching in clip_image_encode because the memory usage can blow up very quickly (some models use ~4k patches per image)

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

Sounds good.

Comment on lines +215 to +216
image_tokens->nx = clip_n_patches(ctx->ctx_clip);
image_tokens->ny = 1;
Copy link
Collaborator Author

@ngxson ngxson Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ny - number of tokens in y direction is only used by qwen2vl for now (which we don't support yet in mtmd)

I should refactor mtmd_image_tokens very soon (in another PR). Otherwise, having both nx, ny for models not using M-RoPE looks very weird

Copy link
Collaborator Author

@ngxson ngxson Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov @slaren Tagging you here for visibility. Regarding to the discussion of tracking n_past internally by libllama. For qwen models, the first dimension of the 4D pos is the traditional n_past, but the rest are relative position (in the case of image, they are 2D X/Y coordinates)

So I think when refactoring llama_batch_ext, we should find a way to allow user to specify this "additional" dimensions:

(Note: for qwen, dimensions of each token in llama.cpp is currently [n_past, x, y, unused] - the unused dim is reserved for future usage, for ex. 3D spatial understanding)

image

@ngxson ngxson changed the title mtmd : merge llava, gemm3 and minicpmv CLI into single llama-mtmd-cli mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli Apr 19, 2025
@ngxson ngxson marked this pull request as ready for review April 19, 2025 08:09
@ngxson ngxson requested a review from ggerganov April 19, 2025 08:09
}

if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) {
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

Sounds good.

Co-authored-by: Georgi Gerganov <[email protected]>
@ngxson ngxson merged commit 84a9bf2 into ggml-org:master Apr 21, 2025
47 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…li` (ggml-org#13012)

* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
…li` (ggml-org#13012)

* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants