mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` #13012

ngxson · 2025-04-18T17:58:24Z

This PR unifies all vision models supported by llama.cpp into libmtmd

Qwen2VL is not yet merged for now, due to some complications in M-RoPE. This can be resolved after the llama_batch_ext refactoring: #11875

These models are supported:

llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF  # and other sizes: 12b, 27b (+ QAT version)
llama-mtmd-cli -hf guinmoon/MobileVLM-3B-GGUF --chat-template deepseek
llama-mtmd-cli -hf THUDM/glm-edge-v-5b-gguf
llama-mtmd-cli -hf second-state/Llava-v1.5-7B-GGUF --chat-template vicuna
llama-mtmd-cli -hf cjpais/llava-1.6-mistral-7b-gguf --chat-template vicuna
llama-mtmd-cli -hf ibm-research/granite-vision-3.2-2b-GGUF
llama-mtmd-cli -hf second-state/MiniCPM-Llama3-V-2_5-GGUF
llama-mtmd-cli -hf openbmb/MiniCPM-V-2_6-gguf
llama-mtmd-cli -hf openbmb/MiniCPM-o-2_6-gguf

NOTE: Yi-VL-6B is removed from the test because:

There is no good info about which chat template is used, the built-in one is chatml while the model does not support it
This model has very little usage, not worth maintaining it

Follow-up PRs:

Remove BOI / EOI token embeddings from clip.cpp (used by glm-edge), they should be processed as text tokens ==> will be a breaking change
Refactor documentations (find a way to reduce number of README files)
Unify the conversion scripts
Add support for SmolVLM, should be relatively simple

ngxson · 2025-04-19T07:15:01Z

examples/llava/mtmd.cpp

+    }
+
+    if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) {
+        // TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()


On second thought, it may not be a good idea to support real batching in clip_image_encode because the memory usage can blow up very quickly (some models use ~4k patches per image)

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

Sounds good.

ngxson · 2025-04-19T07:18:34Z

examples/llava/mtmd.cpp

+            image_tokens->nx = clip_n_patches(ctx->ctx_clip);
+            image_tokens->ny = 1;


the ny - number of tokens in y direction is only used by qwen2vl for now (which we don't support yet in mtmd)

I should refactor mtmd_image_tokens very soon (in another PR). Otherwise, having both nx, ny for models not using M-RoPE looks very weird

@ggerganov @slaren Tagging you here for visibility. Regarding to the discussion of tracking n_past internally by libllama. For qwen models, the first dimension of the 4D pos is the traditional n_past, but the rest are relative position (in the case of image, they are 2D X/Y coordinates)

So I think when refactoring llama_batch_ext, we should find a way to allow user to specify this "additional" dimensions:

(Note: for qwen, dimensions of each token in llama.cpp is currently [n_past, x, y, unused] - the unused dim is reserved for future usage, for ex. 3D spatial understanding)

examples/llava/mtmd.cpp

ggerganov · 2025-04-21T09:45:00Z

examples/llava/mtmd.cpp

+    }
+
+    if (clip_is_llava(ctx->ctx_clip) || clip_is_minicpmv(ctx->ctx_clip) || clip_is_glm(ctx->ctx_clip)) {
+        // TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()


I expect that in practice batching image encodes will have little to no utility since a single image is enough to typically saturate all the compute that is available. And as you mentioned, the memory for the compute buffers will grow very fast.

What we can do in the short term is to allow clip_image_encode to run the decode multiple times, i.e. simply copy this loop into clip_image_encode - this can be done in another PR

Sounds good.

Co-authored-by: Georgi Gerganov <[email protected]>

…li` (ggml-org#13012) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

mtmd : merge llava-cli and gemma3-cli into single mtmd-cli

4737bd0

github-actions bot added the examples label Apr 18, 2025

ngxson mentioned this pull request Apr 18, 2025

server: Bring back multimodal support #8010

Open

18 tasks

support for minicpmv

da6b9eb

ngxson changed the title ~~mtmd : merge llava-cli and gemma3-cli into single mtmd-cli~~ mtmd : merge llava, gemm3 and minicpmv CLI into single llama-mtmd-cli Apr 19, 2025

ngxson added 3 commits April 19, 2025 09:00

remove cpp files of llava and minicpmv

a1f65d1

update hot topics

ccca0fb

mtmd : add not supported msg for qwen2vl

7360d43

ngxson commented Apr 19, 2025

View reviewed changes

ngxson changed the title ~~mtmd : merge llava, gemm3 and minicpmv CLI into single llama-mtmd-cli~~ mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli Apr 19, 2025

Merge branch 'master' into xsn/merge_llava_to_mtmd_cli

6b2ea2c

ngxson marked this pull request as ready for review April 19, 2025 08:09

ngxson requested a review from ggerganov April 19, 2025 08:09

ngxson mentioned this pull request Apr 21, 2025

llama-gemma3-cli: Sigint rework in gemma3 vision example #13043

Closed

ggerganov approved these changes Apr 21, 2025

View reviewed changes

Update examples/llava/mtmd.cpp

56bc674

Co-authored-by: Georgi Gerganov <[email protected]>

ngxson merged commit 84a9bf2 into ggml-org:master Apr 21, 2025
47 checks passed

ngxson mentioned this pull request Apr 21, 2025

server : vision support via libmtmd (need testing!) #12898

Open

8 tasks

SignalRT mentioned this pull request May 1, 2025

MTMD - Remove Llava SciSharp/LLamaSharp#1178

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` #13012

mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` #13012

ngxson commented Apr 18, 2025 •

edited

Loading

ngxson Apr 19, 2025

ggerganov Apr 21, 2025

ngxson Apr 19, 2025 •

edited

Loading

ngxson Apr 19, 2025 •

edited

Loading

ggerganov Apr 21, 2025

		image_tokens->nx = clip_n_patches(ctx->ctx_clip);
		image_tokens->ny = 1;

mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli #13012

mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli #13012

Conversation

ngxson commented Apr 18, 2025 • edited Loading

ngxson Apr 19, 2025

Choose a reason for hiding this comment

ggerganov Apr 21, 2025

Choose a reason for hiding this comment

ngxson Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

ggerganov Apr 21, 2025

Choose a reason for hiding this comment

mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` #13012

mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` #13012

ngxson commented Apr 18, 2025 •

edited

Loading

ngxson Apr 19, 2025 •

edited

Loading

ngxson Apr 19, 2025 •

edited

Loading