csm : implement Sesame-based conversation example #12392

ggerganov · 2025-03-14T14:49:46Z

With the first Sesame CSM model openly available, we should implement a local example similar to their online research demo. It seems that the released CSM model uses Kyutai's Mimi audio codec which we have to implement in a similar way as we did with the WavTokenizer. Next we can modify the talk-llama example to support audio generation with the CSM. This way we will be able to plug any LLM for the text response generation and use Sesame for speech input/output.

randxie · 2025-03-15T14:46:03Z

This is a very cool issue! I can take a look in the next few days if no one volunteers

ngxson · 2025-03-19T08:12:30Z

Would be interesting if someone can have a Mimi implementation in llama.cpp / ggml. AFAIK it has a small transformer inside with a sliding context window of 750 tokens, which may make it a bit complicated to implement. The ref python code is here: https://github.com/kyutai-labs/moshi/blob/77f9215629f1ff7914f0a3bb82508824a6436413/moshi/moshi/modules/transformer.py#L211

ngxson · 2025-03-24T16:48:39Z

I'm working on porting mimi to ggml, but it turns out to be more complicated than I initially thought.

My code is still WIP but I want to share it here, please let me know if someone has an idea to improve it / make it work: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp

pminev · 2025-03-24T17:18:08Z

@ngxson I have started a rough implementation here #12549. I hope it's okay to continue.

ngxson · 2025-03-24T17:29:21Z

@pminev Maybe I missed something, but I don't see the cgraph in your PR

ngxson · 2025-03-24T17:34:56Z

Ok so your PR only convert the weight GGUF for now. Indeed, what I interested in is the cgraph, especially the cgraph allowing to run mimi model. Are you planning to work on that?

randxie · 2025-03-24T20:03:41Z

Sorry that I did not spend much time on this recently. @ngxson To enable Sesame tts, we only need the decoder. llama with RQ transformer will predict the RVQ tokens, then we can decode the tokens with mimi decoder.

pminev · 2025-03-24T20:25:47Z

@ngxson yeah, I'm planning to work on the cgraph too. That's my next step.

CrossPr0duct · 2025-03-25T13:56:41Z

Hey, thanks everyone for porting this!

I run the Open Sesame Discord. We just got fine-tuning working and are moving on to adding a transcription head to the model.
I wanted to provide some information and maybe port the LLaMA 3.2 parts of this model if no one is willing to take it on. The audio encoder and decoder are out of my depth.

To enable Sesame TTS, we only need the decoder, which is correct.

However, there is a caveat to this. You would miss out on the conversational modeling, which makes this model magical.

We need the encoder to encode the user’s utterance or enable the developer or user to build a conversation context so that the model understands the context to adjust the prosody of the conversation.

Without this, the model just sounds like Orpheus (with better voice cloning abilities) and lacks the conversational vibe. This is a requirement to get that Maya/Miles-like feel.

The second requirement is just streaming and latency.

Let me know if you guys want help. Some devs in the community would be stoked to work on this with a bit of guidance.

@ngxson @pminev @randxie

ngxson · 2025-03-25T14:01:09Z

I'm working on the encoder/decoder slowly, not particularly interested in sesame, but I know Laurent in-person (the one who invented Mimi / Moshi) and was interested in what he built.

Good news is, I correctly implemented cgraph for SEANet encoder in my ggml implementation: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp , the intermediate activation matches 1-1 with the transformers version.

What left to do are:

Add transformer and vector quantizer
Support causal attention with ring KV
Repeat the same thing with decoder

For ref, Mimi contains a SEANet, a transformer and a vector quantizer

randxie · 2025-03-25T15:32:02Z

Similar to @ngxson, I am working on this slowly, and can serve as code reviewer if anyone has a working version. To support, sesame, I can see the work can be nicely divided into:

Support Mimi Encoder
Support Mimi Decoder
Extend llama to support RQ transformer

These components can be independently built and merged.

After we have the necessary components, we can build demo on top of it to support streaming

pminev · 2025-03-28T14:27:59Z

@randxie At the beginning I was thinking to implement all of those + example. Now I will start with extending llama, if no one starts the mimi parts I will continue with them.

ngxson · 2025-03-28T16:34:38Z

I've been able to get the decoder working. My implementation should be good enough to copy over llama.cpp: https://github.com/ngxson/ggml-easy/blob/12752c5815826b29f1e6c8636f94c3524b3d3b1b/demo/kyutai-mimi.cpp#L654-L675

Probably for now it (the decoder) should be implemented as an example and not part of llama.cpp, because:

It works painfully slow on GPU, in my case the time to go back and forth between CPU and GPU (Metal) costs 10s, while with CPU-only it takes less than 1 sec to finish
I have to hack the ggml_conv_transpose_1d (for the depthwise conv) and ggml_pad (for asymmetric padding) which looks quite ugly
While the RVQ uses ggml_get_rows, it contains multiple codebooks, so unfortunately we can't reuse the build_inp from llama-graph

@ggerganov so, do you think it's ok to have the decoder code entirely in example for now?

The encoder still introduce a bit of noise, so text-to-token does not work for now. But as we're currently interested in TTS and not speech-to-speech so should be fine for now.

@pminev it would be nice if you can already make an example (probably a sub example of llama-tts?) that accepts text input and return an array of codes

pminev · 2025-03-28T16:56:18Z

@ngxson I didn't have time this week, so I just started to work on CSM model support in llama (backbone & decoder - this part here). Now watching it actually I'm not sure if I have to do anything in llama or actually I just need to implement an example which produces the codes and then pass them to your implementation of the decoder as you said.

ngxson · 2025-03-28T17:13:17Z

The backbone and decoder are just 2 transformers using llama arch, just with different with hyperparams.

What you need to do is to convert them into 2 separated GGUFs, then load them into 2 different llama_context(s)

The full pipeline is:

format and tokenize text: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L60
pass the tokens to generate_frame: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L142
decode input using backbone transformer: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L158
sampling + lookup embeddings from codebook: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L165-L167
pass the embeddings to decoder transformer: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L173
the output will be audio codes, ready to be converted to waveform via mimi: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L154

Edit: the codebook in step 4 is indeed a token embedding tensor, it can be placed into the decoder's GGUF

Edit (2): the decoder in this context is not mimi decoder, let's call it "CSM decoder" so there is no confusion. The CSM decoder decode the "latent" tokens into sound tokens, that can then be converted to waveform using Mimi decoder

ggerganov · 2025-03-28T18:25:51Z

@ggerganov so, do you think it's ok to have the decoder code entirely in example for now?

If there are significant hacks in the implementation, then it's better to keep it in the examples, until we improve it.

randxie · 2025-03-29T14:44:23Z

Putting in the example will make it harder to abstract out later. Components like RQ transformer, RVQ are not specific to sesame. Should we create an extension folder for audio models?

cc: @ggerganov @ngxson

ngxson · 2025-03-29T15:06:51Z

@randxie The RVQ and transformer for mimi are all implemented in #12636

You just need to give it the RVQ audio tokens (semantic + acoustic tokens) and it will generate the waveform.

I tested with waveform generated by CSM today, it works. Now just need to implement the 2 transformers of CSM into the code (which, turns out to be not very straightforward, but can still be add to the llama.cpp core library)

ngxson · 2025-03-29T15:10:50Z

I'm working on these 2 transformers inside CSM. It's not as straightforward because:

The backbone has input vocab size of 128k tokens, but output vocab is only 2051 tokens. Not a big deal, we can use first half of output logits tensor to store that
The decoder has an input projector to convert the n_embd dim from backbone (2048) to n_embd of encoder (1024)

randxie · 2025-03-29T15:57:25Z

@ngxson sorry for the confusion. I am not talking about a specific implementation or PR.

Putting a lot of things into examples will it hard to extract common components out later. I am suggest if we could have a folder like llama.cpp/src/audio/ to allow people add custom audio models, which can make common modules sharable.

ngxson · 2025-03-29T16:06:13Z

You can give it a try if you want. IMO that's not a good idea for now. Explained in this comment: #12392 (comment)

ngxson · 2025-03-30T12:30:33Z

Should be ready to test & review: #12648

github-actions · 2025-05-14T01:07:48Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov added model Model specific research 🔬 tts Text-to-speech labels Mar 14, 2025

ngxson mentioned this issue Mar 29, 2025

tts : implement mimi decoder #12636

Closed

4 tasks

ngxson mentioned this issue Mar 29, 2025

tts : implement sesame CSM + Mimi decoder #12648

Open

github-actions bot added the stale label Apr 30, 2025

github-actions bot closed this as completed May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csm : implement Sesame-based conversation example #12392

csm : implement Sesame-based conversation example #12392

ggerganov commented Mar 14, 2025

randxie commented Mar 15, 2025

ngxson commented Mar 19, 2025

ngxson commented Mar 24, 2025

pminev commented Mar 24, 2025

ngxson commented Mar 24, 2025

ngxson commented Mar 24, 2025

randxie commented Mar 24, 2025

pminev commented Mar 24, 2025 •

edited

Loading

CrossPr0duct commented Mar 25, 2025 •

edited

Loading

ngxson commented Mar 25, 2025 •

edited

Loading

randxie commented Mar 25, 2025

pminev commented Mar 28, 2025

ngxson commented Mar 28, 2025 •

edited

Loading

pminev commented Mar 28, 2025 •

edited

Loading

ngxson commented Mar 28, 2025 •

edited

Loading

ggerganov commented Mar 28, 2025

randxie commented Mar 29, 2025

ngxson commented Mar 29, 2025

ngxson commented Mar 29, 2025 •

edited

Loading

randxie commented Mar 29, 2025

ngxson commented Mar 29, 2025

ngxson commented Mar 30, 2025

github-actions bot commented May 14, 2025

csm : implement Sesame-based conversation example #12392

csm : implement Sesame-based conversation example #12392

Comments

ggerganov commented Mar 14, 2025

randxie commented Mar 15, 2025

ngxson commented Mar 19, 2025

ngxson commented Mar 24, 2025

pminev commented Mar 24, 2025

ngxson commented Mar 24, 2025

ngxson commented Mar 24, 2025

randxie commented Mar 24, 2025

pminev commented Mar 24, 2025 • edited Loading

CrossPr0duct commented Mar 25, 2025 • edited Loading

ngxson commented Mar 25, 2025 • edited Loading

randxie commented Mar 25, 2025

pminev commented Mar 28, 2025

ngxson commented Mar 28, 2025 • edited Loading

pminev commented Mar 28, 2025 • edited Loading

ngxson commented Mar 28, 2025 • edited Loading

ggerganov commented Mar 28, 2025

randxie commented Mar 29, 2025

ngxson commented Mar 29, 2025

ngxson commented Mar 29, 2025 • edited Loading

randxie commented Mar 29, 2025

ngxson commented Mar 29, 2025

ngxson commented Mar 30, 2025

github-actions bot commented May 14, 2025

pminev commented Mar 24, 2025 •

edited

Loading

CrossPr0duct commented Mar 25, 2025 •

edited

Loading

ngxson commented Mar 25, 2025 •

edited

Loading

ngxson commented Mar 28, 2025 •

edited

Loading

pminev commented Mar 28, 2025 •

edited

Loading

ngxson commented Mar 28, 2025 •

edited

Loading

ngxson commented Mar 29, 2025 •

edited

Loading