Skip to content

csm : implement Sesame-based conversation example #12392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ggerganov opened this issue Mar 14, 2025 · 23 comments
Closed

csm : implement Sesame-based conversation example #12392

ggerganov opened this issue Mar 14, 2025 · 23 comments
Labels
model Model specific research 🔬 stale tts Text-to-speech

Comments

@ggerganov
Copy link
Member

With the first Sesame CSM model openly available, we should implement a local example similar to their online research demo. It seems that the released CSM model uses Kyutai's Mimi audio codec which we have to implement in a similar way as we did with the WavTokenizer. Next we can modify the talk-llama example to support audio generation with the CSM. This way we will be able to plug any LLM for the text response generation and use Sesame for speech input/output.

@ggerganov ggerganov added model Model specific research 🔬 tts Text-to-speech labels Mar 14, 2025
@randxie
Copy link
Contributor

randxie commented Mar 15, 2025

This is a very cool issue! I can take a look in the next few days if no one volunteers

@ngxson
Copy link
Collaborator

ngxson commented Mar 19, 2025

Would be interesting if someone can have a Mimi implementation in llama.cpp / ggml. AFAIK it has a small transformer inside with a sliding context window of 750 tokens, which may make it a bit complicated to implement. The ref python code is here: https://github.com/kyutai-labs/moshi/blob/77f9215629f1ff7914f0a3bb82508824a6436413/moshi/moshi/modules/transformer.py#L211

@ngxson
Copy link
Collaborator

ngxson commented Mar 24, 2025

I'm working on porting mimi to ggml, but it turns out to be more complicated than I initially thought.

My code is still WIP but I want to share it here, please let me know if someone has an idea to improve it / make it work: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp

@pminev
Copy link
Contributor

pminev commented Mar 24, 2025

@ngxson I have started a rough implementation here #12549. I hope it's okay to continue.

@ngxson
Copy link
Collaborator

ngxson commented Mar 24, 2025

@pminev Maybe I missed something, but I don't see the cgraph in your PR

@ngxson
Copy link
Collaborator

ngxson commented Mar 24, 2025

Ok so your PR only convert the weight GGUF for now. Indeed, what I interested in is the cgraph, especially the cgraph allowing to run mimi model. Are you planning to work on that?

@randxie
Copy link
Contributor

randxie commented Mar 24, 2025

Sorry that I did not spend much time on this recently. @ngxson To enable Sesame tts, we only need the decoder. llama with RQ transformer will predict the RVQ tokens, then we can decode the tokens with mimi decoder.

@pminev
Copy link
Contributor

pminev commented Mar 24, 2025

@ngxson yeah, I'm planning to work on the cgraph too. That's my next step.

@CrossPr0duct
Copy link

CrossPr0duct commented Mar 25, 2025

Hey, thanks everyone for porting this!

I run the Open Sesame Discord. We just got fine-tuning working and are moving on to adding a transcription head to the model.
I wanted to provide some information and maybe port the LLaMA 3.2 parts of this model if no one is willing to take it on. The audio encoder and decoder are out of my depth.

To enable Sesame TTS, we only need the decoder, which is correct.

However, there is a caveat to this. You would miss out on the conversational modeling, which makes this model magical.

We need the encoder to encode the user’s utterance or enable the developer or user to build a conversation context so that the model understands the context to adjust the prosody of the conversation.

Without this, the model just sounds like Orpheus (with better voice cloning abilities) and lacks the conversational vibe. This is a requirement to get that Maya/Miles-like feel.

The second requirement is just streaming and latency.

Let me know if you guys want help. Some devs in the community would be stoked to work on this with a bit of guidance.

@ngxson @pminev @randxie

@ngxson
Copy link
Collaborator

ngxson commented Mar 25, 2025

I'm working on the encoder/decoder slowly, not particularly interested in sesame, but I know Laurent in-person (the one who invented Mimi / Moshi) and was interested in what he built.

Good news is, I correctly implemented cgraph for SEANet encoder in my ggml implementation: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp , the intermediate activation matches 1-1 with the transformers version.

What left to do are:

  • Add transformer and vector quantizer
  • Support causal attention with ring KV
  • Repeat the same thing with decoder

For ref, Mimi contains a SEANet, a transformer and a vector quantizer

@randxie
Copy link
Contributor

randxie commented Mar 25, 2025

Similar to @ngxson, I am working on this slowly, and can serve as code reviewer if anyone has a working version. To support, sesame, I can see the work can be nicely divided into:

  1. Support Mimi Encoder
  2. Support Mimi Decoder
  3. Extend llama to support RQ transformer

These components can be independently built and merged.

After we have the necessary components, we can build demo on top of it to support streaming

@pminev
Copy link
Contributor

pminev commented Mar 28, 2025

@randxie At the beginning I was thinking to implement all of those + example. Now I will start with extending llama, if no one starts the mimi parts I will continue with them.

@ngxson
Copy link
Collaborator

ngxson commented Mar 28, 2025

I've been able to get the decoder working. My implementation should be good enough to copy over llama.cpp: https://github.com/ngxson/ggml-easy/blob/12752c5815826b29f1e6c8636f94c3524b3d3b1b/demo/kyutai-mimi.cpp#L654-L675

Probably for now it (the decoder) should be implemented as an example and not part of llama.cpp, because:

  1. It works painfully slow on GPU, in my case the time to go back and forth between CPU and GPU (Metal) costs 10s, while with CPU-only it takes less than 1 sec to finish
  2. I have to hack the ggml_conv_transpose_1d (for the depthwise conv) and ggml_pad (for asymmetric padding) which looks quite ugly
  3. While the RVQ uses ggml_get_rows, it contains multiple codebooks, so unfortunately we can't reuse the build_inp from llama-graph

@ggerganov so, do you think it's ok to have the decoder code entirely in example for now?

The encoder still introduce a bit of noise, so text-to-token does not work for now. But as we're currently interested in TTS and not speech-to-speech so should be fine for now.

@pminev it would be nice if you can already make an example (probably a sub example of llama-tts?) that accepts text input and return an array of codes

@pminev
Copy link
Contributor

pminev commented Mar 28, 2025

@ngxson I didn't have time this week, so I just started to work on CSM model support in llama (backbone & decoder - this part here). Now watching it actually I'm not sure if I have to do anything in llama or actually I just need to implement an example which produces the codes and then pass them to your implementation of the decoder as you said.

@ngxson
Copy link
Collaborator

ngxson commented Mar 28, 2025

The backbone and decoder are just 2 transformers using llama arch, just with different with hyperparams.

What you need to do is to convert them into 2 separated GGUFs, then load them into 2 different llama_context(s)

The full pipeline is:

  1. format and tokenize text: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L60
  2. pass the tokens to generate_frame: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L142
  3. decode input using backbone transformer: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L158
  4. sampling + lookup embeddings from codebook: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L165-L167
  5. pass the embeddings to decoder transformer: https://github.com/SesameAILabs/csm/blob/2d720827843b653c4d67bb4445b1c0a4f59e646f/models.py#L173
  6. the output will be audio codes, ready to be converted to waveform via mimi: https://github.com/SesameAILabs/csm/blob/ed90181a15de0ed4a32a4f83363bfa2a54093c10/generator.py#L154

Edit: the codebook in step 4 is indeed a token embedding tensor, it can be placed into the decoder's GGUF

Edit (2): the decoder in this context is not mimi decoder, let's call it "CSM decoder" so there is no confusion. The CSM decoder decode the "latent" tokens into sound tokens, that can then be converted to waveform using Mimi decoder

@ggerganov
Copy link
Member Author

@ggerganov so, do you think it's ok to have the decoder code entirely in example for now?

If there are significant hacks in the implementation, then it's better to keep it in the examples, until we improve it.

@randxie
Copy link
Contributor

randxie commented Mar 29, 2025

Putting in the example will make it harder to abstract out later. Components like RQ transformer, RVQ are not specific to sesame. Should we create an extension folder for audio models?

cc: @ggerganov @ngxson

@ngxson
Copy link
Collaborator

ngxson commented Mar 29, 2025

@randxie The RVQ and transformer for mimi are all implemented in #12636

You just need to give it the RVQ audio tokens (semantic + acoustic tokens) and it will generate the waveform.

I tested with waveform generated by CSM today, it works. Now just need to implement the 2 transformers of CSM into the code (which, turns out to be not very straightforward, but can still be add to the llama.cpp core library)

@ngxson
Copy link
Collaborator

ngxson commented Mar 29, 2025

I'm working on these 2 transformers inside CSM. It's not as straightforward because:

  • The backbone has input vocab size of 128k tokens, but output vocab is only 2051 tokens. Not a big deal, we can use first half of output logits tensor to store that
  • The decoder has an input projector to convert the n_embd dim from backbone (2048) to n_embd of encoder (1024)

@randxie
Copy link
Contributor

randxie commented Mar 29, 2025

@ngxson sorry for the confusion. I am not talking about a specific implementation or PR.

Putting a lot of things into examples will it hard to extract common components out later. I am suggest if we could have a folder like llama.cpp/src/audio/ to allow people add custom audio models, which can make common modules sharable.

@ngxson
Copy link
Collaborator

ngxson commented Mar 29, 2025

You can give it a try if you want. IMO that's not a good idea for now. Explained in this comment: #12392 (comment)

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

Should be ready to test & review: #12648

@github-actions github-actions bot added the stale label Apr 30, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific research 🔬 stale tts Text-to-speech
Projects
None yet
Development

No branches or pull requests

5 participants