-
Notifications
You must be signed in to change notification settings - Fork 11.9k
csm : implement Sesame-based conversation example #12392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a very cool issue! I can take a look in the next few days if no one volunteers |
Would be interesting if someone can have a Mimi implementation in llama.cpp / ggml. AFAIK it has a small transformer inside with a sliding context window of 750 tokens, which may make it a bit complicated to implement. The ref python code is here: https://github.com/kyutai-labs/moshi/blob/77f9215629f1ff7914f0a3bb82508824a6436413/moshi/moshi/modules/transformer.py#L211 |
I'm working on porting mimi to ggml, but it turns out to be more complicated than I initially thought. My code is still WIP but I want to share it here, please let me know if someone has an idea to improve it / make it work: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp |
@pminev Maybe I missed something, but I don't see the cgraph in your PR |
Ok so your PR only convert the weight GGUF for now. Indeed, what I interested in is the cgraph, especially the cgraph allowing to run mimi model. Are you planning to work on that? |
Sorry that I did not spend much time on this recently. @ngxson To enable Sesame tts, we only need the decoder. llama with RQ transformer will predict the RVQ tokens, then we can decode the tokens with mimi decoder. |
@ngxson yeah, I'm planning to work on the cgraph too. That's my next step. |
Hey, thanks everyone for porting this! I run the Open Sesame Discord. We just got fine-tuning working and are moving on to adding a transcription head to the model. To enable Sesame TTS, we only need the decoder, which is correct. However, there is a caveat to this. You would miss out on the conversational modeling, which makes this model magical. We need the encoder to encode the user’s utterance or enable the developer or user to build a conversation context so that the model understands the context to adjust the prosody of the conversation. Without this, the model just sounds like Orpheus (with better voice cloning abilities) and lacks the conversational vibe. This is a requirement to get that Maya/Miles-like feel. The second requirement is just streaming and latency. Let me know if you guys want help. Some devs in the community would be stoked to work on this with a bit of guidance. |
I'm working on the encoder/decoder slowly, not particularly interested in sesame, but I know Laurent in-person (the one who invented Mimi / Moshi) and was interested in what he built. Good news is, I correctly implemented cgraph for SEANet encoder in my ggml implementation: https://github.com/ngxson/ggml-easy/blob/master/demo/kyutai-mimi.cpp , the intermediate activation matches 1-1 with the transformers version. What left to do are:
For ref, Mimi contains a SEANet, a transformer and a vector quantizer |
Similar to @ngxson, I am working on this slowly, and can serve as code reviewer if anyone has a working version. To support, sesame, I can see the work can be nicely divided into:
These components can be independently built and merged. After we have the necessary components, we can build demo on top of it to support streaming |
@randxie At the beginning I was thinking to implement all of those + example. Now I will start with extending llama, if no one starts the mimi parts I will continue with them. |
I've been able to get the decoder working. My implementation should be good enough to copy over llama.cpp: https://github.com/ngxson/ggml-easy/blob/12752c5815826b29f1e6c8636f94c3524b3d3b1b/demo/kyutai-mimi.cpp#L654-L675 Probably for now it (the decoder) should be implemented as an example and not part of llama.cpp, because:
@ggerganov so, do you think it's ok to have the decoder code entirely in example for now? The encoder still introduce a bit of noise, so text-to-token does not work for now. But as we're currently interested in TTS and not speech-to-speech so should be fine for now. @pminev it would be nice if you can already make an example (probably a sub example of |
@ngxson I didn't have time this week, so I just started to work on CSM model support in llama (backbone & decoder - this part here). Now watching it actually I'm not sure if I have to do anything in llama or actually I just need to implement an example which produces the codes and then pass them to your implementation of the decoder as you said. |
The backbone and decoder are just 2 transformers using llama arch, just with different with hyperparams. What you need to do is to convert them into 2 separated GGUFs, then load them into 2 different llama_context(s) The full pipeline is:
Edit: the codebook in step 4 is indeed a token embedding tensor, it can be placed into the decoder's GGUF Edit (2): the decoder in this context is not mimi decoder, let's call it "CSM decoder" so there is no confusion. The CSM decoder decode the "latent" tokens into sound tokens, that can then be converted to waveform using Mimi decoder |
If there are significant hacks in the implementation, then it's better to keep it in the examples, until we improve it. |
Putting in the example will make it harder to abstract out later. Components like RQ transformer, RVQ are not specific to sesame. Should we create an extension folder for audio models? cc: @ggerganov @ngxson |
@randxie The RVQ and transformer for mimi are all implemented in #12636 You just need to give it the RVQ audio tokens (semantic + acoustic tokens) and it will generate the waveform. I tested with waveform generated by CSM today, it works. Now just need to implement the 2 transformers of CSM into the code (which, turns out to be not very straightforward, but can still be add to the llama.cpp core library) |
I'm working on these 2 transformers inside CSM. It's not as straightforward because:
|
@ngxson sorry for the confusion. I am not talking about a specific implementation or PR. Putting a lot of things into |
You can give it a try if you want. IMO that's not a good idea for now. Explained in this comment: #12392 (comment) |
Should be ready to test & review: #12648 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
With the first Sesame CSM model openly available, we should implement a local example similar to their online research demo. It seems that the released CSM model uses Kyutai's Mimi audio codec which we have to implement in a similar way as we did with the WavTokenizer. Next we can modify the talk-llama example to support audio generation with the CSM. This way we will be able to plug any LLM for the text response generation and use Sesame for speech input/output.
The text was updated successfully, but these errors were encountered: