Skip to content

tts : add support for Orpheus #12476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ggerganov opened this issue Mar 20, 2025 · 4 comments
Open

tts : add support for Orpheus #12476

ggerganov opened this issue Mar 20, 2025 · 4 comments
Labels
good first issue Good for newcomers tts Text-to-speech

Comments

@ggerganov
Copy link
Member

ggerganov commented Mar 20, 2025

HF: https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2

These TTS models seem suitable for supporting. To do that, we need to implement the SNAC audio codec: https://github.com/hubertsiuzdak/snac/

Sample implementation using Python-based inference of SNAC: https://github.com/isaiahbjork/orpheus-tts-local

Similar model support (OuteTTS): #10784
Can be used as a reference how to implement this.

@ggerganov ggerganov added good first issue Good for newcomers tts Text-to-speech labels Mar 20, 2025
@leoflowers
Copy link

Howdy! I'd like to give this issue a try.

@LostRuins
Copy link
Collaborator

It would be awesome if we could get xcodec, we already have plenty of tts but no ttmusic yet. Everything is a llama model and snac/wavtokenizer/xcodec the vocoding is all that's missing #11467

@ggerganov
Copy link
Member Author

Everything is a llama model and snac/wavtokenizer/xcodec the vocoding is all that's missing

Technically yes, though we have to improve the way that these codecs are implemented and supported in general. Probably be able to have a single GGUF file with both the LLM and the codec, instead of separate. And be able to create either separate decoder / vocoder contexts from a single model. Or alternatively, a combined decoder+vocoder context. At least that's my general idea for supporting multi-modal cases, though I'm still figuring it out.

we already have plenty of tts

llama.cpp has only OuteTTS support via WavTokenizer. It would be nice to have at least one more TTS, so we can find some common patterns which would help to implement the above.

the vocoding is all that's missing

Mostly yes, but we also have to figure out how to do audio streaming. I am not sure yet how it works, but with Orpheus we should be able to understand it because it supports real-time streaming.

@scalar27
Copy link

This works pretty well on my Mac M1 running two instances of llama-server and fastrtc.
https://github.com/PkmX/orpheus-chat-webui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers tts Text-to-speech
Projects
None yet
Development

No branches or pull requests

4 participants