Join us in the #audio-generation channel on the LAION Discord to chat, ask questions, or contribute!
WhisperSpeech is an open-source, text-to-speech (TTS) system created by “inverting” OpenAI Whisper.
Our goal is to be for speech what Stable Diffusion is for images—powerful, hackable, and commercially safe.
- All code is Apache-2.0 / MIT.
- Models are trained only on properly licensed data.
- Current release: English (LibreLight). Multilingual release coming next.
Sample output →
whisperspeech-sample.mp4
[2024-01-29] – Tiny S2A multilingual voice-cloning
We trained a tiny S2A model on an en + pl + fr dataset; it successfully clones French voices using semantic tokens frozen on English + Polish—evidence that one tokeniser could cover all languages.
https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9
[2024-01-18] – 12× real-time on a 4090 + voice-cloning demo
- Added
torch.compile
, KV-caching, and layer tweaks → 12× faster-than-real-time on a consumer RTX 4090. - Seamlessly code-switch within one sentence:
To jest pierwszy test wielojęzycznego
Whisper Speech
modelu …
pl-en-mix.mp4
- One-click voice-cloning—example based on Winston Churchill’s “Be Ye Men of Valour” (radio static preserved by design):
en-cloning.mp4
Test it on Colab (≤ 30 s install). Hugging Face Space coming soon.
[2024-01-10] – Faster SD S2A + first cloning example
A new SD‑size S2A model brings major speed‑ups without sacrificing quality; cloning example added.
Try it on Colab.
[2023-12-10] – Multilingual trio (EN/PL)
- English (female voice transferred from a Polish dataset):
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434 - Polish (male voice):
https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec
Unofficial speed & memory‑usage results from the community can be found here.
- Quick start: open the Colab above or run the notebook locally.
- Manual:
- Pre‑trained models – https://huggingface.co/collabora/whisperspeech
- Converted datasets – https://huggingface.co/datasets/collabora/whisperspeech
- Gather large emotive‑speech dataset
- Condition generation on emotion & prosody
- Community drive for freely licensed multilingual speech
- Train final multilingual models
WhisperSpeech follows the two‑stage, token‑based pipeline popularised by
AudioLM, Google’s SPEAR TTS, and Meta’s MusicGen:
Stage | Model | Purpose |
---|---|---|
Semantic | Whisper | Transcription ➜ semantic tokens |
Acoustic | EnCodec | Tokenise waveform (1.5 kbps) |
Vocoder | Vocos | High‑fidelity audio |
Conference talks (deep dives)
Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of Speech – Jakub Cłapa, Collabora
Open‑Source TTS Projects: WhisperSpeech – In‑Depth Discussion
Made possible by:
- Collabora – code & training
- LAION – community & datasets
- Jülich Supercomputing Centre – JUWELS Booster
Additional compute funded by the Gauss Centre for Supercomputing via the John von Neumann Institute for Computing (NIC).
Special thanks to individual contributors:
- @inevitable-2031 (
qwerty_qwer
on Discord) for dataset curation
Need help with open‑source or proprietary AI projects?
Contact us via Collabora or DM on Discord:
@article{SpearTTS,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
url = {https://arxiv.org/abs/2302.03540},
author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
publisher = {arXiv},
year = {2023},
}
@article{MusicGen,
title = {Simple and Controllable Music Generation},
url = {https://arxiv.org/abs/2306.05284},
author = {Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
publisher = {arXiv},
year = {2023},
}
@article{Whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
publisher = {arXiv},
year = {2022},
}
@article{EnCodec,
title = {High Fidelity Neural Audio Compression},
url = {https://arxiv.org/abs/2210.13438},
author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
publisher = {arXiv},
year = {2022},
}
@article{Vocos,
title = {Vocos: Closing the Gap Between Time‑Domain and Fourier‑Based Neural Vocoders for High‑Quality Audio Synthesis},
url = {https://arxiv.org/abs/2306.00814},
author = {Hubert Siuzdak},
publisher = {arXiv},
year = {2023},
}