WhisperSpeech

Join us in the #audio-generation channel on the LAION Discord to chat, ask questions, or contribute!

WhisperSpeech is an open-source, text-to-speech (TTS) system created by “inverting” OpenAI Whisper.
Our goal is to be for speech what Stable Diffusion is for images—powerful, hackable, and commercially safe.

All code is Apache-2.0 / MIT.
Models are trained only on properly licensed data.
Current release: English (LibreLight). Multilingual release coming next.

Sample output →

whisperspeech-sample.mp4

🚀 Progress Updates

[2024-01-29] – Tiny S2A multilingual voice-cloning

We trained a tiny S2A model on an en + pl + fr dataset; it successfully clones French voices using semantic tokens frozen on English + Polish—evidence that one tokeniser could cover all languages.

https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9

[2024-01-18] – 12× real-time on a 4090 + voice-cloning demo

Added torch.compile, KV-caching, and layer tweaks → 12× faster-than-real-time on a consumer RTX 4090.
Seamlessly code-switch within one sentence:

To jest pierwszy test wielojęzycznego Whisper Speech modelu …

pl-en-mix.mp4

One-click voice-cloning—example based on Winston Churchill’s “Be Ye Men of Valour” (radio static preserved by design):

en-cloning.mp4

Test it on Colab (≤ 30 s install). Hugging Face Space coming soon.

[2024-01-10] – Faster SD S2A + first cloning example

A new SD‑size S2A model brings major speed‑ups without sacrificing quality; cloning example added.
Try it on Colab.

[2023-12-10] – Multilingual trio (EN/PL)

English (female voice transferred from a Polish dataset):
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434
Polish (male voice):
https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec

Archive of older updates

📊 Community Benchmarks

Unofficial speed & memory‑usage results from the community can be found here.

📦 Downloads

Quick start: open the Colab above or run the notebook locally.
Manual:
- Pre‑trained models – https://huggingface.co/collabora/whisperspeech
- Converted datasets – https://huggingface.co/datasets/collabora/whisperspeech

🗺️ Roadmap

Gather large emotive‑speech dataset
Condition generation on emotion & prosody
Community drive for freely licensed multilingual speech
Train final multilingual models

⚙️ Architecture

WhisperSpeech follows the two‑stage, token‑based pipeline popularised by
AudioLM, Google’s SPEAR TTS, and Meta’s MusicGen:

Stage	Model	Purpose
Semantic	Whisper	Transcription ➜ semantic tokens
Acoustic	EnCodec	Tokenise waveform (1.5 kbps)
Vocoder	Vocos	High‑fidelity audio

EnCodec architecture diagram

Conference talks (deep dives)

Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of Speech – Jakub Cłapa, Collabora

Open‑Source TTS Projects: WhisperSpeech – In‑Depth Discussion

🙏 Appreciation

Made possible by:

Collabora – code & training
LAION – community & datasets
Jülich Supercomputing Centre – JUWELS Booster

Additional compute funded by the Gauss Centre for Supercomputing via the John von Neumann Institute for Computing (NIC).

Special thanks to individual contributors:

@inevitable-2031 (qwerty_qwer on Discord) for dataset curation

💼 Consulting

Need help with open‑source or proprietary AI projects?
Contact us via Collabora or DM on Discord:

📚 Citations

@article{SpearTTS,
  title       = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url         = {https://arxiv.org/abs/2302.03540},
  author      = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher   = {arXiv},
  year        = {2023},
}

@article{MusicGen,
  title     = {Simple and Controllable Music Generation},
  url       = {https://arxiv.org/abs/2306.05284},
  author    = {Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
  publisher = {arXiv},
  year      = {2023},
}

@article{Whisper,
  title     = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url       = {https://arxiv.org/abs/2212.04356},
  author    = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},
  year      = {2022},
}

@article{EnCodec,
  title     = {High Fidelity Neural Audio Compression},
  url       = {https://arxiv.org/abs/2210.13438},
  author    = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year      = {2022},
}

@article{Vocos,
  title     = {Vocos: Closing the Gap Between Time‑Domain and Fourier‑Based Neural Vocoders for High‑Quality Audio Synthesis},
  url       = {https://arxiv.org/abs/2306.00814},
  author    = {Hubert Siuzdak},
  publisher = {arXiv},
  year      = {2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
.github/workflows		.github/workflows
examples		examples
makefiles		makefiles
nbs		nbs
whisper-finetuning		whisper-finetuning
whisperspeech		whisperspeech
.gitignore		.gitignore
Inference example.ipynb		Inference example.ipynb
LICENSE		LICENSE
Long-form inference.ipynb		Long-form inference.ipynb
MANIFEST.in		MANIFEST.in
README.md		README.md
extract_distill_data.py		extract_distill_data.py
settings.ini		settings.ini
setup.py		setup.py
whisper-block.png		whisper-block.png
whisperspeech-diagram.png		whisperspeech-diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WhisperSpeech

🚀 Progress Updates

📊 Community Benchmarks

📦 Downloads

🗺️ Roadmap

⚙️ Architecture

🙏 Appreciation

💼 Consulting

📚 Citations

About

Uh oh!

Uh oh!

Contributors 11

Uh oh!

Languages

License

WhisperSpeech/WhisperSpeech

Folders and files

Latest commit

History

Repository files navigation

WhisperSpeech

🚀 Progress Updates

📊 Community Benchmarks

📦 Downloads

🗺️ Roadmap

⚙️ Architecture

🙏 Appreciation

💼 Consulting

📚 Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 11

Uh oh!

Languages