Skip to content

WhisperSpeech/WhisperSpeech

Repository files navigation

WhisperSpeech

Test it out yourself in Colab
Join us in the #audio-generation channel on the LAION Discord to chat, ask questions, or contribute!

WhisperSpeech is an open-source, text-to-speech (TTS) system created by “inverting” OpenAI Whisper.
Our goal is to be for speech what Stable Diffusion is for images—powerful, hackable, and commercially safe.

  • All code is Apache-2.0 / MIT.
  • Models are trained only on properly licensed data.
  • Current release: English (LibreLight). Multilingual release coming next.

Sample output →

whisperspeech-sample.mp4

🚀 Progress Updates

[2024-01-29] – Tiny S2A multilingual voice-cloning

We trained a tiny S2A model on an en + pl + fr dataset; it successfully clones French voices using semantic tokens frozen on English + Polish—evidence that one tokeniser could cover all languages.

https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9

[2024-01-18] – 12× real-time on a 4090 + voice-cloning demo
  • Added torch.compile, KV-caching, and layer tweaks → 12× faster-than-real-time on a consumer RTX 4090.
  • Seamlessly code-switch within one sentence:

To jest pierwszy test wielojęzycznego Whisper Speech modelu …

pl-en-mix.mp4
en-cloning.mp4

Test it on Colab (≤ 30 s install). Hugging Face Space coming soon.

[2024-01-10] – Faster SD S2A + first cloning example

A new SD‑size S2A model brings major speed‑ups without sacrificing quality; cloning example added.
Try it on Colab.

[2023-12-10] – Multilingual trio (EN/PL)

Archive of older updates

📊 Community Benchmarks

Unofficial speed & memory‑usage results from the community can be found here.

📦 Downloads

🗺️ Roadmap

⚙️ Architecture

WhisperSpeech follows the two‑stage, token‑based pipeline popularised by
AudioLM, Google’s SPEAR TTS, and Meta’s MusicGen:

Stage Model Purpose
Semantic Whisper Transcription ➜ semantic tokens
Acoustic EnCodec Tokenise waveform (1.5 kbps)
Vocoder Vocos High‑fidelity audio
EnCodec architecture diagram

EnCodec block diagram

Conference talks (deep dives)


Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of Speech – Jakub Cłapa, Collabora


Open‑Source TTS Projects: WhisperSpeech – In‑Depth Discussion

🙏 Appreciation

Collabora logo LAION logo

Made possible by:

Additional compute funded by the Gauss Centre for Supercomputing via the John von Neumann Institute for Computing (NIC).

Special thanks to individual contributors:

💼 Consulting

Need help with open‑source or proprietary AI projects?
Contact us via Collabora or DM on Discord:
 


📚 Citations

@article{SpearTTS,
  title       = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url         = {https://arxiv.org/abs/2302.03540},
  author      = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher   = {arXiv},
  year        = {2023},
}
@article{MusicGen,
  title     = {Simple and Controllable Music Generation},
  url       = {https://arxiv.org/abs/2306.05284},
  author    = {Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
  publisher = {arXiv},
  year      = {2023},
}
@article{Whisper,
  title     = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url       = {https://arxiv.org/abs/2212.04356},
  author    = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},
  year      = {2022},
}
@article{EnCodec,
  title     = {High Fidelity Neural Audio Compression},
  url       = {https://arxiv.org/abs/2210.13438},
  author    = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year      = {2022},
}
@article{Vocos,
  title     = {Vocos: Closing the Gap Between Time‑Domain and Fourier‑Based Neural Vocoders for High‑Quality Audio Synthesis},
  url       = {https://arxiv.org/abs/2306.00814},
  author    = {Hubert Siuzdak},
  publisher = {arXiv},
  year      = {2023},
}

About

An Open Source text-to-speech system built by inverting Whisper.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 11