Skip to content

lucasjinreal/Kokoros

Repository files navigation

Banner

🔥🔥🔥 Kokoro Rust

ASMR

video-1737110239209.webm

(typo in video, ignore it)

Digital Human

output2_added_subtitle.mp4

Give a star ⭐ if you like it!

Kokoro is a trending top 2 TTS model on huggingface. This repo provides insanely fast Kokoro infer in Rust, you can now have your built TTS engine powered by Kokoro and infer fast by only a command of koko.

kokoros is a rust crate that provides easy to use TTS ability. One can directly call koko in terminal to synthesize audio.

kokoros uses a relative small model 87M params, while results in extremly good quality voices results.

Languge support:

  • English;
  • Chinese (partly);
  • Japanese (partly);
  • German (partly);

🔥🔥🔥🔥🔥🔥🔥🔥🔥 Kokoros Rust version just got a lot attention now. If you also interested in insanely fast inference, embeded build, wasm support etc, please star this repo! We are keep updating it.

New Discord community: https://discord.gg/E566zfDWqD, Please join us if you interested in Rust Kokoro.

Updates

  • 2025.07.12: 🔥🔥🔥 HTTP API streaming and parallel processing infrastructure. OpenAI-compatible server supports streaming audio generation with "stream": true achieving 1-2s time-to-first-audio, work-in-progress parallel TTS processing with --instances flag support, improved logging system with Unix timestamps, and natural-sounding voice generation through advanced chunking;
  • 2025.01.22: 🔥🔥🔥 CLI streaming mode supported. You can now using --stream to have fun with stream mode, kudos to mroigo;
  • 2025.01.17: 🔥🔥🔥 Style mixing supported! Now, listen the output AMSR effect by simply specific style: af_sky.4+af_nicole.5;
  • 2025.01.15: OpenAI compatible server supported, openai format still under polish!
  • 2025.01.15: Phonemizer supported! Now Kokoros can inference E2E without anyother dependencies! Kudos to @tstm;
  • 2025.01.13: Espeak-ng tokenizer and phonemizer supported! Kudos to @mindreframer ;
  • 2025.01.12: Released Kokoros;

Installation

  1. Install required Python packages:
pip install -r scripts/requirements.txt
  1. Initialize voice data:
python scripts/fetch_voices.py

This step fetches the required voices.json data file, which is necessary for voice synthesis.

  1. Build the project:
cargo build --release

Usage

View available options

./target/release/koko -h

Generate speech for some text

./target/release/koko text "Hello, this is a TTS test"

The generated audio will be saved to tmp/output.wav by default. You can customize the save location with the --output or -o option:

./target/release/koko text "I hope you're having a great day today!" --output greeting.wav

Generate speech for each line in a file

./target/release/koko file poem.txt

For a file with 3 lines of text, by default, speech audio files tmp/output_0.wav, tmp/output_1.wav, tmp/output_2.wav will be outputted. You can customize the save location with the --output or -o option, using {line} as the line number:

./target/release/koko file lyrics.txt -o "song/lyric_{line}.wav"

Parallel Processing Configuration

Configure parallel TTS instances for the OpenAI-compatible server based on your performance preference:

# Best 0.5-2 seconds time-to-first-audio (lowest latency)
./target/release/koko openai --instances 1

# Balanced performance (default, 2 instances, usually best throughput for CPU processing)
./target/release/koko openai

# Best total processing time (Diminishing returns on CPU processing observed on Mac M2)
./target/release/koko openai --instances 4

How to determine the optimal number of instances for your system configuration? Choose your configuration based on use case:

  • Single instance for real-time applications requiring immediate audio response irrespective of system configuration.
  • Multiple instances for batch processing where total completion time matters more than initial latency.
    • This was benchmarked on a Mac M2 with 8 cores and 24GB RAM.
    • Tested with the message:

      Welcome to our comprehensive technology demonstration session. Today we will explore advanced parallel processing systems thoroughly. These systems utilize multiple computational instances simultaneously for efficiency. Each instance processes different segments concurrently without interference. The coordination between instances ensures seamless output delivery consistently. Modern algorithms optimize resource utilization effectively across all components. Performance improvements are measurable and significant in real scenarios. Quality assurance validates each processing stage thoroughly before deployment. Integration testing confirms system reliability consistently under various conditions. User experience remains smooth throughout operation regardless of complexity. Advanced monitoring tracks system performance metrics continuously during execution.

    • Benchmark results (avg of 5)
      No. of instances TTFA Total time
      1 1.44s 19.0s
      2 2.44s 16.1s
      4 4.98s 16.6s
    • If you have a CPU, memory bandwidth will be the usual bottleneck. You will have to experiment to find a sweet spot of number of instances giving you optimal throughput on your system configuration.
    • If you have a NVIDIA GPU, you can try increasing the number of instances. You are expected to further improve throughput.
    • Attempts to make this work on CoreML, would likely start with converting the ONNX model to CoreML or ORT.

Note: The --instances flag is currently supported in API server mode. CLI text commands will support parallel processing in future releases.

OpenAI-Compatible Server

  1. Start the server:
./target/release/koko openai
  1. Make API requests using either curl or Python:

Using curl:

# Standard audio generation
curl -X POST http://localhost:3000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of the Kokoro TTS system!",
    "voice": "af_sky"
  }' \
  --output sky-says-hello.wav

# Streaming audio generation (PCM format only)
curl -X POST http://localhost:3000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "This is a streaming test with real-time audio generation.",
    "voice": "af_sky",
    "stream": true
  }' \
  --output streaming-audio.pcm

# Live streaming playback (requires ffplay)
curl -s -X POST http://localhost:3000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello streaming world!",
    "voice": "af_sky",
    "stream": true
  }' | \
  ffplay -f s16le -ar 24000 -nodisp -autoexit -loglevel quiet -

Using Python:

python scripts/run_openai.py

Streaming

The stream option will start the program, reading for lines of input from stdin and outputting WAV audio to stdout.

Use it in conjunction with piping.

Typing manually

./target/release/koko stream > live-audio.wav
# Start typing some text to generate speech for and hit enter to submit
# Speech will append to `live-audio.wav` as it is generated
# Hit Ctrl D to exit

Input from another source

echo "Suppose some other program was outputting lines of text" | ./target/release/koko stream > programmatic-audio.wav

With docker

  1. Build the image
docker build -t kokoros .
  1. Run the image, passing options as described above
# Basic text to speech
docker run -v ./tmp:/app/tmp kokoros text "Hello from docker!" -o tmp/hello.wav

# An OpenAI server (with appropriately bound port)
docker run -p 3000:3000 kokoros openai

Roadmap

Due to Kokoro actually not finalizing it's ability, this repo will keep tracking the status of Kokoro, and helpfully we can have language support incuding: English, Mandarin, Japanese, German, French etc.

Copyright

Copyright reserved by Lucas Jin under Apache License.

About

🔥🔥 Kokoro in Rust. https://huggingface.co/hexgrad/Kokoro-82M Insanely fast, realtime TTS with high quality you ever have.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages