ASMR
video-1737110239209.webm
(typo in video, ignore it)
Digital Human
output2_added_subtitle.mp4
Give a star ⭐ if you like it!
Kokoro is a trending top 2 TTS model on huggingface.
This repo provides insanely fast Kokoro infer in Rust, you can now have your built TTS engine powered by Kokoro and infer fast by only a command of koko
.
kokoros
is a rust
crate that provides easy to use TTS ability.
One can directly call koko
in terminal to synthesize audio.
kokoros
uses a relative small model 87M params, while results in extremly good quality voices results.
Languge support:
- English;
- Chinese (partly);
- Japanese (partly);
- German (partly);
🔥🔥🔥🔥🔥🔥🔥🔥🔥 Kokoros Rust version just got a lot attention now. If you also interested in insanely fast inference, embeded build, wasm support etc, please star this repo! We are keep updating it.
New Discord community: https://discord.gg/E566zfDWqD, Please join us if you interested in Rust Kokoro.
2025.07.12
: 🔥🔥🔥 HTTP API streaming and parallel processing infrastructure. OpenAI-compatible server supports streaming audio generation with"stream": true
achieving 1-2s time-to-first-audio, work-in-progress parallel TTS processing with--instances
flag support, improved logging system with Unix timestamps, and natural-sounding voice generation through advanced chunking;2025.01.22
: 🔥🔥🔥 CLI streaming mode supported. You can now using--stream
to have fun with stream mode, kudos to mroigo;2025.01.17
: 🔥🔥🔥 Style mixing supported! Now, listen the output AMSR effect by simply specific style:af_sky.4+af_nicole.5
;2025.01.15
: OpenAI compatible server supported, openai format still under polish!2025.01.15
: Phonemizer supported! NowKokoros
can inference E2E without anyother dependencies! Kudos to @tstm;2025.01.13
: Espeak-ng tokenizer and phonemizer supported! Kudos to @mindreframer ;2025.01.12
: ReleasedKokoros
;
- Install required Python packages:
pip install -r scripts/requirements.txt
- Initialize voice data:
python scripts/fetch_voices.py
This step fetches the required voices.json
data file, which is necessary for voice synthesis.
- Build the project:
cargo build --release
./target/release/koko -h
./target/release/koko text "Hello, this is a TTS test"
The generated audio will be saved to tmp/output.wav
by default. You can customize the save location with the --output
or -o
option:
./target/release/koko text "I hope you're having a great day today!" --output greeting.wav
./target/release/koko file poem.txt
For a file with 3 lines of text, by default, speech audio files tmp/output_0.wav
, tmp/output_1.wav
, tmp/output_2.wav
will be outputted. You can customize the save location with the --output
or -o
option, using {line}
as the line number:
./target/release/koko file lyrics.txt -o "song/lyric_{line}.wav"
Configure parallel TTS instances for the OpenAI-compatible server based on your performance preference:
# Best 0.5-2 seconds time-to-first-audio (lowest latency)
./target/release/koko openai --instances 1
# Balanced performance (default, 2 instances, usually best throughput for CPU processing)
./target/release/koko openai
# Best total processing time (Diminishing returns on CPU processing observed on Mac M2)
./target/release/koko openai --instances 4
How to determine the optimal number of instances for your system configuration? Choose your configuration based on use case:
- Single instance for real-time applications requiring immediate audio response irrespective of system configuration.
- Multiple instances for batch processing where total completion time matters more than initial latency.
- This was benchmarked on a Mac M2 with 8 cores and 24GB RAM.
- Tested with the message:
Welcome to our comprehensive technology demonstration session. Today we will explore advanced parallel processing systems thoroughly. These systems utilize multiple computational instances simultaneously for efficiency. Each instance processes different segments concurrently without interference. The coordination between instances ensures seamless output delivery consistently. Modern algorithms optimize resource utilization effectively across all components. Performance improvements are measurable and significant in real scenarios. Quality assurance validates each processing stage thoroughly before deployment. Integration testing confirms system reliability consistently under various conditions. User experience remains smooth throughout operation regardless of complexity. Advanced monitoring tracks system performance metrics continuously during execution.
- Benchmark results (avg of 5)
No. of instances TTFA Total time 1 1.44s 19.0s 2 2.44s 16.1s 4 4.98s 16.6s - If you have a CPU, memory bandwidth will be the usual bottleneck. You will have to experiment to find a sweet spot of number of instances giving you optimal throughput on your system configuration.
- If you have a NVIDIA GPU, you can try increasing the number of instances. You are expected to further improve throughput.
- Attempts to make this work on CoreML, would likely start with converting the ONNX model to CoreML or ORT.
Note: The --instances
flag is currently supported in API server mode. CLI text commands will support parallel processing in future releases.
- Start the server:
./target/release/koko openai
- Make API requests using either curl or Python:
Using curl:
# Standard audio generation
curl -X POST http://localhost:3000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, this is a test of the Kokoro TTS system!",
"voice": "af_sky"
}' \
--output sky-says-hello.wav
# Streaming audio generation (PCM format only)
curl -X POST http://localhost:3000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "This is a streaming test with real-time audio generation.",
"voice": "af_sky",
"stream": true
}' \
--output streaming-audio.pcm
# Live streaming playback (requires ffplay)
curl -s -X POST http://localhost:3000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello streaming world!",
"voice": "af_sky",
"stream": true
}' | \
ffplay -f s16le -ar 24000 -nodisp -autoexit -loglevel quiet -
Using Python:
python scripts/run_openai.py
The stream
option will start the program, reading for lines of input from stdin and outputting WAV audio to stdout.
Use it in conjunction with piping.
./target/release/koko stream > live-audio.wav
# Start typing some text to generate speech for and hit enter to submit
# Speech will append to `live-audio.wav` as it is generated
# Hit Ctrl D to exit
echo "Suppose some other program was outputting lines of text" | ./target/release/koko stream > programmatic-audio.wav
- Build the image
docker build -t kokoros .
- Run the image, passing options as described above
# Basic text to speech
docker run -v ./tmp:/app/tmp kokoros text "Hello from docker!" -o tmp/hello.wav
# An OpenAI server (with appropriately bound port)
docker run -p 3000:3000 kokoros openai
Due to Kokoro actually not finalizing it's ability, this repo will keep tracking the status of Kokoro, and helpfully we can have language support incuding: English, Mandarin, Japanese, German, French etc.
Copyright reserved by Lucas Jin under Apache License.