A sophisticated real-time voice assistant that seamlessly integrates speech recognition, AI reasoning, and neural text-to-speech synthesis. Designed for natural conversational interactions with advanced tool-calling capabilities.
┌─────────────────────────────────────────────────────────────────────────────────┐
│ VOCAL AGENT WORKFLOW │
└─────────────────────────────────────────────────────────────────────────────────┘
🎤 USER SPEAKS
│
▼
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ AUDIO CAPTURE │ │ VOICE ACTIVITY │ │ SPEECH-TO-TEXT │
│ │───▶│ DETECTION │───▶│ │
│ • Microphone Input │ │ • Silero VAD │ │ • Whisper large-v1 │
│ • 16kHz Sampling │ │ • Real-time Monitor │ │ • Language: English │
│ • Continuous Stream │ │ • Start/Stop Detect │ │ • CUDA Acceleration │
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
│
▼
📝 "What's the weather in Tokyo?"
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AI REASONING ENGINE │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │
│ │ LLAMA 3.1 8B │ │ AGNO FRAMEWORK │ │ TOOL SELECTION │ │
│ │ │───▶│ │───▶│ │ │
│ │ • Via Ollama │ │ • Agent Orchestration│ │ • Google Search │ │
│ │ • Local LLM │ │ • Context Management │ │ • Wikipedia │ │
│ │ • 8B Parameters │ │ • Response Generation│ │ • ArXiv Papers │ │
│ └─────────────────┘ └──────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
🔍 TOOL EXECUTION (if needed)
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ GOOGLE SEARCH │ │ WIKIPEDIA │ │ ARXIV │
│ │ │ │ │ │
│ • Web Results │ │ • Encyclopedia │ │ • Research │
│ • Real-time │ │ • Facts & Info │ │ • Papers │
│ • Current Data │ │ • Historical │ │ • Academic │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────────────┼────────────────────────────┘
│
▼
📊 AGGREGATED INFORMATION
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ RESPONSE GENERATION │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │
│ │ TEXT RESPONSE │ │ TEXT PROCESSING │ │ PHONEME GEN │ │
│ │ │───▶│ │───▶│ │ │
│ │ • Natural Lang │ │ • G2P Conversion │ │ • Misaki Engine │ │
│ │ • Conversational│ │ • eSpeak Fallback │ │ • English Phonemes │ │
│ │ • 1-2 Sentences │ │ • British=False │ │ • Max Length: 500 │ │
│ └─────────────────┘ └──────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ NEURAL VOICE SYNTHESIS │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │
│ │ KOKORO-82M │ │ VOICE PROFILES │ │ AUDIO OUTPUT │ │
│ │ │───▶│ │───▶│ │ │
│ │ • ONNX Model │ │ • af_heart (warm) │ │ • 16kHz Audio │ │
│ │ • 82M Params │ │ • af_sky (clear) │ │ • Natural Speech │ │
│ │ • High Quality │ │ • af_bella (dynamic) │ │ • Speed: 1.2x │ │
│ └─────────────────┘ └──────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
🔊 SPEAKER OUTPUT
│
▼
👂 USER HEARS RESPONSE
┌─────────────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE METRICS │
│ │
│ Speech Recognition: ~200-500ms │ LLM Processing: ~1-3s │ TTS: ~100-300ms │
│ Total Latency: ~1.3-3.8s │ Memory Usage: ~4-6GB │ Concurrent: 2x │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KEY FEATURES │
│ │
│ 🎙️ Continuous Listening │ 🧠 Smart Tool Selection │ 🗣️ Natural Voice Output │
│ ⚡ Real-time Processing │ 🌐 Web-Connected Intel │ 🔧 Extensible Architecture │
│ 🎯 Voice Activity Detect │ 📚 Multi-source Search │ ⚙️ Configurable Settings │
└─────────────────────────────────────────────────────────────────────────────────┘
- 🎙️ Real-time Speech Processing: Advanced speech recognition using Whisper large-v1 with Silero VAD for accurate voice activity detection
- 🧠 Intelligent Reasoning: Powered by Llama 3.1 8B through the Agno agent framework for sophisticated AI responses
- 🌐 Web-Connected Intelligence: Integrated web search capabilities (Google Search, Wikipedia, ArXiv) for up-to-date information
- 🗣️ Natural Voice Synthesis: High-quality speech generation using Kokoro-82M ONNX for human-like voice output
- ⚡ Low-Latency Pipeline: Optimized audio processing for real-time conversational experience
- 🔧 Extensible Architecture: Modular tool system allowing easy integration of new capabilities
Component | Technology | Purpose |
---|---|---|
Speech Recognition | Whisper (large-v1) + Silero VAD | Convert speech to text with voice activity detection |
Language Model | Llama 3.1 8B via Ollama | Natural language understanding and generation |
Text-to-Speech | Kokoro-82M ONNX | Convert text responses to natural speech |
Agent Framework | Agno LLM Agent | Tool orchestration and reasoning capabilities |
Web Integration | Custom API connectors | Real-time information retrieval |
- Python: Version 3.9 or higher
- Ollama: Local LLM server (Installation Guide)
- System Audio: Microphone and speakers/headphones
- Operating System: macOS, Linux, or Windows
macOS:
# Download from https://ollama.com/download/mac
# Or install via Homebrew
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
# Download installer from https://ollama.com/download/windows
git clone https://github.com/danieladdisonorg/Vocal-Agent.git
cd Vocal-Agent
# Install Python dependencies
pip3 install -r requirements.txt
pip3 install --no-deps kokoro-onnx==0.4.7
Linux:
sudo apt-get install espeak-ng
macOS:
brew install espeak-ng
Windows:
- Download eSpeak NG from releases page
- Install the
.msi
package (e.g.,espeak-ng-20191129-b702b03-x64.msi
)
Language Model:
ollama pull llama3.1:8b
Voice Models: Download the following files and place them in the project root directory:
- Start Ollama service:
ollama serve
- Initialize the model (in a separate terminal):
ollama run llama3.1:8b
- Launch Vocal Agent:
python3 main.py
🎤 Listening... Press Ctrl+C to exit
🔴 Speak now - Recording started
⏹️ Recording stopped
📝 Transcribed: "What's the weather like in Tokyo today?"
🔧 LLM Tool calls...
🤖 Response: "Let me check the current weather in Tokyo for you..."
🔊 [Audio response plays]
Customize the application behavior by modifying settings in main.py
:
# Audio Processing Configuration
SAMPLE_RATE = 16000 # Audio sample rate (Hz)
MAX_PHONEME_LENGTH = 500 # Maximum phoneme sequence length
# Voice Synthesis Settings
SPEED = 1.2 # Speech rate multiplier
VOICE_PROFILE = "af_heart" # Voice character selection
# Performance Settings
MAX_THREADS = 2 # Parallel processing threads
af_heart
- Warm, friendly toneaf_sky
- Clear, professional toneaf_bella
- Expressive, dynamic tone- Additional profiles available in
voices-v1.0.bin
Vocal-Agent/
├── main.py # Core application entry point
├── agent_client.py # LLM agent integration layer
├── kokoro-v1.0.onnx # Neural TTS model
├── voices-v1.0.bin # Voice profile database
├── requirements.txt # Python dependencies
├── vocal_agent_mac.sh # macOS setup automation script
├── demo.png # Application demonstration
├── LICENSE # MIT license
└── README.md # Project documentation
Add new tools to the agent by integrating Agno Toolkits:
from agno import Agent
from agno.tools import WebSearchTool, WikipediaSearchTool
# Add custom tools
agent = Agent(
tools=[WebSearchTool(), WikipediaSearchTool(), YourCustomTool()],
model="llama3.1:8b"
)
- GPU Acceleration: Enable CUDA for faster model inference
- Model Selection: Choose smaller models for faster response times
- Audio Buffer Tuning: Adjust buffer sizes for your hardware
Ollama Connection Error:
# Ensure Ollama is running
ollama serve
# Verify model is available
ollama list
Audio Device Issues:
- Check microphone permissions
- Verify audio device selection in system settings
- Test with
python3 -c "import sounddevice; print(sounddevice.query_devices())"
Model Download Failures:
- Ensure stable internet connection
- Verify sufficient disk space (models require ~8GB)
- Check Ollama service status
- Speech Recognition Latency: ~200-500ms
- LLM Response Time: ~1-3 seconds (depending on query complexity)
- Text-to-Speech Generation: ~100-300ms
- Memory Usage: ~4-6GB (with Llama 3.1 8B)
We welcome contributions! Please see our contribution guidelines for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Issue reporting
This project is licensed under the MIT License - see the LICENSE file for details.
- RealtimeSTT - Speech-to-text with VAD integration
- Kokoro-ONNX - Efficient neural text-to-speech
- Agno - Powerful agent framework
- Ollama - Local LLM serving platform
- Weebo - Project inspiration
- Documentation: Project Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions