Vocal Agent - Real-Time Speech-to-Speech AI Assistant 🤖

A sophisticated real-time voice assistant that seamlessly integrates speech recognition, AI reasoning, and neural text-to-speech synthesis. Designed for natural conversational interactions with advanced tool-calling capabilities.

🔄 How Vocal Agent Works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                           VOCAL AGENT WORKFLOW                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

    🎤 USER SPEAKS
         │
         ▼
┌─────────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐
│   AUDIO CAPTURE     │    │   VOICE ACTIVITY     │    │  SPEECH-TO-TEXT     │
│                     │───▶│     DETECTION        │───▶│                     │
│ • Microphone Input  │    │ • Silero VAD         │    │ • Whisper large-v1  │
│ • 16kHz Sampling    │    │ • Real-time Monitor  │    │ • Language: English │
│ • Continuous Stream │    │ • Start/Stop Detect  │    │ • CUDA Acceleration │
└─────────────────────┘    └──────────────────────┘    └─────────────────────┘
                                                                 │
                                                                 ▼
                                                    📝 "What's the weather in Tokyo?"
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              AI REASONING ENGINE                                │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │   LLAMA 3.1 8B  │    │    AGNO FRAMEWORK    │    │   TOOL SELECTION    │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • Via Ollama    │    │ • Agent Orchestration│    │ • Google Search     │   │
│  │ • Local LLM     │    │ • Context Management │    │ • Wikipedia         │   │
│  │ • 8B Parameters │    │ • Response Generation│    │ • ArXiv Papers      │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
                                              🔍 TOOL EXECUTION (if needed)
                                                                 │
                                    ┌────────────────────────────┼────────────────────────────┐
                                    │                            │                            │
                                    ▼                            ▼                            ▼
                          ┌─────────────────┐        ┌─────────────────┐        ┌─────────────────┐
                          │ GOOGLE SEARCH   │        │   WIKIPEDIA     │        │     ARXIV      │
                          │                 │        │                 │        │                 │
                          │ • Web Results   │        │ • Encyclopedia  │        │ • Research      │
                          │ • Real-time     │        │ • Facts & Info  │        │ • Papers        │
                          │ • Current Data  │        │ • Historical    │        │ • Academic      │
                          └─────────────────┘        └─────────────────┘        └─────────────────┘
                                    │                            │                            │
                                    └────────────────────────────┼────────────────────────────┘
                                                                 │
                                                                 ▼
                                                    📊 AGGREGATED INFORMATION
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                           RESPONSE GENERATION                                   │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │  TEXT RESPONSE  │    │   TEXT PROCESSING    │    │    PHONEME GEN      │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • Natural Lang  │    │ • G2P Conversion     │    │ • Misaki Engine     │   │
│  │ • Conversational│    │ • eSpeak Fallback    │    │ • English Phonemes  │   │
│  │ • 1-2 Sentences │    │ • British=False      │    │ • Max Length: 500   │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         NEURAL VOICE SYNTHESIS                                 │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │  KOKORO-82M     │    │   VOICE PROFILES     │    │   AUDIO OUTPUT      │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • ONNX Model    │    │ • af_heart (warm)    │    │ • 16kHz Audio       │   │
│  │ • 82M Params    │    │ • af_sky (clear)     │    │ • Natural Speech    │   │
│  │ • High Quality  │    │ • af_bella (dynamic) │    │ • Speed: 1.2x       │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
                                                    🔊 SPEAKER OUTPUT
                                                                 │
                                                                 ▼
                                                      👂 USER HEARS RESPONSE

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              PERFORMANCE METRICS                               │
│                                                                                 │
│  Speech Recognition: ~200-500ms  │  LLM Processing: ~1-3s  │  TTS: ~100-300ms  │
│  Total Latency: ~1.3-3.8s       │  Memory Usage: ~4-6GB   │  Concurrent: 2x    │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                KEY FEATURES                                    │
│                                                                                 │
│ 🎙️ Continuous Listening  │ 🧠 Smart Tool Selection │ 🗣️ Natural Voice Output   │
│ ⚡ Real-time Processing  │ 🌐 Web-Connected Intel  │ 🔧 Extensible Architecture │
│ 🎯 Voice Activity Detect │ 📚 Multi-source Search  │ ⚙️ Configurable Settings   │
└─────────────────────────────────────────────────────────────────────────────────┘

🌟 Key Features

🎙️ Real-time Speech Processing: Advanced speech recognition using Whisper large-v1 with Silero VAD for accurate voice activity detection
🧠 Intelligent Reasoning: Powered by Llama 3.1 8B through the Agno agent framework for sophisticated AI responses
🌐 Web-Connected Intelligence: Integrated web search capabilities (Google Search, Wikipedia, ArXiv) for up-to-date information
🗣️ Natural Voice Synthesis: High-quality speech generation using Kokoro-82M ONNX for human-like voice output
⚡ Low-Latency Pipeline: Optimized audio processing for real-time conversational experience
🔧 Extensible Architecture: Modular tool system allowing easy integration of new capabilities

🏗️ Architecture Overview

Component	Technology	Purpose
Speech Recognition	Whisper (large-v1) + Silero VAD	Convert speech to text with voice activity detection
Language Model	Llama 3.1 8B via Ollama	Natural language understanding and generation
Text-to-Speech	Kokoro-82M ONNX	Convert text responses to natural speech
Agent Framework	Agno LLM Agent	Tool orchestration and reasoning capabilities
Web Integration	Custom API connectors	Real-time information retrieval

📋 Prerequisites

Python: Version 3.9 or higher
Ollama: Local LLM server (Installation Guide)
System Audio: Microphone and speakers/headphones
Operating System: macOS, Linux, or Windows

🚀 Quick Start

1. Install Ollama

macOS:

# Download from https://ollama.com/download/mac
# Or install via Homebrew
brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

# Download installer from https://ollama.com/download/windows

2. Clone and Setup

git clone https://github.com/danieladdisonorg/Vocal-Agent.git
cd Vocal-Agent

3. Install Dependencies

# Install Python dependencies
pip3 install -r requirements.txt
pip3 install --no-deps kokoro-onnx==0.4.7

4. Install System Dependencies

Linux:

sudo apt-get install espeak-ng

macOS:

brew install espeak-ng

Windows:

Download eSpeak NG from releases page
Install the .msi package (e.g., espeak-ng-20191129-b702b03-x64.msi)

5. Download AI Models

Language Model:

ollama pull llama3.1:8b

Voice Models: Download the following files and place them in the project root directory:

🎯 Usage

Starting the Application

Start Ollama service:

ollama serve

Initialize the model (in a separate terminal):

ollama run llama3.1:8b

Launch Vocal Agent:

python3 main.py

Interaction Flow

🎤 Listening... Press Ctrl+C to exit
🔴 Speak now - Recording started
⏹️ Recording stopped

📝 Transcribed: "What's the weather like in Tokyo today?"
🔧 LLM Tool calls...
🤖 Response: "Let me check the current weather in Tokyo for you..."
🔊 [Audio response plays]

⚙️ Configuration

Customize the application behavior by modifying settings in main.py:

# Audio Processing Configuration
SAMPLE_RATE = 16000          # Audio sample rate (Hz)
MAX_PHONEME_LENGTH = 500     # Maximum phoneme sequence length

# Voice Synthesis Settings
SPEED = 1.2                  # Speech rate multiplier
VOICE_PROFILE = "af_heart"   # Voice character selection

# Performance Settings
MAX_THREADS = 2              # Parallel processing threads

Available Voice Profiles

af_heart - Warm, friendly tone
af_sky - Clear, professional tone
af_bella - Expressive, dynamic tone
Additional profiles available in voices-v1.0.bin

📁 Project Structure

Vocal-Agent/
├── main.py                 # Core application entry point
├── agent_client.py         # LLM agent integration layer
├── kokoro-v1.0.onnx       # Neural TTS model
├── voices-v1.0.bin        # Voice profile database
├── requirements.txt       # Python dependencies
├── vocal_agent_mac.sh     # macOS setup automation script
├── demo.png              # Application demonstration
├── LICENSE               # MIT license
└── README.md            # Project documentation

🛠️ Development

Extending Functionality

Add new tools to the agent by integrating Agno Toolkits:

from agno import Agent
from agno.tools import WebSearchTool, WikipediaSearchTool

# Add custom tools
agent = Agent(
    tools=[WebSearchTool(), WikipediaSearchTool(), YourCustomTool()],
    model="llama3.1:8b"
)

Performance Optimization

GPU Acceleration: Enable CUDA for faster model inference
Model Selection: Choose smaller models for faster response times
Audio Buffer Tuning: Adjust buffer sizes for your hardware

🔧 Troubleshooting

Common Issues

Ollama Connection Error:

# Ensure Ollama is running
ollama serve
# Verify model is available
ollama list

Audio Device Issues:

Check microphone permissions
Verify audio device selection in system settings
Test with python3 -c "import sounddevice; print(sounddevice.query_devices())"

Model Download Failures:

Ensure stable internet connection
Verify sufficient disk space (models require ~8GB)
Check Ollama service status

📊 Performance Metrics

Speech Recognition Latency: ~200-500ms
LLM Response Time: ~1-3 seconds (depending on query complexity)
Text-to-Speech Generation: ~100-300ms
Memory Usage: ~4-6GB (with Llama 3.1 8B)

🤝 Contributing

We welcome contributions! Please see our contribution guidelines for details on:

Code style and standards
Testing requirements
Pull request process
Issue reporting

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

RealtimeSTT - Speech-to-text with VAD integration
Kokoro-ONNX - Efficient neural text-to-speech
Agno - Powerful agent framework
Ollama - Local LLM serving platform
Weebo - Project inspiration

📞 Support

Documentation: Project Wiki
Issues: GitHub Issues
Discussions: GitHub Discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vocal Agent - Real-Time Speech-to-Speech AI Assistant 🤖

🔄 How Vocal Agent Works

🌟 Key Features

🏗️ Architecture Overview

📋 Prerequisites

🚀 Quick Start

1. Install Ollama

2. Clone and Setup

3. Install Dependencies

4. Install System Dependencies

5. Download AI Models

🎯 Usage

Starting the Application

Interaction Flow

⚙️ Configuration

Available Voice Profiles

📁 Project Structure

🛠️ Development

Extending Functionality

Performance Optimization

🔧 Troubleshooting

Common Issues

📊 Performance Metrics

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent_client.py		agent_client.py
demo.png		demo.png
main.py		main.py
requirements.txt		requirements.txt
vocal_agent_mac.sh		vocal_agent_mac.sh

License

danieladdisonorg/Vocal-Agent

Folders and files

Latest commit

History

Repository files navigation

Vocal Agent - Real-Time Speech-to-Speech AI Assistant 🤖

🔄 How Vocal Agent Works

🌟 Key Features

🏗️ Architecture Overview

📋 Prerequisites

🚀 Quick Start

1. Install Ollama

2. Clone and Setup

3. Install Dependencies

4. Install System Dependencies

5. Download AI Models

🎯 Usage

Starting the Application

Interaction Flow

⚙️ Configuration

Available Voice Profiles

📁 Project Structure

🛠️ Development

Extending Functionality

Performance Optimization

🔧 Troubleshooting

Common Issues

📊 Performance Metrics

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages