Skip to content

A sophisticated real-time voice assistant that seamlessly integrates speech recognition, AI reasoning, and neural text-to-speech synthesis. It is designed for natural conversational interactions with advanced tool-calling capabilities.

License

Notifications You must be signed in to change notification settings

danieladdisonorg/Vocal-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vocal Agent - Real-Time Speech-to-Speech AI Assistant 🤖

License: MIT Python 3.9+ Ollama

A sophisticated real-time voice assistant that seamlessly integrates speech recognition, AI reasoning, and neural text-to-speech synthesis. Designed for natural conversational interactions with advanced tool-calling capabilities.

🔄 How Vocal Agent Works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                           VOCAL AGENT WORKFLOW                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

    🎤 USER SPEAKS
         │
         ▼
┌─────────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐
│   AUDIO CAPTURE     │    │   VOICE ACTIVITY     │    │  SPEECH-TO-TEXT     │
│                     │───▶│     DETECTION        │───▶│                     │
│ • Microphone Input  │    │ • Silero VAD         │    │ • Whisper large-v1  │
│ • 16kHz Sampling    │    │ • Real-time Monitor  │    │ • Language: English │
│ • Continuous Stream │    │ • Start/Stop Detect  │    │ • CUDA Acceleration │
└─────────────────────┘    └──────────────────────┘    └─────────────────────┘
                                                                 │
                                                                 ▼
                                                    📝 "What's the weather in Tokyo?"
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              AI REASONING ENGINE                                │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │   LLAMA 3.1 8B  │    │    AGNO FRAMEWORK    │    │   TOOL SELECTION    │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • Via Ollama    │    │ • Agent Orchestration│    │ • Google Search     │   │
│  │ • Local LLM     │    │ • Context Management │    │ • Wikipedia         │   │
│  │ • 8B Parameters │    │ • Response Generation│    │ • ArXiv Papers      │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
                                              🔍 TOOL EXECUTION (if needed)
                                                                 │
                                    ┌────────────────────────────┼────────────────────────────┐
                                    │                            │                            │
                                    ▼                            ▼                            ▼
                          ┌─────────────────┐        ┌─────────────────┐        ┌─────────────────┐
                          │ GOOGLE SEARCH   │        │   WIKIPEDIA     │        │     ARXIV      │
                          │                 │        │                 │        │                 │
                          │ • Web Results   │        │ • Encyclopedia  │        │ • Research      │
                          │ • Real-time     │        │ • Facts & Info  │        │ • Papers        │
                          │ • Current Data  │        │ • Historical    │        │ • Academic      │
                          └─────────────────┘        └─────────────────┘        └─────────────────┘
                                    │                            │                            │
                                    └────────────────────────────┼────────────────────────────┘
                                                                 │
                                                                 ▼
                                                    📊 AGGREGATED INFORMATION
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                           RESPONSE GENERATION                                   │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │  TEXT RESPONSE  │    │   TEXT PROCESSING    │    │    PHONEME GEN      │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • Natural Lang  │    │ • G2P Conversion     │    │ • Misaki Engine     │   │
│  │ • Conversational│    │ • eSpeak Fallback    │    │ • English Phonemes  │   │
│  │ • 1-2 Sentences │    │ • British=False      │    │ • Max Length: 500   │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         NEURAL VOICE SYNTHESIS                                 │
│                                                                                 │
│  ┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────────┐   │
│  │  KOKORO-82M     │    │   VOICE PROFILES     │    │   AUDIO OUTPUT      │   │
│  │                 │───▶│                      │───▶│                     │   │
│  │ • ONNX Model    │    │ • af_heart (warm)    │    │ • 16kHz Audio       │   │
│  │ • 82M Params    │    │ • af_sky (clear)     │    │ • Natural Speech    │   │
│  │ • High Quality  │    │ • af_bella (dynamic) │    │ • Speed: 1.2x       │   │
│  └─────────────────┘    └──────────────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
                                                                 │
                                                                 ▼
                                                    🔊 SPEAKER OUTPUT
                                                                 │
                                                                 ▼
                                                      👂 USER HEARS RESPONSE

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              PERFORMANCE METRICS                               │
│                                                                                 │
│  Speech Recognition: ~200-500ms  │  LLM Processing: ~1-3s  │  TTS: ~100-300ms  │
│  Total Latency: ~1.3-3.8s       │  Memory Usage: ~4-6GB   │  Concurrent: 2x    │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                KEY FEATURES                                    │
│                                                                                 │
│ 🎙️ Continuous Listening  │ 🧠 Smart Tool Selection │ 🗣️ Natural Voice Output   │
│ ⚡ Real-time Processing  │ 🌐 Web-Connected Intel  │ 🔧 Extensible Architecture │
│ 🎯 Voice Activity Detect │ 📚 Multi-source Search  │ ⚙️ Configurable Settings   │
└─────────────────────────────────────────────────────────────────────────────────┘

🌟 Key Features

  • 🎙️ Real-time Speech Processing: Advanced speech recognition using Whisper large-v1 with Silero VAD for accurate voice activity detection
  • 🧠 Intelligent Reasoning: Powered by Llama 3.1 8B through the Agno agent framework for sophisticated AI responses
  • 🌐 Web-Connected Intelligence: Integrated web search capabilities (Google Search, Wikipedia, ArXiv) for up-to-date information
  • 🗣️ Natural Voice Synthesis: High-quality speech generation using Kokoro-82M ONNX for human-like voice output
  • ⚡ Low-Latency Pipeline: Optimized audio processing for real-time conversational experience
  • 🔧 Extensible Architecture: Modular tool system allowing easy integration of new capabilities

🏗️ Architecture Overview

Component Technology Purpose
Speech Recognition Whisper (large-v1) + Silero VAD Convert speech to text with voice activity detection
Language Model Llama 3.1 8B via Ollama Natural language understanding and generation
Text-to-Speech Kokoro-82M ONNX Convert text responses to natural speech
Agent Framework Agno LLM Agent Tool orchestration and reasoning capabilities
Web Integration Custom API connectors Real-time information retrieval

📋 Prerequisites

  • Python: Version 3.9 or higher
  • Ollama: Local LLM server (Installation Guide)
  • System Audio: Microphone and speakers/headphones
  • Operating System: macOS, Linux, or Windows

🚀 Quick Start

1. Install Ollama

macOS:

# Download from https://ollama.com/download/mac
# Or install via Homebrew
brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

# Download installer from https://ollama.com/download/windows

2. Clone and Setup

git clone https://github.com/danieladdisonorg/Vocal-Agent.git
cd Vocal-Agent

3. Install Dependencies

# Install Python dependencies
pip3 install -r requirements.txt
pip3 install --no-deps kokoro-onnx==0.4.7

4. Install System Dependencies

Linux:

sudo apt-get install espeak-ng

macOS:

brew install espeak-ng

Windows:

  1. Download eSpeak NG from releases page
  2. Install the .msi package (e.g., espeak-ng-20191129-b702b03-x64.msi)

5. Download AI Models

Language Model:

ollama pull llama3.1:8b

Voice Models: Download the following files and place them in the project root directory:

🎯 Usage

Starting the Application

  1. Start Ollama service:
ollama serve
  1. Initialize the model (in a separate terminal):
ollama run llama3.1:8b
  1. Launch Vocal Agent:
python3 main.py

Interaction Flow

🎤 Listening... Press Ctrl+C to exit
🔴 Speak now - Recording started
⏹️ Recording stopped

📝 Transcribed: "What's the weather like in Tokyo today?"
🔧 LLM Tool calls...
🤖 Response: "Let me check the current weather in Tokyo for you..."
🔊 [Audio response plays]

⚙️ Configuration

Customize the application behavior by modifying settings in main.py:

# Audio Processing Configuration
SAMPLE_RATE = 16000          # Audio sample rate (Hz)
MAX_PHONEME_LENGTH = 500     # Maximum phoneme sequence length

# Voice Synthesis Settings
SPEED = 1.2                  # Speech rate multiplier
VOICE_PROFILE = "af_heart"   # Voice character selection

# Performance Settings
MAX_THREADS = 2              # Parallel processing threads

Available Voice Profiles

  • af_heart - Warm, friendly tone
  • af_sky - Clear, professional tone
  • af_bella - Expressive, dynamic tone
  • Additional profiles available in voices-v1.0.bin

📁 Project Structure

Vocal-Agent/
├── main.py                 # Core application entry point
├── agent_client.py         # LLM agent integration layer
├── kokoro-v1.0.onnx       # Neural TTS model
├── voices-v1.0.bin        # Voice profile database
├── requirements.txt       # Python dependencies
├── vocal_agent_mac.sh     # macOS setup automation script
├── demo.png              # Application demonstration
├── LICENSE               # MIT license
└── README.md            # Project documentation

🛠️ Development

Extending Functionality

Add new tools to the agent by integrating Agno Toolkits:

from agno import Agent
from agno.tools import WebSearchTool, WikipediaSearchTool

# Add custom tools
agent = Agent(
    tools=[WebSearchTool(), WikipediaSearchTool(), YourCustomTool()],
    model="llama3.1:8b"
)

Performance Optimization

  • GPU Acceleration: Enable CUDA for faster model inference
  • Model Selection: Choose smaller models for faster response times
  • Audio Buffer Tuning: Adjust buffer sizes for your hardware

🔧 Troubleshooting

Common Issues

Ollama Connection Error:

# Ensure Ollama is running
ollama serve
# Verify model is available
ollama list

Audio Device Issues:

  • Check microphone permissions
  • Verify audio device selection in system settings
  • Test with python3 -c "import sounddevice; print(sounddevice.query_devices())"

Model Download Failures:

  • Ensure stable internet connection
  • Verify sufficient disk space (models require ~8GB)
  • Check Ollama service status

📊 Performance Metrics

  • Speech Recognition Latency: ~200-500ms
  • LLM Response Time: ~1-3 seconds (depending on query complexity)
  • Text-to-Speech Generation: ~100-300ms
  • Memory Usage: ~4-6GB (with Llama 3.1 8B)

🤝 Contributing

We welcome contributions! Please see our contribution guidelines for details on:

  • Code style and standards
  • Testing requirements
  • Pull request process
  • Issue reporting

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • RealtimeSTT - Speech-to-text with VAD integration
  • Kokoro-ONNX - Efficient neural text-to-speech
  • Agno - Powerful agent framework
  • Ollama - Local LLM serving platform
  • Weebo - Project inspiration

📞 Support

About

A sophisticated real-time voice assistant that seamlessly integrates speech recognition, AI reasoning, and neural text-to-speech synthesis. It is designed for natural conversational interactions with advanced tool-calling capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published