Skip to content

Introduction

EdgeVox is a sub-second local voice AI designed for robots, edge devices, and anyone who wants private voice interaction without cloud dependencies.

What is EdgeVox?

EdgeVox is a streaming voice pipeline that chains together:

Microphone → VAD → STT → LLM → TTS → Speaker

Each component runs locally on your machine. The streaming architecture means the bot starts speaking before it finishes thinking — delivering first audio in ~0.8 seconds.

Key Design Principles

  • Portability first — runs on an i9+RTX3080 desktop or an M1 MacBook Air
  • Language-aware — automatically selects the best STT/TTS models per language
  • Interruptible — speak over the bot at any time to cut it off
  • Developer-friendly — TUI with slash commands, Web UI, and simple CLI modes

Pipeline Components

ComponentDefault ModelPurpose
VADSilero VAD v6Voice activity detection (32ms chunks)
STTFaster-WhisperSpeech-to-text (auto-sizes by VRAM)
LLMGemma 4 E2B IT Q4_K_MChat via llama-cpp-python
TTSKokoro-82MText-to-speech (24kHz, 9 native languages)

Multi-Language TTS/STT Backends

LanguageSTTTTS Backend
English, French, Spanish, etc.Faster-WhisperKokoro-82M
VietnameseSherpa-ONNX (Zipformer 30M)Piper ONNX
German, Russian, Arabic, IndonesianFaster-WhisperPiper ONNX
KoreanFaster-WhisperSupertonic
ThaiFaster-WhisperPyThaiTTS

Models are hosted on nrl-ai/edgevox-models (HuggingFace) with automatic fallback to upstream repos.

Next Steps

Sub-second local voice AI