← Back to Orinode

Yorùbá Text-to-Speech

Orinode's Yorùbá TTS is in active development for delivery Q1–Q2 2026. The model targets tone-preserving, naturally-prosodic synthesis for the ~45 million Yorùbá speakers across Nigeria, Benin, and the global Yorùbá diaspora.

Why Yorùbá TTS is hard

Yorùbá is a three-tone language (high, mid, low) where tone is grammatically and lexically significant — òwò means "trade", owò means "money", ọwọ́ means "hand". A TTS system that ignores tone produces output that ranges from unintelligible to comically wrong. Other challenges:

Architecture: CosyVoice 2 + Nigerian studio data

The Aria Voices stack is built on CosyVoice 2 (Apache 2.0, December 2024) — a state-of-the-art speech-LLM TTS combining a Whisper-style encoder, autoregressive token language model, flow matching, and HiFi-GAN vocoder:

Inference modes available today

ModeUse case
inference_zero_shotSynthesize with a reference voice + matching transcript
inference_cross_lingualVoice cloning when the reference is in a different language than the synthesis target
inference_instruct2Natural-language style control ("speak with a warm, professional tone")

Roadmap milestones

Try it

The CosyVoice 2 base voice is live in the Orinode internal demo today. For partner access to Yorùbá-specific fine-tunes when they ship, email [email protected].