Yorùbá Text-to-Speech
Orinode's Yorùbá TTS is in active development for delivery Q1–Q2 2026. The model targets tone-preserving, naturally-prosodic synthesis for the ~45 million Yorùbá speakers across Nigeria, Benin, and the global Yorùbá diaspora.
Why Yorùbá TTS is hard
Yorùbá is a three-tone language (high, mid, low) where tone is grammatically and lexically significant — òwò means "trade", owò means "money", ọwọ́ means "hand". A TTS system that ignores tone produces output that ranges from unintelligible to comically wrong. Other challenges:
- Nasalized vowels — ọn, ẹn, in require precise vowel formant control most TTS systems can't deliver.
- Vowel harmony — Yorùbá enforces a +ATR / −ATR feature alignment across word-internal vowels that synthesized output often breaks.
- Punctuation-driven prosody — Yorùbá uses pitch terraces and downstep that map only loosely to English-style commas and periods.
Architecture: CosyVoice 2 + Nigerian studio data
The Aria Voices stack is built on CosyVoice 2 (Apache 2.0, December 2024) — a state-of-the-art speech-LLM TTS combining a Whisper-style encoder, autoregressive token language model, flow matching, and HiFi-GAN vocoder:
- Base model: CosyVoice 2 0.5B parameters, supporting English/Mandarin/Japanese/Korean/Cantonese out-of-the-box.
- Nigerian fine-tuning: ~120 hours of professionally recorded Yorùbá in three regional accents (Ìbàdàn / Òṣogbo / Èkó-Lagos), captured in an acoustically treated Lagos studio Q1–Q2 2026.
- Voice-cloning reference: zero-shot voice cloning from a 3–10 second reference clip — enables custom brand voices.
- Style control: CosyVoice 2's
instruct2mode supports natural-language style prompts ("speak warmly, slowly, like a customer-support agent").
Inference modes available today
| Mode | Use case |
|---|---|
inference_zero_shot | Synthesize with a reference voice + matching transcript |
inference_cross_lingual | Voice cloning when the reference is in a different language than the synthesis target |
inference_instruct2 | Natural-language style control ("speak with a warm, professional tone") |
Roadmap milestones
- Q1 2026 — studio recording sessions in Lagos (three accents, four voice actors).
- Q1 2026 — base CosyVoice 2 fine-tuning, MOS evaluation on Yorùbá-native listeners.
- Q2 2026 — Aria Voices public API for partner pilots.
- Q3 2026 — code-switch TTS (Yorùbá ↔ English within a single utterance).
Try it
The CosyVoice 2 base voice is live in the Orinode internal demo today. For partner access to Yorùbá-specific fine-tunes when they ship, email [email protected].