Yorùbá Text-to-Speech

Orinode's Yorùbá TTS is in active development for delivery Q1–Q2 2026. The model targets tone-preserving, naturally-prosodic synthesis for the ~45 million Yorùbá speakers across Nigeria, Benin, and the global Yorùbá diaspora.

Looking for the deployable product? This Yorùbá voice powers Maraba — an AI call agent that answers Yorùbá business calls with tonal diacritics (è, é, ẹ, ọ, ṣ) preserved end-to-end and Yorùbá↔English code-switching handled per-token.

Why Yorùbá TTS is hard

Yorùbá is a three-tone language (high, mid, low) where tone is grammatically and lexically significant — òwò means "trade", owò means "money", ọwọ́ means "hand". A TTS system that ignores tone produces output that ranges from unintelligible to comically wrong. Other challenges:

Nasalized vowels — ọn, ẹn, in require precise vowel formant control most TTS systems can't deliver.
Vowel harmony — Yorùbá enforces a +ATR / −ATR feature alignment across word-internal vowels that synthesized output often breaks.
Punctuation-driven prosody — Yorùbá uses pitch terraces and downstep that map only loosely to English-style commas and periods.

Architecture: CosyVoice 2 + Nigerian studio data

The Maraba Voices stack is built on CosyVoice 2 (Apache 2.0, December 2024) — a state-of-the-art speech-LLM TTS combining a Whisper-style encoder, autoregressive token language model, flow matching, and HiFi-GAN vocoder:

Base model: CosyVoice 2 0.5B parameters, supporting English/Mandarin/Japanese/Korean/Cantonese out-of-the-box.
Nigerian fine-tuning: ~120 hours of professionally recorded Yorùbá in three regional accents (Ìbàdàn / Òṣogbo / Èkó-Lagos), captured in an acoustically treated Lagos studio Q1–Q2 2026.
Voice-cloning reference: zero-shot voice cloning from a 3–10 second reference clip — enables custom brand voices.
Style control: CosyVoice 2's instruct2 mode supports natural-language style prompts ("speak warmly, slowly, like a customer-support agent").

Inference modes available today

Mode	Use case
`inference_zero_shot`	Synthesize with a reference voice + matching transcript
`inference_cross_lingual`	Voice cloning when the reference is in a different language than the synthesis target
`inference_instruct2`	Natural-language style control ("speak with a warm, professional tone")

Roadmap milestones

Q1 2026 — studio recording sessions in Lagos (three accents, four voice actors).
Q1 2026 — base CosyVoice 2 fine-tuning, MOS evaluation on Yorùbá-native listeners.
Q2 2026 — Maraba Voices public API for partner pilots.
Q3 2026 — code-switch TTS (Yorùbá ↔ English within a single utterance).

Try it

The CosyVoice 2 base voice is live in the Orinode internal demo today. For partner access to Yorùbá-specific fine-tunes when they ship, email [email protected].

For the deployable voice agent built on this model, see Maraba — Yorùbá AI at maraba.ai.