Maraba v1 Stage 3 complete · 23.2% avg WER · Code-switch corpus released.
Today marks two milestones on Orinode's roadmap.
1. Stage 3 LoRA training is complete. Maraba v1's multilingual speech-LLM (Whisper-large-v3 encoder + MLPReshapeAdapter + Gemma-2-9B-it decoder with LoRA r=16) finished its 20,000-step training run. We evaluated all 16 checkpoints on 200 samples per language, 6 language conditions (English, Hausa, Yoruba, Igbo, Nigerian Pidgin, Yoruba–English code-switch). The best checkpoint (step 20,000) achieves 23.2% average WER:
| Language | WER |
|---|---|
| Nigerian English | 10.9% |
| Nigerian Pidgin | 9.8% |
| Yoruba–English code-switch | 18.0% |
| Yoruba | 29.5% |
| Hausa | 31.1% |
| Igbo | 39.8% |
| Average | 23.2% |
Igbo remains the bottleneck language, consistent with public dataset scarcity. We are actively expanding our Igbo training data to close this gap before Maraba pilot.
2. The Naija Customer-Call Code-Switch Corpus is public. 15,000 hand-written customer-service sentences across Hausa–English, Igbo–English, and Yoruba–English — covering 30 business sectors — are now available under CC-BY 4.0 on Hugging Face:
huggingface.co/datasets/Orinode/naija-customer-call-code-switch
We are publishing the full corpus including honest quality notes per language. The first 3,000 lines of each file are gold-quality; later sections cover broader vocabulary but show more pattern repetition. Released as-is so the community can filter for their use case rather than wait for a sanitised subset.
What's next. Stage 3 weights remain commercial; the eval protocol and per-checkpoint results are open. Next up: TTS layer build-out on CosyVoice 2 (Apache 2.0) with studio-quality Nigerian voice recordings beginning Q1 2026, and the Maraba pilot launch in Q3 2026. Methodology paper targeted at Interspeech 2026.