Project Overview
Yoruba is spoken by over 45 million people across Nigeria, Benin, and the diaspora — yet it remains severely underserved by mainstream translation tools like Google Translate, which struggle with Yoruba's tonal diacritical marks and idiomatic expressions. This project builds a purpose-built, production-ready solution.
The model fine-tunes AfriTeVa V2 — a T5-architecture transformer pre-trained on African language corpora — on 101,906 curated Yoruba-English parallel pairs. The result is a bidirectional translation system that handles diacritical marks natively, detects language automatically, and generates four translation variants with quality scoring so users can choose the best rendering.
Why this matters: Mainstream translation tools are trained primarily on high-resource language pairs. Yoruba's tonal markers (ẹ, ọ, ṣ, á, à, é, è) are frequently mangled by general-purpose models. This project contributes a specialized model to the growing ecosystem of African language NLP — freely available on Hugging Face for research and commercial use.
Key Features
Bidirectional Translation
Seamless English → Yoruba and Yoruba → English. A single model handles both directions using task prefixes.
Auto Language Detection
Detects input language using diacritical mark recognition and Yoruba lexical patterns — no manual direction setting required.
4 Translation Variants
Generates 4 candidates using different decoding strategies (beam search, sampling) with automatic quality scoring to rank them.
Hallucination Detection
Flags potentially unreliable translations based on length anomalies and vocabulary repetition before surfacing results.
Smart Text Chunking
Handles long texts via sentence-boundary-aware segmentation — preserves context within sentences when inputs exceed 128 tokens.
Diacritical Mark Support
Full support for Yoruba-specific characters (ẹ, ọ, ṣ, á, à, é, è) with educational tooltips in the web UI.
Model Performance
Evaluated on a held-out test set of 10,190 samples — 10% of the full corpus not seen during training:
| Metric | Score | What It Measures |
|---|---|---|
| BLEU | 16.30 | N-gram overlap with reference translation (0–100) |
| BLEU-1 | 48.57 | Unigram precision — word-level match |
| BLEU-4 | 6.64 | 4-gram precision — phrase-level match |
| METEOR | 46.26 | Considers synonyms and paraphrasing — stronger semantic measure |
| ROUGE-1 | 58.83 | Unigram recall — content word coverage |
| ROUGE-2 | 34.34 | Bigram recall — phrase coverage |
| ROUGE-L | 51.73 | Longest common subsequence — structural similarity |
| chrF | 39.28 | Character n-gram F-score — handles morphology well |
| TER | 68.52 | Translation edit rate (lower is better) |
Context on BLEU scores: Low BLEU scores are expected and normal for morphologically rich low-resource languages like Yoruba — even state-of-the-art systems for similar language pairs score in the 10–20 range. The METEOR (46.26) and ROUGE-1 (58.83) scores are the more meaningful indicators here, showing strong semantic preservation and content coverage despite exact phrase-matching challenges.
Sample Translations
Architecture
Base Model: AfriTeVa V2
AfriTeVa V2 (by Masakhane NLP) is a sequence-to-sequence transformer built on the T5 architecture and pre-trained on a multilingual African language corpus. Choosing AfriTeVa over standard mT5 or mBART significantly improves out-of-the-box performance on African languages because the pretraining data distribution is already aligned with the target domain — rather than being dominated by high-resource European languages.
- Architecture: T5-style encoder-decoder transformer (~300M parameters, large variant)
- Tokenizer: SentencePiece with 250K vocabulary — handles Yoruba diacritics natively
- Task prefix: "translate English to Yoruba: {text}" or "translate Yoruba to English: {text}"
- Max sequence length: 128 tokens (longer texts handled by smart chunking)
Training Dataset
101,906 Yoruba-English parallel pairs curated from four high-quality sources and split temporally to prevent leakage:
Data Sources
- JW300 — Jehovah's Witnesses religious translations (large, clean, formal register)
- FFR — Foundation for Endangered Languages corpus
- Menyo-20k — Community-contributed general-domain pairs
- Custom corpus — Manually curated sentence pairs for colloquial coverage
Data Quality Measures
- Removed duplicates and near-duplicates
- Filtered sentences with extreme length ratios (>3:1 source/target)
- Validated Yoruba diacritical mark encoding and normalization
- Removed machine-translated contamination from training set
- Checked proper Unicode encoding throughout
Data bias note: The training corpus skews toward formal and religious text (JW300 is the largest source). The model performs best on formal/standard Yoruba and may underperform on colloquial speech, modern slang, or code-switching sentences mixing Yoruba and English.
Training Details
Fine-tuning ran for 18 epochs with early stopping (patience=3). Total training time: 5.36 hours on an NVIDIA GPU with CUDA.
{
"max_sequence_length": 128,
"batch_size": 32,
"gradient_accumulation_steps": 2,
"effective_batch_size": 64,
"learning_rate": 3e-5,
"warmup_steps": 300,
"num_epochs": 18,
"early_stopping_patience": 3,
"optimizer": "AdamW",
"scheduler": "Linear with warmup",
"checkpoint_interval": 6000 // steps
}
- Gradient accumulation: 2 steps with batch_size=32 yields an effective batch of 64 — improving training stability without requiring more GPU memory
- Warmup: 300 steps prevent early large gradient updates on a pre-trained model
- Early stopping: Triggered after 3 epochs without validation BLEU improvement — best checkpoint saved at the optimal epoch
- Hyperparameter search: Final config selected after testing LRs [1e-5, 3e-5, 5e-5], batch sizes [16, 32, 64], and max lengths [64, 128, 256]
4 Decoding Strategies
Each query generates four translation candidates using different decoding configurations — then quality scoring selects the best. Users can also view all variants:
Beam Search (5 beams)
Temperature 0.6, no_repeat_ngram_size=3. Safe, predictable, grammatically conservative output.
Wide Beam (8 beams)
Temperature 0.7, length_penalty=1.2. Natural phrasing with balanced creativity — the default best selection.
Low Temperature (6 beams)
Temperature 0.5, repetition_penalty=1.1. Literal and focused — prioritizes accuracy over fluency.
Sampling (top-k=50)
Temperature 0.8, top_p=0.92. Diverse and natural — captures conversational register better than beam search.
Live Demo
The model is deployed as an interactive Hugging Face Space. Enter any English or Yoruba text to translate — or use auto-detect mode:
API Reference
The Flask API supports language detection, translation with variants, and cache management:
import requests
# English to Yoruba
response = requests.post("http://localhost:7860/translate", json={
"text": "I love learning new languages",
"direction": "en2yo", # or "yo2en" or "auto"
"num_outputs": 4
})
result = response.json()
print(result["output"]) # "Mo nífẹ̀ẹ́ kíkọ́ èdè tuntun"
print(result["all_translations"]) # All 4 variants with quality scores
# Auto-detect Yoruba input
response = requests.post("http://localhost:7860/translate", json={
"text": "Ẹ káàsán",
"direction": "auto"
})
print(response.json()["output"]) # "Good afternoon"
Additional endpoints: GET /health (model status), POST /detect-language (language identification only), POST /clear-cache (cache management).
Tech Stack
Known Limitations
Idiomatic expressions: May translate literally rather than capturing cultural nuance — Yoruba has rich proverbial expressions that don't map cleanly to English equivalents.
Code-switching: Limited support for mixed Yoruba-English sentences — a common pattern in real-world Nigerian communication that the training data underrepresents.
Training data bias: The corpus skews toward formal and religious text. Colloquial speech, modern slang, and informal registers may produce lower-quality outputs.
Diacritical input dependency: For best Yoruba → English results, input text should include proper Yoruba diacritical marks. Plain text without tone marks will produce lower accuracy.
Future work includes expanding to other Nigerian languages (Igbo, Hausa), adding speech-to-text and text-to-speech integration, and fine-tuning on domain-specific corpora (medical, legal, technical).