adeyemi@adediranadeyemi.com +234 816 273 5399
NLP · Low-Resource MT · Transformers

Bidirectional Yoruba-English Neural Translation

A production-ready neural machine translation model for Yoruba ⇄ English — fine-tuned on 101,906 pairs with automatic language detection, cultural context handling, and 4 translation variants with quality scoring. Deployed live on Hugging Face.

Stack
AfriTeVa V2 · PyTorch · Transformers · Flask
Language Pair
Yoruba ⇄ English (bidirectional)
Type
Neural MT · Low-Resource NLP · African Languages
Yoruba-English bidirectional neural translation model interface by Adediran Adeyemi
101K+Yoruba-English translation pairs in training data
46.26METEOR score — strong semantic preservation
58.83ROUGE-1 — high unigram recall
4Translation variants per query with quality scoring

Project Overview

Yoruba is spoken by over 45 million people across Nigeria, Benin, and the diaspora — yet it remains severely underserved by mainstream translation tools like Google Translate, which struggle with Yoruba's tonal diacritical marks and idiomatic expressions. This project builds a purpose-built, production-ready solution.

The model fine-tunes AfriTeVa V2 — a T5-architecture transformer pre-trained on African language corpora — on 101,906 curated Yoruba-English parallel pairs. The result is a bidirectional translation system that handles diacritical marks natively, detects language automatically, and generates four translation variants with quality scoring so users can choose the best rendering.

Why this matters: Mainstream translation tools are trained primarily on high-resource language pairs. Yoruba's tonal markers (ẹ, ọ, ṣ, á, à, é, è) are frequently mangled by general-purpose models. This project contributes a specialized model to the growing ecosystem of African language NLP — freely available on Hugging Face for research and commercial use.

Key Features

Bidirectional Translation

Seamless English → Yoruba and Yoruba → English. A single model handles both directions using task prefixes.

Auto Language Detection

Detects input language using diacritical mark recognition and Yoruba lexical patterns — no manual direction setting required.

4 Translation Variants

Generates 4 candidates using different decoding strategies (beam search, sampling) with automatic quality scoring to rank them.

Hallucination Detection

Flags potentially unreliable translations based on length anomalies and vocabulary repetition before surfacing results.

Smart Text Chunking

Handles long texts via sentence-boundary-aware segmentation — preserves context within sentences when inputs exceed 128 tokens.

Diacritical Mark Support

Full support for Yoruba-specific characters (ẹ, ọ, ṣ, á, à, é, è) with educational tooltips in the web UI.

Model Performance

Evaluated on a held-out test set of 10,190 samples — 10% of the full corpus not seen during training:

MetricScoreWhat It Measures
BLEU16.30N-gram overlap with reference translation (0–100)
BLEU-148.57Unigram precision — word-level match
BLEU-46.644-gram precision — phrase-level match
METEOR46.26Considers synonyms and paraphrasing — stronger semantic measure
ROUGE-158.83Unigram recall — content word coverage
ROUGE-234.34Bigram recall — phrase coverage
ROUGE-L51.73Longest common subsequence — structural similarity
chrF39.28Character n-gram F-score — handles morphology well
TER68.52Translation edit rate (lower is better)

Context on BLEU scores: Low BLEU scores are expected and normal for morphologically rich low-resource languages like Yoruba — even state-of-the-art systems for similar language pairs score in the 10–20 range. The METEOR (46.26) and ROUGE-1 (58.83) scores are the more meaningful indicators here, showing strong semantic preservation and content coverage despite exact phrase-matching challenges.

Sample Translations

English
Good morning
Yoruba
Ẹ káàrọ̀
English
How are you?
Yoruba
Báwo ni o ṣe wà?
English
Thank you very much
Yoruba
O ṣeun púpọ̀
English
I love learning new languages
Yoruba
Mo nífẹ̀ẹ́ kíkọ́ èdè tuntun

Architecture

Base Model: AfriTeVa V2

AfriTeVa V2 (by Masakhane NLP) is a sequence-to-sequence transformer built on the T5 architecture and pre-trained on a multilingual African language corpus. Choosing AfriTeVa over standard mT5 or mBART significantly improves out-of-the-box performance on African languages because the pretraining data distribution is already aligned with the target domain — rather than being dominated by high-resource European languages.

  • Architecture: T5-style encoder-decoder transformer (~300M parameters, large variant)
  • Tokenizer: SentencePiece with 250K vocabulary — handles Yoruba diacritics natively
  • Task prefix: "translate English to Yoruba: {text}" or "translate Yoruba to English: {text}"
  • Max sequence length: 128 tokens (longer texts handled by smart chunking)

Training Dataset

101,906 Yoruba-English parallel pairs curated from four high-quality sources and split temporally to prevent leakage:

Train 80%
Val 10%
Test 10%
Train: 81,526 pairs
Validation: 10,190 pairs
Test: 10,190 pairs

Data Sources

  • JW300 — Jehovah's Witnesses religious translations (large, clean, formal register)
  • FFR — Foundation for Endangered Languages corpus
  • Menyo-20k — Community-contributed general-domain pairs
  • Custom corpus — Manually curated sentence pairs for colloquial coverage

Data Quality Measures

  • Removed duplicates and near-duplicates
  • Filtered sentences with extreme length ratios (>3:1 source/target)
  • Validated Yoruba diacritical mark encoding and normalization
  • Removed machine-translated contamination from training set
  • Checked proper Unicode encoding throughout

Data bias note: The training corpus skews toward formal and religious text (JW300 is the largest source). The model performs best on formal/standard Yoruba and may underperform on colloquial speech, modern slang, or code-switching sentences mixing Yoruba and English.

Training Details

Fine-tuning ran for 18 epochs with early stopping (patience=3). Total training time: 5.36 hours on an NVIDIA GPU with CUDA.

Fine-Tuning Configuration
{
  "max_sequence_length":        128,
  "batch_size":                  32,
  "gradient_accumulation_steps":  2,
  "effective_batch_size":        64,
  "learning_rate":             3e-5,
  "warmup_steps":               300,
  "num_epochs":                  18,
  "early_stopping_patience":      3,
  "optimizer":              "AdamW",
  "scheduler":   "Linear with warmup",
  "checkpoint_interval":       6000   // steps
}
  • Gradient accumulation: 2 steps with batch_size=32 yields an effective batch of 64 — improving training stability without requiring more GPU memory
  • Warmup: 300 steps prevent early large gradient updates on a pre-trained model
  • Early stopping: Triggered after 3 epochs without validation BLEU improvement — best checkpoint saved at the optimal epoch
  • Hyperparameter search: Final config selected after testing LRs [1e-5, 3e-5, 5e-5], batch sizes [16, 32, 64], and max lengths [64, 128, 256]

4 Decoding Strategies

Each query generates four translation candidates using different decoding configurations — then quality scoring selects the best. Users can also view all variants:

Conservative

Beam Search (5 beams)

Best for: Formal text, technical content

Temperature 0.6, no_repeat_ngram_size=3. Safe, predictable, grammatically conservative output.

Balanced

Wide Beam (8 beams)

Best for: General purpose translation

Temperature 0.7, length_penalty=1.2. Natural phrasing with balanced creativity — the default best selection.

Precise

Low Temperature (6 beams)

Best for: Short sentences, idioms

Temperature 0.5, repetition_penalty=1.1. Literal and focused — prioritizes accuracy over fluency.

Creative

Sampling (top-k=50)

Best for: Casual conversation, creative text

Temperature 0.8, top_p=0.92. Diverse and natural — captures conversational register better than beam search.

Live Demo

The model is deployed as an interactive Hugging Face Space. Enter any English or Yoruba text to translate — or use auto-detect mode:

Live translation demo — supports English → Yoruba, Yoruba → English, and auto language detection with diacritical marks.

API Reference

The Flask API supports language detection, translation with variants, and cache management:

POST /translate — Python Example
import requests

# English to Yoruba
response = requests.post("http://localhost:7860/translate", json={
    "text":        "I love learning new languages",
    "direction":   "en2yo",  # or "yo2en" or "auto"
    "num_outputs": 4
})

result = response.json()
print(result["output"])          # "Mo nífẹ̀ẹ́ kíkọ́ èdè tuntun"
print(result["all_translations"]) # All 4 variants with quality scores

# Auto-detect Yoruba input
response = requests.post("http://localhost:7860/translate", json={
    "text":      "Ẹ káàsán",
    "direction": "auto"
})
print(response.json()["output"])  # "Good afternoon"

Additional endpoints: GET /health (model status), POST /detect-language (language identification only), POST /clear-cache (cache management).

Tech Stack

AfriTeVa V2Base Model
PyTorchTraining
TransformersHF Library
FlaskREST API
DockerDeployment
HuggingFaceHosting
PythonAfriTeVa V2PyTorchTransformersSentencePieceFlaskDockerHuggingFace SpacesBLEU / METEORLow-Resource NLPAfrican Languages

Known Limitations

Idiomatic expressions: May translate literally rather than capturing cultural nuance — Yoruba has rich proverbial expressions that don't map cleanly to English equivalents.

Code-switching: Limited support for mixed Yoruba-English sentences — a common pattern in real-world Nigerian communication that the training data underrepresents.

Training data bias: The corpus skews toward formal and religious text. Colloquial speech, modern slang, and informal registers may produce lower-quality outputs.

Diacritical input dependency: For best Yoruba → English results, input text should include proper Yoruba diacritical marks. Plain text without tone marks will produce lower accuracy.

Future work includes expanding to other Nigerian languages (Igbo, Hausa), adding speech-to-text and text-to-speech integration, and fine-tuning on domain-specific corpora (medical, legal, technical).

Work with Adediran Adeyemi

Building NLP for underrepresented languages?

I build production NLP systems that work in the real world — including low-resource and African language contexts. First call is free.