Yoruba-English Translation – AfriTeVa Fine-Tune

Project Overview

Yoruba is spoken by over 45 million people across Nigeria, Benin, and the diaspora — yet it remains severely underserved by mainstream translation tools like Google Translate, which struggle with Yoruba's tonal diacritical marks and idiomatic expressions. This project builds a purpose-built, production-ready solution.

The model fine-tunes AfriTeVa V2 — a T5-architecture transformer pre-trained on African language corpora — on 101,906 curated Yoruba-English parallel pairs. The result is a bidirectional translation system that handles diacritical marks natively, detects language automatically, and generates four translation variants with quality scoring so users can choose the best rendering.

Why this matters: Mainstream translation tools are trained primarily on high-resource language pairs. Yoruba's tonal markers (ẹ, ọ, ṣ, á, à, é, è) are frequently mangled by general-purpose models. This project contributes a specialized model to the growing ecosystem of African language NLP — freely available on Hugging Face for research and commercial use.

Key Features

Bidirectional Translation

Seamless English → Yoruba and Yoruba → English. A single model handles both directions using task prefixes.

Auto Language Detection

Detects input language using diacritical mark recognition and Yoruba lexical patterns — no manual direction setting required.

4 Translation Variants

Generates 4 candidates using different decoding strategies (beam search, sampling) with automatic quality scoring to rank them.

Hallucination Detection

Flags potentially unreliable translations based on length anomalies and vocabulary repetition before surfacing results.

Smart Text Chunking

Handles long texts via sentence-boundary-aware segmentation — preserves context within sentences when inputs exceed 128 tokens.

Diacritical Mark Support

Full support for Yoruba-specific characters (ẹ, ọ, ṣ, á, à, é, è) with educational tooltips in the web UI.

Model Performance

Evaluated on a held-out test set of 10,190 samples — 10% of the full corpus not seen during training:

Metric	Score	What It Measures
BLEU	16.30	N-gram overlap with reference translation (0–100)
BLEU-1	48.57	Unigram precision — word-level match
BLEU-4	6.64	4-gram precision — phrase-level match
METEOR	46.26	Considers synonyms and paraphrasing — stronger semantic measure
ROUGE-1	58.83	Unigram recall — content word coverage
ROUGE-2	34.34	Bigram recall — phrase coverage
ROUGE-L	51.73	Longest common subsequence — structural similarity
chrF	39.28	Character n-gram F-score — handles morphology well
TER	68.52	Translation edit rate (lower is better)

Context on BLEU scores: Low BLEU scores are expected and normal for morphologically rich low-resource languages like Yoruba — even state-of-the-art systems for similar language pairs score in the 10–20 range. The METEOR (46.26) and ROUGE-1 (58.83) scores are the more meaningful indicators here, showing strong semantic preservation and content coverage despite exact phrase-matching challenges.

Sample Translations

English

Good morning

Yoruba

Ẹ káàrọ̀

English

How are you?

Yoruba

Báwo ni o ṣe wà?

English

Thank you very much

Yoruba

O ṣeun púpọ̀

English

I love learning new languages

Yoruba

Mo nífẹ̀ẹ́ kíkọ́ èdè tuntun

Architecture

Base Model: AfriTeVa V2

AfriTeVa V2 (by Masakhane NLP) is a sequence-to-sequence transformer built on the T5 architecture and pre-trained on a multilingual African language corpus. Choosing AfriTeVa over standard mT5 or mBART significantly improves out-of-the-box performance on African languages because the pretraining data distribution is already aligned with the target domain — rather than being dominated by high-resource European languages.

Architecture: T5-style encoder-decoder transformer (~300M parameters, large variant)
Tokenizer: SentencePiece with 250K vocabulary — handles Yoruba diacritics natively
Task prefix: "translate English to Yoruba: {text}" or "translate Yoruba to English: {text}"
Max sequence length: 128 tokens (longer texts handled by smart chunking)

Training Dataset

101,906 Yoruba-English parallel pairs curated from four high-quality sources and split temporally to prevent leakage:

Train 80%

Val 10%

Test 10%

Train: 81,526 pairs

Validation: 10,190 pairs

Test: 10,190 pairs

Data Sources

JW300 — Jehovah's Witnesses religious translations (large, clean, formal register)
FFR — Foundation for Endangered Languages corpus
Menyo-20k — Community-contributed general-domain pairs
Custom corpus — Manually curated sentence pairs for colloquial coverage

Data Quality Measures

Removed duplicates and near-duplicates
Filtered sentences with extreme length ratios (>3:1 source/target)
Validated Yoruba diacritical mark encoding and normalization
Removed machine-translated contamination from training set
Checked proper Unicode encoding throughout

Data bias note: The training corpus skews toward formal and religious text (JW300 is the largest source). The model performs best on formal/standard Yoruba and may underperform on colloquial speech, modern slang, or code-switching sentences mixing Yoruba and English.

Training Details

Fine-tuning ran for 18 epochs with early stopping (patience=3). Total training time: 5.36 hours on an NVIDIA GPU with CUDA.

Fine-Tuning Configuration

{
  "max_sequence_length":        128,
  "batch_size":                  32,
  "gradient_accumulation_steps":  2,
  "effective_batch_size":        64,
  "learning_rate":             3e-5,
  "warmup_steps":               300,
  "num_epochs":                  18,
  "early_stopping_patience":      3,
  "optimizer":              "AdamW",
  "scheduler":   "Linear with warmup",
  "checkpoint_interval":       6000   // steps
}

Gradient accumulation: 2 steps with batch_size=32 yields an effective batch of 64 — improving training stability without requiring more GPU memory
Warmup: 300 steps prevent early large gradient updates on a pre-trained model
Early stopping: Triggered after 3 epochs without validation BLEU improvement — best checkpoint saved at the optimal epoch
Hyperparameter search: Final config selected after testing LRs [1e-5, 3e-5, 5e-5], batch sizes [16, 32, 64], and max lengths [64, 128, 256]

4 Decoding Strategies

Each query generates four translation candidates using different decoding configurations — then quality scoring selects the best. Users can also view all variants:

Conservative

Beam Search (5 beams)

Best for: Formal text, technical content

Temperature 0.6, no_repeat_ngram_size=3. Safe, predictable, grammatically conservative output.

Balanced

Wide Beam (8 beams)

Best for: General purpose translation

Temperature 0.7, length_penalty=1.2. Natural phrasing with balanced creativity — the default best selection.

Precise

Low Temperature (6 beams)

Best for: Short sentences, idioms

Temperature 0.5, repetition_penalty=1.1. Literal and focused — prioritizes accuracy over fluency.

Creative

Sampling (top-k=50)

Best for: Casual conversation, creative text

Temperature 0.8, top_p=0.92. Diverse and natural — captures conversational register better than beam search.

Live Demo

The model is deployed as an interactive Hugging Face Space. Enter any English or Yoruba text to translate — or use auto-detect mode:

Live translation demo — supports English → Yoruba, Yoruba → English, and auto language detection with diacritical marks.

API Reference

The Flask API supports language detection, translation with variants, and cache management:

POST /translate — Python Example

import requests

# English to Yoruba
response = requests.post("http://localhost:7860/translate", json={
    "text":        "I love learning new languages",
    "direction":   "en2yo",  # or "yo2en" or "auto"
    "num_outputs": 4
})

result = response.json()
print(result["output"])          # "Mo nífẹ̀ẹ́ kíkọ́ èdè tuntun"
print(result["all_translations"]) # All 4 variants with quality scores

# Auto-detect Yoruba input
response = requests.post("http://localhost:7860/translate", json={
    "text":      "Ẹ káàsán",
    "direction": "auto"
})
print(response.json()["output"])  # "Good afternoon"

Additional endpoints: GET /health (model status), POST /detect-language (language identification only), POST /clear-cache (cache management).

Tech Stack

AfriTeVa V2Base Model

PyTorchTraining

TransformersHF Library

FlaskREST API

DockerDeployment

HuggingFaceHosting

PythonAfriTeVa V2PyTorchTransformersSentencePieceFlaskDockerHuggingFace SpacesBLEU / METEORLow-Resource NLPAfrican Languages

Known Limitations

Idiomatic expressions: May translate literally rather than capturing cultural nuance — Yoruba has rich proverbial expressions that don't map cleanly to English equivalents.

Code-switching: Limited support for mixed Yoruba-English sentences — a common pattern in real-world Nigerian communication that the training data underrepresents.

Training data bias: The corpus skews toward formal and religious text. Colloquial speech, modern slang, and informal registers may produce lower-quality outputs.

Diacritical input dependency: For best Yoruba → English results, input text should include proper Yoruba diacritical marks. Plain text without tone marks will produce lower accuracy.

Future work includes expanding to other Nigerian languages (Igbo, Hausa), adding speech-to-text and text-to-speech integration, and fine-tuning on domain-specific corpora (medical, legal, technical).

Bidirectional Yoruba-English Neural Translation

Project Overview

Key Features

Bidirectional Translation

Auto Language Detection

4 Translation Variants

Hallucination Detection

Smart Text Chunking

Diacritical Mark Support

Model Performance

Sample Translations

Architecture

Base Model: AfriTeVa V2

Training Dataset

Data Sources

Data Quality Measures

Training Details

4 Decoding Strategies

Beam Search (5 beams)

Wide Beam (8 beams)

Low Temperature (6 beams)

Sampling (top-k=50)

Live Demo

API Reference

Tech Stack

Known Limitations

Building NLP for underrepresented languages?

Bidirectional Yoruba-English Neural Translation

Project Overview

Key Features

Bidirectional Translation

Auto Language Detection

4 Translation Variants

Hallucination Detection

Smart Text Chunking

Diacritical Mark Support

Model Performance

Sample Translations

Architecture

Base Model: AfriTeVa V2

Training Dataset

Data Sources

Data Quality Measures

Training Details

4 Decoding Strategies

Beam Search (5 beams)

Wide Beam (8 beams)

Low Temperature (6 beams)

Sampling (top-k=50)

Live Demo

API Reference

Tech Stack

Known Limitations

Related Projects

FinSight RAG Financial Research

Customer Reviews NLP Classification

Lead Scoring & Conversion Prediction

NYC Property Price Prediction

Building NLP for underrepresented languages?