Project Overview
Customer reviews are one of the richest sources of unstructured business intelligence available to any product team — but at scale, reading them manually is impossible. This project builds an end-to-end ML pipeline that automatically classifies Moniepoint banking app reviews from the Google Play Store into 16 actionable issue categories simultaneously, enabling real-time monitoring of what customers love, what frustrates them, and where competitors have gaps.
The key technical insight driving this project is multi-label classification — a single review can belong to multiple categories at once. A review saying "The app is slow and charges are too high" belongs to both "App crashes or Slow" and "Transaction Charges". Traditional sentiment analysis misses this nuance entirely. This model captures it with 93.7% F1-micro accuracy.
Custom RoBERTa Model
Fine-tuned on 29,000+ fintech reviews achieving 93.7% F1-micro and 99.69% ROC AUC
Production Deployed
Publicly accessible via Hugging Face Hub and interactive web demo — try it live
Power BI Dashboard
Interactive analytics with sentiment trends, category breakdowns, and KPI monitoring
Actionable Intelligence
Identified 417 critical issue mentions and 857+ competitive strength signals from raw text
Key business finding: Moniepoint's response rate of 81.83% is well above the industry average of ~60% — a genuine customer service advantage. Their core competitive moat is speed: 857+ positive mentions across fast transactions, fast transfers, and reliability since August 2023.
16 Review Categories
The model classifies every review into one or more of 16 categories that cover the full spectrum of fintech app user feedback. A single review can trigger multiple categories simultaneously — this multi-label approach is what makes the system substantially more useful than standard sentiment analysis.
Multi-label example: A review reading "The app is slow and charges are too high" is simultaneously classified as App Crashes or Slow + Transaction Charges. Standard single-label classification would force a choice and lose half the signal. This model captures both.
Model Architecture
The model is built on RoBERTa-base — a robustly optimized BERT pretraining approach from Facebook AI Research. RoBERTa was chosen over standard BERT for its superior handling of diverse language patterns, stronger performance on classification tasks, and better generalization from limited training labels in specialized domains like fintech.
Architecture Specifications
- Base Model: roberta-base (125M parameters)
- Task: Multi-label sequence classification
- Max Sequence Length: 256 tokens
- Output Layer: Sigmoid activation (not softmax) — enabling simultaneous multi-label predictions
- Loss Function: Binary Cross Entropy with Logits Loss
- Architecture: 12 transformer layers, 12 attention heads, 768 hidden dimensions
Data Pipeline
- Source: Google Play Store reviews for Moniepoint Personal Banking App
- Volume: 29,000+ reviews spanning multiple years through September 2025
- Preprocessing: Text cleaning, language filtering (English), duplicate removal, label standardization via DeepSeek
- Temporal split: Training on pre-September 2025 data, test set on September 2025+ reviews — prevents data leakage
Model Performance
The model achieved exceptional results on the held-out temporal test set — data the model had never seen during training:
| Metric | Score | Interpretation |
|---|---|---|
| F1 Micro | 93.73% | Overall classification accuracy across all labels |
| F1 Macro | 62.74% | Unweighted average — reflects rare category challenge |
| Precision Micro | 95.36% | When model predicts a label, it's correct 95% of the time |
| Recall Micro | 92.15% | Model catches 92% of all true label occurrences |
| ROC AUC Micro | 99.69% | Near-perfect category discrimination ability |
Industry context: The 93.7% F1-micro score exceeds typical industry benchmarks for multi-label NLP classification tasks (85–90%). The 99.69% ROC AUC indicates the model has near-perfect ability to distinguish between categories — a strong signal of generalizability to unseen reviews.
Training Details
Training ran for 6 epochs with early stopping (patience=2). The best model was selected at Epoch 2 based on F1-micro score — a classic example of early stopping preventing overfitting while preserving generalization:
Training Configuration
- Optimizer: AdamW with learning rate 2e-5
- Batch sizes: 8 (training), 16 (evaluation)
- Warmup steps: 500
- Weight decay: 0.01 for regularization
- Hardware: CUDA-enabled GPU
- Early stopping patience: 2 epochs — triggered after Epoch 4 showed rising validation loss
Live Demo
The model is deployed as an interactive Hugging Face Space. Paste any app review text and receive instant multi-label category predictions with confidence scores:
Power BI Analytics Dashboard
The Power BI dashboard provides visual business intelligence on top of the model's classification output — translating 29,000 categorized reviews into executive-ready insights:
- Review volume trends over time with sentiment trajectory
- Category breakdown and frequency — which issues are growing vs. declining
- Customer service response metrics (81.83% response rate, 29.3hr average)
- Sentiment score correlation with star ratings
- Interactive filters for deep-dive by date range, category, and sentiment
API Usage
The model is publicly available on Hugging Face Hub and can be integrated into any Python application for batch review processing:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, numpy as np, joblib, json
from huggingface_hub import hf_hub_download
REPO_ID = "adeyemi001/Multi-Labelled-Review-Categorization-Model"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load model, tokenizer, and label binarizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model = AutoModelForSequenceClassification.from_pretrained(REPO_ID).to(DEVICE)
mlb = joblib.load(hf_hub_download(REPO_ID, "model/mlb.joblib"))
def predict(texts, threshold=0.5):
if isinstance(texts, str): texts = [texts]
enc = tokenizer(texts, truncation=True, padding=True,
max_length=256, return_tensors="pt").to(DEVICE)
with torch.no_grad():
logits = model(**enc).logits.cpu().numpy()
probs = 1 / (1 + np.exp(-logits))
bins = (probs >= threshold).astype(int)
return [mlb.inverse_transform([b])[0] for b in bins], probs
# Example
reviews = [
"App crashes every time I try to transfer money.",
"Please add dark mode, and why are charges so high?",
"Fast transfers and excellent customer service!"
]
preds, probs = predict(reviews)
for r, p in zip(reviews, preds):
print(f"Review: {r}\nCategories: {p}\n")
Threshold tuning: The default threshold of 0.5 balances precision and recall. Use 0.3 for higher recall (catch more issues at cost of some false positives) or 0.7 for high-precision deployment where false positives are costly.
Critical Issues Identified
The model surfaced 417 total critical issue mentions across Moniepoint's reviews — ranked by frequency and sentiment impact to prioritize engineering and product attention:
Competitive Strengths Identified
Beyond issues, the model surfaces what customers love — the competitive advantages that should be amplified in marketing and product strategy. Speed is Moniepoint's dominant positive signal with 857 combined mentions:
Strategic positioning: 857 speed-related positive mentions represent a durable competitive moat. "Fast" should be Moniepoint's core brand pillar in marketing — it's not a claimed differentiator, it's a customer-validated one. The data also confirms an 81.83% review response rate — well above the ~60% industry average.
Strategic Recommendations
Deploy Real-Time Review Monitoring
Implement continuous ingestion from Play Store and App Store with automated alerting when critical categories (App Not Opening, Failed Transactions, Login Issues) exceed baseline thresholds. The 217 "App Not Opening" mentions represent an immediate churn risk that real-time monitoring would catch within hours, not weeks.
Prioritize Engineering on Tier 1 Issues
The 417 combined critical issue mentions should drive sprint planning directly. App Not Opening (217), Login (101), and Failed Transactions (99) represent complete access failures — users experiencing these are highly likely to uninstall. Set SLA targets for each category and instrument post-deployment monitoring against these baselines.
Address Fee Perception with Transparency
223 fee-related complaints signal a perception problem that may not require a pricing change — it may require better value communication. Test in-app fee calculators, comparison tools, and clearer transaction breakdowns. Measure impact on subsequent review sentiment in the fee-related categories.
Leverage Speed as the Core Marketing Message
857 customer-validated speed mentions make "fast" Moniepoint's most credible differentiator. Build marketing campaigns directly from positive review language — these are authentic customer voices that resonate with prospects experiencing slow competitors.
Extend to Competitive Intelligence
Deploy the same model on OPay, PalmPay, and Kuda reviews to create a continuous competitive intelligence system. Monthly reports comparing issue prevalence and strength mentions would show exactly where Moniepoint is outperforming and where it has market gaps to exploit.
Future Work
Model Improvements
- Multilingual support: Extend to Pidgin English and major Nigerian languages — a meaningful portion of app store reviews use non-standard English that the current model may misclassify
- Continuous learning pipeline: Implement active learning where customer success agents validate predictions, continuously improving accuracy on new issue patterns
- App version correlation: Link review categories to specific app release versions to create a "quality gate" metric for release management
Business Applications
- Churn risk scoring: Combine review categories with behavioral data to build a customer-level churn probability score triggered by specific negative review patterns
- Automated ticket routing: Integrate with customer support to auto-route incoming support tickets to the correct team based on classified issue type
- Predictive analytics: Time-series forecasting of issue volume spikes based on historical patterns and release cycles, enabling proactive engineering response