adeyemi@adediranadeyemi.com +234 816 273 5399
NLP · Transformers · Multi-Label Classification

Turning 29,000 App Reviews into Actionable Intelligence

A RoBERTa-based multi-label NLP classifier that automatically categorizes Moniepoint banking app reviews into 16 issue types — achieving 93.7% F1-micro accuracy and deployed live on Hugging Face with a Power BI analytics dashboard.

Tools
Transformers · RoBERTa · Power BI · HuggingFace
Industry
Fintech · Banking · Nigeria
Type
NLP · Multi-Label Classification · BI Dashboard
Multi-label NLP classification model analyzing Moniepoint banking app reviews by Adediran Adeyemi
93.7% F1-micro accuracy on held-out test set
29K+ Reviews analyzed spanning multiple years
16 Issue categories classified simultaneously
99.7% ROC AUC — near-perfect category separation

Project Overview

Customer reviews are one of the richest sources of unstructured business intelligence available to any product team — but at scale, reading them manually is impossible. This project builds an end-to-end ML pipeline that automatically classifies Moniepoint banking app reviews from the Google Play Store into 16 actionable issue categories simultaneously, enabling real-time monitoring of what customers love, what frustrates them, and where competitors have gaps.

The key technical insight driving this project is multi-label classification — a single review can belong to multiple categories at once. A review saying "The app is slow and charges are too high" belongs to both "App crashes or Slow" and "Transaction Charges". Traditional sentiment analysis misses this nuance entirely. This model captures it with 93.7% F1-micro accuracy.

Custom RoBERTa Model

Fine-tuned on 29,000+ fintech reviews achieving 93.7% F1-micro and 99.69% ROC AUC

Production Deployed

Publicly accessible via Hugging Face Hub and interactive web demo — try it live

Power BI Dashboard

Interactive analytics with sentiment trends, category breakdowns, and KPI monitoring

Actionable Intelligence

Identified 417 critical issue mentions and 857+ competitive strength signals from raw text

Key business finding: Moniepoint's response rate of 81.83% is well above the industry average of ~60% — a genuine customer service advantage. Their core competitive moat is speed: 857+ positive mentions across fast transactions, fast transfers, and reliability since August 2023.

16 Review Categories

The model classifies every review into one or more of 16 categories that cover the full spectrum of fintech app user feedback. A single review can trigger multiple categories simultaneously — this multi-label approach is what makes the system substantially more useful than standard sentiment analysis.

Account Registration 01
App Installation Issues 02
App Crashes or Slow 03
App Not Opening 04
Customer Inquiry 05
Customer Support 06
Failed Transaction 07
Feature Requests 08
General Feedback 09
Login & Account Access 10
Network Failure 11
Other 12
Password Issues 13
Transaction Charges 14
UI / UX 15
USSD Issues 16

Multi-label example: A review reading "The app is slow and charges are too high" is simultaneously classified as App Crashes or Slow + Transaction Charges. Standard single-label classification would force a choice and lose half the signal. This model captures both.

Model Architecture

The model is built on RoBERTa-base — a robustly optimized BERT pretraining approach from Facebook AI Research. RoBERTa was chosen over standard BERT for its superior handling of diverse language patterns, stronger performance on classification tasks, and better generalization from limited training labels in specialized domains like fintech.

Architecture Specifications

  • Base Model: roberta-base (125M parameters)
  • Task: Multi-label sequence classification
  • Max Sequence Length: 256 tokens
  • Output Layer: Sigmoid activation (not softmax) — enabling simultaneous multi-label predictions
  • Loss Function: Binary Cross Entropy with Logits Loss
  • Architecture: 12 transformer layers, 12 attention heads, 768 hidden dimensions

Data Pipeline

  • Source: Google Play Store reviews for Moniepoint Personal Banking App
  • Volume: 29,000+ reviews spanning multiple years through September 2025
  • Preprocessing: Text cleaning, language filtering (English), duplicate removal, label standardization via DeepSeek
  • Temporal split: Training on pre-September 2025 data, test set on September 2025+ reviews — prevents data leakage

Model Performance

The model achieved exceptional results on the held-out temporal test set — data the model had never seen during training:

MetricScoreInterpretation
F1 Micro 93.73% Overall classification accuracy across all labels
F1 Macro 62.74% Unweighted average — reflects rare category challenge
Precision Micro 95.36% When model predicts a label, it's correct 95% of the time
Recall Micro 92.15% Model catches 92% of all true label occurrences
ROC AUC Micro 99.69% Near-perfect category discrimination ability

Industry context: The 93.7% F1-micro score exceeds typical industry benchmarks for multi-label NLP classification tasks (85–90%). The 99.69% ROC AUC indicates the model has near-perfect ability to distinguish between categories — a strong signal of generalizability to unseen reviews.

Training Details

Training ran for 6 epochs with early stopping (patience=2). The best model was selected at Epoch 2 based on F1-micro score — a classic example of early stopping preventing overfitting while preserving generalization:

Epoch 1
91.3%
Val loss: 0.0300
Epoch 2
93.7%
Val loss: 0.0259
✓ Best Model
Epoch 3
93.7%
Val loss: 0.0262
Epoch 4
93.4%
Val loss: 0.0293 ↑

Training Configuration

  • Optimizer: AdamW with learning rate 2e-5
  • Batch sizes: 8 (training), 16 (evaluation)
  • Warmup steps: 500
  • Weight decay: 0.01 for regularization
  • Hardware: CUDA-enabled GPU
  • Early stopping patience: 2 epochs — triggered after Epoch 4 showed rising validation loss

Live Demo

The model is deployed as an interactive Hugging Face Space. Paste any app review text and receive instant multi-label category predictions with confidence scores:

Interactive classification demo — paste any customer review to get real-time multi-label category predictions.

Power BI Analytics Dashboard

The Power BI dashboard provides visual business intelligence on top of the model's classification output — translating 29,000 categorized reviews into executive-ready insights:

  • Review volume trends over time with sentiment trajectory
  • Category breakdown and frequency — which issues are growing vs. declining
  • Customer service response metrics (81.83% response rate, 29.3hr average)
  • Sentiment score correlation with star ratings
  • Interactive filters for deep-dive by date range, category, and sentiment
Interactive Power BI dashboard — use filters to explore review categories, sentiment trends, and KPIs across the full dataset.

API Usage

The model is publicly available on Hugging Face Hub and can be integrated into any Python application for batch review processing:

Quick Start — Predict Review Categories
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, numpy as np, joblib, json
from huggingface_hub import hf_hub_download

REPO_ID = "adeyemi001/Multi-Labelled-Review-Categorization-Model"
DEVICE  = "cuda" if torch.cuda.is_available() else "cpu"

# Load model, tokenizer, and label binarizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model     = AutoModelForSequenceClassification.from_pretrained(REPO_ID).to(DEVICE)
mlb       = joblib.load(hf_hub_download(REPO_ID, "model/mlb.joblib"))

def predict(texts, threshold=0.5):
    if isinstance(texts, str): texts = [texts]
    enc = tokenizer(texts, truncation=True, padding=True,
                    max_length=256, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        logits = model(**enc).logits.cpu().numpy()
    probs   = 1 / (1 + np.exp(-logits))
    bins    = (probs >= threshold).astype(int)
    return [mlb.inverse_transform([b])[0] for b in bins], probs

# Example
reviews = [
    "App crashes every time I try to transfer money.",
    "Please add dark mode, and why are charges so high?",
    "Fast transfers and excellent customer service!"
]
preds, probs = predict(reviews)
for r, p in zip(reviews, preds):
    print(f"Review: {r}\nCategories: {p}\n")

Threshold tuning: The default threshold of 0.5 balances precision and recall. Use 0.3 for higher recall (catch more issues at cost of some false positives) or 0.7 for high-precision deployment where false positives are costly.

Critical Issues Identified

The model surfaced 417 total critical issue mentions across Moniepoint's reviews — ranked by frequency and sentiment impact to prioritize engineering and product attention:

Tier 1 — Critical System Failures
App Not Opening Complete user access failure — highest churn risk
217
Login Issues Authentication barriers — user abandonment risk
101
Failed Transactions Core functionality failure — trust erosion
99
Tier 2 — Financial Concerns
Transaction Charges Pricing competitiveness — competitive disadvantage
117
High Charges Value perception — price sensitivity signals
106
Tier 3 — Performance Issues
Slow App Performance
96
Account Access Issues
76
Account Restrictions
74

Competitive Strengths Identified

Beyond issues, the model surfaces what customers love — the competitive advantages that should be amplified in marketing and product strategy. Speed is Moniepoint's dominant positive signal with 857 combined mentions:

Fast Transactions
297
Fast Transfers
282
Speed (general)
278
Ease of Use
187
Reliability
183

Strategic positioning: 857 speed-related positive mentions represent a durable competitive moat. "Fast" should be Moniepoint's core brand pillar in marketing — it's not a claimed differentiator, it's a customer-validated one. The data also confirms an 81.83% review response rate — well above the ~60% industry average.

Strategic Recommendations

1

Deploy Real-Time Review Monitoring

Implement continuous ingestion from Play Store and App Store with automated alerting when critical categories (App Not Opening, Failed Transactions, Login Issues) exceed baseline thresholds. The 217 "App Not Opening" mentions represent an immediate churn risk that real-time monitoring would catch within hours, not weeks.

2

Prioritize Engineering on Tier 1 Issues

The 417 combined critical issue mentions should drive sprint planning directly. App Not Opening (217), Login (101), and Failed Transactions (99) represent complete access failures — users experiencing these are highly likely to uninstall. Set SLA targets for each category and instrument post-deployment monitoring against these baselines.

3

Address Fee Perception with Transparency

223 fee-related complaints signal a perception problem that may not require a pricing change — it may require better value communication. Test in-app fee calculators, comparison tools, and clearer transaction breakdowns. Measure impact on subsequent review sentiment in the fee-related categories.

4

Leverage Speed as the Core Marketing Message

857 customer-validated speed mentions make "fast" Moniepoint's most credible differentiator. Build marketing campaigns directly from positive review language — these are authentic customer voices that resonate with prospects experiencing slow competitors.

5

Extend to Competitive Intelligence

Deploy the same model on OPay, PalmPay, and Kuda reviews to create a continuous competitive intelligence system. Monthly reports comparing issue prevalence and strength mentions would show exactly where Moniepoint is outperforming and where it has market gaps to exploit.

RoBERTa Transformers Multi-Label Classification HuggingFace NLP Power BI Web Scraping Sentiment Analysis Fintech Analytics Python PyTorch

Future Work

Model Improvements

  • Multilingual support: Extend to Pidgin English and major Nigerian languages — a meaningful portion of app store reviews use non-standard English that the current model may misclassify
  • Continuous learning pipeline: Implement active learning where customer success agents validate predictions, continuously improving accuracy on new issue patterns
  • App version correlation: Link review categories to specific app release versions to create a "quality gate" metric for release management

Business Applications

  • Churn risk scoring: Combine review categories with behavioral data to build a customer-level churn probability score triggered by specific negative review patterns
  • Automated ticket routing: Integrate with customer support to auto-route incoming support tickets to the correct team based on classified issue type
  • Predictive analytics: Time-series forecasting of issue volume spikes based on historical patterns and release cycles, enabling proactive engineering response

Work with Adediran Adeyemi

Thousands of customer reviews — and no idea what they're saying?

I build NLP systems that turn unstructured customer feedback into prioritized, actionable intelligence your product team can actually use. First call is free.