adeyemi@adediranadeyemi.com +234 816 273 5399
E-Commerce Analytics · Python

How a 15% Return Rate Was Eroding 28% of Revenue

A deep-dive Python analysis into customer orders and returns across 42,000+ items — identifying the demographic segments, product categories, and behavioral signals driving significant revenue loss.

Tools
Python · Pandas · Statsmodels · SHAP
Industry
E-Commerce & Retail
Type
EDA · Logistic Regression · ML Explainability
E-commerce product returns analysis showing return rate patterns and revenue loss by Adediran Adeyemi
15% Overall product return rate
28% Revenue loss attributed to returns
42K+ Items analyzed across 3,274 orders
2,740 Unique customers in the dataset

Project Overview

Understanding how customers behave is critical for enhancing satisfaction, optimizing operations, and protecting revenue. This project examines customer orders and returns across an e-commerce business to identify the trends, demographic patterns, and product-level signals that drive return behavior.

The analysis combines exploratory data analysis, statistical segmentation, logistic regression modeling, and SHAP value interpretation to deliver both descriptive insights and predictive understanding of what makes a return more or less likely.

Why this matters: A 15% return rate sounds manageable — until it translates to 28% of revenue lost. This analysis makes the invisible visible: which customers, which products, and which behavioral patterns are costing the most.

Data Dictionary

The dataset spans customer demographics, order transactions, product details, logistics, and derived time features:

ColumnDescription
user_idUnique identifier for each customer
ageAge of the customer
genderGender of the customer (Male / Female)
cityCity where the customer resides
traffic_sourceSource through which the customer arrived (e.g., Ads, Organic Search)
order_idUnique identifier for each order
statusOrder status (e.g., Delivered, Returned)
product_idUnique identifier for each product
product_categoryCategory of the product (e.g., Accessories, Activewear)
product_retail_priceRetail price of the product
costCost incurred for the product
sale_pricePrice at which the product was sold
returned_atTimestamp when the product was returned
dc_nameDistribution center handling the order
dc2c_distanceDistance between distribution center and customer
prep_timeTime taken to prepare the order for shipment
delivery_timeTime taken to deliver the order
total_timeTotal time from order creation to delivery
num_of_itemNumber of items in the order

Executive Summary

Out of 2,740 unique customers, there were 3,274 total orders resulting in 6,789 item returns. This elevated return rate — approximately 15% — represents a significant financial impact estimated at 28% revenue loss from returned items. The repeat purchase rate of around 16% further signals that customer retention is an untapped opportunity.

2,740 Unique Customers
3,274 Total Orders
6,789 Total Returns
42,106 Items Sold
15% Return Rate
28% Revenue Lost to Returns
16% Repeat Purchase Rate
1.2% Repeat Returners

Revenue alert: The number of returns (6,789) exceeding the number of orders (3,274) confirms that multiple items per order are regularly being returned — a pattern that compounds the financial impact significantly.

Overall Metrics

The following key rates were calculated from the full dataset:

  • Return Rate: 15.04% — significant and contributing to over a quarter of total sales revenue being lost
  • Repeat Purchase Rate: 15.84% — while some customers return, the majority do not make multiple purchases
  • Revenue Loss from Returns: 28.13% — nearly a third of revenue is absorbed by returned items
  • Percentage of Sales Reversed: 14.58% — the proportion of completed sales that are subsequently reversed
  • Percentage of Repeat Returners: 1.17% — a small but notable group of customers who return products habitually

The combination of a high return rate and a low repeat purchase rate suggests a systemic satisfaction gap — customers are not finding what they expected from their purchases, and most are not coming back to try again.

Returns by Age Group & Gender

Returns were segmented by age group and gender to identify which customer cohorts drive the highest return volumes:

Age Group Female Returns Male Returns Total
<18312270582
18–24414409823
25–34528448976
35–446014561,057
45–54481365846
55–645435901,133
65+289322611

35–44 Female: Highest Single Segment

Females aged 35–44 generate the most returns of any demographic segment — 601 returns — suggesting strong sizing or expectation mismatches in products targeting this group.

55–64 Males: Outlier Pattern

Male customers aged 55–64 have the highest return count (590) among all male cohorts — an unexpected result that warrants investigation into the specific product categories they purchase.

Young Customers Return Frequently Too

The under-18 and 18–24 groups show substantial return activity, pointing to possible product suitability or expectation-setting issues for younger shoppers.

Middle-Age Peak Across Both Genders

The 25–44 range shows consistently elevated returns for both genders — representing the broadest opportunity for targeted intervention across sizing, product description accuracy, and post-purchase support.

Returns by Product Category & Gender

Return volumes were broken down by product category and gender to identify which items drive the highest return rates for each group:

Product Category Female Returns Male Returns
Intimates4360
Fashion Hoodies & Sweatshirts224267
Dresses2530
Accessories137215
Outwear and Coats108251
Sweaters177204
Swim184204
Jeans214192
Sleep and Loungewear147216
Tops and Tees153183
Pants0241
Underwear0208
Socks0199
Plus1880
Active129152
Shorts106185
Suits and Sport Coats0143
Blazers and Jackets1120
Socks and Hosiery1270
Pants and Capris1210
Maternity1170
Leggings970
Skirts730
Jumpsuits and Rompers270
Suits290
Clothing Sets90

Key Category Insights

  • Intimates drive the highest female returns (436) — likely due to sizing inconsistencies or inadequate fit guidance at the point of purchase.
  • Fashion Hoodies & Sweatshirts and Accessories are high-return categories for both genders, suggesting shared sizing or quality concerns.
  • Pants, Socks, and Underwear are the top male return categories — fit and style expectations likely play a significant role.
  • Several categories — Blazers, Dresses, Clothing Sets — show zero male returns, confirming gender-specific purchasing and return patterns that should inform merchandising strategy.

Returns by Product Category & Age Group

Combining product categories with age groups reveals more granular patterns in where interventions would be most impactful:

Highest-Impact Findings

  • Intimates (35–44): The largest single product-age return cluster in the dataset — a clear priority for sizing improvements and virtual fit tools.
  • Fashion Hoodies & Sweatshirts (25–34, 35–44, 55–64): Returns spread across three age bands, suggesting a product-level quality or consistency issue rather than a demographic-specific one.
  • Jeans (18–24 and 35–44): Two distinct peaks suggest different fit preferences by generation — potentially addressable with better size guidance per age cohort.
  • Outwear and Coats (25–34): The 25–34 group drives the highest return volumes in this category — possible fit or seasonal expectation issues.
  • Accessories (18–24 and 55–64): Return peaks at opposite ends of the age spectrum indicate this category has inconsistent expectations across the customer base.
  • Swim (25–34 and 55–64): Returns concentrated in these two groups may reflect sizing inconsistencies between product lines targeting different demographics.

Pattern: Middle-aged groups (25–44) show the broadest elevated return patterns across the most product categories. This is the demographic segment where targeted interventions — improved size guides, virtual try-on, or pre-purchase consultation — would generate the greatest reduction in return volume.

Logistic Regression Analysis

A logistic regression model was built to identify which variables have a statistically significant relationship with the likelihood of a product being returned. The model was run on 21,947 observations using maximum likelihood estimation.

Logit Regression Results
Dep. Variable:     status_binary     No. Observations:    21,947
Model:                     Logit     Df Residuals:        21,937
Method:                      MLE     Pseudo R-squ.:      0.001676
Converged:                  True     LLR p-value:       1.940e-06

====================================================================================
                        coef      std err     z       P>|z|
------------------------------------------------------------------------------------
const               -0.6339       0.070    -8.996     0.000  ***
delivery_time    -3.876e-06    7.27e-06    -0.533     0.594
age                 -0.0028       0.001    -3.169     0.002  **
gender              -0.0305       0.031    -0.984     0.325
city             -3.774e-06     4.3e-05    -0.088     0.930
product_category    -0.0048       0.002    -2.412     0.016  *
product_retail_price 0.0013       0.001     0.891     0.373
num_of_item         -0.0422       0.015    -2.890     0.004  **
revenue             -0.0017       0.003    -0.641     0.522
dc2c_distance    -4.569e-05    1.33e-05    -3.445     0.001  **
====================================================================================
*** p < 0.001   ** p < 0.01   * p < 0.05

Statistically Significant Predictors

  • Age (coef = −0.0028, p < 0.01): Older customers are slightly less likely to return products — a small but reliable negative association with return probability.
  • Product Category (coef = −0.0048, p < 0.05): Certain product categories are significantly less likely to be returned, confirming that return risk is not uniformly distributed across the catalog.
  • Number of Items (coef = −0.0422, p < 0.01): Larger orders are slightly less likely to result in a return — potentially because multi-item shoppers have stronger purchase intent or more reliable sizing knowledge.
  • Distribution Center Distance (coef = −0.00004569, p < 0.01): Greater distance from the distribution center is associated with fewer returns — possibly because customers who wait longer for delivery are less likely to return items when they arrive.

Non-Significant Variables

Delivery time, gender, city, product retail price, and revenue did not show a statistically significant impact on return likelihood in this model. This is a notable finding — it suggests that return behavior is driven more by product-level and order-level factors than by price point or demographic variables alone.

Model note: The pseudo R-squared of 0.0017 indicates the logistic regression explains only a small fraction of the variance in return behavior. This model establishes statistical significance of specific variables, but more complex models (Random Forest, Gradient Boosting) would be needed for predictive deployment.

SHAP Values Interpretation

SHAP (SHapley Additive exPlanations) values were computed to understand the contribution of each feature to the model's return predictions. Unlike regression coefficients, SHAP values quantify the actual impact of each feature across all observations.

returned_at
0.3690
revenue
0.0057
age
0.0032
product_id
0.0027
product_retail_price
0.0027
delivered_at
0.0023
city
0.0023
created_at
0.0023
dc2c_distance
0.0020
delivery_time
0.0016

Key SHAP Insights

  • returned_at dominates (SHAP = 0.369): The return timestamp is by far the most influential feature — suggesting that return timing patterns (seasonality, post-holiday spikes, time-since-delivery) contain substantial predictive signal worth engineering into future models.
  • Revenue, Age, Product ID (moderate influence): These features contribute meaningfully to the predictions, aligning with the logistic regression findings on age and product category.
  • All other features show low absolute SHAP values: While they contribute, their impact is marginal compared to timing signals — indicating that a return prediction model should heavily feature time-based engineered variables.

Modeling implication: The dominance of returned_at suggests that engineering temporal features — days-since-delivery, return-season flags, cohort return windows — would significantly improve a production return prediction model.

Recommendations

Enhance Product Quality and Fit Information

  • Focus quality improvements on Intimates, Dresses, and Fashion Hoodies & Sweatshirts — the three categories with the highest return volumes
  • Implement detailed size guides with customer measurements, not just S/M/L labels
  • Add user-generated fit photos and verified size reviews for high-return SKUs

Targeted Customer Support by Segment

  • Deploy personalized pre-purchase assistance for the 25–44 age group — the segment with the broadest elevated return pattern
  • Offer styling advice or virtual fitting tools specifically for the 35–44 female segment, which drives the highest single-segment return volume
  • Investigate the anomalous 55–64 male return spike — conduct qualitative research to understand what is driving returns in this group

Optimize Distribution and Logistics

  • The significant negative association between dc2c_distance and returns merits further investigation — understand whether longer-distance customers receive different service levels
  • Consider localized distribution strategies for high-return geographies to reduce delivery time and improve product condition on arrival

Improve Product Descriptions and Imagery

  • Audit product descriptions for accuracy against actual sizing and material — particularly for the Intimates and Jeans categories
  • Require multi-angle product photography and on-body model diversity to reduce expectation gaps at purchase

Leverage Predictive Analytics

  • Build a return-probability score at the order level using the identified significant variables plus engineered time features
  • Use this score to trigger proactive interventions — pre-return outreach, exchange offers, or personalized support — before customers initiate a return

Limitations & Further Research

Data Limitations

  • The dataset lacks customer satisfaction scores or post-return feedback — which would significantly improve the ability to diagnose why products are returned, not just who returns them
  • No information on whether returned items were resold, discounted, or written off — which affects the true revenue impact calculation

Model Constraints

  • The logistic regression pseudo R-squared of 0.0017 indicates limited explanatory power in the current formulation
  • Exploring Random Forests, Gradient Boosting, or neural approaches with engineered temporal features would yield substantially better predictive performance

Suggested Further Research

  • Conduct qualitative interviews with high-return customer segments to understand root causes in their own words
  • Investigate seasonality and promotional activity as moderating factors — return rates may spike predictably around sale events or holiday periods
  • Explore the role of customer reviews and product ratings in predicting returns — negative review sentiment often precedes return spikes
  • Model the financial impact of specific interventions (e.g., adding a size guide) using A/B test data to prioritize investments

Work with Adediran Adeyemi

Are product returns quietly draining your e-commerce revenue?

I help e-commerce businesses identify what's driving returns and build data-backed strategies to reduce them. First call is free.