Project Overview
Understanding how customers behave is critical for enhancing satisfaction, optimizing operations, and protecting revenue. This project examines customer orders and returns across an e-commerce business to identify the trends, demographic patterns, and product-level signals that drive return behavior.
The analysis combines exploratory data analysis, statistical segmentation, logistic regression modeling, and SHAP value interpretation to deliver both descriptive insights and predictive understanding of what makes a return more or less likely.
Why this matters: A 15% return rate sounds manageable — until it translates to 28% of revenue lost. This analysis makes the invisible visible: which customers, which products, and which behavioral patterns are costing the most.
Data Dictionary
The dataset spans customer demographics, order transactions, product details, logistics, and derived time features:
| Column | Description |
|---|---|
| user_id | Unique identifier for each customer |
| age | Age of the customer |
| gender | Gender of the customer (Male / Female) |
| city | City where the customer resides |
| traffic_source | Source through which the customer arrived (e.g., Ads, Organic Search) |
| order_id | Unique identifier for each order |
| status | Order status (e.g., Delivered, Returned) |
| product_id | Unique identifier for each product |
| product_category | Category of the product (e.g., Accessories, Activewear) |
| product_retail_price | Retail price of the product |
| cost | Cost incurred for the product |
| sale_price | Price at which the product was sold |
| returned_at | Timestamp when the product was returned |
| dc_name | Distribution center handling the order |
| dc2c_distance | Distance between distribution center and customer |
| prep_time | Time taken to prepare the order for shipment |
| delivery_time | Time taken to deliver the order |
| total_time | Total time from order creation to delivery |
| num_of_item | Number of items in the order |
Executive Summary
Out of 2,740 unique customers, there were 3,274 total orders resulting in 6,789 item returns. This elevated return rate — approximately 15% — represents a significant financial impact estimated at 28% revenue loss from returned items. The repeat purchase rate of around 16% further signals that customer retention is an untapped opportunity.
Revenue alert: The number of returns (6,789) exceeding the number of orders (3,274) confirms that multiple items per order are regularly being returned — a pattern that compounds the financial impact significantly.
Overall Metrics
The following key rates were calculated from the full dataset:
- Return Rate: 15.04% — significant and contributing to over a quarter of total sales revenue being lost
- Repeat Purchase Rate: 15.84% — while some customers return, the majority do not make multiple purchases
- Revenue Loss from Returns: 28.13% — nearly a third of revenue is absorbed by returned items
- Percentage of Sales Reversed: 14.58% — the proportion of completed sales that are subsequently reversed
- Percentage of Repeat Returners: 1.17% — a small but notable group of customers who return products habitually
The combination of a high return rate and a low repeat purchase rate suggests a systemic satisfaction gap — customers are not finding what they expected from their purchases, and most are not coming back to try again.
Returns by Age Group & Gender
Returns were segmented by age group and gender to identify which customer cohorts drive the highest return volumes:
| Age Group | Female Returns | Male Returns | Total |
|---|---|---|---|
| <18 | 312 | 270 | 582 |
| 18–24 | 414 | 409 | 823 |
| 25–34 | 528 | 448 | 976 |
| 35–44 | 601 | 456 | 1,057 |
| 45–54 | 481 | 365 | 846 |
| 55–64 | 543 | 590 | 1,133 |
| 65+ | 289 | 322 | 611 |
35–44 Female: Highest Single Segment
Females aged 35–44 generate the most returns of any demographic segment — 601 returns — suggesting strong sizing or expectation mismatches in products targeting this group.
55–64 Males: Outlier Pattern
Male customers aged 55–64 have the highest return count (590) among all male cohorts — an unexpected result that warrants investigation into the specific product categories they purchase.
Young Customers Return Frequently Too
The under-18 and 18–24 groups show substantial return activity, pointing to possible product suitability or expectation-setting issues for younger shoppers.
Middle-Age Peak Across Both Genders
The 25–44 range shows consistently elevated returns for both genders — representing the broadest opportunity for targeted intervention across sizing, product description accuracy, and post-purchase support.
Returns by Product Category & Gender
Return volumes were broken down by product category and gender to identify which items drive the highest return rates for each group:
| Product Category | Female Returns | Male Returns |
|---|---|---|
| Intimates | 436 | 0 |
| Fashion Hoodies & Sweatshirts | 224 | 267 |
| Dresses | 253 | 0 |
| Accessories | 137 | 215 |
| Outwear and Coats | 108 | 251 |
| Sweaters | 177 | 204 |
| Swim | 184 | 204 |
| Jeans | 214 | 192 |
| Sleep and Loungewear | 147 | 216 |
| Tops and Tees | 153 | 183 |
| Pants | 0 | 241 |
| Underwear | 0 | 208 |
| Socks | 0 | 199 |
| Plus | 188 | 0 |
| Active | 129 | 152 |
| Shorts | 106 | 185 |
| Suits and Sport Coats | 0 | 143 |
| Blazers and Jackets | 112 | 0 |
| Socks and Hosiery | 127 | 0 |
| Pants and Capris | 121 | 0 |
| Maternity | 117 | 0 |
| Leggings | 97 | 0 |
| Skirts | 73 | 0 |
| Jumpsuits and Rompers | 27 | 0 |
| Suits | 29 | 0 |
| Clothing Sets | 9 | 0 |
Key Category Insights
- Intimates drive the highest female returns (436) — likely due to sizing inconsistencies or inadequate fit guidance at the point of purchase.
- Fashion Hoodies & Sweatshirts and Accessories are high-return categories for both genders, suggesting shared sizing or quality concerns.
- Pants, Socks, and Underwear are the top male return categories — fit and style expectations likely play a significant role.
- Several categories — Blazers, Dresses, Clothing Sets — show zero male returns, confirming gender-specific purchasing and return patterns that should inform merchandising strategy.
Returns by Product Category & Age Group
Combining product categories with age groups reveals more granular patterns in where interventions would be most impactful:
Highest-Impact Findings
- Intimates (35–44): The largest single product-age return cluster in the dataset — a clear priority for sizing improvements and virtual fit tools.
- Fashion Hoodies & Sweatshirts (25–34, 35–44, 55–64): Returns spread across three age bands, suggesting a product-level quality or consistency issue rather than a demographic-specific one.
- Jeans (18–24 and 35–44): Two distinct peaks suggest different fit preferences by generation — potentially addressable with better size guidance per age cohort.
- Outwear and Coats (25–34): The 25–34 group drives the highest return volumes in this category — possible fit or seasonal expectation issues.
- Accessories (18–24 and 55–64): Return peaks at opposite ends of the age spectrum indicate this category has inconsistent expectations across the customer base.
- Swim (25–34 and 55–64): Returns concentrated in these two groups may reflect sizing inconsistencies between product lines targeting different demographics.
Pattern: Middle-aged groups (25–44) show the broadest elevated return patterns across the most product categories. This is the demographic segment where targeted interventions — improved size guides, virtual try-on, or pre-purchase consultation — would generate the greatest reduction in return volume.
Logistic Regression Analysis
A logistic regression model was built to identify which variables have a statistically significant relationship with the likelihood of a product being returned. The model was run on 21,947 observations using maximum likelihood estimation.
Dep. Variable: status_binary No. Observations: 21,947
Model: Logit Df Residuals: 21,937
Method: MLE Pseudo R-squ.: 0.001676
Converged: True LLR p-value: 1.940e-06
====================================================================================
coef std err z P>|z|
------------------------------------------------------------------------------------
const -0.6339 0.070 -8.996 0.000 ***
delivery_time -3.876e-06 7.27e-06 -0.533 0.594
age -0.0028 0.001 -3.169 0.002 **
gender -0.0305 0.031 -0.984 0.325
city -3.774e-06 4.3e-05 -0.088 0.930
product_category -0.0048 0.002 -2.412 0.016 *
product_retail_price 0.0013 0.001 0.891 0.373
num_of_item -0.0422 0.015 -2.890 0.004 **
revenue -0.0017 0.003 -0.641 0.522
dc2c_distance -4.569e-05 1.33e-05 -3.445 0.001 **
====================================================================================
*** p < 0.001 ** p < 0.01 * p < 0.05
Statistically Significant Predictors
- Age (coef = −0.0028, p < 0.01): Older customers are slightly less likely to return products — a small but reliable negative association with return probability.
- Product Category (coef = −0.0048, p < 0.05): Certain product categories are significantly less likely to be returned, confirming that return risk is not uniformly distributed across the catalog.
- Number of Items (coef = −0.0422, p < 0.01): Larger orders are slightly less likely to result in a return — potentially because multi-item shoppers have stronger purchase intent or more reliable sizing knowledge.
- Distribution Center Distance (coef = −0.00004569, p < 0.01): Greater distance from the distribution center is associated with fewer returns — possibly because customers who wait longer for delivery are less likely to return items when they arrive.
Non-Significant Variables
Delivery time, gender, city, product retail price, and revenue did not show a statistically significant impact on return likelihood in this model. This is a notable finding — it suggests that return behavior is driven more by product-level and order-level factors than by price point or demographic variables alone.
Model note: The pseudo R-squared of 0.0017 indicates the logistic regression explains only a small fraction of the variance in return behavior. This model establishes statistical significance of specific variables, but more complex models (Random Forest, Gradient Boosting) would be needed for predictive deployment.
SHAP Values Interpretation
SHAP (SHapley Additive exPlanations) values were computed to understand the contribution of each feature to the model's return predictions. Unlike regression coefficients, SHAP values quantify the actual impact of each feature across all observations.
Key SHAP Insights
- returned_at dominates (SHAP = 0.369): The return timestamp is by far the most influential feature — suggesting that return timing patterns (seasonality, post-holiday spikes, time-since-delivery) contain substantial predictive signal worth engineering into future models.
- Revenue, Age, Product ID (moderate influence): These features contribute meaningfully to the predictions, aligning with the logistic regression findings on age and product category.
- All other features show low absolute SHAP values: While they contribute, their impact is marginal compared to timing signals — indicating that a return prediction model should heavily feature time-based engineered variables.
Modeling implication: The dominance of returned_at suggests that engineering temporal features — days-since-delivery, return-season flags, cohort return windows — would significantly improve a production return prediction model.
Recommendations
Enhance Product Quality and Fit Information
- Focus quality improvements on Intimates, Dresses, and Fashion Hoodies & Sweatshirts — the three categories with the highest return volumes
- Implement detailed size guides with customer measurements, not just S/M/L labels
- Add user-generated fit photos and verified size reviews for high-return SKUs
Targeted Customer Support by Segment
- Deploy personalized pre-purchase assistance for the 25–44 age group — the segment with the broadest elevated return pattern
- Offer styling advice or virtual fitting tools specifically for the 35–44 female segment, which drives the highest single-segment return volume
- Investigate the anomalous 55–64 male return spike — conduct qualitative research to understand what is driving returns in this group
Optimize Distribution and Logistics
- The significant negative association between dc2c_distance and returns merits further investigation — understand whether longer-distance customers receive different service levels
- Consider localized distribution strategies for high-return geographies to reduce delivery time and improve product condition on arrival
Improve Product Descriptions and Imagery
- Audit product descriptions for accuracy against actual sizing and material — particularly for the Intimates and Jeans categories
- Require multi-angle product photography and on-body model diversity to reduce expectation gaps at purchase
Leverage Predictive Analytics
- Build a return-probability score at the order level using the identified significant variables plus engineered time features
- Use this score to trigger proactive interventions — pre-return outreach, exchange offers, or personalized support — before customers initiate a return
Limitations & Further Research
Data Limitations
- The dataset lacks customer satisfaction scores or post-return feedback — which would significantly improve the ability to diagnose why products are returned, not just who returns them
- No information on whether returned items were resold, discounted, or written off — which affects the true revenue impact calculation
Model Constraints
- The logistic regression pseudo R-squared of 0.0017 indicates limited explanatory power in the current formulation
- Exploring Random Forests, Gradient Boosting, or neural approaches with engineered temporal features would yield substantially better predictive performance
Suggested Further Research
- Conduct qualitative interviews with high-return customer segments to understand root causes in their own words
- Investigate seasonality and promotional activity as moderating factors — return rates may spike predictably around sale events or holiday periods
- Explore the role of customer reviews and product ratings in predicting returns — negative review sentiment often precedes return spikes
- Model the financial impact of specific interventions (e.g., adding a size guide) using A/B test data to prioritize investments