Lead Scoring & Conversion Prediction

Project Overview

In an increasingly competitive digital marketplace, understanding what drives users to make purchases is critical for optimizing websites and increasing conversion rates. This project investigates the browsing behavior of users and how various on-page actions - viewing images, reading reviews, checking product specifications, comparing similar items - influence their purchasing decisions.

The primary output is a deployed lead scoring model that classifies sessions by purchase likelihood, allowing sales and marketing teams to focus intervention efforts on the highest-intent prospects rather than treating every visitor identically.

Bottom line: Reviews are the single strongest driver of purchase behavior. Users who viewed reviews were significantly more likely to buy - even without images. Combining reviews with images produced the highest conversion rate of any interaction pair at 58.33%.

Data Dictionary

The dataset captures binary interaction signals from 500 user sessions on an e-commerce product page. Each variable records whether a user engaged with a specific page element:

Variable	Description
SESSION_ID	Unique identifier for each user session
IMAGES	Whether the user viewed product images (1 = Yes, 0 = No)
REVIEWS	Whether the user read product reviews (1 = Yes, 0 = No)
FAQ	Whether the user viewed the FAQ section (1 = Yes, 0 = No)
SPECS	Whether the user viewed product specifications (1 = Yes, 0 = No)
SHIPPING	Whether the user viewed shipping information (1 = Yes, 0 = No)
BRO_TOGETHER	Whether the user browsed similar items together (1 = Yes, 0 = No)
COMPARE_SIMILAR	Whether the user compared similar items (1 = Yes, 0 = No)
VIEW_SIMILAR	Whether the user viewed similar items (1 = Yes, 0 = No)
WARRANTY	Whether the user checked product warranty (1 = Yes, 0 = No)
SPONSORED_LINKS	Whether the user clicked on sponsored links (1 = Yes, 0 = No)
BUY	Whether the user made a purchase during the session (1 = Yes, 0 = No) - target variable

Executive Summary

Out of 500 user sessions, 185 resulted in a purchase (37%) and 315 did not (63%). This class imbalance was a key challenge addressed in the modeling phase. The analysis identifies which browsing behaviors are most predictive of purchase intent - and which combinations create the strongest synergy.

Reviews Drive Purchases

Users who viewed reviews were significantly more likely to buy - even without viewing images. Reviews alone produced a 53.57% purchase rate.

Images + Reviews = Highest Conversion

The combination of images and reviews produced the highest purchase rate of any interaction pair: 58.33% - the strongest signal in the dataset.

Specs Support, Not Lead

Product specifications alone achieve only 25% purchase rate. But paired with reviews, this jumps to 46.43% - specs amplify other signals rather than driving purchases on their own.

Warranty + Specs Passes 50%

Combining warranty information with product specifications crosses the 50% purchase likelihood threshold at 51.85% - the strongest non-reviews combination.

Exploratory Data Analysis

Class Distribution

The target variable showed a meaningful class imbalance - a critical finding that shaped the modeling approach:

BUY Variable Distribution

BUY
0    315    (63.0%) - did not purchase
1    185    (37.0%) - made a purchase
Total: 500 sessions

With a 63/37 split, naive models trained on raw data would be biased toward predicting non-purchases. Four resampling techniques were evaluated to address this: Random Under-Sampling, Random Over-Sampling, SMOTE, and NearMiss.

Analytical Techniques Applied

Summary Statistics - mean, median, variance, and standard deviation for all binary interaction variables
Contingency Tables - crosstab analysis between each behavioral variable and the BUY outcome to measure individual impact
Correlation Heatmap - visualizing which features had the strongest linear relationship with purchase behavior
Interaction Heatmaps - pairwise analysis of feature combinations to identify synergistic effects on conversion

Interaction Analysis

The most valuable EDA finding was how combinations of page elements affect purchase likelihood - not just individual elements. Each interaction pair was analyzed as a 2×2 contingency table.

Images vs Reviews The most impactful feature pair

IMAGES=0, REVIEWS=0

Did not buy

90.5%

Purchased

9.5%

IMAGES=0, REVIEWS=1

Did not buy

46.4%

Purchased

53.6%

IMAGES=1, REVIEWS=0

Did not buy

77.8%

Purchased

22.2%

IMAGES=1, REVIEWS=1 ★ Highest

Did not buy

41.7%

Purchased

58.3%

Specs vs Reviews Reviews dominate even without specs

SPECS=0, REVIEWS=0

Did not buy

89.3%

Purchased

10.7%

SPECS=0, REVIEWS=1

Did not buy

33.3%

Purchased

66.7%

SPECS=1, REVIEWS=0

Did not buy

75.0%

Purchased

25.0%

SPECS=1, REVIEWS=1

Did not buy

53.6%

Purchased

46.4%

Warranty vs Specs The strongest non-reviews combination

WARRANTY=0, SPECS=0

Did not buy

65.1%

Purchased

34.9%

WARRANTY=0, SPECS=1

Did not buy

81.0%

Purchased

19.1%

WARRANTY=1, SPECS=0

Did not buy

61.8%

Purchased

38.2%

WARRANTY=1, SPECS=1 ★ Best combo

Did not buy

48.2%

Purchased

51.9%

Compare Similar vs Sponsored Links Comparison beats paid clicks

COMPARE=0, SPONSORED=0

Did not buy

71.4%

Purchased

28.6%

COMPARE=0, SPONSORED=1

Did not buy

76.2%

Purchased

23.8%

COMPARE=1, SPONSORED=0

Did not buy

66.7%

Purchased

33.3%

COMPARE=1, SPONSORED=1 ★ Best combo

Did not buy

47.1%

Purchased

52.9%

Key Takeaways from Interaction Analysis

Reviews are the #1 conversion driver. Reviews alone achieve a 53.57% purchase rate - higher than images alone (22.22%) or specs alone (25%). No other single feature comes close.

Images alone have a weaker influence than expected. Images without reviews convert at only 22.22% - less than half the rate of reviews without images. Visuals support the decision but don't drive it.

Specs need reviews to be effective. Specs without reviews convert at just 25%. Combined with reviews, this doubles to 46.43%. Specifications answer objections - but reviews create the initial intent.

FAQ alone does not significantly drive purchases. FAQ without images converts at only 21.74%. Users who view FAQs tend to have objections - those objections aren't resolved unless visuals are also present.

Warranty information enhances but doesn't replace other signals. Warranty + specs crosses the 50% threshold (51.85%) - making it the strongest purely informational combination outside of reviews.

Comparing similar items outperforms sponsored links. Users who compare items convert at 33.33% - higher than those who only click sponsored links (23.81%). Active research behavior signals stronger purchase intent than ad clicks.

Machine Learning Models

Four resampling techniques were evaluated to address the 63/37 class imbalance before training a classification model. Each approach handles imbalance differently - the comparison reveals which produces the most reliable predictions:

Under-Sampling

Random Under-Sampling

Reduces the majority class (non-buyers) by randomly removing examples until classes are balanced. Creates a smaller but balanced training set.

97% accuracy

Over-Sampling

Random Over-Sampling

Duplicates minority class (buyer) samples to match the majority class size. Preserves all original data while balancing the training distribution.

99% accuracy

SMOTE

Generates synthetic minority class samples by interpolating between existing buyer examples - more robust than simple duplication.

99% accuracy

NearMiss

Selects majority class examples based on proximity to minority class examples - a more intelligent form of under-sampling.

97% accuracy

Model Performance Comparison

All four models were evaluated on precision, recall, F1-score, and accuracy. Random Over-Sampling and SMOTE both achieved 99% accuracy with near-perfect precision and recall across both classes:

Metric	Under-Sampling	Over-Sampling	SMOTE	NearMiss
Precision (Class 0)	1.00	1.00	1.00	1.00
Precision (Class 1)	0.95	0.98	0.98	0.95
Recall (Class 0)	0.95	0.98	0.98	0.95
Recall (Class 1)	1.00	1.00	1.00	1.00
F1-Score (Class 0)	0.97	0.99	0.99	0.97
F1-Score (Class 1)	0.97	0.99	0.99	0.97
Overall Accuracy	0.97	0.99	0.99	0.97

Detailed Results: Best Models

Random Over-Sampling - Classification Report

precision    recall  f1-score   support
0       1.00      0.98      0.99       100
1       0.98      1.00      0.99        89
accuracy                           0.99       189
macro avg       0.99      0.99      0.99       189
weighted avg       0.99      0.99      0.99       189

SMOTE - Classification Report

precision    recall  f1-score   support
0       1.00      0.98      0.99       100
1       0.98      1.00      0.99        89
accuracy                           0.99       189
macro avg       0.99      0.99      0.99       189
weighted avg       0.99      0.99      0.99       189

Conclusion: Both Random Over-Sampling and SMOTE achieve 99% accuracy with 0.99 F1-scores across both classes - making either approach suitable for production deployment. SMOTE is preferred for production use as it generates synthetic samples rather than simply duplicating existing ones, making the model more generalizable to new session patterns.

Python Scikit-learn SMOTE imbalanced-learn Random Forest Classification Class Imbalance Lead Scoring Conversion Prediction Streamlit

Live Demo

The lead scoring model is deployed as an interactive application. Input a set of browsing behaviors and get an instant purchase likelihood score:

Interactive lead scoring demo - toggle browsing behaviors to see predicted purchase likelihood in real time.

Recommendations

Highlight Product Reviews and Images Together

Ensure reviews are prominently displayed alongside product images - not buried in a separate tab. The 58.33% conversion rate from this combination is the single most actionable finding in this study.
Use verified purchase reviews rather than unverified ratings - the signal is stronger when users trust the source of the review.
Display the review count prominently: social proof scales with volume. More reviews shown = higher perceived credibility.

Engage Users with Content That Reduces Uncertainty

Pair product specifications with prominent warranty information - this combination crosses the 50% purchase threshold (51.85%) and removes two categories of buyer objection simultaneously.
Consider adding customer testimonial videos or comparison charts - active engagement with product information is strongly predictive of purchase intent.
Ensure FAQ content is paired with visuals - FAQ alone converts at only 21.74%, but with images this rises to 42.86%.

Optimize for Comparison Behavior

Users who compare similar items convert at 33.33% - higher than sponsored link clickers (23.81%). Design the comparison UI to be smooth and fast, as this behavior signals strong purchase intent.
Consider triggering a targeted offer or chat prompt when a user initiates a product comparison - this is a high-intent behavioral signal.

Deploy the Lead Scoring Model

Use the trained model to score live sessions in real time, enabling sales teams to prioritize outreach to high-intent visitors.
Build a trigger system: sessions with predicted purchase probability above a threshold (e.g., 70%) receive a personalized intervention - live chat offer, limited-time discount, or follow-up email.

Limitations & Further Research

Data Limitations

The dataset captures only 500 sessions with a limited set of on-page interactions - broader behavioral signals (scroll depth, time-on-page, mouse movement) could significantly improve predictive power.
No pricing, promotional, or competitor context is captured. Purchase behavior is heavily influenced by price sensitivity and competitive availability, which are absent from this model.
No demographic or device information - mobile vs. desktop behavior, age, and location are known moderators of purchase intent that are unaccounted for.

Further Research

Temporal analysis - explore how conversion patterns change during sales events, holiday seasons, or promotional periods.
User segmentation - segment users by demographics or acquisition channel and build segment-specific models; a first-time visitor from paid ads may require different signals than a return visitor from organic search.
Sequential behavioral modeling - the order in which users engage with page elements (e.g., images first vs. reviews first) may contain predictive signal not captured in the current binary features.

Predicting Purchase Likelihood from Browsing Behavior

Project Overview

Data Dictionary

Executive Summary

Reviews Drive Purchases

Images + Reviews = Highest Conversion

Specs Support, Not Lead

Warranty + Specs Passes 50%

Exploratory Data Analysis

Class Distribution

Analytical Techniques Applied

Interaction Analysis

Key Takeaways from Interaction Analysis

Machine Learning Models

Random Under-Sampling

Random Over-Sampling

SMOTE

NearMiss

Model Performance Comparison

Detailed Results: Best Models

Live Demo

Recommendations

Highlight Product Reviews and Images Together

Engage Users with Content That Reduces Uncertainty

Optimize for Comparison Behavior

Deploy the Lead Scoring Model

Limitations & Further Research

Data Limitations

Further Research

Want to know which of your visitors are about to buy?

Predicting Purchase Likelihood from Browsing Behavior

Project Overview

Data Dictionary

Executive Summary

Reviews Drive Purchases

Images + Reviews = Highest Conversion

Specs Support, Not Lead

Warranty + Specs Passes 50%

Exploratory Data Analysis

Class Distribution

Analytical Techniques Applied

Interaction Analysis

Key Takeaways from Interaction Analysis

Machine Learning Models

Random Under-Sampling

Random Over-Sampling

SMOTE

NearMiss

Model Performance Comparison

Detailed Results: Best Models

Live Demo

Recommendations

Highlight Product Reviews and Images Together

Engage Users with Content That Reduces Uncertainty

Optimize for Comparison Behavior

Deploy the Lead Scoring Model

Limitations & Further Research

Data Limitations

Further Research

Related Projects

E-Commerce Product Returns Analysis

E-Commerce Customer Journey Analysis

Retail Business Intelligence Dashboard

Customer Satisfaction Survey Analysis

Want to know which of your visitors are about to buy?