NYC Property Price Prediction

Project Overview

Property valuations in New York City can take days or weeks through traditional appraisal processes — creating friction for investors, buyers, and real estate agents who need quick estimates. This project builds an ML model that delivers instant, data-backed valuations for NYC properties across all five boroughs.

The core challenge isn't the modeling — it's the data. NYC real estate data scraped from Zillow contains extreme outliers, data entry errors, and missing values that require a sophisticated cleaning pipeline before any model can perform reliably. This project documents that pipeline in detail alongside the feature engineering decisions that drive model performance.

Best model (LightGBM): MAPE 30.72%, RMSE $989,980, R² 0.7992 — outperforms both CatBoost and ExtraTrees across all three metrics. The model reliably estimates prices for 70% of standard NYC properties within 30% error.

Dataset

18,177 NYC properties scraped from Zillow — cleaned down to 17,866 properties (removing 1.7% as outliers) with 20 raw features across all five boroughs:

🏙️Manhattan

🌉Brooklyn

🏘️Queens

🌳Bronx

⛴️Staten Island

Raw features cover location (address, zipcode, lat/lng), property details (bedrooms, bathrooms, living area, home type), pricing, and Zillow metadata. Property types include houses, condos, apartments, and townhouses.

Data Cleaning Pipeline

NYC real estate data from any scraping source contains systematic quality issues. The cleaning pipeline addresses three categories of problems:

1. Street Name Extraction

Raw addresses from Zillow include unit numbers, apartment designations, floor markers, and building identifiers that need stripping before street-level features can be engineered. Removed: unit/apartment numbers (#, APT, UNIT, SUITE), floor designations (FL, FLOOR), and building numbers.

2. Extreme Value Correction — Hierarchical Fallback

Properties with impossible bedroom/bathroom combinations or size mismatches were corrected using a five-level fallback strategy rather than simply dropping them:

1

Same street + home type median (min 5 properties) — most precise correction using directly comparable properties

2

Same zipcode + home type median (min 10 properties) — neighborhood-level correction when street has insufficient data

3

Same zipcode median (min 5 properties) — zipcode-wide correction across all property types

4

Living area estimation — derive bedrooms from 800 sq ft/bed, bathrooms from 500 sq ft/bath when spatial data is reliable

5

Dataset-wide median — last resort for properties with no comparable neighbors at any geographic level

3. Price Cleaning

Removed 311 properties (1.7%) that failed price validity checks: zero prices, price below $50/sq ft (unrealistic for NYC), absolute price below $50,000 or above $20,000,000.

Why hierarchical correction instead of dropping: Simply removing extreme-value properties introduces selection bias — unusual properties (very large apartments, converted spaces) may be genuine data, not errors. The fallback strategy preserves them with corrected values rather than removing potentially valid observations.

Feature Engineering

The model's performance is driven by 40+ engineered features derived from four domains. The top 10 features by importance:

living_area_times_baths

#1

zipcode

#2

beds_plus_baths

#3

is_street_bathrooms

#4

is_boulevard_home_type

#5

bathrooms

#6

is_street_zipcode

#7

is_place

#8

is_west_street_borough

#9

zipcode_first_digit

#10

Feature Categories

Street-Level: Prestigious streets (5th Ave, Park Ave, Central Park W/S), street types (Avenue, Boulevard, Place, Court), direction prefixes (East, West, North, South), extracted street numbers
Geographic: Manhattan grid zones (Lower <14th St, Midtown 34th–59th, Upper >59th), zipcode patterns (first digit, first two digits)
Interaction features: 40+ terms crossing street characteristics × property attributes × location (e.g., is_avenue_bedrooms, street_number_zipcode)
Property features: living_area_times_baths, living_area_times_beds, beds_plus_baths

Why street-level features matter: In NYC, the street type and name carry enormous price signal — a property on "5th Avenue" vs "5th Street" in the same zipcode can differ by millions. Standard models using only bedroom/bathroom counts miss this entirely. The street-level feature engineering captures the NYC premium geography that makes this market unique.

Model Comparison

Three gradient boosting / ensemble models were trained on an 80/20 train-test split (14,292 train, 3,574 test) and evaluated across MAPE, RMSE, and R²:

Model	MAPE (%)	RMSE ($)	R² Score
LightGBM ⭐ Deployed	30.72	989,980	0.7992
CatBoost	32.11	1,032,960	0.7814
ExtraTrees	30.94	1,183,564	0.7130

Why LightGBM Was Chosen

Best overall performance: Lowest RMSE ($989,980) and highest R² (0.7992) on the held-out test set
RMSE improvement: ~20% lower RMSE than ExtraTrees — meaningful reduction for high-value NYC properties
Inference speed: Faster prediction than CatBoost at deployment, important for real-time valuation use
Balanced metrics: Strong performance across all three evaluation criteria — not a trade-off winner that sacrifices one metric for another

R² Score

0.80

80% variance explained

MAPE

30.7%

Mean absolute % error

RMSE

$990K

Root mean squared error

Model Evaluation

Residuals Analysis

Residuals vs. predicted values show relatively random scatter around zero — indicating no systematic bias across the price range. The model is unbiased for typical NYC properties. Some outliers are present at higher price points (>$5M), which is expected for luxury real estate.

Residuals Distribution

The distribution of residuals is approximately normal and centered at zero — confirming unbiased predictions with symmetric spread. Most predictions fall within reasonable error bounds for the price range.

Q-Q Plot Interpretation

Strong adherence to normality in the central quantiles. The heavy tails on both ends (common in all real estate datasets) indicate that very cheap and very expensive properties are harder to predict accurately — a known limitation of any single-model approach to real estate valuation.

Model reliability boundaries: The model performs well for standard residential properties but shows higher error for ultra-luxury properties (>$5M), unique architectural properties, and rare property type combinations. A separate model trained specifically on luxury listings would address this gap.

Key Insights

Location Dominates

Zipcode and street characteristics are the strongest predictors — outweighing bedroom count and living area. In NYC, address signals price more than size.

Size Interactions Matter

Living area × bathrooms captures a quality signal that neither variable alone can express — large apartments with few bathrooms vs. many bathrooms signal different price tiers.

Street Type Commands Premiums

Avenue properties consistently command premium prices over streets in the same neighborhood. The Manhattan grid — East/West designation — further differentiates pricing within zipcodes.

Luxury Remains Hard to Predict

Properties above $5M show higher prediction error — luxury real estate pricing depends on unique features and negotiation dynamics that structured data can't fully capture.

Live Demo

The LightGBM model is deployed on Hugging Face Spaces. Enter property details to get an instant price estimate:

Live NYC property price predictor — enter borough, bedrooms, bathrooms, living area, and street to get an instant ML-powered valuation.

Tech Stack

LightGBMPrimary Model

CatBoostComparison

ExtraTreesComparison

PandasData Processing

Beautiful SoupWeb Scraping

HuggingFaceDeployment

Python LightGBM CatBoost Scikit-learn Beautiful Soup Pandas NumPy Feature Engineering Web Scraping HuggingFace Spaces Flask

Related Projects

If this case study is relevant to your business challenge, these projects may also interest you:

Predicting NYC Property Prices with 80% Accuracy

Project Overview

Dataset

Data Cleaning Pipeline

1. Street Name Extraction

2. Extreme Value Correction — Hierarchical Fallback

3. Price Cleaning

Feature Engineering

Feature Categories

Model Comparison

Why LightGBM Was Chosen

Model Evaluation

Residuals Analysis

Residuals Distribution

Q-Q Plot Interpretation

Key Insights

Location Dominates

Size Interactions Matter

Street Type Commands Premiums

Luxury Remains Hard to Predict

Live Demo

Tech Stack

Need a price prediction model for your real estate market?

Predicting NYC Property Prices with 80% Accuracy

Project Overview

Dataset

Data Cleaning Pipeline

1. Street Name Extraction

2. Extreme Value Correction — Hierarchical Fallback

3. Price Cleaning

Feature Engineering

Feature Categories

Model Comparison

Why LightGBM Was Chosen

Model Evaluation

Residuals Analysis

Residuals Distribution

Q-Q Plot Interpretation

Key Insights

Location Dominates

Size Interactions Matter

Street Type Commands Premiums

Luxury Remains Hard to Predict

Live Demo

Tech Stack

Related Projects

Nigerian Used Car Price Prediction

Lead Scoring & Conversion Prediction

Retail Location Strategy Analysis

E-Commerce Product Returns Analysis

Need a price prediction model for your real estate market?