Project Overview
Property valuations in New York City can take days or weeks through traditional appraisal processes — creating friction for investors, buyers, and real estate agents who need quick estimates. This project builds an ML model that delivers instant, data-backed valuations for NYC properties across all five boroughs.
The core challenge isn't the modeling — it's the data. NYC real estate data scraped from Zillow contains extreme outliers, data entry errors, and missing values that require a sophisticated cleaning pipeline before any model can perform reliably. This project documents that pipeline in detail alongside the feature engineering decisions that drive model performance.
Best model (LightGBM): MAPE 30.72%, RMSE $989,980, R² 0.7992 — outperforms both CatBoost and ExtraTrees across all three metrics. The model reliably estimates prices for 70% of standard NYC properties within 30% error.
Dataset
18,177 NYC properties scraped from Zillow — cleaned down to 17,866 properties (removing 1.7% as outliers) with 20 raw features across all five boroughs:
Raw features cover location (address, zipcode, lat/lng), property details (bedrooms, bathrooms, living area, home type), pricing, and Zillow metadata. Property types include houses, condos, apartments, and townhouses.
Data Cleaning Pipeline
NYC real estate data from any scraping source contains systematic quality issues. The cleaning pipeline addresses three categories of problems:
1. Street Name Extraction
Raw addresses from Zillow include unit numbers, apartment designations, floor markers, and building identifiers that need stripping before street-level features can be engineered. Removed: unit/apartment numbers (#, APT, UNIT, SUITE), floor designations (FL, FLOOR), and building numbers.
2. Extreme Value Correction — Hierarchical Fallback
Properties with impossible bedroom/bathroom combinations or size mismatches were corrected using a five-level fallback strategy rather than simply dropping them:
Same street + home type median (min 5 properties) — most precise correction using directly comparable properties
Same zipcode + home type median (min 10 properties) — neighborhood-level correction when street has insufficient data
Same zipcode median (min 5 properties) — zipcode-wide correction across all property types
Living area estimation — derive bedrooms from 800 sq ft/bed, bathrooms from 500 sq ft/bath when spatial data is reliable
Dataset-wide median — last resort for properties with no comparable neighbors at any geographic level
3. Price Cleaning
Removed 311 properties (1.7%) that failed price validity checks: zero prices, price below $50/sq ft (unrealistic for NYC), absolute price below $50,000 or above $20,000,000.
Why hierarchical correction instead of dropping: Simply removing extreme-value properties introduces selection bias — unusual properties (very large apartments, converted spaces) may be genuine data, not errors. The fallback strategy preserves them with corrected values rather than removing potentially valid observations.
Feature Engineering
The model's performance is driven by 40+ engineered features derived from four domains. The top 10 features by importance:
Feature Categories
- Street-Level: Prestigious streets (5th Ave, Park Ave, Central Park W/S), street types (Avenue, Boulevard, Place, Court), direction prefixes (East, West, North, South), extracted street numbers
- Geographic: Manhattan grid zones (Lower <14th St, Midtown 34th–59th, Upper >59th), zipcode patterns (first digit, first two digits)
- Interaction features: 40+ terms crossing street characteristics × property attributes × location (e.g.,
is_avenue_bedrooms,street_number_zipcode) - Property features:
living_area_times_baths,living_area_times_beds,beds_plus_baths
Why street-level features matter: In NYC, the street type and name carry enormous price signal — a property on "5th Avenue" vs "5th Street" in the same zipcode can differ by millions. Standard models using only bedroom/bathroom counts miss this entirely. The street-level feature engineering captures the NYC premium geography that makes this market unique.
Model Comparison
Three gradient boosting / ensemble models were trained on an 80/20 train-test split (14,292 train, 3,574 test) and evaluated across MAPE, RMSE, and R²:
| Model | MAPE (%) | RMSE ($) | R² Score |
|---|---|---|---|
| LightGBM ⭐ Deployed | 30.72 | 989,980 | 0.7992 |
| CatBoost | 32.11 | 1,032,960 | 0.7814 |
| ExtraTrees | 30.94 | 1,183,564 | 0.7130 |
Why LightGBM Was Chosen
- Best overall performance: Lowest RMSE ($989,980) and highest R² (0.7992) on the held-out test set
- RMSE improvement: ~20% lower RMSE than ExtraTrees — meaningful reduction for high-value NYC properties
- Inference speed: Faster prediction than CatBoost at deployment, important for real-time valuation use
- Balanced metrics: Strong performance across all three evaluation criteria — not a trade-off winner that sacrifices one metric for another
Model Evaluation
Residuals Analysis
Residuals vs. predicted values show relatively random scatter around zero — indicating no systematic bias across the price range. The model is unbiased for typical NYC properties. Some outliers are present at higher price points (>$5M), which is expected for luxury real estate.
Residuals Distribution
The distribution of residuals is approximately normal and centered at zero — confirming unbiased predictions with symmetric spread. Most predictions fall within reasonable error bounds for the price range.
Q-Q Plot Interpretation
Strong adherence to normality in the central quantiles. The heavy tails on both ends (common in all real estate datasets) indicate that very cheap and very expensive properties are harder to predict accurately — a known limitation of any single-model approach to real estate valuation.
Model reliability boundaries: The model performs well for standard residential properties but shows higher error for ultra-luxury properties (>$5M), unique architectural properties, and rare property type combinations. A separate model trained specifically on luxury listings would address this gap.
Key Insights
Location Dominates
Zipcode and street characteristics are the strongest predictors — outweighing bedroom count and living area. In NYC, address signals price more than size.
Size Interactions Matter
Living area × bathrooms captures a quality signal that neither variable alone can express — large apartments with few bathrooms vs. many bathrooms signal different price tiers.
Street Type Commands Premiums
Avenue properties consistently command premium prices over streets in the same neighborhood. The Manhattan grid — East/West designation — further differentiates pricing within zipcodes.
Luxury Remains Hard to Predict
Properties above $5M show higher prediction error — luxury real estate pricing depends on unique features and negotiation dynamics that structured data can't fully capture.
Live Demo
The LightGBM model is deployed on Hugging Face Spaces. Enter property details to get an instant price estimate: