adeyemi@adediranadeyemi.com +234 816 273 5399
Machine Learning · Real Estate · LightGBM

Predicting NYC Property Prices with 80% Accuracy

A LightGBM model trained on 18,177 Zillow listings across all five NYC boroughs — with extensive street-level feature engineering, a hierarchical data cleaning pipeline, and an instant valuation demo deployed on Hugging Face.

Stack
Python · LightGBM · CatBoost · Beautiful Soup
Data
18,177 NYC properties scraped from Zillow
Type
Regression ML · Feature Engineering · Web Scraping
NYC property price prediction ML model by Adediran Adeyemi
0.80R² — 80% of price variance explained
18,177NYC properties scraped from Zillow
30.7%MAPE — Mean Absolute Percentage Error
5NYC boroughs covered in the model

Project Overview

Property valuations in New York City can take days or weeks through traditional appraisal processes — creating friction for investors, buyers, and real estate agents who need quick estimates. This project builds an ML model that delivers instant, data-backed valuations for NYC properties across all five boroughs.

The core challenge isn't the modeling — it's the data. NYC real estate data scraped from Zillow contains extreme outliers, data entry errors, and missing values that require a sophisticated cleaning pipeline before any model can perform reliably. This project documents that pipeline in detail alongside the feature engineering decisions that drive model performance.

Best model (LightGBM): MAPE 30.72%, RMSE $989,980, R² 0.7992 — outperforms both CatBoost and ExtraTrees across all three metrics. The model reliably estimates prices for 70% of standard NYC properties within 30% error.

Dataset

18,177 NYC properties scraped from Zillow — cleaned down to 17,866 properties (removing 1.7% as outliers) with 20 raw features across all five boroughs:

🏙️Manhattan
🌉Brooklyn
🏘️Queens
🌳Bronx
⛴️Staten Island

Raw features cover location (address, zipcode, lat/lng), property details (bedrooms, bathrooms, living area, home type), pricing, and Zillow metadata. Property types include houses, condos, apartments, and townhouses.

Data Cleaning Pipeline

NYC real estate data from any scraping source contains systematic quality issues. The cleaning pipeline addresses three categories of problems:

1. Street Name Extraction

Raw addresses from Zillow include unit numbers, apartment designations, floor markers, and building identifiers that need stripping before street-level features can be engineered. Removed: unit/apartment numbers (#, APT, UNIT, SUITE), floor designations (FL, FLOOR), and building numbers.

2. Extreme Value Correction — Hierarchical Fallback

Properties with impossible bedroom/bathroom combinations or size mismatches were corrected using a five-level fallback strategy rather than simply dropping them:

1

Same street + home type median (min 5 properties) — most precise correction using directly comparable properties

2

Same zipcode + home type median (min 10 properties) — neighborhood-level correction when street has insufficient data

3

Same zipcode median (min 5 properties) — zipcode-wide correction across all property types

4

Living area estimation — derive bedrooms from 800 sq ft/bed, bathrooms from 500 sq ft/bath when spatial data is reliable

5

Dataset-wide median — last resort for properties with no comparable neighbors at any geographic level

3. Price Cleaning

Removed 311 properties (1.7%) that failed price validity checks: zero prices, price below $50/sq ft (unrealistic for NYC), absolute price below $50,000 or above $20,000,000.

Why hierarchical correction instead of dropping: Simply removing extreme-value properties introduces selection bias — unusual properties (very large apartments, converted spaces) may be genuine data, not errors. The fallback strategy preserves them with corrected values rather than removing potentially valid observations.

Feature Engineering

The model's performance is driven by 40+ engineered features derived from four domains. The top 10 features by importance:

living_area_times_baths
#1
zipcode
#2
beds_plus_baths
#3
is_street_bathrooms
#4
is_boulevard_home_type
#5
bathrooms
#6
is_street_zipcode
#7
is_place
#8
is_west_street_borough
#9
zipcode_first_digit
#10

Feature Categories

  • Street-Level: Prestigious streets (5th Ave, Park Ave, Central Park W/S), street types (Avenue, Boulevard, Place, Court), direction prefixes (East, West, North, South), extracted street numbers
  • Geographic: Manhattan grid zones (Lower <14th St, Midtown 34th–59th, Upper >59th), zipcode patterns (first digit, first two digits)
  • Interaction features: 40+ terms crossing street characteristics × property attributes × location (e.g., is_avenue_bedrooms, street_number_zipcode)
  • Property features: living_area_times_baths, living_area_times_beds, beds_plus_baths

Why street-level features matter: In NYC, the street type and name carry enormous price signal — a property on "5th Avenue" vs "5th Street" in the same zipcode can differ by millions. Standard models using only bedroom/bathroom counts miss this entirely. The street-level feature engineering captures the NYC premium geography that makes this market unique.

Model Comparison

Three gradient boosting / ensemble models were trained on an 80/20 train-test split (14,292 train, 3,574 test) and evaluated across MAPE, RMSE, and R²:

ModelMAPE (%)RMSE ($)R² Score
LightGBM ⭐ Deployed 30.72 989,980 0.7992
CatBoost 32.11 1,032,960 0.7814
ExtraTrees 30.94 1,183,564 0.7130

Why LightGBM Was Chosen

  • Best overall performance: Lowest RMSE ($989,980) and highest R² (0.7992) on the held-out test set
  • RMSE improvement: ~20% lower RMSE than ExtraTrees — meaningful reduction for high-value NYC properties
  • Inference speed: Faster prediction than CatBoost at deployment, important for real-time valuation use
  • Balanced metrics: Strong performance across all three evaluation criteria — not a trade-off winner that sacrifices one metric for another
R² Score
0.80
80% variance explained
MAPE
30.7%
Mean absolute % error
RMSE
$990K
Root mean squared error

Model Evaluation

Residuals Analysis

Residuals vs. predicted values show relatively random scatter around zero — indicating no systematic bias across the price range. The model is unbiased for typical NYC properties. Some outliers are present at higher price points (>$5M), which is expected for luxury real estate.

Residuals Distribution

The distribution of residuals is approximately normal and centered at zero — confirming unbiased predictions with symmetric spread. Most predictions fall within reasonable error bounds for the price range.

Q-Q Plot Interpretation

Strong adherence to normality in the central quantiles. The heavy tails on both ends (common in all real estate datasets) indicate that very cheap and very expensive properties are harder to predict accurately — a known limitation of any single-model approach to real estate valuation.

Model reliability boundaries: The model performs well for standard residential properties but shows higher error for ultra-luxury properties (>$5M), unique architectural properties, and rare property type combinations. A separate model trained specifically on luxury listings would address this gap.

Key Insights

Location Dominates

Zipcode and street characteristics are the strongest predictors — outweighing bedroom count and living area. In NYC, address signals price more than size.

Size Interactions Matter

Living area × bathrooms captures a quality signal that neither variable alone can express — large apartments with few bathrooms vs. many bathrooms signal different price tiers.

Street Type Commands Premiums

Avenue properties consistently command premium prices over streets in the same neighborhood. The Manhattan grid — East/West designation — further differentiates pricing within zipcodes.

Luxury Remains Hard to Predict

Properties above $5M show higher prediction error — luxury real estate pricing depends on unique features and negotiation dynamics that structured data can't fully capture.

Live Demo

The LightGBM model is deployed on Hugging Face Spaces. Enter property details to get an instant price estimate:

Live NYC property price predictor — enter borough, bedrooms, bathrooms, living area, and street to get an instant ML-powered valuation.

Tech Stack

LightGBMPrimary Model
CatBoostComparison
ExtraTreesComparison
PandasData Processing
Beautiful SoupWeb Scraping
HuggingFaceDeployment
PythonLightGBMCatBoostScikit-learnBeautiful SoupPandasNumPyFeature EngineeringWeb ScrapingHuggingFace SpacesFlask

Work with Adediran Adeyemi

Need a price prediction model for your real estate market?

I build ML valuation models for real estate, e-commerce, and any market where data-backed pricing matters. First call is free.