← Back
price?rooms?area?XGBoost · 30+ cities
Academic · Vanderbilt

Airbnb Rating Classification

Developer · 2023–2024

30+ U.S. cities · large-scale PySpark

Problem

A lot of Airbnb listings have no guest ratings yet, which makes it hard to predict how they'll perform. We wanted to know whether listing attributes alone — location, amenities, pricing, host characteristics — could tell you which performance tier a listing is likely to land in before a single review comes in.


What I Built

Airbnb listing data (30+ U.S. cities)
  → PySpark feature engineering (Databricks/Azure)
  → Features: room/bath counts, amenities, price, host attrs
  → Rating tier definition from review score distributions
  → XGBoost vs Random Forest (cross-validated)
  → Cross-city generalization evaluation

Results

Listing attributes alone turned out to be meaningful predictors of performance across all 30+ city markets, without needing any prior reviews. XGBoost generalized better across cities than Random Forest, which made it the stronger choice for this kind of multi-market setting.


Stack

PySparkXGBoostRandom ForestDatabricksAzurePythonScikit-learn
GitHub

Next Project

RAG Knowledge System