Academic · Vanderbilt

Airbnb Rating Classification

Developer · 2023–2024

⚡ 30+ U.S. cities · large-scale PySpark

Problem

A lot of Airbnb listings have no guest ratings yet, which makes it hard to predict how they'll perform. We wanted to know whether listing attributes alone — location, amenities, pricing, host characteristics — could tell you which performance tier a listing is likely to land in before a single review comes in.

What I Built

01The dataset covered 30+ U.S. cities and was large enough that local processing wasn't practical. We used PySpark on Databricks and Azure from the start to handle the volume.
02Feature engineering covered room and bathroom counts, amenity presence flags, pricing tiers, host response rate, and listing age. The goal was to capture both property-level signals and host behavior in the same model.
03We defined rating tiers from guest review score distributions, then trained XGBoost and Random Forest classifiers with cross-validation to see which generalized better across different city markets.
04We evaluated performance city by city rather than just on overall accuracy. A model that works in New York but fails in Nashville isn't actually generalizing, and that distinction mattered for the conclusions we were drawing.

Architecture

Airbnb listing data (30+ U.S. cities)
  → PySpark feature engineering (Databricks/Azure)
  → Features: room/bath counts, amenities, price, host attrs
  → Rating tier definition from review score distributions
  → XGBoost vs Random Forest (cross-validated)
  → Cross-city generalization evaluation

Results

Listing attributes alone turned out to be meaningful predictors of performance across all 30+ city markets, without needing any prior reviews. XGBoost generalized better across cities than Random Forest, which made it the stronger choice for this kind of multi-market setting.

Stack

PySparkXGBoostRandom ForestDatabricksAzurePythonScikit-learn

GitHub

Next Project

RAG Knowledge System→