Airbnb Rating Classification
Developer · 2023–2024
⚡ 30+ U.S. cities · large-scale PySpark
Problem
A lot of Airbnb listings have no guest ratings yet, which makes it hard to predict how they'll perform. We wanted to know whether listing attributes alone — location, amenities, pricing, host characteristics — could tell you which performance tier a listing is likely to land in before a single review comes in.
What I Built
- 01The dataset covered 30+ U.S. cities and was large enough that local processing wasn't practical. We used PySpark on Databricks and Azure from the start to handle the volume.
- 02Feature engineering covered room and bathroom counts, amenity presence flags, pricing tiers, host response rate, and listing age. The goal was to capture both property-level signals and host behavior in the same model.
- 03We defined rating tiers from guest review score distributions, then trained XGBoost and Random Forest classifiers with cross-validation to see which generalized better across different city markets.
- 04We evaluated performance city by city rather than just on overall accuracy. A model that works in New York but fails in Nashville isn't actually generalizing, and that distinction mattered for the conclusions we were drawing.
Airbnb listing data (30+ U.S. cities) → PySpark feature engineering (Databricks/Azure) → Features: room/bath counts, amenities, price, host attrs → Rating tier definition from review score distributions → XGBoost vs Random Forest (cross-validated) → Cross-city generalization evaluation
Results
Listing attributes alone turned out to be meaningful predictors of performance across all 30+ city markets, without needing any prior reviews. XGBoost generalized better across cities than Random Forest, which made it the stronger choice for this kind of multi-market setting.
Stack
PySparkXGBoostRandom ForestDatabricksAzurePythonScikit-learn
Next Project
RAG Knowledge System→