Skip to content

Open In Colab

Data Science Foundations
X2: Airbnb

Instructor: Wesley Beckner

Contact: wesleybeckner@gmail.com



Today we are going to take our newfound knowledge from the course, and practice how we can leverage data to build predictive models. We'll start with a feature engineering problem on some dummy data. This will get us thinking creatively about problem solving. We will then pivot over to an Airbnb dataset. After performing some general, exploratory data analysis, we will solve the following business case: Airbnb is interested in using historical list prices from their airbnb hosts, to make pricing suggestions to new hosts. How can we use this existing datset to assist with this price listing suggestion?



Prepare Environment and Import Data

# basic packages
import pandas as pd
import numpy as np
import random
import copy

# visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns; sns.set()
import graphviz 

# stats packages
import scipy.stats as stats
from scipy.spatial.distance import cdist
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

# sklearn preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# sklearn modeling
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.mixture import GaussianMixture

# sklearn evaluation
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score

Warm Up

Add aditional feature(s) to X to predict y with a linear classifier (e.g. logistic regression)

from sklearn.datasets import make_circles
X, y = make_circles(random_state=42, noise=.01)
relabel = dict(zip([0,1,2,3],[0,1,0,1]))
y = np.vectorize(relabel.get)(y)
plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis')
<matplotlib.collections.PathCollection at 0x7f20ec66d280>

png

and now predict

# consider using
# LogisticRegression()
# r2_score

Exploratory Data Analysis

which columns are numerical, string; which contain nans/nulls; what is the correlation between features

airbnb = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/datasets/main/datasets/airbnb/AB_NYC_2019.csv")
# recall these attributes
# shape dtypes

# and these methods
# head() tail() isnull() sum() nunique() copy() select_dtypes() describe() drop()

what visualizations would be useful to make?

plt.figure(figsize=(10,6))
sns.scatterplot(x=airbnb.longitude,y=airbnb.latitude,hue=airbnb.neighbourhood_group)
<AxesSubplot:xlabel='longitude', ylabel='latitude'>

png

Feature Engineering

Say we want to predict price, using an ML model. How would you build your features?

Based on the number of null values, what would you do with the last_review and reviews_per_month column?

X = airbnb.copy()

How will we deal with the categorical features?

# Recall
# OneHotEncoder()
X_cat = X.select_dtypes(include='object')
display(X_cat.head())
print(X_cat.nunique())
name host_name neighbourhood_group neighbourhood room_type last_review
0 Clean & quiet apt home by the park John Brooklyn Kensington Private room 2018-10-19
1 Skylit Midtown Castle Jennifer Manhattan Midtown Entire home/apt 2019-05-21
2 THE VILLAGE OF HARLEM....NEW YORK ! Elisabeth Manhattan Harlem Private room NaN
3 Cozy Entire Floor of Brownstone LisaRoxanne Brooklyn Clinton Hill Entire home/apt 2019-07-05
4 Entire Apt: Spacious Studio/Loft by central park Laura Manhattan East Harlem Entire home/apt 2018-11-19
name                   47905
host_name              11452
neighbourhood_group        5
neighbourhood            221
room_type                  3
last_review             1764
dtype: int64
X_num = X.select_dtypes(exclude='object')
X_num.head()
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
0 2539 2787 40.64749 -73.97237 149 1 9 0.21 6 365
1 2595 2845 40.75362 -73.98377 225 1 45 0.38 2 355
2 3647 4632 40.80902 -73.94190 150 3 0 NaN 1 365
3 3831 4869 40.68514 -73.95976 89 1 270 4.64 1 194
4 5022 7192 40.79851 -73.94399 80 10 9 0.10 1 0

Feature Transformation

What features do you think will cause the most problems if untransformed?

Scale and Center all but the target variable, price

# Recall
# StandardScaler() df.pop()

Model Baseline

# Recall
# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
# LinearRegression()
# r2_score()

try a model that captures non-linear relationships

# Recall
# RandomForestRegressor()

both of these results from the LinearRegression and RandomForest models should indicate something to you (think back to Model Selection and Validation)

Additional Strategies

After this first pass, what are some additional strategies to consider for improving the model?