Data Science Foundations
X2: Airbnb¶

Instructor: Wesley Beckner

Today we are going to take our newfound knowledge from the course, and practice how we can leverage data to build predictive models. We'll start with a feature engineering problem on some dummy data. This will get us thinking creatively about problem solving. We will then pivot over to an Airbnb dataset. After performing some general, exploratory data analysis, we will solve the following business case: Airbnb is interested in using historical list prices from their airbnb hosts, to make pricing suggestions to new hosts. How can we use this existing datset to assist with this price listing suggestion?

Prepare Environment and Import Data¶

# basic packages
import pandas as pd
import numpy as np
import random
import copy

# visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns; sns.set()
import graphviz 

# stats packages
import scipy.stats as stats
from scipy.spatial.distance import cdist
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

# sklearn preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# sklearn modeling
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.mixture import GaussianMixture

# sklearn evaluation
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score

Warm Up¶

Add aditional feature(s) to X to predict y with a linear classifier (e.g. logistic regression)

from sklearn.datasets import make_circles
X, y = make_circles(random_state=42, noise=.01)
relabel = dict(zip([0,1,2,3],[0,1,0,1]))
y = np.vectorize(relabel.get)(y)
plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis')

<matplotlib.collections.PathCollection at 0x7f20ec66d280>

png

and now predict

# consider using
# LogisticRegression()
# r2_score

Exploratory Data Analysis¶

which columns are numerical, string; which contain nans/nulls; what is the correlation between features

airbnb = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/datasets/main/datasets/airbnb/AB_NYC_2019.csv")

# recall these attributes
# shape dtypes

# and these methods
# head() tail() isnull() sum() nunique() copy() select_dtypes() describe() drop()

what visualizations would be useful to make?

plt.figure(figsize=(10,6))
sns.scatterplot(x=airbnb.longitude,y=airbnb.latitude,hue=airbnb.neighbourhood_group)

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

png

Feature Engineering¶

Say we want to predict price, using an ML model. How would you build your features?

Based on the number of null values, what would you do with the last_review and reviews_per_month column?

X = airbnb.copy()

How will we deal with the categorical features?

# Recall
# OneHotEncoder()
X_cat = X.select_dtypes(include='object')
display(X_cat.head())
print(X_cat.nunique())

	name	host_name	neighbourhood_group	neighbourhood	room_type	last_review
0	Clean & quiet apt home by the park	John	Brooklyn	Kensington	Private room	2018-10-19
1	Skylit Midtown Castle	Jennifer	Manhattan	Midtown	Entire home/apt	2019-05-21
2	THE VILLAGE OF HARLEM....NEW YORK !	Elisabeth	Manhattan	Harlem	Private room	NaN
3	Cozy Entire Floor of Brownstone	LisaRoxanne	Brooklyn	Clinton Hill	Entire home/apt	2019-07-05
4	Entire Apt: Spacious Studio/Loft by central park	Laura	Manhattan	East Harlem	Entire home/apt	2018-11-19

name                   47905
host_name              11452
neighbourhood_group        5
neighbourhood            221
room_type                  3
last_review             1764
dtype: int64

X_num = X.select_dtypes(exclude='object')
X_num.head()

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	2787	40.64749	-73.97237	149	1	9	0.21	6	365
1	2595	2845	40.75362	-73.98377	225	1	45	0.38	2	355
2	3647	4632	40.80902	-73.94190	150	3	0	NaN	1	365
3	3831	4869	40.68514	-73.95976	89	1	270	4.64	1	194
4	5022	7192	40.79851	-73.94399	80	10	9	0.10	1	0

Feature Transformation¶

What features do you think will cause the most problems if untransformed?

Scale and Center all but the target variable, price

# Recall
# StandardScaler() df.pop()

Model Baseline¶

# Recall
# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
# LinearRegression()
# r2_score()

try a model that captures non-linear relationships

# Recall
# RandomForestRegressor()

both of these results from the LinearRegression and RandomForest models should indicate something to you (think back to Model Selection and Validation)

Additional Strategies¶

After this first pass, what are some additional strategies to consider for improving the model?

Data Science Foundations X2: Airbnb¶