Data Science Foundations
Lab 3: Practice with Feature Engineering and Pipelines¶

Instructor: Wesley Beckner

In this lab we will continue to practice creation of pipelines and feature engineering. We will use the wine dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns; sns.set()

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor

wine = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/"\
      "ds_for_engineers/main/data/wine_quality/winequalityN.csv")

🍇 L3 Q1: Feature Derivation¶

Fill in any missing data in your dataset using imputation and use this new data for Q2-Q3
One-Hot encode categorical variables in the wine dataset

# Code Cell for L1 Q1
display(wine.head())
print(wine.shape)
str_cols = ['type']

enc = OneHotEncoder()
imp = SimpleImputer()

	type	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	white	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	white	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	white	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
3	white	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
4	white	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6

(6497, 13)
(6497,)
(6497, 13)

🍾 L3 Q2: Feature Transformation¶

Use StandardScaler on the input data and evaluate how this affects VIF, kurtosis, and skew

You should ignore the one-hot encoded column(s) for this section

# Non-one hot encoded columns
cols = list(wine.columns)
cols.remove('density')
cols.remove('type')

# Code Cell for L1 Q2

	VIF Factor	kurtosis	skew
fixed acidity	41.790949	5.070143	1.724131
volatile acidity	9.482732	2.834263	1.496433
citric acid	9.344218	2.404077	0.473142
residual sugar	3.336944	4.360399	1.435221
chlorides	5.398369	50.911457	5.400680
free sulfur dioxide	8.529778	7.906238	1.220066
total sulfur dioxide	13.448130	-0.371664	-0.001177
pH	149.003349	0.374743	0.387234
sulphates	18.402953	8.667071	1.799021
alcohol	114.836088	-0.531687	0.565718
quality	63.455488	0.232322	0.189623

	VIF Factor	kurtosis	skew
fixed acidity	1.781336	5.070143	1.724131
volatile acidity	1.808525	2.834263	1.496433
citric acid	1.606484	2.404077	0.473142
residual sugar	1.533403	4.360399	1.435221
chlorides	1.564413	50.911457	5.400680
free sulfur dioxide	2.156598	7.906238	1.220066
total sulfur dioxide	2.872586	-0.371664	-0.001177
pH	1.413100	0.374743	0.387234
sulphates	1.364157	8.667071	1.799021
alcohol	1.696986	-0.531687	0.565718
quality	1.408210	0.232322	0.189623

🍷 L3 Q3: Modeling¶

Create a Pipeline using one of the scaling methods in sklearn and linear or logistic regression

If you are using logistic regression:

dependent variable: wine quality

If you are using linear regression:

dependent variable: wine density

# Code Cell for L1 Q3
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

Text(0.5, 1.0, 'Test, R2: 0.963')

png

Data Science Foundations Lab 3: Practice with Feature Engineering and Pipelines¶

🍇 L3 Q1: Feature Derivation¶

🍾 L3 Q2: Feature Transformation¶

🍷 L3 Q3: Modeling¶

Data Science Foundations
Lab 3: Practice with Feature Engineering and Pipelines¶