Skip to content

Open In Colab

Data Science Foundations
Lab 3: Practice with Feature Engineering and Pipelines

Instructor: Wesley Beckner

Contact: wesleybeckner@gmail.com



In this lab we will continue to practice creation of pipelines and feature engineering. We will use the wine dataset.



import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns; sns.set()

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor

wine = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/"\
      "ds_for_engineers/main/data/wine_quality/winequalityN.csv")

🍇 L3 Q1: Feature Derivation

  1. Fill in any missing data in your dataset using imputation and use this new data for Q2-Q3
  2. One-Hot encode categorical variables in the wine dataset
# Code Cell for L1 Q1
display(wine.head())
print(wine.shape)
str_cols = ['type']

enc = OneHotEncoder()
imp = SimpleImputer()
type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
(6497, 13)
(6497,)
(6497, 13)

🍾 L3 Q2: Feature Transformation

Use StandardScaler on the input data and evaluate how this affects VIF, kurtosis, and skew

You should ignore the one-hot encoded column(s) for this section

# Non-one hot encoded columns
cols = list(wine.columns)
cols.remove('density')
cols.remove('type')
# Code Cell for L1 Q2
VIF Factor kurtosis skew
fixed acidity 41.790949 5.070143 1.724131
volatile acidity 9.482732 2.834263 1.496433
citric acid 9.344218 2.404077 0.473142
residual sugar 3.336944 4.360399 1.435221
chlorides 5.398369 50.911457 5.400680
free sulfur dioxide 8.529778 7.906238 1.220066
total sulfur dioxide 13.448130 -0.371664 -0.001177
pH 149.003349 0.374743 0.387234
sulphates 18.402953 8.667071 1.799021
alcohol 114.836088 -0.531687 0.565718
quality 63.455488 0.232322 0.189623

VIF Factor kurtosis skew
fixed acidity 1.781336 5.070143 1.724131
volatile acidity 1.808525 2.834263 1.496433
citric acid 1.606484 2.404077 0.473142
residual sugar 1.533403 4.360399 1.435221
chlorides 1.564413 50.911457 5.400680
free sulfur dioxide 2.156598 7.906238 1.220066
total sulfur dioxide 2.872586 -0.371664 -0.001177
pH 1.413100 0.374743 0.387234
sulphates 1.364157 8.667071 1.799021
alcohol 1.696986 -0.531687 0.565718
quality 1.408210 0.232322 0.189623

🍷 L3 Q3: Modeling

Create a Pipeline using one of the scaling methods in sklearn and linear or logistic regression

If you are using logistic regression:

  • dependent variable: wine quality

If you are using linear regression:

  • dependent variable: wine density
# Code Cell for L1 Q3
model = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

Text(0.5, 1.0, 'Test, R2: 0.963')

png