Python Foundations, Lab 4: Practice with Pandas¶
Instructor: Wesley Beckner
Contact: wesleybeckner@gmail.com
In this lab we will continue to practice manipulating pandas DataFrames.
import pandas as pd
import numpy as np
from google.colab import data_table
data_table.disable_dataframe_formatter()
🐼 L4 Q1¶
Convert the two series into the columns of a DataFrame
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series([i for i in range(26)])
pd.DataFrame({'col1': ser1,'col2': ser2})
col1 | col2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | c | 2 |
3 | e | 3 |
4 | d | 4 |
5 | f | 5 |
6 | g | 6 |
7 | h | 7 |
8 | i | 8 |
9 | j | 9 |
10 | k | 10 |
11 | l | 11 |
12 | m | 12 |
13 | n | 13 |
14 | o | 14 |
15 | p | 15 |
16 | q | 16 |
17 | r | 17 |
18 | s | 18 |
19 | t | 19 |
20 | u | 20 |
21 | v | 21 |
22 | w | 22 |
23 | x | 23 |
24 | y | 24 |
25 | z | 25 |
🐼 L4 Q2¶
Convert the series into a DataFrame with 7 rows and 5 columns
ser = pd.Series(np.random.randint(1, 10, 35))
x = []
y = []
j = 0
for i in range(7):
for k in range(5):
x.append(ser[j])
j+=1
y.append(x)
x = []
pd.DataFrame(y)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 4 | 6 | 3 | 8 | 6 |
1 | 4 | 1 | 2 | 2 | 7 |
2 | 9 | 6 | 7 | 8 | 2 |
3 | 2 | 5 | 2 | 5 | 4 |
4 | 2 | 6 | 2 | 3 | 7 |
5 | 9 | 3 | 3 | 8 | 9 |
6 | 7 | 4 | 1 | 8 | 6 |
🐼 L4 Q3¶
Compute the difference of differences between consecutive numbers in a series using ser.diff()
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
ser.diff().diff()
0 NaN
1 NaN
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 2.0
dtype: float64
🐼 L4 Q4¶
Convert a series of dates to datetime
format using pd.to_datetime()
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
pd.to_datetime(ser)
0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
dtype: datetime64[ns]
🐼 L4 Q5¶
Compute the mean of weights grouped by fruit
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
['banana', 'banana', 'carrot', 'carrot', 'carrot', 'banana', 'carrot', 'apple', 'carrot', 'carrot']
df = pd.DataFrame([fruit,weights]).T
df.columns = ['fruit', 'weight']
df['weight'] = df['weight'].astype(int)
df.groupby('fruit').mean()
weight | |
---|---|
fruit | |
apple | 8.000000 |
banana | 3.000000 |
carrot | 6.333333 |
🐼 L4 Q6¶
Compute the euclidian distance between vectors p and q
Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors
This is related to the L2 vector norm
import math
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
math.sqrt(((p-q)**2).sum())
18.16590212458495
🐼 L4 Q7¶
Fill in missing values for dates with the previous dates' value using ser.bfill()
or ser.ffill()
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
print(ser)
2000-01-01 1.0
2000-01-03 10.0
2000-01-06 3.0
2000-01-08 NaN
dtype: float64
ser.ffill()
2000-01-01 1.0
2000-01-03 10.0
2000-01-06 3.0
2000-01-08 3.0
dtype: float64
🐼 L4 Q8¶
Check if there are missing values in each column of a dataframe using .isnull()
, and .any()
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
df.isnull().any(axis=0)
Manufacturer True
Model True
Type True
Min.Price True
Price True
Max.Price True
MPG.city True
MPG.highway True
AirBags True
DriveTrain True
Cylinders True
EngineSize True
Horsepower True
RPM True
Rev.per.mile True
Man.trans.avail True
Fuel.tank.capacity True
Passengers True
Length True
Wheelbase True
Width True
Turn.circle True
Rear.seat.room True
Luggage.room True
Weight True
Origin True
Make True
dtype: bool
🐼 L4 Q9¶
Grab the first column and return it as a DataFrame rather than as a series
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[['a']]
a | |
---|---|
0 | 0 |
1 | 5 |
2 | 10 |
3 | 15 |
🐼 L4 Q10¶
In df
, interchange columns 'a' and 'c'.
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[['c','b','a','d','e']]
c | b | a | d | e | |
---|---|---|---|---|---|
0 | 2 | 1 | 0 | 3 | 4 |
1 | 7 | 6 | 5 | 8 | 9 |
2 | 12 | 11 | 10 | 13 | 14 |
3 | 17 | 16 | 15 | 18 | 19 |