Python Foundations, Lab 4: Practice with Pandas¶

Instructor: Wesley Beckner

In this lab we will continue to practice manipulating pandas DataFrames.

import pandas as pd
import numpy as np
from google.colab import data_table
data_table.disable_dataframe_formatter()

🐼 L4 Q1¶

Convert the two series into the columns of a DataFrame

ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series([i for i in range(26)])

pd.DataFrame({'col1': ser1,'col2': ser2})

	col1	col2
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4
5	f	5
6	g	6
7	h	7
8	i	8
9	j	9
10	k	10
11	l	11
12	m	12
13	n	13
14	o	14
15	p	15
16	q	16
17	r	17
18	s	18
19	t	19
20	u	20
21	v	21
22	w	22
23	x	23
24	y	24
25	z	25

🐼 L4 Q2¶

Convert the series into a DataFrame with 7 rows and 5 columns

ser = pd.Series(np.random.randint(1, 10, 35))

x = []
y = []
j = 0
for i in range(7):
  for k in range(5):
    x.append(ser[j])
    j+=1
  y.append(x)
  x = []
pd.DataFrame(y)

	0	1	2	3	4
0	4	6	3	8	6
1	4	1	2	2	7
2	9	6	7	8	2
3	2	5	2	5	4
4	2	6	2	3	7
5	9	3	3	8	9
6	7	4	1	8	6

🐼 L4 Q3¶

Compute the difference of differences between consecutive numbers in a series using ser.diff()

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

ser.diff().diff()

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

🐼 L4 Q4¶

Convert a series of dates to datetime format using pd.to_datetime()

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

🐼 L4 Q5¶

Compute the mean of weights grouped by fruit

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
['banana', 'banana', 'carrot', 'carrot', 'carrot', 'banana', 'carrot', 'apple', 'carrot', 'carrot']

df = pd.DataFrame([fruit,weights]).T
df.columns = ['fruit', 'weight']
df['weight'] = df['weight'].astype(int)
df.groupby('fruit').mean()

	weight
fruit
apple	8.000000
banana	3.000000
carrot	6.333333

🐼 L4 Q6¶

Compute the euclidian distance between vectors p and q

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors

This is related to the L2 vector norm

import math
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

math.sqrt(((p-q)**2).sum())

18.16590212458495

🐼 L4 Q7¶

Fill in missing values for dates with the previous dates' value using ser.bfill() or ser.ffill()

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
print(ser)

2000-01-01     1.0
2000-01-03    10.0
2000-01-06     3.0
2000-01-08     NaN
dtype: float64

ser.ffill()

2000-01-01     1.0
2000-01-03    10.0
2000-01-06     3.0
2000-01-08     3.0
dtype: float64

🐼 L4 Q8¶

Check if there are missing values in each column of a dataframe using .isnull(), and .any()

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

df.isnull().any(axis=0)

Manufacturer          True
Model                 True
Type                  True
Min.Price             True
Price                 True
Max.Price             True
MPG.city              True
MPG.highway           True
AirBags               True
DriveTrain            True
Cylinders             True
EngineSize            True
Horsepower            True
RPM                   True
Rev.per.mile          True
Man.trans.avail       True
Fuel.tank.capacity    True
Passengers            True
Length                True
Wheelbase             True
Width                 True
Turn.circle           True
Rear.seat.room        True
Luggage.room          True
Weight                True
Origin                True
Make                  True
dtype: bool

🐼 L4 Q9¶

Grab the first column and return it as a DataFrame rather than as a series

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

df[['a']]

	a
0	0
1	5
2	10
3	15

🐼 L4 Q10¶

In df, interchange columns 'a' and 'c'.

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

df[['c','b','a','d','e']]

	c	b	a	d	e
0	2	1	0	3	4
1	7	6	5	8	9
2	12	11	10	13	14
3	17	16	15	18	19

	col1	col2
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4
5	f	5
6	g	6
7	h	7
8	i	8
9	j	9
10	k	10
11	l	11
12	m	12
13	n	13
14	o	14
15	p	15
16	q	16
17	r	17
18	s	18
19	t	19
20	u	20
21	v	21
22	w	22
23	x	23
24	y	24
25	z	25

	col1	col2
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4
5	f	5
6	g	6
7	h	7
8	i	8
9	j	9
10	k	10
11	l	11
12	m	12
13	n	13
14	o	14
15	p	15
16	q	16
17	r	17
18	s	18
19	t	19
20	u	20
21	v	21
22	w	22
23	x	23
24	y	24
25	z	25

	col1	col2
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4
5	f	5
6	g	6
7	h	7
8	i	8
9	j	9
10	k	10
11	l	11
12	m	12
13	n	13
14	o	14
15	p	15
16	q	16
17	r	17
18	s	18
19	t	19
20	u	20
21	v	21
22	w	22
23	x	23
24	y	24
25	z	25