学习python数据分析 五 pandas (2)

聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
import numpy as np

df = pd.read_excel('team.xlsx')
df = df.set_index('name')
df.groupby('team').agg(np.mean)

Q1 Q2 Q3 Q4
team
A 62.705882 37.588235 51.470588 46.058824
B 44.318182 55.363636 54.636364 51.636364
C 48.000000 54.272727 48.545455 51.227273
D 45.263158 62.684211 65.315789 63.105263
E 48.150000 50.650000 44.050000 51.650000

如果前面忘记配置name为index就执行聚合,name列无法聚合将会报出警告

1
/tmp/ipykernel_28675/1256569030.py:1: FutureWarning: The operation <function mean at 0x7ff19c51c4c0> failed on a column. If any error is raised, this will raise an exception in a future version of pandas. Drop these columns to avoid this warning.If any error is raised, this will raise an exception in a future version of

也可以聚合多个聚合函数。

1
2
3
4
5
6
7
8
9
df.groupby('team').agg({'Q1':['min','max']})
Q1
min max
team
A 9 96
B 2 97
C 1 98
D 5 80
E 4 97

分组百分位:

1
2
3
4
5
6
7
8
9
10
df.groupby('team').quantile(0.9)


Q1 Q2 Q3 Q4
team
A 88.6 75.8 85.2 85.6
B 88.6 94.4 88.0 89.7
C 90.9 92.8 86.0 78.9
D 79.0 93.6 89.4 89.6
E 89.3 91.3 66.1 93.0

多个条件聚合会产生二层索引

1
2
3
4
5
6
7
8
9
10
11
12
13
df.groupby(['team', df.mean()>=60]).count()
Q1 Q2 Q3 Q4
team
A False 14 14 14 14
True 3 3 3 3
B False 14 14 14 14
True 8 8 8 8
C False 17 17 17 17
True 5 5 5 5
D False 10 10 10 10
True 9 9 9 9
E False 15 15 15 15
True 5 5 5 5

过滤器

根据列名选择过滤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df.filter(items=['Q1', 'Q2']) # 选择两列 

Q1 Q2
0 89 21
1 36 37
2 57 60
3 93 96
4 65 49
... ... ...
95 48 59
96 21 31
97 98 93
98 11 74
99 21 43

基于正则,选择轴0为行,选择轴1为列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df.filter(regex='Q', axis=1)


Q1 Q2 Q3 Q4
0 89 21 24 64
1 36 37 37 57
2 57 60 18 84
3 93 96 71 78
4 65 49 61 86
... ... ... ... ...
95 48 59 87 74
96 21 31 30 43
97 98 93 1 20
98 11 74 58 91
99 21 43 41 74

修改数据

直接使用索引赋值

1
2
3
4
5
df.iloc[0,2]
89
df.iloc[0,2]=20
df.iloc[0,2]
20

批量修改,可以配合条件矩阵

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df[df.Q1 < 60] = 60
df
0 60
1 60
2 60
3 93
4 65
..
95 60
96 60
97 98
98 60
99 60
Name: Q1, Length: 100, dtype: int64

添加列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df['total'] = df.Q1 + df.Q2 + df.Q3 + df.Q4

name team Q1 Q2 Q3 Q4 total
0 60 60 60.0 60.0 60.0 60.0 240.0
1 60 60 60.0 60.0 60.0 60.0 240.0
2 60 60 60.0 60.0 60.0 60.0 240.0
3 Eorge C 93.0 96.0 71.0 78.0 338.0
4 Oah D 65.0 49.0 61.0 86.0 261.0
... ... ... ... ... ... ... ...
95 60 60 60.0 60.0 60.0 60.0 240.0
96 60 60 60.0 60.0 60.0 60.0 240.0
97 None None NaN NaN NaN NaN NaN
98 60 60 60.0 60.0 60.0 60.0 240.0
99 60 60 60.0 60.0 60.0 60.0 240.0

数据追加

append可以将两个DataFrame二合一。df.append()方法可以轻松实现数据的追加和拼接。如果列名相同,会追加一行到数据后面;如果列名不同,会将新的列添加到原数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df.append(df)

team Q1 Q2 Q3 Q4
name
Liver E 89 21 24 64
Arry C 36 37 37 57
Ack A 57 60 18 84
Eorge C 93 96 71 78
Oah D 65 49 61 86
... ... ... ... ... ...
Gabriel C 48 59 87 74
Austin7 C 21 31 30 43
Lincoln C 98 93 1 20
Eli E 11 74 58 91
Ben E 21 43 41 74

也可以concat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pd.concat([df,df])

team Q1 Q2 Q3 Q4
name
Liver E 89 21 24 64
Arry C 36 37 37 57
Ack A 57 60 18 84
Eorge C 93 96 71 78
Oah D 65 49 61 86
... ... ... ... ... ...
Gabriel C 48 59 87 74
Austin7 C 21 31 30 43
Lincoln C 98 93 1 20
Eli E 11 74 58 91
Ben E 21 43 41 74

merge

类似sql的join

1
2
3
4
5
6
7
8
9
10
(base) fenix@fenixs:~/py/scrawpy$ cat d1.csv 
zhangsan,23,a
zhangsi,25,a
lisi,24,b
wangwu,14,ccat

(base) fenix@fenixs:~/py/scrawpy$ cat d2.csv
a,companyA
b,companyB
c,companyC
1
2
3
4
5
6
7
8
9
d1 = pd.read_csv('d1.csv',names=['name','age','cid'],index_col='name')
d2 = pd.read_csv('d2.csv',names=['cid','company'],index_col='cid')
pd.merge(d1,d2,on='cid')

age cid company
0 23 a companyA
1 25 a companyA
2 24 b companyB
3 14 c companyC

数据透视

数据透视可以帮我们找出列与列之间的关系:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df = pd.DataFrame({
'A':['a1', 'a1', 'a2', 'a2', 'a3', 'a3'],
'B':['b1', 'b2', 'b3', 'b1', 'b2', 'b3'],
'C':['c1', 'c2', 'c3', 'c4', 'c5', 'c6'],
'D':['d1', 'd2', 'd3', 'd4', 'd5', 'd6'],
})

df.pivot(index='A', columns='B', values='C')

df.pivot(index='A', columns='B', values='C')
df.pivot(index='A', columns='B', values='C')
B b1 b2 b3
A
a1 c1 c2 NaN
a2 c4 NaN c3
a3 NaN c5 c6

堆叠

数据的堆叠也是数据处理的一种常见方法。在多层索引的数据中,
通常为了方便查看、对比,会将所有数据呈现在一列中;
3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
df = pd.read_excel('team.xlsx')
df = df.set_index('name')
df.groupby('team').sum().stack()


team
A Q1 1066
Q2 639
Q3 875
Q4 783
B Q1 975
Q2 1218
Q3 1202
Q4 1136
C Q1 1056
Q2 1194
Q3 1068
Q4 1127
D Q1 860
Q2 1191
Q3 1241
Q4 1199
E Q1 963
Q2 1013
Q3 881
Q4 1033
dtype: int64

数据堆叠可以简单理解成将多列数据转为一列数据

堆叠和解堆的本质如下。

  • 堆叠:“透视”某个级别的(可能是多层的)列标签,返回带有索
    引的DataFrame,该索引带有一个新的行标签,这个新标签在原
    有索引的最右边。
  • 解堆:将(可能是多层的)行索引的某个级别“透视”到列轴,从
    而生成具有新的最里面的列标签级别的重构的DataFrame。

时序数据

s.rolling()是移动窗口函数,此函数可以应用于一系列数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
s = pd.Series(range(1, 7))

0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
s.rolling(2).sum()

0 NaN
1 3.0
2 5.0
3 7.0
4 9.0
5 11.0
dtype: float64

我们创建随机数时序样本看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
df = pd.DataFrame(np.random.randn(30, 4),
index=pd.date_range('10/1/2020', periods=30),
columns=['A', 'B', 'C', 'D'])

A B C D
2020-10-01 0.116860 0.301451 0.520356 0.271023
2020-10-02 0.559587 -1.235789 -0.365462 1.167461
2020-10-03 1.370972 -0.372828 -0.614924 1.639720
2020-10-04 -1.389624 1.558892 -0.459278 0.073183
2020-10-05 -0.498922 -0.434318 -0.089316 1.062327
2020-10-06 0.714859 -0.155655 0.879164 -0.562244
2020-10-07 -2.084152 1.417185 -1.707112 -1.405558
2020-10-08 -0.178227 0.701537 -0.086532 0.653745
2020-10-09 1.142238 0.536237 -1.927169 1.378880
2020-10-10 0.967374 -0.354644 0.198410 -1.357527
2020-10-11 0.388374 -0.278833 0.973627 1.161361
2020-10-12 -0.711454 0.332449 -1.078495 -0.003996
2020-10-13 0.763811 -0.851593 -0.575605 -0.910922
2020-10-14 -0.434554 -0.527502 -1.344716 -2.319300
2020-10-15 -0.351671 2.227707 1.005529 0.605340
2020-10-16 0.007904 -0.857331 1.939693 -0.687654
2020-10-17 -0.122387 0.771806 0.441340 -0.630876
2020-10-18 0.588816 1.865920 1.557429 0.776753
2020-10-19 2.379811 1.589295 0.747322 -1.381986
2020-10-20 -1.656209 -0.670772 2.491850 0.019814
2020-10-21 1.325171 1.229221 0.002345 0.692146
2020-10-22 1.619935 0.894746 0.235650 -0.765366
2020-10-23 -0.237772 0.157579 2.346790 -0.081254
2020-10-24 -0.428513 0.102795 -0.562904 0.118976
2020-10-25 0.183652 -0.070621 0.206852 -0.983443
2020-10-26 -0.742221 -0.230941 0.064863 0.582942
2020-10-27 0.532819 -0.883184 -1.618293 -1.121184
2020-10-28 -1.970271 -0.207929 -0.363722 1.289313
2020-10-29 -0.011359 1.163952 0.791177 -0.441898
2020-10-30 0.132086 0.337254 1.368583 0.586358

我们使用时间偏移作为周期,2D代表两天

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df.rolling('2D').mean()

A B C D
2020-10-01 0.116860 0.301451 0.520356 0.271023
2020-10-02 0.338223 -0.467169 0.077447 0.719242
2020-10-03 0.965279 -0.804309 -0.490193 1.403591
2020-10-04 -0.009326 0.593032 -0.537101 0.856452
2020-10-05 -0.944273 0.562287 -0.274297 0.567755
2020-10-06 0.107969 -0.294987 0.394924 0.250042
2020-10-07 -0.684646 0.630765 -0.413974 -0.983901
2020-10-08 -1.131190 1.059361 -0.896822 -0.375906
2020-10-09 0.482005 0.618887 -1.006851 1.016313
2020-10-10 1.054806 0.090797 -0.864379 0.010677
2020-10-11 0.677874 -0.316738 0.586018 -0.098083
2020-10-12 -0.161540 0.026808 -0.052434 0.578682
2020-10-13 0.026179 -0.259572 -0.827050 -0.457459
2020-10-14 0.164628 -0.689548 -0.960160 -1.615111
2020-10-15 -0.393113 0.850102 -0.169594 -0.856980
2020-10-16 -0.171884 0.685188 1.472611 -0.041157
2020-10-17 -0.057242 -0.042762 1.190516 -0.659265
2020-10-18 0.233214 1.318863 0.999384 0.072938
2020-10-19 1.484313 1.727608 1.152375 -0.302616
2020-10-20 0.361801 0.459261 1.619586 -0.681086
2020-10-21 -0.165519 0.279224 1.247097 0.355980
2020-10-22 1.472553 1.061984 0.118998 -0.036610
2020-10-23 0.691081 0.526163 1.291220 -0.423310
2020-10-24 -0.333143 0.130187 0.891943 0.018861
2020-10-25 -0.122431 0.016087 -0.178026 -0.432234
2020-10-26 -0.279285 -0.150781 0.135858 -0.200251
2020-10-27 -0.104701 -0.557062 -0.776715 -0.269121
2020-10-28 -0.718726 -0.545556 -0.991007 0.084064
2020-10-29 -0.990815 0.478012 0.213727 0.423708
2020-10-30 0.060363 0.750603 1.079880 0.072230

可视化

按照team分组后的柱状图

1
df.groupby('team').sum().plot.bar()

1

分为Q1-Q4的直方图

1
2

df.groupby('team').sum().hist()

2

df.plot() 可以传入kind参数

1
2
3
4
5
6
7
8
9
10
11
12

kind支持的参数如下。
line:折线图,默认
pie:饼图
bar:柱状图
barh:横向柱状图
hist:直方图
kde、density:核密度估计图
box:箱形图
area:面积图
scatter:散点图
hexbin:六边形分箱图

线条样式

style可指定图的线条等样式,并组合使用:
df[:5].plot(style=’:’) # 虚线
df[:5].plot(style=’-.’) # 虚实相间
df[:5].plot(style=’–’) # 长虚线
df[:5].plot(style=’-‘) # 实线(默认)
df[:5].plot(style=’.’) # 点
df[:5].plot(style=’*-‘) # 实线,数值为星星
df[:5].plot(style=’^-‘) # 实线,数值为三角形

背景辅助线

1
df.set_index('name').head().plot(grid=True)

5

  • 本文作者: fenix
  • 本文链接: https://fenix0.com/py-sci-05/
  • 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC 许可协议。转载请注明出处!