首页 > 脚本专栏 > python > pandas聚合分组

pandas聚合分组的具体使用

2024-03-17 16:30:16 作者：金灰

使用数据库时,我们利用查询操作对各列或各行中的数据进行分组,可以针对其中的每一组数据进行各种不同的操作,本文主要介绍了pandas聚合分组,感兴趣的可以了解一下

1.分组操作

数据的分组与聚合是关系型数据库中比较常见术语。

使用数据库时，我们利用查询操作对各列或各行中的数据进行分组，可以针对其中的每一组数据进行各种不同的操作。

1.1 分组步骤

在数据分析中，经常会遇到这样的情况：根据某一列（或多列）标签把数据划分为不同的组别，然后再对其进行数据分析。

比如，某网站对注册用户的性别或者年龄等进行分组，从而研究出网站用户的画像（特点）。

在 Pandas 中，要完成数据的分组操作，需要使用 groupby() 函数，它和 SQL 的GROUP BY操作非常相似 .

在划分出来的组（group）上应用一些统计函数，从而达到数据分析的目的，比如对分组数据进行聚合、转换，或者过滤。这个过程主要包含以下三步：

拆分（Spliting）：表示对数据进行分组；
应用（Applying）：对分组数据应用聚合函数，进行相应计算；
合并（Combining）：最后汇总计算结果。

1.2 基本使用

演示代码:

import pandas as pd
import numpy as np

company = ["A","B","C"]
df_data= pd.DataFrame({
    "company":[company[x] for x in np.random.randint(0,len(company),10)],
    "salary":np.random.randint(5,50,10),
    "age":np.random.randint(15,50,10)
})
print(df_data)
--------------------------
 company  salary  age
0       A      29   16
1       C      21   23
2       A      15   24
3       A      45   47
4       A      45   41
5       C      46   39
6       B      24   24
7       A      21   18
8       B      33   37
9       C      30   18

在pandas中，实现分组操作的代码很简单，仅需一行代码，在这里，将上面的数据集按照字段进行划分：

group = df_data.groupby("company")
group

# 生成DataFrameGroupBy对象
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000250E4329EA0>

----------
# 转换为列表，则更加直观地看出效果
list(group)
[('A',
    company  salary  age
  1       A      48   47
  3       A      30   29),
 ('B',
    company  salary  age
  0       B      30   21
  2       B      25   44
  4       B      49   18
  5       B      24   19
  7       B      11   37),
 ('C',
    company  salary  age
  6       C      35   42
  8       C       6   26
  9       C      24   45)]

groupby的过程就是将原有的DataFrame按照groupby的字段（这里是company），划分为若干个分组DataFrame

1.3 分组聚合

聚合操作是groupby后非常常见的操作，聚合操作可以用来求和、均值、最大值、最小值等.

--分组后的操作.

函数	作用
max	最大值
min	最小值
sum	求和
mean	求平均值
median	中位数
std	标准差
var	方差
count	计数

# 按照company进行分组，然后求平均值
df_data.groupby("company").agg("mean")
---------
 salary        age
company                      
A        42.000000  40.000000
B        35.333333  36.666667
C        16.666667  25.666667
...

1.4 实操(练习)

import pandas as pd
import numpy as np

country = ["中国","美国","英国"]
data = {
    "country": [country[x] for x in np.random.randint(0,len(country),10)],
    "year": np.random.randint(1990, 2000, size=10),
    "GDP": np.random.randint(25000,30000,size=10)
}
df_data = pd.DataFrame(data)
print(df_data)
-------------------
  country  year    GDP
0      英国  1996  28767
1      中国  1991  25541
2      英国  1996  28251
3      美国  1992  29543
4      中国  1998  28031
5      英国  1993  28510
6      美国  1996  27576
7      美国  1993  27087
8      美国  1998  28345
9      英国  1999  27247

1-创建groupby分组对象

使用 groupby() 可以沿着任意轴分组。您可以把分组时指定的键（key）作为每组的组名.

df_data.groupby('year')
#返回对象地址.

2-查看分组结果

通过调用groups属性查看分组结果.

print(df_data.groupby('year').groups)
#{1990: [2, 4], 1992: [0, 7], 1994: [8], 1995: [5], 1997: [3], 1998: [1, 6, 9]}

3--演练操作

# 计算每一年的GDP和year的平均值
df_data.groupby("year")[["GDP", "year"]].mean()
----------------------
          GDP    year
year                 
1990  29592.0  1990.0
1992  27173.5  1992.0
1993  28316.0  1993.0
1994  29401.0  1994.0
1997  26791.0  1997.0
1998  29947.0  1998.0
1999  28290.0  1999.0


# 计算每个国家GDP的平均值和收入的中位数
df_data.groupby("country").agg({"GDP":"mean","year":"median"})
------------------------
             GDP    year
country                 
中国       27256.0  1998.0
美国       27220.0  1993.0
英国       26834.0  1993.0



# 计算每个国家每年的GDP和year的平均值和方差
df_data.groupby("country")[["GDP", "year"]].agg(["mean","std"])
-------------------------------------------------------------
                  GDP                      year          
                 mean          std         mean       std
country                                                  
中国       27922.285714   727.771419  1994.571429  3.309438
美国       27824.000000  2332.038164  1996.000000  0.000000
英国       26188.000000          NaN  1994.000000       NaN



# 计算每一年，中国和美国的GDP和year的平均值
df_data.groupby(["year", "country"])[["GDP", "year"]].mean()
-----------------------------------------------------
                  GDP    year
year country                 
1990 美国       27325.5  1990.0
     英国       28920.0  1990.0
1991 美国       28691.0  1991.0
     英国       26217.0  1991.0
1992 中国       26445.0  1992.0
1995 美国       28058.0  1995.0
1996 英国       25210.0  1996.0
1999 美国       26850.0  1999.0
     英国       27193.0  1999.0
-----------------------------------
    
    

# 统计每个州出现的国家数
df_data.groupby("year")["country"].count()

# 统计个数去重
df_data.groupby("year")[["country"]].nunique()




# 统计出现的国家数
df_data["country"].nunique()

# 统计出现的国家
df_data["country"].unique()

2.操作回顾

演示代码:

df_data = pd.DataFrame(
    np.random.randint(60,95,size=(6,6)),
    index=["张三","李四","王五","赵六","坤哥","凡哥"],
    columns=["语文","数学","英语","政治","历史","地理"]
)
print(df_data)
-----------------
   语文  数学  英语  政治  历史  地理
张三  62  79  94  81  68  63
李四  66  63  88  87  69  83
王五  94  62  89  60  84  71
赵六  84  85  86  76  93  74
坤哥  92  82  81  62  62  69
凡哥  70  68  71  70  62  93

---显示df_data的基础信息

df_data.info()
#--
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 张三 to 凡哥
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   语文      6 non-null      int32
 1   数学      6 non-null      int32
 2   英语      6 non-null      int32
 3   政治      6 non-null      int32
 4   历史      6 non-null      int32
 5   地理      6 non-null      int32
dtypes: int32(6)
memory usage: 192.0+ bytes
    
    
    
df.describe()
#--
              语文         数学         英语         政治         历史         地理
count   6.000000   6.000000   6.000000   6.000000   6.000000   6.000000
mean   69.833333  75.833333  77.166667  82.000000  79.833333  71.166667
std     7.704977   6.337718  12.221566   8.694826  10.888832   8.518607
min    61.000000  66.000000  62.000000  70.000000  66.000000  62.000000
25%    66.000000  72.000000  66.500000  78.250000  71.500000  66.250000
50%    67.000000  78.500000  80.500000  80.000000  81.000000  69.000000
75%    74.000000  79.750000  87.000000  87.750000  86.750000  74.000000
max    82.000000  82.000000  89.000000  94.000000  94.000000  86.000000

2.1 索引切片

loc() 好用,行和列都能切.

1-展示df_data的前3行 .iloc[ ]

df_data.iloc[:3]

2-取出df_data的指定列

df_data.loc[:,["语文","英语"]]
df_data[["语文","英语"]]

3-取出指定行与列.loc[ ]

df_data.loc[df_data.index[[0,2,4]],["语文","数学","英语"]]

4-取出语文大于70的行

df_data[df_data["语文"] > 70]
df_data[(df_data["语文"] > 70) & (df_data["数学"]< 70)]

5-统计每个语文列成绩出现的次数

df_data["语文"].value_counts()

到此这篇关于pandas聚合分组的文章就介绍到这了,更多相关pandas聚合分组内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！