Pandas 详细笔记

兰玉磊

5 年前

Pandas的概述

Pandas是python第三方库，提供高性能易用数据类型和分析工具

Pandas的基本操作

引入Pandas

import pandas as pd

读取cvs文件数据及相关操作

读取文件

values = pd.read_csv("file/test.csv")

获取多少行数据，默认是5行，可输入整数参数

values.head()

数据的基本信息，包括数据类型，有效数据数量等等

values.info()
    --------- 输出 ---------
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Age            714 non-null float64
    Cabin          204 non-null object
    dtypes: float64(1), int64(1), object(1)
    memory usage: 83.6+ KB

获取所有的列名称

values.keys()
    --------- 输出 ---------
    Index(['PassengerId', 'Age', 'Cabin'], dtype='object')

获取每一列都是什么类型，以及总的数据类型

values.dtype()
    --------- 输出 ---------
    PassengerId      int64
    Age            float64
    Cabin           object
    dtype: object

获取所有的数据，不包含列名称

values.values

获取所有数据的索引值

values.index
    --------- 输出 ---------
    RangeIndex(start=0, stop=891, step=1)

创建一个pandas中的DataFrame数据类型的数据及基本操作

创建数据

data = {
    "country": [
        "aaa", "bbb", "ccc"
    ],
    "population": [
        10, 20, 30
    ]
}
pd.DataFrame(data)

使用已有的列数据设置索引

data.set_index("country")

获取一整列数据

data["country"]

数据切片，和list的切片基本上一致

data["country"][:2]

数值简单计算

data["population"]["aaa"] + data["population"]["bbb"]  # 元素之间的互相计算
data["population"].mean()  # 计算列平均值
data["population"].mean(axis=1)  # 计算行平均值
data["population"].max()  获取列最大值
data["population"].max(axis=1)  # 获取行最大值
data["population"].min()  # 获取列最小值
data["population"].mean(axis=1)  # 获取行最小值

可以得到数据的基本统计特性

values.describe()

Pandas 索引操作

获取多个列数据

df = pd.read_csv("./file/titanic.csv")
df[["Age", "Fare"]].head()

获取数据两个方法 loc 和 iloc

loc 用lable定位数据
iloc 用position定位数据

loc获取一组数据

df.loc["Bob"]

iloc获取一组数据

# 获取1~5行的1~25列数据
df.iloc[0: 5, 1: 25]

数据判断

# 获取出年龄大于40的数据中的前5行
df[df["Age"] > 40].head()

判断值是否存在

s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype="int64")
s.isin([1, 2, 3])
    --------- 输出 ---------
    3    1
    2    2
    1    3
    dtype: int64

双重索引

s2 = pd.Series(np.arange(6), index=pd.MultiIndex.from_product([[0, 1], ["a", "b", "c"]]))
    --------- 输出 ---------
    0  a    0
       b    1
       c    2
    1  a    3
       b    4
       c    5
    dtype: int64

通过多重索引获取对应的值

s2.iloc[s2.index.isin([(1, "a"), (0, "b")])]
    --------- 输出 ---------
    1  a    3
    dtype: int64

数据筛选

df.where(df > 0, -df) 
# 筛选数据，当前数据中大于0的获取出来，默认会将小于0的数据修改成NaN，-df 则表示取反值

列之间的数据判断

df.query("(a < b) & (b < c)")

groupby

创建练习数据

import pandas as pd
df = pd.DataFrame({
    "key": [
        "A", "B", "C", "A", "B", "C", "A", "B", "C"
    ],
    "data": [
        1, 2, 3, 4, 5, 6, 7, 8, 9
    ]
})

计算key这一列中，数据的总和

df.groupby("key").sum()

计算泰坦尼克号数据中的男女年龄的总和

df.groupby("Sex").sum()["Age"]
    --------- 输出 ---------
    Sex
    female     7286.00
    male      13919.17
    Name: Age, dtype: float64

二元统计

数据与数据之间的协方差

df.cov()

数据之间的相关系数，较为常用

df.corr()

统计每个不同属性分别有多少个, 默认是降序，ascending=True表示升序

df["Sex"].value_counts(ascending=True)
    --------- 输出 ---------
    female    314
    male      577
    Name: Sex, dtype: int64

bins=5 将数据分组，这里表示分成5组

df["Age"].value_counts(ascending=True, bins=5)
    --------- 输出 ---------
    (64.084, 80.0]       11
    (48.168, 64.084]     69
    (0.339, 16.336]     100
    (32.252, 48.168]    188
    (16.336, 32.252]    346
    Name: Age, dtype: int64

获取有效数据的总和

df["Age"].count()

Pandas数据对象中的操作

Pandas中主要的数据类型有两种，分别是Series和DataFrame

Series数据类型的操作

创建练习数据

import pandas as pd
data = [10, 20, 30]
index = ["a", "b", "c"]
s = pd.Series(data=data, index=index)
    --------- 输出 ---------
    a    10
    b    20
    c    30
    dtype: int64

获取数据

s[0]  # 通过位置索引直接获取数据

mask = [True, False, True]
s[mask]  # 通过bool获取数据

s.loc["b"]  # 通过label获取数据

s.iloc[1]   # 通过位置索引直接获取数据

根据现有的数据复制出一份一样的数据

s1 = s.copy()

替换数据中的值

s1.replace(to_replace=100, value=10, inplace=True)
# to_replace 被修改的值， value 修改成什么值， inplace 是否在原地修改

修改索引名字，直接会修改原始值

s1.index = ["a", "b", "d"]

修改单个索引名称

s1.rename(index={"a": "A"}, inplace=True)

合并两个数据

s1.append(s, ignore_index=False)  # 直接将s的数据加入到s1中，并且是原地修改数据
# ignore_index表示是否忽略索引，True从新生成索引，False保留原来的索引

删除指定key的值

del s["a"]

删除多个元素数据

s1.drop(["b", "d"], inplace=True)

DataFrame数据类型的操作

创建练习数据

data = [[1, 2, 3], [4, 5, 6]]
index = ["a", "b"]
columns = ["A", "B", "C"]
df = pd.DataFrame(data=data, index=index, columns=columns)

修改指定格内的值

df.loc["a"]["A"] = 100

修改索引名称

df.index = ["f", "g"]

添加一行数据

df.loc["c"] = [1, 2, 3]

合并两个DataFrame类型的数据

df3 = pd.concat([df, df2], axis=0)   # 合并两个数据集的所有行数据
df3 = pd.concat([df, df2], axis=1)   # 合并两个数据集的所有列数据

添加一列数据

df2["Lan"] = [10, 11]

删除数据

df2.drop(["j"], axis=0, inplace=True)  # 原地删除一行数据
df2.drop(["E"], axis=1, inplace=True)  # 原地删除一列数据
df2.drop(["j", "k"], axis=0, inplace=True)  # 原地删除多行数据
df2.drop(["E"，"F"], axis=1, inplace=True)  # 原地删除多列数据

merge

创建练习数据

import pandas as pd
left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"] 
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

merge数据

pd.merge(left, right, on="key", how="outer", indicator=True)
# 合并两个 DataFrame， on="key" 表示根据key这一列合并数据, how="outer" 表示并集, indicator=True 明确出合并的方式
# how: left, right, outer...

Pandas 数据输出的显示设置

import pandas as pd

获取输出的最大行数

pd.get_option("display.max_rows")  # 默认是60行

设置输出的最大行数

pd.set_option("display.max_rows", 6)
# 这里表示最多输出6行，多出的数据折叠起来

获取输出最大列数

pd.get_option("display.max_columns") # 默认是20列

设置输出最大列数

pd.set_option("display.max_columns", 10)
# 这里表示最多输出10列，多出的数据折叠起来

获取网格内值的最大长度

pd.get_option("display.max_colwidth")   # 默认是50个字符

设置网格内值的最大长度

pd.set_option("display.max_colwidth", 10)
# 这里表示最多输出10个字符，多出的数据折叠起来

获取网格内值的精度

pd.get_option("display.precision")   # 默认为 6 位

设置网格内值的精度

pd.set_option("display.precision", 2)
# 这里表示保留小数点后面的2位

pivot 数据透视表

创建练习数据

import pandas as pd
example = pd.DataFrame({
    "Month": [
        "January", "January", "January", "January",
        "February", "February", "February", "February",
        "March", "March", "March", "March"
    ],
    "Caregory": [
        "Transportation", "Grocery", "Household", "Entertainment",
        "Transportation", "Grocery", "Household", "Entertainment",
        "Transportation", "Grocery", "Household", "Entertainment"
    ],
    "Amount": [
        74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.
    ]
})

将DataFrame数据转换成可视度高的表格展示

example_pivot = example.pivot(index="Caregory", columns="Month", values="Amount")

计算总和

# 计算列总和
example_pivot.sum(axis=0)
    --------- 输出 ---------
    Month
    February    705.0
    January     584.0
    March       670.0
    dtype: float64

# 计算行总和
example_pivot.sum(axis=1)
    --------- 输出 ---------
    Caregory
    Entertainment     345.0
    Grocery           735.0
    Household         600.0
    Transportation    279.0
    dtype: float64

统计泰坦尼克号数据中，男女分别在1,2,3舱的平均票价

# 默认计算的是平均值
df.pivot_table(index="Sex", columns="Pclass", values="Fare")

统计泰坦尼克号数据中，男女分别在1,2,3舱的人数

df.pivot_table(index="Sex", columns="Pclass", values="Fare", aggfunc="count")  # aggfunc="mean" 表示获取平均数
# 同上效果
pd.crosstab(index=df["Sex"], columns=df["Pclass"])   # 类似aggfunc="count"

时间操作

创建一个时间数据

import pandas as pd
ts = pd.Timestamp("2018-12-13")
    --------- 输出 ---------
    Timestamp('2018-12-13 00:00:00')

pd.to_datetime("2018-12-13")
    --------- 输出 ---------
    Timestamp('2018-12-13 00:00:00')

时间操作

# 加5天
ts + pd.Timedelta("5 days")
    --------- 输出 ---------
    Timestamp('2018-12-18 00:00:00')

# 减1天
ts - pd.Timedelta("1 days")
    --------- 输出 ---------
    Timestamp('2018-12-12 00:00:00')

Series的时间数据

sd = pd.Series(["2017-12-13 00:00:00", "2017-12-14 00:00:00", "2017-12-15 00:00:00"])
    --------- 输出 ---------
    0    2017-12-13 00:00:00
    1    2017-12-14 00:00:00
    2    2017-12-15 00:00:00
    dtype: object

# 转成 datetime 类型的数据
ts = pd.to_datetime(s)
    --------- 输出 ---------
    0   2017-12-13
    1   2017-12-14
    2   2017-12-15
    dtype: datetime64[ns]

获取小时

ts.dt.hour

获取年

ts.dt.year

生成多个时间序列数据

data = pd.Series(pd.date_range("2018-12-13", periods=3, freq="12H"))
# 从2018-12-13开始生成3个时间数据，间隔为12小时
    --------- 输出 ---------
    0   2018-12-13 00:00:00
    1   2018-12-13 12:00:00
    2   2018-12-14 00:00:00
    dtype: datetime64[ns]

使用切片的形式获取一组数据

data[pd.Timestamp("2018-12-13 00:00:00"): pd.Timestamp("2018-12-14 00:00:00")]

Pandas 常用操作

创建练习数据

import pandas as pd
data = pd.DataFrame({
    "group": [
        "A", "B", "C", "A", "B", "C", "A", "B", "C"
    ],
    "data": [
        1, 2, 3, 4, 5, 6, 7, 8, 9
    ]
})

排序

data.sort_values(by=["group", "data"], ascending=[False, True], inplace=True)
# by=["group", "data"] 表示使用什么去排序
# ascending=[False, True] 如何排序，True是升序，False是降序
# inplace=True 直接在原始数据上修改数据

去掉重复数据

# 默认按照行去重，出现两行一样的就去重
data.drop_duplicates()

# 按照列去重，一列中出现一样的，就去掉后面的一样的数据
data.drop_duplicates(subset="k1")

两组数据做计算

df = pd.DataFrame({"data1": np.random.randn(5),
            "data2": np.random.randn(5)
              })
df2 = df.assign(ration=df["data1"]/df["data2"])   # 做计算

删除一列数据

df2.drop("ration", axis="columns", inplace=True)

数据分类

ages = [14, 15, 14, 79, 24, 57, 24, 100]
bins = [10, 40, 80]
bins_res = pd.cut(ages, bins)   # 根据bins进行分类
    --------- 输出 ---------
    # 下面每一个元素都表示 上面的数值在那个区间
    [(10, 40], (10, 40], (10, 40], (40, 80], (10, 40], (40, 80], (10, 40], NaN]
    Categories (2, interval[int64]): [(10, 40] < (40, 80]]

# 统计分类后的数据
pd.value_counts(bins_res)    # 统计数量
    --------- 输出 ---------
    (10, 40]    5
    (40, 80]    2
    dtype: int64

# 给每个分组命名
group_names = ["Yanth", "Mille", "old"]
pd.value_counts(pd.cut(ages, [10, 20, 50, 80], labels=group_names))
    --------- 输出 ---------
    Yanth    3
    old      2
    Mille    2
    dtype: int64

判断在DataFrame中是否有缺失值，True表示是无效值，False表示有效值

df = pd.DataFrame([range(3), [0, np.nan, 0], [0, 0, np.nan], range(3)])
df.isnull()

# 查看每一列中是否有缺失值
df.isnull().any()
    --------- 输出 ---------  
    0    False
    1     True
    2     True
    dtype: bool

# 查看每一行中是否有缺失值
df.isnull().any(axis=1)
    --------- 输出 ---------  
    0    False
    1     True
    2     True
    3    False
    dtype: bool

填充缺失值

df.fillna(5)

字符串操作

创建练习数据

import pandas as pd
import numpy as np
s = pd.Series(["A", "B", "b", "gaer", "AGER", np.nan])

将Series数据里面的字符串转换成小写

s.str.lower()

将Series数据里面的字符串转换成大写

s.str.upper()

计算Series中每个成员的字符串长度

s.str.len()

去除成员字符串中前后的空格

index = pd.Index(["   l  an", "   yu", "   lei"])
index.str.strip()

替换字段名称里面的数据

df = pd.DataFrame(np.random.randn(3, 2), columns=["A a", "B b"], index=range(3))
df.columns = df.columns.str.replace(" ", "_")

切片数据

s = pd.Series(["a_b_C", "c_d_e", "f_g_h"])
s.str.split("_")    # 切分字符串
    --------- 输出 ---------  
    0    [a, b, C]
    1    [c, d, e]
    2    [f, g, h]
    dtype: object

切分字符串, 并且生成表格 n=6 表示切分几次

s.str.split("_", expand=True, n=6)

判断是否在s中是否包含 "A" ，包含则是True，不包含则是False

s = pd.Series(["Axzfc", "Aefa", "Ahstr", "Aga", "Aaf"])
s.str.contains("A")
    --------- 输出 ---------  
    0    True
    1    True
    2    True
    3    True
    4    True
    dtype: bool

Pandas 绘图

绘制最基本的曲线图

%matplotlib inline
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10), index=np.arange(0, 100, 10))
s.plot()

略微复杂的曲线图

df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), index=np.arange(0, 100, 10), columns=list("ABCD"))
df.plot()

柱状图

data = pd.Series(np.random.rand(16), index=list("abcdefghigklmnop"))
# 柱状图
from matplotlib import pyplot as plt
fig, axes = plt.subplots(2, 1)
data.plot(ax=axes[0], kind="bar")   # 正着画图
data.plot(ax=axes[1], kind="barh")   # 横着画图

多组数据的柱状图

df = pd.DataFrame(np.random.rand(6, 4), index=["one", "two", "three", "four", "five", "six"], 
              columns=pd.Index(["A", "B", "C", "D"], name="Genus"))
df.plot(kind="bar")

直方图

df.A.plot(kind="hist", bins=50)

散点图

df.plot.scatter("A", "B")

多组数据的散点图

pd.scatter_matrix(df, color="k", alpha=0.3)