输入数据集合

导入CSV

import pandas as pd

df = pd.read_csv(
    "book_sales.csv",
    index_col='Date',
    parse_dates=['Date'],
)

返回一个 Dataframe 类型

index_col 把 Date 这一列设置为 DataFrame 的索引（行标签）
parse_dates 把 Date 这一列解析为 日期类型（datetime），而不是普通字符串。

period

1	tunnel = tunnel.to_period()

默认 Pandas 使用 DatetimeIndex（时间戳，表示一个具体时刻）。

.to_period() 转换为 PeriodIndex（时间段，表示一个统计周期，比如“2020-01”表示整个 1 月）。

在一些时间序列分析场景下，PeriodIndex 更直观和方便，比如按月份、季度分析数据。

DataFrame

信息查看

df.index index的那一列

去除一列

1	df.drop('Paperback', axis=1)

axis=1 表示删除的是列（如果是 axis=0 表示删除行）。

查看数据head

df.head()

显示 DataFrame 的前 5 行，方便快速查看数据内容和格式。

添加一列

1	df['Time'] = np.arange(len(df.index))

在 df 中新增一列 Time，内容就是上面生成的序号。

每一行对应一个按顺序递增的数字。

这样相当于给数据增加了一个“时间步”变量，可以用于建模或绘图。

向下移动shift

这常用于 时间序列分析，表示“前一期的值”（lag = 滞后值）。这样就能让模型使用过去的信息来预测未来的结果。

1	df['Hardcover'].shift(1)

.shift(1) 表示把这一列整体向 下移动 1 行：

原来第 1 行的数据会变成 NaN（因为它“前面”没有值）。
原来第 2 行的数据会移到第 1 行，依此类推。

取出一列

1	X = df.loc[:, ['Time']] # features

返回二维的DataFrame， shape: (747, 1)

1	y = df.loc[:, 'NumVehicles']

返回向量 Series, shape: (747,)

Numpy

生成数据

1	np.arange(n)

生成一个从 0 到 n-1 的整数数组。

Plot

matplotlib导入

1 2	import matplotlib.pyplot as plt import seaborn as sns

导入绘图库：matplotlib 用于基础绘图，seaborn 用于更美观和统计型绘图。

设置全局的图表属性

plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
)

自动布局（避免标题或标签被裁掉）

图大小：11 x 4 英寸

标题字体大小 18，粗体

设置坐标轴样式

plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)

标签粗体、字号大
坐标轴标题粗体，字号 16，和轴之间的间距 10

在 Jupyter Notebook 里用高清渲染

1	%config InlineBackend.figure_format = 'retina'

作图

1 2	fig, ax = plt.subplots() ax.plot('Time', 'Hardcover', data=df, color='0.75')

建立一个子图 ax。
使用 折线图 画出时间 (Time) 与销量 (Hardcover) 的趋势。
color='0.75' → 灰色线条。

设置标题

1	ax.set_title('Time Plot of Hardcover Sales');

设置坐标轴比例

1	ax.set_aspect('equal')

设置坐标轴比例为 1:1，确保横纵方向缩放一致。

在滞后图中这样做很有意义，因为理想情况下，如果今天销量和昨天完全相同，点会分布在 对角线 上。

参数

plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
ax = y.plot(**plot_params)

plot_params 是一些绘图参数，方便之后直接传递给 plot()。

点阵图

1	ax = tunnel.plot(style=".", color="0.5")

无legend，线宽

1
2
3

moving_average.plot(
    ax=ax, linewidth=3, title="Tunnel Traffic - 365-Day Moving Average", legend=False,
)

seaborn

样式

1
2
3

import seaborn as sns
print(plt.style.available)
plt.style.use("seaborn-v0_8-whitegrid")

看看可用的样式，使用 seaborn 的白色网格风格，让图表更清晰。

回归散点图

1	sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'), ax=ax)

用 seaborn 的 回归散点图（regplot）覆盖在上面：

横轴 Time，纵轴 Hardcover
ci=None 关闭置信区间
scatter_kws=dict(color='0.25') → 点的颜色深灰
回归线展示销量随时间的整体趋势

线性回归

sklearn的linear regression

from sklearn.linear_model import LinearRegression

# Training data
X = df.loc[:, ['Time']]  # features
y = df.loc[:, 'NumVehicles']  # target

# Train the model
model = LinearRegression()
model.fit(X, y)

# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)

ax = y.plot(**plot_params)
ax = y_pred.plot(ax=ax, linewidth=3)
ax.set_title('Time Plot of Tunnel Traffic')

sklearn 需要的 特征矩阵格式。所以 \(X\) 要是Dataframe 二维的， \(y\) 是一维的向量

预测

1	model.predict(X)

用之前训练好的线性回归模型 model 对输入特征 X 进行预测。

返回的是一个 NumPy 一维数组，长度等于样本数。这个数组没有索引信息，只是纯数字序列。把预测结果转换成 Pandas Series，方便操作和绘图。

趋势

series的 rolling

1	series.rolling(window, min_periods=None, center=False).function()

可以在 一维序列或 DataFrame 的列 上定义一个滑动窗口，然后对窗口内的数据进行统计计算（平均、求和、最大值、标准差等）。

window 是窗口大小，整数表示步数，或时间偏移量
min_periods 计算函数所需的最少有效值，默认window
center 是否把计算结果放在窗口中心（True）或窗口右端（False）

计算趋势

moving_average = tunnel.rolling(
    window=365,       # 365-day window
    center=True,      # puts the average at the center of the window
    min_periods=183,  # choose about half the window size
).mean()

DeterministicProcess

dp = DeterministicProcess(
    index=tunnel.index,  # 用训练数据的日期作为索引
    constant=True,       # 包含截距项（相当于 bias / dummy 变量）
    order=1,             # 多项式阶数，1 表示线性趋势
    drop=True,           # 避免共线性，必要时删除冗余项
)
X = dp.in_sample() # creates features for the dates given in the `index` argument

DeterministicProcess 是 statsmodels 提供的工具，用于生成 时间序列的确定性特征（趋势、季节性等）。

它能自动处理：

截距项（bias）
多项式趋势（线性、二次、三次等
避免共线性问题（collinearity）

这样在用 线性回归建模时间序列 时，更安全、更稳定。

就是生成确定性的序列作为特征。

无截距

1 2	model = LinearRegression(fit_intercept=False) model.fit(X, y)

预测未来

生成未来的 \(X\)

1 2	X_fore = dp.out_of_sample(steps=30) # 未来30天 y_pred = pd.Series(model.predict(X), index=X.index)

机器学习实战