统计学基础

Varianece 方差

运算法则 \[ Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2abCov(X,Y) \]

\[ \operatorname{Var}(Y)=\operatorname{Var}(a X+b)=a^{2} \operatorname{Var}(X) \]

Corvariance 协方差

对于两个随机变量 \(X_1,X_2\) \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=E\left(\left(X_{1}-\mu_{1}\right)\left(X_{2}-\mu_{2}\right)\right) \] 另一个等价的公式 \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=E\left(X_{1} X_{2}\right)-\mu_{1} \mu_{2} \] 若两个随机变量 \(X_1, X_2\) 是独立的 \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=0 \] 有一个数据集的两个参数 \((x_i, y_i)\) 是所有的数据,我们想要知道这两个参数是否有关系 \[ Cov(x,y)= \frac{\sum_{i=1}^n (x-\bar{x})(y-\bar{y})}{n-1} \] 如果 \(Cov(x,y) > 0\) 正相关

\(Cov(x,y) < 0\) 负相关

1
cov(dataset) 

\[ \operatorname{cov}(\text { LAOZOne\_small })=\frac{1}{n-1}\left(\begin{array}{ll} s_{\text {ozone,ozone }}^{2} & s_{\text {ozone,temp }}^{2} \\ s_{\text {ozone,temp }}^{2} & s_{\text {temp,temp }}^{2} \end{array}\right)=\frac{1}{n-1}\left(\begin{array}{ll} s_{y y}^{2} & s_{x y}^{2} \\ s_{x y}^{2} & s_{x x}^{2} \end{array}\right) \]

Correlation

r是

\(r^2\) 代表,可以解释多少的数据 \[ r^{2}=\frac{s_{x y}^{4}}{s_{x x}^{2} s_{y y}^{2}}=\frac{\frac{1}{n-1} s_{x y}^{2} \cdot \frac{1}{n-1} s_{x y}^{2}}{\frac{1}{n-1} s_{x x}^{2} \cdot \frac{1}{n-1} s_{y y}^{2}} \]

假设检验

\(H_0\) 是 Null Hypotheses, 是不依赖于数据的一个假设。比如 A 和 B 是没有不同的

我们只能拒绝假设或者未能拒绝假设

p-value

p-value 是一个在 0 到 1 之间的值,它帮助我们确定两个东西是否不同

比如我们要检验 A 和 B 是否不同,小的 p-value = 0.02 表示我们有很大的信心肯定它是不同的。

大的p-value表示我们不太确定。

p-value可以在假设检验中,帮助我们拒绝零假设

计算p-value

一个p- value由3部分相加组成

  • 发生这个事件的可能性
  • 发生同等稀有的事件的可能性
  • 发生更稀有的事件的可能性

T-test

\[ H_{0}: \beta_{1}=0 \text{ vs } H_{A}: \beta_{1} \neq 0 \]

计算 \[ T=\frac{b_{1}}{s / s_{x x}} \sim t_{n-2} \] 已知 \(\alpha=0.05\) 是significance level , 找到这个 \(t_{n-2,1-\alpha / 2}\) 值,然后判断 \[ |T|>t_{n-2,1-\alpha / 2} \] 成立时拒绝 \(H_0\)

R语言查看t分布

1
qt(0.995, 328)  # n=330

回归分析

Simple Regression

这是直线的定义

\[ y=\beta_{0}+\beta_{1} x \]

Least squares

对于一个数据集 \(x_i,y_i\), least square 选择 \(b_0,b_1\) 使得

\[ \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2} \]

最小, 其中 \(\hat{y}_{i}=b_{0}+b_{1} x_{i}\)

求导后解得

\[ b_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \]

\[ b_{0}=\bar{y}-b_{1} \bar{x} \]

其中\(\bar{x}=\frac{1}{n} \sum_{i=1}^{n} x_{i}\) 是样本平均值

推导 \(b_1\)

公式

\[ s_{x x}^{2}=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \]

R语言线性模型

1
2
3
model = lm(yval ~ xval, data = datasetvar)
summary(model)

Multiple Regression

1
2
3
4
5
6
7
model = lm(yval ~ x1val + x2val, data = datasetvar)
summary(model)

model = lm(yval ~ . , data = datasetvar) # 全部计算
summary(model)

model = lm( sqrt(crime) ~ . - state, data = crime[-51, ]) # box-cox 变换 lambda = 0.5, 除了 state 全部变换

CI 包含 0 的时候,可以舍弃这个参数

矩阵

矩阵转置 \[ (AB)'=B'A' \\ (A+B)'=A'+B' \] 矩阵 idoempotent \[ HH=H \] tr(H) = 对角线的和 \[ tr(A+B)=tr(A)+tr(B) \\ tr(AB)=tr(BA) \] 一些结论 \[ tr(I-H)=n-k-1\\ (I-H)X=0 \]

Specification

有一个binary的参数,比如有水和没有水。

想知道这个参数的影响

Main Effect

只有常数差的影响

Interaction

和其他参数有共同影响

多项式回归

\[ y_i= \beta_0 + \beta_1x_i + \beta_2 x_i ^2 + \epsilon_i \]

1
lm(y ~ x + I(x^2))

\[ y_i= \beta_0 + \beta_1x_i + \beta_2 x_i ^2 +...+ \beta_kx_i^k + \epsilon_i \]

1
lm(y ~ poly(x, degree = k, raw = T))

比如

1
lm(Strength ~ poly(Conc, degree = 2, raw = TRUE), data = wood)

Interaction

1
2
lm(formula = y ~ x + z + x*z, data = dataset)
lm(formula = y ~ x*z, data = dataset) # 等价上一行

Orthogonal

矩阵的第一列与其余列垂直 \[ x_1'x_i=0, \forall i > 1 \]

Model diagnostics

Residuals

\[ y_i-\hat{y_i} \]

1
residuals(model)

\[ SS_{residual}=\sum_{i=1}^n(y_i-\hat{y_i})^2 \]

1
sum(residuals(model)^2)

\[ s^2=\frac{1}{n-k-1}SS_{residual} \]

1
s = sigma(model) # s

Lag 1 residuals

1
2
res = data.frame(resid = residuals(model), lag1_resid = lag(residuals(model)))
ggplot(res, aes(lag1_resid, resid)) + geom_point()

should show no pattern

otherwise, positive or negative correlated. Suggesting that there is some positive or negative autocorrelation in ozone

levels.

The lack of independence fails one of the assumptions required for use of multiple linear regression.

Autocorrelation function

1
acf(res$resid) 

The positive lag 1 autocorrelation exceeds the approximate 95% confidence bands in blue, which

matches the positive association seen in last.

DW-Test

1
2
3
4
library(lmtest)

dwtest(ozone_model, alternative = "two.sided")

null hypothesis of independence \(DW \approx 2\) .

Cook's Distance

1
2
library(ggfortify)
autoplot(model, which = 4)

Model Selection

Forward selection with AIC

1
2
3
4
library(MASS)
model_AIC = stepAIC(lm(log(C)~1, data = powerplant),
direction = "forward",
scope = list(upper = lm(log(C) ~ ., data = powerplant)))

Forward selection with BIC

1
2
3
4
model_BIC = stepAIC(lm(log(C)~1, data = powerplant), 
direction = "forward",
scope = list(upper = lm(log(C) ~ ., data = powerplant)),
k = log(nrow(powerplant)))

Backward elimination with AIC

1
model_AIC_back = stepAIC(lm(log(C)~., data = powerplant), direction = "backward")

Backward elimination with BIC

1
2
model_BIC_back = stepAIC(lm(log(C)~., data = powerplant), direction = "backward", 
k = log(nrow(powerplant)))

Stepwise selection

1
2
model_AIC_step = stepAIC(lm(log(C)~1,data = powerplant), direction = "both", 
scope = list(upper = lm(log(C)~ ., data = powerplant)))

Survival regression

1
coxph(Surv(duration, delta) ~ smoke, data = bfeed)

R语言

三目运算符

1
if_else(a > b, ra, rb)

转换成factor

1
factor()

添加一列

1
data$newcol = val

查看类型

1
class(var)

矩阵

构造矩阵

nrow 是行数,rep 可以构造一个列 cbind 可以列构造成矩阵

1
X = cbind(rep(1, nrow(weight)), weight$Before)

求逆矩阵

1
solve(X)

矩阵乘法

1
X %*% Y

转置

1
t(X)

对角线

1
diag(X)