统计学基础
统计学基础
Varianece 方差
运算法则 \[ Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2abCov(X,Y) \]
\[ \operatorname{Var}(Y)=\operatorname{Var}(a X+b)=a^{2} \operatorname{Var}(X) \]
Corvariance 协方差
对于两个随机变量 \(X_1,X_2\) \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=E\left(\left(X_{1}-\mu_{1}\right)\left(X_{2}-\mu_{2}\right)\right) \] 另一个等价的公式 \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=E\left(X_{1} X_{2}\right)-\mu_{1} \mu_{2} \] 若两个随机变量 \(X_1, X_2\) 是独立的 \[ \operatorname{Cov}\left(X_{1}, X_{2}\right)=0 \] 有一个数据集的两个参数 \((x_i, y_i)\) 是所有的数据,我们想要知道这两个参数是否有关系 \[ Cov(x,y)= \frac{\sum_{i=1}^n (x-\bar{x})(y-\bar{y})}{n-1} \] 如果 \(Cov(x,y) > 0\) 正相关
\(Cov(x,y) < 0\) 负相关
1 | cov(dataset) |
\[ \operatorname{cov}(\text { LAOZOne\_small })=\frac{1}{n-1}\left(\begin{array}{ll} s_{\text {ozone,ozone }}^{2} & s_{\text {ozone,temp }}^{2} \\ s_{\text {ozone,temp }}^{2} & s_{\text {temp,temp }}^{2} \end{array}\right)=\frac{1}{n-1}\left(\begin{array}{ll} s_{y y}^{2} & s_{x y}^{2} \\ s_{x y}^{2} & s_{x x}^{2} \end{array}\right) \]
Correlation
r是
\(r^2\) 代表,可以解释多少的数据 \[ r^{2}=\frac{s_{x y}^{4}}{s_{x x}^{2} s_{y y}^{2}}=\frac{\frac{1}{n-1} s_{x y}^{2} \cdot \frac{1}{n-1} s_{x y}^{2}}{\frac{1}{n-1} s_{x x}^{2} \cdot \frac{1}{n-1} s_{y y}^{2}} \]
假设检验
\(H_0\) 是 Null Hypotheses, 是不依赖于数据的一个假设。比如 A 和 B 是没有不同的
我们只能拒绝假设或者未能拒绝假设。
p-value
p-value 是一个在 0 到 1 之间的值,它帮助我们确定两个东西是否不同
比如我们要检验 A 和 B 是否不同,小的 p-value = 0.02 表示我们有很大的信心肯定它是不同的。
大的p-value表示我们不太确定。
p-value可以在假设检验中,帮助我们拒绝零假设
计算p-value
一个p- value由3部分相加组成
- 发生这个事件的可能性
- 发生同等稀有的事件的可能性
- 发生更稀有的事件的可能性
T-test
\[ H_{0}: \beta_{1}=0 \text{ vs } H_{A}: \beta_{1} \neq 0 \]
计算 \[ T=\frac{b_{1}}{s / s_{x x}} \sim t_{n-2} \] 已知 \(\alpha=0.05\) 是significance level , 找到这个 \(t_{n-2,1-\alpha / 2}\) 值,然后判断 \[ |T|>t_{n-2,1-\alpha / 2} \] 成立时拒绝 \(H_0\)
R语言查看t分布
1 | qt(0.995, 328) # n=330 |
回归分析
Simple Regression
这是直线的定义
\[ y=\beta_{0}+\beta_{1} x \]
Least squares
对于一个数据集 \(x_i,y_i\), least square 选择 \(b_0,b_1\) 使得
\[ \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2} \]
最小, 其中 \(\hat{y}_{i}=b_{0}+b_{1} x_{i}\)
求导后解得
\[ b_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \]
\[ b_{0}=\bar{y}-b_{1} \bar{x} \]
其中\(\bar{x}=\frac{1}{n} \sum_{i=1}^{n} x_{i}\) 是样本平均值
推导 \(b_1\)
公式
\[ s_{x x}^{2}=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \]
R语言线性模型
1 | model = lm(yval ~ xval, data = datasetvar) |
Multiple Regression
1 | model = lm(yval ~ x1val + x2val, data = datasetvar) |
CI 包含 0 的时候,可以舍弃这个参数
矩阵
矩阵转置 \[ (AB)'=B'A' \\ (A+B)'=A'+B' \] 矩阵 idoempotent \[ HH=H \] tr(H) = 对角线的和 \[ tr(A+B)=tr(A)+tr(B) \\ tr(AB)=tr(BA) \] 一些结论 \[ tr(I-H)=n-k-1\\ (I-H)X=0 \]
Specification
有一个binary的参数,比如有水和没有水。
想知道这个参数的影响
Main Effect
只有常数差的影响
Interaction
和其他参数有共同影响
多项式回归
\[ y_i= \beta_0 + \beta_1x_i + \beta_2 x_i ^2 + \epsilon_i \]
1 | lm(y ~ x + I(x^2)) |
\[ y_i= \beta_0 + \beta_1x_i + \beta_2 x_i ^2 +...+ \beta_kx_i^k + \epsilon_i \]
1 | lm(y ~ poly(x, degree = k, raw = T)) |
比如
1 | lm(Strength ~ poly(Conc, degree = 2, raw = TRUE), data = wood) |
Interaction
1 | lm(formula = y ~ x + z + x*z, data = dataset) |
Orthogonal
矩阵的第一列与其余列垂直 \[ x_1'x_i=0, \forall i > 1 \]
Model diagnostics
Residuals
\[ y_i-\hat{y_i} \]
1 | residuals(model) |
\[ SS_{residual}=\sum_{i=1}^n(y_i-\hat{y_i})^2 \]
1 | sum(residuals(model)^2) |
\[ s^2=\frac{1}{n-k-1}SS_{residual} \]
1 | s = sigma(model) # s |
Lag 1 residuals
1 | res = data.frame(resid = residuals(model), lag1_resid = lag(residuals(model))) |
should show no pattern
otherwise, positive or negative correlated. Suggesting that there is some positive or negative autocorrelation in ozone
levels.
The lack of independence fails one of the assumptions required for use of multiple linear regression.
Autocorrelation function
1 | acf(res$resid) |
The positive lag 1 autocorrelation exceeds the approximate 95% confidence bands in blue, which
matches the positive association seen in last.
DW-Test
1 | library(lmtest) |
null hypothesis of independence \(DW \approx 2\) .
Cook's Distance
1 | library(ggfortify) |
Model Selection
Forward selection with AIC
1 | library(MASS) |
Forward selection with BIC
1 | model_BIC = stepAIC(lm(log(C)~1, data = powerplant), |
Backward elimination with AIC
1 | model_AIC_back = stepAIC(lm(log(C)~., data = powerplant), direction = "backward") |
Backward elimination with BIC
1 | model_BIC_back = stepAIC(lm(log(C)~., data = powerplant), direction = "backward", |
Stepwise selection
1 | model_AIC_step = stepAIC(lm(log(C)~1,data = powerplant), direction = "both", |
Survival regression
1 | coxph(Surv(duration, delta) ~ smoke, data = bfeed) |
R语言
三目运算符
1 | if_else(a > b, ra, rb) |
转换成factor
1 | factor() |
添加一列
1 | data$newcol = val |
查看类型
1 | class(var) |
矩阵
构造矩阵
nrow
是行数,rep
可以构造一个列
cbind
可以列构造成矩阵
1 | X = cbind(rep(1, nrow(weight)), weight$Before) |
求逆矩阵
1 | solve(X) |
矩阵乘法
1 | X %*% Y |
转置
1 | t(X) |
对角线
1 | diag(X) |