统计学

概述

什么是统计学

研究量化数据的方法的科学

Statisticsis the study of methods for dealing with quantitative information (data).

从数据中获得信息和知识

分为

  • Descriptive Statistics
    • 制作图表
    • Explorative Data Analysis 分析数据得出结论
  • Inductive Statistics(inferential statistics)
    • 研究数据间的关系
    • 预测未来趋势

统计学基本概念

Statistical units统计单元

Objects on which data is observed 被研究数据的对象

Population

所有需要的统计单元的集合

  • 可以有限(finite),无限(infinite),假设的(hypothetical)

Subpopulations

sub set of the population

Sample

Actual subset of the population surveyed

Characteristic/variable/features

Quantity of interest observed on a statistical unit与统计有关系的量

如 Rent index: living space, building age

Characteristic value

上面一个概念具体的值

Feature Types and Scales

根据大小分类feature

  • Nominal scale

    if its expressions are names or categories that cannot be put into a meaningful order
  • Ordinal scale

    if its expressions can be ordered, but no meaningfully

    例如feature "Do you need statistics in your current job?" (constantly, frequently, occasionally, never)

  • Cardinal scale(metrical)

    if its expressions are numbers and their distances can be interpreted meaningfully

sale level 决定一个量是如何被测量的

根据量化的分类

  • Qualitative feature
    • indicates a quality or a membership of a class
    • has a finite number of expressions
    • is at most ordinally scaled
  • Quantitative feature
    • Reflects an intensity or extent
    • Can be measured by numbers

其他分类

离散,连续

如何收集数据

  • 实验 Experiment

    methodically designed investigation for the empirical acquisition of data

  • 调查 Survey

    collecting data that in principle already exists

调查的类型

  • Primary statistical survey

    Collection of data specifically for current issues

  • Secondary statistical survey

    Use of existing original data for new questions

  • Tertiary statistical survey

    Use of already existing, compressed data (e.g. mean values) for new questions

##统计学基本概念

频率Frequencies

考虑一个变量 \(X\), 它由统计单元,统计值为 \(x_1, ..., x_n\)

其中 \(x_1,..,x_n\)源数据(raw data/original list)

绝对频率 Absolute frequencies

统计每个值出现的次数

image-20210622213510801

\(a_1, .. a_k\) 是不同的值,我们用 \(h_j = h(a_j)\) 表示绝对频率

image-20210622213724679

相对频率 Relative frequencies

统计这个值出现的次数占总次数的比例

image-20210622213847691

\(f_j=f(a_j)=h_j/n\) 来表示

其中 \(f_1,f_2,...f_k\) 成为相对频率分布relative frequency distribution, 有 \(\sum _{j=1} ^{k} f_j= 1\)

分类

可以把频率列一个表来分析

image-20210623111520968

图表

Bar diagrams 条状表

If the number \(k\) of the different realizations is small, the absolute / relative frequency distribution can be well represented by means of a bar diagram

image-20210623113332125

Column charts 柱状图

An often even clearer variant (also for small \(k\) ) are column charts

image-20210623113537013

Bar chart 条状表

水平的柱状图 column chart

image-20210623113706497

Pie charts 饼图

image-20210623113735438

Histograms

For metric characteristics with a large number \(k\) of different expressions

image-20210623113859363

累计频率分布

绝对累计频率分布

Absolute cumulative frequency distribution

\[ H(x)= \sum_{j:a_j \le x} h_j \]

相对累计频率分布

Relative cumulative frequency distribution

\[ F(x)=\sum_{j:a_j \le x} f_j \]

这两者的性质

  • 单调递增
image-20210623114720189

中心趋势测量

中心趋势

Central tendency(parameter) 是一个分布的中心值

Mode

一个简单的可以适用于所有 scale levels 的中心趋势

如果一个频率分布有唯一的最大值,那么mode 就是唯一的那个最大值,用 \(x_{mod}\) 来表示

中心趋势--平均值

arithmetic mean算数平均值

平均值

\[ \bar{x}=\frac{1}{n}\left(x_{1}+\cdots+x_{n}\right)=\frac{1}{n} \sum_{i=1}^{n} x_{i} \]

它也可以通过相对频率计算

\[ \bar{x}=a_{1} f_{1}+\cdots+a_{k} f_{k}=\sum_{j=1}^{k} a_{j} f_{j} \]

如果只有若干组的数据,也可以计算

若组为区间 \(\left[c_{0}, c_{1}\right),\left[c_{1}, c_{2}\right), \ldots,\left[c_{l-1}, c_{l}\right)\)

\(m_j=\frac{c_j+c_{j-1}}{2}\) , \(f_j\) 是每组的频率, 那么

\[ \bar{x}_{\text {group }}=\sum_{j=1}^{l} m_{j} f_{j} \]

Truncated mean

去掉最大值和最小值再求平均值

\(\bar{x}_{g}\) 表示

Winsorised means

把最小值改成次小值,最大值改成次大值,然后求平均值

\(\bar{x}_w\) 表示

中心趋势--中位数

设排序的列表 \(x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}\) 中位数为

\[ x_{m e d}=x_\left(\frac{n+1}{2}\right) \]

如果是偶数,那么中位数不是唯一的

中线趋势-分位点

对于 \(0<p<1\)\(p\)-quantile 可以把 \(p\) 的数据和 \(1-p\) 的数据划分开来。

\(\tilde{\mathcal{X}}_{p}\) 表示那个分位点的值

重要的分位点

  • 中位数 \(x_{med}= \tilde{x}_{0.5}=50\%\)-quantile
  • 下四分位 Lower quartile \(\tilde{x}_{0.25}\)
  • 上四分位 Upper quartile \(\tilde{x}_{0.75}\)
  • 十分位 Deciles = \(10\%,20\%,..,90\%\)-quantile,

对于metric 变量

\[ d_Q= \tilde{x}_{0.75}-\tilde{x}_{0.25} \]

是 interquantile range

五点总结

加入最大值和最小值

\[ x_{\min }, \tilde{x}_{0.25}, x_{m e d}, \tilde{x}_{0.75}, x_{\max } \]

可视化:

image-20210623135815388

分散性测度

Measures or parameters of dispersion(分散性) describe how far the values of a distribution spread around their centre

方差Variance

\[ \tilde{s}^{2}=\frac{1}{n}\left[\left(x_{1}-\bar{x}\right)^{2}+\cdots+\left(x_{n}-\bar{x}\right)^{2}\right]=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \]

如果用频率计算

\[ \tilde{s}^{2}=\left(a_{1}-\bar{x}\right)^{2} f_{1}+\cdots+\left(a_{k}-\bar{x}\right)^{2} f_{k}=\sum_{j=1}^{k}\left(a_{j}-\bar{x}\right)^{2} f_{j} \]

标准差 Standard Deviation

\[ \tilde{s}=+\sqrt{\tilde{S}^{2}}=+\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \]

样本方差 sample Variance

\[ s^{2}=\frac{1}{n-1}\left[\left(x_{1}-\bar{x}\right)^{2}+\cdots+\left(x_{n}-\bar{x}\right)^{2}\right]=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \]

Coefficient of Variation

对于一个值非负的数据集且算数平均数\(\bar{x}>0\), coefficient of variation 是

\[ v=\frac{\tilde{S}}{\bar{x}} \]

Unimodal 和 Multimodal分布

Unimodal

分布只有一个唯一的峰值的分布是 Unimodal 分布

image-20210623140844272

Multimodal

有多个不同的峰值

image-20210623141014704

它进程出现在多个Subpopulation 的聚合

如果正好只有2个不同峰值,也叫 bimodal

对称和歪斜

对称 Symmetry

一个 unimodal 分布是对称的若存在一个对称轴

image-20210623141434730

\(\bar{x} \approx x_{m e d} \approx x_{\bmod }\) 成立

明显歪斜 Clearly asymmetric

数据集中在左边: left-steep or right-skewed

数据集中在右边:right-steep or left-skewed

image-20210623141609734

确定规则

image-20210623141725856

多变量描述

Multivariate Description

对于一个统计单元有不同的变量描述,我们这里只研究2个变量的情况

Joint frequencies

联合频率

\[ h_{ij}=h(a_i, b_j) \]

可以用表格描述

image-20210623142459654

联合相对频率也类似

\[ f_{i,j}=f(a_i,b_j) \]

条件频率(Conditional Frequencies)

\[ f_{Y}(b_j | a_i) \]

是在 \(X=a_i\) 的条件下, \(b_j\)\(Y\) 的频率

对于 \(f_X\) 也是类似定义的

二维图表

二维条状图

two-dimensional bar chart

image-20210623143012375

二维Scatter Plots

image-20210623143215712

二维 Histograms

image-20210623143237248

Correlation 相互关系测量

Monotonic correlation

  • same-sence relation: 如果 \(X\) 变大,那么 \(Y\) 也变大
  • opposite-sense relation: 如果 \(X\) 变小,那么 \(Y\) 变小

Metric, Ordinal 变量

Functional correlation

  • linear relation
  • quadratic relationships

Metric 变量

连续性和 \(\chi^2\) 系数

\(\chi^2\) 系数

\[ \chi^{2}=\sum_{i=1}^{k} \sum_{j=1}^{m} \frac{\left(h_{i j}-\frac{h_{i} \cdot h_{\cdot j}}{n}\right)^{2}}{\frac{h_{i} \cdot h_{\cdot j}}{n}} \in[0, \infty) \]

如果 \(\chi^2\) 小,那么这两个变量没有关系,如果大那么这两个变量有关系

连续系数

\[ \begin{aligned} K &=\sqrt{\frac{\chi^{2}}{n+\chi^{2}}} \in\left[0, K_{\max }\right] \\ \text { where } K_{\max } &=\sqrt{\frac{M-1}{M}} \text { with } M=\min \{k, m\} \end{aligned} \]

纠正的连续系数 corrected Contingency Coefficient

\[ K^{*}=\frac{K}{K_{\max }} \in[0,1] \]

如果

\[ K^* =0 \]

那么

\(X,Y\)

无关

\[ K^*=1 \]

那么

\(X,Y\)

有关

经验协方差Empirical Covariance

\[ \tilde{s}_{X Y}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) \cdot\left(y_{i}-\bar{y}\right) \]

Bravais-Pearson Correlation Coefficient

\[ r=r_{X Y}=\frac{\tilde{S}_{X Y}}{\tilde{s}_{X} \cdot \tilde{S}_{Y}}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) \cdot\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}} \]

更加快捷的计算公式:

\[ r=r_{X Y}=\frac{\sum_{i=1}^{n} x_{i} y_{i}-n \bar{x} \bar{y}}{\sqrt{\left(\sum_{i=1}^{n} x_{i}^{2}-n \bar{x}^{2}\right)\left(\sum_{i=1}^{n} y_{i}^{2}-n \bar{y}^{2}\right)}} \]

这个系数可以测量线性关系的强度

Spearman‘s(Rank) Correlation Coefficient

排名Ranks

\[ \begin{array}{|l|c|c|c|c|c|} \hline \text { Person } i & 1 & 2 & 3 & 4 & 5 \\ \hline \text { Element } x_{i} & 1.57 \mathrm{~m} & 1.70 \mathrm{~m} & 1.83 \mathrm{~m} & 1.65 \mathrm{~m} & 1.75 \mathrm{~m} \\ \hline \text { Rank } \operatorname{rk}\left(x_{i}\right) & 1 & 3 & 5 & 2 & 4 \\ \hline \end{array} \]

Spearman‘s(Rank) CorrelationCoefficient

\[ r_{S P}=\frac{\sum_{i=1}^{n}\left(\operatorname{rk}\left(x_{i}\right)-\overline{\mathrm{rk}}_{X}\right) \cdot\left(\operatorname{rk}\left(y_{i}\right)-\overline{\mathrm{rk}}_{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(\operatorname{rk}\left(x_{i}\right)-\overline{\mathrm{rk}}_{X}\right)^{2} \sum_{i=1}^{n}\left(\operatorname{rk}\left(y_{i}\right)-\overline{\mathrm{rk}}_{Y}\right)^{2}}} \]

更快速的计算方式:

\(x_i \ne x_j,y_i \ne y_j\) , 设 \(d_{i}=\operatorname{rk}\left(x_{i}\right)-\operatorname{rk}\left(y_{i}\right)\)

\[ r_{S P}=1-\frac{6 \sum_{i=1}^{n} d_{i}^{2}}{\left(n^{2}-1\right) n} \]

这个系数可以测量递增关系的强度 strength of the monotonic relationship

image-20210623163203393

线性回归

用线性函数描述 \(X,Y\) 关系

\[ Y=f(X)=\alpha + \beta X, \alpha, \beta \in \R \]

image-20210623163626922

允许误差 \(\epsilon _i\)

\[ y_{i}=f\left(x_{i}\right)+\epsilon_{i}=\alpha+\beta x_{i}+\varepsilon_{i}, \quad i=1, \ldots, n \]

image-20210623163756677

Method of Least Squares

\(\varepsilon_{i}=y_{i}-\hat{y}_{i}\) 最小化 \(\varepsilon_{i}^{2}=\left(y_{i}-\hat{y}_{i}\right)^{2} \geq 0\)

\[ \min Q(\alpha, \beta)=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}=\frac{1}{n} \sum_{i=1}^{n}\left[y_{i}-\left(\alpha+\beta x_{i}\right)\right]^{2} \]

我们可以令偏导数为 \(0\) 来最小化

\[ \begin{gathered} \frac{\partial Q(\alpha, \beta)}{\partial \alpha}=-\frac{2}{n} \sum_{i=1}^{n}\left[y_{i}-\left(\alpha+\beta x_{i}\right)\right]=0 \\ \frac{\partial Q(\alpha, \beta)}{\partial \beta}=-\frac{2}{n} \sum_{i=1}^{n}\left[y_{i}-\left(\alpha+\beta x_{i}\right)\right] \cdot x_{i}=0 \end{gathered} \]

于是得出

\[ \begin{gathered} \hat{\alpha}=\bar{y}-\hat{\beta} \bar{x} \\ \hat{\beta}=\frac{\sum_{i=1}^{n} y_{i} x_{i}-n \bar{y} \bar{x}}{\sum_{i=1}^{n} x_{i}^{2}-n \bar{x}^{2}}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\frac{\tilde{s}_{X Y}}{\tilde{s}_{X}^{2}} \end{gathered} \]

Coefficient of Determination and Residual Analysis

total dispersion(Sum of Square Total)

\[ S Q T=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\left(=n \cdot \tilde{S}_{Y}^{2}\right) \]

分解分散性 Dispersion Decomposition

\[ \begin{gathered} S Q T=S Q E+S Q R \\ \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}=\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}+\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2} \end{gathered} \]

Coefficient of Determination

描述一个模型的质量

\[ R^{2}=\frac{\text { explained dispersion }}{\text { total dispersion }}=\frac{S Q E}{S Q T}=\frac{\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \]

另一个公式:

\[ R^{2}=1-\frac{S Q R}{S Q T}=1-\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}} \]

\(0 \le R^2 \le 1\) , 若 \(R^2=0\) 说明坏,\(R^2=1\) 说明好

概率演算Probability Calculus

Statistical inference

点估计

我们需要估计一个分布的未知参数 \(\theta\)

为了实现,我们设 \(X_1,...X_n\) 是独立的且相同的分布

估计函数

对于一个参数的估计函数是:

\[ T=g\left(X_{1}, \ldots, X_{n}\right) \]

下面是数理统计的知识,略

R语言

安装

下载安装:https://cran.r-project.org/

下载 RStudio https://www.rstudio.com/

下载额外包裹

在 R Console 里面

  • 先选择一个比较快的源

    1
    chooseCRANmirror()
  • 然后安装包

    1
    install.packages("name", dependencies = TRUE)

这个学期需要用到的包

  • ggplot2
  • lattice
  • robustHD
  • epade
  • vcd
  • gplots
  • TeachingDemos

基础语法

运算符

+,-,*,/,^ : 加减乘除(浮点数除法)次幂

变量赋值

1
2
x <- 13
x = 13

删除变量

1
rm(x)

函数

库函数

  • sqrt(2) 开平方根
  • sum(1,3,4) 求和
image-20210624104139436

数据类型

image-20210624104324326

vector

按类型长度生成

生成长度为 10 类型为 numeric 的列表

1
x <- vector(mode="numeric", length=10)

使用 c() 生成

直接赋值

1
c(1,0,2,4)

取index

1
x[10] # index 1 .. 10

数组越界会报错:NA

给每个值取名字

1
2
3
x <- c(Monday=1,Tuesday=2)
names(x) # 显示名字
names(x) <- c("M", "T") # 该名字

Matrices

生产 3行 2列数据从10到15按行递增的矩阵

1
a = matrix(nrow=3, ncol=2, data=10:15, byrow=T)

取元素

取出第3行第2列的元素

1
a[3,2]

取出第3行

1
a[3,]

数据集Data Set

1
Age = c(21, 22, 23)Gender = c("m", "w", "m")x = data.frame(Age, Gender)

也可加入名字

1
names = c("Tom", "Kite", "Lisa")x = data.frame(Age, Gender, row.names = names)

提取部分数据

subset

1
women = subset(x, x$Gender == "w")

split

1
a = split(x, x$Gender)a[1]a[2]

数据集

更改目录

1
getwd() # 显示当前目录setwd("D:/Fyind") # 设置目录 

读取数据

1
t = read.table(".../a.csv",sep=";",header = TRUE)t = read.table(".../a.csv",sep=";",header = TRUE, row.names=1) # 第1列作为行的名字

用read_csv读取数据表

1
2
3
4
install.packages("tidyverse")
library(tidyverse)

LAozone = read_csv("LAozone.csv")

写入数据

1
write.table(x, "x.csv", sep=";")

函数

查看维度

1
dim(x)

查看数据表

1
View(x)

选择数据

1
Age <- 0:120Age <- seq(0, 120, 1) # 从 0 到 120 步长 1 的数组AgeGrouped <- cut(Age, breaks = c(0,13,19,65,120)) # 分割出区间 (0,13], (13,19] ... AgeGrouped <- cut(Age, breaks = c(0,13,19,65,120), include.lowest = TRUE) # 包含第1个元素AgeGrouped <- cut(Age, breaks = c(0,13,19,65,Inf), right = FALSE) # 左闭右开AgeGrouped <- cut(Age, breaks = c(0,13,19,65,65, Inf), right = FALSE, labels = c("Children","Teenagers","Adults","seniors")) # 添加标签
  • 选择某几列
1
LAozone_small = LAozone[, c("ozone", "temp")] # 选择了ozone 和 temp 列

生成频率列表

1
table(x) # 单个变量table(MunichGraduate$Finanzierungsquelle,MunichGraduate$Studiendauer) # 双变量

生成相对频率列表

1
prop.table(table(x))

使用拓展包

1
library(lattice)

求统计值

求最大值的行

1
which.max(CoronaBavaria$Infektionen)CoronaBavaria[59, ] # 取出第59行

求平均数

1
MeanCoronaCounty <- mean(CoronaCounty$Infektionen.pro.100.000.Einwohner)

求中位数

1
MedianCoronaBavaria <- median(CoronaBavaria$Infektionen.pro.100.000.Einwohner)

求五点总结

1
summary(CoronaCounty$Infektionen.pro.100.000.Einwohner)

求边界加和

1
addmargins(table(MunichGraduate$Finanzierungsquelle,MunichGraduate$Studiendauer)) # 生成加和

加入列:根据10 quantile来统计

1
library(mltools)CoronaCounty[,"InfectionsPer100000_grouped"]<- bin_data(CoronaCounty$Infektionen.pro.100.000.Einwohner,                                                         bins=10, binType="quantile")

\(\chi^2\) 参数,contingency coefficient

1
library(vcd)assocstats(table(CoronaCounty$Bundesland, CoronaCounty$InfectionsPer100000_grouped))

求 Bravais-Pearson correlation coefficient

1
storksPerHectare <- c(20, 30, 40, 50, 60, 70) # x 轴birthsPerThousand <- c(13, 24, 43, 51, 57, 77)  # y 轴cor(storksPerHectare, birthsPerThousand, method = "pearson")

画表格

生成 histogram

1
library(lattice)histogram(NewerFlats$wfl, breaks = seq(0,320,10),           xlab = "Living space in sqm (class width = 10 sqm)",           ylab = "Proportion in percentage", right = FALSE)

盒子五点总结

1
boxplot(CoronaCounty$Infektionen.pro.100.000.Einwohner, horizontal = TRUE,        xlab = "Infections per 100 000 inhabitants",         main = "Data set Corona County", range = 0)

画图像

1
storksPerHectare <- c(20, 30, 40, 50, 60, 70) # x 轴birthsPerThousand <- c(13, 24, 43, 51, 57, 77)  # y 轴plot(storksPerHectare,birthsPerThousand,      main = "Scatter plot Number of storks per hectare vs. Number of births per thousand inhabitants",      xlab = "Number of storks per hectare",      ylab = "Number of births per thousand inhabitants",      pch = 19)

画正态分布

1
curve(dnorm(x, 0, 1), from=-5, to=5, ylab = expression(varphi(x)))par(mfrow=c(2,2)) # 画4张图(2*2)curve(dnorm(x, 0, sqrt(0.25)), from=-5, to=5, ylab = expression(varphi(x)), main = expression(sigma^2==0.25))curve(dnorm(x, 0, sqrt(1)), from=-5, to=5, ylab = expression(varphi(x)), main = expression(sigma^2==1.0))curve(dnorm(x, 0, sqrt(2)), from=-5, to=5, ylab = expression(varphi(x)), main = expression(sigma^2==2.0))curve(dnorm(x, 0, sqrt(5)), from=-5, to=5, ylab = expression(varphi(x)), main = expression(sigma^2==5.0))

画条状图

1
ShoeSize <- c(40, 44, 42, 41, 44, 41, 43, 42, 43, 43, 43, 42, 42, 42, 41, 40, 41, 44, 39, 45)barplot(table(ShoeSize), col = blues9, xlab = "ShoeSize", ylab = "Quantities (abs. Frequencies)")

假设检验

T-Test

1
t.test(ShoeSize, conf.level = 0.95)

Sigma-Test

1
library(TeachingDemos)sigma.test(ShoeSize, conf.level = 0.95)