AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

taotao_2016 2019-09-01

展開全文

概率

概率亦稱“或然率”。它反映隨機(jī)事件出現(xiàn)的可能性大小的量度。隨機(jī)事件是指在相同條件下，可能出現(xiàn)也可能不出現(xiàn)的事件。例如，從一批有正品和次品的商品中，隨意抽取一件，“抽得的是正品”就是一個(gè)隨機(jī)事件。設(shè)對(duì)某一隨機(jī)現(xiàn)象進(jìn)行了n次試驗(yàn)與觀察，其中A事件出現(xiàn)了m次，即其出現(xiàn)的頻率為m/n。經(jīng)過大量反復(fù)試驗(yàn)，常有m/n越來越接近于某個(gè)確定的常數(shù)。該常數(shù)即為事件A出現(xiàn)的概率，常用P (A) 表示。

基本定義：

試驗(yàn)：一次無法預(yù)計(jì)結(jié)果的行動(dòng)，比如扔硬幣。
樣本空間：一次試驗(yàn)所有可能結(jié)果集，比如扔硬幣，樣本空間有兩個(gè)結(jié)果。
樣本點(diǎn)：一個(gè)可能結(jié)果。
事件：一次試驗(yàn)的具體結(jié)果，比如扔一次硬幣得到正面。
概率：取值0到1之間，表示一個(gè)具體事件的可能性。0表示不可能，1表示必然發(fā)生。

舉例：擲兩個(gè)骰子，希望擲出7。

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

36個(gè)樣本點(diǎn)組成的樣本空間。

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

6個(gè)樣本點(diǎn)得到7，所以擲出7的概率是6 / 36 = 0.167 (16.7%)。

P(A) = 0.167

傾斜(Bias): 有時(shí)樣本空間中的樣本點(diǎn)沒有相同的概率，我們稱這種情況是傾斜，使得一個(gè)結(jié)果比另一個(gè)結(jié)果可能性更大，例如：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

A代表天晴，B代表多云，C代表下雨。

P(A) = 0.6; P(B) = 0.3; P(C) = 0.1

P(A′) = 1 ? P(A) = 1 ? 0.6 = 0.4，A′代表任何一個(gè)非晴天

條件概率

事件有以下幾種類型：

獨(dú)立事件
依賴事件
互斥事件

獨(dú)立事件：

例如投硬幣，樣本空間有兩個(gè)可能結(jié)果：正面和背面。得到正面的概率和得到背面的概率都是1/2，每次投硬幣都是一個(gè)獨(dú)立事件，意味著每次的結(jié)果都不會(huì)依賴上一次。

%matplotlib inlineimport random# Create a list with 2 element (for heads and tails)heads_tails = [0,0] # loop through 10000 trialstrials = 10000trial = 0while trial < trials: trial = trial + 1 # Get a random 0 or 1 toss = random.randint(0,1) # Increment the list element corresponding to the toss result heads_tails[toss] = heads_tails[toss] + 1 print (heads_tails) # Show a pie chart of the resultsfrom matplotlib import pyplot as pltplt.figure(figsize=(5,5))plt.pie(heads_tails, labels=['heads', 'tails'])plt.legend()plt.show()

[4994, 5006]

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

如果連續(xù)投硬幣三次，三次都是正面的概率又是多少呢？

這樣的情況就是獨(dú)立事件組合。

P(A) = 1/2 * 1/2 * 1/2 = 0.125

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

依賴事件：

我們來考慮抽撲克牌場(chǎng)景，樣本空間如下：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

A代表抽取黑色牌，B代表抽取紅色牌。

抽一張牌：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

假定不放回去，事件A和B的概率變成：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

考慮這樣一個(gè)場(chǎng)景：在給定事件A為抽一張黑色牌的前提下，抽一張紅色牌的概率是？這就是所謂的條件概率。

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

互斥事件：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

二項(xiàng)分布：

二項(xiàng)分布就是重復(fù)n次獨(dú)立的伯努利試驗(yàn)。在每次試驗(yàn)中只有兩種可能的結(jié)果，而且兩種結(jié)果發(fā)生與否互相對(duì)立，并且相互獨(dú)立，與其它各次試驗(yàn)結(jié)果無關(guān)，事件發(fā)生與否的概率在每一次獨(dú)立試驗(yàn)中都保持不變，則這一系列試驗(yàn)總稱為n重伯努利實(shí)驗(yàn)，當(dāng)試驗(yàn)次數(shù)為1時(shí)，二項(xiàng)分布服從0-1分布。

二項(xiàng)概率公式：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

對(duì)于二項(xiàng)變量，我們使用該公式來計(jì)算概率質(zhì)量函數(shù)。

概率質(zhì)量函數(shù) (Probability Mass Function，PMF)

是離散隨機(jī)變量在各特定取值上的概率。概率質(zhì)量函數(shù)和概率密度函數(shù)不同之處在于：概率密度函數(shù)是對(duì)連續(xù)隨機(jī)變量定義的。

舉例：機(jī)場(chǎng)安檢，進(jìn)行5次試驗(yàn)，每次試驗(yàn)都會(huì)有一個(gè)乘客過安檢，過安檢時(shí)會(huì)有兩種狀態(tài)：被搜身或不搜身，每個(gè)乘客被搜身的概率是25%，5位乘客中有3位被搜身的概率是多少？除了通過公式計(jì)算外，我們通過scipy.stats.binom.pmf來看下整個(gè)分布。

%matplotlib inlinefrom scipy.stats import binomfrom matplotlib import pyplot as pltimport numpy as np n = 5p = 0.25x = np.array(range(0, n+1)) prob = np.array([binom.pmf(k, n, p) for k in x]) # Set up the graphplt.xlabel('x')plt.ylabel('Probability')plt.bar(x, prob)plt.show()

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

從圖中可以看出，該分布是右傾斜的，原因很簡(jiǎn)單，n取值太小，讓我們?cè)黾觧到100來看看。

%matplotlib inlinefrom scipy.stats import binomfrom matplotlib import pyplot as pltimport numpy as np n = 100p = 0.25x = np.array(range(0, n+1)) prob = np.array([binom.pmf(k, n, p) for k in x]) # Set up the graphplt.xlabel('x')plt.ylabel('Probability')plt.bar(x, prob)plt.show()

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

期望值(Expected Value)

μ = np

以上例子的期望值是：μ = 100 × 0.25 = 25，意味著100位乘客有25位乘客被搜身。

方差和標(biāo)準(zhǔn)差

方差：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

標(biāo)準(zhǔn)差：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

from scipy.stats import binom n = 100p = 0.25 print(binom.mean(n,p))print(binom.var(n,p))print(binom.std(n,p))

25.0

18.75

4.330127018922194

假設(shè)檢驗(yàn)

學(xué)生們結(jié)束了今年的學(xué)業(yè)，被要求為其中一門課數(shù)據(jù)科學(xué)打分，-5(糟糕)到5(很好)，該課程是通過網(wǎng)上在線形式教授的，受眾學(xué)生成千上萬，這里獲取50個(gè)隨機(jī)樣本數(shù)據(jù)。

import numpy as npimport matplotlib.pyplot as plt%matplotlib inline np.random.seed(123)lo = np.random.randint(-5, -1, 6)mid = np.random.randint(0, 3, 38)hi = np.random.randint(4, 6, 6)sample = np.append(lo,np.append(mid, hi))print('Min:' + str(sample.min()))print('Max:' + str(sample.max()))print('Mean:' + str(sample.mean())) plt.hist(sample)plt.show()

Min:-5

Max:5

Mean:0.84

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

該樣本的均值0.84意味著學(xué)生們對(duì)該課的態(tài)度還是正面的，也就是喜歡該課，但這個(gè)結(jié)論只是基于隨機(jī)樣本，那如果是整個(gè)數(shù)據(jù)呢？我們來定義兩個(gè)假設(shè)：

H0假設(shè)：分?jǐn)?shù)的總體均值小于等于0，我們的樣本均值大于0是由于樣本選擇的偶然性。
H1假設(shè)：分?jǐn)?shù)的總體均值大于0，我們的樣本也均值大于0，說明樣本正確的探測(cè)到了這個(gè)趨勢(shì)。

H0 : μ ≤ 0

H1 : μ > 0

如果H0假設(shè)正確，50個(gè)樣本分布會(huì)是一個(gè)正態(tài)分布，均值是0。

import numpy as npimport matplotlib.pyplot as plt%matplotlib inline pop = np.random.normal(0, 1.15, 100000)plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.show()

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

T檢驗(yàn)，亦稱student t檢驗(yàn)（Student's t test），主要用于樣本含量較小（例如n < 30），總體標(biāo)準(zhǔn)差σ未知的正態(tài)分布。 T檢驗(yàn)是用t分布理論來推論差異發(fā)生的概率，從而比較兩個(gè)平均數(shù)的差異是否顯著。單樣本t檢驗(yàn)是檢驗(yàn)一個(gè)樣本平均數(shù)與一個(gè)已知的總體平均數(shù)的差異是否顯著。當(dāng)總體分布是正態(tài)分布，如總體標(biāo)準(zhǔn)差未知且樣本容量小于30，那么樣本平均數(shù)與總體平均數(shù)的離差統(tǒng)計(jì)量呈t分布。

單樣本t檢驗(yàn)公式：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

x?是樣本均值，μ是總體均值，s是標(biāo)準(zhǔn)差，n是樣本均值。

from scipy import statsimport numpy as npimport matplotlib.pyplot as plt%matplotlib inline # T-Testt,p = stats.ttest_1samp(sample, 0)# ttest_1samp is 2-tailed, so half the resulting p-value to get a 1-tailed p-valuep1 = '%f' % (p/2)print ('t-statistic:' + str(t))print('p-value:' + str(p1)) # calculate a 90% confidence interval. 10% of the probability is outside this, 5% in each tailci = stats.norm.interval(0.90, 0, 1.15)plt.hist(pop, bins=100)# show the hypothesized population meanplt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)# show the right-tail confidence interval threshold - 5% of propbability is under the curve to the right of this.plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)# show the t-statistic - the p-value is the area under the curve to the right of thisplt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)plt.show()

t-statistic:2.773584905660377

p-value:0.003911

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

黃線代表H0假設(shè)的總體均值，紅線右邊和曲線下方圍成的區(qū)域表達(dá)了顯著性水平0.05(5%)，紫線右邊和曲線下方圍成的區(qū)域表達(dá)了p-value。

p-value：統(tǒng)計(jì)學(xué)根據(jù)顯著性檢驗(yàn)方法所得到的P 值，一般以P < 0.05 為顯著， P <0.01 為非常顯著，其含義是樣本間的差異由抽樣誤差所致的概率小于0.05 或0.01。

那結(jié)論是什么呢？

p-value小于顯著性水平0.05，說明樣本間的差異由抽樣誤差所致的概率小于0.05，很低，意味著我們可以合理的拒絕H0，也就是說總均值大于0，該課程的評(píng)價(jià)分?jǐn)?shù)是積極的、正面的。

雙尾檢驗(yàn)：

重新定義之前的假設(shè)

H0假設(shè)：分?jǐn)?shù)的總體均值等于0，我們的樣本均值大于或小于0是由于樣本選擇的偶然性。
H1假設(shè)：分?jǐn)?shù)的總體均值不等于0，我們的樣本也均值大于0。

H0 : μ = 0

H1 : μ != 0

from scipy import stats

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# T-Test

t,p = stats.ttest_1samp(sample, 0)

print ('t-statistic:' + str(t))

# ttest_1samp is 2-tailed

print('p-value:' + '%f' % p)

# calculate a 95% confidence interval. 50% of the probability is outside this, 2.5% in each tail

ci = stats.norm.interval(0.95, 0, 1.15)

plt.hist(pop, bins=100)

# show the hypothesized population mean

plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)

# show the confidence interval thresholds - 5% of propbability is under the curve outside these.

plt.axvline(ci[0], color='red', linestyle='dashed', linewidth=2)

plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)

# show the t-statistic thresholds - the p-value is the area under the curve outside these

plt.axvline(pop.mean() - t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)

plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)

plt.show()

t-statistic:2.773584905660377

p-value:0.007822

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

雙尾p-value小于0.05，所以我們拒絕H0假設(shè)。

兩樣本檢驗(yàn)：

AI系列二：數(shù)學(xué)基礎(chǔ)-概率與統(tǒng)計(jì)-概率

這里不再累述。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： taotao_2016 > 《AI》

舉報(bào)/認(rèn)領(lǐng)