R語言進(jìn)階之3:數(shù)據(jù)匯總/透視/提煉
一、行列求和、平均值和頻度
rowSums, colSums, rowMeans, colMeans 可以簡(jiǎn)單理解為按行或列求和或求均值,table把數(shù)字或字符當(dāng)成因子統(tǒng)計(jì)頻度,都相當(dāng)簡(jiǎn)單:
- > a <- array(rep(1:3, each=3), dim=c(3,3))
- > a
- [,1] [,2] [,3]
- [1,] 1 2 3
- [2,] 1 2 3
- [3,] 1 2 3
- > rowSums(a)
- [1] 6 6 6
- > colSums(a)
- [1] 3 6 9
- > table(a)
- a
- 1 2 3
- 3 3 3
對(duì)于多維數(shù)組,rowSums, colSums, rowMeans, colMeans的使用稍為復(fù)雜點(diǎn)。它們的參數(shù)為:
- colSums (x, na.rm = FALSE, dims = 1)
- rowSums (x, na.rm = FALSE, dims = 1)
- colMeans(x, na.rm = FALSE, dims = 1)
- rowMeans(x, na.rm = FALSE, dims = 1)
其中dims為整數(shù),表示哪個(gè)或哪些維數(shù)被看做行或列,對(duì)于row統(tǒng)計(jì)函數(shù),dims+1及以后的維度被看做行,對(duì)于col函數(shù),dims及以前的維度(1:dims)被看做列:
- > b <- array(rep(1:3, each=9), dim=c(3,3,3))
- > b
- , , 1
- [,1] [,2] [,3]
- [1,] 1 1 1
- [2,] 1 1 1
- [3,] 1 1 1
- , , 2
- [,1] [,2] [,3]
- [1,] 2 2 2
- [2,] 2 2 2
- [3,] 2 2 2
- , , 3
- [,1] [,2] [,3]
- [1,] 3 3 3
- [2,] 3 3 3
- [3,] 3 3 3
- > rowSums(b)
- [1] 18 18 18
- > rowSums(b,dims=1)
- [1] 18 18 18
- > rowSums(b,dims=2)
- [,1] [,2] [,3]
- [1,] 6 6 6
- [2,] 6 6 6
- [3,] 6 6 6
- > colSums(b)
- [,1] [,2] [,3]
- [1,] 3 6 9
- [2,] 3 6 9
- [3,] 3 6 9
- > colSums(b,dims=2)
- [1] 9 18 27
table可以統(tǒng)計(jì)數(shù)字出現(xiàn)的頻率,也可以統(tǒng)計(jì)其他可以被看做因子的數(shù)據(jù)類型:
- > table(b)
- b
- 1 2 3
- 9 9 9
- > c <- sample(letters[1:5], 10, replace=TRUE)
- > c
- [1] "a" "c" "b" "d" "a" "e" "d" "e" "c" "a"
- > table(c)
- c
- a b c d e
- 3 1 2 2 2
如果參數(shù)不只一個(gè),它們的長(zhǎng)度應(yīng)該一樣,結(jié)果是不同因子組合的頻度表:
- > a <- rep(letters[1:3], each=4)
- > b <- sample(LETTERS[1:3],12,replace=T)
- > table(a,b)
- b
- a A B C
- a 0 3 1
- b 3 0 1
- c 1 1 2
二、apply系列函數(shù):
如果我們關(guān)心的不僅僅是求和、平均值和頻度這些指標(biāo)的計(jì)算,可以用apply系列函數(shù)來處理,這些函數(shù)包括apply、lapply、sapply、vapply、tapply和mapply。這些函數(shù)的使用可以從目標(biāo)數(shù)據(jù)類型和返回值類型兩個(gè)方面進(jìn)行了解。
1、apply函數(shù):
這個(gè)函數(shù)的使用格式為:apply(X, MARGIN, FUN, ...)。它應(yīng)用的數(shù)據(jù)類型是數(shù)組或矩陣,返回值類型由FUN函數(shù)結(jié)果的長(zhǎng)度確定。
X參數(shù)為數(shù)組或矩陣;MARGIN為要應(yīng)用計(jì)算函數(shù)的邊/維,MARGIN=1為第一維(行),2為第二維(列),...;FUN為要應(yīng)用的計(jì)算函數(shù),后面可以加FUN的有名參數(shù)。比如,要按行或列計(jì)算數(shù)組a的標(biāo)準(zhǔn)差就可以這樣:
- > apply(a, MARGIN=1, FUN=sd)
- [1] 1 1 1
- > apply(a, MARGIN=2, FUN=sd)
- [1] 0 0 0
MARGIN的長(zhǎng)度可以不是1(多維應(yīng)用),如果長(zhǎng)度等于X的維數(shù),應(yīng)用到FUN函數(shù)的數(shù)據(jù)就只有一個(gè)值,結(jié)果沒什么意義,甚至函數(shù)會(huì)獲得無效值:
- > apply(b, MARGIN=3, FUN=sum)
- [1] 9 18 27
- > apply(b, MARGIN=1:2, FUN=sum)
- [,1] [,2] [,3]
- [1,] 6 6 6
- [2,] 6 6 6
- [3,] 6 6 6
- > apply(a, MARGIN=1:2, FUN=sd)
- [,1] [,2] [,3]
- [1,] NA NA NA
- [2,] NA NA NA
- [3,] NA NA NA
上面我們使用的sd、sum或mean函數(shù)的返回值的向量長(zhǎng)度都是1(每一次單獨(dú)計(jì)算),apply函數(shù)結(jié)果的維數(shù)與MARGIN的向量長(zhǎng)度相同;如果FUN函數(shù)返回值的長(zhǎng)度不是1而是每次都為n,apply函數(shù)的結(jié)果是維度為c(n, dim(X)[MARGIN]):
- > a
- [,1] [,2] [,3]
- [1,] 1 2 3
- [2,] 1 2 3
- [3,] 1 2 3
- > apply(a, MARGIN=1, FUN=quantile, probs=seq(0,1, 0.25))
- [,1] [,2] [,3]
- 0% 1.0 1.0 1.0
- 25% 1.5 1.5 1.5
- 50% 2.0 2.0 2.0
- 75% 2.5 2.5 2.5
- 100% 3.0 3.0 3.0
- > apply(a, MARGIN=2, FUN=quantile, probs=seq(0,1, 0.25))
- [,1] [,2] [,3]
- 0% 1 2 3
- 25% 1 2 3
- 50% 1 2 3
- 75% 1 2 3
- 100% 1 2 3
如果FUN函數(shù)返回值的長(zhǎng)度不一樣,情況就復(fù)雜了,apply函數(shù)的結(jié)果會(huì)是列表。
2、lapply、sapply和vapply函數(shù):
這幾個(gè)函數(shù)是一套,前兩個(gè)參數(shù)都為X和FUN,其他參數(shù)在R的函數(shù)幫助文檔里有相信介紹。它們應(yīng)用的數(shù)據(jù)類型都是列表,對(duì)每一個(gè)列表元素應(yīng)用FUN函數(shù),但返回值類型不大一樣。lappy是最基本的原型函數(shù),sapply和vapply都是lapply的改進(jìn)版。
2.1 lapply返回的結(jié)果為列表,長(zhǎng)度與X相同
- > scores <- list(YuWen=c(80,88,94,70), ShuXue=c(99,87,100,68,77))
- > lapply(scores, mean)
- $YuWen
- [1] 83
- $ShuXue
- [1] 86.2
- > lapply(scores, quantile, probs=c(0.5,0.7,0.9))
- $YuWen
- 50% 70% 90%
- 84.0 88.6 92.2
- $ShuXue
- 50% 70% 90%
- 87.0 96.6 99.6
2.2 sapply返回的結(jié)果比較“友好”,如果結(jié)果很整齊,就會(huì)得到向量或矩陣或數(shù)組
sapply是simplify了的lapply,所謂的simplify,是指對(duì)結(jié)果的數(shù)據(jù)結(jié)構(gòu)進(jìn)行了simplify,方便后續(xù)處理。
- > sapply(scores, mean)
- YuWen ShuXue
- 83.0 86.2
- > sapply(scores, quantile, probs=c(0.5,0.7,0.9))
- YuWen ShuXue
- 50% 84.0 87.0
- 70% 88.6 96.6
- 90% 92.2 99.6
2.3 vapply函數(shù):對(duì)返回結(jié)果(value)進(jìn)行類型檢查的sapply
雖然sapply的返回值比lapply好多了,但可預(yù)測(cè)性還是不好,如果是大規(guī)模的數(shù)據(jù)處理,后續(xù)的類型判斷工作會(huì)很麻煩而且很費(fèi)時(shí)。vapply增加的FUN.VALUE參數(shù)可以直接對(duì)返回值類型進(jìn)行檢查,這樣的好處是不僅運(yùn)算速度快,而且程序運(yùn)算更安全(因?yàn)榻Y(jié)果可控)。下面代碼的rt.value變量設(shè)置返回值長(zhǎng)度和類型,如果FUN函數(shù)獲得的結(jié)果和rt.value設(shè)置的不一致(長(zhǎng)度和類型)都會(huì)出錯(cuò):
- > probs <- c(1:3/4)
- > rt.value <- c(0,0,0) #設(shè)置返回值為3個(gè)數(shù)字
- > vapply(scores, quantile, FUN.VALUE=rt.value, probsprobs=probs)
- YuWen ShuXue
- 25% 77.5 77
- 50% 84.0 87
- 75% 89.5 99
- > probs <- c(1:4/4)
- > vapply(scores, quantile, FUN.VALUE=rt.value, probsprobs=probs)
錯(cuò)誤于vapply(scores, quantile, FUN.VALUE = rt.value, probs = probs) :
值的長(zhǎng)度必需為3,
但FUN(X[[1]])結(jié)果的長(zhǎng)度卻是4
- > rt.value <- c(0,0,0,0) #返回值類型為4個(gè)數(shù)字
- > vapply(scores, quantile, FUN.VALUE=rt.value, probsprobs=probs)
- YuWen ShuXue
- 25% 77.5 77
- 50% 84.0 87
- 75% 89.5 99
- 100% 94.0 100
- > rt.value <- c(0,0,0,'') #設(shè)置返回值為3個(gè)數(shù)字和1個(gè)字符串
- > vapply(scores, quantile, FUN.VALUE=rt.value, probsprobs=probs)
錯(cuò)誤于vapply(scores, quantile, FUN.VALUE = rt.value, probs = probs) :
值的種類必需是'character',
但FUN(X[[1]])結(jié)果的種類卻是'double'
FUN.VALUE為必需參數(shù)。
3、 mapply函數(shù):
R的在線文檔說mapply是sapply的多變量版本(multivariate sapply),但它的參數(shù)順序和sapply卻不一樣:
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
mapply應(yīng)用的數(shù)據(jù)類型為向量或列表,F(xiàn)UN函數(shù)對(duì)每個(gè)數(shù)據(jù)元素應(yīng)用FUN函數(shù);如果參數(shù)長(zhǎng)度為1,得到的結(jié)果和sapply是一樣的;但如果參數(shù)長(zhǎng)度不是1,F(xiàn)UN函數(shù)將按向量順序和循環(huán)規(guī)則(短向量重復(fù))逐個(gè)取參數(shù)應(yīng)用到對(duì)應(yīng)數(shù)據(jù)元素:
- > sapply(X=1:4, FUN=rep, times=4)
- [,1] [,2] [,3] [,4]
- [1,] 1 2 3 4
- [2,] 1 2 3 4
- [3,] 1 2 3 4
- [4,] 1 2 3 4
- > mapply(rep, x = 1:4, times=4)
- [,1] [,2] [,3] [,4]
- [1,] 1 2 3 4
- [2,] 1 2 3 4
- [3,] 1 2 3 4
- [4,] 1 2 3 4
- > mapply(rep, x = 1:4, times=1:4)
- [[1]]
- [1] 1
- [[2]]
- [1] 2 2
- [[3]]
- [1] 3 3 3
- [[4]]
- [1] 4 4 4 4
- > mapply(rep, x = 1:4, times=1:2)
- [[1]]
- [1] 1
- [[2]]
- [1] 2 2
- [[3]]
- [1] 3
- [[4]]
- [1] 4 4
4、tapply 和 by 函數(shù):
tapply函數(shù)可以看做是table函數(shù)的擴(kuò)展:table函數(shù)按因子組合計(jì)算頻度,而tapply可以按因子組合應(yīng)用各種函數(shù)。使用格式為:tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
X為要應(yīng)用函數(shù)的數(shù)據(jù),通常為向量;INDEX為因子,和table函數(shù)一樣,它的長(zhǎng)度必需和X相同。
- > (x <- 1:10)
- [1] 1 2 3 4 5 6 7 8 9 10
- > (f <- gl(2,5, labels=c("CK", "T")))
- [1] CK CK CK CK CK T T T T T
- Levels: CK T
- > tapply(x, f, length) #FUN函數(shù)是length,得到的結(jié)果和table類似
- CK T
- 5 5
- > table(f)
- f
- CK T
- 5 5
- > tapply(x, f, sum)
- CK T
- 15 40
by函數(shù)是tapply函數(shù)針對(duì)數(shù)據(jù)框類型數(shù)據(jù)的應(yīng)用,但結(jié)果不怎么友好,你可以用下面語句看看情況:
- with(mtcars, by(mtcars, cyl, summary))
三、aggregate函數(shù)
這個(gè)函數(shù)的功能比較強(qiáng)大,它首先將數(shù)據(jù)進(jìn)行分組(按行),然后對(duì)每一組數(shù)據(jù)進(jìn)行函數(shù)統(tǒng)計(jì),最后把結(jié)果組合成一個(gè)比較nice的表格返回。根據(jù)數(shù)據(jù)對(duì)象不同它有三種用法,分別應(yīng)用于數(shù)據(jù)框(data.frame)、公式(formula)和時(shí)間序列(ts):
- aggregate(x, by, FUN, ..., simplify = TRUE)
- aggregate(formula, data, FUN, ..., subset, nana.action = na.omit)
- aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)
我們通過 mtcars 數(shù)據(jù)集的操作對(duì)這個(gè)函數(shù)進(jìn)行簡(jiǎn)單了解。mtcars 是不同類型汽車道路測(cè)試的數(shù)據(jù)框類型數(shù)據(jù):
- > str(mtcars)
- 'data.frame': 32 obs. of 11 variables:
- $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
- $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
- $ disp: num 160 160 108 258 360 ...
- $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
- $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
- $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
- $ qsec: num 16.5 17 18.6 19.4 17 ...
- $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
- $ am : num 1 1 1 0 0 0 0 0 0 0 ...
- $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
- $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
先用attach函數(shù)把mtcars的列變量名稱加入到變量搜索范圍內(nèi),然后使用aggregate函數(shù)按cyl(汽缸數(shù))進(jìn)行分類計(jì)算平均值:
- > attach(mtcars)
- > aggregate(mtcars, by=list(cyl), FUN=mean)
- Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
- 1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
- 2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
- 3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
by參數(shù)也可以包含多個(gè)類型的因子,得到的就是每個(gè)不同因子組合的統(tǒng)計(jì)結(jié)果:
- > aggregate(mtcars, by=list(cyl, gear), FUN=mean)
- Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
- 1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
- 2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
- 3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
- 4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
- 5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
- 6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
- 7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
- 8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000
公式(formula)是一種特殊的R數(shù)據(jù)對(duì)象,在aggregate函數(shù)中使用公式參數(shù)可以對(duì)數(shù)據(jù)框的部分指標(biāo)進(jìn)行統(tǒng)計(jì):
- > aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
- cyl gear mpg hp
- 1 4 3 21.500 97.0000
- 2 6 3 19.750 107.5000
- 3 8 3 15.050 194.1667
- 4 4 4 26.925 76.0000
- 5 6 4 19.750 116.5000
- 6 4 5 28.200 102.0000
- 7 6 5 19.700 175.0000
- 8 8 5 15.400 299.5000
上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子組合對(duì) cbind(mpg,hp) 數(shù)據(jù)進(jìn)行操作。
aggregate在時(shí)間序列數(shù)據(jù)上的應(yīng)用請(qǐng)參考R的函數(shù)說明文檔。
原文鏈接:http://helloxxxxxx.blog.163.com/blog/static/216015095201331610310847/?latestBlog
【編輯推薦】