我最愛比速度了XDDD
新增dplyr跟plyr的比較
我來提供一個更快的,code如下:
N = 1e4
x <- data.frame(Freq=runif(N,0,1),Category=c("T","F"))
library(data.table)
library(plyr)
library(dplyr)
library(sqldf)
x_dt = data.table(x)
setkey(x_dt, Category)
Cate_group_dt = group_by(x_dt, Category)
Cate_group_df = group_by(x_dt, Category)
library(rbenchmark)
benchmark(data_table = x_dt[, sum(Freq),by = Category],
tapply = tapply(x$Freq, x$Category, FUN=sum),
sqldf = sqldf("SELECT Category, sum(Freq) FROM x
GROUP BY Category"),
plyr_dt = aggregate(Freq ~ Category, data = x_dt, FUN=sum),
plyr_dt2 = ddply(x_dt, .(Category), colwise(sum)),
plyr_df = aggregate(Freq ~ Category, data = x, FUN=sum),
plyr_df2 = ddply(x, .(Category), colwise(sum)),
dplyr_dt = summarise(Cate_group_dt, sum(Freq)),
dplyr_df = summarise(Cate_group_df, sum(Freq)),
replications = 100,
columns=c('test', 'replications', 'elapsed','relative',
'user.self'), order='relative')
Result for N = 1e4:
test replications elapsed relative user.self
2 tapply 100 0.07 1.000 0.08
1 data_table 100 0.10 1.429 0.09
8 dplyr_dt 100 0.14 2.000 0.14
9 dplyr_df 100 0.17 2.429 0.17
5 plyr_dt2 100 0.23 3.286 0.24
7 plyr_df2 100 0.23 3.286 0.23
3 sqldf 100 2.24 32.000 2.22
4 plyr_dt 100 2.61 37.286 2.60
6 plyr_df 100 3.88 55.429 3.88
# 這裡我多做幾次 tapply跟data.table其實有上有下,不分軒輊
Results for N = 1e6
test replications elapsed relative user.self
9 dplyr_df 20 0.51 1.000 0.46
1 data_table 20 0.52 1.020 0.48
8 dplyr_dt 20 0.52 1.020 0.43
2 tapply 20 0.99 1.941 0.87
5 plyr_dt2 20 1.54 3.020 1.32
7 plyr_df2 20 1.56 3.059 1.35
3 sqldf 20 39.51 77.471 37.68
4 plyr_dt 20 66.23 129.863 65.17
6 plyr_df 20 102.23 200.451 101.37
Results for N = 1e7
test replications elapsed relative user.self
5 dplyr_dt 20 4.83 1.000 4.03
6 dplyr_df 20 4.84 1.002 4.13
1 data_table 20 4.85 1.004 4.11
2 tapply 20 9.90 2.050 8.47
4 plyr_df2 20 15.90 3.292 12.98
3 plyr_dt2 20 16.01 3.315 13.01
資料不算太大時,data.table跟tapply算快,但是資料變大之後
dplyr跟data.table就差不多快。
接著測試如果有兩個group的話
N = 1e4
set.seed(100)
x <- data.frame(Freq=runif(N,0,1),Category=c("T","F"),
Category2 = sample(c("T","F"), N, replace = TRUE))
library(data.table)
library(plyr)
library(dplyr)
library(sqldf)
x_dt = data.table(x)
setkey(x_dt, Category, Category2)
Cate_group_dt = group_by(x_dt, Category, Category2)
Cate_group_df = group_by(x_dt, Category, Category2)
library(rbenchmark)
benchmark(data_table = x_dt[, sum(Freq),by = key(x_dt)],
tapply = tapply(x$Freq, list(x$Category, x$Category2), FUN=sum),
plyr_dt2 = ddply(x_dt, .(Category, Category2), colwise(sum)),
plyr_df2 = ddply(x, .(Category, Category2), colwise(sum)),
dplyr_dt = summarise(Cate_group_dt, sum(Freq)),
dplyr_df = summarise(Cate_group_df, sum(Freq)),
replications = 100,
columns=c('test', 'replications', 'elapsed','relative',
'user.self'),order='relative')
Result for N = 1e4:
test replications elapsed relative user.self
2 tapply 100 0.06 1.000 0.06
1 data_table 100 0.11 1.833 0.11
5 dplyr_dt 100 0.17 2.833 0.18
6 dplyr_df 100 0.17 2.833 0.17
4 plyr_df2 100 0.40 6.667 0.41
3 plyr_dt2 100 0.41 6.833 0.40
Result for N = 1e6:
test replications elapsed relative user.self
1 data_table 20 0.39 1.000 0.33
5 dplyr_dt 20 0.39 1.000 0.36
6 dplyr_df 20 0.40 1.026 0.36
2 tapply 20 1.13 2.897 1.03
3 plyr_dt2 20 3.52 9.026 2.97
4 plyr_df2 20 3.59 9.205 3.09
Result for N = 1e7:
test replications elapsed relative user.self
1 data_table 20 3.70 1.000 3.42
6 dplyr_df 20 3.73 1.008 3.33
5 dplyr_dt 20 3.75 1.014 3.22
2 tapply 20 11.89 3.214 10.37
3 plyr_dt2 20 35.99 9.727 29.56
4 plyr_df2 20 36.72 9.924 30.32
結果同一個類別的情況。
最後,測試結果得知data.table真的很快!!!!
而且dplyr也很快,還可以直接搭配data.frame,我覺得算是更方便
至於plyr的ddply,我個人覺得寫起來比較順手,可是速度就沒預期快了。
My environment: i7-3770K@4.3GHz 16G ram
※ 引述《kenshin528 (成立奧凶帝國!!)》之銘言:
: 測試結果:
: 當rows = 10,000時
: user system elapsed
: SQLDF 0.05 0.00 0.94
: TAPPLY 0.00 0.00 0.34
: