Re: [請益] bootstrap v.s. random sample yabt PTT批踢踢實業坊

Re: [請益] bootstrap v.s. random sample

作者: yabt (痴心絕對) 2006-05-03 21:56:23

※ 引述《brooky (未夠班)》之銘言：
: Hi,
: This week we mentioned bootstrap as a sampling method.
: It was shown that in bootstrap "test set will contain about 36.8% of
: the instance and for a reasonably large dataset".
: Intuitively, I can not see the difference between bootstrap and
: simply using a random sample method.
: In another word, why don't we just choose 63.2% from the dataset?
: Is there any idea?
: Thanks.
The following is just my guess, it may not be correct :)
From the perspective of statistics, sampling without replacement is a
form of permutation sampling, that is, sampling k samples without
replacement is equivalent to that we permute the data and then use the
first k samples.
Under such a scheme, a problem will rise especially if the size of data
is small: we cannot choose one instance two times in one sampling procedure;
in other words, the samples we choose do not follow i.i.d., even though the
original data we choose from do follow i.i.d..
Note: i.i.d. stands for independent and identically distributed.
For example, if we construct a data set following i.i.d., and we want to
sample from them to evaluate the data itself or the mining method we used.
Suppose that the data set perfectly reflects the distribution of the
original concept we want to measure, say, flipping a coin with equal
probability of head and tail.
If the size is small, taking a extreme example, flipping 4 times, and the
result is 2 head and 2 tail, which perfectly match the underlying probability.
If we wish to sample 2 instances from them, one might think that sampling
without replacement would suffice, but it is not the case. While we sample
the first one, the probability is 2:2 = 0.5:0.5, which follows the
distribution. However, when we further sample the second, the probability of
the second sample is no longer 0.5:0.5

繼續閱讀