1.
I did some experiments about sampling training data randomly
(or not). It is shown below the testing accuracy with different
sampling strategy on training data:
* Choose the first <num> instances as training
K num= 10000 100000 520000
1 0.738838 0.801072 0.971465
3 0.708943 0.780355 0.976169
5 0.691094 0.769242 0.976414
* Choose the first <num> instances as training
K num= 10000 100000 520000
1 0.492297 0.627565 0.971497
3 0.493723 0.628762 0.976169
5 0.493100 0.628975 0.976414
* Choose training instances randomly
K num= 10000 100000 520000
1 0.827837 0.946945 0.971497
3 0.805481 0.936668 0.976169
5 0.797040 0.926998 0.976414
I'd like to ask your opinios about this, and you can also
keep an eye on your experiment in HW6 to see if there is
the same situation.
2.
ATLAS (Automatically Tuned Linear Algebra Software) is a package
that handles operations on vectors and matrices. It will tune
some parameters according to your machine when it is installed,
so that the execution speed is optimized.
http://math-atlas.sourceforge.net/
We have ATLAS installed on bsd7 (workstation in CSIE), so you can
use this library directly. The URL below contains a sample about
multiplication of two matrices.
http://www.csie.ntu.edu.tw/~r92010/courses/dm2005/hw5/
It is amazing that, multiplication between a 3000x10000 matrix
and a 10000x3000 matrix spends only 35 seconds on bsd7 (3.2GHz)!
3.
In HW3 (or 4,5), some of you mentioned that the first attribute is
the most important. And in HW5 most of you concluded that the
accuracy is lower after normalization. One conjecture is that,
the 1st attribute is divided by a larger number when normalized,
thus this attribute will not dominate others (?).
Since you are using 1-R and Naive Bayes, I think it is interesting
to investigate importance of attributes too ?