Machine Learning
Homework 4/5
Due: March 30, 2004 (midnight)
For this assignment you will perform some analyses related to learning
theory and then use WEKA to experiment with some recent learning methods
discussed in class: support vector machine, instance-based learner, and
ensemble methods.
- Exercise 7.2 (page 227 of Mitchell's book). Show all work and justify
all answers.
- Exercise 7.5 (b) (page 228 of Mitchell's book). Show all work and justify
all answers.
- Using WEKA's experimenter environment, perform the following
experiment. This is similar to the experiment performed in HW3, but note
differences.
- For the Results Destination section, select ARFF file and provide a
file name in which to store the experimental results.
- For Experiment Type, choose 10-fold cross-validation and
classification.
- For Iteration Control, choose 1 iteration and
data sets first. (*** different from HW3)
- Select the following five datasets that come with WEKA:
contact-lenses, iris, labor, soybean and weather.
- Select the following classifiers with default parameter settings,
except as noted (*** different from HW3):
- bayes.NaiveBayes
- trees.j48.J48
- lazy.IBk
- lazy.IBk -K 3
- lazy.IBk -K 5
- functions.supportVector.SMO
- meta.AdaBoostM1 (with J48 as the base classifer)
- meta.vote (with ConjunctiveRule, NaiveBayes, and J48 as the ensemble
classifers)
- Run the experiment.
- Analyze the results by loading the ARFF results file, select
"Percent_incorrect" as the comparison field, set the significance level to
0.05, select NaiveBayes as the test base, check to show standard
deviations, and perform the test. (*** different from HW3)
- Construct a table of classifiers vs. datasets, and in each entry,
enter the error and standard deviation of that classifier on that dataset
from the above experiment. Also, add an asterisk to the end of the entry
for each dataset/classifier pair for which the classifier outperforms
NaiveBayes at the 0.05 level.
- For each dataset indicate which classifier has the best performance on
that dataset, where "best" is based on the error value alone, ignoring
standard deviation and significance.
- For the three versions of the IBk lazy learner, indicate which one had
the lowest error for each of the datasets.
- For the two ensemble learners (AdaBoostM1 and Vote), indicate which
one had the lowest error for each of the datasets.
- Email to me (holder@cse.uta.edu)
your nicely-formatted report (MSWord, PDF or PostScript) containing the
requested information referred to above.