Machine Learning

Homework 4/5

Due: March 30, 2004 (midnight)

For this assignment you will perform some analyses related to learning theory and then use WEKA to experiment with some recent learning methods discussed in class: support vector machine, instance-based learner, and ensemble methods.

  1. Exercise 7.2 (page 227 of Mitchell's book). Show all work and justify all answers.
  2. Exercise 7.5 (b) (page 228 of Mitchell's book). Show all work and justify all answers.
  3. Using WEKA's experimenter environment, perform the following experiment. This is similar to the experiment performed in HW3, but note differences.
    1. For the Results Destination section, select ARFF file and provide a file name in which to store the experimental results.
    2. For Experiment Type, choose 10-fold cross-validation and classification.
    3. For Iteration Control, choose 1 iteration and data sets first. (*** different from HW3)
    4. Select the following five datasets that come with WEKA: contact-lenses, iris, labor, soybean and weather.
    5. Select the following classifiers with default parameter settings, except as noted (*** different from HW3):
      • bayes.NaiveBayes
      • trees.j48.J48
      • lazy.IBk
      • lazy.IBk -K 3
      • lazy.IBk -K 5
      • functions.supportVector.SMO
      • meta.AdaBoostM1 (with J48 as the base classifer)
      • meta.vote (with ConjunctiveRule, NaiveBayes, and J48 as the ensemble classifers)
    6. Run the experiment.
    7. Analyze the results by loading the ARFF results file, select "Percent_incorrect" as the comparison field, set the significance level to 0.05, select NaiveBayes as the test base, check to show standard deviations, and perform the test. (*** different from HW3)
  4. Construct a table of classifiers vs. datasets, and in each entry, enter the error and standard deviation of that classifier on that dataset from the above experiment. Also, add an asterisk to the end of the entry for each dataset/classifier pair for which the classifier outperforms NaiveBayes at the 0.05 level.
  5. For each dataset indicate which classifier has the best performance on that dataset, where "best" is based on the error value alone, ignoring standard deviation and significance.
  6. For the three versions of the IBk lazy learner, indicate which one had the lowest error for each of the datasets.
  7. For the two ensemble learners (AdaBoostM1 and Vote), indicate which one had the lowest error for each of the datasets.
  8. Email to me (holder@cse.uta.edu) your nicely-formatted report (MSWord, PDF or PostScript) containing the requested information referred to above.