Machine Learning

Homework 5

Due: October 26, 2007 (midnight)

No late homeworks will be accepted.

Total points: 55

For this assignment you will use Weka to experiment with some learning methods recently discussed in class: support vector machine, instance-based learner, and ensemble methods.

  1. (10 points) Using Weka's explorer environment, load the weather data in the Preprocess tab, choose the functions.SMO classifier (this is Weka's SVM classifier) in the Classify tab, choose "Use training set" as the Test option, and click Start to run.
    1. Describe the default kernel function used by the SMO classifier. Specifically, describe how to compute the value of the kernel function given two examples from the weather.arff file. Hint: Look at the output of the run to see how the attributes are transformed.
    2. Explain the meaning of the -E parameter to the default kernel function.
    3. Note that the run did not perfectly classify the training set. Increase the -E parameter on the default kernel by 1.0 until the result does perfectly classify the training examples. What is the minimal integral value of the -E parameter for which the SMO classifier achieves zero error on the training set?
    4. Note that the original run (for -E 1) yielded a linear kernel, and therefore the classifier was described as weights on the attributes; whereas, for a -E greater than 1, the classifier is described as the support vectors and their weights. Describe how you would use this classifier (from part c) to classify a new instance.
  2. (10 points) Using Weka's experimenter environment, perform the following experiment. This is similar to the experiment performed in HW3, but note differences.
    1. Choose a "New" experiment.
    2. For the Results Destination section, select ARFF file and provide a file name in which to store the experimental results.
    3. For Experiment Type, choose 10-fold cross-validation and classification.
    4. For Iteration Control, choose 1 iteration and data sets first. (*** different from HW3)
    5. Select the following four datasets that come with Weka: contact-lenses, iris, labor, and weather.
    6. Select the following classifiers with default parameter settings, except as noted (*** different from HW3):
      • bayes.NaiveBayes
      • trees.J48
      • lazy.IBk
      • lazy.IBk -K 3
      • lazy.IBk -K 5
      • functions.SMO (this is Weka's SVM classifier)
      • meta.AdaBoostM1 (with J48 as the base classifier)
      • (with ConjunctiveRule, NaiveBayes, and J48 as the ensemble classifiers; you will need to delete ZeroR from the list)
    7. Run the experiment.
    8. Analyze the results by loading the ARFF results file, selecting the following configuration, and perform the test.
      • Testing with: Paired T-Tester (corrected)
      • Comparison field: Percent_incorrect (NOTE: "incorrect", not "correct")
      • Significance: 0.05
      • Test base: bayes.NaiveBayes (*** different from HW3)
      • Show std. deviations: (checked)
  3. (20 points) Construct a table of classifiers vs. datasets, and in each entry, enter the error and standard deviation of that classifier on that dataset from the above experiment. Also, add an asterisk to the end of the entry for each dataset/classifier pair for which the classifier outperforms NaiveBayes at the 0.05 level.
  4. (15 points) Discuss your conclusions about the performance of these different classifiers on the different datasets based on both the error values alone and the standard deviations and significance results. Specifically discuss which of the three IBk lazy learners performed best for each dataset, and which of the two ensemble learners (AdaBoostM1 and Vote) performed best for each dataset. Finally, discuss the relative performance of the IBk, ensemble and SMO learners and why you think they performed this way on these datasets.
  5. Email to me ( a zip file containing the following:
    1. Text file containing the raw output of the SMO run on the weather dataset using the default kernel in Problem 1.
    2. Text file containing the raw output of the SMO run on the weather dataset using the default kernel, but with the -E value determined in Problem 1c.
    3. Text file containing the raw output of the experiment performed in problem 2.
    4. Nicely-formatted report (MSWord, PDF or PostScript) containing:
      • Your answers from Problem 1.
      • Table from Problem 3.
      • Discussion from Problem 4.