Machine Learning

Homework 6

Due: November 12, 2010 (midnight)

For this assignment you will become familiar with kernel machines and ensembles.

  1. Using WEKA's explorer environment, load the loan.arff dataset from HW2/HW3 in the Preprocess tab. Choose the functions.SMO classifier (this is WEKA's SVM classifier) in the Classify tab and change the -E option on the PolyKernel to 2.0 (i.e., we will use (x.y)^2 as our kernel). Choose "Use training set" as the Test option, and click Start to run.
    1. Include the WEKA output in your report.
    2. The classifier consists of a set of arithmetic terms. Explain what each number/symbol in the first line of the classifier represents.
    3. Consider the test instance <Income=Medium, Debt=Medium, Education=MS>. Using the classifier learned above, compute the numeric result of the classifer and the predicted class for this instance. Show your work.
  2. Using WEKA's experimenter environment, perform the following experiment.
    1. Choose a "New" experiment.
    2. Choose the default Experiment Type: 10-fold cross-validation and classification.
    3. Choose default Iteration Control: 10 repetitions and data sets first.
    4. Select the following datasets that come with WEKA: diabetes, labor, iris, and vote.
    5. Select the following classifiers with default parameter settings, except as noted:
      • bayes.NaiveBayes
      • trees.J48
      • lazy.IBk -K 3
      • functions.SMO
      • functions.SMO (PolyKernel with -E 2.0)
      • meta.AdaBoostM1 (with J48 as the base classifier)
    6. Run the experiment.
    7. Analyze the results using the following configuration, and perform the test.
      • Testing with: Paired T-Tester (corrected)
      • Comparison field: Percent_correct
      • Significance: 0.05
      • Test base: bayes.NaiveBayes
      • Show std. deviations: (checked)
  3. Construct a table of classifiers vs. datasets, and in each entry, enter the accuracy and standard deviation of that classifier on that dataset from the above experiment. Also, add an asterisk to the end of the entry for any dataset/classifier pair for which the classifier outperforms NaiveBayes at the 0.05 level.
  4. Compare the performance of the different classifiers on the different datasets. Specifically, which classifier performs better on which datasets and why. The "why" part should consider the characteristics of the data, the hypothesis space, and the learning algorithm.
  5. Email to me ( a nicely-formatted document containing your answers to the above questions.