Machine Learning

Homework 5

Due: October 30, 2009 (midnight)

No late homeworks will be accepted.

  1. Consider the hypothesis space H consisting of axis-aligned ellipses in two dimensions. The general form is (x-x0)^2/a^2 + (y-y0)^2/b^2 = 1, where the original of the ellipse is at (x0,y0). Instances inside the ellipse are classified as positive, and instances outside are negative.
    1. Assume x0=y0=0 and a and b are integers in the range [1,N]. Compute |H| in terms of N. Show your work.
    2. Derive an expression for the sample complexity necessary to PAC-Learn C = H from part (a). Compute this sample complexity when N = 100 so that we are 95% sure we will learn a hypothesis whose error is less than 5%. Show your work
    3. Assume x0, y0, a and b are real numbers (a>0, b>0). Determine the VC dimension of H and prove your result.
    4. Derive an expression for the sample complexity necessary to PAC-Learn C = H from part (c). Compute this sample complexity so that we are 95% sure we will learn a hypothesis whose error is less than 5%. Show your work.
  2. Draw a three-dimensional plot showing the behavior of the mistake bound of the weighted majority algorithm for β = 1/2, where the number of prediction algorithms n ranges from 1 to 100, and the number of mistakes k made by the best algorithm ranges from 1 to 100. Briefly describe the observed effect of k and n on the mistake bound. Be sure to clearly label the axes of your plot.
  3. Using Weka's explorer environment, load the loan.arff dataset from HW2 in the Preprocess tab. Choose the functions.SMO classifier (this is Weka's SVM classifier) in the Classify tab and change the -E option on the PolyKernel to 2.0 (i.e., we will use (x.y)^2 as our kernel). Choose "Use training set" as the Test option, and click Start to run.
    1. Include the Weka output in your report.
    2. The classifier consists of a set of arithmetic terms. Explain what each number/symbol in the first line of the classifier represents.
    3. Consider the test instance <Income=Medium, Debt=Medium, Education=MS>. Using the classifier learned above, compute the numeric result of the classifer and the predicted class for this instance. Show your work.
  4. Using Weka's experimenter environment, perform the following experiment.
    1. Choose a "New" experiment.
    2. For the Results Destination section, select ARFF file and provide a file name in which to store the experimental results.
    3. For Experiment Type, choose 10-fold cross-validation and classification.
    4. For Iteration Control, choose 1 iteration and data sets first.
    5. Select the following datasets that come with Weka: iris, segment-challenge, and soybean.
    6. Select the following classifiers with default parameter settings, except as noted:
      • bayes.NaiveBayes
      • trees.J48
      • lazy.IBk
      • lazy.IBk -K 3
      • lazy.IBk -K 5
      • functions.SMO
      • meta.AdaBoostM1 (with J48 as the base classifier)
    7. Run the experiment.
    8. Analyze the results by loading the ARFF results file, selecting the following configuration, and perform the test.
      • Testing with: Paired T-Tester (corrected)
      • Comparison field: Percent_incorrect (NOTE: "incorrect", not "correct")
      • Significance: 0.05
      • Test base: bayes.NaiveBayes
      • Show std. deviations: (checked)
  5. Construct a table of classifiers vs. datasets, and in each entry, enter the error and standard deviation of that classifier on that dataset from the above experiment. Also, add an asterisk to the end of the entry for any dataset/classifier pair for which the classifier outperforms NaiveBayes at the 0.05 level.
  6. Discuss your conclusions about the performance of these different classifiers on the different datasets based on the error values, standard deviations, and significance results. Specifically, discuss the relative performance of the instance-based, ensemble-based and kernel-based learners and why you think they performed this way on these datasets.
  7. Email to me (holder@eecs.wsu.edu) a nicely-formatted document containing your answers to the above questions.