Machine Learning

Homework 5

Due: October 31, 2008 (midnight)

No late homeworks will be accepted.

  1. Consider the hypothesis space consisting of the disjunction of two intervals on an integer range from 0 to L. Specifically, H = {[a,b] V [c,d] | 0 <= a,b,c,d <= L and a <= b and c <= d}. Instances are drawn from the set of integers from 0 to L, and an instance x is classified as positive if (a <= x <=b) or (c <= x <= d); otherwise, negative.
    1. Compute |H| based on L. Show your work.
    2. Determine the VC dimension d of H. This involves a proof showing that H can shatter any target concept on d instances, but not d+1 instances. Show your work.
    3. Assume L = 100. Compute the sample complexity (number of training examples m needed to PAC learn H), such that you are 90% certain that the learned hypothesis is at least 90% accurate, using the sample complexity formula for a consistent learner based on |H|. Compute the same sample complexity, but using the formula based on the VC dimension. Show your work.
    4. Suppose we generalize the hypothesis space to consist of a disjunction of at most k intervals on the real-number line. Compute |H| and VCdim(H) based on k. Show your work. Again, computing the VC dimension requires a proof.
  2. Using Weka's explorer environment, load the weather.nominal.arff data in the Preprocess tab. Choose the functions.SMO classifier (this is Weka's SVM classifier) in the Classify tab and change the -E option on the PolyKernel to 2.0 (i.e., we will use (x.y)^2 as our kernel). Choose "Use training set" as the Test option, and click Start to run.
    1. Include the Weka output in your report.
    2. The classifier consists of a set of arithmetic terms. Explain what each number/symbol in the first line of the classifier represents.
    3. Consider the test instance <outlook=sunny, temperature=cool, humidity=high, windy=FALSE>. Using the classifier learned above, determine the predicted class for this instance. Show your work.
  3. Using Weka's experimenter environment, perform the following experiment. This is similar to the experiment performed in HW3, but note differences.
    1. Choose a "New" experiment.
    2. For the Results Destination section, select ARFF file and provide a file name in which to store the experimental results.
    3. For Experiment Type, choose 10-fold cross-validation and classification.
    4. For Iteration Control, choose 1 iteration and data sets first. (*** different from HW3)
    5. Select the following four datasets that come with Weka: contact-lenses, iris, labor, and weather.
    6. Select the following classifiers with default parameter settings, except as noted (*** different from HW3):
      • rules.ConjunctiveRule
      • bayes.NaiveBayes
      • trees.J48
      • lazy.IBk
      • lazy.IBk -K 3
      • lazy.IBk -K 5
      • functions.SMO
      • meta.AdaBoostM1 (with J48 as the base classifier)
      • (with ConjunctiveRule, NaiveBayes, and J48 as the ensemble classifiers; you will need to delete ZeroR from the list)
    7. Run the experiment.
    8. Analyze the results by loading the ARFF results file, selecting the following configuration, and perform the test.
      • Testing with: Paired T-Tester (corrected)
      • Comparison field: Percent_incorrect (NOTE: "incorrect", not "correct")
      • Significance: 0.05
      • Test base: rules.ConjunctiveRule
      • Show std. deviations: (checked)
  4. Construct a table of classifiers vs. datasets, and in each entry, enter the error and standard deviation of that classifier on that dataset from the above experiment. Also, add an asterisk to the end of the entry for each dataset/classifier pair for which the classifier outperforms ConjunctiveRule at the 0.05 level.
  5. Discuss your conclusions about the performance of these different classifiers on the different datasets based on the error values, standard deviations, and significance results. Specifically discuss which of the three IBk lazy learners performed best for each dataset, and which of the two ensemble learners (AdaBoostM1 and Vote) performed best for each dataset. Finally, discuss the relative performance of the IBk, ensemble and SMO learners and why you think they performed this way on these datasets.
  6. Email to me ( a nicely-formatted report containing the following:
    1. Solutions to Problem 1.
    2. Raw output of the SMO run on the weather.nominal dataset in Problem 2. An explanation of the classifier, and computation of the prediction of the test instance.
    3. Table from Problem 4.
    4. Discussion from Problem 5.