CSE 6363 Fall 2001
Homework 2
Due: September 25, 2001 (midnight)

  1. A framework for comparing machine learning algorithms is available on gamma2 in the directory /public/cse/6363-501/code/ml2.0. See the README file for information on how to use the system. Currently, ml is set up to compare three different learning algorithms: a majority class learner, a version of the ID3 algorithm called DTL, and the C4.5 system.

  2. Included in the ml2.0 directory are the files dtl.h and dtl.c which implement a version of the ID3 algorithm described in Table 3.1 of the textbook. DTL differs from this ID3 algorithm in that DTL accepts more than two target attribute values and handles continuous-valued attributes. Also, the attribute selection measure in DTL simply measures the number of correctly classified training examples instead of using the entropy based measure. Your job is to implement a new version, called ID3, that uses the gain ratio measure described on page 74 of the textbook. You will need to modify the SelectBestAttribute procedure in dtl.c and add any necessary auxiliary functions.

  3. Generate a decision tree for the entire vote data set contained in the 6363-501/data directory for both C4.5 and your ID3 algorithm. Discuss differences in the resulting trees and the underlying reasons.

  4. For each of the six datasets in the 6363-501/data directory, use the ml program to run a 10-fold cross validation on the Majority, DTL, C4.5 and your ID3 algorithm. A file-based interface to the C4.5 program has been provided in 6363-501/code/ml2.0/c45.c. Compile your results into six tables each in the form shown below. For example, the ID3-C4.5 column is the average and standard deviation of the difference between the 10 runs of the two algorithms. Also include the significance levels of the differences and the overall ANOVA significance for all four algorithms. We will discuss the meaning of these statistics later in the course.

    Domain ID3 C4.5 ID3 - C4.5 Significance
    credit 0.33 +/- 0.05 0.22 +/- 0.07 0.11 +/- 0.06 0.05
    diabetes ...      

  5. Compare the different algorithms based on your tabulated results (i.e., which algorithm seems best and why).