Machine Learning

Homework 4

Due: October 17, 2008 (midnight)

No late homeworks will be accepted.

  1. Run the NaiveBayes classifier on the iris dataset. Use the training set as the test option. Include in your submission the printed results from WEKA.
  2. What type of distribution does WEKA's NaiveBayes classifier assume for continuous attributes?
  3. Redo questions 3 and 4 from HW3, only substitute NaiveBayes for ConjunctiveRule.
  4. Redo questions 6 and 7 from HW3, only substitute NaiveBayes for J48.
  5. The UCI ML Repository contains a spam database (here), where the emails have already been processed to extract word frequencies and other information.
    1. Download the data and convert it to WEKA's ARFF format.
    2. Describe the attributes used for the instances in this dataset.
    3. Run a 10-fold cross-validation test on the dataset using the NaiveBayes classifier and report the accuracy achieved.
    4. Compare the NaiveBayes classifier with another classifier we have used in WEKA (ConjunctiveRule, J48, or MultilayerPerceptron) using the evaluation techniques we have learned while using WEKA.
  6. Email to me ( a zip file containing the following:
    1. Text file containing the raw output of the NaiveBayes run on the iris dataset.
    2. Text file containing the raw output of the first experiment above (result as from HW3 question 3h).
    3. Raw threshold curve data for NaiveBayes and MultilayerPerceptron on the labor dataset (the two files you saved as in step 6e in HW3 ).
    4. ARFF file for the Spambase dataset.
    5. Any supporting files for your comparison in 5d.
    6. Nicely-formatted report (MSWord, PDF or PostScript) containing:
      • Answer to question 2.
      • Table summarizing results of experiment in question 3.
      • Nicely-formatted plot of the two ROC curves.
      • Discussion of performance comparison based on the ROC curves.
      • Description of Spambase attributes (5b).
      • Results of experiment in 5c.
      • Comparison of NaiveBayes to other learner on the Spambase dataset (5d).