Machine Learning


Project due December 15, 2010

No late submissions will be accepted.

Intermediate deadlines:

Team Registration: November 5, 2010

Initial Entry: November 12, 2010

For the class project, you will form 1-2 person teams to participate in the 2010 UC San Diego Data Mining Contest, a machine learning challenge to predict new customers for an online retailer. Winners of the contest receive real money, but unfortunately the deadline to qualify for prize money was September 30. However, the data is still available, so we can still use it to investigate different learning approaches on the data. The main goal is for you to learn more about applying machine learning techniques to real problems. Below are the specific requirements for the class project.

  1. Read over the material at We will be using the data from Task 1: E-commerce Customer Identification (Raw).
  2. You may choose to compete individually or as a two-person team. Some portion of the grading will be based on the difficulty of your approach and your team's ranking within the class, so I recommend you pair up. If you need help finding a teammate, let me know. Once you have your team finalized, you should follow the instructions on the website to register your team by November 5 and provide me with your team name and team members.
  3. Since the contest server has been closed, we will use our own test set to evaluate your solutions. I have taken the original training data and divided it up into new training and testing sets, where the new training set is 2/3 of the original training set, and the new testing set is 1/3 of the original training set. I have maintained the original 1:10 class distribution in both these sets. Your goal is to maximize accuracy (not AUC) on the test set. The assumption is that your classifier will predict 0 or 1 for each test example (not the probability that the example is a 1). These new training and testing files are available here.
  4. For your first entry, each team should submit to me the same entry; namely, predict 0 for each of the 42,955 test instances. This should result in an accuracy score of 0.9066. This first entry should be emailed to me ( by November 12.
  5. The majority of your effort on the project should involve designing, implementing and testing one or more machine learning approaches to achieve high accuracy on this challenge problem. NOTE: I understand that you have the correct answers for the test examples, but I ask that you pretend that you do not. I will rely on you not to include knowledge of the test examples in your learning approaches. In my evaluation of your project, I will check that your solution does not include such information, and I may test your solution on alternative training/testing sets.
  6. By December 15 you should email to me ( the following:
    1. Report describing all your attempts, the methods used for each, enough detail on your best submission so that the results can be reproduced, and a general discussion of your experience (what worked, what didn't, why, and what would you try next).
    2. All code and instructions necessary for reproducing your best result from the training data. That is, we will need to be able to input the training data to your software and get out your best prediction file. You can assume we have WEKA, but there is no requirement that you use WEKA.
    3. Your best prediction file.
    4. For 2-person teams, each team member should send me a separate email describing each team member's contribution to the project. These emails will be considered confidential between me and you.
  7. Your project will be graded according to the following criteria.
    1. The difficulty, number and creativity of your solutions to the challenge.
    2. The relevance to machine learning of your approach(es) to the problem.
    3. Your team's ranking within the class based on the accuracy of your best solution.
    4. Your meeting the above intermediate deadlines.
    5. The relative contribution of each team member.
    6. The quality of your report based on presentation, coverage, detail and general discussion.
    7. The efficiency, understandability and correctness of your instructions and code for reproducing your best result.