Machine Learning


Project due December 16, 2009

No late submissions will be accepted.

Intermediate deadlines:

Team Registration: November 6, 2009

Initial Entry: November 13, 2009

For the class project, you will form 1-2 person teams to participate in the 2009 UC San Diego Data Mining Contest, a machine learning challenge to predict anomalies in e-commerce transactions. The contest deadline has passed, so we will not be competing for the $4,000 prize money, but the contest server is still running. Besides, it's not about the money, right? The main goal is for you to learn more about applying machine learning techniques to real problems. Below are the specific requirements for the class project.

  1. Read over the material at We will be attempting the "Hard" version.
  2. You may choose to compete individually or as a two-person team. Some portion of the grading will be based on the difficulty of your approach and your team's ranking within the class, so I recommend you pair up. If you need help finding a teammate, let me know. Once you have your team finalized, you should follow the instructions on the website to register your team by November 6 and provide me with your team name and team members.
  3. For your first entry, each team should submit the same entry; namely, predict 0.020000 (probability of positive based on class distribution) for each of the 50,000 test instances. This should result in a lift score of 0.894. This first entry should be completed by November 13. Send me an email when your first entry appears on the leaderboard.
  4. The remainder of your effort on the project should involve designing, implementing and testing one or more machine learning approaches to achieve better performance on this challenge problem. We will be maintaining an up-to-date ranking of the teams based on their submissions to the challenge. So, if you make a submission that improves your current best score, let me know by email so that we can update the class ranking.
  5. By December 16 you should email to me ( the following:
    1. Report describing all your attempts, the methods used for each, enough detail on your best submission so that the results can be reproduced, and a general discussion of your experience (what worked, what didn't, why, and what would you try next).
    2. All code and instructions necessary for reproducing your best result from the training data. That is, we will need to be able to input the training data to your software and get out your best prediction file. You can assume we have Weka, but there is no requirement that you use Weka.
    3. Your best prediction file successfully submitted to the challenge site.
    4. For 2-person teams, each team member should send me a separate email describing each team member's contribution to the project. These emails will be considered confidential between me and you.
  6. Your project will be graded according to the following criteria.
    1. The difficulty, number and creativity of your submissions to the challenge.
    2. The relevance to machine learning of your approach(es) to the problem.
    3. Your team's ranking within the class based on the score of your best successful submission.
    4. The relative contribution of each team member.
    5. Your meeting the above intermediate deadlines.
    6. The quality of your report based on presentation, coverage, detail and general discussion.
    7. The efficiency, understandability and correctness of your instructions and code for reproducing your best result.

Team Rankings

Rank Team Name Best Score
1 dragon1wsu 3.539
2 Albion 3.503
3 noSkynetHere 3.448
4 Seth 3.372
5 RAB 3.325
6 Dang_Nabbit 3.293
7 Learner 3.285
8 Halo 3.230
9 RobenGokcen 3.174
Last updated: December 17, 2009 at 12:20am