Project due December 15, 2010
No late submissions will be accepted.
Team Registration: November 5, 2010
Initial Entry: November 12, 2010
For the class project, you will form 1-2 person teams to participate in the
2010 UC San Diego Data Mining Contest, a
machine learning challenge to predict new customers for an online retailer.
Winners of the contest receive real money, but unfortunately the deadline
to qualify for prize money was September 30. However, the data is still
available, so we can still use it to investigate different learning approaches
on the data. The main goal is for you to learn more about applying machine
learning techniques to real problems. Below are the specific requirements for
the class project.
- Read over the material at mill.ucsd.edu.
We will be using the data from Task 1: E-commerce Customer Identification (Raw).
- You may choose to compete individually or as a two-person team.
Some portion of the grading will be based on the difficulty of your
approach and your team's ranking within the class, so I recommend
you pair up. If you need help finding a teammate, let me know. Once
you have your team finalized, you should follow the instructions on
the website to register your team by November 5 and provide
me with your team name and team members.
- Since the contest server has been closed, we will use our own test
set to evaluate your solutions. I have taken the original training data
and divided it up into new training and testing sets, where the new training
set is 2/3 of the original training set, and the new testing set is 1/3 of the
original training set. I have maintained the original 1:10 class distribution
in both these sets. Your goal is to maximize accuracy (not AUC) on the test
set. The assumption is that your classifier will predict 0 or 1 for each test
example (not the probability that the example is a 1). These new training and
testing files are available here.
- For your first entry, each team should submit to me the same entry;
namely, predict 0 for each of the 42,955 test instances.
This should result in an accuracy score of 0.9066. This first entry should
be emailed to me (email@example.com)
by November 12.
- The majority of your effort on the project should involve designing,
implementing and testing one or more machine learning approaches to
achieve high accuracy on this challenge problem. NOTE: I understand
that you have the correct answers for the test examples, but I ask that
you pretend that you do not. I will rely on
you not to include knowledge of the test examples in your learning
approaches. In my evaluation of your project, I will check that your
solution does not include such information, and I may test your solution
on alternative training/testing sets.
- By December 15 you should email to me
(firstname.lastname@example.org) the following:
- Report describing all your attempts, the methods used for each,
enough detail on your best submission so that the results can be reproduced,
and a general discussion of your experience (what worked, what didn't,
why, and what would you try next).
- All code and instructions necessary for reproducing your best result
from the training data. That is, we will need to be able to input the
training data to your software and get out your best prediction file. You
can assume we have WEKA, but there is no requirement that you use WEKA.
- Your best prediction file.
- For 2-person teams, each team member should send me a
separate email describing each team member's contribution
to the project. These emails will be considered confidential between me
- Your project will be graded according to the following criteria.
- The difficulty, number and creativity of your solutions to the challenge.
- The relevance to machine learning of your approach(es) to the problem.
- Your team's ranking within the class based on the accuracy of your
- Your meeting the above intermediate deadlines.
- The relative contribution of each team member.
- The quality of your report based on presentation, coverage, detail
and general discussion.
- The efficiency, understandability and correctness of your instructions
and code for reproducing your best result.