Introduction to Data Science: CptS 483-04, Fall 2016


Schedule and Lecture Material

I will use the online portal OSBLE+ (https://plus.osble.org) for posting lecture materials, assignments, class related announcements, etc, and handling submissions. On this page, I will maintain an overview of the schedule as the course proceeds.

Legend:
DDS = Doing Data Science (O'Neil and Schutt)
ISLR = An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie and Tibshirani)
MLPP = Machine Learning: A Probabilistic Perspective (Murphy)
MMDS = Mining of Massive Datasets (Leskovek, Rajaraman, and Ullman)

DateTopicDetailsComments
Mon, Aug 22 Course overview Motivation; Syllabus walk-through; Course work - Reading: syllabus
Wed, Aug 24 What is Data Science? Big Data and Data Science hype and getting past the hype; Why now; - Reading: Slides | Chap 1 of DDS.
Fri, Aug 26 What is Data Science? Part II Current landscape of perspectives; skill sets needed; data scientist in industry; data scientist in academia. - Reading: Slides | Chap 1 of DDS.
- Reading: Vasant Dhar's 2013 Comm of the ACM article.
- Pre-course survey out, due Aug 31.
Mon, Aug 29 Statistical Inference Processes and data; Populations and samples; Population/sample vis-a-vis Big Data - Reading: Slides | Chap 2 of DDS.
Wed, Aug 31 Lecture on R R Resources overview; Getting and installing R - Reading: R-resources document
Fri, Sep 2 Lecture on R R Basics - Reading: Slides | R scripts
- Note: There will also be Shira Broschat's tutorial 3--4:30 (also in Sloan 5).
Mon, Sep 5 No Class Labour Day
Wed, Sep 7 Lecture on R R Graphics - Reading : Slides | R scripts and datasets
Fri, Sep 9 Statistical modeling Probability distributions; fitting a model - Reading : Slides | Chap 2 of DDS.
Mon, Sep 12 Power-Law Distributions and Normal Distributions Properties of power laws; Generative models for power laws; Power-law distributions vs normal ditributions. - Reading : Slides
- Further reading 1: Mitzenmacher's article on generative models (Internet Mathematics, 2004).
- Further reading 2: Lada Adamic's ranking tutorial on Zipfs, Power-laws and Pareto.
Wed, Sep 14 Exploratory Data Analysis EDA: Approach, Tools, Philosophy. Contrast with Confirmatory Data Analysis. - Reading : Slides | Chap 2 of DDS.
Fri, Sep 16 The Data Science Process. Components of the Data Science Process and how they interrelate; Roles of the Data Scientist. - Reading: Slides | Chap 2 of DDS.
- Note: Assignment 2 is out. Due 9/23 by 6pm.
Mon, Sep 19 Machine Learning Overview Supervised and unsupervised learning; Examples; Real-world applications - Reading: Slides | Chap 1 of MLPP (posted)
Wed, Sep 21 Linear Regression Simple linear regression; least squares coefficient estimates; assessing the accuracy of the coefficient estimates; assessing the accuracy of the model - Reading : Slides | Chap 3 of ISLR.
Fri, Sep 23 Linear Regression Multiple linear regression; Qualitative (discrete-valued) predictors; interactions; nonlinear relationships - Reading : Slides | Chap 3 of ISLR.
Mon, Sep 26 Linear Regression Linear Regression Lab Session - Reading : R-script posted.
Wed, Sep 28 k-Nearest Neighbors General idea; KNN process; distance metrics; evaluation metrics. - Reading: Slides | DDS (pages 71--82)
Fri, Sep 30 k-Means Clustering (what, why, applications); k-means as a clustering method (how it works, properties and limitations) - Reading: Slides.
- Note: Assignment 3 went out. Due 10/09/2016 by 6pm.
Mon, Oct 3 Hierarchical clustering Hierarchical clustering idea, algorithms, examples. - Reading: Slides | Sec 10.3 of ISLR.
Wed, Oct 5 Principal Components Analysis What are principal components; computation of principal components; geometric interpretation; illustration; R-lab session. - Reading: Slides | Sec 10.2 of ISLR.
Fri, Oct 7 Status Review Review of topics; feedback on assignment 2; preview of upcoming topics
Mon, Oct 10 Data Wrangling I Data cleaning, data reshaping - Reading: Slides
- Note: Assignment 3 went out. Due 10/15/2016 by 6pm.
Wed, Oct 12 Data Wrangling II Data integration, data reduction - Reading: Slides
Fri, Oct 14 Data Wrangling Lab dplyr, tidyr - Reading: posted R codes.
Mon, Oct 17 Naive Bayes classifier Basic idea, the algorithm, examples of application - Reading: Slides | Chapter 4 of DDS.
Wed, Oct 19 Semester Project Set Up Description; Requirements - Reading: Project Description document.
- Note: project proposal went out. Due October 28.
Fri, Oct 21 Project Ideas discussion A set of 10 ideas presented; own proposal welcomed. - Reading: Project Ideas document.
Mon, Oct 24 Feature Generation Background (data science competitions, crowdscourcing); Feature generation general approaches (brain storming, imagination, domain expertise). - Reading: Slides | Chap 7 of DDS
Wed, Oct 26 Feature Selection Filters; Wrappers (best subset selection, stepwise forward selection, stepwise backward selection) - Reading: Slides | Chapter 8 of ISLR
Fri, Oct 28 Decision Trees and Random Forests Decision trees; entropy; bagging; random forests; Decision trees in R. - Reading : Slides | Chapter 8 of ISLR
Mon, Oct 31 Recommendation Systems Motivation; collaborative filtering; ML algorithms - Reading : Slides | Chap 8 of DDS
Wed, Nov 2 Recommendation System II Dimensionality reduction: SVD and UV decomposition. - Reading : Slides | Cahp 8 of DDS | (Optional: Chap 9 of MMDS)
Fri, Nov 4 Data visualization Telling story with data; choosing tools to visualize data; visualizing patterns over time. - Reading: slides
Mon, Nov 7 Course review for mid-term preparation - Reading: Study guide
Wed, Nov 9 Project consultation
Fri, Nov 11 No class Veterans Day
Mon, Nov 14 Mid-Term Exam
Wed, Nov 16 Data visualization II Visualizing proportions; visualizing relationships; visualizing text information - Reading : Slides
Fri, Nov 18 Social Network Analysis: Centrality Motivating examples for various centrality metrics - Reading : Slides
Nov 21 -- 25 Thanksgiving Break
Mon, Nov 28 Centrality II formal metrics: degree centrality; eccentricity; closeness/transmission centrality; betweenness centrality; Katz index - Reading: Slides
Wed, Nov 30 Ethics and course wrap-up Look-back at topics; next-gen data scientists; a word on ethics. - Reading : Slides | Chap 16 of DDS.
Fri, Dec 2 Project Presentation 1. Abdu Sayed Chowdhury and Mukti Sharma
Predicting Emotions, Sentiments and Demographics from Tweets

2. Md Kamruzzaman and Siyang Li
Analysis of Crop Phenotypic Behavior

3. Yang Hu, Yang Zheng and Yuan Zhi
Accurate Recovery of Missing Values in PMU Measurements
Mon, Dec 5 Project Presentations 1. Zachary Allen, Carla De Lira and Siddhant Srivastava
Analysis and Visualization of Hashtags in Twitter Data

2. Ehdieh Khaledian and Anand Raghuraman
Using the Twitter API to capture Tweets

3. Gridhar Manoharan and Aditi Deepak Thuse
Analytics of Bank Marketing Data
Wed, Dec 7 Project Presentations 1. Keegan Caruso and Adam Skoog
Article Classification

2. Aidan Lancaster and Jared Meade
Article Classification

3. Insun Lee, Kim Nguyen and Chao Zeng
Food clustering
Fri, Dec 9 Project Presentations 1. Mohammad Hossein Namaki and Keyvan Sasani
Predicting Response Times of Top-k Graph Queries

2. Mario Migliacio and Nehemia Salo
Sport Analysis

3. Dustin Crossman and Kayl Coulston
Modeling and Analysis of Customer Review Data
Mon, Dec 12 Project Report Due by 2pm.