Syllabus of IE 451 - Applied Data Analysis

Department: Industrial Engineering

Credits: Bilkent 3,    ECTS 5

Course Coordinator: Savaş Dayanık

Semester: 2019-2020 Fall

Contact Hours: 3 hours of lecture per week,    1 hour of Lab/Studio/Others per week

Textbook and Other Required Material:
• Required - Textbook: An Introduction to Statistical Learning, G. James, D. Witten, T. Hastie, R. Tibshirani, 2013, Springer [download]

Catalog Description:
Introduction to exploratory data analysis, multivariate regression, semiparametric regression, scatterplot smoothing, linear mixed models, generalized linear models, recursive partitioning, and hidden Markov models through the applications on real data sets using the statistical software R. Applications to consumer choice models, modeling the number of emergency room visits, building e-mail spam filters, detecting fraudulent transactions, and other applications from manufacturing and service systems illustrating big data analytics.

Prerequisite(s): MATH 260

Assessment Methods:
Type Label Count Total Contribution
1 In-class participation 1 5
2 Homework 4 10
3 Quiz 3 25
4 Midterm:Practical (skills) 1 25
5 Final:Practical(skills) 1 35

Minimum Requirements to Qualify for the Final Exam:
The weighted average of homework and quizzes should be at least 40%.

Course Learning Outcomes:
Course Learning Outcome Assessment
Recognize the properties of fundamental statistical models Quiz
Midterm:Practical (skills)
Final:Practical(skills)
Explore features of a data sets through graphics and summary statistics Quiz
Midterm:Practical (skills)
Final:Practical(skills)
Fit a sensible statistical model to a given dataset Quiz
Midterm:Practical (skills)
Final:Practical(skills)
Take advantage of bias-variance trade-off and avoid over-fitting Quiz
Midterm:Practical (skills)
Final:Practical(skills)
Select the best model for a given dataset Quiz
Midterm:Practical (skills)
Final:Practical(skills)

Weekly Syllabus:
1. Introduction to statistical learning and R, overview of regression and classification problems
2. Linear regression (Illustrations: effects of budgets allocated for TV, newspaper, radio advertisement on annual sales, prediction of credit card balance from income, limit, rating, age, number of cards, and education level)
3. Linear regression continued and k-nearest neighbour regression (Illustration: conjoint analysis from marketing science; how can you design a new product with a higher market penetration?)
4. Logistic regression (Illustration: loan default probability estimation from credit card balance, income, occupation)
5. Multinomial and Poisson regressions (Illustration: would it have been possible to predict the Challenger diasaster? https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster)
6. Linear discriminant analysis (Illustrations: revisit credit card default probability estimation and Challenger disaster)
7. Cross-validation, linear model selection, subset selection (Illustrations: what are the variables among income, limit, rating, age, number of cards, and education level that explain the credit card balance or default probability best? Is logistic regression or linear discriminant model best for predicting the loan default probability?)
8. Shrinkage methods, ridge regression and lasso (What if the number of predictors is large--comparable to number of examples? Illustration: prediction of salaries of baseball players from various measures of their performances in the past games)
9. Polynomial regression, regression splines, smoothing splines (Illustration: modeling the wage as a function of age, the amount pollutants in a residential area as a function of its distance from employment centers)
10. Local regression, generalized additive models for quantitative and categorical variables (Illustrations: revisit wage and pollutant examples)
11. Regression trees (Illustrations: predict the baseball player salaries, car-seat sales)
12. Classification trees (Illustrations: email spam filtering--when is an email message spam? Predict crime rate in a residential area)
13. Bagging, random forests, boosting (Illustrations: revisit baseball player salary email spam, crime-rate examples)
14. Principal component analysis, k-means and hierarchical clustering (Illustrations: handwritten digit recognition, clustering cancer cell according to micro-array data, market-basket data)

Type of Course:   Lecture

Teaching Methods:   Lecture - Exercises - Assignment