Syllabus of IE 551 - Applied Statistics

Department: Industrial Engineering

Credits: Bilkent 3,    ECTS 5

Course Coordinator: Savaş Dayanık

Semester: 2019-2020 Fall

Contact Hours: 3 hours of lecture per week,    1 hour of Lab/Studio/Others per week

Textbook and Other Required Material:
• Required - Textbook: Linear Models with R, Julian J. Faraway, Second Edition, CRC Press

Catalog Description:
Exploratory data analysis, kernel density estimation, multivariate regression, nonparametric and semiparametric regression, scatterplot smoothing, linear mixed models, logistic regression, recursive partitioning, anova, ancova, hidden Markov models, dynamic linear models, graphical models, principal component analysis. Applications on real datasets using statistical software.

Prerequisite(s): None

Assessment Methods:
Type Label Count Total Contribution
1 In-class participation Participation 1 5
2 Homework Homework 7 15
3 Midterm:Practical (skills) Midterm 1 40
4 Final:Practical(skills) Final 1 40

Minimum Requirements to Qualify for the Final Exam:
None

Course Learning Outcomes:
Course Learning Outcome Assessment
Fit a linear model, formulate and test various hypotheses about effects, and predict response variable values for given explanatory variable values Midterm
Final
Diagnose violations of linear model assumptions and take appropriate remedies Midterm
Final
Handle missing values appropriately. Midterm
Final

Weekly Syllabus:
1. Data exploration using statistical and graphical summaries. Introduction to R, scatterplots, boxplots, pairs plots, conditional plots, smoothers, tidying data for the analyses to come.
2. What are linear models? Multiple regression is the best known example of linear models. However, they also serve as the building blocks for principal component regression, ridge regression, and smoothers like loess and spline regression. Principal component and ridge regressions are useful when there are many predictors, which is quite common in this age of big data. Smoothers and spline regression are extremely useful to discover the nonlinear patterns in the data. We will apply these methods to data on wages, diabetes, cancer, factors driving the happiness of MBA-students (statistics, of course-kidding), teen gambling habits, and many others.
3. Estimation of linear model parameters. What is least squares estimation? What is the maximum likelihood estimation? Geometric intuition behind the former. Valuable distributional information obtained by using the latter. Derivations of quantities fundamental for all of the analyses in the remainder.
4. Inference for linear models. Does the size or elevation of an island or its distance to the closest neigboring island matter to the number of different species (Galapagos Islands) ? What are the factors promoting prostate cancer? What drive the teen gambling? What is important to MBA students, income or social life? Those are some examples of inferential questions.
5. Prediction from linear models. How many passengers are expected to fly over the six months? How many people are expected to die from lung cancer next year? Those are some examples for a prediction problem.
6. Diagnostics, problems with the predictors. Are there outliers? If yes, where are they? What can be done about them? Are there influential data points? Beware how significantly your conclusions (about cancer factors, for example) can change in the face of influential data points. Is there any sensible pattern left out in the residuals?
7. Robust regression. If normal assumption does not hold, the extreme residuals are unavoidable. How can we reduce the influence of a few extreme cases on the selection of regression parameters? When does it make more sense to model quantiles of response variable distribution? How can we do the quantile regression?
8. Problems with the error terms. Linear models assume that the response variable has symmetric distribution, ideally, Normal. How can we check this (failing to add relevant variables may lead to wrong conclusions)? If we are certain about lack of symmetry, we may use a Box-Cox transformation.
9. Transformation of variables. If the relationship between response and explanatory variable is nonlinear, we may use broken stick, polynomial transformations, even splines, and finally additive models.
10. Model selection. Simple models can fail to pick an important pattern, whereas complex models may think that every funny pattern is in there. That's known as bias-variance trade-off. Often we will have many alternative models built for the same data, but we need to find the ones that strike the best balance. Sometimes, we have a knob that we can continuously change until this balance is reached (as in ridge and lasso).
11. Shrinkage methods: principal component, partial least squares. Does your model fit well, but none of the variables seem significant, or coefficients suggest counter-intuitive relationships? Perhaps, your model suffers from multi-collinearity? How do we check it? How do we deal with it if it is there? If the explanatory variables are linearly related, then the powers of significance tests decreases, confidence intervals for the effect sizes widen. These side-effects can be curbed using shrinkage methods.
12. Shrinkage methods continued with ridge, and lasso regressions.
13. Missing data, models with several categorical variables. A patient's data have everything except the reading for a blood pressure. Should we throw the sample point altogether or is it possible to fill in the missing measurement? How do we add categorical variables to a linear model? A more in depth discussion in this week.
14. A complete example: Insurance Redlining

Type of Course:   Lecture

Teaching Methods:   Lecture - Assignment