-
Description: This is an introductory course in statistical learning, with an emphasis on regression and classification methods (supervised learning), and a pinch of exploratory methods (unsupervised learning)
-
Instructor: Gaston Sanchez
-
Lecture: 3 hours of lecture per week
-
Lab: 2 hours of computer lab sessions
-
Assignments: biweekly HW assignments
-
Exams: one midterm exam, and final test
-
Primary textbook:
- CSL: Concepts of Statistical Learning (by Sanchez and Marzban, 2019)
-
Secondary texts:
- ISL: An Introduction to Statistical Learning (by James et al, 2015)
- ESL: The Elements of Statistical Learning (by Hastie et al, 2009)
-
Prerequisites:
- Statistical and probability theory (e.g. Stat 135)
- Linear algebra (e.g. Math 54)
- Multivariable calculus (e.g. Math 53)
- Experience with some programming language (e.g. R, python, matlab)
- Recommended: discrete mathematics (e.g. Math 55).
-
LMS: the specific learning resources of a given semester are shared in the Learning Management Sysment (LMS) approved by Campus authorities (e.g. bCourses, Canvas)
-
Policies:
π ABOUT:
We begin with some preliminary concepts and an overview of Statistical Learning methods. Then we discuss the concept of data as a set of objects or individuals on which one or more variables have been measured. More specifically, we do this from a decisive geometric perspective.
π READING:
βοΈ TOPICS:
- Preamble for PCA
- The duality of the data matrix: rows (individuals) and columns (variables)
- Common operations for individuals: the average individual, distance between individuals, multivariate dispersion, and inertia
- Common operations for variables: variables as vectors, length of a vector, vector and scalar projections, angle between vectors, mean, variance, covariance, and correlation
π ABOUT:
Principal Components Analysis is one of the unsupervised learning topics of this course. Simply put, PCA allows us to study the systematic structure of a data set (of quantitative variables). Although PCA can be approached from multiple angles, we focus on its geometric perspective.
π READING:
βοΈ TOPICS:
- Fundamentals of PCA
- PCA from three perspectives: projected inertia, maximized variance, data decomposition
- PCA solution with EVD of cross-products X'X and XX'
- Application of PCA (anatomy of PCA solution)
- How many components to retain?
- How can a component be interpreted?
- What visualizations can be obtained, and how to read them?
- Some practical considerations
- Digression on Matrix Decompositions
- Matrix decompositions: Eigenvalue Decomposition (EVD)
- Singular Value Decomposition (SVD) and lower rank approximations
- Relationship between EVD and SVD
π ABOUT:
After PCA we shift gears to supervised learning methods that have to do with predicting a quantitative response. We begin with Linear Regression models which are the stepping stone for all supervised learning methods. We will study the general regression framework by paying attention to the algebraic and geometric aspects, while postponing the discussion of the learning elements for later (to be covered in Concepts of Learning Theory).
π READING:
βοΈ TOPICS:
- Introduction to Linear Regression
- Motivating an intuitive feeling for regression problems
- The regression function: conditional expectation
- Classic framework of Ordinary Least Squares (OLS)
- Theoretical core of OLS and Optimization
- Geometries of Least Squares: individuals, variables, and parameters perspectives
- Gradient descent algorithm
- Linear regression from a probabilistic approach: Maximum Likelihood
π ABOUT:
In this part of the course we review the notion of Learning (from a supervised point of view) and other related concepts: What are the conceptual pieces of a learning problem? What do we mean by learning? How do we measure the learning ability of a machine/model? We will also talk about several aspects concerning Learning theory: model performance, bias-variance tradeoff, overfitting, validation, and model selection.
π READING:
βοΈ TOPICS:
- A framework for Supervised Learning
- Supervised Learning Diagram: anatomy of supervised learning problems
- The meaning of Learning, and the need for Training and Test sets
- Types of errors and their measures: In-sample and Out-of-sample errors
- Noisy targets and conditional distributions
- Bias-Variance trade-off
- Derivation of the Bias-Variance decomposition formula
- Case study: data simulation to illustrate the bias-variance decomposition
- Interpretation of the bias-variance trade-off
π ABOUT:
In this part of the course we review the notion of Learning (from a supervised point of view) and other related concepts: What are the conceptual pieces of a learning problem? What do we mean by learning? How do we measure the learning ability of a machine/model? We will also talk about several aspects concerning Learning theory: model performance, bias-variance tradeoff, overfitting, validation, and model selection.
π READING:
βοΈ TOPICS:
- Overfitting
- What is overfitting? Why should we care about it? What causes overfitting?
- Case study: data simulation to illustrate overfitting
- Model Validation and Model Selection
- Validation: What is it? Why do we need it?
- Validation holdout framework, and repeated validation, bootstrap validation
- Three-way holdout frmework
- Cross-Validation: k-fold CV, leave-one-out CV
- Some general considerations
π ABOUT:
Having introduced the basic concepts of Learning Theory, weβll expand our discussion of linear regression. The classic Least Squares solution for linear models is not always feasible or desirable. One main idea to get a better solution is by regularizing the regression coefficients. This can be done in a couple ways: 1) by transforming the predictors and reducing the dimensionality of the input space; or 2) by penalizing the criterion to be minimized via restricting the size of the regression coefficients.
π READING:
- CSL: Regularization Techniques
- CSL: Principal Components Regression
- CSL: Partial Least Squares Regression
βοΈ TOPICS:
- Issues with Least Squares
- Issues with OLS and potential solutions (postponed from OLS regression)
- Multicollinearity issues (postponed from OLS regression)
- Regularization via Dimension Reduction methods
- Principal Components Regression (PCR)
- Partial Least Squares Regression (PLSR)
π ABOUT:
Having introduced the basic concepts of Learning Theory, weβll expand our discussion of linear regression. The classic Least Squares solution for linear models is not always feasible or desirable. One main idea to get a better solution is by regularizing the regression coefficients. This can be done in a couple ways: 1) by transforming the predictors and reducing the dimensionality of the input space; or 2) by penalizing the criterion to be minimized via restricting the size of the regression coefficients.
π READING:
βοΈ TOPICS:
- Regularization via Penalized methods
- Ridge Regression (RR)
- Other methods: lasso, elastic net, and cousins
- Geometries of penalized parameters
π ABOUT:
Linear models can be quite useful but they have some limitations that force us to move beyond linearity. The main idea in this section is to relax the linearity assumption while still trying to obtain interpretable models. Weβll learn about various approaches that allow us to augment/replace the input variables with an array of transformations such as polynomial regression, step functions, splines, local regression, RBF, etc.
π READING:
- CSL: Beyond Linear Regression
- CSL: Basis Expansion
- CSL: Nonparametric Regression
- CSL: Nearest Neighbors Estimates
- CSL: Kernel Smoothers
βοΈ TOPICS:
- Limitation of linear models, and notion of linearity
- Some nonlinear approaches
- Polynomial regression
- Stepwise regression
- Basis functions (basis-expansion)
- Splines
- Local regression
- Radial Basis Functions
π ABOUT:
The other major type of supervised learning problems covered in this course has to do with classification methods: predicting a qualitative response. We begin with Logistic Regression which provides a nice bridge between linear regression and classification ideas.
π READING:
βοΈ TOPICS:
- Introduction to Classification
- The classification problem
- Limitations of the classic regression model
- Motivation for logistic transformation
- Logistic Regression
- Model
- Error measure
- Algorithm(s)
π ABOUT:
An important kind of classification methods belong to a general framework known as Discriminant Analysis (DA). We start reviewing certain notions and formulas to measure total variation (or dispersion) in terms of a) variation within classes and b) variation between classes. Then we discuss Ronald Fisher's geometric approach commonly known as Canonical Discriminant Analysis. This method can be considered to be a classification with an unsupervised touch.
π READING:
βοΈ TOPICS:
- Preamble for Discriminant Analysis
- Total variation decomposition
- Variation within classes
- Variation between classes
- Canonical Discriminant Analysis (CDA)
- CDA's semi-supervised approach
- CDA's supervised approach
- Limitations of CDA classifiers
π ABOUT:
We continue the discussion of Discriminant Analysis (DA) with the so-called generative classification methods: Linear DA, Quadratic DA, and Naives Bayes. Then we move onto model performance with classification methods. Similar to what we did for regression models, we will discuss how to assess model performance in a classification setting. We will also talk about concepts like: confusion matrices, sensitivity and specificity, true positives and false positives, as well as Receiver Operating Characteristic (ROC) curves.
π READING:
βοΈ TOPICS:
- Discriminant Analysis
- Probabilistic Discriminant Analysis
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
- Naive Bayes
- Performance of Classifiers
- Measures of Classification Error
- Confusion Matrices
- Decision Rules
- ROC Curves
π ABOUT:
Simply put, Clustering has to do with finding groups in data. This is the second unsupervised topic of the course, covering partition methods as well as hierarchical agglomerative techniques.
π READING:
βοΈ TOPICS:
- Clustering
- About clustering
- Dispersion measures
- Complesity in clustering
- Direct Partitioning Methods
- Partitioning methods
- K-Means
- Hierarchical Clustering
- Distances and Dissimilarities
- Single linkage
- Complete linkage
- Average linkage
- Centroid linkade
- Dendrograms
π ABOUT:
We introduce decision trees, which are one the most visually atractive and intuitive supervised learning methods. In particular, we will focus our discussion around one kind of trees, the CART-style binary decision trees from the methodology developed in the early 1980s by Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen.
π READING:
βοΈ TOPICS:
- Introduction to Trees
- Terminology
- Tree diagrams
- Binary splits and impurity
- Binary partitions
- Measures of impurity: Entropy
- Measures of impurity: Gini impurity
- Measures of impurity: Variance-based
π ABOUT:
We introduce decision trees, which are one of the most visually atractive and intuitive supervised learning methods. In particular, we will focus our discussion around one kind of trees, the CART-style binary decision trees from the methodology developed in the early 1980s by Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen.
π READING:
βοΈ TOPICS:
- Splitting Nodes
- Entropy-based splits
- Gini-impurity based splits
- Looking for the best split
- Building Binary Trees
- Node-splitting stopping criteria
- Pruning a tree
- Pros and cons of trees
π ABOUT:
We finish the course with so-called Ensemble methods (i.e. aggregating individual learners) such as baging, boosting, and random forests.
π READING:
βοΈ TOPICS:
- Bagging: bootstraop aggregating
- Idea of bagging
- Advantages of bagging
- Random Forests
- Idea of random forest
- Advantages of random forest