Alessandro (Ale) Rinaldo - Fall, 2024
SDS 387 is an intermediate graduate course in theoretical statistics for PhD students, covering two separate but interrelated topics: (i) stochastic convergence and (ii) linear regression modeling. The material and style of the course will skew towards the mathematical and theoretical aspects of common models and methods, in order to provide a foundation for those who wish to pursue research in statistical methods and theory. This is not an applied regression analysis course.
Syllabus: Syllabus
Lectures: Tuesday and Thursday, 9:00am - 10:30am, PMA 5.112
TA: Khai Nguyen, khainb@utexas.edu - Office hours: Thursday, 1:30pm - 2:30pm, GDC 7.418 (Poisson Bowl)
Ale's Office hours: by appointment
Homework submission and solutions: use Canvas
Due date | |
Homework 1 | September 17 |
Homework 2 | October 3 |
Final project proposal | October 12 |
Homework 3 | October 17 |
Homework 4 | November 14 |
Lecture 1: Introduction and course logistics. Deterministic convergence and convergence with probability one.
Lecture 2: Lim sup and lim inf of events. Borel Cantelli Lemmas. Convergence in probability and comparison with convergence with probability one. Law of large numbers. Glivenko Cantelli Lemma.
References:
See Ferguson's book, chapters 1, 2 and 4.
For a proof of Glivenko-Cantelli's Lemma see Theorem 19.1 of van der Vaart's book.
A nice webpage summarizing the different modes of stochastic convergence and providing some good examples to illustrate their differences.
Lecture 3: Glivenko Cantelli Theorem, First Borel Cantelli Lemma, more on convergence in probability. For the Glivenko Cantelli Theorem, see Theorem 19.1 in van der Vaart's book.
Lecture 4: Lp convergence, Minkowski, Holder and Jensen inequalities. Relations between Lp convergence and convergence in probability and with probability one. C.d.f.'s in multivariate settings.
Lecture 5: Convergence in distribution. Relation with other forms of convergence. Marginal vs joint convergence in distribution. Portmanteau theorem. For the proof of the claim that convergence in probability implies convergence in distribution, see page 330 of Billingsley's book Probability and Measure.
Lecture 6: Portmantreau Theorem, Continuous Mapping Theorem, characteristics functions and Continuity Theorem, Cramer-Wald device. I suggest reading Chapter 3 of Ferguson's book (in particuar, Theorem 3(e) has a neat proof).
Lecture 7: Slutsky's theorem, more on convergence in distribution. Big-oh and little-oh notation.
Lecture 8: More on big-oh and little-oh notation. CLT for i.i.d. variables using characteristic functions. Triangular arrays, Lindeberg Feller and Lyapunov conditions.
Lecture 9: Lindeberg Feller, examples and multivariate extension. Berry-Esseen bounds. A good reference for this lecture and the last is the book Sums of Independent Random Variables, by V.V. Petrov, Springer, 1975. Another classic and good reference is Approximation Theorems of Mathematical Statistics by Serfling, Wiley, 1980.
Lecture 10: Kolmogorov Smirnov, total variation and Wasserstein distances. Theorem 1.1 about Lindeberg approximations for 3-times continuously differentiable functions.
Lecture 11: Review of linear algebra. See references in the class notes.
Lecture 12: Spectral properties of matrices. Eigendecomposition and singular value decomposition.
Lecture 13: Projections. Vector and matrix norms.
Lecture 14: projection of a random variable onto vector space of random variables. Introduction to linear regression modeling. For the next few lectures, I will be following closely the book Learning Theory from First Principles by Francis Bach
Lecture 15: Inference and prediction in linear regression modeling. Projection parameter, prediction risk decomposition.
Lecture 16: Geometric interpretation of the OLS estimator. Gradient descent convergence guarantee for the OLS.
Lecture 17: Pseudo inverse. Risk decomposition for the estimator of the linear regression parameters for fixed design.
Lecture 18: Gauss Markov Theorem. Ridge regression.
Lecture 19: Optimal tuning for ridge regression and minimax lower bound for OLS.
Lecture 20: Minimax lower bound for OLS. Consistency of the OLS.
Lecture 21: Asymptotic normality of the OLS estimator and statistical inference in the fixed-design, well-specified setting
Lecture 22: Random design. Risk formula for the OLS. Projection parameters.
Lecture 23: Random design. Minimax optimality of the OLS. Exact analysis under Gaussian design. Recommended readings:
Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. Annals of Statistics, 50(4):2157–2178, 2022.
Leo Breiman and David Freedman. How many variables should be entered in a regression equation? J. Amer. Statist. Assoc., 78(381):131–136, 1983.
Lecture 24: The double descend phenomenon. Recommended readings:
Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2), 949-986.
Belkin, M., Hsu, D. and Xu, J. (202Two models of double descent for weak features (2020). SIAM Journal on Mathematics of Data Science, 2, 4.
Assumption lean regression. Highly recommended reading:
Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., Zhao, L. (2019). Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34(4), 523-544.
Lecture 25: Consistency and asymptotic normality of the OLS estimator in assumption lean setting.
Lecture 26: Consistency of the plug-in estimator of the sandwich covariance for the OLS estimator.
High-dimensional generalizations of the results from the last few lectures can be found in:
Kuchibhotla, A., Rinaldo, A. and Wasserman, L. (2021). Berry-Esseen Bounds for Projection Parameters and Partial Correlations with Increasing Dimension, arXiv:2007.09751
Chang, W., Kuchibhotla, A., Rinaldo, A. (2013). Inference for Projection Parameters in Linear Regression: beyond d=o(n1/2), arXiv:2307.00795.
Conditions for consistency and asymptotic normality of the OLS estimator (for a well-specified linear model) were given by
Lai, T. Z. and Wei, C. Z. (1982). Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems, Annals of Statistics, 10(1): 154-166.
Khamaru, K., Deshpande, Y., and Wainwright, M. (2021). Near-optimal inference in adaptive linear regression, arXiv:2107.02266
Here is a simple example of the negative impact of adaptively collected data protocols:
Shin, J., Ramdas, A., and Rinaldo A. (2021). On the Bias, Risk, and Consistency of Sample Means in Multi-armed Bandits, SIAM Journal of Mathematics of Data Science, 3(4).