References

Texbooks

Concentration Inequalities: A Nonasymptotic Theory of Independencei, by S. Boucheron, G. Lugosi and P. Massart, Oxford University Press, 2013.
High-Dimensional Probability, An Introduction with Applications in Data Science, by R. Vershynin, Cambridge University Press, 2018, available here.

Mon Aug 30

A classical reference on the concentration of well-behaved functions of independent random variables is

A new look at independence, by M. Talagrand, Annals of Probbaility, 24(1): 1-34, 1996.

References for improved Chernoff bounds for Bernoulli random variables:

Check out the Wikipedia page.
A guided tour of chernoff bounds, by T. Hagerup and C. R\"{u}b, Information and Processing Letters, 33(6), 305--308, 1990.
Chapter 4 of the book Probability and Computing: Randomized Algorithms and Probabilistic Analysis, by M. Mitzenmacher and E. Upfal, Cambridge University Press, 2005.
The Probabilistic Method, 3rd Edition, by N. Alon and J. H. Spencer, Wiley, 2008, Appendix A.1.

Improvement of Hoeffding's inequality by Berend and Kantorivich:

On the concentration of the missing mass, by D. Berend and A. Kntorovich, Electron. Commun. Probab. 18 (3), 1–7, 2013.
Section 2.2.4 in monograph on concentration inequalities by M. Raginski.

The original proof of Hoeffding inequality is here. Compare to the modern, slick proof in Lemma 2.2 of [BLM].

Wed Sep 1

For an example of the improvement afforded by Bernstein versus Hoeffding, see Theorem 7.1 of

Laszlo Gyorfi, Michael Kohler, Adam Krzyzak, Harro Walk (2002). A Distribution-Free Theory of Nonparametric Regression, Springer.

available here. By the way, this is an excellent book.

Wed Sep 8

The book Concentration Inequalities for Sums and Martingales by B. Bercu, B. Delyon and E. Rio (2015) contains a many sharp calculations and bounds.
References on the JL Lemma:

Database friendly random projections: Johnson-Lindenstrauss with binary coins, by D. Achlioptas, Journal of Computer and System Sciences 66 (2003) 671687.
An Elementary Proof of a Theorem of Johnson and Lindenstrauss, by S. Dasgupta and A. Gupta, 2002.
Section 1.2 of the book The Random Projection Method by S. Vempala, AMS, 2004
Simple Analysis of Sparse, Sign-Consistent JL, by M. Jagadeesan, 2017, arXiv:1708.02966
Optimality of the Johnson-Lindenstrauss Lemma, by K.G. Larsen and J. Nelson (2016), arXiv:1609.02094

Mon Sep 13, 15 and 20

The book Metric Characterization of Random Variables and Random Processes (2000) V. V. Buldygin and Yu. V. Kozachenko, provides a great deal of information about sub-Gaussian and sub-Exponential variables (as well as random processes) and Orlicz norms.

To see the equivalent characterizations of sub-Gaussian and sub-exponential random variables, see Theorems 2.5.2 and 2.7.1 in Vershynin's book. In particular, Them 2.8.1. therein is yet another version of Bernstein inequality using Orlicz norm.

The Bernstein-Orlicz norm was introduced in the paper The Bernstein-Orlicz norm and deviation inequalities, by S. van de Geer and J. Lederer. See also the follow up paper by J. Wellner, The Bennett-Orlicz norm.

The properties of sub-Weibull random variables are described in the papers:

Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression, by A.K. Kuchibhotla and A. Chakrabortty, arXiv:1804.02605
Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions, by M. Vladimirova, S. Girard, H. Nguyen and J. Arbel arXiv:1905.04955

Mon Sep 29

The use of Donsker-Varadhan variational formula for the KL divergence in adaptive data analysis was introduced in the paper:

Controlling Bias in Adaptive Data Analysis Using Information Theory, by Daniel Russo, James Zou Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:1232-1240, 2016.

Follow-up contributions mentioned in class:

Information-theoretic analysis of generalization capability of learning algorithms, by Aolin Xu, Maxim Raginsky, NIPS 2017
Dependence Measures Bounding the Exploration Bias for General Measurements, by Jiantao Jiao, Yanjun Han, Tsachy Weissman, 2016, arXiv:1612.05845

These arguments were further developed to yield tighter generalization bounds based on chaining mutual information here:

Chaining Mutual Information and Tightening Generalization Bounds, by Amir R. Asadi1, Emmanuel Abbe1 and Sergio Verdu', NeurIPS 2018

To see the effect of data adaptivity on something as simple as the sample mean, check out

Are sample means in multi-armed bandits positively or negatively biased? by Jaehyeok Shin, Aaditya Ramdas, Alessandro Rinaldo, NeurIPS 2018
On the bias, risk and consistency of sample means in multi-armed bandits, by Jaehyeok Shin, Aaditya Ramdas, Alessandro Rinaldo, 2019, arXiv:1902.00746
On conditional versus marginal bias in multi-armed bandits, by Jaehyeok Shin, Aaditya Ramdas, Alessandro Rinaldo, 2020, ICML 2020