Last Friday before a sumptuous Thanksgiving-themed LIDS lunch, a seminar was presented by Xu Huan of McGill University. Mr. Xu's talk was entitled "Miracle of Regularization," and was joint work with his advisor Shie Mannor, formerly a LIDS post-doc, and Constantine Caramanis, formerly a LIDS student and author of some very nice expository articles. Part of the work was recently featured on An Ergodic Walk, the blog of Anand Sarwate.
The supervised learning problem with regularization is studied from the viewpoint of robust optimization. Xu first motivated why regularization is needed in supervised learning from finite training data by appealing to properties of ill-posed problems stated by Hadamard, in particular that a unique solution does not exist and that the solution does not depend continuously on the data. Other motivations for regularization in supervised learning come from the structural risk minimization principle.
I first learned about robust optimization in a lecture by Melvyn Sim tele-delivered from Singapore for the class 6.255. When data is uncertain but known to belong to an uncertainty set, the basic idea of robust optimization is to optimize the objective function with respect to the worst-case point in the uncertainty set, i.e. doing a min-max or a max-min optimization.
In typical supervised classification formulations, a decision function is to be found that minimizes an empirical risk of training data, often with a margin-based loss function, plus a regularization term, often a norm in the space of decision functions, weighted by a non-negative scalar c. Xu, Caramanis, and Mannor show how this regularization formulation arises when a robust optimization problem with uncertainty around training examples is solved. They also discuss what the uncertainty set corresponding to standard regularization terms is, and what c means in terms of the uncertainty set.
The training data is generally assumed to be independent samples from a joint distribution of features and class labels. (This joint distribution is unknown.) A classifier is said to be consistent if in the limit as the size of the training set goes to infinity, it converges in probability to the optimal classifier that minimizes probability of error if the joint distribution were known. Several recent papers, including those by Lin; Steinwart; and Bartlett, Jordan, and McAuliffe, discuss the properties of the margin-based loss function in the empirical risk and of the regularization term needed for a classifier to be consistent. Xu et al. show consistency of classifiers using the robust optimization perspective they develop.
Munther Dahleh made several remarks/questions at the end of the talk, including asking about connections to robust estimation studied twenty years ago. Interestingly, when Mike Jordan was visiting LIDS a few weeks ago, he also discussed classifier consistency in his SSG talk, which was based on joint work with XuanLong Nguyen and former LIDS student Martin Wainwright. That work links valid margin-based loss functions and Ali-Silvey distances.