We have n i.i.d. samples ( x 1 , y 1 ), … ( x n , y n ) coming from the unknown distribution P( x , y ) and we want to infer this statistical dependency.
A generic model is
y i = f ( x i ) + e i where e i are i.i.d. noise with unknown dist.
Regression aims to find f with finite samples.
f can be a generalized linear function
f ( x ) = w T ψ ( x ) = Σ i w i ψ ( x , ө i ) where ψ is a basis function
f can be an affine function
f ( x ) = Σ i w i k ( x i , x ) + b where k is a kernel function
There is a 1-1 correspondence between probability distributions and code length functions such that small probabilities corresponds to large code lengths and vice versa.
Encode a sequence x 1 , x 2 , …, x n with minimum number of bits which is universally good for a family of distributions
Bayes and NML becomes indistinguishable if Jefferys’ prior is chosen
Jefferys’ prior is uniform not on parameter space but on the space of distributions with the “natural metric” (Fisher) that measures the distance between distinguishable distributions
For large n , Bayesian predictive distribution concentrates more and more around the ML distribution
Model set and mixture distributions can also be encoded
The difficulty arises in computational aspect rather than MDL criterion itself
MDL vs. Bayesian
Bayesian: prior represents degrees of belief in different state of nature; true distribution has a nonzero probability measure
MDL: No such thing as a true distribution; inductive learning only based on the regularities in data which will be present in future coming from the same phenomenon
Major concern: consistency/rate of convergence no result comparable to Vapnik’s statistical learning theory
Rissanen’s extreme position:
The assumption that there exists a probability distribution generating data is untenable in many applications.
Statistical inference based on the assumption that a true distribution exists and seeking to obtain this distribution as fast as possible is methodologically flawed .
Model selection should be based on the properties of data alone (cf. The Computer Journal, 42(4), 1999).
Nevertheless, if true distribution does exist and is in the model set, the method better finds it given enough data.
R( f ) = Σ i L( y i , f ( x i ) ) / n + λ ||Af|| L ( x )
f can be any regression function (no need to have parameter ө ); A is an operator and L is the Hilbert space of square integrable functions on x with a proper measure
Only need to work on reproduced kernel Hilbert space (RKHS)
( n λ I + K ) c = Y f ( x ) = Σ i c i K ( x , x i )
where K is the kernel function and K is a symmetric positive definite matrix
Best choice for regularization parameter λ (Cucker and Smale) Unique solution λ * exists for a compact subspace that minimizes the approximation error to true f *
Can be interpreted as the best tradeoff between sample complexity and hypothesis space complexity
In statistics Regularized nonparametric least squares regression
Bayesian interpretation: Use prior P( f )=exp( − λ ||Af|| )/ Z
Closely related to Gaussian process model (MacKay)
J. Rissanen. Stochastic Complexity in Statistical Inquiry . World Scientic, River Edge, NJ, 1989.
T. Hastie, et al. The Elements of Statistical Learning . Springer, 2001.
V. N. Vapnik. Statistical Learning Theory . Wiley, New York, 1998.
Z. Zhao, H. Chen, and X. R. Li. “Semiparametric Model Selection with Applications to Regression”, Proc. 2005 IEEE Workshop on Statistical Signal Processing , Bordeaux, France, July 2005.
H. Chen, Y. Bar-Shalom, K. R. Pattipati, and T. Kirubarajan. “MDL Approach for Multiple Low Observable Track Initiation”, IEEE Trans. Aerospace and Electronic Systems , AES-39(3):862-882, Jul. 2003.
Be the first to comment