The minimum description length (MDL) principle helps us to choose among alternate theories. Closely related to Occam’s razor, it states that the best explanation for a set of data is one that allows us to compress the data as much as possible; to encode it in the shortest way. The more a data set can be compressed, the more regularities appear, and the more can be learned from the data.
“…given some data, the simplest hypothesis explaining it is preferable.” ~Rissanen (1989)
With MDL, Kholmogorov complexity is used to find the “simplest” hypothesis. Kolmogorov complexity is the shortest code which would cause a Turing machine to print out a set of numbers. For example, the following scatter plot shows a series of coordinates:
These points could be described in several ways involving linear equations and lists of coordinate points on a Cartesian plane. However, the shortest way to describe the points is:
- The coordinate of the first point,
- The direction of the line,
- Distance between points,
- Total number of points.
The above “code” is therefore the Kholgomorov complexity for the above series of points.
In an ideal world, we could use the minimum description length principle to choose a model by using Kholmogorov Complexity. Comparing these shortest codes for different models, it would be easy to see which was the absolute shortest; in ideal MDL this model would be chosen.
This works theoretically because it doesn’t matter what (general purpose) computer language is used: the lengths of the shortest programs printed out will only differ by a constant, so the differences between programs for various models are stable.
There is just one problem with this method: Kolmogorov Complexity, while it would be incredibly useful if we could get a calculator to spit it out, happens to be uncomputable. There isn’t any general-use computer program, or, theoretically, any possibility of writing one, that would find the shortest program that prints A for any data set or model A.
Practical Minimum Description Length Theory
Since we can’t do what we’d like to do—summarize the results of every model in ‘shortest possible code’ and see which is shorter—we need to work with what we can. Rather than using general-purpose computer languages, we use more specific methods of summarizing our models; instead of general-purpose languages, we use specific languages that, while they couldn’t describe anything, at least describe our models. We can use probability distributions to describe our data, and linear regression models to determine what fits best.
Grunwald, Peter D. The Minimum Description Length Principle, Chapter 1. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159.403&rep=rep1&type=pdf on April 26, 2018
Singh, Aarti. 10-704 Information Processing and Learning Class Notes: Lecture 13: Minimum Description Length. Retrieved from http://www.cs.cmu.edu/~aarti/Class/10704/lec13-MDL.pdf on April 26, 2018
Risannen, Jorma. Minimum Description Length (2008) Scholarpedia, 3(8):6727. Retrieved from http://www.scholarpedia.org/article/Minimum_description_length on April 26, 2018
Hansen, Mark H., Yu, Bin. Model Selection and the Principle of Minimum Description Length.
Retrieved from https://www.stat.berkeley.edu/~binyu/summer08/mdlrevew1.pdf on April 27, 2018
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific.------------------------------------------------------------------------------
Confused and have questions? Head over to Chegg and use code “CS5OFFBTS18” (exp. 11/30/2018) to get $5 off your first month of Chegg Study, so you can understand any concept by asking a subject expert and getting an in-depth explanation online 24/7.
Comments? Need to post a correction? Please post a comment on our Facebook page.