In most practical cases, the testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. In this way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. An equivalence between the conserved quantities and corresponding symmetry groups implies the same level of equivalence for both these two ways of specifying the testable information in the maximum entropy method.
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular. Strictly speaking, the trial distributions, which do not maximize the entropy, are actually not probability distributions.
The principle was first expounded by E.T. Jaynes in his seminal papers of 1957 where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He suggested that the entropy in statistical mechanics, and the information entropy in information theory, are principally the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.
The maximum entropy principle makes explicit our freedom in using different forms of prior information. As a special case, a uniform prior probability density (Laplace's principle of indifference) may be adopted. Thus, the maximum entropy principle is not just an alternative to the methods of inference of classical statistics, but its important conceptual generalization and correction.
are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.
Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy probability distribution is the uniform distribution,
The principle of maximum entropy can thus be seen as a generalization of the classical principle of indifference, also known as the principle of insufficient reason
Furthermore, the probabilities must sum to one, giving the constraint
The probability distribution with maximum information entropy subject to these constraints is
It is sometimes called the Gibbs distribution. The normalization constant is determined by
and is conventionally called the partition function. (Interestingly, the Pitman-Koopman theorem states that the necessary and sufficient condition for a sampling distribution to admit sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)
The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to
These m simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.
where m(x), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that it is known; we will discuss it further after the solution equations are given.
Relative entropy is usually defined as the Kullback-Leibler divergence of m from p (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the Principle of Minimum Discrimination Information.
We have some testable information I about a quantity x which takes values in some interval of the real numbers (all integrals below are over this interval). We express this information as m constraints on the expectations of the functions fk, i.e. we require our epistemic probability density function to satisfy
And of course, the probability density must integrate to one, giving the constraint
The probability density function with maximum Hc subject to these constraints is
with the partition function determined by
As in the discrete case, the values of the λk parameters are determined by the constraints according to
The invariant measure function m(x) can be best understood by supposing that x is known to take values only in the bounded interval (a, b), and that no other information is given. Then the maximum entropy probability density function is
where A is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory.
Refer to Cover and Thomas for excellent explanation of the ideas .
By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we do possess. Thus the maximum entropy distribution is the only reasonable distribution.
Suppose an individual wishes to make an epistemic probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of epistemic probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be
where ni is the number of quanta that were assigned to the ith proposition.
Now, in order to reduce the 'graininess' of the epistemic probability assignment, it will be necessary to use quite a large number of quanta of epistemic probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, our protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,
where
is sometimes known as the multiplicity of the outcome.
The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, our protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize
At this point, in order simplify the expression, our protagonist takes the limit as N → ∞, i.e. as the epistemic probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, she finds
All that remains for our protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all "fair" random epistemic distributions, in the limit as the probability levels go from discrete to continuous.