In particular, the information gain about a random variable X obtained from an observation that a random variable A takes the value A=a is the Kullback-Leibler divergence DKL(p(x|a) || p(x|I) ) of the prior distributionp(x|I) for x from the posterior distributionp(x|a) for x given a.
The expected value of the information gain is the mutual informationI(X;A) of X and A — i.e. the reduction in the entropy of X achieved by learning the state of the random variable A.
In machine learning this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high information gain should be preferred to other attributes.
General definition
In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given:
Formal definition
Let be the set of all attributes and the set of all training examples,
with
defines the value of a specific example for attribute , specifies the entropy.
The information gain for an attribute is defined as follows: