Feature selection algorithms typically fall into two categories; Feature Ranking and Subset Selection. Feature Ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset Selection searches the set of possible features for the optimal subset.
In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as Branch and Bound and Piecewise Linear Networks.
Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.
Search approaches include:
Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' -- they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category.
Other available filter metrics include:
There are a variety of optimality criteria that can be used for controlling feature selection. The oldest are Mallows' Cp statistic and Akaike information criterion (AIC). These add variables if the t statistic is bigger than .
Other criteria are Bayesian information criterion (BIC) which uses , minimum description length (MDL) which asymptotically uses but some argue this asymptote is not computed correctly, Bonnferroni / RIC which use , and a variety of new criteria that are motivated by false discovery rate (FDR) which use something close to .
Selecting features that correlate strongest to the classification variable has been called the "maximum-relevance selection". Many heuristic algorithms can be used, such as the sequential forward, backward, or floating selections.
On the other hand, features can be selected to be different from each other, while they still have high correlation to the classification variable. This scheme, called "minimum-Redundancy-Maximum-Relevance" selection (mRMR), has been found to be more powerful than the maximum relevance selection.
As a special case, statistical dependence between variables can be used instead of correlation. Mutual information can be used to quantify the dependency. In this case, it is shown that mRMR is an approximation to maximizing the dependency between the joint distribution of the selected features and the classification variable.