, the F1 score
) is a measure of a test's accuracy. It considers both the precision p
and the recall r
of the test to compute the score: p
is the number of correct results divided by the number of all returned results and r
is the number of correct results divided by the number of results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:
The general formula for non-negative real β is:
The formula in terms of Type I and type II errors:
Two other commonly used F measures are the measure, which weights recall twice as much as precision, and the measure, which weights precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that "measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure . Their relationship is where .