Definitions

# Branch predictor

In computer architecture, a branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. This is called branch prediction. Branch predictors are crucial in today's modern, superscalar processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved.

Almost all pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Many earlier microprogrammed CPUs did not do branch prediction because there was little or no performance penalty for altering the flow of the instruction stream.

Branch prediction is not the same as branch target prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch target prediction attempts to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself.

## History

Microprogrammed processors took multiple cycles per instruction, and generally did not require branch prediction. The VAX 9000 was both microprogrammed and pipelined, and performed branch prediction.

The Burroughs B4900, a cacheless, microprogrammed machine, released in ~1982 was pipelined and used branch prediction. The B4900 branch prediction history state was stored back into the in-memory instructions during program execution. The B4900 implemented 4-state branch prediction by using 4 semantically equivalent branch opcodes to represent each branch operator type. The opcode used indicated the history of that particular branch instruction. If the hardware determined that the branch prediction state of a particular branch needed to be updated, it would rewrite the opcode with the semantically equivalent opcode that hinted the proper history. This scheme obtained a 93% hit rate. US patent 4,435,756 and others were granted on this scheme.

The first commercial RISC processors, MIPS and SPARC, did only trivial "not-taken" branch prediction. Because they used branch delay slots, fetched just one instruction per cycle, and executed in-order, there was no performance loss. Later, the R4000 used the same trivial "not-taken" branch prediction, and lost two cycles to each taken branch because the branch resolution recurrence was four cycles long.

Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064, the MIPS R8000, and the IBM POWER series. These processors all relied on one-bit or simple bimodal predictors.

The DEC Alpha EV6 uses a next-line predictor overridden by a combined local predictor and global predictor, where the combining choice is made by a bimodal predictor.

The AMD K8 has a combined bimodal and global predictor, where the combining choice is another bimodal predictor. This processor caches the base and choice bimodal predictor counters in bits of the L2 cache otherwise used for ECC. As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache. Parity is just fine, since any instruction suffering a parity error can be invalidated and refetched from memory.

The Alpha EV8 (cancelled late in design) had a minimum branch misprediction penalty of 14 cycles. It was to use a complex but fast next line predictor overridden by a combined bimodal and majority-voting predictor. The majority vote was between the bimodal and two gskew predictors.

## Implementation

### Trivial prediction

The early implementations of SPARC and MIPS (two of the first commercial RISC architectures) did trivial branch prediction: they always predicted that a branch (or unconditional jump) would not be taken, so they always fetched the next sequential instruction. Only when the branch or jump was evaluated and found to be taken did the instruction fetch pointer get set to a non-sequential address.

Both CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Both architectures defined branch delay slots in order to utilize these fetched instructions.

### Static prediction

Processors that implement "Static prediction" predict that backward-pointing branches will be taken (assuming that the backwards branch is the bottom of a program loop), and forward-pointing branches will not be taken (assuming they are early exits from the loop or other processing code). This only mispredicts the very last branch of a loop.

Static prediction is used as a fall-back technique in most processors with dynamic branch prediction when there isn't any information for dynamic predictors to use. Both the Motorola MPC7450 (G4e) and the Intel Pentium 4 use this technique as a fall-back.

### Next line prediction

Some superscalar processors (MIPS R8000, DEC Alpha EV6 and EV8) fetched with each line of instructions a pointer to the next line. This next line predictor is not directly comparable to the other predictors listed here because it handles branch target prediction as well as branch direction prediction.

When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.

Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.

The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.

### Bimodal branch prediction

A bimodal branch predictor has a table of two-bit entries, indexed with the least significant bits of the instruction addresses. Unlike the instruction cache, bimodal predictor entries typically do not have tags, and so a particular counter may be mapped to different branch instructions (this is called branch interference or branch aliasing), in which case it is likely to be less accurate. Each counter has one of four states:

• Strongly not taken
• Weakly not taken
• Weakly taken
• Strongly taken

When a branch is evaluated, the corresponding entry is updated according to the underlying Moore machine. In the case of two-bit saturating counters ("2-bit Smith Counters with Saturation"), branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The primary benefit of this two-bit saturating counter scheme is that loop closing branches are always predicted taken. A one-bit scheme (like the R8000), mispredicts both the first and last branch of a loop. A two-bit scheme mispredicts just the last branch. Similarly, on heavily biased branches which almost always go one way, a one-bit scheme mispredicts twice for each odd branch, and a two-bit scheme mispredicts once.

On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.

Because the bimodal counter table is indexed with the instruction address bits, a superscalar processor can split the table into separate SRAMs for each instruction fetched, and fetch a prediction for every instruction in parallel with fetching the instruction, so that the branch prediction is available as soon as the branch is decoded.

### Local branch prediction

Bimodal branch prediction mispredicts the exit of every loop. For loops which tend to have the same loop count every time (and for many other branches with repetitive behavior), we can do much better.

Local branch predictors keep two tables. The first table is the local branch history table. It is indexed by the low-order bits of each branch instruction's address, and it records the taken/not-taken history of the $n$ most recent executions of the branch.

The other table is the pattern history table. Like the bimodal predictor, this table contains bimodal counters; however, its index is generated from the branch history in the first table. To predict a branch, the branch history is looked up, and that history is then used to look up a bimodal counter which makes a prediction.

On the SPEC'89 benchmarks, very large local predictors saturate at 97.1% correct.

Local prediction is slower than bimodal prediction because it requires two sequential table lookups for each prediction. A fast implementation would use a separate bimodal counter array for each instruction fetched, so that the second array access can proceed in parallel with instruction fetch. These arrays are not redundant, as each counter is intended to store the behavior of a single branch.

### Global branch prediction

Global branch predictors make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches. We can keep a single shift register updated with the recent history of every branch executed, and use this value to index into a table of bimodal counters. This scheme, by itself, is only better than the bimodal scheme for large table sizes, and is never as good as local prediction.

If, instead, we index the table of bimodal counters with the recent history concatenated with a few bits of the branch instruction's address, we get the gselect predictor. Gselect does better than local prediction for small table sizes, and local prediction is only slightly better for table storage larger than 1KB.

We can do slightly better than gselect by XORing the branch instruction address with the global history, rather than concatenating. The result is gshare, which is a little better than gselect for tables larger than 256 bytes.

On the SPEC'89 benchmarks, very large gshare predictors saturate at 96.6% correct, which is just a little worse than large local predictors.

Gselect and gshare are easier to make fast than local prediction, because they require a single table lookup per branch. As with bimodal prediction, the table can be split so that parallel lookups can be made for each instruction fetched, so that the table lookup can proceed in parallel with instruction load.

### Combined branch prediction

Scott McFarling proposed combined branch prediction in his 1993 paper . Combined branch prediction is about as accurate as local prediction, and almost as fast as global prediction.

Combined branch prediction uses three predictors in parallel: bimodal, gshare, and a bimodal-like predictor to pick which of bimodal or gshare to use on a branch-by-branch basis. The choice predictor is yet another 2-bit up/down saturating counter, in this case the MSB choosing the prediction to use. In this case the counter is updated whenever the bimodal and gshare predictions disagree, to favor whichever predictor was actually right.

On the SPEC'89 benchmarks, such a predictor is about as good as the local predictor.

Another way of combining branch predictors is to have e.g. 3 different branch predictors, and merge their results by a majority vote.

Predictors like gshare use multiple table entries to track the behavior of any particular branch. This multiplication of entries makes it much more likely that two branches will map to the same table entry (a situation called aliasing), which in turn makes it much more likely that prediction accuracy will suffer for those branches. Once you have multiple predictors, it is beneficial to arrange that each predictor will have different aliasing patterns, so that it is more likely that at least one predictor will have no aliasing. Combined predictors with different indexing functions for the different predictors are called gskew predictors, and are analogous to skewed associative caches used for data and instruction caching.

### Agree prediction

Another technique to reduce destructive aliasing within the pattern history tables is an agree predictor. Some method is used to establish a relatively static prediction for the branch, perhaps a bimodal predictor or hint bits within the branch instruction. Another predictor (e.g. a gskew predictor) makes predictions, but rather than predicting taken/not-taken, the predictor predicts agree/disagree with the base prediction.

The intention is that if branches covered by the gskew predictor tend to be a bit biased in one direction, perhaps 70%/30%, then all those biases can be aligned so that the gskew pattern history table will tend to have more agree entries than disagree entries. This reduces the likelihood that two aliasing branches would best have opposite values in the PHT.

Agree predictors work well with combined predictors, because the combined predictor usually has a bimodal predictor which can be used as the base for the agree predictor. Agree predictors do less well with branches that are not biased in one direction, if that causes the base predictor to give changing predictions. So an agree predictor may work best as part of a three-predictor scheme, with one agree predictor and another non-agree type predictor.

### Overriding branch prediction

The EV6 and EV8 cores used a fast single-cycle next line predictor to handle the branch target recurrence and provide a simple and fast branch prediction. Because the next line predictor is so inaccurate, and the branch resolution recurrence takes so long, both cores have two-cycle secondary branch predictors which can override the prediction of the next line predictor at the cost of a single lost fetch cycle. Configurations like this are referred to as prophet/critic hybrid predictors.

Since a fetch of 4 instructions or more may contain more than one branch, an overriding predictor will sometimes have to predict a next PC which is neither the fall-through nor the earlier predicted next line. The overriding predictor usually extracts the target address from the instruction bytes themselves since they are available by the time the predictor must generate a predicted next fetch address.

### Neural branch predictors

The first dynamic neural branch predictors (LVQ-predictors and perceptrons) were proposed by Prof. Lucian Vintan (Lucian Blaga University of Sibiu, Romania), in his paper entitled "Towards a High Performance Neural Branch Predictor", Proceedings of The International Joint Conference on Neural Networks - IJCNN '99, Washington DC, USA, 1999. The neural branch predictor research was strongly developed further by Dr. Daniel Jimenez (Rutgers University, USA). In 2001, (HPCA Conference) it was presented the first perceptron predictor that was feasible to implement in hardware.

The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor (he also used a gshare/perceptron overriding hybrid predictors).

The main disadvantage of the perceptron predictor is its high latency. Even after taking advantage of high-speed arithmetic tricks, the computation latency is relatively high compared to the clock period of many modern microarchitectures. In order to reduce the prediction latency, Jimenez proposed in 2003 the fast-path neural predictor, where the perceptron predictor chooses its weights according to the current branch’s path, rather than according to the branch’s PC. Many other researchers developed this concept (A. Seznec, M. Monchiero, D. Tarjan & K. Skadron, V. Desmet, Akkary et al, K. Aasaraai, Michael Black, etc.)

The neural branch predictor concept is very promising. Most of the state of the art branch predictors are using a perceptron predictor (see Intel's "Championship Branch Prediction Competition" ). Intel already implements this idea in one of the IA-64's simulators (2003).