**Pipeline Hazards**

- Short lecture with 8th and final lab on 11/30
- 5th and final problem due on 12/1
**ARM 3-Stage Pipeline**

- **Fetch, Decode, and Execute Stages**
- Instructions are decoded in both the Fetch and Decode stages
- Register ports are “Read” in the Decode stage and “Written” at the end of the Execute stage
- PC+4, register reads for stores (StrData), and BX source (BXreg) are “delayed” for use in later stages
**Simple Instruction Flow**

Consider the following instruction sequence:

Instruction becomes available at the end of the Fetch stage

Operands at the end of Decode

Destination and PSR are updated at the end of Execute

```
sub    r0, r1, r2
add    r3, r1, #2
and    r1, r1, #1
cmp    r0, r4
```

<table>
<thead>
<tr>
<th></th>
<th>i</th>
<th>i+1</th>
<th>i+2</th>
<th>i+3</th>
<th>i+4</th>
<th>i+5</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch</td>
<td>sub</td>
<td>add</td>
<td>and</td>
<td>cmp</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>r0, r1, r2</td>
<td>r3, r1, #2</td>
<td>r1, r1, #1</td>
<td>r0, r4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Decode</td>
<td>sub</td>
<td>add</td>
<td>and</td>
<td>cmp</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>r0, r1, r2</td>
<td>r3, r1, #2</td>
<td>r1, r1, #1</td>
<td>r0, r4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Execute</td>
<td>sub</td>
<td>add</td>
<td>and</td>
<td>cmp</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>r0, r1, r2</td>
<td>r3, r1, #2</td>
<td>r1, r0, #1</td>
<td>r0, r4</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Time (in clock cycles)
**Pipeline Control Hazards**

Pipelining Hazards are situations where the next instruction cannot execute in the next clock cycle. There are two forms of hazards, CONTROL and STRUCTURAL.

Consider the instruction sequence shown:

```
loop: add r0, r0, r0
cmp r0, #64
ble loop
and r1, r0, #7
sub r1, r0, r1
```

Time (in clock cycles)

<table>
<thead>
<tr>
<th>Pipeline</th>
<th>i</th>
<th>i+1</th>
<th>i+2</th>
<th>i+3</th>
<th>i+4</th>
<th>i+5</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Fetch</strong></td>
<td>add r0, r0, r0</td>
<td>cmp r0, #64</td>
<td>ble loop</td>
<td>and r1, r0, #7</td>
<td>sub r1, r0, r1</td>
<td>add r0, r0, r0</td>
</tr>
<tr>
<td><strong>Decode</strong></td>
<td>add r0, r0, r0</td>
<td>cmp r0, #64</td>
<td>ble loop</td>
<td>and r1, r0, #7</td>
<td>???</td>
<td>???</td>
</tr>
<tr>
<td><strong>Execute</strong></td>
<td>add r0, r0, r0</td>
<td>cmp r0, #64</td>
<td>ble loop</td>
<td>???</td>
<td>???</td>
<td></td>
</tr>
</tbody>
</table>

When the branch instruction reaches the execute stage, the next 2 instructions have already been fetched!
Branch Fixes

Problem: Two instructions following a branch are fetched before the branch decision is made (to take or not to take)

Solutions:

1. Program around it. Define the ISA such that the branch does not take effect until after instructions in the "DELAY SLOTS" complete. This is how MIPS pipelines work. It leads to ODD looking code in tight (short) loops. Of course you could always put NOPs in the delay slots.

2. Detect the branch decision as early as possible, and ANNUL instructions in the delay slots. This is what ARM does.
### Early Detect and Annul

We can detect branch instructions (B, BL, or BX) in the Decode stage. The decision to branch is decided no later than the current instruction in the Execute stage. Thus, we could make the branch decision in the Decode stage. We then annul the following instruction by disabling WERF and PSR updates! Making the next instruction a NOP!

<table>
<thead>
<tr>
<th>Pipeline</th>
<th>i</th>
<th>i+1</th>
<th>i+2</th>
<th>i+3</th>
<th>i+4</th>
<th>i+5</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fetch</td>
<td>add r0,r0,r0</td>
<td>cmp r0,#64</td>
<td>ble loop</td>
<td>and r1,r0,#7</td>
<td>add r0,r0,r0</td>
<td>cmp r0,#64</td>
</tr>
<tr>
<td>Decode</td>
<td>add r0,r0,r0</td>
<td>cmp r0,#64</td>
<td>ble loop</td>
<td>and r1,r0,#7</td>
<td>add r0,r0,r0</td>
<td></td>
</tr>
<tr>
<td>Execute</td>
<td>add r0,r0,r0</td>
<td>cmp r0,#64</td>
<td>ble loop</td>
<td>NOP</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

If we detect the branch in the decode stage then the PSR state of the instruction in the Execute stage can be combined to change the next PC.
The cost of taken branches

When an ARM branch is taken the branch instructions are effectively 2 cycles rather than 1 when they aren’t. In a MIPS-like instruction set, one can often fill the delay slots with useful instructions, but they are executed whether or not the branch is taken.

The ARM approach is easier to understand, and since it does not “EXPOSE” the pipeline, it also allows for an alternative number of pipeline stages to be implemented in future designs, while conserving code compatibility.

Lastly, using ARM, many conditional branches can be eliminated using the condition execution, which pipelines beautifully!
Structural Pipeline Hazards

There’s another problem with our code fragment!

The destination register of instructions are written at the end of the Execute stage. However the following instruction might use this result as a source operand.

```
loop:  add r0,r0,r0
cmp r0,#64
ble loop
and r1,r0,#7
sub r1,r0,r1
```
**Data Hazards**

**Problem:** When a register source is needed from a later stage of the pipeline before it is written.

**Solutions:**

1. Program around it. One could document the weird semantics-- "You can’t reference the destination register of an instruction in the immediately following instruction." Would make assembly language even harder to understand. Would expose the pipeline, once again making future improvements difficult to implement while maintaining code compatibility.

2. Hardware bypass multiplexers.
Source Bypassing

The idea here is to load the value that to be saved in the destination register also into the pipeline registers that hold the ALU operands.

We also need bypass MUXes on the StrReg and BXreg pipeline registers.
Load/Store Stalls

Load and Store memory accesses are the actual bottleneck of the ARM pipeline. Also, recall that instructions and load/stores actually come from the same memory. Thus, we need to stall instruction fetching to allow for loads and stores.

Loop:  
\[
\begin{align*}
\text{loop: } & \text{ ldr } r0, \lbrack r1, \#4 \rbrack \\
          & \text{ add } r0, r0, \#4 \\
          & \text{ str } r0, \lbrack r1, \#4 \rbrack \\
          & \text{ sub } r0, r0, \#4 \\
          & \text{ and } r2, r2, \#f
\end{align*}
\]

<table>
<thead>
<tr>
<th>Time (in clock cycles)</th>
</tr>
</thead>
<tbody>
<tr>
<td>i</td>
</tr>
<tr>
<td>Fetch</td>
</tr>
<tr>
<td>Decode</td>
</tr>
<tr>
<td>Execute</td>
</tr>
</tbody>
</table>
Load/Store stall implementation

Disable loading of pipeline registers for one clock when a load or store instruction reaches the execute stage.

1. Adding enable lines to the PC and pipeline registers on the control path
2. A simple 2-state state machine to stall the pipeline for 1 state to allow for the load/store memory cycle.
**Where does this leave us**

Overall we can now nearly triple the clock rate. Instructions have a throughput of one-per-clock with the following caveats:

1. Taken branches take 2 cycles.
2. Loads and store take 2 cycles.

You can pipeline an ARM CPU even more. There exist ARM implementations with 7, 8, and 9 pipeline stages. But the overhead of bypass paths and stall cases increase.

---

3x speed up: 100 MHz clock now 300 MHz
Reality vs Specmanship

Assuming approximately 10% of instructions executed are branches, and of those 80% of the time they are taken, and 15% of instruction executed are loads or stores, what sort of real speed up do we expect?

\[
\text{Perf}_{\text{before}} = (100) \times 1 = 100 \text{ Clocks} \times 10 \times 10^{-9} \text{ sec/clock} = 1000 \times 10^{-9} \text{ secs}
\]

\[
\text{Perf}_{\text{after}} = (10)((0.8) \times 2 + (0.2) \times 1) + 15 \times 2 + 75 \times 1 = 123 \text{ Clocks}
\]

\[
123 \times 3.333 \times 10^{-9} \text{ sec/clock} = 410 \times 10^{-9} \text{ secs}
\]

\[
\text{Speedup} = \frac{\text{Perf}_{\text{before}}}{\text{Perf}_{\text{after}}} = 1000/410 = 2.439 \times
\]
NEXT TIME

It appears memory access time is our real bottleneck. What tricks can be applied to improving CPU performance in this case?

- Interleaving
- Block-transfers
- Caching