# Pipelined CPUs



Where are the registers? Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators.

Study Chapter 4 of Text

### Review of CPU Performance

$$MIPS = \frac{Freq}{CPI}$$

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

#### To Increase MIPS:

- 1. DECREASE CPI.
  - RISC simplicity reduces CPI to 1.0.
  - CPI below 1.0? State-of-the-art multiple instruction issue
- 2. INCREASE Freq.
  - Freq limited by delay along longest combinational path; hence
  - PIPELINING is the key to improving performance.



The diagram on the left illustrates the Data Flow of miniMIPS

Wanted: longest path

#### Complications:

- •some apparent paths aren't "possible"
- •functional units have variable execution times (eg, ALU)
- •time axis is not to scale (eg, t<sub>PD.MEM</sub> is very big!)

### Where Are the Bottlenecks?



# Ultimate Goal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:



Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to

Instruction Decode/Register File stage: Decode control lines and select source operands

ALU stage: Performs specified operation, passes result to

Memory stage: If it's a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to

Write-Back stage: writes result back into register file.

### miniMIPS Timing

Different instructions use various parts of the data path.



6 n5 Instruction Fetch
2 n5 Instruction Decode
2 n5 Register Prop Delay
5 n5 ALU Operation
4 n6 Branch Target
6 n6 Data Access
1 n6 Register Setup

This is an example of a "Asynchronous Globally-Timed" control strategy (see Lecture 16). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!



# Uniform miniMIPS Timing

With a fixed clock period, we have to allow for the worse case.



6 nS Instruction Fetch
2 nS Instruction Decode

2 nS Register Prop Delay

5 nS ALU Operation

4 nS 🔲 Branch Target

6 nS Data Access

1 nS 📘 Register Setup

By accounting for the "worse case" path (i.e. allowing time for each possible combination of operations) we can implement a "Synchronous Globally -Timed" control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!



# Step 1: A 2-Stage Pipeline



# 2-Stage Pipe Timing

Improves performance by increasing instruction throughput. Ideal speedup is number of pipeline stages in the pipeline.



6 nS Instruction Fetch 2 n5 Instruction Decode 2 nS Register Prop Delay 5 nS ALU Operation 4 nS Branch Target 6 nS Data Access

1 nS Register Setup

By partitioning each instruction cycle into a "fetch" stage and an "execute" stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores.



# 2-Stage w/2-Cycle Loads & Stores

Further improves performance, with slight increase in control complexity. Some 1st generation (pre-cache) RISC processors used this approach.



2 nS Register Prop Delay
5 nS ALU Operation

4 nS Branch Target

6 nS Data Access

1 nS 📘 Register Setup

The clock rate of this variant is nearly twice that of our original design. Does that mean it is twice as fast? Not likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is:

speed up = 
$$\frac{\text{old clock period}}{\text{new clock period}(0.7+2*0.3)}$$
$$= \frac{20}{8(1.3)} = 1.923$$



# 2-Stage Pipelined Operation

Consider a sequence of instructions:

addi xor sltiu srl

\$t2,\$t1,1 \$t2,\$t1,\$t2 \$t3,\$t2,1 \$t2,\$t2,1

Executed on our 2-stage pipeline:



Kecall "Pipeline Diagrame" from Lecture 16.

| 1            |     | ———— TIME (cycles) ———→ |      |       |            |            |     |     |  |  |
|--------------|-----|-------------------------|------|-------|------------|------------|-----|-----|--|--|
|              | _   | į i                     | i+1  | i+2   | i+3        | i+4        | i+5 | i+6 |  |  |
| Pipeline     | IF  | addi                    | xor  | sltiu | <b>ørl</b> | •••        |     |     |  |  |
| – Pif        | EXE |                         | addi | xor   | sltiu      | <b>ørl</b> | ••• |     |  |  |
| $\downarrow$ |     | į                       | İ    | i     |            |            |     |     |  |  |

It can't be this easy!?

# Pipeline Control Hazards

BUT consider instead:

loop: add \$t1,\$t1,\$t0

srl \$t2,\$t2,1
bne \$t2,\$0,loop

andi \$t0,\$t2,1

• • •

| _   | ı i | I i+1 | i+2 | ı i+3 | i+4 | i+5 | i+6 |
|-----|-----|-------|-----|-------|-----|-----|-----|
| IF  | add | srl   | bne | andi  | ••• |     |     |
| EXE |     | add   | srl | bne   | ?   | ••• |     |

This is the cycle where the branch decision is made... but we've already fetched the following instruction which should be executed only if branch is not taken!

Pipelining HAZARDS are situations where the next instruction cannot execute in the next clock cycle. There are two forms of hazards, CONTROL and STRUCTURAL.

## Branch Delay Slots

PROBLEM: One (or more) instructions following a branch are fetched before the branch decision is made (to take, or not to take).

#### POSSIBLE SOLUTIONS:

- 1. Make hardware "annul" the instructions following taken branches, e.g., by disabling WERF and WR.
- Solution 2 implies breaking the sequential semantics of the ISA. Logically, the branch takes place after instructions in the DELAY SLOTS are
- 2. "Program around it". Either
  - a) Follow each BNE/BEQ with a NOP instruction; or
  - b) Make compiler clever enough to move USEFUL instructions into the branch delay slots
    - i. Always execute instructions in delay slots
    - ii. Conditionally execute instructions in delay slots



• Delay slots also apply to jump instructions: j, jal, and jr

### **Branch Solution 1**

Make the hardware annul instructions in the branch delay slots of a taken branch.

loop: add \$t1,\$t1,\$t0
 srl \$t2,\$t2,1
 bne \$t2,\$0,loop
 andi \$t0,\$t2,1

. . .

| _   | ı i | ı i+1 | i+2 | i+3  | ı i+4         | i+5        | i+6 |
|-----|-----|-------|-----|------|---------------|------------|-----|
| IF  | add | srl   | bne | andi | add           | <b>ørl</b> | bne |
| EXE |     | add   | srl | bne  | andi          | add        | srl |
|     |     |       |     | 1    | nop<br>Branch | taken      |     |

Pros: Programs run identically on both unpipelined and pipelined hardware Cons: in SPEC benchmarks 14% of instructions are taken branches  $\rightarrow$  14/114 = 12% of total cycles are annulled

### Branch Annulment Hardware



#### Branch Alternative 2a

Always fill branch delay slots with NOP instructions. Worse than H/W annulment. NOPs get executed whether branches are taken or not.

\$t1,\$t1,\$t0 loop: add \$t2,\$t2,1 srl \$t2,\$0,loop bne nop andi \$t0,\$t2,1

i+2 i+3 i+5 i+4 i+6 add srl add srl bne nop bne EXE add add cmp cmp bne nop Branch taken

Pros: Does not require H/W modifications, only compiler changes

Cons: NOPs make code longer; >12% of cycles spent executing NOPs

Maybe I could find something

useful to do in

that instruction

# Branch Alternative 2b(i)

Put USEFUL instructions in the branch delay slots; remember they will be executed whether the branch is taken or not

srl \$t2,\$t2,1
loop: add \$t1,\$t1,\$t0
bne \$t2,\$0,loop
srl \$t2,\$t2,1

. . .

Effectively
a NOP if the
branch is not
taken.
(if (\$t2 == 0) then
\$t2 >> 1 == 0)

However, finding an instruction that behaves like a NOP when not taken can be tricky,



Pros: only one "extra" instruction is executed (on last iteration)

Cons: finding "useful" instructions that should always be executed is difficult; clever rewrite may be required. Program executes differently on unpipelined implementation.

This is the standard approach for pipelined MIPS implementations

# Branch Alternative 2b(ii)

Put USEFUL instructions \$t1,\$t1,\$t0 add The ".t" suffix. srl \$t2,\$t2,1 loop: in the branch delay slots; implies a new \$t2,\$0,loop bne.t annul them if branch instruction \$t1,\$t1,\$t0 add variant "Branch doesn't behave as \$t0,\$t2,1 andi if not equal, predicted while executing the delay slot if taken." Likewise, we i+7 i+1i+2 i+3 i+5 i+6 i+4 could add a ".n" IF add bne.t add variant for the add srl srl bne.t andi "execute if not EXE taken" case. add srl bne.t add srl bne.t H/W annuls nop the "opposite" Branch not taken case. Branch taken

Pros: only one instruction is annulled (on last iteration); about 70% of branch delay slots can be filled with useful instructions

Cons: Program executes differently on naïve unpipelined implementation; difficult to utilize with more than one delay slot.

# Architectural Issue: Branch Decision Timing

The number of branch delay slots is determined by "where" in the pipeline the "branch decision" is made relative to where the next instruction is fetched.

Consider the 5-stage miniMIPS pipeline shown on the right.

Where is the branch decision resolved?

The decision is based on the ALU's Z-flag, which is determined at the very end of the ALU stage, nearly 2 clocks after the instruction fetch. Therefore, a naïve 5-stage pipelined miniMIPS implementation has at least TWO branch delay slots.



Is there any way miniMIPS' could make its branch decision sooner? We only need to support BNE and BEQ, since we choose to trap and emulate the more complicated branch instructions.

# Early Branch Decision Hardware





Treats register file as two separate devices: combinational READ, clocked WRITE at end of pipe.

What other information do we have to pass down pipeline? PC (return addresses) instruction fields (decoding)

What sort of improvement should expect in cycle time?

# 4-Stage miniMIPS Operation

Consider a sequence of instructions:

← Pipeline

```
addi $t0,$t0,1
sll $t1,$t1,2
andi $t2,$t2,15
sub $t3,$0,$t3
```

Executed on our 4-stage pipeline:

TIME (cycles) i+1 i+2 i+3 i+4 i+5 i+6 IF addi sll sub andi RF addi sll andi sub ••• ALU addi sll andi sub WB addi sll andi sub

# Pipeline "Structural Hazard"

BUT consider instead:

addi \$t0,\$t0,1

sll \$t1,\$t0,2

andi \$t2,\$t2,15

sub \$t3,\$0,\$t3

One of our source operands is the

. . .

|     | i    | i+1  | i+2  | i+3  | i+4  | i+5  | i+6 |
|-----|------|------|------|------|------|------|-----|
| IF  | addi | sll  | andi | sub  |      |      |     |
| RF  |      | addi | ၅။   | andi | sub  |      |     |
| ALU |      |      | addi | sll  | andi | sub  |     |
| WB  |      |      |      | addi | øll  | andi | sub |
| _   |      |      |      |      |      |      |     |

Oops! s11 is trying to read Reg[8] (\$t0) during cycle I+2 but addi doesn't write its result into Reg[8] until the end of cycle I+3!

Stuff like this never happened when we did pipelining before. Why now?

Before, we forbade feedback. Can't do that with a useful CPU.

How do we fix this one?



### Data Hazard Solution 1

"Program around it"

... document weirdo semantics, declare it a software problem.

- Breaks sequential semantics!
  (Order of instruction execution is not obvious)
- Costs code efficiency.

**EXAMPLE:** Rewrite

```
addi $t0,$t0,1 addi $t0,$t0,1 sll $t1,$t0,2 as andi $t2,$t2,15 andi $t2,$t2,15 sub $t3,$0,$t3 sll $t1,$t0,2
```

How often can we do this?

Not Very.

| _   | i    | i+1  | i+2  | i+3     | i+4  | i+5 | i+6 |
|-----|------|------|------|---------|------|-----|-----|
| IF  | addi | andi | sub  | sll     |      |     |     |
| RF  |      | addi | andi | sub     | sll  |     |     |
| ALU |      |      |      |         | sub  | ၅။  |     |
| WB  |      |      |      |         | andi |     | øll |
| _   |      |      |      | 2.3131. | 2    |     |     |

### Data Hazard Solution 2

Stall the pipeline (add bubbles/disable update to IR<sup>x</sup>s and PC<sup>x</sup>s): Freeze IF, RF stages for 2 cycles, inserting NOPs into ALU-stage instruction register

| _   | i    | i+1  | i+2  | i+3  | i+4   | i+5  | i+6  | L |
|-----|------|------|------|------|-------|------|------|---|
| IF  | addi | øll  | andi | andi | andi  | sub  |      |   |
| RF  |      | addi | øll  | øll  | 👍 sll | andi | sub  |   |
| ALU |      |      | addi | NOP  | NOP   | øll  | andi |   |
| MB  |      |      |      | addi | NOP   | NOP  | øll  |   |
|     |      |      |      |      |       |      |      | Γ |

Drawback: Added NOPs "waste" cycles. Lot's of wasted cycles. (A large percentage of instructions depend on results from the immediately preceding instruction)

#### Data Hazard Solution 3

#### Bypass (aka forwarding) Paths:

Add extra data paths & control logic to re-route data in problem cases.

|     | i    | i+1  | i+2   | i+3  | i+4  | i+5  | i+6 |
|-----|------|------|-------|------|------|------|-----|
| IF  | addi | ell  | andi  | sub  |      |      |     |
| RF  |      | addi | øll ♠ | andi | sub  |      |     |
| ALU |      |      | addi  | sll  | andi | sub  |     |
| MB  |      |      |       | addi | sll  | andi | sub |
|     |      |      |       | İ    |      |      |     |

Idea: The result from the addi, which will be written into the register file at the end of cycle I+3, is actually available at output of the ALU during cycle I+2 – just in time for it to be input into the ALU of the s11 in the RF stage! Thus, using it before it is actually written into the register!

# Bypass Paths (I)



# Bypass Paths (II)



#### SELECT this BYPASS path if

 $Op^{RF}$  uses  $Rs^{RF}$  as a source and  $Rs^{RF}$ != Oand not using ALU bypass and WERF = 1 and  $Rs^{RF}$  = WA

Why not get it from the register file? It's being written this cycle!

### Next Time

