### cādence"

# Low Power Design of the DLX Processor Using the Encounter Platform Session #: 4.13

Madhuparna Datta, Cadence Design Systems Shalini Sharma, Freescale Semiconductor Inc Harjot Singh, Cadence Design Systems



### • Low Power Design of DLX Processor Core

- Overview of DLX Processor Core
- RTL Level Power Estimation Methodology
- Application of standard Low Power Design Techniques
- Redesign of Critical Portions of Design to reduce Power
- Front End and Back End flow
- Conclusions

## **Overview of DLX processor core**

- DLX is a generic 32 bit RISC microprocessor
- Used for academic Purposes
- 32 thirty two bit registers
- 5 stage Pipeline
- Why was DLX chosen?
  - Based on observations about most frequently used primitives in programs
  - Good architectural model for study
    - Popularity
    - Easy to understand

# **Overview of DLX Processor Core**

## INSTRUCTION SET

- Fixed 32 bit instruction set for improved code efficiency.
- Register, Immediate and Jump type instructions
- Access to objects other than byte must be aligned.
- Load / Store architecture
- Conditional Execution
- Instruction Execution, maximum 1 instruction per cycle

# **Overview of DLX Processor Core**

### • MEMORY INTERFACE

- 32 bit data interface to Instruction cache for accessing fixed 32 bit instructions
- 32 bit address bus to Instruction cache
- 32 bit data interface to Data cache
- Signals to indicate nature of access byte/word/double word
- Signals to indicate type of access read/write to Data cache



### • INTERRUPTS

- External interrupt to halt program execution.
- Interrupt to stall the pipeline.
- Signal to clear the Interrupt.
- STATUS SIGNALS
  - Signal to indicate status of pipeline empty or not empty

# 5 Stage Pipeline of DLX





Power Savings are maximum at RTL level



# **RTL Level Power Estimation**

- Accuracy of RTL level Power Estimation
  - Functional Vectors used For Power Estimation
  - Libraries/PVT conditions used for Power Estimation
  - Interconnect Model / Wire-load Model used for Power Estimation.
  - Level Of Hierarchy where Power is analyzed
  - Power Calculation
- Power Estimation Results of DLX Processor Core

Vector Generation Flow for DLX Processor Core



# Libraries/PVT/Interconnect Model used for Power estimation

| Libraries/PVT      | Best Case/P <sub>best</sub> V <sub>max</sub> T <sub>min</sub>              |
|--------------------|----------------------------------------------------------------------------|
| Interconnect Model | 1 <sup>st</sup> set with Enclosed wireload<br>2 <sup>nd</sup> set with PLE |
| Level Of Hierarchy | Each Stage of pipeline of 5 Stage pipeline of DLX                          |

**Power Calculation in RC-LP Engine** 

$$P_{total} = \sum P_{instance} + \sum P_{net}$$

$$P_{ins \tan ce} = \sum P_{int ernal} + \sum P_{leakage}$$
$$P_{net} = \frac{1}{2}C \times V^2 \times f \times TR$$

2

$$P_{\text{internal}} = \sum_{perarc} TR_{arc_{ij}} \times \phi(SR_i, C_j) + \sum_{perpin} TR_i \times \phi(SR_i)$$

$$P_{leakage} = \sum_{state=1}^{k} P_{state\_leakage} \times probability_{state}$$
© 2007 Cadence Design Systems, Inc. All rights reserved worldwide.

© 2007 Cadence Design Systems, Inc. All rights reserved worldwide. 12

# **Power Dissipation Model**



© 2007 Cadence Design Systems, Inc. All rights reserved worldwide. 13

## **Power Estimation of DLX Processor Core**

- DLX Process was Benchmarked for power using set of 4 test cases written in C
  - MATRIX MULTIPLICATION
  - FACTORIAL
  - FIBONACCI
  - ATOI
- Frequency of operation chosen is 100 Mhz





### **Power Estimation Results of DLX Processor Core**

Wireload model

#### Max Power: 4.42mW (matrix)



60% power consumed in IDinst. Only 2% of total is leakage!!

### **Power Estimation Conclusion of DLX Processor Core**

Wireload model

- Power Cosumption of DLX processor core is 4.4 mW using worst functional vector with switching power =4.3 mW and leakage power =86.29 uW.
- About 61 % of Total dynamic power is consumed in ID unit of DLX Processor Core
- To achieve any Reduction in power, ID unit should be targeted





#### Scheduling For Read After Write to a Register in Pipeline

# Revisiting DLX PIPELINE



Scheduling For Write After Read to a Register in Pipeline

<sup>19</sup> © 2007 Cadence Design Systems, Inc. All rights reserved worldwide.

# **Design Optimization for Power**

- Register File read occurs in ID stage and Write in WB stage
- 4 cycle delay exists for Write after Read operation to the same register in the register File
- 1 cycle delay exists for Read After Write from a register in the register File
- DLX Register File used pos edge Flip Flops
- Timing Relationship for read after write and write after read to a particular register allows us to replace all registers with negative level sensitive latches with no change in functionality.

**Timing Diagram** 



## **Power Estimation of DLX Processor Core**

#### Wireload model



## Baseline of NewregFile reduces power by 45%

#### Max Power is 2.41 mW (matrix)

## Low Power Implementation of DLX Processor Core

- Synthesis Methodology
  - Clock Gating
  - Operand Isolation
  - Leakage Power Optimization
  - MSMV Methodology
  - Power Constraints



• Clock gating is stopping the clock during the idle period when the register is shut off by the gating function.



- Power is saved in the gated-clock circuitry.
- For inserting CG in RC
  - set\_attribute lp\_insert\_clock\_gating true
- To merge clock-gating instances, we used the following command
  - clock\_gating declone [-hierarchical]



# Operand Isolation

- Operand Isolation is a dynamic power optimization technique that can reduce power dissipation in datapath blocks controlled by an enable signal
- It can be done in RC using
  - set\_attribute lp\_insert\_operand\_isolation true



# Leakage Power Optimization

- Leakage Power can be reduced by using High  $V_T$  Cells which have lower leakage
- This can be done in RC using

– set\_attribute lp\_multi\_vt\_optimization\_effort low /



Multiple supply voltages in a design is one of the most effective approaches to reduce the dynamic power dissipation of a design

- In RC
  - We use library domains to Indicate that some blocks in your design operate on different voltages
  - Associate dedicated libraries with some blocks in the design without using multiple supply voltages
- There are two types of MSV designs
  - Multiple Supply Single Voltage (MSSV): Core logic runs at a single voltage, but some portions of the logic are isolated on their own power supply. In this case isolation cells are required
  - Multiple Supply Multiple Voltage (MSMV): Supplies of different voltages are used for core logic. We have used MSMV in our design. In This case both isolation cells and level shifters may be required



- A library domain is a collection of libraries that should have been characterized for the same nominal operating conditions
  - create\_library\_domain {DOMAIN<sub>1</sub> DOMAIN<sub>2</sub> ... DOMAIN<sub>n</sub>}
  - create\_library\_domain {DLXD IDD} (OUR CASE)
- Associating Libraries with voltage domain
  - set\_attribute library {library1,library2} domain1
  - set\_attribute library {library3} domain2
- Different blocks of your design may operate on different voltages
- To set the target library domain for the top design
  - set\_attribute library\_domain DOMAIN1 INSTANCE1
  - set\_attribute library\_domain DOMAIN2 INSTANCE2





Partition of design into multiple voltage domains and communication between those domains

# Power Constraints

- Dynamic Power constraints can be specified in RC
  - set\_attribute max\_dynamic\_power 2460 DLX\_sync
  - 2460 is obtained from initial estimate of power that we did using RTL estimation
- Leakage Power constraints
  - set\_attribute max\_leakage\_power 61 DLX\_sync

## Low Power Synthesis Flow for DLX Processor Core

- read libraries
- set lp\_auto\_create\_level\_shifter 1
- create\_library\_domain {IDD DLXD}
- set\_attribute library \$1v0\_lib\_list IDD
- set\_attribute library \$0v8\_lib\_list DLXD
- set\_attribute lp\_insert\_operand\_isolation true
- read\_rtl { ....}
- define\_clock -name CLK -period 10000 [find / -port clk]
- set\_attribute library\_domain DLXD DLX\_sync
- set\_attribute library\_domain IDD ID
- set\_attribute max\_leakage\_power 61 DLX\_sync
- set\_attribute lp\_multi\_vt\_optimization\_effort low
- read\_tcf design.tcf
- set\_attribute max\_dynamic\_power 2460 DLX\_sync
- set\_attribute lp\_auto\_insert\_level\_shifter 1
- synthesize -to\_generic
- synthesize -to\_mapped
- clock\_gating declone -hierarchical

Wireload model

**Original Testcase** 

**RTL Optimized Testcase** 

Max Power: 4.28mW (matrix)

Max Power: 2.32mW (matrix)

cādence



Difference about ~46% to 53%!

Wireload model

**Original Testcase** 

**RTL Optimized Testcase** 

Max Power: 2.16mW (matrix)

Power Profile LP+CG

Max Power: 2.09mW (matrix)

Power Profile LP+CG



Difference just ~3% to 8%!



Wireload model

**Original Testcase** 

Max Power: 2.08mW (matrix)

**RTL Optimized Testcase** 

Max Power: 1.89mW (matrix)

Power Profile LP+CG+OI

Power Profile LP+CG+OI

cādence



Difference just ~5% to 9%!

#### Wireload model

cādence

#### Original

Max Power: 1.48mW (matrix)



#### **RTL Optimized**

Power Profile LP+CG+OI+MSV

#### Max Power: 1.35mW (matrix)



Difference of ~3% to 12%! Max Switching = 1.35mW and leakage= 18.6uW Reduction from Original baseline ~70%!!

PLE model

**Original Testcase** 

**RTL Optimized Testcase** 

Max Power: 4.99mW (matrix)

Max Power: 2.76mW (matrix)

cādence



Difference about ~48%!

## Low Power Synthesis Results

PLE model

**Original Testcase** 

**RTL Optimized Testcase** 

Max Power: 4.55mW (matrix)

Max Power: 2.57mW (matrix)



Difference about ~44% to 52%!





PLE model

**Original Testcase** 

**RTL Optimized Testcase** 

Max Power: 2.37mW (matrix)

Max Power: 2.23mW (matrix)

cādence



#### Difference just ~3% to 7%!



**PLE model** 

Original Testcase Max Power: 2.31mW (matrix)

#### RTL Optimized Testcase Max Power: 2.2mW (matrix)

cādence



Difference just ~5% to 8%!

## Low Power Synthesis Results

**PLE model** 

cādence

Original

Max Power: 1.83(matrix)

**RTL Optimized** 

Max Power: 1.48mW (matrix)



#### Difference of ~5% to 19%! From original baseline around 70% power reduction!

## **Backend Flow**

- Floorplanning Elements
  - Core Aspect: 1.0
  - Core utilization : 0.6
  - Power nets: VDD\_DLXD (0.7V) and VDD\_IDD(1.08V)
  - Ground net: VSS
  - Core ring: VDD\_DLXD, VDD\_IDD & VSS
  - IDD PD ring : VDD\_DLXD, VDD\_IDD & VSS
  - Power strips connect Core ring to IDD ring
  - Top & Bottom: M7, H ; Left & Right: M8, V
  - LVLLH and LVLHL Shifter cells

# Power Planning

| Design Browser                                      |   |      |                |
|-----------------------------------------------------|---|------|----------------|
| <u>F</u> ile <u>V</u> iew <u>E</u> dit <u>T</u> ool |   | <br> |                |
| Find Instance -                                     |   |      |                |
| < > 💀 🗞 🔊 🐨 🐨 🐼 🖉 🎾 🌌 📚                             |   |      |                |
| 🛛 Hier Cell – DLX_sync, 14890 LeafCells             |   |      |                |
| ₽Terms (170)                                        |   |      |                |
| -CLI (output) - CLI                                 |   |      |                |
| 🖶 DM_addr (output bus)                              |   |      |                |
| -DM_read (output) - DM_read                         |   |      |                |
| -DM_write (output) - DM_write                       |   |      | IDD(TI =60.5%) |
| DM_write_data (output bus)                          |   |      |                |
| Hore (output bus)                                   |   |      |                |
| -PIPEEMPTY (output) - PIPEEMPTY                     |   |      |                |
| -byte (output) - byte                               |   |      |                |
| -word (output) - word<br>@-DM_read_data (input bus) |   |      |                |
| -FREEZE (input) - FREEZE                            |   |      |                |
| -INT (input) - INT                                  |   |      |                |
| ⊕-IR (input bus)                                    |   |      |                |
| -clk (input) - clk                                  |   |      |                |
| reset (input) - reset                               |   |      |                |
| ⊕-Nets (515)                                        |   |      |                |
| ⊕-Modules (5)                                       |   |      |                |
| ⊕EXinst (EX), 8966 LeafCells                        |   |      |                |
| 🕀 IDinst (ID), 5437 LeafCells                       |   |      |                |
| 🕀 IFinst (IF), 455 LeafCells                        |   |      |                |
| 🖻-MEMinst (MEM), 175 LeafCells                      |   |      |                |
| ➡RC_LS_HIER_INST_1341 (RC_LS_MOD), 53 LeafCells     |   |      |                |
| ḋ-PowerDomains                                      |   |      |                |
| ∲-IDD                                               |   |      |                |
| ⊕-DLXD                                              |   |      |                |
|                                                     |   |      |                |
| IDinst                                              | L | <br> |                |
| Hilite Colors: 📕 💶 🔜 📕 📕 📕                          |   |      |                |
|                                                     | i |      |                |

# Backend Flow

- Placement
  - Timing driven, Congestion Effort : High
  - In Place Optimization
- SRoute
  - DLXD
    - Std Cells VDD\_DLXD,VSS
    - Level Shifter Pins VDD\_IDD
  - IDD
    - Std Cells VDD\_IDD,VSS
    - Level Shifter Pins VDD\_DLXD
- PreCTS, PostCTS, PostRoute optimization



- PreCTS Optimization
  - Density: 65.9%
  - Hold: 0.013ns (WNS)
  - Setup: 1.1ns (WNS)
- PostCTS Optimization
  - Density: 66.8%
  - Hold: 0.962ns (WNS)
  - Setup: 1.108ns (WNS)
- PostRoute Optimization
  - Density: 71.1%
  - Hold: 1.016ns (WNS)
  - Setup: 0.697ns (WNS)

## Power Outcome

- Total Power after PnR (Comb: 60%, Seq=40%)
  - Total Internal Power: 1.38mW
  - Total Switching Power: 0.77mW
  - Total Leakage Power: 65.084 uW
  - Total Total Power: 2.22 mW
  - DLXD: 36.5%, IDD: 63.5%
- Clock Power distribution (19.39% of total power)
  - Internal: 173.22uW
  - Switching: 255.887uW
  - Leakage: 2.05uW
- Fibo baseline: 4.78mW; Fibo Layout: 2.22mW
  - Reduction: 53.5%



- DLX Processor Core's pipelined structure
- RTL Level Power Estimation Methodology
- ID's register file consumed max power so redesigned it
- RTL redesign helps reduced power by 70%!!
- Testcases face varied effects of LP,CG,OI,MSV
  - For seq dominated, CG helps a lot!
- Max. power reduction effect in Synthesis
- Completed Front End and Back End flow
- RTL estimated 36% less power mainly due to optDesign



## **THANK YOU!**



47 © 2007 Cadence Design Systems, Inc. All rights reserved worldwide.





# CONNECT: IDEAS

cādence™

## **CDNLive! 2007 Silicon Valley**