



# Network Processor HW/SW Co-Verification

Sakthi Subramanian (sakthi@redback.com) Robert McAlister (robmc@redback.com) Session 1.12

**Sept 2007** 

## Agenda

Redback SmartEdge and PPA – introduction HW/SW Co-Verification goals Previous Co-Verification approaches Cadence Xtreme III Transaction-Based Acceleration (TBA) approach



### **Redback Smart Edge Router**

- Edge Router
- B-RAS
- Ethernet Aggregation
- Application and Flow aware
- Multiple Line cards





## **SmartEdge Line Card (1/2)**

#### Packet Processing ASICs (PPA)

- Multiple execution units (32 EUs) tightly coupled with instruction/data caches
- Execution units process packets independently
- Hardware assist offloads critical packet functions such as packet re-ordering, Traffic management
- Control/data path FPGAs interfaces with PPAs
- Proprietary system interfaces to control PPAs and FPGAs
- Packet mesh architecture ASIC to connect to back plane



## SmartEdge Line Card (2/2)

- Two operating systems control PPA and Control/Data path FPGAS
  - OS1: Responsible for PPA and FPGAs bring-up
  - OS2: Responsible for network admin layer
- PPAs, FPGAs have local memory and share memory with operating systems



### **HW/SW Co-verification goals**

#### Software goals

- SmartEdge driver development
  - PPA, Control/Data path FPGAs bring-up
- PPA software development can be done in parallel with nextgen chip design
- Performance tuning of software code

#### Hardware goals

- Running actual software code helps in complete design verification
- Running as a whole system will help identify production issues
- Performance tuning of hardware design
- Accelerate verification regressions



## **Previous approach (1/2)**





## Previous approach (2/2)

- Simulation environment consists of some RTL components, modeled components and the testbench
  - Execution units (EU) were modeled in C language
  - RTL components and the models linked through Programming language interface (PLIs)
  - Test-bench modules mainly present to initialize the PPA
- Driver by-passes some of the bring-up tasks to speed-up simulation
  - Example: Back-door writes to initialize memory
- Since full RTL is not used because of simulation performance, only certain portion of the bring-up driver code is developed.



### Modified approach (1/2)





# Modified approach (2/2)

- Modified approach replaces the modeled components with RTL
  - This simulates code running on real system
- PPA bring-up takes 6-7 hours
- Save feature of the software simulator is used to save initialized state of the RTL
  - Saved 6-7 hours of simulation time
  - Not all test bench state can be saved
    - Special hook-ups needed to save PLI states
- 60 million nano seconds worth of simulation time (without dumping) takes 2 ½ days for a test-case
  - Test case changes means the simulation needs to run from the restored state and wait for another 2 days!
  - Test bench changes means whole simulation needs to be started from time 0.
- Simulation becomes slower when dump is enabled



## Cadence Xtreme III (1/3)

- Comprised of two tightly coupled components that provide single simulation solution
  - Xcite SIMulator (XSIM): The software component
    - IEEE-complaint event driven HDL software simulator
  - ReConfigurable Computing (RCC) engine: The hardware component
    - Implemented as an array of FPGA that accelerates the RTL and gate-level evaluation
    - Tight coupling to XSIM for simulation and debugging
    - Automatic FPGA partitioning and routing of designs through a single compilation process
- RTL and gate-level netlist are mapped into RCC
- Behavioral-level elements and testbenches are supported by the host workstation



# Cadence Xtreme III (2/3)

 Extended memory ideal for designs containing or interfacing with large memory designs

#### Four simulation modes

- Traditional software simulation mode
- RCC acceleration mode
- Targetless emulation mode
- In-circuit verification mode
- Supports Standard Co-Emulation Modeling Interface (SCE-MI 1.1) and Cadence Transaction Based Acceleration (TBA) extensions
  - Abstract testbench components running on the host machine communicate with the DUT running on the Xtreme III using SCEMI transactors



## Cadence Xtreme III (3/3)

#### VCD on demand (VoD)

- Waveform can be generated either during simulation or afterwards
- Recording mechanism minimally impacts simulation performance and produce very small record files, so a simulation can always run with "record" on

#### • Hot-swapping:

 Dynamically change between software simulation and acceleration mode



#### **Transaction-based acceleration approach (1/2)**





# TBA approach (2/2)

- DUT is mapped to RCC and run on the Xtreme III
- Transactors are used
  - BFM is modeled using verilog constructs that can be synthesized and run on Xtreme III
    - BFM interacts with DUT interface based on the signal-level protocol definition of the bus interface
  - Software portion of the transactors (Proxy models) are run on the host machine
  - BFM and the software proxy models communicate among each other by exchanging messages through SCE-MI interface
- Cadence TBA SCEMI extensions are used to aid further simulation speed-up
  - Variable length multi-word messages are supported
  - Multiple message batching improves runtime performance



## **TBA** approach: Summary/Conclusion

- PPA and Control/Data path bring-up now takes 10 minutes
- PPA self-initialization takes an hour to complete

|                                          | Old approach                       | TBA approach                                                   |
|------------------------------------------|------------------------------------|----------------------------------------------------------------|
| PPA and<br>Control/Data path<br>bring-up | 6-7 hours                          | 10 minutes                                                     |
| PPA self<br>initialization               | Never ran<br>completely            | 1 hour                                                         |
| PPA software test<br>cases (60m ns)      | 2.5 days                           | 2 hours                                                        |
| Debug                                    | Slows down runtime<br>considerably | 100% visibility is<br>available without<br>performance penalty |



#### **TBA** approach: Future plans

- We are using this approach for next generation PPA software development
- We have plans to incorporate this approach for next generation chip development
- Plans to identify performance bottle-necks





cadence designer network



# CONNECT: IDEAS

## **CDNLive! 2007 Silicon Valley**