

Hassan Shojania

# Agenda

#### History

- Challenges, features and applications
- Example application/routing scenario
- NP architecture
- Case study: IXP2400
- Software
- Scalability & future

# History

- First generation (1980s)
  - General purpose CPU, minicomputer → adaptable
  - Few connections/slow links
- Second generation (mid 1990s)
  - Increased speed and density
  - Specialized hardware functions
  - Offload of functions (e.g. classification) from CPU
- Third generation (late 1990s)
  - More and more specialized HW  $\rightarrow$  ASIC
  - Decentralized: Multiple HW → complexity
  - Protocol consolidation: IP/Ethernet → less flexibility

Tradeoff: programmability for speed

# Today's story

- Convergence (voice/data/multimedia)
  - Faster pace of changes
  - New services/applications
- Shorter product lifecycle
- Fast time to market
- More complex
  - QoS, VPN, MPLS
  - Not store-and-forward anymore
  - Encryption, compression, classification
    - Two order of magnitude
- → Programmability needed again (1<sup>st</sup> gen. hallmark)
- → But at high performance (3<sup>rd</sup> gen. hallmark)

## **Design questions**

- What most important tasks to optimize?
- What HW-assist units to include?
- What I/O interface needed?
- What size instruction/data store needed?
- What memory tech., interconnects?
- What level software support? languages/tools,...
  Many *possibilities* → Many solutions
  More than 30 NP vendor by Jan. 2003!
  Not many around these days ☺

#### Slide 5

MSOffice8 Ask Prof. Leon-Garcia about it!

#### MSOffice1

# NP: The new approach

- Key attributes
  - Programmability
  - Simple prog. model
  - Maximum flexibility
  - High processing power
  - High functional integration
  - Open prog. interface
- Universal applicability
  - Interface/protocol range
  - Programmability at all levels







#### Slide 6

MSOffice1 Bringing both flexilibility and performance -- point to the axes

System designers focusing on higher-level service rather than constant change

Benefits:

- universal networking applications
- faster time-to-market
- longer time in market --> focus on high-level service
- Scalable performance
- Lower system cost
- Higher availability
- Continous innovation

All level of protocol stack: 2-7

, 3/4/2006

#### MSOffice2

# Towards integrated system

- Coprocessor engines
  - Bottlenecks
  - Classification/queuing
- Lower-level functions
  - SONET framer
  - Higher port density
- ➔ Lower cost
- Higher Performance Lower interconnection penalty



MSOffice2 Similarly: software development cycle is reduced across product generations through "Stable Programming Interfaces", 3/4/2006

### What's an NP?

- Packet processing/forwarding
- Compared to a GPP
  - Simpler arithmetic/caching
  - Multiple execution threads  $\rightarrow$  Parallel packet processing
  - Special functional units
- Location in network:
  - Edge: Intelligent stateful processing
  - Core: Aggregated traffic flows
- Tuned towards:
  - Control-plane: Sys. mgmt., routing updates, protocol mgmt.
  - Data-plane: Packet processing (general term)



### **Applications**

- Load balancing  $\rightarrow$  distributing overload to servers
- Traffic differentiation → QoS
- Network security
- Terminal Mobility
  - Tunneling and bundling (edge)
- Active networking
  - No more passive forwarding
  - Code carried in packets
  - More data-plane processing
  - Exposing router state



MSOffice3 convergence of mobile & IP 3GPP and SCTP(stream control transmission protocol) , 3/4/2006

#### Example app: Content-aware switch



- Web-server front-end
- One virtual IP
- Examining above TCP/IP layer (layer 5)
- Advantageous:
  - Better load balance: distributed
  - Faster response: caching
  - Better resource management: database partitioning

#### MSOffice6

### Packet processing

- Different from normal processing
  - I/O centric vs. processing centric
  - Real-time vs. best-effort
  - Many simple tasks vs. few complex tasks
  - State: per-flow vs. per program
  - Buffering: dependence in flow, e.g. CRC calculation in ATM
  - Atomic context process (seq. packets)
- Router functions
  - Packet receive
  - Route-table lookup
  - Classification
  - Metering
  - Congestion avoidance
  - Packet scheduling
  - Packet transmit



MSOffice6 Buffering size might grow: internal/external SDRAM/DRAM Memory speed

#### MSOffice7

### Routing example

- Serial stream in
- Packet (framing)
- Target extraction
- Packet buffering
- Packet processing
- Router lookup
- Scheduling
- Transmission
- Packet update
- Serial stream out



MSOffice7 Packet classification?

, 3/5/2006

# Ex: Load-balancing dispatcher

- Packet → NP
- Preserving flow order
  - TCP fast retransmit
  - BW waste
- Flow state
  - Shared memory
  - SMT/CMP NP
- Flow classification problem
  - Assigning each flow to a fixed NP
    - Flow identity: src/dst IP and port, transport protocol ID
  - Other app: firewall, NAT, network monitoring
  - Eliminating inter-NP synchronization



#### MSOffice9

# Parallelism

- Single processor
- Run-to-completion
- Packet interdependence
- Parallel processing
- Pipelining

- f0; g0; h0 Coordination f(); g(); h() Mechanism f0; g0; h0 : f0; g0; h0 g0 h() f0 From [1]
- Typical processor design issues
  - Superscalar/pipelined processors: e.g. P4
  - Higher clock with higher pipelining degree
  - If dependency exists?

Slide 14

MSOffice9 Simplistic view , 3/8/2006

# Design space

Number of PEs per stage

4

- Homogenous PEs
  - Fully pipelined
  - Fully pooled
- Heterogeneous PEs
  - EZchip NP-1/2
  - Task Optimized Proc.
    - Packet modification
    - Lookup and classification
    - Forwarding and QoS decision
    - Packet parse



link / switch fabric

# Memory

- Issues
  - Managing 1000s of flows
  - Packet buffering/queuing
  - Complex packet processing: e.g. Encryption
  - Several access to packet
- And all at wire-speed!
- Memory types:
  - SRAM for data structures (memory mgmt. pointers)
  - DRAM for packet buffers
  - Using on-chip cache rather than off-chip memory
  - Inter-PE communication; synchronization
- Specialized memory management/queuing blocks

#### MSOffice5

### Market

- 2005: \$174 million 2003 estimate in 2001: \$1B
- 1<sup>st</sup>: AMCC, nP37X0: single-chip OC-48/traffic mgr.
- 2<sup>nd</sup>: Intel, IPX2800/2400: flexible, software, power
- 3<sup>rd</sup>: Agere, Hifn, Wintegra
- EZchip: startup, 5% market share, 10Gbps
- Mainstream: OC-48 and up; metro-Ethernet switches
- Niche market: Hifn: PowerNP → security, VPN



Slide 17

MSOffice5 2.5Gbs/10Gbs/40Gbs OC 48-192-768

Intel: 10Gb/s: OC-48 , 3/5/2006

# Intel IXP product line

- IXP4xx series
  - For home, small-to-medium enterprise level
    - Wireless access point, router, DSL, VoIP , ...
    - LinkSys, DLink, Netgear routers
- IXP12xx series
  - 1<sup>st</sup> generation NP
  - OC-12 applications
- IXP2xxx series
  - 2<sup>nd</sup> generation NP
  - OC-192 applications
  - For high-performance, and scalable network
    - Edge and core applications from T1/E1 to OC-192

#### IXP2xxx series

- Integrated XScale RISC proc. (ARM-based)
  - For control-plane tasks
- IXP2400
  - OC-48 2.5Gbps
  - Single chip packet forwarding/traffic management
  - 8 micro-engines, 4K word inst. Memory @ 600MHz
  - 5.4 Giga op/s
- IXP28xx
  - OC-192 10Gbps
  - 16 MEs with 8K inst. Memory @ 1.5GHz
  - 24 Giga op/s
  - More SDRAM/DRAM channels: 4 and 3 vs. 2 and 1 respectively
- IXP2855
  - Specialized cryptography engines (DES, AES, SHA)







From [3]

### IXP2400 SDK

- Software is critical!
  - NP: all about programmability
- Runtime environment
  - VxWorks, Embedded Linux, QNX Neutrino
- Tools
  - Microcode assembler, Microengine C Compiler and C Runtime Library, cycle-accurate simulator, Architecture Tool, ...
- Data-plane Libraries:
  - Microcode and Microengine C versions of: Hardware Abstraction Library, Protocol Library, Cryptography Library (IXP2850), Utility Library, and Microblock Infrastructure Library

# Scalability & future

- OC-768 router
  - Load balancing
  - Concurrent NPs
  - Without inter-proc. comm.
- Future trend
  - Internet is booming
    - OK! Not at the old pace
  - Nodes are more BW hungry
  - New services and applications
    - Not following OSI model
  - More complex & upper-layer task at edge/core
  - → More powerful NP; standard HW/SW interfaces
    - More like CPU trend



From [13]

#### Software examples

#### Sample IXP2xxx code for NAT

- http://www.npbook.cs.purdue.edu/intel/code/NAT\_pkt\_handler.c.txt
- http://www.npbook.cs.purdue.edu/intel/code/NAT\_microblock.uc.txt
- Intel SDK 4.2

#### References

- [1] Douglas Comer, "Network Processors: Programmable Technology for Building Network Systems", The Internet Protocol Journal, Vol. 7, No. 4.
- [2] David Husak, "Network Processors: A Definition and Comparison", Freescale Semiconductor.
- [3] Werner Bux, et al., "Technologies and Building Blocks for Fast Packet Forwarding", IEEE Communications Magazine, Jan. 2001.
- [4] Yan Luo's slides, Network Processor and Its Applications.
- [5] Network Processor Tutorial in Micro 34-Mangione-Smith & Memik.
- [6] Intel Corp., "Next Generation Network Processor Technologies", 2001.
- [7] Matthias Gries, "Exploring Trade-offs in Performance and Programmability of Processing Element Topologies for Network Processors", 9th International Symposium on High Performance Computer Architecture (HPCA9), Feb. 2003.
- [8] Intel Corp., "IXP2400 Network Processor Datasheet".
- [9] Intel Corp., "IXF6048 Multi-Speed SONET Packet Framer Product Brief".
- [10] Intel Corp. "IXP2855 Product Overview".
- [11] Andreas Kind, "The Role of Network Processors in Active Networks", IBM Zurich Research Lab, 2003.
- [12] The Linley Group, "A Guide to Network Processors for Metro Applications", 7<sup>th</sup> edition, Dec. 2005. (only its summary is publicly available)
- [13] Patricia Sagmeister, "Scaling Network Processor Performance to 40 Gbps", IBM Research Zurich.



#### Sample packet flow in IXP



27

From [7]

#### MSOffice4

### Active networks

- Decouples network service/infrastructure
- Active packets
  - Carry code (reference or directly)
- Active nodes
  - Execution environment like a VM; byte-code (JIT)
  - Access to node resource (link, routing table)
- App-level filtering:
  - Dropping B-frames in multicast tree
- Network management:
  - node params; aggregating several managed nodes

MSOffice4 Active packet source:

- end-user
- active gateways
- network management app

in p2p could sense congestion and adapt JIT: compiling once: storing in native binary format , 3/4/2006

## IXP2400 ME hardware assists

- Multiplier:
  - To improve performance and code density for QoS algorithms
- A pseudo random number generator:
  - To accelerate congestion avoidance algorithms like WRED
- Cyclic Redundancy Check (CRC) generator:
  - To automate CRC generation for ATM AAL5, Ethernet, Frame Relay, ...
- 16-entry Content Addressable Memory (CAM):
  - To efficiently share data among ME threads
  - To reduce memory bandwidth consumption
- 64-bit local timer:
  - To enhance traffic scheduling and shaping
- Memory features:
  - To accelerate updates to shared memory locations





### IXP425 software architecture



## IXC1100 and Res. Gateway

Media Independent Interface Ethernet NPE A 133 MHz Advanced IXC1100 High-Performance Bus Queue Media Independent Interface Etherne Status Bus NPE B SDRAM Controller B MB-256 M Control-plane proc. -KB SR 133 MHz Advanced 66 MHz Advanced Peripheral Bus High-Performance Bus PCI Contro Exp Bus Controller ntel XScale<sup>®</sup> Core 266/400/533 MHz 32-KB Data Cache

Residential gateway system architecture



Test Logi Unit

JTAG

16 Pins

2.KB Instruction Cach

32-bit

16-bit

32-bit

### IXP425 micro-engine

