

#### **GPGPU** introduction and network applications

PacketShaders, SSLShader



# Mellanox Connect. Accelerate. Outperform."

#### Agenda

#### GPGPU Introduction

- Computer graphics background
- GPGPUs past, present and future
- PacketShader A GPU-Accelerated Software Router
- SSLShader A GPU-Accelerated SSL encryption/decryption proxy



2



## **GPGPU** Intro



### GPU = <u>Graphics Processing Unit</u>

- The heart of graphics cards
- Mainly used for real-time 3D game rendering
  - Massively-parallel processing capacity



(Ubisoft's AVARTAR, from http://ubi.com)



4

#### GPU Fundamentals: The Graphics Pipeline



#### A simplified graphics pipeline

- Note that pipe widths vary
- Many caches, FIFOs, and so on not shown



## **GPU Pipeline: Transform**

#### Vertex Processor (multiple operate in parallel)

- Transform from "world space" to "image space"
- Compute per-vertex lighting





6

## **GPU Pipeline: Rasterizer**

#### Rasterizer

- Convert geometric rep. (vertex) to image rep. (fragment)
  - Fragment = image fragment
    - Pixel + associated data: color, depth, stencil, etc.
- Interpolate per-vertex quantities across pixels







7

#### GPU Pipeline: Shade

#### Fragment Processors (multiple in parallel)

- Compute a color for each pixel
- Optionally read colors from textures (images)





#### GPU Fundamentals: The Modern Graphics Pipeline





#### nVidia G80 GPU Architecture Overview

- •16 Multiprocessors Blocks
- •Each MP Block Has:
  - •8 Streaming Processors (IEEE 754 spfp compliant)
  - •16K Shared Memory
  - •64K Constant Cache
  - •8K Texture Cache
- •Each processor can access all of the memory at 86Gb/s, but with different latencies:
- •Shared 2 cycle latency
- •Device 300 cycle latency





|   | Instruction<br>Unit |  |
|---|---------------------|--|
|   | Constant<br>Cache   |  |
| Ī | Texture<br>Cache    |  |
| T |                     |  |
|   |                     |  |

#### Queueing

#### FIFO buffering (first-in, first-out) is provided between task stages

- Accommodates variation in execution time
- Provides elasticity to allow unified load balancing to work

#### FIFOs can also be unified

- Share a single large memory with multiple head-tail pairs
- Allocate as required







#### SIMT - Memory Access Latency Hiding

## GPU can effectively hide memory latency





## Implementation vs. architecture model



#### **NVIDIA GeForce 8800**



© 2014 Mellanox Technologies

Source : NVIDIA



## **OpenGL** Pipeline



## **NVIDIA GeForce 8800**



**Fixed-function assembly** 



## **OpenGL** Pipeline

#### The nVidia G80 GPU

- 128 streaming floating point processors @1.5Ghz
- 1.5 Gb Shared RAM with 86Gb/s bandwidth
- 500 Gflop on one chip (single precision)





#### Why are GPU's so fast?

- Entertainment Industry has driven the economy of these chips?
  - Males age 15-35 buy \$10B in video games / year
- Moore's Law ++
- Simplified design (stream processing)
  - Huge parallelism maps well to hardware
  - Latency hiding using the parallelism
- Single-chip designs.



#### "Silicon Budget" in CPU and GPU



## Xeon X5550: 4 cores **731M** transistors

10



**GTX480: 480** cores **3,200M** transistors



#### Floorplans comparison





#### GPU – nVidia Kepler

#### CPU - Core i7



# Very Efficient For

- Fast Parallel Floating Point Processing
- Single Instruction Multiple Data Operations
- High Computation per Memory Access

# Not As Efficient For

- Double Precision situation is improving
- Logical Operations on Integer Data
- Branching-Intensive Operations
- Random Access, Memory-Intensive Operations





## Programable stream processor

- Huge number of ALUs
- Huge memory bandwidth
- Programming was painful
  - OpenGL-SL Shader Language
  - Requires deep understanding of computers graphics
  - Huge applications speedup when done correctly

# CUDA/OpenCL

- C-like code
- Massively multi-threaded
- Simple to port existing code (but not to get good performance)





CUDA – Single Instruction Multiple Threads

#### Example code: vector addition (C = A + B)

CPU code

GPU code





#### Achieving Performance in CUDA

- Almost all C code will compile to be CUDA code
  - But will run slower
  - Single threaded operation ~50x slower than CPU code
- Must expose parallelism
- Careful with memory accesses
  - Thread scheduling helps hide memory access latency
  - But even this runs out

#### Moving target

• Performance optimizations are strongly HW and SW platform dependent

#### Can make huge difference

100x and even more





## PacketShader

A GPU-Accelerated Software Router



#### High Performance Software Router

## Work by Sangjin Han, Keon Jang, KyoungSoo Park and Sue Moon

- Advanced Networking Lab, CS, KAIST
- Networked and Distributed Computing Systems Lab, EE, KAIST
- 40 Gbps packet forwarding in a single box
  - IPv4, 64B packets
  - Bigger packet sizes bounded by PCI-e bandwidth
- 20 Gbps IPsec tunneling
  - For 1024B packets
  - 10 Gbps for 64B packets



# Despite its name, not limited to IP routing

• You can implement whatever you want on it.

# Driven by software

- Flexible
- Friendly development environments

# Based on commodity hardware

- Cheap
- Fast evolution



#### Now 10 Gigabit NIC is a commodity

## From \$200 – \$300 per port

Great opportunity for software routers





#### Achilles' Heel of Software Routers

## Low performance

Due to CPU bottleneck

| Year | Ref.                             | H/W                             | IPv |
|------|----------------------------------|---------------------------------|-----|
| 2008 | Egi et al.                       | Two quad-core CPUs              |     |
| 2008 | "Enhanced SR"<br>Bolla et al.    | Two quad-core CPUs              |     |
| 2009 | "RouteBricks"<br>Dobrescu et al. | Two quad-core CPUs<br>(2.8 GHz) |     |

## Not capable of supporting even a single 10G port



# 4 Throughput3.5 Gbps4.2 Gbps8.7 Gbps

#### Per-Packet CPU Cycles for 10G



(in x86, cycle numbers are from RouteBricks [Dobrescu09] and PacketShader)



#### = 2,800

#### = 6,600

#### PacketShader Approach 1: I/O Optimization



Allocating SKBs – 50% of CPU time



#### = 2,800

#### = 6,600

#### PacketShader Approach 2: GPU Offloading



## GPU Offloading for

- Memory-intensive or
- Compute-intensive operations
- Main topic of this talk









# **GPU FOR PACKET PROCESSING**





34

#### Advantages of GPU for Packet Processing

- 1. Raw computation power
- 2. Memory access latency
- 3. Memory bandwidth
- Comparison between
  - Intel X5550 CPU
  - NVIDIA GTX480 GPU



## (1/3) Raw Computation Power

## Compute-intensive operations in software routers

- Hashing, encryption, pattern matching, network coding, compression, etc.
- GPU can help!





## •Software router $\rightarrow$ lots of cache misses

• GPU can effectively hide memory latency







#### (3/3) Memory Bandwidth



## CPU's memory bandwidth (theoretical): 32 GB/s

© 2014 Mellanox Technologies



## (3/3) Memory Bandwidth



## CPU's memory bandwidth (<u>empirical</u>) < 25 GB/s

© 2014 Mellanox Technologies



# 4. TX: RAM → NIC

#### (3/3) Memory Bandwidth



## Your budget for packet processing can be less 10 GB/s



#### (3/3) Memory Bandwidth



## Your budget for packet processing can be less 10 GB/s **GPU's memory bandwidth: 174GB/s**







# HOW TO USE GPU

© 2014 Mellanox Technologies



#### **Basic Idea**



## Offload core operations to GPU (e.g., forwarding table lookup)



## •For GPU, more parallelism, more throughput



#### GTX480: 480 cores



### The key insight

• Stateless packet processing = parallelizable





#### 2. Parallel Processing in GPU

Fast link = enough # of packets in a small time window

## 10 GbE link

• up to 1,000 packets only in 67µs

## Much less time with 40 or 100 GbE





# PACKETSHADER DESIGN

© 2014 Mellanox Technologies





#### Three stages in a streamline





#### Packet's Journey (1/3)

#### IPv4 forwarding example





#### Packet's Journey (2/3)

#### IPv4 forwarding example







#### Packet's Journey (3/3)

#### IPv4 forwarding example





### Interfacing with NICs





### Scaling with a Multi-Core CPU





### Scaling with Multiple Multi-Core CPUs







# **EVALUATION**

© 2014 Mellanox Technologies



#### Hardware Setup





#### Quad-core, 2.66 GHz



#### **Dual-port 10 GbE**





480 cores, 1.4 GHz



#### **Total 8 CPU cores**

#### **Total 80 Gbps**

#### **Total 960 cores**

#### **Experimental Setup**







8 × 10 GbE links

### **Packet generator** (Up to 80 Gbps)





#### **PacketShader**

**CPU-only CPU+GPU** 



Throughput (Gbps)



Longest prefix matching on 128-bit IPv6 addresses

- Algorithm: binary search on hash tables [Waldvogel97]
  - 7 hashings + 7 memory accesses









128

#### Example 1: IPv6 forwarding



**Packet size (bytes)** 

(Routing table was randomly generated with 200K entries)

Throughput (Gbps)



#### Bounded by motherboard IO capacity



ESP (Encapsulating Security Payload) Tunnel mode
with AES-CTR (encryption) and SHA1 (authentication)





#### 3.5x speedup



**Packet size (bytes)** 





| Year | Ref.                                     | H/W                              | ll<br>Through |
|------|------------------------------------------|----------------------------------|---------------|
| 2008 | Egi <i>et al</i> .                       | Two quad-core CPUs               | 3.5 G         |
| 2008 | "Enhanced SR"<br>Bolla <i>et al</i> .    | Two quad-core CPUs               | 4.2 G         |
| 2009 | "RouteBricks"<br>Dobrescu <i>et al</i> . | Two quad-core CPUs<br>(2.8 GHz)  | 8.7 G         |
| 2010 | PacketShader<br>(CPU-only)               | Two quad-core CPUs<br>(2.66 GHz) | 28.2 G        |
| 2010 | PacketShader<br>(CPU+GPU)                | Two quad-core CPUs<br>+ two GPUs | 39.2 G        |





#### Conclusions

## GPU

• a great opportunity for fast packet processing

## PacketShader

- Optimized packet I/O + GPU acceleration
- scalable with
  - # of multi-core CPUs, GPUs, and high-speed NICs

## Current Prototype

- Supports IPv4, IPv6, OpenFlow, and IPsec
- 40 Gbps performance on a single PC



## Control plane integration

- Dynamic routing protocols with Quagga or Xorp
- Multi-functional, modular programming environment
  - Integration with Click? [Kohler99]

## Opportunistic offloading

- CPU at low load
- GPU at high load

## Stateful packet processing





## SSLShader

A GPU-Accelerated Software Router











# Thank You



## Mellanox Connect. Accelerate. Outperform.™