# Intelligent Data Processing, From Cloud to Edge



**KALRAY MPPA3\*-80** C1B080H5E P63M24.08A-3E 111

# **MPPA<sup>®</sup> DPU Coolidge**<sup>™</sup>

A new class of processors specialized in intelligent data processing for data center infrastructure, compute, AI acceleration and edge applications.

## High-performance computing

• Up to 1.5 TFLOPs (SP)/192 GFLOPs (DP)

# **Power efficiency**

• As low as 30W

#### Al acceleration

• Up to 25 TFLOPs (16 bits)/50 TOPs (8 bits)

# High-speed I/O & interfaces

- Up to 18GB/s
- 12M IOPS
- Low 30 µs latency
- X2 100GbE

#### **Real-time data processing**

- Massive parallel processing
- 80 cores
- 6-issue VLIW
- Ultra-low latency

# Massive parallel multitask processing

Scalable 80-core DPU processor

# **Fully programmable**

- Open standards: C/C++, Linux, RTOS, POSIX
- Kalray SDK based on standard tools & APIs for the development of new and portability of existing applications.

#### Security/safety

- Hardware root of trust
- Secure boot
- Accelerated cryptographic functions
  (optional)

The Coolidge <sup>m</sup> 1 and 2 **Data Processing Units (DPU)** are third-generation processors based on Kalray's **Massively Parallel Processor Array (MPPA®)** architecture. Kalray DPUs are natively capable of managing multiple workloads in parallel with no bottlenecks to enable smarter, more efficient, and energy-wise data-intensive applications.

The DPU presents a compelling alternative compared to GPU, ASIC, and FPGA, contributing distinct advantages and adaptability across numerous data-heavy applications, ranging from data centers to edge or in embedded systems.

# Next-generation data center infrastructure & 5G

Data center infrastructure chip for seamless integration onto PCIe Gen4 cards for use cases including I/O controllers, storage initiator and target controllers, high-speed network processing offload

- x86 offload or stand-alone "CPU-free" applications
- Compatible with containerized, virtualized, and bare metal infrastructures
  - Fully programmable with dynamic distribution of resources across data and control & management planes

#### Fully programable acceleration of high-performance protocols, services & QoS

- Enhanced support for NVMe-oF, RoCE/RDMA, TCP/IP, NVMe
- Intelligent load-balancing, priority-based flow control, and stateless L1-L4 parsing
- High-speed data protection services for clustered and fully distributed applications
- Line rate erasure coding (Reed-Solomon) per cluster
- Line-rate encryption/decryption/hash (IPSEC, TLS, XTS, MACsec)
- Acceleration for open RAN L1\*
- Al functionality for insightful analytics and adaptive configuration

## Acceleration of compute-intensive applications

#### Acceleration of complex workloads

- Innovative, patented core and co-processor enhancement for machine learning inference
- Advanced computation for computer vision
- Signal processing (e.g., FFT), cryptography, mathematics

#### Development of autonomous intelligent embedded systems

- Compatibility with multiple operating systems (Linux, RTOS)
- Support of 'freedom from interference' for mixed criticality

#### Enable next-gen edge computing systems

- Real-time analytics for automation, prediction, and control
- Seamless incorporation into existing systems

|                                        | Coolidge™ 1                                                                                                                                                                                                                                                                               | Coolidge™ 2                                                |                                                                                                                                                                                     | Coolidge™ 1                                                                                                                                                       | Coolidge™ 2                                                       |
|----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| Processor<br>architecture              | 64-bit / 32-bit                                                                                                                                                                                                                                                                           |                                                            | Network interface                                                                                                                                                                   | 2x100GbE Ethernet                                                                                                                                                 |                                                                   |
| Cores                                  | 80                                                                                                                                                                                                                                                                                        |                                                            |                                                                                                                                                                                     | 2x1/2x10/2x25/2x40/2x50/2x100<br>GbE                                                                                                                              | 8x1/8x10/8x25/4x40/4x50/2x100<br>GbE                              |
| Clock speed                            | Up to 1GHz                                                                                                                                                                                                                                                                                |                                                            |                                                                                                                                                                                     | Jumbo frame support (9.6KB)                                                                                                                                       |                                                                   |
| Instruction level<br>parallelism (ILP) | 6-issue VLIW                                                                                                                                                                                                                                                                              |                                                            |                                                                                                                                                                                     |                                                                                                                                                                   | Support for PTP/IEEE<br>1588v2                                    |
| L1 cache                               | 16KB instruction cache/<br>16KB data cache<br>IEEE 754-2008                                                                                                                                                                                                                               | 32KB instruction cache/<br>32KB data cache                 | Priority flow control (PFC), IEEE 802.1Qbb<br>Checksum offload header & payload<br>Line rate packet classification/smart load balancing<br>Hash & round-robin based dispatch policy |                                                                                                                                                                   |                                                                   |
| Floating point unit<br>(FPU) standard  | Reciprocal square root operations in floating single<br>precision<br>64-bit integer multiplication (asymmetric cryptography)<br>4 execution rings                                                                                                                                         |                                                            |                                                                                                                                                                                     | Hash & round-robin based (                                                                                                                                        | Accelerated receive flow steering                                 |
|                                        |                                                                                                                                                                                                                                                                                           |                                                            |                                                                                                                                                                                     | RDMA over converged Ethernet (RoCE) v1 and v2                                                                                                                     |                                                                   |
| Load/store bandwidth                   | 256-bits per cycle                                                                                                                                                                                                                                                                        |                                                            | Security                                                                                                                                                                            | Hardware root of trust                                                                                                                                            |                                                                   |
| Co-processors                          | 80; 1 per core                                                                                                                                                                                                                                                                            |                                                            |                                                                                                                                                                                     | Secure boot with authentication & encryption<br>True random number generators (TRNG)<br>RSA, Diffie-Hellman, DSA, ECC, EC-DSA and EC-DH<br>acceleration           |                                                                   |
| Applications                           | Acceleration of INT8,<br>INT16 or FP16 accuracy                                                                                                                                                                                                                                           | Acceleration of INT8, FP16<br>accuracy                     |                                                                                                                                                                                     |                                                                                                                                                                   |                                                                   |
|                                        | High-performance acceleration for 32-bit floating-point transcendental functions                                                                                                                                                                                                          |                                                            | Cryptography<br>accelerators (optional)<br>Secure execution                                                                                                                         | AES-128/192/256 (ECB/CBC/ICM/CTR/GCM/GMAC/CCM)<br>AES-XTS for storage applications<br>MD5/SHA-1, SHA-2, SHA-3<br>Kazumi/Snow 3G, ZUC<br>Mixed criticality support |                                                                   |
|                                        | Optimized matrix operations for deep learning and artificial intelligence                                                                                                                                                                                                                 |                                                            |                                                                                                                                                                                     |                                                                                                                                                                   |                                                                   |
| MAC operations                         | Up to 128/cycle                                                                                                                                                                                                                                                                           | Up to 256/cycle                                            | Secure execution                                                                                                                                                                    | Lockable critical configuration                                                                                                                                   |                                                                   |
| System-on-chip<br>(SoC)                | 5 clusters (total of 80 application cores + 5 management cores)                                                                                                                                                                                                                           |                                                            |                                                                                                                                                                                     | Memory and cache partitioning for non-interference &<br>time-predictable execution<br>Configurable L1 cache coherency                                             |                                                                   |
| Compute performance                    | Up to 640 GFLOPs<br>(SP)/160 GFLOPs (DP)                                                                                                                                                                                                                                                  | Up to 1.5 TFLOPs (SP)/192<br>GFLOPs (DP)                   | Management/<br>control interfaces                                                                                                                                                   | -                                                                                                                                                                 | 1GbE management<br>interface/RGMII                                |
|                                        | Up to 2.5 TFLOPs (16<br>bits)/20 TOPs (8bits) for<br>deep learning                                                                                                                                                                                                                        | Up to 25 TFLOPs (16<br>bits)/50 TOPs (8bits)               |                                                                                                                                                                                     | GPIOs/UARTs/SPI/I2C/CAN/PWM<br>SSI controller for serial NOR Flash with optional boot<br>SDCARD UHS-I / eMMC 4.51 memory controller                               |                                                                   |
| Cluster                                | 16 application cores + 1 management/security core                                                                                                                                                                                                                                         |                                                            |                                                                                                                                                                                     | 2x USB 2.0 OTG ULPI<br>JTAG IEEE 1149.1                                                                                                                           |                                                                   |
| L2 cache/TCM                           | 4 MB, 512GB/s                                                                                                                                                                                                                                                                             | 8 MB, 600GB/s                                              |                                                                                                                                                                                     | 16-bit parallel trace interfac                                                                                                                                    | e                                                                 |
| PCIe interface                         | 16-lane PCIe GEN4 endpoint (EP) or root complex (RC)<br>Bifurcation up to 8 downstream ports in RC mode<br>SR-IOV up to 8 physical functions/248 virtual functions<br>Support for hot pluggable<br>Up to 512 DMAs for multi queues/kernel bypass drivers<br>NVMe emulation-virtualization |                                                            | Compression/<br>decompression<br>acceleration                                                                                                                                       |                                                                                                                                                                   | FC1950 (zlib),<br>RFC1951(deflate) and<br>RFC1952 (gzip) @100Gbps |
|                                        |                                                                                                                                                                                                                                                                                           |                                                            | Low-density<br>parity check<br>(LDPC) encoder                                                                                                                                       |                                                                                                                                                                   | Compliant with 3GPP TS<br>38.212, 5G NR FEC UL, 5G<br>NR FEC DL   |
| Memory interface                       | 64-bit DDR4/<br>LPDDR4-3200 channels<br>with sideband/inline ECC                                                                                                                                                                                                                          | 64-bit DDR4 -3200<br>channels with sideband/<br>inline ECC | and decoder<br>accelerator                                                                                                                                                          |                                                                                                                                                                   |                                                                   |
|                                        |                                                                                                                                                                                                                                                                                           |                                                            |                                                                                                                                                                                     |                                                                                                                                                                   |                                                                   |

Up to two ranks per DDR4 channel

2 DDR channels (up to 32GB) with channel interleaving

# MPPA® DPU Coolidge™ Processor Block Diagram

Coolidge™ is composed of 5 clusters with 16 cores dedicated to applications, and 1 core each for management and security. 16 nm Finfet technology.

#### Need more performance?

Connect several MPPA® DPU processors together to reach the level of performance you need.



