WELCOME
TO KALRAY'S BLOG!

February 26, 2021

High Performance Programming: Offloading on Manycore Architecture – Let’s dive into technical implementation! (Part 2/2)

Kalray_MPPA_DPU_processor- Data-Processing-Unit
Kalray and OpenCL conformance

Our last post explained the business and technical reasons why we decided to adopt Software Open Standards for programming the MPPA® DPU intelligent processor and how OpenCL™ actually fits our hardware architecture.

Let’s take a deeper look into this implementation.

This post will provide to software developers the initial technical details of the OpenCL™ 1.2 Embedded Profile implementation on the MPPA® DPU Coolidge™ processor. It is assumed that readers have OpenCL™ knowledge and as such will easily jump into MPPA® DPU programming!

Platform and Memory Mapping Details

The current MPPA® OpenCL™ support relies on POCL 1.5 (Portable Computing Language 1) and uses LLVM 10.0 2 as an OpenCL™-C compiler to implement OpenCL™ 1.2 Embedded Profile.

As mentioned previously, we can represent the OpenCL™ platform concept such as an acceleration configuration based on PCIe communications to offload computations onto a single MPPA® processor, implementing 5 compute clusters. On-line and off-line compilations are supported.

The MPPA® architecture features hierarchical computing resources and a non-uniform memory hierarchy. The OpenCL™ computing resources are mapped onto the MPPA® DPU processor as:

The OpenCL™ device is mapped on a single MPPA® DPU processor or a part of an MPPA® DPU processor with several compute clusters. A command queue is used to send jobs to the MPPA® OpenCL™ device using shared memory via memory-mapped accesses or DMA. The OpenCL™ compute unit is mapped onto a compute cluster of an MPPA® DPU processor. The compute clusters run in parallel and they feature direct memory accesses to the main memory of the device.

The work-groups execute within compute clusters. The work-items inside the work-group run on PEs in parallel with a parallelism degree of 16 at most.

When it comes to the memory mapping, this is straight forward:

  • __local: buffers will be allocated into shared local memory of the compute cluster.
  • __global: buffers will be allocated in the main memory, namely the DDR memory of the MPPA® board
  • __private: buffers will be allocated into the PE stack which is stored in the shared local memory
  • __CONSTANT: ADDRESS SPACE IS LOCATED IN THE MAIN MEMORY, WHICH IS THE DDR MEMORY OF THE MPPA® BOARD

Execution Modes Details

How are the kernels actually executed in this context?

These execution modes are available for optimizing application performance depending on the application type and the requirements of the software developers.

One Dimensional Vector

Here, an example is used to show how an OpenCL-C kernel maps onto the presented execution modes. 
The use-case implements a one-dimensional vector computation
. The one dimensional kernel of global work size { vect gx } and local work size { vect lx } is listed as follows:
				
					void __kernel vector_mul(__global float *a, __global float *b; __global float *c)
{
    int i= get_global_id(0); /* global id (0) is dimension x */
    
    c[i] = a[i] * b[i];
}
				
			
				
					size_t global_work_sizes[] = { vect_gx }; /* Global Work Dimension */
size_t local_work_sizes[] = { vect_1x }; /* Local Work Dimension */

EnqueueNDRangeKernel (..,1 /* Work dim */, global_work_sizes,
    local_work_sizes, ..);
				
			

The configuration of the one dimensional NDRange is represented below. The total number of work-items is 25, and the local work size is 5; thus, the number of work-groups will be 25 5 x 5 in the OpenCL™ standard.

OpenCL-C execution: SPMD

The execution of OpenCL-C code lets the work-items inside a work-group execute in parallel in each compute cluster of the MPPA® DPU device. This execution is called SPMD (Single Program Multiple Data). The maximum number of work-items inside a work-group is 16, which is the number of PEs inside a compute cluster. More formally, the parallel degree of work-items running inside a work-group is size of the local work size { Lx,Ly,Lz } , where Lx ∗ Ly ∗ Lz must be less-equal than 16. The work-group is then mapped on a compute cluster.

C/C++ Dispatch with OpenCL™

Last but not least: you can even use our OpenCL™ implementation to deploy directly C or C++ code onto the MPPA® DPU clusters. It means you can actually work in standard C and C++ languages and actually use OpenCL™ only as an code deployment infrastructure.

C/C++ dispatch with OpenCL™ allows the developer to program the MPPA® architecture at low-level. Hence, this feature lets the programmer use OpenCL™ to deploy one thread on the first PE of each compute cluster, and then use another programming model to parallelize inside the compute cluster itself. These other supported programming models are OpenMP and POSIX multi-threading (Pthreads).

As a reminder, the SPMD mode maps the work-group execution on the compute cluster, and the local memory is shared by all PEs inside the compute cluster. Once the OpenCL™ kernel is set in SPMD mode, the local work size must be set to (1,1,1) to enable this execution mode. Indeed, the OpenCL™ runtime creates a single thread in the first PE of each compute cluster execution the NDRange.

Debug and Profiling

Now you know how to setup, configure and execute OpenCL™ kernels in different modes. As any software development environment, AccessCore® provides the capability to debug and trace your execution to help with the tuning and optimizing of performance.

The OpenCL™ runtime implements traces and profiling information to let the programmer analyze and identify hot points of the running application. Several traces and profiling information are available:

  • Host OpenCL™ Application Tracing: Based on the LTTng open-source tracing framework, these OpenCL™ runtime-level traces allow the programmer to understand the execution flow on the host application

  • Host OpenCL™ Runtime Profiling Metrics: It provides at the end of the execution the list of executed OpenCL™ kernels, the number of launches and the execution time for each of them, and a global break down of the entire OpenCL™ application regarding OpenCL™ kernels themselves

  • MPPA® OpenCL™ Device Runtime Tracing: These trace points provide the programmer with execution information on MPPA® side such as: code relocation, memory allocation, asynchronous copies, and computation

  • MPPA® OpenCL™ Device User Application Tracing: It is possible to trace the user code executing on MPPA® side using MPPA® standard debugging tools and extensions to OpenCL™ we are providing (see below).

OpenCL™ MPPA® Extensions

By default, and without using any MPPA® DPU specific extensions, standard OpenCL™ 1.2 (embedded profile) applications will be portable to other architectures.

But as each and every implementation of OpenCL™ for acceleration hardware offers OpenCL™ extensions to benefit from hardware specific capabilities, Kalray proposes such extensions to exploit low-level features and optimize the performance of the MPPA® DPU processor.

Asynchronous Work Item Copies

This extension allows the programmer to use asynchronous copies at work-item level in order to better handle explicit global memory access copies at finest granularity. The asynchronous work-item copies can be used to copy from/to global memory (global buffer) to/from the local memory (local buffer) or private memory (private buffer).

Asynchronous Copies 2D and 3D

A set of OpenCL-C primitives to handle 2-dimensional (2D) and 3-dimensional (3D) asynchronous data copies are available to optimize data intensive applications, with regards to DMA-based transactions. 2D asynchronous copies are useful to deal with block tiling classically found in image processing or matrix computation applications. 3D asynchronous copies are important to handle 3D stencil computations, or to solve multi-dimensional problems.

Asynchronous Copy Fence

These are new primitives to address scoped fences regarding DMA-based transactions to optimize data intensive applications. Asynchronous work-group or work-item copy fence provides the programmer with the completion of outstanding asynchronous work-group or work-item copy in program order.

Default and Local Memory Size Configuration

The default local memory size may be set by the programmer at runtime using an MPPA® specific extension at device/sub-device level.

Mapping OpenCL™ Global Buffers in the Shared Local Memory

The MPPA® OpenCL™ provides the programmer with an extension to create global OpenCL™ buffers directly inside the shared local memory of a compute cluster.

Runtime Cache Coherence Control

We provide an MPPA® DPU specific extension to control the cache maintenance before and after each kernel execution.

Native Functions with External User Libraries

MPPA® DPU native functions are externally linked functions that can be provided by the programmer to the OpenCL™ program compiler. These functions can be compiled by any other compiler (i.e: GCC) which respects the same calling conventions of functions.

Native Functions with External Runtime System Libraries

Inside MPPA® DPU native functions, the programmer may use diverse functions that are listed as follows:

  • C/C++ Runtime
  • OpenMP Multi-threading
  • Low-level MPPA® Specific API

Key Takeaways

We have seen here the basics of OpenCL™ as an introduction to Open Standards and the way it is mapped to MPPA® DPU as well as details of implementation.

We have seen that you can run different types of execution: SPMD, Linearized (for benchmarking) and even provide a capability to deploy C and C++ code onto MPPA® DPU, an overview of Profiling capabilities and description of MPPA® DPU Extensions.

With this, you have taken your first step to enter into a known world: the MPPA® DPU deployment programming model.

Kalray and OpenCL conformance

And by the way…

This is not all we have for you, we have developed a framework based on this implementation of OpenCL™: the Kalray Framework (KAF™). Simply put: it is a software tool that helps you using OpenCL™ for its initialization, configuration, kernels and data management for dedicated usage such as CNN, Math Libs or Computer Vision. So, if you are an application developer who doesn’t want to get involved with OpenCL™, no worries, KAF™ is here. But that is for another blog… Stay tuned.