Towards an Explicit RTM Stencil Computation Framework on Kalray TurboCard2

RTM seismic imaging migration algorithms are very IO- and compute-intensive, and can benefit from using accelerators such as GPGPU or Manycores.

One of the main concern to the use of accelerators is the integration of legacy code with accelerators and the ability to leverage the standard programming models the industry has relied upon: code written in FORTRAN and parallelized with MPI and OpenMP-3 is not always easy to port to accelerators.

Most of the time the most computationally intensive part of algorithms such as RTM can be reduced to small parts of the code, mostly constituted of loops nests, called kernels. The problem then becomes how to move (or offload) those kernels to the accelerators and how to integrate the offloaded kernels with the rest of the application.

To achieve this, some approaches extends well-known programming models such as OpenMP-3 to support accelerators offloading such as OpenACC and OpenMP-4. One of the shortcomings of this “#pragma based” approach is that it does not always allow to extract most of the performance of the accelerators, because of being too high-level.

For examples, explicit RTM schemes using stencil computation models are mostly IO-bound on current accelerators architectures, and could benefit from strategies such as cache-blocking or time-skewing, but those optimization strategies need to be explicitly described for a specific architecture. These optimization can be hard to implement and defeat the main goals of those approaches which is to be accelerator-independent.

An alternative approach is to bring domain specific languages or libraries to allow the scientists to concentrate on the model itself while letting experts optimize for each platform. Several tools and libraries exists for stencil computation such as ArrayLib, Pluto or Pochoir, but they are not targeting accelerators and they can be intrusive.

In order to facilitate explicit stencils computation such as explicit RTM schemes on the new TurboCard2 accelerator we decided to concentrate our efforts on a stencil library, which will abstract and optimize the domain decomposition and the data distribution on the accelerators directly from the host, while letting the programmers implement their kernels using their usual tools and models.

Multicore platforms have become Linux common environment. The next stage is hundreds of cores in one chip, what brings its own specific challenges and opportunities. In this talk we present the state of the work in progress on porting Linux to a part of Kalray’s 256-core MPPA256 platform. We explain why it is a part of it, cover the subject of the effort to bring Linux to the MPPA’s K1 VLIW core and then describe the peripheral support with the distributed device driver model, multi-stage boot process and performance-critical Network-on-Chip device support. This presentation intends to provide the developers basic information on massive multicore and the new subjects that must be addressed in Linux in the massive multicore world.