Computing In-Place FFTs with SIMD Lane Slicing

Research and Publications

Computing In-Place FFTs with SIMD Lane Slicing

We present an approach for implementing in-place FFTs on cores fitted with SIMD units and non-temporal load-store units. Loading the input samples with SIMD instructions decimates them in time across the SIMD lanes. A classic FFT implementation is extended to operate on SIMD data rather than scalar data and computes the sub-transforms concurrently. This enables efficient exploitation of the SIMD arithmetic and memory access instructions while involving little SIMD lane shuffling. A last FFT stage then recombines in-place the sub-transforms results to produce the output. We illustrate this approach on a Cooley- Tukey radix-4 decimated-in-frequency FFT implementation, which also integrates the two inner loop collapsing optimization of the TI C6x DSP _fft32×32 code that enables software pipelining and the Burrus technique for using bit-reversal in high-radix FFT implementations. Performance evaluations are performed on the Kalray KV3 core, which implements a 64-bit vector-scalar VLIW architecture with level-l cache bypass load instructions.

SUBSCRIBE FOR PRODUCT UPDATES >

BOOK A MEETING >

ALL RESOURCES

Kalray Accelerator Resources

Visit Resources Library >

COMPANY

NEWS

INVESTOR INFORMATION >

Computing In-Place FFTs with SIMD Lane Slicing

COMPANY

Resources

Use Cases

Engage

PRODUCT UPDATES