Personal tools

All the flavours of FFT on MemPool (1-2S/B)

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

MemPool [1] is a IIS-born many-core system, having 256 Snitch cores and 1024 banks of shared tightly coupled L1 data-memory. Leveraging its hierarchical architecture, we can scale the system to TeraPool, a cluster of 1024 Snitch cores, having 4096 banks of shared memory. The huge parallel computing power and the small latency cost of the shared memory accesses in TeraPool suit perfectly the purpose of accelerating embarrassingly parallel tasks, such as matrix-matrix multiplication. Things get more tricky with kernels having irregular memory accesses, such as the Fast Fourier Transform.

In the framework of a poject were MemPool accelerates the workload of 5G processing, we already implemented a performant version of Cooley-Turkey FFT [2], and we are now looking into different algorithmic strategies to execute up to 128 FFT tasks in less than 0.5ms.

Project

The goal of this project is to implement and optimize different FFT kernels:

  • You will extend our work on Cooley-Turkey FFT to different radix.
  • You will implement and optimize on MemPool and TeraPool other FFT kernels (e.g. six steps FFT).
  • You will add hardware extensions to specialize MemPool for the execution of FFT and other key algorithms in the field of wireless communications. Another option is also the integration of a PULP FFT accelerator [3] in the MemPool Tile.

The different FFT implementations will be scientifically benchmarked. A reference could be the FFT generated by Spiral project [4].

Character

  • 10% Literature Review
  • 50% Software Design
  • 20% Hardware Design
  • 20% Evaluation & Documentation

Prerequisites

  • Strong interest in computer architecture and signal processing
  • Experience in C/C++ programming
  • Experience with digital design in SystemVerilog as taught in VLSI I is appreciated

References

[1] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.

[2] M. Bertuletti, Y. Zhang, A. Vanelli-Coralli, and L. Benini, “Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor”, https://arxiv.org/abs/2210.09196

[3] L. Bertaccini, L. Benini and F. Conti, "To Buffer, or Not to Buffer? A Case Study on FFT Accelerators for Ultra-Low-Power Multicore Clusters," 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2021, pp. 1-8, doi: 10.1109/ASAP52443.2021.00008