Personal tools

Runtime partitioning of L1 memory in Mempool (M)

From iis-projects

Jump to: navigation, search


Status: Available


MemPool [1] is a IIS-born many-core system, having 256 Snitch cores and 1024 banks of shared tightly coupled L1 data-memory. Leveraging its hierarchical architecture, we can scale the system to TeraPool, a cluster of 1024 Snitch cores, having 4096 banks of shared memory. The huge parallel computing power of TeraPool suits perfectly the purpose of accelerating embarrassingly parallel tasks, such as matrix-matrix multiplication. Other kernels benefit less from large-scale parallelization and the speed-up that can be obtained with respect to a single-core execution is found when the algorithm is executed in parallel over a sub-set of the cores. In both cases the kernels are optimized to avoid conflicts in the access to the memory interconnection resources, which would generate a stall of the processors LSU.

When MemPool and TeraPool are employed for the execution of composite algorithmic chains, different kernels might be allocated to sub-sets of the processors in the cluster. The sequential addressing scheme of MemPool and TeraPool makes conflicts more likely to happen when different kernels running at the same time access their respective data-structures, which are allocated at the same addresses of L1 memory. This is for instance the case when different stages of the 5G base-band singnal processing chain are executed on the platform.


The goal of this project is to create a runtime partitioning of MemPool's and TeraPool's L1 memory, to separate the memory regions where the data structures of different kernels, running concurrently on the platform, are allocated. The project is divided in two parts:

  • In the first part you will be asked to adapt a software approach to the problem, working on the dynamic allocation of data structures, that are folded on a sub-set of th cluster banks, instead of unrolling across the entire continuous addressing space.
  • In the second part you will work on the platform addressing scheme, making it runtime-programmable.

You will test the two approaches and compare them with respect to the baseline continuous addressing scheme, executing concurrently signal-processing kernels typically used in the field of base-band communication.


  • 10% Literature Review
  • 50% Software Design
  • 20% Hardware Design
  • 20% Evaluation & Documentation


  • Strong interest in computer architecture and memory systems
  • Experience in C/C++ programming
  • Experience with digital design in SystemVerilog as taught in VLSI I is appreciated


[1] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.