Overview

Status: In progress

Type: Bachelor or Semester Thesis
Professor: Prof. Dr. L. Benini
Supervisors:
- Marco Bertuletti: mbertuletti@iis.ee.ethz.ch
- Yichao Zhang: yiczhang@iis.ee.ethz.ch

Introduction

MemPool [1] is a IIS-born many-core system, having 256 Snitch cores and 1024 banks of shared tightly coupled L1 data-memory. Leveraging its hierarchical architecture, we can scale the system to TeraPool, a cluster of 1024 Snitch cores, having 4096 banks of shared memory. The huge parallel computing power of TeraPool suits perfectly the purpose of accelerating embarrassingly parallel tasks, such as matrix-matrix multiplication. Other kernels benefit less from large-scale parallelization and the speed-up that can be obtained with respect to a single-core execution is found when the algorithm is executed in parallel over a sub-set of the cores. In both cases the kernels are optimized to avoid conflicts in the access to the memory interconnection resources, which would generate a stall of the processors LSU.

When MemPool and TeraPool are employed for the execution of composite algorithmic chains, different kernels might be allocated to sub-sets of the processors in the cluster. The sequential addressing scheme of MemPool and TeraPool makes conflicts more likely to happen when different kernels running at the same time access their respective data structures, which are allocated at the same addresses of L1 memory. This is for instance the case when different stages of the 5G base-band signal processing chain are executed on the platform.

Project

The goal of this project is to create a runtime partitioning of MemPool's and TeraPool's L1 memory, to separate the memory regions where the data structures of different kernels, running concurrently on the platform, are allocated. The project is divided into two parts:

In the first part you will be asked to adapt a software approach to the problem, working on the dynamic allocation of data structures, that are folded on a sub-set of the cluster banks, instead of unrolling across the entire continuous addressing space.
In the second part you will work on the platform addressing scheme, making it runtime-programmable.

You will test the two approaches and compare them with respect to the baseline continuous addressing scheme, executing concurrently signal-processing kernels typically used in the field of base-band communication.

Character

10% Literature Review
50% Software Design
20% Hardware Design
20% Evaluation & Documentation

Prerequisites

Strong interest in computer architecture and memory systems
Experience in C/C++ programming
Experience with digital design in SystemVerilog as taught in VLSI I is appreciated

References

[1] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.

Personal tools

Runtime partitioning of L1 memory in Mempool (M) - iis-projects

Search

Navigation

Tools

Runtime partitioning of L1 memory in Mempool (M)

From iis-projects

Contents