Personal tools

Vector-based Parallel Programming Optimization of Communication Algorithm (1-2S/B)

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

Flexible and scalable solutions will be needed for future communications processing systems. Vector processors provide an efficient means of exploiting data-level parallelism (DLP), which is heavily present in communications kernels. The Spatz [1], a small and energy-efficient vector unit based on the RISC-V vector extension specification is introduced for efficiency and performance improvement. Spatz lean Processing Element (PE) acts as an accelerator to a scalar core, which is a good candidate for achieving ideal hardware utilization and enabling scalability. Based on these metrics, we implemented Spatz on the TeraPool architecture as our hardware platform, a scaled-up system from MemPool [2], which has 1024 Snitch cores and 4096 banks of shared tightly coupled L1 data-memory. In this project, we will exhibit considerable DLP and implement the typical kernels of baseband signal processing tasks [3].

Project

This project aims to improve the performance and utilization of key kernels from baseband signal processing by vector-based SIMD parallel programming and find out the best efficiency by exploring different vector hardware configurations. The project is divided into three parts:

  • In the first part, you will learn the vector SIMD parallel programming by the example kernels of the 5G PUSCH algorithm. In this part, you will not only learn vector parallel programming but also understand the hardware architecture of the Spatz-based MemPool/TeraPool system.
  • In the second part, you will work on software implementation by C code for the key kernels of the 5G PUSCH algorithm, collecting the benchmark results and comparing them with the scalar baseline, identifying the reason for performance loss, and then improving kernels' performance.
  • In the last part, you will work on the hardware architecture improvement based on your identifying bottleneck. Improving the addressing scheme and memory allocation to achieve the best performance and hardware utilization.

In this project, you will touch both the software and hardware of vector-based manycore SIMD architecture, creating and executing concurrently signal-processing kernels typically used in the field of baseband communication.

Character

  • 10% Literature Review
  • 60% Software Design
  • 20% Hardware Design
  • 10% Evaluation & Documentation

Weekly Reports

The student is required to write a weekly report at the end of each week and send it to his advisors by email. The weekly report aims to briefly summarize the work, progress, and any findings made during the week, plan the actions for the next week, and discuss open questions and points. For software programming benchmarks, we strongly recommend creating a google-sheet and plotting the results to trace your benchmark results.

Report

Documentation is an essential and often overlooked aspect of engineering. A final report has to be completed within this project.

The common language of engineering is English. Therefore, the final report of the work is preferred to be written in English.

Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Inkscape or Tgif, or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff. If you write the report in LaTeX, we offer an instructive, ready-to-use template, which can be downloaded here.


Prerequisites

  • Strong interest in manycore computer architecture and memory systems
  • Knowledge of vector architecture
  • Experience in C/C++ programming
  • Experience with digital design in SystemVerilog as taught in VLSI I is appreciated

References

[1] M. Cavalcante, D. Wüthrich, M. Perotti and et al, "Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters," arXiv preprint arXiv:2207.07970 (2022), https://arxiv.org/abs/2207.07970?context=cs

[2] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.

[3] M. Bertuletti, Y. Zhang, A. Vanelli-Coralli, and L. Benini, “Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor”, https://arxiv.org/abs/2210.09196