Personal tools

Vector-based Parallel Programming Optimization of Communication Algorithm (1-2S/B)

From iis-projects

Revision as of 12:18, 14 November 2022 by Yiczhang (talk | contribs)
Jump to: navigation, search


Overview

Status: Available

Introduction

Flexible and scalable solutions will be needed for future communications processing systems. Vector processors provide an efficient means of exploiting data-level parallelism (DLP), which is heavily present in communications kernels. The Spatz [1], a small and energy-efficient vector unit based on the RISC-V vector extension specification is introduced for efficiency and performance improvement. Spatz lean Processing Element (PE) acts as an accelerator to a scalar core, which is a good candidate for achieving ideal hardware utilization and enabling scalability. Based on these metrics, we implemented Spatz on the TeraPool architecture as our hardware platform, a scaled-up system from MemPool [2], which has 1024 Snitch cores and 4096 banks of shared tightly coupled L1 data-memory. In this project, we will exhibit considerable DLP and implement the typical kernels of baseband signal processing tasks [3].

Project

The goal of this project is to improve the performance and utilization of key kernels from baseband signal processing by vector-based SIMD parallel programming and find out the best efficiency by exploring different vector hardware configurations. The project is divided into three parts:

  • In the first part, you will learn the basic vector SIMD parallel programming by the basic kernels of the 5G PUSCH algorithm. Based on our example kernel, you will not only learn vector programming but also understand the hardware architecture of the Spatz-based MemPool/TeraPool system.
  • In the second part, you will work on other key kernels' software implementation by C code, collecting the benchmark results and compare with the scalar baseline, identifying the reason for performance loss, and then improving your kernels.
  • In the last part, you will work on the hardware architecture improvement based on your identifying bottleneck. Improving the addressing scheme and memory allocation to achieve the best performance and hardware utilization.

In this project, you will touch the both software and hardware of vector-based manycore SIMD architecture, creating and executing concurrently signal-processing kernels typically used in the field of base-band communication.

Character

  • 10% Literature Review
  • 60% Software Design
  • 20% Hardware Design
  • 10% Evaluation & Documentation

Weekly Reports

The student is required to write a weekly report at the end of each week and send it to his advisors by email. The idea of the weekly report is to briefly summarize the work, progress, and any findings made during the week, plan the actions for the next week, and discuss open questions and points. For software programming benchmarks, we strongly recommend creating a google-sheet and plotting the results to trace your benchmark results.

Report

Documentation is an important and often overlooked aspect of engineering. A final report has to be completed within this project.

The common language of engineering is English. Therefore, the final report of the work is preferred to be written in English.

Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Inkscape or Tgif, or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff. If you write the report in LaTeX, we offer an instructive, ready-to-use template, which can be downloaded here.


Prerequisites

  • Strong interest in manycore computer architecture and memory systems
  • Knowledge of vector architecture
  • Experience in C/C++ programming
  • Experience with digital design in SystemVerilog as taught in VLSI I is appreciated

References

[1] M. Cavalcante, D. Wüthrich, M. Perotti and et al, "Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters," arXiv preprint arXiv:2207.07970 (2022).

[2] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.


[3] M. Bertuletti, Y. Zhang, A. Vanelli-Coralli, and L. Benini, “Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor”, https://arxiv.org/abs/2210.09196