Personal tools

Transforming MemPool into a CGRA (M)

From iis-projects

Jump to: navigation, search
High level block diagram of an example CGRA [7]
High level block diagram of an example CGRA [7]

Overview

Status: Completed

Character

  • 40% Software
  • 40% RTL design
  • 20% Evaluation

Prerequisites

  • VLSI I
  • Experience with C

Introduction

Striving for high image quality, even on mobile devices, has lead to an increase in pixel count in smartphone cameras over the last decade [1]. These image sensors, boasting tens of millions of pixels, create a massive amount of data to be processed on a tight power envelope as quickly as possible. While this processing is highly parallelizable, it requires specialized image signal processors (ISPs) which can exploit this high degree of parallelism to meet the timing and power constraints. One modern example of such an ISP is Google’s Pixel Visual Core, which contains eight image processing units, each consisting of 256 specialized processing elements to achieve a combined performance of 3.28 T [2].

At ETH, we are developing our own ISP called MemPool [3]. It boasts 256 lightweight 32-bit Snitch cores developed at ETH [4]. They implement the RISC-V instruction set architecture (ISA), which is a modular and open ISA [5]. Despite its size, MemPool manages to give all 256 cores low-latency access to the shared L1 memory, with a maximum latency of only five cycles. Therefore, all cores can efficiently communicate, making MemPool suitable for various workloads and easy to program.

However, core kernels of modern workloads, such as computational photography and machine learning, map well to systolic array architectures. Google offers two prominent examples for this: the Pixel Visual Core they use for image processing in their smartphones and their TPU used for deep learning [2], [6]. Those architectures sacrifice flexibility to achieve the best performance for their specific domain. A possible implementation of a systolic array is the coarse grained reconfigurable architecture (CGRA). As shown in 1, these architectures implement a systolic array of configurable processing elements. They are often compared to FPGAs but trade some flexibility for higher clock frequencies, faster configuration times, and better energy efficiency [7].

Goal

This thesis’ goal is to extend MemPool to be able to operate in a systolic mode. Specifically, to build an overlay over the existing architecture to allow MemPool to function as a CGRA efficiently. The benefit is high performance for systolic workloads while maintaining MemPool’s flexibility by having a duality of modes. The result is a flexible system that achieves a very high throughput for systolic workloads.

Project Description

There are two critical points to transforming MemPool into a CGRA: each cell’s configuration and their communication. Since the Snitch cores in MemPool are individually programmable RISC-V cores, we can view them as individual cells of the CGRA and configuration can be done through programming. Since MemPool’s interconnect is more general than a CGRA’s, we can emulate a CGRA’s behavior on MemPool with a software solution for the beginning. Using the cores’ L1 memory as queues, we can emulate the mesh interconnect with concurrent software queues. Therefore, this thesis will first focus on implementing a software layer emulating a CGRA’s behavior. In the next step, the hardware to make the communication efficient will be implemented—specifically, an extension of the memory interface to implement efficient queues and a mesh network.

Implementing software overlay

The first phase’s goal is to run systolic workloads on MemPool as if it was a 2D CGRA, but without changing the hardware. To this end, you will implement a software communication layer that maps each core onto a 2D grid and allows them to send data to their neighbors. You will be able to synchronize the cores with the use of atomic instructions. With this layer, you will evaluate how to assign each core an XY coordinate in the grid and what hardware features are required to implement efficient queues on the L1 memory. This step also includes an investigation of how to program the CGRA and writing a few benchmarks to use throughout the thesis.

Hardware aided queues

To reduce the costly synchronization overhead of the software queues, you will extend MemPool’s memory controller to handle the synchronization. The queues will still use the L1 memory to hold the data, but with the insights you gained from the first phase, your hardware modification should cut down the synchronization overhead and latency to a minimum. Of course, you will have to verify the hardware you write and evaluate its impact on timing, area, and performance.

Mesh interconnect

Once you establish an efficient way for the cores to communicate, you can reduce the power consumption and potential congestions in the global interconnect by adding a specialized mesh interconnect. This interconnect will allow each core to communicate with its neighbor over a very short distance directly. Therefore, you drastically reduce the power consumption and improve the system’s efficiency. Again, the impact of this network has to be analyzed in MemPool’s physical implementation. Furthermore, you will have to evaluate the performance benefit for the benchmarks.

Milestones

The following are the milestones that we expect to achieve throughout the project:

  • Implement a software layer emulating systolic communication.
  • Create a set of benchmarks that will serve as a baseline and comparison throughout the project.
  • Implement hardware support for efficient neighbor communication
  • Extend MemPool with a mesh network.

Stretch Goals

Should the above milestones be reached earlier than expected and you are motivated to do further work, we propose the following stretch goals to aim for:

  • Analyze different mesh network topologies, such as torus network or adding multi-hop connections.
  • Evaluate different programming frameworks for efficiently implementing systolic workloads on your CGRA overlay.

Project Realization

Meetings

Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Beside these regular meetings, additional meetings can be organized to address urgent issues as well.

Weekly Reports

Master Thesis: The student is required to a write a weekly report at the end of each week and to send it to his advisors by email. The idea of the weekly report is to briefly summarize the work, progress and any findings made during the week, to plan the actions for the next week, and to discuss open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.

Coding Guidelines

HDL Code Style

Adapting a consistent code style is one of the most important steps in order to make your code easy to understand. If signals, processes, and modules are always named consistently, any inconsistency can be detected more easily. Moreover, if a design group shares the same naming and formatting conventions, all members immediately feel at home with each other’s code. At IIS, we use lowRISC’s style guide for SystemVerilog HDL: https://github.com/lowRISC/style-guides/.

Software Code Style

We generally suggest that you use style guides or code formatters provided by the language’s developers or community. For example, we recommend LLVM’s or Google’s code styles with clang-format for C/C++, PEP-8 and pylint for Python, and the official style guide with rustfmt for Rust.

Version Control

Even in the context of a student project, keeping a precise history of changes is essential to a maintainable codebase. You may also need to collaborate with others, adopt their changes to existing code, or work on different versions of your code concurrently. For all of these purposes, we heavily use Git as a version control system at IIS. If you have no previous experience with Git, we strongly advise you to familiarize yourself with the basic Git workflow before you start your project.

Report

Documentation is an important and often overlooked aspect of engineering. A final report has to be completed within this project.

The common language of engineering is de facto English. Therefore, the final report of the work is preferred to be written in English.

Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Inkscape or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

If you write the report in LaTeX, we offer an instructive, ready-to-use template, which can be forked from the Git repository at https://iis-git.ee.ethz.ch/akurth/iisreport.

Final Report

The final report has to be presented at the end of the project and a digital copy needs to be handed in and remain property of the IIS. Note that this task description is part of your report and has to be attached to your final report.

Presentation

There will be a presentation 20 min presentation and 5 min Q&A) at the end of this project in order to present your results to a wider audience. The exact date will be determined towards the end of the work.

Deliverables

In order to complete the project successfully, the following deliverables have to be submitted at the end of the work:

  • Final report incl. presentation slides
  • Source code and documentation for all developed software and hardware
  • Testsuites (software) and testbenches (hardware)
  • Synthesis and implementation scripts, results, and reports

References

[1] S. Skafisk, This is how smartphone cameras have improved over time. 2017 (accessed August 18, 2020).

[2] J. L. Hennessy and D. A. Patterson, Computer architecture: A quantitative approach. Elsevier, 2011.

[3] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect.” 2020.

[4] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A 10 kGE pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads.” 2020.

[5] A. Waterman et al., “The RISC-V instruction set manual.” 2014.

[6] K. Sato, What makes TPUs fine-tuned for deep learning? 2018 (accessed January 22, 2021).

[7] A. Podobas, K. Sano, and S. Matsuoka, A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective,” IEEE Access, vol. 8, pp. 146719–146743, Apr. 2020.