Personal tools

Difference between revisions of "Implementation of a Small and Energy-Efficient RISC-V-based Vector Accelerator (1M)"

From iis-projects

Jump to: navigation, search
 
Line 3: Line 3:
 
= Overview =
 
= Overview =
  
== Status: Available ==
+
== Status: In progress ==
  
 
* Type: Master's Thesis
 
* Type: Master's Thesis
Line 9: Line 9:
 
* Supervisors:
 
* Supervisors:
 
** [[:User:Matheusd | Matheus Cavalcante]]: [mailto:matheusd@iis.ee.ethz.ch matheusd@iis.ee.ethz.ch]
 
** [[:User:Matheusd | Matheus Cavalcante]]: [mailto:matheusd@iis.ee.ethz.ch matheusd@iis.ee.ethz.ch]
 +
** [[:User:Mperotti | Matteo Perotti]]: [mailto:mperottil@iis.ee.ethz.ch mperotti@iis.ee.ethz.ch]
 
** [[:User:Sriedel | Samuel Riedel]]: [mailto:sriedel@iis.ee.ethz.ch sriedel@iis.ee.ethz.ch]
 
** [[:User:Sriedel | Samuel Riedel]]: [mailto:sriedel@iis.ee.ethz.ch sriedel@iis.ee.ethz.ch]
** [[:User:Mperotti | Matteo Perotti]]: [mailto:mperottil@iis.ee.ethz.ch mperotti@iis.ee.ethz.ch]
 
  
 
== Prerequisites ==
 
== Prerequisites ==
Line 23: Line 23:
 
[[Category:Heterogeneous Acceleration Systems]]
 
[[Category:Heterogeneous Acceleration Systems]]
 
[[Category:2021]]
 
[[Category:2021]]
[[Category:Semester Thesis]]
+
[[Category:Master Thesis]]
 
[[Category:Hot]]
 
[[Category:Hot]]
 
[[Category:Matheusd]]
 
[[Category:Matheusd]]
 +
[[Category:Mperotti]]
 
[[Category:Sriedel]]
 
[[Category:Sriedel]]
[[Category:Mperotti]]
+
[[Category:In progress]]
[[Category:Available]]
 
  
 
= Introduction =
 
= Introduction =
  
Striving for high image quality, even on mobile devices, has lead to an increase in pixel count in smartphone cameras over the last decade.  
+
Striving for high image quality, even on mobile devices, has led to an increase in pixel count in smartphone cameras over the last decade.  
 
These image sensors, boasting tens of millions of pixels, create a massive amount of data to be processed on a tight power envelope as quickly as possible.  
 
These image sensors, boasting tens of millions of pixels, create a massive amount of data to be processed on a tight power envelope as quickly as possible.  
 
While this processing is highly parallelizable, it requires specialized Image Signal Processors (ISPs) which can exploit this high degree of parallelism to meet the timing and power constraints.
 
While this processing is highly parallelizable, it requires specialized Image Signal Processors (ISPs) which can exploit this high degree of parallelism to meet the timing and power constraints.
Line 40: Line 40:
  
 
Programming MemPool presents some challenges.
 
Programming MemPool presents some challenges.
Even though the scratchpad memory can be accessed within at most five cycles, memory banks close to the cores can be accessed with a lower latency.
+
Even though the scratchpad memory can be accessed within at most five cycles, memory banks close to the cores can be accessed with lower latency.
 
It is therefore beneficial to keep the cores' accesses local, reducing the latency and the load on the global interconnect.
 
It is therefore beneficial to keep the cores' accesses local, reducing the latency and the load on the global interconnect.
 
We acknowledged this through a hybrid memory addressing scheme, which allocates each cores' stack on a memory bank close to it, accessible within one cycle of latency.
 
We acknowledged this through a hybrid memory addressing scheme, which allocates each cores' stack on a memory bank close to it, accessible within one cycle of latency.
Line 55: Line 55:
  
 
This thesis' goal is to develop a small and energy vector accelerator unit, and integrate it with MemPool.
 
This thesis' goal is to develop a small and energy vector accelerator unit, and integrate it with MemPool.
This unit should achieve a high performance on key computational photography kernels, while keeping the energy efficiency of the design under control.
+
This unit should achieve high performance on key computational photography kernels, while keeping the energy efficiency of the design under control.
 
This manycore system with vector support is to be analyzed in terms of the performance improvements, power requirements, and area impacts of the hardware needed to implement the vector accelerator.
 
This manycore system with vector support is to be analyzed in terms of the performance improvements, power requirements, and area impacts of the hardware needed to implement the vector accelerator.
  
Line 70: Line 70:
 
We want to implement a simple, small, and energy-efficient vector unit instead.
 
We want to implement a simple, small, and energy-efficient vector unit instead.
  
Regarding the vector register file, we might either use a small vector register file per vector unit, or stream the operands from the local L1 memory.
+
Regarding the vector register file, we might either use a small vector register file per vector unit or stream the operands from the local L1 memory.
 
During the thesis, the student will be asked to evaluate both approaches, and implement the chosen one.
 
During the thesis, the student will be asked to evaluate both approaches, and implement the chosen one.
 
The other approach, and the comparison between them, is taken as a stretch goal.
 
The other approach, and the comparison between them, is taken as a stretch goal.
Line 82: Line 82:
 
* Implement a vector unit and integrate it with the Snitch cores in the MemPool tile
 
* Implement a vector unit and integrate it with the Snitch cores in the MemPool tile
 
* Benchmark the performance of the design with vector kernels
 
* Benchmark the performance of the design with vector kernels
* Analyze the impacts of the vector support on area and on power consumption
+
* Analyze the impacts of the vector support on the area and on power consumption
 
* Compare your solution with MemPool as a systolic array
 
* Compare your solution with MemPool as a systolic array
  
Line 89: Line 89:
 
== Meetings ==
 
== Meetings ==
  
Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Beside these regular meetings, additional meetings can be organized to address urgent issues as well.
+
Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Besides these regular meetings, additional meetings can be organized to address urgent issues as well.
  
 
== Weekly Reports ==
 
== Weekly Reports ==
  
Semester Thesis: The student is advised, but not required, to a write a weekly report at the end of each week and to send it to his advisors. The idea of the weekly report is to briefly summarize the work, progress and any findings made during the week, to plan the actions for the next week, and to bring up open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.
+
Semester Thesis: The student is advised, but not required, to a write a weekly report at the end of each week and to send it to his advisors. The idea of the weekly report is to briefly summarize the work, progress, and any findings made during the week, to plan the actions for the next week, and to bring up open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.
  
 
== Coding Guidelines ==
 
== Coding Guidelines ==

Latest revision as of 08:36, 22 October 2021


Overview

Status: In progress

Prerequisites

  • VLSI I
  • SoCDAML (recommended)
  • Experience with SystemVerilog

Introduction

Striving for high image quality, even on mobile devices, has led to an increase in pixel count in smartphone cameras over the last decade. These image sensors, boasting tens of millions of pixels, create a massive amount of data to be processed on a tight power envelope as quickly as possible. While this processing is highly parallelizable, it requires specialized Image Signal Processors (ISPs) which can exploit this high degree of parallelism to meet the timing and power constraints. One modern example of such an ISP is Google's Pixel Visual Core, which contains eight image processing units, each consisting of 256 specialized processing elements to achieve a combined performance of 3.28 TOPS.

At ETH, we are developing our own many-core system called MemPool. It boasts 256 lightweight 32-bit Snitch cores. They implement the RISC-V instruction set architecture (ISA), a modular and open ISA. Despite its size, MemPool manages to give all 256 cores low-latency access to the shared L1 memory, with a zero-load latency of at most five cycles. Therefore, all cores can efficiently communicate, making MemPool suitable for various workloads.

Programming MemPool presents some challenges. Even though the scratchpad memory can be accessed within at most five cycles, memory banks close to the cores can be accessed with lower latency. It is therefore beneficial to keep the cores' accesses local, reducing the latency and the load on the global interconnect. We acknowledged this through a hybrid memory addressing scheme, which allocates each cores' stack on a memory bank close to it, accessible within one cycle of latency.

We also explored programming MemPool with a systolic array, transforming it into a Coarse-Grained Reconfigurable Architecture (CGRA). This approach instantiates queues between cores, which privileges communication between neighboring cores. Through the addition of special push and pop instructions, similar to SSRs, we can also elide some memory loads and stores, alleviating the Von Neumann bottleneck.

A vector programming model can be also used to program MemPool. We can exploit the fact that each vector instruction can be translated into a long series of scalar micro-operations. By replicating such micro-operations, we can alleviate the pressure on the instruction issue of the scalar core, leaving it free to execute other instructions.

Goal

This thesis' goal is to develop a small and energy vector accelerator unit, and integrate it with MemPool. This unit should achieve high performance on key computational photography kernels, while keeping the energy efficiency of the design under control. This manycore system with vector support is to be analyzed in terms of the performance improvements, power requirements, and area impacts of the hardware needed to implement the vector accelerator.

Project Description

The project has different aspects to be explored. First, we need to determine a small subset of RISC-V's Vector Extension to be implemented. If the Vector Extension has been ratified and proposes an instruction subset for small embedded subsystems, we can use it. Otherwise, we will base ourselves on the instructions needed to execute the software kernels of interest.

Then, we will investigate how to implement this small and energy-efficient vector accelerator. We can take inspiration on Ara, a RISC-V-based vector processor developed by our group. Keep in mind that Ara targets much higher operating frequencies, and is overall a complex vector machine---each lane of Ara is about as large as one tile of MemPool. We want to implement a simple, small, and energy-efficient vector unit instead.

Regarding the vector register file, we might either use a small vector register file per vector unit or stream the operands from the local L1 memory. During the thesis, the student will be asked to evaluate both approaches, and implement the chosen one. The other approach, and the comparison between them, is taken as a stretch goal.

Milestones

The following are the milestones that we expect to achieve throughout the project:

  • Familiarize yourself with MemPool and with the RISC-V Vector Extension
  • Choose a subset of interest of the RISC-V Vector Extension
  • Implement a vector unit and integrate it with the Snitch cores in the MemPool tile
  • Benchmark the performance of the design with vector kernels
  • Analyze the impacts of the vector support on the area and on power consumption
  • Compare your solution with MemPool as a systolic array

Project Realization

Meetings

Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Besides these regular meetings, additional meetings can be organized to address urgent issues as well.

Weekly Reports

Semester Thesis: The student is advised, but not required, to a write a weekly report at the end of each week and to send it to his advisors. The idea of the weekly report is to briefly summarize the work, progress, and any findings made during the week, to plan the actions for the next week, and to bring up open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.

Coding Guidelines

HDL Code Style

Adapting a consistent code style is one of the most important steps in order to make your code easy to understand. If signals, processes, and modules are always named consistently, any inconsistency can be detected more easily. Moreover, if a design group shares the same naming and formatting conventions, all members immediately feel at home with each other’s code. At IIS, we use lowRISC’s style guide for SystemVerilog HDL: https://github.com/lowRISC/style-guides/.

Software Code Style

We generally suggest that you use style guides or code formatters provided by the language’s developers or community. For example, we recommend LLVM’s or Google’s code styles with clang-format for C/C++, PEP-8 and pylint for Python, and the official style guide with rustfmt for Rust.

Version Control

Even in the context of a student project, keeping a precise history of changes is essential to a maintainable codebase. You may also need to collaborate with others, adopt their changes to existing code, or work on different versions of your code concurrently. For all of these purposes, we heavily use Git as a version control system at IIS. If you have no previous experience with Git, we strongly advise you to familiarize yourself with the basic Git workflow before you start your project.

Report

Documentation is an important and often overlooked aspect of engineering. A final report has to be completed within this project.

The common language of engineering is de facto English. Therefore, the final report of the work is preferred to be written in English.

Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Inkscape or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

If you write the report in LaTeX, we offer an instructive, ready-to-use template, which can be forked from the Git repository at https://iis-git.ee.ethz.ch/akurth/iisreport.

Final Report

The final report has to be presented at the end of the project and a digital copy needs to be handed in and remain property of the IIS. Note that this task description is part of your report and has to be attached to your final report.

Presentation

There will be a presentation 20 min presentation and 5 min Q&A) at the end of this project in order to present your results to a wider audience. The exact date will be determined towards the end of the work.

Deliverables

In order to complete the project successfully, the following deliverables have to be submitted at the end of the work:

  • Final report incl. presentation slides
  • Source code and documentation for all developed software and hardware
  • Testsuites (software) and testbenches (hardware)
  • Synthesis and implementation scripts, results, and reports

References

[1] O.Shachamand, M.Reynders, "Pixel Visual Core: image processing and machine learning on Pixel 2," oct 2017. [Online]. Available: https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

[2] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, "MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect," dec 2020. [Online]. Available: http://arxiv.org/abs/2012.02973

[3] S. Riedel and M. Cavalcante, "MemPool GitHub," 2021. [Online]. Available: https://github.com/pulp-platform/mempool

[4] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, "Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads," IEEE TRANSACTIONS ON COMPUTERS, pp.1–1, feb 2020. [Online]. Available: http://arxiv.org/abs/2002.10143

[5] A. Waterman and K. Asanovic, "The RISC-V Instruction Set Manual Volume I: Unprivileged ISA - Document Version 20191213," RISC-V Foundation, Tech. Rep., 2019. [Online]. Available: https://github.com/riscv/riscv-isa-manual/releases/download/draft-20201002-db3eeaf/riscv-spec.pdf

[6] M. Cavalcante, F. Schuiki, F. Zaruba, M. Schaffner, and L. Benini, "Ara: A 1-GHz+ scalable and energy-efficient RISC-V vector processor with multiprecision floating-point support in 22nm FD-SOI," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 2, pp. 530–543, 2020.