Personal tools

Difference between revisions of "Benchmarking a heterogeneous 217-core MPSoC on HPC applications (M/1-3S)"

From iis-projects

Jump to: navigation, search
Line 21: Line 21:
 
= Introduction =
 
= Introduction =
  
Occamy is a massively-parallel multiprocessor system-on-chip (MPSoC) designed for energy-efficient high-performance computing applications. It couples an application-class out-of-order CVA6 core [1,2] which can boot Linux to a many-core accelerator comprising 216 energy-efficient Snitch cores [3,4].  
+
Occamy is a massively-parallel multiprocessor system-on-chip (MPSoC) designed for energy-efficient high-performance computing (HPC) applications. It is a concrete implementation of the concept Manticore architecture that went on display at HotChips 2020 [1]. It couples a 64-bit RISC-V application-class out-of-order CVA6 core [2,3], which can boot Linux, with a many-core accelerator comprising 216 energy-efficient 32-bit RISC-V Snitch cores [4,5]. The accelerator cores are tightly coupled to a set of software-managed L1 scratch-pad memories (SPMs). The difference to hardware-managed caches is that data movement between the L2 and L1 memories has to be explicitly defined in software, for which several DMA engines are provided. This design decision improves overall energy efficiency.
  
This heterogeneous platform guarantees high single-thread performance for sequential code executing on CVA6, while parallel code regions can be offloaded to the accelerator to take advantage of its higher energy efficiency and peak performance.  
+
This heterogeneous platform guarantees high single-thread performance by executing sequential code on the host (CVA6 core), while parallel code regions can be offloaded to the accelerator to take advantage of its higher energy efficiency and peak performance.  
The heterogeneous programming model described opens up many interesting questions for exploration, such as:
+
The heterogeneous programming model described opens up many interesting questions, such as:
* When is it convenient to offload a computation on the accelerator?
+
* When is it convenient to offload a computation to the accelerator?
 
* What is the optimal number of accelerator cores to select for offload?
 
* What is the optimal number of accelerator cores to select for offload?
Given the rising popularity of heterogeneous MPSoCs [5,6], being able to comfortably answer these questions should be considered a valuable skill set.
+
Given the rising popularity of heterogeneous MPSoCs [6,7], being able to confidently answer these questions should be considered a valuable takeaway of this thesis.
  
In addition to providing a
+
Snitch's floating point subsystem is also of particular interest. Several ISA extensions have been developed to improve its energy efficiency, namely stream semantic registers (SSRs) [8,9] and the floating-point repetition (FREP) instruction [5], respectively enabling load/store elisions and pseudo dual-issue execution, and other developments are ongoing.
Snitch’s floating-point subsystem is of particular interest as it integrates several ISA extensions: stream semantic registers (SSRs) [2] and the floating-point repetition (FREP) instruction extension .
 
  
 
= Project =
 
= Project =
  
<!-- In this project, you will create a unified library of high-performance computational kernels tailored to Snitch and its extensions for use in compute-intensive applications. To this end, you will:
+
In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and the software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development.
  
 
* '''Review and get familiar with existing efforts''' on  
 
* '''Review and get familiar with existing efforts''' on  
Line 71: Line 70:
 
= References =
 
= References =
  
[1] https://github.com/openhwgroup/cva6 <br />
+
[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802
[2] https://arxiv.org/pdf/1904.05442.pdf <br />
+
[2] https://github.com/openhwgroup/cva6 <br />
[3] https://github.com/pulp-platform/snitch <br />
+
[3] https://arxiv.org/pdf/1904.05442.pdf <br />
[4] https://arxiv.org/pdf/2002.10143.pdf <br />
+
[4] https://github.com/pulp-platform/snitch <br />
[5] https://en.wikichip.org/wiki/nvidia/tegra/xavier <br />
+
[5] https://arxiv.org/pdf/2002.10143.pdf <br />
[6] https://en.wikipedia.org/wiki/ARM_big.LITTLE
+
[6] https://en.wikichip.org/wiki/nvidia/tegra/xavier <br />
 +
[7] https://en.wikipedia.org/wiki/ARM_big.LITTLE
 +
[8] https://arxiv.org/pdf/1911.08356.pdf
 +
[9] https://arxiv.org/pdf/2011.08070.pdf
 +
[10] http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/

Revision as of 00:05, 12 October 2022


Overview

Status: Available

Introduction

Occamy is a massively-parallel multiprocessor system-on-chip (MPSoC) designed for energy-efficient high-performance computing (HPC) applications. It is a concrete implementation of the concept Manticore architecture that went on display at HotChips 2020 [1]. It couples a 64-bit RISC-V application-class out-of-order CVA6 core [2,3], which can boot Linux, with a many-core accelerator comprising 216 energy-efficient 32-bit RISC-V Snitch cores [4,5]. The accelerator cores are tightly coupled to a set of software-managed L1 scratch-pad memories (SPMs). The difference to hardware-managed caches is that data movement between the L2 and L1 memories has to be explicitly defined in software, for which several DMA engines are provided. This design decision improves overall energy efficiency.

This heterogeneous platform guarantees high single-thread performance by executing sequential code on the host (CVA6 core), while parallel code regions can be offloaded to the accelerator to take advantage of its higher energy efficiency and peak performance. The heterogeneous programming model described opens up many interesting questions, such as:

  • When is it convenient to offload a computation to the accelerator?
  • What is the optimal number of accelerator cores to select for offload?

Given the rising popularity of heterogeneous MPSoCs [6,7], being able to confidently answer these questions should be considered a valuable takeaway of this thesis.

Snitch's floating point subsystem is also of particular interest. Several ISA extensions have been developed to improve its energy efficiency, namely stream semantic registers (SSRs) [8,9] and the floating-point repetition (FREP) instruction [5], respectively enabling load/store elisions and pseudo dual-issue execution, and other developments are ongoing.

Project

In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and the software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development.

  • Review and get familiar with existing efforts on
    • Snitch compute kernels and runtime
    • Compute libraries targeting PULP (PULP-NN [3], PULP DSP [4])
  • Define the structure and requirements for a compute kernel library
  • Write new compute kernels, which may include any of:
    • Linear algebra (matrices/vectors/scalar sums, products, transpositions, inversions...)
    • Machine learning (pooling, batch normalization, backpropagation, ...)
    • Filter functions (convolution, FFT, ...)
    • Complex numbers (addition, multiplication, magnitude and argument, ...)
  • Verify your new kernels using results generated by common compute libraries
  • Evaluate the performance of your kernels in RTL simulations of a Snitch system

Depending on your preferences and prior experience, you may choose which class(es) of kernels you want to tackle or focus on. The proposal can also be split into multiple individual projects if necessary. -->

Character

Prerequisites

References

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802 [2] https://github.com/openhwgroup/cva6
[3] https://arxiv.org/pdf/1904.05442.pdf
[4] https://github.com/pulp-platform/snitch
[5] https://arxiv.org/pdf/2002.10143.pdf
[6] https://en.wikichip.org/wiki/nvidia/tegra/xavier
[7] https://en.wikipedia.org/wiki/ARM_big.LITTLE [8] https://arxiv.org/pdf/1911.08356.pdf [9] https://arxiv.org/pdf/2011.08070.pdf [10] http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/