Personal tools

Difference between revisions of "Benchmarking a heterogeneous 217-core MPSoC on HPC applications (M/1-3S)"

From iis-projects

Jump to: navigation, search
Line 33: Line 33:
 
= Project =
 
= Project =
  
In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and the software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development.
+
In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development. The applications will be developed in C. A bare-metal runtime is provided, hiding the details of the hardware beneath a set of convenience functions.
  
* '''Review and get familiar with existing efforts''' on
+
To break it down in more detail, you will:
** Snitch compute kernels and runtime
 
** Compute libraries targeting PULP (PULP-NN [3], PULP DSP [4])
 
* '''Define the structure and requirements''' for a compute kernel library
 
* '''Write new compute kernels''', which may include any of:
 
** Linear algebra (matrices/vectors/scalar sums, products, transpositions, inversions...)
 
** Machine learning (pooling, batch normalization, backpropagation, ...)
 
** Filter functions (convolution, FFT, ...)
 
** Complex numbers (addition, multiplication, magnitude and argument, ...)
 
* '''Verify your new kernels''' using results generated by common compute libraries
 
* '''Evaluate the performance of your kernels''' in RTL simulations of a Snitch system
 
  
Depending on your preferences and prior experience, you may choose which class(es) of kernels you want to tackle or focus onThe proposal can also be split into multiple individual projects if necessary. -->
+
* '''Gain a deep understanding of the PolyBench kernels''', in particular of:
 +
** the underlying algorithms;
 +
** the data movement and communication patterns;
 +
** the parallelism they expose, i.e. distinguish sequential vs. parallel code regions, data vs. task parallelism, etc.;
 +
* '''Understand the Occamy architecture and familiarize
 +
* '''Select a suitable subset of kernels to implement'''
 +
* '''Implement the kernels on Occamy'''
 +
** A) Port the original sources to run on the CVA6 host
 +
** B) Offload amenable code regions to the accelerator
 +
** C) Optimize data movement, overlapping communication and computation where possible
 +
* '''Compare the performance and energy efficiency of the implementations in A), B) and C)'''
 +
** Analyze the speedup in Amdahl's terms
 +
** Understand and locate where the major performance losses occur
 +
** Compare the attained FPU utilization and performance to the architecture's peak values
 +
** Suggest new hardware features or ISA extensions which could improve the kernels
 +
 
 +
Additional stretch goals may include:
 +
 
 +
* Study which kernels could be optimized with the SSR or FREP ISA extensions and eventually optimize them
 +
* Categorize the kernels based on their use of collective communication (multicast, reductions) and synchronization primitives (barriers, locks)
 +
 
 +
The programs will be run in RTL simulation. To speed up the development, we might opt for a downscaled version of Occamy, with a reduced number of accelerator cores.
  
 
== Character ==
 
== Character ==
  
<!--
+
* 20% Literature/architecture review
* 20% Literature / architecture review
+
* 60% Bare-metal C programming
* 40% RTL implementation
 
* 20% Bare-metal C programming
 
 
* 20% Evaluation
 
* 20% Evaluation
-->
 
  
 
== Prerequisites ==
 
== Prerequisites ==
  
<!--
 
 
* Strong interest in computer architecture and memory systems
 
* Strong interest in computer architecture and memory systems
 
* Experience with digital design in SystemVerilog as taught in VLSI I
 
* Experience with digital design in SystemVerilog as taught in VLSI I
* Experience with ASIC implementation flow (synthesis) as taught in VLSI II
+
* Preferred: Knowledge or prior experience with RISC-V
* SoCs for Data Analytics and ML and/or Computer Architecture lectures or equivalent
+
* Preferred: Experience with ASIC implementation flow as taught in VLSI II
* Preferred: Knowledge or prior experience with RISC-V or ISA extension design
 
-->
 
  
 
= References =
 
= References =

Revision as of 01:14, 12 October 2022


Overview

Status: Available

Introduction

Occamy is a massively-parallel multiprocessor system-on-chip (MPSoC) designed for energy-efficient high-performance computing (HPC) applications. It is a concrete implementation of the concept Manticore architecture that went on display at HotChips 2020 [1]. It couples a 64-bit RISC-V application-class out-of-order CVA6 core [2,3], which can boot Linux, with a many-core accelerator comprising 216 energy-efficient 32-bit RISC-V Snitch cores [4,5]. The accelerator cores are tightly coupled to a set of software-managed L1 scratch-pad memories (SPMs). The difference to hardware-managed caches is that data movement between the L2 and L1 memories has to be explicitly defined in software, for which several DMA engines are provided. This design decision improves overall energy efficiency.

This heterogeneous platform guarantees high single-thread performance by executing sequential code on the host (CVA6 core), while parallel code regions can be offloaded to the accelerator to take advantage of its higher energy efficiency and peak performance. The heterogeneous programming model described opens up many interesting questions, such as:

  • When is it convenient to offload a computation to the accelerator?
  • What is the optimal number of accelerator cores to select for offload?

Given the rising popularity of heterogeneous MPSoCs [6,7], being able to confidently answer these questions should be considered a valuable takeaway of this thesis.

Snitch's floating point subsystem is also of particular interest. Several ISA extensions have been developed to improve its energy efficiency, namely stream semantic registers (SSRs) [8,9] and the floating-point repetition (FREP) instruction [5], respectively enabling load/store elisions and pseudo dual-issue execution, and other developments are ongoing.

Project

In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development. The applications will be developed in C. A bare-metal runtime is provided, hiding the details of the hardware beneath a set of convenience functions.

To break it down in more detail, you will:

  • Gain a deep understanding of the PolyBench kernels, in particular of:
    • the underlying algorithms;
    • the data movement and communication patterns;
    • the parallelism they expose, i.e. distinguish sequential vs. parallel code regions, data vs. task parallelism, etc.;
  • Understand the Occamy architecture and familiarize
  • Select a suitable subset of kernels to implement
  • Implement the kernels on Occamy
    • A) Port the original sources to run on the CVA6 host
    • B) Offload amenable code regions to the accelerator
    • C) Optimize data movement, overlapping communication and computation where possible
  • Compare the performance and energy efficiency of the implementations in A), B) and C)
    • Analyze the speedup in Amdahl's terms
    • Understand and locate where the major performance losses occur
    • Compare the attained FPU utilization and performance to the architecture's peak values
    • Suggest new hardware features or ISA extensions which could improve the kernels

Additional stretch goals may include:

  • Study which kernels could be optimized with the SSR or FREP ISA extensions and eventually optimize them
  • Categorize the kernels based on their use of collective communication (multicast, reductions) and synchronization primitives (barriers, locks)

The programs will be run in RTL simulation. To speed up the development, we might opt for a downscaled version of Occamy, with a reduced number of accelerator cores.

Character

  • 20% Literature/architecture review
  • 60% Bare-metal C programming
  • 20% Evaluation

Prerequisites

  • Strong interest in computer architecture and memory systems
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Knowledge or prior experience with RISC-V
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802
[2] https://github.com/openhwgroup/cva6
[3] https://arxiv.org/pdf/1904.05442.pdf
[4] https://github.com/pulp-platform/snitch
[5] https://arxiv.org/pdf/2002.10143.pdf
[6] https://en.wikichip.org/wiki/nvidia/tegra/xavier
[7] https://en.wikipedia.org/wiki/ARM_big.LITTLE
[8] https://arxiv.org/pdf/1911.08356.pdf
[9] https://arxiv.org/pdf/2011.08070.pdf
[10] http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/