Personal tools

Difference between revisions of "Benchmarking a heterogeneous 217-core MPSoC on HPC applications (M/1-3S)"

From iis-projects

Jump to: navigation, search
(Optional stretch goals)
(One intermediate revision by the same user not shown)
Line 81: Line 81:
 
= References =
 
= References =
  
[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802 <br />
+
[1] [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802 Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing] <br />
[2] https://github.com/openhwgroup/cva6 <br />
+
[2] [https://github.com/openhwgroup/cva6 CVA6 core Github repository] <br />
[3] https://arxiv.org/pdf/1904.05442.pdf <br />
+
[3] [https://arxiv.org/pdf/1904.05442.pdf The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology] <br />
[4] https://github.com/pulp-platform/snitch <br />
+
[4] [https://github.com/pulp-platform/snitch Snitch core Github repository] <br />
[5] https://arxiv.org/pdf/2002.10143.pdf <br />
+
[5] [https://arxiv.org/pdf/2002.10143.pdf Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads] <br />
[6] https://en.wikichip.org/wiki/nvidia/tegra/xavier <br />
+
[6] [https://en.wikichip.org/wiki/nvidia/tegra/xavier Nvidia Tegra Xavier Wikichip article] <br />
[7] https://en.wikipedia.org/wiki/ARM_big.LITTLE <br />
+
[7] [https://en.wikipedia.org/wiki/ARM_big.LITTLE Arm big.Little Wikipedia article] <br />
[8] https://arxiv.org/pdf/1911.08356.pdf <br />
+
[8] [https://arxiv.org/pdf/1911.08356.pdf Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores] <br />
[9] https://arxiv.org/pdf/2011.08070.pdf <br />
+
[9] [https://arxiv.org/pdf/2011.08070.pdf Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra] <br />
[10] http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/ <br />
+
[10] [http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/ PolyBench/C Website] <br />
[11] https://github.com/cavazos-lab/PolyBench-ACC
+
[11] [https://iis-git.ee.ethz.ch/bjoernf/PolyBench-ACC PolyBench port to HERO architecture] <br />
 +
[12] [https://arxiv.org/pdf/1712.06497.pdf HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA] <br />
 +
[13] [https://github.com/MatthiasJReisinger/PolyBenchC-4.2.1/blob/master/polybench.pdf PolyBench 4.2.1 kernel descriptions]

Revision as of 14:14, 22 November 2022

Manticore concept architecture

Overview

Status: Available

Introduction

Occamy is a massively-parallel multiprocessor system-on-chip (MPSoC) designed for energy-efficient high-performance computing (HPC) applications. It is a concrete implementation of the concept Manticore architecture that went on display at HotChips 2020 [1]. It couples a 64-bit RISC-V application-class out-of-order CVA6 core [2,3], which can boot Linux, with a many-core accelerator comprising 216 energy-efficient 32-bit RISC-V Snitch cores [4,5]. The accelerator cores are tightly coupled to a set of software-managed L1 scratch-pad memories (SPMs). The difference to hardware-managed caches is that data movement between the L2 and L1 memories has to be explicitly defined in software, for which several DMA engines are provided. This design decision improves overall energy efficiency.

This heterogeneous platform guarantees high single-thread performance by executing sequential code on the host (CVA6 core), while parallel code regions can be offloaded to the accelerator to take advantage of its higher energy efficiency and peak performance. The heterogeneous programming model described opens up many interesting questions, such as:

  • When is it convenient to offload a computation to the accelerator?
  • What is the optimal number of accelerator cores to select for offload?

Given the rising popularity of heterogeneous MPSoCs [6,7], being able to confidently answer these questions should be considered a valuable takeaway of this thesis.

Snitch's floating point subsystem is also of particular interest. Several ISA extensions have been developed to improve its energy efficiency, namely stream semantic registers (SSRs) [8,9] and the floating-point repetition (FREP) instruction [5], respectively enabling load/store elisions and pseudo dual-issue execution, and other developments are ongoing.

Project description

In this project, you will port a series of HPC kernels from the PolyBench/C benchmark [10] to Occamy. You will optimize the kernels to take advantage of the heterogeneous architecture and software defined data movement. An additional goal would be to explore the applicability of SSRs and FREP to the kernels, and potentially other extensions under development. The applications will be developed in C. A bare-metal runtime is provided, hiding the details of the hardware beneath a set of convenience functions. The programs will be run in RTL simulation. To speed up the development, we might opt for a downscaled version of Occamy, with a reduced number of accelerator cores.

Detailed task description

To break it down in more detail, you will:

  • Gain a deep understanding of the PolyBench kernels, in particular of:
    • the underlying algorithms;
    • the data movement and communication patterns;
    • the parallelism they expose, i.e. distinguish sequential vs. parallel code regions, data vs. task parallelism, etc.;
  • Understand the Occamy architecture and familiarize with the software stack
  • Select a suitable subset of kernels to implement
  • Implement the kernels on Occamy
    • A) Port the original sources to run on the CVA6 host
    • B) Offload amenable code regions to the accelerator
    • C) Optimize data movement, overlapping communication and computation where possible
  • Compare the performance and energy efficiency of the implementations in A), B) and C)
    • Analyze the speedup in Amdahl's terms
    • Understand and locate where the major performance losses occur
    • Compare the attained FPU utilization and performance to the architecture's peak values
    • Suggest new hardware features or ISA extensions to further improve the attained performance

Optional stretch goals

Additional stretch goals may include:

  • Study which kernels could be optimized with the SSR or FREP ISA extensions
  • Eventually optimize the kernels with SSRs or FREP
  • Categorize the kernels based on their use of collective communication (multicast, reductions) and synchronization primitives (barriers, locks)
  • Compare your results to a GPU or server-class CPU implementation [11]

Character

  • 20% Literature/architecture review
  • 60% Bare-metal C programming
  • 20% Evaluation

Prerequisites

  • Strong interest in computer architecture
  • Experience in bare-metal or embedded C programming
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Knowledge or prior experience with RISC-V
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing
[2] CVA6 core Github repository
[3] The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology
[4] Snitch core Github repository
[5] Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads
[6] Nvidia Tegra Xavier Wikichip article
[7] Arm big.Little Wikipedia article
[8] Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores
[9] Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
[10] PolyBench/C Website
[11] PolyBench port to HERO architecture
[12] HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
[13] PolyBench 4.2.1 kernel descriptions