Personal tools

Difference between revisions of "A RISC-V ISA Extension for Pseudo Dual-Issue Monte Carlo in Snitch (1M/2S)"

From iis-projects

Jump to: navigation, search
(References)
 
Line 99: Line 99:
 
[11] [https://dl.acm.org/doi/10.1145/2063384.2063405 Philox] <br />
 
[11] [https://dl.acm.org/doi/10.1145/2063384.2063405 Philox] <br />
 
[12] [https://prng.di.unimi.it/ xoshiro/xoroshiro] <br />
 
[12] [https://prng.di.unimi.it/ xoshiro/xoroshiro] <br />
[13] [https://nullprogram.com/blog/2017/09/21/ Finding the Best 64-bit Simulation PRNG]
+
[13] [https://nullprogram.com/blog/2017/09/21/ Finding the Best 64-bit Simulation PRNG] <br />
 
[14] [https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/rng/host_api/engines-basic-random-number-generators.html#onemkl-rng-engines-basic-random-number-generators oneAPI PRNGs]
 
[14] [https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/rng/host_api/engines-basic-random-number-generators.html#onemkl-rng-engines-basic-random-number-generators oneAPI PRNGs]

Latest revision as of 12:53, 7 March 2024


Overview

Status: In progress

Introduction

Figure 1: A block diagram of the Snitch cluster architecture

Monte Carlo simulations are used in many complex applications and take up a large fraction of high-performance computing (HPC) cycles today. They fall in the category of embarassingly parallel computations and as such can benefit from the compute resources of massively parallel MPSoCs.

This project aims at reaching high-performance for Monte Carlo applications on Snitch, a pseudo dual-issue energy-efficient RISC-V processor for floating-point computations [1, 2]. Snitch features an integer core and a floating-point accelerator, which can operate in parallel to some extent. It implements two custom ISA extensions to achieve high FPU utilization, namely FREP and SSRs. The first in particular allows the FPU to operate independently of the integer core, effectively enabling (pseudo) dual-issue execution.

Monte Carlo methods [3] rely on repeated random sampling from an application-specific probability distribution. Random samples from any probability distribution can be obtained by sampling a uniform probability distribution and converting the samples from the uniform to the desired probability distribution. The Box-Muller method [4] for instance can be used to generate normally-distributed numbers given a source of uniformly-distributed numbers. In this project we will only focus on generating uniform samples. This is the goal of most pseudo-random number generators (PRNGs) available. While Monte Carlo methods typically compute floating-point quantities, PRN generation mostly involves integer operations. This makes the Snitch architecture particularly suited for such computations, as the integer core can potentially generate random samples while the FPU subsystem uses these samples in parallel for the Monte Carlo calculation.

In a previous project, we evaluated the performance of Monte Carlo applications on Snitch. Unfortunately, pseudo dual-issue execution in the Snitch processor is only possible when the integer and floating-point threads are independent and no communication occurs between the two. This is not the case for our use-case, where the integer thread would supply the floating-point thread with PRNs to operate on.

Project description

With this project, we aim to enable sustained pseudo dual-issue execution for Monte Carlo applications on Snitch.

In order to achieve this, you will have to broaden the scope of the FREP ISA extension, to cover the scenario where the floating-point thread would consume data produced by the integer thread. This will involve putting your hands on the FREP sequencer in the Snitch core and extending the semantics of FCVT and FLT instructions issued by the sequencer.

You will then showcase your extensions on a sample Monte Carlo application.

Detailed task description

To break it down in more detail, you will:

  • Review the literature on PRNGs
    • Understand the requirements for PRNGs in HPC applications (ultimately we need double-precision floating-point numbers) [13]
    • Understand the tradeoffs involved, generation efficiency (speed) vs. statistical properties (quality)
    • Understand how distributed PRN generation works [8, 9]
    • Review what are the most common PRNGs used in standard libraries and HPC applications [5] (LCGs [6], MRGs [7], Mersenne Twister, SPRNG [10], Philox [11], xoshiro [12] etc.)
  • Review Monte Carlo methods [3]
  • Review previously developed Monte Carlo application on Snitch
    • Study the code from previous students who implemented a parallel Monte Carlo application on a Snitch cluster
    • Understand single-thread performance limit and why we cannot use FREP extension in Snitch
  • Extend existing Monte Carlo application
    • Select a widely-used PRNG which balances speed and quality (suitable for HPC)
    • Implement the PRNG on the Snitch cluster
    • Improve the single-thread performance of the application, including PRN generation
  • Extend Snitch to enable dual-issue Monte Carlo applications
    • Extend Snitch to enable FLT and FCVT instruction sequencing in FREP loops
    • Extend FLT and FCVT semantics to operate on SSR registers during FREP operation
  • Evaluate your extensions
    • Analyze how your extensions impact Snitch PPA figures
    • Use your extensions in our sample Monte Carlo application and evaluate performance impact
  • Scale the Monte Carlo application to multiple clusters in Occamy

Stretch goals

Additional optional stretch goals may include:

  • Select an alternative, real-world Monte Carlo application to implement
  • Extend the integer->FP communication mechanism to use FIFOs, and evaluate energy efficiency improvements

Character

  • 30% Literature/architecture review
  • 25% Low-level software development
  • 35% RTL design and verification
  • 10% Physical design evaluation

Prerequisites

  • Strong interest in computer architecture
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Experience in bare-metal or embedded C programming
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] Snitch paper
[2] Snitch Github repository
[3] Monte Carlo methods
[4] Box-Muller transform
[5] List of PRNGs
[6] List of LCGs
[7] LCGs and MRGs in Monte Carlo
[8] Parallel LCG implementation
[9] Leapfrog method for parallel LCG implementation
[10] SPRNG
[11] Philox
[12] xoshiro/xoroshiro
[13] Finding the Best 64-bit Simulation PRNG
[14] oneAPI PRNGs