Personal tools

Difference between revisions of "A Unified Compute Kernel Library for Snitch (1-2S)"

From iis-projects

Jump to: navigation, search
(Created page with "<!-- Universal Stream Semantic Registers for Snitch (1S) --> Category:Digital Category:High Performance SoCs Category:Acceleration_and_Transprecision Category:2...")
 
Line 22: Line 22:
 
** [[:User:Fischeti | Tim Fischer ]]: [mailto:fischeti@iis.ee.ethz.ch fischeti@iis.ee.ethz.ch]
 
** [[:User:Fischeti | Tim Fischer ]]: [mailto:fischeti@iis.ee.ethz.ch fischeti@iis.ee.ethz.ch]
 
** [[:User:Paulsc | Paul Scheffler]]: [mailto:paulsc@iis.ee.ethz.ch paulsc@iis.ee.ethz.ch]
 
** [[:User:Paulsc | Paul Scheffler]]: [mailto:paulsc@iis.ee.ethz.ch paulsc@iis.ee.ethz.ch]
** [[:User:Aottaviano| Alessandro Ottaviano]]: [mailto:aottaviano@iis.ee.ethz.ch aottaviano@iis.ee.ethz.ch]
+
** [[:User:Aottaviano | Alessandro Ottaviano]]: [mailto:aottaviano@iis.ee.ethz.ch aottaviano@iis.ee.ethz.ch]
  
 
= Introduction =
 
= Introduction =

Revision as of 20:15, 19 November 2021


Overview

Status: Available

  • Type: Semester Thesis
  • Professor: Prof. Dr. L. Benini
  • Supervisors:

Introduction

The Snitch ecosystem [1] targets energy-efficient high-performance systems. It is built around the minimal RISC-V Snitch integer core, only about 15 thousand gates in size, which can optionally be coupled to accelerators such as an FPU or a DMA engine.

Currently, Snitch’s floating-point subsystem is of particular interest: it includes stream semantic registers (SSRs) [2] and the floating-point repetition (FREP) hardware loop, which together enable almost continuous FPU utilization in many data-oblivious problems.

Over time, we have written many simple demonstrator programs for Snitch systems to measure their performance. Most of these involved computational kernels, small computational functions like linear algebra operations, convolutions, or FFTs; these are frequently called in larger compute-intensive applications like machine learning layers or mathematical problem solvers.

Since a lot compute time on is spent in these kernels, optimizing them for the target hardware is a highly effective way to accelerate computation. Thus, most existing Snitch kernels are hand-tunded, partially or completely written in assembly, and use Snitch's extensions for maximum performance and efficiency.

Unfortunately, we do not have many compute kernels for Snitch yet, and much of the existing code was written for old versions of Snitch and is no longer maintained; it uses various code conventions, targets outdated versions of our extensions, and/or no longer performs optimally on our hardware. It is also scattered across different projects and repositories.

Project

In this project, you will create a unified library of high-performance computational kernels tailored to Snitch and its extensions for use in compute-intensive applications. To this end, you will:

  • Review and get familiar with existing efforts on
    • Snitch compute kernels and runtime
    • Compute libraries targeting PULP (PULP-NN [3], PULP DSP [4])
  • Define the structure and requirements for a compute kernel library
  • Write new compute kernels, which may include any of:
    • Linear algebra (matrices/vectors/scalar sums, products, transpositions, inversions...)
    • Machine learning (pooling, batch normalization, backpropagation, ...)
    • Filter functions (convolution, FFT, ...)
    • Complex numbers (addition, multiplication, magnitude and argument, ...)
  • Verify your new kernels using results generated by common compute libraries
  • Evaluate the performance of your kernels in RTL simulations of a Snitch system

Depending on your preferences and prior experience, you may choose which class(es) of kernels you want to tackle or focus on. The proposal can also be split into multiple individual projects if necessary.

Character

  • 20% Literature / architecture review
  • 40% RTL implementation
  • 20% Bare-metal C programming
  • 20% Evaluation

Prerequisites

  • Strong interest in computer architecture and memory systems
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Experience with ASIC implementation flow (synthesis) as taught in VLSI II
  • SoCs for Data Analytics and ML and/or Computer Architecture lectures or equivalent
  • Preferred: Knowledge or prior experience with RISC-V or ISA extension design

References

[1] https://ieeexplore.ieee.org/document/9216552

[2] https://ieeexplore.ieee.org/document/9068465

[3] https://github.com/pulp-platform/pulp-nn

[4] https://github.com/pulp-platform/pulp-dsp