Personal tools

Difference between revisions of "Universal Stream Semantic Registers for Snitch (1S)"

From iis-projects

Jump to: navigation, search
(Created page with "<!-- Universal Stream Semantic Registers for Snitch (1S) --> <!-- TODO: unlock safety Category:Digital Category:High Performance SoCs Category:Computer Architecture...")
 
m
Line 25: Line 25:
 
= Introduction =
 
= Introduction =
  
Processors often access data as *memory streams*, sequences of memory requests following predefined address patterns. Recent architectural extensions [1-2] propose handling such streams in hardware. This frees processors from explicitly computing addresses and issuing requests, increasing compute throughput. It also *decouples* data movement from execution, hiding architectural latencies and maximizing bandwidth utilization.
+
Processors often access data as ''memory streams'', sequences of memory requests following predefined address patterns. Recent architectural extensions [1-2] propose handling such streams in hardware. This frees processors from explicitly computing addresses and issuing requests, increasing compute throughput. It also ''decouples'' data movement from execution, hiding architectural latencies and maximizing bandwidth utilization.
  
 
In our group, we developed Stream Semantic Registers (SSRs) [1]. These map memory streams directly to general-purpose registers in a RISC-V core, such that simply accessing a register loads or stores data. The stream's addresses are computed by an address generator, which is programmed with the stream's address pattern (loop bounds, strides, ...) beforehand.
 
In our group, we developed Stream Semantic Registers (SSRs) [1]. These map memory streams directly to general-purpose registers in a RISC-V core, such that simply accessing a register loads or stores data. The stream's addresses are computed by an address generator, which is programmed with the stream's address pattern (loop bounds, strides, ...) beforehand.
Line 31: Line 31:
 
SSRs are used in the Snitch cluster [3] along with the floating point repetition (FREP) hardware loop; this enables floating-point unit (FPU) utilizations near 100% on regular problems. In this context, we recently extended SSRs to also handle indirect streams [4] for sparse workloads, and are actively working on further extensions.
 
SSRs are used in the Snitch cluster [3] along with the floating point repetition (FREP) hardware loop; this enables floating-point unit (FPU) utilizations near 100% on regular problems. In this context, we recently extended SSRs to also handle indirect streams [4] for sparse workloads, and are actively working on further extensions.
  
However, there is a fundamental limitation to SSRs as currently implemented in Snitch systems: they only support streaming double-precision (64-bit) floating-point data. Adding support *integer* types and *different element sizes* (8, 16, 32, 64 bit) would enable accelerating many more scenarios, such as graph processing.
+
However, there is a fundamental limitation to SSRs as currently implemented in Snitch systems: they only support streaming double-precision (64-bit) floating-point data. Adding support ''integer'' types and ''different element sizes'' (8, 16, 32, 64 bit) would enable accelerating many more scenarios, such as graph processing.
  
 
= Project =
 
= Project =
Line 38: Line 38:
  
 
* Extend SSRs to support variably-sized types for stream elements.
 
* Extend SSRs to support variably-sized types for stream elements.
* Extend the work-in-progress Snitch _integer processing unit_ (IPU) to support integer SSRs.
+
* Extend the work-in-progress Snitch ''integer processing unit'' (IPU) to support integer SSRs.
 
* Write simple programs (e.g. linear algebra, graph algorithm kernels) demonstrating the use of integer and variable-size streams, respectively.
 
* Write simple programs (e.g. linear algebra, graph algorithm kernels) demonstrating the use of integer and variable-size streams, respectively.
 
* Evaluate the performance, area, energy, and timing impact of these extensions on the above applications.
 
* Evaluate the performance, area, energy, and timing impact of these extensions on the above applications.

Revision as of 11:27, 10 August 2021


Overview

Status: Available

Introduction

Processors often access data as memory streams, sequences of memory requests following predefined address patterns. Recent architectural extensions [1-2] propose handling such streams in hardware. This frees processors from explicitly computing addresses and issuing requests, increasing compute throughput. It also decouples data movement from execution, hiding architectural latencies and maximizing bandwidth utilization.

In our group, we developed Stream Semantic Registers (SSRs) [1]. These map memory streams directly to general-purpose registers in a RISC-V core, such that simply accessing a register loads or stores data. The stream's addresses are computed by an address generator, which is programmed with the stream's address pattern (loop bounds, strides, ...) beforehand.

SSRs are used in the Snitch cluster [3] along with the floating point repetition (FREP) hardware loop; this enables floating-point unit (FPU) utilizations near 100% on regular problems. In this context, we recently extended SSRs to also handle indirect streams [4] for sparse workloads, and are actively working on further extensions.

However, there is a fundamental limitation to SSRs as currently implemented in Snitch systems: they only support streaming double-precision (64-bit) floating-point data. Adding support integer types and different element sizes (8, 16, 32, 64 bit) would enable accelerating many more scenarios, such as graph processing.

Project

In this project, we want to:

  • Extend SSRs to support variably-sized types for stream elements.
  • Extend the work-in-progress Snitch integer processing unit (IPU) to support integer SSRs.
  • Write simple programs (e.g. linear algebra, graph algorithm kernels) demonstrating the use of integer and variable-size streams, respectively.
  • Evaluate the performance, area, energy, and timing impact of these extensions on the above applications.

The project can be simplified, adapted, or extended to suit your needs and wishes.

Character

  • 20% Architecture specification
  • 40% RTL implementation
  • 40% Verification and Evaluation

Prerequisites

  • Strong interest in computer architecture and/or memory systems
  • Experience with HDLs (preferably SystemVerilog) as taught in VLSI I
  • Knowledge of ASIC tool flow or parallel enrollment with VLSI II
  • Basic knowledge on embedded / bare-metal programming in C

References

[1] https://ieeexplore.ieee.org/document/9068465

[2] https://ieeexplore.ieee.org/document/8980305

[3] https://ieeexplore.ieee.org/document/9216552

[4] https://arxiv.org/abs/2011.08070