Personal tools

Creating an At-memory Low-overhead Bufferless Matrix Transposition Accelerator (1-3S/B)

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

At IIS we are developing a scalable and flexible family of DMA engines, called iDMA [1]. iDMA is the cluster-level DMA in both the Snitch and PULP clusters. When implemented as this cluster-level engine, iDMA has fine-granular access to the cluster-internal tightly-coupled data memory (TCDM).

Traditionally, when reorganizing data, e.g., transposing a matrix, an accelerator requires a huge internal buffer to read the data in a dense format, reshuffle it, and write it out again as a dense stream. This requires a dedicated special-purpose buffer. Our idea is to create such a reshuffling accelerator (inside of iDMA) using the already present cluster TCDM as its buffer.

The resulting stream accelerators are extremely lightweight yet very performant.

Project

You first investigate common data reshuffling operations and define their reshuffling characteristics. You then implement these reshuffling operations in our iDMA engine. You implement the required changes to access the Snitch TCDM memory at a word granularity (or even sub-word) and enable its usage as a buffer. Finally, you evaluate your approach compared to accelerators using a dedicated internal buffer. Depending on the progress, this work can directly lead to a publication.

Character

  • 20% Getting familiar with the iDMA, and Snitch, evaluating reshuffle operations
  • 30% Implementing the reshuffle operation in the iDMA
  • 30% Integrating your accelerator in Snitch
  • 20% Evaluation


Prerequisites

  • Interest in memory systems
  • Experience with digital design in SystemVerilog as taught in VLSI I

References

[1] “A High-performance, Energy-efficient Modular DMA Engine Architecture” https://arxiv.org/abs/2305.05240