Personal tools

3D Matrix Multiplication Unit for ITA (1S)

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

Architecture of ITA.

Transformers [1], initially popularized in Natural Language Processing (NLP), have found applications beyond and are now integral to a wide range of deep learning tasks. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. To solve this issue, we designed ITA, Integer Transformer Accelerator [2], that targets efficient transformer inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values.

ITA utilizes the parallelism of attention mechanism and 8-bit integer quantization to improve performance and energy efficiency. To maximize ITA’s energy efficiency, we focus on minimizing data movement throughout the execution cycle of the attention mechanism. In contrast to throughput-oriented accelerator designs, which typically employ systolic arrays, ITA implements its processing elements with wide dot-product units, allowing us to maximize the depth of adder trees, thereby further increasing efficiency.

To overcome the complex dataflow requirements of standard softmax, we present a novel approach that allows performing softmax on 8-bit integer quantized values directly in a streaming data fashion. This energy- and area-efficient softmax implementation fully operates in integer arithmetic with a footprint of only 3.3% over the total area of ITA and a mean absolute error of 0.46% compared to its floating-point implementation. Our approach also enables a weight stationary dataflow by decoupling denominator summation and division in softmax. The streaming softmax operation and weight stationary flow, in turn, minimize data movement in the system and power consumption.

In the current implementation, ITA has multiple dot product units (2D). However, a recent study [3] shows that cube units can provide higher energy and area efficiency. Therefore, you will implement a 3D matrix multiplication (MatMul) unit for ITA and compare it with the 2D version.

Project

  • Familiarize yourself with the current version of ITA and propose the modified architecture with a 3D MatMul unit.
  • Design a 3D MatMul unit and modify the RTL of ITA by replacing the 2D dot product units.
  • Evaluate your design in terms of performance, energy and area and compare it with 2D-ITA.

Character

  • 20% Architecture review
  • 50% RTL implementation
  • 30% Evaluation

Prerequisites

  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Experience with ASIC implementation flow (synthesis) as taught in VLSI II

References

[1] Attention Is All You Need

[2] ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

[3] Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper