Personal tools

Approximate Matrix Multiplication based Hardware Accelerator to achieve the next 10x in Energy Efficiency: Full System Intregration

From iis-projects

Revision as of 10:09, 13 September 2022 by Andrire (talk | contribs)
Jump to: navigation, search


Overview

Status: Available

Introduction

Floorplan or the Maddness Accelerator.

The continued growth in DNN model parameter count, application domains and general adoption led to an explosion of the needed computing power and energy. Especially the energy needs have become large enough to be economically unviable or extremely difficult to cool down. That led to a push for more energy-efficient solutions. Energy efficient accelerator solutions have a long tradition in IIS, with a multitude of proven accelerators published in the past. Standard accelerator architectures try to increase throughput via higher memory bandwidth, improved memory hierarchy or reduced precision (FP16, INT8, INT4). The approach of the accelerator used in the project is a different one. It uses an approximate matrix multiplication (AMM) algorithm called MADDness, which replaces the matrix multiplication with a lookup into a look-up-table (LUT) and an addition. That can significantly reduce the overall computing and energy needs.

Project Details

The MADDness algorithm is split into two parts. We have an encoding part, which translates the input matrix A into the addresses of the LUT. After the translation follows a decoding part which adds the corresponding LUT values together to calculate the approximate output of the matrix multiplication. MADDness is then integrated into deep neural networks. The most common seen layers in DNNs are convolutional layers and linear layers, both can be replaced by MADDness. The current RTL implementation is fully post-simulation tested and includes both the encoder and decoder unit. Additionally, a post-layout simulation-based energy estimation has been done. The accelerator is not yet integrated into a full system. Energy estimates with the current implementation using GF 22nm FDX technology suggest an energy efficiency of up to 32 TMACs/W compared to a state-of-the-art datacenter NVIDIA A100 (TSMC 7nm FinFET) at around 0.7 TMACs/W (FP16). In this project, we would like to integrate the accelerator into a full system. The aim at the end would be to be able to have a tape-out ready full system which includes the MADDness accelerator. A full system includes a suitable memory hierarchy to support the bandwidth needs. We are envisioning an integration into one of the existing PULP systems (for example: PULP clusters or ARA). The evaluation of which system suiting the accelerator the best and defining the final architecture is part of the thesis. More information can be found here:

Project Plan

Acquire background knowledge & familiarize with the project

  • Read up on the MADDness algorithm and product quantization methods
  • Familiarize yourself with the current state of the project
  • Familiarize with the IIS compute environment

Setup the project & rerun RTL simulations

  • Setup the project with all its dependencies
  • Try to rerun the current RTL simulations

Evaluate suitable systems to integrate and refine an architecture

  • Define bandwidth needs and brainstorm suitable memory hierarchies
  • Spreadsheet based evaluation of the different target systems like PULP clusters, ARA etc. this includes exploring different configurations and estimated size of the chip
  • Decide on a final architecture that we will pursue for the remainder of the project

Integrate the accelerator into the defined architecture

  • Implementing the integration in SystemVerilog & add testbenches
  • Replace the currently used standard cell memories with compiled memories

Setup the design flow

  • Setup and integrate the (most likely tsmc65) design flow into the project

Synthesize + Place-and-Route & make design tape-out ready

Project finalization

  • Prepare final report
  • Prepare project presentation
  • Clean up code


Character

  • 15% Literature / architecture review
  • 15% Design Evaluation
  • 30% RTL implementation (SystemVerilog)
  • 10% low-level software implementation (C)
  • 30% ASIC tape-out preparation

Prerequisites

  • Strong interest in computer architecture
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Experience with ASIC implementation flow (synthesis) as taught in VLSI II
  • Lite experience with C or comparable language for low-level SW glue code


Status: Available