Revision as of 11:11, 13 September 2022

Overview

Status: Available

Type: Semester Thesis (2 students), Master Thesis (1 student)
Professor: Prof. Dr. L. Benini
Supervisors:
- Jannis Schönleber: janniss@ethz.ch
- Lukas Cavigelli (Huawei), lukas.cavigelli@huawei.com
- Renzo Andri (Huawei), renzo.andri@huawei.com

Introduction

Floorplan or the Maddness Accelerator.

The continued growth in DNN model parameter count, application domains and general adoption led to an explosion of the needed computing power and energy. Especially the energy needs have become large enough to be economically unviable or extremely difficult to cool down. That led to a push for more energy-efficient solutions. Energy efficient accelerator solutions have a long tradition in IIS, with a multitude of proven accelerators published in the past. Standard accelerator architectures try to increase throughput via higher memory bandwidth, improved memory hierarchy or reduced precision (FP16, INT8, INT4). The approach of the accelerator used in the project is a different one. It uses an approximate matrix multiplication (AMM) algorithm called MADDness, which replaces the matrix multiplication with a lookup into a look-up-table (LUT) and an addition. That can significantly reduce the overall computing and energy needs.

Project Details

The MADDness algorithm is split into two parts. We have an encoding part, which translates the input matrix A into the addresses of the LUT. After the translation follows a decoding part which adds the corresponding LUT values together to calculate the approximate output of the matrix multiplication. MADDness is then integrated into deep neural networks. The most common seen layers in DNNs are convolutional layers and linear layers, both can be replaced by MADDness. The current RTL implementation is fully post-simulation tested and includes both the encoder and decoder unit. Additionally, a post-layout simulation-based energy estimation has been done. The accelerator is not yet integrated into a full system. Energy estimates with the current implementation using GF 22nm FDX technology suggest an energy efficiency of up to 32 TMACs/W compared to a state-of-the-art datacenter NVIDIA A100 (TSMC 7nm FinFET) at around 0.7 TMACs/W (FP16). In this project, we would like to integrate the accelerator into a full system. The aim at the end would be to be able to have a tape-out ready full system which includes the MADDness accelerator. A full system includes a suitable memory hierarchy to support the bandwidth needs. We are envisioning an integration into one of the existing PULP systems (for example: PULP clusters or ARA). The evaluation of which system suiting the accelerator the best and defining the final architecture is part of the thesis. More information can be found here:

Code: https://github.com/joennlae/halutmatmul
Reference Paper: https://arxiv.org/abs/2106.10860
HN discussion: https://news.ycombinator.com/item?id=28375096
and please do not hesitate to reach out to me: janniss@ethz.ch

Project Plan

Acquire background knowledge & familiarize with the project

Read up on the MADDness algorithm and product quantization methods
Familiarize yourself with the current state of the project
Familiarize with the IIS compute environment

Setup the project & rerun RTL simulations

Setup the project with all its dependencies
Try to rerun the current RTL simulations

Evaluate suitable systems to integrate and refine an architecture

Define bandwidth needs and brainstorm suitable memory hierarchies
Spreadsheet based evaluation of the different target systems like PULP clusters, ARA etc. this includes exploring different configurations and estimated size of the chip
Decide on a final architecture that we will pursue for the remainder of the project

Integrate the accelerator into the defined architecture

Implementing the integration in SystemVerilog & add testbenches
Replace the currently used standard cell memories with compiled memories

Setup the design flow

Setup and integrate the (most likely tsmc65) design flow into the project

Synthesize + Place-and-Route & make design tape-out ready

Synthesize + Place-and-Route the design
Get a working post-layout simulation
Place macros & power routing, IR drop checks
The goal is to have everything ready for a design review: http://eda.ee.ethz.ch/index.php?title=Design_review (ETH domain)

Project finalization

Prepare final report
Prepare project presentation
Clean up code

Character

15% Literature / architecture review
15% Design Evaluation
30% RTL implementation (SystemVerilog)
10% low-level software implementation (C)
30% ASIC tape-out preparation

Prerequisites

Strong interest in computer architecture
Experience with digital design in SystemVerilog as taught in VLSI I
Experience with ASIC implementation flow (synthesis) as taught in VLSI II
Lite experience with C or comparable language for low-level SW glue code

Status: Available

@@ Line 5: / Line 5: @@
 [[Category:High Performance SoCs]]
 [[Category:Computer Architecture]]
+[[Category:Deep Learning Projects]]
 [[Category:2022]]
 [[Category:Semester Thesis]]
 [[Category:Master Thesis]]
 [[Category:Available]]
 = Overview =

Personal tools

Difference between revisions of "Approximate Matrix Multiplication based Hardware Accelerator to achieve the next 10x in Energy Efficiency: Full System Intregration" - iis-projects

Search

Navigation

Tools

Difference between revisions of "Approximate Matrix Multiplication based Hardware Accelerator to achieve the next 10x in Energy Efficiency: Full System Intregration"

From iis-projects

Revision as of 11:11, 13 September 2022

Contents

Overview

Status: Available

Introduction

Project Details

Project Plan

Character

Prerequisites

Status: Available

	Privacy policy About iis-projects Disclaimers
	Mozilla Cavendish Theme based on Cavendish style by Gabriel Wicke modified by DaSch for the Web Community Wiki github Projectpage – Report Bug – Skin-Version: 2.3.5