Personal tools

Difference between revisions of "NVDLA meets PULP"

From iis-projects

Jump to: navigation, search
(Created page with "==Introduction== After many years of neglection, “classic” Cray-like vector processors have been proposed again [Asanovic2016][Lee2015] as a more general and elegant solut...")
 
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
After many years of neglection, “classic” Cray-like vector processors have been proposed again [Asanovic2016][Lee2015] as a more general and elegant solution than packed single-instruction multiple-data extensions (e.g. Intel AVX) and GPU-like single-instruction multiple-thread approaches to parallelism.
+
[[File:nvdla_memory.png|right|NVDLA Memory System and High-Level Architecture]]
However, many have been sceptical that such a scheme can be applied to accelerate applications in low-power devices such as the PULP chips we develop at IIS. However, a vector processor shares many similarities with custom-designed HW accelerators that we have succesfully coupled in the past with our platform, with the additional advantage of more flexibility in the target application.
+
 
 +
Machine learning is ''the'' hot topic of the industry at the moment. Dedicated hardware accelerators for various deep learning tasks are being developed by many different parties for an equally diverse set of applications. IIS and the PULP project is no exception: we have developed multiple accelerators targeted at different forms and stages of deep learning, as well as processors that are computationally capable to perform such tasks. NVIDIA of GPU fame has recently released their take on such an accelerator, called the NVIDIA Deep Learning Accelerator. In this Master Thesis we would like to find out if and how NVDLA can be a companion, competitor, accelerator, or encompassing framework to the PULP project and the accelerators/processors we have developed.
  
 
==Project description==
 
==Project description==
  
<!--[[File:ergo_archi.png|thumb|400px]]-->
+
The purpose of this project is to get the [http://nvdla.org/ NVIDIA Deep Learning Accelerator] up and running, implement it in a modern ASIC technology node, and compare it against other accelerators in the PULP project. More specifically we would like to see as a result of this project how NVDLA
 +
 
 +
* compares against NTX [Schuiki2018], a streaming floating-point accelerator
 +
* compares against Ara, a vector processor based on the RISC-V Vector extension
 +
* compares against RI5CY cluster/Ariane [Gautschi2017,Zaruba2018], two scalar RISC-V processors
  
The primary purpose of this project is to design a cluster-coupled vector coprocessor to be deployed within a PULP cluster similar to Fulmine [Conti2017]. The main objectives of this project will be the following:
+
in terms of performance, area, and power consumption. This includes getting familiar with NVDLA, understanding its programming model, and being able to launch computation on it. Since NVDLA is a large unit, we are also interested to see if and how it can be combined with PULP and what scale such a combination would be beneficial. The difference of scales will make it necessary to consider multiple NTX/Ara/RI5CY clusters/Ariane working in tandem in order to attain meaningful comparisons (such as same-compute, same-area, same-power, same-bandwidth settings).
* ''A simple base design''. The design should be a base to build specialized accelerators, not a fully autonomous core. The base core should include mostly the instructions needed to fetch data, move it between registers, and perform basic arithmetic operations. A possibile approach is to take Zero-Riscy, one of the smallest RISC-V cores introduced at IIS [Schiavone2017] and use it as a baseline.
 
* ''Designed for shared-memory interaction''. The vector coprocessor must be thought from the ground up to communicate with the shared memory of the PULP clusters. Techniques and ideas developed in the context of PULP HW accelerators [Conti2017][Azarkhish2017] can be used to make this more efficient.
 
* ''Design for minimum energy''. This particular coprocessor will be designed not to maximize performance, but to minimize energy spent on highly data-parallel and vectorizable applications, such as machine learning applications (support vector machines, neural networks). To this end, one of the possibilities is to explore frequency scaling within the coprocessor.
 
* ''Compliance to RISC-V guidelines''. While the RISC-V Foundation has not ratified yet official specifications for their vector extensions, the design should satisfy at least coarsely the guidelines exposed in [Asanovic2016].
 
  
The design will be performed in SystemVerilog, for full compliance with the rest of the PULP design flow.
+
NVDLA is released as Verilog source code, and all PULP-related sources are in SystemVerilog and VHDL. It is essential that you know or are willing to learn your way around an HDL and ASIC implementation tools (see next section). As a first step we are interested in synthesis results only, but depending on the project's progress we can also consider doing place-and-route to get a feeling how NVDLA behaves in the backend.
  
 
==Required Skills==
 
==Required Skills==
Line 25: Line 26:
 
* to be strongly motivated for a difficult but super-cool project
 
* to be strongly motivated for a difficult but super-cool project
  
===Status: Available ===
+
===Status: Completed ===
: Supervision: [[:User:Fschuiki|Fabian Schuiki]]
+
: Semester Project of Davide Menini
 +
: Supervision: [[:User:Fschuiki|Fabian Schuiki]], [[:User:Zarubaf|Florian Zaruba]]
  
 
===Professor===
 
===Professor===
Line 46: Line 48:
  
 
==Literature==
 
==Literature==
* [Asanovic2016] K. Asanovic, RISC-V Vector Extension proposal [https://riscv.org/wp-content/uploads/2016/12/Wed0930-RISC-V-Vectors-Asanovic-UC-Berkeley-SiFive.pdf]
+
* [Schuiki2018] Schuiki, Fabian, et al. "A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets." [https://arxiv.org/abs/1803.04783]
* [Conti2017] F. Conti et al., An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics [https://arxiv.org/pdf/1612.05974.pdf]
+
* [Gautschi2017] Gautschi, Michael, et al. "Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices." [https://ieeexplore.ieee.org/iel7/92/8049574/07864441.pdf]
* [Lee2015] Y. Lee et al., Hwachwa Instruction Set Architecture [https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdf] and Microarchitecture [https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.pdf]
+
* [Zaruba2018] Zaruba, Florian "Ariane: An open-source 64-bit RISC-V Application-Class Processor and latest Improvements" [https://content.riscv.org/wp-content/uploads/2018/05/14.15-14.40-FlorianZaruba_riscv_workshop-1.pdf]
* [Schiavone2017] P. D. Schiavone et al., Slow and Steady Wins the Race? A Comparison of Ultra-Low-Power RISC-V Cores for Internet-of-Things Applications
 
  
 
==Links==  
 
==Links==  
Line 61: Line 62:
 
[[Category:PULP]]  
 
[[Category:PULP]]  
 
[[Category:ASIC]]  
 
[[Category:ASIC]]  
[[Category:Available]]
+
[[Category:Completed]]
 
[[Category:Master Thesis]]
 
[[Category:Master Thesis]]
 
[[Category:Acceleration and Transprecision]]
 
[[Category:Acceleration and Transprecision]]

Latest revision as of 12:42, 10 November 2020

Introduction

NVDLA Memory System and High-Level Architecture

Machine learning is the hot topic of the industry at the moment. Dedicated hardware accelerators for various deep learning tasks are being developed by many different parties for an equally diverse set of applications. IIS and the PULP project is no exception: we have developed multiple accelerators targeted at different forms and stages of deep learning, as well as processors that are computationally capable to perform such tasks. NVIDIA of GPU fame has recently released their take on such an accelerator, called the NVIDIA Deep Learning Accelerator. In this Master Thesis we would like to find out if and how NVDLA can be a companion, competitor, accelerator, or encompassing framework to the PULP project and the accelerators/processors we have developed.

Project description

The purpose of this project is to get the NVIDIA Deep Learning Accelerator up and running, implement it in a modern ASIC technology node, and compare it against other accelerators in the PULP project. More specifically we would like to see as a result of this project how NVDLA

  • compares against NTX [Schuiki2018], a streaming floating-point accelerator
  • compares against Ara, a vector processor based on the RISC-V Vector extension
  • compares against RI5CY cluster/Ariane [Gautschi2017,Zaruba2018], two scalar RISC-V processors

in terms of performance, area, and power consumption. This includes getting familiar with NVDLA, understanding its programming model, and being able to launch computation on it. Since NVDLA is a large unit, we are also interested to see if and how it can be combined with PULP and what scale such a combination would be beneficial. The difference of scales will make it necessary to consider multiple NTX/Ara/RI5CY clusters/Ariane working in tandem in order to attain meaningful comparisons (such as same-compute, same-area, same-power, same-bandwidth settings).

NVDLA is released as Verilog source code, and all PULP-related sources are in SystemVerilog and VHDL. It is essential that you know or are willing to learn your way around an HDL and ASIC implementation tools (see next section). As a first step we are interested in synthesis results only, but depending on the project's progress we can also consider doing place-and-route to get a feeling how NVDLA behaves in the backend.

Required Skills

To work on this project, you will need:

  • to have worked in the past with at least one RTL language (SystemVerilog or Verilog or VHDL) -- having followed the VLSI1 / VLSI2 courses is recommended
  • to have prior knowedge of hardware design and computer architecture -- having followed the "Advanced System-on-Chip Design" or "Energy-Efficient Parallel Computing Systems for Data Analytics" course is recommended
  • to have prior knowledge of basic machine learning, mainly DNNs/CNNs which will be used as sample workloads

Other skills that you might find useful include:

  • familiarity with git, the UNIX shell, C programming
  • to be strongly motivated for a difficult but super-cool project

Status: Completed

Semester Project of Davide Menini
Supervision: Fabian Schuiki, Florian Zaruba

Professor

Luca Benini

Practical Details

Meetings & Presentations

The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.

At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS colloquium.

Literature

  • [Schuiki2018] Schuiki, Fabian, et al. "A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets." [1]
  • [Gautschi2017] Gautschi, Michael, et al. "Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices." [2]
  • [Zaruba2018] Zaruba, Florian "Ariane: An open-source 64-bit RISC-V Application-Class Processor and latest Improvements" [3]

Links

  • The EDA wiki with lots of information on the ETHZ ASIC design flow (internal only) [4]
  • The IIS/DZ coding guidelines [5]