Efficient Execution of Transformers in RISC-V Vector Machines with Custom HW acceleration (M)
1 Project Description and Timeline
Transformers have set a new standard in natural language processing and other recurrent machine learning tasks (e.g., computer vision, molecular dynamics).
In contrast to recurrent neural networks, the entire input is processed, but a learned attention mechanism provides context for any local position in the input sequence.
These methods have proven to train significantly faster than LSTM and similar models, and therefore allowed for much larger and complex models.
Machine Learning Accelerators have been good in exploiting the inherent redundancy of convolutional neural networks layers, unfortunately the building blocks in transformers have significant lower operational intensity (i.e., compute/data load ratio) and rely on more general-purpose compute (e.g., softmax).
In this master thesis, we want to elaborate the bottleneck of transformers in a energy-efficient general-purpose vector processor, and extend the vector processor to cope with the new challenges to run efficiently transformer models.
Note: The student has already experience in programming and developing on the Ara vector processor from his previous semester thesis.
WP1: Familiarization with Algorithm and Tools (4 weeks, August) Study and summarize SoA transformers models. Study novel accelerator design for transformers. Analysis and get to know compute and memory patterns/requirements. Familiarize with IT/infrastructure/EDA Toolflow (if needed)
WP2: Software Development (8 weeks, September-October) Starting from the Ara project, develop a baseline SW implementation of a representative set of kernels/networks. Benchmark and find bottlenecks. Propose HW extensions.
WP3: System Development and HW Extension (8 weeks, November-December) Develop the extensions proposed in WP2 into the Ara vector processor. (Potentially) define the HW configuration of the vector processor.
WP4: Finalization and Reporting (4 weeks – January 2023) Finalize previous work packages. Write the comprehensive report. Prepare and hold a 20 minutes presentation at the IIS
Solid Digital VLSI Design knowledge Front-end and preferably also Back-end (e.g., taught at Universities, like VLSI I-III from ETH Zurich) Minimal experience in industry-standard EDA tools like Design Compiler, Innovus, Modelsim or similar. Basic knowledge in computer arithmetics. Basic knowledge in machine learning is an asset. Strong coding and scripting skills (SystemVerilog/VHDL, Python, TCL, Bash etc.) Excellent communication and writing skills in English
Reading list (Transformers/AI):
Reading list (VLSI):
Vivienne Sze et al., Efficient Processing of Deep Neural Networks (VNL library)
System and method for an optimized Winograd convolution accelerator
Method and apparatus for keeping statistical inference accuracy with 8-bit Winograd convolution: https://patents.google.com/patent/WO2020024093A1/en?oq=WO2020024093A1+ https://old.hotchips.org/hc31/HC31_1.11_Huawei.Davinci.HengLiao_v4.0.pdf
Liqiang Lu, SpWA: An Efficient Sparse Winograd CNN Accelerator on FPGA. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8465842
Verilog Coding Convention, https://github.com/lowRISC/style-guides/blob/master/VerilogCodingStyle.md
Reading list (general):
Xiaoyao Liang, Ascend AI Processor Architecture and Programming (from VNL library)