http://iis-projects.ee.ethz.ch/index.php?title=Special:NewPages&feed=atom&hideredirs=1&limit=50&offset=&namespace=0&username=&tagfilter=iis-projects - New pages [en]2024-03-29T07:50:30ZFrom iis-projectsMediaWiki 1.28.0http://iis-projects.ee.ethz.ch/index.php?title=Benchmarking_a_RISC-V-based_Server_on_LLMs/Foundation_Models_(SA_or_MA)Benchmarking a RISC-V-based Server on LLMs/Foundation Models (SA or MA)2024-03-10T18:22:10Z<p>Xiaywang: </p>
<hr />
<div><!-- Benchmarking a RISC-V-based Server on LLMs/Foundation Models (SA or MA) --><br />
<br />
[[Category:Digital]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:2023]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Xiaywang]]<br />
[[Category:Cykoenig]]<br />
[[Category:Available]]<br />
<br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Semester or Master Thesis (multiple students possible)<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Xiaywang | Xiaying Wang]]: [mailto:xiaywang@iis.ee.ethz.ch xiaywang@iis.ee.ethz.ch]<br />
** [[:User:Cykoenig | Cyril Koenig]]: [mailto:cykoenig@iis.ee.ethz.ch cykoenig@iis.ee.ethz.ch]<br />
** [[:User: Vivianep | Viviane Potocnik]]: [mailto:vivianep@iis.ee.ethz.ch vivianep@iis.ee.ethz.ch]<br />
<br />
= Introduction =<br />
<br />
Milk-V is a company committed to delivering high-quality RISC-V products to developers, enterprises, and consumers. It focuses on the development of both hardware and software ecosystems around the RISC-V architecture. Milk-V strongly supports open-source initiatives and aims to enrich the RISC-V product landscape, hoping that through its efforts and those of the community, the future of RISC-V products will be as vast and luminous as the Milky Way.<br />
<br />
The Milk-V Pioneer is a developer motherboard utilizing the SOPHON SG2042 [1], designed in the standard microATX (mATX) form factor. It offers PC-like interfaces and compatibility with PC industrial standards, aiming to provide a native RISC-V development environment and desktop experience. The Pioneer is targeted at RISC-V developers and hardware pioneers, offering a platform to engage with cutting-edge RISC-V technology. This motherboard serves as an excellent choice for those interested in exploring and developing within the RISC-V architecture.<br />
<br />
[[File:Pioneer.jpg|400px|]] [2]<br />
<br />
= Project description =<br />
<br />
In this project, you will be executing LLMs and Foundation Models, e.g., Whisper AI, to Milk-V servers and benchmark their performance.<br />
<br />
You will first select a framework to execute LLMs in C/C++, for instance llama.cpp [3]. You will then evaluate one or several models using this framework on the SG2042 CPU. Finally, you will identify potential limitations or improvements of the code related to the microarchitecture.<br />
<br />
== Character ==<br />
<br />
* 20% Literature/architecture review<br />
* 60% Programming<br />
* 20% Evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Strong interest in computer architecture<br />
* Experience in C programming<br />
* Preferred: Knowledge or prior experience with RISC-V<br />
<br />
= References =<br />
<br />
[https://github.com/milkv-pioneer/pioneer-files/blob/main/hardware/SG2042-TRM.pdf](https://github.com/milkv-pioneer/pioneer-files/blob/main/hardware/SG2042-TRM.pdf)<br />
<br />
[https://milkv.io/docs/pioneer/](https://milkv.io/docs/pioneer/)<br />
<br />
[https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)</div>Xiaywanghttp://iis-projects.ee.ethz.ch/index.php?title=On-Device_Learnable_Embeddings_for_Acoustic_EnvironmentsOn-Device Learnable Embeddings for Acoustic Environments2024-02-28T23:10:12Z<p>Cioflanc: Created page with "<!-- On-Device Federated Continual Learning on Nano-Drone Swarms --> = Overview = == Status: Available == * Type: Semester Thesis * Professor: Prof. Dr. L. Benini * Supervi..."</p>
<hr />
<div><!-- On-Device Federated Continual Learning on Nano-Drone Swarms --><br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]<br />
** [[Dr. Lukas Cavigelli]] (Huawei Technologies): [mailto:lukas.cavigelli@huawei.com lukas.cavigelli@huawei.com]<br />
<br />
<br />
<!-- TODO: ADD APPROPRIATE CATEGORIES HERE --><br />
<br />
<br />
[[Category:2024]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Deep Learning Projects]]<br />
[[Category:Available]]<br />
[[Category:Digital]]<br />
[[Category:Cioflanc]]<br />
<br />
= Introduction =<br />
<br />
The objective of keyword spotting (KWS)[[#ref-Zhang2018|&#91;7&#93;]] is to detect a set of predefined keywords[[#ref-Warden2018|&#91;5&#93;]] within a stream of user utterances. Such a task is usually employed in low-memory, low-power devices, thus a KWS module should obtain a high accuracy while also taking into account the memory footprint and the latency of the system -- TinyML constraints[[#ref-reddi2020|&#91;4&#93;]]. In devices aimed for personal use, adapting the model to the user’s characteristics can considerably improve the performance of the system[[#ref-Cornell2023|&#91;3&#93;]][[#ref-Cioflan2024|&#91;2&#93;]]. Similarly, in noisy settings, adapting to the on-site noise can recover the accuracy drop compared to the accuracy in a silent room. We want to explore how environment-specific features can be exploited in order to improve the accuracy of a small-footprint KWS model.<br />
<br />
Feature embeddings[[#ref-Yang2015|&#91;6&#93;]][[#ref-Cioflan2024|&#91;2&#93;]](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) represent a mapping of a discrete variable to a vector of continuous numbers. Their advantage over other encoding methods (e.g., one-hot encoding) is represented by their low dimensionality, together with the fact that they can be learned in a supervised fashion. Additionally, by projecting the input features in the neural network embeddings’ space, one can perform meaningful comparisons between the features, through the use of distance measures. Historically, this concept has been successfully applied in the context of recommender systems, in which a user is provided with relevant suggestions (e.g., films, books) by comparing their interests against clusters formed by other users’ interests. In other words, the engine adapts itself based on user-specific features.<br />
<br />
<br />
Apart from representing a standalone network, embeddings can be also used as inputs for a DNN, as seen in Figure 1. By doing so, a well-pretrained model can be modified with respect to the environment characteristics, thus exceeding the performance of a generic model, as demonstrated by[[#ref-Cioflan2024|&#91;2&#93;]]. Additionally, learning embedings is more efficient from a hardware perspective than fully updating a system's backbone through backpropagation. Furthermore, a certain environment (e.g., target user) can be identified based on its embeddings, enabling the system to only listen to pre-registered users. Our first goal is to devise a KWS system integrating environmental (e.g., speech characteristics, noisy environments, reverb profiles, etc.) features, which is trained online using the incoming specific characteristics. Our second goal is to deploy the proposed model on an always-on, low-power device, such as the PULP platform GAP9[[#ref-GAP92022|&#91;1&#93;]]. The deployment should account for the latency cost, as well as the memory requirements and the number of parameters in the model.<br />
<br />
<br />
== Character ==<br />
<br />
* 10% literature research<br />
* 30% deep learning<br />
* 50% on-device implementation<br />
* 10% evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Must be familiar with Python, in the context of deep learning and quantization<br />
* Must be familiar with C, for layer implementation and model deployment<br />
<br />
<br />
= Project Goals =<br />
<br />
The main tasks of this project are:<br />
<br />
<ul><br />
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p><br />
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. </p><br />
<p> Read up on embeddings, as well as on-device learning and fine-tuning. </p><br />
<p>Read up on DNN models aimed at time series (e.g., DS-CNNs, TCNs, Transformer and Conformer networks) and the recent advances in keyword spotting. </p></li><br />
<br />
<li><p>'''Task 2 - Develop an environment-aware KWS system (3-4 weeks)'''</p><br />
<p> Develop and train a noise-specific embedding in a KWS system. </p><br />
<p> Integrate the proposed system to expand the capabilities of a person-aware KWS system. </p><br />
<br />
<br />
<li><p>'''Task 3 - Deploy environment-aware KWS system on ultra-low-power devices (4-5 weeks)'''</p><br />
<p> Deploy backbone and feature embeddings to perform on-device inference. </p><br />
<p> Integrate on-device training using PULP-TrainLib[[#ref-Nadalini2022|&#91;8&#93;]] to enable adaptation to unseen environments (e.g., noises, people, etc.) </p><br />
<p> Evaluate the learning process considering hardware-associated metrics (e.g., latency, memory, storage, energy, etc.) </p><br />
<br />
<li><p>'''(Optional - if conducted as Master's Thesis) Task 4 - Develop an Environment-Aware Automated Speech Recognition (ASR) system (8-12 weeks)'''</p><br />
<p>Propose, train, and evaluate an ASR backbone.</p><br />
<p>Deploy the proposed backbone on ultra-low-power platforms and evaluate them for inference.</p><br />
<p>Expand the on-device learning algorithm to the ASR task (e.g., replacing CTC loss with CE loss) and evaluate it considering hardware-associated metrics.</p><br />
<br />
<li><p>'''Task 5 - Gather and Present Final Results (2-3 Weeks)'''</p><br />
<p>Gather final results.</p><br />
<p>Prepare presentation (15/20 min. + 5 min. discussion).</p><br />
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul><br />
<br />
= Project Organization =<br />
<br />
== Weekly Meetings ==<br />
<br />
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.<br />
<br />
== Report ==<br />
<br />
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.<br />
<br />
==== Final Report ====<br />
<br />
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.<br />
<br />
== Presentation ==<br />
<br />
At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.<br />
<br />
= References =<br />
<br />
<div id="refs" class="references csl-bib-body"><br />
<br />
<div id="ref-GAP92022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;1&#93; </span><span class="csl-right-inline"> GAP9 product brief <span><span class="nocase"> https://greenwaves-technologies.com/wp-content/uploads/2023/02/GAP9-Product-Brief-V1_14_non_NDA.pdf</span> </span> 2022.</span><br />
</div><br />
<br />
<div id="ref-Cioflan2024" class="csl-entry"><br />
<span class="csl-left-margin">&#91;2&#93; </span><span class="csl-right-inline">Cristian Cioflan and Lukas Cavigelli and Luca Benini <span><span class="nocase"> Boosting keyword spotting through on-device learnable user speech characteristics</span>. </span> 2024.</span><br />
</div><br />
<br />
<div id="ref-Cornell2023" class="csl-entry"><br />
<span class="csl-left-margin">&#91;3&#93; </span><span class="csl-right-inline">Samuele Cornell and Jee-weon Jung and Shinji Watanabe and Stefano Squartini <span>One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition </span>2023. </span><br />
</div><br />
<br />
<div id="ref-reddi2020" class="csl-entry"><br />
<span class="csl-left-margin">&#91;4&#93; </span><span class="csl-right-inline">Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. <span>“<span class="nocase">MLperf inference benchmark. </span> 2020</span></span><br />
</div><br />
<br />
<div id="ref-Warden2018" class="csl-entry"><br />
<span class="csl-left-margin">&#91;5&#93; </span><span class="csl-right-inline">Pete Warden <span>Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition </span>2018.</span><br />
</div><br />
<br />
<div id="ref-Yang2015" class="csl-entry"><br />
<span class="csl-left-margin">&#91;6&#93; </span><span class="csl-right-inline">Yi Yang and Jacob Eisenstein <span><span class="nocase">Unsupervised Multi-Domain Adaptation with Feature Embeddings </span></span>2015</span><br />
</div><br />
<br />
<div id="ref-Zhang2018" class="csl-entry"><br />
<span class="csl-left-margin">&#91;7&#93; </span><span class="csl-right-inline">Yundong Zhang and Naveen Suda and Liangzhen Lai and Vikas Chandra <span><span class="nocase">Hello Edge: Keyword Spotting on Microcontrollers </span></span>2018</span><br />
</div><br />
<br />
<div id="ref-Nadalini2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;8&#93; </span><span class="csl-right-inline">Nadalini, Davide<br />
and Rusci, Manuele<br />
and Tagliavini, Giuseppe<br />
and Ravaglia, Leonardo<br />
and Benini, Luca<br />
and Conti, Francesco <span><span class="nocase">PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning </span></span>2022</span><br />
</div><br />
<br />
</div></div>Cioflanchttp://iis-projects.ee.ethz.ch/index.php?title=On-Device_Federated_Continual_Learning_on_Nano-Drone_SwarmsOn-Device Federated Continual Learning on Nano-Drone Swarms2024-02-28T22:49:30Z<p>Cioflanc: Created page with "<!-- On-Device Federated Continual Learning on Nano-Drone Swarms --> = Overview = == Status: Available == * Type: Semester Thesis * Professor: Prof. Dr. L. Benini * Supervi..."</p>
<hr />
<div><!-- On-Device Federated Continual Learning on Nano-Drone Swarms --><br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]<br />
** [[:User:Vladn| Vlad Niculescu]] (IIS): [mailto:vladn@iis.ee.ethz.ch vladn@iis.ee.ethz.ch]<br />
<br />
<br />
<!-- TODO: ADD APPROPRIATE CATEGORIES HERE --><br />
<br />
<br />
[[Category:2024]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Deep Learning Projects]]<br />
[[Category:Available]]<br />
[[Category:Digital]]<br />
[[Category:Cioflanc]]<br />
<br />
= Introduction =<br />
<br />
The increasing demand for machine learning at the edge, near the sensor, lead to machine learning inference being performed directly on edge devices with limited memory and computational resources -- an approach often referred to as TinyML[[#ref-reddi2020|&#91;5&#93;]]. <br />
Together with inference, the neural networks' training paradigm is also challenged.<br />
Rather than following a train-once-deploy-everywhere approach, continual learning[[#ref-Lange2022|&#91;3&#93;]][[#ref-Wang2024|&#91;7&#93;]] enables models deployed on devices to learn incrementally, adapting to new data or tasks over time.<br />
This approach is crucial for applications in dynamic environments where data distributions change or new tasks emerge. <br />
One of the challenges that such models must overcome is catastrophic forgetting, thus acquiring new knowledge without corrupting previously learned one.<br />
Federated learning[[#ref-McMahan2017|&#91;4&#93;]] further enhances edge-adaptable models by allowing a collective, distributed learning process across multiple devices, exchanging knowledge without exchanging raw data. <br />
This is crucial for privacy-sensitive applications and in scenarios where data cannot be centralized due to bandwidth or regulatory constraints. <br />
<br />
Federated Continual Learning tasks[[#ref-Dong2022|&#91;1&#93;]][[#ref-Yoon2021|&#91;8&#93;]] face unique challenges at the extreme edge, on autonomous nano-drones.<br />
Given the limited memory, processing power, and the requirement for real-time computation, it is mandatory to develop efficient learning algorithms capable of solving real-world problems such as face recognition.<br />
This scenario becomes even more complex when considering a swarm of drones[[#ref-Friess2023|&#91;2&#93;]], where each of them can encounter different faces in diverse environments. <br />
Through federated learning, the collective knowledge contained in a shared model can be individually updated, without sharing sensitive information such as the raw biometrical data.<br />
This expands the number of recognizable faces over time, while ensuring that the learning process is efficient and adaptable to the ever-changing environments in which the nano-drones operate.<br />
<br />
<br />
== Character ==<br />
<br />
* 10% literature research<br />
* 50% deep learning<br />
* 30% on-device implementation<br />
* 10% evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Must be familiar with Python, in the context of deep learning and quantization<br />
* Must be familiar with C, for layer implementation and model deployment<br />
<br />
<br />
= Project Goals =<br />
<br />
The main tasks of this project are:<br />
<br />
<ul><br />
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p><br />
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. </p><br />
<p> Read up on Continual Learning and Federated Learning, in the context of extreme edge devices. </p><br />
<p> Familiarize yourself with the PULP-TrainLib[[#ref-Nadalini2022|&#91;6&#93;]] on-device learning framework. </p></li><br />
<br />
<li><p>'''Task 2 - Develop an efficient Federated Continual Learning for Face Recognition. (5-6 weeks)'''</p><br />
<p>Create a PyTorch framework for face recognition, including Datasets and Dataloaders. Expand it considering the specific data management required in Continual Learning. </p><br />
<p>Propose a continual learning algorithm and benchmark it considering neural backbones feasible for the extreme edge, such as MobileNetv2. </p><br />
<p>Expand the algorithm to enable federated learning. </p><br />
<br />
<li><p>'''Task 3 - Deploy FCL algorithm on PULP platforms (5-6 weeks)'''</p><br />
<p> Evaluate the proposed algorithm considering hardware-associated metrics (e.g., latency, parameters, memory). Optimize the algorithm considering an accuracy-complexity trade-off. </p><br />
<p> Deploy the pretrained model, expanding PULP-TrainLib to include the proposed learning algorithm. </p><br />
<p> Perform in-field evaluation of the FCL algorithm considering a simple, two-drone system. Intermediate evaluation of CL could be beneficial for this task. </p><br />
<br />
<li><p>'''Task 4 - Gather and Present Final Results (2-3 Weeks)'''</p><br />
<p>Gather final results.</p><br />
<p>Prepare presentation (15/20 min. + 5 min. discussion).</p><br />
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul><br />
<br />
= Project Organization =<br />
<br />
== Weekly Meetings ==<br />
<br />
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.<br />
<br />
== Report ==<br />
<br />
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.<br />
<br />
==== Final Report ====<br />
<br />
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.<br />
<br />
== Presentation ==<br />
<br />
At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.<br />
<br />
= References =<br />
<br />
<div id="refs" class="references csl-bib-body"><br />
<div id="ref-Dong2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;1&#93; </span><span class="csl-right-inline">Jiahua Dong and Lixu Wang and Zhen Fang and Gan Sun and Shichao Xu and Xiao Wang and Qi Zhu <span><span class="nocase">Federated Class-Incremental Learning </span> </span> 2022.</span><br />
</div><br />
<br />
<div id="ref-Friess2023" class="csl-entry"><br />
<span class="csl-left-margin">&#91;2&#93; </span><span class="csl-right-inline">Carl Friess and Vlad Niculescu and Tommaso Polonelli and Michele Magno and Luca Benini <span><span class="nocase"> Fully Onboard SLAM for Distributed Mapping with a Swarm of Nano-Drone</span>. </span> 2023.</span><br />
</div><br />
<br />
<div id="ref-Lange2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;3&#93; </span><span class="csl-right-inline">M. De Lange and R. Aljundi and M. Masana and S. Parisot and X. Jia and A. Leonardis and G. Slabaugh and T. Tuytelaars <span>A Continual Learning Survey: Defying Forgetting in Classification Tasks </span>2022. </span><br />
</div><br />
<br />
<div id="ref-McMahan2017" class="csl-entry"><br />
<span class="csl-left-margin">&#91;4&#93; </span><span class="csl-right-inline">McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and Arcas, Blaise Aguera y <span> Communication-Efficient Learning of Deep Networks from Decentralized Data </span>2017.</span><br />
</div><br />
<br />
<div id="ref-reddi2020" class="csl-entry"><br />
<span class="csl-left-margin">&#91;5&#93; </span><span class="csl-right-inline">Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. <span>“<span class="nocase">MLperf inference benchmark. </span> 2020</span></span><br />
</div><br />
<br />
<div id="ref-Nadalini2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;6&#93; </span><span class="csl-right-inline">Nadalini, Davide<br />
and Rusci, Manuele<br />
and Tagliavini, Giuseppe<br />
and Ravaglia, Leonardo<br />
and Benini, Luca<br />
and Conti, Francesco <span><span class="nocase">PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning </span></span>2022</span><br />
</div><br />
<br />
<div id="ref-Wang2024" class="csl-entry"><br />
<span class="csl-left-margin">&#91;7&#93; </span><span class="csl-right-inline">Liyuan Wang and Xingxing Zhang and Hang Su and Jun Zhu <span><span class="nocase">A Comprehensive Survey of Continual Learning: Theory, Method and Application </span></span>2024</span><br />
</div><br />
<br />
<div id="ref-Yoon2021" class="csl-entry"><br />
<span class="csl-left-margin">&#91;8&#93; </span><span class="csl-right-inline">Jaehong Yoon and Wonyong Jeong and Giwoong Lee and Eunho Yang and Sung Ju Hwang <span><span class="nocase">Federated Continual Learning with Weighted Inter-client Transfer </span></span>2021</span><br />
</div><br />
<br />
</div></div>Cioflanchttp://iis-projects.ee.ethz.ch/index.php?title=ASR-WaveformerASR-Waveformer2024-02-28T22:02:35Z<p>Wiesep: </p>
<hr />
<div><!-- ASR-Waveformer: Solving Automatic Speech Recognition through Linear Attention --><br />
<br />
= Overview =<br />
<br />
== Status: In Progress ==<br />
<br />
* Type: Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]<br />
** [[:User:Wiesep| Philip Wiese]] (IIS): [mailto:wiesep@iis.ee.ethz.ch wiesep@iis.ee.ethz.ch]<br />
<br />
<br />
<!-- TODO: ADD APPROPRIATE CATEGORIES HERE --><br />
<br />
<br />
[[Category:2024]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Deep Learning Projects]]<br />
[[Category:In progress]]<br />
[[Category:Digital]]<br />
[[Category:Cioflanc]]<br />
[[Category:Wiesep]]<br />
<br />
= Introduction =<br />
<br />
The transformer[[#ref-Vaswani2017|&#91;7&#93;]] was initially designed for use in NLP, but its main novelty, the attention mechanism, has proved helpful in various applications, such as Automated Speech Recognition (ASR)[[#ref-Zhang2022|&#91;9&#93;]], speech separation[[#ref-Zhao2023|&#91;8&#93;]], or environmental sound classification[[#ref-Gazneli2022|&#91;2&#93;]].<br />
In parallel with this development, the increasing demand for machine learning at the edge, near the sensor, led to machine learning inference being performed directly on edge devices with limited memory and computational resources -- an approach often referred to as TinyML. <br />
In this context, the attention layer is the main bottleneck preventing transformers' adoption for embedded time-series processing.<br />
The memory and computational requirements of the implementation of the conventional attention mechanism scale quadratically with the input length, severely limiting the ability to process long sequences of data. <br />
A solution to this problem relies on random feature maps to approximate the softmax kernel[[#ref-Choromanski2022|&#91;1&#93;]][[#ref-Katharopoulos2020|&#91;3&#93;]][[#ref-Kitaev2020|&#91;4&#93;]], without the costly explicit calculation of the attention matrix.<br />
The latter class of attention is typically referred to as linear attention.<br />
<br />
Transformer works relying on linear attention achieved state-of-the-art accuracy on keyword spotting tasks, as demonstrated by[[#ref-Scherer2024|&#91;5&#93;]].<br />
Keyword spotting is one of the most well-studied problems for TinyML systems, given its real-time, confidential, near-sensor computation requirements.<br />
On the other hand, the more complex acoustic counterpart of keyword spotting, namely ASR (i.e., the task of converting spoken speech into written words), still requires fundamental improvements to achieve feasibility in edge computing settings.<br />
<br />
This project therefore proposes to adapt the state-of-the-art Waveformer[[#ref-Scherer2024|&#91;5&#93;]] architecture, depicted in Figure 1, to the requirements and complexity of ASR.<br />
The second goal addresses the deployment of the proposed architecture on novel parallel ultra-low-power platforms. <br />
While the original architecture has already proved to be feasible for single-core embedded systems, our goal is to achieve real-time ASR on ultra-low-power devices.<br />
To this aim, you will employ and extend our in-house, open-source, state-of-the-art Deeploy deployment tool, generating C-code to achieve on-board inference for the quantized version of your proposed neural network.<br />
<br />
<br />
== Character ==<br />
<br />
* 10% literature research<br />
* 30% deep learning<br />
* 50% on-device implementation<br />
* 10% evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Must be familiar with Python, in the context of deep learning and quantization<br />
* Must be familiar with C, for layer implementation and model deployment<br />
<br />
<br />
= Project Goals =<br />
<br />
The main tasks of this project are:<br />
<br />
<ul><br />
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p><br />
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. </p><br />
<p> Read up on linear attention and the recent advances in automated speech recognition. </p><br />
<p> Read up on quantization and deployment to ultra-low power platforms. </p></li><br />
<br />
<li><p>'''Task 2 - Develop a Waveformer-based ASR model (4-6 weeks)'''</p><br />
<p>Create a PyTorch framework for ASR, including Datasets and Dataloaders, as well as integrating the training and evaluation mechanisms. </p><br />
<p>Evaluate Waveformer on ASR task(s). </p><br />
<p>Expand the existing architecture considering an accuracy-complexity trade-off. </p><br />
<br />
<li><p>'''Task 3 - Deploy ASR-Waveformer on PULP platforms (4-6 weeks)'''</p><br />
<p> Automate the quantization of Waveformer using Quantlib[[#ref-Spallanzani2022|&#91;6&#93;]], followed by integrating ASR-Waveformer. </p><br />
<p>Expand Deeploy to enable PULP-based deployment of Waveformer. </p><br />
<p> Deploy ASR-Waveformer and evaluate the on-device inference considering hardware-associated metrics (i.e., latency, memory, storage, energy, etc.) </p><br />
<br />
<li><p>'''Task 4 - Gather and Present Final Results (2-3 Weeks)'''</p><br />
<p>Gather final results.</p><br />
<p>Prepare presentation (15/20 min. + 5 min. discussion).</p><br />
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul><br />
<br />
= Project Organization =<br />
<br />
== Weekly Meetings ==<br />
<br />
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.<br />
<br />
== Report ==<br />
<br />
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.<br />
<br />
==== Final Report ====<br />
<br />
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.<br />
<br />
== Presentation ==<br />
<br />
At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.<br />
<br />
= References =<br />
<br />
<div id="refs" class="references csl-bib-body"><br />
<div id="ref-Choromanski2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;1&#93; </span><span class="csl-right-inline">Choromanski, Krzysztof Marcin and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarlos, Tamas and Hawkins, Peter and Davis, Jared Quincy and Mohiuddin, Afroz and Kaiser, Lukasz and Belanger, David Benjamin and Colwell, Lucy J. and Weller, Adrian <span><span class="nocase">Rethinking Attention with Performers</span> </span> 2022.</span><br />
</div><br />
<br />
<div id="ref-Gazneli2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;2&#93; </span><span class="csl-right-inline">Gazneli, Avi and Zimerman, Gadi and Ridnik, Tal and Sharir, Gilad and Noy, Asaf <span><span class="nocase"> Boosting Augmentations Towards An Efficient Audio Classification Network</span>. </span> 2022.</span><br />
</div><br />
<br />
<div id="ref-Katharopoulos2020" class="csl-entry"><br />
<span class="csl-left-margin">&#91;3&#93; </span><span class="csl-right-inline">Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François <span>Transformers are RNNs: fast autoregressive transformers with linear attention </span>2020. </span><br />
</div><br />
<br />
<div id="ref-Kitaev2020" class="csl-entry"><br />
<span class="csl-left-margin">&#91;4&#93; </span><span class="csl-right-inline">Kitaev, Nikita and Kaiser, Lukasz and Levskaya, Anselm <span> Reformer: The Efficient Transformer. Technical Report </span>2020.</span><br />
</div><br />
<br />
<div id="ref-Scherer2024" class="csl-entry"><br />
<span class="csl-left-margin">&#91;5&#93; </span><span class="csl-right-inline">Scherer, Moritz and Cioflan, Cristian and Magno, Michele and Benini, Luca <span>“<span class="nocase">Work In Progress: Linear Transformers for TinyML </span> 2024</span></span><br />
</div><br />
<br />
<div id="ref-Spallanzani2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;6&#93; </span><span class="csl-right-inline">Spallanzani, Matteo and Rutishauser, Georg and Scherer, Moritz and Burrello, Alessio and Conti, Francesco and Benini, Luca <span><span class="nocase">QuantLab: a Modular Framework for Training and Deploying Mixed-Precision NNs </span></span>2022</span><br />
</div><br />
<br />
<div id="ref-Vaswani2017" class="csl-entry"><br />
<span class="csl-left-margin">&#91;7&#93; </span><span class="csl-right-inline">Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia <span><span class="nocase">Attention is all you need </span></span>2017</span><br />
</div><br />
<br />
<div id="ref-Zhao2023" class="csl-entry"><br />
<span class="csl-left-margin">&#91;8&#93; </span><span class="csl-right-inline">Zhao, Shengkui and Ma, Bin <span><span class="nocase">MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions </span></span>2023</span><br />
</div><br />
<br />
<div id="ref-Zhang2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;9&#93; </span><span class="csl-right-inline">Zhang, Yu and Qin, James and Park, Daniel S. and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V. and Wu, Yonghui <span><span class="nocase">Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition </span></span>2022</span><br />
</div><br />
<br />
</div></div>Cioflanchttp://iis-projects.ee.ethz.ch/index.php?title=A_RISC-V_ISA_Extension_for_Scalar_Chaining_in_Snitch_(M)A RISC-V ISA Extension for Scalar Chaining in Snitch (M)2024-02-22T19:28:11Z<p>Colluca: </p>
<hr />
<div><!-- A RISC-V ISA Extension for Scalar Chaining in Snitch (1M) --><br />
<br />
[[Category:Digital]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:2024]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Colluca]]<br />
[[Category:In progress]]<br />
<br />
= Overview =<br />
<br />
== Status: In progress ==<br />
<br />
* Type: Semester thesis <br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Colluca | Luca Colagrande]]: [mailto:colluca@iis.ee.ethz.ch colluca@iis.ee.ethz.ch]<br />
** [[:User:Paulsc | Paul Scheffler]]: [mailto:paulsc@iis.ee.ethz.ch paulsc@iis.ee.ethz.ch]<br />
* Student: Jayanth Jonnalagadda [mailto:jjonnalagadd@student.ethz.ch jjonnalagadd@student.ethz.ch]<br />
<br />
= Introduction =<br />
<br />
[[File:snitch_block_diagram.png|thumb|Figure 1: A block diagram of the Snitch cluster architecture]]<br />
<br />
Suppose you have an expression of the form: A += k * (B + C), where A, B and C are vectors of arbitrary length and k is a constant. In the following we will assume, for simplicity, that A, B and C are mapped to stream semantic registers (SSRs) [3], an ISA extension of the RISC-V Snitch core [1, 2], developed in our group. This expression would translate to the following sequence of instructions:<br />
<br />
<syntaxhighlight lang="asm"><br />
A0: FADD d, b, c<br />
A1: FMADD a, k, d, a<br />
A2: FADD d, b, c<br />
A3: FMADD a, k, d, a<br />
A4: FADD d, b, c<br />
A5: FMADD a, k, d, a<br />
A6: FADD d, b, c<br />
A7: FMADD a, k, d, a<br />
</syntaxhighlight><br />
<br />
In processors targeting operating frequencies >1GHz the FP ALU is typically pipelined, so an instruction may take multiple cycles to execute. In our single-issue in-order RISC-V Snitch core, both instructions above take 4 cycles to complete and cannot overlap, due to the RAW dependency on the intermediate result D. A loop over the elements of the vector would incur in a stall of 4 cycles, on every iteration, significantly reducing the instructions-per-cycle (IPC) and the efficiency of the computation.<br />
<br />
A well-known technique to optimize the above code is loop unrolling. By unrolling the loop and interleaving independent instructions, we can hide the RAW dependency stalls with useful instructions:<br />
<br />
<syntaxhighlight lang="asm"><br />
B0: FADD d1, b, c<br />
B1: FADD d2, b, c<br />
B2: FADD d3, b, c<br />
B3: FADD d4, b, c<br />
B4: FMADD a, k, d1, a<br />
B5: FMADD a, k, d2, a<br />
B6: FMADD a, k, d3, a<br />
B7: FMADD a, k, d4, a<br />
</syntaxhighlight><br />
<br />
This optimization comes at the cost of increased register pressure (we need 8 registers instead of 5), which might require spilling registers to the stack, in turn decreasing the efficiency of the computation.<br />
<br />
On a vector machine (with a large RF) the intermediate result D would be stored in one vector register, whereas here we are essentially building a vector register out of scalar architectural registers, which are a scarce resource. The vector architecture trades off area (and energy efficiency) for the performance improvement, by featuring a comparably large register file. On an out-of-order processor we could execute the original version of the program and still expect the performance of the unrolled version, as it can do register renaming. It trades off energy and area to perform the translation from architectural registers to physical registers.<br />
<br />
= Project description =<br />
<br />
With this project, we want to enable the execution of computations similar to the one in the example above, without increasing the [https://en.wikipedia.org/wiki/Instruction_set_architecture#REGISTER-PRESSURE register pressure], while still not incurring any stall, on an in-order processor. Your efforts will focus on the state-of-the-art RISC-V Snitch core developed in our group. Snitch is a pseudo dual-issue in-order processor, targeting energy-efficient floating-point computations. Snitch-based accelerators, as can be found in Occamy [4], aim to rival the energy-efficiency and performance of GPU accelerators. Therefore preserving the energy-efficiency of the core is key.<br />
<br />
Your extension will leverage the following observations. To store the intermediate results (d1, d2, d3, d4) required to hide the latency of dependent instructions, the FP ALU's pipeline registers can be employed as physical registers in place of the four architectural registers. We could then rewrite the code as follows, to ensure that only one architectural register is used:<br />
<br />
<syntaxhighlight lang="asm"><br />
C0: FADD d, b, c<br />
C1: FADD d, b, c<br />
C2: FADD d, b, c<br />
C3: FADD d, b, c<br />
C4: FMADD a, k, d, a<br />
C5: FMADD a, k, d, a<br />
C6: FMADD a, k, d, a<br />
C7: FMADD a, k, d, a<br />
</syntaxhighlight><br />
<br />
The remaining problem with this code is that instruction C1 would overwrite the result of C0, and so on. Therefore C4 would read value d4 in register d, instead of value d1 as required. This is because the writeback of C0 is expected to occur before that of C1. But if we allow the register file to exert backpressure on the adder, such that intermediate result d2 cannot override d1 until this is consumed from d, effectively postponing the writeback of C1, after the operand read of C4, the semantics of the original program are guaranteed. Of course this requires altering the semantics of the individual instructions, and some state register would have to be programmed in advance to communicate to the core that certain architectural registers have to adopt altered semantics.<br />
<br />
These altered semantics are similar to the SSR semantics w.r.t. to the following properties:<br />
* the value which is stored in an architectural register can only be consumed once<br />
* when a value is consumed, it releases pressure on the upstream unit<br />
<br />
In this sense, the architecture of the core becomes closer to a dataflow architecture [5], with the difference that the RF is always the medium between any two functional units and operand consumption is still explicitly triggered by issuing standard RISC-V instructions.<br />
<br />
== Detailed task description ==<br />
<br />
To break it down in more detail, you will:<br />
<br />
* '''Literature review and related work research'''<br />
** Understand the architecture of the Snitch processor<br />
** Understand the concept of stream semantic registers (SSRs)<br />
** Review dataflow architecture concepts<br />
** Research related works<br />
* '''Develop the scalar chaining extension in Snitch'''<br />
** Develop the RTL code to implement your extension <br />
** Develop synthetic test programs to verify your implementation<br />
* '''Evaluate your extension'''<br />
** Synthesize the Snitch core complex and cluster to evaluate PPA impact of your extensions<br />
** Develop a set of micro-kernels to evaluate the performance improvements thanks to your extension<br />
* '''Future work'''<br />
** Critically review the limitations of your extension<br />
** What could be done in the future to improve your work?<br />
<br />
== Stretch goals ==<br />
<br />
Additional optional stretch goals may include:<br />
* Dataflow operation without the need to explicitly trigger operand consumption through instruction issuing<br />
* Implement some of the proposed future work items you may come up with<br />
<br />
== Character ==<br />
<br />
* 15% Literature/architecture review<br />
* 40% RTL design and verification<br />
* 35% Baremetal software development<br />
* 10% Physical design evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Strong interest in computer architecture<br />
* Experience with digital design in SystemVerilog as taught in VLSI I<br />
* Preferred: Experience in bare-metal or embedded C programming<br />
* Preferred: Experience with ASIC implementation flow as taught in VLSI II<br />
<br />
= References =<br />
<br />
[1] [https://ieeexplore.ieee.org/document/9216552 Snitch paper] <br /><br />
[2] [https://github.com/pulp-platform/snitch_cluster Snitch Github repository] <br /><br />
[3] [https://arxiv.org/abs/1911.08356 Stream Semantic Registers (SSRs)] <br /><br />
[4] [https://pulp-platform.org/occamy/ Occamy] <br /><br />
[5] [https://en.wikipedia.org/wiki/Dataflow_architecture Dataflow architecture] <br /></div>Collucahttp://iis-projects.ee.ethz.ch/index.php?title=Neural_Recording_Interface_and_Spike_Sorting_AlgorithmNeural Recording Interface and Spike Sorting Algorithm2024-02-21T08:22:28Z<p>Yiychen: </p>
<hr />
<div>[[File:SpikeSortingYY.jpg|thumb|600px]]<br />
== Description ==<br />
Over the last decades, an increasing burden that neurological disorders applied on patients and societies attracts more attentions. A better understanding of the brain and neuro system is needed. Since the number of electrophysiological signals generated by neurons are normally massive, a large scale, high accuracy, intense density and robust artificial device is needed. To record electrical signal within the brain, modulate its activity and even stimulate certain event, direct interfacing with the brain is the main approach in neuro science research. In the mid-1990s, the first neuro-prosthetic device implanted in human is appeared, led by a research group from University of Michigan. Recently, cell signal recording using penetrating electrodes grows steadily and its outcomes has overcome a series of fundamental challenges like difficult of stimulation, nonideal overall performance , low quality data and lack of data analysis methods. <br />
<br />
To study the network activity of neurons and tackle these challenges, researchers are using various methods, including imaging techniques, such as calcium imaging, intracellular recording methods, such as patch clamp, and extracellular recording techniques, such as electrical imaging. Through the recording of optical or electrical signals generated by neurons, such as action potentials (APs) and local-field potentials (LFPs), single-neuron behavior and neural signaling in neuronal networks can be studied. Among all these approached, Microelectrode Array(MEA) has been one of the most efficient ways of acquiring neural signals from a large number of neurons in terms of number of recording sites, temporal resolution, spatial resolution, and signal-to-noise ratio. <br />
<br />
In ETHz we invented a new neural signal detecting scheme that can detect neurons firing action potential (AP) spikes (https://ieeexplore.ieee.org/document/10185425). Currently we are designing the next generation Ultra Low Power Neural Recording Interface and we are inviting students to join us to design the low power on chip spike sorting algorithm. Spike sorting is the grouping of spikes into clusters based on the similarity of their shapes. The first figure summarizes the basic steps of spike sorting algorithm. <br />
<br />
[[File:Spikes.png|thumb|600px]]<br />
In this project, the student will<br />
<br />
1. Get to know more about Neural Recording and its state of the art work<br />
<br />
2. Get familiar and implement the Neural spike sorting algorithm on FPGA and On-chip.<br />
<br />
3. Explore new digital circuit structure for Neuro Recoridng processing<br />
<br />
<br />
== Status: Available ==<br />
<br />
Looking for master student who is interseted in doing semester/thesis project.<br />
If you are interested in this challenging position on an exciting and challenging topic, please send your most recent curriculum vitae including a transcript of grades by email to:<br />
<br />
Yiyang Chen <[mailto:yiychen@iis.ee.ethz.ch yiychen@iis.ee.ethz.ch]> <br />
<br />
== Prerequisites ==<br />
<br />
The ideal candidate should have a multi-disciplinary background, strong mathematical aptitude and programming skills (matlab or python). <br />
<br />
===Character===<br />
* 20% Literature review<br />
* 20% Theory<br />
* 60% Programming<br />
<br />
===Professor===<br />
Prof. Taekwang Jang <[mailto:tjang@ethz.ch tjang@ethz.ch]><br />
<br />
[[#top|↑ top]]<br />
[[Category:EECIS]]<br />
[[Category:Available]]<br />
[[Category:2023]]<br />
[[Category:Hot]]<br />
[[Category:]]<br />
<br />
===Practical Details===<br />
* '''[[Project Plan]]'''<br />
* '''[[Project Meetings]]'''<br />
* '''[[Final Report]]'''<br />
* '''[[Final Presentation]]'''</div>Yiychenhttp://iis-projects.ee.ethz.ch/index.php?title=Extending_our_FPU_with_Internal_High-Precision_Accumulation_(M)Extending our FPU with Internal High-Precision Accumulation (M)2024-02-15T13:54:07Z<p>Lbertaccini: Created page with "<!-- Extending our FPU with Internal High-Precision Accumulation (M) --> Category:Digital Category:Acceleration_and_Transprecision Category:High Performance SoCs..."</p>
<hr />
<div><!-- Extending our FPU with Internal High-Precision Accumulation (M) --><br />
<br />
[[Category:Digital]]<br />
[[Category:Acceleration_and_Transprecision]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:Computer Architecture]]<br />
[[Category:2024]]<br />
[[Category:Master Thesis]]<br />
[[Category:Available]]<br />
[[Category:Lbertaccini]]<br />
<br />
= Overview =<br />
<br />
== Status: Available==<br />
<br />
* Type: Master Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Lbertaccini | Luca Bertaccini]]: [mailto:lbertaccini@iis.ee.ethz.ch lbertaccini@iis.ee.ethz.ch]<br />
<br />
= Introduction =<br />
[[File:Fpu_block_diagram.png|thumb|300px|FPnew block diagram [1]. Each operation group block can be instantiated through a parameter. In the figure, the FPU was instantiated without a DivSqrt module.]]<br />
<br />
<br />
Low-precision floating-point (FP) formats are getting more and more traction in the context of neural network (NN) training. Employing low-precision formats, such as 8-bit FP data types, reduces the model's memory footprint and opens new opportunities to increase the system's energy efficiency. For these reasons, many commercial platforms already provide support for 8-bit FP data types. These formats only provide few mantissa bits and are, therefore, not suited for accumulation. They are instead used in mixed-precision operations, where the accumulation is performed in higher precision, e.g., by using FP16 or FP32.<br />
<br />
FP unit (FPU) developed at IIS [1], [2] already provide hardware support for low-precision FP formats (down to 8 bits). The goal of this project is to add support for internal high-precision accumulation in the FPU. In this way, the accumulated value does not have to be written and read to/from the FP register file at every accumulation, thus requiring low energy. At the same time, this decouples the accumulator size from the register file entry size. The internal accumulators can then have a custom size, potentially even larger than what is offered by one register file entry.<br />
<br />
== Character ==<br />
<br />
* 20% Literature / architecture review<br />
* 40% RTL implementation<br />
* 40% Evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Strong interest in computer architecture<br />
* Experience with digital design in SystemVerilog as taught in VLSI I<br />
* Experience with ASIC implementation flow (synthesis) as taught in VLSI II<br />
<br />
= References =<br />
<br />
[1] https://arxiv.org/abs/2207.03192 MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores<br />
<br />
[2] https://github.com/pulp-platform/cvfpu</div>Lbertaccinihttp://iis-projects.ee.ethz.ch/index.php?title=Extreme-Edge_Experience_Replay_for_Keyword_SpottingExtreme-Edge Experience Replay for Keyword Spotting2024-02-15T11:56:26Z<p>Cioflanc: Created page with "<!-- Extreme-Edge Experience Replay for Keyword Spotting (1B/1S/1M) --> = Overview = == Status: Available == * Type: Semester Thesis * Professor: Prof. Dr. L. Benini * Supe..."</p>
<hr />
<div><!-- Extreme-Edge Experience Replay for Keyword Spotting (1B/1S/1M) --><br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]<br />
** [[Dr. Miguel de Prado]] (Verses)<br />
<br />
<br />
<!-- TODO: ADD APPROPRIATE CATEGORIES HERE --><br />
<br />
<br />
[[Category:2024]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Hot]]<br />
[[Category:Deep Learning Projects]]<br />
[[Category:Available]]<br />
[[Category:Digital]]<br />
[[Category:Cioflanc]]<br />
<br />
= Introduction =<br />
<br />
In an ever-changing world, deploying neural networks at the edge with the only purpose of solving a predefined, offline-learned tasks becomes obsolete. Often times, models experience on-site domain shifts (e.g., keyword spotting systems pretrained for warehouses and utilized in construction sites)[[#ref-cioflan2022|&#91;1&#93;]] or new functionalities (i.e., new classes)[[#ref-Hemati2023|&#91;2&#93;]] are added directly in the target device. In such contexts, on-device continual learning -- both domain incremental and class incremental learning -- enables the deployed neural network to remain up-to-date. Nevertheless, while learning new tasks, the model must not forget the previously learned ones, thus avoiding the so-called catastrophic forgetting effect[[#ref-Maltoni2017|&#91;3&#93;]].<br />
<br />
Rehearsal-based methods[[#ref-Rolnick2019|&#91;4&#93;]][[#ref-Zhuo2023|&#91;6&#93;]] avoid or reduce catastrophic forgetting by maintaining a subset of already-seen samples in a memory buffer, also called reservoir. During the adaptation stage, the subset is replayed together with the newly available samples, jointly training the model, thus preventing overfitting on the new domain/class and maintaining generalization. The TinyML constraints associated with low-power, always-on keyword spotting (i.e., memory, storage, latency)[[#ref-cioflan2022|&#91;5&#93;]] limit the size of the memory buffer, therefore the number of samples is drastically limited. Moreover, since each sample is a candidate for the replay buffer, the selection method running along with the inference must be lightweight, not incurring significant overheads for a real-time application.<br />
<br />
The objective of this project is to propose an energy-efficient, real-time rehearsal-based method for keyword spotting, in the context of class-, task-, and domain-incremental learning.<br />
<br />
<br />
== Character ==<br />
<br />
* 20% literature research<br />
* 70% architectural implementation and optimizations<br />
* 10% evaluation<br />
<br />
== Prerequisites ==<br />
<br />
* Must be familiar with Python.<br />
* Knowledge of deep learning basics, including some deep learning frameworks like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.<br />
<br />
= Project Goals =<br />
<br />
The main tasks of this project are:<br />
<br />
<ul><br />
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p><br />
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on class-, task-, and domain-incremental learning, as well as rehearsal-based methods. Read up on DNN models aimed at time series (e.g., DS-CNNs, TCNs, Transformer and Conformer networks) and the recent advances in keyword spotting. </p></li><br />
<br />
<li><p>'''Task 2: Propose evaluation methodology and evaluate related work (4-6 weeks)'''</p><br />
<p>Propose an evaluation methodology (e.g.,[[#ref-Maltoni2017|&#91;3&#93;]]) for audio-based tasks in the context of continual learning. </p><br />
<p>Considering publicly-available methods and state-of-the-art implementation, evaluate experience replay techniques for keyword spotting. </p><br />
<p>Expand the accuracy evaluation with respect to the TinyML constraints. </p><br />
<br />
<li><p>'''Task 3: Propose novel selection method (4-6 weeks)'''</p><br />
<p>Propose and implement lightweight memory buffer update technique and analyse it considering previously defined evaluation methodology. </p><br />
<p>Evaluate the proposed methodology considering different neural network topologies and model sizes. </p><br />
<br />
<li><p>'''<b>(Only if conducted as Master's thesis)</b> Task 4: Hardware-in-the-loop evaluation (6-8 weeks weeks)'''</p><br />
<p>Familiarize yourself with GAP9 architecture and deployment tools. </p><br />
<p>Implement the proposed selection method(s) and network update scheme. </p><br />
<p>On-device evaluation of learning costs. </p><br />
<br />
<li><p>'''Task 5 - Gather and Present Final Results (2-3 Weeks)'''</p><br />
<p>Gather final results.</p><br />
<p>Prepare presentation (15/20 min. + 5 min. discussion).</p><br />
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul><br />
<br />
= Project Organization =<br />
<br />
== Weekly Meetings ==<br />
<br />
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.<br />
<br />
== Report ==<br />
<br />
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.<br />
<br />
==== Final Report ====<br />
<br />
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.<br />
<br />
== Presentation ==<br />
<br />
At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.<br />
<br />
= References =<br />
<br />
<div id="refs" class="references csl-bib-body"><br />
<div id="ref-cioflan2022" class="csl-entry"><br />
<span class="csl-left-margin">&#91;1&#93; </span><span class="csl-right-inline">Cristian Cioflan, Lukas Cavigelli, Manuele Rusci, Miguel De Prado, Luca Benini <span><span class="nocase">Towards On-device Domain Adaptation for Noise-Robust Keyword Spotting</span> </span> 2022.</span><br />
</div><br />
<div id="ref-Hemati2023" class="csl-entry"><br />
<span class="csl-left-margin">&#91;2&#93; </span><span class="csl-right-inline">Hamed Hemati and Andrea Cossu and Antonio Carta and Julio Hurtado and Lorenzo Pellegrini and Davide Bacciu and Vincenzo Lomonaco and Damian Borth <span><span class="nocase">Class-Incremental Learning with Repetition</span>. </span> 2023.</span><br />
</div><br />
<div id="ref-Maltoni2017" class="csl-entry"><br />
<span class="csl-left-margin">&#91;3&#93; </span><span class="csl-right-inline">Vincenzo Lomonaco and Davide Maltoni <span>CORe50: a New Dataset and Benchmark for Continuous Object Recognition </span>2017. </span><br />
</div><br />
<div id="ref-Rolnick2019" class="csl-entry"><br />
<span class="csl-left-margin">&#91;4&#93; </span><span class="csl-right-inline">David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, Gregory Wayne, <span> Experience Replay for Continual Learning </span>2019.</span><br />
</div><br />
<div id="ref-reddi2020" class="csl-entry"><br />
<span class="csl-left-margin">&#91;5&#93; </span><span class="csl-right-inline">Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. <span>“<span class="nocase">MLperf inference benchmark. </span> 2020</span></span><br />
</div><br />
<div id="ref-Zhuo2023" class="csl-entry"><br />
<span class="csl-left-margin">&#91;6&#93; </span><span class="csl-right-inline">Tao Zhuo and Zhiyong Cheng and Zan Gao and Mohan Kankanhalli <span><span class="nocase">Continual Learning with Strong Experience Replay </span></span>2023</span><br />
</div><br />
<br />
</div></div>Cioflanchttp://iis-projects.ee.ethz.ch/index.php?title=Building_an_RTL_top_level_for_a_Mempool-based_Heterogeneous_SoC_(M/1-3S)Building an RTL top level for a Mempool-based Heterogeneous SoC (M/1-3S)2024-02-12T10:43:39Z<p>Cykoenig: </p>
<hr />
<div><!-- Creating Extending the HERO RISC-V HPC stack to support multiple devices on heterogeneous SoCs (M/1-3S) --><br />
<br />
[[Category:Digital]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:Heterogeneous Acceleration Systems]]<br />
[[Category:Computer Architecture]]<br />
[[Category:2024]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Cykoenig]]<br />
[[Category:Smazzola]]<br />
[[Category:Mbertuletti]]<br />
[[Category:Reserved]]<br />
<br />
= Overview =<br />
<br />
== Status: Reserved ==<br />
<br />
* Type: Computer Architecture Master / Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cykoenig | Cyril Koenig]]: [mailto:cykoenig@iis.ee.ethz.ch cykoenig@iis.ee.ethz.ch]<br />
** [[:User:Smazzola | Sergio Mazzola]]: [mailto:smazzola@iis.ee.ethz.ch smazzola@iis.ee.ethz.ch] <br />
** [[:User:Mbertuletti | Marco Bertuletti]]: [mailto:mbertuletti@iis.ee.ethz.ch mbertuletti@iis.ee.ethz.ch] <br />
<br />
= Introduction =<br />
<br />
MemPool[1] is an example of the massively parallel SoCs built at IIS. It integrates 256 Snitch cores and 1MiB of shared-L1 memory. Despite its size, MemPool gives all cores low-latency access to the shared L1 memory, with a maximum latency of only five cycles when no contention occurs. This implements efficient communication among all cores, making MemPool suitable for various workload domains and easy to program. <br />
<br />
Today, MemPool is a standalone cluster of accelerators with distributed memory, but it aims to be programmed by and for a Host subsystem<br />
<br />
Cheshire[2], an open-source SoC from our group that features a 64-bit RISC-V core and various peripherals such as UART, SPI, I2C, VGA and more. It is intended as a pluggable host system that can be reused in Heterogeneous SoCs.<br />
<br />
The goal of this work will be to build a RTL top level for a future SoC gathering a Cheshire host subsystem with a Mempool accelerator subsystem.<br />
<br />
= Project =<br />
<br />
This work will go through multiple of the steps required when proposing a new SoC. After a first architectural proposal, the student will build the top level of the future SoC using System Verilog and verify the communication between the Host and Accelerator subsystems.<br />
<br />
Then, the student will adapt the existing FPGA flow of Cheshire to test the Linux boot on this new platform.<br />
<br />
Finally, a Master thesis student will extend this work with one of the following points<br />
<br />
* ''Extending the HERO runtime for Mempool and benchmark OpenMP [3] kernels on this platform''<br />
* ''Adapt previous synthesis and implementation flows to get an area estimation of the SoC in GF12''<br />
* Integrate a verified RISC-V compliant IOMMU [4] to simplify shared memory based communication between Mempool and Cheshire<br />
<br />
== Character ==<br />
<br />
* 40% Architecture pre-study, RTL top level<br />
* 20% Verification of the memory accesses among the chip<br />
* 40% FPGA implementation and booting Linux<br />
<br />
Master thesis:<br />
<br />
After completing the three points above, an estimated 30% time of the thesis will be dedicated to one of the stretch goal defined in the Project section.<br />
<br />
== Prerequisites ==<br />
<br />
* Good knowledge of computer architectures<br />
* Proficient in System Verilog<br />
* Proficient in C<br />
* Willing to learn about Linux and Linux drivers<br />
<br />
= References =<br />
<br />
[1] https://github.com/pulp-platform/cheshire<br />
<br />
[2] https://pulp-platform.org/docs/lugano2023/MemPool_05_06_23.pdf<br />
<br />
[3] https://www.openmp.org/specifications/<br />
<br />
[4] https://github.com/zero-day-labs/riscv-iommu/tree/main</div>Cykoenighttp://iis-projects.ee.ethz.ch/index.php?title=Extending_the_HERO_RISC-V_HPC_stack_to_support_multiple_devices_on_heterogeneous_SoCs_(M/1-3S)Extending the HERO RISC-V HPC stack to support multiple devices on heterogeneous SoCs (M/1-3S)2024-01-25T09:15:46Z<p>Cykoenig: </p>
<hr />
<div><!-- Creating Extending the HERO RISC-V HPC stack to support multiple devices on heterogeneous SoCs (M/1-3S) --><br />
<br />
[[Category:Digital]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:Heterogeneous Acceleration Systems]]<br />
[[Category:Computer Architecture]]<br />
[[Category:2024]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Master Thesis]]<br />
[[Category:Cykoenig]]<br />
[[Category:Available]]<br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Computer Architecture Master / Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cykoenig | Cyril Koenig]]: [mailto:cykoenig@iis.ee.ethz.ch cykoenig@iis.ee.ethz.ch]<br />
<br />
= Introduction =<br />
<br />
[https://www.openmp.org/specifications/ OpenMP] is an Application Programming Interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. OpenMP allows developers to write parallel programs that can run on a wide range of hardware, including multi-core processors and symmetric multiprocessing (SMP) systems. OpenMP uses pragma directives to exploit parallelism in the annotated code regions. These directives are embedded in the source code and guide the compiler in generating parallel executable code. In addition to compiler support, an OpenMP runtime library abstracts the details of thread creation and management from the programmer, simplifying the parallelization process.<br />
<br />
Starting from version 4.0, OpenMP introduced a target directive, which allows offloading computations to accelerators by explicitly specifying the code regions amenable to execute on the accelerator.<br />
<br />
The HERO stack, developed at IIS, proposes an implementation of the OpenMP runtime that can run on multiple of our SoCs with a maximum of code reuse.<br />
<br />
HERO is today capable of compiling separate applications for different devices, but OpenMP does not support offloading to different devices within the same application.<br />
<br />
= Project =<br />
<br />
The goal of the project is to extend the HERO software stack and toolchain to allow offloading from one host running Linux to multiple devices simultaneously.<br />
<br />
The proposed extension of HERO will be tested on our latest (emulated) heterogeneous platform.<br />
<br />
[[File:Hero_carfield.png|500px|]]<br />
<br />
The project aims to enable the following code:<br />
<br />
<syntaxhighlight lang=C><br />
<br />
int main(int argc, char *argv[]) {<br />
<br />
printf("I am CVA6\n");<br />
double *a, *b, *c, *d, *e, *f;<br />
<br />
#pragma omp target device(MEMCPY_SAFETY) map(to : a, b) map(from : c)<br />
// Process your data on Safety<br />
<br />
#pragma omp target device(MEMCPY_CLUSTER) map(to : d, e) map(from : f)<br />
// Process your data on Cluster<br />
<br />
printf("Back on CVA6");<br />
<br />
return 0;<br />
}<br />
<br />
</syntaxhighlight><br />
<br />
This will rely on the multiple pre-existing hero libraries and drivers:<br />
<br />
[[File:Hero_heterogeneous.png|500px|]]<br />
<br />
== Character ==<br />
<br />
* 20% Study the LLVM project and Hero extensions<br />
* 20% Get familiar with the SoC architecture and its FPGA implementation<br />
* 60% Propose and implement the runtime libraries extensions (written in C) to communicate with multiple devices<br />
* (Optional) Consider compiler support for asynchronous and simultaneous offloading<br />
<br />
== Prerequisites ==<br />
<br />
* Good knowledge of computer architectures<br />
* Proficient in C, knowledge of C++<br />
* Willing to learn about Linux and Linux drivers<br />
<br />
= References =<br />
<br />
[https://www.openmp.org/specifications/ OpenMP]<br />
[https://arxiv.org/pdf/1712.06497.pdf Original HERO paper]</div>Cykoenighttp://iis-projects.ee.ethz.ch/index.php?title=Writing_a_Hero_runtime_for_EPAC_(1-3S/B)Writing a Hero runtime for EPAC (1-3S/B)2024-01-25T09:08:19Z<p>Cykoenig: </p>
<hr />
<div><!-- Writing a Hero runtime for EPAC (1-3S/B) --><br />
<br />
[[Category:Digital]]<br />
[[Category:High Performance SoCs]]<br />
[[Category:Heterogeneous Acceleration Systems]]<br />
[[Category:Computer Architecture]]<br />
[[Category:2024]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Cykoenig]]<br />
[[Category:Reserved]]<br />
<br />
= Overview =<br />
<br />
== Status: Reserved ==<br />
<br />
* Type: Computer Architecture Bachelor / Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Cykoenig | Cyril Koenig]]: [mailto:cykoenig@iis.ee.ethz.ch cykoenig@iis.ee.ethz.ch]<br />
<br />
= Introduction =<br />
<br />
EPAC is one of the chip resulting from the European Processor Initiative (EPI) consortium, in which ETH Zurich is involved. The EPAC and EPAC1.5 chips have been successfully taped-out in the past years and are available at IIS for testing and developing SW. This heterogeneous chip contains four RISC-V Avispado cores along with two STX accelerator tiles and one Variable floating point precision core.<br />
<br />
[[File:Epac_backend.png|500px]]<br />
''Source: [https://www.european-processor-initiative.eu/accelerator/]''<br />
<br />
[https://www.openmp.org/specifications/ OpenMP] is an Application Programming Interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. OpenMP allows developers to write parallel programs that can run on a wide range of hardware, including multi-core processors and symmetric multiprocessing (SMP) systems. OpenMP uses pragma directives to exploit parallelism in the annotated code regions. These directives are embedded in the source code and guide the compiler in generating parallel executable code. In addition to compiler support, an OpenMP runtime library abstracts the details of thread creation and management from the programmer, simplifying the parallelization process.<br />
<br />
Starting from version 4.0, OpenMP introduced a target directive, which allows offloading computations to accelerators by explicitly specifying the code regions amenable to execute on the accelerator.<br />
<br />
<syntaxhighlight lang="c"><br />
int main() {<br />
printf("Hello from the host\n");<br />
<br />
#pragma omp target device(1)<br />
{<br />
printf("Hello from the accelerator main thread\n");<br />
#pragma omp parallel<br />
printf("Hello from the accelerator thread %i\n", omp_get_thread_num());<br />
}<br />
<br />
}</syntaxhighlight><br />
<br />
<br />
The HERO stack, developed at IIS, proposes an implementation of the OpenMP runtime that can run on multiple of our SoCs with a maximum of code reuse.<br />
<br />
= Project =<br />
<br />
This project aims to port the HERO stack to the EPAC1.5 chip in order to benchmark multiple OpenMP based kernel and applications on this state of the art Heterogeneous SoC.<br />
<br />
In order to reach this goal, the student will have to familiarize with the EPAC1.5 chip architecture and it's interface (the chip is programmable via a FPGA host setup). The student will then need to understand the HERO runtime and port it to EPAC. Finally they will implement some benchmarks or applications using OpenMP targeting the accelerators inside the chip.<br />
<br />
== Character ==<br />
<br />
* 20% Get familiar with the SoC architecture and its progrmaming interface<br />
* 20% Study the Hero runtimes<br />
* 40% Propose and implement the runtime plugin (written in C) for EPAC<br />
* 20% Benchmark some OpenMP kernel on the STX accelerator<br />
<br />
== Prerequisites ==<br />
<br />
* Good knowledge of computer architectures<br />
* Proficient in C, knowledge of C++<br />
* Willing to learn about Linux and Linux drivers<br />
<br />
<br />
= References =<br />
<br />
[https://www.openmp.org/specifications/ OpenMP]<br />
[https://arxiv.org/pdf/1712.06497.pdf Original HERO paper]<br />
[https://www.european-processor-initiative.eu/accelerator/ EPAC chip]</div>Cykoenighttp://iis-projects.ee.ethz.ch/index.php?title=ErgErg2024-01-15T07:47:48Z<p>Balasr: Balasr moved page Erg to FPGA mapping of RPC DRAM: Broken name</p>
<hr />
<div><!-- FPGA mapping of RPC DRAM (1-2S/B) --><br />
<br />
[[Category:Digital]]<br />
[[Category:Computer Architecture]]<br />
[[Category:SW/HW Predictability and Security]]<br />
[[Category:HW/SW Safety and Security]]<br />
[[Category:Real-Time Embedded Systems]]<br />
[[Category:Computer Architecture]]<br />
[[Category:2024]]<br />
[[Category:Semester Thesis]]<br />
[[Category:Bachelor Thesis]]<br />
[[Category:Aottaviano]]<br />
[[Category:Balasr]]<br />
[[Category:Tbenz]]<br />
[[Category:Nwistoff]]<br />
[[Category:Available]]<br />
<br />
= Overview =<br />
<br />
== Status: Available ==<br />
<br />
* Type: Bachelor or Semester Thesis<br />
* Professor: Prof. Dr. L. Benini<br />
* Supervisors:<br />
** [[:User:Aottaviano | Alessandro Ottaviano]]: [mailto:aottaviano@iis.ee.ethz.ch aottaviano@iis.ee.ethz.ch]<br />
** [[:User:Tbenz | Thomas Benz]]: [mailto:tbenz@iis.ee.ethz.ch tbenz@iis.ee.ethz.ch]<br />
** [[:User:Balasr | Robert Balas]]: [mailto:balasr@iis.ee.ethz.ch balasr@iis.ee.ethz.ch]<br />
** [[:User:Nwistoff | Nils Wistoff]]: [mailto:nwistoff@iis.ee.ethz.ch nwistoff@iis.ee.ethz.ch]<br />
<br />
= Introduction =<br />
<br />
Recently, a new class of off-chip memories hit the market, targeting low area and low pin-count FPGAs and ASICs. These reduced pin count DDR (RPC DDR) [1] memories only require a simple on-chip PHY and can operate with regular digital IO pads making them usable on our ASICs. <br />
<br />
In previous projects, we implemented a memory-controller for RPC DRAM and taped it out on two different chips [2][3]. While their testing is ongoing, a readily available FPGA map is useful for fast prototyping and assessment of the IP.<br />
<br />
[[File:Rpc_dram.png|thumb|350px|]]<br />
<br />
= Project =<br />
<br />
The goal of the project is to map the RPC memory controller on FPGA and verify its functionality according to specifications. Functionality is verified by means of:<br />
<br />
* Cycle accurate RTL simulations using a model of the RPC DRAM endpoint provided by the DRAM manufacturer<br />
<br />
* A custom PCB with standard FMC termination that can be plugged into a FPGA board. The target board is Digilent Genesys II.<br />
<br />
Both 1. and 2. above are started to provide a reference setup to the student.<br />
<br />
As a last note, the memory controller is part of Cheshire [1], a platform hosting a linux capable RISC-V 64-bit core. An FPGA flow for mapping Cheshire on a Genesys II FPGA is already available.<br />
<br />
The student will:<br />
<br />
* Get familiar with RPC DRAM protocol and the controller design<br />
* Implement the design on FPGA: while a 1-to-1 correspondence between ASIC and FPGA exists, some differences may arise at the level of memory buffers and low-level cells (e.g., clock multiplexers), which require a dynamic switch between the two targets due to implementation differences.<br />
* Verify design functionality according to 1. and 2. above, and assess performance (bandwidth, utilization)<br />
<br />
== Character ==<br />
<br />
* 20% Literature Review<br />
* 60% Hardware Design<br />
* 20% Verification and Bechmarking<br />
<br />
== Prerequisites ==<br />
<br />
* Strong interest in computer architecture and memory systems<br />
* Experience with digital design in SystemVerilog as taught in VLSI I<br />
* Preferred: Knowledge or experience with AXI and RISC-V<br />
<br />
= References =<br />
<br />
<div> [1] “Etron Technology Inc. RPC DRAM.” https://etronamerica.com/products/rpc-dram/ </div><br />
<div> [2] http://asic.ethz.ch/2022/Neo.html </div><br />
<div> [3] http://asic.ethz.ch/2021/Dogeram.html </div><br />
<div> [1] “Cheshire: A Lightweight, Linux-Capable RISC-V Host Platform for Domain-Specific Accelerator Plug-In” https://ieeexplore.ieee.org/abstract/document/10163410 </div></div>Aottavianohttp://iis-projects.ee.ethz.ch/index.php?title=ASIC_implementation_of_a_beamspace_massive_MIMO-OFDM_detector_for_5G/6GASIC implementation of a beamspace massive MIMO-OFDM detector for 5G/6G2024-01-08T14:33:41Z<p>Smirfarsh: /* Short Description */</p>
<hr />
<div><br />
[[File:iip_lmmse_systolic_array.png|400px|thumb|Illustration of the systolic array of processing elements (PEs) implementing LMMSE]]<br />
[[File:iip_LMMSE_sys_PEs.png|500px|thumb|Internal architectures of the diagonal and nondiagonal PEs]]<br />
<br />
<br />
==Short Description== <br />
<br />
Millimeter wave (mmWave) and massive multiple-input multiple-output (MIMO) are key technologies in 5G and beyond 5G wireless communication systems. While mmWave provides vast unused spectrum to increase the communication bandwidth, massive MIMO improves the power and spectral efficiencies via high array gain and spatial multiplexing. To deal with the inter-symbol interference caused by the frequency-selectivity of wideband wireless channels, orthogonal frequency division multiplexing (OFDM) is employed in the 5G standard [1], like many other standards such as LTE and IEEE 802.11 (WiFi). <br />
The implementation of a wideband mmWave massive MIMO-OFDM detector is challenging due to the following reasons:<br />
<br />
* Large number of basestation (BS) antennas in massive MIMO results in large data dimensions to be processed, which makes the detection task complicated. Usually, a linear detector such as linear minimum mean squared error (LMMSE) is employed, to tame the detection complexity.<br />
<br />
* High baseband sampling rates (in the order of Giga samples per second) that are needed to support wideband communication, generate excessive amount of data to be processed per second.<br />
<br />
* An OFDM detector has to detect the data in the frequency domain for N ~ 1000 subcarriers. In a straightforward implementation, this requires N preprocessing operations for a linear detector, resulting in prohibitive complexity. <br />
<br />
We have developed two main techniques to reducing the implementation complexity (i.e. silicon area and power consumption) of a wideband mmWave massive MIMO-OFDM detector:<br />
<br />
* By exploiting the time-domain sparsity of channel impulse response, which results in smooth frequency-domain response, we use subcarrier interpolation of equalization matrices to eliminate the need to compute the equalization matrix for each subcarrier, and thereby gain considerable area and power savings [3].<br />
<br />
* The beamspace-domain sparsity of mmWave massive MIMO channels offers an opportunity to reduce the implementation complexity of matrix-vector multiplications involving such sparse matrices and vectors [4, 5, 6].<br />
<br />
The goal of this project is to put together the existing building blocks of the wideband beamspace massive MIMO detector with OFDM interpolation that are developed so far, in order to gain insight into how much these techniques eventually help to reduce the implementation area and power consumption of a wideband mmWave massive MIMO detector to be deployed in future wireless communication applications. The project mainly consists of Verilog HDL coding to integrate the existing building blocks and test the overall detector, as well as some fixed-point Matlab simulations to generate stimuli and expected outputs for the design. The steps of the project are as follows:<br />
<br />
#Re-implement linear minimum mean squared error (LMMSE) preprocessing engine based on the existing design [3] using advanced fixed-point representation techniques which allow for reduced signal bitwidths resulting in reduced area/power. <br />
#Perform row-wise scaling on the resulting LMMSE equalization matrix or convert them into a custom floating point format to reduce their bitwidth to reduce the area/power of subsequent operations.<br />
#Build the full OFDM receiver by instantiating multiple instances of the preprocessing engine on basepoint subcarriers and performing linear interpolation on the remaining subcarries.<br />
#Integrate the LMMSE preprocessing engines with the beamspace fast Fourier transform (FFT) [7], as well as a recently developed beamspace matrix-vector multiplication engine using a custom floating point number format, which will be provided.<br />
#Perform functional tests to ensure proper operation and then perform backend to produce the layout of the detector.<br />
#Perform stimuli-based power simulation in Cadence Innovus to assess the overall power savings achieved. <br />
#If time permits, prepare the design for a tapeout.<br />
<br />
If you are interested in improving your digital VLSI design skills, learn a lot about wireless communications and number representations with detailed matlab simulations and finally design a chip for future wireless communication systems, this project would be a great fit.<br />
<br />
<br />
<br />
----<br />
<br />
'''''References:'''''<br />
<br />
[1] European Telecommunications Standards Institute, “5G NR Base Station (BS) radio transmission and reception,” Apr. 2020, 3GPP TS 38.104 version 16.4.0 Release 16<br />
<br />
[2] C. Jeon, Z. Li and C. Studer, "Approximate Gram-Matrix Interpolation for Wideband Massive MU-MIMO Systems," in IEEE Transactions on Vehicular Technology, May 2020<br />
<br />
[3] D. Walter, “ASIC implementation of an interpolation-based wideband massive MIMO detector,” ETH Zürich master thesis, Oct. 2023<br />
<br />
[4] S. H. Mirfarshbafan and C. Studer, "Sparse Beamspace Equalization for Massive MU-MIMO MMWave Systems," ICASSP, Apr. 2020<br />
<br />
[5] S. H. Mirfarshbafan and C. Studer, "SPADE: Sparsity-Adaptive Equalization for MMwave Massive MU-MIMO," IEEE Statistical Signal Processing Workshop (SSP), Aug. 2021<br />
<br />
[6] M. Mahdavi, O. Edfors, V. Öwall and L. Liu, "Angular-Domain Massive MIMO Detection: Algorithm, Implementation, and Design Tradeoffs," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 6, pp. 1948-1961, June 2020<br />
<br />
[7] S. H. Mirfarshbafan, S. Taner and C. Studer, "SMUL-FFT: A Streaming Multiplierless Fast Fourier Transform," in IEEE Transactions on Circuits and Systems II: Express Briefs<br />
<br />
----<br />
<br />
<br />
<br />
===Status: Available ===<br />
: Master project (shorter version possible for semester project)<br />
: Contact: [https://iis.ee.ethz.ch/people/person-detail.Mjc3NjIw.TGlzdC8zOTg5LDk5MDE4ODk4MA==.html Seyed Hadi Mirfarshbafan]<br />
<br />
===Prerequisites===<br />
: Verilog<br />
: VLSI I & II <br />
: Matlab<br />
: Some background in wireless communications [optional]<br />
===Character===<br />
: 80% VLSI implementation<br />
: 20% MATLAB simulation<br />
<br />
===Professor===<br />
<!-- : [http://www.iis.ee.ethz.ch/people/person-detail.html?persid=194234 Luca Benini] ---><br />
<!-- : [http://www.iis.ee.ethz.ch/people/person-detail.html?persid=78758 Qiuting Huang] ---><br />
<!--: [http://www.iis.ee.ethz.ch/people/person-detail.html?persid=80923 Mathieu Luisier] ---><br />
<!--: [https://ee.ethz.ch/the-department/people-a-z/person-detail.MjUwODc0.TGlzdC8zMjc5LC0xNjUwNTg5ODIw.html Taekwang Jang] ---><br />
: [https://ee.ethz.ch/the-department/faculty/professors/person-detail.OTY5ODg=.TGlzdC80MTEsMTA1ODA0MjU5.html Christoph Studer]<br />
<!-- : [http://www.iis.ee.ethz.ch/people/person-detail.html?persid=79172 Andreas Schenk] ---><br />
<br />
==Links== <br />
<br />
[[Category:Available]]<br />
[[Category:IIP]]<br />
[[Category:IIP_5G]]<br />
[[Category:2024]]<br />
<br />
[[#top|↑ top]]</div>Smirfarsh