Personal tools


From iis-projects

Jump to: navigation, search


Status: In Progress


The transformer[7] was initially designed for use in NLP, but its main novelty, the attention mechanism, has proved helpful in various applications, such as Automated Speech Recognition (ASR)[9], speech separation[8], or environmental sound classification[2]. In parallel with this development, the increasing demand for machine learning at the edge, near the sensor, led to machine learning inference being performed directly on edge devices with limited memory and computational resources -- an approach often referred to as TinyML. In this context, the attention layer is the main bottleneck preventing transformers' adoption for embedded time-series processing. The memory and computational requirements of the implementation of the conventional attention mechanism scale quadratically with the input length, severely limiting the ability to process long sequences of data. A solution to this problem relies on random feature maps to approximate the softmax kernel[1][3][4], without the costly explicit calculation of the attention matrix. The latter class of attention is typically referred to as linear attention.

Transformer works relying on linear attention achieved state-of-the-art accuracy on keyword spotting tasks, as demonstrated by[5]. Keyword spotting is one of the most well-studied problems for TinyML systems, given its real-time, confidential, near-sensor computation requirements. On the other hand, the more complex acoustic counterpart of keyword spotting, namely ASR (i.e., the task of converting spoken speech into written words), still requires fundamental improvements to achieve feasibility in edge computing settings.

This project therefore proposes to adapt the state-of-the-art Waveformer[5] architecture, depicted in Figure 1, to the requirements and complexity of ASR. The second goal addresses the deployment of the proposed architecture on novel parallel ultra-low-power platforms. While the original architecture has already proved to be feasible for single-core embedded systems, our goal is to achieve real-time ASR on ultra-low-power devices. To this aim, you will employ and extend our in-house, open-source, state-of-the-art Deeploy deployment tool, generating C-code to achieve on-board inference for the quantized version of your proposed neural network.


  • 10% literature research
  • 30% deep learning
  • 50% on-device implementation
  • 10% evaluation


  • Must be familiar with Python, in the context of deep learning and quantization
  • Must be familiar with C, for layer implementation and model deployment

Project Goals

The main tasks of this project are:

  • Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

    Learn about DNN training and PyTorch, how to visualize results with TensorBoard.

    Read up on linear attention and the recent advances in automated speech recognition.

    Read up on quantization and deployment to ultra-low power platforms.

  • Task 2 - Develop a Waveformer-based ASR model (4-6 weeks)

    Create a PyTorch framework for ASR, including Datasets and Dataloaders, as well as integrating the training and evaluation mechanisms.

    Evaluate Waveformer on ASR task(s).

    Expand the existing architecture considering an accuracy-complexity trade-off.

  • Task 3 - Deploy ASR-Waveformer on PULP platforms (4-6 weeks)

    Automate the quantization of Waveformer using Quantlib[6], followed by integrating ASR-Waveformer.

    Expand Deeploy to enable PULP-based deployment of Waveformer.

    Deploy ASR-Waveformer and evaluate the on-device inference considering hardware-associated metrics (i.e., latency, memory, storage, energy, etc.)

  • Task 4 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (15/20 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.


Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: and or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.


At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.


[1] Choromanski, Krzysztof Marcin and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarlos, Tamas and Hawkins, Peter and Davis, Jared Quincy and Mohiuddin, Afroz and Kaiser, Lukasz and Belanger, David Benjamin and Colwell, Lucy J. and Weller, Adrian Rethinking Attention with Performers 2022.

[2] Gazneli, Avi and Zimerman, Gadi and Ridnik, Tal and Sharir, Gilad and Noy, Asaf Boosting Augmentations Towards An Efficient Audio Classification Network. 2022.

[3] Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François Transformers are RNNs: fast autoregressive transformers with linear attention 2020.

[4] Kitaev, Nikita and Kaiser, Lukasz and Levskaya, Anselm Reformer: The Efficient Transformer. Technical Report 2020.

[5] Scherer, Moritz and Cioflan, Cristian and Magno, Michele and Benini, Luca Work In Progress: Linear Transformers for TinyML 2024

[6] Spallanzani, Matteo and Rutishauser, Georg and Scherer, Moritz and Burrello, Alessio and Conti, Francesco and Benini, Luca QuantLab: a Modular Framework for Training and Deploying Mixed-Precision NNs 2022

[7] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia Attention is all you need 2017

[8] Zhao, Shengkui and Ma, Bin MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions 2023

[9] Zhang, Yu and Qin, James and Park, Daniel S. and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V. and Wu, Yonghui Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition 2022