Overview

Status: Available

Type: Semester Thesis
Professor: Prof. Dr. L. Benini
Supervisors:
- Cristian Cioflan (IIS): cioflanc@iis.ee.ethz.ch
- Dr. Lukas Cavigelli (Huawei Technologies): lukas.cavigelli@huawei.com

Introduction

The objective of keyword spotting (KWS)[7] is to detect a set of predefined keywords[5] within a stream of user utterances. Such a task is usually employed in low-memory, low-power devices, thus a KWS module should obtain a high accuracy while also taking into account the memory footprint and the latency of the system -- TinyML constraints[4]. In devices aimed for personal use, adapting the model to the user’s characteristics can considerably improve the performance of the system[3][2]. Similarly, in noisy settings, adapting to the on-site noise can recover the accuracy drop compared to the accuracy in a silent room. We want to explore how environment-specific features can be exploited in order to improve the accuracy of a small-footprint KWS model.

Feature embeddings[6][2](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) represent a mapping of a discrete variable to a vector of continuous numbers. Their advantage over other encoding methods (e.g., one-hot encoding) is represented by their low dimensionality, together with the fact that they can be learned in a supervised fashion. Additionally, by projecting the input features in the neural network embeddings’ space, one can perform meaningful comparisons between the features, through the use of distance measures. Historically, this concept has been successfully applied in the context of recommender systems, in which a user is provided with relevant suggestions (e.g., films, books) by comparing their interests against clusters formed by other users’ interests. In other words, the engine adapts itself based on user-specific features.

Apart from representing a standalone network, embeddings can be also used as inputs for a DNN, as seen in Figure 1. By doing so, a well-pretrained model can be modified with respect to the environment characteristics, thus exceeding the performance of a generic model, as demonstrated by[2]. Additionally, learning embedings is more efficient from a hardware perspective than fully updating a system's backbone through backpropagation. Furthermore, a certain environment (e.g., target user) can be identified based on its embeddings, enabling the system to only listen to pre-registered users. Our first goal is to devise a KWS system integrating environmental (e.g., speech characteristics, noisy environments, reverb profiles, etc.) features, which is trained online using the incoming specific characteristics. Our second goal is to deploy the proposed model on an always-on, low-power device, such as the PULP platform GAP9[1]. The deployment should account for the latency cost, as well as the memory requirements and the number of parameters in the model.

Character

10% literature research
30% deep learning
50% on-device implementation
10% evaluation

Prerequisites

Must be familiar with Python, in the context of deep learning and quantization
Must be familiar with C, for layer implementation and model deployment

Project Goals

The main tasks of this project are:

Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

Learn about DNN training and PyTorch, how to visualize results with TensorBoard.

Read up on embeddings, as well as on-device learning and fine-tuning.

Read up on DNN models aimed at time series (e.g., DS-CNNs, TCNs, Transformer and Conformer networks) and the recent advances in keyword spotting.
Task 2 - Develop an environment-aware KWS system (3-4 weeks)

Develop and train a noise-specific embedding in a KWS system.

Integrate the proposed system to expand the capabilities of a person-aware KWS system.
Task 3 - Deploy environment-aware KWS system on ultra-low-power devices (4-5 weeks)

Deploy backbone and feature embeddings to perform on-device inference.

Integrate on-device training using PULP-TrainLib[8] to enable adaptation to unseen environments (e.g., noises, people, etc.)

Evaluate the learning process considering hardware-associated metrics (e.g., latency, memory, storage, energy, etc.)
(Optional - if conducted as Master's Thesis) Task 4 - Develop an Environment-Aware Automated Speech Recognition (ASR) system (8-12 weeks)

Propose, train, and evaluate an ASR backbone.

Deploy the proposed backbone on ultra-low-power platforms and evaluate them for inference.

Expand the on-device learning algorithm to the ASR task (e.g., replacing CTC loss with CE loss) and evaluate it considering hardware-associated metrics.
Task 5 - Gather and Present Final Results (2-3 Weeks)

Gather final results.

Prepare presentation (15/20 min. + 5 min. discussion).

Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion for the next steps. These meetings are meant to provide a guaranteed time slot for a mutual exchange of information on how to proceed, clear out any questions from either side and ensure the student’s progress.

Report

Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.

Presentation

At the end of the project, the outcome of the thesis will be presented in a 15-minute/20-minute talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.

References

[1] GAP9 product brief https://greenwaves-technologies.com/wp-content/uploads/2023/02/GAP9-Product-Brief-V1_14_non_NDA.pdf 2022.

[2] Cristian Cioflan and Lukas Cavigelli and Luca Benini Boosting keyword spotting through on-device learnable user speech characteristics. 2024.

[3] Samuele Cornell and Jee-weon Jung and Shinji Watanabe and Stefano Squartini One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition 2023.

[4] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. “MLperf inference benchmark. 2020

[5] Pete Warden Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition 2018.

[6] Yi Yang and Jacob Eisenstein Unsupervised Multi-Domain Adaptation with Feature Embeddings 2015

[7] Yundong Zhang and Naveen Suda and Liangzhen Lai and Vikas Chandra Hello Edge: Keyword Spotting on Microcontrollers 2018

[8] Nadalini, Davide and Rusci, Manuele and Tagliavini, Giuseppe and Ravaglia, Leonardo and Benini, Luca and Conti, Francesco PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning 2022

Personal tools

On-Device Learnable Embeddings for Acoustic Environments - iis-projects

Search

Navigation

Tools

On-Device Learnable Embeddings for Acoustic Environments

From iis-projects

Contents