Resource-Constrained Few-Shot Learning for Keyword Spotting (1S)
- Type: Semester Thesis
- Professor: Prof. Dr. L. Benini
The objective of keyword spotting is to detect a set of predefined keywords within a stream of user utterances. Once a keyword spotting system is deployed on an embedded device, performing classic model training becomes unfeasible, due to the constrained computational resources, time limitations, and small amounts of available data. Few-shot learning is a solution proposed especially to alleviate the limited access to data, yet its scope can feasibly be extended. In the current work, our goal is to obtain a few-shot learning pipeline targeting hardware-constrained devices, expanding the capabilities of a keyword spotting system.
Machine Learning (ML) is an engineering field in which applications are designed to "analyse and draw inferences from patterns in data", as defined by the Oxford Dictionary. Henceforth, at the very core of ML lies a dataset, a collection of data exemplifying certain phenomena, systems, concepts, etc., from which the ML application can derive understanding of said matter. While there are fields in which data is abundant, with millions of images being uploaded on social networks every day, that can not be said about medicine, for instance. As mentioned by Rieke et al. , solutions are needed for rare diseases contexts, "where the incident rates are low and data sets at each single institution are too small". In this occasions, one possible solution is few-shot learning.
Few-shot learning (FSL) represents the ML scenario in which the application designed to determine patterns in data learns from a limited number of examples, or samples, using supervised information. This is possible due to the prior knowledge that the system has over the nature of the data that it is presented with, and it is a specific task that shall be learnt at the current moment. Mainly used in a supervised fashion, this approach extends the supervised learning methodology to data-scarce research areas, such as pharmaceutical sciences. Once an ML model gained the capability to discriminate between objects of diﬀerent categories, regardless of the peculiarities of an individual instance of any said object, hence learned what characteristics of the objects are necessary and suﬃcient to categorize the object, learning a new object, although by being exposed to only a limited number of instances, becomes a much easier task. More on the topic can be read in the survey written by Wang et al. 
One field where FSL was successfully applied is represented by Keyword Spotting (KWS). KWS represents a speech-based approach to human-computer interactions. In a personal set-ting, it is mostly used in order to wake up a virtual assistant, such as Siri or Alexa. Therefore, the set of possible words to be detected (i.e. classify a user utterance as one specific word) is usually small. Taking into account the results reported by Deep Neural Networks (DNN) on classification tasks and given the low ratio of classes to samples on which a system is to be trained, high accuracies can be obtained. Such a system usually consists of two components. Firstly, we use a preprocessing step, in which the raw waveform is converted to a set of meaningful features. In order to obtain the most relevant information from a user utterance, mel-frequency cepstral coeﬃcients generally oﬀer the best results, their advantage relying on the frequency bands being spaced on a scale which approximates well the human auditory system’s response. The second component is the DNN, whose inputs are the aforementioned features, while the output is a probability of those features belonging to a certain class. A schematic of a KWS system can be seen in Figure 1.
In the work titled "Few-Shot Keyword Spotting in Any Language", Mazumder et al.  propose a few-shot learning approach towards fine-tuning a classification model using a limited number of (i.e. down to five) samples. As the embedding model extracting the features is pretrained to discriminate between two orders of magnitude more words, belonging to diﬀerent language families, the knowledge acquired can eﬀectively be transferred to previously unseen words from unseen languages. Apart from the already advantages mentioned in favour of FSL, it is worth adding that a model requiring fewer samples to adapt to novel scenarios is a model whose computational requirements during fine-tuning are reduced, as well as the physical time needed to acquire said examples.
The reduction in latency and computational eﬀort makes FSL a suitable candidate for on-device learning, the process where a model which has been deployed to an embedded device, in the desired environment, learns within the device to adapt to the specifics of the environment. Nevertheless, while apparently adequate for a resource-constrained setup, it is the sheer size and complexity of the proposal of Mazumder et al.  that makes it unfeasible for deployment on area-limited, low-power devices. Therefore, our goal for this project is to propose, implement, and evaluate an architecture, integrated in an FSL framework and well-suited for edge devices, while obtaining competitive results with respect to the baseline.
- 20% literature research
- 70% neural network implementation
- 10% evaluation
- Must be familiar with Python.
- Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.
The main tasks of this project are:
Task 1: Familiarize yourself with the project specifics (1-2 Weeks)
Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on DNN models aimed at time series (e.g. TCNs, RNNs, transformer networks) and the recent advances in KWS. Read up on feature embeddings.
Task 2: Devise a DNN with a performance comparable to the state-of-the-art (3-4 Weeks)
Select a dataset and analyse the models which can represent baselines for our work. Particularly check for publicly available code.
If no code is available: design, implement, and train a KWS model, considering a three-party trade-off between accuracy, number of parameters, and latency.
Compare the model against the selected baseline and figures in the paper.
Task 3 - Integrate user features in the designed DNN (6-7 Weeks)
Implement user embeddings, which map the user-ID to the feature vector, the latter representing an additional input of the DNN.
Devise a greedy update policy for the feature vector.
Compare the model’s accuracy against the generic DNN, as well as the variation in parameters and latency.
Evaluate the model on self-generated input data, thus understanding its adaptability.
Repeat this task with other conceivable options to integrate user features.
Task 4 - Model optimization (Optional)
Optimize the model w.r.t. latency (e.g. skip connections, early exits) and/or memory footprint (e.g. pruning, quantization, weight sharing, knowledge distillation techniques).
Integrate federated learning in order to share knowledge among multiple parties without compromising user privacy and data security.
Task 5 - Gather and Present Final Results (2-3 Weeks)
Gather final results.
Prepare presentation (15 min. + 5 min. discussion).
Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.
At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.
 Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, and Vijay Janapa Reddi, Few-shot keyword spotting in any language, 2021.
 Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletarì, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, Sébastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso, The future of digital health with federated learning, 2020.
 Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni, Generalizing from a few examples: A survey on few-shot learning, 2020.