Personal tools

Resource-Constrained Few-Shot Learning for Keyword Spotting (1S)

From iis-projects

Revision as of 16:55, 28 September 2021 by Cioflanc (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Status: Available

Short Description

The objective of keyword spotting is to detect a set of predefined keywords within a stream of user utterances. Once a keyword spotting system is deployed on an embedded device, performing classic model training becomes unfeasible, due to the constrained computational resources, time limitations, and small amounts of available data. Few-shot learning is a solution proposed especially to alleviate the limited access to data, yet its scope can feasibly be extended. In the current work, our goal is to obtain a few-shot learning pipeline targeting hardware-constrained devices, expanding the capabilities of a keyword spotting system.


Machine Learning (ML) is an engineering field in which applications are designed to "analyse and draw inferences from patterns in data", as defined by the Oxford Dictionary. Henceforth, at the very core of ML lies a dataset, a collection of data exemplifying certain phenomena, systems, concepts, etc., from which the ML application can derive understanding of said matter. While there are fields in which data is abundant, with millions of images being uploaded on social networks every day, that can not be said about medicine, for instance. As mentioned by Rieke et al. [2], solutions are needed for rare diseases contexts, "where the incident rates are low and data sets at each single institution are too small". In this occasions, one possible solution is few-shot learning.

Few-shot learning (FSL) represents the ML scenario in which the application designed to determine patterns in data learns from a limited number of examples, or samples, using supervised information. This is possible due to the prior knowledge that the system has over the nature of the data that it is presented with, and it is a specific task that shall be learnt at the current moment. Mainly used in a supervised fashion, this approach extends the supervised learning methodology to data-scarce research areas, such as pharmaceutical sciences. Once an ML model gained the capability to discriminate between objects of different categories, regardless of the peculiarities of an individual instance of any said object, hence learned what characteristics of the objects are necessary and sufficient to categorize the object, learning a new object, although by being exposed to only a limited number of instances, becomes a much easier task. More on the topic can be read in the survey written by Wang et al. [3]

One field where FSL was successfully applied is represented by Keyword Spotting (KWS). KWS represents a speech-based approach to human-computer interactions. In a personal set-ting, it is mostly used in order to wake up a virtual assistant, such as Siri or Alexa. Therefore, the set of possible words to be detected (i.e. classify a user utterance as one specific word) is usually small. Taking into account the results reported by Deep Neural Networks (DNN) on classification tasks and given the low ratio of classes to samples on which a system is to be trained, high accuracies can be obtained. Such a system usually consists of two components. Firstly, we use a preprocessing step, in which the raw waveform is converted to a set of meaningful features. In order to obtain the most relevant information from a user utterance, mel-frequency cepstral coefficients generally offer the best results, their advantage relying on the frequency bands being spaced on a scale which approximates well the human auditory system’s response. The second component is the DNN, whose inputs are the aforementioned features, while the output is a probability of those features belonging to a certain class. A schematic of a KWS system can be seen in Figure 1.


In the work titled "Few-Shot Keyword Spotting in Any Language", Mazumder et al. [1] propose a few-shot learning approach towards fine-tuning a classification model using a limited number of (i.e. down to five) samples. As the embedding model extracting the features is pretrained to discriminate between two orders of magnitude more words, belonging to different language families, the knowledge acquired can effectively be transferred to previously unseen words from unseen languages. Apart from the already advantages mentioned in favour of FSL, it is worth adding that a model requiring fewer samples to adapt to novel scenarios is a model whose computational requirements during fine-tuning are reduced, as well as the physical time needed to acquire said examples.

The reduction in latency and computational effort makes FSL a suitable candidate for on-device learning, the process where a model which has been deployed to an embedded device, in the desired environment, learns within the device to adapt to the specifics of the environment. Nevertheless, while apparently adequate for a resource-constrained setup, it is the sheer size and complexity of the proposal of Mazumder et al. [1] that makes it unfeasible for deployment on area-limited, low-power devices. Therefore, our goal for this project is to propose, implement, and evaluate an architecture, integrated in an FSL framework and well-suited for edge devices, while obtaining competitive results with respect to the baseline.


  • 20% literature research
  • 70% neural network implementation
  • 10% evaluation


  • Must be familiar with Python.
  • Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.

Project Goals

The main tasks of this project are:

  • Task 1 - Familiarize yourself with the project specifics (1-2 Weeks)

    Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on DNN models aimed at time series (e.gTCNs, RNNs, transformer networks) and the recent advances in KWS. Read up on few-shot learning, common approaches and recent advances on the topic.

  • Task 2 - Implement and evaluate the baseline (2-3 Weeks)

    Investigate, reproduce, and extend the analysis proposed by Mazumder et al.[1]

    If needed, perform hyperparameter tuning in order to reach similar results to those reported by the authors.

  • Task 3 - Implement and evaluate resource-constrained architectures (6-7 Weeks)

    Propose, implement, and evaluate alternative architectures for the few-shot learning framework.

    Optimize the model with respect to a three-party trade-off, namely accuracy, latency, and memory requirements.

  • Task 4 - Optimize the keyword spotting pipeline (4-5 Weeks)

    Propose, implement, and evaluate improvements within the few-shot learning frame-work, targeting higher accuracy, reduced number of samples needed for training, lower computational effort, etc.

    Evaluate and analyse the accuracy of the pipeline with respect to the implemented feature extraction method, and implement viable alternatives.

  • Task 5 - Build an on-device demonstration (Optional)

    Prepare an MCU-based demonstration to show the system’s capabilities to adapt to the environment.

  • Task 6 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (10 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.


Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: and or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.


At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.


[1] Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, and Vijay Janapa Reddi, Few-shot keyword spotting in any language, 2021.

[2] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletarì, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, Sébastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso, The future of digital health with federated learning, 2020.

[3] Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni, Generalizing from a few examples: A survey on few-shot learning, 2020.