Feature Extraction and Architecture Clustering for Keyword Spotting (1S)
Status: In progress
- Type: Semester Thesis
- Professor: Prof. Dr. L. Benini
The objective of Keyword Spotting (KWS) is to detect a set of predefined keywords within a stream of user utterances. For most of KWS pipelines, as well as any other audio-based task, the acoustic model and/or linguistic model is preceded by a feature extraction segment. Several approaches have been proposed to perform this step, the most notable one being the Mel Frequency Cepstral Coefficients (MFFCs). As this technique represents only one of the possible approaches towards feature extraction, and since it has been showed that the performance of the system can be largely influenced by this choice, the aim of this project is to do an in-depth evaluation of feature extraction methods for KWS and their pairing with the Deep Neural Network (DNN)’s structure employed for recognition.
Feature extraction represents the process of deriving essential, non-redundant information from a set of measured data. Through the selection and/or combination of input variables into features, the dimension of the given data can be largely reduced, thus also reducing the computational eﬀort within the processing pipeline. Nonetheless, such techniques should not aﬀect the integrity of the data; the extracted features should still accurately described the data set. Often used interchangeably, we diﬀerentiate between feature (pre)processing and feature extraction, defining the former as the techniques applied to alter the data in order to emphasize or remove certain characteristics. To better understand the concept of feature extraction, we will use the Mel-Frequency Cepstral Coeﬃcients (MFCC) as an example.
Extracted from an audio signal received as input by the system, the MFCCs are one of the current standards in KWS  . They are a cepstral representation of the signal, but, compared to the simple cepstrum, the MFCC use equally spaced frequency bands, based on the Mel scale. This leads to a closer representation of the signal to the actual response of the human auditory system. The derivation techniques are described in detail in ; we will enumerate the main steps of PyTorch’s MFCC computation:
Windowing - applying a window function (e.g., Hamming window) to each frame, mainly to counteract the infinite-data assumption made during the FFT computation;
Fast Fourier Transform (FFT) - calculating the frequency spectrum;
Triangular Filters - applying triangular filters on a Mel-scale to extract frequency bands;
Logarithm - transforming the MelSpectrogram into LogMelSpectrogram to obtain additive behaviour;
Discrete Cosine Transform (DCT) - calculating the Cepstral coeﬃcients from the LogMel-Spectrogram.
Feature extraction is often used in Machine Learning settings, and especially in the con-text of Deep Neural Networks (DNN), due to the fact that reducing the feature dimensionality further reduces the computational load, as well as the storage requirements, of such a DNN. Therefore, by pre-determining the relevant features from the input data using human expertise, the model is alleviated from the tedious task of identifying and filtering redundant information. A deep learning pipeline for KWS integrating MFCC is presented in Figure 1.
In the recent years, it was shown that state-of-the-art results can be achieved not only by using the MFCCs, but also by employing the results of an intermediate step, such as the MelSpectrogram  or the LogMelSpectrogram . Moreover, surveys     on the topic present alternative strategies in performing feature extraction for audio-based ML tasks. Nevertheless, there are no such works, to the best of our knowledge, presenting a meta-analysis of the diﬀerent extraction techniques and their impact in the scope of KWS. As Choi et al.  have shown, the choice of the feature extraction mechanism has a significant impact in the context of music tagging, so it is only natural to assume that a similar conclusion would be drawn when performing a comparative evaluation for a keyword spotting system.
Given their interdependence, selecting the feature extraction method should be done concurrently with the selection of the neural network. We therefore propose the evaluation of the extraction techniques over a wide range of DNNs, generated using the Once-For-All (OFA)  Neural Architecture Search (NAS) algorithm. Furthermore, we propose the clustering  of so-obtained models such that, for a given set of constraints (e.g. accuracy, memory, latency, etc.), a limited search space is employed to determine the network (and extraction method) achieving the best trade-off.
- 20% literature research
- 40% feature extraction implementation
- 40% evaluation
- Must be familiar with Python.
- Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.
The main tasks of this project are:
Task 1: Familiarize yourself with the project specifics (1-2 Weeks)
Read up on feature extraction methods mentioned in the reference materials. Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Familiarize yourself with the existing frameworks for comparing feature extractors and for architecture clustering.
Task 2: Extend existing analysis on feature extractors (3-4 Weeks)
Compare the selected feature extraction techniques considering hardware-associated metrics (i.e., storage, latency, memory).
Task 3: Implement architecture clustering (4-5 Weeks)
Investigate and implement clustering methods suitable for the computational graphs generated using OFA.
Analyse the performance of the so-obtained search space subsets, considering the compatibility with the available feature extractors.
Evaluate the clustered subsets considering the set of given hardware constraints.
(Optional) Task 4: Evaluate feature extraction techniques (4-5 Weeks)
Implement and evaluate additional signal processing-based or Deep Learning-based feature extractors.
Extend the analysis to broader search spaces, integrating other DNN models aimed at classifying time series (e.g. TCNs, RNNs, transformer networks)
Task 5 - Gather and Present Final Results (2-3 Weeks)
Gather final results.
Prepare presentation (15 min. + 5 min. discussion).
Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.
The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.
Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.
A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.
At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.
 Sabur Ajibola Alim and Nahrul Khair Alang Rashid. Some commonly used speech feature extraction algorithms. 2018.
 Axel Berg, Mark O’Connor, and Miguel Tairum Cruz. Keyword transformer: A self-attention model for keyword spotting. 2021.
 Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. A comparison of audio signal preprocessing methods for deep neural networks on music tagging. 2021.
 Byeonggeun Kim, Simyung Chang, Jinkyu Lee, and Dooyong Sung. Broadcasted residual learning for eﬃcient keyword spotting. 2021.
 James Lyon. “Mel Frequency Cepstral Coeﬃcient (MFCC) tutorial.
 S. Majumdar and B. Ginsburg. MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. in Proc. Interspeech 2020, 2020, pp. 3356–3360.
 Garima Sharma, Kartikeyan Umapathy, and Sridhar Krishnan. Trends in audio signal feature extraction methods. 2020
 Urmila Shrawankar and V M Thakare. Techniques for feature extraction in speech recognition system: A comparative study. 2013
 Roman Vygon and Nikolay Mikhaylovskiy. Learning eﬃcient representations for keyword spotting with triplet loss. 2021
 Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. 2019
 Thorir Mar Ingolfsson, Mark Vero, Xiaying Wang, Lorenzo Lamberti, Luca Benini, and Matteo Spallanzani.Reducing neural architecture search spaces with training-free statistics and computational graph clustering. 2022