Personal tools

Feature Extraction for Speech Recognition (1S)

From iis-projects

Revision as of 12:23, 27 September 2021 by Cioflanc (talk | contribs)
Jump to: navigation, search


Status: Available

Short Description

The objective of keyword spotting (KWS) is to detect a set of predefined keywords within a stream of user utterances. For most of KWS pipelines, as well as any other audio-based task, the acoustic model and/or lingustic model is preceded by a feature extraction segment. Several approaches have been proposed to perform this step, the most notable one being the Mel Frequency Cepstral Coefficients (MFCC). As this technique represents only one of the possible approaches towards feature extraction, and since it has been showed that the performance of the system can be largely influenced by this choice, the aim of this project is to do an in-depth evaluation of feature extraction methods for KWS.


Feature extraction represents the process of deriving essential, non-redundant information from a set of measured data. Through the selection and/or combination of input variables into features, the dimension of the given data can be largely reduced, thus also reducing the computational effort within the processing pipeline. Nonetheless, such techniques should not affect the integrity of the data; the extracted features should still accurately described the data set. Often used interchangeably, we differentiate between feature (pre)processing and feature extraction, defining the former as the techniques applied to alter the data in order to emphasize or remove certain characteristics. To better understand the concept of feature extraction, we will use the Mel-Frequency Cepstral Coefficients (MFCC) as an example.

Extracted from an audio signal received as input by the system, the MFCCs are one of the current standards in KWS [2] [6]. They are a cepstral representation of the signal, but, compared to the simple cepstrum, the MFCC use equally spaced frequency bands, based on the Mel scale. This leads to a closer representation of the signal to the actual response of the human auditory system. The derivation techniques are described in detail in [5]; we will enumerate the main steps of PyTorch’s1 MFCC computation:

  • Windowing - applying a window function (e.g., Hamming window) to each frame, mainly to counteract the infinite-data assumption made during the FFT computation;

  • Fast Fourier Transform (FFT) - calculating the frequency spectrum;

  • Triangular Filters - applying triangular filters on a Mel-scale to extract frequency bands;

  • Logarithm - transforming the MelSpectrogram into LogMelSpectrogram to obtain additive behaviour;

  • Discrete Cosine Transform (DCT) - calculating the Cepstral coefficients from the LogMel-Spectrogram.

Feature extraction is often used in Machine Learning settings, and especially in the con-text of Deep Neural Networks (DNN), due to the fact that reducing the feature dimensionality further reduces the computational load, as well as the storage requirements, of such a DNN. Therefore, by pre-determining the relevant features from the input data using human expertise, the model is alleviated from the tedious task of identifying and filtering redundant information. A deep learning pipeline for KWS integrating MFCC is presented in Figure 1.


In the recent years, it was shown that state-of-the-art results can be achieved not only by using the MFCCs, but also by employing the results of an intermediate step, such as the MelSpectrogram [9] or the LogMelSpectrogram [4]. Moreover, surveys [1] [7] [3] [8] on the topic present alternative strategies in performing feature extraction for audio-based ML tasks. Nevertheless, there are no such works, to the best of our knowledge, presenting a meta-analysis of the different extraction techniques and their impact in the scope of KWS. As Choi et al. [3] have shown, the choice of the feature extraction mechanism has a significant impact in the context of music tagging, so it is only natural to assume that a similar conclusion would be drawn when performing a comparative evaluation for a keyword spotting system.


  • 20% literature research
  • 40% feature extraction implementation
  • 40% evaluation


  • Must be familiar with Python.
  • Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.

Project Goals

The main tasks of this project are:

  • Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

    Read up on feature extraction methods mentioned in the reference materials. Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on DNN models aimed at time series (e.gTCNs, RNNs, transformer networks) and the recent advances in KWS.

  • Task 2: Implement and evaluate the baseline (2-3 Weeks)

    Select a dataset and analyse the models which can represent baselines for our work. Particularly check for publicly available code. The supervisors will provide you with the MFCC implementation that will represent the starting point for the aforementioned analysis.

    If no code is available: design, implement, and train KWS models, considering the state-of-the-art architectures for time series.

    Compare the model against the selected baseline and figures in the paper.

  • Task 3: Implement feature extraction techniques (4-5 Weeks)

    Using the referenced work, implement the feature extraction methods against which MFCC will be compared.

    Optimize said methods with respect to computational effort and storage requirements.

    (Optional) Perform parameter tuning for the implemented techniques.

  • Task 4: Evaluate feature extraction techniques (4-5 Weeks)

    Evaluate and analyse the accuracy of the KWS pipeline with respect to the implemented feature extraction setting.

    Evaluate and analyse the accuracy of the aforementioned methods under similar operating regimes.

    Evaluate and analyse the compatibility between the processing techniques and the DNN architecture.

  • Task 5 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (10 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.


Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: and or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.


At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.


[1] Sabur Ajibola Alim and Nahrul Khair Alang Rashid. Some commonly used speech feature extraction algorithms. 2018.

[2] Axel Berg, Mark O’Connor, and Miguel Tairum Cruz. Keyword transformer: A self-attention model for keyword spotting. 2021.

[3] Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. A comparison of audio signal preprocessing methods for deep neural networks on music tagging. 2021.

[4] Byeonggeun Kim, Simyung Chang, Jinkyu Lee, and Dooyong Sung. Broadcasted residual learning for efficient keyword spotting. 2021.

[5] James Lyon. Mel Frequency Cepstral Coefficient (MFCC) tutorial.

[6] S. Majumdar and B. Ginsburg. MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. in Proc. Interspeech 2020, 2020, pp. 3356–3360.

[7] Garima Sharma, Kartikeyan Umapathy, and Sridhar Krishnan. Trends in audio signal feature extraction methods. 2020

[8] Urmila Shrawankar and V M Thakare. Techniques for feature extraction in speech recognition system: A comparative study. 2013

[9] Roman Vygon and Nikolay Mikhaylovskiy. Learning efficient representations for keyword spotting with triplet loss. 2021