Personal tools

Audio Visual Speech Separation and Recognition (1S/1M)

From iis-projects

Revision as of 16:23, 26 January 2023 by Cioflanc (talk | contribs) (Status: Available)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Status: Available


Human-computer interaction evolved rapidly in the last years thanks to the hardware development enabling complex Deep Neural Networks (DNNs) to perform machine learning tasks. The scientific advancements led to the employment of Automated Speech Recognition (ASR) and Visual Speech Recognition (VSR) in human-machine communication, with the former representing the task of converting spoken speech into written words[1], whilst the latter being the conversion of lip movement into written words (i.e., lipreading)[6]. In the context of voice-controlled devices, it is of uttermost importance to additionally mention speaker separation[7]. This aim to mimic the cocktail party effect present in humans, which denotes our capability of focusing on an individual conversation, whilst filtering out other discussions and surrounding noises. This ensures that only a designated individual can control one target device at a time.

In the current work, we investigate the potential of performing multi-task learning using sensor fusion[2][8][3]. We propose to fuse audio and visual information, thus increasing the data dimensionality and, implicitly, the amount of useful information. The so-trained model would then provide the target user(s) with the transcript of their respective speech. Lastly, the proposed system must abide by the TinyML[5] constraints considering edge devices, namely reduced memory and storage requirements, as well as real-time operation on low-power, battery-operated devices.


  • 20% literature research
  • 40% feature extraction implementation
  • 40% evaluation


  • Must be familiar with Python.
  • Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.

Project Goals

The main tasks of this project are:

  • Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

    Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on DNN models for audio-visual speech recognition and speaker separation (e.g., CNN, transformer networks) and the common approaches and recent advances in the field..

  • Task 2: Evaluate baseline model(s) (2-3 Weeks)

    Once a baseline model has been established for the given tasks, implement and evaluate the baseline on the selected datasets (e.g, [4]).

  • Task 3: Implement audio-visual sensor fusion on the selected model(s) (3-4 Weeks)

    Propose fusing methods and evaluate them on the selected speech recognition dataset, in comparison with single-sensor models.

    Optimize the fused model with respect to the TinyML constraints.

    Test its generalization properties on other datasets.

  • Perform multi-task learning for speech separation and recognition (3-4 Weeks)

    Extend the model's capabilities to speech separation, considering available datasets containing labels for both tasks.

    Evaluate and optimized the proposed model.

  • Task 5 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (10 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.


Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: and or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.


At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.


[1] Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom Automatic speech recognition: a survey. 2021.

[2] Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. An overview of deep-learning-based audio-visual speech enhancement and separation. 2020.

[3] Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. Seeing wake words: Audio-visual keyword spotting. 2020.

[4] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpu. Librispeech: An asr corpus based on public domain audio books. 2015.

[5] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLperf inference benchmark. 2020

[6] Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, and Li Liu. Deep learning for visual speech analysis: A survey, 2022

[7] DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview. 2028

[8] Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. 2021