Personal tools

Audio Visual Speech Separation and Recognition (1S/1M)

From iis-projects

Jump to: navigation, search


Status: Available


Human-computer interaction evolved rapidly in the last years thanks to the hardware development enabling complex Deep Neural Networks (DNNs) to perform machine learning tasks. The scientific advancements led to the employment of Automated Speech Recognition (ASR) and Visual Speech Recognition (VSR) in human-machine communication, with the former representing the task of converting spoken speech into written words[1], whilst the latter being the conversion of lip movement into written words (i.e., lipreading)[6]. In the context of voice-controlled devices, it is of uttermost importance to additionally mention speaker separation[7]. This aim to mimic the cocktail party effect present in humans, which denotes our capability of focusing on an individual conversation, whilst filtering out other discussions and surrounding noises. This ensures that only a designated individual can control one target device at a time.

In the current work, we investigate the potential of performing Audio-Visual Speech Separation and Recognition (AVSSR) [2][8][3]. Through multi-task learning, we merge Audio-Visual Speech Separation (AVSS) and Audio-Visual Speech Recognition (AVSR). We propose to fuse audio and visual information, thus increasing the data dimensionality and, implicitly, the amount of useful information. The so-trained model would then provide the target user(s) with the transcript of their respective speech. Lastly, the proposed system must abide by the TinyML[5] constraints considering edge devices, namely reduced memory and storage requirements, as well as real-time operation on low-power, battery-operated devices.


  • 20% literature research
  • 50% architectural implementation and optimizations
  • 30% evaluation


  • Must be familiar with Python.
  • Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.

Project Goals

The main tasks of this project are:

  • Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

    Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on data fusion and multimodal learning, common approaches and recent advances on the topic. Read up on multi-task learning in the context of audio-visual time series. Read up on DNN models aimed at time series (e.g., TCNs, TASMs, Transformer and Conformer networks) and the recent advances in AVSR/AVSS.

  • Task 2: Propose and evaluate AVSSR topolgies (4-6 weeks)

    Considering state-of-the-art works in AVSS[9] and AVSR[10][11][12] and previous IIS projects, propose and implement AVSSR architectures.

    Propose evaluation metrics and novel loss functions; analyse the models' performance on the GRID dataset.

  • Task 3: Optimize proposed models considering TinyML constraints (2-3 weeks)

    Reduce the models' hardware-associated costs (i.e., memory, storage, computational complexity), evaluating the trade-offs on the proposed metrics.

    (Only if conducted as Master's thesis) Deploy the proposed architecture on novel parallel ultra-low-power platforms.

  • Task 4: Dataset generalization and ablation study (1-2 Weeks)

    Investigate alternative datasets for AVSSR.

    Propose and implement dataset modifications to enable multi-speaker AVSSR.

    Evaluate and compare multimodal learning against audio- and video-only learning.

    Evaluate and compare speaker-overlapping and speaker-disjoint training and testing.

  • Task 5 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (15 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.


Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: and or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.


At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.


[1] Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom Automatic speech recognition: a survey. 2021.

[2] Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. An overview of deep-learning-based audio-visual speech enhancement and separation. 2020.

[3] Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. Seeing wake words: Audio-visual keyword spotting. 2020.

[4] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpu. Librispeech: An asr corpus based on public domain audio books. 2015.

[5] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLperf inference benchmark. 2020

[6] Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, and Li Liu. Deep learning for visual speech analysis: A survey, 2022

[7] DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview. 2018

[8] Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. 2021

[9] Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu. An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits. 2022

[10] Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic Lipreading using Temporal Convolutional Networks. 2020

[11] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang Conformer: Convolution-augmented Transformer for Speech Recognition. 2022

[12] Ji Lin, Chuang Gan, Song Han TSM: Temporal Shift Module for Efficient Video Understanding. 2019