Personal tools

Difference between revisions of "Audio Visual Speech Separation and Recognition (1S/1M)"

From iis-projects

Jump to: navigation, search
(Created page with "<!-- Audio-Visual Speech Separation and Recognition (1S/1M) --> = Overview = == Status: Available == * Type: Semester Thesis * Professor: Prof. Dr. L. Benini * Supervisors:...")
 
 
(9 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
* Supervisors:
 
* Supervisors:
 
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]
 
** [[:User:Cioflanc| Cristian Cioflan]] (IIS): [mailto:cioflanc@iis.ee.ethz.ch cioflanc@iis.ee.ethz.ch]
** Dr. Miguel de Prado (Bonseyes Community Association): [mailto: miguel.deprado@bonseyes.com miguel.deprado@bonseyes.com]
+
** [[Dr. Miguel de Prado]] (Bonseyes Community Association): [mailto:miguel.deprado@bonseyes.com miguel.deprado@bonseyes.com]
** Dr. José Miranda Calero (Embedded Systems Laboratory, EPFL): [mailto: jose.mirandacalero@epfl.ch jose.mirandacalero@epfl.ch]
+
** [[Dr. José Miranda Calero]] (Embedded Systems Laboratory, EPFL): [mailto:jose.mirandacalero@epfl.ch jose.mirandacalero@epfl.ch]
  
  
Line 16: Line 16:
  
  
[[Category:2023]]
+
[[Category:2024]]
 
[[Category:Semester Thesis]]
 
[[Category:Semester Thesis]]
 
[[Category:Hot]]
 
[[Category:Hot]]
Line 29: Line 29:
 
Human-computer interaction evolved rapidly in the last years thanks to the hardware development enabling complex Deep Neural Networks (DNNs) to perform machine learning tasks. The scientific advancements led to the employment of Automated Speech Recognition (ASR) and Visual Speech Recognition (VSR) in human-machine communication, with the former representing the task of converting spoken speech into written words[[#ref-malik2021|&#91;1&#93;]], whilst the latter being the conversion of lip movement into written words (i.e., lipreading)[[#ref-sheng2022|&#91;6&#93;]]. In the context of voice-controlled devices, it is of uttermost importance to additionally mention speaker separation[[#ref-wang2018|&#91;7&#93;]]. This aim to mimic the cocktail party effect present in humans, which denotes our capability of focusing on an individual conversation, whilst filtering out other discussions and surrounding noises. This ensures that only a designated individual can control one target device at a time.  
 
Human-computer interaction evolved rapidly in the last years thanks to the hardware development enabling complex Deep Neural Networks (DNNs) to perform machine learning tasks. The scientific advancements led to the employment of Automated Speech Recognition (ASR) and Visual Speech Recognition (VSR) in human-machine communication, with the former representing the task of converting spoken speech into written words[[#ref-malik2021|&#91;1&#93;]], whilst the latter being the conversion of lip movement into written words (i.e., lipreading)[[#ref-sheng2022|&#91;6&#93;]]. In the context of voice-controlled devices, it is of uttermost importance to additionally mention speaker separation[[#ref-wang2018|&#91;7&#93;]]. This aim to mimic the cocktail party effect present in humans, which denotes our capability of focusing on an individual conversation, whilst filtering out other discussions and surrounding noises. This ensures that only a designated individual can control one target device at a time.  
  
In the current work, we investigate the potential of performing multi-task learning using sensor fusion[[#ref-michelsanti2020|&#91;2&#93;]][[#ref-zhu2021|&#91;8&#93;]][[#ref-momeni2020|&#91;3&#93;]]. We propose to fuse audio and visual information, thus increasing the data dimensionality and, implicitly, the amount of useful information. The so-trained model would then provide the target user(s) with the transcript of their respective speech. Lastly, the proposed system must abide by the TinyML[[#ref-reddi2020|&#91;5&#93;]] constraints considering edge devices, namely reduced memory and storage requirements, as well as real-time operation on low-power, battery-operated devices.
+
In the current work, we investigate the potential of performing Audio-Visual Speech Separation and Recognition (AVSSR) [[#ref-michelsanti2020|&#91;2&#93;]][[#ref-zhu2021|&#91;8&#93;]][[#ref-momeni2020|&#91;3&#93;]]. Through multi-task learning, we merge Audio-Visual Speech Separation (AVSS) and Audio-Visual Speech Recognition (AVSR). We propose to fuse audio and visual information, thus increasing the data dimensionality and, implicitly, the amount of useful information. The so-trained model would then provide the target user(s) with the transcript of their respective speech. Lastly, the proposed system must abide by the TinyML[[#ref-reddi2020|&#91;5&#93;]] constraints considering edge devices, namely reduced memory and storage requirements, as well as real-time operation on low-power, battery-operated devices.
  
  
Line 35: Line 35:
  
 
* 20% literature research
 
* 20% literature research
* 40% feature extraction implementation
+
* 50% architectural implementation and optimizations
* 40% evaluation
+
* 30% evaluation
  
 
== Prerequisites ==
 
== Prerequisites ==
Line 49: Line 49:
 
<ul>
 
<ul>
 
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p>
 
<li><p>'''Task 1: Familiarize yourself with the project specifics (1-2 Weeks)'''</p>
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on DNN models for audio-visual speech recognition and speaker separation (e.g., CNN, transformer networks) and the common approaches and recent advances in the field..</p></li>
+
<p> Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on data fusion and multimodal learning, common approaches and recent advances on the topic. Read up on multi-task learning in the context of audio-visual time series. Read up on DNN models aimed at time series (e.g., TCNs, TASMs, Transformer and Conformer networks) and the recent advances in AVSR/AVSS. </p></li>
  
<li><p>'''Task 2: Evaluate baseline model(s) (2-3 Weeks)'''</p>
+
<li><p>'''Task 2: Propose and evaluate AVSSR topolgies (4-6 weeks)'''</p>
<p>Once a baseline model has been established for the given tasks, implement and evaluate the baseline on the selected datasets (e.g, [[#ref-panayotov2015|&#91;4&#93;]]). </p>
+
<p>Considering state-of-the-art works in AVSS[[#ref-Li2022|&#91;9&#93;]] and AVSR[[#ref-Martinez2020|&#91;10&#93;]][[#ref-Gulati2020|&#91;11&#93;]][[#ref-Lin2019|&#91;12&#93;]] and previous IIS projects, propose and implement AVSSR architectures. </p>
 +
<p>Propose evaluation metrics and novel loss functions; analyse the models' performance on the GRID dataset. </p>
  
<li><p>'''Task 3: Implement audio-visual sensor fusion on the selected model(s) (3-4 Weeks)'''</p>
+
<li><p>'''Task 3: Optimize proposed models considering TinyML constraints (2-3 weeks)'''</p>
<p>Propose fusing methods and evaluate them on the selected speech recognition dataset, in comparison with single-sensor models. </p>
+
<p>Reduce the models' hardware-associated costs (i.e., memory, storage, computational complexity), evaluating the trade-offs on the proposed metrics. </p>
<p>Optimize the fused model with respect to the TinyML constraints.</p>
+
<p><b>(Only if conducted as Master's thesis)</b> Deploy the proposed architecture on novel parallel ultra-low-power platforms.</p>
<p>Test its generalization properties on other datasets. </p>
 
  
<li><p>'''Perform multi-task learning for speech separation and recognition (3-4 Weeks)'''</p>
+
<li><p>'''Task 4: Dataset generalization and ablation study (1-2 Weeks)'''</p>
<p>Extend the model's capabilities to speech separation, considering available datasets containing labels for both tasks. </p>
+
<p>Investigate alternative datasets for AVSSR. </p>
<p>Evaluate and optimized the proposed model.</p>
+
<p>Propose and implement dataset modifications to enable multi-speaker AVSSR. </p>
 +
<p>Evaluate and compare multimodal learning against audio- and video-only learning.</p>
 +
<p>Evaluate and compare speaker-overlapping and speaker-disjoint training and testing. </p>
  
 
<li><p>'''Task 5 - Gather and Present Final Results (2-3 Weeks)'''</p>
 
<li><p>'''Task 5 - Gather and Present Final Results (2-3 Weeks)'''</p>
 
<p>Gather final results.</p>
 
<p>Gather final results.</p>
<p>Prepare presentation (10 min. + 5 min. discussion).</p>
+
<p>Prepare presentation (15 min. + 5 min. discussion).</p>
 
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul>
 
<p>Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.</p></li></ul>
  
Line 109: Line 111:
 
</div>
 
</div>
 
<div id="ref-wang2018" class="csl-entry">
 
<div id="ref-wang2018" class="csl-entry">
<span class="csl-left-margin">&#91;7&#93; </span><span class="csl-right-inline">DeLiang Wang and Jitong Chen. <span> Supervised speech separation based on deep learning: An overview. </span>2028</span>
+
<span class="csl-left-margin">&#91;7&#93; </span><span class="csl-right-inline">DeLiang Wang and Jitong Chen. <span> Supervised speech separation based on deep learning: An overview. </span>2018</span>
 
</div>
 
</div>
 
<div id="ref-zhu2021" class="csl-entry">
 
<div id="ref-zhu2021" class="csl-entry">
 
<span class="csl-left-margin">&#91;8&#93; </span><span class="csl-right-inline">Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. <span> Deep audio-visual learning: A survey. </span>2021</span>
 
<span class="csl-left-margin">&#91;8&#93; </span><span class="csl-right-inline">Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. <span> Deep audio-visual learning: A survey. </span>2021</span>
 +
</div>
 +
 +
<div id="ref-Li2022" class="csl-entry">
 +
<span class="csl-left-margin">&#91;9&#93; </span><span class="csl-right-inline">Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu. <span> An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits. </span>2022</span>
 +
</div>
 +
 +
<div id="ref-Martinez2020" class="csl-entry">
 +
<span class="csl-left-margin">&#91;10&#93; </span><span class="csl-right-inline">Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic <span> Lipreading using Temporal Convolutional Networks. </span>2020</span>
 +
</div>
 +
 +
<div id="ref-Gulati2022" class="csl-entry">
 +
<span class="csl-left-margin">&#91;11&#93; </span><span class="csl-right-inline">Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang <span> Conformer: Convolution-augmented Transformer for Speech Recognition. </span>2022</span>
 +
</div>
  
 +
<div id="ref-Lin2019" class="csl-entry">
 +
<span class="csl-left-margin">&#91;12&#93; </span><span class="csl-right-inline">Ji Lin, Chuang Gan, Song Han <span>
 +
TSM: Temporal Shift Module for Efficient Video Understanding. </span>2019</span>
 
</div>
 
</div>
 
</div>
 
</div>

Latest revision as of 14:15, 1 March 2024


Overview

Status: Available

Introduction

Human-computer interaction evolved rapidly in the last years thanks to the hardware development enabling complex Deep Neural Networks (DNNs) to perform machine learning tasks. The scientific advancements led to the employment of Automated Speech Recognition (ASR) and Visual Speech Recognition (VSR) in human-machine communication, with the former representing the task of converting spoken speech into written words[1], whilst the latter being the conversion of lip movement into written words (i.e., lipreading)[6]. In the context of voice-controlled devices, it is of uttermost importance to additionally mention speaker separation[7]. This aim to mimic the cocktail party effect present in humans, which denotes our capability of focusing on an individual conversation, whilst filtering out other discussions and surrounding noises. This ensures that only a designated individual can control one target device at a time.

In the current work, we investigate the potential of performing Audio-Visual Speech Separation and Recognition (AVSSR) [2][8][3]. Through multi-task learning, we merge Audio-Visual Speech Separation (AVSS) and Audio-Visual Speech Recognition (AVSR). We propose to fuse audio and visual information, thus increasing the data dimensionality and, implicitly, the amount of useful information. The so-trained model would then provide the target user(s) with the transcript of their respective speech. Lastly, the proposed system must abide by the TinyML[5] constraints considering edge devices, namely reduced memory and storage requirements, as well as real-time operation on low-power, battery-operated devices.


Character

  • 20% literature research
  • 50% architectural implementation and optimizations
  • 30% evaluation

Prerequisites

  • Must be familiar with Python.
  • Knowledge of deep learning basics, including some deep learning framework like PyTorch or TensorFlow from a course, project, or self-taught with some tutorials.

Project Goals

The main tasks of this project are:

  • Task 1: Familiarize yourself with the project specifics (1-2 Weeks)

    Learn about DNN training and PyTorch, how to visualize results with TensorBoard. Read up on data fusion and multimodal learning, common approaches and recent advances on the topic. Read up on multi-task learning in the context of audio-visual time series. Read up on DNN models aimed at time series (e.g., TCNs, TASMs, Transformer and Conformer networks) and the recent advances in AVSR/AVSS.

  • Task 2: Propose and evaluate AVSSR topolgies (4-6 weeks)

    Considering state-of-the-art works in AVSS[9] and AVSR[10][11][12] and previous IIS projects, propose and implement AVSSR architectures.

    Propose evaluation metrics and novel loss functions; analyse the models' performance on the GRID dataset.

  • Task 3: Optimize proposed models considering TinyML constraints (2-3 weeks)

    Reduce the models' hardware-associated costs (i.e., memory, storage, computational complexity), evaluating the trade-offs on the proposed metrics.

    (Only if conducted as Master's thesis) Deploy the proposed architecture on novel parallel ultra-low-power platforms.

  • Task 4: Dataset generalization and ablation study (1-2 Weeks)

    Investigate alternative datasets for AVSSR.

    Propose and implement dataset modifications to enable multi-speaker AVSSR.

    Evaluate and compare multimodal learning against audio- and video-only learning.

    Evaluate and compare speaker-overlapping and speaker-disjoint training and testing.

  • Task 5 - Gather and Present Final Results (2-3 Weeks)

    Gather final results.

    Prepare presentation (15 min. + 5 min. discussion).

    Write a final report. Include all major decisions taken during the design process and argue your choice. Include everything that deviates from the very standard case - show off everything that took time to figure out and all your ideas that have influenced the project.

Project Organization

Weekly Meetings

The student shall meet with the advisor(s) every week in order to discuss any issues/problems that may have persisted during the previous week and with a suggestion of next steps. These meetings are meant to provide a guaranteed time slot for mutual exchange of information on how to proceed, clear out any questions from either side and to ensure the student’s progress.

Report

Documentation is an important and often overlooked aspect of engineering. One final report has to be completed within this project. Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Tgif (See: http://bourbon.usc.edu:8001/tgif/index.html and http://www.dz.ee.ethz.ch/en/information/how-to/drawing-schematics.html) or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.

Final Report

A digital copy of the report, the presentation, the developed software, build script/project files, drawings/illustrations, acquired data, etc. needs to be handed in at the end of the project. Note that this task description is part of your report and has to be attached to your final report.

Presentation

At the end of the project, the outcome of the thesis will be presented in a 15-minutes talk and 5 minutes of discussion in front of interested people of the Integrated Systems Laboratory. The presentation is open to the public, so you are welcome to invite interested friends. The exact date will be determined towards the end of the work.

References

[1] Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom Automatic speech recognition: a survey. 2021.

[2] Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. An overview of deep-learning-based audio-visual speech enhancement and separation. 2020.

[3] Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. Seeing wake words: Audio-visual keyword spotting. 2020.

[4] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpu. Librispeech: An asr corpus based on public domain audio books. 2015.

[5] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunathm Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLperf inference benchmark. 2020

[6] Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, and Li Liu. Deep learning for visual speech analysis: A survey, 2022

[7] DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview. 2018

[8] Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. 2021

[9] Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu. An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits. 2022

[10] Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic Lipreading using Temporal Convolutional Networks. 2020

[11] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang Conformer: Convolution-augmented Transformer for Speech Recognition. 2022

[12] Ji Lin, Chuang Gan, Song Han TSM: Temporal Shift Module for Efficient Video Understanding. 2019