Personal tools

Difference between revisions of "Deep Learning Projects"

From iis-projects

Jump to: navigation, search
m (Available Projects)
(Available Projects)
(14 intermediate revisions by 2 users not shown)
Line 14: Line 14:
 
|-
 
|-
 
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
 
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
|-
 
| available || SA/MA || Stand-Alone Edge Computing with GAP8 || Detailled description: [[Stand-Alone_Edge_Computing_with_GAP8]] || Embedded || SW/HW (PCB-level) || [[:User:andrire|Renzo Andri]] Andres Gomez
 
|-
 
| available || MA || INQ Accelerator || INQ is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of +-2^n. As multiplcations with power's of two can be done by just shifting the bits, it is perfect for HW acceleration. In this thesis you will design an ASIC performing INQ quantized networks. || ASIC || HW (ASIC) || [[:User:andrire|Renzo Andri]]
 
 
|-
 
|-
 
| available || MA/SA || On-chip Learning || Neural Networks are compute and resource intensive and are usually run on power-intensive GPU clusters, but we would like to exploit them also on the everywhere IoT devices. To reach that, we need to develop new hardware architecture optimized for this application. This also include to check new algorithmic approach, which can reduce the compute or memory footprint of these networks. || ASIC || HW (ASIC) || [[:User:andrire|Renzo Andri]]
 
| available || MA/SA || On-chip Learning || Neural Networks are compute and resource intensive and are usually run on power-intensive GPU clusters, but we would like to exploit them also on the everywhere IoT devices. To reach that, we need to develop new hardware architecture optimized for this application. This also include to check new algorithmic approach, which can reduce the compute or memory footprint of these networks. || ASIC || HW (ASIC) || [[:User:andrire|Renzo Andri]]
 
|-
 
|-
| available || 1-2x SA || HW Data Compressor for CNNs || The most commonly used hardware accelerators for CNNs are largely limited (energy efficiency, throughput) by the bandwidth to external DRAM. We have recently proposed a novel compression scheme ([https://arxiv.org/pdf/1810.03979.pdf paper]) which would be a very good fit for a hardware implementation. In this project, you will implement the encoder and decoder for ASIC and/or FPGA, such that we can use it and verify that our claim of hardware suitability truly holds. || ASIC/FPGA || HW (ASIC) / HW (FPGA) || [[:User:lukasc|Lukas Cavigelli]]
+
| available || 1-2x SA || HW Data Compressor for CNNs || The most commonly used hardware accelerators for CNNs are largely limited (energy efficiency, throughput) by the bandwidth to external DRAM. We have recently proposed a novel compression scheme ([https://arxiv.org/pdf/1810.03979.pdf paper]) which would be a very good fit for a hardware implementation. In this project, you will implement the encoder and decoder on an ASIC and/or FPGA, such that we can use it and verify that our claim of hardware suitability truly holds. || ASIC/FPGA || HW (ASIC) / HW (FPGA) || [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
|-
| available || 1-2x SA or 1x MA || Ternary-Weight FPGA System || Together with an external partner we are evaluating how combining binary or ternary-weight CNN can be employed on FPGA to push the throughput/cost ratio higher than embedded GPUs. In this project, you will implement a hardware accelerator and integrate it into a fairly complete FPGA/Zynq-based system with camera etc. for real-time pose detection. || FPGA/Zynq || HW & SW (FPGA) || [[:User:lukasc|Lukas Cavigelli]]
+
| available || MA/SA || Low-Power Systolic LSTM Demonstrator || Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs,  achieve state-of-the-art performance in time series analysis such as speech recognition. We are currently building up a complete speech recognition demonstrator based on a systolic grid of in-house designed LSTM accelerators called Muntaniala. The goal of the project is to build an "overall low power" Muntaniala systolic system demonstrator using a low power microcontroller (e.g. PULP) or a low power FPGA. || ASIC/FPGA || HW (ASIC) / HW (FPGA) / SW (microcontr.) || [[:User:paulin|Gianna Paulin]]
 
|-
 
|-
| available || 1-2x SA or 1x MA ||Data Bottlenecks in DNNs || In many systems, we have a combination of remote sensing nodes and centralized analysis. Such systems' operating cost and energy consumption is often dominated by communication, such that data compression becomes crucial. The strongest compression is usually archieved when performing the whole analysis on the sensor node and transmitting the result (e.g. a label instead of an image), but the sensor node might not have enough processing power available or the data of multiple sensor nodes has to be combined for a good classification/estimate/result. In this project, you will explore how to train DNNs for such problems with a data bottleneck within the DNN, where you will be using a not-yet-published quantization method. If taken as a MA, the result of the algorithmic exploration can be implemented on an embedded platform. || Workstation || SW (algorithm evals) || [[:User:lukasc|Lukas Cavigelli]], Matteo Spallanzani
+
| available || MA/SA || Quantized Training of Recurrent Neural Networks || Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs,  achieve state-of-the-art performance in time series analysis such as speech recognition. RNNs come with additional challenges such as an internal state that needs to be stored and regularly updated, a very large memory footprint and high bandwidth requirements. Research in the last few years has shown that most neural networks can be quantized with a small accuracy cost. The goal of the project is to train a quantized LSTM RNN. || GPU || SW (GPU) || [[:User:paulin|Gianna Paulin]]
 
|}
 
|}
  
Line 46: Line 42:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
+
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! Supervisors
 +
|-
 +
| taken || 1x SA || Stand-Alone Edge Computing with GAP8 || Detailed description: [[Stand-Alone_Edge_Computing_with_GAP8]] || Embedded || SW/HW (PCB-level) || [[:User:andrire|Renzo Andri]], [[:User:lukasc|Lukas Cavigelli]], Andres Gomez, Naomi Stricker (TIK)
 
|-
 
|-
| taken || SA || SAR Data Analysis || We would like to explore the automated analysis of aerial synthetic aperture radar (SAR) images. Essentially, we have one very high-resolution image of a Swiss city and no labels. This project is not about labeling a lot of data, but to explore various options for supervised (cf. [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7827114 paper]) or semi-/unsupervised learning to segment these images using very few labeled data. || Workstation|| SW (algo evals) || [[:User:xiaywang|Xiaying Wang]], [[:User:lukasc|Lukas Cavigelli]], [[:User:magnom|Michele Magno]]
+
| taken || 2x SA|| TWN HW Accel. || INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. In order to keep the flexibility and ease of use in an actual system, we would like to integrate this accelerator into a PULP(issimo) processor system. In this thesis, you will develop the accelerator and/or integrate it into the PULPissimo system. || ASIC || HW (ASIC) & SW || [[:User:lukasc|Lukas Cavigelli]], [[:User:andrire|Renzo Andri]], Fabian Schuiki
 
|-
 
|-
| taken || MA/2x SA || DNN Training Accelerator || The compute effort to train state-of-the-art CNNs is tremendous and largely done on GPUs, or less frequently on specialized HW (e.g. Google's TPUs). Their energy effiency and often performance is limited by DRAM accesses. When storing all the data required for the gradient descent step of typical DNNs, there is no way to store it in on-chip SRAM--even across multiple, very large chips. Recently, Invertible ResNets has been presented (cf. [https://arxiv.org/pdf/1707.04585.pdf paper]) and allows to trade these storage requirements for some more compute effort--a huge opportunity. In this project, you will perform an architecture exploration to analyze how this could best be exploited. || ASIC || HW (ASIC) || [[:User:lukasc|Lukas Cavigelli]]
+
| taken || 1x SA || Data Bottlenecks in DNNs || In many systems, we have a combination of remote sensing nodes and centralized analysis. Such systems' operating cost and energy consumption is often dominated by communication, such that data compression becomes crucial. The strongest compression is usually archieved when performing the whole analysis on the sensor node and transmitting the result (e.g. a label instead of an image), but the sensor node might not have enough processing power available or the data of multiple sensor nodes has to be combined for a good classification/estimate/result. In this project, you will explore how to train DNNs for such problems with a data bottleneck within the DNN, where you will be using a not-yet-published quantization method. If taken as a MA, the result of the algorithmic exploration can be implemented on an embedded platform. || Workstation || SW (algorithm evals) || [[:User:lukasc|Lukas Cavigelli]], Matteo Spallanzani
 
|}
 
|}
  
Line 58: Line 56:
 
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
 
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
 
|-
 
|-
| completed FS18 || SA || CBinfer for Speech Recognition || We have recently published an approach to dramatically reduce computation effort when performing object detection on video streams with limited frame-to-frame changes (cf. [https://arxiv.org/pdf/1704.04313.pdf paper]). We think this approach could also be applied to audio signals for continuous listening to void commands: when looking at MFCCs or the short-term Fourier transform, changes in the spectrum between neighboring time windows are also limited. || Embedded GPU (Tegra X2) || SW (GPU, algo evals) || [[:User:lukasc|Lukas Cavigelli]]
+
| complete HS18|| 1x MA || DNN Training Accelerator || The compute effort to train state-of-the-art CNNs is tremendous and largely done on GPUs, or less frequently on specialized HW (e.g. Google's TPUs). Their energy effiency and often performance is limited by DRAM accesses. When storing all the data required for the gradient descent step of typical DNNs, there is no way to store it in on-chip SRAM--even across multiple, very large chips. Recently, Invertible ResNets has been presented (cf. [https://arxiv.org/pdf/1707.04585.pdf paper]) and allows to trade these storage requirements for some more compute effort--a huge opportunity. In this project, you will perform an architecture exploration to analyze how this could best be exploited. || ASIC || HW (ASIC) || [[:User:lukasc|Lukas Cavigelli]]
 +
|-
 +
| completed HS18 || 1x MA || One-shot/Few-shot Learning || One-shot learning comes in handy whenever it is not possible to collect a large dataset. Consider for example face identification as a form of opening you apartment's door, where the user provides a single picture (not 100s) and is recognized reliably from then on. In this project you would apply a method called Prototypical Networks (cf. [[https://arxiv.org/abs/1703.05175 paper], [https://github.com/jakesnell/prototypical-networks code]]) to learn to identify faces. Once you have trained such a DNN, you will optimize it for an embedded system to run it in real time. For a master thesis, an interesting additional step could be to look at expanding this further to share information between multiple nodes/cameras and learn to re-identify faces also as they evolve over time. || Embedded GPU or Microcontroller || SW (algo, uC) || [[:User:lukasc|Lukas Cavigelli]], [[:User:andrire|Renzo Andri]]
 +
|-
 +
| completed HS18 || 1x SA || SAR Data Analysis || We would like to explore the automated analysis of aerial synthetic aperture radar (SAR) images. Essentially, we have one very high-resolution image of a Swiss city and no labels. This project is not about labeling a lot of data, but to explore various options for supervised (cf. [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7827114 paper]) or semi-/unsupervised learning to segment these images using very few labeled data. || Workstation|| SW (algo evals) || [[:User:xiaywang|Xiaying Wang]], [[:User:lukasc|Lukas Cavigelli]], [[:User:magnom|Michele Magno]]
 
|-
 
|-
| completed HS18 || MA || One-shot/Few-shot Learning || One-shot learning comes in handy whenever it is not possible to collect a large dataset. Consider for example face identification as a form of opening you apartment's door, where the user provides a single picture (not 100s) and is recognized reliably from then on. In this project you would apply a method called Prototypical Networks (cf. [[https://arxiv.org/abs/1703.05175 paper], [https://github.com/jakesnell/prototypical-networks code]]) to learn to identify faces. Once you have trained such a DNN, you will optimize it for an embedded system to run it in real time. For a master thesis, an interesting additional step could be to look at expanding this further to share information between multiple nodes/cameras and learn to re-identify faces also as they evolve over time. || Embedded GPU or Microcontroller || SW (algo, uC) || [[:User:lukasc|Lukas Cavigelli]], [[:User:andrire|Renzo Andri]]
+
| completed HS18 || 1x SA || Ternary-Weight FPGA System || Together with an external partner we are evaluating how combining binary or ternary-weight CNN can be deployed on FPGAs to push the throughput/cost ratio higher than embedded GPUs. In this project, you will implement a hardware accelerator for ternary weight network and integrate it into a fairly complete FPGA/Zynq-based system with camera etc. for real-time pose detection. || FPGA/Zynq || HW & SW (FPGA) || [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
|-
 +
| completed FS18 || 1x SA || CBinfer for Speech Recognition || We have recently published an approach to dramatically reduce computation effort when performing object detection on video streams with limited frame-to-frame changes (cf. [https://arxiv.org/pdf/1704.04313.pdf paper]). We think this approach could also be applied to audio signals for continuous listening to void commands: when looking at MFCCs or the short-term Fourier transform, changes in the spectrum between neighboring time windows are also limited. || Embedded GPU (Tegra X2) || SW (GPU, algo evals) || [[:User:lukasc|Lukas Cavigelli]]
 
|}
 
|}
 
==Where to find us==
 
==Where to find us==
 
[[:User:andrire|Renzo Andri]], ETZ J 76.2, andrire@iis.ee.ethz.ch<br />
 
[[:User:andrire|Renzo Andri]], ETZ J 76.2, andrire@iis.ee.ethz.ch<br />
[[:User:lukasc|Lukas Cavigelli]], ETZ J 76.2, cavigelli@iis.ee.ethz.ch
+
[[:User:lukasc|Lukas Cavigelli]], ETZ J 76.2, cavigelli@iis.ee.ethz.ch<br />
 +
[[:User:paulin|Gianna Paulin]], ETZ J 76.2, pauling@iis.ee.ethz.ch

Revision as of 10:43, 15 May 2019

We are listing a few projects below to give you an idea of what we do. However, we constantly have new project ideas and maybe some other approaches become obsolete in the very rapidly advancing research area. Please just contact the people of a project most similar to what you would like to do, and come talk to us.

Prerequisites

We have no strict, general requirements, as they are highly dependent on the exact project steps. The projects will be adapted to the skills and interests of the student(s) -- just come talk to us! If you don't know about GPU programming or CNNs or ... just let us know and we can together determine what is a useful way to go -- after all you are here to learn not only about project work, but also to develop your technical skills.

Only hard requirements:

  • Excitement for deep learning
  • For VLSI projects: VLSI 1 or equivalent


Available Projects

Status Type Project Name Description Platform Workload Type First Contact(s)
available MA/SA On-chip Learning Neural Networks are compute and resource intensive and are usually run on power-intensive GPU clusters, but we would like to exploit them also on the everywhere IoT devices. To reach that, we need to develop new hardware architecture optimized for this application. This also include to check new algorithmic approach, which can reduce the compute or memory footprint of these networks. ASIC HW (ASIC) Renzo Andri
available 1-2x SA HW Data Compressor for CNNs The most commonly used hardware accelerators for CNNs are largely limited (energy efficiency, throughput) by the bandwidth to external DRAM. We have recently proposed a novel compression scheme (paper) which would be a very good fit for a hardware implementation. In this project, you will implement the encoder and decoder on an ASIC and/or FPGA, such that we can use it and verify that our claim of hardware suitability truly holds. ASIC/FPGA HW (ASIC) / HW (FPGA) Lukas Cavigelli
available MA/SA Low-Power Systolic LSTM Demonstrator Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs, achieve state-of-the-art performance in time series analysis such as speech recognition. We are currently building up a complete speech recognition demonstrator based on a systolic grid of in-house designed LSTM accelerators called Muntaniala. The goal of the project is to build an "overall low power" Muntaniala systolic system demonstrator using a low power microcontroller (e.g. PULP) or a low power FPGA. ASIC/FPGA HW (ASIC) / HW (FPGA) / SW (microcontr.) Gianna Paulin
available MA/SA Quantized Training of Recurrent Neural Networks Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs, achieve state-of-the-art performance in time series analysis such as speech recognition. RNNs come with additional challenges such as an internal state that needs to be stored and regularly updated, a very large memory footprint and high bandwidth requirements. Research in the last few years has shown that most neural networks can be quantized with a small accuracy cost. The goal of the project is to train a quantized LSTM RNN. GPU SW (GPU) Gianna Paulin


Workload types: SW (GPU), SW (microcontr.), SW (algorithm evals), HW (FPGA), HW (ASIC), HW (PCB)


On-Going Projects

Status Type Project Name Description Platform Workload Type Supervisors
taken 1x SA Stand-Alone Edge Computing with GAP8 Detailed description: Stand-Alone_Edge_Computing_with_GAP8 Embedded SW/HW (PCB-level) Renzo Andri, Lukas Cavigelli, Andres Gomez, Naomi Stricker (TIK)
taken 2x SA TWN HW Accel. INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. In order to keep the flexibility and ease of use in an actual system, we would like to integrate this accelerator into a PULP(issimo) processor system. In this thesis, you will develop the accelerator and/or integrate it into the PULPissimo system. ASIC HW (ASIC) & SW Lukas Cavigelli, Renzo Andri, Fabian Schuiki
taken 1x SA Data Bottlenecks in DNNs In many systems, we have a combination of remote sensing nodes and centralized analysis. Such systems' operating cost and energy consumption is often dominated by communication, such that data compression becomes crucial. The strongest compression is usually archieved when performing the whole analysis on the sensor node and transmitting the result (e.g. a label instead of an image), but the sensor node might not have enough processing power available or the data of multiple sensor nodes has to be combined for a good classification/estimate/result. In this project, you will explore how to train DNNs for such problems with a data bottleneck within the DNN, where you will be using a not-yet-published quantization method. If taken as a MA, the result of the algorithmic exploration can be implemented on an embedded platform. Workstation SW (algorithm evals) Lukas Cavigelli, Matteo Spallanzani

Completed Projects

Status Type Project Name Description Platform Workload Type First Contact(s)
complete HS18 1x MA DNN Training Accelerator The compute effort to train state-of-the-art CNNs is tremendous and largely done on GPUs, or less frequently on specialized HW (e.g. Google's TPUs). Their energy effiency and often performance is limited by DRAM accesses. When storing all the data required for the gradient descent step of typical DNNs, there is no way to store it in on-chip SRAM--even across multiple, very large chips. Recently, Invertible ResNets has been presented (cf. paper) and allows to trade these storage requirements for some more compute effort--a huge opportunity. In this project, you will perform an architecture exploration to analyze how this could best be exploited. ASIC HW (ASIC) Lukas Cavigelli
completed HS18 1x MA One-shot/Few-shot Learning One-shot learning comes in handy whenever it is not possible to collect a large dataset. Consider for example face identification as a form of opening you apartment's door, where the user provides a single picture (not 100s) and is recognized reliably from then on. In this project you would apply a method called Prototypical Networks (cf. [paper, code]) to learn to identify faces. Once you have trained such a DNN, you will optimize it for an embedded system to run it in real time. For a master thesis, an interesting additional step could be to look at expanding this further to share information between multiple nodes/cameras and learn to re-identify faces also as they evolve over time. Embedded GPU or Microcontroller SW (algo, uC) Lukas Cavigelli, Renzo Andri
completed HS18 1x SA SAR Data Analysis We would like to explore the automated analysis of aerial synthetic aperture radar (SAR) images. Essentially, we have one very high-resolution image of a Swiss city and no labels. This project is not about labeling a lot of data, but to explore various options for supervised (cf. paper) or semi-/unsupervised learning to segment these images using very few labeled data. Workstation SW (algo evals) Xiaying Wang, Lukas Cavigelli, Michele Magno
completed HS18 1x SA Ternary-Weight FPGA System Together with an external partner we are evaluating how combining binary or ternary-weight CNN can be deployed on FPGAs to push the throughput/cost ratio higher than embedded GPUs. In this project, you will implement a hardware accelerator for ternary weight network and integrate it into a fairly complete FPGA/Zynq-based system with camera etc. for real-time pose detection. FPGA/Zynq HW & SW (FPGA) Lukas Cavigelli
completed FS18 1x SA CBinfer for Speech Recognition We have recently published an approach to dramatically reduce computation effort when performing object detection on video streams with limited frame-to-frame changes (cf. paper). We think this approach could also be applied to audio signals for continuous listening to void commands: when looking at MFCCs or the short-term Fourier transform, changes in the spectrum between neighboring time windows are also limited. Embedded GPU (Tegra X2) SW (GPU, algo evals) Lukas Cavigelli

Where to find us

Renzo Andri, ETZ J 76.2, andrire@iis.ee.ethz.ch
Lukas Cavigelli, ETZ J 76.2, cavigelli@iis.ee.ethz.ch
Gianna Paulin, ETZ J 76.2, pauling@iis.ee.ethz.ch