Personal tools

Difference between revisions of "Deep Learning Projects"

From iis-projects

Jump to: navigation, search
(On-Going Projects)
(Hardware Acceleration of DNNs and QNNs)
(25 intermediate revisions by 5 users not shown)
Line 1: Line 1:
We are listing a few projects below to give you an idea of what we do. However, we constantly have new project ideas and maybe some other approaches become obsolete in the very rapidly advancing research area. Please just contact the people of a project most similar to what you would like to do, and '''come talk to us'''.  
+
==What is Deep Learning?==
 +
Nowadays, machine learning systems are the go-to choice when the cost of analytically deriving closed-form expressions to solve a given problem is prohibitive (e.g., it is very time-consuming, or the knowledge about the problem is insufficient). Machine learning systems can be particularly effective when the amount of data is large, since the statistics are expected to get more and more stable as the amount of data increases.
 +
Amongst machine learning systems, deep neural networks (DNNs) have established a reputation for their effectiveness and simplicity. To understand this success as compared to that of other machine learning systems, it is important to consider not only the accuracy performance of DNNs, but also their computational properties. The training algorithm (an iterative application of backpropagation and stochastic gradient descent) is linear in the data set size, making it more appealing in big data contexts than, for instance, support vector machines (SVMs). DNNs do not use branching instructions, making them predictable programs and allowing to design efficient access patterns for the memory hierarchies of the computing devices (exploiting spatial and temporal locality). DNNs are parallelizable, both at the neuron level and at the layer level. These predictability and parallelizability properties make DNNs an ideal fit for modern SIMD architectures and distributed computing systems.
  
==Prerequisites==
 
We have no strict, general requirements, as they are highly dependent on the exact project steps. The projects will be adapted to the skills and interests of the student(s) -- just come talk to us! If you don't know about GPU programming or CNNs or ... just let us know and we can together determine what is a useful way to go -- after all you are here to learn not only about project work, but also to develop your technical skills.
 
  
Only hard requirements:  
+
The main drawback of these systems is their size: millions or even billions of parameters are a common feature of many top-performing DNNs, and a proportional amount of arithmetic operations must be performed to process each data sample. Hence, to reduce the pressure of DNNs on the underlying computing infrastructure, research in computational deep learning has focussed on two families of optimizations: topological and hardware-oriented.
* Excitement for deep learning  
+
'''Topological optimizations''' are concerned with network topologies (AKA network architectures) which are more efficient in terms of accuracy-per-parameter or accuracy-per-MAC (multiply-accumulate operation). As a specific form of topological optimization, '''pruning''' strategies aim at maximizing the number of zero-valued operands (parameters and/or activations) in order to 1) take advantage of sparsity (for storing the model) and to 2) minimize the number of effective arithmetic operations (i.e., the operations not involving zero-valued operands, which must be actually executed). '''Hardware-oriented optimizations''' are instead concerned with replacing time-consuming and energy-hungry operations, such as evaluations of transcendent functions or floating-point MAC operations, with more efficient counterparts, such as piecewise linear activation functions (e.g., the ReLU) and integer MAC operations (as in quantized neural networks, QNNs).
* For VLSI projects: VLSI 1 or equivalent
+
 
 +
 
 +
==Hardware-oriented neural architecture search (NAS)==
 +
The problems of topology selection and pruning can be considered instances of the classical statistics problems of model selection and feature selection, respectively. In the scope of deep learning, model selection is also called neural architecture search (NAS).
 +
When designing a DNN topology, you have a large number of degrees of freedom at your disposal: number of layers, number of neurons for each layer, connectivity of each neuron, and so on; moreover, the number of choices for each degree of freedom is huge. These properties imply that the design space for a DNN can grow exponentially, making exhaustive searches prohibitive. Therefore, to increase the efficiency of the exploration, stochastic optimization tools are the preferred choice: evolutionary algorithms, reinforcement learning, gradient-based techniques or even random graph generation.
 +
An interesting feature of model selection is that specific constraints can be enforced on the search space so that desired properties are always respected. For instance, given a storage budget describing a hard limitation of the chosen computing platform, the network generation algorithm can be limited to propose topologies that do not exceed a given number of parameters. This capability of incorporating HW features as constraints on the search space make NAS algorithms very interesting in the context of generating HW-friendly DNNs.
 +
 
 +
{|
 +
| style="padding: 10px" | [[File:Thorir.jpg|frameless|left|96px]]
 +
|
 +
===[[:User:Thoriri| Thorir Mar Ingolfsson]]===
 +
* '''e-mail''': [mailto:thoriri@iis.ee.ethz.ch thoriri@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 633 88 43
 +
* '''office''': ETZ J76.2
 +
|}
 +
{|
 +
| style="padding: 10px" | [[File:Matteo_sp.JPG|frameless|left|96px]]
 +
|
 +
===[[:User:Spmatteo| Matteo Spallanzani]]===
 +
* '''e-mail''': [mailto:spmatteo@iis.ee.ethz.ch spmatteo@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 633 84 70
 +
* '''office''': ETZ J76.2
 +
|}
 +
 
 +
==Training algorithms for quantized neural networks (QNNs)==
 +
The typical training algorithm for DNNs is an iterative application of the backpropagation algorithm (BP) and stochastic gradient descent (SGD).
 +
When the quantization is not “aggressive” (i.e., when the parameters and feature maps can be represented as integers with a precision of 8-bits or more), many solutions are available either in specialized literature or in commercial software that can convert models pre-trained with gradient descent to quantized counterparts (post-training quantization).
 +
But when the precision is extremely reduced (i.e., 1-bit or 2-bits operands), these solutions can no longer be applied, and quantization-aware training algorithms are needed. The naive application of gradient descent (which in theory is not even correct) to train these QNNs yields major accuracy drops. Hence, it is likely that suitable training algorithms for QNNs require to replace the standard BP+SGD scheme, which is suitable for differentiable optimization, with search strategies that are more apt for discrete optimization.
 +
 
 +
{|
 +
| style="padding: 10px" | [[File:Matteo_sp.JPG|frameless|left|96px]]
 +
|
 +
===[[:User:Spmatteo| Matteo Spallanzani]]===
 +
* '''e-mail''': [mailto:spmatteo@iis.ee.ethz.ch spmatteo@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 633 84 70
 +
* '''office''': ETZ J76.2
 +
|}
 +
 
 +
==Hardware Acceleration of DNNs and QNNs==
 +
Deep Learning (DL) and Artificial Intelligence (AI) are quickly becoming dominant paradigms for all kinds of analytics, complementing or replacing traditional data science methods. Successful at-scale deployment of these algorithms requires deploying them directly at the data source, i.e. in the IoT end-nodes collecting data. However, due to the extreme constraints of these devices (in terms of power, memory footprint, area cost), performing full DL inference in-situ in low-power end-nodes requires a breakthrough in computational performance and efficiency.
 +
It is widely known that the numerical representation typically used when developing DL algorithms (single-precision floating-point) encodes a higher precision than what is actually required to achieve high quality-of-results in inference (Courbariaux et al. 2016); this fact can be exploited in the design of energy-efficient hardware for DL.
 +
For example, by using ternary weights, which means all network weights are quantized to {-1,0,1}, we can design the fundamental compute units in hardware without using an HW-expensive multiplication unit. Additionally, it allows us to store the weights much more compact on-chip.
  
<!--- <span style="color:red">We are currently out of working spaces at IIS until around Easter 2018. Please contact us 1-2 months before the desired project start!</span> --->
+
{|
 +
| style="padding: 10px" | [[File:gianna.jpg|frameless|left|96px]]
 +
|
 +
===[[:User:Paulin| Gianna Paulin]]===
 +
* '''e-mail''': [mailto:pauling@iis.ee.ethz.ch pauling@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 632 60 80
 +
* '''office''': ETZ J76.2
 +
|}
 +
{|
 +
| style="padding: 10px" | [[File:Georg.jpg|frameless|left|96px]]
 +
|
 +
===[[User:Georg | Georg Rutishauser]]===
 +
* '''e-mail''': [mailto:georgr@iis.ee.ethz.ch georgr@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 632 54 97
 +
* '''office''': ETZ J68.2
 +
|}
 +
{|
 +
| style="padding: 10px" | [[File:Moritz_scherer.jpg|frameless|left|96px]]
 +
|
 +
 
 +
===[[:User:Scheremo| Moritz Scherer]]===
 +
* '''e-mail''': [mailto:scheremo@iis.ee.ethz.ch scheremo@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 632 77 86
 +
* '''office''': ETZ J69.2
 +
|}
 +
{|
 +
| style="padding: 10px" | [[File:Tim_Fischer.jpeg|frameless|left|96px]]
 +
|
  
==Available Projects==
+
===[[:User:Fischeti| Tim Fischer]]===
{| class="wikitable"
+
* '''e-mail''': [mailto:fischeti@iis.ee.ethz.ch fischeti@iis.ee.ethz.ch]
|-
+
* '''phone''': +41 44 632 59 12
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
+
* '''office''': ETZ J76.2
|-
 
| available || MA/SA || RISC-V LSTM Accelerator || LSTM are the state of the art neural networks for time-series data (e.g. audio). Full-custom HW accelerators have been presented, but they usually lack in flexibility and a separate controller (e.g. a microcontroller) is needed to control it. An alternative a heteogeneous processor architecture, where a general purpose processor is extended with special-purpose accelarators. In a previous semester project, a first LSTM accelerator attached to PULP has been developed. In this thesis we would look into evaluation and optimization of this accelerator. || ASIC || HW (ASIC) || [[:User:Paulin|Gianna Paulin]] [[:User:andrire|Renzo Andri]]
 
|-
 
| available || MA ||  A system-level LSTM Acceleration || LSTM are the state of the art neural networks for time-series data (e.g. audio). Full-custom HW accelerators have been presented, but they usually lack in flexibility and a separate controller (e.g. a microcontroller) is needed to control it. In this project an accelerator for LSTM is implemented as a coarse-grain coprocessor to the RISC-V processor to address this issue. The work will explore datapath, internal storage needs, control interface, memory bandwidth requirements into the L1 in an environment with one or more RISC-V processors. This means that the complete system (e.g. memory bus) has to be analyzed and if necessary be adapted. || ASIC || HW  || [[:User:paulin|Gianna Paulin]]
 
|-
 
| available|| MA/SA || TWN HW Accel. || INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. In order to keep the flexibility and ease of use in an actual system, we have thus created an accelerator and integrated it into a PULP(issimo) processor system. In this project, you will further improve this accelerator and/or its software. Depending on the number of students and project type, this could lead to a chip tape-out. || ASIC || HW (ASIC) & SW || Georg Rutishauser, [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
| available || MA/SA ||  Ternary-Weights TCN HW Accel. || INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. Temporal convolutional networks (TCN) have recently been proposed for sequence modelling tasks and achieve state-of-the-art-performance on translation task. TCNs are making use of 1D-fully-convolutional network and causal convolutions. In this work a HW accelerator should be implemented with the ultimate goal of energy efficiency. Potentially this work will make use of an existing ternary-weight convolution accelerator. || ASIC || HW (ASIC) || Georg Rutishauser, [[:User:paulin|Gianna Paulin]]
 
|-
 
| available || MA/SA ||  Ternary-Weights TCN Training || INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. Temporal convolutional networks (TCN) have recently been proposed for sequence modelling tasks and achieve state-of-the-art-performance on translation task. In this project, you will explore how to train TCN for the use ternary weights with various state-of-the-art training schemes. || Workstation|| SW (algorithm evals)  || Georg Rutishauser, [[:User:paulin|Gianna Paulin]]
 
|-
 
| available || MA/SA || Low-Power Systolic LSTM Demonstrator || Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs,  achieve state-of-the-art performance in time series analysis such as speech recognition. We are currently building up a complete speech recognition demonstrator based on a systolic grid of in-house designed LSTM accelerators called Muntaniala. The goal of the project is to build an "overall low power" Muntaniala systolic system demonstrator using a low power microcontroller (e.g. PULP) or a low power FPGA. || ASIC/FPGA || HW (ASIC) / HW (FPGA) / SW (microcontr.) || [[:User:paulin|Gianna Paulin]]
 
|-
 
| available || SA || Parallel EBPC || A large part of the power consumption of neural network accelerators goes towards accessing feature maps stored in large central memories. Extended Bit-Plane Compression (EBPC) is a novel, hardware-friendly compression algorithm for DNN feature maps which makes it possible to reduce the transferred data volume and with it, power consumption. A baseline hardware implementation of EBPC which processes a single 8-bit stream of data has already been developed. The next step, and the goal of this project, is to transform it into a parallel architecture which can process multiple 8-bit words at a time while keeping the original architecture's energy efficiency intact (or improving it!). || ASIC/FPGA || HW || [[:User:georgr|Georg Rutishauser]], [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
| available || MA || Logic Synthesis with Graph CNNs || Logic synthesis and optimization as used for ASIC and FPGA implementation is computationally intensive and finding optimal solutions (in terms of area and/or timing) is only possible for extremely tiny circuits. The logical operations linking the inputs and outputs of a hardware block can be described in a graph form on which equivalence transforms can be applied. Current algorithms use hand-designed heuristics to selected which transforms should be applied to iteratively find a satisfactory solution. Instead, this project aims at using Graph CNNs combined with reinforcement learning to select which local graph transforms should be applied. Prerequisites: Know basics of normal DNNs and PyTorch (or another DL framework). || Workstation || SW (algorithm evals) || [[:User:lukasc|Lukas Cavigelli]]
 
 
|}
 
|}
 +
{|
 +
| style="padding: 10px" | [[File:Arpan_Suravi_Prasad.jpeg|frameless|left|96px]]
 +
|
  
<!--NOTES LUKAS: MRA-based CNNs, finding a lighting/season independent image representation, sensor-fusion, Action Understanding in Video Data -->
+
===[[:User:Prasadar| Arpan Suravi Prasad]]===
 +
* '''e-mail''': [mailto:prasadar@iis.ee.ethz.ch prasadar@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 632 44 91
 +
* '''office''': ETZ J89
 +
|}
  
Workload types: SW (GPU), SW (microcontr.), SW (algorithm evals), HW (FPGA), HW (ASIC), HW (PCB)
+
==Event-Driven Computing==
 +
With the increasing demand for "smart" algorithms on mobile and wearable devices, the energy cost of computing is becoming the bottleneck for battery lifetime. One approach to defuse this bottleneck is to reduce the compute activity on such devices - one of the most popular approaches uses sensor information to determine whether it is worth to run expensive computations or whether there is not enough activity in the environment. This approach is called event-driven computing.
 +
Event-driven architectures can be implemented for many applications - From pure sensing platforms to multi-core systems for machine learning on the edge.
 +
At IIS, we cover most of these applications. Besides working with novel, state-of-the-art sensors and sensing platforms to push the limits of lifetime of wearables and mobile devices, we also work with cutting-edge computing systems like Intel Loihi for Spiking Neural Networks to minimize the energy cost of machine intelligence.
  
==On-Going Projects==
+
{|
{| class="wikitable"
+
| style="padding: 10px" | [[File:Adimauro.png|frameless|left|96px]]
|-
+
|
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! Supervisors
+
===[[:User:Adimauro| Alfio Di Mauro]]===
|-
+
* '''e-mail''': [mailto:adimauro@iis.ee.ethz.ch adimauro@iis.ee.ethz.ch]
| taken (J. MacPherson) || MA/SA || Quantized Training of Recurrent Neural Networks || Recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) RNNs,  achieve state-of-the-art performance in time series analysis such as speech recognition. RNNs come with additional challenges such as an internal state that needs to be stored and regularly updated, a very large memory footprint and high bandwidth requirements. Research in the last few years has shown that most neural networks can be quantized with a small accuracy cost. The goal of the project is to train a quantized LSTM RNN. || GPU || SW (algorithm evals) || [[:User:paulin|Gianna Paulin]], [[:User:lukasc|Lukas Cavigelli]], [[:User:fconti|Francesco Conti]]
+
* '''phone''': +41 44 632 82 19
|-
+
* '''office''': ETZ J78
| taken (M. Scherer) || MA || TNN HW Accel. || Deep neural networks are notoriously hard to compute, need a lot of memory, and require a tremendous amount of energy, even just for inference. A lot of efforts try to reduce the precision of arithmetic operations to 16 bit, 12 bit, or 8 bit. However, with appropriate training methods and at the cost of some accuracy, the networks can be trained to work with binary or ternary intermediate results and filters. We have sketched a possible architecture which is fully targeted at minimizing the energy cost. This way, a TNN could be used for always-on sensing of e.g. audio data and then trigger more energy costly high-precision DNN inference with more classes on another device upon detecting an interesting signal. || ASIC || HW (ASIC) & SW || Georg Rutishauser, [[:User:lukasc|Lukas Cavigelli]]
 
 
|}
 
|}
 +
{|
 +
| style="padding: 10px" | [[File:Moritz_scherer.jpg|frameless|left|96px]]
 +
|
 +
===[[:User:Scheremo| Moritz Scherer]]===
 +
* '''e-mail''': [mailto:scheremo@iis.ee.ethz.ch scheremo@iis.ee.ethz.ch]
 +
* '''phone''': +41 44 632 77 86
 +
* '''office''': ETZ J69.2
 +
|}
 +
 +
==Prerequisites==
 +
We have no strict, general requirements, as they are highly dependent on the exact project steps. The projects will be adapted to the skills and interests of the student(s) -- just come talk to us! If you don't know about GPU programming or CNNs or ... just let us know and we can together determine what is a useful way to go -- after all you are here to learn not only about project work but also to develop your technical skills.
 +
 +
Only hard requirements:
 +
* '''Excitement for deep learning'''
 +
* For '''HW Design''' projects: '''VLSI 1, VLSI 2''' or equivalent
 +
 +
==Tags==
 +
All our projects will be categorized into three categories. Therefore, look out for the following tags:
 +
* '''Algorithmic''' -  you will mainly make algorithmic evaluations using languages and frameworks like e.g. Python, Pytorch, Tensorflow and our in-house frameworks like Quantlab, DORY, NEMO
 +
* '''Embedded Coding''' - you will implement e.g. c-code for one of our microcontrollers
 +
* '''HW Design''' - you will be designing HW including writing RTL, simulate, synthesize, and layout (backend) some HW
 +
 +
<!--- <span style="color:red">We are currently out of working spaces at IIS until around Easter 2018. Please contact us 1-2 months before the desired project start!</span> --->
 +
 +
==Available Projects==
 +
New projects are constantly being added, check back often! If you have any questions or would like to propose own ideas, do not hesitate to contact us!
 +
<DynamicPageList>
 +
category = Available
 +
category = Digital
 +
category = Deep Learning Projects
 +
suppresserrors=true
 +
ordermethod=sortkey
 +
order=ascending
 +
</DynamicPageList>
 +
<DynamicPageList>
 +
category = Available
 +
category = Digital
 +
category = Event-Driven Computing
 +
suppresserrors=true
 +
</DynamicPageList>
 +
 +
==Projects in Progress==
 +
<DynamicPageList>
 +
category = In progress
 +
category = Digital
 +
category = Deep Learning Projects
 +
suppresserrors=true
 +
ordermethod=sortkey
 +
order=ascending
 +
</DynamicPageList>
  
 
==Completed Projects==
 
==Completed Projects==
{| class="wikitable"
+
<DynamicPageList>
|-
+
category = Completed
! Status !! Type !! Project Name !! Description !! Platform !! Workload Type !! First Contact(s)
+
category = Digital
|-
+
category = Deep Learning Projects
| completed FS19 || 2x SA|| TWN HW Accel. || INQ (incremental network quantization) is a quantization technique which has been proven to work very well for neural networks. The weights are quantized to levels of {+-2^n, 0}, and we are particularly interested in the case of {-1,0,1}. This way we get rid of all the multiplications and much more compactly store the weights on-chip, which is great for HW acceleration. In order to keep the flexibility and ease of use in an actual system, we would like to integrate this accelerator into a PULP(issimo) processor system. In this thesis, you will develop the accelerator and/or integrate it into the PULPissimo system. || ASIC || HW (ASIC) & SW || [[:User:lukasc|Lukas Cavigelli]], [[:User:andrire|Renzo Andri]], Georg Rutishauser
+
suppresserrors=true
|-
+
ordermethod=sortkey
| completed FS19 || 1x SA || Stand-Alone Edge Computing with GAP8 || Detailed description: [[Stand-Alone_Edge_Computing_with_GAP8]] || Embedded || SW/HW (PCB-level) || [[:User:andrire|Renzo Andri]], [[:User:lukasc|Lukas Cavigelli]], Andres Gomez, Naomi Stricker (TIK)
+
order=ascending
|-
+
</DynamicPageList>
| completed FS19 || 1x SA || Data Bottlenecks in DNNs || In many systems, we have a combination of remote sensing nodes and centralized analysis. Such systems' operating cost and energy consumption is often dominated by communication, such that data compression becomes crucial. The strongest compression is usually archieved when performing the whole analysis on the sensor node and transmitting the result (e.g. a label instead of an image), but the sensor node might not have enough processing power available or the data of multiple sensor nodes has to be combined for a good classification/estimate/result. In this project, you will explore how to train DNNs for such problems with a data bottleneck within the DNN, where you will be using a not-yet-published quantization method. If taken as a MA, the result of the algorithmic exploration can be implemented on an embedded platform. || Workstation || SW (algorithm evals) || [[:User:lukasc|Lukas Cavigelli]], Matteo Spallanzani
 
|-
 
| completed HS18 || 1x MA || DNN Training Accelerator || The compute effort to train state-of-the-art CNNs is tremendous and largely done on GPUs, or less frequently on specialized HW (e.g. Google's TPUs). Their energy effiency and often performance is limited by DRAM accesses. When storing all the data required for the gradient descent step of typical DNNs, there is no way to store it in on-chip SRAM--even across multiple, very large chips. Recently, Invertible ResNets has been presented (cf. [https://arxiv.org/pdf/1707.04585.pdf paper]) and allows to trade these storage requirements for some more compute effort--a huge opportunity. In this project, you will perform an architecture exploration to analyze how this could best be exploited. || ASIC || HW (ASIC) || [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
| completed HS18 || 1x MA || One-shot/Few-shot Learning || One-shot learning comes in handy whenever it is not possible to collect a large dataset. Consider for example face identification as a form of opening you apartment's door, where the user provides a single picture (not 100s) and is recognized reliably from then on. In this project you would apply a method called Prototypical Networks (cf. [[https://arxiv.org/abs/1703.05175 paper], [https://github.com/jakesnell/prototypical-networks code]]) to learn to identify faces. Once you have trained such a DNN, you will optimize it for an embedded system to run it in real time. For a master thesis, an interesting additional step could be to look at expanding this further to share information between multiple nodes/cameras and learn to re-identify faces also as they evolve over time. || Embedded GPU or Microcontroller || SW (algo, uC) || [[:User:lukasc|Lukas Cavigelli]], [[:User:andrire|Renzo Andri]]
 
|-
 
| completed HS18 || 1x SA || SAR Data Analysis || We would like to explore the automated analysis of aerial synthetic aperture radar (SAR) images. Essentially, we have one very high-resolution image of a Swiss city and no labels. This project is not about labeling a lot of data, but to explore various options for supervised (cf. [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7827114 paper]) or semi-/unsupervised learning to segment these images using very few labeled data. || Workstation|| SW (algo evals) || [[:User:xiaywang|Xiaying Wang]], [[:User:lukasc|Lukas Cavigelli]], [[:User:magnom|Michele Magno]]
 
|-
 
| completed HS18 || 1x SA || Ternary-Weight FPGA System || Together with an external partner we are evaluating how combining binary or ternary-weight CNN can be deployed on FPGAs to push the throughput/cost ratio higher than embedded GPUs. In this project, you will implement a hardware accelerator for ternary weight network and integrate it into a fairly complete FPGA/Zynq-based system with camera etc. for real-time pose detection. || FPGA/Zynq || HW & SW (FPGA) || [[:User:lukasc|Lukas Cavigelli]]
 
|-
 
| completed FS18 || 1x SA || CBinfer for Speech Recognition || We have recently published an approach to dramatically reduce computation effort when performing object detection on video streams with limited frame-to-frame changes (cf. [https://arxiv.org/pdf/1704.04313.pdf paper]). We think this approach could also be applied to audio signals for continuous listening to void commands: when looking at MFCCs or the short-term Fourier transform, changes in the spectrum between neighboring time windows are also limited. || Embedded GPU (Tegra X2) || SW (GPU, algo evals) || [[:User:lukasc|Lukas Cavigelli]]
 
|}
 
==Where to find us==
 
[[:User:lukasc|Lukas Cavigelli]], ETZ J 76.2, cavigelli@iis.ee.ethz.ch<br />
 
[[:User:paulin|Gianna Paulin]], ETZ J 76.2, pauling@iis.ee.ethz.ch<br />
 
Georg Rutishauser, ETZ J 68.2, georgr@iis.ee.ethz.ch<br />
 
[[:User:andrire|Renzo Andri]], ETZ J 76.2, andrire@iis.ee.ethz.ch<br />
 

Revision as of 13:32, 31 January 2022

What is Deep Learning?

Nowadays, machine learning systems are the go-to choice when the cost of analytically deriving closed-form expressions to solve a given problem is prohibitive (e.g., it is very time-consuming, or the knowledge about the problem is insufficient). Machine learning systems can be particularly effective when the amount of data is large, since the statistics are expected to get more and more stable as the amount of data increases. Amongst machine learning systems, deep neural networks (DNNs) have established a reputation for their effectiveness and simplicity. To understand this success as compared to that of other machine learning systems, it is important to consider not only the accuracy performance of DNNs, but also their computational properties. The training algorithm (an iterative application of backpropagation and stochastic gradient descent) is linear in the data set size, making it more appealing in big data contexts than, for instance, support vector machines (SVMs). DNNs do not use branching instructions, making them predictable programs and allowing to design efficient access patterns for the memory hierarchies of the computing devices (exploiting spatial and temporal locality). DNNs are parallelizable, both at the neuron level and at the layer level. These predictability and parallelizability properties make DNNs an ideal fit for modern SIMD architectures and distributed computing systems.


The main drawback of these systems is their size: millions or even billions of parameters are a common feature of many top-performing DNNs, and a proportional amount of arithmetic operations must be performed to process each data sample. Hence, to reduce the pressure of DNNs on the underlying computing infrastructure, research in computational deep learning has focussed on two families of optimizations: topological and hardware-oriented. Topological optimizations are concerned with network topologies (AKA network architectures) which are more efficient in terms of accuracy-per-parameter or accuracy-per-MAC (multiply-accumulate operation). As a specific form of topological optimization, pruning strategies aim at maximizing the number of zero-valued operands (parameters and/or activations) in order to 1) take advantage of sparsity (for storing the model) and to 2) minimize the number of effective arithmetic operations (i.e., the operations not involving zero-valued operands, which must be actually executed). Hardware-oriented optimizations are instead concerned with replacing time-consuming and energy-hungry operations, such as evaluations of transcendent functions or floating-point MAC operations, with more efficient counterparts, such as piecewise linear activation functions (e.g., the ReLU) and integer MAC operations (as in quantized neural networks, QNNs).


Hardware-oriented neural architecture search (NAS)

The problems of topology selection and pruning can be considered instances of the classical statistics problems of model selection and feature selection, respectively. In the scope of deep learning, model selection is also called neural architecture search (NAS). When designing a DNN topology, you have a large number of degrees of freedom at your disposal: number of layers, number of neurons for each layer, connectivity of each neuron, and so on; moreover, the number of choices for each degree of freedom is huge. These properties imply that the design space for a DNN can grow exponentially, making exhaustive searches prohibitive. Therefore, to increase the efficiency of the exploration, stochastic optimization tools are the preferred choice: evolutionary algorithms, reinforcement learning, gradient-based techniques or even random graph generation. An interesting feature of model selection is that specific constraints can be enforced on the search space so that desired properties are always respected. For instance, given a storage budget describing a hard limitation of the chosen computing platform, the network generation algorithm can be limited to propose topologies that do not exceed a given number of parameters. This capability of incorporating HW features as constraints on the search space make NAS algorithms very interesting in the context of generating HW-friendly DNNs.

Thorir.jpg

Thorir Mar Ingolfsson

Matteo sp.JPG

Matteo Spallanzani

Training algorithms for quantized neural networks (QNNs)

The typical training algorithm for DNNs is an iterative application of the backpropagation algorithm (BP) and stochastic gradient descent (SGD). When the quantization is not “aggressive” (i.e., when the parameters and feature maps can be represented as integers with a precision of 8-bits or more), many solutions are available either in specialized literature or in commercial software that can convert models pre-trained with gradient descent to quantized counterparts (post-training quantization). But when the precision is extremely reduced (i.e., 1-bit or 2-bits operands), these solutions can no longer be applied, and quantization-aware training algorithms are needed. The naive application of gradient descent (which in theory is not even correct) to train these QNNs yields major accuracy drops. Hence, it is likely that suitable training algorithms for QNNs require to replace the standard BP+SGD scheme, which is suitable for differentiable optimization, with search strategies that are more apt for discrete optimization.

Matteo sp.JPG

Matteo Spallanzani

Hardware Acceleration of DNNs and QNNs

Deep Learning (DL) and Artificial Intelligence (AI) are quickly becoming dominant paradigms for all kinds of analytics, complementing or replacing traditional data science methods. Successful at-scale deployment of these algorithms requires deploying them directly at the data source, i.e. in the IoT end-nodes collecting data. However, due to the extreme constraints of these devices (in terms of power, memory footprint, area cost), performing full DL inference in-situ in low-power end-nodes requires a breakthrough in computational performance and efficiency. It is widely known that the numerical representation typically used when developing DL algorithms (single-precision floating-point) encodes a higher precision than what is actually required to achieve high quality-of-results in inference (Courbariaux et al. 2016); this fact can be exploited in the design of energy-efficient hardware for DL. For example, by using ternary weights, which means all network weights are quantized to {-1,0,1}, we can design the fundamental compute units in hardware without using an HW-expensive multiplication unit. Additionally, it allows us to store the weights much more compact on-chip.

Gianna.jpg

Gianna Paulin

Georg.jpg

Georg Rutishauser

Moritz scherer.jpg

Moritz Scherer

Tim Fischer.jpeg

Tim Fischer

Arpan Suravi Prasad.jpeg

Arpan Suravi Prasad

Event-Driven Computing

With the increasing demand for "smart" algorithms on mobile and wearable devices, the energy cost of computing is becoming the bottleneck for battery lifetime. One approach to defuse this bottleneck is to reduce the compute activity on such devices - one of the most popular approaches uses sensor information to determine whether it is worth to run expensive computations or whether there is not enough activity in the environment. This approach is called event-driven computing. Event-driven architectures can be implemented for many applications - From pure sensing platforms to multi-core systems for machine learning on the edge. At IIS, we cover most of these applications. Besides working with novel, state-of-the-art sensors and sensing platforms to push the limits of lifetime of wearables and mobile devices, we also work with cutting-edge computing systems like Intel Loihi for Spiking Neural Networks to minimize the energy cost of machine intelligence.

Adimauro.png

Alfio Di Mauro

Moritz scherer.jpg

Moritz Scherer

Prerequisites

We have no strict, general requirements, as they are highly dependent on the exact project steps. The projects will be adapted to the skills and interests of the student(s) -- just come talk to us! If you don't know about GPU programming or CNNs or ... just let us know and we can together determine what is a useful way to go -- after all you are here to learn not only about project work but also to develop your technical skills.

Only hard requirements:

  • Excitement for deep learning
  • For HW Design projects: VLSI 1, VLSI 2 or equivalent

Tags

All our projects will be categorized into three categories. Therefore, look out for the following tags:

  • Algorithmic - you will mainly make algorithmic evaluations using languages and frameworks like e.g. Python, Pytorch, Tensorflow and our in-house frameworks like Quantlab, DORY, NEMO
  • Embedded Coding - you will implement e.g. c-code for one of our microcontrollers
  • HW Design - you will be designing HW including writing RTL, simulate, synthesize, and layout (backend) some HW


Available Projects

New projects are constantly being added, check back often! If you have any questions or would like to propose own ideas, do not hesitate to contact us!


Projects in Progress


Completed Projects