Deep-Learning Based Phoneme Recognition from a Ultra-Low Power Spiking Cochlea
At the Integrated Systems Laboratory (IIS) we have been working on techniques for smart data analytics in ultra-low power sensors for the past few years along the entire technological stack, from HW (e.g. the PULP system) to SW running on microcontrollers – in many cases using convolutional neural networks (CNNs) as the algorithmic “tool” to extract semantic information out of raw data streams. Doing that, it is possible to greatly reduce the amount of data that needs to be collected in a ULP sensor node and sent to a higher-level computing device (e.g. a smartphone, the cloud). The Institute of NeuroInformatics (INI) of the University of Zurich has recently developed a sensor that fits well the “smart low-power sensor” tag: an artificial cochlea, inspired to the similarly named organ that is located within the human ear. The natural cochlea is made up of sensory neurons that transform sounds and voices in a representation in terms of activation potentials or “spikes”, that are then elaborated by the neural cortex. The INI artificial cochlea strives to replicate this functionality in a silicon sensor. In nature, the spiking representation conveys enough information for us to recognize natural sounds, phonemes (and thus language), and music. We have recently realized an interface that can be used to connect the cochlea to a PULP-based chip, as well as to any I2S-equipped microcontroller unit. We believe the next step to be utilize the sensor as a vastly more energy efficient way to extract semantically advanced information such as phonemes and words with respect to conventional sensors.
The goal of this project is to develop a methodology based on convolutional neural networks (CNNs) to extract phoneme information out of digital spikes produced by the INI cochlea. The project will consist of two main subtasks, plus an optional third one: 1. create a spiking phoneme dataset using a set of real speakers (both male and female) and/or automated voice generation tools, e.g. conversion of existing phoneme datasets to spikes. A diverse dataset brings many advantages in terms of generality and trainability. 2. define a methodology to create “frames” out of raw unframed spiking data and a topology for the CNN that will be used for extraction of phonemes. A self-defined topology or a well known one (e.g. AlexNet) may be used for this task. 3. (optional): port the topology defined in step 2 to the Fulmine chip ([]) to demonstrate it.
The main constraint for the CNN topology definition and training will be to keep the overall size of the model low, so that it can effectively be implemented on a highly memory constrained edge node in a IoT device (so PULP or a microcontroller unit). To this end, a few techniques for reduced-precision training developed at the IIS will be accessible.
To work on this project, you will need:
- basic familiarity with a scripting language for deep learning (Python or Lua…)
- a lot of patience!
- to be strongly motivated for a difficult but super-cool project
If you want to work on this project, but you think that you do not match some the required skills, we can give you some preliminary exercise to help you fill in the gap.
Meetings & Presentations
The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.
Around the middle of the project there is a design review, where senior members of the lab review your work (bring all the relevant information, such as prelim. specifications, block diagrams, synthesis reports, testing strategy, ...) to make sure everything is on track and decide whether further support is necessary. They also make the definite decision on whether the chip is actually manufactured (no reason to worry, if the project is on track) and whether more chip area, a different package, ... is provided. For more details confer to .
At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS colloquium.
- The EDA wiki with lots of information on the ETHZ ASIC design flow (internal only) 
- The IIS/DZ coding guidelines