Exploration and Hardware Acceleration of Intra-Layer Mixed-Precision QNNs
- Type: Semester Thesis
- Professor: Prof. Dr. L. Benini
Quantizing the parameters and intermediate activations of neural networks is a key step in enabling their application on embedded systems. Recently, much work has been done on mixed-precision quantization, where different parts of a network are executed in different arithmetic precisions. Most, if not all, of these works approached the problem in a layer-wise manner, assigning each layer a homogeneous bitwidth for the weight parameters and the output activations, respectively.
In a previous student project, we have found that it is possible to quantize a large majority of a network's weights to very low (ternary) precision, keeping the remaining weights in full precision, with only a minor drop in statistical accuracy. The distinguishing feature of this approach is that multiple bitwidths are used within a single layer. In order to utilize these results in a real-world system, we need to map the resulting mixed-precision network to an appropriate representation and apply to it a suitable computing paradigm.
We propose to model the layers of such a mixed-precision network as the superposition of a dense, highly quantized layer and a sparse layer executed in higher precision. Adding the results of both layers together is then equivalent to executing the full, mixed-precision layer.
This allows us to reap the energy efficiency benefits of low-bitwidth quantization while avoiding the considerable accuracy drop usually associated with aggressive quantization. However, this comes at the cost of performing a small fraction of operations in higher precision. Our goal is therefore to keep this overhead of the sparse, high-precision part of the layer as small as possible and reach an optimal trade-off between statistical accuracy and energy efficiency.
- Sparse arithmetic benefits greatly from regularity in the pattern of non-zero values. Thus, your first task will be to explore the accuracy impact of imposing regularity on the distribution of the high-precision values. To this end, you will do a grid search varying related parameters, such as the density and spatial distribution of high-precision values.
- Having found a good trade-off of regularity and network accuracy, you will devise a hardware architecture to accelerate the inference of such mixed-precision networks.
- You will then implement this hardware architecture and evaluate its performance, area requirements and energy on the benchmark network explored in the first step, comparing it to results from literature.
The project can be simplified, adapted, or extended to suit your needs and wishes.
- Interest in embedded machine learning applications and/or accelerator design
- Experience with HDLs (preferably SystemVerilog) as taught in VLSI I
- Familiarity with Python