# Difference between revisions of "Exploring schedules for incremental and annealing quantization algorithms"

## Introduction

Empirical evidence is supporting confidence in the fact that quantisation-aware training algorithms are necessary when targetting aggressive quantisation schemes (1-bit, 2-bits). Most algorithms in this family interpret the quantisation problem as an approximation task, where the quantised network is either obtained by projecting a full-precision network on a constrained space [1, 2], or by progressively “hardening” a relaxed version of a QNN towards its natural discrete definition [3, 4]. An idea which is central to some of these algorithms is that of achieving quantisation progressively. The rationale for this choice is allowing the full-precision part of the network to compensate for the error introduced at the different steps of the quantisation process.

For instance, the incremental network quantisation algorithm (INQ)  defines a partition P = {p_{1}, …, p_{T}} of the weights space of a given network topology, and assigns to each of its elements p_{i} an integer t_{i} representing its quantisation epoch. The algorithm then starts training a full-precision network, and whenever a quantisation epoch t_{i} is reached, the weights in p_{i} are projected onto the corresponding quantised weights space. These weights are no longer allowed to be updated, whereas the weights which have not yet been quantised can adapt to compensate for the error introduced. In this way, at epoch T, all the network’s weights will be quantised.

Another example is that of the additive noise annealing algorithm (ANA) . In this case, the target QNN is regularised through the addition of noise to the parameters, allowing gradients and updates to be computed. To each parameter w_{j}, j ∊ {1, ..., N}, an annealing schedule f_{j}(t) is attached, governing the amount of noise for each parameter. By ensuring that all the schedules {f_j(t)}_{j ∊ {1, ..., N}} are decreasing and that they eventually reach zero when t is the final training epoch, each regularised layer is sharpened to converge towards its quantised counterpart that will be implemented on the real hardware. This annealing prioritises the lower layers and then proceeds to the upper layers. The original intuitive rationale was that the features of the lower layers should stabilise before the features of the upper layers, allowing the latter to adapt to the hardening of the former. Later, this strategy has also been motivated by theoretical results .

Although these algorithms have shown promising results, the tuning of their scheduling hyper-parameters is not yet well understood, requiring tedious and time-consuming iterative searches. When learning algorithms are governed by some hyper-parameters (e.g., the learning rate and the momentum in stochastic gradient descent), an appealing possibility is applying machine learning systems to learn these hyper-parameters automatically, in a process called meta-learning. For instance, this approach has recently been applied to learning the hyper-parameters governing stochastic gradient descent .

## Project description

In this project, we will start by designing suitable parametric models to describe the scheduling processes for the INQ and ANA algorithms. Then, you will be in charge of implementing the experiments and collect performance data about hand-coded approaches. Successively, we will analyse your findings and, if possible, derive some heuristic rules and design meta-learning algorithms which can improve the effectiveness and efficiency of the INQ and ANA algorithms.

If time remains, we could also consider deploying some of the models trained with the improved algorithms on ternary network accelerators developed by other members of the IIS team.