# Exploring schedules for incremental and annealing quantization algorithms

## Introduction

Empirical evidence is supporting confidence in the fact that quantization-aware training algorithms are necessary when targetting aggressive quantization schemes (1-bit, 2-bits). Most algorithms in this family interpret the quantization problem as an approximation task, where the quantized network is either obtained by projecting a full-precision network on a constrained space [1, 2], or by progressively “hardening” a relaxed version of a QNN towards its natural discrete definition [3, 4]. An idea which is central to some of these algorithms is that of achieving quantization progressively. The rationale for this choice is allowing the full-precision part of the network to compensate for the error introduced at the different steps of the quantization process.

For instance, the incremental network quantization algorithm (INQ)  defines a partition P = {p_{1}, …, p_{T}} of the weights space of a given network topology, and assigns to each of its elements p_{i} an integer t_{i} representing its quantization epoch. The algorithm then starts training a full-precision network, and whenever a quantization epoch t_{i} is reached, the weights in p_{i} are projected onto the corresponding quantized weights space. These weights are no longer allowed to be updated, whereas the weights which have not yet been quantized can adapt to compensate for the error introduced. In this way, at epoch T, all the network’s weights will be quantized.

Another example is that of the additive noise annealing algorithm (ANA) . In this case, the target QNN is regularised through the addition of noise to the parameters, allowing gradients and updates to be computed. To each parameter w_{j}, j ∊ {1, ..., N}, an annealing schedule f_{j}(t) is attached, governing the amount of noise for each parameter. By ensuring that all the schedules {f_j(t)}_{j ∊ {1, ..., N}} are decreasing and that they eventually reach zero when t is the final training epoch, each regularised layer is sharpened to converge towards its quantized counterpart that will be implemented on the real hardware. This annealing prioritises the lower layers and then proceeds to the upper layers. The original intuitive rationale was that the features of the lower layers should stabilize before the features of the upper layers, allowing the latter to adapt to the hardening of the former. Later, this strategy has also been motivated by theoretical results .

Although these algorithms have shown promising results, the tuning of their scheduling hyper-parameters is not yet well understood, requiring tedious and time-consuming iterative searches. When learning algorithms are governed by some hyper-parameters (e.g., the learning rate and the momentum in stochastic gradient descent), an appealing possibility is applying machine learning systems to learn these hyper-parameters automatically, in a process called meta-learning. For instance, this approach has recently been applied to learning the hyper-parameters governing stochastic gradient descent .

## Project description

In this project, we will start by designing suitable parametric models to describe the scheduling processes for the INQ and ANA algorithms. Then, you will be in charge of implementing the experiments and collect performance data about hand-coded approaches. Successively, we will analyse your findings and, if possible, derive some heuristic rules and design meta-learning algorithms which can improve the effectiveness and efficiency of the INQ and ANA algorithms.

If time remains, we could also consider deploying some of the models trained with the improved algorithms on ternary network accelerators developed by other members of the IIS team.