Personal tools

Softmax for Transformers (M/1-2S)

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

Transformers [1], initially popularized in Natural Language Processing (NLP), have found applications beyond and are now integral to a wide range of deep learning tasks. However, the computation of the attention mechanism, a crucial part of transformers, encounters a performance bottleneck due to the softmax function. Addressing this limitation, our transformer accelerator ITA [2] introduces ITAmax, a hardware-friendly softmax implementation. ITAmax performance relies on an optimized dataflow and utilizing low-precision integer arithmetic.

Expanding ITA's capabilities to Large Language Models (LLMs) necessitates using Floating Point (FP) softmax. The challenge here is not the precision but instead reducing the number of passes required in the original softmax algorithm and breaking row dependencies. FlashAttention [3, 4] addresses this issue by employing online normalization, but it still depends on finding the maximum within each row for numerical stability. In the first part of this project, we aim to explore the possibility of eliminating the maximum entirely or substituting it with a statistical maximum to get rid of the normalization cost.

In the second part, ITAmax’s performance will be evaluated through quantization-aware training across standard transformer benchmarks to comprehensively validate it. The evaluation will include a comparison with other hardware-friendly softmax algorithms like I-BERT [5] and Softermax [6].

Project

  • Exploring Numerical Stability of Softmax (FP): Analyze the numerical distribution of softmax inputs in LLMs and evaluate the effect of eliminating the maximum pass and replacing it with a statistical maximum on accuracy.
  • Softmax Approximation Benchmark (INT): Assess the ITAmax’s softmax approximation accuracy against the Original, I-BERT and Softermax versions. The benchmark will mainly consist of measuring and comparing the accuracy loss induced by softmax approximations on real Transformers networks and for different tasks (CV, NLP…).
  • ITAmax Precision Exploration: Investigate the impact of different numerical precisions on ITAmax’s performance.
  • (Stretch Goal) Hardware Cost Evaluation: Optionally, explore the hardware cost of Softermax and I-BERT, contrasting it with ITAmax, requiring familiarity with digital design (VLSI 1 course).

Character

  • Literature/Architecture Review: 10% of the project will involve understanding the theoretical foundations and design principles of transformers, attention mechanisms, and hardware-friendly softmax algorithms.
  • Python Coding: The majority (60%) of the project will be dedicated to hands-on coding, where the student will implement and experiment with various softmax algorithms.
  • Evaluation: The final 30% will focus on systematically evaluating the algorithms' performance and analyzing the results.

Prerequisites

  • Fundamental deep learning concepts.
  • Numerical representation formats (integer, fixed-point, floating-point).
  • Proficiency in Python programming.
  • Familiarity with the PyTorch deep learning framework.

References

[1] Attention Is All You Need

[2] ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

[3] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[4] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

[5] I-BERT: Integer-only BERT Quantization

[6] Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers