Keisuke Kamahori, Jungo Kasai, Noriyuki Kojima, Baris Kasikci — Conference on Empirical Methods in Natural Language Processing (EMNLP) (2025)
Efficient ML Speech
Keywords: Automatic Speech Recognition Model Compression Low-Rank Approximation Whisper Efficiency
LiteASR is a novel compression scheme designed to address the computational bottlenecks in modern Automatic Speech Recognition (ASR) encoders. While recent advances like Distill-Whisper have successfully compressed decoders, the encoder remains a massive, compute-bound component.
LiteASR demonstrates that intermediate activations in ASR encoders exhibit strong low-rank properties. By exploiting this structure, LiteASR approximates linear layers and self-attention mechanisms, significantly reducing model size and latency without retraining the model from scratch.

Unlike traditional pruning or quantization that focuses solely on model weights, LiteASR analyzes the activations generated during inference.
LiteASR goes beyond simple linear layers by optimizing the self-attention mechanism itself:
LiteASR consistently outperforms stock Whisper models, establishing a new Pareto frontier for ASR efficiency.

LiteASR’s architecture focuses on transforming the standard Transformer encoder blocks into efficient Low-Rank (LR) equivalents.
This approach ensures that the compression is data-driven, adapting to the actual information flow within the network rather than arbitrary weight magnitudes.
Encoders are the new bottleneck. With the advent of distilled decoders (like Distill-Whisper), the heavy lifting has shifted to the encoder. LiteASR directly targets this bottleneck.
Activations tell the truth. Weight-based compression often misses redundant neurons that appear important but contribute little to actual signal variance. LiteASR’s activation-based analysis is more precise.
Drop-in Replacement. The resulting “Lite-Whisper” models are architecturally compatible with existing serving pipelines (like Hugging Face Transformers or Triton), requiring no complex hardware-specific kernels to see gains.