Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci — Symposium on Operating Systems Design and Implementation (OSDI) (2025)
LLM Serving
Keywords: Large Language Models Inference Serving Intra-Device Parallelism Throughput Optimization Hardware Efficiency
NanoFlow is a high-performance serving framework designed to maximize hardware resource utilization for Large Language Model inference. Unlike traditional approaches that assume LLM serving is memory-bound, NanoFlow demonstrates that end-to-end serving is actually compute-bound, opening new opportunities for performance optimization through intelligent parallelism.
NanoFlow’s key innovation is intra-device parallelism that overlaps compute-, memory-, and network-bound operations within a single device:

Efficient CPU-side management that complements GPU execution:

NanoFlow consistently outperforms state-of-the-art LLM serving systems:

| System | Relative Throughput |
|---|---|
| NanoFlow | 1.91x |
| TensorRT-LLM | 1.0x (baseline) |
| vLLM | 0.85x |
| Deepspeed-FastGen | 0.72x |
NanoFlow’s architecture is built on three key layers:

This multi-layered approach enables NanoFlow to:
LLM serving workloads mix multiple types of resources - compute, memory, network, and storage. Traditional systems execute these operations sequentially, leaving resources idle while waiting for one type of operation to complete.
Concurrent kernel execution across heterogeneous resources unlocks hidden parallelism. By running multiple operations simultaneously (e.g., GPU compute while memory transfers data, CPU preparing batches while GPU executes), NanoFlow keeps all resources busy at the same time rather than waiting in sequence - this is the key to achieving the 1.91x performance boost.