SyFI Lab Systems for Future Intelligence

The SyFI Lab at the University of Washington builds efficient and resilient infrastructure for the future of AI. As applications grow more complex, we bridge the gap between next-gen models and heterogeneous hardware through cross-stack innovation, delivering scalable, open-source systems validated by industrial partners.

Our research targets three key areas:

  • Efficient AI: Optimizing algorithms and systems to maximize performance for training and inference.
  • Flexible AI: Architecting systems that seamlessly adapt to diverse tasks, strategies, and model structures.
  • Resilient AI: Ensuring AI system reliability at scale while leveraging AI to improve infrastructure robustness.

Publications

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen — Annual Conference on Machine Learning and Systems (MLSys) (2026)

PDF
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica — Annual Conference on Machine Learning and Systems (MLSys) (2026)

PDF
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Yi Pan, Yile Gu, Jinbin Luo, Yibo Wu, Ziren Wang, Hongtao Zhang, Ziyi Xu, Shengkai Lin, Baris Kasikci, Stephanie Wang — Annual Conference on Machine Learning and Systems (MLSys) (2026)

Read more »

Blog Posts

SyFI in January 2026: A Big Month for Systems-Driven AI Research

January 31, 2026

January 2026 was a milestone month for the SyFI Lab, with six papers published across MLSys and ICLR—spanning inference, training, scheduling, retrieval, and model architecture.
Meet LLMc: Beating All Compression with LLMs

October 03, 2025

We present LLMc, an open-source tool to compress natural language using LLMs as the world's most reference-packed dictionary.
Efficient Serving of SpeechLMs with VoxServe

September 29, 2025

We present VoxServe, a high-throughput, low-latency serving system designed specifically for Speech Language Models.
Read more »

Talks

Rethinking LLM Serving From the Application’s Perspective

Dec 12, 2025 In Gim — Yale University

Abstract

As LLMs become the core of modern AI applications, inference efficiency has become critical, not just for speed but also for sustainability. An old lesson of systems design is that efficiency arises from understanding the workload. Yet today’s LLM serving systems are largely application agnostic. They are optimized for generic text completion, while real applications now perform far richer tasks such as invoking tools, retrieving data, executing code, and coordinating with other agents. It raises a question: How should we rethink LLM serving, not from the system’s perspective, but from the application’s? In this talk, I will explore that question and show how an application-centered approach leads to serving systems that are more programmable, flexible, and application aware.

Speaker Bio

In Gim is a fourth-year Ph.D. student at Yale University, advised by Prof. Lin Zhong. His research focuses on systems for machine learning, specifically on programmable systems for AI. His first-author works have been recognized by venues like SOSP, MLSys, MobiSys, HotOS, EMNLP, and AAAI.

Read more »