January 2026 was a milestone month for the SyFI Lab. Our group published six papers across MLSys and ICLR together with our collaborators—five at MLSys 2026 and one at ICLR 2026—spanning inference, training, scheduling, retrieval, and model architecture. While each paper tackles a specific systems challenge, together they reflect a shared goal: making large-scale AI systems faster, more flexible, and more practical in the real world.
We’re especially proud of the breadth of the work. These papers cut across the full ML systems stack—from low-level operator scheduling and parallelism, to end-to-end LLM inference and retrieval-augmented generation, to new attention mechanisms for long-context models.
Below is a snapshot of what we have been working on so far this year.
Building the Virtuous Cycle for AI-driven LLM Systems
We introduced FlashInfer-Bench to address a recurring pain point in LLM systems research: inconsistent and incomplete benchmarking. This work focuses on closing the loop between measurement and optimization, enabling more realistic evaluation of inference systems and helping the community reason about performance tradeoffs with shared ground truth.
Accelerating Large-Scale Reasoning Model Inference
Reasoning models are powerful—but expensive. This paper shows how self-speculative decoding, combined with sparsity, can significantly accelerate inference while preserving output quality. The result is a practical path toward deploying large reasoning models in latency-sensitive settings.
Transparent and Flexible Intra-Device Parallelism
With DynaFlow, we rethink how work is scheduled within a device. By making operator scheduling programmable, this work enables more transparent and adaptable intra-device parallelism, especially important for heterogeneous hardware and evolving model architectures.
Scaling Foundation Model Pre-Training
Training foundation models at scale increasingly runs into context and memory limits. FCP introduces a scalable approach to context parallelism that better balances computation and memory across devices, enabling more efficient large-model pre-training.
Efficient Retrieval-Augmented Generation with Lookahead Retrieval
Retrieval-augmented generation is powerful, but often bottlenecked by retrieval latency. TeleRAG uses lookahead retrieval to overlap and streamline the retrieval-generation loop, significantly improving end-to-end RAG inference efficiency.
Adaptive Sparse Attention for Long-Context Models
At ICLR, we will present Tactic, a new sparse attention mechanism that adapts via clustering and distribution fitting. Tactic enables models to handle long contexts more efficiently—an increasingly critical capability for reasoning, summarization, and agentic workloads.
Taken together, these six papers reflect how we think about ML systems research at SyFI:
We’re excited about where this direction is heading—and grateful to our collaborators, students, and the broader community who made this year possible.
👉 Full list of publications: https://syfi.cs.washington.edu/publications/