SyFI Lab Systems for Future Intelligence

SyFI in January 2026: A Big Month for Systems-Driven AI Research

January 2026 was a milestone month for the SyFI Lab. Our group published six papers across MLSys and ICLR together with our collaborators—five at MLSys 2026 and one at ICLR 2026—spanning inference, training, scheduling, retrieval, and model architecture. While each paper tackles a specific systems challenge, together they reflect a shared goal: making large-scale AI systems faster, more flexible, and more practical in the real world.

We’re especially proud of the breadth of the work. These papers cut across the full ML systems stack—from low-level operator scheduling and parallelism, to end-to-end LLM inference and retrieval-augmented generation, to new attention mechanisms for long-context models.

Below is a snapshot of what we have been working on so far this year.

FlashInfer-Bench (MLSys 2026)

Building the Virtuous Cycle for AI-driven LLM Systems

We introduced FlashInfer-Bench to address a recurring pain point in LLM systems research: inconsistent and incomplete benchmarking. This work focuses on closing the loop between measurement and optimization, enabling more realistic evaluation of inference systems and helping the community reason about performance tradeoffs with shared ground truth.

Sparse Self-Speculative Decoding (MLSys 2026)

Accelerating Large-Scale Reasoning Model Inference

Reasoning models are powerful—but expensive. This paper shows how self-speculative decoding, combined with sparsity, can significantly accelerate inference while preserving output quality. The result is a practical path toward deploying large reasoning models in latency-sensitive settings.

DynaFlow (MLSys 2026)

Transparent and Flexible Intra-Device Parallelism

With DynaFlow, we rethink how work is scheduled within a device. By making operator scheduling programmable, this work enables more transparent and adaptable intra-device parallelism, especially important for heterogeneous hardware and evolving model architectures.

FCP: Flexible Context Parallelism (MLSys 2026)

Scaling Foundation Model Pre-Training

Training foundation models at scale increasingly runs into context and memory limits. FCP introduces a scalable approach to context parallelism that better balances computation and memory across devices, enabling more efficient large-model pre-training.

TeleRAG (MLSys 2026)

Efficient Retrieval-Augmented Generation with Lookahead Retrieval

Retrieval-augmented generation is powerful, but often bottlenecked by retrieval latency. TeleRAG uses lookahead retrieval to overlap and streamline the retrieval-generation loop, significantly improving end-to-end RAG inference efficiency.

Tactic (ICLR 2026)

Adaptive Sparse Attention for Long-Context Models

At ICLR, we will present Tactic, a new sparse attention mechanism that adapts via clustering and distribution fitting. Tactic enables models to handle long contexts more efficiently—an increasingly critical capability for reasoning, summarization, and agentic workloads.

Taken together, these six papers reflect how we think about ML systems research at SyFI:

Systems and models must be co-designed, in a full-stack manner.
Performance comes from measurement, flexibility, and adaptation.
Progress requires work across hardware, runtime, and model layers.

We’re excited about where this direction is heading—and grateful to our collaborators, students, and the broader community who made this year possible.

👉 Full list of publications: https://syfi.cs.washington.edu/publications/