Talks & Seminars | SyFI Lab

Rethinking LLM Serving From the Application’s Perspective

Dec 12, 2025 In Gim — Yale University

Abstract

As LLMs become the core of modern AI applications, inference efficiency has become critical, not just for speed but also for sustainability. An old lesson of systems design is that efficiency arises from understanding the workload. Yet today’s LLM serving systems are largely application agnostic. They are optimized for generic text completion, while real applications now perform far richer tasks such as invoking tools, retrieving data, executing code, and coordinating with other agents. It raises a question: How should we rethink LLM serving, not from the system’s perspective, but from the app...

Speaker Bio

In Gim is a fourth-year Ph.D. student at Yale University, advised by Prof. Lin Zhong. His research focuses on systems for machine learning, specifically on programmable systems for AI. His first-author works have been recognized by venues like SOSP, MLSys, MobiSys, HotOS, EMNLP, and AAAI.

Effectively Scaling Reinforcement Learning for LLMs

Dec 1, 2025 Yi Wu — Tsinghua University

Abstract

RL has been an engine for recent LLM advances, from RLHF for ChatGPT, to reasoning RL for thinking models, and more recently, agentic RL for agent products. In this talk, we will discuss and address the main scaling challenges of RL training for LLMs, starting from RLHF to reasoning RL and agentic RL. We will cover three works: (1) the ReaL system for efficient RLHF (https://github.com/openpsi-project/ReaLHF, https://arxiv.org/abs/2406.14088, MLSys 2025), (2) the AReaL system for fully asynchronous reasoning RL (https://github.com/inclusionAI/AReaL, https://arxiv.org/abs/2505.24298, NeurIPS 20...

Speaker Bio

Yi Wu is an assistant professor at the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University. He obtained his Ph.D. from UC Berkeley and was a researcher at OpenAI from 2019 to 2020. His research focuses on reinforcement learning, multi-agent learning, and LLM agent. His representative works include the value iteration network, the MADDPG/MAPPO algorithm, OpenAI's hide-and-seek project, and the AReaL project. He received the best paper award at NIPS 2016, the best demo award finalist at ICRA 2024, and MIT TR35 Asia Pacific 2025.

RDMA P2P Communication Patterns for KvCache Transfer, Weight Update, and MoE Routing

Nov 21, 2025 Lequn Chen — Perplexity

Abstract

As Large Language Models (LLMs) scale and Mixture-of-Experts (MoE) architectures gain prominence, inter-node communication becomes increasingly critical. Current LLM systems rely heavily on collective communication patterns through APIs like torch.distributed and NCCL, following a Single Program Multiple Data (SPMD) model that imposes unnecessary constraints on peer-to-peer data movement. This talk revisits RDMA-based peer-to-peer communication patterns for modern LLM workloads. While peer-to-peer communication is well-established, it has been largely overlooked in contemporary LLM systems. We...

Speaker Bio

Lequn graduated PhD from UW inn 2024. Lequn is currently a Research Engineer at Perplexity AI, building a better answer engine.

GEPA and prompt optimization for compound AI systems

Nov 7, 2025 Lakshya Agrawal — UC Berkeley

Abstract

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language re-flection to learn high-level rules from trial and error. Given any AI system containin...

Speaker Bio

Lakshya A Agrawal is a second-year PhD student in the Sky Lab at UC Berkeley, advised by Prof. Matei Zaharia and Prof. Dan Klein. His research tackles critical challenges in AI, focusing on sample-efficient agentic optimization and AI system reliability. He recently developed GEPA: a reflective prompt optimizer that can outperform reinforcement learning in sample-efficiency, and mmGRPO: an RL algorithm for tuning complex AI systems. His work on reliability includes contributions to MAST, the first taxonomy and study of multi-agent system failures, and the langProBe benchmark. Prior to Berkeley...

Rethinking Prediction for System Tuning and Architectural Modeling

Oct 17, 2025 Jonathan Balkind — UCSB

Abstract

In this talk, I will cover two recent papers from our lab, both adopting prediction in unconventional ways. Time permitting, I will also talk a little about our plans to enable hardware-enforced, privacy-preserving prediction of tenants' applications by cloud providers. To better facilitate application performance programming we propose a software optimization strategy enabled by a novel low-latency Prediction System Service (PSS). Rather than relying on nuanced domain-specific knowledge or slapdash heuristics, a system service for prediction encourages programmers to spend their time uncoveri...

Speaker Bio

Jonathan Balkind is an Assistant Professor in the Department of Computer Science at the University of California, Santa Barbara. His research interests lie at the intersection of Computer Architecture, Programming Languages, and Operating Systems. Jonathan completed his PhD and MA degrees at Princeton University and his MSci degree at the University of Glasgow. Jonathan was an Open Hardware Trailblazer Fellow and recipient of the NSF CAREER Award. Since 2021, he has served as a Director of the FOSSi Foundation.

SOSP practice talk: Yicheng Liu (UCLA)

Oct 10, 2025 Yicheng Liu — UCLA

Abstract

Modern software inevitably encounters periods of resource overload, during which it must still sustain high servicelevel objective (SLO) attainment while minimizing request loss. However, achieving this balance is challenging due to subtle and unpredictable internal resource contention among concurrently executing requests. Traditional overload control mechanisms, which rely on global signals, such as queuing delays, fail to handle application resource overload effectively because they cannot accurately predict which requests will monopolize critical resources. In this paper, we propose Atropo...

Speaker Bio

Yicheng is a second-year Ph.D. student at UCLA, co-advised by Sam Kumar and Harry Xu. Yicheng was an undergraduate student at Shanghai Jiao Tong University (SJTU). When Yicheng was an undergraduate student at SJTU, Yicheng was an intern at Institution of Parallel And Distributed System (IPADS), advised by Jinyu Gu. In the 2024 Summer, Yicheng was an intern in University of Washington, System Group, advised by Baris Kaciksi and mentored by Yigong Hu. In the 2023 Summer, Yicheng visited and took part in the research in University of Michigan, OrderLab, advised by Ryan Huang.

MoDM: Mixture of Diffusion Models to Navigate the Performance-Quality Trade-off in Image Generation

Oct 3, 2025 Nishil Talati — UIUC

Abstract

Diffusion-based text-to-image generation models trade latency for quality: small models are fast but generate lower quality images, while large models produce better images but are slow. In this talk, I will present our recent work MoDM, a novel image caching-based serving system for diffusion models that dynamically balances latency and quality through a mixture of diffusion models. The key enabler of this idea is the concept of image cache that allows a consistently high image generation quality at a high performance. This design enables adaptive serving by dynamicallybalancing latency and i...

Speaker Bio

Nishil Talati is an Assistant Research Scientist in the CSE department at the University of Michigan and an incoming Assistant Professor in the CS department at University of Illinois, Urbana-Champaign (UIUC). His research focuses on computer architecture and systems software design to enhance the efficiency of generative AI and data analytics applications. Nishil's work has been featured in leading venues including ISCA, MICRO, HPCA, ASPLOS and VLDB, and has been recognized with several awards including Research Faculty Recognition Award, IEEE computing's top 30 early career professional awar...

MSCCL++: Rethinking GPU Communication Abstractions for AI

Sep 12, 2025 Aashaka Shah & Roshan Dathathri — Microsoft

Abstract

AI applications increasingly run on distributed and fast-evolving heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable. In this talk, we will introduce the Microsoft Collective Communication Library++ (MSCCL++), a collective communication framework that provides a design methodology for developing high-performance, portable communication kernels. MSCCL++ has (1) a low-level, performance-preserving primitive interface th...

Speaker Bio

Aashaka Shah is a researcher in the Research in Software Engineering (RiSE) group at Microsoft Research. She is interested in building high-performance ML systems, in particular, by optimizing GPU interconnect network utilization and efficient memory management. Her works have been published in top-tier systems, architecture, and ML conferences (NSDI, ISCA, ATC, ICLR). She graduated with her PhD from UT Austin, where she worked on problems at the intersection of systems and ML. Roshan Dathathri is a researcher in the Systems Research Group at Microsoft Research. He received his PhD from the Un...

Acceleration of diffusion models

Sep 5, 2025 Xingyang Li — MIT

Abstract

Diffusion models are capable of generating photo-realistic images and videos, showing a promising future for AIGC. However inference speed, training speed and memory efficiency hinder their deployment in real world as well as their long-context ability. In this talk, I will present our recent works, RadialAttention and SVDQuant. Radial Attention identifies the Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal token distance increases. Guided by this motivation, we translates this energy decay into a unified and static mask wit...

Speaker Bio

Xingyang Li is a senior undergraduate at ACM Honors Class, SJTU. He is currently a student intern at MIT HAN Lab, advised by Professor Song Han. His research focuses on developing efficient algorithms and systems for deep learning, with applications in the realm of computer vision. Before starting the internship at MIT, he conducted research in algorithm-hardware co-design for vision applications like 3D Gaussian Splatting and Video Transformers, and his works were published in top-tier EDA conferences including DAC and ICCAD. He is also seeking a Ph.D. position starting in 2026 Fall.

Two fronts of AI: addressing performance and power/cooling challenges

Aug 22, 2025 Esha Choukse — Microsoft

Abstract

This talk explores two fronts of scaling AI: reducing inference latency and boosting throughput on emerging model types and usecases, and addressing the power and cooling demands of hyperscale data centers. I'll highlight platform-level optimizations that improve efficiency and responsiveness, and show how infrastructure design choices—spanning power delivery to efficient cooling—are becoming inseparable from AI system performance and sustainability.

Speaker Bio

Esha Choukse is a Principal Researcher in the Azure Research- Systems team. Esha is currently leading the efficient AI research project, working on cross-stack projects to optimize the AI platform (scheduling/routing), hardware, and datacenter infrastructure for emerging GenAI workloads in cloud, working toward the goal of datacenter efficiency and sustainability.

Edge data center: experience and lessons in building the world's largest cluster of edge devices

Aug 15, 2025 Animesh Dangwal — UCSB

Abstract

Edge computing distributes cloud functionality to task specific, resource-constrained and low-cost devices operating at data collection points, for low latency, low power and cost effective compute. This coordination requires redesigning cloud paradigms to either communicate or compute over the edge. Rather than adapting cloud technologies for edge constraints, what if we reconfigure the edge environment to enable seamless adoption of cloud research with minimal modifications to existing algorithms and runtimes? In this work, we define this transformation by modelling an edge rack analogous to...

Speaker Bio

Animesh Dangwal is a 5th year PhD student in the Computer Science department at UC Santa Barbara, working with Professor Chandra Krintz and Professor Rich Wolski. His research interests are in edge, serverless computing and distributed systems. His recent work has been in bridging the gap between the edge and cloud by developing flexible, and sustainable deployments at the edge for diverse workloads and hardware.

Whole-Body Conditioned Egocentric Video Prediction

Aug 1, 2025 Yutong Bai — UC Berkeley

Abstract

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, ena...

Speaker Bio

Yutong is currently a Postdoc Researcher at UC Berkeley (BAIR), advised by Prof. Alexei (Alyosha) Efros, Prof. Jitendra Malik and Prof. Trevor Darrell. Prior to that, she obtained CS PhD degree at Johns Hopkins University advised by Prof. Alan Yuille. She used to intern at Meta AI (FAIR Labs) and Google Brain, and was selected as 2023 Apple Scholar and MIT EECS Rising Star. Her work was norminated for CVPR 2022 Best Paper Award.

Speaker Homepage »

Algorithm–system co-design of late binding techniques for adaptive serving

July 18, 2025 Yeonju Ro — University of Texas Austin

Abstract

LLM inference is becoming increasingly challenging as models grow in architectural complexity and are expected to handle ever-longer input contexts. Architectures like Mixture of Experts (MoE) introduce conditional computation, leading to irregular execution patterns and inefficiencies in memory usage and batching. Meanwhile, context windows often span hundreds of thousands of tokens—driven by bursty inputs and prolonged sessions—straining memory and compute resources, particularly in Transformer layers. To address these challenges, we explore the use of late binding techniques for adaptive se...

Speaker Bio

Yeonju Ro is a Ph.D. student at the University of Texas at Austin, co-advised by Professors Aditya Akella and Atlas Wang. Her research focuses on systems for machine learning, algorithm–system co-design, and applying machine learning to systems problems. She has worked at Microsoft Azure, Meta, HP Labs, and Samsung Research. She is a recipient of the 2024 IBM PhD Fellowship.

Speaker Homepage »

LiquidCache, a novel pushdown-based disaggregated caching system

July 11, 2025 Xiangpeng Hao — University of Wisconsin Madison

Abstract

We present LiquidCache, a novel pushdown-based disaggregated caching system that evaluates filters on cache servers before transmitting data to compute nodes, which addresses our key observation that data decoding, not filter evaluation, is the primary bottleneck by transcoding Parquet data into a lightweight, cache-exclusive "Liquid" format that is co-designed with filter evaluation semantics to enable selective decoding, late filter materialization, and encoding-aware filter evaluation for low decoding costs and high compression ratios, allowing easy adoption without breaking ecosystem compa...

Speaker Bio

Xiangpeng Hao is a fifth year PhD student at UW-Madison adviced by Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau. His research focuses on building large scale analytical data systems. Notably, his PhD is supported by industry funding he independently raised through the LiquidCache project.

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

June 27, 2025 Yuxuan Jiang — University of Michigan

Abstract

Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TrainCheck, a framework that takes a proactive checking approach to address silent training errors. TrainCheck automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TrainCheck, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a sing...

Speaker Bio

Yuxuan Jiang is a second-year Ph.D. student in Computer Science and Engineering at the University of Michigan, advised by Prof. Ryan Huang. His research focuses on building tools to improve the reliability of cloud-scale and machine learning systems, with an emphasis on detecting silent failures and automating quality checks. He is the creator of TrainCheck, a runtime monitoring framework that proactively detects training bugs by inferring and checking invariants during deep learning training. Yuxuan is currently interning at Microsoft Research, where he is working on AIOps, and will be based ...

Speaker Homepage »

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

June 20, 2025 Zhiyuan Zeng — University of Washington

Abstract

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. However, current model evaluations commonly fail to achieve these goals by reducing model performance to a single aggregate metric, thereby obscuring the model's heterogeneous performance across diverse capabilities tested within a benchmark. In this talk, I will introduce how we advance the two goals by formulating the research problem of generating a weakness profile, a set of weaknesses expressed in natural language, given a language model's performance on eve...

Speaker Bio

Zhiyuan Zeng is a first-year Ph.D. student in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, advised by Hannaneh Hajishirzi and Pang Wei Koh. Previously, he received his bachelor's degree from the Department of Computer Science and Technology at Tsinghua University in China, where he worked with Danqi Chen at Princeton University and Zhiyuan Liu at Tsinghua University. Zhiyuan is a recipient of the 2022 SenseTime Scholarship, the 2022 China National Scholarship for undergraduate students, and a Gold Medal in the National Olympiad in Informatics (NOI...

Speaker Homepage »