CS 201 | Xifeng Yan, UCSB

“Adaptive Inference in Large Language Models”

Transformer-based large language models (LLMs) have achieved remarkable success, yet many challenges remain. In this talk, I will address a fundamental question: Do all tokens require the same amount of computation within a Transformer? I will share insights into this question and introduce our dynamic layer-skipping and attention-skipping algorithms for adaptive inference in LLMs. Our findings show that many layers and the majority of global attention can be safely skipped without degrading output quality. These skipping patterns reveal a substantial amount of underutilized compute within Transformers, which can be repurposed to enable the generation of multiple tokens using only a subset of layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, no auxiliary routines, and requires no post-generation verification. Despite being trained on a limited dataset and without code-level optimization, the approach demonstrates promising results on a fine-tuned Qwen3-4B model, achieving up to a 2× speedup. Scaling analysis further suggests that substantially larger gains are possible with more extensive training data. These findings also have implications for the design of next-generation hardware accelerators.

Xifeng Yan is a professor at the University of California, Santa Barbara, where he holds the Venkatesh Narayanamurti Chair in Computer Science. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2006 and was a research staff member at the IBM T. J. Watson Research Center from 2006 to 2008. His current research focuses on exploring foundation models in artificial intelligence, leveraging these models for knowledge discovery, and developing cross-disciplinary applications. His work has been widely cited and recognized with numerous honors. His team developed the first Transformer-based time series forecasting model, initiating a new research direction in the field.

Date/Time:
Date(s) - Feb 10, 2026
4:00 pm - 5:45 pm

Location:
3400 Boelter Hall
420 Westwood Plaza Los Angeles California 90095