Jiahao Zhang & Zifan He

Congratulations to PhD students Jiahao Zhang and Zifan He for winning the Adaptive Computing category of the 2025 AMD Open Hardware competition. Supervised by Professor Jason Cong, the Volgenau Chair for Engineering Excellence at UCLA, the team won the award for their project “FPGA-Optimized Large Language Model Inference: High-Speed and Accurate Design with SpinQuant.” This recognition highlights their contributions at the intersection of algorithm design, quantization, and domain-specific acceleration for large language models (LLMs).

The team used the FlexLLM framework, a composable High-Level Synthesis (HLS) library for rapidly building FPGA-based LLM accelerators with hybrid temporal–spatial dataflow and state-of-the-art hardware-efficient quantization, to implement a complete inference system for the Llama-3.2 1B model. They accomplished this in under two months with fewer than 1,000 lines of code. As a result, they created a stage-specialized accelerator whose high-accuracy quantization surpasses the SpinQuant baseline and delivers up to 4.71× end-to-end speedup and 4.13× energy-efficiency gains over an NVIDIA A100 GPU.

Their design also integrates a Hierarchical Memory Transformer (HMT) plug-in that drastically reduces prefill latency and extends the effective context window, enabling efficient long-context LLM inference on AMD FPGAs.

A demo of this work was showcased at the annual PRISM Center Review.

To watch the demonstration, visit this link: https://youtu.be/6VsRv5FKEsg.