He Sun

He Sun

Ph.D. student of Computer Science, USTC

Some Good Papers about Systems for AI (LLM, DNN, etc)

OSDI 2025

ISCA 2025

H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips
AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign
Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression
Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly
REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models
Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Hybrid SLC-MLC RRAM Mixed-Signal Processing-in-Memory Architecture for Transformer Acceleration via Gradient Redistribution
Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units
In-Storage Acceleration of Retrieval Augmented Generation as a Service
Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel’s Gaudi NPU for AI Model Serving
MagiCache: A Virtual In-Cache Computing Engine
Folded Banks: 3D-Stacked HBM Design for Fine-Grained Random-Access Bandwidth
BingoGCN: Towards Scalable and Efficient GNN Acceleration with Fine-Grained Partitioning and SLT

ATC 2025

HPCA 2025