Some Good Papers about Systems for AI (LLM, DNN, etc)
OSDI 2025
- PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
- Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD
- WaferLLM: Large Language Model Inference at Wafer Scale
- Understanding Stragglers in Large Model Training Using What-if Analysis
- Quake: Adaptive Indexing for Vector Search
- DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
- Enabling Efficient GPU Communication over Multiple NICs with FuseLink
- WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
- BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching
- NanoFlow: Towards Optimal Large Language Model Serving Throughput
ISCA 2025
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
- LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
- WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips
- AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
- LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
- DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign
- Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression
- Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
- HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
- NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly
- REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
- MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
- WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
- SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
- Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models
- Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale
- RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- Hybrid SLC-MLC RRAM Mixed-Signal Processing-in-Memory Architecture for Transformer Acceleration via Gradient Redistribution
- Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units
- In-Storage Acceleration of Retrieval Augmented Generation as a Service
- Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel’s Gaudi NPU for AI Model Serving
- MagiCache: A Virtual In-Cache Computing Engine
- Folded Banks: 3D-Stacked HBM Design for Fine-Grained Random-Access Bandwidth
- BingoGCN: Towards Scalable and Efficient GNN Acceleration with Fine-Grained Partitioning and SLT
ATC 2025
- Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
- Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
- WEAVER: Efficient Multi-LLM Serving with Attention Offloading
- CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge
- TOPPINGS: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference
- AssyLLM: Efficient Federated Fine-tuning of LLMs via Assembling Pre-trained Blocks
- DeepServe: Serverless Large Language Model Serving at Scale
- GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs
- QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs
- Obscura: Concealing Recomputation Overhead in Training of Large Language Models with Bubble-filling Pipeline Transformation
- Resource Multiplexing in Tuning and Serving Large Language Models
- KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
- LogCrisp: Fast Aggregated Analysis on Large-scale Compressed Logs by Enabling Two-Phase Pattern Extraction and Vectorized Queries
- SNARY: A High-Performance and Generic SmartNIC-accelerated Retrieval System
HPCA 2025
- BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
- MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- throttLLeM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
- Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
- LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
- VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
- InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
- PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
- FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
- Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
- Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
- eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models
- NeuVSA: A Unified and Efficient Accelerator for Neural Vector Search