Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Shan Yu, Jiarong Xing, Yifan Qiao, and
4 more authors
Under Review at OSDI 2026, 2025
We present Prism, a distributed GPU-sharing inference system for multi-LLM engines. Our system achieves 3.3x SLO improvement and 2x GPU cost reduction in real-world workloads through workload-aware balancing, request scheduling, and model migration with elastic KV cache support.