Mingyuan Ma

I am a Software Engineer at NVIDIA, working on LLM Inference Workload Performance in the Compute Architecture Group. I conduct end-to-end inference performance benchmarking and analysis, and build automation infrastructures for HPC benchmarking.

I received my M.S. in Data Science from Harvard University, with cross-registration in EECS at MIT. Before that, I completed my B.A. in Computer Science and B.A. in Statistics (Double Majors with High Distinction Honors) at UC Berkeley.

My research interests include LLM Inference Systems, Efficient Deep Learning, and Continual Learning. I have collaborated with SGLang / Sky Computing Lab at UC Berkeley on distributed GPU-sharing inference systems, with Microsoft Research Asia on reasoning frameworks for Small Language Models, and with HPC-AI Lab at NUS on continual learning of vision-language models. I also worked at Moonshot AI (Kimi) on efficient LLM architectures.

News

Jul 1, 2025	Joined NVIDIA as Software Engineer in LLM Inference Workload Performance, Compute Architecture Group
May 20, 2025	I graduate from Harvard University with M.S. in Data Science
Jan 20, 2025	Our paper “Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers” is accepted by ICLR 2025
Jan 15, 2025	Our paper “Octopus: On-device language model for function calling of software APIs” is accepted by NAACL 2025 Industry Track (Oral)
Oct 1, 2024	Started collaborating with SGLang / Sky Computing Lab at UC Berkeley on distributed GPU-sharing inference systems
Jul 13, 2023	Our paper Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models is accepted by by ICCV 2023.
May 13, 2023	I graduate from UC Berkeley with Magna cum Laude, double majoring Statistics and Computer Science
Mar 20, 2023	I will start my Master’s in Data Science degree at Harvard SEAS

Publications

Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models

Zangwei Zheng, Mingyuan Ma, Kai Wang, and 3 more authors

ICCV 2023, 2023

Abs arXiv HTML PDF Code

Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model’s zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods. Our method outperforms other methods by 9.7% average score.
Octopus: On-device language model for function calling of software APIs

Wei Chen, Zhiyuan Li, and Mingyuan Ma

NAACL 2025 Industry Track (Oral), 2024

Abs arXiv HTML

Recent advancements in large language models (LLMs) have enabled function calling capabilities, allowing models to interact with external tools and APIs. However, deploying such capabilities on-device remains challenging due to computational constraints. We present Octopus, an on-device language model specifically designed for function calling of software APIs. Our approach achieves competitive performance with significantly reduced model size, making it suitable for edge deployment.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Zhenting Qi, Mingyuan Ma, Jiahang Xu, and 3 more authors

ICLR 2025, 2024

Abs arXiv HTML

We introduce a mutual reasoning framework that leverages two Small Language Models (SLMs) to enhance reasoning performance. By implementing a planning paradigm with Monte Carlo tree search (MCTS) algorithm, we advance SLMs’ logical reasoning capability. Our approach demonstrates superior performance over existing reasoning approaches including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and RAP across different reasoning tasks such as MATH and StrategyQA.
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Huandong Chang, Zicheng Ma, Mingyuan Ma, and 4 more authors

arXiv preprint, 2025

Abs arXiv HTML

We propose ElaLoRA, an elastic and learnable low-rank adaptation method for efficient model fine-tuning. Our approach dynamically adjusts the rank of adaptation matrices during training, achieving better performance-efficiency trade-offs compared to fixed-rank methods like LoRA.
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Shan Yu, Jiarong Xing, Yifan Qiao, and 4 more authors

Under Review at OSDI 2026, 2025

Abs arXiv HTML

We present Prism, a distributed GPU-sharing inference system for multi-LLM engines. Our system achieves 3.3x SLO improvement and 2x GPU cost reduction in real-world workloads through workload-aware balancing, request scheduling, and model migration with elastic KV cache support.