MASPRM: Multi-Agent System Process Reward Model

Abstract & TL;DR

MASPRM augments multi-agent systems with a process reward model that scores partial conversations at inference time. By estimating per-agent progress on-the-fly, MASPRM steers search algorithms toward promising reasoning branches and prunes unproductive ones, improving reliability without additional human supervision.

                
                Trained purely from multi-agent Monte Carlo Tree Search rollouts, no step-level
                human annotations required.

When paired with outcome reward models, MASPRM delivers sizeable gains on open-ended math reasoning benchmarks while respecting compute budgets. The same controller transfers across tasks, demonstrating strong generalization and plug-and-play compatibility with existing verifier-style decoders.

GSM8K

+30.7

Exact Match lift over a straight-through MAS baseline.

MATH

+22.9

EM improvement at fixed compute using MASPRM-guided decoding.

Zero-Shot Transfer

+8.4

Additional EM on MATH when reusing a GSM8K-trained controller.

Key Highlights

Process-Level Value Modeling

MASPRM estimates per-agent returns for partial transcripts, enabling fine-grained control during inference rather than relying solely on final-answer verifiers.

Plug-in Controller for Search

The model slots into beam search and MCTS policies, allocating computation to high-value branches while pruning early failures to respect tight budgets.

Self-Supervised Training Signal

Process rewards are generated automatically by propagating returns from multi-agent rollouts, eliminating the need for step-level human labels.

Generalization Across Tasks

Controllers trained on one benchmark transfer to new domains with minimal adaptation, complementing existing outcome reward models for reliable reasoning.

Method Overview

1. Training with MCTS Rollouts

MASPRM collects trajectories from cooperative multi-agent reasoning runs augmented with Monte Carlo Tree Search. Returns are propagated to every agent’s intermediate actions, creating dense targets aligned with global success.

Sample multi-agent conversations while expanding the search frontier using MCTS.
Attribute credit to individual agent turns by backing up rollout returns.
Distill these signals into a neural value model conditioned on partial transcripts.

2. Inference-Time Control

During decoding, MASPRM evaluates every candidate branch and drives search policies to focus on trajectories with the highest estimated utility. It naturally integrates with verifier models that score terminal answers.

Controller Outputs

Per-agent values: guides coordination and specialization.

Early pruning: retires low-reward branches before they consume compute.

Budget awareness: balances depth vs. breadth given compute caps.

Results & Ablations

MASPRM consistently raises exact-match accuracy on demanding math reasoning benchmarks while keeping generation budgets fixed. It pairs seamlessly with outcome reward models to vet final answers, yielding an inference stack that is both deeply explorative and compute-aware.

Benchmark	Decoder Setup	MASPRM Impact	Notes
GSM8K	MAS agents + Outcome Reward Model	+30.7 EM over single straight-through pass	Improved stability during long-chain reasoning.
MATH	MAS agents + Outcome Reward Model	+22.9 EM at the same compute budget	MCTS guided by MASPRM focuses on viable derivations early.
MATH (zero-shot)	Controller trained on GSM8K	+8.4 EM without retraining	Demonstrates cross-task generalization of process rewards.

            
            Ablations show that removing MASPRM reverts performance to baseline levels, confirming
            that per-agent process values, not just outcome verifiers, drive the observed gains.

Resources

arXiv Abstract PDF Download GitHub Repository

Reproducibility

The repository includes instructions for generating MASPRM training rollouts, fine-tuning controllers, and integrating them into beam search and MCTS pipelines.

Evaluation Protocols

Detailed compute budgets, rollout depths, and verifier configurations are provided to ease comparison with future multi-agent reasoning systems.

Extensibility

MASPRM is architected as a lightweight plug-in compatible with diverse team structures and scales, making it straightforward to adapt to new domains.

BibTeX

@article{yazdani2025masprm,
  title={{MASPRM}: Multi-Agent System Process Reward Model},
  author={Yazdani, Milad and Mostajabdaveh, Mahdi and Zhou, Zirui and Xiong, Ying},
  journal={arXiv preprint arXiv:2510.24803},
  year={2025}
}