MASPRM: Multi-Agent System Process Reward Model

A compute-aware controller that assigns per-action, per-agent values to multi-agent reasoning trajectories, enabling selective inference-time exploration with beam search and Monte Carlo Tree Search.

arXiv: 2510.24803 Categories: cs.MA, cs.AI First posted: 27 Oct 2025
Authors: Milad Yazdani · Mahdi Mostajabdaveh · Zirui Zhou · Ying Xiong

Abstract & TL;DR

MASPRM augments multi-agent systems with a process reward model that scores partial conversations at inference time. By estimating per-agent progress on-the-fly, MASPRM steers search algorithms toward promising reasoning branches and prunes unproductive ones, improving reliability without additional human supervision.

Trained purely from multi-agent Monte Carlo Tree Search rollouts, no step-level human annotations required.

When paired with outcome reward models, MASPRM delivers sizeable gains on open-ended math reasoning benchmarks while respecting compute budgets. The same controller transfers across tasks, demonstrating strong generalization and plug-and-play compatibility with existing verifier-style decoders.

GSM8K

+30.7

Exact Match lift over a straight-through MAS baseline.

MATH

+22.9

EM improvement at fixed compute using MASPRM-guided decoding.

Zero-Shot Transfer

+8.4

Additional EM on MATH when reusing a GSM8K-trained controller.

Key Highlights

Process-Level Value Modeling

MASPRM estimates per-agent returns for partial transcripts, enabling fine-grained control during inference rather than relying solely on final-answer verifiers.

Plug-in Controller for Search

The model slots into beam search and MCTS policies, allocating computation to high-value branches while pruning early failures to respect tight budgets.

Self-Supervised Training Signal

Process rewards are generated automatically by propagating returns from multi-agent rollouts, eliminating the need for step-level human labels.

Generalization Across Tasks

Controllers trained on one benchmark transfer to new domains with minimal adaptation, complementing existing outcome reward models for reliable reasoning.

Method Overview

1. Training with MCTS Rollouts

MASPRM collects trajectories from cooperative multi-agent reasoning runs augmented with Monte Carlo Tree Search. Returns are propagated to every agent’s intermediate actions, creating dense targets aligned with global success.

  • Sample multi-agent conversations while expanding the search frontier using MCTS.
  • Attribute credit to individual agent turns by backing up rollout returns.
  • Distill these signals into a neural value model conditioned on partial transcripts.
2. Inference-Time Control

During decoding, MASPRM evaluates every candidate branch and drives search policies to focus on trajectories with the highest estimated utility. It naturally integrates with verifier models that score terminal answers.

Controller Outputs
Per-agent values: guides coordination and specialization.
Early pruning: retires low-reward branches before they consume compute.
Budget awareness: balances depth vs. breadth given compute caps.

Results & Ablations

MASPRM consistently raises exact-match accuracy on demanding math reasoning benchmarks while keeping generation budgets fixed. It pairs seamlessly with outcome reward models to vet final answers, yielding an inference stack that is both deeply explorative and compute-aware.

Benchmark Decoder Setup MASPRM Impact Notes
GSM8K MAS agents + Outcome Reward Model +30.7 EM over single straight-through pass Improved stability during long-chain reasoning.
MATH MAS agents + Outcome Reward Model +22.9 EM at the same compute budget MCTS guided by MASPRM focuses on viable derivations early.
MATH (zero-shot) Controller trained on GSM8K +8.4 EM without retraining Demonstrates cross-task generalization of process rewards.
Ablations show that removing MASPRM reverts performance to baseline levels, confirming that per-agent process values, not just outcome verifiers, drive the observed gains.

Resources

Reproducibility

The repository includes instructions for generating MASPRM training rollouts, fine-tuning controllers, and integrating them into beam search and MCTS pipelines.

Evaluation Protocols

Detailed compute budgets, rollout depths, and verifier configurations are provided to ease comparison with future multi-agent reasoning systems.

Extensibility

MASPRM is architected as a lightweight plug-in compatible with diverse team structures and scales, making it straightforward to adapt to new domains.

BibTeX

@article{yazdani2025masprm,
  title={{MASPRM}: Multi-Agent System Process Reward Model},
  author={Yazdani, Milad and Mostajabdaveh, Mahdi and Zhou, Zirui and Xiong, Ying},
  journal={arXiv preprint arXiv:2510.24803},
  year={2025}
}