Abstract & TL;DR
MASPRM augments multi-agent systems with a process reward model that scores partial conversations at inference time. By estimating per-agent progress on-the-fly, MASPRM steers search algorithms toward promising reasoning branches and prunes unproductive ones, improving reliability without additional human supervision.
When paired with outcome reward models, MASPRM delivers sizeable gains on open-ended math reasoning benchmarks while respecting compute budgets. The same controller transfers across tasks, demonstrating strong generalization and plug-and-play compatibility with existing verifier-style decoders.
+30.7
Exact Match lift over a straight-through MAS baseline.
+22.9
EM improvement at fixed compute using MASPRM-guided decoding.
+8.4
Additional EM on MATH when reusing a GSM8K-trained controller.
Key Highlights
Process-Level Value Modeling
MASPRM estimates per-agent returns for partial transcripts, enabling fine-grained control during inference rather than relying solely on final-answer verifiers.
Plug-in Controller for Search
The model slots into beam search and MCTS policies, allocating computation to high-value branches while pruning early failures to respect tight budgets.
Self-Supervised Training Signal
Process rewards are generated automatically by propagating returns from multi-agent rollouts, eliminating the need for step-level human labels.
Generalization Across Tasks
Controllers trained on one benchmark transfer to new domains with minimal adaptation, complementing existing outcome reward models for reliable reasoning.
Method Overview
1. Training with MCTS Rollouts
MASPRM collects trajectories from cooperative multi-agent reasoning runs augmented with Monte Carlo Tree Search. Returns are propagated to every agent’s intermediate actions, creating dense targets aligned with global success.
- Sample multi-agent conversations while expanding the search frontier using MCTS.
- Attribute credit to individual agent turns by backing up rollout returns.
- Distill these signals into a neural value model conditioned on partial transcripts.
2. Inference-Time Control
During decoding, MASPRM evaluates every candidate branch and drives search policies to focus on trajectories with the highest estimated utility. It naturally integrates with verifier models that score terminal answers.
Controller Outputs
Results & Ablations
MASPRM consistently raises exact-match accuracy on demanding math reasoning benchmarks while keeping generation budgets fixed. It pairs seamlessly with outcome reward models to vet final answers, yielding an inference stack that is both deeply explorative and compute-aware.
| Benchmark | Decoder Setup | MASPRM Impact | Notes | 
|---|---|---|---|
| GSM8K | MAS agents + Outcome Reward Model | +30.7 EM over single straight-through pass | Improved stability during long-chain reasoning. | 
| MATH | MAS agents + Outcome Reward Model | +22.9 EM at the same compute budget | MCTS guided by MASPRM focuses on viable derivations early. | 
| MATH (zero-shot) | Controller trained on GSM8K | +8.4 EM without retraining | Demonstrates cross-task generalization of process rewards. | 
Resources
Reproducibility
The repository includes instructions for generating MASPRM training rollouts, fine-tuning controllers, and integrating them into beam search and MCTS pipelines.
Evaluation Protocols
Detailed compute budgets, rollout depths, and verifier configurations are provided to ease comparison with future multi-agent reasoning systems.
Extensibility
MASPRM is architected as a lightweight plug-in compatible with diverse team structures and scales, making it straightforward to adapt to new domains.
BibTeX
@article{yazdani2025masprm,
  title={{MASPRM}: Multi-Agent System Process Reward Model},
  author={Yazdani, Milad and Mostajabdaveh, Mahdi and Zhou, Zirui and Xiong, Ying},
  journal={arXiv preprint arXiv:2510.24803},
  year={2025}
}