When Does Multi-Agent Collaboration Help? An Entropy Perspective

Figure 1: Accuracy comparison between Single-Agent Systems (SAS) and Multi-Agent Systems (MAS) across six reasoning benchmarks. SAS achieves the highest accuracy in 13 out of 30 configurations (43.3%).

Background & Motivation

The Promise of MAS

Multi-agent systems (MAS) have emerged as a prominent paradigm for deploying LLMs on complex tasks. Yet the mechanisms governing their effectiveness remain poorly understood: the underlying rationales for why MAS succeed or fail, particularly for systems built on publicly available LLMs, are largely unexplored.

An Entropy Perspective

We revisit MAS through the perspective of entropy, considering both intra- and inter-agent dynamics by tracing entropy transitions during problem-solving across various topologies, 6 reasoning benchmarks, and 2 agentic tasks. Analyzing 245 features spanning token-, agent-, and round-level entropy, we arrive at four key findings.

Key Findings

Counterintuitive Result

Single-agent systems outperform MAS in ~43.3% of cases (13 out of 30 configurations), challenging the assumption that multi-agent collaboration is always beneficial.

Certainty Preference

Peak entropy directly harms MAS correctness, while stable low entropy directly benefits it. MAS effectiveness is largely determined by first-round entropy dynamics.

Base Entropy

Base models with lower entropy during problem-solving are a causal driver of MAS performance (ATE = -0.12, p < 10^-21), validated through rigorous causal inference.

Task Awareness

Optimal entropy profiles are task-dependent: simple tasks require fast convergence to low entropy, while hard tasks benefit from controlled, moderate exploration.

MAS Topologies

We study five interaction topologies. Each exhibits distinct entropy behavior and vulnerability patterns.

Single

Self-refinement loop. Internal consistency is key.

Sequential

Fixed pipeline. Most fragile: errors propagate without cross-check.

Centralized

Star graph with orchestrator. Benefits most from additional rounds.

Debate

DAG + majority voting. Initial divergence amplifies in later rounds.

Hybrid

Dual feedback (coordinator + peers). Most robust to perturbations.

Experimental Scale

245

Entropy Features

5

Open-Source LLMs

6+2

Benchmarks

5

Topologies

5

Hierarchical Levels

Methodology

We systematically extract entropy from the full reasoning lifecycle of LLM-based MAS, constructing a hierarchical feature set that captures uncertainty at multiple granularities. We then apply SHAP-based interpretability analysis to identify which entropy dynamics predict MAS success or failure.

Hierarchical Entropy Features

For each sample, we record the full output probability distribution at every token position across all agents and rounds, then aggregate into a 245-dimensional feature vector spanning five hierarchical levels:

Token Level Per-token entropy H(t) = -Σ p(v) log p(v) over the full vocabulary. Captures fine-grained uncertainty at each generation step.

Agent Level Statistics (mean, max, variance, quartiles) of token-level entropy within each agent's response. Captures individual agent confidence.

Round Level Cross-agent aggregation per round: inter-agent dispersion, consensus indicators, entropy trends across rounds.

Sample Level Global features across the entire reasoning trajectory: total entropy, stability indices, answer-token entropy.

Base Model Level Entropy from the same base model running as SAS on the same problem, providing a reference baseline for comparison.

Feature Groups G_MAS (d=224): MAS-only features.
G_base-H (d=241): + base entropy.
G_base-full (d=245): all features.

Interpretability via SHAP

We train an ensemble of XGBoost and LightGBM classifiers to predict MAS correctness from entropy features, then use SHAP (SHapley Additive exPlanations) to interpret which features matter and how:

          Feature Importance Ī̅j

          Min-max normalized importance from each model, averaged across XGBoost and LightGBM. Higher values indicate stronger predictive power.

          SHAP Correlation ρj

          Pearson correlation between feature values and their SHAP attributions, averaged across models. Positive ρ means higher feature value increases predicted correctness; negative ρ means it decreases correctness.

In the figures below, each feature is characterized by its importance (how much it matters) and SHAP correlation (which direction it pushes the prediction). For example, a feature with high Ī̅ and negative ρ is a strong predictor that, when elevated, harms MAS correctness.

From Correlation to Causation

SHAP reveals which entropy features correlate with MAS outcomes, but correlation alone cannot establish causal mechanisms. To go further, we employ a rigorous causal inference pipeline:

Causal Discovery PC and FCI algorithms identify the causal graph structure from observational data, discovering consensus direct causes agreed upon by both methods.

Effect Estimation DoWhy framework estimates Average Treatment Effects (ATE) via propensity score matching and inverse probability weighting, quantifying the causal strength.

Refutation Tests Three robustness checks: random common cause, placebo treatment, and data subset validation. Causal claims are accepted only if all tests pass.

This pipeline allows us to distinguish features that merely co-occur with success from those that causally drive MAS correctness. We additionally perform mediation analysis to trace how first-round entropy propagates through later rounds to influence the final outcome.

Entropy Analysis: Three Principles

1. Certainty Preference

Peak entropy directly harms MAS correctness, while stable low entropy directly benefits it. For Qwen models, high entropy variance (ρ ≈ -0.92) and first-round divergence (ρ ≈ -0.87) are the primary failure drivers. Correct samples cluster at low entropy variance.

Figure 3: MAS failure analysis. High inter-agent entropy variance and first-round divergence drive failures. Correct samples cluster at low entropy regions.

2. Base Entropy

The base model's intrinsic entropy during problem-solving constrains MAS effectiveness. Higher base model entropy consistently reduces MAS accuracy, with a sharp drop when entropy exceeds 100. This is not merely correlation: causal analysis confirms base entropy as a direct cause (ATE = -0.12, p < 10^-21).

Figure 2: Base model entropy limits MAS. LLaMA operates in low-entropy range (0-100) with lower accuracy; Qwen uses higher entropy (100-1000) but achieves better performance through verification and correction.

3. Task Awareness

Optimal entropy profiles vary by task difficulty. Simple tasks (GSM8K, ~82% accuracy) require fast convergence to low, stable entropy. Hard tasks (AIME, ~25% accuracy) benefit from controlled exploration but are destroyed by extreme peaks. Medium tasks exhibit a mixed pattern: moderate sustained entropy helps, but early peaks predict failure.

Figure 4: Entropy-SHAP relationships by dataset and architecture. Entropy values and dispersion increase with task difficulty. GSM8K accuracy ~82%, AIME2025 accuracy ~25%.

Causal Discovery

We go beyond correlation to establish causal mechanisms using PC and FCI algorithms for causal discovery, followed by DoWhy for effect estimation. All causal claims pass rigorous refutation tests (random common cause, placebo treatment, data subset).

Consensus Direct Causes of MAS Correctness

-0.12

ATE: Base model avg entropy per token
(p < 10^-21)

-0.31

ATE: Max answer token entropy
(p < 10^-28)

neg.

ATE: Round-1 total entropy
(p < 10^-19)

Consensus causal graph discovered by PC and FCI algorithms.

ATE forest plot: consensus direct causes (red) show more concentrated estimates than indirect causes (blue).

Mediation analysis: 30-33% of the causal effect of first-round inter-agent entropy dispersion is transmitted through second-round entropy.

Round & Difficulty Analysis

More Rounds ≠ Better Performance

Expanding from R=2 to R=5 rarely improves and often harms performance while consuming significantly more tokens. Debate and Hybrid architectures degrade with more rounds (divergence amplifies rather than refines). Centralized is the only architecture that consistently benefits from additional rounds.

Entropy dynamics confirm this: max, mean, and total entropy drop sharply from round 1 to round 2, then plateau from round 2 to round 5. Agents essentially stabilize after round 2, and first-round features dominate the top predictors.

Figure 5: Round analysis (R=5). First-round total entropy is the consensus direct cause in the round 1-5 causal graph. Entropy plateaus after round 2.

Task Difficulty Spectrum

Easy: GSM8K (~82%)

Requires fast convergence to low, stable entropy. Round-1 overall entropy (ρ ≈ -0.64) and stability index (ρ ≈ -0.79) strongly predict success.

Medium: MATH500

High average per-agent reasoning entropy (ρ ≈ 0.63) and longer reasoning (ρ ≈ 0.71) correlate with correctness. But excessive early entropy (ρ ≈ -0.73) predicts failure.

Hard: AIME (~25%)

Round-1 total reasoning time is top predictor (I̅=1.0, ρ ≈ 0.73). But round-1 max entropy (ρ ≤ -0.70) and inter-agent round-2 variance (|ρ| ≥ 0.66) strongly predict failure.

Agentic Tasks Extension

Our entropy principles generalize beyond reasoning to agentic settings with tool use. We validate on GAIA (165 general AI assistant tasks across 3 difficulty levels, 6 models) and FinanceAgent (financial QA requiring SEC filings, multi-step calculations).

GAIA: General AI Assistants

Key findings on GAIA (165 tasks, 6 models, 5 architectures): (1) SAS achieves the highest accuracy for 2 out of 6 models, and beats at least one MAS topology for every model. (2) Debate consistently performs worst. (3) Tool-call entropy and first-round inter-agent dispersion jointly constrain MAS correctness, mirroring the reasoning benchmark patterns.

GAIA accuracy by model and architecture. Centralized often leads; Debate consistently underperforms.

Accuracy vs. tool-call entropy quintiles: accuracy drops monotonically with increasing tool-call entropy.

SHAP: top predictors are round-1 inter-agent dispersion, mean tool-call entropy, and step-0 mean entropy, all negative.

Architecture radar: no single topology dominates all axes (tool efficacy, low tool-call entropy, low round-1 max entropy).

Step-wise entropy heatmap: correct trajectories (n=303) start at ~0.029 and stay low; incorrect ones (n=3389) start at ~0.057 and remain elevated.

Causal Verification on GAIA

Causal discovery on GAIA (feature space expanded to 295 dimensions with step-level features) identifies a single consensus direct cause: round-1 tool success rate (ATE = 0.068, p = 6.8×10^-4). All three refutation tests pass (random common cause changes estimate by 0.1%, placebo treatment by 97.7%, data subset by 0.9%). The mediation path runs from round-1 inter-agent skewness through round-2 max agent dispersion to correctness (indirect effect = -0.019, bootstrap 95% CI [-0.042, -0.003]). This confirms that the first-round entropy principles observed in reasoning tasks extend to agentic settings with tool use.

FinanceAgent

On a financial QA benchmark requiring SEC filing retrieval, real-time stock data queries, and multi-step numerical calculations (Qwen3-4B, R=2), SAS and Sequential tie at 40% accuracy, while coordination overhead is destructive: Centralized 22%, Hybrid 12%, Debate 2%. SHAP confirms architecture itself as the top predictor (ρ ≈ 0.83-0.88), with step-0 mean entropy (ρ ≈ -0.75) as the second key factor, refining "first-round decisive" to "first reasoning step decisive."

Causal analysis identifies round-1 Q3 per-agent max entropy as the consensus direct cause (ATE = -0.197, p = 1.5×10^-17), with a cross-round mediation proportion of 68.9%, the largest observed in this study. Base model correctness remains overwhelmingly predictive (ρ ≈ 0.96): tool-calling does not loosen the base-model dependency.

RL Fine-tuning: Entropy Relationship Reversal

Using Qwen2.5-7B-SimpleRL-Zoo (trained via zero-shot RL on 8K MATH problems without SFT), we observe a striking reversal:

Standard base models: higher entropy monotonically reduces accuracy.
RL-finetuned models: accuracy peaks near zero entropy, drops to a minimum, then recovers to a secondary plateau. The recovery indicates that moderate entropy represents productive exploration rather than degeneration.
MAS consistently outperforms SAS with RL base models, unlike the 43.3% failure rate with standard models.

Causal analysis reveals two opposing direct causes: sample average entropy per token has positive ATE (+1.98, p = 3.7×10^-17), while max answer token entropy remains negative (ATE = -0.31). RL training produces more reliable entropy estimates where entropy better reflects solution diversity rather than noise.

RL fine-tuning reshapes the entropy-accuracy relationship. Standard models show monotonic decline; RL models show recovery at moderate entropy, indicating productive exploration.

Causal analysis on the RL-finetuned setting reveals two opposing consensus direct causes: sample average entropy per token has a positive ATE (+1.98, p = 3.7×10^-17), indicating productive exploration, while max answer token entropy remains negative (ATE = -0.31, p = 4.1×10^-23), still marking failure. The mediation path runs from round-1 Q3 per-agent total entropy through round-2 total entropy to correctness (indirect effect = +0.175), confirming that RL training produces more reliable entropy estimates where moderate entropy reflects genuine solution diversity rather than noise.

SHAP analysis further supports this duality: on G_MAS, the top predictor is round-1 median entropy (ρ ≈ -0.758, negative), while round-2 entropy shows a positive correlation (ρ ≈ +0.267). The interpretation: early consensus combined with calibrated later-round exploration is the optimal pattern under RL fine-tuning.

Entropy Judger

Building on our entropy insights, we introduce the Entropy Judger: an ensemble of XGBoost and LightGBM classifiers that leverages learned entropy patterns to select high-quality outputs from MAS pass@k candidates, without requiring ground-truth labels. It achieves consistent accuracy improvements across all MAS configurations and tasks.

Prediction Accuracy

Cross-validated classification accuracy for predicting MAS correctness:

Feature Group	LLaMA Family	Qwen Family
G_MAS (MAS-only, d=224)	72.6%	79.1%
G_base-H (+ base entropy, d=241)	74.5%	80.7%
G_base-full (all features, d=245)	81.2%	91.6%

Pass@k Selection

Entropy Judger selects from K=3 repeated runs, compared against random selection, majority voting, and the oracle upper bound (Pass@k). All strategies consume exactly k runs, so any accuracy gap reflects selection quality rather than additional computation.

Best-of-k selection accuracy across datasets and strategies at k=3, averaged over four models and all architectures.

At k=3, the Entropy Judger consistently outperforms random selection across all benchmarks, while majority voting collapses catastrophically in the low-k regime (e.g., GSM8K MajVote drops from 0.895 to 0.315). The only exception is AIME2025, where almost all runs fail and entropy features carry little discriminative signal. An Early-Stop variant further shows that stronger models require fewer runs: Qwen3-8B on GSM8K commits after a single run with full confidence.

Additional Experiments

Beyond the main analyses above, we conduct extensive supplementary experiments to validate robustness, probe feature properties, and isolate confounding factors. Below we summarize key findings; full details are in the paper appendix.

Feature Redundancy Analysis

Despite substantial pairwise correlation among the 245 features, PCA, recursive feature elimination, and cross-method importance validation confirm that redundancy is benign for SHAP-based predictive analysis (183:1 sample-to-feature ratio + tree-ensemble regularization). For causal discovery, we apply Borda-fusion selection to obtain a non-redundant 28-feature subset.
See Appendix C (Entropy Features).

Entropy Calibration Analysis

Calibration quality is determined by model family (Qwen ECE = 0.275 vs. LLaMA ECE = 0.565) and degrades with task difficulty (ECE = 0.110 on GSM8K vs. 0.632 on AIME2025). The "confidently wrong" rate correlates with ECE at r = 0.989. On agentic tasks (GAIA), calibration is uniformly poor (ECE = 0.790).
See Appendix H (Calibration Analysis).

Controlled SAS vs. MAS Decomposition

A three-way comparison (SAS, MAS Round-1, MAS Round-2) isolates whether entropy shifts come from role assignment or interaction. Role assignment alone is a significant intervention (23/30 cases, p < 0.05). Decomposition shows 83.4% of samples exhibit anchoring (entropy decrease without accuracy gain) while only 6.2% show genuine improvement.
See Appendix I (SAS vs. MAS Comparison).

Robustness Verification

Temperature ablation (0.3-0.9) confirms that relative entropy patterns are preserved across sampling temperatures. Results reproduce at 14B parameter scale (Qwen3-14B), and per-finding causal validation (applied finding-by-finding rather than globally) confirms all main claims independently.
See Appendix E (More Experimental Results).

Divergent Reasoning Styles: Qwen vs. LLaMA

Qualitative case study on AIME2025: Qwen independently re-derives solutions at each agent (self-correcting, higher entropy but suppresses error propagation), while LLaMA accepts and reuses prior answers without verification (lower entropy but errors compound through the chain).
See Appendix F (Model Comparison).

Token-Level Entropy Visualization

Per-token entropy trajectories across five models reveal that moderate, stable entropy correlates with success, while both excessively high entropy (erratic reasoning) and near-zero entropy (premature collapse, as seen in LLaMA's round-2 behavior) predict failure.
See Appendix J (Case Study).

BibTeX

@article{zhao2026does,
  title={When Does Multi-Agent Collaboration Help? An Entropy Perspective},
  author={Zhao, Yuxuan and Chen, Sijia and Su, Ningxin},
  journal={arXiv preprint arXiv:2602.04234},
  year={2026}
}

When Does Multi-Agent Collaboration Help?An Entropy Perspective