RecursiveMAS (Yang et al., 2026) introduced a multi-agent framework where heterogeneous LLMs collaborate through latent-space recursion — passing hidden-state tensors instead of text, reducing token usage by −75% and achieving a 2.4× speedup over text-based alternatives.
This project extends the official release with four engineering contributions: a production-grade web front-end (HOUSE) with Chat and Batch Evaluation modes; an optional LLM enhancement layer that normalises inputs and synthesises outputs using Claude, Gemini, or any OpenAI-compatible endpoint; a sequential-text collaboration mode that removes the RecursiveLink requirement so any instruction-tuned model pair can be used; and a multi-GPU device mapper for distributing agents across GPUs. Three systematic evaluation bugs are also documented and fixed.
Heuristic Orchestration Using Specialist Ensembles — a Gradio
Blocks application that wraps the entire inference pipeline with a warm
model cache, streaming logs, and full parameter control.
Ask any question and receive the parsed answer, solver reasoning chain, and collapsible Planner / Critic outputs — with a full run metadata table on every reply.
Run full benchmark datasets (math500, medqa, gpqa, mbppplus) with live log streaming, cooperative stop, and JSONL download.
Four reasoning domains: general, medical emergency, software engineering, scientific research. Propagated via thread-local storage — zero API changes.
Models load once on first request and stay in VRAM. Style switches evict and reload. No cold-start penalty between questions.
Every reply includes a collapsible table: style, domain, rounds, latent steps, temperature, top-p, seed, device, start/end time, elapsed, token counts.
Two-service Compose setup (recursivemas + serve) sharing a named hf_cache volume. CPU fallback documented for Windows/WSL2 users.
An optional wrapper that applies a frontier LLM before and after the MAS pipeline — translating, normalising, and synthesising without ever touching the core inference code.
| Backend | SDK | Key source | Model discovery |
|---|---|---|---|
| Claude (Anthropic API) | anthropic | ANTHROPIC_API_KEY env / UI | Live API query |
| Gemini (Google AI) | google.generativeai | GEMINI_API_KEY env / UI | Live API query |
| OpenAI-compatible | openai | Endpoint URL + optional key | Live /models query |
API keys are never hardcoded. .env is in .gitignore.
The same Planner → Critic → Solver topology, but agents communicate via natural-language tokens instead of latent tensors. No RecursiveLinks required — any instruction-tuned model can be plugged in from the local HF cache or any HuggingFace ID.
--method text_recursive — exposed from the previously hardcoded internal code path. All adapter validation is bypassed.
Per-agent dropdown listing cached models by size. "Custom model…" sentinel reveals a free-text field for any HF ID.
Read intermediate agent reasoning. Compare with latent mode to understand what the latent channel compresses vs. expresses explicitly.
Filter correct=True trajectories → fine-tune new specialist models. A natural data flywheel for iterative improvement.
Every response is self-documenting: token counts and live VRAM measurements are surfaced in the UI, enabling cost estimation and OOM prevention.
Thread-local accumulator in _token_counter.py intercepts every
model.generate() call across all inference modules.
Emits a single line at pipeline completion:
[tokens] prompt=4821 generated=612 total=5433
Shown in the Run Info table on every chat reply.
Hardcoded estimates were replaced with live measurements from scan_cache_dir()
— the same source as the Model Manager tab.
Example correction: sequential_scaled was documented as ~12 GB;
actual measured size from HF cache: ~23 GB.
Distribute individual agents across different GPUs from the UI — no code changes required. Live VRAM fitness checks warn before you hit OOM.
| Agent | Model | Size | Device |
|---|---|---|---|
| Planner | Gemma3-4B | ~8.2 GB | cuda:0 🟢 |
| Critic | Llama3.2-3B | ~6.1 GB | cuda:0 🟢 |
| Solver | Qwen3.5-4B | ~8.6 GB | cuda:1 🟢 |
gr.State and forwarded through the entire call chain.Three systematic bugs in the answer parser were identified during Math500 benchmarking and corrected.
| # | Bug | Effect | Fix |
|---|---|---|---|
| 1 | compare_answers() cascading A-default |
Parse failure on either side scored as correct when gold = A, inflating accuracy | default=None on both sides; correct=False on any parse failure |
| 2 | extract_gold_answer() silent A-default |
Unparseable gold answers became "A", skewing gold distribution |
Return None on failure; propagated to fix #1 |
| 3 | Latent solver output discriminator | max_new_tokens heuristic failed for short answers (e.g. \boxed{B}) |
Switch to prompt_len as discriminator |
All branches are live on the public repository.
| Branch | Key files | Capability added |
|---|---|---|
feature/domain-system-prompts |
prompts.py, inference_mas.py, serve.py |
Domain prompts · HOUSE two-tab layout · batch eval · run metadata · warm cache |
feature/llm-postprocessing |
serve.py, answer_utils.py |
LLM pre/post-processing · MCQ detection · dynamic VRAM · token counting · 3 eval fixes |
feature/sequential-text |
inference_mas.py, run.py, load_from_repo.py, serve.py |
sequential_text style · --method text_recursive CLI · model-selection UI |
feature/multi-gpu |
serve.py |
Per-agent GPU assignment · live VRAM fitness check · layout fix |
Engineering RecursiveMAS for production raises deeper questions about latent communication, safety, and self-improvement that go beyond benchmark accuracy.
The text mode is the natural observation window. Running both modes on the same question reveals what the latent channel encodes compactly vs. what requires explicit articulation.
Echo chambers, galaxy-brained reasoning, and oscillation are failure modes specific to iterative multi-agent loops. Diversity constraints on the Critic and uncertainty budgets can mitigate them.
Text-mode trajectories labelled correct=True are natural fine-tuning data for new specialist models — a STaR-style flywheel adapted to the multi-agent setting.
Domain-specialist agents (medical, legal, formal verification) require new adapter pairs. Sparse topologies ($O(N)$ links rather than $O(N^2)$) keep the adapter matrix tractable.
A full treatment of these topics is available in the companion document: EXTENSIONS.md
If you build on this work, please also cite the original RecursiveMAS paper.
Last updated: June 2026
This website (danielesalpietro.github.io/RecursiveMAS) is a static informational page about the HOUSE — RecursiveMAS open-source project. It is hosted on GitHub Pages and does not use cookies, tracking scripts, analytics services, or any form of user profiling.
Data collected
This site does not collect, store, or process any personal data.
No forms, login systems, or comment sections are present.
No third-party advertising or analytics scripts are loaded.
External links
This page links to external services including GitHub, arXiv, HuggingFace, and LinkedIn.
Each of those services is governed by its own privacy policy.
Visiting those links may result in data collection by those services.
GitHub Pages hosting
GitHub may collect technical data (IP address, browser type, request timestamps)
as part of serving the page. Please refer to
GitHub's Privacy Statement for details.
Contact
For any privacy-related questions, contact:
daniele@salpietro.it
·
www.salpietro.it