Built on RecursiveMAS · arXiv:2604.25917

HOUSE
RecursiveMAS Engineering Extensions

A production-grade interactive front-end, LLM enhancement layer, token-based collaboration mode, and multi-GPU support — extending the RecursiveMAS latent-space framework for real-world use.

6
Collaboration Styles
4
Feature Branches
3
LLM Backends
Models (text mode)
Overview

What This Work Adds

RecursiveMAS (Yang et al., 2026) introduced a multi-agent framework where heterogeneous LLMs collaborate through latent-space recursion — passing hidden-state tensors instead of text, reducing token usage by −75% and achieving a 2.4× speedup over text-based alternatives.

This project extends the official release with four engineering contributions: a production-grade web front-end (HOUSE) with Chat and Batch Evaluation modes; an optional LLM enhancement layer that normalises inputs and synthesises outputs using Claude, Gemini, or any OpenAI-compatible endpoint; a sequential-text collaboration mode that removes the RecursiveLink requirement so any instruction-tuned model pair can be used; and a multi-GPU device mapper for distributing agents across GPUs. Three systematic evaluation bugs are also documented and fixed.

Branch: feature/domain-system-prompts

🏥 HOUSE Web UI

Heuristic Orchestration Using Specialist Ensembles — a Gradio Blocks application that wraps the entire inference pipeline with a warm model cache, streaming logs, and full parameter control.

💬
Chat Mode

Ask any question and receive the parsed answer, solver reasoning chain, and collapsible Planner / Critic outputs — with a full run metadata table on every reply.

📊
Batch Evaluation

Run full benchmark datasets (math500, medqa, gpqa, mbppplus) with live log streaming, cooperative stop, and JSONL download.

🌐
Domain Prompts

Four reasoning domains: general, medical emergency, software engineering, scientific research. Propagated via thread-local storage — zero API changes.

🔥
Warm Model Cache

Models load once on first request and stay in VRAM. Style switches evict and reload. No cold-start penalty between questions.

📋
Run Metadata

Every reply includes a collapsible table: style, domain, rounds, latent steps, temperature, top-p, seed, device, start/end time, elapsed, token counts.

🐳
Docker Ready

Two-service Compose setup (recursivemas + serve) sharing a named hf_cache volume. CPU fallback documented for Windows/WSL2 users.

Branch: feature/llm-postprocessing

✨ LLM Enhancement Layer

An optional wrapper that applies a frontier LLM before and after the MAS pipeline — translating, normalising, and synthesising without ever touching the core inference code.

Pre-processing pipeline
👤 User question (any language / notation)
↓ translate · disambiguate · structure
Frontier LLM (Claude / Gemini / OpenAI-compat)
↓ English · LaTeX · explicit constraints
🤖 MAS pipeline input
MCQ detection adds a suffix that prevents option injection or answer leakage.
Post-processing pipeline
🤖 MAS output (planner · critic · solver · parsed answer)
↓ synthesise · format · language-match
Frontier LLM (same backend)
↓ coherent · fluent · user's language
👤 Final response to user
Falls back gracefully — never blocks the pipeline on any error.
BackendSDKKey sourceModel discovery
Claude (Anthropic API)anthropicANTHROPIC_API_KEY env / UILive API query
Gemini (Google AI)google.generativeaiGEMINI_API_KEY env / UILive API query
OpenAI-compatibleopenaiEndpoint URL + optional keyLive /models query

API keys are never hardcoded. .env is in .gitignore.

Branch: feature/sequential-text

💬 Sequential-Text Mode

The same Planner → Critic → Solver topology, but agents communicate via natural-language tokens instead of latent tensors. No RecursiveLinks required — any instruction-tuned model can be plugged in from the local HF cache or any HuggingFace ID.

Latent mode (paper baseline)

Qwen3-1.7B
Planner
RL →
Llama3.2-1B
Critic
RL →
Qwen2.5-Math
Solver
Inter-agent: hidden-state tensors via RecursiveLink adapters. Fast · opaque.

Text mode new

any model
Planner
→ text →
any model
Critic
→ text →
any model
Solver
Inter-agent: natural-language tokens. Human-readable · auditable.
CLI flag

--method text_recursive — exposed from the previously hardcoded internal code path. All adapter validation is bypassed.

Model selection UI

Per-agent dropdown listing cached models by size. "Custom model…" sentinel reveals a free-text field for any HF ID.

Interpretability use

Read intermediate agent reasoning. Compare with latent mode to understand what the latent channel compresses vs. expresses explicitly.

Self-improvement

Filter correct=True trajectories → fine-tune new specialist models. A natural data flywheel for iterative improvement.

Branch: feature/llm-postprocessing

📊 Token Accounting & VRAM Introspection

Every response is self-documenting: token counts and live VRAM measurements are surfaced in the UI, enabling cost estimation and OOM prevention.

Token counter

Thread-local accumulator in _token_counter.py intercepts every model.generate() call across all inference modules. Emits a single line at pipeline completion:

[tokens] prompt=4821 generated=612 total=5433

Shown in the Run Info table on every chat reply.

Dynamic VRAM estimates

Hardcoded estimates were replaced with live measurements from scan_cache_dir() — the same source as the Model Manager tab.

Example correction: sequential_scaled was documented as ~12 GB; actual measured size from HF cache: ~23 GB.

sequential_light ≈ 5 GB sequential_scaled ≈ 23 GB mixture ≈ 15 GB distillation ≈ 18 GB
Branch: feature/multi-gpu

⚡ Multi-GPU Device Mapping

Distribute individual agents across different GPUs from the UI — no code changes required. Live VRAM fitness checks warn before you hit OOM.

Example assignment — Sequential Scaled (23 GB) across two 16 GB GPUs
AgentModelSizeDevice
PlannerGemma3-4B~8.2 GBcuda:0 🟢
CriticLlama3.2-3B~6.1 GBcuda:0 🟢
SolverQwen3.5-4B~8.6 GBcuda:1 🟢
State persisted in gr.State and forwarded through the entire call chain.
Branch: feature/llm-postprocessing

🐛 Evaluation Infrastructure Fixes

Three systematic bugs in the answer parser were identified during Math500 benchmarking and corrected.

#BugEffectFix
1 compare_answers() cascading A-default Parse failure on either side scored as correct when gold = A, inflating accuracy default=None on both sides; correct=False on any parse failure
2 extract_gold_answer() silent A-default Unparseable gold answers became "A", skewing gold distribution Return None on failure; propagated to fix #1
3 Latent solver output discriminator max_new_tokens heuristic failed for short answers (e.g. \boxed{B}) Switch to prompt_len as discriminator
Development Branches

Branch Overview

All branches are live on the public repository.

BranchKey filesCapability added
feature/domain-system-prompts prompts.py, inference_mas.py, serve.py Domain prompts · HOUSE two-tab layout · batch eval · run metadata · warm cache
feature/llm-postprocessing serve.py, answer_utils.py LLM pre/post-processing · MCQ detection · dynamic VRAM · token counting · 3 eval fixes
feature/sequential-text inference_mas.py, run.py, load_from_repo.py, serve.py sequential_text style · --method text_recursive CLI · model-selection UI
feature/multi-gpu serve.py Per-agent GPU assignment · live VRAM fitness check · layout fix
Open Questions

Reflections & Future Directions

Engineering RecursiveMAS for production raises deeper questions about latent communication, safety, and self-improvement that go beyond benchmark accuracy.

🔍
Latent interpretability

The text mode is the natural observation window. Running both modes on the same question reveals what the latent channel encodes compactly vs. what requires explicit articulation.

🏛️
Constitutional constraints

Echo chambers, galaxy-brained reasoning, and oscillation are failure modes specific to iterative multi-agent loops. Diversity constraints on the Critic and uncertainty budgets can mitigate them.

🔄
Self-improvement loop

Text-mode trajectories labelled correct=True are natural fine-tuning data for new specialist models — a STaR-style flywheel adapted to the multi-agent setting.

🧩
Vertical specialisation

Domain-specialist agents (medical, legal, formal verification) require new adapter pairs. Sparse topologies ($O(N)$ links rather than $O(N^2)$) keep the adapter matrix tractable.

A full treatment of these topics is available in the companion document: EXTENSIONS.md

Citation

How to Cite

If you build on this work, please also cite the original RecursiveMAS paper.

This work (HOUSE extensions)

@misc{salpietro2026house, title = {HOUSE: Engineering Extensions for RecursiveMAS}, author = {Daniele Carmelo Salpietro}, year = {2026}, url = {https://github.com/danielesalpietro/RecursiveMAS}, note = {Extensions to arXiv:2604.25917} }

Base paper (RecursiveMAS)

@misc{recursivemas, title = {Recursive Multi-Agent Systems}, author = {Xiyuan Yang and Jiaru Zou and Rui Pan and Ruizhong Qiu and Pan Lu and Shizhe Diao and Jindong Jiang and Hanghang Tong and Tong Zhang and Markus J. Buehler and Jingrui He and James Zou}, year = {2026}, eprint = {2604.25917}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, url = {https://arxiv.org/abs/2604.25917} }

Privacy Policy

Last updated: June 2026

This website (danielesalpietro.github.io/RecursiveMAS) is a static informational page about the HOUSE — RecursiveMAS open-source project. It is hosted on GitHub Pages and does not use cookies, tracking scripts, analytics services, or any form of user profiling.

Data collected
This site does not collect, store, or process any personal data. No forms, login systems, or comment sections are present. No third-party advertising or analytics scripts are loaded.

External links
This page links to external services including GitHub, arXiv, HuggingFace, and LinkedIn. Each of those services is governed by its own privacy policy. Visiting those links may result in data collection by those services.

GitHub Pages hosting
GitHub may collect technical data (IP address, browser type, request timestamps) as part of serving the page. Please refer to GitHub's Privacy Statement for details.

Contact
For any privacy-related questions, contact: daniele@salpietro.it  ·  www.salpietro.it