HOUSE — RecursiveMAS Extensions

Overview

What This Work Adds

RecursiveMAS (Yang et al., 2026) introduced a multi-agent framework where heterogeneous LLMs collaborate through latent-space recursion — passing hidden-state tensors instead of text, reducing token usage by −75% and achieving a 2.4× speedup over text-based alternatives.

This project extends the official release with four engineering contributions: a production-grade web front-end (HOUSE) with Chat and Batch Evaluation modes; an optional LLM enhancement layer that normalises inputs and synthesises outputs using Claude, Gemini, or any OpenAI-compatible endpoint; a sequential-text collaboration mode that removes the RecursiveLink requirement so any instruction-tuned model pair can be used; and a multi-GPU device mapper for distributing agents across GPUs. Three systematic evaluation bugs are also documented and fixed.

Branch: feature/domain-system-prompts

🏥 HOUSE Web UI

Heuristic Orchestration Using Specialist Ensembles — a Gradio Blocks application that wraps the entire inference pipeline with a warm model cache, streaming logs, and full parameter control.

💬

Chat Mode

Ask any question and receive the parsed answer, solver reasoning chain, and collapsible Planner / Critic outputs — with a full run metadata table on every reply.

📊

Batch Evaluation

Run full benchmark datasets (math500, medqa, gpqa, mbppplus) with live log streaming, cooperative stop, and JSONL download.

🌐

Domain Prompts

Four reasoning domains: general, medical emergency, software engineering, scientific research. Propagated via thread-local storage — zero API changes.

🔥

Warm Model Cache

Models load once on first request and stay in VRAM. Style switches evict and reload. No cold-start penalty between questions.

📋

Run Metadata

Every reply includes a collapsible table: style, domain, rounds, latent steps, temperature, top-p, seed, device, start/end time, elapsed, token counts.

🐳

Docker Ready

Two-service Compose setup (recursivemas + serve) sharing a named hf_cache volume. CPU fallback documented for Windows/WSL2 users.

Branch: feature/llm-postprocessing

✨ LLM Enhancement Layer

An optional wrapper that applies a frontier LLM before and after the MAS pipeline — translating, normalising, and synthesising without ever touching the core inference code.

Pre-processing pipeline

👤 User question (any language / notation)

↓ translate · disambiguate · structure

Frontier LLM (Claude / Gemini / OpenAI-compat)

↓ English · LaTeX · explicit constraints

🤖 MAS pipeline input

MCQ detection adds a suffix that prevents option injection or answer leakage.

Post-processing pipeline

🤖 MAS output (planner · critic · solver · parsed answer)

↓ synthesise · format · language-match

Frontier LLM (same backend)

↓ coherent · fluent · user's language

👤 Final response to user

Falls back gracefully — never blocks the pipeline on any error.

Backend	SDK	Key source	Model discovery
Claude (Anthropic API)	`anthropic`	`ANTHROPIC_API_KEY` env / UI	Live API query
Gemini (Google AI)	`google.generativeai`	`GEMINI_API_KEY` env / UI	Live API query
OpenAI-compatible	`openai`	Endpoint URL + optional key	Live `/models` query

API keys are never hardcoded. .env is in .gitignore.

Branch: feature/sequential-text

💬 Sequential-Text Mode

The same Planner → Critic → Solver topology, but agents communicate via natural-language tokens instead of latent tensors. No RecursiveLinks required — any instruction-tuned model can be plugged in from the local HF cache or any HuggingFace ID.

Latent mode (paper baseline)

Qwen3-1.7B

Planner

RL →

Llama3.2-1B

Critic

RL →

Qwen2.5-Math

Solver

Inter-agent: hidden-state tensors via RecursiveLink adapters. Fast · opaque.

Text mode new

any model

Planner

→ text →

any model

Critic

→ text →

any model

Solver

Inter-agent: natural-language tokens. Human-readable · auditable.

CLI flag

--method text_recursive — exposed from the previously hardcoded internal code path. All adapter validation is bypassed.

Model selection UI

Per-agent dropdown listing cached models by size. "Custom model…" sentinel reveals a free-text field for any HF ID.

Interpretability use

Read intermediate agent reasoning. Compare with latent mode to understand what the latent channel compresses vs. expresses explicitly.

Self-improvement

Filter correct=True trajectories → fine-tune new specialist models. A natural data flywheel for iterative improvement.

Branch: feature/llm-postprocessing

📊 Token Accounting & VRAM Introspection

Every response is self-documenting: token counts and live VRAM measurements are surfaced in the UI, enabling cost estimation and OOM prevention.

Token counter

Thread-local accumulator in _token_counter.py intercepts every model.generate() call across all inference modules. Emits a single line at pipeline completion:

[tokens] prompt=4821 generated=612 total=5433

Shown in the Run Info table on every chat reply.

Dynamic VRAM estimates

Hardcoded estimates were replaced with live measurements from scan_cache_dir() — the same source as the Model Manager tab.

Example correction: sequential_scaled was documented as ~12 GB; actual measured size from HF cache: ~23 GB.

sequential_light ≈ 5 GB sequential_scaled ≈ 23 GB mixture ≈ 15 GB distillation ≈ 18 GB

Branch: feature/multi-gpu

⚡ Multi-GPU Device Mapping

Distribute individual agents across different GPUs from the UI — no code changes required. Live VRAM fitness checks warn before you hit OOM.

Example assignment — Sequential Scaled (23 GB) across two 16 GB GPUs

Agent	Model	Size	Device
Planner	Gemma3-4B	~8.2 GB	`cuda:0` 🟢
Critic	Llama3.2-3B	~6.1 GB	`cuda:0` 🟢
Solver	Qwen3.5-4B	~8.6 GB	`cuda:1` 🟢

State persisted in gr.State and forwarded through the entire call chain.

Branch: feature/llm-postprocessing

🐛 Evaluation Infrastructure Fixes

Three systematic bugs in the answer parser were identified during Math500 benchmarking and corrected.

#	Bug	Effect	Fix
1	`compare_answers()` cascading A-default	Parse failure on either side scored as correct when gold = A, inflating accuracy	`default=None` on both sides; `correct=False` on any parse failure
2	`extract_gold_answer()` silent A-default	Unparseable gold answers became `"A"`, skewing gold distribution	Return `None` on failure; propagated to fix #1
3	Latent solver output discriminator	`max_new_tokens` heuristic failed for short answers (e.g. `\boxed{B}`)	Switch to `prompt_len` as discriminator

Development Branches

Branch Overview

All branches are live on the public repository.

Branch	Key files	Capability added
`feature/domain-system-prompts`	`prompts.py`, `inference_mas.py`, `serve.py`	Domain prompts · HOUSE two-tab layout · batch eval · run metadata · warm cache
`feature/llm-postprocessing`	`serve.py`, `answer_utils.py`	LLM pre/post-processing · MCQ detection · dynamic VRAM · token counting · 3 eval fixes
`feature/sequential-text`	`inference_mas.py`, `run.py`, `load_from_repo.py`, `serve.py`	`sequential_text` style · `--method text_recursive` CLI · model-selection UI
`feature/multi-gpu`	`serve.py`	Per-agent GPU assignment · live VRAM fitness check · layout fix

Open Questions

Reflections & Future Directions

Engineering RecursiveMAS for production raises deeper questions about latent communication, safety, and self-improvement that go beyond benchmark accuracy.

🔍

Latent interpretability

The text mode is the natural observation window. Running both modes on the same question reveals what the latent channel encodes compactly vs. what requires explicit articulation.

🏛️

Constitutional constraints

Echo chambers, galaxy-brained reasoning, and oscillation are failure modes specific to iterative multi-agent loops. Diversity constraints on the Critic and uncertainty budgets can mitigate them.

🔄

Self-improvement loop

Text-mode trajectories labelled correct=True are natural fine-tuning data for new specialist models — a STaR-style flywheel adapted to the multi-agent setting.

🧩

Vertical specialisation

Domain-specialist agents (medical, legal, formal verification) require new adapter pairs. Sparse topologies ($O(N)$ links rather than $O(N^2)$) keep the adapter matrix tractable.

A full treatment of these topics is available in the companion document: EXTENSIONS.md

Citation

How to Cite

If you build on this work, please also cite the original RecursiveMAS paper.

This work (HOUSE extensions)

@misc{salpietro2026house, title = {HOUSE: Engineering Extensions for RecursiveMAS}, author = {Daniele Carmelo Salpietro}, year = {2026}, url = {https://github.com/danielesalpietro/RecursiveMAS}, note = {Extensions to arXiv:2604.25917} }

Base paper (RecursiveMAS)

@misc{recursivemas, title = {Recursive Multi-Agent Systems}, author = {Xiyuan Yang and Jiaru Zou and Rui Pan and Ruizhong Qiu and Pan Lu and Shizhe Diao and Jindong Jiang and Hanghang Tong and Tong Zhang and Markus J. Buehler and Jingrui He and James Zou}, year = {2026}, eprint = {2604.25917}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, url = {https://arxiv.org/abs/2604.25917} }

Privacy Policy

Last updated: June 2026

This website (danielesalpietro.github.io/RecursiveMAS) is a static informational page about the HOUSE — RecursiveMAS open-source project. It is hosted on GitHub Pages and does not use cookies, tracking scripts, analytics services, or any form of user profiling.

Data collected
This site does not collect, store, or process any personal data. No forms, login systems, or comment sections are present. No third-party advertising or analytics scripts are loaded.

External links
This page links to external services including GitHub, arXiv, HuggingFace, and LinkedIn. Each of those services is governed by its own privacy policy. Visiting those links may result in data collection by those services.

GitHub Pages hosting
GitHub may collect technical data (IP address, browser type, request timestamps) as part of serving the page. Please refer to GitHub's Privacy Statement for details.

Contact
For any privacy-related questions, contact: daniele@salpietro.it · www.salpietro.it

HOUSE RecursiveMAS Engineering Extensions

What This Work Adds

🏥 HOUSE Web UI

✨ LLM Enhancement Layer

💬 Sequential-Text Mode

Latent mode (paper baseline)

Text mode new

📊 Token Accounting & VRAM Introspection

⚡ Multi-GPU Device Mapping

🐛 Evaluation Infrastructure Fixes

Branch Overview

Reflections & Future Directions

How to Cite

This work (HOUSE extensions)

Base paper (RecursiveMAS)

Privacy Policy

HOUSE
RecursiveMAS Engineering Extensions