Rust CLI · NL entry · building in public
Your vision,
AW execution.
aw <intent> is the only command — bare words, no quotes required (stdin pipe also works). An intake employee picks the right tool, a Receiver scopes new projects into a brief, a PM walks the workflow, and a pool of persistent agent employees actually ships. Self-hosted, multi-model, you remain CEO.
Honest status — 2026-05-30 (week-4 polish): end-to-end chain works. 175 unit + 6 integration tests, clippy -D warnings clean. SQLite-canonical store. Fitness-weighted intake with ε-greedy explore, eval-driven hire/fire/clone (gated by AW_AUTO_EVAL=1), multi-model adapter (3 providers), live aw --watch TUI with Tab/Shift+Tab focus + / search + Enter drill into trace, multi-call worker tool loop with meter-driven rotation + PM-side salvage + handoff inlining, 3 cancellation modes (graceful/immediate/rollback), brief-schema validator (13 invariants), 2 bundled workflow templates (webapp + bug-fix), agency-wide $ cap, JSONL migration, monitoring trio (AW_VERBOSE stderr + aw --monitor TUI 500ms + /live HTML 2s), Anthropic native tool_use foundation, dev-log audit dashboard with charts + agent I/O viewer. Math: P_ship(N=4) ≈ 0.97. OpenAI + Gemini native tool_use + real-API smoke + MCP-native v1 are next.
No signup. Today the binary bootstraps a workspace, picks an intake employee by track record, and dispatches your intent to a tool — Receiver / hire / fire / brief / score / trace / and more.
aw <intent>tools.rs)One design principle, load-bearing
Every architectural decision below is downstream of this line. The cost of crossing it once is the whole project drifting into "another CLI with verbs."
Mechanism Rust runtime in the binary
- Rotation engine + meter + handoff bridge
- File locks (RAII guard + stale-lock sweep),
supervisor.log - Workspace IO, role loader, model adapters, cost meters
- SQLite schema (canonical),
brief.jsonvalidator, span / trace plumbing
Small, fast, statically linked. Runs on a $5 VPS.
Policy Prompts of roles (vai)
- When to rotate a session
- Who to fire and why
- What counts as a blocking question
- How the agency talks to its CEO
Plain markdown. Iterate without a rebuild.
Test before adding any clap subcommand: would a human at an agency make this call by judgement? → policy → belongs in a role's prompt, not in the binary. Result: aw <intent> is the only invocation (bare words; quotes optional, stdin pipe also works). --version and --help are conveniences.
How it works
Natural language in, work out. Everything below the shell is a role file you can read and edit.
-
1
You say what you want
aw fix the login bug on /signin. No verbs to memorize, no quotes required. First invocation auto-bootstraps./.aw/withroles/{intake,receiver,pm,dev}.md,config.json, and the global~/.aw/agency/agency.db— idempotent on re-run. -
2
Intake picks a tool
One of your intake employees (fitness-weighted by win rate + cost efficiency + user-reaction blend) reads the intent and returns a strict
{"tool":"<name>","args":{...}}. Tools coverdispatch_new_project(hands off to Receiver),status/edit_brief/approve_brief/cancel_project,hire/fire/promote/demote/clone/lineage,score/activity/trace/explain/list_employees,rate_trace(user-reaction signal),speak/identity. New capability = new tool + registry entry intools.rs, not new flag. -
3
Receiver scopes, PM ships
For a new project, Receiver writes
brief.json— interpretation mirror, operating plan, governance — and exits. PM picks up the brief cold from disk and walks the workflow against a pool of persistent agent employees. You approve, question, fire, promote in plain language.
Watch aw think — three ways to follow live
Agent systems fail silently — a prompt drifts, a tool loops, a model picks the wrong path. aw ships three monitor views covering the contexts you actually work in: same shell, second terminal, browser tab. Pick one or run all three at once on the same trace.
Stderr stream — H1
streaming · same shell · tagged [HH:MM:SS]
$ AW_VERBOSE=1 aw build hello world
[09:26:14] intake: intent received → build…
[09:26:14] intake: selected intake-001
(haiku, 87% over 12 calls)
[09:26:16] intake: LLM 200 OK · 45 in
+ 18 out · 1820ms · tool=
dispatch_new_project
[09:26:20] receiver: LLM 200 OK · 820
in + 340 out · 3990ms
[09:26:20] pm: cold pickup ok · 5 stages
[09:26:20] worker: pm-001 iter 1/6
[09:26:24] worker: → DELIVERABLE (2840 B)
…
Every chain transition (intake / receiver / PM / worker per-iter) lands on stderr with model + tokens + latency. Pipe to tee for a permanent paper trail of the run.
TUI monitor — H2
500ms refresh · native terminal · 3 panes
$ aw --monitor
┌─ aw monitor · 500ms · q quit ─┐
│ trace: tr_8a9c12fb │
│ [in_progress] · 23s · root=u │
│ intent: build hello world │
│ today: $0.0245 · 4820 tokens │
└───────────────────────────────┘
┌─ latest LLM I/O ──────────────┐
│ 09:26:24 pm-001 sonnet │
│ in=1200 out=480 4200ms │
│ user: Stage: scope · User… │
│ resp: <<DELIVERABLE>& #… │
└───────────────────────────────┘
┌─ recent events (7) ───────────┐
│ 09:26:24 stage_complete pm-001│
│ 09:26:20 stage_dispatched │
│ … │
└───────────────────────────────┘
Status-colored badges (green / yellow / red), per-trace cost ticker, top 3 LLM calls truncated 120 chars. q quits, r force-refresh.
Browser /live — H3
2s refresh · full LLM I/O expanded
$ python3 dev-log/dashboard_server.py
# → http://127.0.0.1:8787/live
╔═══════════════════════════════╗
║ aw live · tr_8a9c12fb [active]║
║ duration 23s · 3 calls · 7 ev ║
╠═══════════════════════════════╣
║ ▼ pm-001 sonnet 4200ms ║
║ system: You are pm — a … ║
║ user: Stage: scope … ║
║ response (JSON pretty): ║
║ { ║
║ "tool": … ║
║ } ║
║ ▼ receiver sonnet 3990ms ║
║ … ║
╚═══════════════════════════════╝
Every LLM call expanded by default — full system + user + response, JSON pretty-print, events + spans tables. Browser tab you leave open while iterating.
Plus the audit dashboard at / — every autonomous team session logs slices / decisions / messages / agent_calls (full subagent I/O) into dev-log/orchestration.db. Same server, 11 chart types, slice timeline, expandable LLM I/O viewer. Post-hoc review when the live monitor scrolled past. See docs/monitoring.md for the full comparison + screenshots.
Architecture, locked 2026-05-28
Eight pillars. Each one is a place where we resisted putting a verb in the shell.
Thin shell, NL entry
One invocation: aw <intent> (bare words; stdin pipe also works). Adding a hardcoded verb is the trap. aw setup was proposed in DEC-026 and dropped — bootstrap is mechanism, fires on first NL call.
Role registry
System roles ship with the binary: Intake, Receiver, PM, Dev, Designer, QA. New capability = new roles/<name>.md with TOML frontmatter (allowed_tools[], default_model, daily_token_cap, trace_token_cap, fitness_{success,cost,user_reaction}). Roles are embedded via include_str! and seeded into .aw/roles/ on first call; edit-in-place to override. No rebuild.
Persistent employees
intake-001, dev-001 — long-lived identities with hire date, model, status, lifetime cost / win count. Roles are templates; employees are instantiations with personal history in agency.db + employees/<id>/lessons.md + per-trace notes.
Multi-model by design
Anthropic + OpenAI + Gemini wired (Ollama planned). Per-employee model assignment via role frontmatter; per-call override via env or brief. Anthropic gained native tools[] tool_use in slice s20c (OpenAI + Gemini next). Cognitive diversity is an asset, not a bug; identity survives a model swap.
Brief.json contract
Receiver → PM handoff is a single typed file. Three sections: interpretation mirror, operating plan, governance. PM cold-pickup validates against a 13-invariant schema check (slice s21) — every violation reported at once, not first-failure. No back-channel; the brief is the bridge.
Supervisor primitives
Rotation engine (60% soft / 80% hard meter), file locks with RAII guard, supervisor.log decision trail, handoff bridge, PM-side rotation salvage with cap = 2 retries. Multi-process kill primitive lands with the worker-process slice.
Eval-driven evolution
Auto-fire (rolling rate < 0.5 over ≥ 10 calls) + auto-clone winners onto fresh model variants. Both gated behind AW_AUTO_EVAL=1 (default off — recommendations log as notifications, operator confirms). User-reaction signal (rate_trace tool) blends into fitness once an employee has ≥ 5 reactions. Prompt-level mutation behind AW_AUTO_MUTATE_PROMPT=1 is next.
Owner-as-CEO
Hire / fire / promote / budget approvals all flow through you in NL. Not a Devin-style "watch it work" loop — an agency whose CEO you remain. Push for critical events, pull for status.
What lives on disk
Two roots. The agency persists in SQLite; per-employee notes stay as files. Per-cwd workspace holds config, roles, and project artifacts.
~/.aw/agency/ — the agency
# Persistent across projects · SQLite is the single source of truth ~/.aw/agency/ agency.db # employees, events, spans, traces, agency.db-wal # llm_calls (+ FTS5), notifications, agency.db-shm # rollups_daily — WAL mode employees/ intake-001/ role.md # optional override of seeded prompt lessons.md # append-only learnings notes/<slug>.md # per-trace working notes intake-002/ … dev-001/ …
<cwd>/.aw/ — workspace + projects
# Created on first NL call · idempotent on re-run <cwd>/.aw/ config.json # api_keys, default_provider, cost.* roles/ # intake/receiver/pm/dev/designer/qa.md workflows/ # webapp.json, bug-fix.json (idempotent seed) projects/<slug>/ brief.json # Receiver's scoping (13-invariant validated) brief_original.json # Receiver's first cut; brief.json is post-edit status.json # terminal state + cancellation mode artifacts/ # worker deliverables (cleared on rollback) agents/<id>/sessions/<NNN>/ stage.json stage_report.json handoff.json # only on rotation salvage CANCELLATION.md # only if cancelled immediate/rollback
No roster.json, no profile.json, no events.jsonl. Employee state, every event, every span, every LLM call, every cost rollup — all in agency.db. The directory only holds files that are files by design (role override, lessons, notes).
Honest status
Build started 2026-05-29 after pivoting from Phase 1 (URL scanner — closed per DEC-025). End-to-end chain shipped same day; 30+ slices since across 7 autonomous runs. Build-in-public via daily commits.
Shipped
- Single
awbinary, ~3.7 MB release (arm64; cross-compiles to musl Linux) - Auto-bootstrap — first
aw <intent>seeds./.aw/{roles,config.json}+ global~/.aw/agency/{agency.db,employees/}; idempotent - Shipped role prompts: Intake (with tool registry + fitness weights), Receiver (~150 lines), PM (~300 lines), dev (~165 lines)
- Intake dispatch — fitness-weighted selection, LLM-driven tool pick from a registry of 16+ tools (no keyword classifier; no offline path)
- Multi-model adapter — Anthropic + OpenAI + Gemini providers; per-role model assignment via TOML frontmatter
- End-to-end chain — Intake picks
dispatch_new_project→ Receiver writesbrief.json→ PM cold-pickup → DAG walker overworkflow.stages[]→ dev produces deliverable underartifacts/ - Persistent agent pool —
agency.dbemployees table;hire/fire/promote/demoteas typed UPDATE; auto-clone for winning workers, auto-fire/demote for losers - Cost meters — per-trace + per-day USD via vendor pricing; warn / block thresholds enforced by supervisor; rollups + LLM-call writes transactional
- SQLite-canonical store — every event, span, trace, llm_call (with FTS5), notification, rollup writes to
agency.db.roster.json,profile.json,events.jsonlall ripped (slices 9 + 10c) - Inspection surfaces —
--dashboard,--traces,--trace <id>,--activity,--why,--search(FTS5),--history,--recent,--notifications,--db-stats,--compact-traces,--success-rate-per-stage,--import-jsonl - Live TUI dashboard —
aw --watch, three panes with Tab/Shift+Tab focus cycling, 2s refresh +rforce-refresh,q/Esc/Ctrl-Cquit - Multi-call worker tool loop (slice 20): worker iterates via
read_file/write_file/list_files/run_check, cap atAW_WORKER_MAX_ITER, rotation gate every iteration - PM rotation salvage (slice 17): after worker hits forced rotation, PM picks a fresh employee + inlines prior
handoff.jsonas continuation context. Cap 2 salvages per stage. - Cancellation modes (slice s26):
cancel_projecttool with graceful / immediate / rollback. Mode auto-derived from intent keywords (rollback / immediate / now) or explicitmodearg. - Workflow templates: bundled
webapp.json(5 stages) +bug-fix.json(3 stages); user overrides preserved on re-bootstrap. - Eval gate (slice s23): auto-fire / auto-clone gated behind
AW_AUTO_EVAL=1. Default off — recommendations log as notifications, operator decides. - Bootstrap safety:
AW_REJECT_DEFAULT_AGENCY=1hard gate refuses default~/.aw/agency/whenAW_AGENCY_ROOTunset. - Migration:
aw --import-jsonl <agency_dir>one-shot brings pre-slice-9 dirs into SQLite (idempotent on employee insert). - Brief validator (slice s21): 13-invariant full schema check; returns all violations not first-failure; closes Receiver-hallucinates-bad-brief class.
- User-reaction signal (slice 18 + s18b):
rate_tracetool records explicit feedback;fitness_user_reactionweight folds it into the combiner once an employee has ≥ 5 reactions. - TUI v2 (slice A): ↑/↓ row selection per pane, / opens search filter, Enter drills into trace tree.
- Monitoring trio (slices H1+H2+H3):
AW_VERBOSE=1stderr stream tagged[HH:MM:SS];aw --monitorTUI at 500ms refresh;/liveHTML page at 2s with full LLM I/O. Seedocs/monitoring.md. - Native tool_use foundation (slice s20c-a/b):
LlmCall.tools[]+ Anthropic adapter sends + parses structured tool_use blocks. OpenAI + Gemini + worker integration in flight. - Designer + QA role files (slices G1+G2): full prompts for design + verify stages; webapp + bug-fix workflows now route to real roles instead of falling back to dev.
- Bootstrap safety:
AW_REJECT_DEFAULT_AGENCY=1hard gate refuses default~/.aw/agency/whenAW_AGENCY_ROOTunset (exit 2). Pair with$AW_AGENCY_ROOTin shell rc + CI. - Storage maintenance:
aw --vacuumflushes WAL + runs SQLite VACUUM; prints reclaimed bytes. - dev-log audit infrastructure: every autonomous team run records sessions / slices / decisions / messages / agent_calls (full subagent I/O) / backlog / escalations into
dev-log/orchestration.db. HTML dashboard atdashboard_server.pywith 11 chart types + LLM I/O viewer + slice timeline. - 175 unit + 6 integration tests; clippy
-D warningsclean. Per-agency-root SQLite connection cache + env-var serialization mutex so tests never pollute~/.aw/agency.db
Next
- Real-API E2E smoke (s14b + s28) — exercise webapp + bug-fix templates + forced rotation salvage + cancellation modes + eval auto-execute end-to-end. Captures matrix into
docs/smoke-real-api.md. - Native tool_use finish (s20c-c/d/e) — OpenAI Chat Completions
tools[], GeminifunctionDeclarations, worker integration gated behindAW_USE_NATIVE_TOOLS=1. - MCP-native v1 (3-4 weeks) — aw exposes its control plane via MCP server so external clients (Claude Desktop / Cline / Continue) can drive aw; workers become MCP-client sessions per role frontmatter
mcp_servers[]. Seedocs/mcp-native-architecture.md. - Prompt-level mutation behind
AW_AUTO_MUTATE_PROMPT=1— A/B prompt variants, endorsement counter, auto-retire after re-validation failure. - Multi-process worker + kill primitive — current single-process means "immediate" cancellation can't actually kill anything; real kill lands when worker spawns separate process.
- Cross-project memory + plugin system + DAG parallelism — designed; deferred until usage signal demands them.
roster.json ripped; aw --watch TUI ships (slices 9–11)--vacuum, monitoring trio (H1/H2/H3), native tool_use foundation (s20c-a/b)How aw differs
Honest table — we don't hide where the established projects are ahead.
| Dimension | aw | Devin | CrewAI | AutoGen | MetaGPT |
|---|---|---|---|---|---|
| Single binary, runs locally | ✓ | hosted | library | library | library |
| NL entry, zero hardcoded verbs | ✓ | chat | — | — | — |
| Persistent employee identities | ✓ | — | — | — | — |
| Owner-as-CEO governance | ✓ | autonomous | — | — | — |
| Multi-model per employee | ✓ | single | configurable | configurable | configurable |
| Eval-driven prompt evolution | planned | — | — | — | — |
| Mature today | v0 chain works | ~1 year | ~1 year | ~1 year | ~18 months |
| Team size | 3 founder + AW | ~25 | OSS | MS Research | OSS ~30 |
| Funding | $0 | ~$2B | OSS | internal | OSS |
Sources: each project's documentation, May 2026. aw isn't the right tool for code completion or a chat sidebar — those problems are solved.
Try the binary
Set $ANTHROPIC_API_KEY and the chain runs end-to-end: bootstrap → Intake picks a tool → for new projects, Receiver writes brief.json → PM walks the workflow → dev writes deliverables. No offline path — Intake needs an LLM to think.
# clone and build (Rust 1.75+) git clone https://github.com/dipgle/agent-workforce cd agent-workforce cargo build --release # symlink to your PATH ln -s "$(pwd)/target/release/aw" /usr/local/bin/aw # set your key (Anthropic / OpenAI / Gemini — all wired) export ANTHROPIC_API_KEY=... # power-user safety: refuse default agency dir + pin yours export AW_REJECT_DEFAULT_AGENCY=1 export AW_AGENCY_ROOT=$HOME/.aw-agency # first call auto-bootstraps ./.aw/ + agency dir, runs the chain mkdir myproject && cd myproject AW_VERBOSE=1 aw write a hello.md greeting # stream [HH:MM:SS] transitions to stderr # live monitor in a separate shell while aw runs aw --monitor # TUI, 500ms refresh, follows latest trace python3 dev-log/dashboard_server.py # http://127.0.0.1:8787/live (2s refresh) # peek at what happened ls .aw/projects/*/artifacts/ aw list employees aw --dashboard # one-screen agency snapshot aw --watch # interactive TUI; Tab/Shift+Tab, / search, Enter drill aw --success-rate-per-stage # empirical p_i per stage aw --vacuum # compact SQLite + print reclaimed bytes
Repo: dipgle/agent-workforce · Built in public · Daily commits.
Frequently asked
Why "Agent Workforce" and not just "agent"?
"Agent" is one assistant. A workforce is a structure: triage, scoping, planning, execution, review, hire/fire. The shape of an agency — not the smartness of a single model — is what ships real work.
Why Rust?
The shell, supervisor, rotation engine, file locks — mechanism — should be small, fast, statically linked, runnable on a $5 VPS. The judgement layer (roles) is prompts, not code; that's where iteration lives.
Why no subcommands?
Every hardcoded verb is a bet that we know what should happen. We don't. aw fire designer-002 is a sentence a human would say to a real PM — let the role read it and decide. If we can't enforce that at the smallest scale, we won't at the larger ones.
What about cost?
Per-employee model assignment + per-project budget caps + escalation thresholds (warn at 50%, block at 90%). Cheap models for cheap roles. The supervisor enforces; the PM negotiates with you when the cap looms.
How is this not Devin?
Devin is one autonomous agent you watch. aw is an agency you run. Different shape: persistent employees with history, an owner who approves spend and hires, a brief that bridges roles, a self-hosted binary instead of a hosted service.
What changed in May 2026?
The project pivoted away from "Phase 1" — a URL-scanner that emitted bug tickets. The wedge was wrong. Architecture for Agent Workforce was designed in the same session; we kept the architecture, threw out Phase 1, started clean.
When can I actually use it?
Today: clone, build, set $ANTHROPIC_API_KEY, run aw <intent> (bare words; quotes only for shell metachars). The full chain — Intake → Receiver (validates brief against 13-invariant schema; consults .aw/workflows/ templates) → PM → multi-call worker (with tool loop, rotation salvage + handoff inlining, cancellation modes) — produces a real deliverable under artifacts/. Cost meters with agency-wide $ cap, eval-driven hire/fire/clone (gated by AW_AUTO_EVAL=1), aw --watch TUI v2 (Tab focus + / search + Enter drill), 2 bundled workflow templates (webapp + bug-fix), 6 role files (intake/receiver/pm/dev/designer/qa), JSONL migration, aw --vacuum, monitoring trio (AW_VERBOSE stderr + aw --monitor + /live), and a dev-log audit dashboard at python3 dev-log/dashboard_server.py are all in. OpenAI + Gemini native tool_use + real-API smoke + MCP-native v1 are the next phase. Build in public — 38+ commits across 7 autonomous runs (slices 1–26 + week-4 follow-ups + monitoring trio).
Will the prompts be open?
System role prompts ship with the binary as plain markdown — they're embedded via include_str! and written into .aw/roles/{intake,receiver,pm,dev,designer,qa}.md on your first call. Edit them in place to override; the binary reads the on-disk copy if present. Source at crates/aw/src/roles/.
Watch the agency get built.
v0 chain works (P_ship ≈ 0.97). v1 is MCP-native — aw becomes an MCP server exposing its control plane; workers become MCP clients per role. Design doc. Daily commits — the binary moves before the announcement.
No newsletter. The repo is the changelog.