Rust CLI · NL entry · building in public

Your vision,
AW execution.

aw <intent> is the only command — bare words, no quotes required (stdin pipe also works). An intake employee picks the right tool, a Receiver scopes new projects into a brief, a PM walks the workflow, and a pool of persistent agent employees actually ships. Self-hosted, multi-model, you remain CEO.

Honest status — 2026-05-30 (week-4 polish): end-to-end chain works. 175 unit + 6 integration tests, clippy -D warnings clean. SQLite-canonical store. Fitness-weighted intake with ε-greedy explore, eval-driven hire/fire/clone (gated by AW_AUTO_EVAL=1), multi-model adapter (3 providers), live aw --watch TUI with Tab/Shift+Tab focus + / search + Enter drill into trace, multi-call worker tool loop with meter-driven rotation + PM-side salvage + handoff inlining, 3 cancellation modes (graceful/immediate/rollback), brief-schema validator (13 invariants), 2 bundled workflow templates (webapp + bug-fix), agency-wide $ cap, JSONL migration, monitoring trio (AW_VERBOSE stderr + aw --monitor TUI 500ms + /live HTML 2s), Anthropic native tool_use foundation, dev-log audit dashboard with charts + agent I/O viewer. Math: P_ship(N=4) ≈ 0.97. OpenAI + Gemini native tool_use + real-API smoke + MCP-native v1 are next.

No signup. Today the binary bootstraps a workspace, picks an intake employee by track record, and dispatches your intent to a tool — Receiver / hire / fire / brief / score / trace / and more.

1canonical invocation — aw <intent>
19NL-dispatched tools (registry in tools.rs)
3model providers wired (Anthropic + OpenAI + Gemini)
0hardcoded subcommands. New capability = new tool + role file

One design principle, load-bearing

Every architectural decision below is downstream of this line. The cost of crossing it once is the whole project drifting into "another CLI with verbs."

Mechanism Rust runtime in the binary

  • Rotation engine + meter + handoff bridge
  • File locks (RAII guard + stale-lock sweep), supervisor.log
  • Workspace IO, role loader, model adapters, cost meters
  • SQLite schema (canonical), brief.json validator, span / trace plumbing

Small, fast, statically linked. Runs on a $5 VPS.

Policy Prompts of roles (vai)

  • When to rotate a session
  • Who to fire and why
  • What counts as a blocking question
  • How the agency talks to its CEO

Plain markdown. Iterate without a rebuild.

Test before adding any clap subcommand: would a human at an agency make this call by judgement? policy belongs in a role's prompt, not in the binary. Result: aw <intent> is the only invocation (bare words; quotes optional, stdin pipe also works). --version and --help are conveniences.

How it works

Natural language in, work out. Everything below the shell is a role file you can read and edit.

  1. 1

    You say what you want

    aw fix the login bug on /signin. No verbs to memorize, no quotes required. First invocation auto-bootstraps ./.aw/ with roles/{intake,receiver,pm,dev}.md, config.json, and the global ~/.aw/agency/agency.db — idempotent on re-run.

  2. 2

    Intake picks a tool

    One of your intake employees (fitness-weighted by win rate + cost efficiency + user-reaction blend) reads the intent and returns a strict {"tool":"<name>","args":{...}}. Tools cover dispatch_new_project (hands off to Receiver), status / edit_brief / approve_brief / cancel_project, hire / fire / promote / demote / clone / lineage, score / activity / trace / explain / list_employees, rate_trace (user-reaction signal), speak / identity. New capability = new tool + registry entry in tools.rs, not new flag.

  3. 3

    Receiver scopes, PM ships

    For a new project, Receiver writes brief.json — interpretation mirror, operating plan, governance — and exits. PM picks up the brief cold from disk and walks the workflow against a pool of persistent agent employees. You approve, question, fire, promote in plain language.

Watch aw think — three ways to follow live

Agent systems fail silently — a prompt drifts, a tool loops, a model picks the wrong path. aw ships three monitor views covering the contexts you actually work in: same shell, second terminal, browser tab. Pick one or run all three at once on the same trace.

🖥

Stderr stream — H1

streaming · same shell · tagged [HH:MM:SS]

$ AW_VERBOSE=1 aw build hello world
[09:26:14] intake: intent received → build…
[09:26:14] intake: selected intake-001
              (haiku, 87% over 12 calls)
[09:26:16] intake: LLM 200 OK · 45 in
              + 18 out · 1820ms · tool=
              dispatch_new_project
[09:26:20] receiver: LLM 200 OK · 820
              in + 340 out · 3990ms
[09:26:20] pm: cold pickup ok · 5 stages
[09:26:20] worker: pm-001 iter 1/6
[09:26:24] worker: → DELIVERABLE (2840 B)
…

Every chain transition (intake / receiver / PM / worker per-iter) lands on stderr with model + tokens + latency. Pipe to tee for a permanent paper trail of the run.

📺

TUI monitor — H2

500ms refresh · native terminal · 3 panes

$ aw --monitor
┌─ aw monitor · 500ms · q quit ─┐
│ trace: tr_8a9c12fb            │
│ [in_progress] · 23s · root=u  │
│ intent: build hello world     │
│ today: $0.0245 · 4820 tokens  │
└───────────────────────────────┘
┌─ latest LLM I/O ──────────────┐
│ 09:26:24 pm-001 sonnet        │
│   in=1200 out=480 4200ms      │
│   user: Stage: scope · User…  │
│   resp: <<DELIVERABLE>& #…   │
└───────────────────────────────┘
┌─ recent events (7) ───────────┐
│ 09:26:24 stage_complete pm-001│
│ 09:26:20 stage_dispatched     │
│ …                             │
└───────────────────────────────┘

Status-colored badges (green / yellow / red), per-trace cost ticker, top 3 LLM calls truncated 120 chars. q quits, r force-refresh.

🌐

Browser /live — H3

2s refresh · full LLM I/O expanded

$ python3 dev-log/dashboard_server.py
# → http://127.0.0.1:8787/live

╔═══════════════════════════════╗
║ aw live · tr_8a9c12fb [active]║
║ duration 23s · 3 calls · 7 ev ║
╠═══════════════════════════════╣
║ ▼ pm-001  sonnet  4200ms      ║
║   system: You are pm — a …    ║
║   user:   Stage: scope …      ║
║   response (JSON pretty):     ║
║     {                         ║
║       "tool": …               ║
║     }                         ║
║ ▼ receiver  sonnet  3990ms    ║
║   …                           ║
╚═══════════════════════════════╝

Every LLM call expanded by default — full system + user + response, JSON pretty-print, events + spans tables. Browser tab you leave open while iterating.

Plus the audit dashboard at / — every autonomous team session logs slices / decisions / messages / agent_calls (full subagent I/O) into dev-log/orchestration.db. Same server, 11 chart types, slice timeline, expandable LLM I/O viewer. Post-hoc review when the live monitor scrolled past. See docs/monitoring.md for the full comparison + screenshots.

Architecture, locked 2026-05-28

Eight pillars. Each one is a place where we resisted putting a verb in the shell.

Thin shell, NL entry

One invocation: aw <intent> (bare words; stdin pipe also works). Adding a hardcoded verb is the trap. aw setup was proposed in DEC-026 and dropped — bootstrap is mechanism, fires on first NL call.

🎭

Role registry

System roles ship with the binary: Intake, Receiver, PM, Dev, Designer, QA. New capability = new roles/<name>.md with TOML frontmatter (allowed_tools[], default_model, daily_token_cap, trace_token_cap, fitness_{success,cost,user_reaction}). Roles are embedded via include_str! and seeded into .aw/roles/ on first call; edit-in-place to override. No rebuild.

👥

Persistent employees

intake-001, dev-001 — long-lived identities with hire date, model, status, lifetime cost / win count. Roles are templates; employees are instantiations with personal history in agency.db + employees/<id>/lessons.md + per-trace notes.

🔀

Multi-model by design

Anthropic + OpenAI + Gemini wired (Ollama planned). Per-employee model assignment via role frontmatter; per-call override via env or brief. Anthropic gained native tools[] tool_use in slice s20c (OpenAI + Gemini next). Cognitive diversity is an asset, not a bug; identity survives a model swap.

📜

Brief.json contract

Receiver → PM handoff is a single typed file. Three sections: interpretation mirror, operating plan, governance. PM cold-pickup validates against a 13-invariant schema check (slice s21) — every violation reported at once, not first-failure. No back-channel; the brief is the bridge.

🛡

Supervisor primitives

Rotation engine (60% soft / 80% hard meter), file locks with RAII guard, supervisor.log decision trail, handoff bridge, PM-side rotation salvage with cap = 2 retries. Multi-process kill primitive lands with the worker-process slice.

🧮

Eval-driven evolution

Auto-fire (rolling rate < 0.5 over ≥ 10 calls) + auto-clone winners onto fresh model variants. Both gated behind AW_AUTO_EVAL=1 (default off — recommendations log as notifications, operator confirms). User-reaction signal (rate_trace tool) blends into fitness once an employee has ≥ 5 reactions. Prompt-level mutation behind AW_AUTO_MUTATE_PROMPT=1 is next.

🏛

Owner-as-CEO

Hire / fire / promote / budget approvals all flow through you in NL. Not a Devin-style "watch it work" loop — an agency whose CEO you remain. Push for critical events, pull for status.

What lives on disk

Two roots. The agency persists in SQLite; per-employee notes stay as files. Per-cwd workspace holds config, roles, and project artifacts.

~/.aw/agency/ — the agency

# Persistent across projects · SQLite is the single source of truth
~/.aw/agency/
  agency.db               # employees, events, spans, traces,
  agency.db-wal           # llm_calls (+ FTS5), notifications,
  agency.db-shm           # rollups_daily — WAL mode
  employees/
    intake-001/
      role.md             # optional override of seeded prompt
      lessons.md          # append-only learnings
      notes/<slug>.md     # per-trace working notes
    intake-002/ …
    dev-001/ …

<cwd>/.aw/ — workspace + projects

# Created on first NL call · idempotent on re-run
<cwd>/.aw/
  config.json                 # api_keys, default_provider, cost.*
  roles/                      # intake/receiver/pm/dev/designer/qa.md
  workflows/                  # webapp.json, bug-fix.json (idempotent seed)
  projects/<slug>/
    brief.json                # Receiver's scoping (13-invariant validated)
    brief_original.json       # Receiver's first cut; brief.json is post-edit
    status.json               # terminal state + cancellation mode
    artifacts/                # worker deliverables (cleared on rollback)
    agents/<id>/sessions/<NNN>/
      stage.json
      stage_report.json
      handoff.json            # only on rotation salvage
    CANCELLATION.md           # only if cancelled immediate/rollback

No roster.json, no profile.json, no events.jsonl. Employee state, every event, every span, every LLM call, every cost rollup — all in agency.db. The directory only holds files that are files by design (role override, lessons, notes).

Honest status

Build started 2026-05-29 after pivoting from Phase 1 (URL scanner — closed per DEC-025). End-to-end chain shipped same day; 30+ slices since across 7 autonomous runs. Build-in-public via daily commits.

Shipped

  • Single aw binary, ~3.7 MB release (arm64; cross-compiles to musl Linux)
  • Auto-bootstrap — first aw <intent> seeds ./.aw/{roles,config.json} + global ~/.aw/agency/{agency.db,employees/}; idempotent
  • Shipped role prompts: Intake (with tool registry + fitness weights), Receiver (~150 lines), PM (~300 lines), dev (~165 lines)
  • Intake dispatch — fitness-weighted selection, LLM-driven tool pick from a registry of 16+ tools (no keyword classifier; no offline path)
  • Multi-model adapter — Anthropic + OpenAI + Gemini providers; per-role model assignment via TOML frontmatter
  • End-to-end chain — Intake picks dispatch_new_project → Receiver writes brief.json → PM cold-pickup → DAG walker over workflow.stages[] → dev produces deliverable under artifacts/
  • Persistent agent pool — agency.db employees table; hire / fire / promote / demote as typed UPDATE; auto-clone for winning workers, auto-fire/demote for losers
  • Cost meters — per-trace + per-day USD via vendor pricing; warn / block thresholds enforced by supervisor; rollups + LLM-call writes transactional
  • SQLite-canonical store — every event, span, trace, llm_call (with FTS5), notification, rollup writes to agency.db. roster.json, profile.json, events.jsonl all ripped (slices 9 + 10c)
  • Inspection surfaces — --dashboard, --traces, --trace <id>, --activity, --why, --search (FTS5), --history, --recent, --notifications, --db-stats, --compact-traces, --success-rate-per-stage, --import-jsonl
  • Live TUI dashboard — aw --watch, three panes with Tab/Shift+Tab focus cycling, 2s refresh + r force-refresh, q/Esc/Ctrl-C quit
  • Multi-call worker tool loop (slice 20): worker iterates via read_file/write_file/list_files/run_check, cap at AW_WORKER_MAX_ITER, rotation gate every iteration
  • PM rotation salvage (slice 17): after worker hits forced rotation, PM picks a fresh employee + inlines prior handoff.json as continuation context. Cap 2 salvages per stage.
  • Cancellation modes (slice s26): cancel_project tool with graceful / immediate / rollback. Mode auto-derived from intent keywords (rollback / immediate / now) or explicit mode arg.
  • Workflow templates: bundled webapp.json (5 stages) + bug-fix.json (3 stages); user overrides preserved on re-bootstrap.
  • Eval gate (slice s23): auto-fire / auto-clone gated behind AW_AUTO_EVAL=1. Default off — recommendations log as notifications, operator decides.
  • Bootstrap safety: AW_REJECT_DEFAULT_AGENCY=1 hard gate refuses default ~/.aw/agency/ when AW_AGENCY_ROOT unset.
  • Migration: aw --import-jsonl <agency_dir> one-shot brings pre-slice-9 dirs into SQLite (idempotent on employee insert).
  • Brief validator (slice s21): 13-invariant full schema check; returns all violations not first-failure; closes Receiver-hallucinates-bad-brief class.
  • User-reaction signal (slice 18 + s18b): rate_trace tool records explicit feedback; fitness_user_reaction weight folds it into the combiner once an employee has ≥ 5 reactions.
  • TUI v2 (slice A): / row selection per pane, / opens search filter, Enter drills into trace tree.
  • Monitoring trio (slices H1+H2+H3): AW_VERBOSE=1 stderr stream tagged [HH:MM:SS]; aw --monitor TUI at 500ms refresh; /live HTML page at 2s with full LLM I/O. See docs/monitoring.md.
  • Native tool_use foundation (slice s20c-a/b): LlmCall.tools[] + Anthropic adapter sends + parses structured tool_use blocks. OpenAI + Gemini + worker integration in flight.
  • Designer + QA role files (slices G1+G2): full prompts for design + verify stages; webapp + bug-fix workflows now route to real roles instead of falling back to dev.
  • Bootstrap safety: AW_REJECT_DEFAULT_AGENCY=1 hard gate refuses default ~/.aw/agency/ when AW_AGENCY_ROOT unset (exit 2). Pair with $AW_AGENCY_ROOT in shell rc + CI.
  • Storage maintenance: aw --vacuum flushes WAL + runs SQLite VACUUM; prints reclaimed bytes.
  • dev-log audit infrastructure: every autonomous team run records sessions / slices / decisions / messages / agent_calls (full subagent I/O) / backlog / escalations into dev-log/orchestration.db. HTML dashboard at dashboard_server.py with 11 chart types + LLM I/O viewer + slice timeline.
  • 175 unit + 6 integration tests; clippy -D warnings clean. Per-agency-root SQLite connection cache + env-var serialization mutex so tests never pollute ~/.aw/agency.db

Next

  • Real-API E2E smoke (s14b + s28) — exercise webapp + bug-fix templates + forced rotation salvage + cancellation modes + eval auto-execute end-to-end. Captures matrix into docs/smoke-real-api.md.
  • Native tool_use finish (s20c-c/d/e) — OpenAI Chat Completions tools[], Gemini functionDeclarations, worker integration gated behind AW_USE_NATIVE_TOOLS=1.
  • MCP-native v1 (3-4 weeks) — aw exposes its control plane via MCP server so external clients (Claude Desktop / Cline / Continue) can drive aw; workers become MCP-client sessions per role frontmatter mcp_servers[]. See docs/mcp-native-architecture.md.
  • Prompt-level mutation behind AW_AUTO_MUTATE_PROMPT=1 — A/B prompt variants, endorsement counter, auto-retire after re-validation failure.
  • Multi-process worker + kill primitive — current single-process means "immediate" cancellation can't actually kill anything; real kill lands when worker spawns separate process.
  • Cross-project memory + plugin system + DAG parallelism — designed; deferred until usage signal demands them.
2026-05-28Architecture designed; Phase 1 closed (DEC-025)
2026-05-29Full chain shipped — Intake → Receiver → PM → dev end-to-end
2026-05-30SQLite-canonical store; roster.json ripped; aw --watch TUI ships (slices 9–11)
2026-05-303-week autonomous roadmap closed (slices 16–26 + week-2/3 follow-ups): rotation salvage, brief validator, workflow templates, cancellation modes, eval gate, agency $ cap, import-jsonl, TUI nav
2026-05-30Pre-API-key polish: TUI v2 search+drill, MCP design doc, Receiver workflow consultation, --vacuum, monitoring trio (H1/H2/H3), native tool_use foundation (s20c-a/b)
+1 wkReal-API smoke matrix; OpenAI + Gemini native tool_use; worker integration
+3-4 wksMCP-native v1 — aw as MCP server, workers as MCP clients

How aw differs

Honest table — we don't hide where the established projects are ahead.

Dimension aw Devin CrewAI AutoGen MetaGPT
Single binary, runs locallyhostedlibrarylibrarylibrary
NL entry, zero hardcoded verbschat
Persistent employee identities
Owner-as-CEO governanceautonomous
Multi-model per employeesingleconfigurableconfigurableconfigurable
Eval-driven prompt evolutionplanned
Mature todayv0 chain works~1 year~1 year~1 year~18 months
Team size3 founder + AW~25OSSMS ResearchOSS ~30
Funding$0~$2BOSSinternalOSS

Sources: each project's documentation, May 2026. aw isn't the right tool for code completion or a chat sidebar — those problems are solved.

Try the binary

Set $ANTHROPIC_API_KEY and the chain runs end-to-end: bootstrap → Intake picks a tool → for new projects, Receiver writes brief.json → PM walks the workflow → dev writes deliverables. No offline path — Intake needs an LLM to think.

# clone and build (Rust 1.75+)
git clone https://github.com/dipgle/agent-workforce
cd agent-workforce
cargo build --release

# symlink to your PATH
ln -s "$(pwd)/target/release/aw" /usr/local/bin/aw

# set your key (Anthropic / OpenAI / Gemini — all wired)
export ANTHROPIC_API_KEY=...

# power-user safety: refuse default agency dir + pin yours
export AW_REJECT_DEFAULT_AGENCY=1
export AW_AGENCY_ROOT=$HOME/.aw-agency

# first call auto-bootstraps ./.aw/ + agency dir, runs the chain
mkdir myproject && cd myproject
AW_VERBOSE=1 aw write a hello.md greeting   # stream [HH:MM:SS] transitions to stderr

# live monitor in a separate shell while aw runs
aw --monitor                          # TUI, 500ms refresh, follows latest trace
python3 dev-log/dashboard_server.py   # http://127.0.0.1:8787/live  (2s refresh)

# peek at what happened
ls .aw/projects/*/artifacts/
aw list employees
aw --dashboard                        # one-screen agency snapshot
aw --watch                            # interactive TUI; Tab/Shift+Tab, / search, Enter drill
aw --success-rate-per-stage           # empirical p_i per stage
aw --vacuum                           # compact SQLite + print reclaimed bytes

Repo: dipgle/agent-workforce · Built in public · Daily commits.

Frequently asked

Why "Agent Workforce" and not just "agent"?

"Agent" is one assistant. A workforce is a structure: triage, scoping, planning, execution, review, hire/fire. The shape of an agency — not the smartness of a single model — is what ships real work.

Why Rust?

The shell, supervisor, rotation engine, file locks — mechanism — should be small, fast, statically linked, runnable on a $5 VPS. The judgement layer (roles) is prompts, not code; that's where iteration lives.

Why no subcommands?

Every hardcoded verb is a bet that we know what should happen. We don't. aw fire designer-002 is a sentence a human would say to a real PM — let the role read it and decide. If we can't enforce that at the smallest scale, we won't at the larger ones.

What about cost?

Per-employee model assignment + per-project budget caps + escalation thresholds (warn at 50%, block at 90%). Cheap models for cheap roles. The supervisor enforces; the PM negotiates with you when the cap looms.

How is this not Devin?

Devin is one autonomous agent you watch. aw is an agency you run. Different shape: persistent employees with history, an owner who approves spend and hires, a brief that bridges roles, a self-hosted binary instead of a hosted service.

What changed in May 2026?

The project pivoted away from "Phase 1" — a URL-scanner that emitted bug tickets. The wedge was wrong. Architecture for Agent Workforce was designed in the same session; we kept the architecture, threw out Phase 1, started clean.

When can I actually use it?

Today: clone, build, set $ANTHROPIC_API_KEY, run aw <intent> (bare words; quotes only for shell metachars). The full chain — Intake → Receiver (validates brief against 13-invariant schema; consults .aw/workflows/ templates) → PM → multi-call worker (with tool loop, rotation salvage + handoff inlining, cancellation modes) — produces a real deliverable under artifacts/. Cost meters with agency-wide $ cap, eval-driven hire/fire/clone (gated by AW_AUTO_EVAL=1), aw --watch TUI v2 (Tab focus + / search + Enter drill), 2 bundled workflow templates (webapp + bug-fix), 6 role files (intake/receiver/pm/dev/designer/qa), JSONL migration, aw --vacuum, monitoring trio (AW_VERBOSE stderr + aw --monitor + /live), and a dev-log audit dashboard at python3 dev-log/dashboard_server.py are all in. OpenAI + Gemini native tool_use + real-API smoke + MCP-native v1 are the next phase. Build in public — 38+ commits across 7 autonomous runs (slices 1–26 + week-4 follow-ups + monitoring trio).

Will the prompts be open?

System role prompts ship with the binary as plain markdown — they're embedded via include_str! and written into .aw/roles/{intake,receiver,pm,dev,designer,qa}.md on your first call. Edit them in place to override; the binary reads the on-disk copy if present. Source at crates/aw/src/roles/.

Watch the agency get built.

v0 chain works (P_ship ≈ 0.97). v1 is MCP-native — aw becomes an MCP server exposing its control plane; workers become MCP clients per role. Design doc. Daily commits — the binary moves before the announcement.

No newsletter. The repo is the changelog.