How We Run Growth Ops at BiClaw: AI Agents Doing the Work

TL;DR

BiClaw runs its entire content and growth operation with a small team + AI agents. No dedicated content team.
The agent pipeline covers: keyword research → outline → draft → QA → publish → revalidate → verify.
Daily cost target: under $20/day across all agents. Main challenge: keeping context lean.
Key lesson: agents fail on ambiguity. Clear SOPs + good guardrails = reliable output.

How We Run Growth Ops at BiClaw: AI Agents Doing the Work How We Run Growth Ops at BiClaw: AI Agents Doing the Work Image: How We Run Growth Ops at BiClaw: AI Agents Doing the Work

Why We Built This

BiClaw is a pre-revenue startup. We can't hire 5 content writers, a social media manager, and a growth analyst. But we still need to compete for organic traffic against companies with real teams.

So we built the operation we could afford: a small team of AI agents with clear roles, tight guardrails, and daily human oversight.

This post is a transparent look at how it works, what we've learned, and what still breaks.

The Stack

We run on OpenClaw — the same platform BiClaw is built on. Each agent has:

A dedicated workspace with its own SKILL.md (instructions + constraints)
A model assigned to the task type (cheaper for utility, stronger for writing)
A daily token budget with hard stops
Logging to usage.jsonl and quality scoring to scores.jsonl

Four agents currently active:

Agent	Role	Model
Growth	Blog content, outreach drafts	DeepSeek V3.2
Ops	Infrastructure health, monitoring	GPT-4o-mini
Optimizer	Landing page experiments, conversion	GPT-4o-mini
Main	Orchestration, daily review, operator comms	Claude Sonnet 4.6

The Content Pipeline (End to End)

Every blog post goes through this flow:

1. Keyword brief → Growth agent
2. GET /api/content/related (internal link candidates)
3. Outline (gpt-5-mini: fast, cheap)
4. Full draft (~1,800 words, DeepSeek V3.2)
5. QA: validate links, word count, MDX safety
6. Publish via publish-with-verify.sh
7. POST /api/revalidate (blog + slug + sitemap)
8. Live verify: web_fetch checks H1 + TL;DR present
9. Log: usage.jsonl + quality score

Steps 1–9 happen without human involvement. Human review comes after: we spot-check 2–3 posts per batch for tone, accuracy, and competitive positioning.

Quality Controls We Actually Use

Minimum bar (server-side enforced):

900 words minimum
3 internal links
2 external links (HTTP 200 verified)
MDX pre-compilation (broken content stays draft)

Soft quality checklist (agent self-review before publish):

TL;DR with 4–6 bullets
At least 1 table
At least 1 concrete example with numbers
H1 ≠ page title (question or outcome-first)
Meta description 140–155 chars

Human review triggers:

Quality score below 3.5 (auto-flagged)
Post touches pricing or competitor comparisons (sensitive)
External link returns non-200 (quarantined)

What the Daily Rhythm Looks Like

07:30 VN — Morning brief delivered to Telegram: experiment results, GA4 top pages, cost vs cap, overnight publishes.

Morning (manual) — Tuan reviews briefs, calls out priority topics or corrections.

During the day — Growth agent runs content batches (max 5 posts/run). Ops agent monitors infra. Main orchestrates.

18:00 — Daily review cron: reads all agent logs, compiles consolidated report, saves to reviews/daily-YYYY-MM-DD.md.

Monday 18:00 — Weekly synthesis: model performance, cost trends, quality scores. Adjusts model routing if needed.

Cost Control (The Hard Part)

We hit $68/day in early March from a runaway context window issue. Growth agent was sending ~40k tokens per request. That's expensive at GPT-4 pricing.

Fixes that worked:

Compaction threshold lowered (sessions compact earlier)
Context trim: tool results pruned after 15 min
Model swap: gpt-5 → DeepSeek V3.2 for content (same quality, 80% cheaper)
Hard stop at $20/day with 80% warning alert

Current daily spend: tracking toward $10–15/day.

What Still Breaks

Infra errors confuse agents. Before we added the "don't debug infra" rule, the growth agent would spend 30 min trying to fix a CDN cache issue that needed 1 line from a developer. Now: report the error + URL, move on.

Ambiguous tasks produce mediocre output. "Write about AI agents" returns generic content. "Write a 1,800-word guide on how AI agents sort emails, targeting 'email management software' (2.9k vol, KD 34), with a mini-case from an e-commerce store" returns something publishable.

Quality score gaming. An agent that scores its own output will score itself highly. We cross-check: if the quality score is 4.5 but the post has no table and a weak TL;DR, the score gets manually corrected and the prompt updated.

External link rot. Links that were valid at publish time break later. We don't have automated re-verification yet — it's on the dev roadmap.

Mini-Case: Week 1 Content Batch

Target: 5 posts, Monday publish.

Keywords: best-ai-agents-2026 (KD 15), agentic-ai-news-2026 (74k vol), ai-executive-assistant-guide, ai-automation-agency-guide, ai-email-management-software-2026
Time from brief to all 5 published: ~4 hours (agent run time + human review)
Human time spent: ~45 min (reviewing drafts, approving publishes)
Cost: $3.20 for the batch

All 5 passed server-side QA on first attempt. Two needed minor fixes (wrong internal link slug). None needed full rewrites.

What We'd Do Differently

Build the QA layer first. We spent a week fixing MDX errors that a preflight checker would have caught in seconds.
Start with 3 posts/batch, not 10. More posts = more errors = more context = more cost. Smaller batches are more reliable.
Version control content from day 1. We lost some early drafts. Content versioning is now in the DB — every update snapshots the previous version.
Instrument before you scale. You can't cut costs if you don't know where they're going.

The Honest Assessment

Is this production-grade? Not yet. Failure rate on automated publishes is around 5–10%. Human spot-checks catch most issues before they matter.

But for a pre-revenue startup competing for organic traffic, it works. We're publishing 10–15 posts/week with a combined 3–4 hours of human involvement. The quality floor is enforced by tooling, not discipline.

The goal isn't to remove humans from the loop. It's to put humans where they add the most value: strategy, tone calibration, competitive positioning — not copy-pasting content into a CMS.

How We Run Growth Ops at BiClaw: AI Agents Doing the Work

How We Run Growth Ops at BiClaw: AI Agents Doing the Work

Why We Built This

The Stack

The Content Pipeline (End to End)

Quality Controls We Actually Use

What the Daily Rhythm Looks Like

Cost Control (The Hard Part)

What Still Breaks

Mini-Case: Week 1 Content Batch

What We'd Do Differently

The Honest Assessment

Related reading

Comments

Leave a comment

How AI Agents are Automating Marketing Agency Reporting in 2026

The SaaSpocalypse vs. The Agent Era: AI Agent ROI for SaaS in 2026

AI Marketing Agency Reporting: Client Transparency in 2026