OpenClaw Training (IV): Production-Grade Engineering Upgrades

This post documents an architecture upgrade — what changed, why, and what trade-offs were made. If you’ve been following the OpenClaw Training series, think of this as the engineering implementation sequel to the previous three posts.

What Changed

The first three posts covered configuring a memory-enabled assistant, building a Workflow system, and project management integration. By then, the system was “usable.” But “usable” and “production-ready” are two different things.

This round of changes touches seven areas:

AGENTS.md — Memory architecture formalized; knowledge base CRUD validation rules added
HEARTBEAT.md — Project monitoring module rewritten; Linear Auto labels + flowchain scripts integrated
WORKFLOW.md — Two new workflows added (email monitoring, project onboarding)
TOOLS.md — Network request constraints documented; Obsidian vault path fixed
MEMORY.md — Hierarchical table reorganized; diary write rules made explicit
flowchain/ — Python script layer abstracted as an executor; modular *.mjs Node.js scripts also supported
status/ — Runtime state directory separated out, fully decoupled from the memory layer

The through-line is consistent: shifting from “OpenClaw improvises” to “OpenClaw executes to spec”.

Memory Architecture

The Old Way

The early configuration had a simple memory model: MEMORY.md held everything. This worked at small scale, but as the file grew, the signal-to-noise ratio started dropping — a pitfall from three months ago competed for the same retrieval priority as what you’re working on today, interfering with each other.

The fundamental problem with flat memory: all information lives at the same retrieval level. There’s no distinction between “this is current state” and “this is long-term knowledge.” Every time context was loaded, OpenClaw had to stuff the entire file in — most of which was irrelevant to the current task — wasting tokens and diluting the weight of actually useful information.

Three-Layer Architecture

The new architecture layers information by timeliness and reusability:

Conversation content
    │ written in real-time
    ▼
Diary layer (memory/YYYY-MM-DD.md)     ← timestamped event stream, daily details
    │ distilled every night at 23:45
    ▼
Knowledge base (memory/knowledge/)     ← reusable lessons / decisions / people profiles
    │ weekly GC archiving
    ▼
Cold archive (memory/.archive/)        ← expired data, not actively loaded

Layer 1 · Diary: Append-only, timestamped, raw event stream. No processing — write directly. Files named by date: memory/2026-03-23.md.

Layer 2 · Knowledge Base: Three subdirectories — lessons/ (lessons learned), decisions/ (key decisions), people/ (people profiles). Each file has structured frontmatter. Content entering this layer is distilled: “Will I still need to know this in six months?”

Layer 3 · Cold Archive: Weekly GC moves expired diary entries to .archive/. Keeps the retrieval space clean, preventing stale context from polluting current sessions.

The Cost of Layering

The advantage of flat structure is precisely its simplicity: one file, write freely, no need to decide “which layer does this information belong to.”

With layering, every write adds a decision: is this a diary entry or a knowledge base item? Should it go in lessons/ or decisions/? Is the timing right for distillation? A wrong judgment means information either lands in the wrong layer or never gets distilled at all — making it actually harder to find than with flat structure.

There’s also maintenance cost: nightly distillation, weekly GC, CRUD validation rules — these mechanisms need to keep running. If any link in the chain fails, the layered structure starts to rot, and it rots more invisibly than flat structure.

So layering is only appropriate when: you’re willing to pay the ongoing maintenance cost, and your information volume has grown to the point where the noise problem in flat structure is genuinely affecting your work. If your memory file is still small, a flat MEMORY.md is enough.

What to Actually Write

The diary layer is a raw record of “what happened”:

### 14:29 — Kicked off "one-click background removal" feature

- GitHub issue created: #1
- Linear issue created: bidirectionally linked with GitHub
- Issue assigned to GitHub Copilot Agent
- TODO: monitor Copilot Agent completion progress

Knowledge base entries have structured frontmatter for retrieval and lifecycle management:

---
title: "AI Agent Task Delegation Pattern"
date: 2026-03-22
category: lessons
priority: 🟡
status: active
last_verified: 2026-03-22
tags: [ai-agent, delegation, copilot]
---

Priority markers: 🔴 core knowledge, never archived; 🟡 generally important; ⚪ low priority. Entries not verified in over 30 days get a ⚠️ stale marker — a reminder that this knowledge may be outdated and needs re-verification.

CRUD Validation Rules

The hardest part of memory management isn’t “writing” — it’s preventing “writing badly.” Over time, the knowledge base starts accumulating contradictory entries, duplicate entries, and outdated content with no annotation.

This is addressed by mandatory pre-write validation:

New knowledge arrives
│
├─ Step 1: Read target file's existing content (create if not exists)
├─ Step 2: Compare with new knowledge
│  ├─ Existing content fully covers it → NOOP (don't write)
│  ├─ New knowledge updates old content → mark old version ~~Superseded~~
│  ├─ New knowledge contradicts old content → keep both, add ⚠️ CONFLICT marker
│  └─ Entirely new knowledge → append new section
└─ Step 3: Update last_verified date in frontmatter

Why you must read before writing: Without this validation, the knowledge base silently rots over time. It doesn’t break immediately — it gradually degrades into another noise file, just one that degrades more slowly than MEMORY.md.

Heartbeat

Old Way: Freeform Polling

The old heartbeat was conversational: OpenClaw checked GitHub, checked email, then “decided” what to push. The logic lived in context, not in files. Behavior drifted slightly between sessions — start a new session, and behavior might change.

New Way: flowchain as the Execution Backbone

The new heartbeat is a step-by-step spec with clear responsibilities at each step:

Step A → Linear Auto task check (timeout detection, validation retry count, concurrency gate)
Step B → flowchain Python scripts (PR scan, CI status, Linear status sync)
Step C → NOW.md overwrite (current state snapshot)

Step A is pure OpenClaw-driven logic: use Linear CLI to query all issues tagged Auto and in In Progress state, check if they’ve timed out (no associated PR in over 3 days), whether validation fallback needs triggering, and whether items can be automatically moved from Backlog to Todo. This logic is lightweight; the rules are written in HEARTBEAT.md and OpenClaw executes by reading the file directly.

Step B handles the heavy lifting requiring cross-platform data aggregation — all delegated to Python scripts:

python3 flowchain/projects.py heartbeat

What the script does internally: load all monitored repos from PROJECTS.json → batch-call GitHub API to fetch open and merged PRs for each repo → extract Linear issue IDs from branch names of merged PRs → call Linear API to advance corresponding issues to Done → deduplicate against seen_prs in heartbeat-state.json → output structured JSON.

The JSON format returned by the script is fixed:

{
  "ci_failures":        [{"repo": "user/MyApp", "pr_number": 5, "pr_title": "fix: crash on launch"}],
  "pr_open_stale":      [{"repo": "user/BackClaw", "pr_number": 12, "hours_open": 25}],
  "pr_merged_to_done":  [{"repo": "user/BackClaw", "pr_number": 8, "identifier": "PROJ-45"}],
  "stale_in_progress":  [{"identifier": "PROJ-30", "days_in_progress": 4}],
  "validation_retries": [{"identifier": "PROJ-20", "retries": 3}]
}

OpenClaw reads this JSON and decides what to push based on which fields are present — ci_failures triggers immediate CI alerts, pr_merged_to_done triggers completion notifications, all empty means silence. The entire monitoring layer has zero improvisation: structured data in, formatted notifications out.

Step C is the mandatory wrap-up after every heartbeat: overwrite NOW.md with the current focus, recent events, and pending items. This file is OpenClaw’s “current state snapshot” — used at session startup to quickly restore context without re-reading large amounts of historical diary entries.

Why use scripts instead of prompts for Step B?

Prompts are stateless approximations. Scripts are deterministic. When you need to “check whether this PR has been open for more than 24 hours without merging,” you need arithmetic, not inference. More importantly, PR deduplication depends on persisting seen_prs state — something prompts fundamentally cannot do, because context gets compressed and disappears when the session ends.

Validation Gate

The most intentionally designed piece of this update: a mandatory validation checkpoint for all AI-executed tasks.

The problem it solves: AI coding agents (Codex, Claude Code, GitHub Copilot Agent) mark tasks “complete” when they finish. But the AI’s definition of “complete” and the product’s definition of “complete” are two different things. Previously there was no mechanism to intercept in between.

AI Agent reports completion
│
└─ Phase 4.5 Validation Gate
   ├─ Step 1: Read acceptance criteria from Linear issue
   ├─ Step 2: Run code quality checks (syntax + unit tests)
   ├─ Step 3: Verify each acceptance criterion
   ├─ Step 4: Write validation report to Linear comment
   └─ Step 5: Pass → mark Done / Fail → auto-fix loop (max 2 attempts)
              └─ 3rd failure → escalate for human intervention

Design pattern: you define the acceptance criteria, AI verifies before closing. This gate prevents the failure mode of “issue gets marked Done simply because nothing checked it.”

Retry counts are persisted in status/heartbeat-state.json:

{
  "validation_retries": {
    "PROJ-74": 1,
    "PROJ-83": 0
  }
}

After 3 failures, escalate. Counts don’t auto-reset — only cleared after you confirm you’ve handled it. This ensures repeatedly failing tasks get human attention rather than being silently skipped.

Acceptance criteria are written in a fixed block in the Linear issue description:

## Acceptance Criteria
- [ ] User-visible error message on network timeout
- [ ] Error message includes retry suggestion
- [ ] No regression in existing login flow

If an issue is created without acceptance criteria, OpenClaw infers and drafts them from the description, then pushes them to you for confirmation. Issues without acceptance criteria are not allowed to enter the automated development pipeline — this is a hard rule, not a suggestion.

Two New Workflows

Email Monitoring (Workflow 04)

The problem email monitoring solves: how to get signal from a noisy inbox without reading every message.

The solution: a rules file (status/MAILLIST.json) that encodes your personal judgment of “what matters, what doesn’t”:

{
  "last_summary_date": "2026-03-23",
  "last_urgent_ids": ["msg_abc123", "msg_def456"],
  "last_run": "2026-03-23T18:25:00+08:00",
  "config": {
    "immediate": {
      "sender_whitelist": ["*@github.com", "[email protected]"],
      "subject_keywords": ["production outage", "urgent", "ASAP"]
    },
    "summary": {
      "labels": ["important"],
      "senders": ["linear.app"]
    },
    "ignore": {
      "labels": ["promotions", "social"],
      "sender_blacklist": ["[email protected]"]
    }
  }
}

Four tiers (highest to lowest priority):

🤖 Copilot PR notifications — Highest priority, triggers validation flow (see below)
🔴 Immediate push — Whitelisted senders or urgent keywords, pushed immediately, ignoring quiet hours
📋 Daily digest — At most once per day, summarizing “worth reading but not urgent” emails
🔇 Ignore — Promotions, social notifications, never surfaced

The file stores both rules (config) and runtime state (last_urgent_ids, last_summary_date, last_run). This isn’t arbitrary design — the benefit of keeping both together is that OpenClaw reads one file at startup to know “what are the rules” and “where did I leave off last time,” then writes back to the same file when done. Single source of truth, cross-session persistence, no inconsistency between rule file updates and state file updates.

Copilot Agent → Email → Validation Closed Loop

This is where email monitoring and the project management workflow intersect — and it’s the segment of the entire automation system that best demonstrates “end-to-end closure.”

When GitHub Copilot Agent finishes development, it opens a PR, and GitHub immediately sends an email notification. When email monitoring captures this email, instead of simply pushing it to you, it automatically triggers the subsequent validation flow:

Copilot Agent opens PR
      │
      ├─ GitHub sends email notification (sender *@github.com, Subject contains "github-copilot[bot]")
      │
      ▼
Email monitoring captures (heartbeat or real-time)
      │
      ├─ Extract {repo} and PR number from Subject
      ├─ gh pr view → extract Linear issue ID from branch name
      ├─ Linear issue → In Review
      │
      ▼
Trigger Phase 4.5 Validation Gate
      │
      ├─ Read acceptance criteria → run checks → write validation report to Linear comment
      │
      ├─ Pass → Linear Done, push completion notification
      └─ Fail → auto-fix loop (max 2 attempts) → still failing → escalate

The entire chain requires no intervention from you: Copilot Agent opens PR, OpenClaw validates, passes means close, fails means escalate. You only need to write the acceptance criteria in the Linear issue at the beginning, and make the final call when validation repeatedly fails.

The one dependency on branch naming conventions: branch names must include the Linear issue ID (e.g., fix/PROJ-74-remove-bg). OpenClaw uses this to link PRs to issues. Wrong format breaks the automated chain, falling back to a plain push notification.

Execution flow (complete):

Step 1: Read MAILLIST.json, get rules and state
Step 2: Fetch unread emails (excluding promotions/social categories)
Step 3: Match each email by priority
        ├─ Copilot PR notification → trigger validation chain (Step 3.1)
        ├─ Immediate push → push directly to Telegram
        ├─ Summary → add to daily digest queue
        └─ Ignore → skip
Step 4: Push regular urgent emails (msg_id dedup) + daily digest (last_summary_date dedup)
Step 5: Update last_run, last_urgent_ids, write back to JSON

Project Onboarding (Workflow 06)

Every new project needs: a GitHub repo, a Linear project, a local clone, an entry in the project registry, and optionally heartbeat monitoring. Previously these steps were manual — you had to remember which steps to do each time, and it was easy to miss some.

Now it’s a six-step playbook:

Step 1: Confirm GitHub repo exists (create if not)
Step 2: Confirm Linear project exists (create if not)
Step 3: Clone locally
Step 4: Register in PROJECTS.json
Step 5: Add heartbeat monitoring flag (if needed)
Step 6: Push completion summary

Why it’s worth a dedicated Workflow: Project onboarding is the most typical “each step is obvious individually, but easy to miss in combination” task. Missing Step 5 (heartbeat monitoring flag) means this project’s PRs and CI might run undetected for weeks — you think you’re monitoring it, but you’re not.

flowchain Architecture

The biggest structural change in this update: abstracting all Linear and GitHub operations into a Python script layer (flowchain/projects.py).

Before: OpenClaw called Skills directly in conversation (linear-cli, github), assembled command strings from context, and parsed each tool’s different output format.

After: OpenClaw calls projects.py <command> and reads uniformly formatted stdout.

python3 flowchain/projects.py issue create "feat: dark mode" --project MyApp
# stdout: [ok] PROJ-82 — feat: dark mode

python3 flowchain/projects.py issue move PROJ-82 "In Progress"
# stdout: [ok] PROJ-82 → In Progress

python3 flowchain/projects.py sprint MyApp
# stdout: JSON (snapshot of all current project states)

Unified output protocol: [ok] / [warn] / [error] prefixes. OpenClaw only needs to check the first token — no need to parse the raw output formats of various CLI tools.

Core Problems flowchain Solves

1. Skills Drift

This problem is worth unpacking.

OpenClaw has dozens of Skills, each with its own SKILL.md — a usage guide for the AI. Every time OpenClaw needs to call a Skill, it first reads the SKILL.md, understands the parameter format, then assembles the command. This process consumes context tokens and causes drift:

linear-cli’s issue list parameters changed subtly in some version; SKILL.md wasn’t updated in time
OpenClaw’s interpretation of the same parameter might differ slightly between sessions
Multiple Skills have different output formats; OpenClaw needs to “remember” in conversation how to parse each one

As drift accumulates, you’ll find OpenClaw correctly updating Linear status in session A but producing wrong parameters in session B.

flowchain adds an isolation layer: OpenClaw only knows projects.py’s interface (a few dozen lines of documentation, stable), and doesn’t directly read individual Skill documentation files. Which Skill to call underneath, how to combine parameters, how to handle version differences — all in the script, handled by Python. Stable interface, stable behavior.

This also has a direct context-saving effect: previously every operation required loading linear-cli’s SKILL.md (hundreds of lines); now OpenClaw’s working context only needs projects.py’s interface documentation (a few dozen lines). For a heartbeat running every hour, this difference is significant.

2. State Moved Outside Prompts

Previously “has this PR been seen in the last scan” could only be “remembered” by OpenClaw in context. But context gets compressed and disappears when the session ends.

Now this state lives in status/heartbeat-state.json:

{
  "seen_prs": {
    "user/BackClaw": [8, 9, 11],
    "user/MyApp": [1, 2]
  },
  "last_heartbeat": "2026-03-23T18:25:00+08:00"
}

Persistent, queryable, manually editable. Even if the OpenClaw session restarts, the next heartbeat knows which PRs have already been notified.

3. Unified Error Handling

Every script command has a clear failure path: stderr with [error] prefix, OpenClaw detects uniformly and pushes to you. No more “script silently fails, OpenClaw continues pretending to succeed.” Failure paths are as predictable as success paths.

4. Independently Testable

Scripts don’t depend on OpenClaw’s context and can run independently. Previously, verifying monitoring logic required waiting for heartbeat to trigger; now just:

python3 flowchain/projects.py heartbeat

Watch the output in terminal, debug, confirm behavior, then let OpenClaw run it. This is a significant improvement for system stability — you can iterate script logic without affecting OpenClaw’s normal operation.

The status/ Directory

A small but structurally important addition: consolidating all runtime state into a status/ directory.

status/
├── PROJECTS.json          # project registry (authoritative source)
├── heartbeat-state.json   # last run time, seen PRs, validation retry counts
├── MAILLIST.json          # email monitoring rules and state
└── candidates.json        # recruiting pipeline state

Complete Data Flow

How status/ files are read and updated after a heartbeat triggers:

Heartbeat triggers (every 60 minutes)
      │
      ├─ Read PROJECTS.json
      │    └─ Get all projects with heartbeat: true and their repo paths
      │
      ├─ Read heartbeat-state.json
      │    └─ Get seen_prs (for dedup), validation_retries (validation retry counts)
      │
      ├─ Execute flowchain/projects.py heartbeat
      │    ├─ Batch-fetch open and merged PRs for each repo
      │    ├─ Compare against seen_prs, filter already-notified PRs
      │    ├─ merged PR → extract issue ID → call Linear API to mark Done
      │    ├─ CI failures → record to ci_failures
      │    └─ Output structured JSON
      │
      ├─ OpenClaw reads JSON, decides what to push based on fields
      │
      ├─ Write back heartbeat-state.json
      │    └─ Update seen_prs (append new PR IDs), last_heartbeat, validation_retries
      │
      ├─ Read/write MAILLIST.json
      │    ├─ Read config rules + last_urgent_ids + last_summary_date
      │    ├─ Fetch emails, match rules, decide what to push
      │    └─ Write back updated last_urgent_ids, last_summary_date, last_run
      │
      └─ Overwrite NOW.md (current state snapshot for quick context restore next session)

Why Keep It Separate from memory/

memory/ stores what OpenClaw “knows”: lessons learned, decisions made, understanding of your preferences. This is cognitive content, with “what I’ve learned” as the subject.

status/ stores what the system “has done”: last heartbeat time, already-pushed email IDs, currently monitored project list. This is operational content, with “what the system executed” as the subject.

Mixing the two causes dual problems:

Retrieval noise: searching for “background on a project” gets seen_prs operational logs in the results
Unreliable operations: if deduplication logic depends on content in memory/, and memory files can be distilled, compressed, or rewritten, deduplication develops gaps

Design rule: “what OpenClaw knows” → memory/; “what the system did” → status/.

Another important characteristic of status/ files: they can be safely overwritten. Diary files under memory/ are append-only — overwriting means data loss. status/ files are the opposite: writing the entire file back after each operation is the correct behavior, because they store current state snapshots, not historical records. These two sets of files have completely opposite write semantics; mixing them together is an easy source of errors.

Observations

“Write it down before executing” pattern: Much of this update came from discovering that OpenClaw had implicit knowledge — how to monitor projects, how to triage email — that had never been written down. Making implicit knowledge explicit, encoding it into Workflow files, CRUD rules, and acceptance criteria — this is the main work of “good AI system design.” Not tuning prompts; externalizing knowledge into files.

Idempotency is severely underrated: Good system design means executing the same operation twice is safe. last_urgent_ids deduplication, last_summary_date daily protection, validation_retries counting — these are all mechanisms that make operations safely repeatable. Heartbeat runs every hour; without deduplication logic, every run would re-push all content.

Human-AI collaboration design principle: Almost every automated operation has a clear human checkpoint. AI agents execute code, but OpenClaw validates before marking Done. OpenClaw proposes Sprint changes, but you confirm before execution. Workflows update last_run, but SOP content changes require your confirmation before OpenClaw writes back. Automation is real, but escalation paths are explicitly designed — not implicit fallbacks.