My Intern — Research Workflows in Hermes and OpenClaw

Send an intern to research an unfamiliar topic, or ask AI to investigate a technical area, and you think you’ve saved time — you’ve only deferred the judgment cost. When the research comes back, you still face the same question: can I use this conclusion? AI won’t tell you it went off track, because it doesn’t know. Output speed goes up 10×, but hallucination and loss of control don’t go away — and the cost of trust goes up.

Before LLMs, non-academic research and learning had a rough method: search engines and Wikipedia wandering. Start from one entry, follow links, click through, and gradually you get a shape of a field. That process has a natural correction built in: when you hit a concept you don’t understand, you go back and fill the gap. It’s not linear reading — it’s exploratory back-and-forth, more like a tree where you fill in the knowledge surface and skill map yourself.

In the Agent era, that approach is too slow. AI writes all sub-files linearly and never says “concept X in file 3 wasn’t clear — I need to go back.” The deliverable looks complete, but it’s actually a pipeline with no feedback — or a fantasy novel full of hallucinations.

This post is about putting that back-and-forth back: three metrics replace the “I don’t get it yet” intuition, a state machine replaces random “go back and fill” wandering, and a dual control loop replaces human babysitting.

One note: the FSM and researcher.py described here are a custom system built on OpenClaw / Hermes Agent — not native AI assistant features. If you use another framework, the ideas transfer; core code implementations are included below.

1. Three Failure Modes in Research: Measure First, Then Control

When you wander Wikipedia, you’re roughly doing three things at once:

“This entry went off-topic” — following links, you drift farther from what you originally wanted
“A few concepts here aren’t covered yet” — a vague sense of gaps still unfilled
“Is this source trustworthy?” — instinctively separating academic sources from random blogs

When AI writes research, all three are missing — it doesn’t know it drifted, doesn’t know what’s uncovered, and doesn’t distinguish source quality.

To restore that mechanism, step one is turning them into numbers:

Intuition	Metric	Meaning
”Off-topic”	divergence (semantic drift)	Semantic distance between actual content and the set direction
”Not fully covered”	coverage (concept coverage)	How many core concepts are substantively discussed in sub-files
”Source unreliable”	credibility (source trust)	Weighted mean credibility of all cited URLs

These three numbers underpin every automatic decision — once you understand them, FSM, ACTUATOR, and ESCALATION logic follow naturally.

Engineers don’t discuss “code quality” without lint, tests, and coverage. Research is the same: without measurement, you only have subjective feel — no engineering metrics.

1.1 Definitions and Thresholds for the Three Metrics

divergence: Embed direction text and each sub-file with an embedding model, take average cosine distance. > 0.4 means drift; > 0.6 triggers ESCALATION for human intervention.
coverage: Extract 5–15 core concepts from direction text; each sub-file counts as “substantively covered” if embedding similarity > 0.7. Covered count / total = coverage. ≥ 0.8 means full coverage.
credibility: Scan all cited URLs, look up domain in a credibility dictionary — top journals (NEJM, Lancet) 4.0, authorities (FDA, NIH, WHO) 3.5, industry media 2.5, unlisted domains default 2.0. < 2.5 means source quality is below bar.

These thresholds are empirical, not fixed standards:

divergence 0.4 / 0.6: 0.4 is the perceptual inflection where content starts to noticeably drift; derived from many rounds of zero-baseline research. For stricter control, lower to 0.3; for higher tolerance, 0.5. 0.6 is the line where “the direction itself may be wrong” — harder to calibrate numerically; mainly human judgment.
coverage 0.8: Maps to “80% of core concepts substantively discussed.” Shallow research (overview only) can drop to 0.7; deep research requiring full coverage can stay at 0.9.
credibility 2.5: Maps to “industry media and above.” Pure academic research can raise to 3.0; tool/engineering research citing mostly official docs and GitHub may find 2.5 too low — recalibrate the dictionary by domain.

Driving by dashboard, not gut feel — divergence is the heading indicator, coverage the fuel gauge, credibility the temperature gauge. All three must pass for research to be on-track, complete, and trustworthy.

2. FSM: Split Research into Observable Phases

Research isn’t one shot. It has phases: from “just started, direction unclear” to “direction confirmed, filling content” to “maybe drifting, need correction” to “content stable, awaiting acceptance.” The FSM encodes these as explicit states with entry/exit conditions. At any moment you can answer: where is this research, and where does it go next?

stateDiagram-v2
    [*] --> INIT : User requests research
    INIT --> EXPLORING : Outline confirmed + setpoint
    EXPLORING --> SELF_CORRECTING : divergence > 0.4
    EXPLORING --> CONVERGING : Metrics stable
    SELF_CORRECTING --> CONVERGING : Correction complete
    SELF_CORRECTING --> EXPLORING : Switch strategy, continue
    CONVERGING --> DONE : User accepts
    CONVERGING --> EXPLORING : User rejects, requests more
    DONE --> [*]

    EXPLORING --> ESCALATION : divergence > 0.6
    SELF_CORRECTING --> ESCALATION : Persistent drift
    CONVERGING --> ESCALATION : Cannot converge
    ESCALATION --> EXPLORING : User decides continue
    ESCALATION --> INIT : User adjusts direction
    ESCALATION --> [*] : User terminates

2.1 Rationale and Semantic Boundaries of Each State

INIT: Research just started; outline written but user hasn’t confirmed direction. This state exists for a simple reason — spend 5 minutes confirming direction before writing ten thousand words; the cost of drift far exceeds those 5 minutes.

EXPLORING: Direction confirmed; advance sub-file by sub-file. Each sub-file: search first, then write; after write, checkpoint runs measure. Small steps, each verifiable.

SELF_CORRECTING: divergence > 0.4 or coverage < 0.6; machine adjusts automatically, no user needed. This is where small deviations are handled — problems don’t accumulate until a human must step in.

CONVERGING (convergence): All three metrics pass (coverage ≥ 0.8 + divergence ≤ 0.2 + credibility ≥ 2.5), waiting for user’s final judgment. This is not DONE — the machine can only confirm metrics; whether content is useful is another matter. Final acceptance is human.

DONE: User accepted; written to knowledge_conclusions table; decay timer starts. After 60 days the system asks whether review is needed.

ESCALATION: divergence > 0.6, or FSM repeatedly fails to converge. Severe drift means the direction itself may be wrong — the machine can’t decide that; it escalates to the human.

Why FSM instead of simple “step 1 → 2 → 3”? Real research has loops: after sub-file 5 you realize sub-file 1’s direction was wrong and you need to go back. FSM’s value is giving those loops explicit states so “going back” isn’t random walk but conditional state transition.

Actual code: ResearchController.step() — FSM single-step control loop (research_controller.py)

# Condensed from research_controller.py
def step(self, reading: dict = None) -> dict:
    """Single-step control loop: read state → compare to setpoint → decide action → update FSM."""
    if reading:
        self.divergence = reading.get("divergence")
        self.coverage   = reading.get("coverage", 0.0)
        self.credibility = reading.get("credibility", 0.0)
        self.coverage_stable_rounds = reading.get("coverage_stable_rounds", 0)
        self.uncovered_concepts = reading.get("uncovered_concepts", [])

    sp = self.setpoint
    div_max   = sp.get("divergence_max", 0.2)
    cov_target = sp.get("coverage_target", 1.0) * 0.9
    cred_min  = sp.get("credibility_min", 2.5)
    STABLE_THRESHOLD = 3

    action_type = None

    # ① coverage < 0.6 with uncovered concepts → AUTO_FILL
    if self.coverage < 0.6 and self.uncovered_concepts:
        action_type = "AUTO_FILL"

    if action_type is None:
        # Layered divergence tolerance (higher coverage → wider tolerance)
        if self.coverage < 0.6:   div_tolerance = div_max        # strict 0.2
        elif self.coverage < 0.8: div_tolerance = 0.25           # medium
        else:                     div_tolerance = 0.35           # relaxed

        if self.state == "DONE":
            action_type = "IDLE"
        elif (reading
              and reading["coverage"] >= cov_target
              and reading["divergence"] <= div_tolerance
              and reading["credibility"] >= cred_min):
            self.state = "CONVERGING"
            action_type = "AUTO_CONCLUDE"
        elif self.divergence and self.divergence > 0.6:
            self.state = "ESCALATION"
            action_type = "ESCALATE"
        elif self.divergence and self.divergence > div_max:
            # ② Switch strategy; ESCALATE only after 2 consecutive rounds with no improvement
            if (self.strategy_prev_divergence is not None
                    and self.divergence >= self.strategy_prev_divergence - 0.02):
                self.strategy_failure_count += 1
            else:
                self.strategy_failure_count = 0
            self.strategy_prev_divergence = self.divergence

            if self.strategy_failure_count >= 2:
                self.state = "ESCALATION"
                action_type = "ESCALATE"
            else:
                self.state = "SELF_CORRECTING"
                self.strategy = self._next_strategy()
                action_type = "SELF_CORRECT"
        elif (self.coverage > 0.8
              and self.divergence and self.divergence <= div_tolerance
              and self.coverage_stable_rounds >= STABLE_THRESHOLD):
            self.state = "CONVERGING"
            action_type = "CHECK_COMPLETE"
        else:
            self.state = "EXPLORING"
            action_type = "CONTINUE"

    self._save()
    return {"state": self.state, "action": {"type": action_type, "strategy": self.strategy}}

2.2 Computing the Three Metrics

FSM transitions depend on three quantitative metrics. Below: definition, threshold rationale, and real implementation for each.

divergence (semantic drift)

Definition: Distance between what was actually written and the initial direction (direction_text).

Computation: Embed direction_text and each sub-file with an embedding model (default Bailian text-embedding-v3), take mean cosine distance. Implementation in sensor_measure() calling measure_divergence(); supports embedding and TF-IDF dual engine — embedding when key available, else TF-IDF fallback.

Threshold rationale (from 60+ research runs):

> 0.4: Visible drift; ACTUATOR switches search strategy (perplexity → brave / deep_dive)
> 0.6: Severe drift; ESCALATION, notify user to decide direction
≤ 0.2: Normal range; can enter CONVERGING awaiting acceptance

Why these numbers? 0.4 is empirical — in practice “direction correct but expression differs” is usually 0.1–0.3; “partially off direction” 0.4–0.6; “completely off” usually > 0.6.

coverage (concept coverage)

Definition: Of core concepts in direction_text, how many are substantively discussed in sub-files.

Computation: Extract 5–12 core concepts from direction_text (prefer setpoint topics, else jieba tokenization + TF-IDF scoring). Each sub-file: embedding cosine similarity > 0.65 counts as substantive coverage. Covered concepts / total = coverage.

Threshold rationale:

< 0.6: Insufficient coverage; ACTUATOR auto-fills uncovered concepts
0.6–0.8: Basic coverage; continue EXPLORING
≥ 0.8: All core concepts discussed; can enter CONVERGING

Known gap: Bailian embedding tends to fragment concept extraction on long Chinese sentences — coverage may read low while content is actually complete (see “gray state” handling below).

Also, coverage_stable has a 6-hour time gate: two measure runs must be ≥ 6 hours apart; writing 3 sub-files in one session won’t falsely trigger stable count.

Actual code: sensor_measure() — SENSOR unified entry (research_controller.py)

# Condensed from research_controller.py
def sensor_measure(topic_slug: str) -> dict:
    """SENSOR: read SETPOINT.json + all sub-files, return three metrics."""
    setpoint = json.loads((topic_dir / "SETPOINT.json").read_text())
    modules = []
    for f in topic_dir.glob("*.md"):
        if f.name in ("00-index.md",):
            continue
        text = re.sub(r"```[\s\S]*?```", "", f.read_text())    # strip code blocks
        text = re.sub(r"\[([^\]]+)\]\([^\)]+\)", r"\1", text)  # strip Markdown links
        modules.append(text)

    outline_text = setpoint.get("direction_text", "")

    # Divergence: embedding cosine distance (TF-IDF fallback)
    div_result = measure_divergence(modules, outline_text)

    # Coverage: concept–sub-file matrix (> 0.65 = substantive coverage)
    concepts = _extract_concepts(outline_text, topics=setpoint.get("topics", []))
    concept_to_files, _ = _coverage_matrix(modules, concepts)
    covered = set(c for c, files in concept_to_files.items() if files)
    coverage = len(covered) / len(concepts) if concepts else 0.0

    # coverage_stable: 6-hour time gate, avoid false trigger in same session
    last_measure_at = setpoint.get("last_measure_at")
    time_ok = last_measure_at is None or (time.time() - last_measure_at) >= 6 * 3600
    prev_cov = setpoint.get("coverage_history", [None])[-1]
    if time_ok and prev_cov and abs(coverage - prev_cov) < 0.05:
        stable_rounds = setpoint.get("coverage_stable_rounds", 0) + 1
    elif time_ok:
        stable_rounds = 0
    else:
        stable_rounds = setpoint.get("coverage_stable_rounds", 0)

    # Credibility: weighted mean of URL credibility
    credibility = measure_credibility(modules)

    return {
        "coverage": round(coverage, 3),
        "credibility": round(credibility, 2),
        "divergence": div_result.get("final_divergence", 1.0),
        "modules_completed": len(modules),
        "coverage_stable_rounds": stable_rounds,
        "uncovered_concepts": [c for c, files in concept_to_files.items() if not files],
    }

credibility (source trust algorithm)

Definition: Weighted average credibility score of all cited URLs.

Computation:

Scan all URLs, look up domain in CREDIBILITY_SCORES
Unlisted domains get 2.0 default (neutral — avoid punishing “new but real” sources with zero)
Bare-link rate (URLs not in [^N]: blocks) > 30% → cap 2.5

CREDIBILITY_SCORES = {
    # 4.0: top journals (NEJM/Lancet/JAMA/Nature/Science)
    "nejm.org": 4.0, "thelancet.com": 4.0, "nature.com": 4.0,
    # 3.5: authorities (FDA/NIH/WHO/CDC)
    "fda.gov": 3.5, "nih.gov": 3.5, "who.int": 3.5,
    # 2.5: industry media / professional journals
    "ign.com": 2.5, "steamdb.info": 2.5,
    # Unlisted default: 2.0
}

Authority/professional domains are auto-added to the dictionary at conclude — no need to ask each time. Commercial platforms, blogs, marketing content default excluded; user decides.

Tier logic (subjective judgment, not objective standard):

4.0: Peer-reviewed journals — external review before publication, statistical significance requirements
3.5: Government/health agencies — legal accountability; publishing wrong info has real consequences
2.5: Industry media/professional journals — editorial standards, no external peer review
2.0: Exists but quality unknown — neutral, avoid zeroing new-but-real sources

Domain is a proxy for content quality, not quality itself. A wrong paper in a top journal is still wrong; measured data in a personal blog may beat official docs. Tiers come from “what source type usually implies” — not a fixed standard.

Adjust by domain: technical research may tier arxiv.org, github.com separately (preprints and repos often more timely in engineering); financial research may raise regulator domains; pure academic may raise credibility_min from 2.5 to 3.0.

The FSM uses detect_research_signals() continuously for deviation signals, in three layers:

def detect_research_signals() -> list[dict]:
    signals = []

    # Behavioral layer: research stalled
    for slug, dir_name, mtime in get_topic_dirs():
        days_ago = (now() - mtime) / 86400
        status = get_topic_status(dir_name)["status"]
        if status == "in-progress" and days_ago > 14:
            signals.append({"type": "RESEARCH_STALL_ACTIVITY",
                          "severity": "high", "detail": f"{slug} no updates for 14 days"})
        elif status == "in-progress" and days_ago > 3:
            signals.append({"type": "RESEARCH_STALL_ACTIVITY",
                          "severity": "medium", ...})

    # Temporal layer: past decay cycle
    for conc in db_get_needs_review():
        signals.append({"type": "RESEARCH_STALL_TEMPORAL",
                      "severity": "high" if contradicted else "medium", ...})

    # Content layer: severe divergence (read FSM precomputed results)
    for slug, topic in rc._fsm_load().items():
        if topic.get("state") == "ESCALATION":
            signals.append({"type": "CONTENT_DIVERGENCE",
                          "severity": "high",
                          "detail": f"{slug} div={topic['divergence']:.3f}"})

    return signals

Each layer’s role:

Behavioral (RESEARCH_STALL_ACTIVITY): Detect “research forgotten in a corner.” in-progress with no activity for 3 days → reminder; 14 days → higher severity.
Temporal (RESEARCH_STALL_TEMPORAL): Detect “research conclusions expired.” Research with next_review < now() marked needs_review=1.
Content (CONTENT_DIVERGENCE): Detect “content severely off direction.” FSM in ESCALATION means divergence already exceeded 0.6.

After detection, cmd_signal_check:

Dedup: Same signal at most once per 6 hours (avoid nagging)
TG push: Direct to user’s Telegram
Set verify deadline: medium severity → 72h, high → 48h
Overdue tracking: Unhandled at deadline → mark stale + another TG reminder

Research signals share the same memory.action_tracker table as project management signals — not a separate system. View with:

python3 ~/.hermes/scripts/projects.py signal status
# Output includes both RESEARCH_STALL_* and PROJECT_* signals

No separate signal table per workflow — one table so all anomalies are visible in one place.

2.4 Convergence Quad Condition

Entering CONVERGING requires all four:

modules_completed >= min(3, total_modules // 3)  # at least 1/3 modules done
AND coverage >= 0.8
AND divergence <= 0.2
AND credibility >= 2.5

coverage_stable (3 rounds) is for long research with write-as-you-measure. One-shot completion of all sub-files leaves empty history — stable never satisfies; skip stable check and use only the three substantive metrics above.

3 rounds is minimum evidence — prevents misjudging convergence when coverage happens to stabilize after 3 consecutive sub-files. Slower pace (1–2 per day)? stable at 2 is reasonable. Multi-day/multi-session span? keep 3 or raise to 4.

2.5 Gray State: Low Coverage but Substantively Complete Content

A boundary case in practice: Bailian embedding fragments concept extraction on Chinese direction_text; coverage may show 0.3–0.5 while content already fully answers the research direction.

Decision tree:

measure → coverage < 0.6
  ├─ Read 00-index goals + file map → substantively complete?
  ├─ Sample 2–3 sub-files → do they answer the research question?
  ├─ Is low coverage/credibility algorithm/source issue ≠ missing content?
  │   ├─ Yes → Option 1 (conclude --adopted + manually set decay 60d)
  │   └─ No → Option 2 (add sub-files to fill gaps)
  └─ In report, state clearly "coverage is known algorithm limitation; content substantively complete"

Coverage is auxiliary, not the conclusion. Being blocked by numbers is worse than ignoring them.

3. Dual Control Loop: Machine Executes, Human Sets Setpoint

A basic cybernetics judgment: complex systems need layering — direct control handles high-frequency execution; organizational layer handles low-frequency but critical value judgments.

graph TD
    subgraph Direct_Control_Auto
        A[End of each sub-file<br/>trigger measure] --> B{divergence}
        B -->|0.4-0.6| C[Switch search strategy<br/>perplexity → brave]
        B -->|> 0.6| D[ESCALATION<br/>notify user]
        B -->|< 0.4| E{coverage}
        E -->|< 0.6| F[ACTUATOR auto-fill<br/>uncovered concepts]
        E -->|≥ 0.6| G[Continue EXPLORING]
        C --> G
        F --> G
    end

    subgraph Organizational_User
        H[Step 0 outline confirm<br/>setpoint init] --> I[Step 1 monitor<br/>'Complete yet?']
        I --> J[Step 2 final judgment<br/>accept/reject/adjust]
        D -.escalate.-> I
        J --> K{ESCALATION response}
        K -->|Continue| G
        K -->|Adjust setpoint| H
        K -->|Terminate| L[Archive]
    end

    style D fill:#ff6b6b,color:#fff
    style H fill:#4a90d9,color:#fff
    style J fill:#4a90d9,color:#fff

3.1 What the Direct Control Layer Does (Automatic)

Direct control handles high-frequency, low-value, repetitive tasks:

Operation	Frequency	Trigger	Decision basis
`measure`	End of each sub-file	New sub-file commit	divergence + coverage + credibility
Switch search strategy	Occasional	divergence > 0.4 for 2 rounds	strategy_failure_count
ACTUATOR auto-fill	Occasional	coverage < 0.6 + divergence < 0.3	uncovered_concepts list
ESCALATION	Rare	divergence > 0.6	severity judgment
Decay signal detection	Daily (cron)	`next_review < now`	detect_research_signals()

Common trait: clear rules, machine can run alone, no value judgment. Automating them so humans don’t audit every sub-file.

Search strategy decision tree (ACTUATOR switch logic):

Step 1 entry → default perplexity (web-search)
  ↓
End of each sub-file → measure
  ├─ coverage < 0.6 with uncovered concepts → AUTO_FILL (deep_dive fill gaps)
  ├─ divergence > 0.6 → ESCALATION → TG notify user
  ├─ divergence > 0.4 (and < 0.6) → SELF_CORRECTING → switch brave / deep_dive
  │     2 consecutive strategy switches with no divergence drop → ESCALATION
  ├─ coverage > 0.8 + divergence ≤ 0.35 + stable ≥ 3 rounds → CONVERGING → ask user "Complete yet?"
  └─ else → EXPLORING (continue current strategy)

After strategy switch, closed-loop check: if divergence doesn’t drop after brave / deep_dive (within 0.02), strategy_failure_count increments; ESCALATION only after 2 failures — avoid bothering user on one fluctuation.

Actual code: Actuator.execute() — actuator actions and TG notification templates (research_controller.py)

# Condensed from research_controller.py
class Actuator:
    def execute(self, action: dict, context: dict) -> dict:
        action_type = action["type"]
        slug = context.get("topic_slug", "")

        if action_type == "CONTINUE":
            return {"ok": True, "action": "continue"}

        elif action_type == "SELF_CORRECT":
            return {"ok": True, "action": "strategy_switched",
                    "new_strategy": action.get("strategy")}

        elif action_type == "ESCALATE":
            tg_send(
                f"🚨 *CONTENT ESCALATION*\n\n"
                f"Research *{slug}* severely diverged from outline\n"
                f"Reply:\n"
                f"· `continue` — keep current direction\n"
                f"· `adjust setpoint` — redefine goal\n"
                f"· `terminate` — end research"
            )
            return {"ok": True, "action": "escalated"}

        elif action_type == "AUTO_FILL":
            concepts = action.get("auto_fill_concepts", [])
            tg_send(
                f"🔧 *AUTO FILL*\n\n"
                f"Research *{slug}* coverage < 0.6, uncovered concepts:\n"
                f"`{'  ·  '.join(concepts)}`\n\n"
                f"Please add related content and continue."
            )
            return {"ok": True, "action": "auto_fill", "concepts": concepts}

        elif action_type == "CHECK_COMPLETE":
            stable = context.get("coverage_stable_rounds", 0)
            tg_send(
                f"🤔 *Direction stable — complete yet?*\n\n"
                f"Research *{slug}* stable for {stable} rounds\n"
                f"Reply `Y` → `researcher.py conclude {slug} --adopted`\n"
                f"Reply `N` → continue adding"
            )
            return {"ok": True, "action": "check_complete"}

3.2 What the Organizational Layer Does (User at Key Decision Points)

The organizational layer intervenes only at three nodes — no interruption otherwise:

Step 0 — Outline confirmation

After pre-research, AI presents outline (after perplexity breadth search). User reviews:

“OK” → run researcher.py init + setpoint --init, FSM enters EXPLORING
“Direction off” → revise outline, reconfirm (see “two-round correction” pattern)
“Add constraint: [X]” → update direction_text with new constraint

Setpoint is the baseline for all later measurement; wrong direction means divergence is always wrong. After outline is sent, must explicitly say “OK” before init — when unsure, delay beats drift.

Step 1 — Monitoring

During EXPLORING, every few sub-files the system asks “Complete yet?” User reads measure output (three metrics + sub-file list):

“Y” → continue next sub-file
“Add one: [X topic]” → temporarily add specified topic
“Almost there” → force CONVERGING

Whether content is complete — you know better than the algorithm. You can feel “it’s clear enough”; the machine can’t.

Step 2 — Final judgment

All three metrics pass; FSM in CONVERGING; system requests final acceptance. After reading all sub-files and 00-index core conclusions:

“Pass, conclude” → researcher.py conclude <slug> --adopted, write knowledge_conclusions, decay timer starts
“Part X needs more” → back to EXPLORING, add specified sub-files
“Overall direction wrong” → back to INIT, revise setpoint

Machine confirms “metrics pass”; can’t confirm “content useful” — that value judgment is human-only.

ESCALATION response

divergence > 0.6 triggers; after Telegram notification:

“Continue, direction is fine” → back to EXPLORING, ACTUATOR switches deep_dive strategy
“Adjust setpoint” → back to INIT, revise direction
“Terminate” → archive current research, don’t write knowledge_conclusions

3.3 Full Walkthrough: From “Research X” to “conclude”

User says: “Research GLP-1 weight-loss drugs — mechanism and suitable populations.” Full flow below. Human–AI collaboration, learn as you go — user follows in real time, reads each sub-file as AI writes; whole process 30–45 minutes.

T0: User request

User: Research GLP-1 weight-loss drugs — mechanism and suitable populations

T+5min: Pre-research

AI assistant:
  → perplexity breadth search "GLP-1 mechanism weight loss clinical"
  → 11-file outline (five-segment structure)
  → direction_text (zero-baseline perspective, explain what GLP-1 is)
  → notify user: "Here's the outline, 5 segments 11 files, direction is [X]"

T+8min: User reviews outline (Step 0 — outline confirmation)

User: Add a cost chapter, rest is fine
AI assistant: [Update outline, add 10-application-cost.md, resend direction_text]
User: OK
AI assistant: [Run init + setpoint --init, FSM enters EXPLORING]

T+25min: First 5 sub-files done (user reads along, Step 1 — sync progress)

[Each sub-file end: search → write → measure, user reads in parallel]
AI assistant: [Finish 01-principles.md → measure]
  → divergence=0.18 / coverage=0.25 / credibility=2.8
  → state: EXPLORING, continue
AI assistant: [Finish 02-deep-clinical-data.md → measure]
  → divergence=0.21 / coverage=0.45 / credibility=3.2
  → state: EXPLORING, continue
... continue

T+32min: SELF_CORRECTING

AI assistant: [Finish 05-evolution-long-term.md → measure]
  → divergence=0.42 / coverage=0.65 / credibility=2.9
  → trigger SELF_CORRECTING: switch search strategy (perplexity → brave)
  → state: SELF_CORRECTING
AI assistant: [Write 06-practice-medication.md with brave search]
  → divergence=0.35 / coverage=0.75 / credibility=3.0
  → state: back to EXPLORING

T+40min: ESCALATION (user responds on the spot)

AI assistant: [Finish 09-application-decision-tree.md → measure]
  → divergence=0.67 / coverage=0.85 / credibility=3.1
  → trigger ESCALATION: notify user
AI assistant: [TG push] "Research div=0.67 severely off — switch strategy or reconfirm direction"

User: [Sees push] Direction is fine, continue with cost chapter
AI assistant: [state: ESCALATION → EXPLORING, switch deep_dive strategy]
  → write 10-application-cost.md
  → measure: divergence=0.18 / coverage=0.95 / credibility=3.2
  → state: CONVERGING

T+45min: Step 2 final judgment (user has read most content)

AI assistant: [Notify] "All three metrics pass, awaiting acceptance"
User: [Skim 00-index + unread sub-files]
User: Pass, conclude
AI assistant: [Run researcher.py conclude <slug> --adopted]
  → write knowledge_conclusions, field_decay_days=60
  → next_review = now + 60 days
  → FSM state: DONE

T+60 days: Decay trigger

[cron runs detect_research_signals daily, finds needs_review]
AI assistant: [TG push] "Research past 60 days, needs verify"
User: Content still correct, still-valid
AI assistant: [Run verify --still-valid]
  → last_verified = now, next_review = now + 60d
  → confidence: medium → high

User intervened only 4 times — outline confirm, mid monitoring, ESCALATION response, final acceptance. Everything else automatic.

“Reduce user involvement” is a common trap. User must intervene on setpoint and final judgment — value judgments the machine can’t do. Making users audit every sub-file is the anti-pattern.

4. Five-Segment Structure: Don’t Assume the Reader; Organize for “Learning”

The “reader” of technical research is often yourself three months later when you’ve forgotten everything — or another engineer in the team who knows nothing about the field. Don’t assume background — at first mention of each concept: one-line definition + life analogy + mechanism expansion.

Five segments is a reference template, not fixed structure (adjust flexibly):

Segment	Purpose	Core question answered
Principles	Explain “what it is” from zero	What & Why — reader can explain to a friend
Deep dive	Data + risks + boundaries	How good/bad — reader knows trust and limits
Evolution	History / future / long view	Evolution — reader knows why it is this way
Practice	How to do it concretely	How to — reader can execute
Application	Personal tailoring + synthesis + acceptance	Putting together — reader can decide for themselves

Anti-pattern: “As a practitioner of XYZ, you surely know…” — assumes the reader knows XYZ; the point of research is to teach someone who doesn’t.

Good pattern: “XYZ is a Z of Y (analogy: everyday XXX is like Y). Its core mechanism is…”

Not every segment is required: pure theory may need only Principles + Deep dive; pure hands-on may need only Practice + Application. Forcing segments thins content.

4.1 Example: A Complete 11-File Research Outline

To make five segments concrete, a full example (from a real weight-loss drug research directory, 11 files in five segments):

File	Segment	Core content	Writing notes
`00-index.md`	Entry	Core conclusions + decision tree + key data + rejected directions	Not template fill — synthesized judgment
`01-principles.md`	Principles	What it is + weight-loss mechanism (gut hormone regulation)	One-line definition + analogy: “gut’s satiety signal to the brain”
`02-deep-clinical-data.md`	Deep dive	STEP 1–4 trials + real-world data	Concrete numbers (avg 15–22% weight loss), not “significantly effective”
`03-deep-side-effects.md`	Deep dive	GI reactions + rare risks	Don’t hide negative data; quantify risk (nausea 40–60%)
`04-evolution-history.md`	Evolution	Diabetes to weight loss “accidental discovery”	Timeline: 2005 launch → 2010 weight loss found → 2021 FDA obesity indication
`05-evolution-long-term.md`	Evolution	Weight rebound after stop + long-term safety	Mark data gaps clearly (”> 5 year data limited”)
`06-practice-medication.md`	Practice	Dose + injection frequency + storage	Actionable: start 0.25mg weekly, titrate after 4 weeks
`07-practice-diet.md`	Practice	Synergy with weight-loss diet	Not “eat healthy” — “protein first + reduce refined carbs”
`08-practice-exercise.md`	Practice	Resistance training to prevent muscle loss	Key insight: ~30% of weight loss is muscle — resistance required
`09-application-decision-tree.md`	Application	Suitable populations + contraindications + alternatives	Decision tree: BMI > 30 → prioritize; BMI 27–30 + comorbidity → evaluate
`10-application-cost.md`	Application	Domestic availability + price + insurance	Current price range + coverage (data needs verify)
`11-references.md`	Citations	Citation list (bidirectional binding format)	Each citation: claim + evidence bidirectional binding

After five-segment reorganization: principles through application, clinical evidence through personal decision — reader can both “explain to a friend” and “decide for themselves.”

4.2 Why This Structure Teaches

Five segments is progressive deepening — each segment more concrete and actionable.

flowchart LR
    A["Principles<br/>Abstraction: high<br/>Concept layer"]
    B["Deep dive<br/>Abstraction: medium<br/>Evidence layer"]
    C["Evolution<br/>Abstraction: medium<br/>Time layer"]
    D["Practice<br/>Abstraction: low<br/>Operation layer"]
    E["Application<br/>Abstraction: lowest<br/>Decision layer"]
    A --> B --> C --> D --> E
    style A fill:#4a90d9,color:#fff
    style B fill:#7eb0d9,color:#fff
    style C fill:#a8d5e2,color:#333
    style D fill:#f7c59f,color:#333
    style E fill:#ff6b35,color:#fff

Segment	Abstraction	Reader takeaway
Principles	High (concept)	“What it is, why it aids weight loss”
Deep dive	Medium (evidence)	“How big effect/risk, where boundaries are”
Evolution	Medium (time)	“How the drug came to be, long-term outlook”
Practice	Low (operation)	“If I use it, how exactly”
Application	Lowest (decision)	“Should I use it, is it worth it”

After Principles you know what; Deep dive whether to trust; Evolution whether it goes stale; Practice how to use; Application whether you want it.

Anti-pattern: organize by “whatever I found first” — technical details pile up, reader never learns “what this means for me.” Five segments force cognitive order: mental model → evidence → operation → decision.

4.3 Pre-write checklist: Every file’s opening must have “What is X”

First paragraph of each sub-file must start with this template:

# [filename]

> **One-line definition**: X is a Z of Y (life analogy: [analogy]).

[Then mechanism, data, risks, etc.]

Tell the reader what this chapter is before jumping into technical detail.

4.4 Code-verification research: Write code + run measurements + cite results

Five segments has a special variant — code-verification research. Some conclusions can’t come from reading alone; you must write code, run measurements, and use measured data as the core argument. Such research adds Step 1.5 in Step 1; artifacts land in 4 special subdirectories.

4.4.1 When code verification is needed

Not every research topic needs code. Criteria:

Research type	Code verification?	Reason
Conceptual (philosophy/history/methodology)	No	Conclusions from literature synthesis, no quantification
Tool usage / operational flow	No	Commands are copy-paste runnable, no measurement needed
Algorithm/library capability comparison	Yes	Performance, accuracy, metrics must be measured
Tool/product capability boundaries	Yes	Docs say it works ≠ it actually works — must run
Scientific validation of FSM itself	Yes	FSM is core to this SOP — need controlled experiments proving it works

Examples:

Audio loudness normalization (LUFS) research: To answer “is my audio loudness correct,” must run measurements — same audio through ffmpeg / pyloudnorm / self-implemented K-weighting, check if three numbers differ ≤ 0.04 LU. Docs won’t tell you “how much they differ.”
YOLO vs MediaPipe pose comparison: Docs claiming “90%+ accuracy” is marketing — must measure — 100 images each under different lighting, occlusion, distance, count actual detections.
Scientific validation of FSM research system: FSM is core to this SOP — use Claude Code CLI controlled experiments to prove FSM detects drift and self-corrects.

4.4.2 Flow difference: Step 1 adds sub-step 1.5

Normal Step 1 (search → write → commit); code-verification adds three actions:

flowchart LR
    subgraph "Step 1 standard"
        A1[Search] --> A2[Write sub-file] --> A3[Run measure] --> A4[commit]
    end
    subgraph "Step 1.5 code verification (+3 actions)"
        B1[Search] --> B2[Write sub-file] --> B3[Write verification script] --> B4[Run measurement] --> B5[Attach data to citation] --> B6[commit]
    end

Measured data isn’t “reference material” — it’s research output. Claim field: “Measured: my = -22.29 LUFS / ffmpeg = -22.3 / pyloudnorm = -22.332”, not “refer to some webpage.”

4.4.3 Subdirectory convention (4 special directories)

Code-verification artifacts go in 4 special subdirectories — not formal content (excluded from citation quality scan):

Subdirectory	Contents	Example
`source/`	Raw search responses	`source/pplx_q1.json` (perplexity API raw response with full content + citations)
`scripts/`	Executable code	`scripts/lufs_meter.py` (three-way comparison), `scripts/validate_yolo.py` (YOLO measurement)
`assets/`	Images, attachments	`assets/diagram.png`, `assets/test_audio.wav`
`templates/`	Template files	`templates/citation-entry.md` (citation entry template)

check_citation_quality.py excludes these 4 by default (--include forces scan) — they’re fact base and toolbox, not research body.

4.4.4 Real runnable command example (LUFS three-way measurement)

From actual LUFS research. All commands runnable; interfaces verified:

# 1. Install dependencies
pip install pyloudnorm ffmpeg-python numpy

# 2. Prepare test audio (any wav file)
#    Test file: assets/test_audio.wav (48kHz, stereo, 1 minute)

# 3. Run three-way comparison (script in scripts/lufs_meter.py)
python scripts/lufs_meter.py assets/test_audio.wav

Example output (measured numbers):

my K-weighting implementation:    -22.29 LUFS
ffmpeg loudnorm:                    -22.30 LUFS
pyloudnorm (ITU-R):                 -22.332 LUFS
Three-way difference:               ≤ 0.04 LU

How to write measured data into research:

**§2.2 Three-way measurement validation**: Three LUFS implementations agree on same audio (diff ≤ 0.04 LU),
proving K-weighting filter + Gating algorithm is reproducible[^measured].

[^measured]: [Internal measurement] LUFS three-way comparison — scripts/lufs_meter.py
 | Claim: ①my K-weighting -22.29 LUFS ②ffmpeg loudnorm -22.30 LUFS
              ③pyloudnorm -22.332 LUFS ④diff ≤ 0.04 LU
 | Evidence: §2.2

Note source type [Internal measurement] — not in the 6-type whitelist (Official/Paper/Blog/Community/News/Official docs); [Internal measurement] may need script extension. More importantly: measured data is the claim itself — no URL needed — only “which script produced this” as citation anchor.

4.4.5 Why you can’t skip measurement

“Copy data from docs” vs “run measurement for data” in the SOP means three things:

Reproducibility: Scripts stay in scripts/; anyone can reproduce your conclusion. Copied doc data can’t be verified.
Surfacing disputes: Measurement finds what docs omit — e.g. ffmpeg defaults to integrated LUFS but short-form video needs short-term / momentary; docs won’t tell you “where they differ.”
Handling future change: Measurement scripts are the base for ongoing validation — tool version upgrade, rerun script to see if conclusions need update. Docs require re-reading to judge.

Code-verification research costs more (scripts, runs, debugging) but argument quality is a tier higher — “measured diff ≤ 0.04 LU” beats “all three tools claim LUFS support.”

5. Citation Standard: Claim + Evidence Bidirectional Binding

Research isn’t URL stacking. Each citation must explicitly state: what this source provides (claim), and where in this document it’s used (evidence). Before citing, ask “why am I citing this?” — eliminate hanging authoritative URLs that don’t actually support the argument.

6-field format, one line per citation + pipe separators:

[^1]: [Source type] Title — URL | Claim: ①<concrete fact/data> ②<...> | Evidence: §X, §Y

Field	Meaning	Example
`[^N]`	Citation number (anchor in body)	`[^1]`
`[Source type]`	Whitelist: Official/Paper/Blog/Community/News/Official docs	`[Paper]`
Title	Readable source name	`FasterWhisper: 4x faster Whisper inference`
URL	Real accessible address	`https://github.com/SYSTRAN/faster-whisper`
Claim	What this source provides	`①CTranslate2 optimized impl ②INT8/FP16 quant ③4× faster vs original benchmark`
Evidence	Paragraphs where used	`§2.1, §3 speed comparison`

Anti-pattern (bare URL):

“FasterWhisper is 4× faster than the original. https://github.com/SYSTRAN/faster-whisper”

Reader can’t tell if the URL supports “4× faster” or something else — it just hangs there.

Good pattern (bidirectional binding):

“FasterWhisper is 4× faster than the original[^1].”

At end:

[^1]: [Official] SYSTRAN/faster-whisper — https://github.com/SYSTRAN/faster-whisper | Claim: ①CTranslate2 optimized impl ②INT8/FP16 quant ③4× faster benchmark | Evidence: §2.1, §3

Reader scans end of doc and sees “this source supports these sections” — quick judgment whether to click.

Mechanical check: Before commit run check_citation_quality.py <file>, exit 0 required (whitelist, claim/evidence coverage, bare-link rate).

6. Decay Mechanism: Knowledge Expires; Auto Review

Research complete ≠ forever correct. Algorithm updates, industry shifts, new studies — conclusions have shelf life; this mechanism manages that.

60-day hard cap (engineering/tools): Any research regardless of topic, decay cycle ≤ 60 days. Classic theory (math laws, physical constants) can extend to 180 days with “classic theory exception” noted.

decay_days matrix (background × goal):

	Principles	Hands-on	Overview
Zero baseline	180d	90d	45d
Some familiarity	90d	45d	30d
Want depth	45d	30d	15d

Note: Any background × goal combination must not exceed 60 days (60-day hard cap). Classic theory exception may keep 180d but must note reason in 00-index.md. Implementation: infer_decay(background, goal) auto-written to frontmatter at researcher.py init.

Matrix derivation (subjective, calibrate by domain):

Rows (background depth): Deeper learning → steeper curve → notice change faster — “want depth” rows shorter decay than “zero baseline”
Columns (goal type): Overview (landscape, players, market) changes fastest; hands-on (tools, API) middle; principles (mechanism, math) slowest

60-day hard cap from personal judgment: major framework releases in engineering often every 3–6 months; 60 days is buffer without being too loose. AI/ML may suit 30 days; math/physics principles can use 180 days. Matrix numbers are starting points — recalibrate after months in your domain.

decay period is shelf life. Coffee beans best 7 days after roast, bread hardens in 3 days on shelf, milk pulled after 14 — knowledge too. Labeling “this may be wrong in 60 days” is more honest than pretending forever fresh.

flowchart TD
    A[Research complete<br/>conclude --adopted] --> B{60 days passed?}
    B -->|No| C[Keep ADOPTED<br/>knowledge base in use]
    B -->|Yes| D[detect_research_signals<br/>trigger review]
    D --> E{Read 00-index +<br/>sample 2–3 sub-files}
    E -->|Direction OK<br/>data not stale| F[verify --still-valid<br/>decay reset 60 days]
    E -->|Direction OK<br/>but data stale| G[verify --update<br/>patch flow]
    E -->|Core conclusion overturned| H[verify --outdated<br/>archive old research]
    F --> C
    G --> I[Incremental search +<br/>content update +<br/>re-measure]
    I --> F
    H --> J[Start new research]

    style F fill:#4caf50,color:#fff
    style G fill:#ff9800,color:#fff
    style H fill:#f44336,color:#fff

Three verify paths:

Path	When	Action
`--still-valid`	Direction OK, data not stale	Decay reset 60 days, confidence medium → high
`--update`	Direction OK, data stale	Incremental search + patch sub-files + re-measure (don’t reset FSM state)
`--outdated`	Core conclusion overturned	Archive old research, start new

Actual code: cmd_conclude() — conclude writes knowledge_conclusions + starts decay timer (researcher.py)

# Condensed from researcher.py
def cmd_conclude(topic_slug, conclusion_type, project_ref=None, note=None):
    conc = db_get_conclusion(topic_slug)

    # Principle 4: auto-expand CREDIBILITY_SCORES before conclude
    # Scan all .md (exclude source/assets/templates/scripts)
    modules_for_expand = [f.read_text() for f in topic_dir.rglob("*.md")
                          if not any(p in {"source","assets","templates","scripts"}
                                     for p in f.relative_to(topic_dir).parts)]
    auto_expand_report = rc.auto_expand_credibility_dict(modules_for_expand)

    ts = now_ts()

    if conclusion_type == "adopted":
        db_update_conclusion(
            topic_slug,
            status="completed",
            conclusion_type="adopted",
            conclusion_note=note,
            project_ref=project_ref,
            last_verified=ts,
            next_review=ts + conc["field_decay_days"] * 86400,  # decay timer starts here
            needs_review=0,
        )
        tg_send(f"✅ Research *{conc['topic_title']}* conclusion adopted (project: {project_ref or '—'})")
        return {
            "ok": True,
            "next_review_in_days": conc["field_decay_days"],
            "auto_expand": auto_expand_report,
        }

    elif conclusion_type == "refuted":
        db_update_conclusion(topic_slug, status="contradicted",
                             conclusion_type="refuted", needs_review=1)
        tg_send(f"⚠️ Research *{conc['topic_title']}* conclusion refuted")

    elif conclusion_type == "superseded":
        db_update_conclusion(topic_slug, status="superseded",
                             conclusion_type="superseded", needs_review=0)

Two notable things conclude --adopted does:

next_review = ts + field_decay_days * 86400 — decay starts at this moment, not when research “finished”
auto_expand_credibility_dict() — auto-adds authority/professional domains from citations to credibility dictionary, no manual maintenance each time

6.1 decay_digest: Proactively Push “Which Research Needs Review”

After research completes, the system doesn’t wait to be remembered — it tracks actively. Mechanism: detect_research_signals() (three-layer detection in §2.3) — here how it pairs with cron and daily reflection.

Key design: Don’t make users “remember to review.” After 60 days you’ll likely forget this research exists, or remember but lack motivation. Decay turns “remember to review” into system-initiated push — AI notifies when research is expiring: “Research X expiring soon — review?” User only answers Y/N.

6.2 Cron schedule: When `detect_research_signals` Runs

Decay check isn’t ad-hoc — embedded in daily reflection — runs daily at 23:45 as infrastructure:

# Key segment in daily-reflection task (pseudocode)
def daily_research_check():
    """Run daily at 23:45 as part of daily-reflection"""
    signals = detect_research_signals()  # three-layer detection
    for sig in signals:
        if not tracker_check_exists(sig["type"], sig["value"], within_hours=6):
            tg_send(f"[decay] {sig['detail']}")  # TG push
            tracker_insert(sig, verify_by=now() + 72*3600)  # medium 72h / high 48h

detect_research_signals() three layers (summary):

Signal type	Target	Trigger	Severity
`RESEARCH_STALL_ACTIVITY`	In-progress stall	in-progress > 14 days no update	high
`RESEARCH_STALL_ACTIVITY`	Early stall	in-progress > 3 days no update	medium
`RESEARCH_STALL_TEMPORAL`	Completed research expired	`next_review < now()`	high (if contradicted) / medium
`CONTENT_DIVERGENCE`	In-progress severe drift	FSM in ESCALATION	high

Why not separate cron? Decay check needs user action (verify one of three). Embedding in daily-reflection means user gets one summary at 23:45: “what happened today” plus “which research needs review” — high information density, low interruption.

6.3 Full Decay Workflow: Cron Trigger to User Response

flowchart LR
    T0["T0<br/>23:45 cron trigger"] --> T1["T+30s<br/>dedup + TG push"]
    T1 --> T2["T+1 day<br/>user reads 00-index + samples"]
    T2 --> T3{"User response?"}
    T3 -->|Y| T4["T+1 day done<br/>run verify command"]
    T3 -->|N| T5["T+2-3 days<br/>overdue: stale + re-push"]
    T5 --> T3
    style T0 fill:#4a90d9,color:#fff
    style T4 fill:#4caf50,color:#fff
    style T5 fill:#ff9800,color:#fff

T0: cron trigger (daily 23:45)

daily-reflection cron triggers LLM, runs detect_research_signals(), gets signal list (0–N).

T+30s: signal dedup

Each signal: check tracker for same record within 6 hours. If yes, skip (avoid nagging). If no, continue.

T+30s: TG push

For each new signal, send to Telegram:

[decay] Research *<topic>* past 60-day decay cycle, needs verify

Set verify deadline: medium → 72h, high → 48h.

T+1 day: user opens TG

User workflow:

Open notification → see which research expired
Quick read 00-index → recall what this research was
Sample 2–3 sub-files → judge “still correct?”
Choose verify path:
- Fully correct → --still-valid
- Partially stale → --update (patch flow)
- Core conclusion wrong → --outdated

AI runs command:

researcher.py verify <slug> --still-valid

AI reports:

[verify] <topic> verify --still-valid success
next_review: 2026-08-16 (60 days)
confidence: medium → high

T+2–3 days: overdue no response

If no response within 48h/72h, mark stale + TG again:

[decay] Research <topic> still unhandled (past 48h verify deadline)

Second reminder — avoid “forgot → forever expired.”

6.4 daily-reflection Aggregation with Decay Signals

daily-reflection cron (23:45 daily) isn’t only decay — aggregates day’s important events:

Completed research (conclude events)
Research approaching decay
Daily log distillation
Important lessons / decisions

In daily-reflection, decay signals prioritized — these need user action, more important than pure logs.

Aggregation (pseudocode):

def daily_reflection():
    # 1. Scan today's sessions, write daily file
    scan_sessions_to_daily()

    # 2. Check research status (key! decay check here)
    research_signals = detect_research_signals()
    for sig in research_signals:
        push_to_telegram(f"[decay] {sig['detail']}")

    # 3. Distill lessons / decisions
    distill_lessons()

    # 4. Push final summary to TG
    push_daily_summary()

Decay doesn’t run separate cron — embedded in daily reflection — one TG message at 23:45, dense, non-interrupting.

6.5 Patch Flow (`--update` Detail)

Applies when: content needs supplement after verify --update; or direction OK but some data/cases stale.

Relation to verify flow:

verify --update (mark active + medium confidence)
  ↓
Enter "patch" sub-flow (this section)
  ↓
After patch complete → verify --still-valid → reset decay timer → confidence: medium → high

Step 3.1: Choose Verify Branch

flowchart TD
    A[Receive RESEARCH_STALL_TEMPORAL signal] --> B[Read 00-index.md\nsample 2–3 sub-files]
    B --> C{Judgment}
    C -->|Direction OK, data fresh| D[verify --still-valid]
    C -->|Direction OK, some data stale| E[verify --update\npatch flow]
    C -->|Core conclusion overturned / field shifted| F[verify --outdated]
    style D fill:#4caf50,color:#fff
    style E fill:#ff9800,color:#fff
    style F fill:#f44336,color:#fff

Step 3.2: Patch Execution — 4 Steps

3.2.1 perplexity incremental search [topic] + [time window] latest news, lock 1–3 specific changes (main thread; sub-agents unreliable on web search)
3.2.2 Decide patch mode (update 00-index key data / new sub-file / append section)
3.2.3 Patch existing files + run measure check divergence rise + per-file commit (message format: patch(T<num>): <description>)
3.2.4 verify --still-valid reset decay to 60 days, confidence medium → high

measure side effects (observed):

Writes SETPOINT.json (coverage_history / stable_rounds / last_measure_at)
Writes state.db research_fsm.setpoint_json (dual-write cache)
Does not change research_fsm.state (FSM stays CONVERGING/DONE)
Does not write knowledge_conclusions (that’s verify --still-valid)

Step 3.3: T46 Gray State Special Path

One case skips patch: content complete but FSM metrics fail (state=INIT, coverage algorithm low). Should not use patch path (no conclude yet). Only path:

conclude --adopted (manual, bypass metric check)
  ↓
After 60 days normal verify flow

Failed metrics ≠ bad content — Bailian embedding fragmentation on Chinese direction_text may show coverage 0.3–0.5 while content fully answers direction. Human judgment overrides algorithm; note in conclude report “coverage is known algorithm limitation; content substantively complete”.

6.6 Why Not Make Users “Remember to Review”?

People don’t remember. After 60 days you’ll likely forget or lack motivation. Decay turns “remember to review” into system push — expiry triggers TG; you answer Y/N.

Don’t rely on memory and willpower — encode them in the system — a judgment that recurs throughout this design.

7. Design Considerations: Why This Way, Not That Way

After building the system, a few judgments that shape the architecture — not “this is better” but “the other approach causes problems.”

7.1 Principles in SOP, Tools in Skill

SOP is canonical definition (location-independent); Skill is self-contained entry (reader understands from Skill alone, no cross-file hops). On conflict, SOP wins.

Principles need stability; tools need flexibility:

Principles (research ≠ blog, five segments, citation bidirectional binding, decay 60-day hard cap) are personal decisions — changing them means “business rules changed” — must live in one place to avoid inconsistent scattered definitions.
Tools (how to call perplexity API, how to run check_citation_quality.py) are implementation details — can change with versions — adjust as needed without re-deciding principles each time.

Test: if a sentence belongs at the top of SOP, don’t put it in Skill. Skill only carries “how to execute these principles with tools.”

Skill self-evolution risk: Hermes Skills self-learn — AI patches Skill files when finding better approaches during execution. Good, but side effect: Skill descriptions drift from SOP principles unnoticed.

So Skill updates are patches only — each run may patch Skill with experience; accumulate evidence. In daily/weekly reflection, judge whether patches merge back to SOP:

Skill auto-patch (AI finds better approach during execution)
  ↓
Review Skill changes in daily / weekly reflection
  ↓
Judgment: tool detail optimization or principle-level change?
  ├─ Tool detail → stay in Skill, don't touch SOP
  └─ Principle change → manually merge to SOP, sync Skill to SOP

Don’t let Skill auto-drift replace conscious SOP decisions — Skill is lab; SOP is arbiter.

How this lands in architecture: Originally perplexity / brave-search / duckduckgo-search were three Skills; deep-dive / cross-validate two more. After reorg: three search Skills merged into one web-search umbrella (with auto fallback); methodology of deep-dive and cross-validate moved to execute.md (workflow); only tool API layer stays in Skill. Deep dive and cross-validation are SOP principles — shouldn’t change because search tool changed — judgment from weekly reflection, not Skill auto-patch.

7.2 Counterintuitive Design Choices

Don’t write “process docs” — write “feedback systems.” Traditional SOP is linear “step 1, step 2” — research has loops; at step 5 you may revise step 1’s direction. Linear docs can’t express that. State machine + feedback loops can.

Don’t pursue “full automation.” Research core is setpoint and final judgment — two value judgments machines can’t do. Forcing full automation automates machine bias.

Don’t stack tools. More tools ≠ better SOP. Research SOP core is feedback loop (three metrics + convergence quad + decay three branches); tools are actuators, not the loop itself.

Don’t write outlook chapters. Write if there’s content; skip if not. “AI research + control loop + SOP” already closes the loop — no forced “future outlook.” Such chapters are noise in engineering docs — readers need usable tools now, not predictions.

8. Where This Thinking Can Go

8.1 Wikipedia Wandering → Cybernetics → Research System

Looking back at the opening analogy. Wikipedia wandering’s correction — “this concept isn’t clear, I’ll go back” — in cybernetics terms:

Cybernetics role	Wandering intuition	Research system
SENSOR	”Feels off-track / incomplete / wrong source”	`sensor_measure()`: divergence + coverage + credibility
COMPARATOR	”Is this beyond my tolerance?”	FSM transition (three metric thresholds)
ACTUATOR	”Go back / switch source / abandon this line”	Switch search strategy + auto-fill + ESCALATION
SETPOINT	”What I originally wanted to understand”	User-confirmed direction text (direction_text)

“Research written well” and “research SOP written well” are different things. Former is execution (sub-file quality); latter is whether the system self-corrects when drifting.

8.2 Reusability: Feedback Loop for Any “Quality by Human Watch” Scenario

Feedback loops aren’t research-only. Any “quality by human watch” scenario can use the same pattern:

Code review: SENSOR = linter / test coverage / complexity; COMPARATOR = PR standard met?; ACTUATOR = block merge / require tests.
Stock daily review: SENSOR = daily P&L + vs expectation; COMPARATOR = stop-loss/take-profit triggered?; ACTUATOR = force close / adjust position.
Content creation: SENSOR = reads / bounce rate / comment quality; COMPARATOR = publish standard met?; ACTUATOR = trigger rewrite / unpublish.
Project management: SENSOR = schedule variance + risk signals; COMPARATOR = escalation triggered?; ACTUATOR = reallocate / start contingency.

Common pattern: turn “quality judgment” from subjective feel into measurable numbers; turn “exception handling” from ad-hoc reaction into preset paths.

8.3 Division of Labor: Machine Runs Loop, Human Sets Goal

Section 3 split research into two layers — direct control and organizational. Each does its job; clear boundary keeps the system stable.

Direct control handles high-frequency, rule-based work: measure after each sub-file, auto-fill when coverage low, switch strategy or ESCALATE when divergence high. Thresholds and code — machine runs alone.

Organizational layer intervenes at few nodes: Step 0 confirm setpoint, Step 1/2 “complete yet?”, ESCALATION direction OK? All value judgments — is content useful, should direction change — machines can’t do.

Two common traps:

Machine replaces human on final judgment → machine bias automated; research becomes “metrics pass but nobody wants it.”
Human audits every sub-file → feedback loop wasted; human back to pipeline QA.

Correct division: execution runs setpoint to completion; organizational layer only at key decision points.

Research engineering isn’t writing more docs — it’s making every step measurable, correctable, and pushable.