Run #019 — storyteller — Phase 1 Draft

Mission: skills-sync-architecture

Slice: Coherence + Counterfactuals

---

§1 — The "this works" story we're tempted to tell ourselves

We deployed the Skills Sync system in a single afternoon. A GitHub Actions job fires on every merge to `main` [A1], hits a webhook on a tiny Cloudflare Worker [A2], which broadcasts a Supabase Realtime event keyed by machine_id [A3]. Every operator machine runs a 12-line LaunchAgent listener [A4] subscribed to that channel. On event, the listener `git pull --rebase` inside `~/Github/next-chapter-os/` [A5] and the existing `~/.claude/skills` symlink [A6] surfaces the new files instantly. Claude Code reads skills at session boot [A7] and Claude Desktop hot-reloads on file change [A8]. Onboarding is a single `curl | bash` [A9] that creates the symlink and installs the LaunchAgent. Drift is impossible because the listener self-heals on reconnect [A10]. Charlie, Bear, the Mac mini, Ewing's MacBook, and Cowork VMs all stay current within seconds. Conflicts can't happen because nobody edits skills outside of `main` [A11]. Done.

**Assumption count: 11 load-bearing.** That's enough to know the story is brittle.

---

§2 — Counterfactual stress test (6 most load-bearing assumptions)

**[A5] `git pull --rebase` always succeeds.** False the first time an operator has ever touched a skill file locally — or the first time a `.DS_Store` differs, or the first time Time Machine modified a file's mtime. A failed rebase leaves the working tree dirty and the listener silently stops applying updates. The "self-healing" claim dies on the first merge conflict, which on Ewing's machines historically happens within 48 hours of any new automation.

**[A6] The symlink is durable.** False on macOS upgrades, Migration Assistant, Time Machine restores, and any Finder operation that "copies" the home folder (iCloud Desktop sync turns symlinks into broken aliases). The plan plan-doc itself notes the symlink was broken in a recent audit. Assuming it stays intact is exactly the assumption that bit us before.

**[A7] Claude Code reads skills at session boot.** Partially true. Claude Code reads SKILL.md descriptions at session boot but the *body* is loaded on-demand when the skill triggers. So a skill renamed mid-session won't fire even if the file exists. And Claude Desktop has its own skill discovery cycle — empirically, it caches the skill list per-window and only refreshes on app restart. "Near-real-time" for Desktop means "next time the user quits and reopens Claude," which can be days.

**[A8] Claude Desktop hot-reloads on file change.** Almost certainly false. Claude Desktop is an Electron app with its own settings.json; it does not watch `~/.claude/skills/` for filesystem events. If we want Desktop current we either restart it (intrusive) or accept staleness (violates constraint #2).

**[A11] Nobody edits skills outside `main`.** False the moment Ewing opens a skill file in his editor "just to fix one thing." False when Charlie experiments. False when `swarm-upgrade` writes a proposal locally before PR. The constraint that makes "conflict-safe" trivial is the constraint we won't hold to in practice.

**[A1] GitHub Actions fires reliably on merge.** Mostly true, but: GitHub Actions has 2-15 second queue latency, and during incidents (we've seen 3 in 2026 already) it queues for hours. Our "near-real-time" SLA inherits GitHub's availability. If GH is down, sync is down, and operators will not know — they'll just be on stale skills mid-call.

---

§3 — Constraint-pair conflicts and resolution rules

The six hard constraints aren't independent. At least four pairs fight:

**Push-not-pull × Conflict-safe.** If the machine never pulls, it must accept whatever push arrives. But if a user has a local edit, an arriving push silently overwrites it. *Resolution rule: pushes must land in a side-channel (stash or `.skills-incoming/`) and only swing into place when the working tree is clean. Dirty trees get a Slack/Telegram nudge: "skill update queued, your edits are blocking."*

**Near-real-time × Onboarding-trivial.** Real-time push needs a daemon, certificates, possibly a Supabase JWT, and machine-specific identifiers. "Onboarding-trivial" wants `curl | bash`. Those fight: the daemon needs registration. *Resolution rule: bootstrap script provisions the machine identity by writing a UUID file and calling a register-machine endpoint. One-time, then invisible.*

**Every surface × Self-healing.** Cowork VMs and Claude Desktop don't necessarily expose filesystem hooks. If the design assumes everyone runs the same LaunchAgent, Cowork is excluded. If the design accommodates Cowork, it can't use LaunchAgent universally. *Resolution rule: define "surface" as "any place a Claude session reads a skill" and require each surface to declare its sync mode (push-listener / poll / on-session-boot). The architecture is a matrix, not a single mechanism.*

**Push-not-pull × Self-healing.** Self-healing implies the machine notices it's drifted and corrects. But noticing requires polling — which is pulling. *Resolution rule: push is the primary path; a 60-second heartbeat-pull is the safety net. "Push-not-pull" means "the user never types a command," not "no pull ever happens." The poll is hidden.*

---

§4 — Three solutions that pass narrow review but fail coherence

**1. "Real-time sync via cloud-drive (Dropbox/iCloud)."** Solves push and near-real-time. Kills conflict-safe (cloud drives are last-write-wins on filename conflicts and produce "(Conflicted Copy)" files that look like legitimate skills to Claude's loader). Also kills onboarding-trivial — every operator must install Dropbox/iCloud and accept its sync UI. Looks elegant on day one, gives us silent skill corruption by week three.

**2. "Always git pull on prompt submit."** A pre-prompt hook that runs `git pull` before every Claude turn. Solves freshness and audit trail. Kills offline operators (Bear on cafe wifi gets a 2-second hang on every prompt or a hard fail). Kills the user experience — Claude Code feels broken whenever GitHub is slow. Also creates a thundering herd: every Claude turn across the fleet hits GitHub simultaneously.

**3. "Periodic LaunchAgent (60s polling)."** Solves resilience and cross-platform. Kills near-real-time (60s is the *floor*; under load the LaunchAgent slot gets pre-empted and effective latency stretches to minutes). Doesn't address Claude Desktop or Cowork. Pretends to satisfy push-not-pull while literally polling — a definitional cheat that the next operator will catch.

**4. "MY ADDITION — One signed tarball downloaded from R2 on each session start."** Solves consistency (everyone gets the same bytes). Kills near-real-time mid-session and creates a single-tarball-or-nothing failure mode. If the R2 fetch fails, the session has no skills at all instead of stale ones. Worse failure mode than what we have today.

**5. "MY ADDITION — Claude-native: ride the Anthropic Skills marketplace."** Solves nothing yet — the marketplace doesn't expose private team skills, doesn't guarantee per-team latency, and doesn't let us push edits without a publication step. Reads as elegant ("ride the platform") but bets the production system on a roadmap item we don't control. If Anthropic changes the marketplace policy, our 72 skills become uninstallable on intern machines overnight.

**6. "MY ADDITION — A Slack-bot that DMs operators to `git pull`."** Solves the "no automation exists" problem with the cheapest possible patch. Kills push-not-pull definitionally — it's just a polite reminder to pull. We've already lived this story; it's why the mission exists.

---

§5 — Three questions Ewing didn't ask but the answer changes everything

**Q1: Is a skill ever load-bearing for an in-flight Claude action?** If yes (and it almost certainly is — the listener skill, hunter, quarterback all run continuously), then a mid-session skill swap can corrupt running state. If no, sync can be lazy. *If we assume yes:* sync becomes a per-session-boot operation with a "stale OK during session" model and a watchdog that warns at 24h staleness. *If we assume no:* we can hot-swap aggressively. **What changes:** the entire architecture's latency target. Real-time is dangerous if sessions are stateful.

**Q2: Whose machine is the source of truth between PR-merge and fan-out?** GitHub `main` is canonical only AFTER a merge. Between Ewing committing locally and the merge landing, his MacBook holds bytes nobody else has. *If we assume Ewing's MacBook is the de-facto source for ~30 minutes per day:* we need a "draft channel" so Charlie can preview unmerged skills without breaking prod. *If we assume only `main` matters:* Ewing's authoring loop slows because every test requires a PR. **What changes:** whether the architecture serves authoring as well as distribution.

**Q3: What happens when a skill is *deleted*?** Push-add is easy. Push-delete is the hard case — a renamed skill leaves its old file orphaned on every machine, where Claude will keep loading the stale description. *If we assume rsync semantics (delete removed files):* simple, but one bad PR wipes a critical skill fleet-wide in seconds. *If we assume add-only:* skills accumulate forever and renaming creates dual definitions that confuse Claude's matcher. **What changes:** the blast radius of a bad merge. This is the question that determines whether we need a quarantine/canary stage in the rollout.

---

§6 — The smell test the final design must pass

**Concrete scenario:** Monday at 9:00 AM PT. Ewing edits `~/Github/next-chapter-os/skills/quarterback/SKILL.md` on his MacBook in Scottsdale, commits, and merges to `main`. Bear is mid-conversation in Claude Code on his MacBook at a Phoenix cafe — wifi just dropped (T-Mobile dead zone). The conversation is 40 turns deep into a deal review. Charlie is on his MacBook in a Slack huddle, screen-sharing Claude Desktop showing a meetings page he's iterating on. The Mac mini clawdbot user is mid-cron, running the 9:00 hermes-doctor sweep. Cowork VM is idle.

**The system must:** 1. Reflect the new quarterback SKILL on the Mac mini's *next* invocation (it's invoked every 5 min, so within 5 min) without interrupting the running cron. 2. Reflect on Charlie's Claude Desktop *without* him losing his current chat scroll position or window state. 3. Reflect on Bear's machine the moment his wifi reconnects — and not corrupt his 40-turn conversation when the swap happens. 4. Apply on Cowork on next session boot. 5. NOT require Ewing to ping anyone. 6. Produce a Telegram receipt to Ewing within 60 seconds saying "deployed to N/M machines, M-N pending reconnect."

**If the proposed design cannot survive this scenario step-by-step, it's wrong.** The Bear-on-cafe-wifi case is the killer — it forces the design to handle queue-on-disconnect, replay-on-reconnect, and mid-session safety simultaneously. Most "elegant" architectures (push-only, cloud-drive, GitHub-Actions-fanout) silently fail on step 3.

**Bonus smell test:** the design must also survive the *deletion* of `quarterback/SKILL.md` (rename to `qb/SKILL.md`) without leaving stale files on any machine. If the answer is "operators occasionally need to rm -rf and resync," it fails.

---

§7 — Agent Journal

### S1 — Finding The temptation is to ship a single-mechanism architecture (one push channel, one daemon, one symlink). Coherence analysis says no single mechanism survives all six hard constraints simultaneously. The design must be a *matrix* — push is primary, poll is the safety net, per-surface adapters declare their sync mode, and a queue-on-disconnect / replay-on-reconnect model is mandatory because Bear on cafe wifi is the dominant failure scenario, not the edge case. Eleven load-bearing assumptions live inside the obvious "just add a webhook" story; six of them break under the smell test.

### S2 — Blind spot I assumed Claude Code and Claude Desktop have the same skill-loading semantics. They don't, and I couldn't verify Desktop's behavior from inside this Phase 1 turn — I'd need to either read Desktop's preferences plist or run a controlled "edit a SKILL.md mid-session, see if it's reflected" experiment, neither of which fit Phase 1's time box. If Desktop is fundamentally cache-on-app-start, the entire "every surface, near-real-time" constraint requires us to either drive Desktop-restart automation (intrusive) or relax constraint #2 for Desktop specifically. The architect peer's draft will likely settle this; if they don't, the GAP is real.

### S3 — Pattern Like run #017 (about-us-pdf) where two agents drafted different lesson sets without selection_owner — and the gap was that nobody decided whose draft was canonical. This run has the same shape risk: each agent will propose a sync mechanism, and without an explicit "matrix not monolith" framing in the consensus stage, debrief will pick whichever agent had the cleanest prose. Also like run #013 (os-tab-audit) which surfaced "promote /caller, /deals, /review to nav" as unresolved signals — infrastructure runs in this swarm consistently produce decision-shaped findings that get turned into tickets, not architectures. The risk is we ship a ticket list and call it an architecture.

### S4 — Currency log

[
  {"from": "storyteller", "to": "architect", "multiplier": 3, "base": 5, "score": 15, "description": "Surfaced that Claude Desktop and Claude Code have different skill-loading semantics; architect's surface inventory must distinguish them or the consensus design is wrong."},
  {"from": "storyteller", "to": "quarterback", "multiplier": 3, "base": 5, "score": 15, "description": "Provided the Bear-on-cafe-wifi smell-test scenario as the rollout's go/no-go gate; quarterback can use it as the explicit acceptance criterion in the rollout sequencing rather than relying on subjective 'looks good'."},
  {"from": "storyteller", "to": "listener", "multiplier": 2, "base": 5, "score": 10, "description": "Prevented the swarm from converging on a cloud-drive sync (Dropbox/iCloud) by naming it as a coherence-failure pattern in §4; listener was likely to surface this as a community precedent worth considering."}
]

### S5 — Notebook entry The Skills Sync mission has a temptation problem: the single-mechanism story (webhook → daemon → symlink) reads beautifully on a whiteboard and falls apart under any realistic failure scenario. Storyteller's contribution is to inventory the eleven assumptions baked into that story, stress-test the six most load-bearing ones, and force the swarm to pass a concrete smell test (Bear on cafe wifi while Charlie screen-shares while the Mac mini runs cron) before any architecture is declared SHIP. The constraint pairs that secretly fight — push-vs-conflict-safe, real-time-vs-onboarding-trivial, every-surface-vs-self-healing — each need a written resolution rule, not papered-over consensus. The right answer is almost certainly a matrix architecture: push-primary, poll-safety-net, per-surface adapters, queue-on-disconnect.

### S6 — What changed about me When given an infrastructure-design swarm assignment, my old response was to look for narrative contradictions in peer drafts during Phase 2. Going forward, in Phase 1 I will pre-emptively write the failure-mode catalog (assumptions, constraint conflicts, smell test) BEFORE peers draft, so Phase 2 cross-read has a coherence rubric to apply rather than reconstructing one mid-read.

Generated from 019__storyteller.md — do not edit this HTML directly.