docs/content-tower-plan.md at main · veikka.tngl.sh/sunstead

Sunstead trust scoring project
sunstead / docs / content-tower-plan.md
at main 228 lines 12 kB View raw View rendered
wrap content
Veikka Silvekoski Update sunstead: new modules (embed, voice, content, diffs, merged, vouchsafe), web UI, docs, scorer Dockerfile 2d ago
3df319f5
  1# Content Tower (Tier 1: frozen embeddings + calibrated head) — build plan
  2
  3Self-contained build doc. You can clear the conversation and hand this to a fresh
  4agent. It captures the plan **and** the facts discovered while exploring the live data,
  5so nothing here depends on chat history.
  6
  7---
  8
  9## 0. What this is
 10
 11The PRD fuses two independent signals through a **monotone gate** (not an average):
 12
 13- **Tower A — identity trust** (per-DID, sybil-resistant, **load-bearing**): EigenTrust
 14  over the vouch graph. Already built (`eigentrust.py`).
 15- **Tower B — content risk** (per-PR, **identity-blind**): how risky *this diff* is,
 16  judged with no knowledge of the author. Today this is only Claude (`review.py`) at the
 17  gate, plus an advisory slop-kNN. **This doc builds the learned Tower B.**
 18
 19Tier 1 = run each diff through the **already-wired embedding transformer**
 20(Featherless Qwen3-Embedding) and train a small **calibrated head** on `clean_merge`,
 21using only the diff. Transformer representation power, no fine-tuning, leakage-free by
 22construction (the model never sees identity).
 23
 24Tier 2 (fine-tuning a code transformer) is **deferred** until there are ~10³–10⁴ labeled
 25diffs — below that it loses to frozen-embeddings + a linear head. Not in scope here.
 26
 27### Non-negotiable constraints (PRD) — must hold for every phase
 28- Content models judge **content, never identity**: no author handle/DID/history/aggregates
 29  feed Tower B. Diff + PR-intrinsic stats (size, files, discussion length) only.
 30- Structural signal (Tower A / EigenTrust) stays load-bearing and sybil-resistant.
 31- The gate is **not an average**: content can only **penalize**, never lift an untrusted DID
 32  into the fast lane.
 33- Calibrated + explainable. Serve a learned model only if it **beats its baseline** on a
 34  proper holdout (same rule the GNN already follows).
 35- Serviceless: single DuckDB file, large artifacts under `DATA_ROOT`.
 36
 37---
 38
 39## 1. Critical path
 40
 41```
 42Phase 0: fetch diffs (patchBlobs) ─┐
 43                                   ├─► Phase 2 embed ─► Phase 3 head ─► Phase 4 fuse ─► Phase 5 eval-gate
 44Phase 1: merged labels ────────────┘
 45```
 46
 47**Phase 0 + Phase 1 are prerequisites for ANY content model** (transformer or not) and for
 48Claude review and slop-kNN — all three are dead without diffs. Start at Phase 0.
 49
 50---
 51
 52## 2. Live data facts (as observed)
 53
 54- **Live DB**: `/Volumes/spectrofi-rec/tangled-data/duckdb/trust.duckdb`
 55  (`DATA_ROOT=/Volumes/spectrofi-rec/tangled-data`). The repo-local `.data/…` is a stale dev DB — ignore it.
 56- Backfill is rich but the derived/label layer is **stale** (it ran before some `derive()`
 57  branches existed). Snapshot:
 58  - events 83,991 · contributors 10,848 · **vouches 2,029 (+) / 37 (−)** · pulls 5,768
 59  - `seeds` = **0**, `pull_status` = **0** (collection was not in the old backfill — not even archived),
 60    `stars` = 0 (14,409 `feed.star` events archived but not re-derived), `diff_text` = **0**
 61    (patchBlobs never fetched), **0 positive `clean_merge` labels**, no trained model.
 62- **Read the live DB read-only with retry** (single-writer; a held lock blocks every open):
 63  ```python
 64  import duckdb, time
 65  con=None
 66  for _ in range(80):
 67      try: con=duckdb.connect("/Volumes/spectrofi-rec/tangled-data/duckdb/trust.duckdb", read_only=True); break
 68      except duckdb.IOException: time.sleep(0.3)
 69  ```
 70  Pause `ingest`/`api`/`backfill` before writing, or writes crawl on the lock.
 71
 72### Record shapes you'll need (confirmed from the network)
 73
 74`sh.tangled.repo.pull` (the diff is a gzipped blob, NOT inline):
 75```json
 76{ "rounds": [ { "createdAt": "...",
 77    "patchBlob": { "$type": "blob", "ref": { "$link": "<CID>" },
 78                   "mimeType": "application/gzip", "size": 49502 } } ],
 79  "source": { "branch": "..." },
 80  "target": { "branch": "...", "repo": "did:plc:…", "repoDid": "did:plc:…" } }
 81```
 82- `pr_id` convention (set in `ingest.derive`): `f"{author_did}/{collection}/{rkey}"`,
 83  e.g. `did:plc:X/sh.tangled.repo.pull/3mp…`.
 84- The **latest round** (`rounds[-1]`) is the final proposed change — embed/review that.
 85
 86`sh.tangled.repo.pull.status` (authoritative outcome, public; sparse):
 87```json
 88{ "pull": "at://did:plc:X/sh.tangled.repo.pull/<rkey>",
 89  "status": "sh.tangled.repo.pull.status.merged" }  // .merged / .closed / .open
 90```
 91- Status author may differ from the pull owner — parse `pr_id` from the `pull` field
 92  (`uri[len("at://"):]`), never from the status record's own did/rkey. (`derive()` already does this.)
 93
 94Knot git clone URL (for the Phase-1 label backstop, git-on-knots): `https://{knot}/{owner_did}/{repo}`
 95(https, no auth for public repos; `git ls-remote` returns `refs/heads/main`).
 96
 97### Existing code to build on
 98- `src/trust/embed.py` — `index_diffs(con, limit=256)` already embeds every `pull_requests.diff_text`
 99  into `diff_vectors(pr_id, label, embedding DOUBLE[])`, idempotent/resumable; `embed()` returns
100  `None` without `FEATHERLESS_API_KEY`; `slop_score()` cosine-kNN vs `clean_merge=0`.
101- `src/trust/backfill.py` — reuse `_pds(did)`, `_get(url)`, `_records(pds,did,coll)`, the
102  `ThreadPoolExecutor` fan-out pattern, `_archive_and_derive`.
103- `src/trust/db.py` — `pull_requests.diff_text`, `pull_status`, `diff_vectors`, `pr_labels`
104  view (`clean_merge`), `connection(read_only=…)`, `ensure_schema()`.
105- `src/trust/ingest.py` — `derive()` (pull / pull_status / star branches).
106- `src/trust/learned.py` — copy its shape: `FEATURE_COLS`, `_vec`, `train(split)`,
107  `LearnedScorer`, isotonic calibration, `_reliability`, `MODEL_PATH = MODEL_DIR/…`.
108- `src/trust/fusion.py` — `score_pr`, `decide`, `should_review`, `_features_for`.
109- `src/trust/config.py` — `CFG.embed` (Featherless), `CFG.review`, `MODEL_DIR`.
110
111---
112
113## 3. Phases
114
115### Phase 0 — Fetch the diffs (new `src/trust/diffs.py`)
116The highest-leverage unblock: lights up the content head, Claude review, **and** slop-kNN.
117
118Steps:
1191. Select pulls needing a diff: `SELECT pr_id, author_did, record(from events) FROM pull_requests WHERE diff_text IS NULL`.
120   The CID lives in the archived `events.record` JSON (`rounds[-1].patchBlob.ref.$link`); join
121   `events` on `(did, collection, rkey)` or re-read it.
1222. For each: resolve `_pds(author_did)`, then
123   `GET {pds}/xrpc/com.atproto.sync.getBlob?did={author_did}&cid={cid}` → bytes.
1243. `gzip.decompress(bytes).decode("utf-8", "replace")` → unified-diff text. Cap stored length
125   (~50 KB; embeddings/Claude truncate anyway). `UPDATE pull_requests SET diff_text=? WHERE pr_id=?`.
1264. Parallelize like `backfill`: network fetch in a 12-thread pool, DB writes in chunks (single writer).
127   Skip missing/oversized blobs gracefully (never abort the run).
128
129Deliverable: `pull_requests.diff_text` populated for ~5,768 PRs (minutes of network).
130Self-check: a `demo()` that fetches one known blob and asserts it gunzips to text containing `diff`/`@@`.
131
132### Phase 1 — Merged labels (you need a positive class)
133- Targeted scrape of `sh.tangled.repo.pull.status` (already mapped in `COLLECTION_KINDS` and handled
134  in `derive()`): `python -m trust.backfill --collection sh.tangled.repo.pull.status` (capped first
135  with `--max-repos`).
136- **Measure positives** before building the head:
137  `SELECT clean_merge, count(*) FROM pr_labels GROUP BY 1`.
138- **Risk:** pull.status is sparse. If positives are only tens, the head is data-starved too.
139  Backstop = **git-on-knots `merged` detection** (clone default branch via the knot URL above,
140  check whether each pull's patch landed) for broad `merged` coverage + `reverted`/`re-patched`.
141  Only build the backstop if pull.status coverage proves insufficient.
142
143Deliverable: `pr_labels.clean_merge` with a real positive class (need ≥ a few hundred ideally;
144the trainer requires ≥4 rows spanning both classes as a hard floor).
145
146### Phase 2 — Embed the diffs (frozen transformer)
147- Set `FEATHERLESS_API_KEY`. Run `index_diffs` to caught-up (loop while it returns > 0):
148  ```python
149  from trust.db import connection, ensure_schema
150  from trust import embed
151  ensure_schema()
152  with connection(read_only=False) as con:
153      while embed.index_diffs(con, limit=256): pass
154  ```
155- Optional GPU: self-host Qwen3-Embedding-4B (fits one GPU) to embed ~6k diffs locally for free
156  instead of the API. The head itself is CPU-trivial.
157
158Deliverable: `diff_vectors` filled for every PR with a diff.
159
160### Phase 3 — The calibrated head (new `src/trust/content.py`, Tower B)
161- `_xy(con)`: `X` = `diff_vectors.embedding` for PRs that have a non-NULL `clean_merge`
162  (join `pr_labels`); `y` = `clean_merge`. Optionally concat **PR-intrinsic** scalars
163  (`additions, deletions, files_touched, discussion_len`) and the slop-kNN similarity.
164  **Never** identity/author features.
165- **Model: L2-normalize the embedding → logistic regression (linear probe, L2-reg) → isotonic
166  or Platt calibration.** Linear probe is correct for frozen embeddings at low data; LightGBM on
167  raw 2560-dim embeddings overfits — keep it only as an alt.
168- Time-split train/val (order by `opened_at`). Save `content.pkl` under `MODEL_DIR`.
169- `ContentScorer.prob(pr_id) -> P(content safe)`; expose `content_risk = 1 - P`.
170- Self-check `demo()`: on held-out PRs, a known-bad diff scores higher risk than a clean one;
171  print the reliability curve.
172
173Deliverable: a calibrated content risk for **every** PR (cheap, no API), not just reviewed ones.
174
175### Phase 4 — Fuse into the gate (monotone, unchanged)
176- In `fusion.score_pr`: the head supplies `content_risk` for all PRs; Claude (`review_pr`, gated by
177  `should_review`) refines ambiguous/sensitive ones. Combine conservatively:
178  `content_risk = max(model_risk, claude_risk)` so content still only **penalizes**.
179- Win: every PR gets a content signal; today only the Claude-reviewed subset does.
180- Keep `decide()` and its thresholds; surface the head's risk in the explanation
181  (`build_reason`) like the other factors.
182
183### Phase 5 — Eval + beat-the-baseline gate
184- Calibration: reliability curve (reuse `learned._reliability`). Ranking: AUC / average precision.
185- **Serve only if it beats**: (a) majority-class, (b) Claude-alone risk where available,
186  (c) slop-kNN alone — on a **time-split AND a repo-holdout** (generalize to unseen repos).
187- Write a verdict (like `gnn` does); `fusion` consults it before using the head.
188
189---
190
191## 4. Effort & runtime
192
193| Phase | Build | Runtime |
194|---|---|---|
195| 0 diffs (`diffs.py`) | ~1 hr | few min (network) |
196| 1 labels (scrape) | wired | ~10 min capped |
197| 2 embed (`index_diffs`) | done | few min (API) |
198| 3 head (`content.py`) | ~1 hr | seconds |
199| 4 fuse (`fusion.py`) | ~30 min | — |
200| 5 eval-gate | ~30 min | seconds |
201
202≈ half a day of build + minutes of runtime, given `FEATHERLESS_API_KEY` and enough Phase-1 positives.
203
204## 5. GPU guidance
205- **Tier 1 needs no GPU** — embedding runs on Featherless (remote); the head is CPU-trivial.
206- Use a GPU now only to **self-host Qwen3-Embedding-4B** for free bulk embedding of ~6k diffs
207  (skip API cost/limits).
208- Save the GPU for **Tier 2** (fine-tuning CodeBERT/StarEncoder) — deferred until ~10³–10⁴
209  labeled diffs exist.
210
211## 6. Definition of done
212- `diffs.py` populates `diff_text`; `pr_labels` has a positive class; `diff_vectors` filled.
213- `content.py` trains a calibrated head, identity-blind, with a reliability curve.
214- It **beats** majority / Claude-alone / slop-kNN on a time + repo holdout, else it doesn't serve.
215- `fusion` consumes it monotonically (content only penalizes); explanation shows the content factor.
216- Smoke test added (mirror `tests/test_smoke.py` style: `importorskip` the embedding path; assert a
217  bad diff out-risks a clean one).
218
219## 7. Parallel unblock (not this tower, but the other gating item)
220Structural scoring is still blocked by **`seeds = 0`** + stale derives. Independent of Tower B:
2211. `--rederive` from archived `events` (no network) → repopulates `stars` (and any archived
222   collections) through the current `derive()`.
2232. Seed real maintainer DIDs — top vouch-receivers are the anchors:
224   `did:plc:onu3oqfahfubgbetlr4giknc` (141 in), `did:plc:wshs7t2adsemcrrd4snkeqli` (89),
225   `did:plc:qfpnj4og54vl56wngdriaxug` (56)…  → `INSERT INTO seeds …`.
2263. `trust-train` once labels (Phase 1) exist.
227EigenTrust (Tower A) and the content head (Tower B) can be built in either order; the gate needs both.
228```
Configure Feed

Configure Feed