CLAUDE.md at main · char.tngl.sh/sunstead-backend

char.tngl.sh / sunstead-backend
Fork 0
This repository has no description
Fork 0
sunstead-backend / CLAUDE.md
at main 175 lines 9.8 kB View raw View rendered
wrap content
Mark Pokidko Sunstead backend — Tangled Discover + AI-Solve (snapshot, no history) 1d ago
c760f08e
  1# CLAUDE.md — Integration Repo
  2
  3> Project memory for Claude Code working in this repo. Keep it short and high-signal.
  4> Conventional filename is `CLAUDE.md` (uppercase); rename if your tooling expects otherwise.
  5
  6## What this repo is
  7
  8The **integration** repo is the backend / data layer for a Tangled (AT Protocol) discovery
  9product. Its job, end to end:
 10
 111. **Ingest** repos and activity from the Tangled network (ATProto records).
 122. **Store** them as a synced mirror in Postgres.
 133. **Embed** repo/issue text and maintain a vector index.
 144. **Recommend** relevant repos and issues to a user based on their past activity, and
 15   expose those recommendations (plus read APIs) over HTTP.
 16
 17The **frontend is a separate repo**. This repo contains **no UI** — it exposes a JSON/HTTP
 18API that the frontend consumes. Do not add view/template/component code here. If a task
 19implies UI work, it belongs in the frontend repo, not this one.
 20
 21## Repository layout
 22
 23- `scraper/` — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots,
 24  PDS/network backfill, repo metadata, READMEs, issues, embeddings. See `scraper/README.md`.
 25- `daily_issue_scraper/` — Cloud Run container that re-runs the issue sync on a daily schedule.
 26- `supabase/migrations/` — Postgres + pgvector schema: the `tangled_*` tables/views.
 27- `recommendation/` — the **Discover** recommendation engine: a standalone **Python/FastAPI**
 28  service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has
 29  its own `CLAUDE.md` / `README.md` / `API.md`; intended to be lifted into its own repo later.
 30- `recommendationold/` — the pre-port Node (`.mjs`) version of the rec scripts, superseded by
 31  `recommendation/` (its `reference/src/` holds the same scripts as the porting oracle). Kept
 32  for reference, not run.
 33
 34## Tech stack
 35
 36- Language/runtime: **Python** across the live services (ingestion + recommendation). The
 37  earlier Next.js/FastAPI skeleton was cleared; current code is Python.
 38- DB: **Postgres** with the **pgvector** extension (records + relationships + embeddings in one
 39  DB), schema managed via **Supabase** migrations (`supabase/migrations/`).
 40- ATProto: PDS `com.atproto.repo.listRecords` + knot XRPC (`sh.tangled.repo.tree` for READMEs);
 41  identity via the PLC directory.
 42- Embeddings: **Gemini `gemini-embedding-001`**, 1536-dim, L2-normalized, stored in pgvector
 43  (cosine / HNSW). The recommendation service reads these; it does not embed at runtime.
 44
 45The rec/ranking pipeline is the Python `recommendation/` service — keep a clear HTTP API
 46boundary between it and the ingestion/embedding side.
 47
 48## Domain model — read this before touching ingestion
 49
 50Tangled is a git collaboration platform on the AT Protocol. The split that matters:
 51
 52- **Knots** host the actual **git data** (code, refs). Self-hostable git servers.
 53- **PDS** (Personal Data Service) holds the **collaboration metadata** as ATProto records:
 54  issues, comments, pull requests, stars, collaborators, repo pointers.
 55
 56We ingest **metadata from PDSes**. We do **not** need git code for recommendations — repo
 57descriptions, READMEs, and issue/PR text are the signal. **READMEs are the primary text
 58signal for repo recommendations** (see Embedding conventions) and are fetched live from the
 59**knot** (not the PDS), since no README content is stored in Postgres.
 60
 61- Fetch via the knot XRPC `sh.tangled.repo.tree` query:
 62  `https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=`. With `ref`
 63  omitted the knot uses the repo's default branch and returns a top-level `readme` object
 64  whose `contents` holds the rendered README (it resolves any extension — `.md`, `.org`,
 65  `.rst`, …). Address by the **knot-minted `repoDid`** (`record_raw->>'repoDid'`), not the
 66  owner DID.
 67- **Coverage (measured 2026-06-24):** ~79% of *reachable* repos have a README (758/959);
 68  ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable
 69  self-hosted knots, which are *unknown*, not README-less). ~30% of repos in the DB have no
 70  knot-minted `repoDid` at all and can't be addressed on a knot — embed those from metadata only.
 71
 72Every record is addressed by an AT-URI: `at://<did>/<collection>/<rkey>`.
 73
 74### Collections (NSIDs) we care about
 75
 76- `sh.tangled.repo` — repo record / pointer (owner, name, knot)
 77- `sh.tangled.repo.issue`
 78- `sh.tangled.repo.issue.comment`
 79- `sh.tangled.repo.pull` — pull requests
 80- `sh.tangled.repo.collaborator`
 81- `sh.tangled.feed.star` — stars
 82- `sh.tangled.git.refUpdate` — push / ref-update events
 83
 84Treat this list as the source of truth for ingestion filters. Verify against the live
 85lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now
 86carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases).
 87
 88## Ingestion design
 89
 90Two complementary paths — keep both working:
 91
 92- **Real-time: Jetstream.** Subscribe to a public Jetstream instance with `wantedCollections`
 93  set to the `sh.tangled.*` NSIDs above. JSON in, no CBOR decoding. This is the primary feed.
 94- **Backfill: `listRecords`.** For each known DID, call `com.atproto.repo.listRecords` against
 95  its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream
 96  over time and/or by enumerating the relay with `com.atproto.sync.listRepos`.
 97
 98### Non-negotiable ingestion rules
 99
100- **Mirror semantics, not append-only.** Records get edited and deleted. Handle Jetstream
101  `create`/`update` as **upsert** and `delete` as **soft-delete / tombstone**. Never assume
102  a record seen once is permanent.
103- **Resolve identity.** Records reference DIDs. Resolve DID → PDS endpoint and DID → handle
104  via the PLC directory; cache it. Don't hardcode PDS hosts.
105- **Coverage caveat.** Self-hosted PDSes/knots only appear if the relay crawls them. Hosted
106  instances and Bluesky-network accounts are well covered; full-network coverage is not
107  guaranteed. Don't treat absence as deletion.
108- **Idempotency.** Ingestion must be safely replayable (reconnects, backfills overlapping the
109  live stream). Key on AT-URI.
110
111## Recommendation design
112
113**Two-stage: retrieve, then rank.** Do not ship a single averaged "user vector" + kNN as the
114whole system — it loses multi-interest structure and ignores quality/recency/social signal.
115
1161. **Candidate generation** (high-recall, union the sources):
117   - **Embedding kNN** — query with the user's *recent* interactions individually, or cluster
118     their history into a few interest centroids and query each. Never collapse to one averaged vector.
119   - **Collaborative / co-occurrence** — "users who starred X also starred Y" from the star and
120     contribution matrices.
121   - **Social graph** (our edge on ATProto) — "repos starred by people you follow", "repos your
122     collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up.
1232. **Ranking** — start with a tunable weighted sum (embedding similarity + recency + popularity +
124   social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once
125   there's engagement data. Keep the scorer behind an interface so it's replaceable.
1263. **Rules** — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR);
127   favor freshness.
128
129### Embedding conventions
130
131- **Repo doc = the README** (fetched live from the knot — see Domain model), as the primary
132  text we embed. Prepend the repo `name` + `description` and append `topics` + primary
133  `language` as light context, but the README body is the core signal.
134- **Fallback when no README** (knot 404 / unreachable / repo has no `repoDid`): embed
135  `name + description + topics + primary language` only. ~57–79% of repos have a README;
136  the rest rely on this fallback, so it must produce a usable vector on its own.
137- Issue doc = title + body + labels + parent-repo context.
138- Store vectors in pgvector alongside the record. Re-embed on meaningful record updates
139  (incl. when a previously-missing README becomes available).
140
141### Required, don't skip
142
143- **Cold start** — users with no history fall back to trending / follows-based / onboarding interests.
144- **Eval harness** — hold out each user's most recent interactions; measure recall@k / nDCG offline
145  before shipping any ranking change. Track star-through-rate online. No "it feels better" merges.
146
147## Data layout
148
149The live schema lives in `supabase/migrations/`; the `tangled_*` tables are the source of
150truth (not the generic names below). Key ones the rec engine reads (see
151`recommendation/CLAUDE.md` for full columns): `tangled_readmes` (repo signal + `embedding`),
152`tangled_open_issues` (view), `tangled_repos`, `tangled_identities` (did→handle),
153`tangled_user_collaborations` (view). Embeddings are stored inline on the record rows
154(`embedding vector(1536)` + `embedding_model`), not in a separate table.
155
156## Commands
157
158Each service has its own setup; see the per-folder docs. DB connection comes from
159`DB_CONNECTION_STRING` (`.env`).
160- Scraper (ingest / backfill / embed): see `scraper/README.md` — `python scraper/scrape.py <stage>`.
161- Recommendation API: from `recommendation/`, `uvicorn app.main:app --reload --port 8000`
162  (setup + deploy in `recommendation/README.md`).
163- Rec tests: from `recommendation/`, `.venv/bin/python -m pytest tests/`.
164
165## Conventions
166
167- Keep ingestion, embedding, recommendation, and API as separable modules/services.
168- All external IDs are DIDs internally; resolve to handles only at the API edge for display.
169- Don't put secrets, PDS credentials, or model API keys in code or commits.
170
171## Out of scope (do not do here)
172
173- Frontend / UI work → separate repo.
174- Hosting git content or running a knot → not this service's job; we read metadata and fetch
175  READMEs on demand.
Configure Feed

Configure Feed