This repository has no description
0

Configure Feed

Select the types of activity you want to include in your feed.

at main 175 lines 9.8 kB View raw View rendered
1# CLAUDE.md — Integration Repo 2 3> Project memory for Claude Code working in this repo. Keep it short and high-signal. 4> Conventional filename is `CLAUDE.md` (uppercase); rename if your tooling expects otherwise. 5 6## What this repo is 7 8The **integration** repo is the backend / data layer for a Tangled (AT Protocol) discovery 9product. Its job, end to end: 10 111. **Ingest** repos and activity from the Tangled network (ATProto records). 122. **Store** them as a synced mirror in Postgres. 133. **Embed** repo/issue text and maintain a vector index. 144. **Recommend** relevant repos and issues to a user based on their past activity, and 15 expose those recommendations (plus read APIs) over HTTP. 16 17The **frontend is a separate repo**. This repo contains **no UI** — it exposes a JSON/HTTP 18API that the frontend consumes. Do not add view/template/component code here. If a task 19implies UI work, it belongs in the frontend repo, not this one. 20 21## Repository layout 22 23- `scraper/` — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots, 24 PDS/network backfill, repo metadata, READMEs, issues, embeddings. See `scraper/README.md`. 25- `daily_issue_scraper/` — Cloud Run container that re-runs the issue sync on a daily schedule. 26- `supabase/migrations/` — Postgres + pgvector schema: the `tangled_*` tables/views. 27- `recommendation/` — the **Discover** recommendation engine: a standalone **Python/FastAPI** 28 service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has 29 its own `CLAUDE.md` / `README.md` / `API.md`; intended to be lifted into its own repo later. 30- `recommendationold/` — the pre-port Node (`.mjs`) version of the rec scripts, superseded by 31 `recommendation/` (its `reference/src/` holds the same scripts as the porting oracle). Kept 32 for reference, not run. 33 34## Tech stack 35 36- Language/runtime: **Python** across the live services (ingestion + recommendation). The 37 earlier Next.js/FastAPI skeleton was cleared; current code is Python. 38- DB: **Postgres** with the **pgvector** extension (records + relationships + embeddings in one 39 DB), schema managed via **Supabase** migrations (`supabase/migrations/`). 40- ATProto: PDS `com.atproto.repo.listRecords` + knot XRPC (`sh.tangled.repo.tree` for READMEs); 41 identity via the PLC directory. 42- Embeddings: **Gemini `gemini-embedding-001`**, 1536-dim, L2-normalized, stored in pgvector 43 (cosine / HNSW). The recommendation service reads these; it does not embed at runtime. 44 45The rec/ranking pipeline is the Python `recommendation/` service — keep a clear HTTP API 46boundary between it and the ingestion/embedding side. 47 48## Domain model — read this before touching ingestion 49 50Tangled is a git collaboration platform on the AT Protocol. The split that matters: 51 52- **Knots** host the actual **git data** (code, refs). Self-hostable git servers. 53- **PDS** (Personal Data Service) holds the **collaboration metadata** as ATProto records: 54 issues, comments, pull requests, stars, collaborators, repo pointers. 55 56We ingest **metadata from PDSes**. We do **not** need git code for recommendations — repo 57descriptions, READMEs, and issue/PR text are the signal. **READMEs are the primary text 58signal for repo recommendations** (see Embedding conventions) and are fetched live from the 59**knot** (not the PDS), since no README content is stored in Postgres. 60 61- Fetch via the knot XRPC `sh.tangled.repo.tree` query: 62 `https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=`. With `ref` 63 omitted the knot uses the repo's default branch and returns a top-level `readme` object 64 whose `contents` holds the rendered README (it resolves any extension — `.md`, `.org`, 65 `.rst`, …). Address by the **knot-minted `repoDid`** (`record_raw->>'repoDid'`), not the 66 owner DID. 67- **Coverage (measured 2026-06-24):** ~79% of *reachable* repos have a README (758/959); 68 ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable 69 self-hosted knots, which are *unknown*, not README-less). ~30% of repos in the DB have no 70 knot-minted `repoDid` at all and can't be addressed on a knot — embed those from metadata only. 71 72Every record is addressed by an AT-URI: `at://<did>/<collection>/<rkey>`. 73 74### Collections (NSIDs) we care about 75 76- `sh.tangled.repo` — repo record / pointer (owner, name, knot) 77- `sh.tangled.repo.issue` 78- `sh.tangled.repo.issue.comment` 79- `sh.tangled.repo.pull` — pull requests 80- `sh.tangled.repo.collaborator` 81- `sh.tangled.feed.star` — stars 82- `sh.tangled.git.refUpdate` — push / ref-update events 83 84Treat this list as the source of truth for ingestion filters. Verify against the live 85lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now 86carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases). 87 88## Ingestion design 89 90Two complementary paths — keep both working: 91 92- **Real-time: Jetstream.** Subscribe to a public Jetstream instance with `wantedCollections` 93 set to the `sh.tangled.*` NSIDs above. JSON in, no CBOR decoding. This is the primary feed. 94- **Backfill: `listRecords`.** For each known DID, call `com.atproto.repo.listRecords` against 95 its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream 96 over time and/or by enumerating the relay with `com.atproto.sync.listRepos`. 97 98### Non-negotiable ingestion rules 99 100- **Mirror semantics, not append-only.** Records get edited and deleted. Handle Jetstream 101 `create`/`update` as **upsert** and `delete` as **soft-delete / tombstone**. Never assume 102 a record seen once is permanent. 103- **Resolve identity.** Records reference DIDs. Resolve DID → PDS endpoint and DID → handle 104 via the PLC directory; cache it. Don't hardcode PDS hosts. 105- **Coverage caveat.** Self-hosted PDSes/knots only appear if the relay crawls them. Hosted 106 instances and Bluesky-network accounts are well covered; full-network coverage is not 107 guaranteed. Don't treat absence as deletion. 108- **Idempotency.** Ingestion must be safely replayable (reconnects, backfills overlapping the 109 live stream). Key on AT-URI. 110 111## Recommendation design 112 113**Two-stage: retrieve, then rank.** Do not ship a single averaged "user vector" + kNN as the 114whole system — it loses multi-interest structure and ignores quality/recency/social signal. 115 1161. **Candidate generation** (high-recall, union the sources): 117 - **Embedding kNN** — query with the user's *recent* interactions individually, or cluster 118 their history into a few interest centroids and query each. Never collapse to one averaged vector. 119 - **Collaborative / co-occurrence** — "users who starred X also starred Y" from the star and 120 contribution matrices. 121 - **Social graph** (our edge on ATProto) — "repos starred by people you follow", "repos your 122 collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up. 1232. **Ranking** — start with a tunable weighted sum (embedding similarity + recency + popularity + 124 social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once 125 there's engagement data. Keep the scorer behind an interface so it's replaceable. 1263. **Rules** — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR); 127 favor freshness. 128 129### Embedding conventions 130 131- **Repo doc = the README** (fetched live from the knot — see Domain model), as the primary 132 text we embed. Prepend the repo `name` + `description` and append `topics` + primary 133 `language` as light context, but the README body is the core signal. 134- **Fallback when no README** (knot 404 / unreachable / repo has no `repoDid`): embed 135 `name + description + topics + primary language` only. ~57–79% of repos have a README; 136 the rest rely on this fallback, so it must produce a usable vector on its own. 137- Issue doc = title + body + labels + parent-repo context. 138- Store vectors in pgvector alongside the record. Re-embed on meaningful record updates 139 (incl. when a previously-missing README becomes available). 140 141### Required, don't skip 142 143- **Cold start** — users with no history fall back to trending / follows-based / onboarding interests. 144- **Eval harness** — hold out each user's most recent interactions; measure recall@k / nDCG offline 145 before shipping any ranking change. Track star-through-rate online. No "it feels better" merges. 146 147## Data layout 148 149The live schema lives in `supabase/migrations/`; the `tangled_*` tables are the source of 150truth (not the generic names below). Key ones the rec engine reads (see 151`recommendation/CLAUDE.md` for full columns): `tangled_readmes` (repo signal + `embedding`), 152`tangled_open_issues` (view), `tangled_repos`, `tangled_identities` (did→handle), 153`tangled_user_collaborations` (view). Embeddings are stored inline on the record rows 154(`embedding vector(1536)` + `embedding_model`), not in a separate table. 155 156## Commands 157 158Each service has its own setup; see the per-folder docs. DB connection comes from 159`DB_CONNECTION_STRING` (`.env`). 160- Scraper (ingest / backfill / embed): see `scraper/README.md``python scraper/scrape.py <stage>`. 161- Recommendation API: from `recommendation/`, `uvicorn app.main:app --reload --port 8000` 162 (setup + deploy in `recommendation/README.md`). 163- Rec tests: from `recommendation/`, `.venv/bin/python -m pytest tests/`. 164 165## Conventions 166 167- Keep ingestion, embedding, recommendation, and API as separable modules/services. 168- All external IDs are DIDs internally; resolve to handles only at the API edge for display. 169- Don't put secrets, PDS credentials, or model API keys in code or commits. 170 171## Out of scope (do not do here) 172 173- Frontend / UI work → separate repo. 174- Hosting git content or running a knot → not this service's job; we read metadata and fetch 175 READMEs on demand.