This repository has no description
1# CLAUDE.md — Integration Repo
2
3> Project memory for Claude Code working in this repo. Keep it short and high-signal.
4> Conventional filename is `CLAUDE.md` (uppercase); rename if your tooling expects otherwise.
5
6## What this repo is
7
8The **integration** repo is the backend / data layer for a Tangled (AT Protocol) discovery
9product. Its job, end to end:
10
111. **Ingest** repos and activity from the Tangled network (ATProto records).
122. **Store** them as a synced mirror in Postgres.
133. **Embed** repo/issue text and maintain a vector index.
144. **Recommend** relevant repos and issues to a user based on their past activity, and
15 expose those recommendations (plus read APIs) over HTTP.
16
17The **frontend is a separate repo**. This repo contains **no UI** — it exposes a JSON/HTTP
18API that the frontend consumes. Do not add view/template/component code here. If a task
19implies UI work, it belongs in the frontend repo, not this one.
20
21## Repository layout
22
23- `scraper/` — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots,
24 PDS/network backfill, repo metadata, READMEs, issues, embeddings. See `scraper/README.md`.
25- `daily_issue_scraper/` — Cloud Run container that re-runs the issue sync on a daily schedule.
26- `supabase/migrations/` — Postgres + pgvector schema: the `tangled_*` tables/views.
27- `recommendation/` — the **Discover** recommendation engine: a standalone **Python/FastAPI**
28 service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has
29 its own `CLAUDE.md` / `README.md` / `API.md`; intended to be lifted into its own repo later.
30- `recommendationold/` — the pre-port Node (`.mjs`) version of the rec scripts, superseded by
31 `recommendation/` (its `reference/src/` holds the same scripts as the porting oracle). Kept
32 for reference, not run.
33
34## Tech stack
35
36- Language/runtime: **Python** across the live services (ingestion + recommendation). The
37 earlier Next.js/FastAPI skeleton was cleared; current code is Python.
38- DB: **Postgres** with the **pgvector** extension (records + relationships + embeddings in one
39 DB), schema managed via **Supabase** migrations (`supabase/migrations/`).
40- ATProto: PDS `com.atproto.repo.listRecords` + knot XRPC (`sh.tangled.repo.tree` for READMEs);
41 identity via the PLC directory.
42- Embeddings: **Gemini `gemini-embedding-001`**, 1536-dim, L2-normalized, stored in pgvector
43 (cosine / HNSW). The recommendation service reads these; it does not embed at runtime.
44
45The rec/ranking pipeline is the Python `recommendation/` service — keep a clear HTTP API
46boundary between it and the ingestion/embedding side.
47
48## Domain model — read this before touching ingestion
49
50Tangled is a git collaboration platform on the AT Protocol. The split that matters:
51
52- **Knots** host the actual **git data** (code, refs). Self-hostable git servers.
53- **PDS** (Personal Data Service) holds the **collaboration metadata** as ATProto records:
54 issues, comments, pull requests, stars, collaborators, repo pointers.
55
56We ingest **metadata from PDSes**. We do **not** need git code for recommendations — repo
57descriptions, READMEs, and issue/PR text are the signal. **READMEs are the primary text
58signal for repo recommendations** (see Embedding conventions) and are fetched live from the
59**knot** (not the PDS), since no README content is stored in Postgres.
60
61- Fetch via the knot XRPC `sh.tangled.repo.tree` query:
62 `https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=`. With `ref`
63 omitted the knot uses the repo's default branch and returns a top-level `readme` object
64 whose `contents` holds the rendered README (it resolves any extension — `.md`, `.org`,
65 `.rst`, …). Address by the **knot-minted `repoDid`** (`record_raw->>'repoDid'`), not the
66 owner DID.
67- **Coverage (measured 2026-06-24):** ~79% of *reachable* repos have a README (758/959);
68 ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable
69 self-hosted knots, which are *unknown*, not README-less). ~30% of repos in the DB have no
70 knot-minted `repoDid` at all and can't be addressed on a knot — embed those from metadata only.
71
72Every record is addressed by an AT-URI: `at://<did>/<collection>/<rkey>`.
73
74### Collections (NSIDs) we care about
75
76- `sh.tangled.repo` — repo record / pointer (owner, name, knot)
77- `sh.tangled.repo.issue`
78- `sh.tangled.repo.issue.comment`
79- `sh.tangled.repo.pull` — pull requests
80- `sh.tangled.repo.collaborator`
81- `sh.tangled.feed.star` — stars
82- `sh.tangled.git.refUpdate` — push / ref-update events
83
84Treat this list as the source of truth for ingestion filters. Verify against the live
85lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now
86carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases).
87
88## Ingestion design
89
90Two complementary paths — keep both working:
91
92- **Real-time: Jetstream.** Subscribe to a public Jetstream instance with `wantedCollections`
93 set to the `sh.tangled.*` NSIDs above. JSON in, no CBOR decoding. This is the primary feed.
94- **Backfill: `listRecords`.** For each known DID, call `com.atproto.repo.listRecords` against
95 its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream
96 over time and/or by enumerating the relay with `com.atproto.sync.listRepos`.
97
98### Non-negotiable ingestion rules
99
100- **Mirror semantics, not append-only.** Records get edited and deleted. Handle Jetstream
101 `create`/`update` as **upsert** and `delete` as **soft-delete / tombstone**. Never assume
102 a record seen once is permanent.
103- **Resolve identity.** Records reference DIDs. Resolve DID → PDS endpoint and DID → handle
104 via the PLC directory; cache it. Don't hardcode PDS hosts.
105- **Coverage caveat.** Self-hosted PDSes/knots only appear if the relay crawls them. Hosted
106 instances and Bluesky-network accounts are well covered; full-network coverage is not
107 guaranteed. Don't treat absence as deletion.
108- **Idempotency.** Ingestion must be safely replayable (reconnects, backfills overlapping the
109 live stream). Key on AT-URI.
110
111## Recommendation design
112
113**Two-stage: retrieve, then rank.** Do not ship a single averaged "user vector" + kNN as the
114whole system — it loses multi-interest structure and ignores quality/recency/social signal.
115
1161. **Candidate generation** (high-recall, union the sources):
117 - **Embedding kNN** — query with the user's *recent* interactions individually, or cluster
118 their history into a few interest centroids and query each. Never collapse to one averaged vector.
119 - **Collaborative / co-occurrence** — "users who starred X also starred Y" from the star and
120 contribution matrices.
121 - **Social graph** (our edge on ATProto) — "repos starred by people you follow", "repos your
122 collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up.
1232. **Ranking** — start with a tunable weighted sum (embedding similarity + recency + popularity +
124 social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once
125 there's engagement data. Keep the scorer behind an interface so it's replaceable.
1263. **Rules** — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR);
127 favor freshness.
128
129### Embedding conventions
130
131- **Repo doc = the README** (fetched live from the knot — see Domain model), as the primary
132 text we embed. Prepend the repo `name` + `description` and append `topics` + primary
133 `language` as light context, but the README body is the core signal.
134- **Fallback when no README** (knot 404 / unreachable / repo has no `repoDid`): embed
135 `name + description + topics + primary language` only. ~57–79% of repos have a README;
136 the rest rely on this fallback, so it must produce a usable vector on its own.
137- Issue doc = title + body + labels + parent-repo context.
138- Store vectors in pgvector alongside the record. Re-embed on meaningful record updates
139 (incl. when a previously-missing README becomes available).
140
141### Required, don't skip
142
143- **Cold start** — users with no history fall back to trending / follows-based / onboarding interests.
144- **Eval harness** — hold out each user's most recent interactions; measure recall@k / nDCG offline
145 before shipping any ranking change. Track star-through-rate online. No "it feels better" merges.
146
147## Data layout
148
149The live schema lives in `supabase/migrations/`; the `tangled_*` tables are the source of
150truth (not the generic names below). Key ones the rec engine reads (see
151`recommendation/CLAUDE.md` for full columns): `tangled_readmes` (repo signal + `embedding`),
152`tangled_open_issues` (view), `tangled_repos`, `tangled_identities` (did→handle),
153`tangled_user_collaborations` (view). Embeddings are stored inline on the record rows
154(`embedding vector(1536)` + `embedding_model`), not in a separate table.
155
156## Commands
157
158Each service has its own setup; see the per-folder docs. DB connection comes from
159`DB_CONNECTION_STRING` (`.env`).
160- Scraper (ingest / backfill / embed): see `scraper/README.md` — `python scraper/scrape.py <stage>`.
161- Recommendation API: from `recommendation/`, `uvicorn app.main:app --reload --port 8000`
162 (setup + deploy in `recommendation/README.md`).
163- Rec tests: from `recommendation/`, `.venv/bin/python -m pytest tests/`.
164
165## Conventions
166
167- Keep ingestion, embedding, recommendation, and API as separable modules/services.
168- All external IDs are DIDs internally; resolve to handles only at the API edge for display.
169- Don't put secrets, PDS credentials, or model API keys in code or commits.
170
171## Out of scope (do not do here)
172
173- Frontend / UI work → separate repo.
174- Hosting git content or running a knot → not this service's job; we read metadata and fetch
175 READMEs on demand.