CLAUDE.md — Integration Repo#
Project memory for Claude Code working in this repo. Keep it short and high-signal. Conventional filename is
CLAUDE.md(uppercase); rename if your tooling expects otherwise.
What this repo is#
The integration repo is the backend / data layer for a Tangled (AT Protocol) discovery product. Its job, end to end:
- Ingest repos and activity from the Tangled network (ATProto records).
- Store them as a synced mirror in Postgres.
- Embed repo/issue text and maintain a vector index.
- Recommend relevant repos and issues to a user based on their past activity, and expose those recommendations (plus read APIs) over HTTP.
The frontend is a separate repo. This repo contains no UI — it exposes a JSON/HTTP API that the frontend consumes. Do not add view/template/component code here. If a task implies UI work, it belongs in the frontend repo, not this one.
Repository layout#
scraper/— ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots, PDS/network backfill, repo metadata, READMEs, issues, embeddings. Seescraper/README.md.daily_issue_scraper/— Cloud Run container that re-runs the issue sync on a daily schedule.supabase/migrations/— Postgres + pgvector schema: thetangled_*tables/views.recommendation/— the Discover recommendation engine: a standalone Python/FastAPI service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has its ownCLAUDE.md/README.md/API.md; intended to be lifted into its own repo later.recommendationold/— the pre-port Node (.mjs) version of the rec scripts, superseded byrecommendation/(itsreference/src/holds the same scripts as the porting oracle). Kept for reference, not run.
Tech stack#
- Language/runtime: Python across the live services (ingestion + recommendation). The earlier Next.js/FastAPI skeleton was cleared; current code is Python.
- DB: Postgres with the pgvector extension (records + relationships + embeddings in one
DB), schema managed via Supabase migrations (
supabase/migrations/). - ATProto: PDS
com.atproto.repo.listRecords+ knot XRPC (sh.tangled.repo.treefor READMEs); identity via the PLC directory. - Embeddings: Gemini
gemini-embedding-001, 1536-dim, L2-normalized, stored in pgvector (cosine / HNSW). The recommendation service reads these; it does not embed at runtime.
The rec/ranking pipeline is the Python recommendation/ service — keep a clear HTTP API
boundary between it and the ingestion/embedding side.
Domain model — read this before touching ingestion#
Tangled is a git collaboration platform on the AT Protocol. The split that matters:
- Knots host the actual git data (code, refs). Self-hostable git servers.
- PDS (Personal Data Service) holds the collaboration metadata as ATProto records: issues, comments, pull requests, stars, collaborators, repo pointers.
We ingest metadata from PDSes. We do not need git code for recommendations — repo descriptions, READMEs, and issue/PR text are the signal. READMEs are the primary text signal for repo recommendations (see Embedding conventions) and are fetched live from the knot (not the PDS), since no README content is stored in Postgres.
- Fetch via the knot XRPC
sh.tangled.repo.treequery:https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=. Withrefomitted the knot uses the repo's default branch and returns a top-levelreadmeobject whosecontentsholds the rendered README (it resolves any extension —.md,.org,.rst, …). Address by the knot-mintedrepoDid(record_raw->>'repoDid'), not the owner DID. - Coverage (measured 2026-06-24): ~79% of reachable repos have a README (758/959);
~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable
self-hosted knots, which are unknown, not README-less). ~30% of repos in the DB have no
knot-minted
repoDidat all and can't be addressed on a knot — embed those from metadata only.
Every record is addressed by an AT-URI: at://<did>/<collection>/<rkey>.
Collections (NSIDs) we care about#
sh.tangled.repo— repo record / pointer (owner, name, knot)sh.tangled.repo.issuesh.tangled.repo.issue.commentsh.tangled.repo.pull— pull requestssh.tangled.repo.collaboratorsh.tangled.feed.star— starssh.tangled.git.refUpdate— push / ref-update events
Treat this list as the source of truth for ingestion filters. Verify against the live lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases).
Ingestion design#
Two complementary paths — keep both working:
- Real-time: Jetstream. Subscribe to a public Jetstream instance with
wantedCollectionsset to thesh.tangled.*NSIDs above. JSON in, no CBOR decoding. This is the primary feed. - Backfill:
listRecords. For each known DID, callcom.atproto.repo.listRecordsagainst its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream over time and/or by enumerating the relay withcom.atproto.sync.listRepos.
Non-negotiable ingestion rules#
- Mirror semantics, not append-only. Records get edited and deleted. Handle Jetstream
create/updateas upsert anddeleteas soft-delete / tombstone. Never assume a record seen once is permanent. - Resolve identity. Records reference DIDs. Resolve DID → PDS endpoint and DID → handle via the PLC directory; cache it. Don't hardcode PDS hosts.
- Coverage caveat. Self-hosted PDSes/knots only appear if the relay crawls them. Hosted instances and Bluesky-network accounts are well covered; full-network coverage is not guaranteed. Don't treat absence as deletion.
- Idempotency. Ingestion must be safely replayable (reconnects, backfills overlapping the live stream). Key on AT-URI.
Recommendation design#
Two-stage: retrieve, then rank. Do not ship a single averaged "user vector" + kNN as the whole system — it loses multi-interest structure and ignores quality/recency/social signal.
- Candidate generation (high-recall, union the sources):
- Embedding kNN — query with the user's recent interactions individually, or cluster their history into a few interest centroids and query each. Never collapse to one averaged vector.
- Collaborative / co-occurrence — "users who starred X also starred Y" from the star and contribution matrices.
- Social graph (our edge on ATProto) — "repos starred by people you follow", "repos your collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up.
- Ranking — start with a tunable weighted sum (embedding similarity + recency + popularity + social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once there's engagement data. Keep the scorer behind an interface so it's replaceable.
- Rules — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR); favor freshness.
Embedding conventions#
- Repo doc = the README (fetched live from the knot — see Domain model), as the primary
text we embed. Prepend the repo
name+descriptionand appendtopics+ primarylanguageas light context, but the README body is the core signal. - Fallback when no README (knot 404 / unreachable / repo has no
repoDid): embedname + description + topics + primary languageonly. ~57–79% of repos have a README; the rest rely on this fallback, so it must produce a usable vector on its own. - Issue doc = title + body + labels + parent-repo context.
- Store vectors in pgvector alongside the record. Re-embed on meaningful record updates (incl. when a previously-missing README becomes available).
Required, don't skip#
- Cold start — users with no history fall back to trending / follows-based / onboarding interests.
- Eval harness — hold out each user's most recent interactions; measure recall@k / nDCG offline before shipping any ranking change. Track star-through-rate online. No "it feels better" merges.
Data layout#
The live schema lives in supabase/migrations/; the tangled_* tables are the source of
truth (not the generic names below). Key ones the rec engine reads (see
recommendation/CLAUDE.md for full columns): tangled_readmes (repo signal + embedding),
tangled_open_issues (view), tangled_repos, tangled_identities (did→handle),
tangled_user_collaborations (view). Embeddings are stored inline on the record rows
(embedding vector(1536) + embedding_model), not in a separate table.
Commands#
Each service has its own setup; see the per-folder docs. DB connection comes from
DB_CONNECTION_STRING (.env).
- Scraper (ingest / backfill / embed): see
scraper/README.md—python scraper/scrape.py <stage>. - Recommendation API: from
recommendation/,uvicorn app.main:app --reload --port 8000(setup + deploy inrecommendation/README.md). - Rec tests: from
recommendation/,.venv/bin/python -m pytest tests/.
Conventions#
- Keep ingestion, embedding, recommendation, and API as separable modules/services.
- All external IDs are DIDs internally; resolve to handles only at the API edge for display.
- Don't put secrets, PDS credentials, or model API keys in code or commits.
Out of scope (do not do here)#
- Frontend / UI work → separate repo.
- Hosting git content or running a knot → not this service's job; we read metadata and fetch READMEs on demand.