CLAUDE.md — Integration Repo#

Project memory for Claude Code working in this repo. Keep it short and high-signal. Conventional filename is CLAUDE.md (uppercase); rename if your tooling expects otherwise.

What this repo is#

The integration repo is the backend / data layer for a Tangled (AT Protocol) discovery product. Its job, end to end:

Ingest repos and activity from the Tangled network (ATProto records).
Store them as a synced mirror in Postgres.
Embed repo/issue text and maintain a vector index.
Recommend relevant repos and issues to a user based on their past activity, and expose those recommendations (plus read APIs) over HTTP.

The frontend is a separate repo. This repo contains no UI — it exposes a JSON/HTTP API that the frontend consumes. Do not add view/template/component code here. If a task implies UI work, it belongs in the frontend repo, not this one.

Repository layout#

scraper/ — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots, PDS/network backfill, repo metadata, READMEs, issues, embeddings. See scraper/README.md.
daily_issue_scraper/ — Cloud Run container that re-runs the issue sync on a daily schedule.
supabase/migrations/ — Postgres + pgvector schema: the tangled_* tables/views.
recommendation/ — the Discover recommendation engine: a standalone Python/FastAPI service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has its own CLAUDE.md / README.md / API.md; intended to be lifted into its own repo later.
recommendationold/ — the pre-port Node (.mjs) version of the rec scripts, superseded by recommendation/ (its reference/src/ holds the same scripts as the porting oracle). Kept for reference, not run.

Tech stack#

Language/runtime: Python across the live services (ingestion + recommendation). The earlier Next.js/FastAPI skeleton was cleared; current code is Python.
DB: Postgres with the pgvector extension (records + relationships + embeddings in one DB), schema managed via Supabase migrations (supabase/migrations/).
ATProto: PDS com.atproto.repo.listRecords + knot XRPC (sh.tangled.repo.tree for READMEs); identity via the PLC directory.
Embeddings: Gemini gemini-embedding-001, 1536-dim, L2-normalized, stored in pgvector (cosine / HNSW). The recommendation service reads these; it does not embed at runtime.

The rec/ranking pipeline is the Python recommendation/ service — keep a clear HTTP API boundary between it and the ingestion/embedding side.

Domain model — read this before touching ingestion#

Tangled is a git collaboration platform on the AT Protocol. The split that matters:

Knots host the actual git data (code, refs). Self-hostable git servers.
PDS (Personal Data Service) holds the collaboration metadata as ATProto records: issues, comments, pull requests, stars, collaborators, repo pointers.

We ingest metadata from PDSes. We do not need git code for recommendations — repo descriptions, READMEs, and issue/PR text are the signal. READMEs are the primary text signal for repo recommendations (see Embedding conventions) and are fetched live from the knot (not the PDS), since no README content is stored in Postgres.

Fetch via the knot XRPC sh.tangled.repo.tree query: https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=. With ref omitted the knot uses the repo's default branch and returns a top-level readme object whose contents holds the rendered README (it resolves any extension — .md, .org, .rst, …). Address by the knot-minted repoDid (record_raw->>'repoDid'), not the owner DID.
Coverage (measured 2026-06-24): ~79% of reachable repos have a README (758/959); ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable self-hosted knots, which are unknown, not README-less). ~30% of repos in the DB have no knot-minted repoDid at all and can't be addressed on a knot — embed those from metadata only.

Every record is addressed by an AT-URI: at://<did>/<collection>/<rkey>.

Collections (NSIDs) we care about#

sh.tangled.repo — repo record / pointer (owner, name, knot)
sh.tangled.repo.issue
sh.tangled.repo.issue.comment
sh.tangled.repo.pull — pull requests
sh.tangled.repo.collaborator
sh.tangled.feed.star — stars
sh.tangled.git.refUpdate — push / ref-update events

Treat this list as the source of truth for ingestion filters. Verify against the live lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases).

Ingestion design#

Two complementary paths — keep both working:

Real-time: Jetstream. Subscribe to a public Jetstream instance with wantedCollections set to the sh.tangled.* NSIDs above. JSON in, no CBOR decoding. This is the primary feed.
Backfill: listRecords. For each known DID, call com.atproto.repo.listRecords against its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream over time and/or by enumerating the relay with com.atproto.sync.listRepos.

Non-negotiable ingestion rules#

Mirror semantics, not append-only. Records get edited and deleted. Handle Jetstream create/update as upsert and delete as soft-delete / tombstone. Never assume a record seen once is permanent.
Resolve identity. Records reference DIDs. Resolve DID → PDS endpoint and DID → handle via the PLC directory; cache it. Don't hardcode PDS hosts.
Coverage caveat. Self-hosted PDSes/knots only appear if the relay crawls them. Hosted instances and Bluesky-network accounts are well covered; full-network coverage is not guaranteed. Don't treat absence as deletion.
Idempotency. Ingestion must be safely replayable (reconnects, backfills overlapping the live stream). Key on AT-URI.

Recommendation design#

Two-stage: retrieve, then rank. Do not ship a single averaged "user vector" + kNN as the whole system — it loses multi-interest structure and ignores quality/recency/social signal.

Candidate generation (high-recall, union the sources):
- Embedding kNN — query with the user's recent interactions individually, or cluster their history into a few interest centroids and query each. Never collapse to one averaged vector.
- Collaborative / co-occurrence — "users who starred X also starred Y" from the star and contribution matrices.
- Social graph (our edge on ATProto) — "repos starred by people you follow", "repos your collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up.
Ranking — start with a tunable weighted sum (embedding similarity + recency + popularity + social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once there's engagement data. Keep the scorer behind an interface so it's replaceable.
Rules — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR); favor freshness.

Embedding conventions#

Repo doc = the README (fetched live from the knot — see Domain model), as the primary text we embed. Prepend the repo name + description and append topics + primary language as light context, but the README body is the core signal.
Fallback when no README (knot 404 / unreachable / repo has no repoDid): embed name + description + topics + primary language only. ~57–79% of repos have a README; the rest rely on this fallback, so it must produce a usable vector on its own.
Issue doc = title + body + labels + parent-repo context.
Store vectors in pgvector alongside the record. Re-embed on meaningful record updates (incl. when a previously-missing README becomes available).

Required, don't skip#

Cold start — users with no history fall back to trending / follows-based / onboarding interests.
Eval harness — hold out each user's most recent interactions; measure recall@k / nDCG offline before shipping any ranking change. Track star-through-rate online. No "it feels better" merges.

Data layout#

The live schema lives in supabase/migrations/; the tangled_* tables are the source of truth (not the generic names below). Key ones the rec engine reads (see recommendation/CLAUDE.md for full columns): tangled_readmes (repo signal + embedding), tangled_open_issues (view), tangled_repos, tangled_identities (did→handle), tangled_user_collaborations (view). Embeddings are stored inline on the record rows (embedding vector(1536) + embedding_model), not in a separate table.

Commands#

Each service has its own setup; see the per-folder docs. DB connection comes from DB_CONNECTION_STRING (.env).

Scraper (ingest / backfill / embed): see scraper/README.md — python scraper/scrape.py <stage>.
Recommendation API: from recommendation/, uvicorn app.main:app --reload --port 8000 (setup + deploy in recommendation/README.md).
Rec tests: from recommendation/, .venv/bin/python -m pytest tests/.

Conventions#

Keep ingestion, embedding, recommendation, and API as separable modules/services.
All external IDs are DIDs internally; resolve to handles only at the API edge for display.
Don't put secrets, PDS credentials, or model API keys in code or commits.

Out of scope (do not do here)#

Frontend / UI work → separate repo.
Hosting git content or running a knot → not this service's job; we read metadata and fetch READMEs on demand.

Configure Feed