CLAUDE.md — Tangled Recommendation Engine#
Context for any Claude session working in this folder. This is a standalone Python/FastAPI service (it will be lifted into its own repo and hosted separately). Read this top-to-bottom before changing anything.
1. What this is#
The recommendation backend for Tangled's Discover (contribution-discovery) feature. Given a user's DID it returns repo + issue recommendations. It reads README/issue embeddings (precomputed by the data teammate) from a shared Postgres + pgvector database and reranks them. The Tangled web app ("appview", a separate Go service) calls this over HTTP and renders the results. The service makes no external API calls at runtime — it only reads the DB.
Tangled appview ──HTTP(handle,gh)──► THIS service ──► shared Postgres+pgvector (READ-ONLY)
(Go, separate repo) (Python/FastAPI)
Semantic free-text search (
GET /search) was built then removed at the user's request (the Discover UI only consumes/recommendations). It's easy to re-add: embed the query with Gemini (RETRIEVAL_QUERY) and run the same kNN/merge/shape pipeline with a single "query" seed. The Nodereference/src/issue_search.mjsshows the approach.
It was ported from validated Node scripts in reference/src/*.mjs (the "oracle"):
similar_repos.mjs (per-seed kNN + dedup — closest to our model), issue_experiment.mjs
(issue→README matching), embed_readmes.mjs (Gemini embed + L2-normalize). Consult those
when in doubt about an algorithm detail; they are known-good.
2. Locked decisions (do not silently reverse)#
- Standalone Python/FastAPI service. (Earlier drafts considered Go-in-appview and Node — both rejected. Don't reintroduce.)
- Search-per-seed + consensus, NOT clustering. Each of the user's repos is searched independently; a candidate several seeds agree on ranks higher. (An earlier clustering approach was intentionally dropped — simpler, no threshold to tune, better explanations.)
- Consume existing issue embeddings — the data teammate already ingests + embeds issues. We do NOT run an issue ingestion pipeline.
- Contract is fixed by
schema.md(in the parent repo root) and the Go clientappview/state/discover_engine.go. The wire format carries nopulls,reasons,themes,score, or good-first fields. Consensus/distance are used internally for ranking only — never emitted.
3. The shared database (READ-ONLY)#
- Postgres + pgvector on Google Cloud SQL (public IP, self-signed cert). Connection string
is in
.envasDB_CONNECTION_STRING;app/config.pyauto-appendssslmode=require(the psycopg equivalent of the scripts'rejectUnauthorized:false). - Boundaries: every existing table is READ-ONLY for us. The only writes we are ever
authorized to make are the embedding columns of
tangled_readmes(embedding/embedding_model/embedded_at) and our ownrecschema (not used yet). Never insert/update/delete anything else. - IP authorization: the DB only accepts authorized IPs. On this machine the IP is
already authorized. On a fresh host:
gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me). If you can't connect, this is almost always why. (gcloudis NOT installed here.) - The schema is alpha and moves — introspect to confirm before relying on a column.
Tables we use (key columns)#
tangled_readmes(main repo signal):repo_did(pk),repo_uri,owner_handle,repo_name,content,embedding vector(1536),embedding_model,status. The repo OWNER did is parsed fromrepo_uri=at://<owner_did>/sh.tangled.repo/<rkey>. HNSW index onembeddingwithvector_cosine_ops(cosine = the metric).tangled_open_issues(VIEW, open issues only):uri,rkey,repo_did,repo_uri,author_did,title,body,issue_created_at,embedding vector(1536),record_raw. (tangled_issuesis the all-states table; we use the open view for recommendations.)tangled_repos:repo_did,owner_did,rkey,name,owner_handle,record_rawjsonb (hastopics,description,createdAt,repoDid).tangled_identities:did→handle(used for the owner-handle fallback).tangled_user_collaborations(VIEW):user_did→repo_did(collab seeds; rare, ~240 rows).
Embeddings (recipe — match EXACTLY if you ever embed anything new)#
The service does NOT embed at runtime (it reads precomputed vectors). This recipe is here
for a future embedding catch-up job; the working impl is reference/src/embed_readmes.mjs.
- Model
gemini-embedding-001via Gemini API (generativelanguage.googleapis.com), headerx-goog-api-key = GEMINI_API_KEY.outputDimensionality = 1536. taskType = RETRIEVAL_QUERYfor query text,RETRIEVAL_DOCUMENTfor stored docs.- L2-normalize every vector (sub-3072 MRL dims aren't auto-unit; the cosine index needs unit vectors).
- Vectors are passed to SQL as
%s::vectortext literals ([v1,v2,...]) and read back viaembedding::text— exactly like the reference scripts. No pgvector-python adapter needed.
4. Algorithm (in app/recommend.py)#
- Seeds = the user's owned (
repo_uri like 'at://<did>/%') ∪ collaborated repos that have an embedded README (db.load_seeds). - Per-seed kNN over README embeddings, excluding the user's own/collab repo_dids
(
db.knn_repos,ORDER BY embedding <=> seed::vector). - Merge by candidate repo_did, keeping best (min) distance + the list of seeds that
surfaced it = consensus (
app/merge.py). - Dedup forks by md5 of
content[:500](app/dedup.py); apply a distance floor. - Rerank (
app/rank.py):DefaultScorer= similarity + consensus + recency (+ popularity stub), behind a swappableScorerProtocol; plus a round-robin-across- seeds guard so one busy interest can't bury a lone one. - Issues: same flow over
tangled_open_issues, also excluding issues the user authored and issues in the user's own repos. - Shape to the contract (
app/links.py,app/profile.py): interest chips from seedrecord_raw.topics;@handleowners; absolute repo URLs; RFC-3339 timestamps.
5. File map#
app/
main.py FastAPI app + routes (/recommendations, /health) + CORS + startup log
config.py Settings from env/.env (DB conn, web base, tunable knobs); get_settings()
db.py psycopg3 pool + ALL read-only SQL (load_seeds, knn_repos, knn_issues,
open_issue_counts, embedding_counts, ping)
recommend.py orchestration: recommend(did)
merge.py PURE: merge_hits -> consensus candidates
dedup.py PURE: content_hash, collapse_forks
rank.py PURE: Scorer protocol, DefaultScorer, apply_floor, rerank(diversify)
profile.py PURE: build_interests from topics
links.py PURE: slugify, at_owner, repo_url, issue_list_url, to_rfc3339
schemas.py pydantic response models (wire keys match schema.md EXACTLY)
types.py Candidate dataclass
tests/ pytest: unit (pure modules, no DB) + test_integration.py (env-gated)
eval/harness.py offline held-out-seed retrieval: recall@k / nDCG
reference/src/ the validated Node .mjs oracle scripts (+ node_modules has `pg`)
API.md human API docs; README.md run/deploy; Dockerfile; .env / .env.example
The pure modules (merge/dedup/rank/profile/links/types) have no DB or network and are fully unit-tested — keep them that way so logic changes are testable in isolation.
6. HTTP API (the contract)#
Authoritative shape: ../../schema.md (parent repo) and API.md here. Summary:
GET /recommendations?handle=<did>&gh=<user>→{ profile, repos[], issues[] }.handleis the user's DID.ghis accepted but ignored (no GitHub data). Nokparam — return pre-ranked; the frontend paginates 15/row. Empty user →repos: [].GET /health→{ status, db }.
Issues are special: the engine canNOT supply a reliable sequential issue number, so it
sends repoDid + rkey and the appview resolves the precise /issues/N URL from its
own SQLite issues table (falling back to the repo's issue list). This is implemented in the
parent repo: appview/state/discover_engine.go (engineIssue, resolveIssueLink),
appview/state/discover.go (passes s.db), tested in discover_engine_test.go. If you
change the issue wire shape here, update those three files + schema.md together.
7. Data realities / caveats (VERIFIED, not assumed)#
These drive what we can honestly return — re-check with node/SQL if data has grown:
- READMEs: ~2,400 embedded (0 unembedded). Open issues: ~2,300 embedded. Grows daily — the service reads it live, so counts rise on their own.
- Repos are the real deliverable. Owner handle resolves for ~96% via
owner_handle→ fallbackrepo_uriowner_did →tangled_identities; ~3.5% unresolvable are dropped.repo_nameis never null. stars= 0,comments= 0 — no source (tangled_backlinksis empty). Stubbed.languages= [], repolanguage= "" — no language field in the shared DB.lastActiveusesrecord_raw.createdAt(creation, not true last-activity — best available). Recency ranking uses the same value.- Issues are emittable for ~32% of the corpus (repo identity resolves via
repo_uri). Per user (filtered to their interests) that's a handful. The exact issue number (record_raw->>'issueId') exists for only ~4% in the shared DB → that's why the number is resolved appview-side, not here. - Seeds are dominated by owned repos; collaborations are rare.
8. Run / test / deploy#
# setup (uv is the toolchain here; python 3.12)
uv venv --python 3.12 .venv
uv pip install --python .venv -e ".[dev]"
# run
.venv/bin/python -m uvicorn app.main:app --reload --port 8000
curl 'localhost:8000/health'
curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca' # 10-seed sample user
# docs: http://localhost:8000/docs
# test (unit always; integration auto-runs when DB_CONNECTION_STRING is set)
.venv/bin/python -m pytest tests/ -q
# offline eval baseline (needs DB)
.venv/bin/python eval/harness.py
# deploy
docker build -t tangled-rec . && docker run -p 8000:8000 --env-file .env tangled-rec
# then point the appview: TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations
Config knobs (env, all optional except the two secrets): see app/config.py /
.env.example — TANGLED_WEB_BASE, REC_PER_SEED_LIMIT, REC_DISTANCE_FLOOR,
REC_ISSUE_DISTANCE_FLOOR, REC_MAX_REPOS, REC_MAX_ISSUES.
9. Status & current baseline#
- M0–M4 complete and verified: 23 pytest tests pass (18 pure-unit + 5 live integration incl.
atproto/nix search sanity + own-repo exclusion). Appview Go side compiles +
go test ./appview/state/passes. - Eval baseline (before any tuning): recall@10 ≈ 0.22, recall@20 ≈ 0.23, recall@50 ≈
0.37, nDCG ≈ 0.24 over 60 users. Re-run
eval/harness.pyand compare BEFORE/AFTER any ranking change — no "feels better" merges.
10. Environment gotchas (this machine)#
- No
gcloud, no Go toolchain, nonixinstalled by default. To verify the Go appview change, a Go 1.25 tarball was fetched to/tmp/go(ephemeral).go.modrequires go 1.25. - The reference
.mjsscripts need Node'spg— it lives inreference/.../node_modules/ the folder'snode_modules. Run them withDB_CONNECTION_STRINGin env or.env. - The Bash tool's working directory can reset between calls — use absolute paths or
cdinside the same command. - Secret:
DB_CONNECTION_STRINGlives in.env(gitignored) — the only var the service needs. (GEMINI_API_KEYis only for the Node reference embedding scripts, not the service.) Never commit secrets or paste them into docs/code.
11. Do NOT#
- Write to any shared table except the
tangled_readmesembedding columns (or arecschema). - Re-add clustering, or emit
pulls/reasons/themes/score/good-first in the API. - Hardcode
https://tangled.org— usesettings.web_base(TANGLED_WEB_BASE). - Change the issue wire shape without updating the appview Go files +
schema.mdtogether. - Fabricate
stars/comments/language— they're honest stubs until a data source exists.