This repository has no description
0

Configure Feed

Select the types of activity you want to include in your feed.

CLAUDE.md — Tangled Recommendation Engine#

Context for any Claude session working in this folder. This is a standalone Python/FastAPI service (it will be lifted into its own repo and hosted separately). Read this top-to-bottom before changing anything.


1. What this is#

The recommendation backend for Tangled's Discover (contribution-discovery) feature. Given a user's DID it returns repo + issue recommendations. It reads README/issue embeddings (precomputed by the data teammate) from a shared Postgres + pgvector database and reranks them. The Tangled web app ("appview", a separate Go service) calls this over HTTP and renders the results. The service makes no external API calls at runtime — it only reads the DB.

Tangled appview ──HTTP(handle,gh)──► THIS service ──► shared Postgres+pgvector (READ-ONLY)
 (Go, separate repo)                 (Python/FastAPI)

Semantic free-text search (GET /search) was built then removed at the user's request (the Discover UI only consumes /recommendations). It's easy to re-add: embed the query with Gemini (RETRIEVAL_QUERY) and run the same kNN/merge/shape pipeline with a single "query" seed. The Node reference/src/issue_search.mjs shows the approach.

It was ported from validated Node scripts in reference/src/*.mjs (the "oracle"): similar_repos.mjs (per-seed kNN + dedup — closest to our model), issue_experiment.mjs (issue→README matching), embed_readmes.mjs (Gemini embed + L2-normalize). Consult those when in doubt about an algorithm detail; they are known-good.

2. Locked decisions (do not silently reverse)#

  • Standalone Python/FastAPI service. (Earlier drafts considered Go-in-appview and Node — both rejected. Don't reintroduce.)
  • Search-per-seed + consensus, NOT clustering. Each of the user's repos is searched independently; a candidate several seeds agree on ranks higher. (An earlier clustering approach was intentionally dropped — simpler, no threshold to tune, better explanations.)
  • Consume existing issue embeddings — the data teammate already ingests + embeds issues. We do NOT run an issue ingestion pipeline.
  • Contract is fixed by schema.md (in the parent repo root) and the Go client appview/state/discover_engine.go. The wire format carries no pulls, reasons, themes, score, or good-first fields. Consensus/distance are used internally for ranking only — never emitted.

3. The shared database (READ-ONLY)#

  • Postgres + pgvector on Google Cloud SQL (public IP, self-signed cert). Connection string is in .env as DB_CONNECTION_STRING; app/config.py auto-appends sslmode=require (the psycopg equivalent of the scripts' rejectUnauthorized:false).
  • Boundaries: every existing table is READ-ONLY for us. The only writes we are ever authorized to make are the embedding columns of tangled_readmes (embedding/embedding_model/embedded_at) and our own rec schema (not used yet). Never insert/update/delete anything else.
  • IP authorization: the DB only accepts authorized IPs. On this machine the IP is already authorized. On a fresh host: gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me). If you can't connect, this is almost always why. (gcloud is NOT installed here.)
  • The schema is alpha and moves — introspect to confirm before relying on a column.

Tables we use (key columns)#

  • tangled_readmes (main repo signal): repo_did (pk), repo_uri, owner_handle, repo_name, content, embedding vector(1536), embedding_model, status. The repo OWNER did is parsed from repo_uri = at://<owner_did>/sh.tangled.repo/<rkey>. HNSW index on embedding with vector_cosine_ops (cosine = the metric).
  • tangled_open_issues (VIEW, open issues only): uri, rkey, repo_did, repo_uri, author_did, title, body, issue_created_at, embedding vector(1536), record_raw. (tangled_issues is the all-states table; we use the open view for recommendations.)
  • tangled_repos: repo_did, owner_did, rkey, name, owner_handle, record_raw jsonb (has topics, description, createdAt, repoDid).
  • tangled_identities: didhandle (used for the owner-handle fallback).
  • tangled_user_collaborations (VIEW): user_didrepo_did (collab seeds; rare, ~240 rows).

Embeddings (recipe — match EXACTLY if you ever embed anything new)#

The service does NOT embed at runtime (it reads precomputed vectors). This recipe is here for a future embedding catch-up job; the working impl is reference/src/embed_readmes.mjs.

  • Model gemini-embedding-001 via Gemini API (generativelanguage.googleapis.com), header x-goog-api-key = GEMINI_API_KEY. outputDimensionality = 1536.
  • taskType = RETRIEVAL_QUERY for query text, RETRIEVAL_DOCUMENT for stored docs.
  • L2-normalize every vector (sub-3072 MRL dims aren't auto-unit; the cosine index needs unit vectors).
  • Vectors are passed to SQL as %s::vector text literals ([v1,v2,...]) and read back via embedding::text — exactly like the reference scripts. No pgvector-python adapter needed.

4. Algorithm (in app/recommend.py)#

  1. Seeds = the user's owned (repo_uri like 'at://<did>/%') ∪ collaborated repos that have an embedded README (db.load_seeds).
  2. Per-seed kNN over README embeddings, excluding the user's own/collab repo_dids (db.knn_repos, ORDER BY embedding <=> seed::vector).
  3. Merge by candidate repo_did, keeping best (min) distance + the list of seeds that surfaced it = consensus (app/merge.py).
  4. Dedup forks by md5 of content[:500] (app/dedup.py); apply a distance floor.
  5. Rerank (app/rank.py): DefaultScorer = similarity + consensus + recency (+ popularity stub), behind a swappable Scorer Protocol; plus a round-robin-across- seeds guard so one busy interest can't bury a lone one.
  6. Issues: same flow over tangled_open_issues, also excluding issues the user authored and issues in the user's own repos.
  7. Shape to the contract (app/links.py, app/profile.py): interest chips from seed record_raw.topics; @handle owners; absolute repo URLs; RFC-3339 timestamps.

5. File map#

app/
  main.py        FastAPI app + routes (/recommendations, /health) + CORS + startup log
  config.py      Settings from env/.env (DB conn, web base, tunable knobs); get_settings()
  db.py          psycopg3 pool + ALL read-only SQL (load_seeds, knn_repos, knn_issues,
                 open_issue_counts, embedding_counts, ping)
  recommend.py   orchestration: recommend(did)
  merge.py       PURE: merge_hits -> consensus candidates
  dedup.py       PURE: content_hash, collapse_forks
  rank.py        PURE: Scorer protocol, DefaultScorer, apply_floor, rerank(diversify)
  profile.py     PURE: build_interests from topics
  links.py       PURE: slugify, at_owner, repo_url, issue_list_url, to_rfc3339
  schemas.py     pydantic response models (wire keys match schema.md EXACTLY)
  types.py       Candidate dataclass
tests/           pytest: unit (pure modules, no DB) + test_integration.py (env-gated)
eval/harness.py  offline held-out-seed retrieval: recall@k / nDCG
reference/src/   the validated Node .mjs oracle scripts (+ node_modules has `pg`)
API.md           human API docs;  README.md  run/deploy;  Dockerfile;  .env / .env.example

The pure modules (merge/dedup/rank/profile/links/types) have no DB or network and are fully unit-tested — keep them that way so logic changes are testable in isolation.

6. HTTP API (the contract)#

Authoritative shape: ../../schema.md (parent repo) and API.md here. Summary:

  • GET /recommendations?handle=<did>&gh=<user>{ profile, repos[], issues[] }. handle is the user's DID. gh is accepted but ignored (no GitHub data). No k param — return pre-ranked; the frontend paginates 15/row. Empty user → repos: [].
  • GET /health{ status, db }.

Issues are special: the engine canNOT supply a reliable sequential issue number, so it sends repoDid + rkey and the appview resolves the precise /issues/N URL from its own SQLite issues table (falling back to the repo's issue list). This is implemented in the parent repo: appview/state/discover_engine.go (engineIssue, resolveIssueLink), appview/state/discover.go (passes s.db), tested in discover_engine_test.go. If you change the issue wire shape here, update those three files + schema.md together.

7. Data realities / caveats (VERIFIED, not assumed)#

These drive what we can honestly return — re-check with node/SQL if data has grown:

  • READMEs: ~2,400 embedded (0 unembedded). Open issues: ~2,300 embedded. Grows daily — the service reads it live, so counts rise on their own.
  • Repos are the real deliverable. Owner handle resolves for ~96% via owner_handle → fallback repo_uri owner_did → tangled_identities; ~3.5% unresolvable are dropped. repo_name is never null.
  • stars = 0, comments = 0 — no source (tangled_backlinks is empty). Stubbed.
  • languages = [], repo language = "" — no language field in the shared DB.
  • lastActive uses record_raw.createdAt (creation, not true last-activity — best available). Recency ranking uses the same value.
  • Issues are emittable for ~32% of the corpus (repo identity resolves via repo_uri). Per user (filtered to their interests) that's a handful. The exact issue number (record_raw->>'issueId') exists for only ~4% in the shared DB → that's why the number is resolved appview-side, not here.
  • Seeds are dominated by owned repos; collaborations are rare.

8. Run / test / deploy#

# setup (uv is the toolchain here; python 3.12)
uv venv --python 3.12 .venv
uv pip install --python .venv -e ".[dev]"

# run
.venv/bin/python -m uvicorn app.main:app --reload --port 8000
curl 'localhost:8000/health'
curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca'   # 10-seed sample user
# docs: http://localhost:8000/docs

# test  (unit always; integration auto-runs when DB_CONNECTION_STRING is set)
.venv/bin/python -m pytest tests/ -q

# offline eval baseline (needs DB)
.venv/bin/python eval/harness.py

# deploy
docker build -t tangled-rec . && docker run -p 8000:8000 --env-file .env tangled-rec
# then point the appview:  TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations

Config knobs (env, all optional except the two secrets): see app/config.py / .env.exampleTANGLED_WEB_BASE, REC_PER_SEED_LIMIT, REC_DISTANCE_FLOOR, REC_ISSUE_DISTANCE_FLOOR, REC_MAX_REPOS, REC_MAX_ISSUES.

9. Status & current baseline#

  • M0–M4 complete and verified: 23 pytest tests pass (18 pure-unit + 5 live integration incl. atproto/nix search sanity + own-repo exclusion). Appview Go side compiles + go test ./appview/state/ passes.
  • Eval baseline (before any tuning): recall@10 ≈ 0.22, recall@20 ≈ 0.23, recall@50 ≈ 0.37, nDCG ≈ 0.24 over 60 users. Re-run eval/harness.py and compare BEFORE/AFTER any ranking change — no "feels better" merges.

10. Environment gotchas (this machine)#

  • No gcloud, no Go toolchain, no nix installed by default. To verify the Go appview change, a Go 1.25 tarball was fetched to /tmp/go (ephemeral). go.mod requires go 1.25.
  • The reference .mjs scripts need Node's pg — it lives in reference/.../node_modules / the folder's node_modules. Run them with DB_CONNECTION_STRING in env or .env.
  • The Bash tool's working directory can reset between calls — use absolute paths or cd inside the same command.
  • Secret: DB_CONNECTION_STRING lives in .env (gitignored) — the only var the service needs. (GEMINI_API_KEY is only for the Node reference embedding scripts, not the service.) Never commit secrets or paste them into docs/code.

11. Do NOT#

  • Write to any shared table except the tangled_readmes embedding columns (or a rec schema).
  • Re-add clustering, or emit pulls/reasons/themes/score/good-first in the API.
  • Hardcode https://tangled.org — use settings.web_base (TANGLED_WEB_BASE).
  • Change the issue wire shape without updating the appview Go files + schema.md together.
  • Fabricate stars/comments/language — they're honest stubs until a data source exists.