···11+# CLAUDE.md — Integration Repo
22+33+> Project memory for Claude Code working in this repo. Keep it short and high-signal.
44+> Conventional filename is `CLAUDE.md` (uppercase); rename if your tooling expects otherwise.
55+66+## What this repo is
77+88+The **integration** repo is the backend / data layer for a Tangled (AT Protocol) discovery
99+product. Its job, end to end:
1010+1111+1. **Ingest** repos and activity from the Tangled network (ATProto records).
1212+2. **Store** them as a synced mirror in Postgres.
1313+3. **Embed** repo/issue text and maintain a vector index.
1414+4. **Recommend** relevant repos and issues to a user based on their past activity, and
1515+ expose those recommendations (plus read APIs) over HTTP.
1616+1717+The **frontend is a separate repo**. This repo contains **no UI** — it exposes a JSON/HTTP
1818+API that the frontend consumes. Do not add view/template/component code here. If a task
1919+implies UI work, it belongs in the frontend repo, not this one.
2020+2121+## Repository layout
2222+2323+- `scraper/` — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots,
2424+ PDS/network backfill, repo metadata, READMEs, issues, embeddings. See `scraper/README.md`.
2525+- `daily_issue_scraper/` — Cloud Run container that re-runs the issue sync on a daily schedule.
2626+- `supabase/migrations/` — Postgres + pgvector schema: the `tangled_*` tables/views.
2727+- `recommendation/` — the **Discover** recommendation engine: a standalone **Python/FastAPI**
2828+ service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has
2929+ its own `CLAUDE.md` / `README.md` / `API.md`; intended to be lifted into its own repo later.
3030+- `recommendationold/` — the pre-port Node (`.mjs`) version of the rec scripts, superseded by
3131+ `recommendation/` (its `reference/src/` holds the same scripts as the porting oracle). Kept
3232+ for reference, not run.
3333+3434+## Tech stack
3535+3636+- Language/runtime: **Python** across the live services (ingestion + recommendation). The
3737+ earlier Next.js/FastAPI skeleton was cleared; current code is Python.
3838+- DB: **Postgres** with the **pgvector** extension (records + relationships + embeddings in one
3939+ DB), schema managed via **Supabase** migrations (`supabase/migrations/`).
4040+- ATProto: PDS `com.atproto.repo.listRecords` + knot XRPC (`sh.tangled.repo.tree` for READMEs);
4141+ identity via the PLC directory.
4242+- Embeddings: **Gemini `gemini-embedding-001`**, 1536-dim, L2-normalized, stored in pgvector
4343+ (cosine / HNSW). The recommendation service reads these; it does not embed at runtime.
4444+4545+The rec/ranking pipeline is the Python `recommendation/` service — keep a clear HTTP API
4646+boundary between it and the ingestion/embedding side.
4747+4848+## Domain model — read this before touching ingestion
4949+5050+Tangled is a git collaboration platform on the AT Protocol. The split that matters:
5151+5252+- **Knots** host the actual **git data** (code, refs). Self-hostable git servers.
5353+- **PDS** (Personal Data Service) holds the **collaboration metadata** as ATProto records:
5454+ issues, comments, pull requests, stars, collaborators, repo pointers.
5555+5656+We ingest **metadata from PDSes**. We do **not** need git code for recommendations — repo
5757+descriptions, READMEs, and issue/PR text are the signal. **READMEs are the primary text
5858+signal for repo recommendations** (see Embedding conventions) and are fetched live from the
5959+**knot** (not the PDS), since no README content is stored in Postgres.
6060+6161+- Fetch via the knot XRPC `sh.tangled.repo.tree` query:
6262+ `https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=`. With `ref`
6363+ omitted the knot uses the repo's default branch and returns a top-level `readme` object
6464+ whose `contents` holds the rendered README (it resolves any extension — `.md`, `.org`,
6565+ `.rst`, …). Address by the **knot-minted `repoDid`** (`record_raw->>'repoDid'`), not the
6666+ owner DID.
6767+- **Coverage (measured 2026-06-24):** ~79% of *reachable* repos have a README (758/959);
6868+ ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable
6969+ self-hosted knots, which are *unknown*, not README-less). ~30% of repos in the DB have no
7070+ knot-minted `repoDid` at all and can't be addressed on a knot — embed those from metadata only.
7171+7272+Every record is addressed by an AT-URI: `at://<did>/<collection>/<rkey>`.
7373+7474+### Collections (NSIDs) we care about
7575+7676+- `sh.tangled.repo` — repo record / pointer (owner, name, knot)
7777+- `sh.tangled.repo.issue`
7878+- `sh.tangled.repo.issue.comment`
7979+- `sh.tangled.repo.pull` — pull requests
8080+- `sh.tangled.repo.collaborator`
8181+- `sh.tangled.feed.star` — stars
8282+- `sh.tangled.git.refUpdate` — push / ref-update events
8383+8484+Treat this list as the source of truth for ingestion filters. Verify against the live
8585+lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now
8686+carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases).
8787+8888+## Ingestion design
8989+9090+Two complementary paths — keep both working:
9191+9292+- **Real-time: Jetstream.** Subscribe to a public Jetstream instance with `wantedCollections`
9393+ set to the `sh.tangled.*` NSIDs above. JSON in, no CBOR decoding. This is the primary feed.
9494+- **Backfill: `listRecords`.** For each known DID, call `com.atproto.repo.listRecords` against
9595+ its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream
9696+ over time and/or by enumerating the relay with `com.atproto.sync.listRepos`.
9797+9898+### Non-negotiable ingestion rules
9999+100100+- **Mirror semantics, not append-only.** Records get edited and deleted. Handle Jetstream
101101+ `create`/`update` as **upsert** and `delete` as **soft-delete / tombstone**. Never assume
102102+ a record seen once is permanent.
103103+- **Resolve identity.** Records reference DIDs. Resolve DID → PDS endpoint and DID → handle
104104+ via the PLC directory; cache it. Don't hardcode PDS hosts.
105105+- **Coverage caveat.** Self-hosted PDSes/knots only appear if the relay crawls them. Hosted
106106+ instances and Bluesky-network accounts are well covered; full-network coverage is not
107107+ guaranteed. Don't treat absence as deletion.
108108+- **Idempotency.** Ingestion must be safely replayable (reconnects, backfills overlapping the
109109+ live stream). Key on AT-URI.
110110+111111+## Recommendation design
112112+113113+**Two-stage: retrieve, then rank.** Do not ship a single averaged "user vector" + kNN as the
114114+whole system — it loses multi-interest structure and ignores quality/recency/social signal.
115115+116116+1. **Candidate generation** (high-recall, union the sources):
117117+ - **Embedding kNN** — query with the user's *recent* interactions individually, or cluster
118118+ their history into a few interest centroids and query each. Never collapse to one averaged vector.
119119+ - **Collaborative / co-occurrence** — "users who starred X also starred Y" from the star and
120120+ contribution matrices.
121121+ - **Social graph** (our edge on ATProto) — "repos starred by people you follow", "repos your
122122+ collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up.
123123+2. **Ranking** — start with a tunable weighted sum (embedding similarity + recency + popularity +
124124+ social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once
125125+ there's engagement data. Keep the scorer behind an interface so it's replaceable.
126126+3. **Rules** — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR);
127127+ favor freshness.
128128+129129+### Embedding conventions
130130+131131+- **Repo doc = the README** (fetched live from the knot — see Domain model), as the primary
132132+ text we embed. Prepend the repo `name` + `description` and append `topics` + primary
133133+ `language` as light context, but the README body is the core signal.
134134+- **Fallback when no README** (knot 404 / unreachable / repo has no `repoDid`): embed
135135+ `name + description + topics + primary language` only. ~57–79% of repos have a README;
136136+ the rest rely on this fallback, so it must produce a usable vector on its own.
137137+- Issue doc = title + body + labels + parent-repo context.
138138+- Store vectors in pgvector alongside the record. Re-embed on meaningful record updates
139139+ (incl. when a previously-missing README becomes available).
140140+141141+### Required, don't skip
142142+143143+- **Cold start** — users with no history fall back to trending / follows-based / onboarding interests.
144144+- **Eval harness** — hold out each user's most recent interactions; measure recall@k / nDCG offline
145145+ before shipping any ranking change. Track star-through-rate online. No "it feels better" merges.
146146+147147+## Data layout
148148+149149+The live schema lives in `supabase/migrations/`; the `tangled_*` tables are the source of
150150+truth (not the generic names below). Key ones the rec engine reads (see
151151+`recommendation/CLAUDE.md` for full columns): `tangled_readmes` (repo signal + `embedding`),
152152+`tangled_open_issues` (view), `tangled_repos`, `tangled_identities` (did→handle),
153153+`tangled_user_collaborations` (view). Embeddings are stored inline on the record rows
154154+(`embedding vector(1536)` + `embedding_model`), not in a separate table.
155155+156156+## Commands
157157+158158+Each service has its own setup; see the per-folder docs. DB connection comes from
159159+`DB_CONNECTION_STRING` (`.env`).
160160+- Scraper (ingest / backfill / embed): see `scraper/README.md` — `python scraper/scrape.py <stage>`.
161161+- Recommendation API: from `recommendation/`, `uvicorn app.main:app --reload --port 8000`
162162+ (setup + deploy in `recommendation/README.md`).
163163+- Rec tests: from `recommendation/`, `.venv/bin/python -m pytest tests/`.
164164+165165+## Conventions
166166+167167+- Keep ingestion, embedding, recommendation, and API as separable modules/services.
168168+- All external IDs are DIDs internally; resolve to handles only at the API edge for display.
169169+- Don't put secrets, PDS credentials, or model API keys in code or commits.
170170+171171+## Out of scope (do not do here)
172172+173173+- Frontend / UI work → separate repo.
174174+- Hosting git content or running a knot → not this service's job; we read metadata and fetch
175175+ READMEs on demand.
···11+"""Tangled issue investigation agent."""
22+33+__all__ = [
44+ "AgentState",
55+ "AnthropicCacheSettings",
66+ "IssueSessionContext",
77+ "build_agent_graph",
88+ "build_issue_agent_graph",
99+ "create_anthropic_model",
1010+ "load_issue_context",
1111+ "run_agent",
1212+ "run_issue_agent",
1313+]
1414+1515+1616+def __getattr__(name: str):
1717+ if name in {
1818+ "AgentState",
1919+ "AnthropicCacheSettings",
2020+ "build_agent_graph",
2121+ "build_issue_agent_graph",
2222+ "create_anthropic_model",
2323+ "run_agent",
2424+ "run_issue_agent",
2525+ }:
2626+ from agent import agent as _agent
2727+2828+ return getattr(_agent, name)
2929+ if name == "IssueSessionContext":
3030+ from agent.context import IssueSessionContext
3131+3232+ return IssueSessionContext
3333+ if name == "load_issue_context":
3434+ from agent.load_issue import load_issue_context
3535+3636+ return load_issue_context
3737+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+688
agent/agent.py
···11+#!/usr/bin/env python3
22+"""LangGraph agent loop on Anthropic with prompt caching.
33+44+Core pieces:
55+ - ``AgentState`` — message list reducer state
66+ - ``create_anthropic_model`` — ChatAnthropic factory
77+ - ``cached_system_message`` / ``AnthropicCacheSettings`` — explicit + automatic caching
88+ - ``build_agent_graph`` — agent → (tools) → agent loop
99+ - ``run_agent`` — single-turn or threaded invoke helper
1010+"""
1111+1212+from __future__ import annotations
1313+1414+import json
1515+import os
1616+import sys
1717+from dataclasses import dataclass
1818+from pathlib import Path
1919+from typing import Annotated, Any, Literal, Sequence
2020+2121+from dotenv import load_dotenv
2222+from langchain_anthropic import ChatAnthropic
2323+from langchain_core.messages import (
2424+ AIMessage,
2525+ BaseMessage,
2626+ HumanMessage,
2727+ SystemMessage,
2828+ ToolMessage,
2929+)
3030+from langchain_core.tools import BaseTool
3131+from langgraph.checkpoint.base import BaseCheckpointSaver
3232+from langgraph.graph import END, START, StateGraph
3333+from langgraph.graph.message import add_messages
3434+from langgraph.prebuilt import ToolNode
3535+from typing_extensions import TypedDict
3636+3737+from agent.context import IssueSessionContext, build_issue_system_prompt
3838+from agent.load_issue import load_issue_context
3939+from agent.questionnaire_store import parse_questionnaire_json, save_questionnaire
4040+from agent.questionnaire_repo_store import publish_to_repo, publishing_enabled
4141+from agent.questionnaire_prompt import build_questionnaire_system_prompt
4242+from agent.tools import make_file_tools
4343+4444+REPO_ROOT = Path(__file__).resolve().parent.parent
4545+4646+DEFAULT_SYSTEM_PROMPT = """\
4747+You are a helpful assistant for the Sunstead / Tangled hackathon stack.
4848+4949+You can reason about:
5050+- Tangled repos, issues, and README embeddings in Postgres
5151+- The recommendation API (DID → ranked repos/issues)
5252+- The daily scraper that ingests Tangled network data
5353+5454+Be concise and actionable. Use tools when they help answer factual questions.
5555+"""
5656+5757+CacheTTL = Literal["5m", "1h"]
5858+5959+6060+@dataclass(frozen=True)
6161+class AnthropicCacheSettings:
6262+ """Anthropic prompt cache configuration.
6363+6464+ We use two layers (both are valid together):
6565+ 1. Explicit ``cache_control`` on the static system block (always cached).
6666+ 2. Automatic ``cache_control`` on each ``model.invoke`` call so tools +
6767+ conversation prefix are cached on Anthropic's side (breakpoint moves
6868+ forward as the thread grows).
6969+ """
7070+7171+ type: Literal["ephemeral"] = "ephemeral"
7272+ ttl: CacheTTL = "5m"
7373+7474+ def as_api_dict(self) -> dict[str, str]:
7575+ return {"type": self.type, "ttl": self.ttl}
7676+7777+7878+class AgentState(TypedDict):
7979+ """Graph state: append-only message history."""
8080+8181+ messages: Annotated[list[BaseMessage], add_messages]
8282+8383+8484+def load_env() -> None:
8585+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
8686+ if candidate.exists():
8787+ load_dotenv(candidate)
8888+ return
8989+ load_dotenv()
9090+9191+9292+def require_anthropic_api_key() -> str:
9393+ key = os.getenv("ANTHROPIC_API_KEY", "").strip()
9494+ if not key:
9595+ print("ERROR: ANTHROPIC_API_KEY is not set", file=sys.stderr)
9696+ raise SystemExit(1)
9797+ return key
9898+9999+100100+def cached_system_message(
101101+ text: str,
102102+ *,
103103+ cache: AnthropicCacheSettings | None = None,
104104+) -> SystemMessage:
105105+ """System prompt block with explicit Anthropic ``cache_control``."""
106106+ cache = cache or AnthropicCacheSettings()
107107+ return SystemMessage(
108108+ content=[
109109+ {
110110+ "type": "text",
111111+ "text": text,
112112+ "cache_control": cache.as_api_dict(),
113113+ }
114114+ ]
115115+ )
116116+117117+118118+def _tag_last_content_block(
119119+ message: BaseMessage,
120120+ cache: AnthropicCacheSettings,
121121+) -> BaseMessage:
122122+ """Add ``cache_control`` to the last text block of a message (explicit breakpoint)."""
123123+ content = message.content
124124+ if isinstance(content, str):
125125+ return message.model_copy(
126126+ update={
127127+ "content": [
128128+ {
129129+ "type": "text",
130130+ "text": content,
131131+ "cache_control": cache.as_api_dict(),
132132+ }
133133+ ]
134134+ }
135135+ )
136136+ if not isinstance(content, list) or not content:
137137+ return message
138138+ blocks = [dict(block) if isinstance(block, dict) else block for block in content]
139139+ last = blocks[-1]
140140+ if isinstance(last, dict) and last.get("type") == "text":
141141+ blocks[-1] = {**last, "cache_control": cache.as_api_dict()}
142142+ return message.model_copy(update={"content": blocks})
143143+ return message
144144+145145+146146+def _can_tag_message_for_cache(message: BaseMessage) -> bool:
147147+ """Anthropic forbids cache_control on tool_result blocks."""
148148+ if isinstance(message, ToolMessage):
149149+ return False
150150+ return isinstance(message, (HumanMessage, AIMessage))
151151+152152+153153+def prepare_messages_for_anthropic(
154154+ messages: Sequence[BaseMessage],
155155+ *,
156156+ system_message: SystemMessage,
157157+ cache: AnthropicCacheSettings | None = None,
158158+ cache_conversation_tail: bool = True,
159159+) -> list[BaseMessage]:
160160+ """Build the message list sent to Claude.
161161+162162+ - Prepends the cached system message.
163163+ - Optionally marks the last non-tool message for explicit prefix caching.
164164+ Invoke-level ``cache_control`` still applies to the full request.
165165+ """
166166+ cache = cache or AnthropicCacheSettings()
167167+ history = list(messages)
168168+ # After tool rounds, only invoke-level cache_control is safe — Anthropic
169169+ # rejects cache_control nested inside tool_result content blocks.
170170+ if (
171171+ cache_conversation_tail
172172+ and history
173173+ and not any(isinstance(m, ToolMessage) for m in history)
174174+ ):
175175+ idx = None
176176+ for i in range(len(history) - 1, -1, -1):
177177+ if _can_tag_message_for_cache(history[i]):
178178+ idx = i
179179+ break
180180+ if idx is not None:
181181+ history[idx] = _tag_last_content_block(history[idx], cache)
182182+ return [system_message, *history]
183183+184184+185185+def extract_cache_usage(message: AIMessage) -> dict[str, int]:
186186+ """Pull Anthropic cache token stats from a model response, if present."""
187187+ usage = (message.response_metadata or {}).get("usage") or {}
188188+ return {
189189+ "input_tokens": int(usage.get("input_tokens") or 0),
190190+ "output_tokens": int(usage.get("output_tokens") or 0),
191191+ "cache_creation_input_tokens": int(
192192+ usage.get("cache_creation_input_tokens") or 0
193193+ ),
194194+ "cache_read_input_tokens": int(usage.get("cache_read_input_tokens") or 0),
195195+ }
196196+197197+198198+def create_anthropic_model(
199199+ *,
200200+ model: str | None = None,
201201+ temperature: float | None = None,
202202+ max_tokens: int | None = None,
203203+ api_key: str | None = None,
204204+) -> ChatAnthropic:
205205+ """Construct ``ChatAnthropic`` from env defaults."""
206206+ load_env()
207207+ return ChatAnthropic(
208208+ model=model or os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-6"),
209209+ api_key=api_key or require_anthropic_api_key(),
210210+ temperature=float(
211211+ temperature if temperature is not None else os.getenv("ANTHROPIC_TEMPERATURE", "0")
212212+ ),
213213+ max_tokens=int(
214214+ max_tokens if max_tokens is not None else os.getenv("ANTHROPIC_MAX_TOKENS", "4096")
215215+ ),
216216+ )
217217+218218+219219+def create_questionnaire_model(
220220+ *,
221221+ api_key: str | None = None,
222222+) -> ChatAnthropic:
223223+ """Opus by default — questionnaire generation needs deep repo reasoning."""
224224+ load_env()
225225+ return create_anthropic_model(
226226+ model=os.getenv("ANTHROPIC_QUESTIONNAIRE_MODEL", "claude-opus-4-6"),
227227+ max_tokens=int(os.getenv("ANTHROPIC_QUESTIONNAIRE_MAX_TOKENS", "16384")),
228228+ temperature=float(os.getenv("ANTHROPIC_QUESTIONNAIRE_TEMPERATURE", "0")),
229229+ api_key=api_key,
230230+ )
231231+232232+233233+def _cache_settings_from_env() -> AnthropicCacheSettings:
234234+ ttl = os.getenv("ANTHROPIC_CACHE_TTL", "5m").strip()
235235+ if ttl not in ("5m", "1h"):
236236+ ttl = "5m"
237237+ return AnthropicCacheSettings(ttl=ttl) # type: ignore[arg-type]
238238+239239+240240+def should_continue(state: AgentState) -> Literal["tools", "__end__"]:
241241+ """Route to tools when the model emitted tool calls."""
242242+ last = state["messages"][-1]
243243+ if isinstance(last, AIMessage) and last.tool_calls:
244244+ return "tools"
245245+ return END
246246+247247+248248+def extract_ai_text(message: BaseMessage) -> str:
249249+ """Pull plain text from an AIMessage (string or block content)."""
250250+ if not isinstance(message, AIMessage):
251251+ return ""
252252+ content = message.content
253253+ if isinstance(content, str):
254254+ return content.strip()
255255+ if isinstance(content, list):
256256+ parts: list[str] = []
257257+ for block in content:
258258+ if isinstance(block, str):
259259+ parts.append(block)
260260+ elif isinstance(block, dict) and block.get("type") == "text":
261261+ text = block.get("text")
262262+ if isinstance(text, str) and text.strip():
263263+ parts.append(text)
264264+ return "\n".join(parts).strip()
265265+ return str(content).strip()
266266+267267+268268+def find_questionnaire_json_text(messages: Sequence[BaseMessage]) -> str:
269269+ """Last non-tool AIMessage whose text is valid questionnaire JSON."""
270270+ for message in reversed(messages):
271271+ if isinstance(message, AIMessage) and not message.tool_calls:
272272+ text = extract_ai_text(message)
273273+ if not text:
274274+ continue
275275+ try:
276276+ parse_questionnaire_json(text)
277277+ except (ValueError, json.JSONDecodeError):
278278+ continue
279279+ return text
280280+ last = messages[-1] if messages else None
281281+ raise ValueError(
282282+ "Agent did not produce questionnaire JSON "
283283+ f"(messages={len(messages)}, last={type(last).__name__ if last else 'none'})"
284284+ )
285285+286286+287287+def build_agent_graph(
288288+ *,
289289+ tools: Sequence[BaseTool] | None = None,
290290+ system_prompt: str = DEFAULT_SYSTEM_PROMPT,
291291+ model: ChatAnthropic | None = None,
292292+ cache: AnthropicCacheSettings | None = None,
293293+ checkpointer: BaseCheckpointSaver | None = None,
294294+ min_tool_reads: int = 0,
295295+ verbose_tools: bool = False,
296296+):
297297+ """Compile the LangGraph agent loop: agent ⟷ tools (optional)."""
298298+ load_env()
299299+ tools = list(tools or [])
300300+ model = model or create_anthropic_model()
301301+ cache = cache or _cache_settings_from_env()
302302+ system_message = cached_system_message(system_prompt, cache=cache)
303303+ bound_model = model.bind_tools(tools) if tools else model
304304+ tool_node = ToolNode(tools) if tools else None
305305+306306+ def call_model(state: AgentState) -> dict[str, list[BaseMessage]]:
307307+ history = state["messages"]
308308+ tool_read_count = sum(1 for m in history if isinstance(m, ToolMessage))
309309+ has_tool_results = tool_read_count > 0
310310+ payload = prepare_messages_for_anthropic(
311311+ history,
312312+ system_message=system_message,
313313+ cache=cache,
314314+ )
315315+ invoke_kwargs: dict[str, Any] = {}
316316+ if not has_tool_results:
317317+ invoke_kwargs["cache_control"] = cache.as_api_dict()
318318+319319+ model_to_invoke = bound_model
320320+ if tools and min_tool_reads and tool_read_count < min_tool_reads:
321321+ model_to_invoke = bound_model.bind(tool_choice={"type": "any"})
322322+323323+ response = model_to_invoke.invoke(payload, **invoke_kwargs)
324324+ if isinstance(response, AIMessage):
325325+ stats = extract_cache_usage(response)
326326+ if any(stats[k] for k in ("cache_creation_input_tokens", "cache_read_input_tokens")):
327327+ print(
328328+ "[anthropic-cache]",
329329+ f"read={stats['cache_read_input_tokens']}",
330330+ f"write={stats['cache_creation_input_tokens']}",
331331+ f"in={stats['input_tokens']}",
332332+ f"out={stats['output_tokens']}",
333333+ )
334334+ if verbose_tools and response.tool_calls:
335335+ for tc in response.tool_calls:
336336+ name = tc.get("name", "?") if isinstance(tc, dict) else getattr(tc, "name", "?")
337337+ args = tc.get("args", {}) if isinstance(tc, dict) else getattr(tc, "args", {})
338338+ print(f"[agent] calling {name}({args})", file=sys.stderr)
339339+ return {"messages": [response]}
340340+341341+ def run_tools(state: AgentState) -> dict[str, list[BaseMessage]]:
342342+ assert tool_node is not None
343343+ result = tool_node.invoke(state)
344344+ if verbose_tools:
345345+ for msg in result.get("messages", []):
346346+ if isinstance(msg, ToolMessage):
347347+ preview = msg.content
348348+ if isinstance(preview, str) and len(preview) > 120:
349349+ preview = preview[:120] + "…"
350350+ print(f"[agent] tool result ({len(str(msg.content))} chars)", file=sys.stderr)
351351+ return result
352352+353353+ graph = StateGraph(AgentState)
354354+ graph.add_node("agent", call_model)
355355+356356+ if tools:
357357+ graph.add_node("tools", run_tools)
358358+ graph.add_edge(START, "agent")
359359+ graph.add_conditional_edges(
360360+ "agent",
361361+ should_continue,
362362+ {"tools": "tools", END: END},
363363+ )
364364+ graph.add_edge("tools", "agent")
365365+ else:
366366+ graph.add_edge(START, "agent")
367367+ graph.add_edge("agent", END)
368368+369369+ return graph.compile(checkpointer=checkpointer)
370370+371371+372372+def build_issue_agent_graph(
373373+ ctx: IssueSessionContext,
374374+ *,
375375+ model: ChatAnthropic | None = None,
376376+ cache: AnthropicCacheSettings | None = None,
377377+ checkpointer: BaseCheckpointSaver | None = None,
378378+ include_list_tool: bool = False,
379379+):
380380+ """Issue investigator: context upfront, file tools only."""
381381+ tools = make_file_tools(ctx)
382382+ if not include_list_tool:
383383+ tools = [t for t in tools if t.name == "read_repo_file"]
384384+ return build_agent_graph(
385385+ tools=tools,
386386+ system_prompt=build_issue_system_prompt(ctx),
387387+ model=model,
388388+ cache=cache,
389389+ checkpointer=checkpointer,
390390+ )
391391+392392+393393+def build_questionnaire_agent_graph(
394394+ ctx: IssueSessionContext,
395395+ *,
396396+ model: ChatAnthropic | None = None,
397397+ cache: AnthropicCacheSettings | None = None,
398398+ checkpointer: BaseCheckpointSaver | None = None,
399399+ include_list_tool: bool = False,
400400+):
401401+ """Generate AI-solve questionnaire JSON: Opus + file tools + contract prompt."""
402402+ tools = make_file_tools(ctx)
403403+ if not include_list_tool:
404404+ tools = [t for t in tools if t.name == "read_repo_file"]
405405+ return build_agent_graph(
406406+ tools=tools,
407407+ system_prompt=build_questionnaire_system_prompt(ctx),
408408+ model=model or create_questionnaire_model(),
409409+ cache=cache,
410410+ checkpointer=checkpointer,
411411+ min_tool_reads=int(os.getenv("QUESTIONNAIRE_MIN_TOOL_READS", "0")),
412412+ verbose_tools=os.getenv("AGENT_VERBOSE_TOOLS", "1").strip().lower()
413413+ not in ("0", "false", "no"),
414414+ )
415415+416416+417417+def _finalize_questionnaire_json(
418418+ ctx: IssueSessionContext,
419419+ messages: Sequence[BaseMessage],
420420+ *,
421421+ model: ChatAnthropic | None = None,
422422+ cache: AnthropicCacheSettings | None = None,
423423+) -> str:
424424+ """Dedicated JSON-only model call after research (no tools)."""
425425+ model = model or create_questionnaire_model()
426426+ cache = cache or _cache_settings_from_env()
427427+ system_message = cached_system_message(
428428+ build_questionnaire_system_prompt(ctx), cache=cache
429429+ )
430430+ history = [
431431+ *messages,
432432+ HumanMessage(
433433+ content=(
434434+ "You have finished reading the repository. Output the complete questionnaire "
435435+ "JSON object now (schema version 2). Single JSON object only — no markdown "
436436+ "fences, no commentary, no tool calls."
437437+ )
438438+ ),
439439+ ]
440440+ payload = prepare_messages_for_anthropic(
441441+ history, system_message=system_message, cache=cache
442442+ )
443443+ response = model.invoke(payload)
444444+ text = extract_ai_text(response)
445445+ if text:
446446+ return text
447447+ block_types: list[str] = []
448448+ if isinstance(response, AIMessage) and isinstance(response.content, list):
449449+ for block in response.content:
450450+ if isinstance(block, dict):
451451+ block_types.append(str(block.get("type", "?")))
452452+ raise ValueError(
453453+ "Finalize turn returned empty text "
454454+ f"(blocks={block_types or 'none'})"
455455+ )
456456+457457+458458+def run_questionnaire_agent(
459459+ ctx: IssueSessionContext,
460460+ *,
461461+ graph=None,
462462+ thread_id: str = "default",
463463+ include_list_tool: bool = False,
464464+) -> BaseMessage:
465465+ """Run questionnaire generation (no user prompt — instructions are in the system prompt)."""
466466+ app = graph or build_questionnaire_agent_graph(
467467+ ctx, include_list_tool=include_list_tool
468468+ )
469469+ config = {
470470+ "configurable": {"thread_id": thread_id},
471471+ "recursion_limit": int(os.getenv("QUESTIONNAIRE_RECURSION_LIMIT", "50")),
472472+ }
473473+ result = app.invoke(
474474+ {
475475+ "messages": [
476476+ HumanMessage(
477477+ content=(
478478+ "Phase 1 — research: use read_repo_file to explore the repository "
479479+ "for as long as you need (README, issue-related source, tests, "
480480+ "patterns). Stop calling tools when you have enough context. "
481481+ "Do not output questionnaire JSON yet."
482482+ )
483483+ )
484484+ ]
485485+ },
486486+ config=config,
487487+ )
488488+ messages = result["messages"]
489489+ tool_reads = sum(1 for m in messages if isinstance(m, ToolMessage))
490490+ print(f"[agent] research done ({tool_reads} file reads)", file=sys.stderr)
491491+492492+ try:
493493+ text = find_questionnaire_json_text(messages)
494494+ except ValueError:
495495+ print("[agent] running JSON finalize turn", file=sys.stderr)
496496+ text = _finalize_questionnaire_json(ctx, messages)
497497+ return AIMessage(content=text)
498498+499499+500500+def generate_and_save_questionnaire(
501501+ issue_uri: str,
502502+ *,
503503+ fetch_file_tree: bool = True,
504504+ include_list_tool: bool = False,
505505+ thread_id: str = "job",
506506+ save: bool = True,
507507+) -> dict[str, Any]:
508508+ """Load issue, run questionnaire agent, parse JSON, optionally upsert to Postgres."""
509509+ load_env()
510510+ ctx = load_issue_context(issue_uri, fetch_file_tree=fetch_file_tree)
511511+ reply = run_questionnaire_agent(
512512+ ctx,
513513+ thread_id=thread_id,
514514+ include_list_tool=include_list_tool,
515515+ )
516516+ text = extract_ai_text(reply) if isinstance(reply, AIMessage) else str(reply.content)
517517+ try:
518518+ payload = parse_questionnaire_json(text)
519519+ except (json.JSONDecodeError, ValueError) as exc:
520520+ preview = (text or "")[:400].replace("\n", " ")
521521+ raise ValueError(f"{exc} — response preview: {preview!r}") from exc
522522+523523+ result: dict[str, Any] = {
524524+ "issue_uri": ctx.issue_uri,
525525+ "version": payload.get("version"),
526526+ "question_count": _count_questions(payload.get("items") or []),
527527+ }
528528+ if not save:
529529+ result["payload"] = payload
530530+ print("[agent] --no-save: skipped DB write", file=sys.stderr)
531531+ return result
532532+533533+ row = save_questionnaire(ctx.issue_uri, payload)
534534+ _maybe_publish_to_repo(ctx.issue_uri, payload, row)
535535+ result.update(
536536+ {
537537+ "issue_uri": row["issue_uri"],
538538+ "created_at": row["created_at"].isoformat() if row.get("created_at") else None,
539539+ "updated_at": row["updated_at"].isoformat() if row.get("updated_at") else None,
540540+ }
541541+ )
542542+ return result
543543+544544+545545+def _maybe_publish_to_repo(issue_uri: str, payload: dict, row: dict | None = None) -> None:
546546+ """Best-effort dual-write: also publish the questionnaire to the knot repo when
547547+ QUESTIONNAIRE_PUBLISH_REPO is set. A failure here never fails the DB write."""
548548+ if not publishing_enabled():
549549+ return
550550+ try:
551551+ rel = publish_to_repo(
552552+ issue_uri,
553553+ payload,
554554+ (row or {}).get("created_at"),
555555+ (row or {}).get("updated_at"),
556556+ )
557557+ print(f"[agent] published questionnaire to repo: {rel}", file=sys.stderr)
558558+ except Exception as exc: # noqa: BLE001 - publishing is best-effort
559559+ print(f"[agent] warning: repo publish failed (DB write still ok): {exc}", file=sys.stderr)
560560+561561+562562+def _count_questions(items: list) -> int:
563563+ n = 0
564564+ for q in items:
565565+ n += 1
566566+ for opt in q.get("options") or []:
567567+ n += _count_questions(opt.get("followups") or [])
568568+ return n
569569+570570+571571+def run_issue_agent(
572572+ ctx: IssueSessionContext,
573573+ user_input: str,
574574+ *,
575575+ graph=None,
576576+ thread_id: str = "default",
577577+ include_list_tool: bool = False,
578578+) -> BaseMessage:
579579+ """Run the issue agent with metadata + file tree already in the system prompt."""
580580+ app = graph or build_issue_agent_graph(ctx, include_list_tool=include_list_tool)
581581+ config = {"configurable": {"thread_id": thread_id}}
582582+ result = app.invoke(
583583+ {"messages": [HumanMessage(content=user_input)]},
584584+ config=config,
585585+ )
586586+ return result["messages"][-1]
587587+588588+589589+def run_agent(
590590+ user_input: str,
591591+ *,
592592+ graph=None,
593593+ thread_id: str = "default",
594594+) -> BaseMessage:
595595+ """Invoke the compiled graph with a single user turn."""
596596+ app = graph or build_agent_graph()
597597+ config = {"configurable": {"thread_id": thread_id}}
598598+ result = app.invoke(
599599+ {"messages": [HumanMessage(content=user_input)]},
600600+ config=config,
601601+ )
602602+ return result["messages"][-1]
603603+604604+605605+def main(argv: list[str] | None = None) -> None:
606606+ import argparse
607607+608608+ parser = argparse.ArgumentParser(description="Run the Tangled issue investigation agent.")
609609+ parser.add_argument("prompt", nargs="?", help="User message (omit for stdin)")
610610+ parser.add_argument(
611611+ "--issue-uri",
612612+ metavar="URI",
613613+ help="Issue at:// URI — loads meta + file tree from DB/knot",
614614+ )
615615+ parser.add_argument(
616616+ "--no-file-tree",
617617+ action="store_true",
618618+ help="Skip live knot tree walk (use with --list-tool)",
619619+ )
620620+ parser.add_argument(
621621+ "--list-tool",
622622+ action="store_true",
623623+ help="Also expose list_repo_files (default: read_repo_file only)",
624624+ )
625625+ parser.add_argument(
626626+ "--questionnaire",
627627+ "--questionaire",
628628+ action="store_true",
629629+ help="Generate AI-solve questionnaire JSON (uses Opus)",
630630+ )
631631+ parser.add_argument(
632632+ "--no-save",
633633+ action="store_true",
634634+ help="Do not write questionnaire JSON to Postgres (questionnaire mode)",
635635+ )
636636+ parser.add_argument("--thread-id", default="cli", help="Checkpoint thread id")
637637+ args = parser.parse_args(argv)
638638+639639+ text = args.prompt
640640+ if not text and not sys.stdin.isatty():
641641+ text = sys.stdin.read().strip()
642642+ if not text and not args.questionnaire:
643643+ print("ERROR: provide a prompt argument, stdin, or --questionnaire", file=sys.stderr)
644644+ raise SystemExit(1)
645645+646646+ if not args.issue_uri:
647647+ print("ERROR: --issue-uri is required", file=sys.stderr)
648648+ raise SystemExit(1)
649649+650650+ load_env()
651651+ ctx = load_issue_context(
652652+ args.issue_uri.strip(),
653653+ fetch_file_tree=not args.no_file_tree,
654654+ )
655655+ if args.questionnaire:
656656+ print("[agent] questionnaire mode — will read repo via tools first", file=sys.stderr)
657657+ reply = run_questionnaire_agent(
658658+ ctx,
659659+ thread_id=args.thread_id,
660660+ include_list_tool=args.list_tool,
661661+ )
662662+ text_out = reply.content if isinstance(reply.content, str) else str(reply.content)
663663+ if not args.no_save:
664664+ try:
665665+ payload = parse_questionnaire_json(text_out)
666666+ row = save_questionnaire(ctx.issue_uri, payload)
667667+ _maybe_publish_to_repo(ctx.issue_uri, payload, row)
668668+ print(
669669+ f"[agent] saved questionnaire for {row['issue_uri']}",
670670+ file=sys.stderr,
671671+ )
672672+ except Exception as exc: # noqa: BLE001
673673+ print(f"[agent] warning: could not save to DB: {exc}", file=sys.stderr)
674674+ else:
675675+ reply = run_issue_agent(
676676+ ctx,
677677+ text,
678678+ thread_id=args.thread_id,
679679+ include_list_tool=args.list_tool,
680680+ )
681681+ if isinstance(reply.content, str):
682682+ print(reply.content)
683683+ else:
684684+ print(reply.content)
685685+686686+687687+if __name__ == "__main__":
688688+ main()
+294
agent/ai-solve-questionnaire-contract.md
···11+# AI-Solve Questionnaire — Engine Contract
22+33+**Status:** Draft contract for the AI engine (backend) developer
44+**Date:** 2026-06-25
55+**Owner (frontend/appview):** Miko
66+**Audience:** Engine service developer
77+88+---
99+1010+## 1. Context
1111+1212+Feature: on an issue page, a logged-in user can start an **AI-solve** workflow. The AI
1313+engine generates a **branching questionnaire** about what kind of solution should be
1414+implemented. Many users answer it; when the engine detects consensus, it generates the
1515+solution and opens a pull request **authored as its own AT-Protocol / Tangled user**.
1616+1717+The **appview (frontend) is a thin UI**. The **engine owns all logic**: questionnaire
1818+generation, answer aggregation, consensus detection, code generation, and PR authoring.
1919+2020+The appview talks to the engine with **exactly two HTTP calls**, both made server-side
2121+from the appview (the engine base URL is never exposed to the browser):
2222+2323+1. `GET` the questionnaire for an issue.
2424+2. `POST` a user's completed answer-set.
2525+2626+The branching is **walked client-side** in the appview after the single `GET` — the
2727+engine does **not** serve one-question-at-a-time. It returns the whole questionnaire once.
2828+2929+Everything after the `POST` (consensus, codegen, PR) is internal to the engine. The PR
3030+appears on the issue page because the engine authors a normal `sh.tangled.repo.pull`
3131+record that references the issue — it rides the existing PR/reference ingestion. No third
3232+call is required for that.
3333+3434+---
3535+3636+## 2. The questionnaire structure
3737+3838+The questionnaire is a **tree of nested sequences**, not a `next`-pointer graph. This is
3939+the core of the contract — please implement to this shape.
4040+4141+### Design rationale
4242+4343+The questionnaire must support, modularly:
4444+4545+- **Sub-questions** that are only asked when a particular option is chosen.
4646+- **Regular questions** that are always asked, in order — including *after* a branching
4747+ question's sub-questions have been answered.
4848+4949+A flat `next`-pointer graph models this poorly: every branch leaf must manually point back
5050+to the shared follow-up question to re-converge, so adding one always-asked question means
5151+editing every leaf. Instead we use recursion:
5252+5353+> An option may carry its own ordered list of follow-up questions (`followups`). That list
5454+> has the **same shape** as the top-level list. When the user finishes a `followups` list,
5555+> traversal automatically returns ("pops") to the parent sequence and continues with the
5656+> next item. Re-convergence is free; no manual wiring.
5757+5858+One node type, recursive at any depth, one renderer/walker.
5959+6060+### Schema
6161+6262+```jsonc
6363+// Questionnaire (root)
6464+{
6565+ "issue": "at://…/sh.tangled.repo.issue/…",
6666+ "version": 2,
6767+ "introduction": {
6868+ "project": "What the repo is…",
6969+ "issue": "What this issue asks…",
7070+ "approach": "How the questionnaire guides toward a PR…"
7171+ },
7272+ "items": [ /* ordered array of Question */ ]
7373+}
7474+7575+// Question
7676+{
7777+ "id": "scope",
7878+ "prompt": "Short headline question",
7979+ "context": "Why we ask this now — bridges from intro or parent branch",
8080+ "explanation": "Extended tradeoffs and repo-specific detail",
8181+ "options": [ /* array of Option, >= 2 */ ]
8282+}
8383+8484+// Option — label only (no separate value field)
8585+{
8686+ "label": "Full detailed description of this choice",
8787+ "followups": [ /* optional: array of Question, same shape as items */ ]
8888+}
8989+```
9090+9191+**Field rules**
9292+9393+| Field | Type | Required | Notes |
9494+|---|---|---|---|
9595+| `issue` | string (AT-URI) | yes | Echoes the issue this questionnaire is for. |
9696+| `version` | integer | yes | Schema version; `2` for now. |
9797+| `introduction` | object | yes | `project`, `issue`, `approach` — narrative setup shown before questions. |
9898+| `items` | Question[] | yes | Top-level ordered sequence. Non-empty. |
9999+| `Question.id` | string | yes | **Globally unique** across the whole tree. Stable across re-fetches. |
100100+| `Question.prompt` | string | yes | Short headline (plain text). |
101101+| `Question.context` | string | yes | Bridges logically from intro/parent; must chain narratively. |
102102+| `Question.explanation` | string | yes | Extended detail on tradeoffs and repo facts. |
103103+| `Question.options` | Option[] | yes | At least 2 options. |
104104+| `Option.label` | string | yes | **Full option text** — detailed description, not a terse button label. |
105105+| `Option.followups` | Question[] | no | Omit or `[]` = no sub-questions. |
106106+107107+### Worked example
108108+109109+```json
110110+{
111111+ "issue": "at://did:plc:abc/sh.tangled.repo.issue/3lk2…",
112112+ "version": 2,
113113+ "introduction": {
114114+ "project": "A small CLI tool for…",
115115+ "issue": "Add a flag to…",
116116+ "approach": "First we pick where the fix lives, then shared test preferences."
117117+ },
118118+ "items": [
119119+ {
120120+ "id": "scope",
121121+ "prompt": "Where should the fix live?",
122122+ "context": "The issue touches both the CLI and core library — we need to pick a home first.",
123123+ "explanation": "A new module keeps concerns isolated; extending util is faster but couples the change.",
124124+ "options": [
125125+ {
126126+ "label": "Create a new module dedicated to this feature, imported by the CLI entrypoint.",
127127+ "followups": [
128128+ {
129129+ "id": "mod_name_style",
130130+ "prompt": "Module naming style?",
131131+ "context": "Because you chose a new module, naming should match repo conventions.",
132132+ "explanation": "Flat names match existing `util_*` files; nested packages group related commands.",
133133+ "options": [
134134+ { "label": "Flat single file at repo root (e.g. `pairing.nu`)." },
135135+ { "label": "Nested under an existing package directory." }
136136+ ]
137137+ }
138138+ ]
139139+ },
140140+ { "label": "Extend the existing shared util module — smallest diff, reuses exports." }
141141+ ]
142142+ },
143143+ {
144144+ "id": "tests",
145145+ "prompt": "Add tests?",
146146+ "context": "Regardless of where the fix lives, we need agreement on test coverage.",
147147+ "explanation": "The repo has unit tests in `tests/` but no integration harness for hardware.",
148148+ "options": [
149149+ { "label": "Yes — add unit tests for the new code path." },
150150+ { "label": "No — manual verification only for this change." }
151151+ ]
152152+ }
153153+ ]
154154+}
155155+```
156156+157157+Behaviour:
158158+- A user who picks **New module** is asked `mod_name_style`, then `tests`.
159159+- A user who picks **Existing util** skips `mod_name_style` and goes straight to `tests`.
160160+- `tests` is always asked regardless of the `scope` branch.
161161+162162+### Traversal semantics (how the frontend walks it)
163163+164164+The appview walks the tree depth-first with a stack. The engine doesn't need to run this,
165165+but it defines exactly which questions a given user sees and the order answers are recorded:
166166+167167+```
168168+stack = [ (items, 0) ]
169169+while stack not empty:
170170+ (list, i) = stack.top
171171+ if i >= len(list): stack.pop(); continue
172172+ q = list[i]
173173+ present q to user; user picks option at index `i`
174174+ record answer { questionId: q.id, optionIndex: i }
175175+ stack.top.i += 1 # move past q in its own frame
176176+ if opt.followups is non-empty:
177177+ stack.push( (opt.followups, 0) ) # dive into sub-questions first
178178+# done when the stack is empty
179179+```
180180+181181+---
182182+183183+## 3. API contract
184184+185185+Base URL, auth, and exact paths are TBD between engine and appview (see Open Questions).
186186+Shapes below are the contract.
187187+188188+### 3.1 `GET` questionnaire
189189+190190+Fetch (or generate-and-cache) the questionnaire for an issue.
191191+192192+**Request**
193193+194194+```
195195+GET /questionnaire?issue=<at-uri>
196196+```
197197+198198+| Param | Type | Notes |
199199+|---|---|---|
200200+| `issue` | string (AT-URI) | The `sh.tangled.repo.issue` record URI. |
201201+202202+**Response** `200 OK` — a Questionnaire object (Section 2).
203203+204204+- The same issue should return a **stable** questionnaire (same `id`s) across calls so that
205205+ answers from different users are comparable. Generate once, cache, return cached.
206206+- `404` if the issue is unknown to the engine; `503` if generation is still in progress
207207+ (the appview will show a "preparing…" state and let the user retry).
208208+209209+### 3.2 `POST` answers
210210+211211+Submit one user's completed answer-set.
212212+213213+**Request**
214214+215215+```
216216+POST /answers
217217+Content-Type: application/json
218218+219219+{
220220+ "issue": "at://…/sh.tangled.repo.issue/…",
221221+ "did": "did:plc:…", // the answering user, from the appview's auth session
222222+ "version": 1, // questionnaire version the answers were collected against
223223+ "answers": [
224224+ { "questionId": "scope", "optionIndex": 0 },
225225+ { "questionId": "mod_name_style", "optionIndex": 1 },
226226+ { "questionId": "tests", "optionIndex": 0 }
227227+ ]
228228+}
229229+```
230230+231231+- `answers` is the **flat, ordered list** of `{ questionId, optionIndex }` the user actually
232232+ traversed (only the questions they were shown). Order = traversal order. `questionId`s are
233233+ globally unique, so the engine can reconstruct full context without nesting in the payload.
234234+- `did` is supplied **server-side by the appview** from the authenticated session — the
235235+ engine can trust it as the identity the appview vouches for. It is never taken from the
236236+ browser/client.
237237+238238+**Response** `200 OK` (body optional; the appview ignores it beyond status).
239239+240240+**Idempotency / re-answering:** a user may submit more than once (they re-open the wizard
241241+and change their mind). The engine should **dedupe by `did`** and treat the latest submission
242242+as that user's answer. Define whether resubmission is allowed after consensus is locked.
243243+244244+---
245245+246246+## 4. Out of scope for these two calls (engine-internal)
247247+248248+- Aggregating answers across users and **detecting consensus**.
249249+- Generating the solution / code.
250250+- **Authoring the PR as the engine's own AT-Proto user** — author a `sh.tangled.repo.pull`
251251+ record that **references the issue** (e.g. via the issue AT-URI in the pull's references /
252252+ body) so the appview's existing reference-link rendering surfaces it on the issue page.
253253+254254+No additional appview→engine call is needed to display the solution/PR; it arrives via
255255+normal record ingestion.
256256+257257+---
258258+259259+## 5. Optional future extension — reusable question groups (do NOT build yet)
260260+261261+If two different branches ever need the **same** sub-questionnaire, add a reference item
262262+rather than duplicating questions:
263263+264264+```jsonc
265265+// top-level, alongside "items"
266266+"library": {
267267+ "test-prefs": [ /* Question[] */ ]
268268+}
269269+270270+// usable anywhere an item is expected
271271+{ "ref": "test-prefs" }
272272+```
273273+274274+The recursive walker resolves `{ "ref": … }` against `library` and otherwise behaves
275275+identically. Not required for v1 — the schema simply leaves room for it. Flagged here so the
276276+engine doesn't bake in an assumption that blocks it later.
277277+278278+---
279279+280280+## 6. Open questions to confirm with the appview developer
281281+282282+1. **Issue identifier:** AT-URI (assumed here) vs. the numeric per-repo issue id? AT-URI is
283283+ globally unambiguous; confirm the engine can resolve it.
284284+2. **Base URL / auth:** how does the appview authenticate to the engine (service token,
285285+ mTLS, shared network)? What are the real paths?
286286+3. **Caching/staleness:** is the questionnaire generated once per issue and frozen, or can it
287287+ change (e.g. if the issue body is edited)? If it can change, how do we avoid invalidating
288288+ in-flight answers (the `version` field is here to support this).
289289+4. **Resubmission after consensus:** allowed, or should the appview hide the wizard once the
290290+ engine reports a solution exists? (Note: with only two calls, the appview has no
291291+ "status" endpoint — it infers "solved" from the linked PR. Confirm that's acceptable, or
292292+ we add a lightweight status signal.)
293293+5. **PR ↔ issue linkage:** confirm the engine sets the pull record's references to the issue
294294+ AT-URI so the appview's existing linked-PR rendering picks it up.
+157
agent/atproto.py
···11+"""ATProto / PDS helpers for live issue loading."""
22+33+from __future__ import annotations
44+55+import re
66+from typing import Any
77+88+import httpx
99+1010+DEFAULT_PDS = "https://tngl.sh"
1111+ISSUE_COLLECTION = "sh.tangled.repo.issue"
1212+STATE_COLLECTION = "sh.tangled.repo.issue.state"
1313+REPO_COLLECTION = "sh.tangled.repo"
1414+STATE_OPEN = "sh.tangled.repo.issue.state.open"
1515+STATE_CLOSED = "sh.tangled.repo.issue.state.closed"
1616+1717+_AT_URI_RE = re.compile(
1818+ r"^at://(?P<did>did:[^/]+)/(?P<collection>[^/]+)/(?P<rkey>[^/]+)$"
1919+)
2020+2121+2222+def parse_at_uri(uri: str) -> tuple[str, str, str]:
2323+ match = _AT_URI_RE.match(uri.strip())
2424+ if not match:
2525+ raise ValueError(f"Not a valid at:// URI: {uri!r}")
2626+ return match.group("did"), match.group("collection"), match.group("rkey")
2727+2828+2929+def pds_host_for_did(client: httpx.Client, did: str) -> str | None:
3030+ resp = client.get(f"https://plc.directory/{did}", timeout=15.0)
3131+ if resp.status_code != 200:
3232+ return None
3333+ doc = resp.json()
3434+ for svc in doc.get("service", []):
3535+ if svc.get("type") == "AtprotoPersonalDataServer":
3636+ endpoint = svc.get("serviceEndpoint")
3737+ if isinstance(endpoint, str):
3838+ return endpoint.rstrip("/")
3939+ return None
4040+4141+4242+def handle_from_plc(client: httpx.Client, did: str) -> str | None:
4343+ resp = client.get(f"https://plc.directory/{did}", timeout=15.0)
4444+ if resp.status_code != 200:
4545+ return None
4646+ for alias in resp.json().get("alsoKnownAs", []):
4747+ if isinstance(alias, str) and alias.startswith("at://"):
4848+ return alias.removeprefix("at://")
4949+ return None
5050+5151+5252+def get_record(
5353+ client: httpx.Client,
5454+ pds_host: str,
5555+ repo_did: str,
5656+ collection: str,
5757+ rkey: str,
5858+) -> dict[str, Any]:
5959+ resp = client.get(
6060+ f"{pds_host.rstrip('/')}/xrpc/com.atproto.repo.getRecord",
6161+ params={"repo": repo_did, "collection": collection, "rkey": rkey},
6262+ timeout=20.0,
6363+ )
6464+ resp.raise_for_status()
6565+ data = resp.json()
6666+ if not isinstance(data, dict):
6767+ raise RuntimeError("getRecord returned non-object")
6868+ return data
6969+7070+7171+def list_records(
7272+ client: httpx.Client,
7373+ pds_host: str,
7474+ repo_did: str,
7575+ collection: str,
7676+ *,
7777+ limit: int = 100,
7878+) -> list[dict[str, Any]]:
7979+ resp = client.get(
8080+ f"{pds_host.rstrip('/')}/xrpc/com.atproto.repo.listRecords",
8181+ params={"repo": repo_did, "collection": collection, "limit": limit},
8282+ timeout=20.0,
8383+ )
8484+ resp.raise_for_status()
8585+ page = resp.json().get("records") or []
8686+ return [r for r in page if isinstance(r, dict)]
8787+8888+8989+def issue_state_for_uri(
9090+ client: httpx.Client,
9191+ pds_host: str,
9292+ author_did: str,
9393+ issue_uri: str,
9494+ issue_rkey: str,
9595+) -> str:
9696+ try:
9797+ records = list_records(client, pds_host, author_did, STATE_COLLECTION, limit=200)
9898+ except Exception:
9999+ return "open"
100100+ for rec in records:
101101+ value = rec.get("value")
102102+ if not isinstance(value, dict):
103103+ continue
104104+ target = value.get("issue")
105105+ if target == issue_uri:
106106+ state = value.get("state")
107107+ if state == STATE_CLOSED:
108108+ return "closed"
109109+ return "open"
110110+ return "open"
111111+112112+113113+def repo_did_from_at_uri(uri: str) -> str | None:
114114+ if not uri.startswith("at://"):
115115+ return None
116116+ did = uri.removeprefix("at://").split("/", 1)[0]
117117+ return did if did.startswith("did:") else None
118118+119119+120120+def resolve_repo(
121121+ client: httpx.Client,
122122+ repo_ref: Any,
123123+) -> dict[str, Any]:
124124+ """Resolve issue's ``repo`` field to repo_did, knot_hostname, name, owner_handle."""
125125+ if not isinstance(repo_ref, str) or not repo_ref.strip():
126126+ raise RuntimeError("Issue record has no repo reference")
127127+128128+ if repo_ref.startswith("at://"):
129129+ owner_did, collection, repo_rkey = parse_at_uri(repo_ref)
130130+ if collection != REPO_COLLECTION:
131131+ raise RuntimeError(f"Unexpected repo collection: {collection}")
132132+ pds = pds_host_for_did(client, owner_did) or DEFAULT_PDS
133133+ rec = get_record(client, pds, owner_did, REPO_COLLECTION, repo_rkey)
134134+ value = rec.get("value") if isinstance(rec.get("value"), dict) else {}
135135+ repo_did = value.get("repoDid") if isinstance(value.get("repoDid"), str) else owner_did
136136+ knot = value.get("knotHostname") or value.get("knotHost") or value.get("knot")
137137+ name = value.get("name")
138138+ owner_handle = handle_from_plc(client, owner_did)
139139+ if not isinstance(knot, str) or not knot.strip():
140140+ raise RuntimeError("Repo record missing knot / knotHostname")
141141+ return {
142142+ "repo_did": repo_did,
143143+ "repo_uri": repo_ref,
144144+ "repo_name": name if isinstance(name, str) else "",
145145+ "repo_owner_did": owner_did,
146146+ "repo_owner_handle": owner_handle or "",
147147+ "knot_hostname": knot.strip(),
148148+ }
149149+150150+ if repo_ref.startswith("did:"):
151151+ repo_did = repo_ref
152152+ raise RuntimeError(
153153+ f"Issue references repo by DID only ({repo_did}). "
154154+ "Need at:// owner/repo record URI or a indexed tangled_repos row."
155155+ )
156156+157157+ raise RuntimeError(f"Unsupported repo reference: {repo_ref!r}")
+103
agent/context.py
···11+"""Issue session context injected before the agent runs (no issue-fetch tools)."""
22+33+from __future__ import annotations
44+55+import json
66+from dataclasses import asdict, dataclass, field
77+from typing import Any
88+99+1010+@dataclass
1111+class IssueSessionContext:
1212+ """Everything the caller already knows about the issue + repo."""
1313+1414+ issue_uri: str
1515+ issue_rkey: str
1616+ title: str
1717+ body: str
1818+ state: str
1919+ author_did: str
2020+ author_handle: str
2121+ repo_did: str
2222+ repo_owner_handle: str
2323+ repo_name: str
2424+ knot_hostname: str
2525+ # Repo paths relative to root (provided by caller — primary navigation aid).
2626+ file_tree: list[str] = field(default_factory=list)
2727+ ref: str = "HEAD"
2828+ extra: dict[str, Any] = field(default_factory=dict)
2929+3030+ @classmethod
3131+ def from_dict(cls, data: dict[str, Any]) -> IssueSessionContext:
3232+ known = {f.name for f in cls.__dataclass_fields__.values()} # type: ignore[attr-defined]
3333+ core = {k: v for k, v in data.items() if k in known and k != "extra"}
3434+ extra = dict(data.get("extra") or {})
3535+ for k, v in data.items():
3636+ if k not in known:
3737+ extra[k] = v
3838+ return cls(**core, extra=extra)
3939+4040+ def to_dict(self) -> dict[str, Any]:
4141+ payload = asdict(self)
4242+ extra = payload.pop("extra", {})
4343+ if extra:
4444+ payload.update(extra)
4545+ return payload
4646+4747+4848+ISSUE_AGENT_SYSTEM_PROMPT = """\
4949+You investigate a single Tangled issue. The issue metadata, repo identifiers, and
5050+repository file tree are already provided below — do not ask the user to resolve
5151+handles or DIDs.
5252+5353+Your job:
5454+1. Read the issue title/body and identify which files are relevant.
5555+2. Use ``read_repo_file`` to pull exact source from the knot when you need code.
5656+3. Use ``list_repo_files`` only if the provided file tree is incomplete or you
5757+ need to explore a subdirectory that was not listed.
5858+5959+Rules:
6060+- Prefer paths from the provided file tree.
6161+- Read the smallest set of files needed to answer well.
6262+- Cite file paths when referencing code.
6363+- You cannot file issues, push code, or browse outside this repo.
6464+"""
6565+6666+6767+def format_issue_context_block(ctx: IssueSessionContext) -> str:
6868+ """Serialize session context for the system prompt (cache-friendly static prefix)."""
6969+ tree = ctx.file_tree
7070+ if len(tree) > 500:
7171+ tree_display = tree[:500] + [f"... (+{len(tree) - 500} more paths)"]
7272+ else:
7373+ tree_display = tree
7474+7575+ block = {
7676+ "issue": {
7777+ "uri": ctx.issue_uri,
7878+ "rkey": ctx.issue_rkey,
7979+ "title": ctx.title,
8080+ "body": ctx.body,
8181+ "state": ctx.state,
8282+ "author": {"did": ctx.author_did, "handle": ctx.author_handle},
8383+ },
8484+ "repo": {
8585+ "did": ctx.repo_did,
8686+ "owner_handle": ctx.repo_owner_handle,
8787+ "name": ctx.repo_name,
8888+ "knot_hostname": ctx.knot_hostname,
8989+ "ref": ctx.ref,
9090+ },
9191+ "file_tree": tree_display,
9292+ }
9393+ if ctx.extra:
9494+ block["extra"] = ctx.extra
9595+ return json.dumps(block, indent=2, ensure_ascii=False)
9696+9797+9898+def build_issue_system_prompt(ctx: IssueSessionContext) -> str:
9999+ return (
100100+ f"{ISSUE_AGENT_SYSTEM_PROMPT}\n\n"
101101+ f"## Session context (issue + repo)\n\n"
102102+ f"```json\n{format_issue_context_block(ctx)}\n```"
103103+ )
+249
agent/load_issue.py
···11+"""Load issue session context from a single issue URI (live PDS + knot)."""
22+33+from __future__ import annotations
44+55+import os
66+from collections import deque
77+from dataclasses import replace
88+99+import httpx
1010+import psycopg
1111+from psycopg.rows import dict_row
1212+1313+from agent.atproto import (
1414+ DEFAULT_PDS,
1515+ ISSUE_COLLECTION,
1616+ get_record,
1717+ handle_from_plc,
1818+ issue_state_for_uri,
1919+ parse_at_uri,
2020+ pds_host_for_did,
2121+ resolve_repo,
2222+)
2323+from agent.context import IssueSessionContext
2424+from agent.tangled_client import DEFAULT_TIMEOUT, list_tree, normalize_tree_entries
2525+2626+_ISSUE_SQL = """
2727+ select
2828+ i.uri as issue_uri,
2929+ i.rkey as issue_rkey,
3030+ i.title,
3131+ i.body,
3232+ i.state,
3333+ i.author_did,
3434+ i.author_handle,
3535+ i.repo_did,
3636+ i.repo_uri,
3737+ coalesce(r.owner_handle, ti.handle) as repo_owner_handle,
3838+ r.name as repo_name,
3939+ r.knot_hostname
4040+ from tangled_issues i
4141+ left join tangled_repos r on r.repo_did = i.repo_did
4242+ left join tangled_identities ti
4343+ on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1)
4444+ where i.uri = %s
4545+"""
4646+4747+_REPO_SQL = """
4848+ select repo_did, name as repo_name, owner_handle as repo_owner_handle,
4949+ knot_hostname, uri as repo_uri
5050+ from tangled_repos
5151+ where repo_did = %s
5252+ limit 1
5353+"""
5454+5555+5656+def _join_path(parent: str, name: str) -> str:
5757+ if not parent:
5858+ return name
5959+ return f"{parent.rstrip('/')}/{name}"
6060+6161+6262+def build_file_tree(
6363+ knot_hostname: str,
6464+ repo_did: str,
6565+ *,
6666+ ref: str = "HEAD",
6767+ max_paths: int = 400,
6868+ max_depth: int = 4,
6969+) -> list[str]:
7070+ paths: list[str] = []
7171+ queue: deque[tuple[str, int]] = deque([("", 0)])
7272+7373+ with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client:
7474+ while queue and len(paths) < max_paths:
7575+ directory, depth = queue.popleft()
7676+ try:
7777+ tree = list_tree(
7878+ client,
7979+ knot_hostname=knot_hostname,
8080+ repo_did=repo_did,
8181+ path=directory,
8282+ ref=ref,
8383+ )
8484+ except Exception:
8585+ continue
8686+ for entry in normalize_tree_entries(tree):
8787+ full = _join_path(directory, entry["name"])
8888+ if entry["type"] == "dir":
8989+ if depth + 1 < max_depth:
9090+ queue.append((full, depth + 1))
9191+ else:
9292+ paths.append(full)
9393+9494+ return sorted(paths)
9595+9696+9797+def _repo_from_db(repo_did: str) -> dict | None:
9898+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
9999+ if not dsn:
100100+ return None
101101+ if "sslmode=" not in dsn:
102102+ sep = "&" if "?" in dsn else "?"
103103+ dsn = f"{dsn}{sep}sslmode=require"
104104+ try:
105105+ with psycopg.connect(dsn, row_factory=dict_row) as conn:
106106+ return conn.execute(_REPO_SQL, (repo_did,)).fetchone()
107107+ except Exception:
108108+ return None
109109+110110+111111+def _db_row(issue_uri: str) -> dict | None:
112112+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
113113+ if not dsn:
114114+ return None
115115+ if "sslmode=" not in dsn:
116116+ sep = "&" if "?" in dsn else "?"
117117+ dsn = f"{dsn}{sep}sslmode=require"
118118+ try:
119119+ with psycopg.connect(dsn, row_factory=dict_row) as conn:
120120+ return conn.execute(_ISSUE_SQL, (issue_uri,)).fetchone()
121121+ except Exception:
122122+ return None
123123+124124+125125+def _resolve_repo_did_only(
126126+ client: httpx.Client,
127127+ repo_did: str,
128128+ db_row: dict | None,
129129+) -> dict[str, str]:
130130+ repo_row = _repo_from_db(repo_did)
131131+ knot = (repo_row or {}).get("knot_hostname") or (db_row or {}).get("knot_hostname")
132132+ name = (repo_row or {}).get("repo_name") or (db_row or {}).get("repo_name")
133133+ owner_handle = (repo_row or {}).get("repo_owner_handle") or (db_row or {}).get(
134134+ "repo_owner_handle"
135135+ )
136136+ repo_uri = (repo_row or {}).get("repo_uri") or (db_row or {}).get("repo_uri") or ""
137137+138138+ if isinstance(knot, str) and knot.strip():
139139+ return {
140140+ "repo_did": repo_did,
141141+ "knot_hostname": knot.strip(),
142142+ "repo_name": name if isinstance(name, str) else "",
143143+ "repo_owner_handle": owner_handle if isinstance(owner_handle, str) else "",
144144+ "repo_uri": repo_uri if isinstance(repo_uri, str) else "",
145145+ }
146146+147147+ raise RuntimeError(
148148+ f"Cannot resolve knot for repo_did={repo_did}. "
149149+ "Issue should reference at://owner/sh.tangled.repo/rkey when possible."
150150+ )
151151+152152+153153+def fetch_issue_live(issue_uri: str) -> IssueSessionContext:
154154+ """Load everything from Tangled live (PDS + knot). DB not required."""
155155+ author_did, collection, rkey = parse_at_uri(issue_uri)
156156+ if collection != ISSUE_COLLECTION:
157157+ raise ValueError(f"Expected {ISSUE_COLLECTION}, got {collection}")
158158+159159+ db_row = _db_row(issue_uri)
160160+161161+ with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client:
162162+ pds = pds_host_for_did(client, author_did) or DEFAULT_PDS
163163+ record = get_record(client, pds, author_did, collection, rkey)
164164+ value = record.get("value")
165165+ if not isinstance(value, dict):
166166+ raise RuntimeError("Issue record missing value")
167167+168168+ title = value.get("title") if isinstance(value.get("title"), str) else ""
169169+ body = value.get("body") if isinstance(value.get("body"), str) else ""
170170+ author_handle = handle_from_plc(client, author_did) or ""
171171+ state = issue_state_for_uri(client, pds, author_did, issue_uri, rkey)
172172+173173+ repo_ref = value.get("repo")
174174+ if isinstance(repo_ref, str) and repo_ref.startswith("did:"):
175175+ repo = _resolve_repo_did_only(client, repo_ref, db_row)
176176+ else:
177177+ repo = resolve_repo(client, repo_ref)
178178+179179+ file_tree = build_file_tree(repo["knot_hostname"], repo["repo_did"])
180180+181181+ return IssueSessionContext(
182182+ issue_uri=issue_uri,
183183+ issue_rkey=rkey,
184184+ title=title or (db_row or {}).get("title") or "",
185185+ body=body or (db_row or {}).get("body") or "",
186186+ state=state or (db_row or {}).get("state") or "open",
187187+ author_did=author_did,
188188+ author_handle=author_handle or (db_row or {}).get("author_handle") or "",
189189+ repo_did=repo["repo_did"],
190190+ repo_owner_handle=repo.get("repo_owner_handle") or "",
191191+ repo_name=repo.get("repo_name") or "",
192192+ knot_hostname=repo["knot_hostname"],
193193+ file_tree=file_tree,
194194+ ref="HEAD",
195195+ )
196196+197197+198198+def load_issue_context(
199199+ issue_uri: str,
200200+ *,
201201+ fetch_file_tree: bool = True,
202202+ ref: str = "HEAD",
203203+) -> IssueSessionContext:
204204+ """Hydrate session from live Tangled APIs; DB is optional cache only."""
205205+ ctx = fetch_issue_live(issue_uri)
206206+ if not fetch_file_tree:
207207+ return replace(ctx, file_tree=[], ref=ref)
208208+ if ref != ctx.ref:
209209+ return replace(
210210+ ctx,
211211+ file_tree=build_file_tree(ctx.knot_hostname, ctx.repo_did, ref=ref),
212212+ ref=ref,
213213+ )
214214+ return ctx
215215+216216+217217+# Backwards-compatible alias
218218+fetch_issue_context = load_issue_context
219219+220220+221221+def resolve_issue_uri(issue_id: str) -> str:
222222+ """Resolve a full ``at://`` URI or a per-repo issue rkey via ``tangled_issues``."""
223223+ raw = issue_id.strip()
224224+ if raw.startswith("at://"):
225225+ return raw
226226+227227+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
228228+ if not dsn:
229229+ raise RuntimeError(
230230+ "DB_CONNECTION_STRING is required to resolve issue rkey without at:// URI"
231231+ )
232232+ if "sslmode=" not in dsn:
233233+ sep = "&" if "?" in dsn else "?"
234234+ dsn = f"{dsn}{sep}sslmode=require"
235235+236236+ with psycopg.connect(dsn, row_factory=dict_row) as conn:
237237+ rows = conn.execute(
238238+ "select uri from tangled_issues where rkey = %s order by fetched_at desc",
239239+ (raw,),
240240+ ).fetchall()
241241+242242+ if not rows:
243243+ raise ValueError(f"No issue with rkey {raw!r} in tangled_issues — pass full at:// URI")
244244+ if len(rows) > 1:
245245+ uris = [r["uri"] for r in rows[:5]]
246246+ raise ValueError(
247247+ f"Ambiguous rkey {raw!r} ({len(rows)} issues). Pass full at:// URI. Examples: {uris}"
248248+ )
249249+ return rows[0]["uri"]
+226
agent/questionnaire_prompt.py
···11+"""System prompt for AI-solve questionnaire generation."""
22+33+from __future__ import annotations
44+55+from agent.context import IssueSessionContext, format_issue_context_block
66+77+QUESTIONNAIRE_AGENT_SYSTEM_PROMPT = """\
88+You are the **AI-solve questionnaire engine** for Tangled issues.
99+1010+Your job is to produce a **branching questionnaire** that helps many contributors agree on
1111+*how* an issue should be implemented. Answers will be aggregated across users; when the
1212+engine detects consensus, it will generate code and open a pull request. Your questions
1313+must therefore surface **real, meaningful implementation choices** — not trivia, not
1414+questions already settled by the issue author, and not preferences that do not affect code.
1515+1616+## What you receive
1717+1818+Issue metadata, repo identifiers, and a file tree are embedded below. You also have
1919+``read_repo_file`` (and optionally ``list_repo_files``) to inspect the codebase on the knot.
2020+2121+**You must read the repo before writing the questionnaire.** At minimum:
2222+- README or docs that explain the project
2323+- Files most likely touched by a fix for this issue (infer from title/body + tree)
2424+- Existing patterns for the kind of change requested (CLI commands, modules, tests, APIs)
2525+2626+**If the issue is a bug** (crash, wrong output, regression, race, etc.), research the bug
2727+before designing questions:
2828+- Trace the **reported symptoms** to the code path (callers, handlers, data flow).
2929+- Read the **failing or suspect code** and any related tests, error handling, or edge cases.
3030+- Form **multiple plausible root causes** when the report is ambiguous — do not assume the
3131+ first theory is correct.
3232+- Identify **several distinct fix strategies** (e.g. guard at call site vs fix underlying
3333+ logic vs add validation vs change defaults vs refactor state handling). Each viable
3434+ strategy should become a branch in the questionnaire — users choose *which fix approach*
3535+ to take, then answer follow-ups specific to that path.
3636+- Where reproduction steps exist in the issue, verify them against the code you read.
3737+3838+Do not guess architecture, naming, conventions, or root cause when the source can answer.
3939+4040+## Required workflow (do not skip)
4141+4242+You have two phases. **Do not emit questionnaire JSON during phase 1.**
4343+4444+1. **Research (tools only)** — call ``read_repo_file`` as many times as you need until you
4545+ understand the repo and issue well enough to write the questionnaire (README, relevant
4646+ source, tests, similar patterns). There is no fixed file limit — keep reading while it
4747+ helps. The file tree in context is not enough — read actual contents.
4848+2. **Generate** — when you are done researching, **stop calling tools**. The system will
4949+ ask you for the questionnaire JSON in a separate step. Do not output JSON during research.
5050+5151+## Output contract
5252+5353+Return **one JSON object** and nothing else — no markdown fences, no commentary, no preamble.
5454+The object must validate against this schema (version **2**):
5555+5656+```jsonc
5757+{
5858+ "issue": "<at-uri>", // echo the issue URI from session context exactly
5959+ "version": 2,
6060+ "introduction": {
6161+ "project": "…", // 2–4 sentences: what this repo is, stack, conventions, status
6262+ "issue": "…", // 2–4 sentences: what the issue asks, constraints, open decisions
6363+ "approach":"…" // 2–4 sentences: how the questionnaire guides toward a solution
6464+ },
6565+ "items": [ /* ordered Question[] */ ]
6666+}
6767+6868+// Question
6969+{
7070+ "id": "scope",
7171+ "prompt": "Short question shown as the headline",
7272+ "context": "1–3 sentences bridging from introduction or parent branch — why we are asking NOW",
7373+ "explanation": "Extended paragraph: tradeoffs, code facts, what changes depending on the answer",
7474+ "options": [ /* Option[], at least 2 */ ]
7575+}
7676+7777+// Option — label only (no separate value field)
7878+{
7979+ "label": "Full detailed description of this choice — complete enough to vote on without reading code",
8080+ "followups": [ /* optional Question[] — same shape as items */ ]
8181+}
8282+```
8383+8484+### Narrative coherence (most important)
8585+8686+The questionnaire is a **guided story**, not a checklist of isolated questions.
8787+8888+1. ``introduction`` sets the scene: project reality, issue goal, and how choices chain into a PR.
8989+2. Each question's ``context`` must **logically follow** from the introduction or from the
9090+ option the user chose in the parent branch. Reference concrete facts from the repo/issue.
9191+3. Each ``explanation`` goes deeper: what files/patterns are involved, what breaks if you
9292+ pick wrong, why reasonable people disagree here.
9393+4. Follow-up questions must **narrow** the chosen branch — not repeat the parent question.
9494+ Their ``context`` should say "Because you chose X…" or "Given the lh.nu namespace…".
9595+5. Top-level questions after branches should **re-converge** with context like "Regardless
9696+ of backend choice…" so shared tail questions feel connected to the path taken.
9797+9898+If contexts do not read as one continuous briefing, rewrite before emitting JSON.
9999+100100+### Tree semantics (critical)
101101+102102+The questionnaire is a **tree of nested sequences**, not a ``next``-pointer graph.
103103+104104+- ``items`` is the top-level ordered list. Every user walks it in order.
105105+- When a user picks an option that has ``followups``, those sub-questions are asked
106106+ **immediately** (depth-first), then traversal **automatically continues** with the next
107107+ item in the parent list. Re-convergence is free — do not wire branches back manually.
108108+- Put **path-specific** questions inside ``followups`` on the option they depend on.
109109+- Put **cross-cutting** questions (tests, docs, breaking changes, migration) as **top-level**
110110+ ``items`` after branching sections so every path reaches them without duplication.
111111+112112+Traversal (for your mental model — the frontend runs this):
113113+114114+```
115115+stack = [ (items, 0) ]
116116+while stack not empty:
117117+ (list, i) = stack.top
118118+ if i >= len(list): stack.pop(); continue
119119+ q = list[i]; user picks option opt
120120+ stack.top.i += 1
121121+ if opt.followups is non-empty:
122122+ stack.push( (opt.followups, 0) )
123123+```
124124+125125+### Field rules
126126+127127+| Field | Rules |
128128+|---|---|
129129+| ``issue`` | Required. Exact AT-URI from context. |
130130+| ``version`` | Always ``2``. |
131131+| ``introduction`` | Required. ``project``, ``issue``, ``approach`` — each a substantive paragraph. |
132132+| ``items`` | Non-empty ordered array. |
133133+| ``Question.id`` | **Globally unique** snake_case id. Stable across re-fetches. |
134134+| ``Question.prompt`` | Short headline — one decision per question. |
135135+| ``Question.context`` | Required. Bridges from intro/parent; must read as the next logical paragraph. |
136136+| ``Question.explanation`` | Required. Extended detail on tradeoffs and repo-specific facts. |
137137+| ``Option.label`` | Required. **The entire option text** — detailed description, not a terse button label. No ``value`` field. |
138138+| ``Option.followups`` | Omit or ``[]`` when no sub-questions. |
139139+140140+## How to design a good questionnaire
141141+142142+### Goal
143143+144144+Surface disagreements that **change the diff**: file placement, API shape, dependency choices,
145145+compatibility, test strategy, error-handling philosophy, scope (minimal vs holistic), etc.
146146+147147+### Recommended shape
148148+149149+1. **Anchor question** — highest-level approach (often ``items[0]``). Branch heavily here.
150150+ For bugs, anchor on **which fix strategy** (root-cause fix vs workaround vs defensive
151151+ guard vs broader refactor) — each option should reflect a real alternative you found in code.
152152+2. **Branch depth** — 2–4 levels of ``followups`` where paths genuinely diverge. Shallow
153153+ branches that only differ in wording are useless.
154154+3. **Shared tail** — 2–4 top-level items after branches for concerns every path shares
155155+ (tests, docs, deprecation, rollout).
156156+4. **Size** — aim for **8–15 distinct question ids** across the full tree for a typical
157157+ issue; more for large features, fewer for tiny fixes. Every question must earn its place.
158158+159159+### Dimensions to branch on (when relevant to this issue)
160160+161161+Use only what applies after reading the code — do not checkbox every row blindly.
162162+163163+**Bug issues** — branch when multiple fixes are viable:
164164+- **Root cause**: patch the faulty logic vs fix upstream/downstream caller
165165+- **Fix depth**: minimal one-line guard vs proper invariant fix vs refactor the subsystem
166166+- **Symptom vs cause**: suppress/handle the error vs eliminate the triggering condition
167167+- **Regression**: add test reproducing the bug; fix only vs fix + harden related paths
168168+- **Blast radius**: local patch vs shared utility change affecting other call sites
169169+170170+**Feature / enhancement issues**:
171171+- **Placement**: new module vs extend existing file/package; public API surface vs internal
172172+- **Interface**: CLI subcommand vs library function vs config flag; naming aligned with repo
173173+- **Behavior**: strict vs permissive validation; fail-fast vs graceful degradation
174174+- **Compatibility**: breaking change vs backward-compatible shim; feature flag vs always-on
175175+- **Dependencies**: reuse existing util vs add dependency (name the tradeoff)
176176+- **Data / state**: persistence, migrations, defaults
177177+- **Errors & UX**: error messages, exit codes, logging level
178178+- **Tests**: unit vs integration; fixtures; what to assert
179179+- **Docs**: README, inline docs, changelog entry
180180+- **Scope**: minimal fix vs refactor while here; out-of-scope follow-ups as explicit option
181181+182182+### Diversity requirements
183183+184184+- Options must represent **distinct implementation paths**, not synonyms.
185185+- Avoid false choices (one option obviously correct given the codebase).
186186+- Include at least one **conservative / minimal** and one **broader** path when reasonable.
187187+- When the issue is ambiguous, ask **clarifying** branch questions early in ``followups``.
188188+- Reflect **repo conventions** you observed (e.g. if tests live in ``*_test.go``, ask about
189189+ test file placement using real paths/patterns from the tree).
190190+191191+### Anti-patterns (do not do these)
192192+193193+- Do not ask what the issue already states as a requirement.
194194+- Do not ask "which files to edit" with a single correct answer you could infer.
195195+- Do not duplicate the same question under multiple branches — hoist to top-level ``items``.
196196+- Do not use flat linked-list / ``next`` field thinking.
197197+- Do not ask open-ended free text — every step is multiple choice.
198198+- Do not invent options that violate project constraints visible in the repo.
199199+- Do not output invalid JSON (trailing commas, comments, single quotes).
200200+201201+## ID conventions
202202+203203+- ``Question.id``: lowercase ``snake_case``, globally unique, semantic (``backend_tool``, ``rename_deprecation``).
204204+- Options have **no id** — answers are recorded by ``questionId`` + ``optionIndex`` (0-based).
205205+206206+## Process
207207+208208+1. Read the issue and repo; draft ``introduction`` first — project, issue, approach.
209209+2. Read relevant source via tools until you understand viable solution paths.
210210+3. Draft the tree: each question's ``context`` + ``explanation`` must chain narratively.
211211+4. Write option ``label`` strings as self-contained descriptions a contributor can judge.
212212+5. Validate: unique ids; ≥ 2 options per question; contexts chain logically; JSON parses.
213213+6. Emit the final JSON object only.
214214+215215+You cannot push code, file issues, or browse outside this repo. Your sole deliverable is
216216+the questionnaire JSON.
217217+"""
218218+219219+220220+def build_questionnaire_system_prompt(ctx: IssueSessionContext) -> str:
221221+ """System prompt for questionnaire generation (issue context appended)."""
222222+ return (
223223+ f"{QUESTIONNAIRE_AGENT_SYSTEM_PROMPT}\n\n"
224224+ f"## Session context (issue + repo)\n\n"
225225+ f"```json\n{format_issue_context_block(ctx)}\n```"
226226+ )
+152
agent/questionnaire_repo_store.py
···11+"""Publish AI-solve questionnaires to the knot-hosted git repo (vectorseachdb).
22+33+The generation job dual-writes: it upserts Postgres (agent/questionnaire_store.py)
44+AND, when QUESTIONNAIRE_PUBLISH_REPO is set, publishes the questionnaire as a single
55+JSON file in the embeddings repo on the knot:
66+77+ questionnaires/<did>/<rkey>.json # one file per issue, fetched per-item
88+99+Design choices that make this safe in an ephemeral, possibly-concurrent Cloud Run job:
1010+- **Sparse + partial clone** (`--filter=blob:none --sparse`, sparse-set `questionnaires`)
1111+ so we never download the ~18 MB embedding matrices that share this repo.
1212+- **Per-issue unique path** → concurrent jobs touch different files; no content conflicts.
1313+- **`index.json` is NOT written here** (it would conflict across concurrent jobs) — it's
1414+ rebuilt by scraper/export_questionnaires.py. Consumers can read files by path directly.
1515+- **Push with `pull --rebase` + retry** to tolerate the embeddings export pushing too.
1616+1717+Config (env):
1818+ QUESTIONNAIRE_REPO_GIT_URL e.g. git@tangled.org:did:plc:vg4msk54xucet6of2rdrgahe (required)
1919+ QUESTIONNAIRE_REPO_DIR local checkout dir (default /tmp/qrepo)
2020+ QUESTIONNAIRE_REPO_BRANCH default "main"
2121+ QUESTIONNAIRE_PUBLISH_PUSH "0" to commit but skip push (local testing); default "1"
2222+ QUESTIONNAIRE_SSH_KEY optional path to the deploy key (added to GIT_SSH_COMMAND)
2323+ GIT_SSH_COMMAND respected if already set
2424+"""
2525+2626+from __future__ import annotations
2727+2828+import json
2929+import os
3030+import subprocess
3131+from pathlib import Path
3232+from typing import Any
3333+3434+_PUSH_RETRIES = 4
3535+3636+3737+def publishing_enabled() -> bool:
3838+ return os.getenv("QUESTIONNAIRE_PUBLISH_REPO", "").strip().lower() in ("1", "true", "yes")
3939+4040+4141+def issue_uri_to_relpath(issue_uri: str) -> str:
4242+ """at://<did>/sh.tangled.repo.issue/<rkey> -> questionnaires/<did>/<rkey>.json
4343+ (must match scraper/export_questionnaires.py)."""
4444+ rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri
4545+ parts = rest.split("/")
4646+ return f"questionnaires/{parts[0]}/{parts[-1]}.json"
4747+4848+4949+def _resolve_ssh_key() -> str | None:
5050+ """Return a path to a usable private key, or None.
5151+5252+ Prefers QUESTIONNAIRE_SSH_KEY (a path). Otherwise, if QUESTIONNAIRE_SSH_KEY_CONTENTS
5353+ is set (e.g. a Secret Manager env var in Cloud Run), materialize it to a 0600 temp
5454+ file — secret *volume* mounts are world-readable, which ssh rejects, so env-injection
5555+ + chmod is the robust path."""
5656+ path = os.getenv("QUESTIONNAIRE_SSH_KEY", "").strip()
5757+ if path and Path(path).exists():
5858+ return path
5959+ contents = os.getenv("QUESTIONNAIRE_SSH_KEY_CONTENTS", "")
6060+ if contents.strip():
6161+ dest = Path(os.getenv("QUESTIONNAIRE_REPO_DIR", "/tmp/qrepo")).parent / "qrepo_ssh_key"
6262+ body = contents if contents.endswith("\n") else contents + "\n"
6363+ dest.write_text(body)
6464+ dest.chmod(0o600)
6565+ return str(dest)
6666+ return None
6767+6868+6969+def _git_env() -> dict[str, str]:
7070+ env = dict(os.environ)
7171+ if "GIT_SSH_COMMAND" not in env:
7272+ cmd = "ssh -o StrictHostKeyChecking=accept-new -o ConnectTimeout=30"
7373+ key = _resolve_ssh_key()
7474+ if key:
7575+ cmd += f" -i {key} -o IdentitiesOnly=yes"
7676+ env["GIT_SSH_COMMAND"] = cmd
7777+ return env
7878+7979+8080+def _git(repo: Path, *args: str) -> str:
8181+ out = subprocess.run(
8282+ ["git", *args], cwd=str(repo), env=_git_env(),
8383+ capture_output=True, text=True,
8484+ )
8585+ if out.returncode != 0:
8686+ raise RuntimeError(f"git {' '.join(args)} failed: {out.stderr.strip() or out.stdout.strip()}")
8787+ return out.stdout
8888+8989+9090+def _ensure_checkout(url: str, repo: Path, branch: str) -> None:
9191+ if (repo / ".git").is_dir():
9292+ _git(repo, "fetch", "origin", branch)
9393+ _git(repo, "checkout", branch)
9494+ _git(repo, "reset", "--hard", f"origin/{branch}")
9595+ return
9696+ repo.parent.mkdir(parents=True, exist_ok=True)
9797+ subprocess.run(
9898+ ["git", "clone", "--filter=blob:none", "--sparse", "--branch", branch, url, str(repo)],
9999+ env=_git_env(), capture_output=True, text=True, check=True,
100100+ )
101101+ _git(repo, "sparse-checkout", "set", "questionnaires")
102102+103103+104104+def _file_record(issue_uri: str, payload: dict[str, Any], created_at, updated_at) -> str:
105105+ rec = {
106106+ "issue_uri": issue_uri,
107107+ "version": payload.get("version") if isinstance(payload, dict) else None,
108108+ "created_at": created_at.isoformat() if hasattr(created_at, "isoformat") else created_at,
109109+ "updated_at": updated_at.isoformat() if hasattr(updated_at, "isoformat") else updated_at,
110110+ "payload": payload,
111111+ }
112112+ return json.dumps(rec, ensure_ascii=False, indent=2) + "\n"
113113+114114+115115+def publish_to_repo(issue_uri: str, payload: dict[str, Any], created_at=None, updated_at=None) -> str:
116116+ """Write the questionnaire file, commit, and (unless disabled) push. Returns the
117117+ relative path written. Raises on failure — callers treat publishing as best-effort."""
118118+ url = os.getenv("QUESTIONNAIRE_REPO_GIT_URL", "").strip()
119119+ if not url:
120120+ raise RuntimeError("QUESTIONNAIRE_REPO_GIT_URL is not set")
121121+ repo = Path(os.getenv("QUESTIONNAIRE_REPO_DIR", "/tmp/qrepo")).expanduser()
122122+ branch = os.getenv("QUESTIONNAIRE_REPO_BRANCH", "main")
123123+ do_push = os.getenv("QUESTIONNAIRE_PUBLISH_PUSH", "1").strip().lower() not in ("0", "false", "no")
124124+125125+ _ensure_checkout(url, repo, branch)
126126+127127+ rel = issue_uri_to_relpath(issue_uri)
128128+ path = repo / rel
129129+ path.parent.mkdir(parents=True, exist_ok=True)
130130+ path.write_text(_file_record(issue_uri, payload, created_at, updated_at), encoding="utf-8")
131131+132132+ _git(repo, "add", rel)
133133+ if not _git(repo, "status", "--porcelain").strip():
134134+ return rel # no change (identical content) — nothing to commit
135135+ _git(repo, "-c", "user.name=tangled-questionnaire", "-c", "user.email=bot@stuhi.org",
136136+ "commit", "-m", f"questionnaire: {issue_uri}")
137137+138138+ if not do_push:
139139+ return rel
140140+ last_err: Exception | None = None
141141+ for _ in range(_PUSH_RETRIES):
142142+ try:
143143+ _git(repo, "push", "origin", branch)
144144+ return rel
145145+ except RuntimeError as e: # non-fast-forward (a concurrent push) — rebase + retry
146146+ last_err = e
147147+ try:
148148+ _git(repo, "pull", "--rebase", "origin", branch)
149149+ except RuntimeError as pe:
150150+ last_err = pe
151151+ break
152152+ raise RuntimeError(f"push failed after retries: {last_err}")
+95
agent/questionnaire_store.py
···11+"""Persist AI-solve questionnaires in Postgres."""
22+33+from __future__ import annotations
44+55+import json
66+import os
77+from typing import Any
88+99+import psycopg
1010+from psycopg.rows import dict_row
1111+from psycopg.types.json import Jsonb
1212+1313+_UPSERT = """
1414+ insert into tangled_issue_questionnaires (issue_uri, payload, updated_at)
1515+ values (%s, %s, now())
1616+ on conflict (issue_uri) do update set
1717+ payload = excluded.payload,
1818+ updated_at = now()
1919+ returning issue_uri, created_at, updated_at
2020+"""
2121+2222+_GET = """
2323+ select issue_uri, payload, created_at, updated_at
2424+ from tangled_issue_questionnaires
2525+ where issue_uri = %s
2626+"""
2727+2828+2929+def _connection_string() -> str:
3030+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
3131+ if not dsn:
3232+ raise RuntimeError("DB_CONNECTION_STRING is not set")
3333+ return dsn
3434+3535+3636+def parse_questionnaire_json(raw: str) -> dict[str, Any]:
3737+ """Parse model output into a questionnaire dict (tolerates fences and preamble)."""
3838+ import re
3939+ from json import JSONDecoder
4040+4141+ text = raw.strip()
4242+ if not text:
4343+ raise ValueError("Empty model response — expected questionnaire JSON")
4444+4545+ fence = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, re.DOTALL)
4646+ if fence:
4747+ text = fence.group(1).strip()
4848+4949+ decoder = JSONDecoder()
5050+ try:
5151+ data, _ = decoder.raw_decode(text)
5252+ except json.JSONDecodeError:
5353+ start = text.find("{")
5454+ if start < 0:
5555+ preview = text[:300].replace("\n", " ")
5656+ raise ValueError(
5757+ f"No JSON object in model response (preview: {preview!r})"
5858+ ) from None
5959+ data, _ = decoder.raw_decode(text[start:])
6060+6161+ if not isinstance(data, dict) or not isinstance(data.get("items"), list):
6262+ raise ValueError("Invalid questionnaire: expected object with items[]")
6363+ return data
6464+6565+6666+def save_questionnaire(issue_uri: str, payload: dict[str, Any]) -> dict[str, Any]:
6767+ """Insert or replace the questionnaire for an issue. Returns row metadata."""
6868+ if payload.get("issue") and payload["issue"] != issue_uri:
6969+ raise ValueError(
7070+ f"payload.issue ({payload['issue']!r}) does not match issue_uri ({issue_uri!r})"
7171+ )
7272+ with psycopg.connect(_connection_string(), row_factory=dict_row) as conn:
7373+ row = conn.execute(
7474+ _UPSERT,
7575+ (issue_uri, Jsonb(payload)),
7676+ ).fetchone()
7777+ conn.commit()
7878+ return dict(row)
7979+8080+8181+def get_questionnaire(issue_uri: str) -> dict[str, Any] | None:
8282+ """Load cached questionnaire JSON, or None if missing."""
8383+ with psycopg.connect(_connection_string(), row_factory=dict_row) as conn:
8484+ row = conn.execute(_GET, (issue_uri,)).fetchone()
8585+ if not row:
8686+ return None
8787+ payload = row["payload"]
8888+ if isinstance(payload, str):
8989+ payload = json.loads(payload)
9090+ return {
9191+ "issue_uri": row["issue_uri"],
9292+ "payload": payload,
9393+ "created_at": row["created_at"],
9494+ "updated_at": row["updated_at"],
9595+ }
+31
agent/questionnaires/README.md
···11+# Questionnaire tree viewer
22+33+Small static frontend for exploring AI-solve questionnaire JSON.
44+55+## Run locally
66+77+Browsers block `fetch()` for local files, so serve this folder:
88+99+```bash
1010+cd agent/questionnaires
1111+python -m http.server 8765
1212+```
1313+1414+Open [http://localhost:8765](http://localhost:8765).
1515+1616+## Features
1717+1818+- **Introduction** — project, issue, and approach context shown at the top and in walk-through
1919+- **Tree view** — nested questions with `context`, `explanation`, and detailed option labels
2020+- **Walk-through** — interactive simulator with narrative context per step (depth-first stack)
2121+- **Schema v2** — options are `{ "label": "detailed description…" }` only; answers use `optionIndex`
2222+- **Load** sample (`test.json`), upload a `.json` / `.txt` file, or paste JSON (supports markdown fences; v1 auto-normalized)
2323+2424+## Sample data
2525+2626+- `test.json` — parsed questionnaire for the AtomicXR lighthouse pair issue
2727+- `test.txt` — same content with markdown code fence (also loadable)
2828+2929+## Output
3030+3131+Walk-through mode builds the flat `POST /answers` payload shape when you finish all questions.
···11+{
22+ "issue": "at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22",
33+ "version": 2,
44+ "introduction": {
55+ "project": "AtomicXR is a Nushell-based CLI (`axr`) for managing XR hardware on Linux — SteamVR lighthouse tracking, calibration, and device tooling. Commands live in `.nu` modules served from the repo knot. The README notes the project is deprecated in favor of Homebrew-XR and Envision-OCI, but the codebase is still a useful reference for CLI patterns.",
66+ "issue": "The issue requests a new `axr lh pair` command so users can pair SteamVR Lighthouse devices without manually running `lighthouse_console`. Today lighthouse workflows live under `steamvr-lh.nu` (`axr steamvr-lh calibrate`, `axr steamvr-lh console`), while the issue asks for the shorter `lh` namespace — an intentional mismatch we must resolve first.",
77+ "approach": "This questionnaire walks from namespace and backend choice through UX, error handling, and shared concerns (tests, docs). Each question's context builds on prior answers; branch follow-ups only appear when your chosen path needs extra detail. Together the answers define a concrete PR plan contributors can consensus on."
88+ },
99+ "items": [
1010+ {
1111+ "id": "command_namespace",
1212+ "prompt": "The issue requests `axr lh pair`, but the existing lighthouse module is `steamvr-lh.nu` (commands are `axr steamvr-lh calibrate`, `axr steamvr-lh console`). How should the new pair command be namespaced?",
1313+ "context": "We start here because the issue title says `axr lh pair` but the repo still exposes `axr steamvr-lh …`. Every later decision assumes a command path.",
1414+ "explanation": "The module file `steamvr-lh.nu` registers subcommands via Nushell's module system. Renaming affects import paths, help text, and user muscle memory. Adding `pair` only to the old module is least disruptive; renaming to `lh` matches the issue verbatim.",
1515+ "options": [
1616+ {
1717+ "label": "Rename module to `lh.nu` so commands become `axr lh pair`, `axr lh calibrate`, etc.",
1818+ "followups": [
1919+ {
2020+ "id": "rename_deprecation",
2121+ "prompt": "Should the old `steamvr-lh` name be preserved as a deprecated alias?",
2222+ "context": "Because you chose to rename the module to `lh.nu`, we need to decide whether old scripts using `steamvr-lh` keep working.",
2323+ "explanation": "A thin alias module costs little and prevents breaking existing docs/scripts. Skipping the alias is simpler but contradicts semver expectations if anyone still depends on the old name.",
2424+ "options": [
2525+ {
2626+ "label": "Yes, keep a thin `steamvr-lh.nu` wrapper that re-exports `lh.nu` with a deprecation warning"
2727+ },
2828+ {
2929+ "label": "No, just rename — the project is marked as no longer maintained anyway"
3030+ }
3131+ ]
3232+ }
3333+ ]
3434+ },
3535+ {
3636+ "label": "Add `pair` to the existing `steamvr-lh.nu` module (command becomes `axr steamvr-lh pair`)"
3737+ },
3838+ {
3939+ "label": "Create a new separate `lh.nu` module for pairing only, keep `steamvr-lh.nu` for calibrate/console"
4040+ }
4141+ ]
4242+ },
4343+ {
4444+ "id": "backend_tool",
4545+ "prompt": "Which backend tool should the pair command use?",
4646+ "context": "With the command namespace settled, we pick which external tool wraps the actual pairing protocol.",
4747+ "explanation": "`lighthouse_console` ships with SteamVR today but is slow and script-hostile. `lhctl` is the intended successor but may not be available on all systems yet. Dual mode adds complexity but future-proofs.",
4848+ "options": [
4949+ {
5050+ "label": "Use `lighthouse_console` now (available today via SteamVR, slower but works)",
5151+ "followups": [
5252+ {
5353+ "id": "lhctl_future_proofing",
5454+ "prompt": "Should the implementation be structured to make switching to `lhctl` easier later?",
5555+ "context": "Because you chose `lighthouse_console` today, we decide whether to structure code for a future `lhctl` swap.",
5656+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
5757+ "options": [
5858+ {
5959+ "label": "Yes, abstract the pairing logic behind a helper function so the backend can be swapped"
6060+ },
6161+ {
6262+ "label": "No, just call lighthouse_console directly — refactor when lhctl is available"
6363+ }
6464+ ]
6565+ }
6666+ ]
6767+ },
6868+ {
6969+ "label": "Wait for `lhctl` to be publicly released and implement with that"
7070+ },
7171+ {
7272+ "label": "Support both: detect if `lhctl` is available and prefer it, fall back to `lighthouse_console`",
7373+ "followups": [
7474+ {
7575+ "id": "dual_backend_flag",
7676+ "prompt": "Should the user be able to force a specific backend?",
7777+ "context": "Because you chose dual-backend auto-detection, we clarify whether advanced users can override the choice.",
7878+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
7979+ "options": [
8080+ {
8181+ "label": "Yes, add a `--backend` flag (e.g. `axr lh pair --backend lhctl`)"
8282+ },
8383+ {
8484+ "label": "No, auto-detect only — simpler UX"
8585+ }
8686+ ]
8787+ }
8888+ ]
8989+ }
9090+ ]
9191+ },
9292+ {
9393+ "id": "pairing_workflow",
9494+ "prompt": "How should the pairing workflow work from the user's perspective?",
9595+ "context": "Regardless of backend, users experience pairing differently — interactive scan vs args vs wizard.",
9696+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
9797+ "options": [
9898+ {
9999+ "label": "Fully interactive: scan for devices, present a list, user selects which to pair",
100100+ "followups": [
101101+ {
102102+ "id": "interactive_multi_select",
103103+ "prompt": "Should the user be able to pair multiple devices in one session?",
104104+ "context": "Because you chose a fully interactive scan flow, we decide single vs multi device per session.",
105105+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
106106+ "options": [
107107+ {
108108+ "label": "Yes, allow multi-select from discovered devices"
109109+ },
110110+ {
111111+ "label": "No, pair one device at a time (simpler, matches lighthouse_console behavior)"
112112+ }
113113+ ]
114114+ }
115115+ ]
116116+ },
117117+ {
118118+ "label": "Semi-interactive: user provides device serial/ID as argument, command handles the rest"
119119+ },
120120+ {
121121+ "label": "Guided wizard: step-by-step prompts (put device in pairing mode, confirm, etc.)"
122122+ }
123123+ ]
124124+ },
125125+ {
126126+ "id": "device_types",
127127+ "prompt": "Which lighthouse-tracked device types should the pair command support?",
128128+ "context": "Now we narrow which hardware categories the first implementation supports.",
129129+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
130130+ "options": [
131131+ {
132132+ "label": "All lighthouse devices (base stations, controllers, trackers, HMDs)"
133133+ },
134134+ {
135135+ "label": "Controllers and trackers only (most common pairing need)"
136136+ },
137137+ {
138138+ "label": "Start with controllers only, add other device types in follow-up PRs"
139139+ }
140140+ ]
141141+ },
142142+ {
143143+ "id": "lh_console_discovery",
144144+ "prompt": "The existing `lh-console` helper in `steamvr-lh.nu` checks multiple paths (PATH, Flatpak Steam, native Steam). Should the pair command reuse this helper?",
145145+ "context": "The existing `steamvr-lh.nu` already locates `lighthouse_console` across Steam installs — reuse affects maintainability.",
146146+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
147147+ "options": [
148148+ {
149149+ "label": "Yes, reuse the existing `lh-console` helper as-is"
150150+ },
151151+ {
152152+ "label": "Reuse but refactor `lh-console` to also support piping input/capturing output (needed for scripted pairing)"
153153+ },
154154+ {
155155+ "label": "Write a new helper specifically for pairing that handles the async job pattern lighthouse_console needs"
156156+ }
157157+ ]
158158+ },
159159+ {
160160+ "id": "error_handling",
161161+ "prompt": "How should the command handle common failure cases (no Bluetooth, SteamVR not installed, device not found)?",
162162+ "context": "These concerns apply no matter which namespace/backend/workflow you picked above.",
163163+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
164164+ "options": [
165165+ {
166166+ "label": "Pre-flight checks: verify prerequisites before attempting pairing, with actionable error messages"
167167+ },
168168+ {
169169+ "label": "Attempt pairing and surface errors from lighthouse_console/lhctl with minimal wrapping"
170170+ },
171171+ {
172172+ "label": "Pre-flight checks plus a `--force` flag to skip them for advanced users"
173173+ }
174174+ ]
175175+ },
176176+ {
177177+ "id": "timeout_handling",
178178+ "prompt": "Pairing can take a while (especially with lighthouse_console). How should timeouts be handled?",
179179+ "context": "Pairing duration varies by backend; this shapes UX for all paths.",
180180+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
181181+ "options": [
182182+ {
183183+ "label": "Default timeout with a `--timeout` flag to override"
184184+ },
185185+ {
186186+ "label": "No timeout — wait indefinitely until pairing succeeds or user cancels (Ctrl+C)"
187187+ },
188188+ {
189189+ "label": "Progress indicator with a generous default timeout (e.g. 60s) and clear messaging"
190190+ }
191191+ ]
192192+ },
193193+ {
194194+ "id": "testing_strategy",
195195+ "prompt": "How should this feature be tested? (Hardware-dependent features are hard to unit test)",
196196+ "context": "Hardware pairing is hard to automate — we still need a team agreement on test scope.",
197197+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
198198+ "options": [
199199+ {
200200+ "label": "Manual testing only — document test procedure in PR description"
201201+ },
202202+ {
203203+ "label": "Add basic tests for argument parsing and pre-flight checks (mock external commands)"
204204+ },
205205+ {
206206+ "label": "No tests needed — the project is marked as no longer maintained"
207207+ }
208208+ ]
209209+ },
210210+ {
211211+ "id": "documentation",
212212+ "prompt": "What documentation should accompany this feature?",
213213+ "context": "Final shared question: what docs ship with the command given the project deprecation notice.",
214214+ "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.",
215215+ "options": [
216216+ {
217217+ "label": "Inline help text in the Nushell command (consistent with existing commands like `calibrate`)"
218218+ },
219219+ {
220220+ "label": "Inline help text plus a section in the README"
221221+ },
222222+ {
223223+ "label": "Inline help text only — README already says the project is no longer maintained"
224224+ }
225225+ ]
226226+ }
227227+ ]
228228+}
+234
agent/questionnaires/test.txt
···11+```json
22+{
33+ "issue": "at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22",
44+ "version": 1,
55+ "items": [
66+ {
77+ "id": "command_namespace",
88+ "prompt": "The issue requests `axr lh pair`, but the existing lighthouse module is `steamvr-lh.nu` (commands are `axr steamvr-lh calibrate`, `axr steamvr-lh console`). How should the new pair command be namespaced?",
99+ "options": [
1010+ {
1111+ "label": "Rename module to `lh.nu` so commands become `axr lh pair`, `axr lh calibrate`, etc.",
1212+ "value": "rename_to_lh",
1313+ "followups": [
1414+ {
1515+ "id": "rename_deprecation",
1616+ "prompt": "Should the old `steamvr-lh` name be preserved as a deprecated alias?",
1717+ "options": [
1818+ {
1919+ "label": "Yes, keep a thin `steamvr-lh.nu` wrapper that re-exports `lh.nu` with a deprecation warning",
2020+ "value": "keep_alias"
2121+ },
2222+ {
2323+ "label": "No, just rename — the project is marked as no longer maintained anyway",
2424+ "value": "no_alias"
2525+ }
2626+ ]
2727+ }
2828+ ]
2929+ },
3030+ {
3131+ "label": "Add `pair` to the existing `steamvr-lh.nu` module (command becomes `axr steamvr-lh pair`)",
3232+ "value": "keep_steamvr_lh"
3333+ },
3434+ {
3535+ "label": "Create a new separate `lh.nu` module for pairing only, keep `steamvr-lh.nu` for calibrate/console",
3636+ "value": "separate_module"
3737+ }
3838+ ]
3939+ },
4040+ {
4141+ "id": "backend_tool",
4242+ "prompt": "Which backend tool should the pair command use?",
4343+ "options": [
4444+ {
4545+ "label": "Use `lighthouse_console` now (available today via SteamVR, slower but works)",
4646+ "value": "lighthouse_console",
4747+ "followups": [
4848+ {
4949+ "id": "lhctl_future_proofing",
5050+ "prompt": "Should the implementation be structured to make switching to `lhctl` easier later?",
5151+ "options": [
5252+ {
5353+ "label": "Yes, abstract the pairing logic behind a helper function so the backend can be swapped",
5454+ "value": "abstract_backend"
5555+ },
5656+ {
5757+ "label": "No, just call lighthouse_console directly — refactor when lhctl is available",
5858+ "value": "direct_call"
5959+ }
6060+ ]
6161+ }
6262+ ]
6363+ },
6464+ {
6565+ "label": "Wait for `lhctl` to be publicly released and implement with that",
6666+ "value": "wait_for_lhctl"
6767+ },
6868+ {
6969+ "label": "Support both: detect if `lhctl` is available and prefer it, fall back to `lighthouse_console`",
7070+ "value": "dual_backend",
7171+ "followups": [
7272+ {
7373+ "id": "dual_backend_flag",
7474+ "prompt": "Should the user be able to force a specific backend?",
7575+ "options": [
7676+ {
7777+ "label": "Yes, add a `--backend` flag (e.g. `axr lh pair --backend lhctl`)",
7878+ "value": "backend_flag"
7979+ },
8080+ {
8181+ "label": "No, auto-detect only — simpler UX",
8282+ "value": "auto_detect_only"
8383+ }
8484+ ]
8585+ }
8686+ ]
8787+ }
8888+ ]
8989+ },
9090+ {
9191+ "id": "pairing_workflow",
9292+ "prompt": "How should the pairing workflow work from the user's perspective?",
9393+ "options": [
9494+ {
9595+ "label": "Fully interactive: scan for devices, present a list, user selects which to pair",
9696+ "value": "interactive",
9797+ "followups": [
9898+ {
9999+ "id": "interactive_multi_select",
100100+ "prompt": "Should the user be able to pair multiple devices in one session?",
101101+ "options": [
102102+ {
103103+ "label": "Yes, allow multi-select from discovered devices",
104104+ "value": "multi_select"
105105+ },
106106+ {
107107+ "label": "No, pair one device at a time (simpler, matches lighthouse_console behavior)",
108108+ "value": "single_select"
109109+ }
110110+ ]
111111+ }
112112+ ]
113113+ },
114114+ {
115115+ "label": "Semi-interactive: user provides device serial/ID as argument, command handles the rest",
116116+ "value": "semi_interactive"
117117+ },
118118+ {
119119+ "label": "Guided wizard: step-by-step prompts (put device in pairing mode, confirm, etc.)",
120120+ "value": "guided_wizard"
121121+ }
122122+ ]
123123+ },
124124+ {
125125+ "id": "device_types",
126126+ "prompt": "Which lighthouse-tracked device types should the pair command support?",
127127+ "options": [
128128+ {
129129+ "label": "All lighthouse devices (base stations, controllers, trackers, HMDs)",
130130+ "value": "all_devices"
131131+ },
132132+ {
133133+ "label": "Controllers and trackers only (most common pairing need)",
134134+ "value": "controllers_trackers"
135135+ },
136136+ {
137137+ "label": "Start with controllers only, add other device types in follow-up PRs",
138138+ "value": "controllers_first"
139139+ }
140140+ ]
141141+ },
142142+ {
143143+ "id": "lh_console_discovery",
144144+ "prompt": "The existing `lh-console` helper in `steamvr-lh.nu` checks multiple paths (PATH, Flatpak Steam, native Steam). Should the pair command reuse this helper?",
145145+ "options": [
146146+ {
147147+ "label": "Yes, reuse the existing `lh-console` helper as-is",
148148+ "value": "reuse_helper"
149149+ },
150150+ {
151151+ "label": "Reuse but refactor `lh-console` to also support piping input/capturing output (needed for scripted pairing)",
152152+ "value": "refactor_helper"
153153+ },
154154+ {
155155+ "label": "Write a new helper specifically for pairing that handles the async job pattern lighthouse_console needs",
156156+ "value": "new_helper"
157157+ }
158158+ ]
159159+ },
160160+ {
161161+ "id": "error_handling",
162162+ "prompt": "How should the command handle common failure cases (no Bluetooth, SteamVR not installed, device not found)?",
163163+ "options": [
164164+ {
165165+ "label": "Pre-flight checks: verify prerequisites before attempting pairing, with actionable error messages",
166166+ "value": "preflight_checks"
167167+ },
168168+ {
169169+ "label": "Attempt pairing and surface errors from lighthouse_console/lhctl with minimal wrapping",
170170+ "value": "passthrough_errors"
171171+ },
172172+ {
173173+ "label": "Pre-flight checks plus a `--force` flag to skip them for advanced users",
174174+ "value": "preflight_with_force"
175175+ }
176176+ ]
177177+ },
178178+ {
179179+ "id": "timeout_handling",
180180+ "prompt": "Pairing can take a while (especially with lighthouse_console). How should timeouts be handled?",
181181+ "options": [
182182+ {
183183+ "label": "Default timeout with a `--timeout` flag to override",
184184+ "value": "configurable_timeout"
185185+ },
186186+ {
187187+ "label": "No timeout — wait indefinitely until pairing succeeds or user cancels (Ctrl+C)",
188188+ "value": "no_timeout"
189189+ },
190190+ {
191191+ "label": "Progress indicator with a generous default timeout (e.g. 60s) and clear messaging",
192192+ "value": "progress_with_timeout"
193193+ }
194194+ ]
195195+ },
196196+ {
197197+ "id": "testing_strategy",
198198+ "prompt": "How should this feature be tested? (Hardware-dependent features are hard to unit test)",
199199+ "options": [
200200+ {
201201+ "label": "Manual testing only — document test procedure in PR description",
202202+ "value": "manual_only"
203203+ },
204204+ {
205205+ "label": "Add basic tests for argument parsing and pre-flight checks (mock external commands)",
206206+ "value": "basic_tests"
207207+ },
208208+ {
209209+ "label": "No tests needed — the project is marked as no longer maintained",
210210+ "value": "no_tests"
211211+ }
212212+ ]
213213+ },
214214+ {
215215+ "id": "documentation",
216216+ "prompt": "What documentation should accompany this feature?",
217217+ "options": [
218218+ {
219219+ "label": "Inline help text in the Nushell command (consistent with existing commands like `calibrate`)",
220220+ "value": "inline_help_only"
221221+ },
222222+ {
223223+ "label": "Inline help text plus a section in the README",
224224+ "value": "inline_and_readme"
225225+ },
226226+ {
227227+ "label": "Inline help text only — README already says the project is no longer maintained",
228228+ "value": "inline_no_readme"
229229+ }
230230+ ]
231231+ }
232232+ ]
233233+}
234234+```
···11+"""Daily sync pipeline — imports stage runners from scraper/."""
22+33+from __future__ import annotations
44+55+import os
66+import sys
77+import time
88+from collections.abc import Callable
99+from dataclasses import dataclass, field
1010+from pathlib import Path
1111+from typing import Any
1212+1313+REPO_ROOT = Path(__file__).resolve().parent.parent
1414+SCRAPER_ROOT = REPO_ROOT / "scraper"
1515+if str(SCRAPER_ROOT) not in sys.path:
1616+ sys.path.insert(0, str(SCRAPER_ROOT))
1717+1818+from check_readmes import run_check_readmes # noqa: E402
1919+from db import connect, init_schema, set_crawl_state # noqa: E402
2020+from embed_issues import run_embed_issues # noqa: E402
2121+from embed_readmes import run_embed_readmes # noqa: E402
2222+from fetch_collaborators import run_fetch_collaborators # noqa: E402
2323+from fetch_issues import run_fetch_issues # noqa: E402
2424+from progress import banner, log, summary_block # noqa: E402
2525+from stage2_network import run_stage2_network # noqa: E402
2626+from stage2_pds import run_stage2_accounts_only, run_stage2_repos_only # noqa: E402
2727+2828+CRAWL_KEY = "sync:daily"
2929+3030+StageFn = Callable[[str], dict[str, Any]]
3131+3232+3333+@dataclass
3434+class Stage:
3535+ key: str
3636+ title: str
3737+ run: StageFn
3838+ enabled: bool = True
3939+4040+4141+@dataclass
4242+class SyncReport:
4343+ started_at: float = field(default_factory=time.time)
4444+ stages: dict[str, dict[str, Any]] = field(default_factory=dict)
4545+ errors: list[str] = field(default_factory=list)
4646+4747+ @property
4848+ def elapsed_s(self) -> float:
4949+ return time.time() - self.started_at
5050+5151+5252+def _env_flag(name: str, *, default: bool) -> bool:
5353+ raw = os.getenv(name, "").strip().lower()
5454+ if not raw:
5555+ return default
5656+ return raw in ("1", "true", "yes")
5757+5858+5959+def _configure_daily_env() -> None:
6060+ """Defaults tuned for scheduled daily runs (override via env)."""
6161+ os.environ.setdefault("TANGLED_ISSUE_REFRESH", "1")
6262+ os.environ.setdefault("TANGLED_ISSUE_ALL_USERS", "1")
6363+ os.environ.setdefault("TANGLED_STAGE2_NETWORK_REFRESH", "0")
6464+6565+6666+def _format_stats(stats: dict[str, Any]) -> str:
6767+ """One-line summary of rows processed for sync logs."""
6868+ if not stats:
6969+ return "(no stats)"
7070+ ordered = (
7171+ "repos_stored",
7272+ "already_in_db",
7373+ "account_count",
7474+ "users_scanned",
7575+ "issues_upserted",
7676+ "open_issues",
7777+ "found",
7878+ "missing",
7979+ "repos_fetched",
8080+ "collaborator_edges",
8181+ "embedded",
8282+ "batches",
8383+ "errors",
8484+ "resolve_failed",
8585+ "record_failed",
8686+ "already_synced",
8787+ "skipped",
8888+ "skipped_knot",
8989+ "error",
9090+ )
9191+ parts: list[str] = []
9292+ seen: set[str] = set()
9393+ for key in ordered:
9494+ if key in stats and stats[key] is not None:
9595+ parts.append(f"{key}={stats[key]}")
9696+ seen.add(key)
9797+ for key, value in stats.items():
9898+ if key not in seen and value is not None:
9999+ parts.append(f"{key}={value}")
100100+ return ", ".join(parts) if parts else "(no stats)"
101101+102102+103103+def build_stages() -> list[Stage]:
104104+ return [
105105+ Stage(
106106+ key="network",
107107+ title="Discover repos (tangled.org search)",
108108+ run=run_stage2_network,
109109+ enabled=_env_flag("TANGLED_SYNC_NETWORK", default=True),
110110+ ),
111111+ Stage(
112112+ key="accounts",
113113+ title="Refresh tngl.sh accounts",
114114+ run=run_stage2_accounts_only,
115115+ enabled=_env_flag("TANGLED_SYNC_ACCOUNTS", default=True),
116116+ ),
117117+ Stage(
118118+ key="repos",
119119+ title="Scan tngl.sh repo records (heavy)",
120120+ run=run_stage2_repos_only,
121121+ enabled=_env_flag("TANGLED_SYNC_TNGL_REPOS", default=False),
122122+ ),
123123+ Stage(
124124+ key="issues",
125125+ title="Re-scan all users for issues",
126126+ run=run_fetch_issues,
127127+ enabled=_env_flag("TANGLED_SYNC_ISSUES", default=True),
128128+ ),
129129+ Stage(
130130+ key="readmes",
131131+ title="Fetch missing READMEs from knots",
132132+ run=run_check_readmes,
133133+ enabled=_env_flag("TANGLED_SYNC_READMES", default=True),
134134+ ),
135135+ Stage(
136136+ key="collaborators",
137137+ title="Fetch repo collaborators",
138138+ run=run_fetch_collaborators,
139139+ enabled=_env_flag("TANGLED_SYNC_COLLABORATORS", default=True),
140140+ ),
141141+ Stage(
142142+ key="embed_readmes",
143143+ title="Embed READMEs (Gemini)",
144144+ run=run_embed_readmes,
145145+ enabled=_env_flag("TANGLED_SYNC_EMBED_READMES", default=True),
146146+ ),
147147+ Stage(
148148+ key="embed_issues",
149149+ title="Embed issues (Gemini)",
150150+ run=run_embed_issues,
151151+ enabled=_env_flag("TANGLED_SYNC_EMBED_ISSUES", default=True),
152152+ ),
153153+ ]
154154+155155+156156+def run_daily_sync(dsn: str, *, only: set[str] | None = None) -> SyncReport:
157157+ _configure_daily_env()
158158+ report = SyncReport()
159159+ stages = [s for s in build_stages() if s.enabled and (not only or s.key in only)]
160160+161161+ banner("DAILY SYNC — Tangled → Postgres")
162162+ log("sync", f"Stages: {', '.join(s.key for s in stages) or '(none)'}")
163163+164164+ init_schema(dsn)
165165+166166+ with connect(dsn) as conn:
167167+ set_crawl_state(
168168+ conn,
169169+ key=CRAWL_KEY,
170170+ status="running",
171171+ meta={"stages": [s.key for s in stages]},
172172+ )
173173+ conn.commit()
174174+175175+ for i, stage in enumerate(stages, start=1):
176176+ log("sync", f"── Stage {i}/{len(stages)}: {stage.title} ({stage.key}) ──")
177177+ t0 = time.time()
178178+ try:
179179+ stats = stage.run(dsn)
180180+ report.stages[stage.key] = {
181181+ "status": "ok",
182182+ "elapsed_s": round(time.time() - t0, 1),
183183+ "stats": stats,
184184+ }
185185+ log(
186186+ "sync",
187187+ f"✓ {stage.key} done in {report.stages[stage.key]['elapsed_s']}s — {_format_stats(stats)}",
188188+ )
189189+ except Exception as exc:
190190+ msg = f"{stage.key}: {exc}"
191191+ report.errors.append(msg)
192192+ report.stages[stage.key] = {
193193+ "status": "error",
194194+ "elapsed_s": round(time.time() - t0, 1),
195195+ "error": str(exc),
196196+ }
197197+ log("sync", f"✗ {msg}")
198198+ if _env_flag("TANGLED_SYNC_FAIL_FAST", default=False):
199199+ break
200200+201201+ final_status = "complete" if not report.errors else "partial"
202202+ with connect(dsn) as conn:
203203+ set_crawl_state(
204204+ conn,
205205+ key=CRAWL_KEY,
206206+ status=final_status,
207207+ meta={
208208+ "elapsed_s": round(report.elapsed_s, 1),
209209+ "stages": report.stages,
210210+ "errors": report.errors,
211211+ },
212212+ )
213213+ conn.commit()
214214+215215+ lines = [f"Elapsed: {report.elapsed_s:.0f}s", ""]
216216+ for key, info in report.stages.items():
217217+ mark = "OK" if info["status"] == "ok" else "ERR"
218218+ line = f" [{mark}] {key} ({info['elapsed_s']}s)"
219219+ if info["status"] == "ok" and info.get("stats"):
220220+ line += f" — {_format_stats(info['stats'])}"
221221+ lines.append(line)
222222+ if report.errors:
223223+ lines.append("")
224224+ lines.append("Errors:")
225225+ lines.extend(f" - {e}" for e in report.errors)
226226+227227+ summary_block("Daily sync finished", lines)
228228+229229+ if report.errors and _env_flag("TANGLED_SYNC_STRICT", default=True):
230230+ raise SystemExit(1)
231231+ return report
+1
questionnaire.txt
···11+{"issue":"at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22","version":2,"introduction":{"project":"AtomicXR is a Nushell-based CLI tool (`axr`) for configuring VR/XR on Linux, primarily targeting Fedora Atomic and Universal Blue distributions like Bazzite. The CLI is structured as Nushell modules under `cli/atomic-xr/`, each exporting subcommands (e.g., `envision.nu`, `flatpak.nu`, `steamvr-lh.nu`). The project is marked as no longer maintained in the README, but the repo owner is actively requesting this feature. All modules follow a consistent pattern: they use `std log`, define helper functions, and export public commands.","issue":"The issue requests a new `axr lh pair` command for pairing SteamVR Lighthouse base stations and tracked devices without requiring users to manually invoke `lighthouse_console`. The author proposes two backend tools: (1) `lighthouse_console` from SteamVR, which is available now but slower due to its async job model, or (2) `lhctl`, a faster alternative that is not yet publicly released. The author is open to implementing option 1 now and migrating to option 2 later. Key open decisions include which backend to target, how to structure the command within the existing module hierarchy, what pairing workflow to expose, and how to handle the eventual transition between backends.","approach":"This questionnaire walks through the major implementation decisions in order: first the backend tool strategy (which directly shapes the entire implementation), then the command naming and module placement within the existing CLI structure, followed by the pairing workflow UX, error handling approach, and finally testing and documentation. Branch-specific follow-ups drill into details that only matter for a given backend choice, while shared tail questions cover cross-cutting concerns like deprecation planning and documentation."},"items":[{"id":"backend_strategy","prompt":"Which backend tool strategy should the pairing command use?","context":"The issue explicitly presents two backend options — lighthouse_console (available now, slower) and lhctl (faster, not yet public). This is the foundational decision because it determines the command's implementation, performance characteristics, and maintenance trajectory.","explanation":"The existing `steamvr-lh.nu` module already has a `lh-console` helper function that locates `lighthouse_console` from PATH, Flatpak Steam, or native Steam installations. Using `lighthouse_console` means leveraging this existing infrastructure but dealing with its async job model (commands are queued and results polled). `lhctl` would be significantly faster but introduces a dependency on unreleased software. A third option is to build an abstraction layer that supports both, allowing a seamless swap later. Each path has different implications for code complexity, user experience, and long-term maintenance.","options":[{"label":"Implement using lighthouse_console now (available immediately). Use the existing `lh-console` helper in `steamvr-lh.nu` to invoke lighthouse_console for pairing operations. Accept the slower async job model as a tradeoff for immediate availability. Plan to replace the backend later when lhctl is released.","followups":[{"id":"lh_console_async_handling","prompt":"How should the command handle lighthouse_console's async job model?","context":"Because you chose to use lighthouse_console, the pairing process involves submitting async jobs and polling for results. The existing `lh-console` helper simply runs the binary, but pairing requires multi-step interaction: discovering devices, initiating pairing, and confirming success.","explanation":"lighthouse_console uses an async job queue where you submit a command and then poll for its completion. For pairing, this typically involves: (1) scanning for nearby devices, (2) sending a pair command to a specific device, and (3) waiting for confirmation. The implementation could either parse stdout from lighthouse_console interactively, use its batch/scripting mode if available, or wrap the entire flow in a loop that polls for job completion. Each approach has different reliability and complexity tradeoffs.","options":[{"label":"Parse lighthouse_console's stdout interactively — run lighthouse_console as a subprocess, send commands via stdin, and parse output line-by-line to track job status and extract pairing results. This gives the most control but requires robust text parsing."},{"label":"Use lighthouse_console's command-line arguments for batch operations — pass all necessary arguments upfront (e.g., device serial, pair command) and capture the final output. Simpler implementation but less interactive feedback for the user."},{"label":"Wrap lighthouse_console calls in a polling loop — submit the pair command, then repeatedly invoke lighthouse_console to check job status until completion or timeout. More resilient to output format changes but slower due to repeated process spawning."}]},{"id":"lh_console_migration_prep","prompt":"How much should the implementation prepare for a future lhctl migration?","context":"Since lighthouse_console is intended as a temporary backend, the code could be structured to make swapping backends easier later, or it could be kept simple with the understanding that a rewrite will happen.","explanation":"Adding an abstraction layer (e.g., a common interface that both backends implement) increases initial complexity but makes the future swap trivial. Alternatively, keeping the implementation tightly coupled to lighthouse_console is simpler now but means more work during migration. The existing codebase doesn't use abstraction patterns — modules directly call external tools (e.g., `flatpak run`, `distrobox enter`, `rpm-ostree`). Following this convention suggests a direct implementation is more idiomatic.","options":[{"label":"Keep it simple and direct — implement pairing tightly coupled to lighthouse_console, following the existing pattern in the codebase where modules directly invoke external tools. Accept that a future migration to lhctl will require rewriting the pairing logic."},{"label":"Create a thin abstraction — define the pairing workflow as a sequence of steps (discover, select, pair, verify) with the lighthouse_console implementation behind helper functions. When lhctl arrives, only the helper functions need to change, not the user-facing command logic."}]}]},{"label":"Wait for lhctl and implement using it directly. Defer this feature until lhctl is publicly released, then implement with the faster tool from the start. This avoids throwaway work but leaves users without the feature for an indefinite period.","followups":[{"id":"lhctl_interim_solution","prompt":"Should there be an interim solution while waiting for lhctl?","context":"Because you chose to wait for lhctl, users currently have no streamlined way to pair lighthouse devices through the axr CLI. The existing `axr steamvr-lh console` command already exposes lighthouse_console directly, but it requires users to know the pairing commands themselves.","explanation":"The existing `console` command in `steamvr-lh.nu` already lets users run lighthouse_console with arbitrary arguments. An interim approach could enhance the console command's documentation or add a help subcommand that prints the manual pairing steps, giving users guidance without building full automation. Alternatively, the feature could simply wait with no interim measure.","options":[{"label":"Add a help/guide subcommand (e.g., `axr steamvr-lh pair-guide`) that prints step-by-step instructions for manually pairing via lighthouse_console. Low effort, immediately useful, and can be removed or replaced when lhctl arrives."},{"label":"No interim solution — wait for lhctl and implement the full feature then. Users can continue using `axr steamvr-lh console` directly in the meantime."}]}]},{"label":"Implement with lighthouse_console now, but design the command to auto-detect and prefer lhctl when it becomes available. The command checks for lhctl on PATH first; if not found, falls back to lighthouse_console. This follows the same pattern as the existing `lh-console` helper which already checks multiple locations for lighthouse_console.","followups":[{"id":"dual_backend_detection","prompt":"How should backend detection and selection work?","context":"Because you chose the dual-backend approach, the command needs logic to detect which tools are available and select the best one. The existing `lh-console` helper in `steamvr-lh.nu` already demonstrates this pattern — it checks PATH, then Flatpak Steam, then native Steam for lighthouse_console.","explanation":"The detection could be a simple `which` check at command startup (matching the existing `lh-console` pattern), or it could include a flag to force a specific backend. A flag would be useful for debugging or when users want to explicitly choose, but adds complexity. The existing codebase doesn't use backend-selection flags — tools are auto-detected silently.","options":[{"label":"Auto-detect only, following the existing `lh-console` pattern — check for lhctl first via `which`, fall back to lighthouse_console. No user-facing flag. Log which backend was selected at debug level using `std log`."},{"label":"Auto-detect with an optional override flag (e.g., `--backend lhctl` or `--backend lighthouse_console`) — auto-detect by default but allow users to force a specific backend. Useful for testing or when both tools are installed but one is preferred."}]}]}]},{"id":"command_naming","prompt":"What should the command name and module placement be?","context":"Regardless of which backend is chosen, the new pairing command needs a name and a home within the existing module structure. The issue suggests `axr lh pair`, but the existing lighthouse module is named `steamvr-lh.nu` and is registered in `mod.nu` as `export use steamvr-lh.nu`, making the current command namespace `axr steamvr-lh`.","explanation":"The existing module already exports `calibrate` and `console` commands, accessible as `axr steamvr-lh calibrate` and `axr steamvr-lh console`. Adding `pair` to this module would make it `axr steamvr-lh pair`. However, the issue suggests `axr lh pair`, which would require either renaming the module file to `lh.nu` (a breaking change for existing `axr steamvr-lh` users) or creating a new `lh.nu` module alongside the existing one. The README notes the project is deprecated, which may reduce concern about breaking changes.","options":[{"label":"Add the `pair` command to the existing `steamvr-lh.nu` module, making it `axr steamvr-lh pair`. This is the simplest approach — no module renaming, no breaking changes, consistent with the existing command structure. The command name is slightly longer than the issue suggests but follows established conventions."},{"label":"Rename `steamvr-lh.nu` to `lh.nu` and update `mod.nu` accordingly, making all lighthouse commands available under `axr lh` (e.g., `axr lh pair`, `axr lh calibrate`, `axr lh console`). This matches the issue's suggested naming but is a breaking change for anyone using `axr steamvr-lh` commands. Given the project's deprecated status, this may be acceptable."},{"label":"Create a new `lh.nu` module that re-exports from `steamvr-lh.nu` and adds the `pair` command, providing both `axr lh pair` (new) and `axr steamvr-lh pair` (alias). This avoids breaking existing commands while introducing the shorter namespace, but adds module complexity."}]},{"id":"pairing_workflow","prompt":"What pairing workflow should the command expose to users?","context":"Regardless of backend and naming choices, the user-facing pairing workflow needs to be defined. Lighthouse pairing involves discovering nearby Bluetooth LE devices (base stations, controllers, trackers) and establishing a connection with the host system.","explanation":"Pairing can be fully automated (scan, discover all devices, pair them all) or interactive (show discovered devices, let the user select which to pair). The existing `calibrate` command in `steamvr-lh.nu` uses an interactive pattern — it asks the user yes/no questions and waits for input. Other modules like `envision.nu` take a more automated approach with flags. The choice affects UX complexity and safety (auto-pairing everything might pair unintended nearby devices in shared spaces).","options":[{"label":"Interactive device selection — scan for nearby lighthouse devices, display a numbered list of discovered devices (with serial numbers and types), and let the user select which ones to pair using Nushell's `input list` or similar interactive selection. This is safer in shared spaces and gives users control over which devices are paired.","followups":[{"id":"interactive_multi_select","prompt":"Should users be able to select multiple devices at once or pair one at a time?","context":"Because you chose interactive device selection, the selection interface needs to handle the common case where users want to pair multiple devices (e.g., two base stations and two controllers) in a single session.","explanation":"Nushell's `input list` supports single selection. For multi-select, the command could loop (pair one, ask if there are more), use a custom multi-select prompt, or accept device identifiers as arguments alongside the interactive mode. The existing codebase uses simple `input` calls and `input list` for single selections.","options":[{"label":"Loop-based approach — after pairing one device, ask 'Would you like to pair another device?' and re-scan. Simple to implement using existing patterns (the `ask yn` helper in `steamvr-lh.nu`), handles the multi-device case naturally."},{"label":"Accept optional device serial numbers as arguments — if serials are provided, pair those directly without interactive selection. If no arguments given, enter interactive mode. This supports both scripted and interactive use cases."}]}]},{"label":"Fully automated — scan for all nearby unpaired lighthouse devices and pair them all automatically. Simpler UX (just run `axr lh pair` and wait), but risks pairing unintended devices in environments where multiple lighthouse setups are nearby (e.g., VR arcades, shared spaces).","followups":[{"id":"auto_pair_confirmation","prompt":"Should fully automated pairing require a confirmation step?","context":"Because you chose fully automated pairing, there's a risk of pairing unintended nearby devices. A confirmation step showing what will be paired could mitigate this without adding full interactive selection.","explanation":"The confirmation could show discovered devices and ask 'Pair all N devices? [y/n]' before proceeding, or it could include a `--yes` / `-y` flag to skip confirmation for scripted use. The existing `calibrate` command uses confirmation prompts via the `ask yn` helper.","options":[{"label":"Show discovered devices and require confirmation before pairing — display the list of found devices, then ask 'Pair all N devices? [y/n]' using the existing `ask yn` helper. Add a `--yes` flag to skip confirmation for scripted/automated use."},{"label":"No confirmation — pair immediately upon discovery. Keep the command simple and fast. Users in shared spaces can use the interactive mode (if implemented) or be careful about when they run the command."}]}]}]},{"id":"error_handling","prompt":"How should the command handle common failure scenarios?","context":"Regardless of the pairing workflow chosen, the command needs to handle failures gracefully. Common issues include: Bluetooth not available/enabled, no devices found within timeout, pairing rejected by device, and the backend tool (lighthouse_console or lhctl) not being installed.","explanation":"The existing codebase has two error handling patterns: (1) `error make` with descriptive messages and `help` fields (used in `steamvr-lh.nu`, `runtime.nu`, `oscavmgr.nu`) for fatal errors, and (2) `std log` for warnings and informational messages. The `lh-console` helper already handles the 'tool not found' case with an `error make`. Pairing-specific failures (no devices found, Bluetooth off) need their own handling strategy.","options":[{"label":"Fail fast with descriptive errors — check prerequisites (Bluetooth availability, backend tool installed) upfront before attempting any pairing. Use `error make` with helpful messages and remediation steps in the `help` field, matching the existing pattern in `lh-console` and `runtime.nu`. If pairing fails mid-process, report the specific failure and suggest next steps."},{"label":"Retry with guidance — on transient failures (no devices found, pairing timeout), automatically retry a configurable number of times with user-friendly status messages via `std log`. Only fail with `error make` on non-recoverable errors (tool not installed, Bluetooth hardware missing). Add a `--timeout` flag to control how long to scan for devices."}]},{"id":"testing_strategy","prompt":"What testing approach should be used for the new pairing command?","context":"Regardless of all previous choices, the new command needs some form of validation. The existing codebase has no test files — modules are Nushell scripts that directly invoke system tools, making traditional unit testing difficult.","explanation":"The codebase currently has zero tests. Nushell supports testing via `nu --testbin` and the `testing` module, but the existing modules are tightly coupled to system state (Flatpak, rpm-ostree, distrobox, SteamVR). Adding tests for the pairing command would be a first for this project. Options range from no tests (matching current practice) to adding integration tests that mock external tool calls.","options":[{"label":"No automated tests — match the existing codebase convention. Rely on manual testing with actual lighthouse hardware. Document the manual testing procedure in a comment or the PR description."},{"label":"Add basic smoke tests — create a test file that verifies the command's prerequisite checks (e.g., that it properly errors when lighthouse_console is not found) without requiring actual hardware. This would be the first test in the project and could establish a testing pattern for future commands."}]},{"id":"documentation","prompt":"What documentation should accompany the new command?","context":"Regardless of all implementation choices, the new command needs some level of documentation. The existing commands use Nushell's built-in doc comments (lines starting with `#` above function definitions) which are displayed by `axr -l` and `axr <command> --help`. The README currently focuses on migration away from AtomicXR.","explanation":"Nushell doc comments are the primary documentation mechanism in this codebase — every exported function has a `# Description` comment above it, and parameters have inline `# comment` annotations. The README is focused on deprecation/migration and doesn't document individual commands. Adding README documentation for a new feature in a deprecated project may send mixed signals, while inline doc comments are lightweight and follow existing conventions.","options":[{"label":"Inline Nushell doc comments only — add descriptive comments above the `pair` command and its parameters, following the existing pattern (e.g., `# Open SteamVR's lighthouse_console` on the `console` command). This is sufficient for `axr steamvr-lh pair --help` output and matches the project's documentation style."},{"label":"Inline doc comments plus a brief section in the README — add doc comments and also add a short section in the README under the legacy CLI usage area, documenting the `pair` command's purpose and basic usage. This helps users who read the README before installing."}]}]}
+5
questionnaire_job/.env.example
···11+# Questionnaire job env (deploy with questionnaire_job/deploy.sh or set on Cloud Run Job)
22+# DB_CONNECTION_STRING=postgresql://...
33+# ANTHROPIC_API_KEY=...
44+# ANTHROPIC_QUESTIONNAIRE_MODEL=claude-opus-4-6
55+# QUESTIONNAIRE_MIN_TOOL_READS=2
···11+# Connection string for the SHARED Postgres database (required when DATA_STORAGE=sql).
22+DB_CONNECTION_STRING=postgresql://user:password@host:5432/postgres
33+44+# Storage backend: sql (Postgres+pgvector) or git (in-memory numpy+jsonl bundle).
55+# DATA_STORAGE=sql
66+# DATA_STORAGE=git
77+# REC_DATA_GIT_URL=https://github.com/org/tangled-rec-data.git
88+# REC_DATA_DIR=/tmp/tangled-rec-data
99+# REC_DATA_GIT_REF=main
1010+# REC_DATA_REFRESH_SEC=0
1111+# REC_DATA_GIT_CLONE_TIMEOUT=120
1212+# REC_DATA_GIT_SSH_KEY=<base64-encoded deploy key for git@ SSH remotes on Cloud Run>
1313+1414+# Google Gemini API key — NOT used by the service. Only needed if you run the Node
1515+# reference embedding scripts in reference/src/ (gemini-embedding-001 @ 1536 dims).
1616+# GEMINI_API_KEY=your-gemini-api-key
1717+1818+# Base URL used to build absolute repo links in responses.
1919+# TANGLED_WEB_BASE=https://tangled.org
2020+2121+# Questionnaire read source. "knot" (default) reads each questionnaire per-issue from
2222+# the knot blob XRPC; "db" reverts to Postgres. In knot mode, set DB_FALLBACK=1 to fall
2323+# back to the DB on a miss during transition.
2424+# QUESTIONNAIRE_SOURCE=knot
2525+# QUESTIONNAIRE_KNOT_HOST=knot1.tangled.sh
2626+# QUESTIONNAIRE_REPO_DID=did:plc:vg4msk54xucet6of2rdrgahe
2727+# QUESTIONNAIRE_KNOT_TIMEOUT=10
2828+# QUESTIONNAIRE_DB_FALLBACK=0
2929+3030+# Recommendation tunables (optional; defaults shown).
3131+# REC_PER_SEED_LIMIT=25
3232+# REC_DISTANCE_FLOOR=0.30
3333+# REC_ISSUE_DISTANCE_FLOOR=0.40
3434+# REC_MIN_README_CHARS=120 # drop near-empty READMEs as seeds + candidates (test repos); 0 disables
3535+# REC_MAX_REPOS=40
3636+# REC_MAX_ISSUES=40
3737+# REC_QUERY_WORKERS=8 # concurrent per-seed kNN queries (DB round-trips dominate latency)
···11+# Recommendation Engine — HTTP API
22+33+Standalone FastAPI service for Tangled repo/issue discovery.
44+55+**Storage:** `DATA_STORAGE=sql` (default, Postgres+pgvector) or `DATA_STORAGE=git`
66+(in-memory numpy+jsonl bundle cloned from `REC_DATA_GIT_URL` at boot). See
77+`.env.example`.
88+99+Endpoints: `/recommendations`, `/questionnaire` (sql only today), `/health`.
1010+1111+Base URL: whatever you deploy to (the Tangled appview points `TANGLED_DISCOVER_ENDPOINT`
1212+at the `/recommendations` path).
1313+1414+---
1515+1616+## `GET /recommendations`
1717+1818+The contract consumed by the Tangled appview. Returns the user's interest chips plus
1919+ranked repo + issue recommendations, with the user's own/collaborated repos and
2020+self-authored issues excluded.
2121+2222+**Query params**
2323+2424+| Param | Required | Notes |
2525+| --- | --- | --- |
2626+| `handle` | yes | The user's Tangled DID, e.g. `did:plc:abc123`. |
2727+| `gh` | no | Connected GitHub username. Accepted but currently ignored (no GitHub data). |
2828+2929+**Response** `200 OK` — see [`schema.md`](../../schema.md) for the authoritative shape. Summary:
3030+3131+```jsonc
3232+{
3333+ "profile": {
3434+ "interests": [{ "label": "nix", "slug": "nix" }], // from the user's repo topics
3535+ "languages": [], // no language signal yet
3636+ "sources": { "tangled": { "repos": 10 } } // github omitted (no data)
3737+ },
3838+ "repos": [{
3939+ "name": "...", "owner": "@handle", "language": "", "description": "...",
4040+ "stars": 0, "openIssues": 3, "lastActive": "<RFC3339>",
4141+ "url": "https://tangled.org/@handle/name",
4242+ "basedOnRepoUrl": "https://tangled.org/@you/your-seed-repo"
4343+ }],
4444+ "issues": [{
4545+ "title": "...", "repo": "handle/name", "owner": "@handle",
4646+ "issueUri": "at://did:plc:…/sh.tangled.repo.issue/3k…",
4747+ "repoDid": "did:plc:...", "rkey": "3k...",
4848+ "url": "https://tangled.org/@handle/name",
4949+ "basedOnRepoUrl": "https://tangled.org/@you/your-seed-repo",
5050+ "repoReadme": "...",
5151+ "labels": [], "comments": 0, "language": "", "lastActive": "<RFC3339>"
5252+ }]
5353+}
5454+```
5555+5656+Notes:
5757+- Empty user → `"repos": []` (the frontend then shows its cold-start view).
5858+- `stars`/`comments`/`language`/`languages` are stubbed (no source in the shared DB yet).
5959+- Issues omit `number` (issue permalink); the frontend resolves it from `(repoDid, rkey)`.
6060+ `url` is the parent repo; `basedOnRepoUrl` is the user's seed repo that surfaced the hit.
6161+- `basedOnRepoUrl` on repos is the same seed attribution (the user's repo whose README
6262+ embedding produced the closest match).
6363+6464+---
6565+6666+## `GET /questionnaire`
6767+6868+Return the cached AI-solve questionnaire JSON for an issue (written by the questionnaire
6969+Cloud Run job). Does not generate on demand — returns `404` if not cached yet.
7070+7171+**Query params**
7272+7373+| Param | Required | Notes |
7474+| --- | --- | --- |
7575+| `issue` | yes* | Full `at://…/sh.tangled.repo.issue/<rkey>` URI, or bare rkey (DB lookup). |
7676+| `issue-uri` | yes* | Alias for `issue`. |
7777+7878+\* Provide one of `issue` or `issue-uri`.
7979+8080+**Response** `200 OK` — questionnaire object (version 2: `introduction`, `items[]`, …).
8181+8282+**Errors**
8383+8484+| Status | When |
8585+| --- | --- |
8686+| `400` | Missing param, invalid URI, or ambiguous rkey |
8787+| `404` | Issue URI valid but no cached questionnaire |
8888+8989+```bash
9090+curl 'localhost:8000/questionnaire?issue=at://did:plc:…/sh.tangled.repo.issue/3lv…'
9191+```
9292+9393+---
9494+9595+## `GET /health`
9696+9797+```jsonc
9898+{ "status": "ok", "db": true }
9999+```
100100+101101+`status` is `"degraded"` with `db:false` (and an `error`) if the database is unreachable.
102102+103103+---
104104+105105+## Conventions
106106+107107+- Timestamps (`lastActive`) are RFC-3339; the frontend humanizes them.
108108+- `owner` carries a leading `@`; repo `url` is absolute.
109109+- Ordering is the engine's call — arrays are returned already ranked, most relevant first.
110110+- Errors: any non-200 (or timeout) makes the appview fall back to its cold-start view; no
111111+ structured error body is required.
+217
recommendation/CLAUDE.md
···11+# CLAUDE.md — Tangled Recommendation Engine
22+33+Context for any Claude session working in this folder. This is a **standalone
44+Python/FastAPI service** (it will be lifted into its own repo and hosted separately).
55+Read this top-to-bottom before changing anything.
66+77+---
88+99+## 1. What this is
1010+1111+The recommendation backend for Tangled's **Discover** (contribution-discovery) feature.
1212+Given a user's DID it returns repo + issue recommendations. It reads README/issue
1313+**embeddings** (precomputed by the data teammate) from a shared Postgres + pgvector
1414+database and reranks them. The Tangled web app ("appview", a separate Go service) calls
1515+this over HTTP and renders the results. The service makes **no external API calls** at
1616+runtime — it only reads the DB.
1717+1818+```
1919+Tangled appview ──HTTP(handle,gh)──► THIS service ──► shared Postgres+pgvector (READ-ONLY)
2020+ (Go, separate repo) (Python/FastAPI)
2121+```
2222+2323+> Semantic free-text search (`GET /search`) was built then **removed at the user's request**
2424+> (the Discover UI only consumes `/recommendations`). It's easy to re-add: embed the query
2525+> with Gemini (`RETRIEVAL_QUERY`) and run the same kNN/merge/shape pipeline with a single
2626+> "query" seed. The Node `reference/src/issue_search.mjs` shows the approach.
2727+2828+It was **ported from validated Node scripts** in `reference/src/*.mjs` (the "oracle"):
2929+`similar_repos.mjs` (per-seed kNN + dedup — closest to our model), `issue_experiment.mjs`
3030+(issue→README matching), `embed_readmes.mjs` (Gemini embed + L2-normalize). Consult those
3131+when in doubt about an algorithm detail; they are known-good.
3232+3333+## 2. Locked decisions (do not silently reverse)
3434+3535+- **Standalone Python/FastAPI** service. (Earlier drafts considered Go-in-appview and
3636+ Node — both rejected. Don't reintroduce.)
3737+- **Search-per-seed + consensus**, NOT clustering. Each of the user's repos is searched
3838+ independently; a candidate several seeds agree on ranks higher. (An earlier clustering
3939+ approach was intentionally dropped — simpler, no threshold to tune, better explanations.)
4040+- **Consume existing issue embeddings** — the data teammate already ingests + embeds
4141+ issues. We do NOT run an issue ingestion pipeline.
4242+- **Contract is fixed** by `schema.md` (in the parent repo root) and the Go client
4343+ `appview/state/discover_engine.go`. The wire format carries **no** `pulls`, `reasons`,
4444+ `themes`, `score`, or good-first fields. Consensus/distance are used internally for
4545+ ranking only — never emitted.
4646+4747+## 3. The shared database (READ-ONLY)
4848+4949+- Postgres + pgvector on Google Cloud SQL (public IP, self-signed cert). Connection string
5050+ is in `.env` as `DB_CONNECTION_STRING`; `app/config.py` auto-appends `sslmode=require`
5151+ (the psycopg equivalent of the scripts' `rejectUnauthorized:false`).
5252+- **Boundaries:** every existing table is READ-ONLY for us. The only writes we are ever
5353+ authorized to make are the embedding columns of `tangled_readmes`
5454+ (`embedding`/`embedding_model`/`embedded_at`) and our own `rec` schema (not used yet).
5555+ **Never** insert/update/delete anything else.
5656+- **IP authorization:** the DB only accepts authorized IPs. On this machine the IP is
5757+ already authorized. On a fresh host:
5858+ `gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me)`.
5959+ If you can't connect, this is almost always why. (`gcloud` is NOT installed here.)
6060+- The schema is alpha and **moves** — introspect to confirm before relying on a column.
6161+6262+### Tables we use (key columns)
6363+- `tangled_readmes` (main repo signal): `repo_did` (pk), `repo_uri`, `owner_handle`,
6464+ `repo_name`, `content`, `embedding vector(1536)`, `embedding_model`, `status`. The repo
6565+ OWNER did is parsed from `repo_uri` = `at://<owner_did>/sh.tangled.repo/<rkey>`.
6666+ HNSW index on `embedding` with `vector_cosine_ops` (cosine = the metric).
6767+- `tangled_open_issues` (VIEW, open issues only): `uri`, `rkey`, `repo_did`, `repo_uri`,
6868+ `author_did`, `title`, `body`, `issue_created_at`, `embedding vector(1536)`, `record_raw`.
6969+ (`tangled_issues` is the all-states table; we use the open view for recommendations.)
7070+- `tangled_repos`: `repo_did`, `owner_did`, `rkey`, `name`, `owner_handle`,
7171+ `record_raw` jsonb (has `topics`, `description`, `createdAt`, `repoDid`).
7272+- `tangled_identities`: `did` → `handle` (used for the owner-handle fallback).
7373+- `tangled_user_collaborations` (VIEW): `user_did` → `repo_did` (collab seeds; rare, ~240 rows).
7474+7575+### Embeddings (recipe — match EXACTLY if you ever embed anything new)
7676+The service does NOT embed at runtime (it reads precomputed vectors). This recipe is here
7777+for a future embedding catch-up job; the working impl is `reference/src/embed_readmes.mjs`.
7878+- Model `gemini-embedding-001` via Gemini API (`generativelanguage.googleapis.com`),
7979+ header `x-goog-api-key = GEMINI_API_KEY`. `outputDimensionality = 1536`.
8080+- `taskType = RETRIEVAL_QUERY` for query text, `RETRIEVAL_DOCUMENT` for stored docs.
8181+- **L2-normalize every vector** (sub-3072 MRL dims aren't auto-unit; the cosine index needs
8282+ unit vectors).
8383+- Vectors are passed to SQL as `%s::vector` text literals (`[v1,v2,...]`) and read back via
8484+ `embedding::text` — exactly like the reference scripts. No pgvector-python adapter needed.
8585+8686+## 4. Algorithm (in `app/recommend.py`)
8787+8888+1. **Seeds** = the user's owned (`repo_uri like 'at://<did>/%'`) ∪ collaborated repos that
8989+ have an embedded README (`db.load_seeds`).
9090+2. **Per-seed kNN** over README embeddings, excluding the user's own/collab repo_dids
9191+ (`db.knn_repos`, `ORDER BY embedding <=> seed::vector`).
9292+3. **Merge** by candidate repo_did, keeping best (min) distance + the list of seeds that
9393+ surfaced it = consensus (`app/merge.py`).
9494+4. **Dedup** forks by md5 of `content[:500]` (`app/dedup.py`); apply a **distance floor**.
9595+5. **Rerank** (`app/rank.py`): `DefaultScorer` = similarity + consensus + recency
9696+ (+ popularity stub), behind a swappable `Scorer` Protocol; plus a **round-robin-across-
9797+ seeds** guard so one busy interest can't bury a lone one.
9898+6. **Issues**: same flow over `tangled_open_issues`, also excluding issues the user authored
9999+ and issues in the user's own repos.
100100+7. **Shape** to the contract (`app/links.py`, `app/profile.py`): interest chips from seed
101101+ `record_raw.topics`; `@handle` owners; absolute repo URLs; RFC-3339 timestamps.
102102+103103+## 5. File map
104104+105105+```
106106+app/
107107+ main.py FastAPI app + routes (/recommendations, /health) + CORS + startup log
108108+ config.py Settings from env/.env (DB conn, web base, tunable knobs); get_settings()
109109+ db.py psycopg3 pool + ALL read-only SQL (load_seeds, knn_repos, knn_issues,
110110+ open_issue_counts, embedding_counts, ping)
111111+ recommend.py orchestration: recommend(did)
112112+ merge.py PURE: merge_hits -> consensus candidates
113113+ dedup.py PURE: content_hash, collapse_forks
114114+ rank.py PURE: Scorer protocol, DefaultScorer, apply_floor, rerank(diversify)
115115+ profile.py PURE: build_interests from topics
116116+ links.py PURE: slugify, at_owner, repo_url, issue_list_url, to_rfc3339
117117+ schemas.py pydantic response models (wire keys match schema.md EXACTLY)
118118+ types.py Candidate dataclass
119119+tests/ pytest: unit (pure modules, no DB) + test_integration.py (env-gated)
120120+eval/harness.py offline held-out-seed retrieval: recall@k / nDCG
121121+reference/src/ the validated Node .mjs oracle scripts (+ node_modules has `pg`)
122122+API.md human API docs; README.md run/deploy; Dockerfile; .env / .env.example
123123+```
124124+125125+The pure modules (merge/dedup/rank/profile/links/types) have **no DB or network** and are
126126+fully unit-tested — keep them that way so logic changes are testable in isolation.
127127+128128+## 6. HTTP API (the contract)
129129+130130+Authoritative shape: `../../schema.md` (parent repo) and `API.md` here. Summary:
131131+132132+- `GET /recommendations?handle=<did>&gh=<user>` → `{ profile, repos[], issues[] }`.
133133+ `handle` is the user's DID. `gh` is accepted but **ignored** (no GitHub data). No `k`
134134+ param — return pre-ranked; the frontend paginates 15/row. Empty user → `repos: []`.
135135+- `GET /health` → `{ status, db }`.
136136+137137+**Issues are special:** the engine canNOT supply a reliable sequential issue `number`, so it
138138+sends `repoDid` + `rkey` and the **appview resolves** the precise `/issues/N` URL from its
139139+own SQLite `issues` table (falling back to the repo's issue list). This is implemented in the
140140+parent repo: `appview/state/discover_engine.go` (`engineIssue`, `resolveIssueLink`),
141141+`appview/state/discover.go` (passes `s.db`), tested in `discover_engine_test.go`. If you
142142+change the issue wire shape here, update those three files + `schema.md` together.
143143+144144+## 7. Data realities / caveats (VERIFIED, not assumed)
145145+146146+These drive what we can honestly return — re-check with `node`/SQL if data has grown:
147147+- READMEs: ~2,400 embedded (0 unembedded). Open issues: ~2,300 embedded. **Grows daily** —
148148+ the service reads it live, so counts rise on their own.
149149+- **Repos are the real deliverable.** Owner handle resolves for ~96% via
150150+ `owner_handle` → fallback `repo_uri` owner_did → `tangled_identities`; ~3.5% unresolvable
151151+ are dropped. `repo_name` is never null.
152152+- **`stars` = 0, `comments` = 0** — no source (`tangled_backlinks` is empty). Stubbed.
153153+- **`languages` = [], repo `language` = ""** — no language field in the shared DB.
154154+- **`lastActive`** uses `record_raw.createdAt` (creation, not true last-activity — best
155155+ available). Recency ranking uses the same value.
156156+- **Issues are emittable for ~32%** of the corpus (repo identity resolves via `repo_uri`).
157157+ Per user (filtered to their interests) that's a handful. The exact issue number
158158+ (`record_raw->>'issueId'`) exists for only ~4% in the shared DB → that's why the number is
159159+ resolved appview-side, not here.
160160+- Seeds are dominated by **owned** repos; collaborations are rare.
161161+162162+## 8. Run / test / deploy
163163+164164+```bash
165165+# setup (uv is the toolchain here; python 3.12)
166166+uv venv --python 3.12 .venv
167167+uv pip install --python .venv -e ".[dev]"
168168+169169+# run
170170+.venv/bin/python -m uvicorn app.main:app --reload --port 8000
171171+curl 'localhost:8000/health'
172172+curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca' # 10-seed sample user
173173+# docs: http://localhost:8000/docs
174174+175175+# test (unit always; integration auto-runs when DB_CONNECTION_STRING is set)
176176+.venv/bin/python -m pytest tests/ -q
177177+178178+# offline eval baseline (needs DB)
179179+.venv/bin/python eval/harness.py
180180+181181+# deploy
182182+docker build -t tangled-rec . && docker run -p 8000:8000 --env-file .env tangled-rec
183183+# then point the appview: TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations
184184+```
185185+186186+Config knobs (env, all optional except the two secrets): see `app/config.py` /
187187+`.env.example` — `TANGLED_WEB_BASE`, `REC_PER_SEED_LIMIT`, `REC_DISTANCE_FLOOR`,
188188+`REC_ISSUE_DISTANCE_FLOOR`, `REC_MAX_REPOS`, `REC_MAX_ISSUES`.
189189+190190+## 9. Status & current baseline
191191+192192+- M0–M4 complete and verified: 23 pytest tests pass (18 pure-unit + 5 live integration incl.
193193+ atproto/nix search sanity + own-repo exclusion). Appview Go side compiles + `go test
194194+ ./appview/state/` passes.
195195+- **Eval baseline (before any tuning):** recall@10 ≈ 0.22, recall@20 ≈ 0.23, recall@50 ≈
196196+ 0.37, nDCG ≈ 0.24 over 60 users. Re-run `eval/harness.py` and compare BEFORE/AFTER any
197197+ ranking change — no "feels better" merges.
198198+199199+## 10. Environment gotchas (this machine)
200200+201201+- **No `gcloud`, no Go toolchain, no `nix`** installed by default. To verify the Go appview
202202+ change, a Go 1.25 tarball was fetched to `/tmp/go` (ephemeral). `go.mod` requires go 1.25.
203203+- The reference `.mjs` scripts need Node's `pg` — it lives in `reference/.../node_modules` /
204204+ the folder's `node_modules`. Run them with `DB_CONNECTION_STRING` in env or `.env`.
205205+- The Bash tool's working directory can reset between calls — use absolute paths or `cd`
206206+ inside the same command.
207207+- **Secret**: `DB_CONNECTION_STRING` lives in `.env` (gitignored) — the only var the service
208208+ needs. (`GEMINI_API_KEY` is only for the Node reference embedding scripts, not the service.)
209209+ Never commit secrets or paste them into docs/code.
210210+211211+## 11. Do NOT
212212+213213+- Write to any shared table except the `tangled_readmes` embedding columns (or a `rec` schema).
214214+- Re-add clustering, or emit `pulls`/`reasons`/`themes`/`score`/good-first in the API.
215215+- Hardcode `https://tangled.org` — use `settings.web_base` (`TANGLED_WEB_BASE`).
216216+- Change the issue wire shape without updating the appview Go files + `schema.md` together.
217217+- Fabricate `stars`/`comments`/`language` — they're honest stubs until a data source exists.
···11+# Tangled Recommendation Engine
22+33+A standalone Python/FastAPI service that powers Tangled's **Discover** feature: given a
44+user's DID it returns repo + issue recommendations. It reads README/issue **embeddings**
55+(precomputed with Gemini `gemini-embedding-001`, 1536-dim) from the shared
66+Postgres + pgvector database and reranks them.
77+88+The Tangled appview connects over HTTP via `TANGLED_DISCOVER_ENDPOINT` → see
99+[`schema.md`](../../schema.md) for the wire contract and [`API.md`](./API.md) for all endpoints.
1010+1111+## How it works
1212+1313+For each repo the user already works on (owned ∪ collaborations), we run an independent
1414+pgvector kNN over README embeddings. Results are merged — a candidate several of the user's
1515+repos point at ranks higher (consensus) — then deduped (forks), floored, and reranked
1616+(similarity + consensus + recency) with a round-robin guard so one busy interest can't bury
1717+the others. The user's own work is excluded. Issues use the same flow over open-issue
1818+embeddings.
1919+2020+## Layout
2121+2222+```
2323+app/ FastAPI app + config + db + the pipeline stages
2424+ merge.py dedup.py rank.py profile.py links.py # pure, unit-tested
2525+ db.py recommend.py schemas.py main.py
2626+tests/ pytest unit tests (+ env-gated integration)
2727+eval/ offline hold-out eval harness (recall@k / nDCG)
2828+reference/ the validated Node .mjs scripts this engine was ported from (oracle)
2929+```
3030+3131+## Configuration (env / `.env`)
3232+3333+| Var | Required | Default | Notes |
3434+| --- | --- | --- | --- |
3535+| `DB_CONNECTION_STRING` | yes | — | Shared Postgres. `sslmode=require` is added automatically. |
3636+| `TANGLED_WEB_BASE` | no | `https://tangled.org` | Base for generated repo URLs. |
3737+| `REC_PER_SEED_LIMIT` | no | `25` | kNN neighbours per seed. |
3838+| `REC_DISTANCE_FLOOR` | no | `0.30` | Drop repo matches above this cosine distance. |
3939+| `REC_ISSUE_DISTANCE_FLOOR` | no | `0.40` | Floor for issue matches. |
4040+| `REC_MIN_README_CHARS` | no | `120` | Drop near-empty READMEs as seeds + candidates (filters test/throwaway repos). `0` disables. |
4141+| `REC_QUERY_WORKERS` | no | `8` | Concurrent per-seed kNN queries. The DB is remote, so this cuts request latency. |
4242+| `REC_MAX_REPOS` / `REC_MAX_ISSUES` | no | `40` | Caps per section. |
4343+4444+## Run locally
4545+4646+```bash
4747+uv venv --python 3.12 .venv
4848+uv pip install --python .venv -e ".[dev]"
4949+.venv/bin/python -m uvicorn app.main:app --reload --port 8000
5050+# smoke
5151+curl 'localhost:8000/health'
5252+curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca'
5353+# interactive docs: http://localhost:8000/docs
5454+```
5555+5656+## Test
5757+5858+```bash
5959+.venv/bin/python -m pytest tests/ # unit (no DB)
6060+RUN_DB_TESTS=1 .venv/bin/python -m pytest tests/ # + integration (needs DB_CONNECTION_STRING)
6161+.venv/bin/python eval/harness.py # offline recall@k / nDCG
6262+```
6363+6464+## Deploy
6565+6666+```bash
6767+docker build -t tangled-rec .
6868+docker run -p 8000:8000 --env-file .env tangled-rec
6969+```
7070+7171+**Cloud SQL access:** the shared DB only accepts authorized IPs. Add the host's egress IP:
7272+7373+```bash
7474+gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me)
7575+```
7676+7777+Point the appview at the deployment: `TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations`.
···11+"""Configuration loaded from the environment (and a local .env file).
22+33+All knobs live here so the rest of the service stays pure/testable. The only
44+secret is DB_CONNECTION_STRING (reused from the existing recommendation/.env);
55+the service makes no external API calls at runtime.
66+"""
77+88+from __future__ import annotations
99+1010+import os
1111+from dataclasses import dataclass
1212+from functools import lru_cache
1313+1414+from dotenv import load_dotenv
1515+1616+# Load ./.env (the folder root) if present; real env vars take precedence.
1717+load_dotenv()
1818+1919+2020+def _ensure_sslmode(conn: str) -> str:
2121+ """Cloud SQL public IP uses a self-signed cert. `sslmode=require` encrypts
2222+ without verifying the chain — the Python equivalent of the reference scripts'
2323+ ssl: { rejectUnauthorized: false }."""
2424+ if not conn:
2525+ return conn
2626+ if "sslmode=" in conn:
2727+ return conn
2828+ sep = "&" if "?" in conn else "?"
2929+ return f"{conn}{sep}sslmode=require"
3030+3131+3232+@dataclass(frozen=True)
3333+class Settings:
3434+ # --- storage backend ---
3535+ data_storage: str = "sql" # "sql" | "git"
3636+ data_dir: str = "/tmp/tangled-rec-data"
3737+ data_git_url: str = ""
3838+ data_git_ref: str = ""
3939+ data_refresh_sec: int = 0 # 0 = load once at boot only
4040+ data_git_clone_timeout: int = 120
4141+ data_git_ssh_key_b64: str = "" # optional deploy key for git@ SSH remotes
4242+4343+ # --- connection (sql mode) ---
4444+ db_connection_string: str = ""
4545+4646+ # --- link building ---
4747+ web_base: str = "https://tangled.org"
4848+4949+ # --- recommendation tunables ---
5050+ per_seed_limit: int = 25 # kNN neighbours fetched per seed repo
5151+ distance_floor: float = 0.30 # drop repo candidates above this cosine distance
5252+ issue_distance_floor: float = 0.40 # README-seed -> issue distances run higher
5353+ min_readme_chars: int = 120 # drop near-empty READMEs as seeds AND candidates
5454+ # (test/throwaway repos embed to a generic vector
5555+ # that's trivially "similar" to other empty READMEs)
5656+ max_repos: int = 40 # cap on returned repos (frontend paginates 15/row)
5757+ max_issues: int = 40
5858+ max_interests: int = 8 # interest chips derived from seed topics
5959+ query_workers: int = 8 # concurrent per-seed kNN queries (DB is remote/slow)
6060+6161+ # --- questionnaire read source ---
6262+ # The knot is the source of truth: questionnaires are read per-issue from the
6363+ # knot-hosted repo (one blob fetch), not Postgres. "db" reverts to the old path.
6464+ questionnaire_source: str = "knot" # "knot" | "db"
6565+ questionnaire_knot_host: str = "knot1.tangled.sh"
6666+ questionnaire_repo_did: str = "did:plc:vg4msk54xucet6of2rdrgahe"
6767+ questionnaire_knot_timeout: float = 10.0
6868+ questionnaire_db_fallback: bool = False # in knot mode, fall back to DB on miss
6969+7070+7171+@lru_cache(maxsize=1)
7272+def get_settings() -> Settings:
7373+ conn = os.environ.get("DB_CONNECTION_STRING", "")
7474+ storage = os.environ.get("DATA_STORAGE", "sql").strip().lower()
7575+ if storage not in ("sql", "git"):
7676+ raise ValueError(f"DATA_STORAGE must be 'sql' or 'git', got {storage!r}")
7777+ return Settings(
7878+ data_storage=storage,
7979+ data_dir=os.environ.get("REC_DATA_DIR", "/tmp/tangled-rec-data"),
8080+ data_git_url=os.environ.get("REC_DATA_GIT_URL", "").strip(),
8181+ data_git_ref=os.environ.get("REC_DATA_GIT_REF", "").strip(),
8282+ data_refresh_sec=int(os.environ.get("REC_DATA_REFRESH_SEC", "0")),
8383+ data_git_clone_timeout=int(os.environ.get("REC_DATA_GIT_CLONE_TIMEOUT", "120")),
8484+ data_git_ssh_key_b64=os.environ.get("REC_DATA_GIT_SSH_KEY", "").strip(),
8585+ db_connection_string=_ensure_sslmode(conn),
8686+ web_base=os.environ.get("TANGLED_WEB_BASE", "https://tangled.org").rstrip("/"),
8787+ per_seed_limit=int(os.environ.get("REC_PER_SEED_LIMIT", "25")),
8888+ distance_floor=float(os.environ.get("REC_DISTANCE_FLOOR", "0.30")),
8989+ issue_distance_floor=float(os.environ.get("REC_ISSUE_DISTANCE_FLOOR", "0.40")),
9090+ min_readme_chars=int(os.environ.get("REC_MIN_README_CHARS", "120")),
9191+ max_repos=int(os.environ.get("REC_MAX_REPOS", "40")),
9292+ max_issues=int(os.environ.get("REC_MAX_ISSUES", "40")),
9393+ max_interests=int(os.environ.get("REC_MAX_INTERESTS", "8")),
9494+ query_workers=int(os.environ.get("REC_QUERY_WORKERS", "8")),
9595+ questionnaire_source=os.environ.get("QUESTIONNAIRE_SOURCE", "knot").strip().lower(),
9696+ questionnaire_knot_host=os.environ.get("QUESTIONNAIRE_KNOT_HOST", "knot1.tangled.sh").strip(),
9797+ questionnaire_repo_did=os.environ.get(
9898+ "QUESTIONNAIRE_REPO_DID", "did:plc:vg4msk54xucet6of2rdrgahe"
9999+ ).strip(),
100100+ questionnaire_knot_timeout=float(os.environ.get("QUESTIONNAIRE_KNOT_TIMEOUT", "10")),
101101+ questionnaire_db_fallback=os.environ.get("QUESTIONNAIRE_DB_FALLBACK", "").strip().lower()
102102+ in ("1", "true", "yes"),
103103+ )
+290
recommendation/app/db.py
···11+"""Read-only data access over the shared Postgres + pgvector database.
22+33+Boundaries (per the project brief): every table here is READ-ONLY. The only
44+writes this service is ever authorized to make are the embedding columns of
55+tangled_readmes and its own `rec` schema — neither happens in this module.
66+77+Vectors are passed as `%s::vector` text literals and read back via
88+`embedding::text`, exactly like the validated reference scripts.
99+"""
1010+1111+from __future__ import annotations
1212+1313+from functools import lru_cache
1414+1515+from psycopg.rows import dict_row
1616+from psycopg_pool import ConnectionPool
1717+1818+from app.config import get_settings
1919+2020+2121+def _git_store():
2222+ if get_settings().data_storage == "git":
2323+ from app.git_store import get_git_store
2424+2525+ return get_git_store()
2626+ return None
2727+2828+2929+@lru_cache(maxsize=1)
3030+def get_pool() -> ConnectionPool:
3131+ s = get_settings()
3232+ if s.data_storage == "git":
3333+ raise RuntimeError("SQL pool unavailable in DATA_STORAGE=git mode")
3434+ if not s.db_connection_string:
3535+ raise RuntimeError("DB_CONNECTION_STRING is not set")
3636+ pool = ConnectionPool(
3737+ conninfo=s.db_connection_string,
3838+ min_size=1,
3939+ # Enough connections for the concurrent per-seed kNN fan-out, plus headroom
4040+ # for health/startup probes.
4141+ max_size=max(5, s.query_workers + 2),
4242+ kwargs={"row_factory": dict_row},
4343+ open=True,
4444+ )
4545+ return pool
4646+4747+4848+def ping() -> bool:
4949+ if get_settings().data_storage == "git":
5050+ from app.git_store import is_ready
5151+5252+ return is_ready()
5353+ with get_pool().connection() as conn:
5454+ row = conn.execute("select 1 as ok").fetchone()
5555+ return bool(row and row["ok"] == 1)
5656+5757+5858+def embedding_counts() -> dict:
5959+ """Coverage snapshot — used by /health and logged at startup."""
6060+ store = _git_store()
6161+ if store:
6262+ return store.embedding_counts()
6363+ sql = """
6464+ select
6565+ (select count(*) from tangled_readmes where embedding is not null) as readmes_embedded,
6666+ (select count(*) from tangled_open_issues where embedding is not null) as open_issues_embedded,
6767+ (select count(distinct split_part(replace(repo_uri,'at://',''),'/',1))
6868+ from tangled_readmes where embedding is not null and repo_uri is not null) as addressable_users
6969+ """
7070+ with get_pool().connection() as conn:
7171+ return dict(conn.execute(sql).fetchone())
7272+7373+7474+# --- recommendation data access -------------------------------------------------
7575+7676+# A user's seeds: repos they own (repo_uri encodes the owner DID) UNION repos
7777+# they collaborate on. Both must have an embedded README to be useful as a seed.
7878+# A near-empty README (`< min_chars`) is filtered out: it embeds to a generic
7979+# vector that pulls in unrelated near-empty repos, so it's a poor seed.
8080+_SEEDS_SQL = """
8181+ select r.repo_did,
8282+ r.repo_name,
8383+ r.content,
8484+ r.embedding::text as etext,
8585+ tr.record_raw->'topics' as topics,
8686+ coalesce(r.owner_handle, ti.handle) as owner_handle
8787+ from tangled_readmes r
8888+ left join tangled_repos tr
8989+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
9090+ left join tangled_identities ti
9191+ on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1)
9292+ where r.embedding is not null
9393+ and length(trim(coalesce(r.content, ''))) >= %(min_chars)s
9494+ and ( r.repo_uri like 'at://' || %(did)s || '/%%'
9595+ or r.repo_did in (
9696+ select repo_did from tangled_user_collaborations where user_did = %(did)s
9797+ ) )
9898+"""
9999+100100+# Per-seed kNN over README embeddings. Owner handle resolves via the readmes
101101+# column first, then a fallback to tangled_identities keyed on the owner DID
102102+# parsed out of repo_uri. Excludes the user's own/collab repos and near-empty
103103+# READMEs (`< min_chars`) — those are throwaway/test repos we shouldn't surface.
104104+_KNN_REPOS_SQL = """
105105+ select r.repo_did,
106106+ r.repo_name,
107107+ r.content,
108108+ r.repo_uri,
109109+ coalesce(r.owner_handle, ti.handle) as owner_handle,
110110+ tr.record_raw->>'description' as description,
111111+ tr.record_raw->'topics' as topics,
112112+ tr.record_raw->>'createdAt' as created_at,
113113+ round((r.embedding <=> %(vec)s::vector)::numeric, 4) as distance
114114+ from tangled_readmes r
115115+ left join tangled_repos tr
116116+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
117117+ left join tangled_identities ti
118118+ on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1)
119119+ where r.embedding is not null
120120+ and length(trim(coalesce(r.content, ''))) >= %(min_chars)s
121121+ and not (r.repo_did = any(%(exclude)s))
122122+ order by r.embedding <=> %(vec)s::vector
123123+ limit %(limit)s
124124+"""
125125+126126+_OPEN_ISSUE_COUNTS_SQL = """
127127+ select repo_did, count(*)::int as n
128128+ from tangled_open_issues
129129+ where repo_did = any(%(dids)s)
130130+ group by repo_did
131131+"""
132132+133133+# Per-seed kNN over OPEN issue embeddings (same vector space as READMEs).
134134+# Repo identity is resolved through the issue's repo_uri: owner handle via
135135+# tangled_identities, repo name via (owner_did, rkey) -> tangled_repos. Excludes
136136+# issues the user authored and issues in the user's own/collab repos. We only
137137+# keep issues whose repo identity fully resolves (handle + name) so the appview
138138+# can build a valid link.
139139+_KNN_ISSUES_SQL = """
140140+ select i.uri,
141141+ i.rkey,
142142+ i.repo_did,
143143+ i.title,
144144+ i.body as content,
145145+ i.author_did,
146146+ i.issue_created_at as created_at,
147147+ ti.handle as owner_handle,
148148+ tr.name as repo_name,
149149+ tr.record_raw->>'description' as repo_description,
150150+ rm.content as repo_readme,
151151+ round((i.embedding <=> %(vec)s::vector)::numeric, 4) as distance
152152+ from tangled_open_issues i
153153+ join tangled_identities ti
154154+ on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1)
155155+ join tangled_repos tr
156156+ on tr.owner_did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1)
157157+ and tr.rkey = split_part(i.repo_uri, '/', 5)
158158+ left join tangled_readmes rm
159159+ on rm.repo_did = i.repo_did
160160+ and rm.status = 'found'
161161+ where i.embedding is not null
162162+ and i.repo_uri is not null
163163+ and ti.handle is not null
164164+ and tr.name is not null
165165+ and i.author_did <> %(author)s
166166+ and not (i.repo_did = any(%(exclude)s))
167167+ order by i.embedding <=> %(vec)s::vector
168168+ limit %(limit)s
169169+"""
170170+171171+172172+def knn_issues(vec_text: str, exclude_dids: list[str], author_did: str, limit: int) -> list[dict]:
173173+ store = _git_store()
174174+ if store:
175175+ return store.knn_issues(vec_text, exclude_dids, author_did, limit)
176176+ params = {"vec": vec_text, "exclude": exclude_dids, "author": author_did, "limit": limit}
177177+ with get_pool().connection() as conn:
178178+ return [dict(r) for r in conn.execute(_KNN_ISSUES_SQL, params).fetchall()]
179179+180180+181181+def load_seeds(did: str, min_chars: int = 0) -> list[dict]:
182182+ store = _git_store()
183183+ if store:
184184+ return store.load_seeds(did, min_chars)
185185+ params = {"did": did, "min_chars": min_chars}
186186+ with get_pool().connection() as conn:
187187+ return [dict(r) for r in conn.execute(_SEEDS_SQL, params).fetchall()]
188188+189189+190190+def knn_repos(vec_text: str, exclude_dids: list[str], limit: int, min_chars: int = 0) -> list[dict]:
191191+ store = _git_store()
192192+ if store:
193193+ return store.knn_repos(vec_text, exclude_dids, limit, min_chars)
194194+ params = {"vec": vec_text, "exclude": exclude_dids, "limit": limit, "min_chars": min_chars}
195195+ with get_pool().connection() as conn:
196196+ return [dict(r) for r in conn.execute(_KNN_REPOS_SQL, params).fetchall()]
197197+198198+199199+def open_issue_counts(repo_dids: list[str]) -> dict[str, int]:
200200+ store = _git_store()
201201+ if store:
202202+ return store.open_issue_counts(repo_dids)
203203+ if not repo_dids:
204204+ return {}
205205+ with get_pool().connection() as conn:
206206+ rows = conn.execute(_OPEN_ISSUE_COUNTS_SQL, {"dids": repo_dids}).fetchall()
207207+ return {r["repo_did"]: r["n"] for r in rows}
208208+209209+210210+# --- questionnaires (read-only cache written by the AI-solve job) --------------
211211+212212+_RESOLVE_ISSUE_URI_SQL = """
213213+ select uri
214214+ from tangled_issues
215215+ where rkey = %s
216216+ order by fetched_at desc
217217+"""
218218+219219+_GET_QUESTIONNAIRE_SQL = """
220220+ select issue_uri, payload, created_at, updated_at
221221+ from tangled_issue_questionnaires
222222+ where issue_uri = %s
223223+"""
224224+225225+226226+def resolve_issue_uri(issue_id: str) -> str:
227227+ """Resolve a full ``at://`` URI or a per-repo issue rkey."""
228228+ store = _git_store()
229229+ if store:
230230+ return store.resolve_issue_uri(issue_id)
231231+ raw = issue_id.strip()
232232+ if raw.startswith("at://"):
233233+ return raw
234234+235235+ with get_pool().connection() as conn:
236236+ rows = conn.execute(_RESOLVE_ISSUE_URI_SQL, (raw,)).fetchall()
237237+238238+ if not rows:
239239+ raise ValueError(
240240+ f"No issue with rkey {raw!r} in tangled_issues — pass full at:// URI"
241241+ )
242242+ if len(rows) > 1:
243243+ uris = [r["uri"] for r in rows[:5]]
244244+ raise ValueError(
245245+ f"Ambiguous rkey {raw!r} ({len(rows)} issues). "
246246+ f"Pass full at:// URI. Examples: {uris}"
247247+ )
248248+ return rows[0]["uri"]
249249+250250+251251+_QUESTIONNAIRES_PRESENT_SQL = """
252252+ select issue_uri
253253+ from tangled_issue_questionnaires
254254+ where issue_uri = any(%(uris)s)
255255+"""
256256+257257+258258+def questionnaires_present(issue_uris: list[str]) -> set[str]:
259259+ """Of the given issue URIs, which have a questionnaire. Used to set the
260260+ `hasQuestionnaire` hint on recommended issues.
261261+262262+ One batched query — fast existence check off the dual-written DB cache (the
263263+ questionnaire *content* is still read from the knot, the source of truth). In
264264+ git mode the DB isn't available, so the flag is reported False (no index yet)."""
265265+ if not issue_uris or _git_store():
266266+ return set()
267267+ with get_pool().connection() as conn:
268268+ rows = conn.execute(_QUESTIONNAIRES_PRESENT_SQL, {"uris": list(issue_uris)}).fetchall()
269269+ return {r["issue_uri"] for r in rows}
270270+271271+272272+def get_questionnaire(issue_uri: str) -> dict | None:
273273+ """Load cached questionnaire row, or None if not generated yet."""
274274+ if _git_store():
275275+ return None
276276+ import json
277277+278278+ with get_pool().connection() as conn:
279279+ row = conn.execute(_GET_QUESTIONNAIRE_SQL, (issue_uri,)).fetchone()
280280+ if not row:
281281+ return None
282282+ payload = row["payload"]
283283+ if isinstance(payload, str):
284284+ payload = json.loads(payload)
285285+ return {
286286+ "issue_uri": row["issue_uri"],
287287+ "payload": payload,
288288+ "created_at": row["created_at"],
289289+ "updated_at": row["updated_at"],
290290+ }
+30
recommendation/app/dedup.py
···11+"""Content-based dedup: collapse fork READMEs that share identical text."""
22+33+from __future__ import annotations
44+55+from hashlib import md5
66+77+from app.types import Candidate
88+99+1010+def content_hash(content: str | None = None, *, content_sha500: str | None = None) -> str:
1111+ if content_sha500:
1212+ return content_sha500
1313+ return md5((content or "")[:500].encode("utf-8")).hexdigest()
1414+1515+1616+def row_content_hash(row: dict) -> str:
1717+ sha = row.get("content_sha500")
1818+ if isinstance(sha, str) and sha:
1919+ return sha
2020+ return content_hash(row.get("content"))
2121+2222+2323+def collapse_forks(candidates: list[Candidate]) -> list[Candidate]:
2424+ """Keep one candidate per content_hash — the one with the smallest distance."""
2525+ best: dict[str, Candidate] = {}
2626+ for c in candidates:
2727+ prev = best.get(c.content_hash)
2828+ if prev is None or c.distance < prev.distance:
2929+ best[c.content_hash] = c
3030+ return list(best.values())
+383
recommendation/app/git_store.py
···11+"""In-memory recommendation index loaded from git-shipped numpy + jsonl bundles.
22+33+Contract (frozen):
44+ data/repos.f32.npy, data/repos.jsonl
55+ data/issues.f32.npy, data/issues.jsonl
66+ manifest.json
77+"""
88+99+from __future__ import annotations
1010+1111+import json
1212+import logging
1313+import subprocess
1414+import threading
1515+from dataclasses import dataclass
1616+from pathlib import Path
1717+from typing import Any
1818+1919+import numpy as np
2020+2121+from app.config import Settings
2222+from app.vectors import parse_vector_text, vector_to_text
2323+2424+log = logging.getLogger("rec.git")
2525+2626+_store: GitDataStore | None = None
2727+_load_error: str | None = None
2828+_loading = False
2929+_reload_lock = threading.Lock()
3030+3131+3232+@dataclass
3333+class GitDataStore:
3434+ manifest: dict[str, Any]
3535+ repo_vectors: np.ndarray
3636+ repo_meta: list[dict[str, Any]]
3737+ issue_vectors: np.ndarray
3838+ issue_meta: list[dict[str, Any]]
3939+ repo_row_by_did: dict[str, int]
4040+ issue_uri_by_rkey: dict[str, list[str]]
4141+ issue_count_by_repo_did: dict[str, int]
4242+ owner_did_by_repo_did: dict[str, str]
4343+4444+ @classmethod
4545+ def load_from_dir(cls, data_root: Path) -> GitDataStore:
4646+ data_dir = data_root / "data"
4747+ manifest_path = data_root / "manifest.json"
4848+ if not manifest_path.exists():
4949+ raise FileNotFoundError(f"manifest.json not found under {data_root}")
5050+5151+ manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
5252+ repo_vectors = np.load(data_dir / "repos.f32.npy").astype(np.float32, copy=False)
5353+ issue_vectors = np.load(data_dir / "issues.f32.npy").astype(np.float32, copy=False)
5454+ repo_meta = _read_jsonl(data_dir / "repos.jsonl")
5555+ issue_meta = _read_jsonl(data_dir / "issues.jsonl")
5656+5757+ if len(repo_meta) != repo_vectors.shape[0]:
5858+ raise ValueError(
5959+ f"repos.jsonl rows ({len(repo_meta)}) != repos matrix "
6060+ f"({repo_vectors.shape[0]})"
6161+ )
6262+ if len(issue_meta) != issue_vectors.shape[0]:
6363+ raise ValueError(
6464+ f"issues.jsonl rows ({len(issue_meta)}) != issues matrix "
6565+ f"({issue_vectors.shape[0]})"
6666+ )
6767+6868+ repo_row_by_did = {}
6969+ owner_did_by_repo_did = {}
7070+ for i, row in enumerate(repo_meta):
7171+ did = row.get("repo_did")
7272+ if isinstance(did, str) and did:
7373+ repo_row_by_did[did] = i
7474+ uri = row.get("subject_uri") or row.get("repo_uri") or ""
7575+ if isinstance(uri, str) and uri.startswith("at://"):
7676+ owner = uri.removeprefix("at://").split("/", 1)[0]
7777+ if did:
7878+ owner_did_by_repo_did[did] = owner
7979+8080+ issue_uri_by_rkey: dict[str, list[str]] = {}
8181+ issue_count_by_repo_did: dict[str, int] = {}
8282+ for row in issue_meta:
8383+ uri = row.get("subject_uri") or row.get("uri") or ""
8484+ rkey = row.get("rkey")
8585+ if isinstance(rkey, str) and isinstance(uri, str):
8686+ issue_uri_by_rkey.setdefault(rkey, []).append(uri)
8787+ repo_did = row.get("repo_did")
8888+ if isinstance(repo_did, str) and repo_did:
8989+ issue_count_by_repo_did[repo_did] = issue_count_by_repo_did.get(repo_did, 0) + 1
9090+9191+ log.info(
9292+ "git store loaded: repos=%s issues=%s dim=%s metric=%s",
9393+ len(repo_meta),
9494+ len(issue_meta),
9595+ manifest.get("dim"),
9696+ manifest.get("metric"),
9797+ )
9898+ return cls(
9999+ manifest=manifest,
100100+ repo_vectors=repo_vectors,
101101+ repo_meta=repo_meta,
102102+ issue_vectors=issue_vectors,
103103+ issue_meta=issue_meta,
104104+ repo_row_by_did=repo_row_by_did,
105105+ issue_uri_by_rkey=issue_uri_by_rkey,
106106+ issue_count_by_repo_did=issue_count_by_repo_did,
107107+ owner_did_by_repo_did=owner_did_by_repo_did,
108108+ )
109109+110110+ def embedding_counts(self) -> dict[str, int]:
111111+ return {
112112+ "readmes_embedded": len(self.repo_meta),
113113+ "open_issues_embedded": len(self.issue_meta),
114114+ "addressable_users": len(
115115+ {d for d in self.owner_did_by_repo_did.values() if d}
116116+ ),
117117+ }
118118+119119+ def load_seeds(self, did: str, min_chars: int = 0) -> list[dict]:
120120+ seeds: list[dict] = []
121121+ for i, row in enumerate(self.repo_meta):
122122+ uri = row.get("subject_uri") or ""
123123+ owner = self.owner_did_by_repo_did.get(row.get("repo_did", ""), "")
124124+ if owner != did and not (
125125+ isinstance(uri, str) and uri.startswith(f"at://{did}/")
126126+ ):
127127+ continue
128128+ content_len = int(row.get("content_len") or 0)
129129+ if content_len < min_chars:
130130+ continue
131131+ vec = self.repo_vectors[i]
132132+ seeds.append(
133133+ {
134134+ "repo_did": row["repo_did"],
135135+ "repo_name": row.get("repo_name") or "",
136136+ "content": "",
137137+ "content_sha500": row.get("content_sha500") or "",
138138+ "etext": vector_to_text(vec),
139139+ "topics": row.get("topics"),
140140+ "owner_handle": row.get("owner_handle") or "",
141141+ }
142142+ )
143143+ return seeds
144144+145145+ def knn_repos(
146146+ self,
147147+ vec_text: str,
148148+ exclude_dids: list[str],
149149+ limit: int,
150150+ min_chars: int = 0,
151151+ ) -> list[dict]:
152152+ q = parse_vector_text(vec_text)
153153+ exclude = set(exclude_dids)
154154+ scores = self.repo_vectors @ q
155155+ distances = 1.0 - scores
156156+ candidates: list[tuple[int, float]] = []
157157+ for i, row in enumerate(self.repo_meta):
158158+ repo_did = row.get("repo_did")
159159+ if not repo_did or repo_did in exclude:
160160+ continue
161161+ if int(row.get("content_len") or 0) < min_chars:
162162+ continue
163163+ candidates.append((i, float(distances[i])))
164164+ candidates.sort(key=lambda t: t[1])
165165+ return [
166166+ self._repo_hit(self.repo_meta[i], dist)
167167+ for i, dist in candidates[:limit]
168168+ ]
169169+170170+ def knn_issues(
171171+ self,
172172+ vec_text: str,
173173+ exclude_dids: list[str],
174174+ author_did: str,
175175+ limit: int,
176176+ ) -> list[dict]:
177177+ q = parse_vector_text(vec_text)
178178+ exclude = set(exclude_dids)
179179+ scores = self.issue_vectors @ q
180180+ distances = 1.0 - scores
181181+ candidates: list[tuple[int, float]] = []
182182+ for i, row in enumerate(self.issue_meta):
183183+ repo_did = row.get("repo_did")
184184+ if not repo_did or repo_did in exclude:
185185+ continue
186186+ if row.get("author_did") == author_did:
187187+ continue
188188+ if not row.get("owner_handle") or not row.get("repo_name"):
189189+ continue
190190+ candidates.append((i, float(distances[i])))
191191+ candidates.sort(key=lambda t: t[1])
192192+ return [
193193+ self._issue_hit(self.issue_meta[i], dist)
194194+ for i, dist in candidates[:limit]
195195+ ]
196196+197197+ def open_issue_counts(self, repo_dids: list[str]) -> dict[str, int]:
198198+ return {d: self.issue_count_by_repo_did.get(d, 0) for d in repo_dids}
199199+200200+ def resolve_issue_uri(self, issue_id: str) -> str:
201201+ raw = issue_id.strip()
202202+ if raw.startswith("at://"):
203203+ return raw
204204+ matches = self.issue_uri_by_rkey.get(raw, [])
205205+ if not matches:
206206+ raise ValueError(
207207+ f"No issue with rkey {raw!r} in git bundle — pass full at:// URI"
208208+ )
209209+ if len(matches) > 1:
210210+ raise ValueError(
211211+ f"Ambiguous rkey {raw!r} ({len(matches)} issues). "
212212+ f"Pass full at:// URI. Examples: {matches[:5]}"
213213+ )
214214+ return matches[0]
215215+216216+ @staticmethod
217217+ def _repo_hit(row: dict[str, Any], distance: float) -> dict[str, Any]:
218218+ return {
219219+ "repo_did": row.get("repo_did"),
220220+ "repo_name": row.get("repo_name") or "",
221221+ "content": "",
222222+ "content_sha500": row.get("content_sha500") or "",
223223+ "repo_uri": row.get("subject_uri") or row.get("repo_uri") or "",
224224+ "owner_handle": row.get("owner_handle") or "",
225225+ "description": (row.get("description") or "").strip(),
226226+ "topics": row.get("topics"),
227227+ "created_at": row.get("created_at") or "",
228228+ "distance": round(distance, 4),
229229+ }
230230+231231+ @staticmethod
232232+ def _issue_hit(row: dict[str, Any], distance: float) -> dict[str, Any]:
233233+ return {
234234+ "uri": row.get("subject_uri") or row.get("uri") or "",
235235+ "rkey": row.get("rkey") or "",
236236+ "repo_did": row.get("repo_did") or "",
237237+ "title": (row.get("title") or "").strip(),
238238+ "content": row.get("body") or "",
239239+ "author_did": row.get("author_did") or "",
240240+ "created_at": row.get("created_at") or "",
241241+ "owner_handle": row.get("owner_handle") or "",
242242+ "repo_name": row.get("repo_name") or "",
243243+ "repo_readme": "",
244244+ "distance": round(distance, 4),
245245+ }
246246+247247+248248+def _read_jsonl(path: Path) -> list[dict[str, Any]]:
249249+ rows: list[dict[str, Any]] = []
250250+ with path.open(encoding="utf-8") as fh:
251251+ for line in fh:
252252+ line = line.strip()
253253+ if line:
254254+ rows.append(json.loads(line))
255255+ return rows
256256+257257+258258+def is_ready() -> bool:
259259+ return _store is not None
260260+261261+262262+def load_error() -> str | None:
263263+ return _load_error
264264+265265+266266+def is_loading() -> bool:
267267+ return _loading
268268+269269+270270+def _prepare_ssh(settings: Settings) -> None:
271271+ import base64
272272+ import os
273273+274274+ raw = settings.data_git_ssh_key_b64
275275+ if not raw:
276276+ return
277277+ key_path = Path("/tmp/rec_git_ssh_key")
278278+ key_path.write_bytes(base64.b64decode(raw))
279279+ key_path.chmod(0o600)
280280+ os.environ["GIT_SSH_COMMAND"] = (
281281+ f"ssh -i {key_path} -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null"
282282+ )
283283+284284+285285+def _run_git(args: list[str], *, timeout: int) -> None:
286286+ try:
287287+ proc = subprocess.run(
288288+ args,
289289+ check=True,
290290+ capture_output=True,
291291+ text=True,
292292+ timeout=timeout,
293293+ )
294294+ except FileNotFoundError as exc:
295295+ raise RuntimeError(
296296+ "git binary not found in container — rebuild image with git installed"
297297+ ) from exc
298298+ except subprocess.TimeoutExpired as exc:
299299+ raise RuntimeError(
300300+ f"git timed out after {timeout}s: {' '.join(args)}"
301301+ ) from exc
302302+ except subprocess.CalledProcessError as exc:
303303+ stderr = (exc.stderr or "").strip()
304304+ stdout = (exc.stdout or "").strip()
305305+ detail = stderr or stdout or str(exc)
306306+ raise RuntimeError(f"git failed ({exc.returncode}): {detail}") from exc
307307+ if proc.stderr:
308308+ log.debug("git stderr: %s", proc.stderr.strip())
309309+310310+311311+def _clone_or_pull(url: str, dest: Path, ref: str | None, *, timeout: int) -> None:
312312+ dest.parent.mkdir(parents=True, exist_ok=True)
313313+ if dest.exists() and (dest / ".git").exists():
314314+ log.info("git pull %s", dest)
315315+ _run_git(
316316+ ["git", "-C", str(dest), "fetch", "--depth", "1", "origin", ref or "HEAD"],
317317+ timeout=timeout,
318318+ )
319319+ _run_git(
320320+ ["git", "-C", str(dest), "checkout", "FETCH_HEAD"],
321321+ timeout=timeout,
322322+ )
323323+ return
324324+ args = ["git", "clone", "--depth", "1"]
325325+ if ref:
326326+ args.extend(["--branch", ref])
327327+ args.extend([url, str(dest)])
328328+ log.info("git clone %s -> %s", url, dest)
329329+ _run_git(args, timeout=timeout)
330330+331331+332332+def load_git_store(settings: Settings) -> GitDataStore:
333333+ """Clone/pull (if configured) and mmap-load numpy+jsonl once."""
334334+ global _store, _load_error, _loading
335335+ with _reload_lock:
336336+ _loading = True
337337+ _load_error = None
338338+ try:
339339+ _prepare_ssh(settings)
340340+ root = Path(settings.data_dir)
341341+ if settings.data_git_url:
342342+ _clone_or_pull(
343343+ settings.data_git_url,
344344+ root,
345345+ settings.data_git_ref or None,
346346+ timeout=settings.data_git_clone_timeout,
347347+ )
348348+ elif not (root / "manifest.json").exists() and (
349349+ root / "data" / "repos.f32.npy"
350350+ ).exists():
351351+ root = root.parent
352352+ _store = GitDataStore.load_from_dir(root)
353353+ return _store
354354+ except Exception as exc:
355355+ _load_error = str(exc)
356356+ raise
357357+ finally:
358358+ _loading = False
359359+360360+361361+def start_git_load_background(settings: Settings) -> threading.Thread:
362362+ """Load git bundle in a daemon thread so uvicorn can bind PORT immediately."""
363363+364364+ def _worker() -> None:
365365+ try:
366366+ store = load_git_store(settings)
367367+ log.info("git store loaded in background; %s", store.embedding_counts())
368368+ except Exception:
369369+ log.exception("git store background load failed: %s", _load_error)
370370+371371+ thread = threading.Thread(target=_worker, name="git-load", daemon=True)
372372+ thread.start()
373373+ return thread
374374+375375+376376+def get_git_store() -> GitDataStore:
377377+ if _store is None:
378378+ raise RuntimeError("Git data store is not loaded — call load_git_store() at startup")
379379+ return _store
380380+381381+382382+def reload_git_store(settings: Settings) -> GitDataStore:
383383+ return load_git_store(settings)
···11+"""Pure formatting helpers: slugs, @handles, absolute URLs, RFC-3339 times.
22+33+Per the contract: `owner` carries a leading `@`, repo URLs are absolute, and
44+timestamps are machine-readable (the frontend humanizes them).
55+"""
66+77+from __future__ import annotations
88+99+import re
1010+from datetime import datetime
1111+1212+_SLUG_RE = re.compile(r"[^a-z0-9]+")
1313+1414+1515+def slugify(text: str) -> str:
1616+ return _SLUG_RE.sub("-", (text or "").strip().lower()).strip("-")
1717+1818+1919+def at_owner(handle: str) -> str:
2020+ handle = (handle or "").lstrip("@")
2121+ return f"@{handle}"
2222+2323+2424+def repo_url(web_base: str, handle: str, name: str) -> str:
2525+ return f"{web_base.rstrip('/')}/{at_owner(handle)}/{name}"
2626+2727+2828+def issue_list_url(web_base: str, handle: str, name: str) -> str:
2929+ return f"{repo_url(web_base, handle, name)}/issues"
3030+3131+3232+def to_rfc3339(value) -> str:
3333+ """datetime -> ISO-8601 string; pass through strings (already ISO from the
3434+ DB's record_raw); empty for None."""
3535+ if value is None:
3636+ return ""
3737+ if isinstance(value, datetime):
3838+ return value.isoformat()
3939+ return str(value)
+112
recommendation/app/main.py
···11+"""FastAPI application: /health, /recommendations, /questionnaire."""
22+33+from __future__ import annotations
44+55+import logging
66+from typing import Any
77+88+from fastapi import FastAPI, HTTPException, Query
99+from fastapi.middleware.cors import CORSMiddleware
1010+1111+from app import db, questionnaires, recommend as rec
1212+from app.config import get_settings
1313+from app.lifespan import lifespan
1414+from app.questionnaires import IssueUriError, QuestionnaireNotFoundError
1515+from app.schemas import Recommendations
1616+1717+log = logging.getLogger("rec")
1818+logging.basicConfig(level=logging.INFO)
1919+2020+app = FastAPI(
2121+ title="Tangled Recommendation Engine",
2222+ version="0.1.0",
2323+ description="Repo/issue discovery for Tangled — README-embedding kNN + rerank.",
2424+ lifespan=lifespan,
2525+)
2626+2727+app.add_middleware(
2828+ CORSMiddleware,
2929+ allow_origins=["*"],
3030+ allow_methods=["GET"],
3131+ allow_headers=["*"],
3232+)
3333+3434+3535+def _git_status() -> dict[str, Any] | None:
3636+ settings = get_settings()
3737+ if settings.data_storage != "git":
3838+ return None
3939+ from app.git_store import is_loading, is_ready, load_error
4040+4141+ if is_ready():
4242+ return {"ready": True}
4343+ err = load_error()
4444+ if err:
4545+ return {"ready": False, "error": err}
4646+ if is_loading():
4747+ return {"ready": False, "status": "loading"}
4848+ return {"ready": False, "status": "loading"}
4949+5050+5151+@app.get("/health")
5252+def health() -> dict:
5353+ settings = get_settings()
5454+ git = _git_status()
5555+ if git and not git.get("ready"):
5656+ return {
5757+ "status": "degraded" if git.get("error") else "loading",
5858+ "storage": settings.data_storage,
5959+ "db": False,
6060+ **git,
6161+ }
6262+ try:
6363+ ok = db.ping()
6464+ counts = db.embedding_counts()
6565+ except Exception as exc: # noqa: BLE001
6666+ return {
6767+ "status": "degraded",
6868+ "storage": settings.data_storage,
6969+ "db": False,
7070+ "error": str(exc),
7171+ **(git or {}),
7272+ }
7373+ return {
7474+ "status": "ok",
7575+ "storage": settings.data_storage,
7676+ "db": ok,
7777+ "counts": counts,
7878+ **(git or {}),
7979+ }
8080+8181+8282+@app.get("/recommendations", response_model=Recommendations, response_model_exclude_none=True)
8383+def recommendations(
8484+ handle: str = Query(..., description="The user's Tangled DID (e.g. did:plc:...)"),
8585+ gh: str | None = Query(None, description="Connected GitHub username (ignored: no GitHub data)"),
8686+) -> Recommendations:
8787+ git = _git_status()
8888+ if git and not git.get("ready"):
8989+ raise HTTPException(
9090+ status_code=503,
9191+ detail=git.get("error") or "git data store is still loading",
9292+ )
9393+ return rec.recommend(handle)
9494+9595+9696+@app.get("/questionnaire")
9797+def questionnaire(
9898+ issue: str | None = Query(None, description="Issue at:// URI or rkey"),
9999+ issue_uri: str | None = Query(None, alias="issue-uri", description="Alias for issue"),
100100+) -> dict[str, Any]:
101101+ raw = (issue or issue_uri or "").strip()
102102+ if not raw:
103103+ raise HTTPException(status_code=400, detail="issue query param is required")
104104+ try:
105105+ return questionnaires.load_questionnaire_payload(raw)
106106+ except IssueUriError as exc:
107107+ raise HTTPException(status_code=400, detail=str(exc)) from exc
108108+ except QuestionnaireNotFoundError:
109109+ raise HTTPException(
110110+ status_code=404,
111111+ detail="Questionnaire not found for this issue",
112112+ ) from None
+48
recommendation/app/merge.py
···11+"""Merge per-seed kNN hits into one candidate set (consensus aggregation).
22+33+Each of the user's seed repos is searched independently. Here we union the
44+results, keyed by candidate repo_did: a candidate surfaced by several seeds
55+keeps its best (minimum) distance and records every seed that found it — the
66+length of that seed list is the consensus signal used later for ranking.
77+"""
88+99+from __future__ import annotations
1010+1111+from app.dedup import row_content_hash
1212+from app.types import Candidate
1313+1414+1515+def merge_hits(
1616+ per_seed_hits: list[tuple[str, list[dict]]],
1717+ seed_content_hashes: set[str],
1818+ key_field: str = "repo_did",
1919+) -> list[Candidate]:
2020+ """per_seed_hits: list of (seed_label, [row, ...]). Each row needs the
2121+ `key_field`, `content`, and `distance`. Rows whose content matches one of
2222+ the user's own seeds (`seed_content_hashes`) are dropped (own forks)."""
2323+ merged: dict[str, Candidate] = {}
2424+ for seed_label, rows in per_seed_hits:
2525+ for row in rows:
2626+ h = row_content_hash(row)
2727+ if h in seed_content_hashes:
2828+ continue # a fork of the user's own repo
2929+ key = row[key_field]
3030+ dist = float(row["distance"])
3131+ cand = merged.get(key)
3232+ if cand is None:
3333+ merged[key] = Candidate(
3434+ key=key,
3535+ content_hash=h,
3636+ distance=dist,
3737+ seeds=[seed_label],
3838+ primary_seed=seed_label,
3939+ payload=row,
4040+ )
4141+ else:
4242+ if seed_label not in cand.seeds:
4343+ cand.seeds.append(seed_label)
4444+ if dist < cand.distance:
4545+ cand.distance = dist
4646+ cand.primary_seed = seed_label
4747+ cand.payload = row
4848+ return list(merged.values())
+35
recommendation/app/profile.py
···11+"""Derive the user's interest chips from their seed repos' topics.
22+33+The contract's `profile.interests` are shown in the onboarding reveal. We build
44+them from the most frequent `record_raw.topics` across the user's seed repos —
55+grounded in real data rather than invented cluster labels.
66+"""
77+88+from __future__ import annotations
99+1010+from collections import Counter
1111+1212+from app.links import slugify
1313+1414+1515+def build_interests(seed_rows: list[dict], max_interests: int) -> list[dict]:
1616+ """seed_rows: dicts with a `topics` field (list[str] | None). Returns
1717+ [{label, slug}] ordered by frequency, de-duplicated by slug."""
1818+ counter: Counter[str] = Counter()
1919+ label_for_slug: dict[str, str] = {}
2020+ for row in seed_rows:
2121+ topics = row.get("topics") or []
2222+ for topic in topics:
2323+ if not topic or not str(topic).strip():
2424+ continue
2525+ label = str(topic).strip()
2626+ slug = slugify(label)
2727+ if not slug:
2828+ continue
2929+ counter[slug] += 1
3030+ label_for_slug.setdefault(slug, label)
3131+3232+ interests = []
3333+ for slug, _count in counter.most_common(max_interests):
3434+ interests.append({"label": label_for_slug[slug], "slug": slug})
3535+ return interests
+89
recommendation/app/quality.py
···11+"""Quality heuristics for issue recommendations (pure: no DB, no network).
22+33+Issues are ranked purely by body-embedding similarity, with no notion of whether
44+an issue is a real contribution opportunity or a throwaway. A test/sandbox repo's
55+issue, or a placeholder issue ("hello world", "test issue to explore tangled",
66+"[READ-ONLY]"), can embed close to a user's interests and rank at the top.
77+88+Our repo standard (REC_MIN_README_CHARS) can't be applied to issues — the issue
99+corpus and the README corpus barely overlap, so almost no issue's parent repo has
1010+a README in the DB and a length gate would drop everything. Instead we judge the
1111+parent repo by name/description and the issue by title/body, matching the kinds of
1212+throwaway content observed in production.
1313+1414+Keep these conservative: a false positive silently hides a real contribution.
1515+"""
1616+1717+from __future__ import annotations
1818+1919+import re
2020+2121+# Repo name tokens that mark a scratchpad/sandbox. Matched on word tokens (split
2222+# on non-alphanumerics), so "latest"/"fastest"/"contest" are NOT caught.
2323+_TEST_TOKENS = frozenset({
2424+ "test", "tests", "testing", "tester",
2525+ "sandbox", "playground", "scratch", "scratchpad",
2626+ "demo", "demos", "example", "examples", "sample", "samples",
2727+ "tmp", "temp", "placeholder", "throwaway",
2828+ "foo", "bar", "baz", "qux", "foobar",
2929+ "helloworld",
3030+})
3131+_TOKEN_SPLIT = re.compile(r"[^a-z0-9]+")
3232+_TESTNUM_RE = re.compile(r"^test\d+$") # test100, test2, ...
3333+3434+# Placeholder / "just exploring" phrases in an issue title or body (or a repo
3535+# description). Phrase-anchored so normal text mentioning "tests" is not caught.
3636+_PLACEHOLDER_RE = re.compile(
3737+ r"""
3838+ \btest\s+issue\b
3939+ | \btest\s+repo\b
4040+ | \bthis\s+is\s+(?:just\s+)?a\s+test\b
4141+ | \bjust\s+a\s+test\b
4242+ | \bjust\s+testing\b
4343+ | \btesting\s+(?:the\s+)?(?:tangled|programmatic|access|repo|issue|out|this)\b
4444+ | \bhello,?\s+world\b
4545+ | \bhallo\b
4646+ | \blorem\s+ipsum\b
4747+ | \bread[-\s]?only\s+mirror\b
4848+ | \[read[-\s]?only\]
4949+ | \bignore\s+(?:this|me|please)\b
5050+ | \bplaceholder\b
5151+ | \bexplor(?:e|ing)\s+(?:what\s+)?tangled\b
5252+ | \basdf\b | \bqwerty\b
5353+ """,
5454+ re.IGNORECASE | re.VERBOSE,
5555+)
5656+5757+5858+def _tokens(text: str) -> set[str]:
5959+ return {t for t in _TOKEN_SPLIT.split((text or "").lower()) if t}
6060+6161+6262+def _is_gibberish(text: str) -> bool:
6363+ """A single run of letters with very few distinct characters, e.g.
6464+ 'adadadaddaaddada' or 'adwawdawd' — typical of throwaway repo descriptions."""
6565+ t = (text or "").strip().lower()
6666+ if not t or " " in t or len(t) < 6:
6767+ return False
6868+ return len(set(t)) / len(t) < 0.4
6969+7070+7171+def is_test_repo(name: str, description: str = "") -> bool:
7272+ toks = _tokens(name)
7373+ if toks & _TEST_TOKENS or any(_TESTNUM_RE.match(t) for t in toks):
7474+ return True
7575+ desc = (description or "").strip()
7676+ if desc and (_PLACEHOLDER_RE.search(desc) or _is_gibberish(desc)):
7777+ return True
7878+ return False
7979+8080+8181+def is_placeholder_issue(title: str, body: str = "") -> bool:
8282+ blob = f"{title or ''}\n{body or ''}"
8383+ return bool(_PLACEHOLDER_RE.search(blob))
8484+8585+8686+def drop_issue(repo_name: str, repo_description: str, title: str, body: str) -> bool:
8787+ """True if this issue should be excluded: its repo is a sandbox/test repo, or
8888+ its content is a placeholder/test issue."""
8989+ return is_test_repo(repo_name, repo_description) or is_placeholder_issue(title, body)
+84
recommendation/app/questionnaires.py
···11+"""Questionnaire HTTP helpers — resolve issue URI and load the questionnaire.
22+33+Source of truth is the **knot**: a questionnaire is one JSON file in the knot-hosted
44+repo (`questionnaires/<did>/<rkey>.json`), fetched per-issue via the knot blob XRPC —
55+no clone, no DB. (`QUESTIONNAIRE_SOURCE=db` reverts to the old Postgres read; in knot
66+mode `QUESTIONNAIRE_DB_FALLBACK=1` falls back to the DB on a miss during transition.)
77+"""
88+99+from __future__ import annotations
1010+1111+import json
1212+import urllib.error
1313+import urllib.parse
1414+import urllib.request
1515+1616+from app import db
1717+from app.config import get_settings
1818+1919+2020+class IssueUriError(ValueError):
2121+ """Invalid or ambiguous issue identifier."""
2222+2323+2424+class QuestionnaireNotFoundError(LookupError):
2525+ """No questionnaire for this issue."""
2626+2727+2828+def resolve_issue_param(issue: str) -> str:
2929+ """Normalize ``issue`` query param to a full at:// issue URI."""
3030+ try:
3131+ return db.resolve_issue_uri(issue)
3232+ except ValueError as exc:
3333+ raise IssueUriError(str(exc)) from exc
3434+3535+3636+def _knot_blob_url(issue_uri: str, settings) -> str:
3737+ """at://<did>/sh.tangled.repo.issue/<rkey> -> knot blob URL for its questionnaire.
3838+ Path convention matches agent/questionnaire_repo_store.py + export_questionnaires.py."""
3939+ rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri
4040+ parts = rest.split("/")
4141+ path = f"questionnaires/{parts[0]}/{parts[-1]}.json"
4242+ qs = urllib.parse.urlencode({"repo": settings.questionnaire_repo_did, "path": path})
4343+ return f"https://{settings.questionnaire_knot_host}/xrpc/sh.tangled.repo.blob?{qs}"
4444+4545+4646+def _fetch_from_knot(issue_uri: str, settings) -> dict | None:
4747+ """Fetch + parse the questionnaire file from the knot, or None if absent.
4848+4949+ The blob XRPC returns ``{"content": "<file text>", ...}``; the file text is the
5050+ record written by the generator: ``{issue_uri, version, created_at, updated_at, payload}``."""
5151+ url = _knot_blob_url(issue_uri, settings)
5252+ # Knots 403 the default Python-urllib User-Agent; send an explicit one.
5353+ req = urllib.request.Request(
5454+ url, headers={"User-Agent": "tangled-rec/1.0", "Accept": "application/json"}
5555+ )
5656+ try:
5757+ with urllib.request.urlopen(req, timeout=settings.questionnaire_knot_timeout) as resp:
5858+ blob = json.loads(resp.read().decode("utf-8"))
5959+ except urllib.error.HTTPError as exc:
6060+ if exc.code in (404, 400):
6161+ return None
6262+ raise
6363+ content = blob.get("content") if isinstance(blob, dict) else None
6464+ if not content:
6565+ return None
6666+ rec = json.loads(content) if isinstance(content, str) else content
6767+ return rec if isinstance(rec, dict) and rec.get("payload") is not None else None
6868+6969+7070+def load_questionnaire_payload(issue: str) -> dict:
7171+ """Return the questionnaire JSON object for an issue URI or rkey."""
7272+ settings = get_settings()
7373+ issue_uri = resolve_issue_param(issue)
7474+7575+ if settings.questionnaire_source == "knot":
7676+ rec = _fetch_from_knot(issue_uri, settings)
7777+ if rec is None and settings.questionnaire_db_fallback:
7878+ rec = db.get_questionnaire(issue_uri)
7979+ else:
8080+ rec = db.get_questionnaire(issue_uri)
8181+8282+ if not rec:
8383+ raise QuestionnaireNotFoundError(issue_uri)
8484+ return rec["payload"]
+97
recommendation/app/rank.py
···11+"""Scoring + diversified rerank.
22+33+The scorer is intentionally behind a small interface (Protocol) so it can be
44+swapped for a learned ranker later without touching the pipeline. The default
55+is a transparent weighted sum: similarity + consensus + recency + popularity.
66+77+Diversify uses round-robin across each candidate's primary seed so that one
88+high-volume interest can't bury a user's lone interests (the failure mode the
99+original clustering experiment was built to avoid).
1010+"""
1111+1212+from __future__ import annotations
1313+1414+from datetime import datetime, timezone
1515+from typing import Protocol
1616+1717+from app.types import Candidate
1818+1919+2020+def _recency(created_at: str | None, half_life_days: float = 365.0) -> float:
2121+ """Map an ISO timestamp to (0, 1]; newer is higher. 0 if absent/unparseable."""
2222+ if not created_at:
2323+ return 0.0
2424+ try:
2525+ dt = datetime.fromisoformat(str(created_at).replace("Z", "+00:00"))
2626+ except (ValueError, TypeError):
2727+ return 0.0
2828+ if dt.tzinfo is None:
2929+ dt = dt.replace(tzinfo=timezone.utc)
3030+ age_days = (datetime.now(timezone.utc) - dt).total_seconds() / 86400.0
3131+ if age_days < 0:
3232+ age_days = 0.0
3333+ return 0.5 ** (age_days / half_life_days)
3434+3535+3636+class Scorer(Protocol):
3737+ def score(self, c: Candidate) -> float: ...
3838+3939+4040+class DefaultScorer:
4141+ def __init__(
4242+ self,
4343+ w_similarity: float = 1.0,
4444+ w_consensus: float = 0.10,
4545+ w_recency: float = 0.05,
4646+ w_popularity: float = 0.0, # stub until stars are ingested
4747+ ) -> None:
4848+ self.w_similarity = w_similarity
4949+ self.w_consensus = w_consensus
5050+ self.w_recency = w_recency
5151+ self.w_popularity = w_popularity
5252+5353+ def score(self, c: Candidate) -> float:
5454+ similarity = 1.0 - c.distance
5555+ consensus = c.consensus - 1 # 0 for a single-seed hit
5656+ recency = _recency(c.payload.get("created_at"))
5757+ popularity = 0.0
5858+ return (
5959+ self.w_similarity * similarity
6060+ + self.w_consensus * consensus
6161+ + self.w_recency * recency
6262+ + self.w_popularity * popularity
6363+ )
6464+6565+6666+def apply_floor(candidates: list[Candidate], floor: float) -> list[Candidate]:
6767+ return [c for c in candidates if c.distance <= floor]
6868+6969+7070+def rerank(
7171+ candidates: list[Candidate],
7272+ scorer: Scorer,
7373+ max_n: int,
7474+ diversify: bool = True,
7575+) -> list[Candidate]:
7676+ scored = sorted(candidates, key=scorer.score, reverse=True)
7777+ if not diversify:
7878+ return scored[:max_n]
7979+8080+ # Group by primary seed; preserve score order within each group.
8181+ groups: dict[str, list[Candidate]] = {}
8282+ for c in scored:
8383+ groups.setdefault(c.primary_seed, []).append(c)
8484+8585+ # Order groups by their best member's score (global best leads).
8686+ ordered_groups = sorted(groups.values(), key=lambda g: scorer.score(g[0]), reverse=True)
8787+8888+ out: list[Candidate] = []
8989+ idx = 0
9090+ while len(out) < max_n and any(idx < len(g) for g in ordered_groups):
9191+ for g in ordered_groups:
9292+ if idx < len(g):
9393+ out.append(g[idx])
9494+ if len(out) >= max_n:
9595+ break
9696+ idx += 1
9797+ return out
+198
recommendation/app/recommend.py
···11+"""Recommendation orchestration: seeds -> per-seed kNN -> merge -> dedup ->
22+floor -> rerank -> contract shape.
33+44+This is the only place that stitches the (pure) stages to the (impure) data
55+access. Keeping it thin makes the algorithm easy to read top-to-bottom.
66+"""
77+88+from __future__ import annotations
99+1010+from concurrent.futures import ThreadPoolExecutor
1111+1212+from app import db
1313+from app.config import Settings, get_settings
1414+from app.dedup import collapse_forks, row_content_hash
1515+from app.links import at_owner, repo_url, to_rfc3339
1616+from app.merge import merge_hits
1717+from app.profile import build_interests
1818+from app.quality import drop_issue
1919+from app.rank import DefaultScorer, apply_floor, rerank
2020+from app.schemas import (
2121+ IssueOut,
2222+ Profile,
2323+ Recommendations,
2424+ RepoOut,
2525+ Sources,
2626+ TangledSource,
2727+)
2828+from app.types import Candidate
2929+3030+3131+def _empty(settings: Settings, seed_count: int) -> Recommendations:
3232+ return Recommendations(
3333+ profile=Profile(
3434+ interests=[],
3535+ languages=[],
3636+ sources=Sources(tangled=TangledSource(repos=seed_count)),
3737+ ),
3838+ repos=[],
3939+ issues=[],
4040+ )
4141+4242+4343+def _seed_label(seed: dict) -> str:
4444+ return seed.get("repo_name") or seed["repo_did"]
4545+4646+4747+def _seed_url_map(seeds: list[dict], settings: Settings) -> dict[str, str]:
4848+ """Map seed label (repo name or did) -> absolute Tangled repo URL."""
4949+ out: dict[str, str] = {}
5050+ for seed in seeds:
5151+ handle = (seed.get("owner_handle") or "").strip()
5252+ name = (seed.get("repo_name") or "").strip()
5353+ label = _seed_label(seed)
5454+ out[label] = repo_url(settings.web_base, handle, name) if handle and name else ""
5555+ return out
5656+5757+5858+def _based_on_repo_url(c: Candidate, seed_urls: dict[str, str]) -> str:
5959+ return seed_urls.get(c.primary_seed, "")
6060+6161+6262+def _repo_out(
6363+ c: Candidate,
6464+ settings: Settings,
6565+ open_issues: dict[str, int],
6666+ seed_urls: dict[str, str],
6767+) -> RepoOut:
6868+ p = c.payload
6969+ handle = p.get("owner_handle") or ""
7070+ name = p.get("repo_name") or ""
7171+ return RepoOut(
7272+ name=name,
7373+ owner=at_owner(handle),
7474+ language="", # no language signal in the shared DB yet
7575+ description=(p.get("description") or "").strip(),
7676+ stars=0, # no star signal yet (tangled_backlinks empty)
7777+ openIssues=open_issues.get(c.key, 0),
7878+ lastActive=to_rfc3339(p.get("created_at")),
7979+ url=repo_url(settings.web_base, handle, name),
8080+ basedOnRepoUrl=_based_on_repo_url(c, seed_urls),
8181+ )
8282+8383+8484+def _issue_out(
8585+ c: Candidate, settings: Settings, seed_urls: dict[str, str], with_questionnaire: set[str]
8686+) -> IssueOut:
8787+ p = c.payload
8888+ handle = p.get("owner_handle") or ""
8989+ name = p.get("repo_name") or ""
9090+ uri = (p.get("uri") or "").strip()
9191+ return IssueOut(
9292+ title=(p.get("title") or "").strip(),
9393+ repo=f"{handle}/{name}",
9494+ owner=at_owner(handle),
9595+ issueUri=uri,
9696+ repoDid=p.get("repo_did") or "",
9797+ rkey=p.get("rkey") or "",
9898+ url=repo_url(settings.web_base, handle, name),
9999+ basedOnRepoUrl=_based_on_repo_url(c, seed_urls),
100100+ repoReadme=(p.get("repo_readme") or "").strip(),
101101+ hasQuestionnaire=uri in with_questionnaire,
102102+ labels=[], # issue records carry no labels in the shared DB
103103+ comments=0, # no comment source yet
104104+ language="",
105105+ lastActive=to_rfc3339(p.get("created_at")),
106106+ )
107107+108108+109109+def _fetch_per_seed(seeds, query, workers) -> list[tuple[str, list[dict]]]:
110110+ """Run `query(seed) -> (label, rows)` across the user's seeds concurrently.
111111+112112+ The DB is remote with multi-second round-trips, so the per-seed kNN queries
113113+ dominate request latency; fanning them out across a thread pool cuts it to
114114+ roughly one query's worth. `ThreadPoolExecutor.map` preserves seed order, so
115115+ the downstream merge/rerank stay deterministic (tie-breaks unchanged).
116116+ """
117117+ n = max(1, min(len(seeds), workers))
118118+ with ThreadPoolExecutor(max_workers=n) as ex:
119119+ return list(ex.map(query, seeds))
120120+121121+122122+def _recommend_repos(seeds, exclude_dids, seed_hashes, settings) -> list[RepoOut]:
123123+ seed_urls = _seed_url_map(seeds, settings)
124124+125125+ def query(s):
126126+ rows = db.knn_repos(
127127+ s["etext"], exclude_dids, settings.per_seed_limit, settings.min_readme_chars
128128+ )
129129+ return (s["repo_name"] or s["repo_did"], rows)
130130+131131+ per_seed_hits = _fetch_per_seed(seeds, query, settings.query_workers)
132132+133133+ candidates = merge_hits(per_seed_hits, seed_hashes)
134134+ candidates = collapse_forks(candidates)
135135+ candidates = apply_floor(candidates, settings.distance_floor)
136136+ candidates = [c for c in candidates if (c.payload.get("owner_handle") or "").strip()]
137137+ ranked = rerank(candidates, DefaultScorer(), settings.max_repos, diversify=True)
138138+139139+ counts = db.open_issue_counts([c.key for c in ranked])
140140+ return [_repo_out(c, settings, counts, seed_urls) for c in ranked]
141141+142142+143143+def _recommend_issues(did, seeds, exclude_dids, settings) -> list[IssueOut]:
144144+ seed_urls = _seed_url_map(seeds, settings)
145145+146146+ def query(s):
147147+ rows = db.knn_issues(s["etext"], exclude_dids, did, settings.per_seed_limit)
148148+ return (s["repo_name"] or s["repo_did"], rows)
149149+150150+ per_seed_hits = _fetch_per_seed(seeds, query, settings.query_workers)
151151+152152+ # Key by issue uri — each issue is already unique. We deliberately do NOT run
153153+ # collapse_forks here: that collapses by md5(content[:500]), which is right for
154154+ # fork READMEs but would merge genuinely distinct issues that share an empty or
155155+ # boilerplate body, silently dropping real recommendations.
156156+ candidates = merge_hits(per_seed_hits, seed_content_hashes=set(), key_field="uri")
157157+ candidates = apply_floor(candidates, settings.issue_distance_floor)
158158+ # Drop issues whose parent repo is a sandbox/test repo or whose content is a
159159+ # placeholder/test issue — they embed close to real interests but aren't real
160160+ # contribution opportunities. (The README-length repo standard can't be used
161161+ # here: issue-parent repos almost never have a README in the DB.)
162162+ candidates = [
163163+ c
164164+ for c in candidates
165165+ if not drop_issue(
166166+ c.payload.get("repo_name") or "",
167167+ c.payload.get("repo_description") or "",
168168+ c.payload.get("title") or "",
169169+ c.payload.get("content") or "",
170170+ )
171171+ ]
172172+ ranked = rerank(candidates, DefaultScorer(), settings.max_issues, diversify=True)
173173+ with_questionnaire = db.questionnaires_present(
174174+ [c.payload.get("uri") for c in ranked if c.payload.get("uri")]
175175+ )
176176+ return [_issue_out(c, settings, seed_urls, with_questionnaire) for c in ranked]
177177+178178+179179+def recommend(did: str, settings: Settings | None = None) -> Recommendations:
180180+ settings = settings or get_settings()
181181+182182+ seeds = db.load_seeds(did, settings.min_readme_chars)
183183+ if not seeds:
184184+ return _empty(settings, 0)
185185+186186+ seed_hashes = {row_content_hash(s) for s in seeds}
187187+ exclude_dids = [s["repo_did"] for s in seeds]
188188+189189+ repos = _recommend_repos(seeds, exclude_dids, seed_hashes, settings)
190190+ issues = _recommend_issues(did, seeds, exclude_dids, settings)
191191+192192+ interests = build_interests(seeds, settings.max_interests)
193193+ profile = Profile(
194194+ interests=[{"label": i["label"], "slug": i["slug"]} for i in interests],
195195+ languages=[],
196196+ sources=Sources(tangled=TangledSource(repos=len(seeds))),
197197+ )
198198+ return Recommendations(profile=profile, repos=repos, issues=issues)
+67
recommendation/app/schemas.py
···11+"""Pydantic response models — field names match the schema.md wire contract
22+exactly (camelCase where the Go client expects it), so no aliasing is needed.
33+"""
44+55+from __future__ import annotations
66+77+from pydantic import BaseModel
88+99+1010+class Interest(BaseModel):
1111+ label: str
1212+ slug: str
1313+1414+1515+class TangledSource(BaseModel):
1616+ repos: int
1717+1818+1919+class GithubSource(BaseModel):
2020+ handle: str
2121+ repos: int
2222+2323+2424+class Sources(BaseModel):
2525+ tangled: TangledSource
2626+ github: GithubSource | None = None # omitted when GitHub isn't connected
2727+2828+2929+class Profile(BaseModel):
3030+ interests: list[Interest]
3131+ languages: list[str]
3232+ sources: Sources
3333+3434+3535+class RepoOut(BaseModel):
3636+ name: str
3737+ owner: str # "@handle"
3838+ language: str
3939+ description: str
4040+ stars: int
4141+ openIssues: int
4242+ lastActive: str # RFC-3339
4343+ url: str # absolute — recommended repo
4444+ basedOnRepoUrl: str = "" # user's seed repo that surfaced this hit
4545+4646+4747+class IssueOut(BaseModel):
4848+ title: str
4949+ repo: str # "owner/name"
5050+ owner: str # "@handle"
5151+ issueUri: str = "" # at://…/sh.tangled.repo.issue/<rkey>
5252+ repoDid: str # appview resolves number+url from (repoDid, rkey)
5353+ rkey: str
5454+ url: str = "" # absolute — parent repo the issue belongs to
5555+ basedOnRepoUrl: str = "" # user's seed repo that surfaced this hit
5656+ repoReadme: str = "" # parent repo README the issue belongs to
5757+ hasQuestionnaire: bool = False # an AI-solve questionnaire exists for this issue
5858+ labels: list[str]
5959+ comments: int
6060+ language: str
6161+ lastActive: str # RFC-3339
6262+6363+6464+class Recommendations(BaseModel):
6565+ profile: Profile
6666+ repos: list[RepoOut]
6767+ issues: list[IssueOut]
+27
recommendation/app/search.py
···11+"""Parallel per-seed vector search against the DB."""
22+33+from __future__ import annotations
44+55+from collections.abc import Callable
66+from concurrent.futures import ThreadPoolExecutor
77+from typing import Any, TypeVar
88+99+T = TypeVar("T")
1010+1111+1212+def parallel_seed_search(
1313+ seeds: list[dict[str, Any]],
1414+ search: Callable[[dict[str, Any]], list[dict]],
1515+ *,
1616+ max_workers: int,
1717+) -> list[tuple[str, list[dict]]]:
1818+ """Run one kNN query per seed, up to ``max_workers`` at a time."""
1919+ if not seeds:
2020+ return []
2121+ workers = max(1, min(max_workers, len(seeds)))
2222+ with ThreadPoolExecutor(max_workers=workers) as pool:
2323+ rows_by_seed = list(pool.map(search, seeds))
2424+ return [
2525+ (s.get("repo_name") or s["repo_did"], rows)
2626+ for s, rows in zip(seeds, rows_by_seed, strict=True)
2727+ ]
+31
recommendation/app/types.py
···11+"""Shared domain types for the recommendation pipeline.
22+33+These are deliberately plain dataclasses so the pure stages (merge / dedup /
44+rank) are trivially unit-testable without a database or network.
55+"""
66+77+from __future__ import annotations
88+99+from dataclasses import dataclass, field
1010+1111+1212+@dataclass
1313+class Candidate:
1414+ """A recommended repo (or issue) accumulated across the user's seeds.
1515+1616+ `distance` is the best (minimum) cosine distance seen for this candidate.
1717+ `seeds` records which of the user's seed repos surfaced it — its length is
1818+ the consensus signal (more seeds agreeing -> higher rank). `payload` holds
1919+ the raw DB row fields used later for shaping (name, owner_handle, etc.).
2020+ """
2121+2222+ key: str # repo_did for repos; issue uri for issues
2323+ content_hash: str
2424+ distance: float
2525+ seeds: list[str] = field(default_factory=list)
2626+ primary_seed: str = "" # seed that gave the best (min) distance
2727+ payload: dict = field(default_factory=dict)
2828+2929+ @property
3030+ def consensus(self) -> int:
3131+ return len(self.seeds)
+23
recommendation/app/vectors.py
···11+"""Vector parsing/formatting shared by SQL (pgvector text) and git (numpy) backends."""
22+33+from __future__ import annotations
44+55+import json
66+77+import numpy as np
88+99+1010+def parse_vector_text(text: str) -> np.ndarray:
1111+ """Parse a pgvector-style literal ``[v1,v2,...]`` into a unit float32 vector."""
1212+ raw = text.strip()
1313+ if raw.startswith("[") and raw.endswith("]"):
1414+ raw = raw[1:-1]
1515+ parts = [p.strip() for p in raw.split(",") if p.strip()]
1616+ if not parts:
1717+ raise ValueError("empty vector text")
1818+ vec = np.asarray([float(p) for p in parts], dtype=np.float32)
1919+ return vec
2020+2121+2222+def vector_to_text(vec: np.ndarray) -> str:
2323+ return "[" + ",".join(repr(float(x)) for x in vec) + "]"
···11+"""Offline eval: held-out-seed retrieval (recall@k / nDCG).
22+33+A content-similarity recommender excludes the user's own repos from results, so we
44+can't hold out an owned repo and expect it *recommended*. Instead we measure the
55+underlying relevance signal: hold out each user's most recent repo, generate
66+candidates from their REMAINING repos (excluding the other seeds but NOT the
77+held-out target), and check where the held-out repo ranks. A good engine ranks
88+"what the user built next" near the top of what it would surface.
99+1010+Run: `python eval/harness.py` (needs DB_CONNECTION_STRING). Establishes a
1111+baseline BEFORE any ranking changes — no "feels better" tuning.
1212+"""
1313+1414+from __future__ import annotations
1515+1616+import math
1717+import sys
1818+1919+from app import db
2020+from app.config import get_settings
2121+2222+K_VALUES = (10, 20, 50)
2323+PER_SEED_K = 50 # neighbours pulled per remaining seed
2424+MAX_USERS = 60 # sample size (keeps the run quick)
2525+MIN_SEEDS = 3 # need enough seeds to hold one out and still have signal
2626+# Candidate content gate, mirroring the live service (REC_MIN_README_CHARS).
2727+# Set to 0 to reproduce the pre-gate baseline.
2828+MIN_README_CHARS = get_settings().min_readme_chars
2929+3030+_USERS_SQL = """
3131+ select split_part(replace(repo_uri, 'at://', ''), '/', 1) as owner_did,
3232+ count(*)::int as n
3333+ from tangled_readmes
3434+ where embedding is not null and repo_uri is not null
3535+ group by 1
3636+ having count(*) between %(lo)s and 30
3737+ order by n desc
3838+ limit %(max_users)s
3939+"""
4040+4141+# Owned repos for one user, with createdAt so we can hold out the most recent.
4242+_OWNED_SQL = """
4343+ select r.repo_did,
4444+ r.embedding::text as etext,
4545+ tr.record_raw->>'createdAt' as created_at
4646+ from tangled_readmes r
4747+ left join tangled_repos tr
4848+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
4949+ where r.embedding is not null
5050+ and r.repo_uri like 'at://' || %(did)s || '/%%'
5151+"""
5252+5353+5454+def _users() -> list[str]:
5555+ with db.get_pool().connection() as conn:
5656+ rows = conn.execute(_USERS_SQL, {"lo": MIN_SEEDS, "max_users": MAX_USERS}).fetchall()
5757+ return [r["owner_did"] for r in rows]
5858+5959+6060+def _owned(did: str) -> list[dict]:
6161+ with db.get_pool().connection() as conn:
6262+ return [dict(r) for r in conn.execute(_OWNED_SQL, {"did": did}).fetchall()]
6363+6464+6565+def _rank_of_target(seeds: list[dict], target: dict) -> int | None:
6666+ """Generate candidates from the remaining seeds and return the 1-based rank
6767+ of the held-out target repo (None if outside the candidate pool)."""
6868+ rest = [s for s in seeds if s["repo_did"] != target["repo_did"]]
6969+ exclude = [s["repo_did"] for s in rest] # exclude seeds, but allow the target
7070+ best: dict[str, float] = {}
7171+ for s in rest:
7272+ for row in db.knn_repos(s["etext"], exclude, PER_SEED_K, MIN_README_CHARS):
7373+ rd = row["repo_did"]
7474+ d = float(row["distance"])
7575+ if rd not in best or d < best[rd]:
7676+ best[rd] = d
7777+ ranked = sorted(best, key=best.get)
7878+ tdid = target["repo_did"]
7979+ return ranked.index(tdid) + 1 if tdid in ranked else None
8080+8181+8282+def main() -> int:
8383+ if not get_settings().db_connection_string:
8484+ print("DB_CONNECTION_STRING not set", file=sys.stderr)
8585+ return 1
8686+8787+ users = _users()
8888+ evaluated = 0
8989+ hits = {k: 0 for k in K_VALUES}
9090+ ndcg_sum = 0.0
9191+9292+ for did in users:
9393+ seeds = _owned(did)
9494+ if len(seeds) < MIN_SEEDS:
9595+ continue
9696+ # hold out the most recent repo (fallback: last by repo_did for stability)
9797+ target = max(seeds, key=lambda s: (s.get("created_at") or "", s["repo_did"]))
9898+ rank = _rank_of_target(seeds, target)
9999+ evaluated += 1
100100+ if rank is not None:
101101+ for k in K_VALUES:
102102+ if rank <= k:
103103+ hits[k] += 1
104104+ ndcg_sum += 1.0 / math.log2(rank + 1) # ideal DCG = 1 (single relevant item)
105105+106106+ if evaluated == 0:
107107+ print("no users evaluated")
108108+ return 1
109109+110110+ print(f"evaluated users: {evaluated} (per-seed k={PER_SEED_K})")
111111+ for k in K_VALUES:
112112+ print(f" recall@{k:<3} = {hits[k] / evaluated:.3f}")
113113+ print(f" nDCG = {ndcg_sum / evaluated:.3f}")
114114+ return 0
115115+116116+117117+if __name__ == "__main__":
118118+ raise SystemExit(main())
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function loadConn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env", "../../.env"]) {
66+ try {
77+ const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m);
88+ if (m) return m[1].trim();
99+ } catch {}
1010+ }
1111+ throw new Error("DB_CONNECTION_STRING not found");
1212+}
1313+const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 });
1414+1515+console.log("=== all tables/views (every schema) ===");
1616+console.table((await pool.query(`
1717+ select table_schema, table_name, table_type
1818+ from information_schema.tables
1919+ where table_schema not in ('pg_catalog','information_schema')
2020+ order by table_schema, table_name`)).rows);
2121+2222+console.log("\n=== columns matching 'embed' or 'readme' (any table) ===");
2323+const hits = await pool.query(`
2424+ select table_schema, table_name, column_name, data_type
2525+ from information_schema.columns
2626+ where table_schema not in ('pg_catalog','information_schema')
2727+ and (column_name ~* 'embed|readme|vector')
2828+ order by table_schema, table_name, ordinal_position`);
2929+console.table(hits.rows.length ? hits.rows : [{ note: "no columns named embed*/readme*/vector*" }]);
3030+3131+console.log("\n=== tables matching 'embed' or 'readme' by NAME ===");
3232+const tn = await pool.query(`
3333+ select table_schema, table_name from information_schema.tables
3434+ where table_name ~* 'embed|readme'
3535+ order by 1,2`);
3636+console.table(tn.rows.length ? tn.rows : [{ note: "no table named embed*/readme*" }]);
3737+3838+// If a readme column/table exists, show count + sample
3939+for (const r of [...hits.rows, ...tn.rows]) {
4040+ const t = `"${r.table_schema}"."${r.table_name}"`;
4141+ try {
4242+ const c = await pool.query(`select count(*)::int n from ${t}`);
4343+ console.log(`count ${t}: ${c.rows[0].n}`);
4444+ } catch (e) { /* ignore dup */ }
4545+}
4646+4747+console.log("\n=== columns on tangled_repos (did a readme/embedding col get added here?) ===");
4848+console.table((await pool.query(`
4949+ select column_name, data_type from information_schema.columns
5050+ where table_schema='public' and table_name='tangled_repos' order by ordinal_position`)).rows);
5151+5252+await pool.end();
+51
recommendation/reference/src/check_readmes.mjs
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function loadConn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env"]) {
66+ try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); if (m) return m[1].trim(); } catch {}
77+ }
88+ throw new Error("no conn");
99+}
1010+const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 });
1111+1212+console.log("=== full tangled_readmes columns ===");
1313+console.table((await pool.query(`
1414+ select c.ordinal_position, c.column_name, c.data_type, c.udt_name, c.is_nullable
1515+ from information_schema.columns c
1616+ where c.table_schema='public' and c.table_name='tangled_readmes'
1717+ order by c.ordinal_position`)).rows);
1818+1919+console.log("\n=== embedding column: pgvector dimensions ===");
2020+console.table((await pool.query(`
2121+ select a.attname, format_type(a.atttypid, a.atttypmod) as type
2222+ from pg_attribute a
2323+ join pg_class c on c.oid=a.attrelid join pg_namespace n on n.oid=c.relnamespace
2424+ where n.nspname='public' and c.relname='tangled_readmes' and a.attnum>0 and not a.attisdropped
2525+ and format_type(a.atttypid,a.atttypmod) ~* 'vector'`)).rows);
2626+2727+console.log("\n=== counts ===");
2828+console.table((await pool.query(`
2929+ select
3030+ count(*)::int total,
3131+ count(*) filter (where embedding is not null)::int with_embedding,
3232+ count(distinct embedding_model)::int models
3333+ from tangled_readmes`)).rows);
3434+3535+console.log("\n=== embedding_model values ===");
3636+console.table((await pool.query(`select embedding_model, count(*)::int n from tangled_readmes group by 1 order by 2 desc`)).rows);
3737+3838+console.log("\n=== indexes on tangled_readmes (ivfflat/hnsw?) ===");
3939+console.table((await pool.query(`select indexname, indexdef from pg_indexes where schemaname='public' and tablename='tangled_readmes'`)).rows);
4040+4141+console.log("\n=== sample row (text truncated, embedding omitted) ===");
4242+const s = await pool.query(`select * from tangled_readmes limit 1`);
4343+if (s.rows.length) {
4444+ const r = { ...s.rows[0] };
4545+ for (const k of Object.keys(r)) {
4646+ if (k === "embedding") r[k] = `<vector len=${typeof r[k] === "string" ? (r[k].match(/,/g)?.length ?? 0) + 1 : "?"}>`;
4747+ else if (typeof r[k] === "string" && r[k].length > 160) r[k] = r[k].slice(0, 160) + "…";
4848+ }
4949+ console.log(JSON.stringify(r, null, 2));
5050+}
5151+await pool.end();
···11+// Embed all unembedded READMEs in tangled_readmes using Google Gemini embeddings.
22+//
33+// - Reads the worklist (status='found' AND content IS NOT NULL AND embedding IS NULL),
44+// the exact predicate behind tangled_readmes_unembedded_idx.
55+// - Embeds doc = "# <name>\n\n<description>\n\n<README>" with gemini-embedding-001 at
66+// outputDimensionality=1536 (matches the vector(1536) column), task RETRIEVAL_DOCUMENT.
77+// - L2-normalizes (sub-3072 MRL dims aren't auto-normalized) so the HNSW cosine index is happy.
88+// - UPDATEs only the embedding columns, only where embedding IS NULL → idempotent / re-runnable.
99+//
1010+// Env: DB_CONNECTION_STRING (or ../.env), GEMINI_API_KEY (required).
1111+// Optional: LIMIT (0=all), CONCURRENCY (default 4), DRY_RUN=1 (count only), MAX_CHARS (default 8000).
1212+1313+import pg from "pg";
1414+import { readFileSync } from "node:fs";
1515+1616+function fromEnvFile(key) {
1717+ for (const p of ["../.env", ".env", "../../.env"]) {
1818+ try {
1919+ const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)\\s*$`, "m"));
2020+ if (m) return m[1].trim().replace(/^["']|["']$/g, "");
2121+ } catch {}
2222+ }
2323+ return undefined;
2424+}
2525+2626+const CONN = process.env.DB_CONNECTION_STRING || fromEnvFile("DB_CONNECTION_STRING");
2727+const API_KEY = process.env.GEMINI_API_KEY || fromEnvFile("GEMINI_API_KEY");
2828+const MODEL = process.env.GEMINI_EMBED_MODEL || fromEnvFile("GEMINI_EMBED_MODEL") || "gemini-embedding-001";
2929+const DIMS = 1536;
3030+const LIMIT = parseInt(process.env.LIMIT ?? "0", 10);
3131+const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "4", 10);
3232+const MAX_CHARS = parseInt(process.env.MAX_CHARS ?? "8000", 10);
3333+const DRY_RUN = process.env.DRY_RUN === "1";
3434+3535+if (!CONN) { console.error("DB_CONNECTION_STRING not set"); process.exit(1); }
3636+if (!API_KEY && !DRY_RUN) { console.error("GEMINI_API_KEY not set (add it to recommendation/.env)"); process.exit(1); }
3737+3838+const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 5 });
3939+const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
4040+4141+function buildDoc({ repo_name, description, content }) {
4242+ const parts = [];
4343+ if (repo_name) parts.push(`# ${repo_name}`);
4444+ if (description && description.trim()) parts.push(description.trim());
4545+ parts.push(content);
4646+ return parts.join("\n\n").slice(0, MAX_CHARS);
4747+}
4848+4949+function l2normalize(v) {
5050+ let s = 0;
5151+ for (const x of v) s += x * x;
5252+ const n = Math.sqrt(s) || 1;
5353+ return v.map((x) => x / n);
5454+}
5555+5656+async function embedOnce(text, dims) {
5757+ const url = `https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`;
5858+ const body = {
5959+ model: `models/${MODEL}`,
6060+ content: { parts: [{ text }] },
6161+ taskType: "RETRIEVAL_DOCUMENT",
6262+ outputDimensionality: dims,
6363+ };
6464+ const resp = await fetch(url, {
6565+ method: "POST",
6666+ headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
6767+ body: JSON.stringify(body),
6868+ });
6969+ const txt = await resp.text();
7070+ if (!resp.ok) {
7171+ const err = new Error(`HTTP ${resp.status}: ${txt.slice(0, 200)}`);
7272+ err.status = resp.status;
7373+ throw err;
7474+ }
7575+ const j = JSON.parse(txt);
7676+ const values = j?.embedding?.values;
7777+ if (!Array.isArray(values)) throw new Error(`no embedding in response: ${txt.slice(0, 150)}`);
7878+ return values;
7979+}
8080+8181+// Embed with retries; on 400 (often too-long input) retry once with a hard truncation.
8282+async function embedWithRetry(text) {
8383+ let attempt = 0;
8484+ let input = text;
8585+ while (true) {
8686+ try {
8787+ const v = await embedOnce(input, DIMS);
8888+ return l2normalize(v);
8989+ } catch (e) {
9090+ attempt++;
9191+ if (e.status === 400 && input.length > 2000) {
9292+ input = input.slice(0, Math.floor(input.length / 2));
9393+ continue;
9494+ }
9595+ if (attempt >= 5 || (e.status && e.status >= 400 && e.status < 500 && e.status !== 429)) {
9696+ throw e;
9797+ }
9898+ const backoff = Math.min(30000, 800 * 2 ** (attempt - 1));
9999+ await sleep(backoff);
100100+ }
101101+ }
102102+}
103103+104104+async function main() {
105105+ const worklistSql = `
106106+ select r.repo_did, r.repo_name, r.content,
107107+ coalesce(tr.record_raw->>'description', '') as description,
108108+ length(r.content) as len
109109+ from tangled_readmes r
110110+ left join tangled_repos tr
111111+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
112112+ where r.status = 'found' and r.content is not null and r.embedding is null
113113+ order by r.repo_did
114114+ ${LIMIT > 0 ? `limit ${LIMIT}` : ""}`;
115115+116116+ const { rows } = await pool.query(worklistSql);
117117+ const totalReadmes = (await pool.query(`select count(*)::int n from tangled_readmes`)).rows[0].n;
118118+ const alreadyEmbedded = (await pool.query(`select count(*)::int n from tangled_readmes where embedding is not null`)).rows[0].n;
119119+120120+ console.log(`tangled_readmes total=${totalReadmes} already embedded=${alreadyEmbedded}`);
121121+ console.log(`worklist (to embed now)=${rows.length} model=${MODEL} dims=${DIMS} concurrency=${CONCURRENCY}${LIMIT ? ` limit=${LIMIT}` : ""}`);
122122+ if (DRY_RUN) { console.log("\nDRY_RUN=1 → not embedding, not writing."); await pool.end(); return; }
123123+ if (rows.length === 0) { console.log("\nNothing to embed. ✔"); await pool.end(); return; }
124124+125125+ let done = 0, ok = 0, failed = 0;
126126+ const errors = [];
127127+ const queue = rows.slice();
128128+129129+ async function worker(id) {
130130+ while (queue.length) {
131131+ const r = queue.pop();
132132+ try {
133133+ const doc = buildDoc(r);
134134+ const vec = await embedWithRetry(doc);
135135+ const literal = `[${vec.join(",")}]`;
136136+ const res = await pool.query(
137137+ `update tangled_readmes
138138+ set embedding = $1::vector, embedding_model = $2, embedded_at = now()
139139+ where repo_did = $3 and embedding is null`,
140140+ [literal, MODEL, r.repo_did],
141141+ );
142142+ if (res.rowCount > 0) ok++;
143143+ } catch (e) {
144144+ failed++;
145145+ errors.push({ repo_did: r.repo_did, name: r.repo_name, err: e.message });
146146+ }
147147+ if (++done % 25 === 0 || done === rows.length) {
148148+ process.stderr.write(` ...${done}/${rows.length} (ok=${ok} fail=${failed})\n`);
149149+ }
150150+ }
151151+ }
152152+153153+ await Promise.all(Array.from({ length: CONCURRENCY }, (_, i) => worker(i)));
154154+155155+ console.log(`\n================ EMBEDDING DONE ================`);
156156+ console.log(`embedded ok : ${ok}`);
157157+ console.log(`failed : ${failed}`);
158158+ if (errors.length) {
159159+ console.log("\nfirst errors:");
160160+ for (const e of errors.slice(0, 10)) console.log(` ${e.name ?? e.repo_did}: ${e.err}`);
161161+ }
162162+ const remaining = (await pool.query(
163163+ `select count(*)::int n from tangled_readmes where status='found' and content is not null and embedding is null`,
164164+ )).rows[0].n;
165165+ console.log(`\nremaining unembedded (status=found): ${remaining}`);
166166+ await pool.end();
167167+}
168168+169169+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+37
recommendation/reference/src/explore_users.mjs
···11+// Find owners with several embedded repos, and measure how SPREAD their repos are
22+// (high mean pairwise cosine distance = multi-interest user — good demo candidate).
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
66+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
77+88+const ownerDid = (uri) => uri ? uri.replace("at://", "").split("/")[0] : null;
99+function parseVec(s){ return s.replace(/^\[|\]$/g, "").split(",").map(Number); }
1010+function cos(a, b){ let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return d; } // already unit-norm
1111+1212+const owners = (await pool.query(`
1313+ select split_part(replace(repo_uri,'at://',''),'/',1) as owner_did,
1414+ count(*)::int n, array_agg(repo_name) as names
1515+ from tangled_readmes
1616+ where embedding is not null and repo_uri is not null
1717+ group by 1 having count(*) between 4 and 12
1818+ order by n desc limit 25`)).rows;
1919+2020+const scored = [];
2121+for (const o of owners) {
2222+ const rows = (await pool.query(
2323+ `select repo_name, embedding::text as e from tangled_readmes where embedding is not null and repo_uri like $1`,
2424+ [`at://${o.owner_did}/%`])).rows;
2525+ const vecs = rows.map((r) => parseVec(r.e));
2626+ let sum = 0, cnt = 0;
2727+ for (let i = 0; i < vecs.length; i++) for (let j = i + 1; j < vecs.length; j++) { sum += 1 - cos(vecs[i], vecs[j]); cnt++; }
2828+ const meanDist = cnt ? sum / cnt : 0;
2929+ scored.push({ owner_did: o.owner_did, n: o.n, meanDist: +meanDist.toFixed(3), names: rows.map((r) => r.repo_name) });
3030+}
3131+scored.sort((a, b) => b.meanDist - a.meanDist);
3232+console.log("most multi-interest owners (high mean pairwise README distance):\n");
3333+for (const s of scored.slice(0, 8)) {
3434+ console.log(`mean_dist=${s.meanDist} n=${s.n} ${s.owner_did}`);
3535+ console.log(` repos: ${s.names.join(", ")}\n`);
3636+}
3737+await pool.end();
+42
recommendation/reference/src/fetch_issues.mjs
···11+// Fetch real sh.tangled.repo.issue records live from repo-owner PDSes.
22+import pg from "pg";
33+import { readFileSync } from "node:fs";
44+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
55+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
66+77+// Owners of embedded repos, with their PDS host.
88+const rows = (await pool.query(`
99+ select distinct tr.owner_did, pa.pds_host,
1010+ (select repo_name from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null limit 1) as a_repo
1111+ from tangled_repos tr
1212+ join tangled_pds_accounts pa on pa.did = tr.owner_did
1313+ where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null)
1414+ limit 80`)).rows;
1515+await pool.end();
1616+1717+console.log(`probing ${rows.length} owner PDSes for sh.tangled.repo.issue ...`);
1818+const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`);
1919+2020+let found = [];
2121+async function listIssues(r) {
2222+ const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`;
2323+ try {
2424+ const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000);
2525+ const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } });
2626+ clearTimeout(t);
2727+ if (!resp.ok) return;
2828+ const j = await resp.json();
2929+ for (const rec of j.records ?? []) found.push({ owner: r.owner_did, uri: rec.uri, value: rec.value });
3030+ } catch {}
3131+}
3232+// simple concurrency
3333+const q = rows.slice();
3434+await Promise.all(Array.from({ length: 12 }, async () => { while (q.length) await listIssues(q.pop()); }));
3535+3636+console.log(`\nfound ${found.length} issue records`);
3737+if (found.length) {
3838+ console.log("\nsample issue record value keys:", Object.keys(found[0].value));
3939+ console.log("sample record:", JSON.stringify(found[0], null, 2).slice(0, 900));
4040+ console.log("\nfirst few titles:");
4141+ for (const f of found.slice(0, 8)) console.log(` - ${f.value.title ?? "(no title)"} [repo ref: ${JSON.stringify(f.value.repo ?? f.value.subject ?? "?")}]`);
4242+}
+95
recommendation/reference/src/issue_experiment.mjs
···11+// Full experiment: fetch real Tangled issues live, embed as queries, vector-search READMEs.
22+import pg from "pg";
33+import { readFileSync } from "node:fs";
44+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
55+const API_KEY = fromEnv("GEMINI_API_KEY");
66+const MODEL = "gemini-embedding-001";
77+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 });
88+const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`);
99+1010+async function embedQuery(text) {
1111+ const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, {
1212+ method: "POST", headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
1313+ body: JSON.stringify({ model: `models/${MODEL}`, content: { parts: [{ text: text.slice(0, 8000) }] }, taskType: "RETRIEVAL_QUERY", outputDimensionality: 1536 }),
1414+ });
1515+ if (!resp.ok) throw new Error(`embed HTTP ${resp.status}`);
1616+ const v = (await resp.json()).embedding.values;
1717+ let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1;
1818+ return `[${v.map((x) => x / n).join(",")}]`;
1919+}
2020+2121+// Map an issue.repo reference (bare DID or at://owner/sh.tangled.repo/rkey) -> knot repoDid in readmes.
2222+async function resolveRepoDid(ref) {
2323+ if (!ref) return null;
2424+ if (ref.startsWith("at://")) {
2525+ const m = ref.match(/^at:\/\/([^/]+)\/[^/]+\/(.+)$/);
2626+ if (!m) return null;
2727+ const r = await pool.query(`select coalesce(repo_did, record_raw->>'repoDid') as rd from tangled_repos where owner_did=$1 and rkey=$2 limit 1`, [m[1], m[2]]);
2828+ return r.rows[0]?.rd ?? null;
2929+ }
3030+ return ref; // bare DID == repoDid
3131+}
3232+3333+async function fetchIssues() {
3434+ const rows = (await pool.query(`
3535+ select distinct tr.owner_did, pa.pds_host
3636+ from tangled_repos tr join tangled_pds_accounts pa on pa.did = tr.owner_did
3737+ where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null)
3838+ limit 120`)).rows;
3939+ const found = [];
4040+ const q = rows.slice();
4141+ await Promise.all(Array.from({ length: 14 }, async () => {
4242+ while (q.length) {
4343+ const r = q.pop();
4444+ const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`;
4545+ try {
4646+ const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000);
4747+ const resp = await fetch(url, { signal: ctrl.signal });
4848+ clearTimeout(t);
4949+ if (!resp.ok) continue;
5050+ const j = await resp.json();
5151+ for (const rec of j.records ?? []) if (rec.value?.title) found.push(rec.value);
5252+ } catch {}
5353+ }
5454+ }));
5555+ return found;
5656+}
5757+5858+async function main() {
5959+ const issues = await fetchIssues();
6060+ console.log(`fetched ${issues.length} live issues\n`);
6161+ // attach resolved repoDid + whether embedded; prefer substantive bodies whose repo is embedded
6262+ for (const iss of issues) {
6363+ iss._repoDid = await resolveRepoDid(iss.repo);
6464+ iss._embedded = iss._repoDid
6565+ ? (await pool.query(`select repo_name from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [iss._repoDid])).rows[0]?.repo_name ?? null
6666+ : null;
6767+ }
6868+ const pick = issues
6969+ .filter((i) => (i.body ?? "").length > 60)
7070+ .sort((a, b) => (b._embedded ? 1 : 0) - (a._embedded ? 1 : 0) || (b.body?.length ?? 0) - (a.body?.length ?? 0))
7171+ .slice(0, 4);
7272+7373+ for (const iss of pick) {
7474+ console.log("\n" + "=".repeat(72));
7575+ console.log(`ISSUE: ${iss.title}`);
7676+ console.log(`own repo: ${iss._embedded ? iss._embedded + " (embedded ✓)" : "(parent README not embedded / unresolved)"}`);
7777+ console.log(`body: ${(iss.body ?? "").replace(/\s+/g, " ").slice(0, 200)}…`);
7878+ const qvec = await embedQuery(`${iss.title}\n\n${iss.body ?? ""}`);
7979+ const hits = (await pool.query(`
8080+ select repo_name, repo_did, round((embedding <=> $1::vector)::numeric,4) dist, (repo_did=$2) is_parent
8181+ from tangled_readmes where embedding is not null
8282+ order by embedding <=> $1::vector limit 8`, [qvec, iss._repoDid])).rows;
8383+ console.log("top README matches:");
8484+ hits.forEach((h, i) => console.log(` ${i + 1}. ${h.is_parent ? "👉" : " "} ${(h.repo_name ?? "(no name)").padEnd(34)} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`));
8585+ if (iss._embedded) {
8686+ const rnk = (await pool.query(`
8787+ select 1 + count(*)::int rnk from tangled_readmes
8888+ where embedding is not null and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`,
8989+ [qvec, iss._repoDid])).rows[0].rnk;
9090+ console.log(` → own repo overall rank: #${rnk} of all embedded READMEs`);
9191+ }
9292+ }
9393+ await pool.end();
9494+}
9595+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+84
recommendation/reference/src/issue_search.mjs
···11+// Experiment: embed a Tangled issue as a query and vector-search the README embeddings.
22+// Validates the matching: (a) does the issue's OWN repo rank highly? (b) are other hits topical?
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+66+function fromEnv(key) {
77+ if (process.env[key]) return process.env[key];
88+ for (const p of ["../.env", ".env"]) {
99+ try { const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)$`, "m")); if (m) return m[1].trim().replace(/^["']|["']$/g, ""); } catch {}
1010+ }
1111+}
1212+const CONN = fromEnv("DB_CONNECTION_STRING");
1313+const API_KEY = fromEnv("GEMINI_API_KEY");
1414+const MODEL = "gemini-embedding-001";
1515+const N = parseInt(process.env.ISSUES ?? "3", 10);
1616+1717+const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 3 });
1818+1919+async function embedQuery(text) {
2020+ const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, {
2121+ method: "POST",
2222+ headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
2323+ body: JSON.stringify({
2424+ model: `models/${MODEL}`,
2525+ content: { parts: [{ text: text.slice(0, 8000) }] },
2626+ taskType: "RETRIEVAL_QUERY",
2727+ outputDimensionality: 1536,
2828+ }),
2929+ });
3030+ if (!resp.ok) throw new Error(`embed HTTP ${resp.status}: ${(await resp.text()).slice(0, 200)}`);
3131+ const v = (await resp.json()).embedding.values;
3232+ let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1;
3333+ return `[${v.map((x) => x / n).join(",")}]`;
3434+}
3535+3636+async function main() {
3737+ const total = (await pool.query(`select count(*)::int n from tangled_issues`)).rows[0].n;
3838+ console.log(`tangled_issues total: ${total}`);
3939+ const joinable = (await pool.query(`
4040+ select count(*)::int n from tangled_issues i
4141+ where exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)`)).rows[0].n;
4242+ console.log(`issues whose parent repo has an embedded README: ${joinable}\n`);
4343+ if (joinable === 0) { console.log("No joinable issues — cannot run the own-repo sanity check."); await pool.end(); return; }
4444+4545+ // Pick a few substantive issues (decent body) whose repo is embedded.
4646+ const issues = (await pool.query(`
4747+ select i.uri, i.repo_did, i.title, i.body,
4848+ (select repo_name from tangled_readmes r where r.repo_did = i.repo_did limit 1) as parent_repo
4949+ from tangled_issues i
5050+ where i.title is not null and length(coalesce(i.body,'')) > 80
5151+ and exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)
5252+ order by length(i.body) desc
5353+ limit ${N}`)).rows;
5454+5555+ for (const iss of issues) {
5656+ const queryText = `${iss.title}\n\n${iss.body}`;
5757+ console.log("\n" + "=".repeat(70));
5858+ console.log(`ISSUE: ${iss.title}`);
5959+ console.log(`parent repo: ${iss.parent_repo} (${iss.repo_did})`);
6060+ console.log(`body: ${iss.body.replace(/\s+/g, " ").slice(0, 180)}…`);
6161+ const qvec = await embedQuery(queryText);
6262+ const hits = (await pool.query(`
6363+ select repo_name, repo_did, round((embedding <=> $1::vector)::numeric, 4) as dist,
6464+ (repo_did = $2) as is_parent
6565+ from tangled_readmes
6666+ where embedding is not null
6767+ order by embedding <=> $1::vector
6868+ limit 8`, [qvec, iss.repo_did])).rows;
6969+ console.log("top README matches:");
7070+ hits.forEach((h, idx) => {
7171+ console.log(` ${idx + 1}. ${h.is_parent ? "👉 " : " "}${h.repo_name?.padEnd(32) ?? "(no name)"} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`);
7272+ });
7373+ // Where does the own repo rank overall?
7474+ const rank = (await pool.query(`
7575+ select 1 + count(*)::int as rnk
7676+ from tangled_readmes
7777+ where embedding is not null
7878+ and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`,
7979+ [qvec, iss.repo_did])).rows[0].rnk;
8080+ console.log(` → own repo overall rank: #${rank} of all embedded READMEs`);
8181+ }
8282+ await pool.end();
8383+}
8484+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function conn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env"]) { try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)$/m); if (m) return m[1].trim(); } catch {} }
66+}
77+const pool = new pg.Pool({ connectionString: conn(), ssl: { rejectUnauthorized: false }, max: 3 });
88+99+console.log("=== embedded rows: dims + L2 norm ===");
1010+console.table((await pool.query(`
1111+ select repo_name, embedding_model,
1212+ vector_dims(embedding) as dims,
1313+ round(sqrt((select sum(x*x) from unnest(embedding::real[]) x))::numeric, 5) as l2_norm
1414+ from tangled_readmes where embedding is not null
1515+ order by embedded_at desc limit 5`)).rows);
1616+1717+console.log("\n=== nearest-neighbor sanity (cosine) for one embedded repo ===");
1818+const seed = (await pool.query(`select repo_did, repo_name from tangled_readmes where embedding is not null limit 1`)).rows[0];
1919+if (seed) {
2020+ console.log(`seed: ${seed.repo_name} (${seed.repo_did})`);
2121+ const nn = await pool.query(`
2222+ select repo_name, round((embedding <=> (select embedding from tangled_readmes where repo_did=$1))::numeric, 4) as cosine_dist
2323+ from tangled_readmes
2424+ where embedding is not null and repo_did <> $1
2525+ order by embedding <=> (select embedding from tangled_readmes where repo_did=$1)
2626+ limit 5`, [seed.repo_did]);
2727+ console.table(nn.rows);
2828+}
2929+await pool.end();
recommendation/tests/__init__.py
This is a binary file and will not be displayed.
+30
recommendation/tests/test_dedup.py
···11+from app.dedup import content_hash, collapse_forks
22+from app.types import Candidate
33+44+55+def test_content_hash_is_deterministic_and_prefix_based():
66+ a = content_hash("hello world" + "x" * 1000)
77+ b = content_hash("hello world" + "x" * 1000)
88+ assert a == b
99+ # only the first 500 chars matter -> differing tails hash the same
1010+ assert content_hash("p" * 500 + "AAA") == content_hash("p" * 500 + "BBB")
1111+1212+1313+def test_content_hash_handles_none_and_empty():
1414+ assert content_hash(None) == content_hash("")
1515+1616+1717+def _cand(key, h, dist):
1818+ return Candidate(key=key, content_hash=h, distance=dist, seeds=["s"])
1919+2020+2121+def test_collapse_forks_keeps_min_distance_per_content():
2222+ cands = [
2323+ _cand("repoA", "samehash", 0.20),
2424+ _cand("repoB", "samehash", 0.10), # fork with closer distance -> winner
2525+ _cand("repoC", "other", 0.30),
2626+ ]
2727+ out = collapse_forks(cands)
2828+ keys = {c.key for c in out}
2929+ assert keys == {"repoB", "repoC"}
3030+ assert len(out) == 2
···11+from app.merge import merge_hits
22+33+44+def hit(repo_did, content, distance):
55+ return {"repo_did": repo_did, "content": content, "distance": distance}
66+77+88+def test_consensus_accumulates_across_seeds():
99+ per_seed = [
1010+ ("seed-nix", [hit("R1", "nix stuff", 0.18), hit("R2", "cli stuff", 0.25)]),
1111+ ("seed-cli", [hit("R1", "nix stuff", 0.12), hit("R3", "web stuff", 0.22)]),
1212+ ]
1313+ cands = merge_hits(per_seed, seed_content_hashes=set())
1414+ by_key = {c.key: c for c in cands}
1515+1616+ # R1 surfaced by both seeds -> consensus 2, best (min) distance, primary = closer seed
1717+ assert by_key["R1"].consensus == 2
1818+ assert by_key["R1"].distance == 0.12
1919+ assert by_key["R1"].primary_seed == "seed-cli"
2020+ # R2/R3 surfaced once
2121+ assert by_key["R2"].consensus == 1
2222+ assert by_key["R3"].consensus == 1
2323+2424+2525+def test_skips_user_own_forks_by_content_hash():
2626+ from app.dedup import content_hash
2727+2828+ own = content_hash("my own readme")
2929+ per_seed = [("seed", [hit("R1", "my own readme", 0.05), hit("R2", "fresh", 0.2)])]
3030+ cands = merge_hits(per_seed, seed_content_hashes={own})
3131+ keys = {c.key for c in cands}
3232+ assert keys == {"R2"}
+22
recommendation/tests/test_profile.py
···11+from app.profile import build_interests
22+33+44+def test_interests_aggregate_topics_by_frequency():
55+ seeds = [
66+ {"topics": ["nix", "cli"]},
77+ {"topics": ["nix", "atproto"]},
88+ {"topics": ["nix"]},
99+ {"topics": None}, # tolerate missing topics
1010+ {"topics": ["CLI Tools"]}, # slug normalizes
1111+ ]
1212+ interests = build_interests(seeds, max_interests=5)
1313+ labels = [i["label"] for i in interests]
1414+ slugs = [i["slug"] for i in interests]
1515+ assert labels[0] == "nix" # most frequent first
1616+ assert "cli-tools" in slugs # multi-word topic is slugified
1717+ assert len(interests) <= 5
1818+ assert all(set(i.keys()) == {"label", "slug"} for i in interests)
1919+2020+2121+def test_interests_empty_when_no_topics():
2222+ assert build_interests([{"topics": None}, {}], max_interests=5) == []
+75
recommendation/tests/test_quality.py
···11+"""Unit tests for the issue quality filter (pure, no DB).
22+33+Real examples are drawn from observed recommendation output: the engine was
44+surfacing throwaway/test issues (e.g. "hello, world" whose body is "test issue
55+to explore what tangled looks like") because issues are ranked purely by body
66+embedding similarity, with no quality signal. These tests pin down what we drop
77+and — just as importantly — what we must keep.
88+"""
99+1010+from __future__ import annotations
1111+1212+from app.quality import drop_issue, is_placeholder_issue, is_test_repo
1313+1414+# --- repos that are clearly sandboxes / test scratchpads -------------------------
1515+def test_test_repo_by_name():
1616+ assert is_test_repo("tngl-mcp-test", "")
1717+ assert is_test_repo("test-repo", "")
1818+ assert is_test_repo("test100", "")
1919+ assert is_test_repo("sandbox", "")
2020+ assert is_test_repo("playground", "")
2121+ assert is_test_repo("my-demo", "")
2222+2323+2424+def test_test_repo_by_description():
2525+ assert is_test_repo("blaaaa", "adadadaddaaddada") # gibberish description
2626+ assert is_test_repo("whatever", "just a test")
2727+ assert is_test_repo("x", "this is a test")
2828+2929+3030+def test_real_repos_are_not_flagged():
3131+ assert not is_test_repo("knot-docker", "Docker config for a Tangled knotserver")
3232+ assert not is_test_repo("tangled-cli", "CLI for Tangled")
3333+ assert not is_test_repo("hydrant", "an atproto crawler")
3434+ assert not is_test_repo("drifting-starlight", "")
3535+ assert not is_test_repo("latest", "") # 'test' is a substring, not a token
3636+ assert not is_test_repo("fastest-router", "")
3737+ assert not is_test_repo("contest-platform", "")
3838+3939+4040+# --- placeholder / test issues ---------------------------------------------------
4141+def test_placeholder_issue_titles():
4242+ assert is_placeholder_issue("hello, world", "")
4343+ assert is_placeholder_issue("CLI test issue", "")
4444+ assert is_placeholder_issue("Test Issue", "")
4545+ assert is_placeholder_issue("[READ-ONLY]", "")
4646+4747+4848+def test_placeholder_issue_bodies():
4949+ assert is_placeholder_issue("hello, world", "test issue to explore what tangled looks like\n- and so on")
5050+ assert is_placeholder_issue("[READ-ONLY]", "this is a read-only mirror of https://github.com/npmx-dev/npmx")
5151+ assert is_placeholder_issue("untitled", "Testing programmatic access to Tangled via tang CLI")
5252+ assert is_placeholder_issue("x", "just testing, ignore this")
5353+ assert is_placeholder_issue("x", "lorem ipsum dolor sit amet")
5454+5555+5656+def test_real_issues_are_not_flagged():
5757+ assert not is_placeholder_issue(
5858+ "`KNOT_REPO_SCAN_PATH` doesn't seem to be respected",
5959+ "i've been hosting a knot for the past few versions and the log shows...",
6060+ )
6161+ assert not is_placeholder_issue("PR Phase 2: Reviewer Workflow (Commenting and Reviews)", "Implement the reviewer workflow")
6262+ assert not is_placeholder_issue("Finish migration from GitHub to Tangled", "- [x] Remove dependabot.yml file")
6363+ assert not is_placeholder_issue("[crawler] add `com.atproto.sync.listReposByCollection` support", "right now we use describeRepo")
6464+ assert not is_placeholder_issue("Improve Documentation", "Lots of little jobs here")
6565+ assert not is_placeholder_issue("Add tests for the ranker", "we need more test coverage on the scorer") # legit work *about* tests
6666+6767+6868+# --- combined gate used by the engine --------------------------------------------
6969+def test_drop_issue_combines_both_signals():
7070+ # dropped because the repo is a test repo (even if the issue title looks fine)
7171+ assert drop_issue("tngl-mcp-test", "", "Add README with project overview", "real-sounding body")
7272+ # dropped because the issue body is a placeholder (even if the repo looks fine)
7373+ assert drop_issue("static", "", "hello, world", "test issue to explore what tangled looks like")
7474+ # kept: real repo + real issue
7575+ assert not drop_issue("knot-docker", "Docker config", "KNOT_REPO_SCAN_PATH bug", "the scan path is ignored")
···11+# Connection string for the SHARED Postgres database.
22+# This database is OWNED BY THE DATA-COLLECTION TEAMMATE.
33+# Existing tables are read-only for the rec engine EXCEPT the embedding columns of
44+# tangled_readmes (embedding / embedding_model / embedded_at), which we fill.
55+DB_CONNECTION_STRING=postgresql://user:password@host:5432/postgres
66+77+# Google Gemini API key (Google AI Studio) for README embeddings.
88+# Model: gemini-embedding-001 at outputDimensionality=1536 (matches the vector(1536) column).
99+GEMINI_API_KEY=your-gemini-api-key
1010+# Optional override:
1111+# GEMINI_EMBED_MODEL=gemini-embedding-001
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function loadConn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env", "../../.env"]) {
66+ try {
77+ const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m);
88+ if (m) return m[1].trim();
99+ } catch {}
1010+ }
1111+ throw new Error("DB_CONNECTION_STRING not found");
1212+}
1313+const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 });
1414+1515+console.log("=== all tables/views (every schema) ===");
1616+console.table((await pool.query(`
1717+ select table_schema, table_name, table_type
1818+ from information_schema.tables
1919+ where table_schema not in ('pg_catalog','information_schema')
2020+ order by table_schema, table_name`)).rows);
2121+2222+console.log("\n=== columns matching 'embed' or 'readme' (any table) ===");
2323+const hits = await pool.query(`
2424+ select table_schema, table_name, column_name, data_type
2525+ from information_schema.columns
2626+ where table_schema not in ('pg_catalog','information_schema')
2727+ and (column_name ~* 'embed|readme|vector')
2828+ order by table_schema, table_name, ordinal_position`);
2929+console.table(hits.rows.length ? hits.rows : [{ note: "no columns named embed*/readme*/vector*" }]);
3030+3131+console.log("\n=== tables matching 'embed' or 'readme' by NAME ===");
3232+const tn = await pool.query(`
3333+ select table_schema, table_name from information_schema.tables
3434+ where table_name ~* 'embed|readme'
3535+ order by 1,2`);
3636+console.table(tn.rows.length ? tn.rows : [{ note: "no table named embed*/readme*" }]);
3737+3838+// If a readme column/table exists, show count + sample
3939+for (const r of [...hits.rows, ...tn.rows]) {
4040+ const t = `"${r.table_schema}"."${r.table_name}"`;
4141+ try {
4242+ const c = await pool.query(`select count(*)::int n from ${t}`);
4343+ console.log(`count ${t}: ${c.rows[0].n}`);
4444+ } catch (e) { /* ignore dup */ }
4545+}
4646+4747+console.log("\n=== columns on tangled_repos (did a readme/embedding col get added here?) ===");
4848+console.table((await pool.query(`
4949+ select column_name, data_type from information_schema.columns
5050+ where table_schema='public' and table_name='tangled_repos' order by ordinal_position`)).rows);
5151+5252+await pool.end();
+51
recommendationold/src/check_readmes.mjs
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function loadConn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env"]) {
66+ try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); if (m) return m[1].trim(); } catch {}
77+ }
88+ throw new Error("no conn");
99+}
1010+const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 });
1111+1212+console.log("=== full tangled_readmes columns ===");
1313+console.table((await pool.query(`
1414+ select c.ordinal_position, c.column_name, c.data_type, c.udt_name, c.is_nullable
1515+ from information_schema.columns c
1616+ where c.table_schema='public' and c.table_name='tangled_readmes'
1717+ order by c.ordinal_position`)).rows);
1818+1919+console.log("\n=== embedding column: pgvector dimensions ===");
2020+console.table((await pool.query(`
2121+ select a.attname, format_type(a.atttypid, a.atttypmod) as type
2222+ from pg_attribute a
2323+ join pg_class c on c.oid=a.attrelid join pg_namespace n on n.oid=c.relnamespace
2424+ where n.nspname='public' and c.relname='tangled_readmes' and a.attnum>0 and not a.attisdropped
2525+ and format_type(a.atttypid,a.atttypmod) ~* 'vector'`)).rows);
2626+2727+console.log("\n=== counts ===");
2828+console.table((await pool.query(`
2929+ select
3030+ count(*)::int total,
3131+ count(*) filter (where embedding is not null)::int with_embedding,
3232+ count(distinct embedding_model)::int models
3333+ from tangled_readmes`)).rows);
3434+3535+console.log("\n=== embedding_model values ===");
3636+console.table((await pool.query(`select embedding_model, count(*)::int n from tangled_readmes group by 1 order by 2 desc`)).rows);
3737+3838+console.log("\n=== indexes on tangled_readmes (ivfflat/hnsw?) ===");
3939+console.table((await pool.query(`select indexname, indexdef from pg_indexes where schemaname='public' and tablename='tangled_readmes'`)).rows);
4040+4141+console.log("\n=== sample row (text truncated, embedding omitted) ===");
4242+const s = await pool.query(`select * from tangled_readmes limit 1`);
4343+if (s.rows.length) {
4444+ const r = { ...s.rows[0] };
4545+ for (const k of Object.keys(r)) {
4646+ if (k === "embedding") r[k] = `<vector len=${typeof r[k] === "string" ? (r[k].match(/,/g)?.length ?? 0) + 1 : "?"}>`;
4747+ else if (typeof r[k] === "string" && r[k].length > 160) r[k] = r[k].slice(0, 160) + "…";
4848+ }
4949+ console.log(JSON.stringify(r, null, 2));
5050+}
5151+await pool.end();
+80
recommendationold/src/clustered_recommend.mjs
···11+// Cluster-then-retrieve recommender: preserves a user's multiple distinct interests.
22+// Contrasts NAIVE pooled top-K (one cluster can dominate) vs CLUSTERED round-robin (balanced).
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+import { createHash } from "node:crypto";
66+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
77+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 });
88+99+const USER = process.env.USER_DID || "did:plc:y7g2koy4nqw7434s67fgfjca";
1010+const K = parseInt(process.env.K ?? "10", 10);
1111+const T = parseFloat(process.env.CLUSTER_T ?? "0.22"); // cosine-dist threshold to consider two seeds "same interest"
1212+const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex");
1313+const parseVec = (s) => s.replace(/^\[|\]$/g, "").split(",").map(Number);
1414+const cosDist = (a, b) => { let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return 1 - d; };
1515+1616+async function main() {
1717+ // 1) the user's contributed repos (here: owned) with embeddings
1818+ const seeds = (await pool.query(
1919+ `select repo_did, repo_name, content, embedding::text as etext
2020+ from tangled_readmes where embedding is not null and repo_uri like $1`, [`at://${USER}/%`])).rows;
2121+ if (seeds.length < 2) { console.log("not enough embedded seed repos for this user"); await pool.end(); return; }
2222+ seeds.forEach((s) => (s.vec = parseVec(s.etext)));
2323+ console.log(`USER ${USER}`);
2424+ console.log(`contributed repos (${seeds.length}): ${seeds.map((s) => s.repo_name).join(", ")}\n`);
2525+2626+ // 2) cluster seeds: single-linkage connected components at threshold T (union-find)
2727+ const parent = seeds.map((_, i) => i);
2828+ const find = (x) => (parent[x] === x ? x : (parent[x] = find(parent[x])));
2929+ for (let i = 0; i < seeds.length; i++)
3030+ for (let j = i + 1; j < seeds.length; j++)
3131+ if (cosDist(seeds[i].vec, seeds[j].vec) < T) parent[find(i)] = find(j);
3232+ const clusters = new Map();
3333+ seeds.forEach((s, i) => { const r = find(i); (clusters.get(r) ?? clusters.set(r, []).get(r)).push(s); });
3434+ const clusterList = [...clusters.values()];
3535+ console.log(`→ ${clusterList.length} interest cluster(s):`);
3636+ clusterList.forEach((c, i) => console.log(` [${i + 1}] ${c.map((s) => s.repo_name).join(", ")}`));
3737+3838+ // 3) retrieve neighbors per seed (drop user's own repos), tag with cluster + min dist
3939+ const ownRepoDids = new Set(seeds.map((s) => s.repo_did));
4040+ const seenContent = new Set(seeds.map((s) => hash(s.content)));
4141+ // candidate -> { repo_name, dist, clusterIdx }
4242+ const cand = new Map();
4343+ for (let ci = 0; ci < clusterList.length; ci++) {
4444+ for (const seed of clusterList[ci]) {
4545+ const rows = (await pool.query(
4646+ `select repo_name, repo_did, content, round((embedding <=> $1::vector)::numeric,4) dist
4747+ from tangled_readmes where embedding is not null and repo_did <> all($2)
4848+ order by embedding <=> $1::vector limit 25`, [seed.etext, [...ownRepoDids]])).rows;
4949+ for (const r of rows) {
5050+ const h = hash(r.content);
5151+ if (seenContent.has(h)) continue; // collapse forks / user's own content
5252+ const prev = cand.get(h);
5353+ const dist = Number(r.dist);
5454+ if (!prev || dist < prev.dist) cand.set(h, { repo_name: r.repo_name, dist, clusterIdx: ci });
5555+ }
5656+ }
5757+ }
5858+ const all = [...cand.values()];
5959+6060+ // 4a) NAIVE pooled: global top-K by distance
6161+ const naive = [...all].sort((a, b) => a.dist - b.dist).slice(0, K);
6262+6363+ // 4b) CLUSTERED round-robin: rank within each cluster, then take turns → balanced coverage
6464+ const perCluster = clusterList.map((_, ci) => all.filter((c) => c.clusterIdx === ci).sort((a, b) => a.dist - b.dist));
6565+ const clustered = [];
6666+ const used = new Set();
6767+ for (let round = 0; clustered.length < K && round < 50; round++) {
6868+ for (let ci = 0; ci < perCluster.length && clustered.length < K; ci++) {
6969+ const next = perCluster[ci].find((c) => !used.has(c.repo_name));
7070+ if (next) { used.add(next.repo_name); clustered.push(next); }
7171+ }
7272+ }
7373+7474+ const fmt = (arr) => arr.map((c, i) => ` ${String(i + 1).padStart(2)}. ${(c.repo_name ?? "?").padEnd(30)} dist=${c.dist} [interest ${c.clusterIdx + 1}]`).join("\n");
7575+ const cover = (arr) => { const s = new Set(arr.map((c) => c.clusterIdx)); return `${s.size}/${clusterList.length} interests`; };
7676+ console.log(`\n===== NAIVE pooled top-${K} (covers ${cover(naive)}) =====\n${fmt(naive)}`);
7777+ console.log(`\n===== CLUSTERED round-robin top-${K} (covers ${cover(clustered)}) =====\n${fmt(clustered)}`);
7878+ await pool.end();
7979+}
8080+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
···11+// Embed all unembedded READMEs in tangled_readmes using Google Gemini embeddings.
22+//
33+// - Reads the worklist (status='found' AND content IS NOT NULL AND embedding IS NULL),
44+// the exact predicate behind tangled_readmes_unembedded_idx.
55+// - Embeds doc = "# <name>\n\n<description>\n\n<README>" with gemini-embedding-001 at
66+// outputDimensionality=1536 (matches the vector(1536) column), task RETRIEVAL_DOCUMENT.
77+// - L2-normalizes (sub-3072 MRL dims aren't auto-normalized) so the HNSW cosine index is happy.
88+// - UPDATEs only the embedding columns, only where embedding IS NULL → idempotent / re-runnable.
99+//
1010+// Env: DB_CONNECTION_STRING (or ../.env), GEMINI_API_KEY (required).
1111+// Optional: LIMIT (0=all), CONCURRENCY (default 4), DRY_RUN=1 (count only), MAX_CHARS (default 8000).
1212+1313+import pg from "pg";
1414+import { readFileSync } from "node:fs";
1515+1616+function fromEnvFile(key) {
1717+ for (const p of ["../.env", ".env", "../../.env"]) {
1818+ try {
1919+ const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)\\s*$`, "m"));
2020+ if (m) return m[1].trim().replace(/^["']|["']$/g, "");
2121+ } catch {}
2222+ }
2323+ return undefined;
2424+}
2525+2626+const CONN = process.env.DB_CONNECTION_STRING || fromEnvFile("DB_CONNECTION_STRING");
2727+const API_KEY = process.env.GEMINI_API_KEY || fromEnvFile("GEMINI_API_KEY");
2828+const MODEL = process.env.GEMINI_EMBED_MODEL || fromEnvFile("GEMINI_EMBED_MODEL") || "gemini-embedding-001";
2929+const DIMS = 1536;
3030+const LIMIT = parseInt(process.env.LIMIT ?? "0", 10);
3131+const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "4", 10);
3232+const MAX_CHARS = parseInt(process.env.MAX_CHARS ?? "8000", 10);
3333+const DRY_RUN = process.env.DRY_RUN === "1";
3434+3535+if (!CONN) { console.error("DB_CONNECTION_STRING not set"); process.exit(1); }
3636+if (!API_KEY && !DRY_RUN) { console.error("GEMINI_API_KEY not set (add it to recommendation/.env)"); process.exit(1); }
3737+3838+const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 5 });
3939+const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
4040+4141+function buildDoc({ repo_name, description, content }) {
4242+ const parts = [];
4343+ if (repo_name) parts.push(`# ${repo_name}`);
4444+ if (description && description.trim()) parts.push(description.trim());
4545+ parts.push(content);
4646+ return parts.join("\n\n").slice(0, MAX_CHARS);
4747+}
4848+4949+function l2normalize(v) {
5050+ let s = 0;
5151+ for (const x of v) s += x * x;
5252+ const n = Math.sqrt(s) || 1;
5353+ return v.map((x) => x / n);
5454+}
5555+5656+async function embedOnce(text, dims) {
5757+ const url = `https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`;
5858+ const body = {
5959+ model: `models/${MODEL}`,
6060+ content: { parts: [{ text }] },
6161+ taskType: "RETRIEVAL_DOCUMENT",
6262+ outputDimensionality: dims,
6363+ };
6464+ const resp = await fetch(url, {
6565+ method: "POST",
6666+ headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
6767+ body: JSON.stringify(body),
6868+ });
6969+ const txt = await resp.text();
7070+ if (!resp.ok) {
7171+ const err = new Error(`HTTP ${resp.status}: ${txt.slice(0, 200)}`);
7272+ err.status = resp.status;
7373+ throw err;
7474+ }
7575+ const j = JSON.parse(txt);
7676+ const values = j?.embedding?.values;
7777+ if (!Array.isArray(values)) throw new Error(`no embedding in response: ${txt.slice(0, 150)}`);
7878+ return values;
7979+}
8080+8181+// Embed with retries; on 400 (often too-long input) retry once with a hard truncation.
8282+async function embedWithRetry(text) {
8383+ let attempt = 0;
8484+ let input = text;
8585+ while (true) {
8686+ try {
8787+ const v = await embedOnce(input, DIMS);
8888+ return l2normalize(v);
8989+ } catch (e) {
9090+ attempt++;
9191+ if (e.status === 400 && input.length > 2000) {
9292+ input = input.slice(0, Math.floor(input.length / 2));
9393+ continue;
9494+ }
9595+ if (attempt >= 5 || (e.status && e.status >= 400 && e.status < 500 && e.status !== 429)) {
9696+ throw e;
9797+ }
9898+ const backoff = Math.min(30000, 800 * 2 ** (attempt - 1));
9999+ await sleep(backoff);
100100+ }
101101+ }
102102+}
103103+104104+async function main() {
105105+ const worklistSql = `
106106+ select r.repo_did, r.repo_name, r.content,
107107+ coalesce(tr.record_raw->>'description', '') as description,
108108+ length(r.content) as len
109109+ from tangled_readmes r
110110+ left join tangled_repos tr
111111+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
112112+ where r.status = 'found' and r.content is not null and r.embedding is null
113113+ order by r.repo_did
114114+ ${LIMIT > 0 ? `limit ${LIMIT}` : ""}`;
115115+116116+ const { rows } = await pool.query(worklistSql);
117117+ const totalReadmes = (await pool.query(`select count(*)::int n from tangled_readmes`)).rows[0].n;
118118+ const alreadyEmbedded = (await pool.query(`select count(*)::int n from tangled_readmes where embedding is not null`)).rows[0].n;
119119+120120+ console.log(`tangled_readmes total=${totalReadmes} already embedded=${alreadyEmbedded}`);
121121+ console.log(`worklist (to embed now)=${rows.length} model=${MODEL} dims=${DIMS} concurrency=${CONCURRENCY}${LIMIT ? ` limit=${LIMIT}` : ""}`);
122122+ if (DRY_RUN) { console.log("\nDRY_RUN=1 → not embedding, not writing."); await pool.end(); return; }
123123+ if (rows.length === 0) { console.log("\nNothing to embed. ✔"); await pool.end(); return; }
124124+125125+ let done = 0, ok = 0, failed = 0;
126126+ const errors = [];
127127+ const queue = rows.slice();
128128+129129+ async function worker(id) {
130130+ while (queue.length) {
131131+ const r = queue.pop();
132132+ try {
133133+ const doc = buildDoc(r);
134134+ const vec = await embedWithRetry(doc);
135135+ const literal = `[${vec.join(",")}]`;
136136+ const res = await pool.query(
137137+ `update tangled_readmes
138138+ set embedding = $1::vector, embedding_model = $2, embedded_at = now()
139139+ where repo_did = $3 and embedding is null`,
140140+ [literal, MODEL, r.repo_did],
141141+ );
142142+ if (res.rowCount > 0) ok++;
143143+ } catch (e) {
144144+ failed++;
145145+ errors.push({ repo_did: r.repo_did, name: r.repo_name, err: e.message });
146146+ }
147147+ if (++done % 25 === 0 || done === rows.length) {
148148+ process.stderr.write(` ...${done}/${rows.length} (ok=${ok} fail=${failed})\n`);
149149+ }
150150+ }
151151+ }
152152+153153+ await Promise.all(Array.from({ length: CONCURRENCY }, (_, i) => worker(i)));
154154+155155+ console.log(`\n================ EMBEDDING DONE ================`);
156156+ console.log(`embedded ok : ${ok}`);
157157+ console.log(`failed : ${failed}`);
158158+ if (errors.length) {
159159+ console.log("\nfirst errors:");
160160+ for (const e of errors.slice(0, 10)) console.log(` ${e.name ?? e.repo_did}: ${e.err}`);
161161+ }
162162+ const remaining = (await pool.query(
163163+ `select count(*)::int n from tangled_readmes where status='found' and content is not null and embedding is null`,
164164+ )).rows[0].n;
165165+ console.log(`\nremaining unembedded (status=found): ${remaining}`);
166166+ await pool.end();
167167+}
168168+169169+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+37
recommendationold/src/explore_users.mjs
···11+// Find owners with several embedded repos, and measure how SPREAD their repos are
22+// (high mean pairwise cosine distance = multi-interest user — good demo candidate).
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
66+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
77+88+const ownerDid = (uri) => uri ? uri.replace("at://", "").split("/")[0] : null;
99+function parseVec(s){ return s.replace(/^\[|\]$/g, "").split(",").map(Number); }
1010+function cos(a, b){ let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return d; } // already unit-norm
1111+1212+const owners = (await pool.query(`
1313+ select split_part(replace(repo_uri,'at://',''),'/',1) as owner_did,
1414+ count(*)::int n, array_agg(repo_name) as names
1515+ from tangled_readmes
1616+ where embedding is not null and repo_uri is not null
1717+ group by 1 having count(*) between 4 and 12
1818+ order by n desc limit 25`)).rows;
1919+2020+const scored = [];
2121+for (const o of owners) {
2222+ const rows = (await pool.query(
2323+ `select repo_name, embedding::text as e from tangled_readmes where embedding is not null and repo_uri like $1`,
2424+ [`at://${o.owner_did}/%`])).rows;
2525+ const vecs = rows.map((r) => parseVec(r.e));
2626+ let sum = 0, cnt = 0;
2727+ for (let i = 0; i < vecs.length; i++) for (let j = i + 1; j < vecs.length; j++) { sum += 1 - cos(vecs[i], vecs[j]); cnt++; }
2828+ const meanDist = cnt ? sum / cnt : 0;
2929+ scored.push({ owner_did: o.owner_did, n: o.n, meanDist: +meanDist.toFixed(3), names: rows.map((r) => r.repo_name) });
3030+}
3131+scored.sort((a, b) => b.meanDist - a.meanDist);
3232+console.log("most multi-interest owners (high mean pairwise README distance):\n");
3333+for (const s of scored.slice(0, 8)) {
3434+ console.log(`mean_dist=${s.meanDist} n=${s.n} ${s.owner_did}`);
3535+ console.log(` repos: ${s.names.join(", ")}\n`);
3636+}
3737+await pool.end();
+42
recommendationold/src/fetch_issues.mjs
···11+// Fetch real sh.tangled.repo.issue records live from repo-owner PDSes.
22+import pg from "pg";
33+import { readFileSync } from "node:fs";
44+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
55+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
66+77+// Owners of embedded repos, with their PDS host.
88+const rows = (await pool.query(`
99+ select distinct tr.owner_did, pa.pds_host,
1010+ (select repo_name from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null limit 1) as a_repo
1111+ from tangled_repos tr
1212+ join tangled_pds_accounts pa on pa.did = tr.owner_did
1313+ where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null)
1414+ limit 80`)).rows;
1515+await pool.end();
1616+1717+console.log(`probing ${rows.length} owner PDSes for sh.tangled.repo.issue ...`);
1818+const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`);
1919+2020+let found = [];
2121+async function listIssues(r) {
2222+ const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`;
2323+ try {
2424+ const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000);
2525+ const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } });
2626+ clearTimeout(t);
2727+ if (!resp.ok) return;
2828+ const j = await resp.json();
2929+ for (const rec of j.records ?? []) found.push({ owner: r.owner_did, uri: rec.uri, value: rec.value });
3030+ } catch {}
3131+}
3232+// simple concurrency
3333+const q = rows.slice();
3434+await Promise.all(Array.from({ length: 12 }, async () => { while (q.length) await listIssues(q.pop()); }));
3535+3636+console.log(`\nfound ${found.length} issue records`);
3737+if (found.length) {
3838+ console.log("\nsample issue record value keys:", Object.keys(found[0].value));
3939+ console.log("sample record:", JSON.stringify(found[0], null, 2).slice(0, 900));
4040+ console.log("\nfirst few titles:");
4141+ for (const f of found.slice(0, 8)) console.log(` - ${f.value.title ?? "(no title)"} [repo ref: ${JSON.stringify(f.value.repo ?? f.value.subject ?? "?")}]`);
4242+}
+95
recommendationold/src/issue_experiment.mjs
···11+// Full experiment: fetch real Tangled issues live, embed as queries, vector-search READMEs.
22+import pg from "pg";
33+import { readFileSync } from "node:fs";
44+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
55+const API_KEY = fromEnv("GEMINI_API_KEY");
66+const MODEL = "gemini-embedding-001";
77+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 });
88+const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`);
99+1010+async function embedQuery(text) {
1111+ const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, {
1212+ method: "POST", headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
1313+ body: JSON.stringify({ model: `models/${MODEL}`, content: { parts: [{ text: text.slice(0, 8000) }] }, taskType: "RETRIEVAL_QUERY", outputDimensionality: 1536 }),
1414+ });
1515+ if (!resp.ok) throw new Error(`embed HTTP ${resp.status}`);
1616+ const v = (await resp.json()).embedding.values;
1717+ let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1;
1818+ return `[${v.map((x) => x / n).join(",")}]`;
1919+}
2020+2121+// Map an issue.repo reference (bare DID or at://owner/sh.tangled.repo/rkey) -> knot repoDid in readmes.
2222+async function resolveRepoDid(ref) {
2323+ if (!ref) return null;
2424+ if (ref.startsWith("at://")) {
2525+ const m = ref.match(/^at:\/\/([^/]+)\/[^/]+\/(.+)$/);
2626+ if (!m) return null;
2727+ const r = await pool.query(`select coalesce(repo_did, record_raw->>'repoDid') as rd from tangled_repos where owner_did=$1 and rkey=$2 limit 1`, [m[1], m[2]]);
2828+ return r.rows[0]?.rd ?? null;
2929+ }
3030+ return ref; // bare DID == repoDid
3131+}
3232+3333+async function fetchIssues() {
3434+ const rows = (await pool.query(`
3535+ select distinct tr.owner_did, pa.pds_host
3636+ from tangled_repos tr join tangled_pds_accounts pa on pa.did = tr.owner_did
3737+ where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null)
3838+ limit 120`)).rows;
3939+ const found = [];
4040+ const q = rows.slice();
4141+ await Promise.all(Array.from({ length: 14 }, async () => {
4242+ while (q.length) {
4343+ const r = q.pop();
4444+ const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`;
4545+ try {
4646+ const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000);
4747+ const resp = await fetch(url, { signal: ctrl.signal });
4848+ clearTimeout(t);
4949+ if (!resp.ok) continue;
5050+ const j = await resp.json();
5151+ for (const rec of j.records ?? []) if (rec.value?.title) found.push(rec.value);
5252+ } catch {}
5353+ }
5454+ }));
5555+ return found;
5656+}
5757+5858+async function main() {
5959+ const issues = await fetchIssues();
6060+ console.log(`fetched ${issues.length} live issues\n`);
6161+ // attach resolved repoDid + whether embedded; prefer substantive bodies whose repo is embedded
6262+ for (const iss of issues) {
6363+ iss._repoDid = await resolveRepoDid(iss.repo);
6464+ iss._embedded = iss._repoDid
6565+ ? (await pool.query(`select repo_name from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [iss._repoDid])).rows[0]?.repo_name ?? null
6666+ : null;
6767+ }
6868+ const pick = issues
6969+ .filter((i) => (i.body ?? "").length > 60)
7070+ .sort((a, b) => (b._embedded ? 1 : 0) - (a._embedded ? 1 : 0) || (b.body?.length ?? 0) - (a.body?.length ?? 0))
7171+ .slice(0, 4);
7272+7373+ for (const iss of pick) {
7474+ console.log("\n" + "=".repeat(72));
7575+ console.log(`ISSUE: ${iss.title}`);
7676+ console.log(`own repo: ${iss._embedded ? iss._embedded + " (embedded ✓)" : "(parent README not embedded / unresolved)"}`);
7777+ console.log(`body: ${(iss.body ?? "").replace(/\s+/g, " ").slice(0, 200)}…`);
7878+ const qvec = await embedQuery(`${iss.title}\n\n${iss.body ?? ""}`);
7979+ const hits = (await pool.query(`
8080+ select repo_name, repo_did, round((embedding <=> $1::vector)::numeric,4) dist, (repo_did=$2) is_parent
8181+ from tangled_readmes where embedding is not null
8282+ order by embedding <=> $1::vector limit 8`, [qvec, iss._repoDid])).rows;
8383+ console.log("top README matches:");
8484+ hits.forEach((h, i) => console.log(` ${i + 1}. ${h.is_parent ? "👉" : " "} ${(h.repo_name ?? "(no name)").padEnd(34)} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`));
8585+ if (iss._embedded) {
8686+ const rnk = (await pool.query(`
8787+ select 1 + count(*)::int rnk from tangled_readmes
8888+ where embedding is not null and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`,
8989+ [qvec, iss._repoDid])).rows[0].rnk;
9090+ console.log(` → own repo overall rank: #${rnk} of all embedded READMEs`);
9191+ }
9292+ }
9393+ await pool.end();
9494+}
9595+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+84
recommendationold/src/issue_search.mjs
···11+// Experiment: embed a Tangled issue as a query and vector-search the README embeddings.
22+// Validates the matching: (a) does the issue's OWN repo rank highly? (b) are other hits topical?
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+66+function fromEnv(key) {
77+ if (process.env[key]) return process.env[key];
88+ for (const p of ["../.env", ".env"]) {
99+ try { const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)$`, "m")); if (m) return m[1].trim().replace(/^["']|["']$/g, ""); } catch {}
1010+ }
1111+}
1212+const CONN = fromEnv("DB_CONNECTION_STRING");
1313+const API_KEY = fromEnv("GEMINI_API_KEY");
1414+const MODEL = "gemini-embedding-001";
1515+const N = parseInt(process.env.ISSUES ?? "3", 10);
1616+1717+const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 3 });
1818+1919+async function embedQuery(text) {
2020+ const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, {
2121+ method: "POST",
2222+ headers: { "content-type": "application/json", "x-goog-api-key": API_KEY },
2323+ body: JSON.stringify({
2424+ model: `models/${MODEL}`,
2525+ content: { parts: [{ text: text.slice(0, 8000) }] },
2626+ taskType: "RETRIEVAL_QUERY",
2727+ outputDimensionality: 1536,
2828+ }),
2929+ });
3030+ if (!resp.ok) throw new Error(`embed HTTP ${resp.status}: ${(await resp.text()).slice(0, 200)}`);
3131+ const v = (await resp.json()).embedding.values;
3232+ let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1;
3333+ return `[${v.map((x) => x / n).join(",")}]`;
3434+}
3535+3636+async function main() {
3737+ const total = (await pool.query(`select count(*)::int n from tangled_issues`)).rows[0].n;
3838+ console.log(`tangled_issues total: ${total}`);
3939+ const joinable = (await pool.query(`
4040+ select count(*)::int n from tangled_issues i
4141+ where exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)`)).rows[0].n;
4242+ console.log(`issues whose parent repo has an embedded README: ${joinable}\n`);
4343+ if (joinable === 0) { console.log("No joinable issues — cannot run the own-repo sanity check."); await pool.end(); return; }
4444+4545+ // Pick a few substantive issues (decent body) whose repo is embedded.
4646+ const issues = (await pool.query(`
4747+ select i.uri, i.repo_did, i.title, i.body,
4848+ (select repo_name from tangled_readmes r where r.repo_did = i.repo_did limit 1) as parent_repo
4949+ from tangled_issues i
5050+ where i.title is not null and length(coalesce(i.body,'')) > 80
5151+ and exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)
5252+ order by length(i.body) desc
5353+ limit ${N}`)).rows;
5454+5555+ for (const iss of issues) {
5656+ const queryText = `${iss.title}\n\n${iss.body}`;
5757+ console.log("\n" + "=".repeat(70));
5858+ console.log(`ISSUE: ${iss.title}`);
5959+ console.log(`parent repo: ${iss.parent_repo} (${iss.repo_did})`);
6060+ console.log(`body: ${iss.body.replace(/\s+/g, " ").slice(0, 180)}…`);
6161+ const qvec = await embedQuery(queryText);
6262+ const hits = (await pool.query(`
6363+ select repo_name, repo_did, round((embedding <=> $1::vector)::numeric, 4) as dist,
6464+ (repo_did = $2) as is_parent
6565+ from tangled_readmes
6666+ where embedding is not null
6767+ order by embedding <=> $1::vector
6868+ limit 8`, [qvec, iss.repo_did])).rows;
6969+ console.log("top README matches:");
7070+ hits.forEach((h, idx) => {
7171+ console.log(` ${idx + 1}. ${h.is_parent ? "👉 " : " "}${h.repo_name?.padEnd(32) ?? "(no name)"} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`);
7272+ });
7373+ // Where does the own repo rank overall?
7474+ const rank = (await pool.query(`
7575+ select 1 + count(*)::int as rnk
7676+ from tangled_readmes
7777+ where embedding is not null
7878+ and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`,
7979+ [qvec, iss.repo_did])).rows[0].rnk;
8080+ console.log(` → own repo overall rank: #${rank} of all embedded READMEs`);
8181+ }
8282+ await pool.end();
8383+}
8484+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+44+// Read DB_CONNECTION_STRING from repo-root .env (ignore the gcloud helper line).
55+function loadConn() {
66+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
77+ for (const p of ["../.env", ".env", "../../.env"]) {
88+ try {
99+ const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m);
1010+ if (m) return m[1].trim();
1111+ } catch {}
1212+ }
1313+ throw new Error("DB_CONNECTION_STRING not found");
1414+}
1515+1616+const SAMPLE = process.env.SAMPLE ? parseInt(process.env.SAMPLE, 10) : 0; // 0 = all
1717+const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "30", 10);
1818+const TIMEOUT_MS = parseInt(process.env.TIMEOUT_MS ?? "9000", 10);
1919+2020+const pool = new pg.Pool({
2121+ connectionString: loadConn(),
2222+ ssl: { rejectUnauthorized: false },
2323+ connectionTimeoutMillis: 10_000,
2424+ max: 4,
2525+});
2626+2727+const sql = `
2828+ select knot_hostname,
2929+ coalesce(record_raw->>'repoDid', repo_did) as repodid,
3030+ record_raw->>'name' as name
3131+ from tangled_repos
3232+ where knot_hostname is not null
3333+ and coalesce(record_raw->>'repoDid', repo_did) is not null
3434+ ${SAMPLE ? "order by random() limit " + SAMPLE : ""}`;
3535+3636+const { rows } = await pool.query(sql);
3737+await pool.end();
3838+3939+const totalRepos = rows.length;
4040+console.log(`Checking README presence for ${totalRepos} repos (repoDid-addressable) ...`);
4141+console.log(`concurrency=${CONCURRENCY} timeout=${TIMEOUT_MS}ms sample=${SAMPLE || "ALL"}\n`);
4242+4343+async function checkRepo(r) {
4444+ // sh.tangled.repo.tree defaults to the repo's default branch when ref is omitted,
4545+ // and returns a top-level `readme` (with `contents`) when the knot finds a README
4646+ // under any extension (.md/.org/.rst/...). One request per repo.
4747+ const url = `https://${r.knot_hostname}/xrpc/sh.tangled.repo.tree?repo=${encodeURIComponent(r.repodid)}&path=`;
4848+ const ctrl = new AbortController();
4949+ const t = setTimeout(() => ctrl.abort(), TIMEOUT_MS);
5050+ try {
5151+ const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } });
5252+ const txt = await resp.text();
5353+ if (!resp.ok) return { status: "http_" + resp.status };
5454+ let j; try { j = JSON.parse(txt); } catch { return { status: "bad_json" }; }
5555+ const files = Array.isArray(j?.files) ? j.files : [];
5656+ const readmeObj = !!(j?.readme && typeof j.readme === "object" &&
5757+ typeof j.readme.contents === "string" && j.readme.contents.trim().length > 0);
5858+ const readmeFile = files.some((f) => /^readme(\.|$)/i.test(f?.name ?? ""));
5959+ const empty = files.length === 0 && !readmeObj;
6060+ return { status: "ok", reachable: true, hasReadme: readmeObj || readmeFile, empty };
6161+ } catch (e) {
6262+ return { status: e.name === "AbortError" ? "timeout" : "neterr" };
6363+ } finally {
6464+ clearTimeout(t);
6565+ }
6666+}
6767+6868+let done = 0;
6969+const stats = { reachable: 0, hasReadme: 0, empty: 0 };
7070+const statusCounts = {};
7171+const byKnot = {}; // knot -> {reachable, hasReadme}
7272+7373+async function worker(queue) {
7474+ while (queue.length) {
7575+ const r = queue.pop();
7676+ const res = await checkRepo(r);
7777+ statusCounts[res.status] = (statusCounts[res.status] ?? 0) + 1;
7878+ const k = (byKnot[r.knot_hostname] ??= { total: 0, reachable: 0, hasReadme: 0 });
7979+ k.total++;
8080+ if (res.status === "ok") {
8181+ stats.reachable++; k.reachable++;
8282+ if (res.hasReadme) { stats.hasReadme++; k.hasReadme++; }
8383+ if (res.empty) stats.empty++;
8484+ }
8585+ if (++done % 100 === 0) process.stderr.write(` ...${done}/${totalRepos}\n`);
8686+ }
8787+}
8888+8989+const queue = rows.slice();
9090+await Promise.all(Array.from({ length: CONCURRENCY }, () => worker(queue)));
9191+9292+const pct = (n, d) => (d === 0 ? "n/a" : ((100 * n) / d).toFixed(1) + "%");
9393+9494+console.log("\n================ README COVERAGE ================");
9595+console.log(`repoDid-addressable repos checked : ${totalRepos}`);
9696+console.log(`reachable (knot responded w/ tree): ${stats.reachable} (${pct(stats.reachable, totalRepos)} of checked)`);
9797+console.log(` ├─ have a README : ${stats.hasReadme} (${pct(stats.hasReadme, stats.reachable)} of reachable)`);
9898+console.log(` └─ empty repo (no files) : ${stats.empty}`);
9999+console.log(`README % of ALL checked repos : ${pct(stats.hasReadme, totalRepos)}`);
100100+console.log("\nstatus breakdown:", JSON.stringify(statusCounts));
101101+console.log("\nper-knot (knots with >=10 repos):");
102102+for (const [knot, k] of Object.entries(byKnot).sort((a, b) => b[1].total - a[1].total)) {
103103+ if (k.total >= 10) console.log(` ${knot.padEnd(26)} total=${String(k.total).padStart(4)} reachable=${String(k.reachable).padStart(4)} readme=${String(k.hasReadme).padStart(4)} (${pct(k.hasReadme, k.reachable)} of reachable)`);
104104+}
+47
recommendationold/src/search_issues_by_readme.mjs
···11+// Search ISSUES from a repo's README embedding (README -> issue cosine, in-DB pgvector).
22+// This is the "recommend issues to work on" path: given repos a user knows, surface relevant issues.
33+import pg from "pg";
44+import { readFileSync } from "node:fs";
55+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
66+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
77+const K = parseInt(process.env.K ?? "8", 10);
88+// which issue table to search
99+const ISSUE_TBL = process.env.ISSUE_TBL || "tangled_issues";
1010+1111+async function coverage() {
1212+ for (const t of ["tangled_issues", "tangled_open_issues"]) {
1313+ try {
1414+ const r = (await pool.query(`select count(*)::int total, count(*) filter (where embedding is not null)::int emb, count(distinct embedding_model) models, max(embedding_model) model, max(vector_dims(embedding)) dims from ${t}`)).rows[0];
1515+ console.log(`${t}: total=${r.total} embedded=${r.emb} model=${r.model} dims=${r.dims}`);
1616+ } catch (e) { console.log(`${t}: ${e.message}`); }
1717+ }
1818+}
1919+2020+async function main() {
2121+ await coverage();
2222+ const seeds = process.env.SEED ? [process.env.SEED] : ["tangled-cli", "atproto-oauth", "nixpkgs", "knot-docker"];
2323+ for (const s of seeds) {
2424+ const seed = (await pool.query(
2525+ `select repo_name, repo_did, embedding::text et from tangled_readmes
2626+ where embedding is not null and repo_name ilike $1 order by length(content) desc limit 1`, [s])).rows[0];
2727+ console.log("\n" + "=".repeat(74));
2828+ if (!seed) { console.log(`SEED "${s}" not found`); continue; }
2929+ console.log(`SEED REPO README: ${seed.repo_name}`);
3030+ const hits = (await pool.query(`
3131+ select i.title, i.repo_did, left(regexp_replace(coalesce(i.body,''), '\\s+', ' ', 'g'), 120) as body,
3232+ rd.repo_name as issue_repo,
3333+ round((i.embedding <=> $1::vector)::numeric, 4) as dist
3434+ from ${ISSUE_TBL} i
3535+ left join tangled_readmes rd on rd.repo_did = i.repo_did
3636+ where i.embedding is not null
3737+ order by i.embedding <=> $1::vector
3838+ limit ${K}`, [seed.et])).rows;
3939+ console.log(`top ${hits.length} matching issues:`);
4040+ hits.forEach((h, idx) => {
4141+ console.log(` ${idx + 1}. [${h.dist}] "${(h.title ?? "(no title)").slice(0, 60)}" (repo: ${h.issue_repo ?? h.repo_did?.slice(0, 16)})`);
4242+ if (h.body?.trim()) console.log(` ${h.body}…`);
4343+ });
4444+ }
4545+ await pool.end();
4646+}
4747+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+57
recommendationold/src/similar_repos.mjs
···11+// README -> README similarity search (pure in-DB pgvector cosine; no embedding API call).
22+// Given a seed repo (a repo the user contributed to), find the most similar repos by README.
33+// Dedups exact-duplicate READMEs (forks) and near-identical hits.
44+import pg from "pg";
55+import { readFileSync } from "node:fs";
66+import { createHash } from "node:crypto";
77+function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} }
88+const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 });
99+const K = parseInt(process.env.K ?? "8", 10);
1010+const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex");
1111+1212+// Seeds: env SEED (repo_name ilike or repo_did), else a diverse default set.
1313+const seeds = process.env.SEED ? [process.env.SEED] : ["tangled-cli", "atproto-oauth", "nixpkgs", "holbert-ng"];
1414+1515+async function findSeed(s) {
1616+ const byDid = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [s]);
1717+ if (byDid.rows[0]) return byDid.rows[0];
1818+ const byName = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_name ilike $1 and embedding is not null order by length(content) desc limit 1`, [s]);
1919+ return byName.rows[0] ?? null;
2020+}
2121+2222+async function main() {
2323+ for (const s of seeds) {
2424+ const seed = await findSeed(s);
2525+ console.log("\n" + "=".repeat(74));
2626+ if (!seed) { console.log(`SEED "${s}" — no embedded README found`); continue; }
2727+ console.log(`SEED REPO: ${seed.repo_name} (owner @${seed.owner_handle ?? "?"})`);
2828+ console.log(` readme: ${(seed.content ?? "").replace(/\s+/g, " ").slice(0, 160)}…`);
2929+3030+ // Pull a wide candidate set, then dedup in JS.
3131+ const cand = (await pool.query(`
3232+ select repo_name, owner_handle, repo_did, content,
3333+ round((embedding <=> $1::vector)::numeric, 4) as dist
3434+ from tangled_readmes
3535+ where embedding is not null and repo_did <> $2
3636+ order by embedding <=> $1::vector
3737+ limit 60`, [seed.embedding, seed.repo_did])).rows;
3838+3939+ const seenContent = new Set([hash(seed.content)]); // also drop forks identical to the seed
4040+ const out = [];
4141+ let dupSkipped = 0;
4242+ for (const c of cand) {
4343+ const h = hash(c.content);
4444+ if (seenContent.has(h)) { dupSkipped++; continue; }
4545+ seenContent.add(h);
4646+ out.push(c);
4747+ if (out.length >= K) break;
4848+ }
4949+ console.log(`top ${out.length} similar repos (deduped, ${dupSkipped} fork/dup hits collapsed):`);
5050+ out.forEach((h, i) => {
5151+ console.log(` ${String(i + 1).padStart(2)}. ${(h.repo_name ?? "(no name)").padEnd(30)} @${(h.owner_handle ?? "?").padEnd(20)} cos_dist=${h.dist}`);
5252+ console.log(` ${(h.content ?? "").replace(/\s+/g, " ").slice(0, 110)}…`);
5353+ });
5454+ }
5555+ await pool.end();
5656+}
5757+main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+29
recommendationold/src/verify_embeddings.mjs
···11+import pg from "pg";
22+import { readFileSync } from "node:fs";
33+function conn() {
44+ if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING;
55+ for (const p of ["../.env", ".env"]) { try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)$/m); if (m) return m[1].trim(); } catch {} }
66+}
77+const pool = new pg.Pool({ connectionString: conn(), ssl: { rejectUnauthorized: false }, max: 3 });
88+99+console.log("=== embedded rows: dims + L2 norm ===");
1010+console.table((await pool.query(`
1111+ select repo_name, embedding_model,
1212+ vector_dims(embedding) as dims,
1313+ round(sqrt((select sum(x*x) from unnest(embedding::real[]) x))::numeric, 5) as l2_norm
1414+ from tangled_readmes where embedding is not null
1515+ order by embedded_at desc limit 5`)).rows);
1616+1717+console.log("\n=== nearest-neighbor sanity (cosine) for one embedded repo ===");
1818+const seed = (await pool.query(`select repo_did, repo_name from tangled_readmes where embedding is not null limit 1`)).rows[0];
1919+if (seed) {
2020+ console.log(`seed: ${seed.repo_name} (${seed.repo_did})`);
2121+ const nn = await pool.query(`
2222+ select repo_name, round((embedding <=> (select embedding from tangled_readmes where repo_did=$1))::numeric, 4) as cosine_dist
2323+ from tangled_readmes
2424+ where embedding is not null and repo_did <> $1
2525+ order by embedding <=> (select embedding from tangled_readmes where repo_did=$1)
2626+ limit 5`, [seed.repo_did]);
2727+ console.table(nn.rows);
2828+}
2929+await pool.end();
+120
scraper/README.md
···11+# Tangled scraper (stages 0–1)
22+33+Loads Tangled **lexicons** (schemas) and probes **knot servers** (git infrastructure) into Postgres.
44+55+## What this does / does NOT do
66+77+| Stage | Gets | Does NOT get |
88+|-------|------|--------------|
99+| **0** | All `sh.tangled.*` lexicon JSON from tangled.org/core | Live API data |
1010+| **1** | Knot hostname, version, owner DID, capabilities | Git repos, commits, or source code |
1111+1212+**Actual code** (files, commits, branches) is **Stage 6** — git XRPC on each knot (`sh.tangled.repo.log`, `.tree`, `.blob`, etc.).
1313+1414+## Setup
1515+1616+From the **repo root**:
1717+1818+```bash
1919+# 1. DB connection (you already have this in .env)
2020+# DB_CONNECTION_STRING=postgresql://...
2121+2222+# 2. Python venv + deps
2323+python3 -m venv scraper/.venv
2424+source scraper/.venv/bin/activate
2525+pip install -r scraper/requirements.txt
2626+2727+# 3. git is required on first run (stage 0 clones tangled.org/core lexicons)
2828+git --version
2929+```
3030+3131+## Run
3232+3333+```bash
3434+source scraper/.venv/bin/activate
3535+3636+# Create tables
3737+python scraper/scrape.py init
3838+3939+# Stage 0 — lexicons (~89 JSON files, prints each NSID)
4040+python scraper/scrape.py stage0
4141+4242+# Stage 1 — probe knots
4343+python scraper/scrape.py stage1
4444+4545+# Or both in one go
4646+python scraper/scrape.py stage0-1
4747+4848+# Check counts
4949+python scraper/scrape.py status
5050+```
5151+5252+Progress is printed as timestamped lines, e.g.:
5353+5454+```
5555+[12:34:56] [stage 0] (12/89) sh.tangled.repo (record)
5656+[12:35:01] [stage 1] OK knot1.tangled.sh version=1.14.0-alpha owner=did:plc:...
5757+```
5858+5959+## Knot configuration (optional)
6060+6161+```bash
6262+# Explicit seed list (comma-separated hostnames)
6363+export TANGLED_KNOT_SEEDS=knot1.tangled.sh,my.knot.example
6464+6565+# Auto-probe knot2..knot5 in addition to defaults
6666+export TANGLED_KNOT_PROBE_MAX=5
6767+6868+# Extra hostnames
6969+export TANGLED_KNOT_EXTRA=custom.knot.example
7070+```
7171+7272+## Stage 2 — Discover repos via Tangled PDS
7373+7474+`sh.tangled.sync.listRepos` on knots returns **404** (not deployed yet).
7575+Stage 2 uses **`https://tngl.sh`** instead:
7676+7777+| Phase | What | API |
7878+|-------|------|-----|
7979+| 1 | List all accounts | `com.atproto.sync.listRepos` |
8080+| 2 | Repo records per account | `com.atproto.repo.listRecords` (`sh.tangled.repo`) |
8181+| 3 | Enrich from knot | `sh.tangled.repo.describeRepo` |
8282+8383+**~7,928 accounts** on tngl.sh (as of testing). Full repo scan takes a while.
8484+8585+```bash
8686+# Step 1 only — count/list accounts (fast, ~10s)
8787+python scraper/scrape.py stage2-accounts
8888+8989+# Step 2 only — scan repo records (requires accounts in DB)
9090+python scraper/scrape.py stage2-repos
9191+9292+# All phases in one run
9393+python scraper/scrape.py stage2
9494+9595+python scraper/scrape.py status
9696+```
9797+9898+### Optional env vars
9999+100100+```bash
101101+# Test with first N accounts only
102102+export TANGLED_STAGE2_ACCOUNT_LIMIT=50
103103+104104+# Resolve handles via plc.directory (slower)
105105+export TANGLED_RESOLVE_HANDLES=1
106106+107107+# Skip knot describeRepo enrichment
108108+export TANGLED_STAGE2_ENRICH_KNOTS=0
109109+110110+# Override PDS (default https://tngl.sh)
111111+export TANGLED_PDS_URL=https://tngl.sh
112112+```
113113+114114+## SQL tables created
115115+116116+- `tangled_lexicons` — NSID → full lexicon JSON
117117+- `tangled_knots` — probed knot servers
118118+- `tangled_pds_accounts` — every account on tngl.sh PDS
119119+- `tangled_repos` — `sh.tangled.repo` records + optional knot metadata
120120+- `tangled_crawl_state` — run metadata per stage
+66
scraper/appview_client.py
···11+from __future__ import annotations
22+33+import re
44+from typing import Any
55+from urllib.parse import urlencode
66+77+import httpx
88+99+APPVIEW_BASE = "https://tangled.org"
1010+SEARCH_PATH = "/search"
1111+1212+# href="/owner/repo" — exclude site chrome and static assets
1313+REPO_HREF = re.compile(r'href="/([a-zA-Z0-9._-]+)/([a-zA-Z0-9._-]+)"')
1414+TOTAL_RE = re.compile(r"Returned\s+(\d+)\s+of\s+(\d+)", re.I)
1515+1616+SKIP_OWNERS = frozenset(
1717+ {
1818+ "static",
1919+ "search",
2020+ "login",
2121+ "signup",
2222+ "explore",
2323+ "settings",
2424+ "blog",
2525+ "docs",
2626+ "brand",
2727+ "chat",
2828+ "pwa-manifest.json",
2929+ }
3030+)
3131+3232+3333+def parse_search_total(html: str) -> int | None:
3434+ match = TOTAL_RE.search(html)
3535+ if not match:
3636+ return None
3737+ return int(match.group(2))
3838+3939+4040+def parse_repo_links(html: str) -> list[tuple[str, str]]:
4141+ seen: set[tuple[str, str]] = set()
4242+ out: list[tuple[str, str]] = []
4343+ for owner, repo in REPO_HREF.findall(html):
4444+ if owner in SKIP_OWNERS or owner.endswith(".json"):
4545+ continue
4646+ key = (owner, repo)
4747+ if key not in seen:
4848+ seen.add(key)
4949+ out.append(key)
5050+ return out
5151+5252+5353+def fetch_search_page(
5454+ client: httpx.Client,
5555+ *,
5656+ offset: int = 0,
5757+ limit: int = 100,
5858+ sort: str = "newest",
5959+ query: str = "",
6060+) -> tuple[str, list[tuple[str, str]], int | None]:
6161+ params = {"q": query, "sort": sort, "offset": offset, "limit": limit}
6262+ url = f"{APPVIEW_BASE}{SEARCH_PATH}?{urlencode(params)}"
6363+ resp = client.get(url)
6464+ resp.raise_for_status()
6565+ html = resp.text
6666+ return html, parse_repo_links(html), parse_search_total(html)
+427
scraper/backfill_repos_from_issues.py
···11+#!/usr/bin/env python3
22+"""Backfill tangled_repos for issues that reference repos not yet ingested.
33+44+Issues are scraped from issue authors' PDSes; repos come from separate crawls
55+(stage2-network, stage2 PDS, manual seed). This script closes the gap by
66+fetching sh.tangled.repo from each missing repo owner's PDS using repo_uri on
77+the issue record.
88+99+Usage:
1010+ python scraper/scrape.py backfill-repos-from-issues
1111+ TANGLED_BACKFILL_REPO_LIMIT=50 python scraper/scrape.py backfill-repos-from-issues
1212+1313+After a successful run, fetch READMEs and embeddings for the new repos:
1414+ python scraper/scrape.py check-readmes
1515+ python scraper/scrape.py embed-readmes
1616+"""
1717+1818+from __future__ import annotations
1919+2020+import json
2121+import os
2222+import threading
2323+from concurrent.futures import ThreadPoolExecutor, as_completed
2424+from dataclasses import dataclass
2525+from typing import Any
2626+2727+import httpx
2828+2929+from db import connect, set_crawl_state, upsert_atproto_record
3030+from parallel import concurrency_env
3131+from pds_client import DEFAULT_PDS, handle_from_plc, pds_host_for_did
3232+from progress import banner, log, phase, step, summary_block
3333+from stage2_network import COLLECTION, fetch_repo_record, upsert_identity
3434+3535+CRAWL_KEY = "repos:issue_backfill"
3636+DISCOVERED_VIA = "issue_backfill"
3737+3838+3939+def _repo_limit() -> int | None:
4040+ raw = os.getenv("TANGLED_BACKFILL_REPO_LIMIT", "").strip()
4141+ if not raw:
4242+ return None
4343+ return max(1, int(raw))
4444+4545+4646+def _missing_repos_sql(*, limit: int | None) -> str:
4747+ query = """
4848+ with missing as (
4949+ select i.repo_did
5050+ from tangled_issues i
5151+ left join tangled_repos r on r.repo_did = i.repo_did
5252+ where i.repo_did is not null
5353+ and r.repo_did is null
5454+ group by i.repo_did
5555+ ),
5656+ best_uri as (
5757+ select distinct on (i.repo_did)
5858+ i.repo_did,
5959+ i.repo_uri,
6060+ count(*) over (partition by i.repo_did) as issue_count
6161+ from tangled_issues i
6262+ inner join missing m on m.repo_did = i.repo_did
6363+ where i.repo_uri is not null
6464+ and i.repo_uri like 'at://did:%/sh.tangled.repo/%'
6565+ order by i.repo_did, i.fetched_at desc nulls last
6666+ )
6767+ select
6868+ b.repo_did,
6969+ b.repo_uri,
7070+ b.issue_count,
7171+ split_part(replace(b.repo_uri, 'at://', ''), '/', 1) as owner_did,
7272+ split_part(b.repo_uri, '/', 5) as repo_rkey,
7373+ ti.handle as owner_handle,
7474+ ti.pds_host
7575+ from best_uri b
7676+ left join tangled_identities ti
7777+ on ti.did = split_part(replace(b.repo_uri, 'at://', ''), '/', 1)
7878+ order by b.issue_count desc, b.repo_did
7979+ """
8080+ if limit:
8181+ query += f" limit {limit}"
8282+ return query
8383+8484+8585+def _count_missing_sql() -> str:
8686+ return """
8787+ select
8888+ count(distinct i.repo_did) filter (
8989+ where i.repo_uri is not null
9090+ and i.repo_uri like 'at://did:%/sh.tangled.repo/%'
9191+ ) as backfillable,
9292+ count(distinct i.repo_did) filter (
9393+ where i.repo_uri is null
9494+ or i.repo_uri not like 'at://did:%/sh.tangled.repo/%'
9595+ ) as not_backfillable,
9696+ count(distinct i.repo_did) as total_missing
9797+ from tangled_issues i
9898+ left join tangled_repos r on r.repo_did = i.repo_did
9999+ where i.repo_did is not null
100100+ and r.repo_did is null
101101+ """
102102+103103+104104+@dataclass
105105+class MissingRepo:
106106+ repo_did: str
107107+ repo_uri: str
108108+ issue_count: int
109109+ owner_did: str
110110+ repo_rkey: str
111111+ owner_handle: str | None
112112+ pds_host: str | None
113113+114114+115115+@dataclass
116116+class BackfillResult:
117117+ row: MissingRepo
118118+ status: str # ok | pds_failed | record_failed | error
119119+ owner_handle: str | None = None
120120+ pds_host: str | None = None
121121+ record: dict[str, Any] | None = None
122122+ error: str | None = None
123123+124124+125125+class _PdsCache:
126126+ def __init__(self) -> None:
127127+ self._hosts: dict[str, str | None] = {}
128128+ self._handles: dict[str, str | None] = {}
129129+ self._lock = threading.Lock()
130130+131131+ def resolve_pds(
132132+ self, client: httpx.Client, owner_did: str, hint: str | None
133133+ ) -> str | None:
134134+ if hint:
135135+ return hint.rstrip("/")
136136+ with self._lock:
137137+ if owner_did in self._hosts:
138138+ return self._hosts[owner_did]
139139+ try:
140140+ pds = pds_host_for_did(client, owner_did)
141141+ except httpx.HTTPError:
142142+ pds = None
143143+ host = pds.rstrip("/") if pds else None
144144+ with self._lock:
145145+ self._hosts[owner_did] = host
146146+ return host
147147+148148+ def resolve_handle(
149149+ self, client: httpx.Client, owner_did: str, hint: str | None
150150+ ) -> str | None:
151151+ if hint:
152152+ return hint
153153+ with self._lock:
154154+ if owner_did in self._handles:
155155+ return self._handles[owner_did]
156156+ try:
157157+ handle = handle_from_plc(client, owner_did)
158158+ except httpx.HTTPError:
159159+ handle = None
160160+ with self._lock:
161161+ self._handles[owner_did] = handle
162162+ return handle
163163+164164+165165+def upsert_issue_backfill_repo(
166166+ conn,
167167+ *,
168168+ owner_did: str,
169169+ owner_handle: str | None,
170170+ repo_rkey: str,
171171+ pds_host: str,
172172+ record: dict[str, Any],
173173+) -> None:
174174+ uri = record["uri"]
175175+ value = record["value"]
176176+ rkey = uri.rsplit("/", 1)[-1]
177177+ repo_did = value.get("repoDid") if isinstance(value.get("repoDid"), str) else None
178178+ knot = value.get("knot") if isinstance(value.get("knot"), str) else None
179179+ name = value.get("name") if isinstance(value.get("name"), str) else None
180180+ if not name and not repo_rkey.startswith("3l"):
181181+ name = repo_rkey
182182+183183+ conn.execute(
184184+ """
185185+ insert into tangled_repos (
186186+ uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname,
187187+ cid, record_raw, discovered_via, last_synced_at
188188+ )
189189+ values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, %s, now())
190190+ on conflict (uri) do update set
191191+ owner_did = excluded.owner_did,
192192+ owner_handle = coalesce(excluded.owner_handle, tangled_repos.owner_handle),
193193+ repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did),
194194+ name = coalesce(excluded.name, tangled_repos.name),
195195+ knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname),
196196+ cid = excluded.cid,
197197+ record_raw = excluded.record_raw,
198198+ discovered_via = coalesce(tangled_repos.discovered_via, excluded.discovered_via),
199199+ last_synced_at = now()
200200+ """,
201201+ (
202202+ uri,
203203+ owner_did,
204204+ owner_handle,
205205+ rkey,
206206+ repo_did,
207207+ name,
208208+ knot,
209209+ record.get("cid") if isinstance(record.get("cid"), str) else None,
210210+ json.dumps(value),
211211+ DISCOVERED_VIA,
212212+ ),
213213+ )
214214+215215+ upsert_atproto_record(
216216+ conn,
217217+ uri=uri,
218218+ author_did=owner_did,
219219+ collection=COLLECTION,
220220+ rkey=rkey,
221221+ payload=value,
222222+ cid=record.get("cid") if isinstance(record.get("cid"), str) else None,
223223+ repo_did=repo_did,
224224+ )
225225+226226+227227+def _fetch_one(row: MissingRepo, cache: _PdsCache) -> BackfillResult:
228228+ result = BackfillResult(row=row, status="error")
229229+ try:
230230+ with httpx.Client(timeout=60.0, follow_redirects=True) as client:
231231+ pds = cache.resolve_pds(client, row.owner_did, row.pds_host)
232232+ if not pds:
233233+ result.status = "pds_failed"
234234+ return result
235235+236236+ owner_handle = cache.resolve_handle(client, row.owner_did, row.owner_handle)
237237+ result.owner_handle = owner_handle
238238+ result.pds_host = pds
239239+240240+ record = fetch_repo_record(
241241+ client,
242242+ pds_host=pds,
243243+ owner_did=row.owner_did,
244244+ rkey=row.repo_rkey,
245245+ repo_slug=row.repo_rkey,
246246+ )
247247+ if not record:
248248+ result.status = "record_failed"
249249+ return result
250250+251251+ result.record = record
252252+ result.status = "ok"
253253+ return result
254254+ except httpx.HTTPError as exc:
255255+ result.status = "error"
256256+ result.error = str(exc)[:200]
257257+ return result
258258+ except Exception as exc:
259259+ result.status = "error"
260260+ result.error = str(exc)[:200]
261261+ return result
262262+263263+264264+def run_backfill_repos_from_issues(dsn: str) -> dict[str, Any]:
265265+ workers = concurrency_env("TANGLED_BACKFILL_REPO_CONCURRENCY", default=20)
266266+ repo_limit = _repo_limit()
267267+268268+ banner("BACKFILL — Repos referenced by issues but missing from tangled_repos")
269269+ log("backfill", f"Concurrency: {workers}")
270270+ if repo_limit:
271271+ log("backfill", f"Repo limit: {repo_limit}")
272272+273273+ stats: dict[str, Any] = {
274274+ "backfillable": 0,
275275+ "not_backfillable": 0,
276276+ "total_missing": 0,
277277+ "queued": 0,
278278+ "repos_stored": 0,
279279+ "pds_failed": 0,
280280+ "record_failed": 0,
281281+ "errors": 0,
282282+ }
283283+284284+ with connect(dsn) as conn:
285285+ counts = conn.execute(_count_missing_sql()).fetchone()
286286+ if counts:
287287+ stats["backfillable"] = int(counts.get("backfillable") or 0)
288288+ stats["not_backfillable"] = int(counts.get("not_backfillable") or 0)
289289+ stats["total_missing"] = int(counts.get("total_missing") or 0)
290290+291291+ log(
292292+ "backfill",
293293+ f"Missing repos: {stats['total_missing']} "
294294+ f"({stats['backfillable']} with parseable repo_uri, "
295295+ f"{stats['not_backfillable']} without)",
296296+ )
297297+298298+ rows = conn.execute(_missing_repos_sql(limit=repo_limit)).fetchall()
299299+ pending = [
300300+ MissingRepo(
301301+ repo_did=row["repo_did"],
302302+ repo_uri=row["repo_uri"],
303303+ issue_count=int(row["issue_count"] or 0),
304304+ owner_did=row["owner_did"],
305305+ repo_rkey=row["repo_rkey"],
306306+ owner_handle=row.get("owner_handle"),
307307+ pds_host=row.get("pds_host"),
308308+ )
309309+ for row in rows
310310+ if row.get("owner_did") and row.get("repo_rkey")
311311+ ]
312312+ stats["queued"] = len(pending)
313313+314314+ if not pending:
315315+ log("backfill", "Nothing to backfill.")
316316+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
317317+ conn.commit()
318318+ return stats
319319+320320+ phase(1, f"Fetch sh.tangled.repo for {len(pending)} missing repos")
321321+ set_crawl_state(
322322+ conn,
323323+ key=CRAWL_KEY,
324324+ status="running",
325325+ meta={**stats, "workers": workers},
326326+ )
327327+ conn.commit()
328328+329329+ cache = _PdsCache()
330330+ done = 0
331331+ done_lock = threading.Lock()
332332+333333+ with ThreadPoolExecutor(max_workers=workers) as pool:
334334+ futures = {
335335+ pool.submit(_fetch_one, row, cache): row for row in pending
336336+ }
337337+338338+ for future in as_completed(futures):
339339+ row = futures[future]
340340+ label = f"{row.owner_did[:20]}…/{row.repo_rkey}"
341341+342342+ try:
343343+ result = future.result()
344344+ except Exception as exc:
345345+ result = BackfillResult(
346346+ row=row,
347347+ status="error",
348348+ error=str(exc)[:200],
349349+ )
350350+351351+ with done_lock:
352352+ done += 1
353353+ n = done
354354+355355+ if result.status == "ok" and result.record:
356356+ upsert_identity(
357357+ conn,
358358+ did=row.owner_did,
359359+ handle=result.owner_handle,
360360+ pds_host=result.pds_host,
361361+ )
362362+ upsert_issue_backfill_repo(
363363+ conn,
364364+ owner_did=row.owner_did,
365365+ owner_handle=result.owner_handle,
366366+ repo_rkey=row.repo_rkey,
367367+ pds_host=result.pds_host or DEFAULT_PDS,
368368+ record=result.record,
369369+ )
370370+ stats["repos_stored"] += 1
371371+ if n <= 10 or n % 25 == 0:
372372+ step(
373373+ "backfill",
374374+ n,
375375+ len(pending),
376376+ f"OK {label} issues={row.issue_count}",
377377+ )
378378+ elif result.status == "pds_failed":
379379+ stats["pds_failed"] += 1
380380+ if n <= 10 or n % 50 == 0:
381381+ step(
382382+ "backfill",
383383+ n,
384384+ len(pending),
385385+ f"SKIP {label} — could not resolve PDS",
386386+ )
387387+ elif result.status == "record_failed":
388388+ stats["record_failed"] += 1
389389+ if n <= 10 or n % 50 == 0:
390390+ step(
391391+ "backfill",
392392+ n,
393393+ len(pending),
394394+ f"FAIL {label} — no sh.tangled.repo on PDS",
395395+ )
396396+ else:
397397+ stats["errors"] += 1
398398+ if n <= 10 or n % 50 == 0:
399399+ step(
400400+ "backfill",
401401+ n,
402402+ len(pending),
403403+ f"ERROR {label}: {result.error or 'unknown'}",
404404+ )
405405+406406+ if n % 25 == 0:
407407+ conn.commit()
408408+409409+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
410410+ conn.commit()
411411+412412+ summary_block(
413413+ "Issue repo backfill complete",
414414+ [
415415+ f"Missing repos (total): {stats['total_missing']}",
416416+ f"Backfillable (repo_uri): {stats['backfillable']}",
417417+ f"Queued this run: {stats['queued']}",
418418+ f"Repos stored/updated: {stats['repos_stored']}",
419419+ f"PDS resolve failed: {stats['pds_failed']}",
420420+ f"Record fetch failed: {stats['record_failed']}",
421421+ f"Errors: {stats['errors']}",
422422+ "",
423423+ "Next: python scraper/scrape.py check-readmes",
424424+ " python scraper/scrape.py embed-readmes",
425425+ ],
426426+ )
427427+ return stats
+387
scraper/check_readmes.py
···11+#!/usr/bin/env python3
22+"""Fetch and store README files from knot git for all scraped repos."""
33+44+from __future__ import annotations
55+66+import os
77+import sys
88+import threading
99+from concurrent.futures import ThreadPoolExecutor, as_completed
1010+from dataclasses import dataclass
1111+from pathlib import Path
1212+from typing import Any
1313+1414+import httpx
1515+from dotenv import load_dotenv
1616+1717+from db import connect, init_schema, set_crawl_state
1818+from parallel import concurrency_env
1919+from pds_client import knot_xrpc
2020+from progress import banner, log, metric, phase, step, summary_block
2121+2222+REPO_ROOT = Path(__file__).resolve().parent.parent
2323+CRAWL_KEY = "readmes:check"
2424+README_NAMES = frozenset(
2525+ {"readme.md", "readme", "readme.markdown", "readme.mdown", "readme.mkd"}
2626+)
2727+2828+2929+@dataclass
3030+class ReadmeResult:
3131+ repo_did: str
3232+ repo_uri: str | None
3333+ owner_handle: str | None
3434+ repo_name: str | None
3535+ knot_hostname: str
3636+ status: str
3737+ readme_path: str | None = None
3838+ content: str | None = None
3939+ size_bytes: int | None = None
4040+ error_message: str | None = None
4141+4242+4343+def _repo_limit() -> int | None:
4444+ raw = os.getenv("TANGLED_README_REPO_LIMIT", "").strip()
4545+ if not raw:
4646+ return None
4747+ return max(1, int(raw))
4848+4949+5050+def _skip_existing() -> bool:
5151+ return os.getenv("TANGLED_README_REFRESH", "").strip().lower() not in (
5252+ "1",
5353+ "true",
5454+ "yes",
5555+ )
5656+5757+5858+def _repos_query(*, skip_existing: bool, repo_limit: int | None) -> str:
5959+ skip_clause = ""
6060+ if skip_existing:
6161+ skip_clause = """
6262+ and not exists (
6363+ select 1 from tangled_readmes t
6464+ where t.repo_did = tangled_repos.repo_did
6565+ and t.status in ('found', 'missing')
6666+ )
6767+ """
6868+ query = f"""
6969+ select repo_did, uri, owner_handle, name, knot_hostname
7070+ from tangled_repos
7171+ where repo_did is not null
7272+ and knot_hostname is not null
7373+ {skip_clause}
7474+ order by uri
7575+ """
7676+ if repo_limit:
7777+ query += f" limit {repo_limit}"
7878+ return query
7979+8080+8181+def _find_readme_in_tree(tree: dict[str, Any]) -> str | None:
8282+ for entry in tree.get("files") or []:
8383+ if not isinstance(entry, dict):
8484+ continue
8585+ name = entry.get("name")
8686+ if isinstance(name, str) and name.lower() in README_NAMES:
8787+ if entry.get("type") == "file" or entry.get("mode") in (
8888+ "100644",
8989+ "100755",
9090+ "blob",
9191+ ):
9292+ return name
9393+ # tree listing uses name only for files
9494+ if entry.get("type") != "dir":
9595+ return name
9696+ return None
9797+9898+9999+def fetch_readme(
100100+ client: httpx.Client,
101101+ *,
102102+ knot_hostname: str,
103103+ repo_did: str,
104104+) -> ReadmeResult:
105105+ base = ReadmeResult(
106106+ repo_did=repo_did,
107107+ repo_uri=None,
108108+ owner_handle=None,
109109+ repo_name=None,
110110+ knot_hostname=knot_hostname,
111111+ status="error",
112112+ )
113113+114114+ status, tree = knot_xrpc(
115115+ client,
116116+ knot_hostname,
117117+ "sh.tangled.repo.tree",
118118+ {"repo": repo_did, "ref": "HEAD"},
119119+ )
120120+ if status != 200 or not isinstance(tree, dict):
121121+ base.status = "error"
122122+ base.error_message = f"tree HTTP {status}"
123123+ return base
124124+125125+ readme_path = _find_readme_in_tree(tree)
126126+ if not readme_path:
127127+ base.status = "missing"
128128+ return base
129129+130130+ status, blob = knot_xrpc(
131131+ client,
132132+ knot_hostname,
133133+ "sh.tangled.repo.blob",
134134+ {"repo": repo_did, "ref": "HEAD", "path": readme_path},
135135+ )
136136+ if status != 200 or not isinstance(blob, dict):
137137+ base.status = "error"
138138+ base.readme_path = readme_path
139139+ base.error_message = f"blob HTTP {status}"
140140+ return base
141141+142142+ content = blob.get("content")
143143+ if not isinstance(content, str):
144144+ base.status = "error"
145145+ base.readme_path = readme_path
146146+ base.error_message = "blob response missing content"
147147+ return base
148148+149149+ base.status = "found"
150150+ base.readme_path = readme_path
151151+ base.content = content
152152+ base.size_bytes = len(content.encode("utf-8"))
153153+ return base
154154+155155+156156+def upsert_readme(conn, row: ReadmeResult) -> None:
157157+ conn.execute(
158158+ """
159159+ insert into tangled_readmes (
160160+ repo_did, repo_uri, owner_handle, repo_name, knot_hostname,
161161+ readme_path, status, content, size_bytes, error_message, fetched_at
162162+ )
163163+ values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())
164164+ on conflict (repo_did) do update set
165165+ repo_uri = excluded.repo_uri,
166166+ owner_handle = excluded.owner_handle,
167167+ repo_name = excluded.repo_name,
168168+ knot_hostname = excluded.knot_hostname,
169169+ readme_path = excluded.readme_path,
170170+ status = excluded.status,
171171+ content = excluded.content,
172172+ size_bytes = excluded.size_bytes,
173173+ error_message = excluded.error_message,
174174+ fetched_at = now(),
175175+ embedding = case
176176+ when tangled_readmes.content is distinct from excluded.content then null
177177+ else tangled_readmes.embedding
178178+ end,
179179+ embedding_model = case
180180+ when tangled_readmes.content is distinct from excluded.content then null
181181+ else tangled_readmes.embedding_model
182182+ end,
183183+ embedded_at = case
184184+ when tangled_readmes.content is distinct from excluded.content then null
185185+ else tangled_readmes.embedded_at
186186+ end
187187+ """,
188188+ (
189189+ row.repo_did,
190190+ row.repo_uri,
191191+ row.owner_handle,
192192+ row.repo_name,
193193+ row.knot_hostname,
194194+ row.readme_path,
195195+ row.status,
196196+ row.content,
197197+ row.size_bytes,
198198+ row.error_message,
199199+ ),
200200+ )
201201+202202+203203+def run_check_readmes(dsn: str) -> dict[str, int]:
204204+ workers = concurrency_env("TANGLED_README_CONCURRENCY", default=20)
205205+ repo_limit = _repo_limit()
206206+207207+ banner("README CHECK — fetch README from knot git for each repo")
208208+ log("readmes", f"Concurrency: {workers}")
209209+ if repo_limit:
210210+ log("readmes", f"Repo limit: {repo_limit}")
211211+ skip_existing = _skip_existing()
212212+ if skip_existing:
213213+ log(
214214+ "readmes",
215215+ "Skip existing: on — found/missing rows kept (set TANGLED_README_REFRESH=1 to re-fetch)",
216216+ )
217217+ else:
218218+ log("readmes", "Skip existing: off — re-fetching all")
219219+220220+ with connect(dsn) as conn:
221221+ reachable = {
222222+ r["hostname"]
223223+ for r in conn.execute(
224224+ "select hostname from tangled_knots where reachable = true"
225225+ ).fetchall()
226226+ }
227227+ total_eligible = conn.execute(
228228+ """
229229+ select count(*) as n from tangled_repos
230230+ where repo_did is not null and knot_hostname is not null
231231+ """
232232+ ).fetchone()["n"]
233233+ repos = conn.execute(
234234+ _repos_query(skip_existing=skip_existing, repo_limit=repo_limit)
235235+ ).fetchall()
236236+237237+ if not repos:
238238+ if skip_existing:
239239+ log("readmes", "Nothing to fetch — all eligible repos already checked.")
240240+ return {
241241+ "found": 0,
242242+ "missing": 0,
243243+ "error": 0,
244244+ "skipped": 0,
245245+ "already_in_db": total_eligible,
246246+ }
247247+ raise RuntimeError("No repos with repo_did in tangled_repos.")
248248+249249+ already_in_db = total_eligible - len(repos) if skip_existing else 0
250250+ if skip_existing:
251251+ metric("Eligible repos", total_eligible)
252252+ metric("Already in DB (skipped)", already_in_db)
253253+ metric("To fetch", len(repos))
254254+ log("readmes", f"Checking READMEs for {len(repos)} repos …")
255255+256256+ stats = {
257257+ "found": 0,
258258+ "missing": 0,
259259+ "error": 0,
260260+ "skipped": 0,
261261+ "already_in_db": already_in_db,
262262+ }
263263+ stats_lock = threading.Lock()
264264+ done = 0
265265+ done_lock = threading.Lock()
266266+267267+ phase(1, "Parallel tree + blob fetch on knots")
268268+269269+ def work(repo: dict[str, Any]) -> ReadmeResult:
270270+ knot = repo["knot_hostname"]
271271+ repo_did = repo["repo_did"]
272272+ if knot not in reachable:
273273+ return ReadmeResult(
274274+ repo_did=repo_did,
275275+ repo_uri=repo.get("uri"),
276276+ owner_handle=repo.get("owner_handle"),
277277+ repo_name=repo.get("name"),
278278+ knot_hostname=knot or "",
279279+ status="skipped",
280280+ error_message=f"knot not reachable: {knot}",
281281+ )
282282+ with httpx.Client(timeout=60.0, follow_redirects=True) as client:
283283+ result = fetch_readme(client, knot_hostname=knot, repo_did=repo_did)
284284+ result.repo_uri = repo.get("uri")
285285+ result.owner_handle = repo.get("owner_handle")
286286+ result.repo_name = repo.get("name")
287287+ return result
288288+289289+ with connect(dsn) as conn:
290290+ set_crawl_state(
291291+ conn,
292292+ key=CRAWL_KEY,
293293+ status="running",
294294+ meta={"repo_count": len(repos), "workers": workers},
295295+ )
296296+ conn.commit()
297297+298298+ with ThreadPoolExecutor(max_workers=workers) as pool:
299299+ futures = {pool.submit(work, dict(repo)): repo for repo in repos}
300300+301301+ for future in as_completed(futures):
302302+ repo = futures[future]
303303+ label = f"{repo.get('owner_handle') or '?'}/{repo.get('name') or repo['repo_did'][:16]}"
304304+305305+ try:
306306+ result = future.result()
307307+ except Exception as exc:
308308+ result = ReadmeResult(
309309+ repo_did=repo["repo_did"],
310310+ repo_uri=repo.get("uri"),
311311+ owner_handle=repo.get("owner_handle"),
312312+ repo_name=repo.get("name"),
313313+ knot_hostname=repo.get("knot_hostname") or "",
314314+ status="error",
315315+ error_message=str(exc),
316316+ )
317317+318318+ upsert_readme(conn, result)
319319+320320+ with stats_lock:
321321+ stats[result.status if result.status in stats else "error"] += 1
322322+323323+ with done_lock:
324324+ done += 1
325325+ n = done
326326+327327+ if result.status == "found":
328328+ if n <= 10 or n % 50 == 0:
329329+ step(
330330+ "readmes",
331331+ n,
332332+ len(repos),
333333+ f"OK {label} {result.readme_path} ({result.size_bytes} B)",
334334+ )
335335+ elif n <= 10 or n % 100 == 0:
336336+ step(
337337+ "readmes",
338338+ n,
339339+ len(repos),
340340+ f"{result.status.upper()} {label} {result.error_message or ''}",
341341+ )
342342+343343+ if n % 50 == 0:
344344+ conn.commit()
345345+346346+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
347347+ conn.commit()
348348+349349+ summary_block(
350350+ "README check complete",
351351+ [
352352+ f"Repos checked: {len(repos)}",
353353+ f"Already in DB: {stats['already_in_db']}",
354354+ f"Found README: {stats['found']}",
355355+ f"Missing README: {stats['missing']}",
356356+ f"Errors: {stats['error']}",
357357+ f"Skipped knot: {stats['skipped']}",
358358+ "",
359359+ "Query: select status, count(*) from tangled_readmes group by 1;",
360360+ ],
361361+ )
362362+ return stats
363363+364364+365365+def main() -> None:
366366+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
367367+ if candidate.exists():
368368+ load_dotenv(candidate)
369369+ break
370370+ else:
371371+ load_dotenv()
372372+373373+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
374374+ if not dsn:
375375+ print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr)
376376+ raise SystemExit(1)
377377+378378+ init_schema(dsn)
379379+ run_check_readmes(dsn)
380380+381381+382382+if __name__ == "__main__":
383383+ try:
384384+ main()
385385+ except KeyboardInterrupt:
386386+ print("\nInterrupted.", file=sys.stderr)
387387+ raise SystemExit(130) from None
···11+#!/usr/bin/env python3
22+"""Compute embeddings for tangled_issues (title + body)."""
33+44+from __future__ import annotations
55+66+import os
77+import sys
88+from pathlib import Path
99+1010+import httpx
1111+from dotenv import load_dotenv
1212+1313+from db import connect, init_schema, register_pgvector, set_crawl_state
1414+from embeddings import (
1515+ DEFAULT_DIM,
1616+ DEFAULT_MODEL,
1717+ batch_size,
1818+ embed_texts,
1919+ embedding_model,
2020+ gemini_api_key,
2121+ truncate,
2222+)
2323+from progress import banner, log, phase, step, summary_block
2424+2525+REPO_ROOT = Path(__file__).resolve().parent.parent
2626+CRAWL_KEY = "issues:embed"
2727+2828+2929+def _issue_limit() -> int | None:
3030+ raw = os.getenv("TANGLED_ISSUE_EMBED_LIMIT", "").strip()
3131+ if not raw:
3232+ return None
3333+ return max(1, int(raw))
3434+3535+3636+def _force_reembed() -> bool:
3737+ return os.getenv("TANGLED_ISSUE_EMBED_FORCE", "").strip().lower() in ("1", "true", "yes")
3838+3939+4040+def _issue_text(title: str | None, body: str | None) -> str:
4141+ parts = [p for p in (title, body) if p and p.strip()]
4242+ return truncate("\n\n".join(parts))
4343+4444+4545+def run_embed_issues(dsn: str) -> dict[str, int]:
4646+ api_key = gemini_api_key()
4747+ model = embedding_model()
4848+ bs = batch_size()
4949+ issue_limit = _issue_limit()
5050+ force = _force_reembed()
5151+5252+ banner("ISSUE EMBED — Gemini → tangled_issues.embedding")
5353+ log("embed-issues", f"Model: {model} dim={DEFAULT_DIM} L2-normalized batch={bs}")
5454+ if issue_limit:
5555+ log("embed-issues", f"Limit: {issue_limit}")
5656+ if force:
5757+ log("embed-issues", "Force re-embed enabled")
5858+5959+ where = "1=1"
6060+ if not force:
6161+ where += " and embedding is null"
6262+ query = f"""
6363+ select uri, author_handle, title, body
6464+ from tangled_issues
6565+ where {where}
6666+ and coalesce(nullif(trim(title), ''), nullif(trim(body), '')) is not null
6767+ order by fetched_at desc
6868+ """
6969+ if issue_limit:
7070+ query += f" limit {issue_limit}"
7171+7272+ with connect(dsn) as conn:
7373+ rows = conn.execute(query).fetchall()
7474+7575+ if not rows:
7676+ log("embed-issues", "Nothing to embed (run fetch-issues first).")
7777+ return {"embedded": 0, "batches": 0, "errors": 0}
7878+7979+ log("embed-issues", f"Embedding {len(rows)} issues …")
8080+ stats = {"embedded": 0, "batches": 0, "errors": 0}
8181+8282+ phase(1, "Gemini batchEmbedContents → tangled_issues.embedding")
8383+8484+ with httpx.Client() as client, connect(dsn) as conn:
8585+ register_pgvector(conn)
8686+ set_crawl_state(
8787+ conn,
8888+ key=CRAWL_KEY,
8989+ status="running",
9090+ meta={"count": len(rows), "model": model, "dim": DEFAULT_DIM},
9191+ )
9292+ conn.commit()
9393+9494+ for start in range(0, len(rows), bs):
9595+ batch = rows[start : start + bs]
9696+ texts = [_issue_text(r.get("title"), r.get("body")) for r in batch]
9797+ labels = [
9898+ f"{r.get('author_handle') or '?'}: {(r.get('title') or '')[:40]}"
9999+ for r in batch
100100+ ]
101101+102102+ try:
103103+ vectors = embed_texts(client, api_key=api_key, texts=texts)
104104+ except Exception as exc:
105105+ stats["errors"] += len(batch)
106106+ step(
107107+ "embed-issues",
108108+ min(start + len(batch), len(rows)),
109109+ len(rows),
110110+ f"ERROR batch: {exc}",
111111+ )
112112+ continue
113113+114114+ for row, vec in zip(batch, vectors, strict=True):
115115+ conn.execute(
116116+ """
117117+ update tangled_issues
118118+ set embedding = %s, embedding_model = %s, embedded_at = now()
119119+ where uri = %s
120120+ """,
121121+ (vec, model, row["uri"]),
122122+ )
123123+124124+ stats["embedded"] += len(batch)
125125+ stats["batches"] += 1
126126+ conn.commit()
127127+ n = stats["embedded"]
128128+ if n <= 10 or n % bs == 0 or n == len(rows):
129129+ step("embed-issues", n, len(rows), f"OK {labels[-1]}")
130130+131131+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
132132+ conn.commit()
133133+134134+ summary_block(
135135+ "Issue embed complete",
136136+ [f"Embedded: {stats['embedded']}", f"Errors: {stats['errors']}"],
137137+ )
138138+ return stats
139139+140140+141141+def main() -> None:
142142+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
143143+ if candidate.exists():
144144+ load_dotenv(candidate)
145145+ break
146146+ else:
147147+ load_dotenv()
148148+149149+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
150150+ if not dsn:
151151+ print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr)
152152+ raise SystemExit(1)
153153+154154+ init_schema(dsn)
155155+ run_embed_issues(dsn)
156156+157157+158158+if __name__ == "__main__":
159159+ try:
160160+ main()
161161+ except KeyboardInterrupt:
162162+ print("\nInterrupted.", file=sys.stderr)
163163+ raise SystemExit(130) from None
+171
scraper/embed_readmes.py
···11+#!/usr/bin/env python3
22+"""Compute and store one embedding vector per README in tangled_readmes."""
33+44+from __future__ import annotations
55+66+import os
77+import sys
88+from pathlib import Path
99+1010+import httpx
1111+from dotenv import load_dotenv
1212+1313+from db import connect, init_schema, register_pgvector, set_crawl_state
1414+from embeddings import (
1515+ DEFAULT_DIM,
1616+ DEFAULT_MODEL,
1717+ batch_size,
1818+ embed_texts,
1919+ embedding_model,
2020+ gemini_api_key,
2121+ truncate,
2222+)
2323+from progress import banner, log, phase, step, summary_block
2424+2525+REPO_ROOT = Path(__file__).resolve().parent.parent
2626+CRAWL_KEY = "readmes:embed"
2727+2828+2929+def _repo_limit() -> int | None:
3030+ raw = os.getenv("TANGLED_EMBED_README_LIMIT", "").strip()
3131+ if not raw:
3232+ return None
3333+ return max(1, int(raw))
3434+3535+3636+def _force_reembed() -> bool:
3737+ return os.getenv("TANGLED_EMBED_FORCE", "").strip().lower() in ("1", "true", "yes")
3838+3939+4040+def _select_query(*, force: bool, limit: int | None) -> str:
4141+ where = "status = 'found' and content is not null"
4242+ if not force:
4343+ where += " and embedding is null"
4444+ query = f"""
4545+ select repo_did, owner_handle, repo_name, content
4646+ from tangled_readmes
4747+ where {where}
4848+ order by fetched_at desc
4949+ """
5050+ if limit:
5151+ query += f" limit {limit}"
5252+ return query
5353+5454+5555+def run_embed_readmes(dsn: str) -> dict[str, int]:
5656+ api_key = gemini_api_key()
5757+ model = embedding_model()
5858+ bs = batch_size()
5959+ repo_limit = _repo_limit()
6060+ force = _force_reembed()
6161+6262+ banner("README EMBED — Gemini → tangled_readmes.embedding")
6363+ log("embed", f"Model: {model} dim={DEFAULT_DIM} L2-normalized batch={bs}")
6464+ if repo_limit:
6565+ log("embed", f"Limit: {repo_limit}")
6666+ if force:
6767+ log("embed", "Force re-embed all matching rows")
6868+6969+ with connect(dsn) as conn:
7070+ register_pgvector(conn)
7171+ rows = conn.execute(_select_query(force=force, limit=repo_limit)).fetchall()
7272+7373+ if not rows:
7474+ log("embed", "Nothing to embed (run check-readmes first, or set TANGLED_EMBED_FORCE=1).")
7575+ return {"embedded": 0, "batches": 0, "errors": 0}
7676+7777+ log("embed", f"Embedding {len(rows)} READMEs …")
7878+ stats = {"embedded": 0, "batches": 0, "errors": 0}
7979+8080+ phase(1, "Gemini batchEmbedContents → tangled_readmes.embedding")
8181+8282+ with httpx.Client() as client, connect(dsn) as conn:
8383+ register_pgvector(conn)
8484+ set_crawl_state(
8585+ conn,
8686+ key=CRAWL_KEY,
8787+ status="running",
8888+ meta={"count": len(rows), "model": model, "dim": DEFAULT_DIM},
8989+ )
9090+ conn.commit()
9191+9292+ for start in range(0, len(rows), bs):
9393+ batch = rows[start : start + bs]
9494+ texts = [truncate(r["content"]) for r in batch]
9595+ labels = [
9696+ f"{r.get('owner_handle') or '?'}/{r.get('repo_name') or r['repo_did'][:16]}"
9797+ for r in batch
9898+ ]
9999+100100+ try:
101101+ vectors = embed_texts(client, api_key=api_key, texts=texts)
102102+ except Exception as exc:
103103+ stats["errors"] += len(batch)
104104+ step(
105105+ "embed",
106106+ min(start + len(batch), len(rows)),
107107+ len(rows),
108108+ f"ERROR batch @ {start}: {exc}",
109109+ )
110110+ continue
111111+112112+ for row, vec in zip(batch, vectors, strict=True):
113113+ conn.execute(
114114+ """
115115+ update tangled_readmes
116116+ set embedding = %s,
117117+ embedding_model = %s,
118118+ embedded_at = now()
119119+ where repo_did = %s
120120+ """,
121121+ (vec, model, row["repo_did"]),
122122+ )
123123+124124+ stats["embedded"] += len(batch)
125125+ stats["batches"] += 1
126126+ conn.commit()
127127+128128+ n = stats["embedded"]
129129+ if n <= 10 or n % bs == 0 or n == len(rows):
130130+ step("embed", n, len(rows), f"OK {labels[-1]}")
131131+132132+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
133133+ conn.commit()
134134+135135+ summary_block(
136136+ "README embed complete",
137137+ [
138138+ f"Embedded: {stats['embedded']}",
139139+ f"Batches: {stats['batches']}",
140140+ f"Errors: {stats['errors']}",
141141+ "",
142142+ "Cosine search (L2-normalized vectors):",
143143+ " order by embedding <=> query_vec",
144144+ ],
145145+ )
146146+ return stats
147147+148148+149149+def main() -> None:
150150+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
151151+ if candidate.exists():
152152+ load_dotenv(candidate)
153153+ break
154154+ else:
155155+ load_dotenv()
156156+157157+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
158158+ if not dsn:
159159+ print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr)
160160+ raise SystemExit(1)
161161+162162+ init_schema(dsn)
163163+ run_embed_readmes(dsn)
164164+165165+166166+if __name__ == "__main__":
167167+ try:
168168+ main()
169169+ except KeyboardInterrupt:
170170+ print("\nInterrupted.", file=sys.stderr)
171171+ raise SystemExit(130) from None
+103
scraper/embeddings.py
···11+"""Gemini embeddings: gemini-embedding-001, 1536-dim, L2-normalized for cosine."""
22+33+from __future__ import annotations
44+55+import math
66+import os
77+88+import httpx
99+1010+DEFAULT_MODEL = "gemini-embedding-001"
1111+DEFAULT_DIM = 1536
1212+MAX_CHARS = 24_000
1313+GEMINI_BATCH_URL = (
1414+ "https://generativelanguage.googleapis.com/v1beta/"
1515+ "models/gemini-embedding-001:batchEmbedContents"
1616+)
1717+1818+1919+def embedding_model() -> str:
2020+ return os.getenv("TANGLED_EMBEDDING_MODEL", DEFAULT_MODEL).strip() or DEFAULT_MODEL
2121+2222+2323+def batch_size() -> int:
2424+ raw = os.getenv("TANGLED_EMBED_BATCH_SIZE", "16").strip()
2525+ return max(1, min(100, int(raw)))
2626+2727+2828+def gemini_api_key() -> str:
2929+ key = (
3030+ os.getenv("GEMINI_API_KEY", "").strip()
3131+ or os.getenv("GOOGLE_API_KEY", "").strip()
3232+ )
3333+ if not key:
3434+ raise RuntimeError(
3535+ "GEMINI_API_KEY (or GOOGLE_API_KEY) is not set. "
3636+ "Add it to .env to compute embeddings."
3737+ )
3838+ return key
3939+4040+4141+def truncate(text: str) -> str:
4242+ text = text.strip()
4343+ return text[:MAX_CHARS] if len(text) > MAX_CHARS else text
4444+4545+4646+def l2_normalize(vec: list[float]) -> list[float]:
4747+ norm = math.sqrt(sum(x * x for x in vec))
4848+ if norm == 0:
4949+ return vec
5050+ return [x / norm for x in vec]
5151+5252+5353+def embed_texts(
5454+ client: httpx.Client,
5555+ *,
5656+ api_key: str,
5757+ texts: list[str],
5858+ task_type: str = "RETRIEVAL_DOCUMENT",
5959+) -> list[list[float]]:
6060+ """Embed texts via Gemini batchEmbedContents; returns L2-normalized 1536-dim vectors."""
6161+ if not texts:
6262+ return []
6363+6464+ requests = [
6565+ {
6666+ "model": f"models/{DEFAULT_MODEL}",
6767+ "content": {"parts": [{"text": text}]},
6868+ "taskType": task_type,
6969+ "outputDimensionality": DEFAULT_DIM,
7070+ }
7171+ for text in texts
7272+ ]
7373+7474+ resp = client.post(
7575+ GEMINI_BATCH_URL,
7676+ headers={
7777+ "x-goog-api-key": api_key,
7878+ "Content-Type": "application/json",
7979+ },
8080+ json={"requests": requests},
8181+ timeout=120.0,
8282+ )
8383+ if resp.status_code != 200:
8484+ raise RuntimeError(
8585+ f"Gemini embeddings HTTP {resp.status_code}: {resp.text[:500]}"
8686+ )
8787+8888+ embeddings = resp.json().get("embeddings") or []
8989+ if len(embeddings) != len(texts):
9090+ raise RuntimeError(f"Expected {len(texts)} embeddings, got {len(embeddings)}")
9191+9292+ vectors: list[list[float]] = []
9393+ for row in embeddings:
9494+ values = row.get("values")
9595+ if not isinstance(values, list):
9696+ raise RuntimeError("Gemini response missing embedding values")
9797+ if len(values) != DEFAULT_DIM:
9898+ raise RuntimeError(
9999+ f"Expected dim {DEFAULT_DIM}, got {len(values)}. "
100100+ "Check outputDimensionality support for your API key."
101101+ )
102102+ vectors.append(l2_normalize(values))
103103+ return vectors
+202
scraper/export_embeddings.py
···11+"""Export embeddings from the shared Postgres into the embeddings git repo.
22+33+This is the "transfer" step that publishes the Discover engine's embeddings to the
44+network: it reads the precomputed vectors from Postgres (READ-ONLY) and writes the
55+files consumed by `tangled-discover-embeddings` (a knot-hosted git repo) — a single
66+`.npy` matrix + a `.jsonl` sidecar per section, plus a manifest. Commit + push that
77+repo afterwards (the push emits `sh.tangled.git.refUpdate`, the consumers' re-pull
88+signal).
99+1010+This is the canonical, pipeline-wireable copy. An identical-logic, self-contained
1111+copy also lives in the embeddings repo at `scripts/export_embeddings.py`; the only
1212+difference here is that the OUTPUT directory is configurable (this script lives in the
1313+backend repo, not in the embeddings repo).
1414+1515+ # writes into ../tangled-discover-embeddings by default:
1616+ python scraper/export_embeddings.py
1717+ # or point it anywhere:
1818+ EMBEDDINGS_REPO_DIR=/path/to/tangled-discover-embeddings python scraper/export_embeddings.py
1919+ python scraper/export_embeddings.py /path/to/tangled-discover-embeddings
2020+2121+Vectors read as pgvector text literals ('[v1,...]') exactly like recommendation/app/db.py
2222+and scraper/seed_user.py; they are already 1536-d and L2-normalized. No DB writes.
2323+"""
2424+2525+from __future__ import annotations
2626+2727+import datetime as dt
2828+import hashlib
2929+import json
3030+import os
3131+import sys
3232+from pathlib import Path
3333+3434+import numpy as np
3535+import psycopg
3636+from psycopg.rows import dict_row
3737+3838+try:
3939+ from dotenv import load_dotenv
4040+except ImportError: # dotenv optional if the var is already in env
4141+ def load_dotenv(*_a, **_k): # type: ignore
4242+ return False
4343+4444+BACKEND_ROOT = Path(__file__).resolve().parent.parent # the sunsteadhack repo
4545+DIM = 1536
4646+MODEL = "gemini-embedding-001"
4747+4848+4949+def _out_dir() -> Path:
5050+ """Where to write the embeddings repo files. Precedence: argv[1] > env > default
5151+ sibling repo (../tangled-discover-embeddings)."""
5252+ if len(sys.argv) > 1:
5353+ return Path(sys.argv[1]).expanduser().resolve()
5454+ env = os.environ.get("EMBEDDINGS_REPO_DIR")
5555+ if env:
5656+ return Path(env).expanduser().resolve()
5757+ return (BACKEND_ROOT.parent / "tangled-discover-embeddings").resolve()
5858+5959+6060+# Repos: mirror recommendation/app/db.py joins so description/topics/created_at/handle
6161+# resolve the same way the engine sees them. content stays in the DB — we ship only its
6262+# length (for the min-chars gate) and md5(first 500 chars) (for fork dedup).
6363+_REPOS_SQL = """
6464+ select r.repo_did,
6565+ r.repo_uri,
6666+ coalesce(r.owner_handle, ti.handle) as owner_handle,
6767+ r.repo_name,
6868+ tr.record_raw->>'description' as description,
6969+ tr.record_raw->'topics' as topics,
7070+ tr.record_raw->>'createdAt' as created_at,
7171+ length(trim(coalesce(r.content, ''))) as content_len,
7272+ md5(substring(coalesce(r.content, '') for 500)) as content_sha500,
7373+ r.embedding_model,
7474+ r.embedded_at,
7575+ r.embedding::text as etext
7676+ from tangled_readmes r
7777+ left join tangled_repos tr
7878+ on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did
7979+ left join tangled_identities ti
8080+ on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1)
8181+ where r.embedding is not null
8282+ order by r.repo_did
8383+"""
8484+8585+# Issues: only those whose identity fully resolves (same inner joins as _KNN_ISSUES_SQL),
8686+# i.e. exactly the set the engine can emit.
8787+_ISSUES_SQL = """
8888+ select i.uri,
8989+ i.rkey,
9090+ i.repo_did,
9191+ i.repo_uri,
9292+ i.author_did,
9393+ i.title,
9494+ i.body,
9595+ ti.handle as owner_handle,
9696+ tr.name as repo_name,
9797+ tr.record_raw->>'description' as repo_description,
9898+ i.issue_created_at as created_at,
9999+ i.embedding_model,
100100+ i.embedding::text as etext
101101+ from tangled_open_issues i
102102+ join tangled_identities ti
103103+ on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1)
104104+ join tangled_repos tr
105105+ on tr.owner_did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1)
106106+ and tr.rkey = split_part(i.repo_uri, '/', 5)
107107+ where i.embedding is not null
108108+ and i.repo_uri is not null
109109+ and ti.handle is not null
110110+ and tr.name is not null
111111+ order by i.uri
112112+"""
113113+114114+115115+def _dsn() -> str:
116116+ for candidate in (BACKEND_ROOT / ".env", BACKEND_ROOT / "recommendation" / ".env", BACKEND_ROOT / "scraper" / ".env"):
117117+ if candidate.exists():
118118+ load_dotenv(candidate)
119119+ break
120120+ else:
121121+ load_dotenv()
122122+ conn = os.environ.get("DB_CONNECTION_STRING", "").strip()
123123+ if not conn:
124124+ raise SystemExit("DB_CONNECTION_STRING not set (env or .env)")
125125+ if "sslmode=" not in conn: # Cloud SQL public IP, self-signed cert
126126+ conn += ("&" if "?" in conn else "?") + "sslmode=require"
127127+ return conn
128128+129129+130130+def _parse_vec(etext: str) -> np.ndarray:
131131+ v = np.fromstring(etext.strip()[1:-1], sep=",", dtype=np.float32)
132132+ if v.shape[0] != DIM:
133133+ raise ValueError(f"expected dim {DIM}, got {v.shape[0]}")
134134+ return v
135135+136136+137137+def _json_default(o):
138138+ if isinstance(o, (dt.datetime, dt.date)):
139139+ return o.isoformat()
140140+ return str(o)
141141+142142+143143+def _export_section(conn, data_dir: Path, name: str, sql: str, meta_fields: list[str]) -> dict:
144144+ rows = conn.execute(sql).fetchall()
145145+ if not rows:
146146+ raise SystemExit(f"{name}: no embedded rows found")
147147+ matrix = np.vstack([_parse_vec(r["etext"]) for r in rows]).astype(np.float32)
148148+149149+ npy_path = data_dir / f"{name}.f32.npy"
150150+ jsonl_path = data_dir / f"{name}.jsonl"
151151+ np.save(npy_path, matrix)
152152+ with open(jsonl_path, "w", encoding="utf-8") as fh:
153153+ for i, r in enumerate(rows):
154154+ rec = {"row": i, "subject_uri": r["uri"] if "uri" in r else r["repo_uri"]}
155155+ rec.update({k: r[k] for k in meta_fields})
156156+ fh.write(json.dumps(rec, default=_json_default, ensure_ascii=False) + "\n")
157157+158158+ sha = hashlib.sha256(npy_path.read_bytes()).hexdigest()
159159+ print(f" {name}: {matrix.shape[0]} vectors -> {npy_path} ({npy_path.stat().st_size // 1024} KiB)")
160160+ return {
161161+ "count": int(matrix.shape[0]),
162162+ "vectors": f"data/{name}.f32.npy",
163163+ "meta": f"data/{name}.jsonl",
164164+ "sha256": sha,
165165+ }
166166+167167+168168+def main() -> int:
169169+ out = _out_dir()
170170+ data_dir = out / "data"
171171+ data_dir.mkdir(parents=True, exist_ok=True)
172172+ print(f"exporting embeddings (read-only) -> {out}")
173173+ with psycopg.connect(_dsn(), row_factory=dict_row) as conn:
174174+ repos = _export_section(
175175+ conn, data_dir, "repos", _REPOS_SQL,
176176+ ["repo_did", "repo_name", "owner_handle", "description", "topics",
177177+ "created_at", "content_len", "content_sha500", "embedding_model", "embedded_at"],
178178+ )
179179+ issues = _export_section(
180180+ conn, data_dir, "issues", _ISSUES_SQL,
181181+ ["repo_did", "rkey", "repo_uri", "author_did", "title", "body",
182182+ "owner_handle", "repo_name", "repo_description", "created_at", "embedding_model"],
183183+ )
184184+185185+ manifest = {
186186+ "schema_version": 1,
187187+ "model": MODEL,
188188+ "dim": DIM,
189189+ "metric": "cosine",
190190+ "normalized": True,
191191+ "task_type": "RETRIEVAL_DOCUMENT",
192192+ "generated_at": dt.datetime.now(dt.timezone.utc).isoformat(),
193193+ "sections": {"repos": repos, "issues": issues},
194194+ }
195195+ (out / "manifest.json").write_text(json.dumps(manifest, indent=2) + "\n")
196196+ print(f"wrote {out / 'manifest.json'} (repos={repos['count']}, issues={issues['count']})")
197197+ print("next: cd into the embeddings repo, then git add -A && git commit && git push")
198198+ return 0
199199+200200+201201+if __name__ == "__main__":
202202+ raise SystemExit(main())
+128
scraper/export_questionnaires.py
···11+"""Export AI-solve questionnaires from Postgres into the embeddings git repo.
22+33+Mirrors scraper/export_embeddings.py, but questionnaires are read PER ISSUE (not
44+bulk), so the layout is one JSON file per issue rather than a single matrix:
55+66+ <repo>/questionnaires/<did>/<rkey>.json # one per issue
77+ <repo>/questionnaires/index.json # {issue_uri -> path, updated_at, sha256}
88+99+This is the one-time migration (there's ~1 row today) + a bulk re-sync tool. The
1010+live generation job writes these files itself via agent/questionnaire_repo_store.py;
1111+this script just mirrors whatever is currently in the DB. READ-ONLY against the DB.
1212+1313+ python scraper/export_questionnaires.py
1414+ EMBEDDINGS_REPO_DIR=/path/to/tangled-discover-embeddings python scraper/export_questionnaires.py
1515+"""
1616+1717+from __future__ import annotations
1818+1919+import datetime as dt
2020+import hashlib
2121+import json
2222+import os
2323+import sys
2424+from pathlib import Path
2525+2626+import psycopg
2727+from psycopg.rows import dict_row
2828+2929+try:
3030+ from dotenv import load_dotenv
3131+except ImportError:
3232+ def load_dotenv(*_a, **_k): # type: ignore
3333+ return False
3434+3535+BACKEND_ROOT = Path(__file__).resolve().parent.parent
3636+3737+_SELECT = """
3838+ select issue_uri, payload, created_at, updated_at
3939+ from tangled_issue_questionnaires
4040+ order by issue_uri
4141+"""
4242+4343+4444+def _out_dir() -> Path:
4545+ if len(sys.argv) > 1:
4646+ return Path(sys.argv[1]).expanduser().resolve()
4747+ env = os.environ.get("EMBEDDINGS_REPO_DIR")
4848+ if env:
4949+ return Path(env).expanduser().resolve()
5050+ return (BACKEND_ROOT.parent / "tangled-discover-embeddings").resolve()
5151+5252+5353+def _dsn() -> str:
5454+ for c in (BACKEND_ROOT / ".env", BACKEND_ROOT / "recommendation" / ".env", BACKEND_ROOT / "scraper" / ".env"):
5555+ if c.exists():
5656+ load_dotenv(c)
5757+ break
5858+ else:
5959+ load_dotenv()
6060+ conn = os.environ.get("DB_CONNECTION_STRING", "").strip()
6161+ if not conn:
6262+ raise SystemExit("DB_CONNECTION_STRING not set (env or .env)")
6363+ if "sslmode=" not in conn:
6464+ conn += ("&" if "?" in conn else "?") + "sslmode=require"
6565+ return conn
6666+6767+6868+def issue_uri_to_relpath(issue_uri: str) -> str:
6969+ """at://<did>/sh.tangled.repo.issue/<rkey> -> questionnaires/<did>/<rkey>.json
7070+7171+ Shared convention with agent/questionnaire_repo_store.py — keep in sync."""
7272+ rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri
7373+ parts = rest.split("/")
7474+ did, rkey = parts[0], parts[-1]
7575+ return f"questionnaires/{did}/{rkey}.json"
7676+7777+7878+def file_record(issue_uri, payload, created_at, updated_at) -> dict:
7979+ """The per-issue file shape (mirrors agent.questionnaire_store.get_questionnaire)."""
8080+ return {
8181+ "issue_uri": issue_uri,
8282+ "version": payload.get("version") if isinstance(payload, dict) else None,
8383+ "created_at": created_at.isoformat() if hasattr(created_at, "isoformat") else created_at,
8484+ "updated_at": updated_at.isoformat() if hasattr(updated_at, "isoformat") else updated_at,
8585+ "payload": payload,
8686+ }
8787+8888+8989+def main() -> int:
9090+ out = _out_dir()
9191+ qdir = out / "questionnaires"
9292+ qdir.mkdir(parents=True, exist_ok=True)
9393+ entries = []
9494+ with psycopg.connect(_dsn(), row_factory=dict_row) as conn:
9595+ rows = conn.execute(_SELECT).fetchall()
9696+ print(f"exporting {len(rows)} questionnaire(s) (read-only) -> {qdir}")
9797+ for r in rows:
9898+ payload = r["payload"]
9999+ if isinstance(payload, str):
100100+ payload = json.loads(payload)
101101+ rel = issue_uri_to_relpath(r["issue_uri"])
102102+ path = out / rel
103103+ path.parent.mkdir(parents=True, exist_ok=True)
104104+ body = json.dumps(file_record(r["issue_uri"], payload, r["created_at"], r["updated_at"]),
105105+ ensure_ascii=False, indent=2) + "\n"
106106+ path.write_text(body, encoding="utf-8")
107107+ entries.append({
108108+ "issue_uri": r["issue_uri"],
109109+ "path": rel,
110110+ "updated_at": r["updated_at"].isoformat() if hasattr(r["updated_at"], "isoformat") else r["updated_at"],
111111+ "sha256": hashlib.sha256(body.encode("utf-8")).hexdigest(),
112112+ })
113113+ print(f" {rel} ({len(body)} bytes)")
114114+115115+ index = {
116116+ "schema_version": 1,
117117+ "kind": "questionnaires",
118118+ "generated_at": dt.datetime.now(dt.timezone.utc).isoformat(),
119119+ "count": len(entries),
120120+ "entries": sorted(entries, key=lambda e: e["issue_uri"]),
121121+ }
122122+ (qdir / "index.json").write_text(json.dumps(index, indent=2) + "\n")
123123+ print(f"wrote {qdir / 'index.json'} (count={len(entries)})")
124124+ return 0
125125+126126+127127+if __name__ == "__main__":
128128+ raise SystemExit(main())
+361
scraper/fetch_collaborators.py
···11+#!/usr/bin/env python3
22+"""Fetch collaborator lists for all repos via knot listCollaborators."""
33+44+from __future__ import annotations
55+66+import os
77+import sys
88+import threading
99+from concurrent.futures import ThreadPoolExecutor, as_completed
1010+from dataclasses import dataclass, field
1111+from pathlib import Path
1212+from typing import Any
1313+1414+import httpx
1515+from dotenv import load_dotenv
1616+1717+from db import connect, init_schema, set_crawl_state
1818+from parallel import concurrency_env
1919+from pds_client import knot_xrpc
2020+from progress import banner, log, metric, phase, step, summary_block
2121+2222+REPO_ROOT = Path(__file__).resolve().parent.parent
2323+CRAWL_KEY = "collaborators:fetch"
2424+PAGE_LIMIT = 1000
2525+2626+2727+@dataclass
2828+class CollabFetchResult:
2929+ repo_did: str
3030+ repo_uri: str | None
3131+ knot_hostname: str
3232+ status: str # ok | skipped_knot | error
3333+ collaborators: list[dict[str, Any]] = field(default_factory=list)
3434+ error: str | None = None
3535+3636+3737+def _repo_limit() -> int | None:
3838+ raw = os.getenv("TANGLED_COLLAB_REPO_LIMIT", "").strip()
3939+ if not raw:
4040+ return None
4141+ return max(1, int(raw))
4242+4343+4444+def _skip_existing() -> bool:
4545+ return os.getenv("TANGLED_COLLAB_REFRESH", "").strip().lower() not in (
4646+ "1",
4747+ "true",
4848+ "yes",
4949+ )
5050+5151+5252+def fetch_repo_collaborators(
5353+ client: httpx.Client,
5454+ *,
5555+ knot_hostname: str,
5656+ repo_did: str,
5757+) -> list[dict[str, Any]]:
5858+ items: list[dict[str, Any]] = []
5959+ cursor: str | None = None
6060+6161+ while True:
6262+ params: dict[str, Any] = {
6363+ "subject": repo_did,
6464+ "limit": PAGE_LIMIT,
6565+ }
6666+ if cursor:
6767+ params["cursor"] = cursor
6868+6969+ status, payload = knot_xrpc(
7070+ client,
7171+ knot_hostname,
7272+ "sh.tangled.repo.listCollaborators",
7373+ params,
7474+ )
7575+ if status != 200 or not isinstance(payload, dict):
7676+ raise RuntimeError(f"listCollaborators HTTP {status}")
7777+7878+ page = payload.get("items") or []
7979+ if isinstance(page, list):
8080+ items.extend(item for item in page if isinstance(item, dict))
8181+8282+ cursor = payload.get("cursor")
8383+ if not cursor or not page:
8484+ break
8585+8686+ return items
8787+8888+8989+def upsert_collaborators(
9090+ conn,
9191+ *,
9292+ repo_did: str,
9393+ collaborators: list[dict[str, Any]],
9494+) -> int:
9595+ conn.execute(
9696+ "delete from tangled_repo_collaborators where repo_did = %s",
9797+ (repo_did,),
9898+ )
9999+100100+ stored = 0
101101+ for item in collaborators:
102102+ collab_did = item.get("subject")
103103+ if not isinstance(collab_did, str) or not collab_did.startswith("did:"):
104104+ continue
105105+ conn.execute(
106106+ """
107107+ insert into tangled_repo_collaborators (
108108+ repo_did, collaborator_did, added_by, record_uri, record_cid,
109109+ created_at, last_synced_at
110110+ )
111111+ values (%s, %s, %s, %s, %s, %s::timestamptz, now())
112112+ on conflict (repo_did, collaborator_did) do update set
113113+ added_by = excluded.added_by,
114114+ record_uri = excluded.record_uri,
115115+ record_cid = excluded.record_cid,
116116+ created_at = excluded.created_at,
117117+ last_synced_at = now()
118118+ """,
119119+ (
120120+ repo_did,
121121+ collab_did,
122122+ item.get("addedBy") if isinstance(item.get("addedBy"), str) else None,
123123+ item.get("uri") if isinstance(item.get("uri"), str) else None,
124124+ item.get("cid") if isinstance(item.get("cid"), str) else None,
125125+ item.get("createdAt") if isinstance(item.get("createdAt"), str) else None,
126126+ ),
127127+ )
128128+ stored += 1
129129+130130+ conn.execute(
131131+ """
132132+ insert into tangled_repo_collaborators_sync (repo_did, collaborator_count, synced_at)
133133+ values (%s, %s, now())
134134+ on conflict (repo_did) do update set
135135+ collaborator_count = excluded.collaborator_count,
136136+ synced_at = now()
137137+ """,
138138+ (repo_did, stored),
139139+ )
140140+ return stored
141141+142142+143143+def _fetch_one(repo: dict[str, Any], reachable: set[str]) -> CollabFetchResult:
144144+ repo_did = repo["repo_did"]
145145+ knot = repo.get("knot_hostname") or ""
146146+ base = CollabFetchResult(
147147+ repo_did=repo_did,
148148+ repo_uri=repo.get("uri"),
149149+ knot_hostname=knot,
150150+ status="error",
151151+ )
152152+153153+ if not knot or knot not in reachable:
154154+ base.status = "skipped_knot"
155155+ base.error = f"knot not reachable: {knot or 'missing'}"
156156+ return base
157157+158158+ try:
159159+ with httpx.Client(timeout=60.0, follow_redirects=True) as client:
160160+ collaborators = fetch_repo_collaborators(
161161+ client, knot_hostname=knot, repo_did=repo_did
162162+ )
163163+ base.collaborators = collaborators
164164+ base.status = "ok"
165165+ return base
166166+ except Exception as exc:
167167+ base.error = str(exc)
168168+ return base
169169+170170+171171+def run_fetch_collaborators(dsn: str) -> dict[str, int]:
172172+ workers = concurrency_env("TANGLED_COLLAB_CONCURRENCY", default=20)
173173+ repo_limit = _repo_limit()
174174+ skip_existing = _skip_existing()
175175+176176+ banner("COLLABORATORS — knot listCollaborators for every repo")
177177+ log("collab", f"Concurrency: {workers}")
178178+ if repo_limit:
179179+ log("collab", f"Repo limit: {repo_limit}")
180180+ if skip_existing:
181181+ log(
182182+ "collab",
183183+ "Skip existing: on (set TANGLED_COLLAB_REFRESH=1 to re-fetch all)",
184184+ )
185185+ else:
186186+ log("collab", "Skip existing: off — refreshing every repo")
187187+188188+ with connect(dsn) as conn:
189189+ reachable = {
190190+ row["hostname"]
191191+ for row in conn.execute(
192192+ "select hostname from tangled_knots where reachable = true"
193193+ ).fetchall()
194194+ }
195195+ skip_clause = ""
196196+ if skip_existing:
197197+ skip_clause = """
198198+ and not exists (
199199+ select 1 from tangled_repo_collaborators_sync s
200200+ where s.repo_did = tangled_repos.repo_did
201201+ )
202202+ """
203203+ query = f"""
204204+ select uri, repo_did, knot_hostname, owner_handle, name
205205+ from tangled_repos
206206+ where repo_did is not null
207207+ and knot_hostname is not null
208208+ {skip_clause}
209209+ order by uri
210210+ """
211211+ if repo_limit:
212212+ query += f" limit {repo_limit}"
213213+ repos = conn.execute(query).fetchall()
214214+ synced_count = 0
215215+ if skip_existing:
216216+ synced_count = conn.execute(
217217+ "select count(*) as n from tangled_repo_collaborators_sync"
218218+ ).fetchone()["n"]
219219+ total_eligible = conn.execute(
220220+ """
221221+ select count(*) as n from tangled_repos
222222+ where repo_did is not null and knot_hostname is not null
223223+ """
224224+ ).fetchone()["n"]
225225+226226+ if not repos:
227227+ log("collab", "Nothing to fetch — all eligible repos already synced.")
228228+ return {
229229+ "repos_fetched": 0,
230230+ "collaborator_edges": 0,
231231+ "already_synced": total_eligible,
232232+ "skipped_knot": 0,
233233+ "errors": 0,
234234+ }
235235+236236+ already_synced = synced_count if skip_existing else 0
237237+ if skip_existing:
238238+ metric("Eligible repos", total_eligible)
239239+ metric("Already synced (skipped)", already_synced)
240240+ metric("To fetch", len(repos))
241241+242242+ stats = {
243243+ "repos_fetched": 0,
244244+ "collaborator_edges": 0,
245245+ "already_synced": already_synced,
246246+ "skipped_knot": 0,
247247+ "errors": 0,
248248+ }
249249+ done = 0
250250+ done_lock = threading.Lock()
251251+252252+ phase(1, f"Parallel listCollaborators ({workers} workers)")
253253+254254+ with connect(dsn) as conn:
255255+ set_crawl_state(
256256+ conn,
257257+ key=CRAWL_KEY,
258258+ status="running",
259259+ meta={"repo_count": len(repos), "workers": workers},
260260+ )
261261+ conn.commit()
262262+263263+ with ThreadPoolExecutor(max_workers=workers) as pool:
264264+ futures = {
265265+ pool.submit(_fetch_one, dict(repo), reachable): repo for repo in repos
266266+ }
267267+268268+ for future in as_completed(futures):
269269+ repo = futures[future]
270270+ label = f"{repo.get('owner_handle') or '?'}/{repo.get('name') or repo['repo_did'][:16]}"
271271+272272+ try:
273273+ result = future.result()
274274+ except Exception as exc:
275275+ result = CollabFetchResult(
276276+ repo_did=repo["repo_did"],
277277+ repo_uri=repo.get("uri"),
278278+ knot_hostname=repo.get("knot_hostname") or "",
279279+ status="error",
280280+ error=str(exc),
281281+ )
282282+283283+ with done_lock:
284284+ done += 1
285285+ n = done
286286+287287+ if result.status == "ok":
288288+ count = upsert_collaborators(
289289+ conn,
290290+ repo_did=result.repo_did,
291291+ collaborators=result.collaborators,
292292+ )
293293+ stats["repos_fetched"] += 1
294294+ stats["collaborator_edges"] += count
295295+ if n <= 10 or n % 100 == 0 or count > 0:
296296+ step(
297297+ "collab",
298298+ n,
299299+ len(repos),
300300+ f"OK {label} {count} collaborator(s)",
301301+ )
302302+ elif result.status == "skipped_knot":
303303+ stats["skipped_knot"] += 1
304304+ if n <= 10 or n % 200 == 0:
305305+ step("collab", n, len(repos), f"SKIP {label} {result.error}")
306306+ else:
307307+ stats["errors"] += 1
308308+ if n <= 10 or n % 100 == 0:
309309+ step(
310310+ "collab",
311311+ n,
312312+ len(repos),
313313+ f"ERROR {label} {result.error or 'unknown'}",
314314+ )
315315+316316+ if n % 50 == 0:
317317+ conn.commit()
318318+319319+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
320320+ conn.commit()
321321+322322+ summary_block(
323323+ "Collaborators fetch complete",
324324+ [
325325+ f"Repos fetched: {stats['repos_fetched']}",
326326+ f"Collaborator edges: {stats['collaborator_edges']}",
327327+ f"Already synced: {stats['already_synced']}",
328328+ f"Skipped knot: {stats['skipped_knot']}",
329329+ f"Errors: {stats['errors']}",
330330+ "",
331331+ "Repos a user collaborates on:",
332332+ " select * from tangled_user_collaborations",
333333+ " where user_did = 'did:plc:...';",
334334+ ],
335335+ )
336336+ return stats
337337+338338+339339+def main() -> None:
340340+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
341341+ if candidate.exists():
342342+ load_dotenv(candidate)
343343+ break
344344+ else:
345345+ load_dotenv()
346346+347347+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
348348+ if not dsn:
349349+ print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr)
350350+ raise SystemExit(1)
351351+352352+ init_schema(dsn)
353353+ run_fetch_collaborators(dsn)
354354+355355+356356+if __name__ == "__main__":
357357+ try:
358358+ main()
359359+ except KeyboardInterrupt:
360360+ print("\nInterrupted.", file=sys.stderr)
361361+ raise SystemExit(130) from None
+604
scraper/fetch_issues.py
···11+#!/usr/bin/env python3
22+"""Scrape sh.tangled.repo.issue (+ state) from every known user PDS."""
33+44+from __future__ import annotations
55+66+import json
77+import os
88+import sys
99+import threading
1010+import time
1111+from concurrent.futures import FIRST_COMPLETED, ThreadPoolExecutor, wait
1212+from dataclasses import dataclass, field
1313+from pathlib import Path
1414+from typing import Any
1515+1616+import httpx
1717+from dotenv import load_dotenv
1818+1919+from db import connect, init_schema, set_crawl_state
2020+from parallel import concurrency_env
2121+from pds_client import list_records, pds_host_for_did
2222+from progress import banner, log, metric, phase, step, summary_block
2323+2424+REPO_ROOT = Path(__file__).resolve().parent.parent
2525+CRAWL_KEY = "issues:fetch"
2626+ISSUE_COLLECTION = "sh.tangled.repo.issue"
2727+STATE_COLLECTION = "sh.tangled.repo.issue.state"
2828+STATE_OPEN = "sh.tangled.repo.issue.state.open"
2929+STATE_CLOSED = "sh.tangled.repo.issue.state.closed"
3030+HTTP_TIMEOUT = httpx.Timeout(connect=5.0, read=15.0, write=10.0, pool=10.0)
3131+LOG_EVERY = 10
3232+HEARTBEAT_SEC = 15
3333+INFLIGHT_CHUNK = 200
3434+3535+3636+class _PdsCache:
3737+ def __init__(self) -> None:
3838+ self._hosts: dict[str, str | None] = {}
3939+ self._lock = threading.Lock()
4040+4141+ def resolve(self, client: httpx.Client, user_did: str, hint: str | None) -> str | None:
4242+ if hint:
4343+ return hint.rstrip("/")
4444+ with self._lock:
4545+ if user_did in self._hosts:
4646+ return self._hosts[user_did]
4747+ try:
4848+ pds = pds_host_for_did(client, user_did)
4949+ except httpx.HTTPError:
5050+ pds = None
5151+ with self._lock:
5252+ self._hosts[user_did] = pds.rstrip("/") if pds else None
5353+ return self._hosts[user_did]
5454+5555+5656+@dataclass
5757+class UserIssueResult:
5858+ user_did: str
5959+ handle: str | None
6060+ status: str # ok | error
6161+ issues: list[dict[str, Any]] = field(default_factory=list)
6262+ states: list[dict[str, Any]] = field(default_factory=list)
6363+ error: str | None = None
6464+6565+6666+def _user_limit() -> int | None:
6767+ raw = os.getenv("TANGLED_ISSUE_USER_LIMIT", "").strip()
6868+ if not raw:
6969+ return None
7070+ return max(1, int(raw))
7171+7272+7373+def _max_pages() -> int:
7474+ raw = os.getenv("TANGLED_ISSUE_MAX_PAGES", "50").strip()
7575+ return max(1, int(raw))
7676+7777+7878+def _skip_existing() -> bool:
7979+ return os.getenv("TANGLED_ISSUE_REFRESH", "").strip().lower() not in (
8080+ "1",
8181+ "true",
8282+ "yes",
8383+ )
8484+8585+8686+def _all_users() -> bool:
8787+ return os.getenv("TANGLED_ISSUE_ALL_USERS", "1").strip().lower() not in (
8888+ "0",
8989+ "false",
9090+ "no",
9191+ )
9292+9393+9494+def _users_query(*, skip_existing: bool, user_limit: int | None, all_users: bool) -> str:
9595+ skip_clause = ""
9696+ if skip_existing:
9797+ skip_clause = """
9898+ and not exists (
9999+ select 1 from tangled_issue_user_sync s where s.user_did = u.did
100100+ )
101101+ """
102102+ pds_union = ""
103103+ if all_users:
104104+ pds_union = """
105105+ union all
106106+ select did, handle, pds_host from tangled_pds_accounts
107107+ """
108108+ query = f"""
109109+ select distinct on (u.did) u.did, u.handle, u.pds_host
110110+ from (
111111+ select did, handle, pds_host from tangled_identities
112112+ union all
113113+ select owner_did as did,
114114+ max(owner_handle) as handle,
115115+ null::text as pds_host
116116+ from tangled_repos
117117+ where owner_did is not null
118118+ group by owner_did
119119+ {pds_union}
120120+ ) u
121121+ where u.did is not null
122122+ {skip_clause}
123123+ order by u.did, u.pds_host nulls last, u.handle nulls last
124124+ """
125125+ if user_limit:
126126+ query += f" limit {user_limit}"
127127+ return query
128128+129129+130130+def _total_users_sql(*, all_users: bool) -> str:
131131+ pds_union = ""
132132+ if all_users:
133133+ pds_union = "union select did from tangled_pds_accounts"
134134+ return f"""
135135+ select count(*) as n from (
136136+ select did from tangled_identities
137137+ union
138138+ select owner_did from tangled_repos where owner_did is not null
139139+ {pds_union}
140140+ ) x
141141+ """
142142+143143+144144+def _rkey_from_uri(uri: str) -> str:
145145+ return uri.rsplit("/", 1)[-1]
146146+147147+148148+def _parse_repo_refs(value: dict[str, Any]) -> tuple[str | None, str | None]:
149149+ repo = value.get("repo")
150150+ if isinstance(repo, str):
151151+ if repo.startswith("did:"):
152152+ return repo, None
153153+ if repo.startswith("at://"):
154154+ return _repo_did_from_at_uri(repo), repo
155155+ return None, repo if isinstance(repo, str) else None
156156+157157+158158+def _repo_did_from_at_uri(uri: str) -> str | None:
159159+ if not uri.startswith("at://"):
160160+ return None
161161+ parts = uri.removeprefix("at://").split("/")
162162+ return parts[0] if parts and parts[0].startswith("did:") else None
163163+164164+165165+def _list_all_records(
166166+ client: httpx.Client,
167167+ pds_host: str,
168168+ user_did: str,
169169+ collection: str,
170170+ *,
171171+ max_pages: int,
172172+) -> list[dict[str, Any]]:
173173+ records: list[dict[str, Any]] = []
174174+ cursor: str | None = None
175175+ seen_cursors: set[str] = set()
176176+177177+ for _ in range(max_pages):
178178+ data = list_records(
179179+ client, pds_host, user_did, collection, cursor=cursor, limit=100
180180+ )
181181+ page = data.get("records") or []
182182+ records.extend(rec for rec in page if isinstance(rec, dict))
183183+ next_cursor = data.get("cursor")
184184+ if not next_cursor or not page:
185185+ break
186186+ if not isinstance(next_cursor, str) or next_cursor in seen_cursors:
187187+ break
188188+ seen_cursors.add(next_cursor)
189189+ cursor = next_cursor
190190+ return records
191191+192192+193193+def _state_map(states: list[dict[str, Any]]) -> dict[str, str]:
194194+ mapping: dict[str, str] = {}
195195+ for rec in states:
196196+ value = rec.get("value")
197197+ if not isinstance(value, dict):
198198+ continue
199199+ issue_uri = value.get("issue")
200200+ state = value.get("state")
201201+ if not isinstance(state, str):
202202+ continue
203203+ if state == STATE_CLOSED:
204204+ normalized = "closed"
205205+ elif state == STATE_OPEN:
206206+ normalized = "open"
207207+ else:
208208+ normalized = "open"
209209+ if isinstance(issue_uri, str) and issue_uri:
210210+ mapping[issue_uri] = normalized
211211+ else:
212212+ rkey = _rkey_from_uri(rec["uri"]) if isinstance(rec.get("uri"), str) else None
213213+ if rkey:
214214+ mapping[f"rkey:{rkey}"] = normalized
215215+ return mapping
216216+217217+218218+def _issue_state(uri: str, rkey: str, states: dict[str, str]) -> str:
219219+ if uri in states:
220220+ return states[uri]
221221+ return states.get(f"rkey:{rkey}", "open")
222222+223223+224224+def _optional_timestamp(value: Any) -> str | None:
225225+ if not isinstance(value, str):
226226+ return None
227227+ value = value.strip()
228228+ return value if value else None
229229+230230+231231+def upsert_issue(
232232+ conn,
233233+ *,
234234+ record: dict[str, Any],
235235+ author_did: str,
236236+ author_handle: str | None,
237237+ state: str,
238238+) -> None:
239239+ uri = record["uri"]
240240+ value = record["value"]
241241+ rkey = _rkey_from_uri(uri)
242242+ repo_did, repo_uri = _parse_repo_refs(value)
243243+ title = value.get("title") if isinstance(value.get("title"), str) else None
244244+ body = value.get("body") if isinstance(value.get("body"), str) else None
245245+ created = _optional_timestamp(value.get("createdAt"))
246246+247247+ conn.execute(
248248+ """
249249+ insert into tangled_issues (
250250+ uri, author_did, author_handle, rkey, repo_did, repo_uri,
251251+ title, body, state, issue_created_at, cid, record_raw, fetched_at
252252+ )
253253+ values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s::timestamptz, %s, %s::jsonb, now())
254254+ on conflict (uri) do update set
255255+ author_did = excluded.author_did,
256256+ author_handle = excluded.author_handle,
257257+ rkey = excluded.rkey,
258258+ repo_did = coalesce(excluded.repo_did, tangled_issues.repo_did),
259259+ repo_uri = coalesce(excluded.repo_uri, tangled_issues.repo_uri),
260260+ title = excluded.title,
261261+ body = excluded.body,
262262+ state = excluded.state,
263263+ issue_created_at = excluded.issue_created_at,
264264+ cid = excluded.cid,
265265+ record_raw = excluded.record_raw,
266266+ fetched_at = now(),
267267+ embedding = case
268268+ when tangled_issues.title is distinct from excluded.title
269269+ or tangled_issues.body is distinct from excluded.body
270270+ then null else tangled_issues.embedding end,
271271+ embedding_model = case
272272+ when tangled_issues.title is distinct from excluded.title
273273+ or tangled_issues.body is distinct from excluded.body
274274+ then null else tangled_issues.embedding_model end,
275275+ embedded_at = case
276276+ when tangled_issues.title is distinct from excluded.title
277277+ or tangled_issues.body is distinct from excluded.body
278278+ then null else tangled_issues.embedded_at end
279279+ """,
280280+ (
281281+ uri,
282282+ author_did,
283283+ author_handle,
284284+ rkey,
285285+ repo_did,
286286+ repo_uri,
287287+ title,
288288+ body,
289289+ state,
290290+ created,
291291+ record.get("cid") if isinstance(record.get("cid"), str) else None,
292292+ json.dumps(value),
293293+ ),
294294+ )
295295+296296+297297+def _mark_user_synced(
298298+ conn,
299299+ *,
300300+ user_did: str,
301301+ issue_count: int,
302302+ status: str,
303303+ error_message: str | None = None,
304304+) -> None:
305305+ conn.execute(
306306+ """
307307+ insert into tangled_issue_user_sync (
308308+ user_did, issue_count, synced_at, status, error_message
309309+ )
310310+ values (%s, %s, now(), %s, %s)
311311+ on conflict (user_did) do update set
312312+ issue_count = excluded.issue_count,
313313+ synced_at = now(),
314314+ status = excluded.status,
315315+ error_message = excluded.error_message
316316+ """,
317317+ (user_did, issue_count, status, error_message),
318318+ )
319319+320320+321321+def _fetch_user_issues(
322322+ user_did: str,
323323+ handle: str | None,
324324+ pds_host: str | None,
325325+ cache: _PdsCache,
326326+ max_pages: int,
327327+) -> UserIssueResult:
328328+ result = UserIssueResult(user_did=user_did, handle=handle, status="error")
329329+ try:
330330+ with httpx.Client(timeout=HTTP_TIMEOUT, follow_redirects=True) as client:
331331+ pds = cache.resolve(client, user_did, pds_host)
332332+ if not pds:
333333+ result.error = "could not resolve PDS"
334334+ return result
335335+ issues = _list_all_records(
336336+ client, pds, user_did, ISSUE_COLLECTION, max_pages=max_pages
337337+ )
338338+ states: list[dict[str, Any]] = []
339339+ if issues:
340340+ states = _list_all_records(
341341+ client, pds, user_did, STATE_COLLECTION, max_pages=max_pages
342342+ )
343343+ result.issues = issues
344344+ result.states = states
345345+ result.status = "ok"
346346+ return result
347347+ except httpx.TimeoutException:
348348+ result.error = "PDS timeout"
349349+ return result
350350+ except httpx.HTTPError as exc:
351351+ result.error = str(exc)[:200]
352352+ return result
353353+ except Exception as exc:
354354+ result.error = str(exc)[:200]
355355+ return result
356356+357357+358358+def _heartbeat_loop(
359359+ *,
360360+ done: list[int],
361361+ total: int,
362362+ inflight: list[int],
363363+ last_done_at: list[float],
364364+ stop: threading.Event,
365365+) -> None:
366366+ while not stop.wait(HEARTBEAT_SEC):
367367+ n = done[0]
368368+ pending = total - n
369369+ active = inflight[0]
370370+ idle = time.monotonic() - last_done_at[0]
371371+ log(
372372+ "issues",
373373+ f"… heartbeat {n}/{total} done ({active} in-flight, "
374374+ f"{pending} pending, last +{idle:.0f}s)",
375375+ )
376376+377377+378378+def run_fetch_issues(dsn: str) -> dict[str, int]:
379379+ workers = concurrency_env("TANGLED_ISSUE_CONCURRENCY", default=10)
380380+ user_limit = _user_limit()
381381+ skip_existing = _skip_existing()
382382+ all_users = _all_users()
383383+ max_pages = _max_pages()
384384+385385+ banner("ISSUES — scrape sh.tangled.repo.issue from user PDSes")
386386+ log("issues", f"Concurrency: {workers} PDS read timeout: 15s")
387387+ log("issues", f"Max listRecords pages/user/collection: {max_pages}")
388388+ log("issues", f"User scope: {'all known DIDs (+ tngl PDS accounts)' if all_users else 'identities + repo owners'}")
389389+ if user_limit:
390390+ log("issues", f"User limit: {user_limit}")
391391+ if skip_existing:
392392+ log("issues", "Skip existing: on (set TANGLED_ISSUE_REFRESH=1 to re-scan all)")
393393+ else:
394394+ log("issues", "Skip existing: off — re-scanning every user (daily sync)")
395395+396396+ with connect(dsn) as conn:
397397+ users = conn.execute(
398398+ _users_query(skip_existing=skip_existing, user_limit=user_limit, all_users=all_users)
399399+ ).fetchall()
400400+ total_users = conn.execute(_total_users_sql(all_users=all_users)).fetchone()["n"]
401401+ synced = 0
402402+ if skip_existing:
403403+ synced = conn.execute("select count(*) as n from tangled_issue_user_sync").fetchone()["n"]
404404+405405+ if not users:
406406+ log("issues", "Nothing to fetch — all users already scanned.")
407407+ return {
408408+ "users_scanned": 0,
409409+ "issues_upserted": 0,
410410+ "open_issues": 0,
411411+ "already_synced": synced,
412412+ "errors": 0,
413413+ }
414414+415415+ already_synced = synced if skip_existing else 0
416416+ metric("Known users", total_users)
417417+ if skip_existing:
418418+ metric("Already synced (skipped)", already_synced)
419419+ metric("To scan", len(users))
420420+421421+ stats = {
422422+ "users_scanned": 0,
423423+ "issues_upserted": 0,
424424+ "open_issues": 0,
425425+ "already_synced": already_synced,
426426+ "errors": 0,
427427+ }
428428+ done_box = [0]
429429+ inflight_box = [0]
430430+ last_done_at = [time.monotonic()]
431431+ done_lock = threading.Lock()
432432+ pds_cache = _PdsCache()
433433+434434+ phase(1, f"Parallel PDS listRecords ({workers} workers)")
435435+ log("issues", f"Progress every {LOG_EVERY} users + heartbeat every {HEARTBEAT_SEC}s")
436436+437437+ stop_heartbeat = threading.Event()
438438+ heartbeat = threading.Thread(
439439+ target=_heartbeat_loop,
440440+ kwargs={
441441+ "done": done_box,
442442+ "total": len(users),
443443+ "inflight": inflight_box,
444444+ "last_done_at": last_done_at,
445445+ "stop": stop_heartbeat,
446446+ },
447447+ daemon=True,
448448+ )
449449+ heartbeat.start()
450450+451451+ try:
452452+ with connect(dsn) as conn:
453453+ set_crawl_state(
454454+ conn,
455455+ key=CRAWL_KEY,
456456+ status="running",
457457+ meta={"user_count": len(users), "workers": workers},
458458+ )
459459+ conn.commit()
460460+461461+ user_iter = iter(users)
462462+ pending_futures: dict[Any, dict[str, Any]] = {}
463463+464464+ def submit_more(pool: ThreadPoolExecutor) -> None:
465465+ while len(pending_futures) < INFLIGHT_CHUNK:
466466+ try:
467467+ row = next(user_iter)
468468+ except StopIteration:
469469+ break
470470+ fut = pool.submit(
471471+ _fetch_user_issues,
472472+ row["did"],
473473+ row.get("handle"),
474474+ row.get("pds_host"),
475475+ pds_cache,
476476+ max_pages,
477477+ )
478478+ pending_futures[fut] = row
479479+ inflight_box[0] = len(pending_futures)
480480+481481+ with ThreadPoolExecutor(max_workers=workers) as pool:
482482+ submit_more(pool)
483483+484484+ while pending_futures:
485485+ done_set, _ = wait(pending_futures, timeout=HEARTBEAT_SEC, return_when=FIRST_COMPLETED)
486486+ if not done_set:
487487+ continue
488488+489489+ for future in done_set:
490490+ row = pending_futures.pop(future)
491491+ label = row.get("handle") or row["did"][:20]
492492+493493+ try:
494494+ result = future.result()
495495+ except Exception as exc:
496496+ result = UserIssueResult(
497497+ user_did=row["did"],
498498+ handle=row.get("handle"),
499499+ status="error",
500500+ error=str(exc)[:200],
501501+ )
502502+503503+ with done_lock:
504504+ done_box[0] += 1
505505+ n = done_box[0]
506506+ last_done_at[0] = time.monotonic()
507507+508508+ if result.status == "ok":
509509+ states = _state_map(result.states)
510510+ upserted = 0
511511+ open_n = 0
512512+ for rec in result.issues:
513513+ if not isinstance(rec.get("uri"), str) or not isinstance(
514514+ rec.get("value"), dict
515515+ ):
516516+ continue
517517+ rkey = _rkey_from_uri(rec["uri"])
518518+ state = _issue_state(rec["uri"], rkey, states)
519519+ upsert_issue(
520520+ conn,
521521+ record=rec,
522522+ author_did=result.user_did,
523523+ author_handle=result.handle,
524524+ state=state,
525525+ )
526526+ upserted += 1
527527+ if state == "open":
528528+ open_n += 1
529529+530530+ _mark_user_synced(
531531+ conn,
532532+ user_did=result.user_did,
533533+ issue_count=upserted,
534534+ status="ok",
535535+ )
536536+ stats["users_scanned"] += 1
537537+ stats["issues_upserted"] += upserted
538538+ stats["open_issues"] += open_n
539539+ msg = f"OK {label} {upserted} issue(s) ({open_n} open)"
540540+ else:
541541+ _mark_user_synced(
542542+ conn,
543543+ user_did=result.user_did,
544544+ issue_count=0,
545545+ status="error",
546546+ error_message=result.error,
547547+ )
548548+ stats["errors"] += 1
549549+ msg = f"ERROR {label} {result.error or 'unknown'}"
550550+551551+ if n <= 10 or n % LOG_EVERY == 0 or result.issues:
552552+ step("issues", n, len(users), msg)
553553+554554+ if n % 25 == 0:
555555+ conn.commit()
556556+557557+ submit_more(pool)
558558+ inflight_box[0] = len(pending_futures)
559559+560560+ set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats)
561561+ conn.commit()
562562+ finally:
563563+ stop_heartbeat.set()
564564+ heartbeat.join(timeout=1)
565565+566566+ summary_block(
567567+ "Issues fetch complete",
568568+ [
569569+ f"Users scanned: {stats['users_scanned']}",
570570+ f"Issues upserted: {stats['issues_upserted']}",
571571+ f"Open (this run): {stats['open_issues']}",
572572+ f"Already synced: {stats['already_synced']}",
573573+ f"Errors: {stats['errors']}",
574574+ "",
575575+ "Query open issues:",
576576+ " select count(*) from tangled_open_issues;",
577577+ ],
578578+ )
579579+ return stats
580580+581581+582582+def main() -> None:
583583+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
584584+ if candidate.exists():
585585+ load_dotenv(candidate)
586586+ break
587587+ else:
588588+ load_dotenv()
589589+590590+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
591591+ if not dsn:
592592+ print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr)
593593+ raise SystemExit(1)
594594+595595+ init_schema(dsn)
596596+ run_fetch_issues(dsn)
597597+598598+599599+if __name__ == "__main__":
600600+ try:
601601+ main()
602602+ except KeyboardInterrupt:
603603+ print("\nInterrupted.", file=sys.stderr)
604604+ raise SystemExit(130) from None
+556
scraper/ingest_handle.py
···11+#!/usr/bin/env python3
22+"""Full ingest for one Tangled handle: identity → repos → READMEs + embeddings → issues + embeddings.
33+44+Onboards a single user for recommendations/testing without a network-wide crawl.
55+66+Usage (from scraper/, with repo-root .env):
77+ python ingest_handle.py arsenii.tngl.sh
88+ python ingest_handle.py did:plc:abc123
99+ python ingest_handle.py arsenii.tngl.sh --skip-issues
1010+ python ingest_handle.py arsenii.tngl.sh --force-embed
1111+1212+Requires: DB_CONNECTION_STRING, GEMINI_API_KEY (for embeddings).
1313+"""
1414+1515+from __future__ import annotations
1616+1717+import argparse
1818+import json
1919+import os
2020+import sys
2121+from pathlib import Path
2222+2323+import httpx
2424+from dotenv import load_dotenv
2525+2626+from db import connect, init_schema, register_pgvector
2727+from embeddings import (
2828+ batch_size,
2929+ embed_texts,
3030+ embedding_model,
3131+ gemini_api_key,
3232+ truncate,
3333+)
3434+from fetch_issues import (
3535+ UserIssueResult,
3636+ _fetch_user_issues,
3737+ _issue_state,
3838+ _mark_user_synced,
3939+ _PdsCache,
4040+ _rkey_from_uri,
4141+ _state_map,
4242+ upsert_issue,
4343+)
4444+from progress import banner, log, summary_block
4545+4646+REPO_ROOT = Path(__file__).resolve().parent.parent
4747+REPO_COLLECTION = "sh.tangled.repo"
4848+RESOLVE_PDS = (
4949+ "https://tngl.sh",
5050+ "https://bsky.social",
5151+ "https://public.api.bsky.app",
5252+)
5353+5454+5555+def load_env() -> None:
5656+ for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"):
5757+ if candidate.exists():
5858+ load_dotenv(candidate)
5959+ return
6060+ load_dotenv()
6161+6262+6363+def normalize_handle(raw: str) -> str:
6464+ return raw.strip().lstrip("@")
6565+6666+6767+def resolve_handle_http(client: httpx.Client, handle: str) -> str | None:
6868+ for base in RESOLVE_PDS:
6969+ try:
7070+ resp = client.get(
7171+ f"{base.rstrip('/')}/xrpc/com.atproto.identity.resolveHandle",
7272+ params={"handle": handle},
7373+ timeout=20.0,
7474+ )
7575+ if resp.status_code == 200:
7676+ did = resp.json().get("did")
7777+ if isinstance(did, str) and did.startswith("did:"):
7878+ return did
7979+ except httpx.HTTPError:
8080+ continue
8181+ return None
8282+8383+8484+def resolve_did(client: httpx.Client, conn, handle_or_did: str) -> str:
8585+ raw = handle_or_did.strip()
8686+ if raw.startswith("did:"):
8787+ return raw
8888+ handle = normalize_handle(raw)
8989+ did = resolve_handle_http(client, handle)
9090+ if did:
9191+ return did
9292+ row = conn.execute(
9393+ "select did from tangled_identities where handle = %s limit 1",
9494+ (handle,),
9595+ ).fetchone()
9696+ if row:
9797+ return row["did"]
9898+ raise SystemExit(
9999+ f"ERROR: could not resolve handle {handle!r} "
100100+ f"(tried {', '.join(RESOLVE_PDS)} and tangled_identities)"
101101+ )
102102+103103+104104+def resolve_identity(client: httpx.Client, did: str) -> tuple[str, str | None]:
105105+ """Return (pds_endpoint, handle) from the PLC DID document."""
106106+ resp = client.get(f"https://plc.directory/{did}", timeout=20.0)
107107+ resp.raise_for_status()
108108+ doc = resp.json()
109109+ pds = next(
110110+ s["serviceEndpoint"]
111111+ for s in doc["service"]
112112+ if s.get("id") == "#atproto_pds"
113113+ )
114114+ handle = None
115115+ for aka in doc.get("alsoKnownAs", []):
116116+ if isinstance(aka, str) and aka.startswith("at://"):
117117+ handle = aka.removeprefix("at://")
118118+ break
119119+ return pds.rstrip("/"), handle
120120+121121+122122+def list_repos(client: httpx.Client, pds: str, did: str) -> list[dict]:
123123+ records: list[dict] = []
124124+ cursor: str | None = None
125125+ while True:
126126+ params: dict[str, str | int] = {
127127+ "repo": did,
128128+ "collection": REPO_COLLECTION,
129129+ "limit": 100,
130130+ }
131131+ if cursor:
132132+ params["cursor"] = cursor
133133+ resp = client.get(
134134+ f"{pds}/xrpc/com.atproto.repo.listRecords",
135135+ params=params,
136136+ timeout=30.0,
137137+ )
138138+ resp.raise_for_status()
139139+ data = resp.json()
140140+ page = data.get("records") or []
141141+ records.extend(rec for rec in page if isinstance(rec, dict))
142142+ cursor = data.get("cursor")
143143+ if not cursor or not page:
144144+ break
145145+ return records
146146+147147+148148+def fetch_readme(
149149+ client: httpx.Client, knot: str, repo_did: str
150150+) -> tuple[str | None, str | None]:
151151+ resp = client.get(
152152+ f"https://{knot}/xrpc/sh.tangled.repo.tree",
153153+ params={"repo": repo_did, "path": ""},
154154+ timeout=30.0,
155155+ )
156156+ if resp.status_code != 200:
157157+ return None, None
158158+ readme = (resp.json() or {}).get("readme")
159159+ if not isinstance(readme, dict):
160160+ return None, None
161161+ contents = readme.get("contents")
162162+ if not isinstance(contents, str) or not contents.strip():
163163+ return None, None
164164+ filename = readme.get("filename")
165165+ return (filename if isinstance(filename, str) else None), contents
166166+167167+168168+def vector_literal(vec: list[float]) -> str:
169169+ return "[" + ",".join(repr(x) for x in vec) + "]"
170170+171171+172172+def ingest_repos_and_readmes(
173173+ conn,
174174+ *,
175175+ http: httpx.Client,
176176+ did: str,
177177+ handle: str | None,
178178+ pds: str,
179179+ api_key: str,
180180+ model: str,
181181+ force_embed: bool,
182182+) -> dict[str, int]:
183183+ stats = {"repos": 0, "readmes_found": 0, "readmes_embedded": 0, "readmes_missing": 0}
184184+185185+ conn.execute(
186186+ """
187187+ insert into tangled_identities (did, handle, pds_host, last_synced_at)
188188+ values (%s, %s, %s, now())
189189+ on conflict (did) do update set
190190+ handle = coalesce(excluded.handle, tangled_identities.handle),
191191+ pds_host = coalesce(excluded.pds_host, tangled_identities.pds_host),
192192+ last_synced_at = now()
193193+ """,
194194+ (did, handle, pds),
195195+ )
196196+197197+ records = list_repos(http, pds, did)
198198+ log("repos", f"Found {len(records)} sh.tangled.repo record(s) on PDS")
199199+200200+ ingested: list[dict] = []
201201+ for rec in records:
202202+ uri = rec["uri"]
203203+ value = rec["value"]
204204+ if not isinstance(value, dict):
205205+ continue
206206+ rkey = uri.rsplit("/", 1)[-1]
207207+ repo_did = value.get("repoDid")
208208+ knot = value.get("knot")
209209+ name = value.get("name") or rkey
210210+ if not repo_did or not knot:
211211+ log("repos", f" SKIP {name}: missing repoDid/knot")
212212+ continue
213213+ path, content = fetch_readme(http, knot, repo_did)
214214+ status = "found" if content else "missing"
215215+ if status == "found":
216216+ stats["readmes_found"] += 1
217217+ else:
218218+ stats["readmes_missing"] += 1
219219+ log(
220220+ "repos",
221221+ f" {name:20} readme={status}"
222222+ + (f" ({len(content)} chars)" if content else ""),
223223+ )
224224+ ingested.append(
225225+ {
226226+ "uri": uri,
227227+ "value": value,
228228+ "rkey": rkey,
229229+ "repo_did": repo_did,
230230+ "knot": knot,
231231+ "name": name,
232232+ "cid": rec.get("cid"),
233233+ "readme_path": path,
234234+ "content": content,
235235+ "status": status,
236236+ }
237237+ )
238238+ stats["repos"] += 1
239239+240240+ found_rows = [r for r in ingested if r["status"] == "found"]
241241+ if force_embed:
242242+ to_embed = found_rows
243243+ else:
244244+ dids = [r["repo_did"] for r in found_rows]
245245+ if dids:
246246+ existing = {
247247+ row["repo_did"]
248248+ for row in conn.execute(
249249+ "select repo_did from tangled_readmes "
250250+ "where repo_did = any(%s) and embedding is not null",
251251+ (dids,),
252252+ ).fetchall()
253253+ }
254254+ else:
255255+ existing = set()
256256+ to_embed = [r for r in found_rows if r["repo_did"] not in existing]
257257+ vectors: dict[str, str] = {}
258258+ if to_embed:
259259+ vecs = embed_texts(
260260+ http,
261261+ api_key=api_key,
262262+ texts=[truncate(r["content"]) for r in to_embed],
263263+ )
264264+ vectors = {r["repo_did"]: vector_literal(v) for r, v in zip(to_embed, vecs, strict=True)}
265265+ stats["readmes_embedded"] = len(vectors)
266266+ log("embed", f"Embedded {len(vectors)} README(s) ({model}, 1536-d, L2)")
267267+268268+ for r in ingested:
269269+ conn.execute(
270270+ """
271271+ insert into tangled_repos (
272272+ uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname,
273273+ cid, record_raw, discovered_via, last_synced_at
274274+ )
275275+ values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, 'ingest_handle', now())
276276+ on conflict (uri) do update set
277277+ owner_did = excluded.owner_did,
278278+ owner_handle = excluded.owner_handle,
279279+ repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did),
280280+ name = coalesce(excluded.name, tangled_repos.name),
281281+ knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname),
282282+ cid = excluded.cid,
283283+ record_raw = excluded.record_raw,
284284+ last_synced_at = now()
285285+ """,
286286+ (
287287+ r["uri"],
288288+ did,
289289+ handle,
290290+ r["rkey"],
291291+ r["repo_did"],
292292+ r["name"],
293293+ r["knot"],
294294+ r["cid"],
295295+ json.dumps(r["value"]),
296296+ ),
297297+ )
298298+299299+ vec = vectors.get(r["repo_did"])
300300+ conn.execute(
301301+ """
302302+ insert into tangled_readmes (
303303+ repo_did, repo_uri, owner_handle, repo_name, knot_hostname,
304304+ readme_path, status, content, size_bytes, fetched_at,
305305+ embedding, embedding_model, embedded_at
306306+ )
307307+ values (%s, %s, %s, %s, %s, %s, %s, %s, %s, now(),
308308+ %s::vector, %s, case when %s::text is null then null else now() end)
309309+ on conflict (repo_did) do update set
310310+ repo_uri = excluded.repo_uri,
311311+ owner_handle = excluded.owner_handle,
312312+ repo_name = excluded.repo_name,
313313+ knot_hostname = excluded.knot_hostname,
314314+ readme_path = excluded.readme_path,
315315+ status = excluded.status,
316316+ content = excluded.content,
317317+ size_bytes = excluded.size_bytes,
318318+ fetched_at = now(),
319319+ embedding = excluded.embedding,
320320+ embedding_model = excluded.embedding_model,
321321+ embedded_at = excluded.embedded_at
322322+ """,
323323+ (
324324+ r["repo_did"],
325325+ r["uri"],
326326+ handle,
327327+ r["name"],
328328+ r["knot"],
329329+ r["readme_path"],
330330+ r["status"],
331331+ r["content"],
332332+ len(r["content"].encode()) if r["content"] else None,
333333+ vec,
334334+ model if vec else None,
335335+ vec,
336336+ ),
337337+ )
338338+339339+ return stats
340340+341341+342342+def ingest_issues(
343343+ conn,
344344+ *,
345345+ did: str,
346346+ handle: str | None,
347347+ pds: str,
348348+ max_pages: int,
349349+) -> dict[str, int]:
350350+ stats = {"issues": 0, "open": 0, "errors": 0}
351351+ cache = _PdsCache()
352352+ result: UserIssueResult = _fetch_user_issues(
353353+ did, handle, pds, cache, max_pages=max_pages
354354+ )
355355+ if result.status != "ok":
356356+ stats["errors"] = 1
357357+ log("issues", f"ERROR fetching issues: {result.error}")
358358+ _mark_user_synced(
359359+ conn,
360360+ user_did=did,
361361+ issue_count=0,
362362+ status="error",
363363+ error_message=result.error,
364364+ )
365365+ return stats
366366+367367+ states = _state_map(result.states)
368368+ for rec in result.issues:
369369+ if not isinstance(rec.get("uri"), str) or not isinstance(rec.get("value"), dict):
370370+ continue
371371+ rkey = _rkey_from_uri(rec["uri"])
372372+ state = _issue_state(rec["uri"], rkey, states)
373373+ upsert_issue(
374374+ conn,
375375+ record=rec,
376376+ author_did=did,
377377+ author_handle=handle,
378378+ state=state,
379379+ )
380380+ stats["issues"] += 1
381381+ if state == "open":
382382+ stats["open"] += 1
383383+384384+ _mark_user_synced(conn, user_did=did, issue_count=stats["issues"], status="ok")
385385+ log("issues", f"Upserted {stats['issues']} issue(s) ({stats['open']} open)")
386386+ return stats
387387+388388+389389+def embed_user_issues(
390390+ conn,
391391+ *,
392392+ http: httpx.Client,
393393+ did: str,
394394+ api_key: str,
395395+ model: str,
396396+ force: bool,
397397+) -> int:
398398+ where = "repo_did in (select repo_did from tangled_repos where owner_did = %s)"
399399+ params: list = [did]
400400+ if not force:
401401+ where += " and embedding is null"
402402+ rows = conn.execute(
403403+ f"""
404404+ select uri, title, body
405405+ from tangled_issues
406406+ where {where}
407407+ and coalesce(nullif(trim(title), ''), nullif(trim(body), '')) is not null
408408+ order by fetched_at desc
409409+ """,
410410+ params,
411411+ ).fetchall()
412412+ if not rows:
413413+ log("embed-issues", "No issues to embed for this user")
414414+ return 0
415415+416416+ bs = batch_size()
417417+ embedded = 0
418418+ for start in range(0, len(rows), bs):
419419+ batch = rows[start : start + bs]
420420+ texts = [
421421+ truncate("\n\n".join(p for p in (r.get("title"), r.get("body")) if p and p.strip()))
422422+ for r in batch
423423+ ]
424424+ vectors = embed_texts(http, api_key=api_key, texts=texts)
425425+ for row, vec in zip(batch, vectors, strict=True):
426426+ conn.execute(
427427+ """
428428+ update tangled_issues
429429+ set embedding = %s::vector,
430430+ embedding_model = %s,
431431+ embedded_at = now()
432432+ where uri = %s
433433+ """,
434434+ (vector_literal(vec), model, row["uri"]),
435435+ )
436436+ embedded += len(batch)
437437+ log("embed-issues", f"Embedded {embedded} issue(s)")
438438+ return embedded
439439+440440+441441+def run(
442442+ handle_or_did: str,
443443+ *,
444444+ skip_issues: bool,
445445+ force_embed: bool,
446446+ max_pages: int,
447447+ init_db: bool,
448448+) -> int:
449449+ load_env()
450450+ dsn = os.getenv("DB_CONNECTION_STRING", "").strip()
451451+ if not dsn:
452452+ print("ERROR: DB_CONNECTION_STRING is not set", file=sys.stderr)
453453+ return 1
454454+455455+ api_key = gemini_api_key()
456456+ model = embedding_model()
457457+458458+ banner(f"INGEST HANDLE — {handle_or_did}")
459459+ if init_db:
460460+ log("setup", "Applying migrations…")
461461+ init_schema(dsn)
462462+463463+ repo_stats: dict[str, int] = {}
464464+ issue_stats: dict[str, int] = {}
465465+ issues_embedded = 0
466466+467467+ with httpx.Client(timeout=60.0, follow_redirects=True) as http, connect(dsn) as conn:
468468+ did = resolve_did(http, conn, handle_or_did)
469469+ pds, handle = resolve_identity(http, did)
470470+ log("identity", f"DID={did}")
471471+ log("identity", f"handle={handle} pds={pds}")
472472+473473+ repo_stats = ingest_repos_and_readmes(
474474+ conn,
475475+ http=http,
476476+ did=did,
477477+ handle=handle,
478478+ pds=pds,
479479+ api_key=api_key,
480480+ model=model,
481481+ force_embed=force_embed,
482482+ )
483483+484484+ if not skip_issues:
485485+ issue_stats = ingest_issues(
486486+ conn, did=did, handle=handle, pds=pds, max_pages=max_pages
487487+ )
488488+ issues_embedded = embed_user_issues(
489489+ conn,
490490+ http=http,
491491+ did=did,
492492+ api_key=api_key,
493493+ model=model,
494494+ force=force_embed,
495495+ )
496496+497497+ conn.commit()
498498+499499+ summary_block(
500500+ f"Ingest complete — {handle or did}",
501501+ [
502502+ f"DID: {did}",
503503+ f"Handle: {handle or '(unknown)'}",
504504+ f"Repos: {repo_stats.get('repos', 0)}",
505505+ f"READMEs found: {repo_stats.get('readmes_found', 0)}",
506506+ f"READMEs embedded: {repo_stats.get('readmes_embedded', 0)}",
507507+ f"READMEs missing: {repo_stats.get('readmes_missing', 0)}",
508508+ f"Issues upserted: {issue_stats.get('issues', 0)}",
509509+ f"Open issues: {issue_stats.get('open', 0)}",
510510+ f"Issues embedded: {issues_embedded}",
511511+ "",
512512+ "Test recommendations:",
513513+ f" curl 'http://localhost:8000/recommendations?handle={did}'",
514514+ ],
515515+ )
516516+ return 0
517517+518518+519519+def main(argv: list[str] | None = None) -> int:
520520+ parser = argparse.ArgumentParser(
521521+ description="Ingest one Tangled user by handle: repos, README embeddings, issues."
522522+ )
523523+ parser.add_argument("handle", help="Handle (e.g. arsenii.tngl.sh) or did:plc:…")
524524+ parser.add_argument(
525525+ "--skip-issues",
526526+ action="store_true",
527527+ help="Only ingest repos + README embeddings",
528528+ )
529529+ parser.add_argument(
530530+ "--force-embed",
531531+ action="store_true",
532532+ help="Re-embed READMEs and issues even if vectors already exist",
533533+ )
534534+ parser.add_argument(
535535+ "--max-pages",
536536+ type=int,
537537+ default=int(os.getenv("TANGLED_ISSUE_MAX_PAGES", "50")),
538538+ help="Max listRecords pages per issue collection (default: 50)",
539539+ )
540540+ parser.add_argument(
541541+ "--init-db",
542542+ action="store_true",
543543+ help="Run supabase migrations before ingest",
544544+ )
545545+ args = parser.parse_args(argv)
546546+ return run(
547547+ args.handle,
548548+ skip_issues=args.skip_issues,
549549+ force_embed=args.force_embed,
550550+ max_pages=max(1, args.max_pages),
551551+ init_db=args.init_db,
552552+ )
553553+554554+555555+if __name__ == "__main__":
556556+ raise SystemExit(main())
+10
scraper/parallel.py
···11+from __future__ import annotations
22+33+import os
44+55+66+def concurrency_env(name: str, default: int = 20, *, max_cap: int = 64) -> int:
77+ raw = os.getenv(name, "").strip()
88+ if not raw:
99+ return default
1010+ return max(1, min(max_cap, int(raw)))
···11+-- Stage 0 + 1 tables for the Tangled scraper.
22+-- Safe to re-run: uses IF NOT EXISTS.
33+44+create extension if not exists "pgcrypto";
55+66+create table if not exists public.tangled_lexicons (
77+ nsid text primary key,
88+ lexicon_type text not null,
99+ definition jsonb not null,
1010+ source_path text not null,
1111+ fetched_at timestamptz not null default now()
1212+);
1313+1414+create table if not exists public.tangled_knots (
1515+ hostname text primary key,
1616+ reachable boolean not null default false,
1717+ owner_did text,
1818+ version text,
1919+ capabilities jsonb,
2020+ version_raw jsonb,
2121+ owner_raw jsonb,
2222+ probe_error text,
2323+ first_seen_at timestamptz not null default now(),
2424+ last_probed_at timestamptz not null default now()
2525+);
2626+2727+create table if not exists public.tangled_crawl_state (
2828+ key text primary key,
2929+ status text not null default 'pending',
3030+ meta jsonb,
3131+ last_error text,
3232+ updated_at timestamptz not null default now()
3333+);
3434+3535+create index if not exists tangled_knots_reachable_idx
3636+ on public.tangled_knots (reachable);
3737+3838+create index if not exists tangled_lexicons_type_idx
3939+ on public.tangled_lexicons (lexicon_type);
···11+-- Stage 2: PDS accounts + repo records from tngl.sh
22+33+create table if not exists public.tangled_pds_accounts (
44+ did text primary key,
55+ pds_host text not null,
66+ head text,
77+ rev text,
88+ active boolean,
99+ handle text,
1010+ list_repos_raw jsonb not null,
1111+ repo_record_count integer not null default 0,
1212+ first_seen_at timestamptz not null default now(),
1313+ last_synced_at timestamptz not null default now()
1414+);
1515+1616+create table if not exists public.tangled_repos (
1717+ uri text primary key,
1818+ owner_did text not null,
1919+ rkey text not null,
2020+ repo_did text,
2121+ name text,
2222+ knot_hostname text,
2323+ cid text,
2424+ record_raw jsonb not null,
2525+ describe_raw jsonb,
2626+ first_seen_at timestamptz not null default now(),
2727+ last_synced_at timestamptz not null default now(),
2828+ unique (owner_did, rkey)
2929+);
3030+3131+create index if not exists tangled_pds_accounts_handle_idx
3232+ on public.tangled_pds_accounts (handle);
3333+3434+create index if not exists tangled_repos_owner_did_idx
3535+ on public.tangled_repos (owner_did);
3636+3737+create index if not exists tangled_repos_repo_did_idx
3838+ on public.tangled_repos (repo_did);
3939+4040+create index if not exists tangled_repos_knot_hostname_idx
4141+ on public.tangled_repos (knot_hostname);
···11+-- Stages 3–6: identities, federated records, git XRPC snapshots, source archives.
22+-- Raw-first design: store payloads as JSON/bytea; typed views can come later.
33+44+-- -----------------------------------------------------------------------------
55+-- Stage 3 — User / identity enrichment
66+-- -----------------------------------------------------------------------------
77+88+create table if not exists public.tangled_identities (
99+ did text primary key,
1010+ handle text,
1111+ pds_host text,
1212+ profile_record jsonb, -- sh.tangled.actor.profile payload
1313+ did_doc jsonb, -- full DID document from PLC
1414+ first_seen_at timestamptz not null default now(),
1515+ last_synced_at timestamptz not null default now()
1616+);
1717+1818+create index if not exists tangled_identities_handle_idx
1919+ on public.tangled_identities (handle);
2020+2121+create index if not exists tangled_identities_pds_host_idx
2222+ on public.tangled_identities (pds_host);
2323+2424+-- -----------------------------------------------------------------------------
2525+-- Stage 5 — Federated ATProto records (issues, PRs, stars, comments, …)
2626+-- One row per record; collection = lexicon NSID e.g. sh.tangled.repo.issue
2727+-- -----------------------------------------------------------------------------
2828+2929+create table if not exists public.tangled_atproto_records (
3030+ uri text primary key, -- at://did/collection/rkey
3131+ author_did text not null,
3232+ collection text not null,
3333+ rkey text not null,
3434+ cid text,
3535+ payload jsonb not null, -- record.value exactly as returned
3636+ repo_did text, -- denormalized when record links to a repo
3737+ subject_uri text, -- denormalized target (issue/PR/star subject)
3838+ fetched_at timestamptz not null default now(),
3939+ unique (author_did, collection, rkey)
4040+);
4141+4242+create index if not exists tangled_atproto_records_collection_idx
4343+ on public.tangled_atproto_records (collection);
4444+4545+create index if not exists tangled_atproto_records_repo_did_idx
4646+ on public.tangled_atproto_records (repo_did);
4747+4848+create index if not exists tangled_atproto_records_author_did_idx
4949+ on public.tangled_atproto_records (author_did);
5050+5151+create index if not exists tangled_atproto_records_payload_gin_idx
5252+ on public.tangled_atproto_records using gin (payload);
5353+5454+-- Backlink index rows discovered before fetching the full record (Stage 5 crawl queue)
5555+create table if not exists public.tangled_backlinks (
5656+ id bigserial primary key,
5757+ repo_did text not null,
5858+ collection text not null, -- e.g. sh.tangled.repo.issue
5959+ source_field text not null, -- e.g. repo
6060+ author_did text not null,
6161+ rkey text not null,
6262+ record_uri text generated always as (
6363+ 'at://' || author_did || '/' || collection || '/' || rkey
6464+ ) stored,
6565+ fetched boolean not null default false,
6666+ discovered_at timestamptz not null default now(),
6767+ unique (repo_did, collection, author_did, rkey)
6868+);
6969+7070+create index if not exists tangled_backlinks_repo_collection_idx
7171+ on public.tangled_backlinks (repo_did, collection);
7272+7373+create index if not exists tangled_backlinks_unfetched_idx
7474+ on public.tangled_backlinks (fetched) where fetched = false;
7575+7676+-- -----------------------------------------------------------------------------
7777+-- Stage 6 — Knot/git XRPC response snapshots (commits, branches, tree, diff, …)
7878+-- -----------------------------------------------------------------------------
7979+8080+create table if not exists public.tangled_xrpc_snapshots (
8181+ id bigserial primary key,
8282+ method text not null, -- e.g. sh.tangled.repo.log
8383+ repo_did text,
8484+ params jsonb not null,
8585+ params_hash text not null,
8686+ payload jsonb, -- null when response is binary (see git tables)
8787+ payload_encoding text not null default 'application/json',
8888+ fetched_at timestamptz not null default now(),
8989+ unique (method, repo_did, params_hash)
9090+);
9191+9292+create index if not exists tangled_xrpc_snapshots_method_idx
9393+ on public.tangled_xrpc_snapshots (method);
9494+9595+create index if not exists tangled_xrpc_snapshots_repo_did_idx
9696+ on public.tangled_xrpc_snapshots (repo_did);
9797+9898+-- Full repo snapshot archives (tar.gz @ HEAD or branch)
9999+create table if not exists public.tangled_git_archives (
100100+ repo_did text not null,
101101+ git_ref text not null default 'HEAD',
102102+ format text not null default 'tar.gz',
103103+ size_bytes bigint not null,
104104+ sha256 text,
105105+ content bytea not null,
106106+ fetched_at timestamptz not null default now(),
107107+ primary key (repo_did, git_ref, format)
108108+);
109109+110110+create index if not exists tangled_git_archives_size_idx
111111+ on public.tangled_git_archives (size_bytes);
112112+113113+-- Individual git blob objects (optional dedup layer for file-level storage)
114114+create table if not exists public.tangled_git_blobs (
115115+ repo_did text not null,
116116+ oid text not null,
117117+ size_bytes bigint,
118118+ content bytea not null,
119119+ fetched_at timestamptz not null default now(),
120120+ primary key (repo_did, oid)
121121+);
122122+123123+-- -----------------------------------------------------------------------------
124124+-- Convenience views (query metadata without re-parsing JSON)
125125+-- tangled_issues view moved to 20250624160000 (dedicated table).
126126+127127+create or replace view public.tangled_pulls as
128128+select
129129+ uri,
130130+ author_did,
131131+ repo_did,
132132+ payload ->> 'title' as title,
133133+ payload ->> 'body' as body,
134134+ payload ->> 'createdAt' as created_at,
135135+ payload
136136+from public.tangled_atproto_records
137137+where collection = 'sh.tangled.repo.pull';
···11+-- Track where each repo was discovered (tngl PDS crawl vs appview/network index).
22+33+alter table public.tangled_repos
44+ add column if not exists discovered_via text;
55+66+alter table public.tangled_repos
77+ add column if not exists owner_handle text;
88+99+create index if not exists tangled_repos_discovered_via_idx
1010+ on public.tangled_repos (discovered_via);
1111+1212+create index if not exists tangled_repos_owner_handle_idx
1313+ on public.tangled_repos (owner_handle);
1414+1515+-- Backfill existing Stage 2 rows.
1616+update public.tangled_repos
1717+set discovered_via = 'tngl_pds'
1818+where discovered_via is null;
···11+-- One embedding vector per README (pgvector).
22+-- Model: Gemini gemini-embedding-001, 1536-dim, L2-normalized for cosine (<=>).
33+44+create extension if not exists vector;
55+66+alter table public.tangled_readmes
77+ add column if not exists embedding vector(1536),
88+ add column if not exists embedding_model text,
99+ add column if not exists embedded_at timestamptz;
1010+1111+comment on column public.tangled_readmes.embedding is
1212+ 'L2-normalized gemini-embedding-001 vector (1536); cosine via <=>.';
1313+1414+create index if not exists tangled_readmes_embedding_hnsw_idx
1515+ on public.tangled_readmes using hnsw (embedding vector_cosine_ops)
1616+ where embedding is not null;
1717+1818+create index if not exists tangled_readmes_unembedded_idx
1919+ on public.tangled_readmes (repo_did)
2020+ where status = 'found' and content is not null and embedding is null;
···11+-- Repo ↔ collaborator edges (from knot sh.tangled.repo.listCollaborators).
22+33+create table if not exists public.tangled_repo_collaborators (
44+ repo_did text not null,
55+ collaborator_did text not null,
66+ added_by text,
77+ record_uri text,
88+ record_cid text,
99+ created_at timestamptz,
1010+ first_seen_at timestamptz not null default now(),
1111+ last_synced_at timestamptz not null default now(),
1212+ primary key (repo_did, collaborator_did)
1313+);
1414+1515+create index if not exists tangled_repo_collaborators_user_idx
1616+ on public.tangled_repo_collaborators (collaborator_did);
1717+1818+create index if not exists tangled_repo_collaborators_repo_idx
1919+ on public.tangled_repo_collaborators (repo_did);
2020+2121+-- Tracks repos we already checked (including zero collaborators).
2222+create table if not exists public.tangled_repo_collaborators_sync (
2323+ repo_did text primary key,
2424+ collaborator_count integer not null default 0,
2525+ synced_at timestamptz not null default now()
2626+);
2727+2828+create or replace view public.tangled_user_collaborations as
2929+select
3030+ c.collaborator_did as user_did,
3131+ c.repo_did,
3232+ r.owner_handle,
3333+ r.name as repo_name,
3434+ r.uri as repo_uri,
3535+ c.added_by,
3636+ c.created_at
3737+from public.tangled_repo_collaborators c
3838+left join public.tangled_repos r on r.repo_did = c.repo_did;
···11+-- Dedicated issues table (replaces the old tangled_issues view on atproto_records).
22+33+create extension if not exists vector;
44+55+do $$
66+begin
77+ if exists (
88+ select 1 from pg_catalog.pg_class c
99+ join pg_catalog.pg_namespace n on n.oid = c.relnamespace
1010+ where n.nspname = 'public' and c.relname = 'tangled_issues' and c.relkind = 'v'
1111+ ) then
1212+ execute 'drop view public.tangled_issues';
1313+ end if;
1414+end $$;
1515+1616+-- If a previous partial run left the table, keep it.
1717+create table if not exists public.tangled_issues (
1818+ uri text primary key,
1919+ author_did text not null,
2020+ author_handle text,
2121+ rkey text not null,
2222+ repo_did text,
2323+ repo_uri text,
2424+ title text,
2525+ body text,
2626+ state text not null default 'open', -- open | closed
2727+ issue_created_at timestamptz,
2828+ cid text,
2929+ record_raw jsonb not null,
3030+ fetched_at timestamptz not null default now(),
3131+ embedding vector(1536),
3232+ embedding_model text,
3333+ embedded_at timestamptz
3434+);
3535+3636+create index if not exists tangled_issues_author_did_idx
3737+ on public.tangled_issues (author_did);
3838+3939+create index if not exists tangled_issues_repo_did_idx
4040+ on public.tangled_issues (repo_did);
4141+4242+create index if not exists tangled_issues_state_idx
4343+ on public.tangled_issues (state);
4444+4545+create index if not exists tangled_issues_embedding_hnsw_idx
4646+ on public.tangled_issues using hnsw (embedding vector_cosine_ops)
4747+ where embedding is not null;
4848+4949+-- Tracks which user PDSes were scanned for issues (including zero issues).
5050+create table if not exists public.tangled_issue_user_sync (
5151+ user_did text primary key,
5252+ issue_count integer not null default 0,
5353+ synced_at timestamptz not null default now()
5454+);
5555+5656+create or replace view public.tangled_open_issues as
5757+select *
5858+from public.tangled_issues
5959+where state = 'open';
···11+-- Track per-user issue scan outcomes (including failures we should not retry forever).
22+33+alter table public.tangled_issue_user_sync
44+ add column if not exists status text not null default 'ok',
55+ add column if not exists error_message text;
···11+-- AI-solve questionnaires: one cached JSON tree per issue (engine GET /questionnaire).
22+33+create table if not exists public.tangled_issue_questionnaires (
44+ issue_uri text primary key,
55+ payload jsonb not null,
66+ created_at timestamptz not null default now(),
77+ updated_at timestamptz not null default now(),
88+ constraint tangled_issue_questionnaires_payload_is_object
99+ check (jsonb_typeof(payload) = 'object')
1010+);
1111+1212+comment on table public.tangled_issue_questionnaires is
1313+ 'Cached branching questionnaire JSON per sh.tangled.repo.issue AT-URI (AI-solve engine).';
1414+1515+comment on column public.tangled_issue_questionnaires.issue_uri is
1616+ 'at://…/sh.tangled.repo.issue/<rkey> — same key as tangled_issues.uri when indexed.';
1717+1818+comment on column public.tangled_issue_questionnaires.payload is
1919+ 'Full questionnaire object (version 2): introduction, items, followups tree.';
2020+2121+create index if not exists tangled_issue_questionnaires_updated_at_idx
2222+ on public.tangled_issue_questionnaires (updated_at desc);
2323+2424+create index if not exists tangled_issue_questionnaires_payload_gin_idx
2525+ on public.tangled_issue_questionnaires using gin (payload);