This repository has no description
0

Configure Feed

Select the types of activity you want to include in your feed.

Sunstead backend — Tangled Discover + AI-Solve (snapshot, no history)

author
Mark Pokidko
date (Jun 25, 2026, 4:37 PM +0300) commit c760f08e
+16549
+7
.gitignore
··· 1 + .env 2 + .env.* 3 + !.env.example 4 + .DS_Store 5 + node_modules/ 6 + __pycache__/ 7 + *.pyc
+175
CLAUDE.md
··· 1 + # CLAUDE.md — Integration Repo 2 + 3 + > Project memory for Claude Code working in this repo. Keep it short and high-signal. 4 + > Conventional filename is `CLAUDE.md` (uppercase); rename if your tooling expects otherwise. 5 + 6 + ## What this repo is 7 + 8 + The **integration** repo is the backend / data layer for a Tangled (AT Protocol) discovery 9 + product. Its job, end to end: 10 + 11 + 1. **Ingest** repos and activity from the Tangled network (ATProto records). 12 + 2. **Store** them as a synced mirror in Postgres. 13 + 3. **Embed** repo/issue text and maintain a vector index. 14 + 4. **Recommend** relevant repos and issues to a user based on their past activity, and 15 + expose those recommendations (plus read APIs) over HTTP. 16 + 17 + The **frontend is a separate repo**. This repo contains **no UI** — it exposes a JSON/HTTP 18 + API that the frontend consumes. Do not add view/template/component code here. If a task 19 + implies UI work, it belongs in the frontend repo, not this one. 20 + 21 + ## Repository layout 22 + 23 + - `scraper/` — ingestion + backfill + embedding (Python). Stages 0–6: lexicons, knots, 24 + PDS/network backfill, repo metadata, READMEs, issues, embeddings. See `scraper/README.md`. 25 + - `daily_issue_scraper/` — Cloud Run container that re-runs the issue sync on a daily schedule. 26 + - `supabase/migrations/` — Postgres + pgvector schema: the `tangled_*` tables/views. 27 + - `recommendation/` — the **Discover** recommendation engine: a standalone **Python/FastAPI** 28 + service that reads embeddings from the shared DB and returns repo/issue recs over HTTP. Has 29 + its own `CLAUDE.md` / `README.md` / `API.md`; intended to be lifted into its own repo later. 30 + - `recommendationold/` — the pre-port Node (`.mjs`) version of the rec scripts, superseded by 31 + `recommendation/` (its `reference/src/` holds the same scripts as the porting oracle). Kept 32 + for reference, not run. 33 + 34 + ## Tech stack 35 + 36 + - Language/runtime: **Python** across the live services (ingestion + recommendation). The 37 + earlier Next.js/FastAPI skeleton was cleared; current code is Python. 38 + - DB: **Postgres** with the **pgvector** extension (records + relationships + embeddings in one 39 + DB), schema managed via **Supabase** migrations (`supabase/migrations/`). 40 + - ATProto: PDS `com.atproto.repo.listRecords` + knot XRPC (`sh.tangled.repo.tree` for READMEs); 41 + identity via the PLC directory. 42 + - Embeddings: **Gemini `gemini-embedding-001`**, 1536-dim, L2-normalized, stored in pgvector 43 + (cosine / HNSW). The recommendation service reads these; it does not embed at runtime. 44 + 45 + The rec/ranking pipeline is the Python `recommendation/` service — keep a clear HTTP API 46 + boundary between it and the ingestion/embedding side. 47 + 48 + ## Domain model — read this before touching ingestion 49 + 50 + Tangled is a git collaboration platform on the AT Protocol. The split that matters: 51 + 52 + - **Knots** host the actual **git data** (code, refs). Self-hostable git servers. 53 + - **PDS** (Personal Data Service) holds the **collaboration metadata** as ATProto records: 54 + issues, comments, pull requests, stars, collaborators, repo pointers. 55 + 56 + We ingest **metadata from PDSes**. We do **not** need git code for recommendations — repo 57 + descriptions, READMEs, and issue/PR text are the signal. **READMEs are the primary text 58 + signal for repo recommendations** (see Embedding conventions) and are fetched live from the 59 + **knot** (not the PDS), since no README content is stored in Postgres. 60 + 61 + - Fetch via the knot XRPC `sh.tangled.repo.tree` query: 62 + `https://<knot_hostname>/xrpc/sh.tangled.repo.tree?repo=<repoDid>&path=`. With `ref` 63 + omitted the knot uses the repo's default branch and returns a top-level `readme` object 64 + whose `contents` holds the rendered README (it resolves any extension — `.md`, `.org`, 65 + `.rst`, …). Address by the **knot-minted `repoDid`** (`record_raw->>'repoDid'`), not the 66 + owner DID. 67 + - **Coverage (measured 2026-06-24):** ~79% of *reachable* repos have a README (758/959); 68 + ~57% of all repoDid-addressable repos confirmed (the rest are knot 404s / unreachable 69 + self-hosted knots, which are *unknown*, not README-less). ~30% of repos in the DB have no 70 + knot-minted `repoDid` at all and can't be addressed on a knot — embed those from metadata only. 71 + 72 + Every record is addressed by an AT-URI: `at://<did>/<collection>/<rkey>`. 73 + 74 + ### Collections (NSIDs) we care about 75 + 76 + - `sh.tangled.repo` — repo record / pointer (owner, name, knot) 77 + - `sh.tangled.repo.issue` 78 + - `sh.tangled.repo.issue.comment` 79 + - `sh.tangled.repo.pull` — pull requests 80 + - `sh.tangled.repo.collaborator` 81 + - `sh.tangled.feed.star` — stars 82 + - `sh.tangled.git.refUpdate` — push / ref-update events 83 + 84 + Treat this list as the source of truth for ingestion filters. Verify against the live 85 + lexicons before assuming a field shape — Tangled is alpha and schemas move (e.g. repos now 86 + carry a stable DID; some wire formats changed around the v1.13/v1.14 knot releases). 87 + 88 + ## Ingestion design 89 + 90 + Two complementary paths — keep both working: 91 + 92 + - **Real-time: Jetstream.** Subscribe to a public Jetstream instance with `wantedCollections` 93 + set to the `sh.tangled.*` NSIDs above. JSON in, no CBOR decoding. This is the primary feed. 94 + - **Backfill: `listRecords`.** For each known DID, call `com.atproto.repo.listRecords` against 95 + its PDS, once per collection, paginating the cursor. Discover DIDs from the Jetstream stream 96 + over time and/or by enumerating the relay with `com.atproto.sync.listRepos`. 97 + 98 + ### Non-negotiable ingestion rules 99 + 100 + - **Mirror semantics, not append-only.** Records get edited and deleted. Handle Jetstream 101 + `create`/`update` as **upsert** and `delete` as **soft-delete / tombstone**. Never assume 102 + a record seen once is permanent. 103 + - **Resolve identity.** Records reference DIDs. Resolve DID → PDS endpoint and DID → handle 104 + via the PLC directory; cache it. Don't hardcode PDS hosts. 105 + - **Coverage caveat.** Self-hosted PDSes/knots only appear if the relay crawls them. Hosted 106 + instances and Bluesky-network accounts are well covered; full-network coverage is not 107 + guaranteed. Don't treat absence as deletion. 108 + - **Idempotency.** Ingestion must be safely replayable (reconnects, backfills overlapping the 109 + live stream). Key on AT-URI. 110 + 111 + ## Recommendation design 112 + 113 + **Two-stage: retrieve, then rank.** Do not ship a single averaged "user vector" + kNN as the 114 + whole system — it loses multi-interest structure and ignores quality/recency/social signal. 115 + 116 + 1. **Candidate generation** (high-recall, union the sources): 117 + - **Embedding kNN** — query with the user's *recent* interactions individually, or cluster 118 + their history into a few interest centroids and query each. Never collapse to one averaged vector. 119 + - **Collaborative / co-occurrence** — "users who starred X also starred Y" from the star and 120 + contribution matrices. 121 + - **Social graph** (our edge on ATProto) — "repos starred by people you follow", "repos your 122 + collaborators are active in". Cheap, strong, no embeddings needed. Prioritize wiring this up. 123 + 2. **Ranking** — start with a tunable weighted sum (embedding similarity + recency + popularity + 124 + social proximity + language/topic match). Swap in a learned ranker (LightGBM/XGBoost) once 125 + there's engagement data. Keep the scorer behind an interface so it's replaceable. 126 + 3. **Rules** — drop the user's own repos and already-seen items; enforce diversity (e.g. MMR); 127 + favor freshness. 128 + 129 + ### Embedding conventions 130 + 131 + - **Repo doc = the README** (fetched live from the knot — see Domain model), as the primary 132 + text we embed. Prepend the repo `name` + `description` and append `topics` + primary 133 + `language` as light context, but the README body is the core signal. 134 + - **Fallback when no README** (knot 404 / unreachable / repo has no `repoDid`): embed 135 + `name + description + topics + primary language` only. ~57–79% of repos have a README; 136 + the rest rely on this fallback, so it must produce a usable vector on its own. 137 + - Issue doc = title + body + labels + parent-repo context. 138 + - Store vectors in pgvector alongside the record. Re-embed on meaningful record updates 139 + (incl. when a previously-missing README becomes available). 140 + 141 + ### Required, don't skip 142 + 143 + - **Cold start** — users with no history fall back to trending / follows-based / onboarding interests. 144 + - **Eval harness** — hold out each user's most recent interactions; measure recall@k / nDCG offline 145 + before shipping any ranking change. Track star-through-rate online. No "it feels better" merges. 146 + 147 + ## Data layout 148 + 149 + The live schema lives in `supabase/migrations/`; the `tangled_*` tables are the source of 150 + truth (not the generic names below). Key ones the rec engine reads (see 151 + `recommendation/CLAUDE.md` for full columns): `tangled_readmes` (repo signal + `embedding`), 152 + `tangled_open_issues` (view), `tangled_repos`, `tangled_identities` (did→handle), 153 + `tangled_user_collaborations` (view). Embeddings are stored inline on the record rows 154 + (`embedding vector(1536)` + `embedding_model`), not in a separate table. 155 + 156 + ## Commands 157 + 158 + Each service has its own setup; see the per-folder docs. DB connection comes from 159 + `DB_CONNECTION_STRING` (`.env`). 160 + - Scraper (ingest / backfill / embed): see `scraper/README.md` — `python scraper/scrape.py <stage>`. 161 + - Recommendation API: from `recommendation/`, `uvicorn app.main:app --reload --port 8000` 162 + (setup + deploy in `recommendation/README.md`). 163 + - Rec tests: from `recommendation/`, `.venv/bin/python -m pytest tests/`. 164 + 165 + ## Conventions 166 + 167 + - Keep ingestion, embedding, recommendation, and API as separable modules/services. 168 + - All external IDs are DIDs internally; resolve to handles only at the API edge for display. 169 + - Don't put secrets, PDS credentials, or model API keys in code or commits. 170 + 171 + ## Out of scope (do not do here) 172 + 173 + - Frontend / UI work → separate repo. 174 + - Hosting git content or running a knot → not this service's job; we read metadata and fetch 175 + READMEs on demand.
+13
agent/.env.example
··· 1 + ANTHROPIC_API_KEY= 2 + # DB_CONNECTION_STRING=postgresql://... (questionnaire cache + optional issue lookup) 3 + # ANTHROPIC_MODEL=claude-sonnet-4-6 4 + # ANTHROPIC_QUESTIONNAIRE_MODEL=claude-opus-4-6 5 + # ANTHROPIC_QUESTIONNAIRE_MAX_TOKENS=16384 6 + # ANTHROPIC_QUESTIONNAIRE_TEMPERATURE=0 7 + # QUESTIONNAIRE_MIN_TOOL_READS=0 # optional: force N reads before tools become optional (0=off) 8 + # QUESTIONNAIRE_RECURSION_LIMIT=50 9 + # AGENT_VERBOSE_TOOLS=1 10 + # ANTHROPIC_CACHE_TTL=5m 11 + # AGENT_MAX_FILE_CHARS=32000 12 + # ANTHROPIC_MAX_TOKENS=4096 13 + # ANTHROPIC_TEMPERATURE=0
+37
agent/__init__.py
··· 1 + """Tangled issue investigation agent.""" 2 + 3 + __all__ = [ 4 + "AgentState", 5 + "AnthropicCacheSettings", 6 + "IssueSessionContext", 7 + "build_agent_graph", 8 + "build_issue_agent_graph", 9 + "create_anthropic_model", 10 + "load_issue_context", 11 + "run_agent", 12 + "run_issue_agent", 13 + ] 14 + 15 + 16 + def __getattr__(name: str): 17 + if name in { 18 + "AgentState", 19 + "AnthropicCacheSettings", 20 + "build_agent_graph", 21 + "build_issue_agent_graph", 22 + "create_anthropic_model", 23 + "run_agent", 24 + "run_issue_agent", 25 + }: 26 + from agent import agent as _agent 27 + 28 + return getattr(_agent, name) 29 + if name == "IssueSessionContext": 30 + from agent.context import IssueSessionContext 31 + 32 + return IssueSessionContext 33 + if name == "load_issue_context": 34 + from agent.load_issue import load_issue_context 35 + 36 + return load_issue_context 37 + raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+688
agent/agent.py
··· 1 + #!/usr/bin/env python3 2 + """LangGraph agent loop on Anthropic with prompt caching. 3 + 4 + Core pieces: 5 + - ``AgentState`` — message list reducer state 6 + - ``create_anthropic_model`` — ChatAnthropic factory 7 + - ``cached_system_message`` / ``AnthropicCacheSettings`` — explicit + automatic caching 8 + - ``build_agent_graph`` — agent → (tools) → agent loop 9 + - ``run_agent`` — single-turn or threaded invoke helper 10 + """ 11 + 12 + from __future__ import annotations 13 + 14 + import json 15 + import os 16 + import sys 17 + from dataclasses import dataclass 18 + from pathlib import Path 19 + from typing import Annotated, Any, Literal, Sequence 20 + 21 + from dotenv import load_dotenv 22 + from langchain_anthropic import ChatAnthropic 23 + from langchain_core.messages import ( 24 + AIMessage, 25 + BaseMessage, 26 + HumanMessage, 27 + SystemMessage, 28 + ToolMessage, 29 + ) 30 + from langchain_core.tools import BaseTool 31 + from langgraph.checkpoint.base import BaseCheckpointSaver 32 + from langgraph.graph import END, START, StateGraph 33 + from langgraph.graph.message import add_messages 34 + from langgraph.prebuilt import ToolNode 35 + from typing_extensions import TypedDict 36 + 37 + from agent.context import IssueSessionContext, build_issue_system_prompt 38 + from agent.load_issue import load_issue_context 39 + from agent.questionnaire_store import parse_questionnaire_json, save_questionnaire 40 + from agent.questionnaire_repo_store import publish_to_repo, publishing_enabled 41 + from agent.questionnaire_prompt import build_questionnaire_system_prompt 42 + from agent.tools import make_file_tools 43 + 44 + REPO_ROOT = Path(__file__).resolve().parent.parent 45 + 46 + DEFAULT_SYSTEM_PROMPT = """\ 47 + You are a helpful assistant for the Sunstead / Tangled hackathon stack. 48 + 49 + You can reason about: 50 + - Tangled repos, issues, and README embeddings in Postgres 51 + - The recommendation API (DID → ranked repos/issues) 52 + - The daily scraper that ingests Tangled network data 53 + 54 + Be concise and actionable. Use tools when they help answer factual questions. 55 + """ 56 + 57 + CacheTTL = Literal["5m", "1h"] 58 + 59 + 60 + @dataclass(frozen=True) 61 + class AnthropicCacheSettings: 62 + """Anthropic prompt cache configuration. 63 + 64 + We use two layers (both are valid together): 65 + 1. Explicit ``cache_control`` on the static system block (always cached). 66 + 2. Automatic ``cache_control`` on each ``model.invoke`` call so tools + 67 + conversation prefix are cached on Anthropic's side (breakpoint moves 68 + forward as the thread grows). 69 + """ 70 + 71 + type: Literal["ephemeral"] = "ephemeral" 72 + ttl: CacheTTL = "5m" 73 + 74 + def as_api_dict(self) -> dict[str, str]: 75 + return {"type": self.type, "ttl": self.ttl} 76 + 77 + 78 + class AgentState(TypedDict): 79 + """Graph state: append-only message history.""" 80 + 81 + messages: Annotated[list[BaseMessage], add_messages] 82 + 83 + 84 + def load_env() -> None: 85 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 86 + if candidate.exists(): 87 + load_dotenv(candidate) 88 + return 89 + load_dotenv() 90 + 91 + 92 + def require_anthropic_api_key() -> str: 93 + key = os.getenv("ANTHROPIC_API_KEY", "").strip() 94 + if not key: 95 + print("ERROR: ANTHROPIC_API_KEY is not set", file=sys.stderr) 96 + raise SystemExit(1) 97 + return key 98 + 99 + 100 + def cached_system_message( 101 + text: str, 102 + *, 103 + cache: AnthropicCacheSettings | None = None, 104 + ) -> SystemMessage: 105 + """System prompt block with explicit Anthropic ``cache_control``.""" 106 + cache = cache or AnthropicCacheSettings() 107 + return SystemMessage( 108 + content=[ 109 + { 110 + "type": "text", 111 + "text": text, 112 + "cache_control": cache.as_api_dict(), 113 + } 114 + ] 115 + ) 116 + 117 + 118 + def _tag_last_content_block( 119 + message: BaseMessage, 120 + cache: AnthropicCacheSettings, 121 + ) -> BaseMessage: 122 + """Add ``cache_control`` to the last text block of a message (explicit breakpoint).""" 123 + content = message.content 124 + if isinstance(content, str): 125 + return message.model_copy( 126 + update={ 127 + "content": [ 128 + { 129 + "type": "text", 130 + "text": content, 131 + "cache_control": cache.as_api_dict(), 132 + } 133 + ] 134 + } 135 + ) 136 + if not isinstance(content, list) or not content: 137 + return message 138 + blocks = [dict(block) if isinstance(block, dict) else block for block in content] 139 + last = blocks[-1] 140 + if isinstance(last, dict) and last.get("type") == "text": 141 + blocks[-1] = {**last, "cache_control": cache.as_api_dict()} 142 + return message.model_copy(update={"content": blocks}) 143 + return message 144 + 145 + 146 + def _can_tag_message_for_cache(message: BaseMessage) -> bool: 147 + """Anthropic forbids cache_control on tool_result blocks.""" 148 + if isinstance(message, ToolMessage): 149 + return False 150 + return isinstance(message, (HumanMessage, AIMessage)) 151 + 152 + 153 + def prepare_messages_for_anthropic( 154 + messages: Sequence[BaseMessage], 155 + *, 156 + system_message: SystemMessage, 157 + cache: AnthropicCacheSettings | None = None, 158 + cache_conversation_tail: bool = True, 159 + ) -> list[BaseMessage]: 160 + """Build the message list sent to Claude. 161 + 162 + - Prepends the cached system message. 163 + - Optionally marks the last non-tool message for explicit prefix caching. 164 + Invoke-level ``cache_control`` still applies to the full request. 165 + """ 166 + cache = cache or AnthropicCacheSettings() 167 + history = list(messages) 168 + # After tool rounds, only invoke-level cache_control is safe — Anthropic 169 + # rejects cache_control nested inside tool_result content blocks. 170 + if ( 171 + cache_conversation_tail 172 + and history 173 + and not any(isinstance(m, ToolMessage) for m in history) 174 + ): 175 + idx = None 176 + for i in range(len(history) - 1, -1, -1): 177 + if _can_tag_message_for_cache(history[i]): 178 + idx = i 179 + break 180 + if idx is not None: 181 + history[idx] = _tag_last_content_block(history[idx], cache) 182 + return [system_message, *history] 183 + 184 + 185 + def extract_cache_usage(message: AIMessage) -> dict[str, int]: 186 + """Pull Anthropic cache token stats from a model response, if present.""" 187 + usage = (message.response_metadata or {}).get("usage") or {} 188 + return { 189 + "input_tokens": int(usage.get("input_tokens") or 0), 190 + "output_tokens": int(usage.get("output_tokens") or 0), 191 + "cache_creation_input_tokens": int( 192 + usage.get("cache_creation_input_tokens") or 0 193 + ), 194 + "cache_read_input_tokens": int(usage.get("cache_read_input_tokens") or 0), 195 + } 196 + 197 + 198 + def create_anthropic_model( 199 + *, 200 + model: str | None = None, 201 + temperature: float | None = None, 202 + max_tokens: int | None = None, 203 + api_key: str | None = None, 204 + ) -> ChatAnthropic: 205 + """Construct ``ChatAnthropic`` from env defaults.""" 206 + load_env() 207 + return ChatAnthropic( 208 + model=model or os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-6"), 209 + api_key=api_key or require_anthropic_api_key(), 210 + temperature=float( 211 + temperature if temperature is not None else os.getenv("ANTHROPIC_TEMPERATURE", "0") 212 + ), 213 + max_tokens=int( 214 + max_tokens if max_tokens is not None else os.getenv("ANTHROPIC_MAX_TOKENS", "4096") 215 + ), 216 + ) 217 + 218 + 219 + def create_questionnaire_model( 220 + *, 221 + api_key: str | None = None, 222 + ) -> ChatAnthropic: 223 + """Opus by default — questionnaire generation needs deep repo reasoning.""" 224 + load_env() 225 + return create_anthropic_model( 226 + model=os.getenv("ANTHROPIC_QUESTIONNAIRE_MODEL", "claude-opus-4-6"), 227 + max_tokens=int(os.getenv("ANTHROPIC_QUESTIONNAIRE_MAX_TOKENS", "16384")), 228 + temperature=float(os.getenv("ANTHROPIC_QUESTIONNAIRE_TEMPERATURE", "0")), 229 + api_key=api_key, 230 + ) 231 + 232 + 233 + def _cache_settings_from_env() -> AnthropicCacheSettings: 234 + ttl = os.getenv("ANTHROPIC_CACHE_TTL", "5m").strip() 235 + if ttl not in ("5m", "1h"): 236 + ttl = "5m" 237 + return AnthropicCacheSettings(ttl=ttl) # type: ignore[arg-type] 238 + 239 + 240 + def should_continue(state: AgentState) -> Literal["tools", "__end__"]: 241 + """Route to tools when the model emitted tool calls.""" 242 + last = state["messages"][-1] 243 + if isinstance(last, AIMessage) and last.tool_calls: 244 + return "tools" 245 + return END 246 + 247 + 248 + def extract_ai_text(message: BaseMessage) -> str: 249 + """Pull plain text from an AIMessage (string or block content).""" 250 + if not isinstance(message, AIMessage): 251 + return "" 252 + content = message.content 253 + if isinstance(content, str): 254 + return content.strip() 255 + if isinstance(content, list): 256 + parts: list[str] = [] 257 + for block in content: 258 + if isinstance(block, str): 259 + parts.append(block) 260 + elif isinstance(block, dict) and block.get("type") == "text": 261 + text = block.get("text") 262 + if isinstance(text, str) and text.strip(): 263 + parts.append(text) 264 + return "\n".join(parts).strip() 265 + return str(content).strip() 266 + 267 + 268 + def find_questionnaire_json_text(messages: Sequence[BaseMessage]) -> str: 269 + """Last non-tool AIMessage whose text is valid questionnaire JSON.""" 270 + for message in reversed(messages): 271 + if isinstance(message, AIMessage) and not message.tool_calls: 272 + text = extract_ai_text(message) 273 + if not text: 274 + continue 275 + try: 276 + parse_questionnaire_json(text) 277 + except (ValueError, json.JSONDecodeError): 278 + continue 279 + return text 280 + last = messages[-1] if messages else None 281 + raise ValueError( 282 + "Agent did not produce questionnaire JSON " 283 + f"(messages={len(messages)}, last={type(last).__name__ if last else 'none'})" 284 + ) 285 + 286 + 287 + def build_agent_graph( 288 + *, 289 + tools: Sequence[BaseTool] | None = None, 290 + system_prompt: str = DEFAULT_SYSTEM_PROMPT, 291 + model: ChatAnthropic | None = None, 292 + cache: AnthropicCacheSettings | None = None, 293 + checkpointer: BaseCheckpointSaver | None = None, 294 + min_tool_reads: int = 0, 295 + verbose_tools: bool = False, 296 + ): 297 + """Compile the LangGraph agent loop: agent ⟷ tools (optional).""" 298 + load_env() 299 + tools = list(tools or []) 300 + model = model or create_anthropic_model() 301 + cache = cache or _cache_settings_from_env() 302 + system_message = cached_system_message(system_prompt, cache=cache) 303 + bound_model = model.bind_tools(tools) if tools else model 304 + tool_node = ToolNode(tools) if tools else None 305 + 306 + def call_model(state: AgentState) -> dict[str, list[BaseMessage]]: 307 + history = state["messages"] 308 + tool_read_count = sum(1 for m in history if isinstance(m, ToolMessage)) 309 + has_tool_results = tool_read_count > 0 310 + payload = prepare_messages_for_anthropic( 311 + history, 312 + system_message=system_message, 313 + cache=cache, 314 + ) 315 + invoke_kwargs: dict[str, Any] = {} 316 + if not has_tool_results: 317 + invoke_kwargs["cache_control"] = cache.as_api_dict() 318 + 319 + model_to_invoke = bound_model 320 + if tools and min_tool_reads and tool_read_count < min_tool_reads: 321 + model_to_invoke = bound_model.bind(tool_choice={"type": "any"}) 322 + 323 + response = model_to_invoke.invoke(payload, **invoke_kwargs) 324 + if isinstance(response, AIMessage): 325 + stats = extract_cache_usage(response) 326 + if any(stats[k] for k in ("cache_creation_input_tokens", "cache_read_input_tokens")): 327 + print( 328 + "[anthropic-cache]", 329 + f"read={stats['cache_read_input_tokens']}", 330 + f"write={stats['cache_creation_input_tokens']}", 331 + f"in={stats['input_tokens']}", 332 + f"out={stats['output_tokens']}", 333 + ) 334 + if verbose_tools and response.tool_calls: 335 + for tc in response.tool_calls: 336 + name = tc.get("name", "?") if isinstance(tc, dict) else getattr(tc, "name", "?") 337 + args = tc.get("args", {}) if isinstance(tc, dict) else getattr(tc, "args", {}) 338 + print(f"[agent] calling {name}({args})", file=sys.stderr) 339 + return {"messages": [response]} 340 + 341 + def run_tools(state: AgentState) -> dict[str, list[BaseMessage]]: 342 + assert tool_node is not None 343 + result = tool_node.invoke(state) 344 + if verbose_tools: 345 + for msg in result.get("messages", []): 346 + if isinstance(msg, ToolMessage): 347 + preview = msg.content 348 + if isinstance(preview, str) and len(preview) > 120: 349 + preview = preview[:120] + "…" 350 + print(f"[agent] tool result ({len(str(msg.content))} chars)", file=sys.stderr) 351 + return result 352 + 353 + graph = StateGraph(AgentState) 354 + graph.add_node("agent", call_model) 355 + 356 + if tools: 357 + graph.add_node("tools", run_tools) 358 + graph.add_edge(START, "agent") 359 + graph.add_conditional_edges( 360 + "agent", 361 + should_continue, 362 + {"tools": "tools", END: END}, 363 + ) 364 + graph.add_edge("tools", "agent") 365 + else: 366 + graph.add_edge(START, "agent") 367 + graph.add_edge("agent", END) 368 + 369 + return graph.compile(checkpointer=checkpointer) 370 + 371 + 372 + def build_issue_agent_graph( 373 + ctx: IssueSessionContext, 374 + *, 375 + model: ChatAnthropic | None = None, 376 + cache: AnthropicCacheSettings | None = None, 377 + checkpointer: BaseCheckpointSaver | None = None, 378 + include_list_tool: bool = False, 379 + ): 380 + """Issue investigator: context upfront, file tools only.""" 381 + tools = make_file_tools(ctx) 382 + if not include_list_tool: 383 + tools = [t for t in tools if t.name == "read_repo_file"] 384 + return build_agent_graph( 385 + tools=tools, 386 + system_prompt=build_issue_system_prompt(ctx), 387 + model=model, 388 + cache=cache, 389 + checkpointer=checkpointer, 390 + ) 391 + 392 + 393 + def build_questionnaire_agent_graph( 394 + ctx: IssueSessionContext, 395 + *, 396 + model: ChatAnthropic | None = None, 397 + cache: AnthropicCacheSettings | None = None, 398 + checkpointer: BaseCheckpointSaver | None = None, 399 + include_list_tool: bool = False, 400 + ): 401 + """Generate AI-solve questionnaire JSON: Opus + file tools + contract prompt.""" 402 + tools = make_file_tools(ctx) 403 + if not include_list_tool: 404 + tools = [t for t in tools if t.name == "read_repo_file"] 405 + return build_agent_graph( 406 + tools=tools, 407 + system_prompt=build_questionnaire_system_prompt(ctx), 408 + model=model or create_questionnaire_model(), 409 + cache=cache, 410 + checkpointer=checkpointer, 411 + min_tool_reads=int(os.getenv("QUESTIONNAIRE_MIN_TOOL_READS", "0")), 412 + verbose_tools=os.getenv("AGENT_VERBOSE_TOOLS", "1").strip().lower() 413 + not in ("0", "false", "no"), 414 + ) 415 + 416 + 417 + def _finalize_questionnaire_json( 418 + ctx: IssueSessionContext, 419 + messages: Sequence[BaseMessage], 420 + *, 421 + model: ChatAnthropic | None = None, 422 + cache: AnthropicCacheSettings | None = None, 423 + ) -> str: 424 + """Dedicated JSON-only model call after research (no tools).""" 425 + model = model or create_questionnaire_model() 426 + cache = cache or _cache_settings_from_env() 427 + system_message = cached_system_message( 428 + build_questionnaire_system_prompt(ctx), cache=cache 429 + ) 430 + history = [ 431 + *messages, 432 + HumanMessage( 433 + content=( 434 + "You have finished reading the repository. Output the complete questionnaire " 435 + "JSON object now (schema version 2). Single JSON object only — no markdown " 436 + "fences, no commentary, no tool calls." 437 + ) 438 + ), 439 + ] 440 + payload = prepare_messages_for_anthropic( 441 + history, system_message=system_message, cache=cache 442 + ) 443 + response = model.invoke(payload) 444 + text = extract_ai_text(response) 445 + if text: 446 + return text 447 + block_types: list[str] = [] 448 + if isinstance(response, AIMessage) and isinstance(response.content, list): 449 + for block in response.content: 450 + if isinstance(block, dict): 451 + block_types.append(str(block.get("type", "?"))) 452 + raise ValueError( 453 + "Finalize turn returned empty text " 454 + f"(blocks={block_types or 'none'})" 455 + ) 456 + 457 + 458 + def run_questionnaire_agent( 459 + ctx: IssueSessionContext, 460 + *, 461 + graph=None, 462 + thread_id: str = "default", 463 + include_list_tool: bool = False, 464 + ) -> BaseMessage: 465 + """Run questionnaire generation (no user prompt — instructions are in the system prompt).""" 466 + app = graph or build_questionnaire_agent_graph( 467 + ctx, include_list_tool=include_list_tool 468 + ) 469 + config = { 470 + "configurable": {"thread_id": thread_id}, 471 + "recursion_limit": int(os.getenv("QUESTIONNAIRE_RECURSION_LIMIT", "50")), 472 + } 473 + result = app.invoke( 474 + { 475 + "messages": [ 476 + HumanMessage( 477 + content=( 478 + "Phase 1 — research: use read_repo_file to explore the repository " 479 + "for as long as you need (README, issue-related source, tests, " 480 + "patterns). Stop calling tools when you have enough context. " 481 + "Do not output questionnaire JSON yet." 482 + ) 483 + ) 484 + ] 485 + }, 486 + config=config, 487 + ) 488 + messages = result["messages"] 489 + tool_reads = sum(1 for m in messages if isinstance(m, ToolMessage)) 490 + print(f"[agent] research done ({tool_reads} file reads)", file=sys.stderr) 491 + 492 + try: 493 + text = find_questionnaire_json_text(messages) 494 + except ValueError: 495 + print("[agent] running JSON finalize turn", file=sys.stderr) 496 + text = _finalize_questionnaire_json(ctx, messages) 497 + return AIMessage(content=text) 498 + 499 + 500 + def generate_and_save_questionnaire( 501 + issue_uri: str, 502 + *, 503 + fetch_file_tree: bool = True, 504 + include_list_tool: bool = False, 505 + thread_id: str = "job", 506 + save: bool = True, 507 + ) -> dict[str, Any]: 508 + """Load issue, run questionnaire agent, parse JSON, optionally upsert to Postgres.""" 509 + load_env() 510 + ctx = load_issue_context(issue_uri, fetch_file_tree=fetch_file_tree) 511 + reply = run_questionnaire_agent( 512 + ctx, 513 + thread_id=thread_id, 514 + include_list_tool=include_list_tool, 515 + ) 516 + text = extract_ai_text(reply) if isinstance(reply, AIMessage) else str(reply.content) 517 + try: 518 + payload = parse_questionnaire_json(text) 519 + except (json.JSONDecodeError, ValueError) as exc: 520 + preview = (text or "")[:400].replace("\n", " ") 521 + raise ValueError(f"{exc} — response preview: {preview!r}") from exc 522 + 523 + result: dict[str, Any] = { 524 + "issue_uri": ctx.issue_uri, 525 + "version": payload.get("version"), 526 + "question_count": _count_questions(payload.get("items") or []), 527 + } 528 + if not save: 529 + result["payload"] = payload 530 + print("[agent] --no-save: skipped DB write", file=sys.stderr) 531 + return result 532 + 533 + row = save_questionnaire(ctx.issue_uri, payload) 534 + _maybe_publish_to_repo(ctx.issue_uri, payload, row) 535 + result.update( 536 + { 537 + "issue_uri": row["issue_uri"], 538 + "created_at": row["created_at"].isoformat() if row.get("created_at") else None, 539 + "updated_at": row["updated_at"].isoformat() if row.get("updated_at") else None, 540 + } 541 + ) 542 + return result 543 + 544 + 545 + def _maybe_publish_to_repo(issue_uri: str, payload: dict, row: dict | None = None) -> None: 546 + """Best-effort dual-write: also publish the questionnaire to the knot repo when 547 + QUESTIONNAIRE_PUBLISH_REPO is set. A failure here never fails the DB write.""" 548 + if not publishing_enabled(): 549 + return 550 + try: 551 + rel = publish_to_repo( 552 + issue_uri, 553 + payload, 554 + (row or {}).get("created_at"), 555 + (row or {}).get("updated_at"), 556 + ) 557 + print(f"[agent] published questionnaire to repo: {rel}", file=sys.stderr) 558 + except Exception as exc: # noqa: BLE001 - publishing is best-effort 559 + print(f"[agent] warning: repo publish failed (DB write still ok): {exc}", file=sys.stderr) 560 + 561 + 562 + def _count_questions(items: list) -> int: 563 + n = 0 564 + for q in items: 565 + n += 1 566 + for opt in q.get("options") or []: 567 + n += _count_questions(opt.get("followups") or []) 568 + return n 569 + 570 + 571 + def run_issue_agent( 572 + ctx: IssueSessionContext, 573 + user_input: str, 574 + *, 575 + graph=None, 576 + thread_id: str = "default", 577 + include_list_tool: bool = False, 578 + ) -> BaseMessage: 579 + """Run the issue agent with metadata + file tree already in the system prompt.""" 580 + app = graph or build_issue_agent_graph(ctx, include_list_tool=include_list_tool) 581 + config = {"configurable": {"thread_id": thread_id}} 582 + result = app.invoke( 583 + {"messages": [HumanMessage(content=user_input)]}, 584 + config=config, 585 + ) 586 + return result["messages"][-1] 587 + 588 + 589 + def run_agent( 590 + user_input: str, 591 + *, 592 + graph=None, 593 + thread_id: str = "default", 594 + ) -> BaseMessage: 595 + """Invoke the compiled graph with a single user turn.""" 596 + app = graph or build_agent_graph() 597 + config = {"configurable": {"thread_id": thread_id}} 598 + result = app.invoke( 599 + {"messages": [HumanMessage(content=user_input)]}, 600 + config=config, 601 + ) 602 + return result["messages"][-1] 603 + 604 + 605 + def main(argv: list[str] | None = None) -> None: 606 + import argparse 607 + 608 + parser = argparse.ArgumentParser(description="Run the Tangled issue investigation agent.") 609 + parser.add_argument("prompt", nargs="?", help="User message (omit for stdin)") 610 + parser.add_argument( 611 + "--issue-uri", 612 + metavar="URI", 613 + help="Issue at:// URI — loads meta + file tree from DB/knot", 614 + ) 615 + parser.add_argument( 616 + "--no-file-tree", 617 + action="store_true", 618 + help="Skip live knot tree walk (use with --list-tool)", 619 + ) 620 + parser.add_argument( 621 + "--list-tool", 622 + action="store_true", 623 + help="Also expose list_repo_files (default: read_repo_file only)", 624 + ) 625 + parser.add_argument( 626 + "--questionnaire", 627 + "--questionaire", 628 + action="store_true", 629 + help="Generate AI-solve questionnaire JSON (uses Opus)", 630 + ) 631 + parser.add_argument( 632 + "--no-save", 633 + action="store_true", 634 + help="Do not write questionnaire JSON to Postgres (questionnaire mode)", 635 + ) 636 + parser.add_argument("--thread-id", default="cli", help="Checkpoint thread id") 637 + args = parser.parse_args(argv) 638 + 639 + text = args.prompt 640 + if not text and not sys.stdin.isatty(): 641 + text = sys.stdin.read().strip() 642 + if not text and not args.questionnaire: 643 + print("ERROR: provide a prompt argument, stdin, or --questionnaire", file=sys.stderr) 644 + raise SystemExit(1) 645 + 646 + if not args.issue_uri: 647 + print("ERROR: --issue-uri is required", file=sys.stderr) 648 + raise SystemExit(1) 649 + 650 + load_env() 651 + ctx = load_issue_context( 652 + args.issue_uri.strip(), 653 + fetch_file_tree=not args.no_file_tree, 654 + ) 655 + if args.questionnaire: 656 + print("[agent] questionnaire mode — will read repo via tools first", file=sys.stderr) 657 + reply = run_questionnaire_agent( 658 + ctx, 659 + thread_id=args.thread_id, 660 + include_list_tool=args.list_tool, 661 + ) 662 + text_out = reply.content if isinstance(reply.content, str) else str(reply.content) 663 + if not args.no_save: 664 + try: 665 + payload = parse_questionnaire_json(text_out) 666 + row = save_questionnaire(ctx.issue_uri, payload) 667 + _maybe_publish_to_repo(ctx.issue_uri, payload, row) 668 + print( 669 + f"[agent] saved questionnaire for {row['issue_uri']}", 670 + file=sys.stderr, 671 + ) 672 + except Exception as exc: # noqa: BLE001 673 + print(f"[agent] warning: could not save to DB: {exc}", file=sys.stderr) 674 + else: 675 + reply = run_issue_agent( 676 + ctx, 677 + text, 678 + thread_id=args.thread_id, 679 + include_list_tool=args.list_tool, 680 + ) 681 + if isinstance(reply.content, str): 682 + print(reply.content) 683 + else: 684 + print(reply.content) 685 + 686 + 687 + if __name__ == "__main__": 688 + main()
+294
agent/ai-solve-questionnaire-contract.md
··· 1 + # AI-Solve Questionnaire — Engine Contract 2 + 3 + **Status:** Draft contract for the AI engine (backend) developer 4 + **Date:** 2026-06-25 5 + **Owner (frontend/appview):** Miko 6 + **Audience:** Engine service developer 7 + 8 + --- 9 + 10 + ## 1. Context 11 + 12 + Feature: on an issue page, a logged-in user can start an **AI-solve** workflow. The AI 13 + engine generates a **branching questionnaire** about what kind of solution should be 14 + implemented. Many users answer it; when the engine detects consensus, it generates the 15 + solution and opens a pull request **authored as its own AT-Protocol / Tangled user**. 16 + 17 + The **appview (frontend) is a thin UI**. The **engine owns all logic**: questionnaire 18 + generation, answer aggregation, consensus detection, code generation, and PR authoring. 19 + 20 + The appview talks to the engine with **exactly two HTTP calls**, both made server-side 21 + from the appview (the engine base URL is never exposed to the browser): 22 + 23 + 1. `GET` the questionnaire for an issue. 24 + 2. `POST` a user's completed answer-set. 25 + 26 + The branching is **walked client-side** in the appview after the single `GET` — the 27 + engine does **not** serve one-question-at-a-time. It returns the whole questionnaire once. 28 + 29 + Everything after the `POST` (consensus, codegen, PR) is internal to the engine. The PR 30 + appears on the issue page because the engine authors a normal `sh.tangled.repo.pull` 31 + record that references the issue — it rides the existing PR/reference ingestion. No third 32 + call is required for that. 33 + 34 + --- 35 + 36 + ## 2. The questionnaire structure 37 + 38 + The questionnaire is a **tree of nested sequences**, not a `next`-pointer graph. This is 39 + the core of the contract — please implement to this shape. 40 + 41 + ### Design rationale 42 + 43 + The questionnaire must support, modularly: 44 + 45 + - **Sub-questions** that are only asked when a particular option is chosen. 46 + - **Regular questions** that are always asked, in order — including *after* a branching 47 + question's sub-questions have been answered. 48 + 49 + A flat `next`-pointer graph models this poorly: every branch leaf must manually point back 50 + to the shared follow-up question to re-converge, so adding one always-asked question means 51 + editing every leaf. Instead we use recursion: 52 + 53 + > An option may carry its own ordered list of follow-up questions (`followups`). That list 54 + > has the **same shape** as the top-level list. When the user finishes a `followups` list, 55 + > traversal automatically returns ("pops") to the parent sequence and continues with the 56 + > next item. Re-convergence is free; no manual wiring. 57 + 58 + One node type, recursive at any depth, one renderer/walker. 59 + 60 + ### Schema 61 + 62 + ```jsonc 63 + // Questionnaire (root) 64 + { 65 + "issue": "at://…/sh.tangled.repo.issue/…", 66 + "version": 2, 67 + "introduction": { 68 + "project": "What the repo is…", 69 + "issue": "What this issue asks…", 70 + "approach": "How the questionnaire guides toward a PR…" 71 + }, 72 + "items": [ /* ordered array of Question */ ] 73 + } 74 + 75 + // Question 76 + { 77 + "id": "scope", 78 + "prompt": "Short headline question", 79 + "context": "Why we ask this now — bridges from intro or parent branch", 80 + "explanation": "Extended tradeoffs and repo-specific detail", 81 + "options": [ /* array of Option, >= 2 */ ] 82 + } 83 + 84 + // Option — label only (no separate value field) 85 + { 86 + "label": "Full detailed description of this choice", 87 + "followups": [ /* optional: array of Question, same shape as items */ ] 88 + } 89 + ``` 90 + 91 + **Field rules** 92 + 93 + | Field | Type | Required | Notes | 94 + |---|---|---|---| 95 + | `issue` | string (AT-URI) | yes | Echoes the issue this questionnaire is for. | 96 + | `version` | integer | yes | Schema version; `2` for now. | 97 + | `introduction` | object | yes | `project`, `issue`, `approach` — narrative setup shown before questions. | 98 + | `items` | Question[] | yes | Top-level ordered sequence. Non-empty. | 99 + | `Question.id` | string | yes | **Globally unique** across the whole tree. Stable across re-fetches. | 100 + | `Question.prompt` | string | yes | Short headline (plain text). | 101 + | `Question.context` | string | yes | Bridges logically from intro/parent; must chain narratively. | 102 + | `Question.explanation` | string | yes | Extended detail on tradeoffs and repo facts. | 103 + | `Question.options` | Option[] | yes | At least 2 options. | 104 + | `Option.label` | string | yes | **Full option text** — detailed description, not a terse button label. | 105 + | `Option.followups` | Question[] | no | Omit or `[]` = no sub-questions. | 106 + 107 + ### Worked example 108 + 109 + ```json 110 + { 111 + "issue": "at://did:plc:abc/sh.tangled.repo.issue/3lk2…", 112 + "version": 2, 113 + "introduction": { 114 + "project": "A small CLI tool for…", 115 + "issue": "Add a flag to…", 116 + "approach": "First we pick where the fix lives, then shared test preferences." 117 + }, 118 + "items": [ 119 + { 120 + "id": "scope", 121 + "prompt": "Where should the fix live?", 122 + "context": "The issue touches both the CLI and core library — we need to pick a home first.", 123 + "explanation": "A new module keeps concerns isolated; extending util is faster but couples the change.", 124 + "options": [ 125 + { 126 + "label": "Create a new module dedicated to this feature, imported by the CLI entrypoint.", 127 + "followups": [ 128 + { 129 + "id": "mod_name_style", 130 + "prompt": "Module naming style?", 131 + "context": "Because you chose a new module, naming should match repo conventions.", 132 + "explanation": "Flat names match existing `util_*` files; nested packages group related commands.", 133 + "options": [ 134 + { "label": "Flat single file at repo root (e.g. `pairing.nu`)." }, 135 + { "label": "Nested under an existing package directory." } 136 + ] 137 + } 138 + ] 139 + }, 140 + { "label": "Extend the existing shared util module — smallest diff, reuses exports." } 141 + ] 142 + }, 143 + { 144 + "id": "tests", 145 + "prompt": "Add tests?", 146 + "context": "Regardless of where the fix lives, we need agreement on test coverage.", 147 + "explanation": "The repo has unit tests in `tests/` but no integration harness for hardware.", 148 + "options": [ 149 + { "label": "Yes — add unit tests for the new code path." }, 150 + { "label": "No — manual verification only for this change." } 151 + ] 152 + } 153 + ] 154 + } 155 + ``` 156 + 157 + Behaviour: 158 + - A user who picks **New module** is asked `mod_name_style`, then `tests`. 159 + - A user who picks **Existing util** skips `mod_name_style` and goes straight to `tests`. 160 + - `tests` is always asked regardless of the `scope` branch. 161 + 162 + ### Traversal semantics (how the frontend walks it) 163 + 164 + The appview walks the tree depth-first with a stack. The engine doesn't need to run this, 165 + but it defines exactly which questions a given user sees and the order answers are recorded: 166 + 167 + ``` 168 + stack = [ (items, 0) ] 169 + while stack not empty: 170 + (list, i) = stack.top 171 + if i >= len(list): stack.pop(); continue 172 + q = list[i] 173 + present q to user; user picks option at index `i` 174 + record answer { questionId: q.id, optionIndex: i } 175 + stack.top.i += 1 # move past q in its own frame 176 + if opt.followups is non-empty: 177 + stack.push( (opt.followups, 0) ) # dive into sub-questions first 178 + # done when the stack is empty 179 + ``` 180 + 181 + --- 182 + 183 + ## 3. API contract 184 + 185 + Base URL, auth, and exact paths are TBD between engine and appview (see Open Questions). 186 + Shapes below are the contract. 187 + 188 + ### 3.1 `GET` questionnaire 189 + 190 + Fetch (or generate-and-cache) the questionnaire for an issue. 191 + 192 + **Request** 193 + 194 + ``` 195 + GET /questionnaire?issue=<at-uri> 196 + ``` 197 + 198 + | Param | Type | Notes | 199 + |---|---|---| 200 + | `issue` | string (AT-URI) | The `sh.tangled.repo.issue` record URI. | 201 + 202 + **Response** `200 OK` — a Questionnaire object (Section 2). 203 + 204 + - The same issue should return a **stable** questionnaire (same `id`s) across calls so that 205 + answers from different users are comparable. Generate once, cache, return cached. 206 + - `404` if the issue is unknown to the engine; `503` if generation is still in progress 207 + (the appview will show a "preparing…" state and let the user retry). 208 + 209 + ### 3.2 `POST` answers 210 + 211 + Submit one user's completed answer-set. 212 + 213 + **Request** 214 + 215 + ``` 216 + POST /answers 217 + Content-Type: application/json 218 + 219 + { 220 + "issue": "at://…/sh.tangled.repo.issue/…", 221 + "did": "did:plc:…", // the answering user, from the appview's auth session 222 + "version": 1, // questionnaire version the answers were collected against 223 + "answers": [ 224 + { "questionId": "scope", "optionIndex": 0 }, 225 + { "questionId": "mod_name_style", "optionIndex": 1 }, 226 + { "questionId": "tests", "optionIndex": 0 } 227 + ] 228 + } 229 + ``` 230 + 231 + - `answers` is the **flat, ordered list** of `{ questionId, optionIndex }` the user actually 232 + traversed (only the questions they were shown). Order = traversal order. `questionId`s are 233 + globally unique, so the engine can reconstruct full context without nesting in the payload. 234 + - `did` is supplied **server-side by the appview** from the authenticated session — the 235 + engine can trust it as the identity the appview vouches for. It is never taken from the 236 + browser/client. 237 + 238 + **Response** `200 OK` (body optional; the appview ignores it beyond status). 239 + 240 + **Idempotency / re-answering:** a user may submit more than once (they re-open the wizard 241 + and change their mind). The engine should **dedupe by `did`** and treat the latest submission 242 + as that user's answer. Define whether resubmission is allowed after consensus is locked. 243 + 244 + --- 245 + 246 + ## 4. Out of scope for these two calls (engine-internal) 247 + 248 + - Aggregating answers across users and **detecting consensus**. 249 + - Generating the solution / code. 250 + - **Authoring the PR as the engine's own AT-Proto user** — author a `sh.tangled.repo.pull` 251 + record that **references the issue** (e.g. via the issue AT-URI in the pull's references / 252 + body) so the appview's existing reference-link rendering surfaces it on the issue page. 253 + 254 + No additional appview→engine call is needed to display the solution/PR; it arrives via 255 + normal record ingestion. 256 + 257 + --- 258 + 259 + ## 5. Optional future extension — reusable question groups (do NOT build yet) 260 + 261 + If two different branches ever need the **same** sub-questionnaire, add a reference item 262 + rather than duplicating questions: 263 + 264 + ```jsonc 265 + // top-level, alongside "items" 266 + "library": { 267 + "test-prefs": [ /* Question[] */ ] 268 + } 269 + 270 + // usable anywhere an item is expected 271 + { "ref": "test-prefs" } 272 + ``` 273 + 274 + The recursive walker resolves `{ "ref": … }` against `library` and otherwise behaves 275 + identically. Not required for v1 — the schema simply leaves room for it. Flagged here so the 276 + engine doesn't bake in an assumption that blocks it later. 277 + 278 + --- 279 + 280 + ## 6. Open questions to confirm with the appview developer 281 + 282 + 1. **Issue identifier:** AT-URI (assumed here) vs. the numeric per-repo issue id? AT-URI is 283 + globally unambiguous; confirm the engine can resolve it. 284 + 2. **Base URL / auth:** how does the appview authenticate to the engine (service token, 285 + mTLS, shared network)? What are the real paths? 286 + 3. **Caching/staleness:** is the questionnaire generated once per issue and frozen, or can it 287 + change (e.g. if the issue body is edited)? If it can change, how do we avoid invalidating 288 + in-flight answers (the `version` field is here to support this). 289 + 4. **Resubmission after consensus:** allowed, or should the appview hide the wizard once the 290 + engine reports a solution exists? (Note: with only two calls, the appview has no 291 + "status" endpoint — it infers "solved" from the linked PR. Confirm that's acceptable, or 292 + we add a lightweight status signal.) 293 + 5. **PR ↔ issue linkage:** confirm the engine sets the pull record's references to the issue 294 + AT-URI so the appview's existing linked-PR rendering picks it up.
+157
agent/atproto.py
··· 1 + """ATProto / PDS helpers for live issue loading.""" 2 + 3 + from __future__ import annotations 4 + 5 + import re 6 + from typing import Any 7 + 8 + import httpx 9 + 10 + DEFAULT_PDS = "https://tngl.sh" 11 + ISSUE_COLLECTION = "sh.tangled.repo.issue" 12 + STATE_COLLECTION = "sh.tangled.repo.issue.state" 13 + REPO_COLLECTION = "sh.tangled.repo" 14 + STATE_OPEN = "sh.tangled.repo.issue.state.open" 15 + STATE_CLOSED = "sh.tangled.repo.issue.state.closed" 16 + 17 + _AT_URI_RE = re.compile( 18 + r"^at://(?P<did>did:[^/]+)/(?P<collection>[^/]+)/(?P<rkey>[^/]+)$" 19 + ) 20 + 21 + 22 + def parse_at_uri(uri: str) -> tuple[str, str, str]: 23 + match = _AT_URI_RE.match(uri.strip()) 24 + if not match: 25 + raise ValueError(f"Not a valid at:// URI: {uri!r}") 26 + return match.group("did"), match.group("collection"), match.group("rkey") 27 + 28 + 29 + def pds_host_for_did(client: httpx.Client, did: str) -> str | None: 30 + resp = client.get(f"https://plc.directory/{did}", timeout=15.0) 31 + if resp.status_code != 200: 32 + return None 33 + doc = resp.json() 34 + for svc in doc.get("service", []): 35 + if svc.get("type") == "AtprotoPersonalDataServer": 36 + endpoint = svc.get("serviceEndpoint") 37 + if isinstance(endpoint, str): 38 + return endpoint.rstrip("/") 39 + return None 40 + 41 + 42 + def handle_from_plc(client: httpx.Client, did: str) -> str | None: 43 + resp = client.get(f"https://plc.directory/{did}", timeout=15.0) 44 + if resp.status_code != 200: 45 + return None 46 + for alias in resp.json().get("alsoKnownAs", []): 47 + if isinstance(alias, str) and alias.startswith("at://"): 48 + return alias.removeprefix("at://") 49 + return None 50 + 51 + 52 + def get_record( 53 + client: httpx.Client, 54 + pds_host: str, 55 + repo_did: str, 56 + collection: str, 57 + rkey: str, 58 + ) -> dict[str, Any]: 59 + resp = client.get( 60 + f"{pds_host.rstrip('/')}/xrpc/com.atproto.repo.getRecord", 61 + params={"repo": repo_did, "collection": collection, "rkey": rkey}, 62 + timeout=20.0, 63 + ) 64 + resp.raise_for_status() 65 + data = resp.json() 66 + if not isinstance(data, dict): 67 + raise RuntimeError("getRecord returned non-object") 68 + return data 69 + 70 + 71 + def list_records( 72 + client: httpx.Client, 73 + pds_host: str, 74 + repo_did: str, 75 + collection: str, 76 + *, 77 + limit: int = 100, 78 + ) -> list[dict[str, Any]]: 79 + resp = client.get( 80 + f"{pds_host.rstrip('/')}/xrpc/com.atproto.repo.listRecords", 81 + params={"repo": repo_did, "collection": collection, "limit": limit}, 82 + timeout=20.0, 83 + ) 84 + resp.raise_for_status() 85 + page = resp.json().get("records") or [] 86 + return [r for r in page if isinstance(r, dict)] 87 + 88 + 89 + def issue_state_for_uri( 90 + client: httpx.Client, 91 + pds_host: str, 92 + author_did: str, 93 + issue_uri: str, 94 + issue_rkey: str, 95 + ) -> str: 96 + try: 97 + records = list_records(client, pds_host, author_did, STATE_COLLECTION, limit=200) 98 + except Exception: 99 + return "open" 100 + for rec in records: 101 + value = rec.get("value") 102 + if not isinstance(value, dict): 103 + continue 104 + target = value.get("issue") 105 + if target == issue_uri: 106 + state = value.get("state") 107 + if state == STATE_CLOSED: 108 + return "closed" 109 + return "open" 110 + return "open" 111 + 112 + 113 + def repo_did_from_at_uri(uri: str) -> str | None: 114 + if not uri.startswith("at://"): 115 + return None 116 + did = uri.removeprefix("at://").split("/", 1)[0] 117 + return did if did.startswith("did:") else None 118 + 119 + 120 + def resolve_repo( 121 + client: httpx.Client, 122 + repo_ref: Any, 123 + ) -> dict[str, Any]: 124 + """Resolve issue's ``repo`` field to repo_did, knot_hostname, name, owner_handle.""" 125 + if not isinstance(repo_ref, str) or not repo_ref.strip(): 126 + raise RuntimeError("Issue record has no repo reference") 127 + 128 + if repo_ref.startswith("at://"): 129 + owner_did, collection, repo_rkey = parse_at_uri(repo_ref) 130 + if collection != REPO_COLLECTION: 131 + raise RuntimeError(f"Unexpected repo collection: {collection}") 132 + pds = pds_host_for_did(client, owner_did) or DEFAULT_PDS 133 + rec = get_record(client, pds, owner_did, REPO_COLLECTION, repo_rkey) 134 + value = rec.get("value") if isinstance(rec.get("value"), dict) else {} 135 + repo_did = value.get("repoDid") if isinstance(value.get("repoDid"), str) else owner_did 136 + knot = value.get("knotHostname") or value.get("knotHost") or value.get("knot") 137 + name = value.get("name") 138 + owner_handle = handle_from_plc(client, owner_did) 139 + if not isinstance(knot, str) or not knot.strip(): 140 + raise RuntimeError("Repo record missing knot / knotHostname") 141 + return { 142 + "repo_did": repo_did, 143 + "repo_uri": repo_ref, 144 + "repo_name": name if isinstance(name, str) else "", 145 + "repo_owner_did": owner_did, 146 + "repo_owner_handle": owner_handle or "", 147 + "knot_hostname": knot.strip(), 148 + } 149 + 150 + if repo_ref.startswith("did:"): 151 + repo_did = repo_ref 152 + raise RuntimeError( 153 + f"Issue references repo by DID only ({repo_did}). " 154 + "Need at:// owner/repo record URI or a indexed tangled_repos row." 155 + ) 156 + 157 + raise RuntimeError(f"Unsupported repo reference: {repo_ref!r}")
+103
agent/context.py
··· 1 + """Issue session context injected before the agent runs (no issue-fetch tools).""" 2 + 3 + from __future__ import annotations 4 + 5 + import json 6 + from dataclasses import asdict, dataclass, field 7 + from typing import Any 8 + 9 + 10 + @dataclass 11 + class IssueSessionContext: 12 + """Everything the caller already knows about the issue + repo.""" 13 + 14 + issue_uri: str 15 + issue_rkey: str 16 + title: str 17 + body: str 18 + state: str 19 + author_did: str 20 + author_handle: str 21 + repo_did: str 22 + repo_owner_handle: str 23 + repo_name: str 24 + knot_hostname: str 25 + # Repo paths relative to root (provided by caller — primary navigation aid). 26 + file_tree: list[str] = field(default_factory=list) 27 + ref: str = "HEAD" 28 + extra: dict[str, Any] = field(default_factory=dict) 29 + 30 + @classmethod 31 + def from_dict(cls, data: dict[str, Any]) -> IssueSessionContext: 32 + known = {f.name for f in cls.__dataclass_fields__.values()} # type: ignore[attr-defined] 33 + core = {k: v for k, v in data.items() if k in known and k != "extra"} 34 + extra = dict(data.get("extra") or {}) 35 + for k, v in data.items(): 36 + if k not in known: 37 + extra[k] = v 38 + return cls(**core, extra=extra) 39 + 40 + def to_dict(self) -> dict[str, Any]: 41 + payload = asdict(self) 42 + extra = payload.pop("extra", {}) 43 + if extra: 44 + payload.update(extra) 45 + return payload 46 + 47 + 48 + ISSUE_AGENT_SYSTEM_PROMPT = """\ 49 + You investigate a single Tangled issue. The issue metadata, repo identifiers, and 50 + repository file tree are already provided below — do not ask the user to resolve 51 + handles or DIDs. 52 + 53 + Your job: 54 + 1. Read the issue title/body and identify which files are relevant. 55 + 2. Use ``read_repo_file`` to pull exact source from the knot when you need code. 56 + 3. Use ``list_repo_files`` only if the provided file tree is incomplete or you 57 + need to explore a subdirectory that was not listed. 58 + 59 + Rules: 60 + - Prefer paths from the provided file tree. 61 + - Read the smallest set of files needed to answer well. 62 + - Cite file paths when referencing code. 63 + - You cannot file issues, push code, or browse outside this repo. 64 + """ 65 + 66 + 67 + def format_issue_context_block(ctx: IssueSessionContext) -> str: 68 + """Serialize session context for the system prompt (cache-friendly static prefix).""" 69 + tree = ctx.file_tree 70 + if len(tree) > 500: 71 + tree_display = tree[:500] + [f"... (+{len(tree) - 500} more paths)"] 72 + else: 73 + tree_display = tree 74 + 75 + block = { 76 + "issue": { 77 + "uri": ctx.issue_uri, 78 + "rkey": ctx.issue_rkey, 79 + "title": ctx.title, 80 + "body": ctx.body, 81 + "state": ctx.state, 82 + "author": {"did": ctx.author_did, "handle": ctx.author_handle}, 83 + }, 84 + "repo": { 85 + "did": ctx.repo_did, 86 + "owner_handle": ctx.repo_owner_handle, 87 + "name": ctx.repo_name, 88 + "knot_hostname": ctx.knot_hostname, 89 + "ref": ctx.ref, 90 + }, 91 + "file_tree": tree_display, 92 + } 93 + if ctx.extra: 94 + block["extra"] = ctx.extra 95 + return json.dumps(block, indent=2, ensure_ascii=False) 96 + 97 + 98 + def build_issue_system_prompt(ctx: IssueSessionContext) -> str: 99 + return ( 100 + f"{ISSUE_AGENT_SYSTEM_PROMPT}\n\n" 101 + f"## Session context (issue + repo)\n\n" 102 + f"```json\n{format_issue_context_block(ctx)}\n```" 103 + )
+249
agent/load_issue.py
··· 1 + """Load issue session context from a single issue URI (live PDS + knot).""" 2 + 3 + from __future__ import annotations 4 + 5 + import os 6 + from collections import deque 7 + from dataclasses import replace 8 + 9 + import httpx 10 + import psycopg 11 + from psycopg.rows import dict_row 12 + 13 + from agent.atproto import ( 14 + DEFAULT_PDS, 15 + ISSUE_COLLECTION, 16 + get_record, 17 + handle_from_plc, 18 + issue_state_for_uri, 19 + parse_at_uri, 20 + pds_host_for_did, 21 + resolve_repo, 22 + ) 23 + from agent.context import IssueSessionContext 24 + from agent.tangled_client import DEFAULT_TIMEOUT, list_tree, normalize_tree_entries 25 + 26 + _ISSUE_SQL = """ 27 + select 28 + i.uri as issue_uri, 29 + i.rkey as issue_rkey, 30 + i.title, 31 + i.body, 32 + i.state, 33 + i.author_did, 34 + i.author_handle, 35 + i.repo_did, 36 + i.repo_uri, 37 + coalesce(r.owner_handle, ti.handle) as repo_owner_handle, 38 + r.name as repo_name, 39 + r.knot_hostname 40 + from tangled_issues i 41 + left join tangled_repos r on r.repo_did = i.repo_did 42 + left join tangled_identities ti 43 + on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1) 44 + where i.uri = %s 45 + """ 46 + 47 + _REPO_SQL = """ 48 + select repo_did, name as repo_name, owner_handle as repo_owner_handle, 49 + knot_hostname, uri as repo_uri 50 + from tangled_repos 51 + where repo_did = %s 52 + limit 1 53 + """ 54 + 55 + 56 + def _join_path(parent: str, name: str) -> str: 57 + if not parent: 58 + return name 59 + return f"{parent.rstrip('/')}/{name}" 60 + 61 + 62 + def build_file_tree( 63 + knot_hostname: str, 64 + repo_did: str, 65 + *, 66 + ref: str = "HEAD", 67 + max_paths: int = 400, 68 + max_depth: int = 4, 69 + ) -> list[str]: 70 + paths: list[str] = [] 71 + queue: deque[tuple[str, int]] = deque([("", 0)]) 72 + 73 + with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client: 74 + while queue and len(paths) < max_paths: 75 + directory, depth = queue.popleft() 76 + try: 77 + tree = list_tree( 78 + client, 79 + knot_hostname=knot_hostname, 80 + repo_did=repo_did, 81 + path=directory, 82 + ref=ref, 83 + ) 84 + except Exception: 85 + continue 86 + for entry in normalize_tree_entries(tree): 87 + full = _join_path(directory, entry["name"]) 88 + if entry["type"] == "dir": 89 + if depth + 1 < max_depth: 90 + queue.append((full, depth + 1)) 91 + else: 92 + paths.append(full) 93 + 94 + return sorted(paths) 95 + 96 + 97 + def _repo_from_db(repo_did: str) -> dict | None: 98 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 99 + if not dsn: 100 + return None 101 + if "sslmode=" not in dsn: 102 + sep = "&" if "?" in dsn else "?" 103 + dsn = f"{dsn}{sep}sslmode=require" 104 + try: 105 + with psycopg.connect(dsn, row_factory=dict_row) as conn: 106 + return conn.execute(_REPO_SQL, (repo_did,)).fetchone() 107 + except Exception: 108 + return None 109 + 110 + 111 + def _db_row(issue_uri: str) -> dict | None: 112 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 113 + if not dsn: 114 + return None 115 + if "sslmode=" not in dsn: 116 + sep = "&" if "?" in dsn else "?" 117 + dsn = f"{dsn}{sep}sslmode=require" 118 + try: 119 + with psycopg.connect(dsn, row_factory=dict_row) as conn: 120 + return conn.execute(_ISSUE_SQL, (issue_uri,)).fetchone() 121 + except Exception: 122 + return None 123 + 124 + 125 + def _resolve_repo_did_only( 126 + client: httpx.Client, 127 + repo_did: str, 128 + db_row: dict | None, 129 + ) -> dict[str, str]: 130 + repo_row = _repo_from_db(repo_did) 131 + knot = (repo_row or {}).get("knot_hostname") or (db_row or {}).get("knot_hostname") 132 + name = (repo_row or {}).get("repo_name") or (db_row or {}).get("repo_name") 133 + owner_handle = (repo_row or {}).get("repo_owner_handle") or (db_row or {}).get( 134 + "repo_owner_handle" 135 + ) 136 + repo_uri = (repo_row or {}).get("repo_uri") or (db_row or {}).get("repo_uri") or "" 137 + 138 + if isinstance(knot, str) and knot.strip(): 139 + return { 140 + "repo_did": repo_did, 141 + "knot_hostname": knot.strip(), 142 + "repo_name": name if isinstance(name, str) else "", 143 + "repo_owner_handle": owner_handle if isinstance(owner_handle, str) else "", 144 + "repo_uri": repo_uri if isinstance(repo_uri, str) else "", 145 + } 146 + 147 + raise RuntimeError( 148 + f"Cannot resolve knot for repo_did={repo_did}. " 149 + "Issue should reference at://owner/sh.tangled.repo/rkey when possible." 150 + ) 151 + 152 + 153 + def fetch_issue_live(issue_uri: str) -> IssueSessionContext: 154 + """Load everything from Tangled live (PDS + knot). DB not required.""" 155 + author_did, collection, rkey = parse_at_uri(issue_uri) 156 + if collection != ISSUE_COLLECTION: 157 + raise ValueError(f"Expected {ISSUE_COLLECTION}, got {collection}") 158 + 159 + db_row = _db_row(issue_uri) 160 + 161 + with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client: 162 + pds = pds_host_for_did(client, author_did) or DEFAULT_PDS 163 + record = get_record(client, pds, author_did, collection, rkey) 164 + value = record.get("value") 165 + if not isinstance(value, dict): 166 + raise RuntimeError("Issue record missing value") 167 + 168 + title = value.get("title") if isinstance(value.get("title"), str) else "" 169 + body = value.get("body") if isinstance(value.get("body"), str) else "" 170 + author_handle = handle_from_plc(client, author_did) or "" 171 + state = issue_state_for_uri(client, pds, author_did, issue_uri, rkey) 172 + 173 + repo_ref = value.get("repo") 174 + if isinstance(repo_ref, str) and repo_ref.startswith("did:"): 175 + repo = _resolve_repo_did_only(client, repo_ref, db_row) 176 + else: 177 + repo = resolve_repo(client, repo_ref) 178 + 179 + file_tree = build_file_tree(repo["knot_hostname"], repo["repo_did"]) 180 + 181 + return IssueSessionContext( 182 + issue_uri=issue_uri, 183 + issue_rkey=rkey, 184 + title=title or (db_row or {}).get("title") or "", 185 + body=body or (db_row or {}).get("body") or "", 186 + state=state or (db_row or {}).get("state") or "open", 187 + author_did=author_did, 188 + author_handle=author_handle or (db_row or {}).get("author_handle") or "", 189 + repo_did=repo["repo_did"], 190 + repo_owner_handle=repo.get("repo_owner_handle") or "", 191 + repo_name=repo.get("repo_name") or "", 192 + knot_hostname=repo["knot_hostname"], 193 + file_tree=file_tree, 194 + ref="HEAD", 195 + ) 196 + 197 + 198 + def load_issue_context( 199 + issue_uri: str, 200 + *, 201 + fetch_file_tree: bool = True, 202 + ref: str = "HEAD", 203 + ) -> IssueSessionContext: 204 + """Hydrate session from live Tangled APIs; DB is optional cache only.""" 205 + ctx = fetch_issue_live(issue_uri) 206 + if not fetch_file_tree: 207 + return replace(ctx, file_tree=[], ref=ref) 208 + if ref != ctx.ref: 209 + return replace( 210 + ctx, 211 + file_tree=build_file_tree(ctx.knot_hostname, ctx.repo_did, ref=ref), 212 + ref=ref, 213 + ) 214 + return ctx 215 + 216 + 217 + # Backwards-compatible alias 218 + fetch_issue_context = load_issue_context 219 + 220 + 221 + def resolve_issue_uri(issue_id: str) -> str: 222 + """Resolve a full ``at://`` URI or a per-repo issue rkey via ``tangled_issues``.""" 223 + raw = issue_id.strip() 224 + if raw.startswith("at://"): 225 + return raw 226 + 227 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 228 + if not dsn: 229 + raise RuntimeError( 230 + "DB_CONNECTION_STRING is required to resolve issue rkey without at:// URI" 231 + ) 232 + if "sslmode=" not in dsn: 233 + sep = "&" if "?" in dsn else "?" 234 + dsn = f"{dsn}{sep}sslmode=require" 235 + 236 + with psycopg.connect(dsn, row_factory=dict_row) as conn: 237 + rows = conn.execute( 238 + "select uri from tangled_issues where rkey = %s order by fetched_at desc", 239 + (raw,), 240 + ).fetchall() 241 + 242 + if not rows: 243 + raise ValueError(f"No issue with rkey {raw!r} in tangled_issues — pass full at:// URI") 244 + if len(rows) > 1: 245 + uris = [r["uri"] for r in rows[:5]] 246 + raise ValueError( 247 + f"Ambiguous rkey {raw!r} ({len(rows)} issues). Pass full at:// URI. Examples: {uris}" 248 + ) 249 + return rows[0]["uri"]
+226
agent/questionnaire_prompt.py
··· 1 + """System prompt for AI-solve questionnaire generation.""" 2 + 3 + from __future__ import annotations 4 + 5 + from agent.context import IssueSessionContext, format_issue_context_block 6 + 7 + QUESTIONNAIRE_AGENT_SYSTEM_PROMPT = """\ 8 + You are the **AI-solve questionnaire engine** for Tangled issues. 9 + 10 + Your job is to produce a **branching questionnaire** that helps many contributors agree on 11 + *how* an issue should be implemented. Answers will be aggregated across users; when the 12 + engine detects consensus, it will generate code and open a pull request. Your questions 13 + must therefore surface **real, meaningful implementation choices** — not trivia, not 14 + questions already settled by the issue author, and not preferences that do not affect code. 15 + 16 + ## What you receive 17 + 18 + Issue metadata, repo identifiers, and a file tree are embedded below. You also have 19 + ``read_repo_file`` (and optionally ``list_repo_files``) to inspect the codebase on the knot. 20 + 21 + **You must read the repo before writing the questionnaire.** At minimum: 22 + - README or docs that explain the project 23 + - Files most likely touched by a fix for this issue (infer from title/body + tree) 24 + - Existing patterns for the kind of change requested (CLI commands, modules, tests, APIs) 25 + 26 + **If the issue is a bug** (crash, wrong output, regression, race, etc.), research the bug 27 + before designing questions: 28 + - Trace the **reported symptoms** to the code path (callers, handlers, data flow). 29 + - Read the **failing or suspect code** and any related tests, error handling, or edge cases. 30 + - Form **multiple plausible root causes** when the report is ambiguous — do not assume the 31 + first theory is correct. 32 + - Identify **several distinct fix strategies** (e.g. guard at call site vs fix underlying 33 + logic vs add validation vs change defaults vs refactor state handling). Each viable 34 + strategy should become a branch in the questionnaire — users choose *which fix approach* 35 + to take, then answer follow-ups specific to that path. 36 + - Where reproduction steps exist in the issue, verify them against the code you read. 37 + 38 + Do not guess architecture, naming, conventions, or root cause when the source can answer. 39 + 40 + ## Required workflow (do not skip) 41 + 42 + You have two phases. **Do not emit questionnaire JSON during phase 1.** 43 + 44 + 1. **Research (tools only)** — call ``read_repo_file`` as many times as you need until you 45 + understand the repo and issue well enough to write the questionnaire (README, relevant 46 + source, tests, similar patterns). There is no fixed file limit — keep reading while it 47 + helps. The file tree in context is not enough — read actual contents. 48 + 2. **Generate** — when you are done researching, **stop calling tools**. The system will 49 + ask you for the questionnaire JSON in a separate step. Do not output JSON during research. 50 + 51 + ## Output contract 52 + 53 + Return **one JSON object** and nothing else — no markdown fences, no commentary, no preamble. 54 + The object must validate against this schema (version **2**): 55 + 56 + ```jsonc 57 + { 58 + "issue": "<at-uri>", // echo the issue URI from session context exactly 59 + "version": 2, 60 + "introduction": { 61 + "project": "…", // 2–4 sentences: what this repo is, stack, conventions, status 62 + "issue": "…", // 2–4 sentences: what the issue asks, constraints, open decisions 63 + "approach":"…" // 2–4 sentences: how the questionnaire guides toward a solution 64 + }, 65 + "items": [ /* ordered Question[] */ ] 66 + } 67 + 68 + // Question 69 + { 70 + "id": "scope", 71 + "prompt": "Short question shown as the headline", 72 + "context": "1–3 sentences bridging from introduction or parent branch — why we are asking NOW", 73 + "explanation": "Extended paragraph: tradeoffs, code facts, what changes depending on the answer", 74 + "options": [ /* Option[], at least 2 */ ] 75 + } 76 + 77 + // Option — label only (no separate value field) 78 + { 79 + "label": "Full detailed description of this choice — complete enough to vote on without reading code", 80 + "followups": [ /* optional Question[] — same shape as items */ ] 81 + } 82 + ``` 83 + 84 + ### Narrative coherence (most important) 85 + 86 + The questionnaire is a **guided story**, not a checklist of isolated questions. 87 + 88 + 1. ``introduction`` sets the scene: project reality, issue goal, and how choices chain into a PR. 89 + 2. Each question's ``context`` must **logically follow** from the introduction or from the 90 + option the user chose in the parent branch. Reference concrete facts from the repo/issue. 91 + 3. Each ``explanation`` goes deeper: what files/patterns are involved, what breaks if you 92 + pick wrong, why reasonable people disagree here. 93 + 4. Follow-up questions must **narrow** the chosen branch — not repeat the parent question. 94 + Their ``context`` should say "Because you chose X…" or "Given the lh.nu namespace…". 95 + 5. Top-level questions after branches should **re-converge** with context like "Regardless 96 + of backend choice…" so shared tail questions feel connected to the path taken. 97 + 98 + If contexts do not read as one continuous briefing, rewrite before emitting JSON. 99 + 100 + ### Tree semantics (critical) 101 + 102 + The questionnaire is a **tree of nested sequences**, not a ``next``-pointer graph. 103 + 104 + - ``items`` is the top-level ordered list. Every user walks it in order. 105 + - When a user picks an option that has ``followups``, those sub-questions are asked 106 + **immediately** (depth-first), then traversal **automatically continues** with the next 107 + item in the parent list. Re-convergence is free — do not wire branches back manually. 108 + - Put **path-specific** questions inside ``followups`` on the option they depend on. 109 + - Put **cross-cutting** questions (tests, docs, breaking changes, migration) as **top-level** 110 + ``items`` after branching sections so every path reaches them without duplication. 111 + 112 + Traversal (for your mental model — the frontend runs this): 113 + 114 + ``` 115 + stack = [ (items, 0) ] 116 + while stack not empty: 117 + (list, i) = stack.top 118 + if i >= len(list): stack.pop(); continue 119 + q = list[i]; user picks option opt 120 + stack.top.i += 1 121 + if opt.followups is non-empty: 122 + stack.push( (opt.followups, 0) ) 123 + ``` 124 + 125 + ### Field rules 126 + 127 + | Field | Rules | 128 + |---|---| 129 + | ``issue`` | Required. Exact AT-URI from context. | 130 + | ``version`` | Always ``2``. | 131 + | ``introduction`` | Required. ``project``, ``issue``, ``approach`` — each a substantive paragraph. | 132 + | ``items`` | Non-empty ordered array. | 133 + | ``Question.id`` | **Globally unique** snake_case id. Stable across re-fetches. | 134 + | ``Question.prompt`` | Short headline — one decision per question. | 135 + | ``Question.context`` | Required. Bridges from intro/parent; must read as the next logical paragraph. | 136 + | ``Question.explanation`` | Required. Extended detail on tradeoffs and repo-specific facts. | 137 + | ``Option.label`` | Required. **The entire option text** — detailed description, not a terse button label. No ``value`` field. | 138 + | ``Option.followups`` | Omit or ``[]`` when no sub-questions. | 139 + 140 + ## How to design a good questionnaire 141 + 142 + ### Goal 143 + 144 + Surface disagreements that **change the diff**: file placement, API shape, dependency choices, 145 + compatibility, test strategy, error-handling philosophy, scope (minimal vs holistic), etc. 146 + 147 + ### Recommended shape 148 + 149 + 1. **Anchor question** — highest-level approach (often ``items[0]``). Branch heavily here. 150 + For bugs, anchor on **which fix strategy** (root-cause fix vs workaround vs defensive 151 + guard vs broader refactor) — each option should reflect a real alternative you found in code. 152 + 2. **Branch depth** — 2–4 levels of ``followups`` where paths genuinely diverge. Shallow 153 + branches that only differ in wording are useless. 154 + 3. **Shared tail** — 2–4 top-level items after branches for concerns every path shares 155 + (tests, docs, deprecation, rollout). 156 + 4. **Size** — aim for **8–15 distinct question ids** across the full tree for a typical 157 + issue; more for large features, fewer for tiny fixes. Every question must earn its place. 158 + 159 + ### Dimensions to branch on (when relevant to this issue) 160 + 161 + Use only what applies after reading the code — do not checkbox every row blindly. 162 + 163 + **Bug issues** — branch when multiple fixes are viable: 164 + - **Root cause**: patch the faulty logic vs fix upstream/downstream caller 165 + - **Fix depth**: minimal one-line guard vs proper invariant fix vs refactor the subsystem 166 + - **Symptom vs cause**: suppress/handle the error vs eliminate the triggering condition 167 + - **Regression**: add test reproducing the bug; fix only vs fix + harden related paths 168 + - **Blast radius**: local patch vs shared utility change affecting other call sites 169 + 170 + **Feature / enhancement issues**: 171 + - **Placement**: new module vs extend existing file/package; public API surface vs internal 172 + - **Interface**: CLI subcommand vs library function vs config flag; naming aligned with repo 173 + - **Behavior**: strict vs permissive validation; fail-fast vs graceful degradation 174 + - **Compatibility**: breaking change vs backward-compatible shim; feature flag vs always-on 175 + - **Dependencies**: reuse existing util vs add dependency (name the tradeoff) 176 + - **Data / state**: persistence, migrations, defaults 177 + - **Errors & UX**: error messages, exit codes, logging level 178 + - **Tests**: unit vs integration; fixtures; what to assert 179 + - **Docs**: README, inline docs, changelog entry 180 + - **Scope**: minimal fix vs refactor while here; out-of-scope follow-ups as explicit option 181 + 182 + ### Diversity requirements 183 + 184 + - Options must represent **distinct implementation paths**, not synonyms. 185 + - Avoid false choices (one option obviously correct given the codebase). 186 + - Include at least one **conservative / minimal** and one **broader** path when reasonable. 187 + - When the issue is ambiguous, ask **clarifying** branch questions early in ``followups``. 188 + - Reflect **repo conventions** you observed (e.g. if tests live in ``*_test.go``, ask about 189 + test file placement using real paths/patterns from the tree). 190 + 191 + ### Anti-patterns (do not do these) 192 + 193 + - Do not ask what the issue already states as a requirement. 194 + - Do not ask "which files to edit" with a single correct answer you could infer. 195 + - Do not duplicate the same question under multiple branches — hoist to top-level ``items``. 196 + - Do not use flat linked-list / ``next`` field thinking. 197 + - Do not ask open-ended free text — every step is multiple choice. 198 + - Do not invent options that violate project constraints visible in the repo. 199 + - Do not output invalid JSON (trailing commas, comments, single quotes). 200 + 201 + ## ID conventions 202 + 203 + - ``Question.id``: lowercase ``snake_case``, globally unique, semantic (``backend_tool``, ``rename_deprecation``). 204 + - Options have **no id** — answers are recorded by ``questionId`` + ``optionIndex`` (0-based). 205 + 206 + ## Process 207 + 208 + 1. Read the issue and repo; draft ``introduction`` first — project, issue, approach. 209 + 2. Read relevant source via tools until you understand viable solution paths. 210 + 3. Draft the tree: each question's ``context`` + ``explanation`` must chain narratively. 211 + 4. Write option ``label`` strings as self-contained descriptions a contributor can judge. 212 + 5. Validate: unique ids; ≥ 2 options per question; contexts chain logically; JSON parses. 213 + 6. Emit the final JSON object only. 214 + 215 + You cannot push code, file issues, or browse outside this repo. Your sole deliverable is 216 + the questionnaire JSON. 217 + """ 218 + 219 + 220 + def build_questionnaire_system_prompt(ctx: IssueSessionContext) -> str: 221 + """System prompt for questionnaire generation (issue context appended).""" 222 + return ( 223 + f"{QUESTIONNAIRE_AGENT_SYSTEM_PROMPT}\n\n" 224 + f"## Session context (issue + repo)\n\n" 225 + f"```json\n{format_issue_context_block(ctx)}\n```" 226 + )
+152
agent/questionnaire_repo_store.py
··· 1 + """Publish AI-solve questionnaires to the knot-hosted git repo (vectorseachdb). 2 + 3 + The generation job dual-writes: it upserts Postgres (agent/questionnaire_store.py) 4 + AND, when QUESTIONNAIRE_PUBLISH_REPO is set, publishes the questionnaire as a single 5 + JSON file in the embeddings repo on the knot: 6 + 7 + questionnaires/<did>/<rkey>.json # one file per issue, fetched per-item 8 + 9 + Design choices that make this safe in an ephemeral, possibly-concurrent Cloud Run job: 10 + - **Sparse + partial clone** (`--filter=blob:none --sparse`, sparse-set `questionnaires`) 11 + so we never download the ~18 MB embedding matrices that share this repo. 12 + - **Per-issue unique path** → concurrent jobs touch different files; no content conflicts. 13 + - **`index.json` is NOT written here** (it would conflict across concurrent jobs) — it's 14 + rebuilt by scraper/export_questionnaires.py. Consumers can read files by path directly. 15 + - **Push with `pull --rebase` + retry** to tolerate the embeddings export pushing too. 16 + 17 + Config (env): 18 + QUESTIONNAIRE_REPO_GIT_URL e.g. git@tangled.org:did:plc:vg4msk54xucet6of2rdrgahe (required) 19 + QUESTIONNAIRE_REPO_DIR local checkout dir (default /tmp/qrepo) 20 + QUESTIONNAIRE_REPO_BRANCH default "main" 21 + QUESTIONNAIRE_PUBLISH_PUSH "0" to commit but skip push (local testing); default "1" 22 + QUESTIONNAIRE_SSH_KEY optional path to the deploy key (added to GIT_SSH_COMMAND) 23 + GIT_SSH_COMMAND respected if already set 24 + """ 25 + 26 + from __future__ import annotations 27 + 28 + import json 29 + import os 30 + import subprocess 31 + from pathlib import Path 32 + from typing import Any 33 + 34 + _PUSH_RETRIES = 4 35 + 36 + 37 + def publishing_enabled() -> bool: 38 + return os.getenv("QUESTIONNAIRE_PUBLISH_REPO", "").strip().lower() in ("1", "true", "yes") 39 + 40 + 41 + def issue_uri_to_relpath(issue_uri: str) -> str: 42 + """at://<did>/sh.tangled.repo.issue/<rkey> -> questionnaires/<did>/<rkey>.json 43 + (must match scraper/export_questionnaires.py).""" 44 + rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri 45 + parts = rest.split("/") 46 + return f"questionnaires/{parts[0]}/{parts[-1]}.json" 47 + 48 + 49 + def _resolve_ssh_key() -> str | None: 50 + """Return a path to a usable private key, or None. 51 + 52 + Prefers QUESTIONNAIRE_SSH_KEY (a path). Otherwise, if QUESTIONNAIRE_SSH_KEY_CONTENTS 53 + is set (e.g. a Secret Manager env var in Cloud Run), materialize it to a 0600 temp 54 + file — secret *volume* mounts are world-readable, which ssh rejects, so env-injection 55 + + chmod is the robust path.""" 56 + path = os.getenv("QUESTIONNAIRE_SSH_KEY", "").strip() 57 + if path and Path(path).exists(): 58 + return path 59 + contents = os.getenv("QUESTIONNAIRE_SSH_KEY_CONTENTS", "") 60 + if contents.strip(): 61 + dest = Path(os.getenv("QUESTIONNAIRE_REPO_DIR", "/tmp/qrepo")).parent / "qrepo_ssh_key" 62 + body = contents if contents.endswith("\n") else contents + "\n" 63 + dest.write_text(body) 64 + dest.chmod(0o600) 65 + return str(dest) 66 + return None 67 + 68 + 69 + def _git_env() -> dict[str, str]: 70 + env = dict(os.environ) 71 + if "GIT_SSH_COMMAND" not in env: 72 + cmd = "ssh -o StrictHostKeyChecking=accept-new -o ConnectTimeout=30" 73 + key = _resolve_ssh_key() 74 + if key: 75 + cmd += f" -i {key} -o IdentitiesOnly=yes" 76 + env["GIT_SSH_COMMAND"] = cmd 77 + return env 78 + 79 + 80 + def _git(repo: Path, *args: str) -> str: 81 + out = subprocess.run( 82 + ["git", *args], cwd=str(repo), env=_git_env(), 83 + capture_output=True, text=True, 84 + ) 85 + if out.returncode != 0: 86 + raise RuntimeError(f"git {' '.join(args)} failed: {out.stderr.strip() or out.stdout.strip()}") 87 + return out.stdout 88 + 89 + 90 + def _ensure_checkout(url: str, repo: Path, branch: str) -> None: 91 + if (repo / ".git").is_dir(): 92 + _git(repo, "fetch", "origin", branch) 93 + _git(repo, "checkout", branch) 94 + _git(repo, "reset", "--hard", f"origin/{branch}") 95 + return 96 + repo.parent.mkdir(parents=True, exist_ok=True) 97 + subprocess.run( 98 + ["git", "clone", "--filter=blob:none", "--sparse", "--branch", branch, url, str(repo)], 99 + env=_git_env(), capture_output=True, text=True, check=True, 100 + ) 101 + _git(repo, "sparse-checkout", "set", "questionnaires") 102 + 103 + 104 + def _file_record(issue_uri: str, payload: dict[str, Any], created_at, updated_at) -> str: 105 + rec = { 106 + "issue_uri": issue_uri, 107 + "version": payload.get("version") if isinstance(payload, dict) else None, 108 + "created_at": created_at.isoformat() if hasattr(created_at, "isoformat") else created_at, 109 + "updated_at": updated_at.isoformat() if hasattr(updated_at, "isoformat") else updated_at, 110 + "payload": payload, 111 + } 112 + return json.dumps(rec, ensure_ascii=False, indent=2) + "\n" 113 + 114 + 115 + def publish_to_repo(issue_uri: str, payload: dict[str, Any], created_at=None, updated_at=None) -> str: 116 + """Write the questionnaire file, commit, and (unless disabled) push. Returns the 117 + relative path written. Raises on failure — callers treat publishing as best-effort.""" 118 + url = os.getenv("QUESTIONNAIRE_REPO_GIT_URL", "").strip() 119 + if not url: 120 + raise RuntimeError("QUESTIONNAIRE_REPO_GIT_URL is not set") 121 + repo = Path(os.getenv("QUESTIONNAIRE_REPO_DIR", "/tmp/qrepo")).expanduser() 122 + branch = os.getenv("QUESTIONNAIRE_REPO_BRANCH", "main") 123 + do_push = os.getenv("QUESTIONNAIRE_PUBLISH_PUSH", "1").strip().lower() not in ("0", "false", "no") 124 + 125 + _ensure_checkout(url, repo, branch) 126 + 127 + rel = issue_uri_to_relpath(issue_uri) 128 + path = repo / rel 129 + path.parent.mkdir(parents=True, exist_ok=True) 130 + path.write_text(_file_record(issue_uri, payload, created_at, updated_at), encoding="utf-8") 131 + 132 + _git(repo, "add", rel) 133 + if not _git(repo, "status", "--porcelain").strip(): 134 + return rel # no change (identical content) — nothing to commit 135 + _git(repo, "-c", "user.name=tangled-questionnaire", "-c", "user.email=bot@stuhi.org", 136 + "commit", "-m", f"questionnaire: {issue_uri}") 137 + 138 + if not do_push: 139 + return rel 140 + last_err: Exception | None = None 141 + for _ in range(_PUSH_RETRIES): 142 + try: 143 + _git(repo, "push", "origin", branch) 144 + return rel 145 + except RuntimeError as e: # non-fast-forward (a concurrent push) — rebase + retry 146 + last_err = e 147 + try: 148 + _git(repo, "pull", "--rebase", "origin", branch) 149 + except RuntimeError as pe: 150 + last_err = pe 151 + break 152 + raise RuntimeError(f"push failed after retries: {last_err}")
+95
agent/questionnaire_store.py
··· 1 + """Persist AI-solve questionnaires in Postgres.""" 2 + 3 + from __future__ import annotations 4 + 5 + import json 6 + import os 7 + from typing import Any 8 + 9 + import psycopg 10 + from psycopg.rows import dict_row 11 + from psycopg.types.json import Jsonb 12 + 13 + _UPSERT = """ 14 + insert into tangled_issue_questionnaires (issue_uri, payload, updated_at) 15 + values (%s, %s, now()) 16 + on conflict (issue_uri) do update set 17 + payload = excluded.payload, 18 + updated_at = now() 19 + returning issue_uri, created_at, updated_at 20 + """ 21 + 22 + _GET = """ 23 + select issue_uri, payload, created_at, updated_at 24 + from tangled_issue_questionnaires 25 + where issue_uri = %s 26 + """ 27 + 28 + 29 + def _connection_string() -> str: 30 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 31 + if not dsn: 32 + raise RuntimeError("DB_CONNECTION_STRING is not set") 33 + return dsn 34 + 35 + 36 + def parse_questionnaire_json(raw: str) -> dict[str, Any]: 37 + """Parse model output into a questionnaire dict (tolerates fences and preamble).""" 38 + import re 39 + from json import JSONDecoder 40 + 41 + text = raw.strip() 42 + if not text: 43 + raise ValueError("Empty model response — expected questionnaire JSON") 44 + 45 + fence = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, re.DOTALL) 46 + if fence: 47 + text = fence.group(1).strip() 48 + 49 + decoder = JSONDecoder() 50 + try: 51 + data, _ = decoder.raw_decode(text) 52 + except json.JSONDecodeError: 53 + start = text.find("{") 54 + if start < 0: 55 + preview = text[:300].replace("\n", " ") 56 + raise ValueError( 57 + f"No JSON object in model response (preview: {preview!r})" 58 + ) from None 59 + data, _ = decoder.raw_decode(text[start:]) 60 + 61 + if not isinstance(data, dict) or not isinstance(data.get("items"), list): 62 + raise ValueError("Invalid questionnaire: expected object with items[]") 63 + return data 64 + 65 + 66 + def save_questionnaire(issue_uri: str, payload: dict[str, Any]) -> dict[str, Any]: 67 + """Insert or replace the questionnaire for an issue. Returns row metadata.""" 68 + if payload.get("issue") and payload["issue"] != issue_uri: 69 + raise ValueError( 70 + f"payload.issue ({payload['issue']!r}) does not match issue_uri ({issue_uri!r})" 71 + ) 72 + with psycopg.connect(_connection_string(), row_factory=dict_row) as conn: 73 + row = conn.execute( 74 + _UPSERT, 75 + (issue_uri, Jsonb(payload)), 76 + ).fetchone() 77 + conn.commit() 78 + return dict(row) 79 + 80 + 81 + def get_questionnaire(issue_uri: str) -> dict[str, Any] | None: 82 + """Load cached questionnaire JSON, or None if missing.""" 83 + with psycopg.connect(_connection_string(), row_factory=dict_row) as conn: 84 + row = conn.execute(_GET, (issue_uri,)).fetchone() 85 + if not row: 86 + return None 87 + payload = row["payload"] 88 + if isinstance(payload, str): 89 + payload = json.loads(payload) 90 + return { 91 + "issue_uri": row["issue_uri"], 92 + "payload": payload, 93 + "created_at": row["created_at"], 94 + "updated_at": row["updated_at"], 95 + }
+31
agent/questionnaires/README.md
··· 1 + # Questionnaire tree viewer 2 + 3 + Small static frontend for exploring AI-solve questionnaire JSON. 4 + 5 + ## Run locally 6 + 7 + Browsers block `fetch()` for local files, so serve this folder: 8 + 9 + ```bash 10 + cd agent/questionnaires 11 + python -m http.server 8765 12 + ``` 13 + 14 + Open [http://localhost:8765](http://localhost:8765). 15 + 16 + ## Features 17 + 18 + - **Introduction** — project, issue, and approach context shown at the top and in walk-through 19 + - **Tree view** — nested questions with `context`, `explanation`, and detailed option labels 20 + - **Walk-through** — interactive simulator with narrative context per step (depth-first stack) 21 + - **Schema v2** — options are `{ "label": "detailed description…" }` only; answers use `optionIndex` 22 + - **Load** sample (`test.json`), upload a `.json` / `.txt` file, or paste JSON (supports markdown fences; v1 auto-normalized) 23 + 24 + ## Sample data 25 + 26 + - `test.json` — parsed questionnaire for the AtomicXR lighthouse pair issue 27 + - `test.txt` — same content with markdown code fence (also loadable) 28 + 29 + ## Output 30 + 31 + Walk-through mode builds the flat `POST /answers` payload shape when you finish all questions.
+389
agent/questionnaires/index.html
··· 1 + <!DOCTYPE html> 2 + <html lang="en"> 3 + <head> 4 + <meta charset="UTF-8" /> 5 + <meta name="viewport" content="width=device-width, initial-scale=1.0" /> 6 + <title>Questionnaire tree viewer</title> 7 + <link rel="preconnect" href="https://fonts.googleapis.com" /> 8 + <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /> 9 + <link 10 + href="https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500&family=IBM+Plex+Sans:wght@400;500;600&display=swap" 11 + rel="stylesheet" 12 + /> 13 + <link rel="stylesheet" href="./styles.css" /> 14 + </head> 15 + <body> 16 + <header> 17 + <div> 18 + <h1>Questionnaire tree viewer</h1> 19 + <p class="subtitle"> 20 + Explore branching questions &amp; followups — static tree + interactive walk-through 21 + </p> 22 + </div> 23 + <div class="toolbar"> 24 + <label class="file-btn"> 25 + Load JSON 26 + <input id="file-input" type="file" accept=".json,.txt,application/json" /> 27 + </label> 28 + <button type="button" id="load-sample">Load sample</button> 29 + <button type="button" id="paste-toggle">Paste JSON</button> 30 + </div> 31 + </header> 32 + 33 + <div id="paste-panel" class="panel hidden" style="margin-bottom: 1rem"> 34 + <div class="panel-head"><h2>Paste questionnaire JSON</h2></div> 35 + <div class="panel-body"> 36 + <textarea 37 + id="paste-area" 38 + rows="8" 39 + style="width: 100%; font-family: var(--mono); background: var(--bg); color: var(--text); border: 1px solid var(--border); border-radius: 8px; padding: 0.75rem" 40 + placeholder='{ "issue": "at://…", "version": 1, "items": [ … ] }' 41 + ></textarea> 42 + <div style="margin-top: 0.5rem; display: flex; gap: 0.5rem"> 43 + <button type="button" class="primary" id="paste-apply">Apply</button> 44 + <button type="button" id="paste-cancel">Cancel</button> 45 + </div> 46 + </div> 47 + </div> 48 + 49 + <div id="error" class="error hidden"></div> 50 + 51 + <div id="app" class="hidden"> 52 + <div id="intro-root"></div> 53 + <div class="meta" id="meta"></div> 54 + 55 + <div class="tabs"> 56 + <button type="button" class="tab is-active" data-tab="split">Split view</button> 57 + <button type="button" class="tab" data-tab="tree">Tree only</button> 58 + <button type="button" class="tab" data-tab="walk">Walk only</button> 59 + </div> 60 + 61 + <div class="layout" id="layout-split"> 62 + <section class="panel"> 63 + <div class="panel-head"> 64 + <h2>Question tree</h2> 65 + <span class="chip" id="tree-hint">click a question id to focus</span> 66 + </div> 67 + <div class="panel-body" id="tree-root"></div> 68 + </section> 69 + 70 + <section class="panel"> 71 + <div class="panel-head"> 72 + <h2>Walk-through</h2> 73 + <div class="toolbar"> 74 + <button type="button" id="walk-back" disabled>Back</button> 75 + <button type="button" id="walk-reset">Reset</button> 76 + </div> 77 + </div> 78 + <div class="panel-body" id="walk-root"></div> 79 + </section> 80 + </div> 81 + 82 + <div class="panel hidden" id="layout-tree"> 83 + <div class="panel-head"><h2>Question tree</h2></div> 84 + <div class="panel-body" id="tree-root-full"></div> 85 + </div> 86 + 87 + <div class="panel hidden" id="layout-walk"> 88 + <div class="panel-head"> 89 + <h2>Walk-through</h2> 90 + <div class="toolbar"> 91 + <button type="button" id="walk-back-solo" disabled>Back</button> 92 + <button type="button" id="walk-reset-solo">Reset</button> 93 + </div> 94 + </div> 95 + <div class="panel-body" id="walk-root-solo"></div> 96 + </div> 97 + </div> 98 + 99 + <p id="empty" class="empty">Load a questionnaire JSON file to begin.</p> 100 + 101 + <script type="module"> 102 + import { 103 + parseQuestionnaire, 104 + stats, 105 + QuestionnaireWalker, 106 + renderTree, 107 + renderIntroduction, 108 + renderQuestionMeta, 109 + escapeHtml, 110 + } from "./viewer.js"; 111 + 112 + /** @type {import('./viewer.js').Questionnaire | null} */ 113 + let questionnaire = null; 114 + /** @type {QuestionnaireWalker | null} */ 115 + let walker = null; 116 + let focusedQuestionId = null; 117 + let activeTab = "split"; 118 + 119 + const els = { 120 + app: document.getElementById("app"), 121 + empty: document.getElementById("empty"), 122 + error: document.getElementById("error"), 123 + meta: document.getElementById("meta"), 124 + introRoot: document.getElementById("intro-root"), 125 + treeRoot: document.getElementById("tree-root"), 126 + treeRootFull: document.getElementById("tree-root-full"), 127 + walkRoot: document.getElementById("walk-root"), 128 + walkRootSolo: document.getElementById("walk-root-solo"), 129 + fileInput: document.getElementById("file-input"), 130 + pastePanel: document.getElementById("paste-panel"), 131 + pasteArea: document.getElementById("paste-area"), 132 + layoutSplit: document.getElementById("layout-split"), 133 + layoutTree: document.getElementById("layout-tree"), 134 + layoutWalk: document.getElementById("layout-walk"), 135 + walkBack: document.getElementById("walk-back"), 136 + walkBackSolo: document.getElementById("walk-back-solo"), 137 + walkReset: document.getElementById("walk-reset"), 138 + walkResetSolo: document.getElementById("walk-reset-solo"), 139 + }; 140 + 141 + function showError(message) { 142 + els.error.textContent = message; 143 + els.error.classList.remove("hidden"); 144 + } 145 + 146 + function clearError() { 147 + els.error.classList.add("hidden"); 148 + els.error.textContent = ""; 149 + } 150 + 151 + function loadQuestionnaire(data) { 152 + clearError(); 153 + questionnaire = data; 154 + walker = new QuestionnaireWalker(questionnaire); 155 + focusedQuestionId = null; 156 + els.empty.classList.add("hidden"); 157 + els.app.classList.remove("hidden"); 158 + renderAll(); 159 + } 160 + 161 + function renderIntro() { 162 + els.introRoot.replaceChildren(); 163 + if (questionnaire.introduction) { 164 + els.introRoot.appendChild(renderIntroduction(questionnaire.introduction)); 165 + } 166 + } 167 + 168 + function renderMeta() { 169 + const s = stats(questionnaire); 170 + els.meta.innerHTML = ` 171 + <span><strong>issue</strong> ${escapeHtml(questionnaire.issue)}</span> 172 + <span><strong>version</strong> ${questionnaire.version}</span> 173 + <div class="chips"> 174 + <span class="chip">${s.totalQuestions} questions</span> 175 + <span class="chip">${s.totalOptions} options</span> 176 + <span class="chip">${s.branchCount} branches</span> 177 + <span class="chip">depth ${s.maxDepth}</span> 178 + </div> 179 + `; 180 + } 181 + 182 + function isActive(id) { 183 + if (focusedQuestionId === id) return true; 184 + return walker?.activePathIds().includes(id) ?? false; 185 + } 186 + 187 + function onSelectQuestion(id) { 188 + focusedQuestionId = id; 189 + renderTrees(); 190 + } 191 + 192 + function renderTrees() { 193 + for (const root of [els.treeRoot, els.treeRootFull]) { 194 + root.replaceChildren( 195 + renderTree(questionnaire.items, 1, isActive, onSelectQuestion) 196 + ); 197 + } 198 + } 199 + 200 + function renderWalk(target) { 201 + if (walker.showIntroduction && questionnaire.introduction) { 202 + target.innerHTML = ` 203 + <div class="walk-card intro-walk"> 204 + ${renderIntroduction(questionnaire.introduction).outerHTML} 205 + <button type="button" class="primary walk-begin" style="margin-top: 1rem"> 206 + Begin questionnaire 207 + </button> 208 + </div> 209 + `; 210 + target.querySelector(".walk-begin")?.addEventListener("click", () => { 211 + walker.dismissIntroduction(); 212 + renderAll(); 213 + }); 214 + return; 215 + } 216 + 217 + const question = walker.current(); 218 + 219 + if (!question) { 220 + target.innerHTML = ` 221 + <div class="walk-card"> 222 + <p class="walk-done">Questionnaire complete</p> 223 + <p class="subtitle">${walker.answers.length} answers recorded</p> 224 + </div> 225 + <div class="answer-log"> 226 + <h3>Answer payload</h3> 227 + <pre class="json-preview">${escapeHtml( 228 + JSON.stringify(walker.toAnswerPayload(), null, 2) 229 + )}</pre> 230 + </div> 231 + `; 232 + return; 233 + } 234 + 235 + const crumbs = walker.answers 236 + .map( 237 + (a) => 238 + `<span class="crumb">${escapeHtml(a.questionId)} · #${a.optionIndex + 1}</span>` 239 + ) 240 + .join(""); 241 + const step = walker.answers.length + 1; 242 + 243 + const options = question.options 244 + .map((opt, optionIndex) => { 245 + const hasFollowups = Boolean(opt.followups?.length); 246 + return ` 247 + <button 248 + type="button" 249 + class="walk-option${hasFollowups ? " has-followups" : ""}" 250 + data-index="${optionIndex}" 251 + > 252 + ${escapeHtml(opt.label)} 253 + ${hasFollowups ? '<small>Opens follow-up questions</small>' : ""} 254 + </button> 255 + `; 256 + }) 257 + .join(""); 258 + 259 + target.innerHTML = ` 260 + <div class="breadcrumb"> 261 + ${crumbs || '<span class="crumb">start</span>'} 262 + <span class="crumb is-current">${escapeHtml(question.id)}</span> 263 + </div> 264 + <div class="walk-card"> 265 + <div class="walk-step">Question ${step}</div> 266 + <div class="walk-id">${escapeHtml(question.id)}</div> 267 + <p class="walk-prompt">${escapeHtml(question.prompt)}</p> 268 + ${renderQuestionMeta(question)} 269 + <div class="walk-options">${options}</div> 270 + </div> 271 + ${ 272 + walker.answers.length 273 + ? `<div class="answer-log"><h3>So far</h3>${walker.answers 274 + .map( 275 + (a) => ` 276 + <dl class="answer-row"> 277 + <dt>${escapeHtml(a.questionId)}</dt> 278 + <dd>${escapeHtml(a.label)}</dd> 279 + </dl>` 280 + ) 281 + .join("")}</div>` 282 + : "" 283 + } 284 + `; 285 + 286 + target.querySelectorAll(".walk-option").forEach((btn) => { 287 + btn.addEventListener("click", () => { 288 + walker.pick(Number(btn.dataset.index)); 289 + focusedQuestionId = walker.current()?.id ?? null; 290 + renderAll(); 291 + }); 292 + }); 293 + } 294 + 295 + function renderWalkPanels() { 296 + renderWalk(els.walkRoot); 297 + renderWalk(els.walkRootSolo); 298 + const canBack = 299 + walker.showIntroduction || 300 + walker.answers.length > 0; 301 + els.walkBack.disabled = !canBack; 302 + els.walkBackSolo.disabled = !canBack; 303 + } 304 + 305 + function renderAll() { 306 + renderIntro(); 307 + renderMeta(); 308 + renderTrees(); 309 + renderWalkPanels(); 310 + } 311 + 312 + function setTab(tab) { 313 + activeTab = tab; 314 + document.querySelectorAll(".tab").forEach((btn) => { 315 + btn.classList.toggle("is-active", btn.dataset.tab === tab); 316 + }); 317 + els.layoutSplit.classList.toggle("hidden", tab !== "split"); 318 + els.layoutTree.classList.toggle("hidden", tab !== "tree"); 319 + els.layoutWalk.classList.toggle("hidden", tab !== "walk"); 320 + } 321 + 322 + async function loadSample() { 323 + try { 324 + const res = await fetch("./test.json"); 325 + if (!res.ok) throw new Error(`HTTP ${res.status}`); 326 + loadQuestionnaire(parseQuestionnaire(await res.text())); 327 + } catch (err) { 328 + showError( 329 + `Could not load test.json (${err.message}). Use a local server: python -m http.server 8765` 330 + ); 331 + } 332 + } 333 + 334 + async function loadFile(file) { 335 + const text = await file.text(); 336 + try { 337 + loadQuestionnaire(parseQuestionnaire(text)); 338 + } catch (err) { 339 + showError(err.message); 340 + } 341 + } 342 + 343 + document.querySelectorAll(".tab").forEach((btn) => { 344 + btn.addEventListener("click", () => setTab(btn.dataset.tab)); 345 + }); 346 + 347 + els.fileInput.addEventListener("change", () => { 348 + const file = els.fileInput.files?.[0]; 349 + if (file) loadFile(file); 350 + els.fileInput.value = ""; 351 + }); 352 + 353 + document.getElementById("load-sample").addEventListener("click", loadSample); 354 + document.getElementById("paste-toggle").addEventListener("click", () => { 355 + els.pastePanel.classList.toggle("hidden"); 356 + }); 357 + document.getElementById("paste-cancel").addEventListener("click", () => { 358 + els.pastePanel.classList.add("hidden"); 359 + }); 360 + document.getElementById("paste-apply").addEventListener("click", () => { 361 + try { 362 + loadQuestionnaire(parseQuestionnaire(els.pasteArea.value)); 363 + els.pastePanel.classList.add("hidden"); 364 + } catch (err) { 365 + showError(err.message); 366 + } 367 + }); 368 + 369 + function resetWalk() { 370 + walker.reset(); 371 + focusedQuestionId = walker.current()?.id ?? null; 372 + renderAll(); 373 + } 374 + 375 + function backWalk() { 376 + walker.back(); 377 + focusedQuestionId = walker.current()?.id ?? null; 378 + renderAll(); 379 + } 380 + 381 + els.walkReset.addEventListener("click", resetWalk); 382 + els.walkResetSolo.addEventListener("click", resetWalk); 383 + els.walkBack.addEventListener("click", backWalk); 384 + els.walkBackSolo.addEventListener("click", backWalk); 385 + 386 + loadSample(); 387 + </script> 388 + </body> 389 + </html>
+497
agent/questionnaires/styles.css
··· 1 + :root { 2 + color-scheme: dark; 3 + --bg: #0f1117; 4 + --surface: #171b26; 5 + --surface-2: #1e2433; 6 + --border: #2a3144; 7 + --text: #e8ecf4; 8 + --muted: #8b95ad; 9 + --accent: #6ea8fe; 10 + --accent-dim: #3d5f99; 11 + --question: #7ee787; 12 + --option: #f2cc60; 13 + --branch: #d2a8ff; 14 + --active: #ffa657; 15 + --danger: #ff7b72; 16 + --radius: 10px; 17 + --font: "IBM Plex Sans", system-ui, sans-serif; 18 + --mono: "IBM Plex Mono", ui-monospace, monospace; 19 + } 20 + 21 + * { 22 + box-sizing: border-box; 23 + } 24 + 25 + html, 26 + body { 27 + margin: 0; 28 + min-height: 100%; 29 + background: var(--bg); 30 + color: var(--text); 31 + font-family: var(--font); 32 + line-height: 1.5; 33 + } 34 + 35 + body { 36 + padding: 1.25rem; 37 + } 38 + 39 + a { 40 + color: var(--accent); 41 + } 42 + 43 + header { 44 + display: flex; 45 + flex-wrap: wrap; 46 + gap: 1rem; 47 + align-items: flex-end; 48 + justify-content: space-between; 49 + margin-bottom: 1.25rem; 50 + } 51 + 52 + h1 { 53 + margin: 0; 54 + font-size: 1.35rem; 55 + font-weight: 600; 56 + } 57 + 58 + .subtitle { 59 + margin: 0.25rem 0 0; 60 + color: var(--muted); 61 + font-size: 0.9rem; 62 + } 63 + 64 + .toolbar { 65 + display: flex; 66 + flex-wrap: wrap; 67 + gap: 0.5rem; 68 + align-items: center; 69 + } 70 + 71 + button, 72 + .file-btn { 73 + appearance: none; 74 + border: 1px solid var(--border); 75 + background: var(--surface); 76 + color: var(--text); 77 + border-radius: 8px; 78 + padding: 0.45rem 0.75rem; 79 + font: inherit; 80 + cursor: pointer; 81 + } 82 + 83 + button:hover, 84 + .file-btn:hover { 85 + border-color: var(--accent-dim); 86 + background: var(--surface-2); 87 + } 88 + 89 + button.primary { 90 + background: var(--accent-dim); 91 + border-color: var(--accent); 92 + } 93 + 94 + button.primary:hover { 95 + background: #4a74b8; 96 + } 97 + 98 + button:disabled { 99 + opacity: 0.45; 100 + cursor: not-allowed; 101 + } 102 + 103 + .file-btn input { 104 + display: none; 105 + } 106 + 107 + .tabs { 108 + display: flex; 109 + gap: 0.35rem; 110 + margin-bottom: 1rem; 111 + } 112 + 113 + .tab { 114 + border-radius: 999px; 115 + } 116 + 117 + .tab.is-active { 118 + background: var(--accent-dim); 119 + border-color: var(--accent); 120 + } 121 + 122 + .layout { 123 + display: grid; 124 + grid-template-columns: minmax(0, 1.2fr) minmax(320px, 0.8fr); 125 + gap: 1rem; 126 + align-items: start; 127 + } 128 + 129 + @media (max-width: 960px) { 130 + .layout { 131 + grid-template-columns: 1fr; 132 + } 133 + } 134 + 135 + .panel { 136 + background: var(--surface); 137 + border: 1px solid var(--border); 138 + border-radius: var(--radius); 139 + overflow: hidden; 140 + } 141 + 142 + .panel-head { 143 + padding: 0.75rem 1rem; 144 + border-bottom: 1px solid var(--border); 145 + display: flex; 146 + justify-content: space-between; 147 + gap: 0.75rem; 148 + align-items: center; 149 + } 150 + 151 + .panel-head h2 { 152 + margin: 0; 153 + font-size: 0.95rem; 154 + font-weight: 600; 155 + } 156 + 157 + .panel-body { 158 + padding: 1rem; 159 + max-height: calc(100vh - 220px); 160 + overflow: auto; 161 + } 162 + 163 + .meta { 164 + display: flex; 165 + flex-wrap: wrap; 166 + gap: 0.5rem 1rem; 167 + margin-bottom: 1rem; 168 + font-size: 0.82rem; 169 + color: var(--muted); 170 + } 171 + 172 + .intro-panel { 173 + background: var(--surface); 174 + border: 1px solid var(--border); 175 + border-radius: var(--radius); 176 + padding: 1rem 1.15rem; 177 + margin-bottom: 1rem; 178 + } 179 + 180 + .intro-panel h2 { 181 + margin: 0 0 0.75rem; 182 + font-size: 1rem; 183 + } 184 + 185 + .intro-block + .intro-block { 186 + margin-top: 0.85rem; 187 + padding-top: 0.85rem; 188 + border-top: 1px solid var(--border); 189 + } 190 + 191 + .intro-block h3 { 192 + margin: 0 0 0.35rem; 193 + font-size: 0.78rem; 194 + font-family: var(--mono); 195 + text-transform: uppercase; 196 + letter-spacing: 0.05em; 197 + color: var(--accent); 198 + } 199 + 200 + .intro-block p { 201 + margin: 0; 202 + font-size: 0.92rem; 203 + color: var(--text); 204 + } 205 + 206 + .q-meta-label { 207 + display: block; 208 + font-family: var(--mono); 209 + font-size: 0.68rem; 210 + text-transform: uppercase; 211 + letter-spacing: 0.05em; 212 + margin-bottom: 0.25rem; 213 + } 214 + 215 + .q-context, 216 + .q-explanation { 217 + margin: 0.65rem 0 0.85rem 0; 218 + padding: 0.65rem 0.75rem; 219 + border-radius: 8px; 220 + font-size: 0.88rem; 221 + } 222 + 223 + .q-context { 224 + background: color-mix(in srgb, var(--accent) 10%, var(--surface-2)); 225 + border-left: 3px solid var(--accent); 226 + } 227 + 228 + .q-context p, 229 + .q-explanation p { 230 + margin: 0; 231 + } 232 + 233 + .q-explanation { 234 + background: color-mix(in srgb, var(--question) 8%, var(--surface-2)); 235 + border-left: 3px solid var(--question); 236 + } 237 + 238 + .tree-meta { 239 + margin: 0 0 0.75rem 1.85rem; 240 + } 241 + 242 + .intro-walk .intro-panel { 243 + background: transparent; 244 + border: none; 245 + padding: 0; 246 + margin: 0; 247 + } 248 + 249 + .meta strong { 250 + color: var(--text); 251 + font-family: var(--mono); 252 + font-weight: 500; 253 + } 254 + 255 + .chips { 256 + display: flex; 257 + flex-wrap: wrap; 258 + gap: 0.35rem; 259 + } 260 + 261 + .chip { 262 + font-family: var(--mono); 263 + font-size: 0.75rem; 264 + padding: 0.15rem 0.45rem; 265 + border-radius: 999px; 266 + background: var(--surface-2); 267 + border: 1px solid var(--border); 268 + } 269 + 270 + /* Tree view */ 271 + .tree-level { 272 + display: grid; 273 + gap: 1rem; 274 + } 275 + 276 + .tree-question { 277 + position: relative; 278 + padding-left: calc(var(--depth, 1) * 0.35rem); 279 + } 280 + 281 + .tree-question-head { 282 + display: inline-flex; 283 + align-items: center; 284 + gap: 0.45rem; 285 + width: 100%; 286 + text-align: left; 287 + margin-bottom: 0.35rem; 288 + } 289 + 290 + .tree-question-head.is-active { 291 + outline: 2px solid var(--active); 292 + outline-offset: 2px; 293 + } 294 + 295 + .tree-badge { 296 + display: inline-grid; 297 + place-items: center; 298 + min-width: 1.4rem; 299 + height: 1.4rem; 300 + border-radius: 4px; 301 + background: color-mix(in srgb, var(--question) 18%, transparent); 302 + color: var(--question); 303 + font-family: var(--mono); 304 + font-size: 0.7rem; 305 + font-weight: 600; 306 + } 307 + 308 + .tree-badge.opt { 309 + background: color-mix(in srgb, var(--option) 18%, transparent); 310 + color: var(--option); 311 + } 312 + 313 + .tree-id, 314 + .tree-value { 315 + font-family: var(--mono); 316 + font-size: 0.82rem; 317 + } 318 + 319 + .tree-value { 320 + color: var(--option); 321 + } 322 + 323 + .tree-prompt, 324 + .tree-option-label { 325 + margin: 0 0 0.6rem 1.85rem; 326 + font-size: 0.9rem; 327 + } 328 + 329 + .tree-options { 330 + margin-left: 0.75rem; 331 + padding-left: 0.85rem; 332 + border-left: 2px solid var(--border); 333 + display: grid; 334 + gap: 0.75rem; 335 + } 336 + 337 + .tree-option-head { 338 + display: flex; 339 + align-items: center; 340 + gap: 0.45rem; 341 + } 342 + 343 + .tree-branch { 344 + margin-top: 0.5rem; 345 + margin-left: 0.5rem; 346 + padding: 0.75rem 0 0.25rem 0.75rem; 347 + border-left: 2px dashed color-mix(in srgb, var(--branch) 55%, var(--border)); 348 + border-radius: 0 0 0 6px; 349 + } 350 + 351 + .tree-branch::before { 352 + content: "followups"; 353 + display: block; 354 + font-family: var(--mono); 355 + font-size: 0.68rem; 356 + letter-spacing: 0.04em; 357 + text-transform: uppercase; 358 + color: var(--branch); 359 + margin-bottom: 0.5rem; 360 + } 361 + 362 + /* Walk mode */ 363 + .walk-card { 364 + background: var(--surface-2); 365 + border: 1px solid var(--border); 366 + border-radius: var(--radius); 367 + padding: 1rem; 368 + } 369 + 370 + .walk-step { 371 + font-family: var(--mono); 372 + font-size: 0.75rem; 373 + color: var(--muted); 374 + margin-bottom: 0.35rem; 375 + } 376 + 377 + .walk-id { 378 + font-family: var(--mono); 379 + font-size: 0.82rem; 380 + color: var(--question); 381 + margin-bottom: 0.5rem; 382 + } 383 + 384 + .walk-prompt { 385 + margin: 0 0 1rem; 386 + font-size: 1rem; 387 + } 388 + 389 + .walk-options { 390 + display: grid; 391 + gap: 0.5rem; 392 + } 393 + 394 + .walk-option { 395 + text-align: left; 396 + padding: 0.7rem 0.85rem; 397 + } 398 + 399 + .walk-option small { 400 + display: block; 401 + margin-top: 0.25rem; 402 + font-family: var(--mono); 403 + color: var(--muted); 404 + font-size: 0.72rem; 405 + } 406 + 407 + .walk-option.has-followups { 408 + border-color: color-mix(in srgb, var(--branch) 50%, var(--border)); 409 + } 410 + 411 + .walk-done { 412 + color: var(--question); 413 + font-weight: 600; 414 + } 415 + 416 + .breadcrumb { 417 + display: flex; 418 + flex-wrap: wrap; 419 + gap: 0.35rem; 420 + margin-bottom: 0.75rem; 421 + } 422 + 423 + .crumb { 424 + font-family: var(--mono); 425 + font-size: 0.72rem; 426 + padding: 0.2rem 0.45rem; 427 + border-radius: 6px; 428 + background: var(--bg); 429 + border: 1px solid var(--border); 430 + color: var(--muted); 431 + } 432 + 433 + .crumb.is-current { 434 + color: var(--active); 435 + border-color: color-mix(in srgb, var(--active) 50%, var(--border)); 436 + } 437 + 438 + .answer-log { 439 + margin-top: 1rem; 440 + padding-top: 1rem; 441 + border-top: 1px solid var(--border); 442 + } 443 + 444 + .answer-log h3 { 445 + margin: 0 0 0.5rem; 446 + font-size: 0.85rem; 447 + } 448 + 449 + .answer-row { 450 + display: grid; 451 + grid-template-columns: auto 1fr; 452 + gap: 0.35rem 0.75rem; 453 + font-size: 0.82rem; 454 + margin-bottom: 0.35rem; 455 + } 456 + 457 + .answer-row dt { 458 + font-family: var(--mono); 459 + color: var(--question); 460 + } 461 + 462 + .answer-row dd { 463 + margin: 0; 464 + color: var(--muted); 465 + } 466 + 467 + .json-preview { 468 + margin-top: 1rem; 469 + padding: 0.75rem; 470 + background: var(--bg); 471 + border: 1px solid var(--border); 472 + border-radius: 8px; 473 + font-family: var(--mono); 474 + font-size: 0.72rem; 475 + white-space: pre-wrap; 476 + word-break: break-word; 477 + max-height: 220px; 478 + overflow: auto; 479 + } 480 + 481 + .empty { 482 + color: var(--muted); 483 + font-size: 0.9rem; 484 + } 485 + 486 + .error { 487 + color: var(--danger); 488 + background: color-mix(in srgb, var(--danger) 12%, transparent); 489 + border: 1px solid color-mix(in srgb, var(--danger) 35%, var(--border)); 490 + border-radius: 8px; 491 + padding: 0.75rem 1rem; 492 + margin-bottom: 1rem; 493 + } 494 + 495 + .hidden { 496 + display: none !important; 497 + }
+228
agent/questionnaires/test.json
··· 1 + { 2 + "issue": "at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22", 3 + "version": 2, 4 + "introduction": { 5 + "project": "AtomicXR is a Nushell-based CLI (`axr`) for managing XR hardware on Linux — SteamVR lighthouse tracking, calibration, and device tooling. Commands live in `.nu` modules served from the repo knot. The README notes the project is deprecated in favor of Homebrew-XR and Envision-OCI, but the codebase is still a useful reference for CLI patterns.", 6 + "issue": "The issue requests a new `axr lh pair` command so users can pair SteamVR Lighthouse devices without manually running `lighthouse_console`. Today lighthouse workflows live under `steamvr-lh.nu` (`axr steamvr-lh calibrate`, `axr steamvr-lh console`), while the issue asks for the shorter `lh` namespace — an intentional mismatch we must resolve first.", 7 + "approach": "This questionnaire walks from namespace and backend choice through UX, error handling, and shared concerns (tests, docs). Each question's context builds on prior answers; branch follow-ups only appear when your chosen path needs extra detail. Together the answers define a concrete PR plan contributors can consensus on." 8 + }, 9 + "items": [ 10 + { 11 + "id": "command_namespace", 12 + "prompt": "The issue requests `axr lh pair`, but the existing lighthouse module is `steamvr-lh.nu` (commands are `axr steamvr-lh calibrate`, `axr steamvr-lh console`). How should the new pair command be namespaced?", 13 + "context": "We start here because the issue title says `axr lh pair` but the repo still exposes `axr steamvr-lh …`. Every later decision assumes a command path.", 14 + "explanation": "The module file `steamvr-lh.nu` registers subcommands via Nushell's module system. Renaming affects import paths, help text, and user muscle memory. Adding `pair` only to the old module is least disruptive; renaming to `lh` matches the issue verbatim.", 15 + "options": [ 16 + { 17 + "label": "Rename module to `lh.nu` so commands become `axr lh pair`, `axr lh calibrate`, etc.", 18 + "followups": [ 19 + { 20 + "id": "rename_deprecation", 21 + "prompt": "Should the old `steamvr-lh` name be preserved as a deprecated alias?", 22 + "context": "Because you chose to rename the module to `lh.nu`, we need to decide whether old scripts using `steamvr-lh` keep working.", 23 + "explanation": "A thin alias module costs little and prevents breaking existing docs/scripts. Skipping the alias is simpler but contradicts semver expectations if anyone still depends on the old name.", 24 + "options": [ 25 + { 26 + "label": "Yes, keep a thin `steamvr-lh.nu` wrapper that re-exports `lh.nu` with a deprecation warning" 27 + }, 28 + { 29 + "label": "No, just rename — the project is marked as no longer maintained anyway" 30 + } 31 + ] 32 + } 33 + ] 34 + }, 35 + { 36 + "label": "Add `pair` to the existing `steamvr-lh.nu` module (command becomes `axr steamvr-lh pair`)" 37 + }, 38 + { 39 + "label": "Create a new separate `lh.nu` module for pairing only, keep `steamvr-lh.nu` for calibrate/console" 40 + } 41 + ] 42 + }, 43 + { 44 + "id": "backend_tool", 45 + "prompt": "Which backend tool should the pair command use?", 46 + "context": "With the command namespace settled, we pick which external tool wraps the actual pairing protocol.", 47 + "explanation": "`lighthouse_console` ships with SteamVR today but is slow and script-hostile. `lhctl` is the intended successor but may not be available on all systems yet. Dual mode adds complexity but future-proofs.", 48 + "options": [ 49 + { 50 + "label": "Use `lighthouse_console` now (available today via SteamVR, slower but works)", 51 + "followups": [ 52 + { 53 + "id": "lhctl_future_proofing", 54 + "prompt": "Should the implementation be structured to make switching to `lhctl` easier later?", 55 + "context": "Because you chose `lighthouse_console` today, we decide whether to structure code for a future `lhctl` swap.", 56 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 57 + "options": [ 58 + { 59 + "label": "Yes, abstract the pairing logic behind a helper function so the backend can be swapped" 60 + }, 61 + { 62 + "label": "No, just call lighthouse_console directly — refactor when lhctl is available" 63 + } 64 + ] 65 + } 66 + ] 67 + }, 68 + { 69 + "label": "Wait for `lhctl` to be publicly released and implement with that" 70 + }, 71 + { 72 + "label": "Support both: detect if `lhctl` is available and prefer it, fall back to `lighthouse_console`", 73 + "followups": [ 74 + { 75 + "id": "dual_backend_flag", 76 + "prompt": "Should the user be able to force a specific backend?", 77 + "context": "Because you chose dual-backend auto-detection, we clarify whether advanced users can override the choice.", 78 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 79 + "options": [ 80 + { 81 + "label": "Yes, add a `--backend` flag (e.g. `axr lh pair --backend lhctl`)" 82 + }, 83 + { 84 + "label": "No, auto-detect only — simpler UX" 85 + } 86 + ] 87 + } 88 + ] 89 + } 90 + ] 91 + }, 92 + { 93 + "id": "pairing_workflow", 94 + "prompt": "How should the pairing workflow work from the user's perspective?", 95 + "context": "Regardless of backend, users experience pairing differently — interactive scan vs args vs wizard.", 96 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 97 + "options": [ 98 + { 99 + "label": "Fully interactive: scan for devices, present a list, user selects which to pair", 100 + "followups": [ 101 + { 102 + "id": "interactive_multi_select", 103 + "prompt": "Should the user be able to pair multiple devices in one session?", 104 + "context": "Because you chose a fully interactive scan flow, we decide single vs multi device per session.", 105 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 106 + "options": [ 107 + { 108 + "label": "Yes, allow multi-select from discovered devices" 109 + }, 110 + { 111 + "label": "No, pair one device at a time (simpler, matches lighthouse_console behavior)" 112 + } 113 + ] 114 + } 115 + ] 116 + }, 117 + { 118 + "label": "Semi-interactive: user provides device serial/ID as argument, command handles the rest" 119 + }, 120 + { 121 + "label": "Guided wizard: step-by-step prompts (put device in pairing mode, confirm, etc.)" 122 + } 123 + ] 124 + }, 125 + { 126 + "id": "device_types", 127 + "prompt": "Which lighthouse-tracked device types should the pair command support?", 128 + "context": "Now we narrow which hardware categories the first implementation supports.", 129 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 130 + "options": [ 131 + { 132 + "label": "All lighthouse devices (base stations, controllers, trackers, HMDs)" 133 + }, 134 + { 135 + "label": "Controllers and trackers only (most common pairing need)" 136 + }, 137 + { 138 + "label": "Start with controllers only, add other device types in follow-up PRs" 139 + } 140 + ] 141 + }, 142 + { 143 + "id": "lh_console_discovery", 144 + "prompt": "The existing `lh-console` helper in `steamvr-lh.nu` checks multiple paths (PATH, Flatpak Steam, native Steam). Should the pair command reuse this helper?", 145 + "context": "The existing `steamvr-lh.nu` already locates `lighthouse_console` across Steam installs — reuse affects maintainability.", 146 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 147 + "options": [ 148 + { 149 + "label": "Yes, reuse the existing `lh-console` helper as-is" 150 + }, 151 + { 152 + "label": "Reuse but refactor `lh-console` to also support piping input/capturing output (needed for scripted pairing)" 153 + }, 154 + { 155 + "label": "Write a new helper specifically for pairing that handles the async job pattern lighthouse_console needs" 156 + } 157 + ] 158 + }, 159 + { 160 + "id": "error_handling", 161 + "prompt": "How should the command handle common failure cases (no Bluetooth, SteamVR not installed, device not found)?", 162 + "context": "These concerns apply no matter which namespace/backend/workflow you picked above.", 163 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 164 + "options": [ 165 + { 166 + "label": "Pre-flight checks: verify prerequisites before attempting pairing, with actionable error messages" 167 + }, 168 + { 169 + "label": "Attempt pairing and surface errors from lighthouse_console/lhctl with minimal wrapping" 170 + }, 171 + { 172 + "label": "Pre-flight checks plus a `--force` flag to skip them for advanced users" 173 + } 174 + ] 175 + }, 176 + { 177 + "id": "timeout_handling", 178 + "prompt": "Pairing can take a while (especially with lighthouse_console). How should timeouts be handled?", 179 + "context": "Pairing duration varies by backend; this shapes UX for all paths.", 180 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 181 + "options": [ 182 + { 183 + "label": "Default timeout with a `--timeout` flag to override" 184 + }, 185 + { 186 + "label": "No timeout — wait indefinitely until pairing succeeds or user cancels (Ctrl+C)" 187 + }, 188 + { 189 + "label": "Progress indicator with a generous default timeout (e.g. 60s) and clear messaging" 190 + } 191 + ] 192 + }, 193 + { 194 + "id": "testing_strategy", 195 + "prompt": "How should this feature be tested? (Hardware-dependent features are hard to unit test)", 196 + "context": "Hardware pairing is hard to automate — we still need a team agreement on test scope.", 197 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 198 + "options": [ 199 + { 200 + "label": "Manual testing only — document test procedure in PR description" 201 + }, 202 + { 203 + "label": "Add basic tests for argument parsing and pre-flight checks (mock external commands)" 204 + }, 205 + { 206 + "label": "No tests needed — the project is marked as no longer maintained" 207 + } 208 + ] 209 + }, 210 + { 211 + "id": "documentation", 212 + "prompt": "What documentation should accompany this feature?", 213 + "context": "Final shared question: what docs ship with the command given the project deprecation notice.", 214 + "explanation": "This decision affects how the lighthouse pair command is implemented in the Nushell CLI. Contributors may disagree because the repo already has patterns that could conflict with the issue wording.", 215 + "options": [ 216 + { 217 + "label": "Inline help text in the Nushell command (consistent with existing commands like `calibrate`)" 218 + }, 219 + { 220 + "label": "Inline help text plus a section in the README" 221 + }, 222 + { 223 + "label": "Inline help text only — README already says the project is no longer maintained" 224 + } 225 + ] 226 + } 227 + ] 228 + }
+234
agent/questionnaires/test.txt
··· 1 + ```json 2 + { 3 + "issue": "at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22", 4 + "version": 1, 5 + "items": [ 6 + { 7 + "id": "command_namespace", 8 + "prompt": "The issue requests `axr lh pair`, but the existing lighthouse module is `steamvr-lh.nu` (commands are `axr steamvr-lh calibrate`, `axr steamvr-lh console`). How should the new pair command be namespaced?", 9 + "options": [ 10 + { 11 + "label": "Rename module to `lh.nu` so commands become `axr lh pair`, `axr lh calibrate`, etc.", 12 + "value": "rename_to_lh", 13 + "followups": [ 14 + { 15 + "id": "rename_deprecation", 16 + "prompt": "Should the old `steamvr-lh` name be preserved as a deprecated alias?", 17 + "options": [ 18 + { 19 + "label": "Yes, keep a thin `steamvr-lh.nu` wrapper that re-exports `lh.nu` with a deprecation warning", 20 + "value": "keep_alias" 21 + }, 22 + { 23 + "label": "No, just rename — the project is marked as no longer maintained anyway", 24 + "value": "no_alias" 25 + } 26 + ] 27 + } 28 + ] 29 + }, 30 + { 31 + "label": "Add `pair` to the existing `steamvr-lh.nu` module (command becomes `axr steamvr-lh pair`)", 32 + "value": "keep_steamvr_lh" 33 + }, 34 + { 35 + "label": "Create a new separate `lh.nu` module for pairing only, keep `steamvr-lh.nu` for calibrate/console", 36 + "value": "separate_module" 37 + } 38 + ] 39 + }, 40 + { 41 + "id": "backend_tool", 42 + "prompt": "Which backend tool should the pair command use?", 43 + "options": [ 44 + { 45 + "label": "Use `lighthouse_console` now (available today via SteamVR, slower but works)", 46 + "value": "lighthouse_console", 47 + "followups": [ 48 + { 49 + "id": "lhctl_future_proofing", 50 + "prompt": "Should the implementation be structured to make switching to `lhctl` easier later?", 51 + "options": [ 52 + { 53 + "label": "Yes, abstract the pairing logic behind a helper function so the backend can be swapped", 54 + "value": "abstract_backend" 55 + }, 56 + { 57 + "label": "No, just call lighthouse_console directly — refactor when lhctl is available", 58 + "value": "direct_call" 59 + } 60 + ] 61 + } 62 + ] 63 + }, 64 + { 65 + "label": "Wait for `lhctl` to be publicly released and implement with that", 66 + "value": "wait_for_lhctl" 67 + }, 68 + { 69 + "label": "Support both: detect if `lhctl` is available and prefer it, fall back to `lighthouse_console`", 70 + "value": "dual_backend", 71 + "followups": [ 72 + { 73 + "id": "dual_backend_flag", 74 + "prompt": "Should the user be able to force a specific backend?", 75 + "options": [ 76 + { 77 + "label": "Yes, add a `--backend` flag (e.g. `axr lh pair --backend lhctl`)", 78 + "value": "backend_flag" 79 + }, 80 + { 81 + "label": "No, auto-detect only — simpler UX", 82 + "value": "auto_detect_only" 83 + } 84 + ] 85 + } 86 + ] 87 + } 88 + ] 89 + }, 90 + { 91 + "id": "pairing_workflow", 92 + "prompt": "How should the pairing workflow work from the user's perspective?", 93 + "options": [ 94 + { 95 + "label": "Fully interactive: scan for devices, present a list, user selects which to pair", 96 + "value": "interactive", 97 + "followups": [ 98 + { 99 + "id": "interactive_multi_select", 100 + "prompt": "Should the user be able to pair multiple devices in one session?", 101 + "options": [ 102 + { 103 + "label": "Yes, allow multi-select from discovered devices", 104 + "value": "multi_select" 105 + }, 106 + { 107 + "label": "No, pair one device at a time (simpler, matches lighthouse_console behavior)", 108 + "value": "single_select" 109 + } 110 + ] 111 + } 112 + ] 113 + }, 114 + { 115 + "label": "Semi-interactive: user provides device serial/ID as argument, command handles the rest", 116 + "value": "semi_interactive" 117 + }, 118 + { 119 + "label": "Guided wizard: step-by-step prompts (put device in pairing mode, confirm, etc.)", 120 + "value": "guided_wizard" 121 + } 122 + ] 123 + }, 124 + { 125 + "id": "device_types", 126 + "prompt": "Which lighthouse-tracked device types should the pair command support?", 127 + "options": [ 128 + { 129 + "label": "All lighthouse devices (base stations, controllers, trackers, HMDs)", 130 + "value": "all_devices" 131 + }, 132 + { 133 + "label": "Controllers and trackers only (most common pairing need)", 134 + "value": "controllers_trackers" 135 + }, 136 + { 137 + "label": "Start with controllers only, add other device types in follow-up PRs", 138 + "value": "controllers_first" 139 + } 140 + ] 141 + }, 142 + { 143 + "id": "lh_console_discovery", 144 + "prompt": "The existing `lh-console` helper in `steamvr-lh.nu` checks multiple paths (PATH, Flatpak Steam, native Steam). Should the pair command reuse this helper?", 145 + "options": [ 146 + { 147 + "label": "Yes, reuse the existing `lh-console` helper as-is", 148 + "value": "reuse_helper" 149 + }, 150 + { 151 + "label": "Reuse but refactor `lh-console` to also support piping input/capturing output (needed for scripted pairing)", 152 + "value": "refactor_helper" 153 + }, 154 + { 155 + "label": "Write a new helper specifically for pairing that handles the async job pattern lighthouse_console needs", 156 + "value": "new_helper" 157 + } 158 + ] 159 + }, 160 + { 161 + "id": "error_handling", 162 + "prompt": "How should the command handle common failure cases (no Bluetooth, SteamVR not installed, device not found)?", 163 + "options": [ 164 + { 165 + "label": "Pre-flight checks: verify prerequisites before attempting pairing, with actionable error messages", 166 + "value": "preflight_checks" 167 + }, 168 + { 169 + "label": "Attempt pairing and surface errors from lighthouse_console/lhctl with minimal wrapping", 170 + "value": "passthrough_errors" 171 + }, 172 + { 173 + "label": "Pre-flight checks plus a `--force` flag to skip them for advanced users", 174 + "value": "preflight_with_force" 175 + } 176 + ] 177 + }, 178 + { 179 + "id": "timeout_handling", 180 + "prompt": "Pairing can take a while (especially with lighthouse_console). How should timeouts be handled?", 181 + "options": [ 182 + { 183 + "label": "Default timeout with a `--timeout` flag to override", 184 + "value": "configurable_timeout" 185 + }, 186 + { 187 + "label": "No timeout — wait indefinitely until pairing succeeds or user cancels (Ctrl+C)", 188 + "value": "no_timeout" 189 + }, 190 + { 191 + "label": "Progress indicator with a generous default timeout (e.g. 60s) and clear messaging", 192 + "value": "progress_with_timeout" 193 + } 194 + ] 195 + }, 196 + { 197 + "id": "testing_strategy", 198 + "prompt": "How should this feature be tested? (Hardware-dependent features are hard to unit test)", 199 + "options": [ 200 + { 201 + "label": "Manual testing only — document test procedure in PR description", 202 + "value": "manual_only" 203 + }, 204 + { 205 + "label": "Add basic tests for argument parsing and pre-flight checks (mock external commands)", 206 + "value": "basic_tests" 207 + }, 208 + { 209 + "label": "No tests needed — the project is marked as no longer maintained", 210 + "value": "no_tests" 211 + } 212 + ] 213 + }, 214 + { 215 + "id": "documentation", 216 + "prompt": "What documentation should accompany this feature?", 217 + "options": [ 218 + { 219 + "label": "Inline help text in the Nushell command (consistent with existing commands like `calibrate`)", 220 + "value": "inline_help_only" 221 + }, 222 + { 223 + "label": "Inline help text plus a section in the README", 224 + "value": "inline_and_readme" 225 + }, 226 + { 227 + "label": "Inline help text only — README already says the project is no longer maintained", 228 + "value": "inline_no_readme" 229 + } 230 + ] 231 + } 232 + ] 233 + } 234 + ```
+345
agent/questionnaires/viewer.js
··· 1 + /** @typedef {{ label: string, followups?: Question[] }} Option */ 2 + /** @typedef {{ id: string, prompt: string, context: string, explanation: string, options: Option[] }} Question */ 3 + /** @typedef {{ project: string, issue: string, approach: string }} Introduction */ 4 + /** @typedef {{ issue: string, version: number, introduction?: Introduction, items: Question[] }} Questionnaire */ 5 + 6 + /** 7 + * @param {unknown} raw 8 + * @param {number} index 9 + * @returns {Option} 10 + */ 11 + export function normalizeOption(raw, index) { 12 + if (typeof raw === "string") { 13 + return { label: raw, followups: [] }; 14 + } 15 + if (!raw || typeof raw !== "object") { 16 + throw new Error(`Invalid option at index ${index}`); 17 + } 18 + const obj = /** @type {Record<string, unknown>} */ (raw); 19 + const label = 20 + typeof obj.label === "string" 21 + ? obj.label 22 + : typeof obj.text === "string" 23 + ? obj.text 24 + : null; 25 + if (!label) { 26 + throw new Error(`Option at index ${index} must have a label string`); 27 + } 28 + const followups = Array.isArray(obj.followups) 29 + ? obj.followups.map((q, i) => normalizeQuestion(q, i)) 30 + : []; 31 + return { label, followups }; 32 + } 33 + 34 + /** 35 + * @param {unknown} raw 36 + * @param {number} index 37 + * @returns {Question} 38 + */ 39 + export function normalizeQuestion(raw, index) { 40 + if (!raw || typeof raw !== "object") { 41 + throw new Error(`Invalid question at index ${index}`); 42 + } 43 + const q = /** @type {Record<string, unknown>} */ (raw); 44 + if (typeof q.id !== "string" || typeof q.prompt !== "string") { 45 + throw new Error(`Question at index ${index} needs id and prompt`); 46 + } 47 + if (!Array.isArray(q.options) || q.options.length < 2) { 48 + throw new Error(`Question "${q.id}" needs at least 2 options`); 49 + } 50 + return { 51 + id: q.id, 52 + prompt: q.prompt, 53 + context: typeof q.context === "string" ? q.context : "", 54 + explanation: typeof q.explanation === "string" ? q.explanation : "", 55 + options: q.options.map((opt, i) => normalizeOption(opt, i)), 56 + }; 57 + } 58 + 59 + /** 60 + * Parse questionnaire JSON, tolerating markdown code fences and v1 shape. 61 + * @param {string} raw 62 + * @returns {Questionnaire} 63 + */ 64 + export function parseQuestionnaire(raw) { 65 + let text = raw.trim(); 66 + const fence = text.match(/^```(?:json)?\s*([\s\S]*?)\s*```$/); 67 + if (fence) text = fence[1].trim(); 68 + const data = JSON.parse(text); 69 + if (!data || !Array.isArray(data.items)) { 70 + throw new Error("Invalid questionnaire: expected { issue, version, items }"); 71 + } 72 + 73 + /** @type {Introduction | undefined} */ 74 + let introduction; 75 + if (data.introduction && typeof data.introduction === "object") { 76 + const intro = data.introduction; 77 + if ( 78 + typeof intro.project === "string" && 79 + typeof intro.issue === "string" && 80 + typeof intro.approach === "string" 81 + ) { 82 + introduction = { 83 + project: intro.project, 84 + issue: intro.issue, 85 + approach: intro.approach, 86 + }; 87 + } 88 + } 89 + 90 + return { 91 + issue: String(data.issue ?? ""), 92 + version: Number(data.version ?? 2), 93 + introduction, 94 + items: data.items.map((q, i) => normalizeQuestion(q, i)), 95 + }; 96 + } 97 + 98 + /** @returns {{ totalQuestions: number, totalOptions: number, maxDepth: number, branchCount: number }} */ 99 + export function stats(q) { 100 + let totalQuestions = 0; 101 + let totalOptions = 0; 102 + let maxDepth = 0; 103 + let branchCount = 0; 104 + 105 + function walk(questions, depth) { 106 + maxDepth = Math.max(maxDepth, depth); 107 + for (const question of questions) { 108 + totalQuestions += 1; 109 + for (const opt of question.options) { 110 + totalOptions += 1; 111 + if (opt.followups?.length) { 112 + branchCount += 1; 113 + walk(opt.followups, depth + 1); 114 + } 115 + } 116 + } 117 + } 118 + 119 + walk(q.items, 1); 120 + return { totalQuestions, totalOptions, maxDepth, branchCount }; 121 + } 122 + 123 + export class QuestionnaireWalker { 124 + /** @param {Questionnaire} questionnaire */ 125 + constructor(questionnaire) { 126 + this.questionnaire = questionnaire; 127 + this.reset(); 128 + } 129 + 130 + reset() { 131 + /** @type {{ list: Question[], index: number }[]} */ 132 + this.stack = [{ list: this.questionnaire.items, index: 0 }]; 133 + /** @type {{ questionId: string, optionIndex: number, label: string }[]} */ 134 + this.answers = []; 135 + this.done = false; 136 + this.showIntroduction = Boolean(this.questionnaire.introduction); 137 + } 138 + 139 + dismissIntroduction() { 140 + this.showIntroduction = false; 141 + } 142 + 143 + /** @returns {Question | null} */ 144 + current() { 145 + if (this.showIntroduction) return null; 146 + while (this.stack.length) { 147 + const frame = this.stack[this.stack.length - 1]; 148 + if (frame.index >= frame.list.length) { 149 + this.stack.pop(); 150 + continue; 151 + } 152 + return frame.list[frame.index]; 153 + } 154 + this.done = true; 155 + return null; 156 + } 157 + 158 + /** @param {number} optionIndex */ 159 + pick(optionIndex) { 160 + const question = this.current(); 161 + if (!question) return; 162 + const option = question.options[optionIndex]; 163 + if (!option) return; 164 + 165 + this.answers.push({ 166 + questionId: question.id, 167 + optionIndex, 168 + label: option.label, 169 + }); 170 + this.stack[this.stack.length - 1].index += 1; 171 + if (option.followups?.length) { 172 + this.stack.push({ list: option.followups, index: 0 }); 173 + } 174 + if (!this.current()) this.done = true; 175 + } 176 + 177 + back() { 178 + if (this.showIntroduction) return; 179 + if (!this.answers.length) { 180 + if (this.questionnaire.introduction) { 181 + this.showIntroduction = true; 182 + this.done = false; 183 + } 184 + return; 185 + } 186 + this.answers.pop(); 187 + this.done = false; 188 + this.stack = [{ list: this.questionnaire.items, index: 0 }]; 189 + for (const answer of this.answers) { 190 + const question = this.current(); 191 + if (!question) break; 192 + const option = question.options[answer.optionIndex]; 193 + if (!option) break; 194 + this.stack[this.stack.length - 1].index += 1; 195 + if (option.followups?.length) { 196 + this.stack.push({ list: option.followups, index: 0 }); 197 + } 198 + } 199 + } 200 + 201 + /** @returns {string[]} */ 202 + activePathIds() { 203 + return this.answers.map((a) => a.questionId); 204 + } 205 + 206 + /** @returns {object} */ 207 + toAnswerPayload(did = "did:plc:preview") { 208 + return { 209 + issue: this.questionnaire.issue, 210 + did, 211 + version: this.questionnaire.version, 212 + answers: this.answers.map(({ questionId, optionIndex }) => ({ 213 + questionId, 214 + optionIndex, 215 + })), 216 + }; 217 + } 218 + } 219 + 220 + /** 221 + * @param {Introduction} intro 222 + */ 223 + export function renderIntroduction(intro) { 224 + const el = document.createElement("section"); 225 + el.className = "intro-panel"; 226 + el.innerHTML = ` 227 + <h2>Context</h2> 228 + <div class="intro-block"> 229 + <h3>Project</h3> 230 + <p>${escapeHtml(intro.project)}</p> 231 + </div> 232 + <div class="intro-block"> 233 + <h3>Issue</h3> 234 + <p>${escapeHtml(intro.issue)}</p> 235 + </div> 236 + <div class="intro-block"> 237 + <h3>How this questionnaire fits</h3> 238 + <p>${escapeHtml(intro.approach)}</p> 239 + </div> 240 + `; 241 + return el; 242 + } 243 + 244 + /** 245 + * @param {Question} question 246 + */ 247 + export function renderQuestionMeta(question) { 248 + const parts = []; 249 + if (question.context) { 250 + parts.push( 251 + `<div class="q-context"><span class="q-meta-label">Context</span><p>${escapeHtml(question.context)}</p></div>` 252 + ); 253 + } 254 + if (question.explanation) { 255 + parts.push( 256 + `<div class="q-explanation"><span class="q-meta-label">Why this matters</span><p>${escapeHtml(question.explanation)}</p></div>` 257 + ); 258 + } 259 + return parts.join(""); 260 + } 261 + 262 + /** 263 + * @param {Question[]} questions 264 + * @param {number} depth 265 + * @param {(id: string) => boolean} isActive 266 + * @param {(id: string) => void} onSelectQuestion 267 + */ 268 + export function renderTree(questions, depth, isActive, onSelectQuestion) { 269 + const frag = document.createDocumentFragment(); 270 + 271 + for (const question of questions) { 272 + const qEl = document.createElement("div"); 273 + qEl.className = "tree-question"; 274 + qEl.style.setProperty("--depth", String(depth)); 275 + 276 + const head = document.createElement("button"); 277 + head.type = "button"; 278 + head.className = 279 + "tree-question-head" + (isActive(question.id) ? " is-active" : ""); 280 + head.innerHTML = ` 281 + <span class="tree-badge">Q</span> 282 + <span class="tree-id">${escapeHtml(question.id)}</span> 283 + `; 284 + head.title = question.prompt; 285 + head.addEventListener("click", () => onSelectQuestion(question.id)); 286 + qEl.appendChild(head); 287 + 288 + const prompt = document.createElement("p"); 289 + prompt.className = "tree-prompt"; 290 + prompt.textContent = question.prompt; 291 + qEl.appendChild(prompt); 292 + 293 + const meta = document.createElement("div"); 294 + meta.className = "tree-meta"; 295 + meta.innerHTML = renderQuestionMeta(question); 296 + qEl.appendChild(meta); 297 + 298 + const optionsEl = document.createElement("div"); 299 + optionsEl.className = "tree-options"; 300 + 301 + question.options.forEach((option, optionIndex) => { 302 + const optEl = document.createElement("div"); 303 + optEl.className = "tree-option"; 304 + 305 + const optHead = document.createElement("div"); 306 + optHead.className = "tree-option-head"; 307 + optHead.innerHTML = ` 308 + <span class="tree-badge opt">${optionIndex + 1}</span> 309 + `; 310 + optEl.appendChild(optHead); 311 + 312 + const optLabel = document.createElement("p"); 313 + optLabel.className = "tree-option-label"; 314 + optLabel.textContent = option.label; 315 + optEl.appendChild(optLabel); 316 + 317 + if (option.followups?.length) { 318 + const branch = document.createElement("div"); 319 + branch.className = "tree-branch"; 320 + branch.appendChild( 321 + renderTree(option.followups, depth + 1, isActive, onSelectQuestion) 322 + ); 323 + optEl.appendChild(branch); 324 + } 325 + 326 + optionsEl.appendChild(optEl); 327 + }); 328 + 329 + qEl.appendChild(optionsEl); 330 + frag.appendChild(qEl); 331 + } 332 + 333 + const wrap = document.createElement("div"); 334 + wrap.className = "tree-level"; 335 + wrap.appendChild(frag); 336 + return wrap; 337 + } 338 + 339 + export function escapeHtml(text) { 340 + return String(text) 341 + .replaceAll("&", "&amp;") 342 + .replaceAll("<", "&lt;") 343 + .replaceAll(">", "&gt;") 344 + .replaceAll('"', "&quot;"); 345 + }
+6
agent/requirements.txt
··· 1 + langgraph>=0.2.60,<1.0 2 + langchain-anthropic>=0.3.10,<1.0 3 + langchain-core>=0.3.30,<1.0 4 + python-dotenv>=1.0,<2.0 5 + httpx>=0.28,<1.0 6 + psycopg[binary]>=3.2,<4.0
+100
agent/tangled_client.py
··· 1 + """Live knot git access for Tangled repos (tree + blob).""" 2 + 3 + from __future__ import annotations 4 + 5 + from typing import Any 6 + 7 + import httpx 8 + 9 + DEFAULT_TIMEOUT = httpx.Timeout(connect=5.0, read=30.0, write=10.0, pool=10.0) 10 + 11 + 12 + def knot_xrpc( 13 + client: httpx.Client, 14 + knot_hostname: str, 15 + method: str, 16 + params: dict[str, Any], 17 + ) -> tuple[int, Any]: 18 + host = knot_hostname.removeprefix("https://").rstrip("/") 19 + resp = client.get(f"https://{host}/xrpc/{method}", params=params) 20 + if resp.status_code != 200: 21 + return resp.status_code, {"error": resp.status_code, "body": resp.text[:500]} 22 + try: 23 + return resp.status_code, resp.json() 24 + except ValueError: 25 + return resp.status_code, {"raw": resp.text[:500]} 26 + 27 + 28 + def list_tree( 29 + client: httpx.Client, 30 + *, 31 + knot_hostname: str, 32 + repo_did: str, 33 + path: str = "", 34 + ref: str = "HEAD", 35 + ) -> dict[str, Any]: 36 + status, payload = knot_xrpc( 37 + client, 38 + knot_hostname, 39 + "sh.tangled.repo.tree", 40 + {"repo": repo_did, "ref": ref, "path": path}, 41 + ) 42 + if status != 200 or not isinstance(payload, dict): 43 + raise RuntimeError(f"tree failed HTTP {status}: {payload!r}") 44 + return payload 45 + 46 + 47 + def read_blob( 48 + client: httpx.Client, 49 + *, 50 + knot_hostname: str, 51 + repo_did: str, 52 + path: str, 53 + ref: str = "HEAD", 54 + ) -> str: 55 + status, payload = knot_xrpc( 56 + client, 57 + knot_hostname, 58 + "sh.tangled.repo.blob", 59 + {"repo": repo_did, "ref": ref, "path": path}, 60 + ) 61 + if status != 200 or not isinstance(payload, dict): 62 + raise RuntimeError(f"blob failed HTTP {status}: {payload!r}") 63 + content = payload.get("content") 64 + if not isinstance(content, str): 65 + raise RuntimeError("blob response missing text content") 66 + return content 67 + 68 + 69 + def describe_repo_on_knot( 70 + client: httpx.Client, 71 + knot_hostname: str, 72 + repo_did: str, 73 + ) -> dict[str, Any] | None: 74 + host = knot_hostname.removeprefix("https://").rstrip("/") 75 + resp = client.get( 76 + f"https://{host}/xrpc/sh.tangled.repo.describeRepo", 77 + params={"repoDid": repo_did}, 78 + timeout=20.0, 79 + ) 80 + if resp.status_code == 404: 81 + return None 82 + resp.raise_for_status() 83 + return resp.json() 84 + 85 + 86 + def normalize_tree_entries(tree: dict[str, Any]) -> list[dict[str, str]]: 87 + """Flatten knot tree response into simple name/type entries.""" 88 + out: list[dict[str, str]] = [] 89 + for entry in tree.get("files") or []: 90 + if not isinstance(entry, dict): 91 + continue 92 + name = entry.get("name") 93 + if not isinstance(name, str): 94 + continue 95 + kind = entry.get("type") 96 + if not isinstance(kind, str): 97 + mode = entry.get("mode") 98 + kind = "dir" if mode == "040000" else "file" 99 + out.append({"name": name, "type": kind}) 100 + return out
+98
agent/tools.py
··· 1 + """File tools for issue investigation (knot git only).""" 2 + 3 + from __future__ import annotations 4 + 5 + import json 6 + import os 7 + from typing import Any 8 + 9 + import httpx 10 + from langchain_core.tools import BaseTool, tool 11 + 12 + from agent.context import IssueSessionContext 13 + from agent.tangled_client import DEFAULT_TIMEOUT, list_tree, normalize_tree_entries, read_blob 14 + 15 + DEFAULT_MAX_FILE_CHARS = 32_000 16 + 17 + 18 + def _truncate(text: str, limit: int) -> dict[str, Any]: 19 + if len(text) <= limit: 20 + return {"content": text, "truncated": False, "size_chars": len(text)} 21 + return { 22 + "content": text[:limit], 23 + "truncated": True, 24 + "size_chars": len(text), 25 + "note": f"truncated to {limit} chars; request a narrower path or smaller file", 26 + } 27 + 28 + 29 + def make_file_tools( 30 + ctx: IssueSessionContext, 31 + *, 32 + max_file_chars: int | None = None, 33 + ) -> list[BaseTool]: 34 + """Build tools bound to a single issue session (repo/knot from context).""" 35 + limit = max_file_chars or int(os.getenv("AGENT_MAX_FILE_CHARS", str(DEFAULT_MAX_FILE_CHARS))) 36 + 37 + @tool 38 + def read_repo_file(path: str, ref: str | None = None) -> str: 39 + """Read exact file contents from the issue's repository on the knot. 40 + 41 + Args: 42 + path: File path relative to repo root (e.g. README.md, src/lib.rs). 43 + ref: Git ref (branch/tag/commit). Defaults to the session ref. 44 + """ 45 + git_ref = ref or ctx.ref 46 + path = path.lstrip("/") 47 + with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client: 48 + try: 49 + text = read_blob( 50 + client, 51 + knot_hostname=ctx.knot_hostname, 52 + repo_did=ctx.repo_did, 53 + path=path, 54 + ref=git_ref, 55 + ) 56 + except Exception as exc: # noqa: BLE001 - return to model as tool output 57 + return json.dumps({"error": str(exc), "path": path, "ref": git_ref}) 58 + payload = _truncate(text, limit) 59 + payload.update({"path": path, "ref": git_ref, "repo_did": ctx.repo_did}) 60 + return json.dumps(payload, ensure_ascii=False) 61 + 62 + @tool 63 + def list_repo_files(path: str = "", ref: str | None = None) -> str: 64 + """List files in a repository directory on the knot. 65 + 66 + Use only when the session file tree is insufficient. Prefer known paths 67 + from context when possible. 68 + 69 + Args: 70 + path: Directory relative to repo root (empty string = root). 71 + ref: Git ref. Defaults to the session ref. 72 + """ 73 + git_ref = ref or ctx.ref 74 + directory = path.lstrip("/") 75 + with httpx.Client(timeout=DEFAULT_TIMEOUT, follow_redirects=True) as client: 76 + try: 77 + tree = list_tree( 78 + client, 79 + knot_hostname=ctx.knot_hostname, 80 + repo_did=ctx.repo_did, 81 + path=directory, 82 + ref=git_ref, 83 + ) 84 + entries = normalize_tree_entries(tree) 85 + except Exception as exc: # noqa: BLE001 86 + return json.dumps( 87 + {"error": str(exc), "path": directory or "/", "ref": git_ref} 88 + ) 89 + return json.dumps( 90 + { 91 + "path": directory or "/", 92 + "ref": git_ref, 93 + "entries": entries, 94 + }, 95 + ensure_ascii=False, 96 + ) 97 + 98 + return [read_repo_file, list_repo_files]
+6
daily_issue_scraper/.dockerignore
··· 1 + .env 2 + .env.* 3 + *.pyc 4 + __pycache__/ 5 + .venv/ 6 + .git/
+21
daily_issue_scraper/Dockerfile
··· 1 + # Daily Tangled sync — Cloud Run Job / GCE 2 + FROM python:3.12-slim-bookworm 3 + 4 + RUN apt-get update \ 5 + && apt-get install -y --no-install-recommends ca-certificates \ 6 + && rm -rf /var/lib/apt/lists/* 7 + 8 + WORKDIR /app 9 + 10 + COPY scraper/requirements.txt /app/scraper/requirements.txt 11 + RUN pip install --no-cache-dir -r /app/scraper/requirements.txt 12 + 13 + COPY scraper/ /app/scraper/ 14 + COPY supabase/migrations/ /app/supabase/migrations/ 15 + COPY daily_issue_scraper/ /app/daily_issue_scraper/ 16 + 17 + ENV PYTHONUNBUFFERED=1 \ 18 + PYTHONPATH=/app/scraper:/app 19 + 20 + # Secrets at runtime: DB_CONNECTION_STRING, GEMINI_API_KEY 21 + CMD ["python", "-m", "daily_issue_scraper.main"]
+5
daily_issue_scraper/__init__.py
··· 1 + """Daily Tangled sync container — orchestrates scraper stages.""" 2 + 3 + from daily_issue_scraper.pipeline import SyncReport, run_daily_sync 4 + 5 + __all__ = ["SyncReport", "run_daily_sync"]
+58
daily_issue_scraper/cloudbuild.yaml
··· 1 + # Build and push the daily sync image to Artifact Registry. 2 + # 3 + # Trigger (from repo root): 4 + # gcloud builds submit --config=daily_issue_scraper/cloudbuild.yaml . 5 + # 6 + # One-time setup: 7 + # gcloud artifacts repositories create tangled \ 8 + # --repository-format=docker --location=${REGION} 9 + # 10 + # Schedule (Cloud Run Job example): 11 + # gcloud run jobs create tangled-daily-sync \ 12 + # --image=${REGION}-docker.pkg.dev/${PROJECT_ID}/tangled/daily-issue-scraper:latest \ 13 + # --region=${REGION} \ 14 + # --set-secrets=DB_CONNECTION_STRING=tangled-db-url:latest,GEMINI_API_KEY=gemini-api-key:latest \ 15 + # --task-timeout=3600 --memory=1Gi --cpu=1 --max-retries=1 16 + # 17 + # gcloud scheduler jobs create http tangled-daily-sync-trigger \ 18 + # --schedule="0 3 * * *" \ 19 + # --uri="https://${REGION}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${PROJECT_ID}/jobs/tangled-daily-sync:run" \ 20 + # --http-method=POST \ 21 + # --oauth-service-account-email=${SCHEDULER_SA}@${PROJECT_ID}.iam.gserviceaccount.com 22 + 23 + substitutions: 24 + _REGION: europe-west1 25 + _REPOSITORY: tangled 26 + _IMAGE: daily-issue-scraper 27 + 28 + steps: 29 + - id: build 30 + name: gcr.io/cloud-builders/docker 31 + args: 32 + - build 33 + - -f 34 + - daily_issue_scraper/Dockerfile 35 + - -t 36 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 37 + - -t 38 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 39 + - . 40 + 41 + - id: push-build-id 42 + name: gcr.io/cloud-builders/docker 43 + args: 44 + - push 45 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 46 + 47 + - id: push-latest 48 + name: gcr.io/cloud-builders/docker 49 + args: 50 + - push 51 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 52 + 53 + images: 54 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 55 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 56 + 57 + options: 58 + logging: CLOUD_LOGGING_ONLY
+68
daily_issue_scraper/deploy.sh
··· 1 + #!/usr/bin/env bash 2 + # Build image (Cloud Build), push to Artifact Registry, deploy Cloud Run Job. 3 + # 4 + # Usage (from repo root): 5 + # ./daily_issue_scraper/deploy.sh 6 + # 7 + # Optional overrides: 8 + # PROJECT_ID=my-project REGION=europe-west1 JOB_NAME=tangled-daily-sync ./daily_issue_scraper/deploy.sh 9 + # 10 + # Requires: gcloud auth, and the secrets below stored in Google Secret Manager. 11 + # Secrets are referenced (never passed as plaintext env vars) so DB creds / API 12 + # keys are not visible in `gcloud run jobs describe` or to run.viewer roles. 13 + # One-time setup (run once per secret): 14 + # printf '%s' "$DB_CONNECTION_STRING" | gcloud secrets create tangled-db-url --data-file=- 15 + # printf '%s' "$GEMINI_API_KEY" | gcloud secrets create gemini-api-key --data-file=- 16 + # Grant the job's runtime service account roles/secretmanager.secretAccessor. 17 + 18 + set -euo pipefail 19 + 20 + ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 21 + REGION="${REGION:-europe-west1}" 22 + REPOSITORY="${REPOSITORY:-tangled}" 23 + IMAGE_NAME="${IMAGE_NAME:-daily-issue-scraper}" 24 + JOB_NAME="${JOB_NAME:-tangled-daily-sync}" 25 + TASK_TIMEOUT="${TASK_TIMEOUT:-3600}" 26 + MEMORY="${MEMORY:-1Gi}" 27 + CPU="${CPU:-1}" 28 + MAX_RETRIES="${MAX_RETRIES:-1}" 29 + # Secret Manager secret names (override if yours differ). Mapped to env vars in the job. 30 + DB_SECRET="${DB_SECRET:-tangled-db-url}" 31 + GEMINI_SECRET="${GEMINI_SECRET:-gemini-api-key}" 32 + 33 + PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null)}" 34 + if [[ -z "$PROJECT_ID" || "$PROJECT_ID" == "(unset)" ]]; then 35 + echo "ERROR: Set PROJECT_ID or run: gcloud config set project YOUR_PROJECT_ID" >&2 36 + exit 1 37 + fi 38 + 39 + IMAGE="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/${IMAGE_NAME}:latest" 40 + 41 + echo "==> Project: $PROJECT_ID" 42 + echo "==> Region: $REGION" 43 + echo "==> Image: $IMAGE" 44 + echo "==> Job: $JOB_NAME" 45 + echo "==> Secrets: DB_CONNECTION_STRING<-$DB_SECRET, GEMINI_API_KEY<-$GEMINI_SECRET" 46 + echo 47 + 48 + echo "==> Build & push (Cloud Build)" 49 + gcloud builds submit \ 50 + --project="$PROJECT_ID" \ 51 + --config="$ROOT/daily_issue_scraper/cloudbuild.yaml" \ 52 + "$ROOT" 53 + 54 + echo 55 + echo "==> Deploy Cloud Run Job" 56 + gcloud run jobs deploy "$JOB_NAME" \ 57 + --project="$PROJECT_ID" \ 58 + --region="$REGION" \ 59 + --image="$IMAGE" \ 60 + --set-secrets="DB_CONNECTION_STRING=${DB_SECRET}:latest,GEMINI_API_KEY=${GEMINI_SECRET}:latest" \ 61 + --task-timeout="$TASK_TIMEOUT" \ 62 + --memory="$MEMORY" \ 63 + --cpu="$CPU" \ 64 + --max-retries="$MAX_RETRIES" 65 + 66 + echo 67 + echo "Done. Run once:" 68 + echo " gcloud run jobs execute $JOB_NAME --project=$PROJECT_ID --region=$REGION"
+59
daily_issue_scraper/main.py
··· 1 + #!/usr/bin/env python3 2 + """Container entrypoint for the daily Tangled sync job.""" 3 + 4 + from __future__ import annotations 5 + 6 + import argparse 7 + import os 8 + import sys 9 + from pathlib import Path 10 + 11 + from dotenv import load_dotenv 12 + 13 + from daily_issue_scraper.pipeline import run_daily_sync 14 + 15 + REPO_ROOT = Path(__file__).resolve().parent.parent 16 + 17 + 18 + def load_env() -> None: 19 + for candidate in ( 20 + REPO_ROOT / ".env", 21 + Path(__file__).resolve().parent / ".env", 22 + REPO_ROOT / "scraper" / ".env", 23 + ): 24 + if candidate.exists(): 25 + load_dotenv(candidate) 26 + return 27 + load_dotenv() 28 + 29 + 30 + def require_dsn() -> str: 31 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 32 + if not dsn: 33 + print("ERROR: DB_CONNECTION_STRING is not set", file=sys.stderr) 34 + raise SystemExit(1) 35 + return dsn 36 + 37 + 38 + def main(argv: list[str] | None = None) -> None: 39 + parser = argparse.ArgumentParser(description="Run the daily Tangled sync pipeline.") 40 + parser.add_argument( 41 + "--only", 42 + nargs="+", 43 + metavar="STAGE", 44 + help="Run specific stages only (network, accounts, repos, issues, readmes, ...)", 45 + ) 46 + args = parser.parse_args(argv) 47 + 48 + load_env() 49 + dsn = require_dsn() 50 + only = set(args.only) if args.only else None 51 + run_daily_sync(dsn, only=only) 52 + 53 + 54 + if __name__ == "__main__": 55 + try: 56 + main() 57 + except KeyboardInterrupt: 58 + print("\nInterrupted.", file=sys.stderr) 59 + raise SystemExit(130) from None
+231
daily_issue_scraper/pipeline.py
··· 1 + """Daily sync pipeline — imports stage runners from scraper/.""" 2 + 3 + from __future__ import annotations 4 + 5 + import os 6 + import sys 7 + import time 8 + from collections.abc import Callable 9 + from dataclasses import dataclass, field 10 + from pathlib import Path 11 + from typing import Any 12 + 13 + REPO_ROOT = Path(__file__).resolve().parent.parent 14 + SCRAPER_ROOT = REPO_ROOT / "scraper" 15 + if str(SCRAPER_ROOT) not in sys.path: 16 + sys.path.insert(0, str(SCRAPER_ROOT)) 17 + 18 + from check_readmes import run_check_readmes # noqa: E402 19 + from db import connect, init_schema, set_crawl_state # noqa: E402 20 + from embed_issues import run_embed_issues # noqa: E402 21 + from embed_readmes import run_embed_readmes # noqa: E402 22 + from fetch_collaborators import run_fetch_collaborators # noqa: E402 23 + from fetch_issues import run_fetch_issues # noqa: E402 24 + from progress import banner, log, summary_block # noqa: E402 25 + from stage2_network import run_stage2_network # noqa: E402 26 + from stage2_pds import run_stage2_accounts_only, run_stage2_repos_only # noqa: E402 27 + 28 + CRAWL_KEY = "sync:daily" 29 + 30 + StageFn = Callable[[str], dict[str, Any]] 31 + 32 + 33 + @dataclass 34 + class Stage: 35 + key: str 36 + title: str 37 + run: StageFn 38 + enabled: bool = True 39 + 40 + 41 + @dataclass 42 + class SyncReport: 43 + started_at: float = field(default_factory=time.time) 44 + stages: dict[str, dict[str, Any]] = field(default_factory=dict) 45 + errors: list[str] = field(default_factory=list) 46 + 47 + @property 48 + def elapsed_s(self) -> float: 49 + return time.time() - self.started_at 50 + 51 + 52 + def _env_flag(name: str, *, default: bool) -> bool: 53 + raw = os.getenv(name, "").strip().lower() 54 + if not raw: 55 + return default 56 + return raw in ("1", "true", "yes") 57 + 58 + 59 + def _configure_daily_env() -> None: 60 + """Defaults tuned for scheduled daily runs (override via env).""" 61 + os.environ.setdefault("TANGLED_ISSUE_REFRESH", "1") 62 + os.environ.setdefault("TANGLED_ISSUE_ALL_USERS", "1") 63 + os.environ.setdefault("TANGLED_STAGE2_NETWORK_REFRESH", "0") 64 + 65 + 66 + def _format_stats(stats: dict[str, Any]) -> str: 67 + """One-line summary of rows processed for sync logs.""" 68 + if not stats: 69 + return "(no stats)" 70 + ordered = ( 71 + "repos_stored", 72 + "already_in_db", 73 + "account_count", 74 + "users_scanned", 75 + "issues_upserted", 76 + "open_issues", 77 + "found", 78 + "missing", 79 + "repos_fetched", 80 + "collaborator_edges", 81 + "embedded", 82 + "batches", 83 + "errors", 84 + "resolve_failed", 85 + "record_failed", 86 + "already_synced", 87 + "skipped", 88 + "skipped_knot", 89 + "error", 90 + ) 91 + parts: list[str] = [] 92 + seen: set[str] = set() 93 + for key in ordered: 94 + if key in stats and stats[key] is not None: 95 + parts.append(f"{key}={stats[key]}") 96 + seen.add(key) 97 + for key, value in stats.items(): 98 + if key not in seen and value is not None: 99 + parts.append(f"{key}={value}") 100 + return ", ".join(parts) if parts else "(no stats)" 101 + 102 + 103 + def build_stages() -> list[Stage]: 104 + return [ 105 + Stage( 106 + key="network", 107 + title="Discover repos (tangled.org search)", 108 + run=run_stage2_network, 109 + enabled=_env_flag("TANGLED_SYNC_NETWORK", default=True), 110 + ), 111 + Stage( 112 + key="accounts", 113 + title="Refresh tngl.sh accounts", 114 + run=run_stage2_accounts_only, 115 + enabled=_env_flag("TANGLED_SYNC_ACCOUNTS", default=True), 116 + ), 117 + Stage( 118 + key="repos", 119 + title="Scan tngl.sh repo records (heavy)", 120 + run=run_stage2_repos_only, 121 + enabled=_env_flag("TANGLED_SYNC_TNGL_REPOS", default=False), 122 + ), 123 + Stage( 124 + key="issues", 125 + title="Re-scan all users for issues", 126 + run=run_fetch_issues, 127 + enabled=_env_flag("TANGLED_SYNC_ISSUES", default=True), 128 + ), 129 + Stage( 130 + key="readmes", 131 + title="Fetch missing READMEs from knots", 132 + run=run_check_readmes, 133 + enabled=_env_flag("TANGLED_SYNC_READMES", default=True), 134 + ), 135 + Stage( 136 + key="collaborators", 137 + title="Fetch repo collaborators", 138 + run=run_fetch_collaborators, 139 + enabled=_env_flag("TANGLED_SYNC_COLLABORATORS", default=True), 140 + ), 141 + Stage( 142 + key="embed_readmes", 143 + title="Embed READMEs (Gemini)", 144 + run=run_embed_readmes, 145 + enabled=_env_flag("TANGLED_SYNC_EMBED_READMES", default=True), 146 + ), 147 + Stage( 148 + key="embed_issues", 149 + title="Embed issues (Gemini)", 150 + run=run_embed_issues, 151 + enabled=_env_flag("TANGLED_SYNC_EMBED_ISSUES", default=True), 152 + ), 153 + ] 154 + 155 + 156 + def run_daily_sync(dsn: str, *, only: set[str] | None = None) -> SyncReport: 157 + _configure_daily_env() 158 + report = SyncReport() 159 + stages = [s for s in build_stages() if s.enabled and (not only or s.key in only)] 160 + 161 + banner("DAILY SYNC — Tangled → Postgres") 162 + log("sync", f"Stages: {', '.join(s.key for s in stages) or '(none)'}") 163 + 164 + init_schema(dsn) 165 + 166 + with connect(dsn) as conn: 167 + set_crawl_state( 168 + conn, 169 + key=CRAWL_KEY, 170 + status="running", 171 + meta={"stages": [s.key for s in stages]}, 172 + ) 173 + conn.commit() 174 + 175 + for i, stage in enumerate(stages, start=1): 176 + log("sync", f"── Stage {i}/{len(stages)}: {stage.title} ({stage.key}) ──") 177 + t0 = time.time() 178 + try: 179 + stats = stage.run(dsn) 180 + report.stages[stage.key] = { 181 + "status": "ok", 182 + "elapsed_s": round(time.time() - t0, 1), 183 + "stats": stats, 184 + } 185 + log( 186 + "sync", 187 + f"✓ {stage.key} done in {report.stages[stage.key]['elapsed_s']}s — {_format_stats(stats)}", 188 + ) 189 + except Exception as exc: 190 + msg = f"{stage.key}: {exc}" 191 + report.errors.append(msg) 192 + report.stages[stage.key] = { 193 + "status": "error", 194 + "elapsed_s": round(time.time() - t0, 1), 195 + "error": str(exc), 196 + } 197 + log("sync", f"✗ {msg}") 198 + if _env_flag("TANGLED_SYNC_FAIL_FAST", default=False): 199 + break 200 + 201 + final_status = "complete" if not report.errors else "partial" 202 + with connect(dsn) as conn: 203 + set_crawl_state( 204 + conn, 205 + key=CRAWL_KEY, 206 + status=final_status, 207 + meta={ 208 + "elapsed_s": round(report.elapsed_s, 1), 209 + "stages": report.stages, 210 + "errors": report.errors, 211 + }, 212 + ) 213 + conn.commit() 214 + 215 + lines = [f"Elapsed: {report.elapsed_s:.0f}s", ""] 216 + for key, info in report.stages.items(): 217 + mark = "OK" if info["status"] == "ok" else "ERR" 218 + line = f" [{mark}] {key} ({info['elapsed_s']}s)" 219 + if info["status"] == "ok" and info.get("stats"): 220 + line += f" — {_format_stats(info['stats'])}" 221 + lines.append(line) 222 + if report.errors: 223 + lines.append("") 224 + lines.append("Errors:") 225 + lines.extend(f" - {e}" for e in report.errors) 226 + 227 + summary_block("Daily sync finished", lines) 228 + 229 + if report.errors and _env_flag("TANGLED_SYNC_STRICT", default=True): 230 + raise SystemExit(1) 231 + return report
+1
questionnaire.txt
··· 1 + {"issue":"at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22","version":2,"introduction":{"project":"AtomicXR is a Nushell-based CLI tool (`axr`) for configuring VR/XR on Linux, primarily targeting Fedora Atomic and Universal Blue distributions like Bazzite. The CLI is structured as Nushell modules under `cli/atomic-xr/`, each exporting subcommands (e.g., `envision.nu`, `flatpak.nu`, `steamvr-lh.nu`). The project is marked as no longer maintained in the README, but the repo owner is actively requesting this feature. All modules follow a consistent pattern: they use `std log`, define helper functions, and export public commands.","issue":"The issue requests a new `axr lh pair` command for pairing SteamVR Lighthouse base stations and tracked devices without requiring users to manually invoke `lighthouse_console`. The author proposes two backend tools: (1) `lighthouse_console` from SteamVR, which is available now but slower due to its async job model, or (2) `lhctl`, a faster alternative that is not yet publicly released. The author is open to implementing option 1 now and migrating to option 2 later. Key open decisions include which backend to target, how to structure the command within the existing module hierarchy, what pairing workflow to expose, and how to handle the eventual transition between backends.","approach":"This questionnaire walks through the major implementation decisions in order: first the backend tool strategy (which directly shapes the entire implementation), then the command naming and module placement within the existing CLI structure, followed by the pairing workflow UX, error handling approach, and finally testing and documentation. Branch-specific follow-ups drill into details that only matter for a given backend choice, while shared tail questions cover cross-cutting concerns like deprecation planning and documentation."},"items":[{"id":"backend_strategy","prompt":"Which backend tool strategy should the pairing command use?","context":"The issue explicitly presents two backend options — lighthouse_console (available now, slower) and lhctl (faster, not yet public). This is the foundational decision because it determines the command's implementation, performance characteristics, and maintenance trajectory.","explanation":"The existing `steamvr-lh.nu` module already has a `lh-console` helper function that locates `lighthouse_console` from PATH, Flatpak Steam, or native Steam installations. Using `lighthouse_console` means leveraging this existing infrastructure but dealing with its async job model (commands are queued and results polled). `lhctl` would be significantly faster but introduces a dependency on unreleased software. A third option is to build an abstraction layer that supports both, allowing a seamless swap later. Each path has different implications for code complexity, user experience, and long-term maintenance.","options":[{"label":"Implement using lighthouse_console now (available immediately). Use the existing `lh-console` helper in `steamvr-lh.nu` to invoke lighthouse_console for pairing operations. Accept the slower async job model as a tradeoff for immediate availability. Plan to replace the backend later when lhctl is released.","followups":[{"id":"lh_console_async_handling","prompt":"How should the command handle lighthouse_console's async job model?","context":"Because you chose to use lighthouse_console, the pairing process involves submitting async jobs and polling for results. The existing `lh-console` helper simply runs the binary, but pairing requires multi-step interaction: discovering devices, initiating pairing, and confirming success.","explanation":"lighthouse_console uses an async job queue where you submit a command and then poll for its completion. For pairing, this typically involves: (1) scanning for nearby devices, (2) sending a pair command to a specific device, and (3) waiting for confirmation. The implementation could either parse stdout from lighthouse_console interactively, use its batch/scripting mode if available, or wrap the entire flow in a loop that polls for job completion. Each approach has different reliability and complexity tradeoffs.","options":[{"label":"Parse lighthouse_console's stdout interactively — run lighthouse_console as a subprocess, send commands via stdin, and parse output line-by-line to track job status and extract pairing results. This gives the most control but requires robust text parsing."},{"label":"Use lighthouse_console's command-line arguments for batch operations — pass all necessary arguments upfront (e.g., device serial, pair command) and capture the final output. Simpler implementation but less interactive feedback for the user."},{"label":"Wrap lighthouse_console calls in a polling loop — submit the pair command, then repeatedly invoke lighthouse_console to check job status until completion or timeout. More resilient to output format changes but slower due to repeated process spawning."}]},{"id":"lh_console_migration_prep","prompt":"How much should the implementation prepare for a future lhctl migration?","context":"Since lighthouse_console is intended as a temporary backend, the code could be structured to make swapping backends easier later, or it could be kept simple with the understanding that a rewrite will happen.","explanation":"Adding an abstraction layer (e.g., a common interface that both backends implement) increases initial complexity but makes the future swap trivial. Alternatively, keeping the implementation tightly coupled to lighthouse_console is simpler now but means more work during migration. The existing codebase doesn't use abstraction patterns — modules directly call external tools (e.g., `flatpak run`, `distrobox enter`, `rpm-ostree`). Following this convention suggests a direct implementation is more idiomatic.","options":[{"label":"Keep it simple and direct — implement pairing tightly coupled to lighthouse_console, following the existing pattern in the codebase where modules directly invoke external tools. Accept that a future migration to lhctl will require rewriting the pairing logic."},{"label":"Create a thin abstraction — define the pairing workflow as a sequence of steps (discover, select, pair, verify) with the lighthouse_console implementation behind helper functions. When lhctl arrives, only the helper functions need to change, not the user-facing command logic."}]}]},{"label":"Wait for lhctl and implement using it directly. Defer this feature until lhctl is publicly released, then implement with the faster tool from the start. This avoids throwaway work but leaves users without the feature for an indefinite period.","followups":[{"id":"lhctl_interim_solution","prompt":"Should there be an interim solution while waiting for lhctl?","context":"Because you chose to wait for lhctl, users currently have no streamlined way to pair lighthouse devices through the axr CLI. The existing `axr steamvr-lh console` command already exposes lighthouse_console directly, but it requires users to know the pairing commands themselves.","explanation":"The existing `console` command in `steamvr-lh.nu` already lets users run lighthouse_console with arbitrary arguments. An interim approach could enhance the console command's documentation or add a help subcommand that prints the manual pairing steps, giving users guidance without building full automation. Alternatively, the feature could simply wait with no interim measure.","options":[{"label":"Add a help/guide subcommand (e.g., `axr steamvr-lh pair-guide`) that prints step-by-step instructions for manually pairing via lighthouse_console. Low effort, immediately useful, and can be removed or replaced when lhctl arrives."},{"label":"No interim solution — wait for lhctl and implement the full feature then. Users can continue using `axr steamvr-lh console` directly in the meantime."}]}]},{"label":"Implement with lighthouse_console now, but design the command to auto-detect and prefer lhctl when it becomes available. The command checks for lhctl on PATH first; if not found, falls back to lighthouse_console. This follows the same pattern as the existing `lh-console` helper which already checks multiple locations for lighthouse_console.","followups":[{"id":"dual_backend_detection","prompt":"How should backend detection and selection work?","context":"Because you chose the dual-backend approach, the command needs logic to detect which tools are available and select the best one. The existing `lh-console` helper in `steamvr-lh.nu` already demonstrates this pattern — it checks PATH, then Flatpak Steam, then native Steam for lighthouse_console.","explanation":"The detection could be a simple `which` check at command startup (matching the existing `lh-console` pattern), or it could include a flag to force a specific backend. A flag would be useful for debugging or when users want to explicitly choose, but adds complexity. The existing codebase doesn't use backend-selection flags — tools are auto-detected silently.","options":[{"label":"Auto-detect only, following the existing `lh-console` pattern — check for lhctl first via `which`, fall back to lighthouse_console. No user-facing flag. Log which backend was selected at debug level using `std log`."},{"label":"Auto-detect with an optional override flag (e.g., `--backend lhctl` or `--backend lighthouse_console`) — auto-detect by default but allow users to force a specific backend. Useful for testing or when both tools are installed but one is preferred."}]}]}]},{"id":"command_naming","prompt":"What should the command name and module placement be?","context":"Regardless of which backend is chosen, the new pairing command needs a name and a home within the existing module structure. The issue suggests `axr lh pair`, but the existing lighthouse module is named `steamvr-lh.nu` and is registered in `mod.nu` as `export use steamvr-lh.nu`, making the current command namespace `axr steamvr-lh`.","explanation":"The existing module already exports `calibrate` and `console` commands, accessible as `axr steamvr-lh calibrate` and `axr steamvr-lh console`. Adding `pair` to this module would make it `axr steamvr-lh pair`. However, the issue suggests `axr lh pair`, which would require either renaming the module file to `lh.nu` (a breaking change for existing `axr steamvr-lh` users) or creating a new `lh.nu` module alongside the existing one. The README notes the project is deprecated, which may reduce concern about breaking changes.","options":[{"label":"Add the `pair` command to the existing `steamvr-lh.nu` module, making it `axr steamvr-lh pair`. This is the simplest approach — no module renaming, no breaking changes, consistent with the existing command structure. The command name is slightly longer than the issue suggests but follows established conventions."},{"label":"Rename `steamvr-lh.nu` to `lh.nu` and update `mod.nu` accordingly, making all lighthouse commands available under `axr lh` (e.g., `axr lh pair`, `axr lh calibrate`, `axr lh console`). This matches the issue's suggested naming but is a breaking change for anyone using `axr steamvr-lh` commands. Given the project's deprecated status, this may be acceptable."},{"label":"Create a new `lh.nu` module that re-exports from `steamvr-lh.nu` and adds the `pair` command, providing both `axr lh pair` (new) and `axr steamvr-lh pair` (alias). This avoids breaking existing commands while introducing the shorter namespace, but adds module complexity."}]},{"id":"pairing_workflow","prompt":"What pairing workflow should the command expose to users?","context":"Regardless of backend and naming choices, the user-facing pairing workflow needs to be defined. Lighthouse pairing involves discovering nearby Bluetooth LE devices (base stations, controllers, trackers) and establishing a connection with the host system.","explanation":"Pairing can be fully automated (scan, discover all devices, pair them all) or interactive (show discovered devices, let the user select which to pair). The existing `calibrate` command in `steamvr-lh.nu` uses an interactive pattern — it asks the user yes/no questions and waits for input. Other modules like `envision.nu` take a more automated approach with flags. The choice affects UX complexity and safety (auto-pairing everything might pair unintended nearby devices in shared spaces).","options":[{"label":"Interactive device selection — scan for nearby lighthouse devices, display a numbered list of discovered devices (with serial numbers and types), and let the user select which ones to pair using Nushell's `input list` or similar interactive selection. This is safer in shared spaces and gives users control over which devices are paired.","followups":[{"id":"interactive_multi_select","prompt":"Should users be able to select multiple devices at once or pair one at a time?","context":"Because you chose interactive device selection, the selection interface needs to handle the common case where users want to pair multiple devices (e.g., two base stations and two controllers) in a single session.","explanation":"Nushell's `input list` supports single selection. For multi-select, the command could loop (pair one, ask if there are more), use a custom multi-select prompt, or accept device identifiers as arguments alongside the interactive mode. The existing codebase uses simple `input` calls and `input list` for single selections.","options":[{"label":"Loop-based approach — after pairing one device, ask 'Would you like to pair another device?' and re-scan. Simple to implement using existing patterns (the `ask yn` helper in `steamvr-lh.nu`), handles the multi-device case naturally."},{"label":"Accept optional device serial numbers as arguments — if serials are provided, pair those directly without interactive selection. If no arguments given, enter interactive mode. This supports both scripted and interactive use cases."}]}]},{"label":"Fully automated — scan for all nearby unpaired lighthouse devices and pair them all automatically. Simpler UX (just run `axr lh pair` and wait), but risks pairing unintended devices in environments where multiple lighthouse setups are nearby (e.g., VR arcades, shared spaces).","followups":[{"id":"auto_pair_confirmation","prompt":"Should fully automated pairing require a confirmation step?","context":"Because you chose fully automated pairing, there's a risk of pairing unintended nearby devices. A confirmation step showing what will be paired could mitigate this without adding full interactive selection.","explanation":"The confirmation could show discovered devices and ask 'Pair all N devices? [y/n]' before proceeding, or it could include a `--yes` / `-y` flag to skip confirmation for scripted use. The existing `calibrate` command uses confirmation prompts via the `ask yn` helper.","options":[{"label":"Show discovered devices and require confirmation before pairing — display the list of found devices, then ask 'Pair all N devices? [y/n]' using the existing `ask yn` helper. Add a `--yes` flag to skip confirmation for scripted/automated use."},{"label":"No confirmation — pair immediately upon discovery. Keep the command simple and fast. Users in shared spaces can use the interactive mode (if implemented) or be careful about when they run the command."}]}]}]},{"id":"error_handling","prompt":"How should the command handle common failure scenarios?","context":"Regardless of the pairing workflow chosen, the command needs to handle failures gracefully. Common issues include: Bluetooth not available/enabled, no devices found within timeout, pairing rejected by device, and the backend tool (lighthouse_console or lhctl) not being installed.","explanation":"The existing codebase has two error handling patterns: (1) `error make` with descriptive messages and `help` fields (used in `steamvr-lh.nu`, `runtime.nu`, `oscavmgr.nu`) for fatal errors, and (2) `std log` for warnings and informational messages. The `lh-console` helper already handles the 'tool not found' case with an `error make`. Pairing-specific failures (no devices found, Bluetooth off) need their own handling strategy.","options":[{"label":"Fail fast with descriptive errors — check prerequisites (Bluetooth availability, backend tool installed) upfront before attempting any pairing. Use `error make` with helpful messages and remediation steps in the `help` field, matching the existing pattern in `lh-console` and `runtime.nu`. If pairing fails mid-process, report the specific failure and suggest next steps."},{"label":"Retry with guidance — on transient failures (no devices found, pairing timeout), automatically retry a configurable number of times with user-friendly status messages via `std log`. Only fail with `error make` on non-recoverable errors (tool not installed, Bluetooth hardware missing). Add a `--timeout` flag to control how long to scan for devices."}]},{"id":"testing_strategy","prompt":"What testing approach should be used for the new pairing command?","context":"Regardless of all previous choices, the new command needs some form of validation. The existing codebase has no test files — modules are Nushell scripts that directly invoke system tools, making traditional unit testing difficult.","explanation":"The codebase currently has zero tests. Nushell supports testing via `nu --testbin` and the `testing` module, but the existing modules are tightly coupled to system state (Flatpak, rpm-ostree, distrobox, SteamVR). Adding tests for the pairing command would be a first for this project. Options range from no tests (matching current practice) to adding integration tests that mock external tool calls.","options":[{"label":"No automated tests — match the existing codebase convention. Rely on manual testing with actual lighthouse hardware. Document the manual testing procedure in a comment or the PR description."},{"label":"Add basic smoke tests — create a test file that verifies the command's prerequisite checks (e.g., that it properly errors when lighthouse_console is not found) without requiring actual hardware. This would be the first test in the project and could establish a testing pattern for future commands."}]},{"id":"documentation","prompt":"What documentation should accompany the new command?","context":"Regardless of all implementation choices, the new command needs some level of documentation. The existing commands use Nushell's built-in doc comments (lines starting with `#` above function definitions) which are displayed by `axr -l` and `axr <command> --help`. The README currently focuses on migration away from AtomicXR.","explanation":"Nushell doc comments are the primary documentation mechanism in this codebase — every exported function has a `# Description` comment above it, and parameters have inline `# comment` annotations. The README is focused on deprecation/migration and doesn't document individual commands. Adding README documentation for a new feature in a deprecated project may send mixed signals, while inline doc comments are lightweight and follow existing conventions.","options":[{"label":"Inline Nushell doc comments only — add descriptive comments above the `pair` command and its parameters, following the existing pattern (e.g., `# Open SteamVR's lighthouse_console` on the `console` command). This is sufficient for `axr steamvr-lh pair --help` output and matches the project's documentation style."},{"label":"Inline doc comments plus a brief section in the README — add doc comments and also add a short section in the README under the legacy CLI usage area, documenting the `pair` command's purpose and basic usage. This helps users who read the README before installing."}]}]}
+5
questionnaire_job/.env.example
··· 1 + # Questionnaire job env (deploy with questionnaire_job/deploy.sh or set on Cloud Run Job) 2 + # DB_CONNECTION_STRING=postgresql://... 3 + # ANTHROPIC_API_KEY=... 4 + # ANTHROPIC_QUESTIONNAIRE_MODEL=claude-opus-4-6 5 + # QUESTIONNAIRE_MIN_TOOL_READS=2
+22
questionnaire_job/Dockerfile
··· 1 + # AI-solve questionnaire generator — Cloud Run Job 2 + FROM python:3.12-slim-bookworm 3 + 4 + RUN apt-get update \ 5 + && apt-get install -y --no-install-recommends ca-certificates git openssh-client \ 6 + && rm -rf /var/lib/apt/lists/* 7 + 8 + WORKDIR /app 9 + 10 + COPY agent/requirements.txt /app/agent/requirements.txt 11 + RUN pip install --no-cache-dir -r /app/agent/requirements.txt 12 + 13 + COPY agent/ /app/agent/ 14 + COPY questionnaire_job/ /app/questionnaire_job/ 15 + 16 + ENV PYTHONUNBUFFERED=1 \ 17 + PYTHONPATH=/app 18 + 19 + # Secrets at runtime: DB_CONNECTION_STRING, ANTHROPIC_API_KEY 20 + # Issue id at runtime: ISSUE_URI / ISSUE_ID env or CLI arg (via job execute --args) 21 + ENTRYPOINT ["python", "-m", "questionnaire_job.main"] 22 + CMD []
+46
questionnaire_job/cloudbuild.yaml
··· 1 + # Build and push the issue questionnaire job image to Artifact Registry. 2 + # 3 + # From repo root: 4 + # gcloud builds submit --config=questionnaire_job/cloudbuild.yaml . 5 + # 6 + # Execute for one issue: 7 + # gcloud run jobs execute tangled-questionnaire \ 8 + # --region=europe-west1 \ 9 + # --args="--issue-uri,at://did:plc:…/sh.tangled.repo.issue/…" 10 + 11 + substitutions: 12 + _REGION: europe-west1 13 + _REPOSITORY: tangled 14 + _IMAGE: issue-questionnaire 15 + 16 + steps: 17 + - id: build 18 + name: gcr.io/cloud-builders/docker 19 + args: 20 + - build 21 + - -f 22 + - questionnaire_job/Dockerfile 23 + - -t 24 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 25 + - -t 26 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 27 + - . 28 + 29 + - id: push-build-id 30 + name: gcr.io/cloud-builders/docker 31 + args: 32 + - push 33 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 34 + 35 + - id: push-latest 36 + name: gcr.io/cloud-builders/docker 37 + args: 38 + - push 39 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 40 + 41 + images: 42 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 43 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 44 + 45 + options: 46 + logging: CLOUD_LOGGING_ONLY
+72
questionnaire_job/deploy.sh
··· 1 + #!/usr/bin/env bash 2 + # Build image (Cloud Build), push to Artifact Registry, deploy Cloud Run Job. 3 + # 4 + # Usage (from repo root): 5 + # ./questionnaire_job/deploy.sh 6 + # 7 + # Optional overrides: 8 + # PROJECT_ID=cleveland-464404-m0 JOB_NAME=tangled-questionnaire ./questionnaire_job/deploy.sh 9 + # 10 + # Requires .env at repo root with at least: 11 + # DB_CONNECTION_STRING=... 12 + # ANTHROPIC_API_KEY=... 13 + 14 + set -euo pipefail 15 + 16 + ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 17 + ENV_FILE="${ENV_FILE:-$ROOT/.env}" 18 + REGION="${REGION:-europe-west1}" 19 + REPOSITORY="${REPOSITORY:-tangled}" 20 + IMAGE_NAME="${IMAGE_NAME:-issue-questionnaire}" 21 + JOB_NAME="${JOB_NAME:-tangled-questionnaire}" 22 + TASK_TIMEOUT="${TASK_TIMEOUT:-3600}" 23 + MEMORY="${MEMORY:-2Gi}" 24 + CPU="${CPU:-2}" 25 + MAX_RETRIES="${MAX_RETRIES:-1}" 26 + 27 + PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null)}" 28 + if [[ -z "$PROJECT_ID" || "$PROJECT_ID" == "(unset)" ]]; then 29 + echo "ERROR: Set PROJECT_ID or run: gcloud config set project YOUR_PROJECT_ID" >&2 30 + exit 1 31 + fi 32 + 33 + if [[ ! -f "$ENV_FILE" ]]; then 34 + echo "ERROR: Env file not found: $ENV_FILE" >&2 35 + exit 1 36 + fi 37 + 38 + IMAGE="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/${IMAGE_NAME}:latest" 39 + 40 + echo "==> Project: $PROJECT_ID" 41 + echo "==> Region: $REGION" 42 + echo "==> Image: $IMAGE" 43 + echo "==> Job: $JOB_NAME" 44 + echo "==> Env file: $ENV_FILE" 45 + echo 46 + 47 + echo "==> Build & push (Cloud Build)" 48 + gcloud builds submit \ 49 + --project="$PROJECT_ID" \ 50 + --config="$ROOT/questionnaire_job/cloudbuild.yaml" \ 51 + "$ROOT" 52 + 53 + echo 54 + echo "==> Deploy Cloud Run Job" 55 + gcloud run jobs deploy "$JOB_NAME" \ 56 + --project="$PROJECT_ID" \ 57 + --region="$REGION" \ 58 + --image="$IMAGE" \ 59 + --env-vars-file="$ENV_FILE" \ 60 + --task-timeout="$TASK_TIMEOUT" \ 61 + --memory="$MEMORY" \ 62 + --cpu="$CPU" \ 63 + --max-retries="$MAX_RETRIES" 64 + 65 + echo 66 + echo "Done. Run for one issue (full URI):" 67 + echo " gcloud run jobs execute $JOB_NAME --project=$PROJECT_ID --region=$REGION \\" 68 + echo " --args='--issue-uri,at://did:plc:…/sh.tangled.repo.issue/…'" 69 + echo 70 + echo "Or by rkey (must be unique in tangled_issues):" 71 + echo " gcloud run jobs execute $JOB_NAME --project=$PROJECT_ID --region=$REGION \\" 72 + echo " --args='3lvzel2uo3a22'"
+75
questionnaire_job/job.yaml
··· 1 + # Cloud Run Job — AI-solve questionnaire generator 2 + # 3 + # Replace PROJECT_ID before apply, or use deploy.sh which sets the image via gcloud. 4 + # 5 + # export PROJECT_ID=cleveland-464404-m0 6 + # envsubst < questionnaire_job/job.yaml | gcloud run jobs replace - 7 + # 8 + # Execute: 9 + # gcloud run jobs execute tangled-questionnaire \ 10 + # --region=europe-west1 \ 11 + # --project=$PROJECT_ID \ 12 + # --args="--issue-uri,at://did:plc:…/sh.tangled.repo.issue/…" 13 + 14 + apiVersion: run.googleapis.com/v1 15 + kind: Job 16 + metadata: 17 + name: tangled-questionnaire 18 + labels: 19 + cloud.googleapis.com/location: europe-west1 20 + spec: 21 + template: 22 + metadata: 23 + annotations: 24 + run.googleapis.com/execution-environment: gen2 25 + spec: 26 + taskCount: 1 27 + parallelism: 1 28 + template: 29 + spec: 30 + maxRetries: 1 31 + timeoutSeconds: 3600 32 + serviceAccountName: default 33 + containers: 34 + - name: questionnaire 35 + image: europe-west1-docker.pkg.dev/PROJECT_ID/tangled/issue-questionnaire:latest 36 + resources: 37 + limits: 38 + cpu: "2" 39 + memory: 2Gi 40 + env: 41 + - name: DB_CONNECTION_STRING 42 + valueFrom: 43 + secretKeyRef: 44 + name: tangled-db-url 45 + key: latest 46 + - name: ANTHROPIC_API_KEY 47 + valueFrom: 48 + secretKeyRef: 49 + name: anthropic-api-key 50 + key: latest 51 + - name: ANTHROPIC_QUESTIONNAIRE_MODEL 52 + value: claude-opus-4-6 53 + - name: QUESTIONNAIRE_MIN_TOOL_READS 54 + value: "2" 55 + - name: AGENT_VERBOSE_TOOLS 56 + value: "1" 57 + # Dual-write: also publish each questionnaire to the knot repo. 58 + - name: QUESTIONNAIRE_PUBLISH_REPO 59 + value: "1" 60 + - name: QUESTIONNAIRE_REPO_GIT_URL 61 + value: git@tangled.org:did:plc:vg4msk54xucet6of2rdrgahe 62 + - name: QUESTIONNAIRE_REPO_DIR 63 + value: /tmp/qrepo 64 + # SSH deploy key (registered with the repo owner's Tangled account), 65 + # injected as an env var and chmod'd to 0600 at runtime by the publisher. 66 + - name: QUESTIONNAIRE_SSH_KEY_CONTENTS 67 + valueFrom: 68 + secretKeyRef: 69 + name: tangled-knot-ssh-key 70 + key: latest 71 + # Issue id supplied per execution via --args (see deploy.sh) 72 + command: 73 + - python 74 + - -m 75 + - questionnaire_job.main
+119
questionnaire_job/main.py
··· 1 + #!/usr/bin/env python3 2 + """Cloud Run Job entrypoint: generate AI-solve questionnaire for one issue.""" 3 + 4 + from __future__ import annotations 5 + 6 + import argparse 7 + import json 8 + import os 9 + import sys 10 + 11 + from dotenv import load_dotenv 12 + 13 + from agent.agent import generate_and_save_questionnaire 14 + from agent.load_issue import resolve_issue_uri 15 + 16 + 17 + def load_env() -> None: 18 + load_dotenv() 19 + 20 + 21 + def require_env(name: str) -> str: 22 + value = os.getenv(name, "").strip() 23 + if not value: 24 + print(f"ERROR: {name} is not set", file=sys.stderr) 25 + raise SystemExit(1) 26 + return value 27 + 28 + 29 + def main(argv: list[str] | None = None) -> None: 30 + parser = argparse.ArgumentParser( 31 + description="Generate and cache an AI-solve questionnaire for a Tangled issue." 32 + ) 33 + parser.add_argument( 34 + "issue_id", 35 + nargs="?", 36 + help="Issue at:// URI or rkey (rkey requires DB lookup)", 37 + ) 38 + parser.add_argument( 39 + "--issue-uri", 40 + metavar="URI", 41 + help="Full at:// issue URI (overrides positional issue_id)", 42 + ) 43 + parser.add_argument( 44 + "--issue-id", 45 + metavar="ID", 46 + dest="issue_id_flag", 47 + help="Same as positional: at:// URI or rkey", 48 + ) 49 + parser.add_argument( 50 + "--no-file-tree", 51 + action="store_true", 52 + help="Skip knot file-tree walk before generation", 53 + ) 54 + parser.add_argument( 55 + "--list-tool", 56 + action="store_true", 57 + help="Expose list_repo_files in addition to read_repo_file", 58 + ) 59 + parser.add_argument( 60 + "--no-save", 61 + action="store_true", 62 + help="Print questionnaire JSON only; do not write to Postgres", 63 + ) 64 + args = parser.parse_args(argv) 65 + 66 + load_env() 67 + require_env("DB_CONNECTION_STRING") 68 + require_env("ANTHROPIC_API_KEY") 69 + 70 + raw = ( 71 + args.issue_uri 72 + or args.issue_id_flag 73 + or args.issue_id 74 + or os.getenv("ISSUE_URI", "").strip() 75 + or os.getenv("ISSUE_ID", "").strip() 76 + ) 77 + if not raw: 78 + print( 79 + "ERROR: provide issue id via arg, --issue-uri, ISSUE_URI, or ISSUE_ID", 80 + file=sys.stderr, 81 + ) 82 + raise SystemExit(1) 83 + 84 + issue_uri = resolve_issue_uri(raw) 85 + print(f"[job] generating questionnaire for {issue_uri}", file=sys.stderr) 86 + 87 + result = generate_and_save_questionnaire( 88 + issue_uri, 89 + fetch_file_tree=not args.no_file_tree, 90 + include_list_tool=args.list_tool, 91 + thread_id=f"job-{issue_uri.rsplit('/', 1)[-1]}", 92 + save=not args.no_save, 93 + ) 94 + 95 + if args.no_save and "payload" in result: 96 + print(json.dumps(result["payload"], indent=2, ensure_ascii=False)) 97 + else: 98 + print(json.dumps(result, indent=2)) 99 + if args.no_save: 100 + print( 101 + f"[job] done (no-save) version={result.get('version')} " 102 + f"questions={result.get('question_count')}", 103 + file=sys.stderr, 104 + ) 105 + else: 106 + print( 107 + f"[job] saved version={result.get('version')} " 108 + f"questions={result.get('question_count')} " 109 + f"updated_at={result.get('updated_at')}", 110 + file=sys.stderr, 111 + ) 112 + 113 + 114 + if __name__ == "__main__": 115 + try: 116 + main() 117 + except KeyboardInterrupt: 118 + print("\nInterrupted.", file=sys.stderr) 119 + raise SystemExit(130) from None
+14
recommendation/.dockerignore
··· 1 + .env 2 + .env.* 3 + *.pyc 4 + __pycache__/ 5 + .venv/ 6 + .git/ 7 + reference/ 8 + eval/ 9 + tests/ 10 + tangled_recommendation.egg-info/ 11 + *.egg-info/ 12 + node_modules/ 13 + package.json 14 + package-lock.json
+37
recommendation/.env.example
··· 1 + # Connection string for the SHARED Postgres database (required when DATA_STORAGE=sql). 2 + DB_CONNECTION_STRING=postgresql://user:password@host:5432/postgres 3 + 4 + # Storage backend: sql (Postgres+pgvector) or git (in-memory numpy+jsonl bundle). 5 + # DATA_STORAGE=sql 6 + # DATA_STORAGE=git 7 + # REC_DATA_GIT_URL=https://github.com/org/tangled-rec-data.git 8 + # REC_DATA_DIR=/tmp/tangled-rec-data 9 + # REC_DATA_GIT_REF=main 10 + # REC_DATA_REFRESH_SEC=0 11 + # REC_DATA_GIT_CLONE_TIMEOUT=120 12 + # REC_DATA_GIT_SSH_KEY=<base64-encoded deploy key for git@ SSH remotes on Cloud Run> 13 + 14 + # Google Gemini API key — NOT used by the service. Only needed if you run the Node 15 + # reference embedding scripts in reference/src/ (gemini-embedding-001 @ 1536 dims). 16 + # GEMINI_API_KEY=your-gemini-api-key 17 + 18 + # Base URL used to build absolute repo links in responses. 19 + # TANGLED_WEB_BASE=https://tangled.org 20 + 21 + # Questionnaire read source. "knot" (default) reads each questionnaire per-issue from 22 + # the knot blob XRPC; "db" reverts to Postgres. In knot mode, set DB_FALLBACK=1 to fall 23 + # back to the DB on a miss during transition. 24 + # QUESTIONNAIRE_SOURCE=knot 25 + # QUESTIONNAIRE_KNOT_HOST=knot1.tangled.sh 26 + # QUESTIONNAIRE_REPO_DID=did:plc:vg4msk54xucet6of2rdrgahe 27 + # QUESTIONNAIRE_KNOT_TIMEOUT=10 28 + # QUESTIONNAIRE_DB_FALLBACK=0 29 + 30 + # Recommendation tunables (optional; defaults shown). 31 + # REC_PER_SEED_LIMIT=25 32 + # REC_DISTANCE_FLOOR=0.30 33 + # REC_ISSUE_DISTANCE_FLOOR=0.40 34 + # REC_MIN_README_CHARS=120 # drop near-empty READMEs as seeds + candidates (test repos); 0 disables 35 + # REC_MAX_REPOS=40 36 + # REC_MAX_ISSUES=40 37 + # REC_QUERY_WORKERS=8 # concurrent per-seed kNN queries (DB round-trips dominate latency)
+14
recommendation/.gitignore
··· 1 + node_modules/ 2 + dist/ 3 + .env 4 + .env.* 5 + !.env.example 6 + *.log 7 + .DS_Store 8 + 9 + # Python 10 + .venv/ 11 + __pycache__/ 12 + *.egg-info/ 13 + .pytest_cache/ 14 + build/
+111
recommendation/API.md
··· 1 + # Recommendation Engine — HTTP API 2 + 3 + Standalone FastAPI service for Tangled repo/issue discovery. 4 + 5 + **Storage:** `DATA_STORAGE=sql` (default, Postgres+pgvector) or `DATA_STORAGE=git` 6 + (in-memory numpy+jsonl bundle cloned from `REC_DATA_GIT_URL` at boot). See 7 + `.env.example`. 8 + 9 + Endpoints: `/recommendations`, `/questionnaire` (sql only today), `/health`. 10 + 11 + Base URL: whatever you deploy to (the Tangled appview points `TANGLED_DISCOVER_ENDPOINT` 12 + at the `/recommendations` path). 13 + 14 + --- 15 + 16 + ## `GET /recommendations` 17 + 18 + The contract consumed by the Tangled appview. Returns the user's interest chips plus 19 + ranked repo + issue recommendations, with the user's own/collaborated repos and 20 + self-authored issues excluded. 21 + 22 + **Query params** 23 + 24 + | Param | Required | Notes | 25 + | --- | --- | --- | 26 + | `handle` | yes | The user's Tangled DID, e.g. `did:plc:abc123`. | 27 + | `gh` | no | Connected GitHub username. Accepted but currently ignored (no GitHub data). | 28 + 29 + **Response** `200 OK` — see [`schema.md`](../../schema.md) for the authoritative shape. Summary: 30 + 31 + ```jsonc 32 + { 33 + "profile": { 34 + "interests": [{ "label": "nix", "slug": "nix" }], // from the user's repo topics 35 + "languages": [], // no language signal yet 36 + "sources": { "tangled": { "repos": 10 } } // github omitted (no data) 37 + }, 38 + "repos": [{ 39 + "name": "...", "owner": "@handle", "language": "", "description": "...", 40 + "stars": 0, "openIssues": 3, "lastActive": "<RFC3339>", 41 + "url": "https://tangled.org/@handle/name", 42 + "basedOnRepoUrl": "https://tangled.org/@you/your-seed-repo" 43 + }], 44 + "issues": [{ 45 + "title": "...", "repo": "handle/name", "owner": "@handle", 46 + "issueUri": "at://did:plc:…/sh.tangled.repo.issue/3k…", 47 + "repoDid": "did:plc:...", "rkey": "3k...", 48 + "url": "https://tangled.org/@handle/name", 49 + "basedOnRepoUrl": "https://tangled.org/@you/your-seed-repo", 50 + "repoReadme": "...", 51 + "labels": [], "comments": 0, "language": "", "lastActive": "<RFC3339>" 52 + }] 53 + } 54 + ``` 55 + 56 + Notes: 57 + - Empty user → `"repos": []` (the frontend then shows its cold-start view). 58 + - `stars`/`comments`/`language`/`languages` are stubbed (no source in the shared DB yet). 59 + - Issues omit `number` (issue permalink); the frontend resolves it from `(repoDid, rkey)`. 60 + `url` is the parent repo; `basedOnRepoUrl` is the user's seed repo that surfaced the hit. 61 + - `basedOnRepoUrl` on repos is the same seed attribution (the user's repo whose README 62 + embedding produced the closest match). 63 + 64 + --- 65 + 66 + ## `GET /questionnaire` 67 + 68 + Return the cached AI-solve questionnaire JSON for an issue (written by the questionnaire 69 + Cloud Run job). Does not generate on demand — returns `404` if not cached yet. 70 + 71 + **Query params** 72 + 73 + | Param | Required | Notes | 74 + | --- | --- | --- | 75 + | `issue` | yes* | Full `at://…/sh.tangled.repo.issue/<rkey>` URI, or bare rkey (DB lookup). | 76 + | `issue-uri` | yes* | Alias for `issue`. | 77 + 78 + \* Provide one of `issue` or `issue-uri`. 79 + 80 + **Response** `200 OK` — questionnaire object (version 2: `introduction`, `items[]`, …). 81 + 82 + **Errors** 83 + 84 + | Status | When | 85 + | --- | --- | 86 + | `400` | Missing param, invalid URI, or ambiguous rkey | 87 + | `404` | Issue URI valid but no cached questionnaire | 88 + 89 + ```bash 90 + curl 'localhost:8000/questionnaire?issue=at://did:plc:…/sh.tangled.repo.issue/3lv…' 91 + ``` 92 + 93 + --- 94 + 95 + ## `GET /health` 96 + 97 + ```jsonc 98 + { "status": "ok", "db": true } 99 + ``` 100 + 101 + `status` is `"degraded"` with `db:false` (and an `error`) if the database is unreachable. 102 + 103 + --- 104 + 105 + ## Conventions 106 + 107 + - Timestamps (`lastActive`) are RFC-3339; the frontend humanizes them. 108 + - `owner` carries a leading `@`; repo `url` is absolute. 109 + - Ordering is the engine's call — arrays are returned already ranked, most relevant first. 110 + - Errors: any non-200 (or timeout) makes the appview fall back to its cold-start view; no 111 + structured error body is required.
+217
recommendation/CLAUDE.md
··· 1 + # CLAUDE.md — Tangled Recommendation Engine 2 + 3 + Context for any Claude session working in this folder. This is a **standalone 4 + Python/FastAPI service** (it will be lifted into its own repo and hosted separately). 5 + Read this top-to-bottom before changing anything. 6 + 7 + --- 8 + 9 + ## 1. What this is 10 + 11 + The recommendation backend for Tangled's **Discover** (contribution-discovery) feature. 12 + Given a user's DID it returns repo + issue recommendations. It reads README/issue 13 + **embeddings** (precomputed by the data teammate) from a shared Postgres + pgvector 14 + database and reranks them. The Tangled web app ("appview", a separate Go service) calls 15 + this over HTTP and renders the results. The service makes **no external API calls** at 16 + runtime — it only reads the DB. 17 + 18 + ``` 19 + Tangled appview ──HTTP(handle,gh)──► THIS service ──► shared Postgres+pgvector (READ-ONLY) 20 + (Go, separate repo) (Python/FastAPI) 21 + ``` 22 + 23 + > Semantic free-text search (`GET /search`) was built then **removed at the user's request** 24 + > (the Discover UI only consumes `/recommendations`). It's easy to re-add: embed the query 25 + > with Gemini (`RETRIEVAL_QUERY`) and run the same kNN/merge/shape pipeline with a single 26 + > "query" seed. The Node `reference/src/issue_search.mjs` shows the approach. 27 + 28 + It was **ported from validated Node scripts** in `reference/src/*.mjs` (the "oracle"): 29 + `similar_repos.mjs` (per-seed kNN + dedup — closest to our model), `issue_experiment.mjs` 30 + (issue→README matching), `embed_readmes.mjs` (Gemini embed + L2-normalize). Consult those 31 + when in doubt about an algorithm detail; they are known-good. 32 + 33 + ## 2. Locked decisions (do not silently reverse) 34 + 35 + - **Standalone Python/FastAPI** service. (Earlier drafts considered Go-in-appview and 36 + Node — both rejected. Don't reintroduce.) 37 + - **Search-per-seed + consensus**, NOT clustering. Each of the user's repos is searched 38 + independently; a candidate several seeds agree on ranks higher. (An earlier clustering 39 + approach was intentionally dropped — simpler, no threshold to tune, better explanations.) 40 + - **Consume existing issue embeddings** — the data teammate already ingests + embeds 41 + issues. We do NOT run an issue ingestion pipeline. 42 + - **Contract is fixed** by `schema.md` (in the parent repo root) and the Go client 43 + `appview/state/discover_engine.go`. The wire format carries **no** `pulls`, `reasons`, 44 + `themes`, `score`, or good-first fields. Consensus/distance are used internally for 45 + ranking only — never emitted. 46 + 47 + ## 3. The shared database (READ-ONLY) 48 + 49 + - Postgres + pgvector on Google Cloud SQL (public IP, self-signed cert). Connection string 50 + is in `.env` as `DB_CONNECTION_STRING`; `app/config.py` auto-appends `sslmode=require` 51 + (the psycopg equivalent of the scripts' `rejectUnauthorized:false`). 52 + - **Boundaries:** every existing table is READ-ONLY for us. The only writes we are ever 53 + authorized to make are the embedding columns of `tangled_readmes` 54 + (`embedding`/`embedding_model`/`embedded_at`) and our own `rec` schema (not used yet). 55 + **Never** insert/update/delete anything else. 56 + - **IP authorization:** the DB only accepts authorized IPs. On this machine the IP is 57 + already authorized. On a fresh host: 58 + `gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me)`. 59 + If you can't connect, this is almost always why. (`gcloud` is NOT installed here.) 60 + - The schema is alpha and **moves** — introspect to confirm before relying on a column. 61 + 62 + ### Tables we use (key columns) 63 + - `tangled_readmes` (main repo signal): `repo_did` (pk), `repo_uri`, `owner_handle`, 64 + `repo_name`, `content`, `embedding vector(1536)`, `embedding_model`, `status`. The repo 65 + OWNER did is parsed from `repo_uri` = `at://<owner_did>/sh.tangled.repo/<rkey>`. 66 + HNSW index on `embedding` with `vector_cosine_ops` (cosine = the metric). 67 + - `tangled_open_issues` (VIEW, open issues only): `uri`, `rkey`, `repo_did`, `repo_uri`, 68 + `author_did`, `title`, `body`, `issue_created_at`, `embedding vector(1536)`, `record_raw`. 69 + (`tangled_issues` is the all-states table; we use the open view for recommendations.) 70 + - `tangled_repos`: `repo_did`, `owner_did`, `rkey`, `name`, `owner_handle`, 71 + `record_raw` jsonb (has `topics`, `description`, `createdAt`, `repoDid`). 72 + - `tangled_identities`: `did` → `handle` (used for the owner-handle fallback). 73 + - `tangled_user_collaborations` (VIEW): `user_did` → `repo_did` (collab seeds; rare, ~240 rows). 74 + 75 + ### Embeddings (recipe — match EXACTLY if you ever embed anything new) 76 + The service does NOT embed at runtime (it reads precomputed vectors). This recipe is here 77 + for a future embedding catch-up job; the working impl is `reference/src/embed_readmes.mjs`. 78 + - Model `gemini-embedding-001` via Gemini API (`generativelanguage.googleapis.com`), 79 + header `x-goog-api-key = GEMINI_API_KEY`. `outputDimensionality = 1536`. 80 + - `taskType = RETRIEVAL_QUERY` for query text, `RETRIEVAL_DOCUMENT` for stored docs. 81 + - **L2-normalize every vector** (sub-3072 MRL dims aren't auto-unit; the cosine index needs 82 + unit vectors). 83 + - Vectors are passed to SQL as `%s::vector` text literals (`[v1,v2,...]`) and read back via 84 + `embedding::text` — exactly like the reference scripts. No pgvector-python adapter needed. 85 + 86 + ## 4. Algorithm (in `app/recommend.py`) 87 + 88 + 1. **Seeds** = the user's owned (`repo_uri like 'at://<did>/%'`) ∪ collaborated repos that 89 + have an embedded README (`db.load_seeds`). 90 + 2. **Per-seed kNN** over README embeddings, excluding the user's own/collab repo_dids 91 + (`db.knn_repos`, `ORDER BY embedding <=> seed::vector`). 92 + 3. **Merge** by candidate repo_did, keeping best (min) distance + the list of seeds that 93 + surfaced it = consensus (`app/merge.py`). 94 + 4. **Dedup** forks by md5 of `content[:500]` (`app/dedup.py`); apply a **distance floor**. 95 + 5. **Rerank** (`app/rank.py`): `DefaultScorer` = similarity + consensus + recency 96 + (+ popularity stub), behind a swappable `Scorer` Protocol; plus a **round-robin-across- 97 + seeds** guard so one busy interest can't bury a lone one. 98 + 6. **Issues**: same flow over `tangled_open_issues`, also excluding issues the user authored 99 + and issues in the user's own repos. 100 + 7. **Shape** to the contract (`app/links.py`, `app/profile.py`): interest chips from seed 101 + `record_raw.topics`; `@handle` owners; absolute repo URLs; RFC-3339 timestamps. 102 + 103 + ## 5. File map 104 + 105 + ``` 106 + app/ 107 + main.py FastAPI app + routes (/recommendations, /health) + CORS + startup log 108 + config.py Settings from env/.env (DB conn, web base, tunable knobs); get_settings() 109 + db.py psycopg3 pool + ALL read-only SQL (load_seeds, knn_repos, knn_issues, 110 + open_issue_counts, embedding_counts, ping) 111 + recommend.py orchestration: recommend(did) 112 + merge.py PURE: merge_hits -> consensus candidates 113 + dedup.py PURE: content_hash, collapse_forks 114 + rank.py PURE: Scorer protocol, DefaultScorer, apply_floor, rerank(diversify) 115 + profile.py PURE: build_interests from topics 116 + links.py PURE: slugify, at_owner, repo_url, issue_list_url, to_rfc3339 117 + schemas.py pydantic response models (wire keys match schema.md EXACTLY) 118 + types.py Candidate dataclass 119 + tests/ pytest: unit (pure modules, no DB) + test_integration.py (env-gated) 120 + eval/harness.py offline held-out-seed retrieval: recall@k / nDCG 121 + reference/src/ the validated Node .mjs oracle scripts (+ node_modules has `pg`) 122 + API.md human API docs; README.md run/deploy; Dockerfile; .env / .env.example 123 + ``` 124 + 125 + The pure modules (merge/dedup/rank/profile/links/types) have **no DB or network** and are 126 + fully unit-tested — keep them that way so logic changes are testable in isolation. 127 + 128 + ## 6. HTTP API (the contract) 129 + 130 + Authoritative shape: `../../schema.md` (parent repo) and `API.md` here. Summary: 131 + 132 + - `GET /recommendations?handle=<did>&gh=<user>` → `{ profile, repos[], issues[] }`. 133 + `handle` is the user's DID. `gh` is accepted but **ignored** (no GitHub data). No `k` 134 + param — return pre-ranked; the frontend paginates 15/row. Empty user → `repos: []`. 135 + - `GET /health` → `{ status, db }`. 136 + 137 + **Issues are special:** the engine canNOT supply a reliable sequential issue `number`, so it 138 + sends `repoDid` + `rkey` and the **appview resolves** the precise `/issues/N` URL from its 139 + own SQLite `issues` table (falling back to the repo's issue list). This is implemented in the 140 + parent repo: `appview/state/discover_engine.go` (`engineIssue`, `resolveIssueLink`), 141 + `appview/state/discover.go` (passes `s.db`), tested in `discover_engine_test.go`. If you 142 + change the issue wire shape here, update those three files + `schema.md` together. 143 + 144 + ## 7. Data realities / caveats (VERIFIED, not assumed) 145 + 146 + These drive what we can honestly return — re-check with `node`/SQL if data has grown: 147 + - READMEs: ~2,400 embedded (0 unembedded). Open issues: ~2,300 embedded. **Grows daily** — 148 + the service reads it live, so counts rise on their own. 149 + - **Repos are the real deliverable.** Owner handle resolves for ~96% via 150 + `owner_handle` → fallback `repo_uri` owner_did → `tangled_identities`; ~3.5% unresolvable 151 + are dropped. `repo_name` is never null. 152 + - **`stars` = 0, `comments` = 0** — no source (`tangled_backlinks` is empty). Stubbed. 153 + - **`languages` = [], repo `language` = ""** — no language field in the shared DB. 154 + - **`lastActive`** uses `record_raw.createdAt` (creation, not true last-activity — best 155 + available). Recency ranking uses the same value. 156 + - **Issues are emittable for ~32%** of the corpus (repo identity resolves via `repo_uri`). 157 + Per user (filtered to their interests) that's a handful. The exact issue number 158 + (`record_raw->>'issueId'`) exists for only ~4% in the shared DB → that's why the number is 159 + resolved appview-side, not here. 160 + - Seeds are dominated by **owned** repos; collaborations are rare. 161 + 162 + ## 8. Run / test / deploy 163 + 164 + ```bash 165 + # setup (uv is the toolchain here; python 3.12) 166 + uv venv --python 3.12 .venv 167 + uv pip install --python .venv -e ".[dev]" 168 + 169 + # run 170 + .venv/bin/python -m uvicorn app.main:app --reload --port 8000 171 + curl 'localhost:8000/health' 172 + curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca' # 10-seed sample user 173 + # docs: http://localhost:8000/docs 174 + 175 + # test (unit always; integration auto-runs when DB_CONNECTION_STRING is set) 176 + .venv/bin/python -m pytest tests/ -q 177 + 178 + # offline eval baseline (needs DB) 179 + .venv/bin/python eval/harness.py 180 + 181 + # deploy 182 + docker build -t tangled-rec . && docker run -p 8000:8000 --env-file .env tangled-rec 183 + # then point the appview: TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations 184 + ``` 185 + 186 + Config knobs (env, all optional except the two secrets): see `app/config.py` / 187 + `.env.example` — `TANGLED_WEB_BASE`, `REC_PER_SEED_LIMIT`, `REC_DISTANCE_FLOOR`, 188 + `REC_ISSUE_DISTANCE_FLOOR`, `REC_MAX_REPOS`, `REC_MAX_ISSUES`. 189 + 190 + ## 9. Status & current baseline 191 + 192 + - M0–M4 complete and verified: 23 pytest tests pass (18 pure-unit + 5 live integration incl. 193 + atproto/nix search sanity + own-repo exclusion). Appview Go side compiles + `go test 194 + ./appview/state/` passes. 195 + - **Eval baseline (before any tuning):** recall@10 ≈ 0.22, recall@20 ≈ 0.23, recall@50 ≈ 196 + 0.37, nDCG ≈ 0.24 over 60 users. Re-run `eval/harness.py` and compare BEFORE/AFTER any 197 + ranking change — no "feels better" merges. 198 + 199 + ## 10. Environment gotchas (this machine) 200 + 201 + - **No `gcloud`, no Go toolchain, no `nix`** installed by default. To verify the Go appview 202 + change, a Go 1.25 tarball was fetched to `/tmp/go` (ephemeral). `go.mod` requires go 1.25. 203 + - The reference `.mjs` scripts need Node's `pg` — it lives in `reference/.../node_modules` / 204 + the folder's `node_modules`. Run them with `DB_CONNECTION_STRING` in env or `.env`. 205 + - The Bash tool's working directory can reset between calls — use absolute paths or `cd` 206 + inside the same command. 207 + - **Secret**: `DB_CONNECTION_STRING` lives in `.env` (gitignored) — the only var the service 208 + needs. (`GEMINI_API_KEY` is only for the Node reference embedding scripts, not the service.) 209 + Never commit secrets or paste them into docs/code. 210 + 211 + ## 11. Do NOT 212 + 213 + - Write to any shared table except the `tangled_readmes` embedding columns (or a `rec` schema). 214 + - Re-add clustering, or emit `pulls`/`reasons`/`themes`/`score`/good-first in the API. 215 + - Hardcode `https://tangled.org` — use `settings.web_base` (`TANGLED_WEB_BASE`). 216 + - Change the issue wire shape without updating the appview Go files + `schema.md` together. 217 + - Fabricate `stars`/`comments`/`language` — they're honest stubs until a data source exists.
+18
recommendation/Dockerfile
··· 1 + FROM python:3.12-slim 2 + 3 + WORKDIR /app 4 + 5 + # git mode clones REC_DATA_GIT_URL at boot (SSH remotes need openssh-client). 6 + RUN apt-get update \ 7 + && apt-get install -y --no-install-recommends git openssh-client ca-certificates \ 8 + && rm -rf /var/lib/apt/lists/* 9 + 10 + COPY pyproject.toml ./ 11 + COPY app ./app 12 + RUN pip install --no-cache-dir . 13 + 14 + EXPOSE 8000 15 + 16 + # PORT lets hosts (Cloud Run, Fly, etc.) inject the listen port. 17 + ENV PORT=8000 18 + CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${PORT}"]
+77
recommendation/README.md
··· 1 + # Tangled Recommendation Engine 2 + 3 + A standalone Python/FastAPI service that powers Tangled's **Discover** feature: given a 4 + user's DID it returns repo + issue recommendations. It reads README/issue **embeddings** 5 + (precomputed with Gemini `gemini-embedding-001`, 1536-dim) from the shared 6 + Postgres + pgvector database and reranks them. 7 + 8 + The Tangled appview connects over HTTP via `TANGLED_DISCOVER_ENDPOINT` → see 9 + [`schema.md`](../../schema.md) for the wire contract and [`API.md`](./API.md) for all endpoints. 10 + 11 + ## How it works 12 + 13 + For each repo the user already works on (owned ∪ collaborations), we run an independent 14 + pgvector kNN over README embeddings. Results are merged — a candidate several of the user's 15 + repos point at ranks higher (consensus) — then deduped (forks), floored, and reranked 16 + (similarity + consensus + recency) with a round-robin guard so one busy interest can't bury 17 + the others. The user's own work is excluded. Issues use the same flow over open-issue 18 + embeddings. 19 + 20 + ## Layout 21 + 22 + ``` 23 + app/ FastAPI app + config + db + the pipeline stages 24 + merge.py dedup.py rank.py profile.py links.py # pure, unit-tested 25 + db.py recommend.py schemas.py main.py 26 + tests/ pytest unit tests (+ env-gated integration) 27 + eval/ offline hold-out eval harness (recall@k / nDCG) 28 + reference/ the validated Node .mjs scripts this engine was ported from (oracle) 29 + ``` 30 + 31 + ## Configuration (env / `.env`) 32 + 33 + | Var | Required | Default | Notes | 34 + | --- | --- | --- | --- | 35 + | `DB_CONNECTION_STRING` | yes | — | Shared Postgres. `sslmode=require` is added automatically. | 36 + | `TANGLED_WEB_BASE` | no | `https://tangled.org` | Base for generated repo URLs. | 37 + | `REC_PER_SEED_LIMIT` | no | `25` | kNN neighbours per seed. | 38 + | `REC_DISTANCE_FLOOR` | no | `0.30` | Drop repo matches above this cosine distance. | 39 + | `REC_ISSUE_DISTANCE_FLOOR` | no | `0.40` | Floor for issue matches. | 40 + | `REC_MIN_README_CHARS` | no | `120` | Drop near-empty READMEs as seeds + candidates (filters test/throwaway repos). `0` disables. | 41 + | `REC_QUERY_WORKERS` | no | `8` | Concurrent per-seed kNN queries. The DB is remote, so this cuts request latency. | 42 + | `REC_MAX_REPOS` / `REC_MAX_ISSUES` | no | `40` | Caps per section. | 43 + 44 + ## Run locally 45 + 46 + ```bash 47 + uv venv --python 3.12 .venv 48 + uv pip install --python .venv -e ".[dev]" 49 + .venv/bin/python -m uvicorn app.main:app --reload --port 8000 50 + # smoke 51 + curl 'localhost:8000/health' 52 + curl 'localhost:8000/recommendations?handle=did:plc:y7g2koy4nqw7434s67fgfjca' 53 + # interactive docs: http://localhost:8000/docs 54 + ``` 55 + 56 + ## Test 57 + 58 + ```bash 59 + .venv/bin/python -m pytest tests/ # unit (no DB) 60 + RUN_DB_TESTS=1 .venv/bin/python -m pytest tests/ # + integration (needs DB_CONNECTION_STRING) 61 + .venv/bin/python eval/harness.py # offline recall@k / nDCG 62 + ``` 63 + 64 + ## Deploy 65 + 66 + ```bash 67 + docker build -t tangled-rec . 68 + docker run -p 8000:8000 --env-file .env tangled-rec 69 + ``` 70 + 71 + **Cloud SQL access:** the shared DB only accepts authorized IPs. Add the host's egress IP: 72 + 73 + ```bash 74 + gcloud sql instances patch <instance> --authorized-networks=$(curl -s ifconfig.me) 75 + ``` 76 + 77 + Point the appview at the deployment: `TANGLED_DISCOVER_ENDPOINT=https://<host>/recommendations`.
+1
recommendation/app/__init__.py
··· 1 + """Tangled recommendation engine (standalone FastAPI service)."""
+103
recommendation/app/config.py
··· 1 + """Configuration loaded from the environment (and a local .env file). 2 + 3 + All knobs live here so the rest of the service stays pure/testable. The only 4 + secret is DB_CONNECTION_STRING (reused from the existing recommendation/.env); 5 + the service makes no external API calls at runtime. 6 + """ 7 + 8 + from __future__ import annotations 9 + 10 + import os 11 + from dataclasses import dataclass 12 + from functools import lru_cache 13 + 14 + from dotenv import load_dotenv 15 + 16 + # Load ./.env (the folder root) if present; real env vars take precedence. 17 + load_dotenv() 18 + 19 + 20 + def _ensure_sslmode(conn: str) -> str: 21 + """Cloud SQL public IP uses a self-signed cert. `sslmode=require` encrypts 22 + without verifying the chain — the Python equivalent of the reference scripts' 23 + ssl: { rejectUnauthorized: false }.""" 24 + if not conn: 25 + return conn 26 + if "sslmode=" in conn: 27 + return conn 28 + sep = "&" if "?" in conn else "?" 29 + return f"{conn}{sep}sslmode=require" 30 + 31 + 32 + @dataclass(frozen=True) 33 + class Settings: 34 + # --- storage backend --- 35 + data_storage: str = "sql" # "sql" | "git" 36 + data_dir: str = "/tmp/tangled-rec-data" 37 + data_git_url: str = "" 38 + data_git_ref: str = "" 39 + data_refresh_sec: int = 0 # 0 = load once at boot only 40 + data_git_clone_timeout: int = 120 41 + data_git_ssh_key_b64: str = "" # optional deploy key for git@ SSH remotes 42 + 43 + # --- connection (sql mode) --- 44 + db_connection_string: str = "" 45 + 46 + # --- link building --- 47 + web_base: str = "https://tangled.org" 48 + 49 + # --- recommendation tunables --- 50 + per_seed_limit: int = 25 # kNN neighbours fetched per seed repo 51 + distance_floor: float = 0.30 # drop repo candidates above this cosine distance 52 + issue_distance_floor: float = 0.40 # README-seed -> issue distances run higher 53 + min_readme_chars: int = 120 # drop near-empty READMEs as seeds AND candidates 54 + # (test/throwaway repos embed to a generic vector 55 + # that's trivially "similar" to other empty READMEs) 56 + max_repos: int = 40 # cap on returned repos (frontend paginates 15/row) 57 + max_issues: int = 40 58 + max_interests: int = 8 # interest chips derived from seed topics 59 + query_workers: int = 8 # concurrent per-seed kNN queries (DB is remote/slow) 60 + 61 + # --- questionnaire read source --- 62 + # The knot is the source of truth: questionnaires are read per-issue from the 63 + # knot-hosted repo (one blob fetch), not Postgres. "db" reverts to the old path. 64 + questionnaire_source: str = "knot" # "knot" | "db" 65 + questionnaire_knot_host: str = "knot1.tangled.sh" 66 + questionnaire_repo_did: str = "did:plc:vg4msk54xucet6of2rdrgahe" 67 + questionnaire_knot_timeout: float = 10.0 68 + questionnaire_db_fallback: bool = False # in knot mode, fall back to DB on miss 69 + 70 + 71 + @lru_cache(maxsize=1) 72 + def get_settings() -> Settings: 73 + conn = os.environ.get("DB_CONNECTION_STRING", "") 74 + storage = os.environ.get("DATA_STORAGE", "sql").strip().lower() 75 + if storage not in ("sql", "git"): 76 + raise ValueError(f"DATA_STORAGE must be 'sql' or 'git', got {storage!r}") 77 + return Settings( 78 + data_storage=storage, 79 + data_dir=os.environ.get("REC_DATA_DIR", "/tmp/tangled-rec-data"), 80 + data_git_url=os.environ.get("REC_DATA_GIT_URL", "").strip(), 81 + data_git_ref=os.environ.get("REC_DATA_GIT_REF", "").strip(), 82 + data_refresh_sec=int(os.environ.get("REC_DATA_REFRESH_SEC", "0")), 83 + data_git_clone_timeout=int(os.environ.get("REC_DATA_GIT_CLONE_TIMEOUT", "120")), 84 + data_git_ssh_key_b64=os.environ.get("REC_DATA_GIT_SSH_KEY", "").strip(), 85 + db_connection_string=_ensure_sslmode(conn), 86 + web_base=os.environ.get("TANGLED_WEB_BASE", "https://tangled.org").rstrip("/"), 87 + per_seed_limit=int(os.environ.get("REC_PER_SEED_LIMIT", "25")), 88 + distance_floor=float(os.environ.get("REC_DISTANCE_FLOOR", "0.30")), 89 + issue_distance_floor=float(os.environ.get("REC_ISSUE_DISTANCE_FLOOR", "0.40")), 90 + min_readme_chars=int(os.environ.get("REC_MIN_README_CHARS", "120")), 91 + max_repos=int(os.environ.get("REC_MAX_REPOS", "40")), 92 + max_issues=int(os.environ.get("REC_MAX_ISSUES", "40")), 93 + max_interests=int(os.environ.get("REC_MAX_INTERESTS", "8")), 94 + query_workers=int(os.environ.get("REC_QUERY_WORKERS", "8")), 95 + questionnaire_source=os.environ.get("QUESTIONNAIRE_SOURCE", "knot").strip().lower(), 96 + questionnaire_knot_host=os.environ.get("QUESTIONNAIRE_KNOT_HOST", "knot1.tangled.sh").strip(), 97 + questionnaire_repo_did=os.environ.get( 98 + "QUESTIONNAIRE_REPO_DID", "did:plc:vg4msk54xucet6of2rdrgahe" 99 + ).strip(), 100 + questionnaire_knot_timeout=float(os.environ.get("QUESTIONNAIRE_KNOT_TIMEOUT", "10")), 101 + questionnaire_db_fallback=os.environ.get("QUESTIONNAIRE_DB_FALLBACK", "").strip().lower() 102 + in ("1", "true", "yes"), 103 + )
+290
recommendation/app/db.py
··· 1 + """Read-only data access over the shared Postgres + pgvector database. 2 + 3 + Boundaries (per the project brief): every table here is READ-ONLY. The only 4 + writes this service is ever authorized to make are the embedding columns of 5 + tangled_readmes and its own `rec` schema — neither happens in this module. 6 + 7 + Vectors are passed as `%s::vector` text literals and read back via 8 + `embedding::text`, exactly like the validated reference scripts. 9 + """ 10 + 11 + from __future__ import annotations 12 + 13 + from functools import lru_cache 14 + 15 + from psycopg.rows import dict_row 16 + from psycopg_pool import ConnectionPool 17 + 18 + from app.config import get_settings 19 + 20 + 21 + def _git_store(): 22 + if get_settings().data_storage == "git": 23 + from app.git_store import get_git_store 24 + 25 + return get_git_store() 26 + return None 27 + 28 + 29 + @lru_cache(maxsize=1) 30 + def get_pool() -> ConnectionPool: 31 + s = get_settings() 32 + if s.data_storage == "git": 33 + raise RuntimeError("SQL pool unavailable in DATA_STORAGE=git mode") 34 + if not s.db_connection_string: 35 + raise RuntimeError("DB_CONNECTION_STRING is not set") 36 + pool = ConnectionPool( 37 + conninfo=s.db_connection_string, 38 + min_size=1, 39 + # Enough connections for the concurrent per-seed kNN fan-out, plus headroom 40 + # for health/startup probes. 41 + max_size=max(5, s.query_workers + 2), 42 + kwargs={"row_factory": dict_row}, 43 + open=True, 44 + ) 45 + return pool 46 + 47 + 48 + def ping() -> bool: 49 + if get_settings().data_storage == "git": 50 + from app.git_store import is_ready 51 + 52 + return is_ready() 53 + with get_pool().connection() as conn: 54 + row = conn.execute("select 1 as ok").fetchone() 55 + return bool(row and row["ok"] == 1) 56 + 57 + 58 + def embedding_counts() -> dict: 59 + """Coverage snapshot — used by /health and logged at startup.""" 60 + store = _git_store() 61 + if store: 62 + return store.embedding_counts() 63 + sql = """ 64 + select 65 + (select count(*) from tangled_readmes where embedding is not null) as readmes_embedded, 66 + (select count(*) from tangled_open_issues where embedding is not null) as open_issues_embedded, 67 + (select count(distinct split_part(replace(repo_uri,'at://',''),'/',1)) 68 + from tangled_readmes where embedding is not null and repo_uri is not null) as addressable_users 69 + """ 70 + with get_pool().connection() as conn: 71 + return dict(conn.execute(sql).fetchone()) 72 + 73 + 74 + # --- recommendation data access ------------------------------------------------- 75 + 76 + # A user's seeds: repos they own (repo_uri encodes the owner DID) UNION repos 77 + # they collaborate on. Both must have an embedded README to be useful as a seed. 78 + # A near-empty README (`< min_chars`) is filtered out: it embeds to a generic 79 + # vector that pulls in unrelated near-empty repos, so it's a poor seed. 80 + _SEEDS_SQL = """ 81 + select r.repo_did, 82 + r.repo_name, 83 + r.content, 84 + r.embedding::text as etext, 85 + tr.record_raw->'topics' as topics, 86 + coalesce(r.owner_handle, ti.handle) as owner_handle 87 + from tangled_readmes r 88 + left join tangled_repos tr 89 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 90 + left join tangled_identities ti 91 + on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1) 92 + where r.embedding is not null 93 + and length(trim(coalesce(r.content, ''))) >= %(min_chars)s 94 + and ( r.repo_uri like 'at://' || %(did)s || '/%%' 95 + or r.repo_did in ( 96 + select repo_did from tangled_user_collaborations where user_did = %(did)s 97 + ) ) 98 + """ 99 + 100 + # Per-seed kNN over README embeddings. Owner handle resolves via the readmes 101 + # column first, then a fallback to tangled_identities keyed on the owner DID 102 + # parsed out of repo_uri. Excludes the user's own/collab repos and near-empty 103 + # READMEs (`< min_chars`) — those are throwaway/test repos we shouldn't surface. 104 + _KNN_REPOS_SQL = """ 105 + select r.repo_did, 106 + r.repo_name, 107 + r.content, 108 + r.repo_uri, 109 + coalesce(r.owner_handle, ti.handle) as owner_handle, 110 + tr.record_raw->>'description' as description, 111 + tr.record_raw->'topics' as topics, 112 + tr.record_raw->>'createdAt' as created_at, 113 + round((r.embedding <=> %(vec)s::vector)::numeric, 4) as distance 114 + from tangled_readmes r 115 + left join tangled_repos tr 116 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 117 + left join tangled_identities ti 118 + on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1) 119 + where r.embedding is not null 120 + and length(trim(coalesce(r.content, ''))) >= %(min_chars)s 121 + and not (r.repo_did = any(%(exclude)s)) 122 + order by r.embedding <=> %(vec)s::vector 123 + limit %(limit)s 124 + """ 125 + 126 + _OPEN_ISSUE_COUNTS_SQL = """ 127 + select repo_did, count(*)::int as n 128 + from tangled_open_issues 129 + where repo_did = any(%(dids)s) 130 + group by repo_did 131 + """ 132 + 133 + # Per-seed kNN over OPEN issue embeddings (same vector space as READMEs). 134 + # Repo identity is resolved through the issue's repo_uri: owner handle via 135 + # tangled_identities, repo name via (owner_did, rkey) -> tangled_repos. Excludes 136 + # issues the user authored and issues in the user's own/collab repos. We only 137 + # keep issues whose repo identity fully resolves (handle + name) so the appview 138 + # can build a valid link. 139 + _KNN_ISSUES_SQL = """ 140 + select i.uri, 141 + i.rkey, 142 + i.repo_did, 143 + i.title, 144 + i.body as content, 145 + i.author_did, 146 + i.issue_created_at as created_at, 147 + ti.handle as owner_handle, 148 + tr.name as repo_name, 149 + tr.record_raw->>'description' as repo_description, 150 + rm.content as repo_readme, 151 + round((i.embedding <=> %(vec)s::vector)::numeric, 4) as distance 152 + from tangled_open_issues i 153 + join tangled_identities ti 154 + on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1) 155 + join tangled_repos tr 156 + on tr.owner_did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1) 157 + and tr.rkey = split_part(i.repo_uri, '/', 5) 158 + left join tangled_readmes rm 159 + on rm.repo_did = i.repo_did 160 + and rm.status = 'found' 161 + where i.embedding is not null 162 + and i.repo_uri is not null 163 + and ti.handle is not null 164 + and tr.name is not null 165 + and i.author_did <> %(author)s 166 + and not (i.repo_did = any(%(exclude)s)) 167 + order by i.embedding <=> %(vec)s::vector 168 + limit %(limit)s 169 + """ 170 + 171 + 172 + def knn_issues(vec_text: str, exclude_dids: list[str], author_did: str, limit: int) -> list[dict]: 173 + store = _git_store() 174 + if store: 175 + return store.knn_issues(vec_text, exclude_dids, author_did, limit) 176 + params = {"vec": vec_text, "exclude": exclude_dids, "author": author_did, "limit": limit} 177 + with get_pool().connection() as conn: 178 + return [dict(r) for r in conn.execute(_KNN_ISSUES_SQL, params).fetchall()] 179 + 180 + 181 + def load_seeds(did: str, min_chars: int = 0) -> list[dict]: 182 + store = _git_store() 183 + if store: 184 + return store.load_seeds(did, min_chars) 185 + params = {"did": did, "min_chars": min_chars} 186 + with get_pool().connection() as conn: 187 + return [dict(r) for r in conn.execute(_SEEDS_SQL, params).fetchall()] 188 + 189 + 190 + def knn_repos(vec_text: str, exclude_dids: list[str], limit: int, min_chars: int = 0) -> list[dict]: 191 + store = _git_store() 192 + if store: 193 + return store.knn_repos(vec_text, exclude_dids, limit, min_chars) 194 + params = {"vec": vec_text, "exclude": exclude_dids, "limit": limit, "min_chars": min_chars} 195 + with get_pool().connection() as conn: 196 + return [dict(r) for r in conn.execute(_KNN_REPOS_SQL, params).fetchall()] 197 + 198 + 199 + def open_issue_counts(repo_dids: list[str]) -> dict[str, int]: 200 + store = _git_store() 201 + if store: 202 + return store.open_issue_counts(repo_dids) 203 + if not repo_dids: 204 + return {} 205 + with get_pool().connection() as conn: 206 + rows = conn.execute(_OPEN_ISSUE_COUNTS_SQL, {"dids": repo_dids}).fetchall() 207 + return {r["repo_did"]: r["n"] for r in rows} 208 + 209 + 210 + # --- questionnaires (read-only cache written by the AI-solve job) -------------- 211 + 212 + _RESOLVE_ISSUE_URI_SQL = """ 213 + select uri 214 + from tangled_issues 215 + where rkey = %s 216 + order by fetched_at desc 217 + """ 218 + 219 + _GET_QUESTIONNAIRE_SQL = """ 220 + select issue_uri, payload, created_at, updated_at 221 + from tangled_issue_questionnaires 222 + where issue_uri = %s 223 + """ 224 + 225 + 226 + def resolve_issue_uri(issue_id: str) -> str: 227 + """Resolve a full ``at://`` URI or a per-repo issue rkey.""" 228 + store = _git_store() 229 + if store: 230 + return store.resolve_issue_uri(issue_id) 231 + raw = issue_id.strip() 232 + if raw.startswith("at://"): 233 + return raw 234 + 235 + with get_pool().connection() as conn: 236 + rows = conn.execute(_RESOLVE_ISSUE_URI_SQL, (raw,)).fetchall() 237 + 238 + if not rows: 239 + raise ValueError( 240 + f"No issue with rkey {raw!r} in tangled_issues — pass full at:// URI" 241 + ) 242 + if len(rows) > 1: 243 + uris = [r["uri"] for r in rows[:5]] 244 + raise ValueError( 245 + f"Ambiguous rkey {raw!r} ({len(rows)} issues). " 246 + f"Pass full at:// URI. Examples: {uris}" 247 + ) 248 + return rows[0]["uri"] 249 + 250 + 251 + _QUESTIONNAIRES_PRESENT_SQL = """ 252 + select issue_uri 253 + from tangled_issue_questionnaires 254 + where issue_uri = any(%(uris)s) 255 + """ 256 + 257 + 258 + def questionnaires_present(issue_uris: list[str]) -> set[str]: 259 + """Of the given issue URIs, which have a questionnaire. Used to set the 260 + `hasQuestionnaire` hint on recommended issues. 261 + 262 + One batched query — fast existence check off the dual-written DB cache (the 263 + questionnaire *content* is still read from the knot, the source of truth). In 264 + git mode the DB isn't available, so the flag is reported False (no index yet).""" 265 + if not issue_uris or _git_store(): 266 + return set() 267 + with get_pool().connection() as conn: 268 + rows = conn.execute(_QUESTIONNAIRES_PRESENT_SQL, {"uris": list(issue_uris)}).fetchall() 269 + return {r["issue_uri"] for r in rows} 270 + 271 + 272 + def get_questionnaire(issue_uri: str) -> dict | None: 273 + """Load cached questionnaire row, or None if not generated yet.""" 274 + if _git_store(): 275 + return None 276 + import json 277 + 278 + with get_pool().connection() as conn: 279 + row = conn.execute(_GET_QUESTIONNAIRE_SQL, (issue_uri,)).fetchone() 280 + if not row: 281 + return None 282 + payload = row["payload"] 283 + if isinstance(payload, str): 284 + payload = json.loads(payload) 285 + return { 286 + "issue_uri": row["issue_uri"], 287 + "payload": payload, 288 + "created_at": row["created_at"], 289 + "updated_at": row["updated_at"], 290 + }
+30
recommendation/app/dedup.py
··· 1 + """Content-based dedup: collapse fork READMEs that share identical text.""" 2 + 3 + from __future__ import annotations 4 + 5 + from hashlib import md5 6 + 7 + from app.types import Candidate 8 + 9 + 10 + def content_hash(content: str | None = None, *, content_sha500: str | None = None) -> str: 11 + if content_sha500: 12 + return content_sha500 13 + return md5((content or "")[:500].encode("utf-8")).hexdigest() 14 + 15 + 16 + def row_content_hash(row: dict) -> str: 17 + sha = row.get("content_sha500") 18 + if isinstance(sha, str) and sha: 19 + return sha 20 + return content_hash(row.get("content")) 21 + 22 + 23 + def collapse_forks(candidates: list[Candidate]) -> list[Candidate]: 24 + """Keep one candidate per content_hash — the one with the smallest distance.""" 25 + best: dict[str, Candidate] = {} 26 + for c in candidates: 27 + prev = best.get(c.content_hash) 28 + if prev is None or c.distance < prev.distance: 29 + best[c.content_hash] = c 30 + return list(best.values())
+383
recommendation/app/git_store.py
··· 1 + """In-memory recommendation index loaded from git-shipped numpy + jsonl bundles. 2 + 3 + Contract (frozen): 4 + data/repos.f32.npy, data/repos.jsonl 5 + data/issues.f32.npy, data/issues.jsonl 6 + manifest.json 7 + """ 8 + 9 + from __future__ import annotations 10 + 11 + import json 12 + import logging 13 + import subprocess 14 + import threading 15 + from dataclasses import dataclass 16 + from pathlib import Path 17 + from typing import Any 18 + 19 + import numpy as np 20 + 21 + from app.config import Settings 22 + from app.vectors import parse_vector_text, vector_to_text 23 + 24 + log = logging.getLogger("rec.git") 25 + 26 + _store: GitDataStore | None = None 27 + _load_error: str | None = None 28 + _loading = False 29 + _reload_lock = threading.Lock() 30 + 31 + 32 + @dataclass 33 + class GitDataStore: 34 + manifest: dict[str, Any] 35 + repo_vectors: np.ndarray 36 + repo_meta: list[dict[str, Any]] 37 + issue_vectors: np.ndarray 38 + issue_meta: list[dict[str, Any]] 39 + repo_row_by_did: dict[str, int] 40 + issue_uri_by_rkey: dict[str, list[str]] 41 + issue_count_by_repo_did: dict[str, int] 42 + owner_did_by_repo_did: dict[str, str] 43 + 44 + @classmethod 45 + def load_from_dir(cls, data_root: Path) -> GitDataStore: 46 + data_dir = data_root / "data" 47 + manifest_path = data_root / "manifest.json" 48 + if not manifest_path.exists(): 49 + raise FileNotFoundError(f"manifest.json not found under {data_root}") 50 + 51 + manifest = json.loads(manifest_path.read_text(encoding="utf-8")) 52 + repo_vectors = np.load(data_dir / "repos.f32.npy").astype(np.float32, copy=False) 53 + issue_vectors = np.load(data_dir / "issues.f32.npy").astype(np.float32, copy=False) 54 + repo_meta = _read_jsonl(data_dir / "repos.jsonl") 55 + issue_meta = _read_jsonl(data_dir / "issues.jsonl") 56 + 57 + if len(repo_meta) != repo_vectors.shape[0]: 58 + raise ValueError( 59 + f"repos.jsonl rows ({len(repo_meta)}) != repos matrix " 60 + f"({repo_vectors.shape[0]})" 61 + ) 62 + if len(issue_meta) != issue_vectors.shape[0]: 63 + raise ValueError( 64 + f"issues.jsonl rows ({len(issue_meta)}) != issues matrix " 65 + f"({issue_vectors.shape[0]})" 66 + ) 67 + 68 + repo_row_by_did = {} 69 + owner_did_by_repo_did = {} 70 + for i, row in enumerate(repo_meta): 71 + did = row.get("repo_did") 72 + if isinstance(did, str) and did: 73 + repo_row_by_did[did] = i 74 + uri = row.get("subject_uri") or row.get("repo_uri") or "" 75 + if isinstance(uri, str) and uri.startswith("at://"): 76 + owner = uri.removeprefix("at://").split("/", 1)[0] 77 + if did: 78 + owner_did_by_repo_did[did] = owner 79 + 80 + issue_uri_by_rkey: dict[str, list[str]] = {} 81 + issue_count_by_repo_did: dict[str, int] = {} 82 + for row in issue_meta: 83 + uri = row.get("subject_uri") or row.get("uri") or "" 84 + rkey = row.get("rkey") 85 + if isinstance(rkey, str) and isinstance(uri, str): 86 + issue_uri_by_rkey.setdefault(rkey, []).append(uri) 87 + repo_did = row.get("repo_did") 88 + if isinstance(repo_did, str) and repo_did: 89 + issue_count_by_repo_did[repo_did] = issue_count_by_repo_did.get(repo_did, 0) + 1 90 + 91 + log.info( 92 + "git store loaded: repos=%s issues=%s dim=%s metric=%s", 93 + len(repo_meta), 94 + len(issue_meta), 95 + manifest.get("dim"), 96 + manifest.get("metric"), 97 + ) 98 + return cls( 99 + manifest=manifest, 100 + repo_vectors=repo_vectors, 101 + repo_meta=repo_meta, 102 + issue_vectors=issue_vectors, 103 + issue_meta=issue_meta, 104 + repo_row_by_did=repo_row_by_did, 105 + issue_uri_by_rkey=issue_uri_by_rkey, 106 + issue_count_by_repo_did=issue_count_by_repo_did, 107 + owner_did_by_repo_did=owner_did_by_repo_did, 108 + ) 109 + 110 + def embedding_counts(self) -> dict[str, int]: 111 + return { 112 + "readmes_embedded": len(self.repo_meta), 113 + "open_issues_embedded": len(self.issue_meta), 114 + "addressable_users": len( 115 + {d for d in self.owner_did_by_repo_did.values() if d} 116 + ), 117 + } 118 + 119 + def load_seeds(self, did: str, min_chars: int = 0) -> list[dict]: 120 + seeds: list[dict] = [] 121 + for i, row in enumerate(self.repo_meta): 122 + uri = row.get("subject_uri") or "" 123 + owner = self.owner_did_by_repo_did.get(row.get("repo_did", ""), "") 124 + if owner != did and not ( 125 + isinstance(uri, str) and uri.startswith(f"at://{did}/") 126 + ): 127 + continue 128 + content_len = int(row.get("content_len") or 0) 129 + if content_len < min_chars: 130 + continue 131 + vec = self.repo_vectors[i] 132 + seeds.append( 133 + { 134 + "repo_did": row["repo_did"], 135 + "repo_name": row.get("repo_name") or "", 136 + "content": "", 137 + "content_sha500": row.get("content_sha500") or "", 138 + "etext": vector_to_text(vec), 139 + "topics": row.get("topics"), 140 + "owner_handle": row.get("owner_handle") or "", 141 + } 142 + ) 143 + return seeds 144 + 145 + def knn_repos( 146 + self, 147 + vec_text: str, 148 + exclude_dids: list[str], 149 + limit: int, 150 + min_chars: int = 0, 151 + ) -> list[dict]: 152 + q = parse_vector_text(vec_text) 153 + exclude = set(exclude_dids) 154 + scores = self.repo_vectors @ q 155 + distances = 1.0 - scores 156 + candidates: list[tuple[int, float]] = [] 157 + for i, row in enumerate(self.repo_meta): 158 + repo_did = row.get("repo_did") 159 + if not repo_did or repo_did in exclude: 160 + continue 161 + if int(row.get("content_len") or 0) < min_chars: 162 + continue 163 + candidates.append((i, float(distances[i]))) 164 + candidates.sort(key=lambda t: t[1]) 165 + return [ 166 + self._repo_hit(self.repo_meta[i], dist) 167 + for i, dist in candidates[:limit] 168 + ] 169 + 170 + def knn_issues( 171 + self, 172 + vec_text: str, 173 + exclude_dids: list[str], 174 + author_did: str, 175 + limit: int, 176 + ) -> list[dict]: 177 + q = parse_vector_text(vec_text) 178 + exclude = set(exclude_dids) 179 + scores = self.issue_vectors @ q 180 + distances = 1.0 - scores 181 + candidates: list[tuple[int, float]] = [] 182 + for i, row in enumerate(self.issue_meta): 183 + repo_did = row.get("repo_did") 184 + if not repo_did or repo_did in exclude: 185 + continue 186 + if row.get("author_did") == author_did: 187 + continue 188 + if not row.get("owner_handle") or not row.get("repo_name"): 189 + continue 190 + candidates.append((i, float(distances[i]))) 191 + candidates.sort(key=lambda t: t[1]) 192 + return [ 193 + self._issue_hit(self.issue_meta[i], dist) 194 + for i, dist in candidates[:limit] 195 + ] 196 + 197 + def open_issue_counts(self, repo_dids: list[str]) -> dict[str, int]: 198 + return {d: self.issue_count_by_repo_did.get(d, 0) for d in repo_dids} 199 + 200 + def resolve_issue_uri(self, issue_id: str) -> str: 201 + raw = issue_id.strip() 202 + if raw.startswith("at://"): 203 + return raw 204 + matches = self.issue_uri_by_rkey.get(raw, []) 205 + if not matches: 206 + raise ValueError( 207 + f"No issue with rkey {raw!r} in git bundle — pass full at:// URI" 208 + ) 209 + if len(matches) > 1: 210 + raise ValueError( 211 + f"Ambiguous rkey {raw!r} ({len(matches)} issues). " 212 + f"Pass full at:// URI. Examples: {matches[:5]}" 213 + ) 214 + return matches[0] 215 + 216 + @staticmethod 217 + def _repo_hit(row: dict[str, Any], distance: float) -> dict[str, Any]: 218 + return { 219 + "repo_did": row.get("repo_did"), 220 + "repo_name": row.get("repo_name") or "", 221 + "content": "", 222 + "content_sha500": row.get("content_sha500") or "", 223 + "repo_uri": row.get("subject_uri") or row.get("repo_uri") or "", 224 + "owner_handle": row.get("owner_handle") or "", 225 + "description": (row.get("description") or "").strip(), 226 + "topics": row.get("topics"), 227 + "created_at": row.get("created_at") or "", 228 + "distance": round(distance, 4), 229 + } 230 + 231 + @staticmethod 232 + def _issue_hit(row: dict[str, Any], distance: float) -> dict[str, Any]: 233 + return { 234 + "uri": row.get("subject_uri") or row.get("uri") or "", 235 + "rkey": row.get("rkey") or "", 236 + "repo_did": row.get("repo_did") or "", 237 + "title": (row.get("title") or "").strip(), 238 + "content": row.get("body") or "", 239 + "author_did": row.get("author_did") or "", 240 + "created_at": row.get("created_at") or "", 241 + "owner_handle": row.get("owner_handle") or "", 242 + "repo_name": row.get("repo_name") or "", 243 + "repo_readme": "", 244 + "distance": round(distance, 4), 245 + } 246 + 247 + 248 + def _read_jsonl(path: Path) -> list[dict[str, Any]]: 249 + rows: list[dict[str, Any]] = [] 250 + with path.open(encoding="utf-8") as fh: 251 + for line in fh: 252 + line = line.strip() 253 + if line: 254 + rows.append(json.loads(line)) 255 + return rows 256 + 257 + 258 + def is_ready() -> bool: 259 + return _store is not None 260 + 261 + 262 + def load_error() -> str | None: 263 + return _load_error 264 + 265 + 266 + def is_loading() -> bool: 267 + return _loading 268 + 269 + 270 + def _prepare_ssh(settings: Settings) -> None: 271 + import base64 272 + import os 273 + 274 + raw = settings.data_git_ssh_key_b64 275 + if not raw: 276 + return 277 + key_path = Path("/tmp/rec_git_ssh_key") 278 + key_path.write_bytes(base64.b64decode(raw)) 279 + key_path.chmod(0o600) 280 + os.environ["GIT_SSH_COMMAND"] = ( 281 + f"ssh -i {key_path} -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null" 282 + ) 283 + 284 + 285 + def _run_git(args: list[str], *, timeout: int) -> None: 286 + try: 287 + proc = subprocess.run( 288 + args, 289 + check=True, 290 + capture_output=True, 291 + text=True, 292 + timeout=timeout, 293 + ) 294 + except FileNotFoundError as exc: 295 + raise RuntimeError( 296 + "git binary not found in container — rebuild image with git installed" 297 + ) from exc 298 + except subprocess.TimeoutExpired as exc: 299 + raise RuntimeError( 300 + f"git timed out after {timeout}s: {' '.join(args)}" 301 + ) from exc 302 + except subprocess.CalledProcessError as exc: 303 + stderr = (exc.stderr or "").strip() 304 + stdout = (exc.stdout or "").strip() 305 + detail = stderr or stdout or str(exc) 306 + raise RuntimeError(f"git failed ({exc.returncode}): {detail}") from exc 307 + if proc.stderr: 308 + log.debug("git stderr: %s", proc.stderr.strip()) 309 + 310 + 311 + def _clone_or_pull(url: str, dest: Path, ref: str | None, *, timeout: int) -> None: 312 + dest.parent.mkdir(parents=True, exist_ok=True) 313 + if dest.exists() and (dest / ".git").exists(): 314 + log.info("git pull %s", dest) 315 + _run_git( 316 + ["git", "-C", str(dest), "fetch", "--depth", "1", "origin", ref or "HEAD"], 317 + timeout=timeout, 318 + ) 319 + _run_git( 320 + ["git", "-C", str(dest), "checkout", "FETCH_HEAD"], 321 + timeout=timeout, 322 + ) 323 + return 324 + args = ["git", "clone", "--depth", "1"] 325 + if ref: 326 + args.extend(["--branch", ref]) 327 + args.extend([url, str(dest)]) 328 + log.info("git clone %s -> %s", url, dest) 329 + _run_git(args, timeout=timeout) 330 + 331 + 332 + def load_git_store(settings: Settings) -> GitDataStore: 333 + """Clone/pull (if configured) and mmap-load numpy+jsonl once.""" 334 + global _store, _load_error, _loading 335 + with _reload_lock: 336 + _loading = True 337 + _load_error = None 338 + try: 339 + _prepare_ssh(settings) 340 + root = Path(settings.data_dir) 341 + if settings.data_git_url: 342 + _clone_or_pull( 343 + settings.data_git_url, 344 + root, 345 + settings.data_git_ref or None, 346 + timeout=settings.data_git_clone_timeout, 347 + ) 348 + elif not (root / "manifest.json").exists() and ( 349 + root / "data" / "repos.f32.npy" 350 + ).exists(): 351 + root = root.parent 352 + _store = GitDataStore.load_from_dir(root) 353 + return _store 354 + except Exception as exc: 355 + _load_error = str(exc) 356 + raise 357 + finally: 358 + _loading = False 359 + 360 + 361 + def start_git_load_background(settings: Settings) -> threading.Thread: 362 + """Load git bundle in a daemon thread so uvicorn can bind PORT immediately.""" 363 + 364 + def _worker() -> None: 365 + try: 366 + store = load_git_store(settings) 367 + log.info("git store loaded in background; %s", store.embedding_counts()) 368 + except Exception: 369 + log.exception("git store background load failed: %s", _load_error) 370 + 371 + thread = threading.Thread(target=_worker, name="git-load", daemon=True) 372 + thread.start() 373 + return thread 374 + 375 + 376 + def get_git_store() -> GitDataStore: 377 + if _store is None: 378 + raise RuntimeError("Git data store is not loaded — call load_git_store() at startup") 379 + return _store 380 + 381 + 382 + def reload_git_store(settings: Settings) -> GitDataStore: 383 + return load_git_store(settings)
+58
recommendation/app/lifespan.py
··· 1 + """FastAPI lifespan: load git bundle at boot (and optional periodic refresh).""" 2 + 3 + from __future__ import annotations 4 + 5 + import logging 6 + import threading 7 + from contextlib import asynccontextmanager 8 + 9 + from fastapi import FastAPI 10 + 11 + from app import db 12 + from app.config import get_settings 13 + 14 + log = logging.getLogger("rec") 15 + 16 + 17 + def _start_git_refresh(stop: threading.Event) -> threading.Thread | None: 18 + settings = get_settings() 19 + if settings.data_refresh_sec <= 0: 20 + return None 21 + from app.git_store import is_ready, reload_git_store 22 + 23 + def loop() -> None: 24 + while not stop.wait(settings.data_refresh_sec): 25 + if not is_ready(): 26 + continue 27 + try: 28 + reload_git_store(settings) 29 + log.info("git store refreshed") 30 + except Exception as exc: # noqa: BLE001 31 + log.warning("git refresh failed: %s", exc) 32 + 33 + thread = threading.Thread(target=loop, name="git-refresh", daemon=True) 34 + thread.start() 35 + return thread 36 + 37 + 38 + @asynccontextmanager 39 + async def lifespan(_app: FastAPI): 40 + settings = get_settings() 41 + stop = threading.Event() 42 + if settings.data_storage == "git": 43 + from app.git_store import start_git_load_background 44 + 45 + log.info( 46 + "git mode: loading bundle in background from %s", 47 + settings.data_git_url or settings.data_dir, 48 + ) 49 + start_git_load_background(settings) 50 + _start_git_refresh(stop) 51 + else: 52 + try: 53 + counts = db.embedding_counts() 54 + log.info("startup: sql reachable; %s", counts) 55 + except Exception as exc: # noqa: BLE001 56 + log.warning("startup: sql not reachable yet: %s", exc) 57 + yield 58 + stop.set()
+39
recommendation/app/links.py
··· 1 + """Pure formatting helpers: slugs, @handles, absolute URLs, RFC-3339 times. 2 + 3 + Per the contract: `owner` carries a leading `@`, repo URLs are absolute, and 4 + timestamps are machine-readable (the frontend humanizes them). 5 + """ 6 + 7 + from __future__ import annotations 8 + 9 + import re 10 + from datetime import datetime 11 + 12 + _SLUG_RE = re.compile(r"[^a-z0-9]+") 13 + 14 + 15 + def slugify(text: str) -> str: 16 + return _SLUG_RE.sub("-", (text or "").strip().lower()).strip("-") 17 + 18 + 19 + def at_owner(handle: str) -> str: 20 + handle = (handle or "").lstrip("@") 21 + return f"@{handle}" 22 + 23 + 24 + def repo_url(web_base: str, handle: str, name: str) -> str: 25 + return f"{web_base.rstrip('/')}/{at_owner(handle)}/{name}" 26 + 27 + 28 + def issue_list_url(web_base: str, handle: str, name: str) -> str: 29 + return f"{repo_url(web_base, handle, name)}/issues" 30 + 31 + 32 + def to_rfc3339(value) -> str: 33 + """datetime -> ISO-8601 string; pass through strings (already ISO from the 34 + DB's record_raw); empty for None.""" 35 + if value is None: 36 + return "" 37 + if isinstance(value, datetime): 38 + return value.isoformat() 39 + return str(value)
+112
recommendation/app/main.py
··· 1 + """FastAPI application: /health, /recommendations, /questionnaire.""" 2 + 3 + from __future__ import annotations 4 + 5 + import logging 6 + from typing import Any 7 + 8 + from fastapi import FastAPI, HTTPException, Query 9 + from fastapi.middleware.cors import CORSMiddleware 10 + 11 + from app import db, questionnaires, recommend as rec 12 + from app.config import get_settings 13 + from app.lifespan import lifespan 14 + from app.questionnaires import IssueUriError, QuestionnaireNotFoundError 15 + from app.schemas import Recommendations 16 + 17 + log = logging.getLogger("rec") 18 + logging.basicConfig(level=logging.INFO) 19 + 20 + app = FastAPI( 21 + title="Tangled Recommendation Engine", 22 + version="0.1.0", 23 + description="Repo/issue discovery for Tangled — README-embedding kNN + rerank.", 24 + lifespan=lifespan, 25 + ) 26 + 27 + app.add_middleware( 28 + CORSMiddleware, 29 + allow_origins=["*"], 30 + allow_methods=["GET"], 31 + allow_headers=["*"], 32 + ) 33 + 34 + 35 + def _git_status() -> dict[str, Any] | None: 36 + settings = get_settings() 37 + if settings.data_storage != "git": 38 + return None 39 + from app.git_store import is_loading, is_ready, load_error 40 + 41 + if is_ready(): 42 + return {"ready": True} 43 + err = load_error() 44 + if err: 45 + return {"ready": False, "error": err} 46 + if is_loading(): 47 + return {"ready": False, "status": "loading"} 48 + return {"ready": False, "status": "loading"} 49 + 50 + 51 + @app.get("/health") 52 + def health() -> dict: 53 + settings = get_settings() 54 + git = _git_status() 55 + if git and not git.get("ready"): 56 + return { 57 + "status": "degraded" if git.get("error") else "loading", 58 + "storage": settings.data_storage, 59 + "db": False, 60 + **git, 61 + } 62 + try: 63 + ok = db.ping() 64 + counts = db.embedding_counts() 65 + except Exception as exc: # noqa: BLE001 66 + return { 67 + "status": "degraded", 68 + "storage": settings.data_storage, 69 + "db": False, 70 + "error": str(exc), 71 + **(git or {}), 72 + } 73 + return { 74 + "status": "ok", 75 + "storage": settings.data_storage, 76 + "db": ok, 77 + "counts": counts, 78 + **(git or {}), 79 + } 80 + 81 + 82 + @app.get("/recommendations", response_model=Recommendations, response_model_exclude_none=True) 83 + def recommendations( 84 + handle: str = Query(..., description="The user's Tangled DID (e.g. did:plc:...)"), 85 + gh: str | None = Query(None, description="Connected GitHub username (ignored: no GitHub data)"), 86 + ) -> Recommendations: 87 + git = _git_status() 88 + if git and not git.get("ready"): 89 + raise HTTPException( 90 + status_code=503, 91 + detail=git.get("error") or "git data store is still loading", 92 + ) 93 + return rec.recommend(handle) 94 + 95 + 96 + @app.get("/questionnaire") 97 + def questionnaire( 98 + issue: str | None = Query(None, description="Issue at:// URI or rkey"), 99 + issue_uri: str | None = Query(None, alias="issue-uri", description="Alias for issue"), 100 + ) -> dict[str, Any]: 101 + raw = (issue or issue_uri or "").strip() 102 + if not raw: 103 + raise HTTPException(status_code=400, detail="issue query param is required") 104 + try: 105 + return questionnaires.load_questionnaire_payload(raw) 106 + except IssueUriError as exc: 107 + raise HTTPException(status_code=400, detail=str(exc)) from exc 108 + except QuestionnaireNotFoundError: 109 + raise HTTPException( 110 + status_code=404, 111 + detail="Questionnaire not found for this issue", 112 + ) from None
+48
recommendation/app/merge.py
··· 1 + """Merge per-seed kNN hits into one candidate set (consensus aggregation). 2 + 3 + Each of the user's seed repos is searched independently. Here we union the 4 + results, keyed by candidate repo_did: a candidate surfaced by several seeds 5 + keeps its best (minimum) distance and records every seed that found it — the 6 + length of that seed list is the consensus signal used later for ranking. 7 + """ 8 + 9 + from __future__ import annotations 10 + 11 + from app.dedup import row_content_hash 12 + from app.types import Candidate 13 + 14 + 15 + def merge_hits( 16 + per_seed_hits: list[tuple[str, list[dict]]], 17 + seed_content_hashes: set[str], 18 + key_field: str = "repo_did", 19 + ) -> list[Candidate]: 20 + """per_seed_hits: list of (seed_label, [row, ...]). Each row needs the 21 + `key_field`, `content`, and `distance`. Rows whose content matches one of 22 + the user's own seeds (`seed_content_hashes`) are dropped (own forks).""" 23 + merged: dict[str, Candidate] = {} 24 + for seed_label, rows in per_seed_hits: 25 + for row in rows: 26 + h = row_content_hash(row) 27 + if h in seed_content_hashes: 28 + continue # a fork of the user's own repo 29 + key = row[key_field] 30 + dist = float(row["distance"]) 31 + cand = merged.get(key) 32 + if cand is None: 33 + merged[key] = Candidate( 34 + key=key, 35 + content_hash=h, 36 + distance=dist, 37 + seeds=[seed_label], 38 + primary_seed=seed_label, 39 + payload=row, 40 + ) 41 + else: 42 + if seed_label not in cand.seeds: 43 + cand.seeds.append(seed_label) 44 + if dist < cand.distance: 45 + cand.distance = dist 46 + cand.primary_seed = seed_label 47 + cand.payload = row 48 + return list(merged.values())
+35
recommendation/app/profile.py
··· 1 + """Derive the user's interest chips from their seed repos' topics. 2 + 3 + The contract's `profile.interests` are shown in the onboarding reveal. We build 4 + them from the most frequent `record_raw.topics` across the user's seed repos — 5 + grounded in real data rather than invented cluster labels. 6 + """ 7 + 8 + from __future__ import annotations 9 + 10 + from collections import Counter 11 + 12 + from app.links import slugify 13 + 14 + 15 + def build_interests(seed_rows: list[dict], max_interests: int) -> list[dict]: 16 + """seed_rows: dicts with a `topics` field (list[str] | None). Returns 17 + [{label, slug}] ordered by frequency, de-duplicated by slug.""" 18 + counter: Counter[str] = Counter() 19 + label_for_slug: dict[str, str] = {} 20 + for row in seed_rows: 21 + topics = row.get("topics") or [] 22 + for topic in topics: 23 + if not topic or not str(topic).strip(): 24 + continue 25 + label = str(topic).strip() 26 + slug = slugify(label) 27 + if not slug: 28 + continue 29 + counter[slug] += 1 30 + label_for_slug.setdefault(slug, label) 31 + 32 + interests = [] 33 + for slug, _count in counter.most_common(max_interests): 34 + interests.append({"label": label_for_slug[slug], "slug": slug}) 35 + return interests
+89
recommendation/app/quality.py
··· 1 + """Quality heuristics for issue recommendations (pure: no DB, no network). 2 + 3 + Issues are ranked purely by body-embedding similarity, with no notion of whether 4 + an issue is a real contribution opportunity or a throwaway. A test/sandbox repo's 5 + issue, or a placeholder issue ("hello world", "test issue to explore tangled", 6 + "[READ-ONLY]"), can embed close to a user's interests and rank at the top. 7 + 8 + Our repo standard (REC_MIN_README_CHARS) can't be applied to issues — the issue 9 + corpus and the README corpus barely overlap, so almost no issue's parent repo has 10 + a README in the DB and a length gate would drop everything. Instead we judge the 11 + parent repo by name/description and the issue by title/body, matching the kinds of 12 + throwaway content observed in production. 13 + 14 + Keep these conservative: a false positive silently hides a real contribution. 15 + """ 16 + 17 + from __future__ import annotations 18 + 19 + import re 20 + 21 + # Repo name tokens that mark a scratchpad/sandbox. Matched on word tokens (split 22 + # on non-alphanumerics), so "latest"/"fastest"/"contest" are NOT caught. 23 + _TEST_TOKENS = frozenset({ 24 + "test", "tests", "testing", "tester", 25 + "sandbox", "playground", "scratch", "scratchpad", 26 + "demo", "demos", "example", "examples", "sample", "samples", 27 + "tmp", "temp", "placeholder", "throwaway", 28 + "foo", "bar", "baz", "qux", "foobar", 29 + "helloworld", 30 + }) 31 + _TOKEN_SPLIT = re.compile(r"[^a-z0-9]+") 32 + _TESTNUM_RE = re.compile(r"^test\d+$") # test100, test2, ... 33 + 34 + # Placeholder / "just exploring" phrases in an issue title or body (or a repo 35 + # description). Phrase-anchored so normal text mentioning "tests" is not caught. 36 + _PLACEHOLDER_RE = re.compile( 37 + r""" 38 + \btest\s+issue\b 39 + | \btest\s+repo\b 40 + | \bthis\s+is\s+(?:just\s+)?a\s+test\b 41 + | \bjust\s+a\s+test\b 42 + | \bjust\s+testing\b 43 + | \btesting\s+(?:the\s+)?(?:tangled|programmatic|access|repo|issue|out|this)\b 44 + | \bhello,?\s+world\b 45 + | \bhallo\b 46 + | \blorem\s+ipsum\b 47 + | \bread[-\s]?only\s+mirror\b 48 + | \[read[-\s]?only\] 49 + | \bignore\s+(?:this|me|please)\b 50 + | \bplaceholder\b 51 + | \bexplor(?:e|ing)\s+(?:what\s+)?tangled\b 52 + | \basdf\b | \bqwerty\b 53 + """, 54 + re.IGNORECASE | re.VERBOSE, 55 + ) 56 + 57 + 58 + def _tokens(text: str) -> set[str]: 59 + return {t for t in _TOKEN_SPLIT.split((text or "").lower()) if t} 60 + 61 + 62 + def _is_gibberish(text: str) -> bool: 63 + """A single run of letters with very few distinct characters, e.g. 64 + 'adadadaddaaddada' or 'adwawdawd' — typical of throwaway repo descriptions.""" 65 + t = (text or "").strip().lower() 66 + if not t or " " in t or len(t) < 6: 67 + return False 68 + return len(set(t)) / len(t) < 0.4 69 + 70 + 71 + def is_test_repo(name: str, description: str = "") -> bool: 72 + toks = _tokens(name) 73 + if toks & _TEST_TOKENS or any(_TESTNUM_RE.match(t) for t in toks): 74 + return True 75 + desc = (description or "").strip() 76 + if desc and (_PLACEHOLDER_RE.search(desc) or _is_gibberish(desc)): 77 + return True 78 + return False 79 + 80 + 81 + def is_placeholder_issue(title: str, body: str = "") -> bool: 82 + blob = f"{title or ''}\n{body or ''}" 83 + return bool(_PLACEHOLDER_RE.search(blob)) 84 + 85 + 86 + def drop_issue(repo_name: str, repo_description: str, title: str, body: str) -> bool: 87 + """True if this issue should be excluded: its repo is a sandbox/test repo, or 88 + its content is a placeholder/test issue.""" 89 + return is_test_repo(repo_name, repo_description) or is_placeholder_issue(title, body)
+84
recommendation/app/questionnaires.py
··· 1 + """Questionnaire HTTP helpers — resolve issue URI and load the questionnaire. 2 + 3 + Source of truth is the **knot**: a questionnaire is one JSON file in the knot-hosted 4 + repo (`questionnaires/<did>/<rkey>.json`), fetched per-issue via the knot blob XRPC — 5 + no clone, no DB. (`QUESTIONNAIRE_SOURCE=db` reverts to the old Postgres read; in knot 6 + mode `QUESTIONNAIRE_DB_FALLBACK=1` falls back to the DB on a miss during transition.) 7 + """ 8 + 9 + from __future__ import annotations 10 + 11 + import json 12 + import urllib.error 13 + import urllib.parse 14 + import urllib.request 15 + 16 + from app import db 17 + from app.config import get_settings 18 + 19 + 20 + class IssueUriError(ValueError): 21 + """Invalid or ambiguous issue identifier.""" 22 + 23 + 24 + class QuestionnaireNotFoundError(LookupError): 25 + """No questionnaire for this issue.""" 26 + 27 + 28 + def resolve_issue_param(issue: str) -> str: 29 + """Normalize ``issue`` query param to a full at:// issue URI.""" 30 + try: 31 + return db.resolve_issue_uri(issue) 32 + except ValueError as exc: 33 + raise IssueUriError(str(exc)) from exc 34 + 35 + 36 + def _knot_blob_url(issue_uri: str, settings) -> str: 37 + """at://<did>/sh.tangled.repo.issue/<rkey> -> knot blob URL for its questionnaire. 38 + Path convention matches agent/questionnaire_repo_store.py + export_questionnaires.py.""" 39 + rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri 40 + parts = rest.split("/") 41 + path = f"questionnaires/{parts[0]}/{parts[-1]}.json" 42 + qs = urllib.parse.urlencode({"repo": settings.questionnaire_repo_did, "path": path}) 43 + return f"https://{settings.questionnaire_knot_host}/xrpc/sh.tangled.repo.blob?{qs}" 44 + 45 + 46 + def _fetch_from_knot(issue_uri: str, settings) -> dict | None: 47 + """Fetch + parse the questionnaire file from the knot, or None if absent. 48 + 49 + The blob XRPC returns ``{"content": "<file text>", ...}``; the file text is the 50 + record written by the generator: ``{issue_uri, version, created_at, updated_at, payload}``.""" 51 + url = _knot_blob_url(issue_uri, settings) 52 + # Knots 403 the default Python-urllib User-Agent; send an explicit one. 53 + req = urllib.request.Request( 54 + url, headers={"User-Agent": "tangled-rec/1.0", "Accept": "application/json"} 55 + ) 56 + try: 57 + with urllib.request.urlopen(req, timeout=settings.questionnaire_knot_timeout) as resp: 58 + blob = json.loads(resp.read().decode("utf-8")) 59 + except urllib.error.HTTPError as exc: 60 + if exc.code in (404, 400): 61 + return None 62 + raise 63 + content = blob.get("content") if isinstance(blob, dict) else None 64 + if not content: 65 + return None 66 + rec = json.loads(content) if isinstance(content, str) else content 67 + return rec if isinstance(rec, dict) and rec.get("payload") is not None else None 68 + 69 + 70 + def load_questionnaire_payload(issue: str) -> dict: 71 + """Return the questionnaire JSON object for an issue URI or rkey.""" 72 + settings = get_settings() 73 + issue_uri = resolve_issue_param(issue) 74 + 75 + if settings.questionnaire_source == "knot": 76 + rec = _fetch_from_knot(issue_uri, settings) 77 + if rec is None and settings.questionnaire_db_fallback: 78 + rec = db.get_questionnaire(issue_uri) 79 + else: 80 + rec = db.get_questionnaire(issue_uri) 81 + 82 + if not rec: 83 + raise QuestionnaireNotFoundError(issue_uri) 84 + return rec["payload"]
+97
recommendation/app/rank.py
··· 1 + """Scoring + diversified rerank. 2 + 3 + The scorer is intentionally behind a small interface (Protocol) so it can be 4 + swapped for a learned ranker later without touching the pipeline. The default 5 + is a transparent weighted sum: similarity + consensus + recency + popularity. 6 + 7 + Diversify uses round-robin across each candidate's primary seed so that one 8 + high-volume interest can't bury a user's lone interests (the failure mode the 9 + original clustering experiment was built to avoid). 10 + """ 11 + 12 + from __future__ import annotations 13 + 14 + from datetime import datetime, timezone 15 + from typing import Protocol 16 + 17 + from app.types import Candidate 18 + 19 + 20 + def _recency(created_at: str | None, half_life_days: float = 365.0) -> float: 21 + """Map an ISO timestamp to (0, 1]; newer is higher. 0 if absent/unparseable.""" 22 + if not created_at: 23 + return 0.0 24 + try: 25 + dt = datetime.fromisoformat(str(created_at).replace("Z", "+00:00")) 26 + except (ValueError, TypeError): 27 + return 0.0 28 + if dt.tzinfo is None: 29 + dt = dt.replace(tzinfo=timezone.utc) 30 + age_days = (datetime.now(timezone.utc) - dt).total_seconds() / 86400.0 31 + if age_days < 0: 32 + age_days = 0.0 33 + return 0.5 ** (age_days / half_life_days) 34 + 35 + 36 + class Scorer(Protocol): 37 + def score(self, c: Candidate) -> float: ... 38 + 39 + 40 + class DefaultScorer: 41 + def __init__( 42 + self, 43 + w_similarity: float = 1.0, 44 + w_consensus: float = 0.10, 45 + w_recency: float = 0.05, 46 + w_popularity: float = 0.0, # stub until stars are ingested 47 + ) -> None: 48 + self.w_similarity = w_similarity 49 + self.w_consensus = w_consensus 50 + self.w_recency = w_recency 51 + self.w_popularity = w_popularity 52 + 53 + def score(self, c: Candidate) -> float: 54 + similarity = 1.0 - c.distance 55 + consensus = c.consensus - 1 # 0 for a single-seed hit 56 + recency = _recency(c.payload.get("created_at")) 57 + popularity = 0.0 58 + return ( 59 + self.w_similarity * similarity 60 + + self.w_consensus * consensus 61 + + self.w_recency * recency 62 + + self.w_popularity * popularity 63 + ) 64 + 65 + 66 + def apply_floor(candidates: list[Candidate], floor: float) -> list[Candidate]: 67 + return [c for c in candidates if c.distance <= floor] 68 + 69 + 70 + def rerank( 71 + candidates: list[Candidate], 72 + scorer: Scorer, 73 + max_n: int, 74 + diversify: bool = True, 75 + ) -> list[Candidate]: 76 + scored = sorted(candidates, key=scorer.score, reverse=True) 77 + if not diversify: 78 + return scored[:max_n] 79 + 80 + # Group by primary seed; preserve score order within each group. 81 + groups: dict[str, list[Candidate]] = {} 82 + for c in scored: 83 + groups.setdefault(c.primary_seed, []).append(c) 84 + 85 + # Order groups by their best member's score (global best leads). 86 + ordered_groups = sorted(groups.values(), key=lambda g: scorer.score(g[0]), reverse=True) 87 + 88 + out: list[Candidate] = [] 89 + idx = 0 90 + while len(out) < max_n and any(idx < len(g) for g in ordered_groups): 91 + for g in ordered_groups: 92 + if idx < len(g): 93 + out.append(g[idx]) 94 + if len(out) >= max_n: 95 + break 96 + idx += 1 97 + return out
+198
recommendation/app/recommend.py
··· 1 + """Recommendation orchestration: seeds -> per-seed kNN -> merge -> dedup -> 2 + floor -> rerank -> contract shape. 3 + 4 + This is the only place that stitches the (pure) stages to the (impure) data 5 + access. Keeping it thin makes the algorithm easy to read top-to-bottom. 6 + """ 7 + 8 + from __future__ import annotations 9 + 10 + from concurrent.futures import ThreadPoolExecutor 11 + 12 + from app import db 13 + from app.config import Settings, get_settings 14 + from app.dedup import collapse_forks, row_content_hash 15 + from app.links import at_owner, repo_url, to_rfc3339 16 + from app.merge import merge_hits 17 + from app.profile import build_interests 18 + from app.quality import drop_issue 19 + from app.rank import DefaultScorer, apply_floor, rerank 20 + from app.schemas import ( 21 + IssueOut, 22 + Profile, 23 + Recommendations, 24 + RepoOut, 25 + Sources, 26 + TangledSource, 27 + ) 28 + from app.types import Candidate 29 + 30 + 31 + def _empty(settings: Settings, seed_count: int) -> Recommendations: 32 + return Recommendations( 33 + profile=Profile( 34 + interests=[], 35 + languages=[], 36 + sources=Sources(tangled=TangledSource(repos=seed_count)), 37 + ), 38 + repos=[], 39 + issues=[], 40 + ) 41 + 42 + 43 + def _seed_label(seed: dict) -> str: 44 + return seed.get("repo_name") or seed["repo_did"] 45 + 46 + 47 + def _seed_url_map(seeds: list[dict], settings: Settings) -> dict[str, str]: 48 + """Map seed label (repo name or did) -> absolute Tangled repo URL.""" 49 + out: dict[str, str] = {} 50 + for seed in seeds: 51 + handle = (seed.get("owner_handle") or "").strip() 52 + name = (seed.get("repo_name") or "").strip() 53 + label = _seed_label(seed) 54 + out[label] = repo_url(settings.web_base, handle, name) if handle and name else "" 55 + return out 56 + 57 + 58 + def _based_on_repo_url(c: Candidate, seed_urls: dict[str, str]) -> str: 59 + return seed_urls.get(c.primary_seed, "") 60 + 61 + 62 + def _repo_out( 63 + c: Candidate, 64 + settings: Settings, 65 + open_issues: dict[str, int], 66 + seed_urls: dict[str, str], 67 + ) -> RepoOut: 68 + p = c.payload 69 + handle = p.get("owner_handle") or "" 70 + name = p.get("repo_name") or "" 71 + return RepoOut( 72 + name=name, 73 + owner=at_owner(handle), 74 + language="", # no language signal in the shared DB yet 75 + description=(p.get("description") or "").strip(), 76 + stars=0, # no star signal yet (tangled_backlinks empty) 77 + openIssues=open_issues.get(c.key, 0), 78 + lastActive=to_rfc3339(p.get("created_at")), 79 + url=repo_url(settings.web_base, handle, name), 80 + basedOnRepoUrl=_based_on_repo_url(c, seed_urls), 81 + ) 82 + 83 + 84 + def _issue_out( 85 + c: Candidate, settings: Settings, seed_urls: dict[str, str], with_questionnaire: set[str] 86 + ) -> IssueOut: 87 + p = c.payload 88 + handle = p.get("owner_handle") or "" 89 + name = p.get("repo_name") or "" 90 + uri = (p.get("uri") or "").strip() 91 + return IssueOut( 92 + title=(p.get("title") or "").strip(), 93 + repo=f"{handle}/{name}", 94 + owner=at_owner(handle), 95 + issueUri=uri, 96 + repoDid=p.get("repo_did") or "", 97 + rkey=p.get("rkey") or "", 98 + url=repo_url(settings.web_base, handle, name), 99 + basedOnRepoUrl=_based_on_repo_url(c, seed_urls), 100 + repoReadme=(p.get("repo_readme") or "").strip(), 101 + hasQuestionnaire=uri in with_questionnaire, 102 + labels=[], # issue records carry no labels in the shared DB 103 + comments=0, # no comment source yet 104 + language="", 105 + lastActive=to_rfc3339(p.get("created_at")), 106 + ) 107 + 108 + 109 + def _fetch_per_seed(seeds, query, workers) -> list[tuple[str, list[dict]]]: 110 + """Run `query(seed) -> (label, rows)` across the user's seeds concurrently. 111 + 112 + The DB is remote with multi-second round-trips, so the per-seed kNN queries 113 + dominate request latency; fanning them out across a thread pool cuts it to 114 + roughly one query's worth. `ThreadPoolExecutor.map` preserves seed order, so 115 + the downstream merge/rerank stay deterministic (tie-breaks unchanged). 116 + """ 117 + n = max(1, min(len(seeds), workers)) 118 + with ThreadPoolExecutor(max_workers=n) as ex: 119 + return list(ex.map(query, seeds)) 120 + 121 + 122 + def _recommend_repos(seeds, exclude_dids, seed_hashes, settings) -> list[RepoOut]: 123 + seed_urls = _seed_url_map(seeds, settings) 124 + 125 + def query(s): 126 + rows = db.knn_repos( 127 + s["etext"], exclude_dids, settings.per_seed_limit, settings.min_readme_chars 128 + ) 129 + return (s["repo_name"] or s["repo_did"], rows) 130 + 131 + per_seed_hits = _fetch_per_seed(seeds, query, settings.query_workers) 132 + 133 + candidates = merge_hits(per_seed_hits, seed_hashes) 134 + candidates = collapse_forks(candidates) 135 + candidates = apply_floor(candidates, settings.distance_floor) 136 + candidates = [c for c in candidates if (c.payload.get("owner_handle") or "").strip()] 137 + ranked = rerank(candidates, DefaultScorer(), settings.max_repos, diversify=True) 138 + 139 + counts = db.open_issue_counts([c.key for c in ranked]) 140 + return [_repo_out(c, settings, counts, seed_urls) for c in ranked] 141 + 142 + 143 + def _recommend_issues(did, seeds, exclude_dids, settings) -> list[IssueOut]: 144 + seed_urls = _seed_url_map(seeds, settings) 145 + 146 + def query(s): 147 + rows = db.knn_issues(s["etext"], exclude_dids, did, settings.per_seed_limit) 148 + return (s["repo_name"] or s["repo_did"], rows) 149 + 150 + per_seed_hits = _fetch_per_seed(seeds, query, settings.query_workers) 151 + 152 + # Key by issue uri — each issue is already unique. We deliberately do NOT run 153 + # collapse_forks here: that collapses by md5(content[:500]), which is right for 154 + # fork READMEs but would merge genuinely distinct issues that share an empty or 155 + # boilerplate body, silently dropping real recommendations. 156 + candidates = merge_hits(per_seed_hits, seed_content_hashes=set(), key_field="uri") 157 + candidates = apply_floor(candidates, settings.issue_distance_floor) 158 + # Drop issues whose parent repo is a sandbox/test repo or whose content is a 159 + # placeholder/test issue — they embed close to real interests but aren't real 160 + # contribution opportunities. (The README-length repo standard can't be used 161 + # here: issue-parent repos almost never have a README in the DB.) 162 + candidates = [ 163 + c 164 + for c in candidates 165 + if not drop_issue( 166 + c.payload.get("repo_name") or "", 167 + c.payload.get("repo_description") or "", 168 + c.payload.get("title") or "", 169 + c.payload.get("content") or "", 170 + ) 171 + ] 172 + ranked = rerank(candidates, DefaultScorer(), settings.max_issues, diversify=True) 173 + with_questionnaire = db.questionnaires_present( 174 + [c.payload.get("uri") for c in ranked if c.payload.get("uri")] 175 + ) 176 + return [_issue_out(c, settings, seed_urls, with_questionnaire) for c in ranked] 177 + 178 + 179 + def recommend(did: str, settings: Settings | None = None) -> Recommendations: 180 + settings = settings or get_settings() 181 + 182 + seeds = db.load_seeds(did, settings.min_readme_chars) 183 + if not seeds: 184 + return _empty(settings, 0) 185 + 186 + seed_hashes = {row_content_hash(s) for s in seeds} 187 + exclude_dids = [s["repo_did"] for s in seeds] 188 + 189 + repos = _recommend_repos(seeds, exclude_dids, seed_hashes, settings) 190 + issues = _recommend_issues(did, seeds, exclude_dids, settings) 191 + 192 + interests = build_interests(seeds, settings.max_interests) 193 + profile = Profile( 194 + interests=[{"label": i["label"], "slug": i["slug"]} for i in interests], 195 + languages=[], 196 + sources=Sources(tangled=TangledSource(repos=len(seeds))), 197 + ) 198 + return Recommendations(profile=profile, repos=repos, issues=issues)
+67
recommendation/app/schemas.py
··· 1 + """Pydantic response models — field names match the schema.md wire contract 2 + exactly (camelCase where the Go client expects it), so no aliasing is needed. 3 + """ 4 + 5 + from __future__ import annotations 6 + 7 + from pydantic import BaseModel 8 + 9 + 10 + class Interest(BaseModel): 11 + label: str 12 + slug: str 13 + 14 + 15 + class TangledSource(BaseModel): 16 + repos: int 17 + 18 + 19 + class GithubSource(BaseModel): 20 + handle: str 21 + repos: int 22 + 23 + 24 + class Sources(BaseModel): 25 + tangled: TangledSource 26 + github: GithubSource | None = None # omitted when GitHub isn't connected 27 + 28 + 29 + class Profile(BaseModel): 30 + interests: list[Interest] 31 + languages: list[str] 32 + sources: Sources 33 + 34 + 35 + class RepoOut(BaseModel): 36 + name: str 37 + owner: str # "@handle" 38 + language: str 39 + description: str 40 + stars: int 41 + openIssues: int 42 + lastActive: str # RFC-3339 43 + url: str # absolute — recommended repo 44 + basedOnRepoUrl: str = "" # user's seed repo that surfaced this hit 45 + 46 + 47 + class IssueOut(BaseModel): 48 + title: str 49 + repo: str # "owner/name" 50 + owner: str # "@handle" 51 + issueUri: str = "" # at://…/sh.tangled.repo.issue/<rkey> 52 + repoDid: str # appview resolves number+url from (repoDid, rkey) 53 + rkey: str 54 + url: str = "" # absolute — parent repo the issue belongs to 55 + basedOnRepoUrl: str = "" # user's seed repo that surfaced this hit 56 + repoReadme: str = "" # parent repo README the issue belongs to 57 + hasQuestionnaire: bool = False # an AI-solve questionnaire exists for this issue 58 + labels: list[str] 59 + comments: int 60 + language: str 61 + lastActive: str # RFC-3339 62 + 63 + 64 + class Recommendations(BaseModel): 65 + profile: Profile 66 + repos: list[RepoOut] 67 + issues: list[IssueOut]
+27
recommendation/app/search.py
··· 1 + """Parallel per-seed vector search against the DB.""" 2 + 3 + from __future__ import annotations 4 + 5 + from collections.abc import Callable 6 + from concurrent.futures import ThreadPoolExecutor 7 + from typing import Any, TypeVar 8 + 9 + T = TypeVar("T") 10 + 11 + 12 + def parallel_seed_search( 13 + seeds: list[dict[str, Any]], 14 + search: Callable[[dict[str, Any]], list[dict]], 15 + *, 16 + max_workers: int, 17 + ) -> list[tuple[str, list[dict]]]: 18 + """Run one kNN query per seed, up to ``max_workers`` at a time.""" 19 + if not seeds: 20 + return [] 21 + workers = max(1, min(max_workers, len(seeds))) 22 + with ThreadPoolExecutor(max_workers=workers) as pool: 23 + rows_by_seed = list(pool.map(search, seeds)) 24 + return [ 25 + (s.get("repo_name") or s["repo_did"], rows) 26 + for s, rows in zip(seeds, rows_by_seed, strict=True) 27 + ]
+31
recommendation/app/types.py
··· 1 + """Shared domain types for the recommendation pipeline. 2 + 3 + These are deliberately plain dataclasses so the pure stages (merge / dedup / 4 + rank) are trivially unit-testable without a database or network. 5 + """ 6 + 7 + from __future__ import annotations 8 + 9 + from dataclasses import dataclass, field 10 + 11 + 12 + @dataclass 13 + class Candidate: 14 + """A recommended repo (or issue) accumulated across the user's seeds. 15 + 16 + `distance` is the best (minimum) cosine distance seen for this candidate. 17 + `seeds` records which of the user's seed repos surfaced it — its length is 18 + the consensus signal (more seeds agreeing -> higher rank). `payload` holds 19 + the raw DB row fields used later for shaping (name, owner_handle, etc.). 20 + """ 21 + 22 + key: str # repo_did for repos; issue uri for issues 23 + content_hash: str 24 + distance: float 25 + seeds: list[str] = field(default_factory=list) 26 + primary_seed: str = "" # seed that gave the best (min) distance 27 + payload: dict = field(default_factory=dict) 28 + 29 + @property 30 + def consensus(self) -> int: 31 + return len(self.seeds)
+23
recommendation/app/vectors.py
··· 1 + """Vector parsing/formatting shared by SQL (pgvector text) and git (numpy) backends.""" 2 + 3 + from __future__ import annotations 4 + 5 + import json 6 + 7 + import numpy as np 8 + 9 + 10 + def parse_vector_text(text: str) -> np.ndarray: 11 + """Parse a pgvector-style literal ``[v1,v2,...]`` into a unit float32 vector.""" 12 + raw = text.strip() 13 + if raw.startswith("[") and raw.endswith("]"): 14 + raw = raw[1:-1] 15 + parts = [p.strip() for p in raw.split(",") if p.strip()] 16 + if not parts: 17 + raise ValueError("empty vector text") 18 + vec = np.asarray([float(p) for p in parts], dtype=np.float32) 19 + return vec 20 + 21 + 22 + def vector_to_text(vec: np.ndarray) -> str: 23 + return "[" + ",".join(repr(float(x)) for x in vec) + "]"
+41
recommendation/cloudbuild.yaml
··· 1 + # Build and push the recommendation API image to Artifact Registry. 2 + # 3 + # Trigger (from repo root): 4 + # gcloud builds submit --config=recommendation/cloudbuild.yaml recommendation/ 5 + # 6 + # Or use ./recommendation/deploy.sh (build + push + Cloud Run Service deploy). 7 + 8 + substitutions: 9 + _REGION: europe-west1 10 + _REPOSITORY: tangled 11 + _IMAGE: recommendation-api 12 + 13 + steps: 14 + - id: build 15 + name: gcr.io/cloud-builders/docker 16 + args: 17 + - build 18 + - -t 19 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 20 + - -t 21 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 22 + - . 23 + 24 + - id: push-build-id 25 + name: gcr.io/cloud-builders/docker 26 + args: 27 + - push 28 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 29 + 30 + - id: push-latest 31 + name: gcr.io/cloud-builders/docker 32 + args: 33 + - push 34 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 35 + 36 + images: 37 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:${BUILD_ID} 38 + - ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPOSITORY}/${_IMAGE}:latest 39 + 40 + options: 41 + logging: CLOUD_LOGGING_ONLY
+95
recommendation/deploy.sh
··· 1 + #!/usr/bin/env bash 2 + # Build image (Cloud Build), push to Artifact Registry, deploy Cloud Run Service. 3 + # 4 + # Usage (from repo root): 5 + # ./recommendation/deploy.sh 6 + # 7 + # Optional overrides: 8 + # PROJECT_ID=cleveland-464404-m0 REGION=europe-west1 ./recommendation/deploy.sh 9 + # 10 + # Requires: gcloud auth, .env with DB_CONNECTION_STRING (repo root or recommendation/.env). 11 + 12 + set -euo pipefail 13 + 14 + ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 15 + SERVICE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" 16 + ENV_FILE="${ENV_FILE:-}" 17 + if [[ -z "$ENV_FILE" ]]; then 18 + if [[ -f "$ROOT/.env" ]]; then 19 + ENV_FILE="$ROOT/.env" 20 + elif [[ -f "$SERVICE_DIR/.env" ]]; then 21 + ENV_FILE="$SERVICE_DIR/.env" 22 + fi 23 + fi 24 + 25 + REGION="${REGION:-europe-west1}" 26 + REPOSITORY="${REPOSITORY:-tangled}" 27 + IMAGE_NAME="${IMAGE_NAME:-recommendation-api}" 28 + SERVICE_NAME="${SERVICE_NAME:-tangled-recommendation}" 29 + MEMORY="${MEMORY:-512Mi}" 30 + CPU="${CPU:-1}" 31 + MIN_INSTANCES="${MIN_INSTANCES:-0}" 32 + MAX_INSTANCES="${MAX_INSTANCES:-3}" 33 + ALLOW_UNAUTHENTICATED="${ALLOW_UNAUTHENTICATED:-1}" 34 + 35 + PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null)}" 36 + if [[ -z "$PROJECT_ID" || "$PROJECT_ID" == "(unset)" ]]; then 37 + echo "ERROR: Set PROJECT_ID or run: gcloud config set project YOUR_PROJECT_ID" >&2 38 + exit 1 39 + fi 40 + 41 + if [[ -z "$ENV_FILE" || ! -f "$ENV_FILE" ]]; then 42 + echo "ERROR: Env file not found. Set ENV_FILE or create $ROOT/.env" >&2 43 + exit 1 44 + fi 45 + 46 + IMAGE="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/${IMAGE_NAME}:latest" 47 + 48 + echo "==> Project: $PROJECT_ID" 49 + echo "==> Region: $REGION" 50 + echo "==> Image: $IMAGE" 51 + echo "==> Service: $SERVICE_NAME" 52 + echo "==> Env file: $ENV_FILE" 53 + echo 54 + 55 + echo "==> Build & push (Cloud Build)" 56 + gcloud builds submit \ 57 + --project="$PROJECT_ID" \ 58 + --config="$SERVICE_DIR/cloudbuild.yaml" \ 59 + "$SERVICE_DIR" 60 + 61 + echo 62 + echo "==> Deploy Cloud Run Service" 63 + DEPLOY_ARGS=( 64 + run deploy "$SERVICE_NAME" 65 + --project="$PROJECT_ID" 66 + --region="$REGION" 67 + --image="$IMAGE" 68 + --env-vars-file="$ENV_FILE" 69 + --port=8000 70 + --memory="$MEMORY" 71 + --cpu="$CPU" 72 + --min-instances="$MIN_INSTANCES" 73 + --max-instances="$MAX_INSTANCES" 74 + --timeout=60 75 + ) 76 + 77 + if [[ "$ALLOW_UNAUTHENTICATED" == "1" ]]; then 78 + DEPLOY_ARGS+=(--allow-unauthenticated) 79 + else 80 + DEPLOY_ARGS+=(--no-allow-unauthenticated) 81 + fi 82 + 83 + gcloud "${DEPLOY_ARGS[@]}" 84 + 85 + echo 86 + echo "==> Service URL" 87 + gcloud run services describe "$SERVICE_NAME" \ 88 + --project="$PROJECT_ID" \ 89 + --region="$REGION" \ 90 + --format='value(status.url)' 91 + 92 + echo 93 + echo "Smoke test:" 94 + URL="$(gcloud run services describe "$SERVICE_NAME" --project="$PROJECT_ID" --region="$REGION" --format='value(status.url)')" 95 + echo " curl \"${URL}/health\""
+118
recommendation/eval/harness.py
··· 1 + """Offline eval: held-out-seed retrieval (recall@k / nDCG). 2 + 3 + A content-similarity recommender excludes the user's own repos from results, so we 4 + can't hold out an owned repo and expect it *recommended*. Instead we measure the 5 + underlying relevance signal: hold out each user's most recent repo, generate 6 + candidates from their REMAINING repos (excluding the other seeds but NOT the 7 + held-out target), and check where the held-out repo ranks. A good engine ranks 8 + "what the user built next" near the top of what it would surface. 9 + 10 + Run: `python eval/harness.py` (needs DB_CONNECTION_STRING). Establishes a 11 + baseline BEFORE any ranking changes — no "feels better" tuning. 12 + """ 13 + 14 + from __future__ import annotations 15 + 16 + import math 17 + import sys 18 + 19 + from app import db 20 + from app.config import get_settings 21 + 22 + K_VALUES = (10, 20, 50) 23 + PER_SEED_K = 50 # neighbours pulled per remaining seed 24 + MAX_USERS = 60 # sample size (keeps the run quick) 25 + MIN_SEEDS = 3 # need enough seeds to hold one out and still have signal 26 + # Candidate content gate, mirroring the live service (REC_MIN_README_CHARS). 27 + # Set to 0 to reproduce the pre-gate baseline. 28 + MIN_README_CHARS = get_settings().min_readme_chars 29 + 30 + _USERS_SQL = """ 31 + select split_part(replace(repo_uri, 'at://', ''), '/', 1) as owner_did, 32 + count(*)::int as n 33 + from tangled_readmes 34 + where embedding is not null and repo_uri is not null 35 + group by 1 36 + having count(*) between %(lo)s and 30 37 + order by n desc 38 + limit %(max_users)s 39 + """ 40 + 41 + # Owned repos for one user, with createdAt so we can hold out the most recent. 42 + _OWNED_SQL = """ 43 + select r.repo_did, 44 + r.embedding::text as etext, 45 + tr.record_raw->>'createdAt' as created_at 46 + from tangled_readmes r 47 + left join tangled_repos tr 48 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 49 + where r.embedding is not null 50 + and r.repo_uri like 'at://' || %(did)s || '/%%' 51 + """ 52 + 53 + 54 + def _users() -> list[str]: 55 + with db.get_pool().connection() as conn: 56 + rows = conn.execute(_USERS_SQL, {"lo": MIN_SEEDS, "max_users": MAX_USERS}).fetchall() 57 + return [r["owner_did"] for r in rows] 58 + 59 + 60 + def _owned(did: str) -> list[dict]: 61 + with db.get_pool().connection() as conn: 62 + return [dict(r) for r in conn.execute(_OWNED_SQL, {"did": did}).fetchall()] 63 + 64 + 65 + def _rank_of_target(seeds: list[dict], target: dict) -> int | None: 66 + """Generate candidates from the remaining seeds and return the 1-based rank 67 + of the held-out target repo (None if outside the candidate pool).""" 68 + rest = [s for s in seeds if s["repo_did"] != target["repo_did"]] 69 + exclude = [s["repo_did"] for s in rest] # exclude seeds, but allow the target 70 + best: dict[str, float] = {} 71 + for s in rest: 72 + for row in db.knn_repos(s["etext"], exclude, PER_SEED_K, MIN_README_CHARS): 73 + rd = row["repo_did"] 74 + d = float(row["distance"]) 75 + if rd not in best or d < best[rd]: 76 + best[rd] = d 77 + ranked = sorted(best, key=best.get) 78 + tdid = target["repo_did"] 79 + return ranked.index(tdid) + 1 if tdid in ranked else None 80 + 81 + 82 + def main() -> int: 83 + if not get_settings().db_connection_string: 84 + print("DB_CONNECTION_STRING not set", file=sys.stderr) 85 + return 1 86 + 87 + users = _users() 88 + evaluated = 0 89 + hits = {k: 0 for k in K_VALUES} 90 + ndcg_sum = 0.0 91 + 92 + for did in users: 93 + seeds = _owned(did) 94 + if len(seeds) < MIN_SEEDS: 95 + continue 96 + # hold out the most recent repo (fallback: last by repo_did for stability) 97 + target = max(seeds, key=lambda s: (s.get("created_at") or "", s["repo_did"])) 98 + rank = _rank_of_target(seeds, target) 99 + evaluated += 1 100 + if rank is not None: 101 + for k in K_VALUES: 102 + if rank <= k: 103 + hits[k] += 1 104 + ndcg_sum += 1.0 / math.log2(rank + 1) # ideal DCG = 1 (single relevant item) 105 + 106 + if evaluated == 0: 107 + print("no users evaluated") 108 + return 1 109 + 110 + print(f"evaluated users: {evaluated} (per-seed k={PER_SEED_K})") 111 + for k in K_VALUES: 112 + print(f" recall@{k:<3} = {hits[k] / evaluated:.3f}") 113 + print(f" nDCG = {ndcg_sum / evaluated:.3f}") 114 + return 0 115 + 116 + 117 + if __name__ == "__main__": 118 + raise SystemExit(main())
+161
recommendation/package-lock.json
··· 1 + { 2 + "name": "tangled-recommendation", 3 + "version": "0.1.0", 4 + "lockfileVersion": 3, 5 + "requires": true, 6 + "packages": { 7 + "": { 8 + "name": "tangled-recommendation", 9 + "version": "0.1.0", 10 + "dependencies": { 11 + "pg": "^8.22.0" 12 + } 13 + }, 14 + "node_modules/pg": { 15 + "version": "8.22.0", 16 + "resolved": "https://registry.npmjs.org/pg/-/pg-8.22.0.tgz", 17 + "integrity": "sha512-8wih1vVIBMxoUM2oB4soJsD9tDnDpLv4OXBJ+EJzFsvycD+lfyIreC2gGHq78f8jbLLt+bvlPTFdFZfJkOuzAA==", 18 + "license": "MIT", 19 + "dependencies": { 20 + "pg-connection-string": "^2.14.0", 21 + "pg-pool": "^3.14.0", 22 + "pg-protocol": "^1.15.0", 23 + "pg-types": "2.2.0", 24 + "pgpass": "1.0.5" 25 + }, 26 + "engines": { 27 + "node": ">= 16.0.0" 28 + }, 29 + "optionalDependencies": { 30 + "pg-cloudflare": "^1.4.0" 31 + }, 32 + "peerDependencies": { 33 + "pg-native": ">=3.0.1" 34 + }, 35 + "peerDependenciesMeta": { 36 + "pg-native": { 37 + "optional": true 38 + } 39 + } 40 + }, 41 + "node_modules/pg-cloudflare": { 42 + "version": "1.4.0", 43 + "resolved": "https://registry.npmjs.org/pg-cloudflare/-/pg-cloudflare-1.4.0.tgz", 44 + "integrity": "sha512-Vo7z/6rrQYxpNRylp4Tlob2elzbh+N/MOQbxFVWCxS7oEx6jF53GTJFxK2WWpKuBRkmiin4Mt+xofFDjx09R0A==", 45 + "license": "MIT", 46 + "optional": true 47 + }, 48 + "node_modules/pg-connection-string": { 49 + "version": "2.14.0", 50 + "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.14.0.tgz", 51 + "integrity": "sha512-XwWDGcLRGCXAR8F/AM5bG7Q+A3Wm2s6QeEjlOKZLlH3UYcguiqCWKyWXVag5TLTIjR7oOJUY8kcADaZgWPyLeg==", 52 + "license": "MIT" 53 + }, 54 + "node_modules/pg-int8": { 55 + "version": "1.0.1", 56 + "resolved": "https://registry.npmjs.org/pg-int8/-/pg-int8-1.0.1.tgz", 57 + "integrity": "sha512-WCtabS6t3c8SkpDBUlb1kjOs7l66xsGdKpIPZsg4wR+B3+u9UAum2odSsF9tnvxg80h4ZxLWMy4pRjOsFIqQpw==", 58 + "license": "ISC", 59 + "engines": { 60 + "node": ">=4.0.0" 61 + } 62 + }, 63 + "node_modules/pg-pool": { 64 + "version": "3.14.0", 65 + "resolved": "https://registry.npmjs.org/pg-pool/-/pg-pool-3.14.0.tgz", 66 + "integrity": "sha512-gKtPkFdQPU3DksooVLi9LsjZxrsBUZIpa+7aVx+LV5pNh0KzP4Zleud2po+ConrxbuXGBJ6Hfer6hdgpIBpBaw==", 67 + "license": "MIT", 68 + "peerDependencies": { 69 + "pg": ">=8.0" 70 + } 71 + }, 72 + "node_modules/pg-protocol": { 73 + "version": "1.15.0", 74 + "resolved": "https://registry.npmjs.org/pg-protocol/-/pg-protocol-1.15.0.tgz", 75 + "integrity": "sha512-cq9sECI5s0+uPUXjbz8ioyPJni6RzsRib0US67i5IoTZKw8fNeYlVE7u8F4dG7vEJJtc5wdD1K189lCCUwqWTQ==", 76 + "license": "MIT" 77 + }, 78 + "node_modules/pg-types": { 79 + "version": "2.2.0", 80 + "resolved": "https://registry.npmjs.org/pg-types/-/pg-types-2.2.0.tgz", 81 + "integrity": "sha512-qTAAlrEsl8s4OiEQY69wDvcMIdQN6wdz5ojQiOy6YRMuynxenON0O5oCpJI6lshc6scgAY8qvJ2On/p+CXY0GA==", 82 + "license": "MIT", 83 + "dependencies": { 84 + "pg-int8": "1.0.1", 85 + "postgres-array": "~2.0.0", 86 + "postgres-bytea": "~1.0.0", 87 + "postgres-date": "~1.0.4", 88 + "postgres-interval": "^1.1.0" 89 + }, 90 + "engines": { 91 + "node": ">=4" 92 + } 93 + }, 94 + "node_modules/pgpass": { 95 + "version": "1.0.5", 96 + "resolved": "https://registry.npmjs.org/pgpass/-/pgpass-1.0.5.tgz", 97 + "integrity": "sha512-FdW9r/jQZhSeohs1Z3sI1yxFQNFvMcnmfuj4WBMUTxOrAyLMaTcE1aAMBiTlbMNaXvBCQuVi0R7hd8udDSP7ug==", 98 + "license": "MIT", 99 + "dependencies": { 100 + "split2": "^4.1.0" 101 + } 102 + }, 103 + "node_modules/postgres-array": { 104 + "version": "2.0.0", 105 + "resolved": "https://registry.npmjs.org/postgres-array/-/postgres-array-2.0.0.tgz", 106 + "integrity": "sha512-VpZrUqU5A69eQyW2c5CA1jtLecCsN2U/bD6VilrFDWq5+5UIEVO7nazS3TEcHf1zuPYO/sqGvUvW62g86RXZuA==", 107 + "license": "MIT", 108 + "engines": { 109 + "node": ">=4" 110 + } 111 + }, 112 + "node_modules/postgres-bytea": { 113 + "version": "1.0.1", 114 + "resolved": "https://registry.npmjs.org/postgres-bytea/-/postgres-bytea-1.0.1.tgz", 115 + "integrity": "sha512-5+5HqXnsZPE65IJZSMkZtURARZelel2oXUEO8rH83VS/hxH5vv1uHquPg5wZs8yMAfdv971IU+kcPUczi7NVBQ==", 116 + "license": "MIT", 117 + "engines": { 118 + "node": ">=0.10.0" 119 + } 120 + }, 121 + "node_modules/postgres-date": { 122 + "version": "1.0.7", 123 + "resolved": "https://registry.npmjs.org/postgres-date/-/postgres-date-1.0.7.tgz", 124 + "integrity": "sha512-suDmjLVQg78nMK2UZ454hAG+OAW+HQPZ6n++TNDUX+L0+uUlLywnoxJKDou51Zm+zTCjrCl0Nq6J9C5hP9vK/Q==", 125 + "license": "MIT", 126 + "engines": { 127 + "node": ">=0.10.0" 128 + } 129 + }, 130 + "node_modules/postgres-interval": { 131 + "version": "1.2.0", 132 + "resolved": "https://registry.npmjs.org/postgres-interval/-/postgres-interval-1.2.0.tgz", 133 + "integrity": "sha512-9ZhXKM/rw350N1ovuWHbGxnGh/SNJ4cnxHiM0rxE4VN41wsg8P8zWn9hv/buK00RP4WvlOyr/RBDiptyxVbkZQ==", 134 + "license": "MIT", 135 + "dependencies": { 136 + "xtend": "^4.0.0" 137 + }, 138 + "engines": { 139 + "node": ">=0.10.0" 140 + } 141 + }, 142 + "node_modules/split2": { 143 + "version": "4.2.0", 144 + "resolved": "https://registry.npmjs.org/split2/-/split2-4.2.0.tgz", 145 + "integrity": "sha512-UcjcJOWknrNkF6PLX83qcHM6KHgVKNkV62Y8a5uYDVv9ydGQVwAHMKqHdJje1VTWpljG0WYpCDhrCdAOYH4TWg==", 146 + "license": "ISC", 147 + "engines": { 148 + "node": ">= 10.x" 149 + } 150 + }, 151 + "node_modules/xtend": { 152 + "version": "4.0.2", 153 + "resolved": "https://registry.npmjs.org/xtend/-/xtend-4.0.2.tgz", 154 + "integrity": "sha512-LKYU1iAXJXUgAXn9URjiu+MWhyUXHsvfp7mcuYm9dSUKK0/CjtrUwFAxD82/mCWbtLsGjFIad0wIsod4zrTAEQ==", 155 + "license": "MIT", 156 + "engines": { 157 + "node": ">=0.4" 158 + } 159 + } 160 + } 161 + }
+15
recommendation/package.json
··· 1 + { 2 + "name": "tangled-recommendation-reference", 3 + "version": "0.1.0", 4 + "private": true, 5 + "type": "module", 6 + "description": "Validated Node reference scripts (oracle) for the Python recommendation engine — see app/ for the service.", 7 + "scripts": { 8 + "embed:check": "DRY_RUN=1 node reference/src/embed_readmes.mjs", 9 + "embed": "node reference/src/embed_readmes.mjs", 10 + "readme-coverage": "node reference/src/readme_coverage.mjs" 11 + }, 12 + "dependencies": { 13 + "pg": "^8.22.0" 14 + } 15 + }
+29
recommendation/pyproject.toml
··· 1 + [project] 2 + name = "tangled-recommendation" 3 + version = "0.1.0" 4 + description = "Recommendation + search engine for Tangled (AT Protocol) repo/issue discovery" 5 + requires-python = ">=3.11" 6 + dependencies = [ 7 + "fastapi>=0.115", 8 + "uvicorn[standard]>=0.30", 9 + "psycopg[binary,pool]>=3.2", 10 + "pydantic>=2.7", 11 + "python-dotenv>=1.0", 12 + "numpy>=1.26", 13 + ] 14 + 15 + [project.optional-dependencies] 16 + dev = [ 17 + "pytest>=8.0", 18 + "httpx>=0.27", 19 + ] 20 + 21 + [tool.pytest.ini_options] 22 + testpaths = ["tests"] 23 + 24 + [tool.setuptools.packages.find] 25 + include = ["app*"] 26 + 27 + [build-system] 28 + requires = ["setuptools>=68"] 29 + build-backend = "setuptools.build_meta"
+52
recommendation/reference/src/check_new.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function loadConn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env", "../../.env"]) { 6 + try { 7 + const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); 8 + if (m) return m[1].trim(); 9 + } catch {} 10 + } 11 + throw new Error("DB_CONNECTION_STRING not found"); 12 + } 13 + const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 }); 14 + 15 + console.log("=== all tables/views (every schema) ==="); 16 + console.table((await pool.query(` 17 + select table_schema, table_name, table_type 18 + from information_schema.tables 19 + where table_schema not in ('pg_catalog','information_schema') 20 + order by table_schema, table_name`)).rows); 21 + 22 + console.log("\n=== columns matching 'embed' or 'readme' (any table) ==="); 23 + const hits = await pool.query(` 24 + select table_schema, table_name, column_name, data_type 25 + from information_schema.columns 26 + where table_schema not in ('pg_catalog','information_schema') 27 + and (column_name ~* 'embed|readme|vector') 28 + order by table_schema, table_name, ordinal_position`); 29 + console.table(hits.rows.length ? hits.rows : [{ note: "no columns named embed*/readme*/vector*" }]); 30 + 31 + console.log("\n=== tables matching 'embed' or 'readme' by NAME ==="); 32 + const tn = await pool.query(` 33 + select table_schema, table_name from information_schema.tables 34 + where table_name ~* 'embed|readme' 35 + order by 1,2`); 36 + console.table(tn.rows.length ? tn.rows : [{ note: "no table named embed*/readme*" }]); 37 + 38 + // If a readme column/table exists, show count + sample 39 + for (const r of [...hits.rows, ...tn.rows]) { 40 + const t = `"${r.table_schema}"."${r.table_name}"`; 41 + try { 42 + const c = await pool.query(`select count(*)::int n from ${t}`); 43 + console.log(`count ${t}: ${c.rows[0].n}`); 44 + } catch (e) { /* ignore dup */ } 45 + } 46 + 47 + console.log("\n=== columns on tangled_repos (did a readme/embedding col get added here?) ==="); 48 + console.table((await pool.query(` 49 + select column_name, data_type from information_schema.columns 50 + where table_schema='public' and table_name='tangled_repos' order by ordinal_position`)).rows); 51 + 52 + await pool.end();
+51
recommendation/reference/src/check_readmes.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function loadConn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env"]) { 6 + try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); if (m) return m[1].trim(); } catch {} 7 + } 8 + throw new Error("no conn"); 9 + } 10 + const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 }); 11 + 12 + console.log("=== full tangled_readmes columns ==="); 13 + console.table((await pool.query(` 14 + select c.ordinal_position, c.column_name, c.data_type, c.udt_name, c.is_nullable 15 + from information_schema.columns c 16 + where c.table_schema='public' and c.table_name='tangled_readmes' 17 + order by c.ordinal_position`)).rows); 18 + 19 + console.log("\n=== embedding column: pgvector dimensions ==="); 20 + console.table((await pool.query(` 21 + select a.attname, format_type(a.atttypid, a.atttypmod) as type 22 + from pg_attribute a 23 + join pg_class c on c.oid=a.attrelid join pg_namespace n on n.oid=c.relnamespace 24 + where n.nspname='public' and c.relname='tangled_readmes' and a.attnum>0 and not a.attisdropped 25 + and format_type(a.atttypid,a.atttypmod) ~* 'vector'`)).rows); 26 + 27 + console.log("\n=== counts ==="); 28 + console.table((await pool.query(` 29 + select 30 + count(*)::int total, 31 + count(*) filter (where embedding is not null)::int with_embedding, 32 + count(distinct embedding_model)::int models 33 + from tangled_readmes`)).rows); 34 + 35 + console.log("\n=== embedding_model values ==="); 36 + console.table((await pool.query(`select embedding_model, count(*)::int n from tangled_readmes group by 1 order by 2 desc`)).rows); 37 + 38 + console.log("\n=== indexes on tangled_readmes (ivfflat/hnsw?) ==="); 39 + console.table((await pool.query(`select indexname, indexdef from pg_indexes where schemaname='public' and tablename='tangled_readmes'`)).rows); 40 + 41 + console.log("\n=== sample row (text truncated, embedding omitted) ==="); 42 + const s = await pool.query(`select * from tangled_readmes limit 1`); 43 + if (s.rows.length) { 44 + const r = { ...s.rows[0] }; 45 + for (const k of Object.keys(r)) { 46 + if (k === "embedding") r[k] = `<vector len=${typeof r[k] === "string" ? (r[k].match(/,/g)?.length ?? 0) + 1 : "?"}>`; 47 + else if (typeof r[k] === "string" && r[k].length > 160) r[k] = r[k].slice(0, 160) + "…"; 48 + } 49 + console.log(JSON.stringify(r, null, 2)); 50 + } 51 + await pool.end();
+80
recommendation/reference/src/clustered_recommend.mjs
··· 1 + // Cluster-then-retrieve recommender: preserves a user's multiple distinct interests. 2 + // Contrasts NAIVE pooled top-K (one cluster can dominate) vs CLUSTERED round-robin (balanced). 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + import { createHash } from "node:crypto"; 6 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 7 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 }); 8 + 9 + const USER = process.env.USER_DID || "did:plc:y7g2koy4nqw7434s67fgfjca"; 10 + const K = parseInt(process.env.K ?? "10", 10); 11 + const T = parseFloat(process.env.CLUSTER_T ?? "0.22"); // cosine-dist threshold to consider two seeds "same interest" 12 + const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex"); 13 + const parseVec = (s) => s.replace(/^\[|\]$/g, "").split(",").map(Number); 14 + const cosDist = (a, b) => { let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return 1 - d; }; 15 + 16 + async function main() { 17 + // 1) the user's contributed repos (here: owned) with embeddings 18 + const seeds = (await pool.query( 19 + `select repo_did, repo_name, content, embedding::text as etext 20 + from tangled_readmes where embedding is not null and repo_uri like $1`, [`at://${USER}/%`])).rows; 21 + if (seeds.length < 2) { console.log("not enough embedded seed repos for this user"); await pool.end(); return; } 22 + seeds.forEach((s) => (s.vec = parseVec(s.etext))); 23 + console.log(`USER ${USER}`); 24 + console.log(`contributed repos (${seeds.length}): ${seeds.map((s) => s.repo_name).join(", ")}\n`); 25 + 26 + // 2) cluster seeds: single-linkage connected components at threshold T (union-find) 27 + const parent = seeds.map((_, i) => i); 28 + const find = (x) => (parent[x] === x ? x : (parent[x] = find(parent[x]))); 29 + for (let i = 0; i < seeds.length; i++) 30 + for (let j = i + 1; j < seeds.length; j++) 31 + if (cosDist(seeds[i].vec, seeds[j].vec) < T) parent[find(i)] = find(j); 32 + const clusters = new Map(); 33 + seeds.forEach((s, i) => { const r = find(i); (clusters.get(r) ?? clusters.set(r, []).get(r)).push(s); }); 34 + const clusterList = [...clusters.values()]; 35 + console.log(`→ ${clusterList.length} interest cluster(s):`); 36 + clusterList.forEach((c, i) => console.log(` [${i + 1}] ${c.map((s) => s.repo_name).join(", ")}`)); 37 + 38 + // 3) retrieve neighbors per seed (drop user's own repos), tag with cluster + min dist 39 + const ownRepoDids = new Set(seeds.map((s) => s.repo_did)); 40 + const seenContent = new Set(seeds.map((s) => hash(s.content))); 41 + // candidate -> { repo_name, dist, clusterIdx } 42 + const cand = new Map(); 43 + for (let ci = 0; ci < clusterList.length; ci++) { 44 + for (const seed of clusterList[ci]) { 45 + const rows = (await pool.query( 46 + `select repo_name, repo_did, content, round((embedding <=> $1::vector)::numeric,4) dist 47 + from tangled_readmes where embedding is not null and repo_did <> all($2) 48 + order by embedding <=> $1::vector limit 25`, [seed.etext, [...ownRepoDids]])).rows; 49 + for (const r of rows) { 50 + const h = hash(r.content); 51 + if (seenContent.has(h)) continue; // collapse forks / user's own content 52 + const prev = cand.get(h); 53 + const dist = Number(r.dist); 54 + if (!prev || dist < prev.dist) cand.set(h, { repo_name: r.repo_name, dist, clusterIdx: ci }); 55 + } 56 + } 57 + } 58 + const all = [...cand.values()]; 59 + 60 + // 4a) NAIVE pooled: global top-K by distance 61 + const naive = [...all].sort((a, b) => a.dist - b.dist).slice(0, K); 62 + 63 + // 4b) CLUSTERED round-robin: rank within each cluster, then take turns → balanced coverage 64 + const perCluster = clusterList.map((_, ci) => all.filter((c) => c.clusterIdx === ci).sort((a, b) => a.dist - b.dist)); 65 + const clustered = []; 66 + const used = new Set(); 67 + for (let round = 0; clustered.length < K && round < 50; round++) { 68 + for (let ci = 0; ci < perCluster.length && clustered.length < K; ci++) { 69 + const next = perCluster[ci].find((c) => !used.has(c.repo_name)); 70 + if (next) { used.add(next.repo_name); clustered.push(next); } 71 + } 72 + } 73 + 74 + const fmt = (arr) => arr.map((c, i) => ` ${String(i + 1).padStart(2)}. ${(c.repo_name ?? "?").padEnd(30)} dist=${c.dist} [interest ${c.clusterIdx + 1}]`).join("\n"); 75 + const cover = (arr) => { const s = new Set(arr.map((c) => c.clusterIdx)); return `${s.size}/${clusterList.length} interests`; }; 76 + console.log(`\n===== NAIVE pooled top-${K} (covers ${cover(naive)}) =====\n${fmt(naive)}`); 77 + console.log(`\n===== CLUSTERED round-robin top-${K} (covers ${cover(clustered)}) =====\n${fmt(clustered)}`); 78 + await pool.end(); 79 + } 80 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+29
recommendation/reference/src/discover_api.mjs
··· 1 + // Probe knot1.tangled.sh to discover the real XRPC endpoint names/paths. 2 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; // had a cached languages snapshot 3 + const host = "knot1.tangled.sh"; 4 + 5 + const candidates = [ 6 + `/xrpc/sh.tangled.repo.languages?repo=${repo}`, 7 + `/xrpc/sh.tangled.repo.branches?repo=${repo}&limit=100`, 8 + `/xrpc/sh.tangled.repo.getDefaultBranch?repo=${repo}`, 9 + `/xrpc/sh.tangled.git.temp.getTree?repo=${repo}&ref=HEAD&path=`, 10 + `/xrpc/sh.tangled.git.temp.getTree?repo=${repo}&ref=main&path=`, 11 + `/xrpc/sh.tangled.git.temp.getEntry?repo=${repo}&ref=HEAD&path=README.md`, 12 + `/xrpc/sh.tangled.git.listRefs?repo=${repo}`, 13 + `/xrpc/sh.tangled.git.temp.listBranches?repo=${repo}`, 14 + ]; 15 + 16 + for (const path of candidates) { 17 + const url = `https://${host}${path}`; 18 + try { 19 + const ctrl = new AbortController(); 20 + const t = setTimeout(() => ctrl.abort(), 10000); 21 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 22 + clearTimeout(t); 23 + const txt = await resp.text(); 24 + console.log(`[${resp.status}] ${path}`); 25 + console.log(` -> ${txt.slice(0, 240).replace(/\n/g, " ")}`); 26 + } catch (e) { 27 + console.log(`[ERR] ${path} -> ${e.name}:${e.message}`); 28 + } 29 + }
+33
recommendation/reference/src/discover_api2.mjs
··· 1 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; 2 + const host = "knot1.tangled.sh"; 3 + const ref = "trunk"; 4 + 5 + const candidates = [ 6 + `/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=${ref}&path=`, 7 + `/xrpc/sh.tangled.repo.getTree?repo=${repo}&ref=${ref}&path=`, 8 + `/xrpc/sh.tangled.repo.index?repo=${repo}&ref=${ref}`, 9 + `/xrpc/sh.tangled.repo.index?repo=${repo}`, 10 + `/xrpc/sh.tangled.repo.readme?repo=${repo}&ref=${ref}`, 11 + `/xrpc/sh.tangled.repo.getReadme?repo=${repo}&ref=${ref}`, 12 + `/xrpc/sh.tangled.repo.tags?repo=${repo}&limit=100`, 13 + `/xrpc/sh.tangled.repo.listFiles?repo=${repo}&ref=${ref}&path=`, 14 + `/xrpc/sh.tangled.repo.files?repo=${repo}&ref=${ref}&path=`, 15 + `/xrpc/sh.tangled.repo.blob?repo=${repo}&ref=${ref}&path=README.md`, 16 + `/xrpc/sh.tangled.repo.getBlob?repo=${repo}&ref=${ref}&path=README.md`, 17 + `/xrpc/sh.tangled.repo.entry?repo=${repo}&ref=${ref}&path=README.md`, 18 + ]; 19 + 20 + for (const path of candidates) { 21 + const url = `https://${host}${path}`; 22 + try { 23 + const ctrl = new AbortController(); 24 + const t = setTimeout(() => ctrl.abort(), 10000); 25 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 26 + clearTimeout(t); 27 + const txt = await resp.text(); 28 + console.log(`[${resp.status}] ${path.split("?")[0].replace("/xrpc/", "")}`); 29 + if (resp.ok) console.log(` -> ${txt.slice(0, 400).replace(/\n/g, " ")}`); 30 + } catch (e) { 31 + console.log(`[ERR] ${path} -> ${e.name}`); 32 + } 33 + }
+23
recommendation/reference/src/discover_api3.mjs
··· 1 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; 2 + const host = "knot1.tangled.sh"; 3 + async function get(path) { 4 + const ctrl = new AbortController(); 5 + const t = setTimeout(() => ctrl.abort(), 10000); 6 + try { 7 + const resp = await fetch(`https://${host}${path}`, { signal: ctrl.signal, headers: { accept: "application/json" } }); 8 + const txt = await resp.text(); 9 + return { status: resp.status, txt }; 10 + } finally { clearTimeout(t); } 11 + } 12 + // Full tree (does it include a readme field? top-level keys?) 13 + const full = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=trunk&path=`); 14 + let j; try { j = JSON.parse(full.txt); } catch {} 15 + console.log("tree top-level keys:", j ? Object.keys(j) : "(parse fail)"); 16 + console.log("file names:", (j?.files || []).map((f) => f.name)); 17 + console.log("has top-level 'readme' key:", j && "readme" in j, "->", JSON.stringify(j?.readme)?.slice(0, 120)); 18 + // Without ref 19 + const noref = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&path=`); 20 + console.log("\ntree WITHOUT ref: status", noref.status, "->", noref.txt.slice(0, 120).replace(/\n/g, " ")); 21 + // Empty ref 22 + const emptyref = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=&path=`); 23 + console.log("tree EMPTY ref: status", emptyref.status, "->", emptyref.txt.slice(0, 120).replace(/\n/g, " "));
+169
recommendation/reference/src/embed_readmes.mjs
··· 1 + // Embed all unembedded READMEs in tangled_readmes using Google Gemini embeddings. 2 + // 3 + // - Reads the worklist (status='found' AND content IS NOT NULL AND embedding IS NULL), 4 + // the exact predicate behind tangled_readmes_unembedded_idx. 5 + // - Embeds doc = "# <name>\n\n<description>\n\n<README>" with gemini-embedding-001 at 6 + // outputDimensionality=1536 (matches the vector(1536) column), task RETRIEVAL_DOCUMENT. 7 + // - L2-normalizes (sub-3072 MRL dims aren't auto-normalized) so the HNSW cosine index is happy. 8 + // - UPDATEs only the embedding columns, only where embedding IS NULL → idempotent / re-runnable. 9 + // 10 + // Env: DB_CONNECTION_STRING (or ../.env), GEMINI_API_KEY (required). 11 + // Optional: LIMIT (0=all), CONCURRENCY (default 4), DRY_RUN=1 (count only), MAX_CHARS (default 8000). 12 + 13 + import pg from "pg"; 14 + import { readFileSync } from "node:fs"; 15 + 16 + function fromEnvFile(key) { 17 + for (const p of ["../.env", ".env", "../../.env"]) { 18 + try { 19 + const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)\\s*$`, "m")); 20 + if (m) return m[1].trim().replace(/^["']|["']$/g, ""); 21 + } catch {} 22 + } 23 + return undefined; 24 + } 25 + 26 + const CONN = process.env.DB_CONNECTION_STRING || fromEnvFile("DB_CONNECTION_STRING"); 27 + const API_KEY = process.env.GEMINI_API_KEY || fromEnvFile("GEMINI_API_KEY"); 28 + const MODEL = process.env.GEMINI_EMBED_MODEL || fromEnvFile("GEMINI_EMBED_MODEL") || "gemini-embedding-001"; 29 + const DIMS = 1536; 30 + const LIMIT = parseInt(process.env.LIMIT ?? "0", 10); 31 + const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "4", 10); 32 + const MAX_CHARS = parseInt(process.env.MAX_CHARS ?? "8000", 10); 33 + const DRY_RUN = process.env.DRY_RUN === "1"; 34 + 35 + if (!CONN) { console.error("DB_CONNECTION_STRING not set"); process.exit(1); } 36 + if (!API_KEY && !DRY_RUN) { console.error("GEMINI_API_KEY not set (add it to recommendation/.env)"); process.exit(1); } 37 + 38 + const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 5 }); 39 + const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); 40 + 41 + function buildDoc({ repo_name, description, content }) { 42 + const parts = []; 43 + if (repo_name) parts.push(`# ${repo_name}`); 44 + if (description && description.trim()) parts.push(description.trim()); 45 + parts.push(content); 46 + return parts.join("\n\n").slice(0, MAX_CHARS); 47 + } 48 + 49 + function l2normalize(v) { 50 + let s = 0; 51 + for (const x of v) s += x * x; 52 + const n = Math.sqrt(s) || 1; 53 + return v.map((x) => x / n); 54 + } 55 + 56 + async function embedOnce(text, dims) { 57 + const url = `https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`; 58 + const body = { 59 + model: `models/${MODEL}`, 60 + content: { parts: [{ text }] }, 61 + taskType: "RETRIEVAL_DOCUMENT", 62 + outputDimensionality: dims, 63 + }; 64 + const resp = await fetch(url, { 65 + method: "POST", 66 + headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 67 + body: JSON.stringify(body), 68 + }); 69 + const txt = await resp.text(); 70 + if (!resp.ok) { 71 + const err = new Error(`HTTP ${resp.status}: ${txt.slice(0, 200)}`); 72 + err.status = resp.status; 73 + throw err; 74 + } 75 + const j = JSON.parse(txt); 76 + const values = j?.embedding?.values; 77 + if (!Array.isArray(values)) throw new Error(`no embedding in response: ${txt.slice(0, 150)}`); 78 + return values; 79 + } 80 + 81 + // Embed with retries; on 400 (often too-long input) retry once with a hard truncation. 82 + async function embedWithRetry(text) { 83 + let attempt = 0; 84 + let input = text; 85 + while (true) { 86 + try { 87 + const v = await embedOnce(input, DIMS); 88 + return l2normalize(v); 89 + } catch (e) { 90 + attempt++; 91 + if (e.status === 400 && input.length > 2000) { 92 + input = input.slice(0, Math.floor(input.length / 2)); 93 + continue; 94 + } 95 + if (attempt >= 5 || (e.status && e.status >= 400 && e.status < 500 && e.status !== 429)) { 96 + throw e; 97 + } 98 + const backoff = Math.min(30000, 800 * 2 ** (attempt - 1)); 99 + await sleep(backoff); 100 + } 101 + } 102 + } 103 + 104 + async function main() { 105 + const worklistSql = ` 106 + select r.repo_did, r.repo_name, r.content, 107 + coalesce(tr.record_raw->>'description', '') as description, 108 + length(r.content) as len 109 + from tangled_readmes r 110 + left join tangled_repos tr 111 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 112 + where r.status = 'found' and r.content is not null and r.embedding is null 113 + order by r.repo_did 114 + ${LIMIT > 0 ? `limit ${LIMIT}` : ""}`; 115 + 116 + const { rows } = await pool.query(worklistSql); 117 + const totalReadmes = (await pool.query(`select count(*)::int n from tangled_readmes`)).rows[0].n; 118 + const alreadyEmbedded = (await pool.query(`select count(*)::int n from tangled_readmes where embedding is not null`)).rows[0].n; 119 + 120 + console.log(`tangled_readmes total=${totalReadmes} already embedded=${alreadyEmbedded}`); 121 + console.log(`worklist (to embed now)=${rows.length} model=${MODEL} dims=${DIMS} concurrency=${CONCURRENCY}${LIMIT ? ` limit=${LIMIT}` : ""}`); 122 + if (DRY_RUN) { console.log("\nDRY_RUN=1 → not embedding, not writing."); await pool.end(); return; } 123 + if (rows.length === 0) { console.log("\nNothing to embed. ✔"); await pool.end(); return; } 124 + 125 + let done = 0, ok = 0, failed = 0; 126 + const errors = []; 127 + const queue = rows.slice(); 128 + 129 + async function worker(id) { 130 + while (queue.length) { 131 + const r = queue.pop(); 132 + try { 133 + const doc = buildDoc(r); 134 + const vec = await embedWithRetry(doc); 135 + const literal = `[${vec.join(",")}]`; 136 + const res = await pool.query( 137 + `update tangled_readmes 138 + set embedding = $1::vector, embedding_model = $2, embedded_at = now() 139 + where repo_did = $3 and embedding is null`, 140 + [literal, MODEL, r.repo_did], 141 + ); 142 + if (res.rowCount > 0) ok++; 143 + } catch (e) { 144 + failed++; 145 + errors.push({ repo_did: r.repo_did, name: r.repo_name, err: e.message }); 146 + } 147 + if (++done % 25 === 0 || done === rows.length) { 148 + process.stderr.write(` ...${done}/${rows.length} (ok=${ok} fail=${failed})\n`); 149 + } 150 + } 151 + } 152 + 153 + await Promise.all(Array.from({ length: CONCURRENCY }, (_, i) => worker(i))); 154 + 155 + console.log(`\n================ EMBEDDING DONE ================`); 156 + console.log(`embedded ok : ${ok}`); 157 + console.log(`failed : ${failed}`); 158 + if (errors.length) { 159 + console.log("\nfirst errors:"); 160 + for (const e of errors.slice(0, 10)) console.log(` ${e.name ?? e.repo_did}: ${e.err}`); 161 + } 162 + const remaining = (await pool.query( 163 + `select count(*)::int n from tangled_readmes where status='found' and content is not null and embedding is null`, 164 + )).rows[0].n; 165 + console.log(`\nremaining unembedded (status=found): ${remaining}`); 166 + await pool.end(); 167 + } 168 + 169 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+37
recommendation/reference/src/explore_users.mjs
··· 1 + // Find owners with several embedded repos, and measure how SPREAD their repos are 2 + // (high mean pairwise cosine distance = multi-interest user — good demo candidate). 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 6 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 7 + 8 + const ownerDid = (uri) => uri ? uri.replace("at://", "").split("/")[0] : null; 9 + function parseVec(s){ return s.replace(/^\[|\]$/g, "").split(",").map(Number); } 10 + function cos(a, b){ let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return d; } // already unit-norm 11 + 12 + const owners = (await pool.query(` 13 + select split_part(replace(repo_uri,'at://',''),'/',1) as owner_did, 14 + count(*)::int n, array_agg(repo_name) as names 15 + from tangled_readmes 16 + where embedding is not null and repo_uri is not null 17 + group by 1 having count(*) between 4 and 12 18 + order by n desc limit 25`)).rows; 19 + 20 + const scored = []; 21 + for (const o of owners) { 22 + const rows = (await pool.query( 23 + `select repo_name, embedding::text as e from tangled_readmes where embedding is not null and repo_uri like $1`, 24 + [`at://${o.owner_did}/%`])).rows; 25 + const vecs = rows.map((r) => parseVec(r.e)); 26 + let sum = 0, cnt = 0; 27 + for (let i = 0; i < vecs.length; i++) for (let j = i + 1; j < vecs.length; j++) { sum += 1 - cos(vecs[i], vecs[j]); cnt++; } 28 + const meanDist = cnt ? sum / cnt : 0; 29 + scored.push({ owner_did: o.owner_did, n: o.n, meanDist: +meanDist.toFixed(3), names: rows.map((r) => r.repo_name) }); 30 + } 31 + scored.sort((a, b) => b.meanDist - a.meanDist); 32 + console.log("most multi-interest owners (high mean pairwise README distance):\n"); 33 + for (const s of scored.slice(0, 8)) { 34 + console.log(`mean_dist=${s.meanDist} n=${s.n} ${s.owner_did}`); 35 + console.log(` repos: ${s.names.join(", ")}\n`); 36 + } 37 + await pool.end();
+42
recommendation/reference/src/fetch_issues.mjs
··· 1 + // Fetch real sh.tangled.repo.issue records live from repo-owner PDSes. 2 + import pg from "pg"; 3 + import { readFileSync } from "node:fs"; 4 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 5 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 6 + 7 + // Owners of embedded repos, with their PDS host. 8 + const rows = (await pool.query(` 9 + select distinct tr.owner_did, pa.pds_host, 10 + (select repo_name from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null limit 1) as a_repo 11 + from tangled_repos tr 12 + join tangled_pds_accounts pa on pa.did = tr.owner_did 13 + where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null) 14 + limit 80`)).rows; 15 + await pool.end(); 16 + 17 + console.log(`probing ${rows.length} owner PDSes for sh.tangled.repo.issue ...`); 18 + const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`); 19 + 20 + let found = []; 21 + async function listIssues(r) { 22 + const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`; 23 + try { 24 + const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000); 25 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 26 + clearTimeout(t); 27 + if (!resp.ok) return; 28 + const j = await resp.json(); 29 + for (const rec of j.records ?? []) found.push({ owner: r.owner_did, uri: rec.uri, value: rec.value }); 30 + } catch {} 31 + } 32 + // simple concurrency 33 + const q = rows.slice(); 34 + await Promise.all(Array.from({ length: 12 }, async () => { while (q.length) await listIssues(q.pop()); })); 35 + 36 + console.log(`\nfound ${found.length} issue records`); 37 + if (found.length) { 38 + console.log("\nsample issue record value keys:", Object.keys(found[0].value)); 39 + console.log("sample record:", JSON.stringify(found[0], null, 2).slice(0, 900)); 40 + console.log("\nfirst few titles:"); 41 + for (const f of found.slice(0, 8)) console.log(` - ${f.value.title ?? "(no title)"} [repo ref: ${JSON.stringify(f.value.repo ?? f.value.subject ?? "?")}]`); 42 + }
+95
recommendation/reference/src/issue_experiment.mjs
··· 1 + // Full experiment: fetch real Tangled issues live, embed as queries, vector-search READMEs. 2 + import pg from "pg"; 3 + import { readFileSync } from "node:fs"; 4 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 5 + const API_KEY = fromEnv("GEMINI_API_KEY"); 6 + const MODEL = "gemini-embedding-001"; 7 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 }); 8 + const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`); 9 + 10 + async function embedQuery(text) { 11 + const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, { 12 + method: "POST", headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 13 + body: JSON.stringify({ model: `models/${MODEL}`, content: { parts: [{ text: text.slice(0, 8000) }] }, taskType: "RETRIEVAL_QUERY", outputDimensionality: 1536 }), 14 + }); 15 + if (!resp.ok) throw new Error(`embed HTTP ${resp.status}`); 16 + const v = (await resp.json()).embedding.values; 17 + let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1; 18 + return `[${v.map((x) => x / n).join(",")}]`; 19 + } 20 + 21 + // Map an issue.repo reference (bare DID or at://owner/sh.tangled.repo/rkey) -> knot repoDid in readmes. 22 + async function resolveRepoDid(ref) { 23 + if (!ref) return null; 24 + if (ref.startsWith("at://")) { 25 + const m = ref.match(/^at:\/\/([^/]+)\/[^/]+\/(.+)$/); 26 + if (!m) return null; 27 + const r = await pool.query(`select coalesce(repo_did, record_raw->>'repoDid') as rd from tangled_repos where owner_did=$1 and rkey=$2 limit 1`, [m[1], m[2]]); 28 + return r.rows[0]?.rd ?? null; 29 + } 30 + return ref; // bare DID == repoDid 31 + } 32 + 33 + async function fetchIssues() { 34 + const rows = (await pool.query(` 35 + select distinct tr.owner_did, pa.pds_host 36 + from tangled_repos tr join tangled_pds_accounts pa on pa.did = tr.owner_did 37 + where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null) 38 + limit 120`)).rows; 39 + const found = []; 40 + const q = rows.slice(); 41 + await Promise.all(Array.from({ length: 14 }, async () => { 42 + while (q.length) { 43 + const r = q.pop(); 44 + const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`; 45 + try { 46 + const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000); 47 + const resp = await fetch(url, { signal: ctrl.signal }); 48 + clearTimeout(t); 49 + if (!resp.ok) continue; 50 + const j = await resp.json(); 51 + for (const rec of j.records ?? []) if (rec.value?.title) found.push(rec.value); 52 + } catch {} 53 + } 54 + })); 55 + return found; 56 + } 57 + 58 + async function main() { 59 + const issues = await fetchIssues(); 60 + console.log(`fetched ${issues.length} live issues\n`); 61 + // attach resolved repoDid + whether embedded; prefer substantive bodies whose repo is embedded 62 + for (const iss of issues) { 63 + iss._repoDid = await resolveRepoDid(iss.repo); 64 + iss._embedded = iss._repoDid 65 + ? (await pool.query(`select repo_name from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [iss._repoDid])).rows[0]?.repo_name ?? null 66 + : null; 67 + } 68 + const pick = issues 69 + .filter((i) => (i.body ?? "").length > 60) 70 + .sort((a, b) => (b._embedded ? 1 : 0) - (a._embedded ? 1 : 0) || (b.body?.length ?? 0) - (a.body?.length ?? 0)) 71 + .slice(0, 4); 72 + 73 + for (const iss of pick) { 74 + console.log("\n" + "=".repeat(72)); 75 + console.log(`ISSUE: ${iss.title}`); 76 + console.log(`own repo: ${iss._embedded ? iss._embedded + " (embedded ✓)" : "(parent README not embedded / unresolved)"}`); 77 + console.log(`body: ${(iss.body ?? "").replace(/\s+/g, " ").slice(0, 200)}…`); 78 + const qvec = await embedQuery(`${iss.title}\n\n${iss.body ?? ""}`); 79 + const hits = (await pool.query(` 80 + select repo_name, repo_did, round((embedding <=> $1::vector)::numeric,4) dist, (repo_did=$2) is_parent 81 + from tangled_readmes where embedding is not null 82 + order by embedding <=> $1::vector limit 8`, [qvec, iss._repoDid])).rows; 83 + console.log("top README matches:"); 84 + hits.forEach((h, i) => console.log(` ${i + 1}. ${h.is_parent ? "👉" : " "} ${(h.repo_name ?? "(no name)").padEnd(34)} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`)); 85 + if (iss._embedded) { 86 + const rnk = (await pool.query(` 87 + select 1 + count(*)::int rnk from tangled_readmes 88 + where embedding is not null and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`, 89 + [qvec, iss._repoDid])).rows[0].rnk; 90 + console.log(` → own repo overall rank: #${rnk} of all embedded READMEs`); 91 + } 92 + } 93 + await pool.end(); 94 + } 95 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+84
recommendation/reference/src/issue_search.mjs
··· 1 + // Experiment: embed a Tangled issue as a query and vector-search the README embeddings. 2 + // Validates the matching: (a) does the issue's OWN repo rank highly? (b) are other hits topical? 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + 6 + function fromEnv(key) { 7 + if (process.env[key]) return process.env[key]; 8 + for (const p of ["../.env", ".env"]) { 9 + try { const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)$`, "m")); if (m) return m[1].trim().replace(/^["']|["']$/g, ""); } catch {} 10 + } 11 + } 12 + const CONN = fromEnv("DB_CONNECTION_STRING"); 13 + const API_KEY = fromEnv("GEMINI_API_KEY"); 14 + const MODEL = "gemini-embedding-001"; 15 + const N = parseInt(process.env.ISSUES ?? "3", 10); 16 + 17 + const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 3 }); 18 + 19 + async function embedQuery(text) { 20 + const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, { 21 + method: "POST", 22 + headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 23 + body: JSON.stringify({ 24 + model: `models/${MODEL}`, 25 + content: { parts: [{ text: text.slice(0, 8000) }] }, 26 + taskType: "RETRIEVAL_QUERY", 27 + outputDimensionality: 1536, 28 + }), 29 + }); 30 + if (!resp.ok) throw new Error(`embed HTTP ${resp.status}: ${(await resp.text()).slice(0, 200)}`); 31 + const v = (await resp.json()).embedding.values; 32 + let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1; 33 + return `[${v.map((x) => x / n).join(",")}]`; 34 + } 35 + 36 + async function main() { 37 + const total = (await pool.query(`select count(*)::int n from tangled_issues`)).rows[0].n; 38 + console.log(`tangled_issues total: ${total}`); 39 + const joinable = (await pool.query(` 40 + select count(*)::int n from tangled_issues i 41 + where exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)`)).rows[0].n; 42 + console.log(`issues whose parent repo has an embedded README: ${joinable}\n`); 43 + if (joinable === 0) { console.log("No joinable issues — cannot run the own-repo sanity check."); await pool.end(); return; } 44 + 45 + // Pick a few substantive issues (decent body) whose repo is embedded. 46 + const issues = (await pool.query(` 47 + select i.uri, i.repo_did, i.title, i.body, 48 + (select repo_name from tangled_readmes r where r.repo_did = i.repo_did limit 1) as parent_repo 49 + from tangled_issues i 50 + where i.title is not null and length(coalesce(i.body,'')) > 80 51 + and exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null) 52 + order by length(i.body) desc 53 + limit ${N}`)).rows; 54 + 55 + for (const iss of issues) { 56 + const queryText = `${iss.title}\n\n${iss.body}`; 57 + console.log("\n" + "=".repeat(70)); 58 + console.log(`ISSUE: ${iss.title}`); 59 + console.log(`parent repo: ${iss.parent_repo} (${iss.repo_did})`); 60 + console.log(`body: ${iss.body.replace(/\s+/g, " ").slice(0, 180)}…`); 61 + const qvec = await embedQuery(queryText); 62 + const hits = (await pool.query(` 63 + select repo_name, repo_did, round((embedding <=> $1::vector)::numeric, 4) as dist, 64 + (repo_did = $2) as is_parent 65 + from tangled_readmes 66 + where embedding is not null 67 + order by embedding <=> $1::vector 68 + limit 8`, [qvec, iss.repo_did])).rows; 69 + console.log("top README matches:"); 70 + hits.forEach((h, idx) => { 71 + console.log(` ${idx + 1}. ${h.is_parent ? "👉 " : " "}${h.repo_name?.padEnd(32) ?? "(no name)"} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`); 72 + }); 73 + // Where does the own repo rank overall? 74 + const rank = (await pool.query(` 75 + select 1 + count(*)::int as rnk 76 + from tangled_readmes 77 + where embedding is not null 78 + and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`, 79 + [qvec, iss.repo_did])).rows[0].rnk; 80 + console.log(` → own repo overall rank: #${rank} of all embedded READMEs`); 81 + } 82 + await pool.end(); 83 + } 84 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+30
recommendation/reference/src/probe_live.ts
··· 1 + import { pool, withClient } from "./db.js"; 2 + 3 + async function main() { 4 + const rows = await withClient((c) => c.query(` 5 + select knot_hostname, record_raw->>'repoDid' as repodid, record_raw->>'name' as name 6 + from tangled_repos 7 + where knot_hostname='knot1.tangled.sh' and coalesce(record_raw->>'repoDid','')<>'' 8 + limit 6`).then(r => r.rows)); 9 + for (const r of rows) { 10 + const url = `https://${r.knot_hostname}/xrpc/sh.tangled.git.temp.getTree?repo=${encodeURIComponent(r.repodid)}&ref=HEAD&path=`; 11 + try { 12 + const ctrl = new AbortController(); 13 + const t = setTimeout(() => ctrl.abort(), 12000); 14 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 15 + clearTimeout(t); 16 + const txt = await resp.text(); 17 + let j: any; try { j = JSON.parse(txt); } catch { j = null; } 18 + const fileNames = (j?.files || []).map((f: any) => f.name); 19 + const readmeInTree = fileNames.some((n: string) => /^readme/i.test(n)); 20 + console.log(`\n[${resp.status}] ${r.name ?? "(no name)"} ${r.repodid}`); 21 + console.log(` readme field: ${j?.readme ? JSON.stringify(Object.keys(j.readme)) : "none"} | readmeInTree=${readmeInTree}`); 22 + console.log(` files: ${JSON.stringify(fileNames).slice(0, 200)}`); 23 + if (!j) console.log(` raw: ${txt.slice(0, 200)}`); 24 + } catch (e: any) { 25 + console.log(`\n[ERR] ${r.repodid}: ${e.message}`); 26 + } 27 + } 28 + await pool.end(); 29 + } 30 + main().catch((e) => { console.error(e); process.exit(1); });
+104
recommendation/reference/src/readme_coverage.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + 4 + // Read DB_CONNECTION_STRING from repo-root .env (ignore the gcloud helper line). 5 + function loadConn() { 6 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 7 + for (const p of ["../.env", ".env", "../../.env"]) { 8 + try { 9 + const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); 10 + if (m) return m[1].trim(); 11 + } catch {} 12 + } 13 + throw new Error("DB_CONNECTION_STRING not found"); 14 + } 15 + 16 + const SAMPLE = process.env.SAMPLE ? parseInt(process.env.SAMPLE, 10) : 0; // 0 = all 17 + const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "30", 10); 18 + const TIMEOUT_MS = parseInt(process.env.TIMEOUT_MS ?? "9000", 10); 19 + 20 + const pool = new pg.Pool({ 21 + connectionString: loadConn(), 22 + ssl: { rejectUnauthorized: false }, 23 + connectionTimeoutMillis: 10_000, 24 + max: 4, 25 + }); 26 + 27 + const sql = ` 28 + select knot_hostname, 29 + coalesce(record_raw->>'repoDid', repo_did) as repodid, 30 + record_raw->>'name' as name 31 + from tangled_repos 32 + where knot_hostname is not null 33 + and coalesce(record_raw->>'repoDid', repo_did) is not null 34 + ${SAMPLE ? "order by random() limit " + SAMPLE : ""}`; 35 + 36 + const { rows } = await pool.query(sql); 37 + await pool.end(); 38 + 39 + const totalRepos = rows.length; 40 + console.log(`Checking README presence for ${totalRepos} repos (repoDid-addressable) ...`); 41 + console.log(`concurrency=${CONCURRENCY} timeout=${TIMEOUT_MS}ms sample=${SAMPLE || "ALL"}\n`); 42 + 43 + async function checkRepo(r) { 44 + // sh.tangled.repo.tree defaults to the repo's default branch when ref is omitted, 45 + // and returns a top-level `readme` (with `contents`) when the knot finds a README 46 + // under any extension (.md/.org/.rst/...). One request per repo. 47 + const url = `https://${r.knot_hostname}/xrpc/sh.tangled.repo.tree?repo=${encodeURIComponent(r.repodid)}&path=`; 48 + const ctrl = new AbortController(); 49 + const t = setTimeout(() => ctrl.abort(), TIMEOUT_MS); 50 + try { 51 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 52 + const txt = await resp.text(); 53 + if (!resp.ok) return { status: "http_" + resp.status }; 54 + let j; try { j = JSON.parse(txt); } catch { return { status: "bad_json" }; } 55 + const files = Array.isArray(j?.files) ? j.files : []; 56 + const readmeObj = !!(j?.readme && typeof j.readme === "object" && 57 + typeof j.readme.contents === "string" && j.readme.contents.trim().length > 0); 58 + const readmeFile = files.some((f) => /^readme(\.|$)/i.test(f?.name ?? "")); 59 + const empty = files.length === 0 && !readmeObj; 60 + return { status: "ok", reachable: true, hasReadme: readmeObj || readmeFile, empty }; 61 + } catch (e) { 62 + return { status: e.name === "AbortError" ? "timeout" : "neterr" }; 63 + } finally { 64 + clearTimeout(t); 65 + } 66 + } 67 + 68 + let done = 0; 69 + const stats = { reachable: 0, hasReadme: 0, empty: 0 }; 70 + const statusCounts = {}; 71 + const byKnot = {}; // knot -> {reachable, hasReadme} 72 + 73 + async function worker(queue) { 74 + while (queue.length) { 75 + const r = queue.pop(); 76 + const res = await checkRepo(r); 77 + statusCounts[res.status] = (statusCounts[res.status] ?? 0) + 1; 78 + const k = (byKnot[r.knot_hostname] ??= { total: 0, reachable: 0, hasReadme: 0 }); 79 + k.total++; 80 + if (res.status === "ok") { 81 + stats.reachable++; k.reachable++; 82 + if (res.hasReadme) { stats.hasReadme++; k.hasReadme++; } 83 + if (res.empty) stats.empty++; 84 + } 85 + if (++done % 100 === 0) process.stderr.write(` ...${done}/${totalRepos}\n`); 86 + } 87 + } 88 + 89 + const queue = rows.slice(); 90 + await Promise.all(Array.from({ length: CONCURRENCY }, () => worker(queue))); 91 + 92 + const pct = (n, d) => (d === 0 ? "n/a" : ((100 * n) / d).toFixed(1) + "%"); 93 + 94 + console.log("\n================ README COVERAGE ================"); 95 + console.log(`repoDid-addressable repos checked : ${totalRepos}`); 96 + console.log(`reachable (knot responded w/ tree): ${stats.reachable} (${pct(stats.reachable, totalRepos)} of checked)`); 97 + console.log(` ├─ have a README : ${stats.hasReadme} (${pct(stats.hasReadme, stats.reachable)} of reachable)`); 98 + console.log(` └─ empty repo (no files) : ${stats.empty}`); 99 + console.log(`README % of ALL checked repos : ${pct(stats.hasReadme, totalRepos)}`); 100 + console.log("\nstatus breakdown:", JSON.stringify(statusCounts)); 101 + console.log("\nper-knot (knots with >=10 repos):"); 102 + for (const [knot, k] of Object.entries(byKnot).sort((a, b) => b[1].total - a[1].total)) { 103 + if (k.total >= 10) console.log(` ${knot.padEnd(26)} total=${String(k.total).padStart(4)} reachable=${String(k.reachable).padStart(4)} readme=${String(k.hasReadme).padStart(4)} (${pct(k.hasReadme, k.reachable)} of reachable)`); 104 + }
+57
recommendation/reference/src/similar_repos.mjs
··· 1 + // README -> README similarity search (pure in-DB pgvector cosine; no embedding API call). 2 + // Given a seed repo (a repo the user contributed to), find the most similar repos by README. 3 + // Dedups exact-duplicate READMEs (forks) and near-identical hits. 4 + import pg from "pg"; 5 + import { readFileSync } from "node:fs"; 6 + import { createHash } from "node:crypto"; 7 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 8 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 9 + const K = parseInt(process.env.K ?? "8", 10); 10 + const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex"); 11 + 12 + // Seeds: env SEED (repo_name ilike or repo_did), else a diverse default set. 13 + const seeds = process.env.SEED ? [process.env.SEED] : ["tangled-cli", "atproto-oauth", "nixpkgs", "holbert-ng"]; 14 + 15 + async function findSeed(s) { 16 + const byDid = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [s]); 17 + if (byDid.rows[0]) return byDid.rows[0]; 18 + const byName = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_name ilike $1 and embedding is not null order by length(content) desc limit 1`, [s]); 19 + return byName.rows[0] ?? null; 20 + } 21 + 22 + async function main() { 23 + for (const s of seeds) { 24 + const seed = await findSeed(s); 25 + console.log("\n" + "=".repeat(74)); 26 + if (!seed) { console.log(`SEED "${s}" — no embedded README found`); continue; } 27 + console.log(`SEED REPO: ${seed.repo_name} (owner @${seed.owner_handle ?? "?"})`); 28 + console.log(` readme: ${(seed.content ?? "").replace(/\s+/g, " ").slice(0, 160)}…`); 29 + 30 + // Pull a wide candidate set, then dedup in JS. 31 + const cand = (await pool.query(` 32 + select repo_name, owner_handle, repo_did, content, 33 + round((embedding <=> $1::vector)::numeric, 4) as dist 34 + from tangled_readmes 35 + where embedding is not null and repo_did <> $2 36 + order by embedding <=> $1::vector 37 + limit 60`, [seed.embedding, seed.repo_did])).rows; 38 + 39 + const seenContent = new Set([hash(seed.content)]); // also drop forks identical to the seed 40 + const out = []; 41 + let dupSkipped = 0; 42 + for (const c of cand) { 43 + const h = hash(c.content); 44 + if (seenContent.has(h)) { dupSkipped++; continue; } 45 + seenContent.add(h); 46 + out.push(c); 47 + if (out.length >= K) break; 48 + } 49 + console.log(`top ${out.length} similar repos (deduped, ${dupSkipped} fork/dup hits collapsed):`); 50 + out.forEach((h, i) => { 51 + console.log(` ${String(i + 1).padStart(2)}. ${(h.repo_name ?? "(no name)").padEnd(30)} @${(h.owner_handle ?? "?").padEnd(20)} cos_dist=${h.dist}`); 52 + console.log(` ${(h.content ?? "").replace(/\s+/g, " ").slice(0, 110)}…`); 53 + }); 54 + } 55 + await pool.end(); 56 + } 57 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+29
recommendation/reference/src/verify_embeddings.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function conn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env"]) { try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)$/m); if (m) return m[1].trim(); } catch {} } 6 + } 7 + const pool = new pg.Pool({ connectionString: conn(), ssl: { rejectUnauthorized: false }, max: 3 }); 8 + 9 + console.log("=== embedded rows: dims + L2 norm ==="); 10 + console.table((await pool.query(` 11 + select repo_name, embedding_model, 12 + vector_dims(embedding) as dims, 13 + round(sqrt((select sum(x*x) from unnest(embedding::real[]) x))::numeric, 5) as l2_norm 14 + from tangled_readmes where embedding is not null 15 + order by embedded_at desc limit 5`)).rows); 16 + 17 + console.log("\n=== nearest-neighbor sanity (cosine) for one embedded repo ==="); 18 + const seed = (await pool.query(`select repo_did, repo_name from tangled_readmes where embedding is not null limit 1`)).rows[0]; 19 + if (seed) { 20 + console.log(`seed: ${seed.repo_name} (${seed.repo_did})`); 21 + const nn = await pool.query(` 22 + select repo_name, round((embedding <=> (select embedding from tangled_readmes where repo_did=$1))::numeric, 4) as cosine_dist 23 + from tangled_readmes 24 + where embedding is not null and repo_did <> $1 25 + order by embedding <=> (select embedding from tangled_readmes where repo_did=$1) 26 + limit 5`, [seed.repo_did]); 27 + console.table(nn.rows); 28 + } 29 + await pool.end();
recommendation/tests/__init__.py

This is a binary file and will not be displayed.

+30
recommendation/tests/test_dedup.py
··· 1 + from app.dedup import content_hash, collapse_forks 2 + from app.types import Candidate 3 + 4 + 5 + def test_content_hash_is_deterministic_and_prefix_based(): 6 + a = content_hash("hello world" + "x" * 1000) 7 + b = content_hash("hello world" + "x" * 1000) 8 + assert a == b 9 + # only the first 500 chars matter -> differing tails hash the same 10 + assert content_hash("p" * 500 + "AAA") == content_hash("p" * 500 + "BBB") 11 + 12 + 13 + def test_content_hash_handles_none_and_empty(): 14 + assert content_hash(None) == content_hash("") 15 + 16 + 17 + def _cand(key, h, dist): 18 + return Candidate(key=key, content_hash=h, distance=dist, seeds=["s"]) 19 + 20 + 21 + def test_collapse_forks_keeps_min_distance_per_content(): 22 + cands = [ 23 + _cand("repoA", "samehash", 0.20), 24 + _cand("repoB", "samehash", 0.10), # fork with closer distance -> winner 25 + _cand("repoC", "other", 0.30), 26 + ] 27 + out = collapse_forks(cands) 28 + keys = {c.key for c in out} 29 + assert keys == {"repoB", "repoC"} 30 + assert len(out) == 2
+179
recommendation/tests/test_git_store.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + from pathlib import Path 5 + 6 + import numpy as np 7 + import pytest 8 + 9 + from app.config import Settings 10 + from app.dedup import content_hash, row_content_hash 11 + from app.git_store import GitDataStore, load_git_store 12 + from app import db, recommend 13 + 14 + 15 + def _unit(v: list[float]) -> np.ndarray: 16 + a = np.asarray(v, dtype=np.float32) 17 + return a / np.linalg.norm(a) 18 + 19 + 20 + def _write_bundle(root: Path) -> None: 21 + data = root / "data" 22 + data.mkdir(parents=True) 23 + repo_vecs = np.stack( 24 + [ 25 + _unit([1, 0, 0]), 26 + _unit([0.9, 0.1, 0]), 27 + _unit([0, 1, 0]), 28 + ] 29 + ) 30 + issue_vecs = np.stack( 31 + [ 32 + _unit([0.95, 0.05, 0]), 33 + _unit([0, 0.95, 0.05]), 34 + ] 35 + ) 36 + np.save(data / "repos.f32.npy", repo_vecs) 37 + np.save(data / "issues.f32.npy", issue_vecs) 38 + 39 + repos = [ 40 + { 41 + "row": 0, 42 + "subject_uri": "at://did:plc:alice/sh.tangled.repo/r1", 43 + "repo_did": "did:repo:alice-r1", 44 + "repo_name": "alice-r1", 45 + "owner_handle": "alice", 46 + "description": "Alice repo one", 47 + "topics": ["nix"], 48 + "created_at": "2026-01-01T00:00:00Z", 49 + "content_len": 200, 50 + "content_sha500": "aaa", 51 + "embedding_model": "gemini-embedding-001", 52 + "embedded_at": "2026-01-01T00:00:00Z", 53 + }, 54 + { 55 + "row": 1, 56 + "subject_uri": "at://did:plc:bob/sh.tangled.repo/r9", 57 + "repo_did": "did:repo:bob-r9", 58 + "repo_name": "bob-r9", 59 + "owner_handle": "bob", 60 + "description": "Bob similar repo", 61 + "topics": ["cli"], 62 + "created_at": "2026-01-02T00:00:00Z", 63 + "content_len": 180, 64 + "content_sha500": "bbb", 65 + "embedding_model": "gemini-embedding-001", 66 + "embedded_at": "2026-01-02T00:00:00Z", 67 + }, 68 + { 69 + "row": 2, 70 + "subject_uri": "at://did:plc:carol/sh.tangled.repo/web", 71 + "repo_did": "did:repo:carol-web", 72 + "repo_name": "web", 73 + "owner_handle": "carol", 74 + "description": "Different topic", 75 + "topics": ["web"], 76 + "created_at": "2026-01-03T00:00:00Z", 77 + "content_len": 500, 78 + "content_sha500": "ccc", 79 + "embedding_model": "gemini-embedding-001", 80 + "embedded_at": "2026-01-03T00:00:00Z", 81 + }, 82 + ] 83 + issues = [ 84 + { 85 + "row": 0, 86 + "subject_uri": "at://did:plc:bob/sh.tangled.repo.issue/i1", 87 + "repo_did": "did:repo:bob-r9", 88 + "rkey": "i1", 89 + "repo_uri": "at://did:plc:bob/sh.tangled.repo/r9", 90 + "author_did": "did:plc:other", 91 + "title": "Fix CLI", 92 + "body": "details", 93 + "owner_handle": "bob", 94 + "repo_name": "bob-r9", 95 + "repo_description": "Bob similar repo", 96 + "created_at": "2026-01-04T00:00:00Z", 97 + "embedding_model": "gemini-embedding-001", 98 + }, 99 + { 100 + "row": 1, 101 + "subject_uri": "at://did:plc:carol/sh.tangled.repo.issue/i9", 102 + "repo_did": "did:repo:carol-web", 103 + "rkey": "i9", 104 + "repo_uri": "at://did:plc:carol/sh.tangled.repo/web", 105 + "author_did": "did:plc:carol", 106 + "title": "Web thing", 107 + "body": "body", 108 + "owner_handle": "carol", 109 + "repo_name": "web", 110 + "repo_description": "Different topic", 111 + "created_at": "2026-01-05T00:00:00Z", 112 + "embedding_model": "gemini-embedding-001", 113 + }, 114 + ] 115 + (data / "repos.jsonl").write_text( 116 + "\n".join(json.dumps(r) for r in repos) + "\n", encoding="utf-8" 117 + ) 118 + (data / "issues.jsonl").write_text( 119 + "\n".join(json.dumps(r) for r in issues) + "\n", encoding="utf-8" 120 + ) 121 + (root / "manifest.json").write_text( 122 + json.dumps( 123 + { 124 + "model": "gemini-embedding-001", 125 + "dim": 3, 126 + "metric": "cosine", 127 + "counts": {"repos": 3, "issues": 2}, 128 + } 129 + ), 130 + encoding="utf-8", 131 + ) 132 + 133 + 134 + @pytest.fixture() 135 + def git_bundle(tmp_path, monkeypatch): 136 + root = tmp_path / "bundle" 137 + _write_bundle(root) 138 + monkeypatch.setenv("DATA_STORAGE", "git") 139 + monkeypatch.setenv("REC_DATA_DIR", str(root)) 140 + monkeypatch.delenv("REC_DATA_GIT_URL", raising=False) 141 + from app.config import get_settings 142 + 143 + get_settings.cache_clear() 144 + load_git_store(get_settings()) 145 + yield root 146 + get_settings.cache_clear() 147 + 148 + 149 + def test_row_content_hash_prefers_sha500(): 150 + assert row_content_hash({"content_sha500": "deadbeef", "content": "x"}) == "deadbeef" 151 + assert content_hash("hello") == row_content_hash({"content": "hello"}) 152 + 153 + 154 + def test_git_store_load_and_knn(git_bundle): 155 + store = GitDataStore.load_from_dir(git_bundle) 156 + seeds = store.load_seeds("did:plc:alice", min_chars=120) 157 + assert len(seeds) == 1 158 + assert seeds[0]["repo_did"] == "did:repo:alice-r1" 159 + 160 + hits = store.knn_repos(seeds[0]["etext"], ["did:repo:alice-r1"], limit=5, min_chars=120) 161 + assert hits 162 + assert hits[0]["repo_did"] == "did:repo:bob-r9" 163 + assert hits[0]["distance"] < 0.2 164 + 165 + 166 + def test_git_recommend_end_to_end(git_bundle): 167 + res = recommend.recommend("did:plc:alice") 168 + assert res.profile.sources.tangled.repos == 1 169 + assert res.repos 170 + assert res.repos[0].name == "bob-r9" 171 + assert res.issues 172 + assert res.issues[0].issueUri.endswith("/i1") 173 + 174 + 175 + def test_db_dispatch_git_mode(git_bundle): 176 + counts = db.embedding_counts() 177 + assert counts["readmes_embedded"] == 3 178 + assert db.ping() is True 179 + assert db.get_questionnaire("at://x") is None
+72
recommendation/tests/test_integration.py
··· 1 + """Integration tests against the shared Postgres DB. 2 + 3 + Skipped unless DB_CONNECTION_STRING is configured (via env or .env). These assert 4 + the core correctness guarantees: no own-work leakage and a valid contract shape. 5 + """ 6 + 7 + from __future__ import annotations 8 + 9 + import pytest 10 + 11 + from app import db, recommend 12 + from app.config import get_settings 13 + 14 + settings = get_settings() 15 + 16 + pytestmark = pytest.mark.skipif( 17 + not settings.db_connection_string, 18 + reason="DB_CONNECTION_STRING not configured", 19 + ) 20 + 21 + # A user with several embedded owned repos (CLI / dotfiles flavour). 22 + SAMPLE_DID = "did:plc:y7g2koy4nqw7434s67fgfjca" 23 + 24 + 25 + def test_health_counts_present(): 26 + counts = db.embedding_counts() 27 + assert counts["readmes_embedded"] > 1000 28 + assert counts["addressable_users"] > 100 29 + 30 + 31 + def test_recommend_excludes_own_repos_and_is_well_formed(): 32 + seeds = db.load_seeds(SAMPLE_DID) 33 + assert len(seeds) >= 2, "fixture user should have multiple seeds" 34 + own_names = {(s["repo_name"] or "").lower() for s in seeds} 35 + 36 + res = recommend.recommend(SAMPLE_DID) 37 + 38 + # never recommend the user's own repos 39 + rec_names = {(r.name or "").lower() for r in res.repos} 40 + assert own_names.isdisjoint(rec_names), "own repos leaked into recommendations" 41 + 42 + # contract shape: profile + well-formed repos 43 + assert res.profile.sources.tangled.repos == len(seeds) 44 + assert res.profile.sources.github is None # no GitHub data 45 + assert len(res.repos) > 0 46 + for r in res.repos: 47 + assert r.owner.startswith("@") 48 + assert r.url.startswith("https://") 49 + assert r.url.endswith(f"/{r.name}") 50 + assert isinstance(r.basedOnRepoUrl, str) 51 + if r.basedOnRepoUrl: 52 + assert r.basedOnRepoUrl.startswith("https://") 53 + 54 + # issues, when present, carry the identity the appview needs to build links 55 + for i in res.issues: 56 + assert i.repoDid and i.rkey 57 + assert i.issueUri.startswith("at://") 58 + assert "/sh.tangled.repo.issue/" in i.issueUri 59 + assert i.owner.startswith("@") 60 + assert "/" in i.repo 61 + assert i.url.startswith("https://") 62 + assert i.url.endswith(f"/{i.repo.split('/', 1)[-1]}") 63 + assert isinstance(i.repoReadme, str) 64 + assert isinstance(i.basedOnRepoUrl, str) 65 + if i.basedOnRepoUrl: 66 + assert i.basedOnRepoUrl.startswith("https://") 67 + 68 + 69 + def test_empty_user_returns_no_repos(): 70 + res = recommend.recommend("did:plc:doesnotexistxxxxxxxxxxxx") 71 + assert res.repos == [] 72 + assert res.issues == []
+32
recommendation/tests/test_links.py
··· 1 + from datetime import datetime, timezone 2 + 3 + from app.links import slugify, at_owner, repo_url, to_rfc3339 4 + 5 + 6 + def test_slugify(): 7 + assert slugify("CLI Tools") == "cli-tools" 8 + assert slugify("AT Protocol!") == "at-protocol" 9 + assert slugify(" spaced out ") == "spaced-out" 10 + 11 + 12 + def test_at_owner_prefixes_once(): 13 + assert at_owner("icyphox.sh") == "@icyphox.sh" 14 + assert at_owner("@icyphox.sh") == "@icyphox.sh" 15 + 16 + 17 + def test_repo_url_is_absolute_with_at_handle(): 18 + assert repo_url("https://tangled.org", "icyphox.sh", "legit") == \ 19 + "https://tangled.org/@icyphox.sh/legit" 20 + 21 + 22 + def test_to_rfc3339_from_datetime(): 23 + dt = datetime(2026, 6, 24, 9, 11, 0, tzinfo=timezone.utc) 24 + assert to_rfc3339(dt) == "2026-06-24T09:11:00+00:00" 25 + 26 + 27 + def test_to_rfc3339_passthrough_for_already_iso_string(): 28 + assert to_rfc3339("2026-06-02T11:46:00+03:00") == "2026-06-02T11:46:00+03:00" 29 + 30 + 31 + def test_to_rfc3339_empty_for_none(): 32 + assert to_rfc3339(None) == ""
+32
recommendation/tests/test_merge.py
··· 1 + from app.merge import merge_hits 2 + 3 + 4 + def hit(repo_did, content, distance): 5 + return {"repo_did": repo_did, "content": content, "distance": distance} 6 + 7 + 8 + def test_consensus_accumulates_across_seeds(): 9 + per_seed = [ 10 + ("seed-nix", [hit("R1", "nix stuff", 0.18), hit("R2", "cli stuff", 0.25)]), 11 + ("seed-cli", [hit("R1", "nix stuff", 0.12), hit("R3", "web stuff", 0.22)]), 12 + ] 13 + cands = merge_hits(per_seed, seed_content_hashes=set()) 14 + by_key = {c.key: c for c in cands} 15 + 16 + # R1 surfaced by both seeds -> consensus 2, best (min) distance, primary = closer seed 17 + assert by_key["R1"].consensus == 2 18 + assert by_key["R1"].distance == 0.12 19 + assert by_key["R1"].primary_seed == "seed-cli" 20 + # R2/R3 surfaced once 21 + assert by_key["R2"].consensus == 1 22 + assert by_key["R3"].consensus == 1 23 + 24 + 25 + def test_skips_user_own_forks_by_content_hash(): 26 + from app.dedup import content_hash 27 + 28 + own = content_hash("my own readme") 29 + per_seed = [("seed", [hit("R1", "my own readme", 0.05), hit("R2", "fresh", 0.2)])] 30 + cands = merge_hits(per_seed, seed_content_hashes={own}) 31 + keys = {c.key for c in cands} 32 + assert keys == {"R2"}
+22
recommendation/tests/test_profile.py
··· 1 + from app.profile import build_interests 2 + 3 + 4 + def test_interests_aggregate_topics_by_frequency(): 5 + seeds = [ 6 + {"topics": ["nix", "cli"]}, 7 + {"topics": ["nix", "atproto"]}, 8 + {"topics": ["nix"]}, 9 + {"topics": None}, # tolerate missing topics 10 + {"topics": ["CLI Tools"]}, # slug normalizes 11 + ] 12 + interests = build_interests(seeds, max_interests=5) 13 + labels = [i["label"] for i in interests] 14 + slugs = [i["slug"] for i in interests] 15 + assert labels[0] == "nix" # most frequent first 16 + assert "cli-tools" in slugs # multi-word topic is slugified 17 + assert len(interests) <= 5 18 + assert all(set(i.keys()) == {"label", "slug"} for i in interests) 19 + 20 + 21 + def test_interests_empty_when_no_topics(): 22 + assert build_interests([{"topics": None}, {}], max_interests=5) == []
+75
recommendation/tests/test_quality.py
··· 1 + """Unit tests for the issue quality filter (pure, no DB). 2 + 3 + Real examples are drawn from observed recommendation output: the engine was 4 + surfacing throwaway/test issues (e.g. "hello, world" whose body is "test issue 5 + to explore what tangled looks like") because issues are ranked purely by body 6 + embedding similarity, with no quality signal. These tests pin down what we drop 7 + and — just as importantly — what we must keep. 8 + """ 9 + 10 + from __future__ import annotations 11 + 12 + from app.quality import drop_issue, is_placeholder_issue, is_test_repo 13 + 14 + # --- repos that are clearly sandboxes / test scratchpads ------------------------- 15 + def test_test_repo_by_name(): 16 + assert is_test_repo("tngl-mcp-test", "") 17 + assert is_test_repo("test-repo", "") 18 + assert is_test_repo("test100", "") 19 + assert is_test_repo("sandbox", "") 20 + assert is_test_repo("playground", "") 21 + assert is_test_repo("my-demo", "") 22 + 23 + 24 + def test_test_repo_by_description(): 25 + assert is_test_repo("blaaaa", "adadadaddaaddada") # gibberish description 26 + assert is_test_repo("whatever", "just a test") 27 + assert is_test_repo("x", "this is a test") 28 + 29 + 30 + def test_real_repos_are_not_flagged(): 31 + assert not is_test_repo("knot-docker", "Docker config for a Tangled knotserver") 32 + assert not is_test_repo("tangled-cli", "CLI for Tangled") 33 + assert not is_test_repo("hydrant", "an atproto crawler") 34 + assert not is_test_repo("drifting-starlight", "") 35 + assert not is_test_repo("latest", "") # 'test' is a substring, not a token 36 + assert not is_test_repo("fastest-router", "") 37 + assert not is_test_repo("contest-platform", "") 38 + 39 + 40 + # --- placeholder / test issues --------------------------------------------------- 41 + def test_placeholder_issue_titles(): 42 + assert is_placeholder_issue("hello, world", "") 43 + assert is_placeholder_issue("CLI test issue", "") 44 + assert is_placeholder_issue("Test Issue", "") 45 + assert is_placeholder_issue("[READ-ONLY]", "") 46 + 47 + 48 + def test_placeholder_issue_bodies(): 49 + assert is_placeholder_issue("hello, world", "test issue to explore what tangled looks like\n- and so on") 50 + assert is_placeholder_issue("[READ-ONLY]", "this is a read-only mirror of https://github.com/npmx-dev/npmx") 51 + assert is_placeholder_issue("untitled", "Testing programmatic access to Tangled via tang CLI") 52 + assert is_placeholder_issue("x", "just testing, ignore this") 53 + assert is_placeholder_issue("x", "lorem ipsum dolor sit amet") 54 + 55 + 56 + def test_real_issues_are_not_flagged(): 57 + assert not is_placeholder_issue( 58 + "`KNOT_REPO_SCAN_PATH` doesn't seem to be respected", 59 + "i've been hosting a knot for the past few versions and the log shows...", 60 + ) 61 + assert not is_placeholder_issue("PR Phase 2: Reviewer Workflow (Commenting and Reviews)", "Implement the reviewer workflow") 62 + assert not is_placeholder_issue("Finish migration from GitHub to Tangled", "- [x] Remove dependabot.yml file") 63 + assert not is_placeholder_issue("[crawler] add `com.atproto.sync.listReposByCollection` support", "right now we use describeRepo") 64 + assert not is_placeholder_issue("Improve Documentation", "Lots of little jobs here") 65 + assert not is_placeholder_issue("Add tests for the ranker", "we need more test coverage on the scorer") # legit work *about* tests 66 + 67 + 68 + # --- combined gate used by the engine -------------------------------------------- 69 + def test_drop_issue_combines_both_signals(): 70 + # dropped because the repo is a test repo (even if the issue title looks fine) 71 + assert drop_issue("tngl-mcp-test", "", "Add README with project overview", "real-sounding body") 72 + # dropped because the issue body is a placeholder (even if the repo looks fine) 73 + assert drop_issue("static", "", "hello, world", "test issue to explore what tangled looks like") 74 + # kept: real repo + real issue 75 + assert not drop_issue("knot-docker", "Docker config", "KNOT_REPO_SCAN_PATH bug", "the scan path is ignored")
+54
recommendation/tests/test_questionnaire_api.py
··· 1 + from __future__ import annotations 2 + 3 + from unittest.mock import patch 4 + 5 + import pytest 6 + from fastapi.testclient import TestClient 7 + 8 + from app.main import app 9 + from app.questionnaires import IssueUriError, QuestionnaireNotFoundError 10 + 11 + client = TestClient(app) 12 + 13 + SAMPLE_URI = "at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22" 14 + SAMPLE_PAYLOAD = { 15 + "issue": SAMPLE_URI, 16 + "version": 2, 17 + "introduction": {"project": "p", "issue": "i", "approach": "a"}, 18 + "items": [], 19 + } 20 + 21 + 22 + def test_questionnaire_missing_param(): 23 + res = client.get("/questionnaire") 24 + assert res.status_code == 400 25 + 26 + 27 + def test_questionnaire_not_found(): 28 + with patch("app.questionnaires.load_questionnaire_payload") as load: 29 + load.side_effect = QuestionnaireNotFoundError(SAMPLE_URI) 30 + res = client.get("/questionnaire", params={"issue": SAMPLE_URI}) 31 + assert res.status_code == 404 32 + 33 + 34 + def test_questionnaire_returns_payload(): 35 + with patch("app.questionnaires.load_questionnaire_payload", return_value=SAMPLE_PAYLOAD): 36 + res = client.get("/questionnaire", params={"issue": SAMPLE_URI}) 37 + assert res.status_code == 200 38 + assert res.json() == SAMPLE_PAYLOAD 39 + 40 + 41 + def test_questionnaire_accepts_issue_uri_alias(): 42 + with patch("app.questionnaires.load_questionnaire_payload", return_value=SAMPLE_PAYLOAD) as load: 43 + res = client.get("/questionnaire", params={"issue-uri": SAMPLE_URI}) 44 + assert res.status_code == 200 45 + load.assert_called_once_with(SAMPLE_URI) 46 + 47 + 48 + def test_questionnaire_bad_issue_uri(): 49 + with patch( 50 + "app.questionnaires.load_questionnaire_payload", 51 + side_effect=IssueUriError("bad"), 52 + ): 53 + res = client.get("/questionnaire", params={"issue": "not-a-uri"}) 54 + assert res.status_code == 400
+62
recommendation/tests/test_questionnaire_knot.py
··· 1 + """Unit tests for the knot-backed questionnaire read (mocked network).""" 2 + 3 + from __future__ import annotations 4 + 5 + import json 6 + import urllib.error 7 + from contextlib import contextmanager 8 + from unittest.mock import patch 9 + 10 + import pytest 11 + 12 + from app import questionnaires as q 13 + from app.config import get_settings 14 + 15 + URI = "at://did:plc:zhxv5pxpmojhnvaqy4mwailv/sh.tangled.repo.issue/3lln56674n622" 16 + PAYLOAD = {"issue": URI, "version": 2, "items": [{"id": "x"}]} 17 + FILE_RECORD = {"issue_uri": URI, "version": 2, "created_at": "t", "updated_at": "t", "payload": PAYLOAD} 18 + 19 + 20 + class _Resp: 21 + def __init__(self, body: bytes): 22 + self._body = body 23 + 24 + def read(self): 25 + return self._body 26 + 27 + def __enter__(self): 28 + return self 29 + 30 + def __exit__(self, *a): 31 + return False 32 + 33 + 34 + def test_knot_blob_url_path(): 35 + import urllib.parse 36 + 37 + s = get_settings() 38 + url = q._knot_blob_url(URI, s) 39 + assert "sh.tangled.repo.blob" in url 40 + params = urllib.parse.parse_qs(urllib.parse.urlparse(url).query) 41 + assert params["repo"] == [s.questionnaire_repo_did] 42 + assert params["path"] == ["questionnaires/did:plc:zhxv5pxpmojhnvaqy4mwailv/3lln56674n622.json"] 43 + 44 + 45 + def test_fetch_parses_blob_content(): 46 + blob = json.dumps({"content": json.dumps(FILE_RECORD), "encoding": "utf-8"}).encode() 47 + with patch("app.questionnaires.urllib.request.urlopen", return_value=_Resp(blob)): 48 + rec = q._fetch_from_knot(URI, get_settings()) 49 + assert rec["payload"] == PAYLOAD 50 + 51 + 52 + def test_load_payload_from_knot(): 53 + blob = json.dumps({"content": json.dumps(FILE_RECORD)}).encode() 54 + with patch("app.questionnaires.urllib.request.urlopen", return_value=_Resp(blob)): 55 + assert q.load_questionnaire_payload(URI) == PAYLOAD 56 + 57 + 58 + def test_missing_questionnaire_404_raises_not_found(): 59 + err = urllib.error.HTTPError(url="u", code=404, msg="nf", hdrs=None, fp=None) 60 + with patch("app.questionnaires.urllib.request.urlopen", side_effect=err): 61 + with pytest.raises(q.QuestionnaireNotFoundError): 62 + q.load_questionnaire_payload(URI)
+50
recommendation/tests/test_rank.py
··· 1 + from app.rank import DefaultScorer, apply_floor, rerank 2 + from app.types import Candidate 3 + 4 + 5 + def cand(key, dist, seeds, primary=None, created_at=None): 6 + return Candidate( 7 + key=key, 8 + content_hash=key, 9 + distance=dist, 10 + seeds=list(seeds), 11 + primary_seed=primary or seeds[0], 12 + payload={"created_at": created_at} if created_at else {}, 13 + ) 14 + 15 + 16 + def test_apply_floor_drops_distant_candidates(): 17 + cands = [cand("a", 0.10, ["s"]), cand("b", 0.45, ["s"])] 18 + kept = apply_floor(cands, floor=0.30) 19 + assert [c.key for c in kept] == ["a"] 20 + 21 + 22 + def test_scorer_prefers_closer_distance(): 23 + s = DefaultScorer() 24 + close = cand("close", 0.10, ["s"]) 25 + far = cand("far", 0.28, ["s"]) 26 + assert s.score(close) > s.score(far) 27 + 28 + 29 + def test_consensus_boosts_score(): 30 + s = DefaultScorer() 31 + solo = cand("solo", 0.20, ["s1"]) 32 + agreed = cand("agreed", 0.20, ["s1", "s2", "s3"]) 33 + assert s.score(agreed) > s.score(solo) 34 + 35 + 36 + def test_rerank_diversify_represents_lone_interest_seed(): 37 + # one busy seed with many close hits, one lone seed with a single decent hit 38 + busy = [cand(f"nix{i}", 0.10 + i * 0.001, ["nix"], primary="nix") for i in range(10)] 39 + lone = [cand("font1", 0.20, ["font"], primary="font")] 40 + out = rerank(busy + lone, DefaultScorer(), max_n=5, diversify=True) 41 + primaries = {c.primary_seed for c in out} 42 + assert "font" in primaries # lone seed not buried by the busy cluster 43 + assert out[0].primary_seed == "nix" # global best still leads 44 + 45 + 46 + def test_rerank_without_diversify_is_global_top_n(): 47 + busy = [cand(f"nix{i}", 0.10 + i * 0.001, ["nix"], primary="nix") for i in range(10)] 48 + lone = [cand("font1", 0.20, ["font"], primary="font")] 49 + out = rerank(busy + lone, DefaultScorer(), max_n=5, diversify=False) 50 + assert all(c.primary_seed == "nix" for c in out) # font buried
+69
recommendation/tests/test_recommend_shape.py
··· 1 + from __future__ import annotations 2 + 3 + from app.config import Settings 4 + from app.recommend import _issue_out, _repo_out, _seed_url_map 5 + from app.types import Candidate 6 + 7 + 8 + def _settings() -> Settings: 9 + return Settings(db_connection_string="", web_base="https://tangled.org") 10 + 11 + 12 + def test_seed_url_map_builds_absolute_urls(): 13 + seeds = [ 14 + {"repo_did": "did:plc:a", "repo_name": "nixpkgs", "owner_handle": "nixos"}, 15 + {"repo_did": "did:plc:b", "repo_name": "cli", "owner_handle": "me"}, 16 + ] 17 + urls = _seed_url_map(seeds, _settings()) 18 + assert urls["nixpkgs"] == "https://tangled.org/@nixos/nixpkgs" 19 + assert urls["cli"] == "https://tangled.org/@me/cli" 20 + 21 + 22 + def test_repo_out_includes_recommended_and_seed_urls(): 23 + c = Candidate( 24 + key="did:plc:target", 25 + content_hash="h", 26 + distance=0.1, 27 + seeds=["my-cli"], 28 + primary_seed="my-cli", 29 + payload={ 30 + "owner_handle": "them", 31 + "repo_name": "cool-repo", 32 + "description": "desc", 33 + "created_at": "2026-01-01T00:00:00Z", 34 + }, 35 + ) 36 + out = _repo_out(c, _settings(), {}, {"my-cli": "https://tangled.org/@me/my-cli"}) 37 + assert out.url == "https://tangled.org/@them/cool-repo" 38 + assert out.basedOnRepoUrl == "https://tangled.org/@me/my-cli" 39 + 40 + 41 + def test_issue_out_includes_parent_and_seed_urls(): 42 + c = Candidate( 43 + key="at://did/issue/1", 44 + content_hash="h", 45 + distance=0.2, 46 + seeds=["dotfiles"], 47 + primary_seed="dotfiles", 48 + payload={ 49 + "uri": "at://did:plc:them/sh.tangled.repo.issue/abc", 50 + "owner_handle": "them", 51 + "repo_name": "proj", 52 + "title": "Fix bug", 53 + "repo_did": "did:plc:target", 54 + "rkey": "abc", 55 + "repo_readme": "# Proj", 56 + "created_at": "2026-01-01T00:00:00Z", 57 + }, 58 + ) 59 + out = _issue_out( 60 + c, 61 + _settings(), 62 + {"dotfiles": "https://tangled.org/@me/dotfiles"}, 63 + {"at://did:plc:them/sh.tangled.repo.issue/abc"}, 64 + ) 65 + assert out.issueUri == "at://did:plc:them/sh.tangled.repo.issue/abc" 66 + assert out.url == "https://tangled.org/@them/proj" 67 + assert out.basedOnRepoUrl == "https://tangled.org/@me/dotfiles" 68 + assert out.repoReadme == "# Proj" 69 + assert out.hasQuestionnaire is True
+22
recommendation/tests/test_search.py
··· 1 + from __future__ import annotations 2 + 3 + from app.search import parallel_seed_search 4 + 5 + 6 + def test_parallel_seed_search_preserves_seed_order(): 7 + calls: list[str] = [] 8 + 9 + def search(seed: dict) -> list[dict]: 10 + calls.append(seed["id"]) 11 + return [{"repo_did": f"hit-{seed['id']}", "content": "x", "distance": 0.1}] 12 + 13 + seeds = [{"id": "a", "repo_did": "a"}, {"id": "b", "repo_did": "b"}] 14 + hits = parallel_seed_search(seeds, search, max_workers=2) 15 + 16 + assert calls == ["a", "b"] 17 + assert [label for label, _ in hits] == ["a", "b"] 18 + assert len(hits[0][1]) == 1 19 + 20 + 21 + def test_parallel_seed_search_empty(): 22 + assert parallel_seed_search([], lambda s: [], max_workers=4) == []
+11
recommendationold/.env.example
··· 1 + # Connection string for the SHARED Postgres database. 2 + # This database is OWNED BY THE DATA-COLLECTION TEAMMATE. 3 + # Existing tables are read-only for the rec engine EXCEPT the embedding columns of 4 + # tangled_readmes (embedding / embedding_model / embedded_at), which we fill. 5 + DB_CONNECTION_STRING=postgresql://user:password@host:5432/postgres 6 + 7 + # Google Gemini API key (Google AI Studio) for README embeddings. 8 + # Model: gemini-embedding-001 at outputDimensionality=1536 (matches the vector(1536) column). 9 + GEMINI_API_KEY=your-gemini-api-key 10 + # Optional override: 11 + # GEMINI_EMBED_MODEL=gemini-embedding-001
+7
recommendationold/.gitignore
··· 1 + node_modules/ 2 + dist/ 3 + .env 4 + .env.* 5 + !.env.example 6 + *.log 7 + .DS_Store
+161
recommendationold/package-lock.json
··· 1 + { 2 + "name": "tangled-recommendation", 3 + "version": "0.1.0", 4 + "lockfileVersion": 3, 5 + "requires": true, 6 + "packages": { 7 + "": { 8 + "name": "tangled-recommendation", 9 + "version": "0.1.0", 10 + "dependencies": { 11 + "pg": "^8.22.0" 12 + } 13 + }, 14 + "node_modules/pg": { 15 + "version": "8.22.0", 16 + "resolved": "https://registry.npmjs.org/pg/-/pg-8.22.0.tgz", 17 + "integrity": "sha512-8wih1vVIBMxoUM2oB4soJsD9tDnDpLv4OXBJ+EJzFsvycD+lfyIreC2gGHq78f8jbLLt+bvlPTFdFZfJkOuzAA==", 18 + "license": "MIT", 19 + "dependencies": { 20 + "pg-connection-string": "^2.14.0", 21 + "pg-pool": "^3.14.0", 22 + "pg-protocol": "^1.15.0", 23 + "pg-types": "2.2.0", 24 + "pgpass": "1.0.5" 25 + }, 26 + "engines": { 27 + "node": ">= 16.0.0" 28 + }, 29 + "optionalDependencies": { 30 + "pg-cloudflare": "^1.4.0" 31 + }, 32 + "peerDependencies": { 33 + "pg-native": ">=3.0.1" 34 + }, 35 + "peerDependenciesMeta": { 36 + "pg-native": { 37 + "optional": true 38 + } 39 + } 40 + }, 41 + "node_modules/pg-cloudflare": { 42 + "version": "1.4.0", 43 + "resolved": "https://registry.npmjs.org/pg-cloudflare/-/pg-cloudflare-1.4.0.tgz", 44 + "integrity": "sha512-Vo7z/6rrQYxpNRylp4Tlob2elzbh+N/MOQbxFVWCxS7oEx6jF53GTJFxK2WWpKuBRkmiin4Mt+xofFDjx09R0A==", 45 + "license": "MIT", 46 + "optional": true 47 + }, 48 + "node_modules/pg-connection-string": { 49 + "version": "2.14.0", 50 + "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.14.0.tgz", 51 + "integrity": "sha512-XwWDGcLRGCXAR8F/AM5bG7Q+A3Wm2s6QeEjlOKZLlH3UYcguiqCWKyWXVag5TLTIjR7oOJUY8kcADaZgWPyLeg==", 52 + "license": "MIT" 53 + }, 54 + "node_modules/pg-int8": { 55 + "version": "1.0.1", 56 + "resolved": "https://registry.npmjs.org/pg-int8/-/pg-int8-1.0.1.tgz", 57 + "integrity": "sha512-WCtabS6t3c8SkpDBUlb1kjOs7l66xsGdKpIPZsg4wR+B3+u9UAum2odSsF9tnvxg80h4ZxLWMy4pRjOsFIqQpw==", 58 + "license": "ISC", 59 + "engines": { 60 + "node": ">=4.0.0" 61 + } 62 + }, 63 + "node_modules/pg-pool": { 64 + "version": "3.14.0", 65 + "resolved": "https://registry.npmjs.org/pg-pool/-/pg-pool-3.14.0.tgz", 66 + "integrity": "sha512-gKtPkFdQPU3DksooVLi9LsjZxrsBUZIpa+7aVx+LV5pNh0KzP4Zleud2po+ConrxbuXGBJ6Hfer6hdgpIBpBaw==", 67 + "license": "MIT", 68 + "peerDependencies": { 69 + "pg": ">=8.0" 70 + } 71 + }, 72 + "node_modules/pg-protocol": { 73 + "version": "1.15.0", 74 + "resolved": "https://registry.npmjs.org/pg-protocol/-/pg-protocol-1.15.0.tgz", 75 + "integrity": "sha512-cq9sECI5s0+uPUXjbz8ioyPJni6RzsRib0US67i5IoTZKw8fNeYlVE7u8F4dG7vEJJtc5wdD1K189lCCUwqWTQ==", 76 + "license": "MIT" 77 + }, 78 + "node_modules/pg-types": { 79 + "version": "2.2.0", 80 + "resolved": "https://registry.npmjs.org/pg-types/-/pg-types-2.2.0.tgz", 81 + "integrity": "sha512-qTAAlrEsl8s4OiEQY69wDvcMIdQN6wdz5ojQiOy6YRMuynxenON0O5oCpJI6lshc6scgAY8qvJ2On/p+CXY0GA==", 82 + "license": "MIT", 83 + "dependencies": { 84 + "pg-int8": "1.0.1", 85 + "postgres-array": "~2.0.0", 86 + "postgres-bytea": "~1.0.0", 87 + "postgres-date": "~1.0.4", 88 + "postgres-interval": "^1.1.0" 89 + }, 90 + "engines": { 91 + "node": ">=4" 92 + } 93 + }, 94 + "node_modules/pgpass": { 95 + "version": "1.0.5", 96 + "resolved": "https://registry.npmjs.org/pgpass/-/pgpass-1.0.5.tgz", 97 + "integrity": "sha512-FdW9r/jQZhSeohs1Z3sI1yxFQNFvMcnmfuj4WBMUTxOrAyLMaTcE1aAMBiTlbMNaXvBCQuVi0R7hd8udDSP7ug==", 98 + "license": "MIT", 99 + "dependencies": { 100 + "split2": "^4.1.0" 101 + } 102 + }, 103 + "node_modules/postgres-array": { 104 + "version": "2.0.0", 105 + "resolved": "https://registry.npmjs.org/postgres-array/-/postgres-array-2.0.0.tgz", 106 + "integrity": "sha512-VpZrUqU5A69eQyW2c5CA1jtLecCsN2U/bD6VilrFDWq5+5UIEVO7nazS3TEcHf1zuPYO/sqGvUvW62g86RXZuA==", 107 + "license": "MIT", 108 + "engines": { 109 + "node": ">=4" 110 + } 111 + }, 112 + "node_modules/postgres-bytea": { 113 + "version": "1.0.1", 114 + "resolved": "https://registry.npmjs.org/postgres-bytea/-/postgres-bytea-1.0.1.tgz", 115 + "integrity": "sha512-5+5HqXnsZPE65IJZSMkZtURARZelel2oXUEO8rH83VS/hxH5vv1uHquPg5wZs8yMAfdv971IU+kcPUczi7NVBQ==", 116 + "license": "MIT", 117 + "engines": { 118 + "node": ">=0.10.0" 119 + } 120 + }, 121 + "node_modules/postgres-date": { 122 + "version": "1.0.7", 123 + "resolved": "https://registry.npmjs.org/postgres-date/-/postgres-date-1.0.7.tgz", 124 + "integrity": "sha512-suDmjLVQg78nMK2UZ454hAG+OAW+HQPZ6n++TNDUX+L0+uUlLywnoxJKDou51Zm+zTCjrCl0Nq6J9C5hP9vK/Q==", 125 + "license": "MIT", 126 + "engines": { 127 + "node": ">=0.10.0" 128 + } 129 + }, 130 + "node_modules/postgres-interval": { 131 + "version": "1.2.0", 132 + "resolved": "https://registry.npmjs.org/postgres-interval/-/postgres-interval-1.2.0.tgz", 133 + "integrity": "sha512-9ZhXKM/rw350N1ovuWHbGxnGh/SNJ4cnxHiM0rxE4VN41wsg8P8zWn9hv/buK00RP4WvlOyr/RBDiptyxVbkZQ==", 134 + "license": "MIT", 135 + "dependencies": { 136 + "xtend": "^4.0.0" 137 + }, 138 + "engines": { 139 + "node": ">=0.10.0" 140 + } 141 + }, 142 + "node_modules/split2": { 143 + "version": "4.2.0", 144 + "resolved": "https://registry.npmjs.org/split2/-/split2-4.2.0.tgz", 145 + "integrity": "sha512-UcjcJOWknrNkF6PLX83qcHM6KHgVKNkV62Y8a5uYDVv9ydGQVwAHMKqHdJje1VTWpljG0WYpCDhrCdAOYH4TWg==", 146 + "license": "ISC", 147 + "engines": { 148 + "node": ">= 10.x" 149 + } 150 + }, 151 + "node_modules/xtend": { 152 + "version": "4.0.2", 153 + "resolved": "https://registry.npmjs.org/xtend/-/xtend-4.0.2.tgz", 154 + "integrity": "sha512-LKYU1iAXJXUgAXn9URjiu+MWhyUXHsvfp7mcuYm9dSUKK0/CjtrUwFAxD82/mCWbtLsGjFIad0wIsod4zrTAEQ==", 155 + "license": "MIT", 156 + "engines": { 157 + "node": ">=0.4" 158 + } 159 + } 160 + } 161 + }
+15
recommendationold/package.json
··· 1 + { 2 + "name": "tangled-recommendation", 3 + "version": "0.1.0", 4 + "private": true, 5 + "type": "module", 6 + "description": "Recommendation engine for Tangled (AT Protocol) repo/issue discovery", 7 + "scripts": { 8 + "embed:check": "DRY_RUN=1 node src/embed_readmes.mjs", 9 + "embed": "node src/embed_readmes.mjs", 10 + "readme-coverage": "node src/readme_coverage.mjs" 11 + }, 12 + "dependencies": { 13 + "pg": "^8.22.0" 14 + } 15 + }
+52
recommendationold/src/check_new.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function loadConn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env", "../../.env"]) { 6 + try { 7 + const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); 8 + if (m) return m[1].trim(); 9 + } catch {} 10 + } 11 + throw new Error("DB_CONNECTION_STRING not found"); 12 + } 13 + const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 }); 14 + 15 + console.log("=== all tables/views (every schema) ==="); 16 + console.table((await pool.query(` 17 + select table_schema, table_name, table_type 18 + from information_schema.tables 19 + where table_schema not in ('pg_catalog','information_schema') 20 + order by table_schema, table_name`)).rows); 21 + 22 + console.log("\n=== columns matching 'embed' or 'readme' (any table) ==="); 23 + const hits = await pool.query(` 24 + select table_schema, table_name, column_name, data_type 25 + from information_schema.columns 26 + where table_schema not in ('pg_catalog','information_schema') 27 + and (column_name ~* 'embed|readme|vector') 28 + order by table_schema, table_name, ordinal_position`); 29 + console.table(hits.rows.length ? hits.rows : [{ note: "no columns named embed*/readme*/vector*" }]); 30 + 31 + console.log("\n=== tables matching 'embed' or 'readme' by NAME ==="); 32 + const tn = await pool.query(` 33 + select table_schema, table_name from information_schema.tables 34 + where table_name ~* 'embed|readme' 35 + order by 1,2`); 36 + console.table(tn.rows.length ? tn.rows : [{ note: "no table named embed*/readme*" }]); 37 + 38 + // If a readme column/table exists, show count + sample 39 + for (const r of [...hits.rows, ...tn.rows]) { 40 + const t = `"${r.table_schema}"."${r.table_name}"`; 41 + try { 42 + const c = await pool.query(`select count(*)::int n from ${t}`); 43 + console.log(`count ${t}: ${c.rows[0].n}`); 44 + } catch (e) { /* ignore dup */ } 45 + } 46 + 47 + console.log("\n=== columns on tangled_repos (did a readme/embedding col get added here?) ==="); 48 + console.table((await pool.query(` 49 + select column_name, data_type from information_schema.columns 50 + where table_schema='public' and table_name='tangled_repos' order by ordinal_position`)).rows); 51 + 52 + await pool.end();
+51
recommendationold/src/check_readmes.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function loadConn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env"]) { 6 + try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); if (m) return m[1].trim(); } catch {} 7 + } 8 + throw new Error("no conn"); 9 + } 10 + const pool = new pg.Pool({ connectionString: loadConn(), ssl: { rejectUnauthorized: false }, max: 3 }); 11 + 12 + console.log("=== full tangled_readmes columns ==="); 13 + console.table((await pool.query(` 14 + select c.ordinal_position, c.column_name, c.data_type, c.udt_name, c.is_nullable 15 + from information_schema.columns c 16 + where c.table_schema='public' and c.table_name='tangled_readmes' 17 + order by c.ordinal_position`)).rows); 18 + 19 + console.log("\n=== embedding column: pgvector dimensions ==="); 20 + console.table((await pool.query(` 21 + select a.attname, format_type(a.atttypid, a.atttypmod) as type 22 + from pg_attribute a 23 + join pg_class c on c.oid=a.attrelid join pg_namespace n on n.oid=c.relnamespace 24 + where n.nspname='public' and c.relname='tangled_readmes' and a.attnum>0 and not a.attisdropped 25 + and format_type(a.atttypid,a.atttypmod) ~* 'vector'`)).rows); 26 + 27 + console.log("\n=== counts ==="); 28 + console.table((await pool.query(` 29 + select 30 + count(*)::int total, 31 + count(*) filter (where embedding is not null)::int with_embedding, 32 + count(distinct embedding_model)::int models 33 + from tangled_readmes`)).rows); 34 + 35 + console.log("\n=== embedding_model values ==="); 36 + console.table((await pool.query(`select embedding_model, count(*)::int n from tangled_readmes group by 1 order by 2 desc`)).rows); 37 + 38 + console.log("\n=== indexes on tangled_readmes (ivfflat/hnsw?) ==="); 39 + console.table((await pool.query(`select indexname, indexdef from pg_indexes where schemaname='public' and tablename='tangled_readmes'`)).rows); 40 + 41 + console.log("\n=== sample row (text truncated, embedding omitted) ==="); 42 + const s = await pool.query(`select * from tangled_readmes limit 1`); 43 + if (s.rows.length) { 44 + const r = { ...s.rows[0] }; 45 + for (const k of Object.keys(r)) { 46 + if (k === "embedding") r[k] = `<vector len=${typeof r[k] === "string" ? (r[k].match(/,/g)?.length ?? 0) + 1 : "?"}>`; 47 + else if (typeof r[k] === "string" && r[k].length > 160) r[k] = r[k].slice(0, 160) + "…"; 48 + } 49 + console.log(JSON.stringify(r, null, 2)); 50 + } 51 + await pool.end();
+80
recommendationold/src/clustered_recommend.mjs
··· 1 + // Cluster-then-retrieve recommender: preserves a user's multiple distinct interests. 2 + // Contrasts NAIVE pooled top-K (one cluster can dominate) vs CLUSTERED round-robin (balanced). 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + import { createHash } from "node:crypto"; 6 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 7 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 }); 8 + 9 + const USER = process.env.USER_DID || "did:plc:y7g2koy4nqw7434s67fgfjca"; 10 + const K = parseInt(process.env.K ?? "10", 10); 11 + const T = parseFloat(process.env.CLUSTER_T ?? "0.22"); // cosine-dist threshold to consider two seeds "same interest" 12 + const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex"); 13 + const parseVec = (s) => s.replace(/^\[|\]$/g, "").split(",").map(Number); 14 + const cosDist = (a, b) => { let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return 1 - d; }; 15 + 16 + async function main() { 17 + // 1) the user's contributed repos (here: owned) with embeddings 18 + const seeds = (await pool.query( 19 + `select repo_did, repo_name, content, embedding::text as etext 20 + from tangled_readmes where embedding is not null and repo_uri like $1`, [`at://${USER}/%`])).rows; 21 + if (seeds.length < 2) { console.log("not enough embedded seed repos for this user"); await pool.end(); return; } 22 + seeds.forEach((s) => (s.vec = parseVec(s.etext))); 23 + console.log(`USER ${USER}`); 24 + console.log(`contributed repos (${seeds.length}): ${seeds.map((s) => s.repo_name).join(", ")}\n`); 25 + 26 + // 2) cluster seeds: single-linkage connected components at threshold T (union-find) 27 + const parent = seeds.map((_, i) => i); 28 + const find = (x) => (parent[x] === x ? x : (parent[x] = find(parent[x]))); 29 + for (let i = 0; i < seeds.length; i++) 30 + for (let j = i + 1; j < seeds.length; j++) 31 + if (cosDist(seeds[i].vec, seeds[j].vec) < T) parent[find(i)] = find(j); 32 + const clusters = new Map(); 33 + seeds.forEach((s, i) => { const r = find(i); (clusters.get(r) ?? clusters.set(r, []).get(r)).push(s); }); 34 + const clusterList = [...clusters.values()]; 35 + console.log(`→ ${clusterList.length} interest cluster(s):`); 36 + clusterList.forEach((c, i) => console.log(` [${i + 1}] ${c.map((s) => s.repo_name).join(", ")}`)); 37 + 38 + // 3) retrieve neighbors per seed (drop user's own repos), tag with cluster + min dist 39 + const ownRepoDids = new Set(seeds.map((s) => s.repo_did)); 40 + const seenContent = new Set(seeds.map((s) => hash(s.content))); 41 + // candidate -> { repo_name, dist, clusterIdx } 42 + const cand = new Map(); 43 + for (let ci = 0; ci < clusterList.length; ci++) { 44 + for (const seed of clusterList[ci]) { 45 + const rows = (await pool.query( 46 + `select repo_name, repo_did, content, round((embedding <=> $1::vector)::numeric,4) dist 47 + from tangled_readmes where embedding is not null and repo_did <> all($2) 48 + order by embedding <=> $1::vector limit 25`, [seed.etext, [...ownRepoDids]])).rows; 49 + for (const r of rows) { 50 + const h = hash(r.content); 51 + if (seenContent.has(h)) continue; // collapse forks / user's own content 52 + const prev = cand.get(h); 53 + const dist = Number(r.dist); 54 + if (!prev || dist < prev.dist) cand.set(h, { repo_name: r.repo_name, dist, clusterIdx: ci }); 55 + } 56 + } 57 + } 58 + const all = [...cand.values()]; 59 + 60 + // 4a) NAIVE pooled: global top-K by distance 61 + const naive = [...all].sort((a, b) => a.dist - b.dist).slice(0, K); 62 + 63 + // 4b) CLUSTERED round-robin: rank within each cluster, then take turns → balanced coverage 64 + const perCluster = clusterList.map((_, ci) => all.filter((c) => c.clusterIdx === ci).sort((a, b) => a.dist - b.dist)); 65 + const clustered = []; 66 + const used = new Set(); 67 + for (let round = 0; clustered.length < K && round < 50; round++) { 68 + for (let ci = 0; ci < perCluster.length && clustered.length < K; ci++) { 69 + const next = perCluster[ci].find((c) => !used.has(c.repo_name)); 70 + if (next) { used.add(next.repo_name); clustered.push(next); } 71 + } 72 + } 73 + 74 + const fmt = (arr) => arr.map((c, i) => ` ${String(i + 1).padStart(2)}. ${(c.repo_name ?? "?").padEnd(30)} dist=${c.dist} [interest ${c.clusterIdx + 1}]`).join("\n"); 75 + const cover = (arr) => { const s = new Set(arr.map((c) => c.clusterIdx)); return `${s.size}/${clusterList.length} interests`; }; 76 + console.log(`\n===== NAIVE pooled top-${K} (covers ${cover(naive)}) =====\n${fmt(naive)}`); 77 + console.log(`\n===== CLUSTERED round-robin top-${K} (covers ${cover(clustered)}) =====\n${fmt(clustered)}`); 78 + await pool.end(); 79 + } 80 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+29
recommendationold/src/discover_api.mjs
··· 1 + // Probe knot1.tangled.sh to discover the real XRPC endpoint names/paths. 2 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; // had a cached languages snapshot 3 + const host = "knot1.tangled.sh"; 4 + 5 + const candidates = [ 6 + `/xrpc/sh.tangled.repo.languages?repo=${repo}`, 7 + `/xrpc/sh.tangled.repo.branches?repo=${repo}&limit=100`, 8 + `/xrpc/sh.tangled.repo.getDefaultBranch?repo=${repo}`, 9 + `/xrpc/sh.tangled.git.temp.getTree?repo=${repo}&ref=HEAD&path=`, 10 + `/xrpc/sh.tangled.git.temp.getTree?repo=${repo}&ref=main&path=`, 11 + `/xrpc/sh.tangled.git.temp.getEntry?repo=${repo}&ref=HEAD&path=README.md`, 12 + `/xrpc/sh.tangled.git.listRefs?repo=${repo}`, 13 + `/xrpc/sh.tangled.git.temp.listBranches?repo=${repo}`, 14 + ]; 15 + 16 + for (const path of candidates) { 17 + const url = `https://${host}${path}`; 18 + try { 19 + const ctrl = new AbortController(); 20 + const t = setTimeout(() => ctrl.abort(), 10000); 21 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 22 + clearTimeout(t); 23 + const txt = await resp.text(); 24 + console.log(`[${resp.status}] ${path}`); 25 + console.log(` -> ${txt.slice(0, 240).replace(/\n/g, " ")}`); 26 + } catch (e) { 27 + console.log(`[ERR] ${path} -> ${e.name}:${e.message}`); 28 + } 29 + }
+33
recommendationold/src/discover_api2.mjs
··· 1 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; 2 + const host = "knot1.tangled.sh"; 3 + const ref = "trunk"; 4 + 5 + const candidates = [ 6 + `/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=${ref}&path=`, 7 + `/xrpc/sh.tangled.repo.getTree?repo=${repo}&ref=${ref}&path=`, 8 + `/xrpc/sh.tangled.repo.index?repo=${repo}&ref=${ref}`, 9 + `/xrpc/sh.tangled.repo.index?repo=${repo}`, 10 + `/xrpc/sh.tangled.repo.readme?repo=${repo}&ref=${ref}`, 11 + `/xrpc/sh.tangled.repo.getReadme?repo=${repo}&ref=${ref}`, 12 + `/xrpc/sh.tangled.repo.tags?repo=${repo}&limit=100`, 13 + `/xrpc/sh.tangled.repo.listFiles?repo=${repo}&ref=${ref}&path=`, 14 + `/xrpc/sh.tangled.repo.files?repo=${repo}&ref=${ref}&path=`, 15 + `/xrpc/sh.tangled.repo.blob?repo=${repo}&ref=${ref}&path=README.md`, 16 + `/xrpc/sh.tangled.repo.getBlob?repo=${repo}&ref=${ref}&path=README.md`, 17 + `/xrpc/sh.tangled.repo.entry?repo=${repo}&ref=${ref}&path=README.md`, 18 + ]; 19 + 20 + for (const path of candidates) { 21 + const url = `https://${host}${path}`; 22 + try { 23 + const ctrl = new AbortController(); 24 + const t = setTimeout(() => ctrl.abort(), 10000); 25 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 26 + clearTimeout(t); 27 + const txt = await resp.text(); 28 + console.log(`[${resp.status}] ${path.split("?")[0].replace("/xrpc/", "")}`); 29 + if (resp.ok) console.log(` -> ${txt.slice(0, 400).replace(/\n/g, " ")}`); 30 + } catch (e) { 31 + console.log(`[ERR] ${path} -> ${e.name}`); 32 + } 33 + }
+23
recommendationold/src/discover_api3.mjs
··· 1 + const repo = "did:plc:qsctypxlsrippb5wculrsj7q"; 2 + const host = "knot1.tangled.sh"; 3 + async function get(path) { 4 + const ctrl = new AbortController(); 5 + const t = setTimeout(() => ctrl.abort(), 10000); 6 + try { 7 + const resp = await fetch(`https://${host}${path}`, { signal: ctrl.signal, headers: { accept: "application/json" } }); 8 + const txt = await resp.text(); 9 + return { status: resp.status, txt }; 10 + } finally { clearTimeout(t); } 11 + } 12 + // Full tree (does it include a readme field? top-level keys?) 13 + const full = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=trunk&path=`); 14 + let j; try { j = JSON.parse(full.txt); } catch {} 15 + console.log("tree top-level keys:", j ? Object.keys(j) : "(parse fail)"); 16 + console.log("file names:", (j?.files || []).map((f) => f.name)); 17 + console.log("has top-level 'readme' key:", j && "readme" in j, "->", JSON.stringify(j?.readme)?.slice(0, 120)); 18 + // Without ref 19 + const noref = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&path=`); 20 + console.log("\ntree WITHOUT ref: status", noref.status, "->", noref.txt.slice(0, 120).replace(/\n/g, " ")); 21 + // Empty ref 22 + const emptyref = await get(`/xrpc/sh.tangled.repo.tree?repo=${repo}&ref=&path=`); 23 + console.log("tree EMPTY ref: status", emptyref.status, "->", emptyref.txt.slice(0, 120).replace(/\n/g, " "));
+169
recommendationold/src/embed_readmes.mjs
··· 1 + // Embed all unembedded READMEs in tangled_readmes using Google Gemini embeddings. 2 + // 3 + // - Reads the worklist (status='found' AND content IS NOT NULL AND embedding IS NULL), 4 + // the exact predicate behind tangled_readmes_unembedded_idx. 5 + // - Embeds doc = "# <name>\n\n<description>\n\n<README>" with gemini-embedding-001 at 6 + // outputDimensionality=1536 (matches the vector(1536) column), task RETRIEVAL_DOCUMENT. 7 + // - L2-normalizes (sub-3072 MRL dims aren't auto-normalized) so the HNSW cosine index is happy. 8 + // - UPDATEs only the embedding columns, only where embedding IS NULL → idempotent / re-runnable. 9 + // 10 + // Env: DB_CONNECTION_STRING (or ../.env), GEMINI_API_KEY (required). 11 + // Optional: LIMIT (0=all), CONCURRENCY (default 4), DRY_RUN=1 (count only), MAX_CHARS (default 8000). 12 + 13 + import pg from "pg"; 14 + import { readFileSync } from "node:fs"; 15 + 16 + function fromEnvFile(key) { 17 + for (const p of ["../.env", ".env", "../../.env"]) { 18 + try { 19 + const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)\\s*$`, "m")); 20 + if (m) return m[1].trim().replace(/^["']|["']$/g, ""); 21 + } catch {} 22 + } 23 + return undefined; 24 + } 25 + 26 + const CONN = process.env.DB_CONNECTION_STRING || fromEnvFile("DB_CONNECTION_STRING"); 27 + const API_KEY = process.env.GEMINI_API_KEY || fromEnvFile("GEMINI_API_KEY"); 28 + const MODEL = process.env.GEMINI_EMBED_MODEL || fromEnvFile("GEMINI_EMBED_MODEL") || "gemini-embedding-001"; 29 + const DIMS = 1536; 30 + const LIMIT = parseInt(process.env.LIMIT ?? "0", 10); 31 + const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "4", 10); 32 + const MAX_CHARS = parseInt(process.env.MAX_CHARS ?? "8000", 10); 33 + const DRY_RUN = process.env.DRY_RUN === "1"; 34 + 35 + if (!CONN) { console.error("DB_CONNECTION_STRING not set"); process.exit(1); } 36 + if (!API_KEY && !DRY_RUN) { console.error("GEMINI_API_KEY not set (add it to recommendation/.env)"); process.exit(1); } 37 + 38 + const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 5 }); 39 + const sleep = (ms) => new Promise((r) => setTimeout(r, ms)); 40 + 41 + function buildDoc({ repo_name, description, content }) { 42 + const parts = []; 43 + if (repo_name) parts.push(`# ${repo_name}`); 44 + if (description && description.trim()) parts.push(description.trim()); 45 + parts.push(content); 46 + return parts.join("\n\n").slice(0, MAX_CHARS); 47 + } 48 + 49 + function l2normalize(v) { 50 + let s = 0; 51 + for (const x of v) s += x * x; 52 + const n = Math.sqrt(s) || 1; 53 + return v.map((x) => x / n); 54 + } 55 + 56 + async function embedOnce(text, dims) { 57 + const url = `https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`; 58 + const body = { 59 + model: `models/${MODEL}`, 60 + content: { parts: [{ text }] }, 61 + taskType: "RETRIEVAL_DOCUMENT", 62 + outputDimensionality: dims, 63 + }; 64 + const resp = await fetch(url, { 65 + method: "POST", 66 + headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 67 + body: JSON.stringify(body), 68 + }); 69 + const txt = await resp.text(); 70 + if (!resp.ok) { 71 + const err = new Error(`HTTP ${resp.status}: ${txt.slice(0, 200)}`); 72 + err.status = resp.status; 73 + throw err; 74 + } 75 + const j = JSON.parse(txt); 76 + const values = j?.embedding?.values; 77 + if (!Array.isArray(values)) throw new Error(`no embedding in response: ${txt.slice(0, 150)}`); 78 + return values; 79 + } 80 + 81 + // Embed with retries; on 400 (often too-long input) retry once with a hard truncation. 82 + async function embedWithRetry(text) { 83 + let attempt = 0; 84 + let input = text; 85 + while (true) { 86 + try { 87 + const v = await embedOnce(input, DIMS); 88 + return l2normalize(v); 89 + } catch (e) { 90 + attempt++; 91 + if (e.status === 400 && input.length > 2000) { 92 + input = input.slice(0, Math.floor(input.length / 2)); 93 + continue; 94 + } 95 + if (attempt >= 5 || (e.status && e.status >= 400 && e.status < 500 && e.status !== 429)) { 96 + throw e; 97 + } 98 + const backoff = Math.min(30000, 800 * 2 ** (attempt - 1)); 99 + await sleep(backoff); 100 + } 101 + } 102 + } 103 + 104 + async function main() { 105 + const worklistSql = ` 106 + select r.repo_did, r.repo_name, r.content, 107 + coalesce(tr.record_raw->>'description', '') as description, 108 + length(r.content) as len 109 + from tangled_readmes r 110 + left join tangled_repos tr 111 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 112 + where r.status = 'found' and r.content is not null and r.embedding is null 113 + order by r.repo_did 114 + ${LIMIT > 0 ? `limit ${LIMIT}` : ""}`; 115 + 116 + const { rows } = await pool.query(worklistSql); 117 + const totalReadmes = (await pool.query(`select count(*)::int n from tangled_readmes`)).rows[0].n; 118 + const alreadyEmbedded = (await pool.query(`select count(*)::int n from tangled_readmes where embedding is not null`)).rows[0].n; 119 + 120 + console.log(`tangled_readmes total=${totalReadmes} already embedded=${alreadyEmbedded}`); 121 + console.log(`worklist (to embed now)=${rows.length} model=${MODEL} dims=${DIMS} concurrency=${CONCURRENCY}${LIMIT ? ` limit=${LIMIT}` : ""}`); 122 + if (DRY_RUN) { console.log("\nDRY_RUN=1 → not embedding, not writing."); await pool.end(); return; } 123 + if (rows.length === 0) { console.log("\nNothing to embed. ✔"); await pool.end(); return; } 124 + 125 + let done = 0, ok = 0, failed = 0; 126 + const errors = []; 127 + const queue = rows.slice(); 128 + 129 + async function worker(id) { 130 + while (queue.length) { 131 + const r = queue.pop(); 132 + try { 133 + const doc = buildDoc(r); 134 + const vec = await embedWithRetry(doc); 135 + const literal = `[${vec.join(",")}]`; 136 + const res = await pool.query( 137 + `update tangled_readmes 138 + set embedding = $1::vector, embedding_model = $2, embedded_at = now() 139 + where repo_did = $3 and embedding is null`, 140 + [literal, MODEL, r.repo_did], 141 + ); 142 + if (res.rowCount > 0) ok++; 143 + } catch (e) { 144 + failed++; 145 + errors.push({ repo_did: r.repo_did, name: r.repo_name, err: e.message }); 146 + } 147 + if (++done % 25 === 0 || done === rows.length) { 148 + process.stderr.write(` ...${done}/${rows.length} (ok=${ok} fail=${failed})\n`); 149 + } 150 + } 151 + } 152 + 153 + await Promise.all(Array.from({ length: CONCURRENCY }, (_, i) => worker(i))); 154 + 155 + console.log(`\n================ EMBEDDING DONE ================`); 156 + console.log(`embedded ok : ${ok}`); 157 + console.log(`failed : ${failed}`); 158 + if (errors.length) { 159 + console.log("\nfirst errors:"); 160 + for (const e of errors.slice(0, 10)) console.log(` ${e.name ?? e.repo_did}: ${e.err}`); 161 + } 162 + const remaining = (await pool.query( 163 + `select count(*)::int n from tangled_readmes where status='found' and content is not null and embedding is null`, 164 + )).rows[0].n; 165 + console.log(`\nremaining unembedded (status=found): ${remaining}`); 166 + await pool.end(); 167 + } 168 + 169 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+37
recommendationold/src/explore_users.mjs
··· 1 + // Find owners with several embedded repos, and measure how SPREAD their repos are 2 + // (high mean pairwise cosine distance = multi-interest user — good demo candidate). 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 6 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 7 + 8 + const ownerDid = (uri) => uri ? uri.replace("at://", "").split("/")[0] : null; 9 + function parseVec(s){ return s.replace(/^\[|\]$/g, "").split(",").map(Number); } 10 + function cos(a, b){ let d = 0; for (let i = 0; i < a.length; i++) d += a[i]*b[i]; return d; } // already unit-norm 11 + 12 + const owners = (await pool.query(` 13 + select split_part(replace(repo_uri,'at://',''),'/',1) as owner_did, 14 + count(*)::int n, array_agg(repo_name) as names 15 + from tangled_readmes 16 + where embedding is not null and repo_uri is not null 17 + group by 1 having count(*) between 4 and 12 18 + order by n desc limit 25`)).rows; 19 + 20 + const scored = []; 21 + for (const o of owners) { 22 + const rows = (await pool.query( 23 + `select repo_name, embedding::text as e from tangled_readmes where embedding is not null and repo_uri like $1`, 24 + [`at://${o.owner_did}/%`])).rows; 25 + const vecs = rows.map((r) => parseVec(r.e)); 26 + let sum = 0, cnt = 0; 27 + for (let i = 0; i < vecs.length; i++) for (let j = i + 1; j < vecs.length; j++) { sum += 1 - cos(vecs[i], vecs[j]); cnt++; } 28 + const meanDist = cnt ? sum / cnt : 0; 29 + scored.push({ owner_did: o.owner_did, n: o.n, meanDist: +meanDist.toFixed(3), names: rows.map((r) => r.repo_name) }); 30 + } 31 + scored.sort((a, b) => b.meanDist - a.meanDist); 32 + console.log("most multi-interest owners (high mean pairwise README distance):\n"); 33 + for (const s of scored.slice(0, 8)) { 34 + console.log(`mean_dist=${s.meanDist} n=${s.n} ${s.owner_did}`); 35 + console.log(` repos: ${s.names.join(", ")}\n`); 36 + } 37 + await pool.end();
+42
recommendationold/src/fetch_issues.mjs
··· 1 + // Fetch real sh.tangled.repo.issue records live from repo-owner PDSes. 2 + import pg from "pg"; 3 + import { readFileSync } from "node:fs"; 4 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 5 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 6 + 7 + // Owners of embedded repos, with their PDS host. 8 + const rows = (await pool.query(` 9 + select distinct tr.owner_did, pa.pds_host, 10 + (select repo_name from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null limit 1) as a_repo 11 + from tangled_repos tr 12 + join tangled_pds_accounts pa on pa.did = tr.owner_did 13 + where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null) 14 + limit 80`)).rows; 15 + await pool.end(); 16 + 17 + console.log(`probing ${rows.length} owner PDSes for sh.tangled.repo.issue ...`); 18 + const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`); 19 + 20 + let found = []; 21 + async function listIssues(r) { 22 + const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`; 23 + try { 24 + const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000); 25 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 26 + clearTimeout(t); 27 + if (!resp.ok) return; 28 + const j = await resp.json(); 29 + for (const rec of j.records ?? []) found.push({ owner: r.owner_did, uri: rec.uri, value: rec.value }); 30 + } catch {} 31 + } 32 + // simple concurrency 33 + const q = rows.slice(); 34 + await Promise.all(Array.from({ length: 12 }, async () => { while (q.length) await listIssues(q.pop()); })); 35 + 36 + console.log(`\nfound ${found.length} issue records`); 37 + if (found.length) { 38 + console.log("\nsample issue record value keys:", Object.keys(found[0].value)); 39 + console.log("sample record:", JSON.stringify(found[0], null, 2).slice(0, 900)); 40 + console.log("\nfirst few titles:"); 41 + for (const f of found.slice(0, 8)) console.log(` - ${f.value.title ?? "(no title)"} [repo ref: ${JSON.stringify(f.value.repo ?? f.value.subject ?? "?")}]`); 42 + }
+95
recommendationold/src/issue_experiment.mjs
··· 1 + // Full experiment: fetch real Tangled issues live, embed as queries, vector-search READMEs. 2 + import pg from "pg"; 3 + import { readFileSync } from "node:fs"; 4 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 5 + const API_KEY = fromEnv("GEMINI_API_KEY"); 6 + const MODEL = "gemini-embedding-001"; 7 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 4 }); 8 + const pdsUrl = (h) => (/^https?:\/\//.test(h) ? h : `https://${h}`); 9 + 10 + async function embedQuery(text) { 11 + const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, { 12 + method: "POST", headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 13 + body: JSON.stringify({ model: `models/${MODEL}`, content: { parts: [{ text: text.slice(0, 8000) }] }, taskType: "RETRIEVAL_QUERY", outputDimensionality: 1536 }), 14 + }); 15 + if (!resp.ok) throw new Error(`embed HTTP ${resp.status}`); 16 + const v = (await resp.json()).embedding.values; 17 + let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1; 18 + return `[${v.map((x) => x / n).join(",")}]`; 19 + } 20 + 21 + // Map an issue.repo reference (bare DID or at://owner/sh.tangled.repo/rkey) -> knot repoDid in readmes. 22 + async function resolveRepoDid(ref) { 23 + if (!ref) return null; 24 + if (ref.startsWith("at://")) { 25 + const m = ref.match(/^at:\/\/([^/]+)\/[^/]+\/(.+)$/); 26 + if (!m) return null; 27 + const r = await pool.query(`select coalesce(repo_did, record_raw->>'repoDid') as rd from tangled_repos where owner_did=$1 and rkey=$2 limit 1`, [m[1], m[2]]); 28 + return r.rows[0]?.rd ?? null; 29 + } 30 + return ref; // bare DID == repoDid 31 + } 32 + 33 + async function fetchIssues() { 34 + const rows = (await pool.query(` 35 + select distinct tr.owner_did, pa.pds_host 36 + from tangled_repos tr join tangled_pds_accounts pa on pa.did = tr.owner_did 37 + where exists (select 1 from tangled_readmes r where r.repo_did = coalesce(tr.repo_did, tr.record_raw->>'repoDid') and r.embedding is not null) 38 + limit 120`)).rows; 39 + const found = []; 40 + const q = rows.slice(); 41 + await Promise.all(Array.from({ length: 14 }, async () => { 42 + while (q.length) { 43 + const r = q.pop(); 44 + const url = `${pdsUrl(r.pds_host)}/xrpc/com.atproto.repo.listRecords?repo=${encodeURIComponent(r.owner_did)}&collection=sh.tangled.repo.issue&limit=30`; 45 + try { 46 + const ctrl = new AbortController(); const t = setTimeout(() => ctrl.abort(), 10000); 47 + const resp = await fetch(url, { signal: ctrl.signal }); 48 + clearTimeout(t); 49 + if (!resp.ok) continue; 50 + const j = await resp.json(); 51 + for (const rec of j.records ?? []) if (rec.value?.title) found.push(rec.value); 52 + } catch {} 53 + } 54 + })); 55 + return found; 56 + } 57 + 58 + async function main() { 59 + const issues = await fetchIssues(); 60 + console.log(`fetched ${issues.length} live issues\n`); 61 + // attach resolved repoDid + whether embedded; prefer substantive bodies whose repo is embedded 62 + for (const iss of issues) { 63 + iss._repoDid = await resolveRepoDid(iss.repo); 64 + iss._embedded = iss._repoDid 65 + ? (await pool.query(`select repo_name from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [iss._repoDid])).rows[0]?.repo_name ?? null 66 + : null; 67 + } 68 + const pick = issues 69 + .filter((i) => (i.body ?? "").length > 60) 70 + .sort((a, b) => (b._embedded ? 1 : 0) - (a._embedded ? 1 : 0) || (b.body?.length ?? 0) - (a.body?.length ?? 0)) 71 + .slice(0, 4); 72 + 73 + for (const iss of pick) { 74 + console.log("\n" + "=".repeat(72)); 75 + console.log(`ISSUE: ${iss.title}`); 76 + console.log(`own repo: ${iss._embedded ? iss._embedded + " (embedded ✓)" : "(parent README not embedded / unresolved)"}`); 77 + console.log(`body: ${(iss.body ?? "").replace(/\s+/g, " ").slice(0, 200)}…`); 78 + const qvec = await embedQuery(`${iss.title}\n\n${iss.body ?? ""}`); 79 + const hits = (await pool.query(` 80 + select repo_name, repo_did, round((embedding <=> $1::vector)::numeric,4) dist, (repo_did=$2) is_parent 81 + from tangled_readmes where embedding is not null 82 + order by embedding <=> $1::vector limit 8`, [qvec, iss._repoDid])).rows; 83 + console.log("top README matches:"); 84 + hits.forEach((h, i) => console.log(` ${i + 1}. ${h.is_parent ? "👉" : " "} ${(h.repo_name ?? "(no name)").padEnd(34)} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`)); 85 + if (iss._embedded) { 86 + const rnk = (await pool.query(` 87 + select 1 + count(*)::int rnk from tangled_readmes 88 + where embedding is not null and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`, 89 + [qvec, iss._repoDid])).rows[0].rnk; 90 + console.log(` → own repo overall rank: #${rnk} of all embedded READMEs`); 91 + } 92 + } 93 + await pool.end(); 94 + } 95 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+84
recommendationold/src/issue_search.mjs
··· 1 + // Experiment: embed a Tangled issue as a query and vector-search the README embeddings. 2 + // Validates the matching: (a) does the issue's OWN repo rank highly? (b) are other hits topical? 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + 6 + function fromEnv(key) { 7 + if (process.env[key]) return process.env[key]; 8 + for (const p of ["../.env", ".env"]) { 9 + try { const m = readFileSync(p, "utf8").match(new RegExp(`^\\s*${key}\\s*=\\s*(.+)$`, "m")); if (m) return m[1].trim().replace(/^["']|["']$/g, ""); } catch {} 10 + } 11 + } 12 + const CONN = fromEnv("DB_CONNECTION_STRING"); 13 + const API_KEY = fromEnv("GEMINI_API_KEY"); 14 + const MODEL = "gemini-embedding-001"; 15 + const N = parseInt(process.env.ISSUES ?? "3", 10); 16 + 17 + const pool = new pg.Pool({ connectionString: CONN, ssl: { rejectUnauthorized: false }, max: 3 }); 18 + 19 + async function embedQuery(text) { 20 + const resp = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:embedContent`, { 21 + method: "POST", 22 + headers: { "content-type": "application/json", "x-goog-api-key": API_KEY }, 23 + body: JSON.stringify({ 24 + model: `models/${MODEL}`, 25 + content: { parts: [{ text: text.slice(0, 8000) }] }, 26 + taskType: "RETRIEVAL_QUERY", 27 + outputDimensionality: 1536, 28 + }), 29 + }); 30 + if (!resp.ok) throw new Error(`embed HTTP ${resp.status}: ${(await resp.text()).slice(0, 200)}`); 31 + const v = (await resp.json()).embedding.values; 32 + let s = 0; for (const x of v) s += x * x; const n = Math.sqrt(s) || 1; 33 + return `[${v.map((x) => x / n).join(",")}]`; 34 + } 35 + 36 + async function main() { 37 + const total = (await pool.query(`select count(*)::int n from tangled_issues`)).rows[0].n; 38 + console.log(`tangled_issues total: ${total}`); 39 + const joinable = (await pool.query(` 40 + select count(*)::int n from tangled_issues i 41 + where exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null)`)).rows[0].n; 42 + console.log(`issues whose parent repo has an embedded README: ${joinable}\n`); 43 + if (joinable === 0) { console.log("No joinable issues — cannot run the own-repo sanity check."); await pool.end(); return; } 44 + 45 + // Pick a few substantive issues (decent body) whose repo is embedded. 46 + const issues = (await pool.query(` 47 + select i.uri, i.repo_did, i.title, i.body, 48 + (select repo_name from tangled_readmes r where r.repo_did = i.repo_did limit 1) as parent_repo 49 + from tangled_issues i 50 + where i.title is not null and length(coalesce(i.body,'')) > 80 51 + and exists (select 1 from tangled_readmes r where r.repo_did = i.repo_did and r.embedding is not null) 52 + order by length(i.body) desc 53 + limit ${N}`)).rows; 54 + 55 + for (const iss of issues) { 56 + const queryText = `${iss.title}\n\n${iss.body}`; 57 + console.log("\n" + "=".repeat(70)); 58 + console.log(`ISSUE: ${iss.title}`); 59 + console.log(`parent repo: ${iss.parent_repo} (${iss.repo_did})`); 60 + console.log(`body: ${iss.body.replace(/\s+/g, " ").slice(0, 180)}…`); 61 + const qvec = await embedQuery(queryText); 62 + const hits = (await pool.query(` 63 + select repo_name, repo_did, round((embedding <=> $1::vector)::numeric, 4) as dist, 64 + (repo_did = $2) as is_parent 65 + from tangled_readmes 66 + where embedding is not null 67 + order by embedding <=> $1::vector 68 + limit 8`, [qvec, iss.repo_did])).rows; 69 + console.log("top README matches:"); 70 + hits.forEach((h, idx) => { 71 + console.log(` ${idx + 1}. ${h.is_parent ? "👉 " : " "}${h.repo_name?.padEnd(32) ?? "(no name)"} dist=${h.dist}${h.is_parent ? " <-- OWN REPO" : ""}`); 72 + }); 73 + // Where does the own repo rank overall? 74 + const rank = (await pool.query(` 75 + select 1 + count(*)::int as rnk 76 + from tangled_readmes 77 + where embedding is not null 78 + and (embedding <=> $1::vector) < (select embedding <=> $1::vector from tangled_readmes where repo_did=$2 limit 1)`, 79 + [qvec, iss.repo_did])).rows[0].rnk; 80 + console.log(` → own repo overall rank: #${rank} of all embedded READMEs`); 81 + } 82 + await pool.end(); 83 + } 84 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+30
recommendationold/src/probe_live.ts
··· 1 + import { pool, withClient } from "./db.js"; 2 + 3 + async function main() { 4 + const rows = await withClient((c) => c.query(` 5 + select knot_hostname, record_raw->>'repoDid' as repodid, record_raw->>'name' as name 6 + from tangled_repos 7 + where knot_hostname='knot1.tangled.sh' and coalesce(record_raw->>'repoDid','')<>'' 8 + limit 6`).then(r => r.rows)); 9 + for (const r of rows) { 10 + const url = `https://${r.knot_hostname}/xrpc/sh.tangled.git.temp.getTree?repo=${encodeURIComponent(r.repodid)}&ref=HEAD&path=`; 11 + try { 12 + const ctrl = new AbortController(); 13 + const t = setTimeout(() => ctrl.abort(), 12000); 14 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 15 + clearTimeout(t); 16 + const txt = await resp.text(); 17 + let j: any; try { j = JSON.parse(txt); } catch { j = null; } 18 + const fileNames = (j?.files || []).map((f: any) => f.name); 19 + const readmeInTree = fileNames.some((n: string) => /^readme/i.test(n)); 20 + console.log(`\n[${resp.status}] ${r.name ?? "(no name)"} ${r.repodid}`); 21 + console.log(` readme field: ${j?.readme ? JSON.stringify(Object.keys(j.readme)) : "none"} | readmeInTree=${readmeInTree}`); 22 + console.log(` files: ${JSON.stringify(fileNames).slice(0, 200)}`); 23 + if (!j) console.log(` raw: ${txt.slice(0, 200)}`); 24 + } catch (e: any) { 25 + console.log(`\n[ERR] ${r.repodid}: ${e.message}`); 26 + } 27 + } 28 + await pool.end(); 29 + } 30 + main().catch((e) => { console.error(e); process.exit(1); });
+104
recommendationold/src/readme_coverage.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + 4 + // Read DB_CONNECTION_STRING from repo-root .env (ignore the gcloud helper line). 5 + function loadConn() { 6 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 7 + for (const p of ["../.env", ".env", "../../.env"]) { 8 + try { 9 + const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)\s*$/m); 10 + if (m) return m[1].trim(); 11 + } catch {} 12 + } 13 + throw new Error("DB_CONNECTION_STRING not found"); 14 + } 15 + 16 + const SAMPLE = process.env.SAMPLE ? parseInt(process.env.SAMPLE, 10) : 0; // 0 = all 17 + const CONCURRENCY = parseInt(process.env.CONCURRENCY ?? "30", 10); 18 + const TIMEOUT_MS = parseInt(process.env.TIMEOUT_MS ?? "9000", 10); 19 + 20 + const pool = new pg.Pool({ 21 + connectionString: loadConn(), 22 + ssl: { rejectUnauthorized: false }, 23 + connectionTimeoutMillis: 10_000, 24 + max: 4, 25 + }); 26 + 27 + const sql = ` 28 + select knot_hostname, 29 + coalesce(record_raw->>'repoDid', repo_did) as repodid, 30 + record_raw->>'name' as name 31 + from tangled_repos 32 + where knot_hostname is not null 33 + and coalesce(record_raw->>'repoDid', repo_did) is not null 34 + ${SAMPLE ? "order by random() limit " + SAMPLE : ""}`; 35 + 36 + const { rows } = await pool.query(sql); 37 + await pool.end(); 38 + 39 + const totalRepos = rows.length; 40 + console.log(`Checking README presence for ${totalRepos} repos (repoDid-addressable) ...`); 41 + console.log(`concurrency=${CONCURRENCY} timeout=${TIMEOUT_MS}ms sample=${SAMPLE || "ALL"}\n`); 42 + 43 + async function checkRepo(r) { 44 + // sh.tangled.repo.tree defaults to the repo's default branch when ref is omitted, 45 + // and returns a top-level `readme` (with `contents`) when the knot finds a README 46 + // under any extension (.md/.org/.rst/...). One request per repo. 47 + const url = `https://${r.knot_hostname}/xrpc/sh.tangled.repo.tree?repo=${encodeURIComponent(r.repodid)}&path=`; 48 + const ctrl = new AbortController(); 49 + const t = setTimeout(() => ctrl.abort(), TIMEOUT_MS); 50 + try { 51 + const resp = await fetch(url, { signal: ctrl.signal, headers: { accept: "application/json" } }); 52 + const txt = await resp.text(); 53 + if (!resp.ok) return { status: "http_" + resp.status }; 54 + let j; try { j = JSON.parse(txt); } catch { return { status: "bad_json" }; } 55 + const files = Array.isArray(j?.files) ? j.files : []; 56 + const readmeObj = !!(j?.readme && typeof j.readme === "object" && 57 + typeof j.readme.contents === "string" && j.readme.contents.trim().length > 0); 58 + const readmeFile = files.some((f) => /^readme(\.|$)/i.test(f?.name ?? "")); 59 + const empty = files.length === 0 && !readmeObj; 60 + return { status: "ok", reachable: true, hasReadme: readmeObj || readmeFile, empty }; 61 + } catch (e) { 62 + return { status: e.name === "AbortError" ? "timeout" : "neterr" }; 63 + } finally { 64 + clearTimeout(t); 65 + } 66 + } 67 + 68 + let done = 0; 69 + const stats = { reachable: 0, hasReadme: 0, empty: 0 }; 70 + const statusCounts = {}; 71 + const byKnot = {}; // knot -> {reachable, hasReadme} 72 + 73 + async function worker(queue) { 74 + while (queue.length) { 75 + const r = queue.pop(); 76 + const res = await checkRepo(r); 77 + statusCounts[res.status] = (statusCounts[res.status] ?? 0) + 1; 78 + const k = (byKnot[r.knot_hostname] ??= { total: 0, reachable: 0, hasReadme: 0 }); 79 + k.total++; 80 + if (res.status === "ok") { 81 + stats.reachable++; k.reachable++; 82 + if (res.hasReadme) { stats.hasReadme++; k.hasReadme++; } 83 + if (res.empty) stats.empty++; 84 + } 85 + if (++done % 100 === 0) process.stderr.write(` ...${done}/${totalRepos}\n`); 86 + } 87 + } 88 + 89 + const queue = rows.slice(); 90 + await Promise.all(Array.from({ length: CONCURRENCY }, () => worker(queue))); 91 + 92 + const pct = (n, d) => (d === 0 ? "n/a" : ((100 * n) / d).toFixed(1) + "%"); 93 + 94 + console.log("\n================ README COVERAGE ================"); 95 + console.log(`repoDid-addressable repos checked : ${totalRepos}`); 96 + console.log(`reachable (knot responded w/ tree): ${stats.reachable} (${pct(stats.reachable, totalRepos)} of checked)`); 97 + console.log(` ├─ have a README : ${stats.hasReadme} (${pct(stats.hasReadme, stats.reachable)} of reachable)`); 98 + console.log(` └─ empty repo (no files) : ${stats.empty}`); 99 + console.log(`README % of ALL checked repos : ${pct(stats.hasReadme, totalRepos)}`); 100 + console.log("\nstatus breakdown:", JSON.stringify(statusCounts)); 101 + console.log("\nper-knot (knots with >=10 repos):"); 102 + for (const [knot, k] of Object.entries(byKnot).sort((a, b) => b[1].total - a[1].total)) { 103 + if (k.total >= 10) console.log(` ${knot.padEnd(26)} total=${String(k.total).padStart(4)} reachable=${String(k.reachable).padStart(4)} readme=${String(k.hasReadme).padStart(4)} (${pct(k.hasReadme, k.reachable)} of reachable)`); 104 + }
+47
recommendationold/src/search_issues_by_readme.mjs
··· 1 + // Search ISSUES from a repo's README embedding (README -> issue cosine, in-DB pgvector). 2 + // This is the "recommend issues to work on" path: given repos a user knows, surface relevant issues. 3 + import pg from "pg"; 4 + import { readFileSync } from "node:fs"; 5 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 6 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 7 + const K = parseInt(process.env.K ?? "8", 10); 8 + // which issue table to search 9 + const ISSUE_TBL = process.env.ISSUE_TBL || "tangled_issues"; 10 + 11 + async function coverage() { 12 + for (const t of ["tangled_issues", "tangled_open_issues"]) { 13 + try { 14 + const r = (await pool.query(`select count(*)::int total, count(*) filter (where embedding is not null)::int emb, count(distinct embedding_model) models, max(embedding_model) model, max(vector_dims(embedding)) dims from ${t}`)).rows[0]; 15 + console.log(`${t}: total=${r.total} embedded=${r.emb} model=${r.model} dims=${r.dims}`); 16 + } catch (e) { console.log(`${t}: ${e.message}`); } 17 + } 18 + } 19 + 20 + async function main() { 21 + await coverage(); 22 + const seeds = process.env.SEED ? [process.env.SEED] : ["tangled-cli", "atproto-oauth", "nixpkgs", "knot-docker"]; 23 + for (const s of seeds) { 24 + const seed = (await pool.query( 25 + `select repo_name, repo_did, embedding::text et from tangled_readmes 26 + where embedding is not null and repo_name ilike $1 order by length(content) desc limit 1`, [s])).rows[0]; 27 + console.log("\n" + "=".repeat(74)); 28 + if (!seed) { console.log(`SEED "${s}" not found`); continue; } 29 + console.log(`SEED REPO README: ${seed.repo_name}`); 30 + const hits = (await pool.query(` 31 + select i.title, i.repo_did, left(regexp_replace(coalesce(i.body,''), '\\s+', ' ', 'g'), 120) as body, 32 + rd.repo_name as issue_repo, 33 + round((i.embedding <=> $1::vector)::numeric, 4) as dist 34 + from ${ISSUE_TBL} i 35 + left join tangled_readmes rd on rd.repo_did = i.repo_did 36 + where i.embedding is not null 37 + order by i.embedding <=> $1::vector 38 + limit ${K}`, [seed.et])).rows; 39 + console.log(`top ${hits.length} matching issues:`); 40 + hits.forEach((h, idx) => { 41 + console.log(` ${idx + 1}. [${h.dist}] "${(h.title ?? "(no title)").slice(0, 60)}" (repo: ${h.issue_repo ?? h.repo_did?.slice(0, 16)})`); 42 + if (h.body?.trim()) console.log(` ${h.body}…`); 43 + }); 44 + } 45 + await pool.end(); 46 + } 47 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+57
recommendationold/src/similar_repos.mjs
··· 1 + // README -> README similarity search (pure in-DB pgvector cosine; no embedding API call). 2 + // Given a seed repo (a repo the user contributed to), find the most similar repos by README. 3 + // Dedups exact-duplicate READMEs (forks) and near-identical hits. 4 + import pg from "pg"; 5 + import { readFileSync } from "node:fs"; 6 + import { createHash } from "node:crypto"; 7 + function fromEnv(k){ if(process.env[k])return process.env[k]; for(const p of["../.env",".env"]){try{const m=readFileSync(p,"utf8").match(new RegExp(`^\\s*${k}\\s*=\\s*(.+)$`,"m"));if(m)return m[1].trim();}catch{}} } 8 + const pool = new pg.Pool({ connectionString: fromEnv("DB_CONNECTION_STRING"), ssl: { rejectUnauthorized: false }, max: 3 }); 9 + const K = parseInt(process.env.K ?? "8", 10); 10 + const hash = (s) => createHash("md5").update((s ?? "").slice(0, 500)).digest("hex"); 11 + 12 + // Seeds: env SEED (repo_name ilike or repo_did), else a diverse default set. 13 + const seeds = process.env.SEED ? [process.env.SEED] : ["tangled-cli", "atproto-oauth", "nixpkgs", "holbert-ng"]; 14 + 15 + async function findSeed(s) { 16 + const byDid = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_did=$1 and embedding is not null limit 1`, [s]); 17 + if (byDid.rows[0]) return byDid.rows[0]; 18 + const byName = await pool.query(`select repo_did, repo_name, owner_handle, content, embedding from tangled_readmes where repo_name ilike $1 and embedding is not null order by length(content) desc limit 1`, [s]); 19 + return byName.rows[0] ?? null; 20 + } 21 + 22 + async function main() { 23 + for (const s of seeds) { 24 + const seed = await findSeed(s); 25 + console.log("\n" + "=".repeat(74)); 26 + if (!seed) { console.log(`SEED "${s}" — no embedded README found`); continue; } 27 + console.log(`SEED REPO: ${seed.repo_name} (owner @${seed.owner_handle ?? "?"})`); 28 + console.log(` readme: ${(seed.content ?? "").replace(/\s+/g, " ").slice(0, 160)}…`); 29 + 30 + // Pull a wide candidate set, then dedup in JS. 31 + const cand = (await pool.query(` 32 + select repo_name, owner_handle, repo_did, content, 33 + round((embedding <=> $1::vector)::numeric, 4) as dist 34 + from tangled_readmes 35 + where embedding is not null and repo_did <> $2 36 + order by embedding <=> $1::vector 37 + limit 60`, [seed.embedding, seed.repo_did])).rows; 38 + 39 + const seenContent = new Set([hash(seed.content)]); // also drop forks identical to the seed 40 + const out = []; 41 + let dupSkipped = 0; 42 + for (const c of cand) { 43 + const h = hash(c.content); 44 + if (seenContent.has(h)) { dupSkipped++; continue; } 45 + seenContent.add(h); 46 + out.push(c); 47 + if (out.length >= K) break; 48 + } 49 + console.log(`top ${out.length} similar repos (deduped, ${dupSkipped} fork/dup hits collapsed):`); 50 + out.forEach((h, i) => { 51 + console.log(` ${String(i + 1).padStart(2)}. ${(h.repo_name ?? "(no name)").padEnd(30)} @${(h.owner_handle ?? "?").padEnd(20)} cos_dist=${h.dist}`); 52 + console.log(` ${(h.content ?? "").replace(/\s+/g, " ").slice(0, 110)}…`); 53 + }); 54 + } 55 + await pool.end(); 56 + } 57 + main().catch((e) => { console.error("FATAL:", e); process.exit(1); });
+29
recommendationold/src/verify_embeddings.mjs
··· 1 + import pg from "pg"; 2 + import { readFileSync } from "node:fs"; 3 + function conn() { 4 + if (process.env.DB_CONNECTION_STRING) return process.env.DB_CONNECTION_STRING; 5 + for (const p of ["../.env", ".env"]) { try { const m = readFileSync(p, "utf8").match(/^\s*DB_CONNECTION_STRING\s*=\s*(.+)$/m); if (m) return m[1].trim(); } catch {} } 6 + } 7 + const pool = new pg.Pool({ connectionString: conn(), ssl: { rejectUnauthorized: false }, max: 3 }); 8 + 9 + console.log("=== embedded rows: dims + L2 norm ==="); 10 + console.table((await pool.query(` 11 + select repo_name, embedding_model, 12 + vector_dims(embedding) as dims, 13 + round(sqrt((select sum(x*x) from unnest(embedding::real[]) x))::numeric, 5) as l2_norm 14 + from tangled_readmes where embedding is not null 15 + order by embedded_at desc limit 5`)).rows); 16 + 17 + console.log("\n=== nearest-neighbor sanity (cosine) for one embedded repo ==="); 18 + const seed = (await pool.query(`select repo_did, repo_name from tangled_readmes where embedding is not null limit 1`)).rows[0]; 19 + if (seed) { 20 + console.log(`seed: ${seed.repo_name} (${seed.repo_did})`); 21 + const nn = await pool.query(` 22 + select repo_name, round((embedding <=> (select embedding from tangled_readmes where repo_did=$1))::numeric, 4) as cosine_dist 23 + from tangled_readmes 24 + where embedding is not null and repo_did <> $1 25 + order by embedding <=> (select embedding from tangled_readmes where repo_did=$1) 26 + limit 5`, [seed.repo_did]); 27 + console.table(nn.rows); 28 + } 29 + await pool.end();
+120
scraper/README.md
··· 1 + # Tangled scraper (stages 0–1) 2 + 3 + Loads Tangled **lexicons** (schemas) and probes **knot servers** (git infrastructure) into Postgres. 4 + 5 + ## What this does / does NOT do 6 + 7 + | Stage | Gets | Does NOT get | 8 + |-------|------|--------------| 9 + | **0** | All `sh.tangled.*` lexicon JSON from tangled.org/core | Live API data | 10 + | **1** | Knot hostname, version, owner DID, capabilities | Git repos, commits, or source code | 11 + 12 + **Actual code** (files, commits, branches) is **Stage 6** — git XRPC on each knot (`sh.tangled.repo.log`, `.tree`, `.blob`, etc.). 13 + 14 + ## Setup 15 + 16 + From the **repo root**: 17 + 18 + ```bash 19 + # 1. DB connection (you already have this in .env) 20 + # DB_CONNECTION_STRING=postgresql://... 21 + 22 + # 2. Python venv + deps 23 + python3 -m venv scraper/.venv 24 + source scraper/.venv/bin/activate 25 + pip install -r scraper/requirements.txt 26 + 27 + # 3. git is required on first run (stage 0 clones tangled.org/core lexicons) 28 + git --version 29 + ``` 30 + 31 + ## Run 32 + 33 + ```bash 34 + source scraper/.venv/bin/activate 35 + 36 + # Create tables 37 + python scraper/scrape.py init 38 + 39 + # Stage 0 — lexicons (~89 JSON files, prints each NSID) 40 + python scraper/scrape.py stage0 41 + 42 + # Stage 1 — probe knots 43 + python scraper/scrape.py stage1 44 + 45 + # Or both in one go 46 + python scraper/scrape.py stage0-1 47 + 48 + # Check counts 49 + python scraper/scrape.py status 50 + ``` 51 + 52 + Progress is printed as timestamped lines, e.g.: 53 + 54 + ``` 55 + [12:34:56] [stage 0] (12/89) sh.tangled.repo (record) 56 + [12:35:01] [stage 1] OK knot1.tangled.sh version=1.14.0-alpha owner=did:plc:... 57 + ``` 58 + 59 + ## Knot configuration (optional) 60 + 61 + ```bash 62 + # Explicit seed list (comma-separated hostnames) 63 + export TANGLED_KNOT_SEEDS=knot1.tangled.sh,my.knot.example 64 + 65 + # Auto-probe knot2..knot5 in addition to defaults 66 + export TANGLED_KNOT_PROBE_MAX=5 67 + 68 + # Extra hostnames 69 + export TANGLED_KNOT_EXTRA=custom.knot.example 70 + ``` 71 + 72 + ## Stage 2 — Discover repos via Tangled PDS 73 + 74 + `sh.tangled.sync.listRepos` on knots returns **404** (not deployed yet). 75 + Stage 2 uses **`https://tngl.sh`** instead: 76 + 77 + | Phase | What | API | 78 + |-------|------|-----| 79 + | 1 | List all accounts | `com.atproto.sync.listRepos` | 80 + | 2 | Repo records per account | `com.atproto.repo.listRecords` (`sh.tangled.repo`) | 81 + | 3 | Enrich from knot | `sh.tangled.repo.describeRepo` | 82 + 83 + **~7,928 accounts** on tngl.sh (as of testing). Full repo scan takes a while. 84 + 85 + ```bash 86 + # Step 1 only — count/list accounts (fast, ~10s) 87 + python scraper/scrape.py stage2-accounts 88 + 89 + # Step 2 only — scan repo records (requires accounts in DB) 90 + python scraper/scrape.py stage2-repos 91 + 92 + # All phases in one run 93 + python scraper/scrape.py stage2 94 + 95 + python scraper/scrape.py status 96 + ``` 97 + 98 + ### Optional env vars 99 + 100 + ```bash 101 + # Test with first N accounts only 102 + export TANGLED_STAGE2_ACCOUNT_LIMIT=50 103 + 104 + # Resolve handles via plc.directory (slower) 105 + export TANGLED_RESOLVE_HANDLES=1 106 + 107 + # Skip knot describeRepo enrichment 108 + export TANGLED_STAGE2_ENRICH_KNOTS=0 109 + 110 + # Override PDS (default https://tngl.sh) 111 + export TANGLED_PDS_URL=https://tngl.sh 112 + ``` 113 + 114 + ## SQL tables created 115 + 116 + - `tangled_lexicons` — NSID → full lexicon JSON 117 + - `tangled_knots` — probed knot servers 118 + - `tangled_pds_accounts` — every account on tngl.sh PDS 119 + - `tangled_repos` — `sh.tangled.repo` records + optional knot metadata 120 + - `tangled_crawl_state` — run metadata per stage
+66
scraper/appview_client.py
··· 1 + from __future__ import annotations 2 + 3 + import re 4 + from typing import Any 5 + from urllib.parse import urlencode 6 + 7 + import httpx 8 + 9 + APPVIEW_BASE = "https://tangled.org" 10 + SEARCH_PATH = "/search" 11 + 12 + # href="/owner/repo" — exclude site chrome and static assets 13 + REPO_HREF = re.compile(r'href="/([a-zA-Z0-9._-]+)/([a-zA-Z0-9._-]+)"') 14 + TOTAL_RE = re.compile(r"Returned\s+(\d+)\s+of\s+(\d+)", re.I) 15 + 16 + SKIP_OWNERS = frozenset( 17 + { 18 + "static", 19 + "search", 20 + "login", 21 + "signup", 22 + "explore", 23 + "settings", 24 + "blog", 25 + "docs", 26 + "brand", 27 + "chat", 28 + "pwa-manifest.json", 29 + } 30 + ) 31 + 32 + 33 + def parse_search_total(html: str) -> int | None: 34 + match = TOTAL_RE.search(html) 35 + if not match: 36 + return None 37 + return int(match.group(2)) 38 + 39 + 40 + def parse_repo_links(html: str) -> list[tuple[str, str]]: 41 + seen: set[tuple[str, str]] = set() 42 + out: list[tuple[str, str]] = [] 43 + for owner, repo in REPO_HREF.findall(html): 44 + if owner in SKIP_OWNERS or owner.endswith(".json"): 45 + continue 46 + key = (owner, repo) 47 + if key not in seen: 48 + seen.add(key) 49 + out.append(key) 50 + return out 51 + 52 + 53 + def fetch_search_page( 54 + client: httpx.Client, 55 + *, 56 + offset: int = 0, 57 + limit: int = 100, 58 + sort: str = "newest", 59 + query: str = "", 60 + ) -> tuple[str, list[tuple[str, str]], int | None]: 61 + params = {"q": query, "sort": sort, "offset": offset, "limit": limit} 62 + url = f"{APPVIEW_BASE}{SEARCH_PATH}?{urlencode(params)}" 63 + resp = client.get(url) 64 + resp.raise_for_status() 65 + html = resp.text 66 + return html, parse_repo_links(html), parse_search_total(html)
+427
scraper/backfill_repos_from_issues.py
··· 1 + #!/usr/bin/env python3 2 + """Backfill tangled_repos for issues that reference repos not yet ingested. 3 + 4 + Issues are scraped from issue authors' PDSes; repos come from separate crawls 5 + (stage2-network, stage2 PDS, manual seed). This script closes the gap by 6 + fetching sh.tangled.repo from each missing repo owner's PDS using repo_uri on 7 + the issue record. 8 + 9 + Usage: 10 + python scraper/scrape.py backfill-repos-from-issues 11 + TANGLED_BACKFILL_REPO_LIMIT=50 python scraper/scrape.py backfill-repos-from-issues 12 + 13 + After a successful run, fetch READMEs and embeddings for the new repos: 14 + python scraper/scrape.py check-readmes 15 + python scraper/scrape.py embed-readmes 16 + """ 17 + 18 + from __future__ import annotations 19 + 20 + import json 21 + import os 22 + import threading 23 + from concurrent.futures import ThreadPoolExecutor, as_completed 24 + from dataclasses import dataclass 25 + from typing import Any 26 + 27 + import httpx 28 + 29 + from db import connect, set_crawl_state, upsert_atproto_record 30 + from parallel import concurrency_env 31 + from pds_client import DEFAULT_PDS, handle_from_plc, pds_host_for_did 32 + from progress import banner, log, phase, step, summary_block 33 + from stage2_network import COLLECTION, fetch_repo_record, upsert_identity 34 + 35 + CRAWL_KEY = "repos:issue_backfill" 36 + DISCOVERED_VIA = "issue_backfill" 37 + 38 + 39 + def _repo_limit() -> int | None: 40 + raw = os.getenv("TANGLED_BACKFILL_REPO_LIMIT", "").strip() 41 + if not raw: 42 + return None 43 + return max(1, int(raw)) 44 + 45 + 46 + def _missing_repos_sql(*, limit: int | None) -> str: 47 + query = """ 48 + with missing as ( 49 + select i.repo_did 50 + from tangled_issues i 51 + left join tangled_repos r on r.repo_did = i.repo_did 52 + where i.repo_did is not null 53 + and r.repo_did is null 54 + group by i.repo_did 55 + ), 56 + best_uri as ( 57 + select distinct on (i.repo_did) 58 + i.repo_did, 59 + i.repo_uri, 60 + count(*) over (partition by i.repo_did) as issue_count 61 + from tangled_issues i 62 + inner join missing m on m.repo_did = i.repo_did 63 + where i.repo_uri is not null 64 + and i.repo_uri like 'at://did:%/sh.tangled.repo/%' 65 + order by i.repo_did, i.fetched_at desc nulls last 66 + ) 67 + select 68 + b.repo_did, 69 + b.repo_uri, 70 + b.issue_count, 71 + split_part(replace(b.repo_uri, 'at://', ''), '/', 1) as owner_did, 72 + split_part(b.repo_uri, '/', 5) as repo_rkey, 73 + ti.handle as owner_handle, 74 + ti.pds_host 75 + from best_uri b 76 + left join tangled_identities ti 77 + on ti.did = split_part(replace(b.repo_uri, 'at://', ''), '/', 1) 78 + order by b.issue_count desc, b.repo_did 79 + """ 80 + if limit: 81 + query += f" limit {limit}" 82 + return query 83 + 84 + 85 + def _count_missing_sql() -> str: 86 + return """ 87 + select 88 + count(distinct i.repo_did) filter ( 89 + where i.repo_uri is not null 90 + and i.repo_uri like 'at://did:%/sh.tangled.repo/%' 91 + ) as backfillable, 92 + count(distinct i.repo_did) filter ( 93 + where i.repo_uri is null 94 + or i.repo_uri not like 'at://did:%/sh.tangled.repo/%' 95 + ) as not_backfillable, 96 + count(distinct i.repo_did) as total_missing 97 + from tangled_issues i 98 + left join tangled_repos r on r.repo_did = i.repo_did 99 + where i.repo_did is not null 100 + and r.repo_did is null 101 + """ 102 + 103 + 104 + @dataclass 105 + class MissingRepo: 106 + repo_did: str 107 + repo_uri: str 108 + issue_count: int 109 + owner_did: str 110 + repo_rkey: str 111 + owner_handle: str | None 112 + pds_host: str | None 113 + 114 + 115 + @dataclass 116 + class BackfillResult: 117 + row: MissingRepo 118 + status: str # ok | pds_failed | record_failed | error 119 + owner_handle: str | None = None 120 + pds_host: str | None = None 121 + record: dict[str, Any] | None = None 122 + error: str | None = None 123 + 124 + 125 + class _PdsCache: 126 + def __init__(self) -> None: 127 + self._hosts: dict[str, str | None] = {} 128 + self._handles: dict[str, str | None] = {} 129 + self._lock = threading.Lock() 130 + 131 + def resolve_pds( 132 + self, client: httpx.Client, owner_did: str, hint: str | None 133 + ) -> str | None: 134 + if hint: 135 + return hint.rstrip("/") 136 + with self._lock: 137 + if owner_did in self._hosts: 138 + return self._hosts[owner_did] 139 + try: 140 + pds = pds_host_for_did(client, owner_did) 141 + except httpx.HTTPError: 142 + pds = None 143 + host = pds.rstrip("/") if pds else None 144 + with self._lock: 145 + self._hosts[owner_did] = host 146 + return host 147 + 148 + def resolve_handle( 149 + self, client: httpx.Client, owner_did: str, hint: str | None 150 + ) -> str | None: 151 + if hint: 152 + return hint 153 + with self._lock: 154 + if owner_did in self._handles: 155 + return self._handles[owner_did] 156 + try: 157 + handle = handle_from_plc(client, owner_did) 158 + except httpx.HTTPError: 159 + handle = None 160 + with self._lock: 161 + self._handles[owner_did] = handle 162 + return handle 163 + 164 + 165 + def upsert_issue_backfill_repo( 166 + conn, 167 + *, 168 + owner_did: str, 169 + owner_handle: str | None, 170 + repo_rkey: str, 171 + pds_host: str, 172 + record: dict[str, Any], 173 + ) -> None: 174 + uri = record["uri"] 175 + value = record["value"] 176 + rkey = uri.rsplit("/", 1)[-1] 177 + repo_did = value.get("repoDid") if isinstance(value.get("repoDid"), str) else None 178 + knot = value.get("knot") if isinstance(value.get("knot"), str) else None 179 + name = value.get("name") if isinstance(value.get("name"), str) else None 180 + if not name and not repo_rkey.startswith("3l"): 181 + name = repo_rkey 182 + 183 + conn.execute( 184 + """ 185 + insert into tangled_repos ( 186 + uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname, 187 + cid, record_raw, discovered_via, last_synced_at 188 + ) 189 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, %s, now()) 190 + on conflict (uri) do update set 191 + owner_did = excluded.owner_did, 192 + owner_handle = coalesce(excluded.owner_handle, tangled_repos.owner_handle), 193 + repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did), 194 + name = coalesce(excluded.name, tangled_repos.name), 195 + knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname), 196 + cid = excluded.cid, 197 + record_raw = excluded.record_raw, 198 + discovered_via = coalesce(tangled_repos.discovered_via, excluded.discovered_via), 199 + last_synced_at = now() 200 + """, 201 + ( 202 + uri, 203 + owner_did, 204 + owner_handle, 205 + rkey, 206 + repo_did, 207 + name, 208 + knot, 209 + record.get("cid") if isinstance(record.get("cid"), str) else None, 210 + json.dumps(value), 211 + DISCOVERED_VIA, 212 + ), 213 + ) 214 + 215 + upsert_atproto_record( 216 + conn, 217 + uri=uri, 218 + author_did=owner_did, 219 + collection=COLLECTION, 220 + rkey=rkey, 221 + payload=value, 222 + cid=record.get("cid") if isinstance(record.get("cid"), str) else None, 223 + repo_did=repo_did, 224 + ) 225 + 226 + 227 + def _fetch_one(row: MissingRepo, cache: _PdsCache) -> BackfillResult: 228 + result = BackfillResult(row=row, status="error") 229 + try: 230 + with httpx.Client(timeout=60.0, follow_redirects=True) as client: 231 + pds = cache.resolve_pds(client, row.owner_did, row.pds_host) 232 + if not pds: 233 + result.status = "pds_failed" 234 + return result 235 + 236 + owner_handle = cache.resolve_handle(client, row.owner_did, row.owner_handle) 237 + result.owner_handle = owner_handle 238 + result.pds_host = pds 239 + 240 + record = fetch_repo_record( 241 + client, 242 + pds_host=pds, 243 + owner_did=row.owner_did, 244 + rkey=row.repo_rkey, 245 + repo_slug=row.repo_rkey, 246 + ) 247 + if not record: 248 + result.status = "record_failed" 249 + return result 250 + 251 + result.record = record 252 + result.status = "ok" 253 + return result 254 + except httpx.HTTPError as exc: 255 + result.status = "error" 256 + result.error = str(exc)[:200] 257 + return result 258 + except Exception as exc: 259 + result.status = "error" 260 + result.error = str(exc)[:200] 261 + return result 262 + 263 + 264 + def run_backfill_repos_from_issues(dsn: str) -> dict[str, Any]: 265 + workers = concurrency_env("TANGLED_BACKFILL_REPO_CONCURRENCY", default=20) 266 + repo_limit = _repo_limit() 267 + 268 + banner("BACKFILL — Repos referenced by issues but missing from tangled_repos") 269 + log("backfill", f"Concurrency: {workers}") 270 + if repo_limit: 271 + log("backfill", f"Repo limit: {repo_limit}") 272 + 273 + stats: dict[str, Any] = { 274 + "backfillable": 0, 275 + "not_backfillable": 0, 276 + "total_missing": 0, 277 + "queued": 0, 278 + "repos_stored": 0, 279 + "pds_failed": 0, 280 + "record_failed": 0, 281 + "errors": 0, 282 + } 283 + 284 + with connect(dsn) as conn: 285 + counts = conn.execute(_count_missing_sql()).fetchone() 286 + if counts: 287 + stats["backfillable"] = int(counts.get("backfillable") or 0) 288 + stats["not_backfillable"] = int(counts.get("not_backfillable") or 0) 289 + stats["total_missing"] = int(counts.get("total_missing") or 0) 290 + 291 + log( 292 + "backfill", 293 + f"Missing repos: {stats['total_missing']} " 294 + f"({stats['backfillable']} with parseable repo_uri, " 295 + f"{stats['not_backfillable']} without)", 296 + ) 297 + 298 + rows = conn.execute(_missing_repos_sql(limit=repo_limit)).fetchall() 299 + pending = [ 300 + MissingRepo( 301 + repo_did=row["repo_did"], 302 + repo_uri=row["repo_uri"], 303 + issue_count=int(row["issue_count"] or 0), 304 + owner_did=row["owner_did"], 305 + repo_rkey=row["repo_rkey"], 306 + owner_handle=row.get("owner_handle"), 307 + pds_host=row.get("pds_host"), 308 + ) 309 + for row in rows 310 + if row.get("owner_did") and row.get("repo_rkey") 311 + ] 312 + stats["queued"] = len(pending) 313 + 314 + if not pending: 315 + log("backfill", "Nothing to backfill.") 316 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 317 + conn.commit() 318 + return stats 319 + 320 + phase(1, f"Fetch sh.tangled.repo for {len(pending)} missing repos") 321 + set_crawl_state( 322 + conn, 323 + key=CRAWL_KEY, 324 + status="running", 325 + meta={**stats, "workers": workers}, 326 + ) 327 + conn.commit() 328 + 329 + cache = _PdsCache() 330 + done = 0 331 + done_lock = threading.Lock() 332 + 333 + with ThreadPoolExecutor(max_workers=workers) as pool: 334 + futures = { 335 + pool.submit(_fetch_one, row, cache): row for row in pending 336 + } 337 + 338 + for future in as_completed(futures): 339 + row = futures[future] 340 + label = f"{row.owner_did[:20]}…/{row.repo_rkey}" 341 + 342 + try: 343 + result = future.result() 344 + except Exception as exc: 345 + result = BackfillResult( 346 + row=row, 347 + status="error", 348 + error=str(exc)[:200], 349 + ) 350 + 351 + with done_lock: 352 + done += 1 353 + n = done 354 + 355 + if result.status == "ok" and result.record: 356 + upsert_identity( 357 + conn, 358 + did=row.owner_did, 359 + handle=result.owner_handle, 360 + pds_host=result.pds_host, 361 + ) 362 + upsert_issue_backfill_repo( 363 + conn, 364 + owner_did=row.owner_did, 365 + owner_handle=result.owner_handle, 366 + repo_rkey=row.repo_rkey, 367 + pds_host=result.pds_host or DEFAULT_PDS, 368 + record=result.record, 369 + ) 370 + stats["repos_stored"] += 1 371 + if n <= 10 or n % 25 == 0: 372 + step( 373 + "backfill", 374 + n, 375 + len(pending), 376 + f"OK {label} issues={row.issue_count}", 377 + ) 378 + elif result.status == "pds_failed": 379 + stats["pds_failed"] += 1 380 + if n <= 10 or n % 50 == 0: 381 + step( 382 + "backfill", 383 + n, 384 + len(pending), 385 + f"SKIP {label} — could not resolve PDS", 386 + ) 387 + elif result.status == "record_failed": 388 + stats["record_failed"] += 1 389 + if n <= 10 or n % 50 == 0: 390 + step( 391 + "backfill", 392 + n, 393 + len(pending), 394 + f"FAIL {label} — no sh.tangled.repo on PDS", 395 + ) 396 + else: 397 + stats["errors"] += 1 398 + if n <= 10 or n % 50 == 0: 399 + step( 400 + "backfill", 401 + n, 402 + len(pending), 403 + f"ERROR {label}: {result.error or 'unknown'}", 404 + ) 405 + 406 + if n % 25 == 0: 407 + conn.commit() 408 + 409 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 410 + conn.commit() 411 + 412 + summary_block( 413 + "Issue repo backfill complete", 414 + [ 415 + f"Missing repos (total): {stats['total_missing']}", 416 + f"Backfillable (repo_uri): {stats['backfillable']}", 417 + f"Queued this run: {stats['queued']}", 418 + f"Repos stored/updated: {stats['repos_stored']}", 419 + f"PDS resolve failed: {stats['pds_failed']}", 420 + f"Record fetch failed: {stats['record_failed']}", 421 + f"Errors: {stats['errors']}", 422 + "", 423 + "Next: python scraper/scrape.py check-readmes", 424 + " python scraper/scrape.py embed-readmes", 425 + ], 426 + ) 427 + return stats
+387
scraper/check_readmes.py
··· 1 + #!/usr/bin/env python3 2 + """Fetch and store README files from knot git for all scraped repos.""" 3 + 4 + from __future__ import annotations 5 + 6 + import os 7 + import sys 8 + import threading 9 + from concurrent.futures import ThreadPoolExecutor, as_completed 10 + from dataclasses import dataclass 11 + from pathlib import Path 12 + from typing import Any 13 + 14 + import httpx 15 + from dotenv import load_dotenv 16 + 17 + from db import connect, init_schema, set_crawl_state 18 + from parallel import concurrency_env 19 + from pds_client import knot_xrpc 20 + from progress import banner, log, metric, phase, step, summary_block 21 + 22 + REPO_ROOT = Path(__file__).resolve().parent.parent 23 + CRAWL_KEY = "readmes:check" 24 + README_NAMES = frozenset( 25 + {"readme.md", "readme", "readme.markdown", "readme.mdown", "readme.mkd"} 26 + ) 27 + 28 + 29 + @dataclass 30 + class ReadmeResult: 31 + repo_did: str 32 + repo_uri: str | None 33 + owner_handle: str | None 34 + repo_name: str | None 35 + knot_hostname: str 36 + status: str 37 + readme_path: str | None = None 38 + content: str | None = None 39 + size_bytes: int | None = None 40 + error_message: str | None = None 41 + 42 + 43 + def _repo_limit() -> int | None: 44 + raw = os.getenv("TANGLED_README_REPO_LIMIT", "").strip() 45 + if not raw: 46 + return None 47 + return max(1, int(raw)) 48 + 49 + 50 + def _skip_existing() -> bool: 51 + return os.getenv("TANGLED_README_REFRESH", "").strip().lower() not in ( 52 + "1", 53 + "true", 54 + "yes", 55 + ) 56 + 57 + 58 + def _repos_query(*, skip_existing: bool, repo_limit: int | None) -> str: 59 + skip_clause = "" 60 + if skip_existing: 61 + skip_clause = """ 62 + and not exists ( 63 + select 1 from tangled_readmes t 64 + where t.repo_did = tangled_repos.repo_did 65 + and t.status in ('found', 'missing') 66 + ) 67 + """ 68 + query = f""" 69 + select repo_did, uri, owner_handle, name, knot_hostname 70 + from tangled_repos 71 + where repo_did is not null 72 + and knot_hostname is not null 73 + {skip_clause} 74 + order by uri 75 + """ 76 + if repo_limit: 77 + query += f" limit {repo_limit}" 78 + return query 79 + 80 + 81 + def _find_readme_in_tree(tree: dict[str, Any]) -> str | None: 82 + for entry in tree.get("files") or []: 83 + if not isinstance(entry, dict): 84 + continue 85 + name = entry.get("name") 86 + if isinstance(name, str) and name.lower() in README_NAMES: 87 + if entry.get("type") == "file" or entry.get("mode") in ( 88 + "100644", 89 + "100755", 90 + "blob", 91 + ): 92 + return name 93 + # tree listing uses name only for files 94 + if entry.get("type") != "dir": 95 + return name 96 + return None 97 + 98 + 99 + def fetch_readme( 100 + client: httpx.Client, 101 + *, 102 + knot_hostname: str, 103 + repo_did: str, 104 + ) -> ReadmeResult: 105 + base = ReadmeResult( 106 + repo_did=repo_did, 107 + repo_uri=None, 108 + owner_handle=None, 109 + repo_name=None, 110 + knot_hostname=knot_hostname, 111 + status="error", 112 + ) 113 + 114 + status, tree = knot_xrpc( 115 + client, 116 + knot_hostname, 117 + "sh.tangled.repo.tree", 118 + {"repo": repo_did, "ref": "HEAD"}, 119 + ) 120 + if status != 200 or not isinstance(tree, dict): 121 + base.status = "error" 122 + base.error_message = f"tree HTTP {status}" 123 + return base 124 + 125 + readme_path = _find_readme_in_tree(tree) 126 + if not readme_path: 127 + base.status = "missing" 128 + return base 129 + 130 + status, blob = knot_xrpc( 131 + client, 132 + knot_hostname, 133 + "sh.tangled.repo.blob", 134 + {"repo": repo_did, "ref": "HEAD", "path": readme_path}, 135 + ) 136 + if status != 200 or not isinstance(blob, dict): 137 + base.status = "error" 138 + base.readme_path = readme_path 139 + base.error_message = f"blob HTTP {status}" 140 + return base 141 + 142 + content = blob.get("content") 143 + if not isinstance(content, str): 144 + base.status = "error" 145 + base.readme_path = readme_path 146 + base.error_message = "blob response missing content" 147 + return base 148 + 149 + base.status = "found" 150 + base.readme_path = readme_path 151 + base.content = content 152 + base.size_bytes = len(content.encode("utf-8")) 153 + return base 154 + 155 + 156 + def upsert_readme(conn, row: ReadmeResult) -> None: 157 + conn.execute( 158 + """ 159 + insert into tangled_readmes ( 160 + repo_did, repo_uri, owner_handle, repo_name, knot_hostname, 161 + readme_path, status, content, size_bytes, error_message, fetched_at 162 + ) 163 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now()) 164 + on conflict (repo_did) do update set 165 + repo_uri = excluded.repo_uri, 166 + owner_handle = excluded.owner_handle, 167 + repo_name = excluded.repo_name, 168 + knot_hostname = excluded.knot_hostname, 169 + readme_path = excluded.readme_path, 170 + status = excluded.status, 171 + content = excluded.content, 172 + size_bytes = excluded.size_bytes, 173 + error_message = excluded.error_message, 174 + fetched_at = now(), 175 + embedding = case 176 + when tangled_readmes.content is distinct from excluded.content then null 177 + else tangled_readmes.embedding 178 + end, 179 + embedding_model = case 180 + when tangled_readmes.content is distinct from excluded.content then null 181 + else tangled_readmes.embedding_model 182 + end, 183 + embedded_at = case 184 + when tangled_readmes.content is distinct from excluded.content then null 185 + else tangled_readmes.embedded_at 186 + end 187 + """, 188 + ( 189 + row.repo_did, 190 + row.repo_uri, 191 + row.owner_handle, 192 + row.repo_name, 193 + row.knot_hostname, 194 + row.readme_path, 195 + row.status, 196 + row.content, 197 + row.size_bytes, 198 + row.error_message, 199 + ), 200 + ) 201 + 202 + 203 + def run_check_readmes(dsn: str) -> dict[str, int]: 204 + workers = concurrency_env("TANGLED_README_CONCURRENCY", default=20) 205 + repo_limit = _repo_limit() 206 + 207 + banner("README CHECK — fetch README from knot git for each repo") 208 + log("readmes", f"Concurrency: {workers}") 209 + if repo_limit: 210 + log("readmes", f"Repo limit: {repo_limit}") 211 + skip_existing = _skip_existing() 212 + if skip_existing: 213 + log( 214 + "readmes", 215 + "Skip existing: on — found/missing rows kept (set TANGLED_README_REFRESH=1 to re-fetch)", 216 + ) 217 + else: 218 + log("readmes", "Skip existing: off — re-fetching all") 219 + 220 + with connect(dsn) as conn: 221 + reachable = { 222 + r["hostname"] 223 + for r in conn.execute( 224 + "select hostname from tangled_knots where reachable = true" 225 + ).fetchall() 226 + } 227 + total_eligible = conn.execute( 228 + """ 229 + select count(*) as n from tangled_repos 230 + where repo_did is not null and knot_hostname is not null 231 + """ 232 + ).fetchone()["n"] 233 + repos = conn.execute( 234 + _repos_query(skip_existing=skip_existing, repo_limit=repo_limit) 235 + ).fetchall() 236 + 237 + if not repos: 238 + if skip_existing: 239 + log("readmes", "Nothing to fetch — all eligible repos already checked.") 240 + return { 241 + "found": 0, 242 + "missing": 0, 243 + "error": 0, 244 + "skipped": 0, 245 + "already_in_db": total_eligible, 246 + } 247 + raise RuntimeError("No repos with repo_did in tangled_repos.") 248 + 249 + already_in_db = total_eligible - len(repos) if skip_existing else 0 250 + if skip_existing: 251 + metric("Eligible repos", total_eligible) 252 + metric("Already in DB (skipped)", already_in_db) 253 + metric("To fetch", len(repos)) 254 + log("readmes", f"Checking READMEs for {len(repos)} repos …") 255 + 256 + stats = { 257 + "found": 0, 258 + "missing": 0, 259 + "error": 0, 260 + "skipped": 0, 261 + "already_in_db": already_in_db, 262 + } 263 + stats_lock = threading.Lock() 264 + done = 0 265 + done_lock = threading.Lock() 266 + 267 + phase(1, "Parallel tree + blob fetch on knots") 268 + 269 + def work(repo: dict[str, Any]) -> ReadmeResult: 270 + knot = repo["knot_hostname"] 271 + repo_did = repo["repo_did"] 272 + if knot not in reachable: 273 + return ReadmeResult( 274 + repo_did=repo_did, 275 + repo_uri=repo.get("uri"), 276 + owner_handle=repo.get("owner_handle"), 277 + repo_name=repo.get("name"), 278 + knot_hostname=knot or "", 279 + status="skipped", 280 + error_message=f"knot not reachable: {knot}", 281 + ) 282 + with httpx.Client(timeout=60.0, follow_redirects=True) as client: 283 + result = fetch_readme(client, knot_hostname=knot, repo_did=repo_did) 284 + result.repo_uri = repo.get("uri") 285 + result.owner_handle = repo.get("owner_handle") 286 + result.repo_name = repo.get("name") 287 + return result 288 + 289 + with connect(dsn) as conn: 290 + set_crawl_state( 291 + conn, 292 + key=CRAWL_KEY, 293 + status="running", 294 + meta={"repo_count": len(repos), "workers": workers}, 295 + ) 296 + conn.commit() 297 + 298 + with ThreadPoolExecutor(max_workers=workers) as pool: 299 + futures = {pool.submit(work, dict(repo)): repo for repo in repos} 300 + 301 + for future in as_completed(futures): 302 + repo = futures[future] 303 + label = f"{repo.get('owner_handle') or '?'}/{repo.get('name') or repo['repo_did'][:16]}" 304 + 305 + try: 306 + result = future.result() 307 + except Exception as exc: 308 + result = ReadmeResult( 309 + repo_did=repo["repo_did"], 310 + repo_uri=repo.get("uri"), 311 + owner_handle=repo.get("owner_handle"), 312 + repo_name=repo.get("name"), 313 + knot_hostname=repo.get("knot_hostname") or "", 314 + status="error", 315 + error_message=str(exc), 316 + ) 317 + 318 + upsert_readme(conn, result) 319 + 320 + with stats_lock: 321 + stats[result.status if result.status in stats else "error"] += 1 322 + 323 + with done_lock: 324 + done += 1 325 + n = done 326 + 327 + if result.status == "found": 328 + if n <= 10 or n % 50 == 0: 329 + step( 330 + "readmes", 331 + n, 332 + len(repos), 333 + f"OK {label} {result.readme_path} ({result.size_bytes} B)", 334 + ) 335 + elif n <= 10 or n % 100 == 0: 336 + step( 337 + "readmes", 338 + n, 339 + len(repos), 340 + f"{result.status.upper()} {label} {result.error_message or ''}", 341 + ) 342 + 343 + if n % 50 == 0: 344 + conn.commit() 345 + 346 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 347 + conn.commit() 348 + 349 + summary_block( 350 + "README check complete", 351 + [ 352 + f"Repos checked: {len(repos)}", 353 + f"Already in DB: {stats['already_in_db']}", 354 + f"Found README: {stats['found']}", 355 + f"Missing README: {stats['missing']}", 356 + f"Errors: {stats['error']}", 357 + f"Skipped knot: {stats['skipped']}", 358 + "", 359 + "Query: select status, count(*) from tangled_readmes group by 1;", 360 + ], 361 + ) 362 + return stats 363 + 364 + 365 + def main() -> None: 366 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 367 + if candidate.exists(): 368 + load_dotenv(candidate) 369 + break 370 + else: 371 + load_dotenv() 372 + 373 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 374 + if not dsn: 375 + print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr) 376 + raise SystemExit(1) 377 + 378 + init_schema(dsn) 379 + run_check_readmes(dsn) 380 + 381 + 382 + if __name__ == "__main__": 383 + try: 384 + main() 385 + except KeyboardInterrupt: 386 + print("\nInterrupted.", file=sys.stderr) 387 + raise SystemExit(130) from None
+251
scraper/db.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + from contextlib import contextmanager 5 + from pathlib import Path 6 + from typing import Any, Iterator 7 + 8 + import psycopg 9 + from psycopg.rows import dict_row 10 + 11 + MIGRATIONS_DIR = ( 12 + Path(__file__).resolve().parent.parent / "supabase" / "migrations" 13 + ) 14 + 15 + 16 + def register_pgvector(conn: psycopg.Connection) -> None: 17 + try: 18 + from pgvector.psycopg import register_vector 19 + 20 + register_vector(conn) 21 + except (ImportError, psycopg.ProgrammingError): 22 + pass 23 + 24 + 25 + @contextmanager 26 + def connect(dsn: str) -> Iterator[psycopg.Connection]: 27 + with psycopg.connect(dsn, row_factory=dict_row) as conn: 28 + register_pgvector(conn) 29 + yield conn 30 + 31 + 32 + def init_schema(dsn: str) -> None: 33 + paths = sorted(MIGRATIONS_DIR.glob("*.sql")) 34 + if not paths: 35 + raise RuntimeError(f"No migrations found in {MIGRATIONS_DIR}") 36 + with connect(dsn) as conn: 37 + for path in paths: 38 + conn.execute(path.read_text()) 39 + conn.commit() 40 + 41 + 42 + def upsert_lexicon( 43 + conn: psycopg.Connection, 44 + *, 45 + nsid: str, 46 + lexicon_type: str, 47 + definition: dict[str, Any], 48 + source_path: str, 49 + ) -> None: 50 + conn.execute( 51 + """ 52 + insert into tangled_lexicons (nsid, lexicon_type, definition, source_path, fetched_at) 53 + values (%s, %s, %s::jsonb, %s, now()) 54 + on conflict (nsid) do update set 55 + lexicon_type = excluded.lexicon_type, 56 + definition = excluded.definition, 57 + source_path = excluded.source_path, 58 + fetched_at = now() 59 + """, 60 + (nsid, lexicon_type, json.dumps(definition), source_path), 61 + ) 62 + 63 + 64 + def upsert_knot( 65 + conn: psycopg.Connection, 66 + *, 67 + hostname: str, 68 + reachable: bool, 69 + owner_did: str | None, 70 + version: str | None, 71 + capabilities: list[str] | None, 72 + version_raw: dict[str, Any] | None, 73 + owner_raw: dict[str, Any] | None, 74 + probe_error: str | None, 75 + ) -> None: 76 + conn.execute( 77 + """ 78 + insert into tangled_knots ( 79 + hostname, reachable, owner_did, version, capabilities, 80 + version_raw, owner_raw, probe_error, last_probed_at 81 + ) 82 + values (%s, %s, %s, %s, %s::jsonb, %s::jsonb, %s::jsonb, %s, now()) 83 + on conflict (hostname) do update set 84 + reachable = excluded.reachable, 85 + owner_did = excluded.owner_did, 86 + version = excluded.version, 87 + capabilities = excluded.capabilities, 88 + version_raw = excluded.version_raw, 89 + owner_raw = excluded.owner_raw, 90 + probe_error = excluded.probe_error, 91 + last_probed_at = now() 92 + """, 93 + ( 94 + hostname, 95 + reachable, 96 + owner_did, 97 + version, 98 + json.dumps(capabilities) if capabilities is not None else None, 99 + json.dumps(version_raw) if version_raw is not None else None, 100 + json.dumps(owner_raw) if owner_raw is not None else None, 101 + probe_error, 102 + ), 103 + ) 104 + 105 + 106 + def set_crawl_state( 107 + conn: psycopg.Connection, 108 + *, 109 + key: str, 110 + status: str, 111 + meta: dict[str, Any] | None = None, 112 + last_error: str | None = None, 113 + ) -> None: 114 + conn.execute( 115 + """ 116 + insert into tangled_crawl_state (key, status, meta, last_error, updated_at) 117 + values (%s, %s, %s::jsonb, %s, now()) 118 + on conflict (key) do update set 119 + status = excluded.status, 120 + meta = excluded.meta, 121 + last_error = excluded.last_error, 122 + updated_at = now() 123 + """, 124 + (key, status, json.dumps(meta) if meta else None, last_error), 125 + ) 126 + 127 + 128 + def count_lexicons(conn: psycopg.Connection) -> int: 129 + row = conn.execute("select count(*) as n from tangled_lexicons").fetchone() 130 + return int(row["n"]) if row else 0 131 + 132 + 133 + def count_knots(conn: psycopg.Connection) -> int: 134 + row = conn.execute("select count(*) as n from tangled_knots").fetchone() 135 + return int(row["n"]) if row else 0 136 + 137 + 138 + def count_pds_accounts(conn: psycopg.Connection) -> int: 139 + row = conn.execute("select count(*) as n from tangled_pds_accounts").fetchone() 140 + return int(row["n"]) if row else 0 141 + 142 + 143 + def count_repos(conn: psycopg.Connection) -> int: 144 + row = conn.execute("select count(*) as n from tangled_repos").fetchone() 145 + return int(row["n"]) if row else 0 146 + 147 + 148 + def count_accounts_with_repos(conn: psycopg.Connection) -> int: 149 + row = conn.execute( 150 + "select count(*) as n from tangled_pds_accounts where repo_record_count > 0" 151 + ).fetchone() 152 + return int(row["n"]) if row else 0 153 + 154 + 155 + def upsert_xrpc_snapshot( 156 + conn: psycopg.Connection, 157 + *, 158 + method: str, 159 + repo_did: str | None, 160 + params: dict[str, Any], 161 + params_hash: str, 162 + payload: dict[str, Any] | list[Any] | None, 163 + payload_encoding: str = "application/json", 164 + ) -> None: 165 + conn.execute( 166 + """ 167 + insert into tangled_xrpc_snapshots ( 168 + method, repo_did, params, params_hash, payload, payload_encoding, fetched_at 169 + ) 170 + values (%s, %s, %s::jsonb, %s, %s::jsonb, %s, now()) 171 + on conflict (method, repo_did, params_hash) do update set 172 + params = excluded.params, 173 + payload = excluded.payload, 174 + payload_encoding = excluded.payload_encoding, 175 + fetched_at = now() 176 + """, 177 + ( 178 + method, 179 + repo_did, 180 + json.dumps(params), 181 + params_hash, 182 + json.dumps(payload) if payload is not None else None, 183 + payload_encoding, 184 + ), 185 + ) 186 + 187 + 188 + def upsert_atproto_record( 189 + conn: psycopg.Connection, 190 + *, 191 + uri: str, 192 + author_did: str, 193 + collection: str, 194 + rkey: str, 195 + payload: dict[str, Any], 196 + cid: str | None = None, 197 + repo_did: str | None = None, 198 + subject_uri: str | None = None, 199 + ) -> None: 200 + conn.execute( 201 + """ 202 + insert into tangled_atproto_records ( 203 + uri, author_did, collection, rkey, cid, payload, repo_did, subject_uri, fetched_at 204 + ) 205 + values (%s, %s, %s, %s, %s, %s::jsonb, %s, %s, now()) 206 + on conflict (uri) do update set 207 + cid = excluded.cid, 208 + payload = excluded.payload, 209 + repo_did = excluded.repo_did, 210 + subject_uri = excluded.subject_uri, 211 + fetched_at = now() 212 + """, 213 + ( 214 + uri, 215 + author_did, 216 + collection, 217 + rkey, 218 + cid, 219 + json.dumps(payload), 220 + repo_did, 221 + subject_uri, 222 + ), 223 + ) 224 + 225 + 226 + def count_xrpc_snapshots(conn: psycopg.Connection) -> int: 227 + row = conn.execute("select count(*) as n from tangled_xrpc_snapshots").fetchone() 228 + return int(row["n"]) if row else 0 229 + 230 + 231 + def table_counts(conn: psycopg.Connection) -> dict[str, int]: 232 + tables = [ 233 + "tangled_lexicons", 234 + "tangled_knots", 235 + "tangled_pds_accounts", 236 + "tangled_repos", 237 + "tangled_identities", 238 + "tangled_atproto_records", 239 + "tangled_backlinks", 240 + "tangled_xrpc_snapshots", 241 + "tangled_git_archives", 242 + "tangled_git_blobs", 243 + "tangled_readmes", 244 + "tangled_repo_collaborators", 245 + "tangled_issues", 246 + ] 247 + counts: dict[str, int] = {} 248 + for table in tables: 249 + row = conn.execute(f"select count(*) as n from {table}").fetchone() 250 + counts[table] = int(row["n"]) if row else 0 251 + return counts
+163
scraper/embed_issues.py
··· 1 + #!/usr/bin/env python3 2 + """Compute embeddings for tangled_issues (title + body).""" 3 + 4 + from __future__ import annotations 5 + 6 + import os 7 + import sys 8 + from pathlib import Path 9 + 10 + import httpx 11 + from dotenv import load_dotenv 12 + 13 + from db import connect, init_schema, register_pgvector, set_crawl_state 14 + from embeddings import ( 15 + DEFAULT_DIM, 16 + DEFAULT_MODEL, 17 + batch_size, 18 + embed_texts, 19 + embedding_model, 20 + gemini_api_key, 21 + truncate, 22 + ) 23 + from progress import banner, log, phase, step, summary_block 24 + 25 + REPO_ROOT = Path(__file__).resolve().parent.parent 26 + CRAWL_KEY = "issues:embed" 27 + 28 + 29 + def _issue_limit() -> int | None: 30 + raw = os.getenv("TANGLED_ISSUE_EMBED_LIMIT", "").strip() 31 + if not raw: 32 + return None 33 + return max(1, int(raw)) 34 + 35 + 36 + def _force_reembed() -> bool: 37 + return os.getenv("TANGLED_ISSUE_EMBED_FORCE", "").strip().lower() in ("1", "true", "yes") 38 + 39 + 40 + def _issue_text(title: str | None, body: str | None) -> str: 41 + parts = [p for p in (title, body) if p and p.strip()] 42 + return truncate("\n\n".join(parts)) 43 + 44 + 45 + def run_embed_issues(dsn: str) -> dict[str, int]: 46 + api_key = gemini_api_key() 47 + model = embedding_model() 48 + bs = batch_size() 49 + issue_limit = _issue_limit() 50 + force = _force_reembed() 51 + 52 + banner("ISSUE EMBED — Gemini → tangled_issues.embedding") 53 + log("embed-issues", f"Model: {model} dim={DEFAULT_DIM} L2-normalized batch={bs}") 54 + if issue_limit: 55 + log("embed-issues", f"Limit: {issue_limit}") 56 + if force: 57 + log("embed-issues", "Force re-embed enabled") 58 + 59 + where = "1=1" 60 + if not force: 61 + where += " and embedding is null" 62 + query = f""" 63 + select uri, author_handle, title, body 64 + from tangled_issues 65 + where {where} 66 + and coalesce(nullif(trim(title), ''), nullif(trim(body), '')) is not null 67 + order by fetched_at desc 68 + """ 69 + if issue_limit: 70 + query += f" limit {issue_limit}" 71 + 72 + with connect(dsn) as conn: 73 + rows = conn.execute(query).fetchall() 74 + 75 + if not rows: 76 + log("embed-issues", "Nothing to embed (run fetch-issues first).") 77 + return {"embedded": 0, "batches": 0, "errors": 0} 78 + 79 + log("embed-issues", f"Embedding {len(rows)} issues …") 80 + stats = {"embedded": 0, "batches": 0, "errors": 0} 81 + 82 + phase(1, "Gemini batchEmbedContents → tangled_issues.embedding") 83 + 84 + with httpx.Client() as client, connect(dsn) as conn: 85 + register_pgvector(conn) 86 + set_crawl_state( 87 + conn, 88 + key=CRAWL_KEY, 89 + status="running", 90 + meta={"count": len(rows), "model": model, "dim": DEFAULT_DIM}, 91 + ) 92 + conn.commit() 93 + 94 + for start in range(0, len(rows), bs): 95 + batch = rows[start : start + bs] 96 + texts = [_issue_text(r.get("title"), r.get("body")) for r in batch] 97 + labels = [ 98 + f"{r.get('author_handle') or '?'}: {(r.get('title') or '')[:40]}" 99 + for r in batch 100 + ] 101 + 102 + try: 103 + vectors = embed_texts(client, api_key=api_key, texts=texts) 104 + except Exception as exc: 105 + stats["errors"] += len(batch) 106 + step( 107 + "embed-issues", 108 + min(start + len(batch), len(rows)), 109 + len(rows), 110 + f"ERROR batch: {exc}", 111 + ) 112 + continue 113 + 114 + for row, vec in zip(batch, vectors, strict=True): 115 + conn.execute( 116 + """ 117 + update tangled_issues 118 + set embedding = %s, embedding_model = %s, embedded_at = now() 119 + where uri = %s 120 + """, 121 + (vec, model, row["uri"]), 122 + ) 123 + 124 + stats["embedded"] += len(batch) 125 + stats["batches"] += 1 126 + conn.commit() 127 + n = stats["embedded"] 128 + if n <= 10 or n % bs == 0 or n == len(rows): 129 + step("embed-issues", n, len(rows), f"OK {labels[-1]}") 130 + 131 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 132 + conn.commit() 133 + 134 + summary_block( 135 + "Issue embed complete", 136 + [f"Embedded: {stats['embedded']}", f"Errors: {stats['errors']}"], 137 + ) 138 + return stats 139 + 140 + 141 + def main() -> None: 142 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 143 + if candidate.exists(): 144 + load_dotenv(candidate) 145 + break 146 + else: 147 + load_dotenv() 148 + 149 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 150 + if not dsn: 151 + print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr) 152 + raise SystemExit(1) 153 + 154 + init_schema(dsn) 155 + run_embed_issues(dsn) 156 + 157 + 158 + if __name__ == "__main__": 159 + try: 160 + main() 161 + except KeyboardInterrupt: 162 + print("\nInterrupted.", file=sys.stderr) 163 + raise SystemExit(130) from None
+171
scraper/embed_readmes.py
··· 1 + #!/usr/bin/env python3 2 + """Compute and store one embedding vector per README in tangled_readmes.""" 3 + 4 + from __future__ import annotations 5 + 6 + import os 7 + import sys 8 + from pathlib import Path 9 + 10 + import httpx 11 + from dotenv import load_dotenv 12 + 13 + from db import connect, init_schema, register_pgvector, set_crawl_state 14 + from embeddings import ( 15 + DEFAULT_DIM, 16 + DEFAULT_MODEL, 17 + batch_size, 18 + embed_texts, 19 + embedding_model, 20 + gemini_api_key, 21 + truncate, 22 + ) 23 + from progress import banner, log, phase, step, summary_block 24 + 25 + REPO_ROOT = Path(__file__).resolve().parent.parent 26 + CRAWL_KEY = "readmes:embed" 27 + 28 + 29 + def _repo_limit() -> int | None: 30 + raw = os.getenv("TANGLED_EMBED_README_LIMIT", "").strip() 31 + if not raw: 32 + return None 33 + return max(1, int(raw)) 34 + 35 + 36 + def _force_reembed() -> bool: 37 + return os.getenv("TANGLED_EMBED_FORCE", "").strip().lower() in ("1", "true", "yes") 38 + 39 + 40 + def _select_query(*, force: bool, limit: int | None) -> str: 41 + where = "status = 'found' and content is not null" 42 + if not force: 43 + where += " and embedding is null" 44 + query = f""" 45 + select repo_did, owner_handle, repo_name, content 46 + from tangled_readmes 47 + where {where} 48 + order by fetched_at desc 49 + """ 50 + if limit: 51 + query += f" limit {limit}" 52 + return query 53 + 54 + 55 + def run_embed_readmes(dsn: str) -> dict[str, int]: 56 + api_key = gemini_api_key() 57 + model = embedding_model() 58 + bs = batch_size() 59 + repo_limit = _repo_limit() 60 + force = _force_reembed() 61 + 62 + banner("README EMBED — Gemini → tangled_readmes.embedding") 63 + log("embed", f"Model: {model} dim={DEFAULT_DIM} L2-normalized batch={bs}") 64 + if repo_limit: 65 + log("embed", f"Limit: {repo_limit}") 66 + if force: 67 + log("embed", "Force re-embed all matching rows") 68 + 69 + with connect(dsn) as conn: 70 + register_pgvector(conn) 71 + rows = conn.execute(_select_query(force=force, limit=repo_limit)).fetchall() 72 + 73 + if not rows: 74 + log("embed", "Nothing to embed (run check-readmes first, or set TANGLED_EMBED_FORCE=1).") 75 + return {"embedded": 0, "batches": 0, "errors": 0} 76 + 77 + log("embed", f"Embedding {len(rows)} READMEs …") 78 + stats = {"embedded": 0, "batches": 0, "errors": 0} 79 + 80 + phase(1, "Gemini batchEmbedContents → tangled_readmes.embedding") 81 + 82 + with httpx.Client() as client, connect(dsn) as conn: 83 + register_pgvector(conn) 84 + set_crawl_state( 85 + conn, 86 + key=CRAWL_KEY, 87 + status="running", 88 + meta={"count": len(rows), "model": model, "dim": DEFAULT_DIM}, 89 + ) 90 + conn.commit() 91 + 92 + for start in range(0, len(rows), bs): 93 + batch = rows[start : start + bs] 94 + texts = [truncate(r["content"]) for r in batch] 95 + labels = [ 96 + f"{r.get('owner_handle') or '?'}/{r.get('repo_name') or r['repo_did'][:16]}" 97 + for r in batch 98 + ] 99 + 100 + try: 101 + vectors = embed_texts(client, api_key=api_key, texts=texts) 102 + except Exception as exc: 103 + stats["errors"] += len(batch) 104 + step( 105 + "embed", 106 + min(start + len(batch), len(rows)), 107 + len(rows), 108 + f"ERROR batch @ {start}: {exc}", 109 + ) 110 + continue 111 + 112 + for row, vec in zip(batch, vectors, strict=True): 113 + conn.execute( 114 + """ 115 + update tangled_readmes 116 + set embedding = %s, 117 + embedding_model = %s, 118 + embedded_at = now() 119 + where repo_did = %s 120 + """, 121 + (vec, model, row["repo_did"]), 122 + ) 123 + 124 + stats["embedded"] += len(batch) 125 + stats["batches"] += 1 126 + conn.commit() 127 + 128 + n = stats["embedded"] 129 + if n <= 10 or n % bs == 0 or n == len(rows): 130 + step("embed", n, len(rows), f"OK {labels[-1]}") 131 + 132 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 133 + conn.commit() 134 + 135 + summary_block( 136 + "README embed complete", 137 + [ 138 + f"Embedded: {stats['embedded']}", 139 + f"Batches: {stats['batches']}", 140 + f"Errors: {stats['errors']}", 141 + "", 142 + "Cosine search (L2-normalized vectors):", 143 + " order by embedding <=> query_vec", 144 + ], 145 + ) 146 + return stats 147 + 148 + 149 + def main() -> None: 150 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 151 + if candidate.exists(): 152 + load_dotenv(candidate) 153 + break 154 + else: 155 + load_dotenv() 156 + 157 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 158 + if not dsn: 159 + print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr) 160 + raise SystemExit(1) 161 + 162 + init_schema(dsn) 163 + run_embed_readmes(dsn) 164 + 165 + 166 + if __name__ == "__main__": 167 + try: 168 + main() 169 + except KeyboardInterrupt: 170 + print("\nInterrupted.", file=sys.stderr) 171 + raise SystemExit(130) from None
+103
scraper/embeddings.py
··· 1 + """Gemini embeddings: gemini-embedding-001, 1536-dim, L2-normalized for cosine.""" 2 + 3 + from __future__ import annotations 4 + 5 + import math 6 + import os 7 + 8 + import httpx 9 + 10 + DEFAULT_MODEL = "gemini-embedding-001" 11 + DEFAULT_DIM = 1536 12 + MAX_CHARS = 24_000 13 + GEMINI_BATCH_URL = ( 14 + "https://generativelanguage.googleapis.com/v1beta/" 15 + "models/gemini-embedding-001:batchEmbedContents" 16 + ) 17 + 18 + 19 + def embedding_model() -> str: 20 + return os.getenv("TANGLED_EMBEDDING_MODEL", DEFAULT_MODEL).strip() or DEFAULT_MODEL 21 + 22 + 23 + def batch_size() -> int: 24 + raw = os.getenv("TANGLED_EMBED_BATCH_SIZE", "16").strip() 25 + return max(1, min(100, int(raw))) 26 + 27 + 28 + def gemini_api_key() -> str: 29 + key = ( 30 + os.getenv("GEMINI_API_KEY", "").strip() 31 + or os.getenv("GOOGLE_API_KEY", "").strip() 32 + ) 33 + if not key: 34 + raise RuntimeError( 35 + "GEMINI_API_KEY (or GOOGLE_API_KEY) is not set. " 36 + "Add it to .env to compute embeddings." 37 + ) 38 + return key 39 + 40 + 41 + def truncate(text: str) -> str: 42 + text = text.strip() 43 + return text[:MAX_CHARS] if len(text) > MAX_CHARS else text 44 + 45 + 46 + def l2_normalize(vec: list[float]) -> list[float]: 47 + norm = math.sqrt(sum(x * x for x in vec)) 48 + if norm == 0: 49 + return vec 50 + return [x / norm for x in vec] 51 + 52 + 53 + def embed_texts( 54 + client: httpx.Client, 55 + *, 56 + api_key: str, 57 + texts: list[str], 58 + task_type: str = "RETRIEVAL_DOCUMENT", 59 + ) -> list[list[float]]: 60 + """Embed texts via Gemini batchEmbedContents; returns L2-normalized 1536-dim vectors.""" 61 + if not texts: 62 + return [] 63 + 64 + requests = [ 65 + { 66 + "model": f"models/{DEFAULT_MODEL}", 67 + "content": {"parts": [{"text": text}]}, 68 + "taskType": task_type, 69 + "outputDimensionality": DEFAULT_DIM, 70 + } 71 + for text in texts 72 + ] 73 + 74 + resp = client.post( 75 + GEMINI_BATCH_URL, 76 + headers={ 77 + "x-goog-api-key": api_key, 78 + "Content-Type": "application/json", 79 + }, 80 + json={"requests": requests}, 81 + timeout=120.0, 82 + ) 83 + if resp.status_code != 200: 84 + raise RuntimeError( 85 + f"Gemini embeddings HTTP {resp.status_code}: {resp.text[:500]}" 86 + ) 87 + 88 + embeddings = resp.json().get("embeddings") or [] 89 + if len(embeddings) != len(texts): 90 + raise RuntimeError(f"Expected {len(texts)} embeddings, got {len(embeddings)}") 91 + 92 + vectors: list[list[float]] = [] 93 + for row in embeddings: 94 + values = row.get("values") 95 + if not isinstance(values, list): 96 + raise RuntimeError("Gemini response missing embedding values") 97 + if len(values) != DEFAULT_DIM: 98 + raise RuntimeError( 99 + f"Expected dim {DEFAULT_DIM}, got {len(values)}. " 100 + "Check outputDimensionality support for your API key." 101 + ) 102 + vectors.append(l2_normalize(values)) 103 + return vectors
+202
scraper/export_embeddings.py
··· 1 + """Export embeddings from the shared Postgres into the embeddings git repo. 2 + 3 + This is the "transfer" step that publishes the Discover engine's embeddings to the 4 + network: it reads the precomputed vectors from Postgres (READ-ONLY) and writes the 5 + files consumed by `tangled-discover-embeddings` (a knot-hosted git repo) — a single 6 + `.npy` matrix + a `.jsonl` sidecar per section, plus a manifest. Commit + push that 7 + repo afterwards (the push emits `sh.tangled.git.refUpdate`, the consumers' re-pull 8 + signal). 9 + 10 + This is the canonical, pipeline-wireable copy. An identical-logic, self-contained 11 + copy also lives in the embeddings repo at `scripts/export_embeddings.py`; the only 12 + difference here is that the OUTPUT directory is configurable (this script lives in the 13 + backend repo, not in the embeddings repo). 14 + 15 + # writes into ../tangled-discover-embeddings by default: 16 + python scraper/export_embeddings.py 17 + # or point it anywhere: 18 + EMBEDDINGS_REPO_DIR=/path/to/tangled-discover-embeddings python scraper/export_embeddings.py 19 + python scraper/export_embeddings.py /path/to/tangled-discover-embeddings 20 + 21 + Vectors read as pgvector text literals ('[v1,...]') exactly like recommendation/app/db.py 22 + and scraper/seed_user.py; they are already 1536-d and L2-normalized. No DB writes. 23 + """ 24 + 25 + from __future__ import annotations 26 + 27 + import datetime as dt 28 + import hashlib 29 + import json 30 + import os 31 + import sys 32 + from pathlib import Path 33 + 34 + import numpy as np 35 + import psycopg 36 + from psycopg.rows import dict_row 37 + 38 + try: 39 + from dotenv import load_dotenv 40 + except ImportError: # dotenv optional if the var is already in env 41 + def load_dotenv(*_a, **_k): # type: ignore 42 + return False 43 + 44 + BACKEND_ROOT = Path(__file__).resolve().parent.parent # the sunsteadhack repo 45 + DIM = 1536 46 + MODEL = "gemini-embedding-001" 47 + 48 + 49 + def _out_dir() -> Path: 50 + """Where to write the embeddings repo files. Precedence: argv[1] > env > default 51 + sibling repo (../tangled-discover-embeddings).""" 52 + if len(sys.argv) > 1: 53 + return Path(sys.argv[1]).expanduser().resolve() 54 + env = os.environ.get("EMBEDDINGS_REPO_DIR") 55 + if env: 56 + return Path(env).expanduser().resolve() 57 + return (BACKEND_ROOT.parent / "tangled-discover-embeddings").resolve() 58 + 59 + 60 + # Repos: mirror recommendation/app/db.py joins so description/topics/created_at/handle 61 + # resolve the same way the engine sees them. content stays in the DB — we ship only its 62 + # length (for the min-chars gate) and md5(first 500 chars) (for fork dedup). 63 + _REPOS_SQL = """ 64 + select r.repo_did, 65 + r.repo_uri, 66 + coalesce(r.owner_handle, ti.handle) as owner_handle, 67 + r.repo_name, 68 + tr.record_raw->>'description' as description, 69 + tr.record_raw->'topics' as topics, 70 + tr.record_raw->>'createdAt' as created_at, 71 + length(trim(coalesce(r.content, ''))) as content_len, 72 + md5(substring(coalesce(r.content, '') for 500)) as content_sha500, 73 + r.embedding_model, 74 + r.embedded_at, 75 + r.embedding::text as etext 76 + from tangled_readmes r 77 + left join tangled_repos tr 78 + on coalesce(tr.repo_did, tr.record_raw->>'repoDid') = r.repo_did 79 + left join tangled_identities ti 80 + on ti.did = split_part(replace(r.repo_uri, 'at://', ''), '/', 1) 81 + where r.embedding is not null 82 + order by r.repo_did 83 + """ 84 + 85 + # Issues: only those whose identity fully resolves (same inner joins as _KNN_ISSUES_SQL), 86 + # i.e. exactly the set the engine can emit. 87 + _ISSUES_SQL = """ 88 + select i.uri, 89 + i.rkey, 90 + i.repo_did, 91 + i.repo_uri, 92 + i.author_did, 93 + i.title, 94 + i.body, 95 + ti.handle as owner_handle, 96 + tr.name as repo_name, 97 + tr.record_raw->>'description' as repo_description, 98 + i.issue_created_at as created_at, 99 + i.embedding_model, 100 + i.embedding::text as etext 101 + from tangled_open_issues i 102 + join tangled_identities ti 103 + on ti.did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1) 104 + join tangled_repos tr 105 + on tr.owner_did = split_part(replace(i.repo_uri, 'at://', ''), '/', 1) 106 + and tr.rkey = split_part(i.repo_uri, '/', 5) 107 + where i.embedding is not null 108 + and i.repo_uri is not null 109 + and ti.handle is not null 110 + and tr.name is not null 111 + order by i.uri 112 + """ 113 + 114 + 115 + def _dsn() -> str: 116 + for candidate in (BACKEND_ROOT / ".env", BACKEND_ROOT / "recommendation" / ".env", BACKEND_ROOT / "scraper" / ".env"): 117 + if candidate.exists(): 118 + load_dotenv(candidate) 119 + break 120 + else: 121 + load_dotenv() 122 + conn = os.environ.get("DB_CONNECTION_STRING", "").strip() 123 + if not conn: 124 + raise SystemExit("DB_CONNECTION_STRING not set (env or .env)") 125 + if "sslmode=" not in conn: # Cloud SQL public IP, self-signed cert 126 + conn += ("&" if "?" in conn else "?") + "sslmode=require" 127 + return conn 128 + 129 + 130 + def _parse_vec(etext: str) -> np.ndarray: 131 + v = np.fromstring(etext.strip()[1:-1], sep=",", dtype=np.float32) 132 + if v.shape[0] != DIM: 133 + raise ValueError(f"expected dim {DIM}, got {v.shape[0]}") 134 + return v 135 + 136 + 137 + def _json_default(o): 138 + if isinstance(o, (dt.datetime, dt.date)): 139 + return o.isoformat() 140 + return str(o) 141 + 142 + 143 + def _export_section(conn, data_dir: Path, name: str, sql: str, meta_fields: list[str]) -> dict: 144 + rows = conn.execute(sql).fetchall() 145 + if not rows: 146 + raise SystemExit(f"{name}: no embedded rows found") 147 + matrix = np.vstack([_parse_vec(r["etext"]) for r in rows]).astype(np.float32) 148 + 149 + npy_path = data_dir / f"{name}.f32.npy" 150 + jsonl_path = data_dir / f"{name}.jsonl" 151 + np.save(npy_path, matrix) 152 + with open(jsonl_path, "w", encoding="utf-8") as fh: 153 + for i, r in enumerate(rows): 154 + rec = {"row": i, "subject_uri": r["uri"] if "uri" in r else r["repo_uri"]} 155 + rec.update({k: r[k] for k in meta_fields}) 156 + fh.write(json.dumps(rec, default=_json_default, ensure_ascii=False) + "\n") 157 + 158 + sha = hashlib.sha256(npy_path.read_bytes()).hexdigest() 159 + print(f" {name}: {matrix.shape[0]} vectors -> {npy_path} ({npy_path.stat().st_size // 1024} KiB)") 160 + return { 161 + "count": int(matrix.shape[0]), 162 + "vectors": f"data/{name}.f32.npy", 163 + "meta": f"data/{name}.jsonl", 164 + "sha256": sha, 165 + } 166 + 167 + 168 + def main() -> int: 169 + out = _out_dir() 170 + data_dir = out / "data" 171 + data_dir.mkdir(parents=True, exist_ok=True) 172 + print(f"exporting embeddings (read-only) -> {out}") 173 + with psycopg.connect(_dsn(), row_factory=dict_row) as conn: 174 + repos = _export_section( 175 + conn, data_dir, "repos", _REPOS_SQL, 176 + ["repo_did", "repo_name", "owner_handle", "description", "topics", 177 + "created_at", "content_len", "content_sha500", "embedding_model", "embedded_at"], 178 + ) 179 + issues = _export_section( 180 + conn, data_dir, "issues", _ISSUES_SQL, 181 + ["repo_did", "rkey", "repo_uri", "author_did", "title", "body", 182 + "owner_handle", "repo_name", "repo_description", "created_at", "embedding_model"], 183 + ) 184 + 185 + manifest = { 186 + "schema_version": 1, 187 + "model": MODEL, 188 + "dim": DIM, 189 + "metric": "cosine", 190 + "normalized": True, 191 + "task_type": "RETRIEVAL_DOCUMENT", 192 + "generated_at": dt.datetime.now(dt.timezone.utc).isoformat(), 193 + "sections": {"repos": repos, "issues": issues}, 194 + } 195 + (out / "manifest.json").write_text(json.dumps(manifest, indent=2) + "\n") 196 + print(f"wrote {out / 'manifest.json'} (repos={repos['count']}, issues={issues['count']})") 197 + print("next: cd into the embeddings repo, then git add -A && git commit && git push") 198 + return 0 199 + 200 + 201 + if __name__ == "__main__": 202 + raise SystemExit(main())
+128
scraper/export_questionnaires.py
··· 1 + """Export AI-solve questionnaires from Postgres into the embeddings git repo. 2 + 3 + Mirrors scraper/export_embeddings.py, but questionnaires are read PER ISSUE (not 4 + bulk), so the layout is one JSON file per issue rather than a single matrix: 5 + 6 + <repo>/questionnaires/<did>/<rkey>.json # one per issue 7 + <repo>/questionnaires/index.json # {issue_uri -> path, updated_at, sha256} 8 + 9 + This is the one-time migration (there's ~1 row today) + a bulk re-sync tool. The 10 + live generation job writes these files itself via agent/questionnaire_repo_store.py; 11 + this script just mirrors whatever is currently in the DB. READ-ONLY against the DB. 12 + 13 + python scraper/export_questionnaires.py 14 + EMBEDDINGS_REPO_DIR=/path/to/tangled-discover-embeddings python scraper/export_questionnaires.py 15 + """ 16 + 17 + from __future__ import annotations 18 + 19 + import datetime as dt 20 + import hashlib 21 + import json 22 + import os 23 + import sys 24 + from pathlib import Path 25 + 26 + import psycopg 27 + from psycopg.rows import dict_row 28 + 29 + try: 30 + from dotenv import load_dotenv 31 + except ImportError: 32 + def load_dotenv(*_a, **_k): # type: ignore 33 + return False 34 + 35 + BACKEND_ROOT = Path(__file__).resolve().parent.parent 36 + 37 + _SELECT = """ 38 + select issue_uri, payload, created_at, updated_at 39 + from tangled_issue_questionnaires 40 + order by issue_uri 41 + """ 42 + 43 + 44 + def _out_dir() -> Path: 45 + if len(sys.argv) > 1: 46 + return Path(sys.argv[1]).expanduser().resolve() 47 + env = os.environ.get("EMBEDDINGS_REPO_DIR") 48 + if env: 49 + return Path(env).expanduser().resolve() 50 + return (BACKEND_ROOT.parent / "tangled-discover-embeddings").resolve() 51 + 52 + 53 + def _dsn() -> str: 54 + for c in (BACKEND_ROOT / ".env", BACKEND_ROOT / "recommendation" / ".env", BACKEND_ROOT / "scraper" / ".env"): 55 + if c.exists(): 56 + load_dotenv(c) 57 + break 58 + else: 59 + load_dotenv() 60 + conn = os.environ.get("DB_CONNECTION_STRING", "").strip() 61 + if not conn: 62 + raise SystemExit("DB_CONNECTION_STRING not set (env or .env)") 63 + if "sslmode=" not in conn: 64 + conn += ("&" if "?" in conn else "?") + "sslmode=require" 65 + return conn 66 + 67 + 68 + def issue_uri_to_relpath(issue_uri: str) -> str: 69 + """at://<did>/sh.tangled.repo.issue/<rkey> -> questionnaires/<did>/<rkey>.json 70 + 71 + Shared convention with agent/questionnaire_repo_store.py — keep in sync.""" 72 + rest = issue_uri[len("at://"):] if issue_uri.startswith("at://") else issue_uri 73 + parts = rest.split("/") 74 + did, rkey = parts[0], parts[-1] 75 + return f"questionnaires/{did}/{rkey}.json" 76 + 77 + 78 + def file_record(issue_uri, payload, created_at, updated_at) -> dict: 79 + """The per-issue file shape (mirrors agent.questionnaire_store.get_questionnaire).""" 80 + return { 81 + "issue_uri": issue_uri, 82 + "version": payload.get("version") if isinstance(payload, dict) else None, 83 + "created_at": created_at.isoformat() if hasattr(created_at, "isoformat") else created_at, 84 + "updated_at": updated_at.isoformat() if hasattr(updated_at, "isoformat") else updated_at, 85 + "payload": payload, 86 + } 87 + 88 + 89 + def main() -> int: 90 + out = _out_dir() 91 + qdir = out / "questionnaires" 92 + qdir.mkdir(parents=True, exist_ok=True) 93 + entries = [] 94 + with psycopg.connect(_dsn(), row_factory=dict_row) as conn: 95 + rows = conn.execute(_SELECT).fetchall() 96 + print(f"exporting {len(rows)} questionnaire(s) (read-only) -> {qdir}") 97 + for r in rows: 98 + payload = r["payload"] 99 + if isinstance(payload, str): 100 + payload = json.loads(payload) 101 + rel = issue_uri_to_relpath(r["issue_uri"]) 102 + path = out / rel 103 + path.parent.mkdir(parents=True, exist_ok=True) 104 + body = json.dumps(file_record(r["issue_uri"], payload, r["created_at"], r["updated_at"]), 105 + ensure_ascii=False, indent=2) + "\n" 106 + path.write_text(body, encoding="utf-8") 107 + entries.append({ 108 + "issue_uri": r["issue_uri"], 109 + "path": rel, 110 + "updated_at": r["updated_at"].isoformat() if hasattr(r["updated_at"], "isoformat") else r["updated_at"], 111 + "sha256": hashlib.sha256(body.encode("utf-8")).hexdigest(), 112 + }) 113 + print(f" {rel} ({len(body)} bytes)") 114 + 115 + index = { 116 + "schema_version": 1, 117 + "kind": "questionnaires", 118 + "generated_at": dt.datetime.now(dt.timezone.utc).isoformat(), 119 + "count": len(entries), 120 + "entries": sorted(entries, key=lambda e: e["issue_uri"]), 121 + } 122 + (qdir / "index.json").write_text(json.dumps(index, indent=2) + "\n") 123 + print(f"wrote {qdir / 'index.json'} (count={len(entries)})") 124 + return 0 125 + 126 + 127 + if __name__ == "__main__": 128 + raise SystemExit(main())
+361
scraper/fetch_collaborators.py
··· 1 + #!/usr/bin/env python3 2 + """Fetch collaborator lists for all repos via knot listCollaborators.""" 3 + 4 + from __future__ import annotations 5 + 6 + import os 7 + import sys 8 + import threading 9 + from concurrent.futures import ThreadPoolExecutor, as_completed 10 + from dataclasses import dataclass, field 11 + from pathlib import Path 12 + from typing import Any 13 + 14 + import httpx 15 + from dotenv import load_dotenv 16 + 17 + from db import connect, init_schema, set_crawl_state 18 + from parallel import concurrency_env 19 + from pds_client import knot_xrpc 20 + from progress import banner, log, metric, phase, step, summary_block 21 + 22 + REPO_ROOT = Path(__file__).resolve().parent.parent 23 + CRAWL_KEY = "collaborators:fetch" 24 + PAGE_LIMIT = 1000 25 + 26 + 27 + @dataclass 28 + class CollabFetchResult: 29 + repo_did: str 30 + repo_uri: str | None 31 + knot_hostname: str 32 + status: str # ok | skipped_knot | error 33 + collaborators: list[dict[str, Any]] = field(default_factory=list) 34 + error: str | None = None 35 + 36 + 37 + def _repo_limit() -> int | None: 38 + raw = os.getenv("TANGLED_COLLAB_REPO_LIMIT", "").strip() 39 + if not raw: 40 + return None 41 + return max(1, int(raw)) 42 + 43 + 44 + def _skip_existing() -> bool: 45 + return os.getenv("TANGLED_COLLAB_REFRESH", "").strip().lower() not in ( 46 + "1", 47 + "true", 48 + "yes", 49 + ) 50 + 51 + 52 + def fetch_repo_collaborators( 53 + client: httpx.Client, 54 + *, 55 + knot_hostname: str, 56 + repo_did: str, 57 + ) -> list[dict[str, Any]]: 58 + items: list[dict[str, Any]] = [] 59 + cursor: str | None = None 60 + 61 + while True: 62 + params: dict[str, Any] = { 63 + "subject": repo_did, 64 + "limit": PAGE_LIMIT, 65 + } 66 + if cursor: 67 + params["cursor"] = cursor 68 + 69 + status, payload = knot_xrpc( 70 + client, 71 + knot_hostname, 72 + "sh.tangled.repo.listCollaborators", 73 + params, 74 + ) 75 + if status != 200 or not isinstance(payload, dict): 76 + raise RuntimeError(f"listCollaborators HTTP {status}") 77 + 78 + page = payload.get("items") or [] 79 + if isinstance(page, list): 80 + items.extend(item for item in page if isinstance(item, dict)) 81 + 82 + cursor = payload.get("cursor") 83 + if not cursor or not page: 84 + break 85 + 86 + return items 87 + 88 + 89 + def upsert_collaborators( 90 + conn, 91 + *, 92 + repo_did: str, 93 + collaborators: list[dict[str, Any]], 94 + ) -> int: 95 + conn.execute( 96 + "delete from tangled_repo_collaborators where repo_did = %s", 97 + (repo_did,), 98 + ) 99 + 100 + stored = 0 101 + for item in collaborators: 102 + collab_did = item.get("subject") 103 + if not isinstance(collab_did, str) or not collab_did.startswith("did:"): 104 + continue 105 + conn.execute( 106 + """ 107 + insert into tangled_repo_collaborators ( 108 + repo_did, collaborator_did, added_by, record_uri, record_cid, 109 + created_at, last_synced_at 110 + ) 111 + values (%s, %s, %s, %s, %s, %s::timestamptz, now()) 112 + on conflict (repo_did, collaborator_did) do update set 113 + added_by = excluded.added_by, 114 + record_uri = excluded.record_uri, 115 + record_cid = excluded.record_cid, 116 + created_at = excluded.created_at, 117 + last_synced_at = now() 118 + """, 119 + ( 120 + repo_did, 121 + collab_did, 122 + item.get("addedBy") if isinstance(item.get("addedBy"), str) else None, 123 + item.get("uri") if isinstance(item.get("uri"), str) else None, 124 + item.get("cid") if isinstance(item.get("cid"), str) else None, 125 + item.get("createdAt") if isinstance(item.get("createdAt"), str) else None, 126 + ), 127 + ) 128 + stored += 1 129 + 130 + conn.execute( 131 + """ 132 + insert into tangled_repo_collaborators_sync (repo_did, collaborator_count, synced_at) 133 + values (%s, %s, now()) 134 + on conflict (repo_did) do update set 135 + collaborator_count = excluded.collaborator_count, 136 + synced_at = now() 137 + """, 138 + (repo_did, stored), 139 + ) 140 + return stored 141 + 142 + 143 + def _fetch_one(repo: dict[str, Any], reachable: set[str]) -> CollabFetchResult: 144 + repo_did = repo["repo_did"] 145 + knot = repo.get("knot_hostname") or "" 146 + base = CollabFetchResult( 147 + repo_did=repo_did, 148 + repo_uri=repo.get("uri"), 149 + knot_hostname=knot, 150 + status="error", 151 + ) 152 + 153 + if not knot or knot not in reachable: 154 + base.status = "skipped_knot" 155 + base.error = f"knot not reachable: {knot or 'missing'}" 156 + return base 157 + 158 + try: 159 + with httpx.Client(timeout=60.0, follow_redirects=True) as client: 160 + collaborators = fetch_repo_collaborators( 161 + client, knot_hostname=knot, repo_did=repo_did 162 + ) 163 + base.collaborators = collaborators 164 + base.status = "ok" 165 + return base 166 + except Exception as exc: 167 + base.error = str(exc) 168 + return base 169 + 170 + 171 + def run_fetch_collaborators(dsn: str) -> dict[str, int]: 172 + workers = concurrency_env("TANGLED_COLLAB_CONCURRENCY", default=20) 173 + repo_limit = _repo_limit() 174 + skip_existing = _skip_existing() 175 + 176 + banner("COLLABORATORS — knot listCollaborators for every repo") 177 + log("collab", f"Concurrency: {workers}") 178 + if repo_limit: 179 + log("collab", f"Repo limit: {repo_limit}") 180 + if skip_existing: 181 + log( 182 + "collab", 183 + "Skip existing: on (set TANGLED_COLLAB_REFRESH=1 to re-fetch all)", 184 + ) 185 + else: 186 + log("collab", "Skip existing: off — refreshing every repo") 187 + 188 + with connect(dsn) as conn: 189 + reachable = { 190 + row["hostname"] 191 + for row in conn.execute( 192 + "select hostname from tangled_knots where reachable = true" 193 + ).fetchall() 194 + } 195 + skip_clause = "" 196 + if skip_existing: 197 + skip_clause = """ 198 + and not exists ( 199 + select 1 from tangled_repo_collaborators_sync s 200 + where s.repo_did = tangled_repos.repo_did 201 + ) 202 + """ 203 + query = f""" 204 + select uri, repo_did, knot_hostname, owner_handle, name 205 + from tangled_repos 206 + where repo_did is not null 207 + and knot_hostname is not null 208 + {skip_clause} 209 + order by uri 210 + """ 211 + if repo_limit: 212 + query += f" limit {repo_limit}" 213 + repos = conn.execute(query).fetchall() 214 + synced_count = 0 215 + if skip_existing: 216 + synced_count = conn.execute( 217 + "select count(*) as n from tangled_repo_collaborators_sync" 218 + ).fetchone()["n"] 219 + total_eligible = conn.execute( 220 + """ 221 + select count(*) as n from tangled_repos 222 + where repo_did is not null and knot_hostname is not null 223 + """ 224 + ).fetchone()["n"] 225 + 226 + if not repos: 227 + log("collab", "Nothing to fetch — all eligible repos already synced.") 228 + return { 229 + "repos_fetched": 0, 230 + "collaborator_edges": 0, 231 + "already_synced": total_eligible, 232 + "skipped_knot": 0, 233 + "errors": 0, 234 + } 235 + 236 + already_synced = synced_count if skip_existing else 0 237 + if skip_existing: 238 + metric("Eligible repos", total_eligible) 239 + metric("Already synced (skipped)", already_synced) 240 + metric("To fetch", len(repos)) 241 + 242 + stats = { 243 + "repos_fetched": 0, 244 + "collaborator_edges": 0, 245 + "already_synced": already_synced, 246 + "skipped_knot": 0, 247 + "errors": 0, 248 + } 249 + done = 0 250 + done_lock = threading.Lock() 251 + 252 + phase(1, f"Parallel listCollaborators ({workers} workers)") 253 + 254 + with connect(dsn) as conn: 255 + set_crawl_state( 256 + conn, 257 + key=CRAWL_KEY, 258 + status="running", 259 + meta={"repo_count": len(repos), "workers": workers}, 260 + ) 261 + conn.commit() 262 + 263 + with ThreadPoolExecutor(max_workers=workers) as pool: 264 + futures = { 265 + pool.submit(_fetch_one, dict(repo), reachable): repo for repo in repos 266 + } 267 + 268 + for future in as_completed(futures): 269 + repo = futures[future] 270 + label = f"{repo.get('owner_handle') or '?'}/{repo.get('name') or repo['repo_did'][:16]}" 271 + 272 + try: 273 + result = future.result() 274 + except Exception as exc: 275 + result = CollabFetchResult( 276 + repo_did=repo["repo_did"], 277 + repo_uri=repo.get("uri"), 278 + knot_hostname=repo.get("knot_hostname") or "", 279 + status="error", 280 + error=str(exc), 281 + ) 282 + 283 + with done_lock: 284 + done += 1 285 + n = done 286 + 287 + if result.status == "ok": 288 + count = upsert_collaborators( 289 + conn, 290 + repo_did=result.repo_did, 291 + collaborators=result.collaborators, 292 + ) 293 + stats["repos_fetched"] += 1 294 + stats["collaborator_edges"] += count 295 + if n <= 10 or n % 100 == 0 or count > 0: 296 + step( 297 + "collab", 298 + n, 299 + len(repos), 300 + f"OK {label} {count} collaborator(s)", 301 + ) 302 + elif result.status == "skipped_knot": 303 + stats["skipped_knot"] += 1 304 + if n <= 10 or n % 200 == 0: 305 + step("collab", n, len(repos), f"SKIP {label} {result.error}") 306 + else: 307 + stats["errors"] += 1 308 + if n <= 10 or n % 100 == 0: 309 + step( 310 + "collab", 311 + n, 312 + len(repos), 313 + f"ERROR {label} {result.error or 'unknown'}", 314 + ) 315 + 316 + if n % 50 == 0: 317 + conn.commit() 318 + 319 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 320 + conn.commit() 321 + 322 + summary_block( 323 + "Collaborators fetch complete", 324 + [ 325 + f"Repos fetched: {stats['repos_fetched']}", 326 + f"Collaborator edges: {stats['collaborator_edges']}", 327 + f"Already synced: {stats['already_synced']}", 328 + f"Skipped knot: {stats['skipped_knot']}", 329 + f"Errors: {stats['errors']}", 330 + "", 331 + "Repos a user collaborates on:", 332 + " select * from tangled_user_collaborations", 333 + " where user_did = 'did:plc:...';", 334 + ], 335 + ) 336 + return stats 337 + 338 + 339 + def main() -> None: 340 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 341 + if candidate.exists(): 342 + load_dotenv(candidate) 343 + break 344 + else: 345 + load_dotenv() 346 + 347 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 348 + if not dsn: 349 + print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr) 350 + raise SystemExit(1) 351 + 352 + init_schema(dsn) 353 + run_fetch_collaborators(dsn) 354 + 355 + 356 + if __name__ == "__main__": 357 + try: 358 + main() 359 + except KeyboardInterrupt: 360 + print("\nInterrupted.", file=sys.stderr) 361 + raise SystemExit(130) from None
+604
scraper/fetch_issues.py
··· 1 + #!/usr/bin/env python3 2 + """Scrape sh.tangled.repo.issue (+ state) from every known user PDS.""" 3 + 4 + from __future__ import annotations 5 + 6 + import json 7 + import os 8 + import sys 9 + import threading 10 + import time 11 + from concurrent.futures import FIRST_COMPLETED, ThreadPoolExecutor, wait 12 + from dataclasses import dataclass, field 13 + from pathlib import Path 14 + from typing import Any 15 + 16 + import httpx 17 + from dotenv import load_dotenv 18 + 19 + from db import connect, init_schema, set_crawl_state 20 + from parallel import concurrency_env 21 + from pds_client import list_records, pds_host_for_did 22 + from progress import banner, log, metric, phase, step, summary_block 23 + 24 + REPO_ROOT = Path(__file__).resolve().parent.parent 25 + CRAWL_KEY = "issues:fetch" 26 + ISSUE_COLLECTION = "sh.tangled.repo.issue" 27 + STATE_COLLECTION = "sh.tangled.repo.issue.state" 28 + STATE_OPEN = "sh.tangled.repo.issue.state.open" 29 + STATE_CLOSED = "sh.tangled.repo.issue.state.closed" 30 + HTTP_TIMEOUT = httpx.Timeout(connect=5.0, read=15.0, write=10.0, pool=10.0) 31 + LOG_EVERY = 10 32 + HEARTBEAT_SEC = 15 33 + INFLIGHT_CHUNK = 200 34 + 35 + 36 + class _PdsCache: 37 + def __init__(self) -> None: 38 + self._hosts: dict[str, str | None] = {} 39 + self._lock = threading.Lock() 40 + 41 + def resolve(self, client: httpx.Client, user_did: str, hint: str | None) -> str | None: 42 + if hint: 43 + return hint.rstrip("/") 44 + with self._lock: 45 + if user_did in self._hosts: 46 + return self._hosts[user_did] 47 + try: 48 + pds = pds_host_for_did(client, user_did) 49 + except httpx.HTTPError: 50 + pds = None 51 + with self._lock: 52 + self._hosts[user_did] = pds.rstrip("/") if pds else None 53 + return self._hosts[user_did] 54 + 55 + 56 + @dataclass 57 + class UserIssueResult: 58 + user_did: str 59 + handle: str | None 60 + status: str # ok | error 61 + issues: list[dict[str, Any]] = field(default_factory=list) 62 + states: list[dict[str, Any]] = field(default_factory=list) 63 + error: str | None = None 64 + 65 + 66 + def _user_limit() -> int | None: 67 + raw = os.getenv("TANGLED_ISSUE_USER_LIMIT", "").strip() 68 + if not raw: 69 + return None 70 + return max(1, int(raw)) 71 + 72 + 73 + def _max_pages() -> int: 74 + raw = os.getenv("TANGLED_ISSUE_MAX_PAGES", "50").strip() 75 + return max(1, int(raw)) 76 + 77 + 78 + def _skip_existing() -> bool: 79 + return os.getenv("TANGLED_ISSUE_REFRESH", "").strip().lower() not in ( 80 + "1", 81 + "true", 82 + "yes", 83 + ) 84 + 85 + 86 + def _all_users() -> bool: 87 + return os.getenv("TANGLED_ISSUE_ALL_USERS", "1").strip().lower() not in ( 88 + "0", 89 + "false", 90 + "no", 91 + ) 92 + 93 + 94 + def _users_query(*, skip_existing: bool, user_limit: int | None, all_users: bool) -> str: 95 + skip_clause = "" 96 + if skip_existing: 97 + skip_clause = """ 98 + and not exists ( 99 + select 1 from tangled_issue_user_sync s where s.user_did = u.did 100 + ) 101 + """ 102 + pds_union = "" 103 + if all_users: 104 + pds_union = """ 105 + union all 106 + select did, handle, pds_host from tangled_pds_accounts 107 + """ 108 + query = f""" 109 + select distinct on (u.did) u.did, u.handle, u.pds_host 110 + from ( 111 + select did, handle, pds_host from tangled_identities 112 + union all 113 + select owner_did as did, 114 + max(owner_handle) as handle, 115 + null::text as pds_host 116 + from tangled_repos 117 + where owner_did is not null 118 + group by owner_did 119 + {pds_union} 120 + ) u 121 + where u.did is not null 122 + {skip_clause} 123 + order by u.did, u.pds_host nulls last, u.handle nulls last 124 + """ 125 + if user_limit: 126 + query += f" limit {user_limit}" 127 + return query 128 + 129 + 130 + def _total_users_sql(*, all_users: bool) -> str: 131 + pds_union = "" 132 + if all_users: 133 + pds_union = "union select did from tangled_pds_accounts" 134 + return f""" 135 + select count(*) as n from ( 136 + select did from tangled_identities 137 + union 138 + select owner_did from tangled_repos where owner_did is not null 139 + {pds_union} 140 + ) x 141 + """ 142 + 143 + 144 + def _rkey_from_uri(uri: str) -> str: 145 + return uri.rsplit("/", 1)[-1] 146 + 147 + 148 + def _parse_repo_refs(value: dict[str, Any]) -> tuple[str | None, str | None]: 149 + repo = value.get("repo") 150 + if isinstance(repo, str): 151 + if repo.startswith("did:"): 152 + return repo, None 153 + if repo.startswith("at://"): 154 + return _repo_did_from_at_uri(repo), repo 155 + return None, repo if isinstance(repo, str) else None 156 + 157 + 158 + def _repo_did_from_at_uri(uri: str) -> str | None: 159 + if not uri.startswith("at://"): 160 + return None 161 + parts = uri.removeprefix("at://").split("/") 162 + return parts[0] if parts and parts[0].startswith("did:") else None 163 + 164 + 165 + def _list_all_records( 166 + client: httpx.Client, 167 + pds_host: str, 168 + user_did: str, 169 + collection: str, 170 + *, 171 + max_pages: int, 172 + ) -> list[dict[str, Any]]: 173 + records: list[dict[str, Any]] = [] 174 + cursor: str | None = None 175 + seen_cursors: set[str] = set() 176 + 177 + for _ in range(max_pages): 178 + data = list_records( 179 + client, pds_host, user_did, collection, cursor=cursor, limit=100 180 + ) 181 + page = data.get("records") or [] 182 + records.extend(rec for rec in page if isinstance(rec, dict)) 183 + next_cursor = data.get("cursor") 184 + if not next_cursor or not page: 185 + break 186 + if not isinstance(next_cursor, str) or next_cursor in seen_cursors: 187 + break 188 + seen_cursors.add(next_cursor) 189 + cursor = next_cursor 190 + return records 191 + 192 + 193 + def _state_map(states: list[dict[str, Any]]) -> dict[str, str]: 194 + mapping: dict[str, str] = {} 195 + for rec in states: 196 + value = rec.get("value") 197 + if not isinstance(value, dict): 198 + continue 199 + issue_uri = value.get("issue") 200 + state = value.get("state") 201 + if not isinstance(state, str): 202 + continue 203 + if state == STATE_CLOSED: 204 + normalized = "closed" 205 + elif state == STATE_OPEN: 206 + normalized = "open" 207 + else: 208 + normalized = "open" 209 + if isinstance(issue_uri, str) and issue_uri: 210 + mapping[issue_uri] = normalized 211 + else: 212 + rkey = _rkey_from_uri(rec["uri"]) if isinstance(rec.get("uri"), str) else None 213 + if rkey: 214 + mapping[f"rkey:{rkey}"] = normalized 215 + return mapping 216 + 217 + 218 + def _issue_state(uri: str, rkey: str, states: dict[str, str]) -> str: 219 + if uri in states: 220 + return states[uri] 221 + return states.get(f"rkey:{rkey}", "open") 222 + 223 + 224 + def _optional_timestamp(value: Any) -> str | None: 225 + if not isinstance(value, str): 226 + return None 227 + value = value.strip() 228 + return value if value else None 229 + 230 + 231 + def upsert_issue( 232 + conn, 233 + *, 234 + record: dict[str, Any], 235 + author_did: str, 236 + author_handle: str | None, 237 + state: str, 238 + ) -> None: 239 + uri = record["uri"] 240 + value = record["value"] 241 + rkey = _rkey_from_uri(uri) 242 + repo_did, repo_uri = _parse_repo_refs(value) 243 + title = value.get("title") if isinstance(value.get("title"), str) else None 244 + body = value.get("body") if isinstance(value.get("body"), str) else None 245 + created = _optional_timestamp(value.get("createdAt")) 246 + 247 + conn.execute( 248 + """ 249 + insert into tangled_issues ( 250 + uri, author_did, author_handle, rkey, repo_did, repo_uri, 251 + title, body, state, issue_created_at, cid, record_raw, fetched_at 252 + ) 253 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s::timestamptz, %s, %s::jsonb, now()) 254 + on conflict (uri) do update set 255 + author_did = excluded.author_did, 256 + author_handle = excluded.author_handle, 257 + rkey = excluded.rkey, 258 + repo_did = coalesce(excluded.repo_did, tangled_issues.repo_did), 259 + repo_uri = coalesce(excluded.repo_uri, tangled_issues.repo_uri), 260 + title = excluded.title, 261 + body = excluded.body, 262 + state = excluded.state, 263 + issue_created_at = excluded.issue_created_at, 264 + cid = excluded.cid, 265 + record_raw = excluded.record_raw, 266 + fetched_at = now(), 267 + embedding = case 268 + when tangled_issues.title is distinct from excluded.title 269 + or tangled_issues.body is distinct from excluded.body 270 + then null else tangled_issues.embedding end, 271 + embedding_model = case 272 + when tangled_issues.title is distinct from excluded.title 273 + or tangled_issues.body is distinct from excluded.body 274 + then null else tangled_issues.embedding_model end, 275 + embedded_at = case 276 + when tangled_issues.title is distinct from excluded.title 277 + or tangled_issues.body is distinct from excluded.body 278 + then null else tangled_issues.embedded_at end 279 + """, 280 + ( 281 + uri, 282 + author_did, 283 + author_handle, 284 + rkey, 285 + repo_did, 286 + repo_uri, 287 + title, 288 + body, 289 + state, 290 + created, 291 + record.get("cid") if isinstance(record.get("cid"), str) else None, 292 + json.dumps(value), 293 + ), 294 + ) 295 + 296 + 297 + def _mark_user_synced( 298 + conn, 299 + *, 300 + user_did: str, 301 + issue_count: int, 302 + status: str, 303 + error_message: str | None = None, 304 + ) -> None: 305 + conn.execute( 306 + """ 307 + insert into tangled_issue_user_sync ( 308 + user_did, issue_count, synced_at, status, error_message 309 + ) 310 + values (%s, %s, now(), %s, %s) 311 + on conflict (user_did) do update set 312 + issue_count = excluded.issue_count, 313 + synced_at = now(), 314 + status = excluded.status, 315 + error_message = excluded.error_message 316 + """, 317 + (user_did, issue_count, status, error_message), 318 + ) 319 + 320 + 321 + def _fetch_user_issues( 322 + user_did: str, 323 + handle: str | None, 324 + pds_host: str | None, 325 + cache: _PdsCache, 326 + max_pages: int, 327 + ) -> UserIssueResult: 328 + result = UserIssueResult(user_did=user_did, handle=handle, status="error") 329 + try: 330 + with httpx.Client(timeout=HTTP_TIMEOUT, follow_redirects=True) as client: 331 + pds = cache.resolve(client, user_did, pds_host) 332 + if not pds: 333 + result.error = "could not resolve PDS" 334 + return result 335 + issues = _list_all_records( 336 + client, pds, user_did, ISSUE_COLLECTION, max_pages=max_pages 337 + ) 338 + states: list[dict[str, Any]] = [] 339 + if issues: 340 + states = _list_all_records( 341 + client, pds, user_did, STATE_COLLECTION, max_pages=max_pages 342 + ) 343 + result.issues = issues 344 + result.states = states 345 + result.status = "ok" 346 + return result 347 + except httpx.TimeoutException: 348 + result.error = "PDS timeout" 349 + return result 350 + except httpx.HTTPError as exc: 351 + result.error = str(exc)[:200] 352 + return result 353 + except Exception as exc: 354 + result.error = str(exc)[:200] 355 + return result 356 + 357 + 358 + def _heartbeat_loop( 359 + *, 360 + done: list[int], 361 + total: int, 362 + inflight: list[int], 363 + last_done_at: list[float], 364 + stop: threading.Event, 365 + ) -> None: 366 + while not stop.wait(HEARTBEAT_SEC): 367 + n = done[0] 368 + pending = total - n 369 + active = inflight[0] 370 + idle = time.monotonic() - last_done_at[0] 371 + log( 372 + "issues", 373 + f"… heartbeat {n}/{total} done ({active} in-flight, " 374 + f"{pending} pending, last +{idle:.0f}s)", 375 + ) 376 + 377 + 378 + def run_fetch_issues(dsn: str) -> dict[str, int]: 379 + workers = concurrency_env("TANGLED_ISSUE_CONCURRENCY", default=10) 380 + user_limit = _user_limit() 381 + skip_existing = _skip_existing() 382 + all_users = _all_users() 383 + max_pages = _max_pages() 384 + 385 + banner("ISSUES — scrape sh.tangled.repo.issue from user PDSes") 386 + log("issues", f"Concurrency: {workers} PDS read timeout: 15s") 387 + log("issues", f"Max listRecords pages/user/collection: {max_pages}") 388 + log("issues", f"User scope: {'all known DIDs (+ tngl PDS accounts)' if all_users else 'identities + repo owners'}") 389 + if user_limit: 390 + log("issues", f"User limit: {user_limit}") 391 + if skip_existing: 392 + log("issues", "Skip existing: on (set TANGLED_ISSUE_REFRESH=1 to re-scan all)") 393 + else: 394 + log("issues", "Skip existing: off — re-scanning every user (daily sync)") 395 + 396 + with connect(dsn) as conn: 397 + users = conn.execute( 398 + _users_query(skip_existing=skip_existing, user_limit=user_limit, all_users=all_users) 399 + ).fetchall() 400 + total_users = conn.execute(_total_users_sql(all_users=all_users)).fetchone()["n"] 401 + synced = 0 402 + if skip_existing: 403 + synced = conn.execute("select count(*) as n from tangled_issue_user_sync").fetchone()["n"] 404 + 405 + if not users: 406 + log("issues", "Nothing to fetch — all users already scanned.") 407 + return { 408 + "users_scanned": 0, 409 + "issues_upserted": 0, 410 + "open_issues": 0, 411 + "already_synced": synced, 412 + "errors": 0, 413 + } 414 + 415 + already_synced = synced if skip_existing else 0 416 + metric("Known users", total_users) 417 + if skip_existing: 418 + metric("Already synced (skipped)", already_synced) 419 + metric("To scan", len(users)) 420 + 421 + stats = { 422 + "users_scanned": 0, 423 + "issues_upserted": 0, 424 + "open_issues": 0, 425 + "already_synced": already_synced, 426 + "errors": 0, 427 + } 428 + done_box = [0] 429 + inflight_box = [0] 430 + last_done_at = [time.monotonic()] 431 + done_lock = threading.Lock() 432 + pds_cache = _PdsCache() 433 + 434 + phase(1, f"Parallel PDS listRecords ({workers} workers)") 435 + log("issues", f"Progress every {LOG_EVERY} users + heartbeat every {HEARTBEAT_SEC}s") 436 + 437 + stop_heartbeat = threading.Event() 438 + heartbeat = threading.Thread( 439 + target=_heartbeat_loop, 440 + kwargs={ 441 + "done": done_box, 442 + "total": len(users), 443 + "inflight": inflight_box, 444 + "last_done_at": last_done_at, 445 + "stop": stop_heartbeat, 446 + }, 447 + daemon=True, 448 + ) 449 + heartbeat.start() 450 + 451 + try: 452 + with connect(dsn) as conn: 453 + set_crawl_state( 454 + conn, 455 + key=CRAWL_KEY, 456 + status="running", 457 + meta={"user_count": len(users), "workers": workers}, 458 + ) 459 + conn.commit() 460 + 461 + user_iter = iter(users) 462 + pending_futures: dict[Any, dict[str, Any]] = {} 463 + 464 + def submit_more(pool: ThreadPoolExecutor) -> None: 465 + while len(pending_futures) < INFLIGHT_CHUNK: 466 + try: 467 + row = next(user_iter) 468 + except StopIteration: 469 + break 470 + fut = pool.submit( 471 + _fetch_user_issues, 472 + row["did"], 473 + row.get("handle"), 474 + row.get("pds_host"), 475 + pds_cache, 476 + max_pages, 477 + ) 478 + pending_futures[fut] = row 479 + inflight_box[0] = len(pending_futures) 480 + 481 + with ThreadPoolExecutor(max_workers=workers) as pool: 482 + submit_more(pool) 483 + 484 + while pending_futures: 485 + done_set, _ = wait(pending_futures, timeout=HEARTBEAT_SEC, return_when=FIRST_COMPLETED) 486 + if not done_set: 487 + continue 488 + 489 + for future in done_set: 490 + row = pending_futures.pop(future) 491 + label = row.get("handle") or row["did"][:20] 492 + 493 + try: 494 + result = future.result() 495 + except Exception as exc: 496 + result = UserIssueResult( 497 + user_did=row["did"], 498 + handle=row.get("handle"), 499 + status="error", 500 + error=str(exc)[:200], 501 + ) 502 + 503 + with done_lock: 504 + done_box[0] += 1 505 + n = done_box[0] 506 + last_done_at[0] = time.monotonic() 507 + 508 + if result.status == "ok": 509 + states = _state_map(result.states) 510 + upserted = 0 511 + open_n = 0 512 + for rec in result.issues: 513 + if not isinstance(rec.get("uri"), str) or not isinstance( 514 + rec.get("value"), dict 515 + ): 516 + continue 517 + rkey = _rkey_from_uri(rec["uri"]) 518 + state = _issue_state(rec["uri"], rkey, states) 519 + upsert_issue( 520 + conn, 521 + record=rec, 522 + author_did=result.user_did, 523 + author_handle=result.handle, 524 + state=state, 525 + ) 526 + upserted += 1 527 + if state == "open": 528 + open_n += 1 529 + 530 + _mark_user_synced( 531 + conn, 532 + user_did=result.user_did, 533 + issue_count=upserted, 534 + status="ok", 535 + ) 536 + stats["users_scanned"] += 1 537 + stats["issues_upserted"] += upserted 538 + stats["open_issues"] += open_n 539 + msg = f"OK {label} {upserted} issue(s) ({open_n} open)" 540 + else: 541 + _mark_user_synced( 542 + conn, 543 + user_did=result.user_did, 544 + issue_count=0, 545 + status="error", 546 + error_message=result.error, 547 + ) 548 + stats["errors"] += 1 549 + msg = f"ERROR {label} {result.error or 'unknown'}" 550 + 551 + if n <= 10 or n % LOG_EVERY == 0 or result.issues: 552 + step("issues", n, len(users), msg) 553 + 554 + if n % 25 == 0: 555 + conn.commit() 556 + 557 + submit_more(pool) 558 + inflight_box[0] = len(pending_futures) 559 + 560 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 561 + conn.commit() 562 + finally: 563 + stop_heartbeat.set() 564 + heartbeat.join(timeout=1) 565 + 566 + summary_block( 567 + "Issues fetch complete", 568 + [ 569 + f"Users scanned: {stats['users_scanned']}", 570 + f"Issues upserted: {stats['issues_upserted']}", 571 + f"Open (this run): {stats['open_issues']}", 572 + f"Already synced: {stats['already_synced']}", 573 + f"Errors: {stats['errors']}", 574 + "", 575 + "Query open issues:", 576 + " select count(*) from tangled_open_issues;", 577 + ], 578 + ) 579 + return stats 580 + 581 + 582 + def main() -> None: 583 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 584 + if candidate.exists(): 585 + load_dotenv(candidate) 586 + break 587 + else: 588 + load_dotenv() 589 + 590 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 591 + if not dsn: 592 + print("ERROR: DB_CONNECTION_STRING not set", file=sys.stderr) 593 + raise SystemExit(1) 594 + 595 + init_schema(dsn) 596 + run_fetch_issues(dsn) 597 + 598 + 599 + if __name__ == "__main__": 600 + try: 601 + main() 602 + except KeyboardInterrupt: 603 + print("\nInterrupted.", file=sys.stderr) 604 + raise SystemExit(130) from None
+556
scraper/ingest_handle.py
··· 1 + #!/usr/bin/env python3 2 + """Full ingest for one Tangled handle: identity → repos → READMEs + embeddings → issues + embeddings. 3 + 4 + Onboards a single user for recommendations/testing without a network-wide crawl. 5 + 6 + Usage (from scraper/, with repo-root .env): 7 + python ingest_handle.py arsenii.tngl.sh 8 + python ingest_handle.py did:plc:abc123 9 + python ingest_handle.py arsenii.tngl.sh --skip-issues 10 + python ingest_handle.py arsenii.tngl.sh --force-embed 11 + 12 + Requires: DB_CONNECTION_STRING, GEMINI_API_KEY (for embeddings). 13 + """ 14 + 15 + from __future__ import annotations 16 + 17 + import argparse 18 + import json 19 + import os 20 + import sys 21 + from pathlib import Path 22 + 23 + import httpx 24 + from dotenv import load_dotenv 25 + 26 + from db import connect, init_schema, register_pgvector 27 + from embeddings import ( 28 + batch_size, 29 + embed_texts, 30 + embedding_model, 31 + gemini_api_key, 32 + truncate, 33 + ) 34 + from fetch_issues import ( 35 + UserIssueResult, 36 + _fetch_user_issues, 37 + _issue_state, 38 + _mark_user_synced, 39 + _PdsCache, 40 + _rkey_from_uri, 41 + _state_map, 42 + upsert_issue, 43 + ) 44 + from progress import banner, log, summary_block 45 + 46 + REPO_ROOT = Path(__file__).resolve().parent.parent 47 + REPO_COLLECTION = "sh.tangled.repo" 48 + RESOLVE_PDS = ( 49 + "https://tngl.sh", 50 + "https://bsky.social", 51 + "https://public.api.bsky.app", 52 + ) 53 + 54 + 55 + def load_env() -> None: 56 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 57 + if candidate.exists(): 58 + load_dotenv(candidate) 59 + return 60 + load_dotenv() 61 + 62 + 63 + def normalize_handle(raw: str) -> str: 64 + return raw.strip().lstrip("@") 65 + 66 + 67 + def resolve_handle_http(client: httpx.Client, handle: str) -> str | None: 68 + for base in RESOLVE_PDS: 69 + try: 70 + resp = client.get( 71 + f"{base.rstrip('/')}/xrpc/com.atproto.identity.resolveHandle", 72 + params={"handle": handle}, 73 + timeout=20.0, 74 + ) 75 + if resp.status_code == 200: 76 + did = resp.json().get("did") 77 + if isinstance(did, str) and did.startswith("did:"): 78 + return did 79 + except httpx.HTTPError: 80 + continue 81 + return None 82 + 83 + 84 + def resolve_did(client: httpx.Client, conn, handle_or_did: str) -> str: 85 + raw = handle_or_did.strip() 86 + if raw.startswith("did:"): 87 + return raw 88 + handle = normalize_handle(raw) 89 + did = resolve_handle_http(client, handle) 90 + if did: 91 + return did 92 + row = conn.execute( 93 + "select did from tangled_identities where handle = %s limit 1", 94 + (handle,), 95 + ).fetchone() 96 + if row: 97 + return row["did"] 98 + raise SystemExit( 99 + f"ERROR: could not resolve handle {handle!r} " 100 + f"(tried {', '.join(RESOLVE_PDS)} and tangled_identities)" 101 + ) 102 + 103 + 104 + def resolve_identity(client: httpx.Client, did: str) -> tuple[str, str | None]: 105 + """Return (pds_endpoint, handle) from the PLC DID document.""" 106 + resp = client.get(f"https://plc.directory/{did}", timeout=20.0) 107 + resp.raise_for_status() 108 + doc = resp.json() 109 + pds = next( 110 + s["serviceEndpoint"] 111 + for s in doc["service"] 112 + if s.get("id") == "#atproto_pds" 113 + ) 114 + handle = None 115 + for aka in doc.get("alsoKnownAs", []): 116 + if isinstance(aka, str) and aka.startswith("at://"): 117 + handle = aka.removeprefix("at://") 118 + break 119 + return pds.rstrip("/"), handle 120 + 121 + 122 + def list_repos(client: httpx.Client, pds: str, did: str) -> list[dict]: 123 + records: list[dict] = [] 124 + cursor: str | None = None 125 + while True: 126 + params: dict[str, str | int] = { 127 + "repo": did, 128 + "collection": REPO_COLLECTION, 129 + "limit": 100, 130 + } 131 + if cursor: 132 + params["cursor"] = cursor 133 + resp = client.get( 134 + f"{pds}/xrpc/com.atproto.repo.listRecords", 135 + params=params, 136 + timeout=30.0, 137 + ) 138 + resp.raise_for_status() 139 + data = resp.json() 140 + page = data.get("records") or [] 141 + records.extend(rec for rec in page if isinstance(rec, dict)) 142 + cursor = data.get("cursor") 143 + if not cursor or not page: 144 + break 145 + return records 146 + 147 + 148 + def fetch_readme( 149 + client: httpx.Client, knot: str, repo_did: str 150 + ) -> tuple[str | None, str | None]: 151 + resp = client.get( 152 + f"https://{knot}/xrpc/sh.tangled.repo.tree", 153 + params={"repo": repo_did, "path": ""}, 154 + timeout=30.0, 155 + ) 156 + if resp.status_code != 200: 157 + return None, None 158 + readme = (resp.json() or {}).get("readme") 159 + if not isinstance(readme, dict): 160 + return None, None 161 + contents = readme.get("contents") 162 + if not isinstance(contents, str) or not contents.strip(): 163 + return None, None 164 + filename = readme.get("filename") 165 + return (filename if isinstance(filename, str) else None), contents 166 + 167 + 168 + def vector_literal(vec: list[float]) -> str: 169 + return "[" + ",".join(repr(x) for x in vec) + "]" 170 + 171 + 172 + def ingest_repos_and_readmes( 173 + conn, 174 + *, 175 + http: httpx.Client, 176 + did: str, 177 + handle: str | None, 178 + pds: str, 179 + api_key: str, 180 + model: str, 181 + force_embed: bool, 182 + ) -> dict[str, int]: 183 + stats = {"repos": 0, "readmes_found": 0, "readmes_embedded": 0, "readmes_missing": 0} 184 + 185 + conn.execute( 186 + """ 187 + insert into tangled_identities (did, handle, pds_host, last_synced_at) 188 + values (%s, %s, %s, now()) 189 + on conflict (did) do update set 190 + handle = coalesce(excluded.handle, tangled_identities.handle), 191 + pds_host = coalesce(excluded.pds_host, tangled_identities.pds_host), 192 + last_synced_at = now() 193 + """, 194 + (did, handle, pds), 195 + ) 196 + 197 + records = list_repos(http, pds, did) 198 + log("repos", f"Found {len(records)} sh.tangled.repo record(s) on PDS") 199 + 200 + ingested: list[dict] = [] 201 + for rec in records: 202 + uri = rec["uri"] 203 + value = rec["value"] 204 + if not isinstance(value, dict): 205 + continue 206 + rkey = uri.rsplit("/", 1)[-1] 207 + repo_did = value.get("repoDid") 208 + knot = value.get("knot") 209 + name = value.get("name") or rkey 210 + if not repo_did or not knot: 211 + log("repos", f" SKIP {name}: missing repoDid/knot") 212 + continue 213 + path, content = fetch_readme(http, knot, repo_did) 214 + status = "found" if content else "missing" 215 + if status == "found": 216 + stats["readmes_found"] += 1 217 + else: 218 + stats["readmes_missing"] += 1 219 + log( 220 + "repos", 221 + f" {name:20} readme={status}" 222 + + (f" ({len(content)} chars)" if content else ""), 223 + ) 224 + ingested.append( 225 + { 226 + "uri": uri, 227 + "value": value, 228 + "rkey": rkey, 229 + "repo_did": repo_did, 230 + "knot": knot, 231 + "name": name, 232 + "cid": rec.get("cid"), 233 + "readme_path": path, 234 + "content": content, 235 + "status": status, 236 + } 237 + ) 238 + stats["repos"] += 1 239 + 240 + found_rows = [r for r in ingested if r["status"] == "found"] 241 + if force_embed: 242 + to_embed = found_rows 243 + else: 244 + dids = [r["repo_did"] for r in found_rows] 245 + if dids: 246 + existing = { 247 + row["repo_did"] 248 + for row in conn.execute( 249 + "select repo_did from tangled_readmes " 250 + "where repo_did = any(%s) and embedding is not null", 251 + (dids,), 252 + ).fetchall() 253 + } 254 + else: 255 + existing = set() 256 + to_embed = [r for r in found_rows if r["repo_did"] not in existing] 257 + vectors: dict[str, str] = {} 258 + if to_embed: 259 + vecs = embed_texts( 260 + http, 261 + api_key=api_key, 262 + texts=[truncate(r["content"]) for r in to_embed], 263 + ) 264 + vectors = {r["repo_did"]: vector_literal(v) for r, v in zip(to_embed, vecs, strict=True)} 265 + stats["readmes_embedded"] = len(vectors) 266 + log("embed", f"Embedded {len(vectors)} README(s) ({model}, 1536-d, L2)") 267 + 268 + for r in ingested: 269 + conn.execute( 270 + """ 271 + insert into tangled_repos ( 272 + uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname, 273 + cid, record_raw, discovered_via, last_synced_at 274 + ) 275 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, 'ingest_handle', now()) 276 + on conflict (uri) do update set 277 + owner_did = excluded.owner_did, 278 + owner_handle = excluded.owner_handle, 279 + repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did), 280 + name = coalesce(excluded.name, tangled_repos.name), 281 + knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname), 282 + cid = excluded.cid, 283 + record_raw = excluded.record_raw, 284 + last_synced_at = now() 285 + """, 286 + ( 287 + r["uri"], 288 + did, 289 + handle, 290 + r["rkey"], 291 + r["repo_did"], 292 + r["name"], 293 + r["knot"], 294 + r["cid"], 295 + json.dumps(r["value"]), 296 + ), 297 + ) 298 + 299 + vec = vectors.get(r["repo_did"]) 300 + conn.execute( 301 + """ 302 + insert into tangled_readmes ( 303 + repo_did, repo_uri, owner_handle, repo_name, knot_hostname, 304 + readme_path, status, content, size_bytes, fetched_at, 305 + embedding, embedding_model, embedded_at 306 + ) 307 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s, now(), 308 + %s::vector, %s, case when %s::text is null then null else now() end) 309 + on conflict (repo_did) do update set 310 + repo_uri = excluded.repo_uri, 311 + owner_handle = excluded.owner_handle, 312 + repo_name = excluded.repo_name, 313 + knot_hostname = excluded.knot_hostname, 314 + readme_path = excluded.readme_path, 315 + status = excluded.status, 316 + content = excluded.content, 317 + size_bytes = excluded.size_bytes, 318 + fetched_at = now(), 319 + embedding = excluded.embedding, 320 + embedding_model = excluded.embedding_model, 321 + embedded_at = excluded.embedded_at 322 + """, 323 + ( 324 + r["repo_did"], 325 + r["uri"], 326 + handle, 327 + r["name"], 328 + r["knot"], 329 + r["readme_path"], 330 + r["status"], 331 + r["content"], 332 + len(r["content"].encode()) if r["content"] else None, 333 + vec, 334 + model if vec else None, 335 + vec, 336 + ), 337 + ) 338 + 339 + return stats 340 + 341 + 342 + def ingest_issues( 343 + conn, 344 + *, 345 + did: str, 346 + handle: str | None, 347 + pds: str, 348 + max_pages: int, 349 + ) -> dict[str, int]: 350 + stats = {"issues": 0, "open": 0, "errors": 0} 351 + cache = _PdsCache() 352 + result: UserIssueResult = _fetch_user_issues( 353 + did, handle, pds, cache, max_pages=max_pages 354 + ) 355 + if result.status != "ok": 356 + stats["errors"] = 1 357 + log("issues", f"ERROR fetching issues: {result.error}") 358 + _mark_user_synced( 359 + conn, 360 + user_did=did, 361 + issue_count=0, 362 + status="error", 363 + error_message=result.error, 364 + ) 365 + return stats 366 + 367 + states = _state_map(result.states) 368 + for rec in result.issues: 369 + if not isinstance(rec.get("uri"), str) or not isinstance(rec.get("value"), dict): 370 + continue 371 + rkey = _rkey_from_uri(rec["uri"]) 372 + state = _issue_state(rec["uri"], rkey, states) 373 + upsert_issue( 374 + conn, 375 + record=rec, 376 + author_did=did, 377 + author_handle=handle, 378 + state=state, 379 + ) 380 + stats["issues"] += 1 381 + if state == "open": 382 + stats["open"] += 1 383 + 384 + _mark_user_synced(conn, user_did=did, issue_count=stats["issues"], status="ok") 385 + log("issues", f"Upserted {stats['issues']} issue(s) ({stats['open']} open)") 386 + return stats 387 + 388 + 389 + def embed_user_issues( 390 + conn, 391 + *, 392 + http: httpx.Client, 393 + did: str, 394 + api_key: str, 395 + model: str, 396 + force: bool, 397 + ) -> int: 398 + where = "repo_did in (select repo_did from tangled_repos where owner_did = %s)" 399 + params: list = [did] 400 + if not force: 401 + where += " and embedding is null" 402 + rows = conn.execute( 403 + f""" 404 + select uri, title, body 405 + from tangled_issues 406 + where {where} 407 + and coalesce(nullif(trim(title), ''), nullif(trim(body), '')) is not null 408 + order by fetched_at desc 409 + """, 410 + params, 411 + ).fetchall() 412 + if not rows: 413 + log("embed-issues", "No issues to embed for this user") 414 + return 0 415 + 416 + bs = batch_size() 417 + embedded = 0 418 + for start in range(0, len(rows), bs): 419 + batch = rows[start : start + bs] 420 + texts = [ 421 + truncate("\n\n".join(p for p in (r.get("title"), r.get("body")) if p and p.strip())) 422 + for r in batch 423 + ] 424 + vectors = embed_texts(http, api_key=api_key, texts=texts) 425 + for row, vec in zip(batch, vectors, strict=True): 426 + conn.execute( 427 + """ 428 + update tangled_issues 429 + set embedding = %s::vector, 430 + embedding_model = %s, 431 + embedded_at = now() 432 + where uri = %s 433 + """, 434 + (vector_literal(vec), model, row["uri"]), 435 + ) 436 + embedded += len(batch) 437 + log("embed-issues", f"Embedded {embedded} issue(s)") 438 + return embedded 439 + 440 + 441 + def run( 442 + handle_or_did: str, 443 + *, 444 + skip_issues: bool, 445 + force_embed: bool, 446 + max_pages: int, 447 + init_db: bool, 448 + ) -> int: 449 + load_env() 450 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 451 + if not dsn: 452 + print("ERROR: DB_CONNECTION_STRING is not set", file=sys.stderr) 453 + return 1 454 + 455 + api_key = gemini_api_key() 456 + model = embedding_model() 457 + 458 + banner(f"INGEST HANDLE — {handle_or_did}") 459 + if init_db: 460 + log("setup", "Applying migrations…") 461 + init_schema(dsn) 462 + 463 + repo_stats: dict[str, int] = {} 464 + issue_stats: dict[str, int] = {} 465 + issues_embedded = 0 466 + 467 + with httpx.Client(timeout=60.0, follow_redirects=True) as http, connect(dsn) as conn: 468 + did = resolve_did(http, conn, handle_or_did) 469 + pds, handle = resolve_identity(http, did) 470 + log("identity", f"DID={did}") 471 + log("identity", f"handle={handle} pds={pds}") 472 + 473 + repo_stats = ingest_repos_and_readmes( 474 + conn, 475 + http=http, 476 + did=did, 477 + handle=handle, 478 + pds=pds, 479 + api_key=api_key, 480 + model=model, 481 + force_embed=force_embed, 482 + ) 483 + 484 + if not skip_issues: 485 + issue_stats = ingest_issues( 486 + conn, did=did, handle=handle, pds=pds, max_pages=max_pages 487 + ) 488 + issues_embedded = embed_user_issues( 489 + conn, 490 + http=http, 491 + did=did, 492 + api_key=api_key, 493 + model=model, 494 + force=force_embed, 495 + ) 496 + 497 + conn.commit() 498 + 499 + summary_block( 500 + f"Ingest complete — {handle or did}", 501 + [ 502 + f"DID: {did}", 503 + f"Handle: {handle or '(unknown)'}", 504 + f"Repos: {repo_stats.get('repos', 0)}", 505 + f"READMEs found: {repo_stats.get('readmes_found', 0)}", 506 + f"READMEs embedded: {repo_stats.get('readmes_embedded', 0)}", 507 + f"READMEs missing: {repo_stats.get('readmes_missing', 0)}", 508 + f"Issues upserted: {issue_stats.get('issues', 0)}", 509 + f"Open issues: {issue_stats.get('open', 0)}", 510 + f"Issues embedded: {issues_embedded}", 511 + "", 512 + "Test recommendations:", 513 + f" curl 'http://localhost:8000/recommendations?handle={did}'", 514 + ], 515 + ) 516 + return 0 517 + 518 + 519 + def main(argv: list[str] | None = None) -> int: 520 + parser = argparse.ArgumentParser( 521 + description="Ingest one Tangled user by handle: repos, README embeddings, issues." 522 + ) 523 + parser.add_argument("handle", help="Handle (e.g. arsenii.tngl.sh) or did:plc:…") 524 + parser.add_argument( 525 + "--skip-issues", 526 + action="store_true", 527 + help="Only ingest repos + README embeddings", 528 + ) 529 + parser.add_argument( 530 + "--force-embed", 531 + action="store_true", 532 + help="Re-embed READMEs and issues even if vectors already exist", 533 + ) 534 + parser.add_argument( 535 + "--max-pages", 536 + type=int, 537 + default=int(os.getenv("TANGLED_ISSUE_MAX_PAGES", "50")), 538 + help="Max listRecords pages per issue collection (default: 50)", 539 + ) 540 + parser.add_argument( 541 + "--init-db", 542 + action="store_true", 543 + help="Run supabase migrations before ingest", 544 + ) 545 + args = parser.parse_args(argv) 546 + return run( 547 + args.handle, 548 + skip_issues=args.skip_issues, 549 + force_embed=args.force_embed, 550 + max_pages=max(1, args.max_pages), 551 + init_db=args.init_db, 552 + ) 553 + 554 + 555 + if __name__ == "__main__": 556 + raise SystemExit(main())
+10
scraper/parallel.py
··· 1 + from __future__ import annotations 2 + 3 + import os 4 + 5 + 6 + def concurrency_env(name: str, default: int = 20, *, max_cap: int = 64) -> int: 7 + raw = os.getenv(name, "").strip() 8 + if not raw: 9 + return default 10 + return max(1, min(max_cap, int(raw)))
+133
scraper/pds_client.py
··· 1 + from __future__ import annotations 2 + 3 + import hashlib 4 + import json 5 + from typing import Any 6 + 7 + import httpx 8 + 9 + DEFAULT_PDS = "https://tngl.sh" 10 + DEFAULT_TIMEOUT = 30.0 11 + PAGE_SIZE = 1000 12 + 13 + 14 + def _base(pds_host: str) -> str: 15 + return pds_host.rstrip("/") 16 + 17 + 18 + def params_hash(params: dict[str, Any]) -> str: 19 + normalized = json.dumps(params, sort_keys=True, separators=(",", ":")) 20 + return hashlib.sha256(normalized.encode()).hexdigest() 21 + 22 + 23 + def knot_xrpc( 24 + client: httpx.Client, 25 + knot_hostname: str, 26 + method: str, 27 + params: dict[str, Any], 28 + ) -> tuple[int, Any]: 29 + resp = client.get(f"https://{knot_hostname}/xrpc/{method}", params=params) 30 + if resp.status_code != 200: 31 + return resp.status_code, {"error": resp.status_code, "body": resp.text[:500]} 32 + try: 33 + return resp.status_code, resp.json() 34 + except ValueError: 35 + return resp.status_code, {"raw": resp.text[:500]} 36 + 37 + 38 + def describe_pds(client: httpx.Client, pds_host: str) -> dict[str, Any]: 39 + resp = client.get(f"{_base(pds_host)}/xrpc/com.atproto.server.describeServer") 40 + resp.raise_for_status() 41 + return resp.json() 42 + 43 + 44 + def sync_list_repos( 45 + client: httpx.Client, 46 + pds_host: str, 47 + *, 48 + cursor: str | None = None, 49 + limit: int = PAGE_SIZE, 50 + ) -> dict[str, Any]: 51 + params: dict[str, str | int] = {"limit": limit} 52 + if cursor: 53 + params["cursor"] = cursor 54 + resp = client.get( 55 + f"{_base(pds_host)}/xrpc/com.atproto.sync.listRepos", 56 + params=params, 57 + ) 58 + resp.raise_for_status() 59 + return resp.json() 60 + 61 + 62 + def list_records( 63 + client: httpx.Client, 64 + pds_host: str, 65 + did: str, 66 + collection: str, 67 + *, 68 + cursor: str | None = None, 69 + limit: int = 100, 70 + ) -> dict[str, Any]: 71 + params: dict[str, str | int] = { 72 + "repo": did, 73 + "collection": collection, 74 + "limit": limit, 75 + } 76 + if cursor: 77 + params["cursor"] = cursor 78 + resp = client.get( 79 + f"{_base(pds_host)}/xrpc/com.atproto.repo.listRecords", 80 + params=params, 81 + ) 82 + resp.raise_for_status() 83 + return resp.json() 84 + 85 + 86 + def list_repo_records( 87 + client: httpx.Client, 88 + pds_host: str, 89 + did: str, 90 + *, 91 + cursor: str | None = None, 92 + limit: int = 100, 93 + ) -> dict[str, Any]: 94 + return list_records(client, pds_host, did, "sh.tangled.repo", cursor=cursor, limit=limit) 95 + 96 + 97 + def describe_repo_on_knot( 98 + client: httpx.Client, 99 + knot_hostname: str, 100 + repo_did: str, 101 + ) -> dict[str, Any] | None: 102 + resp = client.get( 103 + f"https://{knot_hostname}/xrpc/sh.tangled.repo.describeRepo", 104 + params={"repoDid": repo_did}, 105 + ) 106 + if resp.status_code == 404: 107 + return None 108 + resp.raise_for_status() 109 + return resp.json() 110 + 111 + 112 + def pds_host_for_did(client: httpx.Client, did: str) -> str | None: 113 + resp = client.get(f"https://plc.directory/{did}") 114 + if resp.status_code != 200: 115 + return None 116 + doc = resp.json() 117 + for svc in doc.get("service", []): 118 + if svc.get("type") == "AtprotoPersonalDataServer": 119 + endpoint = svc.get("serviceEndpoint") 120 + if isinstance(endpoint, str): 121 + return endpoint.rstrip("/") 122 + return None 123 + 124 + 125 + def handle_from_plc(client: httpx.Client, did: str) -> str | None: 126 + resp = client.get(f"https://plc.directory/{did}") 127 + if resp.status_code != 200: 128 + return None 129 + doc = resp.json() 130 + for alias in doc.get("alsoKnownAs", []): 131 + if alias.startswith("at://"): 132 + return alias.removeprefix("at://") 133 + return None
+42
scraper/progress.py
··· 1 + from __future__ import annotations 2 + 3 + import sys 4 + from datetime import datetime, timezone 5 + 6 + 7 + def log(stage: str, message: str) -> None: 8 + ts = datetime.now(timezone.utc).strftime("%H:%M:%S") 9 + print(f"[{ts}] [{stage}] {message}", flush=True) 10 + 11 + 12 + def step(stage: str, current: int, total: int, detail: str) -> None: 13 + ts = datetime.now(timezone.utc).strftime("%H:%M:%S") 14 + print(f"[{ts}] [{stage}] ({current}/{total}) {detail}", flush=True) 15 + 16 + 17 + def banner(title: str) -> None: 18 + line = "=" * 60 19 + print(line, flush=True) 20 + print(title, flush=True) 21 + print(line, flush=True) 22 + 23 + 24 + def die(message: str, code: int = 1) -> None: 25 + print(f"ERROR: {message}", file=sys.stderr, flush=True) 26 + raise SystemExit(code) 27 + 28 + 29 + def phase(number: int, title: str) -> None: 30 + print(flush=True) 31 + print(f"── Phase {number}: {title} " + "─" * max(0, 44 - len(title)), flush=True) 32 + 33 + 34 + def metric(label: str, value: str | int) -> None: 35 + log("·", f"{label}: {value}") 36 + 37 + 38 + def summary_block(title: str, lines: list[str]) -> None: 39 + print(flush=True) 40 + log("summary", title) 41 + for line in lines: 42 + print(f" {line}", flush=True)
+4
scraper/requirements.txt
··· 1 + httpx>=0.28,<1.0 2 + pgvector>=0.4,<1.0 3 + psycopg[binary]>=3.2,<4.0 4 + python-dotenv>=1.0,<2.0
+202
scraper/scrape.py
··· 1 + #!/usr/bin/env python3 2 + from __future__ import annotations 3 + 4 + import argparse 5 + import os 6 + import sys 7 + from pathlib import Path 8 + 9 + from dotenv import load_dotenv 10 + 11 + REPO_ROOT = Path(__file__).resolve().parent.parent 12 + if str(REPO_ROOT) not in sys.path: 13 + sys.path.insert(0, str(REPO_ROOT)) 14 + 15 + from daily_issue_scraper.pipeline import run_daily_sync 16 + from db import ( 17 + connect, 18 + count_accounts_with_repos, 19 + count_knots, 20 + count_lexicons, 21 + count_pds_accounts, 22 + count_repos, 23 + init_schema, 24 + table_counts, 25 + ) 26 + from progress import banner, die, log 27 + from stage0_lexicons import run_stage0 28 + from stage1_knots import run_stage1 29 + from stage2_network import run_stage2_network 30 + from stage2_pds import run_stage2, run_stage2_accounts_only, run_stage2_repos_only 31 + from check_readmes import run_check_readmes 32 + from embed_readmes import run_embed_readmes 33 + from fetch_collaborators import run_fetch_collaborators 34 + from fetch_issues import run_fetch_issues 35 + from embed_issues import run_embed_issues 36 + from backfill_repos_from_issues import run_backfill_repos_from_issues 37 + from stage4_repo_metadata import run_stage4 38 + 39 + 40 + def load_env() -> None: 41 + for candidate in (REPO_ROOT / ".env", Path(__file__).parent / ".env"): 42 + if candidate.exists(): 43 + load_dotenv(candidate) 44 + log("setup", f"Loaded env from {candidate}") 45 + return 46 + load_dotenv() 47 + 48 + 49 + def require_dsn() -> str: 50 + dsn = os.getenv("DB_CONNECTION_STRING", "").strip() 51 + if not dsn: 52 + die( 53 + "DB_CONNECTION_STRING is not set.\n" 54 + "Add it to the repo-root .env file, e.g.:\n" 55 + " DB_CONNECTION_STRING=postgresql://user:pass@host:5432/postgres" 56 + ) 57 + return dsn 58 + 59 + 60 + def cmd_init(dsn: str) -> None: 61 + banner("INIT — Create scraper tables") 62 + init_schema(dsn) 63 + log("init", "Schema ready (stages 0–6 migrations applied).") 64 + 65 + 66 + def cmd_status(dsn: str) -> None: 67 + banner("STATUS") 68 + with connect(dsn) as conn: 69 + lex = count_lexicons(conn) 70 + knots = count_knots(conn) 71 + accounts = count_pds_accounts(conn) 72 + accounts_with_repos = count_accounts_with_repos(conn) 73 + repos = count_repos(conn) 74 + reachable = conn.execute( 75 + "select count(*) as n from tangled_knots where reachable = true" 76 + ).fetchone() 77 + counts = table_counts(conn) 78 + states = conn.execute( 79 + "select key, status, meta, updated_at from tangled_crawl_state order by key" 80 + ).fetchall() 81 + 82 + log("status", "── Stages 0–2 (implemented) ──") 83 + log("status", f" tangled_lexicons: {lex}") 84 + log("status", f" tangled_knots: {knots} ({reachable['n'] if reachable else 0} reachable)") 85 + log("status", f" tangled_pds_accounts: {accounts} ({accounts_with_repos} with repos)") 86 + log("status", f" tangled_repos: {repos}") 87 + 88 + log("status", "── Stages 3–6 (schema ready, scrapers pending) ──") 89 + log("status", f" tangled_identities: {counts['tangled_identities']}") 90 + log("status", f" tangled_atproto_records: {counts['tangled_atproto_records']}") 91 + log("status", f" tangled_backlinks: {counts['tangled_backlinks']}") 92 + log("status", f" tangled_xrpc_snapshots: {counts['tangled_xrpc_snapshots']}") 93 + log("status", f" tangled_git_archives: {counts['tangled_git_archives']}") 94 + log("status", f" tangled_git_blobs: {counts['tangled_git_blobs']}") 95 + log("status", f" tangled_readmes: {counts['tangled_readmes']}") 96 + log("status", f" tangled_issues: {counts['tangled_issues']}") 97 + log("status", f" tangled_repo_collaborators: {counts['tangled_repo_collaborators']}") 98 + 99 + if states: 100 + log("status", "Crawl state:") 101 + for row in states: 102 + meta = row.get("meta") or {} 103 + extra = "" 104 + if isinstance(meta, dict) and "account_count" in meta: 105 + extra = f" accounts={meta['account_count']}" 106 + log("status", f" {row['key']}: {row['status']}{extra} @ {row['updated_at']}") 107 + else: 108 + log("status", "No crawl runs recorded yet.") 109 + 110 + 111 + def main(argv: list[str] | None = None) -> None: 112 + parser = argparse.ArgumentParser( 113 + description="Scrape Tangled into Postgres.", 114 + ) 115 + parser.add_argument( 116 + "command", 117 + choices=[ 118 + "init", 119 + "stage0", 120 + "stage1", 121 + "stage0-1", 122 + "stage2", 123 + "stage2-accounts", 124 + "stage2-repos", 125 + "stage2-network", 126 + "stage4", 127 + "check-readmes", 128 + "embed-readmes", 129 + "fetch-collaborators", 130 + "fetch-issues", 131 + "embed-issues", 132 + "backfill-repos-from-issues", 133 + "sync-daily", 134 + "status", 135 + ], 136 + help=( 137 + "init=tables | stage0=lexicons | stage1=knots | stage2=full PDS crawl | " 138 + "stage2-accounts=count/list accounts | stage2-repos=scan repo records | " 139 + "stage2-network=all repos (Bluesky+tngl via appview) | " 140 + "stage4=deeper repo metadata (branches, tags, collaborators) | " 141 + "check-readmes=fetch README from knot git for each repo | " 142 + "embed-readmes=Gemini embeddings into tangled_readmes.embedding | " 143 + "fetch-collaborators=listCollaborators for all repos | " 144 + "fetch-issues=scrape issues from user PDSes | " 145 + "embed-issues=Gemini embeddings into tangled_issues.embedding | " 146 + "backfill-repos-from-issues=fetch repos referenced by issues but missing from tangled_repos | " 147 + "sync-daily=run full daily sync pipeline" 148 + ), 149 + ) 150 + args = parser.parse_args(argv) 151 + 152 + load_env() 153 + dsn = require_dsn() 154 + 155 + if args.command == "init": 156 + cmd_init(dsn) 157 + return 158 + 159 + init_schema(dsn) 160 + 161 + if args.command == "stage0": 162 + run_stage0(dsn) 163 + elif args.command == "stage1": 164 + run_stage1(dsn) 165 + elif args.command == "stage0-1": 166 + run_stage0(dsn) 167 + print() 168 + run_stage1(dsn) 169 + elif args.command == "stage2": 170 + run_stage2(dsn) 171 + elif args.command == "stage2-accounts": 172 + run_stage2_accounts_only(dsn) 173 + elif args.command == "stage2-repos": 174 + run_stage2_repos_only(dsn) 175 + elif args.command == "stage2-network": 176 + run_stage2_network(dsn) 177 + elif args.command == "stage4": 178 + run_stage4(dsn) 179 + elif args.command == "check-readmes": 180 + run_check_readmes(dsn) 181 + elif args.command == "embed-readmes": 182 + run_embed_readmes(dsn) 183 + elif args.command == "fetch-collaborators": 184 + run_fetch_collaborators(dsn) 185 + elif args.command == "fetch-issues": 186 + run_fetch_issues(dsn) 187 + elif args.command == "embed-issues": 188 + run_embed_issues(dsn) 189 + elif args.command == "backfill-repos-from-issues": 190 + run_backfill_repos_from_issues(dsn) 191 + elif args.command == "sync-daily": 192 + run_daily_sync(dsn) 193 + elif args.command == "status": 194 + cmd_status(dsn) 195 + 196 + 197 + if __name__ == "__main__": 198 + try: 199 + main() 200 + except KeyboardInterrupt: 201 + print("\nInterrupted.", file=sys.stderr) 202 + raise SystemExit(130) from None
+213
scraper/seed_user.py
··· 1 + #!/usr/bin/env python3 2 + """Targeted ingest of a SINGLE Tangled user's repos + READMEs + embeddings. 3 + 4 + Same data path as the normal crawl (stage2 repo records -> README from knot -> 5 + Gemini embedding), but scoped to one DID so we can onboard a specific user for 6 + testing without triggering a network-wide scan or a mass re-embed. 7 + 8 + Reuses the canonical scraper modules (db.connect, embeddings.embed_texts) so the 9 + rows and vectors are identical to what the daily scraper would produce. Repos are 10 + tagged discovered_via='manual_seed' for traceability. 11 + 12 + Usage (from the scraper/ dir, with DB_CONNECTION_STRING + GEMINI_API_KEY in env): 13 + python seed_user.py <handle-or-did> 14 + """ 15 + 16 + from __future__ import annotations 17 + 18 + import json 19 + import sys 20 + 21 + import httpx 22 + 23 + from db import connect, register_pgvector 24 + from embeddings import embed_texts, embedding_model, gemini_api_key, truncate 25 + 26 + REPO_COLLECTION = "sh.tangled.repo" 27 + 28 + 29 + def resolve_did(client: httpx.Client, handle_or_did: str) -> str: 30 + if handle_or_did.startswith("did:"): 31 + return handle_or_did 32 + r = client.get( 33 + "https://public.api.bsky.app/xrpc/com.atproto.identity.resolveHandle", 34 + params={"handle": handle_or_did}, 35 + ) 36 + r.raise_for_status() 37 + return r.json()["did"] 38 + 39 + 40 + def resolve_identity(client: httpx.Client, did: str) -> tuple[str, str | None]: 41 + """Return (pds_endpoint, handle) from the PLC directory DID document.""" 42 + doc = client.get(f"https://plc.directory/{did}").json() 43 + pds = next( 44 + s["serviceEndpoint"] for s in doc["service"] if s["id"] == "#atproto_pds" 45 + ) 46 + handle = None 47 + for aka in doc.get("alsoKnownAs", []): 48 + if aka.startswith("at://"): 49 + handle = aka[len("at://"):] 50 + break 51 + return pds, handle 52 + 53 + 54 + def list_repos(client: httpx.Client, pds: str, did: str) -> list[dict]: 55 + r = client.get( 56 + f"{pds}/xrpc/com.atproto.repo.listRecords", 57 + params={"repo": did, "collection": REPO_COLLECTION, "limit": 100}, 58 + ) 59 + r.raise_for_status() 60 + return r.json().get("records", []) 61 + 62 + 63 + def fetch_readme(client: httpx.Client, knot: str, repo_did: str) -> tuple[str | None, str | None]: 64 + """Documented contract: tree with `ref` omitted -> top-level `readme.contents` 65 + (the knot resolves any extension). Returns (filename, contents).""" 66 + r = client.get( 67 + f"https://{knot}/xrpc/sh.tangled.repo.tree", 68 + params={"repo": repo_did, "path": ""}, 69 + ) 70 + if r.status_code != 200: 71 + return None, None 72 + readme = (r.json() or {}).get("readme") 73 + if not isinstance(readme, dict): 74 + return None, None 75 + contents = readme.get("contents") 76 + if not isinstance(contents, str) or not contents.strip(): 77 + return None, None 78 + return readme.get("filename"), contents 79 + 80 + 81 + def main(handle_or_did: str) -> int: 82 + api_key = gemini_api_key() 83 + model = embedding_model() 84 + import os 85 + 86 + dsn = os.environ["DB_CONNECTION_STRING"] 87 + 88 + with httpx.Client(timeout=60.0, follow_redirects=True) as http: 89 + did = resolve_did(http, handle_or_did) 90 + pds, handle = resolve_identity(http, did) 91 + print(f"DID={did} handle={handle} pds={pds}") 92 + 93 + records = list_repos(http, pds, did) 94 + print(f"repo records: {len(records)}") 95 + 96 + # Pull each README from its knot up front. 97 + ingested = [] # (record, name, knot, readme_path, content) 98 + for rec in records: 99 + uri = rec["uri"] 100 + value = rec["value"] 101 + rkey = uri.rsplit("/", 1)[-1] 102 + repo_did = value.get("repoDid") 103 + knot = value.get("knot") 104 + name = value.get("name") or rkey 105 + if not repo_did or not knot: 106 + print(f" SKIP {rkey}: missing repoDid/knot") 107 + continue 108 + path, content = fetch_readme(http, knot, repo_did) 109 + status = "found" if content else "missing" 110 + print(f" {name:18} repoDid={repo_did[:20]}… readme={status}" 111 + + (f" ({len(content)} chars)" if content else "")) 112 + ingested.append({ 113 + "uri": uri, "value": value, "rkey": rkey, "repo_did": repo_did, 114 + "knot": knot, "name": name, "cid": rec.get("cid"), 115 + "readme_path": path, "content": content, "status": status, 116 + }) 117 + 118 + # Embed the found READMEs in one batch (raw content — matches embed_readmes.py). 119 + found = [r for r in ingested if r["status"] == "found"] 120 + vectors: dict[str, list[float]] = {} 121 + if found: 122 + with httpx.Client() as http: 123 + vecs = embed_texts( 124 + http, api_key=api_key, 125 + texts=[truncate(r["content"]) for r in found], 126 + ) 127 + # Store as a pgvector text literal ('[v1,v2,...]'::vector) — same as the 128 + # rec engine, so we don't depend on the optional pgvector psycopg adapter. 129 + vectors = { 130 + r["repo_did"]: "[" + ",".join(repr(x) for x in v) + "]" 131 + for r, v in zip(found, vecs, strict=True) 132 + } 133 + print(f"embedded {len(vectors)} READMEs ({model}, 1536-d, L2)") 134 + 135 + with connect(dsn) as conn: 136 + register_pgvector(conn) 137 + conn.execute( 138 + """ 139 + insert into tangled_identities (did, handle, pds_host, last_synced_at) 140 + values (%s, %s, %s, now()) 141 + on conflict (did) do update set 142 + handle = coalesce(excluded.handle, tangled_identities.handle), 143 + pds_host = coalesce(excluded.pds_host, tangled_identities.pds_host), 144 + last_synced_at = now() 145 + """, 146 + (did, handle, pds), 147 + ) 148 + 149 + for r in ingested: 150 + conn.execute( 151 + """ 152 + insert into tangled_repos ( 153 + uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname, 154 + cid, record_raw, discovered_via, last_synced_at 155 + ) 156 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, 'manual_seed', now()) 157 + on conflict (uri) do update set 158 + owner_did = excluded.owner_did, 159 + owner_handle = excluded.owner_handle, 160 + repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did), 161 + name = coalesce(excluded.name, tangled_repos.name), 162 + knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname), 163 + cid = excluded.cid, 164 + record_raw = excluded.record_raw, 165 + last_synced_at = now() 166 + """, 167 + (r["uri"], did, handle, r["rkey"], r["repo_did"], r["name"], 168 + r["knot"], r["cid"], json.dumps(r["value"])), 169 + ) 170 + 171 + conn.execute( 172 + """ 173 + insert into tangled_readmes ( 174 + repo_did, repo_uri, owner_handle, repo_name, knot_hostname, 175 + readme_path, status, content, size_bytes, fetched_at, 176 + embedding, embedding_model, embedded_at 177 + ) 178 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s, now(), 179 + %s::vector, %s, case when %s::text is null then null else now() end) 180 + on conflict (repo_did) do update set 181 + repo_uri = excluded.repo_uri, 182 + owner_handle = excluded.owner_handle, 183 + repo_name = excluded.repo_name, 184 + knot_hostname = excluded.knot_hostname, 185 + readme_path = excluded.readme_path, 186 + status = excluded.status, 187 + content = excluded.content, 188 + size_bytes = excluded.size_bytes, 189 + fetched_at = now(), 190 + embedding = excluded.embedding, 191 + embedding_model = excluded.embedding_model, 192 + embedded_at = excluded.embedded_at 193 + """, 194 + ( 195 + r["repo_did"], r["uri"], handle, r["name"], r["knot"], 196 + r["readme_path"], r["status"], r["content"], 197 + len(r["content"].encode()) if r["content"] else None, 198 + vectors.get(r["repo_did"]), 199 + model if r["repo_did"] in vectors else None, 200 + vectors.get(r["repo_did"]), 201 + ), 202 + ) 203 + conn.commit() 204 + 205 + print(f"done: {len(ingested)} repos, {len(vectors)} embedded -> tangled_repos + tangled_readmes") 206 + return 0 207 + 208 + 209 + if __name__ == "__main__": 210 + if len(sys.argv) != 2: 211 + print("usage: python seed_user.py <handle-or-did>", file=sys.stderr) 212 + raise SystemExit(2) 213 + raise SystemExit(main(sys.argv[1]))
+119
scraper/stage0_lexicons.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + import subprocess 5 + from pathlib import Path 6 + from typing import Any 7 + 8 + from db import connect, set_crawl_state, upsert_lexicon 9 + from progress import banner, log, step 10 + 11 + REPO_ROOT = Path(__file__).resolve().parent.parent 12 + DEFAULT_CORE_GIT = "https://tangled.org/tangled.org/core.git" 13 + DEFAULT_CORE_DIR = REPO_ROOT / ".cache" / "tangled-core" 14 + LEXICONS_DIR = "lexicons" 15 + 16 + 17 + def _lexicon_type(definition: dict[str, Any]) -> str: 18 + main = definition.get("defs", {}).get("main", {}) 19 + lex_type = main.get("type") 20 + if lex_type: 21 + return str(lex_type) 22 + return "unknown" 23 + 24 + 25 + def ensure_core_repo(core_dir: Path, git_url: str) -> Path: 26 + lexicons = core_dir / LEXICONS_DIR 27 + if lexicons.is_dir() and any(lexicons.rglob("*.json")): 28 + log("stage 0", f"Using existing lexicons at {lexicons}") 29 + return lexicons 30 + 31 + log("stage 0", f"Cloning Tangled core lexicons (first run only)...") 32 + log("stage 0", f" git clone --depth 1 {git_url}") 33 + core_dir.parent.mkdir(parents=True, exist_ok=True) 34 + if core_dir.exists(): 35 + log("stage 0", f" Removing incomplete clone at {core_dir}") 36 + subprocess.run(["rm", "-rf", str(core_dir)], check=True) 37 + 38 + subprocess.run( 39 + ["git", "clone", "--depth", "1", git_url, str(core_dir)], 40 + check=True, 41 + ) 42 + if not lexicons.is_dir(): 43 + raise RuntimeError(f"Expected lexicons directory at {lexicons} after clone") 44 + log("stage 0", f"Clone complete.") 45 + return lexicons 46 + 47 + 48 + def collect_lexicon_files(lexicons_dir: Path) -> list[Path]: 49 + return sorted(lexicons_dir.rglob("*.json")) 50 + 51 + 52 + def run_stage0( 53 + dsn: str, 54 + *, 55 + core_dir: Path = DEFAULT_CORE_DIR, 56 + git_url: str = DEFAULT_CORE_GIT, 57 + ) -> dict[str, int]: 58 + banner("STAGE 0 — Load Tangled lexicons (schemas)") 59 + log("stage 0", "Lexicons are JSON schema specs — not live API endpoints.") 60 + log("stage 0", "This stage stores every sh.tangled.* definition for later validation.") 61 + 62 + lexicons_dir = ensure_core_repo(core_dir, git_url) 63 + files = collect_lexicon_files(lexicons_dir) 64 + if not files: 65 + raise RuntimeError(f"No lexicon JSON files found under {lexicons_dir}") 66 + 67 + log("stage 0", f"Found {len(files)} lexicon files to import.") 68 + 69 + stats = {"records": 0, "queries": 0, "procedures": 0, "tokens": 0, "other": 0} 70 + 71 + with connect(dsn) as conn: 72 + set_crawl_state(conn, key="stage0:lexicons", status="running") 73 + 74 + for i, path in enumerate(files, start=1): 75 + rel = path.relative_to(core_dir).as_posix() 76 + try: 77 + definition = json.loads(path.read_text()) 78 + except json.JSONDecodeError as exc: 79 + step("stage 0", i, len(files), f"SKIP {rel} — invalid JSON: {exc}") 80 + continue 81 + 82 + nsid = definition.get("id") 83 + if not nsid: 84 + step("stage 0", i, len(files), f"SKIP {rel} — missing id field") 85 + continue 86 + 87 + lex_type = _lexicon_type(definition) 88 + upsert_lexicon( 89 + conn, 90 + nsid=nsid, 91 + lexicon_type=lex_type, 92 + definition=definition, 93 + source_path=rel, 94 + ) 95 + 96 + bucket = { 97 + "record": "records", 98 + "query": "queries", 99 + "procedure": "procedures", 100 + "token": "tokens", 101 + }.get(lex_type, "other") 102 + stats[bucket] += 1 103 + step("stage 0", i, len(files), f"{nsid} ({lex_type})") 104 + 105 + set_crawl_state( 106 + conn, 107 + key="stage0:lexicons", 108 + status="complete", 109 + meta={"file_count": len(files), **stats}, 110 + ) 111 + conn.commit() 112 + 113 + log("stage 0", "Done.") 114 + log( 115 + "stage 0", 116 + f" records={stats['records']} queries={stats['queries']} " 117 + f"procedures={stats['procedures']} tokens={stats['tokens']} other={stats['other']}", 118 + ) 119 + return stats
+159
scraper/stage1_knots.py
··· 1 + from __future__ import annotations 2 + 3 + import os 4 + from typing import Any 5 + from urllib.parse import urlparse 6 + 7 + import httpx 8 + 9 + from db import connect, set_crawl_state, upsert_knot 10 + from progress import banner, log, step 11 + 12 + KNOT_VERSION_METHOD = "sh.tangled.knot.version" 13 + KNOT_OWNER_METHOD = "sh.tangled.owner" 14 + DEFAULT_SEEDS = ["knot1.tangled.sh"] 15 + PROBE_TIMEOUT = 15.0 16 + 17 + 18 + def _normalize_hostname(value: str) -> str: 19 + value = value.strip() 20 + if value.startswith("http://") or value.startswith("https://"): 21 + value = urlparse(value).netloc or value 22 + return value.rstrip("/") 23 + 24 + 25 + def knot_seeds() -> list[str]: 26 + raw = os.getenv("TANGLED_KNOT_SEEDS", "") 27 + if raw.strip(): 28 + return [_normalize_hostname(part) for part in raw.split(",") if part.strip()] 29 + 30 + seeds = list(DEFAULT_SEEDS) 31 + 32 + # Optional auto-discovery: probe knot1..knotN (off by default). 33 + max_auto = int(os.getenv("TANGLED_KNOT_PROBE_MAX", "0")) 34 + for n in range(2, max_auto + 1): 35 + seeds.append(f"knot{n}.tangled.sh") 36 + 37 + extra = os.getenv("TANGLED_KNOT_EXTRA", "") 38 + for part in extra.split(","): 39 + host = _normalize_hostname(part) 40 + if host and host not in seeds: 41 + seeds.append(host) 42 + 43 + return seeds 44 + 45 + 46 + def _xrpc_url(hostname: str, method: str) -> str: 47 + return f"https://{hostname}/xrpc/{method}" 48 + 49 + 50 + def probe_knot(client: httpx.Client, hostname: str) -> dict[str, Any]: 51 + result: dict[str, Any] = { 52 + "hostname": hostname, 53 + "reachable": False, 54 + "owner_did": None, 55 + "version": None, 56 + "capabilities": None, 57 + "version_raw": None, 58 + "owner_raw": None, 59 + "probe_error": None, 60 + } 61 + 62 + try: 63 + version_resp = client.get(_xrpc_url(hostname, KNOT_VERSION_METHOD)) 64 + if version_resp.status_code != 200: 65 + result["probe_error"] = f"{KNOT_VERSION_METHOD} HTTP {version_resp.status_code}" 66 + return result 67 + 68 + version_raw = version_resp.json() 69 + result["version_raw"] = version_raw 70 + result["version"] = version_raw.get("version") 71 + caps = version_raw.get("capabilities") 72 + if isinstance(caps, list): 73 + result["capabilities"] = [str(c) for c in caps] 74 + 75 + owner_resp = client.get(_xrpc_url(hostname, KNOT_OWNER_METHOD)) 76 + if owner_resp.status_code == 200: 77 + owner_raw = owner_resp.json() 78 + result["owner_raw"] = owner_raw 79 + owner = owner_raw.get("owner") 80 + if isinstance(owner, str): 81 + result["owner_did"] = owner 82 + 83 + result["reachable"] = True 84 + return result 85 + except httpx.HTTPError as exc: 86 + result["probe_error"] = str(exc) 87 + return result 88 + except ValueError as exc: 89 + result["probe_error"] = f"invalid JSON: {exc}" 90 + return result 91 + 92 + 93 + def run_stage1(dsn: str) -> dict[str, int]: 94 + banner("STAGE 1 — Probe knot servers (infrastructure)") 95 + log("stage 1", "Knots are git host servers — NOT the source code itself.") 96 + log("stage 1", "This stage checks which knots are alive and records their version/owner.") 97 + log("stage 1", "Actual repo code comes in Stage 6 (git log/tree/blob XRPC).") 98 + 99 + seeds = knot_seeds() 100 + log("stage 1", f"Probing {len(seeds)} knot hostname(s): {', '.join(seeds)}") 101 + if os.getenv("TANGLED_KNOT_PROBE_MAX", "0") == "0": 102 + log( 103 + "stage 1", 104 + "Tip: set TANGLED_KNOT_SEEDS=knot1.tangled.sh,custom.knot.example " 105 + "or TANGLED_KNOT_PROBE_MAX=5 to auto-try knot2..knot5.", 106 + ) 107 + 108 + stats = {"reachable": 0, "unreachable": 0} 109 + 110 + with httpx.Client(timeout=PROBE_TIMEOUT, follow_redirects=True) as client, connect( 111 + dsn 112 + ) as conn: 113 + set_crawl_state(conn, key="stage1:knots", status="running", meta={"seeds": seeds}) 114 + 115 + for i, hostname in enumerate(seeds, start=1): 116 + step("stage 1", i, len(seeds), f"Probing {hostname} ...") 117 + probe = probe_knot(client, hostname) 118 + 119 + upsert_knot( 120 + conn, 121 + hostname=hostname, 122 + reachable=probe["reachable"], 123 + owner_did=probe["owner_did"], 124 + version=probe["version"], 125 + capabilities=probe["capabilities"], 126 + version_raw=probe["version_raw"], 127 + owner_raw=probe["owner_raw"], 128 + probe_error=probe["probe_error"], 129 + ) 130 + 131 + if probe["reachable"]: 132 + stats["reachable"] += 1 133 + caps = probe["capabilities"] or [] 134 + log( 135 + "stage 1", 136 + f" OK {hostname} version={probe['version']} " 137 + f"owner={probe['owner_did'] or '?'} capabilities={caps}", 138 + ) 139 + else: 140 + stats["unreachable"] += 1 141 + log("stage 1", f" FAIL {hostname} {probe['probe_error']}") 142 + 143 + set_crawl_state( 144 + conn, 145 + key="stage1:knots", 146 + status="complete", 147 + meta={"seeds": seeds, **stats}, 148 + ) 149 + conn.commit() 150 + 151 + log("stage 1", "Done.") 152 + log( 153 + "stage 1", 154 + f" reachable={stats['reachable']} unreachable={stats['unreachable']}", 155 + ) 156 + if stats["reachable"] == 0: 157 + log("stage 1", "WARNING: no reachable knots — check network or seed hostnames.") 158 + 159 + return stats
+509
scraper/stage2_network.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + import os 5 + import threading 6 + from concurrent.futures import ThreadPoolExecutor, as_completed 7 + from dataclasses import dataclass 8 + from typing import Any 9 + 10 + import httpx 11 + 12 + from appview_client import fetch_search_page 13 + from db import connect, set_crawl_state, upsert_atproto_record 14 + from parallel import concurrency_env 15 + from pds_client import DEFAULT_PDS, list_records, pds_host_for_did 16 + from progress import banner, log, metric, phase, step, summary_block 17 + 18 + CRAWL_KEY = "stage2:network" 19 + COLLECTION = "sh.tangled.repo" 20 + RESOLVE_PDS = ("https://bsky.social", "https://tngl.sh") 21 + 22 + 23 + def _page_limit() -> int: 24 + return max(1, min(100, int(os.getenv("TANGLED_NETWORK_PAGE_SIZE", "100")))) 25 + 26 + 27 + def _repo_limit() -> int | None: 28 + raw = os.getenv("TANGLED_STAGE2_NETWORK_LIMIT", "").strip() 29 + if not raw: 30 + return None 31 + return max(1, int(raw)) 32 + 33 + 34 + def _skip_existing() -> bool: 35 + return os.getenv("TANGLED_STAGE2_NETWORK_REFRESH", "").strip().lower() not in ( 36 + "1", 37 + "true", 38 + "yes", 39 + ) 40 + 41 + 42 + def _link_key(handle: str, slug: str) -> tuple[str, str]: 43 + return handle.lower(), slug.lower() 44 + 45 + 46 + def _load_existing_links(conn) -> set[tuple[str, str]]: 47 + """(owner_handle, slug) pairs already stored — match on name or rkey.""" 48 + rows = conn.execute( 49 + """ 50 + select owner_handle, name, rkey 51 + from tangled_repos 52 + where owner_handle is not null 53 + """ 54 + ).fetchall() 55 + existing: set[tuple[str, str]] = set() 56 + for row in rows: 57 + handle = row.get("owner_handle") 58 + if not isinstance(handle, str) or not handle: 59 + continue 60 + for slug in (row.get("name"), row.get("rkey")): 61 + if isinstance(slug, str) and slug: 62 + existing.add(_link_key(handle, slug)) 63 + return existing 64 + 65 + 66 + def _partition_links( 67 + links: list[tuple[str, str]], existing: set[tuple[str, str]] 68 + ) -> tuple[list[tuple[str, str]], list[tuple[str, str]]]: 69 + pending: list[tuple[str, str]] = [] 70 + skipped: list[tuple[str, str]] = [] 71 + for handle, slug in links: 72 + if _link_key(handle, slug) in existing: 73 + skipped.append((handle, slug)) 74 + else: 75 + pending.append((handle, slug)) 76 + return pending, skipped 77 + 78 + 79 + def resolve_handle(client: httpx.Client, handle: str) -> str | None: 80 + for base in RESOLVE_PDS: 81 + try: 82 + resp = client.get( 83 + f"{base}/xrpc/com.atproto.identity.resolveHandle", 84 + params={"handle": handle}, 85 + ) 86 + if resp.status_code == 200: 87 + did = resp.json().get("did") 88 + if isinstance(did, str): 89 + return did 90 + except httpx.HTTPError: 91 + continue 92 + return None 93 + 94 + 95 + def fetch_repo_record( 96 + client: httpx.Client, 97 + *, 98 + pds_host: str, 99 + owner_did: str, 100 + rkey: str, 101 + repo_slug: str, 102 + ) -> dict[str, Any] | None: 103 + """Fetch sh.tangled.repo from owner's PDS (Bluesky or tngl).""" 104 + base = pds_host.rstrip("/") 105 + try: 106 + resp = client.get( 107 + f"{base}/xrpc/com.atproto.repo.getRecord", 108 + params={ 109 + "repo": owner_did, 110 + "collection": COLLECTION, 111 + "rkey": rkey, 112 + }, 113 + ) 114 + if resp.status_code == 200: 115 + return resp.json() 116 + except httpx.HTTPError: 117 + pass 118 + 119 + cursor: str | None = None 120 + while True: 121 + try: 122 + data = list_records( 123 + client, pds_host, owner_did, COLLECTION, cursor=cursor, limit=100 124 + ) 125 + except httpx.HTTPError: 126 + return None 127 + 128 + for rec in data.get("records") or []: 129 + value = rec.get("value") 130 + uri = rec.get("uri") 131 + if not isinstance(value, dict) or not isinstance(uri, str): 132 + continue 133 + name = value.get("name") 134 + if uri.endswith(f"/{repo_slug}") or name == repo_slug: 135 + return {"uri": uri, "cid": rec.get("cid"), "value": value} 136 + 137 + cursor = data.get("cursor") 138 + if not cursor or not data.get("records"): 139 + break 140 + return None 141 + 142 + 143 + @dataclass 144 + class NetworkFetchResult: 145 + owner_handle: str 146 + repo_slug: str 147 + status: str # ok | resolve_failed | record_failed | error 148 + owner_did: str | None = None 149 + pds_host: str | None = None 150 + record: dict[str, Any] | None = None 151 + error: str | None = None 152 + 153 + 154 + class _ResolveCache: 155 + def __init__(self) -> None: 156 + self._handle_did: dict[str, str | None] = {} 157 + self._did_pds: dict[str, str | None] = {} 158 + self._lock = threading.Lock() 159 + 160 + def resolve_owner( 161 + self, client: httpx.Client, handle: str 162 + ) -> tuple[str | None, str | None]: 163 + with self._lock: 164 + if handle in self._handle_did: 165 + did = self._handle_did[handle] 166 + if did is None: 167 + return None, None 168 + pds = self._did_pds.get(did) 169 + if pds is not None: 170 + return did, pds 171 + 172 + did = resolve_handle(client, handle) 173 + pds = None 174 + if did: 175 + pds = pds_host_for_did(client, did) or DEFAULT_PDS 176 + 177 + with self._lock: 178 + self._handle_did[handle] = did 179 + if did: 180 + self._did_pds[did] = pds 181 + return did, pds 182 + 183 + 184 + def _fetch_one_link( 185 + owner_handle: str, 186 + repo_slug: str, 187 + cache: _ResolveCache, 188 + ) -> NetworkFetchResult: 189 + result = NetworkFetchResult( 190 + owner_handle=owner_handle, 191 + repo_slug=repo_slug, 192 + status="error", 193 + ) 194 + try: 195 + with httpx.Client(timeout=60.0, follow_redirects=True) as client: 196 + owner_did, pds_host = cache.resolve_owner(client, owner_handle) 197 + if not owner_did: 198 + result.status = "resolve_failed" 199 + return result 200 + 201 + result.owner_did = owner_did 202 + result.pds_host = pds_host 203 + 204 + record = fetch_repo_record( 205 + client, 206 + pds_host=pds_host or DEFAULT_PDS, 207 + owner_did=owner_did, 208 + rkey=repo_slug, 209 + repo_slug=repo_slug, 210 + ) 211 + if not record: 212 + result.status = "record_failed" 213 + return result 214 + 215 + result.record = record 216 + result.status = "ok" 217 + return result 218 + except httpx.HTTPError as exc: 219 + result.status = "error" 220 + result.error = str(exc) 221 + return result 222 + except Exception as exc: 223 + result.status = "error" 224 + result.error = str(exc) 225 + return result 226 + 227 + 228 + def upsert_identity(conn, *, did: str, handle: str | None, pds_host: str | None) -> None: 229 + conn.execute( 230 + """ 231 + insert into tangled_identities (did, handle, pds_host, last_synced_at) 232 + values (%s, %s, %s, now()) 233 + on conflict (did) do update set 234 + handle = coalesce(excluded.handle, tangled_identities.handle), 235 + pds_host = coalesce(excluded.pds_host, tangled_identities.pds_host), 236 + last_synced_at = now() 237 + """, 238 + (did, handle, pds_host), 239 + ) 240 + 241 + 242 + def upsert_network_repo( 243 + conn, 244 + *, 245 + owner_did: str, 246 + owner_handle: str, 247 + repo_slug: str, 248 + pds_host: str, 249 + record: dict[str, Any], 250 + ) -> None: 251 + uri = record["uri"] 252 + value = record["value"] 253 + rkey = uri.rsplit("/", 1)[-1] 254 + repo_did = value.get("repoDid") if isinstance(value.get("repoDid"), str) else None 255 + knot = value.get("knot") if isinstance(value.get("knot"), str) else None 256 + name = value.get("name") if isinstance(value.get("name"), str) else None 257 + if not name: 258 + name = repo_slug if not repo_slug.startswith("3l") else None 259 + 260 + conn.execute( 261 + """ 262 + insert into tangled_repos ( 263 + uri, owner_did, owner_handle, rkey, repo_did, name, knot_hostname, 264 + cid, record_raw, discovered_via, last_synced_at 265 + ) 266 + values (%s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb, 'appview_search', now()) 267 + on conflict (uri) do update set 268 + owner_did = excluded.owner_did, 269 + owner_handle = excluded.owner_handle, 270 + repo_did = coalesce(excluded.repo_did, tangled_repos.repo_did), 271 + name = coalesce(excluded.name, tangled_repos.name), 272 + knot_hostname = coalesce(excluded.knot_hostname, tangled_repos.knot_hostname), 273 + cid = excluded.cid, 274 + record_raw = excluded.record_raw, 275 + discovered_via = coalesce(tangled_repos.discovered_via, excluded.discovered_via), 276 + last_synced_at = now() 277 + """, 278 + ( 279 + uri, 280 + owner_did, 281 + owner_handle, 282 + rkey, 283 + repo_did, 284 + name, 285 + knot, 286 + record.get("cid") if isinstance(record.get("cid"), str) else None, 287 + json.dumps(value), 288 + ), 289 + ) 290 + 291 + upsert_atproto_record( 292 + conn, 293 + uri=uri, 294 + author_did=owner_did, 295 + collection=COLLECTION, 296 + rkey=rkey, 297 + payload=value, 298 + cid=record.get("cid") if isinstance(record.get("cid"), str) else None, 299 + repo_did=repo_did, 300 + ) 301 + 302 + 303 + def run_stage2_network(dsn: str) -> dict[str, Any]: 304 + workers = concurrency_env("TANGLED_STAGE2_NETWORK_CONCURRENCY", default=20) 305 + 306 + banner("STAGE 2-network — All Tangled repos (Bluesky + tngl.sh)") 307 + log("stage 2-network", "Uses tangled.org search index — only accounts WITH repos.") 308 + log("stage 2-network", "Does NOT scan all Bluesky users — only Tangled repo creators.") 309 + log("stage 2-network", "Resolves each owner handle → DID → PDS, then fetches sh.tangled.repo.") 310 + log("stage 2-network", f"Concurrency: {workers}") 311 + 312 + page_size = _page_limit() 313 + repo_limit = _repo_limit() 314 + if repo_limit: 315 + log("stage 2-network", f"Repo limit: {repo_limit}") 316 + if _skip_existing(): 317 + log("stage 2-network", "Skip existing: on (set TANGLED_STAGE2_NETWORK_REFRESH=1 to re-fetch all)") 318 + else: 319 + log("stage 2-network", "Skip existing: off — refreshing every link") 320 + 321 + stats = { 322 + "search_links": 0, 323 + "repos_stored": 0, 324 + "already_in_db": 0, 325 + "resolve_failed": 0, 326 + "record_failed": 0, 327 + "errors": 0, 328 + } 329 + 330 + all_links: list[tuple[str, str]] = [] 331 + seen_links: set[tuple[str, str]] = set() 332 + total_index: int | None = None 333 + 334 + phase(1, "Crawl tangled.org/search index") 335 + 336 + with httpx.Client(timeout=60.0, follow_redirects=True) as client: 337 + offset = 0 338 + while True: 339 + _html, links, total = fetch_search_page( 340 + client, offset=offset, limit=page_size 341 + ) 342 + if total is not None: 343 + total_index = total 344 + 345 + new = 0 346 + for link in links: 347 + if link not in seen_links: 348 + seen_links.add(link) 349 + all_links.append(link) 350 + new += 1 351 + 352 + log( 353 + "stage 2-network", 354 + f" search offset {offset}: +{new} links (unique: {len(all_links)}" 355 + + (f" / {total_index})" if total_index else ")"), 356 + ) 357 + 358 + if repo_limit and len(all_links) >= repo_limit: 359 + all_links = all_links[:repo_limit] 360 + break 361 + if total_index is not None and offset + page_size >= total_index: 362 + break 363 + if new == 0 and offset > 0: 364 + break 365 + offset += page_size 366 + 367 + metric("Unique repos in search index", len(all_links)) 368 + stats["search_links"] = len(all_links) 369 + 370 + pending_links = all_links 371 + if _skip_existing(): 372 + with connect(dsn) as conn: 373 + existing = _load_existing_links(conn) 374 + pending_links, skipped_links = _partition_links(all_links, existing) 375 + stats["already_in_db"] = len(skipped_links) 376 + metric("Already in DB (skipped)", len(skipped_links)) 377 + metric("To fetch", len(pending_links)) 378 + if not pending_links: 379 + log("stage 2-network", "Nothing new to fetch.") 380 + elif len(skipped_links) <= 10: 381 + for handle, slug in skipped_links: 382 + log("stage 2-network", f" skip {handle}/{slug}") 383 + 384 + phase(2, f"Resolve owners & fetch repo records ({workers} workers)") 385 + 386 + cache = _ResolveCache() 387 + done = 0 388 + done_lock = threading.Lock() 389 + total_work = len(pending_links) 390 + 391 + with connect(dsn) as conn: 392 + set_crawl_state( 393 + conn, 394 + key=CRAWL_KEY, 395 + status="running", 396 + meta={ 397 + "link_count": len(all_links), 398 + "pending_count": len(pending_links), 399 + "skipped_count": stats["already_in_db"], 400 + "total_index": total_index, 401 + "workers": workers, 402 + }, 403 + ) 404 + conn.commit() 405 + 406 + if not pending_links: 407 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 408 + conn.commit() 409 + else: 410 + with ThreadPoolExecutor(max_workers=workers) as pool: 411 + futures = { 412 + pool.submit(_fetch_one_link, handle, slug, cache): (handle, slug) 413 + for handle, slug in pending_links 414 + } 415 + 416 + for future in as_completed(futures): 417 + owner_handle, repo_slug = futures[future] 418 + label = f"{owner_handle}/{repo_slug}" 419 + 420 + try: 421 + result = future.result() 422 + except Exception as exc: 423 + result = NetworkFetchResult( 424 + owner_handle=owner_handle, 425 + repo_slug=repo_slug, 426 + status="error", 427 + error=str(exc), 428 + ) 429 + 430 + with done_lock: 431 + done += 1 432 + n = done 433 + 434 + if result.status == "ok" and result.record and result.owner_did: 435 + upsert_identity( 436 + conn, 437 + did=result.owner_did, 438 + handle=owner_handle, 439 + pds_host=result.pds_host, 440 + ) 441 + upsert_network_repo( 442 + conn, 443 + owner_did=result.owner_did, 444 + owner_handle=owner_handle, 445 + repo_slug=repo_slug, 446 + pds_host=result.pds_host or DEFAULT_PDS, 447 + record=result.record, 448 + ) 449 + stats["repos_stored"] += 1 450 + if n <= 10 or n % 50 == 0: 451 + pds_label = ( 452 + "bsky" 453 + if result.pds_host and "bsky" in result.pds_host 454 + else "tngl" 455 + ) 456 + step( 457 + "stage 2-network", 458 + n, 459 + total_work, 460 + f"OK {label} did={result.owner_did[:20]}… pds={pds_label}", 461 + ) 462 + elif result.status == "resolve_failed": 463 + stats["resolve_failed"] += 1 464 + if n <= 10 or n % 100 == 0: 465 + step( 466 + "stage 2-network", 467 + n, 468 + total_work, 469 + f"SKIP {label} — handle not resolved", 470 + ) 471 + elif result.status == "record_failed": 472 + stats["record_failed"] += 1 473 + if n <= 10 or n % 100 == 0: 474 + step( 475 + "stage 2-network", 476 + n, 477 + total_work, 478 + f"FAIL {label} — no record on {result.pds_host or '?'}", 479 + ) 480 + else: 481 + stats["errors"] += 1 482 + if n <= 10 or n % 100 == 0: 483 + step( 484 + "stage 2-network", 485 + n, 486 + total_work, 487 + f"ERROR {label}: {result.error or 'unknown'}", 488 + ) 489 + 490 + if n % 50 == 0: 491 + conn.commit() 492 + 493 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 494 + conn.commit() 495 + 496 + summary_block( 497 + "Stage 2-network complete", 498 + [ 499 + f"Search index links: {len(all_links)}", 500 + f"Already in DB (skip): {stats['already_in_db']}", 501 + f"Repos stored/updated: {stats['repos_stored']}", 502 + f"Handle resolve failed: {stats['resolve_failed']}", 503 + f"Record fetch failed: {stats['record_failed']}", 504 + f"Errors: {stats['errors']}", 505 + "", 506 + "Query: select discovered_via, count(*) from tangled_repos group by 1;", 507 + ], 508 + ) 509 + return stats
+473
scraper/stage2_pds.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + import os 5 + from typing import Any 6 + from urllib.parse import urlparse 7 + 8 + import httpx 9 + 10 + from db import connect, set_crawl_state 11 + from pds_client import ( 12 + DEFAULT_PDS, 13 + describe_pds, 14 + describe_repo_on_knot, 15 + handle_from_plc, 16 + list_repo_records, 17 + sync_list_repos, 18 + ) 19 + from progress import banner, log, metric, phase, step, summary_block 20 + 21 + CRAWL_KEY_ACCOUNTS = "stage2:accounts" 22 + CRAWL_KEY_REPOS = "stage2:repos" 23 + COLLECTION = "sh.tangled.repo" 24 + 25 + 26 + def _pds_host() -> str: 27 + return os.getenv("TANGLED_PDS_URL", DEFAULT_PDS).strip() 28 + 29 + 30 + def _account_limit() -> int | None: 31 + raw = os.getenv("TANGLED_STAGE2_ACCOUNT_LIMIT", "").strip() 32 + if not raw: 33 + return None 34 + return max(1, int(raw)) 35 + 36 + 37 + def _resolve_handles() -> bool: 38 + return os.getenv("TANGLED_RESOLVE_HANDLES", "0").strip() in {"1", "true", "yes"} 39 + 40 + 41 + def _enrich_knots() -> bool: 42 + return os.getenv("TANGLED_STAGE2_ENRICH_KNOTS", "1").strip() not in {"0", "false", "no"} 43 + 44 + 45 + def _rkey_from_uri(uri: str) -> str: 46 + return uri.rsplit("/", 1)[-1] 47 + 48 + 49 + def _repo_name(value: dict[str, Any], rkey: str) -> str | None: 50 + name = value.get("name") 51 + if isinstance(name, str) and name: 52 + return name 53 + if rkey and not rkey.startswith("3l"): 54 + return rkey 55 + return None 56 + 57 + 58 + def update_account_scan( 59 + conn, 60 + *, 61 + did: str, 62 + handle: str | None, 63 + repo_record_count: int, 64 + ) -> None: 65 + conn.execute( 66 + """ 67 + update tangled_pds_accounts 68 + set 69 + handle = coalesce(%s, handle), 70 + repo_record_count = %s, 71 + last_synced_at = now() 72 + where did = %s 73 + """, 74 + (handle, repo_record_count, did), 75 + ) 76 + 77 + 78 + def upsert_accounts_batch( 79 + conn, 80 + *, 81 + pds_host: str, 82 + entries: list[dict[str, Any]], 83 + ) -> None: 84 + if not entries: 85 + return 86 + conn.cursor().executemany( 87 + """ 88 + insert into tangled_pds_accounts ( 89 + did, pds_host, head, rev, active, handle, list_repos_raw, 90 + repo_record_count, last_synced_at 91 + ) 92 + values (%s, %s, %s, %s, %s, null, %s::jsonb, 0, now()) 93 + on conflict (did) do update set 94 + pds_host = excluded.pds_host, 95 + head = excluded.head, 96 + rev = excluded.rev, 97 + active = excluded.active, 98 + list_repos_raw = excluded.list_repos_raw, 99 + last_synced_at = now() 100 + """, 101 + [ 102 + ( 103 + entry["did"], 104 + pds_host, 105 + entry.get("head"), 106 + entry.get("rev"), 107 + entry.get("active"), 108 + json.dumps(entry), 109 + ) 110 + for entry in entries 111 + if isinstance(entry.get("did"), str) 112 + ], 113 + ) 114 + 115 + 116 + def upsert_repo_record( 117 + conn, 118 + *, 119 + uri: str, 120 + owner_did: str, 121 + rkey: str, 122 + repo_did: str | None, 123 + name: str | None, 124 + knot_hostname: str | None, 125 + cid: str | None, 126 + record_raw: dict[str, Any], 127 + describe_raw: dict[str, Any] | None = None, 128 + ) -> None: 129 + conn.execute( 130 + """ 131 + insert into tangled_repos ( 132 + uri, owner_did, rkey, repo_did, name, knot_hostname, cid, 133 + record_raw, describe_raw, last_synced_at 134 + ) 135 + values (%s, %s, %s, %s, %s, %s, %s, %s::jsonb, %s::jsonb, now()) 136 + on conflict (uri) do update set 137 + repo_did = excluded.repo_did, 138 + name = excluded.name, 139 + knot_hostname = excluded.knot_hostname, 140 + cid = excluded.cid, 141 + record_raw = excluded.record_raw, 142 + describe_raw = coalesce(excluded.describe_raw, tangled_repos.describe_raw), 143 + last_synced_at = now() 144 + """, 145 + ( 146 + uri, 147 + owner_did, 148 + rkey, 149 + repo_did, 150 + name, 151 + knot_hostname, 152 + cid, 153 + json.dumps(record_raw), 154 + json.dumps(describe_raw) if describe_raw else None, 155 + ), 156 + ) 157 + 158 + 159 + def phase1_enumerate_accounts(dsn: str, pds_host: str, client: httpx.Client) -> list[str]: 160 + phase(1, "Enumerate accounts on Tangled PDS") 161 + log("stage 2", f"PDS host: {pds_host}") 162 + log("stage 2", "Calling com.atproto.server.describeServer ...") 163 + 164 + try: 165 + info = describe_pds(client, pds_host) 166 + domains = info.get("availableUserDomains") or [] 167 + metric("PDS DID", info.get("did", "?")) 168 + metric("User domains", ", ".join(domains) if domains else "(none listed)") 169 + except httpx.HTTPError as exc: 170 + log("stage 2", f"WARNING: describeServer failed ({exc}) — continuing anyway") 171 + 172 + account_limit = _account_limit() 173 + if account_limit: 174 + log("stage 2", f"Account limit active: {account_limit} (unset TANGLED_STAGE2_ACCOUNT_LIMIT for full crawl)") 175 + 176 + log("stage 2", "Paging com.atproto.sync.listRepos ...") 177 + 178 + all_dids: list[str] = [] 179 + cursor: str | None = None 180 + page = 0 181 + 182 + with connect(dsn) as conn: 183 + set_crawl_state(conn, key=CRAWL_KEY_ACCOUNTS, status="running") 184 + conn.commit() 185 + 186 + while True: 187 + page += 1 188 + data = sync_list_repos(client, pds_host, cursor=cursor) 189 + batch = data.get("repos") or [] 190 + cursor = data.get("cursor") 191 + 192 + page_entries: list[dict[str, Any]] = [] 193 + for entry in batch: 194 + did = entry.get("did") 195 + if not isinstance(did, str): 196 + continue 197 + page_entries.append(entry) 198 + all_dids.append(did) 199 + if account_limit and len(all_dids) >= account_limit: 200 + break 201 + 202 + upsert_accounts_batch(conn, pds_host=pds_host, entries=page_entries) 203 + 204 + conn.commit() 205 + log( 206 + "stage 2", 207 + f" page {page}: +{len(page_entries)} accounts (running total: {len(all_dids)})", 208 + ) 209 + 210 + if account_limit and len(all_dids) >= account_limit: 211 + log("stage 2", f" stopped at account limit ({account_limit})") 212 + break 213 + if not cursor or not batch: 214 + break 215 + 216 + set_crawl_state( 217 + conn, 218 + key=CRAWL_KEY_ACCOUNTS, 219 + status="complete", 220 + meta={"pds_host": pds_host, "account_count": len(all_dids), "pages": page}, 221 + ) 222 + conn.commit() 223 + 224 + metric("Total accounts on PDS", len(all_dids)) 225 + return all_dids 226 + 227 + 228 + def phase2_scan_repo_records( 229 + dsn: str, 230 + pds_host: str, 231 + client: httpx.Client, 232 + account_dids: list[str], 233 + ) -> dict[str, int]: 234 + phase(2, "Scan sh.tangled.repo records per account") 235 + log("stage 2", f"Checking {len(account_dids)} accounts for repo records ...") 236 + 237 + stats = {"accounts_with_repos": 0, "accounts_without_repos": 0, "repo_records": 0, "errors": 0} 238 + resolve_handles = _resolve_handles() 239 + if resolve_handles: 240 + log("stage 2", "Handle resolution enabled (PLC lookup per account — slower)") 241 + 242 + with connect(dsn) as conn: 243 + set_crawl_state(conn, key=CRAWL_KEY_REPOS, status="running") 244 + conn.commit() 245 + 246 + for i, did in enumerate(account_dids, start=1): 247 + handle: str | None = None 248 + if resolve_handles: 249 + handle = handle_from_plc(client, did) 250 + 251 + try: 252 + cursor: str | None = None 253 + repo_count = 0 254 + while True: 255 + data = list_repo_records(client, pds_host, did, cursor=cursor) 256 + records = data.get("records") or [] 257 + cursor = data.get("cursor") 258 + 259 + for rec in records: 260 + uri = rec.get("uri") 261 + value = rec.get("value") 262 + if not isinstance(uri, str) or not isinstance(value, dict): 263 + continue 264 + 265 + rkey = _rkey_from_uri(uri) 266 + repo_did = value.get("repoDid") 267 + if isinstance(repo_did, str): 268 + repo_did_val: str | None = repo_did 269 + else: 270 + repo_did_val = None 271 + 272 + knot = value.get("knot") 273 + knot_hostname = knot if isinstance(knot, str) else None 274 + 275 + upsert_repo_record( 276 + conn, 277 + uri=uri, 278 + owner_did=did, 279 + rkey=rkey, 280 + repo_did=repo_did_val, 281 + name=_repo_name(value, rkey), 282 + knot_hostname=knot_hostname, 283 + cid=rec.get("cid") if isinstance(rec.get("cid"), str) else None, 284 + record_raw=value, 285 + ) 286 + conn.execute( 287 + "update tangled_repos set discovered_via = 'tngl_pds' where uri = %s", 288 + (uri,), 289 + ) 290 + repo_count += 1 291 + stats["repo_records"] += 1 292 + 293 + if not cursor or not records: 294 + break 295 + 296 + update_account_scan( 297 + conn, 298 + did=did, 299 + handle=handle, 300 + repo_record_count=repo_count, 301 + ) 302 + 303 + if repo_count: 304 + stats["accounts_with_repos"] += 1 305 + label = handle or did 306 + step("stage 2", i, len(account_dids), f"{label} → {repo_count} repo(s)") 307 + else: 308 + stats["accounts_without_repos"] += 1 309 + if i % 100 == 0 or i == len(account_dids): 310 + step( 311 + "stage 2", 312 + i, 313 + len(account_dids), 314 + f"… {stats['accounts_with_repos']} accounts with repos so far", 315 + ) 316 + 317 + except httpx.HTTPError as exc: 318 + stats["errors"] += 1 319 + step("stage 2", i, len(account_dids), f"ERROR {did}: {exc}") 320 + 321 + if i % 50 == 0: 322 + conn.commit() 323 + 324 + set_crawl_state( 325 + conn, 326 + key=CRAWL_KEY_REPOS, 327 + status="complete", 328 + meta=stats, 329 + ) 330 + conn.commit() 331 + 332 + return stats 333 + 334 + 335 + def phase3_enrich_from_knots(dsn: str, client: httpx.Client) -> dict[str, int]: 336 + phase(3, "Enrich repos from knot describeRepo (optional)") 337 + stats = {"enriched": 0, "skipped": 0, "errors": 0} 338 + 339 + if not _enrich_knots(): 340 + log("stage 2", "Skipped (TANGLED_STAGE2_ENRICH_KNOTS=0)") 341 + return stats 342 + 343 + with connect(dsn) as conn: 344 + knots = conn.execute( 345 + "select hostname from tangled_knots where reachable = true order by hostname" 346 + ).fetchall() 347 + repos = conn.execute( 348 + """ 349 + select uri, repo_did, knot_hostname 350 + from tangled_repos 351 + where repo_did is not null and knot_hostname is not null 352 + order by uri 353 + """ 354 + ).fetchall() 355 + 356 + reachable = {row["hostname"] for row in knots} 357 + log("stage 2", f"Enriching {len(repos)} repos via {len(reachable)} reachable knot(s) ...") 358 + 359 + with connect(dsn) as conn: 360 + for i, row in enumerate(repos, start=1): 361 + knot = row["knot_hostname"] 362 + repo_did = row["repo_did"] 363 + if knot not in reachable: 364 + stats["skipped"] += 1 365 + continue 366 + 367 + try: 368 + describe = describe_repo_on_knot(client, knot, repo_did) 369 + if describe: 370 + conn.execute( 371 + """ 372 + update tangled_repos 373 + set describe_raw = %s::jsonb, last_synced_at = now() 374 + where uri = %s 375 + """, 376 + (json.dumps(describe), row["uri"]), 377 + ) 378 + stats["enriched"] += 1 379 + if i <= 10 or i % 25 == 0: 380 + step("stage 2", i, len(repos), f"describeRepo OK {repo_did}") 381 + else: 382 + stats["skipped"] += 1 383 + except httpx.HTTPError as exc: 384 + stats["errors"] += 1 385 + step("stage 2", i, len(repos), f"describeRepo FAIL {repo_did}: {exc}") 386 + 387 + if i % 50 == 0: 388 + conn.commit() 389 + conn.commit() 390 + 391 + metric("Knot enrichments", stats["enriched"]) 392 + return stats 393 + 394 + 395 + def run_stage2_accounts_only(dsn: str) -> dict[str, Any]: 396 + banner("STAGE 2a — Count accounts on Tangled PDS") 397 + pds_host = _pds_host() 398 + with httpx.Client(timeout=30.0, follow_redirects=True) as client: 399 + dids = phase1_enumerate_accounts(dsn, pds_host, client) 400 + summary_block( 401 + "Stage 2a complete", 402 + [ 403 + f"PDS: {pds_host}", 404 + f"Accounts: {len(dids)}", 405 + f"Next step: python scraper/scrape.py stage2-repos", 406 + ], 407 + ) 408 + return {"account_count": len(dids)} 409 + 410 + 411 + def run_stage2_repos_only(dsn: str) -> dict[str, Any]: 412 + banner("STAGE 2b — Scan repo records (accounts must exist in DB)") 413 + pds_host = _pds_host() 414 + 415 + with connect(dsn) as conn: 416 + rows = conn.execute( 417 + "select did from tangled_pds_accounts order by did" 418 + ).fetchall() 419 + if not rows: 420 + raise RuntimeError( 421 + "No accounts in tangled_pds_accounts. Run stage2-accounts first:\n" 422 + " python scraper/scrape.py stage2-accounts" 423 + ) 424 + 425 + account_dids = [row["did"] for row in rows] 426 + log("stage 2", f"Loaded {len(account_dids)} accounts from DB") 427 + 428 + with httpx.Client(timeout=30.0, follow_redirects=True) as client: 429 + repo_stats = phase2_scan_repo_records(dsn, pds_host, client, account_dids) 430 + knot_stats = phase3_enrich_from_knots(dsn, client) 431 + 432 + summary_block( 433 + "Stage 2b complete", 434 + [ 435 + f"Accounts scanned: {len(account_dids)}", 436 + f"Accounts with repos: {repo_stats['accounts_with_repos']}", 437 + f"Repo records stored: {repo_stats['repo_records']}", 438 + f"Knot enrichments: {knot_stats['enriched']}", 439 + f"Errors: {repo_stats['errors'] + knot_stats['errors']}", 440 + ], 441 + ) 442 + return {**repo_stats, **knot_stats} 443 + 444 + 445 + def run_stage2(dsn: str) -> dict[str, Any]: 446 + banner("STAGE 2 — Discover repos via Tangled PDS (tngl.sh)") 447 + log("stage 2", "Step-by-step: accounts → repo records → knot enrichment") 448 + log("stage 2", "Note: sh.tangled.sync.listRepos on knots returns 404 — we use PDS instead.") 449 + 450 + pds_host = _pds_host() 451 + host_label = urlparse(pds_host).netloc or pds_host 452 + 453 + with httpx.Client(timeout=30.0, follow_redirects=True) as client: 454 + account_dids = phase1_enumerate_accounts(dsn, pds_host, client) 455 + repo_stats = phase2_scan_repo_records(dsn, pds_host, client, account_dids) 456 + knot_stats = phase3_enrich_from_knots(dsn, client) 457 + 458 + summary_block( 459 + "Stage 2 complete", 460 + [ 461 + f"PDS ({host_label}): {len(account_dids)} accounts", 462 + f"Accounts with repos: {repo_stats['accounts_with_repos']}", 463 + f"Empty accounts: {repo_stats['accounts_without_repos']}", 464 + f"Repo records stored: {repo_stats['repo_records']}", 465 + f"Knot enrichments: {knot_stats['enriched']}", 466 + f"Errors: {repo_stats['errors'] + knot_stats['errors']}", 467 + ], 468 + ) 469 + return { 470 + "account_count": len(account_dids), 471 + **repo_stats, 472 + **knot_stats, 473 + }
+326
scraper/stage4_repo_metadata.py
··· 1 + from __future__ import annotations 2 + 3 + import json 4 + import os 5 + from typing import Any 6 + 7 + import httpx 8 + 9 + from db import connect, set_crawl_state, upsert_atproto_record, upsert_xrpc_snapshot 10 + from pds_client import ( 11 + DEFAULT_PDS, 12 + describe_repo_on_knot, 13 + knot_xrpc, 14 + list_records, 15 + params_hash, 16 + pds_host_for_did, 17 + ) 18 + from progress import banner, log, metric, phase, step, summary_block 19 + 20 + CRAWL_KEY = "stage4:repo_metadata" 21 + 22 + # Knot XRPC methods fetched per repo (deeper than Stage 2 metadata record alone). 23 + KNOT_METHODS: list[tuple[str, str, dict[str, Any] | None]] = [ 24 + ("sh.tangled.repo.getDefaultBranch", "repo", None), 25 + ("sh.tangled.repo.languages", "repo", None), 26 + ("sh.tangled.repo.branches", "repo", {"limit": 100}), 27 + ("sh.tangled.repo.tags", "repo", {"limit": 100}), 28 + ] 29 + 30 + COLLABORATOR_COLLECTION = "sh.tangled.repo.collaborator" 31 + 32 + 33 + def _repo_limit() -> int | None: 34 + raw = os.getenv("TANGLED_STAGE4_REPO_LIMIT", "").strip() 35 + if not raw: 36 + return None 37 + return max(1, int(raw)) 38 + 39 + 40 + def _branch_limit() -> int: 41 + return max(1, int(os.getenv("TANGLED_STAGE4_BRANCH_LIMIT", "100"))) 42 + 43 + 44 + def _collab_page_limit() -> int: 45 + return max(1, min(1000, int(os.getenv("TANGLED_STAGE4_COLLAB_LIMIT", "100")))) 46 + 47 + 48 + def _rkey_from_uri(uri: str) -> str: 49 + return uri.rsplit("/", 1)[-1] 50 + 51 + 52 + def _store_snapshot( 53 + conn, 54 + *, 55 + method: str, 56 + repo_did: str, 57 + params: dict[str, Any], 58 + payload: Any, 59 + ) -> bool: 60 + if not isinstance(payload, (dict, list)): 61 + return False 62 + ph = params_hash(params) 63 + upsert_xrpc_snapshot( 64 + conn, 65 + method=method, 66 + repo_did=repo_did, 67 + params=params, 68 + params_hash=ph, 69 + payload=payload, 70 + ) 71 + return True 72 + 73 + 74 + def _fetch_knot_method( 75 + client: httpx.Client, 76 + conn, 77 + *, 78 + knot_hostname: str, 79 + repo_did: str, 80 + method: str, 81 + param_key: str, 82 + extra: dict[str, Any] | None, 83 + ) -> tuple[bool, str | None]: 84 + params: dict[str, Any] = {param_key: repo_did} 85 + if extra: 86 + params.update(extra) 87 + if method == "sh.tangled.repo.branches": 88 + params["limit"] = _branch_limit() 89 + 90 + status, payload = knot_xrpc(client, knot_hostname, method, params) 91 + if status != 200: 92 + return False, f"HTTP {status}" 93 + 94 + if isinstance(payload, dict) and payload.get("error"): 95 + return False, str(payload.get("body", payload)) 96 + 97 + ok = _store_snapshot(conn, method=method, repo_did=repo_did, params=params, payload=payload) 98 + return ok, None 99 + 100 + 101 + def _fetch_collaborators( 102 + client: httpx.Client, 103 + conn, 104 + *, 105 + knot_hostname: str, 106 + repo_did: str, 107 + ) -> int: 108 + """Paginate sh.tangled.repo.listCollaborators (subject=repo_did).""" 109 + stored = 0 110 + cursor: str | None = None 111 + page = 0 112 + 113 + while True: 114 + page += 1 115 + params: dict[str, Any] = { 116 + "subject": repo_did, 117 + "limit": _collab_page_limit(), 118 + } 119 + if cursor: 120 + params["cursor"] = cursor 121 + 122 + status, payload = knot_xrpc( 123 + client, knot_hostname, "sh.tangled.repo.listCollaborators", params 124 + ) 125 + if status != 200 or not isinstance(payload, dict): 126 + break 127 + 128 + if _store_snapshot( 129 + conn, 130 + method="sh.tangled.repo.listCollaborators", 131 + repo_did=repo_did, 132 + params=params, 133 + payload=payload, 134 + ): 135 + stored += 1 136 + 137 + cursor = payload.get("cursor") 138 + items = payload.get("items") or [] 139 + if not cursor or not items: 140 + break 141 + 142 + return stored 143 + 144 + 145 + def _fetch_pds_collaborator_records( 146 + client: httpx.Client, 147 + conn, 148 + *, 149 + owner_did: str, 150 + repo_did: str, 151 + ) -> int: 152 + pds = pds_host_for_did(client, owner_did) or DEFAULT_PDS 153 + stored = 0 154 + cursor: str | None = None 155 + 156 + while True: 157 + try: 158 + data = list_records( 159 + client, 160 + pds, 161 + owner_did, 162 + COLLABORATOR_COLLECTION, 163 + cursor=cursor, 164 + limit=100, 165 + ) 166 + except httpx.HTTPError: 167 + break 168 + 169 + records = data.get("records") or [] 170 + for rec in records: 171 + uri = rec.get("uri") 172 + value = rec.get("value") 173 + if not isinstance(uri, str) or not isinstance(value, dict): 174 + continue 175 + if value.get("repo") != repo_did: 176 + continue 177 + 178 + upsert_atproto_record( 179 + conn, 180 + uri=uri, 181 + author_did=owner_did, 182 + collection=COLLABORATOR_COLLECTION, 183 + rkey=_rkey_from_uri(uri), 184 + payload=value, 185 + cid=rec.get("cid") if isinstance(rec.get("cid"), str) else None, 186 + repo_did=repo_did, 187 + ) 188 + stored += 1 189 + 190 + cursor = data.get("cursor") 191 + if not cursor or not records: 192 + break 193 + 194 + return stored 195 + 196 + 197 + def run_stage4(dsn: str) -> dict[str, Any]: 198 + banner("STAGE 4 — Deeper repo metadata") 199 + log("stage 4", "Enriches each repo with knot git stats + collaborators.") 200 + log("stage 4", "Stores raw XRPC JSON in tangled_xrpc_snapshots.") 201 + log("stage 4", "Stores collaborator records in tangled_atproto_records.") 202 + 203 + repo_limit = _repo_limit() 204 + if repo_limit: 205 + log("stage 4", f"Repo limit: {repo_limit} (unset TANGLED_STAGE4_REPO_LIMIT for all)") 206 + 207 + with connect(dsn) as conn: 208 + reachable = { 209 + row["hostname"] 210 + for row in conn.execute( 211 + "select hostname from tangled_knots where reachable = true" 212 + ).fetchall() 213 + } 214 + query = """ 215 + select uri, owner_did, repo_did, knot_hostname, name, record_raw 216 + from tangled_repos 217 + where repo_did is not null 218 + order by uri 219 + """ 220 + if repo_limit: 221 + query += f" limit {repo_limit}" 222 + repos = conn.execute(query).fetchall() 223 + 224 + if not repos: 225 + raise RuntimeError("No repos with repo_did in tangled_repos. Run stage2-repos first.") 226 + 227 + log("stage 4", f"Found {len(repos)} repos to enrich.") 228 + 229 + stats = { 230 + "repos_processed": 0, 231 + "repos_skipped_knot": 0, 232 + "describe_repo_updated": 0, 233 + "xrpc_snapshots": 0, 234 + "collaborator_records": 0, 235 + "errors": 0, 236 + } 237 + 238 + phase(1, "Knot metadata (branches, tags, languages, collaborators)") 239 + phase(2, "Owner PDS collaborator records") 240 + 241 + with httpx.Client(timeout=60.0, follow_redirects=True) as client, connect(dsn) as conn: 242 + set_crawl_state(conn, key=CRAWL_KEY, status="running", meta={"repo_count": len(repos)}) 243 + conn.commit() 244 + 245 + for i, repo in enumerate(repos, start=1): 246 + repo_did = repo["repo_did"] 247 + knot = repo["knot_hostname"] 248 + owner_did = repo["owner_did"] 249 + label = repo["name"] or repo_did 250 + 251 + if not knot or knot not in reachable: 252 + stats["repos_skipped_knot"] += 1 253 + if i <= 5 or i % 50 == 0: 254 + step("stage 4", i, len(repos), f"SKIP {label} — knot unreachable ({knot})") 255 + continue 256 + 257 + try: 258 + # describeRepo → tangled_repos.describe_raw 259 + describe = describe_repo_on_knot(client, knot, repo_did) 260 + if describe: 261 + conn.execute( 262 + """ 263 + update tangled_repos 264 + set describe_raw = %s::jsonb, last_synced_at = now() 265 + where uri = %s 266 + """, 267 + (json.dumps(describe), repo["uri"]), 268 + ) 269 + stats["describe_repo_updated"] += 1 270 + 271 + # Knot XRPC snapshots 272 + for method, param_key, extra in KNOT_METHODS: 273 + ok, err = _fetch_knot_method( 274 + client, 275 + conn, 276 + knot_hostname=knot, 277 + repo_did=repo_did, 278 + method=method, 279 + param_key=param_key, 280 + extra=extra, 281 + ) 282 + if ok: 283 + stats["xrpc_snapshots"] += 1 284 + elif err and i <= 3: 285 + log("stage 4", f" {method}: {err}") 286 + 287 + stats["xrpc_snapshots"] += _fetch_collaborators( 288 + client, conn, knot_hostname=knot, repo_did=repo_did 289 + ) 290 + 291 + # PDS collaborator records 292 + collab_n = _fetch_pds_collaborator_records( 293 + client, conn, owner_did=owner_did, repo_did=repo_did 294 + ) 295 + stats["collaborator_records"] += collab_n 296 + 297 + stats["repos_processed"] += 1 298 + step( 299 + "stage 4", 300 + i, 301 + len(repos), 302 + f"{label} snapshots+ collab_records={collab_n}", 303 + ) 304 + 305 + except httpx.HTTPError as exc: 306 + stats["errors"] += 1 307 + step("stage 4", i, len(repos), f"ERROR {label}: {exc}") 308 + 309 + if i % 25 == 0: 310 + conn.commit() 311 + 312 + set_crawl_state(conn, key=CRAWL_KEY, status="complete", meta=stats) 313 + conn.commit() 314 + 315 + summary_block( 316 + "Stage 4 complete", 317 + [ 318 + f"Repos processed: {stats['repos_processed']}", 319 + f"Skipped (bad knot): {stats['repos_skipped_knot']}", 320 + f"describeRepo updated: {stats['describe_repo_updated']}", 321 + f"XRPC snapshots stored: {stats['xrpc_snapshots']}", 322 + f"Collaborator records: {stats['collaborator_records']}", 323 + f"Errors: {stats['errors']}", 324 + ], 325 + ) 326 + return stats
+38
scraper/wait_then_run.sh
··· 1 + #!/usr/bin/env bash 2 + # Wait for a running stage2-repos scrape, then run the next command. 3 + # Does NOT kill the stage2 process. 4 + # 5 + # Usage: 6 + # ./scraper/wait_then_run.sh stage4 7 + # ./scraper/wait_then_run.sh stage4 status 8 + # 9 + # Clears TANGLED_STAGE4_REPO_LIMIT so stage4 runs on ALL repos. 10 + 11 + set -euo pipefail 12 + 13 + ROOT="$(cd "$(dirname "$0")/.." && pwd)" 14 + cd "$ROOT" 15 + 16 + NEXT="${1:-stage4}" 17 + shift || true 18 + 19 + PID="$(pgrep -f "scrape.py stage2-repos" | head -1 || true)" 20 + 21 + if [[ -z "$PID" ]]; then 22 + echo "No stage2-repos process found — running ${NEXT} now." 23 + else 24 + echo "Waiting for stage2-repos (PID ${PID}) to finish ..." 25 + echo " (stage2 still running — this script will NOT kill it)" 26 + while kill -0 "$PID" 2>/dev/null; do 27 + sleep 30 28 + done 29 + echo "Stage 2 finished." 30 + fi 31 + 32 + # shellcheck disable=SC1091 33 + source scraper/.venv/bin/activate 34 + 35 + unset TANGLED_STAGE4_REPO_LIMIT 36 + 37 + echo "Starting: python scraper/scrape.py ${NEXT} $*" 38 + python scraper/scrape.py "$NEXT" "$@"
+38
scripts/test-questionnaire.sh
··· 1 + #!/usr/bin/env bash 2 + # Local questionnaire loop test (same code path as Cloud Run Job). 3 + # 4 + # Usage (from repo root): 5 + # ./scripts/test-questionnaire.sh 6 + # ./scripts/test-questionnaire.sh 'at://did:plc:…/sh.tangled.repo.issue/…' 7 + # ./scripts/test-questionnaire.sh --save # write to Postgres when done 8 + # 9 + # Requires: venv, .env with ANTHROPIC_API_KEY + DB_CONNECTION_STRING 10 + 11 + set -euo pipefail 12 + 13 + ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 14 + ISSUE_URI="${1:-at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22}" 15 + SAVE=0 16 + if [[ "${1:-}" == "--save" ]]; then 17 + SAVE=1 18 + ISSUE_URI="${2:-at://did:plc:zmjoeu3stwcn44647rhxa44o/sh.tangled.repo.issue/3lvzel2uo3a22}" 19 + elif [[ "${2:-}" == "--save" ]]; then 20 + SAVE=1 21 + fi 22 + 23 + cd "$ROOT" 24 + # shellcheck disable=SC1091 25 + source venv/bin/activate 26 + export PYTHONUNBUFFERED=1 27 + export AGENT_VERBOSE_TOOLS=1 28 + 29 + EXTRA=() 30 + if [[ "$SAVE" -eq 0 ]]; then 31 + EXTRA+=(--no-save) 32 + fi 33 + 34 + echo "==> Issue: $ISSUE_URI" 35 + echo "==> Logs on stderr; JSON on stdout" 36 + echo 37 + 38 + python -m questionnaire_job.main "${EXTRA[@]}" --issue-uri "$ISSUE_URI"
+39
supabase/migrations/20250624000000_tangled_scraper_stage0_1.sql
··· 1 + -- Stage 0 + 1 tables for the Tangled scraper. 2 + -- Safe to re-run: uses IF NOT EXISTS. 3 + 4 + create extension if not exists "pgcrypto"; 5 + 6 + create table if not exists public.tangled_lexicons ( 7 + nsid text primary key, 8 + lexicon_type text not null, 9 + definition jsonb not null, 10 + source_path text not null, 11 + fetched_at timestamptz not null default now() 12 + ); 13 + 14 + create table if not exists public.tangled_knots ( 15 + hostname text primary key, 16 + reachable boolean not null default false, 17 + owner_did text, 18 + version text, 19 + capabilities jsonb, 20 + version_raw jsonb, 21 + owner_raw jsonb, 22 + probe_error text, 23 + first_seen_at timestamptz not null default now(), 24 + last_probed_at timestamptz not null default now() 25 + ); 26 + 27 + create table if not exists public.tangled_crawl_state ( 28 + key text primary key, 29 + status text not null default 'pending', 30 + meta jsonb, 31 + last_error text, 32 + updated_at timestamptz not null default now() 33 + ); 34 + 35 + create index if not exists tangled_knots_reachable_idx 36 + on public.tangled_knots (reachable); 37 + 38 + create index if not exists tangled_lexicons_type_idx 39 + on public.tangled_lexicons (lexicon_type);
+41
supabase/migrations/20250624100000_tangled_scraper_stage2.sql
··· 1 + -- Stage 2: PDS accounts + repo records from tngl.sh 2 + 3 + create table if not exists public.tangled_pds_accounts ( 4 + did text primary key, 5 + pds_host text not null, 6 + head text, 7 + rev text, 8 + active boolean, 9 + handle text, 10 + list_repos_raw jsonb not null, 11 + repo_record_count integer not null default 0, 12 + first_seen_at timestamptz not null default now(), 13 + last_synced_at timestamptz not null default now() 14 + ); 15 + 16 + create table if not exists public.tangled_repos ( 17 + uri text primary key, 18 + owner_did text not null, 19 + rkey text not null, 20 + repo_did text, 21 + name text, 22 + knot_hostname text, 23 + cid text, 24 + record_raw jsonb not null, 25 + describe_raw jsonb, 26 + first_seen_at timestamptz not null default now(), 27 + last_synced_at timestamptz not null default now(), 28 + unique (owner_did, rkey) 29 + ); 30 + 31 + create index if not exists tangled_pds_accounts_handle_idx 32 + on public.tangled_pds_accounts (handle); 33 + 34 + create index if not exists tangled_repos_owner_did_idx 35 + on public.tangled_repos (owner_did); 36 + 37 + create index if not exists tangled_repos_repo_did_idx 38 + on public.tangled_repos (repo_did); 39 + 40 + create index if not exists tangled_repos_knot_hostname_idx 41 + on public.tangled_repos (knot_hostname);
+137
supabase/migrations/20250624110000_tangled_scraper_stage3_6.sql
··· 1 + -- Stages 3–6: identities, federated records, git XRPC snapshots, source archives. 2 + -- Raw-first design: store payloads as JSON/bytea; typed views can come later. 3 + 4 + -- ----------------------------------------------------------------------------- 5 + -- Stage 3 — User / identity enrichment 6 + -- ----------------------------------------------------------------------------- 7 + 8 + create table if not exists public.tangled_identities ( 9 + did text primary key, 10 + handle text, 11 + pds_host text, 12 + profile_record jsonb, -- sh.tangled.actor.profile payload 13 + did_doc jsonb, -- full DID document from PLC 14 + first_seen_at timestamptz not null default now(), 15 + last_synced_at timestamptz not null default now() 16 + ); 17 + 18 + create index if not exists tangled_identities_handle_idx 19 + on public.tangled_identities (handle); 20 + 21 + create index if not exists tangled_identities_pds_host_idx 22 + on public.tangled_identities (pds_host); 23 + 24 + -- ----------------------------------------------------------------------------- 25 + -- Stage 5 — Federated ATProto records (issues, PRs, stars, comments, …) 26 + -- One row per record; collection = lexicon NSID e.g. sh.tangled.repo.issue 27 + -- ----------------------------------------------------------------------------- 28 + 29 + create table if not exists public.tangled_atproto_records ( 30 + uri text primary key, -- at://did/collection/rkey 31 + author_did text not null, 32 + collection text not null, 33 + rkey text not null, 34 + cid text, 35 + payload jsonb not null, -- record.value exactly as returned 36 + repo_did text, -- denormalized when record links to a repo 37 + subject_uri text, -- denormalized target (issue/PR/star subject) 38 + fetched_at timestamptz not null default now(), 39 + unique (author_did, collection, rkey) 40 + ); 41 + 42 + create index if not exists tangled_atproto_records_collection_idx 43 + on public.tangled_atproto_records (collection); 44 + 45 + create index if not exists tangled_atproto_records_repo_did_idx 46 + on public.tangled_atproto_records (repo_did); 47 + 48 + create index if not exists tangled_atproto_records_author_did_idx 49 + on public.tangled_atproto_records (author_did); 50 + 51 + create index if not exists tangled_atproto_records_payload_gin_idx 52 + on public.tangled_atproto_records using gin (payload); 53 + 54 + -- Backlink index rows discovered before fetching the full record (Stage 5 crawl queue) 55 + create table if not exists public.tangled_backlinks ( 56 + id bigserial primary key, 57 + repo_did text not null, 58 + collection text not null, -- e.g. sh.tangled.repo.issue 59 + source_field text not null, -- e.g. repo 60 + author_did text not null, 61 + rkey text not null, 62 + record_uri text generated always as ( 63 + 'at://' || author_did || '/' || collection || '/' || rkey 64 + ) stored, 65 + fetched boolean not null default false, 66 + discovered_at timestamptz not null default now(), 67 + unique (repo_did, collection, author_did, rkey) 68 + ); 69 + 70 + create index if not exists tangled_backlinks_repo_collection_idx 71 + on public.tangled_backlinks (repo_did, collection); 72 + 73 + create index if not exists tangled_backlinks_unfetched_idx 74 + on public.tangled_backlinks (fetched) where fetched = false; 75 + 76 + -- ----------------------------------------------------------------------------- 77 + -- Stage 6 — Knot/git XRPC response snapshots (commits, branches, tree, diff, …) 78 + -- ----------------------------------------------------------------------------- 79 + 80 + create table if not exists public.tangled_xrpc_snapshots ( 81 + id bigserial primary key, 82 + method text not null, -- e.g. sh.tangled.repo.log 83 + repo_did text, 84 + params jsonb not null, 85 + params_hash text not null, 86 + payload jsonb, -- null when response is binary (see git tables) 87 + payload_encoding text not null default 'application/json', 88 + fetched_at timestamptz not null default now(), 89 + unique (method, repo_did, params_hash) 90 + ); 91 + 92 + create index if not exists tangled_xrpc_snapshots_method_idx 93 + on public.tangled_xrpc_snapshots (method); 94 + 95 + create index if not exists tangled_xrpc_snapshots_repo_did_idx 96 + on public.tangled_xrpc_snapshots (repo_did); 97 + 98 + -- Full repo snapshot archives (tar.gz @ HEAD or branch) 99 + create table if not exists public.tangled_git_archives ( 100 + repo_did text not null, 101 + git_ref text not null default 'HEAD', 102 + format text not null default 'tar.gz', 103 + size_bytes bigint not null, 104 + sha256 text, 105 + content bytea not null, 106 + fetched_at timestamptz not null default now(), 107 + primary key (repo_did, git_ref, format) 108 + ); 109 + 110 + create index if not exists tangled_git_archives_size_idx 111 + on public.tangled_git_archives (size_bytes); 112 + 113 + -- Individual git blob objects (optional dedup layer for file-level storage) 114 + create table if not exists public.tangled_git_blobs ( 115 + repo_did text not null, 116 + oid text not null, 117 + size_bytes bigint, 118 + content bytea not null, 119 + fetched_at timestamptz not null default now(), 120 + primary key (repo_did, oid) 121 + ); 122 + 123 + -- ----------------------------------------------------------------------------- 124 + -- Convenience views (query metadata without re-parsing JSON) 125 + -- tangled_issues view moved to 20250624160000 (dedicated table). 126 + 127 + create or replace view public.tangled_pulls as 128 + select 129 + uri, 130 + author_did, 131 + repo_did, 132 + payload ->> 'title' as title, 133 + payload ->> 'body' as body, 134 + payload ->> 'createdAt' as created_at, 135 + payload 136 + from public.tangled_atproto_records 137 + where collection = 'sh.tangled.repo.pull';
+18
supabase/migrations/20250624120000_tangled_scraper_stage2_network.sql
··· 1 + -- Track where each repo was discovered (tngl PDS crawl vs appview/network index). 2 + 3 + alter table public.tangled_repos 4 + add column if not exists discovered_via text; 5 + 6 + alter table public.tangled_repos 7 + add column if not exists owner_handle text; 8 + 9 + create index if not exists tangled_repos_discovered_via_idx 10 + on public.tangled_repos (discovered_via); 11 + 12 + create index if not exists tangled_repos_owner_handle_idx 13 + on public.tangled_repos (owner_handle); 14 + 15 + -- Backfill existing Stage 2 rows. 16 + update public.tangled_repos 17 + set discovered_via = 'tngl_pds' 18 + where discovered_via is null;
+18
supabase/migrations/20250624130000_tangled_readmes.sql
··· 1 + -- README content fetched from knot git (sh.tangled.repo.tree + blob). 2 + 3 + create table if not exists public.tangled_readmes ( 4 + repo_did text primary key, 5 + repo_uri text, 6 + owner_handle text, 7 + repo_name text, 8 + knot_hostname text not null, 9 + readme_path text, 10 + status text not null, -- found | missing | error | skipped 11 + content text, 12 + size_bytes integer, 13 + error_message text, 14 + fetched_at timestamptz not null default now() 15 + ); 16 + 17 + create index if not exists tangled_readmes_status_idx 18 + on public.tangled_readmes (status);
+20
supabase/migrations/20250624140000_tangled_readmes_embeddings.sql
··· 1 + -- One embedding vector per README (pgvector). 2 + -- Model: Gemini gemini-embedding-001, 1536-dim, L2-normalized for cosine (<=>). 3 + 4 + create extension if not exists vector; 5 + 6 + alter table public.tangled_readmes 7 + add column if not exists embedding vector(1536), 8 + add column if not exists embedding_model text, 9 + add column if not exists embedded_at timestamptz; 10 + 11 + comment on column public.tangled_readmes.embedding is 12 + 'L2-normalized gemini-embedding-001 vector (1536); cosine via <=>.'; 13 + 14 + create index if not exists tangled_readmes_embedding_hnsw_idx 15 + on public.tangled_readmes using hnsw (embedding vector_cosine_ops) 16 + where embedding is not null; 17 + 18 + create index if not exists tangled_readmes_unembedded_idx 19 + on public.tangled_readmes (repo_did) 20 + where status = 'found' and content is not null and embedding is null;
+38
supabase/migrations/20250624150000_tangled_collaborators.sql
··· 1 + -- Repo ↔ collaborator edges (from knot sh.tangled.repo.listCollaborators). 2 + 3 + create table if not exists public.tangled_repo_collaborators ( 4 + repo_did text not null, 5 + collaborator_did text not null, 6 + added_by text, 7 + record_uri text, 8 + record_cid text, 9 + created_at timestamptz, 10 + first_seen_at timestamptz not null default now(), 11 + last_synced_at timestamptz not null default now(), 12 + primary key (repo_did, collaborator_did) 13 + ); 14 + 15 + create index if not exists tangled_repo_collaborators_user_idx 16 + on public.tangled_repo_collaborators (collaborator_did); 17 + 18 + create index if not exists tangled_repo_collaborators_repo_idx 19 + on public.tangled_repo_collaborators (repo_did); 20 + 21 + -- Tracks repos we already checked (including zero collaborators). 22 + create table if not exists public.tangled_repo_collaborators_sync ( 23 + repo_did text primary key, 24 + collaborator_count integer not null default 0, 25 + synced_at timestamptz not null default now() 26 + ); 27 + 28 + create or replace view public.tangled_user_collaborations as 29 + select 30 + c.collaborator_did as user_did, 31 + c.repo_did, 32 + r.owner_handle, 33 + r.name as repo_name, 34 + r.uri as repo_uri, 35 + c.added_by, 36 + c.created_at 37 + from public.tangled_repo_collaborators c 38 + left join public.tangled_repos r on r.repo_did = c.repo_did;
+59
supabase/migrations/20250624160000_tangled_issues_table.sql
··· 1 + -- Dedicated issues table (replaces the old tangled_issues view on atproto_records). 2 + 3 + create extension if not exists vector; 4 + 5 + do $$ 6 + begin 7 + if exists ( 8 + select 1 from pg_catalog.pg_class c 9 + join pg_catalog.pg_namespace n on n.oid = c.relnamespace 10 + where n.nspname = 'public' and c.relname = 'tangled_issues' and c.relkind = 'v' 11 + ) then 12 + execute 'drop view public.tangled_issues'; 13 + end if; 14 + end $$; 15 + 16 + -- If a previous partial run left the table, keep it. 17 + create table if not exists public.tangled_issues ( 18 + uri text primary key, 19 + author_did text not null, 20 + author_handle text, 21 + rkey text not null, 22 + repo_did text, 23 + repo_uri text, 24 + title text, 25 + body text, 26 + state text not null default 'open', -- open | closed 27 + issue_created_at timestamptz, 28 + cid text, 29 + record_raw jsonb not null, 30 + fetched_at timestamptz not null default now(), 31 + embedding vector(1536), 32 + embedding_model text, 33 + embedded_at timestamptz 34 + ); 35 + 36 + create index if not exists tangled_issues_author_did_idx 37 + on public.tangled_issues (author_did); 38 + 39 + create index if not exists tangled_issues_repo_did_idx 40 + on public.tangled_issues (repo_did); 41 + 42 + create index if not exists tangled_issues_state_idx 43 + on public.tangled_issues (state); 44 + 45 + create index if not exists tangled_issues_embedding_hnsw_idx 46 + on public.tangled_issues using hnsw (embedding vector_cosine_ops) 47 + where embedding is not null; 48 + 49 + -- Tracks which user PDSes were scanned for issues (including zero issues). 50 + create table if not exists public.tangled_issue_user_sync ( 51 + user_did text primary key, 52 + issue_count integer not null default 0, 53 + synced_at timestamptz not null default now() 54 + ); 55 + 56 + create or replace view public.tangled_open_issues as 57 + select * 58 + from public.tangled_issues 59 + where state = 'open';
+5
supabase/migrations/20250624170000_tangled_issue_user_sync_status.sql
··· 1 + -- Track per-user issue scan outcomes (including failures we should not retry forever). 2 + 3 + alter table public.tangled_issue_user_sync 4 + add column if not exists status text not null default 'ok', 5 + add column if not exists error_message text;
+25
supabase/migrations/20250624200000_tangled_issue_questionnaires.sql
··· 1 + -- AI-solve questionnaires: one cached JSON tree per issue (engine GET /questionnaire). 2 + 3 + create table if not exists public.tangled_issue_questionnaires ( 4 + issue_uri text primary key, 5 + payload jsonb not null, 6 + created_at timestamptz not null default now(), 7 + updated_at timestamptz not null default now(), 8 + constraint tangled_issue_questionnaires_payload_is_object 9 + check (jsonb_typeof(payload) = 'object') 10 + ); 11 + 12 + comment on table public.tangled_issue_questionnaires is 13 + 'Cached branching questionnaire JSON per sh.tangled.repo.issue AT-URI (AI-solve engine).'; 14 + 15 + comment on column public.tangled_issue_questionnaires.issue_uri is 16 + 'at://…/sh.tangled.repo.issue/<rkey> — same key as tangled_issues.uri when indexed.'; 17 + 18 + comment on column public.tangled_issue_questionnaires.payload is 19 + 'Full questionnaire object (version 2): introduction, items, followups tree.'; 20 + 21 + create index if not exists tangled_issue_questionnaires_updated_at_idx 22 + on public.tangled_issue_questionnaires (updated_at desc); 23 + 24 + create index if not exists tangled_issue_questionnaires_payload_gin_idx 25 + on public.tangled_issue_questionnaires using gin (payload);