handover/TESTING_PROTOCOL.md at 1ee9e5cfd0d4875b75f91e03d9f29d4008b8cf38 · nateholland.bsky.social/PoseDetection

nateholland.bsky.social / PoseDetection
Fork 0
This repository has no description
Fork 0
PoseDetection / handover / TESTING_PROTOCOL.md
at 1ee9e5cfd0d4875b75f91e03d9f29d4008b8cf38 156 lines 10 kB View raw View rendered
wrap content
virtualintern feat: use camera input aspect ratio for object detection 2mo ago
af9479b8
  1# Testing Protocol — Comparing Models in the Kima App
  2
  3This document describes how to validate that a model behaves correctly in the kima app and how to A/B test different candidates against each other. The protocol mirrors what worked in the R&D project (60s back-to-back captures of the same video segment) but is portable to any Android app that can load multiple TFLite models.
  4
  5## Setup requirements
  6
  7### Hardware
  8- **A test phone** that runs the kima app (the R&D used a Samsung Galaxy A36 5G — different phones will give different absolute Hz numbers but the relative ordering of models should be similar).
  9- **A mounted phone holder / tripod** that does not move between captures. This is the single most important variable for getting reproducible results.
 10- **A monitor or large screen** for playing the test video. The phone films the screen.
 11
 12### Test video
 13- **Use the same video for all comparisons.** The R&D project used `~/Downloads/20250901_100828.mp4` (a 4K, 165s, 16:9 basketball court clip with shooting). Any consistent video with visible basketballs and hoops works — pick one once and keep using it.
 14- **Use a fixed segment** (e.g. seconds 10–70 = 60s of video). Don't sample randomly; the same frames need to flow into every model under test.
 15- Play the video **fullscreen** in QuickTime or whatever player. This avoids window-chrome interference.
 16
 17### Android side
 18- Camera pinned to `RATIO_4_3`, activity orientation locked to landscape — see `ANDROID_INTEGRATION.md`.
 19- The kima app needs a way to **switch between models at runtime** (a debug picker, an intent extra, or just rebuilding with a different default model in assets).
 20- The kima app needs a way to **log detector output to a file or logcat** so you can analyze it after the run. Minimum schema below.
 21
 22## Minimum logging schema
 23
 24For each model run, capture per-frame events with the following structure (this is what `compare_logs.py` consumes):
 25
 26```json
 27{
 28  "device_model": "samsung SM-A366B",
 29  "model_name": "yolo26n_v11_rect_512x384",
 30  "run_id": "1775898858800_yolo26n_v11_rect_512x384",
 31  "started_wall_ms": 1775898858801,
 32  "stopped_wall_ms": 1775898918803,
 33  "events": [
 34    {
 35      "wall_ms": 1775898858900,
 36      "skeleton": null,
 37      "objects": [
 38        {
 39          "label": "basketball",
 40          "confidence": 0.872,
 41          "bbox": { "left": 412.5, "top": 230.1, "right": 488.2, "bottom": 308.9 },
 42          "frame_size": { "width": 1280, "height": 960 }
 43        }
 44      ]
 45    }
 46  ]
 47}
 48```
 49
 50Key fields:
 51- `wall_ms` — absolute wall-clock timestamp of each detector emission. Used to compute Hz and to align events across runs.
 52- `objects[]` — list of detections per frame. Empty list = the detector ran but found nothing.
 53- `confidence` — the conf score from the detector output (0–1).
 54- `bbox.left/top/right/bottom` — pixel coords in the original frame's coordinate space (post un-letterbox).
 55
 56The R&D project's reference implementation lives in:
 57- `posedetection/src/commonMain/kotlin/com/nate/posedetection/ExperimentLogger.kt` — the logger
 58- `posedetection/src/commonMain/kotlin/com/nate/posedetection/ExperimentAuto.kt` — the auto-mode driver
 59
 60The kima app can implement its own equivalent in any language/framework. The schema is the contract.
 61
 62## The protocol
 63
 64### Step 0 — Position the phone
 65
 66Mount the phone in landscape, aimed at the monitor showing the test video. **Don't move it between captures.** Even small bumps shift the framing and can swing yolo26n confidence by 0.10+.
 67
 68### Step 1 — Noise-floor sanity check
 69
 70Before testing the model you actually care about, capture a known-stable model and verify it lands at the expected confidence. The handover bundle includes `yolo11n_noise_floor_baseline.tflite` for exactly this purpose.
 71
 72```
 73Run: yolo11n_noise_floor_baseline.tflite for 60s
 74Expected mean confidence: ~0.72 (range 0.71–0.76 across many sessions)
 75```
 76
 77If this lands more than 0.05 off the expected value, **reposition the phone** before continuing. The lighting, phone framing, or video timing has drifted and any subsequent A/B test will be measuring noise instead of model quality.
 78
 79### Step 2 — Capture the actual A/B set
 80
 81Run **60-second back-to-back captures** of each model under test, against the same video segment, **without moving anything** between captures. The R&D project used this sequence as the standard 5-way:
 82
 831. `yolo11n_noise_floor_baseline.tflite` (sanity)
 842. The model you're considering shipping (e.g. `yolo26n_v11_rect_512x384.tflite`)
 853. A direct competitor (e.g. `yolo26n_v11_square_416.tflite` for the rect-vs-square comparison)
 864. A speed alternative (e.g. `yolo26n_v11_rect_384x288.tflite`)
 875. A quality reference (e.g. `yolo26n_v11_square_512.tflite` or `yolo11n_dataset640.tflite`)
 88
 8960 seconds is the minimum for stable measurements. 10s clips have a noise floor of ~±0.10 mean conf which is too high to detect real differences. 60s clips drop the noise floor to ~±0.05 (still significant for yolo26n, see below).
 90
 91### Step 3 — Run the report generator
 92
 93Drop the captured JSON files into a folder (one subfolder per run, with the JSON inside) and run `reference_code/compare_logs.py`:
 94
 95```bash
 96python3 compare_logs.py /path/to/captures/ -o /tmp/reports --bucket 250
 97open /tmp/reports/report.html
 98```
 99
100The HTML report shows side-by-side scorecards (events, det rate, mean conf, mean bbox area, per-class), per-bucket histograms, and per-class timelines. Look at:
101- **Mean confidence** — the headline quality number
102- **Detection rate** — should be 100% on a video with always-visible objects
103- **Mean bbox area** — proxy for box tightness; lower = tighter (usually better)
104- **Per-class timelines** — visualize where each model loses detections
105
106### Step 4 — Interpret with the noise floor in mind
107
108**The yolo26n family is noticeably more sensitive to small framing changes than yolo11n.** Across multiple R&D sessions, the same `yolo26n_v11_square_416.tflite` measured between **0.64 and 0.84 mean confidence** depending on how the phone was positioned. The yolo11n baseline barely budged across the same conditions.
109
110What this means:
111- **Within-session comparisons are reliable.** All models in the same back-to-back capture sequence saw the same framing, so their relative ordering is meaningful. If model A scores 0.80 and model B scores 0.70 in the same session, A is genuinely better.
112- **Cross-session comparisons are not reliable** unless you re-validate via the noise-floor model. Don't quote V5.x results from this project's history as direct benchmarks for the kima app's measurements.
113
114If your A/B test gives results that surprise you, run Step 1 again — the phone might have shifted between captures.
115
116## Recommended A/B sequence for kima
117
118When you start integrating models, run this sequence in order. Each step should yield a clear yes/no decision before proceeding.
119
120### A. "Does the model load and run?"
121- Build kima with `yolo26n_v11_rect_512x384.tflite` in assets.
122- Cold-start the app, verify logcat shows `delegate=GPU inputShape=[1, 384, 512, 3]`.
123- Aim the camera at any basketball-containing scene and verify detections appear.
124- **Decision**: if yes, proceed. If no, debug per `ANDROID_INTEGRATION.md` "Common gotchas".
125
126### B. "Is the geometry right?"
127- Run a 30s capture against the test video.
128- In the report, check **mean bbox area**. For `yolo26n_v11_rect_512x384` it should be in the 480–560 range (loose box) on the standard test video.
129- Visually check 5–10 frames in the per-class image gallery (if the kima app supports this) — the boxes should hug the basketball, not pad around it.
130- **Decision**: if boxes look correctly placed, the letterbox preprocessing is right. If they're consistently offset or distorted, re-check the un-letterbox math.
131
132### C. "Is the speed acceptable?"
133- 60s capture, look at events/sec (Hz).
134- Expect ~5–7 Hz for `yolo26n_v11_rect_512x384` on a modern phone. Anything below 3 Hz means the GPU delegate didn't engage.
135- **Decision**: if Hz is in the expected range, proceed. If much slower, check delegate selection in logcat.
136
137### D. "Is it production quality?"
138- Run the noise-floor sanity check (`yolo11n_noise_floor_baseline.tflite` should hit ~0.72).
139- Run the A/B comparison: `yolo26n_v11_rect_512x384` vs `yolo26n_v11_square_416`.
140- Expect the rect_512x384 to win on confidence by ~0.05–0.15 in within-session comparison.
141- **Decision**: if the rect model wins, ship it. If not, the kima camera/orientation config probably isn't matching the model's expected aspect ratio — re-check `ANDROID_INTEGRATION.md` step 1 + 2.
142
143### E. "Does it work in real-world conditions?"
144- Beyond the test video, capture 5–10 minutes of real basketball footage with the kima app and the chosen model.
145- Look for consistent detections across distance, angle, lighting changes.
146- This is the only step that the test video can't catch — the dataset was trained on indoor court footage similar to but not identical to whatever real users will film.
147
148## What NOT to test against
149
150- **Don't test fp16 vs fp32 separately.** They give the same speed and ~0.02 conf difference on this hardware. Settled — use fp32, file size doesn't matter at this scale.
151- **Don't run val mAP comparisons** as the deciding metric. The val set is held-out from Roboflow and doesn't reflect on-device performance well — see `LESSONS_LEARNED.md`.
152- **Don't compare against the V5.x historical numbers in this handover.** They were measured on a different phone setup and the within-session noise is real. The history is useful for understanding the *direction* of each tuning lever, not the absolute numbers in the kima app's measurement environment.
153
154## If you only have time for one test
155
156Run the A/B at step **D**: capture `yolo11n_noise_floor_baseline`, `yolo26n_v11_rect_512x384`, and `yolo26n_v11_square_416` back-to-back for 60 seconds each on the same test video. If the rect model wins on confidence and the noise floor is on target, ship the rect model. If the noise floor is off, fix the phone setup first and re-run.
Configure Feed

Configure Feed