handover/MODELS.md at master · nateholland.bsky.social/PoseDetection

nateholland.bsky.social / PoseDetection
Fork 0
This repository has no description
Fork 0
PoseDetection / handover / MODELS.md
at master 106 lines 7.8 kB View raw View rendered
wrap content
virtualintern feat: use camera input aspect ratio for object detection 2mo ago
af9479b8
  1# Models — Detailed Comparison
  2
  3All measurements were taken on a **Samsung Galaxy A36 5G**, **TFLite GPU delegate**, running the test app filming a 4K (3840×2160) basketball video played on a Mac monitor at landscape 16:9, captured for 60 seconds at video offset 10s. Phone in landscape orientation, camera pinned to `RATIO_4_3`. Letterbox preprocessing in the Android detector.
  4
  5**Run-to-run noise floor on this test setup**: ~±0.05 mean confidence on stable models like `yolo11n_dataset_dataset`. The yolo26n family is more sensitive (~±0.10) — small changes in phone position can shift its measured confidence noticeably without affecting framerate or detection rate. Run a noise check (capture `yolo11n_noise_floor_baseline.tflite` first; expect ~0.72 mean conf) before any A/B test.
  6
  7## All 6 candidates at a glance
  8
  9| File | Arch | Input shape | Pixels | **Hz** | **Mean conf range** | val mAP50 | Size | Use case |
 10|---|---|---|---:|---:|---:|---:|---:|---|
 11| **`yolo26n_v11_rect_512x384.tflite`** ⭐ | yolo26n | `[1, 384, 512, 3]` | 197k | **6.7** | **0.78–0.83** | 0.7632 | 9.3 MB | **Recommended production default** |
 12| `yolo26n_v11_rect_384x288.tflite` | yolo26n | `[1, 288, 384, 3]` | 110k | **10.5** | 0.59–0.73 | 0.6456 | 9.3 MB | Real-time tracking, >10 Hz needed |
 13| `yolo26n_v11_square_512.tflite` | yolo26n | `[1, 512, 512, 3]` | 262k | 5.0 | 0.78–0.83 | 0.8206 | 9.3 MB | Quality-first, square inputs OK |
 14| `yolo26n_v11_square_416.tflite` | yolo26n | `[1, 416, 416, 3]` | 173k | 7.5 | 0.64–0.84 | 0.7649 | 9.3 MB | Square baseline reference |
 15| `yolo11n_dataset640.tflite` | yolo11n | `[1, 640, 640, 3]` | 410k | 3.2 | 0.81–0.85 | 0.881 | 10.7 MB | Historical conf champion ⚠️ caveats |
 16| `yolo11n_noise_floor_baseline.tflite` | yolo11n | `[1, 416, 416, 3]` | 173k | 6.9 | 0.72 (stable) | — | 10.6 MB | **Sanity check / noise floor reference** |
 17
 18⭐ = recommended default. ⚠️ = see caveats below.
 19
 20## Tier 1 — Recommended Production Default
 21
 22### `yolo26n_v11_rect_512x384.tflite`
 23
 24The clear winner of the R&D session. Trained at `imgsz=512` with `rect=True`, which produces 384×512 rectangular tensors per batch (the dataset's natural aspect ratio is 4:3). The training and inference tensor shapes match exactly, eliminating the train/inference distribution shift that hurts square-trained models on rectangular cameras.
 25
 26**Strengths:**
 27- Best on-device confidence among the 6.7-Hz speed class.
 28- 14% more pixels than the square 416² baseline at only 13% slower speed — close to a strict Pareto improvement.
 29- Natural 4:3 alignment with the camera. No wasted padding pixels at inference.
 30- mAP50 0.7632 essentially matches the square 416² baseline (0.7649) on the val set, but **significantly outperforms it on the on-device test** (+0.07 to +0.15 mean conf in the same session). The val/on-device gap is the rect=True benefit — see `LESSONS_LEARNED.md`.
 31
 32**Trade-offs:**
 33- `rect=True` auto-disables mosaic augmentation during training. The val mAP took a small hit, but on-device beats every prior square-trained variant in within-session comparisons.
 34- Locked to landscape 4:3. The kima app must pin CameraX accordingly.
 35
 36## Tier 2 — Speed-Critical (10+ Hz)
 37
 38### `yolo26n_v11_rect_384x288.tflite`
 39
 40Same training method as the Tier 1 model but at `imgsz=384` rect=True → 288×384 tensors. Smaller pixel budget = faster, less accurate.
 41
 42**Strengths:**
 43- Cleared the **10 Hz target** in measurement — the only model in the lineup besides the deprecated 320² square that does so.
 44- Same `rect=True` training benefit — the rectangular geometry alignment translates into much higher confidence than the comparable 320² square model (which dropped to ~0.59 even though it has a similar pixel count).
 45
 46**Trade-offs:**
 47- Quality drops noticeably vs the 512×384 rect: mean conf ~0.59–0.73 (depending on session noise), which is OK for a real-time tracker that smooths per-frame jitter but not great for one-shot frame analysis.
 48- Same 4:3 landscape requirement.
 49
 50**When to use it:** if the downstream consumer is a Kalman tracker / smoothing filter on basketball position that benefits from 10+ Hz position updates and can tolerate per-frame noise.
 51
 52## Tier 3 — Quality-First (slower)
 53
 54### `yolo26n_v11_square_512.tflite`
 55
 56Square 512², trained without `rect=True`. Slower than the rect 512×384 (5 Hz vs 6.7 Hz) but slightly higher val mAP (0.8206 vs 0.7632) and similar on-device confidence. The extra pixels go to padding the 4:3 source instead of effective content, so the on-device quality benefit is minor — but it's a known-good square model if landscape lock isn't acceptable.
 57
 58**When to use it:** if the kima app cannot pin orientation to landscape (e.g., app design needs portrait + landscape support) and you need the highest quality among square models.
 59
 60### `yolo11n_dataset640.tflite` ⚠️
 61
 62Trained way back on the **stretched** Roboflow dataset v8 (before we discovered Roboflow's "Resize: Stretch to" preprocessing was destroying aspect ratios — see `LESSONS_LEARNED.md`). Has the **highest measured on-device mean confidence** of the entire R&D session (~0.84) but with **known box-shape artifacts** because the model learned distorted geometry.
 63
 64**Use only as a quality reference** to establish what "best possible accuracy on this clip" looks like. Do not ship.
 65
 66## Reference baselines
 67
 68### `yolo26n_v11_square_416.tflite`
 69
 70Square 416² yolo26n trained on the v11 letterboxed dataset without `rect=True`. The historical "best balance" champion before the rect=True experiments. Similar pixel count (173k) to the new rect 512×384 (197k, +14%). **This is the right A/B comparison target** when validating that the rect=True model wins in the kima app's specific test environment.
 71
 72### `yolo11n_noise_floor_baseline.tflite`
 73
 74This is `yolo11n_dataset_dataset.tflite` from the original repo — a stable, oft-tested model whose measured mean confidence stays consistently in the 0.71–0.76 range across many sessions on this test setup. **Use it as a sanity check before any A/B test**: capture this model on the test video, compare against the historical ~0.72 baseline, and if it's >0.05 off, the test setup has drifted (phone position, lighting, framing) — fix that before judging other models.
 75
 76## Cross-model summary table
 77
 78For quick decision-making:
 79
 80| Use case | Model | Hz | Mean conf | Notes |
 81|---|---|---:|---:|---|
 82| **Default** | `yolo26n_v11_rect_512x384.tflite` | 6.7 | ~0.80 | Use this unless you have a specific reason not to |
 83| Real-time tracking @ 10+ Hz | `yolo26n_v11_rect_384x288.tflite` | 10.5 | ~0.65 | Tracker absorbs the per-frame noise |
 84| Highest one-shot accuracy (rect-aspect lock OK) | `yolo26n_v11_square_512.tflite` | 5.0 | ~0.81 | Slower, marginal quality win |
 85| Highest one-shot accuracy (no aspect lock) | `yolo26n_v11_square_416.tflite` | 7.5 | ~0.78 | Square, no orientation constraints |
 86| A/B test sanity check | `yolo11n_noise_floor_baseline.tflite` | 6.9 | 0.72 | Run first, verify ~0.72 before testing other models |
 87
 88## Output schema (all models)
 89
 90All six models share the same output tensor schema, since they all came from the ultralytics export pipeline with `nms=True` (which is auto-handled for end2end yolo26 models):
 91
 92- Shape: `[1, 300, 6]`
 93- Per-row format: `[x1, y1, x2, y2, conf, cls]` where:
 94  - `x1, y1` = top-left corner, normalized `[0, 1]` over the model's input tensor
 95  - `x2, y2` = bottom-right corner, same normalization
 96  - `conf` = confidence score in `[0, 1]`
 97  - `cls` = float class index (0.0 = basketball, 1.0 = basketball_hoop)
 98- Up to 300 detections per frame, sorted by confidence descending. Trailing rows have `conf = 0`.
 99- **No additional NMS needed in the Android code** — it's already baked into the graph.
100
101## Class labels
102
103```
1040  basketball
1051  basketball_hoop
106```
Configure Feed

Configure Feed