handover/LESSONS_LEARNED.md at master · nateholland.bsky.social/PoseDetection

nateholland.bsky.social / PoseDetection
Fork 0
This repository has no description
Fork 0
PoseDetection / handover / LESSONS_LEARNED.md
at master 142 lines 12 kB View raw View rendered
wrap content
virtualintern feat: use camera input aspect ratio for object detection 2mo ago
af9479b8
  1# Lessons Learned
  2
  3Compressed list of the key findings from the R&D session that produced these models. Each one cost real time to learn — read this before making decisions about training or integration.
  4
  5## Training-side
  6
  7### 1. **NEVER train on a stretched dataset.**
  8
  9Roboflow's `Preprocessing → Resize → Stretch to NxN` distorts every training image to a square, destroying aspect ratios. YOLO is trained to expect letterboxed input, so a stretch-trained model learns wrong geometry priors and produces loose, distorted bboxes at inference even after fixing preprocessing on the inference side.
 10
 11**Cost paid**: Weeks of accuracy issues blamed on the model architecture before discovering it was the dataset preprocessing. Required regenerating the Roboflow dataset (versions v9 → v10 → v11) and retraining everything.
 12
 13**Always use** `Preprocessing → Resize → "Fit" or "None"`. "None" is preferred — let ultralytics' built-in `LetterBox` handle padding at training time with the correct gray-114 fill.
 14
 15### 2. **`rect=True` training beats square training for non-square deployment.**
 16
 17When the inference shape will be non-square (e.g. matching a 4:3 camera frame), use `rect=True` in `TRAINING_HYPERPARAMETERS`. The dataloader produces rectangular batches matching the dataset's natural aspect ratio. The model learns features tuned for the actual deployment shape. Mosaic augmentation auto-disables (mosaic produces square outputs), but the geometric alignment is worth more than the augmentation loss.
 18
 19**Measured benefit**: yolo26n trained at imgsz=384 square then exported at `[288, 384]` → 0.624 mean conf. Same architecture trained with `rect=True` imgsz=384 → **0.734 mean conf, +0.110**, zero speed cost.
 20
 21The val mAP told a misleading story: square-trained scored slightly higher on val. The on-device test reversed it dramatically. **Trust the on-device measurement, not val mAP, when train-time and inference-time distributions differ.**
 22
 23### 3. **Ultralytics silently collapses tuple `imgsz` to a single int during training.**
 24
 25`imgsz=[288, 384]` in `TRAINING_HYPERPARAMETERS` produces a warning `'train' and 'val' imgsz must be an integer` and quietly trains at the larger dimension as a square. This means "train at a rectangular shape" via tuple syntax doesn't work — you must use `rect=True` instead.
 26
 27The same `imgsz=[H, W]` tuple **does** work in `model.export(...)` though, which makes the bug subtle: you can train square and export rect without realizing they don't match.
 28
 29### 4. **mAP50 on the val set doesn't reliably predict on-device confidence.**
 30
 31Multiple times in this session a model with higher mAP50 scored lower on the on-device test than a model with lower mAP50. Reasons include: the val set has a different content distribution than the test video (court angle, lighting), the val set uses a different scoring threshold than the on-device test, and dataset augmentations in training change which content the model is good at. **The 60s on-device capture is the deciding measurement.**
 32
 33### 5. **Always probe the exported TFLite shape with `tf.lite.Interpreter` after export.**
 34
 35Filenames lie. The original `yolo11n_su_416.tflite` had actual input shape `[1, 512, 512, 3]` because an earlier ultralytics export silently rounded `imgsz=416` up to 512 (probably a stride-alignment quirk in onnx2tf). Cost a full retraining cycle to discover. Probe every exported tflite with:
 36
 37```python
 38import tensorflow as tf
 39it = tf.lite.Interpreter(model_path=path)
 40it.allocate_tensors()
 41print('input ', it.get_input_details()[0]['shape'])
 42print('output', it.get_output_details()[0]['shape'])
 43```
 44
 45If the shape doesn't match expectations, rename the file or fix the export before syncing into the app.
 46
 47## Inference-side
 48
 49### 6. **Letterboxing is critical for square models, irrelevant for matched-aspect models.**
 50
 51The Android `ImageDetector` MUST do letterboxed preprocessing (scale to fit, gray-114 pad) when feeding a non-matching aspect to a square model — otherwise YOLO's geometric priors are violated and quality plummets. When the camera aspect matches the model aspect (4:3 camera + 4:3 rect model), the letterbox math reduces to a plain resize with `padX=padY=0` — but you still need the math in place for the case where it doesn't match.
 52
 53Output bboxes are normalized `[0, 1]` over the **letterboxed** input tensor. After inference, you must un-letterbox the bboxes back to the original image's coordinate space (subtract padding offset, divide by scale ratio). Skipping this step gives bboxes that are slightly offset and scaled wrong, most visibly at the frame edges.
 54
 55### 7. **Camera frames default to whatever CameraX picks.**
 56
 57Stock `ImageAnalysis.Builder()` with no resolution config uses the device's default — which may be 4:3 OR 16:9 depending on the phone's main camera native aspect. **Always pin** the aspect ratio with `setTargetAspectRatio(AspectRatio.RATIO_4_3)` so the input distribution to the model is consistent across devices. The kima app must do this if it ships the rect 4:3 model.
 58
 59### 8. **Activity orientation lock is a hard requirement for non-square models.**
 60
 61The Android camera pipeline rotates the bitmap to "upright" relative to the phone's current orientation. A landscape model fed a portrait-rotated bitmap gets garbage. Either:
 62- **Lock the activity to landscape** in `AndroidManifest.xml` (`android:screenOrientation="landscape"`), or
 63- **Detect orientation in the detector** and rotate the bitmap to landscape before resizing if `imgH > imgW`. Adds minor CPU cost (one `Matrix.postRotate(-90f)` per frame).
 64
 65The R&D project locked landscape via manifest. If kima needs portrait support, use the rotation fallback.
 66
 67### 9. **fp32 vs fp16 on the TFLite GPU delegate is essentially equivalent.**
 68
 69Tested both for yolo26n at multiple imgsz. Differences:
 70- **Speed**: identical within noise (±0.05 Hz on a 6-Hz model). The GPU delegate isn't accelerating fp16 over fp32 in any measurable way on the A36 5G — both bottleneck on memory bandwidth, not weight precision.
 71- **Quality**: fp16 costs ~0.02 mean confidence vs fp32. Detectable but small.
 72- **File size**: fp16 is exactly half. ~5 MB vs ~9 MB.
 73
 74For production: **use fp32**. The extra 4 MB is negligible vs the production app's overall size, and the small quality win is real.
 75
 76### 10. **GPU delegate selection is silent on failure — log it explicitly.**
 77
 78If the TFLite GPU delegate fails to construct (e.g. on some emulators, low-end devices, or after Android updates), the interpreter falls back to CPU at ~10× slower inference. **The fallback is silent unless you log it explicitly.** The reference `CustomObjectModel.android.kt` adds an `Logger.i { "TFLite: ... delegate=$selected ..." }` line so `adb logcat -s TFLite:I` shows the actual delegate after every model load.
 79
 80## Test-protocol-side
 81
 82### 11. **The yolo26n family is sensitive to small framing changes.**
 83
 84Same model file scored between **0.64 and 0.84** mean confidence across different sessions in this project, depending on how the phone was positioned. The `yolo11n` family is much more stable across the same conditions (~0.72 ±0.03). Possible cause: yolo26n's feature pyramid is more sensitive to subpixel-scale alignment of small objects.
 85
 86Practical implication: **always run a noise-floor sanity check** on a stable model (`yolo11n_noise_floor_baseline.tflite`) before any A/B test, and **only compare yolo26n models within the same back-to-back capture session**. Cross-session yolo26n comparisons are unreliable.
 87
 88### 12. **60-second test clips are the minimum for stable measurements.**
 89
 9010-second clips have a noise floor of ~±0.10 mean conf, which is too high to detect real model improvements (most candidate-vs-candidate differences are in the 0.03–0.15 range). 60-second clips drop the noise floor to ~±0.05 for stable models and ~±0.10 for the noise-prone yolo26n family. Bigger sample → tighter measurement.
 91
 92### 13. **Run-to-run variance on the SAME model is real and significant.**
 93
 94The same `yolo11n_dataset_dataset.tflite` file measured between 0.597 and 0.753 mean conf across different sessions on the same test setup. Same file. Same video. Different absolute numbers because of phone positioning, lighting, video timing, thermal state, etc.
 95
 96**Within-session relative ordering is the only reliable signal.** Don't quote absolute numbers across sessions.
 97
 98## Pipeline-side
 99
100### 14. **Mount Drive at the start of every Colab session, not the end.**
101
102`drive.mount('/content/drive')` may pop a "Connect to Google Drive" card requiring a click. If you defer this to the post-training sync cell, the user is unlikely to be at their keyboard 30 minutes later when training finishes — and the whole pipeline stalls. Mount Drive in the first ~5 cells, before triggering the long-running training, so any auth happens while the user is present.
103
104### 15. **`google.colab.files.download()` is unreliable.**
105
106The JS download trigger gets silently blocked by browsers (popup blockers, unfocused Colab tab, etc.). The cell completes successfully and the file never arrives on the Mac — no error. **Always use Drive sync** for moving files Colab → Mac. `drive.mount` + `shutil.copy2` to `/content/drive/MyDrive/...` reliably mirrors to Drive Desktop within ~10 seconds.
107
108### 16. **Colab cold-restart loses Python state but Drive mount auth is browser-cached.**
109
110If you switch the runtime accelerator (CPU ↔ GPU) or hit the idle timeout, the Python runtime resets and you have to re-run cells 2 (apt/pip), 3 (verify), 5 (config), 7/9/11 (dataset), 13 (training). But `drive.mount()` is cached by the browser session, so a re-mount on a cold runtime is silent — no second auth click required, as long as the same browser tab is still open.
111
112### 17. **The auto-suffix gotcha when re-running ultralytics on the same dataset alias.**
113
114Running the training cell twice with the same `dataset_alias` produces run dirs `yolo26n_dataset_v11`, `yolo26n_dataset_v112`, `yolo26n_dataset_v113`, etc. Any cell that hardcodes the path will silently use the OLD weights. Always use a glob like `glob.glob('/content/runs/.../yolo26n_dataset_v11*')` sorted by mtime descending to pick the most recent.
115
116## Architecture / model-choice-side
117
118### 18. **yolo26n is a strict improvement over yolo11n on this task.**
119
120In the multi-arch comparison (V5.4), yolo26n at 416² beat yolo11n at 416² on both speed AND quality (within the same session). The newer architecture's improvements seem to actually deliver. **Default to yolo26n** for any new experiments unless you have a specific reason not to.
121
122### 19. **yolo11s is too slow for this hardware tier.**
123
124yolo11s (the "small" variant of yolo11) has ~3-4× the parameters of yolo11n and on the A36 5G runs at ~2.5 Hz at imgsz=416. Quality wins are marginal vs yolo26n. **Don't use yolo11s here.**
125
126### 20. **`imgsz=320` is too small for basketball detection on this dataset.**
127
128At 320² square + letterbox of a 16:9 source, the basketball ends up at ~6 pixels, near YOLO's small-object detection floor. mAP50 dropped from 0.76 (at 416) to 0.63 (at 320). Detection rate stayed at 100% but per-frame confidence dropped from ~0.82 to ~0.63. Use rectangular models or larger imgsz instead.
129
130### 21. **Rectangular models always beat square models with letterbox at the same compute budget.**
131
132For a non-square camera source, a rect model at `(H, W)` has H*W pixels of effective content. A square model at imgsz=N has N*min(H,W)/max(H,W)*N pixels of effective content (the rest is gray padding waste). Same total inference cost, ~25% more useful pixels in the rect model for a 4:3 source. The R&D session's 60s on-device tests confirmed this is real.
133
134## TL;DR for the next session
135
1361. Use `yolo26n_v11_rect_512x384.tflite` as the default.
1372. Pin the camera to 4:3, lock the activity to landscape.
1383. The Android detector MUST do letterbox preprocessing + un-letterbox of output bboxes.
1394. fp32. Don't bother with fp16.
1405. Always run the noise-floor sanity check before any A/B test.
1416. 60-second back-to-back captures, never 10s.
1427. Within-session relative ordering is the only reliable signal.
Configure Feed

Configure Feed