Lessons Learned#
Compressed list of the key findings from the R&D session that produced these models. Each one cost real time to learn — read this before making decisions about training or integration.
Training-side#
1. NEVER train on a stretched dataset.#
Roboflow's Preprocessing → Resize → Stretch to NxN distorts every training image to a square, destroying aspect ratios. YOLO is trained to expect letterboxed input, so a stretch-trained model learns wrong geometry priors and produces loose, distorted bboxes at inference even after fixing preprocessing on the inference side.
Cost paid: Weeks of accuracy issues blamed on the model architecture before discovering it was the dataset preprocessing. Required regenerating the Roboflow dataset (versions v9 → v10 → v11) and retraining everything.
Always use Preprocessing → Resize → "Fit" or "None". "None" is preferred — let ultralytics' built-in LetterBox handle padding at training time with the correct gray-114 fill.
2. rect=True training beats square training for non-square deployment.#
When the inference shape will be non-square (e.g. matching a 4:3 camera frame), use rect=True in TRAINING_HYPERPARAMETERS. The dataloader produces rectangular batches matching the dataset's natural aspect ratio. The model learns features tuned for the actual deployment shape. Mosaic augmentation auto-disables (mosaic produces square outputs), but the geometric alignment is worth more than the augmentation loss.
Measured benefit: yolo26n trained at imgsz=384 square then exported at [288, 384] → 0.624 mean conf. Same architecture trained with rect=True imgsz=384 → 0.734 mean conf, +0.110, zero speed cost.
The val mAP told a misleading story: square-trained scored slightly higher on val. The on-device test reversed it dramatically. Trust the on-device measurement, not val mAP, when train-time and inference-time distributions differ.
3. Ultralytics silently collapses tuple imgsz to a single int during training.#
imgsz=[288, 384] in TRAINING_HYPERPARAMETERS produces a warning 'train' and 'val' imgsz must be an integer and quietly trains at the larger dimension as a square. This means "train at a rectangular shape" via tuple syntax doesn't work — you must use rect=True instead.
The same imgsz=[H, W] tuple does work in model.export(...) though, which makes the bug subtle: you can train square and export rect without realizing they don't match.
4. mAP50 on the val set doesn't reliably predict on-device confidence.#
Multiple times in this session a model with higher mAP50 scored lower on the on-device test than a model with lower mAP50. Reasons include: the val set has a different content distribution than the test video (court angle, lighting), the val set uses a different scoring threshold than the on-device test, and dataset augmentations in training change which content the model is good at. The 60s on-device capture is the deciding measurement.
5. Always probe the exported TFLite shape with tf.lite.Interpreter after export.#
Filenames lie. The original yolo11n_su_416.tflite had actual input shape [1, 512, 512, 3] because an earlier ultralytics export silently rounded imgsz=416 up to 512 (probably a stride-alignment quirk in onnx2tf). Cost a full retraining cycle to discover. Probe every exported tflite with:
import tensorflow as tf
it = tf.lite.Interpreter(model_path=path)
it.allocate_tensors()
print('input ', it.get_input_details()[0]['shape'])
print('output', it.get_output_details()[0]['shape'])
If the shape doesn't match expectations, rename the file or fix the export before syncing into the app.
Inference-side#
6. Letterboxing is critical for square models, irrelevant for matched-aspect models.#
The Android ImageDetector MUST do letterboxed preprocessing (scale to fit, gray-114 pad) when feeding a non-matching aspect to a square model — otherwise YOLO's geometric priors are violated and quality plummets. When the camera aspect matches the model aspect (4:3 camera + 4:3 rect model), the letterbox math reduces to a plain resize with padX=padY=0 — but you still need the math in place for the case where it doesn't match.
Output bboxes are normalized [0, 1] over the letterboxed input tensor. After inference, you must un-letterbox the bboxes back to the original image's coordinate space (subtract padding offset, divide by scale ratio). Skipping this step gives bboxes that are slightly offset and scaled wrong, most visibly at the frame edges.
7. Camera frames default to whatever CameraX picks.#
Stock ImageAnalysis.Builder() with no resolution config uses the device's default — which may be 4:3 OR 16:9 depending on the phone's main camera native aspect. Always pin the aspect ratio with setTargetAspectRatio(AspectRatio.RATIO_4_3) so the input distribution to the model is consistent across devices. The kima app must do this if it ships the rect 4:3 model.
8. Activity orientation lock is a hard requirement for non-square models.#
The Android camera pipeline rotates the bitmap to "upright" relative to the phone's current orientation. A landscape model fed a portrait-rotated bitmap gets garbage. Either:
- Lock the activity to landscape in
AndroidManifest.xml(android:screenOrientation="landscape"), or - Detect orientation in the detector and rotate the bitmap to landscape before resizing if
imgH > imgW. Adds minor CPU cost (oneMatrix.postRotate(-90f)per frame).
The R&D project locked landscape via manifest. If kima needs portrait support, use the rotation fallback.
9. fp32 vs fp16 on the TFLite GPU delegate is essentially equivalent.#
Tested both for yolo26n at multiple imgsz. Differences:
- Speed: identical within noise (±0.05 Hz on a 6-Hz model). The GPU delegate isn't accelerating fp16 over fp32 in any measurable way on the A36 5G — both bottleneck on memory bandwidth, not weight precision.
- Quality: fp16 costs ~0.02 mean confidence vs fp32. Detectable but small.
- File size: fp16 is exactly half. ~5 MB vs ~9 MB.
For production: use fp32. The extra 4 MB is negligible vs the production app's overall size, and the small quality win is real.
10. GPU delegate selection is silent on failure — log it explicitly.#
If the TFLite GPU delegate fails to construct (e.g. on some emulators, low-end devices, or after Android updates), the interpreter falls back to CPU at ~10× slower inference. The fallback is silent unless you log it explicitly. The reference CustomObjectModel.android.kt adds an Logger.i { "TFLite: ... delegate=$selected ..." } line so adb logcat -s TFLite:I shows the actual delegate after every model load.
Test-protocol-side#
11. The yolo26n family is sensitive to small framing changes.#
Same model file scored between 0.64 and 0.84 mean confidence across different sessions in this project, depending on how the phone was positioned. The yolo11n family is much more stable across the same conditions (~0.72 ±0.03). Possible cause: yolo26n's feature pyramid is more sensitive to subpixel-scale alignment of small objects.
Practical implication: always run a noise-floor sanity check on a stable model (yolo11n_noise_floor_baseline.tflite) before any A/B test, and only compare yolo26n models within the same back-to-back capture session. Cross-session yolo26n comparisons are unreliable.
12. 60-second test clips are the minimum for stable measurements.#
10-second clips have a noise floor of ~±0.10 mean conf, which is too high to detect real model improvements (most candidate-vs-candidate differences are in the 0.03–0.15 range). 60-second clips drop the noise floor to ~±0.05 for stable models and ~±0.10 for the noise-prone yolo26n family. Bigger sample → tighter measurement.
13. Run-to-run variance on the SAME model is real and significant.#
The same yolo11n_dataset_dataset.tflite file measured between 0.597 and 0.753 mean conf across different sessions on the same test setup. Same file. Same video. Different absolute numbers because of phone positioning, lighting, video timing, thermal state, etc.
Within-session relative ordering is the only reliable signal. Don't quote absolute numbers across sessions.
Pipeline-side#
14. Mount Drive at the start of every Colab session, not the end.#
drive.mount('/content/drive') may pop a "Connect to Google Drive" card requiring a click. If you defer this to the post-training sync cell, the user is unlikely to be at their keyboard 30 minutes later when training finishes — and the whole pipeline stalls. Mount Drive in the first ~5 cells, before triggering the long-running training, so any auth happens while the user is present.
15. google.colab.files.download() is unreliable.#
The JS download trigger gets silently blocked by browsers (popup blockers, unfocused Colab tab, etc.). The cell completes successfully and the file never arrives on the Mac — no error. Always use Drive sync for moving files Colab → Mac. drive.mount + shutil.copy2 to /content/drive/MyDrive/... reliably mirrors to Drive Desktop within ~10 seconds.
16. Colab cold-restart loses Python state but Drive mount auth is browser-cached.#
If you switch the runtime accelerator (CPU ↔ GPU) or hit the idle timeout, the Python runtime resets and you have to re-run cells 2 (apt/pip), 3 (verify), 5 (config), 7/9/11 (dataset), 13 (training). But drive.mount() is cached by the browser session, so a re-mount on a cold runtime is silent — no second auth click required, as long as the same browser tab is still open.
17. The auto-suffix gotcha when re-running ultralytics on the same dataset alias.#
Running the training cell twice with the same dataset_alias produces run dirs yolo26n_dataset_v11, yolo26n_dataset_v112, yolo26n_dataset_v113, etc. Any cell that hardcodes the path will silently use the OLD weights. Always use a glob like glob.glob('/content/runs/.../yolo26n_dataset_v11*') sorted by mtime descending to pick the most recent.
Architecture / model-choice-side#
18. yolo26n is a strict improvement over yolo11n on this task.#
In the multi-arch comparison (V5.4), yolo26n at 416² beat yolo11n at 416² on both speed AND quality (within the same session). The newer architecture's improvements seem to actually deliver. Default to yolo26n for any new experiments unless you have a specific reason not to.
19. yolo11s is too slow for this hardware tier.#
yolo11s (the "small" variant of yolo11) has ~3-4× the parameters of yolo11n and on the A36 5G runs at ~2.5 Hz at imgsz=416. Quality wins are marginal vs yolo26n. Don't use yolo11s here.
20. imgsz=320 is too small for basketball detection on this dataset.#
At 320² square + letterbox of a 16:9 source, the basketball ends up at ~6 pixels, near YOLO's small-object detection floor. mAP50 dropped from 0.76 (at 416) to 0.63 (at 320). Detection rate stayed at 100% but per-frame confidence dropped from ~0.82 to ~0.63. Use rectangular models or larger imgsz instead.
21. Rectangular models always beat square models with letterbox at the same compute budget.#
For a non-square camera source, a rect model at (H, W) has HW pixels of effective content. A square model at imgsz=N has Nmin(H,W)/max(H,W)*N pixels of effective content (the rest is gray padding waste). Same total inference cost, ~25% more useful pixels in the rect model for a 4:3 source. The R&D session's 60s on-device tests confirmed this is real.
TL;DR for the next session#
- Use
yolo26n_v11_rect_512x384.tfliteas the default. - Pin the camera to 4:3, lock the activity to landscape.
- The Android detector MUST do letterbox preprocessing + un-letterbox of output bboxes.
- fp32. Don't bother with fp16.
- Always run the noise-floor sanity check before any A/B test.
- 60-second back-to-back captures, never 10s.
- Within-session relative ordering is the only reliable signal.