Models — Detailed Comparison#

All measurements were taken on a Samsung Galaxy A36 5G, TFLite GPU delegate, running the test app filming a 4K (3840×2160) basketball video played on a Mac monitor at landscape 16:9, captured for 60 seconds at video offset 10s. Phone in landscape orientation, camera pinned to RATIO_4_3. Letterbox preprocessing in the Android detector.

Run-to-run noise floor on this test setup: ~~±0.05 mean confidence on stable models like yolo11n_dataset_dataset. The yolo26n family is more sensitive (~~±0.10) — small changes in phone position can shift its measured confidence noticeably without affecting framerate or detection rate. Run a noise check (capture yolo11n_noise_floor_baseline.tflite first; expect ~0.72 mean conf) before any A/B test.

All 6 candidates at a glance#

File	Arch	Input shape	Pixels	Hz	Mean conf range	val mAP50	Size	Use case
`yolo26n_v11_rect_512x384.tflite` ⭐	yolo26n	`[1, 384, 512, 3]`	197k	6.7	0.78–0.83	0.7632	9.3 MB	Recommended production default
`yolo26n_v11_rect_384x288.tflite`	yolo26n	`[1, 288, 384, 3]`	110k	10.5	0.59–0.73	0.6456	9.3 MB	Real-time tracking, >10 Hz needed
`yolo26n_v11_square_512.tflite`	yolo26n	`[1, 512, 512, 3]`	262k	5.0	0.78–0.83	0.8206	9.3 MB	Quality-first, square inputs OK
`yolo26n_v11_square_416.tflite`	yolo26n	`[1, 416, 416, 3]`	173k	7.5	0.64–0.84	0.7649	9.3 MB	Square baseline reference
`yolo11n_dataset640.tflite`	yolo11n	`[1, 640, 640, 3]`	410k	3.2	0.81–0.85	0.881	10.7 MB	Historical conf champion ⚠️ caveats
`yolo11n_noise_floor_baseline.tflite`	yolo11n	`[1, 416, 416, 3]`	173k	6.9	0.72 (stable)	—	10.6 MB	Sanity check / noise floor reference

⭐ = recommended default. ⚠️ = see caveats below.

Tier 1 — Recommended Production Default#

`yolo26n_v11_rect_512x384.tflite`#

The clear winner of the R&D session. Trained at imgsz=512 with rect=True, which produces 384×512 rectangular tensors per batch (the dataset's natural aspect ratio is 4:3). The training and inference tensor shapes match exactly, eliminating the train/inference distribution shift that hurts square-trained models on rectangular cameras.

Strengths:

Best on-device confidence among the 6.7-Hz speed class.
14% more pixels than the square 416² baseline at only 13% slower speed — close to a strict Pareto improvement.
Natural 4:3 alignment with the camera. No wasted padding pixels at inference.
mAP50 0.7632 essentially matches the square 416² baseline (0.7649) on the val set, but significantly outperforms it on the on-device test (+0.07 to +0.15 mean conf in the same session). The val/on-device gap is the rect=True benefit — see LESSONS_LEARNED.md.

Trade-offs:

rect=True auto-disables mosaic augmentation during training. The val mAP took a small hit, but on-device beats every prior square-trained variant in within-session comparisons.
Locked to landscape 4:3. The kima app must pin CameraX accordingly.

Tier 2 — Speed-Critical (10+ Hz)#

`yolo26n_v11_rect_384x288.tflite`#

Same training method as the Tier 1 model but at imgsz=384 rect=True → 288×384 tensors. Smaller pixel budget = faster, less accurate.

Strengths:

Cleared the 10 Hz target in measurement — the only model in the lineup besides the deprecated 320² square that does so.
Same rect=True training benefit — the rectangular geometry alignment translates into much higher confidence than the comparable 320² square model (which dropped to ~0.59 even though it has a similar pixel count).

Trade-offs:

Quality drops noticeably vs the 512×384 rect: mean conf ~0.59–0.73 (depending on session noise), which is OK for a real-time tracker that smooths per-frame jitter but not great for one-shot frame analysis.
Same 4:3 landscape requirement.

When to use it: if the downstream consumer is a Kalman tracker / smoothing filter on basketball position that benefits from 10+ Hz position updates and can tolerate per-frame noise.

Tier 3 — Quality-First (slower)#

`yolo26n_v11_square_512.tflite`#

Square 512², trained without rect=True. Slower than the rect 512×384 (5 Hz vs 6.7 Hz) but slightly higher val mAP (0.8206 vs 0.7632) and similar on-device confidence. The extra pixels go to padding the 4:3 source instead of effective content, so the on-device quality benefit is minor — but it's a known-good square model if landscape lock isn't acceptable.

When to use it: if the kima app cannot pin orientation to landscape (e.g., app design needs portrait + landscape support) and you need the highest quality among square models.

`yolo11n_dataset640.tflite` ⚠️#

Trained way back on the stretched Roboflow dataset v8 (before we discovered Roboflow's "Resize: Stretch to" preprocessing was destroying aspect ratios — see LESSONS_LEARNED.md). Has the highest measured on-device mean confidence of the entire R&D session (~0.84) but with known box-shape artifacts because the model learned distorted geometry.

Use only as a quality reference to establish what "best possible accuracy on this clip" looks like. Do not ship.

Reference baselines#

`yolo26n_v11_square_416.tflite`#

Square 416² yolo26n trained on the v11 letterboxed dataset without rect=True. The historical "best balance" champion before the rect=True experiments. Similar pixel count (173k) to the new rect 512×384 (197k, +14%). This is the right A/B comparison target when validating that the rect=True model wins in the kima app's specific test environment.

`yolo11n_noise_floor_baseline.tflite`#

This is yolo11n_dataset_dataset.tflite from the original repo — a stable, oft-tested model whose measured mean confidence stays consistently in the 0.71–0.76 range across many sessions on this test setup. Use it as a sanity check before any A/B test: capture this model on the test video, compare against the historical ~0.72 baseline, and if it's >0.05 off, the test setup has drifted (phone position, lighting, framing) — fix that before judging other models.

Cross-model summary table#

For quick decision-making:

Use case	Model	Hz	Mean conf	Notes
Default	`yolo26n_v11_rect_512x384.tflite`	6.7	~0.80	Use this unless you have a specific reason not to
Real-time tracking @ 10+ Hz	`yolo26n_v11_rect_384x288.tflite`	10.5	~0.65	Tracker absorbs the per-frame noise
Highest one-shot accuracy (rect-aspect lock OK)	`yolo26n_v11_square_512.tflite`	5.0	~0.81	Slower, marginal quality win
Highest one-shot accuracy (no aspect lock)	`yolo26n_v11_square_416.tflite`	7.5	~0.78	Square, no orientation constraints
A/B test sanity check	`yolo11n_noise_floor_baseline.tflite`	6.9	0.72	Run first, verify ~0.72 before testing other models

Output schema (all models)#

All six models share the same output tensor schema, since they all came from the ultralytics export pipeline with nms=True (which is auto-handled for end2end yolo26 models):

Shape: [1, 300, 6]
Per-row format: [x1, y1, x2, y2, conf, cls] where:
- x1, y1 = top-left corner, normalized [0, 1] over the model's input tensor
- x2, y2 = bottom-right corner, same normalization
- conf = confidence score in [0, 1]
- cls = float class index (0.0 = basketball, 1.0 = basketball_hoop)
Up to 300 detections per frame, sorted by confidence descending. Trailing rows have conf = 0.
No additional NMS needed in the Android code — it's already baked into the graph.

Class labels#

0  basketball
1  basketball_hoop

Configure Feed