Testing Protocol — Comparing Models in the Kima App#

This document describes how to validate that a model behaves correctly in the kima app and how to A/B test different candidates against each other. The protocol mirrors what worked in the R&D project (60s back-to-back captures of the same video segment) but is portable to any Android app that can load multiple TFLite models.

Setup requirements#

Hardware#

A test phone that runs the kima app (the R&D used a Samsung Galaxy A36 5G — different phones will give different absolute Hz numbers but the relative ordering of models should be similar).
A mounted phone holder / tripod that does not move between captures. This is the single most important variable for getting reproducible results.
A monitor or large screen for playing the test video. The phone films the screen.

Test video#

Use the same video for all comparisons. The R&D project used ~/Downloads/20250901_100828.mp4 (a 4K, 165s, 16:9 basketball court clip with shooting). Any consistent video with visible basketballs and hoops works — pick one once and keep using it.
Use a fixed segment (e.g. seconds 10–70 = 60s of video). Don't sample randomly; the same frames need to flow into every model under test.
Play the video fullscreen in QuickTime or whatever player. This avoids window-chrome interference.

Android side#

Camera pinned to RATIO_4_3, activity orientation locked to landscape — see ANDROID_INTEGRATION.md.
The kima app needs a way to switch between models at runtime (a debug picker, an intent extra, or just rebuilding with a different default model in assets).
The kima app needs a way to log detector output to a file or logcat so you can analyze it after the run. Minimum schema below.

Minimum logging schema#

For each model run, capture per-frame events with the following structure (this is what compare_logs.py consumes):

{
  "device_model": "samsung SM-A366B",
  "model_name": "yolo26n_v11_rect_512x384",
  "run_id": "1775898858800_yolo26n_v11_rect_512x384",
  "started_wall_ms": 1775898858801,
  "stopped_wall_ms": 1775898918803,
  "events": [
    {
      "wall_ms": 1775898858900,
      "skeleton": null,
      "objects": [
        {
          "label": "basketball",
          "confidence": 0.872,
          "bbox": { "left": 412.5, "top": 230.1, "right": 488.2, "bottom": 308.9 },
          "frame_size": { "width": 1280, "height": 960 }
        }
      ]
    }
  ]
}

Key fields:

wall_ms — absolute wall-clock timestamp of each detector emission. Used to compute Hz and to align events across runs.
objects[] — list of detections per frame. Empty list = the detector ran but found nothing.
confidence — the conf score from the detector output (0–1).
bbox.left/top/right/bottom — pixel coords in the original frame's coordinate space (post un-letterbox).

The R&D project's reference implementation lives in:

posedetection/src/commonMain/kotlin/com/nate/posedetection/ExperimentLogger.kt — the logger
posedetection/src/commonMain/kotlin/com/nate/posedetection/ExperimentAuto.kt — the auto-mode driver

The kima app can implement its own equivalent in any language/framework. The schema is the contract.

The protocol#

Step 0 — Position the phone#

Mount the phone in landscape, aimed at the monitor showing the test video. Don't move it between captures. Even small bumps shift the framing and can swing yolo26n confidence by 0.10+.

Step 1 — Noise-floor sanity check#

Before testing the model you actually care about, capture a known-stable model and verify it lands at the expected confidence. The handover bundle includes yolo11n_noise_floor_baseline.tflite for exactly this purpose.

Run: yolo11n_noise_floor_baseline.tflite for 60s
Expected mean confidence: ~0.72 (range 0.71–0.76 across many sessions)

If this lands more than 0.05 off the expected value, reposition the phone before continuing. The lighting, phone framing, or video timing has drifted and any subsequent A/B test will be measuring noise instead of model quality.

Step 2 — Capture the actual A/B set#

Run 60-second back-to-back captures of each model under test, against the same video segment, without moving anything between captures. The R&D project used this sequence as the standard 5-way:

yolo11n_noise_floor_baseline.tflite (sanity)
The model you're considering shipping (e.g. yolo26n_v11_rect_512x384.tflite)
A direct competitor (e.g. yolo26n_v11_square_416.tflite for the rect-vs-square comparison)
A speed alternative (e.g. yolo26n_v11_rect_384x288.tflite)
A quality reference (e.g. yolo26n_v11_square_512.tflite or yolo11n_dataset640.tflite)

60 seconds is the minimum for stable measurements. 10s clips have a noise floor of ~±0.10 mean conf which is too high to detect real differences. 60s clips drop the noise floor to ~±0.05 (still significant for yolo26n, see below).

Step 3 — Run the report generator#

Drop the captured JSON files into a folder (one subfolder per run, with the JSON inside) and run reference_code/compare_logs.py:

python3 compare_logs.py /path/to/captures/ -o /tmp/reports --bucket 250
open /tmp/reports/report.html

The HTML report shows side-by-side scorecards (events, det rate, mean conf, mean bbox area, per-class), per-bucket histograms, and per-class timelines. Look at:

Mean confidence — the headline quality number
Detection rate — should be 100% on a video with always-visible objects
Mean bbox area — proxy for box tightness; lower = tighter (usually better)
Per-class timelines — visualize where each model loses detections

Step 4 — Interpret with the noise floor in mind#

The yolo26n family is noticeably more sensitive to small framing changes than yolo11n. Across multiple R&D sessions, the same yolo26n_v11_square_416.tflite measured between 0.64 and 0.84 mean confidence depending on how the phone was positioned. The yolo11n baseline barely budged across the same conditions.

What this means:

Within-session comparisons are reliable. All models in the same back-to-back capture sequence saw the same framing, so their relative ordering is meaningful. If model A scores 0.80 and model B scores 0.70 in the same session, A is genuinely better.
Cross-session comparisons are not reliable unless you re-validate via the noise-floor model. Don't quote V5.x results from this project's history as direct benchmarks for the kima app's measurements.

If your A/B test gives results that surprise you, run Step 1 again — the phone might have shifted between captures.

Recommended A/B sequence for kima#

When you start integrating models, run this sequence in order. Each step should yield a clear yes/no decision before proceeding.

A. "Does the model load and run?"#

Build kima with yolo26n_v11_rect_512x384.tflite in assets.
Cold-start the app, verify logcat shows delegate=GPU inputShape=[1, 384, 512, 3].
Aim the camera at any basketball-containing scene and verify detections appear.
Decision: if yes, proceed. If no, debug per ANDROID_INTEGRATION.md "Common gotchas".

B. "Is the geometry right?"#

Run a 30s capture against the test video.
In the report, check mean bbox area. For yolo26n_v11_rect_512x384 it should be in the 480–560 range (loose box) on the standard test video.
Visually check 5–10 frames in the per-class image gallery (if the kima app supports this) — the boxes should hug the basketball, not pad around it.
Decision: if boxes look correctly placed, the letterbox preprocessing is right. If they're consistently offset or distorted, re-check the un-letterbox math.

C. "Is the speed acceptable?"#

60s capture, look at events/sec (Hz).
Expect ~5–7 Hz for yolo26n_v11_rect_512x384 on a modern phone. Anything below 3 Hz means the GPU delegate didn't engage.
Decision: if Hz is in the expected range, proceed. If much slower, check delegate selection in logcat.

D. "Is it production quality?"#

Run the noise-floor sanity check (yolo11n_noise_floor_baseline.tflite should hit ~0.72).
Run the A/B comparison: yolo26n_v11_rect_512x384 vs yolo26n_v11_square_416.
Expect the rect_512x384 to win on confidence by ~0.05–0.15 in within-session comparison.
Decision: if the rect model wins, ship it. If not, the kima camera/orientation config probably isn't matching the model's expected aspect ratio — re-check ANDROID_INTEGRATION.md step 1 + 2.

E. "Does it work in real-world conditions?"#

Beyond the test video, capture 5–10 minutes of real basketball footage with the kima app and the chosen model.
Look for consistent detections across distance, angle, lighting changes.
This is the only step that the test video can't catch — the dataset was trained on indoor court footage similar to but not identical to whatever real users will film.

What NOT to test against#

Don't test fp16 vs fp32 separately. They give the same speed and ~0.02 conf difference on this hardware. Settled — use fp32, file size doesn't matter at this scale.
Don't run val mAP comparisons as the deciding metric. The val set is held-out from Roboflow and doesn't reflect on-device performance well — see LESSONS_LEARNED.md.
Don't compare against the V5.x historical numbers in this handover. They were measured on a different phone setup and the within-session noise is real. The history is useful for understanding the direction of each tuning lever, not the absolute numbers in the kima app's measurement environment.

If you only have time for one test#

Run the A/B at step D: capture yolo11n_noise_floor_baseline, yolo26n_v11_rect_512x384, and yolo26n_v11_square_416 back-to-back for 60 seconds each on the same test video. If the rect model wins on confidence and the noise floor is on target, ship the rect model. If the noise floor is off, fix the phone setup first and re-run.

Configure Feed