Closed-loop training & comparison tools#

These scripts close the loop between Colab-trained TFLite object-detection models and on-device measurement on a real Android phone, so we can objectively compare model A vs. model B by running them against the same video on the same hardware.

collab/Model Training.ipynb       (edit cells, retrain in Colab)
        │ trained .tflite via Drive sync
        ▼
collab/output/<run-id>/*.tflite
        │ tools/sync_models.py
        ▼
sample/composeApp/.../assets/<run-id>__<model>.tflite
        │ ./gradlew :sample:composeApp:assembleDebug
        │ adb install -r
        ▼
Phone (Experiment Mode)
        │ JSON log per detection
        ▼
/sdcard/Android/data/com.nate.posedetection.androidApp/files/experiment_logs/
        │ tools/run_experiment.sh   (manual orchestration)
        │ adb pull
        ▼
experiments/<run-id>/{log.json, manifest.json}
        │ tools/compare_logs.py
        ▼
report.html + summary.csv

The package id is com.nate.posedetection.androidApp (note the .androidApp suffix — the Kotlin namespace is com.nate.posedetection without it). Every adb command targets the suffixed name.

Prerequisites#

macOS with QuickTime Player (preinstalled).
adb at /Users/virtualintern/Library/Android/sdk/platform-tools/adb.
Phone with USB debugging on, paired with this Mac, camera permission pre-granted to the sample app. Verify with adb devices.
Python 3 for the orchestration scripts. compare_logs.py optionally uses matplotlib for embedded timeline charts — install with uv pip install matplotlib if you want them; the report still renders without.

tools/sync_models.py#

Copies trained .tflite files from collab/output/<run-id>/ into the sample app's Android assets dir. Synced files are renamed <run-id>__<model-base>.tflite (double underscore deliberate — the picker in DiscoverModels.android.kt replaces single underscores with spaces, so the double underscore renders as a clear visual break in the dropdown).

tools/sync_models.py                     # default: sync the latest run
tools/sync_models.py --list              # show available runs newest-first
tools/sync_models.py --run v0_smoke      # sync a specific run
tools/sync_models.py --clean             # remove synced models

--clean preserves the three baseline models that ship with the repo (yolo11n_dataset_dataset.tflite, yolo11n_su_416.tflite, yolov10n_float16.tflite).

After syncing, rebuild and reinstall the app:

./gradlew :sample:composeApp:assembleDebug
adb install -r sample/composeApp/build/outputs/apk/debug/composeApp-debug.apk

If install fails with INSTALL_FAILED_UPDATE_INCOMPATIBLE (signature mismatch), the on-device app was signed by a different debug key. Run adb uninstall com.nate.posedetection.androidApp first; this wipes any data the app stored locally.

tools/run_auto_experiment.sh — fully unattended orchestration#

Drives one experiment run end-to-end with no human at the phone. The orchestrator launches the sample app via am start with intent extras that tell the in-app auto-driver to select a model, wait until a wall-clock target, run the experiment buffer for a fixed duration, write the JSON, and finish the activity. The Mac side races to the same wall-clock target and plays the video in QuickTime at the configured offset.

tools/run_auto_experiment.sh \
    --video ~/Downloads/clip.mp4 \
    --duration 10 \
    --model-tag yolo11n_su_416 \
    [--start-offset 10] \
    [--device <serial>]

Flow:

Snapshots existing experiment-log files on the device for the diff-pull.
Pre-opens the video in QuickTime, positions the cursor at --start-offset seconds, and pauses. This step is wrapped in a 600 s AppleEvent timeout so very large 4K files don't blow up. The slow indexing happens here, before any wall-clock timing matters.
Force-stops the sample app and clears Logcat.
Computes start_at_wall_ms = now + 9 s (configurable via the APP_BOOT_PAD_SECONDS env var). Launches AppActivity via am start with --ez experiment_auto true --es model_name <X> --el start_at_wall_ms <Y> --el duration_ms <Z>.
Sleeps locally until start_at_wall_ms, then issues osascript ... play front document (sub-second since the file is already loaded). t0_wall_ms is captured immediately before the play.
Sleeps --duration seconds, then pauses QuickTime.
Diff-polls /sdcard/.../experiment_logs for the new JSON file (15 s timeout) and pulls it into experiments/<run-id>/.
Captures the ExperimentAuto Logcat excerpt and writes experiments/<run-id>/manifest.json with t0_wall_ms, start_at_wall_ms, model tag, video offset, device label, and the captured log lines.

The Compose auto-driver inside CameraSample reads LocalExperimentAutoSpec, finds the model whose display name (or underscore-equivalent) matches model_name case-insensitively, waits for customObjectFlow.drop(1).first() (the pipeline's first real emission, bounded by an 8 s timeout), waits until start_at_wall_ms, runs the buffer for duration_ms, writes the JSON via the same ExperimentLogger the manual button uses, logs the saved path, and finishes the activity.

Logcat tag is ExperimentAuto. To watch a run live:

adb logcat -c && adb logcat -s ExperimentAuto:I

Tuning knobs (env vars):

APP_BOOT_PAD_SECONDS (default 9) — wall-clock budget for cold start + model load + camera spin-up. Bump if you see start_at_wall_ms firing before the phone is ready.
PULL_TIMEOUT_SECONDS (default 15) — how long to wait for the new JSON to appear on the phone after the buffer should have stopped.

tools/run_experiment.sh — manual orchestration#

Drives one experiment run when you can be at the phone to tap Start/Stop.

tools/run_experiment.sh \
    --video ~/Downloads/test.mp4 \
    --duration 10 \
    --model-tag yolo11n_su_416

Flow:

Snapshots the existing experiment-log files on the phone so it can diff-pull only the new ones afterwards.
Prompts you: "Phone ready? Model picked, Experiment Mode on, Start tapped?"
Records t0_wall_ms and plays the video in QuickTime (osascript).
Sleeps for --duration seconds.
Pauses QuickTime.
Prompts you to tap Stop on the phone.
Pulls only the newly-written log file(s) into experiments/<run-id>/.
Writes experiments/<run-id>/manifest.json with t0_wall_ms, model tag, device label, video path, duration.

--device <serial> is only needed if adb devices shows more than one device (otherwise the script auto-selects the single connected one). --no-confirm skips the interactive prompts for unattended dry-runs.

tools/compare_logs.py — model comparison report#

Reads every experiments/<run-id>/manifest.json + log files and produces a self-contained HTML report and a CSV summary.

tools/compare_logs.py experiments/                       # report.html in cwd
tools/compare_logs.py experiments/ -o ./reports/         # write into reports/
tools/compare_logs.py experiments/ --bucket 250          # 250ms buckets

The report contains:

Per-run scorecard: detection rate, mean confidence, events/sec, classes seen, mean bbox area.
Per-class jitter table: stddev of bbox-center deltas across consecutive frames. Lower = more stable. (Note: a moving subject also produces high jitter — useful for spotting flapping detections more than for absolute model quality.)
Per-bucket comparison table: for each time bucket (default 100 ms, configurable via --bucket), each model's frame count, top class, and max confidence. Aligned to each run's t0_wall_ms from the manifest so model A and model B share the same x-axis when run against the same video.
Per-class confidence timelines: matplotlib PNG charts embedded as base64 data URIs (omitted with a warning banner if matplotlib is not installed).

summary.csv has one row per run with the same scorecard fields, for spreadsheet exploration.

Typical comparison workflow#

# 1. Train a new model in collab/Model Training.ipynb (via Colab MCP).
#    Output lands at collab/output/<run-id>/<name>.tflite via Drive sync.

# 2. Sync the trained model into the app.
tools/sync_models.py --run <run-id>

# 3. Build and install.
./gradlew :sample:composeApp:assembleDebug && \
    adb install -r sample/composeApp/build/outputs/apk/debug/composeApp-debug.apk

# 4. Run experiment A against the trained model.
tools/run_experiment.sh \
    --video ~/Movies/test_clip.mp4 \
    --duration 10 \
    --model-tag <run-id>__<name>

# 5. (At the phone) Switch to a baseline model in the picker, tap Start
#    Experiment, and re-run against the same video.
tools/run_experiment.sh \
    --video ~/Movies/test_clip.mp4 \
    --duration 10 \
    --model-tag yolo11n_su_416

# 6. Generate the comparison report.
tools/compare_logs.py experiments/ -o reports/

# 7. Open reports/report.html in a browser.

Layout#

tools/
├── README.md                (this file)
├── sync_models.py           (Phase 2)
├── run_experiment.sh        (Phase 5, manual orchestration)
├── run_auto_experiment.sh   (unattended orchestration via intent extras)
└── compare_logs.py          (Phase 6)

experiments/             (created by run_experiment.sh; gitignored)
└── <t0_wall_ms>_<model_tag>/
    ├── manifest.json
    └── <model>__<runId>.json

collab/
├── Model Training.ipynb (Drive-synced; iterated via Colab MCP)
└── output/              (Drive-synced; tflites land here)
    └── <run-id>/
        └── *.tflite

Configure Feed