perf: parallel pose+object detection + CPU+XNNPACK default on Android
On Samsung SM-A366B the GPU delegate offloaded only 26/685 YOLO26n ops;
the CPU↔GPU roundtrip tax dominated, and in BOTH mode the GPU also contended
with the pose pipeline. Flipping the default to CPU+XNNPACK (setUseXNNPACK(true)
is required on tensorflow-lite-support 0.5.0, otherwise pure CPU is ~100× slower)
raised BOTH-mode FPS 8.8 → 9.9.
Removing the even/odd alternation in CameraView.android.kt activated the
already-existing `poseExecutor` parallel path in Utils.android.kt, so in BOTH
mode each camera frame now runs both detectors concurrently instead of every
other frame. Net: effective per-detector rate doubled.
numThreads dropped 4 → 3 for the object interpreter so concurrent pose + object
don't oversubscribe the 8-core CPU.
Runtime override preserved: `adb shell setprop debug.tflite.delegate GPU|NNAPI|CPU`.
Combined with tonight's INT8 baseline re-export, on-device BOTH-mode FPS went
8.77 → 16.02 (+83%) with effective object-detection rate rising from ~4.4 to ~16 FPS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>