perf: parallel pose+object detection + CPU+XNNPACK default on Android

On Samsung SM-A366B the GPU delegate offloaded only 26/685 YOLO26n ops;
the CPU↔GPU roundtrip tax dominated, and in BOTH mode the GPU also contended
with the pose pipeline. Flipping the default to CPU+XNNPACK (setUseXNNPACK(true)
is required on tensorflow-lite-support 0.5.0, otherwise pure CPU is ~100× slower)
raised BOTH-mode FPS 8.8 → 9.9.

Removing the even/odd alternation in CameraView.android.kt activated the
already-existing `poseExecutor` parallel path in Utils.android.kt, so in BOTH
mode each camera frame now runs both detectors concurrently instead of every
other frame. Net: effective per-detector rate doubled.

numThreads dropped 4 → 3 for the object interpreter so concurrent pose + object
don't oversubscribe the 8-core CPU.

Runtime override preserved: `adb shell setprop debug.tflite.delegate GPU|NNAPI|CPU`.

Combined with tonight's INT8 baseline re-export, on-device BOTH-mode FPS went
8.77 → 16.02 (+83%) with effective object-detection rate rising from ~4.4 to ~16 FPS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

author

virtualintern co-author

Claude Opus 4.7 (1M context) date 2 months ago (Apr 18, 2026, 8:38 PM +0300) commit a405c0c0 a405c0c0af40f1a5bf981a3d05c3268090c4b452 parent ad700278 ad700278cfc5cf6c57d65848854281fc0c5681d7

+59 -26

2 changed files

Expand all

posedetection

src

androidMain

kotlin

com

performancecoachlab

posedetection

custom

CustomObjectModel.android.kt

com.performancecoachlab

posedetection

camera

CameraView.android.kt

+12 -8

posedetection/src/androidMain/kotlin/com.performancecoachlab/posedetection/camera/CameraView.android.kt

··· 305 305 val now = sensorMs + bootEpochOffset 306 306 val area = focus 307 307 308 - // In BOTH mode, stagger pose and object on alternate frames 309 - // to reduce per-frame processing time and improve tracking. 310 - val frameNum = analysisFrameCounter.incrementAndGet() 311 - val isBothMode = currentDetectMode == DetectMode.BOTH 308 + // In BOTH mode, run both detectors on every eligible frame. 309 + // The underlying `ImageProxy.process()` already fans pose out 310 + // to `poseExecutor` and runs object inference on the analyzer 311 + // thread concurrently, so per-frame wall-clock is max(pose, 312 + // object) instead of pose+object. The previous even/odd 313 + // alternation cut each detector's effective frame rate in 314 + // half for ~20% combined-FPS gain — ditched in favor of 315 + // doubling per-detector coverage, which matters more for 316 + // ball-near-hoop detection. 317 + analysisFrameCounter.incrementAndGet() 312 318 val shouldRunObject = currentDetectMode.doObject() && 313 - (now - lastObjectRunAtMs.get() >= objectIntervalMs) && 314 - (!isBothMode || frameNum % 2 == 0L) 319 + (now - lastObjectRunAtMs.get() >= objectIntervalMs) 315 320 val shouldRunPose = currentDetectMode.doPose() && 316 - (now - lastPoseRunAtMs.get() >= poseIntervalMs) && 317 - (!isBothMode || frameNum % 2 != 0L) 321 + (now - lastPoseRunAtMs.get() >= poseIntervalMs) 318 322 319 323 // If neither detector is scheduled to run, just close quickly and reuse last results. 320 324 if (!shouldRunObject && !shouldRunPose) {

+47 -18

posedetection/src/androidMain/kotlin/com/performancecoachlab/posedetection/custom/CustomObjectModel.android.kt

··· 22 22 if (modelPath.androidModelPath == null) { 23 23 throw IllegalArgumentException("Android model path cannot be null") 24 24 } 25 - // Prefer GPU, then NNAPI (API 27+), then CPU. `selectedDelegate` tracks 26 - // which one actually ends up in the final interpreter so we can log it. 25 + // Default delegate is CPU+XNNPACK on Android. Measured on Samsung SM-A366B: 26 + // CPU+XNNPACK at 12.3 FPS object-only and 9.9 FPS both-mode BEAT GPU 27 + // (12.1 and 8.8 FPS respectively) because the GPU delegate only offloads 28 + // 26/685 YOLO26n ops and the CPU↔GPU roundtrip overhead dominates — plus 29 + // GPU contends with the pose pipeline in BOTH mode. See the delegate 30 + // section of collab/experiment_log.md for the full test. 31 + // 32 + // Override for experiments via `adb shell setprop debug.tflite.delegate 33 + // GPU|NNAPI|CPU`. Chosen path is logged. 34 + val requestedDelegate = runCatching { 35 + val cls = Class.forName("android.os.SystemProperties") 36 + val get = cls.getMethod("get", String::class.java, String::class.java) 37 + (get.invoke(null, "debug.tflite.delegate", "CPU") as? String).orEmpty() 38 + .uppercase().ifBlank { "CPU" } 39 + }.getOrElse { "CPU" } 40 + Logger.i { "TFLite: requestedDelegate=$requestedDelegate (debug.tflite.delegate)" } 41 + 27 42 var selectedDelegate = "CPU" 28 - val (options, gpuDelegate) = runCatching { 29 - val delegate = GpuDelegate() 30 - val opts = Interpreter.Options().apply { 31 - addDelegate(delegate) 32 - setNumThreads(2) 43 + val (options, gpuDelegate) = when (requestedDelegate) { 44 + "CPU" -> { 45 + selectedDelegate = "CPU" 46 + // numThreads=3 (not 4) leaves ~3 cores for concurrent pose detection 47 + // in BOTH mode. With numThreads=4 and pose also multithreaded, the 48 + // 8-core Samsung oversubscribes and thermal-throttles within ~30s. 49 + Logger.i { "TFLite: forced CPU (no delegate, numThreads=3, XNNPACK on)" } 50 + Interpreter.Options().apply { 51 + setNumThreads(3) 52 + setUseXNNPACK(true) 53 + } to null 33 54 } 34 - selectedDelegate = "GPU" 35 - Logger.i { "TFLite: GPU delegate constructed" } 36 - opts to delegate 37 - }.onFailure { t -> 38 - Logger.w(t) { "TFLite: GPU delegate not available; trying NNAPI" } 39 - }.getOrElse { 40 - if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.O_MR1) { 55 + "NNAPI" -> if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.O_MR1) { 41 56 runCatching { 42 57 val nnapiDelegate = org.tensorflow.lite.nnapi.NnApiDelegate() 43 58 val opts = Interpreter.Options().apply { ··· 45 60 setNumThreads(2) 46 61 } 47 62 selectedDelegate = "NNAPI" 48 - Logger.i { "TFLite: NNAPI delegate constructed" } 63 + Logger.i { "TFLite: NNAPI delegate constructed (forced)" } 49 64 opts to null 50 - }.onFailure { t -> 51 - Logger.w(t) { "TFLite: NNAPI delegate not available; falling back to CPU" } 52 - }.getOrElse { 65 + }.getOrElse { t -> 66 + Logger.w(t) { "TFLite: NNAPI forced but unavailable; falling back to CPU" } 53 67 selectedDelegate = "CPU" 54 68 Interpreter.Options().apply { setNumThreads(4) } to null 55 69 } 56 70 } else { 71 + selectedDelegate = "CPU" 72 + Logger.w { "TFLite: NNAPI requested but SDK too old; CPU" } 73 + Interpreter.Options().apply { setNumThreads(4) } to null 74 + } 75 + else -> runCatching { // GPU (default) 76 + val delegate = GpuDelegate() 77 + val opts = Interpreter.Options().apply { 78 + addDelegate(delegate) 79 + setNumThreads(2) 80 + } 81 + selectedDelegate = "GPU" 82 + Logger.i { "TFLite: GPU delegate constructed" } 83 + opts to delegate 84 + }.getOrElse { t -> 85 + Logger.w(t) { "TFLite: GPU unavailable; CPU" } 57 86 selectedDelegate = "CPU" 58 87 Interpreter.Options().apply { setNumThreads(4) } to null 59 88 }

Configure Feed

Configure Feed