Android Integration Requirements#

This document explains what the kima Android app must implement to consume the models in models/. All requirements are derived from working code in reference_code/ — copy or adapt those files as needed.

Critical: camera config + activity orientation#

The recommended model yolo26n_v11_rect_512x384.tflite expects a landscape 4:3 input. If the camera delivers any other aspect ratio or orientation, the detector's letterbox math will pad with gray and waste capacity. The square models (square_416, square_512) tolerate any aspect ratio gracefully via letterboxing, but they're slower for the same effective resolution.

1. Pin CameraX `ImageAnalysis` to 4:3#

In whatever Compose / Activity sets up the CameraX use cases, the ImageAnalysis.Builder needs setTargetAspectRatio(AspectRatio.RATIO_4_3):

import androidx.camera.core.AspectRatio
import androidx.camera.core.ImageAnalysis

val imageAnalysis = ImageAnalysis.Builder()
    .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
    .setOutputImageFormat(ImageAnalysis.OUTPUT_IMAGE_FORMAT_RGBA_8888)
    .setTargetAspectRatio(AspectRatio.RATIO_4_3)   // ← CRITICAL for the rect model
    .build()

setTargetAspectRatio is a hint — CameraX picks the closest available 4:3 mode (typically 1280×960, 1440×1080, or 2048×1536 depending on the device). The actual W×H still varies, but the aspect is guaranteed.

If a device can't deliver 4:3 (rare), the existing letterbox math in the detector handles it gracefully — at worst, the model wastes some pixels on padding and quality drops slightly, but it doesn't crash.

See reference_code/CameraView_snippet.kt for the full surrounding context.

2. Lock the activity to landscape#

Otherwise the camera frame arrives portrait-rotated and a 384×512 landscape model gets a 512×384 portrait input — wrong aspect, large letterbox bars. Add to the main activity in AndroidManifest.xml:

<activity
    android:name=".MainActivity"
    android:screenOrientation="landscape"
    ...>

See reference_code/AndroidManifest_snippet.xml for an example.

Alternative if landscape lock isn't acceptable (e.g., kima needs portrait support too): inside the detector, rotate the bitmap by 90° before resizing if imgH > imgW. Adds minimal CPU cost. Implementation sketch in LESSONS_LEARNED.md.

Critical: letterbox-aware preprocessing in the detector#

YOLO models are trained on letterboxed input — the source image is scaled to fit the model's input shape while preserving aspect ratio, then padded with gray (RGB 114,114,114) to fill any remaining space. The Android detector MUST do the same, otherwise the model's geometry priors are violated and detection quality plummets.

If the camera aspect already matches the model aspect (4:3 camera + 4:3 model), the letterbox math degenerates to a plain resize with padX=0, padY=0 — but you still need the math in place for the cases where it doesn't match.

Reference implementation#

reference_code/ImageDetector.android.kt is the working letterbox-aware detector from this project. Key points:

// Letterbox: scale source to fit model input while preserving aspect ratio.
val scale = min(
    inputW.toFloat() / imgW.toFloat(),
    inputH.toFloat() / imgH.toFloat()
)
val scaledW = (imgW * scale).roundToInt().coerceAtLeast(1)
val scaledH = (imgH * scale).roundToInt().coerceAtLeast(1)
val padX = (inputW - scaledW) / 2f
val padY = (inputH - scaledH) / 2f

val resized = Bitmap.createBitmap(inputW, inputH, Bitmap.Config.ARGB_8888)
val canvas = Canvas(resized)
canvas.drawColor(Color.rgb(114, 114, 114))   // ← gray fill matching YOLO training
canvas.drawBitmap(
    srcBitmap, null,
    RectF(padX, padY, padX + scaledW, padY + scaledH),
    Paint(Paint.FILTER_BITMAP_FLAG)
)

After inference, the output bboxes (which are normalized [0, 1] over the letterboxed input tensor) need to be un-letterboxed back to the original image's coordinate space:

val x1pLb = min(x1, x2) * inputW   // letterboxed pixel coords
val y1pLb = min(y1, y2) * inputH
val x2pLb = max(x1, x2) * inputW
val y2pLb = max(y1, y2) * inputH

// Subtract padding offset, divide by scale ratio → original image coords
val left = ((x1pLb - padX) / scale).coerceIn(0f, imgWF)
val top = ((y1pLb - padY) / scale).coerceIn(0f, imgHF)
val right = ((x2pLb - padX) / scale).coerceIn(0f, imgWF)
val bottom = ((y2pLb - padY) / scale).coerceIn(0f, imgHF)

If you skip the un-letterbox step, bboxes will be slightly offset and scaled wrong — most visibly at the edges of the frame.

TFLite interpreter setup#

The reference CustomObjectModel.android.kt shows the full delegate fallback chain (GPU → NNAPI → CPU) with logging. Key snippet:

import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.gpu.GpuDelegate

var selectedDelegate = "CPU"
val (options, gpuDelegate) = runCatching {
    val delegate = GpuDelegate()
    val opts = Interpreter.Options().apply {
        addDelegate(delegate)
        setNumThreads(2)
    }
    selectedDelegate = "GPU"
    Logger.i { "TFLite: GPU delegate constructed" }
    opts to delegate
}.onFailure { t ->
    Logger.w(t) { "TFLite: GPU delegate not available; trying NNAPI" }
}.getOrElse {
    // fall back to NNAPI, then CPU 4-thread
    // ... see reference_code/CustomObjectModel.android.kt
}

val interpreter = Interpreter(model, options)

// IMPORTANT: log the actual loaded shape so you can verify GPU is engaging
val inputShape = interpreter.getInputTensor(0)?.shape()
val outputShape = interpreter.getOutputTensor(0)?.shape()
Logger.i {
    "TFLite: model='$path' delegate=$selectedDelegate " +
        "inputShape=${inputShape?.toList()} outputShape=${outputShape?.toList()}"
}

Verify after install: adb logcat -s TFLite:I after launching the app should show one line per model load. Confirm:

delegate=GPU (not CPU — would mean ~10× slower)
inputShape=[1, 384, 512, 3] for yolo26n_v11_rect_512x384.tflite (or match the chosen model)

Output parsing#

All models output [1, 300, 6] post-NMS:

val outputShape = interpreter.getOutputTensor(0).shape()  // [1, 300, 6]
val output = TensorBuffer.createFixedSize(outputShape, DataType.FLOAT32)
interpreter.run(tensorImage.buffer, output.buffer)
val array = output.floatArray   // 1800 floats

// Layout: 300 rows of 6 columns each, sorted by confidence descending
// Row i: array[i*6 .. i*6+5] = [x1, y1, x2, y2, conf, cls]
for (i in 0 until 300) {
    val row = i * 6
    val conf = array[row + 4]
    if (conf < 0.25f) break  // remaining rows are empty/zero
    val x1 = array[row + 0]
    val y1 = array[row + 1]
    val x2 = array[row + 2]
    val y2 = array[row + 3]
    val cls = array[row + 5].toInt()  // 0=basketball, 1=basketball_hoop
    // (then un-letterbox per the snippet above and emit a detection)
}

The reference ImageDetector.android.kt handles a subtle wrinkle: depending on how ultralytics happens to export, the output may be [1, 300, 6] (300 detections, 6 fields each) OR [1, 6, 300] (6 fields, 300 detections each). The reference code auto-detects via the dim2 == 6 ? elements-first : channels-first check. Worth keeping that flexibility.

Permissions + dependencies#

The kima app likely already has these, but for completeness:

<!-- AndroidManifest.xml -->
<uses-permission android:name="android.permission.CAMERA" />
<uses-feature android:name="android.hardware.camera" android:required="false" />

// build.gradle.kts dependencies
implementation("androidx.camera:camera-core:1.x.x")
implementation("androidx.camera:camera-camera2:1.x.x")
implementation("androidx.camera:camera-lifecycle:1.x.x")
implementation("androidx.camera:camera-view:1.x.x")

implementation("org.tensorflow:tensorflow-lite:2.x.x")
implementation("org.tensorflow:tensorflow-lite-gpu:2.x.x")
implementation("org.tensorflow:tensorflow-lite-support:0.x.x")

Common gotchas#

Wrong delegate — if the GPU delegate fails silently (e.g. on some emulators or low-end devices) and falls back to CPU, inference goes from ~7 Hz to ~1 Hz. The logging in CustomObjectModel.android.kt makes this visible.
Wrong input dtype — input is float32 even for the fp16-internal models. The fp16 versions only have fp16 weights internally; their I/O tensors are still float32. Don't try to feed Int8/Uint8 buffers without re-quantizing.
Pixel normalization — YOLO expects pixel values in [0, 1], not [0, 255]. The reference uses NormalizeOp(0f, 255f) which performs (x - 0) / 255 → [0, 1]. Skipping this is a common cause of "model loads fine, detections all garbage".
NMS double-application — these models have NMS baked in. Don't run another NMS pass on the output — it'll discard valid detections.
Coordinate system after rotation — if you rotate the bitmap to landscape inside the detector (Option B from above), you also need to track that rotation when emitting bounding box coordinates back to the rest of the app, otherwise downstream consumers see boxes in the wrong reference frame.
Filename probe before shipping — see LESSONS_LEARNED.md and the MODELS.md shape table. Always run tf.lite.Interpreter against any tflite once after copying it to verify the input shape matches what you expect. Filenames have lied before in this project.

Configure Feed