Building an AI Asset Pipeline That Actually Works at Scale

March 20, 2026

I needed 55 unit sprites, 365 icons, and a full terrain tileset—all AI-generated, all consistent. Getting a single image right takes an afternoon. Getting a pipeline that produces 400 correct images overnight takes longer. Here's what I actually had to fix.

The Gemini image editing model takes a reference image and a prompt and produces a variation. This is exactly what I needed: take a vanilla game sprite, say "make this look like a wizard," get back a wizard. The problem is the reference image has to actually reach the model.

My first attempt passed the reference as a base64 data URI. This is a reasonable reading of the API docs. In practice, the model ignores base64 data URIs—the reference has no effect and the model generates something based on the text prompt alone. The fix is to upload the reference to fal-ai's CDN storage first and pass the hosted URL instead. With a CDN URL the model actually uses the reference. This is not documented prominently. I found it after generating forty images that all looked identical regardless of what reference I sent.

Once the reference was reaching the model, the next problem was that transparent pixels are undefined to the model—they render as black. If your sprite has dark areas and your background is also black, the model can't distinguish them. The fix is to flatten the reference to white before uploading. Then, because white backgrounds make foreground extraction unreliable later in the pipeline, I switched to magenta. Magenta doesn't appear in fantasy game sprites, so it's safe to chroma-key out afterward. The removal has to be a flood-fill from the four corners of the image, not a global color threshold—global removal at any useful fuzz level eats pixels from characters with cool or purple tones. Corner flood-fill is safe because characters don't reach the corners. Also required: "no drop shadows" in the prompt. Shadows composite over the magenta and produce a dark-purple fringe that can't be cleanly removed at any threshold.

Each Unciv unit has three PNG layers: a base sprite and two color tint masks showing where team colors appear. My first approach generated these in separate API calls. The results were inconsistent because the model never saw all three layers together. The fix was packing all three into a single horizontal 3-panel image and generating them in one call. Two things broke immediately. Using a 16:9 aspect ratio made the model produce vertical stacks instead of horizontal panels. And the tint masks came out as solid silhouettes—fully colored shapes—instead of the sparse overlay maps the game engine expects. The aspect ratio fix was switching to 1:1 with "horizontal layout" in the prompt. The mask fix was changing the prompt vocabulary from "color masks" to "sparse tint maps overlays." That specific phrase is what the model understands. Other phrasings produced solid fills every time.

For icons, the efficient batch was a 3x3 grid: nine vanilla icons arranged in a grid, sent as one image, output cropped back into nine individuals. This reduced API calls by 9x across 365 icons in six categories. All icons needed transparent backgrounds because Unciv's rendering applies color tints to icons—an opaque white background turns into a solid-colored square in-game. I used BiRefNet, a foreground extraction model, as a dedicated pipeline stage after generation.

BiRefNet has one failure mode I didn't expect: it sometimes returns a valid-looking PNG of exactly 334 bytes that is entirely transparent. Correct headers, correct format, no foreground. A file size check doesn't catch this. The correct check is size above 5000 bytes and alpha channel mean above 0.1. The mean check—magick f -channel A -separate -format '%[fx:mean]' info:—detects images where BiRefNet found no foreground. A blank output has mean 0.0. A successful extraction is well above 0.1.

Two other things broke at scale that were obvious in retrospect. Running six concurrent agents all writing BiRefNet output to /tmp/nobg.png caused race conditions—one agent's output overwrote another's before it could be processed. Always use mktemp. And checking for valid PNGs with grep "89 50 4e 47" against xxd output doesn't work because xxd renders those bytes without spaces. Use file "$f" | grep -q PNG.

The final pipeline isn't clever. It's the list of techniques that were actually required: CDN upload instead of base64, magenta background with corner flood-fill removal, "sparse tint maps" in the prompt, triptych batching, BiRefNet with alpha mean validation, mktemp for parallel temp files. Each one came from a specific failure the previous version couldn't handle. That's what a working pipeline looks like after you've built the one that didn't work.