We wanted to answer a very specific question: which of these LLMs is better to code a game?
So we fired up Rosebud, grabbed a simple single-file game template, and challenged three powerful language models—Gemini 2.5 Pro, Grok, and DeepSeek—to build us a 3D open world game.
Spoiler: the results were all over the place—from black screens to beautiful lighting, broken characters to mind-blowing visual effects.
Curious which model came out on top? Keep reading.
The Setup: Same Prompt, Same Template, 3 Different AIs
We used a premade asset-based game created in Rosebud AI, our game creator platform where you can build games just by describing them.
To make it fair for all the LLMs, we stripped it down to a single code file and asked each model to enhance it using this prompt:
“I want to create an awesome open world game with rolling hills, trees, and a daylight cycle. It has to have a character in third-person view, a mini map, and collectible diamonds. Make sure the final result is not simple. Return full code without breaking the main structure.”
Each LLM got the same prompt, same structure, and same rules. But their behavior? Wildly different.
Round One: First Attempts (A.K.A. The Black Screen Era)
- Gemini impressed early with almost 700 lines of code, a working minimap, trees, and a nice day-night cycle… but the game was nearly pitch black and movement was glitchy.
- Grok responded slower but gave us a visually rich world—although the character couldn’t move, and diamonds vanished after being collected.
- DeepSeek delivered the smallest codebase (~250 lines), and unfortunately, a black screen. Repeatedly.
At this point, things weren’t looking great. Most versions were either broken, visually dull, or unplayable.
But we kept iterating, refining prompts, and starting new chats—sometimes that helps LLMs produce better results.
Debugging Hell and Glimmers of Hope
Gemini was the most verbose model—consistently generating over 1,000 lines of code with vivid lighting and effects. Eventually, after several iterations, we got a version with:
- A working minimap
- A character that could run
- Proper lighting with shadows
- A smooth day-night cycle
That said, movement bugs and inconsistent performance still plagued it.
Grok was surprisingly consistent. It produced solid, readable code and a playable world—but weirdly trippy glitches, like the terrain moving under your feet, made things unpredictable.
Still, Grok delivered some of the best collectible mechanics out of the three.
DeepSeek… let’s just say it tried.
Multiple versions, mostly black screens, but we eventually got a decent result where the player could collect diamonds and stay above ground. No minimap, no lighting magic—but at least it worked.
The Winner: Gemini (Barely)
After hours of prompting and testing, Gemini edged out the others with the most complete-looking game. The final version had:
- Beautiful lighting and shadows
- A functioning minimap
- Collectible diamonds
- A running player character (with occasional quirks)
But let’s be real—none of these models nailed it on the first try. We had to battle errors, rendering issues, and confusing limitations.
If your goal is to ship a game fast and not spend hours debugging vague LLM outputs…
Just Use Rosebud (Seriously)
All of this testing happened inside Rosebud AI, where even broken LLM code can be fixed quickly thanks to prompts, real-time previews, and clean code structure.
We love pushing AI tools to their limits—but if you want a smoother ride from idea to playable game, Rosebud is the best companion for your game dev journey.
Ready to test your own ideas? Remix a game in Rosebud, use your favorite LLM (or skip it entirely), and start building today.
Join our Discord to see what other creators are making, swap prompts, and share your weirdest AI bugs.