Alex Waldrop

The LLM Battle Arena is an experiment in how large language models behave when placed in a structured reasoning environment.

Two models play chess against each other, generating moves through prompts while the system enforces legality and maintains game state.

LLM versus LLM

Each side is controlled by a language model that receives a structured description of the current position. The model responds with the move it wants to play.

Watching two models play each other is useful. It surfaces differences in:

how aggressively they trade material
how well they defend under pressure
how quickly they collapse in lost positions

Legal move enforcement

LLMs are not chess engines. Left on their own, they will propose illegal moves or forget where pieces actually are.

Every move is checked by a chess engine. If a move is illegal, it is rejected and the model is asked again. The chess engine is the source of truth for the board.

This keeps the game coherent even when the model’s internal picture starts to drift.

Prompt structure for strategy

The prompts do more than just ask for the next move.

Models are encouraged to:

describe immediate threats
consider captures and checks
think about positional advantages

Good prompts do not turn an LLM into a chess engine, but they do change the shape of its play. With the right framing, models tend to blunder less often and play more human-looking moves.

Failure modes

The most interesting part of the project is where the models fail.

They struggle with:

maintaining accurate board state over long games
planning more than a few moves ahead
avoiding repetition and going in circles

These weaknesses show up even more clearly when two models with different tendencies face each other.

What I learned

LLMs can play plausible chess when the environment does a lot of work for them: validating moves, maintaining state, and nudging them toward structured thinking.

Prompt structure has an outsized impact on the quality of play. Small changes in wording can move a model from random-feeling moves toward something that looks like a plan.

Most importantly, this reinforced a broader lesson: when you put LLMs into structured decision systems, you have to design the guardrails as carefully as the prompts. The interesting part is not the “wow” moment when the model finds a tactic. It is how the overall system behaves under pressure and over time.

Engineering Notes: LLM Battle Arena (Chess)

LLM versus LLM

Legal move enforcement

Prompt structure for strategy

Failure modes

What I learned