EngineLab - Chess Engine Tournament Platform

Creator & Lead Developer · 2026 · 2 months · 1 person · 10 min read

Built a production-grade platform for testing chess engines with real-time WebSocket streaming, concurrent game execution, and battle-tested architecture. Handles hundreds of simultaneous games with sub-20ms latency and zero memory leaks.

Overview

EngineLab is a modern platform for running UCI chess engine tournaments. It orchestrates thousands of games between chess engines, streams every move in real-time via WebSocket to a web interface, and provides detailed statistics and leaderboards. This project replaces my earlier PawnPower system - where PawnPower was pure workload distribution (headless workers running games with no visual feedback), EngineLab focuses on real-time visibility and monitoring.

Problem

My previous PawnPower project was great for distributed testing workloads, but it was purely backend - workers running games, uploading results, no way to watch what's happening. I wanted something more visual and interactive. There's no modern solution that combines performance (concurrent game execution), visibility (real-time streaming), and reliability (proper resource management). I wanted a platform I could watch live, like a TV for chess engines - see every move as it happens, not just final results.

Constraints

  • Must handle hundreds of concurrent engine processes without memory leaks
  • Real-time streaming with < 20ms latency to web clients
  • Precise time control management (millisecond accuracy over hours)
  • Graceful handling of engine crashes and timeouts
  • Production-ready architecture with TLS support and tunnel compatibility

Approach

Started with a solid foundation: on-demand resource management pattern where each game pair creates and destroys its engine processes independently. Built WebSocket broadcast system for real-time updates. Implemented precise time control using System.nanoTime() for drift-free timing. Focused on clean architecture and zero resource leaks from day one.

Key Decisions

On-demand engine process management

Reasoning:

Creating engine processes only when needed and destroying them immediately after prevents resource leaks. Each pair task is completely independent - if one crashes, others continue. This pattern scaled from 1 to 100 concurrent pairs without issues.

Alternatives considered:
  • Process pooling (complex lifecycle management, harder to debug crashes)
  • Long-lived processes (memory leaks, harder cleanup, zombie processes)

WebSocket for real-time streaming

Reasoning:

Native WebSocket API provides true real-time updates with minimal overhead. No polling needed, sub-20ms latency from engine move to browser display. CopyOnWriteArraySet for thread-safe client management without lock contention.

Alternatives considered:
  • HTTP polling (high latency, server overhead)
  • Server-Sent Events (one-way only, less flexible)

Vanilla JavaScript frontend (zero frameworks)

Reasoning:

WebSocket + DOM updates requires ~100 lines of JS. No need for React/Vue complexity. Chessground.js (used by Lichess) handles board display. Result: 20KB bundle size, 60fps rendering, zero breaking changes from framework updates.

Alternatives considered:
  • React (unnecessary complexity, large bundle)
  • Vue (overkill for simple DOM updates)

Tunnel-safe architecture (localhost bind + external tunnel)

Reasoning:

Binding to localhost and using Cloudflare Tunnel eliminates certificate management complexity. TLS terminated by Cloudflare, DDoS protection included, zero app config needed. App stays simple while being production-ready.

Alternatives considered:
  • Native TLS in Jetty (complex keystore management, certificate rotation)
  • Direct internet exposure (security risk, no DDoS protection)

Tech Stack

  • Java 17
  • Jetty 11 (WebSocket + HTTP)
  • Maven
  • Chessground.js
  • WebSocket API
  • ExecutorService (Concurrency)
  • SnakeYAML
  • Gson

Result & Impact

  • 100+ simultaneous
    Concurrent Games
  • < 20ms (P95)
    WebSocket Latency
  • 320MB (8 pairs)
    Memory Footprint
  • 1 week+ (zero leaks)
    Uptime

Built a production-ready platform that replaced fragile scripts with robust architecture. Demonstrated mastery of Java concurrency, WebSocket real-time communication, and clean resource management. The on-demand pattern proved critical - zero memory leaks across thousands of engine processes. Frontend simplicity (vanilla JS) proved frameworks aren't always needed. Learned deep lessons about process management, nano-precision timing, and building for reliability.

Learnings

  • Process management in Java: destroyForcibly() is essential when processes don't respond to quit commands
  • System.nanoTime() for precision timing - currentTimeMillis() drifts with NTP adjustments
  • WebSocket broadcast optimization: serialize JSON once, send to N clients
  • CopyOnWriteArraySet for thread-safe collections without lock contention
  • On-demand resource pattern prevents leaks better than pooling for short-lived tasks
  • Vanilla JS often beats frameworks for simple use cases - less complexity, better performance

What is EngineLab?

🚀 Watch Live Tournaments - See chess engines battle in real-time with WebSocket streaming

EngineLab is a platform for running chess engine tournaments. It launches chess engines (like Stockfish, Aspira (my own engine), or any UCI-compatible engine), makes them play thousands of games against each other, and streams every move in real-time to a web browser.

Evolution from PawnPower:

This project technically replaces my earlier PawnPower system, but they solve different problems:

  • PawnPower was pure distributed workload - headless workers running games across multiple machines, uploading results to a central database. No visual interface, no real-time feedback, just raw computational power for mass testing.

  • EngineLab focuses on visibility and monitoring - watch games unfold in real-time, see stats update live, monitor what’s happening. It’s what I use when I want to watch my engines play, not just crunch numbers. Think TCEC-style viewing, but lightweight and personal - no massive infrastructure, just a cool way to showcase different versions of Aspira and see them battle.

The problem it solves:

  • Real-time visibility: Most GUI tools are heavy and desktop-only. I wanted web-based streaming.
  • Concurrent execution: Run dozens of games simultaneously without resource leaks
  • Reliability: Graceful handling of engine crashes, timeouts, and cleanup

My goal: Build something I could watch like a TV - see my chess engines battle in real-time, from anywhere, while the platform handles hundreds of games reliably.

Inspired by TCEC:

I love watching TCEC (Top Chess Engine Championship) streams - seeing elite engines battle with full analysis. EngineLab is my lightweight, personal version of that concept. No pretension of competing with TCEC’s massive infrastructure or sophisticated ELO calculations - just a fun tool to showcase my engines (different versions of Aspira, experiments, tweaks) and watch them play. It’s TCEC-style viewing for personal projects, not production tournament hosting.

Core Technical Challenges

1. Managing Hundreds of External Processes

Chess engines are separate programs (binaries) that communicate via stdin/stdout using the UCI protocol. Running a tournament means:

  • Launching engine processes on demand
  • Capturing and parsing their stdout/stderr
  • Sending commands via stdin
  • Handling crashes, timeouts, and zombie processes
  • Cleaning up resources properly

My solution: On-demand pattern where each game pair:

  1. Creates 2 fresh engine processes
  2. Plays 2 games (colors reversed)
  3. Sends quit command
  4. Waits 3 seconds for graceful shutdown
  5. Force destroys if still alive
  6. Closes all streams and interrupts reader threads

Result: Zero resource leaks. Tested with 500 pairs (1000 games, 1000+ processes created/destroyed) - stable 320MB heap, no zombie processes.

2. Real-Time Streaming at Scale

Every move from every game needs to reach web clients instantly. With 10 concurrent games, that’s ~50 moves/minute that need broadcasting.

My solution: WebSocket architecture:

  • Jetty WebSocket servlet handles client connections
  • CopyOnWriteArraySet stores connected sessions (thread-safe, no locks)
  • Game threads broadcast events via WebSocketBroadcaster
  • JSON serialized once, sent to all clients
  • Dead connections removed automatically on write failure

Performance:

  • P50 latency: 8ms (engine move → browser display)
  • P95 latency: 15ms
  • Supports 100+ concurrent clients
  • Zero polling - true push-based streaming

3. Millisecond-Precision Time Control

Chess engines have time budgets (e.g., 60 seconds + 1 second increment per move). Over 100 moves, timing errors accumulate. Engines must respect time limits or lose on time.

Challenge:

  • System.currentTimeMillis() drifts when NTP adjusts the clock
  • Thread sleep is imprecise (±10-50ms)
  • Need accuracy over hours of gameplay

My solution:

  • Use System.nanoTime() (monotonic, not affected by clock sync)
  • Start clock before sending move command
  • Stop clock when engine responds with bestmove
  • Calculate elapsed in nanoseconds, convert to milliseconds
  • Decrement remaining time, add increment

Precision achieved: ±1ms over 100+ moves. Engines never flag incorrectly.

4. Concurrency Without Race Conditions

Running multiple games in parallel while broadcasting to web clients requires careful synchronization:

  • Game threads write to GameState
  • Broadcaster reads GameState to send updates
  • Stats manager aggregates results
  • No thread should block others

My solution:

  • Each game pair task is completely independent (no shared state)
  • StatsManager methods are synchronized
  • WebSocketBroadcaster uses CopyOnWriteArraySet (lock-free reads)
  • ExecutorService manages thread pool with fixed size

Result: Scales linearly. 1 concurrent pair = 8 pairs/min. 8 concurrent pairs = 46 pairs/min on 12-core CPU.

Architecture Deep Dive

Concurrency Model

Main Thread
  └─> ExecutorService (thread pool size = 4)
       ├─> OnDemandPairTask 1
       │    ├─> Create Engine1, Engine2
       │    ├─> Play Game1 (Engine1 white)
       │    ├─> Play Game2 (Engine2 white)
       │    └─> Destroy engines, return PairResult
       ├─> OnDemandPairTask 2
       ├─> OnDemandPairTask 3
       └─> OnDemandPairTask 4

Each task is isolated. If one engine crashes, only that pair is affected. Others continue running.

WebSocket Flow

Browser
  ↓ WebSocket handshake
Jetty WebSocketServlet
  ↓ onConnect → register client session
Game Thread (playing moves)
  ↓ Move made → GameUpdate event
WebSocketBroadcaster
  ↓ Serialize to JSON (once)
  ↓ Send to all connected sessions
All Browsers receive update instantly

Resource Lifecycle

Pair Task Starts

Create ProcessBuilder for Engine1

Start process, capture stdout/stderr in threads

Send UCI handshake (uci → uciok)

Play games (send moves, receive bestmoves)

Send quit command

Wait 3 seconds for graceful shutdown

If still alive: destroyForcibly()

Close streams, interrupt threads

Pair Task Complete

Zero leaks. Every resource cleaned up.

Technical Highlights

1. Nano-Precision Timing

public class TimeControl {
    private long whiteRemainingNanos;
    private long clockStartNanos;
    
    public void startClock(Color color) {
        clockStartNanos = System.nanoTime();
    }
    
    public void stopClock(Color color) {
        long elapsedNanos = System.nanoTime() - clockStartNanos;
        whiteRemainingNanos -= elapsedNanos;
        whiteRemainingNanos += incrementMs * 1_000_000L;
    }
}

Nanosecond precision, no drift over hours.

2. Efficient WebSocket Broadcast

public void broadcast(Object message) {
    String json = gson.toJson(message); // Serialize once
    
    clients.removeIf(session -> {
        try {
            session.getRemote().sendString(json);
            return false; // Keep in set
        } catch (IOException e) {
            return true; // Remove dead session
        }
    });
}

1 serialization, N sends. Auto-cleanup of dead connections.

3. Vanilla JS Real-Time Updates

const ws = new WebSocket('ws://localhost:8080/ws');

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  
  if (msg.type === 'move') {
    board.set({
      fen: msg.move.fen,
      lastMove: [msg.move.from, msg.move.to]
    });
  }
};

Zero frameworks. 100 lines of JS. 60fps updates.

Performance Metrics

Benchmark Setup: Ryzen 9 5900XT (16c/32t), 100 pairs, Stockfish vs Aspira

Uptime RunDurationConcurrent PairsResult
Light load6h+10 crashes
Medium load6h+40 crashes
Stress load6h+8+0 crashes
Production1 week+150 crashes

No CPU’s were harmed during testing ;)

Memory Usage:

  • Idle: 180MB heap
  • Peak (8 pairs burst): 450MB heap
  • GC pauses: < 20ms (G1GC)

WebSocket Latency:

  • P50: 8ms
  • P95: 15ms
  • P99: 25ms

What I Learned

Process Management is Hard

  • Streams must be read continuously or they fill up and block the process
  • destroyForcibly() is essential - quit commands aren’t always respected
  • Separate threads for stdout/stderr reading prevent deadlocks
  • Thread interruption is critical for cleanup

Timing Precision Matters

  • currentTimeMillis() drifts when NTP adjusts the clock
  • nanoTime() is monotonic and perfect for elapsed time
  • Over 100 moves, millisecond errors compound to seconds

Simplicity Beats Complexity

  • Vanilla JS was faster to write and debug than setting up React (especially for such a small codebase)
  • On-demand pattern simpler than process pooling

WebSocket is Powerful

  • True real-time (< 20ms latency) without polling
  • Built-in browser support, no libraries needed
  • CopyOnWriteArraySet perfect for broadcast lists

Production Readiness

This isn’t a prototype - it’s production-ready code:

  • Zero memory leaks (verified over 1 week continuous uptime)
  • Graceful error handling (engine crashes don’t stop tournaments)
  • Proper resource cleanup (no zombie processes)
  • Thread-safe concurrency (no race conditions)
  • Configurable via YAML (sensible defaults)
  • Tunnel-compatible architecture (TLS via Cloudflare/ngrok …)
  • 15 automated tests (100% pass rate)

Why This Project Matters

EngineLab demonstrates:

  1. Systems programming expertise - managing external processes reliably
  2. Concurrency mastery - ExecutorService, thread-safe collections, proper synchronization
  3. Real-time architecture - WebSocket broadcast, sub-20ms latency
  4. Performance optimization - nano-precision timing, zero memory leaks
  5. Production thinking - error handling, resource cleanup, security considerations
  6. Pragmatic choices - vanilla JS over frameworks, localhost+tunnel over native TLS

It’s the kind of project that shows you can build robust, performant systems that run in production without babysitting.

Technology Stack

Backend:

  • Java 17 (LTS)
  • Jetty 11.0.26 (WebSocket + HTTP server)
  • Maven (dependency management)
  • SnakeYAML (configuration)
  • Gson (JSON serialization)
  • JUnit 5 (testing)

Frontend:

  • Vanilla JavaScript (zero frameworks)
  • Chessground.js (Lichess’s board library)
  • Native WebSocket API

Deployment:

  • Docker support
  • Cloudflare Tunnel compatible
  • TLS/WSS ready

Future Enhancements

Potential additions I’m considering:

  • ELO rating system with temporal graphs
  • Opening book analysis (stats by variation)
  • Engine parameter tuning via genetic algorithms
  • PGN export and replay with analysis
  • Multi-tournament support (run several tournaments in parallel)

But the core is solid and production-ready as is.