Ongoing

Aspira - High-Performance Chess Engine

Creator & Lead Developer · 2025 · 10+ months (ongoing) · 1 person · 6 min read

Built a UCI-compliant chess engine from scratch in Java reaching 2200+ Elo on Lichess, processing 20M nodes/second. A complete rewrite that forced mastery of low-level optimization, bitboard manipulation, and algorithmic complexity. My goal is to cross the 3000+ Elo milestone and compete with other top engines.

Overview

Aspira is a chess engine written entirely in Java from the ground up. What started as 'let's make something that plays legal moves' evolved into one of the most mentally demanding projects I've worked on. Chess engines have this special property: everything depends on everything else. One small mistake, one shortcut, one assumption that isn't 100% correct — and suddenly nothing makes sense anymore.

Problem

Chess engines are unforgiving. You don't just debug crashes — you debug ideas. A slightly wrong make/undo corrupts the position three plies later. One incorrect bit operation and evaluation becomes noise. The challenge wasn't just writing code that worked, it was writing code that worked correctly under extreme performance constraints, with no room for approximations.

Constraints

Java performance overhead compared to C/C++ engines
No chess libraries - everything built from scratch for deep understanding
Memory allocation in hot loops significantly impacts performance
Bitboard operations must be perfectly optimized
Every component must be correct - no shortcuts allowed

Approach

The current version is not 'v1 with patches' — it's a full rewrite with everything I learned the hard way baked in. I implemented the complete chess ruleset, then focused on performance through bitboards and magic bitboards for sliding pieces. The architecture prioritizes correctness first, then performance through careful optimization of hot paths.

Key Decisions

Bitboard-based representation using magic bitboards

Reasoning:

Magic bitboards provide O(1) lookup for sliding piece moves, crucial for the 20+ million nodes per second target. The complexity of implementation is worth the performance gain in the search tree.

Alternatives considered:

Mailbox representation (simpler but slower)
Rotated bitboards (complex, similar performance)

Full rewrite instead of patching initial version

Reasoning:

The initial design had fundamental architectural issues that couldn't be fixed incrementally. Starting fresh with lessons learned resulted in cleaner, faster, more maintainable code.

Alternatives considered:

Incremental refactoring (would have taken longer with worse results)

Mono-threaded design

Reasoning:

Multi-threading in chess engines is significantly harder to implement correctly and requires proper hardware to pay off. Focusing on single-thread performance first establishes a solid baseline before adding concurrency complexity.

Alternatives considered:

Lazy SMP (complex, would slow down initial development)

Hand-crafted evaluation (HCE) before NNUE

Reasoning:

Want to master traditional evaluation and reach high Elo with HCE before introducing neural network complexity. This provides better understanding of evaluation fundamentals.

Alternatives considered:

Jump straight to NNUE (faster Elo gain but less educational)

Tech Stack

Java
Bitboard manipulation
Magic bitboards
Zobrist hashing
UCI protocol
Alpha-beta pruning
Quiescence search
Transposition tables

Result & Impact

20-22M nodes/sec (Ryzen 5 5500U)

Performance
2100+ on Lichess

Elo Rating
Several full rewrites

Lines of Code

This project fundamentally changed how I approach complex systems. It forced me to write correct code everywhere - there's no hiding in a chess engine. If one part is sloppy, the whole thing explodes. I spent nights debugging perft suites, tracking down single-bit errors that corrupted positions three moves later. The discipline required here translated to all my other work: careful design, proper testing, and deep understanding over quick hacks.

Learnings

Performance comes from correctness, not clever tricks. Most gains came from fixing bugs and simplifying logic.
Complex systems require understanding at every level. Abstractions that look clean can hide critical performance issues.
Debugging conceptually wrong code is harder than debugging syntactically wrong code.
Incremental complexity management - build solid foundation before adding features.
The importance of profiling and measuring rather than guessing optimizations.

The Journey

Aspira didn’t start as an attempt to build a strong engine, and it definitely didn’t stay simple for long.

What’s Implemented

The current baseline includes:

Complete Chess Rules: Castling, en passant, promotion, repetition detection
Move Generation: Bitboard-based with magic bitboards for sliding pieces
Search Algorithm: Alpha-beta pruning in negamax variant with quiescence search
Evaluation: Material evaluation + Piece Square Tables (PSQT)
Optimizations:
- Transposition tables with Zobrist hashing
- Move ordering (History heuristic, MVV-LVA, TT move)
- Delta pruning in quiescence
- Mate distance pruning
- Null move pruning
- Iterative deepening
Protocols: Full UCI support
Testing: Perft testing suite for move generation correctness
Time Management: Autonomous play based on remaining time

Current Development

I’m implementing additional techniques to push Elo higher:

LMR (Late Move Reductions) + PVS: Expected significant Elo gain
Enhanced Evaluation: Passed pawns, king safety, pawn structure
Optimizations:
- Converting to fully legal move generation
- Move packing (32-bit → 16-bit)
- Pre-allocated MoveList stack to eliminate runtime allocations

The NNUE Step

The next major milestone is NNUE (Neural Network-based evaluation). I’ve already done successful tests, but it requires:

Mass data generation from current HCE
Training on millions of positions (several hours of compute)
Careful integration to maintain performance

With proper NNUE implementation and training, Aspira could reach the 3000+ Elo zone.

Performance Evolution

The journey to 20M+ nodes per second wasn’t one big optimization:

March 2025 (Ryzen 7 7800X3D): ~15M nps with legal move generation
December 2025 (Ryzen 5 5500U): ~13M nps, improved to 18M nps (perft semi-bulk)
January 2026 (Ryzen 5 5500U): 20-22M nps (~30M nps on Ryzen 7 7800X3D)

Each improvement came from:

Removing unnecessary allocations
Rewriting slow paths
Simplifying “clean” code that wasn’t fast
Fixing correctness issues that had performance side-effects

Why This Was Hard

I’ve literally spent nights debugging positions to pass perft suites. Move generation seems simple, but the bugs you create along the way surface during perft testing.

You spend hours staring at code that looks correct, only to realize the bug is conceptually wrong, not syntactically wrong.

That’s what makes this project special. It forced me to think deeply about every decision, every data structure, every bit operation.

What’s Next

Continue development toward 3000+ Elo through:

Complete LMR/PVS implementation
Refine evaluation with advanced positional understanding
Perfect the HCE baseline
Implement and train NNUE
Optimize memory layout and cache efficiency

The name comes from aspiring — not just to build something stronger, but to understand something deeply enough that it stops being mysterious. Somewhere along the way, it also started aspiring my soul.

Contributing

Aspira is open source and welcomes contributions. Whether it’s performance improvements, evaluation tweaks, or bug fixes, I’m always open to discussions about engine design and chess programming.

Special thanks to the Stockfish Discord Community for the invaluable discussions, feedback, and shared knowledge about engine design and NNUE implementation.

All projects