A decade of perft: what's changed for CPUs (and GPUs) in 10 years
About ten years ago I posted Chess Move generation benchmark : from mobile phones to supercomputers — running my perft move-generator across every device I could get my hands on, from a Tegra 2 phone to a Tesla P100. It's been long enough that I figured it was time to redo the exercise on today's silicon, and add two more microbenchmarks while I was at it.
Two things have changed since 2016 that matter for this post:
- The CPUs themselves. Big core-count jumps, AVX-512 on the desktop, Apple Silicon turning up out of nowhere, Arm laptops with Snapdragon X — the desktop/laptop CPU landscape looks very different than it did when Skylake was the new thing.
- How the code was written. The 2016 perft code was hand-tuned by me. The 2026 perft numbers come from the same algorithmic core but reworked by AI coding agents - mostly around multi-threading, better branch layout, and tightening the inner loop. Perft is still scalar integer/bitwise code; it doesn't use SIMD on any of these CPUs. The matmul and memory-bandwidth code below is fully AI-written, and that one does use the wide ISAs (AVX-512 / NEON / AMX).
So this post is really two things at once: a snapshot of where commodity silicon is in 2026, and a small data point on what AI-assisted optimization buys you on a workload I've been staring at for over a decade.
The benchmarks
- Perft (Position 2 / Kiwipete) — the same chess move-generation throughput benchmark as last time. Counts every legal move tree node at a given depth on the CPW Position 2:
r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -. No transposition tables. Scalar integer code — no SIMD. Reported in nodes/second. - SGEMM (matmul) — square fp32 × fp32 → fp32 matrix multiply at N = 1024, 2048, 4096. Hand-written kernel (no MKL/BLAS), AI-generated, tuned per-ISA. Reported in GFLOPS.
- Memory bandwidth — sum of 256 million fp32s, straight streaming read. AI-generated code. Reported in achieved GB/s.
Perft on modern CPUs
| CPU | ISA | Single-thread (Mnps, d5) | Multi-thread (Mnps, d7) |
|---|---|---|---|
| AMD Ryzen 9 9950X3D | AVX-512 | 2,729 | 44,383 |
| Intel Core Ultra 7 270K | AVX2 | 2,598 | 50,246 |
| Apple M4 | NEON / AMX | 2,898 | 17,329 |
| Snapdragon X | NEON | 1,586 | 12,580 |
Single-thread at depth 5, multi-thread at depth 7. All Position 2, no hash. Code is scalar integer; the ISA column is just for reference — it doesn't affect perft.
A few things jump out:
- Apple M4 leads single-thread - by a hair over the 9950X3D, in a laptop. Six years ago this would have been unthinkable. Perft is branch- and memory-latency-bound, so the win is wide decode and a deep out-of-order window, not vector width.
- Intel's Core Ultra 7 270K wins multi-threaded mostly on raw core count. The 9950X3D's AVX-512 advantage doesn't apply here - neither chip uses SIMD for perft.
- Snapdragon X is the slowest of the four but at ~1.6 Bnps single-thread it's still about 2× the fastest desktop CPU from 2016, in a laptop. Partly thanks to AI-assisted optimizations in the code.
Then vs now
Pulling the relevant rows from the 2016 post. All 2016 numbers are single-thread (the old code wasn't parallelized):
| Device | Year | Perft, ST (Mnps) | Notes |
|---|---|---|---|
| LG Optimus 2X (Tegra 2) | 2011 | 17 | Phone |
| Snapdragon 410 | 2015 | 40 | Phone |
| Snapdragon 652 | 2016 | 211 | Phone |
| Intel Core i5-3210M | 2012 | 495 | Laptop |
| Intel Core i7-4790K | 2014 | 723 | Desktop, best in 2016 |
| Apple M4 (single-thread) | 2024 | 2,898 | Laptop |
| AMD 9950X3D (single-thread) | 2025 | 2,729 | Desktop |
| Intel 270K (multi-thread) | 2024 | 50,246 | Desktop, all cores |
Apples-to-apples (single-thread → single-thread): the best 2016 desktop CPU did 723 Mnps, the fastest 2026 single-thread is ~2,900 Mnps on the M4 — a ~4× jump in eleven years. Some of that is the AI-assisted rewrite (better threading, branch layout, inner-loop hygiene), some of it is just per-core µarch improvements; I'd put it at roughly 50/50.
Once you allow multi-threading, the 2026 Intel 270K reaches ~50 Bnps (~69× the 2016 best), but that's almost entirely core count.
The phone story is even funnier. The Snapdragon 652 (best phone in the 2016 post) did 211 Mnps. A Snapdragon X laptop in 2026 does 1.59 Bnps single-threaded, which is on the same order as a high-end desktop from a decade ago.
SGEMM (fp32×fp32→fp32, GFLOPS)
| CPU | N=1024 | N=2048 | N=4096 | ISA |
|---|---|---|---|---|
| AMD Ryzen 9 5900X | 1,203 | 1,163 | 1,130 | AVX2 |
| AMD Ryzen 9 9950X3D | 2,998 | 3,628 | 3,032 | AVX-512 |
| Intel Core Ultra 7 270K | 1,777 | 1,893 | 1,911 | AVX2 |
| Apple M4 | 1,250 | 1,630 | 1,600 | NEON + AMX |
| Snapdragon X | 629 | 587 | 461 | NEON |
This is a hand-written matmul kernel — no BLAS, no MKL, no Accelerate framework — so it's measuring "what a competent SIMD implementation gets you on this ISA," not the absolute peak the hardware can do via vendor libraries.
- AVX-512 is the headline. The 9950X3D nearly doubles the 5900X — wider FMA, doubled register file, larger L3, and a generation+1 of front-end improvements.
- Apple M4 gets ~1.6 TFLOPS from NEON + AMX. Not bad for a laptop chip, though it lags AVX-512 by ~2×.
- Snapdragon X tails the field at the 4k size — bandwidth-starved once the working set spills cache. Pure NEON without AMX-equivalent matrix instructions shows here.
Memory bandwidth (streaming read)
Sum of 256 million floats, single-shot streaming. The "what's the bandwidth wall I'm actually going to hit on a real workload" number.
| CPU | Achieved GB/s |
|---|---|
| AMD Ryzen 9 5900X | 44.4 |
| AMD Ryzen 9 9950X3D | 77.7 |
| Intel Core Ultra 7 270K | 87.0 |
| Apple M4 | 112.6 |
| Snapdragon X | 124.0 |
The on-package LPDDR parts (Apple M4, Snapdragon X) clearly run away with this one. The desktop CPUs are capped by DDR5 channel count.
Worth calling out: Snapdragon X at 124 GB/s is more memory bandwidth than any mainstream desktop CPU in this list — in a laptop. That's the "memory is on-package" thing showing up.
And the GPUs
For continuity with the 2016 post, here's where GPUs are today. Same Position 2, depth 7, no transposition tables (raw move-generation throughput). Numbers from perft_gpu_2026.
| GPU | Year | Perft (Bnps, Pos2 no-hash) |
|---|---|---|
| GTX 780 Ti | 2013 | 33.3 |
| GTX Titan X | 2015 | 66.0 |
| GTX 1080 | 2016 | 89.2 |
| Tesla P100 | 2016 | 118.7 |
| RTX 4090 | 2022 | ~1,308 |
| RTX PRO 6000 Blackwell | 2025 | ~2,110 |
The RTX PRO 6000 Blackwell does ~2.1 trillion nodes/sec on raw move generation — about 18× the Tesla P100 (the supercomputer-class GPU from the 2016 post) and ~42× the best 2026 CPU result (Intel 270K, all threads).
With transposition tables turned on, the same Blackwell card reaches ~619 trillion nps at depth 14 from the starting position. (See the repo README for the deeper-depth tables.)
What this told me
A few things I didn't expect when I started this:
- Apple Silicon caught up faster than I thought it would on integer/branchy code. Single-thread perft on the M4 beats a 16-core x86 desktop chip. The conventional wisdom about "Apple is great at the easy stuff but x86 still wins on serious branchy workloads" is overdue for a rewrite.
- AI-assisted optimization is a real lever, even without changing the ISA. Perft is still pure scalar integer code on every CPU here — the rewrite didn't touch SIMD — and it still got something on the order of a 2× single-thread win over what I had in 2016, separate from the silicon improvements.
- Memory bandwidth went somewhere weird. The fastest memory-bandwidth machine in this set is an Arm laptop. Desktop DDR5 didn't keep up with on-package LPDDR.
- GPUs kept doing their thing. ~18× from P100 to RTX PRO 6000 Blackwell on raw perft is roughly in line with the long-term trend, which is its own quiet miracle, given that the AI-driven advances in recent GPUs (tensor cores) don't help branchy / integer workloads.
Code:
Old post: Chess Move generation benchmark : from mobile phones to supercomputers (2016)
