Monday, May 18, 2026

A decade of perft: what's changed for CPUs (and GPUs) in 10 years

About ten years ago I posted Chess Move generation benchmark : from mobile phones to supercomputers — running my perft move-generator across every device I could get my hands on, from a Tegra 2 phone to a Tesla P100. It's been long enough that I figured it was time to redo the exercise on today's silicon, and add two more microbenchmarks while I was at it.

Two things have changed since 2016 that matter for this post:

  1. The CPUs themselves. Big core-count jumps, AVX-512 on the desktop, Apple Silicon turning up out of nowhere, Arm laptops with Snapdragon X — the desktop/laptop CPU landscape looks very different than it did when Skylake was the new thing.
  2. How the code was written. The 2016 perft code was hand-tuned by me. The 2026 perft numbers come from the same algorithmic core but reworked by AI coding agents - mostly around multi-threading, better branch layout, and tightening the inner loop. Perft is still scalar integer/bitwise code; it doesn't use SIMD on any of these CPUs. The matmul and memory-bandwidth code below is fully AI-written, and that one does use the wide ISAs (AVX-512 / NEON / AMX).

So this post is really two things at once: a snapshot of where commodity silicon is in 2026, and a small data point on what AI-assisted optimization buys you on a workload I've been staring at for over a decade.


The benchmarks

  • Perft (Position 2 / Kiwipete) — the same chess move-generation throughput benchmark as last time. Counts every legal move tree node at a given depth on the CPW Position 2: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -. No transposition tables. Scalar integer code — no SIMD. Reported in nodes/second.
  • SGEMM (matmul) — square fp32 × fp32 → fp32 matrix multiply at N = 1024, 2048, 4096. Hand-written kernel (no MKL/BLAS), AI-generated, tuned per-ISA. Reported in GFLOPS.
  • Memory bandwidth — sum of 256 million fp32s, straight streaming read. AI-generated code. Reported in achieved GB/s.



Perft on modern CPUs

CPUISASingle-thread (Mnps, d5)Multi-thread (Mnps, d7)
AMD Ryzen 9 9950X3DAVX-5122,72944,383
Intel Core Ultra 7 270KAVX22,59850,246
Apple M4NEON / AMX2,89817,329
Snapdragon XNEON1,58612,580

Single-thread at depth 5, multi-thread at depth 7. All Position 2, no hash. Code is scalar integer; the ISA column is just for reference — it doesn't affect perft.

A few things jump out:

  • Apple M4 leads single-thread - by a hair over the 9950X3D, in a laptop. Six years ago this would have been unthinkable. Perft is branch- and memory-latency-bound, so the win is wide decode and a deep out-of-order window, not vector width.
  • Intel's Core Ultra 7 270K wins multi-threaded mostly on raw core count. The 9950X3D's AVX-512 advantage doesn't apply here - neither chip uses SIMD for perft.
  • Snapdragon X is the slowest of the four but at ~1.6 Bnps single-thread it's still about 2× the fastest desktop CPU from 2016, in a laptop. Partly thanks to AI-assisted optimizations in the code.

Then vs now

Pulling the relevant rows from the 2016 post. All 2016 numbers are single-thread (the old code wasn't parallelized):

DeviceYearPerft, ST (Mnps)Notes
LG Optimus 2X (Tegra 2)201117Phone
Snapdragon 410201540Phone
Snapdragon 6522016211Phone
Intel Core i5-3210M2012495Laptop
Intel Core i7-4790K2014723Desktop, best in 2016
Apple M4 (single-thread)20242,898Laptop
AMD 9950X3D (single-thread)20252,729Desktop
Intel 270K (multi-thread)202450,246Desktop, all cores

Apples-to-apples (single-thread → single-thread): the best 2016 desktop CPU did 723 Mnps, the fastest 2026 single-thread is ~2,900 Mnps on the M4 — a ~4× jump in eleven years. Some of that is the AI-assisted rewrite (better threading, branch layout, inner-loop hygiene), some of it is just per-core µarch improvements; I'd put it at roughly 50/50.

Once you allow multi-threading, the 2026 Intel 270K reaches ~50 Bnps (~69× the 2016 best), but that's almost entirely core count.

The phone story is even funnier. The Snapdragon 652 (best phone in the 2016 post) did 211 Mnps. A Snapdragon X laptop in 2026 does 1.59 Bnps single-threaded, which is on the same order as a high-end desktop from a decade ago.


SGEMM (fp32×fp32→fp32, GFLOPS)

CPUN=1024N=2048N=4096ISA
AMD Ryzen 9 5900X1,2031,1631,130AVX2
AMD Ryzen 9 9950X3D2,9983,6283,032AVX-512
Intel Core Ultra 7 270K1,7771,8931,911AVX2
Apple M41,2501,6301,600NEON + AMX
Snapdragon X629587461NEON

This is a hand-written matmul kernel — no BLAS, no MKL, no Accelerate framework — so it's measuring "what a competent SIMD implementation gets you on this ISA," not the absolute peak the hardware can do via vendor libraries.

  • AVX-512 is the headline. The 9950X3D nearly doubles the 5900X — wider FMA, doubled register file, larger L3, and a generation+1 of front-end improvements.
  • Apple M4 gets ~1.6 TFLOPS from NEON + AMX. Not bad for a laptop chip, though it lags AVX-512 by ~2×.
  • Snapdragon X tails the field at the 4k size — bandwidth-starved once the working set spills cache. Pure NEON without AMX-equivalent matrix instructions shows here.

Memory bandwidth (streaming read)

Sum of 256 million floats, single-shot streaming. The "what's the bandwidth wall I'm actually going to hit on a real workload" number.

CPUAchieved GB/s
AMD Ryzen 9 5900X44.4
AMD Ryzen 9 9950X3D77.7
Intel Core Ultra 7 270K87.0
Apple M4112.6
Snapdragon X124.0

The on-package LPDDR parts (Apple M4, Snapdragon X) clearly run away with this one. The desktop CPUs are capped by DDR5 channel count.

Worth calling out: Snapdragon X at 124 GB/s is more memory bandwidth than any mainstream desktop CPU in this list — in a laptop. That's the "memory is on-package" thing showing up.


And the GPUs

For continuity with the 2016 post, here's where GPUs are today. Same Position 2, depth 7, no transposition tables (raw move-generation throughput). Numbers from perft_gpu_2026.

GPUYearPerft (Bnps, Pos2 no-hash)
GTX 780 Ti201333.3
GTX Titan X201566.0
GTX 1080201689.2
Tesla P1002016118.7
RTX 40902022~1,308
RTX PRO 6000 Blackwell2025~2,110

The RTX PRO 6000 Blackwell does ~2.1 trillion nodes/sec on raw move generation — about 18× the Tesla P100 (the supercomputer-class GPU from the 2016 post) and ~42× the best 2026 CPU result (Intel 270K, all threads).

With transposition tables turned on, the same Blackwell card reaches ~619 trillion nps at depth 14 from the starting position. (See the repo README for the deeper-depth tables.)


What this told me

A few things I didn't expect when I started this:

  1. Apple Silicon caught up faster than I thought it would on integer/branchy code. Single-thread perft on the M4 beats a 16-core x86 desktop chip. The conventional wisdom about "Apple is great at the easy stuff but x86 still wins on serious branchy workloads" is overdue for a rewrite.
  2. AI-assisted optimization is a real lever, even without changing the ISA. Perft is still pure scalar integer code on every CPU here — the rewrite didn't touch SIMD — and it still got something on the order of a 2× single-thread win over what I had in 2016, separate from the silicon improvements.
  3. Memory bandwidth went somewhere weird. The fastest memory-bandwidth machine in this set is an Arm laptop. Desktop DDR5 didn't keep up with on-package LPDDR.
  4. GPUs kept doing their thing. ~18× from P100 to RTX PRO 6000 Blackwell on raw perft is roughly in line with the long-term trend, which is its own quiet miracle, given that the AI-driven advances in recent GPUs (tensor cores) don't help branchy / integer workloads.



Code:

Old post: Chess Move generation benchmark : from mobile phones to supercomputers (2016)