random programming / benchmarking , etc: 2026-05-17

About ten years ago I posted Chess Move generation benchmark : from mobile phones to supercomputers — running my perft move-generator across every device I could get my hands on, from a Tegra 2 phone to a Tesla P100. It's been long enough that I figured it was time to redo the exercise on today's silicon, and add two more microbenchmarks while I was at it.
Two things have changed since 2016 that matter for this post:
The CPUs themselves. Big core-count jumps, AVX-512 on the desktop, Apple Silicon turning up out of nowhere, Arm laptops with Snapdragon X — the desktop/laptop CPU landscape looks very different than it did when Skylake was the new thing.
How the code was written. The 2016 perft code was hand-tuned by me. The 2026 perft numbers come from the same algorithmic core but reworked by AI coding agents - mostly around multi-threading, better branch layout, and tightening the inner loop. Perft is still scalar integer/bitwise code; it doesn't use SIMD on any of these CPUs. The matmul and memory-bandwidth code below is fully AI-written, and that one does use the wide ISAs (AVX-512 / NEON / AMX).
So this post is really two things at once: a snapshot of where commodity silicon is in 2026, and a small data point on what AI-assisted optimization buys you on a workload I've been staring at for over a decade.
The benchmarksPerft (Position 2 / Kiwipete) — the same chess move-generation throughput benchmark as last time. Counts every legal move tree node at a given depth on the CPW Position 2: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -. No transposition tables. Scalar integer code — no SIMD. Reported in nodes/second.
SGEMM (matmul) — square fp32 × fp32 → fp32 matrix multiply at N = 1024, 2048, 4096. Hand-written kernel (no MKL/BLAS), AI-generated, tuned per-ISA. Reported in GFLOPS.
Memory bandwidth — sum of 256 million fp32s, straight streaming read. AI-generated code. Reported in achieved GB/s.

Perft on modern CPUsCPUISASingle-thread (Mnps, d5)Multi-thread (Mnps, d7)
AMD Ryzen 9 9950X3DAVX-5122,72944,383
Intel Core Ultra 7 270KAVX22,59850,246
Apple M4NEON / AMX2,89817,329
Snapdragon XNEON1,58612,580
Single-thread at depth 5, multi-thread at depth 7. All Position 2, no hash. Code is scalar integer; the ISA column is just for reference — it doesn't affect perft.
A few things jump out:
Apple M4 leads single-thread - by a hair over the 9950X3D, in a laptop. Six years ago this would have been unthinkable. Perft is branch- and memory-latency-bound, so the win is wide decode and a deep out-of-order window, not vector width.
Intel's Core Ultra 7 270K wins multi-threaded mostly on raw core count. The 9950X3D's AVX-512 advantage doesn't apply here - neither chip uses SIMD for perft.
Snapdragon X is the slowest of the four but at ~1.6 Bnps single-thread it's still about 2× the fastest desktop CPU from 2016, in a laptop. Partly thanks to AI-assisted optimizations in the code.
Then vs nowPulling the relevant rows from the 2016 post. All 2016 numbers are single-thread (the old code wasn't parallelized):
DeviceYearPerft, ST (Mnps)Notes
LG Optimus 2X (Tegra 2)201117Phone
Snapdragon 410201540Phone
Snapdragon 6522016211Phone
Intel Core i5-3210M2012495Laptop
Intel Core i7-4790K2014723Desktop, best in 2016
Apple M4 (single-thread)20242,898Laptop
AMD 9950X3D (single-thread)20252,729Desktop
Intel 270K (multi-thread)202450,246Desktop, all cores
Apples-to-apples (single-thread → single-thread): the best 2016 desktop CPU did 723 Mnps, the fastest 2026 single-thread is ~2,900 Mnps on the M4 — a ~4× jump in eleven years. Some of that is the AI-assisted rewrite (better threading, branch layout, inner-loop hygiene), some of it is just per-core µarch improvements; I'd put it at roughly 50/50.
Once you allow multi-threading, the 2026 Intel 270K reaches ~50 Bnps (~69× the 2016 best), but that's almost entirely core count.
The phone story is even funnier. The Snapdragon 652 (best phone in the 2016 post) did 211 Mnps. A Snapdragon X laptop in 2026 does 1.59 Bnps single-threaded, which is on the same order as a high-end desktop from a decade ago.
SGEMM (fp32×fp32→fp32, GFLOPS)CPUN=1024N=2048N=4096ISA
AMD Ryzen 9 5900X1,2031,1631,130AVX2
AMD Ryzen 9 9950X3D2,9983,6283,032AVX-512
Intel Core Ultra 7 270K1,7771,8931,911AVX2
Apple M41,2501,6301,600NEON + AMX
Snapdragon X629587461NEON
This is a hand-written matmul kernel — no BLAS, no MKL, no Accelerate framework — so it's measuring "what a competent SIMD implementation gets you on this ISA," not the absolute peak the hardware can do via vendor libraries.
AVX-512 is the headline. The 9950X3D nearly doubles the 5900X — wider FMA, doubled register file, larger L3, and a generation+1 of front-end improvements.
Apple M4 gets ~1.6 TFLOPS from NEON + AMX. Not bad for a laptop chip, though it lags AVX-512 by ~2×.
Snapdragon X tails the field at the 4k size — bandwidth-starved once the working set spills cache. Pure NEON without AMX-equivalent matrix instructions shows here.
Memory bandwidth (streaming read)Sum of 256 million floats, single-shot streaming. The "what's the bandwidth wall I'm actually going to hit on a real workload" number.
CPUAchieved GB/s
AMD Ryzen 9 5900X44.4
AMD Ryzen 9 9950X3D77.7
Intel Core Ultra 7 270K87.0
Apple M4112.6
Snapdragon X124.0
The on-package LPDDR parts (Apple M4, Snapdragon X) clearly run away with this one. The desktop CPUs are capped by DDR5 channel count.
Worth calling out: Snapdragon X at 124 GB/s is more memory bandwidth than any mainstream desktop CPU in this list — in a laptop. That's the "memory is on-package" thing showing up.
And the GPUsFor continuity with the 2016 post, here's where GPUs are today. Same Position 2, depth 7, no transposition tables (raw move-generation throughput). Numbers from perft_gpu_2026.
GPUYearPerft (Bnps, Pos2 no-hash)
GTX 780 Ti201333.3
GTX Titan X201566.0
GTX 1080201689.2
Tesla P1002016118.7
RTX 40902022~1,308
RTX PRO 6000 Blackwell2025~2,110
The RTX PRO 6000 Blackwell does ~2.1 trillion nodes/sec on raw move generation — about 18× the Tesla P100 (the supercomputer-class GPU from the 2016 post) and ~42× the best 2026 CPU result (Intel 270K, all threads).
With transposition tables turned on, the same Blackwell card reaches ~619 trillion nps at depth 14 from the starting position. (See the repo README for the deeper-depth tables.)
What this told meA few things I didn't expect when I started this:
Apple Silicon caught up faster than I thought it would on integer/branchy code. Single-thread perft on the M4 beats a 16-core x86 desktop chip. The conventional wisdom about "Apple is great at the easy stuff but x86 still wins on serious branchy workloads" is overdue for a rewrite.
AI-assisted optimization is a real lever, even without changing the ISA. Perft is still pure scalar integer code on every CPU here — the rewrite didn't touch SIMD — and it still got something on the order of a 2× single-thread win over what I had in 2016, separate from the silicon improvements.
Memory bandwidth went somewhere weird. The fastest memory-bandwidth machine in this set is an Arm laptop. Desktop DDR5 didn't keep up with on-package LPDDR.
GPUs kept doing their thing. ~18× from P100 to RTX PRO 6000 Blackwell on raw perft is roughly in line with the long-term trend, which is its own quiet miracle, given that the AI-driven advances in recent GPUs (tensor cores) don't help branchy / integer workloads.

Code:
perft (CPU)
perft_gpu_2026
Old post: Chess Move generation benchmark : from mobile phones to supercomputers (2016) 

random programming / benchmarking , etc

Monday, May 18, 2026

A decade of perft: what's changed for CPUs (and GPUs) in 10 years

The benchmarks

Perft on modern CPUs

Then vs now

SGEMM (fp32×fp32→fp32, GFLOPS)

Memory bandwidth (streaming read)

And the GPUs

What this told me

About Me

Links

Previous Posts

Archives

CPU	ISA	Single-thread (Mnps, d5)	Multi-thread (Mnps, d7)
AMD Ryzen 9 9950X3D	AVX-512	2,729	44,383
Intel Core Ultra 7 270K	AVX2	2,598	50,246
Apple M4	NEON / AMX	2,898	17,329
Snapdragon X	NEON	1,586	12,580

Device	Year	Perft, ST (Mnps)	Notes
LG Optimus 2X (Tegra 2)	2011	17	Phone
Snapdragon 410	2015	40	Phone
Snapdragon 652	2016	211	Phone
Intel Core i5-3210M	2012	495	Laptop
Intel Core i7-4790K	2014	723	Desktop, best in 2016
Apple M4 (single-thread)	2024	2,898	Laptop
AMD 9950X3D (single-thread)	2025	2,729	Desktop
Intel 270K (multi-thread)	2024	50,246	Desktop, all cores

CPU	N=1024	N=2048	N=4096	ISA
AMD Ryzen 9 5900X	1,203	1,163	1,130	AVX2
AMD Ryzen 9 9950X3D	2,998	3,628	3,032	AVX-512
Intel Core Ultra 7 270K	1,777	1,893	1,911	AVX2
Apple M4	1,250	1,630	1,600	NEON + AMX
Snapdragon X	629	587	461	NEON

CPU	Achieved GB/s
AMD Ryzen 9 5900X	44.4
AMD Ryzen 9 9950X3D	77.7
Intel Core Ultra 7 270K	87.0
Apple M4	112.6
Snapdragon X	124.0

GPU	Year	Perft (Bnps, Pos2 no-hash)
GTX 780 Ti	2013	33.3
GTX Titan X	2015	66.0
GTX 1080	2016	89.2
Tesla P100	2016	118.7
RTX 4090	2022	~1,308
RTX PRO 6000 Blackwell	2025	~2,110