Chess Move generation benchmark : from mobile phones to supercomputers
Introduction
Writing computer programs that can play chess has been hobby of mine since a very long time. I probably started thinking about it more than 15 years back. I still remember watching matches between the fritz chess program running on Pentium 1 (100Mhz, 8 MB RAM) and my father (who is a chess enthusiast).
I wrote my first successful program when I was in college (about 12 years ago). By successful I mean it was competitive against my friends and dad but not too strong. Relatively good chess players among my friends and my dad could beat it about half of the games.
Since then I have attempted multiple times to improve or re-write it without much success. Most of the times I start something but then get too lazy and abandon the effort.
About 3 years ago, I decided to do a complete re-write: i.e,: without copying single line of code from my original program. One of the shortcomings of my original chess program was that it didn't handle special moves like en-passent and promotion to pieces other than queen correctly. I was not too sure if it had more bugs in move-generation. This time I followed advice from chess programming wiki (CPW) and talkchess.com forums and decided to implement bug-free move generator first.
I started with a 0x88 based board representation (https://chessprogramming.wikispaces.com/0x88), made the move generation almost bug-free but I wasn't very happy with it's speed. Bitboard representation (https://chessprogramming.wikispaces.com/Bitboards) seemed promising so I again rewrote the move generator with Bitboards.
I spent quite a lot of time optimizing the perft routine (a test for move generation correctness and speed: https://chessprogramming.wikispaces.com/perft) and finally my perft routine became one of the fastest available. For the want of more speed, I spent another big chunk of time porting the perft routine to run on Nvidia GPUs with very promising results. AFAIK, it's currently the fastest brute-force perft counter in the world (without using hash tables. I have a version with hash table support too but speeds are not easily comparable due to the way different programs manage or use hash tables).
Writing rest of the chess engine got side-tracked as I spent nearly a year (not full time) in optimizing perft routine for CPUs and then GPUs. I did get back to writing rest of the chess program (only the CPU version). At present it plays chess but I don't call it complete yet. There are a few things I still want to implement before releasing it. My new program is a big improvement over my original one. It easily beats all my friends and dad - even when running on a phone (so I don't have much motivation for improving it further :-/). Against other chess engines it's still very weak. It's probably about 2200 ELO on CCRL scale - top engines being close to 3400 at the time of writing (http://www.computerchess.org.uk/ccrl/404/).
The benchmarks
This article is about comparing speed of move generation (perft routine without hash) across a wide range of devices - phones, tablets, laptops, PCs and GPUs. I also have some numbers for the complete chess engine (chess search nps) too but as nobody has figured out a way to implement efficient chess search algorithm on GPU yet, the numbers only compare some phones and PCs.
In the tables/charts below you will see results for two benchmarks: 'chess' and 'perft'.
- perft is no. of counted moves per second in brute force move-generation for the second test position of CPW: "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -". For depth 5 (for phones and CPU results) or depth 7 (for GPU results)
- chess is no. of nodes evaluated per second for starting position searching till depth 10 or 11.
For chess_cpu, I use the following commands for benchmarking:
uci
ucinewgame
position startpos
go infinite
Which will start the search and print out nps numbers (among other stats).
If you run the perft programs you will get output like this:
...
Perft 5: 193690690, Time taken: 0.216556 seconds, nps: 894414087
...
GPU Perft 7: 374190009323, Time taken: 4.19305 seconds, nps: 89240608095
Notes on 32 vs 64 bit, CPU Cores, compilers, hardware bit-manipulation instructions etc:
As my program uses bitboard representation where pieces of different types are represented by a 64 bit integers and bitwise operations on 64 bit numbers are heavily used for move generation. This means a 64 bit machine / build will perform much faster than the 32 bit one. I have observed slightly more than 2x speed up just going from 32 bit compile to 64 bit compile.
Apart from bitwise operations, two additional operations that are heavily used are:
- popCount (count no of set bits in the 64 bit number). For example this can be to find no. pieces of a type (e.g: numKnights = popCount(knights); numMajorPieces=popCount(queens | rooks);)
- clz (count leading zeros - find the position of first set bit). This is used at many places to enumerate all pieces in a bitboard (e,g, to loop over all bishops in the bitboard)
Most modern hardware (arm, intel and even nvidia gpus) have native instructions to either directly support or speed-up these operations. It sometimes depends on the compiler if it's able to use them effectively. e.g For linux builds I simply use __builtin_popcountll() / __builtin_ffsll() routines and hope the compiler will replace it with native instruction (or efficient sequences).
Although I have tried making the comparison as fair as possible, I have noticed that the compiler used, the kind of optimizations it can perform and whether it can make use of hardware 'clz' and 'popCount' instructions makes a very big difference in the results of perft benchmark.
Note that both perft_cpu and chess_cpu programs are single-threaded. You won't see any benefit on CPUs with more no. of cores/threads. perft_cpu uses very little memory (<3.5 MB total for some lookup-tables) and chess_cpu uses about 260MB (256 MB hash table in addition to the lookup tables). Having more memory capicity isn't going to help these benchmarks. These benchmarks are purely single-threaded CPU performance benchmarks but might benefit from bigger caches or lower latency RAM.
perft_gpu obvisouly is multi-threaded and uses GPU video memory to store portions of the the search tree (in addition to the lookup tables that are present in constant or texture memories).
Results
Device / processor
|
Perft (nps)
|
chess (nps)
|
notes
|
Lg Optimus 2X (Tegra 2, 2xA9 @ 1Ghz, 512MB RAM)
|
17044859
|
gcc 4.8 build used.
gcc 5.3 build doesn't work on this device (kernel too old!)
|
|
Coolpad Dazen1 (Snapdragon 410, 4xA53 @
1.2GHz, 2 GB RAM)
|
39928436
|
230479
|
32 bit (limited by OS).
With gcc 4.8 build: 20M perft and 160k chess. |
Tegra Note 7 (Tegra
4, 4xA15 @1.8Ghz, 1 GB RAM)
|
63811018
|
274913
|
32 bit (CPU doesn't support 64 bit)
|
Xiaomi Mipad (Tegra K1, 4xA15 @2.2Ghz, 2 GB RAM)
|
93773557
|
543880
|
32 bit (neither of CPU, OS support 64bit)
|
Le Eco Le 2 (Snapdragon 652, 4xA72@1.8Ghz+4xA53@1.6Ghz, 3GB)
|
211000000
|
936920
|
64 bit. With 32 bit: perft: 95M,
chess: 570k
|
Home Laptop (intel core i5 3210m, 2C/4T, @~3.1 Ghz, 6GB RAM)
|
495492997
|
MSVC compile, 64 bit
|
|
Home PC (intel core i7 4790k, 4C/8T @~4.4 Ghz, 8GB RAM)
|
723352641
|
3487225
|
gcc compile. 64 bit. Without intrinsics, perft: 220MNps
|
GTX 780Ti stock (GK110 15 SM)
|
33318412188
|
||
GTX 970 (GM204 13 SM, @~1400/3600)
|
47374624070
|
Home PC, GPU slightly over-clocked.
|
|
GTX Titan X stock (GM200, 24 SM, @~1200/3500)
|
66021489568
|
||
GTX 1080 stock (GP104, 20 SM, @~1880/5000)
|
89240608095
|
||
P100 (GP100 pre-preduction card, unknown
clocks/config)
|
118658287137
|
Lot of numbers! Here is a chart for perft scores:
GPUs are so much faster that the mobile and CPU results got squashed. We see healthy increase in performance with newer GPU generations.
Maybe logarithmic scale would help?
Much better than before. Phone to laptops and desktop CPUs show gradual increase. Going to GPUs has a big jump and after that we again have gradual increases with different GPU generations. However it still doesn't show big jump among phones or going from phones to PC. Here are CPU only scores in linear scale:
The results here might not be apples to apples due to differences in compilers and builds used. Few things to note:
- The Lg optimus 2X phone (cortex A9 based tegra 2 @ 1GHz) was running the program compiled with older gcc compiler (gcc 4.8) because the gcc 5.3.1 build (which was used for other results) won't work. it kept complaining that kernel is too old. With gcc 4.8 build, I get ~20M nps on coolpad dazen 1 (snapdragon 410) instead of the ~40m score with gcc 5.3.1. My guess is gcc 4.8 is not utilizing the clz and popCount hardware instructions efficiently. I am not sure if arm A9 even supports them. If it does, clock per clock, for 32 bit apps A53 is probably only as fast as A9 (actually a tiny bit slower).
- Coolpad Dazen 1 phone (A53 based Snapdragon 410) is capable of running 64 bit but it's running 32 bit build as the OS (Android 4.4.4) doesn't support 64 bit :-/. With 64 bit OS it could have been quite a bit faster (I would estimate nearly 100 Mnps instead of 40).
- LeEco Le 2 (Cortex A72 based Snapdragon 652) is the only phone that is running 64 bit build. With 32 bit build, it scores 95Mnps in perft - less than half of 64 bit score.
- Tegra note 7 and Xiaomi Mipad have A15 based processors that doesn't support 64 bit - although 32 bit performance is quite decent.
- The laptop was running a build on windows compiled using MSVC - which is slightly faster than the GCC 5.3.1 build used for other systems. Here is a comparison of performance of various compilers (taken on home PC - 4790k):
Intel core i7 4790k
|
perft
|
GCC fastest build (plain magics)
|
723352641
|
MSVC fastest build (fancy magics + __forceinline)
|
815811111
|
Intel compiler fastest build (plain magics)
|
1069033463
|
It's surprising to see almost 50% performance gain just by using a different compiler. The ICC compiler is not free - I downloaded 30 day trial version for my benchmarking.
Finally here is a graph of chess_cpu results on some of the devices:
One result that stands out is the comparatively low performance of the tegra note 7, and the big jump between tegra note 7 and the tegra k1 which can't be explained just by the CPU frequency. Maybe differences in cache sizes or memory?
Conclusion
Looking at all results, we can say that today's mobile chips are catching up quickly with desktop CPUs - at least for my perft and chess benchmarks. The fastest desktop CPU (i7 4790k) is only about 3.5x faster than the fastest mobile processor (snapdragon 652) that I was able to test. Some might argue that 4790k isn't the fastest CPU - but actually it is for single threaded performance. I have tested 6700k too and it scores exactly same as my 4790k (~1.1 Bnps in perft with intel build). If you compare clock-per-clock performance, it's even close : skylake has < 50% per clock advantage over arm A72!.
Snapdragon 652 is definitely not the fastest mobile cpu - I am eager to get my hands on devices based on snapdragon 820/821, Exynos 8890 or Kirin 950 processors which might have better single threaded performance. Apple A9 and A9X processors are probably even faster.
In my benchmarks you might see GPUs being disproportionately faster. Keep in mind that CPU version of my perft benchmark is single threaded. Although the current benchmark results show fastest GPU being >100X faster than fastest CPU, once the benchmark is modified to utilize all CPU cores, the speedup might drop to only ~5-10x (e.g: when comparing Nvidia Tesla P100 with Intel Xeon E7-8890 v4). It's still surprising to see GPUs performing so well given that they don't have native support for 64 bit integers.
We have compared a wide range of devices - the difference between the slowest phone and the fastest system (GP100 based personal supercomputer :-) is nearly 7000X!
More chess benchmarks comparing phones and PCs:
http://talkchess.com/forum/viewtopic.php?t=43212&postdays=0&postorder=asc&topic_view=flat&start=630