Choosing a processor for a build farm

From Ant-Computing
Revision as of 10:42, 18 January 2015 by Willy (Talk | contribs) (Results)

Jump to: navigation, search

Overview

Build farms are network clusters of nodes with high CPU performance which are dedicated to build software. The general approach consists in running tools like "distcc" on the developer's workstation, which will delegate the job of compiling to all available nodes. The CPU architecture is irrelevant here since cross-compilation for any platform is involved anyway.

Build farms are only interesting if they can build faster than any commonly available, cheaper solution, starting with the developer's workstation. Note that developers workstations are commonly very powerful, so in order to provide any benefit, a cluster aiming at being faster than this workstation still needs to be affordable.

In terms of compilation performance, the metrics are lines-of-code per second per dollar (performance) and lines of code per joule (efficiency). A number of measurements were run on various hardware, and this research is still going. To sum up observations, most interesting solutions are in the middle range. Too cheap devices have too small CPUs or RAM bandwidth, and too expensive devices optimise for areas irrelevant to build speed, or have a pricing model that exponentially follows performance.

Hardware considerations

Nowadays, most processors are optimised for higher graphics performance. Unfortunately, it's still not possible to run GCC on the GPU. And we're wasting transistors, space, power and thermal budget in a part that is totally unused in a build farm. Similarly, we don't need a floating point unit in a build farm. In fact, if some code to be built uses a lot of floating point operations, the compiler will have to use floating point as well to deal with constants and some optimisations, but such code is often marginal in a whole program, let alone distribution (except maybe in HPC environments).

Thus, any CPU with many cores at a high frequency and high memory bandwidth may be eligible for testing, even if there's neither FPU nor GPU.

Methodology

First, software changes a lot. This implies that comparing numbers between machines is not always easy. It could be possible to insist on building an outdated piece of code with an outdated compiler, but that would be pointless. Better build modern code that the developer needs to build right now, with the compiler he wants to use. As a consequence, a single benchmark is useless, it always needs to be compared to one run on another system with the same compiler and code. After all, the purpose of building a build farm is to offload the developer's system so it makes sense to use this system as a reference and compare the same version of toolchain on the device being evaluated.

Porting a compiler to another machine

The most common operation here is what is called a Canadian build. It consists in building on machine A a compiler aimed at running on machine B to produce code for machine C. For example a developer using an x86-64 system could build an ARMv7 compiler producing code for a MIPS platform. Canadian builds sometimes fail because of bugs in the compiler's build system which sometimes mixes variables between build, host or target. For its defense, the principle is complex and detecting unwanted sharing there is even more difficult than detecting similar issues in more common cross-compilation operations.

In case of failure, it can be easier to proceed in two steps :

  • canadian build from developer's system to tested device for tested device. This results in a compiler that runs natively on the test device.
  • native build on the test device of a cross-compiler for the target system using the previously built compiler.

Since that's a somewhat painful (or at least annoying) task, it makes sense to back up resulting compilers and to simply recopy it to future devices to be tested if they use the same architecture.

Tests

Conditions

This test consisted in building haproxy-git-6bcb0a84 using gcc-4.7.4, producing code for i386. In all tests, no I/O operations were made because the compiler, the sources and the resulting binaries were all placed in a RAM disk. The APU and the ARMs were running from a Formilux RAM disk.

Test method

HAProxy's sources matching Git commit ID 6dcb0a84 are extracted into /dev/shm. A purposely built toolchain based on gcc-4.7.4 and glibc-2.18 is extracted in /dev/shm as well. The "make" utility is installed on the system if not present. HAProxy is always built in the same conditions, TARGET is set to "linux2628", EXTRA is empty, CC points to the cross-compiler, and LD is set to "#" to disable the last linking phase which cannot be parallelized. Then the build is run at least 3 times with a parallel setting set sweeping 1 to the number of CPU cores, in powers of two, and the shortest build time is noted. Example :

root@t034:~# cd /dev/shm
root@t034:shm# tar xf haproxy-6dcb0a84.tar
root@t034:shm# tar xf i586-gcc47l_glibc218-linux-gnu-full-arm.tgz
root@t034:shm# cd haproxy
root@t034:haproxy# make clean
root@t034:haproxy# time make -j 4 TARGET=linux2628 EXTRA= CC=../i586-*-gnu/bin/i586-*-gcc LD='#'
...
real  25.032s
user  1m33.770s
sys   0m1.690s

Devices

Machines involved in this test were 32 & 64 bit x86 as well as 32-bit ARMv7 platforms :

Date Machine CPU RAM
family model freq (nom/max) cores threads size width freq
2014/08/05 ThinkPad t430s x86-64 core i5-3320M 2.6/3.3 GHz 2 4 8 GB DDR3 64 1600
2014/08/05 C2Q i686 Core2 Quad Q8300 3.0 GHz (OC) 4 4 8 GB DDR3 128 1066
2014/08/05 PC-Engines apu1c x86-64 AMD T40-E 1.0/1.0 GHz 2 2 2 GB DDR3 64 1066
2014/08/05 Asus EEE PC i686 Atom N2800 1.86/1.86 GHz 2 4 4 GB DDR2 64 1066
2014/08/05 Marvell XP-GP armv7 mv78460 1.6/1.6 GHz 4 4 4 GB DDR3 64 1866
2014/08/05 OpenBlocks AX3 armv7 mv78260 1.33/1.33 GHz 2 2 2 GB DDR3 64 1333
2015/01/18 Jesusrun T034 armv7 RK3288 1.8/1.8 GHz 4 4 2 GB LPDDR2 32 1066
2015/01/18 AMD2 x86-64 Phenom 9950 3.0 GHz (OC) 4 4 2 GB DDR3 128 1066

Results

And the results are presented below in build time for various levels of parallel build.

Date Machine Processes Time (seconds) Observations
2014/08/05 apu1c 1 116.3
2014/08/05 apu1c 2 59.4 CPU is very hot
2014/08/05 apu1c 4 64.0 Expected, more processes than core
2014/08/05 t430s 1 19.3 1 core at 3.3 GHz
2014/08/05 t430s 2 10.9 2 cores at 3.1 GHz
2014/08/05 t430s 4 9.1 2 cores at 3.1 GHz
2014/08/05 AX3 2 93.5 running in Thumb2 mode
2014/08/05 XP-GP 2 74.7 running in Thumb2 mode
2014/08/05 XP-GP 4 39.75 running in Thumb2 mode
2014/08/05 EEE PC 2 61.0
2014/08/05 EEE PC 4 46.6
2014/08/05 C2Q 1 36.6
2014/08/05 C2Q 2 18.9 2 cores on the same die
2014/08/05 C2Q 4 10.8 L3 cache not shared between the 2 dies.
2015/01/18 C2Q 1 30.0
2015/01/18 C2Q 2 16.1 2 cores on the same die
2015/01/18 C2Q 4 8.77 L3 cache not shared between the 2 dies.
2015/01/18 T034 1 74.7
2015/01/18 T034 2 41.7
2015/01/18 T034 4 25.0 slow memory seems to be a bottleneck
2015/01/18 AMD2 1 26.4
2015/01/18 AMD2 2 13.8
2015/01/18 AMD2 4 7.47

Analysis on 2014/08/05

The Core2quad is outdated. It's exactly half as powerful as the new core i5 despite running at sensibly the same frequency. ARMs do not perform that well here. The XP-GP achieves the performance of one core of the C2Q using all of its 4 cores. Since it's running at half the frequency, we can consider that each core of this Armada-XP chip delivers approximately half of the performance of a C2Q at the same frequency in this workload. The Atom in the EEE-PC, despite a slightly higher frequency than the Armada-XP, is not even able to catch up with it. The APU platform is significantly more efficient at similar frequency than the Atom, given that it delivers per core at 1.0 GHz the same performance as the Atom at 1.86 GHz. However the atom can use its HyperThreading to save 25 extra percent of build time and reach a build time that the APU cannot achieve.

The conclusion here is that low-end x86 CPUs such as the Core i3 3217U at 1.8 GHz should still be able to achieve half of the Core i5's performance, or be on par with the C2Q, despite consuming only 17W instead of the C2Q's 77W. All x86 machines are still expensive because you need to add memory and sometimes a small SSD if you cannot boot them over the network. Given the arrival of new Cortex A17 at 2+ GHz supposed to be 60% faster than A9s clock-for-clock (Armada XP's PJ4B core is very similar to A9), there could be some hope to see interesting improvements there. If an A17 could perform as half of the i5 for quarter of its price (or half the price of a fully-equiped low-end i3), it would mean a build farm based on these devices would not be stupid.

Analysis on 2015/01/18

As expected, the Cortex A17 running at the heart of RK3288 shows a very good performance, and a single core performs about 60% faster than Armada XP's clock for clock, resulting in each core being 1.92 times faster thanks to the higher frequency. This is visible in the single core test which shows exactly the same speed as two cores on XP-GP, and the two-core test which is almost twice as fast on the RK3288. However, this 4-core CPU doesn't scale well to 3 nor 4 processes. The very likely reason is that not only the RAM is limited to a 32-bit bus, but it runs at 1066 MHz only. In comparison, the Armada XP is powered by 1600 MHz in 64-bit, resulting in exactly 3 times the bandwidth. The 4-core run on the RK3288 was only 67% faster than the 2-core one. Linear scaling should have shown around 21 seconds for 4 processes instead of 25. It is possible that other devices running faster DDR3/DDR3L and more channels would not experience this performance loss. That said, this device is by far the fastest of all non-x86 devices here and is even much faster than all low-end CPUs tested so far. 4 cores of RK3288 give approximately the same build power as one core of an intel core i5 at 3 GHz. The device is cheap (less than 75 EUR shipping included) and can really compete with lwo-end PCs which still require addition of RAM and storage. For less than 300 EUR, you get the equivalent of four 3GHz intel cores with 8 GB of RAM, and it is completely fanless.