Table of
contents

8.1.2010The year I started blogging (blogware)
9.1.2010Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010OpenCL autoconf m4 macro
9.1.2010Mandelbrot with MPI
10.1.2010Using dynamic libraries for modular client threads
11.1.2010Creating an OpenGL 3 context with GLX
11.1.2010Creating a double buffered X window with the DBE X extension
12.1.2010A simple random file read benchmark
14.12.2011Change local passwords via RoundCube safer
5.1.2012Multi-GPU CUDA stress test
6.1.2012CUDA (Driver API) + nvcc autoconf macro
29.5.2012CUDA (or OpenGL) video capture in Linux
31.7.2012GPGPU abstraction framework (CUDA/OpenCL + OpenGL)
7.8.2012OpenGL (4.3) compute shader example
10.12.2012GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580
4.8.2013DAViCal with Windows Phone 8 GDR2
5.5.2015Sample pattern generator



10.12.2012

GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580

Introduction

Very recently I got my hands on a Tesla K20. Eager to see what it was made of, I selected some of my better optimized GPGPU programs and went on to gather some performance numbers against a variety of current GPUs.

Tested devices

From the NVIDIA's Kepler family, I have:

Then from the NVIDIA's Fermi family, I have:

From the AMD front I have the only current architecture relevant for GPGPU:

Test applications

The majority of transistors in any GPU today is dedicated to single precision floating point and integer arithmetics, and that is what we will be testing today. I'm not saying DP arithmetics isn't relevant, especially for the Teslas, but making a fair test case is tricky because many programs that operate on double precision floating point data actually, after compilation, perform more integer arithmetics (index/pointer arithmetics, counters, etc) than actual DP FP. Anyway, I'll tackle doubles properly some other time.

I have selected four test applications that stress different features of a GPU:

  1. Digital Hydraulics code is all about basic floating point arithmetics, both algebraic and transcendental. No dynamic branching, very little memory traffic.
  2. Ambient Occlusion code is a very mixed load of floating point and integer arithmetics, dynamic branching, texture sampling and memory access. Despite the memory traffic, this is a very compute intensive kernel.
  3. Running Sum code, in contrast to the above, is memory intensive. It shuffles data through at a high rate, not doing much calculations on it. It relies heavily on the on-chip L1 cache, though, so it's not a raw memory bandwidth test.
  4. Geometry Sampling code is texture sampling intensive. It sweeps through geometry data in "waves", and stresses samplers, texture caches, and memory equally. It also has a high register usage and thus low occupancy.

Performance

The test applications are all implemented in both CUDA and OpenCL. CUDA usually outperforms OpenCL on NVIDIA hardware, so I have selected the CUDA port for NVIDIA's devices. OpenCL is used for AMD.

Relative performance (higher is better)
Relative performance (higher is better)

AMD is the winner here, as has been the case in GPGPU lately. While the different NVIDIA architectures perform very differently from each other, devices of the same architecture perform very predictably. Take the number of cores, multiply by the frequency, and there you go. There are differences in memory bandwidth though, especially if ECC is enabled. The Kepler architecture in general is less robust than the Fermi architecture, and its raw potential is less often fully realized.

CUDA vs OpenCL

As a bonus here I'm also comparing CUDA ports of the applications against their identical OpenCL ports. I'm not going to go very analytical on the results, because I simply don't know all the little things that eventually go differently between the two. Although OpenCL and CUDA code should go through exactly the same compiler technology, there are some differences as to how data types and operations are defined at the language level, which forces the compiler to produce slightly different code. For example, thread indexes are unsigned int (32b) in CUDA, and size_t (usually 64b) in OpenCL. Operations on such built-in variables in OpenCL get compiled into more 32b integer operations than in CUDA, and register usage might also increase (affects occupancy).

Anyway, even though there should generally be no major differences in performance, there sometimes are. Over the years CUDA has performed quite solidly, but OpenCL's performance has varied from driver version to the next. For a long time CUDA was always faster, but that is not true anymore. Anyway, in the following I have ran the 4 applications as OpenCL and as CUDA on the two Tesla cards. The setups use CUDA 5.0 and driver version 304.60. Both CUDA and OpenCL here use 32b addressing, so pointer arithmetics should compile to an equal number of instructions.

Relative performance (higher is better)
Relative performance (higher is better)

With the latest drivers, the performance difference between OpenCL and CUDA is the smallest I have ever seen. Only in the first application there is any significant difference, where CUDA is 19% faster. Of course, performance is not the only difference between the two GPGPU APIs: CUDA exposes NVIDIA specific features that are unavailable in OpenCL. These include the ability to configure the L1 cache split (can be a major advantage for kernels that rely heavily on transparent caches), vote functions (which can be emulated in OpenCL via shared counters, but this incurs a performance penalty), shuffle and some native instructions (e.g. sincos()), and so on. Also, there can be huge differences in how (non-native) math library functions are implemented.

Comments

  23.1.2013

It's fun to see AMD outperforming a card that is about 10x more expensive. For integer, AMD is unbeatable on my front, but I've heard that on floating point (especially with doubles) NVidia is much better.

It would be very interesting to know what the AMD & NVidia kernel analyzers say about ALU&FPU utilisation for these tests. It could be easy to carry out (since you already have the code&HW anyway) and the results would say a lot, I think.
- Mate Soos

  15.6.2013

Great review that confirms what I have observed myself. It is the good time to convert your apps from CUDA to OpenCL in order to animate the GPU manufacturers competition.
- vic20

  18.2.2014

Awesome review! I happened to have tested GTX 650, GTX 640 and GTX 580 with both CUDA and OpenCL in my Bayesian MCMC analysis, which involves double precision floating point calculation. Basically the two architectures are side by side, with CUDA slightly outperforming in a few more situations. Not surprising since they are all Nvidia devices. Meanwhile I tried OpenCL on Intel HD 4000 (which comes with i5-3570K CPU). It surprised me. I'm thinking if I should go and get a pair of HD 7970s when the price is back to normal.
- Qiyun Zhu






Nick     E-mail   (optional)

Is this spam? (answer opposite of "yes" and add "pe")