Table of
contents

8.1.2010The year I started blogging (blogware)
9.1.2010Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010OpenCL autoconf m4 macro
9.1.2010Mandelbrot with MPI
10.1.2010Using dynamic libraries for modular client threads
11.1.2010Creating an OpenGL 3 context with GLX
11.1.2010Creating a double buffered X window with the DBE X extension
12.1.2010A simple random file read benchmark
14.12.2011Change local passwords via RoundCube safer
5.1.2012Multi-GPU CUDA stress test
6.1.2012CUDA (Driver API) + nvcc autoconf macro
29.5.2012CUDA (or OpenGL) video capture in Linux
31.7.2012GPGPU abstraction framework (CUDA/OpenCL + OpenGL)
7.8.2012OpenGL (4.3) compute shader example
10.12.2012GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580
4.8.2013DAViCal with Windows Phone 8 GDR2
5.5.2015Sample pattern generator



5.1.2012

Multi-GPU CUDA stress test

Update 16-03-2020: Versions 1.1 and up support tensor cores.

Update 30-11-2016: Versions 0.7 and up also benchmark.

I work with GPUs a lot and have seen them fail in a variety of ways: too much (factory) overclocked memory/cores, unstable when hot, unstable when cold (not kidding), memory partially unreliable, and so on. What's more, failing GPUs often fail silently and produce incorrect results when they are just a little unstable, and I have seen such GPUs consistently producing correct results on some apps and incorrect results on others.

What I needed in my tool box was a stress test for multi-GPGPU-setups that used all of the GPUs' memory and checked the results while keeping the GPUs burning. There are not a lot of tools that can do this, let alone for Linux. Therefore I hacked together my own. It runs on Linux and uses the CUDA driver API.

My program forks one process for each GPU on the machine, one process for keeping track of the GPU temperatures if available (e.g. Fermi Teslas don't have temp. sensors), and one process for reporting the progress. The GPU processes each allocate 90% of the free GPU memory, initialize 2 random 2048*2048 matrices, and continuously perform efficient CUBLAS matrix-matrix multiplication routines on them and store the results across the allocated memory. Both floats and doubles are supported. Correctness of the calculations is checked by comparing results of new calculations against a previous one -- on the GPU. This way the GPUs are 100% busy all the time and CPUs idle. The number of erroneous calculations is brought back to the CPU and reported to the user along with the number of operations performed so far and the GPU temperatures.

Real-time progress and summaries every ~10% are printed as shown below. Matrices processed are cumulative, whereas errors are for that summary. GPUs are separated by slashes. The program exits with a conclusion after it has been running for the number of seconds given as the last command line parameter. If you want to burn using doubles instead, give parameter "-d" before the burn duration. The example below was on a machine that had one working GPU and one faulty (too much factory overclocking and thus slightly unstable (you wouldn't have noticed it during gaming)):

% ./gpu_burn 120
GPU 0: GeForce GTX 1080 (UUID: GPU-f998a3ce-3aad-fa45-72e2-2898f9138c15)
GPU 1: GeForce GTX 1080 (UUID: GPU-0749d3d5-0206-b657-f0ba-1c4d30cc3ffd)
Initialized device 0 with 8110 MB of memory (7761 MB available, using 6985 MB of it), using FLOATS
Initialized device 1 with 8113 MB of memory (7982 MB available, using 7184 MB of it), using FLOATS
10.8%  proc'd: 3472 (4871 Gflop/s) - 3129 (4683 Gflop/s)   errors: 0 - 0   temps: 56 C - 56 C 
	Summary at:   Mon Oct 31 10:32:22 EET 2016

22.5%  proc'd: 6944 (4786 Gflop/s) - 7152 (4711 Gflop/s)   errors: 0 - 0   temps: 61 C - 60 C 
	Summary at:   Mon Oct 31 10:32:36 EET 2016

33.3%  proc'd: 10850 (4843 Gflop/s) - 10728 (4633 Gflop/s)   errors: 2264 (WARNING!) - 0   temps: 63 C - 61 C 
	Summary at:   Mon Oct 31 10:32:49 EET 2016

44.2%  proc'd: 14756 (4861 Gflop/s) - 13857 (4675 Gflop/s)   errors: 1703 (WARNING!) - 0   temps: 66 C - 63 C 
	Summary at:   Mon Oct 31 10:33:02 EET 2016

55.0%  proc'd: 18228 (4840 Gflop/s) - 17433 (4628 Gflop/s)   errors: 3399 (WARNING!) - 0   temps: 69 C - 65 C 
	Summary at:   Mon Oct 31 10:33:15 EET 2016

66.7%  proc'd: 22134 (4824 Gflop/s) - 21009 (4652 Gflop/s)   errors: 3419 (WARNING!) - 0   temps: 70 C - 65 C 
	Summary at:   Mon Oct 31 10:33:29 EET 2016

77.5%  proc'd: 25606 (4844 Gflop/s) - 25032 (4648 Gflop/s)   errors: 5715 (WARNING!) - 0   temps: 71 C - 66 C 
	Summary at:   Mon Oct 31 10:33:42 EET 2016

88.3%  proc'd: 29078 (4835 Gflop/s) - 28161 (4602 Gflop/s)   errors: 7428 (WARNING!) - 0   temps: 73 C - 67 C 
	Summary at:   Mon Oct 31 10:33:55 EET 2016

100.0%  proc'd: 33418 (4752 Gflop/s) - 32184 (4596 Gflop/s)   errors: 9183 (WARNING!) - 0   temps: 74 C - 68 C 
Killing processes.. done

Tested 2 GPUs:
	GPU 0: FAULTY
	GPU 1: OK

With this tool I've been able to spot unstable GPUs that performed well under every other load they were subjected to. So far it has also never missed a GPU that was known to be unstable. *knocks on wood*

Grab it from GitHub https://github.com/wilicc/gpu-burn or below:
gpu_burn-0.4.tar.gz
gpu_burn-0.6.tar.gz (compatible with nvidia-smi and nvcc as of 04-12-2015)
gpu_burn-0.8.tar.gz (includes benchmark, Gflop/s)
gpu_burn-0.9.tar.gz (compute profile 30, compatible w/ CUDA 9)
gpu_burn-1.0.tar.gz (async compare, CPU no longer busy)
gpu_burn-1.1.tar.gz (tensor core support)
and burn with floats for an hour: make && ./gpu_burn 3600
If you're running a Tesla, burning with doubles instead stresses the card more (as it was friendly pointed out to me in the comments by Rick W): make && ./gpu_burn -d 3600
If you have a Turing architecture, you might want to benchmark Tensor cores as well: ./gpu-burn -tc 3600 (Thanks Igor Moura!)
You might have to show the Makefile to your CUDA if it's not in the default path, and also to a version of gcc your nvcc can work with. It expects to find nvidia-smi from your default path.

Comments

  21.1.2012

Gahh! You gave me a tar-bomb! Stop that!
- Iesos

  24.1.2012

You didn't literally burn your card, did you? ;-)
- wili

  27.3.2012

Hi,
I was trying to use your tool to stress test one of our older CUDA Systems (Intel(R) Core(TM)2 Q9650, 8 GiB Ram, GTX 285 Cards). When I run the tool I get the following output:
./gpu_burn 1
GPU 0: GeForce GTX 285 (UUID: N/A)
Initialized device 0 with 1023 MB of memory (967 MB available, using 871 MB of it)
Couldn't init a GPU test: Error in "load module": CUDA_ERROR_NO_BINARY_FOR_GPU
100.0%  proc'd: 0   errors: 164232 (WARNING!)   temps: 46 C 
        Summary at:   Tue Mar 27 16:24:16 CEST 2012

100.0%  proc'd: 0   errors: 354700 (WARNING!)   temps: 46 C 
Killing processes.. done

Tested 1 GPUs:
        0: FAULTY

I guess the card is not properly supported, it is at least weird that proc'd is always 0. 
Any hints on that?
- Ulli Brennenstuhl

  27.3.2012

Well... Could have figured that out faster. Had to make a change in the makefile, as the gtx 285 cards only have computing capability 1.3. (-arch=compute_13)
- Ulli Brennenstuhl

  15.10.2012

Hi, gpu_burn Looks like exactly what I'm looking for! I want to run it remotely (over SSH) on a machine I've just booted off a text-mode-only Live CD (PLD Linux RescueCD 11.2).

What exactly do I have to install on the remote system for it to run? a full-blow X installation? Or is it enough to copy over my NVIDIA driver kernel module file, and a few libraries (perhaps to a LD_LIBRARY_PATH'ed dir)? I would like to install as little as possible stuff in that remote machine... Thanks in advance for any hints. 
- durval

  24.10.2012

Hi!
X is definitely not a prerequisite.  I haven't got a clean Linux installation at hand right now so I'm unable to confirm this, but I think you need:

From my application:
gpu_burn (the binary)
compare.ptx (the compiled result comparing kernel)

And then you need the nvidia kernel module loaded, which has to match the running kernel's version:
nvidia.ko

Finally, the gpu_burn binary is linked against these libraries from the CUDA toolkit, which should be found from LD_LIBRARY_PATH in your case:
libcublas.so
libcudart.so (dependency of libcublas.so)

And the CUDA library that is installed by the nvidia kernel driver installer:
libcuda.so

Hope this helps, and sorry for the late reply :-)
- wili

  1.11.2012

How can I solve "Couldn't init a GPU test: Error in "load module": CUDA_ERROR_FILE_NOT_FOUND"?
Although I specified the absolute path, this always shows me this error message. Can you tell me the reason?

Run length not specified in the command line.  Burning for 10 secs
GPU 0: Quadro 400 (UUID: GPU-d2f766b0-8edd-13d6-710c-569d6e138412)
GPU 1: GeForce GTX 690 (UUID: GPU-e6df228c-b7a7-fde5-3e08-d2cd3485aed7)
GPU 2: GeForce GTX 690 (UUID: GPU-b70fd5b0-129f-4b39-f1ff-938cbad4ed26)
Initialized device 0 with 2047 MB of memory (1985 MB available, using 1786 MB of it)
Couldn't init a GPU test: Error in "load module": CUDA_ERROR_FILE_NOT_FOUND

0.0%  proc'd: 4236484 / 0 / 0   errors: 0 / 0 / 0   temps: 63 C / 42 C / 40 C 

...
...
...

100.0%  proc'd: 1765137168 / -830702792 / 1557019224   errors: 0 / 0 / 0   temps: 62 C / 38 C / 36 C 
100.0%  proc'd: 1769373652 / -826466308 / 1561255708   errors: 0 / 0 / 0   temps: 62 C / 38 C / 36 C 
100.0%  proc'd: 1773610136 / -822229824 / 1565492192   errors: 0 / 0 / 0   temps: 62 C / 38 C / 36 C 
100.0%  proc'd: 1777846620 / -817993340 / 1569728676   errors: 0 / 0 / 0   temps: 62 C / 38 C / 36 C 
Killing processes.. done

Tested 3 GPUs:
	0: OK
	1: OK
	2: OK
- Encheon Lim

  2.11.2012

Hi,

Did you try to run the program in the compilation directory?
The program searches for the file "compare.ptx" in the current work directory, and gives that error if it's not found.  The file is generated during "make".
- wili

  6.3.2013

Hi wili,

did You try Your program on K20 cards? I see errors very often, also on GPUs which are ok. I cross-checked with codes developed in our institute and all of our cards run fine, but gpu_burn gives errors (not in each run, but nearly). Do You have an idea?

Regards,
Henrik 
- Henrik

  7.3.2013

Hi Henrik,

Right now I have only access to one K20.  I've just ran multiple burns on it (the longest one 6 hours), and have not had a single error (CUDA 5.0.35, driver 310.19).  I know this is not what you would like to hear, but given that I'm using proven-and-tested CUBLAS to perform the calculations, there should be no errors on fully stable cards.

How hot are the cards running?  The K20 that I have access to heats up to exactly 90'C.  One thing I have also noticed is that some cards might work OK on some computers but not on other (otherwise stable) computers.  So IF the errors are real, they might not be the K20s' fault entirely.

Regardless, I've made a small adjustment to the program about how results are being compared (more tolerant with regards to rounding); you might want to test this version out.

Best regards,
- wili

  22.3.2013

Hi wili,

Currently we have lots of GPU issues on "fallen off the bus". It is interesting to run this test. When I put in 10min, the job will continue run until it reachs 100%. Then it hang with "Killing processes.." though no errors been counted.

From the /var/log/messages, GPU fallen off the bus again when the test ran for 5min.

Here are the details:

->date ; ./gpu_burn 600 ; date
Fri Mar 22 15:15:53 EST 2013
GPU 0: Tesla K10.G1.8GB (UUID: GPU-af90ada7-7ce4-ae5c-bd28-0ef1745a3ad0)
GPU 1: Tesla K10.G1.8GB (UUID: GPU-a8b75d1d-a592-6f88-c781-65174986329c)
Initialized device 0 with 4095 MB of memory (4028 MB available, using 3625 MB of it)
Initialized device 1 with 4095 MB of memory (4028 MB available, using 3625 MB of it)
10.3%  proc'd: 18984 / 18080   errors: 0 / 0   temps: 87 C / 66 C
        Summary at:   Fri Mar 22 15:16:56 EST 2013

20.3%  proc'd: 32544 / 37064   errors: 0 / 0   temps: 95 C / 76 C
        Summary at:   Fri Mar 22 15:17:56 EST 2013

30.7%  proc'd: 39776 / 56048   errors: 0 / 0   temps: 96 C / 81 C
        Summary at:   Fri Mar 22 15:18:58 EST 2013

40.8%  proc'd: 44296 / 74128   errors: 0 / 0   temps: 97 C / 85 C
        Summary at:   Fri Mar 22 15:19:59 EST 2013

51.7%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
        Summary at:   Fri Mar 22 15:21:04 EST 2013

62.0%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
        Summary at:   Fri Mar 22 15:22:06 EST 2013

72.3%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
        Summary at:   Fri Mar 22 15:23:08 EST 2013

82.5%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
        Summary at:   Fri Mar 22 15:24:09 EST 2013

92.8%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
        Summary at:   Fri Mar 22 15:25:11 EST 2013

100.0%  proc'd: 47912 / 91304   errors: 0 / 0   temps: 98 C / 88 C
Killing processes..


 ->grep fallen /var/log/messages

Mar 22 15:20:53 sstar105 kernel: NVRM: GPU at 0000:06:00.0 has fallen off the bus.
Mar 22 15:20:53 sstar105 kernel: NVRM: GPU at 0000:06:00.0 has fallen off the bus.
Mar 22 15:20:53 sstar105 kernel: NVRM: GPU at 0000:05:00.0 has fallen off the bus.

What does that mean? GPU stop function because the high temperatures? 

Please advise. Thanks.

- runny

  22.3.2013

Hi runny,

Yeah this is very likely the case.
It looks like the cards stop crunching data halfway through the test, when the first one hits 98 C.  This is a high temperature and Teslas should shut down automatically at roughly 100 C.  Also, I've never seen "has fallen off the bus" being caused by other issues than overheating (have seen it due to overheating twice).

Best regards,
- wili

  25.3.2013

Hi Wili,

Thanks for your reply.

I changed the GPU clock and run your script for 10min. And it past the test this time. The temperature reached to 97 C.

The node may survive GPU crash but will sacrifice the performance. Do you know if there is a way to avoid this happen from the programmer's side?

Many thanks,
- runny

  16.5.2013


Hello, 
based on the following output, can i say that my graphic card is with problems?
Thanks

##############################

alechand@pcsantos2:~/Downloads/GPU_BURN$ ./gpu_burn 3600
GPU 0: GeForce GTX 680 (UUID: GPU-b242223a-b6ca-bd7f-3afc-162cba21e710)
Initialized device 0 with 2047 MB of memory (1808 MB available, using 1628 MB of it)
Failure during compute: Error in "Read faultyelemdata": CUDA_ERROR_LAUNCH_FAILED
10.0%  proc'd: 11468360   errors: 22936720 (WARNING!)   temps: 32 C 
	Summary at:   Thu May 16 09:23:06 BRT 2013

20.1%  proc'd: 22604124   errors: 22271528 (WARNING!)   temps: 32 C 
	Summary at:   Thu May 16 09:29:07 BRT 2013

30.1%  proc'd: 33668668   errors: 22129088 (WARNING!)   temps: 32 C 
	Summary at:   Thu May 16 09:35:08 BRT 2013

40.1%  proc'd: 44763812   errors: 22190288 (WARNING!)   temps: 32 C 
	Summary at:   Thu May 16 09:41:09 BRT 2013

50.1%  proc'd: 55869696   errors: 22211768 (WARNING!)   temps: 32 C 
	Summary at:   Thu May 16 09:47:10 BRT 2013

60.2%  proc'd: 67029916   errors: 22320440 (WARNING!)   temps: 31 C 
	Summary at:   Thu May 16 09:53:11 BRT 2013

70.2%  proc'd: 78271124   errors: 22482416 (WARNING!)   temps: 31 C 
	Summary at:   Thu May 16 09:59:12 BRT 2013

80.2%  proc'd: 89538144   errors: 22534040 (WARNING!)   temps: 31 C 
	Summary at:   Thu May 16 10:05:13 BRT 2013

90.2%  proc'd: 100684312   errors: 22292336 (WARNING!)   temps: 31 C 
	Summary at:   Thu May 16 10:11:14 BRT 2013

100.0%  proc'd: 111385148   errors: 21401672 (WARNING!)   temps: 31 C 
Killing processes.. done

Tested 1 GPUs:
	0: FAULTY
alechand@pcsantos2:~/Downloads/GPU_BURN$

######################################
- Alechand

  22.5.2013

@Alechand: I have the same problem.
In my case changing USEMEM to

#define USEMEM 0.75625

let my GPU burn ;)

The error also occurs at "checkError(cuMemcpyDtoH(&faultyElems, d_faultyElemData, sizeof(int)), "Read faultyelemdata");", but I don't think that a simple (only 1 int value!) cudaMemcopy let the GPU crash. After crashing (e.g. USEMEM 0.8) the allocated memory consumes 0% (run nvidia-smi).
I also inserted a sleep(10) between "our->compute();" and "our->compare();". During this 'sleep' it pointed out that even for the "USEMEM 0.9"-case the amount of memory is successfully allocated (run nvidia-smi in another shell).

Are there any ideas how to fix this in a more egelant way?
Thanks in advance for any hints!
- Chris

  23.5.2013

Hi,

Thanks for noting this bug.  Unfortunately I'm unable to reproduce the bug which makes it difficult for me to fix.  Could either of you guys give me the nvidia driver and CUDA versions and the card model you're experiencing this problem with and I could try to fix me up a similar system?

Thanks
- wili

  23.5.2013

Hi,
I'm using NVIDIA-SMI 4.310.23 and Cuda Driver Version: 310.23 (libcuda.so.310.23). The other libraries have version 5.0.43 (libcublas.so.5.0.43, etc. all from CUDA 5.0 SDK). GeForce GTX 680 
Another problem I had is discussed here:
http://troylee2008.blogspot.de/2012/05/cudagetdevicecount-returned-38.html
Since I use the NVIDIA GPU only for CUDA I have to manually create these files and maybe my GPU is in a deep idle power state which leads to the "Read faultyelemdata"-Problem. I also tried to get the current power state using nvidia-smi. -> without success; i only get N/A.

In some of the samples NVIDIA performs a 'warmup'.
For example: stereoDisparity.cu
    // First run the warmup kernel (which we'll use to get the GPU in the correct max power state
    stereoDisparityKernel<<<numBlocks, numThreads>>>(d_img0, d_img1, d_odata, w, h, minDisp, maxDisp); 

It also turned out that after a reboot in some test cases gpu_burn crashes (USEMEM 0.7). But in 95% of all 'gpu_burn after reboot' cases everything went fine. 
I think there is a problem with my setup and not with your code.

At the moment I use 4 executables with different USEMEM arguments to allocate the whole memory and avoid any problems:
    ./gpu_burn 3600 0.25 #1st ~500MB 
    ./gpu_burn 3600 0.33 #2nd ~500MB
    ./gpu_burn 3600 0.5 #3rd ~500MB
    ./gpu_burn 3600 0.9 #4th ~500MB

Thank you very much for the code! At the moment it does what it should (under the discussed assumptions).
- Chris

  24.5.2013

Hi Chris,

Wow, you have quite new CUDA SDK.  The latest production version is 5.0.35 and even in the developer center they don't offer the 5.0.43, so I will not be able to get my hands on the same version.

Power state "N/A" with GTX 680 is quite normal, nothing to get alarmed about.  Also, even if the card was in a deep idle state (Teslas can go dormant such that they take dozens of seconds to wake up), this error should not occur. 

Also, the warmup in SDK samples is only for getting reliable benchmarks timings, not for avoiding issues like these.  Given that the cuMemcpyDtoH is in fact right after the "compare" kernel launch and the returned error code is not something that cuMemcpyDtoH is allowed to return, I think this is likely a bug in the compare kernel (the error code is delayed until the next call) and therefore in my code.

I have now made certain changes to the program and would appreciate if you took a shot with this version and reported back if the problem persists.  I'm still unable to replicate the failure conditions myself so I'm working blind here :-)

Best regards,
- wili

  27.6.2013

Thanks for this helpful utility.

I was using it to test a K20 but found the power consumption displayed by nvidia-smi during testing was only ~110/225W.

I modified the test to do double precision instead of single and my power number went up to ~175W.

Here is my patch:  ftp://ftp.microway.com/for-customer/gpu-burn-single-to-double.diff
- Rick W

  28.6.2013

Hi Rick!

This is a very good observation!  I found the difference in K20 power draw to be smaller than what you reported but, still, doubles burn the card harder.  I have now added support for doubles (by specifying "-d" as a command line parameter).  I implemented it a bit differently to your patch, by using templates.

Best regards,
- wili

  23.7.2013

Thank you, gpu_burn was very helpful. We use it for stress testing our Supermicro GPU servers.
- Dmitry

  12.3.2014

Very useful utility. Thank you!

Would be even better if initCuda() obeyed CUDA_VISIBLE_DEVICES, as we could use this utility to simulate more complex multiuser workloads. But from what I can tell it automatically runs on all GPUs in the system regardless of this setting.
- pulsewidth

  18.3.2014

Can you please give instructions of steps to install and use gpu_burn?  We build many tesla k10, k20 and c2075 systems but have no way of stress testing for erros and stability,  also we are usually Windows based while our customer is Centos based.  Thank you for any help.
- sokhac@chassisplans.com

  18.3.2014

Hey, there seems to be an overflow occuring in your proc'd and errors: variables.
I am getting negative values and huge sudden changes.
Testing it on a titan with Cuda 5.5 and driver 331.38

For example:

 ./gpu_burn 300
GPU 0: GeForce GTX TITAN (UUID: GPU-6f344d7d-5f0e-8974-047e-bfcf4a559f14)
Initialized device 0 with 6143 MB of memory (5082 MB available, using 4573 MB of it), using FLOATS
11.0%  proc'd: 11410   errors: 0   temps: 71 C 
	Summary at:   Tue Mar 18 14:06:17 EDT 2014

21.7%  proc'd: 14833   errors: 0   temps: 70 C 
	Summary at:   Tue Mar 18 14:06:49 EDT 2014

33.3%  proc'd: 18256   errors: 0   temps: 69 C 
	Summary at:   Tue Mar 18 14:07:24 EDT 2014

44.0%  proc'd: 20538   errors: 0   temps: 72 C 
	Summary at:   Tue Mar 18 14:07:56 EDT 2014

55.0%  proc'd: 21679   errors: 0   temps: 76 C 
	Summary at:   Tue Mar 18 14:08:29 EDT 2014

66.7%  proc'd: 23961   errors: 0   temps: 78 C 
	Summary at:   Tue Mar 18 14:09:04 EDT 2014

77.7%  proc'd: 26243   errors: 0   temps: 78 C 
	Summary at:   Tue Mar 18 14:09:37 EDT 2014

88.3%  proc'd: 27384   errors: 0   temps: 78 C 
	Summary at:   Tue Mar 18 14:10:09 EDT 2014

98.7%  proc'd: 4034576   errors: -1478908168  (WARNING!)  temps: 73 C 
	Summary at:   Tue Mar 18 14:10:40 EDT 2014

100.0%  proc'd: 35754376   errors: -938894720  (WARNING!)  temps: 72 C  
Killing processes.. done

Tested 1 GPUs:
	GPU 0: FAULTY


- smth chntla

  18.3.2014

looking at my dmesg output, it looks like gpu_burn is segfaulting:
[3788076.522693] gpu_burn[10677]: segfault at 7fff85734ff8 ip 00007f27e09796be sp 00007fff85735000 error 6 in libcuda.so.331.38[7f27e0715000+b6b000]
[3789172.403516] gpu_burn[11794]: segfault at 7fff407f1ff0 ip 00007fefd9c366b8 sp 00007fff407f1ff0 error 6 in libcuda.so.331.38[7fefd99d2000+b6b000]
[3789569.269295] gpu_burn[12303]: segfault at 7fff04de5ff8 ip 00007f8842346538 sp 00007fff04de5fe0 error 6 in libcuda.so.331.38[7f88420e2000+b6b000]
[3789984.624949] gpu_burn[12659]: segfault at 7fff5814dff8 ip 00007f7ed89656be sp 00007fff5814e000 error 6 in libcuda.so.331.38[7f7ed8701000+b6b000]
- smth chntla

  20.3.2014

Hi, thank you for your useful tool!
I do use your program for testing GpGpu!
I have a question!
in Testing Rresult, I watch "proc'd". I don't know what it means
can you explain it? 
- alsub2

  1.7.2014

- wissam

  1.7.2014

Hi, 
I have some observations to share regarding the stability of the output of this multigpu stress test. I have two systems
1. system1 has a Tesla K20c. launching this test for 10, 30 sec the result is OK, while launching it for 60 sec the test shows errors (temperature remains the same in either cases).  On this same system trying to run cudamemtester  (http://sourceforge.net/projects/cudagpumemtest/) no errors are registered. Trying even other benchmarks like the ones in the shoc package, everything is ok indicating that the gpu card is defects-free


2. system2 has a Quadro 4000 that I know for sure to be faulty  (different tests like cudamemtester and amber show that it is faulty) while running the multigpu stress test for any duration (even days), everything seems to be ok!!
how could you explain that??

moreover, trying to direct the output of this multigpustress test to a file for future analysis, I notice that whenever there are errors, the output file quickly become  huge (for 60 sec duration about 800MB!!)

can anyone explain what is going on here please??

many thanks
- Wissam

  1.10.2014

Any idea what could be the problem?? The system is from Penguin Computing, GPUs are on their Altus 2850GTi, Dual AMD Opteron 6320, CUDA 6.5.14, gcc 4.4.7, CentOS6.  Thanks -- Joe

/local/jgillen/gpuburn> uname -a
Linux n20 2.6.32-279.22.1.el6.637g0002.x86_64 #1 SMP Fri Feb 15 19:03:25 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

/local/jgillen/gpuburn> ./gpu_burn 100
GPU 0: Tesla K20m (UUID: GPU-360ca00b-275c-16ba-76e6-ed0b7f9690c2)
GPU 1: Tesla K20m (UUID: GPU-cf91aacf-f174-1b30-b110-99c6f4e2a5cd)
Initialized device 0 with 4799 MB of memory (4704 MB available, using 4234 MB of it), using FLOATS
Initialized device 1 with 4799 MB of memory (4704 MB available, using 4234 MB of it), using FLOATS
Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
1.0%  proc'd: -1 / 0   errors: -281  (DIED!)/ 0   temps: -- / -- Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
1.0%  proc'd: -1 / -1   errors: -284  (DIED!)/ -1  (DIED!)  temps: -- / -- 

No clients are alive!  Aborting
/local/jgillen/gpuburn> ./gpu_burn -d 100
GPU 0: Tesla K20m (UUID: GPU-360ca00b-275c-16ba-76e6-ed0b7f9690c2)
GPU 1: Tesla K20m (UUID: GPU-cf91aacf-f174-1b30-b110-99c6f4e2a5cd)
Initialized device 0 with 4799 MB of memory (4704 MB available, using 4234 MB of it), using DOUBLES
Initialized device 1 with 4799 MB of memory (4704 MB available, using 4234 MB of it), using DOUBLES
Failure during compute: Error in "Read faultyelemdata": 
2.0%  proc'd: 0 / -1   errors: 0 / -1  (DIED!)  temps: -- / -- Failure during compute: Error in "Read faultyelemdata": 
2.0%  proc'd: -1 / -1   errors: -1  (DIED!)/ -1  (DIED!)  temps: -- / -- 

No clients are alive!  Aborting
/local/jgillen/gpuburn> 
- Joe

  2.10.2014

Difficult to say from this.  Do CUDA sample programs (especially ones that use CUBLAS) work?
- wili

  11.1.2015

for me get segfault

build with g++ 4.9 and cuda 6.5.14
linux 3.17.6
nvidia (beta) 346.22

[1126594.592699] gpu_burn[2038]: segfault at 7fff6e126000 ip 000000000040219a sp 00007fff6e120d80 error 4 in gpu_burn[400000+f000]
[1126615.669239] gpu_burn[2083]: segfault at 7fff80faf000 ip 000000000040216a sp 00007fff80fa94c0 error 4 in gpu_burn[400000+f000]
[1126644.488424] gpu_burn[2119]: segfault at 7fffa3840000 ip 000000000040216a sp 00007fffa383ad40 error 4 in gpu_burn[400000+f000]
[1127041.656267] gpu_burn[2219]: segfault at 7fff981de000 ip 00000000004021f0 sp 00007fff981d92e0 error 4 in gpu_burn[400000+e000]
- sl1pkn07

  4.5.2015

Great tool. helped me identify some bizarre behavior on a gpu.  thank
- naveed

  4.12.2015

Hi: Great and useful job!!!

can you guide me to understand how to pilot GPUs .. in other words how to drive a process on a specific GPU!
Or is it the driver that does that by itself!?!?!

MANY THANKS for your help and compliments fot your work!


CIAOOOOOO
Piero
- Piero

  4.12.2015

Hi Piero!

It is the application's responsibility to query the available devices and pick one from the enumeration.  I'm not aware of a way to force a specific GPU via e.g. an environment variable.

BR,
- wili

  17.2.2016

Hi,
I just tested this tool on one machine with two  K20x (RHEL 6.6)  and it is working like a charm
Thank you for doing a great job 
Best regards from Germany
Mike
- Mike

  18.2.2016

Doesn't work on my system:
$ ./gpu_burn 3600
GPU 0: GeForce GTX 980 (UUID: GPU-23b4ffc9-5548-310e-0a67-e07e0d2a83ff)
GPU 1: Tesla M2090 (UUID: GPU-96f2dc19-daa5-f147-7fa6-0bc86a1a1dd2)
terminate called after throwing an instance of 'std::string'
0.0%  proc'd: -1   errors: 0  (DIED!)  temps: -- 

No clients are alive!  Aborting
$ uname -a
Linux node00 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27
$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.2.1511 (Core) 
Release:	7.2.1511
Codename:	Core
- Richard

  20.6.2016

Great nifty little thing! Many thanks!
P.S.1 (I have modified the makefile slightly for PASCAL)
P.S.2 (Could you please spoil us by implementing more tests, and telling the performance and numerical accuracy of the card? I would be very curious to see how different SM architectures affect performance)   
- obm

  20.6.2016

Hi obm!
Glad that you found the tool useful.  I don't have a Pascal card around yet to see whether compute versions need to get changed.  The point about printing out performance figures is a good one.  I can't print out flops values since I don't know the internal implementation of CUBLAS and as it also might change over time, but I could certainly print out multiplication operations per second.  I'll add that when I have the time!  Numerical accuracy should be the same for all cards using the same CUBLAS library, since all cards nowadays conform to the IEEE floating point standard. 
- wili

  26.7.2016

Very nice tool, just used it to check my cuda setup on a quad titan x compute node.
- Steve

  3.8.2016

Wili,
Thank you for a very useful tool. I have been using gpu_burn since you first put it online. I have used your test to verify a 16 node, 96GPU cluster could run without throttling for thermal protection, and a I drove several brands of workstations to brown-out conditions with 3x K20 and K40 GPUs with verified inadequate current on the P/S  12v rails. Thank You!
- Mike

  16.9.2016

Thanks for writing this, I found it useful to check a used GPU I bought off eBay.

I had to change the Makefile architecture to compute_20 since CUDA 7.0+ doesn't support compute_13 any more. I also needed to update the Temperature matching.

Have you considered hosting this on github instead of your blog? People would be able to more easily make contributions.
- Alex

  16.9.2016

Hi Alex!

That's a very good suggestion, I'll look into it once I have the time :-)
Also please note that the newer version:
"gpu_burn-0.6.tar.gz (compatible with nvidia-smi and nvcc as of 04-12-2015)"
already uses compute_20 and has the temperature parsing matched with the newer nvidia-smi.  The older 0.4 version (which I assume you used) is only there for compatibility with older setups.
- wili

  27.9.2016

great utility!
I confirm this is working on Ubuntu 16.04 w/ CUDA 8 beta.  Just had to change Makefile (CUDAPATH=/usr/local/cuda-8.0) and comment out error line in /usr/local/cuda-8.0/include/host_config.h (because GCC was too new):

#if __GNUC__ > 5 || (__GNUC__ == 5 && __GNUC_MINOR__ > 3)
//#error -- unsupported GNU version! gcc versions later than 5.3 are not supported!
#endif /* __GNUC__ > 5 || (__GNUC__ == 5 && __GNUC_MINOR__ > 1) */

GPU I tested was TitanX (Maxwell).

Lastly, I agree with comment(s) above about turning this into a small benchmark to compare different systems' GPU performance (iterations per second)
- Jason

  23.3.2017

Works great with different GPU cards until the new GTX 1080Ti. Passing of nvidia-smi might be the main issue there
- Leo

  19.5.2017

Any idean on this?

./gpu_burn 3600
GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-a27460fa-1802-cff6-f0b7-f4f1fe935a67)
GPU 1: Tesla P100-SXM2-16GB (UUID: GPU-4ab21a43-10cb-3df1-8a6e-c8c27ff1bf9f)
GPU 2: Tesla P100-SXM2-16GB (UUID: GPU-daabb1f4-529e-08d6-3c58-25ca54b7dbbe)
GPU 3: Tesla P100-SXM2-16GB (UUID: GPU-f378575d-7a45-7e81-9460-c5358f2a7957)
Initialized device 0 with 16276 MB of memory (15963 MB available, using 14366 MB of it), using FLOATS
Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
0.0%  proc'd: -1 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -5277  (DIED!)- 0 - 0 - 0   temps: 35 C - 36 C - 38 C - 37 C Initialized device 1 with 16276 MB of memory (15963 MB available, using 14366 MB of it), using FLOATS
0.0%  proc'd: -1 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -5489  (DIED!)- 0 - 0 - 0   temps: 35 C - 36 C - 38 C - 37 C Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
0.1%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -6667  (DIED!)- -919  (DIED!)- 0 - 0   temps: 35 C - 36 C - 38 C - 37 C Initialized device 3 with 16276 MB of memory (15963 MB available, using 14366 MB of it), using FLOATS
0.1%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -6928  (DIED!)- -1180  (DIED!)- 0 - 0   temps: 35 C - 36 C - 38 C - 37 C Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
0.1%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s) - 0 (0 Gflop/s) - -1 (0 Gflop/s)   errors: -6935  (DIED!)- -1187  (DIED!)- 0 - -1  (DIED!)  temps: 35 C - 36 C - 38 C - 37 C Initialized device 2 with 16276 MB of memory (15963 MB available, using 14366 MB of it), using FLOATS
0.1%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s) - 0 (0 Gflop/s) - -1 (0 Gflop/s)   errors: -7121  (DIED!)- -1373  (DIED!)- 0 - -1  (DIED!)  temps: 35 C - 36 C - 38 C - 37 C Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
0.1%  proc'd: -1 (0 Gflop/s) - -1 (0 Gflop/s) - -1 (0 Gflop/s) - -1 (0 Gflop/s)   errors: -7123  (DIED!)- -1375  (DIED!)- -1  (DIED!)- -1  (DIED!)  temps: 35 C - 36 C - 38 C - 37 C 

No clients are alive!  Aborting
- Lexasoft

  19.5.2017

Hi Lexasoft,
Looks bad, haven't seen this yet.  I'll give it a go next week with updated tools and a pair of 1080Tis and see if this comes up.  (I don't have access to P100s myself (congrats BTW ;-))
BR
- wili

  29.5.2017

Version 0.7 seems to work fine with the default CUDA (7.5.18) that comes in Ubuntu 16.04.1 LTS on dual 1080Tis:

GPU 0: Graphics Device (UUID: GPU-f2a70d44-7e37-fb35-91a3-09f49eb8be76)
GPU 1: Graphics Device (UUID: GPU-62859d61-d08d-8769-e506-ee302442b0f0)
Initialized device 0 with 11169 MB of memory (10876 MB available, using 9789 MB of it), using FLOATS
Initialized device 1 with 11172 MB of memory (11003 MB available, using 9903 MB of it), using FLOATS
10.6%  proc'd: 6699 (6324 Gflop/s) - 6160 (6411 Gflop/s)   errors: 0 - 0   temps: 63 C - 54 C 
	Summary at:   Mon May 29 13:26:34 EEST 2017

21.7%  proc'd: 14007 (6208 Gflop/s) - 13552 (6317 Gflop/s)   errors: 0 - 0   temps: 75 C - 67 C 
	Summary at:   Mon May 29 13:26:54 EEST 2017

32.2%  proc'd: 20706 (6223 Gflop/s) - 20328 (6238 Gflop/s)   errors: 0 - 0   temps: 82 C - 75 C 
	Summary at:   Mon May 29 13:27:13 EEST 2017

42.8%  proc'd: 27405 (5879 Gflop/s) - 27720 (6222 Gflop/s)   errors: 0 - 0   temps: 85 C - 80 C 
	Summary at:   Mon May 29 13:27:32 EEST 2017

53.3%  proc'd: 34104 (5877 Gflop/s) - 34496 (6179 Gflop/s)   errors: 0 - 0   temps: 87 C - 82 C 
	Summary at:   Mon May 29 13:27:51 EEST 2017

63.9%  proc'd: 40803 (5995 Gflop/s) - 41272 (6173 Gflop/s)   errors: 0 - 0   temps: 86 C - 83 C 
	Summary at:   Mon May 29 13:28:10 EEST 2017

75.0%  proc'd: 46893 (5989 Gflop/s) - 48664 (6092 Gflop/s)   errors: 0 - 0   temps: 87 C - 84 C 
	Summary at:   Mon May 29 13:28:30 EEST 2017

85.6%  proc'd: 53592 (5784 Gflop/s) - 55440 (6080 Gflop/s)   errors: 0 - 0   temps: 86 C - 83 C 
	Summary at:   Mon May 29 13:28:49 EEST 2017

96.1%  proc'd: 60291 (5969 Gflop/s) - 62216 (5912 Gflop/s)   errors: 0 - 0   temps: 87 C - 84 C 
	Summary at:   Mon May 29 13:29:08 EEST 2017

100.0%  proc'd: 63336 (5961 Gflop/s) - 64680 (5805 Gflop/s)   errors: 0 - 0   temps: 86 C - 84 C 
Killing processes.. done

Tested 2 GPUs:
	GPU 0: OK
	GPU 1: OK
- wili

  1.6.2017

I'd just like to thank you for such an useful application. I'm testing a little big monster with 10  GTX 1080 Ti cards and it helped a lot.

The complete system setup is:

Supermicro 4028GR-TR2
2x Intel  E5-2620V4
128 GB of memory
8x nvidia GTX 1080 Ti
1 SSD for the O.S.
CentOS 7.3 x64

I installed latest Cuda developement kit "cuda_8.0.61_375.26" but the GPU cards were not recognized so I upgraded the driver to latest 381.22. After that it works like charm.

I'll have acces to this server a little more so if you want me to test something on it, just contact me.
- ibarco

  11.7.2017

This tool is awesome and works as advertised!  We are using it to stress test our Supermicro GPU servers (GeForce GTX 1080 for now).
- theonewolf

  23.8.2017

Hi,
first, thanks for this awesome tool !

But, I may need some help to be sure to understand why I get.

I run gpu_burn and get:
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-d164c8c4-1bdb-a4ab-a9fe-3dffdd8ec75f)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-6e1a32bb-5291-4611-4d0c-78b5ff2766ce)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-c5d10fad-f861-7cc0-b37b-7cfbd9bceb67)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-df9c5c01-4fbc-3fb2-9f41-9c6900bb78c8)
Initialized device 0 with 16276 MB of memory (15945 MB available, using 14350 MB of it), using FLOATS
Initialized device 2 with 16276 MB of memory (15945 MB available, using 14350 MB of it), using FLOATS
Initialized device 3 with 16276 MB of memory (15945 MB available, using 14350 MB of it), using FLOATS
Initialized device 1 with 16276 MB of memory (15945 MB available, using 14350 MB of it), using FLOATS
40.0%  proc'd: 894 (3984 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 38 C - 45 C - 51 C - 38 C
        Summary at:   Wed Aug 23 11:08:01 EDT 2017

60.0%  proc'd: 1788 (7998 Gflop/s) - 894 (3736 Gflop/s) - 894 (3750 Gflop/s) - 894 (3740 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 44 C - 53 C - 59 C - 45 C
        Summary at:   Wed Aug 23 11:08:03 EDT 2017

80.0%  proc'd: 2682 (8014 Gflop/s) - 1788 (8015 Gflop/s) - 1788 (8016 Gflop/s) - 1788 (8016 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 44 C - 53 C - 59 C - 45 C
        Summary at:   Wed Aug 23 11:08:05 EDT 2017

100.0%  proc'd: 3576 (8019 Gflop/s) - 2682 (8015 Gflop/s) - 3576 (8015 Gflop/s) - 2682 (8015 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 44 C - 53 C - 59 C - 45 C
        Summary at:   Wed Aug 23 11:08:07 EDT 2017

100.0%  proc'd: 4470 (8018 Gflop/s) - 3576 (8015 Gflop/s) - 3576 (8015 Gflop/s) - 3576 (8016 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 47 C - 56 C - 62 C - 48 C
Killing processes.. done

Tested 4 GPUs:
        GPU 0: OK
        GPU 1: OK
        GPU 2: OK
        GPU 3: OK

More precisely, with theses information:
proc'd: 3576 (8019 Gflop/s) - 2682 (8015 Gflop/s) - 3576 (8015 Gflop/s) - 2682 (8015 Gflop/s)   errors: 0 - 0 - 0 - 0   temps: 44 C - 53 C - 59 C - 45 C

Is this information listed in the same order as the GPU list ?

(proc'd: 3576 (8019 Gflop/s), 44 C => information for GPU 0 ?
proc'd: 2682 (8015 Gflop/s), 53 C => information for GPU 1 ?
proc'd: 3576 (8015 Gflop/s), 59 C => information for GPU 2 ?
proc'd: 2682 (8015 Gflop/s), 45 C => information for GPU 3 ?)

Thank you!
Ced
- Ced

  23.8.2017

Hi Ced!

Yes, the processed/gflops, temps, and GPU lists should all have the same order.  If you think they don't match, I can double check the code.
- wili

  23.8.2017

Hi wili! 
Thank you 
As the devices were not initialiazed in the same order as the GPU list I was not sure.
Otherwise, I don't have any clue to point the fact that the order could mismatch.

- Ced

  23.8.2017

Hi Ced,

Yes the initialization is multi-threaded and is reported in non-deterministic order.  Otherwise the order should match.  Also note that some devices take longer to init and start crunching data, that's why you may see some devices starting to report their progress late.  
- wili

  30.8.2017

Hi Wili,

Indead, each tests I performed ensured me that the order perfectly match.
Thank you again.
- Ced

  28.9.2017

Hi Wili,

it seems that CUDA 9.0 updated the minimum supported virtual architecture:

nvcc fatal   : Value 'compute_20' is not defined for option 'gpu-architecture'

Replacing that with compute_30 is enough.

Best,

Alfredo
- Alfredo

  28.9.2017

Thanks Alfredo, updated to version 0.9.
- wili

  13.11.2017

Hi Wili,

Is it possible to run your tool on CPU only so as to compare CPU results to GPU results ?

Best regards

 
- Ced

  13.11.2017

Hi Ced,

It runs on CUDA, and as such only on NVidia GPUs.  Until NVidia releases CPU drivers for CUDA (don't hold your breath), I'm afraid the answer is no.

BR,
- wili

  13.11.2017

Ok, maybe I am asking to much :)
Thank you anyway.

Best regards
- Ced

  16.11.2017

Thank you for this tool. Still super handy
- CapnScabby

  23.11.2017

wili, why not distribute your tool through some more accessible platform, e.g. github?

This seems to be the best torture test for CUDA GPUs, so people are surely interested. I myself have used it a few times, but it's just too much (mental) effort to check on a website whether there's an update instead of just doing a "git remote update; git tag" to see if there's a new release!

Look, microway already took the code and uploaded it to github, so why not do it yourself, have people fork your repo and take (more of the) credit while making it easier to access? ;)

Cheers,
Szilard
- pszilard

  25.11.2017

Hi Pszilard,

You can now find it at GitHub, https://github.com/wilicc/gpu-burn

I've known that it needs to go up somewhere easier to access for a while now, especially after there seemed to be outside interest.  (It's just that I like to host all my hobby projects on my own servers.)
I guess I have to budge here this time; thanks for giving the nudge.

BR,
- wili

  4.12.2017

hi, here is the message with quadro 6000 :
GPU 0: Quadro 6000 (UUID: GPU-cfb253b7-0520-7498-dee2-c75060d9ed25)
Initialized device 0 with 5296 MB of memory (4980 MB available, using 4482 MB of it), using FLOATS
Couldn't init a GPU test: Error in "load module": 
10.8%  proc'd: -807182460 (87002046464 Gflop/s)   errors: 0   temps: 71 C 1 C  
	Summary at:   lundi 4 décembre 2017, 16:01:29 (UTC+0100)

21.7%  proc'd: -1334328390 (74880032768 Gflop/s)   errors: 0   temps: 70 C C C 
	Summary at:   lundi 4 décembre 2017, 16:01:42 (UTC+0100)

23.3%  proc'd: 1110985114 (599734133719040 Gflop/s)   errors: 0   temps: 70 C ^C

my configuration works with K5000
- mulot

  6.12.2017

Hi Mulot,

"compare.ptx" has to be found from your working directory.  Between these runs, you probably invoked gpu-burn with different working directories.

(PS.  A more verbose "file not found" print added to github)

BR,
- wili

  23.1.2018

Hi,

I hit the "Error in "load module":" error as well, in my case it was actually an unhandled CUDA error code 218, which says CUDA_ERROR_INVALID_PTX. For me this was caused by misconfigured nvcc path (where nvcc was from CUDA 7.5, but rest of the system used 9.1), once I was correctly using nvcc from cuda 9.1, this started working.

Thanks,
- av

  24.1.2018

Hi,

I've tried with the Titan V with your tool, seems running good, but when I monitor the TDP by NVIDIA-smi, the Watt can't be reached at 250W which is the Card's spec, the problem isn't occurred when I use the older generation card like GTX 1070Ti with the same version of Driver 387.34, does the Geforce Linux Driver have porblem on the Titan V? Or the gpuburn may need modify for Volta GPU usage? Many thanks!
- Natsu

  31.1.2018

This is the best tool for maximizing GPU power consumption! I tried V100 PCIE x8 and it rocks!!! 

Thank you!!! 
- dhenzjhen

  15.3.2018

Hi,

Great tool! I tried with 1080ti and TitanXp, maybe a feature is missing is the choice of the gpu we want to stress. However, is it normal to obtain temperature values that stay constant to 84/85 on both the gpus while they are stressed?
- piovrasca

  15.3.2018

Hi,

A good feature suggestion!  Maybe I’ll add that at some point.  Temperatures sound about right, they are nowadays regulated after all.
- wili

  3.4.2018

Hello,
I had a problem where a multi-GPU server was hard-crashing under a load using CUDA for neural networks training.
Geeks3d's GPUtest wasn't able to reproduce the problem.
Then I've stumbled upon your blog. Your tool singled out the faulty GPU in 5 seconds!
You, sir, have spared me a lot of headache. I cannot thank you enough!
- Kurnass

  17.4.2018

How about make the output stream better suited for log-files (running it with SLURM).
Seems like it uses back ticks extensively; thus the file grows to a couple of 100MiB for 30sec.
- qnib

  22.5.2018

I tried your tool on Jetson TX2 tegra board and on tegra there is no nvidia-smi package available. so temps are not available. Problem I have is that I get a Segementation Fault! Is that because of missing nvidia-smi? or should your stress test work on tegra?
- gernot

  22.5.2018

Hi gernot,

I’ve never tried running this on a Tegra, so it might or might not work. It should print a better error message, granted, but as I don’t have a Tegra for testing you’re unfortunately on your own :-)
- wili

  4.6.2018

I cant seem to get this to build i keep getting 
make
PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:.:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl nvcc -I/usr/local/cuda/include -arch=compute_30 -ptx compare.cu -o compare.ptx
g++ -O3 -Wno-unused-result -I/usr/local/cuda/include -c gpu_burn-drv.cpp
gpu_burn-drv.cpp:49:10: fatal error: cuda.h: No such file or directory
 #include <cuda.h>
          ^~~~~~~~
compilation terminated.
make: *** [Makefile:11: drv] Error 1
Cuda.h can be found in include/linux i have tried updating the path in the make but it doesn't work it still says it cant find it. This is on the latest Arch linux (system just updated and rebooted before this)  any ideas? this is exactly what i am looking for here at work it will speed up my GPU testing significantly so hopefully i can get it working
- kitsuna

  5.6.2018

Spun up an Ubuntu system and it worked right away so ill just use this on the GPU test system. Great program now i can stress 5 cards at once!
- kitsuna

  11.6.2018

genort,

Same problem here.

I figured out that the Tegra OS does not have nvidia-smi.

I get the program working by commenting anything regarding nvidia-smi or temperature reading. (You will have to find another tool if you want do read the temperatures)
- c

  26.6.2018

Hi

Anybody got this working on Ubuntu 18.04 and Cuda 9.2?
- Siroberts

  28.7.2018

Hi, I am trying the test suit with two cards: one gigabyte 1070 gamer together with a GB 1070Ti under Ubuntu 16.04. The system usually crashed in several minutes, and the power consumption of 1070 goes above 180W; the system does a reboot. 

When I test only one card (modifying the code), it runs without any trouble (1070 using ~180W, 1070Ti less). Removing one card the other runs without any pain. So, what I can see that the two card cannot work together, but they work smoothly alone. (BTW when I block additional process testing further cards the code gives the temperature of the other card.) 

Anyway, this is what I can see. Is there anybody who could help me to solve this issue? I guess the FE is overlocked to keep pace with the Ti. (The cards would be used for simple FP computations and rarely visualize the results on remote Linux clients.) 
- nt
- Nantucket

  13.8.2018

Any ideas about these errors when running make on Fedora 28 and CUDA V9.1.85 ?

/usr/include/c++/8/type_traits(1049): error: type name is not allowed

/usr/include/c++/8/type_traits(1049): error: type name is not allowed

/usr/include/c++/8/type_traits(1049): error: identifier "__is_assignable" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_00001256_00000000-6_compare.cpp1.ii".
- ss

  14.8.2018

SS,
Your GCC is not working correctly.  I can see it's including GCC version 8 headers, but CUDA (even 9.2) doesn't support GCC versions newer than 7.  Maybe you've somehow managed to use an older GCC version with newer GCC headers.
- wili

  15.8.2018

wili, thanks for spotting that! Yes, I have GCC version 8 installed by default with Fedora 28, but CUDA from negativo (https://negativo17.org/nvidia-driver/) installs "cuda-gcc" and "cuda-gcc-c++" version 6.4.0. 

I got gpu-burn working by modifying the Makefile to use:

1) the correct CUDAPATH  and header directories
2) cuda-g++
3) the -ccbin flag for nvcc

I'll include it below in case it helps others.

==========

CUDAPATH=/usr/include/cuda

# Have this point to an old enough gcc (for nvcc)
GCCPATH=/usr

HOST_COMPILER=cuda-g++

NVCC=nvcc
CCPATH=${GCCPATH}/bin

drv:
	PATH=${PATH}:.:${CCPATH}:${PATH} ${NVCC} -I${CUDAPATH} -ccbin=${HOST_COMPILER} -arch=compute_30 -ptx compare.cu -o compare.ptx
	${HOST_COMPILER} -O3 -Wno-unused-result -I${CUDAPATH} -c gpu_burn-drv.cpp
	${HOST_COMPILER} -o gpu_burn gpu_burn-drv.o -O3 -lcuda -L${CUDAPATH} -L${CUDAPATH} -Wl,-rpath=${CUDAPATH} -Wl,-rpath=${CUDAPATH} -lcublas -lcudart -o gpu_burn
- ss

  18.8.2018

Is "gpu-burn -d" supposed to print incremental status updates like "gpu-burn" does? 

For DOUBLES, I'm not seeing those, and after running for several (10?) seconds, it freezes all graphical parts of my system until eventually terminating in the usual way with "Killing processes.. done Tested 1 GPUs: 	GPU 0: OK"
- ss

  18.8.2018

Hi SS,
Which GPU are you testing? Does it support double precision arithmetics?
- wili

  18.8.2018

It's a GT1030 (with GDDR5) and at least according to wikipedia it does support single, double, and half precision.
- ss

  26.11.2018

I am running gpu_burn ver 0.9 on a GTX670 and this works perfectly. When I try and test a GTX580 it fails I think due to compute capabilty of the card only being 2.0. I tried to edit the makefile  (-arch=compute_30) to  (-arch=compute_20) and recompile but this fails to compile.
Any ideas on how to get it to support the older card would be much appreciated.
- id23raw

  27.11.2018

id23raw,

Fermi architecture was deprecated in CUDA 8 and dropped from CUDA 9, so the blame is on NVidia.  Your only choice is to downgrade to an older CUDA and then use compute_20.

BR,
- wili

  27.11.2018

Thank you for the prompt reply to my question Wili.

- id23raw

  12.2.2019

my question is  that  can save  LOG  information?
- ship

  12.2.2019

Hi ship,
The only way available to you right now is to run ./gpu_burn 120 1>> LOG_FILE 2>> LOG_FILE
- wili

  25.2.2019

Thanks for the code

  4.3.2019

Is there any chance this works with the new 2080ti's, I need to stress test 4 2080ti's simultaneously.
- JO

  4.3.2019

Hi JO,
I haven't had the chance to test this myself, but according to external reports it works fine on Turing cards as well.  So give it a spin and let me know how it works.  One observation worth note is that the "heat up" process has increased in length from previous generations (like has been the case with many Teslas for example), so do run it with longer durations.  (Not e.g. with just 10 secs).
BR,
- wili

  7.3.2019

I got the following error, ideas? 
./gpu
Run length not specified in the command line.  Burning for 10 secs
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-2f4ea2c2-8119-cda0-63a0-36dfca404b53)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-fc39fba9-e82e-3fbb-81e7-d455355ecdd1)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-0463eba2-0217-fbcc-1b86-02f4bf9b3f34)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-1f424260-52fb-f7ee-3887-e91ece7b7438)
GPU 4: Tesla V100-SXM2-16GB (UUID: GPU-449c96b9-d7b4-1ed9-7d9f-e479a9b56100)
GPU 5: Tesla V100-SXM2-16GB (UUID: GPU-f73428a8-e2ed-8854-cc90-cab903c695d0)
GPU 6: Tesla V100-SXM2-16GB (UUID: GPU-bd886dab-500d-f38f-f93c-c0e53bbc6a4d)
GPU 7: Tesla V100-SXM2-16GB (UUID: GPU-07e97e68-877d-03a9-ee5a-e85d315130bc)
Couldn't init a GPU test: Error in "init": CUBLAS_STATUS_NOT_INITIALIZED
- dave

  20.3.2019

ran ./gpu_burn 60 with my two 1080's and computer crashed within two seconds... thanks!
- 1080 guy

  12.4.2019

version 0.9 worked perfectly for me (CUDA 9.1 on two testla's k20Xm and one GT 710
- wkernkamp

  7.5.2019

My gpu was under utilized (showed 30-50% utilization) and after running your code it shows >95%  utilization. Your code has got the awakening power for sure.  Thanks.
- sedawk

  18.5.2019

i have 1 gpu tesla v100, cuda-10.1, debian 9, my gpu running hot 90+ when i ran over 30sec, i don't want to burn my gpu, is their a way i can throttle it down a bit or is this normal.
- jdm

  26.6.2019

Thank you Will for the test. Tried it on 4 2080Tis with cuda 10. Works fine. In single precisions 1 gpu has 5 errors the rest are OK, but in double precision all gpus are OK. Any idea?
- andy

  21.7.2019

GPU 0: GeForce 940MX (UUID: GPU-bb6cb5d0-ca68-c456-1588-9c0bcb409409)
Initialized device 0 with 4046 MB of memory (3611 MB available, using 3250 MB of it), using FLOATS
Couldn't init a GPU test: Error in "load module": 
11.7%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 50 C 
	Summary at:   Dom Jul 21 11:51:59 -03 2019

23.3%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 52 C 
	Summary at:   Dom Jul 21 11:52:06 -03 2019

35.0%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 55 C 
	Summary at:   Dom Jul 21 11:52:13 -03 2019

46.7%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 56 C 
	Summary at:   Dom Jul 21 11:52:20 -03 2019

58.3%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 57 C 
	Summary at:   Dom Jul 21 11:52:27 -03 2019

68.3%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 58 C 
	Summary at:   Dom Jul 21 11:52:33 -03 2019

80.0%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 58 C 
	Summary at:   Dom Jul 21 11:52:40 -03 2019

91.7%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 57 C 
	Summary at:   Dom Jul 21 11:52:47 -03 2019

100.0%  proc'd: 0 (0 Gflop/s)   errors: 0   temps: 56 C 
Killing processes.. done

Tested 1 GPUs:
	GPU 0: OK

i didnt understand the output, im not sure if the test was successfully done 
- dfvneto

  24.7.2019

gpu-burn doesnn't compile.

Make produces lots of deprecated errors.

Using Titan RTX, CUDA 10.1, and nvidia driver 418.67.
- airyimbin

  30.8.2019

Odd one I'm seeing at the moment:

GPU 0: GeForce TRX 2070 (UUID: GPU-9d0c1beb-ed60-283c-d7ac88404a8b)
Initialized device 0 with 7981 MB of memory (7577 MB available, using 6819MB of it), using FLOATS
Failure during compute: Error in "SGEMM": CUBLAS_STATUS_EXECUTION_FAILED
10.0% proc'd: -1 (0 Gflop/s) errors: -1 (DIED!) temps: 47 C

No clients are alive! Aborting

This is with nvidia driver 430, libcublas9.1.85

I've tested the same motherboard with multiple 2070's which all show the same issue yet it works fine with a 2080ti in that system. The same 2070's then work on other (typically) older systems. At a bit of a loss as to where to go from here - any advice appreciated
- gardron

  5.10.2019

Just wanted to take a moment and thank you for this awesome tool.

I had an unreliable graphics card that was rock stable on almost EVERY workload (gaming, benchmarks, etc), except some CUDA workloads (especially DNN training). The program calling CUDA would sometimes die with completely random errors (INVALID_VALUE, INVALID_ARGUMENT, etc.).

I couldn't figure out what the issue was or how to reproduce, until I tested your tool and it consistently failed! I sent the card back for warranty, tested the replacement card and BOOM! It's been rock solid for the past 1.5 years.

So, thanks so much!
- Mehran

  6.2.2020

the commit 73d13cbba6cc959b3214343cb3cc83ec5354a9d2 is a right way.

However, this change does not make any difference on the binary. It seems the CUDA just used free for cuMemFreeHost.

Driver Version: 410.79       CUDA Version: 10.0
- frank

  3.3.2020

Hi 

I have to make v100 card works on some specific loading, not full loading. 

For example, v100 full loading is 250(watt), may I know some commands of ./gpu_burn make v100 works on 125(watt), 150(watt) ... 200(watt) etc.

 Thank you : )
- Binhao

  2.4.2020

Hi Wili

Hello, may I ask if there is any command that can print the running log slower after running./GPU burn?
- Focus

  2.4.2020

Hi Focus,

Not right now unless you go and modify the code (increment 10.0f on gpu_burn-drv.cpp:525).  I'll make a note that I'll make this adjustable in the future (not sure when I have the time).  Thanks for the suggestion!
- wili

  7.6.2020

Hello and thanks for the program.

When I see:
100.0%  proc'd: 33418 (4752 Gflop/s) - 32184 (4596 Gflop/s)   errors: 9183 (WARNING!) - 0   temps: 74 C - 68 C 
What does the "4596 Gflop/s" mean? Is it the gflops for onw gpu or for all the gpus running?
- Rozo

  7.6.2020

Hi Rozo,

The first GPU churns 4752Gflop/s and the second 4596Gflop/s.  The order is listed in the beginning of the run.
- wili

  26.6.2020

Hello and many thanks for this program.

I have always had a need to test Nvidia GPU's because I either buy then on eBay or I buy systems that have them installed and I need to be absolutely sure they are 100% functional. I had used the OCCT program in Windows but that only allowed me to test one at a time for 1 hour. Also OCCT only allows 2GB of GPU memory to be tested where as GPU-Burn uses 90% of available memory so it is much more thorough.

Now I am testing up to five at a time.

Thanks
- Steve

  26.6.2020

Hello again,

I am reporting a bug in that one of the five GPU's I am testing has the thread for it stop and hang at 1000 iterations. For the other four GPU's they continue increasing to the iteration until the final iteration counts of 32,250, 32,250, 32,250 and 32,200.

All five of the GPU's are Quadro K600's.

The program then tries to exit at 100% testing done but hangs on the GPU that was hung. The only way to exit the program is by doing a <CTRL>C.

I also want to point out that the error counts for all five GPU's shows 0 even the one that hung.

I you have a chance to fix this a solution may be to detect a hung GPU thread, stop it, continue testing the remaining GPU's and then in the final report state that it is faulty.

Thanks
- Steve

  26.6.2020

Hi Steve,

Happy to hear you were able to test 5 GPUs at once.  Totally agree with you, a hung GPU (thread) should be detected and reported accordingly.  That case didn't receive much testing from me because the faulty GPUs I've had kept on running but just produced incorrect results.
I'll add this on the TODO list.

Thanks!
- wili

  6.7.2020

Hello and thanks for the program !

I am doing tests on two GPU : Quadro RTX4000 and Tesla K80
For both the burn is doing fine but with nvidia-smi I can see that I almost maximal power consumption for the K80 (290/300W) while for the RTX4000 I only use 48/125W

Is there a way to increase the power consumption doing the burn on the RTX4000 ?

Thanks

Thomas
- Thomas

  9.7.2020

Hi Wili,

Hope you can help me out with the below problem :

test1@test1-pc:~$ nvidia-smi
Thu Jul 9 10:32:25 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 11W / 151W | 222MiB / 8116MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1006 G /usr/lib/xorg/Xorg 81MiB |
| 0 N/A N/A 1150 G /usr/bin/gnome-shell 124MiB |
| 0 N/A N/A 3387 G …mviewer/tv_bin/TeamViewer 11MiB |
±----------------------------------------------------------------------------+

test1@test1-pc:~$ nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

test1@test1-pc:~$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright © 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

test1@test1-pc:~$ cd gpu-burn

test1@test1-pc:~/gpu-burn$ make
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:.:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin /usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -arch=compute_30 -ptx compare.cu -o compare.ptx
nvcc fatal : Value ‘compute_30’ is not defined for option ‘gpu-architecture’
Makefile:9: recipe for target ‘drv’ failed
make: *** [drv] Error 1

As i am new to ubuntu, could you point me to what i am missing out to cause this error?

Thanks for your time.
- robbocop

  9.9.2020

Hi there, 

It is very great tool for GPU stress test, Can I know  is it compatible for latest cuda, like 10.x?

Best regards 

Jason
- Jason

  12.10.2020

Another person trying to use this on a TX2, I had the same issue as the other person.

A fix that works is to comment out the function call and the line before it that tries to call the function for checking the temps.  There are other ways to monitor temp on the TX2, and the code works with that removed.
- LovesTha

  13.11.2020

Hi Wili

I tried to run gpu-burn on the following environment but failed with the following error message: 

1) OS SLES 15SP2
2) cuda 11.1 installed and have the gpu-burn compiled, 
3: NVDIA-Linux-x86_64-450.80.02

# gpu-burn 
GPU 0: Quadro K5200 (UUID: GPU-684e262c-85a7-d822-c3a4-bb41918db340)
GPU 1: Tesla K40c (UUID: GPU-bb529e3e-aa59-3f8f-c6e6-ee3b1cba7bf5)
Initialized device 0 with 11441 MB of memory (11324 MB available, using 10191 MB of it), using DOUBLES
Initialized device 1 with 7616 MB of memory (7510 MB available, using 6759 MB of it), using DOUBLES

0.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: 1605296569  (WARNING!)- 0   temps: 40 C - 42 C 
0.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -2136859607  (WARNING!)- 0   temps: 40 C - 42 C 
0.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -1584048487  (WARNING!)- 0   temps: 40 C - 42 C 
...100.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -1767488304  (WARNING!)- -1767488304  (WARNING!)  temps: 41 C - 43 C 
100.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -1214677184  (WARNING!)- -1214677184  (WARNING!)  temps: 41 C - 43 C 
100.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -661866064  (WARNING!)- -661866064  (WARNING!)  temps: 41 C - 43 C 
100.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: -109054944  (WARNING!)- -109054944  (WARNING!)  temps: 41 C - 43 C 
100.0%  proc'd: 0 (0 Gflop/s) - 0 (0 Gflop/s)   errors: 443756176  (WARNING!)- 443756176  (WARNING!)  temps: 41 C - 43 C 
Killing processes.. done

Tested 2 GPUs:
	GPU 0: FAULTY
	GPU 1: FAULTY

I don't believe Both GPU are faulty... Any comparability issue with the environment? 

- William

- William

  28.1.2021

Hi, Is there option for sparse tc?
- Radim Kořínek

  1.2.2021

fae@intel:~/gpu_burn$ sudo make
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:.:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin /usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -arch=compute_30 -ptx compare.cu -o compare.ptx
nvcc fatal   : Value 'compute_30' is not defined for option 'gpu-architecture'
Makefile:10: recipe for target 'drv' failed
make: *** [drv] Error 1

This symptom cause by nvcc --gpu-architecture (-arch) in CUDA v11.1 not support compute_30. So we should edit Makefile as below.


CUDAPATH=/usr/local/cuda

# Have this point to an old enough gcc (for nvcc)
GCCPATH=/usr

NVCC=${CUDAPATH}/bin/nvcc
CCPATH=${GCCPATH}/bin

drv:
    PATH=${PATH}:.:${CCPATH}:${PATH} ${NVCC} -I${CUDAPATH}/include -arch=compute_30 -ptx compare.cu -o compare.ptx
    g++ -O3 -Wno-unused-result -I${CUDAPATH}/include -c gpu_burn-drv.cpp
    g++ -o gpu_burn gpu_burn-drv.o -O3 -lcuda -L${CUDAPATH}/lib64 -L${CUDAPATH}/lib -Wl,-rpath=${CUDAPATH}/lib64 -Wl,-rpath=${CUDAPATH}/lib -lcublas -lcudart -o gpu_burn


Change -arch=compute_30 to compute_80, or refer the Virtual Architecture Feature List change value.
- Andy Yang

  21.3.2023

How can I modify the runtime
- Jin Ping Pig

  19.4.2023

-arch=compute_80 for the nVIDIA A30.
- scottyb

  22.6.2023

Hello,
While trying to run gpu-burn on K8s, tried to spin up multiple pods of gpu-burn by enabling time slicing feature of nvidia. However, lets say out of 8 pods of gpu-burn, 6 pods totally ran fine and returned the appropriate throughput. However, 2 pods return FAULT GPU with the error: "Couldn't init a GPU test: Error in "C alloc": CUDA_ERROR_INVALID_VALUE". Any idea how I should proceed with this?
- ckuduvalli

  23.1.2024

啊啊啊?都没挖矿程åºå¥½ç”¨å•Šå“¥ä»¬
- 大笨蛋

  9.9.2024

16.8%  proc'd: 60072747 (113040 Gflop/s) - 61195827 (116874 Gflop/s) - 60717181 (114812 Gflop/s) - 60298700 (114959 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59667636 (112078 Gflop/s) - 58898861 (111560 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60072747 (113040 Gflop/s) - 61195827 (116874 Gflop/s) - 60717181 (114812 Gflop/s) - 60298700 (114959 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59667636 (112078 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60072747 (113040 Gflop/s) - 61195827 (116874 Gflop/s) - 60717181 (114812 Gflop/s) - 60298700 (114959 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60072747 (113040 Gflop/s) - 61197164 (116745 Gflop/s) - 60717181 (114812 Gflop/s) - 60298700 (114959 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60717181 (114812 Gflop/s) - 60298700 (114959 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60717181 (114812 Gflop/s) - 60300037 (114361 Gflop/s) - 60377583 (114886 Gflop/s) - 60861577 (115925 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60717181 (114812 Gflop/s) - 60300037 (114361 Gflop/s) - 60377583 (114886 Gflop/s) - 60862914 (115543 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60718518 (114107 Gflop/s) - 60300037 (114361 Gflop/s) - 60377583 (114886 Gflop/s) - 60862914 (115543 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60718518 (114107 Gflop/s) - 60300037 (114361 Gflop/s) - 60378920 (114074 Gflop/s) - 60862914 (115543 Gflop/s) - 59668973 (114397 Gflop/s) - 58900198 (111332 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60718518 (114107 Gflop/s) - 60300037 (114361 Gflop/s) - 60378920 (114074 Gflop/s) - 60862914 (115543 Gflop/s) - 59668973 (114397 Gflop/s) - 58901535 (110352 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 
16.8%  proc'd: 60074084 (115018 Gflop/s) - 61197164 (116745 Gflop/s) - 60718518 (114107 Gflop/s) - 60300037 (114361 Gflop/s) - 60378920 (114074 Gflop/s) - 60862914 (115543 Gflop/s) - 59670310 (112828 Gflop/s) - 58901535 (110352 Gflop/s)   errors: 0 - 0 - 0 - 0 - 167742  (WARNING!)- 0 - 0 - 0   temps: 77 C - 75 C - 77 C - 70 C - 76 C - 73 C - 85 C - 69 C 

The factory test found that the fifth GPU card was abnormal, which was solved after replacement
- wade






Nick     E-mail   (optional)

Is this spam? (answer opposite of "yes" and add "pe")