|8.1.2010||The year I started blogging (blogware)|
|9.1.2010||Linux initramfs with iSCSI and bonding support for PXE booting|
|9.1.2010||Using manually tweaked PTX assembly in your CUDA 2 program|
|9.1.2010||OpenCL autoconf m4 macro|
|9.1.2010||Mandelbrot with MPI|
|10.1.2010||Using dynamic libraries for modular client threads|
|11.1.2010||Creating an OpenGL 3 context with GLX|
|11.1.2010||Creating a double buffered X window with the DBE X extension|
|12.1.2010||A simple random file read benchmark|
|14.12.2011||Change local passwords via RoundCube safer|
|5.1.2012||Multi-GPU CUDA stress test|
|6.1.2012||CUDA (Driver API) + nvcc autoconf macro|
|29.5.2012||CUDA (or OpenGL) video capture in Linux|
|31.7.2012||GPGPU abstraction framework (CUDA/OpenCL + OpenGL)|
|7.8.2012||OpenGL (4.3) compute shader example|
|10.12.2012||GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580|
|4.8.2013||DAViCal with Windows Phone 8 GDR2|
|5.5.2015||Sample pattern generator|
There are 2 main ways to use CUDA:
I've written programs using both ways, but I find the driver API more to my liking. Things are explicit and host/device code is hard to confuse with one another, and it feels like you have more control over how things are done. Not too long ago driver API also exposed some features of CUDA that runtime API didn't, such as half float textures, but this is no longer the case. In fact, due to runtime API being the de-facto today, you will have better luck with library support, documentation, tutorials, etc using the runtime API. First impressions die hard though, I suppose, which is half-why I continue picking the driver API given the choice.
My favourite way of using the driver API is to procedurally generate the device code at runtime, compile it with nvcc on the fly, and upload the PTX source. The main reason is that I often have algorithms that have complicated configuration switches, and the GPU code can be optimized for specific values of them. Making the GPU kernels dynamic enough to handle all configurations is a waste of clock cycles, and preprocessor directives are simply not powerful enough to handle complex cases of code customization. Besides, you can't have runtime configurations with preprocessor magic, and that's too static for my taste.
Let the above work as an intro as to what I'm looking for in a CUDA autoconf macro:
So I hacked together the following m4 macro. Insert AX_CHECK_CUDA in your configure.ac file, and if your CUDA is not installed in its default path in /usr/local/cuda, instruct the configure script with --with-cuda=/opt/cuda. The rest should be explained in the comments.