Table of

8.1.2010The year I started blogging (blogware)
9.1.2010Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010OpenCL autoconf m4 macro
9.1.2010Mandelbrot with MPI
10.1.2010Using dynamic libraries for modular client threads
11.1.2010Creating an OpenGL 3 context with GLX
11.1.2010Creating a double buffered X window with the DBE X extension
12.1.2010A simple random file read benchmark
14.12.2011Change local passwords via RoundCube safer
5.1.2012Multi-GPU CUDA stress test
6.1.2012CUDA (Driver API) + nvcc autoconf macro
29.5.2012CUDA (or OpenGL) video capture in Linux
31.7.2012GPGPU abstraction framework (CUDA/OpenCL + OpenGL)
7.8.2012OpenGL (4.3) compute shader example
10.12.2012GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580
4.8.2013DAViCal with Windows Phone 8 GDR2
5.5.2015Sample pattern generator


Using manually tweaked PTX assembly in your CUDA 2 program

So you want to optimize or rewrite the PTX code CUDA 2.x compiler produced for you? Well, you should; not only is PTX virtual assembly fairly easy to write, but CUDA compiler technology is far from mature and there are surely manual optimizations to be made. Or you can just compile empty stub functions and fill in the actual PTX code yourself if you're feeling heroic.

In this entry I'll show you how to export PTX code from your CUDA program, and how to compile it back once you've edited it to your liking. Even though this is rather straight forward, I had to mess around for hours before I figured out how to do it. Or maybe I'm just a poor googler.

First, let's say your CUDA code is called and your program binary will be mycode. Create a devcode structure like this:

PROFILE=13 # Which CUDA compute profile to use
mkdir mycode.devcode
nvcc -v -arch=compute_$PROFILE -code=sm_$PROFILE -ext=all \
    --export-dir=mycode.devcode -c -o mycode.o.ptx
rm mycode.devcode/*/sm_$PROFILE
ln -s ../../mycode.cubin `echo mycode.devcode/*`/sm_$PROFILE

Now you have a structure that the CUDA runtime library can use when it runs the final program binary, except that instead of the actual device code object there is a symbolic link pointing to mycode.cubin in your current directory. Note that you don't have to do this again every time you compile a new version of your code. Next you can compile your existing C for CUDA code (or a framework) into an editable ptx file like this:

nvcc --ptx -v -arch=compute_$PROFILE -code=sm_$PROFILE

You are free to edit mycode.ptx now and work your magic in. When you're done, you should compile it as the device binary file mycode.cubin (validating the previously created link):

nvcc -v -arch=compute_$PROFILE -code=sm_$PROFILE --cubin

Now you can continue on building your program as usual. During linking, just include the object file mycode.o.ptx (and link against the CUDA runtime lib).



Hello Ville,
Can you elaborate the steps after generating the cubin file?. How to link the object file?
Thanks in advance
- Jay


Hello Jay,
Sorry for not replying earlier; I don't yet have a notification system in the blog software so I didn't see this until today :-)
The file you should link in your binary is the mycode.o.ptx file, which is a normal object (.o) file.  The device binaries (the cubin file) are loaded at runtime by this object.
So in the simplest case, you might compile and link your program like this:  g++ -lcudart -o program main.cpp mycode.o.ptx

Please note, however, that CUDA 2 is obsolete.  Later CUDAs support a lot cleaner ways to include customized PTX in your kernels.  Also, this example is for the runtime API.  The driver API allows you to explicitly upload a modified .ptx file at runtime.

Best regards
- wili


It seems after modifying  the PTX file, it is not used further. Only the .cu file is used for generating cubin. So even if we link with mycode.o.ptx  which was generated before modifying PTX, how it can make a difference.
- Ginu


Hi Ginu!
It's been a while since I've used this approach, so my recollections are a bit vague.  But re-reading the post, the idea is that the program binary containing the mycode.o.ptx object will search for the "mycode.cubin" file at runtime from your filesystem, and the actual kernel code is not built into the program binary.  After you have modified the PTX and run the last command (mvcc ... --cubin your "mycode.cubin" will have been replaced by the edited PTX which is what the program will load when you run it.
It might be that CUDA has changed in this regard and this approach doesn't work with current CUDAs--I'm not sure, I haven't tried. :-)
Hope this helps,
- wili

Nick     E-mail   (optional)

Is this spam? (answer opposite of "yes" and add "pe")