8.1.2010	The year I started blogging (blogware)
9.1.2010	Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010	Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010	OpenCL autoconf m4 macro
9.1.2010	Mandelbrot with MPI
10.1.2010	Using dynamic libraries for modular client threads
11.1.2010	Creating an OpenGL 3 context with GLX
11.1.2010	Creating a double buffered X window with the DBE X extension
12.1.2010	A simple random file read benchmark
14.12.2011	Change local passwords via RoundCube safer
5.1.2012	Multi-GPU CUDA stress test
6.1.2012	CUDA (Driver API) + nvcc autoconf macro
29.5.2012	CUDA (or OpenGL) video capture in Linux
31.7.2012	GPGPU abstraction framework (CUDA/OpenCL + OpenGL)
7.8.2012	OpenGL (4.3) compute shader example
10.12.2012	GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580
4.8.2013	DAViCal with Windows Phone 8 GDR2
5.5.2015	Sample pattern generator

9.1.2010

Using manually tweaked PTX assembly in your CUDA 2 program

So you want to optimize or rewrite the PTX code CUDA 2.x compiler produced for you? Well, you should; not only is PTX virtual assembly fairly easy to write, but CUDA compiler technology is far from mature and there are surely manual optimizations to be made. Or you can just compile empty stub functions and fill in the actual PTX code yourself if you're feeling heroic.

In this entry I'll show you how to export PTX code from your CUDA program, and how to compile it back once you've edited it to your liking. Even though this is rather straight forward, I had to mess around for hours before I figured out how to do it. Or maybe I'm just a poor googler.

First, let's say your CUDA code is called mycode.cu and your program binary will be mycode. Create a devcode structure like this:

PROFILE=13 # Which CUDA compute profile to use

mkdir mycode.devcode

nvcc -v -arch=compute_$PROFILE -code=sm_$PROFILE -ext=all \

    --export-dir=mycode.devcode -c mycode.cu -o mycode.o.ptx

rm mycode.devcode/*/sm_$PROFILE

ln -s ../../mycode.cubin `echo mycode.devcode/*`/sm_$PROFILE

Now you have a structure that the CUDA runtime library can use when it runs the final program binary, except that instead of the actual device code object there is a symbolic link pointing to mycode.cubin in your current directory. Note that you don't have to do this again every time you compile a new version of your code. Next you can compile your existing C for CUDA code (or a framework) into an editable ptx file like this:

PROFILE=13

nvcc --ptx -v -arch=compute_$PROFILE -code=sm_$PROFILE mycode.cu

You are free to edit mycode.ptx now and work your magic in. When you're done, you should compile it as the device binary file mycode.cubin (validating the previously created link):

PROFILE=13

nvcc -v -arch=compute_$PROFILE -code=sm_$PROFILE --cubin mycode.cu

Now you can continue on building your program as usual. During linking, just include the object file mycode.o.ptx (and link against the CUDA runtime lib).

Comments

15.2.2012

Hello Ville,
Can you elaborate the steps after generating the cubin file?. How to link the object file?
Thanks in advance

- Jay

28.3.2012

Hello Jay,
Sorry for not replying earlier; I don't yet have a notification system in the blog software so I didn't see this until today :-)
The file you should link in your binary is the mycode.o.ptx file, which is a normal object (.o) file.  The device binaries (the cubin file) are loaded at runtime by this object.
So in the simplest case, you might compile and link your program like this:  g++ -lcudart -o program main.cpp mycode.o.ptx

Please note, however, that CUDA 2 is obsolete.  Later CUDAs support a lot cleaner ways to include customized PTX in your kernels.  Also, this example is for the runtime API.  The driver API allows you to explicitly upload a modified .ptx file at runtime.

Best regards

- wili

23.6.2016

It seems after modifying  the PTX file, it is not used further. Only the .cu file is used for generating cubin. So even if we link with mycode.o.ptx  which was generated before modifying PTX, how it can make a difference.

- Ginu

23.6.2016

Hi Ginu!
It's been a while since I've used this approach, so my recollections are a bit vague.  But re-reading the post, the idea is that the program binary containing the mycode.o.ptx object will search for the "mycode.cubin" file at runtime from your filesystem, and the actual kernel code is not built into the program binary.  After you have modified the PTX and run the last command (mvcc ... --cubin mycode.cu) your "mycode.cubin" will have been replaced by the edited PTX which is what the program will load when you run it.
It might be that CUDA has changed in this regard and this approach doesn't work with current CUDAs--I'm not sure, I haven't tried. :-)
Hope this helps,

- wili

wili
Ville Timonen

hack blog

Table of
contents

9.1.2010

Using manually tweaked PTX assembly in your CUDA 2 program

Comments

15.2.2012

28.3.2012

23.6.2016

23.6.2016

wili Ville Timonen

hack blog

Table ofcontents

9.1.2010

Using manually tweaked PTX assembly in your CUDA 2 program

Comments

15.2.2012

28.3.2012

23.6.2016

23.6.2016

wili
Ville Timonen

Table of
contents