8.1.2010 | The year I started blogging (blogware) |
9.1.2010 | Linux initramfs with iSCSI and bonding support for PXE booting |
9.1.2010 | Using manually tweaked PTX assembly in your CUDA 2 program |
9.1.2010 | OpenCL autoconf m4 macro |
9.1.2010 | Mandelbrot with MPI |
10.1.2010 | Using dynamic libraries for modular client threads |
11.1.2010 | Creating an OpenGL 3 context with GLX |
11.1.2010 | Creating a double buffered X window with the DBE X extension |
12.1.2010 | A simple random file read benchmark |
14.12.2011 | Change local passwords via RoundCube safer |
5.1.2012 | Multi-GPU CUDA stress test |
6.1.2012 | CUDA (Driver API) + nvcc autoconf macro |
29.5.2012 | CUDA (or OpenGL) video capture in Linux |
31.7.2012 | GPGPU abstraction framework (CUDA/OpenCL + OpenGL) |
7.8.2012 | OpenGL (4.3) compute shader example |
10.12.2012 | GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580 |
4.8.2013 | DAViCal with Windows Phone 8 GDR2 |
5.5.2015 | Sample pattern generator |
So you want to optimize or rewrite the PTX code CUDA 2.x compiler produced for you? Well, you should; not only is PTX virtual assembly fairly easy to write, but CUDA compiler technology is far from mature and there are surely manual optimizations to be made. Or you can just compile empty stub functions and fill in the actual PTX code yourself if you're feeling heroic.
In this entry I'll show you how to export PTX code from your CUDA program, and how to compile it back once you've edited it to your liking. Even though this is rather straight forward, I had to mess around for hours before I figured out how to do it. Or maybe I'm just a poor googler.
First, let's say your CUDA code is called mycode.cu and your program binary will be mycode. Create a devcode structure like this:
Now you have a structure that the CUDA runtime library can use when it runs the final program binary, except that instead of the actual device code object there is a symbolic link pointing to mycode.cubin in your current directory. Note that you don't have to do this again every time you compile a new version of your code. Next you can compile your existing C for CUDA code (or a framework) into an editable ptx file like this:
You are free to edit mycode.ptx now and work your magic in. When you're done, you should compile it as the device binary file mycode.cubin (validating the previously created link):
Now you can continue on building your program as usual. During linking, just include the object file mycode.o.ptx (and link against the CUDA runtime lib).
Hello Ville, Can you elaborate the steps after generating the cubin file?. How to link the object file? Thanks in advance- Jay
Hello Jay, Sorry for not replying earlier; I don't yet have a notification system in the blog software so I didn't see this until today :-) The file you should link in your binary is the mycode.o.ptx file, which is a normal object (.o) file. The device binaries (the cubin file) are loaded at runtime by this object. So in the simplest case, you might compile and link your program like this: g++ -lcudart -o program main.cpp mycode.o.ptx Please note, however, that CUDA 2 is obsolete. Later CUDAs support a lot cleaner ways to include customized PTX in your kernels. Also, this example is for the runtime API. The driver API allows you to explicitly upload a modified .ptx file at runtime. Best regards- wili
It seems after modifying the PTX file, it is not used further. Only the .cu file is used for generating cubin. So even if we link with mycode.o.ptx which was generated before modifying PTX, how it can make a difference.- Ginu
Hi Ginu! It's been a while since I've used this approach, so my recollections are a bit vague. But re-reading the post, the idea is that the program binary containing the mycode.o.ptx object will search for the "mycode.cubin" file at runtime from your filesystem, and the actual kernel code is not built into the program binary. After you have modified the PTX and run the last command (mvcc ... --cubin mycode.cu) your "mycode.cubin" will have been replaced by the edited PTX which is what the program will load when you run it. It might be that CUDA has changed in this regard and this approach doesn't work with current CUDAs--I'm not sure, I haven't tried. :-) Hope this helps,- wili