Autotuning OpenCL kernels – CLTune on Windows 7

CLTune is a C++ library for automatically tuning OpenCL kernels to extract the maximum speed from your device. I’m going to try building and using it on Windows 7 with MinGW-w64 (GCC 4.9.1) to see what I can achieve with it. While properly written OpenCL code should work on any conformant device and platform, there’s no guarantee it will be fast. What’s fast on an Nvidia GTX 560 Ti isn’t going to get maximum speed out of an Intel CPU. A kernel that squeezes the maximum throughput out of an Intel CPU when using the Intel OpenCL runtime probably won’t do so well on the AMD CPU runtime. This problem even exists between different versions of Nvidia GPUs – each new compute capability requires different tuning.

Just what does “tuning” an OpenCL kernel involve? Lets start with a pretty basic example:

__kernel void copy(const __global float* in, __global float* out, int N)
{
    for (size_t idx = get_global_id(0); idx < N; idx += get_global_size(0))
        out[idx] = in[idx];
}

On NVIDIA GPUs (e.g. GTX 560 Ti, GTX Titan) this is pretty close to the fastest implementation for arrays from 1MiB to 1GiB. On an AMD Radeon R9 290X, unrolling the loop by a factor of 2 or 4 gives the fastest copy for array sizes from 1MiB to about 128MiB, after which the basic version above is just as fast. Some other possibilities are to do the copy using float2 or float4 instead, but this is slower because memory accesses aren’t properly coalesced – the GPUs are designed for each thread to access adjacent 32-bit floats in the array. This is a really simple toy example, but already differences are visible between NVIDIA and AMD.

Downloading and building CLTune

You can grab the latest version of CLTune using

git clone https://github.com/CNugteren/CLTune.git

You’ll also need CMake version 2.8.10 or later. I’ve used version 3.4.1. The most important configuration settings are the OpenCL include folder and library path. I have used the AMD APP SDK, even though I actually own an NVIDIA GPU. The AMD APP SDK provides OpenCL 1.2 headers and library files, especially providing libOpenCL.a for linking against. This is great, because it works properly when building using MinGW-w64, where as OpenCL.lib files don’t always work with MinGW-w64 and generating the .a file from OpenCL.dll also isn’t easy.

This will all work properly with NVIDIA GPUs, so long as no OpenCL 1.2 functions get called (the NVIDIA implementation will do something random since the function doesn’t exist for it. This isn’t a problem if you’re using recent drivers though). I did also try building CLTune using the NVIDIA OpenCL 1.1 headers, but unit_test.exe that you’ll run later didn’t work when I did this, possibly because I have OpenCL 1.2 CPU runtimes from Intel and AMD installed. The build succeeded so I suspect that having only OpenCL 1.1 on your system will also work.

Building CLTune is straightforward. In the CLTune directory, create a new folder called build, this is the output folder for CMake and the folder in which CLTune will actually be built. Now run CMake as follows:

cd build
cmake -G "MinGW Makefiles" "-DOPENCL_INCLUDE_DIRS=C:/Program Files (x86)/AMD APP SDK/2.9-1/include" "-DOPENCL_LIBRARIES=C:/Program Files (x86)/AMD APP SDK/2.9-1/lib/x86_64/libOpenCL.a" -DTESTS=TRUE ..\

Once CMake is finished, run the following commands (you need to be inside the CLTune build folder):

make
.\unit_test.exe

The last step is important because that checks that the build actually produced something working. There’s also sample_conv.exe, sample_gemm.exe and sample_simple.exe inside the build folder. Unfortunately, these examples have been written so that the platform is hard coded to platform 0. This has been fixed in CLTune’s develop branch. If these samples run then everything is probably fine.

I have encountered a problem with the sample_* programs for certain CLI parameter choices. Sometimes, OpenCL error code -36 gets generated, followed by -5 and the sample crashes. -36 is CL_INVALID_COMMAND_QUEUE while -5 is CL_OUT_OF_RESOURCES. This probably means that the specific combination of work group sizes, register usage, local memory usage, etc. exceeds one of the limits of my NVIDIA GPU, a GeForce GTX 560 Ti.

Using CLTune

I’ve started the CLTuneDemos repository to show code samples that demonstrate how to use CLTune. As time allows, hopefully I’ll add some more examples. Anyone that wants to contribute more examples is welcome. My first example is the median filter.

Median filtering is used in image processing to remove noise while not blurring edges quite as much as the Gaussian blur. Median filtering replaces each pixel in the image with the median of its neighbouring pixels, for example using a 3×3 window around the pixel. Finding the median requires sorting the neighbouring pixels, or it can be done using selection algorithms. I’m trying to demonstrate CLTune here, not the best median filtering algorithm, so I’ve stuck to a basic version that sorts all the entries in the window.

The code in my repository has been commented extensively so I won’t reproduce all of it here. Instead, I’m going to focus on the important lines that use CLTune. The first step is to create a Tuner object for the targeted platform and device:

cltune::Tuner tuner(Cfg->m_iPlatformID, Cfg->m_iDeviceID);
tuner.UseFullSearch();
// Outputs the search process to a file
tuner.OutputSearchLog("search_log.txt");

m_iPlatformID is an integer that stores the platform number starting at 0, while m_iDeviceID stores the number of the specific device to target in that platform. UseFullSearch() tells CLTune that all combinations of parameters must be enumerated and tested. There are other options: UseAnnealing, UsePSO and UseRandomSearch, which use the optimisation methods simulated annealing and particle swarm optimisation respectively, or just a random search.
Next, you need to tell the Tuner object where to find your OpenCL kernel:

auto Kernel = std::vector<std::string>{"./src/medfilt.cl"};
auto BaselineKernel = std::vector<std::string>{"./src/medfilt_baseline.cl"};
auto kernelID = tuner.AddKernel(Kernels, "medfilt", {work_x, work_y}, {1, 1});
tuner.SetReference(BaselineKernel, "medfilt", {work_x, work_y}, {8, 8});

Kernel defines all the files that need to be compiled to get your kernel that will be autotuned, while BaselineKernel lists all the files that go into making a simple non-tunable version of your code that is supposed to always give the correct answer. This is very important as it helps to flag bugs between different versions of your kernel, e.g. one with and one without shared memory.

The next step is to define each tunable parameter. These parameters would get inserted into your kernel at compile time probably with the -D flag, i.e. as macros. One important task here is to define any relationships between your parameters and the local and global work sizes used when launching your kernel. CLTune supports this through allowing you to specify whether the global/local work size should be divided or multiplied by any of the parameters. Here’s how I set it up in my example:

tuner.DivGlobalSize(kernelID, {"TBX", "TBY"});
tuner.MulGlobalSize(kernelID, {"TBX", "TBY"});
tuner.MulLocalSize(kernelID, {"TBX", "TBY"});

The Tuner object takes the work sizes specified with AddKernel previously and applies the maths operations to them in sequence. Here, the global work size (work_x, work_y) will be divided by the values being tested for the parameters TBX and TBY, and then multiplied by those same values. As an equation, this is the final global work size: ((work_x/TBX)*TBX,(work_y/TBY)*TBY). Integer arithmetic is used throughout here, so this actually returns the highest multiple of the local work group size that is smaller than (work_x,work_y). The local work group size is just (1*TBX,1*TBY)

After all this setup, its time to call tuner.Tune() and print the results. Then starts the hard work of getting your kernel to work under each configuration. This is a great way to catch bugs too.

Now for some results, to give you an idea how much tuning influences running time. My test is randomly generated 1024×1024 image that will undergo a 5×5 median filter.

Device Unoptimised running time Optimised running time
NVIDIA GeForce GTX 560 Ti (driver version 341.44, OpenCL 1.1) 23ms 18ms (TBX=32, TBY=4, using local memory
Intel Core i7-2600k (Intel runtime, OpenCL 2.0, build 162) 147ms 106ms (TBX=16, TBY=8, using local memory)
Intel Core i7-2600k (AMD runtime, APP SDK 2.9.1, runtime version 938.2, OpenCL 1.2) 150ms 122ms (TBX=16, TBY=8, not using local memory

The “unoptimised” kernel actually wasn’t always the slowest of all the combinations tested, one of the tests on the GTX 560 Ti took longer than 900ms! The optimal configuration on my GPU (TBX=32, TBY=4, using local memory) is actually quite bad for the CPU. Intel and AMD also differ on whether one should use local memory, with AMD preferring that one doesn’t in this test. The optimal configuration using Intel’s runtime is TBX=16, TBY=8, using local memory, which is also pretty close to optimal on the GTX 560 Ti, but is much worse than optimal when using the AMD runtime. The optimal configuration on the AMD runtime (TBX=16, TBY=8, no local memory) is close to optimal under the Intel and NVIDIA runtimes as well, so in this case one could perhaps hard code that configuration.

Its quite interesting to see how the results change if one is doing a 7×7 median filter instead. On the GTX 560 Ti, the optimal work group size is still TBX=32, TBY=4 but not using local memory was slightly faster than using it. On the Intel runtime however, the optimal configuration was now TBX=4, TBY=4 using local memory. TBX=16, TBY=8 was now a lot slower. The optimum for the AMD CPU runtime was TBX=8, TBY=8 using local memory, again quite different from the previous optimum. Now, the AMD runtime also shows a preference towards using local memory. The full set of results is available as CSV files here.

Conclusion

If anything, this should have proven to you that tuning every parameter of your kernel is important, and there’s a tool to do it automatically – CLTune. I can think of a previous complex project where putting this to use on my kernels could have given quite a boost in speed. Since I wasn’t getting close to device limits then (or now), I knew there was room to improve but I just never found it. Perhaps this could have helped.

There are some shortcomings with CLTune though:

  1. Lack of image support: CLTune accepts parameters as scalars or as std::vector, which gets turned into OpenCL buffer objects. I haven’t found a way to pass in image objects
  2. CLTune focuses on one kernel only, it isn’t a general algorithm autotuner. I imagine complex algorithms will be difficult to tune one kernel at a time. You could autotune the longest running kernel, figure out the output data dimensions and then move on to the next kernel called and repeat the tuning. If one of the kernels takes much longer than the others, this sequential approach could do well enough. If all of the kernels take similarly long, good luck.
  3. Documentation is non-existent, even the header file has very few comments. Working your way through the sample applications followed by trial and error is the best at the moment.

Considering that this tool doesn’t require one to write all the OpenCL boilerplate code and that developing the CLTune side of this demo was actually quite fast, I think this is a tool worth keeping around. The vast majority of my time spent on this demo went into debugging the median filter kernel. That just shows another great use of this tool – some of my configurations using local memory worked fine, but most didn’t. This tool made it easier to catch the problem.

Advertisements

3 thoughts on “Autotuning OpenCL kernels – CLTune on Windows 7

  1. Pingback: Bytesize OpenCL News Roundup: 2016.02.01 - IWOCL

  2. Cedric Nugteren

    I am the author of CLTune and was happy to see such a detailed blog-post, thanks! In the meantime I have addressed one of your flaws and have improved the documentation a bit in version 2.2.0 which is released today:

    * Two new simpler examples to get you started quickly.
    * Basic API documentation.
    * A link to a 19-minute video (http://on-demand.gputechconf.com/gtc/2016/video/S6206.html) and the corresponding slides explaining CLTune.
    * A link to this blog-post (useful as documentation as well!)

    Liked by 1 person

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s