GPU Programming 4: Actual CUDA!

I’m combining ending the final module of Introduction To Concurrent Programming and the first two modules of Introduction to Parallel Programming in CUDA since all three of these modules focus on the same type of material: what is the most basic CUDA programming? How does it flow? What are the first questions you need to ask when approaching this?

Interacting with the GPU

We’re working with an external device, something that has it’s own memory and other resources. This dictates a flow between the host machine and the GPU.

1/2: Data

We need to explicitly initialize our variables on the GPU and push data from the host machine (where our main program is running) to the GPU. Here’s an example of importing an image to the GPU.

Of course you’ll also need to get information back and eventually free that memory on the GPU. These aren’t smart pointers!

2/2: Function Calls

When you create a function, on default it runs on your computer. The GPU has it’s own resources, global variables, etc. Therefore you need a way to run functions in that context. This is where the __host__, __global__ and __device__ keywords come in.

Keyword	`__host__` void main()	`__global__`	`__device__`
Purpose	100% of this happens on the host machine. Without a keyword this is the default state.	Called from the host, but executed on the GPU.	100% of this is on the GPU.
Examples	main, functions that load data to and from the GPU, call `__global__` functions.	kernel function.	functions that support the kernel function.

So we should keep in mind that any objects initialized in the __global__ only exist in the GPU.

Kernel Functions – The Guts of the Operation

Now we’re getting to the function that actually performs the operation we want for each data piece. It’s best to think of it as the core of a for loop. The outer loop updates your i, j, k variables, and inside the loop you calculate the address.

i, j, k: Grids, Blocks and Threads

In our for loop analogy, the data structure we’re iterating on is a three dimensional matrix. But each dimension isn’t just a spatial dimension, it can be defined in a variety of ways. The limits of how these can be sized is in this table on the development website (thank you stackoverflow!)

Each kernel is called on it’s own thread. In our for loop this is the dimension that changes the most, the one that’s incrementing every time (since the thread id is indeed incrementing every time we call the kernel function).
A block is a group of threads. Each block has a max of up to 1024 threads and under the hood a single block is mapped to one or more warps, which for our purposes is a group of 32 threads (check the table on the development website). Therefore it’s recommended to make your block a multiple of 32 to fully utilize the GPU.
- In our for loop it’s the middle dimension, but for organization purposes the blocks can be defined on up to 3 dimensions. I haven’t picked up if there’s any technical benefit with using multiple dimensions.
- Threads in the same block can share variable information, looks like this will be tackled in a future module.
- Should consider looking at operations where we have 1024 pieces of data or less, would be fun to play with what we can do here. For images we could also incorporate the number of channels into the block.
Grids are yet another dimension above the block, i in our for loop analogy. Similar to the blocks, a grid can actually have up to 3 dimensions. However grids are not directly tied to the hardware.
- The course talks about how you can use the grid to represent the time dimension for data, downplaying the role of grids and leading you to think that blocks can handle the spatial dimension.
- In images (even in the course) the grid is most useful for tackling spatial dimensions, as long as you keep your block size in mind.

Defining a Kernel Function

To continue our for loop analogy, the kernel function is the core of our for loop. However that means we need to know what i, j, and k are. Ultimately this depends on your data, but here we’ll follow the most complicated example from the Introduction to Parallel Programming in CUDA with some tweeks for clarity. In this example, I’ll be fully defining our address assuming that our blocks are defined in 3 dimensions and for some reason our grid is also defined in 3 dimensions.

This takes in an RGB image and extracts the red channel in the last line. To find what pixel, we:

Calculate which block we’ll access in the grid.
Convert the blockId to an address since our input/output matrixes are one dimensional. Since this is an image, this is similar to picking a row and then calculating the memory address row # we want to access * the size of a row.
Calculate the final address for the information we’re operating on in this thread. The input and output matrixes are the same size (since we split the color channels), so this is similar to saying row # we want to access * size of a row + column # we want to access.

This allows us to perform our operation. Note this approach means that each block focuses on consecutive strands of data, which might not be the best approach for your use case. Looking at the grid/block approach in the next section for instance, you might want to change this address calculation so each block works on a rectangle of data in the image.

Calling a Kernel Function

But how do we define the block and grid sizes? Anything special about calling the kernel? Here’s where we’ll find that out.

Note how the grid and block size are passed into the kernel using <<<G, B>>>, what’s called an execution configuration. Looks similar to a template, probably what inspired this syntax. The grid and block sizes are what we need to get our feet off the ground but there are other configurations we can set with this syntax.

This gives one proposal for how to calculate our grid and block sizes. We attempt to calculate the optimal number of blocks by splitting the image into 1024 pixel chunks. Note that which 1024 pixel chunk we use depends on the kernel.