GPU Programming 5: Memory, In Depth

While module 3 of Introduction to Parallel Programming with CUDA has a really helpful section about how to use the command nvidia-smi to look at your GPU and the current load, the main focus is sharing data between the GPU and the CPU. It’s a helpful bird’s eye view and it gestures at the sort of considerations you should take when choosing one or more of these approaches. I’ll be going more in depth about those considerations.

Data Transfer Methods

There are four main methods that CUDA supports for transferring data between the CPU and GPU. Here is a side by side comparison where we’re transporting an array called var{}.

Method 1: Normal variables are copied into “pinned” memory, and then copied to GPU.

Each of these methods are useful in different scenarios. We’ll dive into each one and where I’d use it.

Method 1: Just Copy a Normal Variable

On default we’ll create variables either on the heap (like malloc) or the stack (where you don’t use malloc). Either way this is stored in paged memory of some form. To ensure we have all of the data, CUDA temporarily copies said variable into reserved memory called pinned memory or page locked memory. This is done to make sure we have all the data before sending it to the GPU. In code, this looks like this:

If you’re just starting to integrate GPU code into your project, of course start with this. After that it’s really a question if Method 2 / initializing your variable in pinned memory is appropriate for your situation or not.

Method 2: Initialize the Variable in Pinned Memory

What if we just did that? There’s a method that allows us to perform this.

Otherwise it would be exactly the same as our first example. That’s why these two figures are almost the same – the copy to pinned memory in the first method is completely under the hood (so it’s lighter gray). The second figure also has the space to show how we copy data from the GPU to the CPU.

We might be tempted to do this everytime because copying data from it’s original location to the pinned memory can be very long. If we’re going to modify small portions of the variable repeatedly this method shines. However reserving this memory hinders the process from managing it’s own memory so we’ll want to be careful. This blog post describes the situation nicely.

Method 3: Mapped Memory

CUDA provides us a way to avoid copying data from the CPU to the GPU at all (as long as we can take a speed hit). Instead the variable is accessible from the GPU and maps to memory on the CPU. This of course means that the GPU has to take time reaching to the CPU for every thread but according to the course this much less of a speed hit than copying everything. You don’t even have to call cudaMalloc() and create a new variable. Just initialize the variable as

This variable can be an argument to our kernel, unlike the previous two methods which require a device_var. As long as you can take the speed hit this approach seems like the most stable and minimally invasive method to send information. A fun side note, cudaHostAlloc can also be used to implement the previous methods. The api page for this method is fun.

Method 4: Unified Memory

Of course software developers are going to iterate, so CUDA has introduced the concept of “unified memory”. A method that just says “we’ll handle variable.” Depending on your system this can result in very different behavior, as shown in this page and this page of NVidia’s documentation. Similar to Method 3, this method creates a variable that can be directly passed into the kernel and requires only one – two lines.

I’d look at the device type you’re using on those two pages I linked to fully understand the pros and cons of using this approach.

Memory Cleanup

A method that this module mentions in a blink it and you’ll miss it moment is cudaDeviceReset(). It’s a single function that attempts to free the memory this whole process has taken up. The API documentation talks about the specific structures it’s able to find and free. This is incredibly useful and from what I can tell should be a standard cleanup call to prevent memory leaks on the GPU. The current documentation doesn’t specifically say it handles cudaHostAlloc() though so that would be important to check before relying on this if we’re using Method 3.

Method 1: Normal variables are copied into “pinned” memory, and then copied to GPU.	Method 2: Never declare the variable in “standard” memory.
Method 3: Avoid copying by mapping GPU memory to the CPU.	Method 4: CUDA can manage the variable for us via Unified Memory.

Reach For Realities