# Mathematica and Cuda

The Common Unified Device Architecture (CUDA) was developed by NVIDIA in late 2007 as a way to make the GPU more general. While programming the GPU has been around for many years, difficulty in programming it had made adoption limited. CUDALink aims at making GPU programming easy and accelerating the adoption. When using Mathematica, one need not worry about many of the steps. With Mathematica, one only needs to write CUDA kernels. This is done by utilizing all levels of the NVIDIA architecture stack.

## CUDA Architecture

CUDA's programming is based on the data parallel model. From a high-level standpoint, the problem is first partitioned onto hundreds or thousands of threads for computation. If the following is your computation,

OutputData = Table[fun[InputData[[i, j]]], {i, 10000}, {j, 10000}]
 , then in the CUDA programming paradigm, this computation is equivalent to:
CUDALaunch[fun, InputData, OutputData, {10000, 10000}]

where fun is a computation function. The above launches 1000×1000 threads, passes their indices to each thread, applying the function to InputData, and places the results in OutputData. CUDALink's equivalence to CUDALaunch is CUDAFunction.

The reason CUDA can launch thousands of threads all lies in its hardware architecture. In the following sections we will discuss those, along with how threads are partitioned for execution.

## CUDA - Grid and Blocks

Unlike the message passing or thread-based parallel programming models, CUDA programming maps problems on a one-, two-, or three-dimensional grid. The following shows a typical two-dimensional CUDA thread configuration.

Each grid contains multiple blocks, and each block contains multiple threads. In terms of the above image, a grid, block, and thread are as follows.

Choosing whether to have a one-, two-, or three-dimensional thread configuration is dependent on the problem. In the case of image processing, for example, we map the image onto the threads as shown in the following figure and apply a function to each pixel.

The one-dimensional cellular automation can map onto a one-dimensional grid.

## CUDA Program Cycle

When programming in CUDA, the basic three steps are the following:

1. Copy the data when launching threads.
2. Wait until the GPU has finished the computation.
3. Copy results from GPU back to host.
The image above shows the typical process of a CUDA program in detail.
1. Allocate memory of the GPU. GPU and CPU memory are physically separate, and the programmer must manage the allocation copies.
2. Copy the memory from the CPU to the GPU.
3. Configure the thread configuration: choose the correct block and grid dimension for the problem.
5. Synchronize the CUDA threads to ensure that the device has completed all its tasks before doing further operations on the GPU memory.
6. Once the threads have completed, memory is copied back from the GPU to the CPU.
7. The GPU memory is freed.

When using Mathematica, one need not worry about many of the steps. With Mathematica, one only needs to write CUDA kernels.

## CUDA Memory Hierarchy

CUDA memory is separated into different areas; each of this memory areas has its advantages and restrictions. The image below shows all types of CUDA memory.

Global Memory

The main memory needs most of the space on a graphic board and is the slowest of the memories. All threads can use memory from the "global memory". Due to performance reasons, however, this should be avoided, if possible.

Texture Memory

The "texture memory" resides in the same area as the "global memory", but is read-only. By contrast it has a much higher performance. This memory only supports the values string, char, and int.

Constant Memory

This memory is a fast one, available for each thread from the grid. The memory is cached, but is limited to 64 kB globally.

Shared Memory

This memory as well is a fast memory, which is linked with one block respectively. Used with current hardware, this memory is limited to 16kB for each block.

Local Memory

This memory is local for each thread, but resides in the "global memory", until the compiler writes the variables into the registry. Here also, the access should be restricted to a minimum, even if this memory is not as slow as the "global memory".

## CUDA Compute Capability

The "compute capability" specifies which operations are supported by the used hardware. Currently, the following operations are available:

 Compute Capability: Extra Features: 1.0 base implementation 1.1 atomic operations 1.2 shared atomic operations and warp vote functions 1.3 support for double-precision operations 2.0 double precision, L2 cache, and concurrent kernels