Overview
The AMD APP SDK includes a CPU runtime for OpenCL?, allowing OpenCL? execution on x86 multi-core CPUs, as well as a GPU runtime providing OpenCL? execution on AMD GPUs. This enables programmers to use all the cores in their system, both conventional processors and stream processors, with the same codebase. This article discusses the history and motivation for OpenCL? and the AMD APP SDK, gives a brief introduction to OpenCL? programming, and describes how NVIDIA CUDA programs can be easily ported to OpenCL?, enabling high performance, cross-platform parallel computing.
History and Motivation
The ongoing move towards parallel processors is now well underway, driven by the capabilities and limitations of modern semiconductor manufacturing. Parallel processing (see Figure 1) provides continued performance increases at reasonable power consumption. However, this move towards parallel processing comes at a cost: in order for future programs to take advantage of increases in processing power, they must be written in a scalable fashion. Writing scalable, parallel programs has been a difficult exercise for many years, and now it is becoming mandatory for all computationally-intensive programs.
Figure 1: Some current Parallel Processors
Complicating this move towards parallel processing is the diversity and heterogeneity of the various parallel architectures that are now available. Traditional CPUs have become multi-core parallel processors, with two to eight cores in a socket. GPUs, which have always been very parallel, boasting hundreds of parallel execution units on a single die, have now become increasingly programmable, to the point that it is now often useful to think of GPUs as many-core processors instead of special purpose accelerators. All of this diversity has been reflected in an array of incompatible tools and programming models required for programming these architectures, resulting in a lot of developer pain and high costs when trying to write cross-platform programs.
OpenCL is an answer to this problem. When using OpenCL, developers can use a single, unified toolchain and language to target all of the parallel processors currently in use. This is done by presenting the developer with an abstract platform model that conceptualizes all of these architectures in a similar way, as well as an execution model supporting data and task parallelism across heterogeneous architectures.
OpenCL is now the preferred programming environment for AMD GPUs, and forms the core of the AMD APP SDK.
What is OpenCL
OpenCL is a standard managed by the Khronos Group, who also manages OpenGL?, the cross-platform graphics API. OpenCL contributors include AMD, Apple, Intel and NVIDIA, among others from all corners of the computer industry. Similarly to OpenGL, OpenCL provides an API and a runtime. When an OpenCL program is executed, a series of API calls configure the system for execution, an embedded Just In Time compiler (JIT) compiles the OpenCL code, and the runtime asynchronously coordinates execution between parallel kernels. OpenCL’s strengths are its practicality, flexibility and retargetability.
OpenCL Platform Model
OpenCL sees today’s heterogeneous world through the lens of an abstract, hierarchical platform model. In this model, a host coordinates execution, transferring data to and from an array of Compute Devices. Each Compute Device is composed of an array of Compute Units, and each Compute Unit is composed of an array of Processing Elements. One of OpenCL’s strengths is that this model does not specify exactly what hardware constitutes a compute device. Thus, a compute device may be a GPU, such as the ATI Radeon? HD 5870 GPU, or a CPU, such as the AMD Phenom? II x4 processor, or other processors such as the Cell Broadband Engine. The OpenCL platform model (see Figure 2) is designed to present a uniform view of many different kinds of parallel processors.
Figure 2: OpenCL Platform Model
OpenCL Execution Model
OpenCL has a flexible execution model that incorporates both task and data parallelism. Data movements between the host and compute devices, as well as OpenCL tasks, are coordinated via command queues. Command queues provide a general way of specifying relationships between tasks, ensuring that tasks are executed in an order that satisfies the natural dependences in the computation. The OpenCL runtime is free to execute tasks in parallel if their dependencies are satisfied, which provides a general-purpose task parallel execution model. Tasks themselves can be comprised of data-parallel kernels, which apply a single function over a range of data elements, in parallel, allowing only restricted synchronization and communication during the execution of a kernel. These concepts will be further explained in this section.
Kernels
As mentioned, OpenCL kernels provide data parallelism. The kernel execution model is based on a hierarchical abstraction of the computation being performed. OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional. In Figure 3, we see an example of a 2 dimensional index space, which has Gx * Gy elements. For every element of the kernel index space, a work-item will be executed. All work items execute the same program, although their execution may differ due to branching based on data characteristics or the index assigned to each work-item.
Figure 3: SEQ Executing Kernels – Work-Groups and Work-Items
The index space is regularly subdivided into work-groups, which are tilings of the entire index space. In Figure 3, we see a work-group of size Sx * Sy elements. Each work-item in the work group receives a work-group id, labeled (wx, wy) in the figure, as well as a local id, labeled (sx, sy) in the figure. Each work-item also receives a global id, which can be derived from its work-group and local ids.
The work-items may only communicate and synchronize locally, within a work-group, via a barrier mechanism. This provides scalability, traditionally the bane of parallel programming. Because communication and synchronization at the finest granularity is restricted in scope, the OpenCL runtime has great freedom in how work-items are scheduled and executed.
Command Queues
The division of a kernel into work-items and work-groups supports data-parallelism, but OpenCL supports another kind of parallelism as well, called task-parallelism. This is done via OpenCL command queues (see Figure 4). An OpenCL command queue is created by the developer through an API call, and associated with a specific compute device. If a developer wishes to target multiple OpenCL compute devices simultaneously, she should create multiple command queues.
Figure 4: Task Parallelism within a Command Queue
To execute a kernel, the kernel is pushed onto a particular command queue. Enqueueing a kernel is done asynchronously, so that the host program may enqueue many different kernels without waiting for any of them to complete. When enqueueing a kernel, the developer optionally specifies a list of events that must occur before the kernel executes. Events are generated by kernel completion, as well as memory read, write, and copy commands. This allows the developer to specify a dependence graph between kernel executions and memory transfers in a particular command queue or between command queues themselves, which the OpenCL runtime will traverse during execution. Figure 4 shows a task graph illustrating the power of this approach, where arrows indicate dependencies between tasks. For example, Kernel A will not execute until Write A and Write B have finished, and Kernel D will not execute until Kernel B and Kernel C have finished.
The ability to construct arbitrary task graphs is a powerful way of constructing task-parallel applications. The OpenCL runtime has the freedom to execute the task graph in parallel, as long as it respects the dependencies encoded in the task graph. Task graphs are general enough to represent the kinds of parallelism useful across the spectrum of hardware architectures, from CPUs to GPUs.
Developers are also free to construct multiple command queues, either for parallelizing an application across multiple compute devices, or for expressing more parallelism via completely independent streams of computation. OpenCL’s ability to use both data and task parallelism simultaneously is a great benefit to parallel application developers, regardless of their intended hardware target.
Synchronization
Besides the task parallel constructs provided in OpenCL which allow synchronization and communication between kernels, OpenCL supports local barrier synchronizations within a work-group. This mechanism allows work-items to coordinate and share data in the local memory space using only very lightweight and efficient barriers. Work-items in different work-groups should never try to synchronize or share data, since the runtime provides no guarantee that all work-items are concurrently executing, and such synchronization easily introduces deadlocks.
Work-items in different work-groups may coordinate execution through the use of atomic memory transactions, which are an OpenCL extension supported by some OpenCL runtimes, such as the AMD APP SDK OpenCL runtime for the x86 multi-core compute devices. For example, work-items may append variable numbers of results to a shared queue in global memory. However, it is good practice that work-items do not, generally, attempt to communicate directly, as without careful design scalability and deadlock can become difficult problems. The hierarchy of synchronization and communication provided by OpenCL is a good fit for many of today’s parallel architectures, while still providing developers the ability to write efficient code, even for parallel computations with non-trivial synchronization and communication patterns.
OpenCL Memory Model
OpenCL has a relaxed consistency memory model, structured as shown in Figure 5.
Figure 5: OpenCL Memory Model
Each compute device has a global memory space, which is the largest memory space available to the device, and typically resides in off-chip DRAM. There is also a read-only, limited-size constant memory space, which allows for efficient reuse of read-only parameters in a computation. Each compute unit on the device has a local memory, which is typically on the processor die, and therefore has much higher bandwidth and lower latency than global memory. Local memory can be read and written by any work-item in a work-group, and thus allows for local communication between work-groups. Additionally, attached to each processing element is a private memory, which is typically not used directly by programmers, but is used to hold data for each work-item that does not fit in the processing element’s registers.
Parameter |
ATI Radeon HD 5870 |
Global memory size |
1 GB |
Global memory bandwidth |
153 GB/s |
Number of Compute Units |
20 |
Number of Processing Elements (per Compute Unit) |
16 |
Width of each Processing Element |
5×32-bit |
Size of local memory (per Compute Unit) |
32 KB |
Clock Frequency |
850 MHz |
Peak Performance |
2.7 TFlops (single precision) |
As OpenCL has a relaxed consistency model, different work-items may see a different view of global memory as the computation progresses. Within a work-item, reads and writes to all memory spaces are consistently ordered, but between work-items, synchronization is necessary in order to ensure consistency. This relaxed consistency model is an important part of OpenCL’s efforts to provide parallel scalability: parallel programs that rely on strong memory consistency for synchronization and communication usually fail to execute in parallel, because memory ordering requirements force a serialization of such programs during execution, hindering scalability. Requiring explicit synchronization and communication between work-items encourages programmers to write scalable code, avoiding the trap often seen in parallel programming where code looks parallel, but ends up executing in serial due to frequent and implicit synchronization induced by reliance on a strict memory ordering model.
Additionally, OpenCL views the global memory space of each compute device as private and separate from host memory. Moving data between compute devices and the host requires the programmer to manually manage communication between the host and the compute devices. This is done through the use of explicit memory reads and writes between devices.
Porting CUDA to OpenCL
The data-parallel programming model in OpenCL shares some commonalities with NVIDIA’s C for CUDA programming model, making it relatively straightforward to convert programs from CUDA to OpenCL.
Terminology
C for CUDA terminology |
OpenCL terminology |
Thread |
Work-item |
Thread block |
Work-group |
Global memory |
Global memory |
Constant memory |
Constant memory |
Shared memory |
Local memory |
Local memory |
Private memory |
Table 1 : General Terminology
Table 1 lists some general terminology for describing computations and memory spaces in both C for CUDA and OpenCL. These terms are fairly similar between the two systems.
Writing Kernels: Qualifiers
C for CUDA terminology |
OpenCL terminology |
__global__ function (callable from host, not callable from device) |
__kernel function (callable from device, including CPU device) |
__device__ function (not callable from host) |
No annotation necessary |
__constant__ variable declaration |
__constant variable declaration |
__device__ variable declaration |
__global variable declaration |
__shared__ variable declaration |
__local variable declaration |
Table 2 : Qualifiers for Kernel Functions
Table 2 shows qualifiers that are added to functions and data when writing kernels in both CUDA and OpenCL. The biggest difference between the two is that in CUDA, __global__ functions are GPU entry points, and __device__ functions are to be executed on the GPU, but are not callable from the host. In OpenCL, entry point functions are annotated with the __kernel qualifier, but non-entry point functions do not need to be annotated.
Writing Kernels: Indexing
C for CUDA terminology |
OpenCL terminology |
gridDim |
get_num_groups() |
blockDim |
get_local_size() |
blockIdx |
get_group_id() |
threadIdx |
get_local_id |
No direct equivalent. Combine blockDim, blockIdx, and threadIdx to calculate a global index. |
get_global_id() |
No direct equivalent. Combine gridDim and blockDim to calculate the global size. |
get_global_size() |
Table 3 : Indexing functions for use in Kernels
Table 3 shows the various indexing mechanisms provided by CUDA and OpenCL. CUDA provides kernel indexing via special pre-defined variables, while OpenCL provides the equivalent information through function calls. OpenCL also provides global indexing information, while CUDA requires manual computation of global indices.
Writing Kernels: Synchronization
C for CUDA terminology |
OpenCL terminology |
__syncthreads() |
barrier() |
__threadfence() |
No direct equivalent. |
__threadfence_block() |
mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE) |
No direct equivalent. |
read_mem_fence() |
No direct equivalent. |
write_mem_fence() |
Table 4 : Synchronization functions for use in Kernel Functions
Table 4 shows functions provided for synchronization in kernel functions. __syncthreads() and barrier() both provide a mechanism for synchronizing all work-items in a work-group, where calling the function suspends work-item execution until all work-items in the work-group have called the barrier.
__threadfence() and mem_fence() provide a more subtle mechanism for sharing data between work-items. Essentially, they force various orderings on outstanding memory transactions, which can allow for more sophisticated sharing of data. For example, mem_fence() forces all outstanding loads and stores to be completed before execution proceeds, disallowing the compiler, runtime, and hardware from reordering any loads and stores through the mem_fence(). This can be used to ensure that all data produced in a work-group is flushed to global memory before proceeding to signal another work-group that execution has completed, which opens up some possibilities for work-groups to communicate without terminating a kernel.
Important API objects
C for CUDA terminology |
OpenCL terminology |
CUdevice |
cl_device_id |
CUcontext |
cl_context |
CUmodule |
cl_program |
CUfunction |
cl_kernel |
CUdeviceptr |
cl_mem |
No direct equivalent. Closest approximation would be the CUDA Stream mechanism. |
cl_command_queue |
Table 5 : Selected API objects used in Host code
Table 5 shows some objects provided by the respective APIs, which are used in host code to control execution on various devices, manage data, and so forth. Of note is the cl_command_queue, which provides OpenCL’s task parallelism capabilities, by allowing the developer to declare dependences between tasks executing on a device. CUDA does not provide such flexibility – the closest thing CUDA provides is their Stream mechanism, which allows kernels and memory transactions to be placed in independent streams. This is not as general as OpenCL’s task parallelism capabilities provided by Command Queues, because it does not allow for parallelism within a queue, and synchronizing between streams is difficult, while Command Queues provide parallelism within and between queues, as well as flexible synchronization capabilities through the use of OpenCL events.
Important API Calls
C for CUDA terminology |
OpenCL terminology |
cuInit() |
No OpenCL initialization required |
cuDeviceGet() |
clGetContextInfo() |
cuCtxCreate() |
clCreateContextFromType() |
No direct equivalent |
clCreateCommandQueue() |
cuModuleLoad() [requires pre-compiled binary] |
clCreateProgramWithSource() or clCreateProgramWithBinary() |
No direct equivalent. CUDA programs are compiled off-line |
clBuildProgram() |
cuModuleGetFunction() |
clCreateKernel() |
cuMemAlloc() |
clCreateBuffer() |
cuMemcpyHtoD() |
clEnqueueWriteBuffer() |
cuMemcpyDtoH() |
clEnqueueReadBuffer() |
cuFuncSetBlockShape() |
No direct equivalent [functionality in clEnqueueNDRangeKernel()] |
cuParamSeti() |
clSetKernelArg() |
cuParamSetSize() |
No direct equivalent [functionality in clSetKernelArg()] |
cuLaunchGrid() |
clEnqueueNDRangeKernel() |
cuMemFree() |
clReleaseMemObj() |
Table 6 : Selected API calls used in Host code
Table 6 lists some important API calls used in host code to set up parallel computations and execute them, as well as manage data on compute devices. For the most part, these functions are fairly similar, although sometimes functionality is divided slightly differently, as shown in the table. The biggest difference is that OpenCL has both a runtime compiled model as well as allowing programs to be compiled offline, whereas CUDA only allows programs to be compiled off-line. To precompile OpenCL, developers can use the clGetProgramInfo() API call to retrieve a compiled binary and save it for later use, along with the clCreateProgramWithBinary() call to create an OpenCL program object from a compiled binary.
Some things to keep in mind while porting from CUDA to OpenCL
- Pointers in OpenCL kernels must be annotated with their memory space. For example, a pointer to local memory would be declared as __local int* p;This applies to kernel arguments as well: data being passed to a kernel is usually in arrays represented by __global pointers.
- CUDA encourages the use of scalar code in kernels. While this works in OpenCL as well, depending on the desired target architecture, it may be more efficient to write programs operating on OpenCL’s vector types, such as float4, as opposed to pure scalar types. This is useful for both AMD CPUs and AMD GPUs, which can operate efficiently on vector types. OpenCL also provides flexible swizzle/broadcast primitives for efficient creation and rearrangement of vector types.
- CUDA does not provide rich facilities for task parallelism, and so it may be beneficial to think about how to take advantage of OpenCL’s task parallelism as you port your application.
Conclusion
The AMD APP SDK , provides a powerful, cross-platform way to unlock the performance of AMD GPUs, as well as multi-core CPUs. Programming in OpenCL enables one codebase to serve the needs of today’s diversity of parallel hardware architectures. OpenCL provides a flexible programming model that allows developers to utilize both data parallelism as well as task parallelism. It is also fairly straightforward to map NVIDIA C for CUDA data-parallel kernels to OpenCL, which can free code from proprietary, vendor-specific toolchains and provide retargetability to data parallel applications. With OpenCL and the AMD APP SDK, developers are able to unlock the potential of today’s diverse and powerful parallel hardware.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.