programming python

Python thay thế chuỗi f

Bộ xử lý đồ họa [GPU]1 cung cấp thông lượng lệnh và băng thông bộ nhớ cao hơn nhiều so với CPU trong cùng mức giá và công suất. Nhiều ứng dụng tận dụng những khả năng cao hơn này để chạy trên GPU nhanh hơn trên CPU [xem Ứng dụng GPU]. Các thiết bị máy tính khác, như FPGA, cũng rất tiết kiệm năng lượng, nhưng cung cấp tính linh hoạt lập trình kém hơn nhiều so với GPU

Sự khác biệt về khả năng giữa GPU và CPU tồn tại bởi vì chúng được thiết kế với các mục tiêu khác nhau. Trong khi CPU được thiết kế để vượt trội trong việc thực hiện một chuỗi hoạt động, được gọi là luồng, nhanh nhất có thể và có thể thực hiện song song vài chục luồng này, thì GPU được thiết kế để vượt trội trong việc thực hiện song song hàng nghìn luồng [khấu hao

GPU chuyên dùng cho tính toán song song cao và do đó được thiết kế sao cho có nhiều bóng bán dẫn hơn được dành cho xử lý dữ liệu thay vì bộ nhớ đệm dữ liệu và điều khiển luồng. Sơ đồ Hình 1 cho thấy một ví dụ phân phối tài nguyên chip cho CPU so với GPU

Hình 1. GPU dành nhiều bóng bán dẫn hơn để xử lý dữ liệu

Dành nhiều bóng bán dẫn hơn để xử lý dữ liệu, chẳng hạn như tính toán dấu phẩy động, có lợi cho tính toán song song cao;

Nói chung, một ứng dụng có sự kết hợp của các phần song song và các phần tuần tự, vì vậy các hệ thống được thiết kế với sự kết hợp giữa GPU và CPU để tối đa hóa hiệu suất tổng thể. Các ứng dụng có mức độ song song cao có thể khai thác tính chất song song ồ ạt này của GPU để đạt được hiệu suất cao hơn so với trên CPU

Sự ra đời của CPU đa lõi và GPU nhiều lõi có nghĩa là các chip xử lý chính hiện là hệ thống song song. Thách thức là phát triển phần mềm ứng dụng mở rộng quy mô song song của nó một cách minh bạch để tận dụng số lượng lõi bộ xử lý ngày càng tăng, giống như các ứng dụng đồ họa 3D mở rộng quy mô song song của chúng thành nhiều GPU lõi với số lượng lõi rất khác nhau

Mô hình lập trình song song CUDA được thiết kế để vượt qua thách thức này trong khi vẫn duy trì đường cong học tập thấp cho các lập trình viên quen thuộc với các ngôn ngữ lập trình tiêu chuẩn như C

Cốt lõi của nó là ba khái niệm trừu tượng chính—hệ thống phân cấp của các nhóm luồng, bộ nhớ dùng chung và đồng bộ hóa rào cản—được hiển thị đơn giản cho người lập trình dưới dạng một tập hợp tối thiểu các phần mở rộng ngôn ngữ

Những trừu tượng này cung cấp song song dữ liệu chi tiết và song song luồng, được lồng trong song song dữ liệu chi tiết và song song tác vụ. Chúng hướng dẫn lập trình viên phân chia vấn đề thành các vấn đề con thô có thể được giải quyết độc lập song song bằng các khối luồng và mỗi vấn đề con thành các phần nhỏ hơn có thể được giải quyết song song một cách hợp tác bởi tất cả các luồng trong khối

Sự phân tách này bảo tồn tính biểu cảm của ngôn ngữ bằng cách cho phép các luồng hợp tác khi giải quyết từng vấn đề phụ, đồng thời cho phép khả năng mở rộng tự động. Thật vậy, mỗi khối luồng có thể được lên lịch trên bất kỳ bộ đa xử lý nào có sẵn trong GPU, theo bất kỳ thứ tự nào, đồng thời hoặc tuần tự, để chương trình CUDA đã biên dịch có thể thực thi trên bất kỳ số lượng bộ đa xử lý nào như được minh họa trong Hình 3 và chỉ thời gian chạy.

Mô hình lập trình có thể mở rộng này cho phép kiến trúc GPU trải rộng trên phạm vi thị trường rộng lớn bằng cách thay đổi quy mô đơn giản số lượng bộ đa xử lý và phân vùng bộ nhớ. từ GPU GeForce dành cho người đam mê hiệu suất cao và các sản phẩm điện toán Quadro và Tesla chuyên nghiệp cho đến nhiều loại GPU GeForce phổ thông, rẻ tiền [xem GPU hỗ trợ CUDA để biết danh sách tất cả các GPU hỗ trợ CUDA]

Hình 3. Khả năng mở rộng tự động

Ghi chú. GPU được xây dựng xung quanh một loạt Bộ đa xử lý truyền trực tuyến [SM] [xem Triển khai phần cứng để biết thêm chi tiết]. Một chương trình đa luồng được phân vùng thành các khối luồng thực thi độc lập với nhau, do đó GPU có nhiều bộ xử lý hơn sẽ tự động thực thi chương trình trong thời gian ngắn hơn GPU có ít bộ xử lý hơn.

Tài liệu này được tổ chức thành các phần sau

Giới thiệu là phần giới thiệu chung về CUDA
Mô hình lập trình phác thảo mô hình lập trình CUDA
Giao diện lập trình mô tả giao diện lập trình
Triển khai phần cứng mô tả việc triển khai phần cứng
Nguyên tắc hiệu suất đưa ra một số hướng dẫn về cách đạt được hiệu suất tối đa
GPU hỗ trợ CUDA liệt kê tất cả các thiết bị hỗ trợ CUDA
Phần mở rộng ngôn ngữ C++ là một mô tả chi tiết về tất cả các phần mở rộng cho ngôn ngữ C++
Nhóm hợp tác mô tả nguyên mẫu đồng bộ hóa cho các nhóm luồng CUDA khác nhau
CUDA Dynamic Parallelism mô tả cách khởi chạy và đồng bộ hóa một hạt nhân từ một hạt nhân khác
Quản lý bộ nhớ ảo mô tả cách quản lý không gian địa chỉ ảo hợp nhất
Stream Ordered Memory Allocator mô tả cách các ứng dụng có thể sắp xếp cấp phát và giải phóng bộ nhớ
Nút bộ nhớ đồ thị mô tả cách đồ thị có thể tạo và sở hữu phân bổ bộ nhớ
Hàm toán học liệt kê các hàm toán học được hỗ trợ trong CUDA
Hỗ trợ ngôn ngữ C++ liệt kê các tính năng C++ được hỗ trợ trong mã thiết bị
Tìm nạp kết cấu cung cấp thêm chi tiết về tìm nạp kết cấu
Khả năng tính toán cung cấp các thông số kỹ thuật của các thiết bị khác nhau, cũng như các chi tiết kiến trúc khác
API trình điều khiển giới thiệu API trình điều khiển cấp thấp
Biến môi trường CUDA liệt kê tất cả các biến môi trường CUDA
Unified Memory Programming giới thiệu mô hình lập trình Unified Memory

CUDA C++ mở rộng C++ bằng cách cho phép lập trình viên định nghĩa các hàm C++, được gọi là hạt nhân, mà khi được gọi, được thực thi N lần song song bởi N luồng CUDA khác nhau, thay vì chỉ một lần như các hàm C++ thông thường

Một hạt nhân được xác định bằng cách sử dụng công cụ xác định khai báo __global__ và số lượng luồng CUDA thực thi hạt nhân đó cho một lệnh gọi hạt nhân nhất định được chỉ định bằng cách sử dụng >cú pháp cấu hình thực thi [xem Phần mở rộng ngôn ngữ C++]. Mỗi luồng thực thi kernel được cấp một ID luồng duy nhất có thể truy cập được trong kernel thông qua các biến tích hợp

Để minh họa, mã mẫu sau, sử dụng biến tích hợp sẵn threadIdx, thêm hai vectơ A và B có kích thước N và lưu kết quả vào vectơ C

// Kernel definition
__global__ void VecAdd[float* A, float* B, float* C]
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main[]
{
    ...
    // Kernel invocation with N threads
    VecAdd[A, B, C];
    ...
}

Ở đây, mỗi trong số N luồng thực thi VecAdd[] thực hiện một phép cộng theo cặp

Để thuận tiện, threadIdx là một vectơ 3 thành phần, do đó các luồng có thể được xác định bằng cách sử dụng chỉ mục luồng một chiều, hai chiều hoặc ba chiều, tạo thành khối một chiều, hai chiều hoặc ba chiều của . Điều này cung cấp một cách tự nhiên để gọi tính toán trên các phần tử trong một miền, chẳng hạn như vectơ, ma trận hoặc thể tích

Chỉ mục của một luồng và ID luồng của nó liên quan đến nhau một cách đơn giản. Đối với khối một chiều, chúng giống nhau;

Ví dụ, đoạn mã sau thêm hai ma trận A và B có kích thước NxN và lưu kết quả vào ma trận C

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Có giới hạn về số lượng luồng trên mỗi khối, vì tất cả các luồng của một khối dự kiến sẽ nằm trên cùng một lõi bộ xử lý đa luồng và phải chia sẻ tài nguyên bộ nhớ hạn chế của lõi đó. Trên các GPU hiện tại, một khối luồng có thể chứa tới 1024 luồng

Tuy nhiên, một hạt nhân có thể được thực thi bởi nhiều khối luồng có hình dạng bằng nhau, do đó tổng số luồng bằng số luồng trên mỗi khối nhân với số khối

Các khối được tổ chức thành một lưới các khối luồng một chiều, hai chiều hoặc ba chiều như được minh họa trong Hình 4. Số lượng khối luồng trong lưới thường được quyết định bởi kích thước của dữ liệu đang được xử lý, thường vượt quá số lượng bộ xử lý trong hệ thống

hinh 4. Lưới các khối chủ đề

The number of threads per block and the number of blocks per grid specified in the syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.

Mỗi khối trong lưới có thể được xác định bằng chỉ mục duy nhất một chiều, hai chiều hoặc ba chiều có thể truy cập được trong hạt nhân thông qua biến blockIdx tích hợp. Kích thước của khối luồng có thể truy cập được trong kernel thông qua biến blockDim tích hợp

Mở rộng ví dụ MatAdd[] trước đó để xử lý nhiều khối, mã sẽ như sau

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

Kích thước khối luồng là 16x16 [256 luồng], mặc dù tùy ý trong trường hợp này, là một lựa chọn phổ biến. Lưới được tạo với đủ khối để có một luồng cho mỗi phần tử ma trận như trước đây. Để đơn giản, ví dụ này giả định rằng số lượng luồng trên mỗi lưới trong mỗi thứ nguyên chia hết cho số lượng luồng trên mỗi khối trong thứ nguyên đó, mặc dù điều đó không nhất thiết phải như vậy

Khối chủ đề được yêu cầu để thực hiện độc lập. Phải có khả năng thực hiện chúng theo bất kỳ thứ tự nào, song song hoặc nối tiếp. Yêu cầu độc lập này cho phép các khối luồng được lên lịch theo bất kỳ thứ tự nào trên bất kỳ số lượng lõi nào như được minh họa trong Hình 3, cho phép các lập trình viên viết mã tỷ lệ với số lượng lõi

Các luồng trong một khối có thể hợp tác bằng cách chia sẻ dữ liệu thông qua một số bộ nhớ dùng chung và bằng cách đồng bộ hóa việc thực thi của chúng để phối hợp truy cập bộ nhớ. Chính xác hơn, người ta có thể chỉ định các điểm đồng bộ hóa trong kernel bằng cách gọi hàm nội tại __syncthreads[]; . Bộ nhớ dùng chung đưa ra một ví dụ về việc sử dụng bộ nhớ dùng chung. Ngoài __syncthreads[], API nhóm hợp tác cung cấp một tập hợp phong phú các nguyên hàm đồng bộ hóa luồng

Để hợp tác hiệu quả, bộ nhớ dùng chung dự kiến sẽ là bộ nhớ có độ trễ thấp gần mỗi lõi bộ xử lý [giống như bộ đệm L1] và __syncthreads[] dự kiến sẽ nhẹ

Với sự ra đời của NVIDIA Compute Capability 9. 0, the CUDA programming model introduces an optional level of hierarchy called Thread Block Clusters that are made up of thread blocks. Tương tự như cách các luồng trong khối luồng được đảm bảo được đồng lập lịch trên bộ đa xử lý phát trực tuyến, các khối luồng trong cụm cũng được đảm bảo được đồng lập lịch trên Cụm xử lý GPU [GPC] trong GPU

Tương tự như các khối luồng, các cụm cũng được tổ chức thành một chiều, hai chiều hoặc ba chiều như minh họa trong Hình 5. Số lượng khối luồng trong một cụm có thể do người dùng xác định và tối đa 8 khối luồng trong một cụm được hỗ trợ dưới dạng kích thước cụm di động trong CUDA. Có kích thước cụm khối luồng vượt quá 8 là dành riêng cho kiến trúc và có thể được truy vấn bằng cách sử dụng API cudaOccupancyMaxPotentialClusterSize

Figure 5. Grid of Thread Block Clusters

Ghi chú. Trong kernel được khởi chạy bằng hỗ trợ cụm, biến gridDim vẫn biểu thị kích thước theo số lượng khối luồng, vì mục đích tương thích. Thứ hạng của một khối trong một cụm có thể được tìm thấy bằng cách sử dụng API nhóm cụm.

Một cụm khối luồng có thể được kích hoạt trong nhân bằng cách sử dụng thuộc tính nhân thời gian của trình biên dịch bằng cách sử dụng __cluster_dims__[X,Y,Z] hoặc sử dụng API khởi chạy nhân CUDA cudaLaunchKernelEx. The example below shows how to launch a cluster using compiler time kernel attribute. The cluster size using kernel attribute is fixed at compile time and then the kernel can be launched using the classical >. If a kernel uses compile-time cluster size, the cluster size cannot be modified when launching the kernel

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

A thread block cluster size can also be set at runtime and the kernel can be launched using the CUDA kernel launch API cudaLaunchKernelEx. The code example below shows how to launch a cluster kernel using the extensible API

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

Trong GPU có khả năng tính toán 9. 0, all the thread blocks in the cluster are guaranteed to be co-scheduled on a single GPU Processing Cluster [GPC] and allow thread blocks in the cluster to perform hardware-supported synchronization using the Cluster Group API cluster. đồng bộ hóa[]. Nhóm cụm cũng cung cấp các hàm thành viên để truy vấn kích thước nhóm cụm theo số luồng hoặc số khối bằng cách sử dụng API num_threads[] và num_blocks[] tương ứng. Thứ hạng của một luồng hoặc khối trong nhóm cụm có thể được truy vấn bằng API dim_threads[] và dim_blocks[] tương ứng

Thread blocks that belong to a cluster have access to the Distributed Shared Memory. Các khối luồng trong một cụm có khả năng đọc, ghi và thực hiện các nguyên tử đối với bất kỳ địa chỉ nào trong bộ nhớ dùng chung phân tán. Bộ nhớ dùng chung phân tán đưa ra một ví dụ về biểu đồ biểu diễn trong bộ nhớ dùng chung phân tán

Như được minh họa trong Hình 7, mô hình lập trình CUDA giả định rằng các luồng CUDA thực thi trên một thiết bị vật lý riêng biệt hoạt động như một bộ đồng xử lý cho máy chủ chạy chương trình C++. Đây là trường hợp, ví dụ, khi các nhân thực thi trên GPU và phần còn lại của chương trình C++ thực thi trên CPU

The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Do đó, một chương trình quản lý không gian bộ nhớ chung, hằng số và kết cấu hiển thị cho các hạt nhân thông qua các lệnh gọi đến thời gian chạy CUDA [được mô tả trong Giao diện lập trình]. Điều này bao gồm phân bổ và giải phóng bộ nhớ thiết bị cũng như truyền dữ liệu giữa máy chủ và bộ nhớ thiết bị

Unified Memory provides managed memory to bridge the host and device memory spaces. Bộ nhớ được quản lý có thể truy cập được từ tất cả các CPU và GPU trong hệ thống dưới dạng một hình ảnh bộ nhớ nhất quán, duy nhất với một không gian địa chỉ chung. This capability enables oversubscription of device memory and can greatly simplify the task of porting applications by eliminating the need to explicitly mirror data on host and device. See Unified Memory Programming for an introduction to Unified Memory

Hình 7. Lập trình không đồng nhất

Note. Mã nối tiếp thực thi trên máy chủ trong khi mã song song thực thi trên thiết bị.

An asynchronous operation is defined as an operation that is initiated by a CUDA thread and is executed asynchronously as-if by another thread. In a well formed program one or more CUDA threads synchronize with the asynchronous operation. The CUDA thread that initiated the asynchronous operation is not required to be among the synchronizing threads

Such an asynchronous thread [an as-if thread] is always associated with the CUDA thread that initiated the asynchronous operation. An asynchronous operation uses a synchronization object to synchronize the completion of the operation. Such a synchronization object can be explicitly managed by a user [e. g. , cuda. memcpy_async] or implicitly managed within a library [e. g. , cooperative_groups. memcpy_async]

Một đối tượng đồng bộ hóa có thể là một cuda. rào cản hoặc một cuda. pipeline. These objects are explained in detail in Asynchronous Barrier and Asynchronous Data Copies using cuda. pipeline. These synchronization objects can be used at different thread scopes. A scope defines the set of threads that may use the synchronization object to synchronize with the asynchronous operation. The following table defines the thread scopes available in CUDA C++ and the threads that can be synchronized with each

Phạm vi chủ đềMô tảcuda. thread_scope. thread_scope_threadChỉ chuỗi CUDA bắt đầu hoạt động không đồng bộ mới đồng bộ hóa. cuda. thread_scope. thread_scope_blockAll or any CUDA threads within the same thread block as the initiating thread synchronizes. cuda. thread_scope. thread_scope_deviceTất cả hoặc bất kỳ luồng CUDA nào trong cùng một thiết bị GPU khi luồng khởi tạo đồng bộ hóa. cuda. thread_scope. thread_scope_systemAll or any CUDA or CPU threads in the same system as the initiating thread synchronizes

Các phạm vi luồng này được triển khai dưới dạng phần mở rộng cho C++ chuẩn trong thư viện C++ chuẩn CUDA

The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU

The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X. Y

Devices with the same major revision number are of the same core architecture. The major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8 for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, and 3 for devices based on the Kepler architecture

Số sửa đổi nhỏ tương ứng với một cải tiến gia tăng đối với kiến trúc cốt lõi, có thể bao gồm các tính năng mới

Turing là kiến trúc dành cho các thiết bị có khả năng tính toán 7. 5, and is an incremental update based on the Volta architecture

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability

Ghi chú. The compute capability version of a particular GPU should not be confused with the CUDA version [for example, CUDA 7. 5, CUDA 8, CUDA 9], which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation.

The Tesla and Fermi architectures are no longer supported starting with CUDA 7. 0 and CUDA 9. 0, tương ứng

CUDA C++ provides a simple path for users familiar with the C++ programming language to easily write programs for execution by the device

It consists of a minimal set of extensions to the C++ language and a runtime library

The core language extensions have been introduced in Programming Model. They allow programmers to define a kernel as a C++ function and use some new syntax to specify the grid and block dimension each time the function is called. A complete description of all extensions can be found in C++ Language Extensions. Any source file that contains some of these extensions must be compiled with nvcc as outlined in Compilation with NVCC

The runtime is introduced in CUDA Runtime. Nó cung cấp các hàm C và C++ thực thi trên máy chủ để phân bổ và giải phóng bộ nhớ thiết bị, truyền dữ liệu giữa bộ nhớ máy chủ và bộ nhớ thiết bị, quản lý hệ thống với nhiều thiết bị, v.v. A complete description of the runtime can be found in the CUDA reference manual

Thời gian chạy được xây dựng dựa trên API C cấp thấp hơn, API trình điều khiển CUDA, ứng dụng cũng có thể truy cập được. API trình điều khiển cung cấp một mức kiểm soát bổ sung bằng cách hiển thị các khái niệm cấp thấp hơn như ngữ cảnh CUDA - tương tự các quy trình máy chủ cho thiết bị - và các mô-đun CUDA - tương tự các thư viện được tải động cho thiết bị. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. As the runtime is interoperable with the driver API, most applications that need some driver API features can default to use the runtime API and only use the driver API where needed. The driver API is introduced in Driver API and fully described in the reference manual

Source files compiled with nvcc can include a mix of host code [i. e. , code that executes on the host] and device code [i. e. , code that executes on the device]. nvcc's basic workflow consists in separating device code from host code and then

biên dịch mã thiết bị thành dạng lắp ráp [mã PTX] và/hoặc dạng nhị phân [đối tượng khối],
and modifying the host code by replacing the syntax introduced in Kernels [and described in more details in Execution Configuration] by the necessary CUDA runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.

Mã máy chủ đã sửa đổi được xuất ra dưới dạng mã C++ còn lại để được biên dịch bằng công cụ khác hoặc dưới dạng mã đối tượng trực tiếp bằng cách cho phép nvcc gọi trình biên dịch máy chủ trong giai đoạn biên dịch cuối cùng

Applications can then

Either link to the compiled host code [this is the most common case],
Or ignore the modified host code [if any] and use the CUDA driver API [see Driver API] to load and execute the PTX code or cubin object

Any PTX or NVVM IR code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility

When the device driver just-in-time compiles some PTX or NVVM IR code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver

Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables

As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used to compile CUDA C++ device code to PTX at runtime. NVRTC is a runtime compilation library for CUDA C++; more information can be found in the NVRTC User guide

To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability [for which no binary code can be generated yet], an application must load PTX code that will be just-in-time compiled for these devices [see Just-in-Time Compilation]

Which PTX and binary code gets embedded in a CUDA C++ application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example,

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

embeds binary code compatible with compute capability 5. 0 and 6. 0 [first and second -gencode options] and PTX and binary code compatible with compute capability 7. 0 [third -gencode option].

Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be

5. 0 binary code for devices with compute capability 5. 0 and 5. 2,
6. 0 binary code for devices with compute capability 6. 0 and 6. 1,
7. 0 binary code for devices with compute capability 7. 0 and 7. 5,
PTX code which is compiled to binary code at runtime for devices with compute capability 8. 0 and 8. 6

x. cu can have an optimized code path that uses warp shuffle operations, for example, which are only supported in devices of compute capability 3. 0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with -arch=compute_35 for example, __CUDA_ARCH__ is equal to 350.

Các ứng dụng sử dụng API trình điều khiển phải biên dịch mã thành các tệp riêng biệt và tải rõ ràng cũng như thực thi tệp phù hợp nhất khi chạy

The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architectures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal's thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70 .

The nvcc user manual lists various shorthands for the -arch , -code , and -gencode compiler options. For example, -arch=sm_70 is a shorthand for -arch=compute_70 -code=compute_70,sm_70 [which is the same as -gencode arch=compute_70,code=\"compute_70,sm_70\" ].

The runtime is implemented in the cudart library, which is linked to the application, either statically via cudart. lib or libcudart. a, or dynamically via cudart. dll or libcudart. so. Applications that require cudart. dll and/or cudart. so for dynamic linking typically include them as part of the application installation package. It is only safe to pass the address of CUDA runtime symbols between components that link to the same instance of the CUDA runtime

All its entry points are prefixed with cuda

As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Device Memory gives an overview of the runtime functions used to manage device memory

Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy, to maximize performance

Page-Locked Host Memory introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory

Asynchronous Concurrent Execution describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system

Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host

Error Checking describes how to properly check the errors generated by the runtime

Call Stack mentions the runtime functions used to manage the CUDA C++ call stack

Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware

Graphics Interoperability introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D

There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called [more specifically any function other than functions from the error handling and version management sections of the reference manual]. One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime

The runtime creates a CUDA context for each device in the system [see Context for more details on CUDA contexts]. This context is the primary context for this device and is initialized at the first runtime function which requires an active context on this device. It is shared among all the host threads of the application. As part of this context creation, the device code is just-in-time compiled if necessary [see Just-in-Time Compilation] and loaded into device memory. This all happens transparently. If needed, for example, for driver API interoperability, the primary context of a device can be accessed from the driver API as described in Interoperability between Runtime and Driver APIs

When a host thread calls cudaDeviceReset[], this destroys the primary context of the device the host thread currently operates on [i. e. , the current device as defined in Device Selection]. The next runtime function call made by any host thread that has this device as current will create a new primary context for this device

Note. The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces [implicitly or explicitly] during program initiation or termination after main] will result in undefined behavior.

As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Kernels operate out of device memory, so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory

Device memory can be allocated either as linear memory or as CUDA arrays

CUDA arrays are opaque memory layouts optimized for texture fetching. They are described in Texture and Surface Memory

Linear memory is allocated in a single unified address space, which means that separately allocated entities can reference one another via pointers, for example, in a binary tree or linked list. The size of the address space depends on the host system [CPU] and the compute capability of the used GPU

Table 1. Linear Memory Address Space x86_64 [AMD64]POWER [ppc64le]ARM64up to compute capability 5. 3 [Maxwell]40bit40bit40bitcompute capability 6. 0 [Pascal] or newerup to 47bitup to 49bitup to 48bit

Note. On devices of compute capability 5. 3 [Maxwell] and earlier, the CUDA driver creates an uncommitted 40bit virtual address reservation to ensure that memory allocations [pointers] fall into the supported range. This reservation appears as reserved virtual memory, but does not occupy any physical memory until the program actually allocates memory.

Linear memory is typically allocated using cudaMalloc[] and freed using cudaFree[] and data transfer between host memory and device memory are typically done using cudaMemcpy[]. In the vector addition code sample of Kernels, the vectors need to be copied from host memory to device memory

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

Linear memory can also be allocated through cudaMallocPitch[] and cudaMalloc3D[]. These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory [using the cudaMemcpy2D[] and cudaMemcpy3D[] functions]. The returned pitch [or stride] must be used to access array elements. The following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

The following code sample allocates a width x height x depth 3D array of floating-point values and shows how to loop over the array elements in device code

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

Note. To avoid allocating too much memory and thus impacting system-wide performance, request the allocation parameters from the user based on the problem size. If the allocation fails, you can fallback to other slower memory types [cudaMallocHost[], cudaHostRegister[], etc. ], or return an error telling the user how much memory was needed that was denied. If your application cannot request the allocation parameters for some reason, we recommend using cudaMallocManaged[] for platforms that support it.

The reference manual lists all the various functions used to copy memory between linear memory allocated with cudaMalloc[], linear memory allocated with cudaMallocPitch[] or cudaMalloc3D[], CUDA arrays, and memory allocated for variables declared in global or constant memory space

The following code sample illustrates various ways of accessing global variables via the runtime API

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

cudaGetSymbolAddress[] is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space. The size of the allocated memory is obtained through cudaGetSymbolSize[]

An access policy window specifies a contiguous region of global memory and a persistence property in the L2 cache for accesses within that region

The code example below shows how to set an L2 persisting access window using a CUDA Stream

CUDA Stream Example

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

When a kernel subsequently executes in CUDA stream, memory accesses within the global memory extent [ptr. ptr+num_bytes] are more likely to persist in the L2 cache than accesses to other global memory locations

L2 persistence can also be set for a CUDA Graph Kernel Node as shown in the example below

CUDA GraphKernelNode Example

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

The hitRatio parameter can be used to specify the fraction of accesses that receive the hitProp property. In both of the examples above, 60% of the memory accesses in the global memory region [ptr. ptr+num_bytes] have the persisting property and 40% of the memory accesses have the streaming property. Which specific memory accesses are classified as persisting [the hitProp] is random with a probability of approximately hitRatio; the probability distribution depends upon the hardware architecture and the memory extent

For example, if the L2 set-aside cache size is 16KB and the num_bytes in the accessPolicyWindow is 32KB

With a hitRatio of 0. 5, the hardware will select, at random, 16KB of the 32KB window to be designated as persisting and cached in the set-aside L2 cache area
With a hitRatio of 1. 0, the hardware will attempt to cache the whole 32KB window in the set-aside L2 cache area. Since the set-aside area is smaller than the window, cache lines will be evicted to keep the most recently used 16KB of the 32KB data in the set-aside portion of the L2 cache

The hitRatio can therefore be used to avoid thrashing of cache lines and overall reduce the amount of data moved into and out of the L2 cache

A hitRatio value below 1. 0 can be used to manually control the amount of data different accessPolicyWindows from concurrent CUDA streams can cache in L2. For example, let the L2 set-aside cache size be 16KB; two concurrent kernels in two different CUDA streams, each with a 16KB accessPolicyWindow, and both with hitRatio value 1. 0, might evict each others' cache lines when competing for the shared L2 resource. However, if both accessPolicyWindows have a hitRatio value of 0. 5, they will be less likely to evict their own or each others' persisting cache lines

The following example shows how to set-aside L2 cache for persistent accesses, use the set-aside L2 cache in CUDA kernels via CUDA Stream and then reset the L2 cache

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

As detailed in Variable Memory Space Specifiers shared memory is allocated using the __shared__ memory space specifier

Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad memory [or software managed cache] to minimize global memory accesses from a CUDA block as illustrated by the following matrix multiplication example

The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one column of B and computes the corresponding element of C as illustrated in Figure 8. A is therefore read B. width times from global memory and B is read A. height times

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Figure 8. Matrix Multiplication without Shared Memory

The following code sample is an implementation of matrix multiplication that does take advantage of shared memory. In this implementation, each thread block is responsible for computing one square sub-matrix Csub of C and each thread within the block is responsible for computing one element of Csub. As illustrated in Figure 9, Csub is equal to the product of two rectangular matrices. the sub-matrix of A of dimension [A. width, block_size] that has the same row indices as Csub, and the sub-matrix of B of dimension [block_size, A. width ]that has the same column indices as Csub. In order to fit into the device's resources, these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and Csub is computed as the sum of the products of these square matrices. Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix, and then by having each thread compute one element of the product. Each thread accumulates the result of each of these products into a register and once done writes the result to global memory

By blocking the computation this way, we take advantage of fast shared memory and save a lot of global memory bandwidth since A is only read [B. width / block_size] times from global memory and B is read [A. height / block_size] times

The Matrix type from the previous code sample is augmented with a stride field, so that sub-matrices can be efficiently represented with the same type. __device__ functions are used to get and set elements and build any sub-matrix from a matrix

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Figure 9. Matrix Multiplication with Shared Memory

Thread block clusters introduced in compute capability 9. 0 provide the ability for threads in a thread block cluster to access shared memory of all the participating thread blocks in a cluster. This partitioned shared memory is called Distributed Shared Memory, and the corresponding address space is called Distributed shared memory address space. Threads that belong to a thread block cluster, can read, write or perform atomics in the distributed address space, regardless whether the address belongs to the local thread block or a remote thread block. Whether a kernel uses distributed shared memory or not, the shared memory size specifications, static or dynamic is still per thread block. The size of distributed shared memory is just the number of thread blocks per cluster multiplied by the size of shared memory per thread block

Accessing data in distributed shared memory requires all the thread blocks to exist. A user can guarantee that all thread blocks have started executing using cluster. sync[] from Cluster Group API. User also needs to ensure that all the distributed shared memory operations are completed before a thread block exits

CUDA provides a mechanism to access to distributed shared memory, and applications can benefit from leveraging its capabilities. Lets look at a simple histogram computation and how to optimize it on the GPU using thread block cluster. A standard way of computing histograms is do the computation in the shared memory of each thread block and then perform global memory atomics. A limitation of this approach is the shared memory capacity. Once the histogram bins no longer fit in the shared memory, a user needs to directly compute histograms and hence the atomics in the global memory. With distributed shared memory, CUDA provides an intermediate step, where a depending on the histogram bins size, histogram can be computed in shared memory, distributed shared memory or global memory directly

The CUDA kernel example below shows how to compute histograms in shared memory or distributed shared memory, depending on the number of histogram bins

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

The above kernel can be launched at runtime with a cluster size depending on the amount of distributed shared memory required. If histogram is small enough to fit in shared memory of just one block, user can launch kernel with cluster size 1. The code snippet below shows how to launch a cluster kernel dynamically based depending on shared memory requirements

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

The runtime provides functions to allow the use of page-locked [also known as pinned] host memory [as opposed to regular pageable host memory allocated by malloc[]]

cudaHostAlloc[] and cudaFreeHost[] allocate and free page-locked host memory;
cudaHostRegister[] page-locks a range of memory allocated by malloc[] [see reference manual for limitations]

Using page-locked host memory has several benefits

Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution
On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory as detailed in Mapped Memory
On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory

Page-locked host memory is a scarce resource however, so allocations in page-locked memory will start failing long before allocations in pageable memory. In addition, by reducing the amount of physical memory available to the operating system for paging, consuming too much page-locked memory reduces overall system performance

Note. Page-locked host memory is not cached on non I/O coherent Tegra devices. Also, cudaHostRegister[] is not supported on non I/O coherent Tegra devices.

The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs

A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc[] or by passing flag cudaHostRegisterMapped to cudaHostRegister[]. Such a block has therefore in general two addresses. one in host memory that is returned by cudaHostAlloc[] or malloc[], and one in device memory that can be retrieved using cudaHostGetDevicePointer[] and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc[] and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space

Accessing host memory directly from within a kernel does not provide the same bandwidth as device memory, but does have some advantages

There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;
There is no need to use streams [see Concurrent Data Transfers] to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events [see Asynchronous Concurrent Execution] to avoid any potential read-after-write, write-after-read, or write-after-write hazards

To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags[] with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer[] will return an error

cudaHostGetDevicePointer[] also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability by checking the canMapHostMemory device property [see Device Enumeration], which is equal to 1 for devices that support mapped page-locked host memory

Note that atomic functions [see Atomic Functions] operating on mapped page-locked memory are not atomic from the point of view of the host or other devices

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, and 8-byte naturally aligned loads and stores to host memory initiated from the device are preserved as single accesses from the point of view of the host and other devices. On some platforms, atomics to memory may be broken by the hardware into separate load and store operations. These component load and store operations have the same requirements on preservation of naturally aligned accesses. As an example, the CUDA runtime does not support a PCI Express bus topology where a PCI Express bridge splits 8-byte naturally aligned writes into two 4-byte writes between the device and the host

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks. The following device operations are asynchronous with respect to the host

Kernel launches;
Memory copies within a single device's memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably

Kernel launches are synchronous if hardware counters are collected via a profiler [Nsight, Visual Profiler] unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel[], and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur

Streams are released by calling cudaStreamDestroy[]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

In case the device is still doing work in the stream when cudaStreamDestroy[] is called, the function will return immediately and the resources associated with the stream will be released automatically once the device has completed all work in the stream

Kernel launches and host device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.

For code that is compiled using the --default-stream per-thread compilation flag [or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers [cuda. h and cuda_runtime. h]], the default stream is a regular stream and each host thread has its own default stream

Note. #define CUDA_API_PER_THREAD_DEFAULT_STREAM 1 cannot be used to enable this behavior when the code is compiled by nvcc as nvcc implicitly includes cuda_runtime. h at the top of the translation unit. In this case the --default-stream per-thread compilation flag needs to be used or the CUDA_API_PER_THREAD_DEFAULT_STREAM macro needs to be defined with the -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 compiler flag.

For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization

For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations described in Compute Capability 3. x and Compute Capability 7. x

For devices that support concurrent kernel execution and are of compute capability 3. 0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete

Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete

Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery[] on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution

All independent operations should be issued before dependent operations,
Synchronization of any kind should be delayed as long as possible

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution [see Overlap of Data Transfer and Kernel Execution], concurrent kernel execution [see Concurrent Kernel Execution], and/or concurrent data transfers [see Concurrent Data Transfers]

For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way [and assuming the device supports overlap of data transfer and kernel execution]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0]

On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap. The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] [assuming the device supports overlap of data transfer and kernel execution]. However, for devices of compute capability 3. 0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. If the code is rewritten as above, the kernel executions overlap [assuming the device supports concurrent kernel execution] since the second kernel launch is issued to stream[1] before the memory copy from device to host is issued to stream[0]. In that case however, the memory copy from device to host issued to stream[0] only overlaps with the last thread blocks of the kernel launch issued to stream[1] as per Implicit Synchronization, which can represent only a small portion of the total execution time of the kernel

The runtime provides a way to insert a CPU function call at any point into a stream via cudaLaunchHostFunc[]. The provided function is executed on the host once all commands issued to the stream before the callback have completed

The following code sample adds the host function MyCallback to each of two streams after issuing a host-to-device memory copy, a kernel launch and a device-to-host memory copy into each stream. The function will begin execution on the host after each of the device-to-host memory copies completes

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

The commands that are issued in a stream after a host function do not start executing before the function has completed

A host function enqueued into a stream must not make CUDA API calls [directly or indirectly], as it might end up waiting on itself if it makes such a call leading to a deadlock

CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. This allows a graph to be defined once and then launched repeatedly. Separating out the definition of a graph from its execution enables a number of optimizations. first, CPU launch costs are reduced compared to streams, because much of the setup is done in advance; second, presenting the whole workflow to CUDA enables optimizations which might not be possible with the piecewise work submission mechanism of streams

To see the optimizations possible with graphs, consider what happens in a stream. when you place a kernel into a stream, the host driver performs a sequence of operations in preparation for the execution of the kernel on the GPU. These operations, necessary for setting up and launching the kernel, are an overhead cost which must be paid for each kernel that is issued. For a GPU kernel with a short execution time, this overhead cost can be a significant fraction of the overall end-to-end execution time

Work submission using graphs is separated into three distinct stages. definition, instantiation, and execution

During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them
Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph
An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation

Stream capture provides a mechanism to create a graph from existing stream-based APIs. A section of code which launches work into streams, including existing code, can be bracketed with calls to cudaStreamBeginCapture[] and cudaStreamEndCapture[]. See below

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

A call to cudaStreamBeginCapture[] places a stream in capture mode. When a stream is being captured, work launched into the stream is not enqueued for execution. It is instead appended to an internal graph that is progressively being built up. This graph is then returned by calling cudaStreamEndCapture[], which also ends capture mode for the stream. A graph which is actively being constructed by stream capture is referred to as a capture graph

Stream capture can be used on any CUDA stream except cudaStreamLegacy [the “NULL stream”]. Note that it can be used on cudaStreamPerThread. If a program is using the legacy stream, it may be possible to redefine stream 0 to be the per-thread stream with no functional change. See Default Stream

Whether a stream is being captured can be queried with cudaStreamIsCapturing[]

Stream capture can handle cross-stream dependencies expressed with cudaEventRecord[] and cudaStreamWaitEvent[], provided the event being waited upon was recorded into the same capture graph

When an event is recorded in a stream that is in capture mode, it results in a captured event. A captured event represents a set of nodes in a capture graph

When a captured event is waited on by a stream, it places the stream in capture mode if it is not already, and the next item in the stream will have additional dependencies on the nodes in the captured event. The two streams are then being captured to the same capture graph

When cross-stream dependencies are present in stream capture, cudaStreamEndCapture[] must still be called in the same stream where cudaStreamBeginCapture[] was called; this is the origin stream. Any other streams which are being captured to the same capture graph, due to event-based dependencies, must also be joined back to the origin stream. This is illustrated below. All streams being captured to the same capture graph are taken out of capture mode upon cudaStreamEndCapture[]. Failure to rejoin to the origin stream will result in failure of the overall capture operation

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

Graph returned by the above code is shown in Figure 11

Note. When a stream is taken out of capture mode, the next non-captured item in the stream [if any] will still have a dependency on the most recent prior non-captured item, despite intermediate items having been removed.

It is invalid to synchronize or query the execution status of a stream which is being captured or a captured event, because they do not represent items scheduled for execution. It is also invalid to query the execution status of or synchronize a broader handle which encompasses an active stream capture, such as a device or context handle when any associated stream is in capture mode

When any stream in the same context is being captured, and it was not created with cudaStreamNonBlocking, any attempted use of the legacy stream is invalid. This is because the legacy stream handle at all times encompasses these other streams; enqueueing to the legacy stream would create a dependency on the streams being captured, and querying it or synchronizing it would query or synchronize the streams being captured

It is therefore also invalid to call synchronous APIs in this case. Synchronous APIs, such as cudaMemcpy[], enqueue work to the legacy stream and synchronize it before returning

Note. As a general rule, when a dependency relation would connect something that is captured with something that was not captured and instead enqueued for execution, CUDA prefers to return an error rather than ignore the dependency. An exception is made for placing a stream into or out of capture mode; this severs a dependency relation between items added to the stream immediately before and after the mode transition.

It is invalid to merge two separate capture graphs by waiting on a captured event from a stream which is being captured and is associated with a different capture graph than the event. It is invalid to wait on a non-captured event from a stream which is being captured without specifying the cudaEventWaitExternal flag

A small number of APIs that enqueue asynchronous operations into streams are not currently supported in graphs and will return an error if called with a stream which is being captured, such as cudaStreamAttachMemAsync[]

Work submission using graphs is separated into three distinct stages. definition, instantiation, and execution. In situations where the workflow is not changing, the overhead of definition and instantiation can be amortized over many executions, and graphs provide a clear advantage over streams

A graph is a snapshot of a workflow, including kernels, parameters, and dependencies, in order to replay it as rapidly and efficiently as possible. In situations where the workflow changes the graph becomes out of date and must be modified. Major changes to graph structure such as topology or types of nodes will require re-instantiation of the source graph because various topology-related optimization techniques must be re-applied

The cost of repeated instantiation can reduce the overall performance benefit from graph execution, but it is common for only node parameters, such as kernel parameters and cudaMemcpy addresses, to change while graph topology remains the same. For this case, CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. This is much more efficient than re-instantiation

Updates will take effect the next time the graph is launched, so they will not impact previous graph launches, even if they are running at the time of the update. A graph may be updated and relaunched repeatedly, so multiple updates/launches can be queued on a stream

CUDA provides two mechanisms for updating instantiated graph parameters, whole graph update and individual node update. Whole graph update allows the user to supply a topologically identical cudaGraph_t object whose nodes contain updated parameters. Individual node update allows the user to explicitly update the parameters of individual nodes. Using an updated cudaGraph_t is more convenient when a large number of nodes are being updated, or when the graph topology is unknown to the caller [i. e. , The graph resulted from stream capture of a library call]. Using individual node update is preferred when the number of changes is small and the user has the handles to the nodes requiring updates. Individual node update skips the topology checks and comparisons for unchanged nodes, so it can be more efficient in many cases

CUDA also provides a mechanism for enabling and disabling individual nodes without affecting their current parameters

The following sections explain each approach in more detail

cudaGraphExecUpdate[] allows an instantiated graph [the "original graph"] to be updated with the parameters from a topologically identical graph [the "updating" graph]. The topology of the updating graph must be identical to the original graph used to instantiate the cudaGraphExec_t. In addition, the order in which nodes were added to, or removed from, the original graph must match the order in which the nodes were added to [or removed from] the updating graph. Therefore, when using stream capture, the nodes must be captured in the same order and when using the explicit graph node creation APIs, all nodes must be added and/or deleted in the same order

The following example shows how the API could be used to update an instantiated graph

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

A typical workflow is to create the initial cudaGraph_t using either the stream capture or graph API. The cudaGraph_t is then instantiated and launched as normal. After the initial launch, a new cudaGraph_t is created using the same method as the initial graph and cudaGraphExecUpdate[] is called. If the graph update is successful, indicated by the updateResult parameter in the above example, the updated cudaGraphExec_t is launched. If the update fails for any reason, the cudaGraphExecDestroy[] and cudaGraphInstantiate[] are called to destroy the original cudaGraphExec_t and instantiate a new one

It is also possible to update the cudaGraph_t nodes directly [i. e. , Using cudaGraphKernelNodeSetParams[]] and subsequently update the cudaGraphExec_t, however it is more efficient to use the explicit node update APIs covered in the next section

Please see the Graph API for more information on usage and current limitations

A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

A memory copy will succeed even if it is issued to a stream that is not associated to the current device

cudaEventRecord[] will fail if the input event and input stream are associated to different devices

cudaEventElapsedTime[] will fail if the two input events are associated to different devices

cudaEventSynchronize[] and cudaEventQuery[] will succeed even if the input event is associated to a device that is different from the current device

cudaStreamWaitEvent[] will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent[] can therefore be used to synchronize multiple devices with each other

Each device has its own default stream [see Default Stream], so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device

Depending on the system properties, specifically the PCIe and/or NVLINK topology, devices are able to address each other's memory [i. e. , a kernel executing on one device can dereference a pointer to the memory of the other device]. This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer[] returns true for these two devices

Peer-to-peer memory access is only supported in 64-bit applications and must be enabled between two devices by calling cudaDeviceEnablePeerAccess[] as illustrated in the following code sample. On non-NVSwitch enabled systems, each device can support a system-wide maximum of eight peer connections

A unified address space is used for both devices [see Unified Virtual Address Space], so the same pointer can be used to address memory from both devices as shown in the code sample below

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

Memory copies can be performed between the memories of two different devices

When a unified address space is used for both devices [see Unified Virtual Address Space], this is done using the regular memory copy functions mentioned in Device Memory

Otherwise, this is done using cudaMemcpyPeer[], cudaMemcpyPeerAsync[], cudaMemcpy3DPeer[], or cudaMemcpy3DPeerAsync[] as illustrated in the following code sample

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}

A copy [in the implicit NULL stream] between the memories of two different devices

does not start until all commands previously issued to either device have completed and
runs to completion before any commands [see Asynchronous Concurrent Execution] issued after the copy to either device can start

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess[] as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster

When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2. 0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. As a consequence

The location of any memory on the host allocated through CUDA, or on any of the devices which use the unified address space, can be determined from the value of the pointer using cudaPointerGetAttributes[]
When copying to or from the memory of any device which uses the unified address space, the cudaMemcpyKind parameter of cudaMemcpy*[] can be set to cudaMemcpyDefault to determine locations from the pointers. This also works for host pointers not allocated through CUDA, as long as the current device uses unified addressing
Allocations via cudaHostAlloc[] are automatically portable [see Portable Memory] across all the devices for which the unified address space is used, and pointers returned by cudaHostAlloc[] can be used directly from within kernels running on these devices [i. e. , there is no need to obtain a device pointer via cudaHostGetDevicePointer[] as described in Mapped Memory

Applications may query if the unified address space is used for a particular device by checking that the unifiedAddressing device property [see Device Enumeration] is equal to 1

Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process. It is not valid outside this process however, and therefore cannot be directly referenced by threads belonging to a different process

To share device memory pointers and events across processes, an application must use the Inter Process Communication API, which is described in detail in the reference manual. The IPC API is only supported for 64-bit processes on Linux and for devices of compute capability 2. 0 and higher. Note that the IPC API is not supported for cudaMallocManaged allocations

Using this API, an application can get the IPC handle for a given device memory pointer using cudaIpcGetMemHandle[], pass it to another process using standard IPC mechanisms [for example, interprocess shared memory or files], and use cudaIpcOpenMemHandle[] to retrieve a device pointer from the IPC handle that is a valid pointer within this other process. Event handles can be shared using similar entry points

Note that allocations made by cudaMalloc[] may be sub-allocated from a larger block of memory for performance reasons. In such case, CUDA IPC APIs will share the entire underlying memory block which may cause other sub-allocations to be shared, which can potentially lead to information disclosure between processes. To prevent this behavior, it is recommended to only share allocations with a 2MiB aligned size

An example of using the IPC API is where a single primary process generates a batch of input data, making the data available to multiple secondary processes without requiring regeneration or copying

Applications using CUDA IPC to communicate with each other should be compiled, linked, and run with the same CUDA driver and runtime

Note. Since CUDA 11. 5, only events-sharing IPC APIs are supported on L4T and embedded Linux Tegra devices with compute capability 7. x and higher. The memory-sharing IPC APIs are still not supported on Tegra platforms.

All runtime functions return an error code, but for an asynchronous function [see Asynchronous Concurrent Execution], this error code cannot possibly report any of the asynchronous errors that could occur on the device since the function returns before the device has completed the task; the error code only reports errors that occur on the host prior to executing the task, typically related to parameter validation; if an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call

The only way to check for asynchronous errors just after some asynchronous function call is therefore to synchronize just after the call by calling cudaDeviceSynchronize[] [or by using any other synchronization mechanisms described in Asynchronous Concurrent Execution] and checking the error code returned by cudaDeviceSynchronize[]

The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs [be it a parameter validation error or an asynchronous error]. cudaPeekAtLastError[] returns this variable. cudaGetLastError[] returns this variable and resets it to cudaSuccess

Kernel launches do not return any error code, so cudaPeekAtLastError[] or cudaGetLastError[] must be called just after the kernel launch to retrieve any pre-launch errors. To ensure that any error returned by cudaPeekAtLastError[] or cudaGetLastError[] does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to cudaSuccess just before the kernel launch, for example, by calling cudaGetLastError[] just before the kernel launch. Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaPeekAtLastError[] or cudaGetLastError[]

Note that cudaErrorNotReady that may be returned by cudaStreamQuery[] and cudaEventQuery[] is not considered an error and is therefore not reported by cudaPeekAtLastError[] or cudaGetLastError[]

Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API

The texture object or the texture reference specifies

The texture, which is the piece of texture memory that is fetched. Texture objects are created at runtime and the texture is specified when creating the texture object as described in Texture Object API. Texture references are created at compile time and the texture is specified at runtime by bounding the texture reference to the texture through runtime functions as described in [[DEPRECATED]] Texture Reference API; several distinct texture references might be bound to the same texture or to textures that overlap in memory. A texture can be any region of linear memory or a CUDA array [described in CUDA Arrays]
Its dimensionality that specifies whether the texture is addressed as a one dimensional array using one texture coordinate, a two-dimensional array using two texture coordinates, or a three-dimensional array using three texture coordinates. Elements of the array are called texels, short for texture elements. The texture width, height, and depth refer to the size of the array in each dimension. Table 15 lists the maximum texture width, height, and depth depending on the compute capability of the device
The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in Built-in Vector Types that are derived from the basic integer and single-precision floating-point types
The read mode, which is equal to cudaReadModeNormalizedFloat or cudaReadModeElementType. If it is cudaReadModeNormalizedFloat and the type of the texel is a 16-bit or 8-bit integer type, the value returned by the texture fetch is actually returned as floating-point type and the full range of the integer type is mapped to [0. 0, 1. 0] for unsigned integer type and [-1. 0, 1. 0] for signed integer type; for example, an unsigned 8-bit texture element with the value 0xff reads as 1. If it is cudaReadModeElementType, no conversion is performed
Whether texture coordinates are normalized or not. By default, textures are referenced [by the functions of Texture Functions] using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. For example, a texture that is 64x32 in size will be referenced with coordinates in the range [0, 63] and [0, 31] for the x and y dimensions, respectively. Normalized texture coordinates cause the coordinates to be specified in the range [0. 0, 1. 0-1/N] instead of [0, N-1], so the same 64x32 texture would be addressed by normalized coordinates in the range [0, 1-1/N] in both the x and y dimensions. Normalized texture coordinates are a natural fit to some applications' requirements, if it is preferable for the texture coordinates to be independent of the texture size
The addressing mode. It is valid to call the device functions of Section B. 8 with coordinates that are out of range. The addressing mode defines what happens in that case. The default addressing mode is to clamp the coordinates to the valid range. [0, N] for non-normalized coordinates and [0. 0, 1. 0] for normalized coordinates. If the border mode is specified instead, texture fetches with out-of-range texture coordinates return zero. For normalized coordinates, the wrap mode and the mirror mode are also available. When using the wrap mode, each coordinate x is converted to frac[x]=x - floor[x] where floor[x] is the largest integer not greater than x. When using the mirror mode, each coordinate x is converted to frac[x] if floor[x] is even and 1-frac[x] if floor[x] is odd. The addressing mode is specified as an array of size three whose first, second, and third elements specify the addressing mode for the first, second, and third texture coordinates, respectively; the addressing mode are cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap, and cudaAddressModeMirror; cudaAddressModeWrap and cudaAddressModeMirror are only supported for normalized texture coordinates
The filtering mode which specifies how the value returned when fetching the texture is computed based on the input texture coordinates. Linear texture filtering may be done only for textures that are configured to return floating-point data. It performs low-precision interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures, bilinear interpolation for two-dimensional textures, and trilinear interpolation for three-dimensional textures. Texture Fetching gives more details on texture fetching. The filtering mode is equal to cudaFilterModePoint or cudaFilterModeLinear. If it is cudaFilterModePoint, the returned value is the texel whose texture coordinates are the closest to the input texture coordinates. If it is cudaFilterModeLinear, the returned value is the linear interpolation of the two [for a one-dimensional texture], four [for a two dimensional texture], or eight [for a three dimensional texture] texels whose texture coordinates are the closest to the input texture coordinates. cudaFilterModeLinear is only valid for returned values of floating-point type

Texture Object API introduces the texture object API

[[DEPRECATED]] Texture Reference API introduces the texture reference API

16-Bit Floating-Point Textures explains how to deal with 16-bit floating-point textures

Textures can also be layered as described in Layered Textures

Cubemap Textures and Cubemap Layered Textures describe a special type of texture, the cubemap texture

Texture Gather describes a special texture fetch, texture gather

A texture object is created using cudaCreateTextureObject[] from a resource description of type struct cudaResourceDesc, which specifies the texture, and from a texture description defined as such

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

addressMode specifies the addressing mode;
filterMode specifies the filter mode;
readMode specifies the read mode;
normalizedCoords specifies whether texture coordinates are normalized or not;
See reference manual for sRGB, maxAnisotropy, mipmapFilterMode, mipmapLevelBias, minMipmapLevelClamp, and maxMipmapLevelClamp

The following code sample applies some simple transformation kernel to a texture

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

Texture Reference API is deprecated

Some of the attributes of a texture reference are immutable and must be known at compile time; they are specified when declaring the texture reference. A texture reference is declared at file scope as a variable of type texture

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

where

DataType specifies the type of the texel;
Type specifies the type of the texture reference and is equal to cudaTextureType1D, cudaTextureType2D, or cudaTextureType3D, for a one-dimensional, two-dimensional, or three-dimensional texture, respectively, or cudaTextureType1DLayered or cudaTextureType2DLayered for a one-dimensional or two-dimensional layered texture respectively; Type is an optional argument which defaults to cudaTextureType1D;
ReadMode specifies the read mode; it is an optional argument which defaults to cudaReadModeElementType

A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function

The other attributes of a texture reference are mutable and can be changed at runtime through the host runtime. As explained in the reference manual, the runtime API has a low-level C-style interface and a high-level C++-style interface. The texture type is defined in the high-level API as a structure publicly derived from the textureReference type defined in the low-level API as such

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

normalized specifies whether texture coordinates are normalized or not;
filterMode specifies the filtering mode;
addressMode specifies the addressing mode;

channelDesc describes the format of the texel; it must match the DataType argument of the texture reference declaration; channelDesc is of the following type

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

where x, y, z, and w are equal to the number of bits of each component of the returned value and f is

cudaChannelFormatKindSigned if these components are of signed integer type,
cudaChannelFormatKindUnsigned if they are of unsigned integer type,
cudaChannelFormatKindFloat if they are of floating point type

See reference manual for sRGB, maxAnisotropy, mipmapFilterMode, mipmapLevelBias, minMipmapLevelClamp, and maxMipmapLevelClamp

normalized, addressMode, and filterMode may be directly modified in host code

Before a kernel can use a texture reference to read from texture memory, the texture reference must be bound to a texture using cudaBindTexture[] or cudaBindTexture2D[] for linear memory, or cudaBindTextureToArray[] for CUDA arrays. cudaUnbindTexture[] is used to unbind a texture reference. Once a texture reference has been unbound, it can be safely rebound to another array, even if kernels that use the previously bound texture have not completed. It is recommended to allocate two-dimensional textures in linear memory using cudaMallocPitch[] and use the pitch returned by cudaMallocPitch[] as input parameter to cudaBindTexture2D[]

The following code samples bind a 2D texture reference to linear memory pointed to by devPtr

Using the low-level API

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

Using the high-level API

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

The following code samples bind a 2D texture reference to a CUDA array cuArray

Using the low-level API

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

Using the high-level API

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}

The format specified when binding a texture to a texture reference must match the parameters specified when declaring the texture reference; otherwise, the results of texture fetches are undefined

There is a limit to the number of textures that can be bound to a kernel as specified in Table 15

The following code sample applies some simple transformation kernel to a texture

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

A one-dimensional or two-dimensional layered texture [also known as texture array in Direct3D and array texture in OpenGL] is a texture made up of a sequence of layers, all of which are regular textures of same dimensionality, size, and data type

A one-dimensional layered texture is addressed using an integer index and a floating-point texture coordinate; the index denotes a layer within the sequence and the coordinate addresses a texel within that layer. A two-dimensional layered texture is addressed using an integer index and two floating-point texture coordinates; the index denotes a layer within the sequence and the coordinates address a texel within that layer

A layered texture can only be a CUDA array by calling cudaMalloc3DArray[] with the cudaArrayLayered flag [and a height of zero for one-dimensional layered texture]

Layered textures are fetched using the device functions described in tex1DLayered[], tex1DLayered[], tex2DLayered[], and tex2DLayered[]. Texture filtering [see Texture Fetching] is done only within a layer, not across layers

Layered textures are only supported on devices of compute capability 2. 0 and higher

A cubemap texture is a special type of two-dimensional layered texture that has six layers representing the faces of a cube

The width of a layer is equal to its height
The cubemap is addressed using three texture coordinates x, y, and z that are interpreted as a direction vector emanating from the center of the cube and pointing to one face of the cube and a texel within the layer corresponding to that face. More specifically, the face is selected by the coordinate with largest magnitude m and the corresponding layer is addressed using coordinates [s/m+1]/2 and [t/m+1]/2 where s and t are defined in Table 2

Table 2. Cubemap Fetch facemst. x. > . y. and . x. > . z. x > 00x-z-yx < 01-xz-y. y. > . x. and . y. > . z. y > 02yxzy < 03-yx-z. z. > . x. and . z. > . y. z > 04zx-yz < 05-z-x-y

A cubemap texture can only be a CUDA array by calling cudaMalloc3DArray[] with the cudaArrayCubemap flag

Cubemap textures are fetched using the device function described in texCubemap[] and texCubemap[]

Cubemap textures are only supported on devices of compute capability 2. 0 and higher

Texture gather is a special texture fetch that is available for two-dimensional textures only. It is performed by the tex2Dgather[] function, which has the same parameters as tex2D[], plus an additional comp parameter equal to 0, 1, 2, or 3 [see tex2Dgather[] and tex2Dgather[]]. It returns four 32-bit numbers that correspond to the value of the component comp of each of the four texels that would have been used for bilinear filtering during a regular texture fetch. For example, if these texels are of values [253, 20, 31, 255], [250, 25, 29, 254], [249, 16, 37, 253], [251, 22, 30, 250], and comp is 2, tex2Dgather[] returns [31, 29, 37, 30]

Note that texture coordinates are computed with only 8 bits of fractional precision. tex2Dgather[] may therefore return unexpected results for cases where tex2D[] would use 1. 0 for one of its weights [α or β, see Linear Filtering]. For example, with an x texture coordinate of 2. 49805. xB=x-0. 5=1. 99805, however the fractional part of xB is stored in an 8-bit fixed-point format. Since 0. 99805 is closer to 256. f/256. f than it is to 255. f/256. f, xB has the value 2. A tex2Dgather[] in this case would therefore return indices 2 and 3 in x, instead of indices 1 and 2

Texture gather is only supported for CUDA arrays created with the cudaArrayTextureGather flag and of width and height less than the maximum specified in Table 15 for texture gather, which is smaller than for regular texture fetch

Texture gather is only supported on devices of compute capability 2. 0 and higher

A surface object is created using cudaCreateSurfaceObject[] from a resource description of type struct cudaResourceDesc

The following code sample applies some simple transformation kernel to a texture

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

Surface Reference API is deprecated

A surface reference is declared at file scope as a variable of type surface

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

where Type specifies the type of the surface reference and is equal to cudaSurfaceType1D, cudaSurfaceType2D, cudaSurfaceType3D, cudaSurfaceTypeCubemap, cudaSurfaceType1DLayered, cudaSurfaceType2DLayered, or cudaSurfaceTypeCubemapLayered; Type is an optional argument which defaults to cudaSurfaceType1D. A surface reference can only be declared as a static global variable and cannot be passed as an argument to a function

Before a kernel can use a surface reference to access a CUDA array, the surface reference must be bound to the CUDA array using cudaBindSurfaceToArray[]

The following code samples bind a surface reference to a CUDA array cuArray

Using the low-level API

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

Using the high-level API

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

A CUDA array must be read and written using surface functions of matching dimensionality and type and via a surface reference of matching dimensionality; otherwise, the results of reading and writing the CUDA array are undefined

Unlike texture memory, surface memory uses byte addressing. This means that the x-coordinate used to access a texture element via texture functions needs to be multiplied by the byte size of the element to access the same element via a surface function. For example, the element at texture coordinate x of a one-dimensional floating-point CUDA array bound to a texture reference texRef and a surface reference surfRef is read using tex1d[texRef, x] via texRef, but surf1Dread[surfRef, 4*x] via surfRef. Similarly, the element at texture coordinate x and y of a two-dimensional floating-point CUDA array bound to a texture reference texRef and a surface reference surfRef is accessed using tex2d[texRef, x, y] via texRef, but surf2Dread[surfRef, 4*x, y] via surfRef [the byte offset of the y-coordinate is internally calculated from the underlying line pitch of the CUDA array]

The following code sample applies some simple transformation kernel to a texture

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

Some resources from OpenGL and Direct3D may be mapped into the address space of CUDA, either to enable CUDA to read data written by OpenGL or Direct3D, or to enable CUDA to write data for consumption by OpenGL or Direct3D

A resource must be registered to CUDA before it can be mapped using the functions mentioned in OpenGL Interoperability and Direct3D Interoperability. These functions return a pointer to a CUDA graphics resource of type struct cudaGraphicsResource. Registering a resource is potentially high-overhead and therefore typically called only once per resource. A CUDA graphics resource is unregistered using cudaGraphicsUnregisterResource[]. Each CUDA context which intends to use the resource is required to register it separately

Once a resource is registered to CUDA, it can be mapped and unmapped as many times as necessary using cudaGraphicsMapResources[] and cudaGraphicsUnmapResources[]. cudaGraphicsResourceSetMapFlags[] can be called to specify usage hints [write-only, read-only] that the CUDA driver can use to optimize resource management

A mapped resource can be read from or written to by kernels using the device memory address returned by cudaGraphicsResourceGetMappedPointer[] for buffers and cudaGraphicsSubResourceGetMappedArray[] for CUDA arrays

Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results. OpenGL Interoperability and Direct3D Interoperability give specifics for each graphics API and some code samples. SLI Interoperability gives specifics for when the system is in SLI mode

The OpenGL resources that may be mapped into the address space of CUDA are OpenGL buffer, texture, and renderbuffer objects

A buffer object is registered using cudaGraphicsGLRegisterBuffer[]. In CUDA, it appears as a device pointer and can therefore be read and written by kernels or via cudaMemcpy[] calls

A texture or renderbuffer object is registered using cudaGraphicsGLRegisterImage[]. In CUDA, it appears as a CUDA array. Kernels can read from the array by binding it to a texture or surface reference. They can also write to it via the surface write functions if the resource has been registered with the cudaGraphicsRegisterFlagsSurfaceLoadStore flag. The array can also be read and written via cudaMemcpy2D[] calls. cudaGraphicsGLRegisterImage[] supports all texture formats with 1, 2, or 4 components and an internal type of float [for example, GL_RGBA_FLOAT32], normalized integer [for example, GL_RGBA8, GL_INTENSITY16], and unnormalized integer [for example, GL_RGBA8UI] [please note that since unnormalized integer formats require OpenGL 3. 0, they can only be written by shaders, not the fixed function pipeline]

The OpenGL context whose resources are being shared has to be current to the host thread making any OpenGL interoperability API calls

Please note. When an OpenGL texture is made bindless [say for example by requesting an image or texture handle using the glGetTextureHandle*/glGetImageHandle* APIs] it cannot be registered with CUDA. The application needs to register the texture for interop before requesting an image or texture handle

The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

On Windows and for Quadro GPUs, cudaWGLGetDevice[] can be used to retrieve the CUDA device associated to the handle returned by wglEnumGpusNV[]. Quadro GPUs offer higher performance OpenGL interoperability than GeForce and Tesla GPUs in a multi-GPU configuration where OpenGL rendering is performed on the Quadro GPU and CUDA computations are performed on other GPUs in the system

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel[float *input, float* output]
{
  
}

int main[]
{
    float *input, *output;
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    cluster_kernel[];
    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx[&config, cluster_kernel, input, output];
    }
}

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

In a system with multiple GPUs, all CUDA-enabled GPUs are accessible via the CUDA driver and runtime as separate devices. There are however special considerations as described below when the system is in SLI mode

First, an allocation in one CUDA device on one GPU will consume memory on other GPUs that are part of the SLI configuration of the Direct3D or OpenGL device. Because of this, allocations may fail earlier than otherwise expected

Second, applications should create multiple CUDA contexts, one for each GPU in the SLI configuration. While this is not a strict requirement, it avoids unnecessary data transfers between devices. The application can use the cudaD3D[9. 10. 11]GetDevices[] for Direct3D and cudaGLGetDevices[] for OpenGL set of calls to identify the CUDA device handle[s] for the device[s] that are performing the rendering in the current and next frame. Given this information the application will typically choose the appropriate device and map Direct3D or OpenGL resources to the CUDA device returned by cudaD3D[9. 10. 11]GetDevices[] or cudaGLGetDevices[] when the deviceList parameter is set to cudaD3D[9. 10. 11]DeviceListCurrentFrame or cudaGLDeviceListCurrentFrame

Please note that resource returned from cudaGraphicsD9D[9. 10. 11]RegisterResource and cudaGraphicsGLRegister[Buffer. Image] must be only used on device the registration happened. Therefore on SLI configurations when data for different frames is computed on different CUDA devices it is necessary to register the resources for each separately

See Direct3D Interoperability and OpenGL Interoperability for details on how the CUDA runtime interoperate with Direct3D and OpenGL, respectively

External resource interoperability allows CUDA to import certain resources that are explicitly exported by other APIs. These objects are typically exported by other APIs using handles native to the Operating System, like file descriptors on Linux or NT handles on Windows. They could also be exported using other unified interfaces such as the NVIDIA Software Communication Interface. There are two types of resources that can be imported. memory objects and synchronization objects

Memory objects can be imported into CUDA using cudaImportExternalMemory[]. An imported memory object can be accessed from within kernels using device pointers mapped onto the memory object via cudaExternalMemoryGetMappedBuffer[]or CUDA mipmapped arrays mapped via cudaExternalMemoryGetMappedMipmappedArray[]. Depending on the type of memory object, it may be possible for more than one mapping to be setup on a single memory object. The mappings must match the mappings setup in the exporting API. Any mismatched mappings result in undefined behavior. Imported memory objects must be freed using cudaDestroyExternalMemory[]. Freeing a memory object does not free any mappings to that object. Therefore, any device pointers mapped onto that object must be explicitly freed using cudaFree[] and any CUDA mipmapped arrays mapped onto that object must be explicitly freed using cudaFreeMipmappedArray[]. It is illegal to access mappings to an object after it has been destroyed

Synchronization objects can be imported into CUDA using cudaImportExternalSemaphore[]. An imported synchronization object can then be signaled using cudaSignalExternalSemaphoresAsync[] and waited on using cudaWaitExternalSemaphoresAsync[]. It is illegal to issue a wait before the corresponding signal has been issued. Also, depending on the type of the imported synchronization object, there may be additional constraints imposed on how they can be signaled and waited on, as described in subsequent sections. Imported semaphore objects must be freed using cudaDestroyExternalSemaphore[]. All outstanding signals and waits must have completed before the semaphore object is destroyed

On Linux and Windows 10, both dedicated and non-dedicated memory objects exported by Vulkan can be imported into CUDA. On Windows 7, only dedicated memory objects can be imported. When importing a Vulkan dedicated memory object, the flag cudaExternalMemoryDedicated must be set

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT can be imported into CUDA using the file descriptor associated with that object as shown below. Note that CUDA assumes ownership of the file descriptor once it is imported. Using the file descriptor after a successful import results in undefined behavior

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT can be imported into CUDA using the NT handle associated with that object as shown below. Note that CUDA does not assume ownership of the NT handle and it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT can also be imported using a named handle if one exists as shown below

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Vulkan API. Additionally, if the mipmapped array is bound as a color target in Vulkan, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray[]. The following code sample shows how to convert Vulkan parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BITcan be imported into CUDA using the file descriptor associated with that object as shown below. Note that CUDA assumes ownership of the file descriptor once it is imported. Using the file descriptor after a successful import results in undefined behavior

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT can be imported into CUDA using the NT handle associated with that object as shown below. Note that CUDA does not assume ownership of the NT handle and it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT can also be imported using a named handle if one exists as shown below

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying semaphore it is automatically destroyed when all other references to the resource are destroyed

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A shareable Direct3D 12 heap memory object, created by setting the flag D3D12_HEAP_FLAG_SHARED in the call to ID3D12Device. CreateHeap, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A shareable Direct3D 12 heap memory object can also be imported using a named handle if one exists as shown below

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A shareable Direct3D 12 committed resource, created by setting the flag D3D12_HEAP_FLAG_SHARED in the call to D3D12Device. CreateCommittedResource, can be imported into CUDA using the NT handle associated with that object as shown below. When importing a Direct3D 12 committed resource, the flag cudaExternalMemoryDedicated must be set. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A shareable Direct3D 12 committed resource can also be imported using a named handle if one exists as shown below

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 12 API. Additionally, if the mipmapped array can be bound as a render target in Direct3D 12, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray[]. The following code sample shows how to convert Vulkan parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

// Device code
__global__ void VecAdd[float* A, float* B, float* C, int N]
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if [i < N]
        C[i] = A[i] + B[i];
}
            
// Host code
int main[]
{
    int N = ...;
    size_t size = N * sizeof[float];

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = [float*]malloc[size];
    float* h_B = [float*]malloc[size];
    float* h_C = [float*]malloc[size];

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc[&d_A, size];
    float* d_B;
    cudaMalloc[&d_B, size];
    float* d_C;
    cudaMalloc[&d_C, size];

    // Copy vectors from host memory to device memory
    cudaMemcpy[d_A, h_A, size, cudaMemcpyHostToDevice];
    cudaMemcpy[d_B, h_B, size, cudaMemcpyHostToDevice];

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            [N + threadsPerBlock - 1] / threadsPerBlock;
    VecAdd[d_A, d_B, d_C, N];

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy[h_C, d_C, size, cudaMemcpyDeviceToHost];

    // Free device memory
    cudaFree[d_A];
    cudaFree[d_B];
    cudaFree[d_C];
            
    // Free host memory
    ...
}

A shareable Direct3D 11 texture resource, viz, ID3D11Texture1D, ID3D11Texture2D or ID3D11Texture3D, can be created by setting either the D3D11_RESOURCE_MISC_SHARED or D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX [on Windows 7] or D3D11_RESOURCE_MISC_SHARED_NTHANDLE [on Windows 10] when calling ID3D11Device. CreateTexture1D, ID3D11Device. CreateTexture2D or ID3D11Device. CreateTexture3D respectively. A shareable Direct3D 11 buffer resource, ID3D11Buffer, can be created by specifying either of the above flags when calling ID3D11Device. CreateBuffer. A shareable resource created by specifying the D3D11_RESOURCE_MISC_SHARED_NTHANDLE can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed. When importing a Direct3D 11 resource, the flag cudaExternalMemoryDedicated must be set

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 resource can also be imported using a named handle if one exists as shown below

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 resource, created by specifying the D3D11_RESOURCE_MISC_SHARED or D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX, can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 11 API. Additionally, if the mipmapped array can be bound as a render target in Direct3D 12, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray[]. The following code sample shows how to convert Direct3D 11 parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 fence object, created by setting the flag D3D11_FENCE_FLAG_SHARED in the call to ID3D11Device5. CreateFence, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 fence object can also be imported using a named handle if one exists as shown below

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 keyed mutex object associated with a shareable Direct3D 11 resource, viz, IDXGIKeyedMutex, created by setting the flag D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 keyed mutex object can also be imported using a named handle if one exists as shown below

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

A shareable Direct3D 11 keyed mutex object can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch[&devPtr, &pitch,
                width * sizeof[float], height];
MyKernel[devPtr, pitch, width, height];

// Device code
__global__ void MyKernel[float* devPtr,
                         size_t pitch, int width, int height]
{
    for [int r = 0; r < height; ++r] {
        float* row = [float*][[char*]devPtr + r * pitch];
        for [int c = 0; c < width; ++c] {
            float element = row[c];
        }
    }
}

An imported Direct3D 11 fence object can be signaled as shown below. Signaling such a fence object sets its value to the one specified. The corresponding wait that waits on this signal must be issued in Direct3D 11. Additionally, the wait that waits on this signal must be issued after this signal has been issued

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

An imported Direct3D 11 fence object can be waited on as shown below. Waiting on such a fence object waits until its value becomes greater than or equal to the specified value. The corresponding signal that this wait is waiting on must be issued in Direct3D 11. Additionally, the signal must be issued before this wait can be issued

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

An imported Direct3D 11 keyed mutex object can be signaled as shown below. Signaling such a keyed mutex object by specifying a key value releases the keyed mutex for that value. The corresponding wait that waits on this signal must be issued in Direct3D 11 with the same key value. Additionally, the Direct3D 11 wait must be issued after this signal has been issued

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

An imported Direct3D 11 keyed mutex object can be waited on as shown below. A timeout value in milliseconds is needed when waiting on such a keyed mutex. The wait operation waits until the keyed mutex value is equal to the specified key value or until the timeout has elapsed. The timeout interval can also be an infinite value. In case an infinite value is specified the timeout never elapses. The windows INFINITE macro must be used to specify an infinite timeout. The corresponding signal that this wait is waiting on must be issued in Direct3D 11. Additionally, the Direct3D 11 signal must be issued before this wait can be issued

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

For allocating an NvSciBuf object compatible with a given CUDA device, the corresponding GPU id must be set with NvSciBufGeneralAttrKey_GpuId in the NvSciBuf attribute list as shown below. Optionally, applications can specify the following attributes -

NvSciBufGeneralAttrKey_NeedCpuAccess. Specifies if CPU access is required for the buffer
NvSciBufRawBufferAttrKey_Align. Specifies the alignment requirement of NvSciBufType_RawBuffer
NvSciBufGeneralAttrKey_RequiredPerm. Different access permissions can be configured for different UMDs per NvSciBuf memory object instance. For example, to provide the GPU with read-only access permissions to the buffer, create a duplicate NvSciBuf object using NvSciBufObjDupWithReducePerm[] with NvSciBufAccessPerm_Readonly as the input parameter. Then import this newly created duplicate object with reduced permission into CUDA as shown
NvSciBufGeneralAttrKey_EnableGpuCache. To control GPU L2 cacheability
NvSciBufGeneralAttrKey_EnableGpuCompression. To specify GPU compression

Note. For more details on these attributes and their valid input options, refer to NvSciBuf Documentation.

The following code snippet illustrates their sample usage

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

The allocated NvSciBuf memory object can be imported in CUDA using the NvSciBufObj handle as shown below. Application should query the allocated NvSciBufObj for attributes required for filling CUDA External Memory Descriptor. Note that the attribute list and NvSciBuf objects should be maintained by the application. If the NvSciBuf object imported into CUDA is also mapped by other drivers, then based on NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency output attribute value the application must use NvSciSync objects [Refer Importing Synchronization Objects] as appropriate barriers to maintain coherence between CUDA and the other drivers

Note. For more details on how to allocate and maintain NvSciBuf objects refer to NvSciBuf API Documentation.

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

An imported NvSciSyncObj object can be signaled as outlined below. Signaling NvSciSync backed semaphore object initializes the fence parameter passed as input. This fence parameter is waited upon by a wait operation that corresponds to the aforementioned signal. Additionally, the wait that waits on this signal must be issued after this signal has been issued. If the flags are set to cudaExternalSemaphoreSignalSkipNvSciBufMemSync then memory synchronization operations [over all the imported NvSciBuf in this process] that are executed as a part of the signal operation by default are skipped. When NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency is FALSE, this flag should be set

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

An imported NvSciSyncObj object can be waited upon as outlined below. Waiting on NvSciSync backed semaphore object waits until the input fence parameter is signaled by the corresponding signaler. Additionally, the signal must be issued before the wait can be issued. If the flags are set to cudaExternalSemaphoreWaitSkipNvSciBufMemSync then memory synchronization operations [over all the imported NvSciBuf in this process] that are executed as a part of the signal operation by default are skipped. When NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency is FALSE, this flag should be set

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

CUDA User Objects can be used to help manage the lifetime of resources used by asynchronous work in CUDA. In particular, this feature is useful for CUDA Graphs and stream capture

Various resource management schemes are not compatible with CUDA graphs. Consider for example an event-based pool or a synchronous-create, asynchronous-destroy scheme.

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent[width * sizeof[float],
                                    height, depth];
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D[&devPitchedPtr, extent];
MyKernel[devPitchedPtr, width, height, depth];

// Device code
__global__ void MyKernel[cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth]
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for [int z = 0; z < depth; ++z] {
        char* slice = devPtr + z * slicePitch;
        for [int y = 0; y < height; ++y] {
            float* row = [float*][slice + y * pitch];
            for [int x = 0; x < width; ++x] {
                float element = row[x];
            }
        }
    }
}

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

These schemes are difficult with CUDA graphs because of the non-fixed pointer or handle for the resource which requires indirection or graph update, and the synchronous CPU code needed each time the work is submitted. They also do not work with stream capture if these considerations are hidden from the caller of the library, and because of use of disallowed APIs during capture. Various solutions exist such as exposing the resource to the caller. CUDA user objects present another approach

A CUDA user object associates a user-specified destructor callback with an internal refcount, similar to C++ shared_ptr. References may be owned by user code on the CPU and by CUDA graphs. Note that for user-owned references, unlike C++ smart pointers, there is no object representing the reference; users must track user-owned references manually. A typical use case would be to immediately move the sole user-owned reference to a CUDA graph after the user object is created

When a reference is associated to a CUDA graph, CUDA will manage the graph operations automatically. A cloned cudaGraph_t retains a copy of every reference owned by the source cudaGraph_t, with the same multiplicity. An instantiated cudaGraphExec_t retains a copy of every reference in the source cudaGraph_t. When a cudaGraphExec_t is destroyed without being synchronized, the references are retained until the execution is completed

Here is an example use.

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

References owned by graphs in child graph nodes are associated to the child graphs, not the parents. If a child graph is updated or deleted, the references change accordingly. If an executable graph or child graph is updated with cudaGraphExecUpdate or cudaGraphExecChildGraphNodeSetParams, the references in the new source graph are cloned and replace the references in the target graph. In either case, if previous launches are not synchronized, any references which would be released are held until the launches have finished executing

There is not currently a mechanism to wait on user object destructors via a CUDA API. Users may signal a synchronization object manually from the destructor code. In addition, it is not legal to call CUDA APIs from the destructor, similar to the restriction on cudaLaunchHostFunc. This is to avoid blocking a CUDA internal shared thread and preventing forward progress. It is legal to signal another thread to perform an API call, if the dependency is one way and the thread doing the call cannot block forward progress of CUDA work

User objects are created with cudaUserObjectCreate, which is a good starting point to browse related APIs

There are two version numbers that developers should care about when developing a CUDA application. The compute capability that describes the general specifications and features of the compute device [see Compute Capability] and the version of the CUDA driver API that describes the features supported by the driver API and runtime

The version of the driver API is defined in the driver header file as CUDA_VERSION. It allows developers to check whether their application requires a newer device driver than the one currently installed. This is important, because the driver API is backward compatible, meaning that applications, plug-ins, and libraries [including the CUDA runtime] compiled against a particular version of the driver API will continue to work on subsequent device driver releases as illustrated in Figure 12. The driver API is not forward compatible, which means that applications, plug-ins, and libraries [including the CUDA runtime] compiled against a particular version of the driver API will not work on previous versions of the device driver

It is important to note that there are limitations on the mixing and matching of versions that is supported

Since only one version of the CUDA Driver can be installed at a time on a system, the installed driver must be of the same or higher version than the maximum Driver API version against which any application, plug-ins, or libraries that must run on that system were built
All plug-ins and libraries used by an application must use the same version of the CUDA Runtime unless they statically link to the Runtime, in which case multiple versions of the runtime can coexist in the same process space. Note that if nvcc is used to link the application, the static version of the CUDA Runtime library will be used by default, and all CUDA Toolkit libraries are statically linked against the CUDA Runtime
All plug-ins and libraries used by an application must use the same version of any libraries that use the runtime [such as cuFFT, cuBLAS, . ] unless statically linking to those libraries

Figure 12. The Driver API Is Backward but Not Forward Compatible

For Tesla GPU products, CUDA 10 introduced a new forward-compatible upgrade path for the user-mode components of the CUDA Driver. This feature is described in CUDA Compatibility. The requirements on the CUDA Driver version described here apply to the version of the user-mode components

On Tesla solutions running Windows Server 2008 and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA's System Management Interface [nvidia-smi], which is a tool distributed as part of the driver

Default compute mode. Multiple host threads can use the device [by calling cudaSetDevice[] on this device, when using the runtime API, or by making current a context associated to the device, when using the driver API] at the same time
Exclusive-process compute mode. Only one CUDA context may be created on the device across all processes in the system. The context may be current to as many threads as desired within the process that created that context
Prohibited compute mode. No CUDA context can be created on the device

This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice[] might be associated with a device other than device 0 if device 0 turns out to be in prohibited mode or in exclusive-process mode and used by another process. cudaSetValidDevices[] can be used to set a device from a prioritized list of devices

Note also that, for devices featuring the Pascal architecture onwards [compute capability with major revision number 6 and higher], there exists support for Compute Preemption. This allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architecture, with the benefit that applications with long-running kernels can be prevented from either monopolizing the system or timing out. However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists. The individual attribute query function cudaDeviceGetAttribute[] with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption. Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode

Applications may query the compute mode of a device by checking the computeMode device property [see Device Enumeration]

GPUs that have a display output dedicate some DRAM memory to the so-called primary surface, which is used to refresh the display device whose output is viewed by the user. When users initiate a mode switch of the display by changing the resolution or bit depth of the display [using NVIDIA control panel or the Display control panel on Windows], the amount of memory needed for the primary surface changes. For example, if the user changes the display resolution from 1280x1024x32-bit to 1600x1200x32-bit, the system must dedicate 7. 68 MB to the primary surface rather than 5. 24 MB. [Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the primary surface. ] On Windows, other events that may initiate display mode switches include launching a full-screen DirectX application, hitting Alt+Tab to task switch away from a full-screen DirectX application, or hitting Ctrl+Alt+Del to lock the computer

If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory allocations dedicated to CUDA applications. Therefore, a mode switch results in any call to the CUDA runtime to fail and return an invalid context error

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors [SMs]. When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a unique architecture called SIMT [Single-Instruction, Multiple-Thread] that is described in SIMT Architecture. The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution

SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices. Compute Capability 3. x, Compute Capability 5. x, Compute Capability 6. x, and Compute Capability 7. x provide the specifics for devices of compute capabilities 3. x, 5. x, 6. x, and 7. x respectively

The NVIDIA GPU architecture uses a little-endian representation

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths

The SIMT architecture is akin to SIMD [Single Instruction, Multiple Data] vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code. Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually

Prior to NVIDIA Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from

Starting with the NVIDIA Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility. threads can now diverge and reconverge at sub-warp granularity

Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity2 of previous hardware architectures. In particular, any warp-synchronous code [such as synchronization-free, intra-warp reductions] should be revisited to ensure compatibility with NVIDIA Volta and beyond. See Compute Capability 7. x for further details

Notes

The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive [disabled]. Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device [see Compute Capability 3. x, Compute Capability 5. x, Compute Capability 6. x, and Compute Capability 7. x], and which thread performs the final write is undefined

If an atomic instruction executed by a warp reads, modifies, and writes to the same location in global memory for more than one of the threads of the warp, each read/modify/write to that location occurs and they are all serialized, but the order in which they occur is undefined

The execution context [program counters, registers, and so on] for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction [the active threads of the warp] and issues the instruction to those threads

In particular, each multiprocessor has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Compute Capabilities. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch

The total number of warps in a block is as follows

ceil[TWsize,1]

T is the number of threads per block,
Wsize is the warp size, which is equal to 32,
ceil[x, y] is equal to x rounded up to the nearest multiple of y

The total number of registers and total amount of shared memory allocated for a block are documented in the CUDA Occupancy Calculator provided in the CUDA Toolkit

At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution. It should assign to each processor the type of work it does best. serial workloads to the host; parallel workloads to the devices

For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases. Either these threads belong to the same block, in which case they should use __syncthreads[] and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible

At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor

As described in Hardware Multithreading, a GPU multiprocessor primarily relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects an instruction that is ready to execute. This instruction can be another independent instruction of the same warp, exploiting instruction-level parallelism, or more commonly an instruction of another warp, exploiting thread-level parallelism. If a ready to execute instruction is selected it is issued to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency, and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely "hidden". The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions [see Arithmetic Instructions for the throughputs of various arithmetic instructions]. If we assume instructions with maximum throughput, it is equal to

4L for devices of compute capability 5. x, 6. 1, 6. 2, 7. x and 8. x since for these devices, a multiprocessor issues one instruction per warp over one clock cycle for four warps at a time, as mentioned in Compute Capabilities
2L for devices of compute capability 6. 0 since for these devices, the two instructions issued every cycle are one instruction for two different warps
8L for devices of compute capability 3. x since for these devices, the eight instructions issued every cycle are four pairs for four different warps, each pair being for the same warp

The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not available yet

If all input operands are registers, latency is caused by register dependencies, i. e. , some of the input operands are written by some previous instruction[s] whose execution has not completed yet. In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time. Execution time varies depending on the instruction. On devices of compute capability 7. x, for most arithmetic instructions, it is typically 4 clock cycles. This means that 16 active warps per multiprocessor [4 cycles, 4 warp schedulers] are required to hide arithmetic instruction latencies [assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed]. If the individual warps exhibit instruction-level parallelism, i. e. have multiple independent instructions in their instruction stream, fewer warps are needed because multiple independent instructions from a single warp can be issued back to back

If some input operand resides in off-chip memory, the latency is much higher. typically hundreds of clock cycles. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands [i. e. , arithmetic instructions most of the time] to the number of instructions with off-chip memory operands is low [this ratio is commonly called the arithmetic intensity of the program]

Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence [Memory Fence Functions] or synchronization point [Synchronization Functions]. A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point. Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points

The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call [Execution Configuration], the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading. Register and shared memory usage are reported by the compiler when compiling with the --ptxas-options=-v option

The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory

The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices of compute capability 6. x, if a kernel uses 64 registers and each block has 512 threads and requires very little shared memory, then two blocks [i. e. , 32 warps] can reside on the multiprocessor since they require 2x512x64 registers, which exactly matches the number of registers available on the multiprocessor. But as soon as the kernel uses one more register, only one block [i. e. , 16 warps] can be resident since two blocks would require 2x512x65 registers, which are more registers than are available on the multiprocessor. Therefore, the compiler attempts to minimize register usage while keeping register spilling [see Device Memory Accesses] and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds

The register file is organized as 32-bit registers. So, each variable stored in a register needs at least one 32-bit register, for example, a double variable uses two 32-bit registers

The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parametrize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime [see reference manual]

The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible

Several API functions exist to assist programmers in choosing thread block size and cluster size based on register and shared memory requirements

The occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor, can provide an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor
- Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage
The occupancy-based launch configurator APIs, cudaOccupancyMaxPotentialBlockSize and cudaOccupancyMaxPotentialBlockSizeVariableSMem, heuristically calculate an execution configuration that achieves the maximum multiprocessor-level occupancy
The occupancy calculator API, cudaOccupancyMaxActiveClusters, can provided occupancy prediction based on the cluster size, block size and shared memory usage of a kernel. This function reports occupancy in terms of number of max active clusters of a given size on the GPU present in the system

The following code sample calculates the occupancy of MyKernel. It then reports the occupancy level with the ratio between concurrent warps versus maximum warps per multiprocessor

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

The following code sample configures an occupancy-based kernel launch of MyKernel according to the user input

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

The following code sample shows how to use the cluster occupancy API to find the max number of active clusters of a given size. Example code below calucaltes occupancy for cluster of size 2 and 128 threads per block

Cluster size of 8 is forward compatible starting compute capability 9. 0. But it is recommended that the users query the maximum cluster size before launching a cluster kernel. Max cluster size can be queried using cudaOccupancyMaxPotentialClusterSize API

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

The CUDA Nsight Compute User Interface also provides a standalone occupancy calculator and launch configurator implementation in /include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. The Nsight Compute version of the occupancy calculator is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy [block size, registers per thread, and shared memory per thread].

The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth

That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device, since these have much lower bandwidth than data transfers between global memory and the device

That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory. shared memory and caches [i. e. , L1 cache and L2 cache available on devices of compute capability 2. x and higher, texture cache and constant cache available on all devices]

Shared memory is equivalent to a user-managed cache. The application explicitly allocates and accesses it. As illustrated in CUDA Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block

Load data from device memory to shared memory,
Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were populated by different threads,
Process the data in shared memory,
Synchronize again if necessary to make sure that shared memory has been updated with the results,
Write the results back to device memory

For some applications [for example, for which global memory access patterns are data-dependent], a traditional hardware-managed cache is more appropriate to exploit data locality. As mentioned in Compute Capability 3. x, Compute Capability 7. x, Compute Capability 8. x and Compute Capability 9. 0, for devices of compute capability 3. x, 7. x, 8. x and 9. 0, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call

The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type of memory. The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible based on the optimal memory access patterns described in Device Memory Accesses. This optimization is especially important for global memory accesses as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput, so non-optimal global memory accesses generally have a high impact on performance

Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move more code from the host to the device, even if that means running kernels that do not expose enough parallelism to execute on the device with full efficiency. Intermediate data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory

Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always performs better than making each transfer separately

On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked host memory as described in Page-Locked Host Memory

In addition, when using mapped page-locked memory [Mapped Memory], there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers are implicitly performed each time the kernel accesses the mapped memory. For maximum performance, these memory accesses must be coalesced as with accesses to global memory [see Device Memory Accesses]. Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead of explicit copies between device and host memory can be a win for performance

On integrated systems where device memory and host memory are physically the same, any copy between host and device memory is superfluous and mapped page-locked memory should be used instead. Applications may query a device is integrated by checking that the integrated device property [see Device Enumeration] is equal to 1

An instruction that accesses addressable memory [i. e. , global, local, shared, constant, or texture memory] might need to be re-issued multiple times depending on the distribution of the memory addresses across the threads within the warp. How the distribution affects the instruction throughput this way is specific to each type of memory and described in the following sections. For example, for global memory, as a general rule, the more scattered the addresses are, the more reduced the throughput is

Global Memory

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned. Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size [i. e. , whose first address is a multiple of their size] can be read or written by memory transactions

When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a 32-byte memory transaction is generated for each thread's 4-byte access, throughput is divided by 8

How many transactions are necessary and how much throughput is ultimately affected varies with the compute capability of the device. Compute Capability 3. x, Compute Capability 5. x, Compute Capability 6. x, Compute Capability 7. x, Compute Capability 8. x and Compute Capability 9. 0 give more details on how global memory accesses are handled for various compute capabilities

To maximize global memory throughput, it is therefore important to maximize coalescing by

Following the most optimal access patterns based on Compute Capability 3. x, Compute Capability 5. x, Compute Capability 6. x, Compute Capability 7. x, Compute Capability 8. x and Compute Capability 9. 0
Using data types that meet the size and alignment requirement detailed in the section Size and Alignment Requirement below,
Padding data in some cases, for example, when accessing a two-dimensional array as described in the section Two-Dimensional Arrays below

Size and Alignment Requirement

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access [via a variable or a pointer] to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned [i. e. , its address is a multiple of that size]

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory

The alignment requirement is automatically fulfilled for the Built-in Vector Types

For structures, the size and alignment requirements can be enforced by the compiler using the alignment specifiers __align__[8] or __align__[16], such as

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results [off by a few words], so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays [with multiple calls to cudaMalloc[] or cuMemAlloc[]] is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address

Two-Dimensional Arrays

A common global memory access pattern is when each thread of index [tx,ty] uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type type* [where type meets the requirement described in Maximize Utilization]

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch[] and cuMemAllocPitch[] functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints

Local Memory

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are

Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available [this is also known as register spilling]

Inspection of the PTX assembly code [obtained by compiling with the -ptx or-keep option] will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the . local mnemonic and accessed using the ld. local and st. local mnemonics. Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture. Inspection of the cubin object using cuobjdump will tell if this is the case. Also, the compiler reports total local memory usage per kernel [lmem] when compiling with the --ptxas-options=-v option. Note that some mathematical functions have implementation paths that might access local memory

The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address [for example, same index in an array variable, same member in a structure variable]

On some devices of compute capability 3. x local memory accesses are always cached in L1 and L2 in the same way as global memory accesses [see Compute Capability 3. x]

On devices of compute capability 5. x and 6. x, local memory accesses are always cached in L2 in the same way as global memory accesses [see Compute Capability 5. x and Compute Capability 6. x]

Shared Memory

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts. This is described in Compute Capability 3. x, Compute Capability 5. x, Compute Capability 6. x, Compute Capability 7. x, Compute Capability 8. x, and Compute Capability 9. 0 for devices of compute capability 3. x, 5. x, 6. x, 7. x, 8. x and 9. 0 respectively

Constant Memory

The constant memory space resides in device memory and is cached in the constant cache

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise

Texture and Surface Memory

The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency

Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory

If the memory reads do not follow the access patterns that global or constant memory reads must follow to get good performance, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads;
Addressing calculations are performed outside the kernel by dedicated units;
Packed data may be broadcast to separate variables in a single operation;
8-bit and 16-bit integer input data may be optionally converted to 32 bit floating-point values in the range [0. 0, 1. 0] or [-1. 0, 1. 0] [see Texture Memory]

Table 3 gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities

Table 3. Throughput of Native Arithmetic Instructions . [Number of Results per Clock Cycle per Multiprocessor] Compute Capability3. 5, 3. 75. 0, 5. 25. 36. 06. 16. 27. x8. 08. 68. 99. 016-bit floating-point add, multiply, multiply-addN/A25612822561282563128256432-bit floating-point add, multiply, multiply-add192128641286412864-bit floating-point add, multiply, multiply-add645432432632226432-bit floating-point reciprocal, reciprocal square root, base-2 logarithm [__log2f], base 2 exponential [exp2f], sine [__sinf], cosine [__cosf]3216321632-bit integer add, extended-precision add, subtract, extended-precision subtract160128641286432-bit integer multiply, multiply-add, extended-precision multiply-add32Multiple instruct. 64724-bit integer multiply [__[u]mul24]Multiple instruct. 32-bit integer shift648643264compare, minimum, maximum16064326432-bit integer bit reverse3264326416Bit field extract/insert32643264Multiple Instruct. 6432-bit bitwise AND, OR, XOR1601286412864count of leading zeros, most significant non-sign bit32163216population count32163216warp shuffle3232932warp reduceMultiple instruct. 16warp vote12864sum of absolute difference32643264SIMD video instructions vabsdiff2160Multiple instruct. SIMD video instructions vabsdiff4160Multiple instruct. 64All other SIMD video instructions32Multiple instruct. Type conversions from 8-bit and 16-bit integer to 32-bit integer types12832163264Type conversions from and to 64-bit types321041641611162216All other type conversions32163216

Other instructions and functions are implemented on top of the native instructions. The implementation may be different for devices of different compute capabilities, and the number of native instructions after compilation may fluctuate with every compiler version. For complicated functions, there can be multiple code paths depending on input. cuobjdump can be used to inspect a particular implementation in a cubin object

The implementation of some functions are readily available on the CUDA header files [math_functions. h, device_functions. h, . ]

In general, code compiled with -ftz=true [denormalized numbers are flushed to zero] tends to have higher performance than code compiled with -ftz=false. Similarly, code compiled with -prec-div=false [less precise division] tends to have higher performance code than code compiled with -prec-div=true, and code compiled with -prec-sqrt=false [less precise square root] tends to have higher performance than code compiled with -prec-sqrt=true. The nvcc user manual describes these compilation flags in more details

Single-Precision Floating-Point Division

__fdividef[x, y] [see Intrinsic Functions] provides faster single-precision floating-point division than the division operator

Single-Precision Floating-Point Reciprocal Square Root

To preserve IEEE-754 semantics the compiler can optimize 1. 0/sqrtf[] into rsqrtf[] only when both reciprocal and square root are approximate, [i. e. , with -prec-div=false and -prec-sqrt=false]. It is therefore recommended to invoke rsqrtf[] directly where desired

Single-Precision Floating-Point Square Root

Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication so that it gives correct results for 0 and infinity

Sine and Cosine

sinf[x], cosf[x], tanf[x], sincosf[x], and corresponding double-precision instructions are much more expensive and even more so if the argument x is large in magnitude

More precisely, the argument reduction code [see Mathematical Functions for implementation] comprises two code paths referred to as the fast path and the slow path, respectively

The fast path is used for arguments sufficiently small in magnitude and essentially consists of a few multiply-add operations. The slow path is used for arguments large in magnitude and consists of lengthy computations required to achieve correct results over the entire argument range

At present, the argument reduction code for the trigonometric functions selects the fast path for arguments whose magnitude is less than 105615. 0f for the single-precision functions, and less than 2147483648. 0 for the double-precision functions

As the slow path requires more registers than the fast path, an attempt has been made to reduce register pressure in the slow path by storing some intermediate variables in local memory, which may affect performance because of local memory high latency and bandwidth [see Device Memory Accesses]. At present, 28 bytes of local memory are used by single-precision functions, and 44 bytes are used by double-precision functions. However, the exact amount is subject to change

Due to the lengthy computations and use of local memory in the slow path, the throughput of these trigonometric functions is lower by one order of magnitude when the slow path reduction is required as opposed to the fast path reduction

Integer Arithmetic

Integer division and modulo operation are costly as they compile to up to 20 instructions. They can be replaced with bitwise operations in some cases. If n is a power of 2, [i/n] is equivalent to [i>>log2[n]] and [i%n] is equivalent to [i&[n-1]]; the compiler will perform these conversions if n is literal

__brev and __popc map to a single instruction and __brevll and __popcll to a few instructions

__[u]mul24 are legacy intrinsic functions that no longer have any reason to be used

Half Precision Arithmetic

In order to achieve good performance for 16-bit precision floating-point add, multiply or multiply-add, it is recommended that the half2 datatype is used for half precision and __nv_bfloat162 be used for __nv_bfloat16 precision. Vector intrinsics [for example, __hadd2, __hsub2, __hmul2, __hfma2] can then be used to do two operations in a single instruction. Using half2 or __nv_bfloat162 in place of two calls using half or __nv_bfloat16 may also help performance of other intrinsics, such as warp shuffles

The intrinsic __halves2half2 is provided to convert two half precision values to the half2 datatype

The intrinsic __halves2bfloat162 is provided to convert two __nv_bfloat precision values to the __nv_bfloat162 datatype

Type Conversion

Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for

Functions operating on variables of type char or short whose operands generally need to be converted to int,
Double-precision floating-point constants [i. e. , those constants defined without any type suffix] used as input to single-precision floating-point computations [as mandated by C/C++ standards]

This last case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3. 141592653589793f, 1. 0f, 0. 5f

Any flow control instruction [if, switch, do, for, while] can significantly impact the effective instruction throughput by causing threads of the same warp to diverge [i. e. , to follow different execution paths]. If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp

To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in SIMT Architecture. A trivial example is when the controlling condition only depends on [threadIdx / warpSize] where warpSize is the warp size. In this case, no warp diverges since the controlling condition is perfectly aligned with the warps

Sometimes, the compiler may unroll loops or it may optimize out short if or switch blocks by using branch predication instead, as detailed below. In these cases, no warp can ever diverge. The programmer can also control loop unrolling using the #pragma unroll directive [see #pragma unroll]

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands

Applications that constantly allocate and free memory too often may find that the allocation calls tend to get slower over time up to a limit. This is typically expected due to the nature of releasing memory back to the operating system for its own use. For best performance in this regard, we recommend the following

Try to size your allocation to the problem at hand. Don't try to allocate all available memory with cudaMalloc / cudaMallocHost / cuMemCreate, as this forces memory to be resident immediately and prevents other applications from being able to use that memory. This can put more pressure on operating system schedulers, or just prevent other applications using the same GPU from running entirely
Try to allocate memory in appropriately sized allocations early in the application and allocations only when the application does not have any use for it. Reduce the number of cudaMalloc+cudaFree calls in the application, especially in performance-critical regions
If an application cannot allocate enough device memory, consider falling back on other memory types such as cudaMallocHost or cudaMallocManaged, which may not be as performant, but will enable the application to make progress
For platforms that support the feature, cudaMallocManaged allows for oversubscription, and with the correct cudaMemAdvise policies enabled, will allow the application to retain most if not all the performance of cudaMalloc. cudaMallocManaged also won't force an allocation to be resident until it is needed or prefetched, reducing the overall pressure on the operating system schedulers and better enabling multi-tenet use cases

The __host__ execution space specifier declares a function that is

Executed on the host,
Callable from the host only

It is equivalent to declare a function with only the __host__ execution space specifier or to declare it without any of the __host__, __device__, or __global__ execution space specifier; in either case the function is compiled for the host only

The __global__ and __host__ execution space specifiers cannot be used together

The __device__ and __host__ execution space specifiers can be used together however, in which case the function is compiled for both the host and the device. The __CUDA_ARCH__ macro introduced in Application Compatibility can be used to differentiate code paths between host and device

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

The __shared__ memory space specifier, optionally used together with __device__, declares a variable that

Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Has a distinct object per block,
Is only accessible from all the threads within the block,
Does not have a constant address

When declaring a variable in shared memory as an external array such as

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol[constData, data, sizeof[data]];
cudaMemcpyFromSymbol[data, constData, sizeof[data]];

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol[devData, &value, sizeof[float]];

__device__ float* devPointer;
float* ptr;
cudaMalloc[&ptr, 256 * sizeof[float]];
cudaMemcpyToSymbol[devPointer, &ptr, sizeof[ptr]];

the size of the array is determined at launch time [see Execution Configuration]. All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. For example, if one wants the equivalent of

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

in dynamically allocated shared memory, one could declare and initialize the arrays the following way

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Alignment requirements for the built-in vector types are listed in Table 4

The __grid_constant__ annotation for compute architectures greater or equal to 7. 0 annotates a const-qualified __global__ function parameter of non-reference type that

Has the lifetime of the grid,
Is private to the grid, i. e. , the object is not accessible to host threads and threads from other grids, including sub-grids,
Has a distinct object per grid, i. e. , all threads in the grid see the same address,
Is read-only, i. e. , modifying a __grid_constant__ object or any of its sub-objects is undefined behavior, including mutable members

Requirements

Kernel parameters annotated with __grid_constant__ must have const-qualified non-reference types
All function declarations must match with respect to any __grid_constant_ parameters
A function template specialization must match the primary template declaration with respect to any __grid_constant__ parameters
A function template instantiation directive must match the primary template declaration with respect to any __grid_constant__ parameters

If the address of a __global__ function parameter is taken, the compiler will ordinarily make a copy of the kernel parameter in thread local memory and use the address of the copy, to partially support C++ semantics, which allow each thread to modify its own local copy of function parameters. Annotating a __global__ function parameter with __grid_constant__ ensures that the compiler will not create a copy of the kernel parameter in thread local memory, but will instead use the generic address of the parameter itself. Avoiding the local copy may result in improved performance

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

nvcc supports restricted pointers via the __restrict__ keyword

Restricted pointers were introduced in C99 to alleviate the aliasing problem that exists in C-type languages, and which inhibits all kind of optimization from code re-ordering to common sub-expression elimination

Here is an example subject to the aliasing issue, where use of restricted pointer can help the compiler to reduce the number of instructions

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

In C-type languages, the pointers a, b, and c may be aliased, so any write through c could modify elements of a or b. This means that to guarantee functional correctness, the compiler cannot load a[0] and b[0] into registers, multiply them, and store the result to both c[0] and c[1], because the results would differ from the abstract execution model if, say, a[0] is really the same location as c[0]. So the compiler cannot take advantage of the common sub-expression. Likewise, the compiler cannot just reorder the computation of c[4] into the proximity of the computation of c[0] and c[1] because the preceding write to c[3] could change the inputs to the computation of c[4]

By making a, b, and c restricted pointers, the programmer asserts to the compiler that the pointers are in fact not aliased, which in this case means writes through c would never overwrite elements of a or b. This changes the function prototype as follows

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Note that all pointer arguments need to be made restricted for the compiler optimizer to derive any benefit. With the __restrict__ keywords added, the compiler can now reorder and do common sub-expression elimination at will, while retaining functionality identical with the abstract execution model

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions

Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy

These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd, and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form make_; for example,

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

07which creates a vector of type int2 with value[x, y].

The alignment requirements of the vector types are detailed in Table 4

Table 4. Alignment Requirements TypeAlignmentchar1, uchar11char2, uchar22char3, uchar31char4, uchar44short1, ushort12short2, ushort24short3, ushort32short4, ushort48int1, uint14int2, uint28int3, uint34int4, uint416long1, ulong14 if sizeof[long] is equal to sizeof[int] 8, otherwiselong2, ulong28 if sizeof[long] is equal to sizeof[int], 16, otherwiselong3, ulong34 if sizeof[long] is equal to sizeof[int], 8, otherwiselong4, ulong416longlong1, ulonglong18longlong2, ulonglong216longlong3, ulonglong38longlong4, ulonglong416float14float28float34float416double18double216double38double416

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread. It is undefined behavior for two threads to read from or write to the same memory location without synchronization

In the following example, thread 1 executes writeXY[], while thread 2 executes readXY[].

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

08The two threads read and write from the same memory locations X and Y simultaneously. Any data-race is undefined behavior, and has no defined semantics. The resulting values for A and B can be anything

Memory fence functions can be used to enforce a sequentially-consistent ordering on memory accesses. The memory fence functions differ in the scope in which the orderings are enforced but they are independent of the accessed memory space [shared memory, global memory, page-locked host memory, and the memory of a peer device]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

is equivalent to cuda. atomic_thread_fence[cuda. memory_order_seq_cst, cuda. thread_scope_block] and ensures that

All writes to all memory made by the calling thread before the call to __threadfence_block[] are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block[];
All reads from all memory made by the calling thread before the call to __threadfence_block[] are ordered before all reads from all memory made by the calling thread after the call to __threadfence_block[]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

is equivalent to cuda. atomic_thread_fence[cuda. memory_order_seq_cst, cuda. thread_scope_device] and ensures that no writes to all memory made by the calling thread after the call to __threadfence[] are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence[]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

is equivalent to cuda. atomic_thread_fence[cuda. memory_order_seq_cst, cuda. thread_scope_system] and ensures that all writes to all memory made by the calling thread before the call to __threadfence_system[] are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to __threadfence_system[]

__threadfence_system[] is only supported by devices of compute capability 2. x and higher

In the previous code sample, we can insert fences in the codes as follows.

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

12For this code, the following outcomes can be observed

A equal to 1 and B equal to 2,
A equal to 10 and B equal to 2,
A equal to 10 and B equal to 20

The fourth outcome is not possible, because the first write must be visible before the second write. If thread 1 and 2 belong to the same block, it is enough to use __threadfence_block[]. If thread 1 and 2 do not belong to the same block, __threadfence[] must be used if they are CUDA threads from the same device and __threadfence_system[] must be used if they are CUDA threads from two different devices

A common use case is when threads consume some data produced by other threads as illustrated by the following code sample of a kernel that computes the sum of an array of N numbers in one call. Each block first sums a subset of the array and stores the result in global memory. When all blocks are done, the last block done reads each of these partial sums from global memory and sums them to obtain the final result. In order to determine which block is finished last, each block atomically increments a counter to signal that it is done with computing and storing its partial sum [see Atomic Functions about atomic functions]. The last block is the one that receives the counter value equal to gridDim. x-1. If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim. x-1 and let the last block start reading partial sums before they have been actually updated in memory

Memory fence functions only affect the ordering of memory operations by a thread; they do not, by themselves, ensure that these memory operations are visible to other threads [like __syncthreads[] does for threads within a block [see Synchronization Functions]]. In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile [see Volatile Qualifier]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads[] are visible to all threads in the block

__syncthreads[] is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses

__syncthreads[] is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects

Devices of compute capability 2. x and higher support three variations of __syncthreads[] described below

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

is identical to __syncthreads[] with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

is identical to __syncthreads[] with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp[] [with the same mask] before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp[] with the same mask, or the result is undefined

Executing __syncwarp[] guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp[], and then safely read values stored by other threads in the warp

Note. For . target sm_6x or below, all threads in mask must execute the same __syncwarp[] in convergence, and the union of all values in mask must be equal to the active mask. Otherwise, the behavior is undefined.

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

fetches from the region of linear memory bound to the one-dimensional texture reference texRef using integer texture coordinate x. tex1Dfetch[] only works with non-normalized coordinates, so only the border and clamp addressing modes are supported. It does not perform any texture filtering. For integer types, it may optionally promote the integer to single-precision floating point

Besides the functions shown above, 2-, and 4-tuples are supported; for example

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

fetches from the region of linear memory bound to texture reference texRef using texture coordinate x

The read-only data cache load function is only supported by devices of compute capability 3. 5 and higher

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

21returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16. h header included, T can be __half or __half2. Similarly, with the cuda_bf16. h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is cached in the read-only data cache [see Global Memory]

These load functions are only supported by devices of compute capability 3. 5 and higher

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

22returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16. h header included, T can be __half or __half2. Similarly, with the cuda_bf16. h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is using the corresponding cache operator [see PTX ISA]

These store functions are only supported by devices of compute capability 3. 5 and higher

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

23stores the value argument of type T to the location at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16. h header included, T can be __half or __half2. Similarly, with the cuda_bf16. h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is using the corresponding cache operator [see PTX ISA ]

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd[] reads a word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. Atomic functions can only be used in device functions

The atomic functions described in this section have ordering cuda. memory_order_relaxed and are only atomic at a particular scope

Atomic APIs with _system suffix [example. __atomicAdd_system] are atomic at scope cuda. thread_scope_system
Atomic APIs without a suffix [example. __atomicAdd] are atomic at scope cuda. thread_scope_device
Atomic APIs with _block suffix [example. __atomicAdd_block] are atomic at scope cuda. thread_scope_block

In the following example both the CPU and the GPU atomically update an integer value at address addr.

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Note that any atomic operation can be implemented based on atomicCAS[] [Compare And Swap]. For example, atomicAdd[] for double-precision floating-point numbers is not available on devices with compute capability lower than 6. 0 but it can be implemented as follows

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

There are system-wide and block-wide variants of the following device-wide atomic APIs, with the following exceptions

Devices with compute capability less than 6. 0 only support device-wide atomic operations,
Tegra devices with compute capability less than 7. 2 do not support system-wide atomic operations

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes [old + val], and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old

The 32-bit floating-point version of atomicAdd[] is only supported by devices of compute capability 2. x and higher

The 64-bit floating-point version of atomicAdd[] is only supported by devices of compute capability 6. x and higher

The 32-bit __half2 floating-point version of atomicAdd[] is only supported by devices of compute capability 6. x and higher. The atomicity of the __half2 or __nv_bfloat162 add operation is guaranteed separately for each of the two __half or __nv_bfloat16 elements; the entire __half2 or __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access

The 16-bit __half floating-point version of atomicAdd[] is only supported by devices of compute capability 7. x and higher

The 16-bit __nv_bfloat16 floating-point version of atomicAdd[] is only supported by devices of compute capability 8. x and higher

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes [old == compare ? val . old] , and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old [Compare And Swap]

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Deprecation notice. __any, __all, and __ballot have been deprecated in CUDA 9. 0 for all devices

Removal notice. When targeting devices with compute capability 7. x or higher, __any, __all, and __ballot are no longer available and their sync variants should be used instead

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input an integer predicate from each thread in the warp and compare those values with zero. The results of the comparisons are combined [reduced] across the active threads of the warp in one of the following ways, broadcasting a single return value to each participating thread

__all_sync[unsigned mask, predicate]. Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them. __any_sync[unsigned mask, predicate]. Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them. __ballot_sync[unsigned mask, predicate]. Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active. __activemask[]. Returns a 32-bit integer mask of all currently active threads in the calling warp. The Nth bit is set if the Nth lane in the warp is active when __activemask[] is called. Inactive threads are represented by 0 bits in the returned mask. Threads which have exited the program are always marked as inactive. Note that threads that are convergent at an __activemask[] call are not guaranteed to be convergent at subsequent instructions unless those instructions are synchronizing warp-builtin functions

Notes

For __all_sync, __any_sync, and __ballot_sync, a mask must be passed that specifies the threads participating in the call. A bit, representing the thread's lane ID, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All active threads named in mask must execute the same intrinsic with the same mask, or the result is undefined

The __shfl_sync[] intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp [and named in mask], moving 4 or 8 bytes of data per thread depending on the type

Threads within a warp are referred to as lanes, and may have an index between 0 and warpSize-1 [inclusive]. Four source-lane addressing modes are supported

__shfl_sync[]Direct copy from indexed lane__shfl_up_sync[]Copy from a lane with lower ID relative to caller__shfl_down_sync[]Copy from a lane with higher ID relative to caller__shfl_xor_sync[]Copy from a lane based on bitwise XOR of own lane ID

Threads may only read data from another thread which is actively participating in the __shfl_sync[] command. If the target thread is inactive, the retrieved value is undefined

All of the __shfl_sync[] intrinsics take an optional width parameter which alters the behavior of the intrinsic. width must have a value which is a power of 2; results are undefined if width is not a power of 2, or is a number greater than warpSize

__shfl_sync[] returns the value of var held by the thread whose ID is given by srcLane. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0. width-1], the value returned corresponds to the value of var held by the srcLane modulo width [i. e. within the same subsection]

__shfl_up_sync[] calculates a source lane ID by subtracting delta from the caller's lane ID. The value of var held by the resulting lane ID is returned. in effect, var is shifted up the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged

__shfl_down_sync[] calculates a source lane ID by adding delta to the caller's lane ID. The value of var held by the resulting lane ID is returned. this has the effect of shifting var down the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. As for __shfl_up_sync[], the ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged

__shfl_xor_sync[] calculates a source line ID by performing a bitwise XOR of the caller's lane ID with laneMask. the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread's lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined

All following functions and types are defined in the namespace nvcuda. wmma. Sub-byte operations are considered preview, i. e. the data structures and APIs for them are subject to change and may not be compatible with future releases. This extra functionality is defined in the nvcuda. wmma. experimental namespace

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

29fragment

An overloaded class containing a section of a matrix distributed across all threads in the warp. The mapping of matrix elements into fragment internal storage is unspecified and subject to change in future architectures

Only certain combinations of template arguments are allowed. The first template parameter specifies how the fragment will participate in the matrix operation. Acceptable values for Use are

matrix_a when the fragment is used as the first multiplicand, A,
matrix_b when the fragment is used as the second multiplicand, B, or
accumulator when the fragment is used as the source or destination accumulators [C or D, respectively]

The m, n and k sizes describe the shape of the warp-wide matrix tiles participating in the multiply-accumulate operation. The dimension of each tile depends on its role. For matrix_a the tile takes dimension m x k; for matrix_b the dimension is k x n, and accumulator tiles are m x n

The data type, T, may be double, float, __half, __nv_bfloat16, char, or unsigned char for multiplicands and double, float, int, or __half for accumulators. As documented in Element Types and Matrix Sizes, limited combinations of accumulator and multiplicand types are supported. The Layout parameter must be specified for matrix_a and matrix_b fragments. row_major or col_major indicate that elements within a matrix row or column are contiguous in memory, respectively. The Layout parameter for an accumulator matrix should retain the default value of void. A row or column layout is specified only when the accumulator is loaded or stored as described below

load_matrix_sync

Waits until all warp lanes have arrived at load_matrix_sync and then loads the matrix fragment a from memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. ldm describes the stride in elements between consecutive rows [for row major layout] or columns [for column major layout] and must be a multiple of 8 for __half element type or multiple of 4 for float element type. [i. e. , multiple of 16 bytes in both cases]. If the fragment is an accumulator, the layout argument must be specified as either mem_row_major or mem_col_major. For matrix_a and matrix_b fragments, the layout is inferred from the fragment's layout parameter. The values of mptr, ldm, layout and all template parameters for a must be the same for all threads in the warp. This function must be called by all threads in the warp, or the result is undefined

store_matrix_sync

Waits until all warp lanes have arrived at store_matrix_sync and then stores the matrix fragment a to memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. ldm describes the stride in elements between consecutive rows [for row major layout] or columns [for column major layout] and must be a multiple of 8 for __half element type or multiple of 4 for float element type. [i. e. , multiple of 16 bytes in both cases]. The layout of the output matrix must be specified as either mem_row_major or mem_col_major. The values of mptr, ldm, layout and all template parameters for a must be the same for all threads in the warp

fill_fragment

Fill a matrix fragment with a constant value v. Because the mapping of matrix elements to each fragment is unspecified, this function is ordinarily called by all threads in the warp with a common value for v

mma_sync

Waits until all warp lanes have arrived at mma_sync, and then performs the warp-synchronous matrix multiply-accumulate operation D=A*B+C. The in-place operation, C=A*B+C, is also supported. The value of satf and template parameters for each matrix fragment must be the same for all threads in the warp. Also, the template parameters m, n and k must match between fragments A, B, C and D. This function must be called by all threads in the warp, or the result is undefined

If satf [saturate to finite value] mode is true, the following additional numerical properties apply for the destination accumulator

If an element result is +Infinity, the corresponding accumulator will contain +MAX_NORM
If an element result is -Infinity, the corresponding accumulator will contain -MAX_NORM
If an element result is NaN, the corresponding accumulator will contain +0

Because the map of matrix elements into each thread's fragment is unspecified, individual matrix elements must be accessed from memory [shared or global] after calling store_matrix_sync. In the special case where all threads in the warp will apply an element-wise operation uniformly to all fragment elements, direct element access can be implemented using the following fragment class members

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

As an example, the following code scales an accumulator matrix tile by half

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Tensor Cores support alternate types of floating point operations on devices with compute capability 8. 0 and higher

__nv_bfloat16

This data format is an alternate fp16 format that has the same range as f32 but reduced precision [7 bits]. You can use this data format directly with the __nv_bfloat16 type available in cuda_bf16. h. Matrix fragments with __nv_bfloat16 data types are required to be composed with accumulators of float type. The shapes and operations supported are the same as with __half

tf32

This data format is a special floating point format supported by Tensor Cores, with the same range as f32 and reduced precision [>=10 bits]. The internal layout of this format is implementation defined. In order to use this floating point format with WMMA operations, the input matrices must be manually converted to tf32 precision

To facilitate conversion, a new intrinsic __float_to_tf32 is provided. While the input and output arguments to the intrinsic are of float type, the output will be tf32 numerically. This new precision is intended to be used with Tensor Cores only, and if mixed with other floattype operations, the precision and range of the result will be undefined

Once an input matrix [matrix_a or matrix_b] is converted to tf32 precision, the combination of a fragment with precision. tf32 precision, and a data type of float to load_matrix_sync will take advantage of this new capability. Both the accumulator fragments must have float data types. The only supported matrix size is 16x16x8 [m-n-k]

The elements of the fragment are represented as float, hence the mapping from element_type to storage_element_type is:

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Sub-byte WMMA operations provide a way to access the low-precision capabilities of Tensor Cores. They are considered a preview feature i. e. the data structures and APIs for them are subject to change and may not be compatible with future releases. This functionality is available via the nvcuda. wmma. experimental namespace

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

For 4 bit precision, the APIs available remain the same, but you must specify experimental::precision::u4 or experimental::precision::s4 as the fragment data type. Since the elements of the fragment are packed together, num_storage_elements will be smaller than num_elements for that fragment. The num_elements variable for a sub-byte fragment, hence returns the number of elements of sub-byte type element_type. This is true for single bit precision as well, in which case, the mapping from element_type to storage_element_type is as follows:

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

The allowed layouts for sub-byte fragments is always row_major for matrix_a and col_major for matrix_b

For sub-byte operations the value of ldm in load_matrix_sync should be a multiple of 32 for element type experimental. precision. u4 and experimental. precision. s4 or a multiple of 128 for element type experimental. precision. b1 [i. e. , multiple of 16 bytes in both cases]

Note. Support for the following variants for MMA instructions is deprecated and will be removed in sm_90.

experimental. precision. u4
experimental. precision. s4
experimental. precision. b1 with bmmaBitOp set to bmmaBitOpXOR

bmma_syncWaits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = [A op B] + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8. 0 and higher

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits

The special format required by tensor cores may be different for each major and minor device architecture. This is further complicated by threads holding only a fragment [opaque architecture-specific ABI data structure] of the overall matrix, with the developer not allowed to make assumptions on how the individual parameters are mapped to the registers participating in the matrix multiply-accumulate

Since fragments are architecture-specific, it is unsafe to pass them from function A to function B if the functions have been compiled for different link-compatible architectures and linked together into the same device executable. In this case, the size and layout of the fragment will be specific to one architecture and using WMMA APIs in the other will lead to incorrect results or potentially, corruption

An example of two link-compatible architectures, where the layout of the fragment differs, is sm_70 and sm_75

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

This undefined behavior might also be undetectable at compilation time and by tools at runtime, so extra care is needed to make sure the layout of the fragments is consistent. This linking hazard is most likely to appear when linking with a legacy library that is both built for a different link-compatible architecture and expecting to be passed a WMMA fragment

Note that in the case of weak linkages [for example, a CUDA C++ inline function], the linker may choose any available function definition which may result in implicit passes between compilation units

To avoid these sorts of problems, the matrix should always be stored out to memory for transit through external interfaces [e. g. wmma. store_matrix_sync[dst, …];] and then it can be safely passed to bar[] as a pointer type [e. g. float *dst]

Note that since sm_70 can run on sm_75, the above example sm_75 code can be changed to sm_70 and correctly work on sm_75. However, it is recommended to have sm_75 native code in your application when linking with other sm_75 separately compiled binaries

Tensor Cores support a variety of element types and matrix sizes. The following table presents the various combinations of matrix_a, matrix_b and accumulator matrix supported

Matrix AMatrix BAccumulatorMatrix Size [m-n-k]__half__halffloat16x16x16__half__halffloat32x8x16__half__halffloat8x32x16__half__half__half16x16x16__half__half__half32x8x16__half__half__half8x32x16unsigned charunsigned charint16x16x16unsigned charunsigned charint32x8x16unsigned charunsigned charint8x32x16signed charsigned charint16x16x16signed charsigned charint32x8x16signed charsigned charint8x32x16

Alternate Floating Point support

Matrix AMatrix BAccumulatorMatrix Size [m-n-k]__nv_bfloat16__nv_bfloat16float16x16x16__nv_bfloat16__nv_bfloat16float32x8x16__nv_bfloat16__nv_bfloat16float8x32x16precision. tf32precision. tf32float16x16x8

Double Precision Support

Matrix AMatrix BAccumulatorMatrix Size [m-n-k]doubledoubledouble8x8x4

Experimental support for sub-byte operations

Matrix AMatrix BAccumulatorMatrix Size [m-n-k]precision. u4precision. u4int8x8x32precision. s4precision. s4int8x8x32precision. b1precision. b1int8x8x128

Without the arrive/wait barrier, synchronization is achieved using __syncthreads[] [to synchronize all threads in a block] or group. sync[] when using Cooperative Groups

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Threads are blocked at the synchronization point [block. sync[]] until all threads have reached the synchronization point. In addition, memory updates that happened before the synchronization point are guaranteed to be visible to all threads in the block after the synchronization point, i. e. , equivalent to atomic_thread_fence[memory_order_seq_cst, thread_scope_block] as well as the sync

This pattern has three stages

Code before sync performs memory updates that will be read after the sync
Synchronization point
Code after sync point with visibility of memory updates that happened before sync point

The temporally-split synchronization pattern with the std. barrier is as follows

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

In this pattern, the synchronization point [block. sync[]] is split into an arrive point [bar. arrive[]] and a wait point [bar. wait[std. move[token]]]. A thread begins participating in a cuda. barrier with its first call to bar. arrive[]. When a thread calls bar. wait[std. move[token]] it will be blocked until participating threads have completed bar. arrive[] the expected number of times as specified by the expected arrival count argument passed to init[]. Memory updates that happen before participating threads' call to bar. arrive[] are guaranteed to be visible to participating threads after their call to bar. wait[std. move[token]]. Note that the call to bar. arrive[] does not block a thread, it can proceed with other work that does not depend upon memory updates that happen before other participating threads' call to bar. arrive[]

The arrive and then wait pattern has five stages which may be iteratively repeated

Code before arrive performs memory updates that will be read after the wait
Arrive point with implicit memory fence [i. e. , equivalent to atomic_thread_fence[memory_order_seq_cst, thread_scope_block]]
Code between arrive and wait
Wait point
Code after the wait, with visibility of updates that were performed before the arrive

Initialization must happen before any thread begins participating in a cuda. barrier

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Before any thread can participate in cuda. barrier, the barrier must be initialized using init[] with an expected arrival count, block. size[] in this example. Initialization must happen before any thread calls bar. arrive[]. This poses a bootstrapping challenge in that threads must synchronize before participating in the cuda. barrier, but threads are creating a cuda. barrier in order to synchronize. In this example, threads that will participate are part of a cooperative group and use block. sync[] to bootstrap initialization. In this example a whole thread block is participating in initialization, hence __syncthreads[] could also be used

The second parameter of init[] is the expected arrival count, i. e. , the number of times bar. arrive[] will be called by participating threads before a participating thread is unblocked from its call to bar. wait[std. move[token]]. In the prior example the cuda. barrier is initialized with the number of threads in the thread block i. e. , cooperative_groups. this_thread_block[]. size[], and all threads within the thread block participate in the barrier

A cuda. barrier is flexible in specifying how threads participate [split arrive/wait] and which threads participate. In contrast this_thread_block. sync[] from cooperative groups or __syncthreads[] is applicable to whole-thread-block and __syncwarp[mask] is a specified subset of a warp. If the intention of the user is to synchronize a full thread block or a full warp we recommend using __syncthreads[] and __syncwarp[mask] respectively for performance reasons

A cuda. barrier counts down from the expected arrival count to zero as participating threads call bar. arrive[]. When the countdown reaches zero, a cuda. barrier is complete for the current phase. When the last call to bar. arrive[] causes the countdown to reach zero, the countdown is automatically and atomically reset. The reset assigns the countdown to the expected arrival count, and moves the cuda. barrier to the next phase

A token object of class cuda. barrier. arrival_token, as returned from token=bar. arrive[], is associated with the current phase of the barrier. A call to bar. wait[std. move[token]] blocks the calling thread while the cuda. barrier is in the current phase, i. e. , while the phase associated with the token matches the phase of the cuda. barrier. If the phase is advanced [because the countdown reaches zero] before the call to bar. wait[std. move[token]] then the thread does not block; if the phase is advanced while the thread is blocked in bar. wait[std. move[token]], the thread is unblocked

It is essential to know when a reset could or could not occur, especially in non-trivial arrive/wait synchronization patterns

A thread's calls to token=bar. arrive[] and bar. wait[std. move[token]] must be sequenced such that token=bar. arrive[] occurs during the cuda. barrier's current phase, and bar. wait[std. move[token]] occurs during the same or next phase
A thread's call to bar. arrive[] must occur when the barrier's counter is non-zero. After barrier initialization, if a thread's call to bar. arrive[] causes the countdown to reach zero then a call to bar. wait[std. move[token]] must happen before the barrier can be reused for a subsequent call to bar. arrive[]
bar. wait[] must only be called using a token object of the current phase or the immediately preceding phase. For any other values of the token object, the behavior is undefined

For simple arrive/wait synchronization patterns, compliance with these usage rules is straightforward

A thread block can be spatially partitioned such that warps are specialized to perform independent computations. Spatial partitioning is used in a producer or consumer pattern, where one subset of threads produces data that is concurrently consumed by the other [disjoint] subset of threads

A producer/consumer spatial partitioning pattern requires two one sided synchronizations to manage a data buffer between the producer and consumer

ProducerConsumerwait for buffer to be ready to be filledsignal buffer is ready to be filledproduce data and fill the buffer signal buffer is filledwait for buffer to be filled consume data in filled buffer

Producer threads wait for consumer threads to signal that the buffer is ready to be filled; however, consumer threads do not wait for this signal. Consumer threads wait for producer threads to signal that the buffer is filled; however, producer threads do not wait for this signal. For full producer/consumer concurrency this pattern has [at least] double buffering where each buffer requires two cuda. barriers

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

In this example the first warp is specialized as the producer and the remaining warps are specialized as the consumer. All producer and consumer threads participate [call bar. arrive[] or bar. arrive_and_wait[]] in each of the four cuda. barriers so the expected arrival counts are equal to block. size[]

A producer thread waits for the consumer threads to signal that the shared memory buffer can be filled. In order to wait for a cuda. barrier a producer thread must first arrive on that ready[i%2]. arrive[] to get a token and then ready[i%2]. wait[token] with that token. For simplicity ready[i%2]. arrive_and_wait[] combines these operations

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

Producer threads compute and fill the ready buffer, they then signal that the buffer is filled by arriving on the filled barrier, filled[i%2]. arrive[]. A producer thread does not wait at this point, instead it waits until the next iteration's buffer [double buffering] is ready to be filled

A consumer thread begins by signaling that both buffers are ready to be filled. A consumer thread does not wait at this point, instead it waits for this iteration's buffer to be filled, filled[i%2]. arrive_and_wait[]. After the consumer threads consume the buffer they signal that the buffer is ready to be filled again, ready[i%2]. arrive[], and then wait for the next iteration's buffer to be filled

When a thread that is participating in a sequence of synchronizations must exit early from that sequence, that thread must explicitly drop out of participation before exiting. The remaining participating threads can proceed normally with subsequent cuda. barrier arrive and wait operations

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

This operation arrives on the cuda. barrier to fulfill the participating thread's obligation to arrive in the current phase, and then decrements the expected arrival count for the next phase so that this thread is no longer expected to arrive on the barrier

The CompletionFunction of cuda::barrier is executed once per phase, after the last thread arrives and before any thread is unblocked from the wait. Memory operations performed by the threads that arrived at the barrier during the phase are visible to the thread executing the CompletionFunction, and all memory operations performed within the CompletionFunction are visible to all threads waiting at the barrier once they are unblocked from the wait.

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
                       float C[N][N]]
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock[N, N];
    MatAdd[A, B, C];
    ...
}

bar must be a pointer to __shared__ memory
expected_count 1. 5n, sai số tuyệt đối tối đa là 5 x 10-12
cyl_bessel_i0[x]6 [đầy đủ]cyl_bessel_i1[x]6 [đầy đủ]fmod[x,y]0 [đầy đủ]remainder[x,y]0 [đầy đủ]remquo[x,y,iptr]0 [
Các chức năng từ phần này chỉ có thể được sử dụng trong mã thiết bị
Trong số các chức năng này có các phiên bản kém chính xác hơn nhưng nhanh hơn của một số chức năng của Chức năng tiêu chuẩn. Chúng có cùng tên bắt đầu bằng __ [chẳng hạn như __sinf[x]]. Chúng nhanh hơn khi chúng ánh xạ tới ít hướng dẫn gốc hơn. Trình biên dịch có một tùy chọn [-use_fast_math] buộc mỗi hàm trong Bảng 9 biên dịch thành đối tác bên trong của nó. Ngoài việc giảm độ chính xác của các chức năng bị ảnh hưởng, nó cũng có thể gây ra một số khác biệt trong xử lý trường hợp đặc biệt. Một cách tiếp cận mạnh mẽ hơn là thay thế một cách có chọn lọc các lệnh gọi hàm toán học bằng các lệnh gọi đến các hàm nội tại chỉ khi nó xứng đáng với hiệu suất đạt được và khi các thuộc tính thay đổi như giảm độ chính xác và xử lý trường hợp đặc biệt khác nhau có thể được chấp nhận
Bảng 9. Các chức năng bị ảnh hưởng bởi -use_fast_math Người vận hành/Thiết bị chức năng Chức năng X/y
__fdividef[x,y]
sinf[x]
__sinf[x]
cosf[x]
__cosf[x]
tanf[x]__tanf[x]sincosf[x,sptr,cptr]__sincosf[x,sptr,cptr]logf[x]
__logf[x]
log2f[x]__log2f[x]log10f[x]__log10f[x]expf[x]__expf[x]exp10f[x]__exp10f[x]powf[x,y]__powf[x,y]
Các hàm dấu phẩy động có độ chính xác đơn
__fadd_[rn,rz,ru,rd][] và __fmul_[rn,rz,ru,rd][] ánh xạ tới các phép toán cộng và nhân mà trình biên dịch không bao giờ hợp nhất vào FMAD. Ngược lại, phép cộng và phép nhân được tạo từ các toán tử '*' và '+' sẽ thường được kết hợp thành FMAD
Các hàm có hậu tố _rn hoạt động bằng cách sử dụng chế độ làm tròn chẵn đến gần nhất
Các hàm có hậu tố _rz hoạt động bằng cách sử dụng chế độ làm tròn về 0
Các hàm có hậu tố _ru hoạt động bằng cách sử dụng chế độ làm tròn [đến vô cùng dương]
Các hàm có hậu tố _rd hoạt động bằng cách sử dụng chế độ làm tròn xuống [đến vô cực âm]
Độ chính xác của phép chia dấu phẩy động khác nhau tùy thuộc vào việc mã được biên dịch với -prec-div=false hay -prec-div=true. Khi mã được biên dịch với -prec-div=false, cả toán tử / phép chia thông thường và __fdividef[x,y] đều có cùng độ chính xác, nhưng đối với 2126 2
__dsqrt_[rn,rz,ru,rd][x]
Tuân theo chuẩn IEEE
Yêu cầu khả năng tính toán > 2
Bảng sau liệt kê các tính năng ngôn ngữ mới đã được chấp nhận trong tiêu chuẩn C++11. Cột "Đề xuất" cung cấp liên kết đến đề xuất của ủy ban ISO C++ mô tả tính năng này, trong khi cột "Có sẵn trong nvcc [mã thiết bị]" cho biết phiên bản đầu tiên của nvcc có triển khai tính năng này [nếu nó đã được triển khai
Bảng 12. Tính năng ngôn ngữ C++11 Tính năng ngôn ngữ Đề xuất C++11 Có sẵn trong nvcc [mã thiết bị]Tham chiếu giá trịN21187. 0 Tham chiếu giá trị cho *thisN24397. 0Khởi tạo các đối tượng lớp bằng rvaluesN16107. 0Trình khởi tạo thành viên dữ liệu không tĩnhN27567. 0 mẫu đa dạng 22427. 0 Mở rộng các tham số mẫu biến thể của mẫuN25557. 0 Danh sách trình khởi tạo 26727. 0Xác nhận tĩnhN17207. 0biến tự động gõ 19847. 0 Tự động khai báo nhiều người khai báoN17377. 0 Xóa auto dưới dạng bộ chỉ định lớp lưu trữN25467. 0 Cú pháp khai báo hàm mớiN25417. 0 biểu thức Lambda 29277. 0Kiểu khai báo của biểu thứcN23437. 0 Loại trả lại không đầy đủN32767. 0Nẹp góc vuôngN17577. 0Đối số mẫu mặc định cho mẫu chức năngDR2267. 0Giải quyết vấn đề SFINAE cho các biểu thứcDR3397. 0Mẫu bí danhN22587. 0Mẫu bên ngoàiN19877. 0 Hằng con trỏ null 24317. 0 enums được gõ mạnhN23477. 0Khai báo chuyển tiếp cho enumsN2764

DR12067. 0Cú pháp thuộc tính được chuẩn hóaN27617. 0Biểu thức hằng tổng quát hóaN22357. 0Hỗ trợ căn chỉnhN23417. 0Hành vi hỗ trợ có điều kiệnN16277. 0Thay đổi hành vi không xác định thành lỗi có thể chẩn đoánN17277. 0 ủy nhiệm các nhà xây dựng 19867. 0 Kế thừa hàm tạo 25407. 0Toán tử chuyển đổi rõ ràngN24377. 0Loại ký tự mớiN22497. 0 chuỗi ký tự Unicode 24427. 0Chuỗi ký tự thôN24427. 0Tên ký tự chung bằng chữ N21707. 0Chữ do người dùng định nghĩaN27657. 0Các loại bố cục tiêu chuẩnN23427. 0Chức năng mặc địnhN23467. 0Chức năng đã xóaN23467. 0Tuyên bố kết bạn mở rộngN17917. 0Mở rộng sizeofN2253

DR8507. 0Không gian tên nội dòngN25357. 0Công đoàn không hạn chếN25447. 0Các loại cục bộ và chưa được đặt tên làm đối số mẫuN26577. 0Dựa trên phạm vi choN29307. 0Ghi đè ảo rõ ràngN2928

N3206

N32727. 0Hỗ trợ tối thiểu cho việc thu gom rác và phát hiện rò rỉ dựa trên khả năng tiếp cậnN2670N/A [xem Các hạn chế]Cho phép các nhà xây dựng di chuyển ném [không ngoại lệ]N30507. 0Xác định di chuyển các hàm thành viên đặc biệtN30537. 0Đồng thờiĐiểm trình tựN2239 Hoạt động nguyên tửN2427 So sánh và trao đổi mạnh mẽN2748 Hàng rào hai chiềuN2752 Mô hình bộ nhớN2429 Thứ tự phụ thuộc dữ liệu. nguyên tử và mô hình bộ nhớN2664 Tuyên truyền ngoại lệN2179 Cho phép sử dụng nguyên tử trong bộ xử lý tín hiệuN2547 Bộ lưu trữ cục bộ theo luồngN2659 Khởi tạo động và hủy đồng thờiN2660 C99 Các tính năng trong C++11__func__ mã định danh được xác định trướcN23407. bộ tiền xử lý 0C99N16537. 0dài dàiN18117. 0Các loại tích phân mở rộngN1988
Bảng sau liệt kê các tính năng ngôn ngữ mới đã được chấp nhận trong tiêu chuẩn C++14
Bảng 13. Tính năng ngôn ngữ C++14 Tính năng ngôn ngữ Đề xuất C++14 Có sẵn trong nvcc [mã thiết bị]Tinh chỉnh một số chuyển đổi ngữ cảnh C++ nhất địnhN33239. 0 chữ nhị phân 34729. 0Các hàm có kiểu trả về được suy raN36389. 0Chụp lambda tổng quát [chụp bắt đầu]N36489. 0Biểu thức lambda chung [đa hình]N36499. 0Mẫu biếnN36519. 0Các yêu cầu nới lỏng đối với các hàm constexprN36529. 0Công cụ khởi tạo và tổng hợp thành viênN36539. 0Làm rõ phân bổ bộ nhớN3664 Thuộc tính thỏa thuận theo kích thướcN3778 [[không dùng nữa]]N37609. 0Dấu nháy đơn làm dấu tách chữ sốN37819. 0
1. Chữ ký loại của các thực thể sau sẽ không phụ thuộc vào việc __CUDA_ARCH__ có được xác định hay không hoặc vào một giá trị cụ thể của __CUDA_ARCH__
  - __global__ hàm và mẫu hàm
  - biến __device__ và __constant__
  - kết cấu và bề mặt
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  81
2. Nếu một mẫu hàm __global__ được khởi tạo và khởi chạy từ máy chủ, thì mẫu hàm đó phải được khởi tạo với cùng các đối số mẫu bất kể __CUDA_ARCH__ có được xác định hay không và bất kể giá trị của __CUDA_ARCH__
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  82
3. Trong chế độ biên dịch riêng biệt, việc có hay không định nghĩa hàm hoặc biến có liên kết bên ngoài sẽ không phụ thuộc vào việc __CUDA_ARCH__ được xác định hay vào một giá trị cụ thể của __CUDA_ARCH__17
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  83
4. Trong quá trình biên dịch riêng biệt, không được sử dụng __CUDA_ARCH__ trong các tiêu đề sao cho các đối tượng khác nhau có thể chứa các hành vi khác nhau. Hoặc, phải đảm bảo rằng tất cả các đối tượng sẽ biên dịch cho cùng một máy tính. Nếu một hàm yếu hoặc hàm mẫu được xác định trong tiêu đề và hành vi của nó phụ thuộc vào __CUDA_ARCH__, thì các phiên bản của hàm đó trong các đối tượng có thể xung đột nếu các đối tượng được biên dịch cho các vòm tính toán khác nhau
  Ví dụ, nếu một. h chứa
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  84
  Sau đó, nếu một. cu và b. cu cả hai bao gồm một. h và khởi tạo getptr cho cùng loại và b. cu mong đợi một địa chỉ không NULL và biên dịch với
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  85
  Tại thời điểm liên kết, chỉ có một phiên bản getptr được sử dụng, vì vậy hành vi sẽ phụ thuộc vào phiên bản được chọn. Để tránh điều này, hoặc là một. cu và b. cu phải được biên dịch cho cùng một vòm tính toán hoặc không nên sử dụng __CUDA_ARCH__ trong chức năng tiêu đề được chia sẻ
Trình biên dịch không đảm bảo rằng chẩn đoán sẽ được tạo cho việc sử dụng không được hỗ trợ của __CUDA_ARCH__ được mô tả ở trên
Không được phép sử dụng các chỉ định không gian bộ nhớ __device__, __shared__, __managed__ và __constant__
- các thành viên dữ liệu lớp, cấu trúc và liên minh,
- thông số chính thức,
- khai báo biến không phải bên ngoài trong một hàm thực thi trên máy chủ
Các chỉ định không gian bộ nhớ __device__, __constant__ và __managed__ không được phép trên các khai báo biến không phải là bên ngoài cũng không phải là tĩnh trong một hàm thực thi trên thiết bị
Định nghĩa biến __device__, __constant__, __managed__ hoặc __shared__ không thể có loại lớp với hàm tạo không trống hoặc hàm hủy không trống. Một hàm tạo cho một loại lớp được coi là trống tại một điểm trong đơn vị dịch thuật, nếu nó là một hàm tạo tầm thường hoặc nó thỏa mãn tất cả các điều kiện sau
- Hàm tạo đã được xác định
- Hàm khởi tạo không có tham số, danh sách trình khởi tạo trống và thân hàm là một câu lệnh ghép rỗng
- Lớp của nó không có hàm ảo, không có lớp cơ sở ảo và không có bộ khởi tạo thành viên dữ liệu không tĩnh
- Các hàm tạo mặc định của tất cả các lớp cơ sở của lớp nó có thể được coi là rỗng
- Đối với tất cả các thành viên dữ liệu không tĩnh của lớp thuộc loại lớp [hoặc mảng của chúng], các hàm tạo mặc định có thể được coi là trống
Một hàm hủy cho một lớp được coi là trống tại một điểm trong đơn vị dịch, nếu nó là một hàm hủy tầm thường hoặc nó thỏa mãn tất cả các điều kiện sau
- Hàm hủy đã được xác định
- Thân hàm hủy là một câu lệnh ghép rỗng
- Lớp của nó không có chức năng ảo và không có lớp cơ sở ảo
- Các hàm hủy của tất cả các lớp cơ sở của lớp có thể được coi là rỗng
- Đối với tất cả các thành viên dữ liệu không tĩnh của lớp thuộc loại lớp [hoặc mảng của chúng], hàm hủy có thể được coi là trống
Khi biên dịch ở chế độ biên dịch toàn bộ chương trình [xem hướng dẫn sử dụng nvcc để biết mô tả về chế độ này], các biến __device__, __shared__, __managed__ và __constant__ không thể được định nghĩa là bên ngoài bằng cách sử dụng từ khóa bên ngoài. Ngoại lệ duy nhất dành cho các biến __shared__ được phân bổ động như được mô tả trong __shared__
Khi biên dịch ở chế độ biên dịch riêng biệt [xem hướng dẫn sử dụng nvcc để biết mô tả về chế độ này], các biến __device__, __shared__, __managed__ và __constant__ có thể được định nghĩa là bên ngoài bằng cách sử dụng từ khóa bên ngoài. nvlink sẽ tạo ra lỗi khi không thể tìm thấy định nghĩa cho một biến bên ngoài [trừ khi đó là biến __shared__ được phân bổ động]
Các biến được đánh dấu bằng bộ xác định không gian bộ nhớ __managed__ [biến "được quản lý"] có các hạn chế sau
- Địa chỉ của biến được quản lý không phải là biểu thức hằng
- Một biến được quản lý sẽ không có loại đủ điều kiện const
- Một biến được quản lý sẽ không có loại tham chiếu
- Địa chỉ hoặc giá trị của biến được quản lý sẽ không được sử dụng khi thời gian chạy CUDA có thể không ở trạng thái hợp lệ, bao gồm các trường hợp sau
  - Trong quá trình khởi tạo tĩnh/động hoặc hủy đối tượng với thời lượng lưu trữ cục bộ tĩnh hoặc luồng
  - Trong mã thực thi sau khi lệnh exit[] được gọi [ví dụ: một hàm được đánh dấu bằng "__attribute__[[destroyor]]"] của gcc
  - Trong mã thực thi khi thời gian chạy CUDA có thể không được khởi tạo [ví dụ: một hàm được đánh dấu bằng "__attribute__[[constructor]]"] của gcc
- Không thể sử dụng biến được quản lý làm đối số biểu thức id không được mở ngoặc đơn cho biểu thức decltype[]
- Các biến được quản lý có cùng hành vi nhất quán và nhất quán như được chỉ định cho bộ nhớ được quản lý được cấp phát động
- Khi một chương trình CUDA chứa các biến được quản lý chạy trên một nền tảng thực thi có nhiều GPU, các biến chỉ được phân bổ một lần chứ không phải trên mỗi GPU
- Không được phép khai báo biến được quản lý mà không có liên kết bên ngoài trong một hàm thực thi trên máy chủ
- Không được phép khai báo biến được quản lý mà không có liên kết tĩnh hoặc liên kết bên ngoài trong một hàm thực thi trên thiết bị
Dưới đây là ví dụ về việc sử dụng hợp pháp và bất hợp pháp các biến được quản lý
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
86
Đặt F biểu thị một hàm hoặc được khai báo ngầm hoặc được mặc định rõ ràng trong lần khai báo đầu tiên của nó. Các chỉ định không gian thực thi [__host__, __device__] cho F là sự kết hợp của các chỉ định không gian thực thi của tất cả các hàm gọi nó [lưu ý rằng a . Ví dụ.
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
87Ở đây, hàm xây dựng được khai báo ngầm định "Có nguồn gốc. Đã tạo" sẽ được coi là hàm __device__, vì nó chỉ được gọi từ hàm __device__ "foo". Hàm khởi tạo được khai báo ngầm định "Khác. Other" sẽ được coi là hàm __host__ __device__, vì nó được gọi từ cả hàm __device__ "foo" và hàm __host__ "bar"
Ngoài ra, nếu F là một hàm hủy ảo, thì các không gian thực thi của mỗi hàm hủy ảo D bị F ghi đè sẽ được thêm vào tập hợp các không gian thực thi cho F, nếu D không được xác định hoàn toàn hoặc được mặc định rõ ràng trên một khai báo khác với khai báo của nó.
Ví dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
88
Khi chức năng __global__ được khởi chạy từ mã thiết bị, mỗi đối số phải có thể sao chép được và có thể phá hủy được
Khi một hàm __global__ được khởi chạy từ mã máy chủ, mỗi loại đối số được phép sao chép không cần thiết hoặc không thể phá hủy một cách tầm thường, nhưng quá trình xử lý cho các loại đó không tuân theo mô hình C++ tiêu chuẩn, như được mô tả bên dưới. Mã người dùng phải đảm bảo rằng quy trình công việc này không ảnh hưởng đến tính chính xác của chương trình. Luồng công việc khác với C++ tiêu chuẩn trong hai lĩnh vực
1. Memcpy thay vì lệnh gọi hàm tạo bản sao
  Khi giảm khởi chạy hàm __global__ từ mã máy chủ, trình biên dịch sẽ tạo các hàm sơ khai sao chép các tham số một hoặc nhiều lần theo giá trị, trước khi sử dụng memcpy để sao chép các đối số vào bộ nhớ tham số của hàm __global__ trên thiết bị. Điều này xảy ra ngay cả khi một đối số không thể sao chép được và do đó có thể làm hỏng các chương trình mà hàm tạo sao chép có tác dụng phụ
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  89
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  90
2. Hàm hủy có thể được gọi trước khi hàm __global__ kết thúc
  Khởi chạy kernel không đồng bộ với thực thi máy chủ. Kết quả là, nếu một đối số của hàm __global__ có một hàm hủy không tầm thường, thì hàm hủy có thể thực thi trong mã máy chủ ngay cả trước khi hàm __global__ thực thi xong. Điều này có thể phá vỡ các chương trình mà hàm hủy có tác dụng phụ
  Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
  91
Bộ xác định không gian bộ nhớ biến được cho phép trong khai báo biến tĩnh V trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm F trong đó
- F là chức năng chỉ dành cho __global__ hoặc __device__
- F là một hàm __host__ __device__ và __CUDA_ARCH__ được xác định 18
Nếu không có trình xác định không gian bộ nhớ rõ ràng nào xuất hiện trong khai báo của V, thì trình xác định __device__ ẩn được giả định trong quá trình biên dịch thiết bị
V có các hạn chế khởi tạo giống như một biến có cùng bộ xác định không gian bộ nhớ được khai báo trong phạm vi không gian tên, ví dụ: biến __device__ không thể có hàm tạo 'không trống' [xem Bộ chỉ định không gian bộ nhớ thiết bị]
Ví dụ về việc sử dụng hợp pháp và bất hợp pháp các biến tĩnh phạm vi chức năng được hiển thị bên dưới
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
92
Khi một hàm trong lớp dẫn xuất ghi đè lên một hàm ảo trong lớp cơ sở, bộ xác định không gian thực thi [i. e. , __host__, __device__] trên các chức năng được ghi đè và chức năng ghi đè phải khớp
Không được phép chuyển làm đối số cho hàm __global__ một đối tượng của lớp có chức năng ảo
Nếu một đối tượng được tạo trong mã máy chủ, việc gọi một hàm ảo cho đối tượng đó trong mã thiết bị có hành vi không xác định
Nếu một đối tượng được tạo trong mã thiết bị, việc gọi một hàm ảo cho đối tượng đó trong mã máy chủ có hành vi không xác định
Xem Windows-Specific để biết các ràng buộc bổ sung khi sử dụng trình biên dịch máy chủ của Microsoft
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
93
Trình biên dịch CUDA tuân theo IA64 ABI để bố trí lớp, trong khi trình biên dịch máy chủ của Microsoft thì không. Đặt T biểu thị một con trỏ tới loại thành viên hoặc loại lớp thỏa mãn bất kỳ điều kiện nào sau đây
- T có chức năng ảo
- T có một lớp cơ sở ảo
- T có nhiều kế thừa với nhiều hơn một lớp cơ sở trống trực tiếp hoặc gián tiếp
- Tất cả các lớp cơ sở trực tiếp và gián tiếp B của T đều trống và kiểu của trường đầu tiên F của T sử dụng B trong định nghĩa của nó, sao cho B được đặt ở độ lệch 0 trong định nghĩa của F
Đặt C biểu thị T hoặc một loại lớp có T là loại trường hoặc là loại lớp cơ sở. Trình biên dịch CUDA có thể tính toán bố cục và kích thước lớp khác với trình biên dịch máy chủ của Microsoft cho loại C
Miễn là loại C được sử dụng riêng trong mã máy chủ hoặc thiết bị, chương trình sẽ hoạt động chính xác
Chuyển một đối tượng loại C giữa mã máy chủ và mã thiết bị có hành vi không xác định, ví dụ: làm đối số cho hàm __global__ hoặc thông qua lệnh gọi cudaMemcpy*[]
Truy cập một đối tượng loại C hoặc bất kỳ đối tượng con nào trong mã thiết bị hoặc gọi một hàm thành viên trong mã thiết bị, có hành vi không xác định nếu đối tượng được tạo trong mã máy chủ
Truy cập một đối tượng loại C hoặc bất kỳ đối tượng con nào trong mã máy chủ hoặc gọi một hàm thành viên trong mã máy chủ, có hành vi không xác định nếu đối tượng được tạo trong mã thiết bị 20
Một loại hoặc mẫu không thể được sử dụng trong đối số mẫu, không phải loại hoặc mẫu của một khởi tạo mẫu hàm __global__ hoặc một khởi tạo biến __device__/__constant__ nếu một trong hai
- Loại hoặc mẫu được xác định trong __host__ hoặc __host__ __device__
- Loại hoặc mẫu là thành viên lớp có quyền truy cập riêng tư hoặc được bảo vệ và lớp cha của nó không được xác định trong hàm __device__ hoặc __global__
- Loại không có tên
- Loại được ghép từ bất kỳ loại nào ở trên
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
94
Đặt 'V' biểu thị biến phạm vi không gian tên hoặc biến thành viên tĩnh của lớp có loại đủ điều kiện const và không có chú thích không gian thực thi [ví dụ: __device__, __constant__, __shared__]. V được coi là biến mã chủ
Giá trị của V có thể được sử dụng trực tiếp trong mã thiết bị, nếu
- V đã được khởi tạo với một biểu thức không đổi trước thời điểm sử dụng,
- loại V không đủ điều kiện dễ bay hơi và
- nó có một trong các loại sau
  - loại dấu phẩy động tích hợp trừ khi trình biên dịch Microsoft được sử dụng làm trình biên dịch máy chủ,
  - tích hợp loại tích phân
Mã nguồn thiết bị không thể chứa tham chiếu đến V hoặc lấy địa chỉ của V
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
95
nvcc hỗ trợ việc sử dụng thuộc tính không dùng nữa khi sử dụng trình biên dịch máy chủ gcc, clang, xlC, icc hoặc pgcc và việc sử dụng declspec không dùng nữa khi sử dụng cl. trình biên dịch máy chủ exe. Nó cũng hỗ trợ thuộc tính tiêu chuẩn [[không dùng nữa]] khi phương ngữ C++14 đã được bật. Trình biên dịch lối vào CUDA sẽ tạo chẩn đoán không dùng nữa cho tham chiếu đến một thực thể không dùng nữa từ bên trong phần thân của hàm __device__, __global__ hoặc __host__ __device__ khi __CUDA_ARCH__ được xác định [i. e. , trong giai đoạn biên dịch thiết bị]. Các tham chiếu khác đến các thực thể không dùng nữa sẽ được xử lý bởi trình biên dịch máy chủ, e. g. , một tham chiếu từ bên trong hàm __host__
Trình biên dịch lối vào CUDA không hỗ trợ các cơ chế cảnh báo #pragma gcc hoặc #pragma được hỗ trợ bởi các trình biên dịch máy chủ khác nhau. Do đó, các chẩn đoán không dùng nữa do trình biên dịch giao diện người dùng CUDA tạo ra không bị ảnh hưởng bởi các pragma này, nhưng các chẩn đoán do trình biên dịch máy chủ tạo ra sẽ bị ảnh hưởng. Để chặn cảnh báo về mã thiết bị, người dùng có thể sử dụng pragma cụ thể của NVIDIA #pragma nv_diag_suppress. Có thể sử dụng cờ nvcc -Wno-deprecated-declarations để chặn tất cả các cảnh báo không dùng nữa và cờ -Werror=deprecated-declarations có thể được sử dụng để biến các cảnh báo không dùng nữa thành lỗi
Trình xác định không gian thực thi cho tất cả các hàm thành viên23 của lớp đóng được liên kết với biểu thức lambda được trình biên dịch dẫn xuất như sau. Như được mô tả trong tiêu chuẩn C++11, trình biên dịch tạo một kiểu đóng trong phạm vi khối nhỏ nhất, phạm vi lớp hoặc phạm vi không gian tên có chứa biểu thức lambda. Phạm vi hàm trong cùng bao quanh kiểu đóng được tính toán và các bộ xác định không gian thực thi của hàm tương ứng được gán cho các hàm thành viên của lớp đóng. Nếu không có phạm vi chức năng kèm theo, bộ xác định không gian thực thi là __host__
Ví dụ về biểu thức lambda và bộ xác định không gian thực thi được tính toán được hiển thị bên dưới [trong nhận xét]
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
96
Không thể sử dụng kiểu đóng của biểu thức lambda trong đối số kiểu hoặc không kiểu của khởi tạo mẫu hàm __global__, trừ khi lambda được xác định trong hàm __device__ hoặc __global__
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
97
Đặt 'V' biểu thị biến phạm vi không gian tên hoặc biến thành viên tĩnh của lớp đã được đánh dấu constexpr và không có chú thích không gian thực thi [e. g. , __thiết bị__, __hằng số__, __được chia sẻ__]. V được coi là biến mã chủ
Nếu V thuộc loại vô hướng 26 không phải là dài gấp đôi và loại không đủ điều kiện dễ bay hơi, giá trị của V có thể được sử dụng trực tiếp trong mã thiết bị. Ngoài ra, nếu V thuộc loại không vô hướng thì các phần tử vô hướng của V có thể được sử dụng bên trong hàm constexpr __device__ hoặc __host__ __device__, nếu lời gọi hàm là một biểu thức hằng 27. Mã nguồn thiết bị không thể chứa tham chiếu đến V hoặc lấy địa chỉ của V
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
98
Đối với đơn vị dịch CUDA đầu vào, trình biên dịch CUDA có thể gọi trình biên dịch máy chủ để biên dịch mã máy chủ trong đơn vị dịch. Trong mã được chuyển đến trình biên dịch máy chủ, trình biên dịch CUDA sẽ thêm mã do trình biên dịch tạo ra, nếu đơn vị dịch CUDA đầu vào chứa định nghĩa của bất kỳ thực thể nào sau đây
- __global__ chức năng hoặc khởi tạo mẫu chức năng
- __thiết bị__, __hằng số__
- biến với bề mặt hoặc loại kết cấu
Mã do trình biên dịch tạo chứa tham chiếu đến thực thể đã xác định. Nếu thực thể được xác định trong một không gian tên nội tuyến và một thực thể khác có cùng tên và chữ ký loại được xác định trong một không gian tên kèm theo, tham chiếu này có thể bị trình biên dịch máy chủ coi là mơ hồ và quá trình biên dịch máy chủ sẽ không thành công
Hạn chế này có thể tránh được bằng cách sử dụng các tên duy nhất cho các thực thể như vậy được xác định trong một không gian tên nội tuyến
Thí dụ
```
// Kernel definition
__global__ void MatAdd[float A[N][N], float B[N][N],
float C[N][N]]
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if [i < N && j < N]
        C[i][j] = A[i][j] + B[i][j];
}

int main[]
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];
    MatAdd[A, B, C];
    ...
}
```
99
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
00
Nếu loại đóng được liên kết với biểu thức lambda được sử dụng trong đối số mẫu của khởi tạo mẫu hàm __global__, thì biểu thức lambda phải được xác định trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm __device__ hoặc __global__ hoặc phải là lambda mở rộng
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
01
Hàm __global__ hoặc mẫu hàm không thể được khai báo là constexpr
Hàm __global__ hoặc mẫu hàm không được có tham số kiểu std. initializer_list hoặc va_list
Hàm __global__ không thể có tham số thuộc loại tham chiếu giá trị
Mẫu chức năng __global__ biến đổi có các hạn chế sau
- Chỉ cho phép một tham số gói duy nhất
- Tham số gói phải được liệt kê cuối cùng trong danh sách tham số mẫu
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
02
Trình xác định không gian thực thi trên một hàm được mặc định rõ ràng trong lần khai báo đầu tiên của nó sẽ bị trình biên dịch CUDA bỏ qua. Thay vào đó, trình biên dịch CUDA sẽ suy ra các chỉ định không gian thực thi như được mô tả trong Hàm được khai báo ngầm và mặc định rõ ràng
Các chỉ định không gian thực thi không bị bỏ qua nếu hàm được mặc định rõ ràng, nhưng không phải trong lần khai báo đầu tiên của nó
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
03
Hàm __global__ không thể có kiểu trả về suy diễn
Nếu hàm __device__ đã suy ra kiểu trả về, trình biên dịch giao diện người dùng CUDA sẽ thay đổi khai báo hàm thành kiểu trả về void, trước khi gọi trình biên dịch máy chủ. Điều này có thể gây ra sự cố khi xem xét kỹ loại trả về được suy ra của hàm __device__ trong mã máy chủ. Do đó, trình biên dịch CUDA sẽ đưa ra các lỗi thời gian biên dịch để tham chiếu kiểu trả về được suy ra như vậy bên ngoài các thân chức năng của thiết bị, ngoại trừ nếu tham chiếu không có khi __CUDA_ARCH__ không được xác định
ví dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
04
Mẫu lớp trình bao bọc chức năng đa hình nvstd. chức năng được cung cấp trong tiêu đề nvfunction. Các thể hiện của mẫu lớp này có thể được sử dụng để lưu trữ, sao chép và gọi bất kỳ mục tiêu có thể gọi nào, chẳng hạn như. g. , biểu thức lambda. nvstd. chức năng có thể được sử dụng trong cả mã máy chủ và mã thiết bị
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
05
Các trường hợp của nvstd. không thể khởi tạo hàm trong mã máy chủ bằng địa chỉ của hàm __device__ hoặc với hàm functor có toán tử[] là hàm __device__. Các trường hợp của nvstd. không thể khởi tạo hàm trong mã thiết bị bằng địa chỉ của hàm __host__ hoặc với hàm functor có toán tử[] là hàm __host__
nvstd. các trường hợp chức năng không thể được chuyển từ mã máy chủ sang mã thiết bị [và ngược lại] trong thời gian chạy. nvstd. không thể sử dụng hàm trong loại tham số của hàm __global__, nếu hàm __global__ được khởi chạy từ mã máy chủ
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
06
nvstd. chức năng được định nghĩa trong tiêu đề nvfunction như sau
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
07
Cờ nvcc '--extended-lambda' cho phép chú thích không gian thực thi rõ ràng trong biểu thức lambda 30. Các chú thích không gian thực thi phải xuất hiện sau 'trình giới thiệu lambda' và trước 'trình khai báo lambda' tùy chọn. nvcc sẽ xác định macro __CUDACC_EXTENDED_LAMBDA__ khi cờ '--extended-lambda' đã được chỉ định
'__device__ lambda mở rộng' là một biểu thức lambda được chú thích rõ ràng bằng '__device__' và được xác định trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm __host__ hoặc __host__ __device__
'__host__ __device__ lambda' mở rộng là một biểu thức lambda được chú thích rõ ràng với cả '__host__' và '__device__' và được xác định trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm __host__ hoặc __host__ __device__
Một 'lambda mở rộng' biểu thị một __device__ lambda mở rộng hoặc một __host__ __device__ lambda mở rộng. Có thể sử dụng lambdas mở rộng trong các đối số kiểu của khởi tạo mẫu hàm __global__
Nếu chú thích không gian thực thi không được chỉ định rõ ràng, thì chúng được tính toán dựa trên phạm vi bao quanh lớp đóng được liên kết với lambda, như được mô tả trong phần về hỗ trợ C++11. Chú thích không gian thực thi được áp dụng cho tất cả các phương thức của lớp đóng được liên kết với lambda
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
08
Trình biên dịch cung cấp các đặc điểm loại để phát hiện các loại đóng cho lambdas mở rộng tại thời điểm biên dịch
__nv_is_extends_device_lambda_closure_type[loại]. Nếu 'loại' là lớp đóng được tạo cho __device__ lambda mở rộng, thì đặc điểm đó là đúng, nếu không thì đặc điểm đó là sai
__nv_is_extends_host_device_lambda_closure_type[loại]. Nếu 'loại' là lớp đóng được tạo cho một __host__ __device__ lambda mở rộng, thì đặc điểm đó là đúng, nếu không thì đặc điểm đó là sai
Những đặc điểm này có thể được sử dụng trong tất cả các chế độ biên dịch, bất kể lambdas hay lambdas mở rộng được kích hoạt31
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
09
Trình biên dịch CUDA sẽ thay thế biểu thức lambda mở rộng bằng một thể hiện của loại trình giữ chỗ được xác định trong phạm vi không gian tên, trước khi gọi trình biên dịch máy chủ. Đối số mẫu của loại trình giữ chỗ yêu cầu lấy địa chỉ của hàm bao quanh biểu thức lambda mở rộng ban đầu. Điều này là bắt buộc để thực thi chính xác bất kỳ mẫu hàm __global__ nào có đối số mẫu liên quan đến loại đóng của lambda mở rộng. Hàm kèm theo được tính như sau
Theo định nghĩa, lambda mở rộng hiện diện trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm __host__ hoặc __host__ __device__. Nếu hàm này không phải là toán tử [] của biểu thức lambda, thì nó được coi là hàm kèm theo cho lambda mở rộng. Mặt khác, lambda mở rộng được xác định trong phạm vi khối ngay lập tức hoặc lồng nhau của toán tử[] của một hoặc nhiều biểu thức lambda kèm theo. Nếu biểu thức lambda ngoài cùng như vậy được xác định trong phạm vi khối ngay lập tức hoặc lồng nhau của hàm F, thì F là hàm bao quanh được tính toán, nếu không thì hàm bao quanh không tồn tại
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
10
Dưới đây là những hạn chế đối với lambdas mở rộng
1. Không thể xác định lambda mở rộng bên trong một biểu thức lambda mở rộng khác
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  11
2. Không thể xác định lambda mở rộng bên trong biểu thức lambda chung
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  12
3. Nếu một lambda mở rộng được xác định trong phạm vi khối tức thời hoặc khối lồng nhau của một hoặc nhiều biểu thức lambda lồng nhau, thì biểu thức lambda ngoài cùng đó phải được xác định bên trong phạm vi khối tức thời hoặc khối lồng nhau của một hàm
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  13
4. Chức năng kèm theo cho lambda mở rộng phải được đặt tên và địa chỉ của nó có thể được lấy. Nếu chức năng kèm theo là một thành viên của lớp, thì các điều kiện sau phải được thỏa mãn
  - Tất cả các lớp bao quanh hàm thành viên phải có tên
  - Hàm thành viên không được có quyền truy cập riêng tư hoặc được bảo vệ trong lớp cha của nó
  - Tất cả các lớp kèm theo không được có quyền truy cập riêng tư hoặc được bảo vệ trong các lớp cha tương ứng của chúng
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  14
5. Phải có khả năng lấy địa chỉ của thường trình kèm theo một cách rõ ràng, tại điểm mà lambda mở rộng đã được xác định. Điều này có thể không khả thi trong một số trường hợp e. g. khi một lớp typedef phủ bóng một đối số kiểu mẫu cùng tên
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  15
6. Không thể định nghĩa lambda mở rộng trong lớp cục bộ của hàm
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  16
7. Hàm kèm theo cho một lambda mở rộng không thể suy ra kiểu trả về
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  17
8. __host__ __device__ lambdas mở rộng không thể là lambdas chung
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  18
9. Nếu hàm kèm theo là một phần khởi tạo của mẫu hàm hoặc mẫu hàm thành viên và/hoặc hàm là thành viên của mẫu lớp, [các] mẫu phải đáp ứng các ràng buộc sau
  - Mẫu phải có nhiều nhất một tham số biến đổi và nó phải được liệt kê cuối cùng trong danh sách tham số mẫu
  - Các tham số mẫu phải được đặt tên
  - Các loại đối số khởi tạo mẫu không thể bao gồm các loại cục bộ của hàm [ngoại trừ các loại đóng cho lambdas mở rộng] hoặc là thành viên của lớp riêng tư hoặc được bảo vệ
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  19
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  20
10. Với trình biên dịch máy chủ Visual Studio, chức năng kèm theo phải có liên kết bên ngoài. Có hạn chế do trình biên dịch máy chủ này không hỗ trợ sử dụng địa chỉ của các hàm không liên kết bên ngoài làm đối số mẫu, cần thiết cho các phép biến đổi trình biên dịch CUDA để hỗ trợ lambdas mở rộng
11. Với trình biên dịch máy chủ Visual Studio, lambda mở rộng sẽ không được xác định trong phần thân của khối 'if-constexpr'
12. Lambda mở rộng có các hạn chế sau đối với các biến đã chụp
  - Trong mã được gửi tới trình biên dịch máy chủ, biến có thể được chuyển theo giá trị cho một chuỗi hàm trợ giúp trước khi được sử dụng để khởi tạo trực tiếp trường của loại lớp được sử dụng để biểu thị loại đóng cho lambda32 mở rộng
  - Một biến chỉ có thể được nắm bắt bởi giá trị
  - Không thể bắt được một biến kiểu mảng nếu số chiều mảng lớn hơn 7
  - Đối với một biến kiểu mảng, trong mã được gửi đến trình biên dịch máy chủ, trường mảng của kiểu đóng trước tiên được khởi tạo mặc định, sau đó mỗi phần tử của trường mảng được gán sao chép từ phần tử tương ứng của biến mảng đã chụp. Do đó, kiểu phần tử mảng phải được tạo mặc định và có thể gán sao chép trong mã máy chủ
  - Không thể nắm bắt được tham số chức năng là thành phần của gói đối số biến đổi
  - Loại của biến đã chụp không thể bao gồm các loại cục bộ của một hàm [ngoại trừ các loại đóng của lambdas mở rộng] hoặc là thành viên của lớp riêng tư hoặc được bảo vệ
  - Đối với lambda mở rộng __host__ __device__, các loại được sử dụng trong kiểu trả về hoặc tham số của biểu thức lambda operator[] không thể liên quan đến các loại cục bộ của hàm [ngoại trừ kiểu đóng của lambda mở rộng] hoặc là thành viên lớp riêng tư hoặc được bảo vệ
  - Init-capture không được hỗ trợ cho __host__ __device__ lambdas mở rộng. Init-capture được hỗ trợ cho __device__ lambdas mở rộng, ngoại trừ khi init-capture thuộc loại mảng hoặc loại std. khởi tạo_list
  - Toán tử gọi hàm cho lambda mở rộng không phải là constexpr. Kiểu đóng cho lambda mở rộng không phải là kiểu chữ. Không thể sử dụng trình xác định constexpr trong khai báo lambda mở rộng
  - Một biến không thể được chụp ngầm bên trong khối if-constexpr được lồng theo từ vựng bên trong lambda mở rộng, trừ khi nó đã được chụp ngầm trước đó bên ngoài khối if-constexpr hoặc xuất hiện trong danh sách chụp rõ ràng cho lambda mở rộng [xem ví dụ bên dưới]
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  21
13. Khi phân tích cú pháp một hàm, trình biên dịch CUDA sẽ gán một giá trị bộ đếm cho mỗi lambda mở rộng trong hàm đó. Giá trị bộ đếm này được sử dụng trong loại được đặt tên thay thế được chuyển đến trình biên dịch máy chủ. Do đó, việc một lambda mở rộng có được xác định trong một hàm hay không sẽ không phụ thuộc vào một giá trị cụ thể của __CUDA_ARCH__ hoặc vào việc __CUDA_ARCH__ không được xác định
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  22
14. Như đã mô tả ở trên, trình biên dịch CUDA thay thế lambda mở rộng __device__ được xác định trong hàm máy chủ bằng loại trình giữ chỗ được xác định trong phạm vi không gian tên. Loại trình giữ chỗ này không xác định hàm operator[] tương đương với khai báo lambda ban đầu. Do đó, nỗ lực xác định kiểu trả về hoặc kiểu tham số của hàm operator[] có thể hoạt động không chính xác trong mã máy chủ, vì mã được trình biên dịch máy chủ xử lý sẽ khác về mặt ngữ nghĩa so với mã đầu vào được trình biên dịch CUDA xử lý. Tuy nhiên, bạn có thể xem xét kỹ kiểu trả về hoặc kiểu tham số của hàm operator[] trong mã thiết bị. Lưu ý rằng hạn chế này không áp dụng cho __host__ __device__ lambdas mở rộng
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  23
15. Nếu đối tượng functor được đại diện bởi một lambda mở rộng được chuyển từ máy chủ sang mã thiết bị [e. g. , làm đối số của hàm __global__], thì bất kỳ biểu thức nào trong phần thân của biểu thức lambda nắm bắt các biến phải được giữ nguyên bất kể macro __CUDA_ARCH__ có được xác định hay không và macro có giá trị cụ thể hay không. Hạn chế này phát sinh do bố cục lớp đóng của lambda phụ thuộc vào thứ tự mà các biến được bắt gặp khi trình biên dịch xử lý biểu thức lambda;
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  24
16. Như đã mô tả trước đây, trình biên dịch CUDA thay thế biểu thức lambda __device__ mở rộng bằng một thể hiện của loại trình giữ chỗ trong mã được gửi tới trình biên dịch máy chủ. Loại trình giữ chỗ này không xác định toán tử chuyển đổi con trỏ thành hàm trong mã máy chủ, tuy nhiên, toán tử chuyển đổi được cung cấp trong mã thiết bị. Lưu ý rằng hạn chế này không áp dụng cho __host__ __device__ lambdas mở rộng
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  25
17. Như đã mô tả trước đây, trình biên dịch CUDA thay thế biểu thức lambda __device__ hoặc __host__ __device__ lambda mở rộng bằng một phiên bản của loại trình giữ chỗ trong mã được gửi tới trình biên dịch máy chủ. Loại trình giữ chỗ này có thể định nghĩa các hàm thành viên đặc biệt của C++ [e. g. hàm tạo, hàm hủy]. Do đó, một số đặc điểm loại C++ tiêu chuẩn có thể trả về các kết quả khác nhau cho loại đóng của lambda mở rộng, trong trình biên dịch lối vào CUDA so với trình biên dịch máy chủ. Các đặc điểm loại sau đây bị ảnh hưởng. tiêu chuẩn. is_trivially_copyable, tiêu chuẩn. is_trivially_constructible, tiêu chuẩn. is_trivially_copy_constructible, tiêu chuẩn. is_trivially_move_constructible, tiêu chuẩn. is_trivially_destroyible
  Phải cẩn thận rằng kết quả của các đặc điểm loại này không được sử dụng trong khởi tạo mẫu hàm __global__ hoặc trong khởi tạo mẫu biến __device__ / __constant__ / __managed__
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  26
Trình biên dịch CUDA sẽ tạo chẩn đoán trình biên dịch cho một tập hợp con các trường hợp được mô tả trong 1-12;
Khi một lambda được định nghĩa trong một hàm thành viên lớp không tĩnh và phần thân của lambda đề cập đến một biến thành viên lớp, các quy tắc C++11/C++14 yêu cầu con trỏ this của lớp được ghi lại theo giá trị, . Nếu lambda là một lambda __device__ hoặc __host____device__ mở rộng được xác định trong hàm máy chủ và lambda được thực thi trên GPU, thì việc truy cập biến thành viên được tham chiếu trên GPU sẽ gây ra lỗi thời gian chạy nếu con trỏ này trỏ đến bộ nhớ máy chủ
Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
27
C++17 giải quyết vấn đề này bằng cách thêm chế độ chụp "*this" mới. Trong chế độ này, trình biên dịch tạo một bản sao của đối tượng được biểu thị bằng "*this" thay vì bắt con trỏ this theo giá trị. Chế độ chụp "*this" được mô tả chi tiết hơn tại đây. http. //www. mở tiêu chuẩn. org/jtc1/sc22/wg21/docs/papers/2016/p0018r3. html
Trình biên dịch CUDA hỗ trợ chế độ chụp "*this" cho lambdas được xác định trong các hàm __device__ và __global__ và cho __device__ lambdas mở rộng được xác định trong mã máy chủ, khi cờ --extended-lambda nvcc được sử dụng
Đây là ví dụ trên được sửa đổi để sử dụng chế độ chụp "*this"
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
28
Chế độ chụp "*this" không được phép đối với lambdas không được chú thích được xác định trong mã máy chủ hoặc đối với __host____device__ lambdas mở rộng. Ví dụ về cách sử dụng được hỗ trợ và không được hỗ trợ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
29
1. Tra cứu ADL. Như đã mô tả trước đó, trình biên dịch CUDA sẽ thay thế biểu thức lambda mở rộng bằng một thể hiện của loại trình giữ chỗ, trước khi gọi trình biên dịch máy chủ. Một đối số mẫu của loại trình giữ chỗ sử dụng địa chỉ của hàm kèm theo biểu thức lambda ban đầu. Điều này có thể khiến các không gian tên bổ sung tham gia vào tra cứu phụ thuộc đối số [ADL], đối với bất kỳ lệnh gọi hàm máy chủ nào có loại đối số liên quan đến loại đóng của biểu thức lambda mở rộng. Điều này có thể khiến trình biên dịch máy chủ chọn sai chức năng
  Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  30
  Trong ví dụ trên, trình biên dịch CUDA đã thay thế lambda mở rộng bằng một loại trình giữ chỗ liên quan đến không gian tên N1. Kết quả là không gian tên N1 tham gia vào tra cứu ADL cho foo[in] trong phần thân của N2. doit và quá trình biên dịch máy chủ không thành công do nhiều ứng viên quá tải N1. foo và N2. foo được tìm thấy
Trong chế độ lọc này, chế độ chỉ khả dụng cho kết cấu dấu chấm động, giá trị được trả về bởi quá trình tìm nạp kết cấu là
- tex[x]=[1−α]T[i]+αT[i+1] cho kết cấu một chiều,
- tex[x,y]=[1−α][1−β]T[i,j]+α[1−β]T[i+1,j]+[1−α]βT[i,j+
- tex[x,y,z] =
  [1−α][1−β][1−γ]T[i,j,k]+α[1−β][1−γ]T[i+1,j,k]+
  [1−α]β[1−γ]T[i,j+1,k]+αβ[1−γ]T[i+1,j+1,k]+
  [1−α][1−β]γT[i,j,k+1]+α[1−β]γT[i+1,j,k+1]+
  [1−α]βγT[i,j+1,k+1]+αβγT[i+1,j+1,k+1]
  cho kết cấu ba chiều,
where
- i=sàn[xB], α=frac[xB], xB=x-0. 5,
- j=sàn[yB], β=frac[yB], yB=y-0. 5,
- k=sàn[zB], γ=frac[zB], zB= z-0. 5,
α, β và γ được lưu trữ ở định dạng điểm cố định 9 bit với 8 bit giá trị phân số [vì vậy 1. 0 được biểu diễn chính xác]
Hình 19 minh họa lọc tuyến tính của kết cấu một chiều với N=4
Hình 19. Chế độ lọc tuyến tính . Lọc tuyến tính của kết cấu một chiều gồm bốn texel ở chế độ định địa chỉ kẹp.

Bảng 14. Hỗ trợ tính năng trên mỗi khả năng tính toán Hỗ trợ tính năng Khả năng tính toán [Các tính năng không công khai được hỗ trợ cho tất cả các khả năng tính toán]3. 5, 3. 7, 5. 0, 5. 25. 36. x7. x8. x9. 0Các hàm nguyên tử hoạt động trên các giá trị số nguyên 32-bit trong bộ nhớ chung [Hàm nguyên tử]CóCác hàm nguyên tử hoạt động trên các giá trị nguyên 32-bit trong bộ nhớ dùng chung [Các hàm nguyên tử]CóCác hàm nguyên tử hoạt động trên các giá trị nguyên 64-bit trong bộ nhớ chung [Các hàm nguyên tử]CóCác hàm nguyên tử . các hàm cộng, trừ, nhân, so sánh, xáo trộn dọc, chuyển đổiNoYesBfloat16 phép toán dấu phẩy động chính xác. cộng, trừ, nhân, so sánh, hàm xáo trộn dọc, chuyển đổi Không Có Lõi Tenor Không Có Hàm ma trận dọc chính xác được trộn lẫn [Hàm ma trận dọc] Không Có memcpy_async được tăng tốc phần cứng [Sao chép dữ liệu không đồng bộ bằng cuda. đường ống]Không CóRào cản đến/chờ phân tách được tăng tốc phần cứng [Rào cản không đồng bộ]KhôngCóQuản lý nơi lưu trữ bộ đệm L2 [Quản lý truy cập L2 bộ nhớ thiết bị]KhôngCóHướng dẫn DPX cho lập trình động tăng tốcKhôngCóBộ nhớ dùng chung được phân phốiKhôngCóCụm khối luồngKhôngCóĐơn vị Trình tăng tốc bộ nhớ Tensor [TMA]KhôngCó
Lưu ý rằng các đơn vị KB và K được sử dụng trong bảng sau tương ứng với 1024 byte [i. e. , một KiB] và 1024 tương ứng
Bảng 15. Thông số kỹ thuật trên mỗi khả năng tính toán Khả năng tính toánThông số kỹ thuật3. 53. 75. 05. 25. 36. 06. 16. 27. 07. 27. 58. 08. 68. 78. 99. 0Số lượng lưới thường trú tối đa trên mỗi thiết bị [Thực thi hạt nhân đồng thời]3216128321612816128Kích thước tối đa của lưới khối luồng3Kích thước x tối đa của lưới khối luồng231-1Kích thước y- hoặc z tối đa của lưới khối luồng65535Kích thước tối đa của khối luồng3X- hoặc tối đa
Tất cả các thiết bị điện toán đều tuân theo tiêu chuẩn IEEE 754-2008 cho số học dấu phẩy động nhị phân với các độ lệch sau
- Không có chế độ làm tròn cấu hình động;
- Không có cơ chế nào để phát hiện đã xảy ra ngoại lệ dấu phẩy động và tất cả các hoạt động hoạt động như thể các ngoại lệ IEEE-754 luôn được che dấu và cung cấp phản hồi được che dấu như được định nghĩa bởi IEEE-754 nếu có sự kiện ngoại lệ. Vì lý do tương tự, trong khi mã hóa SNaN được hỗ trợ, chúng không báo hiệu và được xử lý yên tĩnh
- Kết quả của phép toán dấu phẩy động có độ chính xác đơn liên quan đến một hoặc nhiều NaN đầu vào là NaN yên tĩnh của mẫu bit 0x7fffffff
- Giá trị tuyệt đối của dấu phẩy động độ chính xác kép và phủ định không tuân thủ IEEE-754 đối với NaN;
Mã phải được biên dịch bằng -ftz=false, -prec-div=true và -prec-sqrt=true để đảm bảo tuân thủ IEEE [đây là cài đặt mặc định; xem hướng dẫn sử dụng nvcc để biết mô tả về các cờ biên dịch này]
Bất kể cài đặt của cờ trình biên dịch -ftz,
- dấu phẩy động có độ chính xác đơn nguyên tử được thêm vào bộ nhớ chung luôn hoạt động ở chế độ tuôn ra về không, tôi. e. , hoạt động tương đương với FADD. F32. FTZ. RN,
- dấu phẩy động có độ chính xác đơn nguyên tử được thêm vào bộ nhớ dùng chung luôn hoạt động với sự hỗ trợ không bình thường, tôi. e. , hoạt động tương đương với FADD. F32. RN
Theo tiêu chuẩn IEEE-754R, nếu một trong các tham số đầu vào của fminf[], fmin[], fmaxf[] hoặc fmax[] là NaN, nhưng không phải là tham số khác, thì kết quả là tham số không phải NaN

Việc chuyển đổi giá trị dấu phẩy động thành giá trị số nguyên trong trường hợp giá trị dấu phẩy động nằm ngoài phạm vi của định dạng số nguyên không được xác định bởi IEEE-754. Đối với các thiết bị điện toán, hành vi là kẹp vào cuối phạm vi được hỗ trợ. Điều này không giống như hành vi kiến trúc x86
Hành vi chia số nguyên cho 0 và tràn số nguyên không được xác định bởi IEEE-754. Đối với các thiết bị điện toán, không có cơ chế nào để phát hiện các trường hợp ngoại lệ về thao tác số nguyên như vậy đã xảy ra. Phép chia số nguyên cho số 0 mang lại một giá trị cụ thể cho máy không xác định
https. // nhà phát triển. nvidia. com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus bao gồm nhiều thông tin hơn về độ chính xác và tuân thủ dấu phẩy động của GPU NVIDIA
Một SM bao gồm
- 192 lõi CUDA cho các phép toán số học [xem Hướng dẫn số học để biết thông lượng của các phép toán số học],
- 32 đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động độ chính xác đơn,
- 4 bộ lập lịch dọc
Khi một SM được cung cấp các sợi dọc để thực thi, trước tiên nó sẽ phân phối chúng cho bốn bộ lập lịch. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra hai lệnh độc lập cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có một bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ cố định, nằm trong bộ nhớ thiết bị
Có bộ đệm L1 cho mỗi SM và bộ đệm L2 được chia sẻ bởi tất cả các SM. Bộ đệm L1 được sử dụng để lưu trữ các truy cập vào bộ nhớ cục bộ, bao gồm cả việc tràn đăng ký tạm thời. Bộ đệm L2 được sử dụng để truy cập bộ đệm vào bộ nhớ cục bộ và toàn cầu. Hành vi bộ đệm [e. g. , cho dù các lần đọc được lưu trong cả L1 và L2 hay chỉ trong L2] có thể được định cấu hình một phần trên cơ sở mỗi lần truy cập bằng cách sử dụng các công cụ sửa đổi cho lệnh tải hoặc lệnh lưu trữ. Một số thiết bị có khả năng tính toán 3. 5 và các thiết bị có khả năng tính toán 3. 7 cho phép chọn tham gia bộ nhớ đệm của bộ nhớ chung trong cả L1 và L2 thông qua các tùy chọn trình biên dịch
Bộ nhớ trên chip giống nhau được sử dụng cho cả L1 và bộ nhớ dùng chung. Nó có thể được cấu hình là 48 KB bộ nhớ dùng chung và 16 KB bộ nhớ đệm L1 hoặc 16 KB bộ nhớ dùng chung và 48 KB bộ đệm L1 hoặc 32 KB bộ nhớ dùng chung và 32 KB bộ đệm L1, sử dụng cudaFuncSetCacheConfig[]/cuFuncSetCacheConfig[
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
31
Cấu hình bộ đệm mặc định là "không ưu tiên", nghĩa là "không ưu tiên". Nếu một hạt nhân được cấu hình không có tùy chọn nào, thì nó sẽ mặc định theo tùy chọn của luồng/ngữ cảnh hiện tại, được đặt bằng cách sử dụng cudaDeviceSetCacheConfig[]/cuCtxSetCacheConfig[] [xem hướng dẫn tham khảo để biết chi tiết]. Nếu luồng/bối cảnh hiện tại cũng không có tùy chọn [lại là cài đặt mặc định], thì bất kỳ cấu hình bộ đệm nào được sử dụng gần đây nhất cho bất kỳ hạt nhân nào sẽ là cấu hình được sử dụng, trừ khi cần có cấu hình bộ đệm khác để khởi chạy hạt nhân [ . g. , do yêu cầu bộ nhớ dùng chung]. Cấu hình ban đầu là 48 KB bộ nhớ dùng chung và 16 KB bộ đệm L1
Ghi chú. Thiết bị có khả năng tính toán 3. 7 bổ sung thêm 64 KB bộ nhớ dùng chung cho mỗi cấu hình ở trên, mang lại bộ nhớ dùng chung lần lượt là 112 KB, 96 KB và 80 KB cho mỗi SM. Tuy nhiên, bộ nhớ dùng chung tối đa cho mỗi khối luồng vẫn là 48 KB.
Các ứng dụng có thể truy vấn kích thước bộ đệm L2 bằng cách kiểm tra thuộc tính thiết bị l2CacheSize [xem Bảng liệt kê thiết bị]. Kích thước bộ đệm L2 tối đa là 1. 5 MB
Mỗi SM có bộ đệm dữ liệu chỉ đọc 48 KB để tăng tốc độ đọc từ bộ nhớ thiết bị. Nó truy cập trực tiếp vào bộ đệm này [đối với các thiết bị có khả năng tính toán 3. 5 hoặc 3. 7] hoặc thông qua một đơn vị kết cấu thực hiện các chế độ địa chỉ khác nhau và lọc dữ liệu được đề cập trong Bộ nhớ kết cấu và bề mặt. Khi được truy cập thông qua đơn vị kết cấu, bộ đệm dữ liệu chỉ đọc cũng được gọi là bộ đệm kết cấu
Truy cập bộ nhớ chung cho các thiết bị có khả năng tính toán 3. x được lưu trữ trong L2 và cho các thiết bị có khả năng tính toán 3. 5 hoặc 3. 7, cũng có thể được lưu trữ trong bộ đệm dữ liệu chỉ đọc được mô tả trong phần trước; . Một số thiết bị có khả năng tính toán 3. 5 và các thiết bị có khả năng tính toán 3. 7 cho phép chọn tham gia vào bộ nhớ đệm truy cập bộ nhớ chung trong L1 thông qua tùy chọn -Xptxas -dlcm=ca cho nvcc
Dòng bộ đệm là 128 byte và ánh xạ tới phân đoạn được căn chỉnh 128 byte trong bộ nhớ thiết bị. Các truy cập bộ nhớ được lưu vào bộ đệm trong cả L1 và L2 được phục vụ với các giao dịch bộ nhớ 128 byte, trong khi các truy cập bộ nhớ được lưu trong bộ nhớ đệm chỉ trong L2 được phục vụ với các giao dịch bộ nhớ 32 byte. Do đó, bộ nhớ đệm chỉ trong L2 có thể giảm tải quá mức, ví dụ, trong trường hợp truy cập bộ nhớ phân tán
Nếu kích thước của các từ được truy cập bởi mỗi luồng lớn hơn 4 byte, thì một yêu cầu bộ nhớ của một sợi dọc trước tiên được chia thành các yêu cầu bộ nhớ 128 byte riêng biệt được phát hành độc lập
- Hai yêu cầu bộ nhớ, một cho mỗi nửa dọc, nếu kích thước là 8 byte,
- Bốn yêu cầu bộ nhớ, một cho mỗi phần tư dọc, nếu kích thước là 16 byte
Mỗi yêu cầu bộ nhớ sau đó được chia thành các yêu cầu dòng bộ đệm được phát hành độc lập. Yêu cầu dòng bộ đệm được phục vụ ở thông lượng của bộ đệm L1 hoặc L2 trong trường hợp có lần truy cập bộ đệm hoặc ở thông lượng của bộ nhớ thiết bị, nếu không
Lưu ý rằng các chủ đề có thể truy cập bất kỳ từ nào theo bất kỳ thứ tự nào, kể cả các từ giống nhau
Nếu một lệnh phi nguyên tử được thực thi bởi một sợi dọc ghi vào cùng một vị trí trong bộ nhớ chung cho nhiều luồng của sợi dọc, thì chỉ một luồng thực hiện ghi và luồng nào thực hiện nó không được xác định
Dữ liệu ở chế độ chỉ đọc trong toàn bộ thời gian tồn tại của hạt nhân cũng có thể được lưu vào bộ đệm ẩn trong bộ đệm dữ liệu chỉ đọc được mô tả trong phần trước bằng cách đọc nó bằng hàm __ldg[] [xem Chức năng tải bộ đệm ẩn dữ liệu chỉ đọc]. Khi trình biên dịch phát hiện rằng điều kiện chỉ đọc được thỏa mãn đối với một số dữ liệu, nó sẽ sử dụng __ldg[] để đọc nó. Trình biên dịch có thể không phải lúc nào cũng phát hiện ra rằng điều kiện chỉ đọc được thỏa mãn đối với một số dữ liệu. Đánh dấu các con trỏ được sử dụng để tải dữ liệu đó bằng cả hai vòng loại const và __restrict__ làm tăng khả năng trình biên dịch sẽ phát hiện điều kiện chỉ đọc
Hình 21 cho thấy một số ví dụ về truy cập bộ nhớ chung và các giao dịch bộ nhớ tương ứng
Hình 21. Ví dụ về truy cập bộ nhớ chung . Ví dụ về truy cập bộ nhớ chung theo đường dọc, từ 4 byte trên mỗi luồng và giao dịch bộ nhớ liên kết cho khả năng tính toán 3. x và hơn thế nữa

Bộ nhớ dùng chung có 32 ngân hàng với hai chế độ địa chỉ được mô tả bên dưới
Có thể truy vấn chế độ địa chỉ bằng cudaDeviceGetSharedMemConfig[] và thiết lập bằng cudaDeviceSetSharedMemConfig[] [xem hướng dẫn tham khảo để biết thêm chi tiết]. Mỗi ngân hàng có băng thông 64 bit trên mỗi chu kỳ đồng hồ
Hình 22 cho thấy một số ví dụ về truy cập theo từng bước
Hình 23 cho thấy một số ví dụ về truy cập đọc bộ nhớ liên quan đến cơ chế quảng bá
Chế độ 64 bit
Bản đồ các từ 64 bit liên tiếp tới các ngân hàng liên tiếp
Yêu cầu bộ nhớ dùng chung cho một sợi dọc không tạo ra xung đột ngân hàng giữa hai chuỗi truy cập bất kỳ từ phụ nào trong cùng một từ 64 bit [ngay cả khi địa chỉ của hai từ phụ nằm trong cùng một ngân hàng]. Trong trường hợp đó, đối với các truy cập đọc, từ 64 bit được truyền tới các luồng yêu cầu và đối với các truy cập ghi, mỗi từ phụ chỉ được viết bởi một trong các luồng [luồng nào thực hiện ghi không được xác định]
Chế độ 32 bit
Ánh xạ các từ 32 bit liên tiếp tới các ngân hàng kế tiếp
Yêu cầu bộ nhớ dùng chung cho một sợi dọc không tạo ra xung đột ngân hàng giữa hai luồng truy cập bất kỳ từ phụ nào trong cùng một từ 32 bit hoặc trong hai từ 32 bit có chỉ số i và j nằm trong cùng một phân đoạn được căn chỉnh 64 từ . e. , một phân đoạn có chỉ mục đầu tiên là bội số của 64] và sao cho j=i+32 [mặc dù địa chỉ của hai từ phụ nằm trong cùng một ngân hàng]. Trong trường hợp đó, đối với các truy cập đọc, các từ 32 bit được phát tới các luồng yêu cầu và đối với các truy cập ghi, mỗi từ phụ chỉ được viết bởi một trong các luồng [luồng nào thực hiện ghi không được xác định]
Một SM bao gồm
- 128 lõi CUDA cho các phép toán số học [xem Hướng dẫn số học để biết thông lượng của các phép toán số học],
- 32 đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động độ chính xác đơn,
- 4 bộ lập lịch dọc
Khi một SM được cung cấp các sợi dọc để thực thi, trước tiên nó sẽ phân phối chúng cho bốn bộ lập lịch. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra một lệnh cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có
- bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ không đổi, nằm trong bộ nhớ thiết bị,
- bộ nhớ cache L1/kết cấu hợp nhất 24 KB được sử dụng để lưu bộ nhớ cache đọc từ bộ nhớ chung,
- 64 KB bộ nhớ dùng chung cho các thiết bị có khả năng tính toán 5. 0 hoặc 96 KB bộ nhớ dùng chung cho các thiết bị có khả năng tính toán 5. 2
Bộ đệm kết cấu/L1 hợp nhất cũng được sử dụng bởi đơn vị kết cấu thực hiện các chế độ địa chỉ khác nhau và lọc dữ liệu được đề cập trong Bộ nhớ kết cấu và bề mặt
Ngoài ra còn có một bộ đệm L2 được chia sẻ bởi tất cả các SM được sử dụng để lưu trữ các truy cập bộ đệm vào bộ nhớ cục bộ hoặc bộ nhớ chung, bao gồm cả việc tràn đăng ký tạm thời. Các ứng dụng có thể truy vấn kích thước bộ đệm L2 bằng cách kiểm tra thuộc tính thiết bị l2CacheSize [xem Bảng liệt kê thiết bị]
Hành vi bộ đệm [e. g. , cho dù các lần đọc được lưu vào bộ nhớ cache trong cả bộ đệm kết cấu/L1 hợp nhất và L2 hay chỉ trong L2] có thể được định cấu hình một phần trên cơ sở mỗi lần truy cập bằng cách sử dụng các công cụ sửa đổi cho lệnh tải
Truy cập bộ nhớ chung luôn được lưu vào bộ nhớ cache trong L2 và bộ nhớ đệm trong L2 hoạt động giống như đối với các thiết bị có khả năng tính toán 3. x [xem Bộ nhớ chung]
Dữ liệu ở chế độ chỉ đọc trong toàn bộ thời gian tồn tại của hạt nhân cũng có thể được lưu vào bộ đệm trong bộ đệm kết cấu/L1 hợp nhất được mô tả trong phần trước bằng cách đọc nó bằng hàm __ldg[] [xem Chức năng tải bộ đệm ẩn dữ liệu chỉ đọc]. Khi trình biên dịch phát hiện rằng điều kiện chỉ đọc được thỏa mãn đối với một số dữ liệu, nó sẽ sử dụng __ldg[] để đọc nó. Trình biên dịch có thể không phải lúc nào cũng phát hiện ra rằng điều kiện chỉ đọc được thỏa mãn đối với một số dữ liệu. Đánh dấu các con trỏ được sử dụng để tải dữ liệu đó bằng cả hai vòng loại const và __restrict__ làm tăng khả năng trình biên dịch sẽ phát hiện điều kiện chỉ đọc
Dữ liệu không ở chế độ chỉ đọc trong toàn bộ thời gian tồn tại của nhân không thể được lưu vào bộ nhớ đệm trong bộ đệm kết cấu/L1 hợp nhất cho các thiết bị có khả năng tính toán 5. 0. Đối với các thiết bị có khả năng tính toán 5. 2, theo mặc định, nó không được lưu trong bộ nhớ đệm L1/kết cấu hợp nhất, nhưng bộ nhớ đệm có thể được kích hoạt bằng các cơ chế sau
- Thực hiện đọc bằng lắp ráp nội tuyến với công cụ sửa đổi thích hợp như được mô tả trong hướng dẫn tham khảo PTX;
- Biên dịch với cờ biên dịch -Xptxas -dlcm=ca, trong trường hợp đó, tất cả các lần đọc đều được lưu vào bộ nhớ cache, ngoại trừ các lần đọc được thực hiện bằng cách sử dụng hợp ngữ nội tuyến với một công cụ sửa đổi vô hiệu hóa bộ nhớ đệm;
- Biên dịch với cờ biên dịch -Xptxas -fscm=ca, trong trường hợp đó, tất cả các lần đọc được lưu vào bộ đệm, bao gồm các lần đọc được thực hiện bằng cách sử dụng hợp ngữ nội tuyến bất kể công cụ sửa đổi được sử dụng
Khi bộ nhớ đệm được bật bằng một trong ba cơ chế được liệt kê ở trên, các thiết bị có khả năng tính toán 5. 2 sẽ lưu các lần đọc bộ nhớ chung trong bộ đệm L1/kết cấu hợp nhất cho tất cả các lần khởi chạy kernel ngoại trừ các lần khởi chạy kernel mà các khối luồng tiêu thụ quá nhiều tệp đăng ký của SM. Những ngoại lệ này được báo cáo bởi hồ sơ
Bộ nhớ dùng chung có 32 ngân hàng được tổ chức sao cho các từ 32 bit liên tiếp ánh xạ tới các ngân hàng liên tiếp. Mỗi ngân hàng có băng thông 32 bit trên mỗi chu kỳ đồng hồ
Yêu cầu bộ nhớ dùng chung cho một sợi dọc không tạo ra xung đột ngân hàng giữa hai luồng truy cập bất kỳ địa chỉ nào trong cùng một từ 32 bit [ngay cả khi hai địa chỉ nằm trong cùng một ngân hàng]. Trong trường hợp đó, đối với các truy cập đọc, từ được truyền tới các luồng yêu cầu và đối với các truy cập ghi, mỗi địa chỉ chỉ được ghi bởi một trong các luồng [luồng nào thực hiện ghi không được xác định]
Hình 22 cho thấy một số ví dụ về truy cập theo từng bước
Hình 23 cho thấy một số ví dụ về truy cập đọc bộ nhớ liên quan đến cơ chế quảng bá
Hình 22. Truy cập bộ nhớ chia sẻ theo cấp độ . Ví dụ về thiết bị có khả năng tính toán 3. x [ở chế độ 32 bit] hoặc khả năng tính toán 5. x và 6. x

LeftLinear addressing with a stride of one 32-bit word [no bank conflict].MiddleLinear addressing with a stride of two 32-bit words [two-way bank conflict].RightLinear addressing with a stride of three 32-bit words [no bank conflict].
Hình 23. Truy cập bộ nhớ dùng chung bất thường . Ví dụ về thiết bị có khả năng tính toán 3. x, 5. x hoặc 6. x.

LeftConflict-free access via random permutation.MiddleConflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.RightConflict-free broadcast access [threads access the same word within a bank].
Một SM bao gồm
- 64 [khả năng tính toán 6. 0] hoặc 128 [6. 1 và 6. 2] Lõi CUDA cho các phép tính số học,
- 16 [6. 0] hoặc 32 [6. 1 và 6. 2] các đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động có độ chính xác đơn,
- 2 [6. 0] hoặc 4 [6. 1 và 6. 2] lịch trình dọc
Khi một SM được cung cấp các sợi dọc để thực thi, trước tiên nó sẽ phân phối chúng giữa các bộ lập lịch của nó. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra một lệnh cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có
- bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ không đổi, nằm trong bộ nhớ thiết bị,
- bộ đệm kết cấu/L1 hợp nhất để đọc từ bộ nhớ chung có kích thước 24 KB [6. 0 và 6. 2] hoặc 48 KB [6. 1],
- bộ nhớ dùng chung có kích thước 64 KB [6. 0 và 6. 2] hoặc 96 KB [6. 1]
Bộ đệm kết cấu/L1 hợp nhất cũng được sử dụng bởi đơn vị kết cấu thực hiện các chế độ địa chỉ khác nhau và lọc dữ liệu được đề cập trong Bộ nhớ kết cấu và bề mặt
Ngoài ra còn có một bộ đệm L2 được chia sẻ bởi tất cả các SM được sử dụng để lưu trữ các truy cập bộ đệm vào bộ nhớ cục bộ hoặc bộ nhớ chung, bao gồm cả việc tràn đăng ký tạm thời. Các ứng dụng có thể truy vấn kích thước bộ đệm L2 bằng cách kiểm tra thuộc tính thiết bị l2CacheSize [xem Bảng liệt kê thiết bị]
Hành vi bộ đệm [e. g. , cho dù các lần đọc được lưu vào bộ nhớ cache trong cả bộ đệm kết cấu/L1 hợp nhất và L2 hay chỉ trong L2] có thể được định cấu hình một phần trên cơ sở mỗi lần truy cập bằng cách sử dụng các công cụ sửa đổi cho lệnh tải
Một SM bao gồm
- 64 lõi FP32 cho các phép tính số học có độ chính xác đơn,
- 32 lõi FP64 cho các phép toán số học có độ chính xác kép, 35
- 64 lõi INT32 cho phép toán số nguyên,
- 8 lõi Tensor có độ chính xác hỗn hợp dành cho số học ma trận học sâu
- 16 đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động độ chính xác đơn,
- 4 bộ lập lịch dọc
Một SM phân phối tĩnh các sợi dọc của nó giữa các bộ lập lịch của nó. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra một lệnh cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có
- bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ không đổi, nằm trong bộ nhớ thiết bị,
- bộ đệm dữ liệu thống nhất và bộ nhớ dùng chung với tổng kích thước là 128 KB [Volta] hoặc 96 KB [Turing]
Bộ nhớ dùng chung được phân vùng ra khỏi bộ đệm dữ liệu hợp nhất và có thể được định cấu hình theo nhiều kích cỡ khác nhau [Xem Bộ nhớ dùng chung. ] Bộ đệm dữ liệu còn lại đóng vai trò là bộ đệm L1 và cũng được sử dụng bởi đơn vị kết cấu thực hiện các chế độ lọc dữ liệu và địa chỉ khác nhau được đề cập trong Bộ nhớ kết cấu và bề mặt
Kiến trúc Volta giới thiệu Lập kế hoạch luồng độc lập giữa các luồng trong một sợi dọc, cho phép các mẫu đồng bộ hóa trong sợi dọc trước đây không có sẵn và đơn giản hóa các thay đổi mã khi chuyển mã CPU. Tuy nhiên, điều này có thể dẫn đến một tập hợp các luồng tham gia vào mã được thực thi khá khác so với dự định nếu nhà phát triển đưa ra các giả định về tính đồng bộ dọc của các kiến trúc phần cứng trước đó
Dưới đây là các mẫu mã đáng lo ngại và các hành động khắc phục được đề xuất cho mã an toàn Volta
1. Đối với các ứng dụng sử dụng nội tại dọc [__shfl*, __any, __all, __ballot], điều cần thiết là các nhà phát triển phải chuyển mã của họ sang đối tác mới, an toàn, đồng bộ hóa, với hậu tố *_sync. Nội tại sợi dọc mới có mặt nạ của các luồng xác định rõ ràng các làn nào [các luồng của sợi dọc] phải tham gia vào nội tại dọc. Xem Warp Vote Functions và Warp Shuffle Functions để biết chi tiết
  Vì nội tại có sẵn với CUDA 9. 0+, mã [nếu cần] có thể được thực thi có điều kiện với macro tiền xử lý sau.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  32
  Những nội tại này có sẵn trên tất cả các kiến trúc, không chỉ Volta hay Turing và trong hầu hết các trường hợp, một cơ sở mã duy nhất sẽ đủ cho tất cả các kiến trúc. Tuy nhiên, lưu ý rằng đối với Pascal và các kiến trúc trước đó, tất cả các luồng trong mặt nạ phải thực thi cùng một lệnh nội tại dọc trong sự hội tụ và sự kết hợp của tất cả các giá trị trong mặt nạ phải bằng mặt nạ hoạt động của sợi dọc. Mẫu mã sau hợp lệ trên Volta, nhưng không hợp lệ trên Pascal hoặc các kiến trúc trước đó
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  33
  Sự thay thế cho __ballot[1] là __activemask[]. Lưu ý rằng các luồng trong một sợi dọc có thể phân kỳ ngay cả trong một đường dẫn mã. Do đó, __activemask[] và __ballot[1] chỉ có thể trả về một tập hợp con của chuỗi trên đường dẫn mã hiện tại. Ví dụ về mã không hợp lệ sau đặt bit i của đầu ra thành 1 khi dữ liệu[i] lớn hơn ngưỡng. __activemask[] được sử dụng trong nỗ lực kích hoạt các trường hợp dataLen không phải là bội số của 32
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  34
  Mã này không hợp lệ vì CUDA không đảm bảo rằng sợi dọc sẽ CHỈ phân kỳ ở điều kiện vòng lặp. Khi sự phân kỳ xảy ra vì các lý do khác, các kết quả xung đột sẽ được tính cho cùng một phần tử đầu ra 32 bit bởi các tập con luồng khác nhau trong sợi dọc. Mã chính xác có thể sử dụng điều kiện vòng lặp không phân kỳ cùng với __ballot_sync[] để liệt kê an toàn tập hợp các luồng trong sợi dọc tham gia tính toán ngưỡng như sau
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  35
  Discovery Pattern thể hiện trường hợp sử dụng hợp lệ cho __activemask[]
2. Nếu các ứng dụng có mã đồng bộ dọc, chúng sẽ cần chèn hướng dẫn đồng bộ hóa rào cản toàn cầu __syncwarp[] mới giữa bất kỳ bước nào mà dữ liệu được trao đổi giữa các luồng thông qua bộ nhớ chung hoặc bộ nhớ dùng chung. Giả định rằng mã được thực thi trong bước khóa hoặc đọc/ghi từ các luồng riêng biệt có thể nhìn thấy trên một sợi dọc mà không đồng bộ hóa là không hợp lệ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  36
3. Mặc dù __syncthreads[] đã được ghi lại một cách nhất quán là đồng bộ hóa tất cả các luồng trong khối luồng, Pascal và các kiến trúc trước đó chỉ có thể thực thi đồng bộ hóa ở cấp độ dọc. Trong một số trường hợp nhất định, điều này cho phép một rào cản thành công mà không bị thực thi bởi mọi luồng miễn là ít nhất một số luồng trong mỗi sợi dọc chạm đến rào cản. Bắt đầu với Volta, thanh hướng dẫn __syncthreads[] và PTX tích hợp CUDA. đồng bộ hóa [và các công cụ phái sinh của chúng] được thực thi trên mỗi luồng và do đó sẽ không thành công cho đến khi tất cả các luồng chưa thoát trong khối đạt được. Mã khai thác hành vi trước đó có thể sẽ bế tắc và phải được sửa đổi để đảm bảo rằng tất cả các luồng chưa thoát đều đạt đến rào cản
Các công cụ racecheck và synccheck do cuda-memcheck cung cấp có thể hỗ trợ xác định vi phạm điểm 2 và 3
Để hỗ trợ di chuyển trong khi thực hiện các hành động khắc phục nêu trên, nhà phát triển có thể chọn tham gia mô hình lập lịch trình Pascal không hỗ trợ lập lịch trình luồng độc lập. Xem Khả năng tương thích ứng dụng để biết chi tiết
Tương tự như kiến trúc Kepler, lượng bộ đệm dữ liệu hợp nhất dành riêng cho bộ nhớ dùng chung có thể định cấu hình trên cơ sở mỗi nhân. Đối với kiến trúc Volta [khả năng tính toán 7. 0], bộ đệm dữ liệu hợp nhất có kích thước 128 KB và dung lượng bộ nhớ dùng chung có thể được đặt thành 0, 8, 16, 32, 64 hoặc 96 KB. Đối với kiến trúc Turing [khả năng tính toán 7. 5], bộ đệm dữ liệu hợp nhất có kích thước 96 KB và dung lượng bộ nhớ dùng chung có thể được đặt thành 32 KB hoặc 64 KB. Không giống như Kepler, trình điều khiển tự động định cấu hình dung lượng bộ nhớ dùng chung cho mỗi hạt nhân để tránh tắc nghẽn chiếm dụng bộ nhớ dùng chung đồng thời cho phép thực thi đồng thời với các hạt nhân đã khởi chạy nếu có thể. Trong hầu hết các trường hợp, hành vi mặc định của trình điều khiển sẽ mang lại hiệu suất tối ưu
Vì trình điều khiển không phải lúc nào cũng nhận thức được toàn bộ khối lượng công việc, nên đôi khi các ứng dụng cung cấp các gợi ý bổ sung về cấu hình bộ nhớ dùng chung mong muốn sẽ rất hữu ích. Ví dụ: một nhân ít hoặc không sử dụng bộ nhớ dùng chung có thể yêu cầu một bản khắc lớn hơn để khuyến khích thực thi đồng thời với các nhân sau này yêu cầu nhiều bộ nhớ dùng chung hơn. API cudaFuncSetAttribute[] mới cho phép các ứng dụng đặt dung lượng bộ nhớ dùng chung ưa thích hoặc tính năng cắt bỏ, dưới dạng phần trăm dung lượng bộ nhớ dùng chung được hỗ trợ tối đa [96 KB cho Volta và 64 KB cho Turing]
cudaFuncSetAttribute[] nới lỏng việc thực thi dung lượng chia sẻ ưu tiên so với API cudaFuncSetCacheConfig[] cũ được giới thiệu với Kepler. API cũ coi dung lượng bộ nhớ dùng chung là yêu cầu khó khăn để khởi chạy kernel. Do đó, các nhân xen kẽ với các cấu hình bộ nhớ dùng chung khác nhau sẽ không cần thiết phải tuần tự hóa các lần khởi chạy sau khi cấu hình lại bộ nhớ dùng chung. Với API mới, carveout được coi là một gợi ý. Trình điều khiển có thể chọn một cấu hình khác nếu được yêu cầu để thực thi chức năng hoặc để tránh bị giật
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
37
Ngoài một tỷ lệ phần trăm số nguyên, một số enum tiện lợi được cung cấp như được liệt kê trong các nhận xét mã ở trên. Trường hợp tỷ lệ phần trăm số nguyên đã chọn không ánh xạ chính xác tới dung lượng được hỗ trợ [SM 7. 0 hỗ trợ dung lượng dùng chung là 0, 8, 16, 32, 64 hoặc 96 KB], dung lượng lớn hơn tiếp theo sẽ được sử dụng. Chẳng hạn, trong ví dụ trên, 50% của 96 KB tối đa là 48 KB, đây không phải là dung lượng bộ nhớ dùng chung được hỗ trợ. Do đó, tùy chọn được làm tròn lên tới 64 KB
Khả năng tính toán 7. các thiết bị x cho phép một khối luồng xử lý toàn bộ dung lượng của bộ nhớ dùng chung. 96 KB trên Volta, 64 KB trên Turing. Các hạt nhân dựa trên phân bổ bộ nhớ dùng chung trên 48 KB mỗi khối là dành riêng cho kiến trúc, vì vậy chúng phải sử dụng bộ nhớ dùng chung động [chứ không phải mảng có kích thước tĩnh] và yêu cầu chọn tham gia rõ ràng bằng cách sử dụng cudaFuncSetAttribute[] như sau
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
38
Mặt khác, bộ nhớ dùng chung hoạt động giống như đối với các thiết bị có khả năng tính toán 5. x [Xem Bộ nhớ dùng chung]
Một Streaming Multiprocessor [SM] bao gồm
- 64 lõi FP32 cho các phép toán số học có độ chính xác đơn trong các thiết bị có khả năng tính toán 8. 0 và 128 lõi FP32 trong các thiết bị có khả năng tính toán 8. 6, 8. 7 và 8. 9,
- 32 lõi FP64 cho các phép toán số học có độ chính xác kép trong các thiết bị có khả năng tính toán 8. 0 và 2 lõi FP64 trong các thiết bị có khả năng tính toán 8. 6, 8. 7 và 8. 9
- 64 lõi INT32 cho phép toán số nguyên,
- 4 Lõi tenor thế hệ thứ ba có độ chính xác hỗn hợp hỗ trợ số học ma trận độ chính xác một nửa [fp16], __nv_bfloat16, tf32, byte con và độ chính xác kép [fp64] cho khả năng tính toán 8. 0, 8. 7 và 8. 6 [xem Hàm Warp Matrix để biết chi tiết],
- 4 Lõi Tensor thế hệ Forth có độ chính xác hỗn hợp hỗ trợ fp8, fp16, __nv_bfloat16, tf32, sub-byte và fp64 cho khả năng tính toán 8. 9 [xem các chức năng ma trận Warp để biết chi tiết],
- 16 đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động độ chính xác đơn,
- 4 bộ lập lịch dọc
Một SM phân phối tĩnh các sợi dọc của nó giữa các bộ lập lịch của nó. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra một lệnh cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có
- bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ không đổi, nằm trong bộ nhớ thiết bị,
- bộ đệm dữ liệu thống nhất và bộ nhớ dùng chung với tổng kích thước là 192 KB cho các thiết bị có khả năng tính toán 8. 0 và 8. 7 [1. 5x dung lượng 128 KB của Volta] và 128 KB cho các thiết bị có khả năng tính toán 8. 6 và 8. 9
Bộ nhớ dùng chung được phân vùng ra khỏi bộ đệm dữ liệu hợp nhất và có thể được định cấu hình theo nhiều kích cỡ khác nhau [xem phần Bộ nhớ dùng chung]. Bộ đệm dữ liệu còn lại đóng vai trò là bộ đệm L1 và cũng được sử dụng bởi đơn vị kết cấu thực hiện các chế độ lọc dữ liệu và địa chỉ khác nhau được đề cập trong Bộ nhớ kết cấu và bề mặt
Tương tự như kiến trúc Volta, lượng bộ đệm dữ liệu hợp nhất dành riêng cho bộ nhớ dùng chung có thể định cấu hình trên cơ sở mỗi nhân. Đối với kiến trúc GPU NVIDIA Ampere, bộ đệm dữ liệu hợp nhất có kích thước 192 KB cho các thiết bị có khả năng tính toán 8. 0 và 128 KB cho các thiết bị có khả năng tính toán 8. 6 và 8. 9. Dung lượng bộ nhớ dùng chung có thể được đặt thành 0, 8, 16, 32, 64, 100, 132 hoặc 164 KB cho các thiết bị có khả năng tính toán 8. 0 và 0, 8, 16, 32, 64 hoặc 100 KB đối với các thiết bị có khả năng tính toán 8. 6 và 8. 9
Một ứng dụng có thể thiết lập carveout, tôi. e. , dung lượng bộ nhớ dùng chung ưu tiên, với cudaFuncSetAttribute[]
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
39
API có thể chỉ định carveout dưới dạng phần trăm số nguyên của dung lượng bộ nhớ dùng chung được hỗ trợ tối đa là 164 KB cho các thiết bị có khả năng tính toán 8. 0 và 100 KB cho các thiết bị có khả năng tính toán 8. 6 và 8. 9 tương ứng, hoặc là một trong các giá trị sau. {cudaSharedmemCarveoutDefault, cudaSharedmemCarveoutMaxL1 hoặc cudaSharedmemCarveoutMaxShared. Khi sử dụng tỷ lệ phần trăm, phần gạch bỏ được làm tròn lên đến dung lượng bộ nhớ dùng chung được hỗ trợ gần nhất. Ví dụ: đối với các thiết bị có khả năng tính toán 8. 0, 50% sẽ ánh xạ tới bản khắc 100 KB thay vì bản 82 KB. Việc đặt cudaFuncAttributePreferredSharedMemoryCarveout được trình điều khiển coi là một gợi ý;
Thiết bị có khả năng tính toán 8. 0 cho phép một khối luồng xử lý tới 163 KB bộ nhớ dùng chung, trong khi các thiết bị có khả năng tính toán 8. 6 và 8. 9 cho phép bộ nhớ dùng chung lên tới 99 KB. Các hạt nhân dựa vào phân bổ bộ nhớ dùng chung trên 48 KB mỗi khối là dành riêng cho kiến trúc và phải sử dụng bộ nhớ dùng chung động thay vì các mảng bộ nhớ dùng chung có kích thước tĩnh. Các hạt nhân này yêu cầu chọn tham gia rõ ràng bằng cách sử dụng cudaFuncSetAttribute[] để đặt cudaFuncAttributeMaxDynamicSharedMemorySize;
Lưu ý rằng lượng bộ nhớ dùng chung tối đa cho mỗi khối luồng nhỏ hơn phân vùng bộ nhớ dùng chung tối đa có sẵn cho mỗi SM. 1 KB bộ nhớ dùng chung không khả dụng cho khối luồng được dành riêng cho việc sử dụng hệ thống
Một Streaming Multiprocessor [SM] bao gồm
- 128 lõi FP32 cho các phép tính số học có độ chính xác đơn,
- 64 lõi FP64 cho các phép toán số học có độ chính xác kép,
- 64 lõi INT32 cho phép toán số nguyên,
- 4 Lõi Tensor thế hệ thứ tư có độ chính xác hỗn hợp hỗ trợ loại đầu vào FP8 mới trong E4M3 hoặc E5M2 cho số mũ [E] và phần định trị [M], độ chính xác một nửa [fp16], __nv_bfloat16, tf32, INT8 và ma trận độ chính xác kép [fp64]
- 16 đơn vị chức năng đặc biệt cho các chức năng siêu việt dấu phẩy động độ chính xác đơn,
- 4 bộ lập lịch dọc
Một SM phân phối tĩnh các sợi dọc của nó giữa các bộ lập lịch của nó. Sau đó, tại mỗi thời điểm phát hành lệnh, mỗi bộ lập lịch đưa ra một lệnh cho một trong các lệnh dọc được chỉ định sẵn sàng thực thi, nếu có
Một SM có
- bộ đệm cố định chỉ đọc được chia sẻ bởi tất cả các đơn vị chức năng và tăng tốc độ đọc từ không gian bộ nhớ không đổi, nằm trong bộ nhớ thiết bị,
- bộ đệm dữ liệu thống nhất và bộ nhớ dùng chung với tổng kích thước 256 KB cho các thiết bị có khả năng tính toán 9. 0 [1. 33x Kiến trúc GPU NVIDIA Ampere [dung lượng 192 KB]
Bộ nhớ dùng chung được phân vùng ra khỏi bộ đệm dữ liệu hợp nhất và có thể được định cấu hình theo nhiều kích cỡ khác nhau [xem phần Bộ nhớ dùng chung]. Bộ đệm dữ liệu còn lại đóng vai trò là bộ đệm L1 và cũng được sử dụng bởi đơn vị kết cấu thực hiện các chế độ lọc dữ liệu và địa chỉ khác nhau được đề cập trong Bộ nhớ kết cấu và bề mặt
Tương tự như kiến trúc GPU NVIDIA Ampere, lượng bộ đệm dữ liệu hợp nhất dành riêng cho bộ nhớ dùng chung có thể định cấu hình trên cơ sở mỗi nhân. Đối với kiến trúc GPU NVIDIA H100 Tensor Core, bộ đệm dữ liệu hợp nhất có kích thước 256 KB cho các thiết bị có khả năng tính toán 9. 0. Dung lượng bộ nhớ dùng chung có thể được đặt thành 0, 8, 16, 32, 64, 100, 132, 164, 196 hoặc 228 KB
Như với kiến trúc GPU NVIDIA Ampere, một ứng dụng có thể định cấu hình dung lượng bộ nhớ dùng chung ưa thích của nó, tôi. e. , carveout. Thiết bị có khả năng tính toán 9. 0 cho phép một khối luồng xử lý tối đa 227 KB bộ nhớ dùng chung. Các hạt nhân dựa vào phân bổ bộ nhớ dùng chung trên 48 KB mỗi khối là dành riêng cho kiến trúc và phải sử dụng bộ nhớ dùng chung động thay vì các mảng bộ nhớ dùng chung có kích thước tĩnh. Các hạt nhân này yêu cầu chọn tham gia rõ ràng bằng cách sử dụng cudaFuncSetAttribute[] để đặt cudaFuncAttributeMaxDynamicSharedMemorySize;
Lưu ý rằng lượng bộ nhớ dùng chung tối đa cho mỗi khối luồng nhỏ hơn phân vùng bộ nhớ dùng chung tối đa có sẵn cho mỗi SM. 1 KB bộ nhớ dùng chung không khả dụng cho khối luồng được dành riêng cho việc sử dụng hệ thống
Phần này giả định kiến thức về các khái niệm được mô tả trong CUDA Runtime
API trình điều khiển được triển khai trong thư viện động cuda [cuda. dll hoặc cuda. so] được sao chép trên hệ thống trong quá trình cài đặt trình điều khiển thiết bị. Tất cả các điểm vào của nó đều có tiền tố là cu
Nó là một API bắt buộc, dựa trên xử lý. Hầu hết các đối tượng được tham chiếu bởi các tay cầm mờ có thể được chỉ định cho các hàm để thao tác với các đối tượng
Các đối tượng có sẵn trong driver API được tóm tắt trong Bảng 16
Bảng 16. Các đối tượng có sẵn trong API trình điều khiển CUDA ObjectHandleDescriptionDeviceCUdeviceThiết bị hỗ trợ CUDAContextCUcontext Gần tương đương với một quy trình CPUModuleCUmoduleĐại khái tương đương với một thư viện độngFunctionCUfunctionKernelHeap memoryCUdeviceptrCon trỏ tới bộ nhớ thiết bịMảngCUDACUArrayBộ chứa mờ cho dữ liệu một chiều hoặc hai chiều trên thiết bị, có thể đọc được
API trình điều khiển phải được khởi tạo với cuInit[] trước khi bất kỳ chức năng nào từ API trình điều khiển được gọi. Sau đó, một bối cảnh CUDA phải được tạo được gắn vào một thiết bị cụ thể và được tạo hiện tại cho chuỗi máy chủ đang gọi như được trình bày chi tiết trong Ngữ cảnh
Trong ngữ cảnh CUDA, các hạt nhân được tải rõ ràng dưới dạng PTX hoặc các đối tượng nhị phân bằng mã máy chủ như được mô tả trong Mô-đun. Do đó, các hạt nhân được viết bằng C ++ phải được biên dịch riêng thành các đối tượng PTX hoặc nhị phân. Các hạt nhân được khởi chạy bằng cách sử dụng các điểm nhập API như được mô tả trong Thực thi hạt nhân
Bất kỳ ứng dụng nào muốn chạy trên kiến trúc thiết bị trong tương lai đều phải tải PTX, không phải mã nhị phân. Điều này là do mã nhị phân dành riêng cho kiến trúc và do đó không tương thích với các kiến trúc trong tương lai, trong khi mã PTX được trình điều khiển thiết bị biên dịch thành mã nhị phân tại thời điểm tải
Đây là mã máy chủ của mẫu từ Kernels được viết bằng API trình điều khiển
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
40
Mã đầy đủ có thể được tìm thấy trong mẫu vectorAddDrv CUDA
Bối cảnh CUDA tương tự như quy trình CPU. Tất cả các tài nguyên và hành động được thực hiện trong API trình điều khiển được gói gọn bên trong ngữ cảnh CUDA và hệ thống sẽ tự động dọn sạch các tài nguyên này khi ngữ cảnh bị hủy. Bên cạnh các đối tượng như mô-đun và kết cấu hoặc tham chiếu bề mặt, mỗi bối cảnh có không gian địa chỉ riêng biệt của nó. Do đó, các giá trị CUdevicetr từ các ngữ cảnh khác nhau tham chiếu đến các vị trí bộ nhớ khác nhau
Một chuỗi máy chủ có thể chỉ có một bối cảnh thiết bị hiện tại tại một thời điểm. Khi một bối cảnh được tạo bằng cuCtxCreate[], nó sẽ được tạo hiện tại cho chuỗi máy chủ đang gọi. Các hàm CUDA hoạt động trong ngữ cảnh [hầu hết các hàm không liên quan đến việc liệt kê thiết bị hoặc quản lý ngữ cảnh] sẽ trả về CUDA_ERROR_INVALID_CONTEXT nếu ngữ cảnh hợp lệ không có trong chuỗi
Mỗi luồng máy chủ có một chồng ngữ cảnh hiện tại. cuCtxCreate[] đẩy ngữ cảnh mới lên trên cùng của ngăn xếp. cuCtxPopCurrent[] có thể được gọi để tách ngữ cảnh khỏi chuỗi máy chủ. Bối cảnh sau đó "nổi" và có thể được đẩy làm bối cảnh hiện tại cho bất kỳ chuỗi máy chủ nào. cuCtxPopCurrent[] cũng khôi phục ngữ cảnh hiện tại trước đó, nếu có
Số lượng sử dụng cũng được duy trì cho từng ngữ cảnh. cuCtxCreate[] tạo ngữ cảnh với số lần sử dụng là 1. cuCtxAttach[] tăng số lượng sử dụng và cuCtxDetach[] giảm nó. Bối cảnh bị hủy khi số lượng sử dụng về 0 khi gọi cuCtxDetach[] hoặc cuCtxDestroy[]
API trình điều khiển có thể tương tác với thời gian chạy và có thể truy cập ngữ cảnh chính [xem Khởi tạo] được quản lý bởi thời gian chạy từ API trình điều khiển thông qua cuDevicePrimaryCtxRetain[]
Số lượng sử dụng tạo điều kiện thuận lợi cho khả năng tương tác giữa mã được tác giả của bên thứ ba hoạt động trong cùng một bối cảnh. Ví dụ: nếu ba thư viện được tải để sử dụng cùng ngữ cảnh, thì mỗi thư viện sẽ gọi cuCtxAttach[] để tăng số lượng sử dụng và cuCtxDetach[] để giảm số lượng sử dụng khi thư viện sử dụng xong ngữ cảnh. Đối với hầu hết các thư viện, ứng dụng sẽ tạo ngữ cảnh trước khi tải hoặc khởi tạo thư viện; . Các thư viện muốn tạo ngữ cảnh của riêng họ - không biết đối với các ứng dụng khách API của họ, những người có thể hoặc không thể tạo ngữ cảnh của riêng họ - sẽ sử dụng cuCtxPushCurrent[] và cuCtxPopCurrent[] như được minh họa trong Hình 24
Hình 24. Quản lý bối cảnh thư viện

Các mô-đun là các gói dữ liệu và mã thiết bị có thể tải động, tương tự như DLL trong Windows, được nvcc xuất ra [xem Biên dịch với NVCC]. Tên của tất cả các ký hiệu, bao gồm hàm, biến toàn cục và kết cấu hoặc tham chiếu bề mặt, được duy trì ở phạm vi mô-đun để các mô-đun được viết bởi bên thứ ba độc lập có thể tương thích với nhau trong cùng ngữ cảnh CUDA
Mẫu mã này tải một mô-đun và truy xuất một tay cầm cho một số hạt nhân
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
41
Mẫu mã này biên dịch và tải một mô-đun mới từ mã PTX và phân tích các lỗi biên dịch
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
42
Mẫu mã này biên dịch, liên kết và tải một mô-đun mới từ nhiều mã PTX và phân tích các lỗi liên kết và biên dịch
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
43
Mã đầy đủ có thể được tìm thấy trong mẫu CUDA ptxjit
cuLaunchKernel[] khởi chạy kernel với cấu hình thực thi nhất định
Các tham số được truyền dưới dạng một mảng các con trỏ [bên cạnh tham số cuối cùng của cuLaunchKernel[]] trong đó con trỏ thứ n tương ứng với tham số thứ n và trỏ đến một vùng bộ nhớ mà tham số được sao chép từ đó hoặc dưới dạng một trong các tùy chọn bổ sung [
Khi các tham số được chuyển dưới dạng tùy chọn bổ sung [tùy chọn CU_LAUNCH_PARAM_BUFFER_POINTER], chúng được chuyển dưới dạng con trỏ tới một bộ đệm duy nhất trong đó các tham số được giả định là bù trừ chính xác cho nhau bằng cách khớp yêu cầu căn chỉnh cho từng loại tham số trong mã thiết bị
Yêu cầu căn chỉnh trong mã thiết bị cho các loại véc tơ tích hợp được liệt kê trong Bảng 4. Đối với tất cả các loại cơ bản khác, yêu cầu căn chỉnh trong mã thiết bị khớp với yêu cầu căn chỉnh trong mã máy chủ và do đó có thể nhận được bằng cách sử dụng __alignof[]. Ngoại lệ duy nhất là khi trình biên dịch máy chủ căn chỉnh gấp đôi và dài dài [và dài trên hệ thống 64 bit] trên ranh giới một từ thay vì ranh giới hai từ [ví dụ: sử dụng cờ biên dịch của gcc -mno-align-double
CUdevceptr là một số nguyên, nhưng đại diện cho một con trỏ, vì vậy yêu cầu căn chỉnh của nó là __alignof[void*]
Mẫu mã sau đây sử dụng macro [ALIGN_UP[]] để điều chỉnh độ lệch của từng tham số nhằm đáp ứng yêu cầu căn chỉnh của nó và một macro khác [ADD_TO_PARAM_BUFFER[]] để thêm từng tham số vào bộ đệm tham số được chuyển đến tùy chọn CU_LAUNCH_PARAM_BUFFER_POINTER
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
44
Yêu cầu căn chỉnh của một cấu trúc bằng với mức tối đa của các yêu cầu căn chỉnh của các trường của nó. Do đó, yêu cầu căn chỉnh của cấu trúc chứa các loại vectơ tích hợp, CUdeviceptr hoặc double và long long không được căn chỉnh, có thể khác nhau giữa mã thiết bị và mã máy chủ. Một cấu trúc như vậy cũng có thể được đệm khác nhau. Ví dụ, cấu trúc sau đây hoàn toàn không được đệm trong mã máy chủ, nhưng nó được đệm trong mã thiết bị với 12 byte sau trường f vì yêu cầu căn chỉnh cho trường f4 là 16
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
45
Một ứng dụng có thể trộn mã API thời gian chạy với mã API trình điều khiển
Nếu một ngữ cảnh được tạo và tạo hiện tại thông qua API trình điều khiển, các lệnh gọi thời gian chạy tiếp theo sẽ chọn ngữ cảnh này thay vì tạo một ngữ cảnh mới
Nếu thời gian chạy được khởi tạo [ngầm định như đã đề cập trong Thời gian chạy CUDA], cuCtxGetCurrent[] có thể được sử dụng để truy xuất ngữ cảnh được tạo trong quá trình khởi tạo. Ngữ cảnh này có thể được sử dụng bởi các lệnh gọi API trình điều khiển tiếp theo
Bối cảnh được tạo hoàn toàn từ thời gian chạy được gọi là bối cảnh chính [xem Khởi tạo]. Nó có thể được quản lý từ API trình điều khiển với các chức năng Quản lý bối cảnh chính
Bộ nhớ thiết bị có thể được phân bổ và giải phóng bằng API. CUdeviceptr có thể được chuyển thành con trỏ thông thường và ngược lại
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
46
Cụ thể, điều này có nghĩa là các ứng dụng được viết bằng API trình điều khiển có thể gọi các thư viện được viết bằng API thời gian chạy [chẳng hạn như cuFFT, cuBLAS,. ]
Tất cả các chức năng từ phần quản lý phiên bản và thiết bị của hướng dẫn tham khảo có thể được sử dụng thay thế cho nhau
Để giúp truy xuất các điểm nhập API trình điều khiển CUDA, Bộ công cụ CUDA cung cấp quyền truy cập vào các tiêu đề chứa các định nghĩa con trỏ hàm cho tất cả các API trình điều khiển CUDA. Các tiêu đề này được cài đặt với Bộ công cụ CUDA và có sẵn trong thư mục bao gồm/ của bộ công cụ. Bảng bên dưới tóm tắt các tệp tiêu đề chứa typedefs cho mỗi tệp tiêu đề API CUDA
Bảng 17. Tệp tiêu đề Typedefs cho API trình điều khiển CUDA Tệp tiêu đề APIAPI tiêu đề Typedef tệpcuda. hcudaTypedefs. hcudaGL. cudaGL Typedefs. hcudaProfiler. hcudaProfilerTypedefs. hcudaVDPAU. hcudaVDPAUTypedefs. hcudaEGL. hcudaEGLTypedefs. hcudaD3D9. hcudaD3D9Typedefs. hcudaD3D10. hcudaD3D10Typedefs. hcudaD3D11. hcudaD3D11Typedefs. h
Các tiêu đề trên không tự xác định các con trỏ hàm thực tế; . Ví dụ: cudaTypedefs. h có các typedefs bên dưới cho API trình điều khiển cuMemAlloc
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
47
Các biểu tượng trình điều khiển CUDA có sơ đồ đặt tên dựa trên phiên bản với phần mở rộng _v* trong tên của nó ngoại trừ phiên bản đầu tiên. Khi chữ ký hoặc ngữ nghĩa của API trình điều khiển CUDA cụ thể thay đổi, chúng tôi sẽ tăng số phiên bản của ký hiệu trình điều khiển tương ứng. Trong trường hợp API trình điều khiển cuMemAlloc, tên biểu tượng trình điều khiển đầu tiên là cuMemAlloc và tên biểu tượng tiếp theo là cuMemAlloc_v2. Typedef cho phiên bản đầu tiên được giới thiệu trong CUDA 2. 0 [2000] là PFN_cuMemAlloc_v2000. Typedef cho phiên bản tiếp theo được giới thiệu trong CUDA 3. 2 [3020] là PFN_cuMemAlloc_v3020
Các typedefs có thể được sử dụng để xác định dễ dàng hơn một con trỏ hàm thuộc loại thích hợp trong mã
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
48
Phương pháp trên phù hợp hơn nếu người dùng quan tâm đến một phiên bản API cụ thể. Ngoài ra, các tiêu đề có các macro được xác định trước cho phiên bản mới nhất của tất cả các biểu tượng trình điều khiển có sẵn khi bộ công cụ CUDA đã cài đặt được phát hành; . Đối với CUDA 11. 3, cuMemAlloc_v2 là phiên bản mới nhất nên chúng ta cũng có thể định nghĩa con trỏ hàm của nó như bên dưới
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
49
API trình điều khiển yêu cầu phiên bản CUDA làm đối số để nhận phiên bản tương thích với ABI cho ký hiệu trình điều khiển được yêu cầu. API Trình điều khiển CUDA có ABI theo chức năng được biểu thị bằng phần mở rộng _v*. Ví dụ: hãy xem xét các phiên bản của cuStreamBeginCapture và các typedef tương ứng của chúng từ cudaTypedefs. h
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
50
Từ các typedefs ở trên trong đoạn mã, hậu tố phiên bản _v10000 và _v10010 cho biết rằng các API trên đã được giới thiệu trong CUDA 10. 0 và CUDA 10. 1 tương ứng
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
51
Tham khảo đoạn mã ở trên, để truy xuất địa chỉ tới phiên bản _v1 của API trình điều khiển cuStreamBeginCapture, đối số phiên bản CUDA phải chính xác là 10. 0 [10000]. Tương tự, phiên bản CUDA để truy xuất địa chỉ cho phiên bản _v2 của API phải là 10. 1 [10010]. Chỉ định phiên bản CUDA cao hơn để truy xuất phiên bản cụ thể của API trình điều khiển có thể không phải lúc nào cũng khả dụng. Ví dụ: sử dụng 11030 ở đây sẽ vẫn trả về ký hiệu _v2, nhưng nếu phiên bản _v3 giả định được phát hành trong CUDA 11. 3, API cuGetProcAddress sẽ bắt đầu trả lại biểu tượng _v3 mới hơn khi được ghép nối với CUDA 11. 3 tài xế. Vì ABI và chữ ký hàm của ký hiệu _v2 và _v3 có thể khác nhau, nên việc gọi hàm _v3 bằng cách sử dụng _v10010 typedef dành cho ký hiệu _v2 sẽ thể hiện hành vi không xác định
Để truy xuất phiên bản mới nhất của API trình điều khiển cho Bộ công cụ CUDA nhất định, chúng ta cũng có thể chỉ định CUDA_VERSION làm đối số phiên bản và sử dụng typedef không phiên bản để xác định con trỏ hàm. Vì _v2 là phiên bản mới nhất của trình điều khiển API cuStreamBeginCapture trong CUDA 11. 3, đoạn mã dưới đây hiển thị một phương pháp khác để lấy nó
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
52
Lưu ý rằng việc yêu cầu API trình điều khiển với phiên bản CUDA không hợp lệ sẽ trả về lỗi CUDA_ERROR_NOT_FOUND. Trong các ví dụ mã ở trên, chuyển vào phiên bản nhỏ hơn 10000 [CUDA 10. 0] sẽ không hợp lệ
Bạn luôn nên cài đặt bộ công cụ CUDA mới nhất để truy cập các tính năng trình điều khiển CUDA mới, nhưng nếu vì lý do nào đó, người dùng không muốn cập nhật hoặc không có quyền truy cập vào bộ công cụ mới nhất, API có thể được sử dụng để truy cập các tính năng CUDA mới . Để thảo luận, giả sử người dùng đang sử dụng CUDA 11. 3 và muốn sử dụng cuFoo API trình điều khiển mới có sẵn trong CUDA 12. 0 trình điều khiển. Đoạn mã dưới đây minh họa trường hợp sử dụng này
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
53
cuDeviceGetUuid đã được giới thiệu trong CUDA 9. 2. API này có bản sửa đổi mới hơn [cuDeviceGetUuid_v2] được giới thiệu trong CUDA 11. 4. Để duy trì tính tương thích của phiên bản nhỏ, cuDeviceGetUuid sẽ không được chuyển thành phiên bản cuDeviceGetUuid_v2 trong cuda. h cho đến CUDA 12. 0. Điều này có nghĩa là gọi nó bằng cách lấy một con trỏ hàm tới nó qua cuGetProcAddress có thể có hành vi khác. Ví dụ sử dụng API trực tiếp
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
54
Trong ví dụ này, giả sử người dùng đang biên dịch bằng CUDA 11. 4. Lưu ý rằng điều này sẽ thực hiện hành vi của cuDeviceGetUuid, không phải phiên bản _v2. Bây giờ là một ví dụ về việc sử dụng cuGetProcAddress
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
55
Trong ví dụ này, giả sử người dùng đang biên dịch bằng CUDA 11. 4. Điều này sẽ nhận được con trỏ hàm của cuDeviceGetUuid_v2. Sau đó, gọi con trỏ hàm sẽ gọi hàm _v2 mới, không giống như cuDeviceGetUuid như trong ví dụ trước
Hãy giải quyết vấn đề tương tự và thực hiện một điều chỉnh nhỏ. Ví dụ cuối cùng đã sử dụng hằng số thời gian biên dịch của CUDA_VERSION để xác định con trỏ hàm nào sẽ nhận được. Nhiều phức tạp hơn phát sinh nếu người dùng truy vấn động phiên bản trình điều khiển bằng cách sử dụng cuDriverGetVersion hoặc cudaDriverGetVersion để chuyển đến cuGetProcAddress. Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
56
Trong ví dụ này, giả sử người dùng đang biên dịch bằng CUDA 11. 3. Người dùng sẽ gỡ lỗi, kiểm tra và triển khai ứng dụng này với hành vi đã biết là nhận cuDeviceGetUuid [không phải phiên bản _v2]. Vì CUDA đã đảm bảo khả năng tương thích ABI giữa các phiên bản phụ, ứng dụng tương tự này sẽ chạy sau khi trình điều khiển được nâng cấp lên CUDA 11. 4 [không cập nhật bộ công cụ và thời gian chạy] mà không yêu cầu biên dịch lại. Tuy nhiên, điều này sẽ có hành vi không xác định, bởi vì bây giờ typedef cho PFN_cuDeviceGetUuid sẽ vẫn là chữ ký cho phiên bản gốc, nhưng vì cudaVersion bây giờ sẽ là 11040 [CUDA 11. 4], cuGetProcAddress sẽ trả lại con trỏ hàm về phiên bản _v2, nghĩa là việc gọi nó có thể có hành vi không xác định
Lưu ý trong trường hợp này, typedef gốc [không phải phiên bản _v2] trông giống như
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
57
Nhưng phiên bản _v2 typedef trông giống như
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
58
Vì vậy, trong trường hợp này, API/ABI sẽ giống nhau và lệnh gọi API thời gian chạy có thể sẽ không gây ra sự cố—chỉ có khả năng trả về uuid không xác định. Trong Ý nghĩa đối với API/ABI, chúng tôi thảo luận về một trường hợp có vấn đề hơn về khả năng tương thích API/ABI
Trên đây, là một ví dụ cụ thể cụ thể. Ví dụ, bây giờ, hãy sử dụng một ví dụ lý thuyết vẫn có vấn đề về khả năng tương thích giữa các phiên bản trình điều khiển. Thí dụ
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
59
Lưu ý rằng API đã được sửa đổi hai lần kể từ khi tạo ban đầu trong CUDA 11. 4 và mới nhất trong CUDA 11. 6 cũng đã sửa đổi giao diện API/ABI thành chức năng. Việc sử dụng mã người dùng được biên dịch dựa trên CUDA 11. 5 là
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
60
Trong ví dụ này, không có cập nhật cho typedef mới trong CUDA 11. 6 và biên dịch lại ứng dụng với các typedefs và xử lý trường hợp mới đó, ứng dụng sẽ nhận được con trỏ hàm cuFoo_v3 được trả về và mọi hoạt động sử dụng hàm đó sẽ gây ra hành vi không xác định. Mục đích của ví dụ này là để minh họa rằng ngay cả việc kiểm tra phiên bản rõ ràng cho cuGetProcAddress cũng có thể không bao gồm một cách an toàn các lỗi phiên bản phụ trong bản phát hành chính của CUDA
Các ví dụ trên tập trung vào các vấn đề với việc sử dụng API trình điều khiển để lấy các con trỏ hàm tới API trình điều khiển. Bây giờ chúng ta sẽ thảo luận về các vấn đề tiềm ẩn với việc sử dụng API Thời gian chạy cho cudaApiGetDriverEntryPoint
Chúng tôi sẽ bắt đầu bằng cách sử dụng API thời gian chạy tương tự như trên
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
61
Con trỏ hàm trong ví dụ này thậm chí còn phức tạp hơn so với ví dụ chỉ trình điều khiển ở trên vì không có quyền kiểm soát phiên bản nào của hàm sẽ lấy; . Xem bảng sau để biết thêm thông tin
Phiên bản thời gian chạy tĩnh Phiên bản LinkageDriver đã cài đặtV11. 3V11. 4V11. 3v1v1xV11. 4v1v2
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
62
Sự cố trong bảng xảy ra với CUDA 11 mới hơn. 4 Thời gian chạy và Bộ công cụ và trình điều khiển cũ hơn [CUDA 11. 3] kết hợp, được gắn nhãn là v1x ở trên. Sự kết hợp này sẽ khiến trình điều khiển trả về con trỏ cho hàm cũ hơn [không phải _v2], nhưng typedef được sử dụng trong ứng dụng sẽ dành cho con trỏ hàm mới
Nhiều phức tạp hơn phát sinh khi chúng tôi xem xét các kết hợp khác nhau của phiên bản CUDA mà ứng dụng được biên dịch, phiên bản thời gian chạy CUDA và phiên bản trình điều khiển CUDA mà ứng dụng tự động liên kết với
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
63
Dự kiến ma trận con trỏ hàm sau
Con trỏ chức năng Ứng dụng được biên dịch/Phiên bản liên kết động thời gian chạy/Phiên bản trình điều khiển [3 => CUDA 11. 3 và 4 => CUDA 11. 4]3/3/33/3/43/4/33/4/44/3/34/3/44/4/34/4/4pfn_cuDeviceGetUuidDrivert1/v1t1/v1t1/v1t1/v1N/AN/At2/v1t2/
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
64
Nếu ứng dụng được biên dịch dựa trên CUDA Phiên bản 11. 3, nó sẽ có typedef cho chức năng ban đầu, nhưng nếu được biên dịch theo CUDA Phiên bản 11. 4, nó sẽ có typedef cho hàm _v2. Do đó, hãy lưu ý số trường hợp typedef không khớp với phiên bản thực tế được trả lại/sử dụng
Trong các ví dụ trên sử dụng cuDeviceGetUuid, tác động của API không khớp là rất nhỏ và có thể không hoàn toàn đáng chú ý đối với nhiều người dùng vì _v2 đã được thêm vào để hỗ trợ chế độ GPU đa phiên bản [MIG]. Vì vậy, trên một hệ thống không có MIG, người dùng thậm chí có thể không nhận ra rằng họ đang nhận một API khác
Rắc rối hơn là một API thay đổi chữ ký ứng dụng của nó [và do đó là ABI], chẳng hạn như cuCtxCreate. Phiên bản _v2, được giới thiệu trong CUDA 3. 2 hiện được sử dụng làm cuCtxCreate mặc định khi sử dụng cuda. h nhưng hiện đã có phiên bản mới hơn được giới thiệu trong CUDA 11. 4 [cuCtxCreate v3]. Chữ ký API cũng đã được sửa đổi và hiện có thêm đối số. Vì vậy, trong một số trường hợp ở trên, khi typedef tới con trỏ hàm không khớp với con trỏ hàm được trả về, có khả năng xảy ra sự không tương thích ABI không rõ ràng dẫn đến hành vi không xác định
Ví dụ: giả sử đoạn mã sau được biên dịch dựa trên CUDA 11. 3 với CUDA 11. 4 trình điều khiển được cài đặt
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
65
Chạy mã này khi cudaVersion được đặt thành bất kỳ thứ gì >=11040 [cho biết CUDA 11. 4] có thể có hành vi không xác định do không cung cấp đầy đủ tất cả các tham số cần thiết cho phiên bản _v3 của API cuCtxCreate_v3
Bảng sau liệt kê các biến môi trường CUDA. Các biến môi trường liên quan đến Dịch vụ đa quy trình được ghi lại trong phần Dịch vụ đa quy trình của hướng dẫn quản lý và triển khai GPU.
Bảng 18. Biến môi trường CUDA VariableValuesDes mô tả thiết bị liệt kê và thuộc tínhCUDA_VISIBLE_DEVICESA chuỗi định danh GPU được phân tách bằng dấu phẩy

Hỗ trợ MIG. MIG-

Chỉ các thiết bị có chỉ mục trong chuỗi mới hiển thị với các ứng dụng CUDA và chúng được liệt kê theo thứ tự của chuỗi. Nếu một trong các chỉ số không hợp lệ, chỉ những thiết bị có chỉ mục đứng trước chỉ mục không hợp lệ mới hiển thị với các ứng dụng CUDA. Ví dụ: đặt CUDA_VISIBLE_DEVICES thành 2,1 khiến thiết bị 0 ẩn và thiết bị 2 được liệt kê trước thiết bị 1. Đặt CUDA_VISIBLE_DEVICES thành 0,2,-1,1 khiến thiết bị 0 và 2 hiển thị và thiết bị 1 ẩn.

Định dạng MIG bắt đầu bằng từ khóa MIG và GPU UUID phải theo cùng định dạng như được cung cấp bởi nvidia-smi. Ví dụ: MIG-GPU-8932f937-d72c-4106-c12f-20bd9faed9f6/1/2. Chỉ hỗ trợ liệt kê cá thể MIG đơn lẻ. CUDA_MANAGED_FORCE_DEVICE_ALLOC0 hoặc 1 [mặc định là 0] Buộc trình điều khiển đặt tất cả các phân bổ được quản lý vào bộ nhớ thiết bị. CUDA_DEVICE_ORDERFASTEST_FIRST, PCI_BUS_ID, [mặc định là FASTEST_FIRST]FASTEST_FIRST khiến CUDA liệt kê các thiết bị khả dụng theo thứ tự nhanh nhất đến chậm nhất bằng cách sử dụng phương pháp phỏng đoán đơn giản. PCI_BUS_ID sắp xếp các thiết bị theo ID bus PCI theo thứ tự tăng dần. Biên dịchCUDA_CACHE_DISABLE0 hoặc 1 [mặc định là 0]Tắt bộ nhớ đệm [khi được đặt thành 1] hoặc bật bộ nhớ đệm [khi được đặt thành 0] để biên dịch đúng lúc. Khi bị tắt, không có mã nhị phân nào được thêm vào hoặc truy xuất từ bộ đệm. CUDA_CACHE_PATHfilepathChỉ định thư mục nơi trình biên dịch tức thời lưu trữ mã nhị phân; .
- trên Windows, %APPDATA%\NVIDIA\ComputeCache
- trên Linux, ~/. nv/ComputeCache
CUDA_CACHE_MAXSIZEinteger [default is 1073741824 [1 GiB] for desktop/server platforms and 268435456 [256 MiB] for embedded platforms and the maximum is 4294967296 [4 GiB]]Specifies the size in bytes of the cache used by the just-in-time compiler. Binary codes whose size exceeds the cache size are not cached. Older binary codes are evicted from the cache to make room for newer binary codes if needed.CUDA_FORCE_PTX_JIT0 or 1 [default is 0]When set to 1, forces the device driver to ignore any binary code embedded in an application [see Application Compatibility] and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. This environment variable can be used to validate that PTX code is embedded in an application and that its just-in-time compilation works as expected to guarantee application forward compatibility with future architectures [see Just-in-Time Compilation].CUDA_DISABLE_PTX_JIT0 or 1 [default is 0]When set to 1, disables the just-in-time compilation of embedded PTX code and use the compatible binary code embedded in an application [see Application Compatibility]. If a kernel does not have embedded binary code or the embedded binary was compiled for an incompatible architecture, then it will fail to load. This environment variable can be used to validate that an application has the compatible SASS code generated for each kernel.[see Binary Compatibility].CUDA_FORCE_JIT0 or 1 [default is 0]When set to 1, forces the device driver to ignore any binary code embedded in an application [see Application Compatibility] and to just-in-time compile embedded PTX or NVVM IR code instead. If a kernel does not have embedded PTX or NVVM IR code, it will fail to load. This environment variable can be used to validate that PTX or NVVM IR code is embedded in an application and that its just-in-time compilation works as expected to guarantee application forward compatibility with future architectures [see Just-in-Time Compilation]. The behavior can be overridden for embedded PTX by setting CUDA_FORCE_PTX_JIT=0.CUDA_DISABLE_JIT0 or 1 [default is 0]When set to 1, disables the just-in-time compilation of embedded PTX or NVVM IR code and use the compatible binary code embedded in an application [see Application Compatibility]. If a kernel does not have embedded binary code or the embedded binary was compiled for an incompatible architecture, then it will fail to load. This environment variable can be used to validate that an application has the compatible SASS code generated for each kernel.[see Binary Compatibility]. The behavior can be overridden for embedded PTX by setting CUDA_DISABLE_PTX_JIT=0.ExecutionCUDA_LAUNCH_BLOCKING0 or 1 [default is 0]Disables [when set to 1] or enables [when set to 0] asynchronous kernel launches.CUDA_DEVICE_MAX_CONNECTIONS1 to 32 [default is 8]Sets the number of compute and copy engine concurrent connections [work queues] from the host to each device of compute capability 3.5 and above.CUDA_AUTO_BOOST0 or 1Overrides the autoboost behavior set by the --auto-boost-default option of nvidia-smi. If an application requests via this environment variable a behavior that is different from nvidia-smi's, its request is honored if there is no other application currently running on the same GPU that successfully requested a different behavior, otherwise it is ignored.cuda-gdb [on Linux platform]CUDA_DEVICE_WAITS_ON_EXCEPTION0 or 1 [default is 0]When set to 1, a CUDA application will halt when a device exception occurs, allowing a debugger to be attached for further debugging.MPS service [on Linux platform]CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMITPercentage value [between 0 - 100, default is 0]Devices of compute capability 8.x allow, a portion of L2 cache to be set-aside for persisting data accesses to global memory. When using CUDA MPS service, the set-aside size can only be controlled using this environment variable, before starting CUDA MPS control daemon. I.e., the environment variable should be set before running the command nvidia-cuda-mps-control -d.Module loadingCUDA_MODULE_LOADINGDEFAULT, LAZY, EAGERSpecifies the module loading mode for the application. When set to EAGER, all kernels from a cubin, fatbin or a PTX file are fully loaded upon corresponding cuModuleLoad* API call. This is the same behavior as in all preceding CUDA releases. When set to LAZY, loading of a specific kernel is delayed to the point a CUfunc handle is extracted with cuModuleGetFunction API call. This mode allows for lowering initial module loading latency and decreasing initial module-related device memory consumption, at the cost of higher latency of cuModuleGetFunction API call. Default behavior is EAGER. Default behavior may change in future CUDA releases.Pre-loading dependent librariesCUDA_FORCE_PRELOAD_LIBRARIES0 or 1 [default is 0]When set to 1, forces the driver to preload the libraries required for NVVM and PTX just-in-time compilation during driver initialization. This will increase the memory footprint and the time taken for CUDA driver initialization. This environment variable needs to be set to avoid certain deadlock situations involving multiple CUDA threads.
Bộ nhớ hợp nhất là một thành phần của mô hình lập trình CUDA, được giới thiệu lần đầu trong CUDA 6. 0, định nghĩa một không gian bộ nhớ được quản lý trong đó tất cả các bộ xử lý nhìn thấy một hình ảnh bộ nhớ nhất quán duy nhất với một không gian địa chỉ chung
Ghi chú. Bộ xử lý đề cập đến bất kỳ đơn vị thực thi độc lập nào có MMU chuyên dụng. Điều này bao gồm cả CPU và GPU thuộc bất kỳ loại và kiến trúc nào.
Hệ thống cơ bản quản lý quyền truy cập dữ liệu và vị trí trong chương trình CUDA mà không cần các cuộc gọi sao chép bộ nhớ rõ ràng. Điều này mang lại lợi ích cho việc lập trình GPU theo hai cách chính
- Việc lập trình GPU được đơn giản hóa bằng cách thống nhất các không gian bộ nhớ một cách mạch lạc trên tất cả các GPU và CPU trong hệ thống và bằng cách cung cấp khả năng tích hợp ngôn ngữ chặt chẽ và đơn giản hơn cho các lập trình viên CUDA
- Tốc độ truy cập dữ liệu được tối đa hóa bằng cách di chuyển dữ liệu trong suốt tới bộ xử lý bằng cách sử dụng nó
Nói một cách đơn giản, Bộ nhớ hợp nhất loại bỏ nhu cầu di chuyển dữ liệu rõ ràng thông qua các quy trình cudaMemcpy*[] mà không bị phạt về hiệu suất do đặt tất cả dữ liệu vào bộ nhớ không sao chép. Tất nhiên, việc di chuyển dữ liệu vẫn diễn ra, vì vậy thời gian chạy của chương trình thường không giảm;
Bộ nhớ hợp nhất cung cấp mô hình "một con trỏ tới dữ liệu" tương tự về mặt khái niệm với bộ nhớ không sao chép của CUDA. Một điểm khác biệt chính giữa hai loại này là với phân bổ bản sao bằng không, vị trí vật lý của bộ nhớ được ghim trong bộ nhớ hệ thống CPU sao cho chương trình có thể truy cập nhanh hoặc chậm vào nó tùy thuộc vào nơi nó được truy cập từ đâu. Mặt khác, Bộ nhớ hợp nhất tách rời bộ nhớ và không gian thực thi để tất cả các truy cập dữ liệu đều nhanh chóng
Thuật ngữ Bộ nhớ hợp nhất mô tả một hệ thống cung cấp dịch vụ quản lý bộ nhớ cho nhiều chương trình, từ những chương trình nhắm mục tiêu API thời gian chạy cho đến những chương trình sử dụng ISA ảo [PTX]. Một phần của hệ thống này xác định không gian bộ nhớ được quản lý để chọn tham gia vào các dịch vụ Bộ nhớ hợp nhất
Bộ nhớ được quản lý có thể tương tác và hoán đổi cho nhau với các phân bổ dành riêng cho thiết bị, chẳng hạn như những phân bổ được tạo bằng quy trình cudaMalloc[]. Tất cả các thao tác CUDA hợp lệ trên bộ nhớ thiết bị cũng hợp lệ trên bộ nhớ được quản lý;
Ghi chú. Bộ nhớ hợp nhất không được hỗ trợ trên GPU rời gắn với Tegra.
Hợp nhất không gian bộ nhớ có nghĩa là không còn nhu cầu chuyển bộ nhớ rõ ràng giữa máy chủ và thiết bị. Mọi phân bổ được tạo trong không gian bộ nhớ được quản lý sẽ tự động được di chuyển đến nơi cần thiết
Một chương trình phân bổ bộ nhớ được quản lý theo một trong hai cách. thông qua quy trình cudaMallocManaged[], tương tự về mặt ngữ nghĩa với cudaMalloc[]; . Định nghĩa chính xác của những điều này được tìm thấy sau trong tài liệu này
Ghi chú. Trên nền tảng hỗ trợ với các thiết bị có khả năng tính toán 6. x trở lên, Bộ nhớ hợp nhất sẽ cho phép các ứng dụng phân bổ và chia sẻ dữ liệu bằng cách sử dụng bộ cấp phát hệ thống mặc định. Điều này cho phép GPU truy cập toàn bộ bộ nhớ ảo của hệ thống mà không cần sử dụng bộ cấp phát đặc biệt. Xem Trình phân bổ hệ thống để biết thêm chi tiết.
Các ví dụ mã sau đây minh họa cách sử dụng bộ nhớ được quản lý có thể thay đổi cách viết mã máy chủ. Đầu tiên, một chương trình đơn giản được viết mà không có lợi ích của Bộ nhớ hợp nhất.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
66
Ví dụ đầu tiên này kết hợp hai số với nhau trên GPU bằng ID mỗi luồng và trả về các giá trị trong một mảng. Nếu không có bộ nhớ được quản lý, thì cần có cả bộ lưu trữ phía máy chủ và phía thiết bị cho các giá trị trả về [host_ret và ret trong ví dụ], cũng như một bản sao rõ ràng giữa hai bộ nhớ này bằng cách sử dụng cudaMemcpy[]
So sánh điều này với phiên bản Bộ nhớ hợp nhất của chương trình, cho phép truy cập trực tiếp dữ liệu GPU từ máy chủ. Lưu ý quy trình cudaMallocManaged[] trả về một con trỏ hợp lệ từ cả mã máy chủ và mã thiết bị. Điều này cho phép sử dụng ret mà không cần bản sao host_ret riêng, giúp đơn giản hóa và giảm kích thước của chương trình rất nhiều.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
67
Cuối cùng, tích hợp ngôn ngữ cho phép tham chiếu trực tiếp biến __managed__ do GPU khai báo và đơn giản hóa chương trình hơn nữa khi sử dụng biến toàn cục.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
68
Lưu ý rằng không có lệnh cudaMemcpy[] rõ ràng và thực tế là mảng trả về ret hiển thị trên cả CPU và GPU
Đáng để nhận xét về sự đồng bộ hóa giữa máy chủ và thiết bị. Lưu ý cách trong ví dụ không được quản lý, quy trình cudaMemcpy[] đồng bộ được sử dụng để đồng bộ hóa hạt nhân [nghĩa là đợi hạt nhân chạy xong] và để truyền dữ liệu đến máy chủ. Các ví dụ về Bộ nhớ hợp nhất không gọi cudaMemcpy[] và do đó yêu cầu cudaDeviceSynchronize[] rõ ràng trước khi chương trình máy chủ có thể sử dụng đầu ra từ GPU một cách an toàn
Bộ nhớ hợp nhất cố gắng tối ưu hóa hiệu suất bộ nhớ bằng cách di chuyển dữ liệu tới thiết bị nơi nó đang được truy cập [nghĩa là di chuyển dữ liệu tới bộ nhớ máy chủ nếu CPU đang truy cập và tới bộ nhớ thiết bị nếu GPU sẽ truy cập vào bộ nhớ đó]. Di chuyển dữ liệu là cơ bản đối với Bộ nhớ hợp nhất, nhưng trong suốt đối với chương trình. Hệ thống sẽ cố gắng đặt dữ liệu ở vị trí có thể truy cập dữ liệu hiệu quả nhất mà không vi phạm tính nhất quán
Vị trí vật lý của dữ liệu là vô hình đối với chương trình và có thể thay đổi bất kỳ lúc nào, nhưng quyền truy cập vào địa chỉ ảo của dữ liệu sẽ vẫn hợp lệ và nhất quán từ bất kỳ bộ xử lý nào bất kể vị trí. Lưu ý rằng việc duy trì sự mạch lạc là yêu cầu chính trước hiệu suất;
Kiến trúc GPU có khả năng tính toán thấp hơn 6. x không hỗ trợ di chuyển chi tiết dữ liệu được quản lý sang GPU theo yêu cầu. Bất cứ khi nào nhân GPU được khởi chạy, tất cả bộ nhớ được quản lý thường phải được chuyển sang bộ nhớ GPU để tránh lỗi khi truy cập bộ nhớ. Với khả năng tính toán 6. x một cơ chế lỗi trang GPU mới được giới thiệu để cung cấp chức năng Bộ nhớ hợp nhất liền mạch hơn. Kết hợp với không gian địa chỉ ảo trên toàn hệ thống, lỗi trang cung cấp một số lợi ích. Đầu tiên, lỗi trang có nghĩa là phần mềm hệ thống CUDA không cần đồng bộ hóa tất cả các phân bổ bộ nhớ được quản lý với GPU trước mỗi lần khởi chạy kernel. Nếu một nhân chạy trên GPU truy cập vào một trang không nằm trong bộ nhớ của nó, thì nó sẽ bị lỗi, cho phép trang được tự động di chuyển sang bộ nhớ GPU theo yêu cầu. Ngoài ra, trang có thể được ánh xạ vào không gian địa chỉ GPU để truy cập qua các kết nối liên kết PCIe hoặc NVLink [ánh xạ khi truy cập đôi khi có thể nhanh hơn di chuyển]. Lưu ý rằng Bộ nhớ hợp nhất trên toàn hệ thống. GPU [và CPU] có thể gặp lỗi và di chuyển các trang bộ nhớ từ bộ nhớ CPU hoặc từ bộ nhớ của các GPU khác trong hệ thống
Thiết bị có khả năng tính toán 7. 0 hỗ trợ Dịch vụ dịch địa chỉ [ATS] qua NVLink. Nếu được hỗ trợ bởi CPU chủ và hệ điều hành, ATS cho phép GPU truy cập trực tiếp vào các bảng trang của CPU. Lỗi trong MMU GPU sẽ dẫn đến Yêu cầu dịch địa chỉ [ATR] tới CPU. CPU tìm trong các bảng trang của nó để ánh xạ ảo sang vật lý cho địa chỉ đó và cung cấp bản dịch trở lại GPU. ATS cung cấp cho GPU toàn quyền truy cập vào bộ nhớ hệ thống, chẳng hạn như bộ nhớ được cấp phát bằng malloc, bộ nhớ được cấp phát trên ngăn xếp, biến toàn cục và bộ nhớ được hỗ trợ bởi tệp. Một ứng dụng có thể truy vấn liệu thiết bị có hỗ trợ truy cập mạch lạc bộ nhớ có thể phân trang qua ATS hay không bằng cách kiểm tra thuộc tính pageableMemoryAccessUsesHostPageTables mới
Đây là một mã ví dụ hoạt động trên bất kỳ hệ thống nào đáp ứng các yêu cầu cơ bản đối với Bộ nhớ hợp nhất [xem Yêu cầu hệ thống].
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
69
Các mẫu truy cập mới này được hỗ trợ trên các hệ thống có thuộc tính pageableMemoryAccess. ________ 370 ________ 371 ________ 372
Trong ví dụ trên, dữ liệu có thể được khởi tạo bởi thư viện CPU của bên thứ ba, sau đó được nhân GPU truy cập trực tiếp. Trên các hệ thống có pageableMemoryAccess, người dùng cũng có thể tìm nạp trước bộ nhớ có thể phân trang vào GPU bằng cách sử dụng cudaMemPrefetchAsync. Điều này có thể mang lại lợi ích về hiệu suất thông qua vị trí dữ liệu được tối ưu hóa
Ghi chú. ATS qua NVLink hiện chỉ được hỗ trợ trên các hệ thống IBM Power9.
Thế hệ thứ hai của NVLink cho phép tải/lưu trữ/truy cập nguyên tử trực tiếp từ CPU tới bộ nhớ của mỗi GPU. Cùng với khả năng làm chủ CPU mới, NVLink hỗ trợ các hoạt động nhất quán cho phép dữ liệu đọc từ bộ nhớ GPU được lưu trữ trong hệ thống phân cấp bộ đệm của CPU. Độ trễ truy cập thấp hơn từ bộ đệm của CPU là chìa khóa cho hiệu suất của CPU. Thiết bị có khả năng tính toán 6. x chỉ hỗ trợ các nguyên tử GPU ngang hàng. Thiết bị có khả năng tính toán 7. x có thể gửi các nguyên tử GPU qua NVLink và hoàn thành chúng ở CPU mục tiêu, do đó, thế hệ thứ hai của NVLink bổ sung hỗ trợ cho các nguyên tử do GPU hoặc CPU khởi tạo
Lưu ý rằng không thể truy cập phân bổ cudaMalloc từ CPU. Do đó, để tận dụng tính nhất quán của phần cứng, người dùng phải sử dụng bộ cấp phát Bộ nhớ hợp nhất như cudaMallocManaged hoặc bộ cấp phát hệ thống có hỗ trợ ATS [xem Bộ cấp phát hệ thống]. Thuộc tính mới directManagedMemAccessFromHost cho biết liệu máy chủ có thể truy cập trực tiếp vào bộ nhớ được quản lý trên thiết bị mà không cần di chuyển hay không. Theo mặc định, mọi quyền truy cập CPU của cudaMallocManaged phân bổ nằm trong bộ nhớ GPU sẽ gây ra lỗi trang và di chuyển dữ liệu. Các ứng dụng có thể sử dụng gợi ý hiệu suất cudaMemAdviseSetAccessedBy với cudaCpuDeviceId để cho phép truy cập trực tiếp bộ nhớ GPU trên các hệ thống được hỗ trợ
Hãy xem xét một mã ví dụ dưới đây.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
73Sau khi kernel ghi xong, ret sẽ được tạo và khởi tạo trong bộ nhớ GPU. Tiếp theo, CPU sẽ truy cập ret, sau đó nối thêm kernel bằng cách sử dụng lại cùng một bộ nhớ ret. Mã này sẽ hiển thị các hành vi khác nhau tùy thuộc vào kiến trúc hệ thống và sự hỗ trợ của sự gắn kết phần cứng
- Trên các hệ thống có directManagedMemAccessFromHost=1. Quyền truy cập của CPU vào bộ đệm được quản lý sẽ không kích hoạt bất kỳ sự di chuyển nào;
- Trên các hệ thống có directManagedMemAccessFromHost=0. CPU truy cập vào bộ đệm được quản lý sẽ lỗi trang và bắt đầu di chuyển dữ liệu;
Bộ nhớ hợp nhất được tạo phổ biến nhất bằng cách sử dụng hàm cấp phát tương tự về mặt ngữ nghĩa và cú pháp với bộ cấp phát CUDA tiêu chuẩn, cudaMalloc[]. Mô tả chức năng như sau.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
74
Hàm cudaMallocManaged[] dự trữ các byte kích thước của bộ nhớ được quản lý và trả về một con trỏ trong devPtr. Lưu ý sự khác biệt trong hành vi cudaMallocManaged[] giữa các kiến trúc GPU khác nhau. Theo mặc định, các thiết bị có khả năng tính toán thấp hơn 6. x phân bổ bộ nhớ được quản lý trực tiếp trên GPU. Tuy nhiên, các thiết bị có khả năng tính toán 6. x trở lên không phân bổ bộ nhớ vật lý khi gọi cudaMallocManaged[]. trong trường hợp này, bộ nhớ vật lý được điền vào lần chạm đầu tiên và có thể nằm trên CPU hoặc GPU. Con trỏ được quản lý hợp lệ trên tất cả GPU và CPU trong hệ thống, mặc dù chương trình truy cập vào con trỏ này phải tuân theo các quy tắc tương tranh của mô hình lập trình Bộ nhớ hợp nhất [xem Tính kết hợp và Đồng thời]. Dưới đây là một ví dụ đơn giản, cho thấy việc sử dụng cudaMallocManaged[].
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
75
Hành vi của chương trình không thay đổi về mặt chức năng khi cudaMalloc[] được thay thế bằng cudaMallocManaged[]; . Ngoài ra, có thể loại bỏ các con trỏ kép [một tới máy chủ và một tới bộ nhớ thiết bị]
Mã thiết bị không thể gọi cudaMallocManaged[]. Tất cả bộ nhớ được quản lý phải được cấp phát từ máy chủ hoặc ở phạm vi toàn cầu [xem phần tiếp theo]. Phân bổ trên đống thiết bị bằng cách sử dụng malloc[] trong nhân sẽ không được tạo trong không gian bộ nhớ được quản lý và do đó mã CPU sẽ không thể truy cập được
Để đảm bảo tính nhất quán trên pre-6. x GPU, mô hình lập trình Bộ nhớ hợp nhất đặt ra các ràng buộc đối với việc truy cập dữ liệu trong khi cả CPU và GPU đang thực thi đồng thời. Trên thực tế, GPU có quyền truy cập độc quyền vào tất cả dữ liệu được quản lý trong khi bất kỳ hoạt động hạt nhân nào đang được thực thi, bất kể hạt nhân cụ thể có đang tích cực sử dụng dữ liệu hay không. Khi dữ liệu được quản lý được sử dụng với cudaMemcpy*[] hoặc cudaMemset*[], hệ thống có thể chọn truy cập nguồn hoặc đích từ máy chủ hoặc thiết bị, điều này sẽ đặt ra các ràng buộc đối với quyền truy cập đồng thời của CPU vào dữ liệu đó trong khi cudaMemcpy*[] . Xem Memcpy[]/Memset[] Behavior With Managed Memory để biết thêm chi tiết
CPU không được phép truy cập vào bất kỳ phân bổ hoặc biến được quản lý nào trong khi GPU đang hoạt động đối với các thiết bị có thuộc tính ConcurrentManagedAccess được đặt thành 0. Trên các hệ thống này, các truy cập CPU/GPU đồng thời, ngay cả với các cấp phát bộ nhớ được quản lý khác nhau, sẽ gây ra lỗi phân đoạn vì trang được coi là không thể truy cập được đối với CPU.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
76
Trong ví dụ trên, nhân chương trình GPU vẫn hoạt động khi CPU chạm vào y. [Lưu ý cách nó xảy ra trước cudaDeviceSynchronize[]. ] Mã chạy thành công trên thiết bị có khả năng tính toán 6. x do khả năng lỗi trang GPU loại bỏ mọi hạn chế đối với truy cập đồng thời. Tuy nhiên, truy cập bộ nhớ như vậy là không hợp lệ trên 6. x mặc dù CPU đang truy cập dữ liệu khác với GPU. Chương trình phải đồng bộ hóa rõ ràng với GPU trước khi truy cập y.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
77
Như ví dụ này cho thấy, trên các hệ thống có trước 6. x kiến trúc GPU, một luồng CPU không được truy cập vào bất kỳ dữ liệu được quản lý nào trong khoảng thời gian giữa việc thực hiện khởi chạy nhân và lệnh gọi đồng bộ hóa tiếp theo, bất kể nhân GPU có thực sự chạm vào cùng dữ liệu đó hay không [hoặc bất kỳ dữ liệu được quản lý nào]. Chỉ tiềm năng truy cập CPU và GPU đồng thời là đủ để nâng cao ngoại lệ ở cấp độ quy trình
Lưu ý rằng nếu bộ nhớ được cấp phát động bằng cudaMallocManaged[] hoặc cuMemAllocManaged[] trong khi GPU đang hoạt động, hoạt động của bộ nhớ sẽ không được chỉ định cho đến khi công việc bổ sung được khởi chạy hoặc GPU được đồng bộ hóa. Cố gắng truy cập bộ nhớ trên CPU trong thời gian này có thể hoặc không gây ra lỗi phân đoạn. Điều này không áp dụng cho bộ nhớ được cấp phát bằng cờ cudaMemAttachHost hoặc CU_MEM_ATTACH_HOST
Lưu ý rằng cần phải đồng bộ hóa rõ ràng ngay cả khi kernel chạy nhanh và kết thúc trước khi CPU chạm vào y trong ví dụ trên. Bộ nhớ hợp nhất sử dụng hoạt động logic để xác định xem GPU có đang ở chế độ chờ hay không. Điều này phù hợp với mô hình lập trình CUDA, mô hình chỉ định rằng nhân có thể chạy bất kỳ lúc nào sau khi khởi chạy và không được đảm bảo sẽ hoàn thành cho đến khi máy chủ đưa ra lệnh gọi đồng bộ hóa
Bất kỳ lệnh gọi chức năng nào đảm bảo GPU hoàn thành công việc của nó một cách hợp lý đều hợp lệ. Điều này bao gồm cudaDeviceSynchronize[];
Các phần phụ thuộc được tạo giữa các luồng sẽ được theo dõi để suy ra việc hoàn thành các luồng khác bằng cách đồng bộ hóa trên một luồng hoặc sự kiện. Các phụ thuộc có thể được tạo thông qua cudaStreamWaitEvent[] hoặc hoàn toàn khi sử dụng luồng mặc định [NULL]
Việc CPU truy cập dữ liệu được quản lý từ trong lệnh gọi lại luồng là hợp pháp, miễn là không có luồng nào khác có khả năng truy cập dữ liệu được quản lý đang hoạt động trên GPU. Ngoài ra, một cuộc gọi lại không được theo sau bởi bất kỳ công việc thiết bị nào có thể được sử dụng để đồng bộ hóa. ví dụ: bằng cách báo hiệu một biến điều kiện từ bên trong cuộc gọi lại;
Có một số điểm lưu ý quan trọng
- CPU luôn được phép truy cập dữ liệu không sao chép không được quản lý trong khi GPU đang hoạt động
- GPU được coi là đang hoạt động khi nó đang chạy bất kỳ nhân nào, ngay cả khi nhân đó không sử dụng dữ liệu được quản lý. Nếu nhân có thể sử dụng dữ liệu thì quyền truy cập bị cấm, trừ khi thuộc tính thiết bị concurrentManagedAccess là 1
- Không có hạn chế nào đối với quyền truy cập đồng thời giữa các GPU của bộ nhớ được quản lý, ngoài những ràng buộc áp dụng cho quyền truy cập nhiều GPU của bộ nhớ không được quản lý
- Không có ràng buộc nào đối với các nhân GPU đồng thời truy cập dữ liệu được quản lý
Lưu ý cách điểm cuối cùng cho phép chạy đua giữa các nhân GPU, như trường hợp hiện tại đối với bộ nhớ GPU không được quản lý. Như đã đề cập trước đây, bộ nhớ được quản lý hoạt động giống hệt với bộ nhớ không được quản lý từ góc độ GPU. Ví dụ mã sau đây minh họa những điểm này.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
78
Cho đến bây giờ, người ta cho rằng đối với kiến trúc SM trước 6. x. 1] bất kỳ kernel đang hoạt động nào cũng có thể sử dụng bất kỳ bộ nhớ được quản lý nào và 2] việc sử dụng bộ nhớ được quản lý từ CPU trong khi kernel đang hoạt động là không hợp lệ. Ở đây chúng tôi trình bày một hệ thống để kiểm soát chi tiết hơn bộ nhớ được quản lý được thiết kế để hoạt động trên tất cả các thiết bị hỗ trợ bộ nhớ được quản lý, bao gồm các kiến trúc cũ hơn với ManagedAccess đồng thời bằng 0
Mô hình lập trình CUDA cung cấp các luồng như một cơ chế để các chương trình biểu thị sự phụ thuộc và độc lập giữa các lần khởi chạy kernel. Các hạt nhân khởi chạy vào cùng một luồng được đảm bảo thực thi liên tục, trong khi các hạt nhân khởi chạy vào các luồng khác nhau được phép thực thi đồng thời. Các luồng mô tả tính độc lập giữa các mục công việc và do đó cho phép hiệu quả tiềm năng cao hơn thông qua đồng thời
Bộ nhớ hợp nhất được xây dựng dựa trên mô hình độc lập với luồng bằng cách cho phép chương trình CUDA liên kết rõ ràng các phân bổ được quản lý với luồng CUDA. Bằng cách này, lập trình viên cho biết việc sử dụng dữ liệu của các nhân dựa trên việc chúng có được khởi chạy vào một luồng cụ thể hay không. Điều này cho phép các cơ hội đồng thời dựa trên các mẫu truy cập dữ liệu dành riêng cho chương trình. Chức năng kiểm soát hành vi này là.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
79
Hàm cudaStreamAttachMemAsync[] liên kết các byte độ dài của bộ nhớ bắt đầu từ ptr với luồng đã chỉ định. [Hiện tại, độ dài phải luôn bằng 0 để chỉ ra rằng toàn bộ vùng sẽ được đính kèm. ] Do sự liên kết này, hệ thống Bộ nhớ hợp nhất cho phép CPU truy cập vào vùng bộ nhớ này miễn là tất cả các hoạt động trong luồng đã hoàn thành, bất kể các luồng khác có đang hoạt động hay không. Trên thực tế, điều này hạn chế quyền sở hữu độc quyền vùng bộ nhớ được quản lý bởi một GPU đang hoạt động đối với hoạt động trên mỗi luồng thay vì hoạt động của toàn bộ GPU
Quan trọng nhất, nếu một phân bổ không được liên kết với một luồng cụ thể, nó sẽ hiển thị cho tất cả các hạt nhân đang chạy bất kể luồng của chúng là gì. Đây là chế độ hiển thị mặc định cho phân bổ cudaMallocManaged[] hoặc biến __managed__;
Bằng cách liên kết phân bổ với một luồng cụ thể, chương trình đảm bảo rằng chỉ các hạt nhân được khởi chạy vào luồng đó mới chạm vào dữ liệu đó. Không có kiểm tra lỗi nào được thực hiện bởi hệ thống Bộ nhớ hợp nhất. lập trình viên có trách nhiệm đảm bảo rằng bảo hành được tôn trọng
Ngoài việc cho phép đồng thời lớn hơn, việc sử dụng cudaStreamAttachMemAsync[] có thể [và thường làm] cho phép tối ưu hóa truyền dữ liệu trong hệ thống Bộ nhớ hợp nhất có thể ảnh hưởng đến độ trễ và các chi phí khác
Việc liên kết dữ liệu với một luồng cho phép kiểm soát chi tiết đối với đồng thời CPU + GPU, nhưng dữ liệu nào hiển thị với luồng nào phải được ghi nhớ khi sử dụng các thiết bị có khả năng tính toán thấp hơn 6. x. Nhìn vào ví dụ đồng bộ hóa trước đó.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
80
Ở đây, chúng tôi liên kết rõ ràng y với khả năng truy cập máy chủ, do đó cho phép truy cập mọi lúc từ CPU. [Như trước đây, lưu ý sự vắng mặt của cudaDeviceSynchronize[] trước khi truy cập. ] Việc truy cập vào y bởi hạt nhân đang chạy GPU giờ đây sẽ tạo ra kết quả không xác định
Lưu ý rằng việc liên kết một biến với luồng không thay đổi cách liên kết của bất kỳ biến nào khác. e. g. liên kết x với stream1 không đảm bảo rằng chỉ x được truy cập bởi các hạt nhân được khởi chạy trong stream1, do đó, lỗi là do mã này.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
81
Lưu ý cách truy cập vào y sẽ gây ra lỗi vì mặc dù x đã được liên kết với một luồng nhưng chúng tôi đã không thông báo cho hệ thống biết ai có thể nhìn thấy y. Do đó, hệ thống giả định một cách thận trọng rằng kernel có thể truy cập nó và ngăn CPU làm như vậy
Công dụng chính của cudaStreamAttachMemAsync[] là kích hoạt song song tác vụ độc lập bằng các luồng CPU. Thông thường trong một chương trình như vậy, một luồng CPU tạo luồng riêng cho tất cả công việc mà nó tạo ra vì việc sử dụng luồng NULL của CUDA sẽ gây ra sự phụ thuộc giữa các luồng
Khả năng hiển thị toàn cầu mặc định của dữ liệu được quản lý đối với bất kỳ luồng GPU nào có thể gây khó khăn cho việc tránh tương tác giữa các luồng CPU trong chương trình đa luồng. Do đó, hàm cudaStreamAttachMemAsync[] được sử dụng để liên kết phân bổ được quản lý của luồng với luồng riêng của luồng đó và liên kết thường không bị thay đổi trong vòng đời của luồng
Một chương trình như vậy sẽ chỉ cần thêm một lệnh gọi đến cudaStreamAttachMemAsync[] để sử dụng bộ nhớ hợp nhất để truy cập dữ liệu của nó.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
82
Trong ví dụ này, liên kết luồng phân bổ chỉ được thiết lập một lần và sau đó dữ liệu được sử dụng nhiều lần bởi cả máy chủ và thiết bị. Kết quả là mã đơn giản hơn nhiều so với việc sao chép dữ liệu rõ ràng giữa máy chủ và thiết bị, mặc dù kết quả là như nhau
Trong ví dụ trước cudaMallocManaged[] chỉ định cờ cudaMemAttachHost, cờ này tạo ra một phân bổ ban đầu không thể nhìn thấy đối với việc thực thi phía thiết bị. [Việc phân bổ mặc định sẽ hiển thị cho tất cả các nhân GPU trên tất cả các luồng. ] Điều này đảm bảo rằng không có tương tác ngẫu nhiên với quá trình thực thi của một luồng khác trong khoảng thời gian giữa phân bổ dữ liệu và khi dữ liệu được thu thập cho một luồng cụ thể
Không có cờ này, một phân bổ mới sẽ được coi là đang sử dụng trên GPU nếu một nhân được khởi chạy bởi một luồng khác tình cờ đang chạy. Điều này có thể ảnh hưởng đến khả năng truy cập dữ liệu mới được phân bổ từ CPU của luồng [ví dụ: trong hàm tạo của lớp cơ sở] trước khi có thể đính kèm rõ ràng dữ liệu đó vào luồng riêng tư. Do đó, để kích hoạt tính độc lập an toàn giữa các luồng, việc phân bổ phải được thực hiện chỉ định cờ này
Ghi chú. Một giải pháp thay thế là đặt một rào cản trên toàn quy trình trên tất cả các luồng sau khi phân bổ đã được gắn vào luồng. Điều này sẽ đảm bảo rằng tất cả các luồng hoàn thành liên kết dữ liệu/luồng của chúng trước khi bất kỳ hạt nhân nào được khởi chạy, tránh nguy cơ. Cần có rào cản thứ hai trước khi luồng bị hủy vì việc hủy luồng khiến phân bổ trở lại chế độ hiển thị mặc định của chúng. Cờ cudaMemAttachHost tồn tại để đơn giản hóa quy trình này và vì không phải lúc nào cũng có thể chèn các rào cản toàn cầu khi cần thiết.
Vì bộ nhớ được quản lý có thể được truy cập từ máy chủ hoặc thiết bị, nên cudaMemcpy*[] dựa vào loại truyền, được chỉ định bằng cudaMemcpyKind, để xác định xem dữ liệu nên được truy cập dưới dạng con trỏ máy chủ hay con trỏ thiết bị
Nếu cudaMemcpyHostTo* được chỉ định và dữ liệu nguồn được quản lý, thì nó sẽ được truy cập từ máy chủ nếu nó có thể truy cập nhất quán từ máy chủ trong luồng sao chép [1]; . Các quy tắc tương tự áp dụng cho đích khi cudaMemcpy*ToHost được chỉ định và đích là bộ nhớ được quản lý
Nếu cudaMemcpyDeviceTo* được chỉ định và dữ liệu nguồn được quản lý thì nó sẽ được truy cập từ thiết bị. Nguồn phải được truy cập nhất quán từ thiết bị trong luồng sao chép [2]; . Các quy tắc tương tự áp dụng cho đích khi cudaMemcpy*ToDevice được chỉ định và đích là bộ nhớ được quản lý
Nếu cudaMemcpyDefault được chỉ định, thì dữ liệu được quản lý sẽ được truy cập từ máy chủ nếu nó không thể được truy cập nhất quán từ thiết bị trong luồng sao chép [2] hoặc nếu vị trí ưu tiên cho dữ liệu là cudaCpuDeviceId và nó có thể được truy cập nhất quán từ máy chủ
Khi sử dụng cudaMemset*[] với bộ nhớ được quản lý, dữ liệu luôn được truy cập từ thiết bị. Dữ liệu phải có thể truy cập đồng bộ từ thiết bị trong luồng đang được sử dụng cho thao tác cudaMemset[] [2];
Khi dữ liệu được truy cập từ thiết bị bằng cudaMemcpy* hoặc cudaMemset*, luồng hoạt động được coi là đang hoạt động trên GPU. Trong thời gian này, mọi quyền truy cập dữ liệu của CPU được liên kết với luồng đó hoặc dữ liệu có khả năng hiển thị toàn cầu, sẽ dẫn đến lỗi phân đoạn nếu GPU có giá trị bằng 0 cho thuộc tính thiết bị concurrentManagedAccess. Chương trình phải đồng bộ hóa một cách thích hợp để đảm bảo hoạt động đã hoàn tất trước khi truy cập bất kỳ dữ liệu liên quan nào từ CPU
[1] Để bộ nhớ được quản lý có thể truy cập đồng bộ từ máy chủ trong một luồng nhất định, ít nhất một trong các điều kiện sau phải được đáp ứng
- Luồng đã cho được liên kết với một thiết bị có giá trị khác 0 cho thuộc tính thiết bị concurrentManagedAccess
- Bộ nhớ không có khả năng hiển thị toàn cầu cũng như không được liên kết với luồng đã cho
[2] Để bộ nhớ được quản lý có thể truy cập đồng bộ từ thiết bị trong một luồng nhất định, ít nhất một trong các điều kiện sau phải được đáp ứng
- Thiết bị có giá trị khác 0 cho thuộc tính thiết bị concurrentManagedAccess
- Bộ nhớ có khả năng hiển thị toàn cầu hoặc được liên kết với luồng đã cho
Users of the CUDA Runtime API who compile their host code using nvcc have access to additional language integration features, such as shared symbol names and inline kernel launch via the operator. Unified Memory adds one additional element to CUDA’s language integration: variables annotated with the __managed__ keyword can be referenced directly from both host and device code.
Ví dụ sau, được xem trước đó trong Đơn giản hóa lập trình GPU, minh họa cách sử dụng đơn giản các khai báo toàn cầu __managed__.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
83Khả năng có sẵn với các biến __managed__ là biểu tượng có sẵn trong cả mã thiết bị và mã máy chủ mà không cần hủy đăng ký con trỏ và dữ liệu được chia sẻ bởi tất cả. Điều này giúp dễ dàng trao đổi dữ liệu giữa các chương trình máy chủ và thiết bị mà không cần phân bổ hoặc sao chép rõ ràng
Về mặt ngữ nghĩa, hành vi của các biến __managed__ giống hệt với hành vi của bộ nhớ được phân bổ qua cudaMallocManaged[]. Xem Phân bổ rõ ràng bằng cudaMallocManaged[] để được giải thích chi tiết. Khả năng hiển thị luồng mặc định là cudaMemAttachGlobal, nhưng có thể bị hạn chế khi sử dụng cudaStreamAttachMemAsync[]
Cần có bối cảnh CUDA hợp lệ để các biến __managed__ hoạt động chính xác. Việc truy cập các biến __managed__ có thể kích hoạt tạo ngữ cảnh CUDA nếu ngữ cảnh cho thiết bị hiện tại chưa được tạo. Trong ví dụ trên, việc truy cập x trước khi khởi chạy kernel sẽ kích hoạt tạo ngữ cảnh trên thiết bị 0. Trong trường hợp không có quyền truy cập đó, việc khởi chạy hạt nhân sẽ kích hoạt quá trình tạo ngữ cảnh
Các đối tượng C++ được khai báo là __managed__ phải tuân theo một số ràng buộc cụ thể, đặc biệt khi có liên quan đến các trình khởi tạo tĩnh. Vui lòng tham khảo Hỗ trợ ngôn ngữ C++ trong Hướng dẫn lập trình C++ CUDA để biết danh sách các ràng buộc này
Bộ nhớ hợp nhất chỉ được hỗ trợ trên các thiết bị có khả năng tính toán 3. 0 hoặc cao hơn. Một chương trình có thể truy vấn xem thiết bị GPU có hỗ trợ bộ nhớ được quản lý hay không bằng cách sử dụng cudaGetDeviceProperties[] và kiểm tra thuộc tính ManagedMemory mới. Khả năng cũng có thể được xác định bằng cách sử dụng hàm truy vấn thuộc tính riêng lẻ cudaDeviceGetAttribute[] với thuộc tính cudaDevAttrManagedMemory
Một trong hai thuộc tính sẽ được đặt thành 1 nếu phân bổ bộ nhớ được quản lý được cho phép trên GPU và trong hệ điều hành hiện tại. Lưu ý rằng Bộ nhớ hợp nhất không được hỗ trợ cho các ứng dụng 32 bit [trừ khi trên Android], ngay cả khi GPU có đủ khả năng
Thiết bị có khả năng tính toán 6. x trên các nền tảng hỗ trợ có thể truy cập bộ nhớ có thể phân trang mà không cần gọi cudaHostRegister trên đó. Một ứng dụng có thể truy vấn xem thiết bị có hỗ trợ truy cập mạch lạc vào bộ nhớ có thể phân trang hay không bằng cách kiểm tra thuộc tính pageableMemoryAccess mới
Với cơ chế lỗi trang mới, tính nhất quán dữ liệu toàn cầu được đảm bảo với Bộ nhớ hợp nhất. Điều này có nghĩa là CPU và GPU có thể truy cập phân bổ Bộ nhớ hợp nhất đồng thời. Điều này là bất hợp pháp trên các thiết bị có khả năng tính toán thấp hơn 6. x, vì không thể đảm bảo tính nhất quán nếu CPU truy cập phân bổ Bộ nhớ hợp nhất trong khi nhân GPU đang hoạt động. Một chương trình có thể truy vấn hỗ trợ truy cập đồng thời bằng cách kiểm tra thuộc tính concurrentManagedAccess. Xem Tính kết hợp và Đồng thời để biết chi tiết
Trên các hệ thống có thiết bị có khả năng tính toán thấp hơn 6. x phân bổ được quản lý sẽ tự động hiển thị cho tất cả các GPU trong hệ thống thông qua khả năng ngang hàng của GPU
Trên Linux, bộ nhớ được quản lý được phân bổ trong bộ nhớ GPU miễn là tất cả các GPU đang được chương trình tích cực sử dụng đều có hỗ trợ ngang hàng. Nếu bất kỳ lúc nào ứng dụng bắt đầu sử dụng một GPU không có hỗ trợ ngang hàng với bất kỳ GPU nào khác có phân bổ được quản lý trên chúng, thì trình điều khiển sẽ di chuyển tất cả các phân bổ được quản lý sang bộ nhớ hệ thống
Trên Windows nếu không có ánh xạ ngang hàng [ví dụ: giữa các GPU có kiến trúc khác nhau], thì hệ thống sẽ tự động quay lại sử dụng bộ nhớ không sao chép, bất kể chương trình có thực sự sử dụng cả hai GPU hay không. Nếu chỉ có một GPU thực sự được sử dụng, cần đặt biến môi trường CUDA_VISIBLE_DEVICES trước khi khởi chạy chương trình. Điều này hạn chế GPU nào hiển thị và cho phép phân bổ bộ nhớ được quản lý trong bộ nhớ GPU
Ngoài ra, trên Windows, người dùng cũng có thể đặt CUDA_MANAGED_FORCE_DEVICE_ALLOC thành giá trị khác 0 để buộc trình điều khiển luôn sử dụng bộ nhớ thiết bị để lưu trữ vật lý. Khi biến môi trường này được đặt thành giá trị khác 0, tất cả các thiết bị được sử dụng trong quy trình đó hỗ trợ bộ nhớ được quản lý phải tương thích ngang hàng với nhau. Lỗi. cudaErrorInvalidDevice sẽ được trả về nếu một thiết bị hỗ trợ bộ nhớ được quản lý được sử dụng và nó không tương thích ngang hàng với bất kỳ thiết bị hỗ trợ bộ nhớ được quản lý nào khác đã được sử dụng trước đó trong quy trình đó, ngay cả khi. cudaDeviceReset đã được gọi trên các thiết bị đó. Các biến môi trường này được mô tả trong Biến môi trường CUDA. Lưu ý rằng bắt đầu từ CUDA 8. 0 CUDA_MANAGED_FORCE_DEVICE_ALLOC không ảnh hưởng đến hệ điều hành Linux
Để đạt được hiệu suất tốt với Bộ nhớ hợp nhất, các mục tiêu sau phải được đáp ứng
- Những sai lầm nên tránh. Mặc dù các lỗi có thể phát lại là cơ bản để kích hoạt một mô hình lập trình đơn giản hơn, nhưng chúng có thể gây bất lợi nghiêm trọng cho hiệu suất của ứng dụng. Việc xử lý lỗi có thể mất hàng chục micro giây vì nó có thể liên quan đến việc vô hiệu hóa TLB, di chuyển dữ liệu và cập nhật bảng trang. Trong khi đó, việc thực thi trong một số phần nhất định của ứng dụng sẽ bị tạm dừng, do đó có khả năng ảnh hưởng đến hiệu suất tổng thể
- Dữ liệu phải là cục bộ của bộ xử lý truy cập. Như đã đề cập trước đây, độ trễ truy cập bộ nhớ và băng thông tốt hơn đáng kể khi dữ liệu được đặt cục bộ cho bộ xử lý truy cập nó. Do đó, dữ liệu phải được di chuyển phù hợp để tận dụng độ trễ thấp hơn và băng thông cao hơn
- Bộ nhớ đập nên được ngăn chặn. Nếu dữ liệu thường xuyên được truy cập bởi nhiều bộ xử lý và phải được di chuyển liên tục xung quanh để đạt được vị trí dữ liệu, thì chi phí di chuyển có thể vượt quá lợi ích của vị trí. Phá vỡ bộ nhớ nên được ngăn chặn ở mức độ có thể. Không ngăn chặn được thì phải phát hiện và xử lý thích đáng
Để đạt được mức hiệu suất tương tự như những gì có thể mà không cần sử dụng Bộ nhớ hợp nhất, ứng dụng phải hướng dẫn hệ thống con trình điều khiển Bộ nhớ hợp nhất tránh những cạm bẫy đã nói ở trên. Cần lưu ý rằng hệ thống con trình điều khiển Bộ nhớ hợp nhất có thể phát hiện các mẫu truy cập dữ liệu phổ biến và tự động đạt được một số mục tiêu này mà không cần sự tham gia của ứng dụng. Nhưng khi các mẫu truy cập dữ liệu không rõ ràng, hướng dẫn rõ ràng từ ứng dụng là rất quan trọng. CUDA 8. 0 giới thiệu các API hữu ích để cung cấp thời gian chạy với gợi ý sử dụng bộ nhớ [cudaMemAdvise[]] và để tìm nạp trước rõ ràng [cudaMemPrefetchAsync[]]. Các công cụ này cho phép các khả năng tương tự như sao chép bộ nhớ rõ ràng và ghim API mà không hoàn nguyên về các hạn chế của phân bổ bộ nhớ GPU rõ ràng
Ghi chú. cudaMemPrefetchAsync[] không được hỗ trợ trên thiết bị Tegra.
Tìm nạp trước dữ liệu có nghĩa là di chuyển dữ liệu vào bộ nhớ của bộ xử lý và ánh xạ dữ liệu đó trong các bảng trang của bộ xử lý đó trước khi bộ xử lý bắt đầu truy cập dữ liệu đó. Mục đích của việc tìm nạp trước dữ liệu là để tránh lỗi đồng thời thiết lập vị trí dữ liệu. Điều này có giá trị nhất đối với các ứng dụng truy cập dữ liệu chủ yếu từ một bộ xử lý tại bất kỳ thời điểm nào. Khi bộ xử lý truy cập thay đổi trong suốt vòng đời của ứng dụng, dữ liệu có thể được tìm nạp trước cho phù hợp để tuân theo quy trình thực thi của ứng dụng. Vì công việc được khởi chạy theo luồng trong CUDA, nên việc tìm nạp trước dữ liệu cũng là một hoạt động được truyền trực tuyến như được minh họa trong API sau.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
84trong đó vùng bộ nhớ được chỉ định bởi con trỏ devPtr và đếm số byte, với ptr được làm tròn xuống ranh giới trang gần nhất và số đếm được làm tròn lên ranh giới trang gần nhất, được di chuyển đến thiết bị dstDevice bằng cách đưa vào hàng đợi thao tác di chuyển trong luồng. Truyền cudaCpuDeviceId cho dstDevice sẽ khiến dữ liệu được di chuyển sang bộ nhớ CPU
Hãy xem xét một ví dụ mã đơn giản dưới đây.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
85Nếu không có các gợi ý về hiệu suất, nhân mykernel sẽ bị lỗi khi truy cập dữ liệu lần đầu, điều này tạo ra thêm chi phí xử lý lỗi và thường làm chậm ứng dụng. Bằng cách tìm nạp trước dữ liệu, có thể tránh được lỗi trang và đạt được hiệu suất tốt hơn
API này tuân theo ngữ nghĩa sắp xếp luồng, tôi. e. quá trình di chuyển không bắt đầu cho đến khi tất cả các hoạt động trước đó trong luồng hoàn tất và mọi hoạt động tiếp theo trong luồng không bắt đầu cho đến khi quá trình di chuyển hoàn tất
Chỉ tìm nạp trước dữ liệu là không đủ khi nhiều bộ xử lý cần truy cập đồng thời vào cùng một dữ liệu. Trong những trường hợp như vậy, sẽ rất hữu ích nếu ứng dụng cung cấp các gợi ý về cách dữ liệu sẽ thực sự được sử dụng. Có thể sử dụng API tư vấn sau để chỉ định mức sử dụng dữ liệu.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
86nơi lời khuyên, được chỉ định cho dữ liệu chứa trong vùng bắt đầu từ địa chỉ devPtr và với độ dài của byte đếm, được làm tròn đến ranh giới trang gần nhất, có thể nhận các giá trị sau
- cudaMemAdviseSetReadMostly. Điều này ngụ ý rằng dữ liệu chủ yếu sẽ được đọc từ đó và chỉ thỉnh thoảng được ghi vào. Điều này cho phép trình điều khiển tạo các bản sao dữ liệu chỉ đọc trong bộ nhớ của bộ xử lý khi bộ xử lý đó truy cập vào nó. Tương tự, nếu cudaMemPrefetchAsync được gọi trên vùng này, nó sẽ tạo một bản sao dữ liệu chỉ đọc trên bộ xử lý đích. Khi bộ xử lý ghi vào dữ liệu này, tất cả các bản sao của trang tương ứng đều bị vô hiệu trừ bản ghi xảy ra. Đối số thiết bị bị bỏ qua cho lời khuyên này. Lời khuyên này cho phép nhiều bộ xử lý truy cập đồng thời cùng một dữ liệu với băng thông tối đa như được minh họa trong đoạn mã sau.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
  87
- cudaMemAdviseSetPreferredLocation. Lời khuyên này đặt vị trí ưu tiên cho dữ liệu là bộ nhớ thuộc về thiết bị. Truyền giá trị cudaCpuDeviceId cho thiết bị sẽ đặt vị trí ưu tiên làm bộ nhớ CPU. Đặt vị trí ưa thích không khiến dữ liệu di chuyển đến vị trí đó ngay lập tức. Thay vào đó, nó hướng dẫn chính sách di chuyển khi xảy ra lỗi trên vùng bộ nhớ đó. Nếu dữ liệu đã ở vị trí ưa thích của nó và bộ xử lý bị lỗi có thể thiết lập ánh xạ mà không yêu cầu dữ liệu được di chuyển, thì việc di chuyển sẽ tránh được. Mặt khác, nếu dữ liệu không ở vị trí ưu tiên của nó hoặc nếu không thể thiết lập ánh xạ trực tiếp, thì dữ liệu sẽ được di chuyển đến bộ xử lý đang truy cập dữ liệu đó. Điều quan trọng cần lưu ý là việc đặt vị trí ưa thích không ngăn quá trình tìm nạp trước dữ liệu được thực hiện bằng cudaMemPrefetchAsync
- cudaMemAdviseSetAccessedBy. Lời khuyên này ngụ ý rằng dữ liệu sẽ được truy cập bằng thiết bị. Điều này không gây ra sự di chuyển dữ liệu và không ảnh hưởng đến vị trí của dữ liệu. Thay vào đó, nó làm cho dữ liệu luôn được ánh xạ trong các bảng trang của bộ xử lý đã chỉ định, miễn là vị trí của dữ liệu cho phép thiết lập ánh xạ. Nếu dữ liệu được di chuyển vì bất kỳ lý do gì, ánh xạ sẽ được cập nhật tương ứng. Lời khuyên này hữu ích trong các tình huống mà vị trí dữ liệu không quan trọng, nhưng tránh lỗi là. Ví dụ, hãy xem xét một hệ thống chứa nhiều GPU được bật truy cập ngang hàng, trong đó dữ liệu nằm trên một GPU thỉnh thoảng được truy cập bởi các GPU khác. Trong các tình huống như vậy, việc di chuyển dữ liệu sang các GPU khác không quan trọng bằng vì các truy cập không thường xuyên và chi phí di chuyển có thể quá cao. Tuy nhiên, việc ngăn ngừa lỗi vẫn có thể giúp cải thiện hiệu suất và do đó, việc thiết lập ánh xạ trước là hữu ích. Lưu ý rằng khi CPU truy cập dữ liệu này, dữ liệu có thể được di chuyển sang bộ nhớ CPU vì CPU không thể truy cập trực tiếp vào bộ nhớ GPU. Bất kỳ GPU nào đã đặt cờ cudaMemAdviceSetAccessedBy cho dữ liệu này giờ đây sẽ được cập nhật ánh xạ để trỏ đến trang trong bộ nhớ CPU
Mỗi lời khuyên cũng có thể được bỏ đặt bằng cách sử dụng một trong các giá trị sau. cudaMemAdviseUnsetReadMostly, cudaMemAdviseUnsetPreferredLocation và cudaMemAdviseUnsetAccessedBy
Một chương trình có thể truy vấn các thuộc tính phạm vi bộ nhớ được chỉ định thông qua cudaMemAdvise hoặc cudaMemPrefetchAsync bằng cách sử dụng API sau.
```
// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__[2, 1, 1] cluster_kernel[float *input, float* output]
{

}

int main[]
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock[16, 16];
    dim3 numBlocks[N / threadsPerBlock.x, N / threadsPerBlock.y];

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks. 
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel[input, output];   
}
```
88Hàm này truy vấn một thuộc tính của phạm vi bộ nhớ bắt đầu từ devPtr với kích thước đếm byte. Phạm vi bộ nhớ phải đề cập đến bộ nhớ được quản lý được phân bổ qua cudaMallocManaged hoặc được khai báo qua các biến __managed__. Có thể truy vấn các thuộc tính sau
- cudaMemRangeAttributeRead Chủ yếu. kết quả trả về sẽ là 1 nếu tất cả các trang trong phạm vi bộ nhớ nhất định đã bật tính năng đọc trùng lặp hoặc 0 nếu không
- cudaMemRangeAttributePreferredLocation. kết quả trả về sẽ là id thiết bị GPU hoặc cudaCpuDeviceId nếu tất cả các trang trong phạm vi bộ nhớ có bộ xử lý tương ứng làm vị trí ưa thích của chúng, nếu không thì cudaInvalidDeviceId sẽ được trả về. Một ứng dụng có thể sử dụng API truy vấn này để đưa ra quyết định về việc sắp xếp dữ liệu thông qua CPU hoặc GPU tùy thuộc vào thuộc tính vị trí ưa thích của con trỏ được quản lý. Lưu ý rằng vị trí thực tế của các trang trong phạm vi bộ nhớ tại thời điểm truy vấn có thể khác với vị trí ưa thích
- cudaMemRangeAttributeAccessedBy. sẽ trả về danh sách các thiết bị đã đặt lời khuyên đó cho phạm vi bộ nhớ đó
- cudaMemRangeAttributeLastPrefetchLocation. sẽ trả về vị trí cuối cùng mà tất cả các trang trong phạm vi bộ nhớ đã được tìm nạp trước một cách rõ ràng bằng cách sử dụng cudaMemPrefetchAsync. Lưu ý rằng điều này chỉ trả về vị trí cuối cùng mà ứng dụng đã yêu cầu tìm nạp trước phạm vi bộ nhớ. Nó không đưa ra dấu hiệu nào về việc liệu hoạt động tìm nạp trước đến vị trí đó đã hoàn thành hay thậm chí đã bắt đầu
Ngoài ra, có thể truy vấn nhiều thuộc tính bằng cách sử dụng hàm cudaMemRangeGetAttributes tương ứng
Lazy Loading trì hoãn việc tải các mô-đun CUDA và kernel từ quá trình khởi tạo chương trình gần hơn với quá trình thực thi của kernel. Nếu một chương trình không sử dụng mọi hạt nhân mà nó bao gồm, thì một số hạt nhân sẽ được tải một cách không cần thiết. Điều này rất phổ biến, đặc biệt nếu bạn bao gồm bất kỳ thư viện nào. Hầu hết thời gian, các chương trình chỉ sử dụng một số ít hạt nhân từ các thư viện mà chúng bao gồm
Nhờ Lazy Loading, các chương trình chỉ có thể tải các hạt nhân mà chúng thực sự sẽ sử dụng, tiết kiệm thời gian khởi tạo. Điều này làm giảm chi phí bộ nhớ, cả trên bộ nhớ GPU và bộ nhớ máy chủ
Lazy Loading được bật bằng cách đặt biến môi trường CUDA_MODULE_LOADING thành LAZY
Đầu tiên, CUDA Runtime sẽ không còn tải tất cả các mô-đun trong quá trình khởi tạo chương trình, ngoại trừ các mô-đun chứa các biến được quản lý. Mỗi mô-đun sẽ được tải trong lần sử dụng đầu tiên của một biến hoặc nhân từ mô-đun đó. Tối ưu hóa này chỉ liên quan đến người dùng Thời gian chạy CUDA; . Tối ưu hóa này đã được giới thiệu trong CUDA 11. 8
Thứ hai, việc tải một mô-đun [họ hàm cuModuleLoad*[]] sẽ không tải hạt nhân ngay lập tức, thay vào đó, nó sẽ trì hoãn việc tải hạt nhân cho đến khi cuModuleGetFunction[] được gọi. Có một số ngoại lệ nhất định ở đây. một số hạt nhân phải được tải trong cuModuleLoad*[], chẳng hạn như các hạt nhân mà con trỏ được lưu trữ trong các biến toàn cục. Tối ưu hóa này phù hợp với cả người dùng CUDA Runtime và CUDA Driver. CUDA Runtime sẽ chỉ gọi cuModuleGetFunction[] khi nhân được sử dụng/tham chiếu lần đầu tiên. Tối ưu hóa này đã được giới thiệu trong CUDA 11. 7
Cả hai cách tối ưu hóa này đều được thiết kế để người dùng không nhìn thấy được, giả sử Mô hình lập trình CUDA được tuân thủ
Để ý
Tài liệu này chỉ được cung cấp cho mục đích thông tin và không được coi là bảo hành cho một chức năng, tình trạng hoặc chất lượng nhất định của sản phẩm. Tập đoàn NVIDIA [“NVIDIA”] không tuyên bố hay bảo đảm, rõ ràng hay ngụ ý, về tính chính xác hoặc đầy đủ của thông tin trong tài liệu này và không chịu trách nhiệm đối với bất kỳ sai sót nào trong tài liệu này. NVIDIA sẽ không chịu trách nhiệm pháp lý về hậu quả hoặc việc sử dụng thông tin đó hoặc đối với bất kỳ hành vi vi phạm bằng sáng chế hoặc các quyền khác của bên thứ ba có thể phát sinh từ việc sử dụng thông tin đó. Tài liệu này không phải là cam kết phát triển, phát hành hoặc cung cấp bất kỳ Tài liệu nào [được định nghĩa bên dưới], mã hoặc chức năng
NVIDIA bảo lưu quyền chỉnh sửa, sửa đổi, nâng cao, cải tiến và bất kỳ thay đổi nào khác đối với tài liệu này vào bất kỳ lúc nào mà không cần thông báo
Khách hàng nên lấy thông tin liên quan mới nhất trước khi đặt hàng và nên xác minh rằng thông tin đó là hiện tại và đầy đủ
Các sản phẩm của NVIDIA được bán tuân theo các điều khoản và điều kiện bán hàng tiêu chuẩn của NVIDIA được cung cấp tại thời điểm xác nhận đơn đặt hàng, trừ khi có thỏa thuận khác trong một thỏa thuận bán hàng riêng lẻ được ký bởi đại diện được ủy quyền của NVIDIA và khách hàng [“Điều khoản bán hàng”]. NVIDIA theo đây rõ ràng phản đối việc áp dụng bất kỳ điều khoản và điều kiện chung nào của khách hàng liên quan đến việc mua sản phẩm NVIDIA được tham chiếu trong tài liệu này. Không có nghĩa vụ hợp đồng nào được hình thành trực tiếp hoặc gián tiếp bởi tài liệu này
Các sản phẩm của NVIDIA không được thiết kế, cấp phép hoặc bảo hành để phù hợp sử dụng trong y tế, quân sự, máy bay, vũ trụ hoặc thiết bị hỗ trợ sự sống, cũng như trong các ứng dụng mà lỗi hoặc trục trặc của sản phẩm NVIDIA có thể dẫn đến thương tích cá nhân, . NVIDIA không chịu trách nhiệm pháp lý đối với việc đưa vào và/hoặc sử dụng các sản phẩm của NVIDIA trong các thiết bị hoặc ứng dụng đó và do đó, việc đưa vào và/hoặc sử dụng đó là rủi ro của chính khách hàng
NVIDIA không tuyên bố hay bảo đảm rằng các sản phẩm dựa trên tài liệu này sẽ phù hợp với bất kỳ mục đích sử dụng cụ thể nào. NVIDIA không nhất thiết phải kiểm tra tất cả các thông số của từng sản phẩm. Trách nhiệm duy nhất của khách hàng là đánh giá và xác định khả năng áp dụng của bất kỳ thông tin nào có trong tài liệu này, đảm bảo sản phẩm phù hợp và phù hợp với ứng dụng do khách hàng lên kế hoạch và thực hiện thử nghiệm cần thiết cho ứng dụng để tránh ứng dụng bị lỗi . Điểm yếu trong thiết kế sản phẩm của khách hàng có thể ảnh hưởng đến chất lượng và độ tin cậy của sản phẩm NVIDIA và có thể dẫn đến các điều kiện và/hoặc yêu cầu bổ sung hoặc khác ngoài những điều kiện và yêu cầu có trong tài liệu này. NVIDIA không chịu trách nhiệm pháp lý liên quan đến bất kỳ vi phạm, thiệt hại, chi phí hoặc vấn đề nào có thể dựa trên hoặc quy cho. [i] việc sử dụng sản phẩm NVIDIA theo bất kỳ cách nào trái với tài liệu này hoặc [ii] thiết kế sản phẩm của khách hàng
Không có giấy phép nào, dù rõ ràng hay ngụ ý, được cấp theo bất kỳ quyền bằng sáng chế, bản quyền hoặc quyền sở hữu trí tuệ nào khác của NVIDIA theo tài liệu này. Thông tin do NVIDIA công bố liên quan đến các sản phẩm hoặc dịch vụ của bên thứ ba không cấu thành giấy phép của NVIDIA để sử dụng các sản phẩm hoặc dịch vụ đó hoặc bảo hành hoặc chứng thực cho các sản phẩm hoặc dịch vụ đó. Việc sử dụng thông tin đó có thể yêu cầu giấy phép từ bên thứ ba theo bằng sáng chế hoặc quyền sở hữu trí tuệ khác của bên thứ ba hoặc giấy phép từ NVIDIA theo bằng sáng chế hoặc quyền sở hữu trí tuệ khác của NVIDIA
Chỉ được phép sao chép thông tin trong tài liệu này nếu được NVIDIA chấp thuận trước bằng văn bản, sao chép mà không sửa đổi và tuân thủ đầy đủ tất cả các luật và quy định xuất khẩu hiện hành, đồng thời kèm theo tất cả các điều kiện, giới hạn và thông báo liên quan
TÀI LIỆU NÀY VÀ TẤT CẢ CÁC THÔNG SỐ KỸ THUẬT THIẾT KẾ CỦA NVIDIA, BẢNG THAM KHẢO, TẬP TIN, BẢN VẼ, CHẨN ĐOÁN, DANH SÁCH VÀ CÁC TÀI LIỆU KHÁC [CÙNG VÀ RIÊNG LÀ “TÀI LIỆU”] ĐƯỢC CUNG CẤP “NGUYÊN TRẠNG. ” NVIDIA KHÔNG ĐƯA RA BẤT KỲ ĐẢM BẢO NÀO, RÕ RÀNG, NGỤ Ý, LUẬT ĐỊNH HOẶC CÁCH NÀO LIÊN QUAN ĐẾN TÀI LIỆU VÀ TUYÊN BỐ TỪ CHỐI TẤT CẢ MỌI BẢO ĐẢM NGỤ Ý VỀ VIỆC KHÔNG VI PHẠM, KHẢ NĂNG BÁN ĐƯỢC VÀ TÍNH PHÙ HỢP CHO MỘT MỤC ĐÍCH CỤ THỂ. TRONG PHẠM VI LUẬT PHÁP KHÔNG CẤM, TRONG MỌI TRƯỜNG HỢP NVIDIA SẼ KHÔNG CHỊU TRÁCH NHIỆM PHÁP LÝ VỀ BẤT KỲ THIỆT HẠI NÀO, BAO GỒM KHÔNG GIỚI HẠN BẤT KỲ THIỆT HẠI TRỰC TIẾP, GIÁN TIẾP, ĐẶC BIỆT, NGẪU NHIÊN, TRÁCH NHIỆM HOẶC LÀ HẬU QUẢ, TUY NHIÊN DO NGUYÊN NHÂN VÀ BẤT CỨ LÝ THUYẾT TRÁCH NHIỆM PHÁP LÝ NÀO PHÁT SINH NGOÀI . Bất kể mọi thiệt hại mà khách hàng có thể phải chịu vì bất kỳ lý do gì, trách nhiệm pháp lý tổng thể và tích lũy của NVIDIA đối với khách hàng đối với các sản phẩm được mô tả ở đây sẽ bị giới hạn theo Điều khoản bán hàng cho sản phẩm

Notes

Global Memory

Size and Alignment Requirement

Two-Dimensional Arrays

Local Memory

Shared Memory

Constant Memory

Texture and Surface Memory

Single-Precision Floating-Point Division

Single-Precision Floating-Point Reciprocal Square Root

Single-Precision Floating-Point Square Root

Sine and Cosine

Integer Arithmetic

Half Precision Arithmetic

Type Conversion

Notes

Các hàm dấu phẩy động có độ chính xác đơn

Chế độ 64 bit

Chế độ 32 bit

Để ý

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề