PyTorch GPU

AI & Machine Learning Guide

Working with CUDA in PyTorch

PyTorch is an open source machine learning framework that enables you to perform scientific and tensor computations. You can use PyTorch to speed up deep learning with GPUs. PyTorch comes with a simple interface, includes dynamic computational graphs, and supports CUDA. You can also use PyTorch for asynchronous execution.  

In this article, you will learn:

What Is PyTorch?

PyTorch is an open source, machine learning framework based on Python. It enables you to perform scientific and tensor computations with the aid of graphical processing units (GPUs). You can use it to develop and train deep learning neural networks using automatic differentiation (a calculation process that gives exact values in constant time).

Key features of PyTorch include:

  • Simple interface—includes an easy to use API that can be used with Python, C++, or Java.
  • Pythonic in nature—integrates smoothly with the Python data science stack and enables you to leverage Python services and functionalities.
  • Computational graphs—includes capabilities for dynamic computational graphs that you can customize during runtime.

If you are also working with Keras and want to leverage GPUs, check out our article about Keras GPU. Also see our article reviewing the best GPUs for deep learning.

PyTorch CUDA Support

CUDA is a programming model and computing toolkit developed by NVIDIA. It enables you to perform compute-intensive operations faster by parallelizing tasks across GPUs. CUDA is the dominant API used for deep learning although other options are available, such as OpenCL. PyTorch provides support for CUDA in the torch.cuda library.

Tensor creation and use

PyTorch’s CUDA library enables you to keep track of which GPU you are using and causes any tensors you create to be automatically assigned to that device. After a tensor is allocated, you can perform operations with it and the results are also assigned to the same device.

By default, within PyTorch, you cannot use cross-GPU operations. The exception is the use of copy_() or copy-like methods, such as to() and cuda(). To launch operations across distributed tensors, you must first enable peer-to-peer memory access.

Asynchronous execution

GPU operations are asynchronous by default to enable a larger number of computations to be performed in parallel. Asynchronous operations are generally invisible to the user because PyTorch automatically synchronizes data copied between CPU and GPU or GPU and GPU. Additionally, operations are performed in the order of queuing. This ensures that operations are executed in the same fashion as if computations were synchronous.

If you must use synchronous operations, you can force this setting with the CUDA_LAUNCH_BLOCKING=1 environment variable. For example, you may want to do this if you are seeing errors on your GPUs. Synchronous execution ensures that errors are reported when they occur and makes it easier to identify which request originated the error.

Another instance to be mindful of whether to use async or sync operations is with time measurements. With async operations, your measurements won’t be accurate. To work around this while leaving async enabled, you can call torch.cuda.synchronize() before measuring or you can use torch.cuda.Event to record times.

CUDA streams

CUDA streams are linear execution sequences on specific GPUs. These streams are created by default during operation. Within each stream, operations are serialized by order of creation. However, operations from different streams can be executed simultaneously in any relative order. The exception is if you are using synchronize() or wait_stream() methods.

Keep in mind, if you have your default stream set to “current stream”, PyTorch automatically synchronizes data. However, if you are using non-default streams it is your responsibility to perform this synchronization.

How to Use CUDA with PyTorch

There are a few basic commands you should know to get started with PyTorch and CUDA. The most basic of these commands enable you to verify that you have the required CUDA libraries and NVIDIA drivers, and that you have an available GPU to work with. You can verify this with the following command:


Assuming you gain a positive response to this query, you can continue with the following operations.

Moving tensors with the to()function

Every Tensor you create is assigned a to() member function. This function assigns the specified tensor to the device you define, either CPU or GPU. When using this function, you need to assign a torch.device object as an input. This object can be:

Cuda:{number ID of GPU}

When initializing a tensor, it is often put directly on a CPU. Then, you can move it to GPU if you need to speed up calculations. The following code block shows how you can assign this placement.

if torch.cuda.is_available():
dev = "cuda:0"
dev = "cpu"
device = torch.device(dev)
a = torch.zeros(4,3)
a =

Moving tensors with the cuda() function

You can also use cuda() to place tensors. This function takes an input representing the index of the GPU you wish to use; this input defaults to 0. Using this function, you can place your entire network on a single device. You can see an example code block accomplishing this below.

clf = myNetwork()"cuda:0"))

Make sure to use the same device for tensors

Although it’s useful to be able to specify which GPUs to use for your tensors, you don’t want to have to manually move all of your tensors. Instead, try to automatically create tensors on single devices. This helps prevent cross-device transfers and the time loss these transfers create.

To automatically assign tensors, you can use the torch.get_device() function. This function is only supported for GPUs and returns the GPU index. You can then use this index to direct placement for new tensors. The following code shows how this function is used.

#making sure t2 is on the same device as t2

a = t1.get_device()
b = torch.tensor(a.shape).to(dev)

Another option is to call cuda() and set the desired default.

torch.cuda.set_device({GPU ID})

Simplified PyTorch GPU Management With Run:AI

Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:AI:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.