A Tensor Processing Unit (TPU) is an application specific integrated circuit (ASIC) developed by Google to accelerate machine learning. Google offers TPUs on demand, as a cloud deep learning service called Cloud TPU.
Cloud TPU is tightly integrated with TensorFlow, Google’s open source machine learning (ML) framework. You can use dedicated TensorFlow APIs to run workloads on TPU hardware. Cloud TPU lets you create clusters of TensorFlow computing units, which can also include CPUs and regular graphical processing units (GPUs).
In this article, you will learn:
This is part of an extensive series of guides about Cloud Deep Learning
Cloud TPU is optimized for the following scenarios.
Cloud TPU are not recommended for these scenarios:
Each TPU core has three types of processing units:
MXUs use bfloat16, a 16-bit floating point representation, which provides better accuracy for machine learning model calculations compared to the traditional half-precision representation.
Each core in a TPU device can perform calculations (known as XLA operations) individually. High bandwidth interconnects enable the chips to directly communicate with each other.
Cloud TPU offers two deployment options:
A TPU version specifies the hardware characteristics of the device. The table below provides details for the latest two generations.
Google has announced the launch of a fourth-generation TPU ASIC, called TPU v4, which provides more than twice matrix multiplication capacity than v3, greatly improved memory bandwidth, and improved interconnect technology. In the MLPerf benchmark, TPU v4 had 2.7X better performance than v3.
The results showed that on a similar scale in previous ML Perf training competitions, TPU v3's performance improved by an average of 2.7 times. Please wait patiently. Details of the TPUv4 will be released soon.
Here are a few best practices you can use to get the most out of TPU resources on Google Cloud.
Accelerated Linear Algebra (XLA) is a machine learning compiler that can generate executable binaries for TPU, CPU, GPU and other hardware platforms. XLA comes with TensorFlow’s standard codebase. Cloud TPU TensorFlow models are converted to XLA graphs, and XLA graphs are compiled into TPU executables.
The hardware used for Cloud TPU is distinctly different from that used for CPUs and GPUs. At a higher level, a CPU runs only a few high-performance threads, while a GPU runs many threads with poor thread performance. By contrast, cloud TPUs with 128 x 128 matrix units run one very powerful thread capable of running 16K operations per cycle. This one thread is composed of 128 x 128 threads connected in the form of a pipeline.
Therefore, when addressing memory on a TPU, prefer to use multiples of 8 (floating point), and when running matrix operations, use multiples of 128.
Here is how to resolve two common problems when training models on a TPU:
Data preprocessing takes too long
The software stack provided by TensorFlow-TPU lets CPUs perform complex data preprocessing before sending the data to the TPU. However, TPUs are incredibly fast, and complex input data processing can quickly accumulate into bottlenecks.
Google provides a Cloud TPU analysis tool, which lets you measure whether input processing is causing a bottleneck. "In this case, you can look for optimizations, like performing specific pre-processing operations offline on a one-time basis, to avoid a slowdown.
Sharding makes batch size too small
Your model batch size is automatically sharded, or split, between 8 cores on the TPU. For example, if your batch size is 128, the true batch size running on each TPU core is 16. This will utilize the TPU at only a fraction of its capacity.
To optimally use memory on the TPU, use the biggest batch size which, when divided by 8, will fit your TPU’s memory. Batch sizes should always be divisible by 128, because a TPU uses 128 x 128 memory cells for processing.
Cloud TPU arrays are padded (or “tiled”), filling one dimension to the nearest multiple of 8, and the other dimension to a multiple of 128. The XLA compiler uses heuristics to arrange the data in an efficient manner, but this can sometimes go wrong. Try different model configurations to see which gives you the best performance.
Take into account memory that is wasted on padding. To make the most efficient use of TPUs, structure your model dimension sizes to fit the dimensions expected by the TPU, to minimize tiling and wasted memory overhead.
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments using GPU, and CPU hardware.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.