Deep learning is at the center of most artificial intelligence initiatives. It is based on the concept of a deep neural network, which passes inputs through multiple layers of connections. Neural networks can perform many complex cognitive tasks, improving performance dramatically compared to classical machine learning algorithms. However, they often require huge data volumes to train, and can be very computationally intensive.
Cloud computing services are helping make deep learning more accessible, making it easier to manage large datasets and train algorithms on distributed hardware.
Cloud services are an enabler for deep learning in four respects:
This is part of an extensive series of guides about IaaS.
In this article, you will learn:
Let’s briefly review the deep learning offerings of major cloud providers—Amazon, Google Cloud, and Microsoft Azure.
In each of these clouds, it is possible to run deep learning workloads in a “do it yourself” model. This involves selecting machine images that come pre-installed with deep learning infrastructure, and running them in an infrastructure as a service (IaaS) model, for example as Amazon EC2 instances or Google Compute Engine VMs.
All the cloud providers we review below offer compute instances suitable for deep learning models, which provide specialized hardware such as graphical processing units (GPU), field-programmable gate arrays (FPGA) and TensorFlow Processing Units (TPU). To learn about the compute options offered by each cloud provider, refer to our articles about:
Below, we focus on the platform as a service (PaaS) offering each cloud provides for deep learning users. These PaaS offerings provide the hardware needed for deep learning workloads, as well as software services for managing deep learning pipelines, from data ingestion to production deployment and real-world inference.
Amazon Web Services provides the SageMaker service, which lets you build and manage machine learning models on the cloud, with a focus on deep learning.
Learn more in our guide to AWS deep learning
Google's set of machine learning services, together called Cloud AI, includes general purpose and dedicated services for specific use cases:
Azure Machine Learning is a complete environment for training, deploying, and managing machine learning models.
Key features of Azure Machine Learning:
Learn more in our guide to Azure deep learning
Here are a few key considerations when selecting your cloud-based deep learning service.
Data preparation can be one of the heaviest and most sensitive parts of a deep learning project. There are two common ways to prepare large volumes of data for analytics, which are also used to create deep learning datasets from raw data:
Check which data services are provided by your cloud vendor and whether they support ETL, ELT, or both. Understand which data storage, database or data warehouse services you will use, and how they can make data preparation easier.
Data scientists typically start by developing a model on a local notebook, but it is not feasible to train most deep learning models on a local workstation. A key capability of a cloud deep learning service is the ability to integrate with notebooks and push training jobs seamlessly to cloud-based compute instances.
Evaluate the process and how easy it is to run training jobs on hardware like GPUs, TPUs, and FPGAs, manage these jobs across data science teams, visualize and interpret their results.
Each cloud machine learning service supports different frameworks. You can typically get the broadest framework support in an IaaS model, when deploying deep learning directly on compute instances. However, if you use a full ML Ops platform, you will be limited to the frameworks it supports.
Look for support of the following frameworks, which your data science team may need to use now or in the future:
Also evaluate the ability to integrate your own code and algorithms with the platform’s library of built-in algorithms. This can improve productivity, because you can draw on existing building blocks and only develop unique aspects of your model.
Most cloud platforms provide pre-trained, pre-optimized AI services for many applications including:
The advantage of these types of services is that they have been trained on massive data volumes that are not available to individual companies. They can provide very high accuracy for general use cases, and provide excellent performance and low latency in production. Best of all, they are ready to use out of the box.
Deploying a model is only the start, not the end point, of your AI journey. Data changes and user requirements change, and it is essential to monitor a model’s performance over time, tune it, augment it, and if necessary, replace it. Evaluate the tools a cloud service provides for monitoring model performance when it is already in production, and how easy it is to release updates and improvements to live deep learning models.
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.
There’s a lot more to learn about cloud deep learning. To continue your research, take a look at the rest of our blogs on this topic:
AWS Deep Learning: Choosing the Best Option for You
Amazon Web Services (AWS) is a cloud computing pioneer providing a wide range of scalable, affordable, and innovative cloud services, including a dedicated solution for deep learning. AWS offers a fully-managed machine learning service called SageMaker, and AWS Deep Learning AMI (DLAMI), which is a custom EC2 machine image, as well as deep learning containers.
This article explains in-detail the various deep learning services offered by AWS, and how to leverage AWS technology for training deep learning models.
Read more: AWS Deep Learning: Choosing the Best Option for You
Azure Machine Learning: From Basic ML to Distributed Deep Learning Models
Microsoft Azure is a top cloud computing vendor offering many enterprise-grade services, including a dedicated solution for machine learning and deep learning, called Azure Machine Learning (Azure ML). Azure ML leverages virtual machines (VMs), datasets, datastores, code models, and deployment environments to enable effective training of deep learning models.
This article explains how Azure ML works, and how to perform distributed training of deep learning models on Azure.
Read more: Azure Machine Learning: From Basic ML to Distributed Deep Learning Models
Google TPU: Architecture and Performance Best Practices
Google provides cloud computing services, including dedicated solutions for artificial intelligence (AI), machine learning, and deep learning. Google has long been considered a pioneer and innovator in AI and software development, creating solutions that are adopted worldwide. Tensor Processing Units (TPUs) are another Google innovation, created to help accelerate machine learning.
This article explains what a TPU is, how the technology works, and explores key best practices for optimal cloud TPU performance.
Read more: Google TPU: Architecture and Performance Best Practices
Google Cloud GPU: The Basics and a Quick Tutorial
Google Cloud Platform (GCP) is the world’s third largest cloud provider. Google offers a number of virtual machines (VMs) that provide graphical processing units (GPUs), including the NVIDIA Tesla K80, P4, T4, P100, and V100.
Learn about Google Cloud GPU and TPU options, and learn how to set up a compute instance with an attached GPU in a few easy steps.
Read more: Google Cloud GPU: The Basics and a Quick Tutorial
Triton Inference Server: The Basics and a Quick Tutorial
NVIDIA’s open-source Triton Inference Server offers backend support for most machine learning (ML) frameworks, as well as custom C++ and python backend. This reduces the need for multiple inference servers for different frameworks and allows you to simplify your machine learning infrastructure
Learn about the NVIDIA Triton Inference Server, its key features, models and model repositories, client libraries, and get started with a quick tutorial.
Read more: Triton Inference Server: The Basics and a Quick Tutorial
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of IaaS.
Authored by NetApp
Learn about cloud migration and what major challenges to expect when implementing a cloud migration strategy in your organization.
See top articles in our cloud migration strategy guide:
Authored by Lumigo
Learn about the AWS ecosystem on its services, understand the core Lambda functionalities, and discover AWS Lambda monitoring functionalities.
See top articles in our guide to the AWS serverless ecosystem:
Authored by Spot.io
Learn about financial and economic aspects of cloud computing, how to optimize your cloud costs, and strategies for getting a better return on your cloud investments.