There is increasing demand for deep learning technology, which can discover complex patterns in images, text, speech, and other data, and can power a new generation of applications and data analysis systems.
Many organizations are using cloud computing for deep learning. Cloud systems are useful for storing, processing and ingesting the large data volumes required for deep learning, and to perform large-scale training on deep learning models using multiple GPUs. With cloud deep learning, you can request as many GPU machines as needed, and scale up and down on demand.
Amazon Web Services (AWS) provides an extensive ecosystem of services to support deep learning applications. This article introduces the unique value proposition of Amazon Web Services—including storage resources, fast compute instances with GPU hardware, and high performance networking resources.
AWS also provides end-to-end deep learning solutions, including SageMaker and Deep Learning Containers. Read on to learn more about these solutions and more.
In this article, you will learn:
This is part of an extensive series of guides about Cloud Deep Learning
Any deep learning project requires three essential resources—storage, compute, and networking. Here are the Amazon services typically used to power deep learning deployments in each of these three categories.
You can use Amazon S3 to store a massive amount of data for your deep learning projects, at a low cost. S3 can be the basis for data science tasks like data ingestion, extract, transform and load (ETL), ad-hoc data querying and data wrangling. You can also connect data analysis and visualization tools to S3 to make sense of your data before using it in a deep learning project.
When training is performed, data is typically streamed from S3 to EBS volumes that are attached to the training machines in Amazon EC2. This provides low-latency access to data during model training.
Amazon EFS is probably the best storage option for large-scale batch processing, or when multiple training jobs need access to the same data. It allows developers and data scientists to access large amounts of data directly from their workstation or a code repository, with unlimited disk space and no need to manage network file shares.
Amazon FSx for Lustre is another high-performance file system solution suitable for compute-intensive workloads like deep learning.
Amazon EFA is a special network interface designed for high performance computing (HPC). It bypasses the operating system to allow ultra-fast communication between compute instances, for large distributed computing jobs.
Neural network models typically require millions of matrix and vector operations. These operations can easily be parallelized, and this is why GPU hardware, which has a large number of cores, can provide a massive performance improvement.
Amazon went through four generations of GPU instances—the latest generation, called P4, was released in November 2020.
Provide the following capabilities:
Provide the following capabilities:
Provide the following capabilities:
The G4 instance is a more cost-effective instance offering good performance for deep learning inference applications. It comes with:
Beyond offering the building blocks for deep learning applications, Amazon also offers end-to-end deep learning solutions. We’ll cover three options.
Amazon SageMaker is a fully managed machine learning service, which enables data scientists and developers to create and train machine learning models, including deep learning architectures, and deploy them into a hosted production environment.
SageMaker provides an integrated Jupyter notebook, allowing data scientists to easily access data sources without needing to manage server infrastructure. It makes it easy to run common ML and DL algorithms, pre-optimized to run in a distributed environment.
AWS DLAMI is a custom EC2 machine image that can be used with multiple instance types, including simple CPU instances and fast GPU instances like P4. Developers and data scientists can use it to instantly set up a pre-configured DL environment on Amazon, including CUDA, cuDNN, and popular frameworks like PyTorch, TensorFlow, and Horovod.
AWS Deep Learning Containers are a pre-installed deep learning Docker image that includes a complete deep learning development environment. It comes pre-installed with TensorFlow and PyTorch, and can be deployed on SageMaker or Amazon container services, including EKS and ECS. You can use Deep Learning Containers free, only paying for Amazon resources needed to run the container.
Elastic Inference is a method for attaching GPU-powered acceleration to regular Amazon EC2 instances, like you would add a GPU to a regular CPU-based machine. It can provide significant cost savings, by allowing you to run deep learning and SageMaker instances on regular compute instances, which are significantly cheaper than GPU instances.
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.