Data scientists use machine learning (ML) techniques when training algorithms to use data to predict future behavior, results, and trends. ML enables computers to learn without explicit programming.
Azure ML is a cloud solution that applies for all types of ML, including traditional supervised and unsupervised machine learning models, and newer deep learning (DL) techniques. Azure’s Machine Learning service provides a few ways to work with ML models:
This is part of our series of articles about cloud deep learning.
Related content: learn about AWS deep learning options
In this article, you will learn:
The top-level entity in Azure Machine Learning is a workspace. It contains everything you need to work with machine learning models in Azure:
There are other Azure resources in use in the workspace.
Azure Machine Learning provides two types of fully managed virtual machines (VMs) configured for machine learning jobs.
Azure Machine Learning provides an additional entity called a dataset, which makes your ML data easy to access and use. When creating a dataset, you provide a reference to your source data, and a copy of its metadata. You do not need to duplicate your ML datasets to Azure Machine Learning, just point to them, which saves storage costs and improves security.
Datasets are securely connected to Azure storage through an entity called a datastore. The datastore holds connection information securely, and allows the dataset to connect to your original data, wherever it is located. It retrieves secrets and credentials from the Azure Key Vault instance which is part of the workspace. This fully integrated setup allows you to access storage securely without needing to write scripts, manage complex configuration, or perform any manual action.
In Azure Machine Learning, a model is simply code that accepts data as input and returns outputs. Models can be added to the system in two ways:
Any model in the workspace can be deployed for production use as a service endpoint. This requires three components:
Thus, Azure Machine Learning lets you set up a complete machine learning environment, including compute resources, datasets, models, and endpoints that can help you deploy a model in production for external applications or end users.
Azure Machine Learning can also be used to train large-scale deep learning models. Below is a reference architecture provided by Microsoft, which shows how to distribute deep learning jobs across VM clusters with GPU support. The reference architecture refers to an image classification model, but it can be used for many other deep learning use cases.
The architecture is comprised of four key components:
Azure offers four types of virtual machines that support GPUs and are suitable for training DL algorithms. It is best to start with a single instance and see if it supports your load with sufficient training performance, and if not, scale up to a cluster of smaller instances.
Azure provides four VMs that support GPUs: NC, ND, NCv2, and NCv3. They provide successively more powerful NVIDIA GPUs: K80, P40, P100, and V100, respectively. See the official documentation for Azure GPU instances.
Due to network overhead, distributed training efficiency is always lower than 100%. The main bottleneck is due to device-to-device synchronization. Therefore, distributed learning is ideal for large models that cannot be trained on a single VM, and need to be broken up and trained one piece at a time.
When training a deep learning model, you need to ensure that the model has high performance access to the dataset. You may be running on a fast GPU instance, but if storage is too slow, it will slow training down.
This is why the reference architecture recommends using two measures to improve data access performance:
To summarize, using the reference architecture in the figure above, you can run large-scale, distributed deep learning jobs on Azure with high performance, on a fully managed infrastructure that takes care of compute, storage, deployment and monitoring.
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.