Thursday, February 5, 2026

AWS vs. Azure: A Deep Dive into Mannequin Coaching – Half 2


In Half 1 of this collection, how Azure and AWS take basically totally different approaches to machine studying undertaking administration and information storage.

Azure ML makes use of a workspace-centric construction with user-level role-based entry management (RBAC), the place permissions are granted to people primarily based on their tasks. In distinction, AWS SageMaker adopts a job-centric structure that decouples consumer permissions from job execution, granting entry on the job degree by means of IAM roles. For information storage, Azure ML depends on datastores and information belongings inside workspaces to handle connections and credentials behind the scenes, whereas AWS SageMaker integrates straight with S3 buckets, requiring express permission grants for SageMaker execution roles to entry information.

Discover out extra on this article:

Having established how these platforms deal with undertaking setup and information entry, in Half 2, we’ll study the compute sources and runtime environments that energy the mannequin coaching jobs.

Compute

Compute is the digital machine the place your mannequin and code run. Together with community and storage, it is without doubt one of the basic constructing blocks of cloud computing. Compute sources sometimes characterize the biggest price part of an ML undertaking, as coaching fashions—particularly giant AI fashions—requires lengthy coaching instances and sometimes specialised compute situations (e.g., GPU situations) with greater prices. Due to this fact, Azure ML designs a devoted AzureML Compute Operator function (see particulars in Half 1) for managing compute sources.

Azure and AWS supply numerous occasion varieties that differ within the variety of CPUs/GPUs, reminiscence, disk house and sort, every designed for particular functions. Each platforms use a pay-as-you-go pricing mannequin, charging just for lively compute time.

Azure digital machine collection are named in alphabetic order; for example, D household VMs are designed for general-purpose workloads and meet the necessities for many improvement and manufacturing environments. AWS compute situations are additionally grouped into households primarily based on their goal; for example, the m5 household accommodates general-purpose situations for SageMaker ML improvement. The desk under compares compute situations provided by Azure and AWS primarily based on their goal, hourly pricing and typical use circumstances. (Please be aware that the pricing construction varies by area and plan, so I like to recommend testing their official web sites.)

Now that we’ve in contrast compute pricing in AWS and Azure, let’s discover how the 2 platforms differ in integrating compute sources into ML methods.

Azure ML

Azure Compute for ML

Computes are persistent sources within the Azure ML Workspace, sometimes created as soon as by the AzureML Compute Operator and reused by the info science crew. Since compute sources are cost-intensive, this construction permits them to be centrally managed by a task with cloud infrastructure experience, whereas information scientists and engineers can concentrate on improvement work.

Azure affords a spectrum of compute goal choices designated for ML improvement and deployment, relying on the size of the workload. A compute occasion is a single-node machine appropriate for interactive improvement and testing within the Jupyter pocket book atmosphere. A compute cluster is one other kind of compute goal that spins up multi-node cluster machines. It may be scaled for parallel processing primarily based on workload demand and helps auto-scaling by configuring the parameter min_instances and max_instances. Moreover, there are severless compute, Kubernetes clusters, and containers which might be match for various functions. Here’s a helpful visible abstract that helps you make the choice primarily based in your use case.

image from “[Explore and configure the Azure Machine Learning workspace DP-100](https://www.youtube.com/watch?v=_f5dlIvI5LQ)”
picture from “Discover and configure the Azure Machine Studying workspace DP-100

To create an Azure ML managed compute goal we create an AmlCompute object utilizing the code under:

  • kind: use"amlcompute" for compute cluster. Alternatively, use "computeinstance" for single-node interactive improvement and “kubernetes" for AKS clusters.
  • title: specify the compute goal title.
  • measurement: specify the occasion measurement.
  • min_instances and max_instances (elective): set the vary of situations allowed to run concurrently.
  • idle_time_before_scale_down (elective): mechanically shut down the compute cluster when idle to keep away from incurring pointless prices.
# Create a compute cluster
cpu_cluster = AmlCompute(
    title="cpu-cluster",
    kind="amlcompute",
    measurement="Standard_DS3_v2",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)

# Create or replace the compute
ml_client.compute.begin_create_or_update(cpu_cluster)

As soon as the compute useful resource is created, anybody within the shared Workspace can use it by merely referencing its title in an ML job, making it simply accessible for crew collaboration.

# Use the endured compute "cpu-cluster" within the job
job = command(
    code='./src',
    command='python code.py',
    compute='cpu-cluster',
    display_name='train-custom-env',
    experiment_name='coaching'
)

AWS SageMaker AI

AWS Compute Instance

Compute sources are managed by a standalone AWS service – EC2 (Elastic Compute Cloud). When utilizing these compute sources in SageMaker, it require builders to explicitly configure the occasion kind for every job, then compute situations are created on-demand and terminated when the job finishes. This method provides builders extra flexibility over compute choice primarily based on activity, however requires extra infrastructure data to pick out and handle the suitable compute useful resource. For instance, obtainable occasion varieties differ by job kind. ml.t3.medium and ml.t3.giant are generally used for powering SageMaker notebooks in interactive improvement environments, however they don’t seem to be obtainable for coaching jobs, which require extra highly effective occasion varieties from the m5, c5, p3, or g4dn households.

As proven within the code snippet under, AWS SageMaker specifies the compute occasion and the variety of situations working concurrently as job parameters. A compute occasion with the ml.m5.xlarge kind is created throughout job execution and charged primarily based on the job runtime.

estimator = Estimator(
    image_uri=image_uri,
    function=function,  
    instance_type="ml.m5.xlarge", 
    instance_count=1
)

SageMaker jobs spin up on-demand situations by default. They’re charged by seconds and offers assured capability for working time-sensitive jobs. For jobs that may tolerate interruptions and better latency, spot occasion is a extra cost-saving choice that makes use of unused compute situations. The draw back is the extra ready interval when there aren’t any obtainable spot situations. We use the code snippet under to implement a spot occasion choice for a coaching job.

  • use_spot_instances: set as True to make use of spot situations, in any other case default to on-demand
  • max_wait: the utmost period of time you might be keen to attend for obtainable spot situations (ready time just isn’t charged)
    max_run: the utmost quantity of coaching time allowed for the job
  • checkpoint_s3_uri: the S3 bucket URI path to avoid wasting mannequin checkpoints, in order that coaching can safely restart after ready
estimator = Estimator(
    image_uri=image_uri,
    function=function,  
    instance_type="ml.m5.xlarge", 
    instance_count=1,
    use_spot_instances=True, 
    max_run=3600,
    max_wait=7200,  
    checkpoint_s3_uri=""  
)

What does this imply in apply?

  • Azure ML: Azure’s persistent compute method permits centralized administration and sharing throughout a number of builders, permitting information scientists to concentrate on mannequin improvement slightly than infrastructure administration.
  • AWS SageMaker AI: SageMaker requires builders to explicitly outline compute occasion kind for every job, offering extra flexibility but additionally demanding deeper infrastructure data of occasion varieties, prices and availability constraints.

Reference

Atmosphere

Atmosphere defines the place the code or job is run, together with software program, working system, program packages, docker picture and atmosphere variables. Whereas compute is accountable for the underlying infrastructure and {hardware} picks, atmosphere setup is essential in guaranteeing constant and reproducible behaviors throughout improvement and manufacturing atmosphere, mitigating package deal conflicts and dependency points when executing the identical code in several runtime setup by totally different builders. Azure ML and SageMaker each help utilizing their curated environments and organising {custom} environments.

Azure ML

Much like Knowledge and Compute, Atmosphere is taken into account a sort of useful resource and asset within the Azure ML Workspace. Azure ML affords a complete record of curated environments for common python frameworks (e.g. PyTorch, Tensorflow, scikit-learn) designed for CPU or GPU/CUDA goal.

The code snippet under helps to retrieve the record of all curated environments in Azure ML. They often observe a naming conference that features the framework title, model, working system, Python model, and compute goal (CPU/GPU), e.g.AzureML-sklearn-1.0-ubuntu20.04-py38-cpu signifies scikit-learn model 1.0, working on Ubuntu 20.04 with Python 3.8 for CPU compute.

envs = ml_client.environments.record()
for env in envs:
    print(env.title)
    
    
# >>> Auzre ML Curated Environments
"""
AzureML-AI-Studio-Growth
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu
AzureML-ACPT-pytorch-1.12-py38-cuda11.6-gpu
AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.5-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu
AzureML-responsibleai-0.21-ubuntu20.04-py38-cpu
AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu
AzureML-tensorflow-2.5-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.6-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu
AzureML-sklearn-1.0-ubuntu20.04-py38-cpu
AzureML-pytorch-1.10-ubuntu18.04-py38-cuda11-gpu
AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu
AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu
AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu
AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu
AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu
AzureML-Triton
AzureML-Designer-Rating
AzureML-VowpalWabbit-8.8.0
AzureML-PyTorch-1.3-CPU
"""

To run the coaching job in a curated atmosphere, we create an atmosphere object by referencing its title and model, then passing it as a job parameter.

# Get an curated Atmosphere
atmosphere = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", model=44)

# Use the curated atmosphere in Job
job = command(
    code=".",
    command="python practice.py",
    atmosphere=atmosphere,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

Alternatively, create a {custom} atmosphere from a Docker picture registered in Docker Hob utilizing the code snippet under.

# Get an curated Atmosphere
atmosphere = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", model=44)

# Use the curated atmosphere in Job
job = command(
    code=".",
    command="python practice.py",
    atmosphere=atmosphere,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

AWS SageMaker AI

SageMaker’s atmosphere configuration is tightly coupled with job definitions, providing three ranges of customization to ascertain the OS, frameworks and packages required for job execution. These are Constructed-in Algorithm, Deliver Your Personal Script (Script mode) and Deliver Your Personal Container (BYOC), starting from the simplest but inflexible choice to probably the most advanced but customizable choice.

Constructed-in Algorithms

AWS Sagemaker Built-in Algorithm

That is the choice with the least quantity of effort for builders to coach and deploy machine studying fashions at scale in AWS SageMaker and Azure at the moment doesn’t supply an equal built-in algorithm method utilizing Python SDK as of February 2026.

SageMaker encapsulates the machine studying algorithm, in addition to its python library and framework dependencies inside an estimator object. For instance, right here we instantiate a KMeans estimator by specifying the algorithm-specific hyperparameter ok and passing the coaching information to suit the mannequin. Then the coaching job will spin up a ml.m5.giant compute occasion and the educated mannequin will probably be saved within the output location.

Deliver Your Personal Script

The carry your personal script method (often known as script mode or carry your personal mannequin) permits builders to leverage SageMaker’s prebuilt containers for common python frameworks for machine studying like scikit-learn, PyTorch and Tensorflow. It offers the pliability of customizing the coaching job by means of your personal script with out the necessity of managing the job execution atmosphere, making it the most well-liked alternative when utilizing specialised algorithms not included in SageMaker’s built-in choices.

Within the instance under, we instantiate an estimator utilizing the scikit-learn framework by offering a {custom} coaching script practice.py, the mannequin’s hyperparameters, together with the framework model and python model.

from sagemaker.sklearn import SKLearn

sk_estimator = SKLearn(
    entry_point="practice.py",
    function=function,
    instance_count=1,
    instance_type="ml.m5.giant",
    py_version="py3",
    framework_version="1.2-1",
    script_mode=True,
    hyperparameters={"estimators": 20},
)

# Practice the estimator
sk_estimator.match({"practice": training_data})

Deliver Your Personal Container

That is the method with the best degree of customization, which permits builders to carry a {custom} atmosphere utilizing a Docker picture. It fits eventualities that depend on unsupported python frameworks, specialised packages, or different programming languages (e.g. R, Java and so on). The workflow includes constructing a Docker picture that accommodates all required package deal dependencies and mannequin coaching scripts, then push it to Elastic Container Registry (ECR), which is AWS’s container registry service equal to Docker Hub.

Within the code under, we specify the {custom} docker picture URI as a parameter to create the estimator and match the estimator with coaching information.

from sagemaker.estimator import Estimator

image_uri = ":"

byoc_estimator = Estimator(
    image_uri=image_uri,
    function=function,
    instance_count=1,
    instance_type="ml.m5.giant",
    output_path="",
    sagemaker_session=sess,
)

byoc_estimator.match(training_data)

What does it imply in apply?

  • Azure ML: Supplies help for working coaching jobs utilizing its intensive assortment of curated environments that cowl common frameworks corresponding to PyTorch, TensorFlow, and scikit-learn, in addition to providing the potential to construct and configure {custom} environments from Docker photographs for extra specialised use circumstances. Nevertheless, you will need to be aware that Azure ML doesn’t at the moment supply the built-in algorithm method that encapsulates and packages common machine studying algorithms straight into the atmosphere in the identical means that SageMaker does.
  • AWS SageMaker AI: SageMaker is thought for its three degree of customizations—Constructed-in Algorithm, Deliver Your Personal Script, Deliver Your Personal Container—which cowl a spectrum of builders necessities. Constructed-in Algorithm and Deliver Your Personal Script use AWS’s managed environments and combine tightly with ML algorithms or frameworks. They provide simplicity however are much less appropriate for extremely specialised mannequin coaching processes.

In Abstract

Based mostly on the comparisons of Compute and Atmosphere above together with what we mentioned in AWS vs. Azure: A Deep Dive into Mannequin Coaching — Half 1 (Undertaking Setup and Knowledge Storage), we might have realized the 2 platforms undertake totally different design rules to construction their machine studying ecosystems.

Azure ML follows a extra modular structure the place Knowledge, Compute, and Atmosphere are handled as impartial sources and belongings inside the Azure ML Workspace. Since they are often configured and managed individually, this method is extra beginner-friendly, particularly for customers with out intensive cloud computing or permission administration data. For example, a knowledge scientist can create a coaching job by attaching an present compute within the Workspace while not having infrastructural experience to handle compute situations.

AWS SageMaker has a steeper studying curve, as a number of providers are tightly coupled and orchestrated collectively as a holistic system for ML job execution. Nevertheless, this job-centric method affords clear separation between mannequin coaching and mannequin deployment environments, in addition to the flexibility for distributed coaching at scale. By giving builders extra infrastructure management, SageMaker is properly suited to large-scale information science and AI groups with excessive MLOps maturity and the necessity of CI/CD pipelines.

Take-Dwelling Message

On this collection, we evaluate the 2 hottest cloud platforms Azure and AWS for scalable mannequin coaching, breaking down the comparability into the next dimensions:

  • Undertaking and Permission Administration
  • Knowledge storage
  • Compute
  • Atmosphere

In Half 1, we mentioned high-level undertaking setup and permission administration, then talked about storing and accessing the info required for mannequin coaching.

In Half 2, we examined how Azure ML’s persistent, workspace-centric compute sources differ from AWS SageMaker’s on-demand, job-specific method. Moreover, we explored atmosphere customization choices, from Azure’s curated environments and {custom} environments to SageMaker’s three degree of customizations—Constructed-in Algorithm, Deliver Your Personal Script, Deliver Your Personal Container. This comparability reveals Azure ML’s modular, beginner-friendly structure vs. SageMaker’s built-in, job-centric design that provides larger scalability and infrastructure management for groups with MLOps necessities.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles