You train models on GPU in the SageMaker ecosystem via 2 different components:

  1. You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.

  2. Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example

Answer from Olivier Cruchant on Stack Overflow
🌐
Amazon Web Services
aws.amazon.com › machine learning › amazon sagemaker › pricing
SageMaker pricing - AWS
1 week ago - Fine-grained permissions, powered by AWS Lake Formation, are provided at no extra cost. For the most accurate and detailed pricing information, consult lakehouse pricing. SageMaker AI follows a pay-as-you-go pricing model with no upfront commitments or minimum fees. The key pricing dimensions for SageMaker AI include instance usage (compute resources used in training, hosting, and notebook instances), storage (Amazon SageMaker notebooks, Amazon Elastic Block Store (Amazon EBS) volumes, and Amazon S3), data processing jobs, model deployment, and MLOps (Amazon SageMaker Pipelines and Model Monitor).
Discussions

amazon web services - AWS SageMaker on GPU - Stack Overflow
From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel. ... When you create a new jupyter notebook instance... More on stackoverflow.com
🌐 stackoverflow.com
Which GPU instances are supported by the sagemaker algorithm forecasting-deepar?
AWS Interconnect - multicloud is ... Gateway, AWS Cloud WAN, and Amazon VPC to other cloud service providers with ease. ... I previously ran a hyperparameter tuning job for SageMaker DeepAR with the instance type ml.c5.18xlarge but it seems insufficient to complete the tuning job within the max_run time specified in my account. Now, having tried to use the accelerated GPU instance ... More on repost.aws
🌐 repost.aws
1
0
May 25, 2022
GPU optimized instances only able to be launched through Sagemaker Studio? (and not as a sagemaker notebook)
Have you tried via the CLI? What error do you get? More on reddit.com
🌐 r/aws
7
6
April 14, 2021
[deleted by user]
Regarding this topic - Spot instance support for Sagemaker training is coming soon! https://www.linkedin.com/feed/update/urn:li:activity:6555355648408846336/ More on reddit.com
🌐 r/aws
9
20
July 12, 2019
🌐
Amazon Web Services
aws.amazon.com › machine learning › amazon sagemaker ai › pricing
SageMaker Pricing
1 week ago - There is no additional charge for using JumpStart models or solutions. You will be charged for the underlying Training and Inference instance hours used the same as if you had created them manually. ... Amazon SageMaker Profiler collects system-level data for visualization of high-resolution CPU and GPU trace plots.
Top answer
1 of 2
28

You train models on GPU in the SageMaker ecosystem via 2 different components:

  1. You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.

  2. Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example

2 of 2
-2

If you want to train your model in a Sagemaker Studio notebook make sure you choose both a GPU instance type and GPU Image type: https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html

For example for Tensorflow GPU:

🌐
AWS
docs.aws.amazon.com › aws prescriptive guidance › patterns › ai & machine learning › train and deploy a custom gpu-supported ml model on amazon sagemaker
Train and deploy a custom GPU-supported ML model on Amazon SageMaker - AWS Prescriptive Guidance
However, it can be time-consuming ... (AWS) Cloud. This pattern helps you train and build a custom GPU-supported ML model using Amazon SageMaker. It provides steps to train and deploy a custom CatBoost model built on an open-source Amazon reviews dataset. You can then benchmark its performance ...
🌐
Reddit
reddit.com › r/aws › gpu optimized instances only able to be launched through sagemaker studio? (and not as a sagemaker notebook)
r/aws on Reddit: GPU optimized instances only able to be launched through Sagemaker Studio? (and not as a sagemaker notebook)
April 14, 2021 -

Tried posting this to the aws dev forums with no luck. I assume it's a bit of a niche issue:

Running some tensorflow neural nets so I've been using an accelerated computing ml.g4dn.xlarge setup. However I am unable to launch it as a notebook (which would allow me to access it remotely) and only able to launch it through the sagemaker studio and therefore only able to use it on the jupyter lab. I've tried both the GUI where it's not on the dropdown and the aws-cli, which tells me that it's not part of the list of possible notebooks. I've also checked and the region (eu-west-1) seems to be okay.

If anyone has any ideas of things you would look into I'd be very appreciative as I don't have access to premium support and am quite stuck on this one.

Find elsewhere
🌐
Saturn Cloud
saturncloud.io › blog › how-to-use-aws-sagemaker-on-gpu-for-highperformance-machine-learning
How to Use AWS SageMaker on GPU for HighPerformance Machine Learning | Saturn Cloud Blog
March 12, 2024 - Choose the Right Instance Type: ... learning workload requirements. SageMaker offers various GPU-enabled instances like p3 and g4 instances....
🌐
Reddit
reddit.com › r/aws › [deleted by user]
Economics of ML on Amazon AWS SageMaker - to GPU or ...
July 12, 2019 - For inference, DeepAR supports only CPU instances. ... Not all algorithms can make use of GPU. Have you looked into whether DeepAR can use GPU and the performance difference? ... Make sure you pay attention to the training job timeout parameter too. A lot of the sagemaker notebook examples default to something absurdly high. We naively copied code and had a bad parameter and a p2 took all weekend instead of timing out in an hour. AWS CEO Matt Garman Doesn’t Think AI Should Replace Junior Devs
🌐
AWS
aws.amazon.com › blogs › machine-learning › amazon-sagemaker-inference-now-supports-g6e-instances
Amazon SageMaker Inference now supports G6e instances | Artificial Intelligence
November 22, 2024 - As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G6e instances powered by ...
🌐
Holori
holori.com › accueil › blog › ultimate aws sagemaker pricing guide
Holori - Ultimate AWS Sagemaker pricing guide
October 23, 2024 - Training a model is one of the most resource-intensive and costly aspects. SageMaker charges based on: Training instance type: More powerful instances like ml.p3.16xlarge with GPUs will cost significantly more than CPU-based instances.
🌐
AWS
docs.aws.amazon.com › amazon sagemaker › developer guide › model training › types of algorithms › built-in algorithms and pretrained models in amazon sagemaker › parameters for built-in algorithms › instance types for built-in algorithms
Instance Types for Built-in Algorithms - Amazon SageMaker AI
October 16, 2025 - Most Amazon SageMaker AI algorithms have been engineered to take advantage of GPU computing for training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective.
🌐
AWS
docs.aws.amazon.com › amazon sagemaker › developer guide › machine learning environments offered by amazon sagemaker ai › amazon sagemaker notebook instances
Amazon SageMaker notebook instances - Amazon SageMaker AI
The SageMaker notebook instances help create the environment by initiating Jupyter servers on Amazon Elastic Compute Cloud (Amazon EC2) and providing preconfigured kernels with the following packages: the Amazon SageMaker Python SDK, AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), Conda, Pandas, deep learning framework libraries, and other libraries for data science and machine learning.
🌐
AWS
aws.amazon.com › about-aws › whats-new › 2023 › 09 › amazon-sagemaker-geospatial-notebook-gpu-instances
Amazon SageMaker geospatial capabilities now support Notebook with GPU-based Instances
September 5, 2023 - Support for GPU-based instances with the geospatial image makes it easier for data scientists and machine learning (ML) engineers to build, train, and deploy ML models using geospatial data. Customers use the geospatial image within SageMaker Studio Notebooks to develop and run end-to-end geospatial ML workloads.
🌐
Duke
driv.cs.duke.edu › wiki › index.php
Instance Types to Use in SageMaker
Number of notebook instances: 20 · Number of running notebook instances: 10 · Number of notebook ml.p2.xlarge instances: 10 · Number of notebook ml.p3.16xlarge instances: 5 · Number of training ml.p3.16xlarge instances: 5 · ml.t2.medium: $0.0464/hour, 2 vCPUs, 4GB main memory, no GPU ·
🌐
Reddit
reddit.com › r/aws › is sagemaker gpu instance g5.16xlarge enough for inferencing mixtral-8x7b-instruct?
r/aws on Reddit: Is Sagemaker gpu instance g5.16xlarge enough for inferencing Mixtral-8x7b-Instruct?
March 16, 2024 -

I am using gpu instance “g5.16xlarge” to do inferencing for Mixtral-8x7b-Instruct in AWS Sagemaker, but I am getting an error which says “try changing instance type”.

Does anyone know if “g5.16xlarge” is sufficient for Mixtral-8x7b-Instruct? I noticed in AWS Sagemaker JumpStart, the only gpu instance I can select is “g5.48xlarge” for Mixtral-8x7b-Instruct. Does that mean I will need at least “g5.48xlarge” to run Mixtral-8x7b-Instruct in AWS Sagemaker?

Would really appreciate any input on this. Thanks heaps.