Hi clouduser, If you are looking to set up something similar with just SageMaker and using the AWS CLI, here is an article that shows how you can directly set up a training job using a Hugging Face model and Amazon SageMaker. Here is another example setup with PyTorch Training Jobs. Here is another example setup with TensorFlow Training jobs. I would recommend following one of these three blogs to set up your Amazon SageMaker training job based on which model you decide to go with. Answer from Autrin Abdi on repost.aws
🌐
YouTube
youtube.com › watch
How to create Machine Learning training job in SageMaker using AWS Console - YouTube
In this video we would learn how to create training job in SageMaker using AWS Console. We will use our own created ML algorithm docker image from ECR(Elasti...
Published   August 23, 2020
🌐
Readthedocs
sagemaker-examples.readthedocs.io › en › latest › sagemaker-debugger › build_your_own_container_with_debugger › debugger_byoc.html
Build a Custom Training Container and Debug Training Jobs with Amazon SageMaker Debugger — Amazon SageMaker Examples 1.0.0 documentation
Note: If you want to re-visit tensor data from a previous training job that has already done, you can retrieve them by specifying the exact S3 bucket location. The S3 bucket path is configured in a similar way to the following sample: trial="s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-mnist-byoc-tf2-2020-08-27-05-49-34-037/debug-output". ... The following cell retrieves the loss tensor from training and evaluation mode and plots the loss curves. In this notebook example, the dataset was cifar10 that divided into 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories.
Discussions

How to set up a training job in sagemaker ?
How can i set up something similar with just sagemaker and using aws cli? (sample code below from the example ) . in the example, it uses distilbert-base-uncased model and it is loaded via this code -> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") where does the model gets downloaded from and if one were to set up similar training job ... More on repost.aws
🌐 repost.aws
1
0
January 21, 2024
Understanding of TrainingJobAnalytics
Hello guys 👋 I’m working on a Text Classification project, therefore I’m testing different models (with different hyper-parameters config) to see which one can give me the best results. To be able to compare the different models, I’d like to retrieve the metrics computed during the ... More on discuss.huggingface.co
🌐 discuss.huggingface.co
1
0
May 28, 2022
A Practical Guide to Building with AWS Sagemaker - AI Discussions - DeepLearning.AI
This was originally posted to Stanford’s CS 230 EdStem forum and has been modified to be more general and to remove links I’ve spent the last couple of months working on the CS 230 final project using AWS Sagemaker and I wanted to share what I’ve learned so that other students can take ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
0
October 3, 2024
Distributed training in Sagemaker using Jupyter model

Here are a few resources that could help.

If you want to use the managed SageMaker container for Tensorflow, you can follow this tutorial: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb - the advantage of the managed container is that it handles setting up distributed training on tensorflow for you, so this is the easiest option, you only need to scale up the instance count.

If you want to build your own custom model from scratch, you're given a list of peers in a container network and will need to handle setup and communication for distributed training yourself, according to the specs on https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container You might need to acquaint yourself with data distribution as well: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/data_distribution_types/data_distribution_types.ipynb

Here's an example notebook going over containerizing a custom keras model again (albeit without distributed training): https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/keras_bring_your_own/hpo_bring_your_own_keras_container.ipynb

More on reddit.com
🌐 r/aws
2
3
September 28, 2017
🌐
Hugging Face
huggingface.co › docs › sagemaker › train
Run training on Amazon SageMaker
Look train.py file for a complete example of a 🤗 Transformers training script. If output_dir in the TrainingArguments is set to ‘/opt/ml/model’ the Trainer saves all training artifacts, including logs, checkpoints, and models. Amazon SageMaker archives the whole ‘/opt/ml/model’ directory as model.tar.gz and uploads it at the end of the training job ...
🌐
AWS
awscli.amazonaws.com › v2 › documentation › api › latest › reference › sagemaker › create-training-job.html
create-training-job — AWS CLI 2.27.40 Command Reference
RetryStrategy - The number of times to retry the job when the job fails due to an InternalServerError . For more information about SageMaker, see How It Works . ... create-training-job --training-job-name <value> [--hyper-parameters <value>] --algorithm-specification <value> --role-arn <value> [--input-data-config <value>] --output-data-config <value> --resource-config <value> [--vpc-config <value>] --stopping-condition <value> [--tags <value>] [--enable-network-isolation | --no-enable-network-isolation] [--enable-inter-container-traffic-encryption | --no-enable-inter-container-traffic-encrypt
🌐
Sicara
sicara.fr › blog
How to Train and Deploy Custom Models with Amazon ...
However, there are a number of ... training job has finished. Specifically, your training algorithm needs to look for data in the /opt/ml/input folder, and store model artifacts (and whatever other output you’d like to keep for later) in /opt/ml/model. SageMaker will copy the training data we’ve uploaded to the S3 to the input folder, and copy everything from the model folder to the output S3 bucket. Here’s an example of a Dockerfile ...
🌐
Medium
medium.com › @smertatli › aws-sagemaker-is-one-of-the-most-advanced-machine-learning-services-in-the-cloud-world-46ff67d45c0
Create a Custom Training Job With Your Own Algorithm in Sagemaker | by Mert Atli | Medium
January 29, 2023 - InputDataConfig: The definition of training data. In this example, we have only one data channel called train, and the data for that channel resides in s3 under input_s3_uri, its type is text/csv, and it is not compressed.
Find elsewhere
🌐
AWS
docs.aws.amazon.com › amazon sagemaker › developer guide › model training › train a model with amazon sagemaker
Train a Model with Amazon SageMaker - Amazon SageMaker AI
October 16, 2025 - Hyperparameter Tuning: This SageMaker AI feature helps define a set of hyperparameters for a model and launch many training jobs on a dataset. Depending on the hyperparameter values, the model training performance might vary.
🌐
DeepLearning.AI
community.deeplearning.ai › ai discussions
A Practical Guide to Building with AWS Sagemaker - AI Discussions - DeepLearning.AI
October 3, 2024 - This was originally posted to Stanford’s CS 230 EdStem forum and has been modified to be more general and to remove links I’ve spent the last couple of months working on the CS 230 final project using AWS Sagemaker and I wanted to share what I’ve learned so that other students can take ...
🌐
Tutorials Dojo
tutorialsdojo.com › home › aws › train and deploy a scikit-learn model in amazon sagemaker
Train and Deploy a Scikit-Learn Model in Amazon SageMaker
January 5, 2024 - Alternatively, you can inspect it in the SageMaker console and navigate to the Hyperparameter Jobs section. You should see something similar to this: Through these steps, you have successfully trained a Scikit-Learn model in Amazon SageMaker and fine-tuned its hyperparameters for improved performance.
🌐
AWS
docs.aws.amazon.com › amazon sagemaker › developer guide › machine learning environments offered by amazon sagemaker ai › amazon sagemaker hyperpod › sagemaker hyperpod recipes › tutorials › sagemaker training jobs pre-training tutorial (gpu)
SageMaker training jobs pre-training tutorial (GPU) - Amazon SageMaker AI
You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the SageMaker AI Python SDK · to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.
🌐
AWS
docs.aws.amazon.com › amazon sagemaker › amazon sagemaker api reference › actions › amazon sagemaker service › createtrainingjob
CreateTrainingJob - Amazon SageMaker
DocumentationAmazon SageMakerAmazon Sagemaker API Reference · Request SyntaxRequest ParametersResponse SyntaxResponse ElementsErrorsSee Also · Starts a model training job.
🌐
SageMaker
sagemaker.readthedocs.io › en › stable › overview.html
Using the SageMaker Python SDK — sagemaker 2.254.1 documentation
The EFS volume must be in # the same VPC as your Amazon EC2 instance estimator = TensorFlow(entry_point='tensorflow_mnist/mnist.py', role='SageMakerRole', instance_count=1, instance_type='ml.c4.xlarge', subnets=['subnet-1', 'subnet-2'] security_group_ids=['sg-1']) file_system_input = FileSystemInput(file_system_id='fs-1', file_system_type='EFS', directory_path='/tensorflow', file_system_access_mode='ro') # Start an Amazon SageMaker training job with EFS using the FileSystemInput class estimator.fit(file_system_input) # This example shows how to use FileSystemRecordSet class # Configure an estimator with subnets and security groups from your VPC.
🌐
GitHub
github.com › aws › sagemaker-training-toolkit
GitHub - aws/sagemaker-training-toolkit: Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
To train a model using the image on SageMaker, push the image to ECR and start a SageMaker training job with the image URI.
Starred by 523 users
Forked by 137 users
Languages   Python 96.0% | C 3.2%
🌐
AWS re:Post
repost.aws › knowledge-center › sagemaker-training-job-errors
Troubleshoot errors when you run SageMaker AI training jobs | AWS re:Post
March 19, 2025 - The default maximum runtime for a training job is 1 day. You can adjust the runtime to a maximum of 28 days. To increase the maximum runtime value, pass the MaxRuntimeInSeconds parameter in the CreateTrainingJob API or the max_run parameter in your SageMaker AI Python SDK Estimator.
🌐
AWS
awscli.amazonaws.com › v2 › documentation › api › latest › reference › sagemaker › describe-training-job.html
describe-training-job - sagemaker
For example, if BillableTimeInSeconds is 100 and TrainingTimeInSeconds is 500, the savings is 80%. ... Configuration information for the Amazon SageMaker Debugger hook parameters, metric and tensor collections, and storage paths. To learn more about how to configure the DebugHookConfig parameter, ...
🌐
GitHub
github.com › aws › amazon-sagemaker-examples
GitHub - aws/amazon-sagemaker-examples: Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker. - GitHub - aws/amazon-sagemaker-examples: Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
Starred by 10.8K users
Forked by 7K users
Languages   Jupyter Notebook 94.2% | Python 4.9% | Roff 0.6% | Shell 0.2% | Dockerfile 0.1% | HTML 0.0%
🌐
Readthedocs
sagemaker-examples.readthedocs.io
Amazon SageMaker Examples - Read the Docs
Deploying pre-trained PyTorch vision models with Amazon SageMaker Neo · Use SageMaker Batch Transform for PyTorch Batch Inference · Track, monitor, and explain models · Amazon SageMaker Multi-hop Lineage Queries · Amazon SageMaker Model Monitor · Fairness and Explainability with SageMaker Clarify · Orchestrate workflows · Orchestrate Jobs to Train and Evaluate Models with Amazon SageMaker Pipelines ·