How to set up a training job in sagemaker ?
Understanding of TrainingJobAnalytics
A Practical Guide to Building with AWS Sagemaker - AI Discussions - DeepLearning.AI
Distributed training in Sagemaker using Jupyter model
Here are a few resources that could help.
If you want to use the managed SageMaker container for Tensorflow, you can follow this tutorial: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb - the advantage of the managed container is that it handles setting up distributed training on tensorflow for you, so this is the easiest option, you only need to scale up the instance count.
If you want to build your own custom model from scratch, you're given a list of peers in a container network and will need to handle setup and communication for distributed training yourself, according to the specs on https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container You might need to acquaint yourself with data distribution as well: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/data_distribution_types/data_distribution_types.ipynb
Here's an example notebook going over containerizing a custom keras model again (albeit without distributed training): https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/keras_bring_your_own/hpo_bring_your_own_keras_container.ipynb
More on reddit.com