#AmazonEKSSupport | Explore Tumblr posts and blogs

govindhtech · 2 months ago

Text

Amazon SageMaker HyperPod Presents Amazon EKS Support

Amazon SageMaker HyperPod

Cut the training duration of foundation models by up to 40% and scale effectively across over a thousand AI accelerators.

We are happy to inform you today that Amazon SageMaker HyperPod, a specially designed infrastructure with robustness at its core, will enable Amazon Elastic Kubernetes Service (EKS) for foundation model (FM) development. With this new feature, users can use EKS to orchestrate HyperPod clusters, combining the strength of Kubernetes with the robust environment of Amazon SageMaker HyperPod, which is ideal for training big models. By effectively scaling across over a thousand artificial intelligence (AI) accelerators, Amazon SageMaker HyperPod can save up to 40% of training time.

- Advertisement -

SageMaker HyperPod: What is it?

The undifferentiated heavy lifting associated with developing and refining machine learning (ML) infrastructure is eliminated by Amazon SageMaker HyperPod. Workloads can be executed in parallel for better model performance because it is pre-configured with SageMaker’s distributed training libraries, which automatically divide training workloads over more than a thousand AI accelerators. SageMaker HyperPod occasionally saves checkpoints to guarantee your FM training continues uninterrupted.

You no longer need to actively oversee this process because it automatically recognizes hardware failure when it occurs, fixes or replaces the problematic instance, and continues training from the most recent checkpoint that was saved. Up to 40% less training time is required thanks to the robust environment, which enables you to train models in a distributed context without interruption for weeks or months at a time. The high degree of customization offered by SageMaker HyperPod enables you to share compute capacity amongst various workloads, from large-scale training to inference, and to run and scale FM tasks effectively.

Advantages of the Amazon SageMaker HyperPod

Distributed training with a focus on efficiency for big training clusters

Because Amazon SageMaker HyperPod comes preconfigured with Amazon SageMaker distributed training libraries, you can expand training workloads more effectively by automatically dividing your models and training datasets across AWS cluster instances.

Optimum use of the cluster’s memory, processing power, and networking infrastructure

Using two strategies, data parallelism and model parallelism, Amazon SageMaker distributed training library optimizes your training task for AWS network architecture and cluster topology. Model parallelism divides models that are too big to fit on one GPU into smaller pieces, which are then divided among several GPUs for training. To increase training speed, data parallelism divides huge datasets into smaller ones for concurrent training.

- Advertisement -

Robust training environment with no disruptions

You can train FMs continuously for months on end with SageMaker HyperPod because it automatically detects, diagnoses, and recovers from problems, creating a more resilient training environment.

Customers may now use a Kubernetes-based interface to manage their clusters using Amazon SageMaker HyperPod. This connection makes it possible to switch between Slurm and Amazon EKS with ease in order to optimize different workloads, including as inference, experimentation, training, and fine-tuning. Comprehensive monitoring capabilities are provided by the CloudWatch Observability EKS add-on, which offers insights into low-level node metrics on a single dashboard, including CPU, network, disk, and other. This improved observability includes data on container-specific use, node-level metrics, pod-level performance, and resource utilization for the entire cluster, which makes troubleshooting and optimization more effective.

Since its launch at re:Invent 2023, Amazon SageMaker HyperPod has established itself as the go-to option for businesses and startups using AI to effectively train and implement large-scale models. The distributed training libraries from SageMaker, which include Model Parallel and Data Parallel software optimizations to assist cut training time by up to 20%, are compatible with it. With SageMaker HyperPod, data scientists may train models for weeks or months at a time without interruption since it automatically identifies, fixes, or replaces malfunctioning instances. This frees up data scientists to concentrate on developing models instead of overseeing infrastructure.

Because of its scalability and abundance of open-source tooling, Kubernetes has gained popularity for machine learning (ML) workloads. These benefits are leveraged in the integration of Amazon EKS with Amazon SageMaker HyperPod. When developing applications including those needed for generative AI use cases organizations frequently rely on Kubernetes because it enables the reuse of capabilities across environments while adhering to compliance and governance norms. Customers may now scale and maximize resource utilization across over a thousand AI accelerators thanks to today’s news. This flexibility improves the workflows for FM training and inference, containerized app management, and developers.

With comprehensive health checks, automated node recovery, and work auto-resume features, Amazon EKS support in Amazon SageMaker HyperPod fortifies resilience and guarantees continuous training for big-ticket and/or protracted jobs. Although clients can use their own CLI tools, the optional HyperPod CLI, built for Kubernetes settings, can streamline job administration. Advanced observability is made possible by integration with Amazon CloudWatch Container Insights, which offers more in-depth information on the health, utilization, and performance of clusters. Furthermore, data scientists can automate machine learning operations with platforms like Kubeflow. A reliable solution for experiment monitoring and model maintenance is offered by the integration, which also incorporates Amazon SageMaker managed MLflow.

In summary, the HyperPod service fully manages the HyperPod service-generated Amazon SageMaker HyperPod cluster, eliminating the need for undifferentiated heavy lifting in the process of constructing and optimizing machine learning infrastructure. This cluster is built by the cloud admin via the HyperPod cluster API. These HyperPod nodes are orchestrated by Amazon EKS in a manner akin to that of Slurm, giving users a recognizable Kubernetes-based administrator experience.

Important information

The following are some essential details regarding Amazon EKS support in the Amazon SageMaker HyperPod:

Resilient Environment: With comprehensive health checks, automated node recovery, and work auto-resume, this integration offers a more resilient training environment. With SageMaker HyperPod, you may train foundation models continuously for weeks or months at a time without interruption since it automatically finds, diagnoses, and fixes errors. This can result in a 40% reduction in training time.

Improved GPU Observability: Your containerized apps and microservices can benefit from comprehensive metrics and logs from Amazon CloudWatch Container Insights. This makes it possible to monitor cluster health and performance in great detail.

Scientist-Friendly Tool: This release includes interaction with SageMaker Managed MLflow for experiment tracking, a customized HyperPod CLI for job management, Kubeflow Training Operators for distributed training, and Kueue for scheduling. Additionally, it is compatible with the distributed training libraries offered by SageMaker, which offer data parallel and model parallel optimizations to drastically cut down on training time. Large model training is made effective and continuous by these libraries and auto-resumption of jobs.

Flexible Resource Utilization: This integration improves the scalability of FM workloads and the developer experience. Computational resources can be effectively shared by data scientists for both training and inference operations. You can use your own tools for job submission, queuing, and monitoring, and you can use your current Amazon EKS clusters or build new ones and tie them to HyperPod compute.