Amazon SageMaker HyperPod customers

Top AI startups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod
  • Hugging Face

    Hugging Face has been using SageMaker HyperPod to create important new open foundation models like StarCoder, IDEFICS, and Zephyr which have been downloaded millions of times. SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training. Because our teams need to innovate quickly, this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.

    Jeff Boudier, head of Product at Hugging Face
  • Perplexity AI

    We were looking for the right ML infrastructure to increase productivity and reduce costs in order to build high-performing large language models. After running a few successful experiments, we switched to AWS from other cloud providers in order to use Amazon SageMaker HyperPod. We have been using HyperPod for the last four months to build and fine-tune the LLMs to power the Perplexity conversational answer engine that answers questions along with references provided in the form of citations. Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure. SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast, which means our developers can iterate more quickly, accelerating the development of new generative AI experiences for our customers.

    Aravind Srinivas, co-founder and CEO at Perplexity AI
  • Articul8 AI

    Read the case study

    Amazon SageMaker HyperPod task governance helps maximize GPU utilization across various teams and projects. As a fast-growing GenAI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, thereby reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team's productivity allowing them to bring new GenAI innovations to market faster than ever before.

    Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our GenAI operations. As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and finetuning workloads in a more streamlined manner.

    Arun Subramaniyan, Founder and CEO of Articul8 AI
  • Thomson Reuters

    Read the blog

    Thomson Reuters, a global AI and content-driven technology company, has been testing the task governance capability in Amazon SageMaker HyperPod to address a key challenge around workload prioritization. With task governance, now they can manage customer workloads such as inference requests alongside their own ongoing model development projects, ensuring prioritizing urgent customer requests without disrupting internal research, leading to better resource utilization and customer satisfaction. “We were able to meet our large language model training requirements using Amazon SageMaker HyperPod.”, said John Duprey, Distinguished Engineer, Thomson Reuters Labs, “Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarisation and classification.”

    Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.

    Joel Hron, Head of AI and Labs, Thomson Reuters and John Duprey, Distinguished Engineer, Thomson Reuters Labs
  • Stability AI

    As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.

    Emad Mostaque, Founder and CEO, Stability AI
  • Recursal AI

    The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.

    Nathan Wilce, Infrastructure/data lead, Recursal
  • Hippocratic AI

    Hippocratic AI, an AI company that develops the first safety-focused Large Language Model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. Amazon SageMaker HyperPod flexible training plans made it easier for them to gain access to Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances. HippocraticAI is also leveraging AWS services such as Grafana to track important GPU utilization metrics. Using Amazon EC2 P5 Instances, Hippocratic AI has increased model training speed by four times and scales its solution to accommodate hundreds of use cases. It helped them to secure the required compute resources and train models quickly.

  • NinjaTech

     

    NinjaTech AI, a generative AI company that provides an all-in-one SuperAgent for unlimited productivity, used Amazon SageMaker HyperPod flexible training plans to accelerate fine-tuning of various internal models including the Llama 3.1 405B model to reduce model training costs, and automate the process. The company aims to provide a seamless experience to its users wanting access to various AI agents powering their SuperAgent Technology. To achieve this, they needed a model that could automatically predict user intention and determine which AI agent would be a good fit for it. This mechanism required making frequent updates to the model by incorporating customer feedback and new features iteratively, involving 10m-100m tokens at each round of LoRA fine-tuning. As a startup, acquiring and operating high-performance compute resources is challenging due to its steep cost and bandwidth issues, specifically in multi-node clusters which involve fast network and fast storage in addition to accelerated computing. In addition, the training process is time-consuming, involving steps like model downloading, distributed training, checkpoint, monitoring, auto remediation, merging, and quantization. HyperPod’s flexible training plans provided the company with reliable and affordable compute in advance of the training run, matching their specific compute and timeline requirements, while ensuring efficient model training.

  • OpenBabylon

    Developers and data scientists at OpenBabylon, an AI company that customizes large language models for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large scale experiments. Using the multi-node SageMaker HyperPod’s distributed training capabilities, they conducted 100 large scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating SageMaker HyperPod’s ability to successfully deliver complex projects on time and at budget.

  • Salesforce

    Researchers at Salesforce were looking for ways to quickly get started with foundational model training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With Amazon SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Amazon SageMaker HyperPod partners

 

Drive innovation and unlock greater business value with AWS partners that have deep technical knowledge and proven customer success

  • Accenture

    We are extending our partnership with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Our collaboration with AWS will allow us to guide customers towards the latest technological breakthroughs while helping to reduce generative AI application costs. By bringing together centralized governance capabilities in SageMaker HyperPod, and our experience in generative AI projects, we can help companies realize the value of generative AI even faster, improving customer experience and increasing return on investment.

    Jennifer Jackson, Global Lead for Accenture AWS Business Group & Senior Managing Director
  • Slalom

    We are thrilled to collaborate with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Working with AWS, we can now help our customers rapidly adopt the latest technological advancements and reduce the costs of their generative AI applications. By bringing together centralized governance capabilities in SageMaker HyperPod, with Slalom’s extensive AI and cloud experience, we can deliver exceptional customer experiences alongside increased return on investment.

    Jeff Kempiners, Managing Director of Slalom’s Amazon Center of Excellence (CoE)
  • Rackspace Technology

    We are excited to collaborate with AWS as a launch partner for SageMaker HyperPod task governance. Together, we can help our customers reduce the costs of generative AI applications, while keeping up with the latest technological advancements. By combining SageMaker HyperPod’s centralized governance capabilities with Rackspace’s deep AI and cloud expertise, we can transform customer experiences and improve their return on investment simultaneously.

    Srini Koushik, President, AI, Technology and Sustainability at Rackspace Technology