Amazon AWS – Page 2

Build a conversational data assistant, Part 1: Text-to-SQL with Amazon Bedrock Agents

July 11, 2025

by Dheer Toprani Amazon AWS

What if you could replace hours of data analysis with a minute-long conversation? Large language models can transform how we bridge the gap between business questions and actionable data insights. For most organizations, this gap remains stubbornly wide, with business teams trapped in endless cycles—decoding metric definitions and hunting for the correct data sources to manually craft each SQL query. Simple business questions can become multi-day ordeals, with analytics teams drowning in routine requests instead of focusing on strategic initiatives.

Amazon’s Worldwide Returns & ReCommerce (WWRR) organization faced this challenge at scale. With users executing over 450,000 SQL queries annually against our petabyte-scale Amazon Redshift data warehouses, our business intelligence and analytics teams had become a critical bottleneck. We needed a self-serve solution that could handle enterprise complexities: thousands of interconnected tables, proprietary business terminology and evolving metrics definitions that vary across business domains, and strict governance requirements.

This post series outlines how WWRR developed the Returns & ReCommerce Data Assist (RRDA), a generative AI-powered conversational interface that transformed data access across all levels of the organization. RRDA empowers over 4000 non-technical users to identify correct metrics, construct validated SQL, and generate complex visualizations—all through natural conversation. The results: 90% faster query resolution (from hours to minutes) with zero dependency on business intelligence teams.

In Part 1, we focus on building a Text-to-SQL solution with Amazon Bedrock, a managed service for building generative AI applications. Specifically, we demonstrate the capabilities of Amazon Bedrock Agents. Part 2 explains how we extended the solution to provide business insights using Amazon Q in QuickSight, a business intelligence assistant that answers questions with auto-generated visualizations.

The fundamentals of Text-to-SQL

Text-to-SQL systems transform natural language questions into database queries through a multi-step process. At its core, these systems must bridge the gap between human communication and structured query language by:

Understanding user intent and extracting key entities from natural language
Matching these entities to database schema components (tables, columns, relationships)
Generating syntactically correct SQL that accurately represents the user’s question

Enterprise implementations face additional challenges, including ambiguous terminology, complex schema relationships, domain-specific business metrics, and real-time query validation. Let’s explore how RRDA implements these principles.

RRDA architecture overview

RRDA uses a WebSocket connection through Amazon API Gateway to connect users to AWS Lambda for real-time serverless processing. The following diagram shows the overall architecture of RRDA, highlighting how the user’s messages flow through the system to generate either SQL queries or metric visualizations.

This architecture overview diagram illustrates the system’s dual processing pathways: the upper path handles SQL generation through Amazon Bedrock Agents, while the lower path manages visualization requests via Amazon Q in QuickSight.

In the architecture diagram, the input from the user is classified as an intent and then routed to the most appropriate processing pathway. The architecture features two primary processing pathways:

For a conversational experience (INFO, QUERY, and UNKNOWN intents), the system checks our semantic cache for verified responses before routing to an Amazon Bedrock agent that orchestrates between action groups. These action groups are components that define specific operations the agent can help perform:
- Retrieve metric definitions with domain filtering
- Fetch SQL table metadata and schemas from Amazon DynamoDB
- Generate SQL code and validate query syntax
For visualization requests (SHOW_METRIC intent), the system retrieves relevant Amazon QuickSight Q topics, rewrites the question appropriately, then displays visualizations using Amazon Q in QuickSight.

For Part 1 of this post, we focus on the Amazon Bedrock agent that is capable of answering questions about metrics and generating validated SQL. Part 2 will cover the metric visualization pathway using Amazon Q in QuickSight.

Intent and domain classification

RRDA classifies incoming user messages by user intent and relevant business domain. Intent and domain classification occurs simultaneously in parallel threads to minimize latency. The solution categorizes queries into four distinct intents, and routes requests to our agent or Amazon Q in QuickSight accordingly:

INFO – For informational questions about metrics, definitions, table schemas, and so on (for example, “Explain the resolution index metric”)
QUERY – For SQL query generation (for example, “Write a query for US resolution index for the past 6 months”)
SHOW_METRIC – For requests to display visualizations from Amazon Q in QuickSight (for example, “Show me the resolution index in US over the past 6 months”)
UNKNOWN – For requests that don’t match a known pattern

Intent classification uses a lightweight foundation model (FM) through the Amazon Bedrock Converse API. The system uses a specialized prompt containing numerous examples of each intent type and analyzes both the current user message and recent conversation history to determine the most appropriate category. The system uses structured outputs to constrain the model’s response to one of the four predefined intents.

The following screenshot is an example of a QUERY intent message where a user requests a SQL query for the resolution index metric: “Give me a query for RP resolution index for May 2025 and US marketplace.”

In this example, the system identifies “RP” in the user’s message and correctly maps it to the Returns Processing domain, enabling it to access only the specific tables and metric definitions that uniquely apply to the Returns Processing domain-specific definition of the resolution index metric.

Business domain classification is crucial because enterprises can have metrics that might be calculated differently across different programs, teams, or departments. Our domain classification system uses string comparison to pattern match against a predefined dictionary that maps variations, abbreviations, and aliases to canonical domain names (for example, "rp", "returns-processing", and "return-process" substrings in a user message map to the Returns Processing domain; and "promos" to the Promotions domain). The system tokenizes user messages and identifies whole-word or phrase matches rather than partial string matching to help avoid false positives. This domain context persists across conversation turns, enabling targeted knowledge retrieval and domain-specific processing that improves the response accuracy of our agent.

Amazon Bedrock agent overview

At the core of RRDA’s architecture is an Amazon Bedrock agent powered by Anthropic’s Claude 3.5 Haiku on Amazon Bedrock. This agent serves as the intelligent decision-maker that orchestrates between the following action groups and interprets results:

RetrieveFromKnowledgeBase – Searches the WWRR metrics dictionary for domain-specific metric definitions, calculation logic, and data sources using metadata filtering
FetchTableSchema – Gets table information from DynamoDB, providing the agent with up-to-date table schema, column definitions, dimensions for categorical columns, and example queries for that table
GenerateSQLQuery – Delegates complex SQL generation tasks to Anthropic’s Claude 3.7 Sonnet, passing along retrieved knowledge base information, table schemas, and user requirements

The following sequence diagram shows how the agent orchestrates between different action groups using intent.

Sequence diagram showing RRDA's hybrid model architecture with user query processing flow. The diagram illustrates two paths: for INFO/UNKNOWN intents, Claude 3.5 Haiku (Amazon Bedrock Agent) handles RetrieveFromKnowledgeBase and/or FetchTableSchema actions and provides direct responses; for QUERY intents, Claude 3.5 Haiku first retrieves context through RetrieveFromKnowledgeBase and FetchTableSchema, then delegates to Claude 3.7 Sonnet via Amazon Bedrock Converse API for GenerateSQLQuery with context, returning validated SQL with explanation back to the user.

Sequence diagram illustrating RRDA’s hybrid model architecture where Anthropic’s Claude 3.5 Haiku orchestrates agent actions for fast retrieval and interaction, while delegating complex SQL generation tasks to Anthropic’s Claude 3.7 Sonnet.

The diagram illustrates how we use a hybrid model architecture that optimizes for both performance and adaptability: Anthropic’s Claude 3.5 Haiku orchestrates the agent for fast information retrieval and interactive responses, while delegating complex SQL generation to Anthropic’s Claude 3.7 Sonnet, which excels at code generation tasks. This balanced approach delivers both responsive interactions and high-quality output.

Metrics dictionary with business domain filtering

RRDA’s metrics dictionary serves as the source of truth for over 1,000 metrics across WWRR’s business domains. Each metric is encoded as a JSON object with metadata (business domain, category, name, definition) and metric usage details (SQL expression, source datasets, filters, granularity). This structure maps business terminology directly to technical implementation for accurate SQL translation.

Each metric object is ingested as an individual chunk into our metrics knowledge base with domain metadata tags. When a user asks “Explain the RP resolution index metric,” RRDA detects Returns Processing as the domain and invokes the RetrieveFromKnowledgeBase action group. This action group retrieves only metrics within that specific business domain by applying a domain filter to the vector search configuration.

For domain-ambiguous queries, the action group identifies potentially relevant domains, and instructs the agent to ask the user to confirm the correct one. The following screenshot is an example of a user interaction where the domain could not be classified.

Screenshot of RRDA chat interface showing a conversation where user asks 'List all the resolution index metrics', system responds by identifying multiple domains and asking for clarification, user responds 'RP', and system provides detailed definitions for two Returns Processing resolution index metrics with calculations and business context.

RRDA demonstrating domain classification and knowledge base retrieval. The system identifies multiple domains with “resolution index” metrics, asks for clarification, then provides detailed metric definitions for the selected Returns Processing domain.

The system’s response short-listed potentially relevant domains that contain the “resolution index” metric. After a domain is detected, subsequent searches are refined with the confirmed domain filter. This targeted retrieval makes sure RRDA accesses only domain-appropriate metric definitions, improving response accuracy while reducing latency by limiting the search space.

Table schema retrieval

The agent invokes the FetchTableSchema action group to get SQL table metadata with low latency. When the agent needs to understand a table’s structure, this action retrieves column definitions, example queries, and usage instructions from the DynamoDB table metadata store. The following screenshot shows an interaction where the user asks for details about a table.

Screenshot of RRDA chat interface showing user asking about status buckets for RP aggregate dataset, with system response listing categorized status values including Initial Processing, Warehouse Processing, and Disposition buckets, along with usage notes and table reference information.

RRDA providing detailed categorical dimension values for the RP aggregate dataset. The agent retrieves and categorizes status buckets from the table metadata, helping users understand available filter options and providing practical usage recommendations.

The agent responded with specific categorical dimension values (for example, “In Warehouse, Quality Check Complete”) that the user can filter by. These distinct values are important context for the agent to generate valid SQL that aligns with business terminology and database structure. However, these distinct values are not available directly through the table schema—but are collected using the workflow in the following diagram.

We created an AWS Step Functions workflow that enriches the DynamoDB table metadata store with rich metadata including table schemas, dimensions, and example queries.

Diagram showing BatchRefreshTableMetadata AWS Step Functions workflow with four sequential Lambda functions: SyncTableSchemas fetching from AWS Glue, IdentifyCategoricalDimensions using Bedrock API, ExtractCategoricalValues querying Redshift, and GetExampleQueries processing query logs, all updating the DynamoDB Table Metadata Store.

AWS Step Functions workflow for automated table metadata enrichment, updating the DynamoDB metadata store with schema information, categorical dimensions, and usage examples.

The workflow orchestrates the metadata refresh process, running daily to keep information current. The workflow synchronizes schemas from the AWS Glue Data Catalog, uses Amazon Bedrock FMs to identify categorical dimensions worth extracting, populates top K distinct values by running targeted Redshift queries (for example, SELECT DISTINCT column_name FROM table_name ORDER BY COUNT(*) DESC LIMIT K;), and incorporates example usage patterns from Redshift query logs. This automated pipeline reduces manual metadata maintenance while making sure that the table metadata store reflects current data structures, business terminology, and usage patterns.

SQL generation and validation

The heart of our solution is how RRDA actually creates and validates SQL queries. Rather than generating SQL in one step, we designed a smarter approach using the Amazon Bedrock Agents return of control capability, which works like a conversation between different parts of our system. When a user asks a question like “Write a query to find the resolution index for electronics in the US last quarter,” the process unfolds in several steps:

Our agent consults its knowledge base to identify the correct metric and where this data lives. If anything is unclear, it asks the user follow-up questions like “Are you interested in resolution index for electronics in the US marketplace or North America region?” to verify it understands exactly what’s needed.
When the agent knows which data source to use, it fetches the detailed schema of that table—column names, data types, dimensions, and example queries—so it knows how the data is organized.
The GenerateSQLQuery action group invokes Anthropic’s Claude 3.7 Sonnet with a specialized prompt for SQL generation using the Amazon Bedrock Converse API. The response returns SQL code with an explanation.
Our system intercepts the action group response using return of control. It extracts the generated SQL and validates it against our Redshift database using an EXPLAIN command, which checks for syntax errors without actually running the query. If the system finds syntax problems, it automatically sends the query back to the Amazon Bedrock FM with specific error information, allowing it to fix itself before proceeding.
After a valid query is confirmed, the agent receives it and presents it to the user with a clear explanation in simple language. The agent explains what the query does, what filters are applied, and how the results should be interpreted—and is able to answer follow-up questions as needed.

This approach delivers immediate business value: users receive syntactically correct queries that execute on the first attempt, without needing SQL expertise or database schema knowledge. By validating queries using Redshift’s EXPLAIN command—rather than executing them—RRDA maintains security boundaries while ensuring reliability. Users execute validated queries through their existing tools with proper authentication, preserving enterprise access controls while eliminating the trial-and-error cycle of manual SQL development.

Designing a user experience to earn trust

To build trust with users across our organization, we prioritized creating an interface where users can quickly verify RRDA’s reasoning process. When users receive a response, they can choose View agent actions to see exactly which knowledge bases were searched and which table schemas were accessed.

For SQL generation—where accuracy is paramount—we display validation status badges that clearly indicate when queries have passed Redshift syntax verification, building confidence in the output’s reliability.

We also maintain a question bank of curated and verified answers that we serve to users through our semantic cache if a similar question is asked.

Real-time status indicators keep users informed of each step the agent takes during complex operations, alleviating uncertainty about what’s happening behind the scenes.

Best practices

Based on our experience building RRDA, the following are best practices for implementing agentic Text-to-SQL solutions:

Implement domain-aware context – Use business domain classification to retrieve domain-specific metric definitions, making sure that SQL generation respects how metrics are calculated differently across business units.
Adopt a hybrid model architecture – Use smaller, faster models for orchestration and user interaction, while reserving more powerful models for complex SQL generation tasks to balance performance and accuracy.
Validate without execution – Implement real-time SQL validation using lightweight commands like Redshift EXPLAIN to catch syntax issues without executing the query. Separating query generation from execution reduces errors while preserving existing authentication boundaries.
Automate metadata maintenance – Build pipelines that continuously refresh metadata, such as schema information and dimensions, without manual intervention to make sure the system references up-to-date information.
Design for transparency and user engagement – Build trust by exposing agent reasoning, intermediate steps, and providing clear explanations of generated SQL. Implement real-time status indicators and progress updates for multi-step processes with response times of 15–60 seconds. Transparent processing feedback maintains user engagement and builds confidence during complex operations like SQL generation and knowledge retrieval.

Conclusion

The Returns & ReCommerce Data Assist transforms data access at WWRR by converting natural language into validated SQL queries through Amazon Bedrock. With our domain-aware approach with real-time validation, business users can retrieve accurate data without SQL expertise, dramatically shortening the path from questions to insights. This Text-to-SQL capability is just the first step—Part 2 will explore extending the solution with automated visualization using Amazon Q in QuickSight.

About the authors

Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.

Nicolas Alvarez is a Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team, focusing on building and optimizing recommerce data systems. He plays a key role in developing advanced technical solutions, including Apache Airflow implementations and front-end architecture for the team’s web presence. His work is crucial in enabling data-driven decision making for Amazon’s reverse logistics operations and improving the efficiency of end-of-lifecycle product management.

Lakshdeep Vatsa is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data and reporting solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for Reverse Logistics and ReCommerce operations. He is deeply passionate about enhancing self-service experiences for users and consistently seeks opportunities to utilize generative BI capabilities to solve complex customer challenges.

Karam Muppidi is a Senior Engineering Manager at Amazon Retail, leading data engineering, infrastructure, and analytics teams within the Worldwide Returns and ReCommerce organization. He specializes in using LLMs and multi-agent architectures to transform data analytics and drive organizational adoption of AI tools. He has extensive experience developing enterprise-scale data architectures, analytics services, and governance strategies using AWS and third-party tools. Prior to his current role, Karam developed petabyte-scale data and compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.

Sreeja Das is a Principal Engineer in the Returns and ReCommerce organization at Amazon. In her 10+ years at the company, she has worked at the intersection of high-scale distributed systems in eCommerce and Payments, Enterprise services, and Generative AI innovations. In her current role, Sreeja is focusing on system and data architecture transformation to enable better traceability and self-service in Returns and ReCommerce processes. Previously, she led architecture and tech strategy of some of Amazon’s core systems including order and refund processing systems and billing systems that serve tens of trillions of customer requests everyday.

Implement user-level access control for multi-tenant ML platforms on Amazon SageMaker AI

July 11, 2025

by Durga Sury Amazon AWS

Managing access control in enterprise machine learning (ML) environments presents significant challenges, particularly when multiple teams share Amazon SageMaker AI resources within a single Amazon Web Services (AWS) account. Although Amazon SageMaker Studio provides user-level execution roles, this approach becomes unwieldy as organizations scale and team sizes grow. Refer to the Operating model whitepaper for best practices on account structure.

In this post, we discuss permission management strategies, focusing on attribute-based access control (ABAC) patterns that enable granular user access control while minimizing the proliferation of AWS Identity and Access Management (IAM) roles. We also share proven best practices that help organizations maintain security and compliance without sacrificing operational efficiency in their ML workflows.

Challenges with resource isolation across workloads

Consider a centralized account structure at a regulated enterprise such as finance or healthcare: a single ML platform team manages a comprehensive set of infrastructure that serves hundreds of data science teams across different business units. With such a structure, the platform team can implement consistent governance policies that enforce best practices. By centralizing these resources and controls, you can achieve better resource utilization, maintain security compliance and audit trials, and unify operational standards across ML initiatives. However, the challenge lies in maintaining workload isolation between teams and managing permissions between users of the same team.

With SageMaker AI, platform teams can create dedicated Amazon SageMaker Studio domains for each business unit, thereby maintaining resource isolation between workloads. Resources created within a domain are visible only to users within the same domain, and are tagged with the domain Amazon Resource Names (ARNs). With tens or hundreds of domains, using a team-level or domain-level role compromises security and impairs auditing, whereas maintaining user-level roles lead to hundreds of roles to create and manage, and often runs into IAM service quotas.

We demonstrate how to implement ABAC that uses IAM policy variables to implement user-level access controls while maintaining domain-level execution roles, so you can scale IAM in SageMaker AI securely and effectively. We share some of the common scenarios and sample IAM policies to solve permissions management, however, the patterns can be extended to other services as well.

Key concepts

In this solution, we use two key IAM concepts: source identity and context keys.

An IAM source identity is a custom string that administrators can require be passed on a role assumption, that is used to identify the person or application that is performing these actions. The source identity is logged to AWS CloudTrail and also persists through role chaining, which takes place when a role is assumed by a second role through the AWS Command Line Interface (AWS CLI) or API (refer to Roles terms and concepts for additional information).

In SageMaker Studio, if the domain is set up to use a source identity, the user profile name is passed as the source identity to any API calls made by the user from a user’s private space using signed requests (using AWS Signature Version 4). Source identity enables auditability as API requests from the assumed execution role will contain in the session context the attached source identity. If external AWS credentials (such as access keys) are used to access AWS services from the SageMaker Studio environment, the SourceIdentity from the execution role assumption will not be set for those credentials.

SageMaker Studio supports two condition context keys: sagemaker:DomainId and sagemaker:UserProfileName for certain actions related to SageMaker domains. These context keys are powerful IAM policy variables that make it possible for admins to create dynamic ABAC policies that automatically scope permissions based on a user’s identity and domain. As the name implies, the DomainId key can be used to scope actions to specific domains, and the UserProfileName key enables user-specific resource access patterns.

Although the source identity and sagemaker:UserProfileName can be used interchangeably in IAM policies, there are key differences:

Only SourceIdentity helps in auditing by tracking the user on CloudTrail
SageMaker AI context keys don’t persist on role chaining, whereas SourceIdentity does
SageMaker AI context keys are available only within SageMaker Studio
If only using the SageMaker AI context keys, sts:SourceIdentity doesn’t have to be set in the execution role’s trust policy

In the following sections, we explore a few common scenarios and share IAM policy samples using the preceding principles. With this approach, you can maintain user-level resource isolation without using user-level IAM roles, and adhere to principles of least privilege.

Prerequisites

Before implementing this solution, make sure your SageMaker Studio domain meets the following criteria:

Roles used with SageMaker AI (both SageMaker Studio and roles passed to SageMaker pipelines or processing jobs) have the sts:SetSourceIdentity permission in their trust policy. For example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": ["sts:AssumeRole", "sts:SetSourceIdentity"]
        }
    ]
}

SourceIdentity is enabled at the SageMaker Studio domain level during creation or update (see Monitoring user resource access from SageMaker AI Studio Classic with sourceIdentity). For example:

aws sagemaker update-domain 
    --domain-id <value> 
    --domain-settings-for-update "ExecutionRoleIdentityConfig=USER_PROFILE_NAME"

Roles used with SageMaker Studio have sagemaker:AddTags permission to enable the role to set tags on the SageMaker AI resources created by the role. The following policy example with a condition enables the role to only add tags at resource creation, and it can’t add or modify tags on existing SageMaker AI resources:

{
    "Sid": "AddTagsOnCreate",
    "Effect": "Allow",
    "Action": ["sagemaker:AddTags"],
    "Resource": ["arn:aws:sagemaker:<region>:<account_number>:*"],
    "Condition": {
        "Null": {
            "sagemaker:TaggingAction": "false"
        }
    }
}

When using ABAC-based approaches with SageMaker AI, access to the sagemaker:AddTags action must be tightly controlled, otherwise ownership of resources and therefore access to resources can be modified by changing the SageMaker AI resource tags.

Solution overview

In this post, we demonstrate how to use IAM policy variables and source identity to implement scalable, user-level access control in SageMaker AI. With this approach, you can do the following:

Implement user-level access control without user-level IAM roles
Enforce resource isolation between users
Maintain least privilege access principles
Maintain user-level access control to associated resources like Amazon Simple Storage Service (Amazon S3) buckets and AWS Secrets Manager secrets
Scale policy management and governance

In the following sections, we share some common scenarios for implementing access control and how it can be achieved using the SourceIdentity and policy variables.

It is recommended that you align SageMaker user profile names with other internal user identifiers to make sure these user profile names are aligned with platform users and are unique in this aspect.

SageMaker AI access control

When managing a shared SageMaker Studio domain with multiple users, administrators often need to implement resource-level access controls to prevent users from accessing or modifying each other’s resources. For instance, you might want to make sure data scientists can’t accidentally delete another team member’s endpoints or access SageMaker training jobs they don’t own. In such cases, the sagemaker:DomainId and sagemaker:UserProfileName keys can be used to place this restriction. See the following sample policy:

{
    "Sid": "TrainingJobPermissions",
    "Effect": "Allow",
    "Action": [
        "sagemaker:StopTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:UpdateTrainingJob"
    ],
    "Resource": "arn:aws:sagemaker:{region}:{account_number}:training-job/*",
    "Condition": {
        "StringLike": {
            "sagemaker:ResourceTag/sagemaker:user-profile-arn": "arn:aws:sagemaker:<region>:<account_number>:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
        }
        ...
    }
}

You can also write a similar policy using a source identity. The only limitation is that in this case, the domain ID must be specified or left as a wildcard. See the following example:

{
    "Sid": "TrainingJobPermissions",
    "Effect": "Allow",
    "Action": [
        "sagemaker:StopTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:UpdateTrainingJob"
    ],
    "Resource": "arn:aws:sagemaker:{region}:{account_number}:training-job/*",
    "Condition": {
        "StringLike": {
            "sagemaker:ResourceTag/sagemaker:user-profile-arn": "arn:aws:sagemaker:<region>:<account_number>:user-profile/<domain_id>/${aws:SourceIdentity}"
        }
        ...
    }
}

Amazon S3 access control

Amazon S3 is the primary storage service for SageMaker AI, and is deeply integrated across different job types. It enables efficient reading and writing of datasets, code, and model artifacts. Amazon S3 features like lifecycle policies, encryption, versioning, and IAM controls help keep SageMaker AI workflow data secure, durable, and cost-effective. Using a source identity with IAM identity policies, we can restrict users in SageMaker Studio and the jobs they launch to allow access only their own S3 prefix:

{
    "Version": "2012-10-17",
    "Statement": [
        [
            {
                "Sid": "ListBucket",
                "Effect": "Allow",
                "Action": "s3:ListBucket",
                "Resource": "arn:aws:s3:::my_bucket",
                "Condition": {
                    "StringLikeIfExists": {
                        "s3:prefix": ["my_domain/users/${aws:SourceIdentity}/*"]
                    }
                }
            },
            {
                "Sid": "AccessBucketObjects",
                "Effect": "Allow",
                "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
                "Resource": [
                    "arn:aws:s3:::my_bucket/my_domain/users/${aws:SourceIdentity}/*"
                ]
            }
        ]
    ]
}

You can also implement deny policies on resource policies such as S3 bucket policies to make sure they can only be accessed by the appropriate user. The following is an example S3 bucket policy that only allows get, put, and delete object actions for user-a and user-b SageMaker Studio user profiles:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyUnlessMatchingUser",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::my-bucket",
            "Condition": {
                "StringNotLike": {
                    "aws:SourceIdentity": ["user-a", "user-b"]
                }
            }
        }
    ]
}

Secrets Manager secret access control

With Secrets Manager, you can securely store and automatically rotate credentials such as database passwords, API tokens, and Git personal access tokens (PATs). With SageMaker AI, you can simply request the secret at runtime, so your notebooks, training jobs, and inference endpoints stay free of hard-coded keys. Secrets stored in AWS Secrets Manager are encrypted by AWS Key Management Service (AWS KMS) keys that you own and control. By granting the SageMaker AI execution role GetSecretValue permission on the specific secret ARN, your code can pull it with one API call, giving you audit trails in CloudTrail for your secrets access. Using the source identity, you can allow access to a user-specific hierarchy level of secrets, such as in the following example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "UserSpecificSecretsAccess",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
            ],
            "Resource": "arn:aws:secretsmanager:<region>:<account_number>:secret:user-secrets/${aws:SourceIdentity}/*"
        }
    ]
}

Amazon EMR cluster access control

SageMaker Studio users can create, discover, and manage Amazon EMR clusters directly from SageMaker Studio. The following policy restricts SageMaker Studio users access to EMR clusters by requiring that the cluster be tagged with a user key matching the user’s SourceIdentity. When SageMaker Studio users connect to EMR clusters, this makes sure they can only access clusters explicitly assigned to them through the tag, preventing unauthorized access and enabling user-level access control (refer to IAM policies for tag-based access to clusters and EMR notebooks and Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowClusterCreds",
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:GetClusterSessionCredentials"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/user": "${aws:SourceIdentity}"
                }
            }
        }
    ]
}

File system access control in SageMaker training jobs

SageMaker training jobs can use file systems such as Amazon Elastic File System (Amazon EFS) or Amazon FSx for Lustre to efficiently handle large-scale data processing and model scaling workloads. This is particularly valuable when dealing with large datasets that require high throughput, when multiple training jobs need concurrent access to the same data, or when you want to maintain a persistent storage location that can be shared across different training jobs. This can be provided to the SageMaker training job using a FileSystemDataStore parameter.

When using a common file system across users, administrators might want to restrict access to specific folders, so users don’t overwrite each other’s work or data. This can be achieved using a condition called sagemaker:FileSystemDirectoryPath. In this pattern, each user has a directory on the file system that’s the same as their user profile name. You can then use an IAM policy such as the following to make sure the training jobs they run are only able to access their directory:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SageMakerLaunchJobs",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob"
            ],
            "Resource": "arn:aws:sagemaker:<region>:<account_number>:training-job/*",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "sagemaker:FileSystemDirectoryPath": [
                        "/fsx/users/${aws:SourceIdentity}”
                    ]
                }
            }
        }
    ]
}

Monitor user access with the source identity

The previous examples demonstrate how the source identity can serve as a global context key to control which AWS resources a SageMaker Studio user can access, such as SageMaker training jobs or S3 prefixes. In addition to access control, the source identity is also valuable for monitoring individual user resource access from SageMaker Studio.

The source identity enables precise tracking of individual user actions in CloudTrail logs by propagating the SageMaker Studio user profile name as the sourceIdentity within the SageMaker Studio execution role session or a chained role. This means that API calls made from SageMaker Studio notebooks, SageMaker training and processing jobs, and SageMaker pipelines include the specific user’s identity in CloudTrail events. For more details on the supported scenarios, refer to Considerations when using sourceIdentity.

As a result, administrators can monitor and audit resource access at the individual user level rather than only by the IAM role, providing clearer visibility and stronger security even when users share the same execution role.

Monitor access to the AWS Glue Data Catalog with AWS Lake Formation permissions

You can also use source identity auditing capabilities to track which specific SageMaker Studio user accessed the AWS Glue Data Catalog with AWS Lake Formation permissions. When a user accesses an AWS Glue table governed by Lake Formation, the lakeformation:GetDataAccess API call is logged in CloudTrail. This event records not only the IAM role used, but also the sourceIdentity propagated from the SageMaker Studio user profile, enabling precise attribution of data access to the individual user.

By reviewing these CloudTrail logs, administrators can see which SageMaker Studio user (using the sourceIdentity field) accessed which Data Catalog resources, enhancing auditability and compliance. Refer to Apply fine-grained data access controls with Lake Formation and Amazon EMR from SageMaker Studio for additional information.

Accessing an AWS Glue table with Amazon Athena

When a SageMaker Studio user queries an AWS Glue table through Amazon Athena using a library like the AWS SDK for Pandas, such as running a simple SELECT query from a SageMaker Studio notebook or from a SageMaker processing or training job, this access is logged by Lake Formation in CloudTrail as a GetDataAccess event. The event captures key details, including the IAM role used, the propagated sourceIdentity (which corresponds to the SageMaker Studio user profile name), the AWS Glue table and database accessed, the permissions used (for example, SELECT), and metadata like the Athena query ID.

The following is a typical CloudTrail log entry for this event (simplified for readability):

{
    "userIdentity": {
        "type": "AssumedRole",
        "sessionContext": {
            "sessionIssuer": {
                "arn": "arn:aws:iam::012345678901:role/my_role"
            },
        "sourceIdentity": "STUDIO_USER_PROFILE_NAME"
        }
    },
    "eventTime": "2025-04-18T13:16:36Z",
    "eventSource": "lakeformation.amazonaws.com",
    "eventName": "GetDataAccess",
    "requestParameters": {
        "tableArn": "arn:aws:glue:us-east-1:012345678901:table/my_database/my_table",
        "permissions": [
            "SELECT"
        ],
        "auditContext": {
            "additionalAuditContext": "{queryId: XX-XX-XX-XX-XXXXXX}"
        }
    },
    "additionalEventData": {
        "requesterService": "ATHENA"
    }
}

Accessing an AWS Glue table with Amazon EMR

When a SageMaker Studio user queries an AWS Glue table through Amazon EMR (PySpark), such as running a simple SELECT query from a SageMaker Studio notebook connected to an EMR cluster with IAM runtime roles (see Configure IAM runtime roles for EMR cluster access in Studio) or from a SageMaker pipeline with an Amazon EMR step, this access is logged by Lake Formation in CloudTrail as a GetDataAccess event. The event captures key details, including the IAM role used, the propagated sourceIdentity (which corresponds to the SageMaker Studio user profile name), the AWS Glue table and database accessed, and the permissions used (for example, SELECT).

The following is a typical CloudTrail log entry for this event (simplified for readability):

{
    "userIdentity": {
        "type": "AssumedRole",
        "sessionContext": {
            "sessionIssuer": {
                "arn": "arn:aws:iam::012345678901:role/my-role"
            },
        "sourceIdentity": "STUDIO_USER_PROFILE_NAME"
        }
    },
    "eventTime": "2025-04-18T13:16:36Z",
    "eventSource": "lakeformation.amazonaws.com",
    "eventName": "GetDataAccess",
    "requestParameters": {
        "tableArn": "arn:aws:glue:us-east-1:012345678901:table/my_database/my_table",
        "permissions": [
            "SELECT"
        ]
    },
    "additionalEventData": {
        "LakeFormationAuthorizedSessionTag": "LakeFormationAuthorizedCaller:Amazon EMR",
    }
}

Best practices

To effectively secure and manage access in environments using ABAC, it’s important to follow proven best practices that enhance security, simplify administration, and maintain clear auditability. The following guidelines can help you implement ABAC with a source identity in a scalable and maintainable way:

Use consistent naming conventions – Use consistent naming conventions for resources and tags to make sure policies can reliably reference and enforce permissions based on attributes. Consistent tagging enables effective use of ABAC by matching sourceIdentity values with resource tags, simplifying policy management and reducing errors.
Enforce least privilege access – Apply least privilege access by granting only the permissions required to perform a task. Start with AWS managed policies for common use cases, then refine permissions by creating managed policies tailored to your specific needs. Use ABAC with sourceIdentity to dynamically restrict access based on user attributes, maintaining fine-grained, least-privilege permissions aligned with IAM best practices.
Audit user access – Regularly audit user access by reviewing CloudTrail logs that include source identity, which captures individual SageMaker Studio user actions even when roles are shared. This provides precise visibility into who accessed which resources and when. For details, see Monitoring user resource access from SageMaker AI Studio Classic with sourceIdentity.
Standardize identity-based policies with conditions – Standardize policies incorporating condition context keys to maintain consistent and precise access control while simplifying management across your SageMaker AI environment. For examples and best practices on creating identity-based policies with conditions, refer to Amazon SageMaker AI identity-based policy examples.

Refer to SageMaker Studio Administration Best Practices for additional information on identity and permission management.

Conclusion

In this post, we demonstrated how to implement user-level access control in SageMaker Studio without the overhead of managing individual IAM roles. By combining SageMaker AI resource tags, SageMaker AI context keys, and source identity propagation, you can create dynamic IAM policies that automatically scope permissions based on user identity while maintaining shared execution roles. We showed how to apply these patterns across various AWS services, including SageMaker AI, Amazon S3, Secrets Manager, and Amazon EMR. Additionally, we discussed how the source identity enhances monitoring by propagating the SageMaker Studio user profile name into CloudTrail logs, enabling precise tracking of individual user access to resources like SageMaker jobs and Data Catalog tables. This includes access using Athena and Amazon EMR, providing administrators with clear, user-level visibility for stronger security and compliance across shared execution roles. We encourage you to implement these user-level access control techniques today and experience the benefits of simplified administration and compliance tracking.

About the authors

Durga Sury is a Senior Solutions Architect at Amazon SageMaker, where she helps enterprise customers build secure and scalable AI/ML platforms. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.

Itziar Molina Fernandez is a Machine Learning Engineer in the AWS Professional Services team. In her role, she works with customers building large-scale machine learning platforms and generative AI use cases on AWS. In her free time, she enjoys cycling, reading, and exploring new places.

Will Parr is a Machine Learning Engineer at AWS Professional Services, helping customers build scalable ML platforms and production-ready generative AI solutions. With deep expertise in MLOps and cloud-based architecture, he focuses on making machine learning reliable, repeatable, and impactful. Outside of work, he can be found on a tennis court or hiking in the mountains.

Long-running execution flows now supported in Amazon Bedrock Flows in public preview

July 11, 2025

by Shubhankar Sumar Amazon AWS

Today, we announce the public preview of long-running execution (asynchronous) flow support within Amazon Bedrock Flows. With Amazon Bedrock Flows, you can link foundation models (FMs), Amazon Bedrock Prompt Management, Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and other AWS services together to build and scale predefined generative AI workflows.

As customers across industries build increasingly sophisticated applications, they’ve shared feedback about needing to process larger datasets and run complex workflows that take longer than a few minutes to complete. Many customers told us they want to transform entire books, process massive documents, and orchestrate multi-step AI workflows without worrying about runtime limits, highlighting the need for a solution that can handle long-running background tasks. To address those concerns, Amazon Bedrock Flows introduces a new feature in public preview that extends workflow execution time from 5 minutes (synchronous) to 24 hours (asynchronous).

With Amazon Bedrock long-running execution flows (asynchronous), you can chain together multiple prompts, AI services, and Amazon Bedrock components into complex, long-running workflows (up to 24 hours asynchronously). The new capabilities include built-in execution tracing directly using the AWS Management Console and Amazon Bedrock Flow API for observability. These enhancements significantly streamline workflow development and management in Amazon Bedrock Flows, helping you focus on building and deploying your generative AI applications.

By decoupling the workflow execution (asynchronously through long-running flows that can run for up to 24 hours) from the user’s immediate interaction, you can now build applications that can handle large payloads that take longer than 5 minutes to process, perform resource-intensive tasks, apply multiple rules for decision-making, and even run the flow in the background while integrating with multiple systems—while providing your users with a seamless and responsive experience.

Solution overview

Organizations using Amazon Bedrock Flows now can use long-running execution flow capabilities to design and deploy long-running workflows for building more scalable and efficient generative AI applications. This feature offers the following benefits:

Long-running workflows – You can run long-running workflows (up to 24 hours) as background tasks and decouple workflow execution from the user’s immediate interaction.
Large payload – The feature enables large payload processing and resource-intensive tasks that can continue for 24 hours instead of the current limit of 5 minutes.
Complex use cases – It can manage the execution of complex, multi-step decision-making generative AI workflows that integrate multiple external systems.
Builder-friendly – You can create and manage long-running execution flows through both the Amazon Bedrock API and Amazon Bedrock console.
Observability – You can enjoy a seamless user experience with the ability to check flow execution status and retrieve results accordingly. The feature also provides traces so you can view the inputs and outputs from each node.

Dentsu, a leading advertising agency and creative powerhouse, needs to handle complex, multi-step generative AI use cases that require longer execution time. One use case is their Easy Reading application, which converts books with many chapters and illustrations into easily readable formats to enable people with intellectual disabilities to access literature. With Amazon Bedrock long-running execution flows, now Dentsu can:

Process large inputs and perform complex resource-intensive tasks within the workflow. Prior to long-running execution flows, input size was limited due to the 5 minutes execution limit of flows.
Integrate multiple external systems and services into the generative AI workflow.
Support both quick, near real-time workflows and longer-running, more complex workflows.

“Amazon Bedrock has been amazing to work with and demonstrate value to our clients,” says Victoria Aiello, Innovation Director, Dentsu Creative Brazil. “Using traces and flows, we are able to show how processing happens behind the scenes of the work AI is performing, giving us better visibility and accuracy on what’s to be produced. For the Easy Reading use case, long-running execution flows will allow for processing of the entire book in one go, taking advantage of the 24-hour flow execution time instead of writing custom code to manage multiple sections of the book separately. This saves us time when producing new books or even integrating with different models; we can test different results according to the needs or content of each book.”

Let’s explore how the new long-running execution flow capability in Amazon Bedrock Flows enables Dentsu to build a more efficient and long-running book processing generative AI application. The following diagram illustrates the end-to-end flow of Dentsu’s book processing application. The process begins when a client uploads a book to Amazon Simple Storage Service (Amazon S3), triggering a flow that processes multiple chapters, where each chapter undergoes accessibility transformations and formatting according to specific user requirements. The transformed chapters are then collected, combined with a table of contents, and stored back in Amazon S3 as a final accessible document. This long-running execution (asynchronous) flow can handle large books efficiently, processing them within the 24-hour execution window while providing status updates and traceability throughout the transformation process.

In the following sections, we demonstrate how to create a long-running execution flow in Amazon Bedrock Flows using Dentsu’s real-world use case of books transformation.

Prerequisites

Before implementing the new capabilities, make sure you have the following:

An AWS account
Other Amazon Bedrock features enabled, for example:
- Create and test your base prompts for customer service interactions in Amazon Bedrock Prompt Management
- Set up Amazon Bedrock Guardrails with relevant rules
Auxiliary AWS services configured as needed for your workflow, such as Amazon DynamoDB, Amazon S3, and Amazon Simple Notification Service (Amazon SNS)
Required AWS Identity and Access Management (IAM) permissions:
- Access to Amazon Bedrock Flows
- Appropriate access to large language models (LLMs) in Amazon Bedrock

After these components are in place, you can implement Amazon Bedrock long-running execution flow capabilities in your generative AI use case.

Create a long-running execution flow

Complete the following steps to create your long-running execution flow:

On the Amazon Bedrock console, in the navigation pane under Builder tools, choose Flows.
Choose Create a flow.
Provide a name for your a new flow, for example, easy-read-long-running-flow.

For detailed instructions on creating a flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability. Amazon Bedrock provides different node types to build your prompt flow.

The following screenshot shows the high-level flow of Dentsu’s book conversion generative AI-powered application. The workflow demonstrates a sequential process from input handling through content transformation to final storage and delivery.

The following table outlines the core components and nodes within the preceding workflow, designed for document processing and accessibility transformation.

Node	Purpose
Flow Input	Entry point accepting an array of S3 prefixes (chapters) and accessibility profile
Iterator	Processes each chapter (prefix) individually
S3 Retrieval	Downloads chapter content from the specified Amazon S3 location
Easifier	Applies accessibility transformation rules to chapter content
HTML Formatter	Formats transformed content with appropriate HTML structure
Collector	Assembles transformed chapters while maintaining order
Lambda Function	Combines chapters into a single document with table of contents
S3 Storage	Stores the final transformed document in Amazon S3
Flow Output	Returns Amazon S3 location of the transformed book with metadata

Test the book processing flow

We are now ready to test the flow through the Amazon Bedrock console or API. We use a fictional book called “Beyond Earth: Humanity’s Journey to the Stars.” This book tells the story of humanity’s greatest adventure beyond our home planet, tracing our journey from the first satellites and moonwalks to space stations and robotic explorers that continue to unveil the mysteries of our solar system.

On the Amazon Bedrock console, choose Flows in the navigation pane.
Choose the flow (easy-read-long-running-flow) and choose Create execution.

The flow must be in the Prepared state before creating an execution.

The Execution tab shows the previous executions for the selected flow.

Provide the following input:

dyslexia test input

{
  "chapterPrefixes": [
    "books/beyond-earth/chapter_1.txt",
    "books/beyond-earth/chapter_2.txt",
    "books/beyond-earth/chapter_3.txt"
  ],
  "metadata": {
    "accessibilityProfile": "dyslexia",
    "bookId": "beyond-earth-002",
    "bookTitle": "Beyond Earth: Humanity's Journey to the Stars"
  }
}

These are the different chapters of our book that need to be transformed.

Choose Create.

Amazon Bedrock Flows initiates the long-running execution (asynchronous) flow of our workflow. The dashboard displays the executions of our flow with their respective statuses (Running, Succeeded, Failed, TimedOut, Aborted). When an execution is marked as Completed, the results become available in our designated S3 bucket.

Choosing an execution takes you to the summary page containing its details. The Overview section displays start and end times, plus the execution Amazon Resource Name (ARN)—a unique identifier that’s essential for troubleshooting specific executions later.

When you select a node in the flow builder, its configuration details appear. For instance, choosing the Easifier node reveals the prompt used, the selected model (here it’s Amazon Nova Lite), and additional configuration parameters. This is essential information for understanding how that specific component is set up.

The system also provides access to execution traces, offering detailed insights into each processing step, tracking real-time performance metrics, and highlighting issues that occurred during the flow’s execution. Traces can be enabled using the API and sent to an Amazon CloudWatch log. In the API, set the enableTrace field to true in an InvokeFlow request. Each flowOutputEvent in the response is returned alongside a flowTraceEvent.

We have now successfully created and executed a long-running execution flow. You can also use Amazon Bedrock APIs to programmatically start, stop, list, and get flow executions. For more details on how to configure flows with enhanced safety and traceability, refer to Amazon Bedrock Flows is now generally available with enhanced safety and traceability.

Conclusion

The integration of long-running execution flows in Amazon Bedrock Flows represents a significant advancement in generative AI development. With these capabilities, you can create more efficient AI-powered solutions to automate long-running operations, addressing critical challenges in the rapidly evolving field of AI application development.

Long-running execution flow support in Amazon Bedrock Flows is now available in public preview in AWS Regions where Amazon Bedrock Flows is available, except for the AWS GovCloud (US) Regions. To get started, open the Amazon Bedrock console or APIs to begin building flows with long-running execution flow with Amazon Bedrock Flows. To learn more, see Create your first flow in Amazon Bedrock and Track each step in your flow by viewing its trace in Amazon Bedrock.

We’re excited to see the innovative applications you will build with these new capabilities. As always, we welcome your feedback through AWS re:Post for Amazon Bedrock or your usual AWS contacts. Join the generative AI builder community at community.aws to share your experiences and learn from others.

About the authors

Shubhankar Sumar is a Senior Solutions Architect at AWS, where he specializes in architecting generative AI-powered solutions for enterprise software and SaaS companies across the UK. With a strong background in software engineering, Shubhankar excels at designing secure, scalable, and cost-effective multi-tenant systems on the cloud. His expertise lies in seamlessly integrating cutting-edge generative AI capabilities into existing SaaS applications, helping customers stay at the forefront of technological innovation.

Amit Lulla is a Principal Solutions Architect at AWS, where he architects enterprise-scale generative AI and machine learning solutions for software companies. With over 15 years in software development and architecture, he’s passionate about turning complex AI challenges into bespoke solutions that deliver real business value. When he’s not architecting cutting-edge systems or mentoring fellow architects, you’ll find Amit on the squash court, practicing yoga, or planning his next travel adventure. He also maintains a daily meditation practice, which he credits for keeping him centered in the fast-paced world of AI innovation.

Huong Nguyen is a Principal Product Manager at AWS. She is leading the Amazon Bedrock Flows, with 18 years of experience building customer-centric and data-driven products. She is passionate about democratizing responsible machine learning and generative AI to enable customer experience and business innovation. Outside of work, she enjoys spending time with family and friends, listening to audiobooks, traveling, and gardening.

Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He partners with enterprise customers to architect, optimize, and deploy production-grade AI solutions leveraging AWS’s comprehensive machine learning stack. Christian specializes in inference optimization techniques that balance performance, cost, and latency requirements for large-scale deployments. In his spare time, Christian enjoys exploring nature and spending time with family and friends.

Jeremy Bartosiewicz is a Senior Solutions Architect at AWS, with over 15 years of experience working in technology in multiple roles. Coming from a consulting background, Jeremy enjoys working on a multitude of projects that help organizations grow using cloud solutions. He helps support large enterprise customers at AWS and is part of the Advertising and Machine Learning TFCs.

Fraud detection empowered by federated learning with the Flower framework on Amazon SageMaker AI

July 11, 2025

by Ray Wang Amazon AWS

Fraud detection remains a significant challenge in the financial industry, requiring advanced machine learning (ML) techniques to detect fraudulent patterns while maintaining compliance with strict privacy regulations. Traditional ML models often rely on centralized data aggregation, which raises concerns about data security and regulatory constraints.

Fraud cost businesses over $485.6 billion in 2023 alone, according to Nasdaq’s Global Financial Crime Report, with financial institutions under pressure to keep up with evolving threats. Traditional fraud models often rely on isolated data, leading to overfitting and poor real-world performance. Data privacy laws like GDPR and CCPA further limit collaboration. With federated learning using Amazon SageMaker AI, organizations can jointly train models without sharing raw data, boosting accuracy while maintaining compliance.

In this post, we explore how SageMaker and federated learning help financial institutions build scalable, privacy-first fraud detection systems.

Federated learning with the Flower framework on SageMaker AI

With federated learning, multiple institutions can train a shared model while keeping their data decentralized, addressing privacy and security concerns in fraud detection. A key advantage of this approach is that it mitigates the risk of overfitting by learning from a wider distribution of fraud patterns across various datasets. It allows financial institutions to collaborate while maintaining strict data privacy, making sure that no single entity has access to another’s raw data. This not only improves fraud detection accuracy but also adheres to industry regulations and compliance requirements.

Popular frameworks for federated learning include Flower, PySyft, TensorFlow Federated TFF, and FedML. Among these, the Flower framework stands out for being framework-agnostic, a key advantage that allows it to seamlessly integrate with a wide range of tools such as PyTorch, TensorFlow, Hugging Face, scikit-learn, and more.

Although SageMaker is powerful for centralized ML workflows, Flower is purpose-built for decentralized model training, enabling secure collaboration across data silos without exposing raw data. When deployed on SageMaker, Flower takes advantage of the cloud system’s scalability and automation while enabling flexible, privacy-preserving federated learning workflows. This combination improves time to production, reduces engineering complexity, and supports strict data governance, making it highly suitable for cross-institutional or regulated environments.

Generating synthetic data with Synthetic Data Vault

To strengthen fraud detection while preserving data privacy, organizations can use the Synthetic Data Vault (SDV), a Python library that generates realistic synthetic datasets reflecting real-world patterns. Teams can use SDV to simulate diverse fraud scenarios without exposing sensitive information, helping federated learning models generalize better and detect subtle, evolving fraud tactics. It also helps address data imbalance by amplifying underrepresented fraud cases, improving model accuracy and robustness.

Beyond data generation, SDV captures complex statistical relationships and accelerates model development by reducing dependence on expensive, hard-to-obtain labeled data. In our approach, synthetic data is used primarily as a validation dataset, supporting privacy and consistency across environments, and training datasets can be real or synthetic depending on audit and compliance requirements. This flexibility supports privacy-by-design principles while maintaining adaptability in regulated environments.

A fair evaluation approach for federated learning models

A critical aspect of federated learning is facilitating a fair and unbiased evaluation of trained models. To achieve this, organizations must adopt a structured dataset strategy. As illustrated in the following figure, Dataset A and Dataset B are used as separate training datasets, with each participating institution contributing distinct datasets that capture different fraud patterns. Instead of evaluating the model using only one dataset, a combined dataset of A and B is used for evaluation. This makes sure that the model is tested on a more comprehensive distribution of real-world fraud cases, helping reduce bias and improve fairness in assessment.

By adopting this evaluation method, organizations can validate the model’s ability to generalize across different data distributions. This approach makes sure fraud detection models aren’t overly reliant on a single institution’s data, improving robustness against evolving fraud tactics. Standard evaluation metrics such as precision, recall, F1-score, and AUC-ROC are used to measure model performance. In the insurance sector, particular attention is given to false negatives—cases where fraudulent claims are missed—because these directly translate to financial losses. Minimizing false negatives is critical to protect against undetected fraud, while also making sure the model performs consistently and fairly across diverse datasets in a federated learning environment.

Solution overview

The following diagram illustrates how we implemented this approach across two AWS accounts using SageMaker AI and cross-account virtual private cloud (VPC) peering.

Flower supports a wide range of ML frameworks, including PyTorch, TensorFlow, Hugging Face, JAX, Pandas, fast.ai, PyTorch Lightning, MXNet, scikit-learn, and XGBoost. When deploying federated learning on SageMaker, Flower enables a distributed setup where multiple institutions can collaboratively train models while keeping data private. Each participant trains a local model on its own dataset and shares only model updates—not raw data—with a central server. SageMaker orchestrates the complete training, validation, and evaluation process securely and efficiently. The final model remains consistent with the original framework, making it deployable to a SageMaker endpoint using its supported framework container.

To facilitate a smooth and scalable implementation, SageMaker AI provides built-in features for model orchestration, hyperparameter tuning, and automated monitoring. Institutions can continuously improve their models based on the latest fraud patterns without requiring manual updates. Additionally, integrating SageMaker AI with AWS services such as AWS Identity and Access Management (IAM) enhances security and compliance.

For more information, refer to the Flower Federated Learning Workshop, which provides detailed guidance on setting up and running federated learning workloads effectively. By integrating federated learning, synthetic data generation, and structured evaluation strategies, you can develop robust fraud detection systems that are both scalable and privacy-preserving.

Results and key takeaways

The implementation of federated learning for fraud detection has demonstrated significant improvements in model performance and fraud detection accuracy. By training on diverse datasets, the model captures a broader range of fraud patterns, helping reduce bias and overfitting. The incorporation of SDV-generated datasets facilitates a well-rounded training process, improving generalization to real-world fraud scenarios. The federated learning framework on SageMaker enables organizations to scale their fraud detection models while maintaining compliance with data privacy regulations.

Through this approach, organizations have observed a reduction in false positives, helping fraud analysts focus on high-risk transactions more effectively. The ability to train models on a wider range of fraud patterns across multiple institutions has led to a more comprehensive and accurate fraud detection system. Future optimizations might include refining synthetic data techniques and expanding federated learning participation to further enhance fraud detection capabilities.

Conclusion

The Flower framework provides a scalable, privacy-preserving approach to fraud detection by using federated learning on SageMaker AI. By combining decentralized training, synthetic data generation, and fair evaluation strategies, financial institutions can enhance model accuracy while maintaining compliance with regulations. Shin Kong Financial Holding and Shin Kong Life successfully adopted this approach, as highlighted in their official blog post. This methodology sets a new standard for financial security applications, paving the way for broader adoption of federated learning.

Although using Flower on SageMaker for federated learning offers strong privacy and scalability benefits, there are some limitations to consider. Technically, managing heterogeneity across clients (such as different data schemas, compute capacities, or model architectures) can be complex. From a use case perspective, federated learning might not be ideal for scenarios requiring real-time inference or highly synchronous updates, and it depends on stable connectivity across participating nodes. To address these challenges, organizations are exploring the use of high-quality synthetic datasets that preserve data distributions while protecting privacy, improving model generalization and robustness. Next steps include experimenting with these datasets, using the Flower Federated Learning Workshop for hands-on guidance, reviewing the system architecture for deeper understanding, and engaging with the AWS account team to tailor and scale your federated learning solution.

About the Authors

Ray Wang is a Senior Solutions Architect at AWS. With 12 years of experience in the backend and consultant, Ray is dedicated to building modern solutions in the cloud, especially in especially in NoSQL, big data, machine learning, and Generative AI. As a hungry go-getter, he passed all 12 AWS certificates to increase the breadth and depth of his technical knowledge. He loves to read and watch sci-fi movies in his spare time.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

James Chan is a Solutions Architect at AWS specializing in the Financial Services Industry (FSI). With extensive experience in financial services, Fintech, and manufacturing sectors, James help FSI customers at AWS innovate and build scalable cloud solutions and financial system architectures. James specialize in AWS container, network architecture, and generative AI solutions that combine cloud-native technologies with strict financial compliance requirements.

Mike Xu is an Associate Solutions Architect specializing in AI/ML at Amazon Web Services. He works with customers to design machine learning solutions using services like Amazon SageMaker and Amazon Bedrock. With a background in computer engineering and a passion for generative AI, Mike focuses on helping organizations accelerate their AI/ML journey in the cloud. Outside of work, he enjoys producing electronic music and exploring emerging tech.

Building intelligent AI voice agents with Pipecat and Amazon Bedrock – Part 2

July 11, 2025

by Adithya Suresh Amazon AWS

Voice AI is changing the way we use technology, allowing for more natural and intuitive conversations. Meanwhile, advanced AI agents can now understand complex questions and act autonomously on our behalf.

In Part 1 of this series, you learned how you can use the combination of Amazon Bedrock and Pipecat, an open source framework for voice and multimodal conversational AI agents to build applications with human-like conversational AI. You learned about common use cases of voice agents and the cascaded models approach, where you orchestrate several components to build your voice AI agent.

In this post (Part 2), you explore how to use speech-to-speech foundation model, Amazon Nova Sonic, and the benefits of using a unified model.

Architecture: Using Amazon Nova Sonic speech-to-speech

Amazon Nova Sonic is a speech-to-speech foundation model that delivers real-time, human-like voice conversations with industry-leading price performance and low latency. While the cascaded models approach outlined in Part 1 is flexible and modular, it requires orchestration of automatic speech recognition (ASR), natural language processing (NLU), and text-to-speech (TTS) models. For conversational use cases, this might introduce latency and result in loss of tone and prosody. Nova Sonic combines these components into a unified model that processes audio in real time with a single forward pass, reducing latency while streamlining development.

By unifying these capabilities, the model can dynamically adjust voice responses based on the acoustic characteristics and conversational context of the input, creating more fluid and contextually appropriate dialogue. The system recognizes conversational subtleties such as natural pauses, hesitations, and turn-taking cues, allowing it to respond at appropriate moments and seamlessly manage interruptions during conversation. Amazon Nova Sonic also supports tool use and agentic RAG with Amazon Bedrock Knowledge Bases enabling your voice agents to retrieve information. Refer to the following figure to understand the end-to-end flow.

The choice between the two approaches depends on your use case. While the capabilities of Amazon Nova Sonic are state-of-the-art, the cascaded models approach outlined in Part 1 might be suitable if you require additional flexibility or modularity for advanced use cases.

AWS collaboration with Pipecat

To achieve a seamless integration, AWS collaborated with the Pipecat team to support Amazon Nova Sonic in version v0.0.67, making it straightforward to integrate state-of-the-art speech capabilities into your applications.

Kwindla Hultman Kramer, Chief Executive Officer at Daily.co and Creator of Pipecat, shares his perspective on this collaboration:

“Amazon’s new Nova Sonic speech-to-speech model is a leap forward for real-time voice AI. The bidirectional streaming API, natural-sounding voices, and robust tool-calling capabilities open up exciting new possibilities for developers. Integrating Nova Sonic with Pipecat means we can build conversational agents that not only understand and respond in real time, but can also take meaningful actions; like scheduling appointments or fetching information-directly through natural conversation. This is the kind of technology that truly transforms how people interact with software, making voice interfaces faster, more human, and genuinely useful in everyday workflows.”

“Looking forward, we’re thrilled to collaborate with AWS on a roadmap that helps customers reimagine their contact centers with integration to Amazon Connect and harness the power of multi-agent workflows through the Strands agentic framework. Together, we’re enabling organizations to deliver more intelligent, efficient, and personalized customer experiences—whether it’s through real-time contact center transformation or orchestrating sophisticated agentic workflows across industries.”

Getting started with Amazon Nova Sonic and Pipecat

To guide your implementation, we provide a comprehensive code example that demonstrates the basic functionality. This example shows how to build a complete voice AI agent with Amazon Nova Sonic and Pipecat.

Prerequisites

Before using the provided code examples with Amazon Nova Sonic, make sure that you have the following:

Python 3.12+
An AWS account with appropriate AWS Identity and Access Management (IAM) permissions for Amazon Bedrock, Amazon Transcribe, and Amazon Polly
Access to Amazon Nova Sonic on Amazon Bedrock
Access to an API key for Daily
A modern web browser (such as Google Chrome or Mozilla Firefox) with WebRTC support

Implementation steps

After you complete the prerequisites, you can start setting up your sample voice agent:

Clone the repository:

git clone https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock
cd build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock/part-2

Set up a virtual environment:

cd server
python3 -m venv
venv source venv/bin/activate # On Windows: venvScriptsactivate
pip install -r requirements.txt

Create a .env file with your credentials:

DAILY_API_KEY=your_daily_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region

Start the server:

python server.py

Connect using a browser at http://localhost:7860 and grant microphone access.
Start the conversation with your AI voice agent.

Customize your voice AI agent

To customize your voice AI agent, start by:

Modifying bot.py to change conversation logic.
Adjusting model selection in bot.py for your latency and quality needs.

To learn more, see the README of our code sample on Github.

Clean up

The preceding instructions are for setting up the application in your local environment. The local application will uses AWS services and Daily through IAM and API credentials. For security and to avoid unanticipated costs, when you’re finished, delete these credentials so that they can no longer be accessed.

Amazon Nova Sonic and Pipecat in action

The demo showcases a scenario for an intelligent healthcare assistant. The demo was presented at the keynote in AWS Summit Sydney 2025 by Rada Stanic, Chief Technologist and Melanie Li, Senior Specialist Solutions Architect – Generative AI.

The demo showcases a simple fun facts voice agent in a local environment using SmallWebRTCTransport. As the user speaks, the voice agent provides transcription in real-time as displayed in the terminal.

Enhancing agentic capabilities with Strands Agents

A practical way to boost agentic capability and understanding is to implement a general tool call that delegates tool selection to an external agent such as a Strands Agent. The delegated Strands Agent can then reason or think about your complex query, perform multi-step tasks with tool calls, and return a summarized response.

To illustrate, let’s review a simple example. If the user asks a question like: “What is the weather like near the Seattle Aquarium?”, the voice agent can delegate to a Strands agent through a general tool call such as handle_query.

The Strands agent will handle the query and think about the task, for example:

<thinking>I need to get the weather information for the Seattle Aquarium. To do this, I need the latitude and longitude of the Seattle Aquarium. I will first use the 'search_places' tool to find the coordinates of the Seattle Aquarium.</thinking>

The Strands Agent will then execute the search_places tool call, a subsequent get_weather tool call, and return a response back to the parent agent as part of the handle_query tool call. This is also known as the agent as tools pattern.

To learn more, see the example in our hands-on workshop.

Conclusion

Building intelligent AI voice agents is more accessible than ever through the combination of open source frameworks such as Pipecat, and powerful foundation models on Amazon Bedrock.

In this series, you learned about two common approaches for building AI voice agents. In Part 1, you learned about the cascaded models approach; diving into each component of a conversational AI system. In Part 2, you learned about how using Amazon Nova Sonic, a speech-to-speech foundation model, can simplify implementation and unify these components into a single model architecture. Looking ahead, stay tuned for exciting developments in multi-modal foundation models, including the upcoming Nova any-to-any models—these innovations will continually improve your voice AI applications.

Resources

To learn more about voice AI agents, see the following resources:

Explore our code sample on Github
Try our hands-on-workshop: Building intelligent voice AI agents with Amazon Nova Sonic, Amazon Bedrock and Pipecat

To get started with your own voice AI project, contact your AWS account team to explore an engagement with AWS Generative AI Innovation Center (GAIIC).

About the Authors

Adithya Suresh is a Deep Learning Architect at AWS Generative AI Innovation Center based in Sydney, where he collaborates directly with enterprise customers to design and scale transformational generative AI solutions for complex business challenges. He leverages AWS generative AI services to build bespoke AI systems that drive measurable business value across diverse industries.

Daniel Wirjo is a Solutions Architect at AWS, with focus across AI and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.

Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.

Melanie Li, PhD is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Osman Ipek is a seasoned Solutions Architect on Amazon’s Artificial General Intelligence team, specializing in Amazon Nova foundation models. With over 12 years of experience in software and machine learning, he has driven innovative Alexa product experiences reaching millions of users. His expertise spans voice AI, natural language processing, large language models and MLOps, with a passion for leveraging AI to create breakthrough products.

Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.

Uphold ethical standards in fashion using multimodal toxicity detection with Amazon Bedrock Guardrails

July 11, 2025

by Jean Jacques Mikem Amazon AWS

The global fashion industry is estimated to be valued at $1.84 trillion in 2025, accounting for approximately 1.63% of the world’s GDP (Statista, 2025). With such massive amounts of generated capital, so too comes the enormous potential for toxic content and misuse.

In the fashion industry, teams are frequently innovating quickly, often utilizing AI. Sharing content, whether it be through videos, designs, or otherwise, can lead to content moderation challenges. There remains a risk (through intentional or unintentional actions) of inappropriate, offensive, or toxic content being produced and shared. This can lead to violation of company policy and irreparable brand reputation damage. Implementing guardrails while utilizing AI to innovate faster within this industry can provide long lasting benefits.

In this post, we cover the use of the multimodal toxicity detection feature of Amazon Bedrock Guardrails to guard against toxic content. Whether you’re an enterprise giant in the fashion industry or an up-and-coming brand, you can use this solution to screen potentially harmful content before it impacts your brand’s reputation and ethical standards. For the purposes of this post, ethical standards refer to toxic, disrespectful, or harmful content and images that could be created by fashion designers.

Brand reputation represents a priceless currency that transcends trends, with companies competing not just for sales but for consumer trust and loyalty. As technology evolves, the need for effective reputation management strategies should include using AI in responsible ways. In this growing age of innovation, as the fashion industry evolves and creatives innovate faster, brands that strategically manage their reputation while adapting to changing consumer preferences and global trends will distinguish themselves from the rest in the industry (source). Take the first step toward responsible AI within your creative practices with Amazon Bedrock Guardrails.

Solution overview

To incorporate multimodal toxicity detection guardrails in an image generating workflow with Amazon Bedrock, you can use the following AWS services:

Amazon Simple Storage Service (Amazon S3) to store fashion images
Amazon S3 Event Notifications to trigger workflow processing when new images are uploaded
AWS Lambda to process images
Amazon Bedrock Guardrails to analyze content

The following diagram illustrates the solution architecture.

Prerequisites

For this solution, you must have the following:

An AWS account
AWS Identity and Access Management (IAM) Lambda execution role

The following IAM policy grants specific permissions for a Lambda function to interact with Amazon CloudWatch Logs, access objects in an S3 bucket, and apply Amazon Bedrock guardrails, enabling the function to log its activities, read from Amazon S3, and use Amazon Bedrock content filtering capabilities. Before using this policy, update the placeholders with your resource-specific values:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudWatchLogsAccess",
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT-ID>:*"
        },
        {
            "Sid": "CloudWatchLogsStreamAccess",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<REGION>:<ACCOUNT-ID>:log-group:/aws/lambda/<FUNCTION-NAME>:*"
            ]
        },
        {
            "Sid": "S3ReadAccess",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<BUCKET-NAME>/*"
        },
        {
            "Sid": "BedrockGuardrailsAccess",
            "Effect": "Allow",
            "Action": "bedrock:ApplyGuardrail",
            "Resource": "arn:aws:bedrock:<REGION>:<ACCOUNT-ID>:guardrail/<GUARDRAIL-ID>"
        }
    ]
}

The following steps walk you through how to incorporate multimodal toxicity detection guardrails in an image generation workflow with Amazon Bedrock.

Create a multimodal guardrail in Amazon Bedrock

The foundation of our moderation system is a guardrail in Amazon Bedrock configured specifically for image content. To create a multimodality toxicity detection guardrail, complete the following steps:

On the Amazon Bedrock console, choose Guardrails under Safeguards in the navigation pane.
Choose Create guardrail.
Enter a name and optional description, and create your guardrail.

Configure content filters for multiple modalities

Next, you configure the content filters. Complete the following steps:

On the Configure content filters page, choose Image under Filter for prompts. This allows the guardrail to process visual content alongside text.
Configure the categories for Hate, Insults, Sexual, and Violence to filter both text and image content. The Misconduct and Prompt threat categories are available for text content filtering only.
Create your filters.

By setting up these filters, you create a comprehensive safeguard that can detect potentially harmful content across multiple modalities, enhancing the safety and reliability of your AI applications.

Create an S3 bucket

You need a place for users (or other processes) to upload the images that require moderation. To create an S3 bucket, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Enter a unique name and choose the AWS Region where you want to host the bucket.
For this basic setup, standard settings are usually sufficient.
Create your bucket.

This bucket is where our workflow begins—new images landing here will trigger the next step.

Create a Lambda function

We use a Lambda function, a serverless compute service, written in Python. This function is invoked when a new image arrives in the S3 bucket. The function will send the image to our guardrail in Amazon Bedrock for analysis. Complete the following steps to create your function:

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Enter a name and choose a recent Python runtime.
Grant the correct permissions using the IAM execution role. The function needs permission to read the newly uploaded object from your S3 bucket (s3:GetObject) and permission to interact with Amazon Bedrock Guardrails using the bedrock:ApplyGuardrail action for your specific guardrail.
Create your guardrail.

Let’s explore the Python code that powers this function. We use the AWS SDK for Python (Boto3) to interact with Amazon S3 and Amazon Bedrock. The code first identifies the uploaded image from the S3 event trigger. It then checks if the image format is supported (JPEG or PNG) and verifies that the size doesn’t exceed the guardrail limit of 4 MB.

The key step involves preparing the image data for the ApplyGuardrail API call. We package the raw image bytes along with its format into a structure that Amazon Bedrock understands. We use the ApplyGuardrail API; this is efficient because we can check the image against our configured policies without needing to invoke a full foundation model.

Finally, the function calls ApplyGuardrail, passing the image content, the guardrail ID, and the version you noted earlier. It then interprets the response from Amazon Bedrock, logging whether the content was BLOCKED or NONE (meaning it passed the check), along with specific harmful categories detected if it was blocked.

The following is Python code you can use as a starting point (remember to replace the placeholders):

import boto3
import json
import os
import traceback

s3_client = boto3.client('s3')
# Use 'bedrock-runtime' for ApplyGuardrail and InvokeModel
bedrock_runtime_client = boto3.client('bedrock-runtime')

GUARDRAIL_ID = '<YOUR_GUARDRAIL_ID>' 
GUARDRAIL_VERSION = '<SPECIFIC_VERSION>' #e.g, '1'

# Supported image formats by the Guardrail feature
SUPPORTED_FORMATS = {'jpg': 'jpeg', 'jpeg': 'jpeg', 'png': 'png'}

def lambda_handler(event, context):
    # Get bucket name and object key
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    print(f"Processing s3://{bucket}/{key}")

    # Extract file extension and check if supported
    try:
        file_ext = os.path.splitext(key)[1].lower().lstrip('.')
        image_format = SUPPORTED_FORMATS.get(file_ext)
        if not image_format:
            print(f"Unsupported image format: {file_ext}. Skipping.")
            return {'statusCode': 400, 'body': 'Unsupported image format'}
    except Exception as e:
         print(f"Error determining file format for {key}: {e}")
         return {'statusCode': 500, 'body': 'Error determining file format'}


    try:
        # Get image bytes from S3
        response = s3_client.get_object(Bucket=bucket, Key=key)
        image_bytes = response['Body'].read()

        # Basic size check (Guardrail limit is 4MB)
        if len(image_bytes) > 4 * 1024 * 1024:
             print(f"Image size exceeds 4MB limit for {key}. Skipping.")
             return {'statusCode': 400, 'body': 'Image size exceeds 4MB limit'}

        # 3. Prepare content list for ApplyGuardrail API 
        content_to_assess = [
            {
                "image": {
                    "format": image_format, # 'jpeg' or 'png' 
                    "source": {
                        "bytes": image_bytes # Pass raw bytes 
                    }
                }
            }
        ]

        # Call ApplyGuardrail API 
        print(f"Calling ApplyGuardrail for {key} (Format: {image_format})")
        guardrail_response = bedrock_runtime_client.apply_guardrail(
            guardrailIdentifier=GUARDRAIL_ID,
            guardrailVersion=GUARDRAIL_VERSION,
            source='INPUT', # Assess as user input
            content=content_to_assess
        )

        # Process response
        print("Guardrail Assessment Response:", json.dumps(guardrail_response))

        action = guardrail_response.get('action')
        assessments = guardrail_response.get('assessments', [])
        outputs = guardrail_response.get('outputs', []) # Relevant if masking occurs

        print(f"Guardrail Action for {key}: {action}")

        if action == 'BLOCKED':
            print(f"Content BLOCKED. Assessments: {json.dumps(assessments)}")
            # Add specific handling for blocked content
        elif action == 'NONE':
             print("Content PASSED.")
             # Add handling for passed content
        else:
             # Handle other potential actions (e.g., content masked)
             print(f"Guardrail took action: {action}. Outputs: {json.dumps(outputs)}")


        return {
            'statusCode': 200,
            'body': json.dumps(f'Successfully processed {key}. Guardrail action: {action}')
        }

    except bedrock_runtime_client.exceptions.ValidationException as ve:
        print(f"Validation Error calling ApplyGuardrail for {key}: {ve}")
        # You might get this for exceeding size/dimension limits or other issues
        return {'statusCode': 400, 'body': f'Validation Error: {ve}'}
    except Exception as e:
        print(f"Error processing image {key}: {e}")
        # Log the full error for debugging
        traceback.print_exc()
        return {'statusCode': 500, 'body': f'Internal server error processing {key}'}

Check the function’s default execution timeout (found under Configuration, General configuration) to verify it has enough time to download the image and wait for the Amazon Bedrock API response, perhaps setting it to 30 seconds.

Create an Amazon S3 trigger for the Lambda function

With the S3 bucket ready and the function coded, you must now connect them. This is done by setting up an Amazon S3 trigger on the Lambda function:

On the function’s configuration page, choose Add trigger.

Choose S3 as the source.

Point it to the S3 bucket you created earlier.
Configure the trigger to activate on All object create events. This makes sure that whenever a new file is successfully uploaded to the S3 bucket, your Lambda function is automatically invoked.

Test your moderation pipeline

It’s time to see your automated workflow in action! Upload a few test images (JPEG or PNG, under 4 MB) to your designated S3 bucket. Include images that are clearly safe and others that might trigger the harmful content filters you configured in your guardrail. On the CloudWatch console, find the log group associated with your Lambda function. Examining the latest log streams will show you the function’s execution details. You should see messages confirming which file was processed, the call to ApplyGuardrail, and the final guardrail action (NONE or BLOCKED). If an image was blocked, the logs should also show the specific assessment details, indicating which harmful category was detected.

By following these steps, you have established a robust, serverless pipeline for automatically moderating image content using the power of Amazon Bedrock Guardrails. This proactive approach helps maintain safer online environments and aligns with responsible AI practices.

{
    "ResponseMetadata": {
        "RequestId": "fa025ab0-905f-457d-ae19-416537e2c69f",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "content-type": "application/json",
            "content-length": "1008",
            "connection": "keep-alive",
        },
        "RetryAttempts": 0
    },
    "usage": {
        "topicPolicyUnits": 0,
        "contentPolicyUnits": 0,
        "wordPolicyUnits": 0,
        "sensitiveInformationPolicyUnits": 0,
        "sensitiveInformationPolicyFreeUnits": 0,
        "contextualGroundingPolicyUnits": 0
    },
    "action": "GUARDRAIL_INTERVENED",
    "outputs": [
        {
            "text": "Sorry, the model cannot answer this question."
        }
    ],
    "assessments": [
        {
            "contentPolicy": {
                "filters": [
                    {
                        "type": "HATE",
                        "confidence": "MEDIUM",
                        "filterStrength": "HIGH",
                        "action": "BLOCKED"
                    }
                ]
            },
            "invocationMetrics": {
                "guardrailProcessingLatency": 918,
                "usage": {
                    "topicPolicyUnits": 0,
                    "contentPolicyUnits": 0,
                    "wordPolicyUnits": 0,
                    "sensitiveInformationPolicyUnits": 0,
                    "sensitiveInformationPolicyFreeUnits": 0,
                    "contextualGroundingPolicyUnits": 0
                },
                "guardrailCoverage": {
                    "images": {
                        "guarded": 1,
                        "total": 1
                    }
                }
            }
        }
    ],
    "guardrailCoverage": {
        "images": {
            "guarded": 1,
            "total": 1
        }
    }
}

Clean up

When you’re ready to remove the moderation pipeline you built, you must clean up the resources you created to avoid unnecessary charges. Complete the following steps:

On the Amazon S3 console, remove the event notification configuration in the bucket that triggers the Lambda function.
Delete the bucket.
On the Lambda console, delete the moderation function you created.
On the IAM console, remove the execution role you created for the Lambda function.
If you created a guardrail specifically for this project and don’t need it for other purposes, remove it using the Amazon Bedrock console.

With these cleanup steps complete, you have successfully removed the components of your image moderation pipeline. You can recreate this solution in the future by following the steps outlined in this post—this highlights the ease of cloud-based, serverless architectures.

Conclusion

In the fashion industry, protecting your brand’s reputation while maintaining creative innovation is paramount. By implementing Amazon Bedrock Guardrails multimodal toxicity detection, fashion brands can automatically screen content for potentially harmful material before it impacts their reputation or violates their ethical standards. As the fashion industry continues to evolve digitally, implementing robust content moderation systems isn’t just about risk management—it’s about building trust with your customers and maintaining brand integrity. Whether you’re an established fashion house or an emerging brand, this solution offers an efficient way to uphold your content standards. The solution we outlined in this post provides a scalable, serverless architecture that accomplishes the following:

Automatically processes new image uploads
Uses advanced AI capabilities through Amazon Bedrock
Provides immediate feedback on content acceptability
Requires minimal maintenance after it’s deployed

If you’re interested in further insights on Amazon Bedrock Guardrails and its practical use, refer to the video Amazon Bedrock Guardrails: Make Your AI Safe and Ethical, and the post Amazon Bedrock Guardrails image content filters provide industry-leading safeguards, helping customer block up to 88% of harmful multimodal content: Generally available today.

About the Authors

Jordan Jones is a Solutions Architect at AWS within the Cloud Sales Center organization. He uses cloud technologies to solve complex problems, bringing defense industry experience and expertise in various operating systems, cybersecurity, and cloud architecture. He enjoys mentoring aspiring professionals and speaking on various career panels. Outside of work, he volunteers within the community and can be found watching Golden State Warriors games, solving Sudoku puzzles, or exploring new cultures through world travel.

Jean Jacques Mikem is a Solutions Architect at AWS with a passion for designing secure and scalable technology solutions. He uses his expertise in cybersecurity and technological hardware to architect robust systems that meet complex business needs. With a strong foundation in security principles and computing infrastructure, he excels at creating solutions that bridge business requirements with technical implementation.

New capabilities in Amazon SageMaker AI continue to transform how organizations develop AI models

July 10, 2025

by Ankur Mehrotra Amazon AWS

As AI models become increasingly sophisticated and specialized, the ability to quickly train and customize models can mean the difference between industry leadership and falling behind. That is why hundreds of thousands of customers use the fully managed infrastructure, tools, and workflows of Amazon SageMaker AI to scale and advance AI model development. Since launching in 2017, SageMaker AI has transformed how organizations approach AI model development by reducing complexity while maximizing performance. Since then, we’ve continued to relentlessly innovate, adding more than 420 new capabilities since launch to give customers the best tools to build, train, and deploy AI models quickly and efficiently. Today, we’re pleased to announce new innovations that build on the rich features of SageMaker AI to accelerate how customers build and train AI models.

Amazon SageMaker HyperPod: The infrastructure of choice for developing AI models

AWS launched Amazon SageMaker HyperPod in 2023 to reduce complexity and maximize performance and efficiency when building AI models. With SageMaker HyperPod, you can quickly scale generative AI model development across thousands of AI accelerators and reduce foundation model (FM) training and fine-tuning development costs by up to 40%. Many of today’s top models are trained on SageMaker HyperPod, including models from Hugging Face, Luma AI, Perplexity AI, Salesforce, Thomson Reuters, Writer, and Amazon. By training Amazon Nova FMs on SageMaker HyperPod, Amazon saved months of work and increased utilization of compute resources to more than 90%.

To further streamline workflows and make it faster to develop and deploy models, a new command line interface (CLI) and software development kit (SDK) provides a single, consistent interface that simplifies infrastructure management, unifies job submission across training and inference, and supports both recipe-based and custom workflows with integrated monitoring and control. Today, we are also adding two capabilities to SageMaker HyperPod that can help you reduce training costs and accelerate AI model development.

Reduce the time to troubleshoot performance issues from days to minutes with SageMaker HyperPod observability

To bring new AI innovations to market as quickly as possible, organizations need visibility across AI model development tasks and compute resources to optimize training efficiency and detect and resolve interruptions or performance bottlenecks as soon as possible. For example, to investigate if a training or fine-tuning job failure was the result of a hardware issue, data scientists and machine learning (ML) engineers want to quickly filter to review the monitoring data of the specific GPUs that performed the job rather than manually browsing through the hardware resources of an entire cluster to establish the correlation between the job failure and a hardware issue.

The new observability capability in SageMaker HyperPod transforms how you can monitor and optimize your model development workloads. Through a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Service for Prometheus workspace, you can now see generative AI task performance metrics, resource utilization, and cluster health in a single view. Teams can now quickly spot bottlenecks, prevent costly delays, and optimize compute resources. You can define automated alerts, specify use case-specific task metrics and events, and publish them to the unified dashboard with just a few clicks.

By reducing troubleshooting time from days to minutes, this capability can help you accelerate your path to production and maximize the return on your AI investments.

DatologyAI builds tools to automatically select the best data on which to train deep learning models.

“We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics—from task-specific GPU utilization to file system (FSx for Lustre) performance—without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems.”
–Josh Wills, Member of Technical Staff at DatologyAI

–

Articul8 helps companies build sophisticated enterprise generative AI applications.

“With SageMaker HyperPod observability, we can now deploy our metric collection and visualization systems in a single click, saving our teams days of otherwise manual setup and enhancing our cluster observability workflows and insights. Our data scientists can quickly monitor task performance metrics, such as latency, and identify hardware issues without manual configuration. SageMaker HyperPod observability will help streamline our foundation model development processes, allowing us to focus on advancing our mission of delivering accessible and reliable AI-powered innovation to our customers.”
–Renato Nascimento, head of technology at Articul8

–

Deploy Amazon SageMaker JumpStart models on SageMaker HyperPod for fast, scalable inference

After developing generative AI models on SageMaker HyperPod, many customers import these models to Amazon Bedrock, a fully managed service for building and scaling generative AI applications. However, some customers want to use their SageMaker HyperPod compute resources to speed up their evaluation and move models into production faster.

Now, you can deploy open-weights models from Amazon SageMaker JumpStart, as well as fine-tuned custom models, on SageMaker HyperPod within minutes with no manual infrastructure setup. Data scientists can run inference on SageMaker JumpStart models with a single click, simplifying and accelerating model evaluation. This straightforward, one-time provisioning reduces manual infrastructure setup, providing a reliable and scalable inference environment with minimal effort. Large model downloads are reduced from hours to minutes, accelerating model deployments and shortening the time to market.

–

H.AI exists to push the boundaries of superintelligence with agentic AI.

“With Amazon SageMaker HyperPod, we used the same high-performance compute to build and deploy the foundation models behind our agentic AI platform. This seamless transition from training to inference streamlined our workflow, reduced time to production, and delivered consistent performance in live environments. SageMaker HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.”
–Laurent Sifre, Co-founder & CTO at H.AI

–

Seamlessly access the powerful compute resources of SageMaker AI from local development environments

Today, many customers choose from the broad set of fully managed integrated development environments (IDEs) available in SageMaker AI for model development, including JupyterLab, Code Editor based on Code-OSS, and RStudio. Although these IDEs enable secure and efficient setups, some developers prefer to use local IDEs on their personal computers for their debugging capabilities and extensive customization options. However, customers using a local IDE, such as Visual Studio Code, couldn’t easily run their model development tasks on SageMaker AI until now.

With new remote connections to SageMaker AI, developers and data scientists can quickly and seamlessly connect to SageMaker AI from their local VS Code, maintaining access to the custom tools and familiar workflows that help them work most efficiently. Developers can build and train AI models using their local IDE while SageMaker AI manages remote execution, so you can work in your preferred environment while still benefiting from the performance, scalability, and security of SageMaker AI. You can now choose your preferred IDE—whether that is a fully managed cloud IDE or VS Code—to accelerate AI model development using the powerful infrastructure and seamless scalability of SageMaker AI.

–

CyberArk is a leader in Identity Security, which provides a comprehensive approach centered on privileged controls to protect against advanced cyber threats.

“With remote connections to SageMaker AI, our data scientists have the flexibility to choose the IDE that makes them most productive. Our teams can leverage their customized local setup while accessing the infrastructure and security controls of SageMaker AI. As a security first company, this is extremely important to us as it ensures sensitive data stays protected, while allowing our teams to securely collaborate and boost productivity.”
–Nir Feldman, Senior Vice President of Engineering at CyberArk

–

Build generative AI models and applications faster with fully managed MLflow 3.0

As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Customers such as Cisco, SonRai, and Xometry are already using managed MLflow on SageMaker AI to efficiently manage ML model experiments at scale. The introduction of fully managed MLflow 3.0 on SageMaker AI makes it straightforward to track experiments, monitor training progress, and gain deeper insights into the behavior of models and AI applications using a single tool, helping you accelerate generative AI development.

Conclusion

In this post, we shared some of the new innovations in SageMaker AI to accelerate how you can build and train AI models.

To learn more about these new features, SageMaker AI, and how companies are using this service, refer to the following resources:

About the author

Ankur Mehrotra joined Amazon back in 2008 and is currently the General Manager of Amazon SageMaker AI. Before Amazon SageMaker AI, he worked on building Amazon.com’s advertising systems and automated pricing technology.

Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod

July 10, 2025

by Tomonori Shimomura Amazon AWS

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box dashboard that delivers insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards, optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance.

With a one-click installation of the Amazon Elastic Kubernetes Service (Amazon EKS) add-on for SageMaker HyperPod observability, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter (EFA), integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators. With this unified view, you can trace model development task performance to cluster resources with aggregation of resource metrics at the task level. The solution also abstracts management of collector agents and scrapers across clusters, offering automatic scalability of collectors across nodes as the cluster grows. The dashboards feature intuitive navigation across metrics and visualizations to help users diagnose problems and take action faster. They are also fully customizable, supporting additional PromQL metric imports and custom Grafana layouts.

These capabilities save teams valuable time and resources during FM development, helping accelerate time-to-market and reduce the cost of generative AI innovations. Instead of spending hours or days configuring, collecting, and analyzing cluster telemetry systems, data scientists and machine learning (ML) engineers can now quickly identify training, tuning, and inference disruptions, underutilization of valuable GPU resources, and hardware performance issues. The pre-built, actionable insights of SageMaker HyperPod observability can be used in several common scenarios when operating FM workloads, such as:

Data scientists can monitor resource utilization of submitted training and inference tasks at the per-GPU level, with insights into GPU memory and FLOPs
AI researchers can troubleshoot sub-optimal time-to-first-token (TTFT) for their inferencing workloads by correlating the deployment metrics with the corresponding resource bottlenecks
Cluster administrators can configure customizable alerts to send notifications to multiple destinations such as Amazon Simple Notification Service (Amazon SNS), PagerDuty, and Slack when hardware falls outside of recommended health thresholds
Cluster administrators can quickly identify inefficient resource queuing patterns across teams or namespaces to reconfigure allocation and prioritization policies

In this post, we walk you through installing and using the unified dashboards of the out-of-the-box observability feature in SageMaker HyperPod. We cover the one-click installation from the Amazon SageMaker AI console, navigating the dashboard and metrics it consolidates, and advanced topics such as setting up custom alerts. If you have a running SageMaker HyperPod EKS cluster, then this post will help you understand how to quickly visualize key health and performance telemetry data to derive actionable insights.

Prerequisites

To get started with SageMaker HyperPod observability, you first need to enable AWS IAM Identity Center to use Amazon Managed Grafana. If IAM Identity Center isn’t already enabled in your account, refer to Getting started with IAM Identity Center. Additionally, create at least one user in the IAM Identity Center.

SageMaker HyperPod observability is available for SageMaker HyperPod clusters with an Amazon EKS orchestrator. If you don’t already have a SageMaker HyperPod cluster with an Amazon EKS orchestrator, refer to Amazon SageMaker HyperPod quickstart workshops for instructions to create one.

Enable SageMaker HyperPod observability

To enable SageMaker HyperPod observability, follow these steps:

On the SageMaker AI console, choose Cluster management in the navigation pane.
Open the cluster detail page from the SageMaker HyperPod clusters list.
On the Dashboard tab, in the HyperPod Observability section, choose Quick installation.

SageMaker AI will create a new Prometheus workspace, a new Grafana workspace, and install the SageMaker HyperPod observability add-on to the EKS cluster. The installation typically completes within a few minutes.

When the installation process is complete, you can view the add-on details and metrics available.

Choose Manage users to assign a user to a Grafana workspace.
Choose Open dashboard in Grafana to open the Grafana dashboard.

When prompted, sign in with IAM Identity Center with the user you configured as a prerequisite.

After signing in successfully, you will see the SageMaker HyperPod observability dashboard on Grafana.

SageMaker HyperPod observability dashboards

You can choose from multiple dashboards, including Cluster, Tasks, Inference, Training, and File system.

The Cluster dashboard shows cluster-level metrics such as Total Nodes and Total GPUs, and cluster node-level metrics such as GPU Utilization and Filesystem space available. By default, the dashboard shows metrics about entire cluster, but you can apply filters to show metrics only about a specific hostname or specific GPU ID.

The Tasks dashboard is helpful if you want to see resource allocation and utilization metrics at the task level (PyTorchJob, ReplicaSet, and so on). For example, you can compare GPU utilization by multiple tasks running on your cluster and identify which task should be improved.

You can also choose an aggregation level from multiple options (Namespace, Task Name, Task Pod), and apply filters (Namespace, Task Type, Task Name, Pod, GPU ID). You can use these aggregation and filtering capabilities to view metrics at the appropriate granularity and drill down into the specific issue you are investigating.

The Inference dashboard shows inference application specific metrics such as Incoming Requests, Latency, and Time to First Byte (TTFB). The Inference dashboard is particularly useful when you use SageMaker HyperPod clusters for inference and need to monitor the traffic of the requests and performance of models.

Advanced installation

The Quick installation option will create a new workspace for Prometheus and Grafana and select default metrics. If you want to reuse an existing workspace, select additional metrics, or enable Pod logging to Amazon CloudWatch Logs, use the Custom installation option. For more information, see Amazon SageMaker HyperPod.

Set up alerts

Amazon Managed Grafana includes access to an updated alerting system that centralizes alerting information in a single, searchable view (in the navigation pane, choose Alerts to create an alert). Alerting is useful when you want to receive timely notifications, such as when GPU utilization drops unexpectedly, when a disk usage of your shared file system exceeds 90%, when multiple instances become unavailable at the same time, and so on. The HyperPod observability dashboard in Amazon Managed Grafana has pre-configured alerts for few of these key metrics. You can create additional alert rules based on metrics or queries and set up multiple notification channels, such as emails and Slack messages. For instructions on setting up alerts with Slack messages, see the Setting Up Slack Alerts for Amazon Managed Grafana GitHub page.

The number of alerts is limited to 100 per Grafana workspace. If you need a more scalable solution, check out the alerting options in Amazon Managed Service for Prometheus.

High-level overview

The following diagram illustrates the architecture of the new HyperPod observability capability.

Clean up

If you want to uninstall the SageMaker HyperPod observability feature (for example, to reconfigure it), clean up the resources in the following order:

Remove the SageMaker HyperPod observability add-on, either using the SageMaker AI console or Amazon EKS console.
Delete the Grafana workspace on the Amazon Managed Grafana console.
Delete the Prometheus workspace on the Amazon Managed Service for Prometheus console.

Conclusion

This post provided an overview and usage instructions for SageMaker HyperPod observability, a newly released observability feature for SageMaker HyperPod. This feature reduces the heavy lifting involved in setting up cluster observability and provides centralized visibility into cluster health status and performance metrics.

For more information about SageMaker HyperPod observability, see Amazon SageMaker HyperPod. Please leave your feedback on this post in the comments section.

About the authors

Tomonori Shimomura is a Principal Solutions Architect on the Amazon SageMaker AI team, where he provides in-depth technical consultation to SageMaker AI customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.

Eric Saleh is a Senior GenAI Specialist at AWS, focusing on foundation model training and inference. He is partnering with top foundation model builders and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions with strategic customers. Before joining AWS, Eric led product teams building enterprise AI/ML solutions, which included frontier GenAI services for fine-tuning, RAG, and managed inference. He holds a master’s degree in Business Analytics from UCLA Anderson.

Piyush Kadam is a Senior Product Manager on the Amazon SageMaker AI team, where he specializes in LLMOps products that empower both startups and enterprise customers to rapidly experiment with and efficiently govern foundation models. With a Master’s degree in Computer Science from the University of California, Irvine, specializing in distributed systems and artificial intelligence, Piyush brings deep technical expertise to his role in shaping the future of cloud AI products.

Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker AI team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.

Gopi Sekar is an Engineering Leader for the Amazon SageMaker AI team. He is dedicated to assisting customers and developing products that simplify the adaptation of machine learning to address real-world customer challenges.

Accelerating generative AI development with fully managed MLflow 3.0 on Amazon SageMaker AI

July 10, 2025

by Ram Vittal Amazon AWS

Amazon SageMaker now offers fully managed support for MLflow 3.0 that streamlines AI experimentation and accelerates your generative AI journey from idea to production. This release transforms managed MLflow from experiment tracking to providing end-to-end observability, reducing time-to-market for generative AI development.

As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Data scientists and developers struggle to effectively analyze the performance of their models and AI applications from experimentation to production, making it hard to find root causes and resolve issues. Teams spend more time integrating tools than improving the quality of their models or generative AI applications.

With the launch of fully managed MLflow 3.0 on Amazon SageMaker AI, you can accelerate generative AI development by making it easier to track experiments and observe behavior of models and AI applications using a single tool. Tracing capabilities in fully managed MLflow 3.0 provide customers the ability to record the inputs, outputs, and metadata at every step of a generative AI application, so developers can quickly identify the source of bugs or unexpected behaviors. By maintaining records of each model and application version, fully managed MLflow 3.0 offers traceability to connect AI responses to their source components, which means developers can quickly trace an issue directly to the specific code, data, or parameters that generated it. With these capabilities, customers using Amazon SageMaker HyperPod to train and deploy foundation models (FMs) can now use managed MLflow to track experiments, monitor training progress, gain deeper insights into the behavior of models and AI applications, and manage their machine learning (ML) lifecycle at scale. This reduces troubleshooting time and enables teams to focus more on innovation.

This post walks you through the core concepts of fully managed MLflow 3.0 on SageMaker and provides technical guidance on how to use the new features to help accelerate your next generative AI application development.

Getting started

You can get started with fully managed MLflow 3.0 on Amazon SageMaker to track experiments, manage models, and streamline your generative AI/ML lifecycle through the AWS Management Console, AWS Command Line Interface (AWS CLI), or API.

Prerequisites

To get started, you need:

An AWS account with billing enabled
An Amazon SageMaker Studio AI domain. To create a domain, refer to Guide to getting set up with Amazon SageMaker AI.

Configure your environment to use SageMaker managed MLflow Tracking Server

To perform the configuration, follow these steps:

In the SageMaker Studio UI, in the Applications pane, choose MLflow and choose Create.

Enter a unique name for your tracking server and specify the Amazon Simple Storage Service (Amazon S3) URI where your experiment artifacts will be stored. When you’re ready, choose Create. By default, SageMaker will select version 3.0 to create the MLflow tracking server.
Optionally, you can choose Update to adjust settings such as server size, tags, or AWS Identity and Access Management (IAM) role.

The server will now be provisioned and started automatically, typically within 25 minutes. After setup, you can launch the MLflow UI from SageMaker Studio to start tracking your ML and generative AI experiments. For more details on tracking server configurations, refer to Machine learning experiments using Amazon SageMaker AI with MLflow in the SageMaker Developer Guide.

To begin tracking your experiments with your newly created SageMaker managed MLflow tracking server, you need to install both MLflow and the AWS SageMaker MLflow Python packages in your environment. You can use SageMaker Studio managed Jupyter Lab, SageMaker Studio Code Editor, a local integrated development environment (IDE), or other supported environment where your AI workloads operate to track with SageMaker managed MLFlow tracking server.

To install both Python packages using pip:pip install mlflow==3.0 sagemaker-mlflow==0.1.0

To connect and start logging your AI experiments, parameters, and models directly to the managed MLflow on SageMaker, replace the Amazon Resource Name (ARN) of your SageMaker MLflow tracking server:

import mlflow

# SageMaker MLflow ARN
tracking_server_arn = "arn:aws:sagemaker:<Region>:<Account_id>:mlflow-tracking-server/<Name>" # Enter ARN
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment("customer_support_genai_app")

Now your environment is configured and ready to track your experiments with your SageMaker Managed MLflow tracking server.

Implement generative AI application tracing and version tracking

Generative AI applications have multiple components, including code, configurations, and data, which can be challenging to manage without systematic versioning. A LoggedModel entity in managed MLflow 3.0 represents your AI model, agent, or generative AI application within an experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle. A trace is a log of inputs, outputs, and intermediate steps from a single application execution. Traces provide insights into application performance, execution flow, and response quality, enabling debugging and evaluation. With LoggedModel, you can track and compare different versions of your application, making it easier to identify issues, deploy the best version, and maintain a clear record of what was deployed and when.

To implement version tracking and tracing with managed MLflow 3.0 on SageMaker, you can establish a versioned model identity using a Git commit hash, set this as the active model context so all subsequent traces will be automatically linked to this specific version, enable automatic logging for Amazon Bedrock interactions, and then make an API call to Anthropic’s Claude 3.5 Sonnet that will be fully traced with inputs, outputs, and metadata automatically captured within the established model context. Managed MLflow 3.0 tracing is already integrated with various generative AI libraries and provides one-line automatic tracing experience for all the support libraries. For information about supported libraries, refer to Supported Integrations in the MLflow documentation.

# 1. Define your application version using the git commit
logged_model= "customer_support_agent"
logged_model_name = f"{logged_model}-{git_commit}"

# 2.Set the active model context - traces will be linked to this
mlflow.set_active_model(name=logged_model_name)


# 3.Set auto logging for your model provider
mlflow.bedrock.autolog()

# 4. Chat with your LLM provider
# Ensure that your boto3 client has the necessary auth information
bedrock = boto3.client(
 service_name="bedrock-runtime",
 region_name="<REPLACE_WITH_YOUR_AWS_REGION>",
)

model = "anthropic.claude-3-5-sonnet-20241022-v2:0"
messages = [{ "role": "user", "content": [{"text": "Hello!"}]}]
# All intermediate executions within the chat session will be logged
bedrock.converse(modelId=model, messages=messages)

After logging this information, you can track these generative AI experiments and the logged model for the agent in the managed MLflow 3.0 tracking server UI, as shown in the following screenshot.

In addition to the one-line auto tracing functionality, MLflow offers Python SDK for manually instrumenting your code and manipulating traces. Refer to the code sample notebook sagemaker_mlflow_strands.ipynb in the aws-samples GitHub repository, where we use MLflow manual instrumentation to trace Strands Agents. With tracing capabilities in fully managed MLflow 3.0, you can record the inputs, outputs, and metadata associated with each intermediate step of a request, so you can pinpoint the source of bugs and unexpected behaviors.

These capabilities provide observability in your AI workload by capturing detailed information about the execution of the workload services, nodes, and tools that you can see under the Traces tab.

You can inspect each trace, as shown in the following image, by choosing the request ID in the traces tab for the desired trace.

Fully managed MLflow 3.0 on Amazon SageMaker also introduces the capability to tag traces. Tags are mutable key-value pairs you can attach to traces to add valuable metadata and context. Trace tags make it straightforward to organize, search, and filter traces based on criteria such as user session, environment, model version, or performance characteristics. You can add, update, or remove tags at any stage—during trace execution using mlflow.update_current_trace() or after a trace is logged using the MLflow APIs or UI. Managed MLflow 3.0 makes it seamless to search and analyze traces, helping teams quickly pinpoint issues, compare agent behaviors, and optimize performance. The tracing UI and Python API both support powerful filtering, so you can drill down into traces based on attributes such as status, tags, user, environment, or execution time as shown in the screenshot below. For example, you can instantly find all traces with errors, filter by production environment, or search for traces from a specific request. This capability is essential for debugging, cost analysis, and continuous improvement of generative AI applications.

The following screenshot displays the traces returned when searching for the tag ‘Production’.

The following code snippet shows how you can use search for all traces in production with a successful status:

# Search for traces in production environment with successful status 
traces = mlflow.search_traces( filter_string="attributes.status = 'OK' AND tags.environment = 'production'")

Generative AI use case walkthrough with MLflow tracing

Building and deploying generative AI agents such as chat-based assistants, code generators, or customer support assistants requires deep visibility into how these agents interact with large language models (LLMs) and external tools. In a typical agentic workflow, the agent loops through reasoning steps, calling LLMs and using tools or subsystems such as search APIs or Model Context Protocol (MCP) servers until it completes the user’s task. These complex, multistep interactions make debugging, optimization, and cost tracking especially challenging.

Traditional observability tools fall short in generative AI because agent decisions, tool calls, and LLM responses are dynamic and context-dependent. Managed MLflow 3.0 tracing provides comprehensive observability by capturing every LLM call, tool invocation, and decision point in your agent’s workflow. You can use this end-to-end trace data to:

Debug agent behavior – Pinpoint where an agent’s reasoning deviates or why it produces unexpected outputs.
Monitor tool usage – Discover how and when external tools are called and analyze their impact on quality and cost.
Track performance and cost – Measure latency, token usage, and API costs at each step of the agentic loop.
Audit and govern – Maintain detailed logs for compliance and analysis.

Imagine a real-world scenario using the managed MLflow 3.0 tracing UI for a sample finance customer support agent equipped with a tool to retrieve financial data from a datastore. While you’re developing a generative AI customer support agent or analyzing the agent behavior in production, you can observe how agent responses and the execution optionally call a product database tool for more accurate recommendations. For illustration, the first trace, shown in the following screenshot, shows the agent handling a user query without invoking any tools. The trace captures the prompt, agent response, and agent decision points. The agent’s response lacks product-specific details. The trace makes it clear that no external tool was called, and you quickly identify the behavior in the agent’s reasoning chain.

The second trace, shown in the following screenshot, captures the same agent, but this time it decides to call the product database tool. The trace logs the tool invocation, the returned product data, and how the agent incorporates this information into its final response. Here, you can observe improved answer quality, a slight increase in latency, and additional API cost with higher token usage.

By comparing these traces side by side, you can debug why the agent sometimes skips using the tool, optimize when and how tools are called, and balance quality against latency and cost. MLflow’s tracing UI makes these agentic loops transparent, actionable, and seamless to analyze at scale. This post’s sample agent and all necessary code is available on the aws-samples GitHub repository, where you can replicate and adapt it for your own applications.

Cleanup

After it’s created, a SageMaker managed MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop tracking servers when they’re not in use to save costs, or you can delete them using API or the SageMaker Studio UI. For more details on pricing, refer to Amazon SageMaker pricing.

Conclusion

Fully managed MLflow 3.0 on Amazon SageMaker AI is now available. Get started with sample code in the aws-samples GitHub repository. We invite you to explore this new capability and experience the enhanced efficiency and control it brings to your ML projects. To learn more, visit Machine Learning Experiments using Amazon SageMaker with MLflow.

For more information, visit the SageMaker Developer Guide and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the GenerativeAI space. You can find Sandeep on LinkedIn.

Amit Modi is the product leader for SageMaker AIOps and Governance, and Responsible AI at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.

Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle

July 10, 2025

by Vivek Gangasani Amazon AWS

Today, we’re excited to announce that Amazon SageMaker HyperPod now supports deploying foundation models (FMs) from Amazon SageMaker JumpStart, as well as custom or fine-tuned models from Amazon S3 or Amazon FSx. With this launch, you can train, fine-tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

SageMaker HyperPod offers resilient, high-performance infrastructure optimized for large-scale model training and tuning. Since its launch in 2023, SageMaker HyperPod has been adopted by foundation model builders who are looking to lower costs, minimize downtime, and accelerate time to market. With Amazon EKS support in SageMaker HyperPod you can orchestrate your HyperPod Clusters with EKS. Customers like Perplexity, Hippocratic, Salesforce, and Articul8 use HyperPod to train their foundation models at scale. With the new deployment capabilities, customers can now leverage HyperPod clusters across the full generative AI development lifecycle from model training and tuning to deployment and scaling.

Many customers use Kubernetes as part of their generative AI strategy, to take advantage of its flexibility, portability, and open source frameworks. You can orchestrate your HyperPod clusters with Amazon EKS support in SageMaker HyperPod so you can continue working with familiar Kubernetes workflows while gaining access to high-performance infrastructure purpose-built for foundation models. Customers benefit from support for custom containers, compute resource sharing across teams, observability integrations, and fine-grained scaling controls. HyperPod extends the power of Kubernetes by streamlining infrastructure setup and allowing customers to focus more on delivering models not managing backend complexity.

New Features: Accelerating Foundation Model Deployment with SageMaker HyperPod

Customers prefer Kubernetes for flexibility, granular control over infrastructure, and robust support for open source frameworks. However, running foundation model inference at scale on Kubernetes introduces several challenges. Organizations must securely download models, identify the right containers and frameworks for optimal performance, configure deployments correctly, select appropriate GPU types, provision load balancers, implement observability, and add auto-scaling policies to meet demand spikes. To address these challenges, we’ve launched SageMaker HyperPod capabilities to support the deployment, management, and scaling of generative AI models:

One-click foundation model deployment from SageMaker JumpStart: You can now deploy over 400 open-weights foundation models from SageMaker JumpStart on HyperPod with just a click, including the latest state-of-the-art models like DeepSeek-R1, Mistral, and Llama4. SageMaker JumpStart models will be deployed on HyperPod clusters orchestrated by EKS and will be made available as SageMaker endpoints or Application Load Balancers (ALB).
Deploy fine-tuned models from S3 or FSx for Lustre: You can seamlessly deploy your custom models from S3 or FSx. You can also deploy models from Jupyter notebooks with provided code samples.
Flexible deployment options for different user personas: We’re providing multiple ways to deploy models on HyperPod to support teams that have different preferences and expertise levels. Beyond the one-click experience available in the SageMaker JumpStart UI, you can also deploy models using native kubectl commands, the HyperPod CLI, or the SageMaker Python SDK—giving you the flexibility to work within your preferred environment.
Dynamic scaling based on demand: HyperPod inference now supports automatic scaling of your deployments based on metrics from Amazon CloudWatch and Prometheus with KEDA. With automatic scaling your models can handle traffic spikes efficiently while optimizing resource usage during periods of lower demand.
Efficient resource management with HyperPod Task Governance: One of the key benefits of running inference on HyperPod is the ability to efficiently utilize accelerated compute resources by allocating capacity for both inference and training in the same cluster. You can use HyperPod Task Governance for efficient resource allocation, prioritization of inference tasks over lower priority training tasks to maximize GPU utilization, and dynamic scaling of inference workloads in near real-time.
Integration with SageMaker endpoints: With this launch, you can deploy AI models to HyperPod and register them with SageMaker endpoints. This allows you to use similar invocation patterns as SageMaker endpoints along with integration with other open-source frameworks.
Comprehensive observability: We’ve added the capability to get observability into the inference workloads hosted on HyperPod, including built-in capabilities to scrape metrics and export them to your observability platform. This capability provides visibility into both:
1. Platform-level metrics such as GPU utilization, memory usage, and node health
2. Inference-specific metrics like time to first token, request latency, throughput, and model invocations

“With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.”
–Laurent Sifre, Co-founder & CTO, H.AI

Deploying models on HyperPod clusters

In this launch, we are providing new operators that manage the complete lifecycle of your generative AI models in your HyperPod cluster. These operators will provide a simplified way to deploy and invoke your models in your cluster.

Prerequisites:

You can start by checking the prerequisites in Getting started with Amazon EKS support in SageMaker HyperPod.
Follow the Amazon SageMaker HyperPod EKS workshop to create a HyperPod EKS cluster using an AWS CloudFormation stack following the configured with VPC and storage resources. Then install the necessary dependencies..
Next, follow the instructions in the notebook to complete the prerequisites for installing the required operators:
- Configure IAM roles and service accounts for the operators to use
- Create namespaces for KEDA and Cert Manager
- Configure an S3 bucket for storing TLS certificates
- Configure Service Accounts for the operators
Next, using the same notebook, you can install the HyperPod Inference Operator, ALB Load Balancer Controller, KEDA operators, and the S3 CSI Driver in your cluster. With the HyperPod inference operator you can deploy models using the JumpStartModelor InferenceEndpointConfig Kubernetes resources. This command also installs the AWS Load Balancer Controller and KEDA operator, which can be skipped if not necessary.

helm install hyperpod-inference-operator ./sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/charts/inference-operator 
     -n kube-system 
     --set region=" + REGION + " 
     --set eksClusterName=" + EKS_CLUSTER_NAME + " 
     --set hyperpodClusterArn=" + HP_CLUSTER_ARN + " 
     --set executionRoleArn=" + HYPERPOD_INFERENCE_ROLE_ARN + " 
     --set s3.serviceAccountRoleArn=" + S3_CSI_ROLE_ARN + " 
     --set s3.node.serviceAccount.create=false 
     --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::" + ACCOUNT_ID + ":role/keda-operator-role" 
     --set tlsCertificateS3Bucket=" + TLS_BUCKET_NAME + " 
     --set alb.region=" + REGION + " 
     --set alb.clusterName=" + EKS_CLUSTER_NAME + " 
     --set alb.vpcId=" + VPC_ID + " 
     --set jumpstartGatedModelDownloadRoleArn=" + JUMPSTART_GATED_ROLE_ARN

Architecture:

When you deploy a model using the HyperPod inference operator, the operator will identify the right instance type in the cluster, download the model from the provided source, and deploy it.
The operator will then provision an Application Load Balancer (ALB) and add the model’s pod IP as the target. Optionally, it can register the ALB with a SageMaker endpoint.
The operator will also generate a TLS certificate for the ALB which is saved in S3 at the location specified by the tlsCertificateBucket. The operator will also import the certificate into AWS Certificate Manager (ACM) to associate it with the ALB. This allows clients to connect via HTTPS to the ALB after adding the certificate to their trust store.
If you register with a SageMaker endpoint, the operator will allow you to invoke the model using the SageMaker runtime client and handle authentication and security aspects.
Metrics can be exported to CloudWatch and Prometheus accessed with Grafana dashboards

Deployment sources

Once you have the operators running in your cluster, you can then deploy AI models from multiple sources using SageMaker JumpStart, S3, or FSx:

SageMaker JumpStart

Models hosted in SageMaker JumpStart can be deployed to your HyperPod cluster. You can navigate to SageMaker Studio, go to SageMaker JumpStart and select the open-weights model you want to deploy, and select SageMaker HyperPod. Once you provide the necessary details choose Deploy. The inference operator running in the cluster will initiate a deployment in the namespace provided.

Once deployed, you can monitor deployments in SageMaker Studio.

Alternatively, here is a YAML file that you can use to deploy the JumpStart model using kubectl. For example, the following YAML snippet will deploy DeepSeek-R1 Qwen 1.5b from SageMaker JumpStart on an ml.g5.8xlarge instance:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: JumpStartModel
metadata:
  name: deepseek-llm-r1-distill-qwen-1-5b-july03
  namespace: default
spec:
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.7
  sageMakerEndpoint:
    name: deepseek-llm-r1-distill-qwen-1-5b
  server:
    instanceType: ml.g5.8xlarge
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://<bucket_name>/certificates

Deploying model from S3

You can deploy model artifacts directly from S3 to your HyperPod cluster using the InferenceEndpointConfig resource. The inference operator will use the S3 CSI driver to provide the model files to the pods in the cluster. Using this configuration the operator will download the files located under the prefix deepseek15b as set by the modelLocation parameter. Here is the complete YAML example and documentation:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: deepseek15b
  namespace: default
spec:
  endpointName: deepseek15b
  instanceType: ml.g5.8xlarge
  invocationEndpoint: invocations
  modelName: deepseek15b
  modelSourceConfig:
    modelLocation: deepseek15b
    modelSourceType: s3
    s3Storage:
      bucketName: mybucket
      region: us-west-2

Deploying model from FSx

Models can also be deployed from FSx for Lustre volumes, high-performance storage that can be used to save model checkpoints. This provides the capability to launch a model without having to download artifacts from S3, thus saving the time taken to download the models during deployment or scaling up. Setup instructions for FSx in HyperPod cluster is provided in the Set Up an FSx for Lustre File System workshop. Once set up, you can deploy models using InferenceEndpointConfig. Here is the complete YAML file and a sample:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: deepseek15b
  namespace: default
spec:
  endpointName: deepseek15b
  instanceType: ml.g5.8xlarge
  invocationEndpoint: invocations
  modelName: deepseek15b
  modelSourceConfig:
    fsxStorage:
      fileSystemId: fs-abcd1234
    modelLocation: deepseek-1-5b
    modelSourceType: fsx

Deployment experiences

We are providing multiple experiences to deploy, kubectl, the HyperPod CLI, and the Python SDK. All deployment options will need the HyperPod inference operator to be installed and running in the cluster.

Deploying with kubectl

You can deploy models using native kubectl with YAML files as shown in the previous sections.

To deploy and monitor the status, you can run kubectl apply -f <manifest_name>.yaml.

Once deployed, you can monitor the status with:

kubectl get inferenceendpointconfig will show all InferenceEndpointConfig resources.
kubectl describe inferenceendpointconfig <name> will give detailed status information.
If using SageMaker JumpStart, kubectl get jumpstartmodels will show all deployed JumpStart models.
kubectl describe jumpstartmodel <name> will give detailed status information
kubectl get sagemakerendpointregistrations and kubectl describe sagemakerendpointregistration <name> will provide information on the status of the generated SageMaker endpoint and the ALB.

Other resources that are generated are deployments, services, pods, and ingress. Each resource will be visible from your cluster.

To control the invocation path on your container, you can modify the invocationEndpoint parameter. Your ELB can route requests that are sent to alternate paths such as /v1/chat/completions. To modify the health check path for the container to another path such as /health, you can annotate the generated Ingress object with:

kubectl annotate ingress --overwrite <name> alb.ingress.kubernetes.io/healthcheck-path=/health.

Deploying with the HyperPod CLI

The SageMaker HyperPod CLI also offers a method of deploying using the CLI. Once you set your context, you can deploy a model, for example:

!hyp create hyp-jumpstart-endpoint 
  --version 1.0 
  --model-id deepseek-llm-r1-distill-qwen-1-5b 
  --model-version 2.0.4 
  --instance-type ml.g5.8xlarge 
  --endpoint-name endpoint-test-jscli 
  --tls-certificate-output-s3-uri s3://<bucket_name>/

For more information, see Installing the SageMaker HyperPod CLI and SageMaker HyperPod deployment documentation.

Deploying with Python SDK

The SageMaker Python SDK also provides support to deploy models on HyperPod clusters. Using the Model, Server and SageMakerEndpoint configurations, we can construct a specification to deploy on a cluster. An example notebook to deploy with Python SDK is provided here, for example:

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server,SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
# create configs
model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b',
    model_version='2.0.4',
)
server=Server(
    instance_type='ml.g5.8xlarge',
)
endpoint_name=SageMakerEndpoint(name='deepseklr1distill-qwen')
tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<bucket_name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config,
)

# use spec to deploy
js_endpoint.create()

Run inference with deployed models

Once the model is deployed, you can access the model by invoking the model with a SageMaker endpoint or invoking directly using the ALB.

Invoking the model with a SageMaker endpoint

Once a model has been deployed and the SageMaker endpoint is created successfully, you can invoke your model with the SageMaker Runtime client. You can check the status of the deployed SageMaker endpoint by going to the SageMaker AI console, choosing Inference, and then Endpoints. For example, given an input file input.json we can invoke a SageMaker endpoint using the AWS CLI. This will route the request to the model hosted on HyperPod:

!aws sagemaker-runtime invoke-endpoint 
        --endpoint-name "<ENDPOINT NAME>" 
        --body fileb://input.json 
        --content-type application/json 
        --accept application/json 
        output2.json

Invoke the model directly using ALB

You can also invoke the load balancer directly instead of using the SageMaker endpoint. You must download the generated certificate from S3 and then you can include it in your trust store or request. You can also bring your own certificates.

For example, you can invoke a vLLM container deployed after setting the invocationEndpoint in the deployment YAML shown in previous section value to /v1/chat/completions.

For example, using curl:

curl --cacert /path/to/cert.pem https://<name>.<region>.elb.amazonaws.com/v1/chat/completions 
     -H "Content-Type: application/json" 
     -d '{
        "model": "/opt/ml/model",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

User experience

These capabilities are designed with different user personas in mind:

Administrators: Administrators create the required infrastructure for HyperPod clusters such as provisioning VPCs, subnet, Security groups, EKS Cluster. Administrators also install required operators in the cluster to support deployment of models and allocation of resources across the cluster.
Data scientists: Data scientists deploy foundation models using familiar interfaces—whether that’s the SageMaker console, Python SDK, or Kubectl, without needing to understand all Kubernetes concepts. Data scientists can deploy and iterate on FMs efficiently, run experiments, and fine-tune model performance without needing deep infrastructure expertise.
Machine Learning Operations (MLOps) engineers: MLOps engineers set up observability and autoscaling policies in the cluster to meet SLAs. They identify the right metrics to export, create the dashboards, and configure autoscaling based on metrics.

Observability

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box observability solution that delivers deep insights into inference workloads and cluster resources. This unified observability solution automatically publishes key metrics from multiple sources including Inference Containers, NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, and Kueue to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards. With a one-click installation of this HyperPod EKS add-on, along with resource utilization and cluster utilization, users gain access to critical inference metrics:

model_invocations_total – Total number of invocation requests to the model
model_errors_total – Total number of errors during model invocation
model_concurrent_requests – Active concurrent model requests
model_latency_milliseconds – Model invocation latency in milliseconds
model_ttfb_milliseconds – Model time to first byte latency in milliseconds

These metrics capture model inference request and response data regardless of your model type or serving framework when deployed using inference operators with metrics enabled. You can also expose container-specific metrics that are provided by the model container such as TGI, LMI and vLLM.

You can enable metrics in JumpStart deployments by setting the metrics.enabled: true parameter:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)

You can enable metrics for fine-tuned models for S3 and FSx using the following configuration. Note that the default settings are set to port 8000 and /metrics:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: InferenceEndpointConfig
metadata:
  name: inferenceendpoint-deepseeks
  namespace: ns-team-a
spec:
  modelName: deepseeks
  modelVersion: 1.0.1
  metrics:
    enabled: true # Default: true (can be set to false to disable)
    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s
    modelMetricsConfig:
        port: 8000 # Optional: if overriding the default 8080
        path: "/custom-metrics" # Optional: if overriding the default "/metrics"

For more details, check out the blog post on HyperPod observability and documentation.

Autoscaling

Effective autoscaling handles unpredictable traffic patterns with sudden spikes during peak hours, promotional events, or weekends. Without dynamic autoscaling, organizations must either overprovision resources, leading to significant costs, or risk service degradation during peak loads. LLMs require more sophisticated autoscaling approaches than traditional applications due to several unique characteristics. These models can take minutes to load into GPU memory, necessitating predictive scaling with appropriate buffer time to avoid cold-start penalties. Equally important is the ability to scale in when demand decreases to save costs. Two types of autoscaling are supported, the HyperPod interference operator and KEDA.

Autoscaling provided by HyperPod inference operator

HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from AWS CloudWatch and Amazon Managed Prometheus (AMP). This provides a simple and quick way to setup autoscaling for models deployed with the inference operator. Check out the complete example to autoscale in the SageMaker documentation.

Autoscaling with KEDA

If you need more flexibility for complex scaling capabilities and need to manage autoscaling policies independently from model deployment specs, you can use Kubernetes Event-driven Autoscaling (KEDA). KEDA ScaledObject configurations support a wide range of scaling triggers including Amazon CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like GPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification. For more information, see the Autoscaling documentation.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nd-deepseek-llm-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: deepseek-llm-r1-distill-qwen-1-5b
    apiVersion: apps/v1
    kind: Deployment
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 30     # seconds between checks
  cooldownPeriod: 300     # seconds before scaling down
  triggers:
    - type: aws-cloudwatch
      metadata:
        namespace: AWS/ApplicationELB        # or your metric namespace
        metricName: RequestCount              # or your metric name
        dimensionName: LoadBalancer           # or your dimension key
        dimensionValue: app/k8s-default-albnddee-cc02b67f20/0991dc457b6e8447
        statistic: Sum
        threshold: "3"                        # change to your desired threshold
        minMetricValue: "0"                   # optional floor
        region: us-east-2                     # your AWS region
        identityOwner: operator               # use the IRSA SA bound to keda-operator

Task governance

With HyperPod task governance, you can optimize resource utilization by implementing priority-based scheduling. With this approach you can assign higher priority to inference workloads to maintain low-latency requirements during traffic spikes, while still allowing training jobs to utilize available resources during quieter periods. Task governance leverages Kueue for quota management, priority scheduling, and resource sharing policies. Through ClusterQueue configurations, administrators can establish flexible resource sharing strategies that balance dedicated capacity requirements with efficient resource utilization.

Teams can configure priority classes to define their resource allocation preferences. For example, teams should create a dedicated priority class for inference workloads, such as inference with a weight of 100, to ensure they are admitted and scheduled ahead of other task types. By giving inference pods the highest priority, they are positioned to preempt lower-priority jobs when the cluster is under load, which is essential for meeting low-latency requirements during traffic surges.Additionally, teams must appropriately size their quotas. If inference spikes are expected within a shared cluster, the team should reserve a sufficient amount of GPU resources in their ClusterQueue to handle these surges. When the team is not experiencing high traffic, unused resources within their quota can be temporarily allocated to other teams’ tasks. However, once inference demand returns, those borrowed resources can be reclaimed to prioritize pending inference pods.

Here is a sample screenshot that shows both training and deployment workloads running in the same cluster. Deployments have inference-priority class which is higher than training-priority class. So a spike in inference requests has suspended the training job to enable scaling up of deployments to handle traffic.

For more information, see the SageMaker HyperPod documentation.

Cleanup

You will incur costs for the instances running in your cluster. You can scale down the instances or delete instances in your cluster to stop accruing costs.

Conclusion

With this launch, you can quickly deploy open-weights and custom models foundation model from SageMaker JumpStart, S3, and FSx to your SageMaker HyperPod cluster. SageMaker automatically provisions the infrastructure, deploys the model on your cluster, enables auto-scaling, and configures the SageMaker endpoint. You can use SageMaker to scale the compute resources up and down through HyperPod task governance as the traffic on model endpoints changes, and automatically publish metrics to the HyperPod observability dashboard to provide full visibility into model performance. With these capabilities you can seamlessly train, fine tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

You can start deploying models to HyperPod today in all AWS Regions where SageMaker HyperPod is available. To learn more, visit the Amazon SageMaker HyperPod documentation or try the HyperPod inference getting started guide in the AWS Management Console.

Acknowledgements:

We would like to acknowledge the key contributors for this launch: Pradeep Cruz, Amit Modi, Miron Perel, Suryansh Singh, Shantanu Tripathi, Nilesh Deshpande, Mahadeva Navali Basavaraj, Bikash Shrestha, Rahul Sahu.

About the authors

Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling Gen AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker. His interests include databases, search, machine learning, and AI. He currently focuses on building performant, scalable inference systems for large language models. Outside of work, he enjoys traveling, hiking, and spending time with family.

Chaitanya Hazarey leads software development for inference on SageMaker HyperPod at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.

Andrew Smith is a Senior Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.