Multi-tenant RAG implementation with Amazon Bedrock and Amazon OpenSearch Service for SaaS using JWT

Multi-tenant RAG implementation with Amazon Bedrock and Amazon OpenSearch Service for SaaS using JWT

In recent years, the emergence of large language models (LLMs) has accelerated AI adoption across various industries. However, to further augment LLMs’ capabilities and effectively use up-to-date information and domain-specific knowledge, integration with external data sources is essential. Retrieval Augmented Generation (RAG) has gained attention as an effective approach to address this challenge.

RAG is a technique that searches relevant information from existing knowledge bases or documents based on user input, and incorporates this information into the LLM input to generate more accurate and contextually appropriate responses. This technique is being implemented across a wide range of applications, from using technical documentation in product development to answering FAQs in customer support, and even supporting decision-making systems based on the latest data.

The implementation of RAG brings significant value to both software-as-a-service (SaaS) providers and their users (tenants).

SaaS providers can use a multi-tenant architecture that delivers services to multiple tenants from a single code base. As tenants use the service, their data accumulates while being protected by appropriate access control and data isolation. When implementing AI capabilities using LLMs in such environments, RAG makes it possible to use each tenant’s specific data to provide personalized AI services.

Let’s consider a customer service call center SaaS as an example. Each tenant’s historical inquiry records, FAQs, and product manuals are accumulated as tenant-specific knowledge bases. By implementing a RAG system, the LLM can generate appropriate responses relevant to each tenant’s context by referencing these tenant-specific data sources. This enables highly accurate interactions that incorporate tenant-specific business knowledge—a level of customization that would not be possible with generic AI assistants. RAG serves as a crucial component for delivering personalized AI experiences in SaaS, contributing to service differentiation and value enhancement.

However, using tenant-specific data through RAG presents technical challenges from security and privacy perspectives. The primary concern is implementing secure architecture that maintains data isolation between tenants and helps prevent unintended data leakage or cross-tenant access. In multi-tenant environments, the implementation of data security critically impacts the trustworthiness and competitive advantage of SaaS providers.

Amazon Bedrock Knowledge Bases enables simpler RAG implementation. When using OpenSearch as a vector database, there are two options: Amazon OpenSearch Service or Amazon OpenSearch Serverless. Each option has different characteristics and permission models when building multi-tenant environments:

In this post, we introduce tenant isolation patterns using a combination of JSON Web Token (JWT) and FGAC, along with tenant resource routing. If the aforementioned permission model limits you from achieving your FGAC objectives, you can use the solution in this post. The solution is implemented using OpenSearch Service as the vector database and AWS Lambda as the orchestration layer.

In the next section, we explore the specific implementation of tenant isolation using JWT and FGAC in OpenSearch Service, and how this enables a secure multi-tenant RAG environment.

Effectiveness of JWT in multi-tenant data isolation in OpenSearch Service

As introduced in Storing Multi-Tenant SaaS Data with Amazon OpenSearch Service, OpenSearch Service offers multiple methods for managing multi-tenant data: domain-level isolation, index-level isolation, and document-level isolation.

To implement access permission segregation at the index and document levels, you can use FGAC, which is supported by the OpenSearch Security plugin.

In OpenSearch Service, you can achieve granular access control by mapping IAM identities to OpenSearch roles. This enables detailed permission settings in OpenSearch for each IAM identity. However, this approach presents significant scalability challenges. As the number of tenants increases, the required number of IAM users or roles also increases, potentially hitting the limit of AWS service quotas. Additionally, managing numerous IAM entities leads to operational complexity. Although dynamically generated IAM policies could overcome this challenge, each dynamically generated policy is attached to a single IAM role. A single IAM role can be mapped to a single OpenSearch role, but this would still require an IAM role and dynamic policy per tenant for appropriate isolation, which results in similar operational complexity managing numerous entities.

This post provides an alternative approach and focuses on the effectiveness of JWT, a self-contained token for implementing data isolation and access control in multi-tenant environments. Using JWT provides the following advantages:

  • Dynamic tenant identification – JWT payloads can include attribute information (tenant context) to identify tenants. This enables the system to dynamically identify tenants for each request and allows passing this context to subsequent resources and services.
  • Integration with FGAC in OpenSearch – FGAC can directly use attribute information in JWT for role mapping. This allows mapping of access permissions to specific indexes or documents based on information such as tenant IDs in the JWT.

Combining JWT with FGAC provides secure, flexible, and scalable data isolation and access control in a multi-tenant RAG environment using OpenSearch Service. In the next section, we explore specific implementation details and technical considerations for applying this concept in actual systems.

Solution overview

In RAG, data such as relevant documents used to augment LLM outputs are vectorized by embedding language models and indexed in a vector database. User questions in natural language are converted to vectors using the embedding model and searched in the vector database. The data retrieved through vector search is passed to the LLM as context to augment the output. The following diagram illustrates the solution architecture.

architecture diagram

This solution uses OpenSearch Service as the vector data store for storing knowledge sources in RAG. The flow is as follows:

  1. RAG application users for each tenant are created as users in an Amazon Cognito user pool, receiving a JWT enriched with tenant ID information when logging in to the frontend. Each user’s tenant information is stored in Amazon DynamoDB and added to the JWT by a pre-token generation Lambda trigger during user authentication.
  2. When a user initiates a chat on the frontend, the user query is passed to Lambda using Amazon API Gateway along with the JWT.
  3. The user query is vectorized in conjunction with text embedding models available in Amazon Bedrock.
  4. Domain and index information for retrieval is obtained from DynamoDB.
  5. Vector search is performed on OpenSearch Service to retrieve information related to the query from the index.
  6. The retrieved information is added to the prompt as context and passed to an LLM available in Amazon Bedrock to generate a response.

The key aspect of this solution is using JWT for tenant data isolation in OpenSearch Service and routing to each tenant’s data. It separates access permissions for each dataset using FGAC available in OpenSearch Service and uses tenant ID information added to the JWT for mapping application users to separated permission sets. The solution provides three different patterns for data isolation granularity to meet customer requirements. Routing is also enabled by defining the mapping between tenant ID information from JWT and data location (domain, index) in DynamoDB.

When users add documents, files are uploaded to Amazon Simple Storage Service (Amazon S3) and metadata is written to DynamoDB management table. When storing data in OpenSearch Service, the text embedding model (Amazon Bedrock) is called by the ingest pipeline for vectorization. For document creation, update, and deletion, JWT is attached to requests, allowing tenant identification.

This solution is implemented using the AWS Cloud Development Kit (AWS CDK). For details, refer to the GitHub repository. The instructions to deploy the solution are included in the README file in the repository.

Prerequisites

To try this solution, you must have the following prerequisites:

  • An AWS account.
  • IAM access permissions necessary for running the AWS CDK.
  • A frontend execution environment: node.js and npm installation is required.
  • The AWS CDK must be configured. For details, refer to Tutorial: Create your first AWS CDK app.
  • Access to the models used in Amazon Bedrock must be configured. This solution uses Anthropic’s Claude 3.5 Sonnet v2 and Amazon Titan Text Embedding V2. For details, refer to Add or remove access to Amazon Bedrock foundation models.

In addition to the resources shown in the architecture diagram, the following resources and configurations are created as AWS CloudFormation custom resources through AWS CDK deployment:

  • Amazon Cognito user pool:
    • Users for tenant-a, tenant-b, tenant-c, and tenant-d
  • DynamoDB table:
    • Mapping between users and tenants
    • Mapping between tenants and OpenSearch connection destinations and indexes
  • OpenSearch Service domain:
    • JWT authentication settings
    • Ingest pipeline for vector embedding
    • FGAC roles and role mappings for each tenant
    • k-NN index

User authentication and JWT generation with Amazon Cognito

This solution uses an Amazon Cognito user pool for RAG application user authentication. Amazon Cognito user pools issue JWT during authentication. Because FGAC in OpenSearch Service supports JWT authentication, access from users authenticated by the Amazon Cognito user pool can be permitted by registering public keys issued by the user pool with the OpenSearch Service domain. Additionally, authorization is performed using attributes that can be added to the JWT payload for tenant data access permission segregation with FGAC, which we discuss in the following sections. To achieve this, a pre-token generation Lambda trigger is configured in the Amazon Cognito user pool to retrieve tenant ID information for each user stored in DynamoDB and add it to the token. The obtained JWT is retained by the frontend and used for requests to the backend. DynamoDB stores the mapping between user ID (sub) and tenant ID as follows:

{
  "pk": {
    "S": "membership#<Cognito user ID (sub)>"
  },
  "sk": {
    "S": "tenant#tenant-a"
  }
}

Although multiple patterns exist for implementing multi-tenant authentication with Amazon Cognito, this implementation uses a single user pool with user-tenant mappings in DynamoDB. Additional considerations are necessary for production environments; refer to Multi-tenant application best practices.

Request routing to tenant data using JWT

In multi-tenant architectures where resources are separated by tenant, requests from tenants are essential to route to appropriate resources. To learn more about tenant routing strategies, see Tenant routing strategies for SaaS applications on AWS. This solution uses an approach similar to data-driven routing as described in the post for routing to OpenSearch Service.

The DynamoDB table stores mapping information for tenant IDs, target OpenSearch Service domains, and indexes as follows:

{
  "pk": {
    "S": "tenant#tenant-a"
  },
  "sk": {
    "S": "os_config"
  },
  "os_host": {
    "S": "<Amazon OpenSearch Service domain endpoint>"
  },
  "os_index": {
    "S": "tenant-a-index"
  },
  "rag_role": {
    "S": "tenant-a_role"
  }
}

The JWT is obtained from the Authorization header in HTTP requests sent from the frontend to the Lambda function through API Gateway. The routing destination is determined by retrieving the routing information using the tenant ID obtained from parsing the JWT. Additionally, the JWT is used as authentication information for requests to OpenSearch, as described in the following section.

Multi-tenant isolation of data locations and access permissions in OpenSearch Service

Multi-tenant data isolation strategies in OpenSearch Service include three types of isolation patterns: domain-level, index-level, and document-level isolation, and hybrid models combining these. This solution uses FGAC for access permission control to tenant data, creating dedicated roles for each tenant.

Mapping between tenant users and FGAC tenant roles is implemented through backend roles. In JWT authentication available in OpenSearch Service, the attribute within the JWT payload to be linked with backend roles can be specified as the Roles key. The following screenshot shows this domain config.

JWT-config-in-aos

The JWT payload includes a tenant_id attribute as follows:"tenant_id": "tenant-a" Tenant users and FGAC roles are linked by setting this attribute as the roles key in OpenSearch JWT authentication and mapping roles as follows:

{
  "tenant-a_role": {
    "backend_roles": [
      "tenant-a"
    ]
  }
}

The following screenshot shows an example of tenant role mapping in FGAC in OpenSearch Dashboards.

role-mapping

The sample in this solution provides four tenants—tenant-a, tenant-b, tenant-c, and tenant-d—so you can try all three isolation methods. The following diagram illustrates this architecture.

Three isolation method diagram

Each role is assigned permissions to access only the corresponding tenant data. In this section, we introduce how to implement each of the three isolation methods using JWT and FGAC:

  • Domain-level isolation – Assign individual OpenSearch Service domains to each tenant. Because domains are dedicated to each tenant in this pattern of isolation, there’s no need for data isolation within the domain. Therefore, FGAC roles grant access permissions across the indexes. The following code is part of index_permissions in the FGAC role definition that grants access to the indexes:
"index_permissions": [
    {
    "index_patterns": [
        "*"
    ],
  • Index-level isolation – Multiple tenants share an OpenSearch Service domain, with individual indexes assigned to each tenant. Each tenant should only be able to access their own index, so index_permissions in the FGAC role is configured as follows (example for tenant-b):
"index_permissions": [
    {
    "index_patterns": [
        "tenant-b-index*"
    ]
  • Document-level isolation – Multiple tenants share OpenSearch Service domains and indexes, using FGAC document-level security for access permission segregation of tenant data within the index. Each index includes a field to store tenant ID information, and document-level security queries are set for that field. The following code is part of index_permissions for an FGAC role that allows tenant-c to access only its own data in a configuration where tenant-c and tenant-d share an index:
"index_permissions": [
    {
    "index_patterns": [
        "tenant-cd-shared-index*"
    ],
    "dls": """{"bool": {"must": {"match": {"tenant_id": "tenant-c"}}}}""",

The following screenshot shows an example of index permission for document-level isolation in the FGAC role.

fgac role permission setting

Considerations

The implementation in this post uses a model where DynamoDB tables and S3 buckets are shared between tenants. For production use, consider partitioning models as introduced in Partitioning Pooled Multi-Tenant SaaS Data with Amazon DynamoDB and Partitioning and Isolating Multi-Tenant SaaS Data with Amazon S3) and determine the optimal model based on your requirements.

Additionally, you can use dynamic generation of IAM policies as an additional layer to restrict access permissions to each resource.

Clean up

To avoid unexpected charges, we recommend deleting resources when they are no longer needed. Because the resources are created with the AWS CDK, run the cdk destroy command to delete them. This operation will also delete the documents uploaded to Amazon S3.

Conclusions

In this post, we introduced a solution that uses OpenSearch Service as a vector data store in multi-tenant RAG, achieving data isolation and routing using JWT and FGAC.

This solution uses a combination of JWT and FGAC to implement strict tenant data access isolation and routing, necessitating the use of OpenSearch Service. The RAG application is implemented independently, because at the time of writing, Amazon Bedrock Knowledge Bases can’t use JWT-based access to OpenSearch Service.Multi-tenant RAG usage is important for SaaS companies, and strategies vary depending on requirements such as data isolation strictness, ease of management, and cost. This solution implements multiple isolation models, so you can choose based on your requirements.For other solutions and information regarding multi-tenant RAG implementation, refer to the following resources:


About the authors

Kazuki Nagasawa is a Cloud Support Engineer at Amazon Web Services. He specializes in Amazon OpenSearch Service and focuses on solving customers’ technical challenges. In his spare time, he enjoys exploring whiskey varieties and discovering new ramen restaurants.

Kensuke Fukumoto is a Senior Solutions Architect at Amazon Web Services. He’s passionate about helping ISVs and SaaS providers modernize their applications and transition to SaaS models. In his free time, he enjoys riding motorcycles and visiting saunas.

Read More

Enhance generative AI solutions using Amazon Q index with Model Context Protocol – Part 1

Enhance generative AI solutions using Amazon Q index with Model Context Protocol – Part 1

Today’s enterprises increasingly rely on AI-driven applications to enhance decision-making, streamline workflows, and deliver improved customer experiences. Achieving these outcomes demands secure, timely, and accurate access to authoritative data—especially when such data resides across diverse repositories and applications within strict enterprise security boundaries.

Interoperable technologies powered by open standards like the Model Context Protocol (MCP) are rapidly emerging. MCP simplifies the process for connecting AI applications and agents to third-party tools and data sources, enabling lightweight, real-time interactions and structured operations with minimal engineering effort. Independent software vendor (ISV) applications can securely query their customers’ Amazon Q index using cross-account access, retrieving only the content each user is authorized to see, such as documents, tickets, chat threads, CRM records, and more. Amazon Q connectors regularly sync and index this data to keep it fresh. Amazon Q index’s hybrid semantic-plus-keyword ranking then helps ISVs deliver context-rich answers without building their own search stack.

As large language models (LLMs) and generative AI become integral to enterprise operations, clearly defined integration patterns between MCP and Amazon Q index become increasingly valuable. ISVs exploring the MCP landscape to automate structured actions such as creating tickets or processing approvals can seamlessly integrate Amazon Q index to retrieve authoritative data. Authoritative data enables accurate and confident execution of these actions, reducing risk, minimizing costly errors, and strengthening trust in AI-driven outcomes. For example, a customer support assistant using MCP can automatically open an urgent ticket and instantly retrieve a relevant troubleshooting guide from Amazon Q index to accelerate incident resolution. AWS continues to invest in tighter interoperability between MCP and Amazon Q index within enterprise AI architectures. In this post, we explore best practices and integration patterns for combining Amazon Q index and MCP, enabling enterprises to build secure, scalable, and actionable AI search-and-retrieval architectures.

Key components overview

Let’s break down the two key components referenced throughout the post: MCP and Amazon Q index.

MCP is an open JSON-RPC standard that lets LLMs invoke external tools and data using structured schemas. Each tool schema defines actions, inputs, outputs, versioning, and access scope, giving developers a consistent interface across enterprise systems. To learn more, refer to the MCP User Guide.

Amazon Q index is a fully managed, cross-account, semantic search service within Amazon Q Business that helps ISVs augment their generative AI chat assistants with customer data. It combines semantic and keyword-based ranking to securely retrieve relevant, user-authorized content through the SearchRelevantContent API, so ISVs can enrich their applications with precise, customer-specific context.

Companies like Zoom and PagerDuty use Amazon Q index to enhance their AI-driven search experiences. For example, Zoom uses Amazon Q index to help users securely and contextually access their enterprise knowledge directly within the Zoom AI Companion interface, enhancing real-time productivity during meetings. Similarly, PagerDuty Advance uses Amazon Q index to surface operational runbooks and incident context during live alerts, dramatically improving incident resolution workflows.

Enhancing MCP workflows with Amazon Q index

To fully capitalize on MCP-driven structured actions, modern AI assistants require enterprise-grade knowledge retrieval capabilities—fast responses, precise relevance ranking, and robust permission enforcement. Effective actions depend on timely, accurate, and secure access to authoritative enterprise data. Amazon Q index directly meets these advanced search needs, providing a secure, scalable retrieval layer that enhances and accelerates MCP workflows:

  • Secure ISV integration with the data accessor pattern – ISVs can seamlessly integrate customer enterprise data into their applications using Amazon Q index, providing enriched, generative AI-driven experiences without needing to store or directly index customer data sources. This follows the data accessor pattern, where the ISV acts as a trusted accessor with scoped permissions to securely query the customer’s Amazon Q index and retrieve only authorized results. Companies like Asana, Zoom, and PagerDuty already use this integration approach to enhance their applications securely and efficiently.
  • Highly accurate and managed relevance – Amazon Q index automatically executes both keyword-based (sparse) matching and vector-based (dense/semantic) similarity searches with every SearchRelevantContent API call. Semantic search uses embeddings to understand the contextual meaning of content rather than relying solely on keyword matches, significantly improving accuracy and user satisfaction. Combining semantic and keyword-based search (a hybrid approach) facilitates maximum retrieval accuracy and relevant results.
  • Built-in connectors and automatic indexing – Amazon Q index offers managed, built-in connectors for widely used enterprise applications such as SharePoint, Amazon Simple Storage Service (Amazon S3), and Confluence. These connectors automatically crawl and index enterprise content on a scheduled basis, significantly reducing manual setup and maintenance while keeping data fresh and searchable.
  • Fully managed document-level security – During indexing, Amazon Q index captures source-system ACLs, automatically enforcing these permissions with every query. Users can only search data they’ve been previously granted permission to access. Data is encrypted using customer managed AWS Key Management Service (AWS KMS) keys, with access logged using AWS CloudTrail for auditability.

By managing indexing, ranking, and security, Amazon Q index helps organizations deploy sophisticated enterprise search quickly—typically within weeks. To learn more, see Amazon Q index for independent software vendors (ISVs).

Amazon Q index integration patterns

Now that we’ve explored how Amazon Q index enhances MCP workflows, let’s look at two practical integration patterns enterprises and ISVs commonly adopt to combine these complementary technologies. ISVs and enterprises can access a unified, identity-aware semantic search API called SearchRelevantContent that securely accesses connected enterprise data sources (to learn more, see New capabilities from Amazon Q Business enable ISVs to enhance generative AI experiences).

When planning their integration strategy, organizations typically evaluate factors such as implementation speed, operational complexity, security requirements, and existing MCP commitments. The following patterns highlight common integration approaches, outlining the associated trade-offs and benefits of each scenario:

  • Pattern 1 – Amazon Q index integration with a data accessor (no MCP layer)
  • Pattern 2 – Integrating Amazon Q index using MCP tools

Pattern 1: Amazon Q index integration with a data accessor (no MCP layer)

Customers might opt for simplicity and speed by directly using Amazon Q index without involving MCP. The following diagram illustrates this straightforward and fully managed approach.

Amazon Q index integration with a data accessor (no MCP layer)

This pattern is best suited when your primary requirement is direct, performant search through a fully managed API, and you don’t currently need the orchestration and standardization provided by MCP integration. To learn more, refer to Q index workshop and the following GitHub repo.

The pattern includes the following components:

  • The SearchRelevantContent API is called using a secure, scoped AWS Identity and Access Management (IAM) role provided by the ISV. There’s no MCP layer to build, credentials to manage, or infrastructure to run—integration is handled entirely through an AWS managed API.
  • After the ISV-provided IAM role is approved by the enterprise and AWS, AWS manages the backend—including connectors, incremental content crawling, vector and keyword indexing, intelligent ranking, and secure, document-level access control within Amazon Q index.
  • Enterprise permissions are scoped to a single IAM role that the enterprise explicitly approves. Indexed data is encrypted using customer managed KMS keys, with access tightly controlled and fully audited through CloudTrail.

Pattern 2: Integrating Amazon Q index using MCP tools

By adding Amazon Q index retrieval using MCP, ISVs maintain a consistent MCP-based architecture across actions and retrieval, as illustrated in the following diagram.

Integrating Amazon Q index using MCP tools

This pattern provides a uniform MCP interface for ISVs who already use MCP tools for multiple structured actions. To learn more, refer to the following GitHub repo.

The pattern includes the following components:

  • The SearchRelevantContent API is wrapped as a tool inside an existing MCP system, adding custom logging or throttling.
  • End-users interact only with the ISV’s application. Behind the scenes, the ISV’s MCP server queries Amazon Q index with the approved data accessor role.
  • ISVs must protect tenant isolation, encrypt transit traffic, and log every call. The enterprise offloads patching and intrusion detection to the ISV but retains document‑level ACL enforcement using Amazon Q index.

Considerations for choosing your integration pattern

When choosing your integration pattern, consider these key questions:

  • Is rapid deployment with minimal operational overhead your top priority? Choose Pattern 1 (direct SearchRelevantContent using a data accessor) if you want the fastest route to production-grade, managed retrieval. AWS fully manages indexing, ranking, and document-level permissions, requiring no additional infrastructure from your organization.
  • Are you an ISV aiming to deliver a consistent MCP interface for orchestrating retrieval alongside other tools? Pattern 2 (ISV-hosted MCP) is typically the best choice if you’re an ISV providing a standardized MCP experience to multiple enterprise customers. AWS continues managing indexing, ranking, and permissions, and your organization maintains and operates the MCP server infrastructure for greater orchestration flexibility.

Your ideal integration path ultimately depends on balancing rapid deployment, orchestration flexibility, and compliance requirements specific to your organization.

Determining when MCP-only retrieval is sufficient

Although integrating MCP with Amazon Q index effectively addresses most scenarios for enriching ISV application responses with enterprise data, certain clearly defined use cases benefit from a simpler, MCP-only approach. MCP’s schema-driven architecture is ideal for straightforward, keyword-based queries involving a single or limited set of repositories, such as checking ticket statuses. It also excels when real-time data retrieval is essential, including inventory monitoring, streaming log analysis, or accessing real-time metrics, where pre-indexing content offers little value. Additionally, some vendors offer ready-made, MCP-compatible endpoints, such as Atlassian’s interface for Confluence, so enterprises can quickly plug into these MCP servers, access real-time data without indexing, and use secure, feature-rich integrations that are supported and maintained by the vendor.In these scenarios, MCP-only retrieval serves as an efficient, lightweight alternative to fully indexed search solutions like Amazon Q index—especially when the need for orchestration, ranking, and semantic understanding is minimal.

Conclusion

In this post, we explored how ISVs can integrate Amazon Q index into the MCP landscape for enterprise data retrieval, complementing other structured-action tools. Authoritative data is critical for structured actions because it enables accurate decision-making, reduces operational risk, minimizes costly errors, and strengthens trust in AI-driven solutions. By combining MCP’s ability to automate real-time actions with the powerful data retrieval capabilities of Amazon Q index, enterprises and ISVs can rapidly address critical business problems using generative AI. This integrated approach reduces complexity, streamlines operations, and helps organizations meet stringent governance, compliance, and performance standards without the need to build custom indexing and retrieval infrastructure. AWS continues to actively invest in enhancing interoperability between MCP and Amazon Q index. Stay tuned for part two of this blog series, where we explore upcoming integration capabilities and share guidance for building your enterprise AI architectures. To explore Amazon Q index and MCP integrations further, refer to the following resources:

You can also contact AWS directly or sign in to your AWS Management Console to get started today.


About the authors

Ebbey Thomas is a Senior Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Vishnu Elangovan is a Worldwide Generative AI Solution Architect with over seven years of experience in Data Engineering and Applied AI/ML. He holds a master’s degree in Data Science and specializes in building scalable artificial intelligence solutions. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Outside his professional pursuits, he enjoys traveling, participating in sports, and exploring new problems to solve.

Read More

Technical approach for classifying human-AI interactions at scale

Technical approach for classifying human-AI interactions at scale

The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon shows a stylized person in front of a computer screen. The third icon shows an organization tree with one main node and three nodes branching out side by side below it.

As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.

But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.

In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.

For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.

System architecture highlights

The Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:

  • Hybrid compute engine
    The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes.
  • LLM-centric transformation layer
    At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:
    • Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
    • Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
    • Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
Figure 1. Architecture diagram of LLM workflow
Figure 1. Architecture diagram

The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.

Engineering challenges & solutions

Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.

LLM endpoint latency & variability

Challenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.

Solution: We implemented a combination of:

  • Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
  • Saving output in intervals to write data asynchronously in case of network errors.
  • Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
  • Timeouts and retries with exponential backoff.

Evolving LLM models & prompt alignment

Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.

Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:

  • Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
  • Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
  • Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.

To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.

Dynamic concurrency scaling for LLM calls

Challenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.

Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:

  • External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
  • Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
  • Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.


Optimization experiments

To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.

Batch endpoints (Azure/OpenAI)

Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.

Conversation batching in prompts during pipeline runtime

Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.

Combining classifiers in a single prompt

Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.

Classification using text embeddings

An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.

For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.

The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.

Prompt compression

Compressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.

Text truncation

Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.

Conclusion

Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.

The post Technical approach for classifying human-AI interactions at scale appeared first on Microsoft Research.

Read More

Into the Omniverse: How Global Brands Are Scaling Personalized Advertising With AI and 3D Content Generation

Into the Omniverse: How Global Brands Are Scaling Personalized Advertising With AI and 3D Content Generation

In today’s fast-evolving digital landscape, marketing teams face increasing pressure to deliver personalized, brand-accurate content at scale and speed. Traditional content creation workflows are often time-consuming, costly and fragmented across multiple tools and teams.

Universal Scene Description (OpenUSD), an open and extensible 3D framework, is helping teams overcome these challenges by streamlining how marketing content is created, managed and delivered.

Global brands including Coca-Cola, Moët Hennessy, Nestlé and Unilever are harnessing innovative marketing solutions built on NVIDIA Omniverse — a platform for developing OpenUSD applications. These AI-based solutions dramatically accelerate content generation for advertising and consumer engagement:

  • Moët Hennessy boosts local responsiveness by scaling over 3 million content variations globally, at double the speed.
  • Nestlé reduces time and costs associated with advertising by 70% by scaling digital twins.
  • Unilever’s content imagery is being created 2x faster and at half the cost of traditional methods, leading to 100% brand consistency and quicker content creation.

By using the NVIDIA Omniverse Blueprint for precise visual generative AI, solution providers and software developers are enabling organizations to rapidly produce high-quality, brand-accurate, engaging visuals for local markets at scale, streamlining workflows and ensuring creative consistency across every channel.

Accelerating Content Creation From Weeks to Minutes

Industry leaders are already seeing the results of tapping AI and OpenUSD for marketing workflows.

Accenture Song used OpenUSD in Omniverse to launch an AI-powered content service for Nestlé. The content service creates exact 3D virtual replicas of products for e-commerce and digital media channels, demonstrating the impact of digital twins and advanced 3D workflows.

Accenture Song, Nestlé

SKAI Intelligence, a global provider of AI-powered content creation solutions, recently debuted the world’s first end-to-end, retail-focused AI-generated content production pipeline built entirely on NVIDIA Omniverse. The browser-based, AI-native workflow automates the entire content generation process — from product scanning and modeling to animation, lighting and rendering — and delivers up to 95% faster production speeds versus traditional methods.

Katana Studio, a real-time 3D content creation studio and developer behind the COATcreate tool, has used NVIDIA Omniverse to streamline automotive marketing for Nissan, significantly reducing asset creation timelines and costs.

INDG, a digital content automation company, developed the software-as-a-service platform Grip on NVIDIA Omniverse and OpenUSD to empower global brands like Moët Hennessy and Coca-Cola to produce high-quality, brand-consistent content across markets.

Grip, Moët Hennessy

By centralizing OpenUSD asset libraries and creating digital twins of products, Grip enables teams to quickly assemble, adapt and deploy campaign-ready content in just minutes — rather than weeks. This approach directly addresses the challenges of slow, costly and inconsistent manual localization processes that have long hindered marketing efforts.

Grip relies on rules-based AI, NVIDIA RTX GPUs and the NVIDIA AI Enterprise software platform to ensure brand consistency across diverse markets. The Grip platform also integrates Bria’s visual generative AI models to enhance automated content production at scale. Grip’s content engine acts as a virtual art director, codifying and enforcing brand guidelines for every asset while dynamically adjusting composition, lighting and product details.

Dive deeper into Grip’s innovative approach at the company’s upcoming session at SIGGRAPH, a computer graphics conference taking place Aug. 10-14 at the Vancouver Convention Centre and online.

Unilever, in collaboration with Collective World, is using Omniverse, OpenUSD and photorealistic 3D digital twins to accelerate content production. Unilever’s new content-creation workflow, powered by real-time 3D rendering, has cut production timelines from months to days, halved costs and enabled consistent brand experiences across markets with a 5x reduction in content duplication.

Collective World, Unilever, Nexxus

Monks, a digital-first marketing and technology services company, is also using Omniverse and OpenUSD to drive hyperpersonalized and collaborative product experiences. The technologies allow Monks’ services to empower brands to virtually explore and customize product designs in real time.

Hear Monks representatives discuss how they’re building automated pipelines and agentic systems to ease deployment and scaling of AI-driven marketing operations across the enterprise:

Get Plugged Into the World of OpenUSD

Discover the future of 3D content creation and connect with the OpenUSD community by joining NVIDIA at SIGGRAPH. Highlights will include:

  • A special address on Monday, Aug. 11, with NVIDIA AI research leaders Sanja Fidler, Aaron Lefohn and Ming-Yu Liu, who’ll chart the next frontier in computer graphics and physical AI.
  • OpenUSD Day, taking place on Wednesday, Aug. 13, features sessions and a developer meetup where developers and industry leaders can explore how OpenUSD is adopted across every application, from content creation and simulation to physical AI.
  • Hands-on OpenUSD training for all skill levels, including the first-ever in-person opportunity to receive USD certification.

Discover why developers and 3D practitioners are using OpenUSD and learn how to optimize 3D workflows with the self-paced “Learn OpenUSD” curriculum for 3D developers and practitioners, available for free through the NVIDIA Deep Learning Institute.

Explore the Alliance for OpenUSD forum and the AOUSD website.

Stay up to date by subscribing to NVIDIA Omniverse news, joining the Omniverse community and following NVIDIA Omniverse on Instagram, LinkedIn, Medium and X.

Featured image courtesy of Grip, Moët Hennessy.

Read More

FastVLM: Efficient Vision Encoding for Vision Language Models

Vision Language Models (VLMs) enable visual understanding alongside textual inputs. They are typically built by passing visual tokens from a pretrained vision encoder to a pretrained Large Language Model (LLM) through a projection layer. By leveraging the rich visual representations of the vision encoder and the world knowledge and reasoning capabilities of the LLM, VLMs can be useful for a wide range of applications, including accessibility assistants, UI navigation, robotics, and gaming.
VLM accuracy generally improves with higher input image resolution, creating a tradeoff between accuracy…Apple Machine Learning Research

Mitra: Mixed synthetic priors for enhancing tabular foundation models

Mitra: Mixed synthetic priors for enhancing tabular foundation models


Mitra: Mixed synthetic priors for enhancing tabular foundation models

Generating diverse synthetic prior distributions leads to a tabular foundation model that outperforms task-specific baselines.

Machine learning

July 22, 01:40 PMJuly 22, 01:40 PM

Tabular data powers critical decisions across domains such as healthcare, finance, e-commerce, and the sciences. The machine learning methods traditionally used for tabular data, however such as random forests and XGBoost typically result in models tailored to individual datasets, with limited ability to transfer across different distributions.

Inspired by the success of large language models, tabular foundation models (TFMs) promise to change that: instead of requiring a separately trained model for each task, a single pretrained model can generalize to new tasks simply by conditioning on a moderate number of examples, a technique known as in-context learning (ICL).

As part of the latest release of Amazons automatic-machine-learning framework AutoGluon, we are introducing Mitra, a tabular foundation model trained within this ICL-based paradigm. Much the way large language models (LLMs) are trained on diverse corpora of text, Mitra is pretrained on synthetic datasets generated by a carefully designed mixture of prior distributions (priors).

At first blush, it may seem surprising that we used no real-world data in pretraining Mitra. But real-world tabular data is often limited and heterogeneous, with varying feature types, dependencies, and noise levels. It proves more practical to simulate diverse synthetic datasets that cover a wide range of possible data patterns.

We find that the quality of these synthetic priors plays a critical role in how well the model generalizes. Effective priors tend to (1) yield good performance on real tasks; (2) exhibit diversity, which prevents the overfitting; and (3) offer unique patterns not found in other priors.

Based on these principles, we construct a mixture that includes structural causal models, which combine graphs of the causal dependencies between variables with (probabilistic) equations describing the effects that varying each variables value has on its dependent variables; and popular tree-based methods like gradient boosting, random forests, and decision trees. Together, these priors enable Mitra to learn robust representations and generalize effectively to a wide variety of real-world tabular problems.

Overview of the Mitra framework.<b> </b>We pretrain tabular foundation models (TFMs) on a mixture of synthetic data priors, including structural causal models and tree-based models. Each dataset is split into support and query sets. Mitra supports both 2-D attention across rows and columns and 1-D row-wise attention. At inference, the model conditions on support examples from real datasets to predict query labels using in-context learning (ICL) without gradient updates.

We pretrain Mitra on our selected mixture of priors. Each synthetic task consists of a support set and a query set. The model learns to predict the labels of the query set by attending to the support set; no gradient updates are required. Over millions of such tasks, Mitra learns generalizable patterns of reasoning and adaptation. The architecture is based on 2-D attention across both rows and features, allowing flexible handling of varying table sizes and feature interactions.

We evaluated Mitra on both classification and regression tasks, across major tabular benchmarks such as TabRepo, TabZilla, AMLB, and TabArena. Mitra demonstrated state-of-the-art performance when compared with strong TFMs such as TabPFNv2 and TabICL, as well as with dataset-specific models such as CatBoost, RealMLP, and AutoGluon 1.3 best-quality preset.

The results of the Mitra evaluation. Winner and runner-up for each evaluation metric are shown in green and blue. The abbreviation <i>+e</i> means ensembling in ICL, and <i>+f</i> means fine tuning. The 95% confidence interval is shown in parentheses for the Elo. The columns in the aggregated metrics are the mean and standard deviation (shown in parentheses) of the corresponding metric.
Decision boundaries of Mitra and baselines on 2-D sinusoidal checkerboard data. Mitra shows more-regular and less-fragmented decision boundaries than TabPFNv2.

Just as foundation models have reshaped the domains of computer vision and natural-language processing, Mitra offers a more general and effective approach to tabular-data prediction. As the field progresses, we envision even richer prior spaces and adaptive mixture strategies. Mitra is open sourced (links below) in the AutoGluon 1.4 release and ready to use. We invite researchers and practitioners to explore this new foundation for tabular prediction.

Learn more:

Acknowledgments: Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, Bernie Wang

Research areas: Machine learning

Tags: Tabular data

Read More

Beyond accelerators: Lessons from building foundation models on AWS with Japan’s GENIAC program

Beyond accelerators: Lessons from building foundation models on AWS with Japan’s GENIAC program

In 2024, the Ministry of Economy, Trade and Industry (METI) launched the Generative AI Accelerator Challenge (GENIAC)—a Japanese national program to boost generative AI by providing companies with funding, mentorship, and massive compute resources for foundation model (FM) development. AWS was selected as the cloud provider for GENIAC’s second cycle (cycle 2). It provided infrastructure and technical guidance for 12 participating organizations. On paper, the challenge seemed straightforward: give each team access to hundreds of GPUs/Trainium chips and let innovation ensue. In practice, successful FM training required far more than raw hardware.

AWS discovered that allocating over 1,000 accelerators was merely the starting point—the real challenge lay in architecting a reliable system and overcoming distributed training obstacles. During GENIAC cycle 2, 12 customers successfully deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day. Over the following 6 months, multiple large-scale models were trained, including notable projects like Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B, and others.

This post shares the key insights from this engagement and valuable lessons for enterprises or national initiatives aiming to build FMs at scale.

Cross-functional engagement teams

A crucial early lesson from technical engagement for the GENIAC was that running a multi-organization, national-scale machine learning (ML) initiative requires coordinated support across diverse internal teams. AWS established a virtual team that brought together account teams, specialist Solutions Architects, and service teams. The GENIAC engagement model thrives on close collaboration between customers and a multi-layered AWS team structure, as illustrated in the following figure.

cross-functional-team-engagement

Customers (Cx) typically consist of business and technical leads, including ML and platform engineers, and are responsible for executing training workloads. AWS account teams (Solutions Architects and Account Managers) manage the relationship, maintain documentation, and maintain communication flows with customers and internal specialists. The World Wide Specialist Organization (WWSO) Frameworks team specializes in large-scale ML workloads, with a focus on core HPC and container services such as AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker HyperPod. The WWSO Frameworks team is responsible for establishing this engagement structure and supervising technical engagements in this program. They lead the engagement in partnership with other stakeholders and serve as an escalation point for other stakeholders. They work directly with the service teams—Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon FSx, and SageMaker HyperPod—to help navigate engagements, escalations (business and technical), and make sure the engagement framework is in working order. They provide guidance on training and inference to customers and educate other teams on the technology. The WWSO Frameworks team worked closely with Lead Solutions Architects (Lead SAs), a role specifically designated to support GENIAC engagements. These Lead SAs serve as a cornerstone of this engagement. They are an extension of the Frameworks specialist team and work directly with customers and the account teams. They interface with customers and engage their Framework specialist counterparts when clarification or further expertise is required for in-depth technical discussions or troubleshooting. With this layered structure, AWS can scale technical guidance effectively across complex FM training workloads.

Another critical success factor for GENIAC was establishing robust communication channels between customers and AWS members. The foundation of our communication strategy was a dedicated internal Slack channel for GENIAC program coordination, connecting AWS account teams with lead SAs. This channel enabled real-time troubleshooting, knowledge sharing, and rapid escalation of customer issues to the appropriate technical specialists and service team members. Complementing this was an external Slack channel that bridged AWS teams with customers, creating a collaborative environment where participants could ask questions, share insights, and receive immediate support. This direct line of communication significantly reduced resolution times and fostered a community of practice among participants.

AWS maintained comprehensive workload tracking documents, which clarifies each customer’s training implementation details (model architecture, distributed training frameworks, and related software components) alongside infrastructure specifications (instance types and quantities, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, and storage solutions including Amazon FSx for Lustre and Amazon S3). This tracking system also maintained a chronological history of customer interactions and support cases. In addition, the engagement team held weekly review meetings to track outstanding customer inquiries and technical issues. This regular cadence made it possible for team members to share lessons learned and apply them to their own customer engagements, fostering continuous improvement and knowledge transfer across the program.

With a structured approach to communication and documentation, we could identify common challenges, such as misconfigured NCCL library impacting multi-node performance, share solutions across teams, and continuously refine our engagement model. The detailed tracking system provided valuable insights for future GENIAC cycles, helping us anticipate customer needs and proactively address potential bottlenecks in the FM development process.

Reference architectures

Another early takeaway was the importance of solid reference architectures. Rather than let each team configure their own cluster from scratch, AWS created pre-validated templates and automation for two main approaches: AWS ParallelCluster (for a user-managed HPC cluster) and SageMaker HyperPod (for a managed, resilient cluster service). These reference architectures covered the full stack—from compute, network, and storage to container environments and monitoring—and were delivered as a GitHub repository so teams could deploy them with minimal friction.

AWS ParallelCluster proved invaluable as an open source cluster management tool for multi-node GPU training. AWS ParallelCluster automates the setup of a Slurm-based HPC cluster on AWS. AWS ParallelCluster simplifies cluster provisioning based on the open source Slurm scheduler, using a simple YAML config to stand up the environment. For the GEINIAC program, AWS also offered SageMaker HyperPod as another option for some teams. SageMaker HyperPod is a managed service that provisions GPU and Trainium clusters for large-scale ML. HyperPod integrates with orchestrators like Slurm or Kubernetes (Amazon EKS) for scheduling, providing additional managed functionality around cluster resiliency. By including reference architectures for both AWS ParallelCluster and SageMaker HyperPod, the GENIAC program gave participants flexibility—some opted for the fine-grained control of managing their own HPC cluster, whereas others preferred the convenience and resilience of a managed SageMaker HyperPod cluster.

The reference architecture (shown in the following diagram) seamlessly combines compute, networking, storage, and monitoring into an integrated system specifically designed for large-scale FM training.

Cluster Reference Architecture

The base infrastructure stack is available as an AWS CloudFormation template that provisions the complete infrastructure stack with minimal effort. This template automatically configures a dedicated virtual private cloud (VPC) with optimized networking settings and implements a high-performance FSx for Lustre file system for training data (complemented by optional Amazon FSx for OpenZFS support for shared home directories). The architecture is completed with an S3 bucket that provides durable, long-term storage for datasets and model checkpoints, maintaining data availability well beyond individual training cycles. This reference architecture employs a hierarchical storage approach that balances performance and cost-effectiveness. It uses Amazon S3 for durable, long-term storage of training data and checkpoints, and links this bucket to the Lustre file system through a data repository association (DRA). The DRA enables automatic and transparent data transfer between Amazon S3 and FSx for Lustre, allowing high-performance access without manual copying. You can use the following CloudFormation template to create the S3 bucket used in this architecture.

The optional monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana service running on Amazon EC2) to provide comprehensive observability. It integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for anomalies through Grafana Dashboards. For example, the GPU Health Dashboard (see the following screenshot) provides metrics of common GPU errors, including Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Missing GPUs (from Nvidia-SMI), helping users identify hardware failures as quickly as possible.

xid-error-dashboard

Reproducible deployment guides and structured enablement sessions

Even the best reference architectures are only useful if teams know how to use them. A critical element of GENIAC’s success was reproducible deployment guides and structured enablement through workshops.On October 3, 2024, AWS Japan and the WWSO Frameworks team conducted a mass enablement session for GENIAC Cycle 2 participants, inviting Frameworks team members from the United States to share best practices for FM training on AWS.

The enablement session welcomed over 80 participants and provided a comprehensive mix of lectures, hands-on labs, and group discussions—earning a CSAT score of 4.75, reflecting its strong impact and relevance to attendees. The lecture sessions covered infrastructure fundamentals, exploring orchestration options such as AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, along with the software components necessary to build and train large-scale FMs using AWS. The sessions highlighted practical challenges in FM development—including massive compute requirements, scalable networking, and high-throughput storage—and mapped them to appropriate AWS services and best practices. (For more information, see the slide deck from the lecture session.) Another session focused on best practices, where attendees learned to set up performance dashboards with Prometheus and Grafana, monitor EFA traffic, and troubleshoot GPU failures using NVIDIA’s DCGM toolkit and custom Grafana dashboards based on the Frameworks team’s experience managing a cluster with 2,000 P5 instances.

Additionally, the WWSO team prepared workshops for both AWS ParallelCluster (Machine Learning on AWS ParallelCluster) and SageMaker HyperPod (Amazon SageMaker HyperPod Workshop), providing detailed deployment guides for the aforementioned reference architecture. Using these materials, participants conducted hands-on exercises deploying their training clusters using Slurm with file systems including FSx for Lustre and FSx for OpenZFS, running multi-node PyTorch distributed training. Another segment of the workshop focused on observability and performance tuning, teaching participants how to monitor resource utilization, network throughput (EFA traffic), and system health. By the end of these enablement sessions, customers and supporting AWS engineers had established a shared baseline of knowledge and a toolkit of best practices. Using the assets and knowledge gained during the workshops, customers participated in onboarding sessions—structured, hands-on meetings with their Lead SAs. These sessions differed from the earlier workshops by focusing on customer-specific cluster deployments tailored to each team’s unique use case. During each session, Lead SAs worked directly with teams to deploy training environments, validate setup using NCCL tests, and resolve technical issues in real time.

Customer feedback

“To fundamentally solve data entry challenges, we significantly improved processing accuracy and cost-efficiency by applying two-stage reasoning and autonomous learning with SLM and LLM for regular items, and visual learning with VLM using 100,000 synthetic data samples for detailed items. We also utilized Amazon EC2 P5 instances to enhance research and development efficiency. These ambitious initiatives were made possible thanks to the support of many people, including AWS. We are deeply grateful for their extensive support.”

– Takuma Inoue, Executive Officer, CTO at AI Inside

“Future chose AWS to develop large-scale language models specialized for Japanese and software development at GENIAC. When training large-scale models using multiple nodes, Future had concerns about environment settings such as inter-node communication, but AWS had a wide range of tools, such as AWS ParallelCluster, and we received strong support from AWS Solutions Architects, which enabled us to start large-scale training quickly.”

– Makoto Morishita, Chief Research Engineer at Future

Results and looking ahead

GENIAC has demonstrated that training FMs at scale is fundamentally an organizational challenge, not merely a hardware one. Through structured support, reproducible templates, and a cross-functional engagement team (WWSO Frameworks Team, Lead SAs, and Account Teams), even small teams can successfully execute massive workloads in the cloud. Thanks to this structure, 12 customers launched over 127 P5 instances and 24 Trn1 instances across multiple AWS Regions, including Asia Pacific (Tokyo), in a single day. Multiple large language models (LLMs) and custom models were trained successfully, including a 32B multimodal model on Trainium and a 405B tourism-focused multilingual model.The technical engagement framework established through GENIAC Cycle 2 has provided crucial insights into large-scale FM development. Building on this experience, AWS is advancing improvements across multiple dimensions: engagement models, technical assets, and implementation guidance. We are strengthening cross-functional collaboration and systematizing knowledge sharing to establish a more efficient support structure. Reference architectures and automated training templates continue to be enhanced, and practical technical workshops and best practices are being codified based on lessons learned.AWS has already begun preparations for the next cycle of GENIAC. As part of the onboarding process, AWS hosted a comprehensive technical event in Tokyo on April 3, 2025, to equip FM builders with hands-on experience and architectural guidance. The event, attended by over 50 participants, showcased the commitment AWS has to supporting scalable, resilient generative AI infrastructure.

geniac-event

The event highlighted the technical engagement model of AWS for GENIAC, alongside other support mechanisms, including the LLM Development Support Program and Generative AI Accelerator. The day featured an intensive workshop on SageMaker HyperPod and Slurm, where participants gained hands-on experience with multi-node GPU clusters, distributed PyTorch training, and observability tools. Sessions covered essential topics, including containerized ML, distributed training strategies, and AWS purpose-built silicon solutions. Classmethod Inc. shared practical SageMaker HyperPod insights, and AWS engineers demonstrated architectural patterns for large-scale GPU workloads. The event showcased AWS’s end-to-end generative AI support landscape, from infrastructure to deployment tools, setting the stage for GENIAC Cycle 3. As AWS continues to expand its support for FM development, the success of GENIAC serves as a blueprint for enabling organizations to build and scale their AI capabilities effectively.

Through these initiatives, AWS will continue to provide robust technical support, facilitating the smooth execution of large-scale FM training. We remain committed to contributing to the advancement of generative AI development all over the world through our technical expertise.

This post was contributed by AWS GENIAC Cycle 2 core members Masato Kobayashi, Kenta Ichiyanagi, and Satoshi Shirasawa, Accelerated Computing Specialist Mai Kiuchi, as well as Lead SAs Daisuke Miyamoto, Yoshitaka Haribara, Kei Sasaki, Soh Ohara, and Hiroshi Tokoyo with Executive Sponsorship from Toshi Yasuda. Hiroshi Hata and Tatsuya Urabe also provided support as core member and Lead SA during their time at AWS.

The authors extend their gratitude to WWSO Frameworks members Maxime Hugues, Matthew Nightingale, Aman Shanbhag, Alex Iankoulski, Anoop Saha, Yashesh Shroff, Natarajan Chennimalai Kumar, Shubha Kumbadakone, and Sundar Ranganathan for their technical contributions. Pierre-Yves Aquilanti provided in-depth support during his time at AWS.


About the authors

Keita Watanabe is a Senior Specialist Solutions Architect on the AWS WWSO Frameworks team. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. He leads GENIAC technical engagements.

Masaru Isaka is a Principal Business Development on the AWS WWSO Frameworks team, specializing in machine learning and generative AI solutions. Having engaged with GENIAC since its inception, he leads go-to-market strategies for AWS’s generative AI offerings.

Read More