April 2025 – Vedere AI

Investing in workforce, energy infrastructure and the policies needed to power the AI opportunity

Google is announcing a new paper and support for an effort to train 100,000 electrical workers and 30,000 new apprentices in the United States.Read More

Build public-facing generative AI applications using Amazon Q Business for anonymous users

Amazon Q Business is a generative AI-powered assistant that answers question, provides summaries, generates content, and securely completes tasks based on enterprise data and information. It connects to company data sources, applications, and internal systems to provide relevant, contextual answers while maintaining organizational security and compliance standards.

Today, we’re excited to announce that Amazon Q Business now supports anonymous user access. With this new feature, you can now create Amazon Q Business applications with anonymous user mode, where user authentication is not required and content is publicly accessible. These anonymous user applications can be used in use cases such as public website Q&A, documentation portals, and customer self-service experiences.

This capability allows guest users to use Amazon Q Business generative AI capabilities to quickly find product information, get technical answers, navigate documentation, and troubleshoot issues. Your public-facing websites, documentation, and support portals can now deliver the same powerful AI-driven assistance that authenticated users receive, creating an experience that enriches the guest user journey across your digital environments.

With this launch, you can seamlessly integrate an anonymous Amazon Q Business application into your websites and web applications through two pathways: either by embedding the ready-to-use web experience into your websites using an iframe for quick deployment, or by using our Chat, ChatSync, and PutFeedback APIs to build completely customized interfaces within your own applications. For anonymous Amazon Q Business applications, we’ve implemented a simple consumption-based pricing model where you’re charged based on the number of Chat or ChatSync API operations your anonymous Amazon Q Business applications make.

In this post, we demonstrate how to build a public-facing generative AI application using Amazon Q Business for anonymous users.

Solution overview

In this solution, we walk you through creating an anonymous Amazon Q Business application using both the AWS Management Console and AWS Command Line Interface (AWS CLI). Our example demonstrates a practical scenario: helping website visitors find information on public-facing documentation websites.

We demonstrate how to test the implementation with sample queries through the built-in web experience URL. The resulting application can be customized and embedded directly into your websites (using the API or the iframe method), providing immediate value for your users.

Prerequisites

To follow along with this post, you will need the following:

An AWS account.
At least one Amazon Q Business Pro user that has admin permissions to set up and configure Amazon Q Business. For pricing information, see Amazon Q Business pricing.
AWS Identity and Access Management (IAM) permissions to create and manage IAM roles and policies.
Public content to index (documents, FAQs, knowledge base articles) that can be shared with unauthenticated users.
A supported data source to connect, such as an Amazon Simple Storage Service (Amazon S3) bucket containing your public documents.
The AWS CLI configured with appropriate permissions (if following the AWS CLI method).

Create an anonymous Amazon Q Business application using the console

In this section, we walk through the steps to implement the solution using the console.

Create an IAM role for the web experience

Before creating your Amazon Q Business application, you will need to set up an IAM role with the appropriate permissions:

On the IAM console, choose Roles in the navigation pane and choose Create role.
Choose AWS service as the trusted entity
Select Amazon Q Business from the service list.
Choose Next: Permissions.
Create a custom policy or attach the necessary read-only policies, and add permissions for anonymous access.

We strongly recommend that you use a restricted policy for the role, like the one shown in the following screenshot, which will be used to create the web experience for anonymous access application environments.

An example of a restricted role policy for calling the Chat API for anonymous access application environments would be arn:aws:qbusiness:<your-region>:<your-aws-account-id>:application/<your-application-id>.

Create an IAM role with a trust policy that allows the Amazon Q Business service principal to assume the role using AWS Security Token Service (AWS STS), specifically scoped to your application’s Amazon Resource Name (ARN) in the designated AWS Region.

Create an Amazon Q Business application

Now you’re ready to create your Amazon Q Business application:

On the Amazon Q Business console, choose Create application.
For Application name, enter a name (for example, SupportDocs-Assistant).
For User access, select Anonymous access for this application environment.
Select Web experience to create a managed web experience to access the Amazon Q Business application.

You will see a notice about consumption-based billing for anonymous Amazon Q Business applications. For more details on pricing, refer to Amazon Q Business pricing.

Leave the default service role option unless you have specific requirements.
For Encryption, use the default AWS managed key unless you need custom encryption.
For Web experience settings, you can use an existing IAM role from your account or authorize Amazon Q Business to generate a new role with appropriate permissions. For this post, we select Use an existing service role and choose the IAM role created earlier (QBusinessAnonymousWebRole).
Optionally, customize the web experience title and welcome message.
Review all your configuration options and choose Create to create the application.

You should see a confirmation that your anonymous access application has been created successfully.

You will find the necessary parameters and details of your Amazon Q Business application on the landing page displayed after successful creation like the following screenshot, which provides comprehensive information about your newly created Amazon Q Business application.

Add data sources

After you create your application, you need to add an index and data sources. To learn more, refer to Index. You will see a pop-up like the following indicating that anonymous access is enabled.

Complete the following steps:

From your application dashboard, choose Add index.
Name your index (for example, Supportdocs-External) and keep the default settings.
Choose Add an index.
After you create the index, you can add data sources to it.

For our example, we use the Amazon Q Business public documentation as our data source by adding the URL https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/what-is.html. The Web Crawler will automatically index the content from this documentation page, making it searchable through your anonymous Amazon Q Business application.

For more information about Web Crawler configuration options and best practices, refer to Connecting Web Crawler to Amazon Q Business.

From your index dashboard, choose Add data source.
Enter a name for your data source and optional description.
For Source, select Source URLs and enter the URLs of the public websites you want to index.
For Authentication, select No authentication.
Configure the sync run schedule and field mappings.
Choose Add data source.

Alternatively, you can add Amazon S3 as the data source:

From your index dashboard, choose Add data source.
Select Amazon S3 as the source.
Configure your S3 bucket settings (make sure the bucket has public access).
Complete the data source creation process.

You must only ingest publicly available data sources without access control lists (ACLs).

Generate an anonymous web experience URL

After your data sources are set up, complete the following steps:

From your application dashboard, choose your application.
In the Web experience settings section, choose Share one-time URL.

The anonymous web experience URL can be shared as a single-use link that must be redeemed and accessed within 5 minutes. After it’s activated, the Amazon Q Business session remains active with a configurable timeout ranging from 15–60 minutes. This enables you to experience the web interface and test its functionality before deploying or offering the anonymous application to guest users.

Test your anonymous Amazon Q Business application

To test the application, choose Preview web experience.

The following screenshot shows the welcome page for your anonymous Amazon Q Business application’s web interface. Let’s begin asking Amazon Q Business some questions about the Amazon Q index.

In the first query, we ask “What is Q index? How is it useful for ISV’s?” The following screenshot shows the response.

In the following query, we ask “How can Q index enrich generative AI experiences for ISVs?”

In our next query, we ask “How is Q index priced?”

Having successfully tested our anonymous Amazon Q Business application through the console, we will now explore how to create an equivalent application using the AWS CLI.

Create your anonymous application using the AWS CLI

Make sure that your AWS CLI is configured with permissions to create Amazon Q Business resources and IAM roles.

Create an IAM role for Amazon Q Business

First, create an IAM role that Amazon Q Business can assume to access necessary resources:

# Create trust policy document
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "qbusiness.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create IAM role
aws iam create-role 
  --role-name QBusinessAnonymousAppRole 
  --assume-role-policy-document file://trust-policy.json

# Attach necessary permissions
aws iam attach-role-policy 
  --role-name QBusinessAnonymousAppRole

Create an anonymous Amazon Q Business application

Use the following code to create your application:

#bash
aws qbusiness create-application 
--display-name "PublicKnowledgeBase" 
--identity-type ANONYMOUS 
--role-arn "arn:aws:iam:: <ACCOUNT_ID>:role/QBusinessAnonymousAppRole" 
--description "This is the QBiz application for anonymous use-case"

Save the applicationId from the response:

#json

{
  "applicationId": "your-application-id",
  "applicationArn": "arn:aws:qbusiness:region:account-id:application/your-application-id"
}

Create a restrictive policy for anonymous access

We strongly recommend using the following restricted policy for the role that will be used to call the chat APIs for anonymous access application environments. This policy limits actions to only the necessary APIs and restricts access to only your specific application.

Create the IAM role with the following policy:

# Create restrictive policy document
cat > anonymous-access-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "QBusinessConversationPermission",
      "Effect": "Allow",
      "Action": [
        "qbusiness:Chat",
        "qbusiness:ChatSync",
        "qbusiness:PutFeedback"
      ],
      "Resource": "arn:aws:qbusiness:<REGION>:<ACCOUNT_ID>:application/<APPLICATION_ID>"
    }
  ]
}
EOF

# Attach the policy to the role
aws iam put-role-policy 
  --role-name QBusinessAnonymousAppRole 
  --policy-name QBusinessAnonymousAccessPolicy 
  --policy-document file://anonymous-access-policy.json

Create an index

Create an index for your content, then upload documents using the BatchPutDocument API. For step-by-step guidance, see Select Retriever.

Test your anonymous Amazon Q Business application

To demonstrate the chat functionality using the AWS CLI, we uploaded Amazon Q Business documentation in PDF format to our index and tested the application using the following sample queries.

The following is an example chat interaction using the IAM role credentials. We first ask “What is Amazon Q index?”

#1)
#bash
aws qbusiness chat-sync 
  --application-id <APPLICATION_ID> 
  --user-message "What is Amazon Q index?"

The following screenshot shows part of the output from the chat-sync API when executed with our anonymous Amazon Q Business application ID, as shown in the previous command.

Next, we ask “How can Q index enrich generative AI experiences for ISV’s?”

2)
#bash
aws qbusiness chat-sync 
  --application-id <APPLICATION_ID> 
  --user-message "How can Q index enrich generative AI experiences for ISV's?"

The following screenshot shows part of the output from the chat-sync API when executed with our anonymous Amazon Q Business application ID.

Create a web experience for the anonymous web application

Use the following code to create the web experience:

#bash
aws qbusiness create-web-experience 
  --application-id <APPLICATION_ID> 
  --display-name "PublicKnowledgeBaseExperience" 
  --role-arn "arn:aws:iam::<ACCOUNT_ID>:role/QBusinessAnonymousAppRole" 
  --description "Web interface for my anonymous Q Business application"

To generate an anonymous URL, use the following code:

#bash
aws qbusiness create-anonymous-web-experience-url 
  --application-id <APPLICATION_ID> 
  --web-experience-id <WEB_EXPERIENCE_ID>

You can use the web experience URL generated by the preceding command and embed it into your web applications using an iframe.

Considerations

Consider the following when using anonymous access in Amazon Q Business:

The following are the only chat APIs that support anonymous access application environments:
- Chat
- ChatSync
- PutFeedback
You should only ingest publicly available data sources without ACLs. Examples of public data sources include:
- Data from the Amazon Q Business Web Crawler
- Amazon S3 data without ACLs
Amazon Q Business applications with anonymous access are billed on a consumption-based pricing model.
Chat history is not available for anonymous application environments.
Anonymous users and authenticated users are not supported on the same application environments.
Plugins are not supported for anonymous application environments.
Amazon QuickSight integration is not supported for anonymous application

Environments.

Amazon Q Apps are not supported for anonymous application environments.
Attachments are not supported for anonymous application environments.
Admin controls and guardrails are read-only for anonymous application environments, except for blocked words.
Topic rules using users and groups are not supported for anonymous application

The remaining Amazon Q Business functionality and features remain unchanged.

Clean up

When you are done with the solution, clean up the resources you created.

Conclusion

In this post, we introduced Amazon Q Business anonymous user access mode and demonstrated how to create, configure, and test an anonymous Amazon Q Business application using both the console and AWS CLI. This exciting feature extends enterprise-grade Amazon Q Business generative AI capabilities to your anonymous audiences without requiring authentication, opening up new possibilities for enhancing customer experiences on public websites, documentation portals, and self-service knowledge bases. This feature is available through a consumption pricing model that charges based on actual Chat and Chatsync API usage and index storage costs still applicable.

By following the implementation steps outlined in this post, you can quickly set up an Amazon Q Business application tailored for your external users, secured with appropriate IAM policies, and ready to embed in your end-user-facing applications.

To learn more about this anonymous access feature, see the Amazon Q Business User Guide. For detailed guidance on embedding Amazon Q Business in your web applications, see Add a generative AI experience to your website or web application with Amazon Q embedded. If you’re interested in building completely custom UI experiences with the Amazon Q Business API, check out Customizing an Amazon Q Business web experience.

About the authors

Vishnu Elangovan is a Worldwide Generative AI Solution Architect with over seven years of experience in Applied AI/ML. He holds a master’s degree in Data Science and specializes in building scalable artificial intelligence solutions. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Outside his professional pursuits, he enjoys traveling, participating in sports, and exploring new problems to solve.

Jean-Pierre Dodel is a Principal Product Manager for Amazon Q Business, responsible for delivering key strategic product capabilities including structured data support in Q Business, RAG. and overall product accuracy optimizations. He brings extensive AI/ML and Enterprise search experience to the team with over 7 years of product leadership at AWS.

FloQast builds an AI-powered accounting transformation solution with Anthropic’s Claude 3 on Amazon Bedrock

With the advent of generative AI solutions, a paradigm shift is underway across industries, driven by organizations embracing foundation models (FMs) to unlock unprecedented opportunities. Amazon Bedrock has emerged as the preferred choice for numerous customers seeking to innovate and launch generative AI applications, leading to an exponential surge in demand for model inference capabilities. Amazon Bedrock customers aim to scale their worldwide applications to accommodate a variety of use cases. One such customer is FloQast.

Since its founding in 2013, FloQast has had the privilege of working with over 2,800 organizations across various industries and regions, helping them streamline their accounting operations. From automated reconciliations to tools that manage the entire close process, FloQast has seen firsthand how organizations, big and small, struggle to keep pace with their accounting needs as they scale. FloQast’s software (created by accountants, for accountants) brings AI and automation innovation into everyday accounting workflows. You can reconcile bank statements against internal ledgers, get real-time visibility into financial operations, and much more.

In this post, we share how FloQast built an AI-powered accounting transaction solution using Anthropic’s Claude 3 on Amazon Bedrock.

Accounting operations: Complexity amplified at scale

At the heart of every successful organization—whether small startups or large corporations—lies a well-oiled financial and accounting operation. Accounting is more than just a back-office function; it’s the backbone of every business. From processing payroll to generating financial statements, accounting is a ubiquitous force that touches every facet of business operations.

Consider this: when you sign in to a software system, a log is recorded to make sure there’s an accurate record of activity—essential for accountability and security. Similarly, when an incident occurs in IT, the responding team must provide a precise, documented history for future reference and troubleshooting. The same principle applies to accounting: when a financial event takes place, whether it’s receiving a bill from a vendor or signing a contract with a customer, it must be logged. These logs, known in accounting as journal entries, provide a clear financial record.

Now imagine this process scaled across hundreds, or even thousands, of transactions happening simultaneously in a large organization. The complexity of accounting increases exponentially with growth and diversification. As businesses expand, they encounter a vast array of transactions that require meticulous documentation, categorization, and reconciliation. At scale, upholding the accuracy of each financial event and maintaining compliance becomes a monumental challenge. With advancement in AI technology, the time is right to address such complexities with large language models (LLMs).

Amazon Bedrock has helped democratize access to LLMs, which have been challenging to host and manage. Amazon Bedrock offers a choice of industry-leading FMs along with a broad set of capabilities to build generative AI applications, simplifying development with security, privacy, and responsible AI. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure to securely integrate and deploy generative AI capabilities into your application, handle spiky traffic patterns, and enable new features like cross-Region inference, which helps provide scalability and reliability across AWS Regions.

In this post, we highlight how the AI-powered accounting transformation platform uses Amazon Bedrock. FloQast addresses the most complex and custom aspects of financial processes (the final 20%)—those intricate, bespoke aspects of accounting that are highly specific to each organization and often require manual intervention. FloQast’s AI-powered solution uses advanced machine learning (ML) and natural language commands, enabling accounting teams to automate reconciliation with high accuracy and minimal technical setup.

FloQast AI Transaction Matching

Seamlessly integrated with the existing FloQast suite, the AI Transaction Matching product streamlines and automates your matching and reconciliation processes, delivering unparalleled precision and efficiency.

It offers the following key features:

AI-driven matching – You can automatically match transactions across multiple data sources with high accuracy
Flexible rule creation – You can use natural language to create custom matching rules tailored to your unique processes
Exception handling – You can quickly identify and manage unmatched transactions or discrepancies
Audit trail – You can maintain a comprehensive audit trail of matching activities for compliance and transparency
High-volume processing – You can efficiently handle large volumes of transactions, suitable for businesses of all sizes
Multi-source integration – You can seamlessly integrate and match transactions from various financial systems and data sources

Let’s review how it works:

Transaction data is gathered from bank statements and enterprise resource planning (ERP) systems.
An accountant will select specific transactions in both systems and choose Generate AI Rule.

The following screenshot shows the general ledger system on the left and the bank statement on the right.

Based on the selected transactions, text is generated (see the following screenshot).

At this point, the accountant has the option to either accept the generated text or edit the text.
The accountant chooses Save and apply to generate a rule in coded format that is further used to find additional matches, helping the accountant automate transaction reconciliation.

FloQast AI Transaction Matching offers the following benefits:

Unified environment – It seamlessly integrates with your existing FloQast products for a single source of truth
AI-powered automation – It uses advanced ML to handle complex matching scenarios
User-friendly interface – It’s designed by accountants for how accountants work, providing ease of use and adoption
Real-time insights – You can gain immediate visibility into your transaction data across systems
Scalability – It can adapt as your transaction volumes grow and business evolves

FloQast AI Annotations

FloQast’s new AI Annotations feature empowers teams to seamlessly and automatically annotate and review sample documents, streamlining compliance and audit processes through advanced automation and ML.

It offers the following key features:

Automated document annotation – You can upload sample documents to automatically annotate key data points with attributes specified in your testing criteria, saving time on manual reviews
AI-powered analysis – You can use advanced AI and natural language models to analyze document text, highlighting relevant information according to predefined controls and testing attributes
Bulk annotation for efficiency – You can select multiple documents or testing controls for bulk annotation, reducing time spent on repetitive document processing
Structured storage and audit trail – You can maintain a structured record of each annotated document, capturing all extracted data, annotation responses, and status updates for streamlined compliance and audit trails
Intuitive error handling – Smart checks identify and notify users of processing errors, making sure each annotation is complete and accurate.

The following diagram illustrates the architecture using AWS services.

The workflow starts with user authentication and authorization (steps 1-3). After those steps are complete, the workflow consists of the following steps:

Users upload supporting documents that provide audit evidence into a secure Amazon Simple Storage Service (Amazon S3) bucket.
The input documents are encrypted by Amazon S3 when consumed by Amazon Textract.
Amazon Textract (encrypts data in transit and at rest) extracts the data from the documents.
When complete, raw data is stored into an encrypted S3 bucket.
Data sanitization workflow kicks off using AWS Step Functions consisting of AWS Lambda functions.
Sanitized extracted data is written into an encrypted MongoDB.
Amazon Textract is polled to update the job status and written into Mongo DB.
The user starts the annotation process.
Application logic consumes data from Mongo DB and provides it to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
The LLM runs the audit rules (shown in the following screenshot) against the extracted data and generates an annotation for each audit rule, including pass/fail details of the audit rule.
Annotation results are filtered using Amazon Bedrock Guardrails to enhance content safety and privacy in generative AI applications.

FloQast AI Annotations offers the following benefits:

Seamless integration with FloQast – This feature is integrated into the FloQast platform, providing access to annotation tools alongside your existing compliance and financial workflows
Enhanced efficiency with AI-driven workflows – FloQast’s annotation feature uses AI to reduce manual workload, helping teams focus on high-value tasks rather than repetitive document review
Scalable solution for high-volume document processing – Designed to handle substantial document volumes, FloQast AI Annotations adapts to the demands of growing teams and complex audit requirements
Real-time document processing insights – You can stay informed with live tracking of each annotation job, with built-in monitoring for smooth and efficient workflows

FloQast’s AI technology choices

FloQast selected Amazon Bedrock because of its unmatched versatility, feature sets, and the robust suite of scalable AI models from top-tier providers like Anthropic. Anthropic’s Claude 3.5 Sonnet provides the advanced reasoning and contextual understanding necessary for handling complex financial workflows. However, a key feature of Amazon Bedrock—Amazon Bedrock Agents—is a game changer for FloQast. Amazon Bedrock Agents enables generative AI applications to run multi-step tasks across company systems and data sources. To learn more, see How Amazon Bedrock Agents works.

Amazon Bedrock Agents provides an intelligent orchestration layer, allowing FloQast to automate accounting workflows efficiently. It has added significant value in the following areas:

Instruction handling and task automation – Amazon Bedrock Agents enables FloQast to submit natural language instructions that the AI interprets and executes autonomously.
Session and memory management session – Attributes and promptSessionAttributes are passed between sessions related to a single workflow, but most user requests can be singular to a session.
Code generation that demonstrates business understanding – Amazon Bedrock Agents offers valuable features through its secure code interpretation capabilities and flexible configuration options. Amazon Bedrock agents can be tailored to the correct persona and business context, while operating within a protected test environment. This allows accountants to submit natural language instructions and input data, which is then processed in a controlled manner that aligns with security best practices. When FloQast integrates with Amazon Bedrock Agents, accountants can submit custom requests, and the agent can generate and test code within an isolated secure environment, with appropriate technical oversight and guardrails in place. The combination of Amazon Bedrock Agents’ secure code interpretation features and FloQast’s deep knowledge of accounting practices enables financial teams to operate efficiently while maintaining proper controls.
Data integration and output handling – By using Amazon Bedrock Agents, information is passed from upstream integrated financial systems, allowing FloQast to automate data retrieval and transformation tasks.
Multi-step task orchestration – Amazon Bedrock agents are designed to handle multi-step tasks by orchestrating complex workflows. For example, after FloQast retrieves data from a financial system, that data is passed to the agent, which runs the necessary calculations, generates the output code, and presents the results for user approval—all in one automated process. This orchestration is especially useful in accounting, where multiple steps must be completed in the correct sequence to maintain compliance and accuracy.

The flexibility of Amazon Bedrock Agents to manage these tasks and integrate them seamlessly into existing workflows enables FloQast to achieve scale, reduce complexity, and implement automation required to cater to the evolving needs of FloQast’s customers.

Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock provides the best results in FloQast’s evaluation of other models for the use case. FloQast doesn’t need to fine-tune the model as a model consumer, so they use Retrieval Augmented Generation (RAG) with few-shot classification on data collected on the user’s behalf, removing the overhead of fine-tuning an LLM. For this use case, this design mechanism produces a higher level of accuracy, a better security model that is understood by FloQast’s customers, and ease of use as a developer.

Conclusion

FloQast’s AI-powered accounting transformation solution has had a substantial impact on its users. By automating routine, time-consuming accounting processes, the solution has saved accounting teams countless hours, enabling them to shift away from manual spreadsheet work and focus on higher-value activities, such as reviewing financial outcomes, assessing business health, and making data-driven decisions. This solution has removed the tedium of data reconciliation, delivering measurable improvements, including a 38% reduction in reconciliation time, a 23% decrease in audit process duration and discrepancies, and a 44% improvement in workload management.

Learn more about the FloQast platform at FloQast.com. Contact evelyn.cantu@floqast.com for more information about the FloQast and AWS partnership.

About the authors

Kartik Bhatnagar is a data security-focused Solutions Architect at AWS, based in San Francisco, CA. He has experience working with startups and enterprises across the tech, fintech, healthcare, and media & entertainment industries, in roles including DevOps Engineer and Systems Architect. In his current role, he partners with AWS customers to design and implement scalable, secure, and cost-effective solutions on the AWS platform. Outside of work, he enjoys playing cricket and tennis, food hopping, and hiking.

Aidan Anderson is a dynamic technology leader with over a decade of experience in software engineering, security, and artificial intelligence. Currently serving as the Director of AI Engineering at FloQast, he is at the forefront of integrating AI and automation into accounting workflows, enhancing operational efficiency and accuracy for finance teams. Aidan’s portfolio spans leadership across security, product development, and platform engineering – where he’s consistently driven innovation, built high-performing teams, and delivered impactful solutions in fast-paced startup environments.

Insights in implementing production-ready solutions with generative AI

As generative AI revolutionizes industries, organizations are eager to harness its potential. However, the journey from production-ready solutions to full-scale implementation can present distinct operational and technical considerations. This post explores key insights and lessons learned from AWS customers in Europe, Middle East, and Africa (EMEA) who have successfully navigated this transition, providing a roadmap for others looking to follow suit.

Building a solid business case: Operational excellence drives customer experience

The foundation of successful generative AI implementations are business cases with clear value propositions that fit with organizational goals, for example, improving efficiency, cost savings, or revenue growth. Typical examples include enhancing customer experience, optimizing operations, maintaining compliance with legal standards, improving level of services, or increasing employee productivity.

Companies in EMEA have used AWS services to transform their operations and improve customer experience using generative AI, with their stories illustrating how a strong business case can lead to tangible results across various industry verticals.

Il Sole 24 Ore, Italy’s leading multimedia publishing group, partnered with AWS Professional Services to boost the efficiency of a historic service, L’Esperto Risponde, where users can ask fiscal questions and receive responses from a team of experts. Il Sole 24 Ore leveraged its vast internal knowledge with a Retrieval Augmented Generation (RAG) solution powered by AWS. This solution maintained over 90% accuracy in responses and reduced the time spent by experts in searching and processing information, empowering them to focus on more strategic tasks. Additionally, the company is continuously incorporating end-user feedback to keep the service tailored to customer needs. For more information, you can watch the AWS Summit Milan 2024 presentation.

Booking.com, one of the world’s leading digital travel services, is using AWS to power emerging generative AI technology at scale, creating personalized customer experiences while achieving greater scalability and efficiency in its operations. Booking.com uses Amazon SageMaker AI to provide highly personalized customer accommodation recommendations.

“One of the things we really like about AWS’s approach to generative AI is choice. We love open source, and we feel it will play an important role in the evolution of generative AI,”

– Rob Francis, Chief Technology Officer of Booking.com.

With AWS support, Booking.com is enhancing its generative AI capabilities and positioning itself for future growth in the travel and hospitality industry. For more details, you can watch Booking.com’s keynote at AWS re:Invent 2023, their presentation on generative AI from idea to production on AWS at AWS London Summit 2024, and read the case study on how Booking.com helps customers experience a new world of travel using AWS and generative AI.

ENGIE is a global power and utilities company, with 25 business units operating worldwide. ENGIE’s One Data team partnered with AWS Professional Services to develop an AI-powered chatbot that enables natural language conversation search within ENGIE’s Common Data Hub data lake, over 3 petabytes of data. The solution complements traditional keyword-based search by allowing users to discover datasets through simple conversational queries, making it easier to find relevant data among tens of thousands of datasets. This dual approach to data discovery has accelerated the development of data-driven products and enhanced data assets sharing across the organization.

These examples demonstrate how companies across various sectors have successfully used AWS generative AI capabilities to address specific business challenges.

Getting ahead of implementation challenges

Though essential, a solid business case is only the first step. As organizations move their generative AI initiatives forward, they often encounter new challenges related to making the solution scalable, reliable, and compliant. Let’s explore what it takes to successfully advance generative AI projects from the preproduction phase, making sure that the original value of the business case is then fully realized in real-world application.

Achieving scale, reliability, and compliance

Factors to consider in transitioning to full-scale production include scalability, data governance, privacy, consistent and responsible AI behaviors, security, integration with existing systems, monitoring, end-user feedback collection, and business impact measurement. As organizations in EMEA have discovered, success in this transition requires a holistic approach that goes beyond mere technological considerations. With a multitude of customer learnings, paired with AWS expertise, we can identify key strategies for implementation.

Production-ready infrastructure, applications, and processes in the cloud

With the increase in scope, number, and complexity of generative AI applications, organizations have an increased need to reduce undifferentiated effort and set a high-quality bar for production-ready applications. Standard development best practices and effective cloud operating models, like AWS Well-Architected and the AWS Cloud Adoption Framework for Artificial Intelligence, Machine Learning, and Generative AI, are key to enabling teams to spend most of their time on tasks with high business value, rather than on recurrent, manual operations. Such an approach should include established industry standards such as infrastructure as code (IaC), continuous integration and continuous delivery (CI/CD), monitoring and observability, logging and auditing, and solutions for scalability and high availability.

For instance, Iveco Group, a global automotive leader active in the Commercial and Specialty Vehicles, Powertrain, adopted a structured cloud-operating model, leveraging IaC via Terraform for consistent and repeatable deployments across environments. A DevOps environment, via CI/CD pipelines, allows for frequent updates and testing of generative AI models and applications, allowing the developers to focus on improving and expanding the solutions rather then spending time on manual operations. This also helps make sure that generative AI solutions are optimized for performance, security, and cost-efficiency. This integrated approach not only accelerates the path from pre-production to full-scale implementation, but also enables them to adapt quickly to new generative AI advancements, manage complex dependencies, and scale resources as needed, ultimately driving innovation and competitive advantage in the rapidly evolving field of generative AI. See the re:Invent 2024 session for more information.

Accor Group, a major hospitality company that developed a generative AI-powered booking application, showcased how, even when working with new technologies like generative AI, fundamental software development principles remain a must-have. They implemented a three-layered comprehensive testing strategy. First, unit tests verify that the prompts consistently generate acceptable responses from the chatbot, even upon prompt modifications. Second, integration tests verify the end-to-end flow of the REST API and the chatbot’s interaction with the large language model (LLM). The final step is functional testing with predefined scenarios for manual testing and validation. They also implemented feedback systems, essential for the improvement flywheel of customer-facing applications, in the form of in-app surveys, instant feedback options (thumbs-up or thumbs-down), and a dedicated feedback portal for detailed user input. Finally, to measure the effectiveness of the solution and its business impact, they established a system to track room bookings made through the generative AI application.

Danske Bank, a leading Nordic bank, transitioned from a container-based on-premises setup to Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. This allowed them to quickly move their API-based backend services to a cloud-native environment. This decoupled architecture, designed to be provider-agnostic, set them up for flexibility in leveraging different cloud-based generative AI tools and services as needed. The integration with Amazon Bedrock was seamless and impactful, as it provided faster access to multiple foundational models from more providers. This allowed the customer to rapidly experiment, iterate, and evaluate different models for their specific use cases. This case demonstrates how the combination of generative AI services and a cloud-native, API-driven architecture allowed this customer to iterate faster, and keep the focus on business value rather than integration of technologies.

The Schaeffler Group has been driving forward groundbreaking inventions and developments in the field of motion technology for over 75 years. The company developed a comprehensive generative AI framework, which establishes enterprise-grade governance and security guardrails for generative AI use case roll-out at scale with infrastructure blueprints. A generative AI inference gateway is integrated within the solution, offering centralized access to numerous foundational models while tracking usage and costs. Going forward, Schaeffler envisions to further integrate these capabilities into their wider generative AI and data landscape, including more fine-grained access controls to data assets and the adoption of generative AI agents.

These examples highlight a key theme for organizations across industries: Success in generative AI goes beyond developing standalone applications. A thorough cloud-based operating model is crucial for enterprises looking to keep pace with the rapidly evolving technology, with minimal operational overhead.

Security, compliance, and responsible AI

As an organization’s generative AI applications expand to handle increasingly sensitive data, security, compliance, and governance must be prioritized accordingly. This includes implementing authentication and access control, encrypting data at rest and in transit, monitoring and auditing of system access and usage, maintaining compliance with regulations (such as GDPR and the recent EU AI Act), as well as establishing clear policies for data handling and model usage.

Here are some examples of customers who have successfully navigated these critical requirements.

Il Sole24 Ore implemented a code of self-discipline for ethical AI application. It prescribes retention of high-quality standards and the centrality of trustworthy data. The principles include regulatory compliance, maintaining data provenance and reliability, incorporating human oversight via human-in-the-loop, inclusivity and diversity in data usage and algorithm adoption, responsibility and accountability, and digital education and communicative transparency. By adhering to these principles, Il Sole 24 Ore Group demonstrates its commitment to leveraging innovative technologies like generative AI in a safe and responsible manner, particularly in sensitive areas such as providing expert legal and tax advice. This approach allows them to harness the benefits of AI while mitigating potential risks and maintaining the trust of their users.

For Accor Group, the implementation of their next-generation booking application required direct customer interaction, emphasizing the critical need for responsible AI practices. To make sure the chatbot would deliver effective customer service while operating within strict ethical boundaries, they established specific safeguards to minimize misuse:

Blocking responses to discriminatory queries
Withholding responses to illegal activities
Implementing guardrails to keep conversations within appropriate business context
Installing protections against role-switching or tone-changing attempts during conversations
Implementing robust technical defenses against prompt injections

Conclusion

The transition from preproduction to full-scale implementation for generative AI applications presents new challenges and opportunities. It requires identifying a solid business case, maintaining high standards for infrastructure and processes, strategic thinking in choosing an efficient cloud operating model, robust data governance, security, compliance, ethical AI practices, and more.

Organizations across EMEA have demonstrated how using AWS services can help overcome hurdles and accelerate the advantages of generative AI by embracing a holistic approach. By learning from these use cases, more enterprises can achieve successful deployments of generative AI solutions, and benefit from this transformative technology in a reliable, productive, and responsible manner.

Explore more generative AI use cases and customer succcess stories and discover how to accelerate your AI adoption on the cloud with specialized training and the support of AWS Professional Services and the Generative AI Innovation Center.

About the Authors

Dr. Giorgio Pessot is a Machine Learning Engineer at Amazon Web Services Professional Services. With a background in computational physics, he specializes in architecting enterprise-grade AI systems at the confluence of mathematical theory, DevOps, and cloud technologies, where technology and organizational processes converge to achieve business objectives. When he’s not whipping up cloud solutions, you’ll find Giorgio engineering culinary creations in his kitchen.

Daniel Zagyva is a Senior ML Engineer at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI and machine learning operations.

Nicolò Cosimo Albanese is a Data Scientist and Machine Learning Engineer at Amazon Web Services Professional Services. With a Master of Science in Engineering and postgraduate degrees in Machine Learning and Biostatistics, he specializes in developing AI/ML solutions that drive business value for enterprise customers. His expertise lies at the intersection of statistical modeling, cloud technologies, and scalable machine learning systems.

Subhro Bose is a Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves working on ways for emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

Diar Sabri is a Machine Learning Engineer at AWS who helps organizations transform their business through innovative AI solutions. With experience across multiple industries, he excels at bridging the gap between strategic vision and practical technology implementation, enabling customers to achieve meaningful business outcomes.

Aamna Najmi is a GenAI and Data Specialist at AWS. She assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations, bringing a unique perspective of modern data strategies to complement the field of AI. In her spare time, she pursues her passion of experimenting with food and discovering new places.

Anwar Rizal is a Senior Machine Learning consultant for AWS Professional Services based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Amer Elhabbash is a Senior Data & AI Delivery Consultant with AWS Professional Services. With over 25 years of international experience in IT spanning multiple fields and domains; Telecommunication, Software Engineering , Database, Data Analytics and AI. He helps AWS’ customers migrating their legacy data systems and building innovative cloud-native data-driven solutions.

Hassen Riahi is a Delivery Practice Manager Data & AI at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He collaborates with AWS customers to build data-driven solutions.

Dr. Marco Guerriero leads Data and GenAI at AWS Professional Services for France and Europe South, holding a Ph.D. in Electrical and Computer Engineering from the University of Connecticut. His expertise spans machine learning, statistical inference, and mathematical optimization, with experience at organizations like NATO, GE, and ABB across defense, manufacturing, energy, and industrial automation sectors. With over 60 publications and five US patents to his name, Dr. Guerriero focuses on leveraging emerging technologies like GenAI and Quantum computing to drive business innovation across industries.

Sri Elaprolu is Director of the AWS Generative AI Innovation Center, where he leads a global team implementing cutting-edge AI solutions for enterprise and government organizations. During his 12-year tenure at AWS, he has led ML science teams partnering with organizations like the NFL, Cerner, and NASA. Prior to AWS, he spent 14 years at Northrop Grumman in product development and software engineering leadership roles. Sri holds a Master’s in Engineering Science and an MBA.

Dragica Boca is Managing Director of Professional Services EMEA at Amazon Web Services (AWS), leading enterprise cloud migration and generative AI transformation initiatives. With 30 years of technology consulting experience across Microsoft and IBM Global Business Services, she specializes in implementing production-ready AI solutions for Public Sector and Financial Services organizations. Dragica currently oversees large-scale GenAI implementations across EMEA, helping enterprises navigate the complexities of responsible AI deployment, scalable architecture, and sustainable adoption patterns.

From Kitchen to Drive-Thru: How Yum! Brands Is Accelerating Restaurant Innovation With AI

The quick-service restaurant (QSR) industry is being reinvented by AI.

For example, Yum! Brands — the world’s largest restaurant company and parent company of KFC, Taco Bell, Pizza Hut and Habit Burger & Grill — and NVIDIA announced their partnership at the NVIDIA GTC global AI conference last month, with the goal of deploying NVIDIA-powered AI solutions in 500 restaurants this year.

Joe Park, chief digital and technology officer at Yum! Brands, Inc. and president of Byte by Yum!, joined the NVIDIA AI podcast to share how the company is further accelerating AI deployment with NVIDIA AI.

Yum! Brands’ AI-first strategy aims to enhance every aspect of the restaurant experience, from the kitchen to customer interactions. Park discussed how Yum! Brands uses machine learning to optimize sales forecasts and inventory management, solving the costly problem of food waste and stockouts.

AI can also help with kitchen management. Imagine a local pizzeria getting dozens of calls in a short time and needing to plan which orders to start first so the customers can get their pizzas as fresh as possible.

“We use AI to solve for that,” said Park. “We can collect all kinds of different inputs and apply machine learning to figure out which customers’ pizzas to make now versus later, or which ones could be picked up by the delivery driver.”

With more than 61,000 restaurants operating globally under the Yum! Brands banner, efficiency at scale is always a priority. So, too, is keeping things simple for restaurant managers and team members. To that end, Yum! Brands recently launched Byte by Yum!, a proprietary AI-powered platform that integrates many technologies QSRs rely on.

Beyond AI-accelerated ordering, Yum! is also testing computer vision to analyze drive-thru traffic and explore new use cases for AI to perceive, alert and adjust staffing, with the goal of optimizing service speed. The team is also excited about Voice AI for order taking, including some unique edge use cases.

“If you think about a Baja blast, a chalupa, a gordita — words that aren’t found in Webster’s English dictionary — partnering with NVIDIA has helped us a tremendous amount because we’ve been able to train models with our own custom words and vernacular,” Park explained, referring to common orders at Taco Bell.

“AI at the drive-thru window increases accuracy and speed of order-taking, and helps make team members’ jobs easier,” Park added.

Time Stamps

03:51 – Park discusses how technology and AI are transforming the QSR industry.

06:58 – Yum! Brands’ AI use cases, including sales forecasting and inventory management.

15:07 – Inside the introduction of the Byte by Yum! AI-powered restaurant technology platform.

20:47 – Park’s advice for getting started on the AI journey.

You Might Also Like…

Roboflow Helps Unlock Computer Vision for Every Kind of AI Builder

Roboflow’s mission is to make the world programmable through computer vision. By simplifying computer vision development, they help bridge the gap between AI and people looking to harness it. Cofounder and CEO Joseph Nelson discusses how Roboflow empowers users in manufacturing, healthcare and automotive to solve complex problems with visual AI.

Firsthand’s Jon Heller Shares How AI Agents Enhance Consumer Journeys in Retail

Jon Heller, co-CEO and founder of Firsthand, discusses how the company’s Brand Agents are transforming the retail landscape by personalizing customer journeys, converting marketing interactions into valuable research data and enhancing the customer experience with hyper-personalized insights and recommendations.

AI Agents Take Digital Experiences to the Next Level in Gaming and Beyond

AI agents with advanced perception and cognition capabilities are making digital experiences more dynamic and personalized across industries. Inworld AI’s Chris Covert discusses how intelligent digital humans are reshaping interactive experiences, from gaming to healthcare.

Channel performance and more reporting coming to Performance Max

An overview of the latest reporting features, including channel performance reporting, coming to Performance Max.Read More

Upload and edit your images directly in the Gemini app

We’re rolling out the ability to easily modify both your AI creations and images you upload from your phone or computer.Read More

Control the Composition of AI-Generated Images With the NVIDIA AI Blueprint for 3D-Guided Generative AI

AI-powered image generation has progressed at a remarkable pace — from early examples of models creating images of humans with too many fingers to now producing strikingly photorealistic visuals. Even with such leaps, one challenge remains: achieving creative control.

Creating scenes using text has gotten easier, no longer requiring complex descriptions — and models have improved alignment to prompts. But describing finer details like composition, camera angles and object placement with text alone is hard, and making adjustments is even more complex. Advanced workflows using ControlNets — tools that enhance image generation by providing greater control over the output — offer solutions, but their setup complexity limits broader accessibility.

To help overcome these challenges and fast-track access to advanced AI capabilities, NVIDIA at the CES trade show earlier this year announced the NVIDIA AI Blueprint for 3D-guided generative AI for RTX PCs. This sample workflow includes everything needed to start generating images with full composition control. Users can download the new Blueprint today.

Harness 3D to Control AI-Generated Images

The NVIDIA AI Blueprint for 3D-guided generative AI controls image generation by using a draft 3D scene in Blender to provide a depth map to the image generator — FLUX.1-dev, from Black Forest Labs — which together with a user’s prompt generates the desired images.

The depth map helps the image model understand where things should be placed. The advantage of this technique is that it doesn’t require highly detailed objects or high-quality textures, since they’ll be converted to grayscale. And because the scenes are in 3D, users can easily move objects around and change camera angles.

Under the hood of the blueprint is ComfyUI, a powerful tool that allows creators to chain generative AI models in interesting ways. For example, the ComfyUI Blender plug-in lets users connect Blender to ComfyUI. Plus, an NVIDIA NIM microservice lets users deploy the FLUX.1-dev model and run it at the best performance on GeForce RTX GPUs, tapping into the NVIDIA TensorRT software development kit and optimized formats like FP4 and FP8. The AI Blueprint for 3D-guided generative AI requires an NVIDIA GeForce RTX 4080 GPU or higher.

A Prebuilt Foundation for Generative AI Workflows

The blueprint for 3D-guided generative AI includes everything necessary for getting started with an advanced image generation workflow: Blender, ComfyUI, the Blender plug-ins to connect the two, the FLUX.1-dev NIM microservice and the ComfyUI nodes required to run it. For AI artists, it also comes with an installer and detailed deployment instructions.

The blueprint offers a structured way to dive into image generation, providing a working pipeline that can be tailored to specific needs. Step-by-step documentation, sample assets and a preconfigured environment provide a solid foundation that makes the creative process more manageable and the results more powerful.

For AI developers, the blueprint can act as a foundation for building similar pipelines or expanding existing ones. It comes with source code, sample data, documentation and a working sample for getting started.

Real-Time Generation Powered by RTX AI

AI Blueprints run on NVIDIA RTX AI PCs and workstations, harnessing recent performance breakthroughs from the NVIDIA Blackwell architecture.

The FLUX.1-dev NIM microservice included in the blueprint for 3D-guided generative AI is optimized with TensorRT and quantized to FP4 precision for Blackwell GPUs, enabling more than doubled inference speeds over native PyTorch FP16.

For users on NVIDIA Ada Lovelace generation GPUs, the FLUX.1-dev NIM microservice comes with FP8 variants, also accelerated by TensorRT. These improvements make high-performance workflows more accessible for rapid iteration and experimentation. Quantization also helps run models with less VRAM. With FP4, for instance, model sizes are reduced by more than 2x compared with FP16.

Customize and Create With RTX AI

There are 10 NIM microservices currently available for RTX, supporting use cases spanning image and language generation to speech AI and computer vision — with more blueprints and services on the way.

Available now at https://build.nvidia.com/nvidia/genai-3d-guided, AI Blueprints and NIM microservices provide powerful foundations for those ready to create, customize and push the boundaries of generative AI on RTX PCs and workstations.

Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, digital humans, productivity apps and more on AI PCs and workstations.

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

6x faster Async Checkpointing in PyTorch, using Cached Plans, no GIL contention

Meta: Less Wright, Meet Vadakkanchery, Saurabh Mishra, Ela Krepska, Hamid Shojanazeri, Pradeep Fernando
Crusoe: Ethan Petersen, Martin Cala, Chip Smith

PyTorch DCP (Distributed Checkpointing) has recently enabled new optimizations in asynchronous checkpointing to reduce GPU utilization drop by minimizing collective overhead and improving overall checkpointing efficiency.

Using Crusoe’s 2K H200 cluster, with TorchTitan and training a Llama3-70B, we were able to verify these new features deliver substantial speedups at 1856 GPU scale, reducing the background processing time for async DCP checkpoints from ~436 seconds to ~67 seconds.

This is roughly a 6.5x reduction in background checkpoint processing time, enabling even more total training time to proceed at full training throughput.

Fig 1: 1856 training run with high frequency checkpointing. The first checkpoint (drop down in tps) does not have a cached save plan, and the background processing takes far longer than the rest where the cached plan is used.

Background: What is Asynchronous Checkpointing?

In a standard checkpointing workflow, GPUs are blocked while the checkpointing data is offloaded from GPU to CPU and then written to storage. After the save to physical media is complete, training can resume.

Asynchronous checkpointing greatly reduces this downtime by enabling the actual saving to storage to be done via CPU threads, allowing GPU-based training to continue while the checkpoint data is being persisted in parallel. It is used primarily for intermediate/fault tolerant checkpoints as it unblocks the GPUs much faster compared to the synchronous checkpoints.
For example, in our large-scale experiment, GPU training was blocked for less than a second (.78 seconds at 1856 scale) while checkpoint data was moved from GPU to CPU (staging). At that point, GPU training immediately continues, which is a substantial training time improvement over traditional checkpointing. For reference, Async Checkpointing is covered in more detail here.

Challenges with Asynchronous Checkpointing

However, the background processing inherent in Asynchronous Checkpointing has additional challenges that result in a temporary reduction of training throughput while the storage phase is being completed. These are highlighted below.

GPU utilization drop from GIL contention:

The Global Interpreter Lock (GIL) in Python is a mechanism that prevents multiple native threads from executing Python bytecode at the same time. This lock is necessary mainly because CPython’s memory management is not thread-safe.

DCP currently uses background threads for metadata collectives and uploading to storage. Although these expensive steps are done asynchronously, it leads to contention for the GIL with the trainer threads. This causes the GPU utilization (QPS) to suffer significantly and also increases the e2e upload latency. For large-scale checkpoints, the overhead of the CPU parallel processing has a suppressive effect on net GPU training speed since CPUs also drive the training process via GPU kernel launches.

Please refer to the following figure from our experiments:

Fig 2: One can see a sustained drop in training QPS even after staging (i.e. blocking operation to trainer) is complete.

The first dip in Figure 2 (marked by the purple line) indicates that staging is complete, and training can continue. However, a second drop is evident (marked by the area between the purple and yellow lines) which is due to trainer thread and checkpointing threads contending for the Python GIL, leading to degraded training QPS until the checkpoint thread completes execution.

Collective communications cost:

DCP performs multiple collectives today for various reasons: dedupe, global metadata for the checkpoint, resharding, and distributed exception handling. Collectives are costly as these require network I/O and pickling/unpickling of the large metadata being sent across the GPU network. These collectives become extremely expensive as the job scale grows, leading to significantly higher e2e latency and potential for collective timeouts.

Solutions

Process based async checkpointing

DCP now supports async checkpoint save via a background process. This helps avoid the training QPS drop by eliminating the python GIL contention with the trainer threads. Please see Fig 2 for checkpointing via threads and Fig 3 for checkpointing via background process.

Caching of the save plans

DCP has a clear boundary between the planning and storage I/O steps. SavePlanner in DCP is a stateful component which acts as an access proxy to the state_dict. Planner manages save plans prepared by individual ranks, which carry metadata information necessary to do the write I/O. The planning step involves a collective operation to gather a comprehensive view of the checkpoint on the coordinator rank. The coordinator rank is responsible for de-duplicating parameters/weights to eliminate redundancies, validating the global plan to ensure accuracy and consistency, and creating the global metadata structs. This is followed by a scatter collective where the coordinator rank assigns I/O tasks to each rank. Any transformations done on the plans affect how the storage components finally write the data.

During the course of a training job, multiple checkpoints are saved. In the majority of these cases, only the checkpoint data changes between different save instances, and thus, the plan remains the same. This presented an opportunity for us to cache the plans, pay the planning cost only on the first save, and then amortize that cost across all the subsequent attempts. Only the updated plans (plans which changed in the next attempt) are sent via collective, thus reducing the collective overhead significantly.

Experiment Results

Set up: 1856 H200 GPUs, Llama3-70B, HSDP2 with TorchTitan

After deploying both the solutions above, the following are the key results:

TPS drop has significantly narrowed, with a peak dip to 372 vs 315 tps, and for a greatly reduced time window (~67 seconds vs ~437 seconds). This time window is now mostly attributed to the blocking for CPU processing.
Subsequent checkpoint save attempts also continue to be much faster due to very low overhead at the planning stage. E2E latency is thus improved by over 6.5x. This will allow our partners to increase the checkpointing frequency and reduce the lost training progress (i.e. wasted training time).

If you look at the very first downspike in Figure 1, this drawdown in GPU processing time takes training throughput from 700 down to 320 tps, and suppresses it for roughly 7 minutes (467 seconds). Once the CPUs have finished processing, training continues again at full speed.

Previously, this ~7 minute suppression would be repeated at every checkpoint. However, with the new process-based checkpointing feature, only the first checkpoint has the full drawdown time (mainly due to overhead from daemon process initialization), as all future checkpoints are executed via the background process, mitigating GIL contention with the trainer threads.

This is visually shown in all the subsequent checkpoints where the average MFU suppression time drops to just over a minute, reflected by the sharp spikes that almost immediately revert to full MFU throughput.

Fig 3: The red box shows the non-cached plan checkpoint, which also includes Checkpoint Background Init process overhead, while the purple box highlights the first checkpoint to run with the cached plan.

This means that even large-scale checkpointing, such as shown in Fig 2 at 1856 GPU scale, can be done with ~6x reduced training throughput impact. This enables Asynchronous DCP checkpointing to be run more frequently (thus better rollback protection) while enhancing total training throughput relative to previous Async Checkpointing overhead.

Using DCP’s cached checkpointing:

This feature is already available as part of the PyTorch nightly builds, and you can test out PyTorch’s Asynchronous DCP checkpointing directly in TorchTitan. Following are the instructions to enable these features:

Process-based asynchronous checkpointing:
- Set the async_checkpointer_type to AsyncCheckpointerType.PROCESS in the async_save API. (file: pytorch/torch/distributed/checkpoint/state_dict_saver.py)
Save plan caching:
- Set the enable_plan_caching flag to true in the DefaultSavePlanner. (file: pytorch/torch/distributed/checkpoint/default_planner.py)

Future work

DCP will be rolling out additional optimizations to further improve the checkpointing cost. Currently even though the save plans are cached, coordinator rank still prepares the metadata. For larger jobs and models with many tensors, this overhead is non-trivial. In the next iteration, DCP will eliminate the metadata overhead and improve the e2e latency further. DCP will also introduce additional optimizations, such as zero-overhead checkpointing, to enable efficient checkpointing in large-scale jobs.

Stay tuned!

FlexAttention Part II: FlexAttention for Inference

Overview

In PyTorch 2.5.0 release, we introduced FlexAttention torch.nn.attention.flex_attention for ML researchers who’d like to customize their attention kernels without writing kernel code. This blog introduces our decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

If you’re looking for an easy way to play around with FlexAttention in your post-training / inference pipeline, PyTorch native post-training library torchtune and inference codebase gpt-fast already have FlexAttention integrated. Try it out!

We are excited to share that our paper on FlexAttention has been accepted for presentation at the MLSys2025 Conference held from May 12-15th in Santa Clara, California.

Title: FlexAttention: A Programming Model for Generating Optimized Attention Kernels. Poster

FlexAttention for Inference

TL;DR: torch.compile lowers flex_attention to a fused FlashDecoding kernel when it runs on a very short query.

One fused attention kernel does not suit all – especially in long-context LLM inference.

The decoding phase of LLM inference is an iterative process: tokens are generated one at a time, requiring N forward passes to generate an N-token sentence. Fortunately, each iteration doesn’t need to recompute self-attention over the full sentence — previously calculated tokens are cached, therefore we only need to attend the newly generated token to the cached context.

This results in a unique attention pattern where a short query sequence (1 token) attends to a long key-value cache (context length up to 128k). Traditional optimizations for square attention kernels (q_len ≈ kv_len) don’t directly apply here. This pattern poses new challenges for GPU memory utilization and occupancy. We build a dedicated FlexDecoding backend optimized for long-context LLM inference incorporating decoding-specific techniques from FlashDecoding.

FlexDecoding is implemented as an alternative backend for the torch.nn.attention.flex_attention operator. flex_attention automatically switches to the FlexDecoding backend for its JIT compilation when given a short query and a long KV cache. If the input shape changes significantly, for example transitioning from the prefill phase to decoding, JIT recompilation generates a separate kernel for each scenario.

flex_attention = torch.compile(flex_attention)

k_cache = torch.random(B, H, 16384, D) 
v_cache = torch.random(B, H, 16384, D)

...

# Prefill Phase: query shape = [B, H, 8000, D]
flex_attention(q_prefill, k_cache, v_cache, ...) # Uses FlexAttention backend optimized for prefill & training

# Decoding Phase: q_last_token shape = [B, H, 1, D]
flex_attention(q_last_token  , k_cache, v_cache, ...) # Recompiles with the FlexDecoding backend 

# decode 2 tokens at the same time: q_last_2_tokens shape = [B, H, 2, D]
flex_attention(q_last_2_tokens, k_cache, v_cache, ...) # No recompilation needed! Runs the decoding kernel again.

Working with KV Cache

One of the key optimizations for efficient inference is maintaining a preallocated KV cache that updates in place as new tokens are generated. Instead of enforcing a specific KV cache policy with a dedicated API, FlexDecoding allows users to define and manage the KV cache themselves.

Similar to FlexAttention, FlexDecoding takes user-defined mask_mod and score_mod functions. These functions modify attention scores before the softmax operation.

score_mod(score, b, h, q_idx, kv_idx) -> tensor # return updated score

Score is a scalar pytorch tensor that represents the dot product of a query token and a key token. The rest of the arguments specify which score is being computed:

b batch index
h attention head index
q_idx token position in query tensor
kv_idx token position in key/value tensor

In the decoding phase, previously calculated tokens are cached, and only the latest generated token (i-th) is used as the query. A naive causal mask on this one token query looks like this:

def causal(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, -float("inf"))

This is problematic: the new token “saw” should attend to all previously generated tokens i.e. “The cat sat on the mat and saw”, not just the first entry in the kv cache. To correct this, the score_mod needs to offset q_idx by i for accurate decoding.

Creating a new score_mod for each token to accommodate the offset is slow since it means FlexAttention needs to be recompiled every iteration for a different score_mod. Instead,

We define this offset as a tensor and increment its value at each iteration:

offset = torch.tensor(i, "cuda")
def causal_w_offset(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx + offset >= kv_idx, score, -float("inf"))

# Attend the i-th token
flex_attention(..., score_mod=causal_w_offset  ) # Compiles the kernel here 
...
# Attend the i+1-th token
offset = offset + 1 # Increment offset
flex_attention(..., score_mod=causal_w_offset ) # Doesn't need to recompile! 

Notably, here offset becomes a captured tensor and it does not need to recompile if offset changes values.

Manually rewriting your score_mod and mask_mod for offset handling isn’t necessary. We can automate this process with a generic rewriter:

offset = torch.tensor(i, "cuda")

def get_score_mod_w_offset(score_mod: _score_mod_signature, _offset: tensor):
    def _score_mod(score, b, h, q, kv):
        return score_mod(score, b, h, q + _offset, kv)
    return _score_mod

def get_mask_mod_w_offset(mask_mod: _mask_mod_signature, _offset: tensor):
    def _mask_mod(b, h, q, kv):
        return mask_mod(b, h, q + _offset, kv)
    return _mask_mod

causal_w_offset = get_score_mod_w_offset(causal, offset)

BlockMask for Inference

We can also use BlockMask with inference to leverage mask sparsity. The idea is to precompute the BlockMask once during model setup and use slices of it during decoding

Precomputing BlockMask

During setup, we create a squared BlockMask for MAX_SEQ_LEN x MAX_SEQ_LEN:

from torch.nn.attention.flex_attention import create_block_mask

def causal_mask(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

block_mask = create_block_mask(causal_mask, B=None, H=None, Q_LEN=MAX_SEQ_LEN,KV_LEN=MAX_SEQ_LEN)

Using BlockMask During Decoding

For the i-th token, we use a slice of the mask:

block_offset = i // block_mask.BLOCK_SIZE[0]
block_mask_slice = block_mask[:, :, block_offset]

# don't forget to use the mask_mod with offset! 
block_mask_slice.mask_mod = get_mask_mod_w_offset(causal_mask)

Performance

FlexDecoding kernel performs on par with FlashDecoding (FAKV) and significantly outperforms pytorch scaled_dot_product_attention (code).

FlexDecoding boosts LLaMa3.1-8B serving performance by 1.22x-2.04x, and LLaMa3.1-70B performance by 0.99x – 1.66x compared to SDPA in gpt-fast. (code)

Paged Attention

vLLM is one of the popular LLM serving engines, powered by the efficient memory management from PagedAttention. Existing PagedAttention implementation requires dedicated CUDA kernels and shows limited flexibility on supporting emerging attention variants. In this section, we present a PT2-native PagedAttention implementation that is enabled by flex attention and torch.compile.

PagedAttention scatters KV cache to reduce memory fragmentation and support higher batch sizes. Without PagedAttention, KV cache from the same request are stored in a contiguous memory, requiring 2 tensor of shape B x H x KV LEN x D. We call it a logical KV cache. Here, KV_LEN is the maximum sequence length over all requests in a batch. Considering the Figure 1(a), KV_LEN is 9 thus all requests must be padded to 9 tokens, leading to large memory waste. With PagedAttention, we can chunk each request into multiple pages of the same size page_size and scatter these pages into a physical KV cache of shape 1 x H x max seq len x D, where max_seq_len=n_pages x page_size. This avoids padding requests to the same length and saves memory. Specifically, we provide an assign API to update KV cache via index computations:

def assign(
    batch_idx: torch.Tensor,
    input_pos: torch.Tensor,
    k_val: torch.Tensor,
    v_val: torch.Tensor,
    k_cache: torch.Tensor,
    v_cache: torch.Tensor,
) -> None

Behind this assign API is a page table, a tensor mapping logical KV cache to physical KV cache:

[batch_idx, logical_page_idx] -> physical_page_idx

assign takes k_val and v_val and scatters to physical KV cache guided by the mapping from the page table.

Paged Attention with Page Table

A natural question is, how to integrate PagedAttention with flex attention to support diverse attention variants? A naive idea is to materialize the logical KV cache before computing with flex attention. But this leads to redundant memory copy and bad performance. Another idea is to build a dedicated CUDA or Triton kernel for paged attention, similar to existing PagedAttention implementation. However, this adds much manual effort and code complexity.

Instead, we design a fused indirect memory access by converting a logical block mask according to the page table. In FlexAttention, we exploit BlockMask to identify logical blocks and skip redundant computation. While Paged Attention adds an extra layer of indirect memory access, we can further convert the logical block mask to the physical block mask corresponding to the page table, as illustrated in Figure 2. Our PagedAttention implementation provides a convert_logical_block_mask via torch.gather calls:

def convert_logical_block_mask(
    block_mask: BlockMask,
    batch_idx: Optional[torch.Tensor] = None,
) -> BlockMask

Paged Attention via Block Mask Conversion

One remaining question is how to rewrite user-specified mask_mod and score_mod for PagedAttention. When users specify these modifications, they write with logical indices without the knowledge of the page table maintained at runtime. The following code shows an automated conversion at runtime which is necessary to rewrite user-specified modifications with physical kv indices. The new_mask_mod would take the physical_kv_idx and convert it back to the logical_kv_idx and apply user-specified mask_mod on the logical_kv_idx for the correct mask. For efficiency, we maintain physical_to_logical as a mapping from physical_kv_block to logical_kv_block to facilitate the conversion. For correctness, we mask out-of-boundary blocks as False with a torch.where call. After batching logical KV caches from multiple requests into the same physical KV cache, there are much more physical blocks than the number of logical blocks for each request. Thus, a physical block may not have a corresponding logical block for a specific request during block mask conversion. By masking as False with torch.where, we can ensure the correctness that data from different requests do not interfere with each other. Similarly, we can convert the score_mod automatically.

def get_mask_mod(mask_mod: Optional[_mask_mod_signature]) -> _mask_mod_signature:
    if mask_mod is None:
        mask_mod = noop_mask

    def new_mask_mod(
        b: torch.Tensor,
        h: torch.Tensor,
        q_idx: torch.Tensor,
        physical_kv_idx: torch.Tensor,
    ):
        physical_kv_block = physical_kv_idx // page_size
        physical_kv_offset = physical_kv_idx % page_size
        logical_block_idx = physical_to_logical[b, physical_kv_block]
        logical_kv_idx = logical_block_idx * page_size + physical_kv_offset
        return torch.where(
            logical_block_idx >= 0, mask_mod(b, h, q_idx, logical_kv_idx), False
        )

    return new_mask_mod

Figure 3 demonstrates the latency from Paged Attention (code). Overall, there is less than 5% overhead from Flex Attention with Paged Attention, compared with Flex Attention only. We also observe an on-par performance with Flash Attention v2. A minimal serving example further shows that PagedAttention can support 76x higher batch size when evaluating on OpenOrca dataset which includes 1M GPT-4 completions and 3.2M GPT-3.5 completions.

Paged Attention: Latency under diverse sequence length

Ragged input sequences with Nested Jagged Tensors (NJTs)

FlexAttention now supports ragged-sized input sequences through the use of Nested Jagged Tensors (NJTs). NJTs represent ragged-sized sequences by packing sequences into a single “stacked sequence” and maintaining a set of offsets delimiting sequence boundaries for each batch item.

A block mask can be created for input NJTs through the new create_nested_block_mask() API. The returned block mask is compatible with the ragged structure of the given NJT, treating it as a single “stacked sequence” with inter-sequence attention automatically masked out. The mask_mod or score_mod function can be written as usual.

from torch.nn.attention.flex_attention import create_nested_block_mask, flex_attention

BATCH = 8
NUM_HEADS = 8
D = 16
device = "cuda"

# Input NJTs of shape (BATCH, SEQ_LEN*, D) with ragged SEQ_LEN
sequence_lengths = [torch.randint(5, 30, ()).item() for _ in range(BATCH)]
query = torch.nested.nested_tensor([
    torch.randn(seq_len, NUM_HEADS * D, device=device)
    for seq_len in sequence_lengths
], layout=torch.jagged)
key = torch.randn_like(query)
value = torch.randn_like(query)

# View as shape (BATCH, NUM_HEADS, SEQ_LEN*, HEAD_DIM)
query = query.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2)
key = key.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2)
value = value.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2)

# Simple causal mask
def my_mask_mod(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

# Construct a block mask using the ragged structure of the
# specified query NJT. Ragged-sized sequences are treated as a single
# "stacked sequence" with inter-sequence attention masked out.
block_mask = create_nested_block_mask(my_mask_mod, 1, 1, query)

# For cross attention, create_nested_block_mask() also supports a
# rectangular block mask using the ragged structures of both query / key.
#block_mask = create_nested_block_mask(my_mask_mod, 1, 1, query, key)

output = flex_attention(query, key, value, block_mask=block_mask)

Trainable Biases

FlexAttention now supports trainable parameters in score_mod functions. This feature enables users to reference tensors that require gradients within their score_mod implementations, with gradients automatically backpropagating through these parameters during training.

Memory-Efficient Gradient Accumulation

Instead of materializing the full attention scores matrix, FlexAttention uses atomic additions (tl.atomic_add) to accumulate gradients. This approach significantly reduces memory usage at the cost of introducing some non-determinism in gradient calculations.

Handling Broadcasted Operations

Broadcasting operations in the forward pass (e.g., score + bias[h]) require special consideration in the backward pass. When broadcasting a tensor across multiple attention scores within a head or other dimensions, we need to reduce these gradients back to the original tensor shape. Rather than materializing the full attention score matrix to perform this reduction, we use atomic operations. While this incurs some runtime overhead, it allows us to maintain memory efficiency by avoiding the materialization of large intermediate tensors.

Current Limitations

The implementation currently allows only a single read from each input tensor in the score_mod function. For example, bias[q_idx] + bias[kv_idx] would not be supported as it reads from the same tensor twice. We hope to remove this restriction in the future.

Simple Example:

bias = torch.randn(num_heads, requires_grad=True)
def score_mod(score, b, h, q_idx, kv_idx):
    return score + bias[h]  

Performance Tuning for FlexAttention

TL;DR

For optimal performance, compile FlexAttention using max-autotune, especially when dealing with complex score_mods and mask_mods:

flex_attention = torch.compile(flex_attention, dynamic=True, mode=’max-autotune’)

What is `max-autotune`?

max-autotune is a torch.compile mode in which TorchInductor sweeps many kernel parameters (e.g., tile size, num_stages) and selects the best-performing configuration. This process allows kernels to test both successful and failing configurations without issues, and find the best viable configuration.

While compilation takes longer with max-autotune, the optimal configuration is cached for future kernel executions.

Here’s an example of FlexAttention compiled with max-autotune:

triton_flex_attention_backward_7 0.2528 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=7, HAS_FULL_BLOCKS=False, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, QK_HEAD_DIM=128, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.08838834764831843, SPARSE_KV_BLOCK_SIZE=1073741824, SPARSE_Q_BLOCK_SIZE=1073741824, V_HEAD_DIM=128, num_stages=4, num_warps=4

Why Use `max-autotune` for FlexAttention?

The amount of shared memory utilized in FlexAttention depends on score_mod and mask_mod methods. This variability means that the preconfigured default kernel parameters may lead to performance cliffs or even out of shared memory** **errors on certain hardware for some masks/mods.

For instance, with document masks, default configurations can halve GPU occupancy, reducing performance to ~75% of its potential on some GPUs. To avoid such issues, we strongly recommend enabling max-autotune.

Updates and Enhancements

Now available as a prototype feature in PyTorch 2.5.0
Fixed critical correctness issues, including a bug affecting multiple calls to FlexAttention within the same call to torch.compile

Expanded Architecture Support

Arbitrary sequence length support – no longer requires multiples of 128
Added native grouped-query attention (GQA) support via is_gqa=True
Enhanced dimension flexibility:
- Different QK and V head dimensions
- Non-power-of-two head dimensions
Trainable attention biases (prototype)

Under the Hood

New fused CPU backend
Improved TF32 handling for float32 inputs
Resolved various dynamic shape issues
Output layout matching query strides

These updates make FlexAttention more robust and flexible while maintaining its core promise of combining PyTorch’s ease of use with FlashAttention’s performance benefits.

Solution overview

Prerequisites

Create an anonymous Amazon Q Business application using the console

Create an IAM role for the web experience

Create an Amazon Q Business application

Add data sources

Generate an anonymous web experience URL

Test your anonymous Amazon Q Business application

Create your anonymous application using the AWS CLI

Create an IAM role for Amazon Q Business

Create an anonymous Amazon Q Business application

Create a restrictive policy for anonymous access

Create an index

Test your anonymous Amazon Q Business application

Create a web experience for the anonymous web application

Considerations

Clean up

Conclusion

About the authors

Accounting operations: Complexity amplified at scale

FloQast AI Transaction Matching

FloQast AI Annotations

FloQast’s AI technology choices

Conclusion

About the authors

Building a solid business case: Operational excellence drives customer experience

Getting ahead of implementation challenges

Achieving scale, reliability, and compliance

Production-ready infrastructure, applications, and processes in the cloud

Security, compliance, and responsible AI

Conclusion

About the Authors

Time Stamps

You Might Also Like…

Harness 3D to Control AI-Generated Images

A Prebuilt Foundation for Generative AI Workflows

Real-Time Generation Powered by RTX AI

Customize and Create With RTX AI

Background: What is Asynchronous Checkpointing?

Challenges with Asynchronous Checkpointing

GPU utilization drop from GIL contention:

Collective communications cost:

Solutions

Process based async checkpointing

Caching of the save plans

Experiment Results

Future work

Overview

FlexAttention for Inference

Working with KV Cache

BlockMask for Inference

Precomputing BlockMask

Using BlockMask During Decoding

Performance

Paged Attention

Ragged input sequences with Nested Jagged Tensors (NJTs)

Trainable Biases

Memory-Efficient Gradient Accumulation

Handling Broadcasted Operations

Current Limitations

Simple Example:

Performance Tuning for FlexAttention

TL;DR

What is max-autotune?

Why Use max-autotune for FlexAttention?

Updates and Enhancements

Expanded Architecture Support

Under the Hood

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

What is `max-autotune`?

Why Use `max-autotune` for FlexAttention?