New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluatorsRead More
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms
New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluatorsRead More
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms
New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluatorsRead More
Press Play on Don Diablo’s Music Video — Created With NVIDIA RTX-Powered Generative AI
Electronic music icon Don Diablo is known for pushing the boundaries of music, visual arts and live performance — and the music video for his latest single, “BLACKOUT,” explores using generative AI, combining NVIDIA RTX-powered and cloud-based tools in a hybrid workflow.
Set inside an industrial warehouse, the video features Diablo performing in front of an immersive, AI-generated environment — blending stylized effects, a cinematic mood and personalized elements.
The image creation and experimentation was done locally on a desktop powered by a NVIDIA GeForce RTX 5090 GPU, while the final animation was rendered using Kling AI, a cloud-based generative video tool. This hybrid approach gave the team creative control and high-fidelity results.
CTRL + ALT + Create With RTX
All image-based work in the music video — including stylization and identity modeling — was performed locally using ComfyUI, a visual, node-based interface for building and customizing generative AI workflows.
With minimal setup, the team could launch preconfigured workflows and experiment freely with different models and techniques. ComfyUI offered speed, ease of use and support for a wide range of open-source tools — enabling fast iteration and distinctive visuals that aligned with Diablo’s creative direction.
Diablo wanted the visuals to reflect his identity — not just serve as decor. To achieve this, a custom model was trained on a facial dataset of Diablo, using low-rank adaptation in the FLUXGYM user interface.
The resulting model was then loaded into ComfyUI and integrated into a Stable Diffusion pipeline for stylized image generation. An identity-embedding node ensured consistent facial features across frames, as well as stylistic cohesion.
Once the imagery was locked, the team turned to Kling AI in the cloud to animate the visuals into dynamic scenes, harnessing the tool’s advanced physics and cinematic capabilities.
This workflow, all running locally on an RTX 5090 GPU, enabled faster iteration, precise visual control, tighter feedback loops and full creative autonomy. It showcases how creators can use local RTX-powered tools — as well as cloud-based platforms for pre-defined workloads that require more compute — to quickly and easily produce high-impact content on their terms.
“Working with these new AI tools literally feels like stepping into my own mind,” said Diablo. “We were able to creatively raise the bar, shape the visuals in real time and stay focused on the story. AI is a powerful tool for creativity.”
Creative Workflows, Powered by RTX AI
Watch the full “BLACKOUT” music video to see how the team brought their creative ideas to life with RTX AI.
Artists and developers are already using generative AI to streamline their work and push creative boundaries, whether exploring concepts, designing virtual worlds or building intelligent apps. With RTX AI PCs, users can access the latest and greatest models and tools, as well as powerful AI performance.
Learn more about the NVIDIA Studio platform and RTX AI PCs.
Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NVIDIA NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, digital humans, productivity apps and more on AI PCs and workstations.
Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.
How OpenSynth Uses PyTorch to Accelerate Compute for Energy Modelling Applications
OpenSynth has recently leveraged PyTorch to improve the experience of its users and community. OpenSynth is an open source community hosted by LF Energy that is democratising access to synthetic energy demand data.
Access to smart meter data is essential to rapid and successful energy transitions. Researchers, modelers and policymakers need to understand how energy demand profiles are changing, in a system that requires greater real time optimization of demand and supply on the grid. Yet current global energy modeling and policymaking is still largely based on static and highly aggregated data from the past – when energy flowed in one direction, consumer profiles were relatively predictable, and power generation was highly controllable.
The major challenge is that access to demand data is highly restrictive, as a result of privacy protections. Rather than joining industry calls to unlock raw smart meter data through existing mechanisms, by tackling current data regulations and smart meter legislation, OpenSynth believes generating synthetic data is the fastest way to achieve widespread, global access to smart meter datasets.
The community empowers holders of raw smart meter (i.e. demand) data to generate and share synthetic data and models that can be used by researchers, industry innovators and policy-makers.
PyTorch allowed the OpenSynth community to use GPU compute to speed up computation and use distributed training. End users with access to multiple GPUs can split the dataset into multiple smaller datasets to parallelise compute, further speeding up compute. This allows scaling of training to much bigger datasets than before.
The Business Challenge
Centre for Net Zero, the non-profit that originally developed OpenSynth before it was contributed to LF Energy, has also developed an algorithm called Faraday, available via OpenSynth to its users, that can generate synthetic smart meter data. The Faraday algorithm consists of two components: an AutoEncoder module and a Gaussian Mixture Module.
The Gaussian Mixture Model (GMM) of Faraday was originally implemented using scikit-learn’s implementation. Scikit Learn is a popular library used amongst data scientists to train many different machine learning algorithms. However, that implementation does not scale very well on large datasets, as it only supports CPUs (Central Processing Units) – it does not allow accelerated computation using GPUs (Graphical Processing units). GPUs are a more powerful chip that can perform mathematical operations much faster, and is commonly used to train deep learning models.
Furthermore, it also does not allow any parallelisation. Parallelisation compute means splitting the original dataset into multiple independent and smaller datasets, and training smaller models on each individual dataset, then combining the smaller models into a single model.
A different implementation was needed that supports both parallel computation and GPU acceleration.
How OpenSynth Used PyTorch
The OpenSynth community recently ported the GMM module from Faraday to PyTorch. Originally implemented using scikit-learn, this reimplementation enables the use of GPUs for training GMMs, significantly accelerating computational performance.
By leveraging PyTorch’s powerful GPU capabilities, the new GMM module can now handle much larger datasets and faster computation, making it an invaluable tool for practitioners working with large datasets that cannot fit into memory. This update allows users to scale their models and processes more efficiently, leading to faster insights and improved results in energy modeling applications.
A Word from OpenSynth
“Open source is a powerful catalyst for change. Our open data community, OpenSynth, is democratising global access to synthetic energy demand data – unlocking a diversity of downstream applications that can accelerate the decarbonisation of energy systems. PyTorch has an incredible open source ecosystem that enables us to significantly speed up computation for OpenSynth’s users, using distributed GPUs. Without this open source ecosystem, it would have been impossible to implement this change – and slowed down the efforts of those seeking to affect net zero action.” – Sheng Chai, Senior Data Scientist, Centre for Net Zero
Learn More
For more information, visit the LF Energy OpenSynth website.
PyTorch/XLA 2.7 Release Usability, vLLM boosts, JAX bridge, GPU Build
PyTorch/XLA is a Python package that uses the XLA deep learning compiler to enable PyTorch deep learning workloads on various hardware backends, including Google Cloud TPUs, GPUs, and AWS Inferentia/Trainium. The PyTorch/XLA team has been working hard to bring new capabilities to researchers and developers using TPUs/GPUs and XLA backends. In this update, we’ve made many additions and improvements to the framework. Some of the notable highlights are:
- Usability improvements
- Experimental bridge with JAX operations
- A new Pallas-based kernel for ragged paged attention, enabling further optimizations on vLLM TPU
These features, bug fixes, and other details are outlined in the release notes. Let’s now delve into the highlights in detail!
Usability Improvements
Developers are now able to better target areas of code that they want to measure the performance of by marking the exact regions of code that they would like to profile. An example of this is:
server = xp.start_server(8001) xp.start_trace(profiling_dir) # Run some computation ... xp.stop_trace()
PyTorch/XLA 2.7 also introduces an API to query the number of cached compilation graphs, aiding in the detection of unexpected compilations during production inference or training. An additional enhancement optimizes host-to-device transfers by avoiding unnecessary tensor copying, thus improving performance.
JAX Bridge in PyTorch/XLA (Prototype)
We’re experimenting with integrating JAX operations directly into PyTorch/XLA graphs as a way to enable a bridge between the frameworks — this method enables users to call JAX functions inside PyTorch models running with XLA.
As a use case, we’ve explored calling `jax.experimental.shard_alike` from PyTorch/XLA. This function improves sharding propagation in certain code patterns like scan, and we’ve integrated it as part of the GSPMD (Generalized SPMD) workflow in the compiler. This tool is utilized in torchprime to enable support for the SplashAttention Pallas kernel.
import torch_xla.core.xla_builder as xb # Native function written in JAX def jax_function(...): import jax ... return ... res = xb.call_jax(...) </pre?
Ragged Paged Attention Pallas Kernel
Efficient attention for variable-length sequences is critical for scaling large language models, and the new Pallas kernel for ragged paged attention brings a major performance and usability upgrade to vLLM TPU.
This update introduces a custom kernel implemented using the Pallas custom kernel language and is lowered to Mosaic for TPU. It supports ragged (variable-length) input sequences and implements a paged attention pattern. Below are the key features:
- Supports mixed prefill and decode operations to increase inference throughput (e.g., up to a 5x speedup compared to the padded Multi-Queries Paged Attention implementation for llama-3-8b).
- No GMM (Grouped Matmul) Metadata required! We calculate the metadata on the fly in the kernel. This can increase performance by 10%.
- Provides a CUDA Flash Attention equivalent with Paged Attention support and a similar interface.
We are continuously collaborating with the vLLM community to further optimize performance, expand kernel coverage, and streamline TPU inference at scale.
GPU Build is Back
The GPU build was paused in the PyTorch/XLA 2.6 release, but we’ve now re-enabled GPU Continuous Integration (CI) in version 2.7. The current release includes GPU builds with CUDA 12.6, marking an important step forward for GPU support.
While CUDA support is still considered experimental in this release, we plan to expand coverage to additional CUDA versions in upcoming releases.
Get Involved
Please check out the latest changes on GitHub. As always, we’re actively seeking feedback and contributions from the community.
Securing Amazon Bedrock Agents: A guide to safeguarding against indirect prompt injections
Generative AI tools have transformed how we work, create, and process information. At Amazon Web Services (AWS), security is our top priority. Therefore, Amazon Bedrock provides comprehensive security controls and best practices to help protect your applications and data. In this post, we explore the security measures and practical strategies provided by Amazon Bedrock Agents to safeguard your AI interactions against indirect prompt injections, making sure that your applications remain both secure and reliable.
What are indirect prompt injections?
Unlike direct prompt injections that explicitly attempt to manipulate an AI system’s behavior by sending malicious prompts, indirect prompt injections are far more challenging to detect. Indirect prompt injections occur when malicious actors embed hidden instructions or malicious prompts within seemingly innocent external content such as documents, emails, or websites that your AI system processes. When an unsuspecting user asks their AI assistant or Amazon Bedrock Agents to summarize that infected content, the hidden instructions can hijack the AI, potentially leading to data exfiltration, misinformation, or bypassing other security controls. As organizations increasingly integrate generative AI agents into critical workflows, understanding and mitigating indirect prompt injections has become essential for maintaining security and trust in AI systems, especially when using tools such as Amazon Bedrock for enterprise applications.
Understanding indirect prompt injection and remediation challenges
Prompt injection derives its name from SQL injection because both exploit the same fundamental root cause: concatenation of trusted application code with untrusted user or exploitation input. Indirect prompt injection occurs when a large language model (LLM) processes and combines untrusted input from external sources controlled by a bad actor or trusted internal sources that have been compromised. These sources often include sources such as websites, documents, and emails. When a user submits a query, the LLM retrieves relevant content from these sources. This can happen either through a direct API call or by using data sources like a Retrieval Augmented Generation (RAG) system. During the model inference phase, the application augments the retrieved content with the system prompt to generate a response.
When successful, malicious prompts embedded within the external sources can potentially hijack the conversation context, leading to serious security risks, including the following:
- System manipulation – Triggering unauthorized workflows or actions
- Unauthorized data exfiltration – Extracting sensitive information, such as unauthorized user information, system prompts, or internal infrastructure details
- Remote code execution – Running malicious code through the LLM tools
The risk lies in the fact that injected prompts aren’t always visible to the human user. They can be concealed using hidden Unicode characters or translucent text or metadata, or they can be formatted in ways that are inconspicuous to users but fully readable by the AI system.
The following diagram demonstrates an indirect prompt injection where a straightforward email summarization query results in the execution of an untrusted prompt. In the process of responding to the user with the summarization of the emails, the LLM model gets manipulated with the malicious prompts hidden inside the email. This results in unintended deletion of all the emails in the user’s inbox, completely diverging from the original email summarization query.
Unlike SQL injection, which can be effectively remediated through controls such as parameterized queries, an indirect prompt injection doesn’t have a single remediation solution. The remediation strategy for indirect prompt injection varies significantly depending on the application’s architecture and specific use cases, requiring a multi-layered defense approach of security controls and preventive measures, which we go through in the later sections of this post.
Effective controls for safeguarding against indirect prompt injection
Amazon Bedrock Agents has the following vectors that must be secured from an indirect prompt injection perspective: user input, tool input, tool output, and agent final answer. The next sections explore coverage across the different vectors through the following solutions:
- User confirmation
- Content moderation with Amazon Bedrock Guardrails
- Secure prompt engineering
- Implementing verifiers using custom orchestration
- Access control and sandboxing
- Monitoring and logging
- Other standard application security controls
User confirmation
Agent developers can safeguard their application from malicious prompt injections by requesting confirmation from your application users before invoking the action group function. This mitigation protects the tool input vector for Amazon Bedrock Agents. Agent developers can enable User Confirmation for actions under an action group, and they should be enabled especially for mutating actions that could make state changes for application data. When this option is enabled, Amazon Bedrock Agents requires end user approval before proceeding with action invocation. If the end user declines the permission, the LLM takes the user decline as additional context and tries to come up with an alternate course of action. For more information, refer to Get user confirmation before invoking action group function.
Content moderation with Amazon Bedrock Guardrails
Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. It provides robust content filtering capabilities that block denied topics and redact sensitive information such as personally identifiable information (PII), API keys, and bank accounts or card details. The system implements a dual-layer moderation approach by screening both user inputs before they reach the foundation model (FM) and filtering model responses before they’re returned to users, helping make sure malicious or unwanted content is caught at multiple checkpoints.
In Amazon Bedrock Guardrails, tagging dynamically generated or mutated prompts as user input is essential when they incorporate external data (e.g., RAG-retrieved content, third-party APIs, or prior completions). This ensures guardrails evaluate all untrusted content-including indirect inputs like AI-generated text derived from external sources-for hidden adversarial instructions. By applying user input tags to both direct queries and system-generated prompts that integrate external data, developers activate Bedrock’s prompt attack filters on potential injection vectors while preserving trust in static system instructions. AWS emphasizes using unique tag suffixes per request to thwart tag prediction attacks. This approach balances security and functionality: testing filter strengths (Low/Medium/High) ensures high protection with minimal false positives, while proper tagging boundaries prevent over-restricting core system logic. For full defense-in-depth, combine guardrails with input/output content filtering and context-aware session monitoring.
Guardrails can be associated with Amazon Bedrock Agents. Associated agent guardrails are applied to the user input and final agent answer. Current Amazon Bedrock Agents implementation doesn’t pass tool input and output through guardrails. For full coverage of vectors, agent developers can integrate with the ApplyGuardrail API call from within the action group AWS Lambda function to verify tool input and output.
Secure prompt engineering
System prompts play a very important role by guiding LLMs to answer the user query. The same prompt can also be used to instruct an LLM to identify prompt injections and help avoid the malicious instructions by constraining model behavior. In case of the reasoning and acting (ReAct) style orchestration strategy, secure prompt engineering can mitigate exploits from the surface vectors mentioned earlier in this post. As part of ReAct strategy, every observation is followed by another thought from the LLM. So, if our prompt is built in a secure way such that it can identify malicious exploits, then the Agents vectors are secured because LLMs sit at the center of this orchestration strategy, before and after an observation.
Amazon Bedrock Agents has shared a few sample prompts for Sonnet, Haiku, and Amazon Titan Text Premier models in the Agents Blueprints Prompt Library. You can use these prompts either through the AWS Cloud Development Kit (AWS CDK) with Agents Blueprints or by copying the prompts and overriding the default prompts for new or existing agents.
Using a nonce, which is a globally unique token, to delimit data boundaries in prompts helps the model to understand the desired context of sections of data. This way, specific instructions can be included in prompts to be extra cautious of certain tokens that are controlled by the user. The following example demonstrates setting <DATA>
and <nonce>
tags, which can have specific instructions for the LLM on how to deal with those sections:
Implementing verifiers using custom orchestration
Amazon Bedrock provides an option to customize an orchestration strategy for agents. With custom orchestration, agent developers can implement orchestration logic that is specific to their use case. This includes complex orchestration workflows, verification steps, or multistep processes where agents must perform several actions before arriving at a final answer.
To mitigate indirect prompt injections, you can invoke guardrails throughout your orchestration strategy. You can also write custom verifiers within the orchestration logic to check for unexpected tool invocations. Orchestration strategies like plan-verify-execute (PVE) have also been shown to be robust against indirect prompt injections for cases where agents are working in a constrained space and the orchestration strategy doesn’t need a replanning step. As part of PVE, LLMs are asked to create a plan upfront for solving a user query and then the plan is parsed to execute the individual actions. Before invoking an action, the orchestration strategy verifies if the action was part of the original plan. This way, no tool result could modify the agent’s course of action by introducing an unexpected action. Additionally, this technique doesn’t work in cases where the user prompt itself is malicious and is used in generation during planning. But that vector can be protected using Amazon Bedrock Guardrails with a multi-layered approach of mitigating this attack. Amazon Bedrock Agents provides a sample implementation of PVE orchestration strategy.
For more information, refer to Customize your Amazon Bedrock Agent behavior with custom orchestration.
Access control and sandboxing
Implementing robust access control and sandboxing mechanisms provides critical protection against indirect prompt injections. Apply the principle of least privilege rigorously by making sure that your Amazon Bedrock agents or tools only have access to the specific resources and actions necessary for their intended functions. This significantly reduces the potential impact if an agent is compromised through a prompt injection attack. Additionally, establish strict sandboxing procedures when handling external or untrusted content. Avoid architectures where the LLM outputs directly trigger sensitive actions without user confirmation or additional security checks. Instead, implement validation layers between content processing and action execution, creating security boundaries that help prevent compromised agents from accessing critical systems or performing unauthorized operations. This defense-in-depth approach creates multiple barriers that bad actors must overcome, substantially increasing the difficulty of successful exploitation.
Monitoring and logging
Establishing comprehensive monitoring and logging systems is essential for detecting and responding to potential indirect prompt injections. Implement robust monitoring to identify unusual patterns in agent interactions, such as unexpected spikes in query volume, repetitive prompt structures, or anomalous request patterns that deviate from normal usage. Configure real-time alerts that trigger when suspicious activities are detected, enabling your security team to investigate and respond promptly. These monitoring systems should track not only the inputs to your Amazon Bedrock agents, but also their outputs and actions, creating an audit trail that can help identify the source and scope of security incidents. By maintaining vigilant oversight of your AI systems, you can significantly reduce the window of opportunity for bad actors and minimize the potential impact of successful injection attempts. Refer to Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 2 in the AWS Machine Learning Blog for more details on logging and observability for Amazon Bedrock Agents. It’s important to store logs that contain sensitive data such as user prompts and model responses with all the required security controls according to your organizational standards.
Other standard application security controls
As mentioned earlier in the post, there is no single control that can remediate indirect prompt injections. Besides the multi-layered approach with the controls listed above, applications must continue to implement other standard application security controls, such as authentication and authorization checks before accessing or returning user data and making sure that the tools or knowledge bases contain only information from trusted sources. Controls such as sampling based validations for content in knowledge bases or tool responses, similar to the techniques detailed in Create random and stratified samples of data with Amazon SageMaker Data Wrangler, can be implemented to verify that the sources only contain expected information.
Conclusion
In this post, we’ve explored comprehensive strategies to safeguard your Amazon Bedrock Agents against indirect prompt injections. By implementing a multi-layered defense approach—combining secure prompt engineering, custom orchestration patterns, Amazon Bedrock Guardrails, user confirmation features in action groups, strict access controls with proper sandboxing, vigilant monitoring systems and authentication and authorization checks—you can significantly reduce your vulnerability.
These protective measures provide robust security while preserving the natural, intuitive interaction that makes generative AI so valuable. The layered security approach aligns with AWS best practices for Amazon Bedrock security, as highlighted by security experts who emphasize the importance of fine-grained access control, end-to-end encryption, and compliance with global standards.
It’s important to recognize that security isn’t a one-time implementation, but an ongoing commitment. As bad actors develop new techniques to exploit AI systems, your security measures must evolve accordingly. Rather than viewing these protections as optional add-ons, integrate them as fundamental components of your Amazon Bedrock Agents architecture from the earliest design stages.
By thoughtfully implementing these defensive strategies and maintaining vigilance through continuous monitoring, you can confidently deploy Amazon Bedrock Agents to deliver powerful capabilities while maintaining the security integrity your organization and users require. The future of AI-powered applications depends not just on their capabilities, but on our ability to make sure that they operate securely and as intended.
About the Authors
Hina Chaudhry is a Sr. AI Security Engineer at Amazon. In this role, she is entrusted with securing internal generative AI applications along with proactively influencing AI/Gen AI developer teams to have security features that exceed customer security expectations. She has been with Amazon for 8 years, serving in various security teams. She has more than 12 years of combined experience in IT and infrastructure management and information security.
Manideep Konakandla is a Senior AI Security engineer at Amazon where he works on securing Amazon generative AI applications. He has been with Amazon for close to 8 years and has over 11 years of security experience.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Bedrock Security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Sumanik Singh is a Software Developer engineer at Amazon Web Services (AWS) where he works on Amazon Bedrock Agents. He has been with Amazon for more than 6 years which includes 5 years experience working on Dash Replenishment Service. Prior to joining Amazon, he worked as an NLP engineer for a media company based out of Santa Monica. On his free time, Sumanik loves playing table tennis, running and exploring small towns in pacific northwest area.
Build scalable containerized RAG based generative AI applications in AWS using Amazon EKS with Amazon Bedrock
Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.
In this post, we demonstrate a solution using Amazon Elastic Kubernetes Service (EKS) with Amazon Bedrock to build scalable and containerized RAG solutions for your generative AI applications on AWS while bringing your unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.
Amazon EKS provides a scalable, secure, and cost-efficient environment for building RAG applications with Amazon Bedrock and also enables efficient deployment and monitoring of AI-driven workloads while leveraging Bedrock’s FMs for inference. It enhances performance with optimized compute instances, auto-scales GPU workloads while reducing costs via Amazon EC2 Spot Instances and AWS Fargate and provides enterprise-grade security via native AWS mechanisms such as Amazon VPC networking and AWS IAM.
Our solution uses Amazon S3 as the source of unstructured data and populates an Amazon OpenSearch Serverless vector database via the use of Amazon Bedrock Knowledge Bases with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.
Solution overview
The solution uses Amazon EKS managed node groups to automate the provisioning and lifecycle management of nodes (Amazon EC2 instances) for the Amazon EKS Kubernetes cluster. Every managed node in the cluster is provisioned as part of an Amazon EC2 Auto Scaling group that’s managed for you by EKS.
The EKS cluster consists of a Kubernetes deployment that runs across two Availability Zones for high availability where each node in the deployment hosts multiple replicas of a Bedrock RAG container image registered and pulled from Amazon Elastic Container Registry (ECR). This setup makes sure that resources are used efficiently, scaling up or down based on the demand. The Horizontal Pod Autoscaler (HPA) is set up to further scale the number of pods in our deployment based on their CPU utilization.
The RAG Retrieval Application container uses Bedrock Knowledge Bases APIs and Anthropic’s Claude 3.5 Sonnet LLM hosted on Bedrock to implement a RAG workflow. The solution provides the end user with a scalable endpoint to access the RAG workflow using a Kubernetes service that is fronted by an Amazon Application Load Balancer (ALB) provisioned via an EKS ingress controller.
The RAG Retrieval Application container orchestrated by EKS enables RAG with Amazon Bedrock by enriching the generative AI prompt received from the ALB endpoint with data retrieved from an OpenSearch Serverless index that is synced via Bedrock Knowledge Bases from your company-specific data uploaded to Amazon S3.
The following architecture diagram illustrates the various components of our solution:
Prerequisites
Complete the following prerequisites:
- Ensure model access in Amazon Bedrock. In this solution, we use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
- Install the AWS Command Line Interface (AWS CLI).
- Install Docker.
- Install Kubectl.
- Install Terraform.
Deploy the solution
The solution is available for download on the GitHub repo. Cloning the repository and using the Terraform template will provision the components with their required configurations:
- Clone the Git repository:
- From the
terraform
folder, deploy the solution using Terraform:
Configure EKS
- Configure a secret for the ECR registry:
- Navigate to the
kubernetes/ingress
folder:- Make sure that the
AWS_Region
variable in thebedrockragconfigmap.yaml
file points to your AWS region. - Replace the image URI in line 20 of the
bedrockragdeployment.yaml
file with the image URI of yourbedrockrag
image from your ECR repository.
- Make sure that the
- Provision the EKS deployment, service and ingress:
Create a knowledge base and upload data
To create a knowledge base and upload data, follow these steps:
- Create an S3 bucket and upload your data into the bucket. In our blog post, we uploaded these two files, Amazon Bedrock User Guide and the Amazon FSx for ONTAP User Guide, into our S3 bucket.
- Create an Amazon Bedrock knowledge base. Follow the steps here to create a knowledge base. Accept all the defaults including using the Quick create a new vector store option in Step 7 of the instructions that creates an Amazon OpenSearch Serverless vector search collection as your knowledge base.
- In Step 5c of the instructions to create a knowledge base, provide the S3 URI of the object containing the files for the data source for the knowledge base
- Once the knowledge base is provisioned, obtain the Knowledge Base ID from the Bedrock Knowledge Bases console for your newly created knowledge base.
Query using the Application Load Balancer
You can query the model directly using the API front end provided by the AWS ALB provisioned by the Kubernetes (EKS) Ingress Controller. Navigate to the AWS ALB console and obtain the DNS name for your ALB to use as your API:
Cleanup
To avoid recurring charges, clean up your account after trying the solution:
- From the terraform folder, delete the Terraform template for the solution:
terraform apply --destroy
- Delete the Amazon Bedrock knowledge base. From the Amazon Bedrock console, select the knowledge base you created in this solution, select Delete, and follow the steps to delete the knowledge base.
Conclusion
In this post, we demonstrated a solution that uses Amazon EKS with Amazon Bedrock and provides you with a framework to build your own containerized, automated, scalable, and highly available RAG-based generative AI applications on AWS. Using Amazon S3 and Amazon Bedrock Knowledge Bases, our solution automates bringing your unstructured user file data to Amazon Bedrock within the containerized framework. You can use the approach demonstrated in this solution to automate and containerize your AI-driven workloads while using Amazon Bedrock FMs for inference with built-in efficient deployment, scalability, and availability from a Kubernetes-based containerized deployment.
For more information about how to get started building with Amazon Bedrock and EKS for RAG scenarios, refer to the following resources:
- Amazon Bedrock Workshop GitHub repo
- Amazon EKS Workshop
- Build RAG-based generative AI applications in AWS using Amazon Bedrock and Amazon FSx for NetApp ONTAP
About the Authors
Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for AWS customers and partners. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.
Sandeep Batchu is a Senior Security Architect at Amazon Web Services, with extensive experience in software engineering, solutions architecture, and cybersecurity. Passionate about bridging business outcomes with technological innovation, Sandeep guides customers through their cloud journey, helping them design and implement secure, scalable, flexible, and resilient cloud architectures.
How Hexagon built an AI assistant using AWS generative AI services
This post was co-written with Julio P. Roque Hexagon ALI.
Recognizing the transformative benefits of generative AI for enterprises, we at Hexagon’s Asset Lifecycle Intelligence division sought to enhance how users interact with our Enterprise Asset Management (EAM) products. Understanding these advantages, we partnered with AWS to embark on a journey to develop HxGN Alix, an AI-powered digital worker using AWS generative AI services. This blog post explores the strategy, development, and implementation of HxGN Alix, demonstrating how a tailored AI solution can drive efficiency and enhance user satisfaction.
Forming a generative AI strategy: Security, accuracy, and sustainability
Our journey to build HxGN Alix was guided by a strategic approach focused on customer needs, business requirements, and technological considerations. In this section, we describe the key components of our strategy.
Understanding consumer generative AI and enterprise generative AI
Generative AI serves diverse purposes, with consumer and enterprise applications differing in scope and focus. Consumer generative AI tools are designed for broad accessibility, enabling users to perform everyday tasks such as drafting content, generating images, or answering general inquiries. In contrast, enterprise generative AI is tailored to address specific business challenges, including scalability, security, and seamless integration with existing workflows. These systems often integrate with enterprise infrastructures, prioritize data privacy, and use proprietary datasets to provide relevance and accuracy. This customization allows businesses to optimize operations, enhance decision-making, and maintain control over their intellectual property.
Commercial compared to open source LLMs
We used multiple evaluation criteria, as illustrated in the following figure, to determine whether to use a commercial or open source large language model (LLM).
The evaluation criteria are as follows:
- Cost management – Help avoid unpredictable expenses associated with LLMs.
- Customization – Tailor the model to understand domain-specific terminology and context.
- Intellectual property and licensing – Maintain control over data usage and compliance.
- Data privacy – Uphold strict confidentiality and adherence to security requirements.
- Control over the model lifecycle – By using open source LLMs, we’re able to control the lifecycle of model customizations based on business needs. This control makes sure updates, enhancements, and maintenance of the model are aligned with evolving business objectives without dependency on third-party providers.
The path to the enterprise generative AI: Crawl, walk, run
By adopting a phased approach (as shown in the following figure), we were able to manage development effectively. Because the technology is new, it was paramount to carefully build the right foundation for adoption of generative AI across different business units.
The phases of the approach are:
- Crawl – Establish foundational infrastructure with a focus on data privacy and security. This phase focused on establishing a secure and compliant foundation to enable the responsible adoption of generative AI. Key priorities included implementing guardrails around security, compliance, and data privacy, making sure that customer and enterprise data remained protected within well-defined access controls. Additionally, we focused on capacity management and cost governance, making sure that AI workloads operated efficiently while maintaining financial predictability. This phase was critical in setting up the necessary policies, monitoring mechanisms, and architectural patterns to support long-term scalability.
- Walk – Integrate customer-specific data to enhance relevance while maintaining tenant-level security. With a solid foundation in place, we transitioned from proof of concept to production-grade implementations. This phase was characterized by deepening our technical expertise, refining operational processes, and gaining real-world experience with generative AI models. As we integrated domain-specific data to improve relevance and usability, we continued to reinforce tenant-level security to provide proper data segregation. The goal of this phase was to validate AI-driven solutions in real-world scenarios, iterating on workflows, accuracy, and optimizing performance for production deployment.
- Run – Develop high-value use cases tailored to customer needs, enhancing productivity and decision-making. Using the foundations established in the walk phase, we moved toward scaling development across multiple teams in a structured and repeatable manner. By standardizing best practices and development frameworks, we enabled different products to adopt AI capabilities efficiently. At this stage, we focused on delivering high-value use cases that directly enhanced customer productivity, decision-making, and operational efficiency.
Identifying the right use case: Digital worker
A critical part of our strategy was identifying a use case that would offer the best return on investment (ROI), depicted in the following figure. We pinpointed the development of a digital worker as an optimal use case because of its potential to:
- Enhance productivity – Recognizing that the productivity of any AI solution lies in a digital worker capable of handling advanced and nuanced domain-specific tasks
- Improve efficiency – Automate routine tasks and streamline workflows
- Enhance user experience – Provide immediate, accurate responses to user inquiries
- Support high security environments – Operate within stringent security parameters required by clients
By focusing on a digital worker, we aimed to deliver significant value to both internal teams and end-users.
Introducing Alix: A digital worker for asset lifecycle intelligence
HxGN Alix is our AI-powered chat assistant designed to act as a digital worker to revolutionize user interaction with EAM products. Developed to operate securely within high-security environments, HxGN Alix serves multiple functions:
- Streamline information access – Provide users with quick, accurate answers, alleviating the need to navigate extensive PDF manuals
- Enhance internal workflows – Assist Customer Success managers and Customer Support teams with efficient information retrieval
- Improve customer satisfaction – Offer EAM end-users an intuitive tool to engage with, thereby elevating their overall experience
By delivering a tailored, AI-driven approach, HxGN Alix addresses specific challenges faced by our clients, transforming the user experience while upholding stringent security standards.
Understanding system needs to guide technology selection
Before selecting the appropriate technology stack for HxGN Alix, we first identified the high-level system components and expectations of our AI assistant infrastructure. Through this process, we made sure that we understood the core components required to build a robust and scalable solution. The following figure illustrates the core components that we identified.
The non-functional requirements are:
- Regional failover – Maintain system resilience by providing the ability to fail over seamlessly in case of Regional outages, maintaining service availability.
- Model lifecycle management – Establish a reliable mechanism for customizing and deploying machine learning models.
- LLM hosting – Host the AI models in an environment that provides stability, scalability, and adheres to our high-security requirements.
- Multilingual capabilities – Make sure that the assistant can communicate effectively in multiple languages to cater to our diverse user base.
- Safety tools – Incorporate safeguards to promote safe and responsible AI use, particularly with regard to data protection and user interactions.
- Data storage – Provide secure storage solutions for managing product documentation and user data, adhering to industry security standards.
- Retrieval Augmented Generation (RAG) – Enhance the assistant’s ability to retrieve relevant information from stored documents, thereby improving response accuracy and providing grounded answers.
- Text embeddings – Use text embeddings to represent and retrieve relevant data, making sure that high-accuracy retrieval tasks are efficiently managed.
Choosing the right technology stack
To develop HxGN Alix, we selected a combination of AWS generative AI services and complementary technologies, focusing on scalability, customization, and security. We finalized the following architecture to serve our technical needs.
The AWS services include:
- Amazon Elastic Kubernetes Service (Amazon EKS) – We used Amazon EKS for compute and model deployment. It facilitates efficient deployment and management of Alix’s models, providing high availability and scalability. We were able to use our existing EKS cluster, which already had the required safety, manageability, and integration with our DevOps environment. This allowed for seamless integration and used existing investments in infrastructure and tooling.
- Amazon Elastic Compute Cloud (Amazon EC2) G6e instances – AWS provides comprehensive, secure, and cost-effective AI infrastructure. We selected G6e.48xlarge instances powered by NVIDIA L40S GPUs—the most cost-efficient GPU instances for deploying generative AI models under 12 billion parameters.
- Mistral NeMo – We chose Mistral NeMo, a 12-billion parameter open source LLM built in collaboration with NVIDIA and released under the Apache 2.0 license. Mistral NeMo offers a large context window of up to 128,000 tokens and is designed for global, multilingual applications. It’s optimized for function calling and performs strongly in multiple languages, including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. The model’s multilingual capabilities and optimization for function calling aligned well with our needs.
- Amazon Bedrock Guardrails – Amazon Bedrock Guardrails provides a comprehensive framework for enforcing safety and compliance within AI applications. It enables the customization of filtering policies, making sure that AI-generated responses align with organizational standards and regulatory requirements. With built-in capabilities to detect and mitigate harmful content, Amazon Bedrock Guardrails enhances user trust and safety while maintaining high performance in AI deployments. This service allows us to define content moderation rules, restrict sensitive topics, and establish enterprise-level security for generative AI interactions.
- Amazon Simple Storage Service (Amazon S3) – Amazon S3 provides secure storage for managing product documentation and user data, adhering to industry security standards.
- Amazon Bedrock Knowledge Bases – Amazon Bedrock Knowledge Bases enhances Alix’s ability to retrieve relevant information from stored documents, improving response accuracy. This service stood out as a managed RAG solution, handling the heavy lifting and enabling us to experiment with different strategies and solve complex challenges efficiently. More on this is discussed in the development journey.
- Amazon Bedrock – We used Amazon Bedrock as a fallback solution to handle Regional failures. In the event of zonal or outages, the system can fall back to the Mistral 7B model using Amazon Bedrock multi- Region endpoints, maintaining uninterrupted service.
- Amazon Bedrock Prompt Management – This feature of Amazon Bedrock simplifies the creation, evaluation, versioning, and sharing of prompts within the engineering team to get the best responses from foundation models (FMs) for our use cases.
The development journey
We embarked on the development of HxGN Alix through a structured, phased approach.
The proof of concept
We initiated the project by creating a proof of concept to validate the feasibility of an AI assistant tailored for secure environments. Although the industry has seen various AI assistants, the primary goal of the proof of concept was to make sure that we could develop a solution while adhering to our high security standards, which required full control over the manageability of the solution.
During the proof of concept, we scoped the project to use an off-the-shelf NeMo model deployed on our existing EKS cluster without integrating internal knowledge bases. This approach helped us verify the ability to integrate the solution with existing products, control costs, provide scalability, and maintain security—minimizing the risk of late-stage discoveries.
After releasing the proof of concept to a small set of internal users, we identified a healthy backlog of work items that needed to go live, including enhancements in security, architectural improvements, network topology adjustments, prompt management, and product integration.
Security enhancements
To adhere to the stringent security requirements of our customers, we used the secure infrastructure provided by AWS. With models deployed in our existing production EKS environment, we were able to use existing tooling for security and monitoring. Additionally, we used isolated private subnets to make sure that code interacting with models wasn’t connected to the internet, further enhancing information protection for users.
Because user interactions are in free-text format and users might input content including personally identifiable information (PII), it was critical not to store any user interactions in any format. This approach provided complete confidentiality of AI use, adhering to strict data privacy standards.
Adjusting response accuracy
During the proof of concept, it became clear that integrating the digital worker with our products was essential. Base models had limited knowledge of our products and often produced hallucinations. We had to choose between pretraining the model with internal documentation or implementing RAG. RAG became the obvious choice for the following reasons:
- We were in the early stages of development and didn’t have enough data to pre-train our models
- RAG helps ground the model’s responses in accurate context by retrieving relevant information, reducing hallucinations
Implementing a RAG system presented its own challenges and required experimentation. Key challenges are depicted in the following figure.
These challenges include:
- Destruction of context when chunking documents – The first step in RAG is to chunk documents to transform them into vectors for meaningful text representation. However, applying this method to tables or complex structures risks losing relational data, which can result in critical information not being retrieved, causing the LLM to provide inaccurate answers. We evaluated various strategies to preserve context during chunking, verifying that important relationships within the data were maintained. To address this, we used the hierarchical chunking capability of Amazon Bedrock Knowledge Bases, which helped us preserve the context in the final chunk.
- Handling documents in different formats – Our product documentation, accumulated over decades, varied greatly in format. The presence of non-textual elements, such as tables, posed significant challenges. Tables can be difficult to interpret when directly queried from PDFs or Word documents. To address this, we normalized and converted these documents into consistent formats suitable for the RAG system, enhancing the model’s ability to retrieve and interpret information accurately. We used the FM parsing capability of Amazon Bedrock Knowledge Bases, which processed the raw document with an LLM before creating a final chunk, verifying that data from non-textual elements was also correctly interpreted.
- Handling LLM boundaries – User queries sometimes exceed the system’s capabilities, such as when they request comprehensive information, like a complete list of product features. Because our documentation is split into multiple chunks, the retrieval system might not return all the necessary documents. To address this, we adjusted the system’s responses so the AI agent could provide coherent and complete answers despite limitations in the retrieved context. We created custom documents containing FAQs and special instructions for these cases and added them to the knowledge base. These acted as few-shot examples, helping the model produce more accurate and complete responses.
- Grounding responses – By nature, an LLM completes sentences based on probability, predicting the next word or phrase by evaluating patterns from its extensive training data. However, sometimes the output isn’t accurate or factually correct, a phenomenon often referred to as hallucination. To address this, we use a combination of specialized prompts along with contextual grounding checks from Amazon Bedrock Guardrails.
- Managing one-line conversation follow-ups – Users often engage in follow-up questions that are brief or context-dependent, such as “Can you elaborate?” or “Tell me more.” When processed in isolation by the RAG system, these queries might yield no results, making it challenging for the AI agent to respond effectively. To address this, we implemented mechanisms to maintain conversational context, enabling HxGN Alix to interpret and respond appropriately.
We tested two approaches:
- Prompt-based search reformulation – The LLM first identifies the user’s intent and generates a more complete query for the knowledge base. Although this requires an additional LLM call, it yields highly relevant results, keeping the final prompt concise.
- Context-based retrieval with chat history – We sent the last five messages from the chat history to the knowledge base, allowing broader results. This approach provided faster response times because it involved only one LLM round trip.
The first method worked better with large document sets by focusing on highly relevant results, whereas the second approach was more effective with a smaller, focused document set. Both methods have their pros and cons, and results vary based on the nature of the documents.
To address these challenges, we developed a pipeline of steps to receive accurate responses from our digital assistant.
The following figure summarizes our RAG implementation journey.
Adjusting the application development lifecycle
For generative AI systems, the traditional application development lifecycle requires adjustments. New processes are necessary to manage accuracy and system performance:
- Testing challenges – Unlike traditional code, generative AI systems can’t rely solely on unit tests. Prompts can return different results each time, making verification more complex.
- Performance variability – Responses from LLMs can vary significantly in latency, ranging from 1–60 seconds depending on the user’s query, unlike traditional APIs with predictable response times.
- Quality assurance (QA) – We had to develop new testing and QA methodologies to make sure that Alix’s responses were consistent and reliable.
- Monitoring and optimization – Continuous monitoring was implemented to track performance metrics and user interactions, allowing for ongoing optimization of the AI system.
Conclusion
The successful launch of HxGN Alix demonstrates the transformative potential of generative AI in enterprise asset management. By using AWS generative AI services and a carefully selected technology stack, we optimized internal workflows and elevated user satisfaction within secure environments. HxGN Alix exemplifies how a strategically designed AI solution can drive efficiency, enhance user experience, and meet the unique security needs of enterprise clients.
Our journey underscores the importance of a strategic approach to generative AI—balancing security, accuracy, and sustainability—while focusing on the right use case and technology stack. The success of HxGN Alix serves as a model for organizations seeking to use AI to solve complex information access challenges.
By using the right technology stack and strategic approach, you can unlock new efficiencies, improve user experience, and drive business success. Connect with AWS to learn more about how AI-driven solutions can transform your operations.
About the Authors
Julio P. Roque is an accomplished Cloud and Digital Transformation Executive and an expert at using technology to maximize shareholder value. He is a strategic leader who drives collaboration, alignment, and cohesiveness across teams and organizations worldwide. He is multilingual, with an expert command of English and Spanish, understanding of Portuguese, and cultural fluency of Japanese.
Manu Mishra is a Senior Solutions Architect at AWS, specializing in artificial intelligence, data and analytics, and security. His expertise spans strategic oversight and hands-on technical leadership, where he reviews and guides the work of both internal and external customers. Manu collaborates with AWS customers to shape technical strategies that drive impactful business outcomes, providing alignment between technology and organizational goals.
Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.
How Reasoning AI Agents Transform High-Stakes Decision Making
AI agents powered by large language models (LLMs) have grown past their FAQ chatbot beginnings to become true digital teammates capable of planning, reasoning and taking action — and taking in corrective feedback along the way.
Thanks to reasoning AI models, agents can learn how to think critically and tackle complex tasks. This new class of “reasoning agents” can break down complicated problems, weigh options and make informed decisions — while using only as much compute and as many tokens as needed.
Reasoning agents are making a splash in industries where decisions rely on multiple factors. Such industries range from customer service and healthcare to manufacturing and financial services.
Reasoning On vs. Reasoning Off
Modern AI agents can toggle reasoning on and off, allowing them to efficiently use compute and tokens.
A full chain‑of‑thought pass performed during reasoning can take up to 100x more compute and tokens than a quick, single‑shot reply — so it should only be used when needed. Think of it like turning on headlights — switching on high beams only when it’s dark and turning them back to low when it’s bright enough out.
Single-shot responses are great for simple queries — like checking an order number, resetting a password or answering a quick FAQ. Reasoning might be needed for complex, multistep tasks such as reconciling tax depreciation schedules or orchestrating the seating at a 120‑guest wedding.
New NVIDIA Llama Nemotron models, featuring advanced reasoning capabilities, expose a simple system‑prompt flag to enable or disable reasoning, so developers can programmatically decide per query. This allows agents to perform reasoning only when the stakes demand it — saving users wait times and minimizing costs.
Reasoning AI Agents in Action
Reasoning AI agents are already being used for complex problem-solving across industries, including:
- Healthcare: Enhancing diagnostics and treatment planning.
- Customer Service: Automating and personalizing complex customer interactions, from resolving billing disputes to recommending tailored products.
- Finance: Autonomously analyzing market data and providing investment strategies.
- Logistics and Supply Chain: Optimizing delivery routes, rerouting shipments in response to disruptions and simulating possible scenarios to anticipate and mitigate risks.
- Robotics: Powering warehouse robots and autonomous vehicles, enabling them to plan, adapt and safely navigate dynamic environments.
Many customers are already experiencing enhanced workflows and benefits using reasoning agents.
Amdocs uses reasoning-powered AI agents to transform customer engagement for telecom operators. Its amAIz GenAI platform, enhanced with advanced reasoning models such as NVIDIA Llama Nemotron and amAIz Telco verticalization, enables agents to autonomously handle complex, multistep customer journeys — spanning customer sales, billing and care.
EY is using reasoning agents to significantly improve the quality of responses to tax-related queries. The company compared generic models to tax-specific reasoning models, which revealed up to an 86% improvement in response quality for tax questions when using a reasoning approach.
SAP’s Joule agents — which will be equipped with reasoning capabilities from Llama Nemotron –– can interpret complex user requests, surface relevant insights from enterprise data and execute cross-functional business processes autonomously.
Designing an AI Reasoning Agent
A few key components are required to build an AI agent, including tools, memory and planning modules. Each of these components augments the agent’s ability to interact with the outside world, create and execute detailed plans, and otherwise act semi- or fully autonomously.
Reasoning capabilities can be added to AI agents at various places in the development process. The most natural way to do so is by augmenting planning modules with a large reasoning model, like Llama Nemotron Ultra or DeepSeek-R1. This allows more time and reasoning effort to be used during the initial planning phase of the agentic workflow, which has a direct impact on the overall outcomes of systems.
The AI-Q NVIDIA AI Blueprint and the NVIDIA Agent Intelligence toolkit can help enterprises break down silos, streamline complex workflows and optimize agentic AI performance at scale.
The AI-Q blueprint provides a reference workflow for building advanced agentic AI systems, making it easy to connect to NVIDIA accelerated computing, storage and tools for high-accuracy, high-speed digital workforces. AI-Q integrates fast multimodal data extraction and retrieval using NVIDIA NeMo Retriever, NIM microservices and AI agents.
In addition, the open-source NVIDIA Agent Intelligence toolkit enables seamless connectivity between agents, tools and data. Available on GitHub, this toolkit lets users connect, profile and optimize teams of AI agents, with full system traceability and performance profiling to identify inefficiencies and improve outcomes. It’s framework-agnostic, simple to onboard and can be integrated into existing multi-agent systems as needed.
Build and Test Reasoning Agents With Llama Nemotron
Learn more about Llama Nemotron, which recently was at the top of industry benchmark leaderboards for advanced science, coding and math tasks. Join the community shaping the future of agentic, reasoning-powered AI.
Plus, explore and fine-tune using the open Llama Nemotron post-training dataset to build custom reasoning agents. Experiment with toggling reasoning on and off to optimize for cost and performance.
And test NIM-powered agentic workflows, including retrieval-augmented generation and the NVIDIA AI Blueprint for video search and summarization, to quickly prototype and deploy advanced AI solutions.