Accelerate generative AI inference with NVIDIA Dynamo and Amazon EKS

Accelerate generative AI inference with NVIDIA Dynamo and Amazon EKS

This post is co-written with Kshitiz Gupta, Wenhan Tan, Arun Raman, Jiahong Liu, and Eiluth Triana Isaza from NVIDIA.

As large language models (LLMs) and generative AI applications become increasingly prevalent, the demand for efficient, scalable, and low-latency inference solutions has grown. Traditional inference systems often struggle to meet these demands, especially in distributed, multi-node environments. NVIDIA Dynamo (no relation to Amazon DynamoDB) is an open source inference framework designed to address these challenges, offering innovative solutions to optimize performance and scalability. It supports AWS services such as Amazon Simple Storage Service (Amazon S3), Elastic Fabric Adapter (EFA), and Amazon Elastic Kubernetes Service (Amazon EKS), and can be deployed on NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) instances, including P6 instances accelerated by NVIDIA Blackwell.

This post introduces NVIDIA Dynamo and explains how to set it up on Amazon EKS for automated scaling and streamlined Kubernetes operations. We provide a hands-on walkthrough, which uses the NVIDIA Dynamo blueprint on the AI on EKS GitHub repo by AWS Labs to provision the infrastructure, configure monitoring, and install the NVIDIA Dynamo operator.

NVIDIA Dynamo: A low-latency distributed inference framework

Designed to be inference-engine agnostic, NVIDIA Dynamo supports TRT-LLM, vLLM, SGLang, and other runtimes. It boosts LLM performance by splitting prefill and decode phases to maximize GPU throughput, dynamically scheduling GPU resources, routing requests to avoid KV cache recomputation, accelerating data transfer with the low-latency NIXL library, and efficiently offloading KV cache across memory hierarchies for higher overall system throughput.NVIDIA Dynamo is fully open source and has a modular design so developers can pick and choose the inference serving components, frontend API servers, and inference data transfer libraries that suit their unique needs, facilitating compatibility with your existing AI stack and avoiding costly migration efforts.To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features, as illustrated in the following figure:

Figure 1: Dynamo’s high-level architecture shows the main features. In the following sections, we explore the main features of the architecture in more detail.

Disaggregated prefill and decode phases

LLM inference involves two distinct phases: the token-parallel prefill phase (processing input to generate the first token) and the autoregressive decode phase (generating subsequent tokens). Workloads like Retrieval Augmented Generation (RAG) that contain long inputs and short outputs, and reasoning workloads with shorter inputs and long outputs have varying demands for these phases. Traditional LLM systems co-locate these phases on the same GPU, leading to resource contention and imbalances.NVIDIA Dynamo addresses this by disaggregating the prefill and decode phases across different GPUs or nodes, allowing each to be optimized independently. For example, using a higher tensor parallelism for the autoregressive decoding phase while using a lower tensor parallelism for the token-parallel prefill phase allows both phases to be computed efficiently, improving scalability and enhancing performance. In addition, for requests with long inputs, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.The following diagram illustrates how disaggregated serving separates prefill and decode on different GPUs to optimize performance.

Figure 2. Disaggregated serving separates prefill and decode on different GPUs to optimize performance.

NVIDIA Dynamo Planner

The NVIDIA Dynamo Planner tackles the challenge of managing GPU resources in dynamic LLM inference environments, where static allocation falters against fluctuating demand and diverse request types (like varying input/output sequence lengths). The NVIDIA Dynamo Planner continuously monitors real-time signals such as request rates, sequence lengths, GPU capacity, and queue wait times. Based on this, it intelligently decides how to best utilize resources—determining whether to serve requests using disaggregated prefill and decode phases or a traditional aggregated approach—and dynamically adjusts the number and type of workers assigned to each phase. For instance, if the NVIDIA Dynamo Planner detects an increase in requests with long input sequences, it can automatically scale up prefill workers to meet heightened demand.The following diagram illustrates this configuration.

Figure 3: The Dynamo Planner combines prefill and decode specific metrics with SLAs to scale GPUs up and down in disaggregated setups, which ensures optimal GPU utilization.

This dynamic optimization makes sure GPU resources are allocated efficiently across prefill and decode tasks without requiring system downtime. By considering application SLOs (like Time To First Token and Inter-Token Latency) and the costs of KV Cache transfer, the NVIDIA Dynamo Planner makes informed decisions to reallocate GPUs or change serving strategy to alleviate bottlenecks and adapt to further workload shifts. This adaptability allows the system to maintain optimal throughput, handle demand spikes effectively, and deliver peak performance across large-scale distributed deployments.

NVIDIA Dynamo Smart Router

Before responding to a user’s prompt, LLMs must build a contextual understanding of the input request known as the KV cache. The NVIDIA Dynamo Smart Router efficiently minimizes KV cache recomputation by tracking KV cache entries across GPUs in large, multi-node, and disaggregated deployments. When new requests arrive, it directs them to workers that already possess relevant cached data. This is particularly advantageous in use cases where the same request is frequently executed, such as system prompts, single-user multi-turn AI assistant interactions, and agentic workflows.The following diagram illustrates this configuration.

Figure 4: Dynamo Smart Router helps reduce unnecessary computationTo achieve this, the NVIDIA Dynamo Smart Router calculates an overlap score between an incoming request, and the KV cache blocks active across the entire distributed GPU cluster. By considering this score along with the current workload distribution, it intelligently routes each request to the most suitable worker. This process not only minimizes unnecessary recomputation and reduces inference time, but also frees up valuable GPU resources and helps maintain a balanced load across the cluster for optimal performance.

NVIDIA Dynamo KV Cache Block Manager

The NVIDIA Dynamo KV Block Manager addresses the significant cost challenge of storing ever-growing volumes of KV cache directly in expensive GPU High-Bandwidth Memory (HBM). Although reusing the KV cache is vital for minimizing recomputation and enhancing inference performance, holding extensive cache history in GPU memory becomes prohibitively costly.To solve this, the NVIDIA Dynamo KV Block Manager implements tiered offloading, intelligently moving older, less frequently accessed, or lower-priority KV cache blocks from fast HBM to more cost-effective storage tiers. These can include shared CPU memory, local SSDs, or networked object storage. This hierarchical strategy helps organizations manage and store significantly larger volumes of KV cache—potentially petabytes—at a fraction of the traditional cost. By freeing up valuable GPU memory while still enabling the reuse of historical KV cache, this feature optimizes resource utilization, supports sustained performance, and improves overall economic efficiency.

Accelerated data transfer with NVIDIA NIXL

High-performance disaggregated serving and efficient KV cache offloading critically depend on ultra-fast, low-latency data transfer between GPUs and across diverse memory or storage tiers. Navigating the complexities of different hardware, network protocols (like EFA), and storage systems to achieve this vital data movement is a major hurdle, often leading to integration challenges and performance bottlenecks in distributed AI deployments.NVIDIA NIXL is NVIDIA Dynamo’s specialized communication library designed to conquer these data transfer challenges. It provides a high-throughput, low-latency communication backbone through a unified, asynchronous API. NIXL intelligently abstracts the underlying complexity, supporting diverse backends like GPUDirect Storage (GDS), UCX, and Amazon S3, and automatically selects the optimal data path over interconnects such as NVLink or EFA. This drastically simplifies development and promotes accelerated, efficient KV cache movement, which is essential for minimizing latency and maximizing the performance of generative AI models.The following diagram illustrates the lifecycle of a request under disaggregated inference in NVIDIA Dynamo.

Figure 5: Lifecycle of a request under disaggregated inference in NVIDIA Dynamo.

This diagram shows the decode to prefill flow, but NVIDIA Dynamo also supports the prefill to decode flow. Refer to the NVIDIA Dynamo documentation to dive deeper into the lifecycle of a request under disaggregated inference.

This architecture enables efficient distributed inference by separating the token parallel prefill phase from the latency-sensitive decode phase, while achieving high throughput through zero-copy GPU transfers and intelligent resource management. Running inference in production isn’t just about performance—you need logging, monitoring, security, and more. Therefore, NVIDIA Dynamo on AWS infrastructure is an attractive option for customers to use self-managed AI/ML services, Amazon Elastic File System (Amazon EFS) or Amazon FSx for Lustre for model storage, Network Load Balancer for access, Amazon CloudWatch or Prometheus for observability, AWS Identity and Access Management IAM and virtual private cloud (VPC) isolation and security groups for security, and EFA for low-latency inter-node communication (critical for prefill and decode disaggregation).

Solution overview

Amazon EKS is a fully managed Kubernetes service that helps you run Kubernetes seamlessly in both the AWS Cloud and on-premises data centers. It manages the availability and scalability of the Kubernetes control plane and provides compute node automatic scaling and lifecycle management support to help you run highly available containerized applications.

Amazon EKS is an ideal platform for running distributed multi-node inference workloads due to its robust integration with AWS services and performance features. It seamlessly integrates with Amazon EFS, a high-throughput file system, enabling fast data access and management among nodes using Kubernetes Persistent Volume Claims (PVCs). Efficient multi-node LLM inference also requires enhanced network performance. AWS Elastic Fabric Adapter (EFA), a network interface that offers low-latency, high-throughput connectivity between Amazon EC2 accelerated instances, is well-suited for this.

In the following sections, we discuss some of the features of Amazon EKS used in this post.

Automatic scaling

Karpenter adds just-in-time provisioning based on actual pod spec, such as GPU count, ENI, and Amazon Elastic Block Store (Amazon EBS) throughput, so NVIDIA Dynamo can burst new g6 instances in less than 60 seconds when the smart router queues fill.

Flexible worker nodes and GPU support

Amazon EKS supports different EC2 instance families. You can mix CPU (M5), accelerated compute (G6, P5, P6), and Arm-based AWS Graviton nodes in one cluster. To enable usage of GPUs in Amazon EKS, you can use Amazon EKS optimized GPU AMIs that ship with the correct NVIDIA driver and container-toolkit versions preinstalled, so pods get /dev/nvidia* devices immediately. The Bottlerocket variant with NVIDIA add-on gives an immutable, minimal operating system to get started quickly with NVIDIA Dynamo.

Storage integrations for large models

Amazon EKS has CSI drivers for the following options:

We need a shared file system across all the pods that can store and load the model weights quickly, so we use Amazon EFS in this post. Amazon FSx for Lustre is a high-performance shared file system that is more appropriate for workloads that require sub-millisecond and bursty throughput like ML training jobs. In our case, we don’t need such performance because we are downloading and loading the model weights when we are spinning up or scaling out an NVIDIA Dynamo deployment.

EFA support

Amazon EKS offers VPC networking features, including EFA for low latency. EFA is used to talk between the GPU nodes within a single Availability Zone. EFA is accessed through the Libfabric provider. Refer to Supported interfaces and libraries for supported versions of MPI and NCCL.

Architecture

The example in this post uses the Kubernetes deployment shown in the following AWS deployment architecture diagram. We provision a VPC and associated resources, including subnets (public and private), NAT gateways, and an internet gateway. We create an EKS cluster and with the help of Karpenter, and launch one G6 instance to run the model and a CPU node group using one c7i.16xlarge instance to run NVIDIA Dynamo store pods, frontend pods, and the router pod. We use a single Availability Zone to deploy these nodes.

Figure 6: AWS deployment architecture.

In the following sections, we walk you through the deployment of the DeepSeek-R1-Distill-8b model with disaggregated serving and NVIDIA Dynamo KV Smart Router on Amazon EC2 accelerated GPU instances using Amazon EKS.

For comprehensive step-by-step instructions, detailed configuration options, and troubleshooting guidance, see the complete NVIDIA Dynamo on Amazon EKS documentation.

Prerequisites

To implement this solution, you must have the AWS Command Line Interface (AWS CLI), kubectl, helm, terraform, Docker, earthly, and Python 3.10+ installed. An EC2 instance (t3.xlarge or higher) with Amazon EKS and Amazon Elastic Container Registry (Amazon ECR) permissions is recommended as your setup host.

See the full prerequisites list for installation commands.

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the repository and navigate:
```bash
git clone https://github.com/awslabs/ai-on-eks.git && cd ai-on-eks
```
  1. Deploy the infrastructure and platform:
```bash
cd infra/nvidia-dynamo
./install.sh
```

This single command provisions your complete environment: VPC, EKS cluster with GPU nodes, ECR repositories, monitoring stack, and the NVIDIA Dynamo platform components (Operator, API Store, NATS, PostgreSQL, MinIO).The EKS cluster and node creation process can take 15–30 minutes to complete.

  1. Build base images:
```bash
cd blueprints/inference/nvidia-dynamo
source dynamo_env.sh   # Generated by install.sh with AWS credentials and other env variables
./build-base-image.sh vllm --push
```
  1. Deploy inference graphs:
```bash
./deploy.sh
```

The deployment of the LLM inference graph takes a few minutes. You can monitor the pod that is building the inference graph (buildkitd or kaniko). When the build is complete, you will see more pods prefixed with the deployment name. You can run the test script to validate the deployment now.

Choose your preferred LLM architecture through the interactive menu (agg, disagg, multinode, and so on). For more information, refer to the LLM architectures in the GitHub repo.

Test and validate the solution

Use the following code to validate the platform health and runs inference tests:

```bash
./test.sh
```

The test.sh script will start a port forward to the frontend service of the deployment, and request the health check, metrics, and /v1/models endpoints to make sure that the deployment is working as intended. After basic functionality is verified, an example LLM inference payload is requested to verify that the inference engine is functional.

Access your deployment

Use the following code to access your deployment:

```bash
kubectl port-forward svc/<frontend-service> 3000:3000 -n dynamo-cloud &
curl -X POST http://localhost:3000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
        {"role": "user", "content": "Explain what a Q-Bit is in quantum computing."}
    ],
    "max_tokens": 2000,
    "temperature": 0.7,
    "stream": false
}'
```
**Expected Output:**
```
{
  "id": "1918b11a-6d98-4891-bc84-08f99de70fd0",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "A Q-bit, or qubit, is the basic unit of quantum information. Unlike classical bits which are either 0 or 1, qubits can exist in a superposition of both states simultaneously. This allows quantum computers to process multiple possibilities at once, giving them exponential advantages for certain problems like factoring and optimization... [response continues]",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "created": 1752018267,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
  "object": "chat.completion"
}
```

Monitor and observe

AI-on-EKS can deploy the kube-prometheus-stack add-on, so when NVIDIA Dynamo is installed with this blueprint, you get Grafana and Prometheus out of the box (Grafana on port 3000, Prometheus on 9090):

```bash
kubectl port-forward -n kube-prometheus-stack svc/kube-prometheus-stack-grafana 3000:80
kubectl port-forward -n kube-prometheus-stack svc/prometheus 9090:80
```

Clean up

Use the following code to clean up your resources:

```bash
cd infra/nvidia-dynamo
./cleanup.sh
```

This will destroy the NVIDIA Dynamo deployments and start spinning down the infrastructure step by step.

Advanced usage

For custom model deployment, monitoring configuration, troubleshooting, and production considerations, refer to the complete documentation.

This deployment uses the enterprise-grade features of Amazon EKS, including Karpenter automatic scaling, EFA networking, and seamless AWS service integration to provide a production-ready NVIDIA Dynamo environment.

Conclusion

In this post, we discussed the benefits of using NVIDIA Dynamo, a high-throughput, low-latency open source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. We demonstrated how to deploy your LLM application on Amazon EKS using the NVIDIA Dynamo Kubernetes Operator. This example showed g6e instances, but it can also work with other NVIDIA GPU instances, such as P6, P5, P4d, P4de, G5, and G6.

Developers can start with NVIDIA Dynamo today. For more Dynamo examples, check out the examples folder in the GitHub repo.


About the authors

Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon SageMaker and Amazon EC2. Based out of San Francisco, Baladithya enjoys tinkering, developing applications and his homelab in his free time.

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Brian Kreitzer is a Partner Solutions Architect at Amazon Web Services (AWS). He is responsible for working with partners to create accelerators and solutions for AWS customers, engages in technical co-sell opportunities, and evangelizes accelerator and solution adoption to the technical community.

Anton Alexander is a Senior Specialist in Generative AI at AWS, focusing on scaling large training and inference workloads with AWS HyperPod. As a veteran CUDA programmer and Kubernetes expert, he helps enterprises integrate NVIDIA technologies for distributed training, specializing in EKS and Slurm implementations. Anton works closely with MENA Region and Government sector clients to optimize GenAI solutions. He holds a patent pending for machine learning edge computing systems. Outside work, Anton is a Brazilian jiu-jitsu and collegiate boxing champion who enjoys flying planes.

Kshitiz Gupta is a Senior Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running and hiking.

Wenhan Tan is a Solutions Architect at NVIDIA, assisting customers to adopt NVIDIA AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.

Arun Raman is a Senior Solution Architect at NVIDIA, specializing in AI applications at the edge, in the cloud, and on premises for the consumer Internet industry. In his current role, he works on end-to-end AI pipelines including preprocessing, training, and inference. In addition to his AI work, he has also worked on a wide range of products, including network routers and switches, multi-cloud infrastructure, and services. He holds a master’s degree in electrical engineering from the University of Texas at Dallas.

Jiahong Liu is a Senior Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients build and deploy AI and machine learning solutions using NVIDIA’s accelerated computing to solve complex inference problems. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Read More

Deadline Extended — Create a Project G-Assist Plug-In for a Chance to Win an NVIDIA GeForce RTX GPU and Laptop

Deadline Extended — Create a Project G-Assist Plug-In for a Chance to Win an NVIDIA GeForce RTX GPU and Laptop

Submissions for NVIDIA’s Plug and Play: Project G-Assist Plug-In Hackathon are due Sunday, July 20, at 11:59pm PT. RTX AI Garage offers all the tools and resources to help.

The hackathon invites the community to expand the capabilities of Project G-Assist, an experimental AI assistant available through the NVIDIA App that helps users control and optimize NVIDIA GeForce RTX systems.

Entrants gain the chance to win a GeForce RTX 5090 laptop, or a limited NVIDIA GeForce RTX 5080 or RTX 5070 Founders Edition graphics card, plus NVIDIA Deep Learning Institute credits. Finalists may also be featured on NVIDIA’s social media channels.

Register for the hackathon and check out the curated technical resources below to bring these submissions to life.

(G-)Assist With AI

When in the heat of a gaming moment or the flow of a creative project, interrupting one’s focus to navigate complex PC settings menus is a common frustration. For example, manually tweaking GPU performance or optimizing system parameters often requires leaving the user’s current application, which breaks concentration.

Enter Project G-Assist, which allows users to control their RTX GPU and other system settings using natural language. It’s powered by a small language model that runs on device and can be accessed directly from the NVIDIA overlay within the NVIDIA App — no need to tab out or switch programs.

Users can also expand its capabilities via plug-ins and even connect it to agentic frameworks such as Langflow. G-Assist plug-ins can be built in several ways, including with Python for rapid development, with C++ for performance-critical apps and with custom system interactions for hardware and operating system automation.

Project G-Assist requires a GeForce RTX 50, 40 or 30 Series Desktop GPU with at least 12GB of VRAM, a Windows 11 or 10 operating system, a compatible CPU (Intel Pentium G Series, Core i3, i5, i7 or higher; AMD FX, Ryzen 3, 5, 7, 9, Threadripper or higher), specific disk space requirements and a recent GeForce Game Ready Driver or NVIDIA Studio Driver.

Cross the Finish Line

As the hackathon’s submission deadline approaches this weekend, RTX AI Garage is providing resources that can help:

Sydney Altobell, a senior software engineer at NVIDIA, offers tips and tricks for working with G-Assist plug-ins in this on-demand webinar. The presentation and Q&A are available on the NVIDIA Developer YouTube channel and embedded below.

Fellow community developers are collaborating and sharing notes in the NVIDIA Developer Discord server. Altobell and the G-Assist engineering team have already answered many questions about plug-in submissions — keep the questions coming.

Plus, explore NVIDIA’s GitHub repository, which provides everything needed to get started developing with G-Assist, including step-by-step instructions and documentation for building custom functionalities. Take inspiration from sample plug-ins, which include code for using G-Assist to integrate into Discord, IFTTT, Google Gemini and more.

Learn more about the ChatGPT Plug-In Builder to transform ideas into functional G-Assist plug-ins with minimal coding. The tool uses OpenAI’s custom GPT builder to generate plug-in code and streamline the development process.

NVIDIA’s technical blog walks through the architecture of a G-Assist plug-in, using a Twitch integration as an example. Discover how plug-ins work, how they communicate with G-Assist and how to build them from scratch.

Find submission details and requirements on the Hackathon entry page.

Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NVIDIA NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, productivity apps and more on AI PCs and workstations. 

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X

See notice regarding software product information.

Read More

AWS doubles investment in AWS Generative AI Innovation Center, marking two years of customer success

AWS doubles investment in AWS Generative AI Innovation Center, marking two years of customer success

When we launched the AWS Generative AI Innovation Center in 2023, we had one clear goal: help customers turn AI potential into real business value. We’ve already guided thousands of customers across industries from financial services to healthcare—including Formula 1, FOX, GovTech Singapore, Itaú Unibanco, Nasdaq, NFL, RyanAir, and S&P Global—from AI experimentation to full-scale deployment, driving millions of dollars in productivity gains and transforming customer experiences. Now, as AI evolves toward more autonomous, agentic systems, we’re announcing an additional $100 million investment in the AWS Generative AI Innovation Center to help customers pioneer this next wave of AI innovation.

Proven results through collaborative innovation

The AWS Generative AI Innovation Center delivers results by empowering customers to innovate freely and maximize value through trusted AI solutions. We combine Amazon’s decades of AI leadership with deep technical knowledge and extensive, secure real-world deployment experience. Our global team of AI scientists, strategists, and engineers works directly alongside customers and partners to solve the most complex challenges in AI implementation. Through our working backward approach, we deliver deployment-ready solutions in as little as 45 days—starting with real customer needs and maintaining a production-first mindset that blends advanced science with practical application. Our experience with thousands of customers has revealed a clear pattern: the most successful AI implementations start with a strong data and cloud foundation. Many of our customers begin their AI journey by first establishing robust cloud and data practices on AWS—centralizing their data lakes, implementing governance, and modernizing their analytics capabilities, which then becomes the springboard for transformative AI initiatives.

To help more customers, we also launched the AWS Generative AI Partner Innovation Alliance, a carefully selected global network of systems integrators and consulting firms. These partners apply our center’s battle-tested methodologies and assets to help customers rapidly scale from AI experimentation to enterprise-wide implementation.

“Two years has flown by quickly, and I couldn’t be prouder of how the AWS Generative AI Innovation Center has helped thousands of customers tackle their most complex AI challenges—from medical research to banking to startup innovation,” said Sri Elaprolu, Director for the Center. “This new $100 million AWS investment empowers us to continue innovating alongside our customers.”

The following are some examples of how the AWS Generative AI Innovation Center customers are turning AI potential into real business results, such as accelerating drug discovery, transforming government services, optimizing manufacturing operations, and more:

Jabil, a global leader in engineering, supply chain, and manufacturing solutions with over 100 sites worldwide and 140,000 employees, is a Fortune 500 company that collaborated with AWS and AWS Partners to achieve a 74% reduction in data processing times, improving operational efficiency. In just 3 weeks, they deployed an intelligent shopfloor assistant using Amazon Q to connect their databases, which now processes over 1,700 policies and specifications across multiple languages, reducing their average troubleshooting time while improving diagnostic accuracy.

Warner Bros. Discovery Sports Europe developed an AI-powered solution, Cycling Central Intelligence (CCI), using Amazon Bedrock, Anthropic’s Claude 3.5, and other AWS services. CCI processes hundreds of documents and databases, so mountain bike racing commentators can access information through natural language queries, dramatically reducing research time and enhancing storytelling capabilities.

The BMW Group manages over 23 million vehicles connected to its Connected Vehicle Backend hosted on AWS, generating approximately 17 billion requests and 197 terabytes of data per day. BMW and AWS continually work together to make these processes even more efficient. One of these innovations is an AI solution that automatically analyzes system architecture, logs, and infrastructure changes. The solution can identify root causes of service disruptions in minutes, a task that previously required hours of manual investigation across multiple teams.

Splash Music, an AI music generation startup, developed an AI-powered music production platform on AWS that converts simple hummed tunes into fully-produced songs. Using Amazon SageMaker HyperPod and AWS Trainium chips, it efficiently processes petabyte-scale music datasets and large language models (LLMs). This approach reduced infrastructure costs by over 50% while enabling rapid scaling. In 5 months, users created over 10,000 AI-generated songs (“wemixes”), garnering 400 million impressions. The platform’s ability to turn basic audio input (just a voice) into complex musical compositions expands creative possibilities for both amateur and professional musicians.

The PGA TOUR has transformed its content workflow by implementing an intelligent image selection system powered by Amazon Nova, an AI solution that automatically evaluates player photos against sophisticated branding and emotional criteria, analyzing multiple images to select the perfect shot for tournament previews and recaps. By replacing manual review processes with the advanced capabilities of Amazon Nova, the PGA TOUR achieved 70% cost savings compared to alternative AI models while maintaining its high editorial standards.

The agentic future is here

Today’s investment is particularly significant as we see AI evolving from systems that simply respond to prompts to autonomous agents capable of reasoning, planning, and executing complex tasks. This evolution holds immense economic potential, with Gartner’s projection that 15% of work decisions will be made autonomously by agentic AI by 2028. Beyond economic projections, agents represent a larger shift in how work is done and how value is both captured and created. With agents, humans can focus less on rote work and more on designing responsibly, thinking strategically, and innovating creatively.

At AWS, we are building a future where these advanced agentic capabilities can extend and enhance your team’s capabilities, while still recognizing the irreplaceable value of human judgment, empathy, and responsible decision making. Our approach to implementing these powerful technologies emphasizes clear governance, accountability, explainability, and robust privacy controls. Amazon applies this approach internally, testing services with our teams before releasing them to customers. For example, Amazon is using agentic AI to power the new Alexa+ and in our fulfillment network for inventory placement and demand forecasting. AWS is using agents in internal sales and marketing activities and is actively deploying agentic tools and accelerators to deliver unprecedented speed-to-value for customers. We’re not just building agents to reimagine services delivery—we’re already transforming how AWS Professional Services operates today.

Agentic AI applications delivering results

Customers working with the AWS Generative AI Innovation Center are already seeing promise and value with agentic AI:

Syngenta, a global agriculture leader, has harnessed AWS’s AI capabilities to develop Cropwise AI, an innovative solution designed to help farmers navigate complex agricultural challenges. When a farmer seeks guidance, Cropwise AI’s advanced system springs into action. Its sophisticated agentic AI architecture analyzes weather patterns, soil conditions, crop growth stages, and product information simultaneously. This approach enables real-time synthesis of insights from various data sources, delivering comprehensive farming recommendations with unprecedented speed. Using Syngenta’s proprietary models, meticulously trained on extensive agricultural data, Cropwise AI generates personalized, data-driven advice. Moving beyond traditional sequential AI processing, Cropwise AI empowers farmers to make swift, informed decisions in the face of ever-changing environmental factors, adapting to the dynamic nature of modern farming.

AstraZeneca is building a generative AI-powered conversational analytics solution that helps their internal teams efficiently analyze healthcare professional documentation across their global markets. The resulting solution employs a dynamic workflow of specialized agents: one converting natural language to SQL queries for data retrieval, another transforming results into Python-based visualizations, and a third generating comprehensive summaries. The transition to agents enabled AstraZeneca to dramatically reduce query response times, showing a 50% improvement. This transformation enabled internal teams to uncover valuable insights about prescriber patterns and care gaps that were previously inaccessible. With the solution’s modular architecture, AstraZeneca can repurpose these agents across multiple platforms, creating a scalable foundation for future AI-driven analytics initiatives.

SonicWall is harnessing the power of specialized AI agents on AWS to transform how customers manage their firewall configurations. Their innovative solution deploys purpose-built agents that work in concert: a router agent analyzes natural language requests to determine query type, while dedicated agents simultaneously handle security policy validation and configuration generation. By orchestrating these specialized agents, SonicWall moved from a potentially vulnerable open source framework to a secure, high-performance solution in just 3 weeks. This transformation improved code maintainability by 40% and significantly reduced operational complexity. Now, managed service providers can instantly configure and upgrade firewalls, turning previously complex security tasks into streamlined, automated processes.

Yahoo Finance is on a mission to revolutionize how millions of investors access market analysis. Building on their AI-powered news summarization solution, they are developing an innovative multi-agent system where a supervisor agent will intelligently orchestrate specialized agents dedicated to analyzing breaking news, processing financial data, and interpreting SEC filings. By using these agents, Yahoo Finance can combine insights from multiple sources based on each investor query, delivering comprehensive financial analysis through a single, seamless interface with no coding experience required.

Accelerating AI innovation through expanded investment

Our doubled investment in the AWS Generative AI Innovation Center marks the next chapter in our commitment to helping our customers drive AI innovation. We’re expanding agentic capabilities, deepening collaborations with startups, advancing forward deployed engineering, and scaling partner programs to help accelerate customer journeys. From sophisticated agent architectures to advanced model optimization, we’re focused on turning emerging technologies into practical business solutions that deliver real value in a secure, responsible, and cost-effective way.

As the AI landscape evolves, our mission remains clear: make sure AWS customers stay at the leading edge of AI innovation, driving meaningful business transformation. Whether you’re starting your AI journey or scaling enterprise-wide, we’re here to help transform your boldest AI visions into production reality.

To take the next step, contact your AWS account team, and explore the following resources:


About the authors

Francessca Vasquez is the Vice President of Professional Services and Agentic AI for Amazon Web Services (AWS). She leads global consulting and product services for AWS, overseeing the sales and delivery P&L and customer engagements across public sector, commercial, and partners worldwide. Her team operates in over 48 countries directing programs that accelerate innovation and time-to-value for customers. Francessca drives the co-innovation and service delivery of emerging technologies, such as Agentic AI, Quantum Computing, and Application Modernization, helping customers successfully build and deploy Generative AI solutions. Her team connects AWS AI and ML experts with customers globally to envision, design, and launch cutting-edge generative AI products, services, and processes. As Executive Sponsor for the AWS Global CIO Council, Francessca strengthens strategic partnerships and enhances customer outcomes through collaborative innovation. Under her leadership, AWS Professional Services helps organizations accelerate their digital transformation and unlock the full potential of cloud computing and artificial intelligence technologies.

Taimur Rashid is an accomplished product and business executive with over two decades of experience encompassing leadership roles in product, market/business development, and cloud solutions architecture and engineering. His expertise spans big tech firms and growth-stage startups, particularly in areas bridging technology, product, business, and Go-To-Market (GTM). He currently leads the Generative AI Innovation and Delivery organization, building end-to-end AI solutions for customers.

Read More

NVIDIA CEO Jensen Huang Promotes AI in Washington, DC and China

NVIDIA CEO Jensen Huang Promotes AI in Washington, DC and China

This month, NVIDIA founder and CEO Jensen Huang promoted AI in both Washington, D.C. and Beijing — emphasizing the benefits that AI will bring to business and society worldwide. 

In the U.S. capital, Huang met with President Trump and U.S. policymakers, reaffirming NVIDIA’s support for the Administration’s effort to create jobs, strengthen domestic AI infrastructure and onshore manufacturing, and ensure that America leads in AI worldwide.  

In Beijing, Huang met with government and industry officials to discuss how AI will raise productivity and expand opportunity. The discussions underscored how researchers worldwide can advance safe and secure AI for the benefit of all. 

Huang also provided an update to customers, noting that NVIDIA is filing applications to sell the NVIDIA H20 GPU again. The U.S. government has assured NVIDIA that licenses will be granted, and NVIDIA hopes to start deliveries soon. Finally, Huang announced a new, fully compliant NVIDIA RTX PRO GPU that “is ideal for digital twin AI for smart factories and logistics.” 

As Huang noted during his visits, the world has reached an inflection point — AI has become a fundamental resource, like energy, water and the internet. Jensen emphasized NVIDIA’s commitment to support open-source research, foundation models and applications, which democratize AI and will empower emerging economies in every region, including Latin America, Europe, Asia and beyond.  

“General-purpose, open-source research and foundation models are the backbone of AI innovation,” Huang explained to reporters in D.C. “We believe that every civil model should run best on the U.S. technology stack, encouraging nations worldwide to choose America.”  

Read More

Build AI-driven policy creation for vehicle data collection and automation using Amazon Bedrock

Build AI-driven policy creation for vehicle data collection and automation using Amazon Bedrock

Vehicle data is critical for original equipment manufacturers (OEMs) to drive continuous product innovation and performance improvements and to support new value-added services. Similarly, the increasing digitalization of vehicle architectures and adoption of software-configurable functions allow OEMs to add new features and capabilities efficiently. Sonatus’s Collector AI and Automator AI products address these two aspects of the move towards Software-Defined Vehicles (SDVs) in the automotive industry.

Collector AI lowers the barrier to using data across the entire vehicle lifecycle using data collection policies that can be created without changes to vehicle electronics or requiring modifications to embedded code. However, OEM engineers and other consumers of vehicle data struggle with the thousands of vehicle signals to choose to drive their specific use cases and outcomes. Likewise, Automator AI’s no-code methodology for automating vehicle functions using intuitive if-then-style scripted workflows can also be challenging, especially for OEM users who aren’t well-versed in the events and signals available on vehicles to incorporate in a desired automated action.

To address these challenges, Sonatus partnered with the AWS Generative AI Innovation Center to develop a natural language interface to generate data collection and automation policies using generative AI. This innovation aims to reduce the policy generation process from days to minutes while making it accessible to both engineers and non-experts alike.

In this post, we explore how we built this system using Sonatus’s Collector AI and Amazon Bedrock. We discuss the background, challenges, and high-level solution architecture.

Collector AI and Automator AI

Sonatus has developed a sophisticated vehicle data collection and automation workflow tool, which comprises two main products:

  • Collector AI – Gathers and transmits precise vehicle data based on configurable trigger events
  • Automator AI – Executes automated actions within the vehicle based on analyzed data and trigger conditions

The current process requires engineers to create data collection or automation policies manually. Depending on the range of an OEM’s use cases, there could be hundreds of policies for a given vehicle model. Also, identifying the correct data to collect for the given intent required sifting through multiple layers of information and organizational challenges. Our goal was to develop a more intelligent and intuitive way to accomplish the following:

  • Generate policies from the user’s natural language input
  • Significantly reduce policy creation time from days to minutes
  • Provide complete control over the intermediate steps in the generation process
  • Expand policy creation capabilities to non-engineers such as vehicle product owners, product planners, and even procurement
  • Implement a human-in-the-loop review process for both existing and newly created policies

Key challenges

During implementation, we encountered several challenges:

  • Complex event structures – Vehicle models and different policy entities use diverse representations and formats, requiring flexible policy generation
  • Labeled data limitations – Labeled data mapping natural language inputs to desired policies is limited
  • Format translation – The solution must handle different data formats and schemas across customers and vehicle models
  • Quality assurance – Generated policies must be accurate and consistent
  • Explainability – Clear explanations for how policies are generated can help build trust

Success metrics

We defined the following key metrics to measure the success of our solution:

  • Business metrics:
    • Reduced policy generation time
    • Increased number of policies per customer
    • Expanded user base for policy creation
  • Technical metrics:
    • Accuracy of generated policies
    • Quality of results for modified prompts
  • Operational metrics:
    • Reduced policy generation effort and turnaround time compared to manual process
    • Successful integration with existing systems

Solution overview

The Sonatus Advanced Technology team and Generative AI Innovation Center team built an automated policy generation system, as shown in the following diagram.

This is a chain of large language models (LLMs) that perform individual tasks, including entity extraction, signal translation, and signal parametrization.

Entity extraction

A fully generated vehicle policy consists of multiple parts, which could be captured within one single user statement. These are triggers and target data for collector policies, and triggers, actions, and associated tasks for automator policies. The user’s statement is first broken down into its entities using the following steps and rules:

  • Few-shot examples are provided for each entity
  • Trigger outputs must be self-contained with the appropriate signal value and comparison operator information:
    • Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
    • Trigger output: <response>vehicle speed above 0, vehicle signal</response>
  • Triggers and actions are secondarily verified using a classification prompt
  • For Automator AI, triggers and actions must be associated with their corresponding tasks
  • The final output of this process is the intermediate structured XML representation of the user query in natural language:
    • Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
    • Generated XML:
<response>
<task> Lock doors when moving </task>
<triggers> vehicle speed above 0, vehicle signal </triggers>
<actions> lock doors, vehicle signal </actions>
</response> 

The following is a diagram of our improved solution, which converts a user query into XML output.

This is a chain of large language models (LLMs) that perform individual tasks, including entity extraction, signal translation, and signal parametrization.

Signal translation and parametrization

To get to the final JSON policy structure from the intermediate structured XML output, the correct signals must be identified, the signal parameters need to be generated, and this information must be combined to follow the application’s expected JSON schema.

The output signal format of choice at this stage is Vehicle Signal Specification (VSS), an industry-standard specification driven by COVESA. VSS is a standard specifying vehicle signal naming conventions and strategies that make vehicle signals descriptive and understandable when compared to their physical Control Area Network (CAN) signal counterparts. This makes it not only suitable but also essential in the generative AI generation process because descriptive signal names and availability of their meanings are necessary.

The VSS signals, along with their descriptions and other necessary metadata, are embedded into a vector index. For every XML structure requiring a lookup of a vehicle signal, the process of signal translation includes the following steps:

  1. Available signal data is preprocessed and stored into a vector database.
  2. Each XML representation—triggers, actions, and data—is converted into their corresponding embeddings. In some cases, the XML phrases can also be enhanced for better embedding representation.
  3. For each of the preceding entities:
    1. Top-k similar vector embeddings are identified (assume k as 20).
    2. Candidate signals are reranked based on name and descriptions.
    3. The final signal is selected using a LLM selection prompt.
  4. In the case of triggers, after the selection of the correct signal, the trigger value and condition comparator operator are also generated using few-shot examples.
  5. This retrieved and generated information is combined into a predefined trigger, action, data, and task JSON object structure.
  6. Individual JSON objects are assembled to construct the final JSON policy.
  7. This is run through a policy schema validator before it is saved.

The following diagram illustrates the step-by-step process of signal translation. To generate the JSON output from the intermediate XML structure, correct signals are identified using vector-based lookups and reranking techniques.

Solution highlights

In this section, we discuss key components and features of the solution.

Improvement of task adjacency

In automator policies, a task is a discrete unit of work within a larger process. It has a specific purpose and performs a defined set of actions—both within and outside a vehicle. It also optionally defines a set of trigger conditions that, when evaluated to be true, the defined actions start executing. The larger process—the workflow—defines a dependency graph of tasks and the order in which they are executed. The workflow follows the following rules:

  • Every automator policy starts with exactly one task
  • A task can point to one or more next tasks
  • One task can only initiate one other task
  • Multiple possible next tasks can exist, but only one can be triggered at a time
  • Each policy workflow runs one task at a given time
  • Tasks can be arranged in linear or branching patterns
  • If none of the conditions satisfy, the default is monitoring the trigger conditions for the next available tasks

For example:

# Linear Task Adjacency
t1 → t2 → t3 → t4 → t1*
# Branching Task Adjacency
t1 → t2, t3, t4
t3 → t5
t5 → t4

*Loops back to start.

In some of the generated outputs, we identified that there can be two adjacent tasks in which one doesn’t have an action, and another doesn’t have a trigger. Task merging aims to resolve this issue by merging those into a single task. To address this, we implemented task merging using Anthropic’s Claude on Amazon Bedrock. Our outcomes were as follows:

  • Solve the task merging issue, where multiple tasks with incomplete information are merged into one task
  • Properly generate tasks that point to multiple next tasks
  • Change the prompt style to decision tree-based planning to make it more flexible

Multi-agent approach for parameter generation

During the signal translation process, an exhaustive list of signals is fed into a vector store, and when corresponding triggers or actions are generated, they are used to search the vector store and select the signal with the highest relevancy. However, this sometimes generates less accurate or ambiguous results.

For example, the following policy asks to cool down the car:

Action: <response> cool down the car </response>

The corresponding signal should try to cool the car cabin, as shown in the following signal:

Vehicle.Cabin.HVAC.Station.Row1.Driver.Temperature

It should not cool the car engine, as shown in the following incorrect signal:

Vehicle.Powertrain.CombustionEngine.EngineCoolant.Temperature

We mitigated this issue by introducing a multi-agent approach. Our approach has two agents:

  • ReasoningAgent – Proposes initial signal names based on the query and knowledge base
  • JudgeAgent – Evaluates and refines the proposed signals

The agents interact iteratively up to a set cycle threshold before claiming success for signal identification.

Reduce redundant LLM calls

To reduce latency, parts of the pipeline were identified that could be merged into a single LLM call. For example, trigger condition value generation and trigger condition operator generation were individual LLM calls.We addressed this by introducing a faster Anthropic’s Claude 3 Haiku model and merging prompts where it is possible to do so. The following is an example of a set of prompts before and after merging.The first example is before merging, with the trigger set to when the temperature is above 20 degrees Celsius:

Operator response: <operator> > </operator>
Parameter response: <value> 20 </value>

The following is the combined response for the same trigger:

<response>
<operator> > </operator>
<value> 20 </value>
</response>

Context-driven policy generation

The goal here is to disambiguate the signal translation, similar to the multi-agent approach for parameter generation. To make policy generation more context-aware, we proposed a customer intent clarifier that carries out the following tasks:

  • Retrieves relevant subsystems using knowledge base lookups
  • Identifies the intended target subsystem
  • Allows user verification and override

This approach works by using external and preprocessed information like available vehicle subsystems, knowledge bases, and signals to guide the signal selection. Users can also clarify or override intent in cases of ambiguity early on to reduce wasted iterations and achieve the desired result more quickly. For example, in the case of the previously stated example on an ambiguous generation of “cool the car,” users are asked to clarify which subsystem they meant—to choose from “Engine” or “Cabin.”

Conclusion

Combining early feedback loops and a multi-agent approach has transformed Sonatus’s policy creation system into a more automated and efficient solution. By using Amazon Bedrock, we created a system that not only automates policy creation, reducing time taken by 70%, but also provides accuracy through context-aware generation and validation. So, organizations can achieve similar efficiency gains by implementing this multi-agent approach with Amazon Bedrock for their own complex policy creation workflows. Developers can leverage these techniques to build natural language interfaces that dramatically reduce technical complexity while maintaining precision in business-critical systems.


About the authors

Giridhar Akila Dhakshinamoorthy is the Senior Staff Engineer and AI/ML Tech Lead in the CTO Office at Sonatus.

Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services who helps customers solve their business problems using generative AI and machine learning. He has done MS with Thesis in Machine Learning from University of Illinois and has extensive experience in solving customer problem in the field of data science.

Parth Patwa is a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He has co-authored research papers at top AI/ML venues and has 1000+ citations.

Yingwei Yu is an Applied Science Manager at Generative AI Innovation Center, AWS, where he leverages machine learning and generative AI to drive innovation across industries. With a PhD in Computer Science from Texas A&M University and years of working experience, Yingwei brings extensive expertise in applying cutting-edge technologies to real-world applications.

Hamed Yazdanpanah was a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He helps customers solve their business problems using generative AI and machine learning.

Read More

How Rapid7 automates vulnerability risk scores with ML pipelines using Amazon SageMaker AI

How Rapid7 automates vulnerability risk scores with ML pipelines using Amazon SageMaker AI

This post is cowritten with Jimmy Cancilla from Rapid7.

Organizations are managing increasingly distributed systems, which span on-premises infrastructure, cloud services, and edge devices. As systems become interconnected and exchange data, the potential pathways for exploitation multiply, and vulnerability management becomes critical to managing risk. Vulnerability management (VM) is the process of identifying, classifying, prioritizing, and remediating security weaknesses in software, hardware, virtual machines, Internet of Things (IoT) devices, and similar assets. When new vulnerabilities are discovered, organizations are under pressure to remediate them. Delayed responses can open the door to exploits, data breaches, and reputational harm. For organizations with thousands or millions of software assets, effective triage and prioritization for the remediation of vulnerabilities are critical.

To support this process, the Common Vulnerability Scoring System (CVSS) has become the industry standard for evaluating the severity of software vulnerabilities. CVSS v3.1, published by the Forum of Incident Response and Security Teams (FIRST), provides a structured and repeatable framework for scoring vulnerabilities across multiple dimensions: exploitability, impact, attack vector, and others. With new threats emerging constantly, security teams need standardized, near real-time data to respond effectively. CVSS v3.1 is used by organizations such as NIST and major software vendors to prioritize remediation efforts, support risk assessments, and comply with standards.

There is, however, a critical gap that emerges before a vulnerability is formally standardized. When a new vulnerability is disclosed, vendors aren’t required to include a CVSS score alongside the disclosure. Additionally, third-party organizations such as NIST aren’t obligated or bound by specific timelines to analyze vulnerabilities and assign CVSS scores. As a result, many vulnerabilities are made public without a corresponding CVSS score. This situation can leave customers uncertain about how to respond: should they patch the newly discovered vulnerability immediately, monitor it for a few days, or deprioritize it? Our goal with machine learning (ML) is to provide Rapid7 customers with a timely answer to this critical question.

Rapid7 helps organizations protect what matters most so innovation can thrive in an increasingly connected world. Rapid7’s comprehensive technology, services, and community-focused research remove complexity, reduce vulnerabilities, monitor for malicious behavior, and shut down attacks. In this post, we share how Rapid7 implemented end-to-end automation for the training, validation, and deployment of ML models that predict CVSS vectors. Rapid7 customers have the information they need to accurately understand their risk and prioritize remediation measures.

Rapid7’s solution architecture

Rapid7 built their end-to-end solution using Amazon SageMaker AI, the Amazon Web Services (AWS) fully managed ML service to build, train, and deploy ML models into production environments. SageMaker AI provides powerful compute for ephemeral tasks, orchestration tools for building automated pipelines, a model registry for tracking model artifacts and versions, and scalable deployment to configurable endpoints.

Rapid7 integrated SageMaker AI with their DevOps tools (GitHub for version control and Jenkins for build automation) to implement continuous integration and continuous deployment (CI/CD) for the ML models used for CVSS scoring. By automating model training and deployment, Rapid7’s CVSS scoring solutions stay up to date with the latest data without additional operational overhead.

The following diagram illustrates the solution architecture.

Architectural diagram showing Jenkins pipeline workflow integrated with AWS SageMaker for ML model training, testing, registry, and deployment monitoring

Orchestrating with SageMaker AI Pipelines

The first step in the journey toward end-to-end automation was removing manual activities previously performed by data scientists. This meant migrating experimental code from Jupyter notebooks to production-ready Python scripts. Rapid7 established a project structure to support both development and production. Each step in the ML pipeline—data download, preprocessing, training, evaluation, and deployment—was defined as a standalone Python module in a common directory.

Designing the pipeline

After refactoring, pipeline steps were moved to SageMaker Training and Processing jobs for remote execution. Steps in the pipeline were defined using Docker images with the required libraries, and orchestrated using SageMaker Pipelines in the SageMaker Python SDK.

CVSS v3.1 vectors consist of eight independent metrics combined into a single vector. To produce an accurate CVSS vector, eight separate models were trained in parallel. However, the data used to train these models was identical. This meant that the training process could share common download and preprocessing steps, followed by separate training, validation, and deployment steps for each metric. The following diagram illustrates the high-level architecture of the implemented pipeline.

Parallel ML model training architecture showing data pipeline and deployment workflow for 8 models with shared preprocessing stage

Data loading and preprocessing

The data used to train the model comprised existing vulnerabilities and their associated CVSS vectors. This data source is updated constantly, which is why Rapid7 decided to download the most recent data available at training time and uploaded it to Amazon Simple Storage Service (Amazon S3) to be used by subsequent steps. After being updated, Rapid7 implemented a preprocessing step to:

  1. Structure the data to facilitate ingestion and use in training.
  2. Split the data into three sets: training, validation, and testing (80%, 10%, and 10%).

The preprocessing step was defined with a dependency on the data download step so that the new dataset was available before a new preprocessing job was started. The outputs of the preprocessing job—the resulting training, validation, and test sets—are also uploaded to Amazon S3 to be consumed by the training steps that follow.

Model training, evaluation, and deployment

For the remaining pipeline steps, Rapid7 executed each step eight times—one time for each metric in the CVSS vector. Rapid7 iterated through each of the eight metrics to define the corresponding training, evaluation, and deployment steps using the SageMaker Pipelines SDK.

The loop follows a similar pattern for each metric. The process starts with a training job using PyTorch framework images provided by Amazon SageMaker AI. The following is a sample script for defining a training job.

estimator = PyTorch(
        entry_point="train.py",
        source_dir="src",
        role=role,
        instance_count=1,
        instance_type=TRAINING_INSTANCE_TYPE
        output_path=f"s3://{s3_bucket}/cvss/trained-model",
        framework_version="2.2",
        py_version="py310",
        disable_profiler=True,
        environment={"METRIC": cvss_metric}
        )
step_train = TrainingStep(
    name=f"TrainModel_{cvss_metric}",
    estimator=estimator,
    inputs={
        "train": TrainingInput(
            s3_data=<<INPUT_DATA_S3_URI>>,
            content_type="text/plain"
        ),
        "validation": TrainingInput(
            s3_data=<<VALIDATION_DATA_S3_URI>>,
            content_type="text/plain"
        )
    }
)
training_steps.append(step_train)

The PyTorch Estimator creates model artifacts that are automatically uploaded to the Amazon S3 location defined in the output path parameter. The same script is used for each one of the CVSS v3.1 metrics while focusing on a different metric by passing a different cvss_metric to the training script as an environment variable.

The SageMaker Pipeline is configured to trigger the execution of a model evaluation step when the model training job for that CVSS v3.1 metric is finished. The model evaluation job takes the newly trained model and test data as inputs, as shown in the following step definition.

script_eval = Processor(...)
eval_args = script_eval.run(
    inputs=[
        ProcessingInput(
            source=<<MODEL_ARTIFACTS_IN_AMAZON_S3>>,
            destination="/opt/ml/processing/model"
        ),
        ProcessingInput(
            source=<<TEST_DATA_IN_AMAZON_S3>>,
            destination="/opt/ml/processing/test"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="evaluation",
            source="/opt/ml/processing/evaluation/",
            destination=f"s3://{s3_bucket}/cvss/evaluation/{cvss_metric}/"
        )
    ],
    source_dir="src",
    code="evaluate.py"
)
evaluation_report = PropertyFile(
    name="EvaluationReport",
    output_name="evaluation",
    path="evaluation.json"
)
step_eval = ProcessingStep(
    name=f"Evaluate_{cvss_metric}",
    step_args=eval_args,
    property_files=[evaluation_report],
)
evaluation_steps.append(step_eval)

The processing job is configured to create a PropertyFile object to store the results from the evaluation step. Here is a sample of what might be found in this file:

{
  "ac": {
    "metrics": {
      "accuracy": 99
    }
  }
}

This information is critical in the last step of the sequence followed for each metric in the CVSS vector. Rapid7 wants to ensure that models deployed in production meet quality standards, and they do that by using a ConditionStep that allows only models whose accuracy is above a critical value to be registered in the SageMaker Model Registry. This process is repeated for all eight models.

cond_gte = ConditionGreaterThanOrEqualTo(
            left=JsonGet(
                step_name=step_eval.name,
                property_file=evaluation_report,
                json_path=f"{cvss_metric}.metrics.accuracy"
            ),
            right=accuracy_threshold_param
        )
step_cond = ConditionStep(
    name=f"CVSS_{cvss_metric}_Accuracy_Condition",
    conditions=[cond_gte],
    if_steps=[step_model_create],
    else_steps=[]
)
conditional_steps.append(step_cond) 

Defining the pipeline

With all the steps defined, a pipeline object is created with all the steps for all eight models. The graph for the pipeline definition is shown in the following image.

Flowchart showing attack vector and attack complexity metrics in cybersecurity analysis

Managing models with SageMaker Model Registry

SageMaker Model Registry is a repository for storing, versioning, and managing ML models throughout the machine learning operations (MLOps) lifecycle. The model registry enables the Rapid7 team to track model artifacts and their metadata (such as performance metrics), and streamline model version management as their CVSS models evolve. Each time a new model is added, a new version is created under the same model group, which helps track model iterations over time. Because new versions are evaluated for accuracy before registration, they’re registered with an Approved status. If a model’s accuracy falls below this threshold, the automated deployment pipeline will detect this and send an alert to notify the team about the failed deployment. This enables Rapid7 to maintain an automated pipeline that serves the most accurate model available to date without requiring manual review of new model artifacts.

Deploying models with inference components

When a set of CVSS scoring models has been selected, they can be deployed in a SageMaker AI endpoint for real-time inference, allowing them to be invoked to calculate a CVSS vector as soon as new vulnerability data is available. SageMaker AI endpoints are accessible URLs where applications can send data and receive predictions. Internally, the CVSS v3.1 vector is prepared using predictions from the eight scoring models, followed by postprocessing logic. Because each invocation runs each of the eight CVSS scoring models one time, their deployment can be optimized for efficient use of compute resources.

When the deployment script runs, it checks the model registry for new versions. If it detects an update, it immediately deploys the new version to a SageMaker endpoint.

Ensuring Cost Efficiency

Cost efficiency was a key consideration in designing this workflow. Usage patterns for vulnerability scoring are bursty, with periods of high activity followed by long idle intervals. Maintaining dedicated compute resources for each model would be unnecessarily expensive given these idle times. To address this issue, Rapid7 implemented Inference Components in their SageMaker endpoint. Inference components allow multiple models to share the same underlying compute resources, significantly improving cost efficiency—particularly for bursty inference patterns. This approach enabled Rapid7 to deploy all eight models on a single instance. Performance tests showed that inference requests could be processed in parallel across all eight models, consistently achieving sub-second response times (100-200ms).

Monitoring models in production

Rapid7 continually monitors the models in production to ensure high availability and efficient use of compute resources. The SageMaker AI endpoint automatically uploads logs and metrics into Amazon CloudWatch, which are then forwarded and visualized in Grafana. As part of regular operations, Rapid7 monitors these dashboards to visualize metrics such as model latency, the number of instances behind the endpoint, and invocations and errors over time. Additionally, alerts are configured on response time metrics to maintain system responsiveness and prevent delays in the enrichment pipeline. For more information on the various metrics and their usage, refer to the AWS blog post, Best practices for load testing Amazon SageMaker real-time inference endpoints.

Conclusion

End-to-end automation of vulnerability scoring model development and deployment has given Rapid7 a consistent, fully automated process. The previous manual process for retraining and redeploying these models was fragile, error-prone, and time-intensive. By implementing an automated pipeline with SageMaker, the engineering team now saves at least 2–3 days of maintenance work each month. By eliminating 20 manual operations, Rapid7 software engineers can focus on delivering higher-impact work for their customers. Furthermore, by using inference components, all models can be consolidated onto a single ml.m5.2xlarge instance, rather than deploying a separate endpoint (and instance) for each model. This approach nearly halves the hourly compute cost, resulting in approximately 50% cloud compute savings for this workload. In building this pipeline, Rapid7 benefited from features that reduced time and cost across multiple steps. For example, using custom containers with the necessary libraries improved startup times, while inference components enabled efficient resource utilization—both were instrumental in building an effective solution.

Most importantly, this automation means that Rapid7 customers always receive the most recently published CVEs with a CVSSv3.1 score assigned. This is especially important for InsightVM because Active Risk Scores, Rapid7’s latest risk strategy for understanding vulnerability impact, rely on the CVSSv3.1 score as a key component in their calculation. Providing accurate and meaningful risk scores is critical for the success of security teams, empowering them to prioritize and address vulnerabilities more effectively.

In summary, automating model training and deployment with Amazon SageMaker Pipelines has enabled Rapid7 to deliver scalable, reliable, and efficient ML solutions. By embracing these best practices and lessons learned, teams can streamline their workflows, reduce operational overhead, and remain focused on driving innovation and value for their customers.


About the authors

Jimmy Cancilla is a Principal Software Engineer at Rapid7, focused on applying machine learning and AI to solve complex cybersecurity challenges. He leads the development of secure, cloud-based solutions that use automation and data-driven insights to improve threat detection and vulnerability management. He is driven by a vision of AI as a tool to augment human work, accelerating innovation, enhancing productivity, and enabling teams to achieve more with greater speed and impact.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Read More

Build secure RAG applications with AWS serverless data lakes

Build secure RAG applications with AWS serverless data lakes

Data is your generative AI differentiator, and successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Traditional data architectures often struggle to meet the unique demands of generative such as applications. An effective generative AI data strategy requires several key components like seamless integration of diverse data sources, real-time processing capabilities, comprehensive data governance frameworks that maintain data quality and compliance, and secure access patterns that respect organizational boundaries. In particular, Retrieval Augmented Generation (RAG) applications have emerged as one of the most promising developments in this space. RAG is the process of optimizing the output of a foundation model (FM), so it references a knowledge base outside of its training data sources before generating a response. Such systems require secure, scalable, and flexible data ingestion and access patterns to enterprise data.

In this post, we explore how to build a secure RAG application using serverless data lake architecture, an important data strategy to support generative AI development. We use Amazon Web Services (AWS) services including Amazon S3, Amazon DynamoDB, AWS Lambda, and Amazon Bedrock Knowledge Bases to create a comprehensive solution supporting unstructured data assets which can be extended to structured data. The post covers how to implement fine-grained access controls for your enterprise data and design metadata-driven retrieval systems that respect security boundaries. These approaches will help you maximize the value of your organization’s data while maintaining robust security and compliance.

Use case overview

As an example, consider a RAG-based generative AI application. The following diagram shows the typical conversational workflow that is initiated with a user prompt, for example, operation specialists in a retail company querying internal knowledge to get procurement and supplier details. Each user prompt is augmented with relevant contexts from data residing in an enterprise data lake.

In the solution, the user interacts with the Streamlit frontend, which serves as the application interface. Amazon Cognito that enables IdP integration through IAM Identity Center, so that only authorized users can access the application. For production use, we recommend that you use a more robust frontend framework such as AWS Amplify, which provides a comprehensive set of tools and services for building scalable and secure web applications. After the user has successfully signed in, the application retrieves the list of datasets associated with the user’s ID from the DynamoDB table. The list of datasets is used to filter while querying the knowledge base to get answers from datasets the user is authorized to access. This is made possible because when the datasets are ingested, the knowledge base is prepopulated with metadata files containing user principal-dataset mapping stored in Amazon S3. The knowledge base returns the relevant results, which are then sent back to the application and displayed to the user.

The datasets reside in a serverless data lake on Amazon S3 and are governed using Amazon S3 Access Grants with IAM Identity Center trusted identity propagation enabling automated data permissions at scale. When an access grant is created or deleted for a user or group, the information is added to the DynamoDB table through event-driven architecture using AWS CloudTrail and Amazon EventBridge.

The workflow includes the following key data foundation steps:

  1. Access policies to extract permissions based on relevant data and filter out results based on the prompt user role and permissions.
  2. Enforce data privacy policies such as personally identifiable information (PII) redactions.
  3. Enforce fine-grained access control.
  4. Grant the user role permissions for sensitive information and compliance policies based on dataset classification in the data lake.
  5. Extract, transform, and load multimodal data assets into a vector store.

In the following sections, we explain why a modern data strategy is important for generative AI and what challenges it solves.

Serverless data lakes powering RAG applications

Organizations implementing RAG applications face several critical challenges that impact both functionality and cost-effectiveness. At the forefront is security and access control. Applications must carefully balance broad data access with strict security boundaries. These systems need to allow access to data sources by only authorized users, apply dynamic filtering based on permissions and classifications, and maintain security context throughout the entire retrieval and generation process. This comprehensive security approach helps prevent unauthorized information exposure while still enabling powerful AI capabilities.

Data discovery and relevance present another significant hurdle. When dealing with petabytes of enterprise data, organizations must implement sophisticated systems for metadata management and advanced indexing. These systems need to understand query context and intent while efficiently ranking retrieval results to make sure users receive the most relevant information. Without proper attention to these aspects, RAG applications risk returning irrelevant or outdated information that diminishes their utility.

Performance considerations become increasingly critical as these systems scale. RAG applications must maintain consistent low latency while processing large document collections, handling multiple concurrent users, integrating data from distributed sources and retrieving relevant data. The challenge of balancing real-time and historical data access adds another layer of complexity to maintaining responsive performance at scale.Cost management represents a final key challenge that organizations must address. Without careful architectural planning, RAG implementations can lead to unnecessary expenses through duplicate data storage, excessive vector database operations, and inefficient data transfer patterns. Organizations need to optimize their resource utilization carefully to help prevent these costs from escalating while maintaining system performance and functionality.

A modern data strategy addresses the complex challenges of RAG applications through comprehensive governance frameworks and robust architectural components. At its core, the strategy implements sophisticated governance mechanisms that go beyond traditional data management approaches. These frameworks enable AI systems to dynamically access enterprise information while maintaining strict control over data lineage, access patterns, and regulatory compliance. By implementing comprehensive provenance tracking, usage auditing, and compliance frameworks, organizations can operate their RAG applications within established ethical and regulatory boundaries.

Serverless data lakes serve as the foundational component of this strategy, offering an elegant solution to both performance and cost challenges. Their inherent scalability automatically handles varying workloads without requiring complex capacity planning, and pay-per-use pricing models facilitate cost efficiency. The ability to support multiple data formats—from structured to unstructured—makes them particularly well-suited for RAG applications that need to process and index diverse document types.

To address security and access control challenges, the strategy implements enterprise-level data sharing mechanisms. These include sophisticated cross-functional access controls and federated access management systems that enable secure data exchange across organizational boundaries. Fine-grained permissions at the row, column, and object levels enforce security boundaries while maintaining necessary data access for AI systems.Data discoverability challenges are met through centralized cataloging systems that help prevent duplicate efforts and enable efficient resource utilization. This comprehensive approach includes business glossaries, technical catalogs, and data lineage tracking, so that teams can quickly locate and understand available data assets. The catalog system is enriched with quality metrics that help maintain data accuracy and consistency across the organization.

Finally, the strategy implements a structured data classification framework that addresses security and compliance concerns. By categorizing information into clear sensitivity levels from public to restricted, organizations can create RAG applications that only retrieve and process information appropriate to user access levels. This systematic approach to data classification helps prevent unauthorized information disclosure while maintaining the utility of AI systems across different business contexts.Our solution uses AWS services to create a secure, scalable foundation for enterprise RAG applications. The components are explained in the following sections.

Data lake structure using Amazon S3

Our data lake will use Amazon S3 as the primary storage layer, organized with the following structure:

s3://amzn-s3-demo-enterprise-datalake/
├── retail/
│   ├── product-catalog/
│   ├── customer-data/
│   └── sales-history/
├── finance/
│   ├── financial-statements/
│   ├── tax-documents/
│   └── budget-forecasts/
├── supply-chain/
│   ├── inventory-reports/
│   ├── supplier-contracts/
│   └── logistics-data/
└── shared/
    ├── company-policies/
    ├── knowledge-base/
    └── public-data/

Each business domain has dedicated folders containing domain-specific data, with common data stored in a shared folder.

Data sharing options

There are two options for data sharing. The first option is Amazon S3 Access Points, which provide a dedicated access endpoint policy for different applications or user groups. This approach enables fine-grained control without modifying the base bucket policy.

The following code is an example access point configuration. This policy grants the RetailAnalyticsRole read-only access (GetObject and ListBucket permissions) to data in both the retail-specific directory and the shared directory, but it restricts access to other business domain directories. The policy is attached to a dedicated S3 access point, allowing users with this role to retrieve only data relevant to retail operations and commonly shared resources:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/RetailAnalyticsRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/retail/*",
        "arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/shared/*"
      ]
    }
  ]
}

The second option for data sharing is using bucket policies with path-based access control. Bucket policies can implement path-based restrictions to control which user roles can access specific data directories.The following code is an example bucket policy. This bucket policy implements domain-based access control by granting different permissions based on user roles and data paths. The FinanceUserRole can only access data within the finance and shared directories, and the RetailUserRole can only access data within the retail and shared directories. This pattern enforces data isolation between business domains while facilitating access to common resources. Each role is limited to read-only operations (GetObject and ListBucket) on their authorized directories, which means users can only retrieve data relevant to their business functions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/FinanceUserRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-enterprise-datalake/finance/*",
        "arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*"
      ]
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/RetailUserRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-enterprise-datalake/retail/*",
        "arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*"
      ]
    }
  ]
}

As your number of datasets and use cases scale, you might require more policy space. Bucket policies work as long as the necessary policies fit within the policy size limits of S3 bucket policies (20 KB), AWS Identity and Access Management (IAM) policies (5 KB), and within the number of IAM principals allowed per account. With an increasing number of datasets, access points offer a better alternative of having a dedicated policy for each access point in such cases. You can define quite granular access control patterns because you can have thousands of access points per AWS Region per account, with a policy up to 20 KB in size for each access point. Although S3 Access Points increases the amount of policy space available, it requires a mechanism for clients to discover the right access point for the right dataset. To manage scale, S3 Access Points provides a simplified model to map identities in directories such as Active Directory or IAM principals to datasets in Amazon S3 by prefix, bucket, or object. With the simplified access scheme in S3 Access Grants, you can grant read-only, write-only, or read-write access according to Amazon S3 prefix to both IAM principals and directly to users or groups from a corporate directory. As a result, you can manage automated data permissions at scale.

Amazon Comprehend PII redaction job identifies and redacts (or masks) sensitive data in documents residing in Amazon S3. After redaction, documents are verified for redaction effectiveness using Amazon Macie. Documents flagged by Macie are sent to another bucket for manual review, and cleared documents are moved to a redacted bucket ready for ingestion. For more details, refer to Protect sensitive data in RAG applications with Amazon Comprehend.

User-dataset mapping with DynamoDB

To dynamically manage access permissions, you can use DynamoDB to store mapping information between users or roles and datasets. You can automate the mapping from AWS Lake Formation to DynamoDB using CloudTrail and event-driven Lambda invocation. The DynamoDB structure consists of a table named UserDatasetAccess. Its primary key structure is:

  • Partition keyUserIdentifier (string) – IAM role Amazon Resource Name (ARN) or user ID
  • Sort keyDatasetID (string) – Unique identifier for each dataset

Additional attributes consist of:

  • DatasetPath (string) – S3 path to the dataset
  • AccessLevel (string) – READ, WRITE, or ADMIN
  • Classification (string) – PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
  • Domain (string) – Business domain (such as retail or finance)
  • ExpirationTime (number) – Optional Time To Live (TTL) for temporary access

The following DynamoDB item represents an access mapping between a user role (RetailAnalyst) and a specific dataset (retail-products). It defines that this role has READ access to product catalog data in the retail domain with an INTERNAL security classification. When the RAG application processes a query, it references this mapping to determine which datasets the querying user can access, and the application only retrieves and uses data appropriate for the user’s permissions level.

{
  "UserIdentifier": "arn:aws:iam::123456789012:role/RetailAnalyst",
  "DatasetID": "retail-products",
  "DatasetPath": "s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/",
  "AccessLevel": "READ",
  "Classification": "INTERNAL",
  "Domain": "retail"
}

This approach provides a flexible, programmatic way to control which users can access specific datasets, enabling fine-grained permission management for RAG applications.

Amazon Bedrock Knowledge Bases for unstructured data

Amazon Bedrock Knowledge Bases provides a managed solution for organizing, indexing, and retrieving unstructured data to support RAG applications. For our solution, we use this service to create domain-specific knowledge bases. With the metadata filtering feature provided by Amazon Bedrock Knowledge Bases, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chunks based on applied metadata filters and associated values. In the next sections, we show how you can set this up.

Configuring knowledge bases with metadata filtering

We organize our knowledge bases to support filtering based on:

  1. Business domain (such as finance, retail, or supply-chain)
  2. Security classification (such as public, internal, confidential, or restricted)
  3. Document type (such as policy, report, or guide)

Each document ingested into our knowledge base includes a standardized metadata structure:

{

  "source_uri": "s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/shoes-inventory-2023.pdf",
  "title": "Shoes Inventory Report 2023",
  "language": "en",
  "last_updated": "2023-12-15T14:30:00Z",
  "author": "Inventory Management Team",
  "business_domain": "retail",
  "security_classification": "internal",
  "document_type": "inventory_report",
  "departments": ["retail", "supply-chain"],
  "tags": ["footwear", "inventory", "2023"],
  "version": "1.2"
}

Code examples shown throughout this post are for reference only and highlight key API calls and logic. Additional implementation code is required for production deployments.

Amazon Bedrock Knowledge Bases API integration

To demonstrate how our RAG application will interact with the knowledge base, here’s a Python sample using the AWS SDK:

# High-level logic for querying knowledge base with security filters
def query_knowledge_base(query_text, user_role, business_domain=None):
    # Get permitted classifications based on user role
    permitted_classifications = get_permitted_classifications(user_role)
    
    # Build security filter expression
    filter_expression = build_security_filters(permitted_classifications, business_domain)
    
    # Key API call for retrieval with security filtering
    response = bedrock_agent_runtime.retrieve(
        knowledgeBaseId='your-kb-id',
        retrievalQuery={'text': query_text},
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': 5,
                'filter': filter_expression  # Apply security filters here
            }
        }
    )
    
    return response['retrievalResults']

Conclusion

In this post, we’ve explored how to build a secure RAG application using a serverless data lake architecture. The approach we’ve outlined provides several key advantages:

  1. Security-first design – Fine-grained access controls at scale mean that users only access data they’re authorized for
  2. Scalability – Serverless components automatically handle varying workloads
  3. Cost-efficiency – Pay-as-you-go pricing models optimize expenses
  4. Flexibility – Seamless adaptation to different business domains and use cases

By implementing a modern data strategy with proper governance, security controls, and serverless architecture, organizations can make the most of their data assets for generative AI applications while maintaining security and compliance.The RAG architecture we’ve described enables contextualized, accurate responses that respect security boundaries, providing a powerful foundation for enterprise AI applications across diverse business domains.For next steps, consider implementing monitoring and observability to track performance and usage patterns.

For performance and usage monitoring:

  • Deploy Amazon CloudWatch metrics and dashboards to track key performance indicators such as query latency, throughput, and error rates
  • Set up CloudWatch Logs Insights to analyze usage patterns and identify optimization opportunities
  • Implement AWS X-Ray tracing to visualize request flows across your serverless components

For security monitoring and defense:

  • Enable Amazon GuardDuty to detect potential threats targeting your S3 data lake, Lambda functions, and other application resources
  • Implement Amazon Inspector for automated vulnerability assessments of your Lambda functions and container images
  • Configure AWS Security Hub to consolidate security findings and measure cloud security posture across your RAG application resources
  • Use Amazon Macie for continuous monitoring of S3 data lake contents to detect sensitive data exposures

For authentication and activity auditing:

  • Analyze AWS CloudTrail logs to audit API calls across your application stack
  • Implement CloudTrail Lake to create SQL-queryable datasets for security investigations
  • Enable Amazon Cognito advanced security features to detect suspicious sign-in activities

For data access controls:

  • Set up CloudWatch alarms to send alerts about unusual data access patterns
  • Configure AWS Config rules to monitor for compliance with access control best practices
  • Implement AWS IAM Access Analyzer to identify unintended resource access

Other important considerations include:

  • Adding feedback loops to continuously improve retrieval quality
  • Exploring multi-Region deployment for improved resilience
  • Implementing caching layers to optimize frequently accessed content
  • Extending the solution to support structured data assets using AWS Glue and AWS Lake Formation for data transformation and data access

With these foundations in place, your organization will be well-positioned to use generative AI technologies securely and effectively across the enterprise.


About the authors

Venkata Sistla is a Senior Specialist Solutions Architect in the Worldwide team at Amazon Web Services (AWS), with over 12 years of experience in cloud architecture. He specializes in designing and implementing enterprise-scale AI/ML systems across financial services, healthcare, mining and energy, independent software vendors (ISVs), sports, and retail sectors. His expertise lies in helping organizations transform their data challenges into competitive advantages through innovative cloud solutions while mentoring teams and driving technological excellence. He focuses on architecting highly scalable infrastructures that accelerate machine learning initiatives and deliver measurable business outcomes.

Aamna Najmi is a Senior GenAI and Data Specialist in the Worldwide team at Amazon Web Services (AWS). She assists customers across industries and Regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations, bringing a unique perspective of modern data strategies to complement the field of AI. In her spare time, she pursues her passion of experimenting with food and discovering new places.

 

Read More

AI Testing and Evaluation: Learnings from cybersecurity

AI Testing and Evaluation: Learnings from cybersecurity

Illustrated images of Kathleen Sullivan, Ciaran Martin, and Tori Westerhoff for the Microsoft Research podcast

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In this episode, Sullivan speaks with Professor Ciaran Martin (opens in new tab) of the University of Oxford about risk assessment and testing in the field of cybersecurity. They explore the importance of differentiated standards for organizations of varying sizes, the role of public-private partnerships, and the opportunity to embed security into emerging technologies from the outset. Later, Tori Westerhoff (opens in new tab), a principal director on the Microsoft AI Red Team, joins Sullivan to talk about identifying vulnerabilities in AI products and services. Westerhoff describes AI security in terms she’s heard cybersecurity professionals use for their work—a team sport—and points to cybersecurity’s establishment of a shared language and understanding of risk as a model for AI security.

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

Today, I’m excited to welcome Ciaran Martin to the podcast to explore testing and risk assessment in cybersecurity. Ciaran is a professor of practice in the management of public organizations at the University of Oxford. He had previously founded and served as chief executive of the National Cyber Security Centre within the UK’s intelligence, security, and cyber agency.

And after our conversation, we’ll talk to Microsoft’s Tori Westerhoff, a principal director on Microsoft’s AI Red Team, about how we should think about these insights in the context of AI.

Hi, Ciaran. Thank you so much for being here today.


CIARAN MARTIN: Well, thanks so much for inviting me. It’s great to be here.

SULLIVAN: Ciaran, before we get into some regulatory specifics, it’d be great to hear a little bit more about your origin story, and just take us to that day—who tapped you on the shoulder and said, “Ciaran, we need you to run a national cyber center! Do you fancy building one?”

MARTIN: You could argue that I owe my job to Edward Snowden. Not an obvious thing to say. So the National Cyber Security Centre, which didn’t exist at the time—I was invited to join the British government’s cybersecurity effort in a leadership role—is now a subset of GCHQ. That’s the digital intelligence agency. The equivalent in the US obviously is the NSA [National Security Agency]. It had been convulsed by the Snowden disclosures. It was an unprecedented challenge.

I was a 17-year career government fixer with some national security experience. So I was asked to go out and help with the policy response, the media response, the legal response. But I said, look, any crisis, even one as big as this, is over one way or the other in six months. What should I do long term? And they said, well, we were thinking of asking you to try to help transform our cybersecurity mission. So the National Cyber Security Centre was born, and I was very proud to lead it, and all in all, I did it for seven years from startup to handing it on to somebody else.

SULLIVAN: I mean, it’s incredible. And just building on that, people spend a significant portion of their lives online now with a variety of devices, and maybe for listeners who are newer to cybersecurity, could you give us the 90-second lightning talk? Kind of, what does risk assessment and testing look like in this space?

MARTIN: Well, risk assessment and testing, I think, are two different things. You can’t defend everything. If you defend everything, you’re defending nothing. So broadly speaking, organizations face three threats. One is complete disruption of their systems. So just imagine not being able to access your system. The second is data protection, and that could be sensitive customer information. It could be intellectual property. And the third is, of course, you could be at risk of just straightforward being stolen from. I mean, you don’t want any of them to happen, but you have to have a hierarchy of harm.

SULLIVAN: Yes.

MARTIN: So that’s your risk assessment.

The testing side, I think, is slightly different. One of the paradoxes, I think, of cybersecurity is for such a scientific, data-rich subject, the sort of metrics about what works are very, very hard to come by. So you’ve got boards and corporate leadership and senior governmental structures, and they say, “Look, how do I run this organization safely and securely?” And a cybersecurity chief within the organization will say, “Well, we could get this capability in.” Well, the classic question for a leadership team to ask is, well, what risk and harm will this reduce, by how much, and what’s the cost-benefit analysis? And we find that really hard.

So that’s really where testing and assurance comes in. And also as technology changes so fast, we have to figure out, well, if we’re worried about post-quantum cryptography, for example, what standards does it have to meet? How do you assess whether it’s meeting those standards? So it’s a huge issue in cybersecurity and one that we’re always very conscious of. It’s really hard.

SULLIVAN: Given the scope of cybersecurity, are there any differences in testing, let’s say, for maybe a small business versus a critical infrastructure operator? Are there any, sort of, metrics we can look at in terms of distinguishing risk or assessment?

MARTIN: There have to be. One of the reasons I think why we have to be is that no small business can be expected to take on a hostile nation-state that’s well equipped. You have to be realistic.

If you look at government guidance, certainly in the UK 15 years ago on cybersecurity, you were telling small businesses that are living hand to mouth, week by week, trying to make payments at the end of each month, we were telling them they needed sort of nation-state-level cyber defenses. That was never going to happen, even if they could afford it, which they couldn’t. So you have to have some differentiation. So again, you’ve got assessment frameworks and so forth where you have to meet higher standards. So there absolutely has to be that distinction. Otherwise, you end up in a crazy world of crippling small businesses with just unmanageable requirements which they’re never going to meet.

SULLIVAN: It’s such a great point. You touched on this a little bit earlier, as well, but just cybersecurity governance operates in a fast-moving technology and threat environment. How have testing standards evolved, and where do new technical standards usually originate?

MARTIN: I keep saying this is very difficult, and it is. [LAUGHTER] So I think there are two challenges. One is actually about the balance, and this applies to the technology of today as well as the technology of tomorrow. This is about, how do you make sure things are good enough without crowding out new entrants? You want people to be innovative and dynamic. You want disruptors in this business.

But if you say to them, “Look, well, you have to meet these 14 impossibly high technical standards before you can even sell to anybody or sell to the government,” whatever, then you’ve got a problem. And I think we’ve wrestled with that, and there’s no perfect answer. You just have to try and go to … find the sweet spot between two ends of a spectrum. And that’s going to evolve.

The second point, which in some respects if you’ve got the right capabilities is slightly easier but still a big call, is around, you know, those newer and evolving technologies. And here, having, you know, been a bit sort of gloomy and pessimistic, here I think is actually an opportunity. So one of the things we always say in cybersecurity is that the internet was built and developed without security in mind. And that was kind of true in the ’90s and the noughties, as we call them over here.

But I think as you move into things like post-quantum computing, applied use of AI, and so on, you can actually set the standards at the beginning. And that’s really good because it’s saying to people that these are the things that are going to matter in the post-quantum age. Here’s the outline of the standards you’re going to have to meet; start looking at them. So there’s an opportunity actually to make technology safer by design, by getting ahead of it. And I think that’s the era we’re in now.

SULLIVAN: That makes a lot of sense. Just building on that, do businesses and the public trust these standards? And I guess, which standard do you wish the world would just adopt already, and what’s the real reason they haven’t?

MARTIN: Well, again, where do you start? I mean, most members of the public quite rightly haven’t heard of any of these standards. I think public trust and public capital in any society matters. But I think it is important that these things are credible.

And there’s quite a lot of convergence between, you know, the top-level frameworks. And obviously in the US, you know, the NIST [National Institute of Standards and Technology] framework is the one that’s most popular for cybersecurity, but it bears quite a strong resemblance to the international one, ISO[/IEC] 27001, and there are others, as well. But fundamentally, they boil down to kind of five things. Do a risk assessment; work out what your crown jewels are. Protect your perimeter as best you can. Those are the first two.

The third one then is when your perimeter’s breached, be able to detect it more times than not. And when you can’t do that, you go to the fourth one, which is, can you mitigate it? And when all else fails, how quickly can you recover and manage it? I mean, all the standards are expressed in way more technical language than that, but fundamentally, if everybody adopted those five things and operated them in a simple way, you wouldn’t eliminate the harm, but you would reduce it quite substantially.

SULLIVAN: Which policy initiatives are most promising for incentivizing companies to undertake, you know, these cybersecurity testing parameters that you’ve just outlined? Governments, including the UK, have used carrots and sticks, but what do you think will actually move the needle?

MARTIN: I think there are two answers to that, and it comes back to your split between smaller businesses and critically important businesses. In the critically important services, I think it’s easier because most industries are looking for a level playing field. In other words, they realize there have to be rules and they want to apply them to everyone.

We had a fascinating experience when I was in government back in around 2018 where the telecom sector, they came to us and they said, we’ve got a very good cooperative relationship with the British government, but it needs to be put on a proper legal footing because you’re just asking us nicely to do expensive things. And in a regulated sector, if you actually put in some rules—and please develop them jointly with us; that’s the crucial part—then that will help because it means that we’re not going to our boards and saying, or our shareholders, and saying that we should do this, and they’re saying, “Well, do you have to do it? Are our competitors doing it?” And if the answer to that is, yes, we have to, and, yes, our competitors are doing it, then it tends to be OK.

The harder nut to crack is the smaller business. And I think there’s a real mystery here: why has nobody cracked a really good and easy solution for small business? We need to be careful about this because, you know, you can’t throttle small businesses with onerous regulation. At the same time, we’re not brilliant, I think, in any part of the world at using the normal corporate governance rules to try and get people to figure out how to do cybersecurity.

There are initiatives there that are not the sort of pretty heavy stick that you might have to take to a critical function, but they could help. But that is a hard nut to crack. And I look around the world, and, you know, I think if this was easy, somebody would have figured it out by now. I think most of the developed economies around the world really struggle with cybersecurity for smaller businesses.

SULLIVAN: Yeah, it’s a great point. Actually building on one of the comments you made on the role of, kind of, government, how do you see the role of private-public partnerships scaling and strengthening, you know, robust cybersecurity testing?

MARTIN: I think they’re crucial, but they have to be practical. I’ve got a slight, sort of, high horse on this, if you don’t mind, Kathleen. It’s sort of … [LAUGHS]

SULLIVAN: Of course.

MARTIN: I think that there are two types of public-private partnership. One involves committees saying that we should strengthen partnerships and we should all work together and collaborate and share stuff. And we tried that for a very long time, and it didn’t get us very far. There are other types.

We had some at the National Cyber Security Centre where we paid companies to do spectacularly good technical work that the market wouldn’t provide. So I think it’s sort of partnership with a purpose. I think sometimes, and I understand the human instinct to do this, particularly in governments and big business, they think you need to get around a table and work out some grand strategy to fix everything, and the scale of the … not just the problem but the scale of the whole technology is just too big to do that.

So pick a bit of the problem. Find some ways of doing it. Don’t over-lawyer it. [LAUGHTER] I think sometimes people get very nervous. Oh, well, is this our role? You know, should we be doing this, that, and the other? Well, you know, sometimes certainly in this country, you think, well, who’s actually going to sue you over this, you know? So I wouldn’t over-programmatize it. Just get stuck practically into solving some problems.

SULLIVAN: I love that. Actually, [it] made me think, are there any surprising allies that you’ve gained—you know, maybe someone who you never expected to be a cybersecurity champion—through your work?

MARTIN: Ooh! That’s a … that’s a… what a question! To give you a slightly disappointing answer, but it relates to your previous question. In the early part of my career, I was working in institutions like the UK Treasury long before I was in cybersecurity, and the treasury and the British civil service in general, but the treasury in particular sort of trained you to believe that the private sector was amoral, not immoral, amoral. It just didn’t have values. It just had bottom line, and, you know, its job essentially was to provide employment and revenue then for the government to spend on good things that people cared about. And when I got into cybersecurity and people said, look, you need to develop relations with this cybersecurity company, often in the US, actually. I thought, well, what’s in it for them?

And, sure, sometimes you were paying them for specific services, but other times, there was a real public spiritedness about this. There was a realization that if you tried to delineate public-private boundaries, that it wouldn’t really work. It was a shared risk. And you could analyze where the boundaries fell or you could actually go on and do something about it together. So I was genuinely surprised at the allyship from the cybersecurity sector. Absolutely, I really, really was. And I think it’s a really positive part of certainly the UK cybersecurity ecosystem.

SULLIVAN: Wonderful. Well, we’re coming to the end of our time here, but is there any maybe last thoughts or perhaps requests you have for our listeners today?

MARTIN: I think that standards, assurance, and testing really matter, but it’s a bit like the discussion we’re having over AI. Get all these things to take you 80, 90% of the way and then really apply your judgment. There’s been some bad regulation under the auspices of standards and assurance. First of all, it’s, have you done this assessment? Have you done that? Have you looked at this? Well, fine. And you can tick that box, but what does it actually mean when you do it? What bits that you know in your heart of hearts are really important to the defense of your organization that may not be covered by this and just go and do those anyway. Because sure it helps, but it’s not everything.

SULLIVAN: No. Great, great closing sentiment. Well, Ciaran, thank you for joining us today. This has been just a super fun conversation and really insightful. Just really enjoyed the conversation. Thank you.

MARTIN: My pleasure, Kathleen, thank you.

[TRANSITION MUSIC]

SULLIVAN: Now, I’m happy to introduce Tori Westerhoff. As a principal director on the Microsoft AI Red Team, Tori leads all AI security and safety red team operations, as well as dangerous capability testing, to directly inform C-suite decision-makers.

So, Tori, welcome!

TORI WESTERHOFF: Thanks. I am so excited to be here.

SULLIVAN: I’d love to just start a little bit more learning about your background. You’ve worn some very intriguing hats. I mean, cognitive neuroscience grad from Yale, national security consultant, strategist in augmented and virtual reality … how do those experiences help shape the way you lead the Microsoft AI Red Team?

WESTERHOFF: I always joke this is the only role I think will always combine the entire patchwork LinkedIn résumé. [LAUGHS]

I think I use those experiences to help me understand the really broad approach that AI Red Team—artist also known as AIRT; I’m sure I’ll slip into our acronym—how we frame up the broad security implications of AI. So I think the cognitive neuroscience element really helped me initially approach AI hacking, right. There’s a lot of social engineering and manipulation within chat interfaces that are enabled by AI. And also, kind of, this, like, metaphor for understanding how to find soft spots in the way that you see human heuristics show up, too. And so I think that was actually my personal “in” to getting hooked into AI red teaming generally.

But my experience in national security and I’d also say working through the AR/VR/metaverse space at the time where I was in it helped me balance both how our impact is framed, how we’re thinking about critical industries, how we’re really trying to push our understanding of where security of AI can help people the most. And also do it in a really breakneck speed in an industry that’s evolving all of the time, that’s really pushing you to always be at the bleeding edge of your understanding. So I draw a lot of the energy and the mission criticality and the speed from those experiences as we’re shaping up how we approach it.

SULLIVAN: Can you just give us a quick rundown? What does the Red Team do? What actually, kind of, is involved on a day-to-day basis? And then as we think about, you know, our engagements with large enterprises and companies, how do we work alongside some of those companies in terms of testing?

WESTERHOFF: The way I see our team is almost like an indicator light that works really part and parcel with product development. So the way we’ve organized our expert red teaming efforts is that we work with product development before anything ships out to anyone who can use it. And our job is to act as expert AI manipulators, AI hackers. And we are supposed to take the theories and methods and new research and harness it to find examples of vulnerabilities or soft spots in products to enable product teams to harden those soft spots before anything actually reaches someone who wants to use it.

So if we’re the indicator light, we are also not the full workup, right. I see that as measurement and evals. And we also are not the mechanic, which is that product development team that’s creating mitigations. It’s platform-security folks who are creating mitigations at scale. And there’s a really great throughput of insights from those groups back into our area where we love to inform about them, but we also love to add on to, how do we break the next thing, right? So it’s a continuous cycle.

And part of that is just being really creative and thinking outside of a traditional cybersecurity box. And part of that is also really thinking about how we pull in research—we have a research function within our AI Red Team—and how we automate and scale. This year, we’ve pulled a lot of those assets and insights into the Azure [AI] Foundry AI Red Teaming Agent (opens in new tab). And so folks can now access a lot of our mechanisms through that. So you can get a little taste of what we do day to day in the AI Red Teaming Agent.

SULLIVAN: You recently—actually, with your team—published a report that outlined lessons from testing over a hundred generative AI products. But could you share a bit about what you learned? What were some of the important lessons? Where do you see opportunities to improve the state of red teaming as a method for probing AI safety?

WESTERHOFF: I think the most important takeaway from those lessons is that AI security is truly a team sport. You’ll hear cybersecurity folks say that a lot. And part of the rationale there is that the defense in depth and integrating and a view towards AI security through the entire development of AI systems is really the way that we’re going to approach this with intentionality and responsibility.

So in our space, we really focus on novel harm categories. We are pushing bleeding edge, and we also are pushing iterative and, like, contextually based red teaming in product dev. So outside of those hundred that we’ve done, there’s a community [LAUGHS] through the entire, again, multistage life cycle of a product that is really trying to push the cost of attacking those AI systems higher and higher with all of the expertise they bring. So we may be, like, the experts in AI hacking in that line, but there are also so many partners in the Microsoft ecosystem who are thinking about their market context or they really, really know the people who love their products. How are they using it?

And then when you bubble out, you also have industry and government who are working together to push towards the most secure AI implementation for people, right? And I think our team in particular, we feel really grateful to be part of the big AI safety and security ecosystem at Microsoft and also to be able to contribute to the industry writ large.

SULLIVAN: As you know, we had a chance to speak with Professor Ciaran Martin from the University of Oxford about the cybersecurity industry and governance there. What are some of the ideas and tools from that space that are surfacing in how we think about approaching red teaming and AI governance broadly?

WESTERHOFF: Yeah, I think it’s such a broad set of perspectives to bring in, in the AI instance. Something that I’ve noticed interjecting into security at the AI junction, right, is that cybersecurity has so many decades of experience of working through how to build trustworthy computing, for example, or bring an entire industry to bear in that way. And I think that AI security and safety can learn a lot of lessons of how to bring clarity and transparency across the industry to push universal understanding of where the threats really are.

So frameworks coming out of NIST, coming out of MITRE that help us have a universal language that inform governance, I think, are really important because it brings clarity irrespective of where you are looking into AI security, irrespective of your company size, what you’re working on. It means you all understand, “Hey, we are really worried about this fundamental impact.” And I think cybersecurity has done a really good job of driving towards impact as their organizational vector. And I am starting to see that in the AI space, too, where we’re trying to really clarify terms and threats. And you see it in updates of those frameworks, as well, that I really love.

So I think that the innovation is in transparency to folks who are really innovating and doing the work so we all have a shared language, and from that, it really creates communal goals across security instead of a lot of people being worried about the same thing and talking about it in a different way.

SULLIVAN: Mm-hmm. In the cybersecurity context, Ciaran really stressed matching risk frameworks to an organization’s role and scale. Microsoft plays many roles, including building models and shipping applications. How does your red teaming approach shift across those layers? 

WESTERHOFF: I love this question also because I love it as part of our work. So one of the most fascinating things about working on this team has been the diversity of the technology that we end up red teaming and testing. And it feels like we’re in the crucible in that way. Because we see AI applied to so many different architectures, tech stacks, individual features, models, you name it.

Part of my answer is that we still care about the highest-impact things. And so irrespective of the iteration, which is really fascinating and I love, I still think that our team drives to say, “OK, what is that critical vulnerability that is going to affect people in the largest ways, and can we battle test to see if that can occur?”

So in some ways, the task is always the same. I think in the ways that we change our testing, we customize a lot to the access to systems and data and also people’s trust almost as different variables that could affect the impact, right.

So a good example is if we’re thinking through agentic frameworks that have access to functions and tools and preferential ability to act on data, it’s really different to spaces where that action may not be feasible, right. And so I think the tailoring of the way to get to that impact is hyper-custom every time we start an engagement. And part of it is very thesis driven and almost mechanizing empathy.

You almost need to really focus on how people could use, or misuse, in such a way that you can emulate it before to a really great signal to product development, to say this is truly what people could do and we want to deliver the highest-impact scenarios so you can solve for those and also solve the underlying patterns, actually, that could contribute to maybe that one piece of evidence but also all the related pieces of evidence. So singular drive but like hyper-, hyper-customization to what that piece of tech could do and has access to.

SULLIVAN: What are some of the unexplored testing approaches or considerations from cybersecurity that you think we should encourage AI technologists, policymakers, and other stakeholders to focus on?

WESTERHOFF: I do love that AI humbles us each and every day with new capabilities and the potential for new capabilities. It’s not just saying, “Hey, there’s one test that we want to try,” but more, “Hey, can we create a methodology that we feel really, really solid about so that when we are asked a question we haven’t even thought of, we feel confident that we have the resources and the system?”

So part of me is really intrigued by the process that we’re asked to make without knowing what those capabilities are really going to bring. And then I think tactically, AIRT is really pushing on how we create new research methodologies. How are we investing in, kind of, these longer-term iterations of red teaming? So we’re really excited about pushing out those insights in an experimental and longer-term way.

I think another element is a little bit of that evolution of how industry standards and frameworks are updating to the AI moment and really articulating where AI is either furthering adversarial ability to create those harms or threats or identifying where AI has a net new harm. And I think that demystifies a little bit about what we talked about in terms of the lessons learned, that fundamentally, a lot of the things that we talk about are traditional security vulnerabilities, and we are standing on kind of that cybersecurity shoulder. And I’m starting to see those updates translate in spaces that are already considered trustworthy and kind of the basis on which not only cybersecurity folks build their work but also business decision-makers make decisions on those frameworks.

So to me, integration of AI into those frameworks by those same standards means that we’re evolving security to include AI. We aren’t creating an entirely new industry of AI security and that, I think, really helps anchor people in the really solid foundation that we have in cybersecurity anyways.

I think there’s also some work around how the cyber, like, defenses will actually benefit from AI. So we think a lot about threats because that’s our job. But the other side of cybersecurity is offense. And I’m seeing a ton of people come out with frameworks and methodologies, especially in the research space, on how defensive networks are going to be benefited from things like agentic systems.

Generally speaking, I think the best practice is to realize that we’re fundamentally still talking about the same impacts, and we can use the same avenues, conversations, and frameworks. We just really want them to be crisply updated with that understanding of AI applications.

SULLIVAN: How do you think about bringing others into the fold there? I think those standards and frameworks are often informed by technologists. But I’d love for you to expand [that to] policymakers or other kind of stakeholders in our ecosystem, even, you know, end consumers of these products. Like, how do we communicate some of this to them in a way that resonates and it has an impactful meaning?

WESTERHOFF: I’ve found the AI security-safety space to be one of the more collaborative. I actually think the fact that I’m talking to you today is probably evidence that a ton of people are bringing in perspectives that don’t only come from a long-term cybersecurity view. And I see that as a trend in how AI is being approached opposed to how those areas were moving earlier. So I think that speed and the idea of conversations and not always having the perfect answer but really trying to be transparent with what everyone does know is kind of a communal energy in the communities, at least, where we’re playing. [LAUGHS] So I am pretty biased but at least the spaces where we are.

SULLIVAN: No, I think we’re seeing that across the board. I mean, I’d echo [that] sitting in research, as well, like, that ability to have impact now and at speed to getting the amazing technology and models that we’re creating into the hands of our customers and partners and ecosystem is just underscored.

So on the note of speed, let’s shift gears a little bit to just a quick lightning round. I’d love to get maybe some quick thoughts from you, just 30-second answers here. I’ll start with one.

Which headline-grabbing AI threat do you think is mostly hot air?

WESTERHOFF: I think we should pay attention to it all. I’m a red team lead. I love a good question to see if we can find an answer in real life. So no hot air, just questions.

SULLIVAN: Is there some sort of maybe new tool that you can’t wait to sneak into the red team arsenal?

WESTERHOFF: I think there are really interesting methodologies that break our understanding of cybersecurity by looking at the intersection between different layers of AI and how you can manipulate AI-to-AI interaction, especially now when we’re looking at agentic systems. So I would say a method, not a tool.

SULLIVAN: So maybe ending on a little bit of a lighter note, do you have a go-to snack during an all-night red teaming session?

WESTERHOFF: Always coffee. I would love it to be a protein smoothie, but honestly, it is probably Trader Joe’s elote chips. Like the whole bag. [LAUGHTER] It’s going to get me through. I’m going to not love that I did it.

[MUSIC]

SULLIVAN: Amazing. Well, Tori, thanks so much for joining us today, and just a huge thanks also to Ciaran for his insights, as well.

WESTERHOFF: Thank you so much for having me. This was a joy.

SULLIVAN: And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI.

See you next time! 

[MUSIC FADES]

The post AI Testing and Evaluation: Learnings from cybersecurity appeared first on Microsoft Research.

Read More