How E.ON saves £10 million annually with AI diagnostics for smart meters powered by Amazon Textract

How E.ON saves £10 million annually with AI diagnostics for smart meters powered by Amazon Textract

E.ON—headquartered in Essen, Germany—is one of Europe’s largest energy companies, with over 72,000 employees serving more than 50 million customers across 15 countries. As a leading provider of energy networks and customer solutions, E.ON focuses on accelerating the energy transition across Europe. A key part of this mission involves the Smart Energy Solutions division, which manages over 5 million smart meters in the UK alone. These devices help millions of customers track their energy consumption in near real time, receive accurate bills without manual readings, reduce their carbon footprints through more efficient energy management, and access flexible tariffs aligned with their usage.

Historically, diagnosing errors on smart meters required an on-site visit—an approach that was both time-consuming and logistically challenging. To address this challenge, E.ON partnered with AWS to develop a remote diagnostic solution powered by Amazon Textract, a machine learning (ML) service that automatically extracts printed text, handwriting, and structure from scanned documents and images. Instead of dispatching engineers, the consumer captures a 7-second video of their smart meters, which is automatically uploaded to AWS through the E.ON application for remote analysis. In real-world testing, it delivers an impressive 84% accuracy. Beyond cost savings, this ML-powered solution enhances consistency in diagnostics and can detect malfunctioning meters before issues escalate.

By transforming on-site inspections into quick-turnaround video analysis, E.ON aims to reduce site visits, accelerate repair times, make sure assets achieve their full lifecycle expectation, and cut annual costs by £10 million. This solution also helps E.ON maintain its 95% smart meter connectivity target, further demonstrating the company’s commitment to customer satisfaction and operational excellence.

In this post, we dive into how this solution works and the impact it’s making.

The challenge: Smart meter diagnostics at scale

Smart meters are designed to provide near real-time billing data and support better energy management. But when something goes wrong, such as a Wide Area Network (WAN) connectivity error, resolving it has traditionally required dispatching a field technician. With 135,000 on-site appointments annually and costs exceeding £20 million, this approach is neither scalable nor sustainable.

The process is also inconvenient for customers, who often need to take time off work or rearrange their schedules. Even then, resolution isn’t guaranteed. Engineers diagnose faults by visually interpreting a set of LED indicators on the Communications Hub, the device that sits directly on top of the smart meter. These LEDs, SW, WAN, HAN, MESH, and GAS, blink at different frequencies (Off, Low, Medium, High), and accurate diagnosis requires matching these blink patterns to a technical manual. With no standardized digital output and thousands of possible combinations, the risk of human error is high, and without a confirmed fault in advance, engineers might arrive without the tools needed to resolve the issue.

The following visuals make these differences clear. The first is an animation that mimics how the four states blink in real time, with each pulse lasting 0.1 seconds.

Animation showing the four LED pulse states (Off, Low, Medium, High) and the wait time between each 0.1-second flash.

Animation showing the four LED pulse states (Off, Low, Medium, High) and the wait time between each 0.1-second flash.

The following diagram presents a simplified 7-second timeline for each state, showing exactly when pulses occur and how they differ in count and spacing.

Timeline visualization of LED pulse patterns over 7 seconds.

Timeline visualization of LED pulse patterns over 7 seconds.

E.ON wanted to change this. They set out to alleviate unnecessary visits, reduce diagnostic errors, and improve customer experience. Partnering with AWS, they developed a more automated, scalable, and cost-effective way to detect smart meter faults, without needing to send an engineer on-site.

From manual to automated diagnostics

In partnership with AWS, E.ON developed a solution where customers record and upload short, 7-second videos of their smart meter. These videos are analyzed by a diagnostic tool, which returns the error and a natural language explanation of the issue directly to the customer’s smartphone. If an engineer visit is necessary, the technician arrives equipped with the right tools, having already received an accurate diagnosis.

The following image shows a typical Communications Hub, mounted above the smart meter. The labeled indicators—SW, WAN, HAN, MESH, and GAS—highlight the LEDs used in diagnostics, illustrating how the system identifies and isolates each region for analysis.

A typical Communications Hub, with LED indicators labeled SW, WAN, MESH, HAN, and GAS.

A typical Communications Hub, with LED indicators labeled SW, WAN, MESH, HAN, and GAS.

Solution overview

The diagnostic tool follows three main steps, as outlined in the following data flow diagram:

  1. Upon receiving a 7-second video, the solution breaks it into individual frames. A Signal Intensity metric flags frames where an LED is likely active, drastically reducing the total number of frames requiring deeper analysis.
  2. Next, the tool uses Amazon Textract to find text labels (SW, WAN, MESH, HAN, GAS). These labels, serving as landmarks, guide the system to the corresponding LED regions, where custom signal- and brightness-based heuristics determines whether each LED is on or off.
  3. Finally, the tool counts pulses for each LED over 7 seconds. This pulse count maps directly to Off, Low, Medium, or High frequencies, which in turn align with error codes from the meter’s reference manual. The error code can either be returned directly as shown in the conceptual view or translated into a natural language explanation using a dictionary lookup created from the meter’s reference manual.
A conceptual view of the remote diagnostic pipeline, centered around the use of Textract to extract insights from video input and drive error detection.

A conceptual view of the remote diagnostic pipeline, centered around the use of Textract to extract insights from video input and drive error detection.

A 7-second clip is essential to reduce ambiguity around LED pulse frequency. For instance, the Low frequency might flash once or twice in a five-second window, which could be mistaken for Off. By extending to 7 seconds, each frequency (Off, Low, Medium, or High) becomes unambiguous:

  • Off: 0 pulses
  • Low: 1–2 pulses
  • Medium: 3–4 pulses
  • High: 11–12 pulses

Because there’s no overlap among these pulse counts, the system can now accurately classify each LED’s frequency.

In the following sections, we discuss the three key steps of the solution workflow in more detail.

Step 1: Identify key frames

A modern smartphone typically captures 30 frames per second, resulting in 210 frames over a 7-second video. As seen in the earlier image, many of these frames appear as though the LEDs are off, either because the LEDs are inactive or between pulses, highlighting the need for key frame detection. In practice, only a small subset of the 210 frames will contain a visible lit LED, making it unnecessarily expensive to analyze every frame.

To address this, we introduced a Signal Intensity metric. This simple heuristic examines color channels and assigns each frame a likelihood score of containing an active LED. Frames with a score below a certain threshold are discarded, because they’re unlikely to contain active LEDs. Although the metric might generate a few false positives, it effectively trims down the volume of frames for further processing. Testing in the field conditions has shown robust performance across various lighting scenarios and angles.

Step 2: Inspect light status

With key frames identified, the system next determines which LEDs are active. It uses Amazon Textract to treat the meter’s panel like a document. Amazon Textract identifies all visible text in the frame, and the diagnostic system then parses this output to isolate only the relevant labels: “SW,” “WAN,” “MESH,” “HAN,” and “GAS,” filtering out unrelated text.

The following image shows a key frame processed by Amazon Textract. The bounding boxes show detected text; LED labels appear in red after text matching.

A key frame processed by Amazon Textract. The bounding boxes show detected text; LED labels appear in red after text matching.

A key frame processed by Amazon Textract. The bounding boxes show detected text; LED labels appear in red after text matching.

Because each Communications Hub follows standard dimensions, the LED for each label is consistently located just above it. Using the bounding box coordinates from Amazon Textract as our landmark, the system calculates an “upward” direction for the meter and places a new bounding region above each label, pinpointing the pixels corresponding to each LED. The resulting key frame highlights exactly where to look for LED activity.

To illustrate this, the following image of a key frame shows how the system maps each detected label (“SW,” “WAN,” “MESH,” “HAN,” “GAS”) to its corresponding LED region. Each region is automatically defined using the Amazon Textract output and geometric rules, allowing the system to isolate just the areas that matter for diagnosis.

A key frame showing the exact LED regions for “SW,” “WAN,” “MESH,” “HAN,” and “GAS.”

A key frame showing the exact LED regions for “SW,” “WAN,” “MESH,” “HAN,” and “GAS.”

With the LED regions now precisely defined, the tool evaluates whether each one is on or off. Because E.ON didn’t have a labeled dataset large enough to train a supervised ML model, we opted for a heuristic approach. We combined the Signal Intensity metric from Step 1 with a brightness threshold to determine LED status. By using relative rather than absolute thresholds, the method remains robust across different lighting conditions and angles, even if an LED’s glow reflects off neighboring surfaces.The end result is a simple on/off status for each LED in every key frame, feeding into the final error classification in Step 3.

Step 3: Aggregate results to determine the error

Now that each key frame has an on/off status for each LED, the final step is to determine how many times each light pulses during the 7-second clip. This pulse count reveals which frequency (Off, Low, Medium, or High) each LED is blinking at, allowing the solution to identify the appropriate error code from the Communications Hub’s reference manual, just like a field engineer would, but in a fully automated way.

To calculate the number of pulses, the system first groups consecutive “on” frames. Because one pulse of light typically lasts 0.1 seconds, or about 2–3 frames, a continuous block of “on” frames represents a single pulse. After grouping these blocks, the total number of pulses for each LED can be counted. Thanks to the 7-second recording window, the mapping from pulse count to frequency is unambiguous.

After each LED’s frequency is determined, the system simply references the meter’s manual to find the corresponding error. This final diagnostic result is then relayed back to the customer.

The following demo video below shows this process in action, with a user uploading a 7-second clip of their meter. In just 5.77 seconds, the application detects a WAN error, explains how it arrived at that conclusion, and outlines the steps an engineer would take to address the issue.

Conclusion

E.ON’s story highlights how a creative application of Amazon Textract, combined with custom image analysis and pulse counting, can solve a real-world challenge at scale. By diagnosing smart meter errors through brief smartphone videos, E.ON aims to lower costs, improve customer satisfaction, and enhance overall energy service reliability.

Although the system is still being field tested, initial results are encouraging: approximately 350 cases per week (18,200 annually) can now be diagnosed remotely, with an estimated £10 million in projected annual savings. Real-world accuracy stands at 84%, without extensive tuning, while controlled environments have shown a 100% success rate. Notably, the tool has even caught errors that field engineers initially missed, pointing to opportunities for refined training and proactive fault detection.

Looking ahead, E.ON plans to expand this approach to other devices and integrate advanced computer vision techniques to further boost accuracy. If you’re interested in exploring a similar solution, consider the following next steps:

  • Explore the Amazon Textract documentation to learn how you can streamline text extraction for your own use cases
  • Alternatively, consider Amazon Bedrock Document Automation for a generative AI-powered alternative to extract insights from multimodal content in audio, documents, images, and video
  • Browse the Amazon Machine Learning Blog to discover innovative ways customers use AWS ML services to drive efficiency and reduce costs
  • Contact your AWS Account Manager to discuss your specific needs to design a proof of concept or production-ready solution

By combining domain expertise with AWS services, E.ON demonstrates how an AI-driven strategy can transform operational efficiency, even in early stages. If you’re considering a similar path, these resources can help you unlock the power of AWS AI and ML to meet your unique business goals.


About the Authors

Sam Charlton is a Product Manager at E.ON who looks for innovative ways to use existing technology against entrenched issues often ignored. Starting in the contact center, he has worked the breadth and depth of E.ON, ensuring a holistic stance for his business’s needs.

Tanrajbir Takher is a Data Scientist at the AWS Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.

Satyam Saxena is an Applied Science Manager at the AWS Generative AI Innovation Center. He leads generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production, with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.

Tom Chester is an AI Strategist at the AWS Generative AI Innovation Center, working directly with AWS customers to understand the business problems they are trying to solve with generative AI and helping them scope and prioritize use cases. Tom has over a decade of experience in data and AI strategy and data science consulting.

Amit Dhingra is a GenAI/ML Sr. Sales Specialist in the UK. He works as a trusted advisor to customers by providing guidance on how they can unlock new value streams, solve key business problems, and deliver results for their customers using AWS generative AI and ML services.

Read More

Building intelligent AI voice agents with Pipecat and Amazon Bedrock – Part 1

Building intelligent AI voice agents with Pipecat and Amazon Bedrock – Part 1

Voice AI is transforming how we interact with technology, making conversational interactions more natural and intuitive than ever before. At the same time, AI agents are becoming increasingly sophisticated, capable of understanding complex queries and taking autonomous actions on our behalf. As these trends converge, you see the emergence of intelligent AI voice agents that can engage in human-like dialogue while performing a wide range of tasks.

In this series of posts, you will learn how to build intelligent AI voice agents using Pipecat, an open-source framework for voice and multimodal conversational AI agents, with foundation models on Amazon Bedrock. It includes high-level reference architectures, best practices and code samples to guide your implementation.

Approaches for building AI voice agents

There are two common approaches for building conversational AI agents:

  • Using cascaded models: In this post (Part 1), you will learn about the cascaded models approach, diving into the individual components of a conversational AI agent. With this approach, voice input passes through a series of architecture components before a voice response is sent back to the user. This approach is also sometimes referred to as pipeline or component model voice architecture.
  • Using speech-to-speech foundation models in a single architecture: In Part 2, you will learn how Amazon Nova Sonic, a state-of-the-art, unified speech-to-speech foundation model can enable real-time, human-like voice conversations by combining speech understanding and generation in a single architecture.

Common use cases

AI voice agents can handle multiple use cases, including but not limited to:

  • Customer Support: AI voice agents can handle customer inquiries 24/7, providing instant responses and routing complex issues to human agents when necessary.
  • Outbound Calling: AI agents can conduct personalized outreach campaigns, scheduling appointments or following up on leads with natural conversation.
  • Virtual Assistants: Voice AI can power personal assistants that help users manage tasks, answer questions.

Architecture: Using cascaded models to build an AI voice agent

To build an agentic voice AI application with the cascaded models approach, you need to orchestrate multiple architecture components involving multiple machine learning and foundation models.

Reference Architecture - Pipecat

Figure 1: Architecture overview of a Voice AI Agent using Pipecat

These components include:

WebRTC Transport: Enables real-time audio streaming between client devices and the application server.

Voice Activity Detection (VAD): Detects speech using Silero VAD with configurable speech start and speech end times, and noise suppression capabilities to remove background noise and enhance audio quality.

Automatic Speech Recognition (ASR): Uses Amazon Transcribe for accurate, real-time speech-to-text conversion.

Natural Language Understanding (NLU): Interprets user intent using latency-optimized inference on Bedrock with models like Amazon Nova Pro optionally enabling prompt caching to optimize for speed and cost efficiency in Retrieval Augmented Generation (RAG) use cases.

Tools Execution and API Integration: Executes actions or retrieves information for RAG by integrating backend services and data sources via Pipecat Flows and leveraging the tool use capabilities of foundation models.

Natural Language Generation (NLG): Generates coherent responses using Amazon Nova Pro on Bedrock, offering the right balance of quality and latency.

Text-to-Speech (TTS): Converts text responses back into lifelike speech using Amazon Polly with generative voices.

Orchestration Framework: Pipecat orchestrates these components, offering a modular Python-based framework for real-time, multimodal AI agent applications.

Best practices for building effective AI voice agents

Developing responsive AI voice agents requires focus on latency and efficiency. While best practices continue to emerge, consider the following implementation strategies to achieve natural, human-like interactions:

Minimize conversation latency: Use latency-optimized inference for foundation models (FMs) like Amazon Nova Pro to maintain natural conversation flow.

Select efficient foundation models: Prioritize smaller, faster foundation models (FMs) that can deliver quick responses while maintaining quality.

Implement prompt caching: Utilize prompt caching to optimize for both speed and cost efficiency, especially in complex scenarios requiring knowledge retrieval.

Deploy text-to-speech (TTS) fillers: Use natural filler phrases (such as “Let me look that up for you”) before intensive operations to maintain user engagement while the system makes tool calls or long-running calls to your foundation models.

Build a robust audio input pipeline: Integrate components like noise to support clear audio quality for better speech recognition results.

Start simple and iterate: Begin with basic conversational flows before progressing to complex agentic systems that can handle multiple use cases.

Region availability: Low-latency and prompt caching features may only be available in certain regions. Evaluate the trade-off between these advanced capabilities and selecting a region that is geographically closer to your end-users.

Example implementation: Build your own AI voice agent in minutes

This post provides a sample application on Github that demonstrates the concepts discussed. It uses Pipecat and and its accompanying state management framework, Pipecat Flows with Amazon Bedrock, along with Web Real-time Communication (WebRTC) capabilities from Daily to create a working voice agent you can try in minutes.

Prerequisites

To setup the sample application, you should have the following prerequisites:

  • Python 3.10+
  • An AWS account with appropriate Identity and Access Management (IAM) permissions for Amazon Bedrock, Amazon Transcribe, and Amazon Polly
  • Access to foundation models on Amazon Bedrock
  • Access to an API key for Daily
  • Modern web browser (such as Google Chrome or Mozilla Firefox) with WebRTC support

Implementation Steps

After you complete the prerequisites, you can start setting up your sample voice agent:

  1. Clone the repository:
    git clone https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock 
    cd build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock/part-1 
  2. Set up the environment:
    cd server
    python3 -m venv venv
    source venv/bin/activate  # Windows: venvScriptsactivate
    pip install -r requirements.txt
  3. Configure API key in.env:
    DAILY_API_KEY=your_daily_api_key
    AWS_ACCESS_KEY_ID=your_aws_access_key_id
    AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
    AWS_REGION=your_aws_region
  4. Start the server:
    python server.py
  5. Connect via browser at http://localhost:7860 and grant microphone access
  6. Start the conversation with your AI voice agent

Customizing your voice AI agent

To customize, you can start by:

  • Modifying flow.py to change conversation logic
  • Adjusting model selection in bot.py for your latency and quality needs

To learn more, see documentation for Pipecat Flows and review the README of our code sample on Github.

Cleanup

The instructions above are for setting up the application in your local environment. The local application will leverage AWS services and Daily through AWS IAM and API credentials. For security and to avoid unanticipated costs, when you are finished, delete these credentials to make sure that they can no longer be accessed.

Accelerating voice AI implementations

To accelerate AI voice agent implementations, AWS Generative AI Innovation Center (GAIIC) partners with customers to identify high-value use cases and develop proof-of-concept (PoC) solutions that can quickly move to production.

Customer Testimonial: InDebted

InDebted, a global fintech transforming the consumer debt industry, collaborates with AWS to develop their voice AI prototype.

“We believe AI-powered voice agents represent a pivotal opportunity to enhance the human touch in financial services customer engagement. By integrating AI-enabled voice technology into our operations, our goals are to provide customers with faster, more intuitive access to support that adapts to their needs, as well as improving the quality of their experience and the performance of our contact centre operations”

says Mike Zhou, Chief Data Officer at InDebted.

By collaborating with AWS and leveraging Amazon Bedrock, organizations like InDebted can create secure, adaptive voice AI experiences that meet regulatory standards while delivering real, human-centric impact in even the most challenging financial conversations.

Conclusion

Building intelligent AI voice agents is now more accessible than ever through the combination of open-source frameworks such as Pipecat, and powerful foundation models with latency optimized inference and prompt caching on Amazon Bedrock.

In this post, you learned about two common approaches on how to build AI voice agents, delving into the cascaded models approach and its key components. These essential components work together to create an intelligent system that can understand, process, and respond to human speech naturally. By leveraging these rapid advancements in generative AI, you can create sophisticated, responsive voice agents that deliver real value to your users and customers.

To get started with your own voice AI project, try our code sample on Github or contact your AWS account team to explore an engagement with AWS Generative AI Innovation Center (GAIIC).

You can also learn about building AI voice agents using a unified speech-to-speech foundation models, Amazon Nova Sonic in Part 2.


About the Authors

Adithya Suresh serves as a Deep Learning Architect at the AWS Generative AI Innovation Center, where he partners with technology and business teams to build innovative generative AI solutions that address real-world challenges.

Daniel Wirjo is a Solutions Architect at AWS, focused on FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.

Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.

Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.

Read More

Stream multi-channel audio to Amazon Transcribe using the Web Audio API

Stream multi-channel audio to Amazon Transcribe using the Web Audio API

Multi-channel transcription streaming is a feature of Amazon Transcribe that can be used in many cases with a web browser. Creating this stream source has it challenges, but with the JavaScript Web Audio API, you can connect and combine different audio sources like videos, audio files, or hardware like microphones to obtain transcripts.

In this post, we guide you through how to use two microphones as audio sources, merge them into a single dual-channel audio, perform the required encoding, and stream it to Amazon Transcribe. A Vue.js application source code is provided that requires two microphones connected to your browser. However, the versatility of this approach extends far beyond this use case—you can adapt it to accommodate a wide range of devices and audio sources.

With this approach, you can get transcripts for two sources in a single Amazon Transcribe session, offering cost savings and other benefits compared to using a separate session for each source.

Challenges when using two microphones

For our use case, using a single-channel stream for two microphones and enabling Amazon Transcribe speaker label identification to identify the speakers might be enough, but there are a few considerations:

  • Speaker labels are randomly assigned at session start, meaning you will have to map the results in your application after the stream has started
  • Mislabeled speakers with similar voice tones can happen, which even for a human is hard to distinguish
  • Voice overlapping can occur when two speakers talk at the same time with one audio source

By using two audio sources with microphones, you can address these concerns by making sure each transcription is from a fixed input source. By assigning a device to a speaker, our application knows in advance which transcript to use. However, you might still encounter voice overlapping if two nearby microphones are picking up multiple voices. This can be mitigated by using directional microphones, volume management, and Amazon Transcribe word-level confidence scores.

Solution overview

The following diagram illustrates the solution workflow.

Application diagram

Application diagram for two microphones

We use two audio inputs with the Web Audio API. With this API, we can merge the two inputs, Mic A and Mic B, into a single audio data source, with the left channel representing Mic A and the right channel representing Mic B.

Then, we convert this audio source to PCM (Pulse-Code Modulation) audio. PCM is a common format for audio processing, and it’s one of the formats required by Amazon Transcribe for the audio input. Finally, we stream the PCM audio to Amazon Transcribe for transcription.

Prerequisites

You should have the following prerequisites in place:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DemoWebAudioAmazonTranscribe",
      "Effect": "Allow",
      "Action": "transcribe:StartStreamTranscriptionWebSocket",
      "Resource": "*"
    }
  ]
}

Start the application

Complete the following steps to launch the application:

  1. Go to the root directory where you downloaded the code.
  2. Create a .env file to set up your AWS access keys from the env.sample file.
  3. Install packages and run bun install (if you’re using node, run node install).
  4. Start the web server and run bun dev (if you’re using node, run node dev).
  5. Open your browser in http://localhost:5173/.

    Application running on http://localhost:5173

    Application running on http://localhost:5173 with two connected microphones

Code walkthrough

In this section, we examine the important code pieces for the implementation:

  1. The first step is to list the connected microphones by using the browser API navigator.mediaDevices.enumerateDevices():
const devices = await navigator.mediaDevices.enumerateDevices()
return devices.filter((d) => d.kind === 'audioinput')
  1. Next, you need to obtain the MediaStream object for each of the connected microphones. This can be done using the navigator.mediaDevices.getUserMedia() API, which enables access the user’s media devices (such as cameras and microphones). You can then retrieve a MediaStream object that represents the audio or video data from those devices:
const streams = []
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    deviceId: device.deviceId,
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
  },
})

if (stream) streams.push(stream)
  1. To combine the audio from the multiple microphones, you need to create an AudioContext interface for audio processing. Within this AudioContext, you can use ChannelMergerNode to merge the audio streams from the different microphones. The connect(destination, src_idx, ch_idx) method arguments are:
    • destination – The destination, in our case mergerNode.
    • src_idx – The source channel index, in our case both 0 (because each microphone is a single-channel audio stream).
    • ch_idx – The channel index for the destination, in our case 0 and 1 respectively, to create a stereo output.
// instance of audioContext
const audioContext = new AudioContext({
       sampleRate: SAMPLE_RATE,
})
// this is used to process the microphone stream data
const audioWorkletNode = new AudioWorkletNode(audioContext, 'recording-processor', {...})
// microphone A
const audioSourceA = audioContext.createMediaStreamSource(mediaStreams[0]);
// microphone B
const audioSourceB = audioContext.createMediaStreamSource(mediaStreams[1]);
// audio node for two inputs
const mergerNode = audioContext.createChannelMerger(2);
// connect the audio sources to the mergerNode destination.  
audioSourceA.connect(mergerNode, 0, 0);
audioSourceB.connect(mergerNode, 0, 1);
// connect our mergerNode to the AudioWorkletNode
merger.connect(audioWorkletNode);
  1. The microphone data is processed in an AudioWorklet that emits data messages every defined number of recording frames. These messages will contain the audio data encoded in PCM format to send to Amazon Transcribe. Using the p-event library, you can asynchronously iterate over the events from the Worklet. A more in-depth description about this Worklet is provided in the next section of this post.
import { pEventIterator } from 'p-event'
...

// Register the worklet
try {
  await audioContext.audioWorklet.addModule('./worklets/recording-processor.js')
} catch (e) {
  console.error('Failed to load audio worklet')
}

//  An async iterator 
const audioDataIterator = pEventIterator<'message', MessageEvent<AudioWorkletMessageDataType>>(
  audioWorkletNode.port,
  'message',
)
...

// AsyncIterableIterator: Every time the worklet emits an event with the message `SHARE_RECORDING_BUFFER`, this iterator will return the AudioEvent object that we need.
const getAudioStream = async function* (
  audioDataIterator: AsyncIterableIterator<MessageEvent<AudioWorkletMessageDataType>>,
) {
  for await (const chunk of audioDataIterator) {
    if (chunk.data.message === 'SHARE_RECORDING_BUFFER') {
      const { audioData } = chunk.data
      yield {
        AudioEvent: {
          AudioChunk: audioData,
        },
      }
    }
  }
}
  1. To start streaming the data to Amazon Transcribe, you can use the fabricated iterator and enabled NumberOfChannels: 2 and EnableChannelIdentification: true to enable the dual channel transcription. For more information, refer to the AWS SDK StartStreamTranscriptionCommand documentation.
import {
  LanguageCode,
  MediaEncoding,
  StartStreamTranscriptionCommand,
} from '@aws-sdk/client-transcribe-streaming'

const command = new StartStreamTranscriptionCommand({
    LanguageCode: LanguageCode.EN_US,
    MediaEncoding: MediaEncoding.PCM,
    MediaSampleRateHertz: SAMPLE_RATE,
    NumberOfChannels: 2,
    EnableChannelIdentification: true,
    ShowSpeakerLabel: true,
    AudioStream: getAudioStream(audioIterator),
  })
  1. After you send the request, a WebSocket connection is created to exchange audio stream data and Amazon Transcribe results:
const data = await client.send(command)
for await (const event of data.TranscriptResultStream) {
    for (const result of event.TranscriptEvent.Transcript.Results || []) {
        callback({ ...result })
    }
}

The result object will include a ChannelId property that you can use to identify your microphone source, such as ch_0 and ch_1, respectively.

Deep dive: Audio Worklet

Audio Worklets can execute in a separate thread to provide very low-latency audio processing. The implementation and demo source code can be found in the public/worklets/recording-processor.js file.

For our case, we use the Worklet to perform two main tasks:

  1. Process the mergerNode audio in an iterable way. This node includes both of our audio channels and is the input to our Worklet.
  2. Encode the data bytes of the mergerNode node into PCM signed 16-bit little-endian audio format. We do this for each iteration or when required to emit a message payload to our application.

The general code structure to implement this is as follows:

class RecordingProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super()
  }
  process(inputs, outputs) {...}
}

registerProcessor('recording-processor', RecordingProcessor)

You can pass custom options to this Worklet instance using the processorOptions attribute. In our demo, we set a maxFrameCount: (SAMPLE_RATE * 4) / 10 as a bitrate guide to determine when to emit a new message payload. A message is for example:

this.port.postMessage({
  message: 'SHARE_RECORDING_BUFFER',
  buffer: this._recordingBuffer,
  recordingLength: this.recordedFrames,
  audioData: new Uint8Array(pcmEncodeArray(this._recordingBuffer)), // PCM encoded audio format
})

PCM encoding for two channels

One of the most important sections is how to encode to PCM for two channels. Following the AWS documentation in the Amazon Transcribe API Reference, the AudioChunk is defined by: Duration (s) * Sample Rate (Hz) * Number of Channels * 2. For two channels, 1 second at 16000Hz is: 1 * 16000 * 2 * 2 = 64000 bytes. Our encoding function it should then look like this:

// Notice that input is an array, where each element is a channel with Float32 values between -1.0 and 1.0 from the AudioWorkletProcessor.
const pcmEncodeArray = (input: Float32Array[]) => {
  const numChannels = input.length
  const numSamples = input[0].length
  const bufferLength = numChannels * numSamples * 2 // 2 bytes per sample per channel
  const buffer = new ArrayBuffer(bufferLength)
  const view = new DataView(buffer)

  let index = 0

  for (let i = 0; i < numSamples; i++) {
    // Encode for each channel
    for (let channel = 0; channel < numChannels; channel++) {
      const s = Math.max(-1, Math.min(1, input[channel][i]))
      // Convert the 32 bit float to 16 bit PCM audio waveform samples.
      // Max value: 32767 (0x7FFF), Min value: -32768 (-0x8000) 
      view.setInt16(index, s < 0 ? s * 0x8000 : s * 0x7fff, true)
      index += 2
    }
  }
  return buffer
}

For more information how the audio data blocks are handled, see AudioWorkletProcessor: process() method. For more information on PCM format encoding, see Multimedia Programming Interface and Data Specifications 1.0.

Conclusion

In this post, we explored the implementation details of a web application that uses the browser’s Web Audio API and Amazon Transcribe streaming to enable real-time dual-channel transcription. By using the combination of AudioContext, ChannelMergerNode, and AudioWorklet, we were able to seamlessly process and encode the audio data from two microphones before sending it to Amazon Transcribe for transcription. The use of the AudioWorklet in particular allowed us to achieve low-latency audio processing, providing a smooth and responsive user experience.

You can build upon this demo to create more advanced real-time transcription applications that cater to a wide range of use cases, from meeting recordings to voice-controlled interfaces.

Try out the solution for yourself, and leave your feedback in the comments.


About the Author

Jorge LanzarottiJorge Lanzarotti is a Sr. Prototyping SA at Amazon Web Services (AWS) based on Tokyo, Japan. He helps customers in the public sector by creating innovative solutions to challenging problems.

Read More

How Kepler democratized AI access and enhanced client services with Amazon Q Business

How Kepler democratized AI access and enhanced client services with Amazon Q Business

This is a guest post co-authored by Evan Miller, Noah Kershaw, and Valerie Renda of Kepler Group

At Kepler, a global full-service digital marketing agency serving Fortune 500 brands, we understand the delicate balance between creative marketing strategies and data-driven precision. Our company name draws inspiration from the visionary astronomer Johannes Kepler, reflecting our commitment to bringing clarity to complex challenges and illuminating the path forward for our clients.

In this post, we share how implementing Amazon Q Business transformed our operations by democratizing AI access across our organization while maintaining stringent security standards, resulting in an average savings of 2.7 hours per week per employee in manual work and improved client service delivery.

The challenge: Balancing innovation with security

As a digital marketing agency working with Fortune 500 clients, we faced increasing pressure to use AI capabilities while making sure that we maintain the highest levels of data security. Our previous solution lacked essential features, which led team members to consider more generic solutions. Specifically, the original implementation was missing critical capabilities such as chat history functionality, preventing users from accessing or referencing their prior conversations. This absence of conversation context meant users had to repeatedly provide background information in each interaction. Additionally, the solution had no file upload capabilities, limiting users to text-only interactions. These limitations resulted in a basic AI experience where users often had to compromise by rewriting prompts, manually maintaining context, and working around the inability to process different file formats. The restricted functionality ultimately pushed teams to explore alternative solutions that could better meet their comprehensive needs. Being an International Organization for Standardization (ISO) 27001-certified organization, we needed an enterprise-grade solution that would meet our strict security requirements without compromising on functionality. Our ISO 27001 certification mandates rigorous security controls, which meant that public AI tools weren’t suitable for our needs. We required a solution that could be implemented within our secure environment while maintaining full compliance with our stringent security protocols.

Why we chose Amazon Q Business

Our decision to implement Amazon Q Business was driven by three key factors that aligned perfectly with our needs. First, because our Kepler Intelligence Platform (Kip) infrastructure already resided on Amazon Web Services (AWS), the integration process was seamless. Our Amazon Q Business implementation uses three core connectors (Amazon Simple Storage Service (Amazon S3), Google Drive, and Amazon Athena), though our wider data ecosystem includes 35–45 different platform integrations, primarily flowing through Amazon S3. Second, the commitment from Amazon Q Business to not use our data for model training satisfied our essential security requirements. Finally, the Amazon Q Business apps functionality enabled us to develop no-code solutions for everyday challenges, democratizing access to efficient workflows without requiring additional software developers.

Implementation journey

We began our Amazon Q Business implementation journey in early 2025 with a focused pilot group of 10 participants, expanding to 100 users in February and March, with plans for a full deployment reaching 500+ employees. During this period, we organized an AI-focused hackathon that catalyzed organic adoption and sparked creative solutions. The implementation was unique in how we integrated Amazon Q Business into our existing Kepler Intelligence Platform, rebranding it as Kip AI to maintain consistency with our internal systems.

Kip AI demonstrates how we’ve comprehensively integrated AI capabilities with our existing data infrastructure. We use multiple data sources, including Amazon S3 for our storage needs, Amazon QuickSight for our business intelligence requirements, and Google Drive for team collaboration. At the heart of our system is our custom extract, transform, and load ETL pipeline (Kip SSoT), which we’ve designed to feed data into QuickSight for AI-enabled analytics. We’ve configured Amazon Q Business to seamlessly connect with these data sources, allowing our team members to access insights through both a web interface and browser extension. The following figure shows the architecture of Kip AI.

This integrated approach helps ensure that Kepler’s employees can securely access AI capabilities while maintaining data governance and security requirements crucial for their clients. Access to the platform is secured through AWS Identity and Access Management (IAM), connected to our single sign-on provider, ensuring that only authorized personnel can use the system. This careful approach to security and access management has been crucial in maintaining our clients’ trust while rolling out AI capabilities across our organization.

Transformative use cases and results

The implementation of Amazon Q Business has revolutionized several key areas of our operations. Our request for information (RFI) response process, which traditionally consumed significant time and resources, has been streamlined dramatically. Teams now report saving over 10 hours per RFI response, allowing us to pursue more business opportunities efficiently.

Client communications have also seen substantial improvements. The platform helps us draft clear, consistent, and timely communications, from routine emails to comprehensive status reports and presentations. This enhancement in communication quality has strengthened our client relationships and improved service delivery.

Perhaps most significantly, we’ve achieved remarkable efficiency gains across the organization. Our employees report saving an average of 2.7 hours per week in manual work, with user satisfaction rates exceeding 87%. The platform has enabled us to standardize our approach to insight generation, ensuring consistent, high-quality service delivery across all client accounts.

Looking ahead

As we expand Amazon Q Business access to all Kepler employees (over 500) in the coming months, we’re maintaining a thoughtful approach to deployment. We recognize that some clients have specific requirements regarding AI usage, and we’re carefully balancing innovation with client preferences. This strategic approach includes working to update client contracts and helping clients become more comfortable with AI integration while respecting their current guidelines.

Conclusion

Our experience with Amazon Q Business demonstrates how enterprise-grade AI can be successfully implemented while maintaining strict security standards and respecting client preferences. The platform has not only improved our operational efficiency but has also enhanced our ability to deliver consistent, high-quality service to our clients. What’s particularly impressive is the platform’s rapid deployment capabilities—we were able to implement the solution within weeks, without any coding requirements, and eliminate ongoing model maintenance and data source management expenses. As we continue to expand our use of Amazon Q Business, we’re excited about the potential for further innovation and efficiency gains in our digital marketing services.


About the authors

Evan Miller, Global Head of Product and Data Science, is a strategic product leader who joined Kepler 2013. Currently serving as Global Head of Product and Data Science, he owns the end-to-end product strategy for the Kepler Intelligence Platform (Kip). Under his leadership, Kip has garnered industry recognition, winning awards for Best Performance Management Solution and Best Commerce Technology, while driving significant business impact through innovative features like automated Machine Learning analytics and Marketing Mix Modeling technology.

Noah Kershaw leads the product team at Kepler Group, a global digital marketing agency that helps brands connect with their audiences through data-driven strategies. With a passion for innovation, Noah has been at the forefront of integrating AI solutions to enhance client services and streamline operations. His collaborative approach and enthusiasm for leveraging technology have been key in bringing Kepler’s “Future in Focus” vision to life, helping Kepler and its clients navigate the modern era of marketing with clarity and precision.

Valerie Renda, Director of Data Strategy & Analytics, has a specialized focus on data strategy, analytics, and marketing systems strategy within digital marketing, a field she’s worked in for over eight years. At Kepler, she has made significant contributions to various clients’ data management and martech strategies. She has been instrumental in leading data infrastructure projects, including customer data platform implementations, business intelligence visualization implementations, server-side tracking, martech consolidation, tag migrations, and more. She has also led the development of workflow tools to automate data processes and streamline ad operations to improve internal organizational processes.

Al Destefano is a Sr. Generative AI Specialist on the Amazon Q GTM team based in New York City. At AWS, he uses technical knowledge and business experience to communicate the tangible enterprise benefits when using managed Generative AI AWS services.

Sunanda Patel is a Senior Account Manager with over 15 years of expertise in management consulting and IT sectors, with a focus on business development and people management. Throughout her career, Sunanda has successfully managed diverse client relationships, ranging from non-profit to corporate and large multinational enterprises. Sunanda joined AWS in 2022 as an Account Manager for the Manhattan Commercial sector and now works with strategic commercial accounts, helping them grow in their cloud journey to achieve complex business goals.

Kumar Karra is a Sr. Solutions Architect at AWS supporting SMBs. He is an experienced engineer with deep experience in the software development lifecycle. Kumar looks to solve challenging problems by applying technical, leadership, and business skills. He holds a Master’s Degree in Computer Science and Machine Learning from Georgia Institute of Technology and is based in New York (US).

Read More

Build a serverless audio summarization solution with Amazon Bedrock and Whisper

Build a serverless audio summarization solution with Amazon Bedrock and Whisper

Recordings of business meetings, interviews, and customer interactions have become essential for preserving important information. However, transcribing and summarizing these recordings manually is often time-consuming and labor-intensive. With the progress in generative AI and automatic speech recognition (ASR), automated solutions have emerged to make this process faster and more efficient.

Protecting personally identifiable information (PII) is a vital aspect of data security, driven by both ethical responsibilities and legal requirements. In this post, we demonstrate how to use the Open AI Whisper foundation model (FM) Whisper Large V3 Turbo, available in Amazon Bedrock Marketplace, which offers access to over 140 models through a dedicated offering, to produce near real-time transcription. These transcriptions are then processed by Amazon Bedrock for summarization and redaction of sensitive information.

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Additionally, you can use Amazon Bedrock Guardrails to automatically redact sensitive information, including PII, from the transcription summaries to support compliance and data protection needs.

In this post, we walk through an end-to-end architecture that combines a React-based frontend with Amazon Bedrock, AWS Lambda, and AWS Step Functions to orchestrate the workflow, facilitating seamless integration and processing.

Solution overview

The solution highlights the power of integrating serverless technologies with generative AI to automate and scale content processing workflows. The user journey begins with uploading a recording through a React frontend application, hosted on Amazon CloudFront and backed by Amazon Simple Storage Service (Amazon S3) and Amazon API Gateway. When the file is uploaded, it triggers a Step Functions state machine that orchestrates the core processing steps, using AI models and Lambda functions for seamless data flow and transformation. The following diagram illustrates the solution architecture.

AWS serverless architecture for audio processing: CloudFront to S3, EventBridge trigger, Lambda and Bedrock for transcription and summarization

The workflow consists of the following steps:

  1. The React application is hosted in an S3 bucket and served to users through CloudFront for fast, global access. API Gateway handles interactions between the frontend and backend services.
  2. Users upload audio or video files directly from the app. These recordings are stored in a designated S3 bucket for processing.
  3. An Amazon EventBridge rule detects the S3 upload event and triggers the Step Functions state machine, initiating the AI-powered processing pipeline.
  4. The state machine performs audio transcription, summarization, and redaction by orchestrating multiple Amazon Bedrock models in sequence. It uses Whisper for transcription, Claude for summarization, and Guardrails to redact sensitive data.
  5. The redacted summary is returned to the frontend application and displayed to the user.

The following diagram illustrates the state machine workflow.

AWS Step Functions state machine for audio processing: Whisper transcription, speaker identification, and Bedrock summary tasks

The Step Functions state machine orchestrates a series of tasks to transcribe, summarize, and redact sensitive information from uploaded audio/video recordings:

  1. A Lambda function is triggered to gather input details (for example, Amazon S3 object path, metadata) and prepare the payload for transcription.
  2. The payload is sent to the OpenAI Whisper Large V3 Turbo model through the Amazon Bedrock Marketplace to generate a near real-time transcription of the recording.
  3. The raw transcript is passed to Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock, which produces a concise and coherent summary of the conversation or content.
  4. A second Lambda function validates and forwards the summary to the redaction step.
  5. The summary is processed through Amazon Bedrock Guardrails, which automatically redacts PII and other sensitive data.
  6. The redacted summary is stored or returned to the frontend application through an API, where it is displayed to the user.

Prerequisites

Before you start, make sure that you have the following prerequisites in place:

Create a guardrail in the Amazon Bedrock console

For instructions for creating guardrails in Amazon Bedrock, refer to Create a guardrail. For details on detecting and redacting PII, see Remove PII from conversations by using sensitive information filters. Configure your guardrail with the following key settings:

  • Enable PII detection and handling
  • Set PII action to Redact
  • Add the relevant PII types, such as:
    • Names and identities
    • Phone numbers
    • Email addresses
    • Physical addresses
    • Financial information
    • Other sensitive personal information

After you deploy the guardrail, note the Amazon Resource Name (ARN), and you will be using this when deploys the model.

Deploy the Whisper model

Complete the following steps to deploy the Whisper Large V3 Turbo model:

  1. On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
  2. Search for and choose Whisper Large V3 Turbo.
  3. On the options menu (three dots), choose Deploy.

Amazon Bedrock console displaying filtered model catalog with Whisper Large V3 Turbo speech recognition model and deployment option

  1. Modify the endpoint name, number of instances, and instance type to suit your specific use case. For this post, we use the default settings.
  2. Modify the Advanced settings section to suit your use case. For this post, we use the default settings.
  3. Choose Deploy.

This creates a new AWS Identity and Access Management IAM role and deploys the model.

You can choose Marketplace deployments in the navigation pane, and in the Managed deployments section, you can see the endpoint status as Creating. Wait for the endpoint to finish deployment and the status to change to In Service, then copy the Endpoint Name, and you will be using this when deploying the

Amazon Bedrock console: "How it works" overview, managed deployments table with Whisper model endpoint in service

Deploy the solution infrastructure

In the GitHub repo, follow the instructions in the README file to clone the repository, then deploy the frontend and backend infrastructure.

We use the AWS Cloud Development Kit (AWS CDK) to define and deploy the infrastructure. The AWS CDK code deploys the following resources:

  • React frontend application
  • Backend infrastructure
  • S3 buckets for storing uploads and processed results
  • Step Functions state machine with Lambda functions for audio processing and PII redaction
  • API Gateway endpoints for handling requests
  • IAM roles and policies for secure access
  • CloudFront distribution for hosting the frontend

Implementation deep dive

The backend is composed of a sequence of Lambda functions, each handling a specific stage of the audio processing pipeline:

  • Upload handler – Receives audio files and stores them in Amazon S3
  • Transcription with Whisper – Converts speech to text using the Whisper model
  • Speaker detection – Differentiates and labels individual speakers within the audio
  • Summarization using Amazon Bedrock – Extracts and summarizes key points from the transcript
  • PII redaction – Uses Amazon Bedrock Guardrails to remove sensitive information for privacy compliance

Let’s examine some of the key components:

The transcription Lambda function uses the Whisper model to convert audio files to text:

def transcribe_with_whisper(audio_chunk, endpoint_name):
    # Convert audio to hex string format
    hex_audio = audio_chunk.hex()
    
    # Create payload for Whisper model
    payload = {
        "audio_input": hex_audio,
        "language": "english",
        "task": "transcribe",
        "top_p": 0.9
    }
    
    # Invoke the SageMaker endpoint running Whisper
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Parse the transcription response
    response_body = json.loads(response['Body'].read().decode('utf-8'))
    transcription_text = response_body['text']
    
    return transcription_text

We use Amazon Bedrock to generate concise summaries from the transcriptions:

def generate_summary(transcription):
    # Format the prompt with the transcription
    prompt = f"{transcription}nnGive me the summary, speakers, key discussions, and action items with owners"
    
    # Call Bedrock for summarization
    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        body=json.dumps({
            "prompt": prompt,
            "max_tokens_to_sample": 4096,
            "temperature": 0.7,
            "top_p": 0.9,
        })
    )
    
    # Extract and return the summary
    result = json.loads(response.get('body').read())
    return result.get('completion')

A critical component of our solution is the automatic redaction of PII. We implemented this using Amazon Bedrock Guardrails to support compliance with privacy regulations:

def apply_guardrail(bedrock_runtime, content, guardrail_id):
# Format content according to API requirements
formatted_content = [{"text": {"text": content}}]

# Call the guardrail API
response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion="DRAFT",
source="OUTPUT",  # Using OUTPUT parameter for proper flow
content=formatted_content
)

# Extract redacted text from response
if 'action' in response and response['action'] == 'GUARDRAIL_INTERVENED':
if len(response['outputs']) > 0:
output = response['outputs'][0]
if 'text' in output and isinstance(output['text'], str):
return output['text']

# Return original content if redaction fails
return content

When PII is detected, it’s replaced with type indicators (for example, {PHONE} or {EMAIL}), making sure that summaries remain informative while protecting sensitive data.

To manage the complex processing pipeline, we use Step Functions to orchestrate the Lambda functions:

{
"Comment": "Audio Summarization Workflow",
"StartAt": "TranscribeAudio",
"States": {
"TranscribeAudio": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "WhisperTranscriptionFunction",
"Payload": {
"bucket": "$.bucket",
"key": "$.key"
}
},
"Next": "IdentifySpeakers"
},
"IdentifySpeakers": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "SpeakerIdentificationFunction",
"Payload": {
"Transcription.$": "$.Payload"
}
},
"Next": "GenerateSummary"
},
"GenerateSummary": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "BedrockSummaryFunction",
"Payload": {
"SpeakerIdentification.$": "$.Payload"
}
},
"End": true
}
}
}

This workflow makes sure each step completes successfully before proceeding to the next, with automatic error handling and retry logic built in.

Test the solution

After you have successfully completed the deployment, you can use the CloudFront URL to test the solution functionality.

Audio/video upload and summary interface with completed file upload for team meeting recording analysis

Security considerations

Security is a critical aspect of this solution, and we’ve implemented several best practices to support data protection and compliance:

  • Sensitive data redaction – Automatically redact PII to protect user privacy.
  • Fine-Grained IAM Permissions – Apply the principle of least privilege across AWS services and resources.
  • Amazon S3 access controls – Use strict bucket policies to limit access to authorized users and roles.
  • API security – Secure API endpoints using Amazon Cognito for user authentication (optional but recommended).
  • CloudFront protection – Enforce HTTPS and apply modern TLS protocols to facilitate secure content delivery.
  • Amazon Bedrock data security – Amazon Bedrock (including Amazon Bedrock Marketplace) protects customer data and does not send data to providers or train using customer data. This makes sure your proprietary information remains secure when using AI capabilities.

Clean up

To prevent unnecessary charges, make sure to delete the resources provisioned for this solution when you’re done:

  1. Delete the Amazon Bedrock guardrail:
    1. On the Amazon Bedrock console, in the navigation menu, choose Guardrails.
    2. Choose your guardrail, then choose Delete.
  2. Delete the Whisper Large V3 Turbo model deployed through the Amazon Bedrock Marketplace:
    1. On the Amazon Bedrock console, choose Marketplace deployments in the navigation pane.
    2. In the Managed deployments section, select the deployed endpoint and choose Delete.
  3. Delete the AWS CDK stack by running the command cdk destroy, which deletes the AWS infrastructure.

Conclusion

This serverless audio summarization solution demonstrates the benefits of combining AWS services to create a sophisticated, secure, and scalable application. By using Amazon Bedrock for AI capabilities, Lambda for serverless processing, and CloudFront for content delivery, we’ve built a solution that can handle large volumes of audio content efficiently while helping you align with security best practices.

The automatic PII redaction feature supports compliance with privacy regulations, making this solution well-suited for regulated industries such as healthcare, finance, and legal services where data security is paramount. To get started, deploy this architecture within your AWS environment to accelerate your audio processing workflows.


About the Authors

Kaiyin HuKaiyin Hu is a Senior Solutions Architect for Strategic Accounts at Amazon Web Services, with years of experience across enterprises, startups, and professional services. Currently, she helps customers build cloud solutions and drives GenAI adoption to cloud. Previously, Kaiyin worked in the Smart Home domain, assisting customers in integrating voice and IoT technologies.

Sid VantairSid Vantair is a Solutions Architect with AWS covering Strategic accounts.  He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.

Read More

Implement semantic video search using open source large vision models on Amazon SageMaker and Amazon OpenSearch Serverless

Implement semantic video search using open source large vision models on Amazon SageMaker and Amazon OpenSearch Serverless

As companies and individual users deal with constantly growing amounts of video content, the ability to perform low-effort search to retrieve videos or video segments using natural language becomes increasingly valuable. Semantic video search offers a powerful solution to this problem, so users can search for relevant video content based on textual queries or descriptions. This approach can be used in a wide range of applications, from personal photo and video libraries to professional video editing, or enterprise-level content discovery and moderation, where it can significantly improve the way we interact with and manage video content.

Large-scale pre-training of computer vision models with self-supervision directly from natural language descriptions of images has made it possible to capture a wide set of visual concepts, while also bypassing the need for labor-intensive manual annotation of training data. After pre-training, natural language can be used to either reference the learned visual concepts or describe new ones, effectively enabling zero-shot transfer to a diverse set of computer vision tasks, such as image classification, retrieval, and semantic analysis.

In this post, we demonstrate how to use large vision models (LVMs) for semantic video search using natural language and image queries. We introduce some use case-specific methods, such as temporal frame smoothing and clustering, to enhance the video search performance. Furthermore, we demonstrate the end-to-end functionality of this approach by using both asynchronous and real-time hosting options on Amazon SageMaker AI to perform video, image, and text processing using publicly available LVMs on the Hugging Face Model Hub. Finally, we use Amazon OpenSearch Serverless with its vector engine for low-latency semantic video search.

About large vision models

In this post, we implement video search capabilities using multimodal LVMs, which integrate textual and visual modalities during the pre-training phase, using techniques such as contrastive multimodal representation learning, Transformer-based multimodal fusion, or multimodal prefix language modeling (for more details, see, Review of Large Vision Models and Visual Prompt Engineering by J. Wang et al.). Such LVMs have recently emerged as foundational building blocks for various computer vision tasks. Owing to their capability to learn a wide variety of visual concepts from massive datasets, these models can effectively solve diverse downstream computer vision tasks across different image distributions without the need for fine-tuning. In this section, we briefly introduce some of the most popular publicly available LVMs (which we also use in the accompanying code sample).

The CLIP (Contrastive Language-Image Pre-training) model, introduced in 2021, represents a significant milestone in the field of computer vision. Trained on a collection of 400 million image-text pairs harvested from the internet, CLIP showcased the remarkable potential of using large-scale natural language supervision for learning rich visual representations. Through extensive evaluations across over 30 computer vision benchmarks, CLIP demonstrated impressive zero-shot transfer capabilities, often matching or even surpassing the performance of fully supervised, task-specific models. For instance, a notable achievement of CLIP is its ability to match the top accuracy of a ResNet-50 model trained on the 1.28 million images from the ImageNet dataset, despite operating in a true zero-shot setting without a need for fine-tuning or other access to labeled examples.

Following the success of CLIP, the open-source initiative OpenCLIP further advanced the state-of-the-art by releasing an open implementation pre-trained on the massive LAION-2B dataset, comprised of 2.3 billion English image-text pairs. This substantial increase in the scale of training data enabled OpenCLIP to achieve even better zero-shot performance across a wide range of computer vision benchmarks, demonstrating further potential of scaling up natural language supervision for learning more expressive and generalizable visual representations.

Finally, the set of SigLIP (Sigmoid Loss for Language-Image Pre-training) models, including one trained on a 10 billion multilingual image-text dataset spanning over 100 languages, further pushed the boundaries of large-scale multimodal learning. The models propose an alternative loss function for the contrastive pre-training scheme employed in CLIP and have shown superior performance in language-image pre-training, outperforming both CLIP and OpenCLIP baselines on a variety of computer vision tasks.

Solution overview

Our approach uses a multimodal LVM to enable efficient video search and retrieval based on both textual and visual queries. The approach can be logically split into an indexing pipeline, which can be carried out offline, and an online video search logic. The following diagram illustrates the pipeline workflows.

The indexing pipeline is responsible for ingesting video files and preprocessing them to construct a searchable index. The process begins by extracting individual frames from the video files. These extracted frames are then passed through an embedding module, which uses the LVM to map each frame into a high-dimensional vector representation containing its semantic information. To account for temporal dynamics and motion information present in the video, a temporal smoothing technique is applied to the frame embeddings. This step makes sure the resulting representations capture the semantic continuity across multiple subsequent video frames, rather than treating each frame independently (also see the results discussed later in this post, or consult the following paper for more details). The temporally smoothed frame embeddings are then ingested into a vector index data structure, which is designed for efficient storage, retrieval, and similarity search operations. This indexed representation of the video frames serves as the foundation for the subsequent search pipeline.

The search pipeline facilitates content-based video retrieval by accepting textual queries or visual queries (images) from users. Textual queries are first embedded into the shared multimodal representation space using the LVM’s text encoding capabilities. Similarly, visual queries (images) are processed through the LVM’s visual encoding branch to obtain their corresponding embeddings.

After the textual or visual queries are embedded, we can build a hybrid query to account for keywords or filter constraints provided by the user (for example, to search only across certain video categories, or to search within a particular video). This hybrid query is then used to retrieve the most relevant frame embeddings based on their conceptual similarity to the query, while adhering to any supplementary keyword constraints.

The retrieved frame embeddings are then subjected to temporal clustering (also see the results later in this post for more details), which aims to group contiguous frames into semantically coherent video segments, thereby returning an entire video sequence (rather than disjointed individual frames).

Furthermore, maintaining search diversity and quality is crucial when retrieving content from videos. As mentioned previously, our approach incorporates various methods to enhance search results. For example, during the video indexing phase, the following techniques are employed to control the search results (the parameters of which might need to be tuned to get the best results):

  • Adjusting the sampling rate, which determines the number of frames embedded from each second of video. Less frequent frame sampling might make sense when working with longer videos, whereas more frequent frame sampling might be needed to catch fast-occurring events.
  • Modifying the temporal smoothing parameters to, for example, remove inconsistent search hits based on just a single frame hit, or merge repeated frame hits from the same scene.

During the semantic video search phase, you can use the following methods:

  • Applying temporal clustering as a post-filtering step on the retrieved timestamps to group contiguous frames into semantically coherent video clips (that can be, in principle, directly played back by the end-users). This makes sure the search results maintain temporal context and continuity, avoiding disjointed individual frames.
  • Setting the search size, which can be effectively combined with temporal clustering. Increasing the search size makes sure the relevant frames are included in the final results, albeit at the cost of higher computational load (see, for example, this guide for more details).

Our approach aims to strike a balance between retrieval quality, diversity, and computational efficiency by employing these techniques during both the indexing and search phases, ultimately enhancing the user experience in semantic video search.

The proposed solution architecture provides efficient semantic video search by using open source LVMs and AWS services. The architecture can be logically divided into two components: an asynchronous video indexing pipeline and online content search logic. The accompanying sample code on GitHub showcases how to build, experiment locally, as well as host and invoke both parts of the workflow using several open source LVMs available on the Hugging Face Model Hub (CLIP, OpenCLIP, and SigLIP). The following diagram illustrates this architecture.

The pipeline for asynchronous video indexing is comprised of the following steps:

  1. The user uploads a video file to an Amazon Simple Storage Service (Amazon S3) bucket, which initiates the indexing process.
  2. The video is sent to a SageMaker asynchronous endpoint for processing. The processing steps involve:
    • Decoding of frames from the uploaded video file.
    • Generation of frame embeddings by LVM.
    • Application of temporal smoothing, accounting for temporal dynamics and motion information present in the video.
  3. The frame embeddings are ingested into an OpenSearch Serverless vector index, designed for efficient storage, retrieval, and similarity search operations.

SageMaker asynchronous inference endpoints are well-suited for handling requests with large payloads, extended processing times, and near real-time latency requirements. This SageMaker capability queues incoming requests and processes them asynchronously, accommodating large payloads and long processing times. Asynchronous inference enables cost optimization by automatically scaling the instance count to zero when there are no requests to process, so computational resources are used only when actively handling requests. This flexibility makes it an ideal choice for applications involving large data volumes, such as video processing, while maintaining responsiveness and efficient resource utilization.

OpenSearch Serverless is an on-demand serverless version for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the LVM. The index created in the OpenSearch Serverless collection serves as the vector store, enabling efficient storage and rapid similarity-based retrieval of relevant video segments.

The online content search then can be broken down to the following steps:

  1. The user provides a textual prompt or an image (or both) representing the desired content to be searched.
  2. The user prompt is sent to a real-time SageMaker endpoint, which results in the following actions:
    • An embedding is generated for the text or image query.
    • The query with embeddings is sent to the OpenSearch vector index, which performs a k-nearest neighbors (k-NN) search to retrieve relevant frame embeddings.
    • The retrieved frame embeddings undergo temporal clustering.
  3. The final search results, comprising relevant video segments, are returned to the user.

SageMaker real-time inference suits workloads needing real-time, interactive, low-latency responses. Deploying models to SageMaker hosting services provides fully managed inference endpoints with automatic scaling capabilities, providing optimal performance for real-time requirements.

Code and environment

This post is accompanied by a sample code on GitHub that provides comprehensive annotations and code to set up the necessary AWS resources, experiment locally with sample video files, and then deploy and run the indexing and search pipelines. The code sample is designed to exemplify best practices when developing ML solutions on SageMaker, such as using configuration files to define flexible inference stack parameters and conducting local tests of the inference artifacts before deploying them to SageMaker endpoints. It also contains guided implementation steps with explanations and reference for configuration parameters. Additionally, the notebook automates the cleanup of all provisioned resources.

Prerequisites

The prerequisite to run the provided code is to have an active AWS account and set up Amazon SageMaker Studio. Refer to Use quick setup for Amazon SageMaker AI to set up SageMaker if you’re a first-time user and then follow the steps to open SageMaker Studio.

Deploy the solution

To start the implementation to clone the repository, open the notebook semantic_video_search_demo.ipynb, and follow the steps in the notebook.

In Section 2 of the notebook, install the required packages and dependencies, define global variables, set up Boto3 clients, and attach required permissions to the SageMaker AWS Identity and Access Management (IAM) role to interact with Amazon S3 and OpenSearch Service from the notebook.

In Section 3, create security components for OpenSearch Serverless (encryption policy, network policy, and data access policy) and then create an OpenSearch Serverless collection. For simplicity, in this proof of concept implementation, we allow public internet access to the OpenSearch Serverless collection resource. However, for production environments, we strongly suggest using private connections between your Virtual Private Cloud (VPC) and OpenSearch Serverless resources through a VPC endpoint. For more details, see Access Amazon OpenSearch Serverless using an interface endpoint (AWS PrivateLink).

In Section 4, import and inspect the config file, and choose an embeddings model for video indexing and corresponding embeddings dimension. In Section 5, create a vector index within the OpenSearch collection you created earlier.

To demonstrate the search results, we also provide references to a few sample videos that you can experiment with in Section 6. In Section 7, you can experiment with the proposed semantic video search approach locally in the notebook, before deploying the inference stacks.

In Sections 8, 9, and 10, we provide code to deploy two SageMaker endpoints: an asynchronous endpoint for video embedding and indexing and a real-time inference endpoint for video search. After these steps, we also test our deployed sematic video search solution with a few example queries.

Finally, Section 11 contains the code to clean up the created resources to avoid recurring costs.

Results

The solution was evaluated across a diverse range of use cases, including the identification of key moments in sports games, specific outfit pieces or color patterns on fashion runways, and other tasks in full-length films on the fashion industry. Additionally, the solution was tested for detecting action-packed moments like explosions in action movies, identifying when individuals entered video surveillance areas, and extracting specific events such as sports award ceremonies.

For our demonstration, we created a video catalog consisting of the following videos: A Look Back at New York Fashion Week: Men’s, F1 Insights powered by AWS, Amazon Air’s newest aircraft, the A330, is here, and Now Go Build with Werner Vogels – Autonomous Trucking.

To demonstrate the search capability for identifying specific objects across this video catalog, we employed four text prompts and four images. The presented results were obtained using the google/siglip-so400m-patch14-384 model, with temporal clustering enabled and a timestamp filter set to 1 second. Additionally, smoothing was enabled with a kernel size of 11, and the search size was set to 20 (which were found to be good default values for shorter videos). The left column in the subsequent figures specifies the search type, either by image or text, along with the corresponding image name or text prompt used.

The following figure shows the text prompts we used and the corresponding results.

The following figure shows the images we used to perform reverse images search and corresponding search results for each image.

As mentioned, we implemented temporal clustering in the lookup code, allowing for the grouping of frames based on their ordered timestamps. The accompanying notebook with sample code showcases the temporal clustering functionality by displaying (a few frames from) the returned video clip and highlighting the key frame with the highest search score within each group, as illustrated in the following figure. This approach facilitates a convenient presentation of the search results, enabling users to return entire playable video clips (even if not all frames were actually indexed in a vector store).

To showcase the hybrid search capabilities with OpenSearch Service, we present results for the textual prompt “sky,” with all other search parameters set identically to the previous configurations. We demonstrate two distinct cases: an unconstrained semantic search across the entire indexed video catalog, and a search confined to a specific video. The following figure illustrates the results obtained from an unconstrained semantic search query.

We conducted the same search for “sky,” but now confined to trucking videos.

To illustrate the effects of temporal smoothing, we generated search signal score charts (based on cosine similarity) for the prompt F1 crews change tyres in the formulaone video, both with and without temporal smoothing. We set a threshold of 0.315 for illustration purposes and highlighted video segments with scores exceeding this threshold. Without temporal smoothing (see the following figure), we observed two adjacent episodes around t=35 seconds and two additional episodes after t=65 seconds. Notably, the third and fourth episodes were significantly shorter than the first two, despite exhibiting higher scores. However, we can do better, if our objective is to prioritize longer semantically cohesive video episodes in the search.

To address this, we apply temporal smoothing. As shown in the following figure, now the first two episodes appear to be merged into a single, extended episode with the highest score. The third episode experienced a slight score reduction, and the fourth episode became irrelevant due to its brevity. Temporal smoothing facilitated the prioritization of longer and more coherent video moments associated with the search query by consolidating adjacent high-scoring segments and suppressing isolated, transient occurrences.

Clean up

To clean up the resources created as part of this solution, refer to the cleanup section in the provided notebook and execute the cells in this section. This will delete the created IAM policies, OpenSearch Serverless resources, and SageMaker endpoints to avoid recurring charges.

Limitations

Throughout our work on this project, we also identified several potential limitations that could be addressed through future work:

  • Video quality and resolution might impact search performance, because blurred or low-resolution videos can make it challenging for the model to accurately identify objects and intricate details.
  • Small objects within videos, such as a hockey puck or a football, might be difficult for LVMs to consistently recognize due to their diminutive size and visibility constraints.
  • LVMs might struggle to comprehend scenes that represent a temporally prolonged contextual situation, such as detecting a point-winning shot in tennis or a car overtaking another vehicle.
  • Accurate automatic measurement of solution performance is hindered without the availability of manually labeled ground truth data for comparison and evaluation.

Summary

In this post, we demonstrated the advantages of the zero-shot approach to implementing semantic video search using either text prompts or images as input. This approach readily adapts to diverse use cases without the need for retraining or fine-tuning models specifically for video search tasks. Additionally, we introduced techniques such as temporal smoothing and temporal clustering, which significantly enhance the quality and coherence of video search results.

The proposed architecture is designed to facilitate a cost-effective production environment with minimal effort, eliminating the requirement for extensive expertise in machine learning. Furthermore, the current architecture seamlessly accommodates the integration of open source LVMs, enabling the implementation of custom preprocessing or postprocessing logic during both the indexing and search phases. This flexibility is made possible by using SageMaker asynchronous and real-time deployment options, providing a powerful and versatile solution.

You can implement semantic video search using different approaches or AWS services. For related content, refer to the following AWS blog posts as examples on semantic search using proprietary ML models: Implement serverless semantic search of image and live video with Amazon Titan Multimodal Embeddings or Build multimodal search with Amazon OpenSearch Service.


About the Authors

Dr. Alexander Arzhanov is an AI/ML Specialist Solutions Architect based in Frankfurt, Germany. He helps AWS customers design and deploy their ML solutions across the EMEA region. Prior to joining AWS, Alexander was researching origins of heavy elements in our universe and grew passionate about ML after using it in his large-scale scientific calculations.

Dr. Ivan Sosnovik is an Applied Scientist in the AWS Machine Learning Solutions Lab. He develops ML solutions to help customers to achieve their business goals.

Nikita Bubentsov is a Cloud Sales Representative based in Munich, Germany, and part of Technical Field Community (TFC) in computer vision and machine learning. He helps enterprise customers drive business value by adopting cloud solutions and supports AWS EMEA organizations in the computer vision area. Nikita is passionate about computer vision and the future potential that it holds.

Read More

Multi-account support for Amazon SageMaker HyperPod task governance

Multi-account support for Amazon SageMaker HyperPod task governance

GPUs are a precious resource; they are both short in supply and much more costly than traditional CPUs. They are also highly adaptable to many different use cases. Organizations building or adopting generative AI use GPUs to run simulations, run inference (both for internal or external usage), build agentic workloads, and run data scientists’ experiments. The workloads range from ephemeral single-GPU experiments run by scientists to long multi-node continuous pre-training runs. Many organizations need to share a centralized, high-performance GPU computing infrastructure across different teams, business units, or accounts within their organization. With this infrastructure, they can maximize the utilization of expensive accelerated computing resources like GPUs, rather than having siloed infrastructure that might be underutilized. Organizations also use multiple AWS accounts for their users. Larger enterprises might want to separate different business units, teams, or environments (production, staging, development) into different AWS accounts. This provides more granular control and isolation between these different parts of the organization. It also makes it straightforward to track and allocate cloud costs to the appropriate teams or business units for better financial oversight.

The specific reasons and setup can vary depending on the size, structure, and requirements of the enterprise. But in general, a multi-account strategy provides greater flexibility, security, and manageability for large-scale cloud deployments. In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. We use SageMaker HyperPod task governance to enable this feature.

Solution overview

SageMaker HyperPod task governance streamlines resource allocation and provides cluster administrators the capability to set up policies to maximize compute utilization in a cluster. Task governance can be used to create distinct teams with their own unique namespace, compute quotas, and borrowing limits. In a multi-account setting, you can restrict which accounts have access to which team’s compute quota using role-based access control.

In this post, we describe the settings required to set up multi-account access for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and how to use SageMaker HyperPod task governance to allocate accelerated compute to multiple teams in different accounts.

The following diagram illustrates the solution architecture.

Multi-account AWS architecture: EKS cluster withEKS Pod Identity accessing S3 bucket via access point

In this architecture, one organization is splitting resources across a few accounts. Account A hosts the SageMaker HyperPod cluster. Account B is where the data scientists reside. Account C is where the data is prepared and stored for training usage. In the following sections, we demonstrate how to set up multi-account access so that data scientists in Account B can train a model on Account A’s SageMaker HyperPod and EKS cluster, using the preprocessed data stored in Account C. We break down this setup in two sections: cross-account access for data scientists and cross-account access for prepared data.

Cross-account access for data scientists

When you create a compute allocation with SageMaker HyperPod task governance, your EKS cluster creates a unique Kubernetes namespace per team. For this walkthrough, we create an AWS Identity and Access Management (IAM) role per team, called cluster access roles, that are then scoped access only to the team’s task governance-generated namespace in the shared EKS cluster. Role-based access control is how we make sure the data science members of Team A will not be able to submit tasks on behalf of Team B.

To access Account A’s EKS cluster as a user in Account B, you will need to assume a cluster access role in Account A. The cluster access role will have only the needed permissions for data scientists to access the EKS cluster. For an example of IAM roles for data scientists using SageMaker HyperPod, see IAM users for scientists.

Next, you will need to assume the cluster access role from a role in Account B. The cluster access role in Account A will then need to have a trust policy for the data scientist role in Account B. The data scientist role is the role in account B that will be used to assume the cluster access role in Account A. The following code is an example of the policy statement for the data scientist role so that it can assume the cluster access role in Account A:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::XXXXXXXXXXAAA:role/ClusterAccessRole"
    }
  ]
}

The following code is an example of the trust policy for the cluster access role so that it allows the data scientist role to assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::XXXXXXXXXXBBB:role/DataScientistRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The final step is to create an access entry for the team’s cluster access role in the EKS cluster. This access entry should also have an access policy, such as EKSEditPolicy, that is scoped to the namespace of the team. This makes sure that Team A users in Account B can’t launch tasks outside of their assigned namespace. You can also optionally set up custom role-based access control; see Setting up Kubernetes role-based access control for more information.

For users in Account B, you can repeat the same setup for each team. You must create a unique cluster access role for each team to align the access role for the team with their associated namespace. To summarize, we use two different IAM roles:

  • Data scientist role – The role in Account B used to assume the cluster access role in Account A. This role just needs to be able to assume the cluster access role.
  • Cluster access role – The role in Account A used to give access to the EKS cluster. For an example, see IAM role for SageMaker HyperPod.

Cross-account access to prepared data

In this section, we demonstrate how to set up EKS Pod Identity and S3 Access Points so that pods running training tasks in Account A’s EKS cluster have access to data stored in Account C. EKS Pod Identity allow you to map an IAM role to a service account in a namespace. If a pod uses the service account that has this association, then Amazon EKS will set the environment variables in the containers of the pod.

S3 Access Points are named network endpoints that simplify data access for shared datasets in S3 buckets. They act as a way to grant fine-grained access control to specific users or applications accessing a shared dataset within an S3 bucket, without requiring those users or applications to have full access to the entire bucket. Permissions to the access point is granted through S3 access point policies. Each S3 Access Point is configured with an access policy specific to a use case or application. Since the HyperPod cluster in this blog post can be used by multiple teams, each team could have its own S3 access point and access point policy.

Before following these steps, ensure you have the EKS Pod Identity Add-on installed on your EKS cluster.

  1. In Account A, create an IAM Role that contains S3 permissions (such as s3:ListBucket and s3:GetObject to the access point resource) and has a trust relationship with Pod Identity; this will be your Data Access Role. Below is an example of a trust policy.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession"
      ]
    }
  ]
}
  1. In Account C, create an S3 access point by following the steps here.
  2. Next, configure your S3 access point to allow access to the role created in step 1. This is an example access point policy that gives Account A permission to access points in account C.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name>"
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>",
        "arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>/object/*"
      ]
    }
  ]
}
  1. Ensure your S3 bucket policy is updated to allow Account A access. This is an example S3 bucket policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/*"
      ],
      "Condition": {
        "StringEquals": {
          "s3:DataAccessPointAccount": "<Account-C-ID>"
        }
      }
    }
  ]
}
  1. In Account A, create a pod identity association for your EKS cluster using the AWS CLI.
aws eks create-pod-identity-association 
--cluster-name <EKS-Cluster-Name> 
--role-arn arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name> 
--namespace hyperpod-ns-eng 
--service-account my-service-account
  1. Pods accessing cross-account S3 buckets will need the service account name referenced in their pod specification.

You can test cross-account data access by spinning up a test pod and the executing into the pod to run Amazon S3 commands:

kubectl exec -it aws-test -n hyperpod-ns-team-a -- aws s3 ls s3://<access-point>

This example shows creating a single data access role for a single team. For multiple teams, use a namespace-specific ServiceAccount with its own data access role to help prevent overlapping resource access across teams. You can also configure cross-account Amazon S3 access for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 data across accounts. FSx for Lustre and Amazon S3 will need to be in the same AWS Region, and the FSx for Lustre file system will need to be in the same Availability Zone as your SageMaker HyperPod cluster.

Conclusion

In this post, we provided guidance on how to set up cross-account access to data scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. In addition, we covered how to provide Amazon S3 data access from one account to an EKS cluster in another account. With SageMaker HyperPod task governance, you can restrict access and compute allocation to specific teams. This architecture can be used at scale by organizations wanting to share a large compute cluster across accounts within their organization. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance documentation.


About the Authors

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

Read More

Build a Text-to-SQL solution for data consistency in generative AI using Amazon Nova

Build a Text-to-SQL solution for data consistency in generative AI using Amazon Nova

Businesses rely on precise, real-time insights to make critical decisions. However, enabling non-technical users to access proprietary or organizational data without technical expertise remains a challenge. Text-to-SQL bridges this gap by generating precise, schema-specific queries that empower faster decision-making and foster a data-driven culture. The problem lies in obtaining deterministic answers—precise, consistent results needed for operations such as generating exact counts or detailed reports—from proprietary or organizational data. Generative AI offers several approaches to query data, but selecting the right method is critical to achieve accuracy and reliability.

This post evaluates the key options for querying data using generative AI, discusses their strengths and limitations, and demonstrates why Text-to-SQL is the best choice for deterministic, schema-specific tasks. We show how to effectively use Text-to-SQL using Amazon Nova, a foundation model (FM) available in Amazon Bedrock, to derive precise and reliable answers from your data.

Options for querying data

Organizations have multiple options for querying data, and the choice depends on the nature of the data and the required outcomes. This section evaluates the following approaches to provide clarity on when to use each and why Text-to-SQL is optimal for deterministic, schema-based tasks:

  • Retrieval Augmented Generation (RAG):
    • Use case – Ideal for extracting insights from unstructured or semi-structured sources like documents or articles.
    • Strengths – Handles diverse data formats and provides narrative-style responses.
    • Limitations – Probabilistic answers can vary, making it unsuitable for deterministic queries, such as retrieving exact counts or matching specific schema constraints.
    • Example – “Summarize feedback from product reviews.”
  • Generative business intelligence (BI):
    • Use case – Suitable for high-level insights and summary generation based on structured and unstructured data.
    • Strengths – Delivers narrative insights for decision-making and trends.
    • Limitations – Lacks the precision required for schema-specific or operational queries. Results often vary in phrasing and focus.
    • Example – “What were the key drivers of sales growth last quarter?”
  • Text-to-SQL:
    • Use case – Excels in querying structured organizational data directly from relational schemas.
    • Strengths – Provides deterministic, reproducible results for specific, schema-dependent queries. Ideal for precise operations such as filtering, counting, or aggregating data.
    • Limitations – Requires structured data and predefined schemas.
    • Example – “How many patients diagnosed with diabetes visited clinics in New York City last month?”

In scenarios demanding precision and consistency, Text-to-SQL outshines RAG and generative BI by delivering accurate, schema-driven results. These characteristics make it the ideal solution for operational and structured data queries.

Solution overview

This solution uses the Amazon Nova Lite and Amazon Nova Pro large language models (LLMs) to simplify querying proprietary data with natural language, making it accessible to non-technical users.

Amazon Bedrock is a fully managed service that simplifies building and scaling generative AI applications by providing access to leading FMs through a single API. It allows developers to experiment with and customize these models securely and privately, integrating generative AI capabilities into their applications without managing infrastructure.

Within this system, Amazon Nova represents a new generation of FMs delivering advanced intelligence and industry-leading price-performance. These models, including Amazon Nova Lite and Amazon Nova Pro, are designed to handle various tasks such as text, image, and video understanding, making them versatile tools for diverse applications.

You can find the deployment code and detailed instructions in our GitHub repo.

The solution consists of the following key features:

  • Dynamic schema context – Retrieves the database schema dynamically for precise query generation
  • SQL query generation – Converts natural language into SQL queries using the Amazon Nova Pro LLM
  • Query execution – Runs queries on organizational databases and retrieves results
  • Formatted responses – Processes raw query results into user-friendly formats using the Amazon Nova Lite LLM

The following diagram illustrates the solution architecture.

Data flow between user, Streamlit app, Amazon Bedrock, and Microsoft SQL Server, illustrating query processing and response generation

In this solution, we use Amazon Nova Pro and Amazon Nova Lite to take advantage of their respective strengths, facilitating efficient and effective processing at each stage:

  • Dynamic schema retrieval and SQL query generation – We use Amazon Nova Pro to handle the translation of natural language inputs into SQL queries. Its advanced capabilities in complex reasoning and understanding make it well-suited for accurately interpreting user intents and generating precise SQL statements.
  • Formatted response generation – After we run the SQL queries, the raw results are processed using Amazon Nova Lite. This model efficiently formats the data into user-friendly outputs, making the information accessible to non-technical users. Its speed and cost-effectiveness are advantageous for this stage, where rapid processing and straightforward presentation are key.

By strategically deploying Amazon Nova Pro and Amazon Nova Lite in this manner, the solution makes sure that each component operates optimally, balancing performance, accuracy, and cost-effectiveness.

Prerequisites

Complete the following prerequisite steps:

  1. Install the AWS Command Line Interface (AWS CLI). For instructions, refer to Installing or updating to the latest version of the AWS CLI.
  2. Configure the basic settings that the AWS CLI uses to interact with AWS. For more information, see Configuration and credential file settings in the AWS CLI.
  3. Make sure Amazon Bedrock is enabled in your AWS account.
  4. Obtain access to Amazon Nova Lite and Amazon Nova Pro.
  5. Install Python 3.9 or later, along with required libraries (Streamlit version 1.8.0 or later, Boto3, pymssql, and environment management packages).
  6. Create a Microsoft SQL Server (version 2016 or later) database with credentials to connect.
    1. Create a secret in AWS Secrets Manager for database credentials and name it mssql_secrets. For instructions, see Create an AWS Secrets Manager secret.

Our sample code uses a Microsoft SQL Server database, but this solution supports the following services:

For more information about prerequisites, refer to the GitHub repo.

Set up the development environment

In the command prompt, navigate to the folder where the code exists and run the following command:

python3.9 -m pip install -r requirements.txt --upgrade

This command installs the required libraries to run the application.

Load the sample dataset in the database

Make sure you have created a secret in Secrets Manager named mssql_secrets as mentioned in the prerequisites. If you named your secret something else, update the code in app.py (line 29) and load_data.py (line 22).

After you create the secret, run the following command from the code folder:

python load_data.py

This command creates a database named Sales with tables Products, Customers, and Orders and loads the sample data in these tables.

Run the application

To run the application, execute the following command:

streamlit run app.py

Example queries

In this section, we explore some sample queries.

For our first query, we ask “Who are the customers who bought smartphones?” This generates the following SQL:

SELECT DISTINCT CustomerName, ProductName, SUM(Quantity) AS TotalSoldQuantity FROM vw_sales WHERE ProductName LIKE '%smartphone%' GROUP BY CustomerName, ProductName, OrderDate; 

We get the following formatted response:

  1. Alice Johnson, who bought 1 smartphone on October 14th, 2023.
  2. Ivy Martinez, who bought 2 smartphones on October 15th, 2023.

Next, we ask “How many smartphones are in stock?” This generates the following SQL:

SELECT DISTINCT ProductName, StockQuantity AS AvailableQuantity FROM vw_sales WHERE ProductName LIKE '%smartphone%'; 

We get the response “There are 100 smartphones currently in stock.”

Sales-bot web interface demonstrating natural language to SQL conversion with Amazon Bedrock, showing smartphone purchase query and results

Code execution flow

In this section, we explore the code execution flow. The code reference is from the GitHub repo. Do not run the different parts of the code individually.

Retrieve schema dynamically

Use INFORMATION_SCHEMA views to extract schema details dynamically (code reference from app.py):

def get_schema_context(db_name, db_view_name):
    conn = connect_to_db()
    cursor = conn.cursor()
    cursor.execute(f"USE {db_name}")
    query = f"SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{db_view_name}'"
    cursor.execute(query)
    schema = cursor.fetchall()
    print("Schema:", schema)
    return 'n'.join([f"- {row[0]}: {row[1]}" for row in schema])

Dynamic schema retrieval adapts automatically to changes by querying metadata tables for updated schema details, such as table names and column types. This facilitates seamless integration of schema updates into the Text-to-SQL system, reducing manual effort and improving scalability.

Test this function to verify it adapts automatically when schema changes occur.

Before generating SQL, fetch schema details for the relevant tables to facilitate accurate query construction.

Generate a SQL query using Amazon Nova Pro

Send the user query and schema context to Amazon Nova Pro (code reference from sql_generator.py):

def generate_sql_query(question: str, schema_context: str, db_name: str, db_view_name: str = None) -> str:
    
    nova_client = NovaClient()
       
    # Base prompt with SQL generation rules
    base_prompt = """
    MS SQL DB {db_name} has one view names '{db_view_name}'. 
    Always use '{db_view_name}' as table name to generate your query.
    Create a MS SQL query by carefully understanding the question and generate the query between tags <begin sql> and </end sql>.
    The MS SQL query should selects all columns from a view named '{db_view_name}'
    In your SQL query always use like condition in where clasue.
    if a question is asked about product stock then always use 'distinct' in your SQL query.
    Never Generate an SQL query which gives error upon execution.
      
    
    Question: {question}
    
    Database Schema : {schema_context}
    
    Generate SQL query:
    """
    
    # Format the prompt with the question and schema context
    formatted_prompt = base_prompt.format(
        question=question,
        db_name=db_name,
        db_view_name=db_view_name if db_view_name else "No view name provided",
        schema_context=schema_context if schema_context else "No additional context provided"
    )
        
    # Invoke Nova model
    response = nova_client.invoke_model(
        model_id='amazon.nova-pro-v1:0',
        prompt=formatted_prompt,
        temperature=0.1  # Lower temperature for more deterministic SQL generation
    )
    
    # Extract SQL query from response using regex
    sql_match = extract_sql_from_nova_response(response)
    if sql_match:
        return sql_match
    else:
        raise ValueError("No SQL query found in the response")
    
def extract_sql_from_nova_response(response):
    try:
        # Navigate the nested dictionary structure
        content = response['output']['message']['content']
        # Get the text from the first content item
        text = content[0]['text']
        
        # Find the positions of begin and end tags
        begin_tag = "<begin sql>"
        end_tag = "</end sql>"
        start_pos = text.find(begin_tag)
        end_pos = text.find(end_tag)
        
        # If both tags are found, extract the SQL between them
        if start_pos != -1 and end_pos != -1:
            # Add length of begin tag to start position to skip the tag itself
            sql_query = text[start_pos + len(begin_tag):end_pos].strip()
            return sql_query
            
        return None
        
    except (KeyError, IndexError):
        # Return None if the expected structure is not found
        return None

This code establishes a structured context for a text-to-SQL use case, guiding Amazon Nova Pro to generate SQL queries based on a predefined database schema. It provides consistency by defining a static database context that clarifies table names, columns, and relationships, helping prevent ambiguity in query formation. Queries are required to reference the vw_sales view, standardizing data extraction for analytics and reporting. Additionally, whenever applicable, the generated queries must include quantity-related fields, making sure that business users receive key insights on product sales, stock levels, or transactional counts. To enhance search flexibility, the LLM is instructed to use the LIKE operator in WHERE conditions instead of exact matches, allowing for partial matches and accommodating variations in user input. By enforcing these constraints, the code optimizes Text-to-SQL interactions, providing structured, relevant, and business-aligned query generation for sales data analysis.

Execute a SQL query

Run the SQL query on the database and capture the result (code reference from app.py):

cursor.execute(sql_command)
result = cursor.fetchall()
print(result)

Format the query results using Amazon Nova Lite

Send the database result from the SQL query to Amazon Nova Lite to format it in a human-readable format and print it on the Streamlit UI (code reference from app.py):

def interact_with_nova(user_input, llm_query, query_response, model="nova"):
    session = boto3.session.Session()
    region = session.region_name
    
    nova_client = NovaClient(region_name=region)
    
    final_prompt = f"""Human: You are a expert chatbot who is happy to assist the users. User questions in given in <Question> tag and results in <query_response> tag. Understand the question and use information from <query_response> to generate an answer. If there are more than one entery, give a numbered list. Never retrun <question> and <query_response> in your response.
    for example : question - "How many mouse were sold?"
                  llm response : 
                                " There were 3 mouse sold in total. 
                                - 1 mouse sold to Mia Perez on October 2nd, 2023. 
                                - 2 mouse sold to Jack Hernandez on October 1st 2023."
    <Question>
    {user_input}
    </Question>
    <query_response>
    {query_response}
    </query_response>"""
    
    try:
        
            response = nova_client.invoke_model(
                model_id='amazon.nova-lite-v1:0',
                prompt=final_prompt,
                max_tokens=4096,
                temperature=0.7
            )
            
            content = response['output']['message']['content']
            text = content[0]['text']
            return text
            
            return "Sorry, I couldn't process your request."
    
    except Exception as e:
        print(f"Error in LLM interaction: {str(e)}")
        return "Sorry, an error occurred while processing your request."

Clean up

Follow these steps to clean up resources in your AWS environment and avoid incurring future costs:

  1. Clean up database resources:
  2. Clean up security resources:
  3. Clean up the frontend (only if hosting the Streamlit application on Amazon EC2):
    • Stop the EC2 instance hosting the Streamlit application.
    • Delete associated storage volumes.
  4. Clean up additional resources (if applicable):
    • Remove Elastic Load Balancers.
    • Delete virtual private cloud (VPC) configurations.
  5. Check the AWS Management Console to confirm all resources have been deleted.

Conclusion

Text-to-SQL with Amazon Bedrock and Amazon Nova LLMs provides a scalable solution for deterministic, schema-based querying. By delivering consistent and precise results, it empowers organizations to make informed decisions, improve operational efficiency, and reduce reliance on technical resources.

For a more comprehensive example of a Text-to-SQL solution built on Amazon Bedrock, explore the GitHub repo Setup Amazon Bedrock Agent for Text-to-SQL Using Amazon Athena with Streamlit. This open source project demonstrates how to use Amazon Bedrock and Amazon Nova LLMs to build a robust Text-to-SQL agent that can generate complex queries, self-correct, and query diverse data sources.

Start experimenting with Text-to-SQL use cases today by getting started with Amazon Bedrock.


About the authors

Mansi Sharma is a Solutions Architect for Amazon Web Services. Mansi is a trusted technical advisor helping enterprise customers architect and implement cloud solutions at scale. She drives customer success through technical leadership, architectural guidance, and innovative problem-solving while working with cutting-edge cloud technologies. Mansi specializes in generative AI application development and serverless technologies.

Marie Yap is a Principal Solutions Architect for Amazon Web Services.  In this role, she helps various organizations begin their journey to the cloud. She also specializes in analytics and modern data architectures.

Read More

Modernize and migrate on-premises fraud detection machine learning workflows to Amazon SageMaker

Modernize and migrate on-premises fraud detection machine learning workflows to Amazon SageMaker

This post is co-written with Qing Chen and Mark Sinclair from Radial.

Radial is the largest 3PL fulfillment provider, also offering integrated payment, fraud detection, and omnichannel solutions to mid-market and enterprise brands. With over 30 years of industry expertise, Radial tailors its services and solutions to align strategically with each brand’s unique needs.

Radial supports brands in tackling common ecommerce challenges, from scalable, flexible fulfillment enabling delivery consistency to providing secure transactions. With a commitment to fulfilling promises from click to delivery, Radial empowers brands to navigate the dynamic digital landscape with the confidence and capability to deliver a seamless, secure, and superior ecommerce experience.

In this post, we share how Radial optimized the cost and performance of their fraud detection machine learning (ML) applications by modernizing their ML workflow using Amazon SageMaker.

Businesses need for fraud detection models

ML has proven to be an effective approach in fraud detection compared to traditional approaches. ML models can analyze vast amounts of transactional data, learn from historical fraud patterns, and detect anomalies that signal potential fraud in real time. By continuously learning and adapting to new fraud patterns, ML can make sure fraud detection systems stay resilient and robust against evolving threats, enhancing detection accuracy and reducing false positives over time. This post showcases how companies like Radial can modernize and migrate their on-premises fraud detection ML workflows to SageMaker. By using the AWS Experience-Based Acceleration (EBA) program, they can enhance efficiency, scalability, and maintainability through close collaboration.

Challenges of on-premises ML models

Although ML models are highly effective at combating evolving fraud trends, managing these models on premises presents significant scalability and maintenance challenges.

Scalability

On-premises systems are inherently limited by the physical hardware available. During peak shopping seasons, when transaction volumes surge, the infrastructure might struggle to keep up without substantial upfront investment. This can result in slower processing times or a reduced capacity to run multiple ML applications concurrently, potentially leading to missed fraud detections. Scaling an on-premises infrastructure is typically a slow and resource-intensive process, hindering a business’s ability to adapt quickly to increased demand. On the model training side, data scientists often face bottlenecks due to limited resources, forcing them to wait for infrastructure availability or reduce the scope of their experiments. This delays innovation and can lead to suboptimal model performance, putting businesses at a disadvantage in a rapidly changing fraud landscape.

Maintenance

Maintaining an on-premises infrastructure for fraud detection requires a dedicated IT team to manage servers, storage, networking, and backups. Maintaining uptime often involves implementing and maintaining redundant systems, because a failure could result in critical downtime and an increased risk of undetected fraud. Moreover, fraud detection models naturally degrade over time and require regular retraining, deployment, and monitoring. On-premises systems typically lack the built-in automation tools needed to manage the full ML lifecycle. As a result, IT teams must manually handle tasks such as updating models, monitoring for drift, and deploying new versions. This adds operational complexity, increases the likelihood of errors, and diverts valuable resources from other business-critical activities.

Common modernization challenges in ML cloud migration

Organizations face several significant challenges when modernizing their ML workloads through cloud migration. One major hurdle is the skill gap, where developers and data scientists might lack expertise in microservices architecture, advanced ML tools, and DevOps practices for cloud environments. This can lead to development delays, complex and costly architectures, and increased security vulnerabilities. Cross-functional barriers, characterized by limited communication and collaboration between teams, can also impede modernization efforts by hindering information sharing. Slow decision-making is another critical challenge. Many organizations take too long to make choices about their cloud move. They spend too much time thinking about options instead of taking action. This delay can cause them to miss chances to speed up their modernization. It also stops them from using the cloud’s ability to quickly try new things and make changes. In the fast-moving world of ML and cloud technology, being slow to decide can put companies behind their competitors. Another significant obstacle is complex project management, because modernization initiatives often require coordinating work across multiple teams with conflicting priorities. This challenge is compounded by difficulties in aligning stakeholders on business outcomes, quantifying and tracking benefits to demonstrate value, and balancing long-term benefits with short-term goals. To address these challenges and streamline modernization efforts, AWS offers the EBA program. This methodology is designed to assist customers in aligning executives’ vision and resolving roadblocks, accelerating their cloud journey, and achieving a successful migration and modernization of their ML workloads to the cloud.

EBA: AWS team collaboration

EBA is a 3-day interactive workshop that uses SageMaker to accelerate business outcomes. It guides participants through a prescriptive ML lifecycle, starting with identifying business goals and ML problem framing, and progressing through data processing, model development, production deployment, and monitoring.

We recognize that customers have different starting points. For those beginning from scratch, it’s often simpler to start with low code or no code solutions like Amazon SageMaker Canvas and Amazon SageMaker JumpStart, gradually transitioning to developing custom models on Amazon SageMaker Studio. However, because Radial has an existing on-premises ML infrastructure, we can begin directly by using SageMaker to address challenges in their current solution.

During the EBA, experienced AWS ML subject matter experts and the AWS Account Team worked closely with Radial’s cross-functional team. The AWS team offered tailored advice, tackled obstacles, and enhanced the organization’s capacity for ongoing ML integration. Instead of concentrating solely on data and ML technology, the emphasis is on addressing critical business challenges. This strategy helps organizations extract significant value from previously underutilized resources.

Modernizing ML workflows: From a legacy on-premises data center to SageMaker

Before modernization, Radial hosted its ML applications on premises within its data center. The legacy ML workflow presented several challenges, particularly in the time-intensive model development and deployment processes.

Legacy workflow: On-premises ML development and deployment

When the data science team needed to build a new fraud detection model, the development process typically took 2–4 weeks. During this phase, data scientists performed tasks such as the following:

  • Data cleaning and exploratory data analysis (EDA)
  • Feature engineering
  • Model prototyping and training experiments
  • Model evaluation to finalize the fraud detection model

These steps were carried out using on-premises servers, which limited the number of experiments that could be run concurrently due to hardware constraints. After the model was finalized, the data science team handed over the model artifacts and implementation code—along with detailed instructions—to the software developers and DevOps teams. This transition initiated the model deployment process, which involved:

  • Provisioning infrastructure – The software team set up the necessary infrastructure to host the ML API in a test environment.
  • API implementation and testing – Extensive testing and communication between the data science and software teams were required to make sure the model inference API behaved as expected. This phase typically added 2–3 weeks to the timeline.
  • Production deployment – The DevOps and system engineering teams provisioned and scaled on-premises hardware to deploy the ML API into production, a process that could take up to several weeks depending on resource availability.

Overall, the legacy workflow was prone to delays and inefficiencies, with significant communication overhead and a reliance on manual provisioning.

Modern workflow: SageMaker and MLOps

With the migration to SageMaker and the adoption of a machine learning operations (MLOps) architecture, Radial streamlined its entire ML lifecycle—from development to deployment. The new workflow consists of the following stages:

  • Model development – The data science team continues to perform tasks such as data cleaning, EDA, feature engineering, and model training within 2–4 weeks. However, with the scalable and on-demand compute resources of SageMaker, they can conduct more training experiments in the same timeframe, leading to improved model performance and faster iterations.
  • Seamless model deployment – When a model is ready, the data science team approves it in SageMaker and triggers the MLOps pipeline to deploy the model to the test (pre-production) environment. This eliminates the need for back-and-forth communication with the software team at this stage. Key improvements include:
    • The ML API inference code is preconfigured and wrapped by the data scientists during development, providing consistent behavior between development and deployment.
    • Deployment to test environments takes minutes, because the MLOps pipeline automates infrastructure provisioning and deployment.
  • Final integration and testing – The software team quickly integrates the API and performs necessary tests, such as integration and load testing. After the tests are successful, the team triggers the pipeline to deploy the ML models into production, which takes only minutes.

The MLOps pipeline not only automates the provisioning of cloud resources, but also provides consistency between pre-production and production environments, minimizing deployment risks.

Legacy vs. modern workflow comparison

The new workflow significantly reduces time and complexity:

  • Manual provisioning and communication overheads are reduced
  • Deployment times are reduced from weeks to minutes
  • Consistency between environments provides smoother transitions from development to production

This transformation enables Radial to respond more quickly to evolving fraud trends while maintaining high standards of efficiency and reliability. The following figure provides a visual comparison of the legacy and modern ML workflows.

Solution overview

When Radial migrated their fraud detection systems to the cloud, they collaborated with AWS Machine Learning Specialists and Solutions Architects to redesign how Radial manage the lifecycle of ML models. By using AWS and integrating continuous integration and delivery (CI/CD) pipelines with GitLab, Terraform, and AWS CloudFormation, Radial developed a scalable, efficient, and secure MLOps architecture. This new design accelerates model development and deployment, so Radial can respond faster to evolving fraud detection challenges.

The architecture incorporates best practices in MLOps, making sure that the different stages of the ML lifecycle—from data preparation to production deployment—are optimized for performance and reliability. Key components of the solution include:

  • SageMaker – Central to the architecture, SageMaker facilitates model training, evaluation, and deployment with built-in tools for monitoring and version control
  • GitLab CI/CD pipelines – These pipelines automate the workflows for testing, building, and deploying ML models, reducing manual overhead and providing consistent processes across environments
  • Terraform and AWS CloudFormation – These services enable infrastructure as code (IaC) to provision and manage AWS resources, providing a repeatable and scalable setup for ML applications

The overall solution architecture is illustrated in the following figure, showcasing how each component integrates seamlessly to support Radial’s fraud detection initiatives.

Account isolation for secure and scalable MLOps

To streamline operations and enforce security, the MLOps architecture is built on a multi-account strategy that isolates environments based on their purpose. This design enforces strict security boundaries, reduces risks, and promotes efficient collaboration across teams. The accounts are as follows:

  • Development account (model development workspace) – The development account is a dedicated workspace for data scientists to experiment and develop models. Secure data management is enforced by isolating datasets within Amazon Simple Storage Service (Amazon S3) buckets. Data scientists use SageMaker Studio for data exploration, feature engineering, and scalable model training. When the model build CI/CD pipeline in GitLab is triggered, Terraform and CloudFormation scripts automate the provisioning of infrastructure and AWS resources needed for SageMaker training pipelines. Trained models that meet predefined evaluation metrics are versioned and registered in the Amazon SageMaker Model Registry. With this setup, data scientists and ML engineers can perform multiple rounds of training experiments, review results, and finalize the best model for deployment testing.
  • Pre-production account (staging environment) – After a model is validated and approved in the development account, it’s moved to the pre-production account for staging. At this stage, the data science team triggers the model deploy CI/CD pipeline in GitLab to configure the endpoint in the pre-production environment. Model artifacts and inference images are synced from the development account to the pre-production environment. The latest approved model is deployed as an API in a SageMaker endpoint, where it undergoes thorough integration and load testing to validate performance and reliability.
  • Production account (live environment) – After passing the pre-production tests, the model is promoted to the production account for live deployment. This account mirrors the configurations of the pre-production environment to maintain consistency and reliability. The MLOps production team triggers the model deploy CI/CD pipeline to launch the production ML API. When it’s live, the model is continuously monitored using Amazon SageMaker Model Monitor and Amazon CloudWatch to make sure it performs as expected. In the event of deployment issues, automated rollback mechanisms revert to a stable model version, minimizing disruptions and maintaining business continuity.

With this multi-account architecture, data scientists can work independently while providing seamless transitions between development and production. The automation of CI/CD pipelines reduces deployment cycles, enhances scalability, and provides the security and performance necessary to maintain effective fraud detection systems.

Data privacy and compliance requirements

Radial prioritizes the protection and security of their customers’ data. As a leader in ecommerce solutions, they are committed to meeting the high standards of data privacy and regulatory compliance such as CPPA and PCI. Radial fraud detection ML APIs process sensitive information such as transaction details and behavioral analytics. To meet strict compliance requirements, they use AWS Direct Connect, Amazon Virtual Private Cloud (Amazon VPC), and Amazon S3 with AWS Key Management Service (AWS KMS) encryption to build a secure and compliant architecture.

Protecting data in transit with Direct Connect

Data is never exposed to the public internet at any stage. To maintain the secure transfer of sensitive data between on-premises systems and AWS environments, Radial uses Direct Connect, which offers the following capabilities:

  • Dedicated network connection – Direct Connect establishes a private, high-speed connection between the data center and AWS, alleviating the risks associated with public internet traffic, such as interception or unauthorized access
  • Consistent and reliable performance – Direct Connect provides consistent bandwidth and low latency, making sure fraud detection APIs operate without delays, even during peak transaction volumes

Isolating workloads with Amazon VPC

When data reaches AWS, it’s processed in a VPC for maximum security. This offers the following benefits:

  • Private subnets for sensitive data – The components of the fraud detection ML API, including SageMaker endpoints and AWS Lambda functions, reside in private subnets, which are not accessible from the public internet
  • Controlled access with security groups – Strict access control is enforced through security groups and network access control lists (ACLs), allowing only authorized systems and users to interact with VPC resources
  • Data segregation by account – As mentioned previously regarding the multi-account strategy, workloads are isolated across development, staging, and production accounts, each with its own VPC, to limit cross-environment access and maintain compliance.

Securing data at rest with Amazon S3 and AWS KMS encryption

Data involved in the fraud detection workflows (for both model development and real-time inference) is securely stored in Amazon S3, with encryption powered by AWS KMS. This offers the following benefits:

  • AWS KMS encryption for sensitive data – Transaction logs, model artifacts, and prediction results are encrypted at rest using managed KMS keys
  • Encryption in transit – Interactions with Amazon S3, including uploads and downloads, are encrypted to make sure data remains secure during transfer
  • Data retention policies – Lifecycle policies enforce data retention limits, making sure sensitive data is stored only as long as necessary for compliance and business purposes before scheduled deletion

Data privacy by design

Data privacy is integrated into every step of the ML API workflow:

  • Secure inference – Incoming transaction data is processed within VPC-secured SageMaker endpoints, making sure predictions are made in a private environment
  • Minimal data retention – Real-time transaction data is anonymized where possible, and only aggregated results are stored for future analysis
  • Access control and governance – Resources are governed by AWS Identity and Access Management (IAM) policies, making sure only authorized personnel and services can access data and infrastructure

Benefits of the new ML workflow on AWS

To summarize, the implementation of the new ML workflow on AWS offers several key benefits:

  • Dynamic scalability – AWS enables Radial to scale their infrastructure dynamically to handle spikes in both model training and real-time inference traffic, providing optimal performance during peak periods.
  • Faster infrastructure provisioning – The new workflow accelerates the model deployment cycle, reducing the time to provision infrastructure and deploy new models by up to several weeks.
  • Consistency in model training and deployment – By streamlining the process, Radial achieves consistent model training and deployment across environments. This reduces communication overhead between the data science team and engineering/DevOps teams, simplifying the implementation of model deployment.
  • Infrastructure as code – With IaC, they benefit from version control and reusability, reducing manual configurations and minimizing the risk of errors during deployment.
  • Built-in model monitoring – The built-in capabilities of SageMaker, such as experiment tracking and data drift detection, help them maintain model performance and provide timely updates.

Key takeaways and lessons learned from Radial’s ML model migration

To help modernize your MLOps workflow on AWS, the following are a few key takeaways and lessons learned from Radial’s experience:

  • Collaborate with AWS for customized solutions – Engage with AWS to discuss your specific use cases and identify templates that closely match your requirements. Although AWS offers a wide range of templates for common MLOps scenarios, they might need to be customized to fit your unique needs. Explore how to adapt these templates for migrating or revamping your ML workflows.
  • Iterative customization and support – As you customize your solution, work closely with both your internal team and AWS Support to address any issues. Plan for execution-based assessments and schedule workshops with AWS to resolve challenges at each stage. This might be an iterative process, but it makes sure your modules are optimized for your environment.
  • Use account isolation for security and collaboration – Use account isolation to separate model development, pre-production, and production environments. This setup promotes seamless collaboration between your data science team and DevOps/MLOps team, while also enforcing strong security boundaries between environments.
  • Maintain scalability with proper configuration – Radial’s fraud detection models successfully handled transaction spikes during peak seasons. To maintain scalability, configure instance quota limits correctly within AWS, and conduct thorough load testing before peak traffic periods to avoid any performance issues during high-demand times.
  • Secure model metadata sharing – Consider opting out of sharing model metadata when building your SageMaker pipeline to make sure your aggregate-level model information remains secure.
  • Prevent image conflicts with proper configuration – When using an AWS managed image for model inference, specify a hash digest within your SageMaker pipeline. Because the latest hash digest might change dynamically for the same image model version, this step helps avoid conflicts when retrieving inference images during model deployment.
  • Fine-tune scaling metrics through load testing – Fine-tune scaling metrics, such as instance type and automatic scaling thresholds, based on proper load testing. Simulate your business’s traffic patterns during both normal and peak periods to confirm your infrastructure scales effectively.
  • Applicability beyond fraud detection – Although the implementation described here is tailored to fraud detection, the MLOps architecture is adaptable to a wide range of ML use cases. Companies looking to modernize their MLOps workflows can apply the same principles to various ML projects.

Conclusion

This post demonstrated the high-level approach taken by Radial’s fraud team to successfully modernize their ML workflow by implementing an MLOps pipeline and migrating from on premises to the AWS Cloud. This was achieved through close collaboration with AWS during the EBA process. The EBA process begins with 4–6 weeks of preparation, culminating in a 3-day intensive workshop where a minimum viable MLOps pipeline is created using SageMaker, Amazon S3, GitLab, Terraform, and AWS CloudFormation. Following the EBA, teams typically spend an additional 2–6 weeks to refine the pipeline and fine-tune the models through feature engineering and hyperparameter optimization before production deployment. This approach enabled Radial to effectively select relevant AWS services and features, accelerating the training, deployment, and testing of ML models in a pre-production SageMaker environment. As a result, Radial successfully deployed multiple new ML models on AWS in their production environment around Q3 2024, achieving a more than 75% reduction in ML model deployment cycle and a 9% improvement in overall model performance.

“In the ecommerce retail space, mitigating fraudulent transactions and enhancing consumer experiences are top priorities for merchants. High-performing machine learning models have become invaluable tools in achieving these goals. By leveraging AWS services, we have successfully built a modernized machine learning workflow that enables rapid iterations in a stable and secure environment.”

– Lan Zhang, Head of Data Science and Advanced Analytics

To learn more about EBAs and how this approach can benefit your organization, reach out to your AWS Account Manager or Customer Solutions Manager. For additional information, refer to Using experience-based acceleration to achieve your transformation and Get to Know EBA.


About the Authors

Jake Wen is a Solutions Architect at AWS, driven by a passion for Machine Learning, Natural Language Processing, and Deep Learning. He assists Enterprise customers in achieving modernization and scalable deployment in the Cloud. Beyond the tech world, Jake finds delight in skateboarding, hiking, and piloting air drones.

Qing Chen is a senior data scientist at Radial, a full-stack solution provider for ecommerce merchants. In his role, he modernizes and manages the machine learning framework in the payment & fraud organization, driving a solid data-driven fraud decisioning flow to balance risk & customer friction for merchants.

Mark Sinclair is a senior cloud architect at Radial, a full-stack solution provider for ecommerce merchants. In his role, he designs, implements and manages the cloud infrastructure and DevOps for Radial engineering systems, driving a solid engineering architecture and workflow to provide highly scalable transactional services for Radial clients.

Read More

Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases

Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases

For an AI model to perform effectively in specialized domains, it requires access to relevant background knowledge. A customer support chat assistant, for instance, needs detailed information about the business it serves, and a legal analysis tool must draw upon a comprehensive database of past cases.

To equip large language models (LLMs) with this knowledge, developers often use Retrieval Augmented Generation (RAG). This technique retrieves pertinent information from a knowledge base and incorporates it into the user’s prompt, significantly improving the model’s responses. However, a key limitation of traditional RAG systems is that they often lose contextual nuances when encoding data, leading to irrelevant or incomplete retrievals from the knowledge base.

Challenges in traditional RAG

In traditional RAG, documents are often divided into smaller chunks to optimize retrieval efficiency. Although this method performs well in many cases, it can introduce challenges when individual chunks lack the necessary context. For example, if a policy states that remote work requires “6 months of tenure” (chunk 1) and “HR approval for exceptions” (chunk 3), but omits the middle chunk linking exceptions to manager approval, a user asking about eligibility for a 3-month tenure employee might receive a misleading “No” instead of the correct “Only with HR approval.” This occurs because isolated chunks fail to preserve dependencies between clauses, highlighting a key limitation of basic chunking strategies in RAG systems.

Contextual retrieval enhances traditional RAG by adding chunk-specific explanatory context to each chunk before generating embeddings. This approach enriches the vector representation with relevant contextual information, enabling more accurate retrieval of semantically related content when responding to user queries. For instance, when asked about remote work eligibility, it fetches both the tenure requirement and the HR exception clause, enabling the LLM to provide an accurate response such as “Normally no, but HR may approve exceptions.” By intelligently stitching fragmented information, contextual retrieval mitigates the pitfalls of rigid chunking, delivering more reliable and nuanced answers.

In this post, we demonstrate how to use contextual retrieval with Anthropic and Amazon Bedrock Knowledge Bases.

Solution overview

This solution uses Amazon Bedrock Knowledge Bases, incorporating a custom Lambda function to transform data during the knowledge base ingestion process. This Lambda function processes documents from Amazon Simple Storage Service (Amazon S3), chunks them into smaller pieces, enriches each chunk with contextual information using Anthropic’s Claude in Amazon Bedrock, and then saves the results back to an intermediate S3 bucket. Here’s a step-by-step explanation:

  1. Read input files from an S3 bucket specified in the event.
  2. Chunk input data into smaller chunks.
  3. Generate contextual information for each chunk using Anthropic’s Claude 3 Haiku
  4. Write processed chunks with their metadata back to intermediate S3 bucket

The following diagram is the solution architecture.

Prerequisites

To implement the solution, complete the following prerequisite steps:

Before you begin, you can deploy this solution by downloading the required files and following the instructions in its corresponding GitHub repository. This architecture is built around using the proposed chunking solution to implement contextual retrieval using Amazon Bedrock Knowledge Bases.

Implement contextual retrieval in Amazon Bedrock

In this section, we demonstrate how to use the proposed custom chunking solution to implement contextual retrieval using Amazon Bedrock Knowledge Bases. Developers can use custom chunking strategies in Amazon Bedrock to optimize how large documents or datasets are divided into smaller, more manageable pieces for processing by foundation models (FMs). This approach enables more efficient and effective handling of long-form content, improving the quality of responses. By tailoring the chunking method to the specific characteristics of the data and the requirements of the task at hand, developers can enhance the performance of natural language processing applications built on Amazon Bedrock. Custom chunking can involve techniques such as semantic segmentation, sliding windows with overlap, or using document structure to create logical divisions in the text.

To implement contextual retrieval in Amazon Bedrock, complete the following steps, which can be found in the notebook in the GitHub repository.

To set up the environment, follow these steps:

  1. Install the required dependencies:
    %pip install --upgrade pip --quiet %pip install -r requirements.txt --no-deps

  2. Import the required libraries and set up AWS clients:
    import os
    import sys
    import time
    import boto3
    import logging
    import pprint
    import json
    from pathlib import Path
    
    # AWS Clients Setup
    s3_client = boto3.client('s3')
    sts_client = boto3.client('sts')
    session = boto3.session.Session()
    region = session.region_name
    account_id = sts_client.get_caller_identity()["Account"]
    bedrock_agent_client = boto3.client('bedrock-agent')
    bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime')
    
    # Configure logging
    logging.basicConfig(
        format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s',
        level=logging.INFO
    )
    logger = logging.getLogger(__name__)

  3. Define knowledge base parameters:
    # Generate unique suffix for resource names
    timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(time.time()))[-7:]
    suffix = f"{timestamp_str}"
    
    # Resource names
    knowledge_base_name_standard = 'standard-kb'
    knowledge_base_name_custom = 'custom-chunking-kb'
    knowledge_base_description = "Knowledge Base containing complex PDF."
    bucket_name = f'{knowledge_base_name_standard}-{suffix}'
    intermediate_bucket_name = f'{knowledge_base_name_standard}-intermediate-{suffix}'
    lambda_function_name = f'{knowledge_base_name_custom}-lambda-{suffix}'
    foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"
    
    # Define data sources
    data_source=[{"type": "S3", "bucket_name": bucket_name}]

Create knowledge bases with different chunking strategies

To create knowledge bases with different chunking strategies, use the following code.

  1. Standard fixed chunking:
    # Create knowledge base with fixed chunking
    knowledge_base_standard = BedrockKnowledgeBase(
        kb_name=f'{knowledge_base_name_standard}-{suffix}',
        kb_description=knowledge_base_description,
        data_sources=data_source,
        chunking_strategy="FIXED_SIZE",
        suffix=f'{suffix}-f'
    )
    
    # Upload data to S3
    def upload_directory(path, bucket_name):
        for root, dirs, files in os.walk(path):
            for file in files:
                file_to_upload = os.path.join(root, file)
                if file not in ["LICENSE", "NOTICE", "README.md"]:
                    print(f"uploading file {file_to_upload} to {bucket_name}")
                    s3_client.upload_file(file_to_upload, bucket_name, file)
                else:
                    print(f"Skipping file {file_to_upload}")
    
    upload_directory("../synthetic_dataset", bucket_name)
    
    # Start ingestion job
    time.sleep(30)  # ensure KB is available
    knowledge_base_standard.start_ingestion_job()
    kb_id_standard = knowledge_base_standard.get_knowledge_base_id()

  2. Custom chunking with Lambda function
    # Create Lambda function for custom chunking
    def create_lambda_function():
        with open('lambda_function.py', 'r') as file:
            lambda_code = file.read()
       
        response = lambda_client.create_function(
            FunctionName=lambda_function_name,
            Runtime='python3.9',
            Role=lambda_role_arn,
            Handler='lambda_function.lambda_handler',
            Code={'ZipFile': lambda_code.encode()},
            Timeout=900,
            MemorySize=256
        )
        return response['FunctionArn']
    
    # Create knowledge base with custom chunking
    knowledge_base_custom = BedrockKnowledgeBase(
        kb_name=f'{knowledge_base_name_custom}-{suffix}',
        kb_description=knowledge_base_description,
        data_sources=data_source,
        lambda_function_name=lambda_function_name,
        intermediate_bucket_name=intermediate_bucket_name,
        chunking_strategy="CUSTOM",
        suffix=f'{suffix}-c'
    )
    
    # Start ingestion job
    time.sleep(30)
    knowledge_base_custom.start_ingestion_job()
    kb_id_custom = knowledge_base_custom.get_knowledge_base_id()

Evaluate performance using RAGAS framework

To evaluate performance using the RAGAS framework, follow these steps:

  1. Set up RAGAS evaluation:
    from ragas import SingleTurnSample, EvaluationDataset
    from ragas import evaluate
    from ragas.metrics import (
    context_recall,
    context_precision,
    answer_correctness
    )
    
    # Initialize Bedrock models for evaluation
    TEXT_GENERATION_MODEL_ID = "anthropic.claude-3-haiku-20240307-v1:0"
    EVALUATION_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
    
    llm_for_evaluation = ChatBedrock(model_id=EVALUATION_MODEL_ID, client=bedrock_client)
    bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0",
    client=bedrock_client
    )

  2. Prepare evaluation dataset:
    # Define test questions and ground truths
    questions = [
    "What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?",
    "In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?",
    # Add more questions...
    ]
    
    ground_truths = [
    "The increase in net cash provided by operating activities was primarily due to an increase in net income and favorable changes in operating assets and liabilities.",
    "Octank Financial had the highest net cash used in investing activities in 2021, at $360 million...",
    # Add corresponding ground truths...
    ]
    
    def prepare_eval_dataset(kb_id, questions, ground_truths):
    samples = []
    for question, ground_truth in zip(questions, ground_truths):
    # Get response and context
    response = retrieve_and_generate(question, kb_id)
    answer = response["output"]["text"]
    
    # Process contexts
    contexts = []
    for citation in response["citations"]:
    context_texts = [
    ref["content"]["text"]
    for ref in citation["retrievedReferences"]
    if "content" in ref and "text" in ref["content"]
    ]
    contexts.extend(context_texts)
    
    # Create sample
    sample = SingleTurnSample(
    user_input=question,
    retrieved_contexts=contexts,
    response=answer,
    reference=ground_truth
    )
    samples.append(sample)
    
    return EvaluationDataset(samples=samples)

  3. Run evaluation and compare results:
    # Evaluate both approaches
    contextual_chunking_dataset = prepare_eval_dataset(kb_id_custom, questions, ground_truths)
    default_chunking_dataset = prepare_eval_dataset(kb_id_standard, questions, ground_truths)
    
    # Define metrics
    metrics = [context_recall, context_precision, answer_correctness]
    
    # Run evaluation
    contextual_chunking_result = evaluate(
    dataset=contextual_chunking_dataset,
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    )
    
    default_chunking_result = evaluate(
    dataset=default_chunking_dataset,
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    )
    
    # Compare results
    comparison_df = pd.DataFrame({
    'Default Chunking': default_chunking_result.to_pandas().mean(),
    'Contextual Chunking': contextual_chunking_result.to_pandas().mean()
    })
    
    # Visualize results
    def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: #90EE90' if v else '' for v in is_max]
    
    comparison_df.style.apply(
    highlight_max,
    axis=1,
    subset=['Default Chunking', 'Contextual Chunking']

Performance benchmarks

To evaluate the performance of the proposed contextual retrieval approach, we used the AWS Decision Guide: Choosing a generative AI service as the document for RAG testing. We set up two Amazon Bedrock knowledge bases for the evaluation:

  • One knowledge base with the default chunking strategy, which uses 300 tokens per chunk with a 20% overlap
  • Another knowledge base with the custom contextual retrieval chunking approach, which has a custom contextual retrieval Lambda transformer in addition to the fixed chunking strategy that also uses 300 tokens per chunk with a 20% overlap

We used the RAGAS framework to assess the performance of these two approaches using small datasets. Specifically, we looked at the following metrics:

  • context_recall – Context recall measures how many of the relevant documents (or pieces of information) were successfully retrieved
  • context_precision – Context precision is a metric that measures the proportion of relevant chunks in the retrieved_contexts
  • answer_correctness – The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth
from ragas import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
    answer_correctness
)

#specify the metrics here
metrics = [
    context_recall,
    context_precision,
    answer_correctness
]

questions = [
    "What are the main AWS generative AI services covered in this guide?",
    "How does Amazon Bedrock differ from the other generative AI services?",
    "What are some key factors to consider when choosing a foundation model for your use case?",
    "What infrastructure services does AWS offer to support training and inference of large AI models?",
    "Where can I find more resources and information related to the AWS generative AI services?"
]
ground_truths = [
    "The main AWS generative AI services covered in this guide are Amazon Q Business, Amazon Q Developer, Amazon Bedrock, and Amazon SageMaker AI.",
    "Amazon Bedrock is a fully managed service that allows you to build custom generative AI applications with a choice of foundation models, including the ability to fine-tune and customize the models with your own data.",
    "Key factors to consider when choosing a foundation model include the modality (text, image, etc.), model size, inference latency, context window, pricing, fine-tuning capabilities, data quality and quantity, and overall quality of responses.",
    "AWS offers specialized hardware like AWS Trainium and AWS Inferentia to maximize the performance and cost-efficiency of training and inference for large AI models.",
    "You can find more resources like architecture diagrams, whitepapers, and solution guides on the AWS website. The document also provides links to relevant blog posts and documentation for the various AWS generative AI services."
]

The results obtained using the default chunking strategy are presented in the following table.

The results obtained using the contextual retrieval chunking strategy are presented in the following table. It demonstrates improved performance across the key metrics evaluated, including context recall, context precision, and answer correctness.

By aggregating the results, we can observe that the contextual chunking approach outperformed the default chunking strategy across the context_recall, context_precision, and answer_correctness metrics. This indicates the benefits of the more sophisticated contextual retrieval techniques implemented.

Implementation considerations

When implementing contextual retrieval using Amazon Bedrock, several factors need careful consideration. First, the custom chunking strategy must be optimized for both performance and accuracy, requiring thorough testing across different document types and sizes. The Lambda function’s memory allocation and timeout settings should be calibrated based on the expected document complexity and processing requirements, with initial recommendations of 1024 MB memory and 900-second timeout serving as baseline configurations. Organizations must also configure IAM roles with the principle of least privilege while maintaining sufficient permissions for Lambda to interact with Amazon S3 and Amazon Bedrock services. Additionally, the vectorization process and knowledge base configuration should be fine-tuned to balance between retrieval accuracy and computational efficiency, particularly when scaling to larger datasets.

Infrastructure scalability and monitoring considerations are equally crucial for successful implementation. Organizations should implement robust error-handling mechanisms within the Lambda function to manage various document formats and potential processing failures gracefully. Monitoring systems should be established to track key metrics such as chunking performance, retrieval accuracy, and system latency, enabling proactive optimization and maintenance.

Using Langfuse with Amazon Bedrock is a good option to introduce observability to this solution. The S3 bucket structure for both source and intermediate storage should be designed with clear lifecycle policies and access controls and consider Regional availability and data residency requirements. Furthermore, implementing a staged deployment approach, starting with a subset of data before scaling to full production workloads, can help identify and address potential bottlenecks or optimization opportunities early in the implementation process.

Cleanup

When you’re done experimenting with the solution, clean up the resources you created to avoid incurring future charges.

Conclusion

By combining Anthropic’s sophisticated language models with the robust infrastructure of Amazon Bedrock, organizations can now implement intelligent systems for information retrieval that deliver deeply contextualized, nuanced responses. The implementation steps outlined in this post provide a clear pathway for organizations to use contextual retrieval capabilities through Amazon Bedrock. By following the detailed configuration process, from setting up IAM permissions to deploying custom chunking strategies, developers and organizations can unlock the full potential of context-aware AI systems.

By leveraging Anthropic’s language models, organizations can deliver more accurate and meaningful results to their users while staying at the forefront of AI innovation. You can get started today with contextual retrieval using Anthropic’s language models through Amazon Bedrock and transform how your AI processes information with a small-scale proof of concept using your existing data. For personalized guidance on implementation, contact your AWS account team.


About the Authors

Suheel Farooq is a Principal Engineer in AWS Support Engineering, specializing in Generative AI, Artificial Intelligence, and Machine Learning. As a Subject Matter Expert in Amazon Bedrock and SageMaker, he helps enterprise customers design, build, modernize, and scale their AI/ML and Generative AI workloads on AWS. In his free time, Suheel enjoys working out and hiking.

Author QingweiQingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Vinita is a Senior Serverless Specialist Solutions Architect at AWS. She combines AWS knowledge with strong business acumen to architect innovative solutions that drive quantifiable value for customers and has been exceptional at navigating complex challenges. Vinita’s technical expertise on application modernization, GenAI, cloud computing and ability to drive measurable business impact make her show great impact in customer’s journey with AWS.

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Venkata Moparthi is a Senior Solutions Architect, specializes in cloud migrations, generative AI, and secure architecture for financial services and other industries. He combines technical expertise with customer-focused strategies to accelerate digital transformation and drive business outcomes through optimized cloud solutions.

Read More