Extract, built with Gemini, uses the model’s advanced visual reasoning and multi-modal capabilities to help councils turn old planning documents—including blurry maps an…Read More
UK Prime Minister, NVIDIA CEO Set the Stage as AI Lights Up Europe
AI isn’t waiting. And this week, neither is Europe.
At London’s Olympia, under a ceiling of steel beams and enveloped by the thrum of startup pitches, it didn’t feel like the start of a conference — it felt like the start of something bigger.
NVIDIA founder and CEO Jensen Huang joined U.K. Prime Minister Sir Keir Starmer to open London Tech Week, a moment that signaled a clear shift: what used to be the domain of ambitious technology startups is now national policy — backed by investments in people, platforms and partnerships.
AI is transforming the entire ecosystem, everything from healthcare and manufacturing to scientific research, Huang told the audience. “I make this prediction – because of AI, every industry in the UK will be a tech industry,” Huang said.
Starmer added that his team is looking at every single department in government to see how AI can be used.
Starmer’s goal for the session was clear: to bring to life the real-world impact of the AI revolution and how AI is changing everyday lives for U.K. citizens.
“The U.K. has one of the richest AI communities of anywhere on the planet, the deepest thinkers, the best universities… and the third largest AI capital investment of anywhere in the world,” Huang said.
“So the ability to build these AI supercomputers here in the U.K. will naturally attract more startups, it will naturally enable the rich ecosystem of researchers here to do their life’s work,” Huang added.
To that end, NVIDIA will continue to invest in the U.K. “We’re going to start our AI lab here… we’re going to partner with the UK to upskill the ecosystem of developers into the world of AI,” Huang added.
All of these investments will build on one another. “Infrastructure enables more research, more research, more breakthroughs, more companies,” Huang said. That flywheel will start taking off; it’s already quite large.”
UK on the Move: Momentum in Action
This wasn’t just a symbolic handshake. It marked the U.K.’s acceleration toward embedding AI at the core of its economic strategy. A major announcement from Prime Minister Starmer confirmed the U.K. will invest ~£1 billion in AI research compute by 2030, with investments commencing this year.
- A national AI skills initiative supported by NVIDIA aims to train developers in advanced AI skills.
- A new NVIDIA AI Technology Center in the U.K. is launching to accelerate research in embodied AI, material science and earth system modeling.
- The U.K.’s Financial Conduct Authority is using NVIDIA tech to power its innovation sandbox for safe and secure AI experimentation.
- The U.K. government and NVIDIA also announced a new initiative to accelerate AI-native 6G research and deployment.
- And further cementing the U.K.’s compute power, Isambard AI, the U.K.’s fastest AI supercomputer powered by 5.5k GH200s, is set to be fully operational this summer.
“We need to showcase what we have,” Starmer said. “This is a two-way conversation” between the government and industry.
Starmer underscored the U.K.’s “sovereign AI ambitions,” emphasizing that AI is not just about technology, but about codifying a nation’s culture, common sense and history.
And the movement isn’t confined to the U.K. Across Europe, governments are no longer debating whether AI matters. Now the question in every capital isn’t why AI, it’s how soon can we deploy it at scale?
- In Sweden, NVIDIA is working with Wallenberg Investments, AstraZeneca, Ericsson, Saab and SEB to build the country’s first national AI infrastructure, anchored by the NVIDIA Grace Blackwell platform.
- In Germany, the Leibniz Supercomputing Centre is building Blue Lion — a €250 million supercomputer based on the new NVIDIA Vera Rubin architecture, designed for real-time AI, simulation and science.
- In France, a joint venture between MGX, Bpifrance, Mistral AI and NVIDIA will establish Europe’s largest AI Campus in the Paris region, a 1.4 GW facility aiming to build sovereign and sustainable AI infrastructure for the continent.
“In the last 10 years, AI has advanced 1 million times,” Huang said. “The speed of change is incredible.”
NVIDIA’s commitment to the U.K. is evident, with over 1,700 Inception members and 500 employees across four offices.
NVIDIA is actively building the ‘AI factories of the future’ with leading U.K. companies.
And it’s powering the next generation of startups and scale-ups, from Basecamp Research to Wayve.
What Comes Next: NVIDIA GTC Paris at VivaTech
Next, the story moves to Paris, where Jensen Huang will headline NVIDIA GTC Paris live from VivaTech.
June 11 | 11:00 a.m. CEST | Dôme de Paris
VivaTech or GTC Paris pass required to attend
Livestream available globally, free
Expect news on NVIDIA Blackwell, sovereign AI initiatives, new regional partnerships and how European innovators are turning intent into infrastructure with NVIDIA.
One Week. One Story. One Start.
From Downing Street to the Dôme de Paris, this week reads less like a schedule and more like a strategy.
This isn’t just a collection of conferences. It’s a continental shift — where Europe is aligning talent, policy and compute to lead in AI.
This is just chapter one. But the story is already racing ahead.
Updates to Apple’s On-Device and Server Foundation Language Models
With Apple Intelligence, we’re integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers direct access to the on-device foundation language model at the core of Apple Intelligence.
We crafted these generative models to power the wide range of…Apple Machine Learning Research
‘AI Maker, Not an AI Taker’: UK Builds Its Vision With NVIDIA Infrastructure
U.K. Prime Minister Keir Starmer’s ambition for Britain to be an “AI maker, not an AI taker,” is becoming a reality at London Tech Week.
With NVIDIA’s support, the U.K. is building sovereign compute infrastructure, investing in cutting-edge research and skills, and fostering AI leadership across sectors.
As London Tech Week kicks off today, NVIDIA and some of Britain’s best companies are convening and hosting the first U.K. Sovereign AI Industry Forum.
The initiative unites leading U.K. businesses — including founding members Babcock, BAE Systems, BT, National Grid and Standard Chartered — to strengthen the nation’s economic security by advancing sovereign AI infrastructure and accelerating the growth of the U.K. AI startup ecosystem.
“We have big plans when it comes to developing the next wave of AI innovations here in the U.K. — not only so we can deliver the economic growth needed for our Plan for Change, but maintain our position as a global leader,” U.K. Secretary of State for Science, Innovation and Technology Peter Kyle said. “Central to that is making sure we have the infrastructure to power AI, so I welcome NVIDIA setting up the U.K. Sovereign AI Industry Forum — bringing together leading British businesses to develop and deploy this across the U.K. so we can drive growth and opportunity.”
The U.K. is a global AI hub, leading Europe in newly funded AI startups and total private AI investment through 2024. And the sector is growing fast, backed by over $28 billion in private investment since 2013.
And AI investment benefits the whole of the U.K.
According to an analysis released today by Public First, regions with more AI and data center infrastructure consistently show stronger economic growth. Even a modest increase in AI data center capacity could add nearly £5 billion to national economic output, while a more significant increase, for example, doubling access, could raise the annual benefit to £36.5 billion.
Responding to this opportunity, cloud provider Nscale announced at London Tech Week its commitment to deploy U.K. AI infrastructure with 10,000 NVIDIA Blackwell GPUs by the end of 2026. This facility will help position the U.K. as a global leader in AI, supporting innovation, job creation and the development of a thriving domestic AI ecosystem.
And cloud provider Nebius is continuing the region’s momentum with the launch of its first AI factory in the U.K. It announced it’s bringing 4,000 NVIDIA Blackwell GPUs online, making available scalable, high-performance AI capacity at home in the U.K. — to power U.K. research, academia and public services, including the NHS.
Mind the (Skills) Gap
AI developers are the engine of this new industrial revolution. That’s why NVIDIA is supporting the U.K. government’s national skills drive by training developers in AI.
To support this goal, a new NVIDIA AI Technology Center in the U.K. will provide hands-on training in AI, data science and accelerated computing, focusing on foundation model builders, embodied AI, materials science and earth systems modeling.
Beyond training, this collaboration drives cutting-edge AI applications and research.
For example, the U.K.’s world-leading financial services industry gets a boost from a new AI-powered digital sandbox. This sandbox, a digital testing environment for safe AI innovation in financial services, will be provided by the Financial Conduct Authority, with infrastructure provided by NayaOne and supported by NVIDIA’s platform.
At the same time, Barclays Eagle Labs’ launch of an Innovation Hub in London will help AI and deep tech startups grow to the next level. NVIDIA is supporting the program by offering startups a pathway to the NVIDIA Inception program with access to advanced tools and training.
Furthermore, the Department for Science, Innovation and Technology announced a collaboration with NVIDIA to promote the nation’s goals for AI development in telecoms. Leading U.K. universities will gain access to a suite of powerful AI tools, 6G research platforms and training resources to bolster research and development on AI-native wireless networks.
The Research Engine
Universities are central to the U.K.’s strategy.
Led by Oxford University, the JADE consortium, comprising 20 universities and the Turing Institute, uses NVIDIA technologies to advance AI development and safety. At University College London, researchers are developing a digital twin of the human body enabled by NVIDIA technology. At the University of Bristol, the Isambard-AI supercomputer, built on NVIDIA Grace Hopper Superchips, is powering progress in AI safety, climate modeling and next-generation science. And at the University of Manchester, the NVIDIA Earth-2 platform is being deployed to develop pollution-flow models.
Meanwhile, U.K. tech leaders use NVIDIA’s foundational technologies to innovate across diverse sectors.
It’s how Wayve trains AI for autonomous vehicles. How JBA Risk Management helps organizations anticipate and mitigate climate risks with new precision. And how Stability AI is unleashing creativity with open-source generative AI that turns ideas into images, text and more — instantly.
NVIDIA also champions the U.K.’s most ambitious AI startups through NVIDIA Inception, providing specialized resources and support for startups building new products and services.
Basecamp Research is revolutionizing drug discovery with AI trained on the planet’s biodiversity. Humanoid advances automation and brings commercially scalable, reliable and safe humanoid robots closer to real-world deployment. Relation is accelerating the discovery of tomorrow’s medicines. And Synthesia turns text into studio-quality, multilingual videos with lifelike avatars.
Industry in Motion
The U.K.’s biggest companies are moving fast, too.
Companies like BT, LSEG and NatWest are transforming industries with AI. BT is powering agentic AI-based autonomous operations; LSEG is empowering customers with highly accurate, AI-driven data and insights; and NatWest is streamlining operations and safeguarding customers.
With government vision, talent and cutting-edge tech converging, the U.K. is taking its place among those making AI advances at home and worldwide.
Watch NVIDIA founder and CEO Jensen Huang’s keynote NVIDIA GTC Paris at VivaTech.
HuggingFace Safetensors Support in PyTorch Distributed Checkpointing
Summary
PyTorch Distributed Checkpointing (DCP) is making investments into addressing the interoperability blockers to ensure that popular formats, like HuggingFace safetensors, can work well with PyTorch’s ecosystem. Since HuggingFace has become a leading format in inference and fine-tuning, DCP is beginning to support HuggingFace safetensors. The first customer of these changes is torchtune, who has seen an improved user experience as they can now cleanly read and write directly to HuggingFace with DCP APIs.
Problem
Since HuggingFace is used widely, with over 5 million users, many ML engineers would like to save and load their checkpoints in safetensors format to be able to easily work with their ecosystem. By supporting safetensors format natively in DCP, checkpointing is simplified for our users in the following ways:
- DCP currently has its own custom format, so users who want to work with HuggingFace models, but leverage DCP’s performance wins and features, had to build custom converters and components so that they could work between both systems.
- Instead of users having to download and upload their checkpoints to local storage every time, HuggingFace models can now be saved and loaded directly into the fsspec-supported storage of their choosing.
How to Use
From a user’s perspective, the only change needed to use safetensors is to call load with the new load planner and storage reader, and similarly save with the new save planner and storage writer.
The load and save APIs are called as follows:
load(
state_dict=state_dict,
storage_reader=HuggingFaceStorageReader(path=path),
)
save(
state_dict=state_dict,
storage_writer=HuggingFaceStorageWriter(
path=path,
fqn_to_index_mapping=mapping
),
)
The HuggingFaceStorageReader and HuggingFaceStorageWriter can take any fsspec based path and so it can read/write in HF safetensors format to any fsspec supported back-end, including local storage and HF storage. Since HuggingFace safetensors metadata doesn’t natively provide the same level of information as DCP metadata, distributed checkpoints are currently not well-supported in these APIs, but DCP plans on supporting this natively in the future.
torchtune
Our first customer of HuggingFace DCP support is torchtune – a post-training library written in native PyTorch. The primary way torchtune users retrieve model weights is from the Hugging Face Hub. Before, users had to download the model weights and upload the trained checkpoints via extra CLI commands; the new DCP APIs allow them to directly read and write to HuggingFace, resulting in a much better user experience.
In addition, the support of safetensor serialization in DCP greatly simplifies the checkpointing code in torchtune. No longer will there need to be format-specific checkpointing solutions, thus increasing developer efficiency in the project.
Future Work
DCP plans to handle the distributed loading and saving of HuggingFace safetensors checkpoints with resharding. DCP also plans to support the ability to produce a consolidated final checkpoint to a single file for publishing.
Build a serverless audio summarization solution with Amazon Bedrock and Whisper
Recordings of business meetings, interviews, and customer interactions have become essential for preserving important information. However, transcribing and summarizing these recordings manually is often time-consuming and labor-intensive. With the progress in generative AI and automatic speech recognition (ASR), automated solutions have emerged to make this process faster and more efficient.
Protecting personally identifiable information (PII) is a vital aspect of data security, driven by both ethical responsibilities and legal requirements. In this post, we demonstrate how to use the Open AI Whisper foundation model (FM) Whisper Large V3 Turbo, available in Amazon Bedrock Marketplace, which offers access to over 140 models through a dedicated offering, to produce near real-time transcription. These transcriptions are then processed by Amazon Bedrock for summarization and redaction of sensitive information.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Additionally, you can use Amazon Bedrock Guardrails to automatically redact sensitive information, including PII, from the transcription summaries to support compliance and data protection needs.
In this post, we walk through an end-to-end architecture that combines a React-based frontend with Amazon Bedrock, AWS Lambda, and AWS Step Functions to orchestrate the workflow, facilitating seamless integration and processing.
Solution overview
The solution highlights the power of integrating serverless technologies with generative AI to automate and scale content processing workflows. The user journey begins with uploading a recording through a React frontend application, hosted on Amazon CloudFront and backed by Amazon Simple Storage Service (Amazon S3) and Amazon API Gateway. When the file is uploaded, it triggers a Step Functions state machine that orchestrates the core processing steps, using AI models and Lambda functions for seamless data flow and transformation. The following diagram illustrates the solution architecture.
The workflow consists of the following steps:
- The React application is hosted in an S3 bucket and served to users through CloudFront for fast, global access. API Gateway handles interactions between the frontend and backend services.
- Users upload audio or video files directly from the app. These recordings are stored in a designated S3 bucket for processing.
- An Amazon EventBridge rule detects the S3 upload event and triggers the Step Functions state machine, initiating the AI-powered processing pipeline.
- The state machine performs audio transcription, summarization, and redaction by orchestrating multiple Amazon Bedrock models in sequence. It uses Whisper for transcription, Claude for summarization, and Guardrails to redact sensitive data.
- The redacted summary is returned to the frontend application and displayed to the user.
The following diagram illustrates the state machine workflow.
The Step Functions state machine orchestrates a series of tasks to transcribe, summarize, and redact sensitive information from uploaded audio/video recordings:
- A Lambda function is triggered to gather input details (for example, Amazon S3 object path, metadata) and prepare the payload for transcription.
- The payload is sent to the OpenAI Whisper Large V3 Turbo model through the Amazon Bedrock Marketplace to generate a near real-time transcription of the recording.
- The raw transcript is passed to Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock, which produces a concise and coherent summary of the conversation or content.
- A second Lambda function validates and forwards the summary to the redaction step.
- The summary is processed through Amazon Bedrock Guardrails, which automatically redacts PII and other sensitive data.
- The redacted summary is stored or returned to the frontend application through an API, where it is displayed to the user.
Prerequisites
Before you start, make sure that you have the following prerequisites in place:
- Before using Amazon Bedrock models, you must request access—a one-time setup step. For this solution, verify that access to the Anthropic’s Claude Sonnet 3.5 model is enabled in your Amazon Bedrock account. For instructions, see Access Amazon Bedrock foundation models.
- Set up a guardrail to enable PII redaction by configuring filters that block or mask sensitive information. For guidance on configuring filters for additional use cases, see Remove PII from conversations by using sensitive information filters.
- Deploy the Whisper Large V3 Turbo model within the Amazon Bedrock Marketplace. This post also offers step-by-step guidance for the deployment.
- The AWS Command Line Interface (AWS CLI) should be installed and configured. For instructions, see Installing or updating to the latest version of the AWS CLI.
- Node.js 14.x or later should be installed.
- The AWS CDK CLI should be installed.
- You should have Python 3.8+.
Create a guardrail in the Amazon Bedrock console
For instructions for creating guardrails in Amazon Bedrock, refer to Create a guardrail. For details on detecting and redacting PII, see Remove PII from conversations by using sensitive information filters. Configure your guardrail with the following key settings:
- Enable PII detection and handling
- Set PII action to Redact
- Add the relevant PII types, such as:
- Names and identities
- Phone numbers
- Email addresses
- Physical addresses
- Financial information
- Other sensitive personal information
After you deploy the guardrail, note the Amazon Resource Name (ARN), and you will be using this when deploys the model.
Deploy the Whisper model
Complete the following steps to deploy the Whisper Large V3 Turbo model:
- On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
- Search for and choose Whisper Large V3 Turbo.
- On the options menu (three dots), choose Deploy.
- Modify the endpoint name, number of instances, and instance type to suit your specific use case. For this post, we use the default settings.
- Modify the Advanced settings section to suit your use case. For this post, we use the default settings.
- Choose Deploy.
This creates a new AWS Identity and Access Management IAM role and deploys the model.
You can choose Marketplace deployments in the navigation pane, and in the Managed deployments section, you can see the endpoint status as Creating. Wait for the endpoint to finish deployment and the status to change to In Service, then copy the Endpoint Name, and you will be using this when deploying the
Deploy the solution infrastructure
In the GitHub repo, follow the instructions in the README file to clone the repository, then deploy the frontend and backend infrastructure.
We use the AWS Cloud Development Kit (AWS CDK) to define and deploy the infrastructure. The AWS CDK code deploys the following resources:
- React frontend application
- Backend infrastructure
- S3 buckets for storing uploads and processed results
- Step Functions state machine with Lambda functions for audio processing and PII redaction
- API Gateway endpoints for handling requests
- IAM roles and policies for secure access
- CloudFront distribution for hosting the frontend
Implementation deep dive
The backend is composed of a sequence of Lambda functions, each handling a specific stage of the audio processing pipeline:
- Upload handler – Receives audio files and stores them in Amazon S3
- Transcription with Whisper – Converts speech to text using the Whisper model
- Speaker detection – Differentiates and labels individual speakers within the audio
- Summarization using Amazon Bedrock – Extracts and summarizes key points from the transcript
- PII redaction – Uses Amazon Bedrock Guardrails to remove sensitive information for privacy compliance
Let’s examine some of the key components:
The transcription Lambda function uses the Whisper model to convert audio files to text:
def transcribe_with_whisper(audio_chunk, endpoint_name):
# Convert audio to hex string format
hex_audio = audio_chunk.hex()
# Create payload for Whisper model
payload = {
"audio_input": hex_audio,
"language": "english",
"task": "transcribe",
"top_p": 0.9
}
# Invoke the SageMaker endpoint running Whisper
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='application/json',
Body=json.dumps(payload)
)
# Parse the transcription response
response_body = json.loads(response['Body'].read().decode('utf-8'))
transcription_text = response_body['text']
return transcription_text
We use Amazon Bedrock to generate concise summaries from the transcriptions:
def generate_summary(transcription):
# Format the prompt with the transcription
prompt = f"{transcription}nnGive me the summary, speakers, key discussions, and action items with owners"
# Call Bedrock for summarization
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
body=json.dumps({
"prompt": prompt,
"max_tokens_to_sample": 4096,
"temperature": 0.7,
"top_p": 0.9,
})
)
# Extract and return the summary
result = json.loads(response.get('body').read())
return result.get('completion')
A critical component of our solution is the automatic redaction of PII. We implemented this using Amazon Bedrock Guardrails to support compliance with privacy regulations:
def apply_guardrail(bedrock_runtime, content, guardrail_id):
# Format content according to API requirements
formatted_content = [{"text": {"text": content}}]
# Call the guardrail API
response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion="DRAFT",
source="OUTPUT", # Using OUTPUT parameter for proper flow
content=formatted_content
)
# Extract redacted text from response
if 'action' in response and response['action'] == 'GUARDRAIL_INTERVENED':
if len(response['outputs']) > 0:
output = response['outputs'][0]
if 'text' in output and isinstance(output['text'], str):
return output['text']
# Return original content if redaction fails
return content
When PII is detected, it’s replaced with type indicators (for example, {PHONE} or {EMAIL}), making sure that summaries remain informative while protecting sensitive data.
To manage the complex processing pipeline, we use Step Functions to orchestrate the Lambda functions:
{
"Comment": "Audio Summarization Workflow",
"StartAt": "TranscribeAudio",
"States": {
"TranscribeAudio": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "WhisperTranscriptionFunction",
"Payload": {
"bucket": "$.bucket",
"key": "$.key"
}
},
"Next": "IdentifySpeakers"
},
"IdentifySpeakers": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "SpeakerIdentificationFunction",
"Payload": {
"Transcription.$": "$.Payload"
}
},
"Next": "GenerateSummary"
},
"GenerateSummary": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "BedrockSummaryFunction",
"Payload": {
"SpeakerIdentification.$": "$.Payload"
}
},
"End": true
}
}
}
This workflow makes sure each step completes successfully before proceeding to the next, with automatic error handling and retry logic built in.
Test the solution
After you have successfully completed the deployment, you can use the CloudFront URL to test the solution functionality.
Security considerations
Security is a critical aspect of this solution, and we’ve implemented several best practices to support data protection and compliance:
- Sensitive data redaction – Automatically redact PII to protect user privacy.
- Fine-Grained IAM Permissions – Apply the principle of least privilege across AWS services and resources.
- Amazon S3 access controls – Use strict bucket policies to limit access to authorized users and roles.
- API security – Secure API endpoints using Amazon Cognito for user authentication (optional but recommended).
- CloudFront protection – Enforce HTTPS and apply modern TLS protocols to facilitate secure content delivery.
- Amazon Bedrock data security – Amazon Bedrock (including Amazon Bedrock Marketplace) protects customer data and does not send data to providers or train using customer data. This makes sure your proprietary information remains secure when using AI capabilities.
Clean up
To prevent unnecessary charges, make sure to delete the resources provisioned for this solution when you’re done:
- Delete the Amazon Bedrock guardrail:
- On the Amazon Bedrock console, in the navigation menu, choose Guardrails.
- Choose your guardrail, then choose Delete.
- Delete the Whisper Large V3 Turbo model deployed through the Amazon Bedrock Marketplace:
- On the Amazon Bedrock console, choose Marketplace deployments in the navigation pane.
- In the Managed deployments section, select the deployed endpoint and choose Delete.
- Delete the AWS CDK stack by running the command
cdk destroy
, which deletes the AWS infrastructure.
Conclusion
This serverless audio summarization solution demonstrates the benefits of combining AWS services to create a sophisticated, secure, and scalable application. By using Amazon Bedrock for AI capabilities, Lambda for serverless processing, and CloudFront for content delivery, we’ve built a solution that can handle large volumes of audio content efficiently while helping you align with security best practices.
The automatic PII redaction feature supports compliance with privacy regulations, making this solution well-suited for regulated industries such as healthcare, finance, and legal services where data security is paramount. To get started, deploy this architecture within your AWS environment to accelerate your audio processing workflows.
About the Authors
Kaiyin Hu is a Senior Solutions Architect for Strategic Accounts at Amazon Web Services, with years of experience across enterprises, startups, and professional services. Currently, she helps customers build cloud solutions and drives GenAI adoption to cloud. Previously, Kaiyin worked in the Smart Home domain, assisting customers in integrating voice and IoT technologies.
Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.
Implement semantic video search using open source large vision models on Amazon SageMaker and Amazon OpenSearch Serverless
As companies and individual users deal with constantly growing amounts of video content, the ability to perform low-effort search to retrieve videos or video segments using natural language becomes increasingly valuable. Semantic video search offers a powerful solution to this problem, so users can search for relevant video content based on textual queries or descriptions. This approach can be used in a wide range of applications, from personal photo and video libraries to professional video editing, or enterprise-level content discovery and moderation, where it can significantly improve the way we interact with and manage video content.
Large-scale pre-training of computer vision models with self-supervision directly from natural language descriptions of images has made it possible to capture a wide set of visual concepts, while also bypassing the need for labor-intensive manual annotation of training data. After pre-training, natural language can be used to either reference the learned visual concepts or describe new ones, effectively enabling zero-shot transfer to a diverse set of computer vision tasks, such as image classification, retrieval, and semantic analysis.
In this post, we demonstrate how to use large vision models (LVMs) for semantic video search using natural language and image queries. We introduce some use case-specific methods, such as temporal frame smoothing and clustering, to enhance the video search performance. Furthermore, we demonstrate the end-to-end functionality of this approach by using both asynchronous and real-time hosting options on Amazon SageMaker AI to perform video, image, and text processing using publicly available LVMs on the Hugging Face Model Hub. Finally, we use Amazon OpenSearch Serverless with its vector engine for low-latency semantic video search.
About large vision models
In this post, we implement video search capabilities using multimodal LVMs, which integrate textual and visual modalities during the pre-training phase, using techniques such as contrastive multimodal representation learning, Transformer-based multimodal fusion, or multimodal prefix language modeling (for more details, see, Review of Large Vision Models and Visual Prompt Engineering by J. Wang et al.). Such LVMs have recently emerged as foundational building blocks for various computer vision tasks. Owing to their capability to learn a wide variety of visual concepts from massive datasets, these models can effectively solve diverse downstream computer vision tasks across different image distributions without the need for fine-tuning. In this section, we briefly introduce some of the most popular publicly available LVMs (which we also use in the accompanying code sample).
The CLIP (Contrastive Language-Image Pre-training) model, introduced in 2021, represents a significant milestone in the field of computer vision. Trained on a collection of 400 million image-text pairs harvested from the internet, CLIP showcased the remarkable potential of using large-scale natural language supervision for learning rich visual representations. Through extensive evaluations across over 30 computer vision benchmarks, CLIP demonstrated impressive zero-shot transfer capabilities, often matching or even surpassing the performance of fully supervised, task-specific models. For instance, a notable achievement of CLIP is its ability to match the top accuracy of a ResNet-50 model trained on the 1.28 million images from the ImageNet dataset, despite operating in a true zero-shot setting without a need for fine-tuning or other access to labeled examples.
Following the success of CLIP, the open-source initiative OpenCLIP further advanced the state-of-the-art by releasing an open implementation pre-trained on the massive LAION-2B dataset, comprised of 2.3 billion English image-text pairs. This substantial increase in the scale of training data enabled OpenCLIP to achieve even better zero-shot performance across a wide range of computer vision benchmarks, demonstrating further potential of scaling up natural language supervision for learning more expressive and generalizable visual representations.
Finally, the set of SigLIP (Sigmoid Loss for Language-Image Pre-training) models, including one trained on a 10 billion multilingual image-text dataset spanning over 100 languages, further pushed the boundaries of large-scale multimodal learning. The models propose an alternative loss function for the contrastive pre-training scheme employed in CLIP and have shown superior performance in language-image pre-training, outperforming both CLIP and OpenCLIP baselines on a variety of computer vision tasks.
Solution overview
Our approach uses a multimodal LVM to enable efficient video search and retrieval based on both textual and visual queries. The approach can be logically split into an indexing pipeline, which can be carried out offline, and an online video search logic. The following diagram illustrates the pipeline workflows.
The indexing pipeline is responsible for ingesting video files and preprocessing them to construct a searchable index. The process begins by extracting individual frames from the video files. These extracted frames are then passed through an embedding module, which uses the LVM to map each frame into a high-dimensional vector representation containing its semantic information. To account for temporal dynamics and motion information present in the video, a temporal smoothing technique is applied to the frame embeddings. This step makes sure the resulting representations capture the semantic continuity across multiple subsequent video frames, rather than treating each frame independently (also see the results discussed later in this post, or consult the following paper for more details). The temporally smoothed frame embeddings are then ingested into a vector index data structure, which is designed for efficient storage, retrieval, and similarity search operations. This indexed representation of the video frames serves as the foundation for the subsequent search pipeline.
The search pipeline facilitates content-based video retrieval by accepting textual queries or visual queries (images) from users. Textual queries are first embedded into the shared multimodal representation space using the LVM’s text encoding capabilities. Similarly, visual queries (images) are processed through the LVM’s visual encoding branch to obtain their corresponding embeddings.
After the textual or visual queries are embedded, we can build a hybrid query to account for keywords or filter constraints provided by the user (for example, to search only across certain video categories, or to search within a particular video). This hybrid query is then used to retrieve the most relevant frame embeddings based on their conceptual similarity to the query, while adhering to any supplementary keyword constraints.
The retrieved frame embeddings are then subjected to temporal clustering (also see the results later in this post for more details), which aims to group contiguous frames into semantically coherent video segments, thereby returning an entire video sequence (rather than disjointed individual frames).
Furthermore, maintaining search diversity and quality is crucial when retrieving content from videos. As mentioned previously, our approach incorporates various methods to enhance search results. For example, during the video indexing phase, the following techniques are employed to control the search results (the parameters of which might need to be tuned to get the best results):
- Adjusting the sampling rate, which determines the number of frames embedded from each second of video. Less frequent frame sampling might make sense when working with longer videos, whereas more frequent frame sampling might be needed to catch fast-occurring events.
- Modifying the temporal smoothing parameters to, for example, remove inconsistent search hits based on just a single frame hit, or merge repeated frame hits from the same scene.
During the semantic video search phase, you can use the following methods:
- Applying temporal clustering as a post-filtering step on the retrieved timestamps to group contiguous frames into semantically coherent video clips (that can be, in principle, directly played back by the end-users). This makes sure the search results maintain temporal context and continuity, avoiding disjointed individual frames.
- Setting the search size, which can be effectively combined with temporal clustering. Increasing the search size makes sure the relevant frames are included in the final results, albeit at the cost of higher computational load (see, for example, this guide for more details).
Our approach aims to strike a balance between retrieval quality, diversity, and computational efficiency by employing these techniques during both the indexing and search phases, ultimately enhancing the user experience in semantic video search.
The proposed solution architecture provides efficient semantic video search by using open source LVMs and AWS services. The architecture can be logically divided into two components: an asynchronous video indexing pipeline and online content search logic. The accompanying sample code on GitHub showcases how to build, experiment locally, as well as host and invoke both parts of the workflow using several open source LVMs available on the Hugging Face Model Hub (CLIP, OpenCLIP, and SigLIP). The following diagram illustrates this architecture.
The pipeline for asynchronous video indexing is comprised of the following steps:
- The user uploads a video file to an Amazon Simple Storage Service (Amazon S3) bucket, which initiates the indexing process.
- The video is sent to a SageMaker asynchronous endpoint for processing. The processing steps involve:
- Decoding of frames from the uploaded video file.
- Generation of frame embeddings by LVM.
- Application of temporal smoothing, accounting for temporal dynamics and motion information present in the video.
- The frame embeddings are ingested into an OpenSearch Serverless vector index, designed for efficient storage, retrieval, and similarity search operations.
SageMaker asynchronous inference endpoints are well-suited for handling requests with large payloads, extended processing times, and near real-time latency requirements. This SageMaker capability queues incoming requests and processes them asynchronously, accommodating large payloads and long processing times. Asynchronous inference enables cost optimization by automatically scaling the instance count to zero when there are no requests to process, so computational resources are used only when actively handling requests. This flexibility makes it an ideal choice for applications involving large data volumes, such as video processing, while maintaining responsiveness and efficient resource utilization.
OpenSearch Serverless is an on-demand serverless version for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the LVM. The index created in the OpenSearch Serverless collection serves as the vector store, enabling efficient storage and rapid similarity-based retrieval of relevant video segments.
The online content search then can be broken down to the following steps:
- The user provides a textual prompt or an image (or both) representing the desired content to be searched.
- The user prompt is sent to a real-time SageMaker endpoint, which results in the following actions:
- An embedding is generated for the text or image query.
- The query with embeddings is sent to the OpenSearch vector index, which performs a k-nearest neighbors (k-NN) search to retrieve relevant frame embeddings.
- The retrieved frame embeddings undergo temporal clustering.
- The final search results, comprising relevant video segments, are returned to the user.
SageMaker real-time inference suits workloads needing real-time, interactive, low-latency responses. Deploying models to SageMaker hosting services provides fully managed inference endpoints with automatic scaling capabilities, providing optimal performance for real-time requirements.
Code and environment
This post is accompanied by a sample code on GitHub that provides comprehensive annotations and code to set up the necessary AWS resources, experiment locally with sample video files, and then deploy and run the indexing and search pipelines. The code sample is designed to exemplify best practices when developing ML solutions on SageMaker, such as using configuration files to define flexible inference stack parameters and conducting local tests of the inference artifacts before deploying them to SageMaker endpoints. It also contains guided implementation steps with explanations and reference for configuration parameters. Additionally, the notebook automates the cleanup of all provisioned resources.
Prerequisites
The prerequisite to run the provided code is to have an active AWS account and set up Amazon SageMaker Studio. Refer to Use quick setup for Amazon SageMaker AI to set up SageMaker if you’re a first-time user and then follow the steps to open SageMaker Studio.
Deploy the solution
To start the implementation to clone the repository, open the notebook semantic_video_search_demo.ipynb, and follow the steps in the notebook.
In Section 2 of the notebook, install the required packages and dependencies, define global variables, set up Boto3 clients, and attach required permissions to the SageMaker AWS Identity and Access Management (IAM) role to interact with Amazon S3 and OpenSearch Service from the notebook.
In Section 3, create security components for OpenSearch Serverless (encryption policy, network policy, and data access policy) and then create an OpenSearch Serverless collection. For simplicity, in this proof of concept implementation, we allow public internet access to the OpenSearch Serverless collection resource. However, for production environments, we strongly suggest using private connections between your Virtual Private Cloud (VPC) and OpenSearch Serverless resources through a VPC endpoint. For more details, see Access Amazon OpenSearch Serverless using an interface endpoint (AWS PrivateLink).
In Section 4, import and inspect the config file, and choose an embeddings model for video indexing and corresponding embeddings dimension. In Section 5, create a vector index within the OpenSearch collection you created earlier.
To demonstrate the search results, we also provide references to a few sample videos that you can experiment with in Section 6. In Section 7, you can experiment with the proposed semantic video search approach locally in the notebook, before deploying the inference stacks.
In Sections 8, 9, and 10, we provide code to deploy two SageMaker endpoints: an asynchronous endpoint for video embedding and indexing and a real-time inference endpoint for video search. After these steps, we also test our deployed sematic video search solution with a few example queries.
Finally, Section 11 contains the code to clean up the created resources to avoid recurring costs.
Results
The solution was evaluated across a diverse range of use cases, including the identification of key moments in sports games, specific outfit pieces or color patterns on fashion runways, and other tasks in full-length films on the fashion industry. Additionally, the solution was tested for detecting action-packed moments like explosions in action movies, identifying when individuals entered video surveillance areas, and extracting specific events such as sports award ceremonies.
For our demonstration, we created a video catalog consisting of the following videos: A Look Back at New York Fashion Week: Men’s, F1 Insights powered by AWS, Amazon Air’s newest aircraft, the A330, is here, and Now Go Build with Werner Vogels – Autonomous Trucking.
To demonstrate the search capability for identifying specific objects across this video catalog, we employed four text prompts and four images. The presented results were obtained using the google/siglip-so400m-patch14-384
model, with temporal clustering enabled and a timestamp filter set to 1 second. Additionally, smoothing was enabled with a kernel size of 11, and the search size was set to 20 (which were found to be good default values for shorter videos). The left column in the subsequent figures specifies the search type, either by image or text, along with the corresponding image name or text prompt used.
The following figure shows the text prompts we used and the corresponding results.
The following figure shows the images we used to perform reverse images search and corresponding search results for each image.
As mentioned, we implemented temporal clustering in the lookup code, allowing for the grouping of frames based on their ordered timestamps. The accompanying notebook with sample code showcases the temporal clustering functionality by displaying (a few frames from) the returned video clip and highlighting the key frame with the highest search score within each group, as illustrated in the following figure. This approach facilitates a convenient presentation of the search results, enabling users to return entire playable video clips (even if not all frames were actually indexed in a vector store).
To showcase the hybrid search capabilities with OpenSearch Service, we present results for the textual prompt “sky,” with all other search parameters set identically to the previous configurations. We demonstrate two distinct cases: an unconstrained semantic search across the entire indexed video catalog, and a search confined to a specific video. The following figure illustrates the results obtained from an unconstrained semantic search query.
We conducted the same search for “sky,” but now confined to trucking videos.
To illustrate the effects of temporal smoothing, we generated search signal score charts (based on cosine similarity) for the prompt F1 crews change tyres
in the formulaone
video, both with and without temporal smoothing. We set a threshold of 0.315 for illustration purposes and highlighted video segments with scores exceeding this threshold. Without temporal smoothing (see the following figure), we observed two adjacent episodes around t=35 seconds and two additional episodes after t=65 seconds. Notably, the third and fourth episodes were significantly shorter than the first two, despite exhibiting higher scores. However, we can do better, if our objective is to prioritize longer semantically cohesive video episodes in the search.
To address this, we apply temporal smoothing. As shown in the following figure, now the first two episodes appear to be merged into a single, extended episode with the highest score. The third episode experienced a slight score reduction, and the fourth episode became irrelevant due to its brevity. Temporal smoothing facilitated the prioritization of longer and more coherent video moments associated with the search query by consolidating adjacent high-scoring segments and suppressing isolated, transient occurrences.
Clean up
To clean up the resources created as part of this solution, refer to the cleanup section in the provided notebook and execute the cells in this section. This will delete the created IAM policies, OpenSearch Serverless resources, and SageMaker endpoints to avoid recurring charges.
Limitations
Throughout our work on this project, we also identified several potential limitations that could be addressed through future work:
- Video quality and resolution might impact search performance, because blurred or low-resolution videos can make it challenging for the model to accurately identify objects and intricate details.
- Small objects within videos, such as a hockey puck or a football, might be difficult for LVMs to consistently recognize due to their diminutive size and visibility constraints.
- LVMs might struggle to comprehend scenes that represent a temporally prolonged contextual situation, such as detecting a point-winning shot in tennis or a car overtaking another vehicle.
- Accurate automatic measurement of solution performance is hindered without the availability of manually labeled ground truth data for comparison and evaluation.
Summary
In this post, we demonstrated the advantages of the zero-shot approach to implementing semantic video search using either text prompts or images as input. This approach readily adapts to diverse use cases without the need for retraining or fine-tuning models specifically for video search tasks. Additionally, we introduced techniques such as temporal smoothing and temporal clustering, which significantly enhance the quality and coherence of video search results.
The proposed architecture is designed to facilitate a cost-effective production environment with minimal effort, eliminating the requirement for extensive expertise in machine learning. Furthermore, the current architecture seamlessly accommodates the integration of open source LVMs, enabling the implementation of custom preprocessing or postprocessing logic during both the indexing and search phases. This flexibility is made possible by using SageMaker asynchronous and real-time deployment options, providing a powerful and versatile solution.
You can implement semantic video search using different approaches or AWS services. For related content, refer to the following AWS blog posts as examples on semantic search using proprietary ML models: Implement serverless semantic search of image and live video with Amazon Titan Multimodal Embeddings or Build multimodal search with Amazon OpenSearch Service.
About the Authors
Dr. Alexander Arzhanov is an AI/ML Specialist Solutions Architect based in Frankfurt, Germany. He helps AWS customers design and deploy their ML solutions across the EMEA region. Prior to joining AWS, Alexander was researching origins of heavy elements in our universe and grew passionate about ML after using it in his large-scale scientific calculations.
Dr. Ivan Sosnovik is an Applied Scientist in the AWS Machine Learning Solutions Lab. He develops ML solutions to help customers to achieve their business goals.
Nikita Bubentsov is a Cloud Sales Representative based in Munich, Germany, and part of Technical Field Community (TFC) in computer vision and machine learning. He helps enterprise customers drive business value by adopting cloud solutions and supports AWS EMEA organizations in the computer vision area. Nikita is passionate about computer vision and the future potential that it holds.
Multi-account support for Amazon SageMaker HyperPod task governance
GPUs are a precious resource; they are both short in supply and much more costly than traditional CPUs. They are also highly adaptable to many different use cases. Organizations building or adopting generative AI use GPUs to run simulations, run inference (both for internal or external usage), build agentic workloads, and run data scientists’ experiments. The workloads range from ephemeral single-GPU experiments run by scientists to long multi-node continuous pre-training runs. Many organizations need to share a centralized, high-performance GPU computing infrastructure across different teams, business units, or accounts within their organization. With this infrastructure, they can maximize the utilization of expensive accelerated computing resources like GPUs, rather than having siloed infrastructure that might be underutilized. Organizations also use multiple AWS accounts for their users. Larger enterprises might want to separate different business units, teams, or environments (production, staging, development) into different AWS accounts. This provides more granular control and isolation between these different parts of the organization. It also makes it straightforward to track and allocate cloud costs to the appropriate teams or business units for better financial oversight.
The specific reasons and setup can vary depending on the size, structure, and requirements of the enterprise. But in general, a multi-account strategy provides greater flexibility, security, and manageability for large-scale cloud deployments. In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. We use SageMaker HyperPod task governance to enable this feature.
Solution overview
SageMaker HyperPod task governance streamlines resource allocation and provides cluster administrators the capability to set up policies to maximize compute utilization in a cluster. Task governance can be used to create distinct teams with their own unique namespace, compute quotas, and borrowing limits. In a multi-account setting, you can restrict which accounts have access to which team’s compute quota using role-based access control.
In this post, we describe the settings required to set up multi-account access for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and how to use SageMaker HyperPod task governance to allocate accelerated compute to multiple teams in different accounts.
The following diagram illustrates the solution architecture.
In this architecture, one organization is splitting resources across a few accounts. Account A hosts the SageMaker HyperPod cluster. Account B is where the data scientists reside. Account C is where the data is prepared and stored for training usage. In the following sections, we demonstrate how to set up multi-account access so that data scientists in Account B can train a model on Account A’s SageMaker HyperPod and EKS cluster, using the preprocessed data stored in Account C. We break down this setup in two sections: cross-account access for data scientists and cross-account access for prepared data.
Cross-account access for data scientists
When you create a compute allocation with SageMaker HyperPod task governance, your EKS cluster creates a unique Kubernetes namespace per team. For this walkthrough, we create an AWS Identity and Access Management (IAM) role per team, called cluster access roles, that are then scoped access only to the team’s task governance-generated namespace in the shared EKS cluster. Role-based access control is how we make sure the data science members of Team A will not be able to submit tasks on behalf of Team B.
To access Account A’s EKS cluster as a user in Account B, you will need to assume a cluster access role in Account A. The cluster access role will have only the needed permissions for data scientists to access the EKS cluster. For an example of IAM roles for data scientists using SageMaker HyperPod, see IAM users for scientists.
Next, you will need to assume the cluster access role from a role in Account B. The cluster access role in Account A will then need to have a trust policy for the data scientist role in Account B. The data scientist role is the role in account B that will be used to assume the cluster access role in Account A. The following code is an example of the policy statement for the data scientist role so that it can assume the cluster access role in Account A:
The following code is an example of the trust policy for the cluster access role so that it allows the data scientist role to assume it:
The final step is to create an access entry for the team’s cluster access role in the EKS cluster. This access entry should also have an access policy, such as EKSEditPolicy, that is scoped to the namespace of the team. This makes sure that Team A users in Account B can’t launch tasks outside of their assigned namespace. You can also optionally set up custom role-based access control; see Setting up Kubernetes role-based access control for more information.
For users in Account B, you can repeat the same setup for each team. You must create a unique cluster access role for each team to align the access role for the team with their associated namespace. To summarize, we use two different IAM roles:
- Data scientist role – The role in Account B used to assume the cluster access role in Account A. This role just needs to be able to assume the cluster access role.
- Cluster access role – The role in Account A used to give access to the EKS cluster. For an example, see IAM role for SageMaker HyperPod.
Cross-account access to prepared data
In this section, we demonstrate how to set up EKS Pod Identity and S3 Access Points so that pods running training tasks in Account A’s EKS cluster have access to data stored in Account C. EKS Pod Identity allow you to map an IAM role to a service account in a namespace. If a pod uses the service account that has this association, then Amazon EKS will set the environment variables in the containers of the pod.
S3 Access Points are named network endpoints that simplify data access for shared datasets in S3 buckets. They act as a way to grant fine-grained access control to specific users or applications accessing a shared dataset within an S3 bucket, without requiring those users or applications to have full access to the entire bucket. Permissions to the access point is granted through S3 access point policies. Each S3 Access Point is configured with an access policy specific to a use case or application. Since the HyperPod cluster in this blog post can be used by multiple teams, each team could have its own S3 access point and access point policy.
Before following these steps, ensure you have the EKS Pod Identity Add-on installed on your EKS cluster.
- In Account A, create an IAM Role that contains S3 permissions (such as
s3:ListBucket
ands3:GetObject
to the access point resource) and has a trust relationship with Pod Identity; this will be your Data Access Role. Below is an example of a trust policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
- In Account C, create an S3 access point by following the steps here.
- Next, configure your S3 access point to allow access to the role created in step 1. This is an example access point policy that gives Account A permission to access points in account C.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name>"
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>",
"arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>/object/*"
]
}
]
}
- Ensure your S3 bucket policy is updated to allow Account A access. This is an example S3 bucket policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"StringEquals": {
"s3:DataAccessPointAccount": "<Account-C-ID>"
}
}
}
]
}
- In Account A, create a pod identity association for your EKS cluster using the AWS CLI.
- Pods accessing cross-account S3 buckets will need the service account name referenced in their pod specification.
You can test cross-account data access by spinning up a test pod and the executing into the pod to run Amazon S3 commands:
This example shows creating a single data access role for a single team. For multiple teams, use a namespace-specific ServiceAccount with its own data access role to help prevent overlapping resource access across teams. You can also configure cross-account Amazon S3 access for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 data across accounts. FSx for Lustre and Amazon S3 will need to be in the same AWS Region, and the FSx for Lustre file system will need to be in the same Availability Zone as your SageMaker HyperPod cluster.
Conclusion
In this post, we provided guidance on how to set up cross-account access to data scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. In addition, we covered how to provide Amazon S3 data access from one account to an EKS cluster in another account. With SageMaker HyperPod task governance, you can restrict access and compute allocation to specific teams. This architecture can be used at scale by organizations wanting to share a large compute cluster across accounts within their organization. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance documentation.
About the Authors
Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.
Build a Text-to-SQL solution for data consistency in generative AI using Amazon Nova
Businesses rely on precise, real-time insights to make critical decisions. However, enabling non-technical users to access proprietary or organizational data without technical expertise remains a challenge. Text-to-SQL bridges this gap by generating precise, schema-specific queries that empower faster decision-making and foster a data-driven culture. The problem lies in obtaining deterministic answers—precise, consistent results needed for operations such as generating exact counts or detailed reports—from proprietary or organizational data. Generative AI offers several approaches to query data, but selecting the right method is critical to achieve accuracy and reliability.
This post evaluates the key options for querying data using generative AI, discusses their strengths and limitations, and demonstrates why Text-to-SQL is the best choice for deterministic, schema-specific tasks. We show how to effectively use Text-to-SQL using Amazon Nova, a foundation model (FM) available in Amazon Bedrock, to derive precise and reliable answers from your data.
Options for querying data
Organizations have multiple options for querying data, and the choice depends on the nature of the data and the required outcomes. This section evaluates the following approaches to provide clarity on when to use each and why Text-to-SQL is optimal for deterministic, schema-based tasks:
- Retrieval Augmented Generation (RAG):
- Use case – Ideal for extracting insights from unstructured or semi-structured sources like documents or articles.
- Strengths – Handles diverse data formats and provides narrative-style responses.
- Limitations – Probabilistic answers can vary, making it unsuitable for deterministic queries, such as retrieving exact counts or matching specific schema constraints.
- Example – “Summarize feedback from product reviews.”
- Generative business intelligence (BI):
- Use case – Suitable for high-level insights and summary generation based on structured and unstructured data.
- Strengths – Delivers narrative insights for decision-making and trends.
- Limitations – Lacks the precision required for schema-specific or operational queries. Results often vary in phrasing and focus.
- Example – “What were the key drivers of sales growth last quarter?”
- Text-to-SQL:
- Use case – Excels in querying structured organizational data directly from relational schemas.
- Strengths – Provides deterministic, reproducible results for specific, schema-dependent queries. Ideal for precise operations such as filtering, counting, or aggregating data.
- Limitations – Requires structured data and predefined schemas.
- Example – “How many patients diagnosed with diabetes visited clinics in New York City last month?”
In scenarios demanding precision and consistency, Text-to-SQL outshines RAG and generative BI by delivering accurate, schema-driven results. These characteristics make it the ideal solution for operational and structured data queries.
Solution overview
This solution uses the Amazon Nova Lite and Amazon Nova Pro large language models (LLMs) to simplify querying proprietary data with natural language, making it accessible to non-technical users.
Amazon Bedrock is a fully managed service that simplifies building and scaling generative AI applications by providing access to leading FMs through a single API. It allows developers to experiment with and customize these models securely and privately, integrating generative AI capabilities into their applications without managing infrastructure.
Within this system, Amazon Nova represents a new generation of FMs delivering advanced intelligence and industry-leading price-performance. These models, including Amazon Nova Lite and Amazon Nova Pro, are designed to handle various tasks such as text, image, and video understanding, making them versatile tools for diverse applications.
You can find the deployment code and detailed instructions in our GitHub repo.
The solution consists of the following key features:
- Dynamic schema context – Retrieves the database schema dynamically for precise query generation
- SQL query generation – Converts natural language into SQL queries using the Amazon Nova Pro LLM
- Query execution – Runs queries on organizational databases and retrieves results
- Formatted responses – Processes raw query results into user-friendly formats using the Amazon Nova Lite LLM
The following diagram illustrates the solution architecture.
In this solution, we use Amazon Nova Pro and Amazon Nova Lite to take advantage of their respective strengths, facilitating efficient and effective processing at each stage:
- Dynamic schema retrieval and SQL query generation – We use Amazon Nova Pro to handle the translation of natural language inputs into SQL queries. Its advanced capabilities in complex reasoning and understanding make it well-suited for accurately interpreting user intents and generating precise SQL statements.
- Formatted response generation – After we run the SQL queries, the raw results are processed using Amazon Nova Lite. This model efficiently formats the data into user-friendly outputs, making the information accessible to non-technical users. Its speed and cost-effectiveness are advantageous for this stage, where rapid processing and straightforward presentation are key.
By strategically deploying Amazon Nova Pro and Amazon Nova Lite in this manner, the solution makes sure that each component operates optimally, balancing performance, accuracy, and cost-effectiveness.
Prerequisites
Complete the following prerequisite steps:
- Install the AWS Command Line Interface (AWS CLI). For instructions, refer to Installing or updating to the latest version of the AWS CLI.
- Configure the basic settings that the AWS CLI uses to interact with AWS. For more information, see Configuration and credential file settings in the AWS CLI.
- Make sure Amazon Bedrock is enabled in your AWS account.
- Obtain access to Amazon Nova Lite and Amazon Nova Pro.
- Install Python 3.9 or later, along with required libraries (Streamlit version 1.8.0 or later, Boto3, pymssql, and environment management packages).
- Create a Microsoft SQL Server (version 2016 or later) database with credentials to connect.
- Create a secret in AWS Secrets Manager for database credentials and name it
mssql_secrets
. For instructions, see Create an AWS Secrets Manager secret.
- Create a secret in AWS Secrets Manager for database credentials and name it
Our sample code uses a Microsoft SQL Server database, but this solution supports the following services:
- Amazon Relational Database Service (Amazon RDS), including:
- Amazon Aurora, including:
- Microsoft SQL Server
- On-premises databases
For more information about prerequisites, refer to the GitHub repo.
Set up the development environment
In the command prompt, navigate to the folder where the code exists and run the following command:
This command installs the required libraries to run the application.
Load the sample dataset in the database
Make sure you have created a secret in Secrets Manager named mssql_secrets
as mentioned in the prerequisites. If you named your secret something else, update the code in app.py
(line 29) and load_data.py
(line 22).
After you create the secret, run the following command from the code folder:
This command creates a database named Sales
with tables Products
, Customers
, and Orders
and loads the sample data in these tables.
Run the application
To run the application, execute the following command:
Example queries
In this section, we explore some sample queries.
For our first query, we ask “Who are the customers who bought smartphones?” This generates the following SQL:
We get the following formatted response:
Alice Johnson, who bought 1 smartphone on October 14th, 2023.
Ivy Martinez, who bought 2 smartphones on October 15th, 2023.
Next, we ask “How many smartphones are in stock?” This generates the following SQL:
We get the response “There are 100 smartphones currently in stock.”
Code execution flow
In this section, we explore the code execution flow. The code reference is from the GitHub repo. Do not run the different parts of the code individually.
Retrieve schema dynamically
Use INFORMATION_SCHEMA
views to extract schema details dynamically (code reference from app.py
):
Dynamic schema retrieval adapts automatically to changes by querying metadata tables for updated schema details, such as table names and column types. This facilitates seamless integration of schema updates into the Text-to-SQL system, reducing manual effort and improving scalability.
Test this function to verify it adapts automatically when schema changes occur.
Before generating SQL, fetch schema details for the relevant tables to facilitate accurate query construction.
Generate a SQL query using Amazon Nova Pro
Send the user query and schema context to Amazon Nova Pro (code reference from sql_generator.py
):
This code establishes a structured context for a text-to-SQL use case, guiding Amazon Nova Pro to generate SQL queries based on a predefined database schema. It provides consistency by defining a static database context that clarifies table names, columns, and relationships, helping prevent ambiguity in query formation. Queries are required to reference the vw_sales
view, standardizing data extraction for analytics and reporting. Additionally, whenever applicable, the generated queries must include quantity-related fields, making sure that business users receive key insights on product sales, stock levels, or transactional counts. To enhance search flexibility, the LLM is instructed to use the LIKE operator in WHERE conditions instead of exact matches, allowing for partial matches and accommodating variations in user input. By enforcing these constraints, the code optimizes Text-to-SQL interactions, providing structured, relevant, and business-aligned query generation for sales data analysis.
Execute a SQL query
Run the SQL query on the database and capture the result (code reference from app.py
):
Format the query results using Amazon Nova Lite
Send the database result from the SQL query to Amazon Nova Lite to format it in a human-readable format and print it on the Streamlit UI (code reference from app.py
):
Clean up
Follow these steps to clean up resources in your AWS environment and avoid incurring future costs:
- Clean up database resources:
- Delete the RDS instance or Amazon Elastic Compute Cloud (Amazon EC2) instance hosting the database.
- Remove associated storage volumes.
- Clean up security resources:
- Delete the database credentials from Secrets Manager.
- Remove Amazon Bedrock model access for Amazon Nova Pro and Amazon Nova Lite.
- Clean up the frontend (only if hosting the Streamlit application on Amazon EC2):
- Stop the EC2 instance hosting the Streamlit application.
- Delete associated storage volumes.
- Clean up additional resources (if applicable):
- Remove Elastic Load Balancers.
- Delete virtual private cloud (VPC) configurations.
- Check the AWS Management Console to confirm all resources have been deleted.
Conclusion
Text-to-SQL with Amazon Bedrock and Amazon Nova LLMs provides a scalable solution for deterministic, schema-based querying. By delivering consistent and precise results, it empowers organizations to make informed decisions, improve operational efficiency, and reduce reliance on technical resources.
For a more comprehensive example of a Text-to-SQL solution built on Amazon Bedrock, explore the GitHub repo Setup Amazon Bedrock Agent for Text-to-SQL Using Amazon Athena with Streamlit. This open source project demonstrates how to use Amazon Bedrock and Amazon Nova LLMs to build a robust Text-to-SQL agent that can generate complex queries, self-correct, and query diverse data sources.
Start experimenting with Text-to-SQL use cases today by getting started with Amazon Bedrock.
About the authors
Mansi Sharma is a Solutions Architect for Amazon Web Services. Mansi is a trusted technical advisor helping enterprise customers architect and implement cloud solutions at scale. She drives customer success through technical leadership, architectural guidance, and innovative problem-solving while working with cutting-edge cloud technologies. Mansi specializes in generative AI application development and serverless technologies.
Marie Yap is a Principal Solutions Architect for Amazon Web Services. In this role, she helps various organizations begin their journey to the cloud. She also specializes in analytics and modern data architectures.