Build an enterprise synthetic data strategy using Amazon Bedrock

Build an enterprise synthetic data strategy using Amazon Bedrock

The AI landscape is rapidly evolving, and more organizations are recognizing the power of synthetic data to drive innovation. However, enterprises looking to use AI face a major roadblock: how to safely use sensitive data. Stringent privacy regulations make it risky to use such data, even with robust anonymization. Advanced analytics can potentially uncover hidden correlations and reveal real data, leading to compliance issues and reputational damage. Additionally, many industries struggle with a scarcity of high-quality, diverse datasets needed for critical processes like software testing, product development, and AI model training. This data shortage can hinder innovation, slowing down development cycles across various business operations.

Organizations need innovative solutions to unlock the potential of data-driven processes without compromising ethics or data privacy. This is where synthetic data comes in—a solution that mimics the statistical properties and patterns of real data while being entirely fictitious. By using synthetic data, enterprises can train AI models, conduct analyses, and develop applications without the risk of exposing sensitive information. Synthetic data effectively bridges the gap between data utility and privacy protection. However, creating high-quality synthetic data comes with significant challenges:

  • Data quality – Making sure synthetic data accurately reflects real-world statistical properties and nuances is difficult. The data might not capture rare edge cases or the full spectrum of human interactions.
  • Bias management – Although synthetic data can help reduce bias, it can also inadvertently amplify existing biases if not carefully managed. The quality of synthetic data heavily depends on the model and data used to generate it.
  • Privacy vs. utility – Balancing privacy preservation with data utility is complex. There’s a risk of reverse engineering or data leakage if not properly implemented.
  • Validation challenges – Verifying the quality and representation of synthetic data often requires comparison with real data, which can be problematic when working with sensitive information.
  • Reality gap – Synthetic data might not fully capture the dynamic nature of the real world, potentially leading to a disconnect between model performance on synthetic data and real-world applications.

In this post, we explore how to use Amazon Bedrock for synthetic data generation, considering these challenges alongside the potential benefits to develop effective strategies for various applications across multiple industries, including AI and machine learning (ML). Amazon Bedrock offers a broad set of capabilities to build generative AI applications with a focus on security, privacy, and responsible AI. Built within the AWS landscape, Amazon Bedrock is designed to help maintain the security and compliance standards required for enterprise use.

Attributes of high-quality synthetic data

To be truly effective, synthetic data must be both realistic and reliable. This means it should accurately reflect the complexities and nuances of real-world data while maintaining complete anonymity. A high-quality synthetic dataset exhibits several key characteristics that facilitate its fidelity to the original data:

  • Data structure – The synthetic data should maintain the same structure as the real data, including the same number of columns, data types, and relationships between different data sources
  • Statistical properties – The synthetic data should mimic the statistical properties of the real data, such as mean, median, standard deviation, correlation between variables, and distribution patterns.
  • Temporal patterns – If the real data exhibits temporal patterns (for example, diurnal or seasonal patterns), the synthetic data should also reflect these patterns.
  • Anomalies and outliers – Real-world data often contains anomalies and outliers. The synthetic data should also include a similar proportion and distribution of anomalies and outliers to accurately represent the real-world scenario.
  • Referential integrity – If the real data has relationships and dependencies between different data sources, the synthetic data should maintain these relationships to facilitate referential integrity.
  • Consistency – The synthetic data should be consistent across different data sources and maintain the relationships and dependencies between them, facilitating a coherent and unified representation of the dataset.
  • Scalability – The synthetic data generation process should be scalable to handle large volumes of data and support the generation of synthetic data for different scenarios and use cases.
  • Diversity – The synthetic data should capture the diversity present in the real data.

Solution overview

Generating useful synthetic data that protects privacy requires a thoughtful approach. The following figure represents the high-level architecture of the proposed solution. The process involves three key steps:

  1. Identify validation rules that define the structure and statistical properties of the real data.
  2. Use those rules to generate code using Amazon Bedrock that creates synthetic data subsets.
  3. Combine multiple synthetic subsets into full datasets.

workflow to generate synthetic data

Let’s explore these three key steps for creating useful synthetic data in more detail.

Step 1: Define data rules and characteristics

  1. To create synthetic datasets, start by establishing clear rules that capture the essence of your target data:
  2. Use domain-specific knowledge to identify key attributes and relationships.
  3. Study existing public datasets, academic resources, and industry documentation.
  4. Use tools like AWS Glue DataBrew, Amazon Bedrock, or open source alternatives (such as Great Expectations) to analyze data structures and patterns.
  5. Develop a comprehensive rule-set covering:
    • Data types and value ranges
    • Inter-field relationships
    • Quality standards
    • Domain-specific patterns and anomalies

This foundational step makes sure your synthetic data accurately reflects real-world scenarios in your industry.

Step 2: Generate code with Amazon Bedrock

Transform your data rules into functional code using Amazon Bedrock language models:

  1. Choose an appropriate Amazon Bedrock model based on code generation capabilities and domain relevance.
  2. Craft a detailed prompt describing the desired code output, including data structures and generation rules.
  3. Use the Amazon Bedrock API to generate Python code based on your prompts.
  4. Iteratively refine the code by:
    • Reviewing for accuracy and efficiency
    • Adjusting prompts as needed
    • Incorporating developer input for complex scenarios

The result is a tailored script that generates synthetic data entries matching your specific requirements and closely mimicking real-world data in your domain.

Step 3: Assemble and scale the synthetic dataset

Transform your generated data into a comprehensive, real-world representative dataset:

  1. Use the code from Step 2 to create multiple synthetic subsets for various scenarios.
  2. Merge subsets based on domain knowledge, maintaining realistic proportions and relationships.
  3. Align temporal or sequential components and introduce controlled randomness for natural variation.
  4. Scale the dataset to required sizes, reflecting different time periods or populations.
  5. Incorporate rare events and edge cases at appropriate frequencies.
  6. Generate accompanying metadata describing dataset characteristics and the generation process.

The end result is a diverse, realistic synthetic dataset for uses like system testing, ML model training, or data analysis. The metadata provides transparency into the generation process and data characteristics. Together, these measures result in a robust synthetic dataset that closely parallels real-world data while avoiding exposure of direct sensitive information. This generalized approach can be adapted to various types of datasets, from financial transactions to medical records, using the power of Amazon Bedrock for code generation and the expertise of domain knowledge for data validation and structuring.

Importance of differential privacy in synthetic data generation

Although synthetic data offers numerous benefits for analytics and machine learning, it’s essential to recognize that privacy concerns persist even with artificially generated datasets. As we strive to create high-fidelity synthetic data, we must also maintain robust privacy protections for the original data. Although synthetic data mimics patterns in actual data, if created improperly, it risks revealing details about sensitive information in the source dataset. This is where differential privacy enters the picture. Differential privacy is a mathematical framework that provides a way to quantify and control the privacy risks associated with data analysis. It works by injecting calibrated noise into the data generation process, making it virtually impossible to infer anything about a single data point or confidential information in the source dataset.

Differential privacy protects against re-identification exploits by adversaries attempting to extract details about data. The carefully calibrated noise added to synthetic data makes sure that even if an adversary tries, it is computationally infeasible to tie an output back to specific records in the original data, while still maintaining the overall statistical properties of the dataset. This allows the synthetic data to closely reflect real-world characteristics and remain useful for analytics and modeling while protecting privacy. By incorporating differential privacy techniques into the synthetic data generation process, you can create datasets that not only maintain statistical properties of the original data but also offer strong privacy guarantees. It enables organizations to share data more freely, collaborate on sensitive projects, and develop AI models with reduced risk of privacy breaches. For instance, in healthcare, differentially private synthetic patient data can accelerate research without compromising individual patient confidentiality.

As we continue to advance in the field of synthetic data generation, the incorporation of differential privacy is becoming not just a best practice, but a necessary component for responsible data science. This approach paves the way for a future where data utility and privacy protection coexist harmoniously, fostering innovation while safeguarding individual rights. However, although differential privacy offers strong theoretical guarantees, its practical implementation can be challenging. Organizations must carefully balance the trade-off between privacy and utility, because increasing privacy protection often comes at the cost of reduced data utility.

Build synthetic datasets for Trusted Advisor findings with Amazon Bedrock

In this post, we guide you through the process of creating synthetic datasets for AWS Trusted Advisor findings using Amazon Bedrock. Trusted Advisor provides real-time guidance to optimize your AWS environment, improving performance, security, and cost-efficiency through over 500 checks against AWS best practices. We demonstrate the synthetic data generation approach using the “Underutilized Amazon EBS Volumes” check (checkid: DAvU99Dc4C) as an example.

By following this post, you will gain practical knowledge on:

  • Defining data rules for Trusted Advisor findings
  • Using Amazon Bedrock to generate data creation code
  • Assembling and scaling synthetic datasets

This approach can be applied across over 500 Trusted Advisor checks, enabling you to build comprehensive, privacy-aware datasets for testing, training, and analysis. Whether you’re looking to enhance your understanding of Trusted Advisor recommendations or develop new optimization strategies, synthetic data offers powerful possibilities.

Prerequisites

To implement this approach, you must have an AWS account with the appropriate permissions.

  1. AWS Account Setup:
    • IAM permissions for:
      • Amazon Bedrock
      • AWS Trusted Advisor
      • Amazon EBS
  2. AWS Service Access:
    • Access enabled for Amazon Bedrock in your Region
    • Access to Anthropic Claude model in Amazon Bedrock
    • Enterprise or Business support plan for full Trusted Advisor access
  3. Development Environment:
    • Python 3.8 or later installed
    • Required Python packages:
      • pandas
      • numpy
      • random
      • boto3
  4. Knowledge Requirements:
    • Basic understanding of:
      •  Python programming
      •  AWS services (especially EBS and Trusted Advisor)
      •  Data analysis concepts
      •  JSON/YAML file format

Define Trusted Advisor findings rules

Begin by examining real Trusted Advisor findings for the “Underutilized Amazon EBS Volumes” check. Analyze the structure and content of these findings to identify key data elements and their relationships. Pay attention to the following:

  • Standard fields – Check ID, volume ID, volume type, snapshot ID, and snapshot age
  • Volume attributes – Size, type, age, and cost
  • Usage metrics – Read and write operations, throughput, and IOPS
  • Temporal patterns – Volume type and size variations
  • Metadata – Tags, creation date, and last attached date

As you study these elements, note the typical ranges, patterns, and distributions for each attribute. For example, observe how volume sizes correlate with volume types, or how usage patterns differ between development and production environments. This analysis will help you create a set of rules that accurately reflect real-world Trusted Advisor findings.

After analyzing real Trusted Advisor outputs for the “Underutilized Amazon EBS Volumes” check, we identified the following crucial patterns and rules:

  • Volume type – Consider gp2, gp3, io1, io2, and st1 volume types. Verify the volume sizes are valid for volume types.
  • Criteria – Represent multiple AWS Regions, with appropriate volume types. Correlate snapshot ages with volume ages.
  • Data structure – Each finding should include the same columns.

The following is an example ruleset:

Analysis of the AWS Trusted Advisor finding for "Underutilized Amazon EBS Volumes":
1. Columns in the Trusted Advisor Finding:
- Region
- Volume ID
- Volume Name
- Volume Type
- Volume Size
- Monthly Storage Cost
- Snapshot ID
- Snapshot Name
- Snapshot Age
2. Key Columns and Their Significance:
- Region: AWS region where the EBS volume is located
- Volume ID: Unique identifier for the EBS volume
- Volume Type: Type of EBS volume (e.g., gp2, io1, st1)
- Volume Size: Size of the volume in GB
- Monthly Storage Cost: Estimated cost for storing the volume
- Snapshot ID: Identifier of the most recent snapshot (if any)
- Snapshot Age: Age of the most recent snapshot
3. Relationships and Patterns:
- Volume ID and Snapshot ID relationship: Each volume may have zero or more snapshots
- Region and cost correlation: Storage costs may vary by region
- Volume Type and Size correlation: Certain volume types have size limitations
- Volume Size and Cost correlation: Larger volumes generally cost more
- Snapshot Age and utilization: Older snapshots might indicate less active volumes
4. Data Types and Formats:
- Region: String (e.g., "us-east-1")
- Volume ID: String starting with "vol-"
- Volume Name: String (can be null)
- Volume Type: String (gp2, gp3, io1, io2, st1, sc1, standard)
- Volume Size: Integer (in GB)
- Monthly Storage Cost: Decimal number
- Snapshot ID: String starting with "snap-" (can be null)
- Snapshot Name: String (can be null)

Generate code with Amazon Bedrock

With your rules defined, you can now use Amazon Bedrock to generate Python code for creating synthetic Trusted Advisor findings.

The following is an example prompt for Amazon Bedrock:

Give me python code to create a 100 row pandas df with the following data:
<<Copy paste the ruleset from the above step>>

You can submit this prompt to the Amazon Bedrock chat playground using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock, and receive generated Python code. Review this code carefully, verifying it meets all specifications and generates realistic data. If necessary, iterate on your prompt or make manual adjustments to the code to address any missing logic or edge cases.

The resulting code will serve as the foundation for creating varied and realistic synthetic Trusted Advisor findings that adhere to the defined parameters. By using Amazon Bedrock in this way, you can quickly develop sophisticated data generation code that would otherwise require significant manual effort and domain expertise to create.

Create data subsets

With the code generated by Amazon Bedrock and refined with your custom functions, you can now create diverse subsets of synthetic Trusted Advisor findings for the “Underutilized Amazon EBS Volumes” check. This approach allows you to simulate a wide range of real-world scenarios. In the following sample code, we have customized the volume_id and snapshot_id format to begin with vol-9999 and snap-9999, respectively:

import pandas as pd
import numpy as np
import random

def generate_volume_id():
return f"vol-9999{''.join(random.choices('0123456789abcdef', k=17))}"

def generate_snapshot_id():
return f"snap-9999{''.join(random.choices('0123456789abcdef', k=17))}"

def generate_volume_name():
prefixes = ['app', 'db', 'web', 'cache', 'log']
suffixes = ['prod', 'dev', 'test', 'staging']
return f"{random.choice(prefixes)}-{random.choice(suffixes)}-{random.randint(1, 100)}"

def step3_generate_base_data():

# Generate synthetic data
num_records = 1000
regions = ['us-east-1', 'us-west-2', 'eu-west-1', 'ap-southeast-1']
volume_types = ['gp2', 'gp3', 'io1', 'io2', 'st1', 'sc1', 'standard']

data = {
'Region': np.random.choice(regions, num_records),
'Volume ID': [generate_volume_id() for _ in range(num_records)],
'Volume Name': [generate_volume_name() if random.random() > 0.3 else None for _ in range(num_records)],
'Volume Type': np.random.choice(volume_types, num_records, p=[0.4, 0.2, 0.1, 0.1, 0.1, 0.05, 0.05]),
'Volume Size': np.random.choice(range(1, 1001), num_records),
'Monthly Storage Cost': np.random.uniform(0.1, 100, num_records).round(2),
'Snapshot ID': [generate_snapshot_id() if random.random() > 0.4 else None for _ in range(num_records)],
'Snapshot Name': [f"snapshot-{i}" if random.random() > 0.6 else None for i in range(num_records)],
'Snapshot Age': [random.randint(1, 365) if random.random() > 0.4 else None for _ in range(num_records)]
}

df = pd.DataFrame(data)

# Apply some logic and constraints
df.loc[df['Volume Type'] == 'gp2', 'Volume Size'] = df.loc[df['Volume Type'] == 'gp2', 'Volume Size'].clip(1, 16384)
df.loc[df['Volume Type'] == 'io1', 'Volume Size'] = df.loc[df['Volume Type'] == 'io1', 'Volume Size'].clip(4, 16384)
df.loc[df['Volume Type'] == 'st1', 'Volume Size'] = df.loc[df['Volume Type'] == 'st1', 'Volume Size'].clip(500, 16384)
df.loc[df['Volume Type'] == 'sc1', 'Volume Size'] = df.loc[df['Volume Type'] == 'sc1', 'Volume Size'].clip(500, 16384)

# Adjust Monthly Storage Cost based on Volume Size and Type
df['Monthly Storage Cost'] = df.apply(lambda row: row['Volume Size'] * random.uniform(0.05, 0.15) * (1.5 if row['Volume Type'] in ['io1', 'io2'] else 1), axis=1).round(2)

# Ensure Snapshot ID, Name, and Age are consistent
df.loc[df['Snapshot ID'].isnull(), 'Snapshot Name'] = None
df.loc[df['Snapshot ID'].isnull(), 'Snapshot Age'] = None

# Add some underutilized volumes
df['Underutilized'] = np.random.choice([True, False], num_records, p=[0.7, 0.3])
df.loc[df['Underutilized'], 'Monthly Storage Cost'] *= random.uniform(1.2, 2.0)

return df

This code creates subsets that include:

  • Various volume types and instance types
  • Different levels of utilization
  • Occasional misconfigurations (for example, underutilized volumes)
  • Diverse regional distribution

Combine and scale the dataset

The process of combining and scaling synthetic data involves merging multiple generated datasets while introducing realistic anomalies to create a comprehensive and representative dataset. This step is crucial for making sure that your synthetic data reflects the complexity and variability found in real-world scenarios. Organizations typically introduce controlled anomalies at a specific rate (usually 5–10% of the dataset) to simulate various edge cases and unusual patterns that might occur in production environments. These anomalies help in testing system responses, developing monitoring solutions, and training ML models to identify potential issues.

When generating synthetic data for underutilized EBS volumes, you might introduce anomalies such as oversized volumes (5–10 times larger than needed), volumes with old snapshots (older than 365 days), or high-cost volumes with low utilization. For instance, a synthetic dataset might include a 1 TB gp2 volume that’s only using 100 GB of space, simulating a real-world scenario of overprovisioned resources. See the following code:

import pandas as pd
import numpy as np
import random
def introduce_anomalies(df, anomaly_rate=0.1):
"""
Introduce various volume-related anomalies into the dataset.

:param df: The input DataFrame
:param anomaly_rate: The rate at which to introduce anomalies (default 10%)
:return: DataFrame with anomalies introduced
"""
num_anomalies = int(len(df) * anomaly_rate)
anomaly_indices = np.random.choice(df.index, num_anomalies, replace=False)

df['Anomaly'] = pd.NA  # Initialize Anomaly column with pandas NA

for idx in anomaly_indices:
anomaly_type = random.choice([
'oversized_volume',
'old_snapshot',
'high_cost_low_size',
'mismatched_type',
'very_old_volume'
])

if anomaly_type == 'oversized_volume':
df.at[idx, 'Volume Size'] = int(df.at[idx, 'Volume Size'] * random.uniform(5, 10))
df.at[idx, 'Monthly Storage Cost'] *= random.uniform(5, 10)

elif anomaly_type == 'old_snapshot':
df.at[idx, 'Snapshot Age'] = random.randint(365, 1000)

elif anomaly_type == 'high_cost_low_size':
df.at[idx, 'Volume Size'] = random.randint(1, 10)
df.at[idx, 'Monthly Storage Cost'] *= random.uniform(10, 20)

elif anomaly_type == 'mismatched_type':
if df.at[idx, 'Volume Type'] in ['gp2', 'gp3']:
df.at[idx, 'Volume Type'] = random.choice(['io1', 'io2'])
else:
df.at[idx, 'Volume Type'] = random.choice(['gp2', 'gp3'])

elif anomaly_type == 'very_old_volume':
df.at[idx, 'Volume Name'] = f"old-volume-{random.randint(1, 100)}"
if pd.notna(df.at[idx, 'Snapshot Age']):
df.at[idx, 'Snapshot Age'] = random.randint(1000, 2000)

df.at[idx, 'Anomaly'] = anomaly_type

return df

The following screenshot shows an example of sample rows generated.

Validate the synthetic Trusted Advisor findings

Data validation is a critical step that verifies the quality, reliability, and representativeness of your synthetic data. This process involves performing rigorous statistical analysis to verify that the generated data maintains proper distributions, relationships, and patterns that align with real-world scenarios. Validation should include both quantitative metrics (statistical measures) and qualitative assessments (pattern analysis). Organizations should implement comprehensive validation frameworks that include distribution analysis, correlation checks, pattern verification, and anomaly detection. Regular visualization of the data helps in identifying inconsistencies or unexpected patterns.

For EBS volume data, validation might include analyzing the distribution of volume sizes across different types (gp2, gp3, io1), verifying that cost correlations match expected patterns, and making sure that introduced anomalies (like underutilized volumes) maintain realistic proportions. For instance, validating that the percentage of underutilized volumes aligns with typical enterprise environments (perhaps 15–20% of total volumes) and that the cost-to-size relationships remain realistic across volume types.

The following figures show examples of our validation checks.

  1. The following screenshot shows statistics of the generated synthetic datasets.
  2. The following figure shows the proportion of underutilized volumes in the generated synthetic datasets.
    underutilized volume proportion
  3. The following figure shows the distribution of volume sizes in the generated synthetic datasets.
    volume_size_distribution
  4. The following figure shows the distribution of volume types in the generated synthetic datasets.
    volume_type_distribution
  5. The following figure shows the distribution of snapshot ages in the generated synthetic datasets.
    snapshot_age

Enhancing synthetic data with differential privacy

After exploring the steps to create synthetic datasets for the Trusted Advisor “Underutilized Amazon EBS Volumes” check, it’s worth revisiting how differential privacy strengthens this approach. When a cloud consulting firm analyzes aggregated Trusted Advisor data across multiple clients, differential privacy through OpenDP provides the critical privacy-utility balance needed. By applying carefully calibrated noise to computations of underutilized volume statistics, consultants can generate synthetic datasets that preserve essential patterns across Regions and volume types while mathematically guaranteeing individual client confidentiality. This approach verifies that the synthetic data maintains sufficient accuracy for meaningful trend analysis and recommendations, while eliminating the risk of revealing sensitive client-specific infrastructure details or usage patterns—making it an ideal complement to our synthetic data generation pipeline.

Conclusion

In this post, we showed how to use Amazon Bedrock to create synthetic data for enterprise needs. By combining language models available in Amazon Bedrock with industry knowledge, you can build a flexible and secure way to generate test data. This approach helps create realistic datasets without using sensitive information, saving time and money. It also facilitates consistent testing across projects and avoids ethical issues of using real user data. Overall, this strategy offers a solid solution for data challenges, supporting better testing and development practices.

In part 2 of this series, we will demonstrate how to use pattern recognition for different datasets to automate rule-set generation needed for the Amazon Bedrock prompts to generate corresponding synthetic data.


About the authors

Devi Nair is a Technical Account Manager at Amazon Web Services, providing strategic guidance to enterprise customers as they build, operate, and optimize their workloads on AWS. She focuses on aligning cloud solutions with business objectives to drive long-term success and innovation.

Vishal Karlupia is a Senior Technical Account Manager/Lead at Amazon Web Services, Toronto. He specializes in generative AI applications and helps customers build and scale their AI/ML workloads on AWS. Outside of work, he enjoys being outdoors and keeping bonfires alive.

Srinivas Ganapathi is a Principal Technical Account Manager at Amazon Web Services. He is based in Toronto, Canada, and works with games customers to run efficient workloads on AWS.

Nicolas Simard is a Technical Account Manager based in Montreal. He helps organizations accelerate their AI adoption journey through technical expertise, architectural best practices, and enables them to maximize business value from AWS’s Generative AI capabilities.

Read More

National Robotics Week — Latest Physical AI Research, Breakthroughs and Resources

National Robotics Week — Latest Physical AI Research, Breakthroughs and Resources

Check back here throughout the week to learn the latest on physical AI, which enables machines to perceive, plan and act with greater autonomy and intelligence in real-world environments.

This National Robotics Week, running through April 12, NVIDIA is highlighting the pioneering technologies that are shaping the future of intelligent machines and driving progress across manufacturing, healthcare, logistics and more.

Advancements in robotics simulation and robot learning are driving this fundamental shift in the industry. Plus, the emergence of world foundation models is accelerating the evolution of AI-enabled robots capable of adapting to dynamic and complex scenarios.

For example, by providing robot foundation models like NVIDIA GR00T N1, frameworks such as NVIDIA Isaac Sim and Isaac Lab for robot simulation and training, and synthetic data generation pipelines to help train robots for diverse tasks, the NVIDIA Isaac and GR00T platforms are empowering researchers and developers to push the boundaries of robotics.

Hackathon Features Robots Powered by NVIDIA Isaac GR00T N1 🔗

The Seeed Studio Embodied AI Hackathon, which took place last month, brought together the robotics community to showcase innovative projects using the LeRobot SO-100ARM motor kit.

The event highlighted how robot learning is advancing AI-driven robotics, with teams successfully integrating the NVIDIA Isaac GR00T N1 model to speed humanoid robot development. A notable project involved developing leader-follower robot pairs capable of learning pick-and-place tasks by post-training robot foundation models on real-world demonstration data.

How the project worked:

  • Real-World Imitation Learning: Robots observe and mimic human-led demonstrations, recorded through Arducam vision systems and an external camera.
  • Post-Training Pipeline: Captured data is structured into a modality.json dataset for efficient GPU-based training with GR00T N1.
  • Bimanual Manipulation: The model is optimized for controlling two robotic arms simultaneously, enhancing cooperative skills.

The dataset is now publicly available on Hugging Face, with implementation details on GitHub.

Team “Firebreathing Rubber Duckies” celebrating with NVIDIA hosts.

Learn more about the project.

Advancing Robotics: IEEE Robotics and Automation Society Honors Emerging Innovators 🔗

The IEEE Robotics and Automation Society in March announced the recipients of its 2025 Early Academic Career Award, recognizing outstanding contributions to the fields of robotics and automation.

This year’s honorees — including NVIDIA’s Shuran Song, Abhishek Gupta and Yuke Zhu — are pioneering advancements in scalable robot learning, real-world reinforcement learning and embodied AI. Their work is shaping the next generation of intelligent systems, driving innovation that impacts both research and real-world applications.

Learn more about the award winners:

These researchers will be recognized at the International Conference on Robotics and Automation in May.

Stay up to date on NVIDIA’s leading robotics research through the Robotics Research and Development Digest (R2D2) tech blog series, subscribing to this newsletter and following NVIDIA Robotics on YouTube, Discord and developer forums.

Read More

Repurposing Protein Folding Models for Generation with Latent Diffusion

Repurposing Protein Folding Models for Generation with Latent Diffusion



PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models.

The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment of recognition for the of AI role in biology. What comes next after protein folding?

In PLAID, we develop a method that learns to sample from the latent space of protein folding models to generate new proteins. It can accept compositional function and organism prompts, and can be trained on sequence databases, which are 2-4 orders of magnitude larger than structure databases. Unlike many previous protein structure generative models, PLAID addresses the multimodal co-generation problem setting: simultaneously generating both discrete sequence and continuous all-atom structural coordinates.

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

  • Deploy Whisper on an Arm-based cloud instance.
  • Implement performance optimizations for efficient execution.
  • Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Read More

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

  • Deploy Whisper on an Arm-based cloud instance.
  • Implement performance optimizations for efficient execution.
  • Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Read More

Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still unclear. Additionally, different multimodal foundation models may have distinct preferences for specific caption formats while the efforts of studying the optimal captions for each foundation model remain limited. In this work, we introduce a novel, controllable, and scalable captioning pipeline that generates diverse caption formats…Apple Machine Learning Research

Do LLMs Estimate Uncertainty Well in Instruction-Following?

Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies…Apple Machine Learning Research

Llama 4 family of models from Meta are now available in SageMaker JumpStart

Llama 4 family of models from Meta are now available in SageMaker JumpStart

Today, we’re excited to announce the availability of Llama 4 Scout and Maverick models in Amazon SageMaker JumpStart and coming soon in Amazon Bedrock. Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. You can now deploy the Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8 models using SageMaker JumpStart in the US East (N. Virginia) AWS Region.

In this blog post, we walk you through how to deploy and prompt a Llama-4-Scout-17B-16E-Instruct model using SageMaker JumpStart.

Llama 4 overview

Meta announced Llama 4 today, introducing three distinct model variants: Scout, which offers advanced multimodal capabilities and a 10M token context window; Maverick, a cost-effective solution with a 128K context window; and Behemoth, in preview. These models are optimized for multimodal reasoning, multilingual tasks, coding, tool-calling, and powering agentic systems.

Llama 4 Maverick is a powerful general-purpose model with 17 billion active parameters, 128 experts, and 400 billion total parameters, and optimized for high-quality general assistant and chat use cases. Additionally, Llama 4 Maverick is available with base and instruct models in both a quantized version (FP8) for efficient deployment on the Instruct model and a non-quantized (BF16) version for maximum accuracy.

Llama 4 Scout, the more compact and smaller model, has 17 billion active parameters, 16 experts, and 109 billion total parameters, and features an industry-leading 10M token context window. These models are designed for industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of AI applications that bridge language barriers.

See Meta’s community license agreement for usage terms and more details.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances can be isolated within your virtual private cloud (VPC). After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.

Prerequisites

To try the Llama 4 models in SageMaker JumpStart, you need the following prerequisites:

Discover Llama 4 models in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the Amazon SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive integrated development environment (IDE) that offers a unified, web-based interface for performing all aspects of the AI development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process.

In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference. You can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page in SageMaker Studio, as shown in the following figure.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI and machine learning (AI/ML) workflows and pipelines.

By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Llama 4 models for inference through the SageMaker JumpStart UI

On the SageMaker JumpStart landing page, you can find all the public pre-trained models offered by SageMaker AI. You can then choose the Meta model provider tab to discover all the available Meta models.

If you’re using SageMaker Classic Studio and don’t see the Llama 4 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, see Shut down and Update Studio Classic Apps.

  1. Search for Meta to view the Meta model card. Each model card shows key information, including:
    • Model name
    • Provider name
    • Task category (for example, Text Generation)
  2. Select the model card to view the model details page.

The model details page includes the following information:

  • The model name and provider information
  • Deploy button to deploy the model
  • About and Notebooks tabs with detailed information

The About tab includes important details, such as:

  • Model description
  • License information
  • Technical specifications
  • Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

  1. Choose Deploy to proceed with deployment.
  1. For Endpoint name, use the automatically generated name or enter a custom one.
  2. For Instance type, use the default: p5.48xlarge.
  3. For Initial instance count, enter the number of instances (default: 1).
    Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed.
  4. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
  5. Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
  6. Choose Deploy. The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.

Deploy Llama 4 models for inference using the SageMaker Python SDK

When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker AI.

You can deploy the Llama 4 Scout model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-vlm-llama-4-scout-17b-16e-instruct")

predictor = model.deploy(accept_eula=False)

This deploys the model on SageMaker AI with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
"messages": [
 {"role": "system", "content": "You are a helpful assistant"},
 {"role": "user", "content": "How are you doing today"},
 {"role": "assistant", "content": "Good, what can i help you with today?"},
 {"role": "user", "content": "Give me 5 steps to become better at tennis?"}
],
 "temperature": 0.6,
 "top_p": 0.9,
 "max_tokens": 512,
 "logprobs": False
} 
response = predictor.predict(payload) 
response_message = response['choices'][0]['message']['content']

Recommended instances and benchmark

The following table lists all the Llama 4 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model name Model ID Default instance type Supported instance types
Llama-4-Scout-17B-16E-Instruct meta-vlm-llama-4-scout-17b-16e-instruct ml.p5.48xlarge ml.g6e.48xlarge, ml.p5.48xlarge, ml.p5en.48xlarge
Llama-4-Maverick-17B-128E-Instruct meta-vlm-llama-4-maverick-17b-128e-instruct ml.p5.48xlarge ml.p5.48xlarge, ml.p5en.48xlarge
Llama 4-Maverick-17B-128E-Instruct-FP8 meta-vlm-llama-4-maverick-17b-128-instruct-fp8 ml.p5.48xlarge ml.p5.48xlarge, ml.p5en.48xlarge

Inference and example prompts for Llama 4 Scout 17B 16 Experts model

You can use the Llama 4 Scout model for text and image or vision reasoning use cases. With that model, you can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more.

In the following sections we show example payloads, invocations, and responses for Llama 4 Scout that you can use against your Llama 4 model deployments using Sagemaker JumpStart.

Text-only input

Input:

payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."
        },
        {
            "role": "user",
            "content": "What are three key benefits of large language models for businesses?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Response:

Large language models (LLMs) offer numerous benefits to businesses, transforming how they operate, 
interact with customers, and generate insights. Here are three key benefits:

1. **Enhanced Customer Experience through Automation and Personalization**:
   - **Automated Customer Support**: LLMs can power chatbots and virtual assistants 
	that provide 24/7 customer support. These models can understand and respond to a wide range of customer queries, help with troubleshooting, and even process transactions, significantly reducing the workload on human customer support agents.
   - **Personalization**: By analyzing customer data, LLMs can help businesses personalize their 
	communications and offerings. For instance, they can generate personalized emails, product recommendations, and content, enhancing customer engagement and satisfaction.

2. **Efficiency and Productivity Gains**:
   - **Content Generation**: LLMs can automate the creation of various types of content, such as 
   blog posts, reports, product descriptions, and social media updates. This not only speeds up content production but also allows human writers to focus on more creative and strategic tasks.
   - **Data Analysis and Summarization**: These models can quickly analyze large volumes of data, extract relevant information, and summarize findings in a readable format. This capability can significantly reduce the time and effort required for market research, competitive analysis, and internal reporting.

3. **Improved Decision Making with Data-Driven Insights**:
   - **Market and Trend Analysis**: LLMs can process and analyze vast amounts of data from various sources, including news articles, social media, and market reports. This helps businesses stay informed about market trends, consumer sentiment, and competitor activity, enabling more informed strategic decisions.
   - **Risk Management and Compliance**: By analyzing regulatory documents and monitoring communications, LLMs can help businesses identify and mitigate compliance risks. They can also assist in the creation of compliance reports and documentation, reducing the risk of human error.

In summary, large language models offer businesses the potential to automate and enhance customer interactions, 
improve operational efficiency, and gain deeper insights from their data. These capabilities can lead to cost savings, 
increased revenue, and a competitive edge in the market.

Single-image input

In this section, let’s test Llama 4’s multimodal capabilities. By merging text and vision tokens into a unified processing backbone, Llama 4 can seamlessly understand and respond to queries about an image. The following is an example of how you can prompt Llama 4 to answer questions about an image such as the one in the example:

Image:

Input:

import requests
import base64
import boto3
import json

s3 = boto3.client("s3")

s3_bucket = f"jumpstart-cache-prod-{region}"
key_prefix = "inference-notebook-assets"

def download_from_s3(key_filenames):
    for key_filename in key_filenames:
        s3.download_file(s3_bucket, f"{key_prefix}/{key_filename}", key_filename)

# Define image names
heat_map = "heatmap_semantic_similarity_search.png"

# Download and display the heatmap image
download_from_s3(key_filenames=[heat_map])

def img_to_base64(image_path):
    with open(image_path, "rb") as f:
        img = f.read()
    enc_img = base64.b64encode(img).decode('utf-8')
    return enc_img

b64_img = img_to_base64(heat_map)

data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

payload = json.dumps(data)

Response:

The image presents a heatmap illustrating the semantic textual similarity between 
sentences. The heatmap is divided into two axes, with the x-axis listing six sentences 
and the y-axis listing the same six sentences in a different order.nn**Sentences:**nn* 
Your dog is so cute.n* How cute your dog is!n* You have such a cute dog!n* New York City 
is the place where I work.n* I work in New York City.n* What color do you like the most?n* 
What is your favourite color?nn**Heatmap:**nnThe heatmap displays the semantic textual 
similarity between each pair of sentences, with darker shades of red indicating higher similarity 
and lighter shades of orange indicating lower similarity. The color bar on the right side of the 
heatmap provides a scale for interpreting the similarity values, ranging from -0.4 to 1.0.nn
**Observations:**nn* The three sentences related to dogs ("Your dog is so cute.", "How cute your dog is!", 
and "You have such a cute dog!") show high similarity with each other, as indicated by the dark red squares 
in the corresponding section of the heatmap.n* The two sentences related to New York City ("New York City is 
the place where I work." and "I work in New York City.") also show high similarity with each other.n* The two 
sentences related to favorite colors ("What color do you like the most?" and "What is your favorite color?") 
show high similarity with each other.n* The sentences from different categories (dogs, New York City, and 
favorite colors) show lower similarity with each other, as indicated by the lighter shades of orange in the heatmap.nn
Overall, the heatmap effectively visualizes the semantic textual similarity between the given sentences, allowing 
for easy identification of similar and dissimilar sentence pairs.

The Llama 4 model on JumpStart can take in the image provided via a URL, underlining its powerful potential for real-time multimodal applications.

Multi-image input

Building on its advanced multimodal functionality, Llama 4 can effortlessly process multiple images at the same time. In this demonstration, the model is prompted with two image URLs and tasked with describing each image and explaining their relationship, showcasing its capacity to synthesize information across several visual inputs. Let’s test this below by passing in the URLs of the following images in the payload.

Image 1:

Image 2:

Input: 

def url_to_base64(image_url):
    # Download the image
    response = requests.get(image_url)
    if response.status_code != 200:
        return None

    # Encode the image content to base64
    image_base64 = base64.b64encode(response.content).decode('utf-8')
    return image_base64

image_url_2 = "https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg"
b64_img_url_2 = url_to_base64(image_url_2)

data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "There are two images. Describe each one and why they are related",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img,
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img_url_2,
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

Response:

The first image is a heatmap illustrating the semantic textual similarity between 
sentences. The heatmap is titled "Semantic Textual Similarity Between Sentences" 
and features a color scale ranging from -0.4 to 1.0, with darker shades indicating 
higher similarity. The sentences being compared are:nn* Your dog is so cute.n* 
How cute your dog is!n* You have such a cute dog!n* New York City is the place 
where I work.n* I work in New York City.n* What color do you like the most?n* 
What is your favourite color?nnThe heatmap reveals that the sentences about dogs 
have high similarity scores, with values above 0.9, indicating that they convey 
similar meanings. Similarly, the sentences about work locations and favorite colors 
also show high similarity scores. However, there is a low similarity score between 
the sentences about dogs and those about work locations or favorite colors.nn
The second image presents five stick figures labeled A, B, C, D, and E, each representing
a person. The figures are drawn using simple lines and geometric shapes, such as circles
for heads, squares for bodies, and triangles for skirts. The main difference between the
figures lies in their arm and leg positions:nn* Figure A has its arms at its sides and 
straight legs.n* Figure B has its arms at its sides and straight legs, but with a slightly 
different posture.n* Figure C has its arms at its sides and straight legs, similar to Figure A.n*
Figure D has one arm raised and straight legs.n* Figure E has two arms raised and bent legs.nn
Overall, the two images are distinct in terms of their content and purpose. The heatmap provides a 
visual representation of semantic similarities between sentences, while the stick figures appear to 
be simple illustrations of people with varying poses.

As you can see, Llama 4 excels in handling multiple images simultaneously, providing detailed and contextually relevant insights that emphasize its robust multimodal processing abilities.

Codebase analysis with Llama 4

Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:

Input:

payload = {
    "messages": [
        {
            "role": "system",
            "content": """You are a helpful AI assistant that summarizes codebases to me to help me understand
            how to analyze code by synthesizing through the entire codebase before responding.
            
            Be thorough in your search as the file may be nested within a markdown code block or within a directory listing.
            """
        },
        {
            "role": "user",
            "content": f"Can you explain to me the buildspec-1-10-2.yml file and how it relates to the rest of the huggingface directory? Here is the AWS Deep Learning Container Repository for reference:nn{text}"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Output: 

## Overview of the buildspec-1-10-2.yml File

The provided information appears to be a part of a large, merged representation of 
the Hugging Face repository, generated by Repomix. The `buildspec-1-10-2.yml` file 
is likely a build specification file used in a Continuous Integration/Continuous 
Deployment (CI/CD) pipeline.

## Purpose and Format

The file contains a packed representation of the entire repository's contents, designed
 to be easily consumable by AI systems for analysis, code review, or other automated processes. 
The content is organized into sections:

1.  **Summary**: Provides an overview of the file's purpose, format, and usage guidelines.
2.  **Repository Information**: Not shown in the provided excerpt, but likely contains metadata about the repository.
3.  **Directory Structure**: Lists the directory structure of the repository, including file paths.
4.  **File Entries**: Each file entry consists of a header with the file path and the full contents 
of the file in a code block.

## Relationship to the Hugging Face Directory

The `buildspec-1-10-2.yml` file is part of the Hugging Face repository, which appears to be a 
large project with multiple subdirectories and files. The directory structure section lists various 
subdirectories, including:

*   `.github/`: Contains GitHub-related configuration files, such as issue templates, workflows, 
and CODEOWNERS.
*   `.release_templates/`: Includes release image templates for various frameworks like MXNet, PyTorch, 
and TensorFlow.
*   `autogluon/`: A subdirectory with build artifacts, inference configurations, and other related files.

## buildspec-1-10-2.yml File Contents

Unfortunately, the provided excerpt does not contain the actual contents of the `buildspec-1-10-2.yml` file. 
However, based on the file name and its presence in a CI/CD context, it is likely that this file specifies 
the build process, dependencies, and environment configurations for the Hugging Face project.

## Usage Guidelines

When working with this file, keep in mind:

*   Treat this file as read-only and make changes to the original repository files instead.
*   Use the file path to distinguish between different files in the repository.
*   Handle this file with the same level of security as the original repository, as it may 
contain sensitive information.

## Additional Notes

*   Some files might be excluded based on `.gitignore` rules and Repomix's configuration.
*   Binary files are not included in this packed representation.
*   Files are sorted by Git change count, with files having more changes appearing at the bottom.

## Example Use Case

To analyze the `buildspec-1-10-2.yml` file, you would:

1.  Locate the file in the repository directory structure.
2.  Review the file contents to understand the build process, dependencies, and environment configurations.
3.  Use this information to optimize the build process, troubleshoot issues, or make changes to the repository.

Keep in mind that the actual contents of the `buildspec-1-10-2.yml` file are not provided, so this 
is a general explanation based on the file name and context. 

To get the actual content you would look for 

## File: buildspec-1-10-2.yml 

 block and read its content 

Let me know if I can assist with anything else.

Multi-document processing

Harnessing the same extensive token context window, Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.

Input: 

payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a financial analyst. Extract key financial data from this 10-K report."
        },
        {
            "role": "user",
            "content": f"""Extract the following information from this Amazon 10-K report:
            1. What year is this report for?
            2. Total revenue for the year
            3. Net income
            4. Key business segments and their performance
            5. Important trends mentioned
            
            Here's the 10-K text:
            {pdf_text[:25000]}
            
            Format your response as a JSON object with these fields: year, revenue, net_income, segments, trends.
            """
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Output:

To analyze Amazon's growth from 2017 to 2024, we'll examine the revenue and net income 
data available from the 10-K reports.

### Revenue Growth

| Year | Revenue (in billion USD) |
| --- | --- |
| 2014 | 89 |
| 2015 | 107 |
| 2016 | 135.99 |
| 2017 | 177.87 |
| 2018 | 232.88 |
| 2019 | 280.52 |
| 2020 | 386.06 |
| 2021 | 478 |
| 2022 | 513 |
| 2023 | 513.98 |
| 2024 | 574.78 (projected) |

From 2017 to 2024, Amazon's revenue has grown from $177.87 billion to a projected $574.78 billion, 
representing a significant increase of approximately 223% over the 7-year period.

### Net Income Growth

| Year | Net Income (in billion USD) |
| --- | --- |
| 2014 | 0.236 |
| 2015 | 0.596 |
| 2016 | 2.37 |
| 2017 | 3.03 |
| 2018 | 10.07 |
| 2019 | 11.59 |
| 2020 | 18.68 |
| 2021 | 33.4 |
| 2022 | 18.7 (lower due to economic conditions) |
| 2023 | 33.36 |
| 2024 | 37.68 (projected) |

Amazon's net income has grown from $3.03 billion in 2017 to a projected $37.68 billion in 2024, 
representing a substantial increase of approximately 1143% over the 7-year period.

### Segment and Trend Analysis

Throughout the years, Amazon has maintained its three main business segments: North America, 
International, and Amazon Web Services (AWS). The company has continued to expand into new products, 
services, technologies, and geographic regions, driving growth and increasing competition.

The trends mentioned across the 10-K reports highlight the company's:

1. **Seasonality**: Higher sales volume during the fourth quarter.
2. **Intense competition**: Across geographies and industries, including e-commerce, cloud computing, 
and digital content.
3. **Expansion**: Into new products, services, technologies, and geographic regions, which subjects 
the company to additional business, legal, financial, and competitive risks.
4. **International operations**: Subject to risks such as local economic and political conditions, 
government regulation, and restrictions on sales or distribution.

### Conclusion

Amazon has demonstrated remarkable growth from 2017 to 2024, with revenue increasing by 
approximately 223% and net income growing by approximately 1143%. The company's continued 
expansion into new areas, its strong presence in cloud computing through AWS, and its ability 
to adapt to changing market conditions have contributed to its success. However, the company 
also faces intense competition, seasonality, and risks associated with international operations.

--------------------------------------------------
Ask a question about the Amazon 10-K reports across years.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, using the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 4 models today.

For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.


About the authors

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. As a member of the Third-party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a global lead for the Meta–AWS Partnership and technical strategy. Based in Seattle, Washington, Marco enjoys writing, reading, exercising, and building applications in his free time.

Chakravarthy Nagarajan is a Principal Solutions Architect specializing in machine learning, big data, and high performance computing. In his current role, he helps customers solve real-world, complex business problems using machine learning and generative AI solutions.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Malav Shastri is a Software Development Engineer at AWS, where he works on the Amazon SageMaker JumpStart and Amazon Bedrock teams. His role focuses on enabling customers to take advantage of state-of-the-art open source and proprietary foundation models and traditional machine learning algorithms. Malav holds a Master’s degree in Computer Science.

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon Sagemaker and Amazon EC2. Based in San Francisco, Baladithya enjoys tinkering, developing applications, and his home lab in his free time.

John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 and Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.

Read More

Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge base with metadata filtering

Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge base with metadata filtering

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and AWS. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end Retrieval Augmented Generation (RAG) workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources.

Organizations need to control access to their data across different business units, including companies, departments, or even individuals, while maintaining scalability. When organizations try to separate data sources manually, they often create unnecessary complexity and hit service limitations. This post demonstrates how Amazon Bedrock Knowledge Bases can help you scale your data management effectively while maintaining proper access controls on different management levels.

One of these strategies is using Amazon Simple Storage Service (Amazon S3) folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segmentation within a single knowledge base. Additionally, we dive into integrating common vector database solutions available for Amazon Bedrock Knowledge Bases and how these integrations enable advanced metadata filtering and querying capabilities.

Organizing S3 folder structures for scalable knowledge bases

Organizations working with multiple customers need a secure and scalable way to keep each customer’s data separate while maintaining efficient access controls. Without proper data segregation, companies risk exposing sensitive information between customers or creating complex, hard-to-maintain systems. For this post, we focus on maintaining access controls across multiple business units within the same management level.

A key strategy involves using S3 folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segregation within a single knowledge base. Instead of creating separate knowledge bases for each customer, you can use a consolidated knowledge base with a well-structured S3 folder hierarchy. For example, imagine a consulting firm that manages documentation for multiple healthcare providers—each customer’s sensitive patient records and operational documents must remain strictly separated. The Amazon S3 structure might look as follows:

s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/

    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/policies/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/procedures/

    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/policies/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/procedures/

This structure makes sure that Customer A’s healthcare documentation remains completely separate from Customer B’s data. When combined with Amazon Bedrock Knowledge Bases metadata filtering, you can verify that users associated with Customer A can only access their organization’s documents, and Customer B’s users can only see their own data—maintaining strict data boundaries while using a single, efficient knowledge base infrastructure.

The Amazon Bedrock Knowledge Bases metadata filtering capability enhances this segregation by allowing you to tag documents with customer-specific identifiers and other relevant attributes. These metadata filters provide an additional layer of security and organization, making sure that queries only return results from the appropriate customer’s dataset.

Solution overview

The following diagram provides a high-level overview of AWS services and features through a sample use case. Although the example uses Customer A and Customer B for illustration, these can represent distinct business units (such as departments, companies, or teams) with different compliance requirements, rather than only individual customers.

The workflow consists of the following steps:

  1. Customer data is uploaded along with metadata indicating data ownership and other properties to specific folders in an S3 bucket.
  2. The S3 bucket, containing customer data and metadata, is configured as a knowledge base data source. Amazon Bedrock Knowledge Bases ingests the data, along with the metadata, from the source repository and a knowledge base sync is performed.
  3. A customer initiates a query using a frontend application with metadata filters against the Amazon Bedrock knowledge base. An access control metadata filter must be in place to make sure that the customer only accesses data they own; the customer can apply additional filters to further refine query results. This combined query and filter is passed to the RetrieveAndGenerate API.
  4. The RetrieveAndGenerate API handles the core RAG workflow. It consists of several sub-steps:
    • The user query is converted into a vector representation (embedding).
    • Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
    • The original query is augmented with the retrieved documents, providing context for the large language model (LLM).
    • The LLM generates a response based on the augmented query and retrieved context.
  5. Finally, the generated response is sent back to the user.

When implementing Amazon Bedrock Knowledge Bases in scenarios involving sensitive information or requiring access controls, developers must implement proper metadata filtering in their application code. Failure to enforce appropriate metadata-based filtering could result in unauthorized access to sensitive documents within the knowledge base. Metadata filtering serves as a critical security boundary and should be consistently applied across all queries. For comprehensive guidance on implementing secure metadata filtering practices, refer to the Amazon Bedrock Knowledge Base Security documentation.

Implement metadata filtering

For this use case, two specific example customers, Customer A and Customer B, are aligned to different proprietary compliance documents. The number of customers and folders can scale to N depending on the size of the customer base. We will use the following public documents, which will reside in the respective customer’s S3 folder. Customer A requires the Architecting for HIPAA Security and Compliance on AWS document. Customer B requires access to the  Using AWS in the Context of NHS Cloud Security Guidance  document.

  1. Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data:

{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}

The following is the JSON metadata for Customer B’s data:

{ "metadataAttributes": { "customer": "CustomerB", "documentType": "NHS Compliance Guidance", "focus": "UK Healthcare Compliance", "publicationYear": 2023, "region": "Europe" }}

  1. Save these files separately with the naming convention <filename>.pdf.metadata.JSON and store them in the same S3 folder or prefix that stores the source document. For Customer A, name the metadata file architecting-hipaa-compliance-on-aws.pdf.metadata.json and upload it to the folder corresponding to Customer A’s documents. Repeat these steps for Customer B.
  2. Create an Amazon Bedrock knowledge base. For instructions, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases
  3. After you create your knowledge base, you can sync the data source. For more details, see Sync your data with your Amazon Bedrock knowledge base.

Test metadata filtering

After you sync the data source, you can test the metadata filtering.

The following is an example for setting the customer = CustomerA metadata filter to show Customer A only has access to the HIPAA compliance document and not the NHS Compliance Guidance that relates to Customer B.

To use the metadata filtering options on the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then expand Filters.
  5. Enter a condition using the format: key = value (for this example, customer = CustomerA) and press Enter.
  6. When finished, enter your query in the message box, then choose Run.

We enter two queries, “summarize NHS Compliance Guidance” and “summarize HIPAA Compliance Guide.” The following figure shows the two queries: one attempting to query data related to NHS compliance guidance, which fails because it is outside of the Customer A segment, and another successfully querying data on HIPAA compliance, which has been tagged for Customer A.

Implement field-specific chunking

Amazon Bedrock Knowledge Bases supports several document types for Amazon S3 metadata filtering. The supported file formats include:

  • Plain text (.txt)
  • Markdown (.md)
  • HTML (.html)
  • Microsoft Word documents (.doc and.docx)
  • CSV files (.csv)
  • Microsoft Excel spreadsheets (.xls and .xlsx)

When working with CSV data, customers often want to chunk on a specific field in their CSV documents to gain granular control over data retrieval and enhance the efficiency and accuracy of queries. By creating logical divisions based on fields, users can quickly access relevant subsets of data without needing to process the entire dataset.

Additionally, field-specific chunking aids in organizing and maintaining large datasets, facilitating updating or modifying specific portions without affecting the whole. This granularity supports better version control and data lineage tracking, which are crucial for data integrity and compliance. Focusing on relevant chunks can improve the performance of LLMs, ultimately leading to more accurate insights and better decision-making processes within organizations. For more information, see Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

To demonstrate field-specific chunking, we use two sample datasets with the following schemas:

  • Schema 1 – Customer A uses the following synthetic dataset for recording medical case reports (case_reports.csv)
CaseID DoctorID PatientID Diagnosis TreatmentPlan Content
C001 D001 P001 Hypertension Lifestyle changes, Medication (Lisinopril) “Patient diagnosed with hypertension, advised lifestyle changes, and started on Lisinopril.”
C002 D002 P002 Diabetes Type 2 Medication (Metformin), Diet adjustment “Diabetes Type 2 confirmed, prescribed Metformin, and discussed a low-carb diet plan.”
C003 D003 P003 Asthma Inhaler (Albuterol) “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management.”
C004 D004 P004 Coronary Artery Disease Medication (Atorvastatin), Surgery Consultation “Coronary artery disease diagnosed, started on Atorvastatin, surgery consultation recommended.”
  • Schema 2 – Customer B uses the following dataset for recording genetic testing results (genetic_testings.csv)
SampleID PatientID TestType Result
S001 P001 Genome Sequencing Positive
S002 P002 Exome Sequencing Negative
S003 P003 Targeted Gene Panel Positive
S004 P004 Whole Genome Sequencing Negative

Complete the following steps:

  1. Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data (note that recordBasedStructureMetadata supports exactly one content field):

{
    "metadataAttributes": {
        "customer": "CustomerA"
    },
    "documentStructureConfiguration": {
        "type": "RECORD_BASED_STRUCTURE_METADATA",
        "recordBasedStructureMetadata": {
            "contentFields": [
                {
                    "fieldName": "Content"
                }
            ],
            "metadataFieldsSpecification": {
                "fieldsToInclude": [
                    {
                        "fieldName": "CaseID"
                    },
                    {
                        "fieldName": "DoctorID"
                    },
                    {
                        "fieldName": "PatientID"
                    },
                    {
                        "fieldName": "Diagnosis"
                    },
                    {
                        "fieldName": "TreatmentPlan"
                    }
                ]
            }
        }
    }
}

The following is the JSON metadata for Customer B’s data:

{
    "metadataAttributes": {
        "customer": "CustomerB"
    },
    "documentStructureConfiguration": {
        "type": "RECORD_BASED_STRUCTURE_METADATA",
        "recordBasedStructureMetadata": {
            "contentFields": [
                {
                    "fieldName": "TestType"
                }
            ],
            "metadataFieldsSpecification": {
                "fieldsToInclude": [
                    {
                        "fieldName": "SampleID"
                    },
                    {
                        "fieldName": "PatientID"
                    },
                    {
                        "fieldName": "Result"
                    }
                ]
            }
        }
    }
}
  1. Save your files with the naming convention <filename>.csv.metadata.json and store the new JSON file in the same S3 prefix of the bucket where you stored the dataset. For Customer A, name the metadata file case_reports.csv.metadata.JSON and upload the file to the same folder corresponding to Customer A’s datasets.

Repeat the process for Customer B. You have now created metadata from the source CSV itself, as well as an additional metadata field customer that doesn’t exist in the original dataset. The following image highlights the metadata.

  1. Create an Amazon Bedrock knowledge base.
  2. Sync your data with your Amazon Bedrock knowledge base.

Test field-specific chunking

The following is an example of setting the customer = CustomerA metadata filter demonstrating that Customer A only has access to the medical case reports dataset and not the genetic testing dataset that relates to Customer B. We enter a query requesting information about a patient with PatientID as P003.

To test, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then expand Filters.
  5. Enter a condition using the format: key = value (for this example, customer = CustomerA) and press Enter.
  6. When finished, enter your query in the message box, then choose Run.

The knowledge base returns, “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management,” which is the Result column entry from Customer A’s medical case reports dataset for that PatientID. Although there is a record with the same PatientID in Customer B’s genetic testing dataset, Customer A has access only to the medical case reports data due to the metadata filtering.

Apply metadata filtering for the Amazon Bedrock API

You can call the Amazon Bedrock API RetrieveAndGenerate to query a knowledge base and generate responses based on the retrieved results using the specified FM or inference profile. The response only cites sources that are relevant to the query.

The following Python Boto3 example API call applies the metadata filtering for retrieving Customer B data and generates responses based on the retrieved results using the specified FM (Anthropic’s Claude 3 Sonnet) in RetrieveAndGenerate:

response = bedrock_client.retrieve_and_generate(
    input={
    "text": "Summarize NHS compliance guidance."
},
    retrieveAndGenerateConfiguration={
    "type": "KNOWLEDGE_BASE",
    "knowledgeBaseConfiguration": {
            'knowledgeBaseId': 'example_knowledge_base_id’,
        "modelArn": "arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0".format(region),
        "retrievalConfiguration": {
            "vectorSearchConfiguration": {
                "numberOfResults": 5,
                "filter": {
                    "equals": {
                        "key": "customer",
                        "value": ‘CustomerB’
                    }
                }
            }
        }
    }
})

The following GitHub repository provides a notebook that you can follow to deploy an Amazon Bedrock knowledge base with access control implemented using metadata filtering in your own AWS account.

Integrate existing vector databases with Amazon Bedrock Knowledge Bases and validate metadata

There are multiple ways to create vector databases from AWS services and partner offerings to build scalable solutions. If a vector database doesn’t exist, you can use Amazon Bedrock Knowledge Bases to create one using Amazon OpenSearch Serverless Service, Amazon Aurora PostgreSQL Serverless, or Amazon Neptune Analytics to store embeddings, or you can specify an existing vector database supported by Redis Enterprise Cloud, Amazon Aurora PostgreSQL with the pgvector extension, MongoDB Atlas, or Pinecone. After you create your knowledge base and either ingest or sync your data, the metadata attached to the data will be ingested and automatically populated to the vector database.

In this section, we review how to incorporate and validate metadata filtering with existing vector databases using OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone. To learn how to set up each individual vector databases, follow the instructions in Prerequisites for your own vector store for a knowledge base.

OpenSearch Serverless as a knowledge base vector store

With OpenSearch Serverless vector database capabilities, you can implement semantic search, RAG with LLMs, and recommendation engines. To address data segregation between business segments within each Amazon Bedrock knowledge base with an OpenSearch Serverless vector database, use metadata filtering. Metadata filtering allows you to segment data inside of an OpenSearch Serverless vector database. This can be useful when you want to add descriptive data to your documents for more control and granularity in searches.

Each OpenSearch Serverless dashboard has a URL that can be used to add documents and query your database; the structure of the URL is domain-endpoint/_dashboard.

After creating a vector database index, you can use metadata filtering to selectively retrieve items by using JSON query options in the request body. For example, to return records owned by Customer A, you can use the following request:

GET <index_name>/_search
{
  "query": {
    "match": {
      "customer": "CustomerA"
    }
  }
}

This query will return a JSON response containing the document index with the document labeled as belonging to Customer A.

Aurora PostgreSQL with the pgvector extension as a knowledge base vector store

Pgvector is an extension of PostgreSQL that allows you to extend your relational database into a high-dimensional vector database. It stores each document’s vector in a separate row of a database table. For details on creating an Aurora PostgreSQL table to be used as the vector store for a knowledge base, see Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock.

When storing a vector index for your knowledge base in an Aurora database cluster, make sure that the table for your index contains a column for each metadata property in your metadata files before starting data ingestion.

Continuing with the Customer A example, the customer requires the Architecting for HIPAA Security and Compliance on AWS document.

The following is the JSON metadata for Customer A’s data:

{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}

The schema of the PostgreSQL table you create must contain four essential columns for ID, text content, vector values, and service managed metadata; it must also include additional metadata columns (customer, documentType, focus, publicationYear, region) for each metadata property in the corresponding metadata file. This allows pgvector to perform efficient vector searches and similarity comparisons by running queries directly on the database table. The following table summarizes the columns.

Column Name Data Type Description
id UUID primary key Contains unique identifiers for each record
chunks Text Contains the chunks of raw text from your data sources
embedding Vector Contains the vector embeddings of the data sources
metadata JSON Contains Amazon Bedrock managed metadata required to carry out source attribution and to enable data ingestion and querying.
customer Text Contains the customer ID
documentType Text Contains the type of document
focus Text Contains the document focus
publicationYear Int Contains the year document was published
region Text Contains the document’s related AWS Region

During Amazon Bedrock knowledge base data ingestion, these columns will be populated with the corresponding attribute values. Chunking can break down a single document into multiple separate records (each associated with a different ID).

This PostgreSQL table structure allows for efficient storage and retrieval of document vectors, using PostgreSQL’s robustness and pgvector’s specialized vector handling capabilities for applications like recommendation systems, search engines, or other systems requiring similarity searches in high-dimensional space.

Using this approach, you can implement access control at the table level by creating database tables for each segment. Additional metadata columns can also be included in the table for properties such as the specific document owner (user_id), tags, and so on to further enable and enforce fine-grained (row-level) access control and result filtering if you restrict each user to only query the rows that contain their user ID (document owner).

After creating a vector database table, you can use metadata filtering to selectively retrieve items by using a PostgreSQL query. For example, to return table records owned by Customer A, you can use the following query:

SELECT *
FROM bedrock_integration.bedrock_kb
WHERE customer = 'CustomerA';

This query will return a response containing the database records with the document labeled as belonging to Customer A.

Pinecone as a knowledge base vector store

Pinecone, a fully managed vector database, enables semantic search, high-performance search, and similarity matching. Pinecone databases can be integrated into your AWS environment in the form of Amazon Bedrock knowledge bases, but are first created through the Pinecone console. For detailed documentation about setting up a vector store in Pinecone, see Pinecone as a Knowledge Base for Amazon Bedrock. Then, you can integrate the databases using the Amazon Bedrock console. For more information about Pinecone integration with Amazon Bedrock, see Bring reliable GenAI applications to market with Amazon Bedrock and Pinecone.

You can segment a Pinecone database by adding descriptive metadata to each index and using that metadata to inform query results. Pinecone supports strings and lists of strings to filter vector searches on customer names, customer industry, and so on. Pinecone also supports numbers and booleans.

Use metadata query language to filter output ($eq, $ne, $in, $nin, $and, and $or). The following example shows a snippet of metadata and queries that will return that index. The example queries in Python demonstrate how you can retrieve a list of records associated with Customer A from the Pinecone database.

pc = Pinecone(api_key="xxxxxxxxxxx")

index = pc.Index(<index_name>)

index.query(
    namespace="",
    vector=[0.17,0.96, …, 0.44],
    filter={
        "customer": {"$eq": "CustomerA"}
    },
    top_k=10,
    include_metadata=True # Include metadata in the response.
)

This query will return a response containing the database records labeled as belonging to Customer A.

Enhanced scaling with multiple data sources

Amazon Bedrock Knowledge Bases now supports multiple data sources across AWS accounts. Amazon Bedrock Knowledge Bases can ingest data from up to five data sources, enhancing the comprehensiveness and relevancy of a knowledge base. This feature allows customers with complex IT systems to incorporate data into generative AI applications without restructuring or migrating data sources. It also provides flexibility for you to scale your Amazon Bedrock knowledge bases when data resides in different AWS accounts.

The features includes cross-account data access, enabling the configuration of S3 buckets as data sources across different accounts and efficient data management options for retaining or deleting data when a source is removed. These enhancements alleviate the need for creating multiple knowledge bases or redundant data copies.

Clean up

After completing the steps in this blog post, make sure to clean up your resources to avoid incurring unnecessary charges. Delete the Amazon Bedrock Knowledge Base by navigating to the Amazon Bedrock console, selecting your knowledge base, and choosing “Delete” from the “Actions” dropdown menu. If you created vector databases for testing, remember to delete OpenSearch Serverless collections, stop or delete Aurora PostgreSQL instances, and remove Pinecone index created. Additionally, consider deleting test documents uploaded to S3 buckets specifically for this blog example to avoid storage charges. Review and clean up any IAM roles or policies created for this demonstration if they’re no longer needed.

While Amazon Bedrock Knowledge Bases include charges for data indexing and queries, the underlying storage in S3 and vector databases will continue to incur charges until those resources are removed. For specific pricing details, refer to the Amazon Bedrock pricing page.

Conclusion

In this post, we covered several key strategies for building scalable, secure, and segmented Amazon Bedrock knowledge bases. These include using S3 folder structure, metadata to organize data sources, and data segmentation within a single knowledge base. Using metadata filtering to create custom queries that target specific data segments helps provide retrieval accuracy and maintain data privacy. We also explored integrating and validating metadata for vector databases including OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone.

By consolidating multiple business segments or customer data within a single Amazon Bedrock knowledge base, organizations can achieve cost optimization compared to creating and managing them separately. The improved data segmentation and access control measures help make sure each team or customer can only access the information relevant to their domain. The enhanced scalability helps meet the diverse needs of organizations, while maintaining the necessary data segregation and access control.

Try out metadata filtering with Amazon Bedrock Knowledge Bases, and share your thoughts and questions with the authors or in the comments.


About the Authors

Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing 1P and 3P model adoption. Breanne is also on the Women at Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.

 Justin Lin is a Small & Medium Business Solutions Architect at Amazon Web Services. He studied computer science at UW Seattle. Dedicated to designing and developing innovative solutions that empower customers, Justin has been dedicating his time to experimenting with applications in generative AI, natural language processing, and forecasting.

Chloe Gorgen is an Enterprise Solutions Architect at Amazon Web Services, advising AWS customers in various topics including security, analytics, data management, and automation. Chloe is passionate about youth engagement in technology, and supports several AWS initiatives to foster youth interest in cloud-based technology. Chloe holds a Bachelor of Science in Statistics and Analytics from the University of North Carolina at Chapel Hill.

Read More