Amazon AWS – Page 30

Accelerate your Amazon Q implementation: starter kits for SMBs

February 7, 2025

by Nneoma Okoroafor Amazon AWS

Whether you’re a small or medium-sized business (SMB) or a managed service provider at the beginning of your cloud journey, you might be wondering how to get started. Questions like “Am I following best practices?”, “Am I optimizing my cloud costs?”, and “How difficult is the learning curve?” are quite common. AWS is here to provide a concept called starter kits.

Starter kits are complete, deployable solutions that address common, repeatable business problems. They deploy the services that make up a solution according to best practices, helping you optimize costs and become familiar with these kinds of architectural patterns without a large investment in training. Most of all, starter kits save you time—time that can be better spent on your business or with your customers.

In this post, we showcase a starter kit for Amazon Q Business. If you have a repository of documents that you need to turn into a knowledge base quickly, or simply want to test out the capabilities of Amazon Q Business without a large investment of time at the console, then this solution is for you.

This deployment guide covers the steps to set up an Amazon Q solution that connects to Amazon Simple Storage Service (Amazon S3) and a web crawler data source, and integrates with AWS IAM Identity Center for authentication. An AWS CloudFormation template automates the deployment of this solution.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.

Solution overview

The following diagram illustrates the solution architecture.

The workflow involves the following steps:

The user authenticates using an AWS Identity and Access Management (IAM) identity user name and password before accessing the Amazon Q web application.
Upon successful authentication, the user can access the Amazon Q web UI and ask a question.
Amazon Q retrieves relevant information from its index, which is populated using data from the connected data sources (Amazon S3 and a web crawler).
Amazon Q then generates a response using its internal large language model (LLM) and presents it to the user through the Amazon Q web UI.
The user can provide feedback on the response through the Amazon Q web UI.

Prerequisites

Before deploying the solution, make sure you have the following in place:

AWS account – You will need an active AWS account with the necessary permissions to deploy CloudFormation stacks and create the required resources.
Amazon S3 bucket – Make sure you have an existing S3 bucket that will be used as the data source for Amazon Q. To create a S3 bucket, refer to Create your first S3 bucket.
AWS IAM Identity Center – Configure AWS IAM Identity Center in your AWS environment. You will need to provide the necessary details, such as the IAM Identity Center instance Amazon Resource Name (ARN), during the deployment process.

Deploy the solution using AWS CloudFormation

Complete the following steps to deploy the CloudFormation template:

Sign in to the AWS Management Console.
Choose one of the following Launch Stack options for your desired AWS Region to open the AWS CloudFormation console and create a new stack. Please note that this stack will default to us-east-1.
For Stack name, enter a name for your application (for example, AMAZON-Q-STARTER-KIT).
In the Parameters section, for IAMIdentityCenterARN, enter the ARN of your IAM Identity Center instance.
For QBusinessApplicationName, enter a name for the Amazon Q Business application.
For S3DataSourceBucket, enter the name of the S3 bucket you created earlier.
For WebCrawlerDataSourceUrl, enter the URL of the web crawler data source.
Choose Next.

On the Configure stack options page, leave everything as default, select I acknowledge that AWS CloudFormation might create IAM resources and and choose Next.

On the Review and create page, choose Submit.
On the Amazon Q Business console, you will see the new application you created.
Choose the new Amazon Q Business application, and in the Data sources section, select the data source s3_datasource and choose Sync now.
Select the data source webpage-datasource and choose Sync now.
To add groups and users to your Amazon Q application, refer to instructions.

Test the solution

To validate the Amazon Q solution is functioning as expected, perform the following tests:

Test data ingestion:
1. Upload a test file to the S3 bucket.
2. Verify that the file is successfully ingested and processed by Amazon Q.
3. Check the Amazon Q web experience UI for the processed data.
Test web crawler functionality:
Verify that the web crawler is able to retrieve and ingest the data from the website.
Make sure the data is displayed correctly in the Amazon Q web experience UI.

Clean up

To clean up, delete the CloudFormation stack and the S3 bucket you created.

Conclusion

The Amazon Q starter kit provides a streamlined solution for SMBs to use the power of generative AI and intelligent question-answering. By automating the deployment and integration with key data sources, this kit eases the complexity of setting up Amazon Q, empowering businesses to quickly unlock insights and improve productivity.

If your SMB has a repository of documents that need to be transformed into a valuable knowledge base, or you simply want to explore the capabilities of Amazon Q, we encourage you to take advantage of this starter kit. Get started today and experience the transformative benefits of enterprise-grade question-answering tailored for your business needs, and let us know what you think in the comments. To explore more generative AI use cases, refer to AI Use Case Explorer.

About the Authors

Nneoma Okoroafor is a Partner Solutions Architect focused on AI/ML and generative AI. Nneoma is passionate about providing guidance to AWS Partners on using the latest technologies and techniques to deliver innovative solutions to customers.

Joshua Amah is a Partner Solutions Architect with Amazon Web Services. He primarily serves consulting partners, providing architectural guidance and recommendations for new and existing workloads. Outside of work, he enjoys playing soccer, golf, and spending time with family and friends.

Jason Brown is a Partner Solutions Architect focused on helping AWS Distribution Partners and their Seller Partners build and grow their AWS practices. Jason is passionate about building solutions for MSPs and VARs in the small business space. Outside the office, Jason is an avid traveler and enjoys offshore fishing.

Building the future of construction analytics: CONXAI’s AI inference on Amazon EKS

February 7, 2025

by Tim Krause Amazon AWS

This is a guest post co-written with Tim Krause, Lead MLOps Architect at CONXAI.

CONXAI Technology GmbH is pioneering the development of an advanced AI platform for the Architecture, Engineering, and Construction (AEC) industry. Our platform uses advanced AI to empower construction domain experts to create complex use cases efficiently.

Construction sites typically employ multiple CCTV cameras, generating vast amounts of visual data. These camera feeds can be analyzed using AI to extract valuable insights. However, to comply with GDPR regulations, all individuals captured in the footage must be anonymized by masking or blurring their identities.

In this post, we dive deep into how CONXAI hosts the state-of-the-art OneFormer segmentation model on AWS using Amazon Simple Storage Service (Amazon S3), Amazon Elastic Kubernetes Service (Amazon EKS), KServe, and NVIDIA Triton.

Our AI solution is offered in two forms:

Model as a service (MaaS) – Our AI model is accessible through an API, enabling seamless integration. Pricing is based on processing batches of 1,000 images, offering flexibility and scalability for users.
Software as a service (SaaS) – This option provides a user-friendly dashboard, acting as a central control panel. Users can add and manage new cameras, view footage, perform analytical searches, and enforce GDPR compliance with automatic person anonymization.

Our AI model, fine-tuned with a proprietary dataset of over 50,000 self-labeled images from construction sites, achieves significantly greater accuracy compared to other MaaS solutions. With the ability to recognize more than 40 specialized object classes—such as cranes, excavators, and portable toilets—our AI solution is uniquely designed and optimized for the construction industry.

Our journey to AWS

Initially, CONXAI started with a small cloud provider specializing in offering affordable GPUs. However, it lacked essential services required for machine learning (ML) applications, such as frontend and backend infrastructure, DNS, load balancers, scaling, blob storage, and managed databases. At that time, the application was deployed as a single monolithic container, which included Kafka and a database. This setup was neither scalable nor maintainable.

After migrating to AWS, we gained access to a robust ecosystem of services. Initially, we deployed the all-in-one AI container on a single Amazon Elastic Compute Cloud (Amazon EC2) instance. Although this provided a basic solution, it wasn’t scalable, necessitating the development of a new architecture.

Our top reasons for choosing AWS were primarily driven by the team’s extensive experience with AWS. Additionally, the initial cloud credits provided by AWS were invaluable for us as a startup. We now use AWS managed services wherever possible, particularly for data-related tasks, to minimize maintenance overhead and pay only for the resources we actually use.

At the same time, we aimed to remain cloud-agnostic. To achieve this, we chose Kubernetes, enabling us to deploy our stack directly on a customer’s edge—such as on construction sites—when needed. Some customers are potentially very compliance-restrictive, not allowing data to leave the construction site. Another opportunity is federated learning, training on the customer’s edge and only transferring model weights, without sensitive data, into the cloud. In the future, this approach might lead to having one model fine-tuned for each camera to achieve the best accuracy, which requires hardware resources on-site. For the time being, we use Amazon EKS to offload the management overhead to AWS, but we could easily deploy on a standard Kubernetes cluster if needed.

Our previous model was running on TorchServe. With our new model, we first tried performing inference in Python with Flask and PyTorch, as well as with BentoML. Achieving high inference throughput with high GPU utilization for cost-efficiency was very challenging. Exporting the model to ONNX format was particularly difficult because the OneFormer model lacks strong community support. It took us some time to identify why the OneFormer model was so slow in the ONNX Runtime with NVIDIA Triton. We ultimately resolved the issue by converting ONNX to TensorRT.

Defining the final architecture, training the model, and optimizing costs took approximately 2–3 months. Currently, we improve our model by incorporating increasingly accurate labeled data, a process that takes around 3–4 weeks of training on a single GPU. Deployment is fully automated with GitLab CI/CD pipelines, Terraform, and Helm, requiring less than an hour to complete without any downtime. New model versions are typically rolled out in shadow mode for 1–2 weeks to provide stability and accuracy before full deployment.

Solution overview

The following diagram illustrates the solution architecture.

The architecture consists of the following key components:

The S3 bucket (1) is the most important data source. It is cost-effective, scalable, and provides almost unlimited blob storage. We encrypt the S3 bucket, and we delete all data with privacy concerns after processing took place. Almost all microservices read and write files from and to Amazon S3, which ultimately triggers (2) Amazon EventBridge (3). The process begins when a customer uploads an image on Amazon S3 using a presigned URL provided by our API handling user authentication and authorization through Amazon Cognito.
The S3 bucket is configured in such a way that it forwards (2) all events into EventBridge.
TriggerMesh is a Kubernetes controller where we use AWSEventBridgeSource (6). It abstracts the infrastructure automation and automatically creates an Amazon Simple Queue Service (Amazon SQS) (5) processing queue, which acts as a processing buffer. Additionally, it creates an EventBridge rule (4) to forward the S3 event from the event bus into the SQS processing queue. Finally, TriggerMesh creates a Kubernetes Pod to poll events from the processing queue to feed it into the Knative broker (7). The resources in the Kubernetes cluster are deployed in a private subnet.
The central place for Knative Eventing is the Knative broker (7). It is backed by Amazon Managed Streaming for Apache Kafka (Amazon MSK) (8).
The Knative trigger (9) polls the Knative broker based on a specific CloudEventType and forwards it accordingly to the KServe InferenceService (10).
KServe is a standard model inference platform on Kubernetes that uses Knative Serving as its foundation and is fully compatible with Knative Eventing. It also pulls models from a model repository into the container before the model server starts, eliminating the need to build a new container image for each model version.
We use KServe’s “Collocate transformer and predictor in same pod” feature to maximize inference speed and throughput because containers within the same pod can communicate over localhost and the network traffic never leaves the CPU.
After many performance tests, we achieved best performance with the NVIDIA Triton Inference Server (11) after converting our model first into ONNX and then into TensorRT.
Our transformer (12) uses Flask with Gunicorn and is optimized for the number of workers and CPU cores to maintain GPU utilization over 90%. The transformer gets a CloudEvent with the reference of the image Amazon S3 path, downloads it, and performs model inference over HTTP. After getting back the model results, it performs preprocessing and finally uploads the processed model results back to Amazon S3.
We use Karpenter as the cluster auto scaler. Karpenter is responsible for scaling the inference component to handle high user request loads. Karpenter launches new EC2 instances when the system experiences increased demand. This allows the system to automatically scale up computing resources to meet the increased workload.

All this divides our architecture mainly in AWS managed data service and the Kubernetes cluster:

The S3 bucket, EventBridge, and SQS queue as well as Amazon MSK are all fully managed services on AWS. This keeps our data management effort low.
We use Amazon EKS for everything else. TriggerMesh, AWSEventBridgeSource, Knative Broker, Knative Trigger, KServe with our Python transformer, and the Triton Inference Server are also within the same EKS cluster on a dedicated EC2 instance with a GPU. Because our EKS cluster is just used for processing, it is fully stateless.

Summary

From initially having our own highly customized model, transitioning to AWS, improving our architecture, and introducing our new Oneformer model, CONXAI is now proud to provide scalable, reliable, and secure ML inference to customers, enabling construction site improvements and accelerations. We achieved a GPU utilization of over 90%, and the number of processing errors has dropped almost to zero in recent months. One of the major design choices was the separation of the model from the preprocessing and postprocessing code in the transformer. With this technology stack, we gained the ability to scale down to zero on Kubernetes using the Knative serverless feature, while our scale-up time from a cold state is just 5–10 minutes, which can save significant infrastructure costs for potential batch inference use cases.

The next important step is to use these model results with proper analytics and data science. These results can also serve as a data source for generative AI features such as automated report generation. Furthermore, we want to label more diverse images and train the model on additional construction domain classes as part of a continuous improvement process. We also work closely with AWS specialists to bring our model in AWS Inferentia chipsets for better cost-efficiency.

To learn more about the services in this solution, refer to the following resources:

About the Authors

Tim Krause is Lead MLOps Architect at CONXAI. He takes care of all activities when AI meets infrastructure. He joined the company with previous Platform, Kubernetes, DevOps, and Big Data knowledge and was training LLMs from scratch.

Mehdi Yosofie is a Solutions Architect at AWS, working with startup customers, and leveraging his expertise to help startup customers design their workloads on AWS.

How Untold Studios empowers artists with an AI assistant built on Amazon Bedrock

February 7, 2025

by Olivier Vigneresse Amazon AWS

Untold Studios is a tech-driven, leading creative studio specializing in high-end visual effects and animation. Our commitment to innovation led us to a pivotal challenge: how to harness the power of machine learning (ML) to further enhance our competitive edge while balancing this technological advancement with strict data security requirements and the need to streamline access to our existing internal resources.

To give our artists access to technology, we need to create good user interfaces. This is a challenge, especially if the pool of end-users are diverse in terms of their needs and technological experience. We saw an opportunity to use large language models (LLMs) to create a natural language interface, which makes this challenge easier and takes care of a lot of the heavy lifting.

This post details how we used Amazon Bedrock to create an AI assistant (Untold Assistant), providing artists with a straightforward way to access our internal resources through a natural language interface integrated directly into their existing Slack workflow.

Solution overview

The Untold Assistant serves as a central hub for artists. Besides the common AI functionalities like text and image generation, it allows them to interact with internal data, tools, and workflows through natural language queries.

For the UI, we use Slack’s built-in features rather than building custom frontends. Slack already provides applications for workstations and phones, message threads for complex queries, emoji reactions for feedback, and file sharing capabilities. The implementation uses Slack’s event subscription API to process incoming messages and Slack’s Web API to send responses. Users interact with the Untold Assistant through private direct messages or by mentioning it (@-style tagging) in channels for everybody to see. Because our teams already use Slack throughout the day, this eliminates context switching and the need to adopt new software. Every new message is acknowledged by a gear emoji for immediate feedback, which eventually changes to a check mark if the query was successful or an X if an error occurred. The following screenshot shows an example.

With the use of Anthropic’s Claude 3.5 Sonnet model on Amazon Bedrock, the system processes complex requests and generates contextually relevant responses. The serverless architecture provides scalability and responsiveness, and secure storage houses the studio’s vast asset library and knowledge base. Key AWS services used include:

Amazon Bedrock – Including Anthropic’s Claude 3.5 Sonnet LLM, Stability AI’s Stable Diffusion 3 image generation, and knowledge base connectors
AWS Lambda – For workflow execution
Amazon API Gateway – For the Slack event handler
Amazon Simple Storage Service (Amazon S3) – For arbitrary unstructured data
Amazon DynamoDB – For persistent storage

The following diagram illustrates the solution architecture.

The main components for this application are the Slack integration, the Amazon Bedrock integration, the Retrieval Augmented Generation (RAG) implementation, user management, and logging.

Slack integration

We use a two-function approach to meet Slack’s 3-second acknowledgment requirement. The incoming event from Slack is sent to an endpoint in API Gateway, and Slack expects a response in less than 3 seconds, otherwise the request fails. The first Lambda function, with reserved capacity, quickly acknowledges the event and forwards the request to the second function, where it can be handled without time restrictions. The setup handles time-sensitive responses while allowing for thorough request processing. We call the second function directly from the first function without using an event with Amazon Simple Notification Service (Amazon SNS) or a queue with Amazon Simple Queue Service (Amazon SQS) in between to keep the latency as low as possible.

Amazon Bedrock integration

Our Untold Assistant uses Amazon Bedrock with Anthropic’s Claude 3.5 Sonnet model for natural language processing. We use the model’s function calling capabilities, enabling the application to trigger specific tools or actions as needed. This allows the assistant to handle both general queries and complex specialized queries or run tasks across our internal systems.

RAG implementation

Our RAG setup uses Amazon Bedrock connectors to integrate with Confluence and Salesforce, tapping into our existing knowledge bases. For other data sources without a pre-built connector available, we export content to Amazon S3 and use the Amazon S3 connector. For example, we export pre-chunked asset metadata from our asset library to Amazon S3, letting Amazon Bedrock handle embeddings, vector storage, and search. This approach significantly decreased development time and complexity, allowing us to focus on improving user experience.

User management

We map Slack user IDs to our internal user pool, currently in DynamoDB (but designed to work with Amazon Cognito). This system tailors the assistant’s capabilities to each user’s role and clearance level, making sure that it operates within the bounds of each user’s authority while maintaining functionality. The access to data sources is controlled using tools. Every tool encapsulates a data source and the LLM’s access to tools is restricted by the user and their role.

Additionally, if a user tells the assistant something that should be remembered, we store this piece of information in a database and add it to the context every time the user initiates a request. This could be, for example, “Keep all your replies as short as possible” or “If I ask for code it’s always Python.”

Logging and monitoring

We use the built-in integration with Amazon CloudWatch in Lambda to track system performance and error states. For monitoring critical errors, we’ve set up direct notifications to a dedicated Slack channel, allowing for immediate awareness and response. Every query and tool invocation is logged to DynamoDB, providing a rich dataset that we use to analyze usage patterns and optimize the system’s performance and functionality.

Function calling with Amazon Bedrock

Like Anthropic’s Claude, most modern LLMs support function calling, which allows us to extend the capabilities of LLMs beyond merely generating text. We can provide a set of function specifications with a description of what the function is going to do and the names and descriptions of the function’s parameters. Based on this information, the LLM decides if an incoming request can be solved directly or if the best next step to solve the query would be a function call. If that’s the case, the model returns the name of the function to call, as well as the parameters and values. It’s then up to us to run the function and initiate the next step. Agents use this system in a loop to call functions and process their output until a success criterion is reached. In our case, we only implement a single pass function call to keep things simple and robust. However, in certain cases, the function itself uses the LLM to process data and format it nicely for the end-user.

Function calling is a very useful feature that helps us convert unstructured user input into structured automatable instructions. We anticipate that over the next couple of months, we will add many more functions to extend the AI assistant’s capabilities and increase its usefulness. Although frameworks like LangChain offer comprehensive solutions for implementing function calling systems, we opted for a lightweight, custom approach tailored to our specific needs. This decision allowed us to maintain a smaller footprint and focus on the essential features for our use case.

The following is a code example of using the AiTool base class for extendability.

All that’s required to add a new function is creating a class like the one in our example. The class will automatically be discovered and the respective specification added to the request to the LLM if the user has access to the function. All the required information to create the function specification is extracted from the code and docstrings:

NAME – The ID of the function
PROGRESS_MESSAGE – A message that’s sent to the user through Slack for immediate feedback before the function is run
EXCLUSIVE_ACCESS_DEPARTMENTS – If set, only users of the specified departments have access to this tool

The tool in this example updates the user memory. For example, the query “Remember to always use Python as a programming language” will trigger the execution of this tool. The LLM will extract the info string from the request, for example, “code should always be Python.” If the existing user memory that is always added to the context already contains a memory about the same topic (for example, “code should always be Java”), the LLM will also provide the memory ID and the existing memory will be overwritten. Otherwise, a new memory with a new ID is created.

Key features and benefits

Slack serves as a single entry point, allowing artists to query diverse internal systems without leaving their familiar workflow. The following features are powered by function calling using Anthropic’s Claude:

Various knowledge bases for different user roles (Confluence, Salesforce)
Internal asset library (Amazon S3)
Image generation powered by Stable Diffusion
User-specific memory and preferences (for example, default programming languages, default dimensions for image generation, detail level of responses)

By eliminating the need for additional software or context switching, we’ve drastically reduced friction in accessing critical resources. The system is available around the clock for artist queries and tasks, and our framework for function calling with Anthropic’s Claude allows for future expansion of features.

The LLM’s natural language interface is a game changer for user interaction. It’s inherently more flexible and forgiving compared to traditional interfaces, capable of interpreting unstructured input, asking for missing information, and performing tasks like date formatting, unit conversion, and value extraction from natural language descriptions. The system adeptly handles ambiguous queries, extracting relevant information and intent. This means artists can focus on their creative process rather than worrying about precise phrasing or navigating complex menu structures.

Security and control are paramount in our AI adoption strategy. By keeping all data within the AWS ecosystem, we’ve eliminated dependencies on third-party AI tools and mitigated associated risks. This approach allows us to maintain tight control over data access and usage. Additionally, we’ve implemented comprehensive usage analytics, providing insights into adoption patterns and areas for improvement. This data-driven approach makes sure we’re continually refining the tool to meet evolving artist needs.

Impact and future plans

The Untold Assistant currently handles up to 120 queries per day, with about 10–20% of them calling additional tools, like image generation or knowledge base search. Especially for new users who aren’t too familiar with internal workflows and applications yet, it can save a lot of time. Instead of searching in several different Confluence spaces and Slack channels or reaching out to the technology team, they can just ask the Untold Assistant, which acts as a virtual member of the support team. This can cut down the time from minutes to only a few seconds.

Overall, the Untold Assistant, rapidly developed and deployed using AWS services, has delivered several benefits:

Enhanced discoverability and usage of previously underutilized internal resources
Significant reduction in time spent searching for information
Streamlined access to multiple internal systems with an authorization system from a central entry point
Reduced load on the support and technology team
Increased speed of adoption of new technologies by providing a framework for user interaction

Building on this success, we’re expanding functionality through additional function calls. A key planned feature is render job error analysis for artists. This tool will automatically fetch logs from recent renders, analyze potential errors using the capabilities of Anthropic’s Claude, and provide users with explanations and solutions by using both internet resources and our internal knowledge base of known errors.

Additionally, we plan to analyze the saved queries using Amazon Titan Text Embeddings and agglomerative clustering to identify semantically similar questions. When the cluster frequency exceeds our defined threshold (for example, more than 10 similar questions from different users within a week), we enhance our knowledge base or update onboarding materials to address these common queries proactively, reducing repetitive questions and improving the assistant’s efficiency.

These initial usage metrics and the planned technical improvements demonstrate the system’s positive impact on our workflows. By automating common support tasks and continuously improving our knowledge base through data-driven analysis, we reduce the technology team’s support load while maintaining high-quality assistance. The modular architecture allows us to quickly integrate new tools as needs arise, to keep up with the astonishing pace of the progress made in AI and ML.

Conclusion

The Untold Assistant demonstrates how Amazon Bedrock enables rapid development of sophisticated AI applications without compromising security or control. Using function calling and pre-built connectors in Amazon Bedrock eliminated the need for complex vector store integrations and custom embedding pipelines, reducing our development time from months to weeks. The modular architecture using Python classes for tools makes the system highly maintainable and extensible.

By automating routine technical tasks and information retrieval, we’ve freed our artists to focus on creative work that drives business value. The solution’s clean separation between the LLM interface and business logic, built entirely within the AWS ecosystem, enables quick integration of new capabilities while maintaining strict data security. The LLM’s ability to interpret unstructured input and handle ambiguous queries creates a more natural and forgiving interface compared to traditional menu-driven systems. This foundation of technical robustness and improved artist productivity positions us to rapidly adopt emerging AI capabilities while keeping our focus on creative innovation.

To explore how to streamline your company’s workflows using Amazon Bedrock, see Getting started with Amazon Bedrock. If you have questions or suggestions, please leave a comment.

About the Authors

Olivier Vigneresse is a Solutions Architect at AWS. Based in England, he primarily works with SMB Media an&d Entertainment customers. With a background in security and networking, Olivier helps customers achieve success on their cloud journey by providing architectural guidance and best practices; he is also passionate about helping them bring value with Machine Learning and Generative AI use-cases.

Daniel Goller is a Lead R&D Developer at Untold Studios with a focus on cloud infrastructure and emerging technologies. After earning his PhD in Germany, where he collaborated with industry leaders like BMW and Audi, he has spent the past decade implementing software solutions, with a particular emphasis on cloud technology in recent years. At Untold Studios, he leads infrastructure optimisation and AI/ML initiatives, leveraging his technical expertise and background in research to drive innovation in the Media & Entertainment space.

Max Barnett is an Account Manager at AWS who specialises in accelerating the cloud journey of Media & Entertainment customers. He has been helping customers at AWS for the past 4.5 years. Max has been particularly involved with customers in the visual effect space, guiding them as they explore generative AI.

Protect your DeepSeek model deployments with Amazon Bedrock Guardrails

February 7, 2025

by Satveer Khurpa Amazon AWS

The rapid advancement of generative AI has brought powerful publicly available large language models (LLMs), such as DeepSeek-R1, to the forefront of innovation. The DeepSeek-R1 models are now accessible through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart, and distilled variants are available through Amazon Bedrock Custom Model Import. According to DeepSeek AI, these models offer strong capabilities in reasoning, coding, and natural language understanding. However, their deployment in production environments—like all models—requires careful consideration of data privacy requirements, appropriate management of bias in output, and the need for robust monitoring and control mechanisms.

Organizations adopting open source, open weights models such as DeepSeek-R1 have important opportunities to address several key considerations:

Enhancing security measures to prevent potential misuse, guided by resources such as OWASP LLM Top 10 and MITRE Atlas
Making sure to protect sensitive information
Fostering responsible content generation practices
Striving for compliance with relevant industry regulations

These concerns become particularly critical in highly regulated industries such as healthcare, finance, and government services, where data privacy and content accuracy are paramount.

This blog post provides a comprehensive guide to implementing robust safety protections for DeepSeek-R1 and other open weight models using Amazon Bedrock Guardrails. We’ll explore:

How to use the security features offered by Amazon Bedrock to protect your data and applications
Practical implementation of guardrails to prevent prompt attacks and filter harmful content
Implementing a robust defense-in-depth strategy

By following this guide, you’ll learn how to use the advanced capabilities of DeepSeek models while maintaining strong security controls and promoting ethical AI practices. Whether developing customer-facing generative AI applications or internal tools, these implementation patterns will help you meet your requirements for secure and responsible AI. By following this step-by-step approach, organizations can deploy open weights LLMs such as DeepSeek-R1 in line with best practices for AI safety and security.

DeepSeek models and deployment on Amazon Bedrock

DeepSeek AI, a company specializing in open weights foundation AI models, recently launched their DeepSeek-R1 models, which according to their paper have shown outstanding reasoning abilities and performance in industry benchmarks. According to third-party evaluations, these models consistently achieve top three rankings across various metrics, including quality index, scientific reasoning and knowledge, quantitative reasoning, and coding (HumanEval).

The company has further developed their portfolio by releasing six dense models derived from DeepSeek-R1, built on Llama and Qwen architectures, which they’ve made open weight models. These models are now accessible through AWS generative AI solutions: DeepSeek-R1 is available through Amazon Bedrock Marketplace and SageMaker Jumpstart, while the Llama-based distilled versions can be implemented through Amazon Bedrock Custom Model Import.

Amazon Bedrock offers comprehensive security features to help secure hosting and operation of open source and open weights models while maintaining data privacy and regulatory compliance. Key features include data encryption at rest and in transit, fine-grained access controls, secure connectivity options, and various compliance certifications. Additionally, Amazon Bedrock provides guardrails for content filtering and sensitive information protection to support responsible AI use. AWS enhances these capabilities with extensive platform-wide security and compliance measures:

Data encryption at rest and in transit using AWS Key Management Service (AWS KMS)
Access management through AWS Identity and Access Management (IAM)
Network security through Amazon Virtual Private Cloud (Amazon VPC) deployment, VPC endpoints, and AWS Network Firewall for TLS inspection and strict policy rules
Service control policies (SCPs) for AWS account-level governance
Security groups and network access control lists (NACLs) for access restriction
Compliance certifications including HIPAA, SOC, ISO, and GDPR
FedRAMP High authorization in AWS GovCloud (US-West) for Amazon Bedrock
Monitoring and logging through Amazon CloudWatch and AWS CloudTrail

Organizations should customize these security settings based on their specific compliance and security needs when deploying to production environments. AWS conducts vulnerability scanning of all model containers as part of its security process and accepts only models in Safetensors format to help prevent unsafe code execution.

Amazon Bedrock Guardrails

Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. Amazon Bedrock Guardrails can also be integrated with other Amazon Bedrock tools including Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases to build safer and more secure generative AI applications aligned with responsible AI policies. To learn more, see the AWS Responsible AI page.

Core functionality

Amazon Bedrock Guardrails can be used in two ways. First, it can be integrated directly with the InvokeModel and Converse API call, where guardrails are applied to both input prompts and model outputs during the inference process. This method is suitable with models hosted on Amazon Bedrock through the Amazon Bedrock Marketplace and Amazon Bedrock Custom Model Import. Alternatively, the ApplyGuardrail API offers a more flexible approach, allowing for independent evaluation of content without invoking a model. This second method is useful for assessing inputs or outputs at various stages of an application, working with custom or third-party models outside of Amazon Bedrock. Both approaches enable developers to implement safeguards customized to their use cases and aligned with responsible AI policies, ensuring secure and compliant interactions in generative AI applications.

Key Amazon Bedrock Guardrails policies

Amazon Bedrock Guardrails provides the following configurable guardrail policies to help safely build generative AI applications at scale:

Content filters
- Adjustable filtering intensity for harmful content
- Predefined categories: Hate, Insults, Sexual Content, Violence, Misconduct, and Prompt Attacks
- Multi-modal content including text and images (preview)
Topic filters
- Capability to restrict specific topics
- Prevention of unauthorized topics in both queries and responses
Word filters
- Blocks specific words, phrases, and profanity
- Custom filters for offensive language or competitor references
Sensitive information filters
- Personally identifiable information (PII) blocking or masking
- Support for custom regex patterns
- Probabilistic detection for standard formats (such as SSN, DOB, and addresses)
Contextual grounding checks
- Hallucination detection through source grounding
- Query relevance validation
Automated Reasoning checks for hallucination prevention (gated preview)

Other capabilities

Model-agnostic implementation:

Compatible with all Amazon Bedrock foundation models
Supports fine-tuned models
Extends to external custom and third-party models through the ApplyGuardrail API

This comprehensive framework helps customers implement responsible AI, maintaining content safety and user privacy across diverse generative AI applications.

Solution Overview

Guardrail configuration
- Create a guardrail with specific policies tailored to your use case and configure the policies.

Integration with InvokeModel API
- Call the Amazon Bedrock InvokeModel API with the guardrail identifier in your request.
- When you make the API call, Amazon Bedrock applies the specified guardrail to both the input and output.

Guardrail evaluation process

1. Input evaluation: Before sending the prompt to the model, the guardrail evaluates the user input against the configured policies.
2. Parallel policy checking: For improved latency, the input is evaluated in parallel for each configured policy.
3. Input intervention: If the input violates any guardrail policies, a pre-configured blocked message is returned, and the model inference is discarded.
4. Model inference: If the input passes the guardrail checks, the prompt is sent to the specified model for inference.
5. Output evaluation: After the model generates a response, the guardrail evaluates the output against the configured policies.
6. Output intervention: If the model response violates any guardrail policies, it will be either blocked with a pre-configured message or have sensitive information masked, depending on the policy.
7. Response delivery: If the output passes all guardrail checks, the response is returned to the application without modifications

Prerequisites

Before setting up guardrails for models imported using the Amazon Bedrock Custom Model Import feature, make sure you meet these prerequisites:

An AWS account with access to Amazon Bedrock along with the necessary IAM role with the required permissions. For centralized access management, we recommend that you use AWS IAM Identity Center.
Make sure that a custom model is already imported using the Amazon Bedrock Custom Model Import service. For illustration, we’ll use DeepSeek-R1-Distill-Llama-8B, which can be imported using Amazon Bedrock Custom Model Import. You have two options for deploying this model:
- Follow the instructions in Deploy DeepSeek-R1 distilled Llama models to deploy DeepSeek’s distilled Llama model.
- Use the notebook available from aws-samples for deployment.

You can create the guardrail using the AWS Management Console as explained in this blog post. Alternatively, you can follow this notebook for a programmatic example of how to create the guardrail in this solution. This notebook does the following :

Install the required dependencies
Create a guardrail using the boto3 API and filters to meet the use case mentioned previously.
Configure the tokenizer for the imported model.
Test Amazon Bedrock Guardrails using prompts that show various Amazon Bedrock guardrail filters in action.

This approach integrates guardrails into both the user inputs and the model outputs. This makes sure that any potentially harmful or inappropriate content is intercepted during both phases of the interaction. For open weight distilled models imported using Amazon Bedrock Custom Model Import, Amazon Bedrock Marketplace, and Amazon SageMaker JumpStart, critical filters to implement include those for prompt attacks, content moderation, topic restrictions, and sensitive information protection.

Implementing a defense-in-depth strategy with AWS services

While Amazon Bedrock Guardrails provides essential content and prompt safety controls, implementing a comprehensive defense-in-depth strategy is crucial when deploying any foundation model, especially open weights models such as DeepSeek-R1. For detailed guidance on defense-in-depth approaches aligned with OWASP Top 10 for LLMs, see our previous blog post on architecting secure generative AI applications.

Key highlights include:

Developing organizational resiliency by starting with security in mind
Building on a secure cloud foundation using AWS services
Applying a layered defense strategy across multiple trust boundaries
Addressing the OWASP Top 10 risks for LLM applications
Implementing security best practices throughout the AI/ML lifecycle
Using AWS security services in conjunction with AI and machine learning (AI/ML)-specific features
Considering diverse perspectives and aligning security with business objectives
Preparing for and mitigating risks such as prompt injection and data poisoning

The combination of model-level controls (guardrails) with a defense-in-depth strategy creates a robust security posture that can help protect against:

Data exfiltration attempts
Unauthorized access to fine-tuned models or training data
Potential vulnerabilities in model implementation
Malicious use of AI agents and integrations

We recommend conducting thorough threat modeling exercises using AWS guidance for generative AI workloads before deploying any new AI/ML solutions. This helps align security controls with specific risk scenarios and business requirements.

Conclusion

Implementing safety protection for LLMs, including DeepSeek-R1 models, is crucial for maintaining a secure and ethical AI environment. By using Amazon Bedrock Guardrails with the Amazon Bedrock InvokeModel API and the ApplyGuardrails API, you can help mitigate the risks associated with advanced language models while still harnessing their powerful capabilities. However, it’s important to recognize that model-level protections are just one component of a comprehensive security strategy.

The strategies outlined in this post address several key security concerns that are common across various open weights models hosted on Amazon Bedrock using Amazon Bedrock Custom Model Import, Amazon Bedrock Marketplace, and through Amazon SageMaker JumpStart. These include potential vulnerabilities to prompt injection attacks, the generation of harmful content, and other risks identified in recent assessments. By implementing these guardrails alongside a defense-in-depth approach, organizations can significantly reduce the risk of misuse and better align their AI applications with ethical standards and regulatory requirements.

As AI technology continues to evolve, it’s essential to prioritize safety and responsible use of generative AI. Amazon Bedrock Guardrails provides a configurable and robust framework for implementing these safeguards, allowing developers to customize protection measures according to their specific use cases and organizational policies. We strongly recommend conducting thorough threat modeling of your AI workloads using AWS guidance to evaluate security risks and implementing appropriate controls across your entire technology stack.

Remember to regularly review and update not only your guardrails but all security controls to address new potential vulnerabilities and help maintain protection against emerging threats in the rapidly evolving landscape of AI security. While today we focus on DeepSeek-R1 models, the AI landscape is continuously evolving with new models emerging regularly. Amazon Bedrock Guardrails, combined with AWS security services and best practices, provides a consistent security framework that can adapt to protect your generative AI applications across various open weights models, both current and future. By treating security as a continuous process of assessment, improvement, and adaptation, organizations can confidently deploy innovative AI solutions while maintaining robust security controls.

About the Authors

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Fine-tune and host SDXL models cost-effectively with AWS Inferentia2

February 6, 2025

by Deepti Tirumala Amazon AWS

Building upon a previous Machine Learning Blog post to create personalized avatars by fine-tuning and hosting the Stable Diffusion 2.1 model at scale using Amazon SageMaker, this post takes the journey a step further. As technology continues to evolve, newer models are emerging, offering higher quality, increased flexibility, and faster image generation capabilities. One such groundbreaking model is Stable Diffusion XL (SDXL), released by StabilityAI, advancing the text-to-image generative AI technology to unprecedented heights. In this post, we demonstrate how to efficiently fine-tune the SDXL model using SageMaker Studio. We show how to then prepare the fine-tuned model to run on AWS Inferentia2 powered Amazon EC2 Inf2 instances, unlocking superior price performance for your inference workloads.

Solution overview

The SDXL 1.0 is a text-to-image generation model developed by Stability AI, consisting of over 3 billion parameters. It comprises several key components, including a text encoder that converts input prompts into latent representations, and a U-Net model that generates images based on these latent representations through a diffusion process. Despite its impressive capabilities trained on a public dataset, app builders sometimes need to generate images for a specific subject or style that are difficult or inefficient to describe in words. In that situation, fine-tuning is a great option to improve relevance using your own data.

One popular approach to fine-tuning SDXL is to use DreamBooth and Low-Rank Adaptation (LoRA) techniques. You can use DreamBooth to personalize the model by embedding a subject into its output domain using a unique identifier, effectively expanding its language-vision dictionary. This process uses a technique called prior preservation, which retains the model’s existing knowledge about the subject class (such as humans) while incorporating new information from the provided subject images. LoRA is an efficient fine-tuning method that attaches small adapter networks to specific layers of the pre-trained model, freezing most of its weights. By combining these techniques, you can generate a personalized model while tuning an order-of-magnitude fewer parameters, resulting in faster fine-tuning times and optimized storage requirements.

After the model is fine-tuned, you can compile and host the fine-tuned SDXL on Inf2 instances using the AWS Neuron SDK. By doing this, you can benefit from the higher performance and cost-efficiency offered by these specialized AI chips while taking advantage of the seamless integration with popular deep learning frameworks such as TensorFlow and PyTorch. To learn more, visit our Neuron documentation.

Prerequisites

Before you get started, review the list of services and instance types required to run the sample notebooks provided at this GitHub location.

Basic understanding of Stable Diffusion models. Refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker for more information.
General knowledge about foundation models (FMs) and how fine-tuning brings value. Read more on Fine-tune a foundation model.
An Amazon Web Services (AWS) account. Confirm your AWS identity has the requisite permissions, including the ability to create SageMaker resources (domain, model, and endpoints) and Amazon Simple Storage Service (Amazon S3) access to upload model artifacts. Alternatively, you can attach the AmazonSageMakerFullAccess managed policy to your AWS Identity and Access Management (IAM) user or role.
This notebook is tested using the default Python 3 kernel on SageMaker Studio. A GPU instance such as ml.g5.2xlarge is recommended. Refer to the documentation on setting up a domain for SageMaker Studio.
For compiling the fine-tuned model, an inf2.8xlarge or larger Amazon Elastic Compute Cloud (Amazon EC2) instance with Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) is required. The instance comes with the required neuron drivers, libraries and Jupyter Lab preinstalled.

By following these prerequisites, you will have the necessary knowledge and AWS resources to run the sample notebooks and work with Stable Diffusion models and FMs on Amazon SageMaker.

Fine-tuning SDXL on SageMaker

To fine-tune SDXL on SageMaker, follow the steps in the next sections.

Prepare the images

The first step in fine-tuning the SDXL model is to prepare your training images. Using the DreamBooth technique, you need as few as 10–12 images for fine-tuning. It’s recommended to provide a variety of images to help the model better understand and generalize your facial features.

The training images should include selfies taken from different angles, covering various perspectives of your face. Include images with different facial expressions, such as smiling, frowning, and neutral. Preferably, use images with different backgrounds to help the model identify the subject more effectively. By providing a diverse set of images, DreamBooth can better identify the subject from the pictures and generalize your facial features. The following set of images demonstrate this.

Additionally, use 1024×1024 pixel square images for fine-tuning. To simplify the process of preparing the images, there is a utility function that automatically crops and adjusts your images to the correct dimensions.

Train the personalized model

After the images are prepared, you can begin the fine-tuning process. To achieve this, you use the autoTrain library from Hugging Face, an automatic and user-friendly approach to training and deploying state-of-the-art machine learning (ML) models. Seamlessly integrated with the Hugging Face ecosystem, autoTrain is designed to be accessible, and individuals can train custom models without extensive technical expertise or coding proficiency. To use autoTrain, use the following example code:

!autotrain dreambooth 
--prompt "${INSTANCE_PROMPT}" 
--class-prompt "${CLASS_PROMPT}" 
--model ${MODEL_NAME} 
--project-name ${PROJECT_NAME} 
--image-path "${IMAGE_PATH}" 
--resolution ${RESOLUTION} 
--batch-size ${BATCH_SIZE} 
--num-steps ${NUM_STEPS} 
--gradient-accumulation ${GRADIENT_ACCUMULATION} 
--lr ${LEARNING_RATE} 
--fp16 
--gradient-checkpointing

First, you need to set the prompt and class-prompt. The prompt should include a unique identifier or token that the model can reference to the subject. The class-prompt, on the other hand, is used to subsidize the model training with similar subjects of the same class. This is a requirement for the DreamBooth technique to better associate the new token with the subject of interest. This is why the DreamBooth technique can generate exceptional fine-tuned results with fewer input images. Additionally, you’ll notice that even though you didn’t provide examples of the top or back of our head, the model still knows how to generate them because of the class prompt. In this example, you are using <<TOK>> as a unique identifier to avoid a name that the model might already be familiar with.

instance_prompt = "photo of <<TOK>>"
class_prompt = "photo of a person"

Next, you need to provide the model, image-path, and project-name. The model name loads the base model from the Hugging Face Hub or locally. The image-path is the location of the training images. By default, autoTrain uses LoRA, a parameter-efficient way to fine-tune. Unlike traditional fine-tuning, LoRA fine-tunes by attaching a small transformer adapter model to the base model. Only the adapter weights are updated during training to achieve fine-tuning behavior. Additionally, these adapters can be attached and detached at any time, making them highly efficient for storage as well. These supplementary LoRA adapters are 98% smaller in size compared to the original model, allowing us to store and share the LoRA adapters without having to duplicate the base model repeatedly. The following diagram illustrates these concepts.

The rest of the configuration parameters are as follows. You are recommended to start with these values first. Adjust them only if the fine-tuning results don’t meet your expectations.

resolution = 1024          # resolution or size of the generated images
batch_size = 1             # number of samples in one forward and backward pass  
num_steps = 500           # number of training steps
gradient_accumulation = 4  # accumulating gradients over number of batches
learning_rate = 1e-4       # step size
fp16                       # half-precision
gradient-checkpointing     # technique to reduce memory consumption during training

The entire training process takes about 30 mins with the preceding configuration. After the training is done, you can load the LoRA adapter, such as the following code, and generate fine-tuned images.

from diffusers import DiffusionPipeline, StableDiffusionXLImg2ImgPipeline
import random

seed = random.randint(0, 100000)

# loading the base model
pipeline = DiffusionPipeline.from_pretrained(
    model_name_base,
    torch_dtype=torch.float16,
    ).to(device)

# attach the LoRA adapter
pipeline.load_lora_weights(
    project_name,
    weight_name="pytorch_lora_weights.safetensors",
)

# generate fine tuned images
generator = torch.Generator(device).manual_seed(seed)
base_image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    generator=generator,
    height=1024,
    width=1024,
    output_type="pil",
    ).images[0]
base_image

Deploy on Amazon EC2 Inf2 instances

In this section, you learn to compile and host the fine-tuned SDXL model on Inf2 instances. To begin, you need to clone the repository and upload the LoRA adapter onto the Inf2 instance created in the prerequisites section. Then, run the compilation notebook to compile the fine-tuned SDXL model using the Optimum Neuron library. Visit the Optimum Neuron page for more details.

The NeuronStableDiffusionXLPipeline class in Optimum Neuron now has direct support for the LoRA. All you need to do is to supply the base model, LoRA adapters, and supply the model input shapes to start the compilation process. The following code snippet illustrates how to compile and then export the compiled model to a local directory.

from optimum.neuron import NeuronStableDiffusionXLPipeline

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
adapter_id = "lora"
input_shapes = {"batch_size": 1, "height": 1024, "width": 1024, "num_images_per_prompt": 1}

# Compile
pipe = NeuronStableDiffusionXLPipeline.from_pretrained(
    model_id,
    export=True,
    lora_model_ids=adapter_id,
    lora_weight_names="pytorch_lora_weights.safetensors",
    lora_adapter_names="sttirum",
    **input_shapes,
)

# Save locally or upload to the HuggingFace Hub
save_directory = "sd_neuron_xl/"
pipe.save_pretrained(save_directory)

The compilation process takes about 35 minutes. After the process is complete, you can use the NeuronStableDiffusionXLPipeline again to load the compiled model back.

from optimum.neuron import NeuronStableDiffusionXLPipeline

stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl")

You can then test the model on Inf2 and make sure that you can still generate the fine-tuned results.

import torch
# Run pipeline
prompt = """
photo of <<TOK>> , 3d portrait, ultra detailed, gorgeous, 3d zbrush, trending on dribbble, 8k render
"""

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = 491057365
generator = [torch.Generator().manual_seed(seed)]
image = stable_diffusion_xl(prompt,
                    num_inference_steps=50,
                    guidance_scale=7,
                    negative_prompt=negative_prompt,
                    generator=generator).images[0]

Here are a few avatar images generated using the fine-tuned model on Inf2. The corresponding prompts are the following:

emoji of << TOK >>, astronaut, space ship background
oil painting of << TOK >>, business woman, suit
photo of << TOK >> , 3d portrait, ultra detailed, 8k render
anime of << TOK >>, ninja style, dark hair

Clean up

To avoid incurring AWS charges after you finish testing this example, make sure you delete the following resources:

Amazon SageMaker Studio Domain
Amazon EC2 Inf2 instance

Conclusion

This post has demonstrated how to fine-tune the Stable Diffusion XL (SDXL) model using DreamBooth and LoRA techniques on Amazon SageMaker, enabling enterprises to generate highly personalized and domain-specific images tailored to their unique requirements using as few as 10–12 training images. By using these techniques, businesses can rapidly adapt the SDXL model to their specific needs, unlocking new opportunities to enhance customer experiences and differentiate their offerings. Moreover, we showcased the process of compiling and deploying the fine-tuned SDXL model for inference on AWS Inferentia2 powered Amazon EC2 Inf2 instances, which deliver an unparalleled price-to-performance ratio for generative AI workloads, enabling enterprises to host fine-tuned SDXL models at scale in a cost-efficient manner. We encourage you to try the example and share your creations with us using hashtags #sagemaker #mme #genai on social platforms. We would love to see what you make.

For more examples about AWS Neuron, refer to aws-neuron-samples.

About the Authors

Deepti Tirumala is a Senior Solutions Architect at Amazon Web Services, specializing in Machine Learning and Generative AI technologies. With a passion for helping customers advance their AWS journey, she works closely with organizations to architect scalable, secure, and cost-effective solutions that leverage the latest innovations in these areas.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Diwakar Bansal is a Principal GenAI Specialist focused on business development and go-to- market for GenAI and Machine Learning accelerated computing services. Diwakar has led product definition, global business development, and marketing of technology products in the fields of IOT, Edge Computing, and Autonomous Driving focusing on bringing AI and Machine learning to these domains. Diwakar is passionate about public speaking and thought leadership in the Cloud and GenAI space.

How Aetion is using generative AI and Amazon Bedrock to translate scientific intent to results

February 6, 2025

by Javier Beltrán Amazon AWS

This post is co-written with Javier Beltrán, Ornela Xhelili, and Prasidh Chhabri from Aetion.

For decision-makers in healthcare, it is critical to gain a comprehensive understanding of patient journeys and health outcomes over time. Scientists, epidemiologists, and biostatisticians implement a vast range of queries to capture complex, clinically relevant patient variables from real-world data. These variables often involve complex sequences of events, combinations of occurrences and non-occurrences, as well as detailed numeric calculations or categorizations that accurately reflect the diverse nature of patient experiences and medical histories. Expressing these variables as natural language queries allows users to express scientific intent and explore the full complexity of the patient timeline.

Aetion is a leading provider of decision-grade real-world evidence software to biopharma, payors, and regulatory agencies. The company provides comprehensive solutions to healthcare and life science customers to rapidly and transparently transforms real-world data into real-world evidence.

At the core of the Aetion Evidence Platform (AEP) are Measures—logical building blocks used to flexibly capture complex patient variables, enabling scientists to customize their analyses to address the nuances and challenges presented by their research questions. AEP users can use Measures to build cohorts of patients and analyze their outcomes and characteristics.

A user asking a scientific question aims to translate scientific intent, such as “I want to find patients with a diagnosis of diabetes and a subsequent metformin fill,” into algorithms that capture these variables in real-world data. To facilitate this translation, Aetion developed a Measures Assistant to turn users’ natural language expressions of scientific intent into Measures.

In this post, we review how Aetion is using Amazon Bedrock to help streamline the analytical process toward producing decision-grade real-world evidence and enable users without data science expertise to interact with complex real-world datasets.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.

Aetion’s technology

Aetion is a healthcare software and services company that uses the science of causal inference to generate real-world evidence on the safety, effectiveness, and value of medications and clinical interventions. Aetion has partnered with the majority of top 20 biopharma, leading payors, and regulatory agencies.

Aetion brings deep scientific expertise and technology to life sciences, regulatory agencies (including FDA and EMA), payors, and health technology assessment (HTA) customers in the US, Canada, Europe, and Japan with analytics that can achieve the following:

Optimize clinical trials by identifying target populations, creating external control arms, and contextualizing settings and populations underrepresented in controlled settings
Expand industry access through label changes, pricing, coverage, and formulary decisions
Conduct safety and effectiveness studies for medications, treatments, and diagnostics

Aetion’s applications, including Discover and Substantiate, are powered by the AEP, a core longitudinal analytic engine capable of applying rigorous causal inference and statistical methods to hundreds of millions of patient journeys.

AetionAI, Aetion’s set of generative AI capabilities, are embedded across the AEP and applications. Measures Assistant is an AetionAI feature in Substantiate.

The following figure illustrates the organization of Aetion’s services.

Measures Assistant

Users build analyses in Aetion Substantiate to turn real-world data into decision-grade real-world evidence. The first step is capturing patient variables from real-world data. Substantiate offers a wide range of Measures, as illustrated in the following screenshot. Measures can often be chained together to capture complex variables.

Suppose the user is assessing a therapy’s cost-effectiveness to help negotiate drug coverage with payors. The first step in this analysis is to filter out negative cost values that might appear in claims data. The user can ask AetionAI how to implement this, as shown in the following screenshot.

In another scenario, a user might want to define an outcome in their analysis as the change in hemoglobin over successive lab tests following the start of treatment. A user asks Measures Assistant a question expressed in natural language and receives instructions on how to implement this.

Solution overview

Patient datasets are ingested into the AEP and transformed into a longitudinal (timeline) format. AEP references this data to generate cohorts and run analyses. Measures are the variables that determine conditions for cohort entry, inclusion or exclusion, and the characteristics of a study.

The following diagram illustrates the solution architecture.

Measures Assistant is a microservice deployed in a Kubernetes on AWS environment and accessed through a REST API. The data transmitted to the service is encrypted using Transport Layer Security 1.2 (TLS). When a user asks a question through the assistant UI, Substantiate initiates a request containing the question and previous history of messages, if available. Measures Assistant incorporates the question into a prompt template and calls the Amazon Bedrock API to invoke Anthropic’s Claude 3 Haiku. The user-provided prompts and the requests sent to the Amazon Bedrock API are encrypted using TLS 1.2.

Aetion chose to use Amazon Bedrock for working with large language models (LLMs) due to its vast model selection from multiple providers, security posture, extensibility, and ease of use. Anthropic’s Claude 3 Haiku LLM was found to be more efficient in runtime and cost than available alternatives.

Measures Assistant maintains a local knowledge base about AEP Measures from scientific experts at Aetion and incorporates this information into its responses as guardrails. These guardrails make sure the service returns valid instructions to the user, and compensates for logical reasoning errors that the core model might exhibit.

The Measures Assistant prompt template contains the following information:

A general definition of the task the LLM is running.
Extracts of AEP documentation, describing each Measure type covered, its input and output types, and how to use it.
An in-context learning technique that includes semantically relevant solved questions and answers in the prompt.
Rules to condition the LLM to behave in a certain manner. For example, how to react to unrelated questions, keep sensitive data secure, or restrict its creativity in developing invalid AEP settings.

To streamline the process, Measures Assistant uses templates composed of two parts:

Static – Fixed instructions to be used with user questions. These instructions cover a broad range of well-defined instructions for Measures Assistant.
Dynamic – Questions and answers are dynamically selected from a local knowledge base based on semantic proximity to the user question. These examples improve the quality of the generated answers by incorporating similar previously asked and answered questions to the prompt. This technique models a small-scale, optimized, in-process knowledge base for a Retrieval Augmented Generation (RAG) pattern.

Mixedbread’s mxbai-embed-large-v1 Sentence Transformer was fine-tuned to generate sentence embeddings for a question-and-answer local knowledge base and users’ questions. Sentence question similarity is calculated through the cosine similarity between embedding vectors.

The generation and maintenance of the question-and-answer pool involve a human in the loop. Subject matter experts continuously test Measures Assistant, and question-and-answer pairs are used to refine it continually to optimize the user experience.

Outcomes

Our implementation of AetionAI capabilities enable users using natural language queries and sentences to describe scientific intent into algorithms that capture these variables in real-world data. Users now can turn questions expressed in natural language into measures in a matter minutes as opposed to days, without the need of support staff and specialized training.

Conclusion

In this post, we covered how Aetion uses AWS services to streamline the user’s path from defining scientific intent to running a study and obtaining results. Measures Assistant enables scientists to implement complex studies and iterate on study designs, instantaneously receiving guidance through responses to quick, natural language queries.

Aetion is continuing to refine the knowledge base available to Measures Assistant and expand innovative generative AI capabilities across its product suite to help improve the user experience and ultimately accelerate the process of turning real-world data into real-world evidence.

With Amazon Bedrock, the future of innovation is at your fingertips. Explore Generative AI Application Builder on AWS to learn more about building generative AI capabilities to unlock new insights, build transformative solutions, and shape the future of healthcare today.

About the Authors

Javier Beltrán is a Senior Machine Learning Engineer at Aetion. His career has focused on natural language processing, and he has experience applying machine learning solutions to various domains, from healthcare to social media.

Ornela Xhelili is a Staff Machine Learning Architect at Aetion. Ornela specializes in natural language processing, predictive analytics, and MLOps, and holds a Master’s of Science in Statistics. Ornela has spent the past 8 years building AI/ML products for tech startups across various domains, including healthcare, finance, analytics, and ecommerce.

Prasidh Chhabri is a Product Manager at Aetion, leading the Aetion Evidence Platform, core analytics, and AI/ML capabilities. He has extensive experience building quantitative and statistical methods to solve problems in human health.

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare life sciences customers and specializes in data analytics services. Mikhail has more than 20 years of industry experience covering a wide range of technologies and sectors.

Lightweight LLM for converting text to structured data

February 6, 2025

by Amazon AWS

Novel training procedure and decoding mechanism enable model to outperform much larger foundation model prompted to perform the same task.Read More

Trellix lowers cost, increases speed, and adds delivery flexibility with cost-effective and performant Amazon Nova Micro and Amazon Nova Lite models

February 5, 2025

by Martin Holste Amazon AWS

This post is co-written with Martin Holste from Trellix.

Security teams are dealing with an evolving universe of cybersecurity threats. These threats are expanding in form factor, sophistication, and the attack surface they target. Constrained by talent and budget limitations, teams are often forced to prioritize the events pursued for investigation, limiting the ability to detect and identify new threats. Trellix Wise is an AI-powered technology enabling security teams to automate threat investigation and add risk scores to events. With Trellix Wise, security teams can now complete what used to take multiple analysts hours of work to investigate in seconds, enabling them to expand the security events they are able to cover.

Trellix, a leading company delivering cybersecurity’s broadest AI-powered platform to over 53,000 customers worldwide, emerged in 2022 from the merger of McAfee Enterprise and FireEye. The company’s comprehensive, open, and native AI-powered security platform helps organizations build operational resilience against advanced threats. Trellix Wise is available to customers as part of the Trellix Security Platform. This post discusses the adoption and evaluation of Amazon Nova foundation models (FMs) by Trellix.

With growing adoption and use, the Trellix team has been exploring ways to optimize the cost structure of Trellix Wise investigations. Smaller, cost-effective FMs seemed promising and Amazon Nova Micro stood out as an option because of its quality and cost. In early evaluations, the Trellix team observed that Amazon Nova Micro delivered inferences three times faster and at nearly 100-fold lower cost.

The following figures are the results of tests by Trellix comparing Amazon Nova Micro to other models on Amazon Bedrock.

The Trellix team identified areas where Amazon Nova Micro can complement their use of Anthropic’s Claude Sonnet, delivering lower costs and higher overall speeds. Additionally, the professional services team at Trellix found Amazon Nova Lite to be a strong model for code generation and code understanding and is now using Amazon Nova Lite to speed up their custom solution delivery workflows.

Trellix Wise, generative-AI-powered threat investigation to assist security analysts

Trellix Wise is built on Amazon Bedrock and uses Anthropic’s Claude Sonnet as its primary model. The platform uses the Amazon OpenSearch Service stores billions of security events collected from the environments monitored. OpenSearch Service comes with a built-in vector database capability, making it straightforward to use data stored in OpenSearch Service as context data in a Retrieval Augmented Generation (RAG) architecture with Amazon Bedrock Knowledge Bases. Using OpenSearch Service and Amazon Bedrock, Trellix Wise carries out its automated, proprietary threat investigation steps on each event. This includes retrieval of required data for analysis, analysis of the data using insights from other custom-built machine learning (ML) models, and risk scoring. This sophisticated approach enables the service to interpret complex security data patterns and make intelligent decisions about each event. The Trellix Wise investigation gives each event a risk score and allows analysts to dive deeper into the results of the analysis, to determine whether human follow-up is necessary.

The following screenshot shows an example of an event on the Trellix Wise dashboard.

With growing scale of adoption, Trellix has been evaluating ways to improve cost and speed. The Trellix team has determined not all stages in the investigation need the accuracy of Claude Sonnet, and that some stages can benefit from faster, lower cost models that nevertheless are highly accurate for the target task. This is where Amazon Nova Micro has helped improve the cost structure of investigations.

Improving investigation cost with Amazon Nova Micro, RAG, and repeat inferences

The threat investigation workflow consists of multiple steps, from data collection, to analysis, to assigning of a risk score for the event. The collections stage retrieves event-related information for analysis. This is implemented through one or more inference calls to a model in Amazon Bedrock. The priority in this stage is to maximize completeness of the retrieval data and minimize inaccuracy (hallucinations). The Trellix team identified this stage as the optimal stage in the workflow to optimize for speed and cost.

The Trellix team concluded, based on their testing, Amazon Nova Micro offered two key advantages. Its speed allows it to process 3-5 inferences in the same time as a single Claude Sonnet inference and it’s cost per inference is almost 100 times lower. The Trellix team determined that by running multiple inferences, you can maximize the coverage of required data and still lower costs by a factor of 30. Although the model responses had a higher variability than the larger models, running multiple passes enables getting to a more exhaustive response-set. The response limitations enforced through proprietary prompt engineering and reference data constrain the response space, limiting hallucinations and inaccuracies in the response.

Before implementing the approach, the Trellix team carried out detailed testing to review the response completeness, cost, and speed. The team realized early in their generative AI journey that standardized benchmarks are not sufficient when evaluating models for a specific use case. A test harness replicating the information gathering workflows was set up and detailed evaluations of multiple models were carried out, to validate the benefits of this approach before moving ahead. The speed and cost benefits observed by Trellix helped validate the benefits before moving the new approach into production. The approach is now deployed in a limited pilot environment. Detailed evaluations are being carried out as part of a phased roll-out into production.

Conclusion

In this post, we shared how Trellix adopted and evaluated Amazon Nova models, resulting in significant inference speedup and lower costs. Reflecting on the project, the Trellix team recognizes the following as key enablers allowing them to achieve these results:

Access to a broad range of models, including smaller highly capable models like Amazon Nova Micro and Amazon Nova Lite, accelerated the team’s ability to easily experiment and adopt new models as appropriate.
The ability to constrain responses to avoid hallucinations, using pre-built use-case specific scaffolding that incorporated proprietary data, processes, and policies, reduced the risk of hallucinations and inaccuracies.
Data services that enabled effective integration of data alongside foundation models simplified implementation and reduced the time to production for new components.

“Amazon Bedrock makes it easy to evaluate new models and approaches as they become available. Using Amazon Nova Micro alongside Anthropic’s Claude Sonnet allows us to deliver the best coverage to our customers, fast, and at the best operating cost.“ says Martin Holste, Senior Director, Engineering, Trellix. “We’re really happy with the flexibility that Amazon Bedrock allows us as we continue to evaluate and improve Trellix Wise and the Trellix Security Platform.”

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About the Authors

Martin Holste is the CTO for Cloud and GenAI at Trellix.
Firat Elbey is a Principal Product Manager at Amazon AGI.
Deepak Mohan is a Principal Product Marketing Manager at AWS.

OfferUp improved local results by 54% and relevance recall by 27% with multimodal search on Amazon Bedrock and Amazon OpenSearch Service

February 5, 2025

by Purna Sanyal Amazon AWS

This post is co-written with Andrés Vélez Echeveri and Sean Azlin from OfferUp.

OfferUp is an online, mobile-first marketplace designed to facilitate local transactions and discovery. Known for its user-friendly app and trust-building features, including user ratings and in-app chat, OfferUp enables users to buy and sell items and explore a broad range of jobs and local services. As part of its ongoing mission to enhance user experience and drive business growth, OfferUp constantly seeks to improve its search capabilities, making it faster and more intuitive for users to discover, transact, and connect in their local communities.

In this two-part blog post series, we explore the key opportunities OfferUp embraced on their journey to boost and transform their existing search solution from traditional lexical search to modern multimodal search powered by Amazon Bedrock and Amazon OpenSearch Service. OfferUp found that multimodal search improved relevance recall by 27%, reduced geographic spread (which means more local results) by 54%, and grew search depth by 6.5%. This series delves into strategies, architecture patterns, business benefits and technical steps to modernize your own search solution

Foundational search architecture

OfferUp hosts millions of active listings, with millions more added monthly by its users. Previously, OfferUp’s search engine was built with Elasticsearch (v7.10) on Amazon Elastic Compute Cloud (Amazon EC2), using a keyword search algorithm to find relevant listings. The following diagram illustrates the data pipeline for indexing and query in the foundational search architecture.

Figure 1: Foundational search architecture

The data indexing workflow consists of the following steps:

As an OfferUp user creates or updates a listing, any new images are uploaded directly to Amazon Simple Storage Service (Amazon S3) using signed upload URLs.
The OfferUp user submits the new or updated listing details (title, description, image ids) to a posting microservice.
The posting microservice then persists the changes using the listing writer microservice in Amazon DynamoDB.
The listing writer microservice publishes listing change events to an Amazon Simple Notification Service (Amazon SNS) topic, which an Amazon Simple Queue Service (Amazon SQS) queue subscribes to.
The listing indexer AWS Lambda function continuously polls the queue and processes incoming listing updates.
The indexer retrieves the full listing details through the listing reader microservice from the DynamoDB table.
Finally, the indexer updates or inserts these listing details into Elasticsearch.

This flow makes sure that new or updated listings are indexed and made available for search queries in Elasticsearch.

The data query workflow consists of the following steps:

OfferUp users perform text searches, such as “summer shirt” or “running shoes”’.
The search microservice processes the query requests and retrieves relevant listings from Elasticsearch using keyword search (BM25 as a ranking algorithm).

Challenges with the foundational search architecture

OfferUp continuously strives to enhance user experience, focusing specifically on improving search relevance, which directly impacts Engagement with Seller Response (EWSR) and drives ad impressions. Although the foundational search architecture effectively surfaces a broad and diverse inventory, OfferUp encountered several limitations that prevent it from achieving optimal outcomes. These challenges include:

Context understanding – Keyword searches don’t account for the context in which a term is used. This can lead to irrelevant results if the same keyword has different meanings or uses. Keywords alone can’t discern user intent. For instance, “apple” could refer to the fruit, the technology company, or the brand name in different contexts.
Synonym and variation awareness – Keyword searches might miss results if the search terms vary or if synonyms are used. For example, searching for “car” might not return results for “sedan”. Similarly, searching for iPhone 11 can return results for iPhone 10 and iPhone 12.
Complex query management – The foundational search approach struggled with complex, multi-concept queries like “red running shoes,” often returning results that included shoes in other colors or footwear not designed for running.

Keyword search, which uses BM25 as a ranking algorithm, lacks the ability to understand semantic relationships between words, often missing semantically relevant results if they don’t contain exact keywords.

Solution overview

To improve search quality, OfferUp explored various software and hardware solutions focused on boosting search relevance while maintaining cost-efficiency. Ultimately, OfferUp selected Amazon Titan Multimodal Embeddings and Amazon OpenSearch Service for their fully managed services, which support a robust multimodal search solution capable of delivering high accuracy and fast responses across search and recommendation use cases. This choice also simplifies the deployment and operation of large-scale search capabilities on the OfferUp app, meeting the high throughput and latency requirements.

Amazon Titan Multimodal Embeddings G1 model

This model is pre-trained on large datasets, so you can use it as-is or customize this model by fine-tuning with your own data for a particular task. This model is used for use cases like searching images by text, by image, or by a combination of text and image for similarity and personalization. It translates the input image or text into an embedding that contains the semantic meaning of both the image and text in the same semantic space. By comparing embeddings, the model produces more relevant and contextual responses than keyword matching alone.

The Amazon Titan Multimodal Embeddings G1 offers the following configurations:

Model ID – amazon.titan-embed-image-v1
Max input text tokens – 256
Max input image size – 25 MB
Output vector size – 1,024 (default), 384, 256
Inference types – On-Demand, Provisioned Throughput

OpenSearch Service’s vector database capabilities

Vector databases enable the storage and indexing of vectors alongside metadata, facilitating low-latency queries to discover assets based on similarity. These databases typically use k-nearest (k-NN) indexes built with advanced algorithms such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File (IVF) systems. Beyond basic k-NN functionality, vector databases offer a robust foundation for applications that require data management, fault tolerance, resource access controls, and an efficient query engine.

OpenSearch is a powerful, open-source suite that provides scalable and flexible tools for search, analytics, security monitoring, and observability—all under the Apache 2.0 license. With Amazon OpenSearch Service, you get a fully managed solution that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud. By using Amazon OpenSearch Service as a vector database, you can combine traditional search, analytics, and vector search into one comprehensive solution. OpenSearch’s vector capabilities help accelerate AI application development, making it easier for teams to operationalize, manage, and integrate AI-driven assets.

To further boost these capabilities, OpenSearch offers advanced features, such as:

Connector for Amazon Bedrock – You can seamlessly integrate Amazon Bedrock machine learning (ML) models with OpenSearch through built-in connectors for services, enabling direct access to advanced ML features.
Ingest Pipeline – With ingest pipelines, you can process, transform, and route data efficiently, maintaining smooth data flows and real-time accessibility for search.
Neural Search – Neural search transforms text and images into vectors and facilitates vector search both at ingestion time and at search time. This allows end-to-end configuration of ingest pipelines, search pipelines, and the necessary connectors without having to leave OpenSearch
Transformed multimodal search architecture – OfferUp transformed its foundational search architecture with Amazon Bedrock Titan Multimodal and Amazon OpenSearch Service.

The following diagram below illustrates the data pipeline for indexing and query in the transformed multimodal search architecture:

Figure 2: Transformed multimodal search architecture

The data indexing workflow consists of the following steps:

As an OfferUp user creates or updates a listing, any new images are uploaded directly to Amazon Simple Storage Service (Amazon S3) using signed upload URLs.
The OfferUp user submits the new or updated listing details (title, description, image ids) to a posting microservice.
The posting microservice then persists the changes using the listing writer microservice in Amazon DynamoDB.
The listing writer microservice publishes listing change events to an Amazon Simple Notification Service (Amazon SNS) topic, which an Amazon Simple Queue Service (Amazon SQS) queue subscribes to.
The listing indexer AWS Lambda function continuously polls the queue and processes incoming listing updates.
The indexer retrieves the full listing details through the listing reader microservice from the DynamoDB table.
The Lambda indexer relies on the image microservice to retrieve listing images and encode them in base64 format.
The indexer lambda sends inserts and updates with listing details and base 64-encoded images to an Amazon OpenSearch Service domain.
An OpenSearch Ingest pipeline invokes the OpenSearch connector for Amazon Bedrock. The Titan Multimodal Embeddings model generates multi-dimensional vector embeddings for the listing image and description.
Listing data and embeddings are then stored in an Amazon OpenSearch index.

The data query workflow consists of the following steps:

OfferUp users perform both text and image searches, such as “gray faux leather sofa” or “running shoes”.
The search microservice captures the query and forwards it to Amazon OpenSearch Service domain, which invokes a neural search pipeline. The neural search pipeline forwards each search request to the same Amazon Titan Multimodal Embeddings model to convert the text and images into multi-dimensional vector embeddings.
OpenSearch Service then uses the vectors to find the k-nearest neighbors (KNN) to the vectorized search term and image to retrieve the relevant listings.

After extensive A/B testing with various k values, OfferUp found that a k value of 128 delivers the best search results while optimizing compute resources.

OfferUp multimodal search migration path

OfferUp adopted a three-step process to implement multimodal search functionality into their foundational search architecture.

Identify the Designated Market Areas (DMAs) – OfferUp categorizes its DMAs into high density and low density. High DMA density represents geographic locations with a higher user concentration, whereas low DMA density refers to locations with fewer users. OfferUp initially identified three business-critical high-density locations where multimodal search solutions demonstrated promising results in offline experiments, making them ideal candidates for multimodal search.
Set up Infrastructure and necessary configurations – This includes the following
- OpenSearch Service: The OpenSearch domain is deployed across 3 Availability Zones (AZs) to provide high availability. The cluster comprises 3 cluster manager nodes (m6g.xlarge.search instance) dedicated to manage cluster operations. For data handling, 24 data nodes (r6gd.2xlarge.search instances) are used, optimized for both storage and processing. The index is configured with 12 shards and three read replicas to enhance read performance. Each shard consumes around 11.6GB of memory.
- Embeddings model: The infrastructure enables access to Amazon Titan Multimodal Embeddings G1 in Amazon Bedrock.
Use backfilling – Backfilling converts an image of every active listing into vectors using Amazon Titan Multimodal Embeddings and stores that in OpenSearch Service. In the first phase, OfferUp backfilled 12 million active listings.
OfferUp rolled out multimodal search experiments in these three DMAs where input token size could vary between 3 – 15.

Benefits of multimodal search

In this section, we discuss the benefits of multimodal search

Business metrics

OfferUp evaluated the impact of multimodal search through A/B testing to manage traffic control and user experiment variations. In this experiment, the control group used the existing keyword-based search, and the variant group experienced the new multimodal search functionality. The test included a substantial user base, allowing for a robust comparison.

The results of the multimodal search implementation were compelling.
User engagement increased by 2.2%, and EWSR saw a 3.8% improvement, highlighting enhanced relevance in search outcomes
Search depth grew by 6.5%, as users explored results more thoroughly, indicating improved relevance beyond the top search items
Importantly, the need for fanout searches (broader search queries) decreased by 54.2%, showing that more users found relevant local results quickly
Ad impressions also rose by 0.91%, sustaining ad visibility while enhancing search performance

Technical metrics

OfferUp conducted additional experiments to assess technical metrics, utilizing 6 months of production system data to examine relevance recall with a focus on the top k=10 most relevant results within high-density and low-density DMAs. By segmenting these locations, OfferUp gained insights into how variations in user distribution across different market densities affect system performance, allowing for a deeper understanding of relevance recall efficiency in diverse markets.

relevance recall (RR)= sum(listing relevance score) / number of retrieved listings

Listing relevance is labeled as (1, 0) and is based on query correlations with the listing retrieved.

1: Listing is relevant
0: listing is not relevant

Conclusion

In this post, we demonstrated how OfferUp transformed its foundational search architecture using Amazon Titan Multimodal Embeddings and OpenSearch Service, significantly increasing user engagement, improving search quality and offering users the ability to search with both text and images. OfferUp selected Amazon Titan Multimodal Embeddings and Amazon OpenSearch Service for their fully managed capabilities, enabling the development of a robust multimodal search solution with high accuracy and a faster time to market for search and recommendation use cases.

We are excited to share these insights with the broader community and support organizations embarking on their own multimodal search journeys or seeking to improve search precision. Based on our experience, we highly recommend using Amazon Bedrock and Amazon OpenSearch services to achieve similar outcomes.

In the next part of the series, we discuss how to build multimodal search solution with an Amazon SageMaker Jupyter notebook, Amazon Titan Multimodal Embeddings model and OpenSearch Service.

About the authors

Purna Sanyal is GenAI Specialist Solution Architect at AWS, helping customers to solve their business problems with successful adoption of cloud native architecture and digital transformation. He has specialization in data strategy, machine learning and Generative AI. He is passionate about building large-scale ML systems that can serve global users with optimal performance.

Andrés Vélez Echeveri is a Staff Data Scientist and Machine Learning Engineer at OfferUp, focused on enhancing the search experience by optimizing retrieval and ranking components within a recommendation system. He has a specialization in machine learning and generative AI. He is passionate about creating scalable AI systems that drive innovation and user impact.

Sean Azlin is a Principal Software Development Engineer at OfferUp, focused on leveraging technology to accelerate innovation, decrease time-to-market, and empower others to succeed and thrive. He is highly experienced in building cloud-native distributed systems at any scale. He is particularly passionate about GenAI and its many potential applications.

Enhancing LLM Capabilities with NeMo Guardrails on Amazon SageMaker JumpStart

February 5, 2025

by Georgi Botsihhin Amazon AWS

As large language models (LLMs) become increasingly integrated into customer-facing applications, organizations are exploring ways to leverage their natural language processing capabilities. Many businesses are investigating how AI can enhance customer engagement and service delivery, and facing challenges in making sure LLMs driven engagements are on topic and follow the desired instructions.

In this blog post, we explore a real-world scenario where a fictional retail store, AnyCompany Pet Supplies, leverages LLMs to enhance their customer experience. Specifically, this post will cover:

What NeMo Guardrails is. We will provide a brief introduction to guardrails and the Nemo Guardrails framework for managing LLM interactions.
Integrating with Amazon SageMaker JumpStart to utilize the latest large language models with managed solutions.
Creating an AI Assistant capable of understanding customer inquiries, providing contextually aware responses, and steering conversations as needed.
Implementing Sophisticated Conversation Flows using variables and branching flows to react to the conversation content, ask for clarifications, provide details, and guide the conversation based on user intent.
Incorporating your Data into the Conversation to provide factual, grounded responses aligned with your use case goals using retrieval augmented generation or by invoking functions as tools.

Through this practical example, we’ll illustrate how startups can harness the power of LLMs to enhance customer experiences and the simplicity of Nemo Guardrails to guide the LLMs driven conversation toward the desired outcomes.

Note: For any considerations of adopting this architecture in a production setting, it is imperative to consult with your company specific security policies and requirements. Each production environment demands a uniquely tailored security architecture that comprehensively addresses its particular risks and regulatory standards. Some links for security best practices are shared below but we strongly recommend reaching out to your account team for detailed guidance and to discuss the appropriate security architecture needed for a secure and compliant deployment.

What is Nemo Guardrails?

First, let’s try to understand what guardrails are and why we need them. Guardrails (or “rails” for short) in LLM applications function much like the rails on a hiking trail — they guide you through the terrain, keeping you on the intended path. These mechanisms help ensure that the LLM’s responses stay within the desired boundaries and produces answers from a set of pre-approved statements.

NeMo Guardrails, developed by NVIDIA, is an open-source solution for building conversational AI products. It allows developers to define and constrain the topics the AI agent will engage with, the possible responses it can provide, and how the agent interacts with various tools at its disposal.

The architecture consists of five processing steps, each with its own set of controls, referred to as “rails” in the framework. Each rail defines the allowed outcomes (see Diagram 1):

Input and Output Rails: These identify the topic and provide a blocking mechanism for what the AI can discuss.
Retrieval and Execution Rails: These govern how the AI interacts with external tools and data sources.
Dialog Rails: These maintain the conversational flow as defined by the developer.

For a retail chatbot like AnyCompany Pet Supplies’ AI assistant, guardrails help make sure that the AI collects the information needed to serve the customer, provides accurate product information, maintains a consistent brand voice, and integrates with the surrounding services supporting to perform actions on behalf of the user.

Diagram 1: The architecture of NeMo Guardrails, showing how interactions, rails and integrations are structured.

Within each rail, NeMo can understand user intent, invoke integrations when necessary, select the most appropriate response based on the intent and conversation history and generate a constrained message as a reply (see Diagram 2).

Diagram 2: The flow from input forms to the final output, including how integrations and AI services are utilized.

An Introduction to Colang

Creating a conversational AI that’s smart, engaging and operates with your use case goals in mind can be challenging. This is where NeMo Guardrails comes in. NeMo Guardrails is a toolset designed to create robust conversational agents, utilizing Colang — a modelling language specifically tailored for defining dialogue flows and guardrails. Let’s delve into how NeMo Guardrails own language can enhance your AI’s performance and provide a guided and seamless user experience.

Colang is purpose-built for simplicity and flexibility, featuring fewer constructs than typical programming languages, yet offering remarkable versatility. It leverages natural language constructs to describe dialogue interactions, making it intuitive for developers and simple to maintain.

Let’s delve into a basic Colang script to see how it works:

define user express greeting
"hello"
"hi"
"what's up?"

define bot express greeting
"Hey there!"

define bot ask how are you
"How are you doing?"

define flow greeting
user express greeting
bot express greeting
bot ask how are you

In this script, we see the three fundamental types of blocks in Colang:

User Message Blocks (define user …): These define possible user inputs.
Bot Message Blocks (define bot …): These specify the bot’s responses.
Flow Blocks (define flow …): These describe the sequence of interactions.

In the example above, we defined a simple dialogue flow where a user expresses gratitude, and the bot responds with a welcoming message. This straightforward approach allows developers to construct intricate conversational pathways that uses the examples given to route the conversation toward the desired responses.

Integrating Llama 3.1 and NeMo Guardrails on SageMaker JumpStart

For this post, we’ll use Llama 3.1 8B instruct model from Meta, a recent model that strikes excellent balance between size, inference cost and conversational capabilities. We will launch it via Amazon SageMaker JumpStart, which provides access to numerous foundation models from providers such as Meta, Cohere, Hugging Face, Anthropic and more.

By leveraging SageMaker JumpStart, you can quickly evaluate and select suitable foundation models based on quality, alignment and reasoning metrics. The selected models can then be further fine-tuned on your data to better match your specific use case needs. On top of ample model choice, the additional benefit is that it enables your data to remain within your Amazon VPC during both inference and fine-tuning.

When integrating models from SageMaker JumpStart with NeMo Guardrails, the direct interaction with the SageMaker inference API requires some customization, which we will explore below.

Creating an Adapter for NeMo Guardrails
To verify compatibility, we need to create an adapter to make sure that requests and responses match the format expected by NeMo Guardrails. Although NeMo Guardrails provides a SagemakerEndpoint wrapper class, it requires some customization to handle the Llama 3.1 model API exposed by SageMaker JumpStart properly.

Below, you will find an implementation of a NeMo-compatible class that arranges the parameters required to call our SageMaker endpoint:

class ContentHandler(LLMContentHandler):
    content_type = 'application/json'
    accepts = 'application/json'

    def transform_input(self, prompt: str, model_kwargs=None):
        if model_kwargs is None:
            model_kwargs = {}

        # Ensure the 'stop' parameter is set
        model_kwargs.setdefault('stop', ['<|eot_id|>'])

        input_data = {
            'inputs': prompt,
            'parameters': model_kwargs
        }
        return json.dumps(input_data).encode("utf-8")

    def transform_output(self, output):
        output_data = json.loads(output.read().decode("utf-8"))
        return output_data.get("generated_text", f"Error: {output_data.get('error', 'Unknown error')}")

class CustomSagemakerEndpoint(SagemakerEndpoint):
    content_handler = ContentHandler()
    endpoint_name = llm_predictor.endpoint_name
    region_name = llm_predictor.sagemaker_session.boto_region_name

Structuring the Prompt for Llama 3.1

The Llama 3.1 model from Meta requires prompts to follow a specific structure, including special tokens like </s> and {role} to define parts of the conversation. When invoking the model through NeMo Guardrails, you must make sure that the prompts are formatted correctly.

To achieve seamless integration, you can modify the prompt.yaml file. Here’s an example:

prompts:
  - prompt: "<|endoftext|>{role}: {user_message}<|endoftext|>{role}: {assistant_response}"

For more details on formatting input text for Llama models, you can explore these resources:

Creating an AI Assistant

In our task to create an intelligent and responsible AI assistant for AnyCompany Pet Supplies, we’re leveraging NeMo Guardrails to build a conversational AI chatbot that can understand customer needs, provide product recommendations, and guide users through the purchase process. Here’s how we implement this.

At the heart of NeMo Guardrails are two key concepts: flows and intents. These work together to create a structured, responsive, and context-aware conversational AI.

Flows in NeMo Guardrails

Flows define the conversation structure and guide the AI’s responses. They are sequences of actions that the AI should follow in specific scenarios. For example:

define flow unrelated question
  user ask unrelated question
  bot refuse unrelated question

define flow help user with pets
  user ask about pets
  bot answer question

These flows outline how the AI should respond in different situations. When a user asks about pets, the chatbot will provide an answer. When faced with an unrelated question, it will politely refuse to answer.

Intent Capturing and Flow Selection

The process of choosing which flow to follow begins with capturing the user intent. NeMo Guardrails uses a multi-faceted approach to understand user intent:

Pattern Matching: The system first looks for predefined patterns that correspond to specific intents:

define user ask about pets
  "What are the best shampoos for dogs?"
  "How frequently should I take my puppy on walks?"
  "How do I train my cat?"

define user ask unrelated question
  "Can you help me with my travel plans?"
  "What's the weather like today?"
  "Can you provide me with some investment advice?"

Dynamic Intent Recognition: After selecting the most likely candidates, NeMo uses a sophisticated intent recognition system defined in the prompts.yml file to narrow down the intent:

- task: generate_user_intent
  content: |-
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a helpful AI assistant specialised in pet products.<|eot_id|>
    <|start_header_id|>user<|end_header_id|>
    """
    {{ general_instructions }}
    """

    # This is an example how a conversation between a user and the bot can go:
    {{ sample_conversation }} 
    # The previous conversation was just an example
    
    # This is the current conversation between the user and the bot:
    Assistant: Hello
    {{ history | user_assistant_sequence }}
    
    # Choose the user current intent from this list: {{ potential_user_intents }}

    Ignore the user question. The assistant task is to choose an intent from this list. The last messages are more important for defining the current intent.
    Write only one of: {{ potential_user_intents }}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
  stop:
    - "<|eot_id|>"

This prompt is designed to guide the chatbot in determining the user’s intent. Let’s break it down:

Context Setting: The prompt begins by defining the AI’s role as a pet product specialist. This focuses the chatbot’s attention on pet-related queries.
General Instructions: The {{ general_instructions }} variable contains overall guidelines for the chatbot’s behavior, as defined in our config.yml.
Example Conversation: The {{ sample_conversation }} provides a model of how interactions should flow, giving the chatbot context for understanding user intents.
Current Conversation: The {{ history | user_assistant_sequence }} variable includes the actual conversation history, allowing the chatbot to consider the context of the current interaction.
Intent Selection: The chatbot is instructed to choose from a predefined list of intents {{ potential_user_intents }}. This constrains the chatbot to a set of known intents, ensuring consistency and predictability in intent recognition.
Recency Bias: The prompt specifically mentions that “the last messages are more important for defining the current intent.” This instructs the chatbot to prioritize recent context, which is often most relevant to the current intent.
Single Intent Output: The chatbot is instructed to “Write only one of: {{ potential_user_intents }}“. This provides a clear, unambiguous intent selection.

In Practice:

Here’s how this process works in practice (see Diagram 3):

When a user sends a message, NeMo Guardrails initiates the intent recognition task.
The chatbot reviews the conversation history, focusing on the most recent messages.
It matches the user’s input against a list of predefined intents.
The chatbot selects the most suitable intent based on this analysis.
The identified intent determines the corresponding flow to guide the conversation.

Diagram 3: Two example conversation flows, one denied by the input rails, one allowed to the dialog rail where the LLM picks up the conversation.

For example, if a user asks, “What’s the best food for a kitten?”, the chatbot might classify this as a “product_inquiry” intent. This intent would then activate a flow designed to recommend pet food products.

While this structured approach to intent recognition makes sure that the chatbot’s responses are focused and relevant to the user’s needs, it may introduce latency due to the need to process and analyze conversation history and intent in real-time. Each step, from intent recognition to flow selection, involves computational processing, which can impact the response time, especially in more complex interactions. Finding the right balance between flexibility, control, and real-time processing is crucial for creating an effective and reliable conversational AI system.

Implement Sophisticate Conversation Flows

In our earlier discussion about Colang, we examined its core structure and its role in crafting conversational flows. Now, we will delve into one of Colang’s standout features: the ability to utilize variables to capture and process user input. This functionality enables us to construct conversational agents that are not only more dynamic but also highly responsive, tailoring their interactions based on precise user data.

Continuing with our practical example of developing a pet store assistant chatbot:

define flow answer about pet products
  user express pet products needs
  
  $pet_type = ...

  if $pet_type == "not available"
    bot need clarification
  else
    if $pet_type == "dog":
      bot say "Depending on your dog's coat type, different grooming tools like deshedding brushes or mitts might be useful."
    else if $pet_type == "bird"
      bot say "For birds, it's crucial to use non-toxic cleaning products and sprays to maintain healthy feathers. I recommend looking for products specifically labeled for avian use."
    else
      bot say "For cats, especially those with long hair, a good brushing routine with a wire or bristle brush can help prevent mats and keep their coat healthy."

In the provided example above, we encounter the line:

$pet_type = ...

The ellipsis (...) serves as a placeholder in Colang, signaling where data extraction or inference is to be performed. This notation does not represent executable code but rather suggests that some form of logic or natural language processing should be applied at this stage.

More specifically, the use of an ellipsis here implies that the system is expected to:

Analyze the user’s input previously captured under “user express pet products needs.”
Determine or infer the type of pet being discussed.
Store this information in the $pet_type variable.

The comment accompanying this line sheds more light on the intended data extraction process:

#extract the specific pet type at very high level if available, like dog, cat, bird. Make sure you still class things like puppy as "dog", kitty as "cat", etc. if available or "not available" if none apply

This directive indicates that the extraction should:

Recognize the pet type at a high level (dog, cat, bird).
Classify common variations (e.g., “puppy” as “dog”).
Default to “not available” if no clear pet type is identified.

Returning to our initial code snippet, we use the $pet_type variable to customize responses, enabling the bot to offer specific advice based on whether the user has a dog, bird, or cat.

Next, we will expand on this example to integrate a Retrieval Augmented Generation (RAG) workflow, enhancing our assistant’s capabilities to recommend specific products tailored to the user’s inputs.

Bring Your Data into the Conversation

Incorporating advanced AI capabilities using a model like the Llama 3.1 8B instruct model requires more than just managing the tone and flow of conversations; it necessitates controlling the data the model accesses to respond to user queries. A common technique to achieve this is Retrieval Augmented Generation (RAG). This method involves searching a semantic database for content relevant to a user’s request and incorporating those findings into the model’s response context.

The typical approach uses an embedding model, which converts a sentence into a semantic numeric representation—referred to as a vector. These vectors are then stored in a vector database designed to efficiently search and retrieve closely related semantic information. For more information on this topic, please refer to Getting started with Amazon Titan Text Embeddings in Amazon Bedrock.

NeMo Guardrails simplifies this process: developers can store relevant content in a designated ‘kb’ folder. NeMo automatically reads this data, applies its internal embedding model and stores the vectors in an “Annoy” index, which functions as an in-memory vector database. However, this method might not scale well for extensive data sets typical in e-commerce environments. To address scalability, here are two solutions:

Custom Adapter or Connector: Implement your own extension of the EmbeddingsIndex base class. This allows you to customize storage, search and data embedding processes according to your specific requirements, whether local or remote. This integration makes sure that that relevant information remains in the conversational context throughout the user interaction, though it does not allow for precise control over when or how the information is used. For example:

class CustomVectorDatbaseConnector(EmbeddingsIndex):
    @property
    def embedding_size(self):
        return 768

    async def add_item(self, item: IndexItem):
        """Adds a new item to the index."""
        pass  # Implementation needed

    async def add_items(self, items: List[IndexItem]):
        """Adds multiple items to the index."""
        pass  # Implementation needed

    async def search(self, text: str, max_results: int) -> List[IndexItem]:
        """
        Searches for items in the index that are most similar to the provided text.
        """

Retrieval Augmented Generation via Function Call: Define a function that handles the retrieval process using your preferred provider and technique. This function can directly update the conversational context with relevant data, ensuring that the AI can consider this information in its responses. For example:

async def rag(context: dict, llm: BaseLLM, search: str) -> ActionResult:
   """ retrieval component of a retrieval augmented generation function
       you can directly manipulate the NeMo's in-context knowledge here using:
       context_updates = {"relevant_chunks": json.dumps(data)}
   """
   return "true"

In the conversation rail’s flow, use variables and function calls to precisely manage searches and the integration of results:

#describe the characteristics of a product that would satisfy the user
$product_characteristics = ...
$is_success = execute rag(search=$product_characteristics)

These methods offer different levels of flexibility and control, making them suitable for various applications depending on the complexity of your system. In the next section, we will see how these techniques are applied in a more complex scenario to further enhance the capabilities of our AI assistant.

Complete Example with Variables, Retrievers and Conversation Flows

Scenario Overview

Let’s explore a complex implementation scenario with NeMo Guardrails interacting with multiple tools to drive specific business outcomes. We’ll keep the focus on the pet store e-commerce site that is being upgraded with a conversational sales agent. This agent is integrated directly into the search field at the top of the page. For instance, when a user searches for “double coat shampoo,” the results page displays several products and a chat window automatically engages the user by processing the search terms.

Detailed Conversation Flow

As the user interaction begins, the AI processes the input from the search field:

define flow answer about pet products
  user express pet products needs
  #extract the specific pet type at very high level if available, like dog, cat, bird if available or "pet" if none apply
  $pet_type = ...
  #extract the specific breed of the pet if available or "not available" if none apply
  $pet_breed = ...
  if $pet_breed == "not available"
    bot ask informations about the pet breed

Output: "Would you be able to share the type of your dog breed?"

This initiates the engine’s recognition of the user’s intent to inquire about pet products. Here, the chatbot uses variables to try and extract the type and breed of the pet. If the breed isn’t immediately available from the input, the bot requests further clarification.

Retrieval and Response Generation

If the user responds with the breed (e.g., “It’s a Labradoodle”), the chatbot proceeds to tailor its search for relevant products:

else:
    #describe the characteristics of a product that would satisfy the user
    $product_characteristics = ... 
    #call our previously defined retrieval function   
    $results = execute rag(search=$product_characteristics)
    #write a text message describing in a list all the products in the addional context writing their name and their ASIN number 
    #and any features that relate to the user needs, offering to put one in the cart, in a single paragraph
    $product_message = ...
    bot $product_message

Output: We found several shampoos for Labradoodles: [Product List]. Would you like to add any of these to your cart?

The chatbot uses the extracted variables to refine product search criteria, then retrieves relevant items using an embedded retrieval function. It formats this information into a user-friendly message, listing available products and offering further actions.

Advancing the Sale

If the user expresses a desire to purchase a product (“I’d like to buy the second option from the list”), the chatbot transitions to processing the order:

define flow user wants product
  user express intent to buy
  #the value must be the user city of residence or "unknown"
  $user_city = ...
  #the value must be the user full street address or "unknown"
  $user_address_street = ...
  if $user_city == "unknown"
    if $user_address_street == "unknown"
      bot ask address

Output: "Great choice! To finalize your order, could you please provide your full shipping address?"

At this point, we wouldn’t have the shipping information so the bot ask for it. However, if this was a known customer, the data could be injected into the conversation from other sources. For example, if the user is authenticated and has made previous orders, their shipping address can be retrieved from the user profile database and automatically populated within the conversation flow. Then the model would just have asked for confirmation about the purchase, skipping the part about asking for shipping information.

Completing the Sale

Once our variables are filled and we have enough information to process the order, we can transition the conversation naturally into a sales motion and have the bot finalize the order:

else:
  #the value must be asin of the product the user intend to buy
  $product_asin = ...
  $cart = execute add_order(city=$user_city, street=$user_address_street, product=$product_asin)
  bot $cart

Output: "Success"

In this example, we’ve implemented a mock function called add_order to simulate a backend service call. This function verifies the address and places the chosen product into the user’s session cart. You can capture the return string from this function on the client side and take further action, for instance, if it indicates ‘Success,’ you can then run some JavaScript to display the filled cart to the user. This will show the cart with the item, pre-entered shipping details and a ready checkout button within the user interface, closing the sales loop experience for the user and tying together the conversational interface with the shopping cart and purchasing flow.

Maintaining Conversation Integrity

During this interaction, the NeMo Guardrails framework maintains the conversation within the boundaries set by the Colang configuration. For example, if the user deviates with a question such as ‘What’s the weather like today?’, NeMo Guardrails will classify this as part of a refusal flow and outside the relevant topics of ordering pet supplies. It will then tactfully declines to address the unrelated query and steers the discussion back towards selecting and ordering products, replying with a standard response like, ‘I’m afraid I can’t help with weather information, but let’s continue with your pet supplies order.’ as defined in Colang.

Clean Up

When using Amazon SageMaker JumpStart you’re deploying the selected models using on-demand GPU instances managed by Amazon SageMaker. These instances are billed per second and it’s important to optimize your costs by turning off the endpoint when not needed.

To clean up your resources, please ensure that you run the clean up cells in the three notebooks that you used. Make sure you delete the appropriate model and endpoints by executing similar cells:

llm_model.delete_model()
llm_predictor.delete_predictor()

Please note that in the third notebook, you additionally need to delete the embedding endpoints:

embedding_model.delete_model()
embedding_predictor.delete_predictor()

Additionally, you can make sure that you have deleted the appropriate resources manually by completing the following steps:

Delete the model artifacts:
1. On the Amazon SageMaker console, choose Models under Inference in the navigation pane.
2. Please ensure you do not have llm-model and embedding-model artifacts.
3. To delete these artifacts, choose the appropriate models and click Delete under Actions dropdown menu.
Delete endpoint configurations:
1. On the Amazon SageMaker console, choose Endpoint configuration under Inference in the navigation pane.
2. Please ensure you do not have llm-model and embedding-model endpoint configuration.
3. To delete these configurations, choose the appropriate endpoint configurations and click Delete under Actions dropdown menu.
Delete the endpoints:
1. On the Amazon SageMaker console, choose Endpoints under Inference in the navigation pane.
2. Please ensure you do not have llm-model and embedding-model endpoints running.
3. To delete these endpoints, choose the appropriate model endpoint names and click Delete under Actions dropdown menu.

Best Practices and Considerations

When integrating NeMo Guardrails with SageMaker JumpStart, it’s important to consider AI governance frameworks and security best practices to ensure responsible AI deployment. While this blog focuses on showcasing the core functionality and capabilities of NeMo Guardrails, security aspects are beyond its scope.

For further guidance, please explore:

Conclusion

Integrating NeMo Guardrails with Large Language Models (LLMs) is a powerful step forward in deploying AI in customer-facing applications. The example of AnyCompany Pet Supplies illustrates how these technologies can enhance customer interactions while handling refusal and guiding the conversation toward the implemented outcomes. Looking forward, maintaining this balance of innovation and responsibility will be key to realizing the full potential of AI in various industries. This journey towards ethical AI deployment is crucial for building sustainable, trust-based relationships with customers and shaping a future where technology aligns seamlessly with human values.

Next Steps

You can find the examples used within this article via this link.

We encourage you to explore and implement NeMo Guardrails to enhance your own conversational AI solutions. By leveraging the guardrails and techniques demonstrated in this post, you can quickly constraint LLMs to drive tailored and effective results for your use case.

About the Authors

Georgi Botsihhin is a Startup Solutions Architect at Amazon Web Services (AWS), based in the United Kingdom. He helps customers design and optimize applications on AWS, with a strong interest in AI/ML technology. Georgi is part of the Machine Learning Technical Field Community (TFC) at AWS. In his free time, he enjoys staying active through sports and taking long walks with his dog.

Lorenzo Boccaccia is a Startup Solutions Architect at Amazon Web Services (AWS), based in Spain. He helps startups in creating cost-effective, scalable solutions for their workloads running on AWS, with a focus on containers and EKS. Lorenzo is passionate about Generative AI and is is a certified AWS Solutions Architect Professional, Machine Learning Specialist and part of the Containers TFC. In his free time, he can be found online taking part sim racing leagues.