Amazon AWS – Page 43

Read graphs, diagrams, tables, and scanned pages using multimodal prompts in Amazon Bedrock

November 26, 2024

by Mithil Shah Amazon AWS

Large language models (LLMs) have come a long way from being able to read only text to now being able to read and understand graphs, diagrams, tables, and images. In this post, we discuss how to use LLMs from Amazon Bedrock to not only extract text, but also understand information available in images.

Solution overview

In this post, we demonstrate how to use models on Amazon Bedrock to retrieve information from images, tables, and scanned documents. We provide the following examples:

Performing object classification and object detection tasks
Reading and querying graphs
Reading flowcharts and architecture diagrams (such as an AWS architecture diagram) and converting it to text

We also explore specific use cases where Amazon Bedrock can help you generate information from an image:

Identify the action in the image
Name a product and extract metadata to generate a tagline and description
Generate a real estate listing for a property
Generate a recipe from an image of a dish

Perform object classification and object detection tasks using LLM

At the time of writing, the best algorithms for object classification and detection are based on convolution neural networks. However, LLMs such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock can also perform these tasks using zero-shot prompting, which refers to a prompting technique to give a task to the model without providing specific examples or training for that specific task. In the following example, we show how to classify an image based on the classes that we passed with the prompt. The following image is a sample Australian Medicare card.

We use the following prompt with Anthropic’s Claude 3 Sonnet to classify the image.:

This is one of the documents that a customer seeking a loan has uploaded. The image can belong to one of the following four classes : ID, bank statement, application form or utility bill. Identify what the class of this document is and give me just the class name. I don’t need any other information for now

We get the following response from the LLM:

Based on the image provided, the class of this document appears to be an ID card or identification document. The image shows what seems to be a Medicare card or similar government-issued ID card displaying names and an identification number.

It has correctly identified the class; however, a response in JSON would make it more straightforward to process that response. We add the following to the end of the prompt:

provide the response in json format with the key as “class” and the value as the class of the document

We get the following response:

{
"class": "ID"
}

You can now read the JSON response using a library of your choice, such as the Python JSON library.

In this next example, we use the Meta’s Llama 3.2 90B Vision model to identify an object in an image. The following image is of a gearbox.

We use the following prompt to identify the gearbox from the image:

identify the entire machine part in this image. Give just the name of the machine part in a JSON format with the key as ”name“ and the value as the name of the machine part

The following screenshot shows the prompt and the answer the model gives.

Read and query graphs

In this next example, we show how to read a graph and then interpolate and extrapolate values in the graph. The following graph shows the population growth for a few countries (source).

We pass this image to Anthropic’s Claude 3 Sonnet along with the following prompt:

From this graph create a table for the population of India, with the first column as the year and the second column as the population in millions. Create a row for every 5 years starting from 1950 to 2025.

We ask it to create a row for every 5 years, so the model must interpolate values. We also ask it to extend the table until 2025, and because the data is only until 2021, the model will have to extrapolate the values. We get the following response:

Here is a table for the population of India, with the first column as the year and the second column as the population in millions, for every 5 years starting from 1950 to 2025:

| Year | Population (in millions) |

|------|--------------------------|

| 1950 | 357.6 |

| 1955 | 390.5 |

| 1960 | 427.8 |

| 1965 | 468.0 |

| 1970 | 510.9 |

| 1975 | 556.9 |

| 1980 | 605.7 |

| 1985 | 658.0 |

| 1990 | 713.8 |

| 1995 | 773.0 |

| 2000 | 835.6 |

| 2005 | 901.7 |

| 2010 | 971.5 |

| 2015 | 1045.0 |

| 2020 | 1122.0 |

| 2025 | 1202.4 |

The extrapolation may not be accurate, and this is where a more traditional machine learning (ML) approach would be helpful.

You can also ask the model to combine its knowledge with the knowledge from the graph. The following prompt asks for the population for countries in this graph for 2021, along with the country’s capital:

Can you create a table for 2021 with the first column as the country name, the second column as population in millions and third column as the capital of that country?

The following screenshot shows the response.

The LLM has filled in the table based on the graph and its own knowledge about the capital of each country.

You can try out something harder as well. The following image is a weather map from the

You can ask the LLM a specific question, such as the following:

This is a precipitation map for Australia. Is it raining in Sydney right now?

You don’t have to tell the LLM where Sydney is or that the image is for rainfall.

We get the following response:

Based on the precipitation map shown, there does not appear to be any significant rainfall occurring over the Sydney area, which is located on the eastern coast of the Australian continent. The map uses contour lines and colors to represent precipitation levels, but the region around Sydney has no shaded areas or contour lines indicating rainfall at this particular timeframe depicted by the map. Therefore, the available data suggests that it is likely not raining in Sydney at the time represented by this precipitation map.

There are a couple of interesting observations from this use case:

Anthropic’s Claude 3 Sonnet can read maps
The model is able to read overlays on a map
Phrases such as “region around Sydney” shows that the model doesn’t need to work with exact information but can use an approximation, just as humans do

Read flowcharts and architecture diagrams

You can read AWS architecture diagrams using the Meta Llama 3.2 90B Vision model. The following is an example architecture diagram for modernizing applications with microservices using Amazon Elastic Kubernetes Service (Amazon EKS).

We use the following prompt to read this diagram:

The steps in this diagram are explained using numbers 1 to 11. The numbers are shown in blue squares. Can you explain the diagram using the numbers 1 to 11 and an explanation of what happens at each of those steps?

The following screenshot shows the response that we get from the LLM (truncated for brevity).

Furthermore, you can use this diagram to ask follow-up questions:

Why do we need a network load balancer in this architecture

The following screenshot shows the response from the model.

As you can see, the LLM acts as your advisor now for questions related to this architecture.

However, we’re not limited to using generative AI for only software engineering. You can also read diagrams and images from engineering, architecture, and healthcare.

For this example, we use a process diagram taken from Wikipedia.

To find out what this process diagram is for and to describe the process, you can use the following prompt:

Can you name the process shown in the example. Also describe the process using numbered steps and go from left to right.

The following screenshot shows the response.

The LLM has done a good job figuring out that the diagram is for the Haber process to produce ammonia. It also describes the steps of the process.

Identify actions in an image

You can identify and classify the actions taking place in the image. The model’s ability to accurately identify actions is further enhanced by its capacity to analyze contextual information, such as the surrounding objects, environments, and the positions of individuals or entities within the image. By combining these visual cues and contextual elements, Anthropic’s Claude 3 Sonnet can make informed decisions about the nature of the actions being performed, providing a comprehensive understanding of the scene depicted in the image.

The following is an example where we can not only classify the action of the player but also provide feedback to the player comparing the action to a professional player.

We provide the model the following image of a tennis player. The image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock.

The following screenshot shows the prompt and the model’s response.

Name a product and extract metadata to generate a tagline and description

In the field of marketing and product development, coming up with a perfect product name and creative promotional content can be challenging. With the image-to-text capabilities of Anthropic’s Claude 3 Sonnet, you can upload the image of the product and the model can generate a unique product name and craft taglines to suit the target audience.

For this example, we provide the following image of a sneaker to the model (the image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock).

The following screenshot shows the prompt.

The following screenshot shows the model’s response.

In the retail and ecommerce domain, you can also use Anthropic’s Claude 3 Sonnet to extract detailed product information from the images for inventory management.

For example, we use the prompt shown in the following screenshot.

The following screenshot shows the model’s response.

Create a real estate listing for a property

You can upload images of a property floor plan and pictures of interior and exterior of the house and then get a description to use in a real estate listing. This is useful to increase the creativity and productivity of real estate agents while advertising properties. Architects could also use this mechanism to explain the floor plan to customers.

We provide the following example floor plan to the model.

The following screenshot shows the prompt.

The following screenshot shows the response.

Generate a recipe from the image of a dish

You can also use Anthropic’s Claude 3 Sonnet to create a recipe based on a picture of a dish. However, out of the box, the model can identify only the dishes that are included in the dataset used for the model training. Factors such as ingredient substitutions, cooking techniques, and cultural variations in cuisine can pose significant challenges.

For example, we provide the following image of a cake to the model to extract the recipe. The image was generated using the Stability AI model (SDXL 1.0) on Amazon Bedrock.

The following screenshot shows the prompt.

The model successfully identifies the dish as Black Forest cake and creates a detailed recipe. The recipe may not create the exact cake shown in the figure, but it does get close to a Black Forest Cake.

Conclusion

FMs such as Anthropic’s Claude 3 Sonnet and Meta Llama 3.2 90B Vision model, available on Amazon Bedrock, have demonstrated impressive capabilities in image processing. These FMs unlock a range of powerful features, including image classification, optical character recognition (OCR), and the ability to interpret complex visuals such as graphs and architectural blueprints. Such innovations offer novel solutions to challenging problems, from searching through scanned document archives to generating image-inspired text content and converting visual information into structured data.

To start using these capabilities for your specific needs, we recommend exploring the chat playground feature on Amazon Bedrock, which allows you to interact with and extract information from images.

About the Authors

Mithil Shah is a Principal AI/ML Solution Architect at Amazon Web Services. He helps commercial and public sector customers use AI/ML to achieve their business outcome. He is currently helping customers build chat bots and search functionality using LLM agents and RAG.

Santosh Kulkarni is an Senior Solutions Architect at Amazon Web Services specializing in AI/ML. He is passionate about generative AI and is helping customers unlock business potential and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

How Crexi achieved ML models deployment on AWS at scale and boosted efficiency

November 26, 2024

by Isaac Smothers Amazon AWS

This post is co-written with Isaac Smothers and James Healy-Mirkovich from Crexi.

With the current demand for AI and machine learning (AI/ML) solutions, the processes to train and deploy models and scale inference are crucial to business success. Even though AI/ML and especially generative AI progress is rapid, machine learning operations (MLOps) tooling is continuously evolving to keep pace. Customers are looking for success stories about how best to adopt the culture and new operational solutions to support their data scientists. Solutions should be flexible to adopt, allow seamless integration with other systems, and provide a path to automate MLOps using AWS services and third-party tools, as we’ll explore in this post with Pulumi and Datadog. This framework helps to achieve operational excellence not only in the DevOps space but allows stakeholders to optimize tools such as infrastructure as code (IaC) automation and DevOps research and assessment (DORA) observability of pipelines for MLOps.

Commercial Real Estate Exchange, Inc. (Crexi), is a digital marketplace and platform designed to streamline commercial real estate transactions. It allows brokers to manage the entire process from listing to closing on one platform, including digital letters of intent, best and final offer negotiations, and transaction management tools. Its data and research features allow investors and other commercial real estate stakeholders to conduct due diligence and proactively connect with other professionals ahead of the transaction process.

In this post, we will review how Crexi achieved its business needs and developed a versatile and powerful framework for AI/ML pipeline creation and deployment. This customizable and scalable solution allows its ML models to be efficiently deployed and managed to meet diverse project requirements.

Datadog is a monitoring service for cloud-scale applications, bringing together data from servers, databases, tools and services to present a unified view of your entire stack. Datadog is a SaaS-based data analytics platform that enables Dev and Ops teams to work collaboratively to avoid downtime, resolve performance problems, and helps track that development and deployment cycles finish on time.

Pulumi’s modern infrastructure as code (IaC) platform empowers teams to manage cloud resources using their favorite languages including Python, JavaScript, TypeScript, Go, and C#. Pulumi’s open source SDK integrates with its free and commercial software as a service (SaaS) to simplify infrastructure provisioning, delivery, architecture, policy, and testing on a cloud.

Solution overview

Central to Crexi’s infrastructure are boilerplate AWS Lambda triggers that call Amazon SageMaker endpoints, executing any given model’s inference logic asynchronously. This modular approach supports complex pipeline pathways, with final results directed to Amazon Simple Storage Service (Amazon S3) and Amazon Data Firehose for seamless integration into other systems. One of the SageMaker endpoints also uses Amazon Textract, but any model can be used.

ML pipeline engineering requirements

The engineering requirements for the ML pipeline goal to build a robust infrastructure for model deployments are:

Rapid deployment of ML models: Model pipeline deployments should be managed through a continuous integration and continuous deployment (CI/CD) infrastructure, facilitating model pipeline rollbacks, regression testing, and click deploys. This automated CI/CD deployment process is used to automatically test and deploy pipeline changes, minimizing the risk of errors and downtime.
Distinct separation of concerns for production and development ML pipelines: This requirement prevents ongoing model experiments in the development environment from affecting the production environment, thereby maintaining the stability and reliability of the production models.
Model pipeline health monitoring: Health monitoring allows for proactive identification and resolution of potential issues in model pipelines before they impact downstream engineering teams and users.
Readily accessible models: Model pipelines should be accessible across engineering teams and straightforward to integrate into new and existing products.

The goal is to build reliable, efficient ML pipelines that can be used by other engineering teams with confidence.

Technical overview

The ML pipeline infrastructure is an amalgamation of various AWS products, designed to seamlessly invoke and retrieve output from ML models. This infrastructure is deployed using Pulumi, a modern IaC tool that allows Crexi to handle the orchestration of AWS products in a streamlined and efficient manner.

The AWS products managed by Pulumi in the infrastructure include:

Amazon Identity and Access Management (IAM) for secure access management
Amazon S3 for storing model tar.gz files and model prediction outputs, and Amazon SageMaker for model inference
AWS Lambda to send outputs from SageMaker models to one another
Amazon Simple Notification Service (Amazon SNS) is used to notify downstream teams when ML models produce predictions, helping to ensure timely communication and collaboration
Data Firehose to ship model predictions as needed, further enhancing the flexibility of the pipeline

To protect the robustness and reliability of the infrastructure, Crexi uses Datadog for pipeline log monitoring, which allows the team to keep a close eye on the pipeline’s performance and quickly identify and address issues that might arise.

Lastly, Crexi uses GitHub actions to run Pulumi scripts in a CI/CD fashion for ML pipeline deploys, updates, and destroys. These GitHub actions keep the infrastructure reproducible and sufficiently hardened against code regression.

Pipeline as code

Pulumi-managed ML pipelines are coded as YAML files that data scientists can quickly create and deploy. Deploying IaC using YAML files that data scientists can write has three key advantages:

Increased efficiency and speed: A streamlined deployment process allows data scientists to write and deploy their own models. Enabling data scientists in this way reduces delivery time by not requiring additional data engineering or ops personnel (that is, it reduces cross-functional dependencies) for deployments.
Flexibility and customization: YAML files allow data scientists to specify the necessary configurations such as instance types, model images, and additional permissions. This level of customization helps the team to optimize the deployed models for specific use cases.
Simplicity and readability: YAML files are human-readable, facilitating the evaluation, review, and auditing of infrastructure and deployment configurations.

Implementation

Now, let’s look at the implementation details of the ML pipeline.

The pipeline contains three Sage Maker endpoints named model-a, model-b, and model-c. Each endpoint is asynchronous and has a specified number of running instances. They each have a specified docker image to run the model hosted on the endpoint, a specified location of the model.tar.gz file that the endpoint will host, and a specified type of machine instance to run the endpoint on. The model-b and model-c endpoints depend on the output from model-a.

The model-a endpoint has access to input Amazon S3 objects in the Crexi AWS account and depends on the crexi-model-input-dev bucket for input. Lastly, the model-c endpoint also has access to input S3 objects in the Crexi AWS account in addition to Amazon Textract.

After a new version of an input is uploaded to the crexi-model-input-dev S3 bucket, a Lambda function passes it to the model-a SageMaker endpoint. After results are ready and delivered to the model-a-model-output bucket, the relevant Lambda functions execute model-b and model-c SageMaker endpoints accordingly.

The visualization that follows depicts the pipeline flow.

To automate changes in the resources and new models, the Crexi team manages infrastructure using Pulumi and defines resources using YAML. SageMakerPipelineExample.yaml creates a stack of AWS resources that deploy service models to production. The AWS stack contains the necessary Lambda functions, S3 buckets, SageMaker endpoints, IAM permissions, and so on. As an example, the following is part of the YAML files that define the SageMaker endpoints.

team: Mlops

identifier: SagemakerPipelineExample

data_dev: 
  buckets:
    - name: "crexi-model-storage-dev" 
      additionalWriters:
        - "arn:aws:iam::<aws_account_id>:role/DataDevelopers"
    - name: "crexi-model-input-dev"

sagemakerPipelines:
  - name: "Infrared"
    models:
      - name: model-a 
        async: true 
        count: 4
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference-with-t
        s3Path: "crexi-model-storage-dev/model-a.tar.gz"
        access:
          filesCrexiAccess: true 
          instanceType: ml.c5.4xlarge 
          dependsOn: 
            s3Buckets:
              - bucketName: "crexi-model-input-dev"
                prefix: "manifests/"
                suffix: ".json"
      - name: model-b 
        async: true 
        count: 1
        instanceType: ml.m5.xlarge
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-b.tar.gz"
        dependsOn: 
          models:
            - "model-a"
      - name: model-c 
        async: true 
        count: 1
        instanceType: ml.m5.large
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-c.tar.gz" 
        access:
          filesCrexiAccess: true
        textract: true 
        dependsOn:
          models:
            - "model-a"

Pipeline deployment

ML pipelines can be quickly deployed, modified, and destroyed using a continuous delivery GitHub workflow named Deploy self-service infrastructure that has been set up in a Crexi repository. After new models are tested and everything is ready in Crexi’s repository, GitHub workflow triggers deployment using Pulumi and a YAML file with resources defined in the previous section of this post.

The Deploy self-service infrastructure workflow takes four arguments:

branch
- Description: GitHub branch to source the pipeline YAML file from
- Input (options)
  - GitHub branch (for example, main)
action
- Description: Specifies the type of Pulumi action to run
- Input (options):
  - up: Create or update resources
  - destroy: Tear down resources
  - preview: Preview changes without applying them
environment
- Description: Defines the environment against which the action will be executed
- Input (options):
  - data_dev: Development environment
  - data_prod: Production environment
YAML
- Description: Path to the infrastructure YAML file that defines the resources to be managed
- Input (string)
  - Filename of SageMaker model pipeline YAML file to deploy, modify, or destroy

The following screenshot shows GitHub workflow parameters and history.

Pipeline Monitoring

Pipeline monitoring for Pulumi-deployed ML pipelines uses a comprehensive Datadog dashboard (shown in the following figure) that offers extensive logging capabilities and visualizations. Key metrics and logs are collected and visualized to facilitate real-time monitoring and historical analysis. Pipeline monitoring has dramatically simplified the assessment of a given pipeline’s health status, allowing for the rapid detection of potential bottlenecks and bugs, thereby improving operation of the ML pipelines.

The dashboard offers several core features:

Error tracking: The dashboard tracks 4xx and 5xx errors in aggregate, correlating errors to specific logged events within the model pipelines, which aids in quick and effective diagnosis by providing insights into the frequency and distribution of these errors.
Invocation metrics for SageMaker models: The dashboard aggregates data on instance resource utilization, invocation latency, invocation failures, and endpoint backlog for the SageMaker models deployed through Pulumi, giving a detailed view of performance bottlenecks and latencies.
Lambda function monitoring: The dashboard monitors the success and failure rates of invocations for triggerable Lambda functions, thus delivering a holistic view of the system’s performance.

Conclusion

The ML pipeline deployment framework explored here offers a robust, scalable, and highly customizable solution for AI/ML needs and addresses Crexi’s requirements. With the power to rapidly build and deploy pipelines, experiments and new ML techniques can be tested at scale with minimal effort. It separates development workflow of models and production deployments, and allows to proactively monitor for different issues. Additionally, routing model outputs to S3 supports seamless integration with Snowflake, facilitating storage and accessibility of data. This interconnected ecosystem does more than just improve current operations; it lays the groundwork for continuous innovation. The data housed in Snowflake serves as a rich resource for training new models that can be deployed quickly with new ML pipelines, enabling a cycle of improvement and experimentation that propels Crexi’s projects forward.

If you have any thoughts or questions, leave them in the comments section.

Isaac Smothers is a Senior DevOps Engineer at Crexi. Isaac focuses on automating the creation and maintenance of robust, secure cloud infrastructure with built-in observability. Based in San Luis Obispo, he is passionate about providing self-service solutions that enable developers to build, configure, and manage their services independently, without requiring cloud or DevOps expertise. In his free time, he enjoys hiking, video editing, and gaming.

James Healy-Mirkovich is a principal data scientist at Crexi in Los Angeles. Passionate about making data actionable and impactful, he develops and deploys customer-facing AI/ML solutions and collaborates with product teams to explore the possibilities of AI/ML. Outside work, he unwinds by playing guitar, traveling, and enjoying music and movies.

Marina Novikova is a Senior Partner Solution Architect at AWS. Marina works on the technical co-enablement of AWS ISV Partners in the DevOps and Data and Analytics segments to enrich partner solutions and solve complex challenges for AWS customers. Outside of work, Marina spends time climbing high peaks around the world.

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

November 26, 2024

by Sharon Yu Amazon AWS

We’re excited to announce the availability of Meta Llama 3.1 8B and 70B inference support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Meta Llama 3.1 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models. Trainium and Inferentia, enabled by the AWS Neuron software development kit (SDK), offer high performance and lower the cost of deploying Meta Llama 3.1 by up to 50%.

In this post, we demonstrate how to deploy Meta Llama 3.1 on Trainium and Inferentia instances in SageMaker JumpStart.

What is the Meta Llama 3.1 family?

The Meta Llama 3.1 multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support a long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Meta Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

At its core, Meta Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Architecturally, the core LLM for Meta Llama 3 and Meta Llama 3.1 is the same dense architecture.

Meta Llama 3.1 also offers instruct variants, and the instruct model is fine-tuned for tool use. The model has been trained to generate calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, the model also supports zero-shot tool use.

The responsible use guide from Meta can assist you in additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

What is SageMaker JumpStart?

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models are provisioned on dedicated SageMaker Inference instances, including Trainium and Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This provides data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Solution overview

SageMaker JumpStart provides FMs through two primary interfaces: Amazon SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive interactive development environment (IDE) that offers a unified, web-based interface for performing all aspects of the machine learning (ML) development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart on the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use JumpStart models. This approach allows for greater flexibility and integration with existing AI and ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI and ML development efforts, regardless of your preferred interface or workflow.

In the following sections, we demonstrate how to deploy Meta Llama 3.1 on Trainium instances using SageMaker JumpStart in SageMaker Studio for a one-click deployment and the Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to SageMaker Studio or a SageMaker notebook instance, or an IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
One instance of ml.trn1.32xlarge for SageMaker hosting.

Deploy Meta Llama 3.1 using the SageMaker JumpStart UI

From the SageMaker JumpStart landing page, you can browse for models, notebooks, and other resources. You can find Meta Llama 3.1 Neuron models by searching by “3.1” or find them in the Meta hub.

If you don’t see Meta Llama 3.1 Neuron models in SageMaker Studio Classic, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

In SageMaker JumpStart, you can access the Meta Llama 3.1 Neuron models listed in the following table.

Model Card	Description	Key Capabilities
Meta Llama 3.1 8B Neuron	`Llama-3.1-8B` is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation supported in 10 languages.	Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 8B Instruct Neuron	`Llama-3.1-8B-Instruct` is an update to `Meta-Llama-3-8B-Instruct`, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities.	Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.
Meta Llama 3.1 70B Neuron	`Llama-3.1-70B` is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation in 10 languages.	Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 70B Instruct Neuron	`Llama-3.1-70B-Instruct` is an update to `Meta-Llama-3-70B-Instruct`, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities	Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.

You can choose the model card to view details about the model such as license, data used to train, and how to use.

You can also find two buttons on the model details page, Deploy and Preview notebooks, which help you use the model.

When you choose Deploy, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.

When you acknowledge the terms choose Deploy, model deployment will start.

Deploy Meta Llama 3.1 using the Python SDK

Alternatively, you can deploy through the example notebook available from the model page by choosing Preview notebooks. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. For example, you can deploy a Meta Llama 3.1 70B Instruct model through SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct") 
predictor = model.deploy(accept_eula=True)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "inputs": "The color of the sky is blue but sometimes it can also be ",
    "parameters": {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}
response = predictor.predict(payload)

The following table lists all the Meta Llama models available in SageMaker JumpStart, along with the model_id, default instance type, and supported instance types for each model.

Model Card	Model ID	Default Instance Type	Supported Instance Types
Meta Llama 3 1 8B Neuron	`meta-textgenerationneuron-llama-3-1-8b`	ml.inf2.48xlarge	ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 8B Instruct Neuron	`meta-textgenerationneuron-llama-3-1-8b-instruct`	ml.inf2.48xlarge	ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 70B Neuron	`meta-textgenerationneuron-llama-3-1-70b`	ml.trn1.32xlarge	ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge
Meta Llama 3.1 70B Instruct Neuron	`meta-textgenerationneuron-llama-3-1-70b-instruct`	ml.trn1.32xlarge	ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge

If you want more control of the deployment configurations, such as context length, tensor parallel degree, and maximum rolling batch size, you can modify them using environmental variables. The underlying Deep Learning Container (DLC) of the deployment is the Large Model Inference (LMI) NeuronX DLC. Refer to the LMI user guide for the supported environment variables.

SageMaker JumpStart has pre-compiled Neuron graphs for a variety of configurations for the preceding parameters to avoid runtime compilation. The configurations of pre-compiled graphs are listed in the following table. As long as the environmental variables fall into one of the following categories, compilation of Neuron graphs will be skipped.

Meta Llama 3.1 8B and Meta Llama 3.1 8B Instruct
OPTION_N_POSITIONS	OPTION_MAX_ROLLING_BATCH_SIZE	OPTION_TENSOR_PARALLEL_DEGREE	OPTION_DTYPE
8192	8	2	bf16
8192	8	4	bf16
8192	8	8	bf16
8192	8	12	bf16
8192	8	24	bf16
8192	8	32	bf16
Meta Llama 3.1 70B and Meta Llama 3.1 70B Instruct
OPTION_N_POSITIONS	OPTION_MAX_ROLLING_BATCH_SIZE	OPTION_TENSOR_PARALLEL_DEGREE	OPTION_DTYPE
8192	8	24	bf16
8192	8	32	bf16

The following is an example of deploying Meta Llama 3.1 70B Instruct and setting all the available configurations:

from sagemaker.jumpstart.model import JumpStartModel 

model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct"

model = JumpStartModel( 
    model_id=model_id, 
    env={
        "OPTION_DTYPE": "bf16",
        "OPTION_N_POSITIONS": "8192",
        "OPTION_TENSOR_PARALLEL_DEGREE": "24",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
     }, 
     instance_type="ml.trn1.32xlarge"
) 

pretrained_predictor = model.deploy(accept_eula=True)

Now that you have deployed the Meta Llama 3.1 70B Instruct model, you can run inference with it by invoking the endpoint. The following code snippet demonstrates using the supported inference parameters to control text generation:

{
    'inputs': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assista<|end_header_id|>nn',
    'parameters': {
        'max_new_tokens': 256,
        'top_p': 0.9,
        'temperature': 0.6,
        'stop': '<|eot_id|>'
    }
}

response = pretrained_predictor.predict(payload)

We get the following output:

{'generated_text': "Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls<|eot_id|>"}

For more information on the parameters in the payload, refer to Parameters.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

pretrained_predictor.delete_predictor()

Conclusion

The deployment of Meta Llama 3.1 Neuron models on SageMaker demonstrates a significant advancement in managing and optimizing large-scale generative AI models with reduced costs up to 50% compared to GPU. These models, including variants like Meta Llama 3.1 8B and 70B, use Neuron for efficient inference on Inferentia and Trainium based instances, enhancing their performance and scalability.

The ability to deploy these models through the SageMaker JumpStart UI and Python SDK offers flexibility and ease of use. The Neuron SDK, with its support for popular ML frameworks and high-performance capabilities, enables efficient handling of these large models.

For more information on deploying and fine-tuning pre-trained Meta Llama 3.1 models on GPU-based instances, refer to Llama 3.1 models are now available in Amazon SageMaker JumpStart and Fine-tune Meta Llama 3.1 models for generative AI inference using Amazon SageMaker JumpStart.

About the authors

Sharon Yu is a Software Development Engineer with Amazon SageMaker based in New York City.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

AWS achieves ISO/IEC 42001:2023 Artificial Intelligence Management System accredited certification

November 26, 2024

by Sara Duffer Amazon AWS

Amazon Web Services (AWS) is excited to be the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering: Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines requirements and controls for organizations to promote the responsible development and use of AI systems.

Responsible AI is a long-standing commitment at AWS. From the outset, AWS has prioritized responsible AI innovation and developed rigorous methodologies to build and operate our AI services with consideration for fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency.

AWS is an active stakeholder working with global standard-setting organizations to develop guidelines that play an important role in our industry by improving clarity, definitions and scope, establishing benchmarks for responsible AI practices, and focusing industry efforts on effective options for addressing risk. Our goal is to contribute to and improve AI standards across several critical areas, including risk management, data quality, unwanted bias mitigation, and explainability.

Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. This independent validation enables our customers to gain further assurances around AWS’s commitment to responsible AI and their ability to build and operate AI applications responsibly using AWS Services.

“At Snowflake, delivering AI capabilities to our customers is a top priority. Our product teams need to build and deploy AI responsibly, and have to depend upon our suppliers to do the same, despite the technical complexity. This is why ISO 42001 is important to us. Having ISO 42001 certification means a company has implemented a thoughtful AI management system. Knowing that AWS has ISO 42001 certified services gives us confidence in their commitment to the responsible development and deployment of their services, and builds trust with our own customers’ confidence in our products,” said Tim Tutt, VP US Public Sector, Snowflake.

An accredited certification, like ISO/IEC 42001, is issued by a certification body that has been recognized by a national or international accreditation authority. This demonstrates that the certification is credible, trustworthy, and based on independent verification. In this case, Schellman Compliance, LLC, an ISO certification body accredited by the ANSI National Accreditation Board (ANAB), granted the certification.

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock

November 25, 2024

by Fahim Surani Amazon AWS

In the rapidly evolving digital content industry, multilingual accessibility is crucial for global reach and user engagement. 123RF, a leading provider of royalty-free digital content, is an online resource for creative assets, including AI-generated images from text. In 2023, they used Amazon OpenSearch Service to improve discovery of images by using vector-based semantic search. Building on this success, they have now implemented Amazon Bedrock and Anthropic’s Claude 3 Haiku to improve their content moderation a hundredfold and more sped up content translation to further enhance their global reach and efficiency.

Although the company achieved significant success among English-speaking users with its generative AI-based semantic search tool, it faced content discovery challenges in 15 other languages because of English-only titles and keywords. The cost of using Google Translate for continuous translations was prohibitive, and other models such as Anthropic’s Claude Sonnet and OpenAI GPT-4o weren’t cost-effective. Although OpenAI GPT-3.5 met cost criteria, it struggled with consistent output quality. This prompted 123RF to search for a more reliable and affordable solution to enhance multilingual content discovery.

This post explores how 123RF used Amazon Bedrock, Anthropic’s Claude 3 Haiku, and a vector store to efficiently translate content metadata, significantly reduce costs, and improve their global content discovery capabilities.

The challenge: Balancing quality and cost in mass translation

After implementing generative AI-based semantic search and text-to-image generation, they saw significant traction among English-speaking users. This success, however, cast a harsh light on a critical gap in their global strategy: their vast library of digital assets—comprising millions of images, audio files, and motion graphics—needed a similar overhaul for non-English speaking users.

The crux of the problem lay in the nature of their content. User-generated titles, keywords, and descriptions—the lifeblood of searchability in the digital asset world—were predominantly in English. To truly serve a global audience and unlock the full potential of their library, 123RF needed to translate this metadata into 15 different languages. But as they quickly discovered, the path to multilingual content was filled with financial and technical challenges.

The translation conundrum: Beyond word-for-word

Idioms don’t always translate well

As 123RF dove deeper into the challenge, they uncovered layers of complexity that went beyond simple word-for-word translation. The preceding figure shows one particularly difficult example: idioms. Phrases like “The early bird gets the worm” being literally translated would not convey the meaning of the word as well as another similar idiom in Spanish, “A quien madruga, Dios le ayuda”. Another significant hurdle was named entity resolution (NER)—a critical aspect for a service dealing with diverse visual and audio content.

NER involves correctly identifying and handling proper nouns, brand names, specific terminology, and culturally significant references across languages. For instance, a stock photo of the Eiffel Tower should retain its name in all languages, rather than being literally translated. Similarly, brand names like Coca-Cola or Nike should remain unchanged, regardless of the target language.

This challenge is particularly acute in the realm of creative content. Consider a hypothetical stock image titled Young woman using MacBook in a Starbucks. An ideal translation system would need to do the following:

Recognize MacBook and Starbucks as brand names that should not be translated
Correctly translate Young woman while preserving the original meaning and connotations
Handle the preposition in appropriately, which might change based on the grammatical rules of the target language
Moreover, the system needed to handle industry-specific jargon, artistic terms, and culturally specific concepts that might not have direct equivalents in other languages. For instance, how would one translate bokeh effect into languages where this photographic term isn’t commonly used?

These nuances highlighted the inadequacy of simple machine translation tools and underscored the need for a more sophisticated, context-aware solution.

Turning to language models: Large models compared to small models

In their quest for a solution, 123RF explored a spectrum of options, each with its own set of trade-offs:

Google Translate – The incumbent solution offered reliability and ease of use. However, it came with a staggering price tag. The company had to clear their backlog of 45 million translations. Adding to this, there was an ongoing monthly financial burden for new content that their customers generated. Though effective, this option threatened to cut into 123RF’s profitability, making it unsustainable in the long run.
Large language models – Next, 123RF turned to cutting-edge large language models (LLMs) such as OpenAI GPT-4 and Anthropic’s Claude Sonnet. These models showcased impressive capabilities in understanding context and producing high-quality translations. However, the cost of running these sophisticated models at 123RF’s scale proved prohibitive. Although they excelled in quality, they fell short in cost-effectiveness for a business dealing with millions of short text snippets.
Smaller models – In an attempt to find a middle ground, 123RF experimented with less capable models such as OpenAI GPT-3.5. These offered a more palatable price point, aligning better with 123RF’s budget constraints. However, this cost savings came at a price: inconsistency in output quality. The translations, although sometimes acceptable, lacked the reliability and nuance required for professional-grade content description.
Fine-tuning – 123RF briefly considered fine-tuning a smaller language model to further reduce cost. However, they understood there would be a number of hurdles: they would have to regularly fine-tune models as new model updates occur, hire subject matter experts to train the models and manage their upkeep and deployment, and potentially manage a model for each of the output languages.

This exploration laid bare a fundamental challenge in the AI translation space: the seemingly unavoidable trade-off between cost and quality. High-quality translations from top-tier models were financially unfeasible, whereas more affordable options couldn’t meet the standard of accuracy and consistency that 123RF’s business demanded.

Solution: Amazon Bedrock, Anthropic’s Claude 3 Haiku, prompt engineering, and a vector store

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Throughout this transformative journey, Amazon Bedrock proved to be the cornerstone of 123RF’s success. Several factors contributed to making it the provider of choice:

Model variety – Amazon Bedrock offers access to a range of state-of-the-art language models, allowing 123RF to choose the one best suited for their specific needs, like Anthropic’s Claude 3 Haiku.
Scalability – The ability of Amazon Bedrock to handle massive workloads efficiently was crucial for processing millions of translations.
Cost-effectiveness – The pricing model of Amazon Bedrock, combined with its efficient resource utilization, played a key role in achieving the dramatic cost reduction.
Integration capabilities – The ease of integrating Amazon Bedrock with other AWS services facilitated the implementation of advanced features such as a vector database for dynamic prompting.
Security and compliance – 123RF works with user-generated content, and the robust security features of Amazon Bedrock provided peace of mind in handling potentially sensitive information.
Flexibility for custom solutions – The openness of Amazon Bedrock to custom implementations, such as the dynamic prompting technique, allowed 123RF to tailor the solution precisely to their needs

Cracking the code: Prompt engineering techniques

The first breakthrough in 123RF’s translation journey came through a collaborative effort with the AWS team, using the power of Amazon Bedrock and Anthropic’s Claude 3 Haiku. The key to their success lay in the innovative application of prompt engineering techniques—a set of strategies designed to coax the best performance out of LLMs, especially important for cost effective models.

Prompt engineering is crucial when working with LLMs because these models, while powerful, can produce non-deterministic outputs—meaning their responses can vary even for the same input. By carefully crafting prompts, we can provide context and structure that helps mitigate this variability. Moreover, well-designed prompts serve to steer the model towards the specific task at hand, ensuring that the LLM focuses on the most relevant information and produces outputs aligned with the desired outcome. In 123RF’s case, this meant guiding the model to produce accurate, context-aware translations that preserved the nuances of the original content.

Let’s dive into the specific techniques employed.

Assigning a role to the model

The team began by assigning the AI model a specific role—that of an AI language translation assistant. This seemingly simple step was crucial in setting the context for the model’s task. By defining its role, the model was primed to approach the task with the mindset of a professional translator, considering nuances and complexities that a generic language model might overlook.

For example:

You are an AI language translation assistant. 
Your task is to accurately translate a passage of text from English into another specified language.

Separation of data and prompt templates

A clear delineation between the text to be translated and the instructions for translation was implemented. This separation served two purposes:

Provided clarity in the model’s input, reducing the chance of confusion or misinterpretation
Allowed for simpler automation and scaling of the translation process, because the same prompt template could be used with different input texts

For example:

Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}

Chain of thought

One of the most innovative aspects of the solution was the implementation of a scratchpad section. This allowed the model to externalize its thinking process, mimicking the way a human translator might work through a challenging passage.

The scratchpad prompted the model to consider the following:

The overall meaning and intent of the passage
Idioms and expressions that might not translate literally
Tone, formality, and style of the writing
Proper nouns such as names and places that should not be translated
Grammatical differences between English and the target language
This step-by-step thought process significantly improved the quality and accuracy of translations, especially for complex or nuanced content.

K-shot examples

The team incorporated multiple examples of high-quality translations directly into the prompt. This technique, known as K-shot learning, provided the model with a number (K) of concrete examples in the desired output quality and style.

By carefully selecting diverse examples that showcased different translation challenges (such as idiomatic expressions, technical terms, and cultural references), the team effectively trained the model to handle a wide range of content types.

For example:

Examples:
<text>The early bird catches the worm.</text>
<translated_text>El que madruga, Dios le ayuda.</translated_text>

The magic formula: Putting it all together

The culmination of these techniques resulted in a prompt template that encapsulated the elements needed for high-quality, context-aware translation. The following is an example prompt with the preceding steps. The actual prompt used is not shown here.

You are an AI language translation assistant. Your task is to accurately translate a passage of text from English into another specified language. Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}
Think carefully, in the <scratchpad> section below, think through how you will translate the text while preserving its full meaning and nuance. Consider:
- The overall meaning and intent of the passage
- Idioms and expressions that may not translate literally
- Tone, formality, and style of the writing
- Proper nouns like names and places that should not be translated
- Grammatical differences between English and {{TARGET_LANGUAGE}}
Examples:
<text>The software update is scheduled for next Tuesday.</text>
<translated_text>La actualización del software está programada para el próximo martes.</translated_text>
<text>Breaking news: Elon Musk acquires Twitter for $44 billion.</text>
<translated_text>Última hora: Elon Musk adquiere Twitter por 44 mil millones de dólares.</translated_text>
... [8 more diverse examples] ...
Now provide your final translated version of the text inside <translated_text> tags. Ensure the translation is as accurate and natural-sounding as possible in {{TARGET_LANGUAGE}}. Do not translate any names, places or other proper nouns.
<translated_text>

This template provided a framework for consistent, high-quality translations across a wide range of content types and target languages.

Further refinement: Dynamic prompting for grounding models

Although the initial implementation yielded impressive results, the AWS team suggested further enhancements through dynamic prompting techniques. This advanced approach aimed to make the model even more adaptive and context aware. They adopted the Retrieval Augmented Generation (RAG) technique for creating a dynamic prompt template with K-shot examples relevant to each phrase rather than generic examples for each language. This also allowed 123RF to take advantage of their current catalog of high quality translations to further align the model.

Vector database of high-quality translations

The team proposed creating a vector database for each target language, populated with previous high-quality translations. This database would serve as a rich repository of translation examples, capturing nuances and domain-specific terminologies.

The implementation included the following components:

Embedding generation:
- Use embedding models such as Amazon Titan or Cohere’s offerings on Amazon Bedrock to convert both source texts and their translations into high-dimensional vectors.
Chunking strategy:
- To maintain context and ensure meaningful translations, the team implemented a careful chunking strategy:
  1. Each source text (in English) was paired with its corresponding translation in the target language.
  2. These pairs were stored as complete sentences or logical phrases, rather than individual words or arbitrary character lengths.
  3. For longer content, such as paragraphs or descriptions, the text was split into semantically meaningful chunks, ensuring that each chunk contained a complete thought or idea.
  4. Each chunk pair (source and translation) was assigned a unique identifier to maintain the association.
Vector storage:
- The vector representations of both the source text and its translation were stored together in the database.
- The storage structure included:
  1. The original source text chunk.
  2. The corresponding translation chunk.
  3. The vector embedding of the source text.
  4. The vector embedding of the translation.
  5. Metadata such as the content type, domain, and any relevant tags.
Database organization:
- The database was organized by target language, with separate indices or collections for each language pair (for example, English-Spanish and English-French).
- Within each language pair, the vector pairs were indexed to allow for efficient similarity searches.
Similarity search:
- For each new translation task, the system would perform a hybrid search to find the most semantically similar sentences from the vector database:
  1. The new text to be translated was converted into a vector using the same embedding model.
  2. A similarity search was performed in the vector space to find the closest matches in the source language.
  3. The corresponding translations of these matches were retrieved, providing relevant examples for the translation task.

This structured approach to storing and retrieving text-translation pairs allowed for efficient, context-aware lookups that significantly improved the quality and relevance of the translations produced by the LLM.

Putting it all together

The top matching examples from the vector database would be dynamically inserted into the prompt, providing the model with highly relevant context for the specific translation task at hand.

This offered the following benefits:

Improved handling of domain-specific terminology and phraseology
Better preservation of style and tone appropriate to the content type
Enhanced ability to resolve named entities and technical terms correctly

The following is an example of a dynamically generated prompt:

[Standard prompt preamble]
...
Examples:
<text>{{Dynamically inserted similar source text 1}}</text>
<translated_text>{{Corresponding high-quality translation 1}}</translated_text>
<text>{{Dynamically inserted similar source text 2}}</text>
<translated_text>{{Corresponding high-quality translation 2}}</translated_text>
...
[Rest of the standard prompt]

This dynamic approach allowed the model to continuously improve and adapt, using the growing database of high-quality translations to inform future tasks.
The following diagram illustrates the process workflow.

How to ground translations with a vector store

The process includes the following steps:

Convert the new text to be translated into a vector using the same embeddings model.
Compare text and embeddings against a database of high-quality existing translations.
Combine similar translations with an existing prompt template of generic translation examples for target language.
Send the new augmented prompt with initial text to be translated to Amazon Bedrock.
Store the output of the translation in an existing database or to be saved for human-in-the-loop evaluation.

The results: A 95% cost reduction and beyond

The impact of implementing these advanced techniques on Amazon Bedrock with Anthropic’s Claude 3 Haiku and the engineering effort with AWS account teams was nothing short of innovative for 123RF. By working with AWS, 123RF was able to achieve a staggering 95% reduction in translation costs. But the benefits extended far beyond cost savings:

Scalability – The new solution with Anthropic’s Claude 3 Haiku allowed 123RF to rapidly expand their multilingual offerings. They quickly rolled out translations for 9 languages, with plans to cover all 15 target languages in the near future.
Quality improvement – Despite the massive cost reduction, the quality of translations saw a marked improvement. The context-aware nature of the LLM, combined with careful prompt engineering, resulted in more natural and accurate translations.
Handling of edge cases – The system showed remarkable prowess in handling complex cases such as idiomatic expressions and technical jargon, which had been pain points with previous solutions.
Faster time-to-market – The efficiency of the new system significantly reduced the time required to make new content available in multiple languages, giving 123RF a competitive edge in rapidly updating their global offerings.
Resource reallocation – The cost savings allowed 123RF to reallocate resources to other critical areas of their business, fostering innovation and growth.

Looking ahead: Continuous improvement and expansion

The success of this project has opened new horizons for 123RF and set the stage for further advancements:

Expanding language coverage – With the cost barrier significantly lowered, 123RF is now planning to expand their language offerings beyond the initial 15 target languages, potentially tapping into new markets and user bases.
Anthropic’s Claude 3.5 Haiku – The recent release of Anthropic’s Claude 3.5 Haiku has sparked excitement at 123RF. This upcoming model promises even greater intelligence and efficiency, potentially allowing for further refinements in translation quality and cost-effectiveness.
Broader AI integration – Encouraged by the success in translation, 123RF is exploring additional use cases for generative AI within their operations. Potential areas include the following:
- Enhanced image tagging and categorization.
- Content moderation of user-generated images.
- Personalized content recommendations for users.
Continuous learning loop – The team is working on implementing a feedback mechanism where successful translations are automatically added to the vector database, creating a virtuous cycle of continuous improvement.
Cross-lingual search enhancement – Using the improved translations, 123RF is developing more sophisticated cross-lingual search capabilities, allowing users to find relevant content regardless of the language they search in.
Prompt catalog – They can explore the newly launched Amazon Bedrock Prompt Management as a way to manage prompt templates and iterate on them effectively.

Conclusion

123RF’s success story with Amazon Bedrock and Anthropic’s Claude is more than just a tale of cost reduction—it’s a blueprint for how businesses can use cutting-edge AI to break down language barriers and truly globalize their digital content. This case study demonstrates the transformative power of innovative thinking, advanced prompt engineering, and the right technological partnership.

123RF’s journey offers the following key takeaways:

The power of prompt engineering in extracting optimal performance from LLMs
The importance of context and domain-specific knowledge in AI translations
The potential of dynamic, adaptive AI solutions in solving complex business challenges
The critical role of choosing the right technology partner and platform

As we look to the future, it’s clear that the combination of cloud computing, generative AI, and innovative prompt engineering will continue to reshape the landscape of multilingual content management. The barriers of language are crumbling, opening up new possibilities for global communication and content discovery.

For businesses facing similar challenges in global content discovery, 123RF’s journey offers valuable insights and a roadmap to success. It demonstrates that with the right technology partner and a willingness to innovate, even the most daunting language challenges can be transformed into opportunities for growth and global expansion. If you have a similar use case and want help implementing this technique, reach out to your AWS account teams, or sharpen your prompt engineering skills through our prompt engineering workshop available on GitHub.

About the Author

Fahim Surani is a Solutions Architect at Amazon Web Services who helps customers innovate in the cloud. With a focus in Machine Learning and Generative AI, he works with global digital native companies and financial services to architect scalable, secure, and cost-effective products and services on AWS. Prior to joining AWS, he was an architect, an AI engineer, a mobile games developer, and a software engineer. In his free time he likes to run and read science fiction.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, AWS’ flagship generative AI offering for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Connect SharePoint Online to Amazon Q Business using OAuth 2.0 ROPC flow authentication

November 25, 2024

by Ramesh Eega Amazon AWS

Enterprises face significant challenges accessing and utilizing the vast amounts of information scattered across organization’s various systems. What if you could simply ask a question and get instant, accurate answers from your company’s entire knowledge base, while accounting for an individual user’s data access levels?

Amazon Q Business is a game changing AI assistant that’s revolutionizing how enterprises interact with their data. With Amazon Q Business, you can access relevant information through natural language conversations, drawing insights from diverse data sources within your organization, adhering to the permissions granted to your user account.

At its core, Amazon Q Business works by first indexing the content from a variety of data sources using built-in data source connectors. These connectors function as an integration layer, unifying content from diverse systems such as Salesforce, Microsoft Exchange, and SharePoint into a centralized index. This consolidated index powers the natural language processing and response generation capabilities of Amazon Q. When a user asks a question using the built-in web experience, Amazon Q Business retrieves relevant content from the index, taking into account user profiles and permissions. It then uses large language models (LLMs) to provide accurate, personalized, and well-written responses based on the consolidated data.

For a full list of Amazon Q supported data source connectors, refer to Supported connectors.

This approach is useful when you need Amazon Q Business to crawl through OneNote or when certificate-based authentication is not preferred or your organization has a strict policy that requires regular password rotation. For a complete list of authentication mechanisms, refer to SharePoint (Online) connector overview.

We provide a step-by-step guide for the Azure AD configuration and demonstrate how to set up the Amazon Q connector to establish this secure integration.

Solution overview

SharePoint is a web-based solution developed by Microsoft that enables organizations to collaborate, manage documents, and share information efficiently. It offers a wide range of features, including using document libraries, viewing lists, publishing pages, sharing events and links, and allowing users to make comments, making it a great tool for team collaboration and content management.

After integrating SharePoint Online with Amazon Q Business, you can ask questions using natural language about the content stored in the SharePoint sites. For example, if your organization’s human resources team manages an internal SharePoint site and maintains a list of holidays for geographical regions, you can ask, “What are the company holidays for this year?” Amazon Q Business will then list region-specific holidays based on your location (country).

The following diagram illustrates the solution architecture. In the upcoming sections, we show you how to implement this architecture. After you integrate Amazon Q Business using the SharePoint connector, Amazon Q Business will crawl through the SharePoint content and update the index whenever content changes. Each published event, page, link, file, comment, OneNote, and attachment on the SharePoint site is treated as a document. In addition to the documents, it also crawls through access control lists (ACLs) for each document (user and group information) and stores them in the . This allows end-users to see chat responses generated only from the documents they have access to.

You can configure Azure AD using either of the following methods:

Use the Azure AD console GUI – This is a manual process
Use the provided PowerShell script – This is an automated process that takes in the inputs and configures the required permissions

We demonstrate both methods in the following sections.

Prerequisites

To follow along, you need the following prerequisites:

The user performing these steps should be a global administrator on Azure AD/Entra ID.
You need to configure Microsoft Entra ID and AWS IAM Identity Center. For details, see Configure SAML and SCIM with Microsoft Entra ID and IAM Identity Center.
You need a Microsoft Windows instance to run PowerShell scripts and commands with PowerShell 7.4.1+. Details of the required PowerShell modules are described in the following steps.
The user should have administrator permissions on the Windows instance.
The user running these PowerShell commands should have the right M365 license (for example, M365 E3).

Configure Azure AD using the Azure AD console

To configure Azure AD using the GUI, complete the steps in this section.

Register an Azure AD application

Complete the following steps to register an Azure AD application in the Azure AD tenant that is linked to the SharePoint Online/O365 tenant:

Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
Navigate to Microsoft Azure Portal.
Search for and choose App registrations.

Choose New registration.

Enter a name for your application, select who can use this application, then choose Register.

An application will be created. You will see a page like the following screenshot.

Note the values for Display name, Application (client) ID, and Directory (tenant) ID. These IDs will be different than what is shown in the screenshot.

Now you can configure the newly registered application with Microsoft Graph and SharePoint API permissions.

When configuring permissions, you have two different options:

Option 1 – Allow access to a specific set of SharePoint sites by granting the Selected permission
Option 2 – Allow access to all SharePoint sites by granting the FullControl.All permission

For option 1, install the MS Graph PowerShell SDK as a prerequisite.

Option 1: Manually allow access to specific SharePoint sites

If you choose option 1, to grant access to specific sites instead of all sites, you need to complete additional prerequisites.

Make sure you have access to another application in Microsoft Entra ID with Sites.FullControl.All application-level permissions, along with its client ID and client secret. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you just registered. If you don’t have access to an application with Sites.FullControl permissions, you can follow the previous steps to register a new application and grant Sites.FullControl as described in option 2. We refer to this application as SitesFullControlApp.

To configure your permissions using option 1, complete the following steps:

In the navigation pane, choose API permissions.
Choose the options menu (three dots) next to the permissions that were granted by default when you registered the application, and remove those permissions.

Choose Add a permission and then choose Microsoft Graph.

Choose Delegated permissions.

Select Sites.Selected and choose Add permissions.

Add the following Microsoft Graph API delegated permissions:
1. GroupMember.Read.All
2. User.Read.All

You will see the permissions listed as shown in the following screenshot.

Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions should look like the following screenshot.

To grant permissions to a specific SharePoint site, you’ll need to obtain the Site ID for that site.
1. Visit the URL of the SharePoint site in your web browser. The URL will be in the format https://yourcompany.sharepoint.com/sites/{SiteName}.
2. Log in to the site using valid
3. Edit the URL in your browser’s address bar by appending /_api/site/id to the end of {SiteName}. For example, if the original URL was https://yourcompany.sharepoint.com/sites/HumanResources, modify it to https://yourcompany.sharepoint.com/sites/HumanResources/_api/site/id.
4. Press Enter, and the browser will display the Site ID for that particular SharePoint site collection.

Run the script after gathering the values listed in the following table.

Variable	Description
`AppName`	Display name that you captured earlier.
`AppID`	Application (client) ID that you captured earlier.
`SitesFullControlAppID`	Application (client) ID that was granted with `Sites.FullControl.All`. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant `Sites.Selected` permissions only to the application you plan to register.
`SitesFullControlAppClientSecret`	Client secret of the `SitesFullControlAppID` you entered.
`SiteID`	SharePoint Site ID.
`TenantId`	Directory (tenant) ID that you captured earlier.

param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

This completes the steps to configure permissions for a specific set of SharePoint site collections.

Option 2: Manually allow access to all SharePoint sites

Complete the following steps to allow full control permissions to all the SharePoint site collections:

In the navigation pane, choose API permissions.
Remove the permissions that were granted by default when you registered the application.
Choose Add a permission and then choose Microsoft Graph.

Choose Delegated permissions.

Select FullControl.All and choose Add permissions.

Next, you configure Microsoft Graph application permissions.

In the navigation pane, choose API permissions.
Choose Add a permission and then choose Microsoft Graph.
Choose Application permissions.

Add the following application permissions:
- GroupMember.Read.All
- User.Read.All
- Notes.Read.All
- Sites.Read.All

Next, you configure SharePoint delegated permissions.

In the navigation pane, choose API permissions.
Choose Add a permission and then choose SharePoint.

Choose Delegated permissions.
Expand AllSites, select AllSites.Read, and choose Add permission.

You will find the permissions listed as shown in the following screenshot.

Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions will look like the following screenshot.

This completes the steps to configure permissions to allow full control permissions to all the SharePoint site collections.

Create a client secret

Complete the following steps to create a client secret:

In the navigation pane, choose Certificates & secrets.
On the Client secrets tab, choose New client secret.
Enter a value for Description and choose an expiration length Expires.
Choose Add.

Save the client secret value.

This value is needed while configuring Amazon Q. Client secret values can’t be viewed except for immediately after creation. Be sure to save the secret.

Deactivate multi-factor authentication

To deactivate multi-factor authentication (MFA), sign in to the Microsoft Entra Admin Center as security or global administrator and disable the security defaults.

In the navigation pane, choose Identity and Overview.
On the Properties tab, choose Manage security defaults.
Choose Disabled.
For Reason for disabling, select Other and enter your reason.
Choose Save.

Configure Azure AD using the provided PowerShell script

Option 1: Grant access to specific SharePoint sites. This approach involves granting the Selected permission, which allows access to a specific set of SharePoint sites.
Option 2: Grant access to all SharePoint sites. This approach involves granting the FullControl.All permission, which allows access to all SharePoint sites in your organization.

When configuring permissions, consider your organization’s SharePoint access requirements. Many SharePoint admins prefer to grant Amazon Q Business access only to specific sites that need to be crawled, in which case Option 1 with the Sites.Selected permission would be suitable.

For either option, the user running the PowerShell script should be an Azure AD tenant admin or have tenant admin permissions. Additionally, .

Run one of the provided PowerShell scripts, then follow the additional instructions. The scripts will perform the following tasks:

Register a new application in Azure AD/Entra ID
Configure the required permissions
Provide admin consent for the configured permissions

Option 1: Use a script to allow access to specific SharePoint sites

There is one additional prerequisite for option 1 (granting Sites.Selected permission): you need access to another application in Microsoft Entra ID that has the Sites.FullControl.All application-level permission. These are required to grant the Sites.Selected permission to the new application you will register. If you don’t have access to an application with the Sites.FullControl.All permission, you can follow the to register a new application and grant it the Sites.FullControl.All permission. This application will be referred to as SitesFullControlApp.

Use the following script to grant permissions to a specific SharePoint site. You need the following information before running the script.

Variable	Description
`AppName`	Name of the application that you plan to register.
`SitesFullControlAppID`	Application (client) ID that was granted with `Sites.FullControl.All`. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant `Sites.Selected` permissions only to the application you plan to register.
`SitesFullControlAppClientSecret`	Client secret of the app ID you entered.
`SiteID`	SharePoint Site ID.
`TenantId`	Your Azure Active Directory tenant ID.

param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Option 2: Use a script to manually grant access to all SharePoint sites

The following script grants full control permissions to all the SharePoint site collections. You need the following information before running the script.

Variable	Description
`AppName`	The name of the application that you plan to register.

param(
  [Parameter(Mandatory=$true,
  HelpMessage="The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory=$false,
  HelpMessage="Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,
  [Parameter(Mandatory=$false)]
  [Switch]
  $StayConnected = $false
)

# Requires an admin
if ($TenantId)
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All" -TenantId $TenantId
}
else
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All"
}
$SitePermissionAllSitesRead = "4e0d77b0-96ba-4398-af14-3baa780278f4"
$GraphPermissionsGroupMemberReadAll  = "98830695-27a2-44f7-8c18-0c3ebc9698f6"
$GraphPermissionsNotesReadAll = "3aeca27b-ee3a-4c2b-8ded-80376e2134a4"
$GraphPermissionsSitesReadAll = "332a536c-c7ef-4017-ab91-336970924f0d"
$GraphPermissionsUserReadAll = "df021288-bdef-4463-88db-98f22de89214"
$GraphPermissionsSitesFullControlAll = "5a54b8b3-347c-476d-8f8e-42d5c7424d29"

# Sharepoint permissions 
$sharePointResourceId = "00000003-0000-0ff1-ce00-000000000000"
$SitePermission = @(
  @{
  Id= $SitePermissionAllSitesRead  #AllSites.Read (Delegated) – Read items in all site collections
  Type="Scope"
}
)

# Graph permissions 
$graphResourceId =  "00000003-0000-0000-c000-000000000000"
$graphPermissions = @(
    @{
        Id =  $GraphPermissionsGroupMemberReadAll  # GroupMember.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsNotesReadAll # Notes.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsSitesReadAll # Sites.Read.All (Application)
        Type = "Role"
    },
    @{
        Id =  $GraphPermissionsUserReadAll # User.Read.All (Application)
        Type = "Role"
    },
     @{
        Id = $GraphPermissionsSitesFullControlAll # Sites.FullControl.All (Delegated)
        Type = "Scope"
    }
)


$requiredResourceAccess = @()

$graphResourceAccess   = @{
ResourceAppId=$graphResourceId
ResourceAccess= $graphPermissions
}

$spResourceAccess = @{
    ResourceAppId = $sharePointResourceId
    ResourceAccess = $SitePermission
  }

$requiredResourceAccess += $spResourceAccess
$requiredResourceAccess += $graphResourceAccess


# Get context for access to tenant ID
$context = Get-MgContext

# Create app registration
$appRegistration = New-MgApplication -DisplayName $AppName -SignInAudience "AzureADMyOrg" `
 -Web @{ RedirectUris="http://localhost"; } `
 -RequiredResourceAccess   $requiredResourceAccess `
 -AdditionalProperties @{}
Write-Host -ForegroundColor Cyan "App registration created with app ID" $appRegistration.AppId

# Add client secret
#$clientSecret = [System.Net.WebUtility]::UrlEncode(([System.Text.Encoding]::UTF8.GetBytes((New-Guid).ToString() + "abcdefghijklmnopqrstuvwxyz0123456789")))
$clientSecretCredential = Add-MgApplicationPassword -ApplicationId $appRegistration.Id -PasswordCredential @{ displayName  = "Client Secret"; EndDateTime = (Get-Date).AddYears(2) } 
Write-Host -ForegroundColor Cyan "Client secret created "

$secretValue = $clientSecretCredential.SecretText
Write-Host  -ForegroundColor  Red "Secret Text is [$secretValue]"
Write-Host  -ForegroundColor  Red  "Please Clear the screen after noting down the Secret value."
#$clientSecretCredential |  Format-List

# Create corresponding service principal
$servicePrincipal= New-MgServicePrincipal -AppId $appRegistration.AppId -AdditionalProperties @{} | Out-Null
Write-Host -ForegroundColor Cyan "Service principal created"
Write-Host
Write-Host -ForegroundColor Green "Success"
Write-Host

#Admin consent
$scp = Get-MgServicePrincipal -Filter "DisplayName eq '$($AppName)'" 
$app = Get-MgServicePrincipal -Filter "AppId eq '$graphResourceId'" 

New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId  $GraphPermissionsGroupMemberReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsNotesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsSitesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsUserReadAll
New-MgOAuth2PermissionGrant -ClientId $scp.id  -consentType "AllPrincipals"  -resourceId $app.Id  -Scope "Sites.FullControl.All"


if ($StayConnected -eq $false)
{
  Disconnect-MgGraph
  Write-Host "Disconnected from Microsoft Graph"
}
else
{
  Write-Host
  Write-Host -ForegroundColor Yellow "The connection to Microsoft Graph is still active. To disconnect, use Disconnect-MgGraph"
}

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Configure Amazon Q

Make sure you have set up Amazon Q Business with Entra ID as your identity provider as mentioned in the prerequisites. Also, make sure the email ID is in lowercase letters while creating the user in Entra ID.

Follow the instructions in Connecting Amazon Q Business to SharePoint (Online) using the console.

For Step 9 (Authentication), we choose Oauth 2.0 and configure it as follows:

For Tenant ID, enter the tenant ID of your SharePoint account.

This is the directory (tenant) ID in your registered Azure application, in the Azure Portal, as shown in the following screenshot (the IDs will be different for your setup).

For the AWS Secrets Manager secret, create a secret on the Secrets Manager console to store your SharePoint authentication credentials:
For Secret name, enter a name for your secret.
For Username, enter user name for your SharePoint account.
For Password, enter password for your SharePoint account.
For Client ID, enter the Azure AD client ID generated when you registered SharePoint in Azure AD. This is the application (client) ID created in the Azure Portal when registering the SharePoint application in Azure, as described earlier.
For Client secret, enter the secret generated earlier.

Frequently asked questions

In this section, we discuss some frequently asked questions.

Amazon Q Business isn’t answering questions

There are a few possible scenarios for this issue. If no users are getting a response from a specific document, verify that you have synced your data source with Amazon Q. Choose View report in the Sync run history section. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.

If a specific user is unable to access, verify that their email address in SharePoint matches with the email address of the corresponding identity in IAM Identity Center and entered in lowercase in IAM Identity Center. For more information, refer to Known limitations for the SharePoint (Online) connector.

For troubleshooting purposes, you can use the

Amazon Q Business isn’t answering any questions that are in the event attachment or comments in the SharePoint site

The connector crawls event attachments only when Events is also chosen as an entity to be crawled. Make sure that you chose the corresponding entities in the sync scope.

Error message that authentication failed

In some cases, you might get the error message, “Sharepoint Connector Error code: SPE-5001 Error message: Authentication failed:” when trying to sync.

To address this, validate that the user name, password, clientId, clientSecret, and authType values are correct in the secret that you created for this connector. Verify that MFA is deactivated.

Amazon Q Business is showing old data after an update

After the content has been updated on SharePoint, you must re-sync the contents for the updated data to be picked up by Amazon Q. Go to the data sources, select the SharePoint data source, and choose Sync now. After the sync is complete, verify that the updated data is reflected by running queries on Amazon Q.

Unable to sign in as a new user through the web experience URL

If you experience issues when signing in, clear your browser cookies and sign in as a new user.

Error message that the application is not set up correctly

Verify that the user or group has subscriptions to Amazon Q Business. Check the corresponding user group and then choose Manage access and subscriptions and select the corresponding subscription.

Error message when uploading a file

In some cases, users might get the following message when they upload a file through their user experience: “Your Amazon Q Business subscription doesn’t include file uploads. Please contact your administrator for assistance.”

Troubleshooting

For troubleshooting guidance, refer to Troubleshooting your SharePoint (Online) connector.

Clean up

Complete the following steps to clean up your resources:

Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
Navigate to the Microsoft Azure Portal.
Search for and choose App registrations.
Select the app you created earlier, then choose Delete.
On the Amazon Q Business console, choose Applications in the navigation pane.
Select the application you created, and on the Actions menu, choose Delete.

Additional capabilities of Amazon Q Business

Amazon Q Business offers much more than just a powerful AI assistant. Explore its other capabilities that allow you to customize the user experience, empower your workforce, and increase productivity:

Admin controls and guardrails – Customize your application environment to your organizational needs. Amazon Q Business offers application environment guardrails or chat controls that you can configure to control the end-user chat experience. For example, admins can define specific topics that should be blocked or controlled in the application. You can customize how Amazon Q Business responds when these topics are mentioned by end-users.
Amazon Q Apps – Empower your teams to build lightweight, purpose-built applications that streamline tasks and workflows without coding experience. For example, you could build an app that drafts personalized sales emails to customers informing them of new product launches or generates social media content for specified social media networks based on your data.
Plugins for Amazon Q Business – Seamlessly integrate with supported third-party services that allow you to perform specific tasks like creating an incident ticket in ServiceNow or raising an issue in Jira—all without leaving the Amazon Q interface.

Conclusion

In this post, we explored how to integrate Amazon Q Business with SharePoint Online using the OAuth 2.0 ROPC flow authentication method. We provided both manual and automated approaches using PowerShell scripts for configuring the required Azure AD settings. Additionally, we demonstrated how to enter those details along with your SharePoint authentication credentials into the Amazon Q console to finalize the secure connection.

The ROPC flow offers an alternative to certificate-based authentication for connecting Amazon Q Business to SharePoint Online. This can be useful when you want Amazon Q Business to crawl through OneNote or if you don’t want to deal with certificates or in scenarios that require regular password rotation.

By following this post, enterprises can take advantage of the powerful knowledge mining capabilities of Amazon Q to unlock insights from their SharePoint data repositories and knowledge bases.

About the Author

Ramesh Eega is a Global Accounts Solutions Architect based out of Atlanta, GA. He is passionate about helping customers throughout their cloud journey.

John Snow Labs Medical LLMs are now available in Amazon SageMaker JumpStart

November 25, 2024

by Art Tuazon Amazon AWS

Today, we are excited to announce that John Snow Labs’ Medical LLM – Small and Medical LLM – Medium large language models (LLMs) are now available on Amazon SageMaker Jumpstart. Medical LLM is optimized for the following medical language understanding tasks:

Summarizing clinical encounters – Summarizing discharge notes, progress notes, radiology reports, pathology reports, and various other medical reports
Question answering on clinical notes or biomedical research – Answering questions about a clinical encounter’s principal diagnosis, test ordered, or a research abstract’s study design or main outcomes

For medical doctors, this tool provides a rapid understanding of a patient’s medical journey, aiding in timely and informed decision-making from extensive documentation. This summarization capability not only boosts efficiency but also makes sure that no critical details are overlooked, thereby supporting optimal patient care and enhancing healthcare outcomes.

In a blind evaluation performed by the John Snow Labs research team, Medical LLM – Small outperformed GPT-4o in medical text summarization, being preferred by doctors 88% more often for factuality, 92% more for clinical relevance, and 68% more for conciseness. The model also excelled in clinical notes question answering, preferred 46% more for factuality, 50% more for relevance, and 44% more for conciseness. In biomedical research question answering, the model was preferred even more dramatically: 175% for factuality, 300% for relevance, and 356% for conciseness. Notably, despite being smaller than competitive models by more than an order of magnitude, the small model performed comparably in open-ended medical question answering tasks.

Medical LLM in SageMaker JumpStart is available in two sizes: Medical LLM – Small and Medical LLM – Medium. The models are deployable on commodity hardware, while still delivering state-of-the-art accuracy. This is significant for medical professionals who need to process millions to billions of patient notes without straining computing budgets.

Both models support a context window of 32,000 tokens, which is roughly 50 pages of text. You can try out the models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy Medical LLM – Small using SageMaker JumpStart.

About John Snow Labs

John Snow Labs, the AI for healthcare company, provides state-of-the-art software, models, and data to help healthcare and life science organizations put AI to good use. John Snow Labs is the developer behind Spark NLP, Healthcare NLP, and Medical LLMs. Its award-winning medical AI software powers the world’s leading pharmaceuticals, academic medical centers, and health technology companies. John Snow Labs’ Medical Language Models is by far the most widely used natural language processing (NLP) library by practitioners in the healthcare space (Gradient Flow, The NLP Industry Survey 2022 and the Generative AI in Healthcare Survey 2024).

John Snow Labs’ state-of-the-art AI models for clinical and biomedical language understanding include:

Medical language models, consisting of over 2,400 pre-trained models for analyzing clinical and biomedical text
Visual language models, focused on understanding visual documents and forms
Peer-reviewed, state-of-the-art accuracy on a variety of common medical language understanding tasks
Tested for robustness, fairness, and bias

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models (FMs). ML practitioners can deploy FMs to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy a Medical LLM – Small model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping provide data security. The Medical LLM – Small model is available today for deployment and inference in SageMaker Studio.

Discover the Medical LLM – Small model in SageMaker JumpStart

You can access the FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can discover various models by browsing through different hubs, which are named after model providers. You can find the Medical LLM – Small model in the John Snow Labs hub (see the following screenshot). If you don’t see the Medical LLM – Small model, update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can also find the Medical LLM – Small model by searching for “John Snow Labs” in the search field.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find two options to deploy the model, Deploy and Preview notebooks, which will deploy the model and create an endpoint.

Subscribe to the Medical LLM – Small model in AWS Marketplace

This model requires an AWS Marketplace subscription. When you choose Deploy in SageMaker Studio, you will be prompted to subscribe to the AWS Marketplace listing if you don’t already have it. If you are already subscribed, choose Deploy.

If you don’t have an active AWS Marketplace subscription, choose Subscribe. You will be redirected to the listing on AWS Marketplace. Review the terms and conditions and choose Accept offer.

After you’ve successfully subscribed to the model on AWS Marketplace, you can now deploy the model in SageMaker JumpStart.

Deploy the Medical LLM – Small model in SageMaker JumpStart

When you choose Deploy in SageMaker Studio, deployment will start.

You can monitor the progress of the deployment on the endpoint details page that you’re redirected to.

On the same endpoint details page, on the Test inference tab, you can send a test inference request to a deployed model. This is useful if you want to verify that your endpoint responds to requests as expected. The following prompt asks a question to the Medical LLM – Small model with the context followed by the question and checks the resulting response. Performance metrics, such as execution length time, is also included.

You can also test out the medical text summarization response.

Deploy the model and run inference through a notebook

Alternatively, you can choose Open in JupyterLab to deploy the model through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources. You can configure additional parameters as needed, but SageMaker JumpStart enables you to deploy and run inference out of the box with the included code.

The notebook already has the necessary code to deploy the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To learn more, refer to the API documentation.

After you deploy the model, you can run real-time or batch inference against the deployed endpoint. The notebook includes example code and instructions for both.

Clean up

After you’re done running the notebook, delete all resources that you created in the process.

When deploying the endpoint from the SageMaker Studio console, you can delete it by choosing Delete on the endpoint details page.

If you want to unsubscribe to the model package completely, you need to unsubscribe from the product in AWS Marketplace:

Navigate to the Machine Learning tab on your software subscriptions page.
Locate the listing that you want to cancel the subscription for, then choose Cancel Subscription to cancel the subscription.

Complete these cleanup steps to avoid continued billing for the model.

Conclusion

In this post, we showed you how to get started with the first healthcare-specific model available now in SageMaker JumpStart. Check out SageMaker JumpStart in SageMaker Studio now to get started. To learn more, refer to the following resources:

About the Authors

Art Tuazon is a Solutions Architect on the CSC team at AWS. She supports both AWS Partners and customers on technical best practices. In her free time, she enjoys running and cooking.

Beau Tse is a Partner Solutions Architect at AWS. He focuses on supporting AWS Partners through their partner journey and is passionate about enabling them on AWS. In his free time, he enjoys traveling and dancing.

David Talby is the Chief Technology Officer at John Snow Labs, helping companies apply artificial intelligence to solve real-world problems in healthcare and life science. He was named USA CTO of the Year by the Global 100 Awards and Game Changers Awards in 2022.

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

November 22, 2024

by Aman Shanbhag Amazon AWS

Companies across various scales and industries are using large language models (LLMs) to develop generative AI applications that provide innovative experiences for customers and employees. However, building or fine-tuning these pre-trained LLMs on extensive datasets demands substantial computational resources and engineering effort. With the increase in sizes of these pre-trained LLMs, the model customization process becomes complex, time-consuming, and often prohibitively expensive for most organizations that lack the necessary infrastructure and skilled talent.

In this post, we demonstrate how you can address these challenges by using fully managed environment with Amazon SageMaker Training jobs to fine-tune the Mixtral 8x7B model using PyTorch Fully Sharded Data Parallel (FSDP) and Quantized Low Rank Adaptation (QLoRA).

We guide you through a step-by-step implementation of model fine-tuning on a GEM/viggo dataset, employing the QLoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 40GB GPUs).

Business challenge

Today’s businesses are looking to adopt a variety of LLMs to enhance business applications. Primarily, they’re looking for foundation models (FMs) that are open source (that is, model weights that work without modification from the start) and can offer computational efficiency and versatility. Mistral’s Mixtral 8x7B model, released with open weights under the Apache 2.0 license, is one of the models that has gained popularity with large enterprises due to the high performance that it offers across various tasks. Mixtral employs a sparse mixture of experts (SMoE) architecture, selectively activating only a subset of its parameters for each input during model training. This architecture allows these models to use only 13B (about 18.5%) of its 46.7B total parameters during inference, making it high performing and efficient.

These FMs work well for many use cases but lack domain-specific information that limits their performance at certain tasks. This requires businesses to use fine-tuning strategies to adapt these large FMs to specific domains, thus improving performance on targeted applications. Due to the growing number of model parameters and the increasing context lengths of these modern LLMs, this process is memory intensive and requires advanced AI expertise to align and optimize them effectively. The cost of provisioning and managing the infrastructure increases the overall cost of ownership of the end-to-end solution.

In the upcoming section, we discuss how you can cost-effectively build such a solution with advanced memory optimization techniques using Amazon SageMaker.

Solution overview

To address the memory challenges of fine-tuning LLMs such as Mixtral, we will adopt the QLoRA method. As shown in the following diagram, QLoRA freezes the original model’s weights and adds low-rank trainable parameters to the transformer layers. QLoRA further uses quantization to represent the actual model’s weights in a compact, optimized format such as 4-bit NormalFloat (NF4), effectively compressing the model and reducing its memory footprint. This enables training and fine-tuning these LLMs even on systems with limited memory while maintaining performance comparable to half-precision fine-tuning. QLoRA’s support for double quantization and paged optimizers reduces the memory footprint further by quantizing the quantization constants and effectively handling any sudden memory demands.

During the forward pass computation of this architecture, the 4-bit weights get dequantized to bfloat16 (BF16) precision. On the other hand, the LoRA adapters continue to operate on BF16 precision data. Both (original weights and adapter output vectors) are then added together element-wise to produce the final result, denoted as h.

During the backward pass of the model, the gradients are computed with respect to only the LoRA parameters, not the original base model weights. Although the dequantized original weights are used in calculations, the original 4-bit quantized weights of the base model remain unchanged.

To adopt the following architecture, we will use the Hugging Face Parameter-Efficent Fine-tuning (PEFT) library, which integrates directly with bitsandbytes. This way, the QLoRA technique to fine-tune can be adopted with just a few lines of code.

QLoRA operates on a large FM. In the figure below, X denotes the input tokens of the training data, W is the existing model weights (quantized), and Wa, Wb are the segments of the adapters added by QLoRA. The original model’s weights (W) are frozen, and QLoRA adds adapters (Wa, Wb), which are low-rank trainable parameters, onto the existing transformer layer.

QLoRA explanation showing adapters added onto the existing transformer layer

Figure 1: This figure shows how QLoRA operates. The original model’s weights (W) are frozen, and QLoRA adds in adapters (Wa, Wb) onto the existing transformer layer.

Although QLoRA helps optimize memory during fine-tuning, we will use Amazon SageMaker Training to spin up a resilient training cluster, manage orchestration, and monitor the cluster for failures. By offloading the management and maintenance of the training cluster to SageMaker, we reduce both training time and our total cost of ownership (TCO). Using this approach, you can focus on developing and refining the model while using the fully managed training infrastructure provided by SageMaker Training.

Implementation details

We spin up the cluster by calling the SageMaker control plane through APIs or the AWS Command Line Interface (AWS CLI) or using the SageMaker AWS SDK. In response, SageMaker spins up training jobs with the requested number and type of compute instances. In our example, we use one ml.p4d.24xlarge compute instance.

To take complete advantage of this multi-GPU cluster, we use the recent support of QLoRA and PyTorch FSDP. Although QLoRA reduces computational requirements and memory footprint, FSDP, a data/model parallelism technique, will help shard the model across all eight GPUs (one ml.p4d.24xlarge), enabling training the model even more efficiently. Hugging Face PEFT is where the integration happens, and you can read more about it in the PEFT documentation.

QLoRA adapters are added to the linear layers in the model. The layers (for example, transformer layers, gate networks, and feed-forward networks) put together will form the entire model, as shown in the following diagram, which will be considered to be sharded by FSDP across our cluster (shown as small shards in blue).

The following architecture diagram shows how you can use SageMaker Training to have the SageMaker Control Plane spin up a resilient training job cluster. SageMaker downloads the training image from Amazon Elastic Container Registry (Amazon ECR) and will use Amazon Simple Storage Service (Amazon S3) as an input training data source and to store training artifacts.

Figure 3: Architecture Diagram showing how you can utilize SageMaker Training Jobs to spin up a resilient training cluster. Amazon ECR contains the training image, and Amazon S3 contains the training artifacts.

To put this solution into practice, execute the following use case.

Prerequisites

To perform the solution, you need to have the following prerequisites in place:

Create a Hugging Face User Access Token and get access to the gated repo mistralai/Mixtral-8x7B-v0.1 on Hugging Face.
(Optional) Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring. This is recommended if you’d like to visualize model training specific metrics.
Request a service quota at Service Quotas for 1x ml.p4d.24xlarge on Amazon SageMaker. To request a service quota increase, on the AWS Service Quotas console, navigate to AWS services, Amazon SageMaker, and choose ml.p4d.24xlarge for training job usage.
Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonEC2FullAccess to give required access to SageMaker to run the examples.

This role is for demonstration purposes only. You need to adjust it to your specific security requirements for production. Adhere to the principle of least privilege while defining IAM policies in production.

(Optional) Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup too)
Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets.

$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd 15_mixtral_finetune_qlora

The 15_mixtral_finetune_qlora directory contains the training scripts that you might need to deploy this sample.

Next, we will run the finetune-mixtral.ipynb notebook to fine-tune the Mixtral 8x7B model using QLoRA on SageMaker. Check out the notebook for more details on each step. In the next section, we walk through the key components of the fine-tuning execution.

Solution walkthrough

To perform the solution, follow the steps in the next sections.

Step 1: Set up required libraries

Install the relevant HuggingFace and SageMaker libraries:

!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "py7zr" "peft==0.12.0" --upgrade –quiet

Step 2: Load dataset

In this example, we use the GEM/viggo dataset from Hugging Face. This is a data-to-text generation dataset in the video game domain. The dataset is clean and organized with about 5,000 data points, and the responses are more conversational than information seeking. This type of dataset is ideal for extracting meaningful information from customer reviews. For example, an ecommerce application such as Amazon.com could use a similarly formatted dataset for fine-tuning a model for natural language processing (NLP) analysis to gauge interest in products sold. The results can be used for recommendation engines. Thus, this dataset is a good candidate for fine-tuning LLMs. To learn more about the viggo dataset, check out this research paper.

Load the dataset and convert it to the required prompt structure. The prompt is constructed with the following elements:

Target sentence – Think of this as the final review. In the dataset, this is target.
Meaning representation – Think of this as a deconstructed review, broken down by attributes such as inform, request, or give_opinion. In the dataset, this is meaning_representation.

Running the following cell gives us the train_set and test_set (training split and testing split, respectively) with structured prompts. We use the Python map function to structure the dataset splits according to our prompt.

def generate_and_tokenize_prompt(data_point):
    full_prompt = f"""
      Given a target sentence, construct the underlying 
      meaning representation ...
      ['inform', 'request', 'give_opinion', 'confirm', 
      'verify_attribute', 'suggest', 'request_explanation', 
      'recommend', 'request_attribute']

      The attributes must be one of the following:
      ['name', 'exp_release_date', 'release_year', 
      'developer', 'esrb', 'rating', 'genres', 
      'player_perspective', 'has_multiplayer', 'platforms', 
      'available_on_steam', 'has_linux_release', 
      'has_mac_release', 'specifier']

      ### Target sentence:
      {data_point["target"]}

      ### Meaning representation:
      {data_point["meaning_representation"]}
    """
    return {"prompt": full_prompt.strip()}

# Load dataset from the HuggingFace hub
train_set = load_dataset(dataset_name, split="train")
test_set = load_dataset(dataset_name, split="test")

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)

train_dataset = train_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

test_dataset = test_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

Upload the dataset to Amazon S3. This step is crucial because the dataset stored in Amazon S3 will serve as the input data channel for the SageMaker training cluster. SageMaker will efficiently manage the process of distributing this data across the training cluster, allowing each node to access the necessary information for model training.

input_path = f's3://{sess.default_bucket()}/datasets/mixtral'

# Save datasets to s3
train_dataset.to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
test_dataset.to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

We analyze the distribution of prompt tokens to determine the maximum sequence length required for training our model in the upcoming steps.

The following graph shows the prompt tokens plotted. The x-axis is the length of the prompts, and the y-axis is the number of times that length occurs in the training dataset (frequency). We use this to determine the maximum sequence length and pad the rest of the data points accordingly. The maximum number of words in our example is 173.

Figure 4: Graph showing the distribution of input token lengths prompted. The x-axis shows the lengths and the y-axis shows the frequency with which those input token lengths occur in the train and test dataset splits.

Step 3: Configure the parameters for `SFTTrainer` for the fine-tuning task

We use TrlParser to parse hyperparameters in a YAML file that is required to configure SFTTrainer API for fine-tuning the model. This approach offers flexibility because we can also overwrite the arguments specified in the config file by explicitly passing them through the command line interface.

cat > ./args.yaml <<EOF
model_id: "mistralai/Mixtral-8x7B-v0.1" # Hugging Face model id
max_seq_length: 2048 # based in prompt length distribution graph
train_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train dataset
test_dataset_path: "/opt/ml/input/data/test/" # path to where SageMaker saves test dataset
output_dir: "/opt/ml/model/mixtral/adapter" # path to where SageMaker will upload the model
...

num_train_epochs: 1 # number of training epochs
per_device_train_batch_size: 10 # batch size per device during training
gradient_accumulation_steps: 1 # number of steps before performing a backward/update pass
optim: adamw_torch # use torch adamw optimizer
...

bf16: true # use bfloat16 precision
tf32: true # use tf32 precision
gradient_checkpointing: true # use gradient checkpointing to save memory

# offload FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

Step 4: Review the launch script

You are now prepared to fine-tune the model using a combination of PyTorch FSDP and QLoRA. We’ve prepared a script called launch_fsdp_qlora.py that will perform the tasks mentioned in the following steps. The following is a quick review of the key points in this script before launching the training job.

Load the dataset from a JSON file located at the specified path, using the load_dataset function to prepare it for model training.

# Load datasets
train_dataset = load_dataset(
  "json",
  data_files=os.path.join(script_args.train_dataset_path, 
  "dataset.json"),
  split="train",
)

Prepare the tokenizer and the model.

We employ the BitsAndBytes library to configure 4-bit quantization settings for our model, enabling memory-efficient loading and computation.

By setting parameters such as load_in_4bit and bnb_4bit_use_double_quant to True, we enable a dramatic reduction in model size without significant loss in performance. The nf4 quantization type, coupled with bfloat16 compute and storage data types, allows for nuanced control over the quantization process, striking an optimal balance between model compression and accuracy preservation. This configuration enables the deployment of massive models on resource-constrained hardware, making advanced AI more accessible and practical for a wide range of applications.

# Configure model quantization
torch_dtype = torch.bfloat16
quant_storage_dtype = torch.bfloat16

# Configures 4-bit quantization settings for the model
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch_dtype,
  bnb_4bit_quant_storage=quant_storage_dtype,
)

model_loading_params = {
  "quantization_config": quantization_config,
  "torch_dtype": quant_storage_dtype,
  "use_cache": False if 
  training_args.gradient_checkpointing else True
}

# Loads a pre-trained model from the specified model ID
model = AutoModelForCausalLM.from_pretrained(
  script_args.model_id,
  cache_dir="/opt/ml/sagemaker/warmpoolcache",
  **model_loading_params
)

Initiate the training process using SFTTrainer from the Transformer Reinforcement Learning (TRL) library to fine-tune the model. The SFTTrainer simplifies the process of supervised fine-tuning for LLMs. This approach makes fine-tuning efficient to adapt pre-trained models to specific tasks or domains.

We use the LoraConfig class from the Hugging Face’s PEFT library to configure and add LoRA parameters (also called “adapters”) to the model.

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
  lora_alpha=8,
  lora_dropout=0.05,
  r=16,
  ...
)

################
# Training
################
trainer = SFTTrainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  peft_config=peft_config,
  max_seq_length=script_args.max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  ...
)

trainer.train(resume_from_checkpoint=checkpoint)

Step 5: Fine-tune your model

To fine-tune your model, follow the steps in the next sections.

Launch the training job

You are now ready to launch the training. We use the SageMaker Training estimator, which uses torchrun to initiate distributed training.

The SageMaker estimator simplifies the training process by automating several key tasks in this example:

The SageMaker estimator spins up a training cluster of one ml.p4d.24xlarge instance. SageMaker handles the setup and management of these compute instances, which reduces your TCO.
This estimator also uses one of the pre-built containers managed by SageMaker, PyTorch, which includes an optimized compiled version of the PyTorch framework and its required dependencies and GPU-specific libraries for accelerated computations.

pytorch_estimator = PyTorch(
  entry_point= 'launch_fsdp_qlora.py',
  source_dir="./scripts",
  ...
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  distribution={"torch_distributed": {"enabled": True}},
  hyperparameters={
    "config": "/opt/ml/input/data/config/args.yaml" #path to 
    TRL config which was uploaded to s3
  }
)

The training process generates trained adapters that will be saved in a default S3 bucket named sagemaker-<region name>-<account_id> for this job.

Monitor your training run

You can monitor training metrics, such as loss, and learning rate for your training run through the Weights & Biases Dashboard. The following figures show the results of the training run, where we track GPU utilization and GPU memory utilization.

The example is optimized to use GPU memory to its maximum capacity. Note that increasing the batch size any further will lead to CUDA Out of Memory errors.

The following graph shows the GPU memory utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

Figure 5: This graph shows the GPU Memory utilization plotted for all 8 GPUs in the training job.

The following graph shows the GPU compute utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

Figure 6: This graph shows the GPU Compute utilization plotted for all 8 GPUs in the training job.

Step 6: Merge the trained adapter with the base model for inference

Merge the training LoRA adapter with the base model. After the merge is complete, run inference to find the results. Specifically, look at how the new fine-tuned and merged model performs compared to the original unmodified Mixtral-8x7b model. The example does the adapter merge and inference both in the same launch script “merge_model_adapter.py.”

Before launching the training job, review the key components of the merge script:

Use the Hugging Face Transformers library. Specifically, use AutoModelForCausalLM to load a PEFT model from a specified HuggingFace model directory (mistralai/Mixtral-8x7B-v0.1). We have configured this library to have a low CPU memory utilization (low_cpu_mem_usage=True) to reduce the CPU to GPU communication overhead, and we’ve also used automatic device mapping (device_map="auto") while offloading the model to a designated folder to manage resource constraints.

# Load a Peft model
base_model = AutoModelForCausalLM.from_pretrained(
  model_id,
  low_cpu_mem_usage=True,
  #torch_dtype=torch.float16,
  device_map="auto",
  offload_folder="/opt/ml/model/"
)

# Load the adapter
peft_model = PeftModel.from_pretrained(
  base_model,
  adapter_dir,
  #torch_dtype=torch.float16,  # Set dtype to float16
  offload_folder="/opt/ml/model/"
)

# Merge the base model with the trained adapter
model = peft_model.merge_and_unload()
print("Merge done")

After the model is merged, send inference requests to generate responses.

def generate_text(model, prompt, max_length=500, num_return_sequences=1):
    ...

    input_ids = tokenizer.encode(prompt_input, 
    return_tensors="pt").to(device)

    # Generate text
    with torch.no_grad():
    output = model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences=num_return_sequences,
      no_repeat_ngram_size=2,
      top_k=50,
      top_p=0.95,
      temperature=0.7
    )

    # Decode and return the generated text
    generated_texts = [tokenizer.decode(seq, 
    skip_special_tokens=True) for seq in output]

    return generated_texts

print(f"nnn*** Generating Inference on Base Model: {generate_text(base_model,prompt)}nnn")

print(f"***nnn Generating Inference on Trained Model: {generate_text(model,prompt)}nnn")

Step 7: Launch the SageMaker training job to merge the adapter

Run the following script as part of the SageMaker training job.

First, explore the adapters that were saved as part of the training run.

adapter_dir_path=f"{model_artifacts}/mixtral/adapter/"

print(f'nAdapter S3 Dir path:{adapter_dir_path} n')

!aws s3 ls {adapter_dir_path}

# Reference Output
Adapter S3 Dir path:s3://sagemaker-<Region>-<Account-ID>/mixtral-8-7b-finetune-2024-09-08-22-27-42-099/output/model/mixtral/adapter/

PRE checkpoint-64/
PRE runs/
2024-09-08 23:08:07       5101 README.md
2024-09-08 23:07:58        722 adapter_config.json
2024-09-08 23:08:06  969174880 adapter_model.safetensors
2024-09-08 23:08:08        437 special_tokens_map.json
2024-09-08 23:08:04    1795596 tokenizer.json
2024-09-08 23:08:04        997 tokenizer_config.json
2024-09-08 23:08:04       5688 training_args.bin

Create and run the PyTorch estimator to configure the training job.

pytorch_estimator_adapter = PyTorch(
  entry_point= 'merge_model_adapter.py',
  source_dir="./scripts",
  job_name=job_name,
  base_job_name=job_name,
  max_run=5800,
  role=role,
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  hyperparameters={
    "model_id": "mistralai/Mixtral-8x7B-v0.1",  # Hugging Face model id
    "hf_token": "<hf-token>",
    "dataset_name":dataset_name
  }
)

# starting the train job with our uploaded datasets as input
pytorch_estimator_adapter.fit(data, wait=True)

Here’s the target sentence (key prompt) to generate model inference results:

Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. 
Is your opinion true for all games which don't have multiplayer?

Ground truth inference (data label):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

Original model inference (that is, meaning representation):

inform(name(Little Big Adventure), has_multiplayer(Little Big Adventure))

Fine-tuned model inference result (that is, meaning representation):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

The preceding results compare the inference results of the fine-tuned model against both the ground truth and the inference results of the original unmodified Mixtral 8x7B model. You can observe that the fine-tuned model provides more details and better representation of the meaning than the base model. Run systematic evaluation to quantify the fine-tuned model’s improvements for your production workloads.

Clean up

To clean up your resources to avoid incurring any more charges, follow these steps:

Delete any unused SageMaker Studio resources.
(Optional) Delete the SageMaker Studio domain.
Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.

Figure 7: Screenshot showing that there are no training jobs running anymore. This is what your console should look like once you follow the clean-up steps provided

To learn more about cleaning up your provisioned resources, check out Clean up.

Conclusion

In this post, we provided you with a step-by-step guide to fine-tune the Mixtral 8x7B MoE model with QLoRA. We use SageMaker Training Jobs and the Hugging Face PEFT package for QLoRA, with bitsandbytes for quantization together to perform the fine-tuning task. The fine-tuning was conducted using the quantized model loaded on a single compute instance, which eliminates the need of a larger cluster. As observed, the model performance improved with just 50 epochs.

To learn more about Mistral on AWS and to find more examples, check out the mistral-on-aws GitHub repository. To get started, check out the notebook on the mixtral_finetune_qlora GitHub repository. To learn more about generative AI on AWS, check out Generative AI on AWS, Amazon Bedrock, and Amazon SageMaker.

About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Nishant Karve is a Sr. Solutions Architect aligned with the healthcare and life sciences (HCLS) domain. He collaborates with large HCLS customers for their generative AI initiatives and guides them from ideation to production.

Amazon SageMaker Inference now supports G6e instances

November 22, 2024

by Vivek Gangasani Amazon AWS

As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G6e instances powered by NVIDIA’s L40S Tensor Core GPUs on Amazon SageMaker. You will have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high bandwidth memory (HBM). This launch provides organizations with the capability to use a single-node GPU instance—G6e.xlarge—to host powerful open-source foundation models such as Llama 3.2 11 B Vision, Llama 2 13 B, and Qwen 2.5 14B, offering organizations a cost-effective and high-performing option. This makes it a perfect choice for those looking to optimize costs while maintaining high performance for inference workloads.

The key highlights for G6e instances include:

Twice the GPU memory compared to G5 and G6 instances, enabling deployment of large language models in FP16 up to:
- 14B parameter model on a single GPU node (G6e.xlarge)
- 72B parameter model on a 4 GPU node (G6e.12xlarge)
- 90B parameter model on an 8 GPU node (G6e.48xlarge)
Up to 400 Gbps of networking throughput
Up to 384 GB GPU Memory

Use cases

G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real time use cases such as:

Chatbots and conversational AI
Text generation and summarization
Image generation and vision models

We have also observed that G6e performs well for inference at high concurrency and with longer context lengths. We have provided complete benchmarks in the following section.

Performance

In the following two figures, we see that for long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

In the following two figures, we see that G5.2xlarge throws a CUDA out of memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides great performance.

In the following two figures, we compare G5.48xlarge (8 GPU node) with the G6e.12xlarge (4 GPU) node, which costs 35% less and is more performant. For higher concurrency, we see that G6e.12xlarge provides 60% lower latency and 2.5 times higher throughput.

In the below figure, we are comparing cost per 1000 tokens when deploying a Llama 3.1 70b which further highlights the cost/performance benefits of using G6e instances compared to G5.

Deployment walkthrough

Prerequisites

To try out this solution using SageMaker, you’ll need the following prerequisites:

An AWS account that will contain all of your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio or a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
1 instance of ml.g6e.xlarge (or larger) for SageMaker hosting.

Deployment

You can clone the repository and use the notebook provided here.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion

G6e instances on SageMaker unlock the ability to deploy a wide variety of open source models cost-effectively. With superior memory capacity, enhanced performance, and cost-effectiveness, these instances represent a compelling solution for organizations looking to deploy and scale their AI applications. The ability to handle larger models, support longer context lengths, and maintain high throughput makes G6e instances particularly valuable for modern AI applications. Try the code to deploy with G6e.

About the Authors

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Pavan Kumar Madduri is an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. He earned his MS in Information Technology from Arizona State University. Outside of work, he enjoys swimming and watching movies.

Orchestrate generative AI workflows with Amazon Bedrock and AWS Step Functions

November 22, 2024

by Dimitri Restaino Amazon AWS

Companies across all industries are harnessing the power of generative AI to address various use cases. Cloud providers have recognized the need to offer model inference through an API call, significantly streamlining the implementation of AI within applications. Although a single API call can address simple use cases, more complex ones may necessitate the use of multiple calls and integrations with other services.

This post discusses how to use AWS Step Functions to efficiently coordinate multi-step generative AI workflows, such as parallelizing API calls to Amazon Bedrock to quickly gather answers to lists of submitted questions. We also touch on the usage of Retrieval Augmented Generation (RAG) to optimize outputs and provide an extra layer of precision, as well as other possible integrations through Step Functions.

Introduction to Amazon Bedrock and Step Functions

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

AWS Step Functions is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function helps you scale more easily and change applications more quickly. Step Functions is a reliable way to coordinate components and step through the functions of your application. Step Functions provides a graphical console to arrange and visualize the components of your application as a series of steps. This makes it easier to build and run multi-step applications. Step Functions automatically triggers and tracks each step and retries when there are errors, so your application executes in order and as expected. Step Functions logs the state of each step, so when things do go wrong, you can diagnose and debug problems more quickly. You can change and add steps without even writing code, so you can more easily evolve your application and innovate faster.

Orchestrating parallel tasks using the map functionality

Arrays are fundamental data structures in programming, consisting of ordered collections of elements. In the context of Step Functions, arrays play a crucial role in enabling parallel processing and efficient task orchestration. The map functionality in Step Functions uses arrays to execute multiple tasks concurrently, significantly improving performance and scalability for workflows that involve repetitive operations. Step Functions provides two different mapping strategies for iterating through arrays: inline mapping and distributed mapping, each with its own advantages and use cases.

Inline mapping

The inline map functionality allows you to perform parallel processing of array elements within a single Step Functions state machine execution. This approach is suitable when you have a relatively small number of items to process and when the processing of each item is independent of the others.
Here’s how it works:

You define a Map state in your Step Functions state machine.
Step Functions iterates over the array and runs the specified tasks for each element concurrently.
The results of each iteration are collected and made available for subsequent steps in the state machine.

Inline mapping is efficient for lightweight tasks and helps avoid launching multiple Step Functions executions, which can be more costly and resource intensive. But there are limitations. When using inline mapping, only JSON payloads can be accepted as input, your workflow’s execution history can’t exceed 25,000 entries, and you can’t run more than 40 concurrent map iterations.

Distributed mapping

The distributed map functionality is designed for scenarios where many items need to be processed or when the processing of each item is resource intensive or time-consuming. Instead of handling all items within a single execution, Step Functions launches a separate execution for each item in the array, letting you concurrently process large-scale data sources stored in Amazon Simple Storage Service (Amazon S3), such as a single JSON or CSV file containing large amounts of data, or even a large set of Amazon S3 objects. This approach offers the following advantages:

Scalability – By distributing the processing across multiple executions, you can scale more efficiently and take advantage of the built-in parallelism in Step Functions
Fault isolation – If one execution fails, it doesn’t affect the others, providing better fault tolerance and reliability
Resource management – Each execution can be allocated its own resources, helping prevent resource contention and providing consistent performance

However, distributed mapping can incur additional costs due to the overhead of launching multiple Step Functions executions.

Choosing a mapping approach

In summary, inline mapping is suitable for lightweight tasks with a relatively small number of items, whereas distributed mapping is better suited for resource-intensive tasks or large datasets that require better scalability and fault isolation. The choice between the two mapping strategies depends on the specific requirements of your application, such as the number of items, the complexity of processing, and the desired level of parallelism and fault tolerance.

Another important consideration when building generative AI applications using Amazon Bedrock and Step Functions Map states together would be the Amazon Bedrock runtime quotas. Generally, these model quotas allow for hundreds or even thousands of requests per minute. However, you may run into issues trying to run a large map on models with low requests processed per minute quotas, such as image generation models. In that scenario, you can include a retrier in the error handling of your Map state.

Solution overview

In the following sections, we get hands-on to see how this solution works. Amazon Bedrock has a variety of model choices to address specific needs of individual use cases. For the purposes of this exercise, we use Amazon Bedrock to run inference on Anthropic’s Claude 3.5 Haiku model to receive answers to an array of questions because it’s a performant, fast, and cost-effective option.

Our goal is to create an express state machine in Step Functions using the inline Map state to parse through the JSON array of questions sent by an API call from an application. For each question, Step Functions will scale out horizontally, creating a simultaneous call to Amazon Bedrock. After all the answers come back, Step Functions will concatenate them into a single response, which our original calling application can then use for further processing or displaying to end-users.

The payload we send consists of an array of nine Request for Proposal (RFP) questions, as well as a company description:

{
  "questions": [
    "Can you describe your technical capabilities and infrastructure?",
    "What security measures do you have in place to protect data and privacy?",
    "Can you provide case studies or examples of similar projects you have handled?",
    "How do you handle project management, and what tools do you use?",
    "What are your support and maintenance services like?",
    "What is your pricing model?",
    "Can you provide references from other clients?",
    "How do you ensure the scalability of your solution?",
    "What is your approach to data backup and recovery?"
  ],
  "description": "Our company, AnyCompany Tech, boasts a robust technical infrastructure that allows us to handle complex projects with ease. Our strength lies in our dynamic team of experts and our cutting-edge technology, which, when combined, can deliver solutions of any scale. We've worked with clients across the globe, for instance, our project with Example Corp involved a sophisticated upgrade of their system. In terms of security, we prioritize data privacy and have put in place stringent measures to ensure that all data is stored securely. We're quite proud of our project with AnyCompany Networks, where we overhauled their security systems to bolster their data protection capabilities. We use a range of project management tools, including Product-1 and Product-2, which allows us to customize our approach to each client's needs. Our pricing model varies depending on the project, but we always aim to provide cost-effective solutions. We've had numerous positive feedback from our clients, with Example Corp and AnyCompany Networks among those who have expressed satisfaction with our services. We're more than happy to provide further references upon request. Software updates and upgrades are a critical part of our service. We have a dedicated team that ensures all systems are up-to-date and running smoothly. Furthermore, our solutions are designed to be scalable, ensuring that they can grow alongside your business. Lastly, in terms of data backup and recovery, we have a comprehensive plan in place, which includes regular data backups and a robust recovery strategy. We understand the importance of data in today's world and we're committed to ensuring its safety and accessibility at all times."
}

You can use the step-by-step guide in this post or use the prebuilt AWS CloudFormation template in the us-west-2 Region to provision the necessary AWS resources. AWS CloudFormation gives developers and businesses a straightforward way to create a collection of related AWS and third-party resources, and provision and manage them in an orderly and predictable fashion.

Prerequisites

You need the following prerequisites to follow along with this solution implementation:

An AWS account
An AWS Identity and Access Management (IAM) user that has access to Amazon Bedrock and Step Functions
Model access to Anthropic Claude 3.5 Haiku on Amazon Bedrock in a supported AWS Region (we will use us-west-2)

Create a State Machine and add a Map state

In the AWS console in the us-west-2 Region, launch into Step Functions, and select Get started and Create your own to open a blank canvas in Step Functions Workflow Studio.

Edit the state machine by adding an inline Map state with items sourced from a JSON payload.

Next, tell the Map state where the array of questions is located by selecting Provide a path to items array and pointing it to the questions array using JSONPath syntax. Selecting Modify items with ItemSelector allows you to structure the payload, which is then sent to each of the child workflow executions. Here, we map the description through with no change and use $$.Map.Item.Value to map the question from the array at the index of the map iteration.

Invoke an Amazon Bedrock model

Next, add a Bedrock: InvokeModel action task as the next state within the Map state.

Now you can structure your Amazon Bedrock API calls through Workflow Studio. Because we’re using Anthropic’s Claude 3.5 Haiku model on Amazon Bedrock, we select the corresponding model ID for Bedrock model identifier and edit the provided sample with instructions to incorporate the incoming payload. Depending on which model you select, the payload may have a different structure and prompt syntax.

Build the payload

The prompt you build uses the Amazon State Language intrinsic function States.Format in order to do string interpolation, substituting {} for the variables declared after the string. We must also include .$ after our text key to reference a node in this state’s JSON input.

When building out this prompt, you should be very prescriptive in asking the model to do the following:

Answer the questions thoroughly using the following description
Not repeat the question
Only respond with the answer to the question

We set the max_tokens to 800 to allow for longer responses from Amazon Bedrock. Additionally, you can include other inference parameters such as temperature, top_p, top_k, and stop_sequences. Tuning these parameters can help limit the length or influence the randomness or diversity of the model’s response. For the sake of this example, we keep all other optional parameters as default.

{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 800,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text.$": "States.Format('Answer following question thoroughly, using the following description. Do not repeat the question. Only respond with the answer to the question. Question: {} Description: {}', $.questions.question, $.description)"
        }
      ]
    }
  ]
}

Form the response

To provide a cleaner response back to our calling application, we want to use some options to transform the output of the Amazon Bedrock Task state. First, use ResultSelector to filter the response coming back from the service to pull out the text completion, then add the original input back to the output using ResultPath and finish by filtering the final output using OutputPath. That way you don’t have to see the description being mapped unnecessarily for each array item.

To simulate the state machine being called by an API, choose Execute in Workflow Studio. Using the preceding input, the Step Functions output should look like the following code, although it may vary slightly due to the diversity and randomness of FMs:

{
  "output": [
    {
      "question": "Can you describe your technical capabilities and infrastructure?",
      "response": {
        "completion": "AnyCompany Tech offers a comprehensive technical infrastructure powered by a dynamic expert team and cutting-edge technology capable of handling complex projects at any scale. The company prioritizes data security through stringent protection measures, uses advanced project management tools like Product-1 and Product-2, and offers customizable solutions. They provide global client support with a focus on scalability, software updates, and robust data backup and recovery strategies. Their approach emphasizes cost-effectiveness, client satisfaction, and adaptable technological solutions that can grow with a client's business needs. The team maintains a dedicated focus on keeping systems updated, secure, and responsive to evolving technological requirements."
      }
    },
    {
      "question": "What security measures do you have in place to protect data and privacy?",
      "response": {
        "completion": "At AnyCompany Tech, we have implemented stringent security measures to protect data and privacy. These include a comprehensive data backup and recovery strategy with regular backups, robust security systems that have been successfully deployed for clients like AnyCompany Networks, and dedicated teams ensuring systems are up-to-date. We prioritize data privacy and use advanced technical infrastructure and cutting-edge technology to secure client information. Our approach involves customizable security solutions, scalable systems, and a commitment to protecting data accessibility and safety at all times."
      }
    },
    {
      "question": "Can you provide case studies or examples of similar projects you have handled?",
      "response": {
        "completion": "Based on the description provided, we have successfully completed projects for Example Corp, involving a sophisticated system upgrade, and AnyCompany Networks, where we overhauled their security systems. These case studies demonstrate our ability to handle complex technical infrastructure projects across different industries. Our team has worked with global clients, delivering solutions that prioritize security, data privacy, and scalability. While specific technical details aren't elaborated, our track record shows proven expertise in system improvements and security enhancements. We have received positive client feedback from both Example Corp and AnyCompany Networks, and we are prepared to provide additional references upon request."
      }
    },
    {
      "question": "How do you handle project management, and what tools do you use?",
      "response": {
        "completion": "At AnyCompany Tech, we utilize Product-1 and Product-2 as our primary project management tools, allowing us to create customized approaches for each client's unique needs. Our project management strategy is built on our dynamic team of experts and cutting-edge technology, which enables us to handle complex projects efficiently. We prioritize adaptability, scalability, and client-specific solutions, ensuring that our project management approach can accommodate projects of any scale. Our tools and processes are designed to provide transparent tracking, seamless communication, and effective resource allocation, which allows us to deliver high-quality results consistently. Additionally, we have dedicated teams for software updates, system maintenance, and data backup/recovery to support our robust project management infrastructure."
      }
    },
    {
      "question": "What are your support and maintenance services like?",
      "response": {
        "completion": "Our support and maintenance services are comprehensive and tailored to each client's unique needs. We provide a dedicated team that ensures all systems are up-to-date and running smoothly, with regular software updates and upgrades. Our solutions are designed to be scalable, allowing them to grow alongside your business. We have a robust data backup and recovery strategy that includes regular backups and a comprehensive recovery plan to ensure data safety and accessibility. Our technical infrastructure, combined with our expert team, allows us to deliver high-quality, secure, and efficient maintenance services. We use advanced project management tools like Product-1 and Product-2 to customize our approach, and we prioritize data privacy with stringent security measures. Our goal is to provide cost-effective solutions that meet and exceed client expectations, with a track record of successful projects and satisfied clients like Example Corp and AnyCompany Networks."
      }
    },
    {
      "question": "What is your pricing model?",
      "response": {
        "completion": "Our pricing model is flexible and tailored to each specific project. We aim to provide cost-effective solutions that are customized to meet the unique needs of our clients. The exact pricing depends on the scope, complexity, and requirements of the project, ensuring that we deliver value while addressing the specific challenges of each client's technological needs."
      }
    },
    {
      "question": "Can you provide references from other clients?",
      "response": {
        "completion": "Yes, we can provide references from clients like Example Corp and AnyCompany Networks, who have previously worked with us and expressed satisfaction with our services. We are more than happy to connect you with these references upon request, allowing you to gain direct insights into our capabilities, project performance, and customer satisfaction."
      }
    },
    {
      "question": "How do you ensure the scalability of your solution?",
      "response": {
        "completion": "We ensure scalability through a multi-faceted approach. Our solutions are intentionally designed to be flexible and adaptable, capable of growing alongside our clients' businesses. We leverage cutting-edge technology and our dynamic team of experts to create infrastructure that can handle increasingly complex projects. Our robust technical infrastructure allows us to seamlessly expand systems, and we use advanced project management tools like Product-1 and Product-2 to customize solutions that can scale effectively. Additionally, we have a dedicated team focused on software updates and upgrades, which helps maintain system performance and enables smooth expansion. Our comprehensive approach to scalability is further supported by our data backup and recovery strategies, ensuring system reliability and continuity as businesses grow. By prioritizing adaptability and having the technical expertise to implement scalable solutions, we can confidently meet the evolving needs of our clients across various industries."
      }
    },
    {
      "question": "What is your approach to data backup and recovery?",
      "response": {
        "completion": "Our data backup and recovery approach encompasses regular, comprehensive data backups and a robust recovery strategy. We understand the critical importance of data protection and have implemented a systematic process to ensure data safety and accessibility at all times. Our strategy involves creating multiple backup copies, storing them in secure, redundant locations, and maintaining a well-defined recovery protocol that allows for quick restoration of data in the event of any potential loss or system failure. This approach is designed to minimize downtime and protect our clients' valuable information, reflecting our commitment to maintaining the highest standards of data security and reliability."
      }
    }
  ],
  "outputDetails": {
    "truncated": false
  }
}

Clean up resources

To delete this solution, navigate to the State machines page on the Step Functions console, select your state machine, choose Delete, and enter delete to confirm. It will be marked for deletion and will be deleted when all executions are stopped.

RAG and other possible integrations

RAG is a strategy that enhances the output of a large language model (LLM) by allowing it to reference an authoritative external knowledge base, generating more accurate or secure responses. This powerful tool can extend the capabilities of LLMs to specific domains or an organization’s internal knowledge base without needing to retrain or even fine-tune the model.

A straightforward way to integrate RAG into the preceding RFP example is by adding a Bedrock Runtime Agents: Retrieve action task to your Map state before invoking the model. This enables queries to Amazon Bedrock Knowledge Bases, which supports various vector storage databases, including the Amazon OpenSearch Serverless vector engine, Pinecone, Redis Enterprise Cloud, and soon Amazon Aurora and MongoDB. Using Knowledge Bases to ingest and vectorize example RFPs and documents stored in Amazon S3 eliminates the need to include a description with the question array. Also, because a vector store can accommodate a broader range of information than a single prompt is able to, RAG can greatly enhance the specificity of the responses.

In addition to Amazon Bedrock Knowledge Bases, there are other options to integrate for RAG depending on your existing tech stack, such as directly with an Amazon Kendra Task state or with a vector database of your choosing through third-party APIs using HTTP Task states.

Step Functions offers composability, allowing you to seamlessly integrate over 9,000 AWS API actions from more than 200 services directly into your workflows. These optimized service integrations simplify the use of common services like AWS Lambda, Amazon Elastic Container Service (Amazon ECS), AWS Glue, and Amazon EMR, offering features such as IAM policy generation and the Run A Job (.sync) pattern, which automatically waits for the completion of asynchronous jobs. Another common pattern seen in generative AI applications is chaining models together to accomplish secondary tasks, like language translation after a primary summarization task is completed. This can be accomplished by adding another Bedrock: InvokeModel action task just as we did earlier.

Conclusion

In this post, we demonstrated the power and flexibility of Step Functions for orchestrating parallel calls to Amazon Bedrock. We explored two mapping strategies—inline and distributed—for processing small and large datasets, respectively. Additionally, we delved into a practical use case of answering a list of RFP questions, demonstrating how Step Functions can efficiently scale out and manage multiple Amazon Bedrock calls.

We introduced the concept of RAG as a strategy for enhancing the output of an LLM by referencing an external knowledge base and demonstrated multiple ways to incorporate RAG into Step Functions state machines. We also highlighted the integration capabilities of Step Functions, particularly the ability to invoke over 9,000 AWS API actions from more than 200 services directly from your workflow.

As next steps, explore the possibilities of application patterns offered by the GenAI Quick Start PoCs GitHub repo as well as various Step Functions integrations through sample project templates within Workflow Studio. Also, consider integrating RAG into your workflows to use your organization’s internal knowledge base or specific domain expertise.

About the Author

Dimitri Restaino is a Brooklyn-based AWS Solutions Architect specialized in designing innovative and efficient solutions for healthcare companies, with a focus on the potential applications of AI, blockchain and other promising industry disruptors. Off the clock, he can be found spending time in nature or setting fastest laps in his racing sim.