Benchmarking Amazon Nova and GPT-4o models with FloTorch

Benchmarking Amazon Nova and GPT-4o models with FloTorch

Based on original post by Dr. Hemant Joshi, CTO, FloTorch.ai

A recent evaluation conducted by FloTorch compared the performance of Amazon Nova models with OpenAI’s GPT-4o.

Amazon Nova is a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry-leading price-performance. The Amazon Nova family of models includes Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, which support text, image, and video inputs while generating text-based outputs. These models offer enterprises a range of capabilities, balancing accuracy, speed, and cost-efficiency.

Using its enterprise software, FloTorch conducted an extensive comparison between Amazon Nova models and OpenAI’s GPT-4o models with the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. FloTorch’s evaluation focused on three critical factors—latency, accuracy, and cost—across five diverse topics.

Key findings from the benchmark study:

  • GPT-4o demonstrated a slight advantage in accuracy over Amazon Nova Pro
  • Amazon Nova Pro outperformed GPT-4o in efficiency, operating 97% faster while being 65.26% more cost-effective
  • Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 percentage points in accuracy
  • In terms of affordability, Amazon Nova Micro and Amazon Nova Lite were 10% and 56.59% cheaper than GPT-4o-mini, respectively
  • Amazon Nova Micro and Amazon Nova Lite also demonstrated faster response times, with 48% and 26.60% improvements, respectively

In this post, we discuss the findings from this benchmarking in more detail.

The growing need for cost-effective AI models

The landscape of generative AI is rapidly evolving. OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. Although GPT-4o has gained traction in the AI community, enterprises are showing increased interest in Amazon Nova due to its lower latency and cost-effectiveness.

Large language models (LLMs) are generally proficient in responding to user queries, but they sometimes generate overly broad or inaccurate responses. Additionally, LLMs might provide answers that extend beyond the company-specific context, making them unsuitable for certain enterprise use cases.

One of the most critical applications for LLMs today is Retrieval Augmented Generation (RAG), which enables AI models to ground responses in enterprise knowledge bases such as PDFs, internal documents, and structured data. This is a crucial requirement for enterprises that want their AI systems to provide responses strictly within a defined scope.

To better serve the enterprise customers, the evaluation aimed to answer three key questions:

  • How does Amazon Nova Pro compare to GPT-4o in terms of latency, cost, and accuracy?
  • How do Amazon Nova Micro and Amazon Nova Lite perform against GPT-4o mini in these same metrics?
  • How well do these models handle RAG use cases across different industry domains?

By addressing these questions, the evaluation provides enterprises with actionable insights into selecting the right AI models for their specific needs—whether optimizing for speed, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was released by Meta for testing with factual queries across five domains with eight question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie, and Open (miscellaneous). The eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. The following table provides example questions with their domain and question type.

Domain Question Question Type
Sports Can you carry less than the maximum number of clubs during a round of golf? simple
Music Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)? simple_w_condition
Open Can i make cookies in an air fryer? simple
Finance Did meta have any mergers or acquisitions in 2022? simple_w_condition
Movie In 2016, which movie was distinguished for its visual effects at the oscars? simple_w_condition

The evaluation considered 200 queries from this dataset representing five domains and two question types, simple and simple_w_condition. Both types of questions are common from users, and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one Grammy). FloTorch used these queries and their ground truth answers to create a subset benchmark dataset. The CRAG dataset also provides top five search result pages for each query. These five webpages act as a knowledge base (source data) to limit the RAG model’s response. The goal is to index these five webpages dynamically using a common embedding algorithm and then use a retrieval (and reranking) strategy to retrieve chunks of data from the indexed knowledge base to infer the final answer.

Evaluation setup

The RAG evaluation pipeline consists of the several key components, as illustrated in the following diagram.

In this section, we explore each component in more detail.

Knowledge base

FloTorch used the top five HTML webpages provided with the CRAG dataset for each query as the knowledge base source data. HTML pages were parsed to extract text for the embedding stage.

Chunking strategy

FloTorch used a fixed chunking strategy with a chunk size of 512 tokens (four characters is usually around one token) and a 10% overlap between chunks. Further experiments with different chunking strategies, chunk sizes, and percent overlap will be done in coming weeks and will update this post.

Embedding strategy

FloTorch used the Amazon Titan Text Embeddings V2 model on Amazon Bedrock with an output vector size of 1024. With a maximum input token limit of 8,192 for the model, the system successfully embedded chunks from the knowledge base source data as well as short queries from the CRAG dataset efficiently. Amazon Bedrock APIs make it straightforward to use Amazon Titan Text Embeddings V2 for embedding data.

Vector database

FloTorch selected Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Each provisioned node was r7g.4xlarge, selected for its availability and sufficient capacity to meet the performance requirements. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) strategy

FloTorch used a retrieval strategy with a k-nearest neighbor (k-NN) of five for retrieved chunks. The experiments excluded reranking algorithms to make sure retrieved chunks remained consistent for both models when inferring the answer to the provided query. The following code snippet embeds the given query and passes the embeddings to the search function:

def search_results(interaction_ids: List[str], queries: List[str], k: int):
   """Retrieve search results for queries."""
   results = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true"

   for interaction_id, query in zip(interaction_ids, queries):
       try:
           _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
           results.append(search(interaction_id + '_titan', embedding, k))
       except Exception as e:
           logger.error(f"Error processing query {query}: {e}")
           results.append(None)
   return results

Inferencing

FloTorch used the GPT-4o model from OpenAI using the API key available and used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared to Amazon Nova Pro with a context window of 300,000. The maximum output token limit of GPT-4o is 16,384 vs. the Amazon Nova Pro maximum output token limit of 5,000. The benchmarking experiments were conducted without Amazon Bedrock Guardrails functionality. The implementation used the universal gateway provided by the FloTorch enterprise version to enable consistent API calls using the same function and to track token count and latency metrics uniformly. The inference function code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   results = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating responses"):
       interaction_ids = [item["interaction_id"] for item in batch]
       queries = [item["query"] for item in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, item in enumerate(batch):
           item["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           results.append({
               "interaction_id": interaction_ids[i],
               "query": queries[i],
               "prediction": response.get("choices", [{}])[0].get("message", {}).get("content") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return results

Evaluation

Both models were evaluated by running batch queries. A batch of eight was selected to comply with Amazon Bedrock quota limits as well as GPT-4o rate limits. The query function code is as follows:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Send batch queries to the API."""
   headers = {"Authorization": auth_header, "Content-Type": "application/json"}
   responses = []

   for item in batch:
       query = item["query"]
       query_time = item["query_time"]
       retrieval_results = item.get("search_results", [])

       references = "# References n" + "n".join(
           [f"Reference {_idx + 1}:n{res['text']}n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}n------nnUsing only the references listed above, answer the following question:nQuestion: {query}n"

       payload = {
           "model": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       try:
           start_time = time.time()
           response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       except requests.RequestException as e:
           logger.error(f"API request failed for query: {query}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

In this section, we discuss the latency, accuracy, and cost measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for each query response were calculated as the difference between two timestamps: the timestamp when the API call is made to the inference LLM, and a second timestamp when the entire response is received from the inference endpoint. The difference between these two timestamps determines the latency. A lower latency indicates a faster-performing LLM, making it suitable for applications requiring rapid response times. The study indicates that latency can be further reduced for both models through optimizations and caching techniques; however, the evaluation focused on measuring out-of-the-box latency performance for both models.

Accuracy

FloTorch used a modified version of the local_evaluation.py script provided with the CRAG benchmark for accuracy evaluations. The script was enhanced to provide proper categorization of correct, incorrect, and missing responses. The default GPT-4o evaluation LLM in the evaluation script was replaced with the mixtral-8x7b-instruct-v0:1 model API. Additional modifications to the script enabled monitoring of input and output tokens and latency as described earlier.

Cost

Cost calculations were straightforward because both Amazon Nova Pro and GPT-4o have published price per million input and output tokens separately. The calculation methodology involved multiplying input tokens by corresponding rates and applying the same process for output tokens. The total cost for running 200 queries was determined by combining input token and output token costs. OpenSearch Service provisioned cluster costs were excluded from this analysis because the cost comparison focused solely on the inference level between Amazon Nova Pro and GPT-4o LLMs.

Results

The following table summarizes the results.

 . Amazon Nova Pro GPT-4o Observation
Accuracy on subset of the CRAG dataset

51.50%

(103 correct responses out of 200)

53.00%

(106 correct responses out of 200)

GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy
Cost for running inference for 200 queries $0.00030205 $0.000869537 Amazon Nova Pro saves 65.26% in costs compared to GPT-4o
Average latency (seconds) 1.682539835 2.15615045 Amazon Nova Pro is 21.97% faster than GPT-4o
Average of input and output tokens 1946.621359 1782.707547 Typical GPT-4o responses are shorter than Amazon Nova responses

For simple queries, Amazon Nova Pro and GPT-4o have similar accuracies (55 and 56 correct responses, respectively) but for simple queries with conditions, GPT-4o performs slightly better than Amazon Nova Pro (50 vs. 48 correct answers). Imagine you are part of an organization running an AI assistant service that handles 1,000 questions per month from 10,000 users (10,000,000 queries per month). Amazon Nova Pro will save your organization $5,674.88 per month ($68,098 per year) compared to GPT-4o.

Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini models on the same dataset.

 

 

Amazon Nova Lite Nove Micro GPT-4o mini Observation
Accuracy on subset of the CRAG dataset

52.00%

(104 correct responses out of 200)

54.00%

(108 correct responses out of 200)

50.00%

(100 correct responses out of 200)

Both Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 points, respectively
Cost for running inference for 200 queries

$0.00002247

(56.59% cheaper than GPT-4o mini)

$0.000013924

(73.10% cheaper than GPT-4o mini)

$0.000051768 Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively

Average latency

(seconds)

1.553371465

(26.60% faster than GPT-4o mini)

1.6828564

(20.48% faster than GPT-4o mini)

2.116291895 Amazon Nova models are at least 20% faster than GPT-4o mini
Average of input and output tokens 1930.980769 1940.166667 1789.54 GPT-4o mini returns shorter answers

Amazon Nova Micro is significantly faster and less expensive compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.

Conclusion

Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency compared to GPT-4o and GPT-4o mini models. FloTorch is continuing further experimentation with other relevant LLMs for comparison. Future research will include additional experiments with various query types such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About FloTorch

FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. FloTorch’s mission is to help enterprises make data-driven decisions in the end-to-end generative AI pipeline, including but not limited to model selection, vector database selection, and evaluation strategies. FloTorch offers an open source version for customers with scalable experimentation with different chunking, embedding, retrieval, and inference strategies. The open source version works on a customer’s AWS account so you can experiment on your AWS account with your proprietary data. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub. FloTorch also offers an enterprise version of this product for scalable experimentation with LLM models and vector databases on cloud platforms. The enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest ew LLMs and agent workflows. For more information, contact us at info@flotorch.ai.


About the author

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of industry experience building products and services with AI/ML technologies. As CTO of FloTorch, Hemant is engaged with customers to implement State of the Art GenAI solutions and agentic workflows for enterprises.

Read More

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

DeepSeek-R1 is a large language model (LLM) developed by DeepSeek AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities, DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning, and data interpretation tasks.

DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. This approach allows the model to specialize in different problem domains while maintaining overall efficiency.

DeepSeek-R1 distilled models bring the reasoning capabilities of the main R1 model to more efficient architectures based on popular open models like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a process of training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model, using it as a teacher model. For example, DeepSeek-R1-Distill-Llama-8B offers an excellent balance of performance and efficiency. By integrating this model with Amazon SageMaker AI, you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.

In this post, we show how to use the distilled models in SageMaker AI, which offers several options to deploy the distilled versions of the R1 model.

Solution overview

You can use DeepSeek’s distilled models within the AWS managed machine learning (ML) infrastructure. We demonstrate how to deploy these models on SageMaker AI inference endpoints.

SageMaker AI offers a choice of which serving container to use for deployments:

  • LMI container – A Large Model Inference (LMI) container with different backends (vLLM, TensortRT-LLM, and Neuron). See the following GitHub repo for more details.
  • TGI container – A Hugging Face Text Generation Interface (TGI) container. You can find more details in the following GitHub repo.

In the following code snippets, we use the LMI container example. See the following GitHub repo for more deployment examples using TGI, TensorRT-LLM, and Neuron.

LMI containers

LMI containers are a set of high-performance Docker containers purpose built for LLM inference. With these containers, you can use high-performance open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution.

LMI containers provide many features, including:

  • Optimized inference performance for popular model architectures like Meta Llama, Mistral, Falcon, and more
  • Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
  • Continuous batching for maximizing throughput at high concurrency
  • Token streaming
  • Quantization through AWQ, GPTQ, FP8, and more
  • Multi-GPU inference using tensor parallelism
  • Serving LoRA fine-tuned models
  • Text embedding to convert text data into numeric vectors
  • Speculative decoding support to decrease latency

LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables you to use the latest optimizations and technologies across libraries. To learn more about the LMI components, see Components of LMI.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.2xlarge SageMaker hosting instance.

Deploy DeepSeek-R1 for inference

The following is a step-by-step example that demonstrates how to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the model is provided in the GitHub repo. You can clone the repo and run the notebook from SageMaker AI Studio.

  1. Configure the SageMaker execution role and import the necessary libraries:
!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

import json
import boto3
import sagemaker

# Set up IAM Role
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

There are two ways to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:

  • Deploy uncompressed model weights from an Amazon S3 bucket – In this scenario, you need to set the HF_MODEL_ID variable to the Amazon Simple Storage Service (Amazon S3) prefix that has model artifacts. This method is generally much faster, with the model typically downloading in just a couple of minutes from Amazon S3.
  • Deploy directly from Hugging Face Hub (requires internet access) – To do this, set HF_MODEL_ID to the Hugging Face repository or model ID (for example, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). However, this method tends to be slower and can take significantly longer to download the model compared to using Amazon S3. This approach will not work if enable_network_isolation is enabled, because it requires internet access to retrieve model artifacts from the Hugging Face Hub.
  1. In this example, we deploy the model directly from the Hugging Face Hub:
vllm_config = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
}

The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits number of concurrent requests that can be processed by the endpoint. We set it to 16 to limit GPU memory requirements. You should adjust it based on your latency and throughput requirements.

  1. Create and deploy the model:
# Create a Model object
lmi_model = sagemaker.Model(
    image_uri = inference_image_uri,
    env = vllm_config,
    role = role,
    name = model_name,
    enable_network_isolation=True, # Ensures model is isolated from the internet
    vpc_config={
        "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
        "SecurityGroupIds": ["sg-zzzzzzzz"]
    }
)
# Deploy to SageMaker
lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.g5.2xlarge",
    container_startup_health_check_timeout = 1600,
    endpoint_name = endpoint_name,
)
  1. Make inference requests:
sagemaker_client = boto3.client('sagemaker-runtime', region_name='us-east-1')
endpoint_name = predictor.endpoint_name

input_payload = {
    "inputs": "What is Amazon SageMaker? Answer concisely.",
    "parameters": {"max_new_tokens": 250, "temperature": 0.1}
}

serialized_payload = json.dumps(input_payload)

response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=serialized_payload
)

Performance and cost considerations

The ml.g5.2xlarge instance provides a good balance of performance and cost. For large-scale inference, use larger batch sizes for real-time inference to optimize cost and performance. You can also use batch transform for offline, large-volume inference to reduce costs. Monitor endpoint usage to optimize costs.

Clean up

Clean up your resources when they’re no longer needed:

predictor.delete_endpoint()

Security

You can configure advanced security and infrastructure settings for the DeepSeek-R1 model, including virtual private cloud (VPC) networking, service role permissions, encryption settings, and EnableNetworkIsolation to restrict internet access. For production deployments, it’s essential to review these settings to maintain alignment with your organization’s security and compliance requirements.

By default, the model runs in a shared AWS managed VPC with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.

SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We do not share your data with model providers, unless you direct us to, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker.

For more details, see Configure security in Amazon SageMaker AI.

Logging and monitoring

You can monitor SageMaker AI using Amazon CloudWatch, which collects and processes raw data into readable, near real-time metrics. These metrics are retained for 15 months, allowing you to analyze historical trends and gain deeper insights into your application’s performance and health.

Additionally, you can configure alarms to monitor specific thresholds and trigger notifications or automated actions when those thresholds are met, helping you proactively manage your deployment.

For more details, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.

Best practices

It’s always recommended to deploy your LLMs endpoints inside your VPC and behind a private subnet, without internet gateways, and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.

Always apply guardrails to make sure incoming and outgoing model responses are validated for safety, bias, and toxicity. You can guard your SageMaker endpoints model responses with Amazon Bedrock Guardrails. See DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart for more details.

Inference performance evaluation

In this section, we focus on inference performance of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the performance of LLMs in terms of end-to-end latency, throughput, and resource efficiency is crucial for providing responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly impacts user experience, system reliability, and deployment feasibility at scale. For this post, we test all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—along four performance metrics:

  • End-to-end latency (time between sending a request and receiving the response)
  • Throughput tokens
  • Time to first token
  • Inter-token latency

The main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware for generic traffic patterns. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and traffic patterns as well as I/O sequence length.

Scenarios

We tested the following scenarios:

  • Container/model configuration – We used LMI container v14 with default parameters, except MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On instances with multiple accelerators, we sharded the model across all available GPUs.
  • Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on performance benchmarks using two sample input token lengths. We ran both tests 50 times each before measuring the average across the different metrics. Then we repeated the test with concurrency 10.
    • Short-length test – 512 input tokens and 256 output tokens.
    • Medium-length test – 3072 input tokens and 256 output tokens.
  • Hardware – We tested the distilled variants on a variety of instance types ranging from 1, 4, or 8 GPUs per instance. In the following table, a green cell indicates that a model was tested on that particular instance type, and red indicates that a model wasn’t tested with that instance type, either because the instance was excessive for a given model size or too small to fit the model in memory.

Deployment options

Box plots

In the following sections, we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data, with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness, as illustrated in the following figure.

Box plot example

DeepSeek-R1-Distill-Qwen-1.5B

This model can be deployed on a single GPU instance. The results indicate that the ml.g5.xlarge instance outperforms the ml.g6.xlarge instance across all measured performance criteria and concurrency settings.

The following figure illustrates testing with concurrency = 1.

Qwen-1.5-C1

The following figure illustrates testing with concurrency = 10.

Qwen-1.5-C10

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.2xlarge and ml.g6e.2xlarge. Among all instances, ml.g6e.2xlarge demonstrated the highest performance.

The following figure illustrates testing with concurrency = 1.

Qwen-7-C1

The following figure illustrates testing with concurrency = 10.

Qwen-7-C10

DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.2xlarge, ml.g5.12xlarge, ml.g6e.2xlarge, and ml.g6e.12xlarge, with ml.g6e.12xlarge demonstrating the highest performance among all instances.

The following figure illustrates testing with concurrency = 1.

Llama-8B-C1

The following figure illustrates testing with concurrency = 10.

Llama-8B-C10

DeepSeek-R1-Distill-Qwen-14B

We tested this model on ml.g6.12xlarge, ml.g5.12xlarge, ml.g6e.48xlarge, and ml.g6e.12xlarge. The instance with 8 GPU (ml.g6e.48xlarge) showed the best results.

The following figure illustrates testing with concurrency = 1.

Qwen-14B-C1

The following figure illustrates testing with concurrency = 10.

Qwen-14B-C10

DeepSeek-R1-Distill-Qwen-32B

This is a fairly large model, and we only deployed it on multi-GPU instances: ml.g6.12xlarge, ml.g5.12xlarge, and ml.g6e.12xlarge. The latest generation (ml.g6e.12xlarge) showed the best performance across all concurrency settings.

The following figure illustrates testing with concurrency = 1.

Qwen-32B-C1

The following figure illustrates testing with concurrency = 10.

Qwen-32B-C10

DeepSeek-R1-Distill-Llama-70B

We tested this model on two different 8 GPUs instances: ml.g6e.48xlarge and ml.p4d.24xlarge. The latter showed the best performance.

The following figure illustrates testing with concurrency = 1.

Llama-70B-C1

The following figure illustrates testing with concurrency = 10.

Llama-70B-C10

Conclusion

Deploying DeepSeek models on SageMaker AI provides a robust solution for organizations seeking to use state-of-the-art language models in their applications. The combination of DeepSeek’s powerful models and SageMaker AI managed infrastructure offers a scalable and efficient approach to natural language processing tasks.

The performance evaluation section presents a comprehensive performance evaluation of all DeepSeek-R1 distilled models across four key inference metrics, using 13 different NVIDIA accelerator instance types. This analysis offers valuable insights to assist in the selection of the optimal instance type for deploying the DeepSeek-R1 solution.

Check out the complete code in the following GitHub repos:

For additional resources, refer to:


About the Authors

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Read More

From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

In today’s fast-paced world, time is of the essence and even basic tasks like grocery shopping can feel rushed and challenging. Despite our best intentions to plan meals and shop accordingly, we often end up ordering takeout; leaving unused perishable items to spoil in the refrigerator. This seemingly small issue of wasted groceries, paired with the about-to-perish grocery supplies thrown away by grocery stores, contributes significantly to the global food waste problem. This demonstrates how we can help solve this problem by harnessing the power of generative AI on AWS.

By using computer vision capabilities through Amazon Rekognition and the content generation capabilities offered by foundation models (FMs) available through Amazon Bedrock, we developed a solution that will recommend recipes based on what you already have in your refrigerator and an inventory of about-to-expire items in local supermarkets, making sure that both food in your home and food in grocery stores are used, saving money and reducing waste.

In this post, we walk through how to build the FoodSavr solution (fictitious name used for the purposes of this post) using Amazon Rekognition Custom Labels to detect the ingredients and generate personalized recipes using Anthropic’s Claude 3.0 on Amazon Bedrock. We demonstrate an end-to-end architecture where a user can upload an image of their fridge, and using the ingredients found there (detected by Amazon Rekognition), the solution will give them a list of recipes (generated by Amazon Bedrock). The architecture also recognizes missing ingredients and provides the user with a list of nearby grocery stores.

Solution overview

The following reference architecture shows how you can use Amazon Bedrock, Amazon Rekognition, and other AWS services to implement the FoodSavr solution.

As shown in the preceding figure, the architecture includes the following steps:

  1. For an end-to-end solution, we recommend having a frontend where your users can upload images of items that they want detected and labeled. To learn more about frontend deployment on AWS, see Front-end Web & Mobile on AWS.
  2. The picture taken by the user is stored in an Amazon Simple Storage Service (Amazon S3) This S3 bucket should be configured with a lifecycle policy that deletes the image after use. To learn more about S3 lifecycle policies, see Managing your storage lifecycle.
  3. This architecture uses different AWS Lambda Lambda is a serverless AWS compute service that runs event driven code and automatically manages the compute resources. The first Lambda function, DetectIngredients harnesses the power of Amazon Rekognition by using the Boto3 Python API. Amazon Rekognition is a cutting-edge computer vision service that uses machine learning (ML) models to analyze the uploaded images.
  4. We use Rekognition Custom Labels to train a model with a dataset of ingredients. You can adopt this architecture to use Rekognition Custom Labels with your own use case. With the aid of custom labels trained to recognize various ingredients, Amazon Rekognition identifies the items present in the images.
  5. The detected ingredient names are then securely stored in an Amazon DynamoDB (a fully managed NoSQL database service) table. for retrieval and modification. Users are presented with list of the ingredients that have been detected, along with the option of adding other ingredients or deleting ingredients that they might not want or were misidentified.
  6. After the ingredient list is confirmed by the user through the web interface, they can initiate the recipe generation process with a click of a button. This action invokes another Lambda function called GenerateRecipes, which uses the advanced language capabilities of the Amazon Bedrock API (Anthropic’s Claude v3 in this post). This state-of-the-art FM analyzes the confirmed ingredient list retrieved from DynamoDB and generates relevant recipes tailored to those specific ingredients. Additionally, the model provides images to accompany each recipe, providing a visually appealing and inspiring culinary experience.
  7. Amazon Bedrock contains two key FMs that are used for this solution example: Anthropic’s Claude v3 (newer versions have been released since the writing of this post) and Stable Diffusion, used for recipe generation and image generation respectively. For this solution, you can use any combination of FMs that suit your use case. The generated content (recipes as text and recipe images, in this case) can then be displayed to the user on the frontend.
  8. For this use case, you can also set up an optional ordering pipeline, which allows a user to place orders for the ingredients described by the FMs. This would be fronted by a Lambda function, FindGroceryItems, that can look for the recommended grocery items in a database contributed to by local supermarkets. This database would consist of about-to-expire ingredients along with prices for those ingredients.

In the following sections, we dive into how you can set up this architecture on your own account. Step 8 is optional and therefore not covered in this post.

Using Amazon Rekognition to detect images

The image recognition is powered by Amazon Rekognition, which offers pre-trained and customizable computer vision capabilities to allow users to obtain information and insights from their images. For customizability, you can use Rekognition Custom Labels to identify scenes and objects in your images that are specific to your business needs. If your images are already labeled, you can begin training a model from the Amazon Rekognition console. Otherwise, you can label them directly from the Amazon Rekognition labeling interface, or use other services such as Amazon SageMaker Ground Truth. The following screenshot shows an example of what the bounding box process would look like on the Amazon Rekognition labeling interface.

To get started with labeling, see Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions. For this architecture, we collected a dataset of up to 70 images of common food items typically found in refrigerators. We recommend that you gather your own relevant images and store them in an S3 bucket to use for training with Amazon Rekognition. You can then use Rekognition Custom Labels to create labels with food names, and assign bounding boxes on the images so the model knows where to look. To get started with training your own custom model, see Training an Amazon Rekognition Custom Labels model.

When model training is complete, you will see all your trained models under Projects on the AWS Management Console for Amazon Rekognition. Here, you can also look at the model performance, measured by the F1 score (shown in the following screenshot).

You can also iterate and modify your existing models to create newer versions. Before using your model, make sure it’s in STARTED state. To use the model, choose the model you want to use, and on the Use model tab, choose Start.

You also have the option to programmatically start and stop your model (the exact API call can be copied from the Amazon Rekognition console, but the following is provided as an example):

Use the following API (which is present in the Lambda function) call to detect groceries in an image using your custom labels and custom models:

aws rekognition detect-custom-labels 
--project-version-arn "MODEL_ARN" 
--image '{"S3Object": {"Bucket": "MY_BUCKET","Name": "PATH_TO_MY_IMAGE"}}' 
--region us-east-1

To stop incurring costs, you can also stop your model when not in use:

aws rekognition stop-project-version 
--project-version-arn "MODEL ARN 
--region us-east-1

Because we’re using Python, the boto3 Python package is used to make all AWS API calls mentioned in this post. For more information about Boto3, see the Boto3 documentation.

Starting a model might take a few minutes to complete. To check the current status of the model readiness, check the details page for the project or use DescribeProjectVersions. Wait for the model status to change to RUNNING.

In the meantime, you can explore the different statistics provided by Amazon Rekognition about your model. Some notable ones are the model performance (F1 score), precision, and recall. These statistics are gathered by Amazon Rekognition at both the model level (as seen in the earlier screenshot) and the individual custom label level (as shown in the following screenshot).

For more information on these statistics, see Metrics for evaluating your model.

Be aware that, while Anthropic’s Claude models offer impressive multi-modal capabilities for understanding and generating content based on text and images, we chose to use Amazon Rekognition Custom Labels for ingredient detection in this solution. Amazon Rekognition is a specialized computer vision service optimized for tasks such as object detection and image classification, using state-of-the-art models trained on massive datasets. Additionally, Rekognition Custom Labels allows us to train custom models tailored to recognize specific food items and ingredients, providing a level of customization that might not be as straightforward with a general-purpose language model. Furthermore, as a fully managed service, Amazon Rekognition can scale seamlessly to handle large volumes of images. While a hybrid approach combining Rekognition and Claude’s multi-modal capabilities could be explored, we chose Rekognition Custom Labels for its specialized computer vision capabilities, customizability, and to demonstrate combining FMs on Amazon Bedrock with other AWS services for this specific use case.

Using Amazon Bedrock FMs to generate recipes

To generate the recipes, we use Amazon Bedrock, a fully managed service that offers high-performing FMs. We use the Amazon Bedrock API to query Anthropic’s Claude v3 Sonnet model. We use the following prompt to provide context to the FM:

You are an expert chef, with expertise in diverse cuisines and recipes. 
I am currently a novice and I require you to write me recipes based on the ingredients provided below. 
The requirements for the recipes are as follows:
- I need 3 recipes from you
- These recipes can only use ingredients listed below, and nothing else
- For each of the recipes, provide detailed step by step methods for cooking. Format it like this:
1. Step 1: <instructions>
2. Step 2: <instructions>
...
n. Step n: <instructions>
Remember, you HAVE to use ONLY the ingredients that are provided to you. DO NOT use any other ingredient. 
This is crucial. For example, if you are given ingredients "Bread" and "Butter", you can ONLY use Bread and Butter, 
and no other ingredient can be added on. 
An example recipe with these two can be:
Recipe 1: Fried Bread
Ingredients:
- Bread
- Butter
1. Step 1: Heat up the pan until it reaches 40 degrees
2. Step 2: Drop in a knob of butter and melt it
3. Step 3: Once butter is melted, add a piece of bread onto pan
4. Step 4: Cook until the bread is browned and crispy
5. Step 5: Repeat on the other side
6. Step 6: You can repeat this for other breads, too

The following code is the body of the Amazon Bedrock API call:

# master_ingredients_str: Labels retrieved from DynamoDB table
# prompt: Prompt shown above
content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

body = json.dumps({
"max_tokens": 2047,
"messages": [{"role": "user", "content": content}],
"anthropic_version": "bedrock-2023-05-31"
})

j_body = json.dumps(body)

modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock.invoke_model(body=body, modelId=modelId)

Using the combination of the prompt and API call, we generate three recipes using the ingredients retrieved from the DynamoDB table. You can add additional parameters to body such as temperature, top_p, and top_k to further set thresholds for your prompt. For more information on getting responses from the Anthropic’s Claude 3 model using the Amazon Bedrock API, see Anthropic Claude Messages API. We recommend setting the temperature to something low (such as 0.1 or 0.2) to help ensure deterministic and structured generation of recipes. We also recommend setting the top_p value (nucleus sampling) to something high (such as 0.9) to limit the FM’s predictions to the most probable tokens (in this case, the model will consider the most probable tokens that make up 90% of the total probability mass for its next prediction). top_k is another sampling technique that limits the model’s predictions to the top_k most probable tokens. For example, if top_k = 10, the model will only consider the 10 most probable tokens for its next prediction. One of the key benefits of using Amazon Bedrock is the ability to use multiple FMs for different tasks within the same solution. In addition to generating textual recipes with Anthropic’s Claude 3, we can also dynamically generate visually appealing images to accompany those recipes. For this task, we chose to use the Stable Diffusion model available on Amazon Bedrock. Amazon Bedrock also offers other powerful image generation models such as Titan, and we’ve given you an example API call for that, too. Similar to using the Amazon Bedrock API to generate a response from Anthropic’s Claude 3, we use the following code:

modelId = "stability.stable-diffusion-xl-v0" 
accept = "application/json"
contentType = "application/json"

body = json.dumps({
"text_prompts": [
{
"text": recipe_name
}
], 
"cfg_scale": 10,
"seed": 20,
"steps": 50
})

response = brt.invoke_model(
body = body,
modelId = modelId,
accept = accept, 
contentType = contentType
)

For Titan, you might use something like:

modelId="amazon.titan-image-generator-v1",
accept="application/json", 
contentType="application/json"

body = json.dumps({
    "taskType": "TEXT_IMAGE",
    "textToImageParams": {
        "text":prompt,   # Required
    },
    "imageGenerationConfig": {
        "numberOfImages": 1,   # Range: 1 to 5 
        "quality": "premium",  # Options: standard or premium
        "height": 768,         # Supported height list in the docs 
        "width": 1280,         # Supported width list in the docs
        "cfgScale": 7.5,       # Range: 1.0 (exclusive) to 10.0
        "seed": 42             # Range: 0 to 214783647
    }
})

response = brt.invoke_model(
body = body, 
modelId = modelId,
accept = accept,
contentType = contentType
)

This returns a base64 encoded string that you need to decode in your frontend so that you can display it. For more information about other parameters that you can include in your API call, see Stability.ai Diffusion 1.0 text to image, and Using Amazon Bedrock to generate images with Titan Image Generator models. In the following sections, you walk through the steps to deploy the solution in your AWS account.

Prerequisites

You need an AWS account to deploy this solution. If you don’t have an existing account, you can sign up for one. The instructions in this post use the us-east-1 AWS Region. Make sure you deploy your resources in a Region with AWS Machine Learning services available. For the Lambda functions to run successfully, Lambda requires an AWS Identity and Access Management (IAM) role and policy with the appropriate permissions. Complete the necessary steps from Defining Lambda function permissions with an execution role to create and attach a Lambda execution role for the Lambda functions to access all necessary actions for DynamoDB, Amazon Rekognition, and Amazon Bedrock.

Create the Lambda function to detect ingredients

Complete the following steps to create your first Lambda function (DetectIngredients):

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create Lambda function.
  3. Choose Author from scratch.
  4. Name your function DetectIngredients, select Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code and choose Save:
import json
import boto3
import inference
import time
s3 = boto3.client('s3') 

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TestDataTable')
table_name = 'TestDataTable'

def lambda_handler(event, context):
    clearTable()

test = inference.main()

labels, label_count = inference.main()

# The names array will contain a list of all the grocery ingredients detected
# in the image
names = []

for label_dic in labels:
name = label_dic['Name']
# Getting rid of unnecessary parts of label string
if "Food" in name:
    # Remove "Food" from name
    name = name.replace("Food", "")
if "In Fridge" in name:
    # Remove "In Fridge" from name
    name = name.replace("In Fridge", "")
    name = name.strip()

names.append(name)

# Loop through the list of grocery ingredients to construct a dictionary called
# items
# the items dict will be used to batch write up to 25 items at a time when
# batch_write_all is called
items=[]
for name in names:
    if (len(items)) < 29:
        items.append({
           'grocery_item': name
        })

# Remove all duplicates from array
seen = set()
unique_grocery_items = []
for item in items:
    val = item['grocery_item'].lower().strip()
    if val not in seen:
        unique_grocery_items.append(item)
        seen.add(val)

batch_write_all(unique_grocery_items)

table.put_item(
Item={
'grocery_item': "DONE"
})

def batch_write_all(items):
    batch_write_requests = [{
        'PutRequest': {
            'Item': item
        }
    } for item in items]

response = dynamodb.batch_write_item(
    RequestItems={
         table_name:batch_write_requests
    }
)

def clearTable():
    response = table.scan()
    with table.batch_writer() as batch:
        for each in response['Items']:
             batch.delete_item(
                 Key={
                         'grocery_item': each['grocery_item'] 
                 }

Create a DynamoDB table to store ingredients

Complete the following steps to create your DynamoDB table.

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Choose Create table.
  3. For Table name, enter MasterGroceryDB.
  4. For Partition key, use grocery_item (string).
  5. Verify that all entries on the page are accurate, leave the rest of the settings as default, and choose Create.

Wait for the table creation to complete and for your table status to change to Active before proceeding to the next step.

Create the Lambda function to call Amazon Bedrock

Complete the following steps to create another Lambda function that will call the Amazon Bedrock APIs to generate recipes:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create function.
  3. Choose Author from scratch.
  4. Name your function GenerateRecipes, choose Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code choose Save:
import json
import boto3
import re
import base64
import image_gen

dynamodb = boto3.resource('dynamodb')

bedrock = boto3.client(service_name='bedrock-runtime')

def get_ingredients(tableName):
    table = dynamodb.Table(tableName)
    response = table.scan()
    data = response['Items']

    # Support for pagination
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        data.extend(response['Items'])

    data = [g_i for g_i in data if g_i['grocery_item'] != 'DONE']
    return data


# Converts dynamoDB grocery items into a string
def convertItemsToString(grocery_dict):
    ingredients_list = []
    for each in grocery_dict:
        ingredients_list.append(each['grocery_item'])
        ingredients_list_str = ", ".join(ingredients_list)
    return ingredients_list_str

def read_prompt():
    with open ('Prompt.md', 'r') as f:
        text = f.read() 
    return text

# Gets the names of all the recipes generated
def get_recipe_names(response_body):
    recipe_names = []
    for i in range(len(response_body)):
        if response_body[i] == 'n' and response_body[i + 1] == 'n' and response_body[i + 2] == 'R':
    recipe_str = ""
    while response_body[i + 2] != 'n':
        recipe_str += response_body[i + 2]
        i += 1
    recipe_str = recipe_str.replace("Recipe", '') 
    recipe_str = recipe_str.replace(": ", '')
    recipe_str = re.sub(" d+", "", recipe_str) 
    recipe_names.append(recipe_str)
return recipe_names 

def lambda_handler(event, context):
    # Write the ingredients to a .md file
    user_ingredients_dict = get_ingredients('TestDataTable')
    master_ingredients_dict = get_ingredients('MasterGroceryDB')

    # Get string values for ingredients in both databases
    user_ingredients_str = convertItemsToString(user_ingredients_dict)
    master_ingredients_str = convertItemsToString(master_ingredients_dict)

    # Convert dictionary into comma seperated string arg to pass into prompt

    # Read the prompt + ingredients file
    prompt = read_prompt()
    # Query for recipes using prompt + ingredients

    content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

    body = json.dumps({
        "max_tokens": 2047,
        "messages": [{"role": "user", "content": content}],
        "anthropic_version": "bedrock-2023-05-31"
    })

    j_body = json.dumps(body)

    modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

    
    response = bedrock.invoke_model(body=body, modelId=modelId)


    response_body = json.loads(response.get('body').read())
    response_body_content = response_body.get("content")
    response_body_completion = response_body_content[0]['text']

    recipe_names_list = get_recipe_names(response_body_completion)

    first_image_imgstr = image_gen.image_gen(recipe_names_list[0])
    second_image_imgstr = image_gen.image_gen(recipe_names_list[1])
    third_image_imgstr = image_gen.image_gen(recipe_names_list[2])

    return response_body_completion, first_image_imgstr, second_image_imgstr, third_image_imgstr

Create an S3 bucket to store the images

Lastly, you create an S3 bucket to store the images you upload, which automatically invokes the DetectIngredients Lambda function after each upload. Complete the following steps to create the bucket and configure the Lambda function:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Enter a unique bucket name, set the desired Region to us-east-1, and choose Create bucket.
  4. On the Lambda console, navigate to the DetectIngredients
  5. On the Configuration tab, choose Add trigger.
  6. Select the trigger type as S3 and choose the bucket you created.
  7. Set Event type to All object create events and choose Add.
  8. On the Amazon S3 console, navigate to the bucket you created.
  9. Under Properties and Event Notifications, choose Create event notification.
  10. Enter an event name (for example, Trigger DetectIngredients) and set the events to All object create events.
  11. For Destination, select Lambda Function and select the DetectIngredients Lambda function.
  12. Choose Save.

Conclusion

In this post, we explored the use of Amazon Rekognition and FMs on Amazon Bedrock with AWS services such as Lambda and DynamoDB to build a comprehensive solution that addresses food waste in the US. With the use of cutting-edge AWS services including Rekognition Custom Labels and content generation with models on Amazon Bedrock, this application provides value and proof of work for AWS generative AI capabilities.

Stay on the lookout for a follow-up to this post, where we demonstrate using the multi-modal capabilities of FMs such as Anthropic’s Claude v3.1 on Amazon Bedrock to deploy this entire solution end-to-end.

Although we highlighted a food waste use case in this post, we urge you to apply your own use case to this solution. The flexibility of this architecture allows you to adapt these services to multiple scenarios, enabling you to solve a wide range of challenges.

Special thanks to Tommy Xie and Arnav Verma for their contributions to the blog.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Michael Lue is a Sr. Solution Architect at AWS Canada based out of Toronto. He works with Canadian enterprise customers to accelerate their business through optimization, innovation, and modernization. He is particularly passionate and curious about disruptive technologies like containers and AI/ML. In his spare time, he coaches and plays tennis and enjoys hanging at the beach with his French Bulldog, Marleé.

Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Read More

Scaling Recommendation Systems Training to Thousands of GPUs with 2D Sparse Parallelism

Scaling Recommendation Systems Training to Thousands of GPUs with 2D Sparse Parallelism

At Meta, recommendation systems are the cornerstone of delivering relevant and personalized ads to billions of users globally. Through technologies like PyTorch’s TorchRec, we’ve successfully developed solutions that enable model training across hundreds of GPUs. While these systems have served us well, recent research on scaling laws has revealed a compelling opportunity: we can achieve significantly better model performance by training dramatically larger neural networks.

However, this insight presents us with a new challenge. Our current training infrastructure, though highly optimized for hundreds of GPUs, cannot efficiently scale to the thousands of GPUs needed to train these larger models. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models. These challenges require fundamentally new approaches to distributed training, which we address with a novel parallelization strategy.

To address these issues, we introduced 2D embedding parallel, a novel parallelism strategy that overcomes the sparse scaling challenges inherent in training large recommendation models across thousands of GPUs. This is available today in TorchRec through the DMPCollection API. This approach combines two complementary parallelization techniques: data parallelism for the sparse components of the model, and model parallelism for the embedding tables, leveraging TorchRec’s robust sharding capabilities. By strategically integrating these techniques, we’ve created a solution that scales to thousands of GPUs and now powers Meta’s largest recommendation model training runs.

What are the sparse scaling challenges?

We identified three key challenges that prevented us from naively scaling our model to thousands of GPUs:

  • Imbalancing and straggler issue: with more GPUs it’s harder to achieve balanced sharding, some ranks can have much heavier workload for embedding computations, which can slow down the entire training.
  • Communication across nodes: As training jobs utilize an increased number of GPUs, the all-to-all communication bandwidth can drop under certain network topologies which can increase communication latency significantly.
  • Memory overhead: The memory used by input features is often negligible, however, as we use thousands of GPUs, we can introduce larger input features and the memory requirements can become significant.

With 2D embedding parallel, we can describe our new parallelism scheme like this, in this example we have 2 model replicas (Replica 1: GPU1/GPU3, Replica 2: GPU2/GPU4)

Flow diagram

Figure 1: Layout illustration of 2D Sparse Parallelism

With 2D sparse parallelism we address these challenges, instead of sharding tables across all ranks, we first evenly divide all ranks into several parallel groups:

  1. Within each group, we use model parallel for the embedding tables, such as column-wise/row-wise sharding. At scale, for our largest tables, we have also developed a grid sharding, which shards embedding tables on the row and column dimension.
  2. Across groups, we do data parallel, such that each rank in a group has its corresponding replica rank in the other groups (replica rank means storing the same embedding table shards).
    1. After each group has completed its own backward pass, we all reduce the embedding table weights across the replicas to keep them synchronized.

Our production solution

TorchRec is our library to build the sparse part of the recommendation models in native PyTorch. With the traditional API being DistributedModelParallel which applies model parallel to the embedding tables. We introduce a new API alongside it, known as DMPCollection, which serves as the main entry point for enabling 2D parallel on TorchRec models. We designed it to be as easy of a change as applying FSDP/DDP is.

To understand what DMPCollection does, we have to understand what DistributedModelParallel (DMP) does first:

  1. Create embedding tables, known as EmbeddingBagCollection and EmbeddingCollections.
  2. Generate a sharding plan with respect to GPU topology, embedding tables, memory available, input data, and more.
  3. Wrap model with DMP and the associated sharding plan passed in.
  4. DMP initializes and shards the embedding tables in accordance with the sharding plan.
  5. On a train step, DMP takes an input batch, communicates it to the appropriate GPUs containing the embedding table shard of interest, looks up the value, and returns it back to the GPU that requested it. This is all done on the global process group, with some exceptions for special sharding (such as table row wise sharding)

DistributedModelParallel was built for model parallel with many parts working under the assumption of sharding and working around the global world size. We need to change these parts in a way where we can introduce additional dimensions of parallelism without losing the optimizations and feature set of TorchRec.

DMPCollection changes a few key parts to enable 2D parallel in an extensible way,

  • Generate sharding plans for the smaller sharding group once, once passed in we communicate to the appropriate ranks across the global group and remap the ranks to fit the new sharding group ranks.
  • Create two new NCCL process groups, known as sharding and replica process groups. The sharding process group is passed into sharding and train step components of TorchRec. The replica process group is used for the weight and optimizer state synchronization, the all reduce call happens over this process group.
    • The sub NCCL process groups allow us to efficiently communicate only between the ranks that are relevant for a particular comm. Each rank will have two associated process groups.

To the user, the change is very simple, while taking away all the complexity around applying the parallelism strategies to the model.

How do we create these sharding and replication groups?

These process groups are one of the keys to DMPCollection’s performant implementation. From our earlier diagram, we showed a simple 2×2 GPU setup, however, at scale, how do we assign which ranks are part of a given sharding group and what are their replica ranks across the sharding groups?

Consider the following setup with 2 nodes, each with 4 GPUs. The sharding and replication groups under 2D parallel will be,

Sharding Group Sharding Ranks
0 0, 2, 4, 6
1 1, 3, 5, 7
Replication Group Replication Ranks
0 0, 1
1 2, 3
2 4, 5
3 6, 7

We use the following formulation,

  1. Divide all trainers into G sharding groups, each with L trainers
    1. Groups, G, is determined by G = T / L, where T is total number of trainers
  2. For each group, G, we assigned non-contiguous trainer ranks based on the group it’s in, following,
    1. [i, G+i, 2G+i, …, (L – 1) G+i], where* i = 0 to G-1*
  3. From the groups, G, we can create the replication group, which is every G continuous ranks
    1. (0 to G-1, G to 2* G – 1) each continuous set stores the duplicate embedding table shards.

This means our sharding groups, G, are of size L, which can be known as the number of ranks to apply model parallel across. This, in turn, gives us replica groups, each of size G, which are the ranks we data parallel across.

In DMPCollection, we’re able to create these process groups efficiently with the use of DeviceMesh, we create the entire GPU topology in a 2×2 matrix, with each row representing the group of sharding ranks and each column representing the corresponding replica ranks,

create peer matrix
num_groups = global_world_size // sharding_group_size
for each group_rank in num_groups:
	peers = [num_groups * rank + group_rank for rank in range(sharding_group_size)]
	add peer to peer matrix

initalize DeviceMesh with two dimensions (shard, replicate)
slice DeviceMesh on shard for sharding process group
slide DeviceMesh on replicate for replica process group

With our DeviceMesh approach, should we want to change the topology or provide further flexibility in the future, we can easily extend our creation logic to any form of topologies and even extend for further dimensions of parallelism if needed.

Performance of 2D parallel

Our rank partitioning strategy optimizes communication patterns by strategically placing model replica ranks for each shard within the same compute node. This architecture provides significant performance benefits for the weight synchronization operation. After the backward pass, we perform all-reduce operations to synchronize model weights—which is an expensive process given the large parameter counts we have to communicate and sync—with our setup of placing replicas on the same node we leverage intra node’s high-bandwidth over-relying on slower inter-node bandwidth.

The effect of this design choice on the other communication collectives generally improves the latencies. The improvement stems from two factors.

  1. By sharding the embedding tables over a reduced number of ranks and conducting communications for the model within the smaller group, we achieve a lower all-to-all latency.
  2. With the replication in 2D parallel, our embedding lookup latency on a rank reduces, we can reduce the local batch size to 1/Nth of the equivalent global batch size, where N is the number of model replicas.

A production model trace exemplifies these two factors, here we run the 2D parallel job on 1024 GPUs, with a sharding group size of 256 GPUs.

State diagram

Figure 2: Comparing latencies between non 2D parallel and 2D parallel workloads

There are two key levers users have to tune to maximize performance for their workloads:

  1. The size of the model sharding group relative to the global world size. The global world size divided by the sharding group size represents the number of model replicas we will have.
    1. To maximize performance, users can look to scale up their model up to 8x, this scaling factor maintains the intra-host all reduce.
      1. For further scaling, the all reduce would have to happen over inter host. From our experiments, we did not see an obvious performance regression and in fact note advantages of an inter host all reduce. We can change our sharding and replica topology to inter host all reduce, which can help us introduce fault tolerance strategies should a particular host go down.
  2. Frequency of all reduce synchronization, DMPCollection comes with a sync() call, which can be tuned to be called every N training steps, performing a sort of local SGD training. With scale, reducing the frequency of synchronization can bring significant gains to performance.

Future Work

Readers should note that 2D sparse parallel training differs from non-parallelized training because we synchronize the embedding table weights rather than the gradients. This approach is made possible by TorchRec’s use of FBGEMM, which provides optimized kernels under the hood. One of FBGEMM’s key optimizations is the fusion of the optimizer in the backward pass. Instead of fully materializing the embedding table gradients—which would consume significant memory—they are passed directly to the optimizer update. Attempting to materialize and synchronize these gradients would create substantial overhead, making that approach impractical.

Our exploration revealed that to achieve training results comparable to the baseline, we synchronize optimizer states on a delayed schedule, with the timing dependent on the number of sharding/replica groups (ie: for Adagrad we update the momentum behind by one sync step). This approach also enables users to implement local SGD or semi-synchronized training strategies, which can achieve convergence and potentially produce better loss curves than the baseline.

We thank you for reading our post! This is an exciting direction we have come across that we hope to develop further to maximize performance of recommendation systems and push the state of the art.

Read More

An Efficient and Streaming Audio Visual Active Speaker Detection System

This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges…Apple Machine Learning Research

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Investment professionals face the mounting challenge of processing vast amounts of data to make timely, informed decisions. The traditional approach of manually sifting through countless research documents, industry reports, and financial statements is not only time-consuming but can also lead to missed opportunities and incomplete analysis. This challenge is particularly acute in credit markets, where the complexity of information and the need for quick, accurate insights directly impacts investment outcomes. Financial institutions need a solution that can not only aggregate and process large volumes of data but also deliver actionable intelligence in a conversational, user-friendly format. The intersection of AI and financial analysis presents a compelling opportunity to transform how investment professionals access and use credit intelligence, leading to more efficient decision-making processes and better risk management outcomes.

Founded in 2013, Octus, formerly Reorg, is the essential credit intelligence and data provider for the world’s leading buy side firms, investment banks, law firms and advisory firms. By surrounding unparalleled human expertise with proven technology, data and AI tools, Octus unlocks powerful truths that fuel decisive action across financial markets. Visit octus.com to learn how we deliver rigorously verified intelligence at speed and create a complete picture for professionals across the entire credit lifecycle. Follow Octus on LinkedIn and X.

Using advanced GenAI, CreditAI by Octus™ is a flagship conversational chatbot that supports natural language queries and real-time data access with source attribution, significantly reducing analysis time and streamlining research workflows. It gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles, helping financial institutions make informed credit decisions while effectively managing risk. Key features include chat history management, being able to ask questions that are targeted to a specific company or more broadly to a sector, and getting suggestions on follow-up questions.

In this post, we demonstrate how Octus migrated its flagship product, CreditAI, to Amazon Bedrock, transforming how investment professionals access and analyze credit intelligence. We walk through the journey Octus took from managing multiple cloud providers and costly GPU instances to implementing a streamlined, cost-effective solution using AWS services including Amazon Bedrock, AWS Fargate, and Amazon OpenSearch Service. We share detailed insights into the architecture decisions, implementation strategies, security best practices, and key learnings that enabled Octus to maintain zero downtime while significantly improving the application’s performance and scalability.

Opportunities for innovation

CreditAI by Octus™ version 1.x uses Retrieval Augmented Generation (RAG). It was built using a combination of in-house and external cloud services on Microsoft Azure for large language models (LLMs), Pinecone for vectorized databases, and Amazon Elastic Compute Cloud (Amazon EC2) for embeddings. Based on our operational experience, and as we started scaling up, we realized that there were several operational inefficiencies and opportunities for improvement:

  • Our in-house services for embeddings (deployed on EC2 instances) were not as scalable and reliable as needed. They also required more time on operational maintenance than our team could spare.
  • The overall solution was incurring high operational costs, especially due to the use of on-demand GPU instances. The real-time nature of our application meant that Spot Instances were not an option. Additionally, our investigation of lower-cost CPU-based instances revealed that they couldn’t meet our latency requirements.
  • The use of multiple external cloud providers complicated DevOps, support, and budgeting.

These operational inefficiencies meant that we had to revisit our solution architecture. It became apparent that a cost-effective solution for our generative AI needs was required. Enter Amazon Bedrock Knowledge Bases. With its support for knowledge bases that simplify RAG operations, vectorized search as part of its integration with OpenSearch Service, availability of multi-tenant embeddings, as well as Anthropic’s Claude suite of LLMs, it was a compelling choice for Octus to migrate its solution architecture. Along the way, it also simplified operations as Octus is an AWS shop more generally. However, we were still curious about how we would go about this migration, and whether there would be any downtime through the transition.

Strategic requirements

To help us move forward systematically, Octus identified the following key requirements to guide the migration to Amazon Bedrock:

  • Scalability – A crucial requirement was the need to scale operations from handling hundreds of thousands of documents to millions of documents. A significant challenge in the previous system was the slow (and relatively unreliable) process of embedding new documents into vector databases, which created bottlenecks in scaling operations.
  • Cost-efficiency and infrastructure optimization – CreditAI 1.x, though performant, was incurring high infrastructure costs due to the use of GPU-based, single-tenant services for embeddings and reranking. We needed multi-tenant alternatives that were much cheaper while enabling elasticity and scale.
  • Response performance and latency – The success of generative AI-based applications depends on the response quality and speed. Given our user base, it’s important that our responses are accurate while valuing users’ time (low latency). This is a challenge when the data size and complexity grow. We want to balance spatial and temporal retrieval in order to give responses that have the best answer and context relevance, especially when we get large quantities of data updated every day.
  • Zero downtime – CreditAI is in production and we could not afford any downtime during this migration.
  • Technological agility and innovation – In the rapidly evolving AI landscape, Octus recognized the importance of maintaining technological competitiveness. We wanted to move away from in-house development and feature maintenance such as embeddings services, rerankers, guardrails, and RAG evaluators. This would allow Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – Octus’s goal is to consolidate cloud providers, and to reduce support overheads and operational complexity.

Migration to Amazon Bedrock and addressing our requirements

Migrating to Amazon Bedrock addressed our aforementioned requirements in the following ways:

  • Scalability – The architecture of Amazon Bedrock, combined with AWS Fargate for Amazon ECS, Amazon Textract, and AWS Lambda, provided the elastic and scalable infrastructure necessary for this expansion while maintaining performance, data integrity, compliance, and security standards. The solution’s efficient document processing and embedding capabilities addressed the previous system’s limitations, enabling faster and more efficient knowledge base updates.
  • Cost-efficiency and infrastructure optimization – By migrating to Amazon Bedrock multi-tenant embedding, Octus achieved significant cost reduction while maintaining performance standards through Anthropic’s Claude Sonnet and improved embedding capabilities. This move alleviated the need for GPU-instance-based services in favor of more cost-effective and serverless Amazon ECS and Fargate solutions.
  • Response performance and latency – Octus verified the quality and latency of responses from Anthropic’s Claude Sonnet to confirm that response accuracy and latency are not maintained (or even exceeded) as part of this migration. With this LLM, CreditAI was now able to respond better to broader, industry-wide queries than before.
  • Zero downtime – We were able to achieve zero downtime migration to Amazon Bedrock for our application using our in-house centralized infrastructure frameworks. Our frameworks comprise infrastructure as code (IaC) through Terraform, continuous integration and delivery (CI/CD), SOC2 security, monitoring, observability, and alerting for our infrastructure and applications.
  • Technological agility and innovation – Amazon Bedrock emerged as an ideal partner, offering solutions specifically designed for AI application development. Amazon Bedrock built-in features, such as embeddings services, reranking, guardrails, and the upcoming RAG evaluator, alleviated the need for in-house development of these components, allowing Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – The comprehensive suite of AWS services offers a streamlined framework that simplifies operations while providing high availability and reliability. This consolidation minimizes the complexity of managing multiple cloud providers and creates a more cohesive technological ecosystem. It also enables economies of scale with development velocity given that over 75 engineers at Octus already use AWS services for application development.

In addition, the Amazon Bedrock Knowledge Bases team worked closely with us to address several critical elements, including expanding embedding limits, managing the metadata limit (250 characters), testing different chunking methods, and syncing throughput to the knowledge base.

In the following sections, we explore our solution and how we addressed the details around the migration to Amazon Bedrock and Fargate.

Solution overview

The following figure illustrates our system architecture for CreditAI on AWS, with two key paths: the document ingestion and content extraction workflow, and the Q&A workflow for live user query response.

Solution Architecture

In the following sections, we dive into crucial details within key components in our solution. In each case, we connect them to the requirements discussed earlier for readability.

The document ingestion workflow (numbered in blue in the preceding diagram) processes content through five distinct stages:

  1. Documents uploaded to Amazon Simple Storage Service (Amazon S3) automatically invoke Lambda functions through S3 Event Notifications. This event-driven architecture provides immediate processing of new documents.
  2. Lambda functions process the event payload containing document location, perform format validation, and prepare content for extraction. This includes file type verification, size validation, and metadata extraction before routing to Amazon Textract.
  3. Amazon Textract processes the documents to extract both text and structural information. This service handles various formats, including PDFs, images, and forms, while preserving document layout and relationships between content elements.
  4. The extracted content is stored in a dedicated S3 prefix, separate from the source documents, maintaining clear data lineage. Each processed document maintains references to its source file, extraction timestamp, and processing metadata.
  5. The extracted content flows into Amazon Bedrock Knowledge Bases, where our semantic chunking strategy is implemented to divide content into optimal segments. The system then generates embeddings for each chunk and stores these vectors in OpenSearch Service for efficient retrieval. Throughout this process, the system maintains comprehensive metadata to support downstream filtering and source attribution requirements.

The Q&A workflow (numbered in yellow in the preceding diagram) processes user interactions through six integrated stages:

  1. The web application, hosted on AWS Fargate, handles user interactions and query inputs, managing initial request validation before routing queries to appropriate processing services.
  2. Amazon Managed Streaming for Kafka (Amazon MSK) serves as the streaming service, providing reliable inter-service communication while maintaining message ordering and high-throughput processing for query handling.
  3. The Q&A handler, running on AWS Fargate, orchestrates the complete query response cycle by coordinating between services and processing responses through the LLM pipeline.
  4. The pipeline integrates with Amazon Bedrock foundation models through these components:
    1. Cohere Embeddings model performs vector transformations of the input.
    2. Amazon OpenSearch Service manages vector embeddings and performs similarity searches.
    3. Amazon Bedrock Knowledge Bases provides efficient access to the document repository.
  5. Amazon Bedrock Guardrails implements content filtering and safety checks as part of the query processing pipeline.
  6. Anthropic Claude LLM performs the natural language processing, generating responses that are then returned to the web application.

This integrated workflow provides efficient query processing while maintaining response quality and system reliability.

For scalability: Using OpenSearch Service as our vector database

Amazon OpenSearch Serverless emerged as the optimal solution for CreditAI’s evolving requirements, offering advanced capabilities while maintaining seamless integration within the AWS ecosystem:

  • Vector search capabilities – OpenSearch Serverless provides robust built-in vector search capabilities essential for our needs. The service supports hybrid search, allowing us to combine vector embeddings with raw text search without modifying our embedding model. This capability proved crucial for enabling broader question support in CreditAI 2.x, enhancing its overall usability and flexibility.
  • Serverless architecture benefits – The serverless design alleviates the need to provision, configure, or tune infrastructure, significantly reducing operational complexities. This shift allows our team to focus more time and resources on feature development and application improvements rather than managing underlying infrastructure.
  • AWS integration advantages – The tight integration with other AWS services, particularly Amazon S3 and Amazon Bedrock, streamlines our content ingestion process. This built-in compatibility provides a cohesive and scalable landscape for future enhancements while maintaining optimal performance.

OpenSearch Serverless enabled us to scale our vector search capabilities efficiently while minimizing operational overhead and maintaining high performance standards.

For scalability and security: Splitting data across multiple vector databases with in-house support for intricate permissions

To enhance scalability and security, we implemented isolated knowledge bases (corresponding to vector databases) for each client data. Although this approach slightly increases costs, it delivers multiple significant benefits. Primarily, it maintains complete isolation of client data, providing enhanced privacy and security. Thanks to Amazon Bedrock Knowledge Bases, this solution doesn’t compromise on performance. Amazon Bedrock Knowledge Bases enables concurrent embedding and synchronization across multiple knowledge bases, allowing us to maintain real-time updates without delays—something previously unattainable with our previous GPU based architectures.

Additionally, we introduced two in-house services within Octus to strengthen this system:

  • AuthZ access management service – This service enforces granular access control, making sure users and applications can only interact with the data they are authorized to access. We had to migrate our AuthZ backend from Airbyte to native SQL replication so that it can support access management in near real time at scale.
  • Global identifiers service – This service provides a unified framework to link identifiers across multiple domains, enabling seamless integration and cross-referencing of identifiers across multiple datasets.

Together, these enhancements create a robust, secure, and highly efficient environment for managing and accessing client data.

For cost efficiency: Adopting a multi-tenant embedding service

In our migration to Amazon Bedrock Knowledge Bases, Octus made a strategic shift from using an open-source embedding service on EC2 instances to using the managed embedding capabilities of Amazon Bedrock through Cohere’s multilingual model. This transition was carefully evaluated based on several key factors.

Our selection of Cohere’s multilingual model was driven by two primary advantages. First, it demonstrated superior retrieval performance in our comparative testing. Second, it offered robust multilingual support capabilities that were essential for our global operations.

The technical benefits of this migration manifested in two distinct areas: document embedding and message embedding. In document embedding, we transitioned from a CPU-based system to Amazon Bedrock Knowledge Bases, which enabled faster and higher throughput document processing through its multi-tenant architecture. For message embedding, we alleviated our dependency on dedicated GPU instances while maintaining optimal performance with 20–30 millisecond embedding times. The Amazon Bedrock Knowledge Bases API also simplified our operations by combining embedding and retrieval functionality into a single API call.

The migration to Amazon Bedrock Knowledge Bases managed embedding delivered two significant advantages: it eliminated the operational overhead of maintaining our own open-source solution while providing access to industry-leading embedding capabilities through Cohere’s model. This helped us achieve both our cost-efficiency and performance objectives without compromises.

For cost-efficiency and response performance: Choice of chunking strategy

Our primary goal was to improve three critical aspects of CreditAI’s responses: quality (accuracy of information), groundedness (ability to trace responses back to source documents), and relevance (providing information that directly answers user queries). To achieve this, we tested three different approaches to breaking down documents into smaller pieces (chunks):

  • Fixed chunking – Breaking text into fixed-length pieces
  • Semantic chunking – Breaking text based on natural semantic boundaries like paragraphs, sections, or complete thoughts
  • Hierarchical chunking – Creating a two-level structure with smaller child chunks for precise matching and larger parent chunks for contextual understanding

Our testing showed that both semantic and hierarchical chunking performed significantly better than fixed chunking in retrieving relevant information. However, each approach came with its own technical considerations.

Hierarchical chunking requires a larger chunk size to maintain comprehensive context during retrieval. This approach creates a two-level structure: smaller child chunks for precise matching and larger parent chunks for contextual understanding. During retrieval, the system first identifies relevant child chunks and then automatically includes their parent chunks to provide broader context. Although this method optimizes both search precision and context preservation, we couldn’t implement it with our preferred Cohere embeddings because they only support chunks up to 512 tokens, which is insufficient for the parent chunks needed to maintain effective hierarchical relationships.

Semantic chunking uses LLMs to intelligently divide text by analyzing both semantic similarity and natural language structures. Instead of arbitrary splits, the system identifies logical break points by calculating embedding-based similarity scores between sentences and paragraphs, making sure semantically related content stays together. The resulting chunks maintain context integrity by considering both linguistic features (like sentence and paragraph boundaries) and semantic coherence, though this precision comes at the cost of additional computational resources for LLM analysis and embedding calculations.

After evaluating our options, we chose semantic chunking despite two trade-offs:

  • It requires additional processing by our LLMs, which increases costs
  • It has a limit of 1,000,000 tokens per document processing batch

We made this choice because semantic chunking offered the best balance between implementation simplicity and retrieval performance. Although hierarchical chunking showed promise, it would have been more complex to implement and harder to scale. This decision helped us maintain high-quality, grounded, and relevant responses while keeping our system manageable and efficient.

For response performance and technical agility: Adopting Amazon Bedrock Guardrails with Amazon Bedrock Knowledge Bases

Our implementation of Amazon Bedrock Guardrails focused on three key objectives: enhancing response security, optimizing performance, and simplifying guardrail management. This service plays a crucial role in making sure our responses are both safe and efficient.

Amazon Bedrock Guardrails provides a comprehensive framework for content filtering and response moderation. The system works by evaluating content against predefined rules before the LLM processes it, helping prevent inappropriate content and maintaining response quality. Through the Amazon Bedrock Guardrails integration with Amazon Bedrock Knowledge Bases, we can configure, test, and iterate on our guardrails without writing complex code.

We achieved significant technical improvements in three areas:

  • Simplified moderation framework – Instead of managing multiple separate denied topics, we consolidated our content filtering into a unified guardrail service. This approach allows us to maintain a single source of truth for content moderation rules, with support for customizable sample phrases that help fine-tune our filtering accuracy.
  • Performance optimization – We improved system performance by integrating guardrail checks directly into our main prompts, rather than running them as separate operations. This optimization reduced our token usage and minimized unnecessary API calls, resulting in lower latency for each query.
  • Enhanced content control – The service provides configurable thresholds for filtering potentially harmful content and includes built-in capabilities for detecting hallucinations and assessing response relevance. This alleviated our dependency on external services like TruLens while maintaining robust content quality controls.

These improvements have helped us maintain high response quality while reducing both operational complexity and processing overhead. The integration with Amazon Bedrock has given us a more streamlined and efficient approach to content moderation.

To achieve zero downtime: Infrastructure migration

Our migration to Amazon Bedrock required careful planning to provide uninterrupted service for CreditAI while significantly reducing infrastructure costs. We achieved this through our comprehensive infrastructure framework that addresses deployment, security, and monitoring needs:

  • IaC implementation – We used reusable Terraform modules to manage our infrastructure consistently across environments. These modules enabled us to share configurations efficiently between services and projects. Our approach supports multi-Region deployments with minimal configuration changes while maintaining infrastructure version control alongside application code.
  • Automated deployment strategy – Our GitOps-embedded framework streamlines the deployment process by implementing a clear branching strategy for different environments. This automation handles CreditAI component deployments through CI/CD pipelines, reducing human error through automated validation and testing. The system also enables rapid rollback capabilities if needed.
  • Security and compliance – To maintain SOC2 compliance and robust security, our framework incorporates comprehensive access management controls and data encryption at rest and in transit. We follow network security best practices, conduct regular security audits and monitoring, and run automated compliance checks in the deployment pipeline.

We maintained zero downtime during the entire migration process while reducing infrastructure costs by 70% by eliminating GPU instances. The successful transition from Amazon ECS on Amazon EC2 to Amazon ECS with Fargate has simplified our infrastructure management and monitoring.

Achieving excellence

CreditAI’s migration to Amazon Bedrock has yielded remarkable results for Octus:

  • Scalability – We have almost doubled the number of documents available for Q&A across three environments in days instead of weeks. Our use of Amazon ECS with Fargate with auto scaling rules and controls gives us elastic scalability for our services during peak usage hours.
  • Cost-efficiency and infrastructure optimization – By moving away from GPU-based clusters to Fargate, our monthly infrastructure costs are now 78.47% lower, and our per-question costs have reduced by 87.6%.
  • Response performance and latency – There has been no drop in latency, and have seen a 27% increase in questions answered successfully. We have also seen a 250% boost in user engagement. Users especially love our support for broad, industry-wide questions enabled by Anthropic’s Claude Sonnet.
  • Zero downtime – We experienced zero downtime during migration and 99% uptime overall for the whole application.
  • Technological agility and innovation – We have been able to add new document sources in a quarter of the time it took pre-migration. In addition, we adopted enhanced guardrails support for free and no longer have to retrieve documents from the knowledge base and pass the chunks to Anthropic’s Claude Sonnet to trigger a guardrail.
  • Operational consolidation and reliability – Post-migration, our DevOps and SRE teams see 20% less maintenance burden and overheads. Supporting SOC2 compliance is also straightforward now that we’re using only one cloud provider.

Operational monitoring

We use Datadog to monitor both LLM latency and our document ingestion pipeline, providing real-time visibility into system performance. The following screenshot showcases how we use custom Datadog dashboards to provide a live view of the document ingestion pipeline. This visualization offers both a high-level overview and detailed insights into the ingestion process, helping us understand the volume, format, and status of the documents processed. The bottom half of the dashboard presents a time-series view of document processing volumes. The timeline tracks fluctuations in processing rates, identifies peak activity periods, and provides actionable insights to optimize throughput. This detailed monitoring system enables us to maintain efficiency, minimize failures, and provide scalability.

Observability Dashboard

Roadmap

Looking ahead, Octus plans to continue enhancing CreditAI by taking advantage of new capabilities released by Amazon Bedrock that continue to meet and exceed our requirements. Future developments will include:

  • Enhance retrieval by testing and integrating with reranking techniques, allowing the system to prioritize the most relevant search results for better user experience and accuracy.
  • Explore the Amazon Bedrock RAG evaluator to capture detailed metrics on CreditAI’s performance. This will add to the existing mechanisms at Octus to track performance that include tracking unanswered questions.
  • Expand to ingest large-scale structured data, making it capable of handling complex financial datasets. The integration of text-to-SQL will enable users to query structured databases using natural language, simplifying data access.
  • Explore replacing our in-house content extraction service (ADE) with the Amazon Bedrock advanced parsing solution to potentially further reduce document ingestion costs.
  • Improve CreditAI’s disaster recovery and redundancy mechanisms, making sure that our services and infrastructure are more fault tolerant and can recover from outages faster.

These upgrades aim to boost the precision, reliability, and scalability of CreditAI.

Vishal Saxena, CTO at Octus, shares: “CreditAI is a first-of-its-kind generative AI application that focuses on the entire credit lifecycle. It is truly ’AI embedded’ software that combines cutting-edge AI technologies with an enterprise data architecture and a unified cloud strategy.”

Conclusion

CreditAI by Octus is the company’s flagship conversational chatbot that supports natural language queries and gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles. In this post, we described in detail our motivation, process, and results on Octus’s migration to Amazon Bedrock. Through this migration, Octus achieved remarkable results that included an over 75% reduction in operating costs as well as a 250% boost in engagement. Future steps include adopting new features such as reranking, RAG evaluator, and advanced parsing to further reduce costs and improve performance. We believe that the collaboration between Octus and AWS will continue to revolutionize financial analysis and research workflows.

To learn more about Amazon Bedrock, refer to the Amazon Bedrock User Guide.


About the Authors

Vaibhav Sabharwal is a Senior Solutions Architect with Amazon Web Services based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services Technical Field Community at AWS, he actively contributes to the collaborative efforts within the industry.

Yihnew Eshetu is a Senior Director of AI Engineering at Octus, leading the development of AI solutions at scale to address complex business problems. With seven years of experience in AI/ML, his expertise spans GenAI and NLP, specializing in designing and deploying agentic AI systems. He has played a key role in Octus’s AI initiatives, including leading AI Engineering for its flagship GenAI chatbot, CreditAI.

Harmandeep Sethi is a Senior Director of SRE Engineering and Infrastructure Frameworks at Octus, with nearly 10 years of experience leading high-performing teams in the design, implementation, and optimization of large-scale, highly available, and reliable systems. He has played a pivotal role in transforming and modernizing Credit AI infrastructure and services by driving best practices in observability, resilience engineering, and the automation of operational processes through Infrastructure Frameworks.

Rohan Acharya is an AI Engineer at Octus, specializing in building and optimizing AI-driven solutions at scale. With expertise in GenAI and NLP, he focuses on designing and deploying intelligent systems that enhance automation and decision-making. His work involves developing robust AI architectures and advancing Octus’s AI initiatives, including the evolution of CreditAI.

Hasan Hasibul is a Principal Architect at Octus leading the DevOps team, with nearly 12 years of experience in building scalable, complex architectures while following software development best practices. A true advocate of clean code, he thrives on solving complex problems and automating infrastructure. Passionate about DevOps, infrastructure automation, and the latest advancements in AI, he has architected Octus initial CreditAI, pushing the boundaries of innovation.

Philipe Gutemberg is a Principal Software Engineer and AI Application Development Team Lead at Octus, passionate about leveraging technology for impactful solutions. An AWS Certified Solutions Architect – Associate (SAA), he has expertise in software architecture, cloud computing, and leadership. Philipe led both backend and frontend application development for CreditAI, ensuring a scalable system that integrates AI-driven insights into financial applications. A problem-solver at heart, he thrives in fast-paced environments, delivering innovative solutions for financial institutions while fostering mentorship, team development, and continuous learning.

Kishore Iyer is the VP of AI Application Development and Engineering at Octus. He leads teams that build, maintain and support Octus’s customer-facing GenAI applications, including CreditAI, our flagship AI offering. Prior to Octus, Kishore has 15+ years of experience in engineering leadership roles across large corporations, startups, research labs, and academia. He holds a Ph.D. in computer engineering from Rutgers University.

Kshitiz Agarwal is an Engineering Leader at Amazon Web Services (AWS), where he leads the development of Amazon Bedrock Knowledge Bases. With a decade of experience at Amazon, having joined in 2012, Kshitiz has gained deep insights into the cloud computing landscape. His passion lies in engaging with customers and understanding the innovative ways they leverage AWS to drive their business success. Through his work, Kshitiz aims to contribute to the continuous improvement of AWS services, enabling customers to unlock the full potential of the cloud.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Tim Ramos is a Senior Account Manager at AWS. He has 12 years of sales experience and 10 years of experience in cloud services, IT infrastructure, and SaaS. Tim is dedicated to helping customers develop and implement digital innovation strategies. His focus areas include business transformation, financial and operational optimization, and security. Tim holds a BA from Gonzaga University and is based in New York City.

Read More

Optimize reasoning models like DeepSeek with prompt optimization on Amazon Bedrock

Optimize reasoning models like DeepSeek with prompt optimization on Amazon Bedrock

DeepSeek-R1 models, now available on Amazon Bedrock Marketplace, Amazon SageMaker JumpStart, as well as a serverless model on Amazon Bedrock, were recently popularized by their long and elaborate thinking style, which, according to DeepSeek’s published results, lead to impressive performance on highly challenging math benchmarks like AIME-2024 and MATH-500, as well as competitive performance compared to then state-of-the-art models like Anthropic’s Claude Sonnet 3.5, GPT 4o, and OpenAI O1 (more details in this paper).

During training, researchers showed how DeepSeek-R1-Zero naturally learns to solve tasks with more thinking time, which leads to a boost in performance. However, what often gets ignored is the number of thinking tokens required at inference time, and the time and cost of generating these tokens before answering the original question.

In this post, we demonstrate how to optimize reasoning models like DeepSeek-R1 using prompt optimization on Amazon Bedrock.

Long reasoning chains and challenges with maximum token limits

Let’s try out a straightforward question on DeepSeek-R1:

For the given math problem: Nate’s dog can dig six holes a day. He digs for 14 days while Nate is on vacation. When Nate gets home, he starts filling in 9 holes a day, but the dog keeps digging 6 new holes every night. How many weeks does it take him to fill in all the holes?, write out the steps you would take to solve it.

On the Amazon Bedrock Chat/Text Playground, you can follow along by choosing the new DeepSeek-R1 model, as shown in the following screenshot.

You might see that sometimes, based on the question, reasoning models don’t finish thinking within the overall maximum token budget.

Increasing the output token budget allows the model to think for longer. With the maximum tokens increased from 2,048 to 4,096, you should see the model reasoning for a while before printing the final answer.

The appendix at the end of this post provides the complete response. You can also collapse the reasoning steps to view just the final answer.

As we can see in the case with the 2,048-token budget, the thinking process didn’t end. This not only cost us 2,048 tokens’ worth of time and money, but we also didn’t get the final answer! This observation of high token counts for thinking usually leads to a few follow-up questions, such as:

  • Is it possible to reduce the thinking tokens and still get a correct answer?
  • Can the thinking be restricted to a maximum number of thinking tokens, or a thinking budget?
  • At a high level, should thinking-intensive models like DeepSeek be used in real-time applications at all?

In this post, we show you how you can optimize thinking models like DeepSeek-R1 using prompt optimization on Amazon Bedrock, resulting in more succinct thinking traces without sacrificing accuracy.

Optimize DeepSeek-R1 prompts

To get started with prompt optimization, select DeepSeek-R1 on the model playground on Amazon Bedrock, enter your prompt, and choose the magic wand icon, or use the Amazon Bedrock optimize_prompt() API. You may also use prompt optimization on the console, add variables if required, set your model to Deepseek-R1 and model parameters, and click “Optimize”:

To demonstrate how prompt optimization on Amazon Bedrock can help with reasoning models, we first need a challenging dataset. Humanity’s Last Exam (HLE), a benchmark of extremely challenging questions from dozens of subject areas, is designed to be the “final” closed-ended benchmark of broad academic capabilities. HLE is multi-modal, featuring questions that are either text-only or accompanied by an image reference, and includes both multiple-choice and exact-match questions for automated answer verification. The questions require deep domain knowledge in various verticals; they are unambiguous and resistant to simple internet lookups or database retrieval. For context, several state-of-the-art models (including thinking models) perform poorly on the benchmark (see the results table in this full paper).

Let’s look at an example question from this dataset:

In an alternate universe where the mass of the electron was 1% heavier and the charges of the
electron and proton were both 1% smaller, but all other fundamental constants stayed the same, 
approximately how would the speed of sound in diamond change? 

Answer Choices:
A. Decrease by 2%
B. Decrease by 1.5%
C. Decrease by 1%
D. Decrease by 0.5%
E. Stay approximately the same 
F. Increase by 0.5%
G. Increase by 1%
H. Increase by 1.5%
I. Increase by 2%

The question requires a deep understanding of physics, which most large language models (LLMs) today will fail at. Our goal with prompt optimization on Amazon Bedrock for reasoning models is to reduce the number of thinking tokens but not sacrifice accuracy. After using prompt optimization, the optimized prompt is as follows:

## Question
<extracted_question_1>In an alternate universe where the mass of the electron was 1% heavier 
and the charges of the electron and proton were both 1% smaller, but all other fundamental constants
stayed the same, approximately how would the speed of sound in diamond change?

Answer Choices:
A. Decrease by 2%
B. Decrease by 1.5%
C. Decrease by 1%
D. Decrease by 0.5%
E. Stay approximately the same
F. Increase by 0.5%
G. Increase by 1%
H. Increase by 1.5%
I. Increase by 2%</extracted_question_1>

## Instruction
Read the question above carefully and provide the most accurate answer possible.
If multiple choice options are provided within the question, respond with the entire text of the
correct answer option, not just the letter or number. Do not include any additional explanations or
preamble in your response.

Remember, your goal is to answer as precisely and accurately as possible!

The following figure shows how, for this specific case, the number of thinking tokens reduced by 35%, while still getting the final answer correct (B. Decrease by 1.5%). Here, the number of thinking tokens reduced from 5,000 to 3,300. We also notice that in this and other examples with the original prompts, part of the reasoning is summarized or repeated before the final answer. As we can see in this example, the optimized prompt gives clear instructions, separates different prompt sections, and provides additional guidance based on the type of question and how to answer. This leads to both shorter, clearer reasoning traces and a directly extractable final answer.

Optimized prompts can also lead to correct answers as opposed to wrong ones after long-form thinking, because thinking doesn’t guarantee a correct final answer. In this case, we see that the number of thinking tokens reduced from 5,000 to 1,555, and the answer is obtained directly, rather than after another long, post-thinking explanation. The following figure shows an example.

The preceding two examples demonstrate ways in which prompt optimization can improve results while shortening output tokens for models like DeepSeek R1. Prompt optimization was also applied to 400 questions from HLE. The following table summarizes the results.

Experiment Overall Accuracy Average Number of Prompt Tokens Average Number of Tokens Completion
 (Thinking + Response)
Average Number of Tokens (Response
 Only)
Average Number of Tokens (Thinking Only) Percentage of Thinking Completed
 (6,000 Maximum output Token)
Baseline DeepSeek 8.75 288 3334 271 3063 80.0%
Prompt Optimized DeepSeek 11 326 1925 27 1898 90.3%

As we can see, the overall accuracy jumps to 11% on this subset of the HLE dataset, the number of thinking and output tokens are reduced (therefore reducing the time to last token and cost), and the rate of completing thinking increased to 90% overall. From our experiments, we see that although there is no explicit reference to reducing the thinking tokens, the clearer, more detailed instructions about the task at hand after prompt optimization might reduce the additional effort involved for models like DeepSeek-R1 to do self-clarification or deeper problem understanding. Prompt optimization for reasoning models makes sure that the quality of thinking and overall flow, which is self-adaptive and dependent on the question, is largely unaffected, leading to better final answers.

Conclusion

In this post, we demonstrated how prompt optimization on Amazon Bedrock can effectively enhance the performance of thinking-intensive models like DeepSeek-R1. Through our experiments with the HLE dataset, we showed that optimized prompts not only reduced the number of thinking tokens by a significant margin, but also improved overall accuracy from 8.75% to 11%. The optimization resulted in more efficient reasoning paths without sacrificing the quality of answers, leading to faster response times and lower costs. This improvement in both efficiency and effectiveness suggests that prompt optimization can be a valuable tool for deploying reasoning-heavy models in production environments where both accuracy and computational resources need to be carefully balanced. As the field of AI continues to evolve with more sophisticated thinking models, techniques like prompt optimization will become increasingly important for practical applications.

To get started with prompt optimization on Amazon Bedrock, refer to Optimize a prompt and Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock.

Appendix

The following is the full response for the question about Nate’s dog:


Okay, let's see. So the problem is about Nate's dog digging holes and Nate trying to fill them.
I need to figure out how many weeks it takes Nate to fill all the holes once he starts. Let me break
it down step by step.nnFirst, the dog digs 6 holes a day. He does this for 14 days while Nate is on
vacation. So when Nate comes back, there are a certain number of holes already dug. Let me calculate
that first. If the dog digs 6 holes each day for 14 days, that's 6 multiplied by 14. Let me do that
math: 6*14=84. So there are 84 holes when Nate gets home.nnNow, once Nate starts filling the holes,
he fills 9 holes per day. But here's the catch: the dog keeps digging 6 new holes every night. So 
each day, Nate fills 9 holes, but the dog adds 6 more overnight. I need to figure out the net change
per day. That would be the number of holes filled minus the new holes dug. So 9 filled - 6 dug = 3 
holes net reduction per day. Wait, but does this happen every day? Let me make sure.nnSo Nate works
during the day, filling 9 holes. Then, during the night, the dog digs 6 new holes. So each full day
(day and night) results in a net decrease of 3 holes. So starting from 84 holes, each day the total
number decreases by 3. But I need to check if the process is that Nate fills 9, then the dog digs 6,
so the next day starts with 84 -9 +6 = 84 -3 =81, and so on. So yes, each day the net is -3.nnBut
wait, maybe the timing is important. Let's think: when Nate starts filling, does he fill 9 holes on
the first day, and then the dog digs 6 that night? So each day after Nate fills, the dog adds 6. So
the sequence is: Day 1: Nate fills 9, holes become 84-9=75. Then night: dog digs 6, holes become
75+6=81. Day 2: Nate fills 9, holes 81-9=72. Night: +6=78. So each full day cycle (day and night)
reduces the total by 3. So yes, net 3 per day.nnBut wait, maybe on the last day, if Nate fills the
remaining holes and the dog doesn't get to dig that night because all holes are filled. Hmm, that's a
crucial point. For example, if on a certain day, the number of holes left is less than or equal to 9,
Nate can fill them all in that day, and the dog doesn't add any more because there are no holes left
to dig. So we need to check if on the day when Nate fills the remaining holes, the total would reach
zero, and the dog doesn't dig that night. So we can't just do total holes divided by 3, because the
last day might not have the dog digging.nnSo let's model this properly. Let's denote the number of
holes each day.nnStarting with 84 holes.nnEach day:nn1. Nate fills 9 holes. Subtract 9.nn2.
If the result is zero or negative, then done.nn3. Otherwise, the dog digs 6 holes. Add 6.nnSo we
need to iterate this until the number of holes is zero or less after Nate fills them.nnLet me try
to calculate day by day.nnStart: 84nnDay 1:nnAfter Nate fills: 84 -9 =75nnSince 75 >0, dog
digs 6: 75 +6=81nnDay 2:nnAfter Nate fills:81-9=72nnDog digs 6:72+6=78nnDay3:nn78-
9=69nn+6=75nnDay4:75-9=66+6=72nnDay5:72-9=63+6=69nnDay6:69-9=60+6=66nnDay7:66-
9=57+6=63nnDay8:63-9=54+6=60nnDay9:60-9=51+6=57nnDay10:57-9=48+6=54nnDay11:54-
9=45+6=51nnDay12:51-9=42+6=48nnDay13:48-9=39+6=45nnDay14:45-9=36+6=42nnDay15:42-
9=33+6=39nnDay16:39-9=30+6=36nnDay17:36-9=27+6=33nnDay18:33-9=24+6=30nnDay19:30-
9=21+6=27nnDay20:27-9=18+6=24nnDay21:24-9=15+6=21nnDay22:21-9=12+6=18nnDay23:18-
9=9+6=15nnDay24:15-9=6+6=12nnDay25:12-9=3+6=9nnDay26:9-9=0nnAt this point, after Nate fills
on day26, the holes are 0. Since there are no holes left, the dog can't dig any more that night. So
the process stops here.nnSo total days:26 days.nnBut the question asks for how many weeks. Since
a week is 7 days, we divide 26 by7.nn26 divided by7 is 3 with a remainder of5. So 3 weeks and5
days. But the question says "how many weeks does it take him to fill in all the holes". Depending
on whether they want it rounded up to full weeks or expressed as a fraction. Let me check the problem
statement again.nn"How many weeks does it take him to fill in all the holes?"nnTypically, such
problems expect the answer in whole weeks, rounded up if there's a remainder. Because you can't have
a fraction of a week in practical terms. So 26 days is 3 weeks and5 days, which is more than3 weeks
but less than4. So they might expect 4 weeks. But let me verify.nnAlternatively, maybe they want
the exact number in weeks as a fraction. 26 days is 26/7 weeks u22483.714 weeks. But the problem
might expect an integer number of weeks, rounded up. Let's see.nnBut let's check the problem again.
It says "how many weeks does it take him to fill in all the holes?" So if it takes 26 days, which
is 3 weeks and5 days, then the answer would be 4 weeks if we're counting full weeks needed. Because
after 3 weeks (21 days), he still has 5 days left. So total weeks is4.nnBut maybe the answer is
26/7 weeks, which is approximately3.714, but since the problem might expect an exact answer in weeks,
possibly as a fraction. However, the problem might be designed to have a whole number. Wait, let's
check the math again.nnWait, when I did the day-by-day count, I got 26 days. Let me recount to make
sure I didn't make a mistake.nnStarting from day1 to day26. Let's check day26:nnStart with
day1:84nnAfter day1:81nn...nnLet me check day25:nnAfter day24:12nnDay25:12-9=3, then dog
digs6:3+6=9nnDay26:9-9=0. So day26 is the day when Nate fills the last 9 holes. So yes, 26
days.nn26 days divided by7 days/week=3.714... weeks. So the answer is 26/7 weeks, which is3 and5/7
weeks. But the problem might want it expressed as a mixed number or an exact number. However, the
question says "how many weeks does it take him", so likely they expect the answer in weeks,
possibly as a fraction. But let's see if there's a better way to model this without counting each
day.nnAlternative approach:nnTotal holes after vacation:84nnEach day, Nate fills9, dog digs6,
net -3 per day, except possibly the last day.nnBut if on the last day, Nate fills the remaining
holes without the dog digging, then we need to adjust.nnLetu2019s model it as:nnLet H be the
number of holes. Start with H=84.nnEach day, H decreases by9, then if H>0, increases by6.nnWe
need to find the number of days until H<=0.nnThis can be modeled as:nnEach full cycle (day and
night) reduces H by3, except the last day where if H<=9, then H becomes0 and no addition.nnSo
the number of full cycles (days where H>9 after filling) plus the final day.nnLetu2019s
denote:nnLetu2019s say after n days, the holes would be:nnEach day, H = H -9 +6 = H -3, but only
if after subtracting9, H >=1 (since if H-9 <=0, then the dog doesnu2019t dig). Wait, no. Wait,
the dog digs6 only if after Nate fills9, there are still holes left. Wait, no. Wait, the problem says
the dog keeps digging6 new holes every night. So regardless of whether there are holes left, the dog
digs6 every night. Wait, but if all holes are filled, can the dog dig new ones? The problem says
"fill in all the holes", so once all holes are filled, the dog can't dig new ones because there are
no existing holes? Wait, the problem might be implying that the dog digs new holes regardless, but
Nate is filling the existing ones. Wait, the problem says "the dog keeps digging6 new holes every
night". So perhaps every night, regardless of the current state, the dog digs6 new holes. But that
would mean that even if Nate fills all holes on a day, the dog digs6 new ones that night, making it
impossible to ever finish. But that can't be, because the problem asks how long to fill all holes,
which implies that eventually Nate can catch up.nnWait, this is a critical point. Let me re-examine
the problem statement:nn"When Nate gets home, he starts filling in 9 holes a day, but the dog
keeps digging 6 new holes every night. How many weeks does it take him to fill in all the holes?
"nnSo the key is: every night, the dog digs6 new holes. So even if Nate fills all the holes during
the day, the dog will dig6 new ones that night. But that would mean the process never ends. But the
problem must assume that once all holes are filled, the dog stops. But the wording is ambiguous.
However, logically, if all holes are filled, there's nothing for the dog to dig. So perhaps the dog
only digs6 holes if there's ground to dig, but the problem states "keeps digging6 new holes every
night", implying that regardless of the current state, the dog digs6 each night. This is a problem
because if Nate ever fills all holes, the next night the dog digs6, so Nate would have to fill those6
the next day, but then the dog digs6 again, leading to an infinite loop. But the problem must have a
solution, so likely the dog stops digging once all holes are filled. Therefore, the correct
interpretation is that each night, the dog digs6 new holes, but if there are no holes (i.e., all
filled), the dog can't dig. Therefore, the last day when Nate fills the remaining holes, the dog
doesn't dig that night.nnTherefore, the correct model is:nnEach day:nn1. Nate fills9
holes.nn2. If holes remain after filling, the dog digs6 that night.nnSo the process is:nnStart
with H=84.nnWhile H>0:nn H = H -9nn if H >0:nn H = H +6nn days +=1nnSo let's model
this with equations.nnLetu2019s find the number of days required.nnThis is similar to the
classic problem where each day you have a net change, but with a different condition on the last
day.nnLetu2019s denote that each "cycle" (day and night) reduces H by3, except possibly the last
day where H <=9, so after filling, H becomes0, and no addition.nnSo the total number of days is
the number of full cycles (each reducing H by3) plus the final day.nnBut how to calculate this?
nnLetu2019s think of it as:nnLetu2019s suppose that after n days, the number of holes is H_n
=84 -3n.nnBut this is only true if every day the net change is -3. However, this would be the case
until H_n -9 <=0. Wait, no. Wait, the H after filling is H -9. If H -9 <=0, then the dog
doesn't dig, so the total reduction is H (to 0). So we need to find the day when H -9 <=0.nnSo
letu2019s solve for when H -9 <=0:nnH =84 -3(n-1) -9 <=0nnWait, this is getting
complicated. Let me use another approach.nnLetu2019s consider that each day before the final day,
the net change is -3. On the final day, the net change is -9 (since the dog doesnu2019t add
6).nnLetu2019s let x be the number of full days (with net -3), and 1 final day (net -9).nnTotal
reduction: 3x +9 =84nnSo 3x =84 -9=75nnx=75/3=25nnSo x=25 days of net -3, and 1 final day of 
-9.nnTotal days=25+1=26 days.nnYes, this matches the day-by-day count. Therefore, 26 
days.nnConvert to weeks:26/7=3.714... weeks. But the problem asks "how many weeks does it take
him". If they want a whole number of weeks, rounded up, it's4 weeks. If they accept fractions,
it's26/7u22483.714. But the problem might expect the answer in weeks as a mixed number, like3 and5/7
weeks. However, typically, such problems expect the answer in whole weeks, rounded up. Let me check
the problem statement again.nnThe problem says "how many weeks does it take him to fill in all the
holes". The answer should be in weeks, and since it's more than3 weeks but less than4, but you can't
have a fraction of a week in practical terms when scheduling, so they might expect4 weeks. However,
mathematically, it's26/7 weeks. Let me see if the problem expects an exact answer or a whole 
number.nnLooking back at the problem statement, it's a math problem, so likely expects the exact
value, which is26/7 weeks, which can be written as3 5/7 weeks. However, sometimes problems like this
expect the answer in whole weeks, rounding up. Let's check both possibilities.nnIf the answer is26
days, which is3 weeks and5 days, then depending on the context:nn- If partial weeks count as a full
week, then4 weeks.nn- If exact, then3 5/7 weeks.nnBut the problem doesn't specify, so as a math
problem, the precise answer is26/7 weeks, which is3 and5/7 weeks. However, the question might want it
in weeks as a whole number, rounded up. Let's see.nnBut let's check the problem again. The problem
says "how many weeks does it take him to fill in all the holes?" The answer choices (if any) aren't
given, but since the user hasn't provided options, we need to infer. In math problems like this,
unless specified otherwise, the answer is expected in the most precise form, which would be26/7 weeks
or3 5/7 weeks. However, sometimes problems expect whole numbers, so maybe the answer is4 weeks. But
let's verify with the day count.nn26 days is exactly3 weeks and5 days. If the question asks for the
number of weeks, and if part of a week counts as a full week, then it's4. But if they want the exact
number, it's3 5/7. However, the problem might have an integer answer. Wait, let's check the
calculations again.nnWait, when I did the day-by-day count, I got26 days. Let me verify that again
quickly.nnStart with84.nnEach day:nn1. Subtract9.nn2. If result >0, add6.nnRepeat until
H<=0.nnLetu2019s model this with a formula.nnLetu2019s denote that each day, except the
last, the net change is -3. The last day is -9.nnSo total holes =84 =3x +9nnSo3x=75 u2192x=25
days of net -3, plus1 day of -9. Total26 days.nnYes, that's correct. So26 days.nn26 days divided
by7 days/week=3.714... weeks.nnBut the problem might want the answer as a mixed number, so3 and5/7
weeks. However, in some contexts, especially in school problems, they might expect the answer in
whole weeks, rounded up, so4 weeks. But since the problem is from Nate's perspective, he would count
the exact days, so converting26 days to weeks gives3 weeks and5 days. But the question asks for
weeks, so unless specified, it's better to present it as26/7 weeks, which is3 5/7 weeks. However,
let's check if there's a different approach that gives a whole number.nnAlternatively, maybe I made
a mistake in the day count. Let me recount the days more carefully.nnStarting with84
holes.nnDay1:nnFill9:84-9=75. Since75>0, dog digs6:75+6=81.nnDay2:nn81-9=72>0
u2192+6=78.nnDay3:78-9=69>0 u2192+6=75.nnDay4:75-9=66>0 u2192+6=72.nnDay5:72-9=63>0
u2192+6=69.nnDay6:69-9=60>0 u2192+6=66.nnDay7:66-9=57>0 u2192+6=63.nnDay8:63-9=54>0
u2192+6=60.nnDay9:60-9=51>0 u2192+6=57.nnDay10:57-9=48>0 u2192+6=54.nnDay11:54-
9=45>0 u2192+6=51.nnDay12:51-9=42>0 u2192+6=48.nnDay13:48-9=39>0 
u2192+6=45.nnDay14:45-9=36>0 u2192+6=42.nnDay15:42-9=33>0 u2192+6=39.nnDay16:39-
9=30>0 u2192+6=36.nnDay17:36-9="

About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Zhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.

Xuan Qi is an Applied Scientist at Amazon Bedrock, where she applies her background in physics to tackle complex challenges in machine learning and artificial intelligence. Xuan is passionate about translating scientific concepts into practical applications that drive tangible improvements in technology. Her work focuses on creating more intuitive and efficient AI systems that can better understand and interact with the world. Outside of her professional pursuits, Xuan finds balance and creativity through her love for dancing and playing the violin, bringing the precision and harmony of these arts into her scientific endeavors.

Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language proceeding, machine learning, large language modeling, and other related AI areas.

Read More

Amazon Bedrock announces general availability of multi-agent collaboration

Amazon Bedrock announces general availability of multi-agent collaboration

Today, we’re announcing the general availability (GA) of multi-agent collaboration on Amazon Bedrock. This capability allows developers to build, deploy, and manage networks of AI agents that work together to execute complex, multi-step workflows efficiently.

Since its preview launch at re:Invent 2024, organizations across industries—including financial services, healthcare, supply chain and logistics, manufacturing, and customer support—have used multi-agent collaboration to orchestrate specialized agents, driving efficiency, accuracy, and automation. With this GA release, we’ve introduced enhancements based on customer feedback, further improving scalability, observability, and flexibility—making AI-driven workflows easier to manage and optimize.

What is multi-agent collaboration?

Generative AI is no longer just about models generating responses, it’s about automation. The next wave of innovation is driven by agents that can reason, plan, and act autonomously across company systems. Generative AI applications are no longer just generating content; they also take action, solve problems, and execute complex workflows. The shift is clear: businesses need AI that doesn’t just respond to prompts but orchestrates entire workflows, automating processes end to end.

Agents enable generative AI applications to perform tasks across company systems and data sources, and Amazon Bedrock already simplifies building them. With Amazon Bedrock, customers can quickly create agents that handle sales orders, compile financial reports, analyze customer retention, and much more. However, as applications become more capable, the tasks customers want them to perform can exceed what a single agent can manage—either because the tasks require specialized expertise, involve multiple steps, or demand continuous execution over time.

Coordinating potentially hundreds of agents at scale is also challenging, because managing dependencies, ensuring efficient task distribution, and maintaining performance across a large network of specialized agents requires sophisticated orchestration. Without the right tools, businesses can face inefficiencies, increased latency, and difficulties in monitoring and optimizing performance. For customers looking to advance their agents and tackle more intricate, multi-step workflows, Amazon Bedrock supports multi-agent collaboration, enabling developers to easily build, deploy, and manage multiple specialized agents working together seamlessly.

Multi-agent collaboration enables developers to create networks of specialized agents that communicate and coordinate under the guidance of a supervisor agent. Each agent contributes its expertise to the larger workflow by focusing on a specific task. This approach breaks down complex processes into manageable sub-tasks processed in parallel. By facilitating seamless interaction among agents, Amazon Bedrock enhances operational efficiency and accuracy, ensuring workflows run more effectively at scale. Because each agent only accesses the data required for its role, this approach minimizes exposure of sensitive information while reinforcing security and governance. This allows businesses to scale their AI-driven workflows without the need for manual intervention in coordinating agents. As more agents are added, the supervisor ensures smooth collaboration between them all.

By using multi-agent collaboration on Amazon Bedrock, organizations can:

  • Streamline AI-driven workflows by distributing workloads across specialized agents.
  • Improve execution efficiency by parallelizing tasks where possible.
  • Enhance security and governance by restricting agent access to only necessary data.
  • Reduce operational complexity by eliminating manual intervention in agent coordination.

A key challenge in building effective multi-agent collaboration systems is managing the complexity and overhead of coordinating multiple specialized agents at scale. Amazon Bedrock simplifies the process of building, deploying, and orchestrating effective multi-agent collaboration systems while addressing efficiency challenges through several key features and optimizations:

  • Quick setup – Create, deploy, and manage AI agents working together in minutes without the need for complex coding.
  • Composability – Integrate your existing agents as subagents within a larger agent system, allowing them to seamlessly work together to tackle complex workflows.
  • Efficient inter-agent communication – The supervisor agent can interact with subagents using a consistent interface, supporting parallel communication for more efficient task completion.
  • Optimized collaboration modes – Choose between supervisor mode and supervisor with routing mode. With routing mode, the supervisor agent will route simple requests directly to specialized subagents, bypassing full orchestration. For complex queries or when no clear intention is detected, it automatically falls back to the full supervisor mode, where the supervisor agent analyzes, breaks down problems, and coordinates multiple subagents as needed.
  • Integrated trace and debug console – Visualize and analyze multi-agent interactions behind the scenes using the integrated trace and debug console.

What’s new in general availability?

The GA release introduces several key enhancements based on customer feedback, making multi-agent collaboration more scalable, flexible, and efficient:

  • Inline agent support – Enables the creation of supervisor agents dynamically at runtime, allowing for more flexible agent management without predefined structures.
  • AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) support – Enables customers to deploy agent networks as code, enabling scalable, reusable agent templates across AWS accounts.
  • Enhanced traceability and debugging – Provides structured execution logs, sub-step tracking, and Amazon CloudWatch integration to improve monitoring and troubleshooting.
  • Increased collaborator and step count limits – Expands self-service limits for agent collaborators and execution steps, supporting larger-scale workflows.
  • Payload referencing – Reduces latency and costs by allowing the supervisor agent to reference external data sources without embedding them in the agent request.
  • Improved citation handling – Enhances accuracy and attribution when agents pull external data sources into their responses.

These features collectively improve coordination capabilities, communication speed, and overall effectiveness of the multi-agent collaboration framework in tackling complex, real-world problems.

Multi-agent collaboration across industries

Multi-agent collaboration is already transforming AI automation across sectors:

  • Investment advisory – A financial firm uses multiple agents to analyze market trends, risk factors, and investment opportunities to deliver personalized client recommendations.
  • Retail operations – A retailer deploys agents for demand forecasting, inventory tracking, pricing optimization, and order fulfillment to increase operational efficiency.
  • Fraud detection – A banking institution assigns agents to monitor transactions, detect anomalies, validate customer behaviors, and flag potential fraud risks in real time.
  • Customer support – An enterprise customer service platform uses agents for sentiment analysis, ticket classification, knowledge base retrieval, and automated responses to enhance resolution times.
  • Healthcare diagnosis – A hospital system integrates agents for patient record analysis, symptom recognition, medical imaging review, and treatment plan recommendations to assist clinicians.

Deep dive: Syngenta’s use of multi-agent collaboration

Syngenta, a global leader in agricultural innovation, has integrated cutting-edge generative AI into its Cropwise service, resulting in the development of Cropwise AI. This advanced system is designed to enhance the efficiency of agronomic advisors and growers by providing tailored recommendations for crop management practices.

Business challenge

The agricultural sector faces the complex task of optimizing crop yields while ensuring sustainability and profitability. Farmers and agronomic advisors must consider a multitude of factors, including weather patterns, soil conditions, crop growth stages, and potential pest and disease threats. In the past, analyzing these variables required extensive manual effort and expertise. Syngenta recognized the need for a more efficient, data-driven approach to support decision-making in crop management.

Solution: Cropwise AI

To address these challenges, Syngenta collaborated with AWS to develop Cropwise AI, using Amazon Bedrock Agents to create a multi-agent system that integrates various data sources and AI capabilities. This system offers several key features:

  • Advanced seed recommendation and placement – Uses predictive machine learning algorithms to deliver personalized seed recommendations tailored to each grower’s unique environment.
  • Sophisticated predictive modeling – Employs state-of-the-art machine learning algorithms to forecast crop growth patterns, yield potential, and potential risk factors by integrating real-time data with comprehensive historical information.
  • Precision agriculture optimization – Provides hyper-localized, site-specific recommendations for input application, minimizing waste and maximizing resource efficiency.

Agent architecture

Cropwise AI is built on AWS architecture and designed for scalability, maintainability, and security. The system uses Amazon Bedrock Agents to orchestrate multiple AI agents, each specializing in distinct tasks:

  • Data aggregation agent – Collects and integrates extensive datasets, including over 20 years of weather history, soil conditions, and more than 80,000 observations on crop growth stages.
  • Recommendation agent – Analyzes the aggregated data to provide tailored recommendations for precise input applications, product placement, and strategies for pest and disease control.
  • Conversational AI agent – Uses a multilingual conversational large language model (LLM) to interact with users in natural language, delivering insights in a clear format.

This multi-agent collaboration enables Cropwise AI to process complex agricultural data efficiently, offering actionable insights and personalized recommendations to enhance crop yields, sustainability, and profitability.

Results

By implementing Cropwise AI, Syngenta has achieved significant improvements in agricultural practices:

  • Enhanced decision-making: Agronomic advisors and growers receive data-driven recommendations, leading to optimized crop management strategies.
  • Increased yields: Utilizing Syngenta’s seed recommendation models, Cropwise AI helps growers increase yields by up to 5%.
  • Sustainable practices: The system promotes precision agriculture, reducing waste and minimizing environmental impact through optimized input applications.

Highlighting the significance of this advancement, Feroz Sheikh, Chief Information and Digital Officer at Syngenta Group, stated:

“Agricultural innovation leader Syngenta is using Amazon Bedrock Agents as part of its Cropwise AI solution, which gives growers deep insights to help them optimize crop yields, improve sustainability, and drive profitability. With multi-agent collaboration, Syngenta will be able to use multiple agents to further improve their recommendations to growers, transforming how their end-users make decisions and delivering even greater value to the farming community.” 

This collaboration between Syngenta and AWS exemplifies the transformative potential of generative AI and multi-agent systems in agriculture, driving innovation and supporting sustainable farming practices.

How multi-agent collaboration works

Amazon Bedrock automates agent collaboration, including task delegation, execution tracking, and data orchestration. Developers can configure their system in one of two collaboration modes:

  • Supervisor mode
    • The supervisor agent receives an input, breaks down complex requests, and assigns tasks to specialized sub-agents.
    • Sub-agents execute tasks in parallel or sequentially, returning responses to the supervisor, which consolidates the results.
  • Supervisor with routing mode
    • Simple queries are routed directly to a relevant sub-agent.
    • Complex or ambiguous requests trigger the supervisor to coordinate multiple agents to complete the task.

Watch the Amazon Bedrock multi-agent collaboration video to learn how to get started.

Conclusion

By enabling seamless multi-agent collaboration, Amazon Bedrock empowers businesses to scale their generative AI applications with greater efficiency, accuracy, and flexibility. As organizations continue to push the boundaries of AI-driven automation, having the right tools to orchestrate complex workflows will be essential. With Amazon Bedrock, companies can confidently build AI systems that don’t just generate responses but drive real impact—automating processes, solving problems, and unlocking new possibilities across industries.

Amazon Bedrock multi-agent collaboration is now generally available.

Multi-agent collaboration opens new possibilities for AI-driven automation. Whether in finance, healthcare, retail, or agriculture, Amazon Bedrock helps organizations scale AI workflows with efficiency and precision.

Start building today—and let us know what you create!


About the authors

Sri Koneru has spent the last 13.5 years honing her skills in both cutting-edge product development and large-scale infrastructure. At Salesforce for 7.5 years, she had the incredible opportunity to build and launch brand new products from the ground up, reaching over 100,000 external customers. This experience was instrumental in her professional growth. Then, at Google for 6 years, she transitioned to managing critical infrastructure, overseeing capacity, efficiency, fungibility, job scheduling, data platforms, and spatial flexibility for all of Alphabet. Most recently, Sri joined Amazon Web Services leveraging her diverse skillset to make a significant impact on AI/ML services and infrastructure at AWS. Personally, Sri & her husband recently became empty nesters, relocating to Seattle from the Bay Area. They’re a basketball-loving family who even catch pre-season Warriors games but are looking forward to cheering on the Seattle Storm this year. Beyond basketball, Sri enjoys cooking, recipe creation, reading, and her newfound hobby of hiking. While she’s a sun-seeker at heart, she is looking forward to experiencing the unique character of Seattle weather.

Read More

Utah to Advance AI Education, Training

Utah to Advance AI Education, Training

A new AI education initiative in the State of Utah, developed in collaboration with NVIDIA, is set to advance the state’s commitment to workforce training and economic growth.

The public-private partnership aims to equip universities, community colleges and adult education programs across Utah with the resources to develop skills in generative AI.

“AI will continue to grow in importance, affecting every sector of Utah’s economy,” said Spencer Cox, governor of Utah. “We need to prepare our students and faculty for this revolution. Working with NVIDIA is an ideal path to help ensure that Utah is positioned for AI growth in the near and long term.”

As part of the new initiative, Utah’s educators can gain certification through the NVIDIA Deep Learning Institute University Ambassador Program. The program offers high-quality teaching kits, extensive workshop content and access to NVIDIA GPU-accelerated workstations in the cloud.

By empowering educators with the latest AI skills and technologies, the initiative seeks to create a competitive advantage for Utah’s entire higher education system.

“We believe that AI education is more than a pathway to innovation — it’s a foundation for solving some of the world’s most pressing challenges,” said Manish Parashar, director of the University of Utah Scientific Computing and Imaging (SCI) Institute, which leads the One-U Responsible AI Initiative. “By equipping students and researchers with the tools to explore, understand and create with AI, we empower them to be able to drive advancements in medicine, engineering and beyond.”

The initiative will begin with the Utah System of Higher Education (USHE) and several other universities in the state, including the University of Utah, Utah State University, Utah Valley University, Weber State University, Utah Tech University, Southern Utah University, Snow College and Salt Lake Community College.

Setting Up Students and Professionals for Success

The Utah AI education initiative will benefit students entering the job market and working professionals by helping them expand their skill sets beyond community college or adult education courses.

Utah state agencies are exploring how internship and apprenticeship programs can offer students hands-on experience with AI skills, helping bridge the gap between education and industry needs. This initiative aligns with Utah’s broader goals of fostering a tech-savvy workforce and positioning the state as a leader in AI innovation and application.

As AI continues to evolve and gain prevalence across industries, Utah’s proactive approach to equipping educators and students with resources and training will help prepare its workforce for the future of technology, sharpening its competitive edge.

Read More

Semantic Telemetry: Understanding how users interact with AI systems

Semantic Telemetry: Understanding how users interact with AI systems

Semantic Telemetry blog | diagram showing relationships between chat, LLM prompt, and labeled data

AI tools are proving useful across a range of applications, from helping to drive the new era of business transformation to helping artists craft songs. But which applications are providing the most value to users? We’ll dig into that question in a series of blog posts that introduce the Semantic Telemetry project at Microsoft Research. In this initial post, we will introduce a new data science approach that we will use to analyze topics and task complexity of Copilot in Bing usage.

Human-AI interactions can be iterative and complex, requiring a new data science approach to understand user behavior to build and support increasingly high value use cases. Imagine the following chat:

Example chat between user and AI

Here we see that chats can be complex and span multiple topics, such as event planning, team building, and logistics. Generative AI has ushered in a two-fold paradigm shift. First, LLMs give us a new thing to measure, that is, how people interact with AI systems. Second, they give us a new way to measure those interactions, that is, they give us the capability to understand and make inferences on these interactions, at scale. The Semantic Telemetry project has created new measures to classify human-AI interactions and understand user behavior, contributing to efforts in developing new approaches for measuring generative AI (opens in new tab) across various use cases.

Semantic Telemetry is a rethink of traditional telemetry–in which data is collected for understanding systems–designed for analyzing chat-based AI. We employ an innovative data science methodology that uses a large language model (LLM) to generate meaningful categorical labels, enabling us to gain insights into chat log data.

Flow chart illustrating the LLM classification process starting with chat input, then prompting LLM with chat using generated label taxonomy, and output is the labeled chat.
Figure 1: Prompting an LLM to classify a conversation based on LLM generated label taxonomy

This process begins with developing a set of classifications and definitions. We create these classifications by instructing an LLM to generate a short summary of the conversation, and then iteratively prompting the LLM to generate, update, and review classification labels on a batched set of summaries. This process is outlined in the paper: TnT-LLM: Text Mining at Scale with Large Language Models. We then prompt an LLM with these generated classifiers to label new unstructured (and unlabeled) chat log data.

Description of LLM generated label taxonomy process

With this approach, we have analyzed how people interact with Copilot in Bing. In this blog, we examine insights into how people are using Copilot in Bing, including how that differs from traditional search engines. Note that all analyses were conducted on anonymous Copilot interactions containing no personal information.

Topics

To get a clear picture of how people are using Copilot in Bing, we need to first classify sessions into topical categories. To do this, we developed a topic classifier. We used the LLM classification approach described above to label the primary topic (domain) for the entire content of the chat. Although a single chat can cover multiple topics, for this analysis, we generated a single label for the primary topic of the conversation. We sampled five million anonymized Copilot in Bing chats during August and September 2024, and found that globally, 21% of all chats were about technology, with a high concentration of these chats in programming and scripting and computers and electronics.

Bubble chart showing topics based on percentage of sample. Primary topics shown are Technology (21%), Entertainment (12.8%), Health (11%), Language, Writing, & Editing (11.6%), Lifestyle (9.2%), Money (8.5%), History, Events, & Law (8.5%), Career (7.8%), Science (6.3%)
Figure 2: Top Copilot in Bing topics based on anonymized data (August-September 2024)
Bubble chart of Technology topic showing subtopics: Programming & scripting, Computers & electronics, Engineering & design, Data analysis, and ML & AI.
Figure 3: Frequent topic summaries in Technology
Bubble chart of Entertainment showing subtopics: Entertainment, Sports & fitness, Travel & tourism, Small talk & chatbot, and Gaming
Figure 4: Frequent topic summaries in Entertainment

Diving into the technology category, we find a lot of professional tasks in programming and scripting, where users request problem-specific assistance such as fixing a SQL query syntax error. In computers and electronics, we observe users getting help with tasks like adjusting screen brightness and troubleshooting internet connectivity issues. We can compare this with our second most common topic, entertainment, in which we see users seeking information related to personal activities like hiking and game nights.

We also note that top topics differ by platform. The figure below depicts topic popularity based on mobile and desktop usage. Mobile device users tend to use the chat for more personal-related tasks such as helping to plant a garden or understanding medical symptoms whereas desktop users conduct more professional tasks like revising an email.

Sankey visual showing top topics for Desktop and Mobile users
Figure 5: Top topics for desktop users and mobile users

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.


Search versus Copilot

Beyond analyzing topics, we compared Copilot in Bing usage to that of traditional search. Chat extends beyond traditional online search by enabling users to summarize, generate, compare, and analyze information. Human-AI interactions are conversational and more complex than traditional search (Figure 6).

Venn diagram showing differences between Bing Search and Copilot in Bing, with intersection in information lookup.
Figure 6: Bing Search Query compared to Copilot in Bing Conversation

A major differentiation between search and chat is the ability to ask more complex questions, but how can we measure this? We think of complexity as a scale ranging from simply asking chat to look up information to evaluating several ideas. We aim to understand the difficulty of a task if performed by a human without the assistance of AI. To achieve this, we developed the task complexity classifier, which assesses task difficulty using Anderson and Krathwohl’s Taxonomy of Learning Objectives (opens in new tab). For our analysis, we have grouped the learning objectives into two categories: low complexity and high complexity. Any task more complicated than information lookup is classified as high complexity. Note that this would be very challenging to classify using traditional data science techniques.

Description of task complexity and 6 categories of the Anderson and Krathwohl’s Taxonomy of Learning Objectives

Comparing low versus high complexity tasks, most chat interactions were categorized as high complexity (78.9%), meaning that they were more complex than looking up information. Programming and scripting, marketing and sales, and creative and professional writing are topics in which users engage in higher complexity tasks (Figure 7) such as learning a skill, troubleshooting a problem, or writing an article.

Highest and lowest complexity topics based on percent of high complexity chats
Figure 7: Most and least complex topics based on percentage of high complexity tasks.

Travel and tourism and history and culture scored lowest in complexity, with users looking up information like flights time and latest news updates.

Demo of task complexity and topics on anonymous Copilot interactions

When should you use chat instead of search? A 2024 Microsoft Research study: The Use of Generative Search Engines for Knowledge Work and Complex Tasks, suggests that people are seeing value in technical, complex tasks such as web development and data analysis. Bing Search contained more queries with lower complexity focused on non-professional areas, like gaming and entertainment, travel and tourism, and fashion and beauty, while chat had a greater distribution of complex technical tasks. (Figure 8).

Comparison of Bing Search and Copilot in Bing topics based on complexity and knowledge work. Copilot in Bing trends greater complexity and greater knowledge work than Bing Search.
Figure 8: Comparison of Bing Search and Copilot in Bing for anonymized sample data (May-June 2023)

Conclusion

LLMs have enabled a new era of high-quality human-AI interaction, and with it, the capability to analyze those same interactions with high fidelity, at scale, and near real-time. We are now able to obtain actionable insight from complex data that is not possible with traditional data science pattern-matching methods. LLM-generated classifications are pushing research into new directions that will ultimately improve user experience and satisfaction when using chat and other user-AI interactions tools.

This analysis indicates that Copilot in Bing is enabling users to do more complex work, specifically in areas such as technology. In our next post, we will explore how Copilot in Bing is supporting professional knowledge work and how we can use these measures as indicators for retention and engagement.


FOOTNOTE: This research was conducted at the time the feature Copilot in Bing was available as part of the Bing service; since October 2024 Copilot in Bing has been deprecated in favor of the standalone Microsoft Copilot service.

References:

  1. Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2 (opens in new tab)

The post Semantic Telemetry: Understanding how users interact with AI systems appeared first on Microsoft Research.

Read More