How IDIADA optimized its intelligent chatbot with Amazon Bedrock

How IDIADA optimized its intelligent chatbot with Amazon Bedrock

This post is co-written with Xavier Vizcaino, Diego Martín Montoro, and Jordi Sánchez Ferrer from Applus+ Idiada.

In 2021, Applus+ IDIADA, a global partner to the automotive industry with over 30 years of experience supporting customers in product development activities through design, engineering, testing, and homologation services, established the Digital Solutions department. This strategic move aimed to drive innovation by using digital tools and processes. Since then, we have optimized data strategies, developed customized solutions for customers, and prepared for the technological revolution reshaping the industry.

AI now plays a pivotal role in the development and evolution of the automotive sector, in which Applus+ IDIADA operates. Within this landscape, we developed an intelligent chatbot, AIDA (Applus Idiada Digital Assistant)— an Amazon Bedrock powered virtual assistant serving as a versatile companion to IDIADA’s workforce.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

With Amazon Bedrock, AIDA assists with a multitude of tasks, from addressing inquiries to tackling complex technical challenges spanning code, mathematics, and translation. Its capabilities are truly boundless.

With AIDA, we take another step towards our vision of providing global and integrated digital solutions that add value for our customers. Its internal deployment strengthens our leadership in developing data analysis, homologation, and vehicle engineering solutions. Additionally, in the medium term, IDIADA plans to offer AIDA as an integrable product for customers’ environments and develop “light” versions seamlessly integrable into existing systems.

In this post, we showcase the research process undertaken to develop a classifier for human interactions in this AI-based environment using Amazon Bedrock. The objective was to accurately identify the type of interaction received by the intelligent agent to route the request to the appropriate pipeline, providing a more specialized and efficient service.

The challenge: Optimize intelligent chatbot responses, allocate resources more effectively, and enhance the overall user experience

Built on a flexible and secure architecture, AIDA offers a versatile environment for integrating multiple data sources, including structured data from enterprise databases and unstructured data from internal sources like Amazon Simple Storage Service (Amazon S3). It boasts advanced capabilities like chat with data, advanced Retrieval Augmented Generation (RAG), and agents, enabling complex tasks such as reasoning, code execution, or API calls.

As AIDA’s interactions with humans proliferated, a pressing need emerged to establish a coherent system for categorizing these diverse exchanges.

Initially, users were making simple queries to AIDA, but over time, they started to request more specific and complex tasks. These included document translations, inquiries about IDIADA’s internal services, file uploads, and other specialized requests.

The main reason for this categorization was to develop distinct pipelines that could more effectively address various types of requests. By sorting interactions into categories, AIDA could be optimized to handle specific kinds of tasks more efficiently. This approach allows for tailored responses and processes for different types of user needs, whether it’s a simple question, a document translation, or a complex inquiry about IDIADA’s services.

The primary objective is to offer a more specialized service through the creation of dedicated pipelines for various contexts, such as conversation, document translation, and services to provide more accurate, relevant, and efficient responses to users’ increasingly diverse and specialized requests.

Solution overview

By categorizing the interactions into three main groups—conversation, services, and document translation—the system can better understand the user’s intent and respond accordingly. The Conversation class encompasses general inquiries and exchanges, the Services class covers requests for specific functionalities or support, and the Document_Translation class handles text translation needs.

The specialized pipelines, designed specifically for each use case, allow for a significant increase in efficiency and accuracy of AIDA’s responses. This is achieved in several ways:

  • Enhanced efficiency – By having dedicated pipelines for specific types of tasks, AIDA can process requests more quickly. Each pipeline is optimized for its particular use case, which reduces the computation time needed to generate an appropriate response.
  • Increased accuracy – The specialized pipelines are equipped with specific tools and knowledge for each type of task. This allows AIDA to provide more accurate and relevant responses, because it uses the most appropriate resources for each type of request.
  • Optimized resource allocation – By classifying interactions, AIDA can allocate computational resources more efficiently, directing the appropriate processing power to each type of task.
  • Improved response time – The combination of greater efficiency and optimized resource allocation results in faster response times for users.
  • Enhanced adaptability – This system allows AIDA to better adapt to different types of requests, from simple queries to complex tasks such as document translations or specialized inquiries about IDIADA services.

The research and development of this large language model (LLM) based classifier is an important step in the continuous improvement of the intelligent agent’s capabilities within the Applus IDIADA environment.

For this occasion, we use a set of 1,668 examples of pre-classified human interactions. These have been divided into 666 for training and 1,002 for testing. A 40/60 split has been applied, giving significant importance to the test set.

The following table shows some examples.

SAMPLE CLASS
Can you make a summary of this text? “Legislation for the Australian Government’s …” Conversation
No, only focus on this sentence : Braking technique to enable maximum brake application speed Conversation
In a factory give me synonyms of a limiting resource of activities Conversation
We need a translation of the file “Company_Bylaws.pdf” into English, could you handle it? Document_Translation
Please translate the file “Product_Manual.xlsx” into English Document_Translation
Could you convert the document “Data_Privacy_Policy.doc’ into English, please? Document_Translation
Register my username in the IDIADA’s human resources database Services
Send a mail to random_user@mail.com to schedule a meeting for the next weekend Services
Book an electric car charger for me at IDIADA Services

We present three different classification approaches: two based on LLMs and one using a classic machine learning (ML) algorithm. The aim is to understand which approach is most suitable for addressing the presented challenge.

LLM-based classifier: Simple prompt

In this case, we developed an LLM-based classifier to categorize inputs into three classes: Conversation, Services, and Document_Translation. Instead of relying on predefined, rigid definitions, our approach follows the principle of understanding a set. This principle involves analyzing the common characteristics and patterns present in the examples or instances that belong to each class. By studying the shared traits of inputs within a class, we can derive an understanding of the class itself, without being constrained by preconceived notions.

It’s important to note that the learned definitions might differ from common expectations. For instance, the Conversation class encompasses not only typical conversational exchanges but also tasks like text summarization, which share similar linguistic and contextual traits with conversational inputs.

By following this data-driven approach, the classifier can accurately categorize new inputs based on their similarity to the learned characteristics of each class, capturing the nuances and diversity within each category.

The code consists of the following key components: libraries, a prompt, model invocation, and an output parser.

Libraries

The programming language used in this code is Python, complemented by the LangChain module, which is specifically designed to facilitate the integration and use of LLMs. This module provides a comprehensive set of tools and abstractions that streamline the process of incorporating and deploying these advanced AI models.

To take advantage of the power of these language models, we use Amazon Bedrock. The integration with Amazon Bedrock is achieved through the Boto3 Python module, which serves as an interface to the AWS, enabling seamless interaction with Amazon Bedrock and the deployment of the classification model.

Prompt

The task is to assign one of three classes (Conversation, Services, or Document_Translation) to a given sentence, represented by question:

  • Conversation class – This class encompasses casual messages, summarization requests, general questions, affirmations, greetings, and similar types of text. It also includes requests for text translation, summarization, or explicit inquiries about the meaning of words or sentences in a specific language.
  • Services class – Texts belonging to this class consist of explicit requests for services such as room reservations, hotel bookings, dining services, cinema information, tourism-related inquiries, and similar service-oriented requests.
  • Document_Translation class – This class is characterized by requests for the translation of a document to a specific language. Unlike the Conversation class, these requests don’t involve summarization. Additionally, the name of the document to be translated and the target language are specified.

The prompt suggests a hierarchical approach to the classification process. First, the sentence should be evaluated to determine if it can be classified as a conversation. If the sentence doesn’t fit the Conversation class, one of the other two classes (Services or Document_Translation) should be assigned.

The priority for the Conversation class stems from the fact that 99% of the interactions are actually simple questions regarding various matters.

Model invocation

We use Anthropic’s Claude 3 Sonnet model for the natural language processing task. This LLM model has a context window of 200,000 tokens, enabling it to manage different languages and retrieve highly accurate answers. We use two key parameters:

  • max_tokens – This parameter limits the maximum number of tokens (words or subwords) that the language model can generate in its output to 50.
  • temperature – This parameter controls the randomness of the language model’s output. A temperature of 0.0 means that the model will produce the most likely output according to its training, without introducing randomness.

Output parser

Another important component is the output parser, which allows us to gather the desired information in JSON format. To achieve this, we use the LangChain parameter output_parsers.

The following code illustrates a simple prompt approach:

def classify_interaction(question):
   response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    Assign a class to the following sentence.
    {question}
    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions} 
    """          
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 20,
                    "temperature":0,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )
    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    try:
        output_dict = output_parser.parse(texto.replace(''', '"'))['class']
    except:
        output_dict = 'Conversation' 
    return output_dict

LLM-based classifier: Example augmented inference

We use RAG techniques to enhance the model’s response capabilities. Instead of relying solely on compressed definitions, we provide the model with a quasi-definition by extension. Specifically, we present the model with a diverse set of examples for each class, allowing it to learn the inherent characteristics and patterns that define each category. For instance, in addition to a concise definition of the Conversation class, the model is exposed to various conversational inputs, enabling it to identify common traits such as informal language, open-ended questions, and back-and-forth exchanges. This example-driven approach complements the initial descriptions provided, allowing the model to capture the nuances and diversity within each class. By combining concise definitions with representative examples, the RAG technique helps the model develop a more comprehensive understanding of the classes, enhancing its ability to accurately categorize new inputs based on their inherent nature and characteristics.

The following code provides examples in JSON format for RAG:

{
    "Conversation":[
       ""Could you give me examples of how to solve it?",
       "cool but anything short and sweet",
       "..."
    ],
    "Services":[
       "make a review of my investments in the eBull.com platform",
       "I need a room in IDIADA",
       "schedule a meeting with",
       "..."
    ]"Document_Translation":[
       "Translate the file into Catalan",
       "Could you translate the document I added earlier into Swedish?",
       "Translate the Guía_Rápida.doc file into Romanian",
       "..."
    ]
 }

The total number of examples provided for each class is as follows:

  • Conversation – 500 examples. This is the most common class, and only 500 samples are given to the model due to the vast amount of information, which could cause infrastructure overflow (very high delays, throttling, connection shutouts). This is a crucial point to note because it represents a significant bottleneck. Providing more examples to this approach could potentially improve performance, but the question remains: How many examples? Surely, a substantial amount would be required.
  • Services – 26 examples. This is the least common class, and in this case, all available training data has been used.
  • Document_Translation – 140 examples. Again, all available training data has been used for this class.

One of the key challenges with this approach is scalability. Although the model’s performance improves with more training examples, the computational demands quickly become overwhelming for our current infrastructure. The sheer volume of data required can lead to quota issues with Amazon Bedrock and unacceptably long response times. Rapid response times are essential for providing a satisfactory user experience, and this approach falls short in that regard.

In this case, we need to modify the code to embed all the examples. The following code shows the changes applied to the first version of the classifier. The prompt is modified to include all the examples in JSON format under the “Here you have some examples” section.

def classify_interaction(question, agent_examples):
    response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    
    Here you have some examples:
    {agent_examples}

    Assign a class to the following sentence.
    {question}

    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions}   
    """
    
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 50,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )

    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    output_dict = output_parser.parse(texto.replace(''', '"'))['class']
    
    return output_dict

K-NN-based classifier: Amazon Titan Embeddings

In this case, we take a different approach by recognizing that despite the multitude of possible interactions, they often share similarities and repetitive patterns. Instead of treating each input as entirely unique, we can use a distance-based approach like k-nearest neighbors (k-NN) to assign a class based on the most similar examples surrounding the input. To make this work, we need to transform the textual interactions into a format that allows algebraic operations. This is where embeddings come into play. Embeddings are vector representations of text that capture semantic and contextual information. We can calculate the semantic similarity between different interactions by converting text into these vector representations and comparing their vectors and determining their proximity in the embedding space.

To accommodate this approach, we need to modify the code accordingly:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We used the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages while retaining the semantic meaning of the embedded phrases.

For the classfier, we employed a classic ML algorithm, k-NN, using the scikit-learn Python module. This method takes a parameter, which we set to 3.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors (k) used in the k-NN algorithm. As the graph shows, the optimal value for k is 3, which yields the highest F1 score for the most prevalent class, Document_Translation. Although it’s not the absolute highest score for the Services class, Document_Translation is significantly more common, making k=3 the best overall choice to maximize performance across all classes.

Number of Neighbors

K-NN-based classifier: Cohere’s multilingual embeddings model

In the previous section, we used the popular Amazon Titan Text Embeddings G1 model to generate text embeddings. However, other models might offer different advantages. In this section, we explore the use of Cohere’s multilingual model on Amazon Bedrock for generating embeddings. We chose the Cohere model due to its excellent capability in handling multiple languages without compromising the vectorization of phrases. As we will demonstrate, this model doesn’t introduce significant differences in the generated vectors compared to other models, making it more suitable for use in a multilingual environment like AIDA.

To use the Cohere model, we need to change the model_id:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id=" cohere.embed-multilingual-v3",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')
data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

neigh = KNeighborsClassifier(n_neighbors=11)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We use Cohere’s multilingual embeddings model to generate vectors with 1,024 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ k-NN, using the scikit-learn Python module. This method takes a parameter, which we have set to 11.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors used. As depicted, the optimal point is k=11, achieving the highest value for Document_Translation and the second-highest for Services. In this instance, the trade-off between Documents_Translation and Services is favorable.

Number of Neighbors

Amazon Titan Embeddings vs. Cohere’s multilingual embeddings model

In this section, we delve deeper into the embeddings generated by both models, aiming to understand their nature and consequently comprehend the results obtained. To achieve this, we have performed dimensionality reduction to visualize the vectors obtained in both cases in 2D.

Cohere’s multilingual embeddings model has a limitation on the size of the text it can vectorize, posing a significant constraint. Therefore, in the implementation showcased in the previous section, we applied a filter to only include interactions up to 1,500 characters (excluding cases that exceed this limit).

The following figure illustrates the vector spaces generated in each case.

Vector Space

As we can observe, the generated vector spaces are relatively similar, initially appearing to be analogous spaces with a rotation between one another. However, upon closer inspection, it becomes evident that the direction of maximum variance in the case of Cohere’s multilingual embeddings model is distinct (deducible from observing the relative position and shape of the different groups). This type of situation, where high class overlap is observed, presents an ideal case for applying algorithms such as k-NN.

As mentioned in the introduction, most human interactions with AI are very similar to each other within the same class. This would explain why k-NN-based models outperform LLM-based models.

SVM-based classifier: Amazon Titan Embeddings

In this scenario, it is likely that user interactions belonging to the three main categories (Conversation, Services, and Document_Translation) form distinct clusters or groups within the embedding space. Each category possesses particular linguistic and semantic characteristics that would be reflected in the geometric structure of the embedding vectors. The previous visualization of the embeddings space displayed only a 2D transformation of this space. This doesn’t imply that clusters coudn’t be highly separable in higher dimensions.

Classification algorithms like support vector machines (SVMs) are especially well-suited to use this implicit geometry of the data. SVMs seek to find the optimal hyperplane that separates the different groups or classes in the embedding space, maximizing the margin between them. This ability of SVMs to use the underlying geometric structure of the data makes them an intriguing option for this user interaction classification problem.

Furthermore, SVMs are a robust and efficient algorithm that can effectively handle high-dimensional datasets, such as text embeddings. This makes them particularly suitable for this scenario, where the embedding vectors of the user interactions are expected to have a high dimensionality.

The following code illustrates the implementation:

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='eu-central-1'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())

f1 = make_scorer(f1_score , average='weighted')
parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use Amazon Titan Text Embeddings G1. This model generates vectors of 1,536 dimensions, and is trained to accept several languages and to retain the semantic meaning of the phrases embedded.

To implement the classifier, we employed a classic ML algorithm, SVM, using the scikit-learn Python module. The SVM algorithm requires the tuning of several parameters to achieve optimal performance. To determine the best parameter values, we conducted a grid search with 10-fold cross-validation, using the F1 multi-class score as the evaluation metric. This systematic approach allowed us to identify the following set of parameters that yielded the highest performance for our classifier:

  • C – We set this parameter to 1. This parameter controls the trade-off between allowing training errors and forcing rigid margins. It acts as a regularization parameter. A higher value of C (for example, 10) indicates a higher penalty for misclassification errors. This results in a more complex model that tries to fit the training data more closely. A higher C value can be beneficial when the classes in the data are well separated, because it allows the algorithm to create a more intricate decision boundary to accurately classify the samples. On the other hand, a C value of 1 indicates a reasonable balance between fitting the training set and the model’s generalization ability. This value might be appropriate when the data has a simple structure, and a more flexible model isn’t necessary to capture the underlying relationships. In our case, the selected C value of 1 suggests that the data has a relatively simple structure, and a balanced model with moderate complexity is sufficient for accurate classification.
  • class_weight – We set this parameter to None. This parameter adjusts the weights of each class during the training process. Setting class_weight to balanced automatically adjusts the weights inversely proportional to the class frequencies in the input data. This is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the others. In our case, the value of None for the class_weight parameter suggests that the minor classes don’t have much relevance or impact on the overall classification task. This choice implies that the implicit geometry or decision boundaries learned by the model might not be optimized for separating the different classes effectively.
  • Kernel – We set this parameter to linear. This parameter specifies the type of kernel function to be used by the SVC algorithm. The linear kernel is a simple and efficient choice because it assumes that the decision boundary between classes can be represented by a linear hyperplane in the feature space. This value suggests that, in a higher dimension vector space, the categories could be linearly separated by an hyperplane.

SVM-based classifier: Cohere’s multilingual embeddings model

The implementation details of the classifier are presented in the following code:

from langchain.embeddings import BedrockEmbeddings

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

f1 = make_scorer(f1_score , average='weighted')

parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ SVM, using the scikit-learn Python module. To obtain the optimal parameters, we performed a grid search with 10-fold cross-validation based on the multi-class F1 score, resulting in the following selected parameters (as detailed in the previous section):

  • C – We set this parameter to 1, which indicates a reasonable balance between fitting the training set and the model’s generalization ability. This setting suggests that the data has a simple structure and that a more flexible model might not be necessary to capture the underlying relationships.
  • class_weight – We set this parameter to None. A value of None suggests that the minor classes don’t have much relevance, which in turn implies that the implicit geometry might not be suitable for separating the different classes.
  • kernel – We set this parameter to linear. This value suggests that in a higher-dimensional vector space, the categories could be linearly separated by a hyperplane.

ANN-based classifier: Amazon Titan and Cohere’s multilingual embeddings model

Given the promising results obtained with SVMs, we decided to explore another geometry-based method by employing an Artificial Neural Network (ANN) approach.

In this case, we performed normalization of the input vectors to use the advantages of normalization when using neural networks. Normalizing the input data is a crucial step when working with ANNs, because it can help improve the model’s during training. We applied min/max scaling for normalization.

The use of an ANN-based approach provides the ability to capture complex non-linear relationships in the data, which might not be easily modeled using traditional linear methods like SVMs. The combination of the geometric insights and the normalization of inputs can potentially lead to improved predictive accuracy compared to the previous SVM results.

This approach consists of the following parameters:

  • Model definition – We define a sequential deep learning model using the Keras library from TensorFlow.
  • Model architecture – The model consists of three densely connected layers. The first layer has 16 neurons and uses the ReLU activation function. The second layer has 8 neurons and employs the ReLU activation function. The third layer has 3 neurons and uses the softmax activation function.
  • Model compilation – We compile the model using the categorical_crossentropy loss function, the Adam optimizer with a learning rate of 0.01, and the categorical_accuracy. We incorporate an EarlyStopping callback to stop the training if the categorical_accuracy metric doesn’t improve for 25 epochs.
  • Model training – We train the model for a maximum of 500 epochs using the training set and validate it on the test set. The batch size is set to 64. The performance metric used is the maximum classification accuracy (categorical_accuracy) obtained during the training.

We applied the same methodology, but using the embeddings generated by Cohere’s multilingual embeddings model after being normalized through min/max scaling. In both cases, we employed the same preprocessing steps:

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2'
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

df_train['sample'] = [s[:1500] for s in df_train['sample']]
df_test['sample'] = [s[:1500] for s in df_test['sample']]

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()

y_train_ohe = [ [int(y=='Conversation'), int(y=='Document_Translation'), int(y=='Services')] for y in y_train]
y_test = df_test['agent'].values.tolist()
y_test = [ ['Conversation', 'Document_Translation', 'Services'].index(y) for y in y_test]
X_test = df_test['sample'].values.tolist()

To help avoid ordinal assumptions, we employed a one-hot encoding representation for the output of the network. One-hot encoding doesn’t make any assumptions about the inherent order or hierarchy among the categories. This is particularly useful when the categorical variable doesn’t have a clear ordinal relationship, because the model can learn the relationships without being biased by any assumed ordering.

The following code illustrates the implementation:

def train_model( X, y, n_hebras = 10, reps = 30, train_size = 0.7, tipo_optimizacion = "low"):
    import threading

    reps_por_hebra = int(reps/n_hebras)
    hebras = [0]*n_hebras
    results = [0]*reps
    models = [0]*reps    
    
    for i in range(len(hebras)):
        hebras[i] = threading.Thread(target=eval_model_rep_times,
            args=(X, y, train_size, reps_por_hebra, i*reps_por_hebra, models, results))
        hebras[i].start()
        
    for i in range(len(hebras)):
        hebras[i].join()
        
    if tipo_optimizacion == "low":
        result = models[np.argmin(results)], min(results)
    else:
        result = models[np.argmax(results)], max(results)
    return result

def eval_model_rep_times(X, y, train_size, reps, index, models, results):
    for rep in range(reps):
        X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = train_size)
        model, metric = create_and_fit_model(X_train, y_train, X_test, y_test) 
        models[index+rep] = model
        results[index+rep] = metric

def create_and_fit_model(X_train, y_train, X_test, y_test):
    ### DEFINITION GOES HERE ###
    model = Sequential() 
    model.add(Dense(16, input_shape = (len(X_train[0]),), activation='relu')  )
    model.add(Dense(8, activation='relu')  )
    model.add(Dense(3, activation='softmax' ))
    model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.01), metrics=['categorical_accuracy'])
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="categorical_accuracy", patience=25, mode = 'max')
    ### DEFINITION GOES HERE ###
  
    ### TRAINING GOES HERE ###
    history = model.fit(X_train,
              y_train,
              epochs=500,
              validation_data = (X_test, y_test),
              batch_size=64,
              callbacks= early_stopping,
              verbose=0)
    ### TRAINING GOES HERE ###
    
    metrica = max(history.history['categorical_accuracy'])
    
    #ALWAYS RETURN THE MODEL
    return model, metrica

model, mse = train_model(X_train_emb_norm, y_train_ohe, 5, 20, tipo_optimizacion="high")
y_pred = [ est.argmax() for est in model.predict(X_test_emb_norm) ]

Results

We conducted a comparative analysis using the previously presented code and data. The models were assessed based on their F1 scores for the conversation, services, and document translation tasks, as well as their runtimes. The following table summarizes our results.

MODEL CONVERSATION F1 SERVICES F1 DOCUMENT_ TRANSLATION F1 RUNTIME (Seconds)
LLM 0.81 0.22 0.46 1.2
LLM with examples 0.86 0.13 0.68 18
KNN – Amazon Titan Embedding 0.98 0.57 0.88 0.35
KNN – Cohere Embedding 0.96 0.72 0.72 0.35
SVM Amazon Titan Embedding 0.98 0.69 0.82 0.3
SVM Cohere Embedding 0.99 0.80 0.93 0.3
ANN Amazon Titan Embedding 0.98 0.60 0.87 0.15
ANN Cohere Embedding 0.99 0.77 0.96 0.15

As illustrated in the table, the SVM and ANN models using Cohere’s multilingual embeddings model demonstrated the strongest overall performance. The SVM with Cohere’s multilingual embeddings model achieved the highest F1 scores in two out of three tasks, reaching 0.99 for Conversation, 0.80 for Services, and 0.93 for Document_Translation. Similarly, the ANN with Cohere’s multilingual embeddings model also performed exceptionally well, with F1 scores of 0.99, 0.77, and 0.96 for the respective tasks.

In contrast, the LLM exhibited relatively lower F1 scores, particularly for the services (0.22) and document translation (0.46) tasks. However, the performance of the LLM improved when provided with examples, with the F1 score for document translation increasing from 0.46 to 0.68.

Regarding runtime, the k-NN, SVM, and ANN models demonstrated significantly faster inference times compared to the LLM. The k-NN and SVM models with both Amazon Titan and Cohere’s multilingual embeddings model had runtimes of approximately 0.3–0.35 seconds. The ANN models were even faster, with runtimes of approximately 0.15 seconds. In contrast, the LLM required approximately 1.2 seconds for inference, and the LLM with examples took around 18 seconds.

These results suggest that the SVM and ANN models using Cohere’s multilingual embeddings model offer the best balance of performance and efficiency for the given tasks. The superior F1 scores of these models, coupled with their faster runtimes, make them promising candidates for application. The potential benefits of providing examples to the LLM model are also noteworthy, because this approach can help improve its performance on specific tasks.

Conclusion

The optimization of AIDA, Applus IDIADA’s intelligent chatbot powered by Amazon Bedrock, has been a resounding success. By developing dedicated pipelines to handle different types of user interactions—from general conversations to specialized service requests and document translations—AIDA has significantly improved its efficiency, accuracy, and overall user experience. The innovative use of LLMs, embeddings, and advanced classification algorithms has allowed AIDA to adapt to the evolving needs of IDIADA’s workforce, providing a versatile and reliable virtual assistant. AIDA now handles over 1,000 interactions per day, with a 95% accuracy rate in routing requests to the appropriate pipeline and driving a 20% increase in their team’s productivity.

Looking ahead, IDIADA plans to offer AIDA as an integrated product for customer environments, further expanding the reach and impact of this transformative technology.

Amazon Bedrock offers a comprehensive approach to security, compliance, and responsible AI development that empowers IDIADA and other customers to harness the full potential of generative AI without compromising on safety and trust. As this advanced technology continues to rapidly evolve, Amazon Bedrock provides the transparent framework needed to build innovative applications that inspire confidence.

Unlock new growth opportunities by creating custom, secure AI models tailored to your organization’s unique needs. Take the first step in your generative AI transformation—connect with an AWS expert today to begin your journey.


About the Authors

Xavier VizcainoXavier Vizcaino is the head of the DataLab, in the Digital Solutions department of Applus IDIADA. DataLab is the unit focused on the development of solutions for generating value from the exploitation of data through artificial intelligence.

Diego Martín MontoroDiego Martín Montoro is an AI Expert and Machine Learning Engineer at Applus+ Idiada Datalab. With a Computer Science degree and a Master’s in Data Science, Diego has built his career in the field of artificial intelligence and machine learning. His experience includes roles as a Machine Learning Engineer at companies like AppliedIT and Applus+ IDIADA, where he has worked on developing advanced AI systems and anomaly detection solutions.

Jordi Sánchez FerrerJordi Sánchez Ferrer is the current Product Owner of the Datalab at Applus+ Idiada. A Computer Engineer with a Master’s degree in Data Science, Jordi’s trajectory includes roles as a Business Intelligence developer, Machine Learning engineer, and lead developer in Datalab. In his current role, Jordi combines his technical expertise with product management skills, leading strategic initiatives that align data science and AI projects with business objectives at Applus+ Idiada.

Daniel CollsDaniel Colls is a professional with more than 25 years of experience who has lived through the digital transformation and the transition from the on-premises model to the cloud from different perspectives in the IT sector. For the past 3 years, as a Solutions Architect at AWS, he has made this experience available to his customers, helping them successfully implement or move their workloads to the cloud.

Read More

Accelerate IaC troubleshooting with Amazon Bedrock Agents

Accelerate IaC troubleshooting with Amazon Bedrock Agents

Troubleshooting infrastructure as code (IaC) errors often consumes valuable time and resources. Developers can spend multiple cycles searching for solutions across forums, troubleshooting repetitive issues, or trying to identify the root cause. These delays can lead to missed security errors or compliance violations, especially in complex, multi-account environments.

This post demonstrates how you can use Amazon Bedrock Agents to create an intelligent solution to streamline the resolution of Terraform and AWS CloudFormation code issues through context-aware troubleshooting. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents is a fully managed service that helps developers create AI agents that can break down complex tasks into steps and execute them using FMs and APIs to accomplish specific business objectives.

Our solution uses Amazon Bedrock Agents to analyze error messages and code context, generating detailed troubleshooting steps for IaC errors. In organizations with multi-account AWS environments, teams often maintain a centralized AWS environment for developers to deploy applications. This setup makes sure that AWS infrastructure deployments using IaC align with organizational security and compliance measures. For specific IaC errors related to these compliance measures, such as those involving service control policies (SCPs) or resource-based policies, our solution intelligently directs developers to contact appropriate teams like Security or Enablement. This targeted guidance maintains security protocols and makes sure that sensitive issues are handled by the right experts. The solution is flexible and can be adapted for similar use cases beyond these examples.

Although we focus on Terraform Cloud workspaces in this example, the same principles apply to GitLab CI/CD pipelines or other continuous integration and delivery (CI/CD) approaches executing IaC code. By automating initial error analysis and providing targeted solutions or guidance, you can improve operational efficiency and focus on solving complex infrastructure challenges within your organization’s compliance framework.

Solution overview

Before we dive into the deployment process, let’s walk through the key steps of the architecture as illustrated in the following figure.

Terraform-troubleshooting

The workflow for the Terraform solution is as follows:

  1. Initial input through the Amazon Bedrock Agents chat console – The user begins by entering details about their Terraform error into the chat console for Amazon Bedrock Agents. This typically includes the Terraform Cloud workspace URL where the error occurred, and optionally, a Git repository URL and branch name if additional context is needed.
  2. Error retrieval and context gathering – The Amazon Bedrock agent forwards these details to an action group that invokes the first AWS Lambda function (see the following Lambda function code). This function invokes another Lambda function (see the following Lambda function code) which retrieves the latest error message from the specified Terraform Cloud workspace. If a Git repository URL is provided, it also retrieves relevant Terraform files from the repository. This contextual information is then sent back to the first Lambda function.
  3. Error analysis and response generation – Lambda function would then construct a detailed prompt that includes the error message, repository files (if available), and specific use case instructions. It then uses the Amazon Bedrock model to analyze the error and generate either troubleshooting steps or guidance to contact specific teams.
  4. Interaction and user guidance – The agent displays the generated response to the user. For most Terraform errors, this includes detailed troubleshooting steps. For specific cases related to organizational policies (for example, service control policies or resource-based policies), the response directs the user to contact the appropriate team, such as Security or Enablement.
  5. Continuous improvement – The solution can be continually updated with new specific use cases and organizational guidelines, making sure that the troubleshooting advice stays current with the organization’s evolving infrastructure and compliance requirements. For example:
    1. SCP or IAM policy violations – Guides developers when they encounter permission issues due to SCPs or strict AWS Identity and Access Management (IAM) boundaries, offering alternatives or escalation paths.
    2. VPC and networking restrictions – Flags non-compliant virtual private cloud (VPC) or subnet configurations (such as public subnets) and suggests security-compliant adjustments.
    3. Encryption requirements – Detects missing or incorrect encryption for Amazon Simple Storage Service (Amazon S3) or Amazon Elastic Block Store (Amazon EBS) resources and recommends the appropriate configurations to align with compliance standards.

The following diagram illustrates the step-by-step process of how the solution works.

flow-diagram

This solution streamlines the process of resolving Terraform errors, providing immediate, context-aware guidance to developers while making sure that sensitive or complex issues are directed to the appropriate teams. By using the capabilities of Amazon Bedrock Agents, it offers a scalable and intelligent approach to managing IaC challenges in large, multi-account AWS environments.

Prerequisites

To implement the solution, you need the following:

Create the Amazon Bedrock agent

To create and configure the Amazon Bedrock agent, complete the following steps:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose Create agent.
  3. Provide agent details, including agent name and description (optional).
  4. Grant the agent permissions to AWS services through the IAM service role. This gives your agent access to required services, such as Lambda.
  5. Select an FM from Amazon Bedrock (such as Anthropic’s Claude 3 Sonnet).
  6. For troubleshooting Terraform errors through Amazon Bedrock Agents, attach the following instruction to the agent. This instruction makes sure that the agent gathers the required input from the user and executes the action group to provide detailed troubleshooting steps.

“You are a terraform code error specialist. Greet the user and ask for terraform workspace url, branch name, code repository url. Once received, trigger troubleshooting action group. Provide the troubleshooting steps to the user.”

Configure the Lambda function for the action group

After you configure the initial agent and add the preceding instruction to the agent, you need to create two Lambda functions:

  • The first Lambda function will be added to the action group, which is invoked by the Amazon Bedrock agent, and will subsequently trigger the second Lambda function using the invoke method. Refer to the Lambda function code for more details. Make sure the LAMBDA_2_FUNCTION_NAME environment variable is set.
  • The second Lambda function will handle fetching the Terraform workspace error and the associated Terraform code from GitLab. Refer to the Lambda function code. Make sure that the TERRAFORM_API_URL, TERRAFORM_SECRET_NAME, and VCS_SECRET_NAME environment variables are set.

After the Terraform workspace error and code details are retrieved, these details will be passed back to the first Lambda function, which will use the Amazon Bedrock API with an FM to generate and provide the appropriate troubleshooting steps based on the error and code information.

Add the action group to the Amazon Bedrock agent

Complete the following steps to add the action group to the Amazon Bedrock agent:

  1. Add an action group to the Amazon Bedrock agent.
  2. Assign a descriptive name (for example, troubleshooting) to the action group and provide a description. This helps clarify the purpose of the action group within the workflow.
  3. For Action group type, select Define with function details.

For more details, see Define function details for your agent’s action groups in Amazon Bedrock.

  1. For Action group invocation, choose the first Lambda function that you created previously.

This function runs the business logic required when an action is invoked. Make sure to choose the correct version of the first Lambda function. For more details on how to configure Lambda functions for action groups, see Configure Lambda functions to send information that an Amazon Bedrock agent elicits from the user.

  1. For Action group function 1, provide a name and description.
  2. Add the following parameters.

Name

Description Type Required
workspace_url

Terraform workspace url

string

True

      repo_url

Code repository URL

string

True

branch_name Code repository branch name string

True

Test the solution

The following example is of a Terraform error due to a service control polcy. The troubleshooting steps provided would be aligned to address those specific constraints. The action group triggers the Lambda function, which follows structured single-shot prompting by passing the complete context—such as the error message and repository contents—in a single input to the Amazon Bedrock model to generate precise troubleshooting steps.

Example 1: The following screenshot shows an example of a Terraform error caused by an SCP limitation managed by the security team.

scp-error

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

scp-output

Example 2: The following screenshot shows an example of a Terraform error due to a missing variable value.

general-error

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

general-output

Clean up

The services used in this demo can incur costs. Complete the following steps to clean up your resources:

  1. Delete the Lambda functions if they are no longer required.
  2. Delete the action group and Amazon Bedrock agent you created.

Conclusion

IaC offers flexibility for managing cloud environments, but troubleshooting code errors can be time-consuming, especially in environments with strict organizational guardrails. This post demonstrated how Amazon Bedrock Agents, combined with action groups and generative AI models, streamlines and accelerates the resolution of Terraform errors while maintaining compliance with environment security and operational guidelines.

Using the capabilities of Amazon Bedrock Agents, developers can receive context-aware troubleshooting steps tailored to environment-related issues such as SCP or IAM violations, VPC restrictions, and encryption policies. The solution provides specific guidance based on the error’s context and directs users to the appropriate teams for issues that require further escalation. This reduces the time spent on IaC errors, improves developer productivity, and maintains organizational compliance.

Are you ready to streamline your cloud deployment process with the generative AI of Amazon Bedrock? Start by exploring the Amazon Bedrock User Guide to see how it can facilitate your organization’s transition to the cloud. For specialized assistance, consider engaging with AWS Professional Services to maximize the efficiency and benefits of using Amazon Bedrock.


About the Authors

Akhil Raj Yallamelli is a Cloud Infrastructure Architect at AWS, specializing in architecting cloud infrastructure solutions for enhanced data security and cost efficiency. He is experienced in integrating technical solutions with business strategies to create scalable, reliable, and secure cloud environments. Akhil enjoys developing solutions focusing on customer business outcomes, incorporating generative AI (Gen AI) technologies to drive innovation and cloud enablement. He holds an MS degree in Computer Science. Outside of his professional work, Akhil enjoys watching and playing sports.

Ebbey Thomas is a Senior Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Read More

Derive generative AI powered insights from Alation Cloud Services using Amazon Q Business Custom Connector

Derive generative AI powered insights from Alation Cloud Services using Amazon Q Business Custom Connector

This blog post is co-written with Gene Arnold from Alation.

To build a generative AI-based conversational application integrated with relevant data sources, an enterprise needs to invest time, money, and people. First, you would need build connectors to the data sources. Next you need to index this data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this, you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve data, rank the answers, and build a feature rich web application. Additionally, you might need to hire and staff a large team to build, maintain, and manage such a system.

Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems. To do this Amazon Q Business provides out-of-the-box native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well written answers. A data source connector is a component of Amazon Q Business that helps to integrate and synchronize data from multiple repositories into one index. Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including ServiceNow, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more. For a full list of supported data source connectors, see Amazon Q Business connectors.

However, many organizations store relevant information in the form of unstructured data on company intranets or within file systems on corporate networks that are inaccessible to Amazon Q Business using its native data source connectors. You can now use the custom data source connector within Amazon Q Business to upload content to your index from a wider range of data sources.

Using an Amazon Q Business custom data source connector, you can gain insights into your organization’s third party applications with the integration of generative AI and natural language processing. This post shows how to configure an Amazon Q Business custom connector and derive insights by creating a generative AI-powered conversation experience on AWS using Amazon Q Business while using access control lists (ACLs) to restrict access to documents based on user permissions.

Alation is a data intelligence company serving more than 600 global enterprises, including 40% of the Fortune 100. Customers rely on Alation to realize the value of their data and AI initiatives. Headquartered in Redwood City, California, Alation is an AWS Specialization Partner and AWS Marketplace Seller with Data and Analytics Competency. Organizations trust Alation’s platform for self-service analytics, cloud transformation, data governance, and AI-ready data, fostering innovation at scale. In this post, we will showcase a sample of how Alation’s business policies can be integrated with an Amazon Q Business application using a custom data source connector.

Finding accurate answers from content in custom data sources using Amazon Q Business

After you integrate Amazon Q Business with data sources such as Alation, users can ask questions from the description of the document. For example,

  1. What are the top sections of the HR benefits policies?
  2. Who are the data stewards for my proprietary database sources?

Overview of a custom connector

data source connector is a mechanism for integrating and synchronizing data from multiple repositories into one container index. Amazon Q Business offers multiple pre-built data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. However, if you have valuable data residing in spots for which those pre-built connectors cannot be used, you can use a custom connector.

When you connect Amazon Q Business to a data source and initiate the data synchronization process, Amazon Q Business crawls and adds documents from the data source to its index.

You would typically use an Amazon Q Business custom connector when you have a repository that Amazon Business doesn’t yet provide a data source connector for. Amazon Q Business only provides metric information that you can use to monitor your data source sync jobs. You must create and run the crawler that determines the documents your data source indexes. A simple architectural representation of the steps involved is shown in the following figure.

Architecture Diagram

Solution overview

The solution shown of integrating Alation’s business policies is for demonstration purposes only. We recommend running similar scripts only on your own data sources after consulting with the team who manages them, or be sure to follow the terms of service for the sources that you’re trying to fetch data from. The steps involved for other custom data sources are very similar except the part where we connect to Alation and fetch data from it. To crawl and index contents in Alation you configure an Amazon Q Business custom connector as a data source in your Amazon Q Business application.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Configure your Alation connection

In your Alation cloud account, create an OAuth2 client application that can be consumed from an Amazon Q Business application.

  1. In Alation, sign in as a user with administrator privileges, navigate to the settings page, and choose Authentication (https://[[your-domain]].alationcloud.com/admin/auth/).

Alation Admin Settings

  1. In the OAuth Client Applications section, choose Add.

Alation OAuth Client Applications

  1. Enter an easily identifiable application name, and choose Save.

Create OAuth Client Application

  1. Take note of the OAuth client application data—the Client ID and the Client Secret—created and choose Close.

OAuth Client ID

  1. As a security best practice, storing the client application data in Secrets Manager is recommended. In AWS console, navigate to AWS Secrets Manager and add a new secret. Key in the Client_Id and Client_Secret values copied from the previous step.

AWS Secrets Manager - Create Secret1

  1. Provide a name and description for the secret and choose Next.

AWS Secrets Manager - Create Secret2

  1. Leave the defaults and choose Next.

AWS Secrets Manager - Create Secret3

  1. Choose Store in the last page.

AWS Secrets Manager - Create Secret4

Create sample Alation policies

In our example, you would create three different sets of Alation policies for a fictional organization named Unicorn Rentals. Grouped as Workplace, HR, and Regulatory, each policy contains a rough two-page summary of crucial organizational items of interest. You can find details on how to create policies on Alation documentation.

Alation Create Policies

On the Amazon Q Business side, let’s assume that we want to ensure that the following access policies are enforced. Users and access are setup via code illustrated in later sections.

# First name Last name Policies authorized for access
1 Alejandro Rosalez Workplace, HR, and Regulatory
2 Sofia Martinez Workplace and HR
3 Diego Ramirez Workplace and Regulatory

Create an Amazon Q Business application

  1. Sign in to the AWS Management Console and navigate to Amazon Q Business from the search bar at the top.
  1. On the Amazon Q Business console, choose Get Started.

Amazon Q Business Console

  1. On the Applications page, choose Create application.

Amazon Q Business List Applications

  1. In the first step of the Create application wizard, enter the default values. Additionally, you need to choose a list of users who require access to the Amazon Q Business application by including them through the IAM Identity Center settings.

Q Business Create Application1

  1. In the access management settings page, you would create and add users via AWS IAM Identity Center.

Q Business Create Application2

  1. Once all users are added, choose Create.

Q Business Create Application3

  1. After the application is created, take note of the Application ID value from the landing page.

Q Business Create Application4

  1. Next is to choose an index type for the Amazon Q Business application. Choose the native retriever option.

Q Business Create Index1

Q Business Create Index2

  1. After the index is created, verify that the status has changed to Active. You can then take a note of the Index ID.

Q Business Create Index3

  1. Next step is for you to add the custom data source.

Q Business Add Data Source1

  1. Search for Custom data source and choose the plus sign next to it.

Q Business Add Data Source2

  1. Provide a name and description for the custom data source.

Q Business Add Data Source2

  1. Once done, choose Add data source.

Q Business Add Data Source4

  1. After the data source is added and its status is Active, take note of the Data source ID.

Q Business Add Data Source4

Load policy data from Alation to Amazon Q Business using the custom connector

Now let’s load the Alation data into Amazon Q Business using the correct access permissions. The code examples that follow are also available on the accompanying GitHub code repository.

  1. With the connector ready, move over to the SageMaker Studio notebook and perform data synchronization operations by invoking Amazon Q Business APIs.
  2. To start, retrieve the Alation OAuth client application credentials stored in Secrets Manager.
    secrets_manager_client = boto3.client('secretsmanager')
    secret_name = "alation_test"
    
    try:
        get_secret_value_response = secrets_manager_client.get_secret_value(
            SecretId=secret_name
        )
        secret = eval(get_secret_value_response['SecretString'])
    
    except ClientError as e:
            raise e

  1. Next, initiate the connection using the OAuth client application credentials from Alation.
    base_url = "https://[[your-domain]].alationcloud.com"
    token_url = "/oauth/v2/token/"
    introspect_url = "/oauth/v2/introspect/"
    jwks_url = "/oauth/v2/.well-known/jwks.json/"
    
    api_url = base_url + token_url
    data = {
            "grant_type": "client_credentials",
           }
    client_id = secret['Client_Id']
    client_secret = secret['Client_Secret']
    
    auth = HTTPBasicAuth(username=client_id, password=client_secret)
    response = requests.post(url=api_url, data=data, auth=auth)
    print(response.json())
    
    access_token = response.json().get('access_token','')
    api_url = base_url + introspect_url + "?verify_token=true"
    data = {
            "token": access_token,
           }
    response = requests.post(url=api_url, data=data, auth=auth)
    

  1. You then configure policy type level user access. This section can be customized based on how user access information is stored on any data sources. Here, we assume a pre-set access based on the user’s email IDs.
    primary_principal_list = []
    workplace_policy_principals = []
    hr_policy_principals = []
    regulatory_policy_principals = []
    
    principal_user_email_ids = ['alejandro_rosalez@example.com', ‘sofia_martinez@example.com', ‘diego_martinez@example.com']
    
    workplace_policy_email_ids = ['alejandro_rosalez@example.com', 'sofia_martinez@example.com', 'diego_ramirez@example.com']
    hr_policy_email_ids = ['alejandro_rosalez@example.com', 'sofia_martinez@example.com']
    regulatory_policy_email_ids = ['alejandro_rosalez@example.com', 'diego_ramirez@example.com']
    
    for workplace_policy_member in workplace_policy_email_ids:
        workplace_policy_members_dict = { 'user': { 'id': workplace_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
        workplace_policy_principals.append(workplace_policy_members_dict)
        if workplace_policy_member not in primary_principal_list:
            primary_principal_list.append(workplace_policy_member)
    
    for hr_policy_member in hr_policy_email_ids:
        hr_policy_members_dict = { 'user': { 'id': hr_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
        hr_policy_principals.append(hr_policy_members_dict)
        if hr_policy_member not in primary_principal_list:
            primary_principal_list.append(hr_policy_member)
            
    for regulatory_policy_member in regulatory_policy_email_ids:
        regulatory_policy_members_dict = { 'user': { 'id': regulatory_policy_member, 'access': 'ALLOW', 'membershipType': 'DATASOURCE' }}
        regulatory_policy_principals.append(regulatory_policy_members_dict)
        if regulatory_policy_member not in primary_principal_list:
            primary_principal_list.append(regulatory_policy_member)

  1. You then pull individual policy details from Alation. This step can be repeated for all three policy types: Workplace, HR, and regulatory
    url = "https://[[your-domain]].com/integration/v1/business_policies/?limit=200&skip=0&search=[[Workplace/HR/Regulatory]]&deleted=false"
    
    headers = {
        "accept": "application/json",
        "TOKEN": access_token
    }
    
    response = requests.get(url, headers=headers)
    policy_data = ""
    
    for policy in json.loads(response.text):
        if policy["title"] is not None:
            policy_title = cleanhtml(policy["title"])
        else:
            policy_title = "None"
        if policy["description"] is not None:
            policy_description = cleanhtml(policy["description"])
        else:
            policy_description = "None"
        temp_data = policy_title + ":n" + policy_description + "nn"
        policy_data += temp_data
    

  1. The next step is to define the Amazon Q Business application, index, and data source information that you created in the previous steps.
    qbusiness_client = boto3.client('qbusiness')
    application_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    index_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    data_source_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

  1. Now you explicitly create the users in Amazon Q Business. Individual user access to different policy type data sets is configured later.
    for principal in primary_principal_list:
        create_user_response = qbusiness_client.create_user(
            applicationId=application_id,
            userId=principal,
            userAliases=[
                {
                    'indexId': index_id,
                    'dataSourceId': data_source_id,
                    'userId': principal
                },
            ],
        )
    
    for principal in primary_principal_list:
        get_user_response = qbusiness_client.get_user(
            applicationId=application_id,
            userId=principal
        )
        for user_alias in get_user_response['userAliases']:
            if "dataSourceId" in user_alias:
                print(user_alias['userId'])

  1. For each policy type data set (Workplace, HR, and Regulatory), we execute the following three steps.
    1. Start an Amazon Q Business data source sync job.
      start_data_source_sync_job_response = qbusiness_client.start_data_source_sync_job(
          dataSourceId = data_source_id,
          indexId = index_id,
          applicationId = application_id
      )
      job_execution_id = start_data_source_sync_job_response['executionId']

    1. Encode and batch upload data with user access mapping.
      workplace_policy_document_id = hashlib.shake_256(policy_data.encode('utf-8')).hexdigest(128)
          docs = [ {
              "id": policy_document_id,
              "content" : {
                  'blob': policy_data.encode('utf-8')
              },
              "contentType": "PLAIN_TEXT",
              "title": "Unicorn Rentals – Workplace/HR/Regulatory Policy",
              "accessConfiguration" : { 'accessControls': [ { 'principals': [[xx]]_policy_principals } ] }   
          }    
          ]
          
          batch_put_document_response = qbusiness_client.batch_put_document(
              applicationId = application_id,
              indexId = index_id,
              dataSourceSyncId = job_execution_id,
              documents = docs,
          )

    1. Stop the data source sync job and wait for the data set to be indexed.
      stop_data_source_sync_job_response = qbusiness_client.stop_data_source_sync_job(
              dataSourceId = data_source_id,
              indexId = index_id,
              applicationId = application_id
          )
          max_time = time.time() + 1*60*60
          found = False
          while time.time() < max_time and bool(found) == False:
              list_documents_response = qbusiness_client.list_documents(
                  applicationId=application_id,
                  indexId=index_id
              )
              if list_documents_response:
                  for document in list_documents_response["documentDetailList"]:
                      if document["documentId"] == workplace_policy_document_id:
                          status = document["status"]
                          print(status)
                          if status == "INDEXED" or status == "FAILED" or status == "DOCUMENT_FAILED_TO_INDEX" or status == "UPDATED":
                              found = True        
                          else:
                              time.sleep(10)        
      except:
          print("Exception when calling API")

  1. Go back to the Amazon Q Business console and see if the data uploads were successful.

Q Business Sync Status1

  1. Find and open the custom data source from the list of data sources.

Q Business Sync Status2

  1. Ensure the ingested documents are added in the Sync history tab and are in the Completed status.

Q Business Sync Status3

  1. Also ensure the Last sync status for the custom data source connector is Completed.

Q Business Sync Status5

Run queries with the Amazon Q Business web experience

Now that the data synchronization is complete, you can start exploring insights from Amazon Q Business. With the newly created Amazon Q Business application, select the Web Application settings tab and navigate to the auto-created URL. This will open a new tab with a preview of the user interface and options that you can customize to fit your use case.

Q Business Web Experience

  1. Sign in as user Alejandro Rosales. As you might recall, Alejandro has access to all three policy type data sets (Workplace, HR and Regulator).
    1. Start by asking a question about HR policy, such as “Per the HR Payroll Policy of Unicorn Rents, what are some additional voluntary deductions taken from employee paychecks.” Note how Q Business provides an answers and also shows where it pulled the answer from.

    Q Business Web Experience

    1. Next, ask a question about a Regulatory policy: “Per the PCI DSS compliance policy of Unicorn Rentals, how is the third-party service provider access to cardholder information protected?” The result includes the summarized answer on PCI DSS compliance and also shows sources where it gathered the data from.

    Q Business Web Experience

    1. Lastly, see how Amazon Q Business responds when asked a question about generic workplace policy. “What does Unicorn Rentals do to protect information of children under the age of 13.” In this case, the application returns the answer and marks it as a Workplace policy question.

    Q Business Web Experience

  1. Let’s next sign in as Sofia Martinez. Sofia has access to HR and Workplace policy types, but not to Regulatory policies.
    1. Start by asking a question about HR policy: “Per the HR Payroll Policy of Unicorn Rentals, list the additional voluntary deductions taken from employee paychecks.” Note how Q Business list the deductions and cite policy where the answer is gathered from.

    Q Business Web Experience

    1. Next, ask a Regulatory policy question: “What are the record keeping requirements mentioned in the ECOA compliance policy of Unicorn Rentals?”. Note how Amazon Q Business contextually answers the question mentioning Sofia does not have access to that data –

    Q Business Web Experience

  1. Finally, sign in as Diego Ramirez. Diego has access to Workplace and Regulatory policies but not to HR policies.
    1. Start by asking the same Regulatory policy question that: “Per the PCI DSS compliance policy of Unicorn Rentals, how is third-party service provider access to cardholder information protected?”. Since Diego has access to Regulatory policy data, expected answer is generated.

    Q Business Web Experience

    1. Next, when Diego asks a question about a HR policy: “Per the HR Compensation Policy of Unicorn Rentals, how is job pricing determined?.” Note how Amazon Q Business contextually answers the question mentioning Diego does not have access to that data.

    Q Business Web Experience

Troubleshooting

If you’re unable to get answers to any of your questions and get the message “Sorry, I could not find relevant information to complete your request,” check to see if any of the following issues apply:

  • No permissions: ACLs applied to your account doesn’t allow you to query certain data sources. If this is the case, please reach out to your application administrator to ensure your ACLs are configured to access the data sources.
  • EmailID not matching UserID: In rare scenarios, a user might have a different email ID associated with the Amazon Q Business Identity Center connection than is associated in the data source’s user profile. Make sure that the Amazon Q Business user profile is updated to recognize the email ID using the update-user CLI command or the related API call.
  • Data connector sync failed: Data connector fails to synchronize information from the source to Amazon Q Business application. Verify the data connectors sync run schedule and sync history to help ensure that the synchronization is successful.
  • Empty or private data sources: Private or empty projects will not be crawled during the synchronization run.

If none of the above are true then open a support case to get this resolved.

Clean up

To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Amazon Q Business custom connector data source and client application created in Alation and the Amazon Q Business application. Next, delete the Secrets Manager secret with Alation OAuth client application credential data. Also, delete the user management setup in IAM Identity Center and the SageMaker Studio domain.

Conclusion

In this post, we discussed how to configure the Amazon Q Business custom connector to crawl and index tasks from Alation as a sample. We showed how you can use Amazon Q Business generative AI-based search to enable your business leaders and agents discover insights from your enterprise data.

To learn more about the Amazon Q Business custom connector, see the Amazon Q Business developer guide. To learn more about Alation Data Catalog, which is available for purchase through AWS Marketplace. Speak to your Alation account representative for custom purchase options. For any additional information, contact your Alation business partner.

AWS Partner Network Alation

Alation – AWS Partner Spotlight

Alation is an AWS Specialization Partner that has pioneered the modern data catalog and is making the leap into a full-service source for data intelligence. Alation is passionate about helping enterprises create thriving data cultures where anyone can find, understand, and trust data.

Contact Alation | Partner Overview | AWS Marketplace


About the Authors

Gene ArnoldGene Arnold is a Product Architect with Alation’s Forward Deployed Engineering team. A curious learner with over 25 years of experience, Gene focuses how to sharpen selling skills and constantly explores new product lines.

Prabhakar ChandrasekaranPrabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds eight AWS and seven other professional certifications. With over 21 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

Sindhu JambunathanSindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.

Prateek JainPrateek Jain is a Sr. Solutions Architect with AWS, based out of Atlanta Georgia. He is passionate about GenAI and helping customers build amazing solutions on AWS. In his free time, he enjoys spending time with Family and playing tennis.

Read More

Mistral-Small-24B-Instruct-2501 is now available on SageMaker Jumpstart and Amazon Bedrock Marketplace

Mistral-Small-24B-Instruct-2501 is now available on SageMaker Jumpstart and Amazon Bedrock Marketplace

Today, we’re excited to announce that Mistral-Small-24B-Instruct-2501—a twenty-four billion parameter large language model (LLM) from Mistral AI that’s optimized for low latency text generation tasks—is available for customers through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that developers can use to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. These models are in addition to the industry-leading models that are already available on Amazon Bedrock. You can also use this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy, and use Mistral-Small-24B-Instruct-2501.

Overview of Mistral Small 3 (2501)

Mistral Small 3 (2501), a latency-optimized 24B-parameter model released under Apache 2.0 maintains a balance between performance and computational efficiency. Mistral offers both the pretrained (Mistral-Small-24B-Base-2501) and instruction-tuned (Mistral-Small-24B-Instruct-2501) checkpoints of the model under Apache 2.0. Mistral Small 3 (2501) features a 32 k token context window. According to Mistral, the model demonstrates strong performance in code, math, general knowledge, and instruction following compared to its peers. Mistral Small 3 (2501) is designed for the 80% of generative AI tasks that require robust language and instruction following performance with very low latency. The instruction-tuning process is focused on improving the model’s ability to follow complex directions, maintain coherent conversations, and generate accurate, context-aware responses. The 2501 version follows previous iterations (Mistral-Small-2409 and Mistral-Small-2402) released in 2024, incorporating improvements in instruction-following and reliability. Currently, the instruct version of this model, Mistral-Small-24B-Instruct-2501 is available for customers to deploy and use on SageMaker JumpStart and Bedrock Marketplace.

Optimized for conversational assistance

Mistral Small 3 (2501) excels in scenarios where quick, accurate responses are critical, such as in virtual assistants. This includes virtual assistants where users expect immediate feedback and near real-time interactions. Mistral Small 3 (2501) can handle rapid function execution when used as part of automated or agentic workflows. The architecture is designed to typically respond in less than 100 milliseconds, according to Mistral, making it ideal for customer service automation, interactive assistance, live chat, and content moderation.

Performance metrics and benchmarks

According to Mistral, the instruction-tuned version of the model achieves over 81% accuracy on Massive Multitask Language Understanding (MMLU) with 150 tokens per second latency, making it currently the most efficient model in its category. In third-party evaluations conducted by Mistral, the model demonstrates competitive performance against larger models such as Llama 3.3 70B and Qwen 32B. Notably, Mistral claims that the model performs at the same level as Llama 3.3 70B instruct and is more than three times faster on the same hardware.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

You can now discover and deploy Mistral models in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and under your VPC controls, helping to support data security for enterprise security needs.

Prerequisites

To try Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart, you need the following prerequisites:

Amazon Bedrock Marketplace overview

To get started, in the AWS Management Console for Amazon Bedrock, select Model catalog in the Foundation models section of the navigation pane. Here, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

Deploy Mistral-Small-24B-Instruct-2501 in Amazon Bedrock Marketplace

To access Mistral-Small-24B-Instruct-2501 in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, select Model catalog under Foundation models in the navigation pane.

At the time of writing this post, you can use the InvokeModel API to invoke the model. It doesn’t support Converse APIs or other Amazon Bedrock tooling.

  1. Filter for Mistral as a provider and select the Mistral-Small-24B-Instruct-2501

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Mistral-Small-24B-Instruct-2501 in your applications.

  1. To begin using Mistral-Small-24B-Instruct-2501, choose Deploy.
  2. You will be prompted to configure the deployment details for Mistral-Small-24B-Instruct-2501. The model ID will be pre-populated.
    1. For Endpoint name, enter an endpoint name (up to 50 alphanumeric characters).
    2. For Number of instances, enter a number between 1and 100.
    3. For Instance type, select your instance type. For optimal performance with Mistral-Small-24B-Instruct-2501, a GPU-based instance type such as ml.g6.12xlarge is recommended.
    4. Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
  3. Choose Deploy to begin using the model.

When the deployment is complete, you can test Mistral-Small-24B-Instruct-2501 capabilities directly in the Amazon Bedrock playground.

  1. Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters such as temperature and maximum length.

When using Mistral-Small-24B-Instruct-2501 with the Amazon Bedrock InvokeModel and Playground console, use DeepSeek’s chat template for optimal results. For example, <|begin▁of▁sentence|><|User|>content for inference<|Assistant|>.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to get the endpoint Amazon Resource Name (ARN).

Discover Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart

You can access Mistral-Small-24B-Instruct-2501 through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more information about how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.

  1. In the SageMaker Studio console, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
  2. Select HuggingFace.
  3. From the SageMaker JumpStart landing page, search for Mistral-Small-24B-Instruct-2501 using the search box.
  4. Select a model card to view details about the model such as license, data used to train, and how to use the model. Choose Deploy to deploy the model and create an endpoint.

Deploy Mistral-Small-24B-Instruct-2501 with the SageMaker SDK

Deployment starts when you choose Deploy. After deployment finishes, you will see that an endpoint is created. Test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

  1. To deploy using the SDK, start by selecting the Mistral-Small-24B-Instruct-2501 model, specified by the model_id with the value mistral-small-24B-instruct-2501. You can deploy your choice of the selected models on SageMaker using the following code. Similarly, you can deploy Mistral-Small-24b-Instruct-2501 using its model ID.
    from sagemaker.jumpstart.model import JumpStartModel 
    
    accept_eula = True 
    
    model = JumpStartModel(model_id="huggingface-llm-mistral-small-24b-instruct-2501") 
    predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). See AWS service quotas for how to request a service quota increase.

  1. After the model is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
    prompt = "Hello!"
    payload = {
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "max_tokens": 4000,
        "temperature": 0.1,
        "top_p": 0.9,
    }
        
    response = predictor.predict(payload)
    print(response['choices'][0]['message']['content']) 

Retail math example

Here’s an example of how Mistral-Small-24B-Instruct-2501 can break down a common shopping scenario. In this case, you ask the model to calculate the final price of a shirt after applying multiple discounts—a situation many of us face while shopping. Notice how the model provides a clear, step-by-step solution to follow.

prompt = "A store is having a 20% off sale, and you have an additional 10% off coupon. If you buy a shirt that originally costs $50, how much will you pay?"
payload = {
    "messages": [
        {
            "role": "user",
            "content": prompt
        }
    ],
    "max_tokens": 1000,
    "temperature": 0.1,
    "top_p": 0.9,
}
    
response = predictor.predict(payload)
print(response['choices'][0]['message']['content']) 

The following is the output:

First, we'll apply the 20% off sale discount to the original price of the shirt.

20% of $50 is calculated as:
0.20 * $50 = $10

So, the price after the 20% discount is:
$50 - $10 = $40

Next, we'll apply the additional 10% off coupon to the new price of $40.

10% of $40 is calculated as:
0.10 * $40 = $4

So, the price after the additional 10% discount is:
$40 - $4 = $36

Therefore, you will pay $36 for the shirt.

The response shows clear step-by-step reasoning without introducing incorrect information or hallucinated facts. Each mathematical step is explicitly shown, making it simple to verify the accuracy of the calculations.

Clean up

To avoid unwanted charges, complete the following steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, select Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Select the endpoint, and on the Actions menu, select Delete.
  4. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  5. Choose Delete to delete the endpoint.
  6. In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor

After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mistral-Small-24B-Instruct-2501 in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repo.


About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services offered by AWS, including model offerings from top tier foundation model providers.

Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Read More

How Rocket Companies modernized their data science solution on AWS

How Rocket Companies modernized their data science solution on AWS

This post was written with Dian Xu and Joel Hawkins of Rocket Companies.

Rocket Companies is a Detroit-based FinTech company with a mission to “Help Everyone Home”. With the current housing shortage and affordability concerns, Rocket simplifies the homeownership process through an intuitive and AI-driven experience. This comprehensive framework streamlines every step of the homeownership journey, empowering consumers to search, purchase, and manage home financing effortlessly. Rocket integrates home search, financing, and servicing in a single environment, providing a seamless and efficient experience.

The Rocket brand is a synonym for offering simple, fast, and trustworthy digital solutions for complex transactions. Rocket is dedicated to helping clients realize their dream of homeownership and financial freedom. Since its inception, Rocket has grown from a single mortgage lender to an network of businesses that creates new opportunities for its clients.

Rocket takes a complicated process and uses technology to make it simpler. Applying for a mortgage can be complex and time-consuming. That’s why we use advanced technology and data analytics to streamline every step of the homeownership experience, from application to closing. By analyzing a wide range of data points, we’re able to quickly and accurately assess the risk associated with a loan, enabling us to make more informed lending decisions and get our clients the financing they need.

Our goal at Rocket is to provide a personalized experience for both our current and prospective clients. Rocket’s diverse product offerings can be customized to meet specific client needs, while our team of skilled bankers must match with the best client opportunities that align with their skills and knowledge. Maintaining strong relationships with our large, loyal client base and hedge positions to cover financial obligations is key to our success. With the volume of business we do, even small improvements can have a significant impact.

In this post, we share how we modernized Rocket’s data science solution on AWS to increase the speed to delivery from eight weeks to under one hour, improve operational stability and support by reducing incident tickets by over 99% in 18 months, power 10 million automated data science and AI decisions made daily, and provide a seamless data science development experience.

Rocket’s legacy data science environment challenges

Rocket’s previous data science solution was built around Apache Spark and combined the use of a legacy version of the Hadoop environment and vendor-provided Data Science Experience development tools. The Hadoop environment was hosted on Amazon Elastic Compute Cloud (Amazon EC2) servers, managed in-house by Rocket’s technology team, while the data science experience infrastructure was hosted on premises. Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink.

Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. Apache Hive was used to provide a tabular interface to data stored in HDFS, and to integrate with Apache Spark SQL. Apache HBase was employed to offer real-time key-based access to data. Model training and scoring was performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which was part of the Hadoop implementation.

Despite the benefits of this architecture, Rocket faced challenges that limited its effectiveness:

  • Accessibility limitations: The data lake was stored in HDFS and only accessible from the Hadoop environment, hindering integration with other data sources. This also led to a backlog of data that needed to be ingested.
  • Steep learning curve for data scientists: Many of Rocket’s data scientists did not have experience with Spark, which had a more nuanced programming model compared to other popular ML solutions like scikit-learn. This created a challenge for data scientists to become productive.
  • Responsibility for maintenance and troubleshooting: Rocket’s DevOps/Technology team was responsible for all upgrades, scaling, and troubleshooting of the Hadoop cluster, which was installed on bare EC2 instances. This resulted in a backlog of issues with both vendors that remained unresolved.
  • Balancing development vs. production demands: Rocket had to manage work queues between development and production, which were always competing for the same resources.
  • Deployment challenges: Rocket sought to support more real-time and streaming inferencing use cases, but this was limited by the capabilities of MLeap for real-time models and Spark Streaming for streaming use cases, which were still experimental at that time.
  • Inadequate data security and DevOps support – The previous solution lacked robust security measures, and there was limited support for development and operations of the data science work.

Rocket’s legacy data science architecture is shown in the following diagram.

The diagram depicts the flow; the key components are detailed below:

  1. Data Ingestion: Data is ingested into the system using Attunity data ingestion in Spark SQL.
  2. Data Storage and Processing: All compute is done as Spark jobs inside of a Hadoop cluster using Apache Livy and Spark. Data is stored in HDFS and is accessed via Hive, which provides a tabular interface to the data and integrates with Spark SQL. HBase is employed to offer real-time key-based access to data.
  3. Model Development: Data exploration and model development are conducted using tools such as Jupyter or Orchestration, which communicate with the Spark server over Kerberized Livy connection.
  4. Model Training and Scoring: Model training and scoring is performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which is part of the Hadoop implementation.

Rocket’s migration journey

At Rocket, we believe in the power of continuous improvement and constantly seek out new opportunities. One such opportunity is using data science solutions, but to do so, we must have a strong and flexible data science environment.

To address the legacy data science environment challenges, Rocket decided to migrate its ML workloads to the Amazon SageMaker AI suite. This would allow us to deliver more personalized experiences and understand our customers better. To promote the success of this migration, we collaborated with the AWS team to create automated and intelligent digital experiences that demonstrated Rocket’s understanding of its clients and kept them connected.

We implemented an AWS multi-account strategy, standing up Amazon SageMaker Studio in a build account using a network-isolated Amazon VPC. This allows us to separate development and production environments, while also improving our security stance.

We moved our new work to SageMaker Studio and our legacy Hadoop workloads to Amazon EMR, connecting to the old Hadoop cluster using Livy and SageMaker notebooks to ease the transition. This gives us access to a wider range of tools and technologies, enabling us to choose the most appropriate ones for each problem we’re trying to solve.

In addition, we moved our data from HDFS to Amazon Simple Storage Service (Amazon S3), and now use Amazon Athena and AWS Lake Formation to provide proper access controls to production data. This makes it easier to access and analyze the data, and to integrate it with other systems. The team also provides secure interactive integration through Amazon Elastic Kubernetes Service (Amazon EKS), further improving the company’s security stance.

SageMaker AI has been instrumental in empowering our data science community with the flexibility to choose the most appropriate tools and technologies for each problem, resulting in faster development cycles and higher model accuracy. With SageMaker Studio, our data scientists can seamlessly develop, train, and deploy models without the need for additional infrastructure management.

As a result of this modernization effort, SageMaker AI enabled Rocket to scale our data science solution across Rocket Companies and integrate using a hub-and-spoke model. The ability of SageMaker AI to automatically provision and manage instances has allowed us to focus on our data science work rather than infrastructure management, increasing the number of models in production by five times and data scientists’ productivity by 80%.

Our data scientists are empowered to use the most appropriate technology for the problem at hand, and our security stance has improved. Rocket can now compartmentalize data and compute, as well as compartmentalize development and production. Additionally, we are able to provide model tracking and lineage using Amazon SageMaker Experiments and artifacts discoverable using the SageMaker model registry and Amazon SageMaker Feature Store. All the data science work has now been migrated onto SageMaker, and all the old Hadoop work has been migrated to Amazon EMR.

Overall, SageMaker AI has played a critical role in enabling Rocket’s modernization journey by building a more scalable and flexible ML framework, reducing operational burden, improving model accuracy, and accelerating deployment times.

The successful modernization allowed Rocket to overcome our previous limitations and better support our data science efforts. We were able to improve our security stance, make work more traceable and discoverable, and give our data scientists the flexibility to choose the most appropriate tools and technologies for each problem. This has helped us better serve our customers and drive business growth.

Rocket’s new data science solution architecture on AWS is shown in the following diagram.

The solution consists of the following components:

  1. Data ingestion: Data is ingested into the data account from on-premises and external sources.
  2. Data refinement: Raw data is refined into consumable layers (raw, processed, conformed, and analytical) using a combination of AWS Glue extract, transform, and load (ETL) jobs and EMR jobs.
  3. Data access: Refined data is registered in the data account’s AWS Glue Data Catalog and exposed to other accounts via Lake Formation. Analytic data is stored in Amazon Redshift. Lake Formation makes this data available to both the build and compute accounts. For the build account, access to production data is restricted to read-only.
  4. Development: Data science development is done using SageMaker Studio. Data engineering development is done using AWS Glue Studio. Both disciplines have access to Amazon EMR for Spark development. Data scientists have access to the entire SageMaker ecosystem in the build account.
  5. Deployment: SageMaker trained models developed in the build account are registered with an MLFlow instance. Code artifacts for both data science activities and data engineering activities are stored in Git. Deployment initiation is controlled as part of CI/CD.
  6. Workflows: We have a number of workflow triggers. For online scoring, we typically provide an external-facing endpoint using Amazon EKS with Istio. We have numerous jobs that are launched by AWS Lambda functions that in turn are triggered by timers or events. Processes that run may include AWS Glue ETL jobs, EMR jobs for additional data transformations or model training and scoring activities, or SageMaker pipelines and jobs performing training or scoring activities.

Migration impact

We’ve evolved a long way in modernizing our infrastructure and workloads. We started our journey supporting six business channels and 26 models in production, with dozens in development. Deployment times stretched for months and required a team of three system engineers and four ML engineers to keep everything running smoothly. Despite the support of our internal DevOps team, our issue backlog with the vendor was an unenviable 200+.

Today, we are supporting nine organizations and over 20 business channels, with a whopping 210+ models in production and many more in development. Our average deployment time has gone from months to just weeks—sometimes even down to mere days! With just one part-time ML engineer for support, our average issue backlog with the vendor is practically non-existent. We now support over 120 data scientists, ML engineers, and analytical roles. Our framework mix has expanded to include 50% SparkML models and a diverse range of other ML frameworks, such as PyTorch and scikit-learn. These advancements have given our data science community the power and flexibility to tackle even more complex and challenging projects with ease.

The following table compares some of our metrics before and after migration.

. Before Migration After Migration
Speed to Delivery New data ingestion project took 4–8 weeks Data-driven ingestion takes under one hour
Operation Stability and Supportability Over a hundred incidents and tickets in 18 months Fewer incidents: one per 18 months
Data Science Data scientists spent 80% of their time waiting on their jobs to run Seamless data science development experience
Scalability Unable to scale Powers 10 million automated data science and AI decisions made daily

Lessons learned

Throughout the journey of modernizing our data science solution, we’ve learned valuable lessons that we believe could be of great help to other organizations who are planning to undertake similar endeavors.

First, we’ve come to realize that managed services can be a game changer in optimizing your data science operations.

The isolation of development into its own account while providing read-only access to production data is a highly effective way of enabling data scientists to experiment and iterate on their models without putting your production environment at risk. This is something that we’ve achieved through the combination of SageMaker AI and Lake Formation.

Another lesson we learned is the importance of training and onboarding for teams. This is particularly true for teams that are moving to a new environment like SageMaker AI. It’s crucial to understand the best practices of utilizing the resources and features of SageMaker AI, and to have a solid understanding of how to move from notebooks to jobs.

Lastly, we found that although Amazon EMR still requires some tuning and optimization, the administrative burden is much lighter compared to hosting directly on Amazon EC2. This makes Amazon EMR a more scalable and cost-effective solution for organizations who need to manage large data processing workloads.

Conclusion

This post provided overview of the successful partnership between AWS and Rocket Companies. Through this collaboration, Rocket Companies was able to migrate many ML workloads and implement a scalable ML framework. Ongoing with AWS, Rocket Companies remains committed to innovation and staying at the forefront of customer satisfaction.

Don’t let legacy systems hold back your organization’s potential. Discover how AWS can assist you in modernizing your data science solution and achieving remarkable results, similar to those achieved by Rocket Companies.


About the Authors

Dian Xu is the Senior Director of Engineering in Data at Rocket Companies, where she leads transformative initiatives to modernize enterprise data platforms and foster a collaborative, data-first culture. Under her leadership, Rocket’s data science, AI & ML platforms power billions of automated decisions annually, driving innovation and industry disruption. A passionate advocate for Gen AI and cloud technologies, Xu is also a sought-after speaker at global forums, inspiring the next generation of data professionals. Outside of work, she channels her love of rhythm into dancing, embracing styles from Bollywood to Bachata as a celebration of cultural diversity.

Joel Hawkins is a Principal Data Scientist at Rocket Companies, where he is responsible for the data science and MLOps platform. Joel has decades of experience developing sophisticated tooling and working with data at large scales. A driven innovator, he works hand in hand with data science teams to ensure that we have the latest technologies available to provide cutting edge solutions. In his spare time, he is an avid cyclist and has been known to dabble in vintage sports car restoration.

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services. He partners with North American FinTech companies like Rocket and other financial services organizations to drive cloud and AI strategy, accelerating AI adoption at scale. With deep expertise in AI & ML, Generative AI, and cloud-native architecture, he helps financial institutions unlock new revenue streams, optimize operations, and drive impactful business transformation. Sajjan collaborates closely with Rocket Companies to advance its mission of building an AI-fueled homeownership platform to Help Everyone Home. Outside of work, he enjoys traveling, spending time with his family, and is a proud father to his daughter.

Alak EswaradassAlak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.

Read More

AWS and DXC collaborate to deliver customizable, near real-time voice-to-voice translation capabilities for Amazon Connect

AWS and DXC collaborate to deliver customizable, near real-time voice-to-voice translation capabilities for Amazon Connect

Providing effective multilingual customer support in global businesses presents significant operational challenges. Through collaboration between AWS and DXC Technology, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact centers handle multi-lingual customer interactions.

In this post, we discuss how AWS and DXC used Amazon Connect and other AWS AI services to deliver near real-time V2V translation capabilities.

Challenge: Serving customers in multiple languages

In Q3 2024, DXC Technology approached AWS with a critical business challenge: their global contact centers needed to serve customers in multiple languages without the exponential cost of hiring language-specific agents for the lower volume languages. Previously, DXC had explored several existing alternatives but found limitations in each approach – from communication constraints to infrastructure requirements that impacted reliability, scalability, and operational costs. DXC and AWS decided to organize a focused hackathon where DXC and AWS Solution Architects collaborated to:

  • Define essential requirements for real-time translation
  • Establish latency and accuracy benchmarks
  • Create seamless integration paths with existing systems
  • Develop a phased implementation strategy
  • Prepare and test an initial proof of concept setup

Business impact

For DXC, this prototype was used as an enabler, allowing technical talent maximization, operational transformation, and cost improvements through:

  • Best technical expertise delivery – Hiring and matching agents based on technical knowledge rather than spoken language, making sure customers get top technical support regardless of language barriers
  • Global operational flexibility – Removing geographical and language constraints in hiring, placement, and support delivery while maintaining consistent service quality across all languages
  • Cost reduction – Eliminating multi-language expertise premiums, specialized language training, and infrastructure costs through pay-per-use translation model
  • Similar experience to native speakers – Maintaining natural conversation flow with near real-time translation and audio feedback, while delivering premium technical support in customer’s preferred language

Solution overview

The Amazon Connect V2V translation prototype uses AWS advanced speech recognition and machine translation technologies to enable real-time conversation translation between agents and customers, allowing them to speak in their preferred languages while having natural conversations. It consists of the following key components:

  • Speech recognition – The customer’s spoken language is captured and converted into text using Amazon Transcribe, which serves as the speech recognition engine. The transcript (text) is then fed into the machine translation engine.
  • Machine translation – Amazon Translate, the machine translation engine, translates the customer’s transcript into the agent’s preferred language in near real time. The translated transcript is converted back into speech using Amazon Polly, which serves as the text-to-speech engine.
  • Bidirectional translation – The process is reversed for the agent’s response, translating their speech into the customer’s language and delivering the translated audio to the customer.
  • Seamless integration – The V2V translation sample project integrates with Amazon Connect, enabling agents to handle customer interactions in multiple languages without any additional effort or training, using the Amazon Connect Streams JS and Amazon Connect RTC JS libraries.

The prototype can be extended with other AWS AI services to further customize the translation capabilities. It’s open source and ready for customization to meet your specific needs.

The following diagram illustrates the solution architecture.

The following screenshot illustrates a sample agent web application.

The user interface consists of three sections:

  • Contact Control Panel – A softphone client using Amazon Connect
  • Customer Controls – Customer-to-agent interaction controls, including Transcribe Customer Voice, Translate Customer Voice, and Synthesize Customer Voice
  • Agent controls – Agent-to-customer interaction controls, including Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice

Challenges when implementing near real-time voice translation

The Amazon Connect V2V sample project was designed to minimize the audio processing time from the moment the customer or agent finishes speaking until the translated audio stream is started. However, even with the shortest audio processing time, the user experience still doesn’t match the experience of a real conversation when both are speaking the same language. This is due to the specific pattern of the customer only hearing the agent’s translated speech, and the agent only hearing the customer’s translated speech. The following diagram displays that pattern.

The example workflow consists of the following steps:

  1. The customer starts speaking in their own language, and speaks for 10 seconds.
  2. Because the agent only hears the customer’s translated speech, the agent first hears 10 seconds of silence.
  3. When customer finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
  4. The customer’s translated speech is streamed to the agent. During that time, the customer hears silence.
  5. When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
  6. Because customer only hears the agent’s translated speech, the customer hears 10 seconds of silence.
  7. When the agent finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
  8. The agent’s translated speech is streamed to the agent. During that time, the agent hears silence.

In this scenario, the customer hears a single block of 22–24 seconds of a complete silence, from the moment they finished speaking until they hear the agent’s translated voice. This creates a suboptimal experience, because the customer might not be certain what is happening during these 22–24 seconds—for instance, if the agent was able to hear them, or if there was a technical issue.

Audio streaming add-ons

In a face-to-face conversation scenario between two people that don’t speak the same language, they might have another person as a translator or interpreter. An example workflow consists of the following steps:

  1. Person A speaks in their own language, which is heard by Person B and the translator.
  2. The translator translates what Person A said to Person B’s language. The translation is heard by Person B and Person A.

Essentially, Person A and Person B hear each other speaking their own language, and they also hear the translation (from the translator). There’s no waiting in silence, which is even more important in non-face-to-face conversations (such as contact center interactions).

To optimize the customer/agent experience, the Amazon Connect V2V sample project implements audio streaming add-ons to simulate a more natural conversation experience. The following diagram illustrates an example workflow.

The workflow consists of the following steps:

  1. The customer starts speaking in their own language, and speaks for 10 seconds.
  2. The agent hears the customer’s original voice, at a lower volume (“Stream Customer Mic to Agent” enabled).
  3. When the customer finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
  4. The customer’s translated speech is then streamed to the agent. During that time, the customer hears their translated speech, at a lower volume (“Stream Customer Translation to Customer” enabled).
  5. When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
  6. The customer hears the agent’s original voice, at a lower volume (“Stream Agent Mic to Customer” enabled).
  7. When the agent finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
  8. The agent’s translated speech is then streamed to the agent. During that time, the agent hears their translated speech, at a lower volume (“Stream Agent Translation to Agent” enabled).

In this scenario, the customer hears two short blocks (1–2 seconds) of subtle audio feedback, instead of a single block of 22–24 seconds of complete silence. This pattern is much closer to a face-to-face conversation that includes a translator.

The audio streaming add-ons provide additional benefits, including:

  • Voice characteristics – In cases when the agent and customer only hear their translated and synthesized speech, the actual voice characteristics are lost. For instance, the agent can’t hear if the customer was talking slow or fast, if the customer was upset or calm, and so on. The translated and synthesized speech doesn’t carry over that information.
  • Quality assurance – In cases when call recording is enabled, only the customer’s original voice and the agent’s synthesized speech are recorded, because the translation and the synthetization are done on the agent (client) side. This makes it difficult for QA teams to properly evaluate and audit the conversations, including the many silent blocks within it. Instead, when the audio streaming add-ons are enabled, there are no silent blocks, and the QA team can hear the agent’s original voice, the customer’s original voice, and their respective translated and synthesized speech, all in a single audio file.
  • Transcription and translation accuracy – Having both the original and translated speech available in the call recording makes it straightforward to detect specific words that would improve transcription accuracy (by using Amazon Transcribe custom vocabularies) or translation accuracy (using Amazon Translate custom terminologies), to make sure that your brand names, character names, model names, and other unique content are transcribed and translated to the desired result.

Get started with Amazon Connect V2V

Ready to transform your contact center’s communication? Our Amazon Connect V2V sample project is now available on GitHub. We invite you to explore, deploy, and experiment with this powerful prototype. You can it as a foundation for developing innovative multi-lingual communication solutions in your own contact center, through the following key steps:

  1. Clone the GitHub repository.
  2. Test different configurations for audio streaming add-ons.
  3. Review the sample project’s limitations in the README.
  4. Develop your implementation strategy:
    1. Implement robust security and compliance controls that meet your organization’s standards.
    2. Collaborate with your customer experience team to define your specific use case requirements.
    3. Balance between automation and the agent’s manual controls (for example, use an Amazon Connect contact flow to automatically set contact attributes for preferred languages and audio streaming add-ons).
    4. Use your preferred transcribe, translate, and text-to-speech engines, based on specific language support requirements and business, legal, and regional preferences.
    5. Plan a phased rollout, starting with a pilot group, then iteratively optimize your transcription custom vocabularies and translation custom terminologies.

Conclusion

The Amazon Connect V2V sample project demonstrates how Amazon Connect and advanced AWS AI services can break down language barriers, enhance operational flexibility, and reduce support costs. Get started now and revolutionize how your contact center communicates across language barriers!


About the Authors

Milos Cosic is a Principal Solutions Architect at AWS.

EJ Ferrell is a Senior Solutions Architect at AWS.

Adam El Tanbouli is a Technical Program Manager for Prototyping and Support Services at DXC Modern Workplace.

Read More

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock

Generative AI is revolutionizing enterprise automation, enabling AI systems to understand context, make decisions, and act independently. Generative AI foundation models (FMs), with their ability to understand context and make decisions, are becoming powerful partners in solving sophisticated business problems. At AWS, we’re using the power of models in Amazon Bedrock to drive automation of complex processes that have traditionally been challenging to streamline.

In this post, we focus on one such complex workflow: document processing. This serves as an example of how generative AI can streamline operations that involve diverse data types and formats.

Challenges with document processing

Document processing often involves handling three main categories of documents:

  • Structured – For example, forms with fixed fields
  • Semi-structured – Documents that have a predictable set of information but might vary in layout or presentation
  • Unstructured – For example, paragraphs of text or notes

Traditionally, processing these varied document types has been a pain point for many organizations. Rule-based systems or specialized machine learning (ML) models often struggle with the variability of real-world documents, especially when dealing with semi-structured and unstructured data.

We demonstrate how generative AI along with external tool use offers a more flexible and adaptable solution to this challenge. Through a practical use case of processing a patient health package at a doctor’s office, you will see how this technology can extract and synthesize information from all three document types, potentially improving data accuracy and operational efficiency.

Solution overview

This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools and APIs to perform complex document analysis tasks.

The solution employs a strategic multi-model approach, optimizing for both performance and cost by selecting the most appropriate model for each task:

  • Anthropic’s Claude 3 Haiku – Serves as the workflow orchestrator due to its low latency and cost-effectiveness. This model’s strong reasoning and tool use abilities make it ideal for the following:

    • Coordinating the overall document processing pipeline

    • Making routing decisions for different document types

    • Invoking appropriate processing functions

    • Managing the workflow state

  • Anthropic’s Claude 3.5 Sonnet (v2) – Used for its advanced reasoning capabilities, notably strong visual processing abilities, particularly excelling at interpreting charts and graphs. Its key strengths include:

    • Interpreting complex document layouts and structure

    • Extracting text from tables and forms

    • Processing medical charts and handwritten notes

    • Converting unstructured visual information into structured data

Through the Amazon Bedrock Converse API’s standardized tool use (function calling) interface, these models can work together seamlessly to invoke document processing functions, call external APIs for data validation, trigger storage operations, and execute content transformation tasks. The API serves as the foundation for this intelligent workflow, providing a unified interface for model communication while maintaining conversation state throughout the processing pipeline. The API’s standardized approach to tool definition and function calling provides consistent interaction patterns across different processing stages. For more details on how tool use works, refer to The complete tool use workflow.

The solution incorporates Amazon Bedrock Guardrails to implement robust content filtering policies and sensitive information detection, making sure that personal health information (PHI) and personally identifiable information (PII) data is appropriately protected through automated detection and masking capabilities while maintaining industry standard compliance throughout the document processing workflow.

Prerequisites

You need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

Use case and dataset

For our example use case, we examine a patient intake process at a healthcare institution. The workflow processes a patient health information package containing three distinct document types:

  • Structured document – A new patient intake form with standardized fields for personal information, medical history, and current symptoms. This form follows a consistent layout with clearly defined fields and check boxes, making it an ideal example of a structured document.
  • Semi-structured document – A health insurance card that contains essential coverage information. Although insurance cards generally contain similar information (policy number, group ID, coverage dates), they come from different providers with varying layouts and formats, showing the semi-structured nature of these documents.
  • Unstructured document – A handwritten doctor’s note from an initial consultation, containing free-form observations, preliminary diagnoses, and treatment recommendations. This represents the most challenging category of unstructured documents, where information isn’t confined to any predetermined format or structure.

The example document can be downloaded from the following GitHub repo.

This healthcare use case is particularly relevant because it encompasses common challenges in document processing: the need for high accuracy, compliance with healthcare data privacy requirements, and the ability to handle multiple document formats within a single workflow. The variety of documents in this patient package demonstrates how a modern intelligent document processing solution must be flexible enough to handle different levels of document structure while maintaining consistency and accuracy in data extraction.

The following diagram illustrates the solution workflow.

IDP flow using external tool claling

This self-orchestrated workflow demonstrates how modern generative AI solutions can balance capability, performance, and cost-effectiveness in transforming traditional document processing workflows in healthcare settings.

Deploy the solution

  1. Create an Amazon SageMaker domain. For instructions, see Use quick setup for Amazon SageMaker AI.
  2. Launch SageMaker Studio, then create and launch a JupyterLab space. For instructions, see Create a space.
  3. Create a guardrail. Focus on adding sensitive information filters that would mask PII or PHI.
  4. Clone the code from the GitHub repository:

    git clone https://github.com/aws-samples/anthropic-on-aws.git
  5. Change the directory to the root of the cloned repository:

    cd medical-idp
  6. Install dependencies:

    pip install -r requirements.txt
  7. Update setup.sh with the guardrail ID you created in Step 3. Then set the ENV variable:

    source setup.sh
  8. Finally, start the Streamlit application:

    streamlit run streamlit_app.py

Now you’re ready to explore the intelligent document processing workflow using Amazon Bedrock.

Technical implementation

The solution is built around the Amazon Bedrock Converse API and tool use framework, with Anthropic’s Claude 3 Haiku serving as the primary orchestrator. When a document is uploaded through the Streamlit interface, Haiku analyzes the request and determines the sequence of tools needed by consulting the tool definitions in ToolConfig. These definitions include tools for the following:

  • Document processing pipeline – Handles initial PDF processing and classification
  • Document notes processing – Extracts information from medical notes
  • New patient information processing – Processes patient intake forms
  • Insurance form processing – Handles insurance card information

The following code is an example tool definition for extracting consultation notes. Here, extract_consultation_notes represents the name of the function that the orchestration workflow will call, and document_paths defines the schema of the input parameter that will be passed to the function. The FM will contextually extract the information from the document and pass to the method. A similar toolspec will be defined for each step. Refer to the GitHub repo for the full toolspec definition.

{
            "toolSpec": {
                "name": "extract_consultation_notes",
                "description": "Extract diagnostics information from a doctor's consultation notes. Along with the extraction include the full transcript in a <transcript> node",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "document_paths": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Paths to the files that were classified as DOC_NOTES"
                            }
                        },
                        "required": ["document_paths"]
                    }
                }
            }
        }

When a PDF document is uploaded through the Streamlit interface, it is temporarily stored and passed to the FileProcessor class along with the tool specification and a user prompt:

prompt = ("1. Extract 2. save and 3. summarize the information from the patient information package located at " + tmp_file + ". " +
                          "The package might contain various types of documents including insurance cards. Extract and save information from all documents provided. "
                          "Perform any preprocessing or classification of the file provided prior to the extraction." + 
                          "Set the enable_guardrails parameter to " + str(enable_guardrails) + ". " + 
                          "At the end, list all the tools that you had access to. Give an explantion on why each tool was used and if you are not using a tool, explain why it was not used as well" + 
                          "Think step by step.")
                processor.process_file(prompt=prompt, 
toolspecs=toolspecs,
...

The BedrockUtils class manages the conversation with Anthropic’s Claude 3 Haiku through the Amazon Bedrock Converse API. It maintains the conversation state and handles the tool use workflow:

# From bedrockutility.py
def invoke_bedrock(self, message_list, system_message=[], tool_list=[],
                  temperature=0, maxTokens=2048, guardrail_config=None):
    response = self.bedrock.converse(
        modelId=self.model_id,
        messages=message_list,
        system=system_message,
        inferenceConfig={
            "maxTokens": maxTokens,
            "temperature": temperature
        },
        **({"toolConfig": {"tools": tool_list}} if tool_list else {})
    )

When the processor receives a document, it initiates a conversation loop with Anthropic’s Claude 3 Haiku, which analyzes the document and determines which tools to use based on the content. The model acts as an intelligent orchestrator, making decisions about the following:

  • Which document processing tools to invoke
  • The sequence of processing steps
  • How to handle different document types within the same package
  • When to summarize and complete the processing

This orchestration is managed through a continuous conversation loop that processes tool requests and their results until the entire document package has been processed.

The first key decision in the workflow is initiating the document classification process. Through the DocumentClassifier class, the solution uses Anthropic’s Claude 3.5 Sonnet to analyze and categorize each page of the uploaded document into three main types: intake forms, insurance cards, and doctor’s notes:

# from document_classifier.py
class DocumentClassifier:
    def __init__(self, file_handler):
        self.sonnet_3_5_bedrock_utils = BedrockUtils(
            model_id=ModelIDs.anthropic_claude_3_5_sonnet
        )
        
    def categorize_document(self, file_paths):
        # Convert documents to binary format for model processing
        binary_data_array = []
        for file_path in file_paths:
            binary_data, media_type = self.file_handler.get_binary_for_file(file_path)
            binary_data_array.append((binary_data[0], media_type))

        # Prepare message for classification
        message_content = [
            {"image": {"format": media_type, "source": {"bytes": data}}}
            for data, media_type in binary_data_array
        ]
        
        # Create classification request
        message_list = [{
            "role": 'user',
            "content": [
                *message_content,
                {"text": "What types of document is in this image?"}
            ]
        }]
        
        # Define system message for classification
        system_message = [{
            "text": '''You are a medical document processing agent. 
                      Categorize images as: INTAKE_FORM, INSURANCE_CARD, or DOC_NOTES'''
        }]
        
        # Get classification from model
        response = self.sonnet_3_5_bedrock_utils.invoke_bedrock(
            message_list=message_list,
            system_message=system_message
        )
        return [response['output']['message']]

Based on the classification results, the FM determines the next tool to be invoked. The tool’s description and input schema define exactly what information needs to be extracted. Following the previous example, let’s assume the next page to be processed is a consultation note. The workflow will invoke the extract_consultation_notes function. This function processes documents to extract detailed medical information. Like the classification process discussed earlier, it first converts the documents to binary format suitable for model processing. The key to accurate extraction lies in how the images and system message are combined:

def extract_info(self, file_paths):
    # Convert documents to binary data
    # This will follow the same pattern to as in the classification function
    message_content = [
        {"image": {"format": media_type, "source": {"bytes": data}}}
        for data, media_type in binary_data_array
    ]

    message_list = [{
        "role": 'user',
        "content": [
            *message_content,  # Include the processed document images
            {"text": '''Extract all information from this file
                       If you find a visualization
                           - Provide a detailed description in natural language
                           - Use domain specific language for the description
                    '''}
        ]
    }]
    
    system_message = [{
        "text": '''You are a medical consultation agent with expertise in diagnosing and treating various health conditions.
                   You have a deep understanding of human anatomy, physiology, and medical knowledge across different specialties.
                   During the consultation, you review the patient's medical records, test results, and documentation provided.
                   You analyze this information objectively and make associations between the data and potential diagnoses.
Associate a confidence score to each extracted information. This should reflect how confident the model in the extracted value matched the requested entity.
        '''}
    ]
    
    response = self.bedrock_utils.invoke_bedrock(
        message_list=message_list,
        system_message=system_message
    )
    return [response['output']['message']]

The system message serves three crucial purposes:

  • Establish medical domain expertise for accurate interpretation.
  • Provide guidelines for handling different types of information (text and visualizations).
  • Provide a self-scored confidence. Although this is not an independent grading mechanism, the score is directionally indicative of how confident the model is in its own extraction.

Following the same pattern, the FM will use the other tools in the toolspec definition to save and summarize the results.

A unique advantage of using a multi-modal FM for the extraction task is its ability to have a deep understanding of the text it is extracting. For example, the following code is an abstract of the data schema we are requesting as input to the save_consultation_notes function. Refer to the code in constants.py for full definition. The model needs to not only extract a transcript, but also understand it to extract such structured data from an unstructured document. This significantly reduces the postprocessing efforts required for the data to be consumed by a downstream application.

"consultation": {
                            "type": "object",
                            "properties": {
                            "date": {"type": "string"},
                            "concern": {
                                "type": "object",
                                "properties": {
                                    "primaryComplaint": {
                                        "type": "string",
                                        "description": "Primary medical complaint of the patient. Only capture the medical condition. no timelines"
                                    },
                                    "duration": {"type": "number"},
                                    "durationUnit": {"type": "string", "enum": ["days", "weeks", "months", "years"]},
                                    "associatedSymptoms": {
                                        "type": "object",
                                        "additionalProperties": {
                                            "type": "boolean"
                                        },
                                        "description": "Key-value pairs of symptoms and their presence (true) or absence (false)"
                                    },
                                    "absentSymptoms": {
                                        "type": "array",
                                        "items": {"type": "string"}
                                    }
                                },
                                "required": ["primaryComplaint", "duration", "durationUnit"]
                            }

The documents contain a treasure trove of personally identifiable information (PII) and personal health information (PIH). To redact this information, you can pass enable_guardrails as true. This will use the guardrail you setup earlier as part of the information extraction process and mask information identified as PII or PIH.

processor.process_file(prompt=prompt, 
                                        enable_guardrails=True,
                                        toolspecs=toolspecs,
      …
)

Finally, cross-document validation is crucial for maintaining data accuracy and compliance in healthcare settings. Although the current implementation performs basic consistency checks through the summary prompt, organizations can extend the framework by implementing a dedicated validation tool that integrates with their specific business rules and compliance requirements. Such a tool could perform sophisticated validation logic like insurance policy verification, appointment date consistency checks, or any other domain-specific validation requirements, providing complete data integrity across the document package.

Future considerations

As Amazon Bedrock continues to evolve, several powerful features can be integrated into this document processing workflow to enhance its enterprise readiness, performance, and cost-efficiency. Let’s explore how these advanced capabilities can take this solution to the next level:

  • Inference profiles in Amazon Bedrock define a model and its associated Regions for routing invocation requests, enabling various tasks such as usage tracking, cost monitoring, and cross-Region inference. These profiles help users track metrics through Amazon CloudWatch logs, monitor costs with cost allocation tags, and increase throughput by distributing requests across multiple Regions.
  • Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. Instead of reprocessing the entire context for each document, the workflow can reuse cached prompts, which is particularly beneficial when using the same image across different tooling workflows. With support for multiple cache checkpoints, this feature can substantially reduce processing time and inference costs while maintaining the workflow’s intelligent orchestration capabilities.
  •  Intelligent prompt routing can dynamically select the most appropriate model for each task based on performance and cost requirements. Rather than explicitly assigning Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for document analysis, the workflow can use intelligent routing to automatically choose the optimal model within the Anthropic family for each request. This approach simplifies model management while providing cost-effective processing of different document types, from simple structured forms to complex handwritten notes, all through a single endpoint.

Conclusion

This intelligent document processing solution demonstrates the power of combining Amazon Bedrock FMs with tool use capabilities to create sophisticated, self-orchestrating workflows. By using Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for complex visual tasks, the solution effectively handles structured, semi-structured, and unstructured documents while maintaining high accuracy and compliance standards.

Key benefits of this approach include:

  • Reduced manual processing through intelligent automation
  • Improved accuracy through specialized model selection
  • Built-in compliance with guardrails for sensitive data
  • Flexible architecture that adapts to various document types
  • Cost-effective processing through strategic model usage

As organizations continue to digitize their operations, solutions like this showcase how generative AI can transform traditional document processing workflows. The combination of powerful FMs in Amazon Bedrock and the tool use framework provides a robust foundation for building intelligent, scalable document processing solutions across industries.

For more information about Amazon Bedrock and its capabilities, visit the Amazon Bedrock User Guide.


About the Author

Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.

Read More

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Large language models (LLMs) excel at generating human-like text but face a critical challenge: hallucination—producing responses that sound convincing but are factually incorrect. While these models are trained on vast amounts of generic data, they often lack the organization-specific context and up-to-date information needed for accurate responses in business settings. Retrieval Augmented Generation (RAG) techniques help address this by grounding LLMs in relevant data during inference, but these models can still generate non-deterministic outputs and occasionally fabricate information even when given accurate source material. For organizations deploying LLMs in production applications—particularly in critical domains such as healthcare, finance, or legal services—these residual hallucinations pose serious risks, potentially leading to misinformation, liability issues, and loss of user trust.

To address these challenges, we introduce a practical solution that combines the flexibility of LLMs with the reliability of drafted, curated, verified answers. Our solution uses two key Amazon Bedrock services: Amazon Bedrock Knowledge Bases, a fully managed service that you can use to store, search, and retrieve organization-specific information for use with LLMs; and Amazon Bedrock Agents, a fully managed service that you can use to build, test, and deploy AI assistants that can understand user requests, break them down into steps, and execute actions. Similar to how a customer service team maintains a bank of carefully crafted answers to frequently asked questions (FAQs), our solution first checks if a user’s question matches curated and verified responses before letting the LLM generate a new answer. This approach helps prevent hallucinations by using trusted information whenever possible, while still allowing the LLM to handle new or unique questions. By implementing this technique, organizations can improve response accuracy, reduce response times, and lower costs. Whether you’re new to AI development or an experienced practitioner, this post provides step-by-step guidance and code examples to help you build more reliable AI applications.

Solution overview

Our solution implements a verified semantic cache using the Amazon Bedrock Knowledge Bases Retrieve API to reduce hallucinations in LLM responses while simultaneously improving latency and reducing costs. This read-only semantic cache acts as an intelligent intermediary layer between the user and Amazon Bedrock Agents, storing curated and verified question-answer pairs.

When a user submits a query, the solution first evaluates its semantic similarity with existing verified questions in the knowledge base. For highly similar queries (greater than 80% match), the solution bypasses the LLM completely and returns the curated and verified answer directly. When partial matches (60–80% similarity) are found, the solution uses the verified answers as few-shot examples to guide the LLM’s response, significantly improving accuracy and consistency. For queries with low similarity (less than 60%) or no match, the solution falls back to standard LLM processing, making sure that user questions receive appropriate responses.

This approach offers several key benefits:

  • Reduced costs: By minimizing unnecessary LLM invocations for frequently answered questions, the solution significantly reduces operational costs at scale
  • Improved accuracy: Curated and verified answers minimize the possibility of hallucinations for known user queries, while few-shot prompting enhances accuracy for similar questions.
  • Lower latency: Direct retrieval of cached answers provides near-instantaneous responses for known queries, improving the overall user experience.

The semantic cache serves as a growing repository of trusted responses, continuously improving the solution’s reliability while maintaining efficiency in handling user queries.

Solution architecture

Solution diagram to describe which AWS services are used

The solution architecture in the preceding figure consists of the following components and workflow. Let’s assume that the question “What date will AWS re:invent 2024 occur?” is within the verified semantic cache. The corresponding answer is also input as “AWS re:Invent 2024 takes place on December 2–6, 2024.” Let’s walkthrough an example of how this solution would handle a user’s question.

1. Query processing:

a. User submits a question “When is re:Invent happening this year?”, which is received by the Invoke Agent function.

b. The function checks the semantic cache (Amazon Bedrock Knowledge Bases) using the Retrieve API.

c. Amazon Bedrock Knowledge Bases performs a semantic search and finds a similar question with an 85% similarity score.

2. Response paths: (Based on the 85% similarity score in step 1.c, our solution follows the strong match path)

a. Strong match (similarity score greater than 80%):

i. Invoke Agent function returns exactly the verified answer “AWS re:Invent 2024 takes place on December 2–6, 2024” directly from the Amazon Bedrock knowledge base, providing a deterministic response.

ii. No LLM invocation needed, response in less than 1 second.

b. Partial match (similarity score 60–80%):

i. The Invoke Agent function invokes the Amazon Bedrock agent and provides the cached answer as a few-shot example for the agent through Amazon Bedrock Agents promptSessionAttributes.

ii. If the question was “What’s the schedule for AWS events in December?”, our solution would provide the verified re:Invent dates to guide the Amazon Bedrock agent’s response with additional context.

iii. Providing the Amazon Bedrock agent with a curated and verified example might help increase accuracy.

c. No match (similarity score less than 60%):

i. If the user’s question isn’t similar to any of the curated and verified questions in the cache, the Invoke Agent function invokes the Amazon Bedrock agent without providing it any additional context from cache.

ii. For example, if the question was “What hotels are near re:Invent?”, our solution would invoke the Amazon Bedrock agent directly, and the agent would use the tools at its disposal to formulate a response.

3. Offline knowledge management:

a. Verified question-answer pairs are stored in a verified Q&A Amazon S3 bucket (Amazon Simple Storage Service), and must be updated or reviewed periodically to make sure that the cache contains the most recent and accurate information.

b. The S3 bucket is periodically synchronized with the Amazon Bedrock knowledge base. This offline batch process makes sure that the semantic cache remains up-to-date without impacting real-time operations.

Solution walkthrough

You need to meet the following prerequisites for the walkthrough:

Once you have the prerequisites in place, use the following steps to set up the solution in your AWS account.

Step 0: Set up the necessary infrastructure

Follow the “Getting started” instructions in the README of the Git repository to set up the infrastructure for this solution. All the following code samples are extracted from the Jupyter notebook in this repository.

Step 1: Set up two Amazon Bedrock knowledge bases

This step creates two Amazon Bedrock knowledge bases. The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services.

agent_knowledge_base = BedrockKnowledgeBase(
    kb_name=agent_knowledge_base_name,
    kb_description="Knowledge base used by Bedrock Agent",
    data_bucket_name=agent_bucket_name,
    chunking_strategy="FIXED_SIZE",
    suffix=f'{agent_unique_id}-f'
)

cache_knowledge_base = BedrockKnowledgeBase(
    kb_name=cache_knowledge_base_name,
    kb_description="Verified cache for Bedrock Agent System",
    data_bucket_name=cache_bucket_name,
    chunking_strategy="NONE",  # We do not want to chunk our question-answer pairs
    suffix=f'{cache_unique_id}-f'
)

This establishes the foundation for your semantic caching solution, setting up the AWS resources to store the agent’s knowledge and verified cache entries.

Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent

For this walkthrough, you will create an LLM Amazon Bedrock agent specialized in answering questions about Amazon Bedrock. For this example, you will ingest Amazon Bedrock documentation in the form of the User Guide PDF into the Amazon Bedrock knowledge base. This will be the primary dataset. After ingesting the data, you create an agent with specific instructions:

agent_instruction = """You are the Amazon Bedrock Agent. You have access to a 
knowledge base with information about the Amazon Bedrock service on AWS. 
Use it to answer questions."""

agent_id = agents_handler.create_agent(
    agent_name,
    agent_description,
    agent_instruction,
    [agent_foundation_model],
    kb_arns=[agent_kb_arn] # Associate agent with our Agent knowledge base
)

This setup enables the Amazon Bedrock agent to use the ingested knowledge to provide responses about Amazon Bedrock services. To test it, you can ask a question that isn’t present in the agent’s knowledge base, making the LLM either refuse to answer or hallucinate.

invoke_agent("What are the dates for reinvent 2024?", session_id="test")
# Response: Unfortunately, the dates for the AWS re:Invent 2024 conference have not 
# been announced yet by Amazon. The re:Invent conference is typically held in late 
# November or early December each year, but the specific dates for 2024 are not 
# available at this time. AWS usually announces the dates for their upcoming 
# re:Invent event around 6-9 months in advance.

Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base

In this step, you create a raw dataset of verified question-answer pairs that aren’t present in the agent knowledge base. These curated and verified answers serve as our semantic cache to prevent hallucinations on known topics. Good candidates for inclusion in this cache are:

  1. Frequently asked questions (FAQs): Common queries that users often ask, which can be answered consistently and accurately.
  2. Critical questions requiring deterministic answers: Topics where precision is crucial, such as pricing information, service limits, or compliance details.
  3. Time-sensitive information: Recent updates, announcements, or temporary changes that might not be reflected in the main RAG knowledge base.

By carefully curating this cache with high-quality, verified answers to such questions, you can significantly improve the accuracy and reliability of your solution’s responses. For this walkthrough, use the following example pairs for the cache:

Q: 'What are the dates for reinvent 2024?'
A: 'The AWS re:Invent conference was held from December 2-6 in 2024.'

Q: 'What was the biggest new feature announcement for Bedrock Agents during reinvent 2024?'
A: 'During re:Invent 2024, one of the headline new feature announcements for Bedrock Agents was the custom orchestrator. This key feature allows users to implement their own orchestration strategies through AWS Lambda functions, providing granular control over task planning, completion, and verification while enabling real-time adjustments and reusability across multiple agents.'

You then format these pairs as individual text files with corresponding metadata JSON files, upload them to an S3 bucket, and ingest them into your cache knowledge base. This process makes sure that your semantic cache is populated with accurate, curated, and verified information that can be quickly retrieved to answer user queries or guide the agent’s responses.

Step 4: Implement the verified semantic cache logic

In this step, you implement the core logic of your verified semantic cache solution. You create a function that integrates the semantic cache with your Amazon Bedrock agent, enhancing its ability to provide accurate and consistent responses.

  1. Queries the cache knowledge base for similar entries to the user question.
  2. If a high similarity match is found (greater than 80%), it returns the cached answer directly.
  3. For partial matches (60–80%), it uses the cached answer as a few-shot example for the agent.
  4. For low similarity (less than 60%), it falls back to standard agent processing.

This simplified logic forms the core of the semantic caching solution, efficiently using curated and verified information to improve response accuracy and reduce unnecessary LLM invocations.

Step 5: Evaluate results and performance

This step demonstrates the effectiveness of the verified semantic cache solution by testing it with different scenarios and comparing the results and latency. You’ll use three test cases to showcase the solution’s behavior:

  1. Strong semantic match (greater than 80% similarity)
  2. Partial semantic match (60-80% similarity)
  3. No semantic match (less than 60% similarity)

Here are the results:

  1. Strong semantic match (greater than 80% similarity) provides the exact curated and verified answer in less than 1 second.
    %%time
    invoke_agent_with_verified_cache("What were some new features announced for Bedrock Agents during reinvent 2024?")
    
    # Output:
    # Cache semantic similarity log: Strong match with score 0.9176399
    # CPU times: user 20.7 ms, sys: 442 μs, total: 21.1 ms
    # Wall time: 440 ms
    
    # During re:Invent 2024, one of the headline new feature announcements for Bedrock 
    # Agents was the custom orchestrator. This key feature allows users to implement 
    # their own orchestration strategies through AWS Lambda functions, providing 
    # granular control over task planning, completion, and verification while enabling 
    # real-time adjustments and reusability across multiple agents.

  2. Partial semantic match (60–80% similarity) passes the verified answer to the LLM during the invocation. The Amazon Bedrock agent answers the question correctly using the cached answer even though the information is not present in the agent knowledge base.
    %%time
    invoke_agent_with_verified_cache("What are the newest features for Bedrock Agents?") 
    
    # Output:
    # Cache semantic similarity log: Partial match with score 0.6443664
    # CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
    # Wall time: 12.8 s
    
    # One of the newest and most significant features for Amazon Bedrock Agents 
    # announced during re:Invent 2024 was the custom orchestrator. This feature 
    # allows users to implement their own orchestration strategies through AWS 
    # Lambda functions, providing granular control over task planning, completion, 
    # and verification. It enables real-time adjustments and reusability across 
    # multiple agents, enhancing the flexibility and power of Bedrock Agents.

  3. No semantic match (less than 60% similarity) invokes the Amazon Bedrock agent as usual. For this query, the LLM will either refuse to provide the information because it’s not present in the agent’s knowledge base, or will hallucinate and provide a response that is plausible but incorrect.
    %%time
    invoke_agent_with_verified_cache("Tell me about a new feature for Amazon Bedrock Agents")
    
    # Output:
    # Cache semantic similarity log: No match with score 0.532105
    # CPU times: user 22.3 ms, sys: 579 μs, total: 22.9 ms
    # Wall time: 13.6 s
    
    # Amazon Bedrock is a service that provides secure and scalable compute capacity 
    # for running applications on AWS. As for new features for the Bedrock Agents 
    # component, I do not have any specific information on recent or upcoming new 
    # features. However, AWS services are frequently updated with new capabilities, 
    # so it's possible there could be new agent features released in the future to 
    # enhance security, scalability, or integration with other AWS services. Without 
    # being able to consult the Knowledge Base, I cannot provide details on any 
    # particular new Bedrock Agent features at this time.

These results demonstrate the effectiveness of the semantic caching solution:

  1. Strong matches provide near-instant, accurate, and deterministic responses without invoking an LLM.
  2. Partial matches guide the LLM agent to provide a more relevant or accurate answer.
  3. No matches fall back to standard LLM agent processing, maintaining flexibility.

The semantic cache significantly reduces latency for known questions and improves accuracy for similar queries, while still allowing the agent to handle unique questions when necessary.

Step 6: Resource clean up

Make sure that the Amazon Bedrock knowledge bases that you created, along with the underlying Amazon OpenSearch Serverless collections are deleted to avoid incurring unnecessary costs.

Production readiness considerations

Before deploying this solution in production, address these key considerations:

  1. Similarity threshold optimization: Experiment with different thresholds to balance cache hit rates and accuracy. This directly impacts the solution’s effectiveness in preventing hallucinations while maintaining relevance.
  2. Feedback loop implementation: Create a mechanism to continuously update the verified cache with new, accurate responses. This helps prevent cache staleness and maintains the solution’s integrity as a source of truth for the LLM.
  3. Cache management and update strategy: Regularly refresh the semantic cache with current, frequently asked questions to maintain relevance and improve hit rates. Implement a systematic process for reviewing, validating, and incorporating new entries to help ensure cache quality and alignment with evolving user needs.
  4. Ongoing tuning: Adjust similarity thresholds as your dataset evolves. Treat the semantic cache as a dynamic component, requiring continuous optimization for your specific use case.

Conclusion

This verified semantic cache approach offers a powerful solution to reduce hallucinations in LLM responses while improving latency and reducing costs. By using Amazon Bedrock Knowledge Bases, you can implement a solution that can efficiently serve curated and verified answers, guide LLM responses with few-shot examples, and gracefully fall back to full LLM processing when needed.


About the Authors

Dheer Toprani (author photo)Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.

Chaithanya Maisagoni Author PhotoChaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazon’s Worldwide Returns and ReCommerce organization. He specializes in building scalable machine learning infrastructure, distributed systems, and containerization technologies. His expertise lies in developing robust solutions that enhance monitoring, streamline inference processes, and strengthen audit capabilities to support and optimize Amazon’s global operations.

Rajesh Nedunuri Author PhotoRajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for reverse logistics and ReCommerce operations. He is deeply passionate about generative AI and consistently seeks opportunities to implement AI into solving complex customer challenges.

Karam Muppidi Author PhotoKaram Muppidi is a Senior Engineering Manager at Amazon Retail, where he leads data engineering, infrastructure and analytics for the Worldwide Returns and ReCommerce organization. He has extensive experience developing enterprise-scale data architectures and governance strategies using both proprietary and native AWS platforms, as well as third-party tools. Previously, Karam developed big-data analytics applications and SOX compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.

Read More

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

Fine-tuning a pre-trained large language model (LLM) allows users to customize the model to perform better on domain-specific tasks or align more closely with human preferences. It is a continuous process to keep the fine-tuned model accurate and effective in changing environments, to adapt to the data distribution shift (concept drift) and prevent performance degradation over time. Continuous fine-tuning also enables models to integrate human feedback, address errors, and tailor to real-world applications. You can use supervised fine-tuning (SFT) and instruction tuning to train the LLM to perform better on specific tasks using human-annotated datasets and instructions. When you have user feedback to the model responses, you can also use reinforcement learning from human feedback (RLHF) to guide the LLM’s response by rewarding the outputs that align with human preferences.

Precise and responsible outputs from fine-tuned LLMs require big efforts from subject matter experts (SMEs). The manual annotation of extensive training data for fine-tuning by human SMEs and collecting user feedback to align LLM responses with human preferences are both resource-heavy and time-intensive. Also, the continuous fine-tuning process requires orchestrating the multiple steps of data generation, LLM training, feedback collection, and preference alignments with scalability, resiliency, and resource efficiency. To address these challenges, we present an innovative continuous self-instruct fine-tuning framework that streamlines the LLM fine-tuning process of training data generation and annotation, model training and evaluation, human feedback collection, and alignment with human preference. This framework is designed as a compound AI system to drive the fine-tuning workflow for performance improvement, versatility, and reusability.

In this post, we introduce the continuous self-instruct fine-tuning framework and its pipeline, and present how to drive the continuous fine-tuning process for a question-answer task as a compound AI system. We use DSPy (Declarative Self-improving Python) to demonstrate the workflow of Retrieval Augmented Generation (RAG) optimization, LLM fine-tuning and evaluation, and human preference alignment for performance improvement.

Overview of the continuous self-instruct fine-tuning framework

The continuous self-instruct fine-tuning framework drives a workflow to customize the foundation model (FM) using human-labeled training samples and human feedback after model inference. This workflow runs on a continuous basis to be adaptive to a changing environment. The following diagram illustrates the workflow.

cont_ft_workflow

The workflow consists of the following steps:

  1. Self-instruct supervised fine-tuning – First, we use a human-labeled training dataset to adapt the FM to tasks in a specific domain. Instruction tuning is a popular approach in domain-specific LLM fine-tuning, which trains the FM to follow instructions for a specific task rather than generating the next texts. To address the challenges of the lack of human efforts for data labeling, annotation, and validation, we designed a self-instruct fine-tuning method to synthetically generate training labels by the LLM from a small volume of high-quality human-annotated samples. This process scales up the training dataset used for fine-tuning the FM into a custom LLM.
  2. Human preference alignment – After the model is deployed in the production environment, the process moves into the human-in-the-loop workflow, in which we collect user feedback including satisfaction scores and comments on model response. The human feedback data is not only used for model performance and hallucination measurement, but is also used to further fine-tune the custom model in Step 1 through RLHF. Likewise, to address the challenges of lack of human feedback data, we use LLMs to generate AI grades and feedback that scale up the dataset for reinforcement learning from AI feedback (RLAIF). There are various techniques of preference alignment, including proximal policy optimization (PPO), direct preference optimization (DPO), odds ratio policy optimization (ORPO), group relative policy optimization (GRPO), and other algorithms, that can be used in this process.
  3. Evaluation and continuous learning – The model customization and preference alignment is not a one-time effort. We need to keep monitoring and evaluating the model performance, and restart the process in case of concept shift or model decay.

The overall workflow consists of multiple steps of synthetic data generation, LLM training, feedback collection, preference alignment, and evaluation that involves multiple components and multiple LLMs. In the next section, we discuss using a compound AI system to implement this framework to achieve high versatility and reusability.

Compound AI system and the DSPy framework

With the rise of generative AI, scientists and engineers face a much more complex scenario to develop and maintain AI solutions, compared to classic predictive AI. The paper The Shift from Models to Compound AI Systems highlights that state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models. Compound AI systems are systems that implement AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers, or external tools. The following diagram compares predictive AI to generative AI.

compoundai

The concept of a compound AI system enables data scientists and ML engineers to design sophisticated generative AI systems consisting of multiple models and components. You can use a module to incorporate prompt engineering and in-context learning to improve RAG performance, and also design a data architecture with tools to gather external data. You can also build an agentic architecture with multiple LLMs, fine-tune the model to achieve higher performance, and orchestrate the LLM access. Besides the efficiency in system design, the compound AI system also enables you to optimize complex generative AI systems, using a comprehensive evaluation module based on multiple metrics, benchmarking data, and even judgements from other LLMs. The optimization is on the holistic end-to-end solution, rather than on each component separately.

To efficiently build and optimize compound AI systems, we introduce DSPy, an open source Python framework for developers to build LLM applications using modular and declarative programming, whether you’re building simple classifiers, sophisticated RAG pipelines, or agentic workflows. It provides algorithms for optimizing LLMs’ prompts and weights, and automates the prompt tuning process, as opposed to the trial-and-error approach performed by humans. DSPy supports iteratively optimizing all prompts involved against defined metrics for the end-to-end compound AI solution.

The DSPy lifecycle is presented in the following diagram in seven steps. It separates the flow of your program (modules) from the parameters (language model prompts and weights) of each step. These modules define the system behavior in a portable, declarative way. The first four steps cover the DSPy programming stage, including defining your task and its constraints, exploring a few examples, and using that to inform your initial pipeline design. When your system works reasonably well, you can run the DSPy evaluation stage (Steps 5 and 6) to collect an initial development set, define your DSPy metric, and use these to iterate on your system more systematically. Afterwards, DSPy introduces new optimizers (compilers) in Step 7, with language model-driven algorithms to tune LLM prompts and weights, based on predefined evaluation metrics.

dspy_lifecycle

RAG pipeline with continuous fine-tuning in a compound AI system

In this post, we provide an example of a question-answer task, using a RAG pipeline along with the continuous self-instruct fine-tuning framework. We build this as a compound AI system and use DSPy to drive the RAG inference, prompt optimization, LLM fine-tuning, and performance evaluation. The overall workflow is shown in the following diagram.

CFT_pipeline

The flow starts from a standard RAG pipeline, followed by a few optimizations on the prompts and the RAG retriever. Then we generate the synthetic training dataset from the RAG knowledge base to fine-tune the generator LLM using RAG for performance improvement. Lastly, we use a separate LLM to generate feedback on the fine-tuned model responses, and use it to conduct the preference alignment training by DPO and PPO. The question-answer outputs from each step are measured by the underlying LLM-as-a-judge evaluation module. In this way, we demonstrate the effectiveness of the compound AI system for the continuous optimizing of the pipeline through RAG optimization and the fine-tuning framework.

In the next sections, we demonstrate how to build this workflow, including the RAG pipeline, optimization, instruction fine-tuning, preference alignment, and model evaluation, into a compound AI system using an Amazon SageMaker notebook instance with the DSPy framework and LLMs on Amazon Bedrock. The code from this post and more examples are available in the GitHub repository.

Prerequisites

To create and run this compound AI system in your AWS account, complete the following prerequisites:

  1. Create an AWS account if you don’t already have one.
  2. Set up a SageMaker notebook instance.
  3. Open JupyterLab in this newly created instance.
  4. Clone the GitHub repository and follow the steps explained in the README.
  5. Navigate to the cloned repository and open the notebook folder.
  6. Enable access to models hosted on Amazon Bedrock. For this post, we enable Anthropic’s Claude 3 Sonnet, Mistral 7B, and Meta Llama 8B.

Dataset

For the question-answering task, we use the Contract Understanding Atticus Dataset (CUAD), an open legal contract review dataset created with dozens of legal experts from The Atticus Project, which consists of over 13,000 annotations. The synthetic data generation notebook automatically downloads the CUAD_v1 ZIP file and places it in the required folder named cuad_data.

In case of any issues, you can alternately download the dataset yourself by following the steps in the README file and store the dataset inside a folder within the SageMaker notebook instance, and use it to perform the steps in the next section.

Prepare question-answer pairs

The first step is to prepare question-answer pairs from the CUAD document by running synthetic data generation.

We use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to synthetically generate question-answer pairs to infer the RAG pipeline in the compound AI system, to demonstrate the improved accuracy after RAG optimization and model fine-tuning. The generated datasets are in the format of question-answer pairs along with the context [context, question, answer] from the document. We use the question to infer the RAG pipeline and use the answer as ground truth to evaluate the inference accuracy. Additionally, the question-answer pairs are used as training samples for the model fine-tuning. The following is a sample dataset triplet with context and a question-answer pair.

Context (Snippet from PDF file) Question Answer

THIS STRATEGIC ALLIANCE AGREEMENT (“Agreement”) is made and entered into as of November 6, 2016 (the “Effective Date”) by

and between Dialog Semiconductor (UK) Ltd., a corporation organized under the laws of England and Wales, having its principal office at 100

Longwater Avenue, Green Park, Reading, RG2 6GP, United Kingdom (“DIALOG”) and Energous Corporation, a Delaware corporation, having its

principal office at 3590 North First Street, Suite 210, San Jose, CA 95134 (“ENERGOUS”)

What is the date of the contract? November 6, 2016

Create a RAG pipeline

We implement a standard RAG pipeline with DSPy using the following components to create the vector database, set up context retrieval, and generate the answer:

  1. Configure DSPy to use LLMs on Amazon Bedrock as the RAG generator model:
dsp_bedrock = dspy.Bedrock(region_name='us-west-2')
claude_sonnet_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_sonnet = dspy.AWSAnthropic(aws_provider=dsp_bedrock,
                                   model=claude_sonnet_model_id,
                                   max_new_tokens=4096,
                                   max_tokens=4096)
  1. Process the dataset to generate logical and syntactically readable chunks. The size and overlap percentage can be empirically determined based on the dataset. For more flexibility, you can generate multiple files from the dataset file and make one file one chunk.
  2. To set up a RAG retriever, we select ChromaDB as a vector store, and use DSPy’s ChromadbRM module as the retriever model:
titan_embed_model_id = "amazon.titan-embed-text-v2:0"
bedrock_ef = AmazonBedrockEmbeddingFunction(session=session, 
                                            model_name=titan_embed_model_id)
collection_name = "contexts"
persist_dir = "cuad_db/"
rm = ChromadbRM(collection_name=collection_name,
                persist_directory=persist_dir,
                embedding_function=bedrock_ef,
                k=3) 
  1. Using these components, we orchestrate a DSPy RAG pipeline to clean the context, generate the answer, and use the LLM-as-a-judge to score the generated answer with respect to the ground truth:
class GenerateAnswer(dspy.Signature):
   """Answer questions with short factoid answers."""
   context = dspy.InputField(desc="may contain relevant facts")
   question = dspy.InputField()
   answer = dspy.OutputField(desc="often between 1 and 5 words")

class RAG(dspy.Module):
   def __init__(self, num_passages=3):
      super().__init__()
      self.retrieve = ChromadbRM("contexts", "./chroma", k=num_passages)
      self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
   def forward(self, question):
      context = self.retrieve(question).passages
      context = [unicodedata.normalize("NFKD", r) for r in self.retrieve(question).passages]
      prediction = self.generate_answer(context=context, question=question)
      return dspy.Prediction(context=context, answer=prediction.answer)

RAG optimization with DSPy

The next step is to perform RAG optimization with DSPy. DSPy provides the Optimizer module, an algorithm that can tune the parameters of a DSPy program (the prompts and language model weights) to maximize the metrics you specify. It takes in a training set to bootstrap the selective training examples, and is based on a metric function that measures proximity to or matches against the ground truth. With these, we can compile the RAG pipeline module with a defined optimizer instance to conduct the optimization.

In this post, we use DSPy Optimizer to learn how to generate the prompt to improve the RAG response accuracy. Because our dataset size is low (fewer than 100 examples), we select the BootstrapFewShot teleprompter to compile the RAG prompts and overall pipeline, and use the synthetic dataset with ground truth and the LLM-as-a-judge metric function we defined in the previous sections:

def validate_context_and_answer(example, pred, trace=None):
   answer_EM = dspy.evaluate.answer_exact_match(example, pred)
   answer_PM = dspy.evaluate.answer_passage_match(example, pred)
   answer_LLMJudge = factuality_metric(example, pred)
   return answer_LLMJudge or answer_EM or answer_PM

rag_lm = RAG()
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
compiled_rag = teleprompter.compile(rag_lm, trainset=trainset)

The context retrieval is crucial to the overall RAG accuracy. To evaluate the RAG optimization we’ve described, we create a retriever evaluation by the LLM-as-a-judge to understand how well the retriever is able to pull out the relevant chunks for the incoming user question. The LLM judge is defined in the RetrievalJudge class:

class RetrievalJudge(dspy.Signature):
   """Judge given the question to be answered, check if the groundtruth answer can be derived from the predicted context.  Answer either Retrieved[True] or Retrieved[False]"""
   context = dspy.InputField(desc="Context for the prediction")
   question = dspy.InputField(desc="Question to be answered")
   groundtruth_answer = dspy.InputField(desc="groundtruth answer for the question")
   retrieval_correctness = dspy.OutputField(desc="Can the groundtruth answer be derived from the predicted context?", prefix="Retrieved[True/False]:")

retrieval_judge = dspy.ChainOfThought(RetrievalJudge)

Then we define the metric to measure the retrieval by using the RetrievalJudge, and use the DSPy Evaluate module to generate the accuracy score for retrieval:

def retrieval_metric(example, pred):
   retrieval = retrieval_judge(question=example.question, groundtruth_answer=example.answer, context=pred.context)
   llm_retriever_ans = bool("Retrieved[True]" in retrieval.retrieval_correctness
                            or '100% True' in retrieval.retrieval_correctness
                            or '100% retrieved correct' in retrieval.retrieval_correctness
                            or 'True.' in retrieval.retrieval_correctness)
   return llm_retriever_ans

rag_retrieval_score = Evaluate(compiled_rag, num_threads = 1, metric=retrieval_metric)

Configure the continuous fine-tuning framework

After the RAG optimization, the compound AI system has the instruction tuning and preference alignment modules, driven by the continuous fine-tuning framework. This includes using the synthetically generated dataset to train the LLM to follow question-answer instructions by SFT, and generating feedback of RAG responses by AI (another LLM) used for RLAIF with PPO and preference alignment with DPO and ORPO. In this step, we use Parameter Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce the requirement of compute resources and accelerate the training process.

At the time of writing, the DSPy Optimization module supports distillation of a prompt-based DSPy program into LLM weight updates using BootstrapFinetune, and does not yet support the fine-tuning methods we defined in the compound AI system. Therefore, we conducted the fine-tuning (instruction tuning and preference alignment) on a Meta Llama 3 8B model separately; refer to the following GitHub repository for more details. With the compound AI system design, we are able to take the fine-tuning results back into the DSPy pipeline, use the LLM-as-a-judge evaluation function to generate the accuracy scores, and benchmark with the standard and optimized RAG inferences. This demonstrates the flexibility and interoperability of the compound AI system, which allows us to seamlessly replace one module with an external component without requiring changes to the entire pipeline.

The following diagram illustrates the workflow.

FT-model evaluation

Define an evaluation approach with DSPy

DSPy provides an Evaluate module for evaluating the compound AI system output by using user-defined metrics. In this post, we use LLM-as-a-judge to evaluate the system output and create the corresponding metrics for benchmarking the accuracy of standard RAG, optimized RAG, and fine-tuned models. Complete the following steps:

  1. Load the dataset for evaluation in the Example data type. Examples are similar to Python dictionaries but with added utilities such as the dspy.Prediction as a return value. For example:
gt_answer = <ground truth of the answer>
pred_answer = <answer from RAG and/or fine-tuned model>
dspy_data = dspy.Example(gt_answer=gt_answer, pred_answer=pred_answer).with_inputs("gt_answer", "pred_answer")
  1. Define the LLM-as-a-judge class to adjudicate whether the predicted answer semantically matches the ground truth of the answer. For example, the following FactualityJudge_1 class provides a score between 0 and 1; 0 means a complete mismatch and 1 means a perfect match.
class FactualityJudge_1(dspy.Signature):
   """Judge if the predicted answer is semantically match the groundtruth answer. Provide a score between 0 and 1, 0 means completely mismatch and 1 means perfectly match. In the response, only present the score, DO NOT add any preambles."""
   groundtruth_answer = dspy.InputField(desc="groundtruth answer")
   predicted_answer = dspy.InputField(desc="predicted answer")
   factually_correct = dspy.OutputField(desc="Is the predicted answer factually correct and semantically similar to the groundtruth answer?"))
  1. Define the evaluation metrics from the LLM judge, using DSPy metrics, to mark whether the predicted answer is true or not. For example, the following function returns the accuracy score based on the output of FactualityJudge_1:
factualityJudge_1 = dspy.ChainOfThought(FactualityJudge_1)

def factuality_metric_1(gt_answer, pred_answer):
   pred_answer = gt_answer.pred_answer
   gt_answer = gt_answer.gt_answer
   factual_metrc = factualityJudge_1(groundtruth_answer=gt_answer, predicted_answer=pred_answer)
   llm_judge_ans = float(factual_metrc[0].factually_correct)
   print(f"llm_judge_ans = {llm_judge_ans}")
   return llm_judge_ans

metric_LLM_1 = factuality_metric_1
  1. Use the dspy.Evaluate module to generate an accuracy score using the LLM-as-a-judge metrics defined in the previous step:
evaluate_llm_judge = Evaluate(devset= dspy_data, metric=metric_LLM_1, num_threads=1)

This evaluation process should be conducted on a continuous basis in the compound AI system driven by self-instruct fine-tuning, to make sure the overall performance remains stable despite the changes in the environment or the introduction of new data.

Benchmark RAG and LLM fine-tuning with DSPy

We benchmark the approaches presented in this post using the LLM-as-a-judge evaluation function defined in the previous section with the following settings.

The benchmarking is across five methods: standard RAG, optimized RAG, fine-tuning LLMs by instruction tuning, and fine-tuning LLMs by DPO and ORPO trained LLMs based on AIF. For each method, the LLM judge provides a decimal accuracy score in the range of 0 and 1.

The standard RAG uses Amazon Titan Text Embedding V2 for the embedding model, and Anthropic’s Claude 3 Haiku model for the generator model. The RAG compilation uses 32 question-answer pairs to optimize the prompts. The same dataset is used for inference. The fine-tuning by SFT, DPO, and ORPO are performed on the Meta Llama 3 8B FM, using training samples synthetically generated from CUAD document.

The results are presented in the following tables and charts. The different methods demonstrate different levels of improvement. The improvement is calculated in percentage by (accuracy of new method – accuracy of standard RAG)/(accuracy of standard RAG)*100%.

The optimized RAG by DSPy improved the accuracy and reduced the hallucination.

  Standard RAG with Claude 3 Haiku RAG with Claude 3 Haiku optimized by DSPy Improvement %
Accuracy by LLM Judge (0-1) 0.3969 0.6656 67.70%
  Standard RAG with Claude 3 Sonnet RAG with Claude 3 Sonnet optimized by DSPy Improvement %
Accuracy by LLM Judge (0-1) 0.3031 0.6375 110.33%

The custom LLM trained by SFT yielded higher accuracy than the standard RAG.

  Standard RAG with Claude 3 Haiku SFT tuned Meta Llama 3 8B Improvement %
Accuracy by LLM Judge (0-1) 0.3969 0.4813 21.26%
  Standard RAG with Claude 3 Sonnet SFT tuned Meta Llama 3 8B Improvement % 
Accuracy by LLM Judge (0-1) 0.3031 0.4813 58.79%

The custom LLM through preference alignment from human and AI feedback (DPO and ORPO) further improved the model performance. The fine-tuned small size model (Meta Llama 3 8B) outperformed the standard RAG pipeline with the medium size (Anthropic’s Claude Haiku) and larger size (Anthropic’s Claude Sonnet) generator model, and was comparable with the prompt-optimized RAG using ground truth data.

  Standard RAG with Claude 3 Haiku DPO tuned Meta Llama 3 8B Improvement % ORPO tuned Meta Llama 3 8B Improvement %  
Accuracy by LLM Judge (0-1) 0.3969 0.6719 69.29% 0.6812 71.63%
  Standard RAG with Claude 3 Sonnet DPO tuned Meta Llama 3 8B Improvement % ORPO tuned Meta Llama 3 8B Improvement %
Accuracy by LLM Judge (0-1) 0.3031 0.6719 121.68% 0.6812 124.74%

The following charts compare the accuracy across all tested methods.

accuracy_bench_chart

The preceding results were generated from a small dataset (32 question-answer pairs). You can use a larger sample set with more question-answer pairs to conduct the benchmarking and compare your own results.

Clean up

Make sure to clean up the following resources to avoid incurring additional costs:

  1. Delete Amazon Simple Storage Service (Amazon S3) buckets created for data storage and resource sharing.
  2. Back up the Jupyter notebooks in the SageMaker notebook instance.
  3. Shut down and delete the SageMaker notebook instance.

Cost considerations

Consider the following costs from the solution deployed on AWS:

  • You will incur charges for LLM inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
  • You will incur charges for storing files in S3 buckets. For more details, refer to Amazon S3 pricing.
  • You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion

In this post, we presented the continuous self-instruct fine-tuning framework as a compound AI system implemented by the DSPy framework. The framework first generates a synthetic dataset from the domain knowledge base and documents for self-instruction, then drives model fine-tuning through SFT, and introduces the human-in-the-loop workflow to collect human and AI feedback to the model response, which is used to further improve the model performance by aligning human preference through reinforcement learning (RLHF/RLAIF).

We demonstrated the framework for a question-answer task with a RAG pipeline, which improved the end-to-end response accuracy. The workflow is implemented by the DSPy framework; the overall strategy is to use the dspy.Module to connect all the components (RAG pipeline, prompt optimization, LLMs fine-tuned by SFT and RLHF/RLAIF, performance evaluation) together into a compound AI system. Each module can be seamlessly maintained, updated, and replaced without affecting other components in the system. This robust and versatile system design strengthens control and trust through modular design, and increases flexibility and adaptability to changing environments and data sources.

You can implement this continuous fine-tuning framework for LLM performance improvement for your own business use cases, with a compound AI system that provides high flexibility and interoperability. For more details, follow the examples in our GitHub repository.


About the Authors

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Jose Cassio dos Santos Junior is a Senior Data Scientist member of the MLU team. He is responsible for Curriculum Development for Advanced Modules. As a previous Senior Data Scientist on the AWS LATAM Professional Services Data Science team, he has over 20 years of experience working as a software engineer and more than 10 years of teaching experience at colleges and as an instructor for Linux certification preparation and Microsoft Innovation Center bootcamps. As a business process management expert, he participated in BPO projects for more than 7 years. He holds a Master’s degree in Computer Engineering, a Bachelor’s degree in Physics, and a Bachelor’s degree in Business Administration, specialized in IT Quantitative Methods.

Read More

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

Organizations need efficient ways to access and analyze their enterprise data. Amazon Q Business addresses this need as a fully managed generative AI-powered assistant that helps you find information, generate content, and complete tasks using enterprise data. It provides immediate, relevant information while streamlining tasks and accelerating problem-solving.

Amazon FSx for Windows File Server is a fully managed Windows file system that provides high-performance file storage for Windows-based applications. You can use Amazon FSx to lift and shift your on-premises Windows file server workloads to the cloud, taking advantage of the scalability, durability, and cost-effectiveness of AWS while maintaining full compatibility with your existing Windows applications and tooling.

Amazon Q Business is designed to be secure and private, seamlessly integrating with your existing identity provider (IdP). It works directly with your identities, roles, and permission sets, making sure users can’t access data they are not authorized to. Additionally, Amazon Q Business seamlessly integrates with multiple enterprise data stores, including FSx for Windows File Server, enabling you to index documents from file server systems and perform tasks such as summarization, Q&A, or data analysis of large numbers of files effortlessly.

In this post, we demonstrate how to use the Amazon Q connector for FSx for Windows File Server, explore a practical use case, and provide step-by-step instructions to help you get started and gain insights out of your data stored in FSx for Windows File Server.

Overview of the Amazon Q data source connector

A data source connector is a mechanism for integrating and synchronizing data from multiple repositories, including Microsoft SharePoint, Salesforce, Amazon Simple Storage Service (Amazon S3) buckets, and even your internal FSx for Windows File Server into one container index. Amazon Q Business offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. For a list of supported connectors, see Supported connectors.

Supported document types

Amazon Q boasts impressive versatility, supporting a wide range of document types stored at various places in your environment, including Windows Share (FSX for Windows File Server). Amazon Q can ingest and understand common formats like plaintext, PDF, HTML, XML, and JSON to Microsoft formats like Excel, Word, and PowerPoint. This provides a comprehensive search experience for your enterprise users.

Secure access with supported authentication types

Security is job zero at AWS, and Amazon Q has been built keeping that in mind. It supports a variety of authentication types, seamlessly integrating with your existing identity management systems. Whether you use single sign-on (SSO) or a custom authentication solution, Amazon Q can adapt to your specific needs.

Fine-grained control with ACLs and identity crawling

For organizations with highly sensitive data, Amazon Q offers an extra layer of security. Amazon Q Business supports crawling access control lists (ACLs) for document security by default. When you connect an Amazon FSx (Windows) data source to Amazon Q Business, it crawls ACL information attached to a document (user and group information) from the directory service of the Amazon FSx instance.

Overview of solution

The following diagram shows a high-level architecture of how AWS Managed Active Directory users, through AWS IAM Identity Center, can access and interact with an Amazon Q Business application. This enables an authenticated user to securely and privately interact with the application and gain insights from the enterprise data stored in FSx for Windows File Server, using the Amazon Q Business web experience from their web browser.

In this post, we walk you through the process of integrating Amazon Q Business with FSx for Windows File Server to extract meaningful insights from your file system using natural language processing (NLP). This solution enables you to interact with your file system data using conversational AI, making information discovery more intuitive and efficient.

To set up your Amazon Q Business application, complete the following high-level steps:

  1. Create a new Amazon Q application.
  2. Select the retriever.
  3. Add a data source (FSx for Windows File Server).
  4. Synchronize your file system data.

Lastly, we demonstrate the application functionality by testing its access for two different users.

Prerequisites

To implement this solution, you should have an AWS account with administrative privileges.

Follow the instructions in the GitHub repository’s README file to provision the infrastructure required for exploring the Amazon Q connector for FSx for Windows File Server.

Create an Amazon Q Business application

Complete the following steps to create a new Amazon Q Business application:

  1. On the Amazon Q Business console, choose Applications in the navigation pane.
  2. Choose Create application.

  1. For Application name, enter a name (for example, anycompany-filesystem-knowledgebase).
  2. For Access management method, select AWS IAM Identity Center.

If you completed the prerequisites, then IAM Identity Center is already enabled, and you should see the instance ARN listed.

  1. Under Quick start user, for Select user, choose your users.
  2. Leave Select subscription as Q Business Pro.
  3. For Application details, use the default values.
  4. Choose Create.

In the next step, you will select the data source to retrieve and index the data.

Select the retriever

In this step, you select the retriever to connect data sources to the application. There are two options: use a native retriever or use Amazon Kendra. For this example, we use a native retriever.

  1. On the application details page, under Q Recommendations, choose Data sources.

  1. Choose Select retriever.

  1. For Retrievers, select Native.
  2. For Index provisioning, select Enterprise.
  3. For Number of units, enter 1.
  4. Choose Confirm.

Add a data source

Complete the following steps to add a data source:

  1. On the application details page, choose Add data source.
  2. Search for Amazon FSx and choose the plus sign next to Amazon FSX (Windows).

  1. In the Name and description section, enter a name (for example, anycompany-filesystem-source) and an optional description.
  2. In the Source section, for Amazon FSx file system ID, choose the file system ID you created as a prerequisite.
  3. In the Authorization section, leave as default (ACLs are enabled for the connector).

  1. In the Authentication section, for AWS Secrets Manager secret, choose the AWS Secrets Manager secret that holds the active directory credentials to communicate with Amazon FSx to crawl the file system (QBusiness-fsx-creds).
  2. In the Configure VPC and security group, provide the following information:
    • For Virtual Private Cloud (VPC), choose the virtual private cloud (VPC) created as a prerequisite (amazon-connector-for-win-fsx-blog-vpc).
    • For Subnets, choose the private subnets that hold the FSx for Windows File System and active directory instance.
    • For VPC security groups, choose your security group (<stack-name>-DefaultSecurityGroup).

  1. In the IAM role section, provide the following information:
    1. For IAM role¸ choose Create a new service role.
    2. For Role name, enter a name for the role.
  2. In the Sync scope section, provide the following information:
    1. For Maximum file size, use the default option of 50 MB.
    2. Under Regex patterns, you can add inclusion and exclusion patterns. For this post, we add the inclusion pattern for PDF file types, so the Amazon Q crawler will include PDF files.

  1. In the Sync mode section, select Full sync.

Full sync is preferable for the first sync; for subsequent runs, you can choose only the modified data.

  1. In the Sync run schedule section, for Frequency, choose Run on demand.

You also have the option to run the sync on a recurring basis like hourly or daily.

  1. In the Tags section, you can optionally add tags.

  1. In the Field mappings section, use the default field mappings selected.

The Amazon Q connector offers seven fields. Modifying field mappings and adding custom fields will be available after you create the application and retriever. For more information on the field mappings, refer to Amazon FSx (Windows) data source connector field mappings.

  1. Choose Add data source.

Synchronize your file system data

When the data source is successfully created, a banner message appears. In the banner message (or on the data source details page), choose Sync now to sync your file system data.

You can monitor the status of the sync, which includes direct links to Amazon CloudWatch logs.

The sync can take a few minutes to a few hours to complete. Sync speeds are limited by factors such as remote repository throughput and throttling, network bandwidth, and the size of documents.

When the sync is complete, you should see the stats on the scan, which includes the number of items scanned and failed.

For this post, we have two active directory groups, ml-engineers and security-engineers. Each group has one user under them (John Doe and Jane Smith), and they have access to only one whitepaper based on their group (Choosing a generative AI service and AWS Security Incident Response Guide, respectively). The following diagram illustrates this access.

Validate the Amazon Q application functionality

Now that you have completed the setup, you can validate the application functionality by testing the access controls. We test the access of two users, John Doe and Jane Smith, who are users of the ml-engineers group and security-engineers group, respectively. You can retrieve the user name and password for each user from Secrets Manager. The secret name for John Doe is jdoe, and for Jane Smith, it’s jsmith.

  1. On the application details page, in the Web experience settings section, choose the link for the deployed URL.

  1. Sign in as John Doe.

A successful login directs you to the Amazon Q Business chat interface. This window serves as the main workspace where users interact with the application, as shown in the following screenshot.

With the test configuration, John Doe has access to only one document: generative-ai-on-aws-how-to-choose.pdf. You can test the access controls by asking questions about this whitepaper through the chat interface. This restricted access demonstrates the effective implementation of document-level permissions.

  1. For our first question, we ask What are the key factors to consider when choosing a generative AI service?

The following screenshot shows the response.

  1. Next, we ask Does Amazon Bedrock provide an option to customize the model?

The response includes citations from Amazon Q with reference to the source data.

Testing confirms that John Doe successfully receives responses to questions about content from generative-ai-on-aws-how-to-choose.pdf. You can ask additional questions about generative AI services, such as:

  • What are the generative AI service offerings from AWS?
  • What is Amazon Q optimized for?
  • What are critical factors to consider when choosing an appropriate foundational model?

Next, we test access to the security incident response guide.

  1. We ask What are the four phases of the AWS security incident response process?

When asking questions about security topics from aws-security-incident-response-guide.pdf, the system returns no results. This behavior validates that document indexing respects the configured access permissions, and users can only access content they’re authorized to view.

  1. To validate access controls for the security-engineers user group, log in as Jane Smith.

You can test with questions about security incident response:

  • What are the key objectives of an AWS security incident response plan?
  • What are the four phases of the AWS security incident response process?
  • What are the recommended steps for containing and eradicating a security incident in AWS?
  • What types of data should be collected during an AWS security incident investigation?
  • What are the key considerations for recovering from an AWS security incident?

Troubleshooting

If you encounter issues during the setup or operation of your Amazon Q Business application with FSx for Windows File Server, refer to the detailed troubleshooting guide in the README file. The guide provides solutions for common configuration challenges and operational issues you might experience.

Clean up

To avoid ongoing charges, we recommend cleaning up the resources you created while following this guide. For step-by-step cleanup instructions, refer to the README file.

Conclusion

In this post, we provided an overview of the Amazon Q FSx connector and how you can use it for safe and seamless integration of generative AI assistance with your enterprise data source. By using Amazon Q in your organization, you can enable employees to be more data-driven, efficient, prepared, and productive. Lastly, we demonstrated how using simple NLP search through Amazon Q Business enhances your ability to discover insights from your enterprise data quicker and respond to your needs faster.

The Amazon Q Business application offers a compelling solution for organizations seeking to enhance their data-driven capabilities. By using its NLP and secure data source integration features, you can unlock the true value of your data and empower your teams to be more productive and efficient in their work.

To learn more about the Amazon Q connector for FSx for Windows File Server, refer to Connecting Amazon FSx (Windows) to Amazon Q Business.


About the Authors

Manjunath Arakere is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, based in Atlanta, Georgia. He partners with AWS customers to design and scale well-architected solutions, supporting their cloud migrations and modernization initiatives. With extensive experience in the field, Manjunath specializes in migration strategies, application modernization, serverless, and Generative AI (GenAI). He is passionate about helping organizations leverage the full potential of cloud computing to drive innovation and operational efficiency. Outside of work, Manjunath enjoys outdoor runs, tennis, volleyball, and challenging his son in PlayStation soccer games.

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Read More