Optimize query responses with user feedback using Amazon Bedrock embedding and few-shot prompting

Improving response quality for user queries is essential for AI-driven applications, especially those focusing on user satisfaction. For example, an HR chat-based assistant should strictly follow company policies and respond using a certain tone. A deviation from that can be corrected by feedback from users. This post demonstrates how Amazon Bedrock, combined with a user feedback dataset and few-shot prompting, can refine responses for higher user satisfaction. By using Amazon Titan Text Embeddings v2, we demonstrate a statistically significant improvement in response quality, making it a valuable tool for applications seeking accurate and personalized responses.

Recent studies have highlighted the value of feedback and prompting in refining AI responses. Prompt Optimization with Human Feedback proposes a systematic approach to learning from user feedback, using it to iteratively fine-tune models for improved alignment and robustness. Similarly, Black-Box Prompt Optimization: Aligning Large Language Models without Model Training demonstrates how retrieval augmented chain-of-thought prompting enhances few-shot learning by integrating relevant context, enabling better reasoning and response quality. Building on these ideas, our work uses the Amazon Titan Text Embeddings v2 model to optimize responses using available user feedback and few-shot prompting, achieving statistically significant improvements in user satisfaction. Amazon Bedrock already provides an automatic prompt optimization feature to automatically adapt and optimize prompts without additional user input. In this blog post, we showcase how to use OSS libraries for a more customized optimization based on user feedback and few-shot prompting.

We’ve developed a practical solution using Amazon Bedrock that automatically improves chat assistant responses based on user feedback. This solution uses embeddings and few-shot prompting. To demonstrate the effectiveness of the solution, we used a publicly available user feedback dataset. However, when applying it inside a company, the model can use its own feedback data provided by its users. With our test dataset, it shows a 3.67% increase in user satisfaction scores. The key steps include:

Retrieve a publicly available user feedback dataset (for this example, Unified Feedback Dataset on Hugging Face).
Create embeddings for queries to capture semantic similar examples, using Amazon Titan Text Embeddings.
Use similar queries as examples in a few-shot prompt to generate optimized prompts.
Compare optimized prompts against direct large language model (LLM) calls.
Validate the improvement in response quality using a paired sample t-test.

The following diagram is an overview of the system.

The key benefits of using Amazon Bedrock are:

Zero infrastructure management – Deploy and scale without managing complex machine learning (ML) infrastructure
Cost-effective – Pay only for what you use with the Amazon Bedrock pay-as-you-go pricing model
Enterprise-grade security – Use AWS built-in security and compliance features
Straightforward integration – Integrate seamlessly existing applications and open source tools
Multiple model options – Access various foundation models (FMs) for different use cases

The following sections dive deeper into these steps, providing code snippets from the notebook to illustrate the process.

Prerequisites

Prerequisites for implementation include an AWS account with Amazon Bedrock access, Python 3.8 or later, and configured Amazon credentials.

Data collection

We downloaded a user feedback dataset from Hugging Face, llm-blender/Unified-Feedback. The dataset contains fields such as conv_A_user (the user query) and conv_A_rating (a binary rating; 0 means the user doesn’t like it and 1 means the user likes it). The following code retrieves the dataset and focuses on the fields needed for embedding generation and feedback analysis. It can be run in an Amazon Sagemaker notebook or a Jupyter notebook that has access to Amazon Bedrock.

# Load the dataset and specify the subset
dataset = load_dataset("llm-blender/Unified-Feedback", "synthetic-instruct-gptj-pairwise")

# Access the 'train' split
train_dataset = dataset["train"]

# Convert the dataset to Pandas DataFrame
df = train_dataset.to_pandas()

# Flatten the nested conversation structures for conv_A and conv_B safely
df['conv_A_user'] = df['conv_A'].apply(lambda x: x[0]['content'] if len(x) > 0 else None)
df['conv_A_assistant'] = df['conv_A'].apply(lambda x: x[1]['content'] if len(x) > 1 else None)

# Drop the original nested columns if they are no longer needed
df = df.drop(columns=['conv_A', 'conv_B'])

Data sampling and embedding generation

To manage the process effectively, we sampled 6,000 queries from the dataset. We used Amazon Titan Text Embeddings v2 to create embeddings for these queries, transforming text into high-dimensional representations that allow for similarity comparisons. See the following code:

import random import bedrock # Take a sample of 6000 queries 
df = df.shuffle(seed=42).select(range(6000)) 
# AWS credentials
session = boto3.Session()
region = 'us-east-1'
# Initialize the S3 client
s3_client = boto3.client('s3')

boto3_bedrock = boto3.client('bedrock-runtime', region)
titan_embed_v2 = BedrockEmbeddings(
    client=boto3_bedrock, model_id="amazon.titan-embed-text-v2:0")
    
# Function to convert text to embeddings
def get_embeddings(text):
    response = titan_embed_v2.embed_query(text)
    return response  # This should return the embedding vector

# Apply the function to the 'prompt' column and store in a new column
df_test['conv_A_user_vec'] = df_test['conv_A_user'].apply(get_embeddings)

Few-shot prompting with similarity search

For this part, we took the following steps:

Sample 100 queries from the dataset for testing. Sampling 100 queries helps us run multiple trials to validate our solution.
Compute cosine similarity (measure of similarity between two non-zero vectors) between the embeddings of these test queries and the stored 6,000 embeddings.
Select the top k similar queries to the test queries to serve as few-shot examples. We set K = 10 to balance between the computational efficiency and diversity of the examples.

See the following code:

# Step 2: Define cosine similarity function
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1).reshape(1, -1) # Reshape to 2D array
embedding2 = np.array(embedding2).reshape(1, -1) # Reshape to 2D array
return cosine_similarity(embedding1, embedding2)[0][0]

# Sample query embedding
def get_matched_convo(query, df):
    query_embedding = get_embeddings(query)
    
    # Step 3: Compute similarity with each row in the DataFrame
    df['similarity'] = df['conv_A_user_vec'].apply(lambda x: compute_cosine_similarity(query_embedding, x))
    
    # Step 4: Sort rows based on similarity score (descending order)
    df_sorted = df.sort_values(by='similarity', ascending=False)
    
    # Step 5: Filter or get top matching rows (e.g., top 10 matches)
    top_matches = df_sorted.head(10) 
    
    # Print top matches
    return top_matches[['conv_A_user', 'conv_A_assistant','conv_A_rating','similarity']]

This code provides a few-shot context for each test query, using cosine similarity to retrieve the closest matches. These example queries and feedback serve as additional context to guide the prompt optimization. The following function generates the few-shot prompt:

import boto3
from langchain_aws import ChatBedrock
from pydantic import BaseModel

# Initialize Amazon Bedrock client
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name="us-east-1")

# Configure the model to use
model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
model_kwargs = {
"max_tokens": 2048,
"temperature": 0.1,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["nnHuman"],
}

# Create the LangChain Chat object for Bedrock
llm = ChatBedrock(
client=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)

# Pydantic model to validate the output prompt
class OptimizedPromptOutput(BaseModel):
optimized_prompt: str

# Function to generate the few-shot prompt
def generate_few_shot_prompt_only(user_query, nearest_examples):
    # Ensure that df_examples is a DataFrame
    if not isinstance(nearest_examples, pd.DataFrame):
    raise ValueError("Expected df_examples to be a DataFrame")
    # Construct the few-shot prompt using nearest matching examples
    few_shot_prompt = "Here are examples of user queries, LLM responses, and feedback:nn"
    for i in range(len(nearest_examples)):
    few_shot_prompt += f"User Query: {nearest_examples.loc[i,'conv_A_user']}n"
    few_shot_prompt += f"LLM Response: {nearest_examples.loc[i,'conv_A_assistant']}n"
    few_shot_prompt += f"User Feedback: {'👍' if nearest_examples.loc[i,'conv_A_rating'] == 1.0 else '👎'}nn"
    
    # Add the user query for which the optimized prompt is required
    few_shot_prompt += f"Based on these examples, generate a general optimized prompt for the following user query:nn"
    few_shot_prompt += f"User Query: {user_query}n"
    few_shot_prompt += "Optimized Prompt: Provide a clear, well-researched response based on accurate data and credible sources. Avoid unnecessary information or speculation."
    
    return few_shot_prompt

The get_optimized_prompt function performs the following tasks:

The user query and similar examples generate a few-shot prompt.
We use the few-shot prompt in an LLM call to generate an optimized prompt.
Make sure the output is in the following format using Pydantic.

See the following code:

# Function to generate an optimized prompt using Bedrock and return only the prompt using Pydantic
def get_optimized_prompt(user_query, nearest_examples):
    # Generate the few-shot prompt
    few_shot_prompt = generate_few_shot_prompt_only(user_query, nearest_examples)
    
    # Call the LLM to generate the optimized prompt
    response = llm.invoke(few_shot_prompt)
    
    # Extract and validate only the optimized prompt using Pydantic
    optimized_prompt = response.content # Fixed to access the 'content' attribute of the AIMessage object
    optimized_prompt_output = OptimizedPromptOutput(optimized_prompt=optimized_prompt)
    
    return optimized_prompt_output.optimized_prompt

# Example usage
query = "Is the US dollar weakening over time?"
nearest_examples = get_matched_convo(query, df_test)
nearest_examples.reset_index(drop=True, inplace=True)

# Generate optimized prompt
optimized_prompt = get_optimized_prompt(query, nearest_examples)
print("Optimized Prompt:", optimized_prompt)

The make_llm_call_with_optimized_prompt function uses an optimized prompt and user query to make the LLM (Anthropic’s Claude Haiku 3.5) call to get the final response:

# Function to make the LLM call using the optimized prompt and user query
def make_llm_call_with_optimized_prompt(optimized_prompt, user_query):
    start_time = time.time()
    # Combine the optimized prompt and user query to form the input for the LLM
    final_prompt = f"{optimized_prompt}nnUser Query: {user_query}nResponse:"

    # Make the call to the LLM using the combined prompt
    response = llm.invoke(final_prompt)
    
    # Extract only the content from the LLM response
    final_response = response.content  # Extract the response content without adding any labels
    time_taken = time.time() - start_time
    return final_response,time_taken

# Example usage
user_query = "How to grow avocado indoor?"
# Assume 'optimized_prompt' has already been generated from the previous step
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
print("LLM Response:", final_response)

Comparative evaluation of optimized and unoptimized prompts

To compare the optimized prompt with the baseline (in this case, the unoptimized prompt), we defined a function that returned a result without an optimized prompt for all the queries in the evaluation dataset:

def get_unoptimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the user query from 'conv_A_user'
        user_query = row['conv_A_user']
        
        # Make the Bedrock LLM call
        response = llm.invoke(user_query)
        
        # Store the response content in a new column 'unoptimized_prompt_response'
        df_eval.at[index, 'unoptimized_prompt_response'] = response.content  # Extract 'content' from the response object
    
    return df_eval

The following function generates the query response using similarity search and intermediate optimized prompt generation for all the queries in the evaluation dataset:

def get_optimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the user query from 'conv_A_user'
        user_query = row['conv_A_user']
        nearest_examples = get_matched_convo(user_query, df_test)
        nearest_examples.reset_index(drop=True, inplace=True)
        optimized_prompt = get_optimized_prompt(user_query, nearest_examples)
        # Make the Bedrock LLM call
        final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
        
        # Store the response content in a new column 'unoptimized_prompt_response'
        df_eval.at[index, 'optimized_prompt_response'] = final_response  # Extract 'content' from the response object
    
    return df_eval

This code compares responses generated with and without few-shot optimization, setting up the data for evaluation.

LLM as judge and evaluation of responses

To quantify response quality, we used an LLM as a judge to score the optimized and unoptimized responses for alignment with the user query. We used Pydantic here to make sure the output sticks to the desired pattern of 0 (LLM predicts the response won’t be liked by the user) or 1 (LLM predicts the response will be liked by the user):

# Define Pydantic model to enforce predicted feedback as 0 or 1
class FeedbackPrediction(BaseModel):
    predicted_feedback: conint(ge=0, le=1)  # Only allow values 0 or 1

# Function to generate few-shot prompt
def generate_few_shot_prompt(df_examples, unoptimized_response):
    few_shot_prompt = (
        "You are an impartial judge evaluating the quality of LLM responses. "
        "Based on the user queries and the LLM responses provided below, your task is to determine whether the response is good or bad, "
        "using the examples provided. Return 1 if the response is good (thumbs up) or 0 if the response is bad (thumbs down).nn"
    )
    few_shot_prompt += "Below are examples of user queries, LLM responses, and user feedback:nn"
    
    # Iterate over few-shot examples
    for i, row in df_examples.iterrows():
        few_shot_prompt += f"User Query: {row['conv_A_user']}n"
        few_shot_prompt += f"LLM Response: {row['conv_A_assistant']}n"
        few_shot_prompt += f"User Feedback: {'👍' if row['conv_A_rating'] == 1 else '👎'}nn"
    
    # Provide the unoptimized response for feedback prediction
    few_shot_prompt += (
        "Now, evaluate the following LLM response based on the examples above. Return 0 for bad response or 1 for good response.nn"
        f"User Query: {unoptimized_response}n"
        f"Predicted Feedback (0 for 👎, 1 for 👍):"
    )
    return few_shot_prompt

LLM-as-a-judge is a functionality where an LLM can judge the accuracy of a text using certain grounding examples. We have used that functionality here to judge the difference between the result received from optimized and un-optimized prompt. Amazon Bedrock launched an LLM-as-a-judge functionality in December 2024 that can be used for such use cases. In the following function, we demonstrate how the LLM acts as an evaluator, scoring responses based on their alignment and satisfaction for the full evaluation dataset:

# Function to predict feedback using few-shot examples
def predict_feedback(df_examples, df_to_rate, response_column, target_col):
    # Create a new column to store predicted feedback
    df_to_rate[target_col] = None
    
    # Iterate over each row in the dataframe to rate
    for index, row in tqdm(df_to_rate.iterrows(), total=len(df_to_rate)):
        # Get the unoptimized prompt response
        try:
            time.sleep(2)
            unoptimized_response = row[response_column]

            # Generate few-shot prompt
            few_shot_prompt = generate_few_shot_prompt(df_examples, unoptimized_response)

            # Call the LLM to predict the feedback
            response = llm.invoke(few_shot_prompt)

            # Extract the predicted feedback (assuming the model returns '0' or '1' as feedback)
            predicted_feedback_str = response.content.strip()  # Clean and extract the predicted feedback

            # Validate the feedback using Pydantic
            try:
                feedback_prediction = FeedbackPrediction(predicted_feedback=int(predicted_feedback_str))
                # Store the predicted feedback in the dataframe
                df_to_rate.at[index, target_col] = feedback_prediction.predicted_feedback
            except (ValueError, ValidationError):
                # In case of invalid data, assign default value (e.g., 0)
                df_to_rate.at[index, target_col] = 0
        except:
            pass

    return df_to_rate

In the following example, we repeated this process for 20 trials, capturing user satisfaction scores each time. The overall score for the dataset is the sum of the user satisfaction score.

df_eval = df.drop(df_test.index).sample(100)
df_eval['unoptimized_prompt_response'] = "" # Create an empty column to store responses
df_eval = get_unoptimized_prompt_response(df_eval)
df_eval['optimized_prompt_response'] = "" # Create an empty column to store responses
df_eval = get_optimized_prompt_response(df_eval)
Call the function to predict feedback
df_with_predictions = predict_feedback(df_eval, df_eval, 'unoptimized_prompt_response', 'predicted_unoptimized_feedback')
df_with_predictions = predict_feedback(df_with_predictions, df_with_predictions, 'optimized_prompt_response', 'predicted_optimized_feedback')

# Calculate accuracy for unoptimized and optimized responses
original_success = df_with_predictions.conv_A_rating.sum()*100.0/len(df_with_predictions)
unoptimized_success  = df_with_predictions.predicted_unoptimized_feedback.sum()*100.0/len(df_with_predictions) 
optimized_success = df_with_predictions.predicted_optimized_feedback.sum()*100.0/len(df_with_predictions) 

# Display results
print(f"Original success: {original_success:.2f}%")
print(f"Unoptimized Prompt success: {unoptimized_success:.2f}%")
print(f"Optimized Prompt success: {optimized_success:.2f}%")

Result analysis

The following line chart shows the performance improvement of the optimized solution over the unoptimized one. Green areas indicate positive improvements, whereas red areas show negative changes.

As we gathered the result of 20 trials, we saw that the mean of satisfaction scores from the unoptimized prompt was 0.8696, whereas the mean of satisfaction scores from the optimized prompt was 0.9063. Therefore, our method outperforms the baseline by 3.67%.

Finally, we ran a paired sample t-test to compare satisfaction scores from the optimized and unoptimized prompts. This statistical test validated whether prompt optimization significantly improved response quality. See the following code:

from scipy import stats
# Sample user satisfaction scores from the notebook
unopt = [] #20 samples of scores for the unoptimized promt
opt = [] # 20 samples of scores for the optimized promt]
# Paired sample t-test
t_stat, p_val = stats.ttest_rel(unopt, opt)
print(f"t-statistic: {t_stat}, p-value: {p_val}")

After running the t-test, we got a p-value of 0.000762, which is less than 0.05. Therefore, the performance boost of optimized prompts over unoptimized prompts is statistically significant.

Key takeaways

We learned the following key takeaways from this solution:

Few-shot prompting improves query response – Using highly similar few-shot examples leads to significant improvements in response quality.
Amazon Titan Text Embeddings enables contextual similarity – The model produces embeddings that facilitate effective similarity searches.
Statistical validation confirms effectiveness – A p-value of 0.000762 indicates that our optimized approach meaningfully enhances user satisfaction.
Improved business impact – This approach delivers measurable business value through improved AI assistant performance. The 3.67% increase in satisfaction scores translates to tangible outcomes: HR departments can expect fewer policy misinterpretations (reducing compliance risks), and customer service teams might see a significant reduction in escalated tickets. The solution’s ability to continuously learn from feedback creates a self-improving system that increases ROI over time without requiring specialized ML expertise or infrastructure investments.

Limitations

Although the system shows promise, its performance heavily depends on the availability and volume of user feedback, especially in closed-domain applications. In scenarios where only a handful of feedback examples are available, the model might struggle to generate meaningful optimizations or fail to capture the nuances of user preferences effectively. Additionally, the current implementation assumes that user feedback is reliable and representative of broader user needs, which might not always be the case.

Next steps

Future work could focus on expanding this system to support multilingual queries and responses, enabling broader applicability across diverse user bases. Incorporating Retrieval Augmented Generation (RAG) techniques could further enhance context handling and accuracy for complex queries. Additionally, exploring ways to address the limitations in low-feedback scenarios, such as synthetic feedback generation or transfer learning, could make the approach more robust and versatile.

Conclusion

In this post, we demonstrated the effectiveness of query optimization using Amazon Bedrock, few-shot prompting, and user feedback to significantly enhance response quality. By aligning responses with user-specific preferences, this approach alleviates the need for expensive model fine-tuning, making it practical for real-world applications. Its flexibility makes it suitable for chat-based assistants across various domains, such as ecommerce, customer service, and hospitality, where high-quality, user-aligned responses are essential.

To learn more, refer to the following resources:

About the Authors

Tanay Chowdhury is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.

Parth Patwa is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.

Yingwei Yu is an Applied Science Manager at the Generative AI Innovation Center at Amazon Web Services.

Vedere AI