In 2024, more than 140,000 people participated in Google and Kaggle’s Gen AI Intensive live course. Our course is returning this year, with updated content and a new Kag…Read More
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and…Apple Machine Learning Research
Mistral-Small-24B-Instruct-2501 is now available on SageMaker Jumpstart and Amazon Bedrock Marketplace
Today, we’re excited to announce that Mistral-Small-24B-Instruct-2501—a twenty-four billion parameter large language model (LLM) from Mistral AI that’s optimized for low latency text generation tasks—is available for customers through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that developers can use to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. These models are in addition to the industry-leading models that are already available on Amazon Bedrock. You can also use this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy, and use Mistral-Small-24B-Instruct-2501.
Overview of Mistral Small 3 (2501)
Mistral Small 3 (2501), a latency-optimized 24B-parameter model released under Apache 2.0 maintains a balance between performance and computational efficiency. Mistral offers both the pretrained (Mistral-Small-24B-Base-2501) and instruction-tuned (Mistral-Small-24B-Instruct-2501) checkpoints of the model under Apache 2.0. Mistral Small 3 (2501) features a 32 k token context window. According to Mistral, the model demonstrates strong performance in code, math, general knowledge, and instruction following compared to its peers. Mistral Small 3 (2501) is designed for the 80% of generative AI tasks that require robust language and instruction following performance with very low latency. The instruction-tuning process is focused on improving the model’s ability to follow complex directions, maintain coherent conversations, and generate accurate, context-aware responses. The 2501 version follows previous iterations (Mistral-Small-2409 and Mistral-Small-2402) released in 2024, incorporating improvements in instruction-following and reliability. Currently, the instruct version of this model, Mistral-Small-24B-Instruct-2501 is available for customers to deploy and use on SageMaker JumpStart and Bedrock Marketplace.
Optimized for conversational assistance
Mistral Small 3 (2501) excels in scenarios where quick, accurate responses are critical, such as in virtual assistants. This includes virtual assistants where users expect immediate feedback and near real-time interactions. Mistral Small 3 (2501) can handle rapid function execution when used as part of automated or agentic workflows. The architecture is designed to typically respond in less than 100 milliseconds, according to Mistral, making it ideal for customer service automation, interactive assistance, live chat, and content moderation.
Performance metrics and benchmarks
According to Mistral, the instruction-tuned version of the model achieves over 81% accuracy on Massive Multitask Language Understanding (MMLU) with 150 tokens per second latency, making it currently the most efficient model in its category. In third-party evaluations conducted by Mistral, the model demonstrates competitive performance against larger models such as Llama 3.3 70B and Qwen 32B. Notably, Mistral claims that the model performs at the same level as Llama 3.3 70B instruct and is more than three times faster on the same hardware.
SageMaker JumpStart overview
SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.
You can now discover and deploy Mistral models in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and under your VPC controls, helping to support data security for enterprise security needs.
Prerequisites
To try Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart, you need the following prerequisites:
- An AWS account that will contain all your AWS resources.
- An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
- Access to Amazon SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
- Access to accelerated instances (GPUs) for hosting the model.
Amazon Bedrock Marketplace overview
To get started, in the AWS Management Console for Amazon Bedrock, select Model catalog in the Foundation models section of the navigation pane. Here, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).
Deploy Mistral-Small-24B-Instruct-2501 in Amazon Bedrock Marketplace
To access Mistral-Small-24B-Instruct-2501 in Amazon Bedrock, complete the following steps:
- On the Amazon Bedrock console, select Model catalog under Foundation models in the navigation pane.
At the time of writing this post, you can use the InvokeModel
API to invoke the model. It doesn’t support Converse APIs or other Amazon Bedrock tooling.
- Filter for Mistral as a provider and select the Mistral-Small-24B-Instruct-2501
The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.
The page also includes deployment options and licensing information to help you get started with Mistral-Small-24B-Instruct-2501 in your applications.
- To begin using Mistral-Small-24B-Instruct-2501, choose Deploy.
- You will be prompted to configure the deployment details for Mistral-Small-24B-Instruct-2501. The model ID will be pre-populated.
- For Endpoint name, enter an endpoint name (up to 50 alphanumeric characters).
- For Number of instances, enter a number between 1and 100.
- For Instance type, select your instance type. For optimal performance with Mistral-Small-24B-Instruct-2501, a GPU-based instance type such as ml.g6.12xlarge is recommended.
- Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
- Choose Deploy to begin using the model.
When the deployment is complete, you can test Mistral-Small-24B-Instruct-2501 capabilities directly in the Amazon Bedrock playground.
- Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters such as temperature and maximum length.
When using Mistral-Small-24B-Instruct-2501 with the Amazon Bedrock InvokeModel and Playground console, use DeepSeek’s chat template for optimal results. For example, <|begin▁of▁sentence|><|User|>content for inference<|Assistant|>
.
This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.
You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to get the endpoint Amazon Resource Name (ARN).
Discover Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart
You can access Mistral-Small-24B-Instruct-2501 through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more information about how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.
- In the SageMaker Studio console, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
- Select HuggingFace.
- From the SageMaker JumpStart landing page, search for
Mistral-Small-24B-Instruct-2501
using the search box. - Select a model card to view details about the model such as license, data used to train, and how to use the model. Choose Deploy to deploy the model and create an endpoint.
Deploy Mistral-Small-24B-Instruct-2501 with the SageMaker SDK
Deployment starts when you choose Deploy. After deployment finishes, you will see that an endpoint is created. Test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.
- To deploy using the SDK, start by selecting the Mistral-Small-24B-Instruct-2501 model, specified by the
model_id
with the valuemistral-small-24B-instruct-2501
. You can deploy your choice of the selected models on SageMaker using the following code. Similarly, you can deploy Mistral-Small-24b-Instruct-2501 using its model ID.
This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA
value must be explicitly defined as True
to accept the end-user license agreement (EULA). See AWS service quotas for how to request a service quota increase.
- After the model is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
Retail math example
Here’s an example of how Mistral-Small-24B-Instruct-2501 can break down a common shopping scenario. In this case, you ask the model to calculate the final price of a shirt after applying multiple discounts—a situation many of us face while shopping. Notice how the model provides a clear, step-by-step solution to follow.
The following is the output:
The response shows clear step-by-step reasoning without introducing incorrect information or hallucinated facts. Each mathematical step is explicitly shown, making it simple to verify the accuracy of the calculations.
Clean up
To avoid unwanted charges, complete the following steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:
- On the Amazon Bedrock console, under Foundation models in the navigation pane, select Marketplace deployments.
- In the Managed deployments section, locate the endpoint you want to delete.
- Select the endpoint, and on the Actions menu, select Delete.
- Verify the endpoint details to make sure you’re deleting the correct deployment:
- Endpoint name
- Model name
- Endpoint status
- Choose Delete to delete the endpoint.
- In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.
Delete the SageMaker JumpStart predictor
After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. For more details, see Delete Endpoints and Resources.
Conclusion
In this post, we showed you how to get started with Mistral-Small-24B-Instruct-2501 in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repo.
About the Authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.
Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services offered by AWS, including model offerings from top tier foundation model providers.
Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.
NVIDIA Earth-2 Features First Gen AI to Power Weather Super-Resolution for Continental US
To better prepare communities for extreme weather, forecasters first need to see exactly where it’ll land.
That’s why weather agencies and climate scientists around the world are harnessing NVIDIA CorrDiff, a generative AI weather model that enables kilometer-scale forecasts of wind, temperature, and precipitation type and amount. It’s part of the NVIDIA Earth-2 platform for simulating weather and climate conditions.
The paper behind CorrDiff was featured today in Communications Earth and Environment, part of the Nature portfolio of scientific journals. Available as an easy-to-deploy NVIDIA NIM microservice, the model is already being used by weather technology companies, researchers and government agencies to enhance their forecasts.
With the rising frequency of extreme weather events, fast, high-resolution predictions of weather phenomena could help mitigate risks to people, communities and economies by supporting risk assessment, evacuation planning, disaster management and the development of climate-resilient infrastructure.
Weather agencies and startups across the globe are adopting CorrDiff and other Earth-2 tools to improve the resolution and precision of forecasts for extreme weather phenomena, renewable energy management and agricultural planning.
High-Fidelity Forecasts on the Horizon
CorrDiff uses generative AI to sharpen the precision of coarse-resolution weather models — resolving atmospheric data from 25-kilometer scale down to 2 kilometers using diffusion modeling, the same kind of AI model architecture that powers today’s text-to-image generation services.
In addition to boosting image resolution, CorrDiff can also predict related variables that weren’t present in the input data — such as radar reflectivity, which is used as an indicator of rain location and intensity.
CorrDiff was trained on the Weather Research and Forecasting model’s numerical simulations to generate weather patterns at 12x higher resolution.
The initial CorrDiff model, announced at NVIDIA GTC 2024 and described in the Communications Earth and Environment paper, was optimized on Taiwan weather data in collaboration with its Central Weather Administration.
NVIDIA researchers and engineers next worked to efficiently scale the model to cover a larger section of the globe. The version released as an NVIDIA NIM microservice at Supercomputing 2024 covers the continental United States — trained on U.S. weather data, with sample datasets of real-world natural disasters including hurricanes, floods, winter storms, tornados and cold waves.
The optimized CorrDiff NIM microservice for U.S. data is 500x faster and 10,000x more energy-efficient than traditional high-resolution numerical weather prediction using CPUs.
The research team behind CorrDiff continues to advance the model’s capabilities, and has released additional generative AI diffusion models showing how the model could be enhanced to more robustly resolve small-scale details in different environments — and better capture rare or extreme weather events.
CorrDiff could also help with downwash prediction — when strong winds funnel down to street level, damaging buildings and affecting pedestrians — in urban areas.
Weather Agencies Put CorrDiff on the Map
Meteorological agencies and companies around the globe are tapping CorrDiff to accelerate predictions with applications in regional forecasting, renewable energy and disaster management.
Taiwan’s National Science and Technology Center for Disaster Reduction, for instance, has deployed the CorrDiff to support disaster alerts in the region, enabling an estimated gigawatt-hour of energy savings due to the energy efficiency of CorrDiff running on the NVIDIA AI platform. CorrDiff predictions are embedded in the center’s disaster monitoring site, helping Taiwan forecasters better prepare for typhoons.
Discover Earth-2 at NVIDIA GTC
Learn more about AI applications using Earth-2 at NVIDIA GTC, the global AI conference taking place March 17-21 in San Jose, California. Relevant sessions include:
- Applying AI Weather Models With NVIDIA Earth-2: This training lab will show participants how to run global AI weather forecasting models.
- Earth to AI: This panel brings together industry leaders to explore how AI and climate science are transforming business strategies for a sustainable future.
- Enhancing Photovoltaic Power Predictions With High-Resolution Weather Forecasting from NVIDIA Earth-2: This session covers a project between NVIDIA, Peking University and power company GCL to use Earth-2 models to predict solar power generation output.
- Global Atmospheric Downscaling by Improving CorrDiff Process: In this poster, South Korean startup NoteSquare describes a project that modified and applied CorrDiff to regional weather data from the Korea Meteorological Administration.
- Transform Natural Catastrophe Risk Simulations With Advanced Computational Tools: Presenters from NVIDIA, Amazon Web Services and multinational insurance corporation AXA will share how AXA uses Earth-2 to simulate extreme weather.
How Rocket Companies modernized their data science solution on AWS
This post was written with Dian Xu and Joel Hawkins of Rocket Companies.
Rocket Companies is a Detroit-based FinTech company with a mission to “Help Everyone Home”. With the current housing shortage and affordability concerns, Rocket simplifies the homeownership process through an intuitive and AI-driven experience. This comprehensive framework streamlines every step of the homeownership journey, empowering consumers to search, purchase, and manage home financing effortlessly. Rocket integrates home search, financing, and servicing in a single environment, providing a seamless and efficient experience.
The Rocket brand is a synonym for offering simple, fast, and trustworthy digital solutions for complex transactions. Rocket is dedicated to helping clients realize their dream of homeownership and financial freedom. Since its inception, Rocket has grown from a single mortgage lender to an network of businesses that creates new opportunities for its clients.
Rocket takes a complicated process and uses technology to make it simpler. Applying for a mortgage can be complex and time-consuming. That’s why we use advanced technology and data analytics to streamline every step of the homeownership experience, from application to closing. By analyzing a wide range of data points, we’re able to quickly and accurately assess the risk associated with a loan, enabling us to make more informed lending decisions and get our clients the financing they need.
Our goal at Rocket is to provide a personalized experience for both our current and prospective clients. Rocket’s diverse product offerings can be customized to meet specific client needs, while our team of skilled bankers must match with the best client opportunities that align with their skills and knowledge. Maintaining strong relationships with our large, loyal client base and hedge positions to cover financial obligations is key to our success. With the volume of business we do, even small improvements can have a significant impact.
In this post, we share how we modernized Rocket’s data science solution on AWS to increase the speed to delivery from eight weeks to under one hour, improve operational stability and support by reducing incident tickets by over 99% in 18 months, power 10 million automated data science and AI decisions made daily, and provide a seamless data science development experience.
Rocket’s legacy data science environment challenges
Rocket’s previous data science solution was built around Apache Spark and combined the use of a legacy version of the Hadoop environment and vendor-provided Data Science Experience development tools. The Hadoop environment was hosted on Amazon Elastic Compute Cloud (Amazon EC2) servers, managed in-house by Rocket’s technology team, while the data science experience infrastructure was hosted on premises. Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink.
Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. Apache Hive was used to provide a tabular interface to data stored in HDFS, and to integrate with Apache Spark SQL. Apache HBase was employed to offer real-time key-based access to data. Model training and scoring was performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which was part of the Hadoop implementation.
Despite the benefits of this architecture, Rocket faced challenges that limited its effectiveness:
- Accessibility limitations: The data lake was stored in HDFS and only accessible from the Hadoop environment, hindering integration with other data sources. This also led to a backlog of data that needed to be ingested.
- Steep learning curve for data scientists: Many of Rocket’s data scientists did not have experience with Spark, which had a more nuanced programming model compared to other popular ML solutions like scikit-learn. This created a challenge for data scientists to become productive.
- Responsibility for maintenance and troubleshooting: Rocket’s DevOps/Technology team was responsible for all upgrades, scaling, and troubleshooting of the Hadoop cluster, which was installed on bare EC2 instances. This resulted in a backlog of issues with both vendors that remained unresolved.
- Balancing development vs. production demands: Rocket had to manage work queues between development and production, which were always competing for the same resources.
- Deployment challenges: Rocket sought to support more real-time and streaming inferencing use cases, but this was limited by the capabilities of MLeap for real-time models and Spark Streaming for streaming use cases, which were still experimental at that time.
- Inadequate data security and DevOps support – The previous solution lacked robust security measures, and there was limited support for development and operations of the data science work.
Rocket’s legacy data science architecture is shown in the following diagram.
The diagram depicts the flow; the key components are detailed below:
- Data Ingestion: Data is ingested into the system using Attunity data ingestion in Spark SQL.
- Data Storage and Processing: All compute is done as Spark jobs inside of a Hadoop cluster using Apache Livy and Spark. Data is stored in HDFS and is accessed via Hive, which provides a tabular interface to the data and integrates with Spark SQL. HBase is employed to offer real-time key-based access to data.
- Model Development: Data exploration and model development are conducted using tools such as Jupyter or Orchestration, which communicate with the Spark server over Kerberized Livy connection.
- Model Training and Scoring: Model training and scoring is performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which is part of the Hadoop implementation.
Rocket’s migration journey
At Rocket, we believe in the power of continuous improvement and constantly seek out new opportunities. One such opportunity is using data science solutions, but to do so, we must have a strong and flexible data science environment.
To address the legacy data science environment challenges, Rocket decided to migrate its ML workloads to the Amazon SageMaker AI suite. This would allow us to deliver more personalized experiences and understand our customers better. To promote the success of this migration, we collaborated with the AWS team to create automated and intelligent digital experiences that demonstrated Rocket’s understanding of its clients and kept them connected.
We implemented an AWS multi-account strategy, standing up Amazon SageMaker Studio in a build account using a network-isolated Amazon VPC. This allows us to separate development and production environments, while also improving our security stance.
We moved our new work to SageMaker Studio and our legacy Hadoop workloads to Amazon EMR, connecting to the old Hadoop cluster using Livy and SageMaker notebooks to ease the transition. This gives us access to a wider range of tools and technologies, enabling us to choose the most appropriate ones for each problem we’re trying to solve.
In addition, we moved our data from HDFS to Amazon Simple Storage Service (Amazon S3), and now use Amazon Athena and AWS Lake Formation to provide proper access controls to production data. This makes it easier to access and analyze the data, and to integrate it with other systems. The team also provides secure interactive integration through Amazon Elastic Kubernetes Service (Amazon EKS), further improving the company’s security stance.
SageMaker AI has been instrumental in empowering our data science community with the flexibility to choose the most appropriate tools and technologies for each problem, resulting in faster development cycles and higher model accuracy. With SageMaker Studio, our data scientists can seamlessly develop, train, and deploy models without the need for additional infrastructure management.
As a result of this modernization effort, SageMaker AI enabled Rocket to scale our data science solution across Rocket Companies and integrate using a hub-and-spoke model. The ability of SageMaker AI to automatically provision and manage instances has allowed us to focus on our data science work rather than infrastructure management, increasing the number of models in production by five times and data scientists’ productivity by 80%.
Our data scientists are empowered to use the most appropriate technology for the problem at hand, and our security stance has improved. Rocket can now compartmentalize data and compute, as well as compartmentalize development and production. Additionally, we are able to provide model tracking and lineage using Amazon SageMaker Experiments and artifacts discoverable using the SageMaker model registry and Amazon SageMaker Feature Store. All the data science work has now been migrated onto SageMaker, and all the old Hadoop work has been migrated to Amazon EMR.
Overall, SageMaker AI has played a critical role in enabling Rocket’s modernization journey by building a more scalable and flexible ML framework, reducing operational burden, improving model accuracy, and accelerating deployment times.
The successful modernization allowed Rocket to overcome our previous limitations and better support our data science efforts. We were able to improve our security stance, make work more traceable and discoverable, and give our data scientists the flexibility to choose the most appropriate tools and technologies for each problem. This has helped us better serve our customers and drive business growth.
Rocket’s new data science solution architecture on AWS is shown in the following diagram.
The solution consists of the following components:
- Data ingestion: Data is ingested into the data account from on-premises and external sources.
- Data refinement: Raw data is refined into consumable layers (raw, processed, conformed, and analytical) using a combination of AWS Glue extract, transform, and load (ETL) jobs and EMR jobs.
- Data access: Refined data is registered in the data account’s AWS Glue Data Catalog and exposed to other accounts via Lake Formation. Analytic data is stored in Amazon Redshift. Lake Formation makes this data available to both the build and compute accounts. For the build account, access to production data is restricted to read-only.
- Development: Data science development is done using SageMaker Studio. Data engineering development is done using AWS Glue Studio. Both disciplines have access to Amazon EMR for Spark development. Data scientists have access to the entire SageMaker ecosystem in the build account.
- Deployment: SageMaker trained models developed in the build account are registered with an MLFlow instance. Code artifacts for both data science activities and data engineering activities are stored in Git. Deployment initiation is controlled as part of CI/CD.
- Workflows: We have a number of workflow triggers. For online scoring, we typically provide an external-facing endpoint using Amazon EKS with Istio. We have numerous jobs that are launched by AWS Lambda functions that in turn are triggered by timers or events. Processes that run may include AWS Glue ETL jobs, EMR jobs for additional data transformations or model training and scoring activities, or SageMaker pipelines and jobs performing training or scoring activities.
Migration impact
We’ve evolved a long way in modernizing our infrastructure and workloads. We started our journey supporting six business channels and 26 models in production, with dozens in development. Deployment times stretched for months and required a team of three system engineers and four ML engineers to keep everything running smoothly. Despite the support of our internal DevOps team, our issue backlog with the vendor was an unenviable 200+.
Today, we are supporting nine organizations and over 20 business channels, with a whopping 210+ models in production and many more in development. Our average deployment time has gone from months to just weeks—sometimes even down to mere days! With just one part-time ML engineer for support, our average issue backlog with the vendor is practically non-existent. We now support over 120 data scientists, ML engineers, and analytical roles. Our framework mix has expanded to include 50% SparkML models and a diverse range of other ML frameworks, such as PyTorch and scikit-learn. These advancements have given our data science community the power and flexibility to tackle even more complex and challenging projects with ease.
The following table compares some of our metrics before and after migration.
. | Before Migration | After Migration |
---|---|---|
Speed to Delivery | New data ingestion project took 4–8 weeks | Data-driven ingestion takes under one hour |
Operation Stability and Supportability | Over a hundred incidents and tickets in 18 months | Fewer incidents: one per 18 months |
Data Science | Data scientists spent 80% of their time waiting on their jobs to run | Seamless data science development experience |
Scalability | Unable to scale | Powers 10 million automated data science and AI decisions made daily |
Lessons learned
Throughout the journey of modernizing our data science solution, we’ve learned valuable lessons that we believe could be of great help to other organizations who are planning to undertake similar endeavors.
First, we’ve come to realize that managed services can be a game changer in optimizing your data science operations.
The isolation of development into its own account while providing read-only access to production data is a highly effective way of enabling data scientists to experiment and iterate on their models without putting your production environment at risk. This is something that we’ve achieved through the combination of SageMaker AI and Lake Formation.
Another lesson we learned is the importance of training and onboarding for teams. This is particularly true for teams that are moving to a new environment like SageMaker AI. It’s crucial to understand the best practices of utilizing the resources and features of SageMaker AI, and to have a solid understanding of how to move from notebooks to jobs.
Lastly, we found that although Amazon EMR still requires some tuning and optimization, the administrative burden is much lighter compared to hosting directly on Amazon EC2. This makes Amazon EMR a more scalable and cost-effective solution for organizations who need to manage large data processing workloads.
Conclusion
This post provided overview of the successful partnership between AWS and Rocket Companies. Through this collaboration, Rocket Companies was able to migrate many ML workloads and implement a scalable ML framework. Ongoing with AWS, Rocket Companies remains committed to innovation and staying at the forefront of customer satisfaction.
Don’t let legacy systems hold back your organization’s potential. Discover how AWS can assist you in modernizing your data science solution and achieving remarkable results, similar to those achieved by Rocket Companies.
About the Authors
Dian Xu is the Senior Director of Engineering in Data at Rocket Companies, where she leads transformative initiatives to modernize enterprise data platforms and foster a collaborative, data-first culture. Under her leadership, Rocket’s data science, AI & ML platforms power billions of automated decisions annually, driving innovation and industry disruption. A passionate advocate for Gen AI and cloud technologies, Xu is also a sought-after speaker at global forums, inspiring the next generation of data professionals. Outside of work, she channels her love of rhythm into dancing, embracing styles from Bollywood to Bachata as a celebration of cultural diversity.
Joel Hawkins is a Principal Data Scientist at Rocket Companies, where he is responsible for the data science and MLOps platform. Joel has decades of experience developing sophisticated tooling and working with data at large scales. A driven innovator, he works hand in hand with data science teams to ensure that we have the latest technologies available to provide cutting edge solutions. In his spare time, he is an avid cyclist and has been known to dabble in vintage sports car restoration.
Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services. He partners with North American FinTech companies like Rocket and other financial services organizations to drive cloud and AI strategy, accelerating AI adoption at scale. With deep expertise in AI & ML, Generative AI, and cloud-native architecture, he helps financial institutions unlock new revenue streams, optimize operations, and drive impactful business transformation. Sajjan collaborates closely with Rocket Companies to advance its mission of building an AI-fueled homeownership platform to Help Everyone Home. Outside of work, he enjoys traveling, spending time with his family, and is a proud father to his daughter.
Alak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.
AWS and DXC collaborate to deliver customizable, near real-time voice-to-voice translation capabilities for Amazon Connect
Providing effective multilingual customer support in global businesses presents significant operational challenges. Through collaboration between AWS and DXC Technology, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact centers handle multi-lingual customer interactions.
In this post, we discuss how AWS and DXC used Amazon Connect and other AWS AI services to deliver near real-time V2V translation capabilities.
Challenge: Serving customers in multiple languages
In Q3 2024, DXC Technology approached AWS with a critical business challenge: their global contact centers needed to serve customers in multiple languages without the exponential cost of hiring language-specific agents for the lower volume languages. Previously, DXC had explored several existing alternatives but found limitations in each approach – from communication constraints to infrastructure requirements that impacted reliability, scalability, and operational costs. DXC and AWS decided to organize a focused hackathon where DXC and AWS Solution Architects collaborated to:
- Define essential requirements for real-time translation
- Establish latency and accuracy benchmarks
- Create seamless integration paths with existing systems
- Develop a phased implementation strategy
- Prepare and test an initial proof of concept setup
Business impact
For DXC, this prototype was used as an enabler, allowing technical talent maximization, operational transformation, and cost improvements through:
- Best technical expertise delivery – Hiring and matching agents based on technical knowledge rather than spoken language, making sure customers get top technical support regardless of language barriers
- Global operational flexibility – Removing geographical and language constraints in hiring, placement, and support delivery while maintaining consistent service quality across all languages
- Cost reduction – Eliminating multi-language expertise premiums, specialized language training, and infrastructure costs through pay-per-use translation model
- Similar experience to native speakers – Maintaining natural conversation flow with near real-time translation and audio feedback, while delivering premium technical support in customer’s preferred language
Solution overview
The Amazon Connect V2V translation prototype uses AWS advanced speech recognition and machine translation technologies to enable real-time conversation translation between agents and customers, allowing them to speak in their preferred languages while having natural conversations. It consists of the following key components:
- Speech recognition – The customer’s spoken language is captured and converted into text using Amazon Transcribe, which serves as the speech recognition engine. The transcript (text) is then fed into the machine translation engine.
- Machine translation – Amazon Translate, the machine translation engine, translates the customer’s transcript into the agent’s preferred language in near real time. The translated transcript is converted back into speech using Amazon Polly, which serves as the text-to-speech engine.
- Bidirectional translation – The process is reversed for the agent’s response, translating their speech into the customer’s language and delivering the translated audio to the customer.
- Seamless integration – The V2V translation sample project integrates with Amazon Connect, enabling agents to handle customer interactions in multiple languages without any additional effort or training, using the Amazon Connect Streams JS and Amazon Connect RTC JS libraries.
The prototype can be extended with other AWS AI services to further customize the translation capabilities. It’s open source and ready for customization to meet your specific needs.
The following diagram illustrates the solution architecture.
The following screenshot illustrates a sample agent web application.
The user interface consists of three sections:
- Contact Control Panel – A softphone client using Amazon Connect
- Customer Controls – Customer-to-agent interaction controls, including Transcribe Customer Voice, Translate Customer Voice, and Synthesize Customer Voice
- Agent controls – Agent-to-customer interaction controls, including Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice
Challenges when implementing near real-time voice translation
The Amazon Connect V2V sample project was designed to minimize the audio processing time from the moment the customer or agent finishes speaking until the translated audio stream is started. However, even with the shortest audio processing time, the user experience still doesn’t match the experience of a real conversation when both are speaking the same language. This is due to the specific pattern of the customer only hearing the agent’s translated speech, and the agent only hearing the customer’s translated speech. The following diagram displays that pattern.
The example workflow consists of the following steps:
- The customer starts speaking in their own language, and speaks for 10 seconds.
- Because the agent only hears the customer’s translated speech, the agent first hears 10 seconds of silence.
- When customer finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
- The customer’s translated speech is streamed to the agent. During that time, the customer hears silence.
- When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
- Because customer only hears the agent’s translated speech, the customer hears 10 seconds of silence.
- When the agent finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
- The agent’s translated speech is streamed to the agent. During that time, the agent hears silence.
In this scenario, the customer hears a single block of 22–24 seconds of a complete silence, from the moment they finished speaking until they hear the agent’s translated voice. This creates a suboptimal experience, because the customer might not be certain what is happening during these 22–24 seconds—for instance, if the agent was able to hear them, or if there was a technical issue.
Audio streaming add-ons
In a face-to-face conversation scenario between two people that don’t speak the same language, they might have another person as a translator or interpreter. An example workflow consists of the following steps:
- Person A speaks in their own language, which is heard by Person B and the translator.
- The translator translates what Person A said to Person B’s language. The translation is heard by Person B and Person A.
Essentially, Person A and Person B hear each other speaking their own language, and they also hear the translation (from the translator). There’s no waiting in silence, which is even more important in non-face-to-face conversations (such as contact center interactions).
To optimize the customer/agent experience, the Amazon Connect V2V sample project implements audio streaming add-ons to simulate a more natural conversation experience. The following diagram illustrates an example workflow.
The workflow consists of the following steps:
- The customer starts speaking in their own language, and speaks for 10 seconds.
- The agent hears the customer’s original voice, at a lower volume (“Stream Customer Mic to Agent” enabled).
- When the customer finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
- The customer’s translated speech is then streamed to the agent. During that time, the customer hears their translated speech, at a lower volume (“Stream Customer Translation to Customer” enabled).
- When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
- The customer hears the agent’s original voice, at a lower volume (“Stream Agent Mic to Customer” enabled).
- When the agent finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
- The agent’s translated speech is then streamed to the agent. During that time, the agent hears their translated speech, at a lower volume (“Stream Agent Translation to Agent” enabled).
In this scenario, the customer hears two short blocks (1–2 seconds) of subtle audio feedback, instead of a single block of 22–24 seconds of complete silence. This pattern is much closer to a face-to-face conversation that includes a translator.
The audio streaming add-ons provide additional benefits, including:
- Voice characteristics – In cases when the agent and customer only hear their translated and synthesized speech, the actual voice characteristics are lost. For instance, the agent can’t hear if the customer was talking slow or fast, if the customer was upset or calm, and so on. The translated and synthesized speech doesn’t carry over that information.
- Quality assurance – In cases when call recording is enabled, only the customer’s original voice and the agent’s synthesized speech are recorded, because the translation and the synthetization are done on the agent (client) side. This makes it difficult for QA teams to properly evaluate and audit the conversations, including the many silent blocks within it. Instead, when the audio streaming add-ons are enabled, there are no silent blocks, and the QA team can hear the agent’s original voice, the customer’s original voice, and their respective translated and synthesized speech, all in a single audio file.
- Transcription and translation accuracy – Having both the original and translated speech available in the call recording makes it straightforward to detect specific words that would improve transcription accuracy (by using Amazon Transcribe custom vocabularies) or translation accuracy (using Amazon Translate custom terminologies), to make sure that your brand names, character names, model names, and other unique content are transcribed and translated to the desired result.
Get started with Amazon Connect V2V
Ready to transform your contact center’s communication? Our Amazon Connect V2V sample project is now available on GitHub. We invite you to explore, deploy, and experiment with this powerful prototype. You can it as a foundation for developing innovative multi-lingual communication solutions in your own contact center, through the following key steps:
- Clone the GitHub repository.
- Test different configurations for audio streaming add-ons.
- Review the sample project’s limitations in the README.
- Develop your implementation strategy:
- Implement robust security and compliance controls that meet your organization’s standards.
- Collaborate with your customer experience team to define your specific use case requirements.
- Balance between automation and the agent’s manual controls (for example, use an Amazon Connect contact flow to automatically set contact attributes for preferred languages and audio streaming add-ons).
- Use your preferred transcribe, translate, and text-to-speech engines, based on specific language support requirements and business, legal, and regional preferences.
- Plan a phased rollout, starting with a pilot group, then iteratively optimize your transcription custom vocabularies and translation custom terminologies.
Conclusion
The Amazon Connect V2V sample project demonstrates how Amazon Connect and advanced AWS AI services can break down language barriers, enhance operational flexibility, and reduce support costs. Get started now and revolutionize how your contact center communicates across language barriers!
About the Authors
Milos Cosic is a Principal Solutions Architect at AWS.
EJ Ferrell is a Senior Solutions Architect at AWS.
Adam El Tanbouli is a Technical Program Manager for Prototyping and Support Services at DXC Modern Workplace.
Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock
Generative AI is revolutionizing enterprise automation, enabling AI systems to understand context, make decisions, and act independently. Generative AI foundation models (FMs), with their ability to understand context and make decisions, are becoming powerful partners in solving sophisticated business problems. At AWS, we’re using the power of models in Amazon Bedrock to drive automation of complex processes that have traditionally been challenging to streamline.
In this post, we focus on one such complex workflow: document processing. This serves as an example of how generative AI can streamline operations that involve diverse data types and formats.
Challenges with document processing
Document processing often involves handling three main categories of documents:
- Structured – For example, forms with fixed fields
- Semi-structured – Documents that have a predictable set of information but might vary in layout or presentation
- Unstructured – For example, paragraphs of text or notes
Traditionally, processing these varied document types has been a pain point for many organizations. Rule-based systems or specialized machine learning (ML) models often struggle with the variability of real-world documents, especially when dealing with semi-structured and unstructured data.
We demonstrate how generative AI along with external tool use offers a more flexible and adaptable solution to this challenge. Through a practical use case of processing a patient health package at a doctor’s office, you will see how this technology can extract and synthesize information from all three document types, potentially improving data accuracy and operational efficiency.
Solution overview
This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools and APIs to perform complex document analysis tasks.
The solution employs a strategic multi-model approach, optimizing for both performance and cost by selecting the most appropriate model for each task:
-
Anthropic’s Claude 3 Haiku – Serves as the workflow orchestrator due to its low latency and cost-effectiveness. This model’s strong reasoning and tool use abilities make it ideal for the following:
-
Coordinating the overall document processing pipeline
-
Making routing decisions for different document types
-
Invoking appropriate processing functions
-
Managing the workflow state
-
-
Anthropic’s Claude 3.5 Sonnet (v2) – Used for its advanced reasoning capabilities, notably strong visual processing abilities, particularly excelling at interpreting charts and graphs. Its key strengths include:
-
Interpreting complex document layouts and structure
-
Extracting text from tables and forms
-
Processing medical charts and handwritten notes
-
Converting unstructured visual information into structured data
-
Through the Amazon Bedrock Converse API’s standardized tool use (function calling) interface, these models can work together seamlessly to invoke document processing functions, call external APIs for data validation, trigger storage operations, and execute content transformation tasks. The API serves as the foundation for this intelligent workflow, providing a unified interface for model communication while maintaining conversation state throughout the processing pipeline. The API’s standardized approach to tool definition and function calling provides consistent interaction patterns across different processing stages. For more details on how tool use works, refer to The complete tool use workflow.
The solution incorporates Amazon Bedrock Guardrails to implement robust content filtering policies and sensitive information detection, making sure that personal health information (PHI) and personally identifiable information (PII) data is appropriately protected through automated detection and masking capabilities while maintaining industry standard compliance throughout the document processing workflow.
Prerequisites
You need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2
AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.
- An AWS account with an AWS Identity and Access Management (IAM) role that has permissions to Amazon Bedrock and Amazon SageMaker Studio.
- Access to the Anthropic’s Claude 3.5 Sonnet (v2) and Claude 3 Haiku models in Amazon Bedrock. For instructions, see Access Amazon Bedrock foundation models and CreateInferenceProfile.
- Access to create an Amazon Bedrock guardrail. For more information, see Create a guardrail.
Use case and dataset
For our example use case, we examine a patient intake process at a healthcare institution. The workflow processes a patient health information package containing three distinct document types:
- Structured document – A new patient intake form with standardized fields for personal information, medical history, and current symptoms. This form follows a consistent layout with clearly defined fields and check boxes, making it an ideal example of a structured document.
- Semi-structured document – A health insurance card that contains essential coverage information. Although insurance cards generally contain similar information (policy number, group ID, coverage dates), they come from different providers with varying layouts and formats, showing the semi-structured nature of these documents.
- Unstructured document – A handwritten doctor’s note from an initial consultation, containing free-form observations, preliminary diagnoses, and treatment recommendations. This represents the most challenging category of unstructured documents, where information isn’t confined to any predetermined format or structure.
The example document can be downloaded from the following GitHub repo.
This healthcare use case is particularly relevant because it encompasses common challenges in document processing: the need for high accuracy, compliance with healthcare data privacy requirements, and the ability to handle multiple document formats within a single workflow. The variety of documents in this patient package demonstrates how a modern intelligent document processing solution must be flexible enough to handle different levels of document structure while maintaining consistency and accuracy in data extraction.
The following diagram illustrates the solution workflow.
This self-orchestrated workflow demonstrates how modern generative AI solutions can balance capability, performance, and cost-effectiveness in transforming traditional document processing workflows in healthcare settings.
Deploy the solution
- Create an Amazon SageMaker domain. For instructions, see Use quick setup for Amazon SageMaker AI.
- Launch SageMaker Studio, then create and launch a JupyterLab space. For instructions, see Create a space.
- Create a guardrail. Focus on adding sensitive information filters that would mask PII or PHI.
-
Clone the code from the GitHub repository:
git clone https://github.com/aws-samples/anthropic-on-aws.git
-
Change the directory to the root of the cloned repository:
cd medical-idp
-
Install dependencies:
pip install -r requirements.txt
-
Update setup.sh with the guardrail ID you created in Step 3. Then set the ENV variable:
source setup.sh
-
Finally, start the Streamlit application:
streamlit run streamlit_app.py
Now you’re ready to explore the intelligent document processing workflow using Amazon Bedrock.
Technical implementation
The solution is built around the Amazon Bedrock Converse API and tool use framework, with Anthropic’s Claude 3 Haiku serving as the primary orchestrator. When a document is uploaded through the Streamlit interface, Haiku analyzes the request and determines the sequence of tools needed by consulting the tool definitions in ToolConfig
. These definitions include tools for the following:
- Document processing pipeline – Handles initial PDF processing and classification
- Document notes processing – Extracts information from medical notes
- New patient information processing – Processes patient intake forms
- Insurance form processing – Handles insurance card information
The following code is an example tool definition for extracting consultation notes. Here, extract_consultation_notes
represents the name of the function that the orchestration workflow will call, and document_paths
defines the schema of the input parameter that will be passed to the function. The FM will contextually extract the information from the document and pass to the method. A similar toolspec
will be defined for each step. Refer to the GitHub repo for the full toolspec
definition.
{
"toolSpec": {
"name": "extract_consultation_notes",
"description": "Extract diagnostics information from a doctor's consultation notes. Along with the extraction include the full transcript in a <transcript> node",
"inputSchema": {
"json": {
"type": "object",
"properties": {
"document_paths": {
"type": "array",
"items": {"type": "string"},
"description": "Paths to the files that were classified as DOC_NOTES"
}
},
"required": ["document_paths"]
}
}
}
}
When a PDF document is uploaded through the Streamlit interface, it is temporarily stored and passed to the FileProcessor class along with the tool specification and a user prompt:
prompt = ("1. Extract 2. save and 3. summarize the information from the patient information package located at " + tmp_file + ". " +
"The package might contain various types of documents including insurance cards. Extract and save information from all documents provided. "
"Perform any preprocessing or classification of the file provided prior to the extraction." +
"Set the enable_guardrails parameter to " + str(enable_guardrails) + ". " +
"At the end, list all the tools that you had access to. Give an explantion on why each tool was used and if you are not using a tool, explain why it was not used as well" +
"Think step by step.")
processor.process_file(prompt=prompt,
toolspecs=toolspecs,
...
The BedrockUtils
class manages the conversation with Anthropic’s Claude 3 Haiku through the Amazon Bedrock Converse API. It maintains the conversation state and handles the tool use workflow:
# From bedrockutility.py
def invoke_bedrock(self, message_list, system_message=[], tool_list=[],
temperature=0, maxTokens=2048, guardrail_config=None):
response = self.bedrock.converse(
modelId=self.model_id,
messages=message_list,
system=system_message,
inferenceConfig={
"maxTokens": maxTokens,
"temperature": temperature
},
**({"toolConfig": {"tools": tool_list}} if tool_list else {})
)
When the processor receives a document, it initiates a conversation loop with Anthropic’s Claude 3 Haiku, which analyzes the document and determines which tools to use based on the content. The model acts as an intelligent orchestrator, making decisions about the following:
- Which document processing tools to invoke
- The sequence of processing steps
- How to handle different document types within the same package
- When to summarize and complete the processing
This orchestration is managed through a continuous conversation loop that processes tool requests and their results until the entire document package has been processed.
The first key decision in the workflow is initiating the document classification process. Through the DocumentClassifier
class, the solution uses Anthropic’s Claude 3.5 Sonnet to analyze and categorize each page of the uploaded document into three main types: intake forms, insurance cards, and doctor’s notes:
# from document_classifier.py
class DocumentClassifier:
def __init__(self, file_handler):
self.sonnet_3_5_bedrock_utils = BedrockUtils(
model_id=ModelIDs.anthropic_claude_3_5_sonnet
)
def categorize_document(self, file_paths):
# Convert documents to binary format for model processing
binary_data_array = []
for file_path in file_paths:
binary_data, media_type = self.file_handler.get_binary_for_file(file_path)
binary_data_array.append((binary_data[0], media_type))
# Prepare message for classification
message_content = [
{"image": {"format": media_type, "source": {"bytes": data}}}
for data, media_type in binary_data_array
]
# Create classification request
message_list = [{
"role": 'user',
"content": [
*message_content,
{"text": "What types of document is in this image?"}
]
}]
# Define system message for classification
system_message = [{
"text": '''You are a medical document processing agent.
Categorize images as: INTAKE_FORM, INSURANCE_CARD, or DOC_NOTES'''
}]
# Get classification from model
response = self.sonnet_3_5_bedrock_utils.invoke_bedrock(
message_list=message_list,
system_message=system_message
)
return [response['output']['message']]
Based on the classification results, the FM determines the next tool to be invoked. The tool’s description and input schema define exactly what information needs to be extracted. Following the previous example, let’s assume the next page to be processed is a consultation note. The workflow will invoke the extract_consultation_notes
function. This function processes documents to extract detailed medical information. Like the classification process discussed earlier, it first converts the documents to binary format suitable for model processing. The key to accurate extraction lies in how the images and system message are combined:
def extract_info(self, file_paths):
# Convert documents to binary data
# This will follow the same pattern to as in the classification function
message_content = [
{"image": {"format": media_type, "source": {"bytes": data}}}
for data, media_type in binary_data_array
]
message_list = [{
"role": 'user',
"content": [
*message_content, # Include the processed document images
{"text": '''Extract all information from this file
If you find a visualization
- Provide a detailed description in natural language
- Use domain specific language for the description
'''}
]
}]
system_message = [{
"text": '''You are a medical consultation agent with expertise in diagnosing and treating various health conditions.
You have a deep understanding of human anatomy, physiology, and medical knowledge across different specialties.
During the consultation, you review the patient's medical records, test results, and documentation provided.
You analyze this information objectively and make associations between the data and potential diagnoses.
Associate a confidence score to each extracted information. This should reflect how confident the model in the extracted value matched the requested entity.
'''}
]
response = self.bedrock_utils.invoke_bedrock(
message_list=message_list,
system_message=system_message
)
return [response['output']['message']]
The system message serves three crucial purposes:
- Establish medical domain expertise for accurate interpretation.
- Provide guidelines for handling different types of information (text and visualizations).
- Provide a self-scored confidence. Although this is not an independent grading mechanism, the score is directionally indicative of how confident the model is in its own extraction.
Following the same pattern, the FM will use the other tools in the toolspec
definition to save and summarize the results.
A unique advantage of using a multi-modal FM for the extraction task is its ability to have a deep understanding of the text it is extracting. For example, the following code is an abstract of the data schema we are requesting as input to the save_consultation_notes
function. Refer to the code in constants.py for full definition. The model needs to not only extract a transcript, but also understand it to extract such structured data from an unstructured document. This significantly reduces the postprocessing efforts required for the data to be consumed by a downstream application.
"consultation": {
"type": "object",
"properties": {
"date": {"type": "string"},
"concern": {
"type": "object",
"properties": {
"primaryComplaint": {
"type": "string",
"description": "Primary medical complaint of the patient. Only capture the medical condition. no timelines"
},
"duration": {"type": "number"},
"durationUnit": {"type": "string", "enum": ["days", "weeks", "months", "years"]},
"associatedSymptoms": {
"type": "object",
"additionalProperties": {
"type": "boolean"
},
"description": "Key-value pairs of symptoms and their presence (true) or absence (false)"
},
"absentSymptoms": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["primaryComplaint", "duration", "durationUnit"]
}
The documents contain a treasure trove of personally identifiable information (PII) and personal health information (PIH). To redact this information, you can pass enable_guardrails as true. This will use the guardrail you setup earlier as part of the information extraction process and mask information identified as PII or PIH.
processor.process_file(prompt=prompt,
enable_guardrails=True,
toolspecs=toolspecs,
…
)
Finally, cross-document validation is crucial for maintaining data accuracy and compliance in healthcare settings. Although the current implementation performs basic consistency checks through the summary prompt, organizations can extend the framework by implementing a dedicated validation tool that integrates with their specific business rules and compliance requirements. Such a tool could perform sophisticated validation logic like insurance policy verification, appointment date consistency checks, or any other domain-specific validation requirements, providing complete data integrity across the document package.
Future considerations
As Amazon Bedrock continues to evolve, several powerful features can be integrated into this document processing workflow to enhance its enterprise readiness, performance, and cost-efficiency. Let’s explore how these advanced capabilities can take this solution to the next level:
- Inference profiles in Amazon Bedrock define a model and its associated Regions for routing invocation requests, enabling various tasks such as usage tracking, cost monitoring, and cross-Region inference. These profiles help users track metrics through Amazon CloudWatch logs, monitor costs with cost allocation tags, and increase throughput by distributing requests across multiple Regions.
- Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. Instead of reprocessing the entire context for each document, the workflow can reuse cached prompts, which is particularly beneficial when using the same image across different tooling workflows. With support for multiple cache checkpoints, this feature can substantially reduce processing time and inference costs while maintaining the workflow’s intelligent orchestration capabilities.
- Intelligent prompt routing can dynamically select the most appropriate model for each task based on performance and cost requirements. Rather than explicitly assigning Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for document analysis, the workflow can use intelligent routing to automatically choose the optimal model within the Anthropic family for each request. This approach simplifies model management while providing cost-effective processing of different document types, from simple structured forms to complex handwritten notes, all through a single endpoint.
Conclusion
This intelligent document processing solution demonstrates the power of combining Amazon Bedrock FMs with tool use capabilities to create sophisticated, self-orchestrating workflows. By using Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for complex visual tasks, the solution effectively handles structured, semi-structured, and unstructured documents while maintaining high accuracy and compliance standards.
Key benefits of this approach include:
- Reduced manual processing through intelligent automation
- Improved accuracy through specialized model selection
- Built-in compliance with guardrails for sensitive data
- Flexible architecture that adapts to various document types
- Cost-effective processing through strategic model usage
As organizations continue to digitize their operations, solutions like this showcase how generative AI can transform traditional document processing workflows. The combination of powerful FMs in Amazon Bedrock and the tool use framework provides a robust foundation for building intelligent, scalable document processing solutions across industries.
For more information about Amazon Bedrock and its capabilities, visit the Amazon Bedrock User Guide.
About the Author
Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.
Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases
Large language models (LLMs) excel at generating human-like text but face a critical challenge: hallucination—producing responses that sound convincing but are factually incorrect. While these models are trained on vast amounts of generic data, they often lack the organization-specific context and up-to-date information needed for accurate responses in business settings. Retrieval Augmented Generation (RAG) techniques help address this by grounding LLMs in relevant data during inference, but these models can still generate non-deterministic outputs and occasionally fabricate information even when given accurate source material. For organizations deploying LLMs in production applications—particularly in critical domains such as healthcare, finance, or legal services—these residual hallucinations pose serious risks, potentially leading to misinformation, liability issues, and loss of user trust.
To address these challenges, we introduce a practical solution that combines the flexibility of LLMs with the reliability of drafted, curated, verified answers. Our solution uses two key Amazon Bedrock services: Amazon Bedrock Knowledge Bases, a fully managed service that you can use to store, search, and retrieve organization-specific information for use with LLMs; and Amazon Bedrock Agents, a fully managed service that you can use to build, test, and deploy AI assistants that can understand user requests, break them down into steps, and execute actions. Similar to how a customer service team maintains a bank of carefully crafted answers to frequently asked questions (FAQs), our solution first checks if a user’s question matches curated and verified responses before letting the LLM generate a new answer. This approach helps prevent hallucinations by using trusted information whenever possible, while still allowing the LLM to handle new or unique questions. By implementing this technique, organizations can improve response accuracy, reduce response times, and lower costs. Whether you’re new to AI development or an experienced practitioner, this post provides step-by-step guidance and code examples to help you build more reliable AI applications.
Solution overview
Our solution implements a verified semantic cache using the Amazon Bedrock Knowledge Bases Retrieve API to reduce hallucinations in LLM responses while simultaneously improving latency and reducing costs. This read-only semantic cache acts as an intelligent intermediary layer between the user and Amazon Bedrock Agents, storing curated and verified question-answer pairs.
When a user submits a query, the solution first evaluates its semantic similarity with existing verified questions in the knowledge base. For highly similar queries (greater than 80% match), the solution bypasses the LLM completely and returns the curated and verified answer directly. When partial matches (60–80% similarity) are found, the solution uses the verified answers as few-shot examples to guide the LLM’s response, significantly improving accuracy and consistency. For queries with low similarity (less than 60%) or no match, the solution falls back to standard LLM processing, making sure that user questions receive appropriate responses.
This approach offers several key benefits:
- Reduced costs: By minimizing unnecessary LLM invocations for frequently answered questions, the solution significantly reduces operational costs at scale
- Improved accuracy: Curated and verified answers minimize the possibility of hallucinations for known user queries, while few-shot prompting enhances accuracy for similar questions.
- Lower latency: Direct retrieval of cached answers provides near-instantaneous responses for known queries, improving the overall user experience.
The semantic cache serves as a growing repository of trusted responses, continuously improving the solution’s reliability while maintaining efficiency in handling user queries.
Solution architecture
The solution architecture in the preceding figure consists of the following components and workflow. Let’s assume that the question “What date will AWS re:invent 2024 occur?” is within the verified semantic cache. The corresponding answer is also input as “AWS re:Invent 2024 takes place on December 2–6, 2024.” Let’s walkthrough an example of how this solution would handle a user’s question.
1. Query processing:
a. User submits a question “When is re:Invent happening this year?”, which is received by the Invoke Agent function.
b. The function checks the semantic cache (Amazon Bedrock Knowledge Bases) using the Retrieve API.
c. Amazon Bedrock Knowledge Bases performs a semantic search and finds a similar question with an 85% similarity score.
2. Response paths: (Based on the 85% similarity score in step 1.c, our solution follows the strong match path)
a. Strong match (similarity score greater than 80%):
i. Invoke Agent function returns exactly the verified answer “AWS re:Invent 2024 takes place on December 2–6, 2024” directly from the Amazon Bedrock knowledge base, providing a deterministic response.
ii. No LLM invocation needed, response in less than 1 second.
b. Partial match (similarity score 60–80%):
i. The Invoke Agent function invokes the Amazon Bedrock agent and provides the cached answer as a few-shot example for the agent through Amazon Bedrock Agents promptSessionAttributes.
ii. If the question was “What’s the schedule for AWS events in December?”, our solution would provide the verified re:Invent dates to guide the Amazon Bedrock agent’s response with additional context.
iii. Providing the Amazon Bedrock agent with a curated and verified example might help increase accuracy.
c. No match (similarity score less than 60%):
i. If the user’s question isn’t similar to any of the curated and verified questions in the cache, the Invoke Agent function invokes the Amazon Bedrock agent without providing it any additional context from cache.
ii. For example, if the question was “What hotels are near re:Invent?”, our solution would invoke the Amazon Bedrock agent directly, and the agent would use the tools at its disposal to formulate a response.
3. Offline knowledge management:
a. Verified question-answer pairs are stored in a verified Q&A Amazon S3 bucket (Amazon Simple Storage Service), and must be updated or reviewed periodically to make sure that the cache contains the most recent and accurate information.
b. The S3 bucket is periodically synchronized with the Amazon Bedrock knowledge base. This offline batch process makes sure that the semantic cache remains up-to-date without impacting real-time operations.
Solution walkthrough
You need to meet the following prerequisites for the walkthrough:
- An AWS account
- Model access to Anthropic’s Claude Sonnet V1 and Amazon Titan Text Embedding V2
- AWS Command Line Interface (AWS CLI) installed and configured with the appropriate credentials
Once you have the prerequisites in place, use the following steps to set up the solution in your AWS account.
Step 0: Set up the necessary infrastructure
Follow the “Getting started” instructions in the README of the Git repository to set up the infrastructure for this solution. All the following code samples are extracted from the Jupyter notebook in this repository.
Step 1: Set up two Amazon Bedrock knowledge bases
This step creates two Amazon Bedrock knowledge bases. The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services.
This establishes the foundation for your semantic caching solution, setting up the AWS resources to store the agent’s knowledge and verified cache entries.
Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent
For this walkthrough, you will create an LLM Amazon Bedrock agent specialized in answering questions about Amazon Bedrock. For this example, you will ingest Amazon Bedrock documentation in the form of the User Guide PDF into the Amazon Bedrock knowledge base. This will be the primary dataset. After ingesting the data, you create an agent with specific instructions:
This setup enables the Amazon Bedrock agent to use the ingested knowledge to provide responses about Amazon Bedrock services. To test it, you can ask a question that isn’t present in the agent’s knowledge base, making the LLM either refuse to answer or hallucinate.
Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base
In this step, you create a raw dataset of verified question-answer pairs that aren’t present in the agent knowledge base. These curated and verified answers serve as our semantic cache to prevent hallucinations on known topics. Good candidates for inclusion in this cache are:
- Frequently asked questions (FAQs): Common queries that users often ask, which can be answered consistently and accurately.
- Critical questions requiring deterministic answers: Topics where precision is crucial, such as pricing information, service limits, or compliance details.
- Time-sensitive information: Recent updates, announcements, or temporary changes that might not be reflected in the main RAG knowledge base.
By carefully curating this cache with high-quality, verified answers to such questions, you can significantly improve the accuracy and reliability of your solution’s responses. For this walkthrough, use the following example pairs for the cache:
Q: 'What are the dates for reinvent 2024?'
A: 'The AWS re:Invent conference was held from December 2-6 in 2024.'
Q: 'What was the biggest new feature announcement for Bedrock Agents during reinvent 2024?'
A: 'During re:Invent 2024, one of the headline new feature announcements for Bedrock Agents was the custom orchestrator. This key feature allows users to implement their own orchestration strategies through AWS Lambda functions, providing granular control over task planning, completion, and verification while enabling real-time adjustments and reusability across multiple agents.'
You then format these pairs as individual text files with corresponding metadata JSON files, upload them to an S3 bucket, and ingest them into your cache knowledge base. This process makes sure that your semantic cache is populated with accurate, curated, and verified information that can be quickly retrieved to answer user queries or guide the agent’s responses.
Step 4: Implement the verified semantic cache logic
In this step, you implement the core logic of your verified semantic cache solution. You create a function that integrates the semantic cache with your Amazon Bedrock agent, enhancing its ability to provide accurate and consistent responses.
- Queries the cache knowledge base for similar entries to the user question.
- If a high similarity match is found (greater than 80%), it returns the cached answer directly.
- For partial matches (60–80%), it uses the cached answer as a few-shot example for the agent.
- For low similarity (less than 60%), it falls back to standard agent processing.
This simplified logic forms the core of the semantic caching solution, efficiently using curated and verified information to improve response accuracy and reduce unnecessary LLM invocations.
Step 5: Evaluate results and performance
This step demonstrates the effectiveness of the verified semantic cache solution by testing it with different scenarios and comparing the results and latency. You’ll use three test cases to showcase the solution’s behavior:
- Strong semantic match (greater than 80% similarity)
- Partial semantic match (60-80% similarity)
- No semantic match (less than 60% similarity)
Here are the results:
- Strong semantic match (greater than 80% similarity) provides the exact curated and verified answer in less than 1 second.
- Partial semantic match (60–80% similarity) passes the verified answer to the LLM during the invocation. The Amazon Bedrock agent answers the question correctly using the cached answer even though the information is not present in the agent knowledge base.
- No semantic match (less than 60% similarity) invokes the Amazon Bedrock agent as usual. For this query, the LLM will either refuse to provide the information because it’s not present in the agent’s knowledge base, or will hallucinate and provide a response that is plausible but incorrect.
These results demonstrate the effectiveness of the semantic caching solution:
- Strong matches provide near-instant, accurate, and deterministic responses without invoking an LLM.
- Partial matches guide the LLM agent to provide a more relevant or accurate answer.
- No matches fall back to standard LLM agent processing, maintaining flexibility.
The semantic cache significantly reduces latency for known questions and improves accuracy for similar queries, while still allowing the agent to handle unique questions when necessary.
Step 6: Resource clean up
Make sure that the Amazon Bedrock knowledge bases that you created, along with the underlying Amazon OpenSearch Serverless collections are deleted to avoid incurring unnecessary costs.
Production readiness considerations
Before deploying this solution in production, address these key considerations:
- Similarity threshold optimization: Experiment with different thresholds to balance cache hit rates and accuracy. This directly impacts the solution’s effectiveness in preventing hallucinations while maintaining relevance.
- Feedback loop implementation: Create a mechanism to continuously update the verified cache with new, accurate responses. This helps prevent cache staleness and maintains the solution’s integrity as a source of truth for the LLM.
- Cache management and update strategy: Regularly refresh the semantic cache with current, frequently asked questions to maintain relevance and improve hit rates. Implement a systematic process for reviewing, validating, and incorporating new entries to help ensure cache quality and alignment with evolving user needs.
- Ongoing tuning: Adjust similarity thresholds as your dataset evolves. Treat the semantic cache as a dynamic component, requiring continuous optimization for your specific use case.
Conclusion
This verified semantic cache approach offers a powerful solution to reduce hallucinations in LLM responses while improving latency and reducing costs. By using Amazon Bedrock Knowledge Bases, you can implement a solution that can efficiently serve curated and verified answers, guide LLM responses with few-shot examples, and gracefully fall back to full LLM processing when needed.
About the Authors
Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.
Chaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazon’s Worldwide Returns and ReCommerce organization. He specializes in building scalable machine learning infrastructure, distributed systems, and containerization technologies. His expertise lies in developing robust solutions that enhance monitoring, streamline inference processes, and strengthen audit capabilities to support and optimize Amazon’s global operations.
Rajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for reverse logistics and ReCommerce operations. He is deeply passionate about generative AI and consistently seeks opportunities to implement AI into solving complex customer challenges.
Karam Muppidi is a Senior Engineering Manager at Amazon Retail, where he leads data engineering, infrastructure and analytics for the Worldwide Returns and ReCommerce organization. He has extensive experience developing enterprise-scale data architectures and governance strategies using both proprietary and native AWS platforms, as well as third-party tools. Previously, Karam developed big-data analytics applications and SOX compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.
LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker
Fine-tuning a pre-trained large language model (LLM) allows users to customize the model to perform better on domain-specific tasks or align more closely with human preferences. It is a continuous process to keep the fine-tuned model accurate and effective in changing environments, to adapt to the data distribution shift (concept drift) and prevent performance degradation over time. Continuous fine-tuning also enables models to integrate human feedback, address errors, and tailor to real-world applications. You can use supervised fine-tuning (SFT) and instruction tuning to train the LLM to perform better on specific tasks using human-annotated datasets and instructions. When you have user feedback to the model responses, you can also use reinforcement learning from human feedback (RLHF) to guide the LLM’s response by rewarding the outputs that align with human preferences.
Precise and responsible outputs from fine-tuned LLMs require big efforts from subject matter experts (SMEs). The manual annotation of extensive training data for fine-tuning by human SMEs and collecting user feedback to align LLM responses with human preferences are both resource-heavy and time-intensive. Also, the continuous fine-tuning process requires orchestrating the multiple steps of data generation, LLM training, feedback collection, and preference alignments with scalability, resiliency, and resource efficiency. To address these challenges, we present an innovative continuous self-instruct fine-tuning framework that streamlines the LLM fine-tuning process of training data generation and annotation, model training and evaluation, human feedback collection, and alignment with human preference. This framework is designed as a compound AI system to drive the fine-tuning workflow for performance improvement, versatility, and reusability.
In this post, we introduce the continuous self-instruct fine-tuning framework and its pipeline, and present how to drive the continuous fine-tuning process for a question-answer task as a compound AI system. We use DSPy (Declarative Self-improving Python) to demonstrate the workflow of Retrieval Augmented Generation (RAG) optimization, LLM fine-tuning and evaluation, and human preference alignment for performance improvement.
Overview of the continuous self-instruct fine-tuning framework
The continuous self-instruct fine-tuning framework drives a workflow to customize the foundation model (FM) using human-labeled training samples and human feedback after model inference. This workflow runs on a continuous basis to be adaptive to a changing environment. The following diagram illustrates the workflow.
The workflow consists of the following steps:
- Self-instruct supervised fine-tuning – First, we use a human-labeled training dataset to adapt the FM to tasks in a specific domain. Instruction tuning is a popular approach in domain-specific LLM fine-tuning, which trains the FM to follow instructions for a specific task rather than generating the next texts. To address the challenges of the lack of human efforts for data labeling, annotation, and validation, we designed a self-instruct fine-tuning method to synthetically generate training labels by the LLM from a small volume of high-quality human-annotated samples. This process scales up the training dataset used for fine-tuning the FM into a custom LLM.
- Human preference alignment – After the model is deployed in the production environment, the process moves into the human-in-the-loop workflow, in which we collect user feedback including satisfaction scores and comments on model response. The human feedback data is not only used for model performance and hallucination measurement, but is also used to further fine-tune the custom model in Step 1 through RLHF. Likewise, to address the challenges of lack of human feedback data, we use LLMs to generate AI grades and feedback that scale up the dataset for reinforcement learning from AI feedback (RLAIF). There are various techniques of preference alignment, including proximal policy optimization (PPO), direct preference optimization (DPO), odds ratio policy optimization (ORPO), group relative policy optimization (GRPO), and other algorithms, that can be used in this process.
- Evaluation and continuous learning – The model customization and preference alignment is not a one-time effort. We need to keep monitoring and evaluating the model performance, and restart the process in case of concept shift or model decay.
The overall workflow consists of multiple steps of synthetic data generation, LLM training, feedback collection, preference alignment, and evaluation that involves multiple components and multiple LLMs. In the next section, we discuss using a compound AI system to implement this framework to achieve high versatility and reusability.
Compound AI system and the DSPy framework
With the rise of generative AI, scientists and engineers face a much more complex scenario to develop and maintain AI solutions, compared to classic predictive AI. The paper The Shift from Models to Compound AI Systems highlights that state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models. Compound AI systems are systems that implement AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers, or external tools. The following diagram compares predictive AI to generative AI.
The concept of a compound AI system enables data scientists and ML engineers to design sophisticated generative AI systems consisting of multiple models and components. You can use a module to incorporate prompt engineering and in-context learning to improve RAG performance, and also design a data architecture with tools to gather external data. You can also build an agentic architecture with multiple LLMs, fine-tune the model to achieve higher performance, and orchestrate the LLM access. Besides the efficiency in system design, the compound AI system also enables you to optimize complex generative AI systems, using a comprehensive evaluation module based on multiple metrics, benchmarking data, and even judgements from other LLMs. The optimization is on the holistic end-to-end solution, rather than on each component separately.
To efficiently build and optimize compound AI systems, we introduce DSPy, an open source Python framework for developers to build LLM applications using modular and declarative programming, whether you’re building simple classifiers, sophisticated RAG pipelines, or agentic workflows. It provides algorithms for optimizing LLMs’ prompts and weights, and automates the prompt tuning process, as opposed to the trial-and-error approach performed by humans. DSPy supports iteratively optimizing all prompts involved against defined metrics for the end-to-end compound AI solution.
The DSPy lifecycle is presented in the following diagram in seven steps. It separates the flow of your program (modules) from the parameters (language model prompts and weights) of each step. These modules define the system behavior in a portable, declarative way. The first four steps cover the DSPy programming stage, including defining your task and its constraints, exploring a few examples, and using that to inform your initial pipeline design. When your system works reasonably well, you can run the DSPy evaluation stage (Steps 5 and 6) to collect an initial development set, define your DSPy metric, and use these to iterate on your system more systematically. Afterwards, DSPy introduces new optimizers (compilers) in Step 7, with language model-driven algorithms to tune LLM prompts and weights, based on predefined evaluation metrics.
RAG pipeline with continuous fine-tuning in a compound AI system
In this post, we provide an example of a question-answer task, using a RAG pipeline along with the continuous self-instruct fine-tuning framework. We build this as a compound AI system and use DSPy to drive the RAG inference, prompt optimization, LLM fine-tuning, and performance evaluation. The overall workflow is shown in the following diagram.
The flow starts from a standard RAG pipeline, followed by a few optimizations on the prompts and the RAG retriever. Then we generate the synthetic training dataset from the RAG knowledge base to fine-tune the generator LLM using RAG for performance improvement. Lastly, we use a separate LLM to generate feedback on the fine-tuned model responses, and use it to conduct the preference alignment training by DPO and PPO. The question-answer outputs from each step are measured by the underlying LLM-as-a-judge evaluation module. In this way, we demonstrate the effectiveness of the compound AI system for the continuous optimizing of the pipeline through RAG optimization and the fine-tuning framework.
In the next sections, we demonstrate how to build this workflow, including the RAG pipeline, optimization, instruction fine-tuning, preference alignment, and model evaluation, into a compound AI system using an Amazon SageMaker notebook instance with the DSPy framework and LLMs on Amazon Bedrock. The code from this post and more examples are available in the GitHub repository.
Prerequisites
To create and run this compound AI system in your AWS account, complete the following prerequisites:
- Create an AWS account if you don’t already have one.
- Set up a SageMaker notebook instance.
- Open JupyterLab in this newly created instance.
- Clone the GitHub repository and follow the steps explained in the README.
- Navigate to the cloned repository and open the notebook folder.
- Enable access to models hosted on Amazon Bedrock. For this post, we enable Anthropic’s Claude 3 Sonnet, Mistral 7B, and Meta Llama 8B.
Dataset
For the question-answering task, we use the Contract Understanding Atticus Dataset (CUAD), an open legal contract review dataset created with dozens of legal experts from The Atticus Project, which consists of over 13,000 annotations. The synthetic data generation notebook automatically downloads the CUAD_v1 ZIP file and places it in the required folder named cuad_data.
In case of any issues, you can alternately download the dataset yourself by following the steps in the README file and store the dataset inside a folder within the SageMaker notebook instance, and use it to perform the steps in the next section.
Prepare question-answer pairs
The first step is to prepare question-answer pairs from the CUAD document by running synthetic data generation.
We use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to synthetically generate question-answer pairs to infer the RAG pipeline in the compound AI system, to demonstrate the improved accuracy after RAG optimization and model fine-tuning. The generated datasets are in the format of question-answer pairs along with the context [context, question, answer]
from the document. We use the question to infer the RAG pipeline and use the answer as ground truth to evaluate the inference accuracy. Additionally, the question-answer pairs are used as training samples for the model fine-tuning. The following is a sample dataset triplet with context and a question-answer pair.
Context (Snippet from PDF file) | Question | Answer |
THIS STRATEGIC ALLIANCE AGREEMENT (“Agreement”) is made and entered into as of November 6, 2016 (the “Effective Date”) by and between Dialog Semiconductor (UK) Ltd., a corporation organized under the laws of England and Wales, having its principal office at 100 Longwater Avenue, Green Park, Reading, RG2 6GP, United Kingdom (“DIALOG”) and Energous Corporation, a Delaware corporation, having its principal office at 3590 North First Street, Suite 210, San Jose, CA 95134 (“ENERGOUS”) |
What is the date of the contract? | November 6, 2016 |
Create a RAG pipeline
We implement a standard RAG pipeline with DSPy using the following components to create the vector database, set up context retrieval, and generate the answer:
- Configure DSPy to use LLMs on Amazon Bedrock as the RAG generator model:
- Process the dataset to generate logical and syntactically readable chunks. The size and overlap percentage can be empirically determined based on the dataset. For more flexibility, you can generate multiple files from the dataset file and make one file one chunk.
- To set up a RAG retriever, we select ChromaDB as a vector store, and use DSPy’s ChromadbRM module as the retriever model:
- Using these components, we orchestrate a DSPy RAG pipeline to clean the context, generate the answer, and use the LLM-as-a-judge to score the generated answer with respect to the ground truth:
RAG optimization with DSPy
The next step is to perform RAG optimization with DSPy. DSPy provides the Optimizer module, an algorithm that can tune the parameters of a DSPy program (the prompts and language model weights) to maximize the metrics you specify. It takes in a training set to bootstrap the selective training examples, and is based on a metric function that measures proximity to or matches against the ground truth. With these, we can compile the RAG pipeline module with a defined optimizer instance to conduct the optimization.
In this post, we use DSPy Optimizer to learn how to generate the prompt to improve the RAG response accuracy. Because our dataset size is low (fewer than 100 examples), we select the BootstrapFewShot teleprompter to compile the RAG prompts and overall pipeline, and use the synthetic dataset with ground truth and the LLM-as-a-judge metric function we defined in the previous sections:
The context retrieval is crucial to the overall RAG accuracy. To evaluate the RAG optimization we’ve described, we create a retriever evaluation by the LLM-as-a-judge to understand how well the retriever is able to pull out the relevant chunks for the incoming user question. The LLM judge is defined in the RetrievalJudge class:
Then we define the metric to measure the retrieval by using the RetrievalJudge, and use the DSPy Evaluate module to generate the accuracy score for retrieval:
Configure the continuous fine-tuning framework
After the RAG optimization, the compound AI system has the instruction tuning and preference alignment modules, driven by the continuous fine-tuning framework. This includes using the synthetically generated dataset to train the LLM to follow question-answer instructions by SFT, and generating feedback of RAG responses by AI (another LLM) used for RLAIF with PPO and preference alignment with DPO and ORPO. In this step, we use Parameter Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce the requirement of compute resources and accelerate the training process.
At the time of writing, the DSPy Optimization module supports distillation of a prompt-based DSPy program into LLM weight updates using BootstrapFinetune, and does not yet support the fine-tuning methods we defined in the compound AI system. Therefore, we conducted the fine-tuning (instruction tuning and preference alignment) on a Meta Llama 3 8B model separately; refer to the following GitHub repository for more details. With the compound AI system design, we are able to take the fine-tuning results back into the DSPy pipeline, use the LLM-as-a-judge evaluation function to generate the accuracy scores, and benchmark with the standard and optimized RAG inferences. This demonstrates the flexibility and interoperability of the compound AI system, which allows us to seamlessly replace one module with an external component without requiring changes to the entire pipeline.
The following diagram illustrates the workflow.
Define an evaluation approach with DSPy
DSPy provides an Evaluate module for evaluating the compound AI system output by using user-defined metrics. In this post, we use LLM-as-a-judge to evaluate the system output and create the corresponding metrics for benchmarking the accuracy of standard RAG, optimized RAG, and fine-tuned models. Complete the following steps:
- Load the dataset for evaluation in the Example data type. Examples are similar to Python dictionaries but with added utilities such as the dspy.Prediction as a return value. For example:
- Define the LLM-as-a-judge class to adjudicate whether the predicted answer semantically matches the ground truth of the answer. For example, the following FactualityJudge_1 class provides a score between 0 and 1; 0 means a complete mismatch and 1 means a perfect match.
- Define the evaluation metrics from the LLM judge, using DSPy metrics, to mark whether the predicted answer is true or not. For example, the following function returns the accuracy score based on the output of FactualityJudge_1:
- Use the
dspy.Evaluate
module to generate an accuracy score using the LLM-as-a-judge metrics defined in the previous step:
This evaluation process should be conducted on a continuous basis in the compound AI system driven by self-instruct fine-tuning, to make sure the overall performance remains stable despite the changes in the environment or the introduction of new data.
Benchmark RAG and LLM fine-tuning with DSPy
We benchmark the approaches presented in this post using the LLM-as-a-judge evaluation function defined in the previous section with the following settings.
The benchmarking is across five methods: standard RAG, optimized RAG, fine-tuning LLMs by instruction tuning, and fine-tuning LLMs by DPO and ORPO trained LLMs based on AIF. For each method, the LLM judge provides a decimal accuracy score in the range of 0 and 1.
The standard RAG uses Amazon Titan Text Embedding V2 for the embedding model, and Anthropic’s Claude 3 Haiku model for the generator model. The RAG compilation uses 32 question-answer pairs to optimize the prompts. The same dataset is used for inference. The fine-tuning by SFT, DPO, and ORPO are performed on the Meta Llama 3 8B FM, using training samples synthetically generated from CUAD document.
The results are presented in the following tables and charts. The different methods demonstrate different levels of improvement. The improvement is calculated in percentage by (accuracy of new method – accuracy of standard RAG)/(accuracy of standard RAG)*100%.
The optimized RAG by DSPy improved the accuracy and reduced the hallucination.
Standard RAG with Claude 3 Haiku | RAG with Claude 3 Haiku optimized by DSPy | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3969 | 0.6656 | 67.70% |
Standard RAG with Claude 3 Sonnet | RAG with Claude 3 Sonnet optimized by DSPy | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3031 | 0.6375 | 110.33% |
The custom LLM trained by SFT yielded higher accuracy than the standard RAG.
Standard RAG with Claude 3 Haiku | SFT tuned Meta Llama 3 8B | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3969 | 0.4813 | 21.26% |
Standard RAG with Claude 3 Sonnet | SFT tuned Meta Llama 3 8B | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3031 | 0.4813 | 58.79% |
The custom LLM through preference alignment from human and AI feedback (DPO and ORPO) further improved the model performance. The fine-tuned small size model (Meta Llama 3 8B) outperformed the standard RAG pipeline with the medium size (Anthropic’s Claude Haiku) and larger size (Anthropic’s Claude Sonnet) generator model, and was comparable with the prompt-optimized RAG using ground truth data.
Standard RAG with Claude 3 Haiku | DPO tuned Meta Llama 3 8B | Improvement % | ORPO tuned Meta Llama 3 8B | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3969 | 0.6719 | 69.29% | 0.6812 | 71.63% |
Standard RAG with Claude 3 Sonnet | DPO tuned Meta Llama 3 8B | Improvement % | ORPO tuned Meta Llama 3 8B | Improvement % | |
Accuracy by LLM Judge (0-1) | 0.3031 | 0.6719 | 121.68% | 0.6812 | 124.74% |
The following charts compare the accuracy across all tested methods.
The preceding results were generated from a small dataset (32 question-answer pairs). You can use a larger sample set with more question-answer pairs to conduct the benchmarking and compare your own results.
Clean up
Make sure to clean up the following resources to avoid incurring additional costs:
- Delete Amazon Simple Storage Service (Amazon S3) buckets created for data storage and resource sharing.
- Back up the Jupyter notebooks in the SageMaker notebook instance.
- Shut down and delete the SageMaker notebook instance.
Cost considerations
Consider the following costs from the solution deployed on AWS:
- You will incur charges for LLM inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
- You will incur charges for storing files in S3 buckets. For more details, refer to Amazon S3 pricing.
- You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.
Conclusion
In this post, we presented the continuous self-instruct fine-tuning framework as a compound AI system implemented by the DSPy framework. The framework first generates a synthetic dataset from the domain knowledge base and documents for self-instruction, then drives model fine-tuning through SFT, and introduces the human-in-the-loop workflow to collect human and AI feedback to the model response, which is used to further improve the model performance by aligning human preference through reinforcement learning (RLHF/RLAIF).
We demonstrated the framework for a question-answer task with a RAG pipeline, which improved the end-to-end response accuracy. The workflow is implemented by the DSPy framework; the overall strategy is to use the dspy.Module
to connect all the components (RAG pipeline, prompt optimization, LLMs fine-tuned by SFT and RLHF/RLAIF, performance evaluation) together into a compound AI system. Each module can be seamlessly maintained, updated, and replaced without affecting other components in the system. This robust and versatile system design strengthens control and trust through modular design, and increases flexibility and adaptability to changing environments and data sources.
You can implement this continuous fine-tuning framework for LLM performance improvement for your own business use cases, with a compound AI system that provides high flexibility and interoperability. For more details, follow the examples in our GitHub repository.
About the Authors
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.
Jose Cassio dos Santos Junior is a Senior Data Scientist member of the MLU team. He is responsible for Curriculum Development for Advanced Modules. As a previous Senior Data Scientist on the AWS LATAM Professional Services Data Science team, he has over 20 years of experience working as a software engineer and more than 10 years of teaching experience at colleges and as an instructor for Linux certification preparation and Microsoft Innovation Center bootcamps. As a business process management expert, he participated in BPO projects for more than 7 years. He holds a Master’s degree in Computer Engineering, a Bachelor’s degree in Physics, and a Bachelor’s degree in Business Administration, specialized in IT Quantitative Methods.
Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows
Organizations need efficient ways to access and analyze their enterprise data. Amazon Q Business addresses this need as a fully managed generative AI-powered assistant that helps you find information, generate content, and complete tasks using enterprise data. It provides immediate, relevant information while streamlining tasks and accelerating problem-solving.
Amazon FSx for Windows File Server is a fully managed Windows file system that provides high-performance file storage for Windows-based applications. You can use Amazon FSx to lift and shift your on-premises Windows file server workloads to the cloud, taking advantage of the scalability, durability, and cost-effectiveness of AWS while maintaining full compatibility with your existing Windows applications and tooling.
Amazon Q Business is designed to be secure and private, seamlessly integrating with your existing identity provider (IdP). It works directly with your identities, roles, and permission sets, making sure users can’t access data they are not authorized to. Additionally, Amazon Q Business seamlessly integrates with multiple enterprise data stores, including FSx for Windows File Server, enabling you to index documents from file server systems and perform tasks such as summarization, Q&A, or data analysis of large numbers of files effortlessly.
In this post, we demonstrate how to use the Amazon Q connector for FSx for Windows File Server, explore a practical use case, and provide step-by-step instructions to help you get started and gain insights out of your data stored in FSx for Windows File Server.
Overview of the Amazon Q data source connector
A data source connector is a mechanism for integrating and synchronizing data from multiple repositories, including Microsoft SharePoint, Salesforce, Amazon Simple Storage Service (Amazon S3) buckets, and even your internal FSx for Windows File Server into one container index. Amazon Q Business offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. For a list of supported connectors, see Supported connectors.
Supported document types
Amazon Q boasts impressive versatility, supporting a wide range of document types stored at various places in your environment, including Windows Share (FSX for Windows File Server). Amazon Q can ingest and understand common formats like plaintext, PDF, HTML, XML, and JSON to Microsoft formats like Excel, Word, and PowerPoint. This provides a comprehensive search experience for your enterprise users.
Secure access with supported authentication types
Security is job zero at AWS, and Amazon Q has been built keeping that in mind. It supports a variety of authentication types, seamlessly integrating with your existing identity management systems. Whether you use single sign-on (SSO) or a custom authentication solution, Amazon Q can adapt to your specific needs.
Fine-grained control with ACLs and identity crawling
For organizations with highly sensitive data, Amazon Q offers an extra layer of security. Amazon Q Business supports crawling access control lists (ACLs) for document security by default. When you connect an Amazon FSx (Windows) data source to Amazon Q Business, it crawls ACL information attached to a document (user and group information) from the directory service of the Amazon FSx instance.
Overview of solution
The following diagram shows a high-level architecture of how AWS Managed Active Directory users, through AWS IAM Identity Center, can access and interact with an Amazon Q Business application. This enables an authenticated user to securely and privately interact with the application and gain insights from the enterprise data stored in FSx for Windows File Server, using the Amazon Q Business web experience from their web browser.
In this post, we walk you through the process of integrating Amazon Q Business with FSx for Windows File Server to extract meaningful insights from your file system using natural language processing (NLP). This solution enables you to interact with your file system data using conversational AI, making information discovery more intuitive and efficient.
To set up your Amazon Q Business application, complete the following high-level steps:
- Create a new Amazon Q application.
- Select the retriever.
- Add a data source (FSx for Windows File Server).
- Synchronize your file system data.
Lastly, we demonstrate the application functionality by testing its access for two different users.
Prerequisites
To implement this solution, you should have an AWS account with administrative privileges.
Follow the instructions in the GitHub repository’s README file to provision the infrastructure required for exploring the Amazon Q connector for FSx for Windows File Server.
Create an Amazon Q Business application
Complete the following steps to create a new Amazon Q Business application:
- On the Amazon Q Business console, choose Applications in the navigation pane.
- Choose Create application.
- For Application name, enter a name (for example, anycompany-filesystem-knowledgebase).
- For Access management method, select AWS IAM Identity Center.
If you completed the prerequisites, then IAM Identity Center is already enabled, and you should see the instance ARN listed.
- Under Quick start user, for Select user, choose your users.
- Leave Select subscription as Q Business Pro.
- For Application details, use the default values.
- Choose Create.
In the next step, you will select the data source to retrieve and index the data.
Select the retriever
In this step, you select the retriever to connect data sources to the application. There are two options: use a native retriever or use Amazon Kendra. For this example, we use a native retriever.
- On the application details page, under Q Recommendations, choose Data sources.
- Choose Select retriever.
- For Retrievers, select Native.
- For Index provisioning, select Enterprise.
- For Number of units, enter 1.
- Choose Confirm.
Add a data source
Complete the following steps to add a data source:
- On the application details page, choose Add data source.
- Search for Amazon FSx and choose the plus sign next to Amazon FSX (Windows).
- In the Name and description section, enter a name (for example, anycompany-filesystem-source) and an optional description.
- In the Source section, for Amazon FSx file system ID, choose the file system ID you created as a prerequisite.
- In the Authorization section, leave as default (ACLs are enabled for the connector).
- In the Authentication section, for AWS Secrets Manager secret, choose the AWS Secrets Manager secret that holds the active directory credentials to communicate with Amazon FSx to crawl the file system (QBusiness-fsx-creds).
- In the Configure VPC and security group, provide the following information:
- For Virtual Private Cloud (VPC), choose the virtual private cloud (VPC) created as a prerequisite (amazon-connector-for-win-fsx-blog-vpc).
- For Subnets, choose the private subnets that hold the FSx for Windows File System and active directory instance.
- For VPC security groups, choose your security group (<stack-name>-DefaultSecurityGroup).
- In the IAM role section, provide the following information:
- For IAM role¸ choose Create a new service role.
- For Role name, enter a name for the role.
- In the Sync scope section, provide the following information:
- For Maximum file size, use the default option of 50 MB.
- Under Regex patterns, you can add inclusion and exclusion patterns. For this post, we add the inclusion pattern for PDF file types, so the Amazon Q crawler will include PDF files.
- In the Sync mode section, select Full sync.
Full sync is preferable for the first sync; for subsequent runs, you can choose only the modified data.
- In the Sync run schedule section, for Frequency, choose Run on demand.
You also have the option to run the sync on a recurring basis like hourly or daily.
- In the Tags section, you can optionally add tags.
- In the Field mappings section, use the default field mappings selected.
The Amazon Q connector offers seven fields. Modifying field mappings and adding custom fields will be available after you create the application and retriever. For more information on the field mappings, refer to Amazon FSx (Windows) data source connector field mappings.
- Choose Add data source.
Synchronize your file system data
When the data source is successfully created, a banner message appears. In the banner message (or on the data source details page), choose Sync now to sync your file system data.
You can monitor the status of the sync, which includes direct links to Amazon CloudWatch logs.
The sync can take a few minutes to a few hours to complete. Sync speeds are limited by factors such as remote repository throughput and throttling, network bandwidth, and the size of documents.
When the sync is complete, you should see the stats on the scan, which includes the number of items scanned and failed.
For this post, we have two active directory groups, ml-engineers and security-engineers. Each group has one user under them (John Doe and Jane Smith), and they have access to only one whitepaper based on their group (Choosing a generative AI service and AWS Security Incident Response Guide, respectively). The following diagram illustrates this access.
Validate the Amazon Q application functionality
Now that you have completed the setup, you can validate the application functionality by testing the access controls. We test the access of two users, John Doe and Jane Smith, who are users of the ml-engineers group and security-engineers group, respectively. You can retrieve the user name and password for each user from Secrets Manager. The secret name for John Doe is jdoe, and for Jane Smith, it’s jsmith.
- On the application details page, in the Web experience settings section, choose the link for the deployed URL.
- Sign in as John Doe.
A successful login directs you to the Amazon Q Business chat interface. This window serves as the main workspace where users interact with the application, as shown in the following screenshot.
With the test configuration, John Doe has access to only one document: generative-ai-on-aws-how-to-choose.pdf. You can test the access controls by asking questions about this whitepaper through the chat interface. This restricted access demonstrates the effective implementation of document-level permissions.
- For our first question, we ask What are the key factors to consider when choosing a generative AI service?
The following screenshot shows the response.
- Next, we ask Does Amazon Bedrock provide an option to customize the model?
The response includes citations from Amazon Q with reference to the source data.
Testing confirms that John Doe successfully receives responses to questions about content from generative-ai-on-aws-how-to-choose.pdf. You can ask additional questions about generative AI services, such as:
- What are the generative AI service offerings from AWS?
- What is Amazon Q optimized for?
- What are critical factors to consider when choosing an appropriate foundational model?
Next, we test access to the security incident response guide.
- We ask What are the four phases of the AWS security incident response process?
When asking questions about security topics from aws-security-incident-response-guide.pdf, the system returns no results. This behavior validates that document indexing respects the configured access permissions, and users can only access content they’re authorized to view.
- To validate access controls for the security-engineers user group, log in as Jane Smith.
You can test with questions about security incident response:
- What are the key objectives of an AWS security incident response plan?
- What are the four phases of the AWS security incident response process?
- What are the recommended steps for containing and eradicating a security incident in AWS?
- What types of data should be collected during an AWS security incident investigation?
- What are the key considerations for recovering from an AWS security incident?
Troubleshooting
If you encounter issues during the setup or operation of your Amazon Q Business application with FSx for Windows File Server, refer to the detailed troubleshooting guide in the README file. The guide provides solutions for common configuration challenges and operational issues you might experience.
Clean up
To avoid ongoing charges, we recommend cleaning up the resources you created while following this guide. For step-by-step cleanup instructions, refer to the README file.
Conclusion
In this post, we provided an overview of the Amazon Q FSx connector and how you can use it for safe and seamless integration of generative AI assistance with your enterprise data source. By using Amazon Q in your organization, you can enable employees to be more data-driven, efficient, prepared, and productive. Lastly, we demonstrated how using simple NLP search through Amazon Q Business enhances your ability to discover insights from your enterprise data quicker and respond to your needs faster.
The Amazon Q Business application offers a compelling solution for organizations seeking to enhance their data-driven capabilities. By using its NLP and secure data source integration features, you can unlock the true value of your data and empower your teams to be more productive and efficient in their work.
To learn more about the Amazon Q connector for FSx for Windows File Server, refer to Connecting Amazon FSx (Windows) to Amazon Q Business.
About the Authors
Manjunath Arakere is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, based in Atlanta, Georgia. He partners with AWS customers to design and scale well-architected solutions, supporting their cloud migrations and modernization initiatives. With extensive experience in the field, Manjunath specializes in migration strategies, application modernization, serverless, and Generative AI (GenAI). He is passionate about helping organizations leverage the full potential of cloud computing to drive innovation and operational efficiency. Outside of work, Manjunath enjoys outdoor runs, tennis, volleyball, and challenging his son in PlayStation soccer games.
Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.