July 2025 – Vedere AI

Multiagent AI for generating chain-of-thought training data

Using ensembles of agents to generate and refine interactions annotated with chains of thought improves performance on a battery of benchmarks by an average of 29%.

Conversational AI

July 31, 05:18 PMJuly 31, 05:18 PM

Chain-of-thought reasoning, in which a large language model (LLM) is asked not only to perform multistep actions but to explain its reasons for taking the steps it does, has been shown to improve LLMs reasoning capability. One promising application of chain-of-thought (CoT) reasoning is ensuring that LLMs adhere to responsible-AI policies.

Using CoT to optimize an LLM for policy adherence requires high-quality training data annotated with chains of thoughts. But hiring human annotators to generate such training data is expensive and time consuming.

Inspired by current work on incorporating artificial experts into the standard LLM training pipeline, researchers in Amazons Artificial General Intelligence organization have begun exploring the possibility of using ensembles of AI agents to generate high-quality CoT data. We report the results of our initial experiments in a paper we presented at this years meeting of the Association for Computational Linguistics (ACL).

Using two different LLMs and five different datasets, we compared models fine tuned on data created through our multiagent-deliberation approach to both baseline pretrained models and models fine tuned through supervised fine tuning on conventional data.

Our approach achieves an increase in average safety (in-domain, out-of-domain, and jailbreaks) of 96% relative to the baseline and 73% relative to the conventionally fine-tuned model, when using a non-safety trained model (Mixtral). The increases were 12% and 44%, respectively, on a safety-trained model (Qwen).

Multiagent deliberation

Our approach divides the task of generating policy-compliant chains of thought into three stages, each of which uses LLMs: intent decomposition, deliberation, and refinement.

During intent decomposition, an LLM receives the user query and identifies explicit and implicit user intents. These, together with the query, are then passed to another LLM, which generates an initial CoT.

Deliberation is an iterative process in which multiple LLMs (agents) expand the CoT in sequential fashion, factoring in a defined set of policies. Each agent is prompted to review and correct the version of the CoT it receives or to confirm that its good as is. This stage ends when an agent judges the CoT complete or when a predefined deliberation budget is exhausted.

Finally, in the refinement stage, an LLM takes the outputs of the deliberation stage and post-processes them to filter out redundant, deceptive, and policy-inconsistent thoughts.

A schematic of our multiagent-deliberation framework to generate safety-embedded CoTs.

Evaluation

Following prior work, we analyze the quality of the generated CoTs by measuring three fine-grained attributes: (1) relevance, (2) coherence, and (3) completeness. Each attribute is evaluated on a scale from 1 to 5, where 1 represents the lowest quality and 5 represents the highest. As test data, we use examples from several standard CoT benchmark datasets.

We also assess faithfulness along three dimensions: (1) faithfulness between policy and the generated CoT; (2) faithfulness between policy and the generated response; and (3) faithfulness between the generated CoT and the final response. We use an LLM fine tuned as an auto-grader to evaluate faithfulness on a scale from 1 to 5, where 1 indicates minimal faithfulness, and 5 indicates complete adherence.

As can be seen in the table below, using our framework provides quality improvements across all metrics, with an improvement of more than 10% in CoTs policy faithfulness.

Average auto-grader scores on the generated-CoT datasets (1-5 scale), including general-reasoning metrics to evaluate the quality of CoTs and faithfulness metrics to evaluate policy adherence.

<tbody><tr style=”mso-yfti-irow:1″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Metric </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>LLM_ZS </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>AIDSAFE </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>delta </td></tr><tr style=”mso-yfti-irow:2″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Relevance </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.66 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.68 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.43% </td></tr><tr style=”mso-yfti-irow:3″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Coherence </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.93 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.96 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.61% </td></tr><tr style=”mso-yfti-irow:4″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Completeness </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.86 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.92 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1.23% </td></tr><tr style=”mso-yfti-irow:5″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>CoTs faithfulness (policy) </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>3.85 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.27 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>10.91% </td></tr><tr style=”mso-yfti-irow:6″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Response faithfulness (policy) </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.85 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.91 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1.24% </td></tr><tr style=”mso-yfti-irow:7;mso-yfti-lastrow:yes”><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Response faithfulness (CoT) </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.99 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>5 </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.20% </td></tr></tbody>

Fine tuning

We use several benchmarks to measure the performance improvements provided by our generated CoT data: Beavertails (for safety), WildChat, XSTest (for overrefusal, or erroneously flagging safe generations as unsafe), MMLU (for utility), and StrongREJECT (for jailbreak robustness).

We used two different LLMs in our tests, the widely used open-source models Qwen and Mixtral. The base versions of these models provide one baseline, and we add another baseline by fine-tuning these models with only the prompts and responses from the original dataset not the generated CoTs. Our method shows significant improvements over baseline, specifically on safety and jailbreak robustness, with some trade-offs on utility and overrefusal.

Below are the results of evaluation of the supervised fine-tuned (SFT) model. “Base” denotes the LLM without SFT, SFT_OG denotes the model SFTd on the original response data without any CoTs, and SFT_DB denotes the model SFTd on our generated CoTs and responses. (If the full table doesn’t fit on your browser, try scrolling right.)

LLM: Mixtral

<tbody><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Eval</td><td colspan=”1″ rowspan=”1″>Dimension</td><td colspan=”1″ rowspan=”1″>Metric</td><td colspan=”1″ rowspan=”1″>Dataset</td><td colspan=”1″ rowspan=”1″>Base</td><td colspan=”1″ rowspan=”1″>SFT_OG</td><td colspan=”1″ rowspan=”1″>SFT_DB (ours)</td></tr><tr><td colspan=”1″ rowspan=”1″>Safety</td><td colspan=”1″ rowspan=”1″>Safe response</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>Beavertails</td><td colspan=”1″ rowspan=”1″>76</td><td colspan=”1″ rowspan=”1″>79.57</td><td colspan=”1″ rowspan=”1″>96</td></tr><tr><td colspan=”1″ rowspan=”1″>WildChat</td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″>31</td><td colspan=”1″ rowspan=”1″>33.5</td><td colspan=”1″ rowspan=”1″>85.95</td></tr><tr><td colspan=”1″ rowspan=”1″>Overrefusal</td><td colspan=”1″ rowspan=”1″>1-Overrefuse</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>XSTest</td><td colspan=”1″ rowspan=”1″>98.8</td><td colspan=”1″ rowspan=”1″>87.6</td><td colspan=”1″ rowspan=”1″>91.84</td></tr><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Utility</td><td colspan=”1″ rowspan=”1″>Answer</td><td colspan=”1″ rowspan=”1″>accuracy</td><td colspan=”1″ rowspan=”1″>MMLU</td><td colspan=”1″ rowspan=”1″>35.42</td><td colspan=”1″ rowspan=”1″>31.38</td><td colspan=”1″ rowspan=”1″>34.51</td></tr><tr><td colspan=”1″ rowspan=”1″>Jailbreak Robustness</td><td colspan=”1″ rowspan=”1″>Safe response</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>StrongREJECT</td><td colspan=”1″ rowspan=”1″>51.09</td><td colspan=”1″ rowspan=”1″>67.01</td><td colspan=”1″ rowspan=”1″>94.04</td></tr></tbody>

LLM: Qwen

<tbody><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Eval</td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Dimension</td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Metric</td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Dataset</td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Base</td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>SFT_OG</td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>SFT_DB (ours)</td></tr><tr style=”mso-yfti-irow:0;mso-yfti-firstrow:yes”><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safety </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safe response </td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate </td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Beavertails </td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>94.14 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>87.95 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>97 </td></tr><tr style=”mso-yfti-irow:1″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>WildChat </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>&nbsp; -</td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>&nbsp; -</td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>&nbsp; -</td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>95.5 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>59.42 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>96.5 </td></tr><tr style=”mso-yfti-irow:2″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Overrefusal </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1-Overrefuse </td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate </td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>XSTest </td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>99.2 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>98 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>93.6 </td></tr><tr style=”mso-yfti-irow:3″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Utility </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Answer </td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>accuracy </td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>MMLU </td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>75.78 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>55.73 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>60.52 </td></tr><tr style=”mso-yfti-irow:4;mso-yfti-lastrow:yes”><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Jailbreak Robustness </td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safe response </td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate </td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>StrongREJECT </td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>72.84 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>59.48 </td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>95.39 </td></tr></tbody>

Acknowledgements: We would like to acknowledge our coauthors and collaborators, Kai-Wei Chang, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Aram Galstyan, Richard Zemel, and Rahul Gupta, for their contributions.

Research areas: Conversational AI

Tags: Association for Computational Linguistics (ACL), Responsible AI , Commonsense reasoning

Introducing AWS Batch Support for Amazon SageMaker Training jobs

Picture this: your machine learning (ML) team has a promising model to train and experiments to run for their generative AI project, but they’re waiting for GPU availability. The ML scientists spend time monitoring instance availability, coordinating with teammates over shared resources, and managing infrastructure allocation. Simultaneously, your infrastructure administrators spend significant time trying to maximize utilization and minimize idle instances that lead to cost-inefficiency.

This isn’t a unique story. We heard from customers that instead of managing their own infrastructure and job ordering, they wanted a way to queue, submit, and retry training jobs while using Amazon SageMaker AI to perform model training.

AWS Batch now seamlessly integrates with Amazon SageMaker Training jobs. This integration delivers intelligent job scheduling and automated resource management while preserving the fully managed SageMaker experience your teams are familiar with. ML scientists can now focus more on model development and less on infrastructure coordination. At the same time, your organization can optimize the usage of costly accelerated instances, increasing productivity and decreasing costs. The following example comes from Toyota Research Institute (TRI):

“With multiple variants of Large Behavior Models (LBMs) to train, we needed a sophisticated job scheduling system. AWS Batch’s priority queuing, combined with SageMaker AI Training Jobs, allowed our researchers to dynamically adjust their training pipelines—enabling them to prioritize critical model runs, balance demand across multiple teams, and efficiently utilize reserved capacity. The result was ideal for TRI: we maintained flexibility and speed while being responsible stewards of our resources.”
–Peter Richmond, Director of Information Engineering

In this post, we discuss the benefits of managing and prioritizing ML training jobs to use hardware efficiently for your business. We also walk you through how to get started using this new capability and share suggested best practices, including the use of SageMaker training plans.

Solution overview

AWS Batch is a fully managed service for developers and researchers to efficiently run batch computing workloads at different scales without the overhead of managing underlying infrastructure. AWS Batch dynamically provisions the optimal quantity and type of compute resources based on the volume and specific requirements of submitted batch jobs. The service automatically handles the heavy lifting of capacity planning, job scheduling, and resource allocation, so you can focus on your application logic rather than managing underlying infrastructure.

When you submit a job, AWS Batch evaluates the job’s resource requirements, queues it appropriately, and launches the necessary compute instances to run the job, scaling up during peak demand and scaling down to zero when no jobs are running. Beyond basic orchestration, AWS Batch includes intelligent features like automatic retry mechanisms that restart failed jobs based on configurable retry strategies, and fair share scheduling to manage equitable resource distribution among different users or projects by preventing a single entity from monopolizing compute resources. This can be especially useful if your organization has production workloads that should be prioritized. AWS Batch has been used by many customers with submit-now, run-later semantics for scheduling jobs and achieving high utilization of compute resources on Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and now SageMaker Training jobs.

AWS Batch for SageMaker Training jobs consists of the following key components that work together to deliver seamless batch processing:

Training jobs serve as blueprints that specify how jobs should run, including Docker container images, instance types, AWS Identity and Access Management (IAM) roles, and environment variables
Job queues act as holding areas where jobs wait to be executed, with configurable priority levels that determine execution order
Service environments define the underlying infrastructure maximum capacity

With these foundations, AWS Batch can retry for transient failures and provide comprehensive queue visualization, addressing critical pain points that have been challenging to address with ML workflows. The integration provides automatic retry for transient failures, bulk job submission, enabling scientists to focus on model improvements instead of infrastructure management.

To use an AWS Batch queue for SageMaker Training jobs, you must have a service environment and a job queue. The service environment represents the Amazon SageMaker AI capacity limits available to schedule, expressed through maximum number of instances. The job queue is the scheduler interface researchers interact with to submit jobs and interrogate job status. You can use the AWS Batch console, or AWS Command Line Interface (AWS CLI) to create these resources. In this example, we create a First-In-First-Out (FIFO) job queue and a service environment pool with a limit of five ml.g5.xlarge instances using the AWS Batch console. The following diagram illustrates the solution architecture.

Prerequisites

Before you deploy this solution, you must have an AWS account with permissions to create and manage AWS Batch resources. For this example, you can use these Sample IAM Permissions along with your SageMaker AI execution role.

Create a service environment

Complete the following steps to create the service environment you will associate with the training job queue:

On the AWS Batch console, choose Environments in the navigation pane.
Choose Create environment, then choose Service environment.

Provide a name for your service environment (for this post, we name it ml-g5-xl-se).
Specify the maximum number of compute instances that will be available to this environment for model training (for this post, we set it to 5). You can update the value for your capacity limit later as needed.
Optionally, specify tags for your service environment.
Create your service environment.

Create a job queue

Complete the following steps to create your job queue:

On the AWS Batch console, choose Job queues in the navigation pane.
Choose Create job queue.
For Orchestration type, select SageMaker Training.

Provide a name for your job queue (for this post, we name it my-sm-training-fifo-jq).
For Connected service environment, choose the service environment you created.
Leave the remaining settings as default and choose Create job queue.

You can explore fair-share queues by reading more about the scheduling policy parameter. Additionally, you can use job state limits to configure your job queue to take automatic action to unblock itself in the event that a user submitted jobs that are misconfigured or remain capacity constrained beyond a configurable period of time. These are workload-specific parameters that you can tune to help optimize your throughput and resource utilization.

Submit SageMaker Training jobs to AWS Batch from the SageMaker Python SDK

The newly added aws_batch module within the SageMaker Python SDK allows you to programmatically create and submit SageMaker Training jobs to an AWS Batch queue using Python. This includes helper classes to submit both Estimators and ModelTrainers. You can see an example of this in action by reviewing the sample Jupyter notebooks. The following code snippets summarize the key pieces.

Complete the basic setup steps to install a compatible version of the SageMaker Python SDK:

!pip install sagemaker

To use the job queue you configured earlier, you can refer to it by name. The Python SDK has built-in support for the integration within the TrainingQueue class:

from sagemaker.aws_batch.training_queue import TrainingQueue

JOB_QUEUE_NAME = 'my-sm-training-fifo-jq'
training_queue = TrainingQueue(JOB_QUEUE_NAME)

For this example, we focus on the simplest job that you can run, either a class that inherits from EstimatorBase or ModelTrainer, a hello world job. You can use a ModelTrainer or Estimator, such as PyTorch, instead of the placeholder:

from sagemaker.session import Session
from sagemaker import image_uris
session = Session()

image_uri = image_uris.retrieve(
    framework="pytorch",
    region=session.boto_session.region_name,
    version="2.5",
    instance_type=INSTANCE_TYPE,
    image_scope="training"
)
from sagemaker.estimator import Estimator

EXECUTION_ROLE = get_execution_role()
INSTANCE_TYPE = 'ml.g5.xlarge'
TRAINING_JOB_NAME = 'hello-world-simple-job'

estimator = Estimator(
    image_uri=image_uri,
    role=EXECUTION_ROLE,
    instance_count=1,
    instance_type=INSTANCE_TYPE,
    volume_size=1,
    base_job_name=TRAINING_JOB_NAME,
    container_entry_point=['echo', 'Hello', 'World'],
    max_run=300,
)

training_queued_job = training_queue.submit(training_job=estimator, inputs=None)

Submitting an estimator job is as straightforward as creating the estimator and then calling queue.submit. This particular estimator doesn’t require any data, but in general, data should be provided by specifying inputs. Alternatively, you can queue a ModelTrainer using AWS Batch by calling queue.submit, shown in the following code:

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode

source_code = SourceCode(command="echo 'Hello World'")

model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=TRAINING_JOB_NAME,
    compute={"instance_type": INSTANCE_TYPE, "instance_count": 1},
    stopping_condition={"max_runtime_in_seconds": 300}
)

training_queued_job = training_queue.submit(training_job=model_trainer, inputs=None)

Monitor job status

In this section, we demonstrate two methods to monitor the job status.

Display the status of jobs using the Python SDK

The TrainingQueue can list jobs by status, and each job can be described individually for more details:

submitted_jobs = training_queue.list_jobs(status="SUBMITTED")
pending_jobs = training_queue.list_jobs(status="PENDING")
runnable_jobs = training_queue.list_jobs(status="RUNNABLE")
scheduled_jobs = training_queue.list_jobs(status="SCHEDULED")
starting_jobs = training_queue.list_jobs(status="STARTING")
running_jobs = training_queue.list_jobs(status="RUNNING")
completed_jobs = training_queue.list_jobs(status="SUCCEEDED")
failed_jobs = training_queue.list_jobs(status="FAILED")

all_jobs = submitted_jobs + pending_jobs + runnable_jobs + scheduled_jobs + starting_jobs + running_jobs + completed_jobs + failed_jobs

for job in all_jobs:
    job_status = job.describe().get("status", "")
    print(f"Job : {job.job_name} is {job_status}")

After a TrainingQueuedJob has reached the STARTING status, the logs can be printed from the underlying SageMaker AI training job:

import time

while True:
    job_status = training_queued_job.describe().get("status", "")

    if job_status in {"STARTING", "RUNNING", "SUCCEEDED", "FAILED"}:
        break

    print(f"Job : {training_queued_job.job_name} is {job_status}")
    time.sleep(5)

training_queued_job.get_estimator().logs()

Display the status of jobs on the AWS Batch console

The AWS Batch console also provides a convenient way to view the status of running and queued jobs. To get started, navigate to the overview dashboard, as shown in the following screenshot.

From there, you can choose on the number underneath the AWS Batch job state you’re interested in to see the jobs in your queue that are in the given state.

Choosing an individual job in the queue will bring you to the job details page.

You can also switch to the SageMaker Training job console for a given job by choosing the View in SageMaker link on the AWS Batch job details page. You will be redirected to the corresponding job details page on the SageMaker Training console.

Whether you use the AWS Batch console or a programmatic approach to inspecting the jobs in your queue, it is generally useful to know how AWS Batch job states map to SageMaker Training job states. To learn how that mapping is defined, refer to the Batch service job status overview page found within the Batch user guide.

Best practices

We recommend creating dedicated service environments for each job queue in a 1:1 ratio. FIFO queues deliver basic fire-and-forget semantics, whereas fair share scheduling queues provide more sophisticated scheduling, balancing utilization within a share identifier, share weights, and job priority. If you don’t need multiple shares but want to assign a priority on job submission, we recommend creating a fair share scheduling queue and using a single share within it for all submissions.

This integration works seamlessly with SageMaker Flexible Training Plans (FTP); simply set the TrainingPlanArn as part of the CreateTrainingJob JSON request, which is passed to AWS Batch. If the goal is for a single job queue to keep that FTP fully utilized, setting capacityLimits on the service environment to match the capacity allocated to the flexible training plan will allow the queue to maintain high utilization of all the capacity.

If the same FTP needs to be shared among many teams, each with a firm sub-allocation of capacity (for example, dividing a 20-instance FTP into 5 instances for a research team and 15 instances for a team serving production workloads), then we recommend creating two job queues and two service environments. The first job queue, research_queue, would be connected to the research_environment service environment with a capacityLimit set to 5 instances. The second job queue, production_queue, would be connected to a production_environment service environment with a capacity limit of 15. Both research and production team members would submit their requests using the same FTP.

Alternatively, if a strict partition isn’t necessary, both teams can share a single fair share scheduling job queue with separate share identifiers, which allows the queue to better utilize available capacity.

We recommend not using the SageMaker warm pool feature, because this can cause capacity to be idle.

Conclusion

In this post, we covered the new capability to use AWS Batch with SageMaker Training jobs and how to get started setting up your queues and submitting your jobs. This can help your organization schedule and prioritize jobs, freeing up time for your infrastructure admins and ML scientists. By implementing this functionality, your teams can focus on their workloads and not waste time managing and coordinating infrastructure. This capability is especially powerful using SageMaker training plans so that your organization can reserve capacity in the quantity you need, during the time you need it. By using AWS Batch with SageMaker AI, you can fully utilize the training plan for the most efficiency. We encourage you to try out this new capability so it can make a meaningful impact in your operations!

About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

David Lindskog is a Senior Software Engineer at AWS Batch. David has worked across a broad spectrum of projects at Amazon, and specializes in designing and implementing complex, scalable distributed systems and APIs that solve challenging technical problems.

Mike Moore is a Software Development Manager at AWS Batch. He works in high performance computing, with a focus on the application of simulation to the analysis and design of spacecraft and robotic systems. Prior to joining AWS, Mike worked with NASA to build spacecraft simulators to certify SpaceX Dragon and CST-100’s ascent abort systems for crew flight readiness. He lives in Seattle with his wife and daughter, where they enjoy hiking, biking, and sailing.

Mike Garrison is a Global Solutions Architect based in Ypsilanti, Michigan. Utilizing his twenty years of experience, he helps accelerate tech transformation of automotive companies. In his free time, he enjoys playing video games and travel.

Michelle Goodstein is a Principal Engineer on AWS Batch. She focuses on scheduling improvements for AI/ML to drive utilization, efficiency, and cost optimization, as well as improved observability into job execution lifecycle and efficiency. She enjoys building innovative solutions to distributed systems problems spanning data, compute, and AI/ML.

Michael Oguike is a Product Manager for Amazon SageMaker AI. He is passionate about using technology and AI to solve real-world problems. At AWS, he helps customers across industries build, train, and deploy AI/ML models at scale. Outside of work, Michael enjoys exploring behavioral science and psychology through books and podcasts.

Angel Pizarro is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.

Tom Burggraf is the Head of Product for AWS Batch, where he champions innovative features that help research platform builders achieve unprecedented scale and operational efficiency. He specializes in identifying novel ways to evolve AWS Batch capabilities, particularly in democratizing high-performance computing for complex scientific and analytical workloads. Prior to AWS, he was a product leader in FinTech and served as a consultant for product organizations across multiple industries, bringing a wealth of cross-industry expertise to cloud computing challenges.

Structured outputs with Amazon Nova: A guide for builders

Developers building AI applications face a common challenge: converting unstructured data into structured formats. Structured output is critical for machine-to-machine communication use cases, because this enables downstream use cases to more effectively consume and process the generated outputs. Whether it’s extracting information from documents, creating assistants that fetch data from APIs, or developing agents that take actions, these tasks require foundation models to generate outputs in specific structured formats.

We launched constrained decoding to provide reliability when using tools for structured outputs. Now, tools can be used with Amazon Nova foundation models (FMs) to extract data based on complex schemas, reducing tool use errors by over 95%.

In this post, we explore how you can use Amazon Nova FMs for structured output use cases.

Techniques for implementing structured outputs

When addressing the requirements for structured outputs use cases, there are two common approaches for implementation. You can modify the system prompt or take advantage of tool calling. For example, in a customer support use case, you might want the model to output a JSON with its response to the user and the current sentiment. So, the system prompt would be modified to include the expected structure:

Make sure your final response is valid JSON that follows the below response schema: 

##Response schema
```json
{
   "response": "the response to the customer",
   "sentiment": "the current customer sentiment"
}```

The other option is to provide a tool configuration. Tool calling is the act of providing an API, code function, or schema (or structure) required by your end application to the model through the request schema with the Converse API. This is most used when building agentic applications but is also frequently used in structured output use cases because of the ability to define a set schema that the model should adhere to.

tool_config = {
    "tools": [ 
        {
            "toolSpec": {
                "name": "respondToUser",
                "description": "the formatted response to the customer",
                "inputSchema": {
                    "type": "object",
                    "properties": {
                        "response": {
                            "description": "the response to the customer",
                            "type": "string"
                        },
                        "sentiment": {
                            "description": "the current customer sentiment",
                            "type": "string"
                        }
                    },
                    "required": [
                        "response",
                        "sentiment"
                    ]
                }
            }
        }
    ]
}

Both approaches can be effective prompting techniques to influence the model output. However, the output is still non-deterministic and there is room for failure. In our work with customers to implement use cases such as agentic workflows and applications and structured extraction, we’ve observed that the accuracy of the model tends to decrease as the schema becomes more complex.

Structured output with Amazon Nova models

Based on these learnings, we have implemented constrained decoding in our system to help ensure high model reliability in the output generated and to allow the model to handle complex schemas with ease. Constrained decoding relies on a grammar to constrain the possible tokens a model can output at each step. This is differentiated from the prompting techniques historically used, because this changes the actual tokens a model can choose from when generating an output. For example, when closing a JSON object, the model would be constrained to just a } token to select. Constrained decoding is used every time a tool configuration is passed. Because tool use provides us a specific schema already, we can use that to generate a grammar dynamically, based on the schema desired by the developer. Constrained decoding prevents the model from generating invalid keys and enforces correct data types based on the defined schema.

Schema definition process

A key step in using structured outputs with Amazon Nova is to create a tool configuration. The tool configuration provides a standard interface to define the expected output schema. While the primary intent of a tool configuration is to provide external functionality to the model, this JSON interface is used in structured output use cases as well. This can be illustrated using a use case that extracts recipes from online content. To start the integration, we create a tool configuration representing the specific fields we want extracted from the invoices. When creating a tool configuration, it is important to be clear and concise because the property names and descriptions are what inform the model how the fields should be populated.

tool_config = {
   "tools": [
        {
            "toolSpec": {
                "name": "extract_recipe",
                "description": "Extract recipe for cooking instructions",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "recipe": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string",
                                        "description": "Name of the recipe"
                                    },
                                    "description": {
                                        "type": "string",
                                        "description": "Brief description of the dish"
                                    },
                                    "prep_time": {
                                        "type": "integer",
                                        "description": "Preparation time in minutes"
                                    },
                                    "cook_time": {
                                        "type": "integer",
                                        "description": "Cooking time in minutes"
                                    },
                                    "servings": {
                                        "type": "integer",
                                        "description": "Number of servings"
                                    },
                                    "difficulty": {
                                        "type": "string",
                                        "enum": [
                                            "easy",
                                            "medium",
                                            "hard"
                                        ],
                                        "description": "Difficulty level of the recipe"
                                    },
                                    "ingredients": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "name": {
                                                    "type": "string",
                                                    "description": "Name of ingredient"
                                                },
                                                "amount": {
                                                    "type": "number",
                                                    "description": "Quantity of ingredient"
                                                },
                                                "unit": {
                                                    "type": "string",
                                                    "description": "Unit of measurement"
                                                }
                                            },
                                            "required": [
                                                "name",
                                                "amount",
                                                "unit"
                                            ]
                                        }
                                    },
                                    "instructions": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Step-by-step cooking instructions"
                                        }
                                    },
                                    "tags": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Categories or labels for the recipe"
                                        }
                                    }
                                },
                                "required": [
                               ]
                            }
                        },
                        "required": [
                        ]
                    }
                }
            }
        }
    ]
}

After the tool configuration has been created, we can pass it through the Converse API along with the recipe, which will be contained in the user prompt. A system prompt is historically required for structured output use cases to guide the model in how to output the content, in this case we can use it to pass details about the system role and persona.

import boto3

model_response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
   system=[{"text": "You are an expert recipe extractor that compiles recipe details from blog posts"}],
    messages=[{"role": "user", "content": content}],
    inferenceConfig={"temperature": 0},
    toolConfig=tool_config
)

By using the native tool use support with constrained decoding, we get a parsed tool call that will follow the correct syntax and expected schema as set in the tool configuration.

{
    "toolUse": {
        "toolUseId": "tooluse_HDCl-Y8gRa6yWTU-eE97xg",
        "name": "extract_recipe",
        "input": {
            "recipe": {
                "name": "Piacenza Tortelli",
                "description": "Piacenza tortelli, also known as 'tortelli with the tail' due to their elongated shape, are a delicious fresh pasta, easy to make at home!",
                "prep_time": 60,
                "cook_time": 10,
                "servings": 4,
                "difficulty": "hard",
                "ingredients": [
                    {
                        "name": "Type 00 flour",
                        "amount": 2.3,
                        "unit": "cups"
                    },
                    {
                        "name": "Eggs",
                        "amount": 3,
                        "unit": ""
                    },
                    {
                        "name": "Fine salt",
                        "amount": 1,
                        "unit": "pinch"
                    },
                    {
                        "name": "Spinach",
                        "amount": 13.3,
                        "unit": "cups"
                    },
                    {
                        "name": "Cow's milk ricotta cheese",
                        "amount": 1.3,
                        "unit": "cups"
                    },
                    {
                        "name": "Parmigiano Reggiano PDO cheese",
                        "amount": 4.2,
                        "unit": "oz"
                    },
                    {
                        "name": "Fine salt",
                        "amount": 1,
                        "unit": "to taste"
                    },
                    {
                        "name": "Nutmeg",
                        "amount": 1,
                        "unit": "to taste"
                    },
                    {
                        "name": "Butter",
                        "amount": 80,
                        "unit": "g"
                    },
                    {
                        "name": "Sage",
                        "amount": 2,
                        "unit": "sprigs"
                    }
                ],
                "instructions": [
                    "Arrange the flour in a mound and pour the eggs into the center 1; add a pinch of salt and start working with a fork 2, then knead by hand 3.",
                    "You should obtain a smooth dough 4; wrap it in plastic wrap and let it rest for half an hour in a cool place.",
                    "Meanwhile, prepare the filling starting with the spinach: immerse them in boiling salted water 5 and blanch them for a few minutes until wilted 6.",
                    "Drain the spinach and transfer them to cold water 7, preferably with ice. Then squeeze them very well 8 and chop them finely with a knife 9.",
                    "Place the chopped spinach in a bowl, add the ricotta 10, salt, pepper, and nutmeg 11. Also add the grated Parmigiano Reggiano DOP 12.",
                    "Mix well until you get a homogeneous consistency 13.",
                    "At this point, take the dough that has now rested 14, take a portion of it keeping the rest covered. Lightly flatten the dough with a rolling pin 15.",
                    "Roll it out with a pasta machine 16; as you reduce the thickness, fold the dough over itself 17 and roll it out again 18.",
                    "You should get a very thin rectangle, about 0.04-0.08 inches thick 19. Cut 2 strips of dough by dividing the rectangle in half lengthwise 20, then cut out diamonds of 4 inches 21.",
                    "Fill the diamonds with the spinach filling 22 and close them. To do this, bring one of the two longer points inward 23, then fold the two side points towards the center 24.",
                    "Now close the tortello by pinching the dough in the center and moving gradually towards the outside 25. The movement is similar to the closure of culurgiones. Continue in this way until the dough and filling are finished 26; you will get about 40-45 pieces.",
                    "Place a pot full of salted water on the stove. Meanwhile, in a pan, pour the butter and sage 27. Turn on the heat and let it flavor.",
                    "Then cook the tortelli for 5-6 minutes 28, then drain them and toss them in the butter and sage sauce 29.",
                    "Plate and serve the Piacenza tortelli with plenty of grated Parmigiano Reggiano DOP 30!"
                ],
                "tags": [
                    "vegetarian",
                    "Italian"
                ]
            }
        }
    }
}

Now, with constrained decoding, we can use a smaller model such as Amazon Nova Lite to output a large and complex JSON schema to use in our application. For image-based use cases with complex schemas, we recommend that you use Nova Pro or Nova Premier for the best performance.

Conclusion

By using structured output with Amazon Nova through tool calling, you can take advantage of the key benefits of constrained decoding and build a reliable system. We encourage you to try this out in your applications today. Learn more at the Amazon Nova User Guide. Get started building your AI applications with Amazon Nova in the Amazon Bedrock console.

About the authors

Jean Farmer is a Generative AI Solutions Architect on the Amazon Artificial General Intelligence (AGI) team, specializing in agentic applications. Based in Seattle, Washington, she works at the intersection of autonomous AI systems and practical business solutions, helping to shape the future of AGI at Amazon.

Mukund Birje is a Sr. Product Marketing Manager on the AIML team at AWS. In his current role he’s focused on driving adoption of Amazon Nova Foundation Models. He has over 10 years of experience in marketing and branding across a variety of industries. Outside of work you can find him hiking, reading, and trying out new restaurants. You can connect with him on LinkedIn.

AI agents unifying structured and unstructured data: Transforming support analytics and beyond with Amazon Q Plugins

As organizations seek to derive greater value from their AWS Support data, operational teams are looking for ways to transform raw support cases and health events into actionable insights. While traditional analytics tools can provide basic reporting capabilities, teams need more sophisticated solutions that can understand and process natural language queries about their operational data. Retrieval-Augmented Generation (RAG) architecture forms the foundation for optimizing large language model outputs by referencing authoritative knowledge bases outside of their training data before generating responses. This architecture uses the power of semantic search and information retrieval capabilities to enhance accuracy.

In our previous blog post, Derive meaningful and actionable operational insights from AWS Using Amazon Q Business, we introduced a RAG-based solution using Amazon Q Business. However, while this approach excels at semantic search, it can face challenges with precise numerical analysis and aggregations. In this post, we address these limitations by showing how you can enhance Amazon Q with custom plugins to improve support analytics and incident response. Our solution combines the strengths of RAG with structured data querying, enabling more accurate answers to analytical questions about your AWS Support data. Detailed deployment instructions for the solution in this post are available in our AWS Samples GitHub repository.

Understanding RAG’s design for analytics

RAG architecture excels at finding relevant information through context-based matching, enabling natural language interactions with external data. However, the fundamental architecture of vector similarity search, which prioritizes semantic relevance over analytical aggregation capabilities, can introduce significant inaccuracies. Let’s examine this limitation with a simple query:

Give me total count of support cases in Feb 2025

The following is a RAG-only response indicating that the system attempts to piece together information from various documents, reporting 190 cases through potentially hallucinated aggregations.

Based on the provided JSON files, there were 190 AWS support cases documented in 
February 2025, calculated from the following case counts: 17 cases from Feb 1-3, 
12 cases from Feb 4-5, 21 cases from Feb 7-8, 21 cases from Feb 8-10, 16 cases from 
Feb 10-11, 10 cases from Feb 11-12, 23 cases from Feb 19-20, 22 cases from Feb 20-21,
27 cases from Feb 23-24, and 21 cases from Feb 25-26.

However, when enhanced with structured data querying capabilities, the same query yields 958 support cases, the accurate count verified against the actual dataset:

Based on the query results, there were a total of 958 support cases in February 2025.

This limitation becomes even more apparent with moderately complex queries that require categorical analysis:

Give me OpenSearch Support cases, distribution based on severity, opened in Jan 2025?

The following is a RAG-only response showing limited or inaccurate categorical breakdown:

Based on the support cases data from January 2025, here is the severity distribution 
for OpenSearch cases: 

Urgent: One case was opened regarding cluster health degradation and production system 
downtime

Normal: Two cases were identified:
1. A case about OpenSearch RI Exchange
2. A case concerning blue/green deployment upgrade from version 1.3 to 2.17

The cases demonstrate varying levels of urgency, with one critical production issue 
and two normal severity cases requiring standard response times.

With structured data querying, we get precise severity distributions:

Based on the query results for OpenSearch Support cases opened in January 2025,
here's the distribution by severity:
1. Normal severity: 4 cases
2. Critical severity: 4 cases
3. High severity: 3 cases
4. Low severity: 2 cases
5. Urgent severity: 1 case

In total, there were 14 OpenSearch Support cases opened in January 2025. 
The distribution shows that Normal and Critical severity cases were the most common, 
followed by High severity cases. There were fever Low and Urgent severity cases 
during this period.

The gap widens further when dealing with multi-dimensional analysis. Consider the following multi-dimensional query expecting detailed analysis with account-service combinations and temporal patterns.

Query:

Find accounts with frequent high-priority support cases since January 2024, 
excluding limit increase requests. For each account and AWS service combination, 
show the total number of cases and how many different days had issues. 
Only include results where there were more than 3 cases, and sort results by the 
highest number of cases first.

Response:

Based on the analysis of high-priority support cases since January 2024, excluding
limit increase requests, here are the key findings:

1. The account with the most frequent high-priority cases is 111111111111, with 
60 cases for the Connect service.
2. The top 5 account and service combinations with the highest number of cases are:
* Account 111111111111 (Connect): 60 cases
* Account 222222222222 (RDS Oracle): 52 cases
* Account 333333333333 (OpenSearch Service): 38 cases
* Account 444444444444 (Aurora PostgreSQL): 25 cases
* Account 555555555555 (EMR): 23 cases

The services that appear most frequently in the top results are:
* Aurora PostgreSQL
* OpenSearch Service
* RDS Oracle
* EMR
* DynamoDB

Interestingly, the number of total cases matches the number of unique days with 
cases for each account-service combination. This suggests that these accounts 
are experiencing issues on different days rather than multiple issues on the 
same day.

There are 89 account-service combinations that have more than 3 high-priority 
cases since January 2024.

Some accounts appear multiple times with different services, indicating they 
may be experiencing issues across various AWS offerings. For example:
* Account 222222222222 appears with RDS Oracle (52 cases), DynamoDB (18 cases), 
and Aurora PostgreSQL (6 cases)
* Account 111111111111 appears with Connect (60 cases) and other services

This information can be useful for identifying accounts that may need additional 
support or proactive engagement to address recurring issues. Would you like to 
focus on a specific aspect of this data or explore further details about 
particular accounts or services?

The following is a RAG-only response showing inability to process the same complex query:

Based on the provided data source, I cannot provide a complete analysis of 
high-priory support cases and their frequency patterns. While there are support 
cases mentioned in the data from January 2024, the data does not contain 
sufficient information about case priorities or enough cases per account 
to identity patterns meeting the threshold of more than 3 cases. 
Additionally, while some technical and customer service cases are shown, 
the data is fragmented across different dates without providing a comprehensive 
view of case volumes or frequencies per account and service.

These examples demonstrate how RAG alone struggles with analytical queries of increasing complexity, while our plugin-enhanced solution delivers precise, structured analysis at every level. When working with scenarios requiring precise numerical analysis, we can enhance RAG’s capabilities through structured approaches:

Aggregation and pattern analysis: When user prompts include aggregation queries (such as counts, totals, or distributions), they require exact numerical computation through structured querying to provide precise results. Vector similarity search alone cannot guarantee accurate numerical aggregations, making structured metadata querying essential for these analytical use cases.
Context and correlation analysis: External unstructured data requires thoughtful data engineering to extract and maintain structured metadata (such as creation dates, categories, severity levels, and service types). While RAG excels at finding semantically similar content, having well-defined metadata enables precise filtering and querying capabilities. For example, when analyzing system performance issues, structured metadata about incident timing, affected services, and their dependencies enables comprehensive impact analysis through exact querying rather than relying solely on semantic matching.

Enhancing Q Support-Insights with agentic AI

Building on the Q Support-Insights (QSI) solution introduced in Derive meaningful and actionable operational insights from AWS Using Amazon Q Business, we’ll demonstrate how to enhance analytical capabilities through agentic AI by creating custom plugins. This enhancement preserves QSI’s base implementation while adding precise analytical processing through structured metadata querying.

QSI overview

The Amazon Q Support Insights (QSI) solution consists of two main components:

Data collection Pipeline
- Support Collector module using AWS Lambda functions
- The Support data consists of AWS Support cases, Health events, and Trusted Advisor checks
- Amazon EventBridge for automated data collection. The data pipeline enables two synchronization mechanisms:
  - Real-time case updates: Processes AWS Support cases through event-based triggers (CreateCase, AddCommunicationToCase, ResolveCase, ReopenCase).
  - Historical and daily sync: Performs initial historical data sync and refreshes AWS Trusted Advisor data daily.
- Stores data in JSON format in centralized Amazon Simple Storage Service (Amazon S3) bucket
- Supports multi-account data aggregation through AWS Organizations
Amazon Q Business application environment
- Amazon Q Business application deployment
- Amazon S3 connector for data source integration
- Web experience configuration for user interaction
- Authentication through AWS IAM Identity Center

Enabling query aggregation with custom plugins for Amazon Q Business

Custom plugins extend Amazon Q Business to combine semantic search with precise analytics capabilities. The following implementation details outline how we’ve augmented the base QSI solution:

Augments QSI’s natural language processing with structured query capabilities
Converts analytical requests into precise Amazon Athena SQL using an Amazon Bedrock large language model (LLM)
Executes queries against structured metadata tables
Provides exact numerical results alongside semantic search responses

Metadata processing

To enable precise querying capabilities, we’ve implemented a robust metadata processing framework that structures and indexes support data:

Using QSI’s existing data collection in Amazon S3
Extracts structured metadata from support cases into two queryable tables
Maintains critical fields like dates, categories, severity levels
Enables precise filtering and aggregation capabilities

Cross-source analysis

The solution enables comprehensive analysis across multiple data sources through intelligent correlation capabilities:

Correlates support cases with health events
Enables comprehensive operational assessment
Supports pattern detection across multiple data sources

Architecture overview

The following illustration shows how the enhanced Amazon Q Business solution integrates custom plugins with the existing QSI architecture. The diagram demonstrates the flow from user query through analytical processing to response generation, highlighting how structured data querying complements the RAG-based semantic search capabilities.

When a user submits an analytical query like Show me OpenSearch case distribution based on severity, the following process occurs:

The custom plugin detects the analytical intent and triggers a Lambda function
The Lambda function uses an Amazon Bedrock LLM to transform the natural language query into precise Athena SQL
The query executes against structured metadata in Athena
Results are synthesized back into natural language and presented alongside traditional RAG responses

This approach enables precise analytics while maintaining the conversational interface users expect.

Implementation example

Here’s a sample interaction demonstrating the enhanced capabilities. Let’s start with a basic query to understand the overall OpenSearch case volume, using structured data querying:

Query:

Give me OpenSearch cases in Jan 2025

Response:

Based on the query results for OpenSearch cases in January 2025, here's a summary:
1. Total cases: 15
2. Severity breakdown:
Critical: 4 cases
High: 3 cases
Urgent: 1 case
Normal: 3 cases
Low: 2 cases
3. Status: All cases are resolved
4. Most common issues:
Cluster health problems (yellow status)
Blue/Green deployment issues
Performance concerns (high CPU utilization)
5. Services involved:
OpenSearch Service Managed Cluster: 13 cases
OpenSearch Service Ingestion: 2 cases

The cases span various topics including cluster health, deployments, performance, 
and configuration issues. Is there a specific aspect of these cases you'd like 
more information on?

Now, using RAG capabilities to analyze unstructured case content for architectural insights:

Query:

Out of these, what are the OpenSearch issues that are related to architecture 
deficiencies?

Response:

Based on the available data, several architectural issues have been identified in OpenSearch deployments:

* The shard count per node exceeding recommended limits is a significant 
architectural concern, with validation errors occurring when scaling up 
node counts. The recommended limit is 1,000 shards per node and exceeding 
this requires careful architectural planning. 

* Master node scaling is another architectural challenge, where insufficient 
master node capacity can limit cluster expansion. This requires proper 
configuration of master node instance types according to the number of nodes 
and shards in the domain. 

* Resource utilization issues, particularly around JVM Memory Pressure and 
CPU utilization, indicate potential architectural shortcomings. 
The system architecture should be designed to maintain JVM Memory Pressure 
below 80% and CPU Utilization below 85% for optimal performance. 

* The system's handling of blue/green deployments shows architectural limitations, 
as these operations cause performance impacts and increased latency during data 
migration to new nodes.

Continuing with RAG-based analysis of case details to evaluate implementation patterns:

Query:

Given the high shard count, JVM, or other data can we make inferences about whether 
they are using this correctly?

Response:

Based on the data discussed, there are strong indicators that the OpenSearch system 
is not being used optimally:

JVM Memory Management Issues:
1. Memory pressure consistently exceeded recommended thresholds (above 78% and 
eventually over 90%)
2. The 64GB initial heap size configuration proved inadequate for the workload
3. Circuit breaking exceptions and cluster blocks occurred due to memory issues

Shard Management Problems:
1. The system appears to have exceeded the recommended limit of 1,000 shards 
per node
2. Too many small shards created unnecessary system strain
Large shards complicated failure recovery processes

Resource Utilization:
1. Failed to maintain recommended thresholds:
2. JVM Memory Pressure should be below 80%
3. CPU Utilization should be below 85%
4. Continuous memory pressure issues

Failed snapshots
These indicators strongly suggest improper sizing and resource allocation for 
the workload, pointing to incorrect implementation or configuration of the 
OpenSearch system. Better capacity planning and adherence to best practices 
would be necessary for optimal performance.

Finally, here’s another example of structured data analysis with a complex query demonstrating multi-dimensional aggregation capabilities:

Query:

For high-severity support cases since January 1, 2024, show the number of 
total cases and after-hours cases for each account and service combination. 
Define after-hours as before 1 PM UTC, after 1 AM UTC, or times falling within 
the weekend. Only include results with more than 3 after-hours cases. Sort by 
the highest number of after-hours cases. Exclude limit increase requests.

Response:

Based on the analysis of high-severity support cases since January 1, 2024, 
here's a summary of the results, focusing on accounts and services with more 
than 3 after-hours cases:

1. The account with the highest number of after-hours cases (48) is 111111111111, 
for the Connect service. They have a total of 66 high-severity cases.
2. The second highest is account 333333333333 with 40 after-hours cases for the 
OpenSearch Service Managed Cluster, out of 49 total cases.
3. Account 222222222222 has 38 after-hours cases for RDS Oracle, out of 64 
total cases.
4. EMR service for account 555555555555 has 19 after-hours cases out of 23 
total cases.
5. Bedrock service for account 666666666666 has 17 after-hours cases out of 
24 total cases.

The results are sorted by the highest number of after-hours cases, and limit 
increase requests have been excluded as requested. After-hours cases are 
defined as those created before 1 PM UTC, after 1 AM UTC, or times falling 
within the weekend.

Would you like more details on a specific account or service, or do you need 
a further analysis of this data?

The response combines precise numerical analysis with semantic understanding, providing actionable insights for operational reviews.

Benefits and impact

This plugin architecture delivers several key improvements:

Precise analytics: Exact counts and distributions replace approximate semantic matching
Contextual analysis: Maintains analytical context across conversation threads
Architectural understanding: Better correlation of related issues through structured analysis

Deploy the Amazon Q Business application

The following is a simplified deployment process. For detailed instructions, see the Amazon Q Business application creation module.

Prerequisites

AWS CloudShell is recommended since, it comes pre-installed with the required libraries and tools. Alternatively, you can use a local machine with the AWS Command Line Interface (AWS CLI) installed and configured with valid credentials.
Two S3 buckets:
- Support data bucket for storing AWS Support case data
- Resource bucket as temporary storage for Lambda resources for deployment (can be deleted after deployment)
IAM Identity Center instance configured
The solution needs AWS Support data collected using the Support Data Pipeline. You can deploy now and add data later, but functionality depends on data availability in your S3 bucket.
Access to Anthropic’s Cloud 3-5 Sonnet through Amazon Bedrock. See Add or remove access to Amazon Bedrock foundation models
The default database should exist in Athena. If not, you can create one using Athena Query Editor to create the database.

Deployment steps

You can us the following script to deploy the Q solution. No manual steps are needed—the script handles stack creation and configuration automatically.

# Clone the repository
git clone https://github.com/aws-samples/support-insights-with-amazon-q.git
cd q_application
chmod +x deploy_q_stacks.sh
./deploy_q_stacks.sh

Clean up

To remove the resources, delete the S3 buckets and CloudFormation stacks. Delete the CloudFormation stacks in the following order:

case-metadata-stack
amazon-q-stack
custom-plugin-stack

Note that this won’t delete the existing S3 buckets, you must manually delete the S3 buckets.

Conclusion

By combining RAG’s semantic understanding with precise analytical capabilities through plugins, we’ve transformed Amazon Q Business into a powerful operational analytics platform. In the examples in this post, you can see how organizations can use this enhancement to derive more accurate and actionable insights from their AWS Support data, supporting better operational decision-making and proactive issue resolution. While demonstrated through support data analytics for operational improvements, these patterns apply across domains that combine structured and unstructured data sources.

Learn more

Explore the Amazon Q documentation to understand more about building custom plugins
Check out these related resources:

For questions and feedback, visit the AWS re:Post or contact AWS Support.

About the authors

Chitresh Saxena is a Sr. AI/ML specialist TAM specializing in generative AI solutions and dedicated to helping customers successfully adopt AI/ML on AWS. He excels at understanding customer needs and provides technical guidance to build, launch, and scale AI solutions that solve complex business problems.

Kevin Morgan is a Sr. Enterprise Support Manager at AWS who helps customers accelerate their cloud adoption journey through hands-on leadership and technical guidance. As a member of the NextGen Developer Experience TFC, he specializes in Builder Experience, CloudOps and DevOps. Outside of work, Kevin enjoys being a Game Master for D&D and is a retro computing enthusiast.

Amazon Strands Agents SDK: A technical deep dive into agent architectures and observability

The Amazon Strands Agents SDK is an open source framework for building AI agents that emphasizes a model-driven approach. Instead of hardcoding complex task flows, Strands uses the reasoning abilities of modern large language models (LLMs) to handle planning and tool usage autonomously. Developers can create an agent with a prompt (defining the agent’s role or behavior) and a list of tools, and the LLM-powered agent will figure out how to chain its reasoning and invoke tools as needed. This dramatically simplifies agent development compared to traditional workflow-based frameworks.

In this post, we first introduce the Strands Agents SDK and its core features. Then we explore how it integrates with AWS environments for secure, scalable deployments, and how it provides rich observability for production use. Finally, we discuss practical use cases, and present a step-by-step example to illustrate Strands in action.

What is the Strands Agents SDK?

The Strands Agents SDK is an open source framework designed to simplify the creation of robust LLM-powered AI agents. Rather than requiring developers to handcraft complex workflows, Strands embraces a model-driven approach centered around three key components: a language model, a system prompt, and a set of tools. This architecture empowers the LLM to perform the crucial reasoning, autonomously deciding the optimal actions and when to use tools based on the current context and task. This model-driven design allows agents to be flexible, intelligent, and autonomous, while minimizing the boilerplate code typically needed to support multi-step or multi-agent interactions. Its effectiveness is already proven – Strands is actively used in production by multiple AWS teams for their AI agents in production, including Kiro, Amazon Q, and AWS Glue.

Key capabilities of the Strands Agents SDK

The Strands Agents SDK offers the following key capabilities:

Lightweight, flexible agent loop – Strands implements a simple yet extensible agent loop that drives the interaction. The LLM behind the agent iteratively reads the conversation (and context), plans an action, possibly calls a tool, and then incorporates the tool’s result before deciding the next step, until it reaches a final answer. This loop is fully customizable when needed, but works out of the box for most use cases.
Tool use and integration – Tools are external functions or APIs the agent can call (for example, calculators, web search, or database queries). Strands makes it straightforward to define tools in Python with a @tool decorator and supply them to agents. During development, the SDK supports hot-reloading, so you can modify or add tools and have them picked up automatically without restarting the agent. This accelerates iteration and testing. The SDK also comes with an optional library of pre-built tools (strands-agents-tools) for common functionalities like arithmetic, web requests, and more. Strands supports both the Model Context Protocol (MCP) and A2A (Agent-to-Agent). MCP is an open standard that gives agents access to thousands of external tools hosted on model servers, greatly expanding their capabilities without custom coding. A2A allows agents to call each other as tools – enabling powerful multi-agent collaboration and specialization with minimal overhead.
Model-agnostic and multi-model support – Strands is not tied to a single LLM provider. It can work with models on Amazon Bedrock (for example, Anthropic’s Claude or other Amazon Bedrock models) by default, but also supports Anthropic’s API; open source models such as LlamaAPI, Ollama, OpenAI; and others through a pluggable provider interface. For example, you can switch the agent’s model from Anthropic’s Claude hosted on Amazon Bedrock to a local Meta Llama 3 or OpenAI GPT-4 by changing the model provider in the code. With this flexibility, you can choose the model that best fits your needs or swap models in different deployments.
Scalability from prototypes to production – The same Strands agent code can run locally for quick testing and then be deployed to AWS for production use. The SDK is already used internally at AWS for agent-based features in services like Amazon Q (developer assistant), AWS Glue, and VPC Reachability Analyzer. It supports running agents in various environments – including Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, AWS Fargate, and Amazon Bedrock AgentCore – and isolating tool execution from the agent for security and reliability. Strands agents can run anywhere and integrate with cloud services, but because it’s open source, you can also run them on premises or in other clouds.
Advanced use cases – Although Strands excels with simple single-agent assistants, it also supports more complex agent systems. You can compose multi-agent applications where multiple agents collaborate or coordinate (for example, an agent that delegates subtasks to other specialist agents). The SDK supports patterns like agent hierarchies, agent networks, and even swarm-style cooperation (discussed more in the next section). It also allows building fully autonomous agents that loop on tasks without human input, enabling multi-step workflows where the agent’s chain-of-thought spans many tool calls or intermediate decisions. Additionally, features like streaming responses (token streaming) are supported for real-time agent interactions.
Open source and community contributions – The Strands Agents SDK is Apache-2.0 licensed and open to contributions. Several companies (Accenture, Anthropic, Meta, PwC, and others) have already contributed to its development. For example, Anthropic contributed integration for their API, and Meta added support for their Llama models. This community-driven approach means the tool and model ecosystem is growing beyond AWS. Developers are encouraged to join the project on GitHub to report issues, add new tools or model providers, and help expand the framework.

Strands provides a concise yet powerful way to build AI agents. With a few lines of Python, you define an agent’s role and its available tools, and the SDK handles the rest – from reasoning through a problem to invoking tools and producing answers. The next sections explore how Strands supports various agent architectures and how it makes these agents observable in production.

Agent-architectural patterns supported by Strands

Strands supports multiple agent architecture patterns, scaling from a single self-contained agent up to complex networks of cooperating agents. In this section, we explain the key patterns and how the Strands SDK enables them.

Single-agent pattern

The simplest scenario is a single AI agent endowed with an LLM and (optionally) some tools, which interacts with a user or performs a job without delegating to other agents. In Strands, a single-agent is represented by the Agent class – you initialize it with a model (or accept the default) and tools it can use. The agent runs an internal event loop to decide how to answer each query: it might directly respond using the model’s knowledge, or choose to invoke one of its tools, incorporate the result, potentially call more tools, and so on until it finishes. This loop continues until the agent produces a final answer.

In code, a single-agent usage is straightforward. In the following example code, we create a basic agent with a calculator tool and ask it a question:

from strands import Agent
from strands_tools import calculator 

# Create an agent that can use a calculator tool
agent = Agent(tools=[calculator])
result = agent("What is the square root of 1764?")
print(result)

In this example, the agent uses its LLM to interpret the question. It recognizes that it might need to calculate a square root, so it calls the provided calculator tool (a simple arithmetic function) to get the result, and then returns the answer. Strands handles the prompt formatting, calling the tool, and inserting the tool’s result back into the model’s context for the final answer. With a single agent pattern, reasoning and tool use happen within one agent process.

A single agent is suitable for many tasks, such as question-answering, data retrieval, simple assistants. It keeps the logic self-contained. However, as tasks grow in complexity, you might reach the limits of what one agent (even with tools) can effectively manage (for example, if multiple different expertise or concurrent actions are needed). That’s where the multi-agent patterns come in.

Multi-agent networks (swarm or peer-to-peer agents)

Strands supports agent networks where multiple agents operate and communicate to solve problems collaboratively. In such a network, there is no single orchestrator; instead, agents interact peer-to-peer or in an open topology. This pattern is sometimes referred to as a swarm of agents working together. Each agent in the network might have a specialized role or perspective, and they share information to converge on a solution.

In a swarm-style network, communication patterns can vary. One common approach is a mesh communication where agents can talk to other agents freely. This is useful for brainstorming agents or consensus-building, where agents exchange ideas. Other communication schemes include using a shared memory or blackboard (a common repository where agents post and read information), or message-passing channels between specific agent pairs.

Strands provides tools to implement these networks. You can create multiple agent instances (each with its own prompt, persona, and tools) and connect them together. For example, you might have a research agent, creative agent, and critic agent connected in a mesh. The research agent provides factual data, the creative agent proposes ideas, and the critic agent spots flaws; together they iterate towards an answer. The Strands SDK also includes a built-in agent_graph tool to help manage such networks programmatically (so you can define agents and connections, then send messages into the network).

Communication and coordination in an agent swarm can be designed with different philosophies:

Collaborative swarms – Agents actively build on each other’s contributions and aim for consensus. This might be ideal for creative problem solving or research, where combining perspectives yields the best result.
Competitive swarms – Agents work in parallel on the task (perhaps with different methods or hypotheses) and might even critique each other’s results. This could be useful in scenarios like multiple agents trying independent strategies to see which is best.
Hybrid approaches – A mix of cooperation on some subtasks and independent exploration on other.

Strands doesn’t force a particular style; you can implement the message exchange logic as needed (the SDK leaves the content of messages and timing up to the developer or even the agents themselves). The new agent_graph utility simplifies setting up networks by specifying a topology (for example, fully connected mesh) and then letting you broadcast or direct messages to agents. Each agent can run on a separate thread or process, providing scalability. Multi-agent networks excel in complex problem domains where different skills or viewpoints are needed concurrently.

Supervisor-agent model (orchestrator with tool agents)

Another pattern Strands supports is the supervisor-agent model, also known as the orchestrator and specialists architecture or agents as tools pattern. In this design, one agent acts as a primary orchestrator (supervisor) that interfaces with the user or high-level task, and it delegates subtasks to one or more specialist agents. Each specialist is effectively an agent wrapped as a callable tool that the orchestrator can invoke for specific needs.

The manager agent decides which specialist agent is required for a given query and forwards the request, then integrates the results back into a final answer. For example, you might design an orchestrator agent that, when asked a complex question, can call a Research Assistant agent for factual lookup, or a Math Assistant agent for calculations, or a Travel Planner agent for itinerary tasks, depending on the query. Each assistant is an LLM-powered agent with its own system prompt and tools specialized to its domain.

Strands makes it straightforward to implement this. You can create specialized agents and expose them as Python tools using the @tool decorator, as illustrated in the following code:

from strands import Agent, tool
from strands_tools import retrieve, http_request

# System prompt for a specialized research agent
RESEARCH_ASSISTANT_PROMPT = """
You are a specialized research assistant. Focus on providing factual, well-sourced information for research questions.
Always cite sources in your answers.
"""

@tool
def research_assistant(query: str) -> str:
    """Tool that uses a specialized agent to answer research queries."""
    # Create a specialized agent for research tasks
    research_agent = Agent(
        system_prompt=RESEARCH_ASSISTANT_PROMPT,
        tools=[retrieve, http_request] # this agent can use web retrieval tools
    )
    return research_agent(query) # delegate the query to the research agent

In this example, we defined research_assistant as a tool. The solution spins up an agent with a special prompt and a couple of tools for web research (like retrieve to fetch documents and http_request to call web APIs). It then queries that agent and returns the result. We could similarly define other specialist agent tools, such as math_assistant or trip_planner_assistant, each with their own prompt and domain-specific toolset.

Now we can create the orchestrator agent that uses these specialist agents as its tools:

# Orchestrator agent that can delegate to specialized assistants
orchestrator_agent = Agent(
    tools=[research_assistant, math_assistant, trip_planner_assistant]
)
# When a user question comes in, the orchestrator can decide which agent to invoke
response = orchestrator_agent(
    "What are the latest NASA findings on Mars, and can you calculate the travel time to Mars at 20km/s?"
)
print (response)

When the orchestrator_agent receives the complex question, it uses the LLM’s reasoning (guided by its prompt, which we could customize as a general coordinator) to determine how to answer. It might decide this question has two parts – a research part (latest NASA findings) and a calculation part (travel time) – so it might call the research_assistant tool for the first part and the math_assistant tool for the second part. Each tool a full agent that carries out its subtask (for example, the research assistant might use http_request to fetch data from a NASA API or knowledge base). The orchestrator then assembles the final answer. This hierarchy creates a clear delegation chain: the top-level agent offloads work to experts and then merges their outputs.

The benefits of the supervisor agent architecture include separation of concerns (each agent specializes, making the system straightforward to maintain) and modularity (you can add or remove specialist agents without rewriting the whole agent). It also mirrors human organizational structures – a manager coordinating specialists – which can be an intuitive way to scale up agent complexity. In Strands, this pattern is fully supported by treating agents as just another kind of tool. The orchestrator agent’s built-in logic (through the LLM) reads tool docstrings and decides when to use which specialist tool, especially if you provide it guidance like “Use the Research Assistant for any questions about scientific facts or current data” in its system prompt.

Hierarchical agent architectures

The hierarchical pattern is an extension of the supervisor-agent idea to multiple levels of delegation. Instead of a single orchestrator and a list of specialists, you can have layers of agents forming a hierarchy or tree structure. For example, at the top might be an executive agent handling the broad objective; it delegates high-level tasks to a few manager agents, each of whom further breaks down tasks among their worker agents. This is useful when problems are very complex or naturally tree-structured (project management, multi-stage workflows).

Strands’s agent graph concept generalizes this. In an agent graph, nodes are agents and edges define communication or supervisory links. A hierarchical topology is one of the supported graph patterns: a tree where each parent node directs its children. Information flows down the tree as tasks and up the tree as results or reports. The star topology (one central coordinator with many direct specialists) is actually a one-level hierarchy (the supervisor-agent model), whereas a true hierarchy might have multi-level oversight.

Using the Strands SDK, you can manually create such structures by assembling agents and coordinating their calls, but the simpler route is to use the graph tool or similar utilities. As shown earlier, you can programmatically define nodes and edges of a graph (specifying each agent’s role and prompt) and the tool will instantiate that network. Then you can send messages into the graph (for instance, give a top-level instruction to the executive agent) and the system will propagate tasks downward and solutions upward according to the defined edges. Each agent in the hierarchy can maintain its own state and focus – for example, a mid-level manager agent might keep track of progress of its sub-tasks independently.

Hierarchical agent architectures are ideal when you need layered processing or decision-making. For instance, consider an autonomous software engineering agent: an executive agent decides overall what feature to implement and delegates to a design agent and a coding agent, and the coding agent might further delegate tasks to a code generation agent and a testing agent. Each level adds oversight and can catch errors or refine requirements from the level below. This reduces the cognitive load on each individual agent. Strands facilitates this by providing the mechanisms to link agents in tree or graph structures and handle message passing along those links. The built-in support for maintaining sessions and state for each agent and controlling information flow means you can fine-tune what each agent knows, providing clarity and preventing agents from interfering with each other’s context.

Finally, Strands doesn’t lock you into a single pattern – you can combine them. For example, you could have a swarm of peer agents at one level of the hierarchy and a supervisor above them, or an orchestrator that consults a small swarm of brainstorming agents as one of its tools. The SDK’s flexible design (with agents as first-class objects and tools) lets you mix patterns to suit the problem.

Observability in Strands agents

In production, observability is crucial to understand and trust what your agents are doing. The Strands SDK was built with observability in mind, providing built-in instrumentation hooks, telemetry collection, and support for logging and metrics out of the box.

Instrumentation and traces

Strands can record agent trajectories – the sequence of steps (for example, model calls, tool calls) an agent takes for each request. It uses OpenTelemetry (OTEL) standards to emit this data, meaning you can plug it into other OTEL-compatible monitoring backends (such as AWS X-Ray, Amazon CloudWatch, and Jaeger) to visualize and analyze agent behavior.

Each run of an agent can produce a trace, which consists of spans for each significant action. For example, when the agent calls the LLM model, that’s a span, which can include metadata like the prompt, model parameters (for example, temperature or max tokens), and token usage counts. When the agent invokes a tool, that’s another span, recording which tool was called and the input and output. By stitching these spans together, you get an end-to-end timeline of how the agent arrived at its answer.

This distributed tracing extends across components – for instance, if your agent is calling a remote microservice as a tool, the trace can propagate into that service (provided it also uses OTEL), giving you a cross-system view of a request. Such insight is invaluable for debugging agent reasoning, optimizing performance, and identifying failure points in complex multi-agent workflows.

Metrics tracking

Strands also tracks key metrics about agent operations. Metrics are aggregate measurements that help quantify performance and usage. You can capture metrics such as the number of times each tool was invoked (and perhaps success and failure rates of those calls), runtime of tool calls, how many turns or agent loops run per interaction, latency of model responses (time to first byte and time to complete), and token consumption (prompt tokens vs. completion tokens) per request. Strands can also surface system metrics (CPU, memory usage if relevant) and custom business metrics like how often users are happy with the agent’s answer (if feedback is provided).

By monitoring these metrics, developers and operations teams can make sure the agent is behaving efficiently and reliably. For example, a sudden spike in tool error rates or a jump in token usage per query might signal a regression that needs attention. Metrics can feed into dashboards or alerting systems to maintain the operational health of your AI agent service.

Logging

The SDK emits logs for important events – for example, the full prompt being sent to the model, the model’s raw response, decisions it made about which tool to use, and errors encountered. These logs are timestamped and can be configured at various verbosity levels (debug, info, error) similar to other applications. Logs are useful for deep debugging or audit trails, and because they might include sensitive or verbose information, Strands allows structuring or redacting logs as needed. In production, you might integrate Strands logs with standard logging infrastructure (CloudWatch Logs, ELK stack) for centralized analysis.

Observability

Strands encourages end-to-end observability. The SDK documentation outlines a framework where agent developers, data engineers, and product owners all consume telemetry. For example, developers use traces to diagnose why an agent made a certain decision (such as visualizing a trace to see the chain of tool calls leading to a wrong answer). Data engineering teams might aggregate telemetry in a data warehouse to analyze usage patterns or costs over time. AI researchers could use logs and traces to identify failure modes and fine-tune prompts or models (treating trace data as feedback to improve the agent).

Strands provides the raw capabilities (instrumentation points and OTEL integration) to enable this, but it’s up to the implementing team to set up collectors and dashboards. Best practices include standardizing on open formats (like OTEL) for interoperability, using collectors to route telemetry to multiple sinks (for operations or business intelligence), and filtering or sampling data to manage volume and privacy.

Observability is not an afterthought in Strands – it’s built into the agent loop so that when your agent is running in production, you can monitor its reasoning and actions. This focus on instrumentation sets Strands apart, especially compared to earlier agent frameworks where developers often had to add their own logging or tracing.

Enterprise readiness and deployment best practices

The Strands Agents SDK was designed with enterprise production use in mind, providing features and guidance to help agents run reliably, at scale, and securely in business-critical environments. This section discusses how Strands addresses key enterprise requirements that are essential for adopting agentic workflows in large organizations.

Scalability and performance

Strands can scale from quick prototypes to large-scale production deployments seamlessly. The same agent code running on a developer’s laptop can be deployed to the cloud without changes. Strands is built in a lightweight manner (a Python framework orchestrating API calls to LLMs and tools), so it introduces minimal overhead. Agents can run concurrently – each agent in a multi-agent configuration can execute on its own thread or process to utilize multiple cores or machines. This means an orchestrator agent can dispatch work to specialist agents running in parallel processes, achieving concurrency and faster throughput on multi-CPU systems. When integrated into distributed environments (like microservices or serverless architectures), you can run multiple agent instances behind a load balancer to handle high request volumes. The framework’s model-agnostic nature also means you can choose more powerful model endpoints for heavier workloads or distribute calls across different model providers to avoid bottlenecks.

In practice, performance is usually dominated by the LLM’s response time and external API calls the agent makes; Strands makes sure it pipelines these operations efficiently (for example, by streaming responses when available, to start delivering output as soon as possible). There is no strict upper limit in Strands on the number of tools or steps an agent can handle, aside from compute resources and model limitations – making it suitable for complex, long-running tasks if needed.

Security and data protection

Enterprise applications demand strong security measures, and Strands provides mechanisms and recommendations to build secure agents. Because agents can execute code or call external APIs through tools, it’s critical to manage what tools are available to an agent. Strands supports fine-grained control over tool access – you decide which tools to include for each agent.

Sensitive data handling is also emphasized: AWS recommends implementing end-to-end encryption for sensitive data that agents handle, both at rest and in transit. For example, if an agent stores conversation history or intermediate results, those could be encrypted or kept in memory only. It’s also important to sanitize inputs and outputs; using Amazon Bedrock Guardrails or custom validation to filter the agent’s responses can prevent the agent from returning confidential information or disallowed content. Strands’s logging can be configured to omit or redact sensitive details, so audit logs don’t become a source of leakage.

Authentication and authorization should be layered in front of agent endpoints: when deploying using Amazon API Gateway or Lambda, you can use AWS Identity and Access Management (IAM) roles, Amazon Cognito, or OAuth tokens to make sure only authorized systems or users can invoke the agent. Within the agent, you might also enforce role-based logic – for example, certain tools only activate if the requesting user has appropriate permissions. For multi-agent systems, isolating each agent’s context (which Strands supports using sessions and separate prompts) can enforce the principle of least privilege (each sub-agent only knows what it needs to).

Threat modeling for AI agents is a new but important practice; AWS has published guidance like the MAESTRO framework for agentic AI threat modeling. This encourages teams to anticipate how an agent might be misused or attacked (for instance, prompt injection by malicious inputs, or attempts to make an agent reveal secure data) and to implement mitigations such as input validation, output filtering, and robust exception handling.

Strands gives you the building blocks for powerful autonomous agents, but enterprises should wrap those agents with the same rigor applied to any application – encryption, monitoring, access control, and testing against adversarial inputs – to run them safely and responsibly.

AWS service integration

As an AWS originated project, the Strands Agents SDK integrates naturally with the AWS ecosystem. It can work with Amazon Bedrock out of the box (for access to foundation models), which simplifies using high-quality, scalable models with enterprise security (data is not left unencrypted or sent to external third-parties when using Amazon Bedrock).

Beyond model hosting, Strands’s use of OTEL means you can pipe trace data into AWS X-Ray for distributed tracing visuals and send metrics to CloudWatch for real-time monitoring. For example, you could set up CloudWatch alarms on metrics like tool error rate or latency per agent call to alert operations teams of anomalies.

Strands can also call AWS services as tools – either using APIs or specialized tools. Community-contributed tools for AWS are already available (for instance, tools that can run AWS SDK commands or query AWS resources). In one scenario, an agent could use an AWS SDK tool to automatically remediate an AWS infrastructure issue (making it a DevOps assistant), or use an Amazon DynamoDB tool to store and retrieve information as part of its workflow. In fact, one of the Strands example use cases demonstrates an agent storing weather data into DynamoDB, showing how seamlessly an agent can incorporate AWS data services. Because Strands is Python, it also works smoothly with Lambda for serverless deployments (more on that later in this post) and with AWS Step Functions if you need to embed an agent call as one step in a larger orchestrated business workflow. Enterprises can use existing AWS identity and networking features like virtual private cloud (VPC), IAM, or AWS Key Management Service (AWS KMS) encryption to further secure and isolate their agents when running in the cloud.

Conversely, Strands remains cloud-agnostic enough that if needed, you can run it on premises or in other cloud environments – for example, using local models through Ollama or connecting to third-party APIs – giving flexibility for hybrid deployments.

Deployment best practices

There are several proven patterns for deploying Strands agents in production, and the SDK provides a deployment toolkit with reference implementations. Depending on use case, you might choose one of the following deployment methods:

Serverless (Lambda) – This is ideal for short-lived agent tasks or event-driven invocations. You can deploy an agent as a Lambda function, possibly using the Lambda function URL feature or API Gateway to trigger it using HTTPS. This offers scalability (Lambda will spawn concurrent executions as needed) and minimal operations overhead. It’s best for agents that complete within the Lambda runtime limit and don’t require long-lived state (though you can use external storage for state if needed). AWS provides examples for deploying Strands this way, which also include using streaming using Lambda for real-time responses. For interactive or streaming agents, you might prefer using WebSockets or an asynchronous pattern, because Lambda invocations are stateless.
Containers (AWS Fargate and Amazon ECS or Amazon EKS) – For long-running or stateful agent services, containerizing the agent logic is a common approach. You can host the agent loop in a container (for example, a microservice that listens for requests, invokes the agent, and returns results). Fargate (serverless containers) and Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS) are both supported in reference architectures. This approach is well-suited to streaming interactions (where an agent might keep a connection open to stream tokens) and high-concurrency scenarios. With containers, you can allocate more memory/CPU for larger models or use GPU-backed instances if running heavy local models. You can also horizontally scale the number of agent service instances and integrate with service meshes or mesh-like patterns if multiple agent services need to communicate.
Hybrid return-of-control pattern – In some enterprise scenarios, part of the tool execution is done on the client side or in a separate environment for security. Strands supports an architecture where the agent is hosted in one environment (for example, in AWS) but some tools are executed in a different environment (like on a user’s device or in a secured on-premises service). The return-of-control pattern lets the agent delegate certain tool calls back to the client application. For instance, a client application might register a local tool (for example, for accessing a local database or hardware device) with the agent. When the agent decides to use that tool, it returns a signal for the client to execute it and await the result. Meanwhile, other tools can be hosted in the cloud. This pattern gives maximum flexibility and can address data governance concerns (keeping some data processing local) while still benefiting from the agent’s reasoning capabilities in the cloud.
Monolithic vs. microservices – Strands agents can be deployed as a monolith (the agent loop and all tools in one process) or split into microservices (each tool as its own service that the agent calls through an API). Monolithic deployments are simpler and have less latency (function calls in memory), but splitting tools into separate services can improve fault isolation, allow independent scaling of expensive tools, and enable polyglot implementations (tools in other languages). A best practice is to start monolithic for simplicity, then refactor out critical tools as needed (for example, a data-intensive tool might be better as a separate service that can be scaled on its own). Strands’s observability will still capture cross-service calls if OTEL tracing is propagated, giving you a full picture of the distributed workflow.
Amazon Bedrock AgentCore – For production-grade deployment of Strands agents with built-in support for identity, memory, observability, and tool integration, AWS offers Amazon Bedrock AgentCore. This is a secure, serverless runtime designed specifically for running AI agents in real-world applications. You can wrap a Strands agent using the BedrockAgentCoreApp wrapper and deploy it through the AWS Command Line Interface (AWS CLI) or container workflows. Amazon Bedrock AgentCore supports long-running tasks (up to 8 hours), asynchronous tool execution, and tool interoperability using MCP, A2A, or API Gateway based services. It also includes secure identity features such as OAuth, Amazon Cognito, and IAM, as well as native observability with CloudWatch and OTEL. This approach is ideal for teams looking for a scalable, secure, and fully managed agent infrastructure that integrates seamlessly with existing AWS services. Amazon Bedrock AgentCore is currently available in public preview as of July 2025.

When operating agents in production, it’s also recommended to implement robust error handling and monitoring. For example, you might wrap the agent invocation in a retry loop or fallback logic – if the agent fails or returns an incomplete result, it will handle the exception and respond gracefully (perhaps return a default answer or a message that it will get back later). Define timeouts for tool calls and possibly limit the number of reasoning loops to avoid “runaway” agents. Use CloudWatch or a similar monitoring system to collect metrics like latency, error counts, token usage, and cost per request, and set up alerts for anomalies. In an enterprise, operational excellence is as important as the agent’s accuracy – Strands gives you the hooks (telemetry, logs, config options) to achieve this, but it’s up to your DevOps team to wire them into your existing operations toolkit.

By following these enterprise deployment best practices – choosing the right architecture, enforcing security at multiple layers, and using the infrastructure of AWS – companies can confidently deploy Strands agents that meet their scalability, security, and compliance requirements while delivering advanced AI capabilities to end-users.

Practical usage example: Autonomous workflow in action

To illustrate how you might use the Strands SDK for an autonomous workflow, let’s walk through a scenario. Suppose we want to build an autonomous research assistant that can perform a multi-step task: the user asks a broad question, and the agent needs to gather information from the web, perform some analysis, and provide a well-formulated answer with citations. This involves decision-making (how to break down the task), tool usage (web search, reading documents, performing calculations), and synthesis of results.

Define tools

We need some tools for web research. Strands includes a retrieve tool (for searching and retrieving documents) and an http_request tool (for calling APIs or fetching URLs). If additional processing is needed, we could also include a Python execution tool or others.

Create the agent with a suitable prompt

We give our agent a system prompt that instructs it to be a diligent research assistant. For example: “You have access to web search and browsing. Always find factual information and cite sources in your answer. If calculations are needed, do them step by step.” This prompt sets the context so the LLM knows how to behave (defining the role and guidelines for the agent).

from strands_tools import calculator

# Orchestrator agent that can both research and calculate
smart_agent = Agent(
    system_prompt="You are an AI research assistant. You answer questions with facts and citations. You have tools for web research and math.",
    tools=[research_assistant, calculator]
)
query = "What are the latest NASA findings on Mars, and how long would it take a spacecraft to travel from Earth to Mars at 20 km/s?"
answer = smart_agent(query)
print(answer)

Agent reasoning and tool use

When a query comes in, the agent’s LLM will parse the question and decide on a plan. Thanks to Strands’s model-driven loop, the agent can internally reason something like: “The question asks for the latest NASA findings on Mars and a travel time calculation. Plan: (1) use web search tool to find recent NASA announcements about Mars, (2) summarize findings, (3) use calculator (or internal math) to compute travel time, (4) compose final answer.”

The agent will execute this plan step by step: call retrieve with a query like “latest NASA Mars findings,” get results (perhaps through http_request to read an article), extract key information, do the calculation for travel time (maybe using the calculator tool if available, or basic arithmetic), and finally synthesize everything into a coherent answer, citing the article source for the Mars information.

This happens in one agent() call, transparently. The user sees the final answer, but Strands orchestrated multiple tool calls and reasoning iterations to get there. The observability features described earlier make it possible to inspect this process after the fact (for example, we can see in the trace that the agent made two tool calls: one to retrieve for NASA info and one to calculator for computing travel time).

Compose the final answer

The agent combines the results of its tools and formulates the answer to return to the user. For instance:

## Latest NASA Findings on Mars (2023-2025)

### Atmospheric Discoveries
1. **First Visible-Light Auroras** (March 2024)
- NASA observed visible-light auroras on Mars after a solar flare and coronal mass ejection
- This event occurred near the peak of the current solar cycle

2. **MAVEN's "Sputtering" Discovery** (2024)
- After a decade of searching, NASA's MAVEN mission made the first direct observation of atmospheric "sputtering"
- This process helps explain how Mars lost its atmosphere over time

### Geological Discoveries
1. **Arsia Mons Volcano Observation** (2025)
- NASA's 2001 Mars Odyssey orbiter captured new images of Arsia Mons, one of Mars' largest volcanoes
- The images show the massive volcano peeking through morning cloud cover

2. **Perseverance Rover's New Explorations** (2025)
- Perseverance began exploring "Krokodillen," a region potentially containing some of Mars' oldest rocks
- The rover continues collecting samples for the future Mars Sample Return mission

3. **Dust Devil Captured in Perseverance Selfie** (May 2024)
- A Martian dust devil was captured in Perseverance's fifth selfie since landing

### Other Notable Findings
- Curiosity rover was photographed in motion by the Mars Reconnaissance Orbiter
- NASA's Europa Clipper spacecraft captured infrared images of Mars during a flyby
- Ongoing updates to the Mars Relay Network for communication between rovers and orbiters

## Earth to Mars Travel Time at 20 km/s
The distance between Earth and Mars varies significantly due to their elliptical orbits:
- **Minimum distance**: 54.6 million km (when planets are at their closest)
- **Maximum distance**: 401 million km (when planets are at their furthest)

At a constant speed of 20 km/s:
- **Minimum travel time**:
- 2,730,000 seconds
- 758.3 hours
- **~31.6 days**

- **Maximum travel time**:
- 20,050,000 seconds
- 5,569.4 hours
- **~232.1 days** (about 7.7 months)

This significant variation in travel time (from approximately 1 month to 7.7 months) demonstrates why launch windows to Mars are carefully planned to coincide with favorable orbital alignments that minimize travel time, fuel consumption, and mission duration.

This answer illustrates how the agent retrieved factual data, specifically the minimum and maximum distances between Earth and Mars, and used that information to calculate travel time at a constant speed of 20 km/s. It expressed the results in multiple formats – seconds, hours, and days – offering a realistic travel range from about 1 month to 7.7 months depending on planetary positions – all without the user needing to prompt those steps separately.

By using Strands for this workflow, we didn’t not have to script the sequence “first do search, then do calculation” ourselves; the agent decided it intelligently. If the question had been different, the agent might have taken a different approach or called different tools, all based on its autonomous reasoning. By adjusting the provided tools and the system prompt, we can guide the agent’s behavior for various autonomous workflows (from writing code with a code-execution tool to analyzing datasets with a data analysis tool).

Error handling and reflection

In practice, autonomous agents might make mistakes or need to recover (perhaps the first web search wasn’t useful and it needs to try a different query). Strands agents have the ability to reflect and retry within the loop – for instance, if a tool’s result is unexpected or irrelevant, the agent can reformulate its approach. We can also program explicit guardrails or use the observability signals to intervene if something goes wrong (for example, if no useful answer after X loops, stop and respond with an apology or ask for clarification). The SDK provides callback hooks and the ability to integrate such logic, but by default, the LLM’s own iterative reasoning often suffices for moderate tasks.

Through this example, we see how Strands enables multi-step decision making in a single unified agent. It empowers the agent to act autonomously – deciding which steps to take and in what order – using the powerful reasoning of the LLM, all giving developers simple control points (which tools to allow, what the agent’s role and prompt is, and the ability to analyze logs and traces afterward).

Strands vs. other agent frameworks

With the rapid rise of AI agents, a number of frameworks have emerged. LangChain is one of the most popular libraries for building LLM-driven applications and agents. In this section, we provide an overview of similarities and differences, and when to consider each.

Core concept and philosophy

Both Strands and LangChain enable the pattern of using LLMs to drive actions (the ReAct paradigm of reasoning and tool use). They let you define tools and functions and have the model call them to solve tasks. The big difference is in developer experience and emphasis. Strands preaches minimal orchestration: you supply the prompt and tools and trust the model to figure out the sequence. It consciously avoids requiring developers to write complex workflow code or state machines around the agent.

LangChain started with a focus on giving developers building blocks to construct chains or sequences of LLM calls and tools. In early LangChain, you might manually stitch together a series of steps or use a predefined agent template. Over time, LangChain added agentic behavior (like its AgentExecutor with an LLM planning actions), but it still offers more low-level control if you want it. LangChain’s latest versions introduced explicit workflow orchestration tools like LangGraph for DAG-based flows and a MultiAgent orchestration engine. This means LangChain can support very fine-grained, developer-defined workflows when needed (you can script each sub-agent and how they connect). Strands chooses to abstract that away by default – the workflow emerges from the model’s decisions, not the developer’s hardcoded logic.

In short: Strands is “LLM-first” (model as planner) whereas LangChain is “developer-first” (assemble your desired chain, though it can also delegate planning to the model if you use its agent abstractions).

Tool ecosystem and integration

Both frameworks recognize the importance of tools (also called skills or functions). LangChain provides a wide array of integrations – not only generic tools like math or web search, but connectors to databases, vector stores, and APIs, often through community-contributed wrappers.

Strands’s approach to tools embraces standards like MCP (model-provided tools). By supporting MCP, Strands can access a large library of tools (for example, QA over documents, coding helpers, and cloud resource manipulators) without each tool needing a custom integration script in your code. LangChain doesn’t natively support MCP, but it has its own catalog of tools, and often you integrate by writing a Python function and adding it as a tool (similar to Strands’s @tool, which was likely inspired by patterns popularized by LangChain). LangChain has a wider array of built-in connectors (especially for things like data stores or model providers) given its maturity, whereas Strands is catching up quickly by using open protocols and community contributions (Anthropic and Meta have already added compatibility in Strands for their systems).

Strands can naturally integrate well with AWS services (for example, Amazon Bedrock models), whereas LangChain has historically been used a lot with OpenAI’s API, custom local models, and various third-party services. Both can be adapted to either use case, but the simplest mappings tend to be Amazon Bedrock with Strands and OpenAI with LangChain, even though Strands also works fine with OpenAI.

Multi-agent and structuring of agents

Strands comes with first-class support for multi-agent systems (you can make swarms, graphs, and hierarchical setups with built-ins like the agent_graph tool). LangChain has also moved into this space, providing things like MultiAgentManager/Executor and even experimental ecosystems like Microsoft’s AutoGen integration, where multiple agents (like a user agent and a system agent) converse. However, LangChain’s multi-agent patterns are not as unified under one concept; you often find examples or custom implementations (for instance, agents that critique each other or work in a loop).

In 2025, LangChain introduced LangGraph and other features to let developers explicitly design multi-agent workflows with flows, shared memory, and parallelism. This gives a lot of power to the developer to orchestrate agents in LangChain, but requires you to plan those interactions. Strands provides more out-of-the-box constructs (like a ready-made way to turn agents into tools or create a mesh network) for common patterns, leaning on sensible defaults.

AWS also has a separate library called Agent Squad, which focuses on multi-agent orchestration (routing to specialist agents). AWS suggests using Agent Squad when you need many specialized sub-agents with strict context isolation, and using Strands when you want a single-agent-with-tools approach that leans on the LLM’s reasoning. That indicates Strands often works best with a single agent (or a contained hierarchy of agents) using tools, whereas if you wanted a large ensemble of disparate agents with their own models or knowledge, you might use a different orchestrator and possibly incorporate Strands agents within it. For most cases though, Strands is fully capable of multi-agent coordination.

Memory and context handling

Both frameworks allow for maintaining state between agent turns (conversation history, or more generally an agent’s memory). LangChain has numerous utilities for short-term memory (chat message history) and long-term memory (vector store backed memory). Strands provides sessions and state management too – you can choose where to store conversation history or other context (in-memory, persistent store) and how to truncate or retrieve it. The principles are similar, but LangChain has more pre-built variants (like summarizing memory or knowledge graphs as memory), reflecting its broader usage in conversational AI.

For a solution architect, if memory beyond a simple history is needed (like integrating a vector database for knowledge), LangChain offers pluggable classes for that; with Strands, you might implement a custom tool to query a vector database (or use MCP if a tool exists for it). Both can achieve the goal, but the approach differs (LangChain would treat it as a retriever in a chain, Strands would treat it as a tool the agent can call).

Observability and evaluation

Strands focuses on production observability with its OTEL integration and guidance on metrics. It treats telemetry as a first-class concern for live agents. LangChain, although it provides some logging and callbacks (and there are third-party solutions like Langfuse to instrument LangChain agents), doesn’t have a built-in distributed tracing or metrics system at the level Strands does. In practice, teams using LangChain have often needed to add their own logging or use wrappers to record agent steps. With Strands, much of that comes included or with a simple configuration. This makes Strands appealing for enterprise and production scenarios where monitoring is a must.

On the evaluation side, both Strands and LangChain acknowledge the need to evaluate agent outputs. LangChain has an evaluation module (for grading responses), and Strands offers evaluation techniques for agents (such as collecting traces for analysis or using feedback loops). The telemetry advantage leans in favor of Strands due to its built-in OTEL support.

Use case suitability

Choosing between Strands and LangChain will depend on the project’s priorities. If you want a quick, boilerplate solution to get an agent working – especially if you plan to use AWS infrastructure or Amazon Bedrock models – Strands is a fantastic option. It offers a quick path from “prompt plus tools” to a working agent, and it excels when you are comfortable letting a powerful model orchestrate the process (which, given today’s LLMs like GPT-4 or Anthropic’s Claude 2, is often a reasonable choice). Strands is also a strong choice if observability and reliability in production are top concerns; it was designed by AWS with production in mind (the fact that internal AWS teams replaced months of custom agent tinkering with Strands speaks to that focus).

On the other hand, if your use case requires extensive custom control over each step, or involves a lot of integrations with external systems that LangChain already has modules for (for example, you need to interface with 10 different databases and proprietary APIs, and you want a library that has samples for all), LangChain’s ecosystem might get you there faster. LangChain also has a larger community at the moment, so finding examples and support for niche scenarios could be faster. Additionally, for academic or very experimental multi-agent research, LangChain’s flexibility to craft novel agent loops or logic might be useful. However, Strands is quickly evolving, and being open source, it might soon incorporate many of the needed integrations (and you can always use Strands in conjunction with other libraries – for example, call a LangChain chain as a Strands tool).

Performance

There isn’t a clear publicly documented performance benchmark between Strands and LangChain. Both primarily spend time waiting on LLM API calls or running the model, so raw performance differences likely come down to how efficiently they manage prompts and tools. Strands being lightweight might have less overhead in the agent loop, whereas LangChain’s flexibility can sometimes introduce extra layers (for instance, parsing outputs with regex or validators). In practice, both are fast enough and any latency is dominated by the model inference and any external API calls.

Conclusion

In summary, Strands and LangChain have overlapping goals but different design centers. Strands, coming from AWS, emphasizes using the power of state-of-the-art models with minimal fuss and making sure you can deploy and monitor these agents reliably. LangChain offers a broader toolkit for constructing AI reasoning pipelines, which can be molded into agent behaviors but also requires more decision-making from the developer on how the workflow should look. Many solution architects might even use them together or for different layers of a system (you don’t have to use one exclusively). Frameworks like Strands are simplifying agent development – as LLMs improve, developers can focus more on what they want the agent to achieve and less on how to micromanage the agent’s steps. As the ecosystem evolves, we can expect both Strands and LangChain (and others) to learn from each other and contribute to making AI agents more capable, robust, and simpler to build than ever before.For more information about the Strands Agents SDK, refer to the following:

About the authors

Jin Tan Ruan is a Senior Generative AI Developer on the AWS Industries Prototyping and Customer Engineering (PACE) team. He specializes in building multi-agent and multimodal applications using foundation models, with a focus on real-world implementations of AGI-like capabilities. Jin brings a strong background in software development and holds nine AWS certifications, including the Machine Learning Specialty. At AWS, Jin designs and builds advanced agentic systems that combine reasoning, planning, memory, and tool use – using LLMs for autonomous decision-making and task orchestration. He is also deeply involved in fine-tuning and prompt engineering to optimize foundation models for enterprise-scale, domain-specific performance. Jin holds a Master’s in Computer Science and Software Engineering from Syracuse University. Outside of work, he enjoys exploring new places and diving into the world of horror films. Connect with Jin on LinkedIn.

Nominations Open for the 2025 PyTorch Contributor Awards

PyTorch Contributor Awards Nominate Now

Nominations are now open for the 2025 PyTorch Contributor Awards! These awards shine a spotlight on the incredible individuals whose work and dedication are driving innovation, collaboration, and community-building within the PyTorch ecosystem.

Whether through code, documentation, mentoring, community leadership, or new ideas that push boundaries, contributors are at the heart of PyTorch’s success. Now is your chance to help us celebrate them.

Submit your nomination today.

Awards Ceremony

Winners will be honored at the PyTorch Conference in San Francisco, October 22–23, 2025. Each winner will receive a complimentary ticket to attend the conference.

Who Should You Nominate?

Anyone making a meaningful impact in the PyTorch ecosystem! We welcome and encourage self-nominations, and nominations for contributors across all backgrounds, geographies, and roles including:

Open source developers
Documentation writers
Educators and content creators
Community advocates
Ecosystem builders
Bug reporters and fixers
Longtime contributors and rising newcomers

Award Categories

You’ll be asked to nominate someone for one of the following categories:

PyTorch Superhero – Excellence in all aspects of community contributions
PyTorchbearer – Excellence in long-term contributions across all modalities
PyTorch Pace-Setter – Excellence in high-level activity and contributions
PyTorch Newcomer – Excellence in new contributions
PyTorch Ambassador – Excellence in bringing new users to the community
(Only approved PyTorch Ambassadors are eligible)
PyTorch Problem-Solver – Excellence in uncovering or resolving bugs
PyTorch Innovator – Excellence in innovative new features or approaches
PyTorch Trail Blazer – Excellence in documentation and knowledge sharing
PyTorch Rock-Turner – Excellence in submitting interesting issues or bugs
PyTorch Ecosystem Champion – Excellence in strengthening the broader ecosystem

How to Submit a Strong Nomination

Want your nominee to shine? Here’s how:

Be Specific

Describe what they did—not just that they were “great.” Examples matter.

Highlight the Impact

Did their work:

Improve performance or usability?
Reach new users or communities?
Help others adopt or learn PyTorch?

Provide Supporting Evidence

Include links to:

GitHub issues, PRs, or repos
Blog posts, talks, or tutorials
Event listings or documentation sprints

Sample Strong Nomination Statements

“Led a PyTorch documentation sprint, improving over 200 tutorials to support new users.”
“Resolved critical bugs impacting model stability in production deployments.”
“Ran workshops in underserved regions, expanding PyTorch’s reach to new users.”
“Mentored dozens of first-time contributors through successful PRs and onboarding.”

Celebrating All Forms of Contribution

We welcome nominations from all parts of the community—across genders, geographies, institutions, and contribution types. Contributions may include advocacy, education, bug hunting, outreach, translation, and more.

Questions? Reach out to us at: contributor-award@pytorch.org

Nominate now by visiting the PyTorch Contributor Awards page.

Let’s recognize the people making PyTorch better for everyone.

Build dynamic web research agents with the Strands Agents SDK and Tavily

“Tavily is now available on AWS Marketplace and integrates natively with Amazon Bedrock AgentCore Gateway. This makes it even faster for developers and enterprises to embed real-time web intelligence into secure, AWS-powered agents.”

As enterprises accelerate their AI adoption, the demand for agent frameworks that can autonomously gather, process, and synthesize information has increased. Traditional approaches to building AI agents often require extensive orchestration code, explicit state management, and rigid architectures that are difficult to maintain and scale.

Strands Agents simplifies agent development by addressing these challenges. It introduces a model-centric paradigm that shifts the complexity from hard-coded logic into the large language model (LLM) itself. This dramatically reduces development overhead while increasing agent flexibility—for example, minimizing the need to write explicit logic for each input or output type. By embedding logic directly into the model, agents can be significantly improved simply by swapping in more advanced models as they are released.

In this post, we introduce how to combine Strands Agents with Tavily’s purpose-built web intelligence API, to create powerful research agents that excel at complex information gathering tasks while maintaining the security and compliance standards required for enterprise deployment.

Strands Agents SDK: Model-centric agent framework

The Strands Agents SDK is an open source framework that revolutionizes AI agent development by embracing a model-driven approach. It offers a code-first, lightweight yet powerful framework for building agentic workflows. Instead of requiring complex orchestration code, the Strands Agents SDK helps developers create sophisticated agents through three primary components:

Models – Offers flexible integration with leading LLM providers, including Amazon Bedrock, Anthropic, Ollama, and LiteLLM, and provides an extensible interface for implementing custom model providers.
Tools – Allows agents to interact with external systems, access data, and manipulate their environment. Strands Agents offers more than 20 built-in tool capabilities, and helps developers create custom tools using simple Python function decorators.
Prompts – Supports natural language instructions that guide agent behavior and objectives.

Strands Agents offers an advanced and rich feature set. With the Strands Agents SDK, developers can build intelligent agents with minimal code while maintaining enterprise-grade capabilities:

Security and responsible AI – Provides seamless integration with guardrails for content filtering, personally identifiable information (PII) protection, and more
Streamlined agent development lifecycle – Helps developers run agents locally and build complex evaluation workflows that can be automated as part of your continuous integration and delivery (CI/CD) pipelines
Flexible deployment – Offers support for many deployment options, from dedicated servers to serverless
Observability – Supports OpenTelemetry standard for transmitting logs, metrics, and traces

Strands Agents abstracts away the complexity of building, orchestrating, and deploying intelligent agents, providing a natural language-based interaction and control coupled with dynamic output generation. The result is a more intuitive and powerful development experience.

Tavily: Secure, modular web intelligence for AI agents

Tavily is an API-first web intelligence layer designed specifically for LLM agents, powering real-time search, high-fidelity content extraction, and structured web crawling. Built for developers building AI-based systems, Tavily is engineered for precision, speed, and modularity. It offers a seamless integration experience for agent frameworks like Strands Agents.Tavily’s API is an enterprise-grade infrastructure layer trusted by leading AI companies. It combines robust capabilities with production-grade operational guarantees, such as:

SOC 2 Type II compliance – Supports best-in-class security and privacy posture
Zero data retention – No queries, payloads, or user data are stored, maintaining compliance with strict internal policies and regulatory frameworks
Plug-and-play with Amazon Bedrock and private LLMs – Supports hybrid cloud deployments, private language model use, and latency-sensitive inference stacks
Modular endpoints – Designed for agent-style interaction, Tavily provides purpose-built APIs for:
- Search – Retrieve semantically ranked links and content snippets across the public web, filtered by domain, recency, or count
- Extract – Pull raw content or cleaned markdown from known URLs for summarization, QA, or embedding
- Crawl – Traverse websites recursively through links to simulate exploratory behavior and build site maps

Each endpoint is exposed as a standalone tool, meaning they can be quickly wrapped into your agent framework’s tool schema (such as OpenAI’s tool-calling, LangChain, Strands, or ReAct-based implementations).

Combining Strands Agents with the Tavily web infrastructure

By combining the flexibility of the Strands Agents SDK with Tavily’s real-time web intelligence capabilities, developers can build dynamic, LLM-powered agents that interact intelligently with the internet. These agents can reason over open-ended queries, make decisions based on natural language prompts, and autonomously gather, process, and deliver insights from the web.This integration can be appropriate for a wide range of agent-based applications. For example:

Customer success agents that proactively retrieve the latest product documentation, policy updates, or external FAQs to resolve support issues faster
Internal employee assistants that answer workplace questions by pulling from both internal tools and publicly available information, reducing dependency on knowledge silos
Sales and revenue agents that surface timely company news and industry shifts to support account planning and outreach

Each use case benefits from the same foundation: a developer-friendly agent framework, composable web intelligence tools, and the decision-making power of LLMs.To demonstrate how this comes together in practice, we explore a focused implementation: a research agent designed for autonomous, high-fidelity web investigation.

Research agent example

Many research agent implementations require extensive development efforts and rely on deterministic logic or workflows with constrained inputs and outputs. Alternatively, Strands enables developers to build highly dynamic agents through natural language. Strands agents use prompt engineering to dynamically generate varied output types and accept diverse natural language inputs seamlessly.Combining Tavily with Strands unlocks a new class of agents purpose-built for deep, dynamic research. Unlike hardcoded research pipelines, this pairing helps developers accomplish the following:

Rapidly develop powerful research agents using Tavily’s endpoints (Search, Crawl, Extract) as tools within the Strands Agents framework, offering a developer-friendly interface
Offload complex decision-making relying on the LLM’s native capabilities
Inherit performance boosts automatically with every new generation of model (for example, Anthropic Claude’s on Amazon Bedrock or Amazon Nova), as the flexible agent architecture dynamically improves with minimal code changes
Combine the enterprise security infrastructure of Amazon Bedrock with Tavily’s zero data retention policies to create a highly secure environment for sensitive research tasks

With Strands Agents and Tavily’s capabilities combined, the agents excel in gathering industry intelligence and providing organizations with real-time insights into trends, competitor activities, and emerging opportunities. Agents can conduct comprehensive competitive analysis, scouring vast amounts of online data to identify strengths, weaknesses, and strategic positioning of industry players. In the realm of technical research, these agents can rapidly assimilate and synthesize complex information from multiple sources, which can help accelerate innovation and problem-solving processes. Additionally, such agents prove invaluable for regulatory compliance monitoring by continuously scanning and interpreting evolving legal landscapes to make sure organizations stay ahead of regulatory changes. The flexibility of the Strands Agents SDK allows for customization to specific industry needs—it’s equally effective for tasks ranging from customer service automation to sophisticated data analysis workflows.

Solution overview

To illustrate this combination, we created a deep researcher implementation (see the GitHub repo) that uses the agent loop capability at the core of the Strands Agents SDK to intelligently and autonomously choose from Tavily’s web intelligence capabilities. The following diagram illustrates this workflow.

We configured the Strands Agents SDK to use Anthropic’s Claude 4 Sonnet on Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a unified API. The following diagram illustrates the solution architecture.

This research agent is composed of three primary components:

Large language model – Powers the agent to understand queries and generate responses
Tools – Helps the agent gather information from the internet using Tavily’s APIs, format the response, and save the output in Markdown format
System prompt – Guides the agent’s behavior, outlining how and when to use each tool to achieve its research objectives

In the following sections, we discuss the LLM and tools in more detail.

Large language model

The LLM influences the behavior of the agent as well as the quality of the generated response. We decided to use Anthropic’s Claude 4 Sonnet on Amazon Bedrock for its ability to plan and execute complex tasks, but you can use one of the other models supported by Amazon Bedrock or another model provider.

from strands.models import BedrockModel
bedrock_model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-east-1",
)
agent = Agent(model=bedrock_model)

Tools

Tools help extend agents’ capabilities and interact with external services such as Tavily. We implemented the following tools to enable our agent to perform deep research over the internet and provide a formatted output:

web_search – Search the web for relevant information
web_extract – Extract the full page content from a webpage
web_crawl – Crawl entire websites and scrape their content
format_research_response – Transform raw research content into clear, well-structured, and properly cited responses
write_markdown_file – Save the research output in Markdown forward on the local file system

To define a tool with the Strands Agents SDK, you can simply wrap a Python function with the @tool decorator and provide a Python docstring with the tool description. Let’s explore an example of how we implemented the web_search tool using Tavily’s search endpoint.The search endpoint lets agents discover relevant webpages based on a natural language query. Results include URLs, title, content snippets, semantic scores, or even the full content of matched pages. You can fine-tune searches with parameters such as:

Max number of results – Limits the number of results to an upper bound
Time range filtering – Limits the results to content published within a specific time frame
Domain restrictions – Restricts results to specific domains

See the following code:

@tool
def web_search(
    query: str, time_range: str | None = None, include_domains: str | None = None
) -> str:
    """Perform a web search. Returns the search results as a string, with the title, url, and content of each result ranked by relevance.
    Args:
        query (str): The search query to be sent for the web search.
        time_range (str | None, optional): Limits results to content published within a specific timeframe.
            Valid values: 'd' (day - 24h), 'w' (week - 7d), 'm' (month - 30d), 'y' (year - 365d).
            Defaults to None.
        include_domains (list[str] | None, optional): A list of domains to restrict search results to.
            Only results from these domains will be returned. Defaults to None.
    Returns:
        formatted_results (str): The web search results
    """
    client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
    formatted_results = format_search_results_for_agent(
        client.search(
            query=query,
            max_results=10,
            time_range=time_range,
            include_domains=include_domains
        )
    )
    return formatted_results

LLMs rely heavily on the tool definition and description to determine how and when to use them. To improve tool accuracy, consider the following best practices:

Clearly explain when the tool should be used and its functionally
Use type hints in the function signature to describe the parameters, return types, and default values
Detail each parameter and provide examples of the accepted formats

Each Tavily endpoint can be exposed to a language model as a distinct tool, giving AI agents flexible, granular access to the web. By combining these tools, agents become dramatically more capable at tasks like research, summarization, competitive intelligence, and decision-making. You can find the other tools implementation in the GitHub repository.

Strategic value proposition

AWS chose Tavily for the following benefits:

Shared vision – Tavily and AWS both serve the next generation of AI-based builders, with a strong emphasis on enterprise-readiness, security, and privacy
Marketplace integration – Tavily is available on AWS Marketplace, making integration and procurement seamless for enterprise customers
Go-to partner for web access – AWS chose Tavily as the premier tool for real-time search integration within the Strands Agents SDK, providing the best web access experience for agent developers
Amazon Bedrock – Amazon Bedrock is a fully managed, secure service that offers a choice of high-performing FMs from leading AI companies like Meta, Anthropic, AI21, and Amazon

Conclusion

The combination of the Strands Agents SDK and Tavily represents a significant advancement in enterprise-grade research agent development. This integration can help organizations build sophisticated, secure, and scalable AI agents while maintaining the highest standards of security and performance. To learn more, refer to the following resources:

Explore the Strands Agents SDK documentation
Sign up for Tavily API access
Join the developer community of Strands Agents
Start building your first Strands Agents Sample

About the authors

Akarsha Sehwag is a Generative AI Data Scientist in Amazon Bedrock Agents GTM team. With over six years of expertise in AI/ML product development, she has built Machine learning solutions across diverse customer segments.

Lorenzo Micheli is a Principal Delivery Consultant at AWS Professional Services, focused on helping Global Financial Services and Healthcare organizations navigate their cloud journey. He develops strategic roadmaps for generative AI adoption and cloud-native architectures that drive innovation while ensuring alignment with their business objectives and regulatory requirements.

Dean Sacoransky is a Forward Deployed Engineer at Tavily, specializing in applied AI. He helps enterprises and partners use Tavily’s web infrastructure technology to power and enhance their AI systems.

Lee Tzanani is Head of GTM and Partnerships at Tavily. She leads strategic collaborations with Tavily’s most valuable partners and works with enterprise and Fortune 500 customers to integrate real-time web search into production AI systems. Lee drives Tavily’s go-to-market efforts across the AI landscape, advancing its mission to onboard the next billion AI agents to the web.

Sofia Guzowski leads Partnerships and Community at Tavily, where she works with companies to integrate real-time web data into their AI products. She focuses on strategic collaborations, developer engagement, and bringing Tavily’s APIs to the broader AI landscape.

Wired for Action: Langflow Enables Local AI Agent Creation on NVIDIA RTX PCs

Interest in generative AI is continuing to grow, as new models include more capabilities. With the latest advancements, even enthusiasts without a developer background can dive right into tapping these models.

With popular applications like Langflow — a low-code, visual platform for designing custom AI workflows — AI enthusiasts can use simple, no-code user interfaces (UIs) to chain generative AI models. And with native integration for Ollama, users can now create local AI workflows and run them at no cost and with complete privacy, powered by NVIDIA GeForce RTX and RTX PRO GPUs.

Visual Workflows for Generative AI

Langflow offers an easy-to-use, canvas-style interface where components of generative AI models — like large language models (LLMs), tools, memory stores and control logic — can be connected through a simple drag-and-drop UI.

This allows complex AI workflows to be built and modified without manual scripting, easing the development of agents capable of decision-making and multistep actions. AI enthusiasts can iterate and build complex AI workflows without prior coding expertise.

Build complex AI workflows without prior coding expertise in Langflow.

Unlike apps limited to running a single-turn LLM query, Langflow can build advanced AI workflows that behave like intelligent collaborators, capable of analyzing files, retrieving knowledge, executing functions and responding contextually to dynamic inputs.

Langflow can run models from the cloud or locally — with full acceleration for RTX GPUs through Ollama. Running workflows locally provides multiple key benefits:

Data privacy: Inputs, files and prompts remain confined to the device.
Low costs and no API keys: As cloud application programming interface access is not required, there are no token restrictions, service subscriptions or costs associated with running the AI models.
Performance: RTX GPUs enable low-latency, high-throughput inference, even with long context windows.
Offline functionality: Local AI workflows are accessible without the internet.

Creating Local Agents With Langflow and Ollama

Getting started with Ollama within Langflow is simple. Built-in starters are available for use cases ranging from travel agents to purchase assistants. The default templates typically run in the cloud for testing, but they can be customized to run locally on RTX GPUs with Langflow.

Langflow provides a variety of built-in starters to test AI agents.

To build a local workflow:

Install the Langflow desktop app for Windows.
Install Ollama, then run Ollama and launch the preferred model (Llama 3.1 8B or Qwen3 4B recommended for users’ first workflow).
Run Langflow and select a starter.
Replace cloud endpoints with local Ollama runtime. For agentic workflows, set the language model to Custom, drag an Ollama node to the canvas and connect the agent node’s custom model to the Language Model output of the Ollama node.

Templates can be modified and expanded — such as by adding system commands, local file search or structured outputs — to meet advanced automation and assistant use cases.

Watch this step-by-step walkthrough from the Langflow team:

Get Started

Below are two sample projects to start exploring.

Create a personal travel itinerary agent: Input all travel requirements — including desired restaurant reservations, travelers’ dietary restrictions and more — to automatically find and arrange accommodations, transport, food and entertainment.

Expand Notion’s capabilities: Notion, an AI workspace application for organizing projects, can be expanded with AI models that automatically input meeting notes, update the status of projects based on Slack chats or email, and send out project or meeting summaries.

RTX Remix Adds Model Context Protocol, Unlocking Agent Mods

RTX Remix — an open-source platform that allows modders to enhance materials with generative AI tools and create stunning RTX remasters that feature full ray tracing and neural rendering technologies — is adding support for Model Context Protocol (MCP) with Langflow.

Langflow nodes with MCP give users a direct interface for working with RTX Remix — enabling modders to build modding assistants capable of intelligently interacting with Remix documentation and mod functions.

To help modders get started, NVIDIA’s Langflow Remix template includes:

A retrieval-augmented generation module with RTX Remix documentation.
Real-time access to Remix documentation for Q&A-style support.
An action nodule via MCP that supports direct function execution inside RTX Remix, including asset replacement, metadata updates and automated mod interactions.

Modding assistant agents built with this template can determine whether a query is informational or action-oriented. Based on context, agents dynamically respond with guidance or take the requested action. For example, a user might prompt the agent: “Swap this low-resolution texture with a higher-resolution version.” In response, the agent would check the asset’s metadata, locate an appropriate replacement and update the project using MCP functions — without requiring manual interaction.

Documentation and setup instructions for the Remix template are available in the RTX Remix developer guide.

Control RTX AI PCs With Project G-Assist in Langflow

NVIDIA Project G-Assist is an experimental, on-device AI assistant that runs locally on GeForce RTX PCs. It enables users to query system information (e.g. PC specs, CPU/GPU temperatures, utilization), adjust system settings and more — all through simple natural language prompts.

With the G-Assist component in Langflow, these capabilities can be built into custom agentic workflows. Users can prompt G-Assist to “get GPU temperatures” or “tune fan speeds” — and its response and actions will flow through their chain of components.

Beyond diagnostics and system control, G-Assist is extensible via its plug-in architecture, which allows users to add new commands tailored to their workflows. Community-built plug-ins can also be invoked directly from Langflow workflows.

To get started with the G-Assist component in Langflow, read the developer documentation.

Langflow is also a development tool for NVIDIA NeMo microservices, a modular platform for building and deploying AI workflows across on-premises or cloud Kubernetes environments.

With integrated support for Ollama and MCP, Langflow offers a practical no-code platform for building real-time AI workflows and agents that run fully offline and on device, all accelerated by NVIDIA GeForce RTX and RTX PRO GPUs.

Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NVIDIA NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, productivity apps and more on AI PCs and workstations.

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter. Join NVIDIA’s Discord server to connect with community developers and AI enthusiasts for discussions on what’s possible with RTX AI.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

Embark on Epic Adventures in August With a Dozen New Games Coming to GeForce NOW

August brings new levels of gaming excitement on GeForce NOW, with 2,300 titles now available to stream in the cloud.

Grab a controller and get ready for epic adventures — a dozen new games are coming to the cloud this month. Each week brings fresh titles for members to discover, stream and conquer.

Get ready to shrink down for big fun — early access for Grounded 2, announced as a surprise in June at XBOX Games Showcase, will be available to stream on day one: tomorrow, Aug. 1.

Plus, finish off July with the eight games available to stream this week, alongside the latest update for the acclaimed open-world action role-playing game Genshin Impact.

The Park Is Calling

Grounded 2 on GeForce NOW — *Oops, shrunken again.*

Step into the wild, unpredictable world of Brookhollow Park — Grounded 2 is launching in early access in the cloud, and the backyard has never felt bigger or more alive.

Shrink to the size of an ant, navigate towering blades of grass, uncover hidden secrets and face off against bugs that can be friend or foe. Craft weapons and armor from scavenged materials, build creative bases and raise Buggie companions to help explore and survive. Whether playing solo or with friends, every day brings new challenges and surprises. Grounded 2’s quirky humor and inventive gameplay make every moment an adventure, from dodging predators to racing Buggies through underground tunnels.

Experience every detail rendered in stunning quality streaming from a GeForce RTX-powered gaming rig in the cloud. Jump into the game instantly — no downloads, installs or waiting. Whether on a low-powered laptop, Mac or mobile device, GeForce NOW brings the full power of high-end gaming hardware to members through the cloud.

Arriving in August

Genshin Impact V5.8 on GeForce NOW — *Welcome to the Easybreeze Holiday Resort.*

Genshin Impact Version 5.8, “Sunspray Summer Resort,” is now live in the cloud, letting players dive in instantly — no updates needed. Explore Natlan’s new Easybreeze Holiday Resort, filled with events, mini games and new exploration features. Meet Ineffa, the first Nod-Krai five-star electro polearm user, alongside returning favorites like Bennett and Venti, in a fresh Story Quest. The update includes limited-time rewards, an exclusive Claymore, new outfits and gameplay improvements. It also marks the start of the “Song of the Welkin Moon” storyline, expanding Natlan and Nod-Krai lore.

Catch the games ready to play today:

Grounded 2 (New release on Steam and Xbox, available on PC Game Pass, July 29)
Achilles: Survivor (New release on Steam, July 29)
Frosthaven (New release on Steam, July 31)
Dead Take (New release on Steam, July 31)
Farming Simulator 25 (New release on Xbox, available on PC Game Pass, Aug. 1)
High on Life (Xbox, available on PC Game Pass)
The King Is Watching (Steam)
Ultimate Chicken Horse (Xbox, available on PC Game Pass)

Catch the full list of games coming to the cloud in August:

Mafia: The Old Country (New release on Steam, Aug. 8)
Echoes of the End (New release on Steam, Aug. 12)
Warhammer 40,000: Dawn of War – Definitive Edition (New release on Steam, Aug. 14)
Supraworld (New release on Steam, Aug. 15)
VOID/BREAKER (New release on Steam, Aug. 20)
Aztecs: The Last Sun (New release on Steam, Aug. 28)
Among Us 3D (Steam)
Funko Fusion (Steam)
Field of Glory II: Medieval (Steam)
The Rogue Prince of Persia (New release on Ubisoft)
No Sleep for Kaname Date – From AI: THE SOMNIUM FILES (Steam)

Say Bye to July

In addition to the 20 games announced in July, an extra 15 joined over the month, including Achilles: Survivor, Frosthaven and Dead Take, newly added this week:

Abiotic Factor (Xbox, available on PC Game Pass)
Barony (Steam)
BitCraft Online (Steam)
He Is Coming (Xbox, available on PC Game Pass)
Humanity (Xbox, available on the Microsoft store)
Legion TD 2 (Epic Games Store)
PEAK (Steam)
RoboCop: Rogue City (New release on Xbox)
SteamWorld Dig (Steam)
SUPERVIVE (Steam)
Tony Hawk’s Pro Skater 1 + 2 (Steam, Epic Games Store and Xbox, available on the Microsoft Store)
WUCHANG: Fallen Feathers (Xbox, available on PC Game Pass)

Call of Duty: Modern Warfare II and Call of Duty: Modern Warfare III were moved from the main Call of Duty HQ launcher and are now standalone downloads. Both games will be available to stream on GeForce NOW in the coming days. Check out this knowledgebase article for more information.

What are you planning to play this weekend? Let us know on X or in the comments below.

The cloud remembers everything.
What was the very first game you streamed on GeForce NOW?

— NVIDIA GeForce NOW (@NVIDIAGFN) July 29, 2025

Automate the creation of handout notes using Amazon Bedrock Data Automation

Organizations across various sectors face significant challenges when converting meeting recordings or recorded presentations into structured documentation. The process of creating handouts from presentations requires lots of manual effort, such as reviewing recordings to identify slide transitions, transcribing spoken content, capturing and organizing screenshots, synchronizing visual elements with speaker notes, and formatting content. These challenges impact productivity and scalability, especially when dealing with multiple presentation recordings, conference sessions, training materials, and educational content.

In this post, we show how you can build an automated, serverless solution to transform webinar recordings into comprehensive handouts using Amazon Bedrock Data Automation for video analysis. We walk you through the implementation of Amazon Bedrock Data Automation to transcribe and detect slide changes, as well as the use of Amazon Bedrock foundation models (FMs) for transcription refinement, combined with custom AWS Lambda functions orchestrated by AWS Step Functions. Through detailed implementation details, architectural patterns, and code, you will learn how to build a workflow that automates the handout creation process.

Amazon Bedrock Data Automation

Amazon Bedrock Data Automation uses generative AI to automate the transformation of multimodal data (such as images, videos and more) into a customizable structured format. Examples of structured formats include summaries of scenes in a video, unsafe or explicit content in text and images, or organized content based on advertisements or brands. The solution presented in this post uses Amazon Bedrock Data Automation to extract audio segments and different shots in videos.

Solution overview

Our solution uses a serverless architecture orchestrated by Step Functions to process presentation recordings into comprehensive handouts. The workflow consists of the following steps:

The workflow begins when a video is uploaded to Amazon Simple Storage Service (Amazon S3), which triggers an event notification through Amazon EventBridge rules that initiates our video processing workflow in Step Functions.
After the workflow is triggered, Amazon Bedrock Data Automation initiates a video transformation job to identify different shots in the video. In our case, this is represented by a change of slides. The workflow moves into a waiting state, and checks for the transformation job progress. If the job is in progress, the workflow returns to the waiting state. When the job is complete, the workflow continues, and we now have extracted both visual shots and spoken content.
These visual shots and spoken content feed into a synchronization step. In this Lambda function, we use the output of the Amazon Bedrock Data Automation job to match the spoken content to the correlating shots based on the matching of timestamps.
After function has matched the spoken content to the visual shots, the workflow moves into a parallel state. One of the steps of this state is the generation of screenshots. We use a FFmpeg-enabled Lambda function to create images for each identified video shot.
The other step of the parallel state is the refinement of our transformations. Amazon Bedrock processes and improves each raw transcription section through a Map state. This helps us remove speech disfluencies and improve the sentence structure.
Lastly, after the screenshots and refined transcript are created, the workflow uses a Lambda function to create handouts. We use the Python-PPTX library, which generates the final presentation with synchronized content. These final handouts are stored in Amazon S3 for distribution.

The following diagram illustrates this workflow.

If you want to try out this solution, we have created an AWS Cloud Development Kit (AWS CDK) stack available in the accompanying GitHub repo that you can deploy in your account. It deploys the Step Functions state machine to orchestrate the creation of handout notes from the presentation video recording. It also provides you with a sample video to test out the results.

To deploy and test the solution in your own account, follow the instructions in the GitHub repository’s README file. The following sections describe in more detail the technical implementation details of this solution.

Video upload and initial processing

The workflow begins with Amazon S3, which serves as the entry point for our video processing pipeline. When a video is uploaded to a dedicated S3 bucket, it triggers an event notification that, through EventBridge rules, initiates our Step Functions workflow.

Shot detection and transcription using Amazon Bedrock Data Automation

This step uses Amazon Bedrock Data Automation to detect slide transitions and create video transcriptions. To integrate this as part of the workflow, you must create an Amazon Bedrock Data Automation project. A project is a grouping of output configurations. Each project can contain standard output configurations as well as custom output blueprints for documents, images, video, and audio. The project has already been created as part of the AWS CDK stack. After you set up your project, you can process content using the InvokeDataAutomationAsync API. In our solution, we use the Step Functions service integration to execute this API call and start the asynchronous processing job. A job ID is returned for tracking the process.

The workflow must now check the status of the processing job before continuing with the handout creation process. This is done by polling Amazon Bedrock Data Automation for the job status using the GetDataAutomationStatus API on a regular basis. Using a combination of the Step Functions Wait and Choice states, we can ask the workflow to poll the API on a fixed interval. This not only gives you the ability to customize the interval depending on your needs, but it also helps you control the workflow costs, because every state transition is billed in Standard workflows, which this solution uses.

When the GetDataAutomationStatus API output shows as SUCCESS, the loop exits and the workflow continues to the next step, which will match transcripts to the visual shots.

Matching audio segments with corresponding shots

To create comprehensive handouts, you must establish a mapping between the visual shots and their corresponding audio segments. This mapping is crucial to make sure the final handouts accurately represent both the visual content and the spoken narrative of the presentation.

A shot represents a series of interrelated consecutive frames captured during the presentation, typically indicating a distinct visual state. In our presentation context, a shot corresponds to either a new slide or a significant slide animation that adds or modifies content.

An audio segment is a specific portion of an audio recording that contains uninterrupted spoken language, with minimal pauses or breaks. This segment captures a natural flow of speech. The Amazon Bedrock Data Automation output provides an audio_segments array, with each segment containing precise timing information such as the start and end time of each segment. This allows for accurate synchronization with the visual shots.

The synchronization between shots and audio segments is critical for creating accurate handouts that preserve the presentation’s narrative flow. To achieve this, we implement a Lambda function that manages the matching process in three steps:

The function retrieves the processing results from Amazon S3, which contains both the visual shots and audio segments.
It creates structured JSON arrays from these components, preparing them for the matching algorithm.
It executes a matching algorithm that analyzes the different timestamps of the audio segments and the shots, and matches them based on these timestamps. This algorithm also considers timestamp overlaps between shots and audio segments.

For each shot, the function examines audio segments and identifies those whose timestamps overlap with the shot’s duration, making sure the relevant spoken content is associated with its corresponding slide in the final handouts. The function returns the matched results directly to the Step Functions workflow, where it will serve as input for the next step, where Amazon Bedrock will refine the transcribed content and where we will create screenshots in parallel.

Screenshot generation

After you get the timestamps of each shot and associated audio segment, you can capture the slides of the presentation to create comprehensive handouts. Each detected shot from Amazon Bedrock Data Automation represents a distinct visual state in the presentation—typically a new slide or significant content change. By generating screenshots at these precise moments, we make sure our handouts accurately represent the visual flow of the original presentation.

This is done with a Lambda function using the ffmpeg-python library. This library acts as a Python binding for the FFmpeg media framework, so you can run FFmpeg terminal commands using Python methods. In our case, we can extract frames from the video at specific timestamps identified by Amazon Bedrock Data Automation. The screenshots are stored in an S3 bucket to be used in creating the handouts, as described in the following code. To use ffmpeg-python in Lambda, we created a Lambda ZIP deployment containing the required dependencies to run the code. Instructions on how to create the ZIP file can be found in our GitHub repository.

The following code shows how a screenshot is taken using ffmpeg-python. You can view the full Lambda code on GitHub.

## Taking a screenshot at a specific timestamp 
ffmpeg.input(video_path, ss=timestamp).output(screenshot_path, vframes=1).run()

Transcript refinement with Amazon Bedrock

In parallel with the screenshot generation, we refine the transcript using a large language model (LLM). We do this to improve the quality of the transcript and filter out errors and speech disfluencies. This process uses an Amazon Bedrock model to enhance the quality of the matched transcription segments while maintaining content accuracy. We use a Lambda function that integrates with Amazon Bedrock through the Python Boto3 client, using a prompt to guide the model’s refinement process. The function can then process each transcript segment, instructing the model to do the following:

Fix typos and grammatical errors
Remove speech disfluencies (such as “uh” and “um”)
Maintain the original meaning and technical accuracy
Preserve the context of the presentation

In our solution, we used the following prompt with three example inputs and outputs:

prompt = '''This is the result of a transcription. 
I want you to look at this audio segment and fix the typos and mistakes present. 
Feel free to use the context of the rest of the transcript to refine (but don't leave out any info). 
Leave out parts where the speaker misspoke. 
Make sure to also remove works like "uh" or "um". 
Only make change to the info or sentence structure when there are mistakes. 
Only give back the refined transcript as output, don't add anything else or any context or title. 
If there are no typos or mistakes, return the original object input. 
Do not explain why you have or have not made any changes; I just want the JSON object. 

These are examples: 
Input: <an example-input> 
Output: <an example-output>

Input: <an example-input> 
Output: <an example-output>

Input: <an example-input> 
Output: <an example-output>

Here is the object: ''' + text

The following is an example input and output:

Input: Yeah. Um, so let's talk a little bit about recovering from a ransomware attack, right?

Output: Yes, let's talk a little bit about recovering from a ransomware attack.

To optimize processing speed while adhering to the maximum token limits of the Amazon Bedrock InvokeModel API, we use the Step Functions Map state. This enables parallel processing of multiple transcriptions, each corresponding to a separate video segment. Because these transcriptions must be handled individually, the Map state efficiently distributes the workload. Additionally, it reduces operational overhead by managing integration—taking an array as input, passing each element to the Lambda function, and automatically reconstructing the array upon completion.The Map state returns the refined transcript directly to the Step Functions workflow, maintaining the structure of the matched segments while providing cleaner, more professional text content for the final handout generation.

Handout generation

The final step in our workflow involves creating the handouts using the python-pptx library. This step combines the refined transcripts with the generated screenshots to create a comprehensive presentation document.

The Lambda function processes the matched segments sequentially, creating a new slide for each screenshot while adding the corresponding refined transcript as speaker notes. The implementation uses a custom Lambda layer containing the python-pptx package. To enable this functionality in Lambda, we created a custom layer using Docker. By using Docker to create our layer, we make sure the dependencies are compiled in an environment that matches the Lambda runtime. You can find the instructions to create this layer and the layer itself in our GitHub repository.

The Lambda function implementation uses python-pptx to create structured presentations:

import boto3
from pptx import Presentation
from pptx.util import Inches
import os
import json

def lambda_handler(event, context):
    # Create new presentation with specific dimensions
    prs = Presentation()
    prs.slide_width = int(12192000)  # Standard presentation width
    prs.slide_height = int(6858000)  # Standard presentation height
    
    # Process each segment
    for i in range(num_images):
        # Add new slide
        slide = prs.slides.add_slide(prs.slide_layouts[5])
        
        # Add screenshot as full-slide image
        slide.shapes.add_picture(image_path, 0, 0, width=slide_width)
        
        # Add transcript as speaker notes
        notes_slide = slide.notes_slide
        transcription_text = transcription_segments[i].get('transcript', '')
        notes_slide.notes_text_frame.text = transcription_text
    
    # Save presentation
    pptx_path = os.path.join(tmp_dir, "lecture_notes.pptx")
    prs.save(pptx_path)

The function processes segments sequentially, creating a presentation that combines visual shots with their corresponding audio segments, resulting in handouts ready for distribution.

The following screenshot shows an example of a generated slide with notes. The full deck has been added as a file in the GitHub repository.

Conclusion

In this post, we demonstrated how to build a serverless solution that automates the creation of handout notes from recorded slide presentations. By combining Amazon Bedrock Data Automation with custom Lambda functions, we’ve created a scalable pipeline that significantly reduces the manual effort required in creating handout materials. Our solution addresses several key challenges in content creation:

Automated detection of slide transitions, content changes, and accurate transcription of spoken content using the video modality capabilities of Amazon Bedrock Data Automation
Intelligent refinement of transcribed text using Amazon Bedrock
Synchronized visual and textual content with a custom matching algorithm
Handout generation using the ffmpeg-python and python-pptx libraries in Lambda

The serverless architecture, orchestrated by Step Functions, provides reliable execution while maintaining cost-efficiency. By using Python packages for FFmpeg and a Lambda layer for python-pptx, we’ve overcome technical limitations and created a robust solution that can handle various presentation formats and lengths. This solution can be extended and customized for different use cases, from educational institutions to corporate training programs. Certain steps such as the transcript refinement can also be improved, for instance by adding translation capabilities to account for diverse audiences.

To learn more about Amazon Bedrock Data Automation, refer to the following resources:

About the authors

Laura Verghote is the GenAI Lead for PSI Europe at Amazon Web Services (AWS), driving Generative AI adoption across public sector organizations. She partners with customers throughout Europe to accelerate their GenAI initiatives through technical expertise and strategic planning, bridging complex requirements with innovative AI solutions.

Elie Elmalem is a solutions architect at Amazon Web Services (AWS) and supports Education customers across the UK and EMEA. He works with customers to effectively use AWS services, providing architectural best practices, advice, and guidance. Outside of work, he enjoys spending time with family and friends and loves watching his favorite football team play.