Multiagent AI for generating chain-of-thought training data


Multiagent AI for generating chain-of-thought training data

Using ensembles of agents to generate and refine interactions annotated with chains of thought improves performance on a battery of benchmarks by an average of 29%.

Conversational AI

July 31, 05:18 PMJuly 31, 05:18 PM

Chain-of-thought reasoning, in which a large language model (LLM) is asked not only to perform multistep actions but to explain its reasons for taking the steps it does, has been shown to improve LLMs reasoning capability. One promising application of chain-of-thought (CoT) reasoning is ensuring that LLMs adhere to responsible-AI policies.

Using CoT to optimize an LLM for policy adherence requires high-quality training data annotated with chains of thoughts. But hiring human annotators to generate such training data is expensive and time consuming.

Inspired by current work on incorporating artificial experts into the standard LLM training pipeline, researchers in Amazons Artificial General Intelligence organization have begun exploring the possibility of using ensembles of AI agents to generate high-quality CoT data. We report the results of our initial experiments in a paper we presented at this years meeting of the Association for Computational Linguistics (ACL).

Using two different LLMs and five different datasets, we compared models fine tuned on data created through our multiagent-deliberation approach to both baseline pretrained models and models fine tuned through supervised fine tuning on conventional data.

Our approach achieves an increase in average safety (in-domain, out-of-domain, and jailbreaks) of 96% relative to the baseline and 73% relative to the conventionally fine-tuned model, when using a non-safety trained model (Mixtral). The increases were 12% and 44%, respectively, on a safety-trained model (Qwen).

Multiagent deliberation

Our approach divides the task of generating policy-compliant chains of thought into three stages, each of which uses LLMs: intent decomposition, deliberation, and refinement.

During intent decomposition, an LLM receives the user query and identifies explicit and implicit user intents. These, together with the query, are then passed to another LLM, which generates an initial CoT.

Deliberation is an iterative process in which multiple LLMs (agents) expand the CoT in sequential fashion, factoring in a defined set of policies. Each agent is prompted to review and correct the version of the CoT it receives or to confirm that its good as is. This stage ends when an agent judges the CoT complete or when a predefined deliberation budget is exhausted.

Finally, in the refinement stage, an LLM takes the outputs of the deliberation stage and post-processes them to filter out redundant, deceptive, and policy-inconsistent thoughts.

A schematic of our multiagent-deliberation framework to generate safety-embedded CoTs.

Evaluation

Following prior work, we analyze the quality of the generated CoTs by measuring three fine-grained attributes: (1) relevance, (2) coherence, and (3) completeness. Each attribute is evaluated on a scale from 1 to 5, where 1 represents the lowest quality and 5 represents the highest. As test data, we use examples from several standard CoT benchmark datasets.

We also assess faithfulness along three dimensions: (1) faithfulness between policy and the generated CoT; (2) faithfulness between policy and the generated response; and (3) faithfulness between the generated CoT and the final response. We use an LLM fine tuned as an auto-grader to evaluate faithfulness on a scale from 1 to 5, where 1 indicates minimal faithfulness, and 5 indicates complete adherence.

As can be seen in the table below, using our framework provides quality improvements across all metrics, with an improvement of more than 10% in CoTs policy faithfulness.

 

Average auto-grader scores on the generated-CoT datasets (1-5 scale), including general-reasoning metrics to evaluate the quality of CoTs and faithfulness metrics to evaluate policy adherence.

<tbody><tr style=”mso-yfti-irow:1″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Metric</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>LLM_ZS</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>AIDSAFE</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>delta</b><br></td></tr><tr style=”mso-yfti-irow:2″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Relevance<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.66<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>4.68</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.43%<br></td></tr><tr style=”mso-yfti-irow:3″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Coherence<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.93<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>4.96</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.61%<br></td></tr><tr style=”mso-yfti-irow:4″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Completeness<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.86<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>4.92</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1.23%<br></td></tr><tr style=”mso-yfti-irow:5″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>CoTs faithfulness (policy)<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>3.85<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>4.27</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>10.91%<br></td></tr><tr style=”mso-yfti-irow:6″><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Response faithfulness (policy)<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.85<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>4.91</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1.24%<br></td></tr><tr style=”mso-yfti-irow:7;mso-yfti-lastrow:yes”><td colspan=”1″ rowspan=”1″ width=”264″ valign=”top” style=”width:2.75in;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Response faithfulness (CoT)<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>4.99<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>5</b><br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>0.20%<br></td></tr></tbody>

Fine tuning

We use several benchmarks to measure the performance improvements provided by our generated CoT data: Beavertails (for safety), WildChat, XSTest (for overrefusal, or erroneously flagging safe generations as unsafe), MMLU (for utility), and StrongREJECT (for jailbreak robustness).

We used two different LLMs in our tests, the widely used open-source models Qwen and Mixtral. The base versions of these models provide one baseline, and we add another baseline by fine-tuning these models with only the prompts and responses from the original dataset not the generated CoTs. Our method shows significant improvements over baseline, specifically on safety and jailbreak robustness, with some trade-offs on utility and overrefusal.

Below are the results of evaluation of the supervised fine-tuned (SFT) model. “Base” denotes the LLM without SFT, SFT_OG denotes the model SFTd on the original response data without any CoTs, and SFT_DB denotes the model SFTd on our generated CoTs and responses. (If the full table doesn’t fit on your browser, try scrolling right.)

LLM: Mixtral

<tbody><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Eval</b></td><td colspan=”1″ rowspan=”1″><b>Dimension</b></td><td colspan=”1″ rowspan=”1″><b>Metric</b></td><td colspan=”1″ rowspan=”1″><b>Dataset</b></td><td colspan=”1″ rowspan=”1″><b>Base</b></td><td colspan=”1″ rowspan=”1″><b>SFT_OG</b></td><td colspan=”1″ rowspan=”1″><b>SFT_DB (ours)</b></td></tr><tr><td colspan=”1″ rowspan=”1″>Safety</td><td colspan=”1″ rowspan=”1″>Safe response</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>Beavertails</td><td colspan=”1″ rowspan=”1″>76</td><td colspan=”1″ rowspan=”1″>79.57</td><td colspan=”1″ rowspan=”1″><b>96</b></td></tr><tr><td colspan=”1″ rowspan=”1″>WildChat</td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″></td><td colspan=”1″ rowspan=”1″>31</td><td colspan=”1″ rowspan=”1″>33.5</td><td colspan=”1″ rowspan=”1″><b>85.95</b></td></tr><tr><td colspan=”1″ rowspan=”1″>Overrefusal</td><td colspan=”1″ rowspan=”1″>1-Overrefuse</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>XSTest</td><td colspan=”1″ rowspan=”1″><b>98.8</b></td><td colspan=”1″ rowspan=”1″>87.6</td><td colspan=”1″ rowspan=”1″>91.84</td></tr><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Utility</td><td colspan=”1″ rowspan=”1″>Answer</td><td colspan=”1″ rowspan=”1″>accuracy</td><td colspan=”1″ rowspan=”1″>MMLU</td><td colspan=”1″ rowspan=”1″><b>35.42</b></td><td colspan=”1″ rowspan=”1″>31.38</td><td colspan=”1″ rowspan=”1″>34.51</td></tr><tr><td colspan=”1″ rowspan=”1″>Jailbreak Robustness</td><td colspan=”1″ rowspan=”1″>Safe response</td><td colspan=”1″ rowspan=”1″>rate</td><td colspan=”1″ rowspan=”1″>StrongREJECT</td><td colspan=”1″ rowspan=”1″>51.09</td><td colspan=”1″ rowspan=”1″>67.01</td><td colspan=”1″ rowspan=”1″><b>94.04</b></td></tr></tbody>

LLM: Qwen

<tbody><tr><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Eval</b></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Dimension</b></td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Metric</b></td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Dataset</b></td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>Base</b></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>SFT_OG</b></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>SFT_DB (ours)</b></td></tr><tr style=”mso-yfti-irow:0;mso-yfti-firstrow:yes”><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safety<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safe response<br></td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate<br></td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Beavertails<br></td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>94.14<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>87.95<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>97</b><br></td></tr><tr style=”mso-yfti-irow:1″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>WildChat<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”> <br>-</td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”> <br>-</td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”> <br>-</td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>95.5<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>59.42<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>96.5</b><br></td></tr><tr style=”mso-yfti-irow:2″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Overrefusal<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>1-Overrefuse<br></td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate<br></td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>XSTest<br></td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>99.2</b><br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>98<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>93.6<br></td></tr><tr style=”mso-yfti-irow:3″><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Utility<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Answer<br></td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>accuracy<br></td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>MMLU<br></td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>75.78</b><br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>55.73<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>60.52<br></td></tr><tr style=”mso-yfti-irow:4;mso-yfti-lastrow:yes”><td colspan=”1″ rowspan=”1″ width=”71″ valign=”top” style=”width:53.4pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt”>Jailbreak Robustness<br></td><td colspan=”1″ rowspan=”1″ width=”88″ valign=”top” style=”width:66.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>Safe response<br></td><td colspan=”1″ rowspan=”1″ width=”70″ valign=”top” style=”width:52.5pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>rate<br></td><td colspan=”1″ rowspan=”1″ width=”104″ valign=”top” style=”width:77.9pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>StrongREJECT<br></td><td colspan=”1″ rowspan=”1″ width=”53″ valign=”top” style=”width:.55in;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>72.84<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.6pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”>59.48<br></td><td colspan=”1″ rowspan=”1″ width=”65″ valign=”top” style=”width:48.75pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt”><b>95.39</b><br></td></tr></tbody>

Acknowledgements: We would like to acknowledge our coauthors and collaborators, Kai-Wei Chang, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Aram Galstyan, Richard Zemel, and Rahul Gupta, for their contributions.

Research areas: Conversational AI

Tags: Association for Computational Linguistics (ACL), Responsible AI , Commonsense reasoning

Read More