Amazon Bedrock Model Distillation is generally available, and it addresses the fundamental challenge many organizations face when deploying generative AI: how to maintain high performance while reducing costs and latency. This technique transfers knowledge from larger, more capable foundation models (FMs) that act as teachers to smaller, more efficient models (students), creating specialized models that excel at specific tasks. In this post, we highlight the advanced data augmentation techniques and performance improvements in Amazon Bedrock Model Distillation with Meta’s Llama model family.
Agent function calling represents a critical capability for modern AI applications, allowing models to interact with external tools, databases, and APIs by accurately determining when and how to invoke specific functions. Although larger models typically excel at identifying the appropriate functions to call and constructing proper parameters, they come with higher costs and latency. Amazon Bedrock Model Distillation now enables smaller models to achieve comparable function calling accuracy while delivering substantially faster response times and lower operational costs.
The value proposition is compelling: organizations can deploy AI agents that maintain high accuracy in tool selection and parameter construction while benefiting from the reduced footprint and increased throughput of smaller models. This advancement makes sophisticated agent architectures more accessible and economically viable across a broader range of applications and scales of deployment.
Prerequisites
For a successful implementation of Amazon Bedrock Model Distillation, you’ll need to meet several requirements. We recommend referring to the Submit a model distillation job in Amazon Bedrock in the official AWS documentation for the most up-to-date and comprehensive information.
Key requirements include:
- An active AWS account
- Selected teacher and student models enabled in your account (verify on the Model access page of the Amazon Bedrock console)
- An S3 bucket for storing input datasets and output artifacts
- Appropriate IAM permissions:
- Trust relationship allowing Amazon Bedrock to assume the role
- Permissions to access S3 for input/output data and invocation logs
- Permissions for model inference when using inference profiles
If you’re using historical invocation logs, confirm if model invocation logging is enabled in your Amazon Bedrock settings with S3 selected as the logging destination.
Preparing your data
Effective data preparation is crucial for successful distillation of agent function calling capabilities. Amazon Bedrock provides two primary methods for preparing your training data: uploading JSONL files to Amazon S3 or using historical invocation logs. Regardless of which method to choose, you’ll need to prepare proper formatting of tool specifications to enable successful agent function calling distillation.
Tool specification format requirements
For agent function calling distillation, Amazon Bedrock requires that tool specifications be provided as part of your training data. These specifications must be encoded as text within the system or user message of your input data. The example shown is using the Llama model family’s function calling format:
This approach lets the model learn how to interpret tool definitions and make appropriate function calls based on user queries. Afterwards, when calling inference on the distilled student model, we suggest keeping the prompt format consistent with the distillation input data. This provides optimal performance by maintaining the same structure the model was trained on.
Preparing data using Amazon S3 JSONL upload
When creating a JSONL file for distillation, each record must follow this structure:
Each record must include the schemaVersion
field with the value bedrock-conversation-2024
. The system field contains instructions for the model, including available tools. The messages field contains the conversation, with required user input and optional assistant responses.
Using historical invocation logs
Alternatively, you can use your historical model invocation logs on Amazon Bedrock for distillation. This approach uses actual production data from your application, capturing real-world function calling scenarios. To use this method:
- Enable invocation logging in your Amazon Bedrock account settings, selecting S3 as your logging destination.
- Add metadata to your model invocations using the
requestMetadata
field to categorize interactions. For example: - When creating your distillation job, specify filters to select relevant logs based on metadata:
Using historical invocation logs means that you can distill knowledge from your production workloads, allowing the model to learn from real user interactions and function calls.
Model distillation enhancements
Although the basic process for creating a model distillation job remains similar to what we described in our previous blog post, Amazon Bedrock Model Distillation introduces several enhancements with general availability that improve the experience, capabilities, and transparency of the service.
Expanded model support
With general availability, we have expanded the model options available for distillation. In addition to the models supported during preview, customers can now use:
- Nova Premier as a teacher model for Nova Pro/Lite/Micro models distillation
- Anthropic Claude Sonnet 3.5 v2 as a teacher model for Claude Haiku distillation
- Meta’s Llama 3.3 70B as teacher and 3.2 1B and 3B as student models for Meta model distillation
This broader selection allows customers to find the balance between performance and efficiency across different use cases. For the most current list of supported models, refer to the Amazon Bedrock documentation.
Advanced data synthesis technology
Amazon Bedrock applies proprietary data synthesis techniques during the distillation process for certain use cases. This science innovation automatically generates additional training examples that improve the student model’s ability to generate better response.
For agent function calling with Llama models specifically, the data augmentation methods help bridge the performance gap between teacher and student models compared to vanilla distillation (vanilla distillation means directly annotating input data with teacher response and run student training with supervised fine-tuning). This makes the student models’ performance much more comparable to the teacher after distillation while maintaining the cost and latency benefits of a smaller model.
Enhanced training visibility
Amazon Bedrock model distillation now provides better visibility into the training process through multiple enhancements:
- Synthetic data transparency – Model distillation now provides samples of the synthetically generated training data used to enhance model performance. For most model families, up to 50 sample prompts are exported (up to 25 for Anthropic models), giving you insight into how your model was trained, which can help support internal compliance requirements.
- Prompt insights reporting – A summarized report of prompts accepted for distillation is provided, along with detailed visibility into prompts that were rejected and the specific reason for rejection. This feedback mechanism helps you identify and fix problematic prompts to improve your distillation success rate.
These insights are stored in the output S3 bucket specified during job creation, giving you a clearer picture of the knowledge transfer process.
Improved job status reporting
Amazon Bedrock Model Distillation also offers enhanced training job status reporting to provide more detailed information about where your model distillation job stands in the process. Rather than brief status indicators such as “In Progress” or “Complete,” the system now provides more granular status updates, helping you better track the progress of the distillation job.
You can track these job status details in both the AWS Management Console and AWS SDK.
Performance improvements and benefits
Now that we’ve explored the feature enhancements in Amazon Bedrock Model Distillation, we examine the benefits these capabilities deliver, particularly for agent function calling use cases.
Evaluation metric
We use abstract syntax tree (AST) to evaluate the function calling performance. AST parses the generated function call and performs fine-grained evaluation on the correctness of the generated function name, parameter values, and data types with the following workflow:
- Function matching – Checks if the predicted function name is consistent with one of the possible answers
- Required parameter matching – Extracts the arguments from the AST and checks if each parameter can be found and exact matched in possible answers
- Parameter type and value matching – Checks if the predicted parameter values and types are correct
The process is illustrated in following diagram from Gorilla: Large Language Model Connected with Massive APIs.
Experiment results
To evaluate model distillation in the function call use case, we used the BFCL v2 dataset and filtered it to specific domains (entertainment, in this case) to match a typical use case of model customization. We also split the data into training and test sets and performed distillation on the training data while we ran evaluations on the test set. Both the training set and the test set contained around 200 examples. We assessed the performance of several models, including the teacher model (Llama 405B), the base student model (Llama 3B), a vanilla distillation version where Llama 405B is distilled into Llama 3B without data augmentation, and an advanced distillation version enhanced with proprietary data augmentation techniques.
The evaluation focused on simple and multiple categories defined in the BFCL V2 dataset. As shown in the following chart, there is a performance variance between the teacher and the base student model across both categories. Vanilla distillation significantly improved the base student model’s performance. In the simple category, performance increased from 0.478 to 0.783, representing a 63.8% relative improvement. In the multiple category, the score rose from 0.586 to 0.742, which is a 26.6% relative improvement. On average, vanilla distillation led to a 45.2% improvement across the two categories.
Applying data augmentation techniques provided further gains beyond vanilla distillation. In the simple category, performance improved from 0.783 to 0.826, and in the multiple category, from 0.742 to 0.828. On average, this resulted in a 5.8% relative improvement across both categories, calculated as the mean of the relative gains in each. These results highlight the effectiveness of both distillation and augmentation strategies in enhancing student model performance for function call tasks.
We show the latency and output speed comparison for different models in the following figure. The data is gathered from Artificial Analysis, a website that provides independent analysis of AI models and providers, on April 4, 2025. We find that there is a clear trend on latency and generation speed as different size Llama models evaluated. Notably, the Llama 3.1 8B model offers the highest output speed, making it the most efficient in terms of responsiveness and throughput. Similarly, Llama 3.2 3B performs well with a slightly higher latency but still maintains a solid output speed. On the other hand, Llama 3.1 70B and Llama 3.1 405B exhibit much higher latencies with significantly lower output speeds, indicating a substantial performance cost at higher model sizes. Compared to Llama 3.1 405B, Llama 3.2 3B provides 72% latency reduction and 140% output speed improvement. These results suggest that smaller models might be more suitable for applications where speed and responsiveness are critical.
In addition, we report the comparison of cost per 1M tokens for different Llama models. As shown in the following figure, it’s evident that smaller models (Llama 3.2 3B and Llama 3.1 8B) are significantly more cost-effective. As the model size increases (Llama 3.1 70B and Llama 3.1 405B), the pricing scales steeply. This dramatic increase underscores the trade-off between model complexity and operational cost.
Real-world agent applications require LLM models that can achieve a good balance between accuracy, speed, and cost. This result shows that using a distilled model for agent applications helps developers receive the speed and cost of smaller models while getting similar accuracy as a larger teacher model.
Conclusion
Amazon Bedrock Model Distillation is now generally available, offering organizations a practical pathway for deploying capable agent experiences without compromising on performance or cost-efficiency. As our performance evaluation demonstrates, distilled models for function calling can achieve accuracy comparable to models many times their size while delivering significantly faster inference and lower operational costs. This capability enables scalable deployment of AI agents that can accurately interact with external tools and systems across enterprise applications.
Start using Amazon Bedrock Model Distillation today through the AWS Management Console or API to transform your generative AI applications, including agentic use cases, with the balance of accuracy, speed, and cost efficiency. For implementation examples, check out our code samples in the amazon-bedrock-samples GitHub repository.
Appendix
BFCL V2 simple category
Definition: The simple category consists of tasks where the user is provided with a single function documentation (that is, one JSON function definition), and the model is expected to generate exactly one function call that matches the user’s request. This is the most basic and commonly encountered scenario, focusing on whether the model can correctly interpret a straightforward user query and map it to the only available function, filling in the required parameters as needed.
BFCL V2 multiple category
Definition: The multiple category presents the model with a user query and several (typically two to four) function documentations. The model must select the most appropriate function to call based on the user’s intent and context and then generate a single function call accordingly. This category evaluates the model’s ability to understand the user’s intent, distinguish between similar functions, and choose the best match from multiple options.
About the authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Yijun Tian is an Applied Scientist II at AWS Agentic AI, where he focuses on advancing fundamental research and applications in Large Language Models, Agents, and Generative AI. Prior to joining AWS, he obtained his Ph.D. in Computer Science from the University of Notre Dame.
Yawei Wang is an Applied Scientist at AWS Agentic AI, working at the forefront of generative AI technologies to build next-generation AI products within AWS. He also collaborates with AWS business partners to identify and develop machine learning solutions that address real-world industry challenges.
David Yan is a Senior Research Engineer at AWS Agentic AI, leading efforts in Agent Customization and Optimization. Prior to that, he was in AWS Bedrock, leading model distillation effort to help customers optimize LLM latency, cost and accuracy. His research interest includes AI agent, planning and prediction and inference optimization. Before joining AWS, David worked on planning and behavior prediction for autonomous driving in Waymo. Before that, he worked on nature language understanding for knowledge graph at Google. David received a M.S. in Electrical Engineering from Stanford University and a B.S. in Physics from Peking University.
Panpan Xu is a Principal Applied Scientist at AWS Agentic AI, leading a team working on Agent Customization and Optimization. Prior to that, she lead a team in AWS Bedrock working on research and development of inference optimization techniques for foundation models, covering modeling level techniques such as model distillation and sparsification to hardware-aware optimization. Her past research interest covers a broad range of topics including model interpretability, graph neural network, human-in-the-loop AI and interactive data visualization. Prior to joining AWS, she was a lead research scientist at Bosch Research and obtained her PhD in computer science from Hong Kong University of Science and Technology.
Shreeya Sharma is a Senior Technical Product Manager at AWS, where she has been working on leveraging the power of generative AI to deliver innovative and customer-centric products. Shreeya holds a master’s degree from Duke University. Outside of work, she loves traveling, dancing, and singing.