New methods boost reasoning in small and large language models

The image shows a diagram illustrating the relationship between mathematical statements in natural language and formal language. On the left, there is a blue box labeled

Artificial intelligence is advancing across a wide range of fields, with one of the most important developments being its growing capacity for reasoning. This capability could help AI becomes a reliable partner in critical domains like scientific research and healthcare.

To support this progress, we’ve identified three primary strategies to strengthen reasoning capabilities in both small and large language models: improve architectural design to boost performance in smaller models; incorporate mathematical reasoning techniques to increase reliability; and build stronger generalization capabilities to enable reasoning across a variety of fields.

Smarter reasoning in smaller models

While language models trained on broad world knowledge hold great potential, they lack the ability to learn continuously and refine their understanding. This limitation becomes especially pronounced in smaller models, where limited capacity makes strong reasoning even harder.

The problem stems from how current language models operate. They rely on fast, pattern recognition-based responses that break down in complex scenarios. In contrast, people use deliberate, step-by-step reasoning, test different approaches, and evaluate outcomes. To address this gap, we’re building methods to enable stronger reasoning in smaller systems.

rStar-Math is a method that uses Monte Carlo Tree Search (MCTS) to simulate deeper, more methodical reasoning in smaller models. It uses a three-step, self-improving cycle:

Problem decomposition breaks down complex mathematical problems into manageable steps, creating a thorough and accurate course of reasoning.
Process preference model (PPM) trains small models to predict reward labels for each step, improving process-level supervision.
Iterative refinement applies a four-round, self-improvement cycle in which updated strategy models and PPMs guide MCTS to improve performance.

When tested on four small language models ranging from 1.5 billion to 7 billion parameters, rStar-Math achieved an average accuracy of 53% on the American Invitational Mathematics Examination (AIME)—performance that places it among the top 20% of high school competitors in the US.

Figure 1: A three-part diagram illustrating the rStar-Math framework. (a) Shows an MCTS-driven reasoning tree with Q-values and answer verification using PPM or Python; correct and incorrect steps are marked. (b) Depicts how Q-value filtering constructs per-step preference pairs from partial to full solutions. (c) Outlines four rounds of self-evolution, alternating between SLM and PPM improvements using terminal-guided and PPM-augmented MCTS. — Figure 1. The rStar-Math framework

Logic-RL is a reinforcement learning framework that strengthens logical reasoning through a practical system prompt and a structured reward function. By training models on logic puzzles, Logic-RL grants rewards only when both the reasoning process and the final answer meet strict formatting requirements. This prevents shortcuts and promotes analytical rigor.

Language models trained with Logic-RL demonstrate strong performance beyond logic puzzles, generalizing effectively to mathematical competition problems. On the AIME and AMC (American Mathematics Competitions) datasets, 7-billion-parameter models improved accuracy by 125% and 38%, respectively, compared with baseline models.

Building reliable mathematical reasoning

Mathematics poses a unique challenge for language models, which often struggle to meet its precision and rigor using natural language. To address this, we’re creating formal and symbolic methods to enable language models to adopt structured mathematical tools. The goal is to convert language model outputs into code based on the fundamental rules of arithmetic, like 1 + 1 = 2, allowing us to systematically verify accuracy.

LIPS (LLM-based Inequality Prover with Symbolic Reasoning) is a system that combines LLMs’ pattern recognition capabilities with symbolic reasoning. LIPS draws on the strategies participants in math competitions use in order to distinguish between tasks best suited to symbolic solvers (e.g., scaling) and those better handled by language models (e.g., rewriting). On 161 Olympiad-level problems, LIPS achieved state-of-the-art results without additional training data.

Figure 2: A three-part diagram showing the LIPS framework for inequality proof generation. On the left, a current inequality problem is transformed into new inequality subproblems via tactic generation using symbolic-based and LLM-generated rewriting methods. In the center, these new goals are filtered and ranked using LLM and symbolic methods. On the right, a ranked sequence of inequalities forms a complete proof, applying named tactics like Cauchy-Schwarz, AM-GM, and LLM simplification, ending with the original inequality verified. — Figure 2. An overview of LIPS

However, translating natural-language math problems into precise, machine-readable formats is a challenge. Our goal is to bridge the gap between the one-pass success rate, where the top-ranked generated result is correct, and the k-pass success rate, where at least one of the top k generated results is correct.

We developed a new framework using two evaluation methods. Symbolic equivalence checks whether outputs are logically identical, while semantic consistency uses embedding similarity to detect subtle differences missed by symbolic checks.

When we evaluated this approach on the MATH and miniF2F datasets, which include problems from various math competitions, it improved accuracy by up to 1.35 times over baseline methods.

Figure 3: A flowchart illustrating the autoformalization framework. On the left, a natural language math statement is converted into a formal language theorem via an — Figure 3. An overview of the auto-formalization framework

To address the shortage of high-quality training data, we developed a neuro-symbolic framework that automatically generates diverse, well-structured math problems. Symbolic solvers create the problems, while language models translate them into natural language. This approach not only broadens training resources but also supports more effective instruction and evaluation of mathematical reasoning in language models.

Figure 4: A flowchart illustrating the neuro-symbolic data generation framework. It begins with a natural language math problem about a sandbox's perimeter. This is formalized into symbolic assertions, then mutated while preserving structure. The formal problem is solved and informalized into a new natural language Q&A about a garden's dimensions. The process continues with further mutation to generate problems of varying difficulty—examples include an easy question about a rectangle’s width and a medium one involving expressions for area. — Figure 4. An overview of the neuro-symbolic data generation framework

Boosting generalization across domains

A key indicator of advanced AI is its ability to generalize—the ability to transfer reasoning skills across different domains. We found that training language models on math data significantly improved performance in coding, science, and other areas, revealing unexpected cross-domain benefits.

This discovery motivated us to develop Chain-of-Reasoning (CoR), an approach that unifies reasoning across natural language, code, and symbolic forms. CoR lets models blend these formats using natural language to frame context, code for precise calculations, and symbolic representations for abstraction. By adjusting prompts, CoR adapts both reasoning depth and paradigm diversity to match specific problem requirements.

Tests of CoR across five math datasets showed its ability to tackle both computational and proof-based problems, demonstrating strong general mathematical problem-solving skills.

Figure 5: Diagram illustrating three reasoning paradigms: (a) Single-paradigm reasoning, where all reasoning steps use the same medium (e.g., natural language, algorithms, or symbols); (b) Tool-integrated single-paradigm reasoning, where natural language drives reasoning, but code is used to solve specific sub-problems, with results reintegrated into the language-based reasoning; (c) CoR (multi-paradigm) reasoning framework, which enables reasoning across different paradigms with varying depths to handle diverse problem types, supported by examples. — Figure 5. CoR’s reasoning process under different types of methods

Current language models often rely on domain-specific solutions, limiting their flexibility across different types of problems. To move beyond this constraint, we developed Critical Plan Step Learning (CPL), an approach focused on high-level abstract planning that teaches models to identify key knowledge, break down problems, and make strategic decisions.

The technique draws on how people solve problems, by breaking them down, identifying key information, and recalling relevant knowledge—strategies we want language models to learn.

CPL combines two key components: plan-based MCTS, which searches multi-step solution paths and constructs planning trees, and step-APO, which learns preferences for strong intermediate steps while filtering out weak ones. This combination enhances reasoning and improves generalization across tasks, moving AI systems closer to the flexible thinking that characterizes human intelligence.

Figure 6: Illustration of CPL. Left: Plans represent abstract thinking for problem-solving, which allows for better generalization, whereas task-specific solutions often limit it. Right: CPL searches within the action space on high-level abstract plans using MCTS and obtains advantage estimates for step-level preferences. CPL can then identify and learn critical steps that provide a distinct advantage over others. — Figure 6. Overview of the CPL framework

Looking ahead: Next steps in AI reasoning

From building reliable math solvers to unifying reasoning approaches, researchers are redefining how language models approach complex tasks. Their work sets the stage for more capable and versatile AI systems—applicable to education, science, healthcare, and beyond. Despite these advances, hallucinations and imprecise logic continue to pose risks in critical fields like medicine and scientific research, where accuracy is essential.

These challenges are driving the team’s exploration of additional tools and frameworks to improve language model reasoning. This includes AutoVerus for automated proof generation in Rust code, SAFE for addressing data scarcity in Rust formal verification, and Alchemy, which uses symbolic mutation to improve neural theorem proving.

Together, these technologies represent important progress toward building trustworthy, high-performing reasoning models and signal a broader shift toward addressing some of AI’s current limitations.

The post New methods boost reasoning in small and large language models appeared first on Microsoft Research.

Hexagon Taps NVIDIA Robotics and AI Software to Build and Deploy AEON, a New Humanoid

As a global labor shortage leaves 50 million positions unfilled across industries like manufacturing and logistics, Hexagon — a global leader in measurement technologies — is developing humanoid robots that can lend a helping hand.

Industrial sectors depend on skilled workers to perform a variety of error-prone tasks, including operating high-precision scanners for reality capture — the process of capturing digital data to replicate the real world in simulation.

At the Hexagon LIVE Global conference, Hexagon’s robotics division today unveiled AEON — a new humanoid robot built in collaboration with NVIDIA that’s engineered to perform a wide range of industrial applications, from manipulation and asset inspection to reality capture and operator support. Hexagon plans to deploy AEON across automotive, transportation, aerospace, manufacturing, warehousing and logistics.

Future use cases for AEON include:

Reality capture, which involves automatic planning and then scanning of assets, industrial spaces and environments to generate 3D models. The captured data is then used for advanced visualization and collaboration in the Hexagon Digital Reality (HxDR) platform powering Hexagon Reality Cloud Studio (RCS).
Manipulation tasks, such as sorting and moving parts in various industrial and manufacturing settings.
Part inspection, which includes checking parts for defects or ensuring adherence to specifications.
Industrial operations, including highly dexterous technical tasks like machinery operations, teleoperation and scanning parts using high-end scanners.

“The age of general-purpose robotics has arrived, due to technological advances in simulation and physical AI,” said Deepu Talla, vice president of robotics and edge AI at NVIDIA. “Hexagon’s new AEON humanoid embodies the integration of NVIDIA’s three-computer robotics platform and is making a significant leap forward in addressing industry-critical challenges.”

Using NVIDIA’s Three Computers to Develop AEON

To build AEON, Hexagon used NVIDIA’s three computers for developing and deploying physical AI systems. They include AI supercomputers to train and fine-tune powerful foundation models; the NVIDIA Omniverse platform, running on NVIDIA OVX servers, for testing and optimizing these models in simulation environments using real and physically based synthetic data; and NVIDIA IGX Thor robotic computers to run the models.

Hexagon is exploring using NVIDIA accelerated computing to post-train the NVIDIA Isaac GR00T N1.5 open foundation model to improve robot reasoning and policies, and tapping Isaac GR00T-Mimic to generate vast amounts of synthetic motion data from a few human demonstrations.

AEON learns many of its skills through simulations powered by the NVIDIA Isaac platform. Hexagon uses NVIDIA Isaac Sim, a reference robotic simulation application built on Omniverse, to simulate complex robot actions like navigation, locomotion and manipulation. These skills are then refined using reinforcement learning in NVIDIA Isaac Lab, an open-source framework for robot learning.

This simulation-first approach enabled Hexagon to fast-track its robotic development, allowing AEON to master core locomotion skills in just 2-3 weeks — rather than 5-6 months — before real-world deployment.

In addition, AEON taps into NVIDIA Jetson Orin onboard computers to autonomously move, navigate and perform its tasks in real time, enhancing its speed and accuracy while operating in complex and dynamic environments. Hexagon is also planning to upgrade AEON with NVIDIA IGX Thor to enable functional safety for collaborative operation.

“Our goal with AEON was to design an intelligent, autonomous humanoid that addresses the real-world challenges industrial leaders have shared with us over the past months,” said Arnaud Robert, president of Hexagon’s robotics division. “By leveraging NVIDIA’s full-stack robotics and simulation platforms, we were able to deliver a best-in-class humanoid that combines advanced mechatronics, multimodal sensor fusion and real-time AI.”

Data Comes to Life Through Reality Capture and Omniverse Integration

AEON will be piloted in factories and warehouses to scan everything from small precision parts and automotive components to large assembly lines and storage areas.

Captured data comes to life in RCS, a platform that allows users to collaborate, visualize and share reality-capture data by tapping into HxDR and NVIDIA Omniverse running in the cloud. This removes the constraint of local infrastructure.

“Digital twins offer clear advantages, but adoption has been challenging in several industries,” said Lucas Heinzle, vice president of research and development at Hexagon’s robotics division. “AEON’s sophisticated sensor suite enables the integration of reality data capture with NVIDIA Omniverse, streamlining workflows for our customers and moving us closer to making digital twins a mainstream tool for collaboration and innovation.”

AEON’s Next Steps

By adopting the OpenUSD framework and developing on Omniverse, Hexagon can generate high-fidelity digital twins from scanned data — establishing a data flywheel to continuously train AEON.

This latest work with Hexagon is helping shape the future of physical AI — delivering scalable, efficient solutions to address the challenges faced by industries that depend on capturing real-world data.

Watch the Hexagon LIVE keynote, explore presentations and read more about AEON.

All imagery courtesy of Hexagon.

How Anomalo solves unstructured data quality issues to deliver trusted assets for AI with AWS

This post is co-written with Vicky Andonova and Jonathan Karon from Anomalo.

Generative AI has rapidly evolved from a novelty to a powerful driver of innovation. From summarizing complex legal documents to powering advanced chat-based assistants, AI capabilities are expanding at an increasing pace. While large language models (LLMs) continue to push new boundaries, quality data remains the deciding factor in achieving real-world impact.

A year ago, it seemed that the primary differentiator in generative AI applications would be who could afford to build or use the biggest model. But with recent breakthroughs in base model training costs (such as DeepSeek-R1) and continual price-performance improvements, powerful models are becoming a commodity. Success in generative AI is becoming less about building the right model and more about finding the right use case. As a result, the competitive edge is shifting toward data access and data quality.

In this environment, enterprises are poised to excel. They have a hidden goldmine of decades of unstructured text—everything from call transcripts and scanned reports to support tickets and social media logs. The challenge is how to use that data. Transforming unstructured files, maintaining compliance, and mitigating data quality issues all become critical hurdles when an organization moves from AI pilots to production deployments.

In this post, we explore how you can use Anomalo with Amazon Web Services (AWS) AI and machine learning (AI/ML) to profile, validate, and cleanse unstructured data collections to transform your data lake into a trusted source for production ready AI initiatives, as shown in the following figure.

The challenge: Analyzing unstructured enterprise documents at scale

Despite the widespread adoption of AI, many enterprise AI projects fail due to poor data quality and inadequate controls. Gartner predicts that 30% of generative AI projects will be abandoned in 2025. Even the most data-driven organizations have focused primarily on using structured data, leaving unstructured content underutilized and unmonitored in data lakes or file systems. Yet, over 80% of enterprise data is unstructured (according to MIT Sloan School research), spanning everything from legal contracts and financial filings to social media posts.

For chief information officers (CIOs), chief technical officers (CTOs), and chief information security officers (CISOs), unstructured data represents both risk and opportunity. Before you can use unstructured content in generative AI applications, you must address the following critical hurdles:

Extraction – Optical character recognition (OCR), parsing, and metadata generation can be unreliable if not automated and validated. In addition, if extraction is inconsistent or incomplete, it can result in malformed data.
Compliance and security – Handling personally identifiable information (PII) or proprietary intellectual property (IP) demands rigorous governance, especially with the EU AI Act, Colorado AI Act, General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and similar regulations. Sensitive information can be difficult to identify in unstructured text, leading to inadvertent mishandling of that information.
Data quality – Incomplete, deprecated, duplicative, off-topic, or poorly written data can pollute your generative AI models and Retrieval Augmented Generation (RAG) context, yielding hallucinated, out-of-date, inappropriate, or misleading outputs. Making sure that your data is high-quality helps mitigate these risks.
Scalability and cost – Training or fine-tuning models on noisy data increases compute costs by unnecessarily growing the training dataset (training compute costs tend to grow linearly with dataset size), and processing and storing low-quality data in a vector database for RAG wastes processing and storage capacity.

In short, generative AI initiatives often falter—not because the underlying model is insufficient, but because the existing data pipeline isn’t designed to process unstructured data and still meet high-volume, high-quality ingestion and compliance requirements. Many companies are in the early stages of addressing these hurdles and are facing these problems in their existing processes:

Manual and time-consuming – The analysis of vast collections of unstructured documents relies on manual review by employees, creating time-consuming processes that delay projects.
Error-prone – Human review is susceptible to mistakes and inconsistencies, leading to inadvertent exclusion of critical data and inclusion of incorrect data.
Resource-intensive – The manual document review process requires significant staff time that could be better spent on higher-value business activities. Budgets can’t support the level of staffing needed to vet enterprise document collections.

Although existing document analysis processes provide valuable insights, they aren’t efficient or accurate enough to meet modern business needs for timely decision-making. Organizations need a solution that can process large volumes of unstructured data and help maintain compliance with regulations while protecting sensitive information.

The solution: An enterprise-grade approach to unstructured data quality

Anomalo uses a highly secure, scalable stack provided by AWS that you can use to detect, isolate, and address data quality problems in unstructured data–in minutes instead of weeks. This helps your data teams deliver high-value AI applications faster and with less risk. The architecture of Anomalo’s solution is shown in the following figure.

Automated ingestion and metadata extraction – Anomalo automates OCR and text parsing for PDF files, PowerPoint presentations, and Word documents stored in Amazon Simple Storage Service (Amazon S3) using auto scaling Amazon Elastic Cloud Compute (Amazon EC2) instances, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Registry (Amazon ECR).
Continuous data observability – Anomalo inspects each batch of extracted data, detecting anomalies such as truncated text, empty fields, and duplicates before the data reaches your models. In the process, it monitors the health of your unstructured pipeline, flagging surges in faulty documents or unusual data drift (for example, new file formats, an unexpected number of additions or deletions, or changes in document size). With this information reviewed and reported by Anomalo, your engineers can spend less time manually combing through logs and more time optimizing AI features, while CISOs gain visibility into data-related risks.
Governance and compliance – Built-in issue detection and policy enforcement help mask or remove PII and abusive language. If a batch of scanned documents includes personal addresses or proprietary designs, it can be flagged for legal or security review—minimizing regulatory and reputational risk. You can use Anomalo to define custom issues and metadata to be extracted from documents to solve a broad range of governance and business needs.
Scalable AI on AWS – Anomalo uses Amazon Bedrock to give enterprises a choice of flexible, scalable LLMs for analyzing document quality. Anomalo’s modern architecture can be deployed as software as a service (SaaS) or through an Amazon Virtual Private Cloud (Amazon VPC) connection to meet your security and operational needs.
Trustworthy data for AI business applications – The validated data layer provided by Anomalo and AWS Glue helps make sure that only clean, approved content flows into your application.
Supports your generative AI architecture – Whether you use fine-tuning or continued pre-training on an LLM to create a subject matter expert, store content in a vector database for RAG, or experiment with other generative AI architectures, by making sure that your data is clean and validated, you improve application output, preserve brand trust, and mitigate business risks.

Impact

Using Anomalo and AWS AI/ML services for unstructured data provides these benefits:

Reduced operational burden – Anomalo’s off-the-shelf rules and evaluation engine save months of development time and ongoing maintenance, freeing time for designing new features instead of developing data quality rules.
Optimized costs – Training LLMs and ML models on low-quality data wastes precious GPU capacity, while vectorizing and storing that data for RAG increases overall operational costs, and both degrade application performance. Early data filtering cuts these hidden expenses.
Faster time to insights – Anomalo automatically classifies and labels unstructured text, giving data scientists rich data to spin up new generative prototypes or dashboards without time-consuming labeling prework.
Strengthened compliance and security – Identifying PII and adhering to data retention rules is built into the pipeline, supporting security policies and reducing the preparation needed for external audits.
Create durable value – The generative AI landscape continues to rapidly evolve. Although LLM and application architecture investments may depreciate quickly, trustworthy and curated data is a sure bet that won’t be wasted.

Conclusion

Generative AI has the potential to deliver massive value–Gartner estimates 15–20% revenue increase, 15% cost savings, and 22% productivity improvement. To achieve these results, your applications must be built on a foundation of trusted, complete, and timely data. By delivering a user-friendly, enterprise-scale solution for structured and unstructured data quality monitoring, Anomalo helps you deliver more AI projects to production faster while meeting both your user and governance requirements.

Interested in learning more? Check out Anomalo’s unstructured data quality solution and request a demo or contact us for an in-depth discussion on how to begin or scale your generative AI journey.

About the authors

Vicky Andonova is the GM of Generative AI at Anomalo, the company reinventing enterprise data quality. As a founding team member, Vicky has spent the past six years pioneering Anomalo’s machine learning initiatives, transforming advanced AI models into actionable insights that empower enterprises to trust their data. Currently, she leads a team that not only brings innovative generative AI products to market but is also building a first-in-class data quality monitoring solution specifically designed for unstructured data. Previously, at Instacart, Vicky built the company’s experimentation platform and led company-wide initiatives to grocery delivery quality. She holds a BE from Columbia University.

Jonathan Karon leads Partner Innovation at Anomalo. He works closely with companies across the data ecosystem to integrate data quality monitoring in key tools and workflows, helping enterprises achieve high-functioning data practices and leverage novel technologies faster. Prior to Anomalo, Jonathan created Mobile App Observability, Data Intelligence, and DevSecOps products at New Relic, and was Head of Product at a generative AI sales and customer success startup. He holds a BA in Cognitive Science from Hampshire College and has worked with AI and data exploration technology throughout his career.

Mahesh Biradar is a Senior Solutions Architect at AWS with a history in the IT and services industry. He helps SMBs in the US meet their business goals with cloud technology. He holds a Bachelor of Engineering from VJTI and is based in New York City (US)

Emad Tawfik is a seasoned Senior Solutions Architect at Amazon Web Services, boasting more than a decade of experience. His specialization lies in the realm of Storage and Cloud solutions, where he excels in crafting cost-effective and scalable architectures for customers.

An innovative financial services leader finds the right AI solution: Robinhood and Amazon Nova

This post is cowritten with Renyu Chen and Dev Tagare from Robinhood.

Robinhood has been a pioneer and disruptor in the once staid world of online brokerages. Founded in 2013, the company transformed an industry better known for gatekeeping into an open platform accessible to all. Robinhood pioneered commission-free trading, and harnessed the power of technology and intuitive design to create a seamless and engaging experience for modern investors. To this day, the company continues to disrupt the financial services industry by launching groundbreaking product innovations on AWS.

Such innovations have made Robinhood one of the fastest growing brokerages in history, with more than 25 million customers worldwide and a global reputation as an innovator and technology leader. Fueled by its mission of “democratizing finance for all,” the company’s focus on accessibility, particularly for first-time investors, has kept Robinhood as one of the top finance apps on the Apple App Store for more than a decade and earned Robinhood accolades such as an award from Fast Company magazine as one of World’s 50 Most Innovative Companies. This annual ranking highlights companies that are reshaping industries and culture through innovation.

Robinhood’s Chief Executive Officer, Vlad Tenev, explains why this focus is important to Robinhood:

“Our belief is, the more we lower the barriers to entry, the more we level the playing field and allow people to invest their money at a younger age, the better off our economy will be and the better off society will be.”

Built to operate in the cloud, Robinhood uses AWS to power its online business, deliver and update its mobile trading app, securely store information and data, and perform business analytics. Robinhood recently used AI to improve customer experience and expand accessibility. For example, in 2025, the company will launch Robinhood Cortex, an AI investment tool that is designed to provide real-time insights to help users better navigate markets, identify potential opportunities, and stay up to date on the latest market moving news. Cortex is an exciting step forward, providing a level premium investment and market digests that have historically been reserved for institutional investors and wealthy individuals.

As Robinhood customers are able to do more on the platform, the company is working with AWS to explore new generative AI solutions such as Amazon Nova, a family of foundation models (FMs) that make generative AI development faster and more efficient, with exceptional price performance. These new solutions will help the company accommodate rapid expansion of customer requirements.

In this post, we share how Robinhood delivers democratized finance and real-time market insights using generative AI and Amazon Nova.

An AI/ML journey built on customer obsession

Robinhood, like all financial services firms, operates in a highly regulated environment. Historically, the industry was seen as slow-moving and wary of new technologies. Robinhood’s founders put technology at the forefront by initially building a no-frills, no-fee app that, by design, would make investing accessible to everyone, not just the very wealthy. As Robinhood grew, it attracted a wider variety of customers who need the speed, reliability, security, and low cost the platform offers, but who also want a richer set of services for different and novel use cases.

Robinhood listens closely to these active traders. As Renyu Chen, staff machine learning (ML) engineer at Robinhood, explains,

“We wanted to create a seamless journey for AI/ML applications to go from experimentation to Robinhood scale. We looked to the AWS team to help meet the AI/ML needs of our developers while providing advanced ML tooling to serve our most sophisticated ‘active trader’ customers. This would also require a plug-and-play approach that could adopt the latest generative AI technologies from open source, model providers, and home-grown platform tooling.”

Robinhood explored various generative AI solutions during 2023, concluding that the best way to get to Robinhood scale was with Amazon Bedrock, a fully managed service that helps users build generative AI models. Amazon Bedrock offers an extensive selection of FMs from various providers, and allows a high level of customization and security through a single API.

According to Robinhood’s Renyu Chen,

“For us, the security of our customers’ data comes first. Nothing is more important. With Amazon Bedrock, data stays under our control. When we query a model, the input and output never leave our virtual private cloud. When we fine-tune a foundation model, it is based on a private copy of that model. This means our customers’ data is not shared with model providers, and is not used to improve the base models.”

To meet the needs of Robinhood’s ever-growing base of power users, Robinhood is exploring Amazon Nova, estimating that the price per token using Amazon Nova can be up to 80% lower than other models they have tested, which would make it cost-effective to power new high-demand use cases such as a fraud investigation assistant, enhanced document processing, and AI-created content generation.

In addition, AWS generative AI solutions working through Amazon Nova can power new agentic workflows for Robinhood, in which autonomous AI agents can independently make decisions, adapt to changing situations, and execute actions.

“Robinhood offers its customers simplicity, speed, security, and cost savings. Working developer-to-developer with the Robinhood team and building together, we can design generative AI solutions that meet Robinhood’s priorities and customer-focused goals. For example, Amazon Nova models can be easily customized with Amazon Bedrock Model Distillation, which ‘distills’ knowledge from a larger, more capable ‘teacher’ model to a smaller, faster, and cost-efficient ‘student’ model. This solution can help Robinhood use models such as DeepSeek to explore exciting new use cases quickly, securely, and at a 75% lower cost than equivalent offerings from competitors.”

– Dushan Tharmal, Principal Product Manager, Amazon Artificial General Intelligence (AGI).

Amazon Nova: More services, greater value for Robinhood and its customers

Working with AWS on its ambitious AI journey, Robinhood is able to rapidly scale new services for customers without needing the costly structures, staff, and infrastructure found at traditional brokerages. With support from AWS, Robinhood is able to offer a richer customer experience while remaining true to its mission of simplicity, clarity, low cost, speed, security, and reliability.

“We see that Amazon Nova can be a great match for our mission. Amazon Nova offers the lowest latency responses at very low cost, and is accurate and lightning-fast across a wide range of interactive and high-volume Robinhood applications. And, consistent with Robinhood’s commitment to simplicity and low cost for its customers, using Amazon Nova models through Amazon Bedrock makes these large-scale tasks significantly easier, cheaper, and more cost-effective.”

– Dev Tagare, Robinhood’s head of AI.

Learn more about Amazon Nova and how it can deliver frontier intelligence and industry leading price-performance for your organization.

About the authors

Renyu Chen is a Staff AI Engineer at Robinhood Markets

Dev Tagare is the Head of AI at Robinhood Markets

Uchenna Egbe is a GenAI Solutions Architect at AWS FSI,

Trevor Spires is a GenAI Solutions Architect at AWS FinTech.

Build conversational interfaces for structured data using Amazon Bedrock Knowledge Bases

Organizations manage extensive structured data in databases and data warehouses. Large language models (LLMs) have transformed natural language processing (NLP), yet converting conversational queries into structured data analysis remains complex. Data analysts must translate business questions into SQL queries, creating workflow bottlenecks.

Amazon Bedrock Knowledge Bases enables direct natural language interactions with structured data sources. The system interprets database schemas and context, converting natural language questions into accurate queries while maintaining data reliability standards. You can chat with your structured data by setting up structured data ingestion from AWS Glue Data Catalog tables and Amazon Redshift clusters in a few steps, using the power of Amazon Bedrock Knowledge Bases structured data retrieval.

This post provides instructions to configure a structured data retrieval solution, with practical code examples and templates. It covers implementation samples and additional considerations, empowering you to quickly build and scale your conversational data interfaces. Through clear examples and proven methodologies, organizations can transform their data access capabilities and accelerate decision-making processes.

Solution overview

The solution demonstrates how to build a conversational application using Amazon Bedrock Knowledge Bases structured data retrieval. Developers often face challenges integrating structured data into generative AI applications. This includes difficulties training LLMs to convert natural language queries to SQL queries based on complex database schemas, as well as making sure appropriate data governance and security controls are in place. Amazon Bedrock Knowledge Bases alleviates these complexities by providing a managed natural language to SQL (NL2SQL) module. Amazon Bedrock Knowledge Bases offers an end-to-end managed workflow for you to build custom generative AI applications that can access and incorporate contextual information from a variety of structured and unstructured data sources. Using advanced NLP, Amazon Bedrock Knowledge Bases can transform natural language queries into SQL queries, so you can retrieve data directly from the source without the need to move or preprocess the data.

This solution includes Amazon Bedrock Knowledge Bases, Amazon Redshift, AWS Glue, and Amazon Simple Storage Service (Amazon S3). The solution architecture consists of two parts: a data ingestion pipeline, and a structured data retrieval application using Amazon Bedrock Knowledge Bases.

Amazon Bedrock Knowledge Bases structured data retrieval supports Amazon Redshift as the query engine and multiple data ingestion options. The data ingestion pipeline is a one-time setup, and supports multiple ingestion options. In this post, we discuss a common data ingestion use case using Amazon S3, AWS Glue, and Amazon Redshift.

You can configure Amazon Bedrock Knowledge Bases structured data retrieval to retrieve data from AWS Glue databases and S3 datasets. This setup uses automatic mounting of the Data Catalog in Amazon Redshift. With this ingestion option, you can seamlessly integrate existing S3 datasets and Data Catalog tables into your Retrieval Augmented Generation (RAG) applications with the access permissions configured through Lake Formation. The following diagram illustrates this pipeline.

The following screenshot shows the configuration options on the Amazon Bedrock console.

After the data ingestion is configured and the knowledge bases data source sync job is complete, users can ask natural language questions, and Amazon Bedrock Knowledge Bases will generate the SQL, execute the SQL against the query engine, and process it through the LLM to provide a user-friendly response. The following diagram illustrates a sample architecture of the structured data retrieval workflow.

The data retrieval workflow consists of the following steps:

In a RAG application, the user can ask a natural language data analytics question through the chat interface, such as “What is the sales revenue for the Month of February 2025?”
The natural language query is sent to Amazon Bedrock Knowledge Bases for data retrieval and processing.
Amazon Bedrock Knowledge Bases generates a SQL query based on the underlying data schema configured during the knowledge base creation.
The SQL query is executed against the query engine (Amazon Redshift) to retrieve data from a structured data store (AWS Glue tables). The query can include multiple joins and aggregation.
The generated SQL response is sent to an LLM along with additional context to generate a response in natural language.
The response is sent back to the user. The user can ask follow-up questions based on the retrieved response, such as “What is the product that generated highest revenue in this period?”

Amazon Bedrock Knowledge Bases structured data retrieval supports three different APIs to meet your data retrieval requirements:

Retrieval and response generation – The retrieval and response generation API, similar to the solution workflow we’ve discussed, generates a SQL query, retrieves data through the query engine, and processes it through the LLM to generate a natural language response
Retrieval only – The retrieval only API generates a SQL query, retrieves data through the query engine, and returns the data without processing it through an LLM
Generate SQL queries – The generate SQL query API returns the raw SQL query that was generated by Amazon Bedrock Knowledge Bases, which can be used for review and further processing by applications

The following screenshot shows the configuration options on the Amazon Bedrock console.

Code resources and templates

The solution uses the following notebooks:

Data ingestion notebook – Structured-rag-s3-glue-ingestion includes the step-by-step guide to ingest an open dataset to Amazon S3, configure AWS Glue tables using crawlers, and set up the Amazon Redshift Serverless query engine.
Structured data retrieval notebook – Structured-rag-s3-glue-retrieval walks through the implementation steps and provides sample code for configuring Amazon Bedrock Knowledge Bases structured data retrieval using Amazon S3, AWS Glue, and the Amazon Redshift query engine.

For more details, refer to the GitHub repo.

Prerequisites

To implement the solution provided in this post, you must have an AWS account. Additionally, access to the required foundation models must be enabled in Amazon Bedrock.

Set up the data ingestion pipeline

To set up the data ingestion pipeline, we load the sample dataset in an S3 bucket and configure AWS Glue as data storage and a Redshift Serverless workgroup as the query engine. Complete the following steps in data ingestion notebook:

For data ingestion, download the following sample ecommerce dataset, convert it to a pandas data frame, and upload it to an S3 bucket using Amazon SageMaker Data Wrangler.
Create an AWS Glue database and table using an AWS Glue crawler by crawling the source S3 bucket with the dataset. You can update this step to crawl your own S3 bucket or use your existing Data Catalog tables as storage metadata.
Use the data ingestion notebook to create a Redshift Serverless namespace and workgroup in the default VPC. If you plan to use your own Redshift Serverless workgroup or Amazon Redshift provisioned cluster, you can skip this step.

Set up the structured data retrieval solution

In this section, we detail the steps to set up the structured data retrieval component of the solution.

Amazon Bedrock Knowledge Bases supports multiple data access patterns, including AWS Identity and Access Management (IAM), AWS Secrets Manager, and database users. For this post, we demonstrate the setup option with IAM access. You can use IAM access with the Redshift Serverless workgroup configured as part of the ingestion workflow or an existing Redshift Serverless or provisioned cluster to compete these steps.

Complete the following steps in structured data retrieval notebook:

Create an execution role with the necessary policies for accessing data from Amazon Redshift, AWS Glue, and the S3 bucket.
Invoke the CreateKnowledgeBase API to create the knowledge base with the execution role and knowledge base configurations. In the knowledge base configuration, the AWS Glue database and tables are used as storage metadata with Amazon Redshift as the query engine.
After you create the knowledge base, you must complete additional steps to make sure the IAM execution role has the necessary permissions to execute the query in Amazon Redshift and retrieve data from AWS Glue. The notebook includes the necessary instructions to create and grant database access to the execution role, and grant AWS Lake Formation permissions.
The ingestion job will sync the data store schema metadata about AWS Glue database and tables with the NL2SQL module. This schema metadata will be used while generating the SQL query during structured data retrieval.
After the knowledge base sync job is complete, you can use the three data retrieval APIs – retrieve and generate response, retrieval only, and generate SQL query – to query and validate the structured data retrieval solution.

For more details, refer to Create a knowledge base by connecting to a structured data store.

Clean up

We have included cleanup instructions in both the data ingestion and structured data retrieval notebooks to clean up resources after the end-to-end solution is implemented and validated.

Conclusion

Amazon Bedrock Knowledge Bases simplifies data analysis by converting natural language questions into SQL queries, eliminating the need for specialized database expertise. The service integrates with Amazon Redshift, AWS Glue, and Amazon S3, allowing business analysts, data scientists, and operations teams to query data directly using conversation-like questions. It maintains data security through built-in governance controls and access permissions. Customers can deploy this managed service to enable users to analyze data using natural language questions, while maintaining data integrity and security standards.

To learn more, refer to Build a knowledge base by connecting to a structured data store and Amazon Bedrock Knowledge Bases now supports structured data retrieval.

About the authors

George Belsian is a Senior Cloud Application Architect at Amazon Web Services, helping organizations navigate the complexities of cloud adoption, AI integration, and data-driven innovation. By transforming legacy systems into cloud-based platforms and incorporating AI/ML capabilities, he helps businesses create new opportunities for growth, optimize their processes, and deliver scalable solutions.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Gopikrishnan Anilkumar is a Principal Technical Product Manager in AWS Agentic AI organization. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.

How Apollo Tyres is unlocking machine insights using agentic AI-powered Manufacturing Reasoner

This is a joint post co-authored with Harsh Vardhan, Global Head, Digital Innovation Hub, Apollo Tyres Ltd.

Apollo Tyres, headquartered in Gurgaon, India, is a prominent international tire manufacturer with production facilities in India and Europe. The company advertises its products under its two global brands: Apollo and Vredestein, and its products are available in over 100 countries through a vast network of branded, exclusive, and multiproduct outlets. The product portfolio of the company includes the entire range of passenger car, SUV, MUV, light truck, truck-bus, two-wheeler, agriculture, industrial, specialty, bicycle, and off-the-road tires and retreading materials.

Apollo Tyres has started an ambitious digital transformation journey to streamline its entire business value process, including manufacturing. The company collaborated with Amazon Web Services (AWS) to implement a centralized data lake using AWS services. Additionally, Apollo Tyres enhanced its capabilities by unlocking insights from the data lake using generative AI powered by Amazon Bedrock across business values.

In this pursuit, they developed Manufacturing Reasoner, powered by Amazon Bedrock Agents, a custom solution that automates multistep tasks by seamlessly connecting with the company’s systems, APIs, and data sources. The solution has been developed, deployed, piloted, and scaled out to identify areas to improve, standardize, and benchmark the cycle time beyond the total effective equipment performance (TEEP) and overall equipment effectiveness (OEE) of highly automated curing presses. The data flow of curing machines is connected to the AWS Cloud through the industrial Internet of Things (IoT), and machines are sending real-time sensor, process, operational, events, and condition monitoring data to the AWS Cloud.

In this post, we share how Apollo Tyres used generative AI with Amazon Bedrock to harness the insights from their machine data in a natural language interaction mode to gain a comprehensive view of its manufacturing processes, enabling data-driven decision-making and optimizing operational efficiency.

The challenge: Reducing dry cycle time for highly automated curing presses and improving operational efficiency

Before the Manufacturing Reasoner solution, plant engineers were conducting manual analysis to identify bottlenecks and focus areas using an industrial IoT descriptive dashboard for the dry cycle time (DCT) of curing presses across all machines, SKUs, cure mediums, suppliers, machine type, subelements, sub-subelements, and more. The analysis and identification of these focus areas across curing presses among millions of parameters on real-time operations used to consume from approximately 7 hours per issue to an average of 2 elapsed hours per issue. Additionally, subelemental level analysis (that is, bottleneck analysis of subelemental and sub-subelemental activities) wasn’t possible using traditional root cause analysis (RCA) tools. The analysis required subject matter experts (SMEs) from various departments such as manufacturing, technology, industrial engineering, and others to come together and perform RCA. As the insights were not generated in real time, corrective actions were delayed.

Solution impact

With the agentic AI Manufacturing Reasoner, the goal was to empower their plant engineers to perform corrective actions on accelerated RCA insights to reduce curing DCT. This agentic AI solution and virtual experts (agents) help plant engineers interact with industrial IoT connected to big data in natural language (English) to retrieve relevant insights and provide insightful recommendations for resolving operational issues in DCT processes. The RCA agent offers detailed insights and self-diagnosis or recommendations, identifying which of the over 25 automated subelements or activities should be focused on across more than 250 automated curing presses, more than 140 stock-keeping units (SKUs), three types of curing mediums, and two types of machine suppliers. The goal is to achieve the best possible reduction in DCT across three plants. Through this innovation, plant engineers now have a thorough understanding of their manufacturing bottlenecks. This comprehensive view supports data-driven decision-making and enhances operational efficiency. They realized an approximate 88% reduction in effort in assisting RCA for DCT through self-diagnosis of bottleneck areas on streaming and real-time data. The generative AI assistant reduces the DCT RCA from up to 7 hours per issue to less than 10 minutes per issue. Overall, the targeted benefit is expected to save approximately 15 million Indian rupees (INR) per year just in the passenger car radial (PCR) division across their three manufacturing plants.

This virtual reasoner also offers real-time triggers to highlight continuous anomalous shifts in DCT for mistake-proofing or error prevention in line with the Poka-yoke approach, leading to appropriate preventative actions. The following are additional benefits offered by the Manufacturing Reasoner:

Observability of elemental-wise cycle time along with graphs and statistical process control (SPC) charts, press-to-press direct comparison on the real-time streaming data
On-demand RCA on streaming data, along with daily alerts to manufacturing SMEs

“Imagine a world where business associates make real-time, data-driven decisions, and AI collaborates with humans. Our transformative generative AI solution is designed, developed, and deployed to make this vision a reality. This in-house Manufacturing Reasoner, powered by generative AI, is not about replacing human intelligence; it is about amplifying it.”

– Harsh Vardhan, Global Head, Digital Innovation Hub, Apollo Tyres Ltd.

Solution overview

By using Amazon Bedrock features, Apollo Tyres implemented an advanced auto-diagnosis Manufacturing Reasoner designed to streamline RCA and enhance decision-making. This tool uses a generative AI–based machine root cause reasoner that facilitated accurate analysis through natural language queries, provided predictive insights, and referenced a reliable Amazon Redshift database for actionable data. The system enabled proactive maintenance by predicting potential issues, optimizing cycle times, and reducing inefficiencies. Additionally, it supported staff with dynamic reporting and visualization capabilities, significantly improving overall productivity and operational efficiency.

The following diagram illustrates the multibranch workflow.

The following diagram illustrates the process flow.

To enable the workflow, Apollo Tyres followed these steps:

Users ask their questions in natural language through the UI, which is a Chainlit application hosted on Amazon Elastic Compute Cloud (Amazon EC2).
The question asked is picked up by the primary AI agent, which classifies the complexity of the question and decides which agent to be called for the multistep reasoning with help of different AWS services.
Amazon Bedrock Agents uses Amazon Bedrock Knowledge Bases and the vector database capabilities of Amazon OpenSearch Service to extract relevant context for the request:
1. Complex transformation engine agent – This agent works as an on-demand and complex transformation engine for the context and specific question.
2. RCA agent – This agent for Amazon Bedrock constructs a multistep, multi–large language model (LLM) workflow to perform detailed automated RCA, which is particularly useful for complex diagnostic scenarios.
The primary agent calls the explainer agent and visualization agent concurrently using multiple threads:
1. Explainer agent – This agent for Amazon Bedrock uses Anthropic’s Claude Haiku model to generate explanations in two parts:
  1. Evidence – Provides a step-by-step logical explanation of the executed query or CTE.
  2. Conclusion – Offers a brief answer to the question, referencing Amazon Redshift records.
2. Visualization agent – This agent for Amazon Bedrock generates Plotly chart code for creating visual charts using Anthropic’s Claude Sonnet model.
The primary agent combines the outputs (records, explanation, chart code) from both agents and streams them to the application.
The UI renders the result to the user by dynamically displaying the statistical plots and formatting the records in a table.
Amazon Bedrock Guardrails helped setting up tailored filters and response limits, which made sure that interactions with machine data were not only secure but also relevant and compliant with established operational guidelines. The guardrails also helped to prevent errors and inaccuracies by automatically verifying the validity of information, which was essential for accurately identifying the root causes of manufacturing problems.

The following screenshot shows an example of the Manufacturing Reasoner response.

The following diagram shows an example of the Manufacturing Reasoner dynamic chart visualization.

“As we integrate this generative AI solution, built on Amazon Bedrock, to automate RCA into our plant curing machines, we’ve seen a profound transformation in how we diagnose issues and optimize operations,” says Vardhan. “The precision of generative AI–driven insights has enabled plant engineers to not only accelerate problem finding from an average of 2 hours per scenario to less than 10 minutes now but also refine focus areas to make improvements in cycle time (beyond TEEP). Real-time alerts notify process SMEs to act on bottlenecks immediately and advanced diagnosis features of the solution provide subelement-level information about what’s causing deviations.”

Lessons learned

Apollo Tyres learned the following takeaways from this journey:

Applying generative AI to streaming real-time industrial IoT data requires extensive research due to the unique nature of each use case. To develop an effective manufacturing reasoner for automated RCA scenarios, Apollo Tyres explored several strategies from the prototype to the proof-of-concept stages.
In the beginning, the solution faced significant delays in response times when using Amazon Bedrock, particularly when multiple agents were involved. The initial response times exceeded 1 minute for data retrieval and processing by all three agents. To address this issue, efforts were made to optimize performance. By carefully selecting appropriate LLMs and small language models (SLMs) and disabling unused workflows within the agent, the response time was successfully reduced to approximately 30–40 seconds. These optimizations played a crucial role in boosting the solution’s efficiency and responsiveness, leading to smoother operations and an enhanced user experience across the system.
While using the capabilities of LLMs to generate code for visualizing data through charts, Apollo Tyres faced challenges when dealing with extensive datasets. Initially, the generated code often contained inaccuracies or failed to handle large volumes of data correctly. To address this issue, they embarked on a process of continuous refinement, iterating multiple times to enhance the code generation process. Their efforts focused on developing a dynamic approach that could accurately generate chart code capable of efficiently managing data within a data frame, regardless of the number of records involved. Through this iterative approach, they significantly improved the reliability and robustness of the chart generation process, making sure that it could handle substantial datasets without compromising accuracy or performance.
Consistency issues were effectively resolved by making sure the correct data format is ingested into the Amazon data lake for the knowledge base, structured as follows:

{
"Question": <Question in natural language>, 
"Query": < Complex Transformation Engine scripts >, 
“Metadata” :<metadata>
}

Next steps

The Apollo Tyres team is scaling the successful solution from tire curing to various areas across different locations, advancing towards the industry 5.0 goal. To achieve this, Amazon Bedrock will play a pivotal role in extending the multi-agentic Retrieval Augmented Generation (RAG) solution. This expansion involves using specialized agents, each dedicated to specific functionalities. By implementing agents with distinct roles, the team aims to enhance the solution’s capabilities across diverse operational domains.

Furthermore, the team is focused on benchmarking and optimizing the time required to deliver accurate responses to queries. This ongoing effort will streamline the process, providing faster and more efficient decision-making and problem-solving capabilities across the extended solution.Apollo Tyres is also exploring generative AI using Amazon Bedrock for its other manufacturing and nonmanufacturing processes.

Conclusion

In summary, Apollo Tyres used generative AI through Amazon Bedrock and Amazon Bedrock Agents to transform raw machine data into actionable insights, achieving a holistic view of their manufacturing operations. This enabled more informed, data-driven decision-making and enhanced operational efficiency. By integrating generative AI–based manufacturing reasoners and RCA agents, they developed a machine cycle time diagnosis assistant capable of pinpointing focus areas across more than 25 subprocesses, more than 250 automated curing presses, more than 140 SKUs, three curing mediums, and two machine suppliers. This solution helped drive targeted improvements in DCT across three plants, with targeted annualized savings of approximately INR 15 million within the PCR segment alone and achieving an approximate 88% reduction in manual effort for root cause analysis.

“By embracing this agentic AI-driven approach, Apollo Tyres is redefining operational excellence—unlocking hidden capacity through advanced ‘asset sweating’ while enabling our plant engineers to communicate with machines in natural language. These bold, in-house AI initiatives are not just optimizing today’s performance but actively building the firm foundation for intelligent factories of the future driven by data and human-machine collaboration.”

– Harsh Vardhan.

To learn more about Amazon Bedrock and getting started, refer to Getting started with Amazon Bedrock. If you have feedback about this post, leave a comment in the comments section.

About the authors

Harsh Vardhan is a distinguished global leader in Business-first AI-first Digital Transformation with over two- decades of industry experience. As the Global Head of the Digital Innovation Hub at Apollo Tyres Limited, he leads industrialisation of AI-led Digital Manufacturing, Industry 4.0/5.0 excellence, and fostering enterprise-wide AI-first innovation culture. He is A+ contributor in field of Advanced AI with Arctic code vault badge, Strategic Intelligence member at World Economic Forum, and executive member of CII National Committee. He is an avid reader and loves to drive.

Gautam Kumar is a Solutions Architect at Amazon Web Services. He helps various Enterprise customers to design and architect innovative solutions on AWS. Outside work, he enjoys travelling and spending time with family.

Deepak Dixit is a Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprises architect scalable AI/ML workloads, implement Large Language Models (LLMs), and optimize cloud-native applications.

Extend your Amazon Q Business with PagerDuty Advance data accessor

This blog post is co-written with Jacky Leybman from PagerDuty.

As organizations scale their digital operations, they face unprecedented challenges in managing and extracting value from their vast data ecosystems, particularly when it comes to data accessibility and quality. The complexity of modern IT operations demands solutions that can efficiently integrate, process, and deliver actionable insights.

In this post, we demonstrate how organizations can enhance their incident management capabilities by integrating PagerDuty Advance, an innovative set of agentic and generative AI capabilities that automate response workflows and provide real-time insights into operational health, with Amazon Q Business. We show how to configure PagerDuty Advance as a data accessor for Amazon Q indexes, so you can search and access enterprise knowledge across multiple systems during incident response. We also explore the key benefits of this integration, including improved search capabilities across connected platforms and enhanced data processing for faster incident resolution, supported by robust security features. The post includes a step-by-step implementation guide to help you set up this integration in your environment.

Understanding the components

PagerDuty, a leading digital operations management platform that helps organizations prevent and resolve business-impacting incidents, uses sophisticated ML and AI to automate response workflows and provide real-time insights into operational health. As the first incident management platform to integrate with Amazon Q Business, PagerDuty is an enterprise-grade incident management and operational intelligence solution that can be interacted with through your corporate communications tool, and can analyze data across multiple software as a service (SaaS) applications, breaking down data silos that typically hinder AI’s potential to drive operational resilience. PagerDuty Advance is a comprehensive suite of generative and agentic AI capabilities for the PagerDuty platform, purpose-built to elevate operational efficiency with less effort and faster, automated actions supported by intelligent context at every step of the way.

Amazon Q index for independent software vendors (ISVs) is a capability to seamlessly integrate their generative AI applications with customers’ enterprise data and metadata through an Amazon Q index, so customers can search across their application data alongside other enterprise content. This integration capability makes sure that ISVs can offer their customers a unified search experience while maintaining strict security, access controls, and ownership over their data.

When combined with the intelligent search and insight derivation capabilities of Amazon Q index, organizations gain a complete solution that transforms how they handle operational data. The integration enables a variety of use cases that enhance operational efficiency and incident management across the enterprise, as demonstrated by PagerDuty Advance.

The integration creates a relationship where the refined indexing of data from Amazon Q can be combined with PagerDuty real-time incident data, creating a unified view of operational intelligence. Through the Amazon Q Data Accessor capability, PagerDuty Advance can securely access and analyze data from over 100 different SaaS applications typically used by businesses, making previously siloed data actionable and valuable for incident prevention and resolution.

The following video shows this solution in action, as an agent uses PagerDuty Advance to identify an incident cause and request troubleshooting advice.

Benefits for enterprises

Enterprises often struggle with incident resolution, spending precious time searching through multiple systems for answers. Imagine a scenario where your team receives a critical alert—with the integration of PagerDuty Advance and Amazon Q index, you can quickly access relevant runbooks for resolution steps from Confluence or identify potentially related GitHub commits that might have triggered the issue. This seamless integration transforms the incident management experience:

Improved search capabilities – Amazon Q index augments the generative AI Q&A experience by providing semantically relevant enterprise content across connected systems, resulting in contextually appropriate and actionable results. Teams can quickly locate information across Confluence, GitHub, and other integrated platforms, significantly reducing search time.
Enhanced data processing – The system continuously ingests and analyzes operational data, automatically correlating incidents and identifying patterns across connected systems. Through intelligent parsing of documentation and code repositories, it creates automatic links between incidents, relevant documentation, and GitHub changes while providing a unified view of operational data. This analysis converts siloed raw data into actionable insights, enabling automated suggestions for resolution and trend analysis for proactive improvements.
Cost optimization – Organizations can achieve significant cost savings through reduced mean time to resolution (MTTR) and optimized resource allocation. By having immediate access to runbooks, past resolution information, and related code changes, teams can resolve incidents faster and more efficiently. The integration streamlines workflows and automates routine tasks, resulting in decreased operational overhead. Teams can accomplish more with existing resources, leading to improved return on investment (ROI) on technology investments.
Security benefits – Security is paramount in the integrated solution, with Amazon Q index implementing robust identity-aware access controls. Enterprise index data remains securely stored within the enterprise environment, and the PagerDuty data accessor capability only retrieves relevant content through the Search Relevant Content API—a specialized API designed for enterprise applications to securely search and retrieve contextually relevant information across their data sources—providing secure and reliable data access. Through this identity awareness API, the system authenticates and validates each user’s permissions before returning search results or document access. This means users will only see information from documents they have explicit permissions to access—if a user doesn’t have access to specific Confluence pages or GitHub repositories, those results will be automatically filtered out from their search results. The system features complete end-to-end encryption to protect sensitive operational data, and the role-based access control integrates with your existing identity management systems. This zero-trust security approach maintains compliance with industry standards and provides organizations with granular control over their sensitive operational data, reducing the risk of unauthorized access to confidential information.

Jacky Leybman, Principal Product Manager on PagerDuty, says,

“Our PagerDuty customers have asked for a one-stop shop from identifying critical issues to driving resolution. The integration of Amazon Q index with PagerDuty Advance represents a significant milestone, enabling us to provide customers with comprehensive insights, including runbooks and historical information stored in the enterprise environment, and help them resolve issues efficiently, resulting in up to 30% faster MTTR on average. Working with AWS to implement this integration has been a remarkably smooth experience, and we’re already seeing strong interest from numerous enterprise customers eager to test these capabilities. We are very excited to see how customers leverage these capabilities.”

Solution overview

The Amazon Q Business data accessor, a secure interface component that bridges enterprise applications with Amazon Q index, provides a simple and secure way for enterprises to allow PagerDuty Advance to access their Amazon Q index to provide relevant answers to user queries through PagerDuty Advance.

The integration of PagerDuty Advance with Amazon Q index offers a robust incident management solution that uses enterprise data across multiple platforms. When a user requests information through Slack or Microsoft Teams, the PagerDuty Advance orchestrator processes the query, checking both the PagerDuty knowledge base for relevant incident data and the Amazon Q Business data accessor to search the Amazon Q index. The index can aggregate data from various enterprise systems like Slack, Salesforce, and Atlassian products using built-in Amazon Q Business connectors. The orchestrator uses generative AI to provide users with contextual, actionable insights directly within their communication platform. With this integration, teams can quickly access runbooks, ongoing issue details, and other critical information, enhancing incident response efficiency and reducing resolution times.

The following diagram depicts the overall solution integrating PagerDuty Advance and Amazon Q index.

Prerequisites

Before enabling the Amazon Q index integration on PagerDuty Advance, you need to have the following components and requirements in place:

Amazon Q Business set up with AWS IAM Identity Center for user authentication
Access to PagerDuty Advance
A valid AWS account with appropriate service access

With the Amazon Q Business data accessor, PagerDuty Advance seamlessly integrates with your Amazon Q index. Simply complete the basic configuration steps on both the Amazon Q Business and PagerDuty consoles to get started. For more information on how to set up an Amazon Q Business application, see the Amazon Q Business Activation Day workshop.

Add PagerDuty Advance as a data accessor

After creating an Amazon Q Business application with IAM Identity Center, administrators can configure PagerDuty as a data accessor through the Amazon Q Business console. Complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose PagerDuty Advance as your data accessor.
For Accessor name, enter a name for your data accessor.
For Data source access, configure your level of access.
- You can select specific data sources from your Amazon Q index to be available through the data accessor. This makes it possible to control which content is surfaced in the ISV environment. You can use Amazon Q Business pre-built connectors to synchronize content from various systems. For more information, refer to Supported connectors.
For User access, specify which users or groups can access the Amazon Q index through the data accessor.
- This option enables you to configure granular permissions for data accessor accessibility and manage organizational access controls.

For more information about data access, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.

After you have added the data accessor, the Amazon Q Business console displays configuration details that you need to share with PagerDuty Advance to complete the setup. Note down this information for the next step. Also, you can always come back to retrieve these values on the data accessor’s details page.

Configure Amazon Q for PagerDuty Advance

After PagerDuty has been configured as a data accessor, administrators can enable Amazon Q Business assistance on PagerDuty Advance. The following steps describe how to do it:

On your PagerDuty page, go to Account Settings, then choose PagerDuty Advance.
Turn on Enable Amazon Q Business.
Choose Edit configuration values and enter the values you copied when enabling the data accessor in the previous step.

Your setup is now complete!

Now you can go to the communication tool where PagerDuty Advance is available and start asking questions. For example, on Slack, you can use /pd amazonq <user query>, as shown in the following screenshot.

Clean up

When you’re done using this solution in a given environment, clean up the resources you created.

On your PagerDuty page, go to Account Settings, then choose PagerDuty Advance, and turn off Enable Amazon Q Business.
On AWS console, delete PagerDuty data accessor from the Data accessors console. Deleting this data accessor will delete permissions and access to the data accessor for all users.
Delete the Amazon Q Business application that you created as a prerequisite.
- Navigate to the Amazon Q Business console.
- Choose Applications on the left menu.
- Select the application you created.
- Choose Delete from under Actions to delete the application.

Deleting the Amazon Q Business application will remove the associated index and data source connectors, and prevent incurring additional costs.

Conclusion

The combination of PagerDuty Advance and Amazon Q index offers businesses an improved way to handle daily operations more effectively. By bringing together PagerDuty’s enterprise-grade incident management solutions with Amazon Q index smart search features, companies can now get specific answers and find relevant information that was previously scattered across different systems safely while maintaining data ownership. This means faster problem-solving and better teamwork across the organization.

In this post, we explored how enterprises can use the integration between PagerDuty Advance and Amazon Q Business, allowing users to streamline their incident management processes and unlock valuable operational gains and insights. We demonstrated how organizations can set up this integration using an Amazon Q data accessor, so teams can access critical information across multiple systems securely and in a cost-effective manner.

Ready to level up your incident management and operational efficiency? Unlock the full potential of your enterprise’s operational intelligence today with the Amazon Q Business console, PagerDuty Advance documentation, and the integration implementation guide.

About the Authors

Jacky Leybman is a Principal Product Manager at PagerDuty, leading the development of PagerDuty Advance and AI Agents. With over 19 years of experience in technology and product management, Jacky specializes in leading Agile cross-functional teams to develop and launch innovative digital products. Based in Miami, Florida, Jacky brings extensive expertise in product strategy, team leadership, and artificial intelligence implementations.

Takeshi Kobayashi is a Senior AI/ML Solutions Architect within the Amazon Q Business team, responsible for developing advanced AI/ML solutions for enterprise customers. With over 14 years of experience at Amazon in AWS, AI/ML, and technology, Takeshi is dedicated to leveraging generative AI and AWS services to build innovative solutions that address customer needs. Based in Seattle, WA, Takeshi is passionate about pushing the boundaries of artificial intelligence and machine learning technologies.

Daniel Lopes is a Solutions Architect at AWS, where he partners with ISVs to architect solutions that align with their strategic objectives. He specializes in leveraging AWS services to help ISVs transform their product vision into reality, with particular expertise in event-driven architectures, serverless computing, and generative AI. Outside work, Daniel mentors his kids in video games and pop culture.

Innovate business logic by implementing return of control in Amazon Bedrock Agents

In the context of distributed systems and microservices architecture, orchestrating communication between diverse components presents significant challenges. However, with the launch of Amazon Bedrock Agents, the landscape is evolving, offering a simplified approach to agent creation and seamless integration of the return of control capability. In this post, we explore how Amazon Bedrock Agents revolutionizes agent creation and demonstrates the efficacy of the return of control capability in orchestrating complex interactions between multiple systems.

Amazon Bedrock Agents simplifies the creation, deployment, and management of agents in distributed systems. By using the power of AWS Lambda and AWS Step Functions, Amazon Bedrock Agents abstracts away the complexities of agent implementation, which means developers can focus on building robust and scalable applications without worrying about infrastructure management.

You can use agents in Amazon Bedrock in various scenarios where you need to handle the return of control to the user or the system. Use cases include conversational assistants, task automation, decision support systems, interactive tutorials and walkthroughs, and virtual assistants. In these use cases, the key aspect of the agents is their ability to handle the return of control to the user or the system. This allows for a more natural and responsive interaction, where the user feels in control of the process while still benefiting from the agent’s guidance and automation capabilities.

Solution overview

In this post, we demonstrate an automated personalized investment portfolio solution using Amazon Bedrock Agents. The solution calls a third-party API to fetch a user’s current investment portfolio. These are then analyzed using foundation models (FMs) available on Amazon Bedrock to produce recommendations inline to the inputs provided by the end user, showcasing a return of control capability integrated with Amazon Bedrock Agents.

This solution uses a combination of synchronous data retrieval and generative AI to provide tailored investment recommendations that align with users’ specific financial goals and risk tolerance. By incorporating machine learning (ML) and simulation techniques, the system can generate personalized portfolios and assess their potential performance, making sure the recommended solutions are optimized for individual needs.

With Amazon Bedrock Agents, the capability to return control to the application invoking the agent can handle external functions and business logic at the application level instead of using a Lambda function. This way, an application can manage external interactions and return the response while the agent continues its orchestration. This is illustrated in the following diagram.

The option to return control is particularly useful in two main scenarios:

Calling an API from an existing application rather than building a new Lambda function with the required authentication and networking configurations
Handling tasks that might run longer than 15 minutes and can’t be accommodated through a Lambda function, instead requiring containers, virtual servers, or workflow orchestration tools such as AWS Step Functions

The following sample code uses Amazon Bedrock Agents with handling return of control in the code. With the Amazon Bedrock Agents feature, you can manage Amazon Bedrock Agents return of control in your backend services and simplify application integrations. To demonstrate this, we have the following four code snippets: external-bedrock-agent-api.py, streamlit-app-portfolio-recommender.py, Portfolio-Recommender-CFN-Template.yaml, and requirements.txt, along with detailed steps to replicate the scenario.

The external-bedrock-agent-api code implements a portfolio recommendation system using Amazon Bedrock Agents and Flask. Here’s a high-level overview of the functions used:

fetch_user_data: Processes user profile information such as risk tolerance or investment goals
generate_portfolios: Creates sample investment portfolios with different risk levels
fetch_custom_portfolio: Combines user data and portfolio generation
send_custom_portfolio_as_email: Sends portfolio recommendations by email using an Amazon Simple Email Service (Amazon SES) verified email identity
/sns-handler endpoint: This API endpoint receives POST requests with user investment preferences, processes the message containing user preference details, invokes the Amazon Bedrock agent to generate recommendations, and handles email communication of the recommendations

The streamlit-app-portfolio-recommender code is a Streamlit web application for investment portfolio recommendations. The code sets up the webpage with a title and configuration. The app collects several pieces of information through form elements:

Email address – Text input
Financial goal – Dropdown with options for retirement, wealth accumulation, and passive income
Risk tolerance – Dropdown with options for low, medium, and high
Investment horizon – Dropdown with options for short-term and long-term
Environmental, social, and governance (ESG) preference – Checkbox for environmental, social, and governance preferences
Email preference – Checkbox for receiving recommendations by email

The system operates through a Portfolio Generation Function that actively sending POST requests to a local API endpoint. This function transforms user preferences into JSON data and delivers either an API response or error message back to the user.

The process to display results begins when user click the Submit button, which triggers the custom_portfolio function with their specific inputs. The system then displays the portfolio recommendation in a text area for successful executions, while immediately alerting users with an error message if any issues occur during the process.

Solution walkthrough

Follow the steps to set up the environment and test the application in the US East (N. Virginia) us-east-1 Region.

To enable Anthropic’s Claude model on Amazon Bedrock in your AWS account:

On the Amazon Bedrock console, in the left navigation pane under Amazon Bedrock configurations, select Model access
Select Claude 3 Sonnet, as shown in the following screenshot

To create the Amazon Bedrock agents, related action groups, Amazon SageMaker AI domain, sample user profile, and JupyterLab space, follow these steps:

- Invoke the AWS CloudFormation template at Portfolio-Recommender-CloudFormation-Template.yml
- Give a name to the stack
- Provide an email address for the EmailIdentityParameter

Select the checkbox to acknowledge that the template contains AWS Identity and Access Management (IAM) resources, as shown in the following screenshot

Monitor AWS CloudFormation until it completes the resource creation process. You can verify the successful deployment by checking the Stack details output tab, which will display the AgentId and AgentAliasId values, as shown in the screenshot below.

You will receive an email address verification request email from AWS for in the US East (N. Virginia) Region. Select the link in the email to verify.

After creating your CloudFormation resources, follow these steps to access Amazon SageMaker Studio:

On the Amazon SageMaker AI console, under Admin configurations in the left navigation pane, select Domains
Select the bedrock-return-of-control-demo domain created by the CloudFormation template, as shown in the following screenshot

Select the User profiles tab
To open the SageMaker Studio environment, under User profiles, next to the sagemakeruser profile on the right, select Launch. From the dropdown menu, choose Studio, as shown in the following screenshot

You should now observe the SageMaker Studio home page. This environment is where you will execute Python scripts to set up your application.

To access the JupyterLab environment for this lab, follow these steps:

On the SageMaker Studio console, in the left navigation pane under Applications, select JupyterLab
You’ll find bedrock-agent-space that has been preprovisioned for this lab. Its Status should be Stopped. On the right side under Action, choose Run
Within 30–40 seconds, the JupyterLab application status will change from Starting to Running

When it’s running, under Action, choose Open, as shown in the following screenshot

Three required files are copied under the /home/sagemaker-user/scripts directory: two Python files (external-bedrock-agent-api and streamlit-app-portfolio-recommender) and one requirements.txt file, as shown in the following screenshot. The JupyterLab application environment is under the default directory.

In the File menu, select New. In the dropdown menu, select Terminal to open a new terminal window, as shown in the following screenshot.
Go to the scripts directory where you have the required files in the terminal and enter:
```
pip install -r requirements.txt
```
Enter the following command on the terminal:
```
python3 external-bedrock-agent-api.py
```
Open a new terminal and go to the /home/sagemaker-user/scripts directory and enter:
```
streamlit run streamlit-app-portfolio-recommender.py
```
From the command execution in the terminal, note the port number (8501) and studio URL from the browser. The URL will be in the format of: https://{domainid}.studio.{region}-1.sagemaker.aws/jupyterlab/default/lab/tree/scripts
To access the Streamlit app, modify the Studio URL, replacing everything after the default/ lab/tree/scripts with proxy/[PORT NUMBER]/. The modified Streamlit UI URL will look like this: https://{domainid}.studio.{region}.sagemaker.aws/jupyterlab/default/proxy/8501/
Select all appropriate inputs for generating your custom portfolio recommendation. Choose whether you prefer to receive email notifications or inline recommendations through the application interface by checking the corresponding box. Then choose Submit. Provide the same email address that was verified earlier in this walkthrough.

The sample output and email response are shown in the following demo screenshot.

Cleanup

When you’re done, delete resources you no longer need to avoid ongoing costs. Follow these steps:

Go to the SageMaker AI JupyterLab environment and stop the Amazon SageMaker Studio application or running instance
Delete the resources created by deleting the CloudFormation stack.

The following screenshot demonstrates how to view and stop running instances in the SageMaker AI JupyterLab environment. For more information, refer to Delete a stack from the CloudFormation console.

Amazon Bedrock Agents return of control considerations

When implementing return of control, consider the following:

Return of control performance considerations – When implementing return of control, developers should focus on optimizing action execution times and response handling. Each action should be designed to complete within reasonable timeframes to maintain conversation flow. Consider implementing caching mechanisms for frequently accessed data and facilitate efficient state management between return of control cycles. The application should be designed to handle concurrent user sessions effectively while maintaining responsiveness.
Return of control limitations – Actions must be defined with clear input and output schemas. Each action should be atomic and focused on a specific task to maintain simplicity and reliability. Consider payload sizes for requests and responses because there might be size limitations. Actions execute sequentially, and the system needs to maintain conversation context throughout the interaction cycle.
Security recommendations – Security implementation requires proper authentication and authorization mechanisms for all actions, following the principle of least privilege when defining permissions. Input parameters must be validated before processing, with comprehensive error handling in place. Rate limiting and request validation should be implemented to prevent abuse, and sensitive data handling must comply with security requirements and include proper logging mechanisms for audit trails. Additionally, implement input filtering to prevent prompt injection attacks, configure response filters to protect sensitive information, and set up content scanning for both input and output. Deploy regex-based response filtering to help prevent personally identifiable information (PII) exposure and establish content moderation filters to block inappropriate content.
Monitoring and observability – Implement comprehensive logging for all action executions and responses. Monitor key metrics such as action execution times, success rates, and error rates. Set up alerts for abnormal patterns or failures. Use Amazon CloudWatch for monitoring system health and performance. Consider implementing tracing to track request flow through different components of your system. Regular review of metrics and logs helps identify potential issues and optimization opportunities.

Conclusion

In this post, we’ve demonstrated how Amazon Bedrock Agents simplifies agent creation and streamlines the orchestration of complex interactions between microservices using the return of control capability. By abstracting away infrastructure management and providing seamless integration with your application, Amazon Bedrock Agents empowers developers to build resilient and scalable applications with ease. As organizations embrace microservices architecture and distributed systems, tools such as Amazon Bedrock Agents play a pivotal role in accelerating innovation and driving digital transformation.

Resources

For the most current and specific information, refer to:

About the Authors

Vishwanatha Handadi is a Sr. Solutions Architect within the Global Financial Services vertical, working with Amazon Web Services (AWS) for over 2 years and has over 22 years of experience in the IT industry primarily in data and analytics. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences. He is based out of Bangalore, India.

Mohammed Asadulla Baig is a Sr. Technical Account Manager with Amazon Web Services (AWS) Enterprise Support. Asad helps customers architect scalable, resilient, and secure solutions. With a keen eye for innovation and a passion for delivering customer success, Asad has established himself as a thought leader in the industry, helping enterprises navigate their cloud transformation journeys with confidence and ease.

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization

The field of large language models is shifting toward lower-precision computation. This shift necessitates a rethinking of scaling laws to account for the effects of quantization on resulting quantized model performance. In this work, we demonstrate that previous conclusions on the low-bit scaling laws can be significantly sharpened by better quantization scheme design and training improvements.

We propose ParetoQ, the first algorithm that unifies binary, ternary, and 2-to-4 bit quantization-aware training. ParetoQ demonstrates its robustness by yielding state-of-the-art (SOTA) models at all bit widths, surpassing prior works tailored for individual bit levels. We’ve released the MobileLLM low-bit model collection on Hugging Face, featuring models quantized with our ParetoQ method. The smallest model is an ultra-efficient 1-bit 125M variant, with just ~16MB equivalent storage size.

These SOTA points in the Pareto chart ensure that our scaling law comparisons are both reliable and consistent, as they derive from homogeneous settings. Our scaling laws reveal that binary quantization significantly compromises accuracy, while ternary, 2-bit, and 3-bit quantization are tied in performance, often surpassing 4-bit.

ParetoQ is based on PyTorch models, including LLaMA and MobileLLM. We utilized a popular PyTorch Library: HuggingFace Transformers for accuracy experiments. For the latency experiments, we utilize the low-bit quantization kernels on the CPU with ExecuTorch. We compared their speed with that of 4-bit quantization. Additionally, we implemented state-of-the-art 2-bit GPU kernels, which showed up to a 4.14x speedup compared to FP16 and a 1.24x speedup over the Machete 4-bit kernel on TritonBench.

ParetoQ has been integrated into torchao [pull]. This integration enables users to leverage ParetoQ by specifying “paretoq” as the quantization method within torchao’s codebase. Once set, the users can utilize torchao’s ParetoQ workflow, optimizing quantization parameters to balance accuracy and compression trade-offs and compare different quantization bit`s apple-to-apple using Pareto frontier analysis. This allows for the efficient deployment of models on edge devices without requiring manual tuning of quantization settings.

To obtain the ParetoQ-quantized models, simply navigate to the torchao/prototype/paretoq directory and execute the training script:

cd torchao/prototype/paretoq && bash 1_run_train.sh $w_bit

Here, $w_bit specifies the target weight bit-width for quantization.

ParetoQ code is available at: https://github.com/facebookresearch/ParetoQ

Paper link: https://arxiv.org/abs/2502.02631

1 A Better QAT Scheduling Strategy for Extreme Low-Bit LLMs

1.1 Training Budget Allocation

Given a fixed training budget B_train = B_FPT +B_QAT, how should the budget be optimally allocated between full-precision training (B_FPT) and quantization-aware training/fine-tuning (B_QAT) to maximize the accuracy of the quantized model?

Figure 1: Optimal allocation between full-precision pretraining and QAT fine-tuning.

Finding-1 QAT finetuning consistently surpasses both PTQ with B_FPT = B_train and QAT from scratch with B_QAT = B_train. Optimal performance is nearly achieved by dedicating the majority of the training budget to full precision (FP) training and approximately 10% to QAT.

1.2 Fine-tuning Characteristics

Figure 2: Analysis of training token requirements for quantization-aware fine-tuning and training from scratch

Finding-2 While fine-tuning enhances performance across all bit-widths, even binary and ternary, optimal fine-tuning effort inversely correlates with bit-width. For 3-bit and 4-bit weights, fine-tuning adjusts within a nearby grid to mitigate accuracy loss and requires less fine-tuning tokens. In contrast, binary and ternary weights break the grid, creating new semantic representations to maintain performance, requiring longer fine-tuning.

Figure 3: L1 norm difference between QAT-finetuned weights and full-precision initialization (||W_finetune −W_init||_l1 /||W_init||_l1).

2 A Hitchhiker’s Guide to Quantization Method Choices

In sub-4-bit quantization, the choice of function is highly sensitive and can drastically alter scaling law outcomes.

Figure 4: Impact of quantization grid choice across bit widths. 2.1.1 Range clippingCompared to statistics-based quantization (e.g., min-max quantization), learnable scales which optimize quantization ranges as network parameters, balancing outlier suppression and precision, yields more stable and superior performance. As shown in Figure (b)-(e), learnable policies consistently outperform stats-based methods across all bit widths.

2.1.2 Quantization grids

Level symmetry in quantization grids is vital for lower-bit quantization but often overlooked. Including “0” in even-level quantization (e.g., 2-bit, 3-bit, 4-bit) can cause imbalance. For instance, 2-bit quantization options like (-2, -1, 0, 1) limit positive representation to only one level, while (-1.5, -0.5, 0.5, 1.5) offers more balanced representation. We propose Stretched Elastic Quant (SEQ) to address this in lower-bit scenarios.

SEQ balances quantized levels and evenly divides the full-precision weight span, crucial for extremely low-bit quantization. Figures show SEQ’s advantage in ternary and 2-bit quantization, while LSQ with “0” slightly excels in 3 and 4-bit cases.

Figure 5: Comparison of quantization methods across different bit-widths

2.2 Quantization Function

Based on our analysis, we combine the optimal quantization functions identified for each bit-width into one formula, denoted as ParetoQ. This includes Elastic Binarization [1] for 1-bit quantization, LSQ [2] for 3 and 4-bit quantization, and the proposed SEQ for 1.58 and 2-bit quantization.

Here, k equals 3 in the ternary case and 2Nbit otherwise; n = –2Nbit-1 and p = 2Nbit-1 -1. In the backward pass, the gradients to the weights and scaling factor can be easily calculated using a straight-through estimator.

With ParetoQ, we present a robust comparison framework across five bit-widths (1-bit, 1.58-bit, 2-bit, 3-bit, 4-bit), each achieving state-of-the-art accuracy. This facilitates direct, apple-to-apple comparisons to identify the most effective bit-width selection.

3 Comparison with SoTA

3.1 Comparisons on 1.58-bit quantization

The figure below illustrates that ParetoQ consistently outperforms previous methods targeting ternary quantization aware training including Spectra [3] and 1-bit Era [4]. Given that a full-precision LLaMA-3 3B model achieves 69.9 accuracy, it’s remarkable that ParetoQ ternary 3B-parameter model narrows the gap to just 4.1 points, while previous methods experience drops exceeding 11.7 points.

Figure 6: Ternary quantization accuracy averaged across six tasks: ARC-e, ARC-c, BoolQ, PIQA, HellaSwag, and WinoGrande. ParetoQ consistently outperforms all prior methods in ternary quantization-aware training.

3.2 comparisons 2-bit / 3-bit / 4-bit quantization

As evidenced by Figure 1, compared to previous state-of-the-art PTQ and QAT methods on 2, 3 or 4-bit quantization settings, our approach consistently resides on the Pareto front, with a particularly pronounced advantage in lower-bit quantization settings. These results confirm that our bit-accuracy trade-off conclusions are benchmarked against SoTA results across all bit settings, ensuring its reliability.

Figure 7: Accuracy comparison on 8 models. ParetoQ outperforms all state-of-the-art PTQ and QAT methods in 2, 3, and 4-bit settings.

4 Pareto Curve

4-bit quantization-aware training (QAT) achieves near-lossless compression in many scenarios. With ParetoQ, we are able to further improve the trade-off curve. Figure (a) demonstrates that sub-4-bit quantization, including binary, ternary, 2-bit, and 3-bit, often surpasses 4-bit. Notably, 2-bit and ternary models reside on the Pareto frontier.

To evaluate potential speedup benefits beyond memory reduction, we utilize the High-Performance Low-Bit Operators for 2-bit quantization and compare the latency with 4-bit quantization. The curves in Figure8 (c) demonstrate that, within our experimental range, 2-bit quantized models consistently outperform 4-bit models in terms of accuracy-speed performance, positioning 2-bit quantization as a superior choice for on-device applications where both latency and storage are critical.

Figure 8: (a) (b) In sub-4-bit regime, 1.58-bit, 2-bit, and 3-bit quantization outperform 4-bit in terms of the accuracy-model size trade-off. (c) Under hardware constraints, 2-bit quantization demonstrates superior accuracy-speed trade-offs compared to higher-bit schemes.

5 GPU Latency

We measured the latency of LLaMA 3.2 models (1B, 3B, 8B) on an H100 NVL GPU (94GB memory). The W4A16 kernel used the Machete kernel from vLLM, while the W2A16 kernel was implemented based on the CUTLASS mixed precision backbone kernel. All tests were performed on a single GPU with a context length of 2048 tokens. For kernel-level latency, we compared the 2-bit kernel to the 4-bit Machete kernel across three weight shapes: (4096 x 4096), (8192 x 8192), and (16384 x 16384) on TritonBench. For larger size kernels, 2-bit can achieve ~24% speed up compared to the 4-bit Machete kernel.

Conclusion

In this study, we propose ParetoQ, an advanced quantization framework that achieves state-of-the-art performance across all bit-width levels. This framework uniquely enables a direct, consistent comparison across different bit-widths, ensuring an equitable evaluation of performance metrics. Our empirical analysis indicates that quantization at 1.58-bit, 2-bit, and 3-bit offers a superior trade-off between accuracy and effective quantized model size compared to 4-bit, highlighting their potential for optimized model deployment.

Feel free to try running ParetoQ from torchao/prototype/paretoq, following the steps in that repo. If you have any questions, feel free to reach out to Zechun Liu <zechunliu@meta.com>, Changsheng Zhao <cszhao@meta.com> Andrew Or <andrewor@meta.com>

References

[1] BiT: Robustly Binarized Multi-Distilled Transformer.

[2] Learned Step Size Quantization.

[3] Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models.

[4] The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits

Deploy Qwen models with Amazon Bedrock Custom Model Import

We’re excited to announce that Amazon Bedrock Custom Model Import now supports Qwen models. You can now import custom weights for Qwen2, Qwen2_VL, and Qwen2_5_VL architectures, including models like Qwen 2, 2.5 Coder, Qwen 2.5 VL, and QwQ 32B. You can bring your own customized Qwen models into Amazon Bedrock and deploy them in a fully managed, serverless environment—without having to manage infrastructure or model serving.

In this post, we cover how to deploy Qwen 2.5 models with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the AWS infrastructure at an effective cost.

Overview of Qwen models

Qwen 2 and 2.5 are families of large language models, available in a wide range of sizes and specialized variants to suit diverse needs:

General language models: Models ranging from 0.5B to 72B parameters, with both base and instruct versions for general-purpose tasks
Qwen 2.5-Coder: Specialized for code generation and completion
Qwen 2.5-Math: Focused on advanced mathematical reasoning
Qwen 2.5-VL (vision-language): Image and video processing capabilities, enabling multimodal applications

Overview of Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing foundation models (FMs) through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage the underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Amazon Bedrock tools and features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. Amazon Bedrock Custom Model Import is generally available in the US-East (N. Virginia), US-West (Oregon), and Europe (Frankfurt) AWS Regions. Now, we’ll explore how you can use Qwen 2.5 models for two common use cases: as a coding assistant and for image understanding. Qwen2.5-Coder is a state-of-the-art code model, matching capabilities of proprietary models like GPT-4o. It supports over 90 programming languages and excels at code generation, debugging, and reasoning. Qwen 2.5-VL brings advanced multimodal capabilities. According to Qwen, Qwen 2.5-VL is not only proficient at recognizing objects such as flowers and animals, but also at analyzing charts, extracting text from images, interpreting document layouts, and processing long videos.

Prerequisites

Before importing the Qwen model with Amazon Bedrock Custom Model Import, make sure that you have the following in place:

An active AWS account
An Amazon Simple Storage Service (Amazon S3) bucket to store the Qwen model files
Sufficient permissions to create Amazon Bedrock model import jobs
Verified that your Region supports Amazon Bedrock Custom Model Import

Use case 1: Qwen coding assistant

In this example, we will demonstrate how to build a coding assistant using the Qwen2.5-Coder-7B-Instruct model

Go to to Hugging Face and search for and copy the Model ID Qwen/Qwen2.5-Coder-7B-Instruct:

You will use Qwen/Qwen2.5-Coder-7B-Instruct for the rest of the walkthrough. We don’t demonstrate fine-tuning steps, but you can also fine-tune before importing.

Use the following command to download a snapshot of the model locally. The Python library for Hugging Face provides a utility called snapshot download for this:

from huggingface_hub import snapshot_download

snapshot_download(repo_id=" Qwen/Qwen2.5-Coder-7B-Instruct", 
                local_dir=f"./extractedmodel/")

Depending on your model size, this could take a few minutes. When completed, your Qwen Coder 7B model folder will contain the following files.

Configuration files: Including config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and vocab.json
Model files: Four safetensor files and model.safetensors.index.json
Documentation: LICENSE, README.md, and merges.txt

Upload the model to Amazon S3, using boto3 or the command line:

aws s3 cp ./extractedfolder s3://yourbucket/path/ --recursive

Start the import model job using the following API call:

response = self.bedrock_client.create_model_import_job(
                jobName="uniquejobname",
                importedModelName="uniquemodelname",
                roleArn="fullrolearn",
                modelDataSource={
                    's3DataSource': {
                        's3Uri': "s3://yourbucket/path/"
                    }
                }
            )

You can also do this using the AWS Management Console for Amazon Bedrock.

In the Amazon Bedrock console, choose Imported models in the navigation pane.
Choose Import a model.

Enter the details, including a Model name, Import job name, and model S3 location.

Create a new service role or use an existing service role. Then choose Import model

After you choose Import on the console, you should see status as importing when model is being imported:

If you’re using your own role, make sure you add the following trust relationship as describes in Create a service role for model import.

After your model is imported, wait for model inference to be ready, and then chat with the model on the playground or through the API. In the following example, we append Python to prompt the model to directly output Python code to list items in an S3 bucket. Remember to use the right chat template to input prompts in the format required. For example, you can get the right chat template for any compatible model on Hugging Face using below code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Write sample boto3 python code to list files in a bucket stored in the variable `my_bucket`"
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Note that when using the invoke_model APIs, you must use the full Amazon Resource Name (ARN) for the imported model. You can find the Model ARN in the Bedrock console, by navigating to the Imported models section and then viewing the Model details page, as shown in the following figure

After the model is ready for inference, you can use Chat Playground in Bedrock console or APIs to invoke the model.

Use case 2: Qwen 2.5 VL image understanding

Qwen2.5-VL-* offers multimodal capabilities, combining vision and language understanding in a single model. This section demonstrates how to deploy Qwen2.5-VL using Amazon Bedrock Custom Model Import and test its image understanding capabilities.

Import Qwen2.5-VL-7B to Amazon Bedrock

Download the model from Huggingface Face and upload it to Amazon S3:

from huggingface_hub import snapshot_download

hf_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"

# Enable faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download model locally
snapshot_download(repo_id=hf_model_id, local_dir=f"./{local_directory}")

Next, import the model to Amazon Bedrock (either via Console or API):

response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_uri
        }
    }
)

Test the vision capabilities

After the import is complete, test the model with an image input. The Qwen2.5-VL-* model requires proper formatting of multimodal inputs:

def generate_vl(messages, image_base64, temperature=0.3, max_tokens=4096, top_p=0.9):
    processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    response = client.invoke_model(
        modelId=model_id,
        body=json.dumps({
            'prompt': prompt,
            'temperature': temperature,
            'max_gen_len': max_tokens,
            'top_p': top_p,
            'images': [image_base64]
        }),
        accept='application/json',
        contentType='application/json'
    )
    
    return json.loads(response['body'].read().decode('utf-8'))

# Using the model with an image
file_path = "cat_image.jpg"
base64_data = image_to_base64(file_path)

messages = [
    {
        "role": "user",
        "content": [
            {"image": base64_data},
            {"text": "Describe this image."}
        ]
    }
]

response = generate_vl(messages, base64_data)

# Print response
print("Model Response:")
if 'choices' in response:
    print(response['choices'][0]['text'])
elif 'outputs' in response:
    print(response['outputs'][0]['text'])
else:
    print(response)

When provided with an example image of a cat (such the following image), the model accurately describes key features such as the cat’s position, fur color, eye color, and general appearance. This demonstrates Qwen2.5-VL-* model’s ability to process visual information and generate relevant text descriptions.

The model’s response:

This image features a close-up of a cat lying down on a soft, textured surface, likely a couch or a bed. The cat has a tabby coat with a mix of dark and light brown fur, and its eyes are a striking green with vertical pupils, giving it a captivating look. The cat's whiskers are prominent and extend outward from its face, adding to the detailed texture of the image. The background is softly blurred, suggesting a cozy indoor setting with some furniture and possibly a window letting in natural light. The overall atmosphere of the image is warm and serene, highlighting the cat's relaxed and content demeanor.

Pricing

You can use Amazon Bedrock Custom Model Import to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import doesn’t charge for model import. You are charged for inference based on two factors: the number of active model copies and their duration of activity. Billing occurs in 5-minute increments, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, Region, and compute unit version, and is tiered by model copy size. The custom model unites required for hosting depends on the model’s architecture, parameter count, and context length. Amazon Bedrock automatically manages scaling based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this might involve cold-start latency of up to a minute. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

For more information, see Amazon Bedrock pricing.

Clean up

To avoid ongoing charges after completing the experiments:

Delete your imported Qwen models from Amazon Bedrock Custom Model Import using the console or the API.
Optionally, delete the model files from your S3 bucket if you no longer need them.

Remember that while Amazon Bedrock Custom Model Import doesn’t charge for the import process itself, you are billed for model inference usage and storage.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like Qwen 2.5, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of Qwen 2.5’s advanced AI capabilities and Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in Product Management, Engineering, and Go-To-Market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing Generative AI technologies and driving real-world impact with Generative AI.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Dharinee Gupta is an Engineering Manager at AWS Bedrock, where she focuses on enabling customers to seamlessly utilize open source models through serverless solutions. Her team specializes in optimizing these models to deliver the best cost-performance balance for customers. Prior to her current role, she gained extensive experience in authentication and authorization systems at Amazon, developing secure access solutions for Amazon offerings. Dharinee is passionate about making advanced AI technologies accessible and efficient for AWS customers.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Smarter reasoning in smaller models

Building reliable mathematical reasoning

Boosting generalization across domains

Looking ahead: Next steps in AI reasoning

Using NVIDIA’s Three Computers to Develop AEON

Data Comes to Life Through Reality Capture and Omniverse Integration

AEON’s Next Steps

The challenge: Analyzing unstructured enterprise documents at scale

The solution: An enterprise-grade approach to unstructured data quality

Impact

Conclusion

About the authors

An AI/ML journey built on customer obsession

Amazon Nova: More services, greater value for Robinhood and its customers

About the authors

Solution overview

Code resources and templates

Prerequisites

Set up the data ingestion pipeline

Set up the structured data retrieval solution

Clean up

Conclusion

About the authors

The challenge: Reducing dry cycle time for highly automated curing presses and improving operational efficiency

Solution impact

Solution overview

Lessons learned

Next steps

Conclusion

About the authors

Understanding the components

Benefits for enterprises

Solution overview

Prerequisites

Add PagerDuty Advance as a data accessor

Configure Amazon Q for PagerDuty Advance

Clean up

Conclusion

About the Authors

Solution overview

Solution walkthrough

Cleanup

Amazon Bedrock Agents return of control considerations

Conclusion

Resources

About the Authors

1 A Better QAT Scheduling Strategy for Extreme Low-Bit LLMs

1.1 Training Budget Allocation

2 A Hitchhiker’s Guide to Quantization Method Choices

2.1.2 Quantization grids

3 Comparison with SoTA

3.1 Comparisons on 1.58-bit quantization

3.2 comparisons 2-bit / 3-bit / 4-bit quantization

4 Pareto Curve

5 GPU Latency

Conclusion

References

Overview of Qwen models

Overview of Amazon Bedrock Custom Model Import

Prerequisites

Use case 1: Qwen coding assistant

Use case 2: Qwen 2.5 VL image understanding

Import Qwen2.5-VL-7B to Amazon Bedrock

Test the vision capabilities

Pricing

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.