Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.

Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.

In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.

What is SageMaker HyperPod?

SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

  • Fault-tolerant compute clusters with automated faulty node replacement during model training
  • Efficient cluster utilization through observability and performance monitoring
  • Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?

Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)

The company’s proprietary ModelMesh™ technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.

Articul8’s ModelMesh™ supports:

  • LLMs for general tasks
  • Domain-specific models optimized for industry-specific applications
  • Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.

Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

  • Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
  • Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
  • Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.

Distributed training challenges and the role of SageMaker HyperPod

Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

  • Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
  • ResiliencyFault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag --auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
  • Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
  • Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview

The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.

To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.

The following diagram illustrates the observability architecture.

SageMaker HyperPod Architecture (Slurm)

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.

For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.

Prerequisites

The following two AWS CloudFormation templates address the prerequisites of the solution setup.

For SageMaker HyperPod

This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

  • VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
  • Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
  • Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
  • IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
  • Security groupThe script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability

To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

  • Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
  • NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
  • EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
  • FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.

Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.

Cluster setup

The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

  • Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
  • Local storage – Each node has 8 TB local NVME volume attached for local storage.
  • Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
  • Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results

During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.

Clean up

If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.

Conclusion

This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

  • DSMs significantly outperform general-purpose LLMs in critical domains
  • SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
  • Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.


About the Authors

Yashesh A. Shroff, PhD.Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.

Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.

Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.

Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.

Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

Read More

How VideoAmp uses Amazon Bedrock to power their media analytics interface

How VideoAmp uses Amazon Bedrock to power their media analytics interface

This post was co-written with Suzanne Willard and Makoto Uchida from VideoAmp.

In this post, we illustrate how VideoAmp, a media measurement company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to develop a prototype of the VideoAmp Natural Language (NL) Analytics Chatbot to uncover meaningful insights at scale within media analytics data using Amazon Bedrock. The AI-powered analytics solution involved the following components:

  • A natural language to SQL pipeline, with a conversational interface, that works with complex queries and media analytics data from VideoAmp
  • An automated testing and evaluation tool for the pipeline

VideoAmp background

VideoAmp is a tech-first measurement company that empowers media agencies, brands, and publishers to precisely measure and optimize TV, streaming, and digital media. With a comprehensive suite of measurement, planning, and optimization solutions, VideoAmp offers clients a clear, actionable view of audiences and attribution across environments, enabling them to make smarter media decisions that help them drive better business outcomes. VideoAmp has seen incredible adoption for its measurement and currency solutions with 880% YoY growth, 98% coverage of the TV publisher landscape, 11 agency groups, and more than 1,000 advertisers. VideoAmp is headquartered in Los Angeles and New York with offices across the United States. To learn more, visit www.videoamp.com.

VideoAmp’s AI journey

VideoAmp has embraced AI to enhance its measurement and optimization capabilities. The company has integrated machine learning (ML) algorithms into its infrastructure to analyze vast amounts of viewership data across traditional TV, streaming, and digital services. This AI-driven approach allows VideoAmp to provide more accurate audience insights, improve cross-environment measurement, and optimize advertising campaigns in real time. By using AI, VideoAmp has been able to offer advertisers and media owners more precise targeting, better attribution models, and increased return on investment for their advertising spend. The company’s AI journey has positioned it as a leader in the evolving landscape of data-driven advertising and media measurement.

To take their innovations a step further, VideoAmp is building a brand-new analytics solution powered by generative AI, which will provide their customers with accessible business insights. Their goal for a beta product is to create a conversational AI assistant powered by large language models (LLMs) that allows VideoAmp’s data analysts and non-technical users such as content researchers and publishers to perform data analytics using natural language queries.

Use case overview

VideoAmp is undergoing a transformative journey by integrating generative AI into its analytics. The company aims to revolutionize how customers, including publishers, media agencies, and brands, interact with and derive insights from VideoAmp’s vast repository of data through a conversational AI assistant interface.

Presently, analysis by data scientists and analysts is done manually, requires technical SQL knowledge, and can be time-consuming for complex and high-dimensional datasets. Acknowledging the necessity for streamlined and accessible processes, VideoAmp worked with the GenAIIC to develop an AI assistant capable of comprehending natural language queries, generating and executing SQL queries on VideoAmp’s data warehouse, and delivering natural language summaries of retrieved information. The assistant allows non-technical users to surface data-driven insights, and it reduces research and analysis time for both technical and non-technical users.

Key success criteria for the project included:

  • The ability to convert natural language questions into SQL statements, connect to VideoAmp’s provided database, execute statements on VideoAmp performance metrics data, and create a natural language summary of results
  • A UI to ask natural language questions and view assistant output, which includes generated SQL queries, reasoning for the SQL statements, retrieved data, and natural language data summaries
  • Conversational support for the user to iteratively refine and filter asked questions
  • Low latency and cost-effectiveness
  • An automated evaluation pipeline to assess the quality and accuracy of the assistant

The team overcame a few challenges during the development process:

  • Adapting LLMs to understand the domain aspects of VideoAmp’s dataset – The dataset included highly industry-specific fields and metrics, and required complex queries to effectively filter and analyze. The queries often involved multiple specialized metric calculations, filters selecting from over 30 values, and extensive grouping and ordering.
  • Developing an automated evaluation pipeline – The pipeline is able to correctly identify if generated outputs are equivalent to ground truth data, even if they have different column aliasing, ordering, and metric calculations.

Solution overview

The GenAIIC team worked with VideoAmp to create an AI assistant that used Anthropic’s Claude 3 LLMs through Amazon Bedrock. Amazon Bedrock was chosen for this project because it provides access to high-quality foundation models (FMs), including Anthropic’s Claude 3 series, through a unified API. This allowed the team to quickly integrate the most suitable models for different components of the solution, such as SQL generation and data summarization.

Additional features in Amazon Bedrock, including Amazon Bedrock Prompt Management, native support for Retrieval Augmented Generation (RAG) and structured data retrieval through Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and fine-tuning, enable VideoAmp to quickly expand the analytics solution and take it to production. Amazon Bedrock also offers robust security and adheres to compliance certifications, allowing VideoAmp to confidently expand their AI analytics solution while maintaining data privacy and adhering to industry standards.

The solution is connected to a data warehouse. It supports a variety of database connections, such as Snowflake, SingleStore, PostgreSQL, Excel and CSV files, and more. The following diagram illustrates the high-level workflow of the solution.

A diagram illustrating the high-level workflow of VideoAmp's Natural Language Analytics solution

The workflow consists of the following steps:

  1. The user navigates to the frontend application and asks a question in natural language.
  2. A Question Rewriter LLM component uses previous conversational context to augment the question with additional details if applicable. This allows follow-up questions and refinements to previous questions.
  3. A Text-to-SQL LLM component creates a SQL query that corresponds to the user question.
  4. The SQL query is executed in the data warehouse.
  5. A Data-to-Text LLM component summarizes the retrieved data for the user.

The rewritten question, generated SQL, reasoning, and retrieved data are returned at each step.

AI assistant workflow details

In this section, we discuss the components of the AI assistant workflow in more detail.

Rewriter

After the user asks the question, the current question and the previous questions the user asked in the current session are sent to the Question Rewriter component, which uses Anthropic’s Claude 3 Sonnet model. If deemed necessary, the LLM uses context from the previous questions to augment the current user question to make it a standalone question with context included. This enables multi-turn conversational support for the user, allowing for natural interactions with the assistant.

For example, if a user first asked, “For the week of 09/04/2023 – 09/10/2023, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”, followed by, “Can I have the same data for one year later”, the rewriter would rewrite the latter question as “For the week of 09/03/2024 – 09/09/2024, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”

Text-to-SQL

The rewritten user question is sent to the Text-to-SQL component, which also uses Anthropic’s Claude 3 Sonnet model. The Text-to-SQL component uses information about the database in its prompt to generate a SQL query corresponding to the user question. It also generates an explanation of the query.

The text-to-SQL prompt addressed several challenges, such as industry-specific language in user questions, complex metrics, and several rules and defaults for filtering. The prompt was developed through several iterations, based on feedback and guidance from the VideoAmp team, and manual and automated evaluation.

The prompt consisted of four overarching sections: context, SQL instructions, task, and examples. During the development phase, database schema and domain- or task-specific knowledge were found to be critical, so one major part of the prompt was designed to incorporate them in the context. To make this solution reusable and scalable, a modularized design of the prompt/input system is employed, making it generic so it can be applied to other use cases and domains. The solution can support Q&A with multiple databases by dynamically switching/changing the corresponding context with an orchestrator if needed.

The context section contains the following details:

  • Database schema
  • Sample categories for relevant data fields such as television networks to aid the LLM in understanding what fields to use for identifiers in the question
  • Industry term definitions
  • How to calculate different types of metrics or aggregations
  • Default values or fields should be selected if not specified
  • Other domain- or task-specific knowledge

The SQL instructions contain the following details:

  • Dynamic insertion of today’s date as a reference for terms, such as “last 3 quarters”
  • Instructions on usage of sub-queries
  • Instructions on when to retrieve additional informational columns not specified in the user question
  • Known SQL syntax and database errors to avoid and potential fixes

In the task section, the LLM is given a detailed step-by-step process to formulate SQL queries based on the context. A step-by-step process is required for the LLM to correctly think through and assimilate the required context and rules. Without the step-by-step process, the team found that the LLM wouldn’t adhere to all instructions provided in the previous sections.

In the examples section, the LLM is given several examples of user questions, corresponding SQL statements, and explanations.

In addition to iterating on the prompt content, different content organization patterns were tested due to long context. The final prompt was organized with markdown and XML.

SQL execution

After the Text-to-SQL component outputs a query, the query is executed against VideoAmp’s data warehouse using database connector code. For this use case, only read queries for analytics are executed to protect the database from unexpected operations like updates or deletes. The credentials for the database are securely stored and accessed using AWS Secrets Manager and AWS Key Management Service (AWS KMS).

Data-to-Text

The data retrieved by the SQL query is sent to the Data-to-Text component, along with the rewritten user question. The Data-to-Text component, which uses Anthropic’s Claude 3 Haiku model, produces a concise summary of the retrieved data and answers the user question.

The final outputs are displayed on the frontend application as shown in the following screenshots (protected data is hidden).

A screenshot showing the outputs of the VideoAmp Natural Language Analytics solution

A screenshot showing the outputs of the VideoAmp Natural Language Analytics solution

Evaluation framework workflow details

The GenAIIC team developed a sophisticated automated evaluation pipeline for VideoAmp’s NL Analytics Chatbot, which directly informed prompt optimization and solution improvements and was a critical component in providing high-quality results.

The evaluation framework comprises of two categories:

  • SQL query evaluation – Generated SQL queries are evaluated for overall closeness to the ground truth SQL query. A key feature of the SQL evaluation component was the ability to account for column aliasing and ordering differences when comparing statements and determine equivalency.
  • Retrieved data evaluation – The retrieved data is compared to ground truth data to determine an exact match, after a few processing steps to account for column, formatting, and system differences.

The evaluation pipeline also produces detailed reports of the results and discrepancies between generated data and ground truth data.

Dataset

The dataset used for the prototype solution was hosted in a data warehouse and consisted of performance metrics data such as viewership, ratings, and rankings for television networks and programs. The field names were industry-specific, so a data dictionary was included in the text-to-SQL prompt as part of the schema. The credentials for the database are securely stored and accessed using Secrets Manager and AWS KMS.

Results

A set of test questions were evaluated by the GenAIIC and VideoAmp teams, focusing on three metrics:

  • Accuracy – Different accuracy metrics were analyzed, but exact matches between retrieved data and ground truth data were prioritized
  • Latency – LLM generation latency, excluding the time taken to query the database
  • Cost – Average cost per user question

Both the evaluation pipeline and human review reported high accuracies on the dataset, whereas costs and latencies remained low. Overall, the results were well-aligned with VideoAmp expectations. VideoAmp anticipates this solution will make it simple for users to handle complex data queries with confidence through intuitive natural language interactions, reducing the time to business insights.

Conclusion

In this post, we shared how the GenAIIC team worked with VideoAmp to build a prototype of the VideoAmp NL Analytics Chatbot, an end-to-end generative AI data analytics interface using Amazon Bedrock and Anthropic’s Claude 3 LLMs. The solution is equipped with a variety of state-of-the-art LLM-based techniques, such as question rewriting, text-to-SQL query generation, and summarization of data in natural language. It also includes an automated evaluation module for evaluating the correctness of generated SQL statements and retrieved data. The solution achieved high accuracy on VideoAmp’s evaluation samples. Users can interact with the solution through an intuitive AI assistant interface with conversational capabilities.

VideoAmp will soon be launching their new generative AI-powered analytics interface, which enables customers to analyze data and gain business insights through natural language conversation. Their successful work with the GenAIIC team will allow VideoAmp to use generative AI technology to swiftly deliver valuable insights for both technical and non-technical customers.

This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases. The GenAIIC is a group of science and strategy experts with comprehensive expertise spanning the generative AI journey, helping you prioritize use cases, build a roadmap, and move solutions into production. If you’re interested in working with the GenAIIC, reach out to them today.


About the authors

Suzanne Willard is the VP of Engineering at VideoAmp where she founded and leads the GenAI program, establishing the strategic vision and execution roadmap. With over 20 years experience she is driving innovation in AI technologies, creating transformative solutions that align with business objectives and set the company apart in the market.

Makoto Uchida is a senior architect at VideoAmp in the AI domain, acting as area technical lead of AI portfolio, responsible for defining and driving AI product and technical strategy in the content and ads measurement platform PaaS product. Previously, he was a software engineering lead in generative and predictive AI Platform at a major hyperscaler public Cloud service. He has also engaged with multiple startups, laying the foundation of Data/ML/AI infrastructures.

Shreya Mohanty is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.

Long Chen is a Sr. Applied Scientist at AWS Generative AI Innovation Center. He holds a Ph.D. in Applied Physics from University of Michigan – Ann Arbor. With more than a decade of experience for research and development, he works on innovative solutions in various domains using generative AI and other machine learning techniques, ensuring the success of AWS customers. His interest includes generative models, multi-modal systems and graph learning.

Amaran Asokkumar is a Deep Learning Architect at AWS, specializing in infrastructure, automation, and AI. He leads the design of GenAI-enabled solutions across industry segments. Amaran is passionate about all things AI and helping customers accelerate their GenAI exploration and transformation efforts.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Scaling up image segmentation across data and tasks

Scaling up image segmentation across data and tasks


Scaling up image segmentation across data and tasks

Novel architecture that fuses learnable queries and conditional queries improves a segmentation models ability to transfer across tasks.

Computer vision

June 12, 12:25 PMJune 12, 12:25 PM

The first draft of this blog post was generated by Amazon Nova Pro, based on detailed instructions from Amazon Science editors and multiple examples of prior posts.

In a paper we’re presenting at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we introduce a new approach to image segmentation that scales across diverse datasets and tasks. Traditional segmentation models, while effective on isolated tasks, often struggle as the number of new tasks or unfamiliar scenarios grows. Our proposed method, which uses a model we call a mixed-query transformer (MQ-former), aims to enable joint training and evaluation across multiple tasks and datasets.

Scalable segmentation

Image segmentation is a computer vision task that involves partitioning an image into distinct regions or segments. Each segment corresponds to a different object or part of the scene. There are several types of segmentation tasks, including foreground/background segmentation (distinguishing objects at different distances), semantic segmentation (labeling each pixel as belonging to a particular object class), and instance segmentation (identifying each pixel as belonging to a particular instance of an object class).

An example of instance segmentation, given the text prompt “cardinal”.

Scalability means that a segmentation model can effectively improve with an increase in the size of its training dataset, in the diversity of the tasks it performs, or both. Most prior research has focused on one or the other data or task diversity. We address both at once.

A tale of two queries

In our paper, we show that one issue preventing effective scalability in segmentation models is the design of object queries. An object query is a way of representing a hypothesis about objects in a scene a hypothesis that can be tested against images.

There are two main types of object queries. The first, which we refer to as learnable queries, are learned vectors that interact with image features and encode information about location and object class. Learnable queries tend to perform well on semantic segmentation as the they do not contain object-specific priors.

The second type of object query, which we refer to as a conditional query, is akin to two-stage object detection: region proposals are generated by a Transformer encoder, and then high-confidence proposals are fed into the Transformer decoder as queries to generate the final prediction. Conditional queries are closely aligned with the object classes and excel at object detection and instance segmentation on semantically well-defined objects.

Our approach is to combine both types of queries, which improves the models ability to transfer across tasks. Our MQ-Former model represents inputs using both learnable queries and conditional queries, and every layer of the decoder has a cross-attention mechanism, so that the processing of the learnable queries can factor in information from the conditional-query processing, and vice versa.

Architectural schematics for learnable queries, conditional queries, and mixed queries. Solid triangles represent instance segmentation ground truth, and striped triangles represent semantic-segmentation ground truth.

Leveraging synthetic data

Mixed queries aid scalability across segmentation tasks, but the other aspect of scalability in segmentation models is dataset size. One of the key challenges in scaling up segmentation models is the scarcity of high-quality, annotated data. To overcome this limitation, we propose leveraging synthetic data.

Examples of synthetic data. At left are two examples of synthetic masks, at right two examples of synthetic captions.

While segmentation data is scarce, object recognition data is plentiful. Object recognition datasets typically include bounding boxes, or rectangles that identify the image regions in which labeled objects can be found.

Asking a trained segmentation model to segment only the object within a bounding box significantly improves performance; we are thus able to use weaker segmentation models to convert object recognition datasets into segmentation datasets that can be used to train stronger segmentation models.

Bounding boxes can also focus automatic captioning models on regions of interest in an image, to provide the type of object classifications necessary to train semantic-segmentation and instance segmentation models.

Experimental results

We evaluated our approach using 15 datasets covering a range of segmentation tasks and found that, with MQ-Former, scaling up both the volume of training data and the diversity of tasks consistently enhances the models segmentation capabilities.

For example, on the SeginW benchmark, which includes 25 datasets used for open-vocabulary in-the-wild segmentation evaluation, scaling the data and tasks from 100,000 samples to 600,000 boosted performance 16%, as measured by average precision of object masking. Incorporating synthetic data improved performance by another 14%, establishing a new state of the art.

Research areas: Computer vision

Tags: Image segmentation, Data representation

Read More

How AI is reshaping the future of healthcare and medical research

In November 2022, OpenAI’s ChatGPT kick-started a new era in AI. This was followed less than a half year later by the release of GPT-4. In the months leading up to GPT-4’s public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, The AI Revolution in Medicine, Revisited, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee. 

In this episode, Microsoft co-founder and Gates Foundation Chair Bill Gates (opens in new tab) and OpenAI research lead Sébastien Bubeck (opens in new tab), formerly Microsoft’s VP of AI, join Lee to discuss how they’re seeing generative AI’s adoption in healthcare unfolding globally and the opportunities for further adoption, such as the development of proper benchmarks. Together, the three use insights drawn from unparalleled access to the continuing evolution of AI to explore the yet untapped potential of the technology to empower clinicians and patients alike and talk about the urgency to create AI-driven healthcare systems in underserved countries. They also reflect on the distinction between healthcare delivery and healthcare discovery and how the type and pace of change brought on by AI may differ for each. 

Transcript

[MUSIC]     

[BOOK PASSAGE]  

PETER LEE: “In ‘The Little Black Bag,’ a classic science fiction story, a high-tech doctor’s kit of the future is accidentally transported back to the 1950s, into the shaky hands of a washed-up, alcoholic doctor. The ultimate medical tool, it redeems the doctor wielding it, allowing him to practice gratifyingly heroic medicine. … The tale ends badly for the doctor and his treacherous assistant, but it offered a picture of how advanced technology could transform medicine—powerful when it was written nearly 75 years ago and still so today. What would be the Al equivalent of that little black bag? At this moment when new capabilities are emerging, how do we imagine them into medicine?”  

[END OF BOOK PASSAGE]    

[THEME MUSIC]    

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee.   

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?    

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here.  

[THEME MUSIC FADES]


The book passage I read at the top is from “Chapter 10: The Big Black Bag.” 

In imagining AI in medicine, Carey, Zak, and I included in our book two fictional accounts. In the first, a medical resident consults GPT-4 on her personal phone as the patient in front of her crashes. Within seconds, it offers an alternate response based on recent literature. In the second account, a 90-year-old woman with several chronic conditions is living independently and receiving near-constant medical support from an AI aide.   

In our conversations with the guests we’ve spoken to so far, we’ve caught a glimpse of these predicted futures, seeing how clinicians and patients are actually using AI today and how developers are leveraging the technology in the healthcare products and services they’re creating. In fact, that first fictional account isn’t so fictional after all, as most of the doctors in the real world actually appear to be using AI at least occasionally—and sometimes much more than occasionally—to help in their daily clinical work. And as for the second fictional account, which is more of a science fiction account, it seems we are indeed on the verge of a new way of delivering and receiving healthcare, though the future is still very much open. 

As we continue to examine the current state of AI in healthcare and its potential to transform the field, I’m pleased to welcome Bill Gates and Sébastien Bubeck.  

Bill may be best known as the co-founder of Microsoft, having created the company with his childhood friend Paul Allen in 1975. He’s now the founder of Breakthrough Energy, which aims to advance clean energy innovation, and TerraPower, a company developing groundbreaking nuclear energy and science technologies. He also chairs the world’s largest philanthropic organization, the Gates Foundation, and focuses on solving a variety of health challenges around the globe and here at home. 

Sébastien is a research lead at OpenAI. He was previously a distinguished scientist, vice president of AI, and a colleague of mine here at Microsoft, where his work included spearheading the development of the family of small language models known as Phi. While at Microsoft, he also coauthored the discussion-provoking 2023 paper “Sparks of Artificial General Intelligence,” which presented the results of early experiments with GPT-4 conducted by a small team from Microsoft Research.   

[TRANSITION MUSIC]  

Here’s my conversation with Bill Gates and Sébastien Bubeck. 

LEE: Bill, welcome. 

BILL GATES: Thank you. 

LEE: Seb … 

SÉBASTIEN BUBECK: Yeah. Hi, hi, Peter. Nice to be here. 

LEE: You know, one of the things that I’ve been doing just to get the conversation warmed up is to talk about origin stories, and what I mean about origin stories is, you know, what was the first contact that you had with large language models or the concept of generative AI that convinced you or made you think that something really important was happening? 

And so, Bill, I think I’ve heard the story about, you know, the time when the OpenAI folks—Sam Altman, Greg Brockman, and others—showed you something, but could we hear from you what those early encounters were like and what was going through your mind?  

GATES: Well, I’d been visiting OpenAI soon after it was created to see things like GPT-2 and to see the little arm they had that was trying to match human manipulation and, you know, looking at their games like Dota that they were trying to get as good as human play. And honestly, I didn’t think the language model stuff they were doing, even when they got to GPT-3, would show the ability to learn, you know, in the same sense that a human reads a biology book and is able to take that knowledge and access it not only to pass a test but also to create new medicines. 

And so my challenge to them was that if their LLM could get a five on the advanced placement biology test, then I would say, OK, it took biologic knowledge and encoded it in an accessible way and that I didn’t expect them to do that very quickly but it would be profound.  

And it was only about six months after I challenged them to do that, that an early version of GPT-4 they brought up to a dinner at my house, and in fact, it answered most of the questions that night very well. The one it got totally wrong, we were … because it was so good, we kept thinking, Oh, we must be wrong. It turned out it was a math weakness [LAUGHTER] that, you know, we later understood that that was an area of, weirdly, of incredible weakness of those early models. But, you know, that was when I realized, OK, the age of cheap intelligence was at its beginning. 

LEE: Yeah. So I guess it seems like you had something similar to me in that my first encounters, I actually harbored some skepticism. Is it fair to say you were skeptical before that? 

GATES: Well, the idea that we’ve figured out how to encode and access knowledge in this very deep sense without even understanding the nature of the encoding, … 

LEE: Right.  

GATES: … that is a bit weird.  

LEE: Yeah. 

GATES: We have an algorithm that creates the computation, but even say, OK, where is the president’s birthday stored in there? Where is this fact stored in there? The fact that even now when we’re playing around, getting a little bit more sense of it, it’s opaque to us what the semantic encoding is, it’s, kind of, amazing to me. I thought the invention of knowledge storage would be an explicit way of encoding knowledge, not an implicit statistical training. 

LEE: Yeah, yeah. All right. So, Seb, you know, on this same topic, you know, I got—as we say at Microsoft—I got pulled into the tent. [LAUGHS] 

BUBECK: Yes.  

LEE: Because this was a very secret project. And then, um, I had the opportunity to select a small number of researchers in MSR [Microsoft Research] to join and start investigating this thing seriously. And the first person I pulled in was you

BUBECK: Yeah. 

LEE: And so what were your first encounters? Because I actually don’t remember what happened then. 

BUBECK: Oh, I remember it very well. [LAUGHS] My first encounter with GPT-4 was in a meeting with the two of you, actually. But my kind of first contact, the first moment where I realized that something was happening with generative AI, was before that. And I agree with Bill that I also wasn’t too impressed by GPT-3. 

I though that it was kind of, you know, very naturally mimicking the web, sort of parroting what was written there in a nice way. Still in a way which seemed very impressive. But it wasn’t really intelligent in any way. But shortly after GPT-3, there was a model before GPT-4 that really shocked me, and this was the first image generation model, DALL-E 1. 

So that was in 2021. And I will forever remember the press release of OpenAI where they had this prompt of an avocado chair and then you had this image of the avocado chair. [LAUGHTER] And what really shocked me is that clearly the model kind of “understood” what is a chair, what is an avocado, and was able to merge those concepts. 

So this was really, to me, the first moment where I saw some understanding in those models.  

LEE: So this was, just to get the timing right, that was before I pulled you into the tent. 

BUBECK: That was before. That was like a year before. 

LEE: Right.  

BUBECK: And now I will tell you how, you know, we went from that moment to the meeting with the two of you and GPT-4. 

So once I saw this kind of understanding, I thought, OK, fine. It understands concept, but it’s still not able to reason. It cannot—as, you know, Bill was saying—it cannot learn from your document. It cannot reason.  

So I set out to try to prove that. You know, this is what I was in the business of at the time, trying to prove things in mathematics. So I was trying to prove that basically autoregressive transformers could never reason. So I was trying to prove this. And after a year of work, I had something reasonable to show. And so I had the meeting with the two of you, and I had this example where I wanted to say, there is no way that an LLM is going to be able to do x

And then as soon as I … I don’t know if you remember, Bill. But as soon as I said that, you said, oh, but wait a second. I had, you know, the OpenAI crew at my house recently, and they showed me a new model. Why don’t we ask this new model this question?  

LEE: Yeah.  

BUBECK: And we did, and it solved it on the spot. And that really, honestly, just changed my life. Like, you know, I had been working for a year trying to say that this was impossible. And just right there, it was shown to be possible.  

LEE: [LAUGHS] One of the very first things I got interested in—because I was really thinking a lot about healthcare—was healthcare and medicine. 

And I don’t know if the two of you remember, but I ended up doing a lot of tests. I ran through, you know, step one and step two of the US Medical Licensing Exam. Did a whole bunch of other things. I wrote this big report. It was, you know, I can’t remember … a couple hundred pages.  

And I needed to share this with someone. I didn’t … there weren’t too many people I could share it with. So I sent, I think, a copy to you, Bill. Sent a copy to you, Seb.  

I hardly slept for about a week putting that report together. And, yeah, and I kept working on it. But I was far from alone. I think everyone who was in the tent, so to speak, in those early days was going through something pretty similar. All right. So I think … of course, a lot of what I put in the report also ended up being examples that made it into the book. 

But the main purpose of this conversation isn’t to reminisce about [LAUGHS] or indulge in those reminiscences but to talk about what’s happening in healthcare and medicine. And, you know, as I said, we wrote this book. We did it very, very quickly. Seb, you helped. Bill, you know, you provided a review and some endorsements. 

But, you know, honestly, we didn’t know what we were talking about because no one had access to this thing. And so we just made a bunch of guesses. So really, the whole thing I wanted to probe with the two of you is, now with two years of experience out in the world, what, you know, what do we think is happening today? 

You know, is AI actually having an impact, positive or negative, on healthcare and medicine? And what do we now think is going to happen in the next two years, five years, or 10 years? And so I realize it’s a little bit too abstract to just ask it that way. So let me just try to narrow the discussion and guide us a little bit.  

Um, the kind of administrative and clerical work, paperwork, around healthcare—and we made a lot of guesses about that—that appears to be going well, but, you know, Bill, I know we’ve discussed that sometimes that you think there ought to be a lot more going on. Do you have a viewpoint on how AI is actually finding its way into reducing paperwork? 

GATES: Well, I’m stunned … I don’t think there should be a patient-doctor meeting where the AI is not sitting in and both transcribing, offering to help with the paperwork, and even making suggestions, although the doctor will be the one, you know, who makes the final decision about the diagnosis and whatever prescription gets done.  

It’s so helpful. You know, when that patient goes home and their, you know, son who wants to understand what happened has some questions, that AI should be available to continue that conversation. And the way you can improve that experience and streamline things and, you know, involve the people who advise you. I don’t understand why that’s not more adopted, because there you still have the human in the loop making that final decision. 

But even for, like, follow-up calls to make sure the patient did things, to understand if they have concerns and knowing when to escalate back to the doctor, the benefit is incredible. And, you know, that thing is ready for prime time. That paradigm is ready for prime time, in my view. 

LEE: Yeah, there are some good products, but it seems like the number one use right now—and we kind of got this from some of the previous guests in previous episodes—is the use of AI just to respond to emails from patients. [LAUGHTER] Does that make sense to you? 

BUBECK: Yeah. So maybe I want to second what Bill was saying but maybe take a step back first. You know, two years ago, like, the concept of clinical scribes, which is one of the things that we’re talking about right now, it would have sounded, in fact, it sounded two years ago, borderline dangerous. Because everybody was worried about hallucinations. What happened if you have this AI listening in and then it transcribes, you know, something wrong? 

Now, two years later, I think it’s mostly working. And in fact, it is not yet, you know, fully adopted. You’re right. But it is in production. It is used, you know, in many, many places. So this rate of progress is astounding because it wasn’t obvious that we would be able to overcome those obstacles of hallucination. It’s not to say that hallucinations are fully solved. In the case of the closed system, they are.  

Now, I think more generally what’s going on in the background is that there is something that we, that certainly I, underestimated, which is this management overhead. So I think the reason why this is not adopted everywhere is really a training and teaching aspect. People need to be taught, like, those systems, how to interact with them. 

And one example that I really like, a study that recently appeared where they tried to use ChatGPT for diagnosis and they were comparing doctors without and with ChatGPT (opens in new tab). And the amazing thing … so this was a set of cases where the accuracy of the doctors alone was around 75%. ChatGPT alone was 90%. So that’s already kind of mind blowing. But then the kicker is that doctors with ChatGPT was 80%.  

Intelligence alone is not enough. It’s also how it’s presented, how you interact with it. And ChatGPT, it’s an amazing tool. Obviously, I absolutely love it. But it’s not … you don’t want a doctor to have to type in, you know, prompts and use it that way. 

It should be, as Bill was saying, kind of running continuously in the background, sending you notifications. And you have to be really careful of the rate at which those notifications are being sent. Because if they are too frequent, then the doctor will learn to ignore them. So you have to … all of those things matter, in fact, at least as much as the level of intelligence of the machine. 

LEE: One of the things I think about, Bill, in that scenario that you described, doctors do some thinking about the patient when they write the note. So, you know, I’m always a little uncertain whether it’s actually … you know, you wouldn’t necessarily want to fully automate this, I don’t think. Or at least there needs to be some prompt to the doctor to make sure that the doctor puts some thought into what happened in the encounter with the patient. Does that make sense to you at all? 

GATES: At this stage, you know, I’d still put the onus on the doctor to write the conclusions and the summary and not delegate that. 

The tradeoffs you make a little bit are somewhat dependent on the situation you’re in. If you’re in Africa, where most people never meet a real doctor their entire life, the idea of being able to have some of this advice and diagnosis is extremely advantageous because you’re comparing it to nothing. 

So, yes, the doctor’s still going to have to do a lot of work, but just the quality of letting the patient and the people around them interact and ask questions and have things explained, that alone is such a quality improvement. It’s mind blowing.  

LEE: So since you mentioned, you know, Africa—and, of course, this touches on the mission and some of the priorities of the Gates Foundation and this idea of democratization of access to expert medical care—what’s the most interesting stuff going on right now? Are there people and organizations or technologies that are impressing you or that you’re tracking? 

GATES: Yeah. So the Gates Foundation has given out a lot of grants to people in Africa doing education, agriculture but more healthcare examples than anything. And the way these things start off, they often start out either being patient-centric in a narrow situation, like, OK, I’m a pregnant woman; talk to me. Or, I have infectious disease symptoms; talk to me. Or they’re connected to a health worker where they’re helping that worker get their job done. And we have lots of pilots out, you know, in both of those cases.  

The dream would be eventually to have the thing the patient consults be so broad that it’s like having a doctor available who understands the local things.  

LEE: Right.  

GATES: We’re not there yet. But over the next two or three years, you know, particularly given the worsening financial constraints against African health systems, where the withdrawal of money has been dramatic, you know, figuring out how to take this—what I sometimes call “free intelligence”—and build a quality health system around that, we will have to be more radical in low-income countries than any rich country is ever going to be.  

LEE: Also, there’s maybe a different regulatory environment, so some of those things maybe are easier? Because right now, I think the world hasn’t figured out how to and whether to regulate, let’s say, an AI that might give a medical diagnosis or write a prescription for a medication. 

BUBECK: Yeah. I think one issue with this, and it’s also slowing down the deployment of AI in healthcare more generally, is a lack of proper benchmark. Because, you know, you were mentioning the USMLE [United States Medical Licensing Examination], for example. That’s a great test to test human beings and their knowledge of healthcare and medicine. But it’s not a great test to give to an AI. 

It’s not asking the right questions. So finding what are the right questions to test whether an AI system is ready to give diagnosis in a constrained setting, that’s a very, very important direction, which to my surprise, is not yet accelerating at the rate that I was hoping for. 

LEE: OK, so that gives me an excuse to get more now into the core AI tech because something I’ve discussed with both of you is this issue of what are the right tests. And you both know the very first test I give to any new spin of an LLM is I present a patient, the results—a mythical patient—the results of my physical exam, my mythical physical exam. Maybe some results of some initial labs. And then I present or propose a differential diagnosis. And if you’re not in medicine, a differential diagnosis you can just think of as a prioritized list of the possible diagnoses that fit with all that data. And in that proposed differential, I always intentionally make two mistakes. 

I make a textbook technical error in one of the possible elements of the differential diagnosis, and I have an error of omission. And, you know, I just want to know, does the LLM understand what I’m talking about? And all the good ones out there do now. But then I want to know, can it spot the errors? And then most importantly, is it willing to tell me I’m wrong, that I’ve made a mistake?  

That last piece seems really hard for AI today. And so let me ask you first, Seb, because at the time of this taping, of course, there was a new spin of GPT-4o last week that became overly sycophantic. In other words, it was actually prone in that test of mine not only to not tell me I’m wrong, but it actually praised me for the creativity of my differential. [LAUGHTER] What’s up with that? 

BUBECK: Yeah, I guess it’s a testament to the fact that training those models is still more of an art than a science. So it’s a difficult job. Just to be clear with the audience, we have rolled back that [LAUGHS] version of GPT-4o, so now we don’t have the sycophant version out there. 

Yeah, no, it’s a really difficult question. It has to do … as you said, it’s very technical. It has to do with the post-training and how, like, where do you nudge the model? So, you know, there is this very classical by now technique called RLHF [reinforcement learning from human feedback], where you push the model in the direction of a certain reward model. So the reward model is just telling the model, you know, what behavior is good, what behavior is bad. 

But this reward model is itself an LLM, and, you know, Bill was saying at the very beginning of the conversation that we don’t really understand how those LLMs deal with concepts like, you know, where is the capital of France located? Things like that. It is the same thing for this reward model. We don’t know why it says that it prefers one output to another, and whether this is correlated with some sycophancy is, you know, something that we discovered basically just now. That if you push too hard in optimization on this reward model, you will get a sycophant model. 

So it’s kind of … what I’m trying to say is we became too good at what we were doing, and we ended up, in fact, in a trap of the reward model. 

LEE: I mean, you do want … it’s a difficult balance because you do want models to follow your desires and … 

BUBECK: It’s a very difficult, very difficult balance. 

LEE: So this brings up then the following question for me, which is the extent to which we think we’ll need to have specially trained models for things. So let me start with you, Bill. Do you have a point of view on whether we will need to, you know, quote-unquote take AI models to med school? Have them specially trained? Like, if you were going to deploy something to give medical care in underserved parts of the world, do we need to do something special to create those models? 

GATES: We certainly need to teach them the African languages and the unique dialects so that the multimedia interactions are very high quality. We certainly need to teach them the disease prevalence and unique disease patterns like, you know, neglected tropical diseases and malaria. So we need to gather a set of facts that somebody trying to go for a US customer base, you know, wouldn’t necessarily have that in there. 

Those two things are actually very straightforward because the additional training time is small. I’d say for the next few years, we’ll also need to do reinforcement learning about the context of being a doctor and how important certain behaviors are. Humans learn over the course of their life to some degree that, I’m in a different context and the way I behave in terms of being willing to criticize or be nice, you know, how important is it? Who’s here? What’s my relationship to them?  

Right now, these machines don’t have that broad social experience. And so if you know it’s going to be used for health things, a lot of reinforcement learning of the very best humans in that context would still be valuable. Eventually, the models will, having read all the literature of the world about good doctors, bad doctors, it’ll understand as soon as you say, “I want you to be a doctor diagnosing somebody.” All of the implicit reinforcement that fits that situation, you know, will be there.

LEE: Yeah.

GATES: And so I hope three years from now, we don’t have to do that reinforcement learning. But today, for any medical context, you would want a lot of data to reinforce tone, willingness to say things when, you know, there might be something significant at stake. 

LEE: Yeah. So, you know, something Bill said, kind of, reminds me of another thing that I think we missed, which is, the context also … and the specialization also pertains to different, I guess, what we still call “modes,” although I don’t know if the idea of multimodal is the same as it was two years ago. But, you know, what do you make of all of the hubbub around—in fact, within Microsoft Research, this is a big deal, but I think we’re far from alone—you know, medical images and vision, video, proteins and molecules, cell, you know, cellular data and so on. 

BUBECK: Yeah. OK. So there is a lot to say to everything … to the last, you know, couple of minutes. Maybe on the specialization aspect, you know, I think there is, hiding behind this, a really fundamental scientific question of whether eventually we have a singular AGI [artificial general intelligence] that kind of knows everything and you can just put, you know, explain your own context and it will just get it and understand everything. 

That’s one vision. I have to say, I don’t particularly believe in this vision. In fact, we humans are not like that at all. I think, hopefully, we are general intelligences, yet we have to specialize a lot. And, you know, I did myself a lot of RL, reinforcement learning, on mathematics. Like, that’s what I did, you know, spent a lot of time doing that. And I didn’t improve on other aspects. You know, in fact, I probably degraded in other aspects. [LAUGHTER] So it’s … I think it’s an important example to have in mind. 

LEE: I think I might disagree with you on that, though, because, like, doesn’t a model have to see both good science and bad science in order to be able to gain the ability to discern between the two? 

BUBECK: Yeah, no, that absolutely. I think there is value in seeing the generality, in having a very broad base. But then you, kind of, specialize on verticals. And this is where also, you know, open-weights model, which we haven’t talked about yet, are really important because they allow you to provide this broad base to everyone. And then you can specialize on top of it. 

LEE: So we have about three hours of stuff to talk about, but our time is actually running low.

BUBECK: Yes, yes, yes.  

LEE: So I think I want … there’s a more provocative question. It’s almost a silly question, but I need to ask it of the two of you, which is, is there a future, you know, where AI replaces doctors or replaces, you know, medical specialties that we have today? So what does the world look like, say, five years from now? 

GATES: Well, it’s important to distinguish healthcare discovery activity from healthcare delivery activity. We focused mostly on delivery. I think it’s very much within the realm of possibility that the AI is not only accelerating healthcare discovery but substituting for a lot of the roles of, you know, I’m an organic chemist, or I run various types of assays. I can see those, which are, you know, testable-output-type jobs but with still very high value, I can see, you know, some replacement in those areas before the doctor.  

The doctor, still understanding the human condition and long-term dialogues, you know, they’ve had a lifetime of reinforcement of that, particularly when you get into areas like mental health. So I wouldn’t say in five years, either people will choose to adopt it, but it will be profound that there’ll be this nearly free intelligence that can do follow-up, that can help you, you know, make sure you went through different possibilities. 

And so I’d say, yes, we’ll have doctors, but I’d say healthcare will be massively transformed in its quality and in efficiency by AI in that time period. 

LEE: Is there a comparison, useful comparison, say, between doctors and, say, programmers, computer programmers, or doctors and, I don’t know, lawyers? 

GATES: Programming is another one that has, kind of, a mathematical correctness to it, you know, and so the objective function that you’re trying to reinforce to, as soon as you can understand the state machines, you can have something that’s “checkable”; that’s correct. So I think programming, you know, which is weird to say, that the machine will beat us at most programming tasks before we let it take over roles that have deep empathy, you know, physical presence and social understanding in them. 

LEE: Yeah. By the way, you know, I fully expect in five years that AI will produce mathematical proofs that are checkable for validity, easily checkable, because they’ll be written in a proof-checking language like Lean or something but will be so complex that no human mathematician can understand them. I expect that to happen.  

I can imagine in some fields, like cellular biology, we could have the same situation in the future because the molecular pathways, the chemistry, biochemistry of human cells or living cells is as complex as any mathematics, and so it seems possible that we may be in a state where in wet lab, we see, Oh yeah, this actually works, but no one can understand why. 

BUBECK: Yeah, absolutely. I mean, I think I really agree with Bill’s distinction of the discovery and the delivery, and indeed, the discovery’s when you can check things, and at the end, there is an artifact that you can verify. You know, you can run the protocol in the wet lab and see [if you have] produced what you wanted. So I absolutely agree with that.  

And in fact, you know, we don’t have to talk five years from now. I don’t know if you know, but just recently, there was a paper that was published on a scientific discovery using o3- mini (opens in new tab). So this is really amazing. And, you know, just very quickly, just so people know, it was about this statistical physics model, the frustrated Potts model, which has to do with coloring, and basically, the case of three colors, like, more than two colors was open for a long time, and o3 was able to reduce the case of three colors to two colors.  

LEE: Yeah. 

BUBECK: Which is just, like, astounding. And this is not … this is now. This is happening right now. So this is something that I personally didn’t expect it would happen so quickly, and it’s due to those reasoning models.  

Now, on the delivery side, I would add something more to it for the reason why doctors and, in fact, lawyers and coders will remain for a long time, and it’s because we still don’t understand how those models generalize. Like, at the end of the day, we are not able to tell you when they are confronted with a really new, novel situation, whether they will work or not. 

Nobody is able to give you that guarantee. And I think until we understand this generalization better, we’re not going to be willing to just let the system in the wild without human supervision. 

LEE: But don’t human doctors, human specialists … so, for example, a cardiologist sees a patient in a certain way that a nephrologist … 

BUBECK: Yeah.

LEE: … or an endocrinologist might not.

BUBECK: That’s right. But another cardiologist will understand and, kind of, expect a certain level of generalization from their peer. And this, we just don’t have it with AI models. Now, of course, you’re exactly right. That generalization is also hard for humans. Like, if you have a human trained for one task and you put them into another task, then you don’t … you often don’t know. But you have other examples. So if you have two humans that were trained on a task and you put them on another one, then you kind of expect that they will do the same on the other task. 

LEE: OK. You know, the podcast is focused on what’s happened over the last two years. But now, I’d like one provocative prediction about what you think the world of AI and medicine is going to be at some point in the future. You pick your timeframe. I don’t care if it’s two years or 20 years from now, but, you know, what do you think will be different about AI in medicine in that future than today? 

BUBECK: Yeah, I think the deployment is going to accelerate soon. Like, we’re really not missing very much. There is this enormous capability overhang. Like, even if progress completely stopped, with current systems, we can do a lot more than what we’re doing right now. So I think this will … this has to be realized, you know, sooner rather than later. 

And I think it’s probably dependent on these benchmarks and proper evaluation and tying this with regulation. So these are things that take time in human society and for good reason. But now we already are at two years; you know, give it another two years and it should be really …  

LEE: Will AI prescribe your medicines? Write your prescriptions? 

BUBECK: I think yes. I think yes. 

LEE: OK. Bill? 

GATES: Well, I think the next two years, we’ll have massive pilots, and so the amount of use of the AI, still in a copilot-type mode, you know, we should get millions of patient visits, you know, both in general medicine and in the mental health side, as well. And I think that’s going to build up both the data and the confidence to give the AI some additional autonomy. You know, are you going to let it talk to you at night when you’re panicked about your mental health with some ability to escalate?  

And, you know, I’ve gone so far as to tell politicians with national health systems that if they deploy AI appropriately, that the quality of care, the overload of the doctors, the improvement in the economics will be enough that their voters will be stunned because they just don’t expect this, and, you know, they could be reelected [LAUGHTER] just on this one thing of fixing what is a very overloaded and economically challenged health system in these rich countries. 

You know, my personal role is going to be to make sure that in the poorer countries, there isn’t some lag; in fact, in many cases, that we’ll be more aggressive because, you know, we’re comparing to having no access to doctors at all. And, you know, so I think whether it’s India or Africa, there’ll be lessons that are globally valuable because we need medical intelligence. And, you know, thank god AI is going to provide a lot of that. 

LEE: Well, on that optimistic note, I think that’s a good way to end. Bill, Seb, really appreciate all of this.  

I think the most fundamental prediction we made in the book is that AI would actually find its way into the practice of medicine, and I think that that at least has come true, maybe in different ways than we expected, but it’s come true, and I think it’ll only accelerate from here. So thanks again, both of you. 

[TRANSITION MUSIC] 

GATES: Yeah. Thanks, you guys. 

BUBECK: Thank you, Peter. Thanks, Bill. 

LEE: I just always feel such a sense of privilege to have a chance to interact and actually work with people like Bill and Sébastien.   

With Bill, I’m always amazed at how practically minded he is. He’s really thinking about the nuts and bolts of what AI might be able to do for people, and his thoughts about underserved parts of the world, the idea that we might actually be able to empower people with access to expert medical knowledge, I think is both inspiring and amazing.  

And then, Seb, Sébastien Bubeck, he’s just absolutely a brilliant mind. He has a really firm grip on the deep mathematics of artificial intelligence and brings that to bear in his research and development work. And where that mathematics takes him isn’t just into the nuts and bolts of algorithms but into philosophical questions about the nature of intelligence.  

One of the things that Sébastien brought up was the state of evaluation of AI systems. And indeed, he was fairly critical in our conversation. But of course, the world of AI research and development is just moving so fast, and indeed, since we recorded our conversation, OpenAI, in fact, released a new evaluation metric that is directly relevant to medical applications, and that is something called HealthBench. And Microsoft Research also released a new evaluation approach or process called ADeLe.  

HealthBench and ADeLe are examples of new approaches to evaluating AI models that are less about testing their knowledge and ability to pass multiple-choice exams and instead are evaluation approaches designed to assess how well AI models are able to complete tasks that actually arise every day in typical healthcare or biomedical research settings. These are examples of really important good work that speak to how well AI models work in the real world of healthcare and biomedical research and how well they can collaborate with human beings in those settings. 

You know, I asked Bill and Seb to make some predictions about the future. You know, my own answer, I expect that we’re going to be able to use AI to change how we diagnose patients, change how we decide treatment options.  

If you’re a doctor or a nurse and you encounter a patient, you’ll ask questions, do a physical exam, you know, call out for labs just like you do today, but then you’ll be able to engage with AI based on all of that data and just ask, you know, based on all the other people who have gone through the same experience, who have similar data, how were they diagnosed? How were they treated? What were their outcomes? And what does that mean for the patient I have right now? Some people call it the “patients like me” paradigm. And I think that’s going to become real because of AI within our lifetimes. That idea of really grounding the delivery in healthcare and medical practice through data and intelligence, I actually now don’t see any barriers to that future becoming real. 

[THEME MUSIC] 

I’d like to extend another big thank you to Bill and Sébastien for their time. And to our listeners, as always, it’s a pleasure to have you along for the ride. I hope you’ll join us for our remaining conversations, as well as a second coauthor roundtable with Carey and Zak.  

Until next time.  

[MUSIC FADES]

The post How AI is reshaping the future of healthcare and medical research appeared first on Microsoft Research.

Read More

Turn RTX ON With 40% Off Performance Day Passes

Turn RTX ON With 40% Off Performance Day Passes

Level up GeForce NOW experiences this summer with 40% off Performance Day Passes. Enjoy 24 hours of premium cloud gaming with RTX ON, delivering low latency and shorter wait times.

The hot deal comes just in time for the cloud’s highly anticipated launch of Dune: Awakening — a multiplayer survival game on a massive scale set on the unforgiving sands of Arrakis.

It’s perfect to pair with the nine games available this week, including the Frosthaven demo announced at Steam Next Fest.

Try Before You Buy

Level up to the cloud, no commitment required. For a limited time, grab a Performance Day Pass at a price that’s less than an ice cream sundae and experience premium GeForce NOW gaming for 24 hours.

With RTX ON, enjoy shorter wait times and lower latency for supported games, all powered by the cloud. Dive into popular games with upgraded visuals and smoother gameplay over free users, whether exploring vast open worlds or battling in fast-paced arenas.

Take the experience even further by applying the value of the Day Pass toward a six-month Performance membership during the limited-time summer sale. It’s the perfect way to try out premium cloud gaming before jumping into a longer-term membership.

Survive and Thrive

Join the fight for Arrakis.

Dune: Awakening, a multiplayer survival game on a massive scale from Funcom, is set on an ever-changing desert planet called Arrakis. Whether braving colossal sandworms, battling for spice or forging alliances, gamers can experience the spectacle of Arrakis with all the benefits of GeForce NOW.

Manage hydration, temperature and exposure while contending with deadly sandworms, sandstorms and rival factions. Blend skills-based third-person action combat — featuring ranged and melee weapons, gadgets and abilities — with deep crafting, base building and resource management. Explore and engage in large-scale player vs. player and player vs. environment battles while vying for control over territory and the precious spice.

The spice is flowing — and so is the power of the cloud. Stream it on GeForce NOW without waiting for lengthy downloads or worrying about hardware requirements. Dune: Awakening is available for members to stream from anywhere with the power of NVIDIA RTX for ultra-smooth gameplay and stunning visuals, even on low-powered devices.

Chill Out

Time to bundle up.

Experience the highly anticipated Frosthaven demo in the cloud during Steam Next Fest with GeForce NOW. For a limited time, dive into a preview of the game directly from the cloud — no high-end PC required.

Frosthaven — a dark fantasy tactical role-playing game from Snapshot Games and X-COM creator Julian Gollop — brings to life the board game of the same name. It features deep, turn-based combat, unique character classes, and single-player and online co-op modes.

Play the Frosthaven demo on virtually any device with GeForce NOW and experience the magic of gathering around a board game — now in the cloud. Enter the frozen north of Frosthaven, strategize with friends and dive into epic battles without the hassle of setup or cleanup. With GeForce NOW, game night is just a click away, wherever members are playing from.

Seize New Games

A new era of “Rainbow Six Siege” has begun.

Rainbow Six Siege X, the biggest evolution in the game’s history, is now available with free access for new players. It introduces a new 6v6 “Dual Front” game mode, where teams attack and defend simultaneously with respawns and new strategic objectives. R6 Siege X also brings new and improved gameplay features — such as modernized maps with enhanced visuals and lighting, new destructible environmental elements, advanced rappel, smoother movement, an audio overhaul and a communication wheel for precise strategic plays, as well as weapon inspections to showcase gamers’ favorite cosmetics.

Look for the following games available to stream in the cloud this week:

  • Frosthaven Demo (New release on Steam, June 9)
  • Dune: Awakening (New release on Steam, June 10)
  • MindsEye (New release on Steam, June 10)
  • Kingdom Two Crowns (New release on Xbox, available on PC Game Pass, June 11)
  • The Alters (New release on Steam and Xbox, available on PC Game Pass, June 13)
  • Lost in Random: The Eternal Die (New release on Steam and Xbox, June 13, available on PC Game Pass, June 17)
  • Firefighting Simulator – The Squad (Xbox, available on PC Game Pass)
  • JDM: Japanese Drift Master (Steam)
  • Hellslave (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

NVIDIA TensorRT Boosts Stable Diffusion 3.5 Performance on NVIDIA GeForce RTX and RTX PRO GPUs

NVIDIA TensorRT Boosts Stable Diffusion 3.5 Performance on NVIDIA GeForce RTX and RTX PRO GPUs

Generative AI has reshaped how people create, imagine and interact with digital content.

As AI models continue to grow in capability and complexity, they require more VRAM, or video random access memory. The base Stable Diffusion 3.5 Large model, for example, uses over 18GB of VRAM — limiting the number of systems that can run it well.

By applying quantization to the model, noncritical layers can be removed or run with lower precision. NVIDIA GeForce RTX 40 Series and the Ada Lovelace generation of NVIDIA RTX PRO GPUs support FP8 quantization to help run these quantized models, and the latest-generation NVIDIA Blackwell GPUs also add support for FP4.

NVIDIA collaborated with Stability AI to quantize its latest model, Stable Diffusion (SD) 3.5 Large, to FP8 — reducing VRAM consumption by 40%. Further optimizations to SD3.5 Large and Medium with the NVIDIA TensorRT software development kit (SDK) double performance.

In addition, TensorRT has been reimagined for RTX AI PCs, combining its industry-leading performance with just-in-time (JIT), on-device engine building and an 8x smaller package size for seamless AI deployment to more than 100 million RTX AI PCs. TensorRT for RTX is now available as a standalone SDK for developers.

RTX-Accelerated AI

NVIDIA and Stability AI are boosting the performance and reducing the VRAM requirements of Stable Diffusion 3.5, one of the world’s most popular AI image models. With NVIDIA TensorRT acceleration and quantization, users can now generate and edit images faster and more efficiently on NVIDIA RTX GPUs.

Stable Diffusion 3.5 quantized FP8 (right) generates images in half the time with similar quality as FP16 (left). Prompt: A serene mountain lake at sunrise, crystal clear water reflecting snow-capped peaks, lush pine trees along the shore, soft morning mist, photorealistic, vibrant colors, high resolution.

To address the VRAM limitations of SD3.5 Large, the model was quantized with TensorRT to FP8, reducing the VRAM requirement by 40% to 11GB. This means five GeForce RTX 50 Series GPUs can run the model from memory instead of just one.

SD3.5 Large and Medium models were also optimized with TensorRT, an AI backend for taking full advantage of Tensor Cores. TensorRT optimizes a model’s weights and graph — the instructions on how to run a model — specifically for RTX GPUs.

FP8 TensorRT boosts SD3.5 Large performance by 2.3x vs. BF16 PyTorch, with 40% less memory use. For SD3.5 Medium, BF16 TensorRT delivers a 1.7x speedup.

Combined, FP8 TensorRT delivers a 2.3x performance boost on SD3.5 Large compared with running the original models in BF16 PyTorch, while using 40% less memory. And in SD3.5 Medium, BF16 TensorRT provides a 1.7x performance increase compared with BF16 PyTorch.

The optimized models are now available on Stability AI’s Hugging Face page.

NVIDIA and Stability AI are also collaborating to release SD3.5 as an NVIDIA NIM microservice, making it easier for creators and developers to access and deploy the model for a wide range of applications. The NIM microservice is expected to be released in July.

TensorRT for RTX SDK Released

Announced at Microsoft Build — and already available as part of the new Windows ML framework in preview — TensorRT for RTX is now available as a standalone SDK for developers.

Previously, developers needed to pre-generate and package TensorRT engines for each class of GPU — a process that would yield GPU-specific optimizations but required significant time.

With the new version of TensorRT, developers can create a generic TensorRT engine that’s optimized on device in seconds. This JIT compilation approach can be done in the background during installation or when they first use the feature.

The easy-to-integrate SDK is now 8x smaller and can be invoked through Windows ML — Microsoft’s new AI inference backend in Windows. Developers can download the new standalone SDK from the NVIDIA Developer page or test it in the Windows ML preview.

For more details, read this NVIDIA technical blog and this Microsoft Build recap.

Join NVIDIA at GTC Paris

At NVIDIA GTC Paris at VivaTech — Europe’s biggest startup and tech event — NVIDIA founder and CEO Jensen Huang yesterday delivered a keynote address on the latest breakthroughs in cloud AI infrastructure, agentic AI and physical AI. Watch a replay.

GTC Paris runs through Thursday, June 12, with hands-on demos and sessions led by industry leaders. Whether attending in person or joining online, there’s still plenty to explore at the event.

Each week, the RTX AI Garage blog series features community-driven AI innovations and content for those looking to learn more about NVIDIA NIM microservices and AI Blueprints, as well as building AI agents, creative workflows, digital humans, productivity apps and more on AI PCs and workstations. 

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X

See notice regarding software product information.

Read More

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

Adobe Inc. excels in providing a comprehensive suite of creative tools that empower artists, designers, and developers across various digital disciplines. Their product landscape is the backbone of countless creative projects worldwide, ranging from web design and photo editing to vector graphics and video production.

Adobe’s internal developers use a vast array of wiki pages, software guidelines, and troubleshooting guides. Recognizing the challenge developers faced in efficiently finding the right information for troubleshooting, software upgrades, and more, Adobe’s Developer Platform team sought to build a centralized system. This led to the initiative Unified Support, designed to help thousands of the company’s internal developers get immediate answers to questions from a centralized place and reduce time and cost spent on developer support. For instance, a developer setting up a continuous integration and delivery (CI/CD) pipeline in a new AWS Region or running a pipeline on a dev branch can quickly access Adobe-specific guidelines and best practices through this centralized system.

The initial prototype for Adobe’s Unified Support provided valuable insights and confirmed the potential of the approach. This early phase highlighted key areas requiring further development to operate effectively at Adobe’s scale, including addressing scalability needs, simplifying resource onboarding, improving content synchronization mechanisms, and optimizing infrastructure efficiency. Building on these learnings, improving retrieval precision emerged as the next critical step.

To address these challenges, Adobe partnered with the AWS Generative AI Innovation Center, using Amazon Bedrock Knowledge Bases and the Vector Engine for Amazon OpenSearch Serverless. This solution dramatically improved their developer support system, resulting in a 20% increase in retrieval accuracy. Metadata filtering empowers developers to fine-tune their search, helping them surface more relevant answers across complex, multi-domain knowledge sources. This improvement not only enhanced the developer experience but also contributed to reduced support costs.

In this post, we discuss the details of this solution and how Adobe enhances their developer productivity.

Solution overview

Our project aimed to address two key objectives:

  • Document retrieval engine enhancement – We developed a robust system to improve search result accuracy for Adobe developers. This involved creating a pipeline for data ingestion, preprocessing, metadata extraction, and indexing in a vector database. We evaluated retrieval performance against Adobe’s ground truth data to produce high-quality, domain-specific results.
  • Scalable, automated deployment – To support Unified Support across Adobe, we designed a reusable blueprint for deployment. This solution accommodates large-scale data ingestion of various types and offers flexible configurations, including embedding model selection and chunk size adjustment.

Using Amazon Bedrock Knowledge Bases, we created a customized, fully managed solution that improved the retrieval effectiveness. Key achievements include a 20% increase in accuracy metrics for document retrieval, seamless document ingestion and change synchronization, and enhanced scalability to support thousands of Adobe developers. This solution provides a foundation for improved developer support and scalable deployment across Adobe’s teams. The following diagram illustrates the solution architecture.

Solution architecture diagram

Let’s take a closer look at our solution:

  • Amazon Bedrock Knowledge Bases index – The backbone of our system is Amazon Bedrock Knowledge Bases. Data is indexed through the following stages:
    • Data ingestion – We start by pulling data from Amazon Simple Storage Service (Amazon S3) buckets. This could be anything from resolutions to past issues or wiki pages.
    • Chunking – Amazon Bedrock Knowledge Bases breaks data down into smaller pieces, or chunks, defining the specific units of information that can be retrieved. This chunking process is configurable, allowing for optimization based on the specific needs of the business.
    • Vectorization – Each chunk is passed through an embedding model (in this case, Amazon Titan V2 on Amazon Bedrock) creating a 1,024-dimension numerical vector. This vector represents the semantic meaning of the chunk, allowing for similarity searches
    • Storage – These vectors are stored in the Amazon OpenSearch Serverless vector database, creating a searchable repository of information.
  • Runtime – When a user poses a question, our system competes the following steps:
    • Query vectorization – With the Amazon Bedrock Knowledge Bases Retrieve API, the user’s question is automatically embedded using the same embedding model used for the chunks during data ingestion.
    • Similarity search and retrieval – The system retrieves the most relevant chunks in the vector database based on similarity scores to the query.
    • Ranking and presentation – The corresponding documents are ranked based on the sematic similarity of their modest relevant chunks to the query, and the top-ranked information is presented to the user.

Multi-tenancy through metadata filtering

As developers, we often find ourselves seeking help across various domains. Whether it’s tackling CI/CD issues, setting up project environments, or adopting new libraries, the landscape of developer challenges is vast and varied. Sometimes, our questions even span multiple domains, making it crucial to have a system for retrieving relevant information. Metadata filtering empowers developers to retrieve not just semantically relevant information, but a well-defined subset of that information based on specific criteria. This powerful tool enables you to apply filters to your retrievals, helping developers narrow the search results to a limited set of documents based on the filter, thereby improving the relevancy of the search.

To use this feature, metadata files are provided alongside the source data files in an S3 bucket. To enable metadata-based filtering, each source data file needs to be accompanied by a corresponding metadata file. These metadata files used the same base name as the source file, with a .metadata.json suffix. Each metadata file included relevant attributes—such as domain, year, or type—to support multi-tenancy and fine-grained filtering in OpenSearch Service. The following code shows what an example metadata file looks like:

{
  "metadataAttributes": 
      {
        "domain": "project A",
        "year": 2016,
        "type": "wiki"
       }
 }

Retrieve API

The Retrieve API allows querying a knowledge base to retrieve relevant information. You can use it as follows:

  1. Send a POST request to /knowledgebases/knowledgeBaseId/retrieve.
  2. Include a JSON body with the following:
    1. retrievalQuery – Contains the text query.
    2. retrievalConfiguration – Specifies search parameters, such as number of results and filters.
    3. nextToken – For pagination (optional).

The following is an example request syntax:

POST /knowledgebases/knowledgeBaseId/retrieve HTTP/1.1
Content-type: application/json
{
   "nextToken": "string",
   "retrievalConfiguration": { 
      "vectorSearchConfiguration": { 
         "filter": { ... },
         "numberOfResults": number,
         "overrideSearchType": "string"
      }
   },
   "retrievalQuery": { 
      "text": "string"
   }
}

Additionally, you can set up the retriever with ease using the langchain-aws package:

from langchain_aws import AmazonKnowledgeBasesRetriever
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id="YOUR-ID",
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)
retriever.get_relevant_documents(query="What is the meaning of life?")

This approach enables semantic querying of the knowledge base to retrieve relevant documents based on the provided query, simplifying the implementation of search.

Experimentation

To deliver the most accurate and efficient knowledge retrieval system, the Adobe and AWS teams put the solution to the test. The team conducted a series of rigorous experiments to fine-tune the system and find the optimal settings.

Before we dive into our findings, let’s discuss the metrics and evaluation process we used to measure success. We used the open source model evaluation framework Ragas to evaluate the retrieval system across two metrics: document relevance and mean reciprocal rank (MRR). Although Ragas comes with many metrics for evaluating model performance out of the box, we needed to implement these metrics by extending the Ragas framework with custom code.

  • Document relevance – Document relevance offers a qualitative approach to assessing retrieval accuracy. This metric uses a large language model (LLM) as an impartial judge to compare retrieved chunks against user queries. It evaluates how effectively the retrieved information addresses the developer’s question, providing a score between 1–10.
  • Mean reciprocal rank – On the quantitative side, we have the MRR metric. MRR evaluates how well a system ranks the first relevant item for a query. For each query, find the rank k of the highest-ranked relevant document. The score for that query is 1/k. MRR is the average of these 1/k scores over the entire set of queries. A higher score (closer to 1) signifies that the first relevant result is typically ranked high.

These metrics provide complementary insights: document relevance offers a content-based assessment, and MRR provides a ranking-based evaluation. Together, they offer a comprehensive view of the retrieval system’s effectiveness in finding and prioritizing relevant information.In our recent experiments, we explored various data chunking strategies to optimize the performance of retrieval. We tested several approaches, including fixed-size chunking as well as more advanced semantic chunking and hierarchical chunking.Semantic chunking focuses on preserving the contextual relationships within the data by segmenting it based on semantic meaning. This approach aims to improve the relevance and coherence of retrieved results.Hierarchical chunking organizes data into a hierarchical parent-child structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data.

For more information on how to set up different chunking strategies, refer to Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

We tested the following chunking methods with Amazon Bedrock Knowledge Bases:

  • Fixed-size short chunking – 400-token chunks with a 20% overlap (shown as the blue variant in the following figure)
  • Fixed-size long chunking – 1,000-token chunks with a 20% overlap
  • Hierarchical chunking – Parent chunks of 1,500 tokens and child chunks of 300 tokens, with a 60-token overlap
  • Semantic chunking – 400-token chunks with a 95% similarity percentile threshold

For reference, a paragraph of approximately 1,000 characters typically translates to around 200 tokens. To assess performance, we measured document relevance and MRR across different context sizes, ranging from 1–5. This comparison aims to provide insights into the most effective chunking strategy for organizing and retrieving information for this use case.The following figures illustrate the MRR and document relevance metrics, respectively.

Experiment results

Experiment results

As a result of these experiments, we found that MRR is a more sensitive metric for evaluating the impact of chunking strategies, particularly when varying the number of retrieved chunks (top-k from 1 to 5). Among the approaches tested, the fixed-size 400-token strategy—shown in blue—proved to be the simplest and most effective, consistently yielding the highest accuracy across different retrieval sizes.

Conclusion

In the journey to design Adobe’s developer Unified Support search and retrieval system, we’ve successfully harnessed the power of Amazon Bedrock Knowledge Bases to create a robust, scalable, and efficient solution. By configuring fixed-size chunking and using the Amazon Titan V2 embedding model, we achieved a remarkable 20% increase in accuracy metrics for document retrieval compared to Adobe’s existing solution, by running evaluations on the customer’s testing system and provided dataset.The integration of metadata filtering emerged as a game changing feature, allowing for seamless navigation across diverse domains and enabling customized retrieval. This capability proved invaluable for Adobe, given the complexity and breadth of their information landscape. Our comprehensive comparison of retrieval accuracy for different configurations of the Amazon Bedrock Knowledge Bases index has yielded valuable insights. The metrics we developed provide an objective framework for assessing the quality of retrieved context, which is crucial for applications demanding high-precision information retrieval. As we look to the future, this customized, fully managed solution lays a solid foundation for continuous improvement in developer support at Adobe, offering enhanced scalability and seamless support infrastructure in tandem with evolving developer needs.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center. To learn more about Amazon Bedrock Knowledge Bases, see Retrieve data and generate AI responses with knowledge bases.


About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. With over a decade of experience in software development, he specializes in building AI-driven solutions, including AI agents. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Nay Doummar is an Engineering Manager on the Unified Support team at Adobe, where she’s been since 2012. Over the years, she has contributed to projects in infrastructure, CI/CD, identity management, containers, and AI. She started on the CloudOps team, which was responsible for migrating Adobe’s infrastructure to the AWS Cloud, marking the beginning of her long-term collaboration with AWS. In 2020, she helped build a support chatbot to simplify infrastructure-related assistance, sparking her passion for user support. In 2024, she joined a project to Unify Support for the Developer Platform, aiming to streamline support and boost productivity.

Varsha Chandan Bellara is a Software Development Engineer at Adobe, specializing in AI-driven solutions to boost developer productivity. She leads the development of an AI assistant for the Unified Support initiative, using Amazon Bedrock, implementing RAG to provide accurate, context-aware responses for technical support and issue resolution. With expertise in cloud-based technologies, Varsha combines her passion for containers and serverless architectures with advanced AI to create scalable, efficient solutions that streamline developer workflows.

Jan Michael Ong is a Senior Software Engineer at Adobe, where he supports the developer community and engineering teams through tooling and automation. Currently, he is part of the Developer Experience team at Adobe, working on AI projects and automation contributing to Adobe’s internal Developer Platform.

Justin Johns is a Deep Learning Architect at Amazon Web Services who is passionate about innovating with generative AI and delivering cutting-edge solutions for customers. With over 5 years of software development experience, he specializes in building cloud-based solutions powered by generative AI.

Gaurav Dhamija is a Principal Solutions Architect at Amazon Web Services, where he helps customers design and build scalable, reliable, and secure applications on AWS. He is passionate about developer experience, containers, and serverless technologies, and works closely with engineering teams to modernize application architectures. Gaurav also specializes in generative AI, using AWS generative AI services to drive innovation and enhance productivity across a wide range of use cases.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Business portrait photoAnila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Read More