How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

This post is co-written with Saibal Samaddar, Tanushree Halder, and Lokesh Joshi from Infosys Consulting.

Critical insights and expertise are concentrated among thought leaders and experts across the globe. Language barriers often hinder the distribution and comprehension of this knowledge during crucial encounters. Workshops, conferences, and training sessions serve as platforms for collaboration and knowledge sharing, where the attendees can understand the information being conveyed in real-time and in their preferred language.

Infosys, a leading global IT services and consulting organization, used its digital expertise to tackle this challenge by pioneering, Infosys Event AI, an innovative AI-based event assistant. Infosys Event AI is designed to make knowledge universally accessible, making sure that valuable insights are not lost and can be efficiently utilized by individuals and organizations across diverse industries both during the event and after the event has concluded. The absence of such a system hinders effective knowledge sharing and utilization, limiting the overall impact of events and workshops. By transforming ephemeral event content into a persistent and searchable knowledge asset, Infosys Event AI seeks to enhance knowledge utilization and impact.

Some of the challenges in capturing and accessing event knowledge include:

  • Knowledge from events and workshops is often lost due to inadequate capture methods, with traditional note-taking being incomplete and subjective.
  • Reviewing lengthy recordings to find specific information is time-consuming and inefficient, creating barriers to knowledge retention and sharing.
  • People who miss events face significant obstacles accessing the knowledge shared, impacting sectors like education, media, and public sector where information recall is crucial.

To address these challenges, Infosys partnered with Amazon Web Services (AWS) to develop the Infosys Event AI to unlock the insights generated during events. In this post, we explain how Infosys built the Infosys Event AI solution using several AWS services including:

Solution Architecture

In this section, we present an overview of Event AI, highlighting its key features and workflow. Event AI delivers these core functionalities, as illustrated in the architecture diagram that follows:

  1. Seamless live stream acquisition from on-premises sources
  2. Real-time transcription processing for speech-to-text conversion
  3. Post-event processing and knowledge base indexing for structured information retrieval
  4. Automated generation of session summaries and key insights to enhance accessibility
  5. AI-powered chat-based assistant for interactive Q&A and efficient knowledge retrieval from the event session

Solution walkthrough

Next, we break down each functionality in detail. The services used in the solution are granted least-privilege permissions through AWS Identity and Access Management (IAM) policies for security purposes.

Seamless live stream acquisition

The solution begins with an IP-enabled camera capturing the live event feed, as shown in the following section of the architecture diagram. This stream is securely and reliably transported to the cloud using the Secure Reliable Transport (SRT) protocol through MediaConnect. The ingested stream is then received and processed by MediaLive, which encodes the video in real time and generates the necessary outputs.

The workflow follows these steps:

  1. Use an IP-enabled camera or ground encoder to convert non-IP streams into IP streams and transmit them through SRT protocol to MediaConnect for live event ingestion.
  2. MediaConnect securely transmits the stream to MediaLive for processing.

Real-time transcription processing

To facilitate real-time accessibility, the system uses MediaLive to isolate audio from the live video stream. This audio-only stream is then forwarded to a real-time transcriber module. The real-time transcriber module, hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance, uses the Amazon Transcribe stream API to generate transcriptions with minimal latency. These real-time transcriptions are subsequently delivered to an on-premises web client through secure WebSocket connections. The following screenshot shows a brief demo based on a fictitious scenario to illustrate Event AI’s real-time streaming capability.

The workflow steps for this part of the solution follows these steps:

  1. MediaLive extracts the audio from the live stream and creates an audio-only stream, which it then sends to the real-time transcriber module running on an EC2 instance. MediaLive also extracts the audio-only output and stores it in an Amazon Simple Storage Service (Amazon S3) bucket, facilitating a subsequent postprocessing workflow.
  2. The real-time transcriber module receives the audio-only stream and employs the Amazon Transcribe stream API to produce real-time transcriptions with low latency.
  3. The real-time transcriber module uses a secure WebSocket to transmit the transcribed text.
  4. The on-premises web client receives the transcribed text through a secure WebSocket connection through Amazon CloudFront and displays it on the web client’s UI.

The below diagram shows the live-stream acquisition and real-time transcription.

Post-event processing and knowledge base indexing

After the event concludes, recorded media and transcriptions are securely stored in Amazon S3 for further analysis. A serverless, event-driven workflow using Amazon EventBridge and AWS Lambda automates the post-event processing. Amazon Transcribe processes the recorded content to generate the final transcripts, which are then indexed and stored in an Amazon Bedrock knowledge base for seamless retrieval. Additionally, Amazon Nova Pro enables multilingual translation of the transcripts, providing global accessibility when needed. With its quality and speed, Amazon Nova Pro is ideally suited for this global use case.

The workflow for this part of the process follows these steps:

  1. After the event concludes, MediaLive sends a channel stopped notification to EventBridge
  2. Lambda function, subscribed to the channel stopped event, triggers post-event transcription using Amazon Transcribe
  3. The transcribed content is processed and stored in an S3 bucket
  4. (Optional) Amazon Nova Pro translates transcripts into multiple languages for broader accessibility using Amazon Bedrock
  5. Amazon Transcribe generates a transcription complete event and sends it to EventBridge
  6. A Lambda function, subscribed to the transcription complete event, triggers the synchronization process with Amazon Bedrock Knowledge Bases
  7. The knowledge is then indexed and stored in Amazon Bedrock knowledge base for efficient retrieval

These steps are shown in the following diagram.

Automated generation of session summaries and key insights

To enhance user experience, the solution uses Amazon Bedrock to analyze the transcriptions to generate concise session summaries and key insights. These insights help users quickly understand the essence of the event without going through lengthy transcripts. The below screenshot shows Infosys Event AI’s summarization capability.

The workflow for this part of the solution follows these steps:

  1. Users authenticate in to the web client portal using Amazon Cognito. Once authenticated, the user selects option in the portal UI to view the summaries and key insights.
  2. The user request is delegated to the AI assistant module, where it fetches the complete transcript from the S3 bucket.
  3. The transcript undergoes processing through Amazon Bedrock Pro, which is guided by Amazon Bedrock Guardrails. In line with responsible AI policies, this process results in the generation of secure summaries and the creation of key insights that are safeguarded for the user.

AI-powered chat-based assistant

A key feature of this architecture is an AI-powered chat assistant, which is used to interactively query the event knowledge base. The chat assistant is powered by Amazon Bedrock and retrieves information from the Amazon OpenSearch Serverless index, enabling seamless access to session insights.

The workflow for this part of the solution follows these steps:

  1. Authenticated users engage with the chat assistant using natural language to request specific event messaging details from the client web portal.
  2. The user prompt is directed to the AI assistant module for processing.
  3. The AI assistant module queries Amazon Bedrock Knowledge Bases for relevant answers.
  4. The transcript is processed by Amazon Nova Pro, guided by Amazon Bedrock Guardrails, to generate secure summaries and safeguard key insights. The integration of Amazon Bedrock Guardrails promotes professional, respectful interactions by working to block undesirable and harmful content during user interactions aligned with responsible AI policies.

The following demo demonstrates Event AI’s Q&A capability.

The steps for automated generation of insights and AI-chat assistant are shown in the following diagram.

Results and Impact

Infosys EventAI Assistant was launched on February 2025 during a responsible AI conference event in Bangalore, India, hosted by Infosys in partnership with the British High Commission.

  • Infosys Event AI was used by more than 800 conference attendees
  • It was used by around 230 people every minute during the event
  • The intelligent chat assistant was queried an average of 57 times every minute during the event
  • A total of more than 9,000 event session summaries were generated during the event

By using the solution, Infosys was able to realize the following key benefits for their internal users and for their customers:

  • Enhanced knowledge retention – During the events, Infosys Event AI was accessible from both mobile and laptop devices, providing an immersive participation experience for both the online and offline event.
  • Improved accessibility – Session knowledge became quickly accessible after the event through transcripts, summaries, and the intelligent chat assistant. The event information is readily available for attendees and for those who couldn’t attend. Furthermore, Infosys Event AI aggregates the session information from previous events, creating a knowledge archival system for information retrieval.
  • Increased engagement – The interactive chat assistant provides deeper engagement during the event sessions, which means users can ask specific questions and receive immediate, contextually relevant answers.
  • Time efficiency – Quick access to summaries and chat responses saves time compared to reviewing full session recordings or manual notes when seeking specific information.

Impacting Multiple Industries

Infosys is positioned to accelerate the adoption of Infosys Event AI across diverse industries:

  • AI-powered meeting management for the enterprises – Businesses can use the system for generating meeting minutes, creating training documentation from workshops, and facilitating knowledge sharing within teams. Summaries provide quick recaps of meetings for executives, and transcripts offer detailed records for compliance and reference.
  • Improved transparency and accessibility in the public sector – Parliamentary debates, public hearings, and government briefings are made accessible to the general public through transcripts and summaries, improving transparency and citizen engagement. The platform enables searchable archives of parliamentary proceedings for researchers, policymakers, and the public, creating accessible records for historical reference.
  • Accelerated learnings and knowledge retention in the education sector – Students effectively review lectures, seminars, and workshops through transcripts and summaries, reinforcing learning, and improving knowledge retention. The chat assistant allows for interactive learning and clarification of doubts, acting as a virtual teaching assistant. This is particularly useful in online and hybrid learning environments.
  • Improved media reporting and efficiency in the media industry – Journalists can use Infosys Event AI to rapidly transcribe press conferences, speeches, and interviews, accelerating news cycles and improving reporting accuracy. Summaries provide quick overviews of events, enabling faster news dissemination. The chat assistant facilitates quick fact-checking (with source citation) and information retrieval from event recordings.
  • Improved accessibility and inclusivity across the industry – Real-time transcription provides accessibility for hearing-challenged individuals. Multilingual translation of event transcripts allows participation by attendees for whom the event sessions aren’t in their native language. This promotes inclusivity and a wider participation during events for the purposes of knowledge sharing.

Conclusion

In this post, we explored how Infosys developed Infosys Event AI to unlock the insights generated from events and conferences. Through its suite of features—including real-time transcription, intelligent summaries, and an interactive chat assistant—Infosys Event AI makes event knowledge accessible and provides an immersive engagement solution for the attendees, during and after the event.

Infosys is planning to offer the Infosys Event AI solution to their internal teams and global customers in two versions: as a multi-tenanted, software-as-a-service (SaaS) solution and as a single-deployment solution. Infosys is also adding capabilities to include an event catalogue, knowledge lake, and event archival system to make the event information accessible beyond the scope of the current event. By using AWS managed services, Infosys has made Event AI a readily available, interactive, immersive and valuable resource for students, journalists, policymakers, enterprises, and the public sector. As organizations and institutions increasingly rely on events for knowledge dissemination, collaboration, and public engagement, Event AI is well positioned to unlock the full potential of the events.

Stay updated with new Amazon AI features and releases to advance your AI journey on AWS.


About the Authors

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 24+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI & Machine Learning with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Maheshwaran G is a Specialist Solution Architect working with Media and Entertainment supporting Media companies in India to accelerate growth in an innovative fashion leveraging the power of cloud technologies. He is passionate about innovation and currently holds 8 USPTO and 8 IPO granted patents in diversified domains.

Saibal Samaddar is a senior principal consultant at Infosys Consulting and heads the AI Transformation Consulting (AIX) practice in India. He has over eighteen years of business consulting experience, including 11 years in PwC and KPMG, helping organizations drive strategic transformation by harnessing Digital and AI technologies. Known to be a visionary who can navigate complex transformations and make things happen, he has played a pivotal role in winning multiple new accounts for Infosys Consulting (IC).

Tanushree Halder is a principal consultant with Infosys Consulting and is the Lead – CX and Gen AI capability for AI Transformation Consulting (AIX). She has 11 years of experience working with clients in their transformational journeys. She has travelled to over 10 countries to provide her advisory services in AI with clients in BFSI, retail and logistics, hospitality, healthcare and shared services.

Lokesh Joshi is a consultant at Infosys Consulting. He has worked with multiple clients to strategize and integrate AI based solutions for workflow enhancements. He has over 4 years of experience in AI/ML, GenAI development, full Stack development, and cloud services. He specializes in Machine Learning and Data Science with a focus on Deep Learning and NLP. A fitness enthusiast, his hobbies include programming, hiking, and traveling.

Read More

Making Brain Waves: AI Startup Speeds Disease Research With Lab in the Loop

Making Brain Waves: AI Startup Speeds Disease Research With Lab in the Loop

About 15% of the world’s population — over a billion people — are affected by neurological disorders, from commonly known diseases like Alzheimer’s and Parkinson’s to hundreds of lesser-known, rare conditions.

BrainStorm Therapeutics, a San Diego-based startup, is accelerating the development of cures for these conditions using AI-powered computational drug discovery paired with lab experiments using organoids: tiny, 3D bundles of brain cells created from patient-derived stem cells. This hybrid, iterative method, where clinical data and AI models inform one another to accelerate drug development, is known as lab in the loop.

“The brain is the last frontier in modern biology,” said BrainStorm’s founder and CEO Robert Fremeau, who was previously a scientific director in neuroscience at Amgen and a faculty member at Duke University and the University of California, San Francisco. “By combining our organoid disease models with the power of generative AI, we now have the ability to start to unravel the underlying complex biology of disease networks.”

The company aims to lower the failure rate of drug candidates for brain diseases during clinical trials — currently over 93% — and identify therapeutics that can be applied to multiple diseases. Achieving these goals would make it faster and more economically viable to develop treatments for rare and common conditions.

“This alarmingly high clinical trial failure rate is mainly due to the inability of traditional preclinical models with rodents or 2D cells to predict human efficacy,” said Jun Yin, cofounder and chief technology officer at BrainStorm. “By integrating human-derived brain organoids with AI-driven analysis, we’re building a platform that better reflects the complexity of human neurobiology and improves the likelihood of clinical success.”

Fremeau and Yin believe that BrainStorm’s platform has the potential to accelerate development timelines, reduce research and development costs, and significantly increase the probability of bringing effective therapies to patients.

BrainStorm Therapeutics’ AI models, which run on NVIDIA GPUs in the cloud, were developed using the NVIDIA BioNeMo Framework, a set of programming tools, libraries and models for computational drug discovery. The company is a member of NVIDIA Inception, a global network of cutting-edge startups.

Clinical Trial in a Dish

BrainStorm Therapeutics uses AI models to develop gene maps of brain diseases, which they can use to identify promising targets for potential drugs and clinical biomarkers. Organoids allow them to screen thousands of drug molecules per day directly on human brain cells, enabling them to test the effectiveness of potential therapies before starting clinical trials.

“Brains have brain waves that can be picked up in a scan like an EEG, or electroencephalogram, which measures the electrical activity of neurons,” said Maya Gosztyla, the company’s cofounder and chief operating officer. “Our organoids also have spontaneous brain waves, allowing us to model the complex activity that you would see in the human brain in this much smaller system. We treat it like a clinical trial in a dish for studying brain diseases.”

BrainStorm Therapeutics is currently using patient-derived organoids for its work on drug discovery for Parkinson’s disease, a condition tied to the loss of neurons that produce dopamine, a neurotransmitter that helps with physical movement and cognition.

“In Parkinson’s disease, multiple genetic variants contribute to dysfunction across different cellular pathways, but they converge on a common outcome — the loss of dopamine neurons,” Fremeau said. “By using AI models to map and analyze the biological effects of these variants, we can discover disease-modifying treatments that have the potential to slow, halt or even reverse the progression of Parkinson’s.”

The BrainStorm team used single-cell sequencing data from brain organoids to fine-tune foundation models available through the BioNeMo Framework, including the Geneformer model for gene expression analysis. The organoids were derived from patients with mutations in the GBA1 gene, the most common genetic risk factor for Parkinson’s disease.

BrainStorm is also collaborating with the NVIDIA BioNeMo team to help optimize open-source access to the Geneformer model.

Accelerating Drug Discovery Research

With its proprietary platform, BrainStorm can mirror human brain biology and simulate how different treatments might work in a patient’s brain.

“This can be done thousands of times, much quicker and much cheaper than can be done in a wet lab — so we can narrow down therapeutic options very quickly,” Gosztyla said. “Then we can go in with organoids and test the subset of drugs the AI model thinks will be effective. Only after it gets through those steps will we actually test these drugs in humans.”

View of an organoid using Fluorescence Imaging Plate Reader, or FLIPR — a technique used to study the effect of compounds on cells during drug screening.

This technology led to the discovery that Donepezil, a drug prescribed for Alzheimer’s disease, could also be effective in treating Rett syndrome, a rare genetic neurodevelopmental disorder. Within nine months, the BrainStorm team was able to go from organoid screening to applying for a phase 2 clinical trial of the drug in Rett patients. This application was recently cleared by the U.S. Food and Drug Administration.

BrainStorm also plans to develop multimodal AI models that integrate data from cell sequencing, cell imaging, EEG scans and more.

“You need high-quality, multimodal input data to design the right drugs,” said Yin. “AI models trained on this data will help us understand disease better, find more effective drug candidates and, eventually, find prognostic biomarkers for specific patients that enable the delivery of precision medicine.”

The company’s next project is an initiative with the CURE5 Foundation to conduct the most comprehensive repurposed drug screen to date for CDKL5 Deficiency Disorder, another rare genetic neurodevelopmental disorder.

“Rare disease research is transforming from a high-risk niche to a dynamic frontier,” said Fremeau. “The integration of BrainStorm’s AI-powered organoid technology with NVIDIA accelerated computing resources and the NVIDIA BioNeMo platform is dramatically accelerating the pace of innovation while reducing the cost — so what once required a decade and billions of dollars can now be investigated with significantly leaner resources in a matter of months.”

Get started with NVIDIA BioNeMo for AI-accelerated drug discovery.

Read More

Chill Factor: NVIDIA Blackwell Platform Boosts Water Efficiency by Over 300x

Chill Factor: NVIDIA Blackwell Platform Boosts Water Efficiency by Over 300x

Traditionally, data centers have relied on air cooling — where mechanical chillers circulate chilled air to absorb heat from servers, helping them maintain optimal conditions. But as AI models increase in size, and the use of AI reasoning models rises, maintaining those optimal conditions is not only getting harder and more expensive — but more energy-intensive.

While data centers once operated at 20 kW per rack, today’s hyperscale facilities can support over 135 kW per rack, making it an order of magnitude harder to dissipate the heat generated by high-density racks. To keep AI servers running at peak performance, a new approach is needed for efficiency and scalability.

One key solution is liquid cooling — by reducing dependence on chillers and enabling more efficient heat rejection, liquid cooling is driving the next generation of high-performance, energy-efficient AI infrastructure.

The NVIDIA GB200 NVL72 and the NVIDIA GB300 NVL72 are rack-scale, liquid-cooled systems designed to handle the demanding tasks of trillion-parameter large language model inference. Their architecture is also specifically optimized for test-time scaling accuracy and performance, making it an ideal choice for running AI reasoning models while efficiently managing energy costs and heat.

Liquid-cooled NVIDIA Blackwell compute tray.

Driving Unprecedented Water Efficiency and Cost Savings in AI Data Centers

Historically, cooling alone has accounted for up to 40% of a data center’s electricity consumption, making it one of the most significant areas where efficiency improvements can drive down both operational expenses and energy demands.

Liquid cooling helps mitigate costs and energy use by capturing heat directly at the source. Instead of relying on air as an intermediary, direct-to-chip liquid cooling transfers heat in a technology cooling system loop. That heat is then cycled through a coolant distribution unit via liquid-to-liquid heat exchanger, and ultimately transferred to a facility cooling loop. Because of the higher efficiency of this heat transfer, data centers and AI factories can operate effectively with warmer water temperatures — reducing or eliminating the need for mechanical chillers in a wide range of climates.

The NVIDIA GB200 NVL72 rack-scale, liquid-cooled system, built on the NVIDIA Blackwell platform, offers exceptional performance while balancing energy costs and heat. It packs unprecedented compute density into each server rack, delivering 40x higher revenue potential, 30x higher throughput, 25x more energy efficiency and 300x more water efficiency than traditional air-cooled architectures. Newer NVIDIA GB300 NVL72 systems built on the Blackwell Ultra platform boast a 50x higher revenue potential and 35x higher throughput with 30x more energy efficiency.

Data centers spend an estimated $1.9-2.8M per megawatt (MW) per year, which amounts to nearly $500,000 spent annually on cooling-related energy and water costs. By deploying the liquid-cooled GB200 NVL72 system, hyperscale data centers and AI factories can achieve up to 25x cost savings, leading to over $4 million dollars in annual savings for a 50 MW hyperscale data center.

For data center and AI factory operators, this means lower operational costs, enhanced energy efficiency metrics and a future-proof infrastructure that scales AI workloads efficiently — without the unsustainable water footprint of legacy cooling methods.

Moving Heat Outside the Data Center

As compute density rises and AI workloads drive unprecedented thermal loads, data centers and AI factories must rethink how they remove heat from their infrastructure. The traditional methods of heat rejection that supported predictable CPU-based scaling are no longer sufficient on their own. Today, there are multiple options for moving heat outside the facility, but four major categories dominate current and emerging deployments.

Key Cooling Methods in a Changing Landscape

  • Mechanical Chillers: Mechanical chillers use a vapor compression cycle to cool water, which is then circulated through the data center to absorb heat. These systems are typically air-cooled or water-cooled, with the latter often paired with cooling towers to reject heat. While chillers are reliable and effective across diverse climates, they are also highly energy-intensive. In AI-scale facilities where power consumption and sustainability are top priorities, reliance on chillers can significantly impact both operational costs and carbon footprint.
  • Evaporative Cooling: Evaporative cooling uses the evaporation of water to absorb and remove heat. This can be achieved through direct or indirect systems, or hybrid designs. These systems are much more energy-efficient than chillers but come with high water consumption. In large facilities, they can consume millions of gallons of water per megawatt annually. Their performance is also climate-dependent, making them less effective in humid or water-restricted regions.
  • Dry Coolers: Dry coolers remove heat by transferring it from a closed liquid loop to the ambient air using large finned coils, much like an automotive radiator. These systems don’t rely on water and are ideal for facilities aiming to reduce water usage or operate in dry climates. However, their effectiveness depends heavily on the temperature of the surrounding air. In warmer environments, they may struggle to keep up with high-density cooling demands unless paired with liquid-cooled IT systems that can tolerate higher operating temperatures.
  • Pumped Refrigerant Systems: Pumped refrigerant systems use liquid refrigerants to move heat from the data center to outdoor heat exchangers. Unlike chillers, these systems don’t rely on large compressors inside the facility and they operate without the use of water. This method offers a thermodynamically efficient, compact and scalable solution that works especially well for edge deployments and water-constrained environments. Proper refrigerant handling and monitoring are required, but the benefits in power and water savings are significant.

Each of these methods offers different advantages depending on factors like climate, rack density, facility design and sustainability goals. As liquid cooling becomes more common and servers are designed to operate with warmer water, the door opens to more efficient and environmentally friendly cooling strategies — reducing both energy and water use while enabling higher compute performance.

Optimizing Data Centers for AI Infrastructure

As AI workloads grow exponentially, operators are reimagining data center design with infrastructure built specifically for high-performance AI and energy efficiency. Whether they’re transforming their entire setup into dedicated AI factories or upgrading modular components, optimizing inference performance is crucial for managing costs and operational efficiency.

To get the best performance, high compute capacity GPUs aren’t enough — they need to be able to communicate with each other at lightning speed.

NVIDIA NVLink boosts communication, enabling GPUs to operate as a massive, tightly integrated processing unit for maximum performance with a full-rack power density of 120 kW. This tight, high-speed communication is crucial for today’s AI tasks, where every second saved on transferring data can mean more tokens per second and more efficient AI models.

Traditional air cooling struggles at these power levels. To keep up, data center air would need to be either cooled to below-freezing temperatures or flow at near-gale speeds to carry the heat away, making it increasingly impractical to cool dense racks with air alone.

At nearly 1,000x the density of air, liquid cooling excels at carrying heat away thanks to its superior heat capacitance and thermal conductivity. By efficiently transferring heat away from high-performance GPUs, liquid cooling reduces reliance on energy-intensive and noisy cooling fans, allowing more power to be allocated to computation rather than cooling overhead.

Liquid Cooling in Action

Innovators across the industry are leveraging liquid cooling to slash energy costs, improve density and drive AI efficiency:

Cloud service providers are also adopting cutting-edge cooling and power innovations. Next-generation AWS data centers, featuring jointly developed liquid cooling solutions, increase compute power by 12% while reducing energy consumption by up to 46% — all while maintaining water efficiency.

Cooling the AI Infrastructure of the Future

As AI continues to push the limits of computational scale, innovations in cooling will be essential to meeting the thermal management challenges of the post-Moore’s law era.

NVIDIA is leading this transformation through initiatives like the COOLERCHIPS program, a U.S. Department of Energy-backed effort to develop modular data centers with next-generation cooling systems that are projected to reduce costs by at least 5% and improve efficiency by 20% over traditional air-cooled designs.

Looking ahead, data centers must evolve not only to support AI’s growing demands but do so sustainably — maximizing energy and water efficiency while minimizing environmental impact. By embracing high-density architectures and advanced liquid cooling, the industry is paving the way for a more efficient AI-powered future.

Learn more about breakthrough solutions for data center energy and water efficiency presented at NVIDIA GTC 2025 and discover how accelerated computing is driving a more efficient future with NVIDIA Blackwell.

Read More

Keeping AI on the Planet: NVIDIA Technologies Make Every Day About Earth Day

Keeping AI on the Planet: NVIDIA Technologies Make Every Day About Earth Day

Whether at sea, land or in the sky — even outer space — NVIDIA technology is helping research scientists and developers alike explore and understand oceans, wildlife, the climate and far out existential risks like asteroids.

These increasingly intelligent developments are helping to analyze environmental pollutants, damage to habitats and natural disaster risks at an accelerated pace. This, in turn, enables partnerships with local governments to take climate mitigation steps like pollution prevention and proactive planting.

Sailing the Seas of AI

Amphitrite, based in France, uses satellite data with AI to simulate and predict ocean currents and weather. Its AI models, driven by the NVIDIA AI and Earth-2 platforms, offer insights for positioning vessels to best harness the power of ocean currents. This helps determine when it’s best to travel, as well as the optimal course, reducing travel times, fuel consumption and carbon emissions. Amphitrite is a member of the NVIDIA Inception program for cutting-edge startups.

Watching Over Wildlife With AI

München, Germany-based OroraTech monitors animal poaching and wildfires with NVIDIA CUDA and Jetson. The NVIDIA Inception program member uses the EarthRanger platform to offer a wildfire detection and monitoring service that uses satellite imagery and AI to safeguard the environment and prevent poaching.

Keeping AI on the Weather

Weather agencies and climate scientists worldwide are using NVIDIA CorrDiff, a generative AI weather model enabling kilometer-scale forecasts of wind, temperature and precipitation type and amount. CorrDiff is part of the NVIDIA Earth-2 platform for simulating weather and climate conditions. It’s available as an easy-to-deploy NVIDIA NIM microservice.

In another climate effort, NVIDIA Research announced a new generative AI model, called StormCast, for reliable weather prediction at a scale larger than storms.

The model, outlined in a paper, can help with disaster and mitigation planning, saving lives.

Avoiding Mass Extinction Events

Researchers reported in Nature how a new method was able to spot 10-meter asteroids within the main asteroid belt located between Jupiter and Mars. Such space rocks can range from bus-sized to several Costco stores in width and deliver destruction to cities. It used NASA’s James Webb Space Telescope (JWST), which was tapped for views of these asteroids from previous research and enabled by NVIDIA accelerated computing.

Boosting Energy Efficiency With Liquid-Cooled Blackwell

NVIDIA GB200 NVL72 rack-scale, liquid-cooled systems, built on the Blackwell platform, offer exceptional performance while balancing energy costs and heat. It delivers 40x higher revenue potential, 30x higher throughput, 25x more energy efficiency and 300x more water efficiency than air-cooled architectures. NVIDIA GB300 NVL72 systems built on the Blackwell Ultra platform offer a 50x higher revenue potential, 35x higher throughput with 30x more energy efficiency.

Learn more about NVIDIA Earth-2 and NVIDIA Blackwell.

Read More

Amazon Bedrock Prompt Optimization Drives LLM Applications Innovation for Yuewen Group

Amazon Bedrock Prompt Optimization Drives LLM Applications Innovation for Yuewen Group

Yuewen Group is a global leader in online literature and IP operations. Through its overseas platform WebNovel, it has attracted about 260 million users in over 200 countries and regions, promoting Chinese web literature globally. The company also adapts quality web novels into films, animations for international markets, expanding the global influence of Chinese culture.

Today, we are excited to announce the availability of Prompt Optimization on Amazon Bedrock. With this capability, you can now optimize your prompts for several use cases with a single API call or a click of a button on the Amazon Bedrock console. In this blog post, we discuss how Prompt Optimization improves the performance of large language models (LLMs) for intelligent text processing task in Yuewen Group.

Evolution from Traditional NLP to LLM in Intelligent Text Processing

Yuewen Group leverages AI for intelligent analysis of extensive web novel texts. Initially relying on proprietary natural language processing (NLP) models, Yuewen Group faced challenges with prolonged development cycles and slow updates. To improve performance and efficiency, Yuewen Group transitioned to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.

Claude 3.5 Sonnet offers enhanced natural language understanding and generation capabilities, handling multiple tasks concurrently with improved context comprehension and generalization. Using Amazon Bedrock significantly reduced technical overhead and streamlined development process.

However, Yuewen Group initially struggled to fully harness LLM’s potential due to limited experience in prompt engineering. In certain scenarios, the LLM’s performance fell short of traditional NLP models. For example, in the task of “character dialogue attribution”, traditional NLP models achieved around 80% accuracy, while LLMs with unoptimized prompts only reached around 70%. This discrepancy highlighted the need for strategic prompt optimization to enhance capabilities of LLMs in these specific use cases.

Challenges in Prompt Optimization

Manual prompt optimization can be challenging due to the following reasons:

Difficulty in Evaluation: Assessing the quality of a prompt and its consistency in eliciting desired responses from a language model is inherently complex. Prompt effectiveness is not only determined by the prompt quality, but also by its interaction with the specific language model, depending on its architecture and training data. This interplay requires substantial domain expertise to understand and navigate. In addition, evaluating LLM response quality for open-ended tasks often involves subjective and qualitative judgements, making it challenging to establish objective and quantitative optimization criteria.

Context Dependency: Prompt effectiveness is highly contigent on the specific contexts and use cases. A prompt that works well in one scenario may underperform in another, necessitating extensive customization and fine-tuning for different applications. Therefore, developing a universally applicable prompt optimization method that generalizes well across diverse tasks remains a significant challenge.

Scalability: As LLMs find applications in a growing number of use cases, the number of required prompts and the complexity of the language models continue to rise. This makes manual optimization increasingly time-consuming and labor-intensive. Crafting and iterating prompts for large-scale applications can quickly become impractical and inefficient. Meanwhile, as the number of potential prompt variations increases, the search space for optimal prompts grows exponentially, rendering manual exploration of all combinations infeasible, even for moderately complex prompts.

Given these challenges, automatic prompt optimization technology has garnered significant attention in the AI community. In particular, Bedrock Prompt Optimization offers two main advantages:

  • Efficiency: It saves considerable time and effort by automatically generating high quality prompts suited for a variety of target LLMs supported on Bedrock, alleviating the need for tedious manual trial and error in model-specific prompt engineering.
  • Performance Enhancement: It notably improves AI performance by creating optimized prompts that enhance the output quality of language models across a wide range of tasks and tools.

These benefits not only streamline the development process, but also lead to more efficient and effective AI applications, positioning auto-prompting as a promising advancement in the field.

Introduction to Bedrock Prompt Optimization

Prompt Optimization on Amazon Bedrock is an AI-driven feature aiming to automatically optimize under-developed prompts for customers’ specific use cases, enhancing performance across different target LLMs and tasks. Prompt Optimization is seamlessly integrated into Amazon Bedrock Playground and Prompt Management to easily create, evaluate, store and use optimized prompt in your AI applications.

Amazon-Bedrock-Prompt-Optimization-1

On the AWS Management Console for Prompt Management, users input their original prompt. The prompt can be a template with the required variables represented by placeholders (e.g. {{document}} ), or a full prompt with actual texts filled into the placeholders. After selecting a target LLM from the supported list, users can kick off the optimization process with a single click, and the optimized prompt will be generated within seconds. The console then displays the Compare Variants tab, presenting the original and optimized prompts side-by-side for quick comparison. The optimized prompt often includes more explicit instructions on processing the input variables and generating the desired output format. Users can observe the enhancements made by Prompt Optimization to improve the prompt’s performance for their specific task.

Amazon-Bedrock-Prompt-Optimization-2

Comprehensive evaluation was done on open-source datasets across tasks including classification, summarization, open-book QA / RAG, agent / function-calling, as well as complex real-world customer use cases, which has shown substantial improvement by the optimized prompts.

Underlying the process, a Prompt Analyzer and a Prompt Rewriter are combined to optimize the original prompt. Prompt Analyzer is a fine-tuned LLM which decomposes the prompt structure by extracting its key constituent elements, such as the task instruction, input context, and few-shot demonstrations. The extracted prompt components are then channeled to the Prompt Rewriter module, which employs a general LLM-based meta-prompting strategy to further improve the prompt signatures and restructure the prompt layout. As the result, Prompt Rewriter produces a refined and enhanced version of the initial prompt tailored to the target LLM.

Results of Prompt Optimization

Using Bedrock Prompt Optimization, Yuewen Group achieved significant improvements in across various intelligent text analysis tasks, including name extraction and multi-option reasoning use-cases. Take character dialogue attribution as an example, optimized prompts reached 90% accuracy, surpassing traditional NLP models by 10% per customer’s experimentation.

Using the power of foundation models, Prompt Optimization produces high-quality results with minimal manual prompt iteration. Most importantly, this feature enabled Yuewen Group to complete prompt engineering processes in a fraction of the time, greatly improving development efficiency.

Prompt Optimization Best Practices

Throughout our experience with Prompt Optimization, we’ve compiled several tips for better user experience:

  1. Use clear and precise input prompt: Prompt Optimization will benefit from clear intent(s) and key expectations in your input prompt. Also, clear prompt structure can offer a better start for Prompt Optimization. For example, separating different prompt sections by new lines.
  2. Use English as the input language: We recommend using English as the input language for Prompt Optimization. Currently, prompts containing a large extent of other languages might not yield the best results.
  3. Avoid overly long input prompt and examples: Excessively long prompts and few-shot examples significantly increase the difficulty of semantic understanding and challenge the output length limit of the rewriter. Another tip is to avoid excessive placeholders among the same sentence and removing actual context about the placeholders from the prompt body, for example: instead of “Answer the {{question}} by reading {{author}}’s {{paragraph}}”, assemble your prompt in forms such as “Paragraph:n{{paragraph}}nAuthor:n{{author}}nAnswer the following question:n{{question}}”.
  4. Use in the early stages of Prompt Engineering : Prompt Optimization excels at quickly optimizing less-structured prompts (a.k.a. “lazy prompts”) during the early stage of prompt engineering. The improvement is likely to be more significant for such prompts compared to those already carefully curated by experts or prompt engineers.

Conclusion

Prompt Optimization on Amazon Bedrock has proven to be a game-changer for Yuewen Group in their intelligent text processing. By significantly improving the accuracy of tasks like character dialogue attribution and streamlining the prompt engineering process, Prompt Optimization has enabled Yuewen Group to fully harness the power of LLMs. This case study demonstrates the potential of Prompt Optimization to revolutionize LLM applications across industries, offering both time savings and performance improvements. As AI continues to evolve, tools like Prompt Optimization will play a crucial role in helping businesses maximize the benefits of LLM in their operations.

We encourage you to explore Prompt Optimization to improve the performance of your AI applications. To get started with Prompt Optimization, see the following resources:

  1. Amazon Bedrock Pricing page
  2. Amazon Bedrock user guide
  3. Amazon Bedrock API reference

About the Authors

qruwangRui Wang is a senior solutions architect at AWS with extensive experience in game operations and development. As an enthusiastic Generative AI advocate, he enjoys exploring AI infrastructure and LLM application development. In his spare time, he loves eating hot pot.

tonyhhHao Huang is an Applied Scientist at the AWS Generative AI Innovation Center. His expertise lies in generative AI, computer vision, and trustworthy AI. Hao also contributes to the scientific community as a reviewer for leading AI conferences and journals, including CVPR, AAAI, and TMM.

yaguanGuang Yang, Ph.D. is a senior applied scientist with the Generative AI Innovation Centre at AWS. He has been with AWS for 5 yrs, leading several customer projects in the Greater China Region spanning different industry verticals such as software, manufacturing, retail, AdTech, finance etc. He has over 10+ years of academic and industry experience in building and deploying ML and GenAI based solutions for business problems.

donshenZhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.

Huong Nguyen is a Principal Product Manager at AWS. She is a product leader at Amazon Bedrock, with 18 years of experience building customer-centric and data-driven products. She is passionate about democratizing responsible machine learning and generative AI to enable customer experience and business innovation. Outside of work, she enjoys spending time with family and friends, listening to audiobooks, traveling, and gardening.

Read More

Build a location-aware agent using Amazon Bedrock Agents and Foursquare APIs

Build a location-aware agent using Amazon Bedrock Agents and Foursquare APIs

This post is co-written with Vikram Gundeti and Nate Folkert from Foursquare.

Personalization is key to creating memorable experiences. Whether it’s recommending the perfect movie or suggesting a new restaurant, tailoring suggestions to individual preferences can make all the difference. But when it comes to food and activities, there’s more to consider than just personal taste. Location and weather also play a crucial role in shaping our choices. Imagine planning a day out: on a sunny afternoon, a leisurely picnic in the park might be ideal, but if it’s pouring rain, a cozy indoor café would be much more appealing. The challenge, then, is to create an agent that can seamlessly integrate these factors—location, weather, and personal preferences—to provide truly personalized recommendations.

To tackle this challenge, we can combine Amazon Bedrock Agents and Foursquare APIs. In this post, we demonstrate how you can use a location-aware agent to bring personalized responses to your users.

Amazon Bedrock Agents

Amazon Bedrock that makes it straightforward to build and scale generative AI applications. It provides access to a variety of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Luma, Meta, Mistral AI, Stability AI, and Amazon, all through a single API. This means you don’t need to manage infrastructure, because it’s serverless and integrates with familiar AWS services for security, privacy, and responsible AI. You can experiment with models, customize them with your data, and build applications without writing complex code.

Amazon Bedrock Agents is a feature within Amazon Bedrock that allows you to create autonomous AI agents. These agents can understand user requests, break them into steps, and complete tasks by connecting to your company’s APIs and data sources. For example, they can automate processes like processing insurance claims or managing inventory, making them efficient for business tasks. They handle prompt engineering, memory, and security automatically, so you can set them up quickly without managing infrastructure.

Foursquare Places APIs

Foursquare’s Places APIs deliver precise location intelligence for applications requiring contextual awareness. Built on top of the open source global Places dataset with 100 million points of interest spanning 1,500 categories, the Places APIs transform geographic coordinates into actionable business context.

The GeoTagging API accurately resolves GPS coordinates to a specific place with a high degree of precision, enabling applications to instantly determine if a user is at a local coffee shop, inside a Macy’s department store, or standing in Central Park. The Place Search & Data APIs transform how applications discover locations by providing nuanced filtering capabilities beyond simple proximity searches. Developers can filter places by specific categories (finding only restaurants, parks, or tourist attractions), apply attribute-based constraints (such as price range or special amenities), consider temporal factors (like current operating status), and balance distance with relevance for truly contextual results. Each place returned comes enriched with contextual attributes, including photos, reviews, quality ratings, and real-time popularity data.

When integrated with Amazon Bedrock Agents, Foursquare’s Places APIs enable the creation of applications that understand the complete context of a user’s location—resulting in experiences that are relevant, timely, and personalized.

Solution overview

To demonstrate the power of adding location to Amazon Bedrock Agents, we created a simple architecture that creates an Amazon Bedrock agent with the Foursquare Places APIs and a Weather API. By combing these capabilities, we can create unique user experiences that are customized to the context of where the user is. The following diagram shows how we architected the agent.

In the solution workflow, the user interacts with the agent through a Streamlit web interface. The web application uses the application logic that invokes the Amazon Bedrock agent in the cloud. The agent knows about the location tools and weather tools even though these are hosted locally inside the application. When the tools are invoked by the agent, a return of control response is given to the application logic, which invokes the tool and provides the response from the tool in a second invocation of the agent. In addition to the tools, the agent has basic instructions on what type of personality it should have and what types of behaviors it should support.

Let’s explore an example of a brief interaction with the agent where we ask if there is a park nearby and a recommended restaurant near the park for takeout food.

The following screenshot shows the first interaction with an agent, locating a park nearby with the Foursquare APIs invoked by the agent.

In this example, you can see the agent sending intermediate events to the user informing them of the actions taking place (invoking the model, invoking a tool, thinking, and so on).

The following screenshot shows the list of restaurants recommended by the Foursquare APIs near the park.

In this example, the agent invokes the APIs based on the user input, and the Streamlit UI connects the output from Foursquare to a map.

In the following section, we detail how you can build the agent in your account and get started.

Prerequisites

To deploy this solution, you should have an AWS account with the necessary permissions.

You will also need a Foursquare Service API Key to allow your AI agent to access Foursquare API endpoints. If you do not already have one, follow the instructions on Foursquare Documentation – Manage Your Service API Keys to create one. You will need to log in to your Foursquare developer account or create one if you do not have one (creating a basic account is free and includes starter credit for your project). Be sure to copy the Service API key upon creation as you will not be able to see it again. 

Build the agent

The source code for the Foursquare agent is available as open source in the following GitHub repository. Complete the following steps to up the agent in your local folder from the source:

  1. Clone the repository to a local folder.
  2. Set environment variables for your Foursquare API token:

export FOURSQUARE_SERVICE_TOKEN=<Foursquare API token>

  1. Set environment variables for your AWS credentials:

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>

export AWS_SECRET_ACCESS_KEY=<SECRET_ACCESS_KEY>

  1. Install requirements:

pip install requirements.txt

  1. Start the Streamlit UI:

streamlit run agent_ui.py

Best practices

When you’re creating an agent, we recommend starting with a test dataset. Think through the possible inputs and what are acceptable outputs. Use these sample conversations to test the agent whenever a change is made. In addition, Amazon Bedrock Agents allows you to configure guardrails to protect against malicious input or types of conversation that you would not want to use for your user experience. We recommend for any production use cases to couple your agent with appropriate guardrails. To learn more, see Amazon Bedrock Guardrails.

Clean up

When you’re done using the solution, remove any resources you created to avoid ongoing charges.

Conclusion

Agents provide a mechanism to automate work in behalf of your customers, whether through a chat interface or other inputs. Combining the automation possible with agents with the location-aware APIs from Foursquare, you can create powerful UIs and experiences that will delight your customers with new levels of personalization. With Amazon Bedrock Agents, you can build a cloud-centered solution that allows you to use powerful foundation models on Amazon Bedrock to drive these experiences.

Try out the solution for your own use case, and share your feedback in the comments.


About the authors

John Baker is a Principal SDE at AWS, where he works on Amazon Bedrock and specifically Amazon Bedrock Agents. He has been with Amazon for more than 10 years and has worked across AWS, Alexa, and Amazon.com. In his spare time, John enjoys skiing and other outdoor activities throughout the Pacific Northwest.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS Certifications, including the ML Specialty Certification.

Vikram Gundeti currently serves as the Chief Technology Officer (CTO) of Foursquare, where he leads the technical strategy, decision making, and research for the company’s Geospatial Platform. Before joining Foursquare, Vikram held the position of Principal Engineer at Amazon, where he made his mark as a founding engineer on the Amazon Alexa team.

Nate Folkert is a Senior Staff Engineer at Foursquare, where he’s been since spotting it trending nearby when checking in at a Soho coffee shop 14 years ago. He builds the server API for Swarm and helps out on special projects. Outside of work, he loves exploring the world (with Swarm, ofc, so is it really outside of work?) and is currently obsessed with finding all of the irl filming locations used in Apple TV’s Severance

Read More

Build an automated generative AI solution evaluation pipeline with Amazon Nova

Build an automated generative AI solution evaluation pipeline with Amazon Nova

Large language models (LLMs) have become integral to numerous applications across industries, ranging from enhanced customer interactions to automated business processes. Deploying these models in real-world scenarios presents significant challenges, particularly in ensuring accuracy, fairness, relevance, and mitigating hallucinations. Thorough evaluation of the performance and outputs of these models is therefore critical to maintaining trust and safety.

Evaluation plays a central role in the generative AI application lifecycle, much like in traditional machine learning. Robust evaluation methodologies enable informed decision-making regarding the choice of models and prompts. However, evaluating LLMs is a complex and resource-intensive process given the free-form text output of LLMs. Methods such as human evaluation provide valuable insights but are costly and difficult to scale. Consequently, there is a demand for automated evaluation frameworks that are highly scalable and can be integrated into application development, much like unit and integration tests in software development.

In this post, to address the aforementioned challenges, we introduce an automated evaluation framework that is deployable on AWS. The solution can integrate multiple LLMs, use customized evaluation metrics, and enable businesses to continuously monitor model performance. We also provide LLM-as-a-judge evaluation metrics using the newly released Amazon Nova models. These models enable scalable evaluations due to their advanced capabilities and low latency. Additionally, we provide a user-friendly interface to enhance ease of use.

In the following sections, we discuss various approaches to evaluate LLMs. We then present a typical evaluation workflow, followed by our AWS-based solution that facilitates this process.

Evaluation methods

Prior to implementing evaluation processes for generative AI solutions, it’s crucial to establish clear metrics and criteria for assessment and gather an evaluation dataset.

The evaluation dataset should be representative of the actual real-world use case. It should consist of diverse samples and ideally contain ground truth values generated by experts. The size of the dataset will depend on the exact application and the cost of acquiring data; however, a dataset that spans relevant and diverse use cases should be a minimum. Developing an evaluation dataset can itself be an iterative task that is progressively enhanced by adding new samples and enriching the dataset with samples where the model performance is lacking. After the evaluation dataset is acquired, evaluation criteria can then be defined.

The evaluation criteria can be broadly divided into three main areas:

  • Latency-based metrics – These include measurements such as response generation time or time to first token. The importance of each metric might vary depending on the specific application.
  • Cost – This refers to the expense associated with response generation.
  • Performance – Performance-based metrics are highly case-dependent. They might include measurements of accuracy, factual consistency of responses, or the ability to generate structured responses.

Generally, there is an inverse relationship between latency, cost, and performance. Depending on the use case, one factor might be more critical than the others. Having metrics for these categories across different models can help you make data-driven decisions to determine the optimum choice for your specific use case.

Although measuring latency and cost can be relatively straightforward, assessing performance requires a deep understanding of the use case and knowing what is crucial for success. Depending on the application, you might be interested in evaluating the factual accuracy of the model’s output (particularly if the output is based on specific facts or reference documents), or you might want to assess whether the model’s responses are consistently polite and helpful, or both.

To support these diverse scenarios, we have incorporated several evaluation metrics in our solution:

  • FMEval Foundation Model Evaluation (FMEval) library provided by AWS offers purpose-built evaluation models to provide metrics like toxicity in LLM output, accuracy, and semantic similarity between generated and reference text. This library can be used to evaluate LLMs across several tasks such as open-ended generation, text summarization, question answering, and classification.
  • Ragas Ragas is an open source framework that provides metrics for evaluation of Retrieval Augmented Generation (RAG) systems (systems that generate answers based on a provided context). Ragas can be used to evaluate the performance of an information retriever (the component that retrieves relevant information from a database) using metrics like context precision and recall. Ragas also provides metrics to evaluate the LLM generation from the provided context using metrics like answer faithfulness to the provided context and answer relevance to the original question.
  • LLMeter LLMeter is a simple solution for latency and throughput testing of LLMs, such as LLMs provided through Amazon Bedrock and OpenAI. This can be helpful in comparing models on metrics for latency-critical workloads.
  • LLM-as-a-judge metrics – Several challenges arise in defining performance metrics for free form text generated by LLMs – for example, the same information might be expressed in a different way. It’s also difficult to clearly define metrics for measuring characteristics like politeness. To tackle such evaluations, LLM-as-a-judge metrics have become popular. LLM-as-a-judge evaluations use a judge LLM to score the output of an LLM based on certain predefined criteria. We use the Amazon Nova model as the judge due to its advanced accuracy and performance.

Evaluation workflow

Now that we know what metrics we care about, how do we go about evaluating our solution? A typical generative AI application development (proof of concept) process can be abstracted as follows:

  1. Builders use a few test examples and try out different prompts to see the performance and get a rough idea of the prompt template and model they want to start with (online evaluation).
  2. Builders test the first prompt template version with a selected LLM against a test dataset with ground truth for a list of evaluation metrics to check the performance (offline evaluation). Based on the evaluation results, they might need to modify the prompt template, fine-tune the model, or implement RAG to add additional context to improve performance.
  3. Builders implement the change and evaluate the updated solution against the dataset to validate improvements on the solution. Then they repeat the previous steps until the performance of the developed solution meets the business requirements.

The two key stages in the evaluation process are:

  • Online evaluation – This involves manually evaluating prompts based on a few examples for qualitative checks
  • Offline evaluation – This involves automated quantitative evaluation on an evaluation dataset

This process can add significant operational complications and effort from the builder team and operations team. To achieve this workflow, you need the following:

  • A side-by-side comparison tool for various LLMs
  • A prompt management service that can be used to save and version control prompts
  • A batch inference service that can invoke your selected LLM on a large number of examples
  • A batch evaluation service that can be used to evaluate the LLM response generated in the previous step

In the next section, we describe how we can create this workflow on AWS.

Solution overview

In this section, we present an automated generative AI evaluation solution that can be used to simplify the evaluation process. The architecture diagram of the solution is shown in the following figure.

This solution provides both online (real-time comparison) and offline (batch evaluation) evaluation options that fulfill different needs during the generative AI solution development lifecycle. Each component in this evaluation infrastructure can be developed using existing open source tools or AWS native services.

The architecture of the automated LLM evaluation pipeline focuses on modularity, flexibility, and scalability. The design philosophy makes sure that different components can be reused or adapted for other generative AI projects. The following is an overview of each component and its role in the solution:

  • UI – The UI provides a straightforward way to interact with the evaluation framework. Users can compare different LLMs with a side-by-side comparison. The UI provides latency, model outputs, and cost for each input query (online evaluation). The UI also helps you store and manage your different prompt templates backed by the Amazon Bedrock prompt management feature. These prompts can be referenced later for batch generation or production use. You can also launch batch generation and evaluation jobs through the UI. The UI service can be run locally in a Docker container or deployed to AWS Fargate.
  • Prompt management – The evaluation solution includes a key component for prompt management. Backed by Amazon Bedrock prompt management, you can save and retrieve your prompts using the UI.
  • LLM invocation pipeline – Using AWS Step Functions, this workflow automates the process of generating outputs from the LLM for a test dataset. It retrieves inputs from Amazon Simple Storage Service (Amazon S3), processes them, and stores the responses back to Amazon S3. This workflow supports batch processing, making it suitable for large-scale evaluations.
  • LLM evaluation pipeline – This workflow, also managed by Step Functions, evaluates the outputs generated by the LLM. At the time of writing, the solution supports metrics provided by the FMEval library, Ragas library, and custom LLM-as-a-judge metrics. It handles various evaluation methods, including direct metrics computation and LLM-guided evaluation. The results are stored in Amazon S3, ready for analysis.
  • Eval factory – A core service for conducting evaluations, the eval factory supports multiple evaluation techniques, including those that use other LLMs for reference-free scoring. It provides consistency in evaluation results by standardizing outputs into a single metric per evaluation. It can be difficult to find a one-size-fits-all solution when it comes to evaluation, so we provide you the flexibility to use your own script for evaluation. We also provide pre-built scripts and pipelines for some common tasks including classification, summarization, translation, and RAG. Especially for RAG, we have integrated popular open source libraries like Ragas.
  • Postprocessing and results store – After the pipeline results are generated, postprocessing can concatenate the results and potentially display the results in a results store that can provide a graphical view of the results. This part also handles updates to the prompt management system because each prompt template and LLM combination will have recorded evaluation results to help you select the right model and prompt template for the use case. Visualization of the results can be done on the UI or even with an Amazon Athena table if the prompt management system uses Amazon S3 as the data storage. This part can be done by using an AWS Lambda function, which can be triggered by an event sent after the new data has been saved to the Amazon S3 location for the prompt management system.

The evaluation solution can significantly enhance team productivity throughout the development lifecycle by reducing manual intervention and increasing automated processes. As new LLMs emerge, builders can compare the current production LLM with new models to determine if upgrading would improve the system’s performance. This ongoing evaluation process makes sure that the generative AI solution remains optimal and up-to-date.

Prerequisites

For scripts to set up the solution, refer to the GitHub repository. After the backend and the frontend are up and running, you can start the evaluation process.

To start, open the UI in your browser. The UI provides the ability to do both online and offline evaluations.

Online evaluation

To iteratively refine prompts, you can follow these steps:

  1. Choose the options menu (three lines) on the top left side of the page to set the AWS Region.
  2. After you choose the Region, the model lists will be prefilled with the available Amazon Bedrock models in that Region.
  3. You can choose two models for side-by-side comparison.
  4. You can select a prompt already stored in Amazon Bedrock prompt management from the dropdown menu. If selected, this will automatically fill the prompts.
  5. You can also create a new prompt by entering the prompt in the text box. You can select generation configurations (temperature, top P, and so on) on the Generation Configuration The prompt template can also use dynamic variables by entering variables in {{}} (for example, for additional context, add a variable like {{context}}). Then define the value of these variables on the Context tab.
  6. Choose Enter to start generation.
  7. This will invoke the two models and present the output in the text boxes below each model. Additionally, you will also be provided with the latency and cost for each model.
  8. To save the prompt to Amazon Bedrock, choose Save.

Offline generation and evaluation

After you have made the model and prompt choice, you can run batch generation and evaluation over a larger dataset.

  1. To run batch generation, choose the model from the dropdown list.
  2. You can provide an Amazon Bedrock knowledge base ID if additional context is required for generation.
  3. You can also provide a prompt template ID. This prompt will be used for generation.
  4. Upload a dataset file. This file will be uploaded to the S3 bucket set in the sidebar. This file should be a pipe (|) separated CSV file. For more details on expected data file format, see the project’s GitHub README file.
  5. Choose Start Generation to start the job. This will trigger a Step Functions workflow that you can track by choosing the link in the pop-up.

Select model for Batch Generation

Invoking batch generation triggers a Step Functions workflow, which is shown in the following figure. The logic follows these steps:

  1. GetPrompts – This step retrieves a CSV file containing prompts from an S3 bucket. The contents of this file become the Step Functions workflow’s payload.
  2. convert_to_json – This step parses the CSV output and converts it into a JSON format. This transformation enables the step function to use the Map state to process the invoke_llm flow concurrently.
  3. Map step – This is an iterative step that processes the JSON payload by invoking the invoke_llm Lambda function concurrently for each item in the payload. A concurrency limit is set, with a default value of 3. You can adjust this limit based on the capacity of your backend LLM service. Within each Map iteration, the invoke_llm Lambda function calls the backend LLM service to generate a response for a single question and its associated context.
  4. InvokeSummary – This step combines the output from each iteration of the Map step. It generates a JSON Lines result file containing the outputs, which is then stored in an S3 bucket for evaluation purposes.

When the batch generation is complete, you can trigger a batch evaluation pipeline with the selected metrics from the predefined metric list. You can also specify the location of an S3 file that contains already generated LLM outputs to perform batch evaluation.

Select model for Evaluation

Invoking batch evaluation triggers an Evaluate-LLM Step Functions workflow, which is shown in the following figure. The Evaluate-LLM Step Functions workflow is designed to comprehensively assess LLM performance using multiple evaluation frameworks:

  • LLMeter evaluation – Uses the AWS Labs LLMeter framework and focuses on endpoint performance metrics and benchmarking.
  • Ragas framework evaluation – Uses Ragas framework evaluation to measure four critical quality metrics:
    • Context precision – A metric that evaluates whether the ground truth relevant items present in the contexts (retrieved chunks from vector database) are ranked higher or not. Its value ranges between 0–1, with higher values indicating better performance. The RAG system usually retrieves more than 1 chunks for a given query, and the chunks are ranked in order. A lower score is assigned when the high-ranked chunks contain more irrelevant information, which indicate bad information retrieval capability.
    • Context recall – A metric that measures the extent to which the retrieved context aligns with the ground truth. Its value ranges between 0–1, with higher values indicating better performance. The ground truth can contain several short and definitive claims. For example, the ground truth “Canberra is the capital city of Australia, and the city is located at the northern end of the Australian Capital Territory” has two claims: “Canberra is the capital city of Australia” and “Canberra city is located at the northern end of the Australian Capital Territory.” Each claim in the ground truth is analyzed to determine whether it can be attributed to the retrieved context or not. A higher value is assigned when more claims in the ground truth are attributable to the retrieved context.
    • Faithfulness – A metric that measures the factual consistency of the generated answer against the given context. Its value ranges between 0–1, with higher values indicating better performance. The answer can also contain several claims. A lower score is assigned to answers that contain a smaller number of claims that can be inferred from the given context.
    • Answer relevancy – A metric that focuses on assessing how pertinent the generated answer is to the given prompt. It is scaled to (0, 1) range, and the higher the better. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.
  • LLM-as-a-judge evaluation – Uses LLM capabilities to compare and score outputs against expected answers, which provides qualitative assessment of response accuracy. The prompts used for the LLM-as-a-judge are for demonstration purposes; to serve your specific use case, provide your own evaluation prompts to make sure the LLM-as-a-judge meets the correct evaluation requirements.
  • FM evaluation: Uses the AWS open source FMEval library and analyzes key metrics, including toxicity measurement.

The architecture implements these evaluations as nested Step Functions workflows that execute concurrently, enabling efficient and comprehensive model assessment. This design also makes it straightforward to add new frameworks to the evaluation workflow.

Step Function workflow for Evaluation

Clean up

To delete local deployment for the frontend, run run.sh delete_local. If you need to delete the cloud deployment, run run.sh delete_cloud. For the backend, you can delete the AWS CloudFormation stack, llm-evaluation-stack. For resources that you can’t delete automatically, manually delete them on the AWS Management Console.

Conclusion

In this post, we explored the importance of evaluating LLMs in the context of generative AI applications, highlighting the challenges posed by issues like hallucinations and biases. We introduced a comprehensive solution using AWS services to automate the evaluation process, allowing for continuous monitoring and assessment of LLM performance. By using tools like the FMeval Library, Ragas, LLMeter, and Step Functions, the solution provides flexibility and scalability, meeting the evolving needs of LLM consumers.

With this solution, businesses can confidently deploy LLMs, knowing they adhere to the necessary standards for accuracy, fairness, and relevance. We encourage you to explore the GitHub repository and start building your own automated LLM evaluation pipeline on AWS today. This setup can not only streamline your AI workflows but also make sure your models deliver the highest-quality outputs for your specific applications.


About the Authors

Deepak DalakotiDeepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in artificial intelligence, he partners with clients to accelerate their GenAI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences, currently focusing on strength training.

Rafa XURafa XU, is a passionate Amazon Web Services (AWS) senior cloud architect focused on helping Public Sector customers design, build, and run infrastructure application and services on AWS. With more than 10 years of experience working across multiple information technology disciplines, Rafa has spent the last five years focused on AWS Cloud infrastructure, serverless applications, and automation. More recently, Rafa has expanded his skillset to include Generative AI, Machine Learning, Big data and Internet of Things (IoT).

Melanie LiDr. Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Sam EdwardsSam Edwards, is a Solutions Architect at AWS based in Sydney and focused on Media & Entertainment. He is a Subject Matter Expert for Amazon Bedrock and Amazon SageMaker AI services. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. In his spare time, he likes traveling and enjoying time with Family.

Kai ZhuDr. Kai Zhu, currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker and Bedrock Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.

Read More

Allie: A Human-Aligned Chess Bot

Allie: A Human-Aligned Chess Bot

Play against Allie on lichess!

Introduction

In 1948, Alan Turning designed what might be the first chess playing AI, a paper program that Turing himself acted as the computer for. Since then, chess has been a testbed for nearly every generation of AI advancement. After decades of improvement, today’s top chess engines like Stockfish and AlphaZero have far surpassed the capabilities of even the strongest human grandmasters.

However, most chess players are not grandmasters, and these state-of-the-art Chess AIs have been described as playing more like aliens than fellow humans.

The core problem here is that strong AI systems are not human-aligned; they are unable to match the diversity of skill levels of human partners and unable to model human-like behaviors beyond piece movement. Understanding how to make AI systems that can effectively collaborate with and be overseen by humans is a key challenge in AI alignment. Chess provides an ideal testbed for trying out new ideas towards this goal – while modern chess engines far surpass human ability, they are completely incapable of playing in a human-like way or adapting to match their human opponents’ skill levels. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game.

What is Human-aligned Chess?

When we talk about “human-aligned” chess AI, what exactly do we mean? At its core, we want a system that is both humanlike, defined as making moves that feel natural to human players, as well as skill-calibrated, defined as capable of playing at a similar level against human opponents across the skill spectrum.

Our goal here is quite different from traditional chess engines like Stockfish or AlphaZero, which are optimized solely to play the strongest moves possible. While these engines achieve superhuman performance, their play can feel alien to humans. They may instantly make moves in complex positions where humans would need time to think, or continue playing in completely lost positions where humans would normally resign.

Building Allie

Allie's system design
Figure 1: (a) A game state is represented as the sequence of moves that produced it and some metadata. This sequence is inputted to a Transformer, which predicts the next move, pondering time for this move, and a value assessment of the move. (b) At inference time, we employee Monte-Carlo Tree Search with the value predictions from the model. The number of rollouts (N_mathrm{sim}) is chosen dynamically based on the predicted pondering time.

A Transformer model trained on transcripts of real games

While most prior deep learning approaches build models that input a board state, and output a distribution over possible moves, we instead approach chess like a language modeling task. We use a Transformer architecture that inputs a sequence of moves rather than a single board state. Just as large language models learn to generate human-like text by training on vast text corpora, we hypothesized that a similar architecture could learn human-like chess by training on human game records. We train our chess “language” model on transcripts of over 93M games encompassing a total of 6.6 billion moves, which were played on the chess website Lichess.

Conditioning on Elo score

In chess, Elo scores normally fall in the range of 500 (beginner players) to 3000 (top chess professionals). To calibrate the playing strength of ALLIE to different levels of players, we model gameplay under a conditional generation framework, where encodings of the Elo ratings of both players are prepended to the game sequence. Specifically, we prefix each game with soft control tokens, which interpolate between a weak token, representing 500 Elo, and a strong token, representing 3000 Elo.

For a player with Elo rating (k), we compute a soft token (e_k) by linearly interpolating between the weak and strong tokens:

$$e_k = gamma e_text{weak} + (1-gamma) e_text{strong}$$

where (gamma = frac{3000-k}{2500}). During training, we prefix each game with two soft tokens corresponding to the two players’ strengths.

Learning objectives

On top of the base Transformer model, Allie has three prediction objectives:

  1. A policy head (p_theta) that outputs a probability distribution over possible next moves
  2. A pondering-time head (t_theta) that outputs the number of seconds a human player would take to come up with this move
  3. A value assessment head (v_theta) that outputs a scalar value representing who expects to win the game

All three heads are individually parametrized as linear layers applied to the final hidden state of the decoder. Given a dataset of chess games, represented as a sequence of moves (mathbf{m}), human ponder time before each move (mathbf{t}), and game output (v) we trained Allie to minimize the log-likelihood of next moves and MSE of time and value predictions:

$$mathcal{L}(theta) = sum_{(mathbf{m}, mathbf{t}, v) in mathcal{D}} left( sum_{1 le i le N} left( -log p_theta(m_i ,|, mathbf{m}_{lt i}) + left(t_theta(mathbf{m}_{lt i}) – t_iright)^2 + left(v_theta(mathbf{m}_{lt i}) – vright)^2 right) right) text{.}$$

Adaptive Monte-Carlo Tree Search

At play-time, traditional chess engines like AlphaZero use search algorithms such as Monte-Carlo Tree Search (MCTS) to anticipate many moves into the future, evaluating different possibilities for how the game might go. The search budget (N_mathrm{sim}) is almost always fixed—they will spend the same amount of compute on search regardless of whether the best next move is extremely obvious or pivotal to the outcome of the game.

This fixed budget doesn’t match human behavior; humans naturally spend more time analyzing critical or complex positions compared to simple ones. In Allie, we introduce a time-adaptive MCTS procedure that varies the amount of search based on Allie’s prediction of how long a human would think in each position. If Allie predicts a human would spend more time on a position, it performs more search iterations to better match human depth of analysis. To keep things simple, we just set

How does Allie Play?

To evaluate whether Allie is human-aligned, we evaluate its performance both on an offline dataset and online against real human players.

Figure 2. Allie significantly outperforms pervious state-of-the-art methods. Adaptive-search enables matching human moves at expert levels.

In offline games, Allie achieves state-of-the-art in move-matching accuracy (defined as the % of moves made that match real human moves). It also models how humans resign, and ponder very well.

Figure 3: Allie’s time predictions are strongly correlated with ground-truth human time usage. In the figure, we show median and IQR of Allie’s think time for different amount of time spent by humans.
Figure 4: Allie learns to assign reliable value estimates to board states by observing game outcomes alone. We report Pearson’s r correlation of value estimates by ALLIE and Stockfish with game outcomes.

Another main insight of our paper is that adaptive search enables remarkable skill calibration against players across the skill spectrum. Against players from 1100 to 2500 Elo, the adaptive search variant of Allie has an average skill gap of only 49 Elo points. In other words, Allie (with adaptive search) wins about 50% of games against opponents that are both beginner and expert level. Notably, none of the other methods (even the non-adpative MCTS baseline) can match the strength of 2500 Elo players.

Table 1: Adaptive search enables remarkable skill calibration. Mean and maximum skill calibration errors is measured by computed by binning human players into 200-Elo groups. We also report systems’ estimated performance against players at the lower and upper Elo ends of the skill spectrum.

Limitations and Future Work

Despite strong offline evaluation metrics and generally positive player feedback, Allie still exhibits occasional behaviors that feel non-humanlike. Players specifically noted Allie’s propensity toward late-game blunders and sometimes spending too much time pondering positions where there’s only one reasonable move. These observations suggest there’s still room to improve our understanding of how humans allocate cognitive resources during chess play.

For future work, we identify several promising directions. First, our approach heavily relies on available human data, which is plentiful for fast time controls but more limited for classical chess with longer thinking time. Extending our approach to model human reasoning in slower games, where players make more accurate moves with deeper calculation, represents a significant challenge. With the recent interest in reasoning models that make use of test-time compute, we hope that our adaptive search technique can be applied to improving the efficiency of allocating a limited compute budget.

If you are interested in learning more about this work, please checkout our ICLR paper, Human-Aligned Chess With a Bit of Search.

Read More

Apple Machine Learning Research at ICLR 2025

Apple researchers are advancing machine learning (ML) and AI through fundamental research that improves the world’s understanding of this technology and helps to redefine what is possible with it. To support the broader research community and help accelerate progress in this field, we share much of our research through publications, open source resources, and engagement at conferences.
This week, the Thirteenth International Conference on Learning Representations (ICLR) will be held in Singapore. ICLR brings together leading experts on deep learning and the application of representation…Apple Machine Learning Research