August 2023 – Vedere AI

Use Amazon SageMaker Model Card sharing to improve model governance

As Artificial Intelligence (AI) and Machine Learning (ML) technologies have become mainstream, many enterprises have been successful in building critical business applications powered by ML models at scale in production. However, since these ML models are making critical business decisions for the business, it’s important for enterprises to add proper guardrails throughout their ML lifecycle. Guardrails ensure that security, privacy, and quality of the code, configuration, and data and model configuration used in model lifecycle are versioned and preserved.

Implementing these guardrails is getting harder for enterprises because the ML processes and activities within enterprises are becoming more complex due to the inclusion of deeply involved processes that require contributions from multiple stakeholders and personas. In addition to data engineers and data scientists, there have been inclusions of operational processes to automate & streamline the ML lifecycle. Additionally, the surge of business stakeholders and in some cases legal and compliance reviews need capabilities to add transparency for managing access control, activity tracking, and reporting across the ML lifecycle.

The framework that gives systematic visibility into ML model development, validation, and usage is called ML governance. During AWS re:Invent 2022, AWS introduced new ML governance tools for Amazon SageMaker which simplifies access control and enhances transparency over your ML projects. One of the tools available as part of the ML governance is Amazon SageMaker Model Cards, which has the capability to create a single source of truth for model information by centralizing and standardizing documentation throughout the model lifecycle.

SageMaker model cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a fact sheet of the model that is important for model governance.

As you scale your models, projects, and teams, as a best practice we recommend that you adopt a multi-account strategy that provides project and team isolation for ML model development and deployment. For more information about improving governance of your ML models, refer to Improve governance of your machine learning models with Amazon SageMaker.

Architecture overview

The architecture is implemented as follows:

Data Science Account – Data Scientists conduct their experiments in SageMaker Studio and build an MLOps setup to deploy models to staging/production environments using SageMaker Projects.
ML Shared Services Account – The MLOps set up from the Data Science account will trigger continuous integration and continuous delivery (CI/CD) pipelines using AWS CodeCommit and AWS CodePipeline.
Dev Account – The CI/CD pipelines will further trigger ML pipelines in this account covering data pre-processing, model training and post processing like model evaluation and registration. Output of these pipelines will deploy the model in SageMaker endpoints to be consumed for inference purposes. Depending on your governance requirements, Data Science & Dev accounts can be merged into a single AWS account.
Data Account – The ML pipelines running in the Dev Account will pull the data from this account.
Test and Prod Accounts – The CI/CD pipelines will continue the deployment after the Dev Account to set up SageMaker endpoint configuration in these accounts.
Security and Governance – Services like AWS Identity and Access Management (IAM), AWS IAM Identity Center, AWS CloudTrail, AWS Key Management Service (AWS KMS), Amazon CloudWatch, and AWS Security Hub will be used across these accounts as part of security and governance.

The following diagram illustrates this architecture.

For more information about setting scalable multi account ML architecture, refer to MLOps foundation for enterprises with Amazon SageMaker.

Our customers need the capability to share model cards across accounts to improve visibility and governance of their models through information shared in the model card. Now, with cross-account model cards sharing, customers can enjoy the benefits of multi-account strategy while having accessibility into the available model cards in their organization, so they can accelerate collaboration and ensure governance.

In this post, we show how to set up and access model cards across Model Development Lifecycle (MDLC) accounts using the new cross-account sharing feature of the model card. First, we will describe a scenario and architecture for setting up the cross-account sharing feature of the model card, and then dive deep into each component of how to set up and access shared model cards across accounts to improve visibility and model governance.

Solution overview

When building ML models, we recommend setting up a multi-account architecture to provide workload isolation improving security, reliability, and scalability. For this post, we will assume building and deploying a model for Customer Churn use case. The architecture diagram that follows shows one of the recommended approaches – centralized model card – for managing a model card in a multi-account Machine Learning Model-Development Lifecycle (MDLC) architecture. However, you can also adopt another approach, a hub-and-spoke model card. In this post, we will focus only on a centralized model card approach, but the same principles can be extended to a hub-and-spoke approach. The main difference is that each spoke account will maintain their own version of model card and it will have processes to aggregate and copy to a centralized account.

The following diagram illustrates this architecture.

The architecture is implemented as follows:

Lead Data Scientist is notified to solve the Customer Churn use case using ML, and they start the ML project through creation of a model card for Customer Churn V1 model in Draft status in the ML Shared Services Account
Through automation, that model card is shared with ML Dev Account
Data Scientist builds the model and starts to populate information via APIs into the model card based on their experimentation results and the model card status is set to Pending Review
Through automation, that model card is shared with the ML test account
ML Engineer (MLE) runs integration and validation tests in ML Test account and the model in the central registry is marked Pending Approval
Model Approver reviews the model results with the supporting documentation provided in the central model card and approves the model card for production deployment.
Through automation, that model card is shared with ML Prod account in read-only mode.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Two AWS accounts.
In both AWS accounts, an IAM federation role with administrator access to do the following:
- Create, edit, view, and delete model cards within Amazon SageMaker.
- Create, edit, view, and delete resource share within AWS RAM.

For more information, refer to Example IAM policies for AWS RAM.

Setting up model card sharing

The account where the model cards are created is the model card account. Users in the model card account share them with the shared accounts where they can be updated. Users in the model card account can share their model cards through AWS Resource Access Manager (AWS RAM). AWS RAM helps you share resources across AWS accounts.

In the following section, we show how to share model cards.

First, create a model card for a Customer Churn use case as previously described. On the Amazon SageMaker console, expand the Governance section and choose Model cards.

We create the model card in Draft status with the name Customer-Churn-Model-Card. For more information, refer to Create a model card. In this demonstration, you can leave the remainder of the fields blank and create the model card.

Alternatively, you can use the following AWS CLI command to create the model card:

aws sagemaker create-model-card --model-card-name Customer-Churn-Model-Card --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}" --model-card-status Draft

Now, create the cross-account share using AWS RAM. In the AWS RAM console, select Create a resource share.

Enter a name for the resource share, for example “Customer-Churn-Model-Card-Share”. In the Resources – optional section, select the resource type as SageMaker Model Cards. The model card we created in the previous step will appear in the listing.

Select that model and it will appear in the Selected resources section. Select that resource again as shown in the following steps and choose Next.

On the next page, you can select the Managed permissions. You can create custom permissions or use the default option “AWSRAMPermissionSageMakerModelCards” and select Next. For more information, refer to Managing permissions in AWS RAM.

On the next page, you can select Principals. Under Select principal type, choose AWS Account and enter the ID of the account of the share the model card. Select Add and continue to the next page.

On the last page, review the information and select “Create resource share”. Alternatively, you can use the following AWS CLI command to create a resource share:

aws ram create-resource-share --name <Name of the Model Card>

aws ram associate-resource-share --resource-share-arn <ARN of resource share create from the previous command> --resource-arns <ARN of the Model Card>

On the AWS RAM console, you see the attributes of the resource share. Make sure that Shared resources, Managed permissions, and Shared principals are in the “Associated” status.

After you use AWS RAM to create a resource share, the principals specified in the resource share can be granted access to the share’s resources.

If you turn on AWS RAM sharing with AWS Organizations, and your principals that you share with are in the same organization as the sharing account, those principals can receive access as soon as their account administrator grants them permissions.
If you don’t turn on AWS RAM sharing with Organizations, you can still share resources with individual AWS accounts that are in your organization. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.
You can also share with accounts outside of your organization if the resource type supports it. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.

For more information about AWS RAM, refer to Terms and concepts for AWS RAM.

Accessing shared model cards

Now we can log in to the shared AWS account to access the model card. Make sure that you are accessing the AWS console using IAM permissions (IAM role) which allow access to AWS RAM.

With AWS RAM, you can view the resource shares to which you have been added, the shared resources that you can access, and the AWS accounts that have shared resources with you. You can also leave a resource share when you no longer require access to its shared resources.

To view the model card in the shared AWS account:

Navigate to the Shared with me: Shared resources page in the AWS RAM console.
Make sure that you are operating in the same AWS region where the share was created.
The model shared from the model account will be available in the listing. If there is a long list of resources, then you can apply a filter to find specific shared resources. You can apply multiple filters to narrow your search.
The following information is available:
1. Resource ID – The ID of the resource. This is the name of the model card that we created earlier in the model card account.
2. Resource type – The type of resource.
3. Last share date – The date on which the resource was shared with you.
4. Resource shares – The number of resource shares in which the resource is included. Choose the value to view the resource shares.
5. Owner ID – The ID of the principal who owns the resource.

You can also access the model card using the AWS CLI option. For the AWS IAM policy configured with the correct credentials, make sure that you have permissions to create, edit, and delete model cards within Amazon SageMaker. For more information, refer to Configure the AWS CLI.

You can use the following AWS IAM permissions policy as template:

{
     "Version": "2012-10-17",
     "Statement": [
        {
             "Effect": "Allow",
             "Action": [
                 "sagemaker:DescribeModelCard",
                 "sagemaker:UpdateModelCard",
                 "sagemaker:CreateModelCardExportJob",
                 "sagemaker:ListModelCardVersions",
                 "sagemaker:DescribeModelCardExportJob"
             ],
             "Resource": [
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-0",
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-1/*"
             ]
        },
        { 
             "Effect": "Allow", 
             "Action": "s3:PutObject",
             "Resource": "arn:aws:s3:::Amazon-S3-bucket-storing-the-pdf-of-the-model-card/model-card-name/*"
        }
    ]
}

You can run the following AWS CLI command to access the details of the shared model card.

aws sagemaker describe-model-card --model-card-name <ARN of the model card>

Now you can make changes to this model card from this account.

aws sagemaker update-model-card --model-card-name <ARN of the Model Card> --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}"

After you make changes, go back to the model card account to see the changes that we made in this shared account.

The problem type has been updated to “Customer Churn Model” which we had provided as part of the AWS CLI command input.

Clean up

You can now delete the model card you created. Make sure that you delete the AWS RAM resource share that you created to share the model card.

Conclusion

In this post, we provided an overview of multi-account architecture for scaling and governing your ML workloads securely and reliably. We discussed the architecture patterns for setting up model card sharing and illustrated how centralized model card sharing patterns work. Finally, we set up model card sharing across multiple accounts for improving visibility and governance in your model development lifecycle. We encourage you try out the new model card sharing feature and let us know your feedback.

About the authors

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!

WeatherBench 2: A benchmark for the next generation of data-driven weather models

Posted by Stephan Rasp, Research Scientist, and Carla Bromberg, Program Lead, Google Research

In 1950, weather forecasting started its digital revolution when researchers used the first programmable, general-purpose computer ENIAC to solve mathematical equations describing how weather evolves. In the more than 70 years since, continuous advancements in computing power and improvements to the model formulations have led to steady gains in weather forecast skill: a 7-day forecast today is about as accurate as a 5-day forecast in 2000 and a 3-day forecast in 1980. While improving forecast accuracy at the pace of approximately one day per decade may not seem like a big deal, every day improved is important in far reaching use cases, such as for logistics planning, disaster management, agriculture and energy production. This “quiet” revolution has been tremendously valuable to society, saving lives and providing economic value across many sectors.

Now we are seeing the start of yet another revolution in weather forecasting, this time fueled by advances in machine learning (ML). Rather than hard-coding approximations of the physical equations, the idea is to have algorithms learn how weather evolves from looking at large volumes of past weather data. Early attempts at doing so go back to 2018 but the pace picked up considerably in the last two years when several large ML models demonstrated weather forecasting skill comparable to the best physics-based models. Google’s MetNet [1, 2], for instance, demonstrated state-of-the-art capabilities for forecasting regional weather one day ahead. For global prediction, Google DeepMind created GraphCast, a graph neural network to make 10 day predictions at a horizontal resolution of 25 km, competitive with the best physics-based models in many skill metrics.

Apart from potentially providing more accurate forecasts, one key advantage of such ML methods is that, once trained, they can create forecasts in a matter of minutes on inexpensive hardware. In contrast, traditional weather forecasts require large super-computers that run for hours every day. Clearly, ML represents a tremendous opportunity for the weather forecasting community. This has also been recognized by leading weather forecasting centers, such as the European Centre for Medium-Range Weather Forecasts’ (ECMWF) machine learning roadmap or the National Oceanic and Atmospheric Administration’s (NOAA) artificial intelligence strategy.

To ensure that ML models are trusted and optimized for the right goal, forecast evaluation is crucial. Evaluating weather forecasts isn’t straightforward, however, because weather is an incredibly multi-faceted problem. Different end-users are interested in different properties of forecasts, for example, renewable energy producers care about wind speeds and solar radiation, while crisis response teams are concerned about the track of a potential cyclone or an impending heat wave. In other words, there is no single metric to determine what a “good” weather forecast is, and the evaluation has to reflect the multi-faceted nature of weather and its downstream applications. Furthermore, differences in the exact evaluation setup — e.g., which resolution and ground truth data is used — can make it difficult to compare models. Having a way to compare novel and established methods in a fair and reproducible manner is crucial to measure progress in the field.

To this end, we are announcing WeatherBench 2 (WB2), a benchmark for the next generation of data-driven, global weather models. WB2 is an update to the original benchmark published in 2020, which was based on initial, lower-resolution ML models. The goal of WB2 is to accelerate the progress of data-driven weather models by providing a trusted, reproducible framework for evaluating and comparing different methodologies. The official website contains scores from several state-of-the-art models (at the time of writing, these are Keisler (2022), an early graph neural network, Google DeepMind’s GraphCast and Huawei’s Pangu-Weather, a transformer-based ML model). In addition, forecasts from ECMWF’s high-resolution and ensemble forecasting systems are included, which represent some of the best traditional weather forecasting models.

Making evaluation easier

The key component of WB2 is an open-source evaluation framework that allows users to evaluate their forecasts in the same manner as other baselines. Weather forecast data at high-resolutions can be quite large, making even evaluation a computational challenge. For this reason, we built our evaluation code on Apache Beam, which allows users to split computations into smaller chunks and evaluate them in a distributed fashion, for example using DataFlow on Google Cloud. The code comes with a quick-start guide to help people get up to speed.

Additionally, we provide most of the ground-truth and baseline data on Google Cloud Storage in cloud-optimized Zarr format at different resolutions, for example, a comprehensive copy of the ERA5 dataset used to train most ML models. This is part of a larger Google effort to provide analysis-ready, cloud-optimized weather and climate datasets to the research community and beyond. Since downloading these data from the respective archives and converting them can be time-consuming and compute-intensive, we hope that this should considerably lower the entry barrier for the community.

Assessing forecast skill

Together with our collaborators from ECMWF, we defined a set of headline scores that best capture the quality of global weather forecasts. As the figure below shows, several of the ML-based forecasts have lower errors than the state-of-the-art physical models on deterministic metrics. This holds for a range of variables and regions, and underlines the competitiveness and promise of ML-based approaches.

This scorecard shows the skill of different models compared to ECMWF’s Integrated Forecasting System (IFS), one of the best physics-based weather forecasts, for several variables. IFS forecasts are evaluated against IFS analysis. All other models are evaluated against ERA5. The order of ML models reflects publication date.

Toward reliable probabilistic forecasts

However, a single forecast often isn’t enough. Weather is inherently chaotic because of the butterfly effect. For this reason, operational weather centers now run ~50 slightly perturbed realizations of their model, called an ensemble, to estimate the forecast probability distribution across various scenarios. This is important, for example, if one wants to know the likelihood of extreme weather.

Creating reliable probabilistic forecasts will be one of the next key challenges for global ML models. Regional ML models, such as Google’s MetNet already estimate probabilities. To anticipate this next generation of global models, WB2 already provides probabilistic metrics and baselines, among them ECMWF’s IFS ensemble, to accelerate research in this direction.

As mentioned above, weather forecasting has many aspects, and while the headline metrics try to capture the most important aspects of forecast skill, they are by no means sufficient. One example is forecast realism. Currently, many ML forecast models tend to “hedge their bets” in the face of the intrinsic uncertainty of the atmosphere. In other words, they tend to predict smoothed out fields that give lower average error but do not represent a realistic, physically consistent state of the atmosphere. An example of this can be seen in the animation below. The two data-driven models, Pangu-Weather and GraphCast (bottom), predict the large-scale evolution of the atmosphere remarkably well. However, they also have less small-scale structure compared to the ground truth or the physical forecasting model IFS HRES (top). In WB2 we include a range of these case studies and also a spectral metric that quantifies such blurring.

Forecasts of a front passing through the continental United States initialized on January 3, 2020. Maps show temperature at a pressure level of 850 hPa (roughly equivalent to an altitude of 1.5km) and geopotential at a pressure level of 500 hPa (roughly 5.5 km) in contours. ERA5 is the corresponding ground-truth analysis, IFS HRES is ECMWF’s physics-based forecasting model.

Conclusion

WeatherBench 2 will continue to evolve alongside ML model development. The official website will be updated with the latest state-of-the-art models. (To submit a model, please follow these instructions). We also invite the community to provide feedback and suggestions for improvements through issues and pull requests on the WB2 GitHub page.

Designing evaluation well and targeting the right metrics is crucial in order to make sure ML weather models benefit society as quickly as possible. WeatherBench 2 as it is now is just the starting point. We plan to extend it in the future to address key issues for the future of ML-based weather forecasting. Specifically, we would like to add station observations and better precipitation datasets. Furthermore, we will explore the inclusion of nowcasting and subseasonal-to-seasonal predictions to the benchmark.

We hope that WeatherBench 2 can aid researchers and end-users as weather forecasting continues to evolve.

Acknowledgements

WeatherBench 2 is the result of collaboration across many different teams at Google and external collaborators at ECMWF. From ECMWF, we would like to thank Matthew Chantry, Zied Ben Bouallegue and Peter Dueben. From Google, we would like to thank the core contributors to the project: Stephan Rasp, Stephan Hoyer, Peter Battaglia, Alex Merose, Ian Langmore, Tyler Russell, Alvaro Sanchez, Antonio Lobato, Laurence Chiu, Rob Carver, Vivian Yang, Shreya Agrawal, Thomas Turnbull, Jason Hickey, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. We also would like to thank Kunal Shah, Rahul Mahrsee, Aniket Rawat, and Satish Kumar. Thanks to John Anderson for sponsoring WeatherBench 2. Furthermore, we would like to thank Kaifeng Bi from the Pangu-Weather team and Ryan Keisler for their help in adding their models to WeatherBench 2.

Meet Five Generative AI Innovators in Africa and the Middle East

Entrepreneurs are cultivating generative AI from the west coast of Africa to the eastern edge of the Arabian Desert.

Gen AI is the latest of the big plans Kofi Genfi and Nii Osae have been hatching since they met 15 years ago in high school in Accra, Ghana’s capital that sits on the Gulf of Guinea.

“We watched this latest wave of AI coming for the last few years,” said Osae, a software engineer who discovered his passion for machine learning in college.

Picture of Nii Osae and Kofi Genfi of startup Mazzuma. — Nii Osae (left) and Kofi Genfi of startup Mazzuma.

So, late last year, they expanded Mazzuma — their mobile-payments startup that’s already processed more than $150 million in transactions — to include MazzumaGPT.

The large language model (LLM) trained on two popular blockchain languages so it can help developers quickly draft smart contracts, a Web3 market that International Data Corp. projects could hit $19 billion next year.

Thousands of Hits

In its first month, 400 developers from 70 countries used the LLM that sports 175 billion parameters, a rough measure of a model’s size and strength.

It’s the latest success for the pair that in 2018 made Forbes’ list of 30 top entrepreneurs in Africa under 30.

“Given the high growth and large demographics, there are big opportunities in this region,” said Genfi, who started his first company, an Apple device reseller, when he was 19.

Osae nurtures that potential as founder and chair of the 100+ member AI Association of Ghana. “I think we’re on a trajectory to leapfrog progress elsewhere,” he said.

LLMs Speak Arabic

About two years ago and 6,000 miles to the northeast, another pair of entrepreneurs launched a generative AI business in the Persian Gulf emirate of Dubai, home of the Burj Khalifa, the world’s tallest building.

Yakov Livshits already had about a dozen active startups when AI researcher Eli Braginskiy, a friend with family ties, came to him with the idea for MetaDialog. The startup built the first LLM to support both Arabic and English, a 7-billion-parameter model trained on one of the world’s largest Arabic/English datasets.

“We call it Baby, because we’re proud of it, and we’re building a larger, 40-billion parameter model now,” said Braginskiy.

“Our Baby LLM is currently integrated in one of the biggest governments in the region, and we’re talking with three other governments interested in using it, too,” said Livshits.

With more than 3 million people in just 13 square miles, Dubai is a vibrant hub for the region.

“The way governments in the Middle East think about AI and advanced tech in general is very bold — they want to move fast, so we’re training custom models in different languages and will present them at the GITEX conference” said Livshits, who lived in Russia, Israel and the U.S. before moving to Dubai.

In February, Saudi Arabia alone announced a $2.4 billion startup fund to help diversify the nation’s economy.

Corporations Want Custom LLMs

In Abu Dhabi, just a hundred miles down the coast, Hussein Al-Natsheh leads a team of engineers and data scientists at Beyond Limits training and fine tuning LLMs. One is already drafting documents for a large energy company and verifying they comply with its standards.

Beyond Limits also works on models for energy companies, utilities and other customers that will index and search corporate documents, draft marketing materials and more.

“Companies need their own LLMs trained on their own data which is confidential, so we have machines reading their data, not us,” said Al-Natsheh, a native of Amman, Jordan, who, prior to joining Beyond Limits, worked on Salma, one of the first Arabic speech assistants.

Drilling for Data

Now that data is the new oil, Beyond Limits is developing toolkits to extract it from unstructured files — corporate emails, PowerPoint and other sources — so it can help companies train custom LLMs approaching 70 billion parameters in size.

The toolkits can help address the lack of data samples from the many Arabic dialects. Indeed, a report from the UAE government on 100 top gen AI uses called for more work on Arabic, a language spoken by nearly half a billion people.

The good news is governments and large companies like G42, a regional cloud service company, are pouring resources into the problem. For example, Beyond Limits was able to create its regional headquarters in Dubai thanks to its last funding round, much of which came from G42.

A Big Boost from Inception

All three companies are members of NVIDIA Inception, a free program that helps startups working on cutting-edge technologies like generative AI.

As part of Inception, Beyond Limits had access to libraries in NVIDIA NeMo, a framework for building massive gen AI models, and which cut training time in one case from five days to one.

“NVIDIA software makes our work much easier, and our clients trust NVIDIA technology,” said Al-Natsheh.

As part of Inception, Mazzuma got access to cloud GPU services to accelerate its experiments and introductions to potential investors.

“That really gave us a boost, and there’s a lot of assurance that comes from working with the best people and tools,” said Genfi.

Treating Partners Well

For its part, MetaDialog trained its Baby LLM on 440 NVIDIA A100 Tensor Core GPUs using a service operated by MosaicML, an Inception member recently acquired by Databricks.

“I’ve built many startups, and no company treats its partners as well as NVIDIA,” said Livshits.

At top: From left to right, Nii Osae, Hussein Al-Natsheh, Eli Braginskiy, Yakov Livshits and Kofi Genfi.

Morphobots for Mars: Caltech Develops All-Terrain Robot as Candidate for NASA Mission

Academics Mory Gharib and Alireza Ramezani in 2020 were spitballing a transforming robot that is now getting a shot at work that’s literally out of this world: NASA Mars Rover missions.

Caltech has unveiled its multi-talented robot that can fly, drive, walk and do eight permutations of motions through a combination of its skills. They call it the Multi-Modal Mobility Morphobot, or M4, which is enabled by the NVIDIA Jetson platform for edge AI and robotics.

“It grew in the number of functions that we wanted to do,” said Gharib, a professor of aeronautics and bioinspired engineering at Caltech. “When we proposed it to our design team, at first they all said, ‘no.’”

Caltech funded its initial research, and NASA and its Jet Propulsion Lab (JPL) funded its next phase and brought in Ramezani, an assistant professor of electrical and computer engineering at Northeastern University, as a faculty researcher at JPL last summer to develop it further.

Its M42 version is now under development at NASA as a Mars Rover candidate and has interest from the U.S. Department of Transportation, Gharib said.

“At NASA, we’re being tested right now for transforming while landing,” he said.

And since recently releasing a paper on it in Nature Communications, Gharib says he’s been inundated with proposals.

“We’re kind of dizzy about how it suddenly got so much attention,” he said. “Different organizations want to do different things and are coming to approach us.”

Firefighting, Search and Rescue Operations

The Caltech team behind the paper — Gharib and Ramezani, as well as Eric Sihite, a postdoctoral scholar research associate in aerospace at Caltech; Arash Kalantari, from JPL; and Reza Nemovi, a design engineer at CAST — said the M4 is designed for diverse mission requirements in search and rescue, among other areas.

For example, when it’s not feasible to roll or walk into areas — like fire zones — it can fly and do reconnaissance to assess situations using its cameras and sensors.

According to Gharib, multiple fire departments in the Los Angeles area have contacted Gharib with interest in the M4.

“For first responders, this is huge because you need to land in a safe area and then drive into the situation,” he said.

Versatile Drone Deliveries to Get the Job Done

Caltech’s team also aims to solve complications with drone deliveries using the M4. Drone deliveries are the “low hanging fruit,” for this robot, said Gharib.

Traditional drones for deliveries are problematic because nobody wants drones landing near their home or business for safety reasons, he said. The M4 can land somewhere isolated from people and then drive to finish deliveries, making it a safer option, he added.

The M4 can also fly into areas where truck deliveries might have a difficult time getting into or can’t offer delivery service at all.

“There are a lot of places where truck deliveries can’t go,” he said.

Right now, the M4 is capable of traveling as fast as 40 mph, and its battery can last up to 30 minutes on a charge. But the team is working to design larger drones with longer flight times, bigger payloads and increased travel distances.

The sky’s the limit.

Learn about NVIDIA Jetson Nano.

AI Frontiers: AI in India and beyond with Sriram Rajamani

AI Frontiers with Sriram Rajamani; black and white photo of Sriram Rajamani, Managing Director of Microsoft Research India, next to the Microsoft Research Podcast

Episode 146 | August 31, 2023

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity.

This episode features Sriram Rajamani, Distinguished Scientist and Managing Director of Microsoft Research India. Rajamani talks about how the lab’s work is being influenced by today’s rapidly advancing AI. One example? The development of a conversational agent in India capable of providing information about governmental agricultural programs in farmers’ natural language, particularly significant in a country with more than 30 languages, including 22 government-recognized languages. It’s an application Microsoft CEO Satya Nadella described as the “mic drop moment” of his trip to the lab early this year.

Learn more:

AI4Bhārat (opens in new tab)
Organization homepage
MEGA: Multilingual Evaluation of Generative AI
Publication, May 2023
AI and Microsoft Research
Learn more about the breadth of AI research at Microsoft

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale AI models like GPT-4 is accelerating the advancement of AI. These models and the systems they power are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in large AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Sriram Rajamani, Managing Director of Microsoft Research India. For nearly 20 years, this lab has focused on interdisciplinary research, blending theory and practice and computer science with social science. Our researchers in India have made many contributions to advance AI in areas like causal reasoning, but the latest wave of powerful AI models has made a profound impact on all the lab’s work, including their approach to creating technologies for underserved communities.

[MUSIC FADES]

All right, so, Sriram, let’s dive right in. I think it’s fairly obvious for me to say at this point that ChatGPT—and generative AI more broadly—is a worldwide phenomenon. But what’s so striking to me about this is the way that so many people around the world can pick up the technology and use it in their context, in their own way. I was on a panel discussion a few weeks ago where I saw a comedian discover in real time that GPT-4 could write jokes that are actually funny. And shortly after that, I spoke to a student who was using ChatGPT to write an application to obtain a grazing permit for cattle. You know, the work of your lab is situated in its own unique societal context. So, what I really want to know and start with here today is, like, what’s the buzz been like for you in your part of the world around this new wave of AI?

SRIRAM RAJAMANI: Yeah. First of all, Ashley, you know, thank you for having this conversation with me. You’re absolutely right that our lab is situated in a very unique context on how this technology is going to play out in, you know, this part of the world, certainly. And you might remember, Ashley, a sort of a mic drop moment that happened for Satya [Nadella] when he visited India earlier this year, in January. So one of our researchers, Pratyush Kumar—he’s also co-founder of our partner organization called AI4Bhārat—he works also with the government on a project called Bhashini, which the government endeavors to bring conversational AI to the many Indian languages that are spoken in India. And what Pratyush did was he connected some of the AI4Bhārat translation models, language translation models, together with one of the GPT models to build a bot for a farmer to engage and ask questions about the government’s agricultural programs so the farmer could speak in their own language—you know, it could be Hindi—and what the AI4Bhārat models would do is to convert the Hindi speech into text and then translate it into English. And then he taught, you know, either fine-tuned or integrated with augmented generation … I don’t … I’m not … I don’t quite remember which one … it was one of those … where he made a GPT model customized to understand the agricultural program of the government. And he chained it together with this speech recognition and translation model. And the farmer could just now talk to the system, the AI system, in Hindi and ask, you know, are they eligible for their benefits and many details. And the, and the model had a sensible conversation with him, and Satya was just really amazed by that, and he calls … he called that as the mic drop moment of his trip in India, which I think is indicative of the speed at which this disruption is impacting very positively the various parts of the world, including the Indian subcontinent.

LLORENS: You referenced the many Indian languages written and spoken. Can you just bring, bring that to life for us? How many, how many languages are we talking about?

RAJAMANI: So, I think there are at least, you know, 30 or 40, you know, main, mainstream languages. I mean, the government recognizes 22. We call them as IN22. But I would think that there are about 30-plus languages that are spoken very, very broadly, each of them with, you know, several tens of millions, hundreds of millions of speakers. And then there is a long tail of maybe a hundred more languages which are spoken by people with … in, in smaller population counts. The real … they’re also very low-resource languages like Gondi and Idu Mishmi, which are just spoken by maybe just only a million speakers or even under a million speakers who probably … those languages probably don’t have enough data resources. So, India is an amazing testbed because of this huge diversity and distribution of languages in terms of the number of speakers, the amount of available data, and, and many of these tail languages have unique, you know, sociocultural nuances. So I think in that sense, there’s a really good testbed for, you know, how conversational AI can inclusively impact the entire world.

LLORENS: And, and what’s the … you mentioned tail languages. And so maybe we mean they’re low-resource languages like you also mentioned. What’s the gap like between what languages AI is accessible in today versus the full extent of all those languages that you just described, even just for, you know, for the Indian subcontinent?

RAJAMANI: So what is … what we’re seeing is that with IN22, the top languages, if you look at successive versions of the GPT models, for example, the performance is definitely improving. So if you just go from, you know, GPT-2 to GPT-3 to 3.5 to 4, right, you can sort of see that these models are increasingly getting capable. But still there is a gap between what these models are able to do and what custom models are able to do, particularly if you go towards languages in which there’s not enough training data. So, so people in our lab, you know, are doing very systematic work in this area. There is a benchmarking work that my colleagues are doing called MEGA, where there is systematic benchmark being done on various tasks on a matrix that consists of, you know, tasks on one axis and languages on another axis to just systematically, empirically study, you know, what these models are able to do. And also, we are able to build models to predict how much more data is needed in each of these languages in order for the performance to be comparable to, say, languages like English. What is the … what is the gap, and how much data is needed? The other thing is that it turns out that these models, they, they learn also from related languages. So if you want to improve the performance of a language, it turns out there are other languages in the world and in India that have similar characteristics, you know, syntactic and semantic characteristics, to the language that you’re thinking about. So we can also sort of recommend, you know, what distribution of data we should collect so that all the languages improve. So that’s the kind of work that we’re doing.

LLORENS: Yeah, it’s one of the most fascinating parts of all of this—how diversity in the training dataset improves, you know, across the board, like even the addition of code, for example, in addition to language, and now we’re even seeing even other modalities. And, you know, the, the wave of AI and the unprecedented capabilities we’re seeing has significant implications for just about all of computing research. In fact, those of us in and around the field are undergoing now a process that I call, you know, reimagining computing research. And, you know, that’s a somewhat artful way to put it. But beyond the technical journey, there’s an emotional journey happening across the research community and many other communities, as well. So what has that journey been like for you and the folks at the India lab?

RAJAMANI: Yeah, that’s a good question, Ashley. You know, our work in the lab spans four areas. You know, we do work in theory and algorithms. We do work in AI and machine learning. We do systems work, and we also have an area called “Technology and Empowerment.” It’s about making sure that technology benefits people. And so far, our conversation has been about the last area. But all these four areas have been affected in a big way using this disruption. Maybe, maybe I’ll just say a few more things about the empowerment area first and then move on to the other ones. If you look at our work in the empowerment area, Ashley, right, this lab has had a track record of doing work that makes technology inclusive not just from an academic perspective, but by also deploying the work via spun-off startups, many startups, that have taken projects in the lab and scaled them to the community. Examples are Digital Green, which is an agricultural extension; 99DOTS, which is a tuberculosis medication adherence system. Karya is a, is a platform for dignified digital labor to enable underprivileged users, rural users, to contribute data and get paid for it. You know, HAMS is a system that we have built to improve road safety. You know, we’ve built a system called BlendNet that enables rural connectivity. And almost all of these, we have spun them off into startups that are … that have been funded by, you know, venture capitalists, impact investors, and we have a vibrant community of these partners that are taking the work from the lab and deploying them in the community. So the second thing that is actually happening in this area is that, as you may have heard, India is playing a pivotal role in digital public infrastructure. Advances like the Aadhaar biometric authentication system; UPI, which is a payment system—they are pervasively deployed in India, and they reach, you know, several hundreds of millions of people. And in the case of Aadhaar, more than a billion people and so on. And the world is taking note. India is now head of the G20, and many countries now want to be inspired by India and build such a digital public infrastructure in their own countries, right. And so, so, so what you saw is the mic drop moment, right? That … it actually has been coming for a long time. There has been a lot of groundwork that has been laid by our lab, by our partners, you know, such as AI4Bhārat, the people that work on digital public goods to get the technical infrastructure and our know-how to a stage where we can really build technology that benefits people, right. So, so going forward, in addition to these two major advancements, which is the building of the partner and alumni ecosystem, the digital public good infrastructure, I think AI is going to be a third and extremely important pillar that is going to enable citizen-scale digital services to reach people who may only have spoken literacy and who might speak in their own native languages and the public services can be accessible to them.

LLORENS: So you mentioned AI4Bhārat, and I’d love for you to say a bit more about that organization and how researchers are coming together with collaborators across sectors to make some of these technology ideas real.

RAJAMANI: Yeah. So AI4Bhārat is a center in IIT Madras, which is an academic institution. It has multiple stakeholders, not just Microsoft Research, but our search technology center in India also collaborates with them. Nandan Nilekani is a prominent technologist and philanthropist. He’s behind a lot of India’s digital public infrastructure. He also, you know, funds that center significantly through his philanthropic efforts. And there are a lot of academics that have come together. And what the center does is data collection. I talked about the diversity of, you know, Indian languages. They collect various kinds of data. They also look at various applications. Like in the judicial system, in the Indian judicial system, they are thinking about, you know, how to transcribe, you know, judgments, enabling various kinds of technological applications in that context, and really actually thinking about how these kinds of AI advances can help right on top of digital public goods. So that’s actually the context in which they are working on.

LLORENS: Digital public goods. Can you, can you describe that? What, what do we mean in this context by digital public good?

RAJAMANI: So what we mean is if you look at Indian digital public infrastructure, right, that is, as I mentioned, that is Aadhaar, which is the identity system that is now enrolled more than 1.3 billion Indians. There is actually a payment infrastructure called UPI. There are new things that are coming up, like something that’s, that’s called Beckn. There’s something called ONDC that is poised to revolutionize how e-commerce is done. So these are all, you know, sort of protocols that through private-public partnership, right, government together with think tanks have developed, that are now deployed in a big way in India. And they are now pervasively impacting education, health, and agriculture. And every area of public life is now being impacted by these digital public infrastructures. And there is a huge potential for AI and AI-enabled systems to ride on top of this digital public infrastructure to really reach people.

LLORENS: You know, you talked about some of the, you know, the infrastructure considerations, and so what are the challenges in bringing, you know, digital technologies to, you know, to, to the Indian context? And, and you mentioned the G20 and other countries that are following the patterns. What are, what are some of the common challenges there?

RAJAMANI: So, I mean, there are many, many challenges. One of them is lack of access. You know, though India has made huge strides in lifting people out of poverty, people out there don’t have the same access to technology that you and I have. Another challenge is awareness. People just don’t know, you know, how technology can help them, right. You know, people hearing this podcast know about, you know, LinkedIn to get jobs. They know about, you know, Netflix or other streaming services to get entertainment. But there are many people out there that don’t even know that these things exist, right. So awareness is another issue. Affordability is another issue. So … many of the projects that I mentioned, what they do is actually they start not with the technology; they start with the users and their context and this situation, and what they’re trying to do and then map back. And technology is just really one of the pieces that these systems, that all of these systems that I mentioned, right … technology is just only one component. There’s a sociotechnical piece that deals with exactly these kinds of access and awareness and these kinds of issues.

LLORENS: And we’re, we’re kind of taking a walk right now through the work of the lab. And there are some other areas that you, you want to get into, but I want to come back to this … maybe this is a good segue into the emotional journey part of the question I asked a few minutes ago. As you get into some of the, you know, the deep technical work of the lab, what were some of the first impressions of the new technologies, and what were, what were some of the first things that, you know, you and your colleagues there and our colleagues, you know, felt, you know, in observing these new capabilities?

RAJAMANI: So I, I think Peter [Lee] mentioned this very eloquently as stages of grief. And me and my colleagues, I think, went through the same thing. I mean, the … there was … we went from, you know, disbelief, saying, “Oh, wow, this is just amazing. I can’t believe this is happening” to sort of understanding what this technology can do and, over time, understanding what its limitations are and what the opportunities are as a scientist and technologist and engineering organization to really push this forward and make use of it. So that’s, I think, the stages that we went through. Maybe I can be a little bit more specific. As I mentioned, the three other areas we work on are theory in algorithms, in machine learning, and in systems. And I can sort of see … say how my colleagues are evolving, you know, their own technical and research agendas in the, in the light of the disruption. If you take our work in theory, this lab has had a track record of, you know, cracking longstanding open problems. For example, problems like the Kadison-Singer conjecture that was open for many years, many decades, was actually solved by people from the lab. Our lab has incredible experts in arithmetic and circuit complexity. They came so close to resolving the VP versus VNP conjecture, which is the arithmetic analog of the P versus NP problem. So we have incredible people working on, working on theoretical computer science, and a lot of them are now shifting their attention to understanding these large language models, right. Instead of understanding just arithmetic circuits, you know, people like Neeraj Kayal and Ankit Garg are now thinking about mathematically what does it take to understand transformers, how do we understand … how might we evolve these models or training data so that these models improve even further in performance in their capabilities and so on. So that’s actually a journey that the theory people are going through, you know, bringing their brainpower to bear on understanding these models foundationally. Because as you know, currently our understanding of these foundation models is largely empirical. We don’t have a deep scientific understanding of them. So that’s the opportunity that the, that the theoreticians see in this space. If you look at our machine learning work, you know, that actually is going through a huge disruption. I remember now one of the things that we do in this lab is work on causal ML … Amit Sharma, together with Emre Kiciman and other colleagues working on causal machine learning. And I heard a very wonderful podcast that you hosted them some time ago. Maybe you can say a little bit about what, what you heard from them, and then I can pick up back and then connect that with the rest of the lab.

LLORENS: Sure. Well, it’s … you know, I think the, the common knowledge … there’s, there’s so many, there’s so many things about machine learning over the last few decades that have become kind of common knowledge and conventional wisdom. And one of those things is that, you know, correlation is not causation and that, you know, you know, learned models don’t, you know, generally don’t do causal reasoning. And so we, you know, we’ve had very specialized tools created to do the kind of causal reasoning that Amit and Emre do. And it was interesting. I asked them some of the same questions I’m asking you now, you know, about the journey and the initial skepticism. But it has been really interesting to see how they’re moving forward. They recently published a position paper on arXiv where they conducted some pretty compelling experiments, in some cases, showing something like, you know, causal reasoning, you know, being, being exhibited, or at least I’ll say convincing performance on causal reasoning tasks.

RAJAMANI: Yeah, absolutely.

LLORENS: Yeah, go ahead.

RAJAMANI: Yeah, yeah, yeah, absolutely. So, so, so, you know, I would say that their journey was that initially they realized that … of course, they build specialized causal reasoning tools like DoWhy, which they’ve been building for many years. And one of the things they realized was that, “Oh, some of the things that DoWhy can do with sophisticated causal reasoning these large language models were just able to do out of the box.” And that was sort of stunning for them, right. And so the question then becomes, you know, does specific vertical research in causal reasoning is even needed, right. So that’s actually the shock and the awe and the emotional journey that these people went through. But actually, after the initial shock faded, they realized that there is actually [a] “better together” story that is emerging in the sense that, you know, once you understand the details, what they realized was that natural language contains a lot of causal information. Like if you just look at the literature, the literature has many things like, you know, A causes B and if there is, if there is, you know, hot weather, then ice cream sales go up. You know, this information is present in the literature. So if you look at tools like DoWhy, what they do is that in order to provide causal machine learning, they need assumptions from the user on what the causal model is. They need assumptions about what the causal graph is, what is the user’s assumptions about which variables depend on which variables, right? And then … and, and, and what they’ve realized is that models like GPT-4 can now provide this information. Previously, only humans were able to provide this information. And … but in addition to that, right, tools like DoWhy are still needed to confirm or refute these assumptions, statistically, using data. So this division of labor between getting assumptions from either a human or from a large language model and then using the mathematics of DoWhy to confirm or refute the assumptions now is emerging as a real advance in the way we do causal reasoning, right? So I think that’s actually what I heard in your podcast, and that’s indicative of actually what the rest of my colleagues are going through. You know, moving from first thinking about, “Oh, GPT-4 is like a threat, you know, in the sense that it really obviates my research area” to understand, “Oh, no, no. It’s really a friend. It, it really helps me do, you know, some of the things that required primarily human intervention. And if I combine GPT or these large language models together with, you know, domain specific research, we can actually go after bigger problems that we didn’t even dare going after before.”

LLORENS: Mmm. Let me, let me ask you … I’m going to, I’m going to pivot here in a moment, but did you … have you covered, you know, the areas of research in the lab that you wanted to walk through?

RAJAMANI: Yeah, yeah, there’s, there’s more. You know, thank you for reminding me. Even in the machine learning area, there is another work direction that we have called extreme classification, which is about building very, very … classifiers with a large number of labels, you know, hundreds of millions and billions of labels. And, you know, these people are also benefiting from large language encoders. You know, they have come up with clever ways of taking these language encoders that are built using self-supervised learning together with supervised signals from things like, you know, clicks and logs from search engines and so on to improve performance of classifiers. Another work that we’ve been doing is called DiskANN, or approximate nearest neighbor search. As you know, Ashley, in this era of deep learning, retrieval works by converting everything in the world, you know, be it a document, be it an image, you know, be it an audio or video file, everything into an embedding, and relevance … relevant retrieval is done by nearest neighbor search in a geometric space. And our lab has been doing … I mean, we have probably the most scalable vector index that has been built. And, and, and these people are positively impacted by these large language models because, you know, as you know, retrieval augmented generation is one of the most common design patterns in making these large language models work for applications. And so their work is becoming increasingly relevant, and they are being placed huge demands on, you know, pushing the scale and the functionality of the nearest neighbor retrieval API to do things like, oh, can I actually add predicates, can I add streaming queries, and so on. So they are just getting stretched with more demand, you know, for their work. You know, if you look at our systems work, which is the last area that I want to cover, you know, we have, we have been doing work on using GPUs and managing GPU resources for training as well as inference. And this area is also going through a lot of disruption. And prior to these large language models, these people were looking at relatively smaller models, you know, maybe not, you know, hundreds of billions to trillions of parameters. But, but, you know, maybe hundreds of millions and so on. And they invented several techniques to share a GPU cluster among training jobs. So the disruption that they had was all these models are so large that nobody is actually sharing clusters for them. But it turned out that some of the techniques that they invented to deal with, you know, migration of jobs and so on are now used for failure recovery in very, very large models. So it turns out that, you know, at the beginning it seems like, “Oh, my work is not relevant anymore,” but once you get into the details, you find that there are actually still many important problems. And the insights you have from solving problems for smaller models can now carry over to the larger ones. And one other area I would say is the area of, you know, programming. You know, I myself work in this area. We have been doing … combining machine learning together with program analysis to build a new generation of programing tools. And the disruption that I personally faced was that the custom models that I was building were no longer relevant; they’re, they’re not even needed. So that was a disruption. But actually, what me and my colleagues went through was that, “OK, that is true, but we can now go after problems that we didn’t dare to go before.” Like, for example, you know, we can now see that, you know, copilot and so on let you give recommendations in the context of the particular file that you are editing. But can we now edit an entire repository which might contain, you know, millions of files with hundreds of millions of code? Can I just say, let’s take, for example, the whole of the Xbox code base or the Windows code base, and in the whole code base, I want to do this refactoring, or I want to, you know, migrate this package from … migrate this code base from using now this serialization package to that serialization package. Can we just do that, right? I think we wouldn’t even dare going after such a problem two years ago. But now with large language models, we are thinking, can we do that? And large language models cannot do this right now because, you know, whatever context size you have, you can’t have 100-million-line code as a context to a large language model. And so this requires, you know, combining program analysis with these techniques. That’s as an example. And actually, furthermore, there are, you know, many things that we are doing that are not quite affected by large language models. You know, for example, Ashley, you know about the HyWay project, where we’re thinking about technology to make hybrid work work better. And, you know, we are doing work on using GPUs and accelerators for, you know, database systems and so on. And we do networking work. We do a low-earth orbit satellite work for connectivity and so on. And those we are doubling down, you know, though, they have nothing to do with large language models because those are problems that are important. So, I think, you know, to summarize, I would say that, you know, most of us have gone through a journey from, you know, shock and awe to sort of somewhat of an insecurity, saying is my work even relevant, to sort of understanding, oh, these things are really aides for us. These are not threats for us. These are really aides, and we can use them to solve problems that we didn’t even dream of before. That’s the journey I think my colleagues have gone through.

LLORENS: I want to, I want to step into two of the concepts that you just laid out, maybe just to get into some of the intuitions as to what problem is being solved and how generative AI is sort of changing the way that those, those problems are solved. So the first one is extreme classification. I think, you know, a flagship use of generative AI and foundation models is, is Bing chat. And so I think this idea of, of internet search as a, as a, you know, as a, a home for, for these new technologies is, is in the popular imagination now. And I know that extreme classification seeks to solve some challenges related to search and information retrieval. But what is the challenge problem there? What, you know … how is extreme classification addressing that, and how is that, you know, being done differently now?

RAJAMANI: So as I mentioned, where my colleagues have already made a lot of progress is in combining language encoders with extreme classifiers to do retrieval. So there are these models called NLR. Like, for example, there’s a tooling NLR model, which is a large language model which does representation, right. It actually represents, you know, keywords, keyword phrases, documents, and so on in the encodings, you know, based on, you know, self-supervised learning. But it is a very important problem to combine the knowledge that these large language models have, you know, from understanding a text. We have to combine that with supervised signals that we have from click logs. Because we have search engine click logs, we know, you know, for example, when somebody searches for this information and we show these results, what users click on. That’s supervised signals, and we have that in huge amounts. And what our researchers have done is they have figured out how to combine these encoders together with the supervised signals from click logs in order to improve both the quality and cost of retrieval, right. And, Ashley, as you said, retrieval is an extremely important part of experiences like Bing chat and retrieval augmented generation is what prevents hallucination and grounds these large language models with appropriate information retrieved and presented so that the, the relevant results are grounded without hallucination, right. Now, the new challenge that this team is now facing is, OK, that’s so far so good as far as retrieval is concerned, right? But can we do similar things with generation, right? Can we now combine these NLG models, which are these generative models, together with supervised signals, so that even generation can actually be guided in this manner, improved in both performance, as well as accuracy. And that is an example of a challenging problem that the team is going after.

LLORENS: Now let’s do the same thing with programming, and maybe I’m going to engage you on a slightly higher level of abstraction than the deep work you’re doing. And then, we can, we can, we can get back down into the work. But one of the things … one of, one of the, one of the popular ideas about these new foundation models is that you can … effectively through interacting with them, you’re sort of programming them in natural language. How does that concept sit with you as someone who, you know, is an expert in programming languages? What do you, what do you think, what do you think when someone says, you know, sort of programming the, you know, the system in natural language?

RAJAMANI: Yeah, so I, I find it fascinating and, you know, for one, you know, can we … an important topic in programming language research has been always that can we get end users or, you know, people who are nonprogrammers to program. I think that has been a longstanding open problem. And if you look at the programming language community, right, the programming language community has been able to solve it only in, in narrow domains. You know, for example, Excel has Flash Fill, where, through examples, you know, people can program Excel macros and so on. But those are not as general as these kinds of, you know, LLM-based models, right. And, and it is for the whole community, not just me, right. It was stunning when users can just describe in natural language what program they want to write and these models emit in a Python or Java or C# code. But there is a gap between that capability and having programmers just program in natural language, right. Like, you know, the obvious one is … and I can sort of say, you know, write me Python code to do this or that, and it can generate Python code, and I could run it. And if that works, then that’s a happy path. But if it doesn’t work, what am I supposed to do if I don’t know Python? What am I supposed to do, right? I still have to now break that abstraction boundary of natural language and go down into Python and debug Python. So one of the opportunities that I see is then can we build representations that are also in natural language, but that sort of describe, you know, what the application the user is trying to build and enable nonprogrammers—could be lawyers, could be accountants, could be doctors—to engage with a system purely in natural language and the system should talk back to you, saying, “Oh, so far this is what I’ve understood. This is the kind of program that I am writing,” without the user having to break that natural language abstraction boundary and going and having to go and understand Python, right? I think this is a huge opportunity in programming languages to see whether … can we build, like, for example, right, Ashley, right, I’m a programmer, and one of the things I love about programming is that I can write code. I can run it, see what it produces, and if I don’t like the results, I can go change the code and rerun it. And that’s sort of the, you know, coding, evaluating … we call it the REPL loop, right. So that’s, that’s what a programmer faces, right. Can we now provide that to natural language programmers? And since … and I want to say, “Here’s a program I want to write,” and now I want to say, “Well, I want to run this program with this input.” And if it doesn’t work, I want to say, “Oh, this is something I don’t like. I want to change this code this way,” right. So can I now provide that kind of experience to natural language programming? I think that’s a huge opportunity if you managed to pull that off.

LLORENS: And now let’s, let’s maybe return to some of the more societally oriented, you know, topics that, that you were talking about at the top of the episode in the context of, of, of programming. Because being able to program in natural language, I think, really changes, you know, who can use the technologies, who can develop technologies, what a program … what a software development team can actually be, and who, who that, who that kind of a team can consist of. So can you paint a picture? You know, what, what, what kind of opportunities for, for, you know, software development does this open up when you can sort of program in natural languages, assuming we can make the AI compatible with your language, whatever that happens to be?

RAJAMANI: Yeah, I think there are a lot of opportunities, and maybe I’ll, I’ll, I’ll describe a few things that we’re already doing. My, my colleagues are working on a project called VeLLM, which is now a copilot assistant for societal-scale applications. And one application they are going after is education. So, you know, India, like many other countries, has made a lot of educational resources available to teachers in government schools and so on so that if a teacher wants to make a lesson plan, you know, there is enough information available for them to search, find out many videos that their colleagues have created from different parts of the country, and put them together to create a lesson plan for their class, right. But that is a very laborious process. I mean, you have information overload when you deal with it. So my colleagues are thinking about, can we now think about, in some sense, the teacher as a programmer and have the teacher talk to the VeLLM system saying, “Hey, and here is my lesson plan. Here is what I’m trying to put together in terms of what I want to teach. And I now want the AI system to collect the relevant resources that are relevant to my lesson plan and get them in my language, the language that my students speak. You know, how do I do that,” right? And all of the things that I mentioned, right, you have to now index all of the existing information using vector indices. You have to now [use] retrieval augmented generation to get the correct thing. You have to now deal with the trunk and tail languages because this teacher might be speaking in, in, in a language that is not English, right. And, and, and, and the teacher might get a response that they don’t like, right. And how do they now … but they are not a programmer, right? How are they going to deal with it, right? So that’s actually an example. If we, if we pull this off, right, and a teacher in rural India is able to access this information in their own language and create a lesson plan which contains the best resources throughout the country, right, we would have really achieved something.

LLORENS: Yeah, you know, it’s a, it’s a hugely compelling vision. And I’m really looking forward to seeing where you and, you know, our colleagues in Microsoft Research India Lab and MSR [Microsoft Research] more broadly, you know, take all these different directions.

[MUSIC PLAYS] So I really appreciate you spending this time with me today.

RAJAMANI: Thank you, Ashley. And I was very happy that I could share the work that my colleagues are doing here and, and bringing this to your audience. Thank you so much.

The post AI Frontiers: AI in India and beyond with Sriram Rajamani appeared first on Microsoft Research.

GeForce NOW Gets Wild, With ‘Party Animals’ Leading 24 New Games in September

Just like that, summer falls into September, and some of the most anticipated games of the year, like the Cyberpunk 2077: Phantom Liberty expansion, PAYDAY 3 and Party Animals, are dropping into the GeForce NOW library at launch this month.

They’re part of 24 new games hitting the cloud gaming service in September. And the next Game Pass title to join the cloud at launch is Sea of Stars, part of 13 new games this week.

Keep an eye on GFN Thursday to see the next Microsoft titles joining the cloud this month, including Quake II, Gears Tactics and Halo Infinite.

Plus, NVIDIA has worked with Google to give Chromebook owners a new offer that includes three free months of a GeForce NOW Priority membership. GeForce NOW cloud gaming ‌goes perfectly together with Chromebooks, which provide up to 1,600p resolution and 120Hz+ displays.

Party Hard in the Cloud

Party Animals on GeForce NOW — *The cloud is about to get wild.*

Make painfully great memories with friends in Party Animals, a hilarious, physics-based party brawler from Recreate Games and Source Technology. Fight friends as adorable puppies, mischievous kittens, magical unicorns or other fuzzy creatures or terrorize them as fearsome sharks and ferocious dinosaurs to be the last one standing.

Battle it out by picking up an assortment of weapons to get an edge over others or punch, toss, jump, kick and headbutt others in the brawl. Bring the action across multiple game modes — each requires a different strategy to win.

Get fierce for party game night, whether playing with friends locally on the couch or across devices online. Party Animals joins the cloud at launch on Thursday, Sept. 21.

Work Hard, Play Hard

Chromebook offer for GeForce NOW membership — *Shiny new GeForce NOW offers for Chromebook owners.*

One of the best ways to stream games from GeForce NOW is with the new Cloud Gaming Chromebooks, which feature screens that display beautiful scenes at up to 1,600p and 120Hz+.

Chromebook gamers can jump right into 100+ free-to-play titles and over 1,600 hit games, like Baldur’s Gate 3, Remnant II, supported games from the Xbox PC Game Pass library and more. GeForce NOW Priority members can also explore the worlds of Cyberpunk 2077, Control and other titles with RTX ON, and Ultimate members can level up and access new NVIDIA technologies like DLSS 3.5 in upcoming games like Alan Wake 2 and Portal with RTX. Compete online with ultra-low latency and other features perfect for playing.

Starting today, Google and NVIDIA are offering all Chromebook owners three free months of a GeForce NOW Priority membership to get gamers started. And those interested in leveling up to an Ultimate membership, the highest-performing tier, are already able to get three free months of a GeForce NOW Ultimate membership with the purchase of a Cloud Gaming Chromebook. Find more details on how to redeem the offer in Google’s Keyword blog or on the Chromebooks Perks page.

New Games as Far as the Eye Can Sea

Sea of Stars on GeForce NOW — *Time to sea the stars from the cloud.*

New games come with each GFN Thursday, and this week’s batch includes Sea of Stars from Sabotage Studio. A retro-inspired role-playing game drawing from classics like Chrono Trigger, it features a vibrant world, a dynamic combat system and a story of cosmic proportions. Play as two Children of the Solstice, who combine the powers of the sun and moon to perform Eclipse Magic, the only force capable of fending off the monstrous creations of an evil alchemist known as the Fleshmancer. Sea of Stars is now available for members to stream from the cloud via Game Pass or Steam.

Check out the list of 13 new games joining this week:

Sea of Stars (New release on Xbox and Steam, Aug. 28)
Untamed Tactics (New release on Steam, Aug. 28)
Trine 5: A Clockwork Conspiracy (New release on Steam, Aug. 31)
Deep Rock Galactic (Xbox)
HITMAN – World of Assassination (Xbox)
King’s Bounty: The Legend (Epic Games Store)
Little Big Workshop (Epic Games Store)
NARAKA: BLADEPOINT (Xbox)
Portal: Prelude with RTX (Steam)
Railway Empire 2 (Epic Games Store)
Shadowrun Returns (Xbox)
Snake Pass (Epic Games Store)
STORY OF SEASONS: Friends of Mineral Town (Epic Games Store)

And here’s a peek at what September will look like:

Chants of Sennaar (New release on Steam, Sept. 5)
SYNCED (New release on Steam, Sept. 7)
Deceit 2 (New release on Steam, Sept. 14)
The Crew Motorfest (New release on Ubisoft, Sept. 14)
Ad Infinitum (New release on Steam, Sept. 14)
Party Animals (New release on Steam, Sept. 20)
Warhaven (New release on Steam, Sept. 20)
PAYDAY 3 (New release on Xbox, Steam and Epic Games Store, Sept. 21)
Cyberpunk 2077: Phantom Liberty (New release on Steam, Epic Games Store and GOG, Sept. 25)
Paleo Pines (New release on Steam, Sept. 26)
Infinity Strash: DRAGON QUEST The Adventure of Dai (New release on Steam, Sept. 28)
Wildmender (New release on Steam, Sept. 28)
Broforce (Steam)
Death in the Water 2 (Steam)
Deceive Inc. (Steam)
Devil May Cry 5 (Steam)
Don Duality (Steam)
Dust Fleet (Steam)
Kingdoms Reborn (Steam)
Mega City Police (Steam)
Necesse (Steam)
Saints Row (Steam)
Shadow Gambit: The Cursed Crew (Epic Games Store)
SPRAWL (Steam)
War for the Overworld (Steam)

This week’s Game On giveaway with SteelSeries includes Dying Light 2 and three-day Ultimate membership codes. It’s the last week of the giveaway, so check out the SteelSeries page for details on how to enter.

Amazing August

On top of the 32 games announced in August, an additional 36 joined the cloud last month across multiple stores:

Orwell: Keeping an Eye on You (Free game on Epic Games Store, Aug. 10)
Age of Empires: Definitive Edition (Xbox)
Age of Empires III: Definitive Edition (Xbox)
Age of Empires IV (Xbox)
Age of Wonders 4 (Epic Games Store)
Aliens: Dark Descent (Epic Games Store)
Amnesia: The Bunker (Epic Games Store)
Crusader Kings III (Xbox)
Darkest Dungeon II (Epic Games Store)
Dead Cells (Xbox)
Deathloop (Xbox)
DOOM 2016 (Steam)
DOOM Eternal (Steam)
F1 Manager 2023 (Epic Games Store)
Gears 5 (Xbox)
The Great War: Western Front (Epic Games Store)
Grounded (Xbox)
Kovaak’s FPS Trainer (Steam)
Mount & Blade II: Bannerlord (Xbox)
No Man’s Sky (Xbox)
The Outlast Trials (Epic Games Store)
Pentiment (Xbox)
Project Highrise (Epic Games Store)
Quake (Xbox, Steam and Epic Games Store)
Shadowrun: Dragonfall – Director’s Cut (Xbox)
Stellaris (Xbox)
Symphony of War: The Nephilim Saga (Epic Games Store)
System Shock (Epic Games Store)
The Texas Chain Saw Massacre (Xbox)
Trackmania (Steam)
Ultimate KovaaK’s Challenge
Valheim (Xbox)
Warhammer 40,000: Darktide (Xbox)
Wolfenstein II: The New Colossus (Xbox, Steam)
Wolfenstein: The New Order (Steam and Epic Games Store)
Wolfenstein: Youngblood (Xbox, Steam)

Before starting the weekend, we’ve got a question for you. Let us know the answer on Twitter or in the comments below.

Best animal in a video game. Go.

Bonus points for pictures…

— NVIDIA GeForce NOW (@NVIDIAGFN) August 30, 2023

Teaching with AI

We’re releasing a guide for teachers using ChatGPT in their classroom—including suggested prompts, an explanation of how ChatGPT works and its limitations, the efficacy of AI detectors, and bias.OpenAI Blog

PyTorch/XLA SPMD: Scale Up Model Training and Serving with Automatic Parallelization

Today, we are delighted to announce PyTorch/XLA SPMD: the integration of GSPMD into PyTorch with an easy to use API. PyTorch developers seeking superior performance and scale can train and serve the largest neural networks while maximizing utilization of AI accelerators, such as Google Cloud TPUs.

Introduction

GSPMD is an automatic parallelization system for ML workloads. The XLA compiler transforms the single device program into a partitioned one with proper collectives, based on the user provided sharding hints. This allows developers to write PyTorch programs as if they are on a single large device without any custom sharded computation and/or collective communication ops to scale models.

PyTorch/XLA SPMD allows PyTorch users to parallelize their ML workloads with GSPMD with less effort and with better performance. Some of the key highlights are:

Better developer experience. Everything happens with a few sharding annotations from the user, and PyTorch/XLA SPMD achieves comparable performance to the most efficient PyTorch sharding implementation (see the Examples and Results section below). PyTorch/XLA SPMD separates the task of programming an ML model from the challenge of parallelization. Its automated approach to model sharding frees up the user from implementing the sharded version of ops with proper collectives in place.
A single API that enables a large variety of parallelism algorithms (including data parallelism, fully sharded data parallelism, spatial partitioning tensor and pipeline parallelism, as well as combinations of these algorithms) for different ML workloads and model architectures.
Industry-leading performance in large model training. PyTorch/XLA SPMD brings the powerful XLA GSPMD to PyTorch, enabling users to harness the full power of Google Cloud TPUs.
Enabling PyTorch and JAX developers take advantage of the same underlying XLA API to scale models.

Key Concepts

The key concepts behind the sharding annotation API are: 1) Mesh, 2) Partition Spec, and 3) mark_sharding API to express sharding intent using Mesh and Partition Spec. A more detailed design overview is available as a user guide here.

Mesh

For a given cluster of devices, a physical mesh is a representation of the interconnect topology.

We derive a logical mesh based on this topology to create sub-groups of devices which can be used for partitioning different axes of tensors in a model. We apply sharding annotations to map the program across the logical mesh; this automatically inserts communication collectives in the program graph to support functional correctness (see the figure below).

We abstract logical mesh with Mesh API. The axes of the logical Mesh can be named. Here is an example:

import numpy as np
import torch_xla.runtime as xr
from torch_xla.experimental.xla_sharding import Mesh

# Assuming you are running on a TPU host that has 8 devices attached
num_devices = xr.global_runtime_device_count()
# mesh shape will be (4,2) in this example
mesh_shape = (num_devices // 2, 2)
device_ids = np.array(range(num_devices))
# axis_names 'x' nad 'y' are optional
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))

mesh.get_logical_mesh()
>> array([[0, 1],
          [2, 3],
          [4, 5],
          [6, 7]])
mesh.shape()
>> OrderedDict([('x', 4), ('y', 2)])

Partition Spec

partition_spec has the same rank as the input tensor. Each dimension describes how the corresponding input tensor dimension is sharded across the device mesh (logically defined by mesh_shape). partition_spec is a tuple of device_mesh dimension index, None, or a tuple of mesh dimension indices. The index can be an int or str if the corresponding mesh dimension is named. This specifies how each input rank is sharded (index to mesh_shape) or replicated (None).

# Provide optional mesh axis names and use them in the partition spec
mesh = Mesh(device_ids, (4, 2), ('data', 'model'))
partition_spec = ('model', 'data')
xs.mark_sharding(input_tensor, mesh, partition_spec)

We support all three types of sharding described in the original GSPMD paper. For instance, one can specify partial replication like this:

# Provide optional mesh axis names and use them in the partition spec
mesh = Mesh(device_ids, (2, 2, 2), ('x', 'y', 'z'))

# evenly shard across x and z and replicate among y
partition_spec = ('x', 'z')  # equivalent to ('x', None, 'z')
xs.mark_sharding(input_tensor, mesh, partition_spec)

Simple Example With Sharding Annotation

Users can annotate native PyTorch tensors using the mark_sharding API (src). This takes torch.Tensor as input and returns a XLAShardedTensor as output.

def mark_sharding(t: Union[torch.Tensor, XLAShardedTensor], mesh: Mesh, partition_spec: Tuple[Union[int, None]]) -> XLAShardedTensor

Invoking mark_sharding API takes a user defined logical mesh and partition_spec and generates a sharding annotation for the XLA compiler. The sharding specification is attached to the XLATensor, as well as the original input tensor. Here is a simple usage example from the [RFC], to illustrate how the sharding annotation API works:

import numpy as np
import torch
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.experimental.xla_sharding as xs
from torch_xla.experimental.xla_sharded_tensor import XLAShardedTensor
from torch_xla.experimental.xla_sharding import Mesh

# Enable XLA SPMD execution mode.
xr.use_spmd()

# Device mesh, this and partition spec as well as the input tensor shape define the individual shard shape.
num_devices = xr.global_runtime_device_count()
mesh_shape = (2, num_devicese // 2)  # 2x4 on v3-8, 2x2 on v4-8  
device_ids = np.array(range(num_devices))
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))

t = torch.randn(8, 4).to(xm.xla_device())

# Mesh partitioning, each device holds 1/8-th of the input
partition_spec = (0, 1)
m1_sharded = xs.mark_sharding(t, mesh, partition_spec)
assert isinstance(m1_sharded, XLAShardedTensor) == True
# Note that the sharding annotation is also in-placed updated to t

We can annotate different tensors in the PyTorch program to enable different parallelism techniques, as described in the comment below:

# Sharding annotate the linear layer weights.
model = SimpleLinear().to(xm.xla_device())
xs.mark_sharding(model.fc1.weight, mesh, partition_spec)

# Training loop
model.train()
for step, (data, target) in enumerate(loader):
  # Assumes `loader` returns data, target on XLA device
  optimizer.zero_grad()
  # Sharding annotate input data, we can shard any input
  # dimensions. Sharding the batch dimension enables 
  # data parallelism, sharding the feature dimension enables
  # spatial partitioning.
  xs.mark_sharding(data, mesh, partition_spec)
  ouput = model(data)
  loss = loss_fn(output, target)
  optimizer.step()
  xm.mark_step()

More complete unit test cases and integration test examples are available in the PyTorch/XLA repo.

Results

Performance

We measured the performance of PyTorch/XLA SPMD using a GPT-2 model (src) and compared it with user-mode FSDP.

Here, SPMD applies the same sharding scheme as the FSDP plot (i.e. 1D sharding). Users are expected to achieve better MFU results by exploring more advanced SPMD sharding schemes.

We use Model FLOPS Utilization (MFU) as a metric for comparison. MFU is “the ratio of the observed throughput relative to the theoretical maximum throughput of a system operating at peak FLOPs” (PaLM paper).

flops_per_step = 6 * global_batch_size * seq_len * num_params
model_flops_utilization = flops_per_step / step_time(s) / chip_count / flops_per_chip

This estimation assumes that the input dimensionality is much larger than the input sequence length (d_model » seq_len). If this assumption is violated the self-attention FLOPs start to be significant enough and this expression will underestimate the true MFU.

Scalability

One of the core benefits of SPMD is the flexible partitioning which can be used to save accelerator memory (HBM) usage and improve scalability. For scalability analysis, we present two studies: 1) we examine the peak HBM across 4 model sizes using Hugging Face transformers (GPT-2) as the base implementation; 2) we examine the peak HBM usage with spatial partitioning.

The above figure illustrates the unsharded 2B parameters model peak memory footprint stands at 26GB (red dashed line). harding model weights (model parallelism) reduces the peak memory footprint, and thus, enables larger model training with a given TPU pod slice. In these experiments, we achieved up to 39.75% MFU on a 4B parameters model on Google Cloud TPU v4-16.

We also ran an input batch scalability test using spatial partitioning and a simple ResNet50 example (src) on Cloud TPU v4-8. Input batch is commonly sharded across the batch dimension for data parallelism (DDP, FSDP), but PyTorch/XLA SPMD enables input sharding across input feature dimensions for spatial sharding. As shown in the below figure, one can push the per-device batch size to 512 with spatial partitioning which is not possible with other data parallelism techniques.

The Road Forward for PyTorch/XLA SPMD

We are ecstatic about what’s ahead for PyTorch/XLA and invite the community to join us. SPMD is still experimental, and we continuously add new features to it. In future releases, we plan to address async dataloading, partially replicated sharding, and other improvements. We’d love to hear from you, answer your questions about PyTorch/XLA SPMD, and learn how you use SPMD.

Cheers!

The PyTorch/XLA Team at Google

Bringing generative AI in Search to more people around the world

We’re bringing generative AI capabilities in Search (SGE) to more people, making Search Labs available in India and Japan.Read More

Deploy self-service question answering with the QnABot on AWS solution powered by Amazon Lex with Amazon Kendra and large language models

Powered by Amazon Lex, the QnABot on AWS solution is an open-source, multi-channel, multi-language conversational chatbot. QnABot allows you to quickly deploy self-service conversational AI into your contact center, websites, and social media channels, reducing costs, shortening hold times, and improving customer experience and brand sentiment. Customers now want to apply the power of large language models (LLMs) to further improve the customer experience with generative AI capabilities. This includes automatically generating accurate answers from existing company documents and knowledge bases, and making their self-service chatbots more conversational.

Our latest QnABot releases, v5.4.0+, can now use an LLM to disambiguate customer questions by taking conversational context into account, dynamically generating answers from relevant FAQs or Amazon Kendra search results and document passages. It also provides attribution and transparency by displaying links to the reference documents and context passages that were used by the LLM to construct the answers.

When you deploy QnABot, you can choose to automatically deploy a state-of-the-art open-source LLM model (Falcon-40B-instruct) on an Amazon SageMaker endpoint. The LLM landscape is constantly evolving—new models are released frequently and our customers want to experiment with different models and providers to see what works best for their use cases. This is why QnABot also integrates with any other LLM using an AWS Lambda function that you provide. To help you get started, we’ve also released a set of sample one-click deployable Lambda functions (plugins) to integrate QnABot with your choice of leading LLM providers, including our own Amazon Bedrock service and APIs from third-party providers, Anthropic and AI21.

In this post, we introduce the new Generative AI features for QnABot and walk through a tutorial to create, deploy, and customize QnABot to use these features. We also discuss some relevant use cases.

New Generative AI features

Using the LLM, QnABot now has two new important features, which we discuss in this section.

Generate answers to questions from Amazon Kendra search results or text passages

QnABot can now generate concise answers to questions from document extracts provided by an Amazon Kendra search, or text passages created or imported directly. This provides the following advantages:

The number of FAQs that you need to maintain and import into QnABot is reduced, because you can now synthesize concise answers on the fly from your existing documents.
Generated answers can be modified to create the best experience for the intended channel. For example, you can set the answers to be short, concise, and suitable for voice channel contact center bots, and website or text bots could potentially provide more detailed information.
Generated answers are fully compatible with QnABot’s multi-language support—users can interact in their chosen languages and receive generated answers in the same language.
Generated answers can include links to the reference documents and context passages used, to provide attribution and transparency on how the LLM constructed the answers.

For example, when asked “What is Amazon Lex?”, QnABot can retrieve relevant passages from an Amazon Kendra index (containing AWS documentation). QnABot then asks (prompts) the LLM to answer the question based on the context of the passages (which can also optionally be viewed in the web client). The following screenshot shows an example.

Disambiguate follow-up questions that rely on preceding conversation context

Understanding the direction and context of an ever-evolving conversation is key to building natural, human-like conversational interfaces. User queries often require a bot to interpret requests based on conversation memory and context. Now QnABot will ask the LLM to generate a disambiguated question based on the conversation history. This can then be used as a search query to retrieve the FAQs, passages, or Amazon Kendra results to answer the user’s question. The following is an example chat history:

Human: What is Amazon Lex?
AI: "Amazon Lex is an AWS service for building conversational interfaces for applications using voice and text..."
Human: Can it integrate with my CRM?

QnABot uses the LLM to rewrite the follow-up question to make “it” unambiguous, for example, “Can Amazon Lex integrate with my CRM system?” This allows users to interact like they would in a human conversation, and QnABot generates clear search queries to find the relevant FAQs or document passages that have the information to answer the user’s question.

These new features make QnABot more conversational and provide the ability to dynamically generate responses based on a knowledge base. This is still an experimental feature with tremendous potential. We strongly encourage users to experiment to find the best LLM and corresponding prompts and model parameters to use. QnABot makes it straightforward to experiment!

Tutorial

Time to try it! Let’s deploy the latest QnABot (v5.4.0 or later) and enable the new Generative AI features. The high-level steps are as follows:

Create and populate an Amazon Kendra index.
Choose and deploy an LLM plugin (optional).
Deploy QnABot.
Configure QnABot for your Lambda plugin (if using a plugin).
Access the QnABot web client and start experimenting.
Customize behavior using QnABot settings.
Add curated Q&As and text passages to the knowledge base.

Create and populate an Amazon Kendra Index

Download and use the following AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and SageMaker. Deploying the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index.

When the Amazon Kendra index stack is successfully deployed, navigate to the stack’s Outputs tab and note the Index Id, which you will use later when deploying QnABot.

Alternatively, if you already have an Amazon Kendra index with your own content, you can use it instead with your own example questions for the tutorial.

Choose and deploy an LLM plugin (optional)

QnABot can deploy a built-in LLM (Falcon-40B-instruct on SageMaker) or use Lambda functions to call any other LLMs of your choice. In this section, we show you how to use the Lambda option with a pre-built sample Lambda function. Skip to the next step if you want to use the built-in LLM instead.

First, choose the plugin LLM you want to use. Review your options from the qnabot-on-aws-plugin-samples repository README. As of this writing, plugins are available for Amazon Bedrock (in preview), and for AI21 and Anthropic third-party APIs. We expect to add more sample plugins over time.

Deploy your chosen plugin by choosing Launch Stack in the Deploy a new Plugin stack section, which will deploy into the us-east-1 Region by default (to deploy in other Regions, see Build and Publish QnABot Plugins CloudFormation artifacts).

When the Plugin stack is successfully deployed, navigate to the stack’s Outputs tab (see the following screenshot) and inspect its contents, which you will use in the following steps to deploy and configure QnABot. Keep this tab open in your browser.

Deploy QnABot

Choose Launch Solution from the QnABot implementation guide to deploy the latest QnABot template via AWS CloudFormation. Provide the following parameters:

For DefaultKendraIndexId, use the Amazon Kendra Index ID (a GUID) you collected earlier
For EmbeddingsApi (see Semantic Search using Text Embeddings), choose one of the following:
- SAGEMAKER (the default built-in embeddings model)
- LAMBDA (to use the Amazon Bedrock embeddings API with the BEDROCK-EMBEDDINGS-AND-LLM Plugin)
  - For EmbeddingsLambdaArn, use the EmbeddingsLambdaArn output value from your BEDROCK-EMBEDDINGS-AND-LLM Plugin stack.
For LLMApi (see Query Disambiguation for Conversational Retrieval, and Generative Question Answering), choose one of the following:
- SAGEMAKER (the default built-in LLM model)
- LAMBDA (to use the LLM Plugin deployed earlier)
  - For LLMLambdaArn, use the LLMLambdaArn output value from your Plugin stack

For all other parameters, accept the defaults (see the implementation guide for parameter definitions), and proceed to launch the QnABot stack.

Configure QnABot for your Lambda plugin (if using a plugin)

If you deployed QnABot using a sample LLM Lambda plugin to access a different LLM, update the QnABot model parameters and prompt template settings as recommended for your chosen plugin. For more information, see Update QnABot Settings. If you used the SageMaker (built-in) LLM option, skip to the next step, because the settings are already configured for you.

Access the QnABot web client and start experimenting

On the AWS CloudFormation console, choose the Outputs tab of the QnABot CloudFormation stack and choose the ClientURL link. Alternatively, launch the client by choosing QnABot on AWS Client from the Content Designer tools menu.

Now, try to ask questions related to AWS services, for example:

What is Amazon Lex?
How does SageMaker scale up inference workloads?
Is Kendra a search service?

Then you can ask follow-up questions without specifying the previously mentioned services or context, for example:

Is it secure?
Does it scale?

Customize behavior using QnABot settings

You can customize many settings on the QnABot Content Designer Settings page—see README – LLM Settings for a full list of relevant settings. For example, try the following:

Set ENABLE_DEBUG_RESPONSES to TRUE, save the settings, and try the previous questions again. Now you will see additional debug output at the top of each response, showing you how the LLM generates the Amazon Kendra search query based on the chat history, how long the LLM inferences took to run, and more. For example:
```
[User Input: "Is it fast?", LLM generated query (1207 ms): "Does Amazon Kendra provide search results quickly?", Search string: "Is it fast? / Does Amazon Kendra provide search results quickly?"["LLM: LAMBDA"], Source: KENDRA RETRIEVE API
```
Set ENABLE_DEBUG_RESPONSES back to FALSE, set LLM_QA_SHOW_CONTEXT_TEXT and LLM_QA_SHOW_SOURCE_LINKS to FALSE, and try the examples again. Now the context and sources links are not shown, and the output contains only the LLM-generated response.
If you feel adventurous, experiment also with the LLM prompt template settings—LLM_GENERATE_QUERY_PROMPT_TEMPLATE and LLM_QA_PROMPT_TEMPLATE. Refer to README – LLM Settings to see how you can use placeholders for runtime values like chat history, context, user input, query, and more. Note that the default prompts can most likely be improved and customized to better suit your use cases, so don’t be afraid to experiment! If you break something, you can always revert to the default settings using the RESET TO DEFAULTS option on the settings page.

Add curated Q&As and text passages to the knowledge base

QnABot can, of course, continue to answer questions based on curated Q&As. It can also use the LLM to generate answers from text passages created or imported directly into QnABot, in addition to using Amazon Kendra index.

QnABot attempts to find a good answer to the disambiguated user question in the following sequence:

QnA items
Text passage items
Amazon Kendra index

Let’s try some examples.

On the QnABot Content Designer tools menu, choose Import, then load the two example packages:

TextPassages-NurseryRhymeExamples
blog-samples-final

QnABot can use text embeddings to provide semantic search capability (using QnABot’s built-in OpenSearch index as a vector store), which improves accuracy and reduces question tuning, compared to standard OpenSearch keyword based matching. To illustrate this, try questions like the following:

“Tell me about the Alexa device with the screen”
“Tell me about Amazon’s video streaming device?”

These should ideally match the sample QNA you imported, even though the words used to ask the question are poor keyword matches (but good semantic matches) with the configured QnA items: Alexa.001 (What is an Amazon Echo Show) and FireTV.001 (What is an Amazon Fire TV).

Even if you are not (yet) using Amazon Kendra (and you should!), QnABot can also answer questions based on passages created or imported into Content Designer. The following questions (and follow-up questions) are all answered from an imported text passage item that contains the nursery rhyme 0.HumptyDumpty:

“Where did Humpty Dumpty sit before he fell?”
“What happened after he fell? Was he OK?”

When using embeddings, a good answer is an answer that returns a similarity score above the threshold defined by the corresponding threshold setting. See Semantic question matching, using Large Language Model Text Embeddings for more details on how to test and tune the threshold settings.

If there are no good answers, or if the LLM’s response matches the regular expression defined in LLM_QA_NO_HITS_REGEX, then QnABot invokes the configurable Custom Don’t Know (no_hits) behavior, which, by default, returns a message saying “You stumped me.”

Try some experiments by creating Q&As or text passage items in QnABot, as well as using an Amazon Kendra index for fallback generative answers. Experiment (using the TEST tab in the designer) to find the best values to use for the embedding threshold settings to get the behavior you want. It’s hard to get the perfect balance, but see if you can find a good enough balance that results in useful answers most of the time.

Clean up

You can, of course, leave QnABot running to experiment with it and show it to your colleagues! But it does incur some cost—see Plan your deployment – Cost for more details. To remove the resources and avoid costs, delete the following CloudFormation stacks:

QnABot stack
LLM Plugin stack (if applicable)
Amazon Kendra index stack

Use case examples

These new features make QnABot relevant for many customer use cases such as self-service customer service and support bots and automated web-based Q&A bots. We discuss two such use cases in this section.

Integrate with a contact center

QnABot’s automated question answering capabilities deliver effective self-service for inbound voice calls in contact centers, with compelling outcomes. For example, see how Kentucky Transportation Cabinet reduced call hold time and improved customer experience with self-service virtual agents using Amazon Connect and Amazon Lex. Integrating the new generative AI features strengthens this value proposition further by dynamically generating reliable answers from existing content such as documents, knowledge bases, and websites. This eliminates the need for bot designers to anticipate and manually curate responses to every possible question that a user might ask. To integrate QnABot with Amazon Connect, see Connecting QnABot on AWS to an Amazon Connect call center. To integrate with other contact centers, See how Amazon Chime SDK can be used to connect Amazon Lex voice bots with 3^rd party contact centers via SIPREC and Build an AI-powered virtual agent for Genesys Cloud using QnABot and Amazon Lex.

The LLM-powered QnABot can also play a pivotal role as an automated real-time agent assistant. In this solution, QnABot passively listens to the conversation and uses the LLM to generate real-time suggestions for the human agents based on certain cues. It’s straightforward to set up and try—give it a go! This solution can be utilized with both Amazon Connect and other on-prem and cloud contact centers. For more information, see Live call analytics and agent assist for your contact center with Amazon language AI services.

Integrate with a website

Embedding QnABot in your websites and applications allows users to get automated assistance with natural dialogue. For more information, see Deploy a Web UI for your Chatbot. For curated Q&A content, use markdown syntax and UI buttons and incorporate links, images, videos, and other dynamic elements that inform and delight your users. Integrate the QnABot Amazon Lex web UI with Amazon Connect live chat to facilitate quick escalation to human agents when the automated assistant cannot fully address a user’s inquiry on its own.

The QnABot on the AWS plugin samples repository

As shown in this post, QnABot v5.4.0+ not only offers built-in support for embeddings and LLM models hosted on SageMaker, but it also offers the ability to easily integrate with any other LLM by using Lambda functions. You can author your own custom Lambda functions or get started faster with one of the samples we have provided in our new qnabot-on-aws-plugin-samples repository.

This repository includes a ready-to-deploy plugin for Amazon Bedrock, which supports both embeddings and text generation requests. At the time of writing, Amazon Bedrock is available through private preview—you can request preview access. When Amazon Bedrock is generally available, we expect to integrate it directly with QnABot, but why wait? Apply for preview access and use our sample plugin to start experimenting!

Today’s LLM innovation cycle is driving a breakneck pace of new model releases, each aiming to surpass the last. This repository will expand to include additional QnABot plugin samples over time. As of this writing, we have support for two third-party model providers: Anthropic and AI21. We plan to add integrations for more LLMs, embeddings, and potentially common use case examples involving Lambda hooks and knowledge bases. These plugins are offered as-is without warranty, for your convenience—users are responsible for supporting and maintaining them once deployed.

We hope that the QnABot plugins repository will mature into a thriving open-source community project. Watch the qnabot-on-aws-plugin-samples GitHub repo to receive updates on new plugins and features, use the Issues forum to report problems or provide feedback, and contribute improvements via pull requests. Contributions are welcome!

Conclusion

In this post, we introduced the new generative AI features for QnABot and walked through a solution to create, deploy, and customize QnABot to use these features. We also discussed some relevant use cases. Automating repetitive inquiries frees up human workers and boosts productivity. Rich responses create engaging experiences. Deploying the LLM-powered QnABot can help you elevate the self-service experience for customers and employees.

Don’t miss this opportunity—get started today and revolutionize the user experience on your QnABot deployment!

About the authors

Clevester Teo is a Senior Partner Solutions Architect at AWS, focused on the Public Sector partner ecosystem. He enjoys building prototypes, staying active outdoors, and experiencing new cuisines. Clevester is passionate about experimenting with emerging technologies and helping AWS partners innovate and better serve public sector customers.

Windrich is a Solutions Architect at AWS who works with customers in industries such as finance and transport, to help accelerate their cloud adoption journey. He is especially interested in Serverless technologies and how customers can leverage them to bring values to their business. Outside of work, Windrich enjoys playing and watching sports, as well as exploring different cuisines around the world.

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Architecture overview

Solution overview

Prerequisites

Setting up model card sharing

Accessing shared model cards

Clean up

Conclusion

About the authors

Making evaluation easier

Assessing forecast skill

Toward reliable probabilistic forecasts

Conclusion

Acknowledgements

Thousands of Hits

LLMs Speak Arabic

Corporations Want Custom LLMs

Drilling for Data

A Big Boost from Inception

Treating Partners Well

Firefighting, Search and Rescue Operations

Versatile Drone Deliveries to Get the Job Done

Episode 146 | August 31, 2023

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Party Hard in the Cloud

Work Hard, Play Hard

New Games as Far as the Eye Can Sea

Introduction

Key Concepts

Mesh

Partition Spec

Simple Example With Sharding Annotation

Results

Performance

Scalability

The Road Forward for PyTorch/XLA SPMD

New Generative AI features

Generate answers to questions from Amazon Kendra search results or text passages

Disambiguate follow-up questions that rely on preceding conversation context

Tutorial

Create and populate an Amazon Kendra Index

Choose and deploy an LLM plugin (optional)

Deploy QnABot

Configure QnABot for your Lambda plugin (if using a plugin)

Access the QnABot web client and start experimenting

Customize behavior using QnABot settings

Add curated Q&As and text passages to the knowledge base

Clean up

Use case examples

Integrate with a contact center

Integrate with a website

The QnABot on the AWS plugin samples repository

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.