Abstracts: October 23, 2023

Abstracts: October 23, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Andy Gordon, a Partner Research Manager, and Carina Negreanu, a Senior Researcher, both at Microsoft Research, join host Dr. Gretchen Huizinga to discuss “Co-audit: Tools to help humans double-check AI-generated content.” This paper brings together current understanding of generative AI performance to explore the need and context for tools to help people using the technology find and fix mistakes in AI output.

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot, or a podcast abstract, of their new and noteworthy papers. Today, I’m talking to Dr. Andy Gordon, a Partner Research Manager, and Dr. Carina Negreanu, a Senior Researcher, both at Microsoft Research. Doctors Gordon and Negreanu are co-editors of a paper called “Co-audit: Tools to help humans double-check AI-generated content,” and you can read a preprint of this paper now on arXiv. Andy Gordon, Carina Negreanu, thanks for joining us on Abstracts!


ANDY GORDON: Great to be here.

CARINA NEGREANU: Likewise.

HUIZINGA: Let’s start with you, Andy. In a few sentences, describe the issue or problem your paper addresses and why people should care about it.

GORDON: Well, generative AI is amazing. Things like Bing Chat or ChatGPT, all these things powered by large language models. Totally amazing. But it’s really important for everyone to remember that these AIs can make mistakes. For example, you ask when your favorite actor got married, and the model says the year but gets it wrong. Or you ask for some Python code, and it works on positive numbers, but occasionally you give it negative numbers and it goes wrong. Another example, you get a summary of some text. It’s great but unfortunately misses one of the important points. Or thinking about images, you ask for a portrait of a character from the AI and there’s some glitch, and it produces a hand with six fingers. So as users, we need to get into the habit of carefully checking AI outputs for mistakes. And we refer to that as “audit” in a sense of a systematic review. Coming to the paper, it’s about what we call co-audit. And that’s our term for any tool support that helps the human audit the AI output. And some examples of co-audit are tools that can help check for hallucinations, like when the actor’s date of birth is wrong, or to check Python code to find some errors or show how a summary has been constructed to help people find errors.

HUIZINGA: Carina, let’s talk to you. What related research does this paper build on, and how does your work add to it?

NEGREANU: So there was no direct work on the co-audit brand before us. We’re just introducing it. But there has been a lot of research that either motivates the need for co-audit or provides relevant framing for it or even like early examples of what we start thinking of co-audit. So as you’re probably aware, there has been a really great effort in the last years to assess the quality of generations by large language models across a multitude, really, of tasks. And currently we use this body of work as motivation for our research. It basically shows there really is a need for this kind of work. And we hope that in the future, we can also use it to benchmark co-audit tools that we are going to produce in our wider community. But the idea of dealing with errors has been a key part of research on human-AI interaction for ages. And there have been some really cool guidelines that came out recently, especially from Amershi in 2019, on human-AI interactions that are concerned with this part of the world. And more recently, Glassman had a really cool paper about conversational frameworks for human-AI and communication and basically links these concepts to psychology. And in our work, as you can read in our paper, we are trying to basically frame co-audit within her framework, and we find that it’s a natural fit. But before we started defining formally co-audit and building this paper, our group has built co-audit tools in the co-generation space. One such tool is GAM, which is grounded abstraction matching, where we basically help users learn how to effectively communicate with large language models so that they both understand what the large language model understands they’re asking and also get good feedback back. We also built ColDeco, which is a spreadsheet tool for inspecting and verifying calculated columns without the user requiring to view the underlying code produced by the large language models. But really, any tool that focuses on debugging or basically getting information back from human-generated content is useful here. So even tools that are like early debugging tools like FxD are very important here as we learn how people use these kinds of tools and we try to basically apply the same concepts in the context of LLM-generated content. So basically, we are building on top of work that helps understand the needs and challenges that end-user programmers have when working in this space and trying to extrapolate them to co-auditing tools for LLM-generated content.

HUIZINGA: Well, Andy, how would you describe the research approach you used or your methodology for this paper, and how did it come about?

GORDON: Great question, Gretchen, and it was actually quite an unusual methodology for us. So as Carina says, we’ve been looking at co-audit in a very specific setting of spreadsheet computations, and we began to realize that co-audit was really important for any kind of AI-generated output, and we started to see other people doing research that was doing the same sort of thing we were doing but in different settings. So, for example, there was a paper, they were generating bits of Python and they were deliberately showing multiple pieces of code after they’d been generated to kind of nudge the human user to make a decision about which one was better. I mean that’s, it’s really important to get people to think about the outputs, and this was a nice trick. So we thought, look, this is actually quite an important problem, and MSR (Microsoft Research) should step up and sort of gather people. So we organized a workshop inside Microsoft in the spring and got folks together to share their perspectives on co-audit. And then since then, we’ve reflected on those discussions and tried to kind of pull them together in a more coherent sense than the sort of whiteboards and sticky notes that we produced back then. And so that’s produced this paper. I think one of the key things that we learned in that process that we hadn’t been thinking about before was that co-audit really complements prompt engineering. So you hear a lot about prompt engineering, and it’s the first part of what we call the prompt-response-audit loop. And this is related to what Carina was saying about Elena Glassman’s work about AI-human interaction. So the first step is you formulate a prompt. For example, you ask for Python code. That’s the first step. The second step is we wait for the response from the AI. And then the third step is that we need to inspect the response—that’s the audit part—decide if it meets our needs or if there is a mistake, and if that’s the case, we need to repeat again. So that’s this loop, the prompt-response-audit loop. And prompt engineering, they’re the tools and techniques that you use in that first step to create the prompt. So, for example, some tools will automatically include a data context in a prompt if you’re trying to create some Python to apply to a table in a spreadsheet or, or something like that. And then duly, co-audit, those are the tools and techniques we have to help the human audit the response in the third step of this loop. And that’s like these tools I’ve been mentioning that show maybe two or three candidates of code that’s to be used.

HUIZINGA: Carina, let’s move over to what kinds of things you came away with. Your takeaways or your findings from this workshop. Talk about that and how you chose to articulate them in the paper.

NEGREANU: So as part of our research, we found that basically one co-audit tool does not fit all needs, which in a way was great because we have a bigger field to explore, but in other ways a bit daunting, as it means you have to think of many things. And one thing that really came to light was that even though we can’t, you know, build something that fits everything, we can build a set of principles that we think are important. So really, we wrote our paper around those 10 principles that we have identified from the workshop and then are trying to promote them as things people should think about when they start going on the journey of building co-auditing tools. So one of the examples is that we really think that we should think about grounding outputs, so, for example, by citing reliable sources similar to what Bing Chat does today. We think that’s a really valuable, important principle that people should follow, and they should think about what that means in the concept of their co-auditing tool. In the case of Bing, it’s quite simple, as it’s like factual references, but maybe if it becomes referencing code, that becomes more tricky but still super interesting going forward. We also propose that co-auditing tools should have the capability to prioritize the user’s attention to the most likely errors, as we need to be mindful of the user’s cognitive efforts and have a positive cost benefit. Basically, if we overflood the users with different errors and flags, it might be too problematic, and the adoption might be quite difficult going forward. And finally, this is something that really comes to core to our research area in spreadsheets. It’s about thinking beyond text. So we know visuals are so important in how we explain things, in how we teach in schools, how we teach universities. So how do we include them in the co-auditing process going forward? I think that’s going to be a really interesting challenge, and we hope we’re going to see some interesting work in that space.

HUIZINGA: Yeah. Well, principles are one thing, Andy, but how does this paper contribute to real-world impact? We talked about that a bit at the beginning. Who benefits most from this tool?

GORDON: That is a great question, Gretchen, and actually that was a question that we talked about at the workshop. We think that some application areas are going to benefit more than others. So co-audit really matters when correctness really matters and when mistakes are bad consequences, so in terms of application area, that’s areas like maybe finance or technology development or medicine. But you asked particularly about who, and we think some people will benefit more from co-audit than others. And we found this really striking example, I guess it’s an anecdotal example that someone was posting on social media. A professor was teaching a class using generative AI tools for the first time to generate code, and he found some evidence that people who have low self-confidence with computers can be intimidated by generative AI. So he would find that some of the class were really confident users and they would ask it, you know, generate some Python to do such and such, and it would come back with code with, you know, a bunch of mistakes in it. And the confident users were happy just to swat that away; they were even quite a little arrogant about it, like this is a stupid computer, they were saying. But, Gretchen, he found that a lot of his students who were less confident with computers were quite intimidated by this because it was very confidently just saying, oh look, all this code is going to work. And they kind of got a bit stuck, and some of them were scrolling around through this code, trying to understand how it worked, when in fact it was just really broken. So he thought this was pretty bad that these able students who were just less confident were being intimidated and were making less good use of the, the generative AI. Now that is an example that’s an anecdote from social media from a reputable professor, but we looked into it and there’s peer-reviewed studies that show similar effect in the literature. So I’d say we need co-audit tools that will encourage these less confident users to question when the AI is mistaken rather than getting stuck, and I think otherwise they’re not going to see the benefits of the generative AI.

HUIZINGA: Well, Carina, sometimes I like to boil things down to a nugget or a beautiful takeaway. So if there’s one thing you want our listeners to take away from this work, this paper, what would it be?

NEGREANU: I think that what this study has taught us is that really we need significantly more research. So basically, a good co-auditing experience can really be the element that makes it or breaks it in how we incorporate LLMs safely into our day-to-day lives. But to make this happen, we need people from the field working towards the same goal. It’s really an interdisciplinary work, and I don’t think we can do it by isolating into groups as we’re currently researching now. So I would urge our listeners to think about how they could contribute in this space and reach out with feedback and questions to us. We are more than open to collaboration. Really, we are just starting this journey, and we’d love to see this area to become a research priority going forward in 2024.

HUIZINGA: Well, Andy, as an opportunity to give some specificity to Carina’s call for help, what potential pitfalls have you already identified that represent ongoing research challenges in this field? And what’s next on yours—and potentially others’—research agenda in this field?

GORDON: Well, one point, and I think Carina made this, that co-audit techniques will themselves never be perfect. I mean, we’re saying that language models are never going to be perfect. Mistakes will come through. But the co-audit techniques themselves won’t be perfect either. So sometimes a user who is using the tools will still miss some mistakes. So, for example, you know, at the workshop, we thought about security questions and co-audit tools themselves. And we were thinking, for instance, about maybe deliberate attacks on a generative AI. There’s various techniques that people are talking about at the moment where you might sort of poison the inputs that generative AI models pick up on. And in principle, co-audit tools could help users realize that there are deliberate mistakes that have been engineered by the attacker. So that’s good. But on the other hand, you know, security always becomes an arms race. And so once, you know, if we did have a good tool that could detect those kinds of mistakes, the attackers then will start to engineer around the co-audit tools, trying to make them less effective. So that will be an ongoing problem, I think. And on the other hand, you know, we’ll find that if co-audit tools are giving too many warnings, users will start to ignore them, and there’ll be a sort of under-reliance on co-audit tools. And of course, if we give too few, users will miss the mistakes. So an interesting balance needs to be struck. And also, we don’t expect there’s going to be one overarching co-audit experience, but we think there’ll be many different realizations. And so, as Carina says, we hope that common lessons can be learned, and that’s why we want to keep documenting this space in general and building a research community. So I echo what Carina was saying. If you’re listening and you think that what you’re working on is co-audit, do reach out.

HUIZINGA: Well, Andy Gordon, Carina Negreanu, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper and this research, you can find a link at aka.ms/abstracts, or you can read the preprint on arXiv. See you next time on Abstracts!

The post Abstracts: October 23, 2023 appeared first on Microsoft Research.

Read More

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

Customers of every size and industry are innovating on AWS by infusing machine learning (ML) into their products and services. Recent developments in generative AI models have further sped up the need of ML adoption across industries. However, implementing security, data privacy, and governance controls are still key challenges faced by customers when implementing ML workloads at scale. Addressing those challenges builds the framework and foundations for mitigating risk and responsible use of ML-driven products. Although generative AI may need additional controls in place, such as removing toxicity and preventing jailbreaking and hallucinations, it shares the same foundational components for security and governance as traditional ML.

We hear from customers that they require specialized knowledge and investment of up to 12 months for building out their customized Amazon SageMaker ML platform implementation to ensure scalable, reliable, secure, and governed ML environments for their lines of business (LOBs) or ML teams. If you lack a framework for governing the ML lifecycle at scale, you may run into challenges such as team-level resource isolation, scaling experimentation resources, operationalizing ML workflows, scaling model governance, and managing security and compliance of ML workloads.

Governing ML lifecycle at scale is a framework to help you build an ML platform with embedded security and governance controls based on industry best practices and enterprise standards. This framework addresses challenges by providing prescriptive guidance through a modular framework approach extending an AWS Control Tower multi-account AWS environment and the approach discussed in the post Setting up secure, well-governed machine learning environments on AWS.

It provides prescriptive guidance for the following ML platform functions:

  • Multi-account, security, and networking foundations – This function uses AWS Control Tower and well-architected principles for setting up and operating multi-account environment, security, and networking services.
  • Data and governance foundations – This function uses a data mesh architecture for setting up and operating the data lake, central feature store, and data governance foundations to enable fine-grained data access.
  • ML platform shared and governance services – This function enables setting up and operating common services such as CI/CD, AWS Service Catalog for provisioning environments, and a central model registry for model promotion and lineage.
  • ML team environments – This function enables setting up and operating environments for ML teams for model development, testing, and deploying their use cases for embedding security and governance controls.
  • ML platform observability – This function helps with troubleshooting and identifying the root cause for problems in ML models through centralization of logs and providing tools for log analysis visualization. It also provides guidance for generating cost and usage reports for ML use cases.

Although this framework can provide benefits to all customers, it’s most beneficial for large, mature, regulated, or global enterprises customers that want to scale their ML strategies in a controlled, compliant, and coordinated approach across the organization. It helps enable ML adoption while mitigating risks. This framework is useful for the following customers:

  • Large enterprise customers that have many LOBs or departments interested in using ML. This framework allows different teams to build and deploy ML models independently while providing central governance.
  • Enterprise customers with a moderate to high maturity in ML. They have already deployed some initial ML models and are looking to scale their ML efforts. This framework can help accelerate ML adoption across the organization. These companies also recognize the need for governance to manage things like access control, data usage, model performance, and unfair bias.
  • Companies in regulated industries such as financial services, healthcare, chemistry, and the private sector. These companies need strong governance and audibility for any ML models used in their business processes. Adopting this framework can help facilitate compliance while still allowing for local model development.
  • Global organizations that need to balance centralized and local control. This framework’s federated approach allows the central platform engineering team to set some high-level policies and standards, but also gives LOB teams flexibility to adapt based on local needs.

In the first part of this series, we walk through the reference architecture for setting up the ML platform. In a later post, we will provide prescriptive guidance for how to implement the various modules in the reference architecture in your organization.

The capabilities of the ML platform are grouped into four categories, as shown in the following figure. These capabilities form the foundation of the reference architecture discussed later in this post:

  • Build ML foundations
  • Scale ML operations
  • Observable ML
  • Secure ML

Solution overview

The framework for governing ML lifecycle at scale framework enables organizations to embed security and governance controls throughout the ML lifecycle that in turn help organizations reduce risk and accelerate infusing ML into their products and services. The framework helps optimize the setup and governance of secure, scalable, and reliable ML environments that can scale to support an increasing number of models and projects. The framework enables the following features:

  • Account and infrastructure provisioning with organization policy compliant infrastructure resources
  • Self-service deployment of data science environments and end-to-end ML operations (MLOps) templates for ML use cases
  • LOB-level or team-level isolation of resources for security and privacy compliance
  • Governed access to production-grade data for experimentation and production-ready workflows
  • Management and governance for code repositories, code pipelines, deployed models, and data features
  • A model registry and feature store (local and central components) for improving governance
  • Security and governance controls for the end-to-end model development and deployment process

In this section, we provide an overview of prescriptive guidance to help you build this ML platform on AWS with embedded security and governance controls.

The functional architecture associated with the ML platform is shown in the following diagram. The architecture maps the different capabilities of the ML platform to AWS accounts.

The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, SageMaker, AWS DevOps services, and a data lake. The reference architecture for the ML platform with various AWS services is shown in the following diagram.

This framework considers multiple personas and services to govern the ML lifecycle at scale. We recommend the following steps to organize your teams and services:

  1. Using AWS Control Tower and automation tooling, your cloud administrator sets up the multi-account foundations such as Organizations and AWS IAM Identity Center (successor to AWS Single Sign-On) and security and governance services such as AWS Key Management Service (AWS KMS) and Service Catalog. In addition, the administrator sets up a variety of organization units (OUs) and initial accounts to support your ML and analytics workflows.
  2. Data lake administrators set up your data lake and data catalog, and set up the central feature store working with the ML platform admin.
  3. The ML platform admin provisions ML shared services such as AWS CodeCommit, AWS CodePipeline, Amazon Elastic Container Registry (Amazon ECR), a central model registry, SageMaker Model Cards, SageMaker Model Dashboard, and Service Catalog products for ML teams.
  4. The ML team lead federates via IAM Identity Center, uses Service Catalog products, and provisions resources in the ML team’s development environment.
  5. Data scientists from ML teams across different business units federate into their team’s development environment to build the model pipeline.
  6. Data scientists search and pull features from the central feature store catalog, build models through experiments, and select the best model for promotion.
  7. Data scientists create and share new features into the central feature store catalog for reuse.
  8. An ML engineer deploys the model pipeline into the ML team test environment using a shared services CI/CD process.
  9. After stakeholder validation, the ML model is deployed to the team’s production environment.
  10. Security and governance controls are embedded into every layer of this architecture using services such as AWS Security Hub, Amazon GuardDuty, Amazon Macie, and more.
  11. Security controls are centrally managed from the security tooling account using Security Hub.
  12. ML platform governance capabilities such as SageMaker Model Cards and SageMaker Model Dashboard are centrally managed from the governance services account.
  13. Amazon CloudWatch and AWS CloudTrail logs from each member account are made accessible centrally from an observability account using AWS native services.

Next, we dive deep into the modules of the reference architecture for this framework.

Reference architecture modules

The reference architecture comprises eight modules, each designed to solve a specific set of problems. Collectively, these modules address governance across various dimensions, such as infrastructure, data, model, and cost. Each module offers a distinct set of functions and interoperates with other modules to provide an integrated end-to-end ML platform with embedded security and governance controls. In this section, we present a short summary of each module’s capabilities.

Multi-account foundations

This module helps cloud administrators build an AWS Control Tower landing zone as a foundational framework. This includes building a multi-account structure, authentication and authorization via IAM Identity Center, a network hub-and-spoke design, centralized logging services, and new AWS member accounts with standardized security and governance baselines.

In addition, this module gives best practice guidance on OU and account structures that are appropriate for supporting your ML and analytics workflows. Cloud administrators will understand the purpose of the required accounts and OUs, how to deploy them, and key security and compliance services they should use to centrally govern their ML and analytics workloads.

A framework for vending new accounts is also covered, which uses automation for baselining new accounts when they are provisioned. By having an automated account provisioning process set up, cloud administrators can provide ML and analytics teams the accounts they need to perform their work more quickly, without sacrificing on a strong foundation for governance.

Data lake foundations

This module helps data lake admins set up a data lake to ingest data, curate datasets, and use the AWS Lake Formation governance model for managing fine-grained data access across accounts and users using a centralized data catalog, data access policies, and tag-based access controls. You can start small with one account for your data platform foundations for a proof of concept or a few small workloads. For medium-to-large-scale production workload implementation, we recommend adopting a multi-account strategy. In such a setting, LOBs can assume the role of data producers and data consumers using different AWS accounts, and the data lake governance is operated from a central shared AWS account. The data producer collects, processes, and stores data from their data domain, in addition to monitoring and ensuring the quality of their data assets. Data consumers consume the data from the data producer after the centralized catalog shares it using Lake Formation. The centralized catalog stores and manages the shared data catalog for the data producer accounts.

ML platform services

This module helps the ML platform engineering team set up shared services that are used by the data science teams on their team accounts. The services include a Service Catalog portfolio with products for SageMaker domain deployment, SageMaker domain user profile deployment, data science model templates for model building and deploying. This module has functionalities for a centralized model registry, model cards, model dashboard, and the CI/CD pipelines used to orchestrate and automate model development and deployment workflows.

In addition, this module details how to implement the controls and governance required to enable persona-based self-service capabilities, allowing data science teams to independently deploy their required cloud infrastructure and ML templates.

ML use case development

This module helps LOBs and data scientists access their team’s SageMaker domain in a development environment and instantiate a model building template to develop their models. In this module, data scientists work on a dev account instance of the template to interact with the data available on the centralized data lake, reuse and share features from a central feature store, create and run ML experiments, build and test their ML workflows, and register their models to a dev account model registry in their development environments.

Capabilities such as experiment tracking, model explainability reports, data and model bias monitoring, and model registry are also implemented in the templates, allowing for rapid adaptation of the solutions to the data scientists’ developed models.

ML operations

This module helps LOBs and ML engineers work on their dev instances of the model deployment template. After the candidate model is registered and approved, they set up CI/CD pipelines and run ML workflows in the team’s test environment, which registers the model into the central model registry running in a platform shared services account. When a model is approved in the central model registry, this triggers a CI/CD pipeline to deploy the model into the team’s production environment.

Centralized feature store

After the first models are deployed to production and multiple use cases start to share features created from the same data, a feature store becomes essential to ensure collaboration across use cases and reduce duplicate work. This module helps the ML platform engineering team set up a centralized feature store to provide storage and governance for ML features created by the ML use cases, enabling feature reuse across projects.

Logging and observability

This module helps LOBs and ML practitioners gain visibility into the state of ML workloads across ML environments through centralization of log activity such as CloudTrail, CloudWatch, VPC flow logs, and ML workload logs. Teams can filter, query, and visualize logs for analysis, which can help enhance security posture as well.

Cost and reporting

This module helps various stakeholders (cloud admin, platform admin, cloud business office) to generate reports and dashboards to break down costs at ML user, ML team, and ML product levels, and track usage such as number of users, instance types, and endpoints.

Customers have asked us to provide guidance on how many accounts to create and how to structure those accounts. In the next section, we provide guidance on that account structure as reference that you can modify to suit your needs according to your enterprise governance requirements.

Reference account structure

In this section, we discuss our recommendation for organizing your account structure. We share a baseline reference account structure; however, we recommend ML and data admins work closely with their cloud admin to customize this account structure based on their organization controls.

We recommend organizing accounts by OU for security, infrastructure, workloads, and deployments. Furthermore, within each OU, organize by non-production and production OU because the accounts and workloads deployed under them have different controls. Next, we briefly discuss those OUs.

Security OU

The accounts in this OU are managed by the organization’s cloud admin or security team for monitoring, identifying, protecting, detecting, and responding to security events.

Infrastructure OU

The accounts in this OU are managed by the organization’s cloud admin or network team for managing enterprise-level infrastructure shared resources and networks.

We recommend having the following accounts under the infrastructure OU:

  • Network – Set up a centralized networking infrastructure such as AWS Transit Gateway
  • Shared services – Set up centralized AD services and VPC endpoints

Workloads OU

The accounts in this OU are managed by the organization’s platform team admins. If you need different controls implemented for each platform team, you can nest other levels of OU for that purpose, such as an ML workloads OU, data workloads OU, and so on.

We recommend the following accounts under the workloads OU:

  • Team-level ML dev, test, and prod accounts – Set this up based on your workload isolation requirements
  • Data lake accounts – Partition accounts by your data domain
  • Central data governance account – Centralize your data access policies
  • Central feature store account – Centralize features for sharing across teams

Deployments OU

The accounts in this OU are managed by the organization’s platform team admins for deploying workloads and observability.

We recommend the following accounts under the deployments OU because the ML platform team can set up different sets of controls at this OU level to manage and govern deployments:

  • ML shared services accounts for test and prod – Hosts platform shared services CI/CD and model registry
  • ML observability accounts for test and prod – Hosts CloudWatch logs, CloudTrail logs, and other logs as needed

Next, we briefly discuss organization controls that need to be considered for embedding into member accounts for monitoring the infrastructure resources.

AWS environment controls

A control is a high-level rule that provides ongoing governance for your overall AWS environment. It’s expressed in plain language. In this framework, we use AWS Control Tower to implement the following controls that help you govern your resources and monitor compliance across groups of AWS accounts:

  • Preventive controls – A preventive control ensures that your accounts maintain compliance because it disallows actions that lead to policy violations and are implemented using a Service Control Policy (SCP). For example, you can set a preventive control that ensures that CloudTrail is not deleted or stopped in AWS accounts or Regions.
  • Detective controls – A detective control detects noncompliance of resources within your accounts, such as policy violations, provides alerts through the dashboard, and is implemented using AWS Config rules. For example, you can create a detective control to detects whether public read access is enabled to the Amazon Simple Storage Service (Amazon S3) buckets in the log archive shared account.
  • Proactive controls – A proactive control scans your resources before they are provisioned and makes sure that the resources are compliant with that control and are implemented using AWS CloudFormation hooks. Resources that aren’t compliant will not be provisioned. For example, you can set a proactive control that checks that direct internet access is not allowed for a SageMaker notebook instance.

Interactions between ML platform services, ML use cases, and ML operations

Different personas, such as the head of data science (lead data scientist), data scientist, and ML engineer, operate modules 2–6 as shown in the following diagram for different stages of ML platform services, ML use case development, and ML operations along with data lake foundations and the central feature store.

The following table summarizes the ops flow activity and setup flow steps for different personas. Once a persona initiates a ML activity as part of ops flow, the services run as mentioned in setup flow steps.

Persona Ops Flow Activity – Number Ops Flow Activity – Description Setup Flow Step – Number Setup Flow Step – Description
Lead Data Science or ML Team Lead

1

Uses Service Catalog in the ML platform services account and deploys the following:

    • ML infrastructure
    • SageMaker projects
    • SageMaker model registry

1-A

  • Sets up the dev, test, and prod environments for LOBs
  • Sets up SageMaker Studio in the ML platform services account

1-B

  • Sets up SageMaker Studio with the required configuration
Data Scientist

2

Conducts and tracks ML experiments in SageMaker notebooks

2-A

  • Uses data from Lake Formation
  • Saves features in the central feature store

3

Automates successful ML experiments with SageMaker projects and pipelines

3-A

    • Initiates SageMaker pipelines (preprocess, train, evaluate) in the dev account
  • Initiates the build CI/CD process with CodePipeline in the dev account

3-B

After the SageMaker pipelines run, saves the model in the local (dev) model registry
Lead Data Scientist or ML Team Lead

4

Approves the model in the local (dev) model registry

4-A

Model metadata and model package writes from the local (dev) model registry to the central model registry

5

Approves the model in the central model registry

5-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the test environment

5-B

Writes the model information and metadata to the ML governance module (model card, model dashboard) in the ML platform services account from the local (dev) account
ML Engineer

6

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

7

Approves deployment for SageMaker endpoints in the prod environment

7-A

Initiates the deployment CI/CD process to create SageMaker endpoints in the prod environment

8

Tests and monitors the SageMaker endpoint in the test environment after CI/CD .

Personas and interactions with different modules of the ML platform

Each module caters to particular target personas within specific divisions that utilize the module most often, granting them primary access. Secondary access is then permitted to other divisions that require occasional use of the modules. The modules are tailored towards the needs of particular job roles or personas to optimize functionality.

We discuss the following teams:

  • Central cloud engineering – This team operates at the enterprise cloud level across all workloads for setting up common cloud infrastructure services, such as setting up enterprise-level networking, identity, permissions, and account management
  • Data platform engineering – This team manages enterprise data lakes, data collection, data curation, and data governance
  • ML platform engineering – This team operates at the ML platform level across LOBs to provide shared ML infrastructure services such as ML infrastructure provisioning, experiment tracking, model governance, deployment, and observability

The following table details which divisions have primary and secondary access for each module according to the module’s target personas.

Module Number Modules Primary Access Secondary Access Target Personas Number of accounts

1

Multi-account foundations Central cloud engineering Individual LOBs
  • Cloud admin
  • Cloud engineers
Few

2

Data lake foundations Central cloud or data platform engineering Individual LOBs
  • Data lake admin
  • Data engineers
Multiple

3

ML platform services Central cloud or ML platform engineering Individual LOBs
  • ML platform Admin
  • ML team Lead
  • ML engineers
  • ML governance lead
One

4

ML use case development Individual LOBs Central cloud or ML platform engineering
  • Data scientists
  • Data engineers
  • ML team lead
  • ML engineers
Multiple

5

ML operations Central cloud or ML engineering Individual LOBs
  • ML Engineers
  • ML team leads
  • Data scientists
Multiple

6

Centralized feature store Central cloud or data engineering Individual LOBs
  • Data engineer
  • Data scientists
One

7

Logging and observability Central cloud engineering Individual LOBs
  • Cloud admin
  • IT auditors
One

8

Cost and reporting Individual LOBs Central platform engineering
  • LOB executives
  • ML managers
One

Conclusion

In this post, we introduced a framework for governing the ML lifecycle at scale that helps you implement well-architected ML workloads embedding security and governance controls. We discussed how this framework takes a holistic approach for building an ML platform considering data governance, model governance, and enterprise-level controls. We encourage you to experiment with the framework and concepts introduced in this post and share your feedback.


About the authors

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Maira Ladeira Tanke is a Senior Data Specialist at AWS. As a technical lead, she helps customers accelerate their achievement of business value through emerging technology and innovative solutions. Maira has been with AWS since January 2020. Prior to that, she worked as a data scientist in multiple industries focusing on achieving business value from data. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Ryan Lempka is a Senior Solutions Architect at Amazon Web Services, where he helps his customers work backwards from business objectives to develop solutions on AWS. He has deep experience in business strategy, IT systems management, and data science. Ryan is dedicated to being a lifelong learner, and enjoys challenging himself every day to learn something new.

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Read More

How Meesho built a generalized feed ranker using Amazon SageMaker inference

How Meesho built a generalized feed ranker using Amazon SageMaker inference

This is a guest post co-written by Rama Badrinath, Divay Jindal and Utkarsh Agrawal at Meesho.


Meesho is India’s fastest growing ecommerce company with a mission to democratize internet commerce for everyone and make it accessible to the next billion users of India. Meesho was founded in 2015 and today focuses on buyers and sellers across India. The Meesho marketplace provides micro, small, and medium businesses and individual entrepreneurs access to millions of customers, a selection from over 30 categories and more than 900 sub-categories, pan-India logistics, payment services, and customer support capabilities to efficiently run their businesses on the Meesho ecosystem.

As an ecommerce platform, Meesho aims to improve the user experience by offering personalized and relevant product recommendations. We wanted to create a generalized feed ranker that considers individual preferences and historical behavior to effectively display products in each user’s feed. Through this, we wanted to boost user engagement, conversion rates, and overall business growth by tailoring the shopping experience to each customer’s unique requirements and providing the best value for their money.

We used AWS machine learning (ML) services like Amazon SageMaker to develop a powerful generalized feed ranker (GFR). In this post, we discuss the key components of the GFR and how this ML-driven solution streamlined the ML lifecycle, ensuring efficient infra management, scalability, and reliability within the ecosystem.

Solution overview

To personalize users’ feeds, we analyzed extensive historical data, extracting insights into features that include browsing patterns and interests. These valuable features are used to construct ranking models. The GFR personalizes each user’s feed in real time, considering various factors like geography, prior shopping pattern, acquisition channels, and more. Several interaction-based features are also used to capture the affinity of the user towards an item, item category, or item properties like price, rating, or discount.

Several user-agnostic features and scores at item level are used as well. These include an item popularity score and item propensity to buy score. All these features go as input to the Learning to Rank (LTR) model that tries to emit the Probability of Click (PCTR) and Probability of Purchase (PCVR).

For diverse and relevant recommendations, the GFR sources candidate products from multiple channels, including exploit (known user preferences), explore (novel and potentially interesting products), popularity (trending items), and recent (latest additions).

The following diagram illustrates the GFR architecture.

The architecture can be divided into two different components: model training and model deployment. In the following sections, we discuss each component and the AWS services used in more detail.

Model training

Meesho used Amazon EMR with Apache Spark to process hundreds of millions of data points, depending on the model’s complexity. One of the major challenges was to run distributed training at scale. We used Dask—a distributed data science computing framework that natively integrates with Python libraries—on Amazon EMR to scale out the training jobs across the cluster. The distributed training of the model helped cut down training time from days to hours and allowed us to schedule Spark jobs efficiently and cost-effectively. We used an offline feature store to maintain a historical record of all feature values that will be used for model training. Model artifacts from training are stored in Amazon Simple Storage Service (Amazon S3), providing convenient access and version management.

We used a time sampling strategy to create training, validation, and test datasets for model training. We kept track of various metrics to evaluate the performance of the model—the most important ones being area under the ROC curve and area under the precision recall curve. We also tracked calibration of the model to prevent overconfidence and underconfidence issues while predicting the probability scores.

Model deployment

Meesho used SageMaker inference endpoints with auto scaling enabled for deploying the trained model. SageMaker offered ease of deployment with support for various ML frameworks, allowing models to be served with low latency. Although AWS offers standard inference images suitable for most use cases, we built a custom inference image that caters specifically to our needs and pushed it to Amazon Elastic Container Registry (Amazon ECR).

We built an in-house A/B testing platform that facilitated live monitoring of A/B metrics, enabling us to make data-driven decisions promptly. We also used the A/B testing feature of SageMaker to deploy multiple production variants on an endpoint. Through A/B experiments, we observed an approximate 3.5% enhancement in the platform’s conversion rate and an increase in app open frequency of the users, highlighting the effectiveness of this approach.

We kept track of various drifts such as feature drift and prior drift multiple times a day after model deployment to prevent the model performance from deteriorating.

We used AWS Lambda to set up various automations and triggers that are required during model retraining, endpoint updates, and monitoring processes.

The recommendation workflow after model deployment works as follows (as noted in the solution architecture diagram):

  1. The input requests with user context and interaction features are received at the application layer from Meesho’s mobile and web app.
  2. The application layer fetches additional features like historical data of the user from the online feature store and appends these to the input requests.
  3. The appended features are sent to the real-time endpoints for generating recommendations.
  4. The model predictions are sent back to the application layer.
  5. The application layer uses these predictions to personalize the user feeds on the mobile or web application.

Conclusion

Meesho successfully implemented a generalized feed ranker using SageMaker, which resulted in highly personalized product recommendations for each customer based on their preferences and historical behavior. This approach significantly improved user engagement and led to higher conversion rates, contributing to the company’s overall business growth. As a result of utilizing AWS services, our ML lifecycle runtime reduced significantly, from taking months to just weeks, leading to increased efficiency and productivity for our team.

With this advanced feed ranker, Meesho continues to deliver tailored shopping experiences, adding more value to its customers and fulfilling its mission to democratize ecommerce for everyone.

The team is grateful for the continuous support and guidance from Ravindra Yadav, Director of Data Science at Meesho, and Debdoot Mukherjee, Head of AI at Meesho, who played a key role in enabling this success.

To learn more about SageMaker, refer to the Amazon SageMaker Developer Guide.


About the Authors

Utkarsh Agrawal is currently working as a Senior Data Scientist at Meesho. He previously worked with Fractal Analytics and Trell on various domains, including recommender systems, time series, NLP, and more. He holds a master’s degree in Mathematics and Computing from Indian Institute of Technology Kharagpur (IIT), India.

Rama Badrinath is currently working as a Principal Data Scientist at Meesho. He previously worked with Microsoft and ShareChat on various domains, including recommender systems, image AI, NLP, and more. He holds a master’s degree in Machine Learning from Indian Institute of Science (IISc), India. He has also published papers in renowned conferences such as KDD and ECIR.

Divay Jindal is currently working as a Lead Data Scientist at Meesho. He previously worked with Bookmyshow on various domains, including recommender systems and dynamic pricing.

Venugopal Pai is a Solutions Architect at AWS. He lives in Bengaluru, India, and helps digital-native customers scale and optimize their applications on AWS.

Read More

Answering billions of reporting queries each day with low latency

Answering billions of reporting queries each day with low latency

Google Ads infrastructure runs on an internal data warehouse called Napa. Billions of reporting queries, which power critical dashboards used by advertising clients to measure campaign performance, run on tables stored in Napa. These tables contain records of ads performance that are keyed using particular customers and the campaign identifiers with which they are associated. Keys are tokens that are used both to associate an ads record with a particular client and campaign (e.g., customer_id, campaign_id) and for efficient retrieval. A record contains dozens of keys, so clients use reporting queries to specify keys needed to filter the data to understand ads performance (e.g., by region, device and metrics such as clicks, etc.). What makes this problem challenging is that the data is skewed since queries require varying levels of effort to be answered and have stringent latency expectations. Specifically, some queries require the use of millions of records while others are answered with just a few.

To this end, in “Progressive Partitioning for Parallelized Query Execution in Napa”, presented at VLDB 2023, we describe how the Napa data warehouse determines the amount of machine resources needed to answer reporting queries while meeting strict latency targets. We introduce a new progressive query partitioning algorithm that can parallelize query execution in the presence of complex data skews to perform consistently well in a matter of a few milliseconds. Finally, we demonstrate how Napa allows Google Ads infrastructure to serve billions of queries every day.

Query processing challenges

When a client inputs a reporting query, the main challenge is to determine how to parallelize the query effectively. Napa’s parallelization technique breaks up the query into even sections that are equally distributed across available machines, which then process these in parallel to significantly reduce query latency. This is done by estimating the number of records associated with a specified key, and assigning more or less equal amounts of work to machines. However, this estimation is not perfect since reviewing all records would require the same effort as answering the query. A machine that processes significantly more than others would result in run-time skews and poor performance. Each machine also needs to have sufficient work since needless parallelism leads to underutilized infrastructure. Finally, parallelization has to be a per query decision that must be executed near-perfectly billions of times, or the query may miss the stringent latency requirements.

The reporting query example below extracts the records denoted by keys (i.e., customer_id and campaign_id) and then computes an aggregate (i.e., SUM(cost)) from an advertiser table. In this example the number of records is too large to process on a single machine, so Napa needs to use a subsequent key (e.g., adgroup_id) to further break up the collection of records so that equal distribution of work is achieved. It is important to note that at petabyte scale, the size of the data statistics needed for parallelization may be several terabytes. This means that the problem is not just about collecting enormous amounts of metadata, but also how it is managed.

        SELECT customer_id, campaign_id, SUM(cost)
             FROM advertiser_table
             WHERE customer_id in (1, 7, ..., x )
             AND campaign_id in (10, 20, ..., y)
             GROUP BY customer_id, campaign_id;


This reporting query example extracts records denoted by keys (i.e., customer_id and campaign_id) and then computes an aggregate (i.e., SUM(cost)) from an advertiser table. The query effort is determined by the keys’ included in the query. Keys belonging to clients with larger campaigns may touch millions of records since the data volume directly correlates with the size of the ads campaign. This disparity of matching records based on keys reflects the skewness in data, which makes query processing a challenging problem.

An effective solution minimizes the amount of metadata needed, focuses effort primarily on the skewed part of the key space to partition data efficiently, and works well within the allotted time. For example, if the query latency is a few hundred milliseconds, partitioning should take no longer than tens of milliseconds. Finally, a parallelization process should determine when it’s reached the best possible partitioning that considers query latency expectations. To this end, we have developed a progressive partitioning algorithm that we describe later in this article.

Managing the data deluge

Tables in Napa are constantly updated, so we use log-structured merge forests (LSM tree) to organize the deluge of table updates. LSM is a forest of sorted data that is temporally organized with a B-tree index to support efficient key lookup queries. B-trees store summary information of the sub-trees in a hierarchical manner. Each B-tree node records the number of entries present in each subtree, which aids in the parallelization of queries. LSM allows us to decouple the process of updating the tables from the mechanics of query serving in the sense that live queries go against a different version of the data, which is atomically updated once the next batch of ingest (called delta) has been fully prepared for querying.

The partitioning problem

The data partitioning problem in our context is that we have a massively large table that is represented as an LSM tree. In the figure below, Delta 1 and 2 each have their own B-tree, and together represent 70 records. Napa breaks the records into two pieces, and assigns each piece to a different machine. The problem becomes a partitioning problem of a forest of trees and requires a tree-traversal algorithm that can quickly split the trees into two equal parts.

To avoid visiting all the nodes of the tree, we introduce the concept of “good enough” partitioning. As we begin cutting and partitioning the tree into two parts, we maintain an estimate of how bad our current answer would be if we terminated the partitioning process at that instant. This is the yardstick of how close we are to the answer and is represented below by a total error margin of 40 (at this point of execution, the two pieces are expected to be between 15 and 35 records in size, the uncertainty adds up to 40). Each subsequent traversal step reduces the error estimate, and if the two pieces are approximately equal, it stops the partitioning process. This process continues until the desired error margin is reached, at which time we are guaranteed that the two pieces are more or less equal.

Progressive partitioning algorithm

Progressive partitioning encapsulates the notion of “good enough” in that it makes a series of moves to reduce the error estimate. The input is a set of B-trees and the goal is to cut the trees into pieces of more or less equal size. The algorithm traverses one of the trees (“drill down” in the figure) which results in a reduction of the error estimate. The algorithm is guided by statistics that are stored with each node of the tree so that it makes an informed set of moves at each step. The challenge here is to decide how to direct effort in the best possible way so that the error bound reduces quickly in the fewest possible steps. Progressive partitioning is conducive for our use-case since the longer the algorithm runs, the more equal the pieces become. It also means that if the algorithm is stopped at any point, one still gets good partitioning, where the quality corresponds to the time spent.

Prior work in this space uses a sampled table to drive the partitioning process, while the Napa approach uses a B-tree. As mentioned earlier, even just a sample from a petabyte table can be massive. A tree-based partitioning method can achieve partitioning much more efficiently than a sample-based approach, which does not use a tree organization of the sampled records. We compare progressive partitioning with an alternative approach, where sampling of the table at various resolutions (e.g., 1 record sample every 250 MB and so on) aids the partitioning of the query. Experimental results show the relative speedup from progressive partitioning for queries requiring varying numbers of machines. These results demonstrate that progressive partitioning is much faster than existing approaches and the speedup increases as the size of the query increases.

Conclusion

Napa’s progressive partitioning algorithm efficiently optimizes database queries, enabling Google Ads to serve client reporting queries billions of times each day. We note that tree traversal is a common technique that students in introductory computer science courses use, yet it also serves a critical use-case at Google. We hope that this article will inspire our readers, as it demonstrates how simple techniques and carefully designed data structures can be remarkably potent if used well. Check out the paper and a recent talk describing Napa to learn more.

Acknowledgements

This blog post describes a collaborative effort between Junichi Tatemura, Tao Zou, Jagan Sankaranarayanan, Yanlai Huang, Jim Chen, Yupu Zhang, Kevin Lai, Hao Zhang, Gokul Nath Babu Manoharan, Goetz Graefe, Divyakant Agrawal, Brad Adelberg, Shilpa Kolhar and Indrajit Roy.

Read More

For the World to See: Nonprofit Deploys GPU-Powered Simulators to Train Providers in Sight-Saving Surgery

For the World to See: Nonprofit Deploys GPU-Powered Simulators to Train Providers in Sight-Saving Surgery

GPU-powered surgical-simulation devices are helping train more than 2,000 doctors a year in lower-income countries to treat cataract blindness, the world’s leading cause of blindness, thanks to the nonprofit HelpMeSee.

While cataract surgery has a success rate of around 99%, many patients in low- and middle-income countries lack access to the common procedure due to a severe shortage of ophthalmologists. An estimated 90% of the 100 million people affected by cataract-related visual impairment or blindness are in these locations.

By training more healthcare providers — including those without a specialty in ophthalmology — to treat cataracts, HelpMeSee improves the quality of life for patients such as a mother of two young children in Bhiwandi, near Mumbai, India, who was blinded by cataracts in both eyes.

“After the surgery, her vision improved dramatically and she was able to take up a job, changing the course of her entire family,” said Dr. Chetan Ahiwalay, chief instructor and subject-matter expert for HelpMeSee in India. “She and her husband are now happily raising their kids and leading a healthy life. These are the things that keep us going as doctors.”

HelpMeSee’s simulator devices use NVIDIA RTX GPUs to render high-quality visuals, providing a more realistic training environment for doctors to hone their surgical skills. To further improve the trainee experience, NVIDIA experts are working with the HelpMeSee team to improve rendering performance, increase visual realism and augment the simulator with next-generation technologies such as real-time ray tracing and AI.

Tackling Treatable Blindness With Accessible Training

High-income countries have 18x more ophthalmologists per million residents than low-income countries. That coverage gap, which is far wider still in certain countries, makes it harder for those in thinly resourced areas to receive treatment for avoidable blindness.

HelpMeSee’s devices can train doctors on multiple eye procedures using immersive tools inspired by flight simulators used in aviation. The team trains doctors in countries including India, China, Madagascar, Mexico and the U.S., and rolls out multilingual training each year for new procedures.

The eye surgery simulator offers realistic 3D visuals, haptic feedback, performance scores and the opportunity to attempt a step of the procedure multiple times until the trainee achieves proficiency. Qualified instructors like Dr. Ahiwalay travel to rural and urban areas to deliver the training through structured courses — and help surgeons transition from the simulators to live surgeries.

Doctors training to perform cataract surgery
During a training session, doctors learn to perform manual small-incision cataract surgery.

“We’re lowering the barrier for healthcare practitioners to learn these specific skills that can have a profound impact on patients,” said Dr. Bonnie An Henderson, CEO of HelpMeSee, which is based in New York. “Simulation-based training will improve surgical skills while keeping patients safe.”

Looking Ahead to AI, Advanced Rendering 

HelpMeSee works with Surgical Science, a supplier of medical virtual-reality simulators, based in Gothenburg, Sweden, to develop the 3D models and real-time rendering for its devices. Other collaborators — Strasbourg, France-based InSimo and Pune, India-based Harman Connected Services — develop the physics-based simulations and user interface, respectively. 

“Since there are many crucial visual cues during eye surgery, the simulation requires high fidelity,” said Sebastian Ullrich, senior manager of software development at Surgical Science, who has worked with HelpMeSee for years. “To render a realistic 3D representation of the human eye, we use custom shader materials with high-resolution textures to represent various anatomical components, mimic optical properties such as refraction, use order-independent transparency sorting and employ volume rendering.”

NVIDIA RTX GPUs support 3D volume rendering, stereoscopic rendering and depth sorting algorithms that provide a realistic visual experience for HelpMeSee’s trainees. Working with NVIDIA, the team is investigating AI models that could provide trainees with a real-time analysis of the practice procedure and offer recommendations for improvement.

Watch a demo of HelpMeSee’s cataract surgery training simulation.

Subscribe to NVIDIA healthcare news.

Read More

Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning

Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning

A new AI agent developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can.

The stunning prestidigitation, showcased in the video above, is one of nearly 30 tasks that robots have learned to expertly accomplish thanks to Eureka, which autonomously writes reward algorithms to train bots.

Eureka has also taught robots to open drawers and cabinets, toss and catch balls, and manipulate scissors, among other tasks.

The Eureka research, published today, includes a paper and the project’s AI algorithms, which developers can experiment with using NVIDIA Isaac Gym, a physics simulation reference application for reinforcement learning research. Isaac Gym is built on NVIDIA Omniverse, a development platform for building 3D tools and applications based on the OpenUSD framework. Eureka itself is powered by the GPT-4 large language model.

“Reinforcement learning has enabled impressive wins over the last decade, yet many challenges still exist, such as reward design, which remains a trial-and-error process,” said Anima Anandkumar, senior director of AI research at NVIDIA and an author of the Eureka paper. “Eureka is a first step toward developing new algorithms that integrate generative and reinforcement learning methods to solve hard tasks.”

AI Trains Robots

Eureka-generated reward programs — which enable trial-and-error learning for robots — outperform expert human-written ones on more than 80% of tasks, according to the paper. This leads to an average performance improvement of more than 50% for the bots.

Robot arm taught by Eureka to open a drawer.

The AI agent taps the GPT-4 LLM and generative AI to write software code that rewards robots for reinforcement learning. It doesn’t require task-specific prompting or predefined reward templates — and readily incorporates human feedback to modify its rewards for results more accurately aligned with a developer’s vision.

Using GPU-accelerated simulation in Isaac Gym, Eureka can quickly evaluate the quality of large batches of reward candidates for more efficient training.

Eureka then constructs a summary of the key stats from the training results and instructs the LLM to improve its generation of reward functions. In this way, the AI is self-improving. It’s taught all kinds of robots — quadruped, bipedal, quadrotor, dexterous hands, cobot arms and others — to accomplish all kinds of tasks.

The research paper provides in-depth evaluations of 20 Eureka-trained tasks, based on open-source dexterity benchmarks that require robotic hands to demonstrate a wide range of complex manipulation skills.

The results from nine Isaac Gym environments are showcased in visualizations generated using NVIDIA Omniverse.

Humanoid robot learns a running gait via Eureka.

“Eureka is a unique combination of large language models and NVIDIA GPU-accelerated simulation technologies,” said Linxi “Jim” Fan, senior research scientist at NVIDIA, who’s one of the project’s contributors. “We believe that Eureka will enable dexterous robot control and provide a new way to produce physically realistic animations for artists.”

It’s breakthrough work bound to get developers’ minds spinning with possibilities, adding to recent NVIDIA Research advancements like Voyager, an AI agent built with GPT-4 that can autonomously play Minecraft.

NVIDIA Research comprises hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Learn more about Eureka and NVIDIA Research.

Read More

Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data

Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data

Companies increasingly rely on user-generated images and videos for engagement. From ecommerce platforms encouraging customers to share product images to social media companies promoting user-generated videos and images, using user content for engagement is a powerful strategy. However, it can be challenging to ensure that this user-generated content is consistent with your policies and fosters a safe online community for your users.

Many companies currently depend on human moderators or respond reactively to user complaints to manage inappropriate user-generated content. These approaches don’t scale to effectively moderate millions of images and videos at sufficient quality or speed, which leads to a poor user experience, high costs to achieve scale, or even potential harm to brand reputation.

In this post, we discuss how to use the Custom Moderation feature in Amazon Rekognition to enhance the accuracy of your pre-trained content moderation API.

Content moderation in Amazon Rekognition

Amazon Rekognition is a managed artificial intelligence (AI) service that offers pre-trained and customizable computer vision capabilities to extract information and insights from images and videos. One such capability is Amazon Rekognition Content Moderation, which detects inappropriate or unwanted content in images and videos. Amazon Rekognition uses a hierarchical taxonomy to label inappropriate or unwanted content with 10 top-level moderation categories (such as violence, explicit, alcohol, or drugs) and 35 second-level categories. Customers across industries such as ecommerce, social media, and gaming can use content moderation in Amazon Rekognition to protect their brand reputation and foster safe user communities.

By using Amazon Rekognition for image and video moderation, human moderators have to review a much smaller set of content, typically 1–5% of the total volume, already flagged by the content moderation model. This enables companies to focus on more valuable activities and still achieve comprehensive moderation coverage at a fraction of their existing cost.

Introducing Amazon Rekognition Custom Moderation

You can now enhance the accuracy of the Rekognition moderation model for your business-specific data with the Custom Moderation feature. You can train a custom adapter with as few as 20 annotated images in less than 1 hour. These adapters extend the capabilities of the moderation model to detect images used for training with higher accuracy. For this post, we use a sample dataset containing both safe images and images with alcoholic beverages (considered unsafe) to enhance the accuracy of the alcohol moderation label.

The unique ID of the trained adapter can be provided to the existing DetectModerationLabels API operation to process images using this adapter. Each adapter can only be used by the AWS account that was used for training the adapter, ensuring that the data used for training remains safe and secure in that AWS account. With the Custom Moderation feature, you can tailor the Rekognition pre-trained moderation model for improved performance on your specific moderation use case, without any machine learning (ML) expertise. You can continue to enjoy the benefits of a fully managed moderation service with a pay-per-use pricing model for Custom Moderation.

Solution overview

Training a custom moderation adapter involves five steps that you can complete using the AWS Management Console or the API interface:

  1. Create a project
  2. Upload the training data
  3. Assign ground truth labels to images
  4. Train the adapter
  5. Use the adapter

workflow diagram

Let’s walk through these steps in more detail using the console.

Create a project

A project is a container to store your adapters. You can train multiple adapters within a project with different training datasets to assess which adapter performs best for your specific use case. To create your project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Moderation in the navigation pane.
  2. Choose Create project.

screenshot - list of tasks

  1. For Project name, enter a name for your project.
  2. For Adapter name, enter a name for your adapter.
  3. Optionally, enter a description for your adapter.

screenshot - create task

Upload training data

You can begin with as few as 20 sample images to adapt the moderation model to detect fewer false positives (images that are appropriate for your business but are flagged by the model with a moderation label). To reduce false negatives (images that are inappropriate for your business but don’t get flagged with a moderation label), you are required to start with 50 sample images.

You can select from the following options to provide the image datasets for adapter training:

Complete the following steps:

  1. For this post, select Import images from S3 bucket and enter your S3 URI.

screenshot - provide dataset

Like any ML training process, training a Custom Moderation adapter in Amazon Rekognition requires two separate datasets: one for training the adapter and another for evaluating the adapter. You can either upload a separate test dataset or choose to automatically split your training dataset for training and testing.

  1. For this post, select Autosplit.
  2. Select Enable auto-update to ensure that the system automatically retrains the adapter when a new version of the content moderation model is launched.
  3. Choose Create project.

screenshot - create project

Assign ground truth labels to images

If you uploaded unannotated images, you can use the Amazon Rekognition console to provide image labels as per the moderation taxonomy. In the following example, we train an adapter to detect hidden alcohol with higher accuracy, and label all such images with the label alcohol. Images not considered inappropriate can be labeled as Safe.

screenshot - label images

Train the adapter

After you label all the images, choose Start training to initiate the training process. Amazon Rekognition will use the uploaded image datasets to train an adapter model for enhanced accuracy on the specific type of images provided for training.

After the custom moderation adapter is trained, you can view all the adapter details (adapterID, test and training manifest files) in the Adapter performance section.

The Adapter performance section displays improvements in false positives and false negatives when compared to the pre-trained moderation model. The adapter we trained to enhance the detection of the alcohol label reduces the false negative rate in test images by 73%. In other words, the adapter now accurately predicts the alcohol moderation label for 73% more images compared to the pre-trained moderation model. However, no improvement is observed in false positives, as no false positive samples were used for training.

screenshot - accuracy

Use the adapter

You can perform inference using the newly trained adapter to achieve enhanced accuracy. To do this, call the Amazon Rekognition DetectModerationLabel API with an additional parameter, ProjectVersion, which is the unique AdapterID of the adapter. The following is a sample command using the AWS Command Line Interface (AWS CLI):

aws rekognition detect-moderation-labels 
--image 'S3Object={Bucket="<bucket>",Name="<key>"}' 
--project-version <ARN of the Adapter> 
--region us-east-1

The following is a sample code snippet using the Python Boto3 library:

import boto3
client = boto3.client('rekognition')
response = client.detect_moderation_labels(
    Image={
        "S3Object":{
            "Bucket":"<bucket>",
            "Name":"<key>"
        }
    }, 
    ProjectVersion="<ARN of the Adapter>"
)

Best practices for training

To maximize the performance of your adapter, the following best practices are recommended for training the adapter:

  • The sample image data should capture the representative errors that you want to improve the moderation model accuracy for
  • Instead of only bringing in error images for false positives and false negatives, you can also provide true positives and true negatives for improved performance
  • Supply as many annotated images as possible for training

Conclusion

In this post, we presented an in-depth overview of the new Amazon Rekognition Custom Moderation feature. Furthermore, we detailed the steps for performing training using the console, including best practices for optimal results. For additional information, visit the Amazon Rekognition console and explore the Custom Moderation feature.

Amazon Rekognition Custom Moderation is now generally available in all AWS Regions where Amazon Rekognition is available.

Learn more about content moderation on AWS. Take the first step towards streamlining your content moderation operations with AWS.


About the Authors

Author - Shipra KanoriaShipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Author - Aakash DeepAakash Deep is a Software Development Engineering Manager based in Seattle. He enjoys working on computer vision, AI, and distributed systems. His mission is to enable customers to address complex problems and create value with AWS Rekognition. Outside of work, he enjoys hiking and traveling.

Author - Lana ZhangLana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for Content Moderation, Computer Vision, Natural Language Processing and Generative AI. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising & marketing.

Read More

Next-Level Computing: NVIDIA and AMD Deliver Powerful Workstations to Accelerate AI, Rendering and Simulation

Next-Level Computing: NVIDIA and AMD Deliver Powerful Workstations to Accelerate AI, Rendering and Simulation

To enable professionals worldwide to build and run AI applications right from their desktops, NVIDIA and AMD are powering a new line of workstations equipped with NVIDIA RTX Ada Generation GPUs and AMD Ryzen Threadripper PRO 7000 WX-Series CPUs.

Bringing together the highest levels of AI computing, rendering and simulation capabilities, these new platforms enable professionals to efficiently tackle the most resource-intensive, large-scale AI workflows locally.

Bringing AI Innovation to the Desktop

Advanced AI tasks typically require data-center-level performance. Training a large language model with a trillion parameters, for example, takes thousands of GPUs running for weeks, though research is underway to reduce model size and enable model training on smaller systems while still maintaining high levels of AI model accuracy.

The new NVIDIA RTX GPU and AMD CPU-powered AI workstations provide the power and performance required for training such smaller models, as well as local fine-tuning, and helping to offload data center and cloud resources for AI development tasks. The devices let users select single- or multi-GPU configurations as required for their workloads.

Smaller trained AI models also provide the opportunity to use workstations for local inferencing. RTX GPU and AMD CPU-powered workstations can be configured to run these smaller AI models for inference serving for small workgroups or departments.

With up to 48GB of memory in a single NVIDIA RTX GPU, these workstations offer a cost-effective way to reduce compute load on data centers. And when professionals do need to scale training and deployment from these workstations to data centers or the cloud, the NVIDIA AI Enterprise software platform enables seamless portability of workflows and toolchains.

RTX GPU and AMD CPU-powered workstations also enable cutting-edge visual workflows. With accelerated computing power, the new workstations enable highly interactive content creation, industrial digitalization, and advanced simulation and design.

Unmatched Power, Performance and Flexibility

AMD Ryzen Threadripper PRO 7000 WX-Series processors provide the CPU platform for the next generation of demanding workloads. The processors deliver a significant increase in core count — up to 96 cores per CPU — and industry-leading maximum memory bandwidth in a single socket.

Combining them with the latest NVIDIA RTX Ada Generation GPUs brings unmatched power and performance in a workstation. The GPUs enable up to 2x the performance in ray tracing, AI processing, graphics rendering and computational tasks compared to the previous generation.

Ada Generation GPUs options include the RTX 4000 SFF, RTX 4000, RTX 4500, RTX 5000 and RTX 6000. They’re built on the NVIDIA Ada Lovelace architecture and feature up to 142 third-generation RT Cores, 568 fourth-generation Tensor Cores and 18,176 latest-generation CUDA cores.

From architecture and manufacturing to media and entertainment and healthcare, professionals across industries will be able to use the new workstations to tackle challenging AI computing workloads — along with 3D rendering, product visualization, simulation and scientific computing tasks.

Availability

New workstations powered by NVIDIA RTX Ada Generation GPUs and the latest AMD Threadripper Pro processors will be available starting next month from BOXX and HP, with other system integrators offering them soon.

Read More

NVIDIA AI Now Available in Oracle Cloud Marketplace

NVIDIA AI Now Available in Oracle Cloud Marketplace

Training generative AI models just got easier.

NVIDIA DGX Cloud AI supercomputing platform and NVIDIA AI Enterprise software are now available in Oracle Cloud Marketplace, making it possible for Oracle Cloud Infrastructure customers to access high-performance accelerated computing and software to run secure, stable and supported production AI in just a few clicks.

The addition — an industry first — brings new capabilities for end-to-end development and deployment on Oracle Cloud. Enterprises can get started from the Oracle Cloud Marketplace to train models on DGX Cloud, and then deploy their applications on OCI with NVIDIA AI Enterprise.

Oracle Cloud and NVIDIA Lift Industries Into Era of AI

Thousands of enterprises around the world rely on OCI to power the applications that drive their businesses. Its customers include leaders across industries such as healthcare, scientific research, financial services, telecommunications and more.

Oracle Cloud Marketplace is a catalog of solutions that offers customers flexible consumption models and simple billing. Its addition of DGX Cloud and NVIDIA AI Enterprise lets OCI customers use their existing cloud credits to integrate NVIDIA’s leading AI supercomputing platform and software into their development and deployment pipelines.

With DGX Cloud, OCI customers can train models for generative AI applications like intelligent chatbots, search, summarization and content generation.

The University at Albany, in upstate New York, recently launched its AI Plus initiative, which is integrating teaching and learning about AI across the university’s research and academic enterprise, in fields such as cybersecurity, weather prediction, health data analytics, drug discovery and next-generation semiconductor design. It will also foster collaborations across the humanities, social sciences, public policy and public health. The university is using DGX Cloud AI supercomputing instances on OCI as it builds out an on-premises supercomputer.

“We’re accelerating our mission to infuse AI into virtually every academic and research disciplines,” said Thenkurussi (Kesh) Kesavadas, vice president for research and economic development at UAlbany. “We will drive advances in healthcare, security and economic competitiveness, while equipping students for roles in the evolving job market.”

NVIDIA AI Enterprise brings the software layer of the NVIDIA AI platform to OCI. It includes NVIDIA NeMo frameworks for building LLMs, NVIDIA RAPIDS for data science and NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server for supercharging production AI. NVIDIA software for cybersecurity, computer vision, speech AI and more is also included. Enterprise-grade support, security and stability ensure a smooth transition of AI projects from pilot to production.

NVIDIA DGX Cloud generative AI training
NVIDIA DGX Cloud provides enterprises immediate access to AI supercomputing platform and software hosted by their preferred cloud provider.

AI Supercomputing Platform Hosted by OCI

NVIDIA DGX Cloud provides enterprises immediate access to an AI supercomputing platform and software.

Hosted by OCI, DGX Cloud provides enterprises with access to multi-node training on NVIDIA GPUs, paired with NVIDIA AI software, for training advanced models for generative AI and other groundbreaking applications.

Each DGX Cloud instance consists of eight NVIDIA Tensor Core GPUs interconnected with network fabric, purpose-built for multi-node training. This high-performance computing architecture also includes industry-leading AI development software and offers direct access to NVIDIA AI expertise so businesses can train LLMs faster.

OCI customers access DGX Cloud using NVIDIA Base Command Platform, which gives developers access to an AI supercomputer through a web browser. By providing a single-pane view of the customer’s AI infrastructure, Base Command Platform simplifies the management of multinode clusters.

NVIDIA AI Enterprise software
NVIDIA AI Enterprise software powers secure, stable and supported production AI and data science.

Software for Secure, Stable and Supported Production AI

NVIDIA AI Enterprise enables rapid development and deployment of AI and data science.

With NVIDIA AI Enterprise on Oracle Cloud Marketplace, enterprises can efficiently build an application once and deploy it on OCI and their on-prem infrastructure, making a multi- or hybrid-cloud strategy cost-effective and easy to adopt. Since NVIDIA AI Enterprise is also included in NVIDIA DGX Cloud, customers can streamline the transition from training on DGX Cloud to deploying their AI application into production with NVIDIA AI Enterprise on OCI, since the AI software runtime is consistent across the environments.

Qualified customers can purchase NVIDIA AI Enterprise and NVIDIA DGX Cloud with their existing Oracle Universal Credits.

Visit NVIDIA AI Enterprise and NVIDIA DGX Cloud on the Oracle Cloud Marketplace to get started today.

Read More