September 2022 – Page 6

Q&A: Global challenges surrounding the deployment of AI

The AI Policy Forum (AIPF) is an initiative of the MIT Schwarzman College of Computing to move the global conversation about the impact of artificial intelligence from principles to practical policy implementation. Formed in late 2020, AIPF brings together leaders in government, business, and academia to develop approaches to address the societal challenges posed by the rapid advances and increasing applicability of AI.

The co-chairs of the AI Policy Forum are Aleksander Madry, the Cadence Design Systems Professor; Asu Ozdaglar, deputy dean of academics for the MIT Schwarzman College of Computing and head of the Department of Electrical Engineering and Computer Science; and Luis Videgaray, senior lecturer at MIT Sloan School of Management and director of MIT AI Policy for the World Project. Here, they discuss talk some of the key issues facing the AI policy landscape today and the challenges surrounding the deployment of AI. The three are co-organizers of the upcoming AI Policy Forum Summit on Sept. 28, which will further explore the issues discussed here.

Q: Can you talk about the ongoing work of the AI Policy Forum and the AI policy landscape generally?

Ozdaglar: There is no shortage of discussion about AI at different venues, but conversations are often high-level, focused on questions of ethics and principles, or on policy problems alone. The approach the AIPF takes to its work is to target specific questions with actionable policy solutions and engage with the stakeholders working directly in these areas. We work “behind the scenes” with smaller focus groups to tackle these challenges and aim to bring visibility to some potential solutions alongside the players working directly on them through larger gatherings.

Q: AI impacts many sectors, which makes us naturally worry about its trustworthiness. Are there any emerging best practices for development and deployment of trustworthy AI?

Madry: The most important thing to understand regarding deploying trustworthy AI is that AI technology isn’t some natural, preordained phenomenon. It is something built by people. People who are making certain design decisions.

We thus need to advance research that can guide these decisions as well as provide more desirable solutions. But we also need to be deliberate and think carefully about the incentives that drive these decisions.

Now, these incentives stem largely from the business considerations, but not exclusively so. That is, we should also recognize that proper laws and regulations, as well as establishing thoughtful industry standards have a big role to play here too.

Indeed, governments can put in place rules that prioritize the value of deploying AI while being keenly aware of the corresponding downsides, pitfalls, and impossibilities. The design of such rules will be an ongoing and evolving process as the technology continues to improve and change, and we need to adapt to socio-political realities as well.

Q: Perhaps one of the most rapidly evolving domains in AI deployment is in the financial sector. From a policy perspective, how should governments, regulators, and lawmakers make AI work best for consumers in finance?

Videgaray: The financial sector is seeing a number of trends that present policy challenges at the intersection of AI systems. For one, there is the issue of explainability. By law (in the U.S. and in many other countries), lenders need to provide explanations to customers when they take actions deleterious in whatever way, like denial of a loan, to a customer’s interest. However, as financial services increasingly rely on automated systems and machine learning models, the capacity of banks to unpack the “black box” of machine learning to provide that level of mandated explanation becomes tenuous. So how should the finance industry and its regulators adapt to this advance in technology? Perhaps we need new standards and expectations, as well as tools to meet these legal requirements.

Meanwhile, economies of scale and data network effects are leading to a proliferation of AI outsourcing, and more broadly, AI-as-a-service is becoming increasingly common in the finance industry. In particular, we are seeing fintech companies provide the tools for underwriting to other financial institutions — be it large banks or small, local credit unions. What does this segmentation of the supply chain mean for the industry? Who is accountable for the potential problems in AI systems deployed through several layers of outsourcing? How can regulators adapt to guarantee their mandates of financial stability, fairness, and other societal standards?

Q: Social media is one of the most controversial sectors of the economy, resulting in many societal shifts and disruptions around the world. What policies or reforms might be needed to best ensure social media is a force for public good and not public harm?

Ozdaglar: The role of social media in society is of growing concern to many, but the nature of these concerns can vary quite a bit — with some seeing social media as not doing enough to prevent, for example, misinformation and extremism, and others seeing it as unduly silencing certain viewpoints. This lack of unified view on what the problem is impacts the capacity to enact any change. All of that is additionally coupled with the complexities of the legal framework in the U.S. spanning the First Amendment, Section 230 of the Communications Decency Act, and trade laws.

However, these difficulties in regulating social media do not mean that there is nothing to be done. Indeed, regulators have begun to tighten their control over social media companies, both in the United States and abroad, be it through antitrust procedures or other means. In particular, Ofcom in the U.K. and the European Union is already introducing new layers of oversight to platforms. Additionally, some have proposed taxes on online advertising to address the negative externalities caused by current social media business model. So, the policy tools are there, if the political will and proper guidance exists to implement them.

Q&A: Global challenges surrounding the deployment of AI

Q: Can you talk about the ongoing work of the AI Policy Forum and the AI policy landscape generally?

Q: AI impacts many sectors, which makes us naturally worry about its trustworthiness. Are there any emerging best practices for development and deployment of trustworthy AI?

Scaling to trillion-parameter model training on AWS

Contiguous parameter management and prefetched activation offloading expand the MiCS tool kit.Read More

Introducing self-service quota management and higher default service quotas for Amazon Textract

Today, we’re excited to announce self-service quota management support for Amazon Textract via the AWS Service Quotas console, and higher default service quotas in select AWS Regions.

Customers tell us they need quick turnaround times to process their requests for quota increases and visibility into their service quotas so they may continue to scale their Amazon Textract usage. With this launch, we’re improving Amazon Textract support for service quotas by enabling you to self-manage your service quotas via the Service Quotas console. In addition to viewing the default service quotas, you can now view your account’s applied custom quotas for a specific Region, view your historical utilization metrics per applied quota, set up alarms to notify when utilization approaches a threshold, and add tags to your quotas for easier organization. Additionally, we’re launching the Amazon Textract Service Quota Calculator, which will help you quickly estimate service quota requirements for your workload prior to submitting a quota increase request.

In this post, we discuss the updated default service quotas, the new service quota management capabilities, and the service quota calculator for Amazon Textract.

Increased default service quotas for Amazon Textract

Amazon Textract now has higher service quotas for several asynchronous and synchronous APIs in multiple major AWS Regions. The updated default service quotas are available for US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), and Europe (Ireland) Regions. The following table summarizes the before and after default quota numbers for each of these Regions for the respective synchronous and asynchronous APIs. You can refer to Amazon Textract endpoints and quotas to learn more about the current default quotas.

Synchronous Operations	API	Region	Before	After
Transactions per second per account for synchronous operations	AnalyzeDocument	US East (Ohio)	1	10
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
	DetectDocumentText	US East (Ohio)	1	10
		US East (N. Virginia)	10	25
		US West (Oregon)	10	25
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
Asynchronous Operations	API	Region	Before	After
Transactions per second per account for all Start (asynchronous) operations	StartDocumentAnalysis	US East (Ohio)	2	10
		Asia Pacific (Mumbai)	2	5
		Europe (Ireland)	2	5
	StartDocumentTextDetection	US East (Ohio)	1	5
		US East (N. Virginia)	10	15
		US West (Oregon)	10	15
		Asia Pacific (Mumbai)	1	5
		Europe (Ireland)	1	5
Transactions per second per account for all Get (asynchronous) operations	GetDocumentAnalysis	US East (Ohio)	5	10
	GetDocumentTextDetection	US East (Ohio)	5	10
		US East (N. Virginia)	10	25
		US West (Oregon)	10	25

Improved service quota support for Amazon Textract

Starting today, you can manage your Amazon Textract service quotas via the Service Quotas console. Requests may now be processed automatically, speeding up approval times. After a quota request for a specific Region is approved, the new quota is immediately available for scaling your Amazon Textract usage and also visible on the Service Quotas console. You can see the default and applied quota values for your account in a given Region, and view the historical utilization metrics via an integrated Amazon CloudWatch graph. This enables you to make informed decisions about whether a quota increase is required to scale your workload. You can also use CloudWatch alarms to notify whenever a specified quota reaches a predefined threshold, which can help investigate issues with your applications or monitor spikey workloads. You can also add tags to the quotas, which allows better administration and monitoring.

The following sections discusses the features that are now available via the Service Quotas console for Amazon Textract.

Default and applied quotas

You can now have visibility into the AWS default quota value and applied quota value of a specific quota for Amazon Textract on the Service Quotas console. The default quota value is the default value of the quota in that specific Region, and the applied quota value is the currently applied value for that quota for the account in that Region.

Monitoring via CloudWatch graphs

The Service Quotas console also displays a utilization against the total applied quota value. You can also view the weekly, daily, and hourly trend in utilization of the applied quota through an integrated CloudWatch graph, right from the Service Quotas console, for a given quota. You can add this graph to a custom CloudWatch dashboard for better monitoring and reporting of service usage and overall utilization.

We have also added the capability to set up CloudWatch alarms to notify you automatically whenever a specified quota reaches a certain configurable threshold. This helps you monitor the usage of Amazon Textract from your applications, analyze spikey workloads, make informed decisions about the overall utilization, control costs, and make improvements to the application’s architecture.

Quota tagging

With quota tagging, you can now add tags to applied quotas to simplify administration. Tags help you identify and organize AWS resources. With quota tags, you can manage the applied service quotas for Amazon Textract along with other AWS service quotas, as part of your administration and governance practices. You can better manage and monitor quotas and quota utilization for different environments based on tags. For example, you can use production or development tags to logically separate and monitor dev environment and production environment quotas and quota utilization for accounts under AWS Organizations and unified reporting.

Amazon Textract Service Quota Calculator

We’re introducing a new quota calculator on the Amazon Textract console. The quota calculator helps forecast service quota requirements based on answers to questions about your workload and usage of Amazon Textract. With calculations based on your usage patterns, such as number of documents and number of pages per document, it provides actionable recommendations in the form of a required quota value for the workload.

As shown in the following screenshot, the quota calculator is now accessible directly from the Amazon Textract console. You can also navigate to the Service Quotas console directly from the calculator, where you can manage the service quotas based on the calculated recommendations.

Quota calculator for synchronous operations

To view the current quota values and recommended quota values for synchronous operations, you start by selecting Synchronous under Processing type. For example, if you’re interested in calculating the desired quota values for your workload that uses the DetectDocumentText API, you select the Synchronous processing type, and then choose Detect Document Text on the Use case type drop-down menu.

After you specify your desired options, the quota calculator prompts for additional inputs, which include the maximum number of documents you expect to process via the API per day or per hour. The corresponding numbers of documents to be processed shown under View calculation is automatically calculated based on the input. Because synchronous processing allows text detection and analysis of single-page documents, the number of pages per document defaults to 1. For multi-page documents, we recommend using asynchronous processing.

The output of this calculation is a current quota value applicable for that account in the current Region, and the recommended quota value, based on the quota type selected and the provided number of documents.

You can copy the recommended quota value within the calculator and use the Quota type (in this case, DetectDocumentText) deep link to navigate to the specific quota on the Service Quotas console to create a quota increase request.

Quota calculator for asynchronous operations

The way to view current quota values and recommended quota values for asynchronous operations is similar to that of the synchronous operations. Specify the use case type for your asynchronous operation usage, and answer a few questions relevant to your workload to view the current quotas and recommended quotas for all the asynchronous operations relevant to the use case.

For example, if you’re running asynchronous jobs using the StartDocumentTextDetection API and consecutively using the GetDocumentTextDetection API to get the results of the job in your workload, choose the Document Text Detection option as your use case. Because these two APIs are always used in conjunction to each other, the calculator provides recommendations for both the APIs. For asynchronous operations, there are limits on the total number of concurrent jobs that can be run per account in a given Region. Therefore, the calculator also calculates the recommended total number of concurrent asynchronous jobs recommended for your workload.

In addition to the processing type and use case type, you need to provide specific values relevant to your workload:

The maximum number of documents you expect to process
A processing time frame value in hours, which is the approximate length of time over which you expect to process the documents
The maximum number of pages per document, because asynchronous operations allow processing multi-page documents

Quota calculation for asynchronous operations generates recommended quota values for all the asynchronous APIs relevant to the selected use case. In our example, the quota values for the StartDocumentTextDetection API, GetDocumentTextDetection API, and number of concurrent text detection jobs are generated by the calculator, as shown in the following screenshot. You can then use the required quota value to request quota increases via the Service Quotas console using the corresponding deep links under Quota type.

It’s worth noting that the all the quota-related information within the calculator is shown for the current AWS Region for the AWS Management Console. To view the quota information for a different Region, you can change the Region manually from the top navigation bar of the console. Recommendations generated by the calculator are based on the current applied quota for that account for the current Region, the selected processing type (asynchronous and synchronous), and other information relevant to your workload. You can use these recommendations to submit quota increase requests via the Service Quotas console. Although most requests are processed automatically, some requests may need additional manual review prior to being approved.

Conclusion

In this post, we announced the updated default service quotas in select AWS Regions and the self-service quota management capabilities of Amazon Textract. We also announced the availability of a new quota calculator, available on the Amazon Textract console. You can start taking advantage of the new default service quotas, and use the Amazon Textract quota calculator to generate recommended quota values to quickly scale your workload. With the improved Service Quotas console for Amazon Textract, you can request quota increases, monitor quota utilization and service usage, and set up alarms. With the features announced in this post, you can now easily monitor your quota utilization, manage costs, and follow best practices to scale your Amazon Textract usage.

To learn more about the Amazon Textract service quota calculator and extended features for quota management, visit Quotas in Amazon Textract.

About the authors

Anjan Biswas is a Senior AI Services Solutions Architect with focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand, and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations and is actively helping customers get started and scale on AWS AI services.

Shashwat Sapre is a Senior Technical Product Manager with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers. In his spare time, he likes reading about new technologies, traveling and exploring different cuisines.

Bridging communities: TensorFlow Federated (TFF) and OpenMined

Posted by Krzys Ostrowski (Research Scientist), Alex Ingerman (Product Manager), and Hardik Vala (Software Engineer)

Since the announcement of TensorFlow Federated (TFF) on this blog 3.5 years ago, a number of organizations have developed frameworks for Federated Learning (FL). While growing attention to privacy and investments in FL are a welcome trend, one challenge that arises is fragmentation of community and industry efforts, which leads to code duplication and reinvention. One way we can address this as a community is by investing in interoperability mechanisms that could enable our platforms and developers to work together and leverage each other’s strengths.

In this context, we’re excited to announce the collaboration between TFF and OpenMined – an OSS community dedicated to development of privacy-preserving technologies. OpenMined’s PySyft framework has attracted a vibrant community of hundreds of OSS contributors, and includes tools and APIs to facilitate containerized deployment and integrations with diverse data sources that complement the capabilities we offer in TFF.

OpenMined is joining Special Interest Group (SIG) Federated (see the charter, forum, meeting notes, and the Discord server) we’ve recently established to enable developers of TFF, together with a growing set of OSS and industry partners, to openly engage in conversations about how to jointly evolve the TFF ecosystem and grow the adoption of FL.

Introducing PySyTFF

To kick off the collaboration, we – the developers of TFF and OpenMined’s PySyft – decided to focus our initial efforts on building together a new platform, with an endearing name PySyTFF, that combines elements of TFF and PySyft to support what we believe will be an increasingly common scenario, illustrated below.

In this scenario, an owner of a sensitive dataset would like to invite researchers to experiment with training and evaluating ML models on their dataset to advance the current understanding of what model architectures, parameters, etc., work best, while protecting the data and adhering to policies that may govern its use. In practice, such scenarios often end up involving negotiating data usage contracts. On the one hand, these can be tedious to set up, and on the other hand, they largely rely on goodwill.

What we’d like instead is, to have a platform that can offer structural safeguards in place that limit the disclosure of sensitive information and ensure policy compliance by construction – this is our goal for PySyTFF.

As an aside, note that even though this blog post is about FL, we aren’t necessarily talking here about scenarios where data is physically siloed across physical locations – the data can also be hosted in a datacenter and logically siloed. More on this below.

Developer experience

The initial proof-of-concept implementation of PySyTFF offers an early glimpse of what the developer experience for the data scientist will look like. Note how we combine the advantages of both frameworks – e.g., TFF’s ability to define models in Keras, and PySyft’s access control mechanism and APIs for data access:

domain = sy.login(email=“sam@stargate.net”, password=“changethis”, port=8081)

model_fn = lambda: tf.keras.models.Sequential(…)

params = {

’rounds’: 10,

‘no_clients’: 3,

‘noise_multiplier’: 0.05,

‘clients_per_round’: 2,

‘train_data_id’: domain.dataset[0][‘images’].id_at_location.to_string(),

‘label_data_id’: domain.datasets[0][‘labels’].id_at_location.to_string()

}

model, metrics = sy.tff.train_model(model_fn, params, domain, timeout=5000)

Here, the data scientist is logging into a PySyft’s domain node – an infrastructure component provisioned by or on behalf of the data provider – and gains a limited, access control-guarded ability to enumerate the available resources and perform actions on them. This includes obtaining references to datasets managed by the node and their metadata (but not content) and issuing the train_model calls, wherein the data scientist can supply a Keras model they wish to train, and the various parameters that control the training process and affect the privacy guarantees of the computed result, such as the number of rounds or the amount of noise added in order to make the results of the model training more private. In return, the researcher may get computed outputs such as a set of evaluation metrics, or the trained model parameters.

Exactly what ranges of parameters supplied by the researcher are accepted by the platform, and what results the researcher can get will, in general, depend on the policies defined by the data owner that might, e.g., mandate the use of privacy-preserving algorithms and constrain the allowed privacy budget – and these may constrain parameters such as the number of training rounds, clients per round, or the noise multiplier. Whereas at the current stage of development, PySyTFF does not yet offer policy engine integration, this is an important part of the future development plans.

Under the hood

The domain node is a docker-based environment that bundles together a web-based frontend that you can securely log into, with a mechanism for authenticating and authorizing users, and a set of internal services that includes database connectivity, as illustrated below.

The train_model call in the code snippet above, perhaps embedded in the data scientist’s Python colab notebook, is implemented as a network request, carrying a serialized representation of the TensorFlow code of the model to train, along with the training parameters, and the references to the PySyft datasets to use for training and evaluation.

Inside the domain node, the call is relayed to a PySyTFF service, a new component introduced to the PySyft ecosystem to orchestrate the training process. This involves interacting with PySyft’s data backend to obtain handles to shards of user data, calling TFF APIs to construct TFF computations to run, and passing the constructed TFF computations and data handles to an embedded instance of TFF runtime that loads the data using the supplied handles and runs the FL algorithms.

FL on logically-siloed data

At this point, some of you may be wondering how exactly FL fits into the picture. After all, FL is mostly known as a technology that supports computations on data that’s distributed across a set of devices, or (in what’s called a cross-silo flavor of FL) a set of data centers owned by a group of institutions, yet here, we’re talking about a scenario where the data is already in the customer’s PySyft database.

To explain this, let’s pop up a level and consider the high level objective – to enable researchers to perform ML computations on sensitive data with platform-level, structural and formal privacy guarantees. In order to do so, the platform should ideally uphold formal privacy principles, such as data minimization (a guarantee on how the computation is executed and how sensitive data is handled), and anonymous aggregation (a guarantee on what is being computed and released).

Federated Learning is a great fit in this context because it structurally embodies these principles, and provides a framework for implementing algorithms that provably achieve user-level Differential Privacy (DP) – the current gold standard. The FL algorithms that enable us to achieve these guarantees can be used to process data in datacenter deployments, even in scenarios where – as is the case here with the PySyft database – all of that data resides in a single administrative domain.

To see this, just imagine that for each user in the database, we draw a virtual boundary around all their data, and think of it as a kind of virtual silo. We can treat such virtual silos of user data in the same way as how we treat “client” devices in a more traditional FL setting, and orchestrate FL algorithms to run across virtual silos as clients.

Thus, for example, when training an ML model, we’d repeatedly pick sets of users from the database, locally and independently train local model updates on their data – separately for each user, add clipping to each local update and noise for privacy, aggregate these local updates across users to produce an updated global model, and repeat this process for thousands of rounds until the ML model converges, as shown below.

Whereas the data may be only logically partitioned, following this approach enables us to achieve the very same types of formal guarantees, including provable user-level differential privacy, as those cited above – and indeed, TFF enables us to leverage the same FL algorithm implementation – literally the same TFF code – as that which powers Google’s mobile/IoT production deployments.

Collaborate with us!

As noted earlier, the initial version of PySyTFF is still missing a number of components – and this, dear reader, is where you come in. If the vision laid out above excites you, we – the TFF and PySyft teams – would love to work with you to evolve this platform together. In addition to policy engine integration, we plan to augment PySyTFF with the ability to spawn distributed instances of the TFF runtime on cloud or compute clusters to power very compute-intensive workloads, a system of charging for the use of resources, and to extend the scope of PySyTFF to include classical types of cross-silo FL deployments, to name just a few.

There are a great many ways to go about this – from joining the TFF and PySyft’s collaborative efforts and directly helping us build and deploy this platform, to helping design and build generic components and APIs that can enable TFF and PySyft/PyGrid to interoperate.

Ready to get started? You can visit the SIG Federated forum and join the Discord server, or you can reach out directly – see the contact info in the SIG charter, and the engagement channels created by the OpenMined’s PySyft team. We’re looking forward to hearing from you!

Acknowledgments

On behalf of the TFF team at Google, we’d like to thank our OpenMined partners Andrew Trask, Tudor Cebere, and Teo Milea for the productive collaboration leading up to this announcement.

A graphic overview of the way performance assessment methods change across the development lifecycle. It has four phases: getting started, connecting with users, tuning the user experience, and performance assessment in the deployment context. It visually shows how the balance of user experience and tech development change over these four phases. — Figure 1: Performance assessment methods change across the development lifecycle for complex AI systems in ways that differ from general purpose AI. The emphasis shifts from rapid technical innovation that requires easy-to-calculate aggregate performance metrics at the beginning of the development process to metrics that reflect the performance of critical AI system attributes needed to underpin the user experience at the end.

AI systems are becoming increasingly complex as we move from visionary research to deployable technologies such as self-driving cars, clinical predictive models, and novel accessibility devices. Unlike singular AI models, it is more difficult to assess whether these more complex AI systems are performing consistently and as intended to realize human benefit.

1. Real-world contexts for which the data might be noisy or different from training data;
2. Multiple AI components interact with each other, creating unanticipated dependencies and behaviors;
3. Human-AI feedback loops that come from repeated engagements between people and AI system.
4. Very large AI models (e.g., transformer models)
5. AI models that interact with other parts of a system (e.g., user interface or heuristic algorithm)

How do we know when these more advanced systems are ‘good enough’ for their intended use? When assessing the performance of AI models, we often rely on aggregate performance metrics like percentage of accuracy. But this ignores the many, often human elements, that make up an AI system.

Our research on what it takes to build forward-looking, inclusive AI experiences has demonstrated that getting to ‘good enough’ requires multiple performance assessment approaches at different stages of the development lifecycle, based upon realistic data and key user needs (figure 1).

Shifting emphasis gradually from iterative adjustments in the AI models themselves toward approaches that improve the AI system as a whole has implications not only in terms of how performance is assessed, but who should be involved in the performance assessment process. Engaging (and training) non-technical domain experts earlier (i.e., for choosing test data or defining experience metrics) and in a larger capacity throughout the development lifecycle can enhance relevance, usability, and reliability of the AI system.

Performance assessment best practices from the PeopleLens

The PeopleLens (figure 2) is a new Microsoft technology designed to enable children who are born blind to experience social agency and build up the range of social attention skills needed to initiate and maintain social interactions. Running on smart glasses, it provides the wearer with continuous, real-time information about the people around them through spatial audio, helping them build up a dynamic map of the whereabouts of others. Its underlying technology is a complex AI system using several computer vision algorithms to calculate, pose, identify registered people, and track those entities over time.

The PeopleLens offers a useful illustration of the wide range of performance assessment methods and people necessary to comprehensively gauge its efficacy.

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard. — Figure 2: The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space dynamically in real-time.

Getting started: AI model or AI system performance?

Calculating aggregate performance metrics on open-source benchmarked datasets may demonstrate the capability of an individual AI model, but may be insufficient when applied to an entire AI system. It can be tempting to believe a single aggregate performance metric (such as accuracy) can be sufficient to validate multiple AI models individually. But the performance of two AI models in a system cannot be comprehensively measured by simple summation of each model’s aggregate performance metric.

We used two AI models to test the accuracy of the PeopleLens to locate and identify people: the first was a benchmarked, state-of-the-art pose model used to indicate the location of people in an image. The second was a novel facial recognition algorithm previously demonstrated to have greater than 90% accuracy. Despite strong historical performance of these two models, when applied to the PeopleLens, the AI system recognized only 10% of people from a realistic dataset in which people were not always facing the camera.

This finding illustrates that multi-algorithm systems are more than a sum of their parts, requiring specific performance assessment approaches.

Connecting to the human experience: Metric scorecards and realistic data

Metrics scorecards, calculated on a realistic reference dataset, offer one way to connect to the human experience while the AI system is still undergoing significant technical iteration. A metrics scorecard can combine several metrics to measure aspects of the system that are most important to users.

We used ten metrics in the development of PeopleLens. The most valuable two metrics were time-to-first-identification, which measured how long it took from the time a person was seen in a frame to the user hearing the name of that person, and number of repeat false positives, which measured how often a false positive occurred in three frames or more in a row within the reference dataset.

The first metric captured the core value proposition for the user: having the social agency to be the first to say hello when someone approaches. The second was important because the AI system would self-correct single misidentifications, while repeated mistakes would lead to a poor user experience. This measured the ramifications of that accuracy throughout the system, rather than just on a per-frame basis.

Beyond metrics: Using visualization tools to finetune the user experience

While metrics play a critical role in the development of AI systems, a wider range of tools is needed to finetune the intended user experience. It is essential for development teams to test on realistic datasets to understand how the AI system generates the actual user experience. This is especially important with complex systems, where multiple models, human-AI feedback loops, or unpredictable data (e.g., user-controlled data capture) can cause the AI system to respond unpredictably.

Visualization tools can enhance the top-down statistical tools of data scientists, helping domain experts contribute to system development. In the PeopleLens, we used custom-built visualization tools to compare side-by-side renditions of the experience with different model parameters (figure 3). We leveraged these visualizations to enable domain experts—in this case parents and teachers—to spot patterns of odd system behavior across the data.

Project Tokyo studio interface — Figure 3: Visualization tools helped the development team, including domain experts, in connecting the AI system to the user experience using realistic data. In this image, the top bar shows images taken from the wearable camera stream overlayed with the various model outcomes. The bottom bar shows the output of the world-state tracking algorithm on the left and the ground truth on the right. The panel in the middle shows model parameters that are being changed with the impact on the user experience being viewed in real time.

AI system performance in the context of the user experience

A user experience can only be as good as the underlying AI system. Testing the AI system in a realistic context, measuring things that matter to the users, is a critical stage before wide-spread deployment. We know, for example, that improving AI system performance does not necessarily correspond to improved performance of AI teams (reference).

We also know that human-to-AI feedback loops can make it difficult to measure an AI system’s performance. Essentially repeated interactions between AI system and user, these feedback loops can surface (and intensify) errors. They can also, through good intelligibility, be repaired by the user.

The PeopleLens system gave users feedback about the people’s locations and their faces. A missed identification (e.g., because the person is looking at a chest rather than a face) can be resolved once the user responds to feedback (e.g., by looking up). This example shows us that we do not need to focus on missed identification as they will be resolved by the human-AI feedback loop. However, users were very perplexed by the identification of people who were no longer present, and therefore performance assessments needed to focus on these false positive misidentifications.

1. Multiple performance assessment methods should be used in AI system development. In contrast to developing individual AI models, general aggregate performance metrics are a small component, relevant primarily in the earliest stages of development.
2. Documenting AI system performance should include a range of approaches, from metrics scorecards to system performance metrics for a deployed user experience, to visualization tools.
3. Domain experts play an important role in performance assessment, beginning early in the development lifecycle. Domain experts are often not prepared or skilled for the in-depth participation optimal in AI system development.
4. Visualization tools are as important as metrics in creating and documenting an AI system for a particular intended use. It is critical that domain experts have access to these tools as key decision-makers in AI system deployment.

Bringing it all together

For complex AI systems, performance assessment methods change across the development lifecycle in ways that differ from individual AI models. Shifting performance assessment techniques from rapid technical innovation requiring easy-to-calculate aggregate metrics at the beginning of the development process, to the performance metrics that reflect critical AI system attributes that make up the user experience toward the end of development helps every type of stakeholder precisely and collectively define what is ‘good enough’ to achieve the intended use.

It is useful for developers to remember performance assessment is not an end goal in itself; it is a process that defines how the system has reached its best state and whether that state is ready for deployment. The performance assessment process must include a broad range of stakeholders, including domain experts, who may need new tools to fulfill critical (sometimes unexpected) roles in the development and deployment of an AI system.

The post Assessing AI system performance: thinking beyond models to deployment contexts appeared first on Microsoft Research.

Vedere AI

Monthly Archives: September 2022

Q&A: Global challenges surrounding the deployment of AI

Q&A: Global challenges surrounding the deployment of AI

Scaling to trillion-parameter model training on AWS

Introducing self-service quota management and higher default service quotas for Amazon Textract

Increased default service quotas for Amazon Textract