Scaling vision transformers to 22 billion parameters

Scaling vision transformers to 22 billion parameters

Large Language Models (LLMs) like PaLM or GPT-3 showed that scaling transformers to hundreds of billions of parameters improves performance and unlocks emergent abilities. The biggest dense models for image understanding, however, have reached only 4 billion parameters, despite research indicating that promising multimodal models like PaLI continue to benefit from scaling vision models alongside their language counterparts. Motivated by this, and the results from scaling LLMs, we decided to undertake the next step in the journey of scaling the Vision Transformer.

In “Scaling Vision Transformers to 22 Billion Parameters”, we introduce the biggest dense vision model, ViT-22B. It is 5.5x larger than the previous largest vision backbone, ViT-e, which has 4 billion parameters. To enable this scaling, ViT-22B incorporates ideas from scaling text models like PaLM, with improvements to both training stability (using QK normalization) and training efficiency (with a novel approach called asynchronous parallel linear operations). As a result of its modified architecture, efficient sharding recipe, and bespoke implementation, it was able to be trained on Cloud TPUs with a high hardware utilization1. ViT-22B advances the state of the art on many vision tasks using frozen representations, or with full fine-tuning. Further, the model has also been successfully used in PaLM-e, which showed that a large model combining ViT-22B with a language model can significantly advance the state of the art in robotics tasks.

Architecture

Our work builds on many advances from LLMs, such as PaLM and GPT-3. Compared to the standard Vision Transformer architecture, we use parallel layers, an approach in which attention and MLP blocks are executed in parallel, instead of sequentially as in the standard Transformer. This approach was used in PaLM and reduced training time by 15%.

Secondly, ViT-22B omits biases in the QKV projections, part of the self-attention mechanism, and in the LayerNorms, which increases utilization by 3%. The diagram below shows the modified transformer architecture used in ViT-22B:

ViT-22B transformer encoder architecture uses parallel feed-forward layers, omits biases in QKV and LayerNorm layers and normalizes Query and Key projections.

Models at this scale necessitate “sharding” — distributing the model parameters in different compute devices. Alongside this, we also shard the activations (the intermediate representations of an input). Even something as simple as a matrix multiplication necessitates extra care, as both the input and the matrix itself are distributed across devices. We develop an approach called asynchronous parallel linear operations, whereby communications of activations and weights between devices occur at the same time as computations in the matrix multiply unit (the part of the TPU holding the vast majority of the computational capacity). This asynchronous approach minimizes the time waiting on incoming communication, thus increasing device efficiency. The animation below shows an example computation and communication pattern for a matrix multiplication.

Asynchronized parallel linear operation. The goal is to compute the matrix multiplication y = Ax, but both the matrix A and activation x are distributed across different devices. Here we illustrate how it can be done with overlapping communication and computation across devices. The matrix A is column-sharded across the devices, each holding a contiguous slice, each block represented as Aij. More details are in the paper.

At first, the new model scale resulted in severe training instabilities. The normalization approach of Gilmer et al. (2023, upcoming) resolved these issues, enabling smooth and stable model training; this is illustrated below with example training progressions.

The effect of normalizing the queries and keys (QK normalization) in the self-attention layer on the training dynamics. Without QK normalization (red) gradients become unstable and the training loss diverges.

Results

Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

To illustrate the richness of the learned representation, we train a text model to produce representations that align text and image representations (using LiT-tuning). Below we show several results for out-of-distribution images generated by Parti and Imagen:

Examples of image+text understanding for ViT-22B paired with a text model. The graph shows normalized probability distribution for each description of an image.

Human object recognition alignment

To find out how aligned ViT-22B classification decisions are with human classification decisions, we evaluated ViT-22B fine-tuned with different resolutions on out-of-distribution (OOD) datasets for which human comparison data is available via the model-vs-human toolbox. This toolbox measures three key metrics: How well do models cope with distortions (accuracy)? How different are human and model accuracies (accuracy difference)? Finally, how similar are human and model error patterns (error consistency)? While not all fine-tuning resolutions perform equally well, ViT-22B variants are state of the art for all three metrics. Furthermore, the ViT-22B models also have the highest ever recorded shape bias in vision models. This means that they mostly use object shape, rather than object texture, to inform classification decisions — a strategy known from human perception (which has a shape bias of 96%). Standard models (e.g., ResNet-50, which has aa ~20–30% shape bias) often classify images like the cat with elephant texture below according to the texture (elephant); models with a high shape bias tend to focus on the shape instead (cat). While there are still many important differences between human and model perception, ViT-22B shows increased similarities to human visual object recognition.

Cat or elephant? Car or clock? Bird or bicycle? Example images with the shape of one object and the texture of a different object, used to measure shape/texture bias.
Shape bias evaluation (higher = more shape-biased). Many vision models have a low shape / high texture bias, whereas ViT-22B fine-tuned on ImageNet (red, green, blue trained on 4B images as indicated by brackets after model names, unless trained on ImageNet only) have the highest shape bias recorded in a ML model to date, bringing them closer to a human-like shape bias.

Out-of-distribution performance

Measuring performance on OOD datasets helps assess generalization. In this experiment we construct label-maps (mappings of labels between datasets) from JFT to ImageNet and also from ImageNet to different out-of-distribution datasets like ObjectNet (results after pre-training on this data shown in the left curve below). Then the models are fully fine-tuned on ImageNet.

We observe that scaling Vision Transformers increases OOD performance: even though ImageNet accuracy saturates, we see a significant increase on ObjectNet from ViT-e to ViT-22B (shown by the three orange dots in the upper right below).

Even though ImageNet accuracy saturates, we see a significant increase in performance on ObjectNet from ViT-e/14 to ViT-22B.

Linear probe

Linear probe is a technique where a single linear layer is trained on top of a frozen model. Compared to full fine-tuning, this is much cheaper to train and easier to set up. We observed that the linear probe of ViT-22B performance approaches that of state-of-the-art full fine-tuning of smaller models using high-resolution images (training with higher resolution is generally much more expensive, but for many tasks it yields better results). Here are results of a linear probe trained on the ImageNet dataset and evaluated on the ImageNet validation dataset and other OOD ImageNet datasets.

Linear probe results trained on ImageNet, evaluated on Imagenet-ReaL, ImageNet-v2, ObjectNet, ImageNet-R and ImageNet-A datasets. High-resolution fine-tuned ViT-e/14 provided as a reference.

Distillation

The knowledge of the bigger model can be transferred to a smaller model using the distillation method. This is helpful as big models are slower and more expensive to use. We found that ViT-22B knowledge can be transferred to smaller models like ViT-B/16 and ViT-L/16, achieving a new state of the art on ImageNet for those model sizes.

Model Approach (dataset) ImageNet1k Accuracy
ViT-B/16       Transformers for Image Recognition at Scale (JFT)       84.2
Scaling Vision Transformers (JFT) 86.6
DeiT III: Revenge of the ViT (INet21k) 86.7
Distilled from ViT-22B (JFT) 88.6
     
ViT-L/16 Transformers for Image Recognition at Scale (JFT) 87.1
Scaling Vision Transformers (JFT) 88.5
DeiT III: Revenge of the ViT (INet21k) 87.7
Distilled from ViT-22B (JFT) 89.6

Fairness and bias

ML models can be susceptible to unintended unfair biases, such as picking up spurious correlations (measured using demographic parity) or having performance gaps across subgroups. We show that scaling up the size helps in mitigating such issues.

First, scale offers a more favorable tradeoff frontier — performance improves with scale even when the model is post-processed after training to control its level of demographic parity below a prescribed, tolerable level. Importantly, this holds not only when performance is measured in terms of accuracy, but also other metrics, such as calibration, which is a statistical measure of the truthfulness of the model’s estimated probabilities. Second, classification of all subgroups tends to improve with scale as demonstrated below. Third, ViT-22B reduces the performance gap across subgroups.

Top: Accuracy for each subgroup in CelebA before debiasing. Bottom: The y-axis shows the absolute difference in performance across the two specific subgroups highlighted in this example: females and males. ViT-22B has a small gap in performance compared to smaller ViT architectures.

Conclusions

We have presented ViT-22B, currently the largest vision transformer model at 22 billion parameters. With small but critical changes to the original architecture, we achieved excellent hardware utilization and training stability, yielding a model that advances the state of the art on several benchmarks. Great performance can be achieved using the frozen model to produce embeddings and then training thin layers on top. Our evaluations further show that ViT-22B shows increased similarities to human visual perception when it comes to shape and texture bias, and offers benefits in fairness and robustness, when compared to existing models.

Acknowledgements

This is a joint work of Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy, Elsayed Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers Jeremiah Harmsen, and Neil Houlsby

We would like to thank Jasper Uijlings, Jeremy Cohen, Arushi Goel, Radu Soricut, Xingyi Zhou, Lluis Castrejon, Adam Paszke, Joelle Barral, Federico Lebron, Blake Hechtman, and Peter Hawkins. Their expertise and unwavering support played a crucial role in the completion of this paper. We also acknowledge the collaboration and dedication of the talented researchers and engineers at Google Research.


1Note: ViT-22B has 54.9% model FLOPs utilization (MFU) while PaLM reported
46.2% MFU and we measured 44.0% MFU for ViT-e on the same hardware. 

Read More

Reduce call hold time and improve customer experience with self-service virtual agents using Amazon Connect and Amazon Lex

Reduce call hold time and improve customer experience with self-service virtual agents using Amazon Connect and Amazon Lex

This post was co-written with Tony Momenpour and Drew Clark from KYTC.

Government departments and businesses operate contact centers to connect with their communities, enabling citizens and customers to call to make appointments, request services, and sometimes just ask a question. When there are more calls than agents can answer, callers get placed on hold with a message such as the following: “We are experiencing higher than usual call volumes. Your call is very important to us, please stay on the line and your call will be answered in the order it was received.”

Unless the hold music is particularly good, callers don’t typically enjoy having to wait—it wastes time and money. Some contact centers play automated messages to encourage the caller to leave a voicemail, visit the website, or call back later. These options are unsatisfying to callers who just want to ask an agent a question to get an answer quickly.

One solution is to have enough trained agents available to take all the calls right away, even during times of unusually high call volumes. This would eliminate hold times and ensure that callers receive fast responses. The key to making this approach practical is to augment human agents with scalable, AI-powered virtual agents that can address callers’ needs for at least some of the incoming calls. When a virtual agent successfully addresses a caller’s enquiry, the result is a happy caller, lower average hold times for all callers, and lower costs. Gartner’s Customer Service and Support Leader poll estimates that live channels such as phone and live chat cost an average of $8.01 per contact, while self-service channels cost about $0.10 per contact—a virtual agent can potentially save $7.91 (98%) for every call it successfully handles.

A virtual agent doesn’t have to handle every call, and it probably shouldn’t try—some portion of calls are likely served best with a human touch, so a good virtual agent should know its own limitations, and quickly transfer the caller to a human agent when needed.

In this post, we share how the Kentucky Transportation Cabinet’s (KYTC) Department of Vehicle Regulations (DVR) reduced call hold time and improved customer experience with self-service virtual agents using Amazon Connect and Amazon Lex.

KYTC DVR’s challenges

The KYTC DVR supports, assists and provides information related to vehicle registration, driver licenses, and commercial vehicle credentials to nearly 5 million constituents.

“In a recent survey conducted with Kentucky citizens, more than 50% actually wanted help without speaking to someone,” says Drew Clark, Business Analyst and Project Manager at KYTC.

There were several challenges the KYTC team faced that made it necessary for them to replace the existing system with Amazon Connect and Amazon Lex. The lack of flexibility in the existing customer service system prevented them from providing their customers the best user experience and from innovating further by introducing features like the ability to handle redundant queries via chat. Also, the introduction of federal REAL ID requirements in 2019 resulted in increased call volumes from drivers with questions. Call volumes increased further in 2020 when the COVID-19 pandemic struck and driver licensing regional offices closed. Callers experienced an average handle time of 5 minutes or longer—an undesirable situation for both the callers and the DVR contact center professionals. In addition, there was an over-reliance on the callback feature, resulting in a below par customer experience.

Solution overview

To tackle these challenges, the KYTC team reviewed several contact center solutions and collaborated with the AWS ProServe team to implement a cloud-based contact center and a virtual agent named Max. Currently, customers can interact with the contact center via voice and chat channels. The contact center is powered by Amazon Connect, and Max, the virtual agent, is powered by Amazon Lex and the AWS QnABot solution.

Amazon Connect directs some incoming calls to the virtual agent (Max) by identifying the caller number. Max uses natural language processing (NLP) to find the best answer to a caller’s question from the DVR’s knowledge base of questions and answers, and responds to the caller using a natural and human-like synthesized voice (powered by Amazon Polly), supplemented when appropriate with an SMS text message containing links to webpages that provide relevant detailed information. With Amazon Lex, the department was able to automate tasks like providing information on REAL IDs, and renewing driver’s licenses or vehicle registrations. If the caller can’t find the desired answer, the call is transferred to a live agent.

The KYTC DVR reports that with the new system, they can handle the same or greater call volumes at a lower operational cost than the previous system. The call handling time has been reduced by 33%. They consistently see 90% of the QnABot traffic routing through the self-service option on the website. The QnABot is now handling close to 35% of the incoming phone calls without the need for human intervention, during regular business hours and after hours as well! In addition, agent training time was reduced to 2 weeks from 4 weeks due to Amazon Connect’s intuitive design and ease of use. Not only did DVR improve the customer and agent experience, but they also avoided high up-front costs and reduced their overall operational cost.

Amazon Lex and the AWS QnABot

Amazon Lex is an AWS service for creating conversational interfaces. You can use Amazon Lex to build capable self-service virtual agents for your contact center to automate a wide variety of caller experiences, such as claims, quotes, payments, purchases, appointments, and more.

The AWS QnABot is an open-source solution that uses Amazon Lex along with other AWS services to automate question answering use cases.

QnABot allows you to quickly deploy a conversational AI virtual agent into your contact centers, websites, and messaging channels, with no coding experience required. You configure curated answers to frequently asked questions using an integrated content management system that supports rich text and rich voice responses optimized for each channel. You can expand the solution’s knowledge base to include searching existing documents and webpage content using Amazon Kendra. QnABot uses Amazon Translate to support user interaction in many languages.

Integrated user feedback and monitoring provide visibility into customer queries, concerns, and sentiment. This enables you to tune and enrich your content, effectively teaching your virtual agent so it gets smarter all the time.

Conclusion

The KYTC DVR contact center has achieved impressive customer experience and cost-efficiency improvements by deploying an Amazon Connect cloud-based contact center, along with a virtual agent built with Amazon Lex and the open-source AWS QnABot solution.

Curious to see if you can benefit from the same approaches that worked for the KYTC DVR? Check out these short demo videos:

Try Amazon Lex or the QnABot for yourself in your own AWS account. You can follow the steps in the implementation guide for automated deployment, or explore the AWS QnABot workshop.

We’d love to hear from you. Let us know what you think in the comments section.


About the Authors

Tony Momenpour is a systems consultant within the Kentucky Transportation Cabinet. He has worked for the Commonwealth of Kentucky for 19 years in various roles.  His focus is to assist the Commonwealth with being able to provide its citizens a great customer service experience.

Drew Clark is a business analyst/project manager for the Kentucky Transportation Cabinet’s Office of Information Technology. He is focusing on system architecture, application platforms, and modernization for the cabinet. He has been with the Transportation Cabinet since 2016 working in various IT roles.

Rajiv Sharma is a Domain Lead – Contact Center in the AWS Data and Machine Learning team. Rajiv works with our customers to deliver engagements using Amazon Connect and Amazon Lex.

Thomas Rindfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs

Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs

Intelligent document processing (IDP) with AWS helps automate information extraction from documents of different types and formats, quickly and with high accuracy, without the need for machine learning (ML) skills. Faster information extraction with high accuracy can help you make quality business decisions on time, while reducing overall costs. For more information, refer to Intelligent document processing with AWS AI services: Part 1.

However, complexity arises when implementing real-world scenarios. Documents are often sent out of order, or they may be sent as a combined package with multiple form types. Orchestration pipelines need to be created to introduce business logic, and also account for different processing techniques depending on the type of form inputted. These challenges are only magnified as teams deal with large document volumes.

In this post, we demonstrate how to solve these challenges using Amazon Textract IDP CDK Constructs, a set of pre-built IDP constructs, to accelerate the development of real-world document processing pipelines. For our use case, we process an Acord insurance document to enable straight-through processing, but you can extend this solution to any use case, which we discuss later in the post.

Acord document processing at scale

Straight-through processing (STP) is a term used in the financial industry to describe the automation of a transaction from start to finish without the need for manual intervention. The insurance industry uses STP to streamline the underwriting and claims process. This involves the automatic extraction of data from insurance documents such as applications, policy documents, and claims forms. Implementing STP can be challenging due to the large amount of data and the variety of document formats involved. Insurance documents are inherently varied. Traditionally, this process involves manually reviewing each document and entering the data into a system, which is time-consuming and prone to errors. This manual approach is not only inefficient but can also lead to errors that can have a significant impact on the underwriting and claims process. This is where IDP on AWS comes in.

To achieve a more efficient and accurate workflow, insurance companies can integrate IDP on AWS into the underwriting and claims process. With Amazon Textract and Amazon Comprehend, insurers can read handwriting and different form formats, making it easier to extract information from various types of insurance documents. By implementing IDP on AWS into the process, STP becomes easier to achieve, reducing the need for manual intervention and speeding up the overall process.

This pipeline allows insurance carriers to easily and efficiently process their commercial insurance transactions, reducing the need for manual intervention and improving the overall customer experience. We demonstrate how to use Amazon Textract and Amazon Comprehend to automatically extract data from commercial insurance documents, such as Acord 140, Acord 125, Affidavit of Home Ownership, and Acord 126, and analyze the extracted data to facilitate the underwriting process. These services can help insurance carriers improve the accuracy and speed of their STP processes, ultimately providing a better experience for their customers.

Solution overview

The solution is built using the AWS Cloud Development Kit (AWS CDK), and consists of Amazon Comprehend for document classification, Amazon Textract for document extraction, Amazon DynamoDB for storage, AWS Lambda for application logic, and AWS Step Functions for workflow pipeline orchestration.

The pipeline consists of the following phases:

  1. Split the document packages and classification of each form type using Amazon Comprehend.
  2. Run the processing pipelines for each form type or page of form with the appropriate Amazon Textract API (Signature Detection, Table Extraction, Forms Extraction, or Queries).
  3. Perform postprocessing of the Amazon Textract output into machine-readable format.

The following screenshot of the Step Functions workflow illustrates the pipeline.

Prerequisites

To get started with the solution, ensure you have the following:

  • AWS CDK version 2 installed
  • Docker installed and running on your machine
  • Appropriate access to Step Functions, DynamoDB, Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Textract, and Amazon Comprehend

Clone the GitHub repo

Start by cloning the GitHub repository:

git clone https://github.com/aws-samples/aws-textract-e2e-processing.git

Create an Amazon Comprehend classification endpoint

We first need to provide an Amazon Comprehend classification endpoint.

For this post, the endpoint detects the following document classes (ensure naming is consistent):

  • acord125
  • acord126
  • acord140
  • property_affidavit

You can create one by using the comprehend_acord_dataset.csv sample dataset in the GitHub repository. To train and create a custom classification endpoint using the sample dataset provided, follow the instructions in Train custom classifiers. If you would like to use your own PDF files, refer to the first workflow in the post Intelligently split multi-form document packages with Amazon Textract and Amazon Comprehend.

After training your classifier and creating an endpoint, you should have an Amazon Comprehend custom classification endpoint ARN that looks like the following code:

arn:aws:comprehend:<REGION>:<ACCOUNT_ID>:document-classifier-endpoint/<CLASSIFIER_NAME>

Navigate to docsplitter/document_split_workflow.py and modify lines 27–28, which contain comprehend_classifier_endpoint. Enter your endpoint ARN in line 28.

Install dependencies

Now you install the project dependencies:

python -m pip install -r requirements.txt

Initialize the account and Region for the AWS CDK. This will create the Amazon Simple Storage Service (Amazon S3) buckets and roles for the AWS CDK tool to store artifacts and be able to deploy infrastructure. See the following code:

cdk bootstrap

Deploy the AWS CDK stack

When the Amazon Comprehend classifier and document configuration table are ready, deploy the stack using the following code:

cdk deploy DocumentSplitterWorkflow --outputs-file document_splitter_outputs.json --require-approval never

Upload the document

Verify that the stack is fully deployed.

Then in the terminal window, run the aws s3 cp command to upload the document to the DocumentUploadLocation for the DocumentSplitterWorkflow:

aws s3 cp sample-doc.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`DocumentSplitterWorkflow-DocumentUploadLocation`].Value' --output text)

We have created a sample 12-page document package that contains the Acord 125, Acord 126, Acord 140, and Property Affidavit forms. The following images show a 1-page excerpt from each document.

All data in the forms is synthetic, and the Acord standard forms are the property of the Acord Corporation, and are used here for demonstration only.

Run the Step Functions workflow

Now open the Step Function workflow. You can get the Step Function workflow link from the document_splitter_outputs.json file, the Step Functions console, or by using the following command:

aws cloudformation list-exports --query 'Exports[?Name==`DocumentSplitterWorkflow-StepFunctionFlowLink`].Value' --output text

Depending on the size of the document package, the workflow time will vary. The sample document should take 1–2 minutes to process. The following diagram illustrates the Step Functions workflow.

When your job is complete, navigate to the input and output code. From here you will see the machine-readable CSV files for each of the respective forms.

To download these files, open getfiles.py. Set files to be the list outputted by the state machine run. You can run this function by running python3 getfiles.py. This will generate the csvfiles_<TIMESTAMP> folder, as shown in the following screenshot.

Congratulations, you have now implemented an end-to-end processing workflow for a commercial insurance application.

Extend the solution for any type of form

In this post, we demonstrated how we could use the Amazon Textract IDP CDK Constructs for a commercial insurance use case. However, you can extend these constructs for any form type. To do this, we first retrain our Amazon Comprehend classifier to account for the new form type, and adjust the code as we did earlier.

For each of the form types you trained, we must specify its queries and textract_features in the generate_csv.py file. This customizes each form type’s processing pipeline by using the appropriate Amazon Textract API.

Queries is a list of queries. For example, “What is the primary email address?” on page 2 of the sample document. For more information, see Queries.

textract_features is a list of the Amazon Textract features you want to extract from the document. It can be TABLES, FORMS, QUERIES, or SIGNATURES. For more information, see FeatureTypes.

Navigate to generate_csv.py. Each document type needs its classification, queries, and textract_features configured by creating CSVRow instances.

For our example we have four document types: acord125, acord126, acord140, and property_affidavit. In in the following we want to use the FORMS and TABLES features on the acord documents, and the QUERIES and SIGNATURES features for the property affidavit.

def get_csv_rows():
# acord125
acord125_queries: List[List[str]] = list()
acord_125_features: List[str] = ["FORMS", "TABLES"]
acord125_row = CSVRow("acord125",
acord125_queries,
acord_125_features)
# acord126
acord126_queries: List[List[str]] = list()
acord126_features: List[str] = ["FORMS", "TABLES"]
acord126_row = CSVRow("acord126",
acord126_queries,
acord126_features)
# acord140
acord140_queries: List[List[str]] = list()
acord140_features: List[str] = ["FORMS", "TABLES"]
acord140_row = CSVRow("acord140",
acord140_queries,
acord140_features)
# property_affidavit
property_affidavit_queries: List[List[str]] = [
["PROP_AFF_OWNER", "What is your name?"],
["PROP_AFF_ADDR", "What is the property's address?"],
["PROP_AFF_DATE_EXEC_ON", "When was this executed on?"],
["PROP_AFF_DATE_SWORN", "When was this subscribed and sworn to?"],
["PROP_AFF_NOTARY", "Who is the notary public?"],
]
property_affidavit_features: List[str] = ["SIGNATURES", "QUERIES"]
property_affidavit_row = CSVRow("property_affidavit",
property_affidavit_queries,
property_affidavit_features)

Refer to the GitHub repository for how this was done for the sample commercial insurance documents.

Clean up

To remove the solution, run the cdk destroy command. You will then be prompted to confirm the deletion of the workflow. Deleting the workflow will delete all the generated resources.

Conclusion

In this post, we demonstrated how you can get started with Amazon Textract IDP CDK Constructs by implementing a straight-through processing scenario for a set of commercial Acord forms. We also demonstrated how you can extend the solution to any form type with simple configuration changes. We encourage you to try the solution with your respective documents. Please raise a pull request to the GitHub repo for any feature requests you may have. To learn more about IDP on AWS, refer to our documentation.


About the Authors

Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).

Aditi Rajnish is a Second-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.

Enzo Staton is a Solutions Architect with a passion for working with companies to increase their cloud knowledge. He works closely as a trusted advisor and industry specialist with customers around the country.

Read More

Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.

Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.

Figure 1. This blog post discusses the effectiveness of black-box model explanations in aiding end users to make decisions. We observe that explanations do not in fact help with concrete applications such as fraud detection and paper matching for peer review. Our work further motivates novel directions for developing and evaluating tools to support human-ML interactions.

Model explanations have been touted as crucial information to facilitate human-ML interactions in many real-world applications where end users make decisions informed by ML predictions. For example, explanations are thought to assist model developers in identifying when models rely on spurious artifacts and to aid domain experts in determining whether to follow a model’s prediction. However, while numerous explainable AI (XAI) methods have been developed, XAI has yet to deliver on this promise. XAI methods are typically optimized for diverse but narrow technical objectives disconnected from their claimed use cases. To connect methods to concrete use cases, we argued in our Communications of ACM paper [1] for researchers to rigorously evaluate how well proposed methods can help real users in their real-world applications. 

Towards bridging this gap, our group has since completed two collaborative projects where we worked with domain experts in e-commerce fraud detection and paper matching for peer review. Through these efforts, we’ve gleaned the following two insights:

  1. Existing XAI methods are not useful for decision-making. Presenting humans with popular, general-purpose XAI methods does not improve their performance on real-world use cases that motivated the development of these methods. Our negative findings align with those of contemporaneous works.
  2. Rigorous, real-world evaluation is important but hard. These findings were obtained through user studies that were time-consuming to conduct. 

We believe that each of these insights motivates a corresponding research direction to support human-ML interactions better moving forward. First, beyond methods that attempt to explain the ML model itself, we should consider a wider range of approaches that present relevant task-specific information to human decision-makers; we refer to these approaches as human-centered ML (HCML) methods [10]. Second, we need to create new workflows to evaluate proposed HCML methods that are both low-cost and informative of real-world performance.

In this post, we first outline our workflow for evaluating XAI methods.  We then describe how we instantiated this workflow in two domains: fraud detection and peer review paper matching. Finally, we describe the two aforementioned insights from these efforts; we hope these takeaways will motivate the community to rethink how HCML methods are developed and evaluated.

How do you rigorously evaluate explanation methods?

In our CACM paper [1], we introduced a use-case-grounded workflow to evaluate explanation methods in practice—this means showing that they are ‘useful,’ i.e., that they can actually improve human-ML interactions in the real-world applications that they are motivated by. This workflow contrasts with evaluation workflows of XAI methods in prior work, which relied on researcher-defined proxy metrics that may or may not be relevant to any downstream task. Our proposed three-step workflow is based on the general scientific method:

Step 1: Define a concrete use case. To do this, researchers may need to work closely with domain experts to define a task that reflects the practical use case of interest.

Step 2: Select explanation methods for evaluation. While selected methods might be comprised of popular XAI methods, the appropriate set of methods is to a large extent application-specific and should also include relevant non-explanation baselines.

Step 3: Evaluate explanation methods against baselines. While researchers should ultimately evaluate selected methods through a user study with real-world users, researchers may want to first conduct cheaper, noisier forms of evaluation to narrow down the set of methods in consideration (Figure 2). 

Figure 2. Evaluation is a key component of our proposed use-case-grounded workflow and consists of four stages ranging from cheaper, lower-signal evaluations to more expensive, task-specific user studies. The stages of evaluation are adapted from Doshi-Velez and Kim (2017); we introduce an additional stage, use-case-grounded algorithmic evaluations, in a recent Neurips 2022 paper [2].

Instantiating the workflow in practice

We collaborated with experts from two domains (fraud detection and peer review paper matching) to instantiate this use-case-grounded workflow and evaluate existing XAI methods:

Figure 3. Example of the user interface used by fraud analysts in our experiment (populated with sample data for illustrative purposes). (a) Basic interface components, including the model score (shown in the top left), buttons to approve or decline the transactions, and transaction details. (b) A component of the interface that presents the explanations of the model score.

Domain 1: Fraud detection [3]. We partnered with researchers at Feedzai, a financial start-up, to assess whether providing model explanations improved the ability of fraud analysts to detect fraudulent e-commerce transactions. Given that we had access to real-world data (i.e., historical e-commerce transactions for which we had ground truth answers of whether the transaction was fraudulent) and real users (i.e., fraud analysts), we directly conducted a user study in this context. An example of the interface shown to analysts is in Figure 3. We compared analysts’ average performance when shown different explanations to a baseline setting where they were only provided the model prediction. We ultimately found that none of the popular XAI methods we evaluated (LIME, SHAP, and Tree Interpreter) resulted in any improvement in the analysts’ decisions compared to the baseline setting (Figure 5, left). Evaluating these methods with real users additionally posed many logistical challenges because fraud analysts took time from their regular day-to-day work to periodically participate in our study. 

Figure 4. Peer review paper matching is an example of a document matching application. For each submitted paper, the matching model pre-screens a list of candidate reviewers via affinity scores (solid arrows). Meta-reviewers, typically under a time constraint, then select the best match to the submitted paper among the pre-screened reviewer (box with a solid line). We study whether providing additional assistive information, namely highlighting potentially relevant information in the candidate documents, can help the meta-reviewers make better decisions (dotted arrows and boxes). 

Domain 2: Peer review paper matching [4]. We collaborated with Professor Nihar Shah (CMU), an expert in peer review, to investigate what information could help meta-reviewers of a conference better match submitted papers to suitable reviewers. Learning from our prior experience, we first conducted a user study using proxy tasks and users, which we worked with Professor Shah to design as shown in Figure 4. In this proxy setting, we found that providing explanations from popular XAI methods in fact led users to be more confident—-the majority of participants shown highlights from XAI methods believed the highlighted information was helpful—yet, they made statistically worse decisions (Figure 5 right)!

Figure 5. We evaluated popular XAI methods in two domains: e-commerce fraud (left), where we conducted a user study with a real use case and users, and peer review paper matching (right), where we conducted a user study with a proxy task and users that we designed with a domain expert. Although we find that explanations from popular XAI methods do not outperform baselines of only providing the model prediction (and often result in statistically worse performance), we are optimistic about the potential of task-specific methods. In particular, our proposed method in the peer review paper matching task outperformed both the model-score-only baseline and existing general-purpose methods.

How can we better support human-ML interactions?

Through these collaborations, we identified two important directions for future work, which we describe in more detail along with our initial efforts in each direction.

We need to develop methods for specific use cases. Our results suggest that explanations from popular, general-purpose XAI methods can both hurt decision-making while making users overconfident. These findings have also been observed in multiple contemporaneous works (e.g., [7,8,9]). Researchers, instead, need to consider developing human-centered ML (HCML) methods [10] tailored for each downstream use case. HCML methods are any approach that provides information about the particular use case and context that can inform human decisions.

Figure 6. Examples of highlighted information from different methods in our peer review matching proxy task. Highlights for “Key Parts” (second row) provide the “ground truth”, ie., it indicates the information relevant to the query summary (first row), all of which ideally should be visibly highlighted by the methods that follow. Existing methods like SHAP (third row) and BERTSum (fourth row) fail to fully highlight all key parts. Critically, they fail to visibly highlight the key part about “river levels rising” (yellow highlights in Key Parts), the unique information that distinguishes the ground truth from other candidate articles, which can directly impact the participant’s performance. On the other hand, our task-specific method (bottom row) visibly highlights all key parts.

Our contributions: In the peer review matching setting, we proposed an HCML method designed in tandem with a domain expert [4]. Notably, our method is not a model explanation approach, as it highlights information in the input data, specifically sentences and phrases that are similar in the submitted paper and the reviewer profile. Figure 6 compares the text highlighted using our method to the text highlighted using existing methods. Our method outperformed both a baseline where there was no explanation and the model explanation condition (Figure 5, right). Based on these positive results, we plan to move evaluations of our proposed method to more realistic peer review settings. Further, we performed an exploratory study to better understand how people interact with information provided by HCML methods as a first step towards coming up with a more systematic approach to devise task-specific HCML methods [5].

We need more efficient evaluation pipelines. While user studies conducted in a real-world use case and with real users are the ideal way to evaluate HCML methods, it is a time- and resource-consuming process. We highlight the need for more cost-effective evaluations that can be utilized to narrow down candidate HCML methods and still implicate the downstream use case. One option is to work with domain experts to design a proxy task as we did in the peer review setting, but even these studies require careful consideration of the generalizability to the real-world use case. 

Our contributions. We introduced an algorithmic-based evaluation called simulated user evaluation (SimEvals) [2]. Instead of conducting studies on proxy tasks, researchers can train SimEvals, which are ML models that serve as human proxies. SimEvals more faithfully reflects aspects of real-world evaluation because their training and evaluation data are instantiated on the same data and task considered in real-world studies. To train SimEvals, the researcher first needs to generate a dataset of observation-label pairs. The observation corresponds to the information that would be presented in a user study (and critically includes the HCML method), while the output is the ground truth label for the use case of interest. For example, in the fraud detection setting, the observation would consist of both the e-commerce transaction and ML model score shown in Figure 3(a) along with the explanation shown in Figure 3(b). The ground truth label is whether or not the transaction was fraudulent. SimEvals are trained to predict a label given an observation and their test set accuracies can be interpreted as a measure of whether the information contained in the observation is predictive for the use case. 

We not only evaluated SimEvals on a variety of proxy tasks but also tested SimEvals in practice by working with Feedzai, where we found results that corroborate the negative findings from the user study [6]. Although SimEvals should not replace user studies because SimEvals are not designed to mimic human decision-making, these results suggest that SimEvals could be initially used to identify more promising explanations (Figure 6). 

Figure 6. An overview of how simulated user studies (SimEvals) can help a researcher select which explanation methods to evaluate given their specific use case. (Left) When conducting user studies, researchers often only evaluate a small number of explanation methods due to resource constraints and select popular methods as candidate explanations to evaluate, with little justification about why each choice may be helpful for the downstream use case. (Right) We propose using SimEvals, which are use-case-grounded, algorithmic evaluations, to efficiently screen explanations before running a user study. In this example, the researcher runs a SimEval on each of the four candidate explanation methods and then uses the results of the SimEvals to select two promising explanation methods where the algorithmic agent has high accuracy for their human subject study.

Conclusion

In summary, our recent efforts motivate two ways the community should rethink how to support human-ML interactions: (1) we need to replace general-purpose XAI techniques with HCML methods tailored to specific use cases, and (2) creating intermediate evaluation procedures that can help narrow down the HCML methods to evaluate in more costly settings. 

For more information about the various papers mentioned in this blog post, see the links below:

[1] Chen, V., Li, J., Kim, J. S., Plumb, G., & Talwalkar, A. Interpretable Machine Learning. Communications of the ACM, 2022. (link)

[2] Chen, V., Johnson, N., Topin, N., Plumb, G., & Talwalkar, A. Use-case-grounded simulations for explanation evaluation. NeurIPS, 2022. (link)

[3] Amarasinghe, K., Rodolfa, K. T., Jesus, S., Chen, V., Balayan, V., Saleiro, P., Bizzaro, P., Talwalkar, A. & Ghani, R. (2022). On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. arXiv. (link)

[4] Kim, J. S., Chen, V., Pruthi, D., Shah, N., Talwalkar, A. Assisting Human Decisions in Document Matching. arXiv. (link)

[5] Chen, V., Liao, Q. V., Vaughan, J. W., & Bansal, G. (2023). Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations. arXiv. (link)

[6] Martin, A., Chen, V., Jesus, S., Saleiro, P. A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies. arXiv. (link)

[7] Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T. & Weld, D. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. CHI, 2021. (link)

[8] Adebayo, J., Muelly, M., Abelson, H., & Kim, B. Post hoc explanations may be ineffective for detecting unknown spurious correlation. ICLR, 2022. (link)

[9] Zhang, Y., Liao, Q. V., & Bellamy, R. K. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. FAccT, 2020. (link)

[10] Chancellor, S. (2023). Toward Practices for Human-Centered Machine Learning. Communications of the ACM, 66(3), 78-85. (link)

Acknowledgments

We would like to thank Kasun Amarasinghe, Jeremy Cohen, Nari Johnson, Joon Sik Kim, Q. Vera Liao, and Junhong Shen for helpful feedback and suggestions on earlier versions of the blog post. Thank you also to Emma Kallina for her help with designing the main figure!

Read More

FaceLit: Neural 3D Relightable Faces

We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of…Apple Machine Learning Research

Naturalistic Head Motion Generation From Speech

Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the variation in the perceptual quality of head motions sampled from a generative model. We show that, despite providing more diverse head motions, the generative model produces motions with varying degrees of…Apple Machine Learning Research

On the Role of Lip Articulation in Visual Speech Perception

*= Equal Contribution
Generating realistic lip motion from audio to simulate speech production is critical for driving natural character animation. Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality. Devising metrics that align with subjective opinion first requires understanding what impacts human perception of quality. In this work, we focus on the degree of articulation and run a series of experiments to study how articulation strength impacts human…Apple Machine Learning Research

Feedback Effect in User Interaction with Intelligent Assistants: Delayed Engagement, Adaption and Drop-out

With the growing popularity of intelligent assistants (IAs), evaluating IA quality becomes an increasingly active field of research. This paper identifies and quantifies the feedback effect, a novel component in IA-user interactions: how the capabilities and limitations of the IA influence user behavior over time. First, we demonstrate that unhelpful responses from the IA cause users to delay or reduce subsequent interactions in the short term via an observational study. Next, we expand the time horizon to examine behavior changes and show that as users discover the limitations of the IA’s…Apple Machine Learning Research

Snapper provides machine learning-assisted labeling for pixel-perfect image object detection

Snapper provides machine learning-assisted labeling for pixel-perfect image object detection

Bounding box annotation is a time-consuming and tedious task that requires annotators to create annotations that tightly fit an object’s boundaries. Bounding box annotation tasks, for example, require annotators to ensure that all edges of an annotated object are enclosed in the annotation. In practice, creating annotations that are precise and well-aligned to object edges is a laborious process.

In this post, we introduce a new interactive tool called Snapper, powered by a machine learning (ML) model that reduces the effort required of annotators. The Snapper tool automatically adjusts noisy annotations, reducing the time required to annotate data at a high-quality level.

Overview of Snapper

Snapper is an interactive and intelligent system that automatically “snaps” object annotations to image-based objects in real time. With Snapper, annotators place bounding box annotations by drawing boxes, and then see immediate and automatic adjustments to their bounding box to better fit the bounded object.

The Snapper system is composed of two subsystems. The first subsystem is a front-end ReactJS component that intercepts annotation-related mouse events and handles the rendering of the model’s predictions. We integrate this front end with our Amazon SageMaker Ground Truth annotation UI. The second subsystem consists of the model backend, which receives requests from the front-end client, routes the requests to an ML model to generate adjusted bounding box coordinates, and sends the data back to the client.

ML model optimized for annotators

A tremendous number of high-performing object detection models have been proposed by the computer vision community in recent years. However, these state-of-the-art models are typically optimized for unguided object detection. To facilitate Snapper’s “snapping” functionality for adjusting users’ annotations, the input to our model is an initial bounding box, provided by the annotator, which can serve as a marker for the presence of an object. Furthermore, because the system has no intended object class it aims to support, Snapper’s adjustment model should be object-agnostic such that the system performs well on a range of object classes.

In general, these requirements diverge substantially from the use cases of typical ML object detection models. We note that the traditional object detection problem is formulated as “detect the object center, then regress the dimensions.” This is counterintuitive, because accurate predictions of bounding box edges rely crucially on first finding an accurate box center, and then trying to establish scalar distances to edges. Moreover, it doesn’t provide good confidence estimates that focus on the uncertainties of the edge locations, because only the classifier score is available for use.

To give our Snapper model the ability to adjust users’ annotations, we design and implement an ML model custom designed for bounding box adjustment. As input, the model takes an image and a corresponding bounding box annotation. The model extracts features from the image using a convolutional neural network. Following feature extraction, directional spatial pooling is applied to each dimension to aggregate the information needed to identify an appropriate edge location.

We formulate location prediction for bounding boxes as a classification problem over different locations. While seeing the whole object, we ask the machine to reason about the presence or absence of an edge directly at each pixel’s location as a classification task. This improves accuracy, as the reasoning for each edge uses image features from the immediate local neighborhood. Moreover, the scheme decouples the reasoning between different edges, which prevents unambiguous edge locations from being affected by the uncertain ones. Additionally, it provides us with edge-wise intuitive confidence estimates, as our model considers each edge of the object independently (like human annotators would) and provides an interpretable distribution (or uncertainty estimate) for each edge’s location. This allows us to highlight less confident edges for more efficient and precise human review.

Benchmarking and evaluating the Snapper tool

In practice, we find that the Snapper tool streamlines the bounding box annotation task and is very intuitive for users to pick up. We also conducted a quantitative analysis of Snapper to characterize the tool objectively. We evaluated Snapper’s adjustment model using a type of evaluation standard to object detection models that employs two measures to examine validity: Intersection over Union (IoU), and edge and corner deviance. IoU calculates the alignment between two annotations by dividing the annotations’ area of overlap by the annotations’ area of union, yielding a metric that ranges from 0–1. Edge deviance and corner deviance are calculated by taking the fraction of edges and corners that deviate from the ground truth by a pixel value.

To evaluate Snapper, we dynamically generated noisy annotation data by randomly adjusting the COCO ground truth bounding box coordinates with jitter. Our procedure for adding jitter first shifts the center of the bounding box by up to 10% of the corresponding bounding box dimension on each axis and then rescales the dimensions of the bounding box by a randomly sampled ratio between 0.9–1.1. Here, we apply these metrics to the validation set from the official MS-COCO dataset used for training. We specifically calculate the fraction of bounding boxes with IoU exceeding 90% alongside the fraction of edge deviations and corner deviations that deviate less than one or three pixels from the corresponding ground truth. The following table summarizes our findings.

As shown in the preceding table, Snapper’s adjustment model significantly improved the two sources of noisy data across each of the three metrics. With an emphasis on high precision annotations, we observe that applying Snapper to the jittered MS COCO dataset increases the fraction of bounding boxes with IoU exceeding 90% by upwards of 40%.

Conclusion

In this post, we introduced a new ML-powered annotation tool called Snapper. Snapper consists of a SageMaker model backend as well as a front-end component that we integrate into the Ground Truth labeling UI. We evaluated Snapper on simulated noisy bounding box annotations and found that it can successfully refine imperfect bounding boxes. The use of Snapper in labeling tasks can significantly reduce cost and increase accuracy.

To learn more, visit Amazon SageMaker Data Labeling and schedule a consultation today.


About the authors

Jonathan Buck is a Software Engineer at Amazon Web Services working at the intersection of machine learning and distributed systems. His work involves productionizing machine learning models and developing novel software applications powered by machine learning to put the latest capabilities in the hands of customers.

Alex Williams is an applied scientist in the human-in-the-loop science team at AWS AI where he conducts interactive systems research at the intersection of human-computer interaction (HCI) and machine learning. Before joining Amazon, he was a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee where he co-directed the People, Agents, Interactions, and Systems (PAIRS) research laboratory. He has also held research positions at Microsoft Research, Mozilla Research, and the University of Oxford. He regularly publishes his work at prem

Min Bai is an applied scientist at AWS, with a current specialization in 2D / 3D computer vision, with a focus on the fields of autonomous driving and user-friendly AI tools. When not at work, he enjoys exploring nature, especially off the beaten track.

Kumar Chellapilla is a General Manager and Director at Amazon Web Services and leads the development of ML/AI Services such as human-in-loop systems, AI DevOps, Geospatial ML, and ADAS/Autonomous Vehicle development. Prior to AWS, Kumar was a Director of Engineering at Uber ATG and Lyft Level 5 and led teams using machine learning to develop self-driving capabilities such as perception and mapping. He also worked on applying machine learning techniques to improve search, recommendations, and advertising products at LinkedIn, Twitter, Bing, and Microsoft Research.

Patrick Haffner is a Principal Applied Scientist with the AWS Sagemaker Ground Truth team. He has been working on human-in-the-loop optimization since 1995, when he applied the LeNet Convolutional Neural Network to check recognition. He is interested in holistic approaches where ML algorithms and labeling UIs are optimized together to minimize the labeling cost.

Erran Li is the applied science manager at humain-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Read More