Does Your Medical Image Classifier Know What It Doesn’t Know?

Deep machine learning (ML) systems have achieved considerable success in medical image analysis in recent years. One major contributing factor is access to abundant labeled datasets, which are used to train highly effective supervised deep learning models. However, in the real-world, these models may encounter samples exhibiting rare conditions that are individually too infrequent for per-condition classification. Nevertheless, such conditions can be collectively common because they follow a long-tail distribution and when taken together can represent a significant portion of cases — e.g., in a recent deep learning dermatological study, hundreds of rare conditions composed around 20% of cases encountered by the model at test time.

To prevent models from generating erroneous outputs on rare samples at test time, there remains a considerable need for deep learning systems with the ability to recognize when a sample is not a condition it can identify. Detecting previously unseen conditions can be thought of as an out-of-distribution (OOD) detection task. By successfully identifying OOD samples, preventive measures can be taken, like abstaining from prediction or deferring to a human expert.

Traditional computer vision OOD detection benchmarks work to detect dataset distribution shifts. For example, a model may be trained on CIFAR images but be presented with street view house numbers (SVHN) as OOD samples, two datasets with very different semantic meanings. Other benchmarks seek to detect slight differences in semantic information, e.g., between images of a truck and a pickup truck, or two different skin conditions. The semantic distribution shifts in such near-OOD detection problems are more subtle in comparison to dataset distribution shifts, and thus, are harder to detect.

In “Does Your Dermatology Classifier Know What it Doesn’t Know? Detecting the Long-Tail of Unseen Conditions”, published in Medical Image Analysis, we tackle this near-OOD detection task in the application of dermatology image classification. We propose a novel hierarchical outlier detection (HOD) loss, which leverages existing fine-grained labels of rare conditions from the long tail and modifies the loss function to group unseen conditions and improve identification of these near OOD categories. Coupled with various representation learning methods and the diverse ensemble strategy, this approach enables us to achieve better performance for detecting OOD inputs.

The Near-OOD Dermatology Dataset
We curated a near-OOD dermatology dataset that includes 26 inlier conditions, each of which are represented by at least 100 samples, and 199 rare conditions considered to be outliers. Outlier conditions can have as low as one sample per condition. The separation criteria between inlier and outlier conditions can be specified by the user. Here the cutoff sample size between inlier and outlier was 100, consistent with our previous study. The outliers are further split into training, validation, and test sets that are intentionally mutually exclusive to mimic real-world scenarios, where rare conditions shown during test time may have not been seen in training.

Long tail distribution of different dermatological conditions in our dataset. The 26 inlier conditions, with at least 100 samples, (blue) and the remaining 199 rare outlier conditions (orange). Outlier conditions can have as low as one sample per condition.
    Train set  Validation set      Test set
Inlier Outlier Inlier Outlier Inlier Outlier
Number of classes 26 68 26 66 26 65
Number of samples 8854 1111 1251 1082 1192 937
Inlier and outlier conditions in our benchmark dataset and detailed dataset split statistics. The outliers are further split into mutually exclusive train, validation, and test sets.

Hierarchical Outlier Detection Loss
We propose to use “known outlier” samples during training that are leveraged to aid detection of “unknown outlier” samples during test time. Our novel hierarchical outlier detection (HOD) loss performs a fine-grained classification of individual classes for all inlier or outlier classes and, in parallel, a coarse-grained binary classification of inliers vs. outliers in a hierarchical setup (see the figure below). Our experiments confirmed that HOD is more effective than performing a coarse-grained classification followed by a fine-grained classification, as this could result in a bottleneck that impacted the performance of the fine-grained classifier.

We use the sum of the predictive probabilities of the outlier classes as the OOD score. As a primary OOD detection metric we use the area under receiver operating characteristics (AUROC) curve, which ranges between 0 and 1 and gives us a measure of separability between inliers and outliers. A perfect OOD detector, which separates all inliers from outliers, is assigned an AUROC score of 1. A popular baseline method, called reject bucket, separates each inlier individually from the outliers, which are grouped into a dedicated single abstention class. In addition to a fine-grained classification for each individual inlier and outlier classes, the HOD loss–based approach separates the inliers collectively from the outliers with a coarse-grained prediction loss, resulting in better generalization. While similar, we demonstrate that our HOD loss–based approach outperforms other baseline methods that leverage outlier data during training, achieving an AUROC score of 79.4% on the benchmark, a significant improvement over that of reject bucket, which achieves 75.6%.

Our model architecture and the HOD loss. The encoder (green) represents the wide ResNet 101×3 model pre-trained with different representation learning models (ImageNet, BiT, SimCLR, and MICLe; see below). The output of the encoder is sent to the HOD loss where fine-grained and coarse-grained predictions for inliers (blue) and outliers (orange) are obtained. The coarse predictions are obtained by summing over the fine-grained probabilities as indicated in the figure. The OOD score is defined as the sum of the probabilities of outlier classes.

Representation Learning and the Diverse Ensemble Strategy
We also investigate how different types of representation learning help in OOD detection in conjunction with HOD by pretraining on ImageNet, BiT-L, SimCLR and MICLe models. We observe that including HOD loss improves OOD performance compared to the reject bucket baseline method for all four representation learning methods.

Representation Learning
Methods
OOD detection metric (AUROC %)
With reject bucket With HOD loss
ImageNet 74.7% 77%
BiT-L 75.6% 79.4%
SimCLR 75.2% 77.2%
MICLe 76.7% 78.8%
OOD detection performance for different representation learning models with reject bucket and with HOD loss.

Another orthogonal approach for improving OOD detection performance and accuracy is deep ensemble, which aggregates outputs from multiple independently trained models to provide a final prediction. We build upon deep ensemble, but instead of using a fixed architecture with a fixed pre-training, we combine different representation learning architectures (ImageNet, BiT-L, SimCLR and MICLe) and introduce objective loss functions (HOD and reject bucket). We call this a diverse ensemble strategy, which we demonstrate outperforms the deep ensemble for OOD performance and inlier accuracy.

Downstream Clinical Trust Analysis
While we mainly focus on improving the performance for OOD detection, the ultimate goal for our dermatology model is to have high accuracy in predicting inlier and outlier conditions. We go beyond traditional performance metrics and introduce a “penalty” matrix that jointly evaluates inlier and outlier predictions for model trust analysis to approximate downstream impact. For a fixed confidence threshold, we count the following types of mistakes: (i) incorrect inlier predictions (i.e., mistaking inlier condition A as inlier condition B); (ii) incorrect abstention of inliers (i.e., abstaining from making a prediction for an inlier); and (iii) incorrect prediction for outliers as one of the inlier classes.

To account for the asymmetrical consequences of the different types of mistakes, penalties can be 0, 0.5, or 1. Both incorrect inlier and outlier-as-inlier predictions can potentially erode user trust in the model and were penalized with a score of 1. Incorrect abstention of an inlier as an outlier was penalized with a score of 0.5, indicating that potential model users should seek additional guidance given the model-expressed uncertainty or abstention. For correct decisions no cost is incurred, indicated by a score of 0.

                  Action of the Model
Prediction as Inlier Abstain
Inlier 0 (Correct)

1 (Incorrect, mistakes
that may erode trust)

0.5 (Incorrect,
abstains inliers)
Outlier     1 (Incorrect, mistakes
that may erode trust)
0 (Correct)
The penalty matrix is designed to capture the potential impact of different types of model errors.

Because real-world scenarios are more complex and contain a variety of unknown variables, the numbers used here represent simplifications to enable qualitative approximations for the downstream impact on user trust of outlier detection models, which we refer to as “cost”. We use the penalty matrix to estimate a downstream cost on the test set and compare our method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios. As shown in the plot below, our proposed solution incurs a much lower estimated cost in comparison to baseline over all possible operating points.

Trust analysis comparing our proposed method to the baseline (reject bucket) for a range of outlier recall rates, indicated by 𝛕. We show that our method reduces downstream estimated cost, potentially reflecting improved downstream impact.

Conclusion
In real-world deployment, medical ML models may encounter conditions that were not seen in training, and it’s important that they accurately identify when they do not know a specific condition. Detecting those OOD inputs is an important step to improving safety. We develop an HOD loss that leverages outlier data during training, and combine it with pre-trained representation learning models and a diverse ensemble to further boost performance, significantly outperforming the baseline approach on our new dermatology benchmark dataset. We believe that our approach, aligned with our AI Principles, can aid successful translation of ML algorithms into real-world scenarios. Although we have primarily focused on OOD detection for dermatology, most of our contributions are fairly generic and can be easily incorporated into OOD detection for other applications.

Acknowledgements
We would like to thank Shekoofeh Azizi, Aaron Loh, Vivek Natarajan, Basil Mustafa, Nick Pawlowski, Jan Freyberg, Yuan Liu, Zach Beaver, Nam Vo, Peggy Bui, Samantha Winter, Patricia MacWilliams, Greg S. Corrado, Umesh Telang, Yun Liu, Taylan Cemgil, Alan Karthikesalingam, Balaji Lakshminarayanan, and Jim Winkens for their contributions. We would also like to thank Tom Small for creating the post animation.

Read More

How Clearly accurately predicts fraudulent orders using Amazon Fraud Detector

This post was cowritten by Ziv Pollak, Machine Learning Team Lead, and Sarvi Loloei, Machine Learning Engineer at Clearly. The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

A pioneer in online shopping, Clearly launched their first site in 2000. Since then, we’ve grown to become one of the biggest online eyewear retailers in the world, providing customers across Canada, the US, Australia, and New Zealand with glasses, sunglasses, contact lenses, and other eye health products. Through its mission to eliminate poor vision, Clearly strives to make eyewear affordable and accessible for everyone. Creating an optimized fraud detection platform is a key part of this wider vision.

Identifying online fraud is one of the biggest challenges every online retail organization has—hundreds of thousands of dollars are lost due to fraud every year. Product costs, shipping costs, and labor costs for handling fraudulent orders further increase the impact of fraud. Easy and fast fraud evaluation is also critical for maintaining high customer satisfaction rates. Transactions shouldn’t be delayed due to lengthy fraud investigation cycles.

In this post, we share how Clearly built an automated and orchestrated forecasting pipeline using AWS Step Functions, and used Amazon Fraud Detector to train a machine learning (ML) model that can identify online fraudulent transactions and bring them to the attention of the billing operations team. This solution also collects metrics and logs, provides auditing, and is invoked automatically.

With AWS services, Clearly deployed a serverless, well-architected solution in just a few weeks.

The challenge: Predicting fraud quickly and accurately

Clearly’s existing solution was based on flagging transactions using hard-coded rules that weren’t updated frequently enough to capture new fraud patterns. Once flagged, the transaction was manually reviewed by a member of the billing operations team.

This existing process had major drawbacks:

  • Inflexible and inaccurate – The hard-coded rules to identify fraud transactions were difficult to update, meaning the team couldn’t respond quickly to emerging fraud trends. The rules were unable to accurately identify many suspicious transactions.
  • Operationally intensive – The process couldn’t scale to high sales volume events (like Black Friday), requiring the team to implement workarounds or accept higher fraud rates. Moreover, the high level of human involvement added significant cost to the product delivery process.
  • Delayed orders – The order fulfillment timeline was delayed by manual fraud reviews, leading to unhappy customers.

Although our existing fraud identification process was a good starting point, it was neither accurate enough nor fast enough to meet the order fulfillment efficiencies that Clearly desired.

Another major challenge we faced was the lack of a tenured ML team—all members had been with the company less than a year when the project kicked off.

Overview of solution: Amazon Fraud Detector

Amazon Fraud Detector is a fully managed service that uses ML to deliver highly accurate fraud detection and requires no ML expertise. All we had to do was upload our data and follow a few straightforward steps. Amazon Fraud Detector automatically examined the data, identified meaningful patterns, and produced a fraud identification model capable of making predictions on new transactions.

The following diagram illustrates our pipeline:

To operationalize the flow, we applied the following workflow:

  1. Amazon EventBridge calls the orchestration pipeline hourly to review all pending transactions.
  2. Step Functions helps manage the orchestration pipeline.
  3. An AWS Lambda function calls Amazon Athena APIs to retrieve and prepare the training data, stored on Amazon Simple Storage Service (Amazon S3).
  4. An orchestrated pipeline of Lambda functions trains an Amazon Fraud Detector model and saves the model performance metrics to an S3 bucket.
  5. Amazon Simple Notification Service (Amazon SNS) notifies users when a problem occurs during the fraud detection process or when the process completes successfully.
  6. Business analysts build dashboards on Amazon QuickSight, which queries the fraud data from Amazon S3 using Athena, as we describe later in this post.

We chose to use Amazon Fraud Detector for a few reasons:

  • The service taps into years of expertise that Amazon has fighting fraud. This gave us a lot of confidence in the service’s capabilities.
  • The ease of use and implementation allowed us to quickly confirm we have the dataset we need to produce accurate results.
  • Because the Clearly ML team was less than 1 year old, a fully managed service allowed us to deliver this project without needing deep technical ML skills and knowledge.

Results

Writing the prediction results into our existing data lake allows us to use QuickSight to build metrics and dashboards for senior leadership. This enables them to understand and use these results when making decisions on the next steps to meet our monthly marketing targets.

We were able to present the forecast results on two levels, starting with overall business performance and then going deeper into needed performance per each line of business (contacts and glasses).

Our dashboard includes the following information:

  • Fraud per day per different lines of business
  • Revenue loss due to fraud transactions
  • Location of fraud transactions (identifying fraud hot spots)
  • Fraud transactions impact by different coupon codes, which allows us to monitor for problematic coupon codes and take further actions to reduce the risk
  • Fraud per hour, which allows us to plan and manage the billing operation team and make sure we have resources available to handle transaction volume when needed

Conclusions

Effective and accurate prediction of customer fraud is one of the biggest challenges in ML for retail today, and having a good understanding of our customers and their behavior is vital to Clearly’s success. Amazon Fraud Detector provided a fully managed ML solution to easily create an accurate and reliable fraud prediction system with minimal overhead. Amazon Fraud Detector predictions have a high degree of accuracy and are simple to generate.

With leading ecommerce tools like Virtual Try On, combined with our unparalleled customer service, we strive to help everyone see clearly in an affordable and effortless manner—which means constantly looking for ways to innovate, improve, and streamline processes,” said Dr. Ziv Pollak, Machine Learning Team Leader. “Online fraud detection is one of the biggest challenges in machine learning in retail today. In just a few weeks, Amazon Fraud Detector helped us accurately and reliably identify fraud with a very high level of accuracy, and save thousands of dollars.


About the Author

Dr Ziv Pollak Dr. Ziv Pollak is an experienced technical leader who transforms the way organizations use machine learning to increase revenue, reduce costs, improve customer service, and ensure business success. He is currently leading the Machine Learning team at Clearly.

Sarvi Loloei is an Associate Machine Learning Engineer at Clearly. Using AWS tools, she evaluates model effectiveness to drive business growth, increase revenue, and optimize productivity.

Read More

Where did that sound come from?

The human brain is finely tuned not only to recognize particular sounds, but also to determine which direction they came from. By comparing differences in sounds that reach the right and left ear, the brain can estimate the location of a barking dog, wailing fire engine, or approaching car.

MIT neuroscientists have now developed a computer model that can also perform that complex task. The model, which consists of several convolutional neural networks, not only performs the task as well as humans do, it also struggles in the same ways that humans do.

“We now have a model that can actually localize sounds in the real world,” says Josh McDermott, an associate professor of brain and cognitive sciences and a member of MIT’s McGovern Institute for Brain Research. “And when we treated the model like a human experimental participant and simulated this large set of experiments that people had tested humans on in the past, what we found over and over again is it the model recapitulates the results that you see in humans.”

Findings from the new study also suggest that humans’ ability to perceive location is adapted to the specific challenges of our environment, says McDermott, who is also a member of MIT’s Center for Brains, Minds, and Machines.

McDermott is the senior author of the paper, which appears today in Nature Human Behavior. The paper’s lead author is MIT graduate student Andrew Francl.

Modeling localization

When we hear a sound such as a train whistle, the sound waves reach our right and left ears at slightly different times and intensities, depending on what direction the sound is coming from. Parts of the midbrain are specialized to compare these slight differences to help estimate what direction the sound came from, a task also known as localization.

This task becomes markedly more difficult under real-world conditions — where the environment produces echoes and many sounds are heard at once.

Scientists have long sought to build computer models that can perform the same kind of calculations that the brain uses to localize sounds. These models sometimes work well in idealized settings with no background noise, but never in real-world environments, with their noises and echoes.

To develop a more sophisticated model of localization, the MIT team turned to convolutional neural networks. This kind of computer modeling has been used extensively to model the human visual system, and more recently, McDermott and other scientists have begun applying it to audition as well.

Convolutional neural networks can be designed with many different architectures, so to help them find the ones that would work best for localization, the MIT team used a supercomputer that allowed them to train and test about 1,500 different models. That search identified 10 that seemed the best-suited for localization, which the researchers further trained and used for all of their subsequent studies.

To train the models, the researchers created a virtual world in which they can control the size of the room and the reflection properties of the walls of the room. All of the sounds fed to the models originated from somewhere in one of these virtual rooms. The set of more than 400 training sounds included human voices, animal sounds, machine sounds such as car engines, and natural sounds such as thunder.

The researchers also ensured the model started with the same information provided by human ears. The outer ear, or pinna, has many folds that reflect sound, altering the frequencies that enter the ear, and these reflections vary depending on where the sound comes from. The researchers simulated this effect by running each sound through a specialized mathematical function before it went into the computer model.

“This allows us to give the model the same kind of information that a person would have,” Francl says.

After training the models, the researchers tested them in a real-world environment. They placed a mannequin with microphones in its ears in an actual room and played sounds from different directions, then fed those recordings into the models. The models performed very similarly to humans when asked to localize these sounds.

“Although the model was trained in a virtual world, when we evaluated it, it could localize sounds in the real world,” Francl says.

Similar patterns

The researchers then subjected the models to a series of tests that scientists have used in the past to study humans’ localization abilities.

In addition to analyzing the difference in arrival time at the right and left ears, the human brain also bases its location judgments on differences in the intensity of sound that reaches each ear. Previous studies have shown that the success of both of these strategies varies depending on the frequency of the incoming sound. In the new study, the MIT team found that the models showed this same pattern of sensitivity to frequency.

“The model seems to use timing and level differences between the two ears in the same way that people do, in a way that’s frequency-dependent,” McDermott says.

The researchers also showed that when they made localization tasks more difficult, by adding multiple sound sources played at the same time, the computer models’ performance declined in a way that closely mimicked human failure patterns under the same circumstances.

“As you add more and more sources, you get a specific pattern of decline in humans’ ability to accurately judge the number of sources present, and their ability to localize those sources,” Francl says. “Humans seem to be limited to localizing about three sources at once, and when we ran the same test on the model, we saw a really similar pattern of behavior.”

Because the researchers used a virtual world to train their models, they were also able to explore what happens when their model learned to localize in different types of unnatural conditions. The researchers trained one set of models in a virtual world with no echoes, and another in a world where there was never more than one sound heard at a time. In a third, the models were only exposed to sounds with narrow frequency ranges, instead of naturally occurring sounds.

When the models trained in these unnatural worlds were evaluated on the same battery of behavioral tests, the models deviated from human behavior, and the ways in which they failed varied depending on the type of environment they had been trained in. These results support the idea that the localization abilities of the human brain are adapted to the environments in which humans evolved, the researchers say.

The researchers are now applying this type of modeling to other aspects of audition, such as pitch perception and speech recognition, and believe it could also be used to understand other cognitive phenomena, such as the limits on what a person can pay attention to or remember, McDermott says.

The research was funded by the National Science Foundation and the National Institute on Deafness and Other Communication Disorders.

Read More

Aligning Language Models to Follow Instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.

Read PaperView Model Card
InstructGPT is better than GPT-3 at following English instructions.
InstructGPT is better than GPT-3 at following English instructions.
Like GPT-3, InstructGPT can respond to tasks defined implicitly via a prompt, without an explicit instruction.
InstructGPT can give wrong or misleading outputs when the instruction assumes a premise that is not true.
When given a sensitive prompt or instruction, InstructGPT is less likely to produce biased or toxic outputs than GPT-3.
Since InstructGPT is trained to follow instructions, it can be susceptible to misuse.

GPT-3 models aren’t trained to follow user instructions. Our InstructGPT models (highlighted) generate much more helpful outputs in response to user instructions.

The OpenAI API is powered by GPT-3 language models which can be coaxed to perform natural language tasks using carefully engineered text prompts. But these models can also generate outputs that are untruthful, toxic, or reflect harmful sentiments. This is in part because GPT-3 is trained to predict the next word on a large dataset of Internet text, rather than to safely perform the language task that the user wants. In other words, these models aren’t aligned with their users.

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API,[1] our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

The resulting InstructGPT models are much better at following instructions than GPT-3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters. At the same time, we show that we don’t have to compromise on GPT-3’s capabilities, as measured by our model’s performance on academic NLP evaluations.

These InstructGPT models, which have been in beta on the API for more than a year, are now the default language models accessible on our API.[2] We believe that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability, and we will continue to push in this direction.

This is the first time our alignment research, which we’ve been pursuing for several years, has been applied to our product. Our work is also related to recent research that fine-tunes language models to follow instructions using academic NLP datasets, notably FLAN and T0. A key motivation for our work is to increase helpfulness and truthfulness while mitigating the harms and biases of language models. Some of our previous research in this direction found that we can reduce harmful outputs by fine-tuning on a small curated dataset of human demonstrations. Other research has focused on filtering the pre-training dataset, safety-specific control tokens, or steering model generations. We are exploring these ideas and others in our ongoing alignment research.

Results

We first evaluate how well outputs from InstructGPT follow user instructions, by having labelers compare its outputs to those from GPT-3. We find that InstructGPT models are significantly preferred on prompts submitted to both the InstructGPT and GPT-3 models on the API. This holds true when we add a prefix to the GPT-3 prompt so that it enters an “instruction-following mode.”

Quality ratings of model outputs on a 1–7 scale (y-axis), for various model sizes (x-axis), on prompts submitted to InstructGPT models on our API. InstructGPT outputs are given much higher scores by our labelers than outputs from GPT-3 with a few-shot prompt and without, as well as models fine-tuned with supervised learning. We find similar results for prompts submitted to GPT-3 models on the API.

To measure the safety of our models, we primarily use a suite of existing metrics on publicly available datasets. Compared to GPT-3, InstructGPT produces fewer imitative falsehoods (according to TruthfulQA) and are less toxic (according to Realtoxicityprompts. We also conduct human evaluations on our API prompt distribution, and find that InstructGPT makes up facts (“hallucinates”) less often, and generates more appropriate outputs.[3]

Dataset
RealToxicity
GPT
0.233
Supervised Fine-Tuning
0.199
InstructGPT
0.196
Dataset
TruthfulQA
GPT
0.224
Supervised Fine-Tuning
0.206
InstructGPT
0.413
API Dataset
Hallucinations
GPT
0.414
Supervised Fine-Tuning
0.078
InstructGPT
0.172
API Dataset
Customer Assistant Appropriate
GPT
0.811
Supervised Fine-Tuning
0.880
InstructGPT
0.902

Evaluating InstructGPT for toxicity, truthfulness, and appropriateness. Lower scores are better for toxicity and hallucinations, and higher scores are better for TruthfulQA and appropriateness. Hallucinations and appropriateness are measured on our API prompt distribution. Results are combined across model sizes.

Finally, we find that InstructGPT outputs are preferred to those from FLAN and T0 on our customer distribution. This indicates that the data used to train FLAN and T0, mostly academic NLP tasks, is not fully representative of how deployed language models are used in practice.

Methods

To train InstructGPT models, our core technique is reinforcement learning from human feedback (RLHF), a method we helped pioneer in our earlier alignment research. This technique uses human preferences as a reward signal to fine-tune our models, which is important as the safety and alignment problems we are aiming to solve are complex and subjective, and aren’t fully captured by simple automatic metrics.

We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT-3 policy to maximize this reward using the PPO algorithm.

One way of thinking about this process is that it “unlocks” capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone: this is because our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.

A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[4] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.

Generalizing to broader preferences

Our procedure aligns our models’ behavior with the preferences of our labelers, who directly produce the data used to train our models, and us researchers, who provide guidance to labelers through written instructions, direct feedback on specific examples, and informal conversations. It is also influenced by our customers and the preferences implicit in our API policies. We selected labelers who performed well on a screening test for aptitude in identifying and responding to sensitive prompts. However, these different sources of influence on the data do not guarantee our models are aligned to the preferences of any broader group.

We conducted two experiments to investigate this. First, we evaluate GPT-3 and InstructGPT using held-out labelers[5] who did not produce any of the training data, and found that these labelers prefer outputs from the InstructGPT models at about the same rate as our training labelers. Second, we train reward models on data from a subset of our labelers, and find that they generalize well to predicting the preferences of a different subset of labelers. This suggests that our models haven’t solely overfit to the preferences of our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

Limitations

Despite making significant progress, our InstructGPT models are far from fully aligned or fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. But the safety of a machine learning system depends not only on the behavior of the underlying models, but also on how these models are deployed. To support the safety of our API, we will continue to review potential applications before they go live, provide content filters for detecting unsafe completions, and monitor for misuse.

A byproduct of training our models to follow user instructions is that they may become more susceptible to misuse if instructed to produce unsafe outputs. Solving this requires our models to refuse certain instructions; doing this reliably is an important open research problem that we are excited to tackle.

Further, in many cases aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, the preferences of that group should be weighted more heavily. Right now, InstructGPT is trained to follow instructions in English; thus, it is biased towards the cultural values of English-speaking people. We are conducting research into understanding the differences and disagreements between labelers’ preferences so we can condition our models on the values of more specific populations. More generally, aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately we must establish responsible, inclusive processes for making these decisions.

Next steps

This is the first application of our alignment research to our product. Our results show that these techniques are effective at significantly improving the alignment of general-purpose AI systems with human intentions. However, this is just the beginning: we will keep pushing these techniques to improve the alignment of our current and future models towards language tools that are safe and helpful to humans.

If you’re interested in these research directions, we’re hiring!


References
  1. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S. and Amodei, D., 2017. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741.
  2. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D.M., Lowe, R., Voss, C., Radford, A., Amodei, D. and Christiano, P., 2020.
  3. Wu, J., Ouyang, L., Ziegler, D.M., Stiennon, N., Lowe, R., Leike, J. and Christiano, P., 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  4. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M. and Le, Q.V., 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  5. Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M., 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  6. Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
  7. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E. and Brynjolfsson, E., 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V. and Irving, G., 2021. Alignment of Language Agents. arXiv preprint arXiv:2103.14659.
  9. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A. and Kenton, Z., 2021. Ethical and social risks of harm from Language Models. arXiv preprint arXiv:2112.04359.
  10. Tamkin, A., Brundage, M., Clark, J. and Ganguli, D., 2021. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv preprint arXiv:2102.02503.
  11. Solaiman, I. and Dennison, C., 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. arXiv preprint arXiv:2106.10328.
  12. Ngo, H., Raterink, C., Araújo, J.G., Zhang, I., Chen, C., Morisot, A. and Frosst, N., 2021. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790.
  13. Xu, J., Ju, D., Li, M., Boureau, Y.L., Weston, J. and Dinan, E., 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  14. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R., 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  15. Krause, B., Gotmare, A.D., McCann, B., Keskar, N.S., Joty, S., Socher, R. and Rajani, N.F., 2020. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.
  16. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J. and Liu, R., 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  17. Lin, S., Hilton, J. and Evans, O., 2021. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  18. Gehman, S., Gururangan, S., Sap, M., Choi, Y. and Smith, N.A., 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  19. Rudinger, R., Naradowsky, J., Leonard, B. and Van Durme, B., 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
  20. Nangia, N., Vania, C., Bhalerao, R. and Bowman, S.R., 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.


Acknowledgments

We’d like to thank our paper co-authors: Long Ouyang, Jeff Wu, Roger Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, and Paul Christiano, along with everyone who provided feedback on the paper and blog post. We’d also like to thank the Comms team for their guidance and assistance, including Steve Dowling, Hannah Wong, Elie Georges, Alper Ercetin, Jared Salzano, Allan Diego, and Justin Jay Wang. Finally, we’d like to thank our labelers, without whom this project would not have been possible.


Footnotes

  1. We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set. ↩︎

  2. The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication. ↩︎

  3. We also measure several other dimensions of potentially harmful outputs on our API distribution: whether the outputs contain sexual or violent content, denigrate a protected class, or encourage abuse. We find that InstructGPT doesn’t improve significantly over GPT-3 on these metrics; the incidence rate is equally low for both models. ↩︎

  4. We found this approach more effective than simply increasing the KL coefficient. ↩︎

  5. These labelers are sourced from Scale AI and Upwork, similarly to our training labelers, but do not undergo a screening test. ↩︎

OpenAI

Nearly 80 Percent of Financial Firms Use AI to Improve Services, Reduce Fraud

From the largest firms trading on Wall Street to banks providing customers with fraud protection to fintechs recommending best-fit products to consumers, AI is driving innovation across the financial services industry.

New research from NVIDIA found that 78 percent of financial services professionals state that their company uses accelerated computing to deliver AI-enabled applications through machine learning, deep learning or high performance computing.

The survey results, detailed in NVIDIA’s “State of AI in Financial Services” report, are based on responses from over 500 C-suite executives, developers, data scientists, engineers and IT teams working in financial services.

AI Prevents Fraud, Boosts Investments

With more than 70 billion real-time payment transactions processed globally in 2020, financial institutions need robust systems to prevent fraud and reduce costs. Accordingly, fraud detection involving payments and transactions was the top AI use case across all respondents at 31 percent, followed by conversational AI at 28 percent and algorithmic trading at 27 percent.

There was a dramatic increase in the percentage of financial institutions investing in AI use cases year-over-year. AI for underwriting increased fourfold, from 3 percent penetration in 2021 to 12 percent this year. Conversational AI jumped from 8 to 28 percent year-over-year, a 3.5x rise.

Meanwhile, AI-enabled applications for fraud detection, know your customer (KYC) and anti-money laundering (AML) all experienced growth of at least 300 percent in the latest survey. Nine of 13 use cases are now utilized by over 15 percent of financial services firms, whereas none of the use cases exceeded that penetration mark in last year’s report.

Future investment plans remain steady for top AI cases, with enterprise investment priorities for the next six to 12 months marked in green.

Top Current AI Use Cases in Financial Services (Ranked by Industry Sector)

Top Current AI Use Cases in Financial Services (Ranked by Industry Sector)
Green highlighted text signifies top AI use cases for investment in next six to 12 months.

Overcoming AI Challenges

Financial services professionals highlighted the main benefits of AI in yielding more accurate models, creating a competitive advantage and improving customer experience. Overall, 47 percent said that AI enables more accurate models for applications such as fraud detection, risk calculation and product recommendations.

However, there are challenges in achieving a company’s AI goals. Only 16 percent of survey respondents agreed that their company is spending the right amount of money on AI, and 37 percent believed “lack of budget” is the primary challenge in achieving their AI goals. Additional obstacles included too few data scientists, lack of data, and explainability, with a third of respondents listing each option.

Financial institutions such as Munich Re, Scotiabank and Wells Fargo have developed explainable AI models to explain lending decisions and construct diversified portfolios.

Biggest Challenges in Achieving Your Company’s AI Goals (by Role)Biggest Challenges in Achieving Your Company’s AI Goals (by Role)

Cybersecurity, data sovereignty, data gravity and the option to deploy on-prem, in the cloud or using hybrid cloud are areas of focus for financial services companies as they consider where to host their AI infrastructure. These preferences are extrapolated from responses to where companies are running most of their AI projects, with over three-quarters of the market operating on either on-prem or hybrid instances.

Where Financial Services Companies Run Their AI WorkloadsWhere Financial Services Companies Run Their AI Workloads

Executives Believe AI Is Key to Business Success

Over half of C-suite respondents agreed that AI is important to their company’s future success. The top total responses to the question “How does your company plan to invest in AI technologies in the future?” were:

  1. Hiring more AI experts (43 percent)
  2. Identifying additional AI use cases (36 percent)
  3. Engaging third-party partners to accelerate AI adoption (36 percent)
  4. Spending more on infrastructure (36 percent)
  5. Providing AI training to staff (32 percent)

However, only 23 percent overall of those surveyed believed their company has the capability and knowledge to move an AI project from research to production. This indicates the need for an end-to-end platform to develop, deploy and manage AI in enterprise applications.

Read the full “State of AI in Financial Services 2022” report to learn more.

Explore NVIDIA’s AI solutions and enterprise-level AI platforms driving the future of financial services.

The post Nearly 80 Percent of Financial Firms Use AI to Improve Services, Reduce Fraud appeared first on The Official NVIDIA Blog.

Read More

Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday

GeForce NOW is taking cloud gaming to new heights.

This GFN Thursday delivers an upgraded streaming experience as part of an update that is now available to all members. It includes new resolution upscaling options to make members’ gaming experiences sharper, plus the ability to customize streaming settings in session.

The GeForce NOW app is fully releasing on select LG TVs, following a successful beta. To celebrate the launch, for a limited time, those who purchase a qualifying LG TV will also receive a six-month Priority membership to kickstart their cloud gaming experience.

Additionally, this week brings five games to the GeForce NOW library.

Upscale Your Gaming Experience

The newest GeForce NOW update delivers new resolution upscaling options — including an AI-powered option for members with select NVIDIA GPUs.

Resolution upscaling update on GeForce NOW
New year, new options for January’s GeForce NOW update.

This feature, now available to all members with the 2.0.37 update, gives gamers with network bandwidth limitation or higher resolution displays sharper graphics that match the native resolution of their monitor or laptop.

Resolution upscaling works by applying sharpening effects that reduce visible blurriness while streaming. It can be applied to any game and enabled via the GeForce NOW settings in native PC and Mac apps.

Three upscaling modes are now available. Standard is enabled by default and has minimal impact on system performance. Enhanced provides a higher quality upscale, but may cause some latency depending on your system specifications. AI Enhanced, available to members playing on PC with select NVIDIA GPUs and SHIELD TVs, leverages a trained neural network model along with image sharpening for a more natural look. These new options can be adjusted mid-session.

Learn more about the new resolution upscaling options.

Stream Your Way With Custom Settings

The upgrade brings some additional benefits to members.

Custom streaming quality settings on the PC and Mac apps have been a popular way for members to take control of their streams — including bit rate, VSync and now the new upscaling modes. The update now enables members to adjust some of the streaming quality settings in session using the GeForce NOW in-game overlay. Bring up the overlay by pressing Ctrl+G > Settings > Gameplay to access these settings while streaming.

The update also comes with an improved web-streaming experience on play.geforcenow.com by automatically assigning the ideal streaming resolution for devices that are unable to decode at high streaming bitrates. Finally, there’s also a fix for launching directly into games from desktop shortcuts.

LG Has Got Game With the GeForce NOW App

LG Electronics, the first TV manufacturer to release the GeForce NOW app in beta, is now bringing cloud gaming to several LG TVs at full force.

Owners of LG 2021 4K TV models including OLED, QNED, NanoCell and UHD TVs can now download the fully launched GeForce NOW app in the LG Content Store. The experience requires a gamepad and gives gamers instant access to nearly 35 free-to-play games, like Apex Legends and Destiny 2, as well as more than 800 PC titles from popular digital stores like Steam, Epic Games Store, Ubisoft Connect and Origin.

The GeForce NOW app on LG OLED TVs delivers responsive gameplay and gorgeous, high-quality graphics at 1080p and 60 frames per second. On these select LG TVs, with nothing more than a gamepad, you can enjoy stunning ray-traced graphics and AI technologies with NVIDIA RTX ON. Learn more about support for the app for LG TVs on the system requirements page under LG TV.

GeForce NOW app for LG
Get your game on directly through an LG TV with a six-month GeForce NOW Priority membership.

In celebration of the app’s full launch and the expansion of devices supported by GeForce NOW, qualifying LG purchases from Feb. 1 to March 27 in the United States come bundled with a sweet six-month Priority membership to the service.

Priority members experience legendary GeForce PC gaming across all of their devices, as well as benefits including priority access to gaming servers, extended session lengths and RTX ON for cinematic-quality in-game graphics.

To collect a free six-month Priority membership, purchase a qualifying LG TV and submit a claim. ​Upon claim approval, you’ll receive a GeForce NOW promo code via email. Create an NVIDIA account for free or sign in to your existing GeForce NOW account to redeem the gifted membership.

This offer is available to those who purchase applicable 2021 model LG 4K TVs in select markets during the promotional period. Current GeForce NOW promotional members are not eligible for this offer. Availability and deadline to claim free membership varies by market. Consult LG’s official country website, starting Feb. 1, for full details. Terms and conditions apply.

It’s Playtime

Mortal Online 2 on GeForce NOW
Explore a massive open world and choose your own path in Mortal Online 2.

Start your weekend with the following five titles coming to the cloud this week:

While you kick off your weekend with gaming fun, we’ve got a question for you this week:

The post Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday appeared first on The Official NVIDIA Blog.

Read More

Aligning language models to follow instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.OpenAI Blog

Demystifying machine-learning systems

Neural networks are sometimes called black boxes because, despite the fact that they can outperform humans on certain tasks, even the researchers who design them often don’t understand how or why they work so well. But if a neural network is used outside the lab, perhaps to classify medical images that could help diagnose heart conditions, knowing how the model works helps researchers predict how it will behave in practice.

MIT researchers have now developed a method that sheds some light on the inner workings of black box neural networks. Modeled off the human brain, neural networks are arranged into layers of interconnected nodes, or “neurons,” that process data. The new system can automatically produce descriptions of those individual neurons, generated in English or another natural language.

For instance, in a neural network trained to recognize animals in images, their method might describe a certain neuron as detecting ears of foxes. Their scalable technique is able to generate more accurate and specific descriptions for individual neurons than other methods.

In a new paper, the team shows that this method can be used to audit a neural network to determine what it has learned, or even edit a network by identifying and then switching off unhelpful or incorrect neurons.

“We wanted to create a method where a machine-learning practitioner can give this system their model and it will tell them everything it knows about that model, from the perspective of the model’s neurons, in language. This helps you answer the basic question, ‘Is there something my model knows about that I would not have expected it to know?’” says Evan Hernandez, a graduate student in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

Co-authors include Sarah Schwettmann, a postdoc in CSAIL; David Bau, a recent CSAIL graduate who is an incoming assistant professor of computer science at Northeastern University; Teona Bagashvili, a former visiting student in CSAIL; Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL; and senior author Jacob Andreas, the X Consortium Assistant Professor in CSAIL. The research will be presented at the International Conference on Learning Representations.

Automatically generated descriptions

Most existing techniques that help machine-learning practitioners understand how a model works either describe the entire neural network or require researchers to identify concepts they think individual neurons could be focusing on.

The system Hernandez and his collaborators developed, dubbed MILAN (mutual-information guided linguistic annotation of neurons), improves upon these methods because it does not require a list of concepts in advance and can automatically generate natural language descriptions of all the neurons in a network. This is especially important because one neural network can contain hundreds of thousands of individual neurons.

MILAN produces descriptions of neurons in neural networks trained for computer vision tasks like object recognition and image synthesis. To describe a given neuron, the system first inspects that neuron’s behavior on thousands of images to find the set of image regions in which the neuron is most active. Next, it selects a natural language description for each neuron to maximize a quantity called pointwise mutual information between the image regions and descriptions. This encourages descriptions that capture each neuron’s distinctive role within the larger network.

“In a neural network that is trained to classify images, there are going to be tons of different neurons that detect dogs. But there are lots of different types of dogs and lots of different parts of dogs. So even though ‘dog’ might be an accurate description of a lot of these neurons, it is not very informative. We want descriptions that are very specific to what that neuron is doing. This isn’t just dogs; this is the left side of ears on German shepherds,” says Hernandez.

The team compared MILAN to other models and found that it generated richer and more accurate descriptions, but the researchers were more interested in seeing how it could assist in answering specific questions about computer vision models.      

Analyzing, auditing, and editing neural networks

First, they used MILAN to analyze which neurons are most important in a neural network. They generated descriptions for every neuron and sorted them based on the words in the descriptions. They slowly removed neurons from the network to see how its accuracy changed, and found that neurons that had two very different words in their descriptions (vases and fossils, for instance) were less important to the network.

They also used MILAN to audit models to see if they learned something unexpected. The researchers took image classification models that were trained on datasets in which human faces were blurred out, ran MILAN, and counted how many neurons were nonetheless sensitive to human faces.

“Blurring the faces in this way does reduce the number of neurons that are sensitive to faces, but far from eliminates them. As a matter of fact, we hypothesize that some of these face neurons are very sensitive to specific demographic groups, which is quite surprising. These models have never seen a human face before, and yet all kinds of facial processing happens inside them,” Hernandez says.

In a third experiment, the team used MILAN to edit a neural network by finding and removing neurons that were detecting bad correlations in the data, which led to a 5 percent increase in the network’s accuracy on inputs exhibiting the problematic correlation.

While the researchers were impressed by how well MILAN performed in these three applications, the model sometimes gives descriptions that are still too vague, or it will make an incorrect guess when it doesn’t know the concept it is supposed to identify.

They are planning to address these limitations in future work. They also want to continue enhancing the richness of the descriptions MILAN is able to generate. They hope to apply MILAN to other types of neural networks and use it to describe what groups of neurons do, since neurons work together to produce an output.

“This is an approach to interpretability that starts from the bottom up. The goal is to generate open-ended, compositional descriptions of function with natural language. We want to tap into the expressive power of human language to generate descriptions that are a lot more natural and rich for what neurons do. Being able to generalize this approach to different types of models is what I am most excited about,” says Schwettmann.

“The ultimate test of any technique for explainable AI is whether it can help researchers and users make better decisions about when and how to deploy AI systems,” says Andreas. “We’re still a long way off from being able to do that in a general way. But I’m optimistic that MILAN — and the use of language as an explanatory tool more broadly — will be a useful part of the toolbox.”

This work was funded, in part, by the MIT-IBM Watson AI Lab and the SystemsThatLearn@CSAIL initiative.

Read More