March 2022 – Page 16

Lessons Learned on Language Model Safety and Misuse

The deployment of powerful AI systems has enriched our understanding of safety and misuse far more than would have been possible through research alone. Notably:

API-based language model misuse often comes in different forms than we feared most.
We have identified limitations in existing language model evaluations that we are addressing with novel benchmarks and classifiers.
Basic safety research offers significant benefits for the commercial utility of AI systems.

Here, we describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.

Over the past two years, we’ve learned a lot about how language models can be used and abused—insights we couldn’t have gained without the experience of real-world deployment. In June 2020, we began giving access to developers and researchers to the OpenAI API, an interface for accessing and building applications on top of new AI models developed by OpenAI. Deploying GPT-3, Codex, and other models in a way that reduces risks of harm has posed various technical and policy challenges.

Overview of Our Model Deployment Approach

Large language models are now capable of performing a very wide range of tasks, often out of the box. Their risk profiles, potential applications, and wider effects on society remain poorly understood. As a result, our deployment approach emphasizes continuous iteration, and makes use of the following strategies aimed at maximizing the benefits of deployment while reducing associated risks:

Pre-deployment risk analysis, leveraging a growing set of safety evaluations and red teaming tools (e.g., we checked our InstructGPT for any safety degradations using the evaluations discussed below)
Starting with a small user base (e.g., both GPT-3 and our InstructGPT series began as private betas)
Studying the results of pilots of novel use cases (e.g., exploring the conditions under which we could safely enable longform content generation, working with a small number of customers)
Implementing processes that help keep a pulse on usage (e.g., review of use cases, token quotas, and rate limits)
Conducting detailed retrospective reviews (e.g., of safety incidents and major deployments)

Note that this diagram is intended to visually convey the need for feedback loops in the continuous process of model development and deployment and the fact that safety must be integrated at each stage. It is not intended to convey a complete or ideal picture of our or any other organization’s process.

There is no silver bullet for responsible deployment, so we try to learn about and address our models’ limitations, and potential avenues for misuse, at every stage of development and deployment. This approach allows us to learn as much as we can about safety and policy issues at small scale and incorporate those insights prior to launching larger-scale deployments.

There is no silver bullet for responsible deployment.

While not exhaustive, some areas where we’ve invested so far include^[1]:

Pre-training data curation and filtering
Fine-tuning models to better follow instructions
Risk analysis of potential deployments
Providing detailed user documentation
Building tools to screen harmful model outputs
Reviewing use cases against our policies
Monitoring for signs of misuse
Studying the impacts of our models

Since each stage of intervention has limitations, a holistic approach is necessary.

There are areas where we could have done more and where we still have room for improvement. For example, when we first worked on GPT-3, we viewed it as an internal research artifact rather than a production system and were not as aggressive in filtering out toxic training data as we might have otherwise been. We have invested more in researching and removing such material for subsequent models. We have taken longer to address some instances of misuse in cases where we did not have clear policies on the subject, and have gotten better at iterating on those policies. And we continue to iterate towards a package of safety requirements that is maximally effective in addressing risks, while also being clearly communicated to developers and minimizing excessive friction.

Still, we believe that our approach has enabled us to measure and reduce various types of harms from language model use compared to a more hands-off approach, while at the same time enabling a wide range of scholarly, artistic, and commercial applications of our models.^[2]

The Many Shapes and Sizes of Language Model Misuse

OpenAI has been active in researching the risks of AI misuse since our early work on the malicious use of AI in 2018 and on GPT-2 in 2019, and we have paid particular attention to AI systems empowering influence operations. We have worked with external experts to develop proofs of concept and promoted careful analysis of such risks by third parties. We remain committed to addressing risks associated with language model-enabled influence operations and recently co-organized a workshop on the subject.^[3]

Yet we have detected and stopped hundreds of actors attempting to misuse GPT-3 for a much wider range of purposes than producing disinformation for influence operations, including in ways that we either didn’t anticipate or which we anticipated but didn’t expect to be so prevalent.^[4] Our use case guidelines, content guidelines, and internal detection and response infrastructure were initially oriented towards risks that we anticipated based on internal and external research, such as generation of misleading political content with GPT-3 or generation of malware with Codex. Our detection and response efforts have evolved over time in response to real cases of misuse encountered “in the wild” that didn’t feature as prominently as influence operations in our initial risk assessments. Examples include spam promotions for dubious medical products and roleplaying of racist fantasies.

To support the study of language model misuse and mitigation thereof, we are actively exploring opportunities to share statistics on safety incidents this year, in order to concretize discussions about language model misuse.

The Difficulty of Risk and Impact Measurement

Many aspects of language models’ risks and impacts remain hard to measure and therefore hard to monitor, minimize, and disclose in an accountable way. We have made active use of existing academic benchmarks for language model evaluation and are eager to continue building on external work, but we have also have found that existing benchmark datasets are often not reflective of the safety and misuse risks we see in practice.^[5]

Such limitations reflect the fact that academic datasets are seldom created for the explicit purpose of informing production use of language models, and do not benefit from the experience gained from deploying such models at scale. As a result, we’ve been developing new evaluation datasets and frameworks for measuring the safety of our models, which we plan to release soon. Specifically, we have developed new evaluation metrics for measuring toxicity in model outputs and have also developed in-house classifiers for detecting content that violates our content policy, such as erotic content, hate speech, violence, harassment, and self-harm. Both of these in turn have also been leveraged for improving our pre-training data^[6]—specifically, by using the classifiers to filter out content and the evaluation metrics to measure the effects of dataset interventions.

Reliably classifying individual model outputs along various dimensions is difficult, and measuring their social impact at the scale of the OpenAI API is even harder. We have conducted several internal studies in order to build an institutional muscle for such measurement, but these have often raised more questions than answers.

We are particularly interested in better understanding the economic impact of our models and the distribution of those impacts. We have good reason to believe that the labor market impacts from the deployment of current models may be significant in absolute terms already, and that they will grow as the capabilities and reach of our models grow. We have learned of a variety of local effects to date, including massive productivity improvements on existing tasks performed by individuals like copywriting and summarization (sometimes contributing to job displacement and creation), as well as cases where the API unlocked new applications that were previously infeasible, such as synthesis of large-scale qualitative feedback. But we lack a good understanding of the net effects.

We believe that it is important for those developing and deploying powerful AI technologies to address both the positive and negative effects of their work head-on. We discuss some steps in that direction in the concluding section of this post.

The Relationship Between the Safety and Utility of AI Systems

In our Charter, published in 2018, we say that we “are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions.” We then published a detailed analysis of competitive AI development, and we have closely followed subsequent research. At the same time, deploying AI systems via the OpenAI API has also deepened our understanding of the synergies between safety and utility.

For example, developers overwhelmingly prefer our InstructGPT models—which are fine-tuned to follow user intentions^[7]—over the base GPT-3 models. Notably, however, the InstructGPT models were not originally motivated by commercial considerations, but rather were aimed at making progress on long-term alignment problems. In practical terms, this means that customers, perhaps not surprisingly, much prefer models that stay on task and understand the user’s intent, and models that are less likely to produce outputs that are harmful or incorrect.^[8] Other fundamental research, such as our work on leveraging information retrieved from the Internet in order to answer questions more truthfully, also has potential to improve the commercial utility of AI systems.^[9]

These synergies will not always occur. For example, more powerful systems will often take more time to evaluate and align effectively, foreclosing immediate opportunities for profit. And a user’s utility and that of society may not be aligned due to negative externalities—consider fully automated copywriting, which can be beneficial for content creators but bad for the information ecosystem as a whole.

It is encouraging to see cases of strong synergy between safety and utility, but we are committed to investing in safety and policy research even when they trade off with commercial utility.

We are committed to investing in safety and policy research even when they trade off against commercial utility.

Ways to Get Involved

Each of the lessons above raises new questions of its own. What kinds of safety incidents might we still be failing to detect and anticipate? How can we better measure risks and impacts? How can we continue to improve both the safety and utility of our models, and navigate tradeoffs between these two when they do arise?

We are actively discussing many of these issues with other companies deploying language models. But we also know that no organization or set of organizations has all the answers, and we would like to highlight several ways that readers can get more involved in understanding and shaping our deployment of state of the art AI systems.

First, gaining first-hand experience interacting with state of the art AI systems is invaluable for understanding their capabilities and implications. We recently ended the API waitlist after building more confidence in our ability to effectively detect and respond to misuse. Individuals in supported countries and territories can quickly get access to the OpenAI API by signing up here.

Second, researchers working on topics of particular interest to us such as bias and misuse, and who would benefit from financial support, can apply for subsidized API credits using this form. External research is vital for informing both our understanding of these multifaceted systems, as well as wider public understanding.

Finally, today we are publishing a research agenda exploring the labor market impacts associated with our Codex family of models, and a call for external collaborators on carrying out this research. We are excited to work with independent researchers to study the effects of our technologies in order to inform appropriate policy interventions, and to eventually expand our thinking from code generation to other modalities.

If you’re interested in working to responsibly deploy cutting-edge AI technologies, apply to work at OpenAI!

OpenAI

Economic Impacts Research at OpenAI

Core to our mission of ensuring that artificial general intelligence benefits all of humanity is understanding the economic impacts that our models have on individuals and society as a whole. Developing tools to rigorously measure the economic impacts of our models is essential to making smarter development and deployment decisions and critical to informing public policy options that maximize human prosperity and minimize the risk of economic harms from AI. Our ability to generate high quality evidence to inform these decisions will be greatly enhanced by developing a range of productive research partnerships, and we firmly believe that AI developers need to support external researchers undertaking this work, rather than exclusively conducting research in-house.

Under this premise, you can see our first public research agenda on these topics. This describes our preliminary priorities for research on the economic impacts of code generation models broadly. Today, we are excited to complement this research agenda with concrete action to facilitate improved measurement of the economic impacts of our models. We are launching a call for expressions of interest from researchers interested in evaluating the economic impact of Codex—our AI system that translates natural language to code. If you are a PhD level researcher (including current doctoral students) interested in collaborating on this research, we would encourage you to fill out the expression of interest form.

Read Research Agenda

Importance of Studying Economic Impacts

As an AI research and deployment company, OpenAI recognizes that our decisions around AI system design and deployment can influence economic impacts and the distribution of economic benefits from advances in AI. Despite remarkable technological progress over the past several decades, gains in economic prosperity have not been widely distributed. In the US, trends in both income and wealth inequality over the last forty years demonstrate a worrying pace of economic divergence and uneven access to opportunity. While recent evidence suggests that there is little immediate risk of widespread technological unemployment due to AI, it is clear that the labor market impacts of increasingly advanced AI will vary widely across different types of workers. Unemployment shocks, even if transitory, have been shown to have widespread negative effects on individual wellbeing, and increasing economic inequality may amplify societal cleavages.

We are eager to support and conduct research that has the potential to impact decision-making on three axes:

AI deployment policies
AI system design decisions
Evidence that public policymakers can draw on.

While we don’t anticipate that the current capabilities of Codex could threaten large-scale economic disruption, future capabilities of code generation and other large language model applications could. We need to engage in research about the economic impact of our models today in order to be positioned to assess the safety of developing and releasing more capable systems in the future. Codex provides a tractable opportunity to establish the foundation for this research going forward.

External Research Collaborators

As an external research collaborator, you would be connected (via OpenAI) to firms that are currently using Codex models or that plan to in the future. You would have the opportunity to work with OpenAI and these firms to implement research projects focused on empirically measuring the impact of Codex on outcomes like worker and firm productivity, labor demand, and skill development. Where necessary and when possible, OpenAI would help facilitate data access to enable impactful research and would provide academic access to Codex and future models. OpenAI will also provide research management resources to external researchers, and researchers would have the freedom to publish their results independently or as co-authors with collaborators at OpenAI. Finally, we intend to facilitate discussions between external researchers, AI developers, AI-adopting firms, and workers in various industries that have been affected by advances in AI in an effort to widen the range of perspectives that can shape the path of AI development and deployment.

If you are a researcher considering submitting an expression of interest, please fill out this form. Additionally, consider emailing us your questions at econ@openai.com to learn more about our goals for economic impacts research and how you can be involved.

If you are a company or user of Codex models and want to learn how you can contribute to this work moving forward, please fill out this form.

Submission Process

If you would like to submit an expression of interest to be a Research Collaborator please use this form.

Submit collaborator interest

We are currently seeking submissions from PhD-level researchers, including current doctoral students. When evaluating expressions of interest, we will assess your background and experience, clarity of motivation to collaborate with OpenAI, and both the clarity and decision-relevance of your research interests related to the economic impact of Codex.

If you are a company or user of Codex models and want to learn how you can contribute to this work moving forward, please fill out this form.

Learn how to contribute

We are in the process of connecting researchers with firms that are best equipped to support particular research interests. If you’re interested in learning more about how your organization can support or sponsor research on economic impacts of AI systems, please contact us here.

Additional Information

If you have any questions about the submission forms or the call for expressions of interest, please contact us at econ@openai.com.

OpenAI

TRILLsson: Small, Universal Speech Representations for Paralinguistic Tasks

Posted by Joel Shor, Staff Software Engineer, Google Research

In recent years, we have seen dramatic improvements on lexical tasks such as automatic speech recognition (ASR). However, machine systems still struggle to understand paralinguistic aspects — such as tone, emotion, whether a speaker is wearing a mask, etc. Understanding these aspects represents one of the remaining difficult problems in machine hearing. In addition, state-of-the-art results often come from ultra-large models trained on private data, making them impractical to run on mobile devices or to release publicly.

In “Universal Paralinguistic Speech Representations Using Self-Supervised Conformers”, to appear in ICASSP 2022, we introduce CAP12— the 12th layer of a 600M parameter model trained on the YT-U training dataset using self-supervision. We demonstrate that the CAP12 model outperforms nearly all previous results in our paralinguistic benchmark, sometimes by large margins, even though previous results are often task-specific. In “TRILLsson: Distilled Universal Paralinguistic Speech Representations”, we introduce the small, performant, publicly-available TRILLsson models and demonstrate how we reduced the size of the high-performing CAP12 model by 6x-100x while maintaining 90-96% of the performance. To create TRILLsson, we apply knowledge distillation on appropriately-sized audio chunks and use different architecture types to train smaller, faster networks that are small enough to run on mobile devices.

1M-Hour Dataset to Train Ultra-Large Self-Supervised Models
We leverage the YT-U training dataset to train the ultra-large, self-supervised CAP12 model. The YT-U dataset is a highly varied, 900M+ hour dataset that contains audio of various topics, background conditions, and speaker acoustic properties.

Video categories by length (outer) and number (inner), demonstrating the variety in the YT-U dataset (figure from BigSSL)

We then modify a Wav2Vec 2.0 self-supervised training paradigm, which can solve tasks using raw data without labels, and combine it with ultra-large Conformer models. Because self-training doesn’t require labels, we can take full advantage of YT-U by scaling up our models to some of the largest model sizes ever trained, including 600M, 1B, and 8B parameters.

NOSS: A Benchmark for Paralinguistic Tasks
We demonstrate that an intermediate representation of one of the previous models contains a state-of-the-art representation for paralinguistic speech. We call the 600M parameter Conformer model without relative attention Conformer Applied to Paralinguistics (CAP). We exhaustively search through all intermediate representations of six ultra-large models and find that layer 12 (CAP12) outperforms previous representations by significant margins.

To measure the quality of the roughly 300 candidate paralinguistic speech representations, we evaluate on an expanded version of the NOn-Semantic Speech (NOSS) benchmark, which is a collection of well-studied paralinguistic speech tasks, such as speech emotion recognition, language identification, and speaker identification. These tasks focus on paralinguistics aspects of speech, which require evaluating speech features on the order of 1 second or longer, rather than lexical features, which require 100ms or shorter. We then add to the benchmark a mask-wearing task introduced at Interspeech 2020, a fake speech detection task (ASVSpoof 2019), a task to detect the level of dysarthria from project Euphonia, and an additional speech emotion recognition task (IEMOCAP). By expanding the benchmark and increasing the diversity of the tasks, we empirically demonstrate that CAP12 is even more generally useful than previous representations.

Simple linear models on time-averaged CAP12 representations even outperform complex, task-specific models on five out of eight paralinguistic tasks. This is surprising because comparable models sometimes use additional modalities (e.g., vision and speech, or text and speech) as well. Furthermore, CAP12 is exceptionally good at emotion recognition tasks. CAP12 embeddings also outperform all other embeddings on all other tasks with only a single exception: for one embedding from a supervised network on the dysarthria detection task.

Model	Voxceleb^†	Voxforge	Speech Commands	ASVSpoof2019^∗∗	Euphonia^#	CREMA-D	IEMOCAP
Prev SoTA	–	95.4	97.9	5.11	45.9	74.0^∗	67.6⁺
TRILL	12.6	84.5	77.6	74.6	48.1	65.7	54.3
ASR Embedding	5.2	98.9	96.1	11.2	54.5	71.8	65.4
Wav2Vec2 layer 6^††	17.9	98.5	95.0	6.7	48.2	77.4	65.8
CAP12	51.0	99.7	97.0	2.5	51.5	88.2	75.0

Test performance on the NOSS Benchmark and extended tasks. “Prev SoTA” indicates the previous best performing state-of-the-art model, which has arbitrary complexity, but all other rows are linear models on time-averaged input. ^† Filtered according to YouTube’s privacy guidelines. ^∗∗ Uses equal error rate [20]. ^# The only non-public dataset. We exclude it from aggregate scores. ^∗ Audio and visual features used in previous state-of-the-art models. ⁺ The previous state-of-the-art model performed cross-validation. For our evaluation, we hold out two specific speakers as a test. †† Wav2Vec 2.0 model from HuggingFace. Best overall layer was layer 6.

TRILLsson: Small, High Quality, Publicly Available Models
Similar to FRILL, our next step was to make an on-device, publicly available version of CAP12. This involved using knowledge distillation to train smaller, faster, mobile-friendly architectures. We experimented with EfficientNet, Audio Spectrogram Transformer (AST), and ResNet. These model types are very different, and cover both fixed-length and arbitrary-length inputs. EfficientNet comes from a neural architecture search over vision models to find simultaneously performant and efficient model structures. AST models are transformers adapted to audio inputs. ResNet is a standard architecture that has shown good performance across many different models.

We trained models that performed on average 90-96% as well as CAP12, despite being 1%-15% the size and trained using only 6% the data. Interestingly, we found that different architecture types performed better at different sizes. ResNet models performed best at the low end, EfficientNet in the middle, and AST models at the larger end.

Aggregate embedding performance vs. model size for various student model architectures and sizes. We demonstrate that ResNet architectures perform best for small sizes, EfficientNetV2 performs best in the midsize model range, up to the largest model size tested, after which the larger AST models are best.

We perform knowledge distillation with the goal of matching a student, with a fixed-size input, to the output of a teacher, with a variable-size input, for which there are two methods of generating student targets: global matching and local matching. Global matching produces distillation targets by generating CAP12 embeddings for an entire audio clip, and then requires that a student match the target from just a small segment of audio (e.g., 2 seconds). Local matching requires that the student network match the average CAP12 embedding just over the smaller portion of the audio that the student sees. In our work, we focused on local matching.

Two types of generating distillation targets for sequences. Left: Global matching uses the average CAP12 embedding over the whole clip for the target for each local chunk. Right: Local matching uses CAP12 embeddings averaged just over local clips as the distillation target.

Observation of Bimodality and Future Directions
Paralinguistic information shows an unexpected bimodal distribution. For the CAP model that operates on 500 ms input segments, and two of the full-input Conformer models, intermediate representations gradually increase in paralinguistic information, then decrease, then increase again, and finally lose this information towards the output layer. Surprisingly, this pattern is also seen when exploring the intermediate representations of networks trained on retinal images.

500 ms inputs to CAP show a relatively pronounced bimodal distribution of paralinguistic information across layers.

Two of the conformer models with full inputs show a bimodal distribution of paralinguistic information across layers.

We hope that smaller, faster models for paralinguistic speech unlock new applications in speech recognition, text-to-speech generation, and understanding user intent. We also expect that smaller models will be more easily interpretable, which will allow researchers to understand what aspects of speech are important for paralinguistics. Finally, we hope that our open-sourced speech representations are used by the community to improve paralinguistic speech tasks and user understanding in private or small datasets.

Acknowledgements
I’d like to thank my co-authors Aren Jansen, Wei Han, Daniel Park, Yu Zhang, and Subhashini Venugopalan for their hard work and creativity on this project. I’d also like to thank the members of the large collaboration for the BigSSL work, without which these projects would not be possible. The team includes James Qin, Anmol Gulati, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu.

3 Questions: Fotini Christia on racial equity and data science

Fotini Christia is the Ford International Professor in the Social Sciences in the Department of Political Science, associate director of the Institute for Data, Systems, and Society (IDSS), and director of the Sociotechnical Systems Research Center (SSRC). Her research interests include issues of conflict and cooperation in the Muslim world, and she has conducted fieldwork in Afghanistan, Bosnia, Iran, the Palestinian Territories, Syria, and Yemen. She has co-organized the IDSS Research Initiative on Combatting Systemic Racism (ICSR), which works to bridge the social sciences, data science, and computation by bringing researchers from these disciplines together to address systemic racism across housing, health care, policing, education, employment, and other sectors of society.

Q: What is the IDSS/ICSR approach to systemic racism research?

A: The Research Initiative on Combatting Systemic Racism (ICSR) aims to seed and coordinate cross-disciplinary research to identify and overcome racially discriminatory processes and outcomes across a range of U.S. institutions and policy domains.

Building off the extensive social science literature on systemic racism, the focus of this research initiative is to use big data to develop and harness computational tools that can help effect structural and normative change toward racial equity.

The initiative aims to create a visible presence at MIT for cutting-edge computational research with a racial equity lens, across societal domains that will attract and train students and scholars.

The steering committee for this research initiative is composed of underrepresented minority faculty members from across MIT’s five schools and the MIT Schwarzman College of Computing. Members will serve as close advisors to the initiative as well as share the findings of our work beyond MIT’s campus. MIT Chancellor Melissa Nobles heads this committee.

Q: What role can data science play in helping to effect change toward racial equity?

A: Existing work has shown racial discrimination in the job market, in the criminal justice system, as well as in education, health care, and access to housing, among other places. It has also underlined how algorithms could further entrench such bias — be it in training data or in the people who build them. Data science tools can not only help identify, but also contribute to, proposing fixes on racially inequitable outcomes that result from implicit or explicit biases in governing institutional practices in the public and private sector, and more recently from the use of AI and algorithmic methods in decision-making.

To that effect, this initiative will produce research that explores and collects the relevant big data across domains, while paying attention to the ways such data are collected, and focus on improving and developing data-driven computational tools to address racial disparities in structures and institutions that have reproduced racially discriminatory outcomes in American society.

The strong correlation between race, class, educational attainment, and various attitudes and behaviors in the American context can make it extremely difficult to rule out the influence of confounding factors. Thus, a key motivation for our research initiative is to highlight the importance of causal analysis using computational methods, and focus on understanding the opportunities of big data and algorithmic decision-making to address racial inequities and promote racial justice — beyond de-biasing algorithms. The intent is to also codify methodologies on equity-informed research practices and produce tools that are clear on the quantifiable expected social costs and benefits, as well as on the downstream effects on systemic racism more broadly.

Q: What are some ways that the ICSR might conduct or follow-up on research seeking real-world impact or policy change?

A: This type of research has ethical and societal considerations at its core, especially as they pertain to historically disadvantaged groups in the U.S., and will be coordinated with and communicated to local stakeholders to drive relevant policy decisions. This initiative intends to establish connections to URM [underrepresented minority] researchers and students at underrepresented universities and to directly collaborate with them on these research efforts. To that effect, we are leveraging existing programs such as the MIT Summer Research Program (MSRP).

To ensure that our research targets the right problems bringing a racial equity lens with an interest to effect policy change, we will also connect with community organizations in minority neighborhoods who often bear the brunt of the direct and indirect effects of systemic racism, as well as with local government offices that work to address inequity in service provision in these communities. Our intent is to directly engage IDSS students with these organizations to help develop and test algorithmic tools for racial equity.

Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly

At the 2021 AWS re:Invent conference in Las Vegas, we demoed Read For Me at the AWS Builders Fair—a website that helps the visually impaired hear documents.

For better quality, view the video here.

Adaptive technology and accessibility features are often expensive, if they’re available at all. Audio books help the visually impaired read. Audio description makes movies accessible. But what do you do when the content isn’t already digitized?

This post focuses on the AWS AI services Amazon Textract and Amazon Polly, which empower those with impaired vision. Read For Me was co-developed by Jack Marchetti, who is visually impaired.

Solution overview

Through an event-driven, serverless architecture and a combination of multiple AI services, we can create natural-sounding audio files in multiple languages from a picture of a document, or any image with text. For example, a letter from the IRS, a holiday card from family, or even the opening titles to a film.

The following Reference Architecture, published in the AWS Architecture Center shows the workflow of a user taking a picture with their phone and playing an MP3 of the content found within that document.

The workflow includes the following steps:

Static content (HTML, CSS, JavaScript) is hosted on AWS Amplify.
Temporary access is granted for anonymous users to backend services via an Amazon Cognito identity pool.
The image files are stored in Amazon Simple Storage Service (Amazon S3).
A user makes a POST request through Amazon API Gateway to the audio service, which proxies to an express AWS Step Functions workflow.
The Step Functions workflow includes the following steps:
1. Amazon Textract extracts text from the image.
2. Amazon Comprehend detects the language of the text.
3. If the target language differs from the detected language, Amazon Translate translates to the target language.
4. Amazon Polly creates an audio file as output using the text.
The AWS Step Functions workflow creates an audio file as output and stores it in Amazon S3 in MP3 format.
A pre-signed URL with the location of the audio file stored in Amazon S3 is sent back to the user’s browser through API Gateway. The user’s mobile device plays the audio file using the pre-signed URL.

In the following sections, we discuss the reasons for why we chose the specific services, architecture pattern, and service features for this solution.

AWS AI services

Several AI services are wired together to power Read For Me:

Amazon Textract identifies the text in the uploaded picture.
Amazon Comprehend determines the language.
If the user chooses a different spoken language than the language in the picture, we translate it using Amazon Translate.
Amazon Polly creates the MP3 file. We take advantage of the Amazon Polly neural engine, which creates a more natural, lifelike audio recording.

One of the main benefits of using these AI services is the ease of adoption with little or no core machine learning experience required. The services expose APIs that clients can invoke using SDKs made available in multiple programming languages, such as Python and Java.

With Read For Me, we wrote the underlying AWS Lambda functions in Python.

AWS SDK for Python (Boto3)

The AWS SDK for Python (Boto3) makes interacting with AWS services simple. For example, the following lines of Python code return the text found in the image or document you provide:

import boto3
client = boto3.client('textract')
response = client.detect_document_text(
Document={
'S3Object': {
'Bucket': 'bucket-name',
'Name': 's3-key'
}
})
#do something with the response

All Python code is run within individual Lambda functions. There are no servers to provision and no infrastructure to maintain.

Architecture patterns

In this section, we discuss the different architecture patterns used in the solution.

Serverless

We implemented a serverless architecture for two main reasons: speed to build and cost. With no underlying hardware to maintain or infrastructure to deploy, we focused entirely on the business logic code and nothing else. This allowed us to get a functioning prototype up and running in a matter of days. If users aren’t actively uploading pictures and listening to recordings, nothing is running, and therefore nothing is incurring costs outside of storage. An S3 lifecycle management rule deletes uploaded images and MP3 files after 1 day, so storage costs are low.

Synchronous workflow

When you’re building serverless workflows, it’s important to understand when a synchronous call makes more sense from the architecture and user experience than an asynchronous process. With Read For Me, we initially went down the asynchronous path and planned on using WebSockets to bi-directionally communicate with the front end. Our workflow would include a step to find the connection ID associated with the Step Functions workflow and upon completion, alert the front end. For more information about this process, refer to From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

We ultimately chose not to do this and used express step functions which are synchronous. Users understand that processing an image won’t be instant, but also know it won’t take 30 seconds or a minute. We were in a space where a few seconds was satisfactory to the end-user and didn’t need the benefit of WebSockets. This simplified the workflow overall.

Express Step Functions workflow

The ability to break out your code into smaller, isolated functions allows for fine-grained control, easier maintenance, and the ability to scale more accurately. For instance, if we determined that the Lambda function that triggered Amazon Polly to create the audio file was running slower than the function that determined the language, we could vertically scale that function, adding more memory, without having to do so for the others. Similarly, you limit the blast radius of what your Lambda function can do or access when you limit its scope and reach.

One of the benefits of orchestrating your workflow with Step Functions is the ability to introduce decision flow logic without having to write any code.

Our Step Functions workflow isn’t complex. It’s linear until the translation step. If we don’t need to call a translation Lambda function, that’s less cost to us, and a faster experience for the user. We can use the visual designer on the Step Functions console to find the specific key in the input payload and, if it’s present, call one function over the other using JSONPath. For example, our payload includes a key called translate:

{ 
extracted_text: "hello world",
target_language: "es",
source_language: "en",
translate: true
}

Within the Step Functions visual designer, we find the translate key, and set up rules to match.

Headless architecture

Amplify hosts the front-end code. The front end is written in React and the source code is checked into AWS CodeCommit. Amplify solves a few problems for users trying to deploy and manage static websites. If you were doing this manually (using an S3 bucket set up for static website hosting and fronting that with Amazon CloudFront), you’d have to expire the cache yourself each time you did deployments. You’d also have to write up your own CI/CD pipeline. Amplify handles this for you.

This allows for a headless architecture, where front-end code is decoupled from the backend and each layer can be managed and scaled independent of the other.

Analyze ID

In the preceding section, we discussed the architecture patterns for processing the uploaded picture and creating an MP3 file from it. Having a document read back to you is a great first step, but what if you only want to know something specific without having the whole thing read back to you? For instance, you need to fill out a form online and provide your state ID or passport number, or perhaps its expiration date. You then have to take a picture of your ID and, while having it read back to you, wait for that specific part. Alternatively, you could use Analyze ID.

Analyze ID is a feature of Amazon Textract that enables you to query documents. Read For Me contains a drop-down menu where you can specifically ask for the expiration date, date of issue, or document number. You can use the same workflow to create an MP3 file that provides an answer to your specific question.

You can demo the Analyze ID feature at readforme.io/analyze.

Additional Polly Features

Read For Me offers multiple neural voices utilizing different languages and dialects. Note that there are several other voices you can choose from, which we did not implement. When a new voice is available, an update to the front-end code and a lambda function is all it takes to take advantage of it.
The Polly service also offers other options which we have yet to include in Read For Me. Those include adjusting the speed of the voices and speech marks.

Conclusion

In this post, we discussed how to use numerous AWS services, including AI and serverless, to aid the visually impaired. You can learn more about the Read For Me project and use it by visiting readforme.io. You can also find Amazon Textract examples on the GitHub repo. To learn more about Analyze ID, check out Announcing support for extracting data from identity documents using Amazon Textract.

The source code for this project will be open-sourced and added to AWS’s public GitHub soon.

About the Authors

Jack Marchetti is a Senior Solutions architect at AWS. With a background in software engineering, Jack is primarily focused on helping customers implement serverless, event-driven architectures. He built his first distributed, cloud-based application in 2013 after attending the second AWS re:Invent conference and has been hooked ever since. Prior to AWS Jack spent the bulk of his career in the ad agency space building experiences for some of the largest brands in the world. Jack is legally blind and resides in Chicago with his wife Erin and cat Minou. He also is a screenwriter, and director with a primary focus on Christmas movies and horror. View Jack’s filmography at his IMDb page.

Alak Eswaradass is a Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master’s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud native services and machine learning. Outside of work, Swagat enjoys travel, reading and meditating.

Using natural language processing to understand and identify risks

As an applied science manager at Amazon, Muthu Chandrasekaran works on new tools to automate and build a risk technology.Read More

GFN Thursday Marches Forward With 21 Games Coming to GeForce NOW This Month

A new month means a whole new set of games coming to GeForce NOW.

Members can look forward to 21 titles joining the GeForce NOW library in March, including day-and-date releases like Shadow Warrior 3, with support for NVIDIA DLSS.

Bring a Katana to a Gunfight

Shoot, slash and slide into Shadow Warrior 3, new this week for GeForce NOW members.

The latest entry from Devolver Digital and Flying Wild Hog is a seamless blend of fast-paced gunplay, razor-sharp melee combat and a new, free-running movement system. Embark on an improbable mission as fallen corporate shogun Lo Wang across nearly all of your devices, including Chromebooks and Macs.

With support for NVIDIA DLSS, RTX 3080 members enjoy the game’s fast-paced action at even faster speeds and cutting-edge performance. Stream every air dash, wall run and katana slice at up to 1440p resolution and 120 frames per second on PCs and native 1440p or 1600p at 120 FPS on Macs for eight awesome hours at a time.

March Madness, But Make It Video Games

ELEX II on GeForce NOW — *Reach for the stars and explore a new planet by streaming ELEX II.*

Another month packed full of great gaming kicks off with eight games ready to stream this week, followed with the 21 total titles coming to the cloud in March:

ELEX II (New release on Steam)
FAR: Changing Tides (New release on Steam)
Shadow Warrior 3 (New release on Steam)
AWAY: The Survival Series (Epic Games Store)
Labyrinthine Dreams (Steam)
Sins of a Solar Empire: Rebellion (Steam)
TROUBLESHOOTER: Abandoned Children (Steam)
The Vanishing of Ethan Carter (Epic Games Store)

Also coming in March:

Buccaneers! (New release on Steam, March 7)
Ironsmith Medieval Simulator (New release on Steam, March 9)
Distant Worlds 2 (New release on Steam, March 10)
Monster Energy Supercross – The Official Videogame 5 (New release on Steam, March 17)
The Settlers (New release on Ubisoft Connect, March 17)
Syberia: The World Before (New release on Steam and Epic Games Store, March 18)
Lumote: The Mastermote Chronicles (New release on Steam, March 24)
Turbo Sloths (New release on Steam, March 30)
Blood West (Steam)
Bus Driver Simulator (Steam)
Conan Chop Chop (Steam)
Dread Hunger (Steam)
Fury Unleashed (Steam)
Hundred Days – Winemaking Simulator (Steam)
The Legend of Heroes: Trails of Cold Steel II (Steam)
Martha is Dead (Steam and Epic Games Store)
Power to the People (Steam)
Project Zomboid (Steam)
Rugby 22 (Steam)

Get Your Fill of Games From February

On top of the 30 titles announced in February, a few extra found their way to the GeForce NOW library. Catch the additional games that joined last month:

Diplomacy Is Not An Option (Epic Games Store)
Not Tonight 2 (Steam and Epic Games Store)
Model Builder (Steam)

We also announced that Two Worlds Epic Edition would be coming to GeForce NOW. At this time, the title is no longer coming to the service.

What are you planning to play this weekend? Could it be something new, or is your back catalog calling? Let us know on Twitter:

be honest, how big is your backlog rn?

screenshots for proof

— NVIDIA GeForce NOW (@NVIDIAGFN) March 2, 2022

The post GFN Thursday Marches Forward With 21 Games Coming to GeForce NOW This Month appeared first on NVIDIA Blog.

Lessons learned on language model safety and misuse

We describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.OpenAI Blog

A research agenda for assessing the economic impacts of code generation models

OpenAI Blog

Learning Robust Real-Time Cultural Transmission without Human Data

In this work, we use deep reinforcement learning to generate artificial agents capable of test-time cultural transmission. Once trained, our agents can infer and recall navigational knowledge demonstrated by experts. This knowledge transfer happens in real time and generalises across a vast space of previously unseen tasks.Read More

Vedere AI

Monthly Archives: March 2022