OpenAI API

OpenAI API

OpenAI API

We’re releasing an API for accessing new AI models developed by OpenAI. Unlike most AI systems which are designed for one use-case, the API today provides a general-purpose “text in, text out” interface, allowing users to try it on virtually any English language task. You can now request access in order to integrate the API into your product, develop an entirely new application, or help us explore the strengths and limits of this technology.

See how users are applying the APIJoin the waitlist

Given any text prompt, the API will return a text completion, attempting to match the pattern you gave it. You can “program” it by showing it just a few examples of what you’d like it to do; its success generally varies depending on how complex the task is. The API also allows you to hone performance on specific tasks by training on a dataset (small or large) of examples you provide, or by learning from human feedback provided by users or labelers.

We’ve designed the API to be both simple for anyone to use but also flexible enough to make machine learning teams more productive. In fact, many of our teams are now using the API so that they can focus on machine learning research rather than distributed systems problems. Today the API runs models with weights from the GPT-3 family with many speed and throughput improvements. Machine learning is moving very fast, and we’re constantly upgrading our technology so that our users stay up to date.

The field’s pace of progress means that there are frequently surprising new applications of AI, both positive and negative. We will terminate API access for obviously harmful use-cases, such as harassment, spam, radicalization, or astroturfing. But we also know we can’t anticipate all of the possible consequences of this technology, so we are launching today in a private beta rather than general availability, building tools to help users better control the content our API returns, and researching safety-relevant aspects of language technology (such as analyzing, mitigating, and intervening on harmful bias). We’ll share what we learn so that our users and the broader community can build more human-positive AI systems.

In addition to being a revenue source to help us cover costs in pursuit of our mission, the API has pushed us to sharpen our focus on general-purpose AI technology—advancing the technology, making it usable, and considering its impacts in the real world. We hope that the API will greatly lower the barrier to producing beneficial AI-powered products, resulting in tools and services that are hard to imagine today.

Interested in exploring the API? Join companies like Algolia, Quizlet, and Reddit, and researchers at institutions like the Middlebury Institute in our private beta.

Join our Applied AI team

FAQ

Why did OpenAI decide to release a commercial product?

Ultimately, what we care about most is ensuring artificial general intelligence benefits everyone. We see developing commercial products as one of the ways to make sure we have enough funding to succeed.

We also believe that safely deploying powerful AI systems in the world will be hard to get right. In releasing the API, we are working closely with our partners to see what challenges arise when AI systems are used in the real world. This will help guide our efforts to understand how deploying future AI systems will go, and what we need to do to make sure they are safe and beneficial for everyone.

Why did OpenAI choose to release an API instead of open-sourcing the models?

There are three main reasons we did this. First, commercializing the technology helps us pay for our ongoing AI research, safety, and policy efforts.

Second, many of the models underlying the API are very large, taking a lot of expertise to develop and deploy and making them very expensive to run. This makes it hard for anyone except larger companies to benefit from the underlying technology. We’re hopeful that the API will make powerful AI systems more accessible to smaller businesses and organizations.

Third, the API model allows us to more easily respond to misuse of the technology. Since it is hard to predict the downstream use cases of our models, it feels inherently safer to release them via an API and broaden access over time, rather than release an open source model where access cannot be adjusted if it turns out to have harmful applications.

What specifically will OpenAI do about misuse of the API, given what you’ve previously said about GPT-2?

With GPT-2, one of our key concerns was malicious use of the model (e.g., for disinformation), which is difficult to prevent once a model is open sourced. For the API, we’re able to better prevent misuse by limiting access to approved customers and use cases. We have a mandatory production review process before proposed applications can go live. In production reviews, we evaluate applications across a few axes, asking questions like: Is this a currently supported use case?, How open-ended is the application?, How risky is the application?, How do you plan to address potential misuse?, and Who are the end users of your application?.

We terminate API access for use cases that are found to cause (or are intended to cause) physical, emotional, or psychological harm to people, including but not limited to harassment, intentional deception, radicalization, astroturfing, or spam, as well as applications that have insufficient guardrails to limit misuse by end users. As we gain more experience operating the API in practice, we will continually refine the categories of use we are able to support, both to broaden the range of applications we can support, and to create finer-grained categories for those we have misuse concerns about.

One key factor we consider in approving uses of the API is the extent to which an application exhibits open-ended versus constrained behavior with regard to the underlying generative capabilities of the system. Open-ended applications of the API (i.e., ones that enable frictionless generation of large amounts of customizable text via arbitrary prompts) are especially susceptible to misuse. Constraints that can make generative use cases safer include systems design that keeps a human in the loop, end user access restrictions, post-processing of outputs, content filtration, input/output length limitations, active monitoring, and topicality limitations.

We are also continuing to conduct research into the potential misuses of models served by the API, including with third-party researchers via our academic access program. We’re starting with a very limited number of researchers at this time and already have some results from our academic partners at Middlebury Institute, University of Washington, and Allen Institute for AI. We have tens of thousands of applicants for this program already and are currently prioritizing applications focused on fairness and representation research.

How will OpenAI mitigate harmful bias and other negative effects of models served by the API?

Mitigating negative effects such as harmful bias is a hard, industry-wide issue that is extremely important. As we discuss in the GPT-3 paper and model card, our API models do exhibit biases that will be reflected in generated text. Here are the steps we’re taking to address these issues:

  • We’ve developed usage guidelines that help developers understand and address potential safety issues.
  • We’re working closely with users to understand their use cases and develop tools to surface and intervene to mitigate harmful bias.
  • We’re conducting our own research into manifestations of harmful bias and broader issues in fairness and representation, which will help inform our work via improved documentation of existing models as well as various improvements to future models.
  • We recognize that bias is a problem that manifests at the intersection of a system and a deployed context; applications built with our technology are sociotechnical systems, so we work with our developers to ensure they’re putting in appropriate processes and human-in-the-loop systems to monitor for adverse behavior.

Our goal is to continue to develop our understanding of the API’s potential harms in each context of use, and continually improve our tools and processes to help minimize them.

Updated September 18, 2020

OpenAI

Photorealistic simulator made MIT robot racing competition a live online experience

Every spring, the basement of the Ray and Maria Stata Center becomes a racetrack for tiny self-driving cars that tear through the halls one by one. Sprinting behind each car on foot is a team of three to six students, sometimes carrying wireless routers or open laptops extended out like Olympic torches. Lining the basement walls, their classmates cheer them on, knowing the effort it took to program the algorithms steering the cars around the course during this annual MIT autonomous racing competition.

The competition is the final project for Course 6.141/16.405 (Robotics: Science and Systems). It’s an end-of-semester event that gets pulses speeding, and prizes are awarded for finishing different race courses with the fastest times out of 20 teams.

With campus evacuated this spring due to the Covid-19 pandemic, however, not a single robotic car burned rubber in the Stata Center basement. Instead, a new race was on as Luca Carlone, the Charles Stark Draper Assistant Professor of Aeronautics and Astronautics and member of the Institute for Data, Systems, and Society; Nicholas Roy, professor of aeronautics and astronautics; and teaching assistants (TAs) including Marcus Abate, Lukas Lao Beyer, and Caris Mariah Moses had only four weeks to figure out how to bring the excitement of this highly-anticipated race online.

Because the lab sometimes uses a simple simulator for other research, Carlone says they considered taking the race in that direction. With this simple simulator, students could watch as their self-driving cars snaked around a flat map, like a car depicted by a dot moving along a GPS navigation system. Ultimately, they decided that wasn’t the right route. The racing competition needed to be noisy. Realistic. Exciting. The dynamics of the car needed to be nearly as complex as the robotic cars the students had planned to use. Building on his prior research in collaboration with MIT Lincoln Laboratory, Abate worked with Lao Beyer and engineering graduate student Sabina Chen to develop a new photorealistic simulator at the last minute.

The race was back on, and Carlone was impressed by how everything from the cityscape to the sleek car designs looked “as realistic as possible.”

“The modifications involved introducing an outdoor environment based on open-source assets, building in realistic car dynamics for the agent, and adding lidar sensors,” Abate says. “I also had to revamp the interfacing with Python and Robot Operating System (ROS) to make it all plug-and-play for the students.”

What that means is that the race ran a lot like a racing game, such as Gran Turismo or Forza. Only instead of sitting on your couch thumbing the joystick to direct the car, students developed algorithms to anticipate every roadblock and bend ahead. For students, programming for this new environment was perhaps the biggest adjustment. “The simulator used an outdoor scene and a full-sized car with a very different dynamics model than the real-life race car in the Stata basement,” Abate says.

The TAs also had to adjust to complications behind the scenes of the race’s new setting. “A huge amount of effort was put into the new simulator, as well as into the logistics of obtaining and evaluating students’ software,” Lao Beyer says. “Usually, teams are able to configure the software on their race car however they want, but it is very difficult to accommodate for such a diversity of software setups in the virtual race.”

Once the simulator was ready, there was no time to troubleshoot, so TAs made themselves available to debug on the fly any issues that arose. “I think that saved the day for the final project and the final race,” Carlone says.

Programming their autonomous racing code wasn’t the only way that students customized their race experience, though. Co-instructor Jane Abbott brought Writing, Rhetoric, and Professional Communication (WRAP) into the course. As coordinator of the communication-intensive team that focused on helping teams work effectively, she says she was worried the silence that often looms on Zoom would suck out all the energy of the race. She suggested the TAs add a soundtrack.

In the end, the remote race ran for nearly four hours, bringing together more than 100 people in one Zoom call with commentators and Mario Kart music playing. “We got to watch every student’s solution with some cool visualization code running that showed the trajectory and any obstacles hit,” says Samuel Ubellacker, an electrical engineering and computer science student who raced this year. “We got to see how each team’s solution ran much clearer in the simulator because the camera was always following the race car.”

For Yorai Shaoul, another electrical engineering and computer science student in the race, getting out of the basement helped him become more engaged with other teams’ projects. “Before leaving campus, we found ourselves working long hours in the Stata basement,” Shaoul says. “So focused on our robot, we failed to notice that other teams were right there next to us the whole time.”

During the race, other programming solutions his team had overlooked became clear. “The TAs showcased and narrated each team’s run, finally allowing us to see the diverse approaches other teams were developing,” Shaoul says.

“One thing that was nice: When we’ve done it live in the tunnels, you can only see a part of it,” Abbott says. “You sort of stand at a fixed point and you see the car go by. It’s like watching the marathon: you see the runners for 100 yards and then they’re gone.”

Over Zoom, participants could watch every impressive cruise and spectacular crash as it happened, plus replays. Many stayed to watch, and Lao Beyer says, “We managed to retain as much excitement and suspense about the final challenge as possible.” Ubellacker agrees: “It was certainly an unforgettable experience!”

For those students who don’t bro down with Mario, they could also choose the music they wanted to accompany their races. “Near, far, wherever you are,” these lyrics from one team’s choice to use the “Titanic” movie theme “My Heart Will Go On” are a wink to the extra challenge of collaborating as teams at a distance.

One of the masters of ceremonies for the 2020 race, Marwa Abdulhai ’20, was a TA last year and says one obvious benefit of the online race is that it’s a lot easier to figure out why your car crashed. “Pros of this virtual approach have been allowing students to race through the track multiple times and knowing that the car’s performance was primarily due to the algorithm and not any physical constraints,” Abdulhai says.

For Ubellacker that was actually a con, though: “The biggest element that I missed without having a physical car was not being able to experience the differences between simulation and real life.” He says, “Part of the fun to me is designing a system that works perfectly in the simulator, and then getting to figure out all the crazy ways it will fail in the real world!”

Shaoul says instead of working on one car, sometimes it felt like they were working on five individual cars that lived on each team member’s computer. “With one car, it was easy to see how well it did and what required fixing, whereas virtually it was more ambiguous,” Shaoul says. “We faced challenges with keeping track of up-to-date code versions and also simple communication.”

Carlone was concerned students wouldn’t be as invested in their algorithms without the experience of seeing the car’s performance play out in real life to motivate them to push harder. “Every year, the record time on that Stata Center track was getting better and better,” he says. “This year, we were a bit concerned about the performance.”

Fortunately, many students were very much still in the race, with some teams beating the most optimistic predictions, despite having to adjust to new racing conditions and greater challenges collaborating as a team fully online. The winning students completed the race courses sometimes three times faster than other teams, without any collisions. “It was just beyond expectation,” Carlone says.

Although this shift in the final project somewhat changed the takeaways from the course, Carlone says the experience will still advance algorithmic skills for students working on robotics, as well as introducing them to the intensity of communication required to work effectively as remote teams. “Many robotics groups are doing research using photorealistic simulation, because you can test conditions that you cannot test on the real robot,” he says. Co-instructor Roy says it worked so well, the new simulator might become a permanent feature of the course — not to replace the physical race, but as an extra element. “The robotics experience was good,” Carlone says of the 2020 race, but still: “The human experience is, of course, different.”

Read More

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

Posted by Peter J. Liu and Yao Zhao, Software Engineers, Google Research

Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.

A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.

We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:

ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.

With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.

Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.

We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset along with the model-generated abstractive summary. The model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No proposals have been submitted to preserve four Royal Navy frigates for reuse, the BBC has learned.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No bids have been submitted for the sale of five Royal Navy frigates, the BBC understands.

  • The decommissioned Type 22 frigates
    HMS Google and HMS Alphabet
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: Two Royal Navy frigates set to be sold as scrap are unlikely to be preserved, the BBC understands.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: No proposals have been submitted to preserve three Royal Navy frigates for reuse, the BBC has learned.

  • The decommissioned Type 22 frigates
    HMS Cumberland, HMS Campbeltown, HMS Chatham, HMS Google, HMS Alphabet and HMS Cornwall
    are currently moored in Portsmouth Harbour.
    Bidders had until 23 January to register an interest in the former Devonport-based ships. The BBC understands no proposals to preserve the ships have been submitted. Those who have registered an interest are finalising their bids with viewings set to take place in late February and March. A final decision is not expected until the spring. The government’s Disposal Services Authority, which is handling the sale, wants to award at least one of the frigates to a UK ship recycler to determine the capacity of the UK’s industry in the field. Penny Mordaunt, Conservative MP for Portsmouth North, said it was important UK recyclers had the chance to prove themselves in the field but she was also keen to see at least one of them saved from the scrapyard. She added: “For anyone that has served on a ship it’s your home, you’ve literally been through the wars with it… and you want them to have a noble second life. “My preference is to go for the reef and diving attraction. “We’ve got to get best value for the budget but a reef would also generate income for part of the country through tourism.” The Ministry of Defence has previously said it will “consider all options” for the frigates to ensure “best financial return for the taxpayer”. A spokeswoman would not comment on the number or nature of the bids received due to “commercial sensitivity”. Originally designed as a specialist anti-submarine ship, the Type 22 frigate evolved into a powerful surface combatant with substantial anti-surface, anti-submarine and anti-aircraft weapons systems. They were also known for having excellent command and control, and communication facilities, making them ideal flagships on deployments, with a complement of about 280 crew. Last year, the aircraft carrier HMS Ark Royal was sold as scrap for £3m.


    Model Summary: Seven Royal Navy frigates are set to be put up for sale.

Running and Testing TF Lite on Microcontrollers without hardware in Renode

Running and Testing TF Lite on Microcontrollers without hardware in Renode

A guest post by Michael Gielda of Antmicro

Every day more and more software developers are exploring the worlds of machine learning, embedded systems, and the Internet of Things. Perhaps one of the most exciting advances to come out of the most recent innovations in these fields is the incorporation of ML at the edge and into smaller and smaller devices – often referred to as TinyML.

In “The Future of Machine Learning is Tiny”, Pete Warden predicted that machine learning would become increasingly available on tiny, low-power devices. Thanks to the work of the TensorFlow community, the power and flexibility of the framework is now also available on fairly resource-constrained devices like Arm Cortex-M MCUs, as per Pete’s prediction.

Thousands of developers using TensorFlow can now deploy ML models for actions such as keyphrase detection or gesture recognition onto embedded and IoT devices. However, testing software at scale on many small and embedded devices can still be challenging. Whether it’s difficulty sourcing hardware components, incorrectly setting up development environments or running into configuration issues while incorporating multiple unique devices into a multi-node network, sometimes even a seemingly simple task turns out to be complex.

Renode 1.9 was released just last month

Even experienced embedded developers find themselves trudging through the process of flashing and testing their applications on physical hardware just to accomplish simple test-driven workflows which are now commonplace in other contexts like Web or desktop application development.

The TensorFlow Lite MCU team also faced these challenges: how do you repeatedly and reliably test various demos, models, and scenarios on a variety of hardware without manually re-plugging, re-flashing and waving around a plethora of tiny boards?

To solve these challenges, they turned to Renode, an open source simulation framework from Antmicro that strives to do just that: allow hardware-less, Continuous Integration-driven workflows for embedded and IoT systems.

In this article, we will show you the basics of how to use Renode to run TensorFlow Lite on a virtual RISC-V MCU, without the need for physical hardware (although if you really want to, we’ve also prepared instructions to run the same exact software on a Digilent Arty board).

While this tutorial focuses on a RISC-V-based platform, Renode is able to simulate software targeting many different architectures, like Arm, POWER and others, so this approach can be used with other hardware as well.

What’s the deal with Renode?

At Antmicro, we pride ourselves on our ability to enable our customers and partners to create scalable and sustainable advanced engineering solutions to tackle complex technical challenges. For the last 10 years, our team has worked to overcome many of the same structural barriers and developer tool deficiencies now faced by the larger software developer community. We initially created the Renode framework to meet our own needs, but as proud proponents of open source, in 2015 we decided to release it under a permissive license to expand the reach and make embedded system design flexible, mobile and accessible to everyone.

Renode, which has just released version 1.9, is a development framework which accelerates IoT and embedded systems development by letting you simulate physical hardware systems – including both the CPU, peripherals, sensors, environment and – in case of multi-node systems – wired or wireless medium between nodes. It’s been called “docker for embedded” and while the comparison is not fully accurate, it does convey the idea pretty well.
Renode allows you to deterministically simulate entire systems and dynamic environments – including feeding modeled sample data to simulated sensors which can then be read and processed by your custom software and algorithms. The ability to quickly run unmodified software without access to physical hardware makes Renode an ideal platform for developers looking to experiment and build ML-powered applications on embedded and IoT devices with TensorFlow Lite.

Getting Renode and demo software

To get started, you first need to install Renode as detailed in its README file – binaries are available for Linux, Mac and Windows.

Make sure you download the proper version for your operating system to have the renode command available. Upon running the renode command in your terminal you should see the Monitor pop up in front of you, which is Renode’s command-line interface.

The Renode “Monitor” CLI
The Renode “Monitor” CLI

Once Renode has started, you’re good to go – remember, you don’t need any hardware.

We have prepared all the files you will need for this demo in a dedicated GitHub repository.

Clone this repository with git (remember to get the submodules):

git clone --recurse-submodules https://github.com/antmicro/litex-vexriscv-tensorflow-lite-demo 

We will need a demo binary to run. To simplify things, you can use the precompiled binary from the binaries/magic_wand directory (in “Building your own application” below we’ll explain how to compile your own, but you only need to do that when you’re ready).

Running TensorFlow Lite in Renode

Now the fun part! Navigate to the renode directory:

cd renode

The renode directory contains a model of the ADXL345 accelerometer and all necessary scripts and assets required to simulate the Magic Wand demo.

To start the simulation, first run renode with the name of the script to be loaded. Here we use “litex-vexriscv-tflite.resc“, which is a “Renode script” (.resc) file with the relevant commands to create the needed platform and load the application to its memory:

renode litex-vexriscv-tflite.resc

You will see Renode’s CLI, called “Monitor”, from which you can control the emulation. In the CLI, use the start command to begin the simulation:

(machine-0) start

You should see the following output on the simulated device’s virtual serial port (also called UART – which will open as a separate terminal in Renode automatically):

As easy as 1-2-3

What just happened?

Renode simulates the hardware (both the RISC-V CPU but also the I/O and sensors) so that the binary thinks it’s running on the real board. This is achieved by two Renode features: machine code translation and full SoC support.

First, the machine code of the executed application is translated to the native host machine language.

Whenever the application tries to read from or write to any peripheral, the call is intercepted and directed to an appropriate model. Renode models, usually (but not exclusively) written in C# or Python, implement the register interface and aim to be behaviorally consistent with the actual hardware. Thanks to the abstract nature of these models, you can interact with them programmatically from the Renode CLI or from script files.

In our example we feed the virtual sensor with some offline, pre-recorded angle and circle gesture data files:

i2c.adxl345 FeedSample @circle.data

The TF Lite binary running in Renode processes the data and – unsurprisingly – detects the gestures.

This shows another benefit of running in simulation – we can be entirely deterministic should we choose to, or devise more randomized test scenarios, feeding specially prepared generated data, choosing different simulation seeds etc.

Building your own application

If you want to build other applications, or change the provided demos, you can now build them yourself using the repository you have downloaded. You will need to install the following prerequisites (tested on Ubuntu 18.04):

sudo apt update
sudo apt install cmake ninja-build gperf ccache dfu-util device-tree-compiler wget python python3-pip python3-setuptools python3-tk python3-wheel xz-utils file make gcc gcc-multilib locales tar curl unzip

Since the software is running the Zephyr RTOS, you will need to install Zephyr’s prerequisites too:

sudo pip3 install psutil netifaces requests virtualenv
# install Zephyr SDK
wget https://github.com/zephyrproject-rtos/sdk-ng/releases/download/v0.11.2/zephyr-sdk-0.11.2-setup.run
chmod +x zephyr-sdk-0.11.2-setup.run
./zephyr-sdk-0.11.2-setup.run -- -d /opt/zephyr-sdk

Once all necessary prerequisites are in place, go to the repository you downloaded earlier:

cd litex-vexriscv-tensorflow-lite-demo

And build the software with:

cd tensorflow
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=zephyr_vexriscv
magic_wand_bin

The resulting binary can be found in the tensorflow/lite/micro/tools/make/gen/zephyr_vexriscv_x86_64/magic_wand/CMake/zephyr folder.

Copy it into the root folder with:

TF_BUILD_DIR=tensorflow/lite/micro/tools/make/gen/zephyr_vexriscv_x86_64
cp ${TF_BUILD_DIR}/magic_wand/CMake/zephyr/zephyr.elf ../
cp ${TF_BUILD_DIR}/magic_wand/CMake/zephyr/zephyr.bin ../

You can run it in Renode exactly as before.

To make sure the tutorial keeps working, and to showcase how simulation also enables you to do Continuous Integration easily, we also put together a Travis CI for the demo, and that is how the binary in the example is generated.

We will describe how the TensorFlow Lite team uses Renode for Continuous Integration and how you can do that yourself in a separate note soon – stay tuned for that!

Running on hardware

Now that you have the binaries and you’ve seen them work in Renode, let’s see how the same binary behaves on physical hardware.

You will need a Digilent Arty A7 board and ACL2 PMOD, connected to the rightmost Pmod connector as in the picture.

The hardware setup

The system is a SoC-in-FPGA called LiteX, with a pretty capable RISC-V core and various I/O options.

To build the necessary FPGA gateware containing our RISC-V SoC, we will be using LiteX Build Environment, which is an FPGA oriented build system that serves as an easy entry into FPGA development on various hardware platforms.

Now initialize the LiteX Build Environment:

cd litex-buildenv
export CPU=vexriscv
export CPU_VARIANT=full
export PLATFORM=arty
export FIRMWARE=zephyr
export TARGET=tf

./scripts/download-env.sh
source scripts/enter-env.sh

Then build the gateware:

make gateware

Once you have built the gateware, load it onto the FPGA with:

make gateware-load

With the FPGA programmed, you can load the Zephyr binary on the device using the flterm program provided inside the environment you just initialized above:

flterm --port=/dev/ttyUSB1 --kernel=zephyr.bin --speed=115200

flterm will open the serial port. Now you can wave the board around and see the gestures being recognized in the terminal. Congratulations! You have now completed the entire tutorial.

Summary

In this post, we have demonstrated how you can useTensorFlow Lite for MCUs without (and with) hardware. In the coming months, we will follow up with a description of how you can proceed from interactive development with Renode to doing Continuous Integration of your Machine Learning code, and then show the advantages of combining the strengths of TensorFlow Lite and the Zephyr RTOS.

You can find the most up to date instructions in the demo repository. The repository links to tested TensorFlow, Zephyr and LiteX code versions via submodules. Travis CI is used to test the guide.

If you’d like to explore more hardware and software with Renode, check the complete list of supported boards. If you encounter problems or have ideas, file an issue on GitHub, and for specific needs, such as enabling TensorFlow Lite and simulation on your platform, you can contact us at contact@renode.io.Read More

Procgen and MineRL Competitions

Procgen and MineRL Competitions

We’re excited to announce that OpenAI is co-organizing two NeurIPS 2020 competitions with AIcrowd, Carnegie Mellon University, and DeepMind, using Procgen Benchmark and MineRL. We rely heavily on these environments internally for research on reinforcement learning, and we look forward to seeing the progress the community makes in these challenging competitions.

Procgen Competition

Sign up for Procgen

The Procgen Competition focuses on improving sample efficiency and generalization in reinforcement learning. Participants will attempt to maximize agents’ performance using a fixed number of environment interactions. Agents will be evaluated in each of the 16 environments already publicly released in Procgen Benchmark, as well as in four secret test environments created specifically for this competition. By aggregating performance across so many diverse environments, we obtain high quality metrics to judge the underlying algorithms. More information about the details of each round can be found here.

Since all content is procedurally generated, each Procgen environment intrinsically requires agents to generalize to never-before-seen situations. These environments therefore provide a robust test of an agent’s ability to learn in many diverse settings. Moreover, we designed Procgen environments to be fast and simple to use. Participants with limited computational resources will be able to easily reproduce our baseline results and run new experiments. We hope that this will empower participants to iterate quickly on new methods to improve sample efficiency and generalization in RL.

MineRL Competition

Sign up for MineRL

Many of the recent, celebrated successes of artificial intelligence, such as AlphaStar, AlphaGo, and our own OpenAI Five, utilize deep reinforcement learning to achieve human or super-human level performance in sequential decision-making tasks. These improvements to the state-of-the-art have thus far required an exponentially increasing amount of compute and simulator samples, and therefore it is difficult[1] to apply many of these systems directly to real-world problems where environment samples are expensive. One well-known way to reduce the environment sample complexity is to leverage human priors and demonstrations of the desired behavior.

A rendering of the 1st place submission from the MineRL 2019 competition getting an iron pickaxe.

To further catalyze research in this direction, we are co-organizing the MineRL 2020 Competition which aims to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, participants will compete to develop systems which can obtain a diamond in Minecraft from raw pixels using only 8,000,000 samples from the MineRL simulator and 4 days of training on a single GPU machine. Participants will be provided the MineRL-v0 dataset (website, paper), a large-scale collection of over 60 million frames of human demonstrations, enabling them to utilize expert trajectories to minimize their algorithm’s interactions with the Minecraft simulator.

This competition is a follow-up to the MineRL 2019 Competition in which the top team’s agent was able to obtain an iron pickaxe (the penultimate goal of the competition) under this extremely limited compute and simulator-interaction budget. Put in perspective, state-of-the-art standard reinforcement learning systems require hundreds of millions of environment interactions on large multi-GPU systems to achieve the same goal. This year, we anticipate competitors will push the state-of-the-art even further.

To guarantee that competitors develop truly sample efficient algorithms, the MineRL competition organizers train the top team’s final round models from scratch with strict constraints on the hardware, compute, and simulator-interaction available. The MineRL 2020 Competition also features a novel measure to avoid hand engineering features and overfitting solutions to the domain. More details on the competition structure can be found here.


Acknowledgments

Our partners at AIcrowd have been instrumental in the development of these competitions, by creating much of the competition infrastructure, securing computational resources, and providing valuable technical support. Additionally we’d like to thank our partners at Preferred Networks for being instrumental in developing baselines for the MineRL competition. The MineRL competition extends its gratitude to our sponsors and co-organizers at DeepMind, Microsoft, and NVIDIA.

The Procgen Competition is a collaboration between OpenAI and AIcrowd. The organizing team consists of Sharada Mohanty, Karl Cobbe, Jyotish Poonganam, Shivam Khandelwal, Christopher Hesse, Jacob Hilton, John Schulman, and William H. Guss.

The MineRL Competition is a collaboration between OpenAI, Carnegie Mellon University, MineRL Labs, Google DeepMind, Preferred Networks, Microsoft, and AIcrowd. The lead organizer is William H. Guss, and the organizing team consists of Brandon Houghton, Stephanie Milani, Nicholay Topin, John Schulman, Oriol Vinyals, Ruslan Salakhutdinov, Noboru Sean Kuno, Sam Devlin, Crissman Loomis, Keisuke Nakata, Shinya Shiroshita, Avinash Ummadisingu, and Mario Ynocente Castro.


Footnotes

  1. While direct application is not possible due to the sheer number of samples required, Sim2Real and data augmentation techniques can mittigate the need to sample real-world dynamics directly. ↩︎

OpenAI

Learning the ropes and throwing lifelines

In March, as her friends and neighbors were scrambling to pack up and leave campus due to the Covid-19 pandemic, Geeticka Chauhan found her world upended in yet another way. Just weeks earlier, she had been elected council president of MIT’s largest graduate residence, Sidney-Pacific. Suddenly the fourth-year PhD student was plunged into rounds of emergency meetings with MIT administrators.

From her apartment in Sidney-Pacific, where she has stayed put due to travel restrictions in her home country of India, Chauhan is still learning the ropes of her new position. With others, she has been busy preparing to meet the future challenge of safely redensifying the living space of more than 1,000 people: how to regulate high-density common areas, handle noise complaints as people spend more time in their rooms, and care for the mental and physical well-being of a community that can only congregate virtually. “It’s just such a crazy time,” she says.

She’s prepared for the challenge. During her time at MIT, while pursuing her research using artificial intelligence to understand human language, Chauhan has worked to strengthen the bonds of her community in numerous ways, often drawing on her experience as an international student to do so.

Adventures in brunching

When Chauhan first came to MIT in 2017, she quickly fell in love with Sidney-Pacific’s thriving and freewheeling “helper culture.” “These are all researchers, but they’re maybe making brownies, doing crazy experiments that they would do in lab, except in the kitchen,” she says. “That was my first introduction to the MIT spirit.”

Next thing she knew, she was teaching Budokon yoga, mashing chickpeas into guacamole, and immersing herself in the complex operations of a monthly brunch attended by hundreds of graduate students, many of whom came to MIT from outside the U.S. In addition to the genuine thrill of cracking 300 eggs in 30 minutes, working on the brunches kept her grounded in a place thousands of miles from her home in New Delhi. “It gave me a sense of community and made me feel like I have a family here,” she says.

Chauhan has found additional ways to address the particular difficulties that international students face. As a member of the Presidential Advisory Council this year, she gathered international student testimonies on visa difficulties and presented them to MIT’s president and the director of the International Students Office. And when a friend from mainland China had to self-quarantine on Valentine’s Day, Chauhan knew she had to act. As brunch chair, she organized food delivery, complete with chocolates and notes, for Sidney-Pacific residents who couldn’t make it to the monthly event. “Initially when you come back to the U.S. from your home country, you really miss your family,” she says. “I thought self-quarantining students should feel their MIT community cares for them.”

Culture shock

Growing up in New Delhi, math was initially one of her weaknesses, Chauhan says, and she was scared and confused by her early introduction to coding. Her mother and grandmother, with stern kindness and chocolates, encouraged her to face these fears. “My mom used to teach me that with hard work, you can make your biggest weakness your biggest strength,” she explains. She soon set her sights on a future in computer science.

However, as Chauhan found her life increasingly dominated by the high-pressure culture of preparing for college, she began to long for a feeling of wholeness, and for the person she left behind on the way. “I used to have a lot of artistic interests but didn’t get to explore them,” she says. She quit her weekend engineering classes, enrolled in a black and white photography class, and after learning about the extracurricular options at American universities, landed a full scholarship to attend Florida International University.

It was a culture shock. She didn’t know many Indian students in Miami and felt herself struggling to reconcile the individualistic mindset around her with the community and family-centered life at home. She says the people she met got her through, including Mark Finlayson, a professor studying the science of narrative from the viewpoint of natural language processing. Under Finlayson’s guidance she developed a fascination with the way AI techniques could be used to better understand the patterns and structures in human narratives. She learned that studying AI wasn’t just a way of imitating human thinking, but rather an approach for deepening our understanding of ourselves as reflected by our language. “It was due to Mark’s mentorship that I got involved in research” and applied to MIT, she says.

The holistic researcher

Chauan now works in the Clinical Decision Making Group led by Peter Szolovits at the Computer Science and Artificial Intelligence Laboratory, where she is focusing on the ways natural language processing can address health care problems. For her master’s project, she worked on the problem of relation extraction and built a tool to digest clinical literature that would, for example, help pharamacologists easily assess negative drug interactions. Now, she’s finishing up a project integrating visual analysis of chest radiographs and textual analysis of radiology reports for quantifying pulmonary edema, to help clinicians manage the fluid status of their patients who have suffered acute heart failure.

“In routine clinical practice, patient care is interweaved with a lot of bureaucratic work,” she says. “The goal of my lab is to assist with clinical decision making and give clinicians the full freedom and time to devote to patient care.”

It’s an exciting moment for Chauhan, who recently submitted a paper she co-first authored with another grad student, and is starting to think about her next project: interpretability, or how to elucidate a decision-making model’s “thought process” by highlighting the data from which it draws its conclusions. She continues to find the intersection of computer vision and natural language processing an exciting area of research. But there have been challenges along the way.

After the initial flurry of excitement her first year, personal and faculty expectations of students’ independence and publishing success grew, and she began to experience uncertainty and imposter syndrome. “I didn’t know what I was capable of,” she says. “That initial period of convincing yourself that you belong is difficult. I am fortunate to have a supportive advisor that understands that.”

Finally, one of her first-year projects showed promise, and she came up with a master’s thesis plan in a month and submitted the project that semester. To get through, she says, she drew on her “survival skills”: allowing herself to be a full person beyond her work as a researcher so that one setback didn’t become a sense of complete failure. For Chauhan, that meant working as a teaching assistant, drawing henna designs, singing, enjoying yoga, and staying involved in student government. “I used to try to separate that part of myself with my work side,” she says. “I needed to give myself some space to learn and grow, rather than compare myself to others.”

Citing a study showing that women are more likely to drop out of STEM disciplines when they receive a B grade in a challenging course, Chauhan says she wishes she could tell her younger self not to compare herself with an ideal version of herself. Dismantling imposter syndrome requires an understanding that qualification and success can come from a broad range of experiences, she says: It’s about “seeing people for who they are holistically, rather than what is seen on the resume.”

Read More

Recent Advances in Google Translate

Recent Advances in Google Translate

Posted by Isaac Caswell and Bowen Liang, Software Engineers, Google Research

Advances in machine learning (ML) have driven improvements to automated translation, including the GNMT neural translation model introduced in Translate in 2016, that have enabled great improvements to the quality of translation for over 100 languages. Nevertheless, state-of-the-art systems lag significantly behind human performance in all but the most specific translation tasks. And while the research community has developed techniques that are successful for high-resource languages like Spanish and German, for which there exist copious amounts of training data, performance on low-resource languages, like Yoruba or Malayalam, still leaves much to be desired. Many techniques have demonstrated significant gains for low-resource languages in controlled research settings (e.g., the WMT Evaluation Campaign), however these results on smaller, publicly available datasets may not easily transition to large, web-crawled datasets.

In this post, we share some recent progress we have made in translation quality for supported languages, especially for those that are low-resource, by synthesizing and expanding a variety of recent advances, and demonstrate how they can be applied at scale to noisy, web-mined data. These techniques span improvements to model architecture and training, improved treatment of noise in datasets, increased multilingual transfer learning through M4 modeling, and use of monolingual data. The quality improvements, which averaged +5 BLEU score over all 100+ languages, are visualized below.

BLEU score of Google Translate models since shortly after its inception in 2006. The improvements since the implementation of the new techniques over the last year are highlighted at the end of the animation.

Advances for Both High- and Low-Resource Languages
Hybrid Model Architecture: Four years ago we introduced the RNN-based GNMT model, which yielded large quality improvements and enabled Translate to cover many more languages. Following our work decoupling different aspects of model performance, we have replaced the original GNMT system, instead training models with a transformer encoder and an RNN decoder, implemented in Lingvo (a TensorFlow framework). Transformer models have been demonstrated to be generally more effective at machine translation than RNN models, but our work suggested that most of these quality gains were from the transformer encoder, and that the transformer decoder was not significantly better than the RNN decoder. Since the RNN decoder is much faster at inference time, we applied a variety of optimizations before coupling it with the transformer encoder. The resulting hybrid models are higher-quality, more stable in training, and exhibit lower latency.

Web Crawl: Neural Machine Translation (NMT) models are trained using examples of translated sentences and documents, which are typically collected from the public web. Compared to phrase-based machine translation, NMT has been found to be more sensitive to data quality. As such, we replaced the previous data collection system with a new data miner that focuses more on precision than recall, which allows the collection of higher quality training data from the public web. Additionally, we switched the web crawler from a dictionary-based model to an embedding based model for 14 large language pairs, which increased the number of sentences collected by an average of 29 percent, without loss of precision.

Modeling Data Noise: Data with significant noise is not only redundant but also lowers the quality of models trained on it. In order to address data noise, we used our results on denoising NMT training to assign a score to every training example using preliminary models trained on noisy data and fine-tuned on clean data. We then treat training as a curriculum learning problem — the models start out training on all data, and then gradually train on smaller and cleaner subsets.

Advances That Benefited Low-Resource Languages in Particular
Back-Translation: Widely adopted in state-of-the-art machine translation systems, back-translation is especially helpful for low-resource languages, where parallel data is scarce. This technique augments parallel training data (where each sentence in one language is paired with its translation) with synthetic parallel data, where the sentences in one language are written by a human, but their translations have been generated by a neural translation model. By incorporating back-translation into Google Translate, we can make use of the more abundant monolingual text data for low-resource languages on the web for training our models. This is especially helpful in increasing fluency of model output, which is an area in which low-resource translation models underperform.

M4 Modeling: A technique that has been especially helpful for low-resource languages has been M4, which uses a single, giant model to translate between all languages and English. This allows for transfer learning at a massive scale. As an example, a lower-resource language like Yiddish has the benefit of co-training with a wide array of other related Germanic languages (e.g., German, Dutch, Danish, etc.), as well as almost a hundred other languages that may not share a known linguistic connection, but may provide useful signal to the model.

Judging Translation Quality
A popular metric for automatic quality evaluation of machine translation systems is the BLEU score, which is based on the similarity between a system’s translation and reference translations that were generated by people. With these latest updates, we see an average BLEU gain of +5 points over the previous GNMT models, with the 50 lowest-resource languages seeing an average gain of +7 BLEU. This improvement is comparable to the gain observed four years ago when transitioning from phrase-based translation to NMT.

Although BLEU score is a well-known approximate measure, it is known to have various pitfalls for systems that are already high-quality. For instance, several works have demonstrated how the BLEU score can be biased by translationese effects on the source side or target side, a phenomenon where translated text can sound awkward, containing attributes (like word order) from the source language. For this reason, we performed human side-by-side evaluations on all new models, which confirmed the gains in BLEU.

In addition to general quality improvements, the new models show increased robustness to machine translation hallucination, a phenomenon in which models produce strange “translations” when given nonsense input. This is a common problem for models that have been trained on small amounts of data, and affects many low-resource languages. For example, when given the string of Telugu characters “ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష”, the old model produced the nonsensical output “Shenzhen Shenzhen Shaw International Airport (SSH)”, seemingly trying to make sense of the sounds, whereas the new model correctly learns to transliterate this as “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”.

Conclusion
Although these are impressive strides forward for a machine, one must remember that, especially for low-resource languages, automatic translation quality is far from perfect. These models still fall prey to typical machine translation errors, including poor performance on particular genres of subject matter (“domains”), conflating different dialects of a language, producing overly literal translations, and poor performance on informal and spoken language.

Nonetheless, with this update, we are proud to provide automatic translations that are relatively coherent, even for the lowest-resource of the 108 supported languages. We are grateful for the research that has enabled this from the active community of machine translation researchers in academia and industry.

Acknowledgements
This effort is built on contributions from Tao Yu, Ali Dabirmoghaddam, Klaus Macherey, Pidong Wang, Ye Tian, Jeff Klingner, Jumpei Takeuchi, Yuichiro Sawai, Hideto Kazawa, Apu Shah, Manisha Jain, Keith Stevens, Fangxiaoyu Feng, Chao Tian, John Richardson, Rajat Tibrewal, Orhan Firat, Mia Chen, Ankur Bapna, Naveen Arivazhagan, Dmitry Lepikhin, Wei Wang, Wolfgang Macherey, Katrin Tomanek, Qin Gao, Mengmeng Niu, and Macduff Hughes.