GFN Thursday: Rolling in the Deep (Silver) with Major ‘Metro Exodus’ and ‘Iron Harvest’ Updates

GFN Thursday reaches a fever pitch this week as we take a deeper look at two major updates coming to GeForce NOW from Deep Silver in the weeks ahead.

Catching Even More Rays

Metro Exodus was one of the first RTX games added to GeForce NOW. It’s still one of the most-played RTX games on the service. Back in February, developer 4A Games shared news of an Enhanced Edition coming to PC that takes advantage of a new Fully Ray-Traced Lighting Pipeline.

Today, we can share that it’s coming to GeForce NOW, day-and-date, on May 6.

The PC Enhanced Edition features significant updates to Metro Exodus’ real-time ray tracing implementations. Players will see improvements to the groundbreaking Ray-Traced Global Illumination (RTGI) featured in the base game, as well as new updates for the Ray-Traced Emissive Lighting techniques pioneered in “The Two Colonels” expansion.

The PC Enhanced Edition also includes additional ray-tracing features, like Advanced Ray-Traced Reflections, and support for NVIDIA DLSS 2.0 on NVIDIA hardware — including GeForce NOW.

The list of RTX features coming to the PC Enhanced Edition is massive:

  • Fully ray-traced lighting throughout — every light sources is now ray traced
  • Next-gen ray tracing and denoising
  • Next-gen temporal reconstruction technology
  • Per-pixel ray-traced global illumination
  • Ray-traced emissive surfaces with area shadows
  • Infinite number of ray-traced light boundaries
  • Atmosphere and transparent surfaces receiving ray-traced bounced lighting
  • Full ray-traced lighting model support with color bleeding and for every light source
  • Advanced ray-traced reflections
  • DX12 ultimate support, including DXR 1.1 and variable rate shading
  • GPU FP16 support and thousands of optimized shaders
  • Support for DLSS 2.0
  • Addition of FOV (field of view) slider to main game options

In short, the game is going to look even more amazing. And, starting next week, members who own Metro Exodus will have access to the update on GeForce NOW. But remember, to access the enhanced visuals you’ll need to be a Founder or Priority member.

Don’t own Metro Exodus yet? Head to Steam or the Epic Games Store and get ahead of the game.

Metro Exodus PC Enhanced Edition includes updated real-time ray tracing across both the base game and the expansion.
Metro Exodus PC Enhanced Edition includes updated real-time ray tracing that GeForce NOW Founders and Priority members can experience across nearly all of their devices.

A Strategic Move

Iron Harvest, the classic real-time strategy game with an epic single-player campaign, multiplayer and coop, set in the alternate reality of 1920+, is getting a new DLC on May 27. Dubbed “Operation Eagle,” the update brings a new faction to the game’s alternate version of World War I: USA.

You’ll guide this new faction through seven new single-player missions, while learning how to use the game’s new Aircraft units across all of the game’s playable factions, including Polania, Saxony and Rusviet.

“Operation Eagle” also adds new multiplayer maps that RTS fans will love, and the new USA campaign can be played cooperatively with friends.

Iron Harvest’s “Operation Eagle” DLC will be available on GeForce NOW day-and-date. You can learn more about the update here.

Don’t Take Our Word for It

The team at Deep Silver was gracious enough to answer a few questions we had about these great updates.

Q: We’re suckers for beautifully ray-traced PC games, on a scale from 1-to-OMG, how great does Metro Exodus PC Enhanced Edition look?

A: We’re quietly confident that the Metro Exodus PC Enhanced Edition will register at the OMG end of the scale, but you don’t need to take our word for it – Digital Foundry declared that, “Metro Exodus’ PC Enhanced Edition’s Global Illumination produces without a doubt, the best lighting I’ve ever witnessed in a video game.”

Q: What does it mean for the team to leverage GeForce NOW to bring these new real-time ray-tracing updates in Metro Exodus PC Enhanced Edition to gamers across their devices?

A: We believe hardware-accelerated ray-tracing GPUs are the future, but right now the number of players with ray-tracing-capable GPUs is a small, albeit growing, percentage of the total PC audience. GeForce NOW will give those players yet to upgrade their gaming hardware a glimpse into the future.

Q: How does “Operation Eagle” build on the story in Iron Harvest? We’re excited to try this new faction.

A: The American Union of Usonia stayed out of the Great War and became an economic and military powerhouse, unnoticed by Europe’s old elites. Relying heavily on mighty “Diesel Birds,” the Usonia faction brings more variety to the Iron Harvest battlefields. Additional new buildings and new units for all factions will enhance the Iron Harvest roster to give players even more options to find the perfect attack and defence strategy.

Q: How do you see GeForce NOW expanding the audience of gamers who can play Metro Exodus and Iron Harvest?

A: We’re committed to bringing the Metro Exodus experience to as many platforms as we can without compromising on the quality of the experience; GeForce NOW puts our state-of-the-art ray-traced version of Metro Exodus into the hands of gamers regardless of their own hardware setup.

Q: Is there anything else you’d want to share with your fans who are streaming Metro Exodus and Iron Harvest from the cloud?

A: Watch out for the jump scares. You have been warned.

There’s probably a jump scare coming here, right? GeForce NOW members can find out on May 6.

GFN Thursday

In addition to rolling with a pair of Deep Silver announcements this week, members get their regular dose of GFN Thursday goodness. Read more about that and other updates this week here.

Getting excited for more ray-traced goodness in Metro Exodus? Can’t wait to get your hands on “Operation Eagle”? Let us know on Twitter or in the comments below.

The post GFN Thursday: Rolling in the Deep (Silver) with Major ‘Metro Exodus’ and ‘Iron Harvest’ Updates appeared first on The Official NVIDIA Blog.

Read More

When artists and machine intelligence come together

Throughout history, from photography to video to hypertext, artists have pushed the expressive limits of new technologies, and artificial intelligence is no exception. At I/O 2019, Google Research and Google Arts & Culture launched the Artists + Machine Intelligence Grants, providing a range of support and technical mentorship to six artists from around the globe following an open call for proposals. The inaugural grant program sought to expand the field of artists working with Machine Learning (ML) and, through supporting pioneering artists, creatively push at the boundaries of generative ML and natural language processing. 

Today, we are publishing the outcomes of the grants. The projects draw from many disciplines, including rap and hip hop, screenwriting, early cinema, phonetics, Spanish language poetry, and Indian pre-modern sound. What they all have in common is an ability to challenge our assumptions about AI’s creative potential.

a graffiti-style visualization of the artwork

Learn more about the Hip Hop Poetry Bot

Hip Hop Poetry Bot by Alex Fefegha  

Can AI rap? Alex explores speech generation trained on rap and hip hop lyrics by Black artists. For the moment it exists as a proof of concept, as building the experiment in full requires a large, public dataset of rap and hip hop lyrics on which an algorithm can be trained, and such a public archive doesn’t currently exist.  The project is therefore launching with an invitation from Alex to rap and hip hop artists to become creative collaborators and contribute their lyrics to create a new, public dataset of lyrics by Black artists. 

A woman, partly smiling, in an industrial-style room

Read more about Neural Swamp

Neural Swamp by Martine Syms 

Martine uses video and performance to examine representations of blackness across generations, geographies, mediums, and traditions. For this residency, Martine developed Neural Swamp, a play staged across five screens, starring five entities who talk and sing alongside and over each other. Two of the five voices are trained on Martine’s voice and generated using machine learning speech models. The project will premiere at The Philadelphia Museum of Art and Fondazione Sandretto Re Rebaudengo in Fall 2021.

A dashboard with toggles for changing the letters in a sentence

The Nonsense Laboratory by Allison Parrish  

Allison invites you to adjust, poke at, mangle, curate and compress words with a series of playful tools in her Nonsense Laboratory. Powered by a bespoke code library and machine learning model developed by Allison Parrish you can mix and respell words, sequence mouth movements to create new words, rewrite a text so that the words feel different in your mouth, or go on a journey through a field of nonsense. 

A collage of images, in the style of old cinema film

Let Me Dream Again by Anna Ridler 

Anna uses machine learning to try to recreate lost films from fragments of early Hollywood and European cinema that still exist. The outcome? An endlessly evolving, algorithmically generated film and soundtrack. The film will continually play, never repeating itself, over a period of one month. 

A woman in a desert holding a staff

Read more about Knots of Code

Knots of Code by Paola Torres Núñez del Prado

Paola studies the history of quipus, a pre-Columbian notation system that is based on the tying of knots in ropes, as part of a new research project, Knots of Code. The project’s first work is a Spanish language poetry-album from Paola and AIELSON, an artificial intelligence system that composes and recites poetry inspired by quipus and emulating the voice of the late Peruvian poet J.E. Eielson. 

An empty stage with bells hanging on wires

Read more about Dhvāni

Dhvāni by Budhaditya Chattopadhyay 

Budhaditya brings a lifelong interest in the materiality, phenomenology, political-cultural associations, and the sociability of sound to Dhvāni, a responsive sound installation, comprising 51 temple bells and conducted with the help of machine learning. An early iteration of Dhvāni was installed at EXPERIMENTA Arts & Sciences Biennale 2020 in Grenoble, France.  

Explore the artworks at g.co/artistsmeetai or on the free Google Arts & Culture app for iOS and Android.

Read More

Model-Based RL for Decentralized Multi-agent Navigation

Posted by Rose E. Wang, Student Researcher and Aleksandra Faust, Staff Research Scientist, Google Research

As robots become more ubiquitous in day-to-day life, the complexity of their interactions with each other and with the environment grows. In a controlled environment, such as a lab, multiple robots can coordinate their actions and efforts through a centralized planner that facilitates communication between individual agents. And while much research has been done to address reliable sensor-informed goal navigation, in many real-world applications aligning goals across independent robotic agents must be done without a centralized planner, which poses non-trivial challenges.

An example of such a challenging decentralized task is the rendezvous task, in which multiple agents must agree upon a time and place at which they can meet, without explicitly communicating with one another. This goal alignment task plays an important role in real world multiagent and human-robot settings, e.g., performing object handovers or determining goals on the fly. Solving the decentralized rendezvous task in this situation depends not just on the obstacles in the environment, but also the policies and dynamics of each agent. Addressing potential miscoordination and dealing with noisy sensor data depends on the agents’ ability to model the motions of other agents as well as their own, and to adapt to diverging intentions while using limited information.

An example of two independently controlled robots separated by obstacles that share the objective of meeting each other. How should they move in order to meet? Example trajectories are illustrated in red and blue arrows for each robot. Each robot makes an independent decision of where to go based on their own observations.

In “Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous”, presented at CoRL 2020, we propose an holistic approach to address the challenges of the decentralized rendezvous task, which we call hierarchical predictive planning (HPP). This is a decentralized, model-based reinforcement learning (RL) system that enables agents to align their goals on the fly in the real world. We evaluate HPP in a mixture of real-world and simulated environments and compare it to several learning-based planning and centralized baselines. In those evaluations, we show that HPP is able to more effectively predict and align trajectories, avoid miscoordinations, and directly transfer to the real world without additional fine-tuning.

Putting Together Prediction, Planning and Control
Akin to a standard navigation pipeline, our learning-based system consists of three modules: prediction, planning, and control. Each agent employs the prediction model to learn agent motion and to predict the future positions of itself (the ego-agent) and others based on its own observations (e.g., from LiDAR and team position information) of other agents’ behaviors and navigation patterns. So, each agent learns two prediction models, one for its own motion and one for the other agent. These motion predictors constitute the prediction module, and are used by each agent’s planning module.

The output of the prediction module — the estimate of where each agent, both the ego-agent and the other agents, is most likely to be given the ego-agent’s own sensor observations — is useful information for the planning module, which evaluates different goal locations and maintains a belief distribution over where the team should converge. The belief distribution is periodically updated using evaluations provided by the prediction model. An agent samples from this belief distribution to update the goal to which it should navigate.

The selected goal is passed to the agent’s control module, which is equipped with a pre-trained, imperfect navigation policy that can navigate to a given location in the obstacle-laden environment. The control policy then determines what action the robot should execute.

This process of observing other agents, updating belief distributions and navigating to an updated goal repeats until agents have successfully rendezvoused. While the hierarchical planning and control setup are not unusual, our work closes the loop between the control and planning for decentralized multiagent systems by use of the sensor-informed prediction module.

Training the Prediction Models
HPP trains motion predictors in simulation, assuming that each agent is controlled by a hidden, perhaps suboptimal, control policy capable of avoiding obstacles. The key difficulty lies in training prediction models without access to other agents’ sensor observations and control policies.

The predictors are trained via self-supervision. To collect the training data, we randomly place all the agents and obstacles in an environment, and each agent is given a random goal (unknown to other agents). As the agents move toward their respective goals, each agent records the experience — its sensor observations and the poses of all agents (itself and other agents). Next, from the recorded experience, the agent learns a separate predictor for each agent in the team including itself (target agent). The training dataset consists of ego-agent initial sensor observations, target agent’s pose and goal, labeled with future ego-observations and target agent poses. The goal and labels are inferred from the recorded experience.

As a result, the predictors learn temporal causality of the present and future ego-agent’s observations and target agent’s poses, conditioned on the target agent’s assumed goals — in other words the models predict where each agent will be in the future based on the present. The predictor training is done only with the information available to agents at the runtime, and in environments independent from the deployment environments.

The training environment for the model prediction models. The environment is filled with randomly filled obstacles. All agents (left in blue, upper right in red) are given the same random goal (center in green) and move with their own control modules towards it.

Selecting Goals for Alignment
A model-based RL planner for each agent uses the learned predictors in the deployment environments to guide the agents towards the rendezvous point. The planner takes into account what it believes the other agents would do when also completing the rendezvous task.

HPP illustration. Each robot independently considers several potential rendezvous points, and evaluates each point based how close it believes that the agents can get.

To perform this reasoning, each agent independently samples a series of potential goals and selects the goal that it believes it would be the most likely to succeed. This process effectively simulates a centralized planner for fictitious agents by using the prediction models to predict trajectories of those agents moving to a fixed goal. Conditioned on a proposed goal, the algorithm predicts the poses of the agents in the future, which are generated from sequential roll outs of the prediction models. Each goal is then evaluated by scoring the anticipated system state using the task reward favoring goals that bring agents closer together. We use the cross-entropy method (CEM) to convert these goal evaluations into belief updates over potential rendezvous points. Finally, the agent’s planner selects a goal for itself from this new belief distribution and passes this goal to the agent’s control module.

A simple illustration of the goal evaluation. At the end of a simulated trajectory, the agents (red, left, and blue, right) are either far (top) or close (bottom) to each other. The goal in the bottom image is better than the goal on top because agents end up closer to each other.

Results
We compare HPP against several baselines — MADDPG (learning-based), RRT (planning) with CEM, and centralized baselines that use heuristics for selecting the agent’s rendezvous point — in a mixture of real-world and simulated environments.

Evaluation environments, each of which are independent of the training environment for the agent’s control policy and prediction modules.

There are two main takeaways from our results. One is that HPP enables agents to predict and align trajectories, avoiding miscoordinations. For example:

The second takeaway is that HPP transfers directly into the real world without additional training. For example:

Conclusion
This work presents HPP, a model-based RL approach for decentralized multiagent coordination. Agents first learn to predict where they and their teammates are going to be from their own sensors and decide and navigate to a common goal. Our experiments demonstrate the method generalizes to new environments and handles miscoordination while making no assumptions about the dynamics of other agents. This may be of interest to the larger multiagent research community as a real-world example of a decentralized task using noisy sensors and imperfect controllers, to the motion planning community as an example of a learning-based planning system that closes the loop between the planner and controller, and to the RL community as an example of model-based RL as feedback in a hierarchical, self-supervised prediction setting.

Acknowledgements
This research was done by Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust with special thanks to Michael Everett, Oscar Ramirez and Igor Mordatch for the insightful discussions.

Read More

Perceiving with Confidence: How AI Improves Radar Perception for Autonomous Vehicles

Editor’s note: This is the latest post in our NVIDIA DRIVE Labs series, which takes an engineering-focused look at individual autonomous vehicle challenges and how NVIDIA DRIVE addresses them. Catch up on all of our automotive posts, here.

Autonomous vehicles don’t just need to detect the moving traffic that surrounds them — they must also be able to tell what isn’t in motion.

At first glance, camera-based perception may seem sufficient to make these determinations. However, low lighting, inclement weather or conditions where objects are heavily occluded can affect cameras’ vision. This means diverse and redundant sensors, such as radar, must also be capable of performing this task. However, additional radar sensors that leverage only traditional processing may not be enough.

In this DRIVE Labs video, we show how AI can address the shortcomings of traditional radar signal processing in distinguishing moving and stationary objects to bolster autonomous vehicle perception.

Traditional radar processing bounces radar signals off of objects in the environment and analyzes the strength and density of reflections that come back. If a sufficiently strong and dense cluster of reflections comes back, classical radar processing can determine this is likely some kind of large object. If that cluster also happens to be moving over time, then that object is probably a car.

While this approach can work well for inferring a moving vehicle, the same may not be true for a stationary one. In this case, the object produces a dense cluster of reflections, but doesn’t move. According to classical radar processing, this means the object could be a railing, a broken down car, a highway overpass or some other object. The approach often has no way of distinguishing which.

Introducing Radar DNN

One way to overcome the limitations of this approach is with AI in the form of a deep neural network (DNN).

Specifically, we trained a DNN to detect moving and stationary objects, as well as accurately distinguish between different types of stationary obstacles, using data from radar sensors.

Training the DNN first required overcoming radar data sparsity problems. Since radar reflections can be quite sparse, it’s practically infeasible for humans to visually identify and label vehicles from radar data alone.

Figure 1. Example of propagating bounding box labels for cars from the lidar data domain into the radar data domain.

Lidar, however, can create a 3D image of surrounding objects using laser pulses. Thus, ground truth data for the DNN was created by propagating bounding box labels from the corresponding lidar dataset onto the radar data as shown in Figure 1. In this way, the ability of a human labeler to visually identify and label cars from lidar data is effectively transferred into the radar domain.

Moreover, through this process, the radar DNN not only learns to detect cars, but also their 3D shape, dimensions and orientation, which classical methods cannot easily do.

With this additional information, the radar DNN is able to distinguish between different types of obstacles — even if they’re stationary — increase confidence of true positive detections, and reduce false positive detections.

The higher confidence 3D perception results from the radar DNN in turn enables AV prediction, planning and control software to make better driving decisions, particularly in challenging scenarios. For radar, classically difficult problems like accurate shape and orientation estimation, detecting stationary vehicles as well as vehicles under highway overpasses become feasible with far fewer failures.

The radar DNN output is integrated smoothly with classical radar processing. Together, these two components form the basis of our radar obstacle perception software stack.

This stack is designed to both offer full redundancy to camera-based obstacle perception and enable radar-only input to planning and control, as well as enable fusion with camera- or lidar-perception software.

With such comprehensive radar perception capabilities, autonomous vehicles can perceive their surroundings with confidence.

To learn more about the software functionality we’re building, check out the rest of our DRIVE Labs series.

The post Perceiving with Confidence: How AI Improves Radar Perception for Autonomous Vehicles appeared first on The Official NVIDIA Blog.

Read More

Intelligent governance of document processing pipelines for regulated industries

Processing large documents like PDFs and static images is a cornerstone of today’s highly regulated industries. From healthcare information like doctor-patient visits and bills of health, to financial documents like loan applications, tax filings, research reports, and regulatory filings, these documents are integral to how these industries conduct business. The mechanisms by which these documents are processed and analyzed, however, are often manual, time-consuming, error-prone, expensive, and not easily scalable.

Fortunately, recent innovations in this space are helping companies improve these methods. Machine learning (ML) techniques such as optical character recognition (OCR) and natural language processing (NLP) enable firms to digitize and extract text from millions of documents and understand the content, including contextual nuances of the language within them. Furthermore, you can then transform the extracted text by merging it with supplemental data to produce additional business insights.

This step-by-step method is called a document processing pipeline. The pipeline includes various components to extract, transform, enrich, and conform the data. New data domains are often introduced and used for numerous downstream business purposes. For example, in financial services, you could be identifying connected financial events, calculating environmental risk scores, and developing risk models. Because these documents help inform or even dictate such important data-driven decisions, it’s imperative for regulated industry companies to establish and maintain a robust data governance framework as part of these document processing pipelines. Without governance, pipelines become a dumping ground where documents are inconsistently stored, duplicated, and processed, and the business is unable to explain to potential auditors where the data that fed their decisions came from, or what that data was used for.

A data governance framework is made up of people, processes, and technology. It enables business users to work collaboratively with technologists to drive clean, certified, and trusted data. It consists of several components including data quality, data catalog, data ownership, data lineage, operation, and compliance. In this post, we discuss data catalog, data ownership, and data lineage, and how they tie together with building document processing pipelines for regulated industries.

For more information about design patterns on data quality, see How to Architect Data Quality on the AWS Cloud.

Data lineage

Data lineage is the part of data governance that refers to the practice of providing GPS services for data. At any point in time, it can explain where the data originated, what happened to it, what its latest status is, and where it’s headed from this point on.

It provides visibility while simplifying the ability to trace financial numbers back to their origin, and provides transparency on potential errors and their root cause analyses.

Furthermore, you can use data lineage captured over time as analytical inputs to drive accuracy scores.

It’s imperative for a document processing pipeline to have a well-defined data lineage framework. The framework should include an end-to-end lifecycle, responsibility model, and the technology to enable data transformation transparency. Without lineage, the data can’t be trusted.

To illustrate this end-to-end data lineage concept, we walk you through creating an NLP-powered document search engine with built-in lineage at each step. Every object and piece of data processed by this ML pipeline can be traced back to the original document.

Each processing component can be replaced by your choice of tooling or bespoke ML model. Furthermore, you can customize the solution to include other use cases, such as central document data lakes or supplemental tabular data feed to an online transaction processing (OLTP) application.

The solution follows an event-driven architecture in which the completion of each stage within the pipeline triggers the next step, while providing self-service lineage for traceability and monitoring. In addition, hooks have been included to provide capabilities to extend the pipeline to additional workloads.

This design uses the following AWS services (you can also follow along in the GitHub repo):

  • Amazon Comprehend – An NLP service that uses ML to find insights and relationships in text, and can do so in multiple languages.
  • Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale.
  • Amazon DynamoDB Streams – A change data capture (CDC) service. It captures an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table.
  • Amazon Elasticsearch Service (Amazon ES) – A fully managed service that makes it possible for you to deploy, secure, and run Elasticsearch cost-effectively and at scale. You can build, monitor, and troubleshoot your applications using the tools you love, at the scale you need.
  • AWS Lambda – A serverless compute service that runs code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • Amazon Simple Notification Service (Amazon SNS) – An AWS managed service for application-to-application communications, with a pub/sub model enabling high-throughput, low-latency message relaying.
  • Amazon Simple Queue Service (Amazon SQS) – A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
  • Amazon Simple Storage Service (Amazon S3) – An object storage service to stores your documents and allows for central management with fine-tuned access controls.
  • Amazon Textract – A fully managed ML service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond OCR to identify, understand, and extract data from forms and tables.

Architecture

The overall design is grouped into five segments:

  • Metadata services module
  • Ingestion module
  • OCR module
  • NLP module
  • Analytics module

All components interact via asynchronous events to allow for scalability. The following diagram illustrates the conceptual design.

The following diagram illustrates the physical design.

Metadata services

This is an encapsulated module to register, track, and trace incoming documents. It’s designed to be used across many different document processing pipelines. In your organization, one team might decide to use the OCR and NLP modules designed in this post. Another team might decide to use a different pipeline. However, governance practices of each pipeline should be consistent, and documents should be registered one time with full transparency on movement and downstream usage. Each document can be processed several times. You can extend the catalog and lineage services designed in this post to keep track of many pipelines, from multiple sources of data.

At the core, the metadata services module contains four reference tables, an SNS topic, three SQS queues, and three self-contained Lambda functions. Tables are created in DynamoDB, and schemas can be easily extended to include additional data attributes deemed important for your pipeline.

In addition, you can extend this design to include additional data governance components such as data quality.

The tables are defined as follows.

Table Name Purpose DynamoDB Stream Enabled? Data Governance component Sample Use
Document Registry Keeps track of all incoming documents. Each document is assigned a unique document ID and registered one time in this table. Yes Catalog Provides the ability to quickly look up and understand the document source and context metadata.
Document Ownership Covers responsibility model of the data governance in which each document acquired to the pipeline has a defined owner. No Ownership Provides notification services and can be extended to manage data quality controls.
Document Lineage Keeps track of all data movements. It provides detailed lineage info that includes the source S3 bucket name, destination S3 bucket name, source file name, target file name, ARN ID of the AWS service that processed the document, and timestamp. No

A simple PartiQL query against this table based on the document ID provides a list of all steps the original document has taken. Query output can include the following columns:

·         Document ID

·         Original document name

·         Timestamp

·         Source S3 bucket

·         Source file name

·         Destination S3 bucket

·         Destination file name

Pipeline Operations Keeps a record of all pipeline actions taken on a document ID, including the current pipeline stage and its status, and keeps a timeline of the stages in chronological order. Yes An operational query on a document ID to determine where in the pipeline the current document processing is.

DynamoDB Streams allows downstream application code to react to updates to objects in DynamoDB. It provides a mechanism to keep an event-based microservices architecture in place by triggering subsequent steps of a workflow whenever new documents are written to our Document Registry table, and subsequently when new document references are created in the Pipeline Operations table.

In addition, DynamoDB Streams provides developer teams with an efficient way of connecting your application logic to various updates in the tables (for example, to keep track of a particular document ID based on owner tags, or alert when certain unexpected problems arise while processing some documents).

The Lambda functions provide microservices API call capabilities for the document pipeline to self-register its movements and actions undertaken by the pipeline code:

  • Document Arrival Register API – Registers the incoming document’s metadata and location within Document Registry table
  • Document Lineage API – Registers the lineage information within Document Lineage table
  • Pipeline Operations API – Provides up-to-date information on the state of the pipeline

The SNS topic is used as a sink for incoming messages from all pipeline movements and document registrations. It disseminates the messages to each downstream subscribed SQS queue according to what type of message was received. In this model, the number of consumers of the messages coming through the SNS topic could be greatly expanded as needed, and all messages are guaranteed to stay in order, because both the SNS topics and SQS queues are created in a First-In-First-Out (FIFO) configuration to prevent duplicates and maintain single-threaded processing in the pipeline.

Using Amazon SNS in the design provides scalability by creating a pub/sub architecture. A pub/sub architecture design is a pattern that provides a framework to decouple the services that produce an event from services that process the event. Many subscribers can subscribe to the same event and trigger different pipelines. For example, this design can easily be extended to process incoming XML file formats by subscribing an additional XML process pipeline for the same event.

The following table provides schema information. The document ID is identical and unique for each document and is part of the composite primary key used to identify movement of each document within the pipeline.

The following diagram shows the architecture of our metadata services.

Ingestion module

The ingestion workload is triggered when a new document is uploaded to the NLP/Raw S3 bucket (or the bucket where raw documents are placed from users or front-end applications).

The ingestion module follows a four-step process (as shown in the following diagram):

  1. A document is uploaded to the NLP/Raw S3 bucket.
  2. The Document Registrar Lambda function is invoked, which calls the metadata services API to register the document and receive a unique ID. This ID is added to the document as a tag, and the metadata is registered within the DynamoDB table Document Registry.
  3. After the document metadata is registered with Metadata Services, the DynamoDB Document Registration stream is invoked to start the Document Classification Lambda function. This function examines the metadata registered on the document and determines if the downstream OCR segment should be invoked on this document. The result of this examination is written back to the metadata services.
  4. The metadata registration of the previous step invokes a DynamoDB Pipeline Operations Stream, which invokes the Document Extension Detector Lambda function. This function examines the incoming file formats and separates the images files from PDF documents.

All steps are registered in metadata services. The red dotted lines in the following diagram represent the metadata asynchronous API calls.

OCR module

This module detects the incoming file format and uses Amazon Textract in this implementation to convert the incoming documents into text. Amazon Textract can process image files synchronously, and PDF and other documents asynchronously, to allow time for the service to complete its analysis.

The OCR module consists of the following process, as illustrated in the architecture diagram:

  1. Image files are uploaded to the NLP/image S3 bucket and the Sync Processor Lambda function is invoked. The function synchronously points Amazon Textract to the S3 location of the image file, and waits for a response.
  2. Amazon Textract transforms the format to text and deposits the text output in the NLP/Textract. This step concludes OCR processing of the image file types.
  3. PDF files are placed within the NLP/PDF S3 bucket. This bucket invokes the Async Processor Lambda function. This function feeds the document to Amazon Textract and completes its state, registering as such with the metadata services.
  4. When the Amazon Textract document analysis is complete, an SNS message is sent to a specified SNS topic, notifying downstream consumers of the job completion. In this implementation, an SQS queue captures that message.
  5. The SQS queue message is the event that triggers the Result Processor Lambda function.
  6. The function extracts the results of document analysis from Amazon Textract and formats it according to the type of text it analyzed (forms, tables, and raw text).
  7. The results are pushed to the NLP/Textract S3 bucket, page by page for every type of text, and as a complete JSON response.

All the progress is registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.

NLP module

This module detects key phrases and entities within the document by using the text output from the OCR module. A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. For example, “day” is a noun; “a beautiful day” is a noun phrase that includes an article (“a”) and an adjective (“beautiful”).

Once key phrases are understood, it’s quite likely that indexing them in an analytical tool would let you find this article quickly and accurately. For example, if you want to analyze corporate social responsibility (CSR) reports, you can find attributes such as “reducing carbon footprints,” “improving labor policies,” “participating in fair-trade,” and “charitable giving” by indexing results of this module.

We use Amazon Comprehend to perform this function in this pipeline. However, as we explained earlier, you can easily swap the tooling used for this design with your preferred tool. For example, you can replace Amazon Comprehend with an Amazon SageMaker custom model as an alternative to extract key phrases and entities in a more domain-focused way. SageMaker is an ML service that you can use to build, train, and deploy ML models for virtually any use case.

Amazon Comprehend is called on a synchronous basis to extract key phrases in the following steps (as illustrated in the following diagram):

  1. The incoming text file uploaded to the NLP/Textract S3 bucket invokes the Sync Comprehend Processor Lambda function.
  2. The function feeds the incoming file to Amazon Comprehend for processing.
  3. The results from Amazon Comprehend, in JSON format, are deposited in the NLP/JSON S3 bucket.
  4. The results from Amazon Comprehend are sent to Amazon ES, the service we incorporate as our document search engine.

All steps are being registered in metadata services. The red dotted lines in the diagram represent the metadata asynchronous API calls.

Analytics module

This module is responsible for the consumption and analytics segment of the pipeline. The steps are illustrated in the following diagram:

  1. The output from Amazon Comprehend, in JSON format, is fed to Amazon Neptune. Neptune allows end users to discover relationships across documents. This is an example of a downstream analytics application that is not implemented in this post.
  2. The end users have access to the original document in four formats (CSV, JSON, original, text), and can search key phrases using Amazon ES. They can identify relationships using Neptune. A JSON version of the document is available in the NLP/JSON S3 bucket. The original document is available in the NLP/Raw S3 bucket.
  3. Full lineage can be obtained from the Document Lineage table in DynamoDB.

The analytics module has many potential implementations. For example, you can use a relational datastore like Amazon Relational Database (Amazon RDS) or Amazon Aurora to analyze extracted tabular data using SQL.

Conclusion

In this post, we architected an end-to-end document processing pipeline using AWS managed ML services. In addition, we introduced metadata services to help organizations create a centralized document repository to store documents one time but process multiple times. A data governance framework as illustrated in this design provides you with necessary guardrails to ensure documents are governed in a standard fashion across the organization, while providing lines of business with autonomy to decide your NLP and OCR models and choice of tooling.

The architecture discussed in this post has been coded and is available for deployment in the GitHub repo. You can download the code and create your pipeline within a few days.


About the Authors

  David Kheyman is a Solutions Architect at Amazon Web Services based out of New York City, where he designs and implements repeatable AWS architecture patterns and solutions for large organizations.

 

 

Mojgan Ahmadi is a Principal Solutions Architect with Amazon Web Services based in New York, where she guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience on Software Development and Architecture, Data Governance and Engineering, and IT Management.

 

Anirudh Menon is a Solutions Architect with Amazon Web Services based in New York, where he helps financial services customers drive innovation with AWS solutions and industry-specific patterns.

Read More

Announcing the AWS DeepComposer Chartbusters challenges 2021 season launch

We’re back with two new challenges for the AWS DeepComposer Chartbusters 2021 season! Chartbusters is a global challenge in which developers use AWS DeepComposer to create original compositions and compete in monthly challenges to showcase their machine learning (ML) and generative artificial intelligence (AI) skills. Regardless of your background in music or ML, one of the two new challenges will be right for you.

You can choose between two different challenges this season. In the basic challenge, Melody-Go-Round, you can use any of the generative AI models available in the AWS DeepComposer Music studio to create new compositions. In the advanced challenge, Melody Harvest, you train a custom generative AI model with your own dataset using Amazon SageMaker. In this challenge, you can dive deeper into the mechanics of data preparation, model training, and evaluation to teach a model to play your favorite style of music.

The 2021 season runs through October 31, 2021. Winners of each challenge are selected on the last day of each month, and we’ll feature the winners in an AWS Machine Learning Blog post. Monthly winners of the Melody Harvest challenge will also win a ticket to AWS re:Invent 2021. To participate, go to the AWS DeepComposer console and choose the Chartbusters challenge that’s right for you in the navigation pane.

Compete in the Melody-Go-Round challenge

You can compete in the AWS DeepComposer Chartbusters Melody-Go-Round challenge in just a few simple steps:

  1. In the AWS DeepComposer Music studio, record a track, import a track, or pick any of the available input tracks.

  1. Get creative and explore different combinations of available models. You can also explore advanced parameters under each model.

  1. Use the Edit melody feature to add or remove notes, or change the note duration and pitch. When finished, choose Apply changes. You can iterate by adjusting the advanced parameters and choosing Enhance again. Repeat these steps until you’re satisfied with the generated music.

You can also download the melody and import it into a digital audio workstation like GarageBand and further indulge your creativity.

  1. When your melody is complete, go to the submission form and choose an existing composition or import a post-processed audio track. Choose Melody-Go-Round for the competition type, register or sign in to SoundCloud, and choose Submit.

For more information on judging criteria, visit AWS DeepComposer Melody-Go-Round page.

Compete in the Melody Harvest challenge

  1. Explore our GitHub pages for Generative Adversarial Networks (GANs), Autoregressive Convolutional Neural Networks (AR-CNNs), and Transformers. Then train your own model and start composing your music.
  2. You can upload the generated MIDI file to a digital audio workstation like GarageBand and further improve it.
  3. When your melody is complete, go to the submission form, choose Melody Harvest for the competition type, import a postprocessed audio track, and add the link to your GitHub repository. Make sure your GitHub repository has your notebook and your model’s checkpoint files.

For more information on datasets and judging criteria visit AWS DeepComposer Melody Harvest page.

Conclusion

Congratulations! You have successfully submitted your composition to the AWS DeepComposer Chartbusters challenge. Now you can invite your friends and family to listen to your creation on SoundCloud, vote for their favorite, and join the fun by participating in the competition.

Although you don’t need a physical keyboard to compete, we’re offering the AWS DeepComposer keyboard at a special price of $69.00 (30% off) for a limited time on Amazon.com to improve your music generation experience. The pricing includes the keyboard and 3 months of the AWS DeepComposer free trial. To learn more about the different generative AI techniques supported by AWS DeepComposer, check out the learning capsules available on the AWS DeepComposer console.


About the Authors

Maryam Rezapoor is a Senior Product Manager with AWS AI Devices team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

 

 

 Chris Whittam is a Senior Product Manager on the AWS AI Devices team helping developers get hands on (literally) with machine learning.

Read More

Holistic Video Scene Understanding with ViP-DeepLab

Posted by Siyuan Qiao, Student Researcher and Liang-Chieh Chen, Research Scientist, Google Research

People are able to retrieve the visual information about 3D environments from a picture quite easily — we can identify objects, determine instance sizes, and reconstruct 3D scene layout, all using the limited signals contained in 2D images. This ability is commonly known as the inverse projection problem, which refers to reconstructing the ambiguous mapping from the retinal images to the sources of retinal stimulation. Real-world computer vision applications, such as autonomous driving, heavily rely on these capabilities to localize and identify 3D objects, which require vision models to infer the spatial location, semantic class, and instance label for each 3D point projected to the 2D images. The ability to reconstruct the 3D world from images can be decomposed into two disjoint computer vision tasks: monocular depth estimation (predicting depth from a single image) and video panoptic segmentation (the unification of instance segmentation and semantic segmentation, in the video domain). However, research has generally considered each task separately. Tackling these tasks jointly with a unified computer vision model could result in easier deployment and greater efficiency by sharing computation among multiple tasks.

Driven by the potential value of a model that predicts depth and video panoptic segmentation at the same time, we present “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation”, accepted to CVPR 2021. In this work, we propose a new task, depth-aware video panoptic segmentation, that aims to simultaneously tackle monocular depth estimation and video panoptic segmentation. For the new task, we present two derived datasets accompanied by a new evaluation metric called depth-aware video panoptic quality (DVPQ). This new metric includes the metrics for depth estimation and video panoptic segmentation, requiring a vision model to simultaneously tackle the two sub-tasks. To this end, we extend Panoptic-DeepLab by adding network branches for depth and video predictions to create ViP-DeepLab, a unified model that jointly performs video panoptic segmentation and monocular depth estimation for each pixel on the image plane, and achieves state-of-the-art performance on several academic datasets for the sub-tasks. This video demonstrates the new task and shows the results of ViP-DeepLab.

Depth-aware video panoptic segmentation results obtained by ViP-DeepLab. Top-left: Video frames used as input. Top-right: Video panoptic segmentation results. Bottom-left: Estimated depth. Bottom-right: Reconstructed 3D points. Each object instance has a unique and temporally consistent label, e.g., pedestrain_1, pedestrain_2, etc. Input images are from the Cityscapes dataset.

Overview
While Panoptic-DeepLab is able to output semantic segmentation, center prediction, and center regression for a single frame, it lacks the capability of depth estimation and temporally consistent instance ID prediction for multiple frames. However, ViP-DeepLab accomplishes this by performing additional predictions from two consecutive frames as input. The first additional output is depth estimation for the first frame, for which it assigns an estimated depth to each pixel. In addition, ViP-DeepLab also performs center regression for two consecutive frames for only the object centers that appear in the first frame. This process is called center offset prediction, and allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped to the previously detected instances. This process continues for every two consecutive frames (with one overlapping frame) in a video sequence, stitching panoptic predictions together to form predictions with temporally consistent instance IDs. That is, it stitches together where objects are and how they move in a video scene with time.

Outputs of ViP-DeepLab for video panoptic segmentation. Two consecutive frames are concatenated as input. The semantic segmentation output associates each pixel with its semantic classes, while the instance segmentation outputs identify the pixels from two frames associated with an individual object in the first frame. Input images are from the Cityscapes dataset.
Visualization of stitching video panoptic predictions. ViP-DeepLab propagates IDs based on mask intersection-over-union between region pairs. It is capable of tracking objects with large movements, e.g., the cyclist in the image.

Neural Network Design
Building on top of Panoptic-DeepLab, ViP-DeepLab additionally contains two prediction branches: (1) a depth prediction branch, and (2) a next-frame instance branch. Specifically, the depth prediction head is a simple design that predicts depth regression for every pixel, while the next-frame instance branch predicts the center offsets for the pixels in the second frame with respect to the centers in the first frame.

Results
We have tested ViP-DeepLab on multiple popular benchmarks, including Cityscapes-VPS, KITTI Depth Prediction, and KITTI Multi-Object Tracking and Segmentation (MOTS).

Specifically, ViP-DeepLab achieves state-of-the-art (SOTA) results, significantly outperforming previous methods by 5.1% video panoptic quality (VPQ) on the Cityscapes-VPS test set.

Method VPQAll VPQThings VPQStuff
VPSNet 57.4% 45.8% 64.8%
ViP-DeepLab          62.5% (+5.1%)       50.2% (+4.4%)       70.3% (+5.5%)   
VPQ comparison on Cityscapes-VPS test set.

ViP-DeepLab ranks 1st on the KITTI depth prediction benchmark, improving over previous methods by 0.65 SILog (the smaller the better).

Method    SILog       sqErrorRel       absErrorRel       iRMSE   
PWA 11.45 2.30 9.05 12.32
ViP-DeepLab       10.80 2.19 8.94 11.77
Monocular depth estimation comparison on KITTI Depth Prediction benchmark. Note for the depth estimation metrics, the smaller the values, the better the performance. While differences may appear small, the top-performing method on this benchmark usually has a gap in SILog smaller than 0.1.

Additionally, ViP-DeepLab was also 1st on KITTI MOTS pedestrians and 3rd on KITTI MOTS cars ranked by the metric sMOTSA, and now is 3rd for both pedestrians and cars ranked by a newer metric HOTA.

Class Method HOTA
Car PointTrack 62.0%
ViP-DeepLab 76.4% (+14.4%)
Pedestrian       PointTrack 54.4%
ViP-DeepLab          64.3% (+9.9%)   
Performance comparison on KITTI Multi-Object Tracking and Segmentation.

Finally, we also present two new datasets for the new task, depth-aware video panoptic segmentation, and test ViP-DeepLab on them. We hope our ViP-DeepLab results on these two new datasets will serve as a strong baseline for the community to compare against. The results are shown below.

Dataset    DVPQAll       DVPQThings       DVPQStuff   
Cityscapes-DVPS       55.1% 43.3% 63.6%
SemKITTI-DVPS 45.6% 36.6% 52.2%
ViP-DeepLab performance for the task of depth-aware video panoptic segmentation on two new datasets.

Conclusion
With a simple architecture, ViP-DeepLab achieves state-of-the-art performance on video panoptic segmentation, monocular depth estimation, and multi-object tracking and segmentation. We hope that along with MaX-DeepLab, which proposes an efficient dual-path transformer module that allows for end-to-end image panoptic segmentation, ViP-DeepLab is useful to the community and furthers research into a more holistic understanding of scenes in the real world.

Acknowledgements
We would like to thank the support and valuable discussions with Yukun Zhu, Hartwig Adam, and Alan Yuille (co-authors of ViP-DeepLab), as well as Maxwell Collins, and the Mobile Vision team.

Read More