A one-up on motion capture

From “Star Wars” to “Happy Feet,” many beloved films contain scenes that were made possible by motion capture technology, which records movement of objects or people through video. Further, applications for this tracking, which involve complicated interactions between physics, geometry, and perception, extend beyond Hollywood to the military, sports training, medical fields, and computer vision and robotics, allowing engineers to understand and simulate action happening within real-world environments.

As this can be a complex and costly process — often requiring markers placed on objects or people and recording the action sequence — researchers are working to shift the burden to neural networks, which could acquire this data from a simple video and reproduce it in a model. Work in physics simulations and rendering shows promise to make this more widely used, since it can characterize realistic, continuous, dynamic motion from images and transform back and forth between a 2D render and 3D scene in the world. However, to do so, current techniques require precise knowledge of the environmental conditions where the action is taking place, and the choice of renderer, both of which are often unavailable.

Now, a team of researchers from MIT and IBM has developed a trained neural network pipeline that avoids this issue, with the ability to infer the state of the environment and the actions happening, the physical characteristics of the object or person of interest (system), and its control parameters. When tested, the technique can outperform other methods in simulations of four physical systems of rigid and deformable bodies, which illustrate different types of dynamics and interactions, under various environmental conditions. Further, the methodology allows for imitation learning — predicting and reproducing the trajectory of a real-world, flying quadrotor from a video.

“The high-level research problem this paper deals with is how to reconstruct a digital twin from a video of a dynamic system,” says Tao Du PhD ’21, a postdoc in the Department of Electrical Engineering and Computer Science (EECS), a member of Computer Science and Artificial Intelligence Laboratory (CSAIL), and a member of the research team. In order to do this, Du says, “we need to ignore the rendering variances from the video clips and try to grasp of the core information about the dynamic system or the dynamic motion.”

Du’s co-authors include lead author Pingchuan Ma, a graduate student in EECS and a member of CSAIL; Josh Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and a member of CSAIL; Wojciech Matusik, professor of electrical engineering and computer science and CSAIL member; and MIT-IBM Watson AI Lab principal research staff member Chuang Gan. This work was presented this week the International Conference on Learning Representations.

While capturing videos of characters, robots, or dynamic systems to infer dynamic movement makes this information more accessible, it also brings a new challenge. “The images or videos [and how they are rendered] depend largely on the on the lighting conditions, on the background info, on the texture information, on the material information of your environment, and these are not necessarily measurable in a real-world scenario,” says Du. Without this rendering configuration information or knowledge of which renderer is used, it’s presently difficult to glean dynamic information and predict behavior of the subject of the video. Even if the renderer is known, current neural network approaches still require large sets of training data. However, with their new approach, this can become a moot point. “If you take a video of a leopard running in the morning and in the evening, of course, you’ll get visually different video clips because the lighting conditions are quite different. But what you really care about is the dynamic motion: the joint angles of the leopard — not if they look light or dark,” Du says.

In order to take rendering domains and image differences out of the issue, the team developed a pipeline system containing a neural network, dubbed “rendering invariant state-prediction (RISP)” network. RISP transforms differences in images (pixels) to differences in states of the system — i.e., the environment of action — making their method generalizable and agnostic to rendering configurations. RISP is trained using random rendering parameters and states, which are fed into a differentiable renderer, a type of renderer that measures the sensitivity of pixels with respect to rendering configurations, e.g., lighting or material colors. This generates a set of varied images and video from known ground-truth parameters, which will later allow RISP to reverse that process, predicting the environment state from the input video. The team additionally minimized RISP’s rendering gradients, so that its predictions were less sensitive to changes in rendering configurations, allowing it to learn to forget about visual appearances and focus on learning dynamical states. This is made possible by a differentiable renderer.

The method then uses two similar pipelines, run in parallel. One is for the source domain, with known variables. Here, system parameters and actions are entered into a differentiable simulation. The generated simulation’s states are combined with different rendering configurations into a differentiable renderer to generate images, which are fed into RISP. RISP then outputs predictions about the environmental states. At the same time, a similar target domain pipeline is run with unknown variables. RISP in this pipeline is fed these output images, generating a predicted state. When the predicted states from the source and target domains are compared, a new loss is produced; this difference is used to adjust and optimize some of the parameters in the source domain pipeline. This process can then be iterated on, further reducing the loss between the pipelines.

To determine the success of their method, the team tested it in four simulated systems: a quadrotor (a flying rigid body that doesn’t have any physical contact), a cube (a rigid body that interacts with its environment, like a die), an articulated hand, and a rod (deformable body that can move like a snake). The tasks included estimating the state of a system from an image, identifying the system parameters and action control signals from a video, and discovering the control signals from a target image that direct the system to the desired state. Additionally, they created baselines and an oracle, comparing the novel RISP process in these systems to similar methods that, for example, lack the rendering gradient loss, don’t train a neural network with any loss, or lack the RISP neural network altogether. The team also looked at how the gradient loss impacted the state prediction model’s performance over time. Finally, the researchers deployed their RISP system to infer the motion of a real-world quadrotor, which has complex dynamics, from video. They compared the performance to other techniques that lacked a loss function and used pixel differences, or one that included manual tuning of a renderer’s configuration.

In nearly all of the experiments, the RISP procedure outperformed similar or the state-of-the-art methods available, imitating or reproducing the desired parameters or motion, and proving to be a data-efficient and generalizable competitor to current motion capture approaches.

For this work, the researchers made two important assumptions: that information about the camera is known, such as its position and settings, as well as the geometry and physics governing the object or person that is being tracked. Future work is planned to address this.

“I think the biggest problem we’re solving here is to reconstruct the information in one domain to another, without very expensive equipment,” says Ma. Such an approach should be “useful for [applications such as the] metaverse, which aims to reconstruct the physical world in a virtual environment,” adds Gan. “It is basically an everyday, available solution, that’s neat and simple, to cross domain reconstruction or the inverse dynamics problem,” says Ma.

This research was supported, in part, by the MIT-IBM Watson AI Lab, Nexplore, DARPA Machine Common Sense program, Office of Naval Research (ONR), ONR MURI, and Mitsubishi Electric.

Read More

Extracting Skill-Centric State Abstractions from Value Functions

Advances in reinforcement learning (RL) for robotics have enabled robotic agents to perform increasingly complex tasks in challenging environments. Recent results show that robots can learn to fold clothes, dexterously manipulate a rubik’s cube, sort objects by color, navigate complex environments and walk on difficult, uneven terrain. But “short-horizon” tasks such as these, which require very little long-term planning and provide immediate failure feedback, are relatively easy to train compared to many tasks that may confront a robot in a real-world setting. Unfortunately, scaling such short-horizon skills to the abstract, long horizons of real-world tasks is difficult. For example, how would one train a robot capable of picking up objects to rearrange a room?

Hierarchical reinforcement learning (HRL), a popular way of solving this problem, has achieved some success in a variety of long-horizon RL tasks. HRL aims to solve such problems by reasoning over a bank of low-level skills, thus providing an abstraction for actions. However, the high-level planning problem can be further simplified by abstracting both states and actions. For example, consider a tabletop rearrangement task, where a robot is tasked with interacting with objects on a desk. Using recent advances in RL, imitation learning, and unsupervised skill discovery, it is possible to obtain a set of primitive manipulation skills such as opening or closing drawers, picking or placing objects, etc. However, even for the simple task of putting a block into the drawer, chaining these skills together is not straightforward. This may be attributed to a combination of (i) challenges with planning and reasoning over long horizons, and (ii) dealing with high dimensional observations while parsing the semantics and affordances of the scene, i.e., where and when the skill can be used.

In “Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning”, presented at ICLR 2022, we address the task of learning suitable state and action abstractions for long-range problems. We posit that a minimal, but complete, representation for a higher-level policy in HRL must depend on the capabilities of the skills available to it. We present a simple mechanism to obtain such a representation using skill value functions and show that such an approach improves long-horizon performance in both model-based and model-free RL and enables better zero-shot generalization.

Our method, VFS, can compose low-level primitives (left) to learn complex long-horizon behaviors (right).

Building a Value Function Space
The key insight motivating this work is that the abstract representation of actions and states is readily available from trained policies via their value functions. The notion of “value” in RL is intrinsically linked to affordances, in that the value of a state for skill reflects the probability of receiving a reward for successfully executing the skill. For any skill, its value function captures two key properties: 1) the preconditions and affordances of the scene, i.e., where and when the skill can be used, and 2) the outcome, which indicates whether the skill executed successfully when it was used.

Given a decision process with a finite set of k skills trained with sparse outcome rewards and their corresponding value functions, we construct an embedding space by stacking these skill value functions. This gives us an abstract representation that maps a state to a k-dimensional representation that we call the Value Function Space, or VFS for short. This representation captures functional information about the exhaustive set of interactions that the agent can have with the environment, and is thus a suitable state abstraction for downstream tasks.

Consider a toy example of the tabletop rearrangement setup discussed earlier, with the task of placing the blue object in the drawer. There are eight elementary actions in this environment. The bar plot on the right shows the values of each skill at any given time, and the graph at the bottom shows the evolution of these values over the course of the task.

Value functions corresponding to each skill (top-right; aggregated in bottom) capture functional information about the scene (top-left) and aid decision-making.

At the beginning, the values corresponding to the “Place on Counter” skill are high since the objects are already on the counter; likewise, the values corresponding to “Close Drawer” are high. Through the trajectory, when the robot picks up the blue cube, the corresponding skill value peaks. Similarly, the values corresponding to placing the objects in the drawer increase when the drawer is open and peak when the blue cube is placed inside it. All the functional information required to affect each transition and predict its outcome (success or failure) is captured by the VFS representation, and in principle, allows a high-level agent to reason over all the skills and chain them together — resulting in an effective representation of the observations.

Additionally, since VFS learns a skill-centric representation of the scene, it is robust to exogenous factors of variation, such as background distractors and appearances of task-irrelevant components of the scene. All configurations shown below are functionally equivalent — an open drawer with the blue cube in it, a red cube on the countertop, and an empty gripper — and can be interacted with identically, despite apparent differences.

The learned VFS representation can ignore task-irrelevant factors such as arm pose, distractor objects (green cube) and background appearance (brown desk).

Robotic Manipulation with VFS
This approach enables VFS to plan out complex robotic manipulation tasks. Take, for example, a simple model-based reinforcement learning (MBRL) algorithm that uses a simple one-step predictive model of the transition dynamics in value function space and randomly samples candidate skill sequences to select and execute the best one in a manner similar to the model-predictive control. Given a set of primitive pushing skills of the form “move Object A near Object B” and a high-level rearrangement task, we find that VFS can use MBRL to reliably find skill sequences that solve the high-level task.

A rollout of VFS performing a tabletop rearrangement task using a robotic arm. VFS can reason over a sequence of low-level primitives to achieve the desired goal configuration.

To better understand the attributes of the environment captured by VFS, we sample the VFS-encoded observations from a large number of independent trajectories in the robotic manipulation task and project them onto a two-dimensional axis using the t-SNE technique, which is useful for visualizing clusters in high-dimensional data. These t-SNE embeddings reveal interesting patterns identified and modeled by VFS. Looking at some of these clusters closely, we find that VFS can successfully capture information about the contents (objects) in the scene and affordances (e.g., a sponge can be manipulated when held by the robot’s gripper), while ignoring distractors like the relative positions of the objects on the table and the pose of the robotic arm. While these factors are certainly important to solve the task, the low-level primitives available to the robot abstract them away and hence, make them functionally irrelevant to the high-level controller.

Visualizing the 2D t-SNE projections of VFS embeddings show emergent clustering of equivalent configurations of the environment while ignoring task-irrelevant factors like arm pose.

Conclusions and Connections to Future Work
Value function spaces are representations built on value functions of underlying skills, enabling long-horizon reasoning and planning over skills. VFS is a compact representation that captures the affordances of the scene and task-relevant information while robustly ignoring distractors. Empirical experiments reveal that such a representation improves planning for model-based and model-free methods and enables zero-shot generalization. Going forward, this representation has the promise to continue improving along with the field of multitask reinforcement learning. The interpretability of VFS further enables integration into fields such as safe planning and grounding language models.

Acknowledgements
We thank our co-authors Sergey Levine, Ted Xiao, Alex Toshev, Peng Xu and Yao Lu for their contributions to the paper and feedback on this blog post. We also thank Tom Small for creating the informative visualizations used in this blog post.

Read More

Designing Societally Beneficial Reinforcement Learning Systems

Designing Societally Beneficial Reinforcement Learning Systems


Deep reinforcement learning (DRL) is transitioning from a research field focused on game playing to a technology with real-world applications. Notable examples include DeepMind’s work on controlling a nuclear reactor or on improving Youtube video compression, or Tesla attempting to use a method inspired by MuZero for autonomous vehicle behavior planning. But the exciting potential for real world applications of RL should also come with a healthy dose of caution – for example RL policies are well known to be vulnerable to exploitation, and methods for safe and robust policy development are an active area of research.

At the same time as the emergence of powerful RL systems in the real world, the public and researchers are expressing an increased appetite for fair, aligned, and safe machine learning systems. The focus of these research efforts to date has been to account for shortcomings of datasets or supervised learning practices that can harm individuals. However the unique ability of RL systems to leverage temporal feedback in learning complicates the types of risks and safety concerns that can arise.

This post expands on our recent whitepaper and research paper, where we aim to illustrate the different modalities harms can take when augmented with the temporal axis of RL. To combat these novel societal risks, we also propose a new kind of documentation for dynamic Machine Learning systems which aims to assess and monitor these risks both before and after deployment.

Designing Societally Beneficial Reinforcement Learning Systems

Deep reinforcement learning (DRL) is transitioning from a research field focused on game playing to a technology with real-world applications. Notable examples include DeepMind’s work on controlling a nuclear reactor or on improving Youtube video compression, or Tesla attempting to use a method inspired by MuZero for autonomous vehicle behavior planning. But the exciting potential for real world applications of RL should also come with a healthy dose of caution – for example RL policies are well known to be vulnerable to exploitation, and methods for safe and robust policy development are an active area of research.

At the same time as the emergence of powerful RL systems in the real world, the public and researchers are expressing an increased appetite for fair, aligned, and safe machine learning systems. The focus of these research efforts to date has been to account for shortcomings of datasets or supervised learning practices that can harm individuals. However the unique ability of RL systems to leverage temporal feedback in learning complicates the types of risks and safety concerns that can arise.

This post expands on our recent whitepaper and research paper, where we aim to illustrate the different modalities harms can take when augmented with the temporal axis of RL. To combat these novel societal risks, we also propose a new kind of documentation for dynamic Machine Learning systems which aims to assess and monitor these risks both before and after deployment.

Engineers use artificial intelligence to capture the complexity of breaking waves

Waves break once they swell to a critical height, before cresting and crashing into a spray of droplets and bubbles. These waves can be as large as a surfer’s point break and as small as a gentle ripple rolling to shore. For decades, the dynamics of how and when a wave breaks have been too complex to predict.

Now, MIT engineers have found a new way to model how waves break. The team used machine learning along with data from wave-tank experiments to tweak equations that have traditionally been used to predict wave behavior. Engineers typically rely on such equations to help them design resilient offshore platforms and structures. But until now, the equations have not been able to capture the complexity of breaking waves.

The updated model made more accurate predictions of how and when waves break, the researchers found. For instance, the model estimated a wave’s steepness just before breaking, and its energy and frequency after breaking, more accurately than the conventional wave equations.

Their results, published today in the journal Nature Communications, will help scientists understand how a breaking wave affects the water around it. Knowing precisely how these waves interact can help hone the design of offshore structures. It can also improve predictions for how the ocean interacts with the atmosphere. Having better estimates of how waves break can help scientists predict, for instance, how much carbon dioxide and other atmospheric gases the ocean can absorb.

“Wave breaking is what puts air into the ocean,” says study author Themis Sapsis, an associate professor of mechanical and ocean engineering and an affiliate of the Institute for Data, Systems, and Society at MIT. “It may sound like a detail, but if you multiply its effect over the area of the entire ocean, wave breaking starts becoming fundamentally important to climate prediction.”

The study’s co-authors include lead author and MIT postdoc Debbie Eeltink, Hubert Branger and Christopher Luneau of Aix-Marseille University, Amin Chabchoub of Kyoto University, Jerome Kasparian of the University of Geneva, and T.S. van den Bremer of Delft University of Technology.

Learning tank

To predict the dynamics of a breaking wave, scientists typically take one of two approaches: They either attempt to precisely simulate the wave at the scale of individual molecules of water and air, or they run experiments to try and characterize waves with actual measurements. The first approach is computationally expensive and difficult to simulate even over a small area; the second requires a huge amount of time to run enough experiments to yield statistically significant results.

The MIT team instead borrowed pieces from both approaches to develop a more efficient and accurate model using machine learning. The researchers started with a set of equations that is considered the standard description of wave behavior. They aimed to improve the model by “training” the model on data of breaking waves from actual experiments.

“We had a simple model that doesn’t capture wave breaking, and then we had the truth, meaning experiments that involve wave breaking,” Eeltink explains. “Then we wanted to use machine learning to learn the difference between the two.”

The researchers obtained wave breaking data by running experiments in a 40-meter-long tank. The tank was fitted at one end with a paddle which the team used to initiate each wave. The team set the paddle to produce a breaking wave in the middle of the tank. Gauges along the length of the tank measured the water’s height as waves propagated down the tank.

“It takes a lot of time to run these experiments,” Eeltink says. “Between each experiment you have to wait for the water to completely calm down before you launch the next experiment, otherwise they influence each other.”

Safe harbor

In all, the team ran about 250 experiments, the data from which they used to train a type of machine-learning algorithm known as a neural network. Specifically, the algorithm is trained to compare the real waves in experiments with the predicted waves in the simple model, and based on any differences between the two, the algorithm tunes the model to fit reality.

After training the algorithm on their experimental data, the team introduced the model to entirely new data — in this case, measurements from two independent experiments, each run at separate wave tanks with different dimensions. In these tests, they found the updated model made more accurate predictions than the simple, untrained model, for instance making better estimates of a breaking wave’s steepness.

The new model also captured an essential property of breaking waves known as the “downshift,” in which the frequency of a wave is shifted to a lower value. The speed of a wave depends on its frequency. For ocean waves, lower frequencies move faster than higher frequencies. Therefore, after the downshift, the wave will move faster. The new model predicts the change in frequency, before and after each breaking wave, which could be especially relevant in preparing for coastal storms.

“When you want to forecast when high waves of a swell would reach a harbor, and you want to leave the harbor before those waves arrive, then if you get the wave frequency wrong, then the speed at which the waves are approaching is wrong,” Eeltink says.

The team’s updated wave model is in the form of an open-source code that others could potentially use, for instance in climate simulations of the ocean’s potential to absorb carbon dioxide and other atmospheric gases. The code can also be worked into simulated tests of offshore platforms and coastal structures.

“The number one purpose of this model is to predict what a wave will do,” Sapsis says. “If you don’t model wave breaking right, it would have tremendous implications for how structures behave. With this, you could simulate waves to help design structures better, more efficiently, and without huge safety factors.”

This research is supported, in part, by the Swiss National Science Foundation, and by the U.S. Office of Naval Research.

Read More

Amazon Rekognition introduces Streaming Video Events to provide real-time alerts on live video streams

Today, AWS announced the general availability of Amazon Rekognition Streaming Video Events, a fully managed service for camera manufacturers and service providers that uses machine learning (ML) to detect objects such as people, pets, and packages in live video streams from connected cameras. Amazon Rekognition Streaming Video Events sends them a notification as soon as the desired object is detected in the live video stream.

With these event notifications, service providers can send timely and actionable smart alerts to their users such as “Pet detected in the backyard,” enable home automation experiences such as turning on garage lights when a person is detected, build custom in-app experiences such as a smart search to find specific video events of packages without scrolling through hours of footage, or integrate these alerts with Echo devices for Alexa announcements such as “A package was detected at the front door” when the doorbell detects a delivery person dropping off a package – all while keeping cost and latency low.

This post describes how camera manufacturers and security service providers can use Amazon Rekognition Streaming Video Events on live video streams to deliver actionable smart alerts to their users in real time.

Amazon Rekognition Streaming Video Events

Many camera manufacturers and security service providers offer home security solutions that include camera doorbells, indoor cameras, outdoor cameras, and value-added notification services to help their users understand what is happening on their property. Cameras with built-in motion detectors are placed at entry or exit points of the home to notify users of any activity in real time, such as “Motion detected in the backyard.” However, motion detectors are noisy, can be set off by innocuous events like wind and rain, creating notification fatigue, and resulting in clunky home automation setup. Building the right user experience for smart alerts, search, or even browsing video clips requires ML and automation that is hard to get right and can be expensive.

Amazon Rekognition Streaming Video Events lowers the costs of value-added video analytics by providing a low-cost, low-latency, fully managed ML service that can detect objects (such as people, pets, and packages) in real time on video streams from connected cameras. The service starts analyzing the video clip only when a motion event is triggered by the camera. When the desired object is detected, it sends a notification that includes the objects detected, bounding box coordinates, zoomed-in image of the objects detected, and the timestamp. The Amazon Rekognition pre-trained APIs provide high accuracy even in varying lighting conditions, camera angles, and resolutions.

Customer success stories

Customers like Abode Systems and 3xLOGIC are using Amazon Rekognition Streaming Video Events to send relevant alerts to their users and minimize false alarms.

Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor their homes from anywhere. Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode chose to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras. To learn more about Abode’s case study, see Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers.

“We are always focused on making technology choices that provide value to our customers and enable rapid growth while keeping costs low. With Amazon Rekognition Streaming Video Events, we could launch person, pet, and package detection at a fraction of the cost of developing everything ourselves. Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.

For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service. We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

– Scott Beck, Chief Technology Officer at Abode Systems.

3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily challenge of dealing with false alarms coming from in-camera motion detection sensors. These false notifications pose a challenge for operators because they must treat every notification as if it were an event of interest. 3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of individuals and filter out the noise from false notifications. To learn more about 3xLOGIC’s case study, see 3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents.

“Simply relying on motion detection sensors triggers several alarms that are not a security or safety risk when there is a lot of activity in a scene. By utilizing machine learning to filter out the vast majority of events, such as animals, shadows, moving vegetation, and more, we can dramatically reduce the workload of the security operators and improve their efficiency.”

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

“With over 50,000 active cameras in the field, many without the advanced analytics of newer and more expensive camera models, 3xLOGIC takes on the challenge of false alarms every day. Building, training, testing, and maintaining computer vision models is resource-intensive and has a huge learning curve. With Amazon Rekognition Streaming Video Events, we simply call the API and surface the results to our users. It has been very easy to use and the accuracy is impressive.”

– Charlie Erickson, CTO at 3xLOGIC.

How it works

Amazon Rekognition Streaming Video Events works with Amazon Kinesis Video Streams to detect objects from live video streams. This enables camera manufacturers and service providers to minimize false alerts from camera motion events by sending real-time notifications only when a desired object (such as a person, pet, or package) is detected in the video frame. The Amazon Rekognition streaming video APIs enable service providers to accurately alert on objects that are relevant for their customer, successfully adjust the duration of the video to process per motion event, and even define specific areas within the frame that needs to be analyzed.

Amazon Rekognition helps service providers protect their user data by automatically encrypting the data at rest using AWS Key Management Service (KMS) and in transit using the industry-standard Transport Layer Security (TLS) protocol.

Here’s how camera manufacturers and service providers can incorporate video analysis on live video streams:

  1. Integrate Kinesis Video Streams with Amazon Rekognition – Kinesis Video Streams allows camera manufacturers and service providers to easily and securely stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. It integrates seamlessly with new or existing Kinesis video streams to facilitate live video stream analysis.
  2. Specify video duration –Amazon Rekognition Streaming Video Events allows service providers to control how much video they need to process per motion event. They can specify the length of the video clips to be between 1–120 seconds (the default is 10 seconds). When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This provides camera manufacturers and service providers with the flexibility to better manage their ML inference costs.
  3. Choose relevant objects –Amazon Rekognition Streaming Video Events provides the capability to choose one or more objects for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame.
  4. Let Amazon Rekognition know where to send the notifications – Service providers can specify their Amazon Simple Notification Service (Amazon SNS) destination to send event notifications. When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected. This notification includes the object detected, the bounding box, the time stamp, and a link to the specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. They can then use this notification to send smart alerts to their users.
  5. Send motion detection trigger notifications – Whenever a connected camera detects motion, the service provider sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects for the defined duration. When the desired object is detected, Amazon Rekognition sends a notification to their private SNS topic.
  6. Integrate with Alexa or other voice assistants (optional) – Service providers can integrate these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever Amazon Rekognition Streaming Video Events sends them a notification, they can send these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

To learn more, see Amazon Rekognition Streaming Video Events developer guide.

The following diagram illustrates Abode’s architecture with Amazon Rekognition Streaming Video Events.

The following diagram illustrates 3xLOGIC’s architecture with Amazon Rekognition Streaming Video Events.

Amazon Rekognition Video Streaming Events is generally available to AWS customers in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Mumbai) Regions, with availability in additional Regions in the coming months.

Conclusion

AWS customers such as Abode and 3xLOGIC are using Amazon Rekognition Streaming Video Events to innovate and add intelligent video analytics to their security solutions and modernize their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.


About the Author

Prathyusha Cheruku is an AI/ML Computer Vision Principal Product Manager at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

Read More

3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents

3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily challenge of dealing with false alarms coming from in-camera motion detection sensors. These false notifications pose a challenge for operators because they must treat every notification as if it were an event of interest. This means that the operator must tap into the live video stream and potentially send personnel to the location for further investigation.

3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events, a low-latency, low-cost, scalable, managed computer vision service from AWS. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of people and filter out the noise from false notifications. When a person is detected the service sends a notification that includes the object detected, zoomed in image of the object, bounding boxes, and timestamps to monitoring center operators for further review.

“Simply relying on motion detection sensors triggers several alarms that are not a security or safety risk when there is a lot of activity in a scene. By utilizing machine learning to filter out the vast majority of events, such as animals, shadows, moving vegetation, and more, we can dramatically reduce the workload of the security operators and improve their efficiency.”

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

Video analytics with Amazon Rekognition Streaming Video Events

The challenge for managed video monitoring operators is that the more false notifications they receive, the more they get desensitized to the noise and the more likely they are to miss a critical notification. Providers like 3xLOGIC want agents to respond to notifications with the same urgency on the last alarm of their shift as they did on the first. The best way for that to happen is to simply filter out the noise from in-camera motion detection events.

3xLOGIC worked with AWS to develop and launch a multi-location pilot program that showed a significant decrease in false alarms. The following diagram illustrates 3xLOGIC’s integration with Amazon Rekognition Streaming Video Events.

When a 3xLOGIC camera detects motion, it starts streaming video to Amazon Kinesis Video Streams and calls an API to trigger Amazon Rekognition to start analyzing the video stream. When Amazon Rekognition detects a person in the video stream, it sends an event to Amazon Simple Notification Service (Amazon SNS), which notifies a video monitoring agent of the event. Amazon Rekognition provides out-of-the-box notifications, which include zoomed-in images of the people, bounding boxes, labels, and timestamps of the event. Monitoring agents use these notifications in concert with live camera views to evaluate the event and take appropriate action. To learn more about Amazon Rekognition Streaming Video Events, refer to the Amazon Rekognition Developer guide.

“With over 50,000 active cameras in the field, many without the advanced analytics of newer and more expensive camera models, 3xLOGIC takes on the challenge of false alarms every day. Building, training, testing, and maintaining computer vision models is resource-intensive and has a huge learning curve. With Amazon Rekognition Streaming Video Events, we simply call the API and surface the results to our users. It has been very easy to use and the accuracy is impressive.”

– Charlie Erickson, CTO at 3xLOGIC Products and Solutions.

Conclusion

The managed video monitoring market requires an in-depth understanding of the variety of security risks that firms face. It also requires that you keep up with the latest technology, regulations, and best practices. By partnering with AWS, providers like 3xLOGIC are innovating and adding intelligent video analytics to their security solutions and modernizing their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.


About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing Frisbee with his dog Max.

Prathyusha Cheruku is a Principal Product Manager for AI/ML Computer Vision at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

David Robo is a Principal WW GTM Specialist for AI/ML Computer Vision at Amazon Web Services. In this role, David works with customers and partners throughout the world who are building innovative video-based devices, products, and services. Outside of work, David has a passion for the outdoors and carving lines on waves and snow.

Read More

Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers

Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor their homes from anywhere. The challenge with in-camera-based motion detection is that a large percentage (up to 90%) of notifications are triggered from insignificant events like wind, rain, or passing cars. Abode wanted to overcome this challenge and provide their customers with highly accurate smart notifications.

Abode has been an AWS user since 2015, taking advantage of multiple AWS services for storage, compute, database, IoT, and video streaming for its solutions. Abode reached out to AWS to understand how they could use AWS computer vision services to build smart notifications into their home security solution for their customers. After evaluating their options, Abode chose to use Amazon Rekognition Streaming Video Events, a low-cost, low-latency, fully managed AI service that can detect objects such as people, pets, and packages in real time on video streams from connected cameras.

“We are always focused on making technology choices that provide value to our customers and enable rapid growth while keeping costs low. With Amazon Rekognition Streaming Video Events, we could launch person, pet, and package detection at a fraction of the cost of developing everything ourselves.”

– Scott Beck, Chief Technology Officer at Abode Systems.

Smart notifications for the connected home market segment

Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode leaned on their relationship with AWS to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras.

“Every time a camera detects motion, we stream video to Amazon Kinesis Video Streams and trigger Amazon Rekognition Streaming Video Events APIs to detect if there truly was a person, pet, or package in the stream,” Beck says. “Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.”

Amazon Rekognition Streaming Video Events

Amazon Rekognition Streaming Video Events detects objects and events in video streams and returns the labels detected, bounding box coordinates, zoomed-in images of the object detected, and timestamps. With this service, companies like Abode can deliver timely and actionable smart notifications only when a desired label such as a person, pet, or package is detected in the video frame. For more information, refer to the Amazon Rekognition Streaming Video Events Developer Guide.

“For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service,” Beck says. “We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

Solution overview

Abode’s goal was to improve accuracy and usefulness of camera-based motion detection notifications to their customers by providing highly accurate label detection using their existing camera technology. This meant that Abode’s customers wouldn’t have to buy additional hardware to take advantage of new features, and Abode wouldn’t have to develop and maintain a bespoke solution. The following diagram illustrates Abode’s integration with Amazon Rekognition Streaming Video Events.

The solution consists of the following steps:

  1. Integrate Amazon Kinesis Video Streams with Amazon Rekognition – Abode was already using Amazon Kinesis Video Streams to easily stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. They simply integrated Kinesis Video Streams with Amazon Rekognition to facilitate live video stream analysis.
  2. Specify video duration – With Amazon Rekognition, Abode can control how much video needs to be processed per motion event. Amazon Rekognition allows you to specify the length of the video clips to be between 0–120 seconds (the default is 10 seconds) per motion event. When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This allows Abode the flexibility to better manage their machine learning (ML) inference costs.
  3. Choose relevant labels – With Amazon Rekognition, customers like Abode can choose one or more labels for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame. Abode opted for person, pet, and package detection.
  4. Let Amazon Rekognition know where to send the notifications – When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected to the Amazon Simple Notification Service (Amazon SNS) destination configured by Abode. This notification includes the object detected, the bounding box, the timestamp, and a link to Abode’s specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. Abode then uses this information to send relevant smart alerts to the homeowner, such as “A package has been detected at 12:53pm” or “A pet detected in the backyard.”
  5. Send motion detection trigger notifications – Whenever the smart camera detects motion, Abode sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects and the duration defined. When the desired object is detected, Amazon Rekognition sends a notification to Abode’s private SNS topic.
  6. Integrate with Alexa or other voice assistants (optional) – Abode also integrated these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever they receive a notification from Amazon Rekognition Streaming Video Events, Abode sends these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

Conclusion

The connected home security market segment is dynamic and evolving, driven by consumers’ increased need for security, convenience, and entertainment. AWS customers like Abode are innovating and adding new ML capabilities to their smart home security solutions for their consumers. The proliferation of camera and streaming video technology is just beginning, and managed computer vision services like Amazon Rekognition Streaming Video Events is paving the way for new smart video streaming capabilities in the home automation market.

To learn more, check out Amazon Rekognition Streaming Video Events and developer guide.


About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing Frisbee with his dog Max.

Prathyusha Cheruku is a Principal Product Manager for AI/ML Computer Vision at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

David Robo is a Principal WW GTM Specialist for AI/ML Computer Vision at Amazon Web Services. In this role, David works with customers and partners throughout the world who are building innovative video-based devices, products, and services. Outside of work, David has a passion for the outdoors and carving lines on waves and snow.

Read More

Pandas user-defined functions are now available in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can select and query data with just a few clicks, quickly transform data with over 300 built-in data transformations, and understand your data with built-in visualizations without writing any code.

Additionally, you can create custom transforms unique to your requirements. Custom transforms allow you to write custom transformations using either PySpark, Pandas, or SQL.

Data Wrangler now supports a custom Pandas user-defined function (UDF) transform that can process large datasets efficiently. You can choose from two custom Pandas UDF modes: Pandas and Python. Both modes provide an efficient solution to process datasets, and the mode you choose depends on your preference.

In this post, we demonstrate how to use the new Pandas UDF transform in either mode.

Solution overview

At the time of this writing, you can import datasets into Data Wrangler from Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Databricks, and Snowflake. For this post, we use Amazon S3 to store the 2014 Amazon reviews dataset.

The data has a column called reviewText containing user-generated text. The text also contains several stop words, which are common words that don’t provide much information, such as “a,” “an,” and “the.” Removal of stop words is a common preprocessing step in natural language processing (NLP) pipelines. We can create a custom function to remove the stop words from the reviews.

Create a custom Pandas UDF transform

Let’s walk through the process of creating two Data Wrangler custom Pandas UDF transforms using Pandas and Python modes.

  1. Download the Digital Music reviews dataset and upload it to Amazon S3.
  2. Open Amazon SageMaker Studio and create a new Data Wrangler flow.
  3. Under Import data, choose Amazon S3 and navigate to the dataset location.
  4. For File type, choose jsonl.

A preview of the data should be displayed in the table.

  1. Choose Import to proceed.
  2. After your data is imported, choose the plus sign next to Data types and choose Add transform.
  3. Choose Custom transform.
  4. On the drop-down menu, Python (User-Defined Function).

Now we create our custom transform to remove stop words.

  1. Specify your input column, output column, return type, and mode.

The following example uses Pandas mode. This means the function should accept and return a Pandas series of the same length. You can think of a Pandas series as a column in a table or a chunk of the column. This is the most performant Pandas UDF mode because Pandas can vectorize operations across batches of values as opposed to one at a time. The pd.Series type hints are required in Pandas mode.

import pandas as pd
from sklearn.feature_extraction import text

# Input: the quick brown fox jumped over the lazy dog
# Output: quick brown fox jumped lazy dog
def remove_stopwords(series: pd.Series) -> pd.Series:
  """Removes stop words from the given string."""
  
  # Replace nulls with empty strings and lowercase to match stop words case
  series = series.fillna("").str.lower()
  tokens = series.str.split()
  
  # Remove stop words from each entry of series
  tokens = tokens.apply(lambda t: [token for token in t 
                                   if token not in text.ENGLISH_STOP_WORDS])
  
  # Joins the filtered tokens by spaces
  return tokens.str.join(" ")

If you prefer to use pure Python as opposed to the Pandas API, Python mode allows you to specify a pure Python function that accepts a single argument and returns a single value. The following example is equivalent to the preceding Pandas code in terms of output. Type hints are not required in Python mode.

from sklearn.feature_extraction import text

def remove_stopwords(value: str) -> str:
  if not value:
    return ""
  
  tokens = value.lower().split()
  tokens = [token for token in tokens 
            if token not in text.ENGLISH_STOP_WORDS]
  return " ".join(tokens)

  1. Choose Add to add your custom transform.

Conclusion

Data Wrangler has over 300 built-in transforms, and you can also add custom transformations unique to your requirements. In this post, we demonstrated how to process datasets with Data Wrangler’s new custom Pandas UDF transform, using both Pandas and Python modes. You can use either mode based on your preference. To learn more about Data Wrangler, refer to Create and Use a Data Wrangler Flow.


About the Authors

Ben Harris is a software engineer with experience designing, deploying, and maintaining scalable data pipelines and machine learning solutions across a variety of domains. Ben has built systems for data collection and labeling, image and text classification, sequence-to-sequence modeling, embedding, and clustering, among others.

Haider Naqvi is a Solutions Architect at AWS. He has extensive Software Development and Enterprise Architecture experience. He focuses on enabling customers to achieve business outcomes with AWS. He is based out of New York.

Vishal Srivastava is a Technical Account Manager at AWS. With a background in Software Development and Analytics, he primarily works with financial services sector and digital native business customers and supports their cloud journey. In his free time, he loves to travel with his family.

Read More