Introducing Omnimattes: A New Approach to Matte Generation using Layered Neural Rendering

Introducing Omnimattes: A New Approach to Matte Generation using Layered Neural Rendering

Posted by Forrester Cole, Software Engineer and Tali Dekel, Research Scientist

Image and video editing operations often rely on accurate mattes — images that define a separation between foreground and background. While recent computer vision techniques can produce high-quality mattes for natural images and videos, allowing real-world applications such as generating synthetic depth-of-field, editing and synthesising images, or removing backgrounds from images, one fundamental piece is missing: the various scene effects that the subject may generate, like shadows, reflections, or smoke, are typically overlooked.

In “Omnimatte: Associating Objects and Their Effects in Video”, presented at CVPR 2021, we describe a new approach to matte generation that leverages layered neural rendering to separate a video into layers called omnimattes that include not only the subjects but also all of the effects related to them in the scene. Whereas a typical state-of-the-art segmentation model extracts masks for the subjects in a scene, for example, a person and a dog, the method proposed here can isolate and extract additional details associated with the subjects, such as shadows cast on the ground.

A state-of-the-art segmentation network (e.g., MaskRCNN) takes an input video (left) and produces plausible masks for people and animals (middle), but misses their associated effects. Our method produces mattes that include not only the subjects, but their shadows as well (right; individual channels for person and dog visualized as blue and green).

Also unlike segmentation masks, omnimattes can capture partially-transparent, soft effects such as reflections, splashes, or tire smoke. Like conventional mattes, omnimattes are RGBA images that can be manipulated using widely-available image or video editing tools, and can be used wherever conventional mattes are used, for example, to insert text into a video underneath a smoke trail.

Layered Decomposition of Video
To generate omnimattes, we split the input video into a set of layers: one for each moving subject, and one additional layer for stationary background objects. In the example below, there is one layer for the person, one for the dog, and one for the background. When merged together using conventional alpha blending, these layers reproduce the input video.

Besides reproducing the video, the decomposition must capture the correct effects in each layer. For example, if the person’s shadow appears in the dog’s layer, the merged layers would still reproduce the input video, but inserting an additional element between the person and dog would produce an obvious error. The challenge is to find a decomposition where each subject’s layer captures only that subject’s effects, producing a true omnimatte.

Our solution is to apply our previously developed layered neural rendering approach to train a convolutional neural network (CNN) to map the subject’s segmentation mask and a background noise image into an omnimatte. Due to their structure, CNNs are naturally inclined to learn correlations between image effects, and the stronger the correlation between the effects, the easier for the CNN to learn. In the above video, for example, the spatial relationships between the person and their shadow, and the dog and its shadow, remain similar as they walk from right to left. The relationships change more (hence, the correlations are weaker) between the person and the dog’s shadow, or the dog and the person’s shadow. The CNN learns the stronger correlations first, leading to the correct decomposition.

The omnimatte system is shown in detail below. In a preprocess, the user chooses the subjects and specifies a layer for each. A segmentation mask for each subject is extracted using an off-the-shelf segmentation network, such as MaskRCNN, and camera transformations relative to the background are found using standard camera stabilization tools. A random noise image is defined in the background reference frame and sampled using the camera transformations to produce per-frame noise images. The noise images provide image features that are random but consistently track the background over time, providing a natural input for the CNN to learn to reconstruct the background colors.

The rendering CNN takes as input the segmentation mask and the per-frame noise images and produces the RGB color images and alpha maps, which capture the transparency of each layer. These outputs are merged using conventional alpha-blending to produce the output frame. The CNN is trained from scratch to reconstruct the input frames by finding and associating the effects not captured in a mask (e.g., shadows, reflections or smoke) with the given foreground layer, and to ensure the subject’s alpha roughly includes the segmentation mask. To make sure the foreground layers only capture the foreground elements and none of the stationary background, a sparsity loss is also applied on the foreground alpha.

A new rendering network is trained for each video. Because the network is only required to reconstruct the single input video, it is able to capture fine structures and fast motion in addition to separating the effects of each subject, as seen below. In the walking example, the omnimatte includes the shadow cast on the slats of the park bench. In the tennis example, the thin shadow and even the tennis ball are captured. In the soccer example, the shadow of the player and the ball are decomposed into their proper layers (with a slight error when the player’s foot is occluded by the ball).

This basic model already works well, but one can improve the results by augmenting the input of the CNN with additional buffers such as optical flow or texture coordinates.

Once the omnimattes are generated, how can they be used? As shown above, we can remove objects, simply by removing their layer from the composition. We can also duplicate objects, by repeating their layer in the composition. In the example below, the video has been “unwrapped” into a panorama, and the horse duplicated several times to produce a stroboscopic photograph effect. Note that the shadow that the horse casts on the ground and onto the obstacle is correctly captured.

A more subtle, but powerful application is to retime the subjects. Manipulation of time is widely used in film, but usually requires separate shots for each subject and a controlled filming environment. A decomposition into omnimattes makes retiming effects possible for everyday videos using only post-processing, simply by independently changing the playback rate of each layer. Since the omnimattes are standard RGBA images, this retiming edit can be done using conventional video editing software.

The video below is decomposed into three layers, one for each child. The children’s initial, unsynchronized jumps are aligned by simply adjusting the playback rate of their layers, producing realistic retiming for the splashes and reflections in the water.

In the original video (left), each child jumps at a different time. After editing (right), everyone jumps together.

It’s important to consider that any novel technique for manipulating images should be developed and applied responsibly, as it could be misused to produce fake or misleading information. Our technique was developed in accordance with our AI Principles and only allows rearrangement of content already present in the video, but even simple rearrangement can significantly alter the effect of a video, as shown in these examples. Researchers should be aware of these risks.

Future Work
There are a number of exciting directions to improve the quality of the omnimattes. On a practical level, this system currently only supports backgrounds that can be modeled as panoramas, where the position of the camera is fixed. When the camera position moves, the panorama model cannot accurately capture the entire background, and some background elements may clutter the foreground layers (sometimes visible in the above figures). Handling fully general camera motion, such as walking through a room or down a street, would require a 3D background model. Reconstruction of 3D scenes in the presence of moving objects and effects is still a difficult research challenge, but one that has seen promising recent progress.

On a theoretical level, the ability of CNNs to learn correlations is powerful, but still somewhat mysterious, and does not always lead to the expected layer decomposition. While our system allows for manual editing when the automatic result is imperfect, a better solution would be to fully understand the capabilities and limitations of CNNs to learn image correlations. Such an understanding could lead to improved denoising, inpainting, and many other video editing applications besides layer decomposition.

Erika Lu, from the University of Oxford, developed the omnimatte system during two internships at Google, in collaboration with Google researchers Forrester Cole, Tali Dekel, Michael Rubinstein, William T. Freeman and David Salesin, and University of Oxford researchers Weidi Xie and Andrew Zisserman.

Thank you to the friends and families of the authors who agreed to appear in the example videos. The “horse jump low”, “lucia”, and “tennis” videos are from the DAVIS 2016 dataset. The soccer video is used by permission from Online Soccer Skills. The car drift video was licensed from Shutterstock.

Read More

Facebook Head of Privacy Research Liz Keneski discusses new research award opportunity

Last year, we offered a first-ever research award opportunity for academics centered around experiences and expectations with digital privacy. We received great responses from the global community, and we’re pleased to continue this initiative again with the launch of a new request for proposals in the same space.

View RFP

To learn more about this new RFP, we reached out to Liz Keneski, Head of Privacy Research at Facebook. In this Q&A, Keneski discusses her team, the impact of collaborating with academia, and what this year’s RFP is about.

Q: What is your role at Facebook, and what does your team do?

Liz Keneski: I have the privilege to lead and support the Privacy Research team. We study privacy from a number of lenses, including understanding people’s privacy needs across topics like content controls, accessing and downloading their data, and privacy settings and data permissions. Because Facebook’s global user base is so diverse, we must consider differences in privacy needs across cultures and populations in our research.

Q: What was last year’s RFP about? What’s the goal of this year’s RFP?

LK: Both this year’s and last year’s RFP are about expanding research on two vital topics for the advancement of privacy science: privacy measurement and inclusive privacy. In order to continue to improve our privacy features, we need fine-grain survey instruments to best understand how privacy expectations and attitudes change over time in response to the changes we make. Further, to be able to build privacy controls and transparency inclusively for people all over the world, we need to deeply understand the unique privacy needs of different populations.

Q: How does this RFP fit into the bigger picture for privacy work at Facebook? Why is it important to collaborate with academia?

LK: We’re aiming to grow the body of knowledge in privacy science as a whole. This means supporting high-quality privacy research — both within and outside Facebook — that we can all learn from as a broader privacy community. Academics are often able to dive into certain topics or areas deeply over the course of many years, and we want to apply that expertise to the way we build privacy products for people.

Q: Where can people stay updated and learn more?

LK: Visit the Security and Privacy research page to learn more and find current publications and team member profiles. Our Research Awards page features our open research award opportunities. To receive email notifications about our new research awards and proposal deadlines, subscribe to our email newsletter.

The post Facebook Head of Privacy Research Liz Keneski discusses new research award opportunity appeared first on Facebook Research.

Read More

How Computational Graphs are Constructed in PyTorch

How Computational Graphs are Constructed in PyTorch

In the previous post we went over the theoretical foundations of automatic differentiation and reviewed the implementation in PyTorch. In this post, we will be showing the parts of PyTorch involved in creating the graph and executing it. In order to understand the following contents, please read @ezyang’s wonderful blog post about PyTorch internals.

Autograd components

First of all, let’s look at where the different components of autograd live:

tools/autograd: Here we can find the definition of the derivatives as we saw in the previous post derivatives.yaml, several python scripts and a folder called templates. These scripts and the templates are used at building time to generate the C++ code for the derivatives as specified in the yaml file. Also, the scripts here generate wrappers for the regular ATen functions so that the computational graph can be constructed.

torch/autograd: This folder is where the autograd components that can be used directly from python are located. In we find the actual definition of torch.autograd.Function, a class used by users to write their own differentiable functions in python as per the documentation. holds components for functionally computing the jacobian vector product, hessian, and other gradient related computations of a given function.
The rest of the files have additional components such as gradient checkers, anomaly detection, and the autograd profiler.

torch/csrc/autograd: This is where the graph creation and execution-related code lives.
All this code is written in C++, since it is a critical part that is required to be extremely performant. Here we have several files that implement the engine, metadata storage, and all the needed components. Alongside this, we have several files whose names start with python_, and their main responsibility is to allow python objects to be used in the autograd engine.

Graph Creation

Previously, we described the creation of a computational graph. Now, we will see how PyTorch creates these graphs with references to the actual codebase.

Figure 1: Example of an augmented computational graph

It all starts when in our python code, where we request a tensor to require the gradient.

>>> x = torch.tensor([0.5, 0.75], requires_grad=True)

When the required_grad flag is set in tensor creation, c10 will allocate an AutogradMeta object that is used to hold the graph information.

void TensorImpl::set_requires_grad(bool requires_grad) {
  if (!autograd_meta_)
    autograd_meta_ = impl::GetAutogradMetaFactory()->make();
    autograd_meta_->set_requires_grad(requires_grad, this);

The AutogradMeta object is defined in torch/csrc/autograd/variable.h as follows:

struct TORCH_API AutogradMeta : public c10::AutogradMetaInterface {
  std::string name_;

  Variable grad_;
  std::shared_ptr<Node> grad_fn_;
  std::weak_ptr<Node> grad_accumulator_;
  // other fields and methods

The most important fields in this structure are the computed gradient in grad_ and a pointer to the function grad_fn that will be called by the engine to produce the actual gradient. Also, there is a gradient accumulator object that is used to add together all the different gradients where this tensor is involved as we will see in the graph execution.

Graphs, Nodes and Edges.

Now, when we call a differentiable function that takes this tensor as an argument, the associated metadata will be populated. Let’s suppose that we call a regular torch function that is implemented in ATen. Let it be the multiplication as in our previous blog post example. The resulting tensor has a field called grad_fn that is essentially a pointer to the function that will be used to compute the gradient of that operation.

>>> x = torch.tensor([0.5, 0.75], requires_grad=True)
>>> v = x[0] * x[1]
>>> v
tensor(0.3750, grad_fn=<MulBackward0>)

Here we see that the tensors’ grad_fn has a MulBackward0 value. This function is the same that was written in the derivatives.yaml file, and its C++ code was generated automatically by all the scripts in tools/autograd. It’s auto-generated source code can be seen in torch/csrc/autograd/generated/Functions.cpp.

variable_list MulBackward0::apply(variable_list&& grads) {
  std::lock_guard<std::mutex> lock(mutex_);

  IndexRangeGenerator gen;
  auto self_ix = gen.range(1);
  auto other_ix = gen.range(1);
  variable_list grad_inputs(gen.size());
  auto& grad = grads[0];
  auto self = self_.unpack();
  auto other = other_.unpack();
  bool any_grad_defined = any_variable_defined(grads);
  if (should_compute_output({ other_ix })) {
    auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, self, other_scalar_type)) : Tensor();
    copy_range(grad_inputs, other_ix, grad_result);
  if (should_compute_output({ self_ix })) {
    auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, other, self_scalar_type)) : Tensor();
    copy_range(grad_inputs, self_ix, grad_result);
  return grad_inputs;

The grad_fn objects inherit from the TraceableFunction class, a descendant of Node with just a property set to enable tracing for debugging and optimization purposes. A graph by definition has nodes and edges, so these functions are indeed the nodes of the computational graph that are linked together by using Edge objects to enable the graph traversal later on.

The Node definition can be found in the torch/csrc/autograd/function.h file.

struct TORCH_API Node : std::enable_shared_from_this<Node> {
 /// Evaluates the function on the given inputs and returns the result of the
  /// function call.
  variable_list operator()(variable_list&& inputs) {

  /// Performs the `Node`'s actual operation.
  virtual variable_list apply(variable_list&& inputs) = 0;
  edge_list next_edges_;

Essentially we see that it has an override of the operator () that performs the call to the actual function, and a pure virtual function called apply. The automatically generated functions override this apply method as we saw in the MulBackward0 example above. Finally, the node also has a list of edges to enable graph connectivity.

The Edge object is used to link Nodes together and its implementation is straightforward.

struct Edge {
  /// The function this `Edge` points to.
  std::shared_ptr<Node> function;
  /// The identifier of a particular input to the function.
  uint32_t input_nr;

It only requires a function pointer (the actual grad_fn objects that the edges link together), and an input number that acts as an id for the edge.

Linking nodes together

When we invoke the product operation of two tensors, we enter into the realm of autogenerated code. All the scripts that we saw in tools/autograd fill a series of templates that wrap the differentiable functions in ATen. These functions have code to construct the backward graph during the forward pass.

The script is in charge of writing all this wrapping code. This script is called from the tools/autograd/ during the pytorch build process and it will output the automatically generated function wrappers to torch/csrc/autograd/generated/.

Let’s take a look at how the tensor multiplication generated function looks like. The code has been simplified, but it can be found in the torch/csrc/autograd/generated/VariableType_4.cpp file when compiling pytorch from source.

at::Tensor mul_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & other) {
  auto _any_requires_grad = compute_requires_grad( self, other );
  std::shared_ptr<MulBackward0> grad_fn;
  if (_any_requires_grad) {
    // Creates the link to the actual grad_fn and links the graph for backward traversal
    grad_fn = std::shared_ptr<MulBackward0>(new MulBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, other ));
  // Does the actual function call to ATen
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::mul(ks & c10::after_autograd_keyset, self_, other_);

  auto result = std::move(_tmp);
    if (grad_fn) {
       // Connects the result to the graph
      set_history(flatten_tensor_args( result ), grad_fn);
  return result;

Let’s walk through the most important lines of this code.
First of all, the grad_fn object is created with: ` grad_fn = std::shared_ptr(new MulBackward0(), deleteNode);`.

After the grad_fn object is created, the edges used to link the nodes together are created by using the grad_fn->set_next_edges(collect_next_edges( self, other )); calls.

struct MakeNextFunctionList : IterArgs<MakeNextFunctionList> {
  edge_list next_edges;
  using IterArgs<MakeNextFunctionList>::operator();
  void operator()(const Variable& variable) {
    if (variable.defined()) {
    } else {
  void operator()(const c10::optional<Variable>& variable) {
    if (variable.has_value() && variable->defined()) {
    } else {

template <typename... Variables>
edge_list collect_next_edges(Variables&&... variables) {
  detail::MakeNextFunctionList make;
  return std::move(make.next_edges);

Given an input variable (it’s just a regular tensor), collect_next_edges
will create an Edge object by calling impl::gradient_edge

 Edge gradient_edge(const Variable& self) {
    // If grad_fn is null (as is the case for a leaf node), we instead
    // interpret the gradient function to be a gradient accumulator, which will
    // accumulate its inputs into the grad property of the variable. These
    // nodes get suppressed in some situations, see "suppress gradient
    // accumulation" below. Note that only variables which have `requires_grad =
    // True` can have gradient accumulators.
    if (const auto& gradient = self.grad_fn()) {
      return Edge(gradient, self.output_nr());
    } else {
      return Edge(grad_accumulator(self), 0);

To understand how edges work, let’s assume that an early executed function produced two output tensors, both with their grad_fn set, each tensor also has an output_nr property with the order in which they were returned. When creating the edges for the current grad_fn, an Edge object per input variable will be created. The edges will point to the variable’s grad_fn and will also track the output_nr to establish ids used when traversing the graph. In the case that the input variables are “leaf”, i.e. they were not produced by any differentiable function, they don’t have a grad_fn attribute set. A special function called a gradient accumulator is set by default as seen in the above code snippet.

After the edges are created, the grad_fn graph Node object that is being currently created will hold them using the set_next_edges function. This is what connects grad_fns together, producing the computational graph.

 void set_next_edges(edge_list&& next_edges) {
    next_edges_ = std::move(next_edges);
    for(const auto& next_edge : next_edges_) {

Now, the forward pass of the function will execute, and after the execution set_history will connect the output tensors to the grad_fn Node.

inline void set_history(
    at::Tensor& variable,
    const std::shared_ptr<Node>& grad_fn) {
  if (variable.defined()) {
    // If the codegen triggers this, you most likely want to add your newly added function
    // to the DONT_REQUIRE_DERIVATIVE list in tools/autograd/
    auto output_nr =
    impl::set_gradient_edge(variable, {grad_fn, output_nr});
  } else {

set_history calls set_gradient_edge, which just copies the grad_fn and the output_nr to the AutogradMeta object that the tensor has.

 void set_gradient_edge(const Variable& self, Edge edge) {
    auto* meta = materialize_autograd_meta(self);
    meta->grad_fn_ = std::move(edge.function);
    meta->output_nr_ = edge.input_nr;
    // For views, make sure this new grad_fn_ is not overwritten unless it is necessary
    // in the VariableHooks::grad_fn below.
    // This logic is only relevant for custom autograd Functions for which multiple
    // operations can happen on a given Tensor before its gradient edge is set when
    // exiting the custom Function.
    auto diff_view_meta = get_view_autograd_meta(self);
    if (diff_view_meta && diff_view_meta->has_bw_view()) {

This tensor now will be the input to another function and the above steps will be all repeated. Check the animation below to see how the graph is created.

Figure 2: Animation that shows the graph creation

Registering Python Functions in the graph

We have seen how autograd creates the graph for the functions included in ATen. However, when we define our differentiable functions in Python, they are also included in the graph!

An autograd python defined function looks like the following:

class Exp(torch.autograd.Function):
     def forward(ctx, i):
         result = i.exp()
         return result

     def backward(ctx, grad_output):
         result, = ctx.saved_tensors
         return grad_output * result

# Call the function
Exp.apply(torch.tensor(0.5, requires_grad=True))
# Outputs: tensor(1.6487, grad_fn=<ExpBackward>)

In the above snippet autograd detected our python function when creating the graph. All of this is possible thanks to the Function class. Let’s take a look at what happens when we call apply.

apply is defined in the torch._C._FunctionBase class, but this class is not present in the python source. _FunctionBase is defined in C++ by using the python C API to hook C functions together into a single python class. We are looking for a function named THPFunction_apply.

PyObject *THPFunction_apply(PyObject *cls, PyObject *inputs)
  // Generates the graph node
  THPObjectPtr backward_cls(PyObject_GetAttrString(cls, "_backward_cls"));
  if (!backward_cls) return nullptr;
  THPObjectPtr ctx_obj(PyObject_CallFunctionObjArgs(backward_cls, nullptr));
  if (!ctx_obj) return nullptr;
  THPFunction* ctx = (THPFunction*)ctx_obj.get();

  auto cdata = std::shared_ptr<PyNode>(new PyNode(std::move(ctx_obj)), deleteNode);
  ctx->cdata = cdata;

  // Prepare inputs and allocate context (grad fn)
  // Unpack inputs will collect the edges
  auto info_pair = unpack_input<false>(inputs);
  UnpackedInput& unpacked_input = info_pair.first;
  InputFlags& input_info = info_pair.second;

   // Initialize backward function (and ctx)
  bool is_executable = input_info.is_executable;
  ctx->needs_input_grad = input_info.needs_input_grad.release();
  ctx->is_variable_input = std::move(input_info.is_variable_input);

  // Prepend ctx to input_tuple, in preparation for static method call
  auto num_args = PyTuple_GET_SIZE(inputs);
  THPObjectPtr ctx_input_tuple(PyTuple_New(num_args + 1));
  if (!ctx_input_tuple) return nullptr;
  PyTuple_SET_ITEM(ctx_input_tuple.get(), 0, (PyObject*)ctx);
  for (int i = 0; i < num_args; ++i) {
    PyObject *arg = PyTuple_GET_ITEM(unpacked_input.input_tuple.get(), i);
    PyTuple_SET_ITEM(ctx_input_tuple.get(), i + 1, arg);

  // Call forward
  THPObjectPtr tensor_outputs;
    AutoGradMode grad_mode(false);
    THPObjectPtr forward_fn(PyObject_GetAttrString(cls, "forward"));
    if (!forward_fn) return nullptr;
    tensor_outputs = PyObject_CallObject(forward_fn, ctx_input_tuple);
    if (!tensor_outputs) return nullptr;

  // Here is where the outputs gets the tensors tracked
  return process_outputs(cls, cdata, ctx, unpacked_input, inputs, std::move(tensor_outputs),
                         is_executable, node);

Although this code is hard to read at first due to all the python API calls, it essentially does the same thing as the auto-generated forward functions that we saw for ATen:

Create a grad_fn object.
Collect the edges to link the current grad_fn with the input tensors one.
Execute the function forward.
Assign the created grad_fn to the output tensors metadata.

The grad_fn object is created in:

  // Generates the graph node
  THPObjectPtr backward_cls(PyObject_GetAttrString(cls, "_backward_cls"));
  if (!backward_cls) return nullptr;
  THPObjectPtr ctx_obj(PyObject_CallFunctionObjArgs(backward_cls, nullptr));
  if (!ctx_obj) return nullptr;
  THPFunction* ctx = (THPFunction*)ctx_obj.get();

  auto cdata = std::shared_ptr<PyNode>(new PyNode(std::move(ctx_obj)), deleteNode);
  ctx->cdata = cdata;

Basically, it asks the python API to get a pointer to the Python object that can execute the user-written function. Then it wraps it into a PyNode object that is a specialized Node object that calls the python interpreter with the provided python function when apply is executed during the forward pass. Note that in the code cdata is the actual Node object that is part of the graph. ctx is the object that is passed to the python forward/backward functions and it is used to store autograd related information by both, the user’s function and PyTorch.

As in the regular C++ functions we also call collect_next_edges to track the inputs grad_fn objects, but this is done in unpack_input:

template<bool enforce_variables>
std::pair<UnpackedInput, InputFlags> unpack_input(PyObject *args) {
  flags.next_edges = (flags.is_executable ? collect_next_edges(unpacked.input_vars) : edge_list());
  return std::make_pair(std::move(unpacked), std::move(flags));

After this, the edges are assigned to the grad_fn by just doing cdata->set_next_edges(std::move(input_info.next_edges)); and the forward function is called through the python interpreter C API.

Once the output tensors are returned from the forward pass, they are processed and converted to variables inside the process_outputs function.

PyObject* process_outputs(PyObject *op_obj, const std::shared_ptr<PyNode>& cdata,
                          THPFunction* grad_fn, const UnpackedInput& unpacked,
                          PyObject *inputs, THPObjectPtr&& raw_output, bool is_executable,
                          torch::jit::Node* node) {
  _wrap_outputs(cdata, grad_fn, unpacked.input_vars, raw_output, outputs, is_executable);
  _trace_post_record(node, op_obj, unpacked.input_vars, outputs, is_inplace, unpack_output);
  if (is_executable) {
    _save_variables(cdata, grad_fn);
  } ...
  return outputs.release();

Here, _wrap_outputs is in charge of setting the forward outputs grad_fn to the newly created one. For this, it calls another _wrap_outputs function defined in a different file, so the process here gets a little confusing.

static void _wrap_outputs(const std::shared_ptr<PyNode>& cdata, THPFunction *self,
    const variable_list &input_vars, PyObject *raw_output, PyObject *outputs, bool is_executable)
  auto cdata_if_executable = is_executable ? cdata : nullptr;

  // Wrap only the tensor outputs.
  // This calls csrc/autograd/custom_function.cpp
  auto wrapped_outputs = _wrap_outputs(input_vars, non_differentiable, dirty_inputs, raw_output_vars, cdata_if_executable);

The called _wrap_outputs is the one in charge of setting the autograd metadata in the output tensors:

std::vector<c10::optional<Variable>> _wrap_outputs(const variable_list &input_vars,
  const std::unordered_set<at::TensorImpl*> &non_differentiable,
  const std::unordered_set<at::TensorImpl*> &dirty_inputs,
  const at::ArrayRef<c10::optional<Variable>> raw_outputs,
  const std::shared_ptr<Node> &cdata) {

  std::unordered_set<at::TensorImpl*> inputs;
  // Sets the grad_fn and output_nr of an output Variable.
  auto set_history = [&](Variable& var, uint32_t output_nr, bool is_input, bool is_modified,
                         bool is_differentiable) {
    // Lots of checks
    if (!is_differentiable) {
    } else if (is_input) {
      // An input has been returned, but it wasn't modified. Return it as a view
      // so that we can attach a new grad_fn to the Variable.
      // Run in no_grad mode to mimic the behavior of the forward.
        AutoGradMode grad_mode(false);
        var = var.view_as(var);
      impl::set_gradient_edge(var, {cdata, output_nr});
    } else if (cdata) {
      impl::set_gradient_edge(var, {cdata, output_nr});

And this is where set_gradient_edge was called and this is how a user-written python function gets included in the computational graph with its associated backward function!

Closing remarks

This blog post is intended to be a code overview on how PyTorch constructs the actual computational graphs that we discussed in the previous post. The next entry will deal with how the autograd engine executes these graphs.

Read More

Recreating Natural Voices for People with Speech Impairments

Recreating Natural Voices for People with Speech Impairments

Posted by Ye Jia, Software Engineer and Julie Cattiau, Product Manager, Google Research

Update — 2021/09/07: Added an additional sound clip used to train the model.

On June 2nd, 2021, Major League Baseball in the United States celebrated Lou Gehrig Day, commemorating both the day in 1925 that Lou Gehrig became the Yankees’ starting first baseman, and the day in 1941 that he passed away from amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig’s disease) at the age of 37. ALS is a progressive neurodegenerative disease that affects motor neurons, which connect the brain with the muscles throughout the body, and govern muscle control and voluntary movements. When voluntary muscle control is affected, people may lose their ability to speak, eat, move and breathe.

In honor of Lou Gehrig, former NFL player and ALS advocate Steve Gleason, who lost his ability to speak due to ALS, recited Gehrig’s famous “Luckiest Man” speech at the June 2nd event using a recreation of his voice generated by a machine learning (ML) model. Gleason’s voice recreation was developed in collaboration with Google’s Project Euphonia, which aims to empower people who have impaired speaking ability due to ALS to better communicate using their own voices.

Steve Gleason, who lost his voice to ALS, worked with Google’s Project Euphonia to generate a speech in his own voice in honor of Lou Gehrig. A portion of Gleason’s speech was broadcast in ballparks across the country during the 4th inning on June 2nd, 2021.

Today we describe PnG NAT, the model adopted by Project Euphonia to recreate Steve Gleason’s voice. PnG NAT is a new text-to-speech synthesis (TTS) model that merges two state-of-the-art technologies, PnG BERT and Non-Attentive Tacotron (NAT), into a single model. It demonstrates significantly better quality and fluency than previous technologies, and represents a promising approach that can be extended to a wider array of users.

Recreating a Voice
Non-Attentive Tacotron (NAT) is the successor to Tacotron 2, a sequence-to-sequence neural TTS model proposed in 2017. Tacotron 2 used an attention module to connect the input text sequence and the output speech spectrogram frame sequence, so that the model knows which part of the text to pay attention to when generating each time step of the synthesized speech spectrogram. Tacotron 2 was the first TTS model that was able to synthesize speech that sounds as natural as a person speaking. However, with extensive experimentation we discovered that there is a small probability that the model can suffer from robustness issues — such as babbling, repeating, or skipping part of the text — due to the inherent flexibility of the attention mechanism.

NAT improves upon Tacotron 2 by replacing the attention module with a duration-based upsampler, which predicts a duration for each input phoneme and upsamples the encoded phoneme representation so that the output length corresponds to the length of the predicted speech spectrogram. Such a change both resolves the robustness issue, and improves the naturalness of the synthesized speech. This approach also enables precise control of the speech duration for each phoneme of the input text while still maintaining highly natural synthesis quality. Because recordings of people with ALS often exhibit disfluent speech, this ability to exert per-phoneme control is key for achieving the fluency of the recreated voice.

Non-Attentive Tacotron (NAT) model.

While NAT addresses the robustness issue and enables precise duration control in neural TTS, we build upon it to further improve the natural language understanding of the TTS input. For this, we apply PnG BERT, which uses an approach similar to BERT, but is specifically designed for TTS. It is pre-trained with self-supervision on both the phoneme representation and the grapheme representation of the same content from a large text corpus, and then is used as the encoder of the TTS model. This results in a significant improvement of the prosody and pronunciation of the synthesized speech, especially in difficult cases.

Take, for example, the following audio, which was synthesized from a regular NAT model that takes only phonemes as input:

In comparison, the audio synthesized from PnG NAT on the same input text includes an additional pause that makes the meaning more clear.

The input text to both models is, “To cancel the payment, press one; or to continue, two.” Notice the different pause lengths before the ending “two” in the two versions. The word “two” in the version output by the regular NAT model could be confused for “too”. Because “too” and “two” have identical pronunciation (and thus the same phoneme representation), the regular NAT model does not understand which of the two is appropriate, and assumes it to be the word that more frequently follows a comma, “too”. In contrast, the PnG NAT model can more easily tell the difference, because it takes graphemes in addition to phonemes as input, and thus makes more appropriate pause.

The PnG NAT model integrates the pre-trained PnG BERT model as the encoder to the NAT model. The hidden representations output from the encoder are used by NAT to predict the duration of each phoneme, and are then upsampled to match the length of the audio spectrogram, as outlined above. In the final step, a non-attentive decoder converts the upsampled hidden representations into audio speech spectrograms, which are finally converted into audio waveforms by a neural vocoder.

PnG BERT and the pre-training objectives. Yellow boxes represent phonemes, and pink boxes represent graphemes.
PnG NAT: PnG BERT replaces the original encoder in the NAT model. The random masking for the Masked Language Model (MLM) pre-training is removed.

To recreate Steve Gleason’s voice, we first trained a PnG NAT model with recordings from 31 professional speakers, and then fine-tuned it with 30 minutes of Gleason’s recordings. Because these latter recordings were made after he was diagnosed with ALS, they exhibit signs of slurring. The fine tuned model was able to synthesize speech that sounds very similar to these recordings. However, because the symptoms of ALS were already present in Gleason’s speech, they exhibited some similar disfluencies.

To mitigate this, we leveraged the phoneme duration control of NAT as well as the model trained with professional speakers. We first predicted the durations of each phoneme for both a professional speaker and for Gleason, and then used the geometric mean of the two durations for each phoneme to guide the NAT output. As a result, the model is able to speak in Gleason’s voice, but more fluently than in the original recordings.

Here is the full version of the synthesized Lou Gehrig speech in Gleason’s voice:

As a comparison, following is one of Gleason’s recordings that was used to train the model:

Besides recreating voices for people with ALS, PnG NAT is also powering voices for a variety of customers through Google Cloud Custom Voice.

Project Euphonia
Of the millions of people around the world who have neurologic conditions that may impact their speech, such as ALS, cerebral palsy or Down syndrome, many may find it difficult to be understood, which can make face-to-face communication challenging. Using voice-activated technologies can be frustrating too, as they don’t always work reliably. Project Euphonia is a Google Research initiative focused on helping people with impaired speech be better understood. The team is researching ways to improve speech recognition for individuals with speech impairments (see recent blog post and segment in TODAY show), as well as customized text-to-speech technology (see Age of AI documentary featuring former NFL player Tim Shaw).

Many people across Google Research, Google Cloud and Consumer Apps, and Google Accessibility teams contributed to this project and the event, including Michael Brenner, Bob MacDonald, Heiga Zen, Yu Zhang, Jonathan Shen, Isaac Elias‎, Yonghui Wu, Anne Keck, Danielle Notaro, Kevin Hogan, Zack Kaplan, KR Liu, Kyndra Price, Zoe Ortiz.

Read More

3D Pose Detection with MediaPipe BlazePose GHUM and TensorFlow.js

3D Pose Detection with MediaPipe BlazePose GHUM and TensorFlow.js

Posted by Ivan Grishchenko, Valentin Bazarevsky, Eduard Gabriel Bazavan, Na Li, Jason Mayes, Google

Pose detection is an important step in understanding more about the human body in videos and images. Our existing models have supported 2D pose estimation for some time, which many of you may have already tried.

Today, we are launching our first 3D model in TF.js pose-detection API. 3D pose estimation opens up new design opportunities for applications such as fitness, medical, motion capture and beyond – in many of these areas we’ve seen a growing interest from the TensorFlow.js community. A great example of this is 3D motion capture to drive a character animation in the browser.

3D motion capture with BlazePose GHUM

3D motion capture with BlazePose GHUM by Richard Yee

(used with permission, live demo available at

This community demo uses multiple models powered by MediaPipe and TensorFlow.js (namely FaceMesh, BlazePose and HandPose). Even better, no app install is needed as you just need to visit a webpage to enjoy the experience. So with that in mind, let’s learn more and see this new model in action!

BlazePose live demo
Try out the live demo!


The pose-detection API provides two runtimes for BlazePose GHUM, namely MediaPipe runtime and TensorFlow.js runtime.

To install the API and runtime library, you can either use the <script> tag in your html file or use NPM.

Through script tag:

<script src=""></script>
<!-- Include below scripts if you want to use TF.js runtime. -->
<script src=""></script>
<script src=""></script>
<script src=""></script>

<!-- Optional: Include below scripts if you want to use MediaPipe runtime. -->
<script src=""></script>

Through NPM:

yarn add @tensorflow-models/pose-detection

# Run below commands if you want to use TF.js runtime.
yarn add @tensorflow/tfjs-core @tensorflow/tfjs-converter
yarn add @tensorflow/tfjs-backend-webgl

# Run below commands if you want to use MediaPipe runtime.
yarn add @mediapipe/pose

To reference the API in your JS code, it depends on how you installed the library.

If installed through script tag, you can reference the library through the global namespace poseDetection.

If installed through NPM, you need to import the libraries first:

import * as poseDetection from '@tensorflow-models/pose-detection';
// Uncomment the line below if you want to use TF.js runtime.
// import '@tensorflow/tfjs-backend-webgl';
// Uncomment the line below if you want to use MediaPipe runtime.
// import '@mediapipe/pose';

Try it yourself!

First, you need to create a detector:

const model = poseDetection.SupportedModels.BlazePose;
const detectorConfig = {
runtime: 'mediapipe', // or 'tfjs'
modelType: 'full'
detector = await poseDetection.createDetector(model, detectorConfig);

Choose a modelType that fits your application needs, there are three options for you to choose from: lite, full, and heavy. From lite to heavy, the accuracy increases while the inference speed decreases. Please try our live demo to compare different configurations.

Once you have a detector, you can pass in a video stream to detect poses:

const video = document.getElementById('video');
const poses = await detector.estimatePoses(video);

How to use the output? poses represent an array of detected pose predictions in the image frame. For each pose, it contains keypoints and keypoints3D. The keypoints are the same as the 2D model we launched before, it is an array of 33 keypoint objects, each object has x, y in pixel units.

keypoints3D is an additional array with 33 keypoint objects, each object has x, y, z. The x, y, z are in meter units. The person is modeled as if they were in a 2m x 2m x 2m cubic space. The range for each axis goes from -1 to 1 (therefore 2m total delta). The origin of this 3D space is the hip center (0, 0, 0). From the origin, z is positive if moving closer to the camera, and negative if moving away from the camera. See below output snippet for example:

score: 0.8,
keypoints: [
{x: 230, y: 220, score: 0.9, name: "nose"},
{x: 212, y: 190, score: 0.8, name: "left_eye"},
keypoints3D: [
{x: 0.5, y: 0.9, z: 0.06 score: 0.9, name: "nose"},

You can refer to our ReadMe for more details about the API.

As you begin to play and develop with BlazePose GHUM, we would appreciate your feedback and contributions. If you make something using this model, tag it with #MadeWithTFJS on social media so we can find your work, as we would love to see what you create.

Model deep dive

The key challenge to build the 3D part of our pose model was obtaining realistic, in-the-wild 3D data. In contrast to 2D, which can be obtained via human annotation, accurate manual 3D annotation becomes a uniquely challenging task. It requires either a lab setup or specialised hardware with depth sensors for 3D scans – which introduce additional challenges to preserve a good level of human and environment diversity in the dataset. Another alternative, which many researchers choose – to build a completely synthetic dataset, which introduces yet another challenge of domain adaptation to real-world pictures.

Our approach is based on a statistical 3D human body model called GHUM, which is built using a large corpus of human shapes and motions. To obtain 3D human body pose ground truth, we fitted the GHUM model to our existing 2D pose dataset and extended it with a real world 3D keypoint coordinates in metric space. During the fitting process the shape and the pose variables of GHUM were optimized such that the reconstructed model aligns with the image evidence. This includes 2D keypoint and silhouette semantic segmentation alignment as well as shape and pose regularization terms. For more details see related work on 3D pose and shape inference (HUND, THUNDR).

Sample GHUM fitting for input image
Sample GHUM fitting for an input image. From left to right: original image, 3D GHUM reconstruction (different viewpoint) and blended result projected on top of the original image.

Due to the nature of 3D to 2D projection, multiple points in 3D can have the same projection in 2D (i.e. with the same X and Y but different Z). So the fitting can result in several realistic 3D body poses for the given 2D annotation. To minimize this ambiguity, in addition to a 2D body pose, we asked annotators to provide depth order between pose skeleton edges where they are certain (check the figure below). This task proved to be an easy one (compared to a real depth annotation) showing high consistency between annotators (98% on cross-validation) and helped to reduce the depth ordering errors for the fitted GHUM reconstructions from 25% to 3%.

Depth order annotation: the wider edge corner denotes the corner closer to the camera (e.g. the person’s right shoulder is closer to camera than left shoulder on both examples)
“Depth order” annotation: the wider edge corner denotes the corner closer to the camera (e.g. the person’s right shoulder is closer to camera than left shoulder on both examples)

BlazePose GHUM utilizes a two-step detector-tracker approach where the tracker operates on a cropped human image. Thus the model is trained to predict 3D body pose in relative coordinates of a metric space with origin in the subject’s hips center.

MediaPipe vs. TF.js runtime

There are some pros and cons of using each runtime. As shown in the performance table below, the MediaPipe runtime provides faster inference speed on desktop, laptop and android phones. The TF.js runtime provides faster inference speed on iPhones and iPads. The TF.js runtime is also about 1 MB smaller than the MediaPipe runtime.

MacBook Pro 15” 2019. 

Intel core i9. 

AMD Radeon Pro Vega 20 Graphics.


iPhone 11


Pixel 5



Intel i9-10900K. Nvidia GTX 1070 GPU.


MediaPipe Runtime

With WASM & GPU Accel.

75 | 67 | 34

9 | 6 | N/A                   

25 | 21 | 8

150 | 130 | 97

TFJS Runtime

With WebGL backend.

52 | 40 | 24

 43 | 32 | 22

14 | 10 | 4

42 | 35 | 29

Inference speed of BlazePose GHUM across different devices and runtimes. The first number in each cell is for the lite model, and the second number is for the full model, the third number is for the heavy model.


We would like to acknowledge our colleagues, who participated in creating BlazePose GHUM 3D: Andrei Zanfir, Cristian Sminchisescu, Tyler Zhu, the other contributors to MediaPipe: Chuo-Ling Chang, Michael Hays, Ming Guang Yong, Matthias Grundmann, along with those involved with the TensorFlow.js pose-detection API: Ahmed Sabie and Ping Yu, and of course the community who are making amazing work with these models: Richard Yee.

Read More

Facebook Fellow Spotlight: Breaking barriers for women and girls in rural India

Each year, PhD students from around the world apply for the Facebook Fellowship, a program designed to encourage and support doctoral students engaged in innovative and relevant research in areas related to computer science and engineering.

As a continuation of our Fellowship spotlight series, we’re highlighting 2020 Facebook Fellow in Social and Economic Policy Gauthami Penakalapati.

Gauthami is a PhD candidate in the Energy and Resources Group at UC Berkeley, advised by Dr. Isha Ray. Her research explores how social connection and gendered norms impact adolescent girls’ well-being in India. In particular, she is interested in using insights from gender and feminist theory to develop meaningful programs and metrics to improve women’s and girls’ empowerment in global development programs. She has extensive international research experience working with UNICEF and Innovations for Poverty Action, and she has worked in Ghana, India, Kenya, Rwanda, and Zambia. She received her bachelor’s degree in biology from Georgia Institute of Technology and her Master of Public Health from Rollins School of Public Health, Emory University.

Before Gauthami pursued her PhD, her career was instrumental in helping her think beyond traditional public health questions. “I was motivated to understand the ‘whys’ of circumstances,” she says. “For example, why are we as researchers only just realizing the importance of incorporating data from women and girls in global development programs?” This led her to question how colonial history of public health and data in India has contributed to the continued marginalization of women and girls. “My fields of study are interdisciplinary. My research involves the intersection of global health, climate adaptation, and human-computer interaction. I’m constantly pulling from decolonial and feminist theories as well as science and technology studies because my work is simultaneously scientific and social,” Gauthami says.

Gauthami takes a holistic, activist-oriented approach to her research, engaging directly with adolescent girls within their communities in Northern India. “We talk about gender equity, but how does the concept incorporate empowerment and agency when we know this varies from individual to individual, community to community?” Gauthami says. Since a real-world approach to gender equity is tied intimately to local norms, Gauthami focuses on researching local approaches to gender equity so she can propose meaningful solutions. Though it requires a long-term commitment in one particular area, Gauthami feels her approach has a positive impact on the communities she works with.

“In India, binary gender norms are strong,” she says. “And the same gendered norms often dictate access to resources and technologies. When considering access to mobile devices with internet capabilities, there is a digital divide between men and women — 71 percent of men in India own a mobile phone, while only 38 percent of women do.” According to Gauthami, these statistics underpin a larger problem: In many rural Indian communities, women and girls who own or use mobile phones are viewed with distrust and suspicion. In her current research, Gauthami explores this phenomenon more deeply by investigating how reliable access and usage of smartphones supports girls’ individual agency and aspirations.

Because Gauthami has not been able to travel to India during the pandemic, she has spent her time collecting data for a systematic review on adolescent empowerment programs and has focused on publishing articles and reworking lectures. However, she looks forward to being able to return to her field sites in Northern India because she prefers to connect directly with the communities she works with, with the ultimate goal of creating and implementing programs and technologies that empower adolescent girls. She is also working on a creative project that elevates the stories of people in India through photography and prose.

To learn more about Gauthami Penakalapati and her research, visit her Fellowship page and her website.

The post Facebook Fellow Spotlight: Breaking barriers for women and girls in rural India appeared first on Facebook Research.

Read More

These researchers are bringing AI to farmers

These researchers are bringing AI to farmers

“Farmers feed the entire world — so how might we support them to be resilient and build sustainable systems that also support global food security?” It’s a question that Diana Akrong found herself asking last year. Diana is a UX researcher based in Accra, Ghana, and the founding member of Google’s Accra UX team.

Across the world, her manager Dr. Courtney Heldreth, was equally interested in answering this question. Courtney is a social psychologist and a staff UX researcher based in Seattle, and both women work as part of Google’s People + Artificial Intelligence Research (PAIR) group. “Looking back on history, we can see how the industrial revolution played a significant role in creating global inequality,” she says. “It set most of Western Europe onto a path of economic dominance that was then followed by both military and political dominance.” Courtney and Diana teamed up on an exploratory effort focused on how AI can help better the lives of small, local farming communities in the Global South. They and their team want to understand what farmers need, their practices, value systems, what their social lives are like — and make sure that Google products reflect these dynamics.

One result of their work is a recently published research paper. The paper — written alongside their colleagues Dr. Jess Holbrook at Google and Dr. Norman Makoto Su of Indiana University and published in the ACM Interactions trade journal — dives into why we need farmer-centered AI research, and what it could mean not just for farmers, but for everyone they feed. I recently took some time to learn more about their work.

How would you explain your job to someone who isn’t in tech?

Courtney: I would say I’m a researcher trying to understand underserved and historically marginalized users’ lives and needs so we can create products that work better for them. 

Diana: I’m a researcher who looks at how people interact with technology. My superpower is my curiosity and it’s my mission to understand and advocate for user needs, explore business opportunities and share knowledge.

What’s something on your mind right now? 

Diana: Because of COVID-19, there’s the threat of a major food crisis in India and elsewhere. We’re wondering how we can work with small farms as well as local consumers, policymakers, agricultural workers, agribusiness owners and NGOs to solve this problem.

Agriculture is very close to my heart, personally. Prior to joining Google, I spent a lot of time learning from smallholder farmers across my country and helping design concepts to address their needs. 

“Farmers feed the entire world — so how might we support them to be resilient and build sustainable systems that also support global food security?” Diana Akrong
UX researcher, Google

Courtney: I’ve been thinking about how AI can be seen as this magical, heroic thing, but there are also many risks to using it in places where there aren’t laws to protect people. When I think about Google’s AI Principles — be socially beneficial, be accountable to people, avoid reinforcing bias, prioritize safety — those things define what projects I want to work on. It’s also why my colleague Tabitha Yong and I developed a set of best practices for designing more equitable AI products.

Can you tell me more about your paper, “What Does AI Mean for Smallholder Farmers? A Proposal for Farmer-Centered AI Research,” recently published in ACM Interactions

Courtney: The impact and failures of AI are often very western and U.S.-centric. We’re trying to think about how to make this more fair and inclusive for communities with different needs around the globe. For example, in our farmer-centered AI research, we know that most existing AI solutions are designed for large farms in the developed world. However, many farmers in the Global South live and work in rural areas, which trail behind urban areas in terms of connectivity and digital adoption. By focusing on the daily realities of these farmers, we can better understand different perspectives, especially those of people who don’t live in the U.S. and Europe, so that Google’s products work for everyone, everywhere.

  • Courtney is on the left of a giant billboard at the CGIAR Platform for Big Data in Agriculture summit in 2019 and Diana is on the right side of the same sign.

    In 2019, Courtney and Diana led a workshop at the CGIAR Platform for Big Data in Agriculture summit; Courtney also participated in a panel discussion. In 2020, Diana spoke at a virtual CGIAR panel on human-centered design. 

  • Diana on the left and Courtney on the right are smiling and hugging each other.

    Diana (left) and Courtney (right) are dedicated to building inclusive AI for farmers with small rural businesses in Africa and Asia. Diana is based in Accra, Ghana and Courtney is based in Seattle, Washington in the U.S. 

  • Courtney is pictured on a research trip to India, looking out of a latticed window with a beautiful traditional Indian dress and earrings.

    Courtney pictured here during a research trip to India.

  • Diana is smiling and wearing an apron for a team building exercise.

    Diana is all smiles at a team building event.

Why did you want to work at Google?

Diana: I see Google as home to teams with diverse experiences and skills who work collaboratively to tackle complex, important issues that change real people’s lives. I’ve thrived here because I get to work on projects I care about and play a critical role in growing the UX community here in Ghana.

Courtney: I chose Google because we work on the world’s hardest problems. Googlers are  fearless and the reach of Google’s products and services is unprecedented. As someone who comes from an underrepresented group, I never thought I would work here. To be here at this moment is so important to me, my community and my family. When I look at issues I care about the most — marginalized and underrepresented communities — the work we do plays a critical role in preventing algorithmic bias, bridging the digital divide and lessening these inequalities. 

How have you seen your research help real people? 

Courtney: In 2018, we worked with Titi Akinsanmi, Google’s Policy and Government Relations Lead for West and Francophone Africa, and PAIR Co-lead and Principal Research Scientist Fernanda Viegas on the report for AI in Nigeria. Since then, the Ministry of Technology and Science reached out to Google to help form a strategy around AI. We’ve seen government bodies in sub-Saharan Africa use this paper as a roadmap to develop their own responsible AI policies.

How should aspiring AI thinkers and future technologists prepare for a career in this field?

Diana: My main advice? Start with people and their needs. A digital solution or AI may not be necessary to solve every problem. The PAIR Guidebook is a great reference for best practices and examples for designing with AI.

Read More

Announcing PyTorch Developer Day 2021

Announcing PyTorch Developer Day 2021

We are excited to announce PyTorch Developer Day (#PTD2), taking place virtually from December 1 & 2, 2021. Developer Day is designed for developers and users to discuss core technical developments, ideas, and roadmaps.

Event Details

Technical Talks Live Stream – December 1, 2021

Join us for technical talks on a variety of topics, including updates to the core framework, new tools and libraries to support development across a variety of domains, responsible AI and industry use cases. All talks will take place on December 1 and will be live streamed on PyTorch channels.

Stay up to date by following us on our social channels: Twitter, Facebook, or LinkedIn.

Poster Exhibition & Networking – December 2, 2021

On the second day, we’ll be hosting an online poster exhibition on Gather.Town. There will be opportunities to meet the authors and learn more about their PyTorch projects as well as network with the community. This poster and networking event is limited to people composed of PyTorch maintainers and contributors, long-time stakeholders and experts in areas relevant to PyTorch’s future. Conversations from the networking event will strongly shape the future of PyTorch. As such, invitations are required to attend the networking event.

Apply for an invitation to the networking event by clicking here.

Call for Content Now Open

Submit your poster abstracts today! Please send us the title and brief summary of your project, tools and libraries that could benefit PyTorch researchers in academia and industry, application developers, and ML engineers for consideration. The focus must be on academic papers, machine learning research, or open-source projects related to PyTorch development, Responsible AI or Mobile. Please no sales pitches. Deadline for submission is September 24, 2021.

You can submit your poster abstract during your application & registration process here.

Visit the event website for more information and we look forward to having you at PyTorch Developer Day. For any questions about the event, contact

Read More

Inside the DPU: Talk Describes an Engine Powering Data Center Networks

Inside the DPU: Talk Describes an Engine Powering Data Center Networks

The tech world this week gets its first look under the hood of the NVIDIA BlueField data processing unit. The chip invented the category of the DPU last year, and it’s already being embraced by cloud services, supercomputers and many OEMs and software partners.

Idan Burstein, a principal architect leading our Israel-based BlueField design team, will describe the DPU’s architecture at Hot Chips, an annual conference that draws many of the world’s top microprocessor designers.

The talk will unveil a silicon engine for accelerating modern data centers. It’s an array of hardware accelerators and general-purpose Arm cores that speed networking, security and storage jobs.

Those jobs include virtualizing data center hardware while securing and smoothing the flow of network traffic. It’s work that involves accelerating in hardware a growing alphabet soup of tasks fundamental to running a data center, such as:

  • IPsec, TLS, AES-GCM, RegEx and Public Key Acceleration for security
  • NVMe-oF, RAID and GPUDirect Storage for storage
  • RDMA, RoCE, SR-IOV, VXLAN, VirtIO and GPUDirect RDMA for networking, and
  • Offloads for video streaming and time-sensitive communications

These workloads are growing faster than Moore’s law and already consume a third of server CPU cycles. DPUs pack purpose-built hardware to run these jobs more efficiently, making more CPU cores available for data center applications.

DPUs deliver virtualization and advanced security without compromising bare-metal performance. Their uses span the gamut from cloud computing and media streaming to storage, edge processing and high performance computing.

NVIDIA CEO Jensen Huang describes DPUs as “one of the three major pillars of computing going forward … The CPU is for general-purpose computing, the GPU is for accelerated computing and the DPU, which moves data around the data center, does data processing.”

A Full Plug-and-Play Stack

The good news for users is they don’t have to master the silicon details that may fascinate processor architects at Hot Chips. They can simply plug their existing software into familiar high-level software interfaces to harness the DPU’s power.

Those APIs are bundled into the DPU’s software stack called NVIDIA DOCA. It includes drivers, libraries, tools, documentation, example applications and a runtime environment for provisioning, deploying and orchestrating services on thousands of DPUs across the data center.

We’ve already received requests for early access to DOCA from hundreds of organizations, including several of the world’s industry leaders.

DOCA DPU software stack
DOCA provides a software platform for rapid development of networking, storage and security applications on the DPU.

DPUs Deliver for Data Centers, Clouds

The architecture described at Hot Chips is moving into several of the world’s largest clouds as well as a TOP500 supercomputer and integrated with next-generation firewalls. It will soon be available in systems from several top OEMs supported with software from more than a dozen other partners.

Today, multiple cloud service providers around the world are using or preparing to deploy BlueField DPUs to provision compute instances securely.

BlueField Powers Supercomputers, Firewalls

The University of Cambridge tapped into the DPU’s efficiencies to debut in June the fastest academic system in the U.K., a supercomputer that hit No. 3 on the Green500 list of the world’s most energy-efficient systems.

It’s the world’s first cloud-native supercomputer, letting researchers share virtual resources with privacy and security while not compromising performance.

With the VM-Series Next-Generation Firewall from Palo Alto Networks, every data center can now access the DPU’s security capabilities. The VM-Series NGFW can be accelerated with BlueField-2 to inspect network flows that were previously impossible or impractical to track.

The DPU will soon be available in systems from ASUS, Atos, Dell Technologies, Fujitsu, GIGABYTE, H3C, Inspur, Quanta/QCT and Supermicro, several of which announced plans at Computex in May.

More than a dozen software partners will support the NVIDIA BlueField DPUs, including:

  • VMware, with Project Monterey, which introduces DPUs to the more than 300,000 organizations that rely on VMware for its speed, resilience and security.
  • Red Hat, with an upcoming developer’s kit for Red Hat Enterprise Linux and Red Hat OpenShift, used by 95 percent of the Fortune 500.
  • Canonical, in Ubuntu Linux, the most popular operating system among public clouds.
  • Check Point Software Technologies, in products used by more than 100,000 organizations worldwide to prevent cyberattacks.

Other partners include Cloudflare, DDN, Excelero, F5, Fortinet, Guardicore, Juniper Networks, NetApp, Vast Data and WekaIO.

The support is broad because the opportunity is big.

“Every single networking chip in the world will be a smart networking chip … And that’s what the DPU is. It’s a data center on a chip,” said Collette Kress, NVIDIA’s CFO, in a May earnings call, predicting every server will someday sport a DPU.

DPU-Powered Networks on the Horizon

Market watchers at Dell’Oro Group forecast the number of smart networking ports shipped will nearly double from 4.4 million in 2020 to 7.4 million by 2025.

Gearing up for that growth, NVIDIA announced at GTC its roadmap for the next two generations of DPUs.

The BlueField-3, sampling next year, will drive networks up to 400 Gbit/second and pack the muscle of 300 x86 cores. The BlueField-4 will deliver an order of magnitude more performance with the addition of NVIDIA AI computing technologies.

What’s clear from the market momentum and this week’s Hot Chips talk is just as it has in AI, NVIDIA is now setting the pace in accelerated networking.

The post Inside the DPU: Talk Describes an Engine Powering Data Center Networks appeared first on The Official NVIDIA Blog.

Read More

Make History This GFN Thursday: ‘HUMANKIND’ Arrives on GeForce NOW

Make History This GFN Thursday: ‘HUMANKIND’ Arrives on GeForce NOW

This GFN Thursday brings in the highly anticipated magnum opus from SEGA and Amplitude Studios, HUMANKIND, as well as exciting rewards to redeem for members playing Eternal Return.

There’s also updates on the newest Fortnite Season 7 game mode, “Impostors,” streaming on GeForce NOW.

Plus, there are nine games in total coming to the cloud this week.

The Future is in Your Hands

It’s time to make history. The exciting new turn-based historical strategy game HUMANKIND released this week and is streaming on GeForce NOW.

In HUMANKIND, you’ll be rewriting the entire narrative of human history and combining cultures to create a civilization as unique as you are. Combine up to 60 historical cultures as you lead your people from the Ancient to the Modern Age. From humble origins as a Neolithic tribe, transition to the Ancient Era as the Babylonians, become the Classical era Mayans, the Medieval Umayyads, the Early Modern-era British, and so on. Create a custom leader from these backgrounds to pave the way to the future.

Players will encounter historical events and make impactful moral decisions to develop the world as they see fit. Explore the natural wonders, discover scientific breakthroughs and make remarkable creations to leave your mark on the world. Master tactical turn-based battles and command your assembled armies to victory against strangers and friends in multiplayer matches of up to eight players. For every discovery, every battle and every deed, players gain fame — and the player with the most fame wins the game.

An awesome extra, unlock unique characters based on popular content creators, like GeForce NOW streamer BurkeBlack, by watching their HUMANKIND streams for unique drops.

Create a civilization that’s as unique as you are and become the most famous leader in history.

Gamers have been eagerly anticipating the release of HUMANKIND, and members will be able to experience this awesome new PC game when streaming on low-powered PCs, Macs, Chromebooks, SHIELD TVs or Android and iOS mobile devices with the power of GeForce NOW.

“GeForce NOW can invite even more players to experience the HUMANKIND journey,” said Romain de Waubert, studio head and chief creative officer at Amplitude Studios. “The service quickly and easily brings gamers into HUMANKIND with beautiful PC graphics on nearly any device.”

Tell your story your way. Play HUMANKIND this week on GeForce NOW and determine where you’ll take history.

Reap the Rewards

Playing games on GeForce NOW is great, and so is getting rewarded for playing.

Eternal Return on GeForce NOW
Members can enjoy awesome skin and emote rewards in Eternal Return.

The GeForce NOW rewards program is always looking to give members access to awesome rewards. This week brings a custom skin and custom emote for Eternal Return.

Getting rewarded for streaming games on the cloud is easy. Members should make sure to check the box for Rewards in the GeForce NOW account portal and opt in to receive newsletters for future updates and upcoming reward spoils.

Impostors Infiltrate Fortnite

Chapter 2 Season 7 of Fortnite also delivered a thrilling, new game mode. Members can play Fortnite “Impostors,” which recently was released on August 17.

Play in matches between four to 10 players of Agents versus Impostors on a brand new map – The Bridge. Agents win by completing minigame assignments to fill their progress bar or revealing all Impostors hiding among the team by calling discussions and voting out suspicious behavior.

While keeping their identity a secret, up to two Impostors will seek to eliminate enough Agents to overtake The Bridge. They can hide their status by completing assignments, which will benefit the progress of the Agent team, and have sneaky sabotage abilities to create chaos.

Whether playing as an Agent or as an Impostor, this game is set to be a great time. Stream it today on GeForce NOW.

It’s Game Time

RiMS Racing on GeForce NOW
Ride the world’s most powerful motorbikes in RiMS Racing this week on GeForce NOW.

As always, GFN Thursday means new games coming to the cloud every week. Members can look forward to being able to stream these nine titles joining the GeForce NOW library:

With all of these new games, it’s always a good time to play. Speaking of time, we’ve got a question about your favorite games:

past, present, or future

what’s your favorite time period to play in?

🌩 NVIDIA GeForce NOW (@NVIDIAGFN) August 18, 2021

Let us know on Twitter or in the comments below.

The post Make History This GFN Thursday: ‘HUMANKIND’ Arrives on GeForce NOW appeared first on The Official NVIDIA Blog.

Read More