Large Motion Frame Interpolation

Large Motion Frame Interpolation

Frame interpolation is the process of synthesizing in-between images from a given set of images. The technique is often used for temporal up-sampling to increase the refresh rate of videos or to create slow motion effects. Nowadays, with digital cameras and smartphones, we often take several photos within a few seconds to capture the best picture. Interpolating between these “near-duplicate” photos can lead to engaging videos that reveal scene motion, often delivering an even more pleasing sense of the moment than the original photos.

Frame interpolation between consecutive video frames, which often have small motion, has been studied extensively. Unlike videos, however, the temporal spacing between near-duplicate photos can be several seconds, with commensurately large in-between motion, which is a major failing point of existing frame interpolation methods. Recent methods attempt to handle large motion by training on datasets with extreme motion, albeit with limited effectiveness on smaller motions.

In “FILM: Frame Interpolation for Large Motion”, published at ECCV 2022, we present a method to create high quality slow-motion videos from near-duplicate photos. FILM is a new neural network architecture that achieves state-of-the-art results in large motion, while also handling smaller motions well.

FILM interpolating between two near-duplicate photos to create a slow motion video.

FILM Model Overview

The FILM model takes two images as input and outputs a middle image. At inference time, we recursively invoke the model to output in-between images. FILM has three components: (1) A feature extractor that summarizes each input image with deep multi-scale (pyramid) features; (2) a bi-directional motion estimator that computes pixel-wise motion (i.e., flows) at each pyramid level; and (3) a fusion module that outputs the final interpolated image. We train FILM on regular video frame triplets, with the middle frame serving as the ground-truth for supervision.

A standard feature pyramid extraction on two input images. Features are processed at each level by a series of convolutions, which are then downsampled to half the spatial resolution and passed as input to the deeper level.

Scale-Agnostic Feature Extraction

Large motion is typically handled with hierarchical motion estimation using multi-resolution feature pyramids (shown above). However, this method struggles with small and fast-moving objects because they can disappear at the deepest pyramid levels. In addition, there are far fewer available pixels to derive supervision at the deepest level.

To overcome these limitations, we adopt a feature extractor that shares weights across scales to create a “scale-agnostic” feature pyramid. This feature extractor (1) allows the use of a shared motion estimator across pyramid levels (next section) by equating large motion at shallow levels with small motion at deeper levels, and (2) creates a compact network with fewer weights.

Specifically, given two input images, we first create an image pyramid by successively downsampling each image. Next, we use a shared U-Net convolutional encoder to extract a smaller feature pyramid from each image pyramid level (columns in the figure below). As the third and final step, we construct a scale-agnostic feature pyramid by horizontally concatenating features from different convolution layers that have the same spatial dimensions. Note that from the third level onwards, the feature stack is constructed with the same set of shared convolution weights (shown in the same color). This ensures that all features are similar, which allows us to continue to share weights in the subsequent motion estimator. The figure below depicts this process using four pyramid levels, but in practice, we use seven.

Bi-directional Flow Estimation

After feature extraction, FILM performs pyramid-based residual flow estimation to compute the flows from the yet-to-be-predicted middle image to the two inputs. The flow estimation is done once for each input, starting from the deepest level, using a stack of convolutions. We estimate the flow at a given level by adding a residual correction to the upsampled estimate from the next deeper level. This approach takes the following as its input: (1) the features from the first input at that level, and (2) the features of the second input after it is warped with the upsampled estimate. The same convolution weights are shared across all levels, except for the two finest levels.

Shared weights allow the interpretation of small motions at deeper levels to be the same as large motions at shallow levels, boosting the number of pixels available for large motion supervision. Additionally, shared weights not only enable the training of powerful models that may reach a higher peak signal-to-noise ratio (PSNR), but are also needed to enable models to fit into GPU memory for practical applications.

The impact of weight sharing on image quality. Left: no sharing, Right: sharing. For this ablation we used a smaller version of our model (called FILM-med in the paper) because the full model without weight sharing would diverge as the regularization benefit of weight sharing was lost.

Fusion and Frame Generation

Once the bi-directional flows are estimated, we warp the two feature pyramids into alignment. We obtain a concatenated feature pyramid by stacking, at each pyramid level, the two aligned feature maps, the bi-directional flows and the input images. Finally, a U-Net decoder synthesizes the interpolated output image from the aligned and stacked feature pyramid.

FILM Architecture. FEATURE EXTRACTION: we extract scale-agnostic features. The features with matching colors are extracted using shared weights. FLOW ESTIMATION: we compute bi-directional flows using shared weights across the deeper pyramid levels and warp the features into alignment. FUSION: A U-Net decoder outputs the final interpolated frame.

Loss Functions

During training, we supervise FILM by combining three losses. First, we use the absolute L1 difference between the predicted and ground-truth frames to capture the motion between input images. However, this produces blurry images when used alone. Second, we use perceptual loss to improve image fidelity. This minimizes the L1 difference between the ImageNet pre-trained VGG-19 features extracted from the predicted and ground truth frames. Third, we use Style loss to minimize the L2 difference between the Gram matrix of the ImageNet pre-trained VGG-19 features. The Style loss enables the network to produce sharp images and realistic inpaintings of large pre-occluded regions. Finally, the losses are combined with weights empirically selected such that each loss contributes equally to the total loss.

Shown below, the combined loss greatly improves sharpness and image fidelity when compared to training FILM with L1 loss and VGG losses. The combined loss maintains the sharpness of the tree leaves.

FILM’s combined loss functions. L1 loss (left), L1 plus VGG loss (middle), and Style loss (right), showing significant sharpness improvements (green box).

Image and Video Results

We evaluate FILM on an internal near-duplicate photos dataset that exhibits large scene motion. Additionally, we compare FILM to recent frame interpolation methods: SoftSplat and ABME. FILM performs favorably when interpolating across large motion. Even in the presence of motion as large as 100 pixels, FILM generates sharp images consistent with the inputs.

Frame interpolation with SoftSplat (left), ABME (middle) and FILM (right) showing favorable image quality and temporal consistency.
Large motion interpolation. Top: 64x slow motion video. Bottom (left to right): The two input images blended, SoftSplat interpolation, ABME interpolation, and FILM interpolation. FILM captures the dog’s face while maintaining the background details.

Conclusion

We introduce FILM, a large motion frame interpolation neural network. At its core, FILM adopts a scale-agnostic feature pyramid that shares weights across scales, which allows us to build a “scale-agnostic” bi-directional motion estimator that learns from frames with normal motion and generalizes well to frames with large motion. To handle wide disocclusions caused by large scene motion, we supervise FILM by matching the Gram matrix of ImageNet pre-trained VGG-19 features, which results in realistic inpainting and crisp images. FILM performs favorably on large motion, while also handling small and medium motions well, and generates temporally smooth high quality videos.

Try It Out Yourself

You can try out FILM on your photos using the source code, which is now publicly available.

Acknowledgements

We would like to thank Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless for their contributions. We thank Marc Comino Trinidad for his contributions on the scale-agnostic feature extractor, Orly Liba and Charles Herrmann for feedback on the text, Jamie Aspinall for the imagery in the paper, Dominik Kaeser, Yael Pritch, Michael Nechyba, William T. Freeman, David Salesin, Catherine Wah, and Ira Kemelmacher-Shlizerman for support. Thanks to Tom Small for creating the animated diagram in this post.

Read More

Reduce cost and development time with Amazon SageMaker Pipelines local mode

Creating robust and reusable machine learning (ML) pipelines can be a complex and time-consuming process. Developers usually test their processing and training scripts locally, but the pipelines themselves are typically tested in the cloud. Creating and running a full pipeline during experimentation adds unwanted overhead and cost to the development lifecycle. In this post, we detail how you can use Amazon SageMaker Pipelines local mode to run ML pipelines locally to reduce both pipeline development and run time while reducing cost. After the pipeline has been fully tested locally, you can easily rerun it with Amazon SageMaker managed resources with just a few lines of code changes.

Overview of the ML lifecycle

One of the main drivers for new innovations and applications in ML is the availability and amount of data along with cheaper compute options. In several domains, ML has proven capable of solving problems previously unsolvable with classical big data and analytical techniques, and the demand for data science and ML practitioners is increasing steadily. From a very high level, the ML lifecycle consists of many different parts, but the building of an ML model usually consists of the following general steps:

  1. Data cleansing and preparation (feature engineering)
  2. Model training and tuning
  3. Model evaluation
  4. Model deployment (or batch transform)

In the data preparation step, data is loaded, massaged, and transformed into the type of inputs, or features, the ML model expects. Writing the scripts to transform the data is typically an iterative process, where fast feedback loops are important to speed up development. It’s normally not necessary to use the full dataset when testing feature engineering scripts, which is why you can use the local mode feature of SageMaker Processing. This allows you to run locally and update the code iteratively, using a smaller dataset. When the final code is ready, it’s submitted to the remote processing job, which uses the complete dataset and runs on SageMaker managed instances.

The development process is similar to the data preparation step for both model training and model evaluation steps. Data scientists use the local mode feature of SageMaker Training to iterate quickly with smaller datasets locally, before using all the data in a SageMaker managed cluster of ML-optimized instances. This speeds up the development process and eliminates the cost of running ML instances managed by SageMaker while experimenting.

As an organization’s ML maturity increases, you can use Amazon SageMaker Pipelines to create ML pipelines that stitch together these steps, creating more complex ML workflows that process, train, and evaluate ML models. SageMaker Pipelines is a fully managed service for automating the different steps of the ML workflow, including data loading, data transformation, model training and tuning, and model deployment. Until recently, you could develop and test your scripts locally but had to test your ML pipelines in the cloud. This made iterating on the flow and form of ML pipelines a slow and costly process. Now, with the added local mode feature of SageMaker Pipelines, you can iterate and test your ML pipelines similarly to how you test and iterate on your processing and training scripts. You can run and test your pipelines on your local machine, using a small subset of data to validate the pipeline syntax and functionalities.

SageMaker Pipelines

SageMaker Pipelines provides a fully automated way to run simple or complex ML workflows. With SageMaker Pipelines, you can create ML workflows with an easy-to-use Python SDK, and then visualize and manage your workflow using Amazon SageMaker Studio. Your data science teams can be more efficient and scale faster by storing and reusing the workflow steps you create in SageMaker Pipelines. You can also use pre-built templates that automate the infrastructure and repository creation to build, test, register, and deploy models within your ML environment. These templates are automatically available to your organization, and are provisioned using AWS Service Catalog products.

SageMaker Pipelines brings continuous integration and continuous deployment (CI/CD) practices to ML, such as maintaining parity between development and production environments, version control, on-demand testing, and end-to-end automation, which helps you scale ML throughout your organization. DevOps practitioners know that some of the main benefits of using CI/CD techniques include an increase in productivity via reusable components and an increase in quality through automated testing, which leads to faster ROI for your business objectives. These benefits are now available to MLOps practitioners by using SageMaker Pipelines to automate the training, testing, and deployment of ML models. With local mode, you can now iterate much more quickly while developing scripts for use in a pipeline. Note that local pipeline instances can’t be viewed or run within the Studio IDE; however, additional viewing options for local pipelines will be available soon.

The SageMaker SDK provides a general purpose local mode configuration that allows developers to run and test supported processors and estimators in their local environment. You can use local mode training with multiple AWS-supported framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) as well as images you supply yourself.

SageMaker Pipelines, which builds a Directed Acyclic Graph (DAG) of orchestrated workflow steps, supports many activities that are part of the ML lifecycle. In local mode, the following steps are supported:

  • Processing job steps – A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation
  • Training job steps – An iterative process that teaches a model to make predictions by presenting examples from a training dataset
  • Hyperparameter tuning jobs – An automated way to evaluate and select the hyperparameters that produce the most accurate model
  • Conditional run steps – A step that provides a conditional run of branches in a pipeline
  • Model step – Using CreateModel arguments, this step can create a model for use in transform steps or later deployment as an endpoint
  • Transform job steps – A batch transform job that generates predictions from large datasets, and runs inference when a persistent endpoint isn’t needed
  • Fail steps – A step that stops a pipeline run and marks the run as failed

Solution overview

Our solution demonstrates the essential steps to create and run SageMaker Pipelines in local mode, which means using local CPU, RAM, and disk resources to load and run the workflow steps. Your local environment could be running on a laptop, using popular IDEs like VSCode or PyCharm, or it could be hosted by SageMaker using classic notebook instances.

Local mode allows data scientists to stitch together steps, which can include processing, training, and evaluation jobs, and run the entire workflow locally. When you’re done testing locally, you can rerun the pipeline in a SageMaker managed environment by replacing the LocalPipelineSession object with PipelineSession, which brings consistency to the ML lifecycle.

For this notebook sample, we use a standard publicly available dataset, the UCI Machine Learning Abalone Dataset. The goal is to train an ML model to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

All of the code required to run this notebook sample is available on GitHub in the amazon-sagemaker-examples repository. In this notebook sample, each pipeline workflow step is created independently and then wired together to create the pipeline. We create the following steps:

  • Processing step (feature engineering)
  • Training step (model training)
  • Processing step (model evaluation)
  • Condition step (model accuracy)
  • Create model step (model)
  • Transform step (batch transform)
  • Register model step (model package)
  • Fail step (run failed)

The following diagram illustrates our pipeline.

Prerequisites

To follow along in this post, you need the following:

After these prerequisites are in place, you can run the sample notebook as described in the following sections.

Build your pipeline

In this notebook sample, we use SageMaker Script Mode for most of the ML processes, which means that we provide the actual Python code (scripts) to perform the activity and pass a reference to this code. Script Mode provides great flexibility to control the behavior within the SageMaker processing by allowing you to customize your code while still taking advantage of SageMaker pre-built containers like XGBoost or Scikit-Learn. The custom code is written to a Python script file using cells that begin with the magic command %%writefile, like the following:

%%writefile code/evaluation.py

The primary enabler of local mode is the LocalPipelineSession object, which is instantiated from the Python SDK. The following code segments show how to create a SageMaker pipeline in local mode. Although you can configure a local data path for many of the local pipeline steps, Amazon S3 is the default location to store the data output by the transformation. The new LocalPipelineSession object is passed to the Python SDK in many of the SageMaker workflow API calls described in this post. Notice that you can use the local_pipeline_session variable to retrieve references to the S3 default bucket and the current Region name.

from sagemaker.workflow.pipeline_context import LocalPipelineSession

# Create a `LocalPipelineSession` object so that each 
# pipeline step will run locally
# To run this pipeline in the cloud, you must change 
# the `LocalPipelineSession()` to `PipelineSession()`
local_pipeline_session = LocalPipelineSession()
region = local_pipeline_session.boto_region_name

default_bucket = local_pipeline_session.default_bucket()
prefix = "sagemaker-pipelines-local-mode-example"

Before we create the individual pipeline steps, we set some parameters used by the pipeline. Some of these parameters are string literals, whereas others are created as special enumerated types provided by the SDK. The enumerated typing ensures that valid settings are provided to the pipeline, such as this one, which is passed to the ConditionLessThanOrEqualTo step further down:

mse_threshold = ParameterFloat(name="MseThreshold", default_value=7.0)

To create a data processing step, which is used here to perform feature engineering, we use the SKLearnProcessor to load and transform the dataset. We pass the local_pipeline_session variable to the class constructor, which instructs the workflow step to run in local mode:

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=local_pipeline_session,
)

Next, we create our first actual pipeline step, a ProcessingStep object, as imported from the SageMaker SDK. The processor arguments are returned from a call to the SKLearnProcessor run() method. This workflow step is combined with other steps towards the end of the notebook to indicate the order of operation within the pipeline.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="code/preprocessing.py",
)

step_process = ProcessingStep(name="AbaloneProcess", step_args=processor_args)

Next, we provide code to establish a training step by first instantiating a standard estimator using the SageMaker SDK. We pass the same local_pipeline_session variable to the estimator, named xgb_train, as the sagemaker_session argument. Because we want to train an XGBoost model, we must generate a valid image URI by specifying the following parameters, including the framework and several version parameters:

from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

model_path = f"s3://{default_bucket}/{prefix}/model"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1",
    py_version="py3",
    instance_type=instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    entry_point="code/abalone.py",
    instance_type=instance_type,
    instance_count=training_instance_count,
    output_path=model_path,
    role=role,
    sagemaker_session=local_pipeline_session,
)

We can optionally call additional estimator methods, for example set_hyperparameters(), to provide hyperparameter settings for the training job. Now that we have an estimator configured, we’re ready to create the actual training step. Once again, we import the TrainingStep class from the SageMaker SDK library:

from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(name="AbaloneTrain", step_args=train_args)

Next, we build another processing step to perform model evaluation. This is done by creating a ScriptProcessor instance and passing the local_pipeline_session object as a parameter:

from sagemaker.processing import ScriptProcessor

script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="script-abalone-eval",
    role=role,
    sagemaker_session=local_pipeline_session,
)

To enable deployment of the trained model, either to a SageMaker real-time endpoint or to a batch transform, we need to create a Model object by passing the model artifacts, the proper image URI, and optionally our custom inference code. We then pass this Model object to a ModelStep, which is added to the local pipeline. See the following code:

from sagemaker.model import Model

model = Model(
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    source_dir="code",
    entry_point="inference.py",
    role=role,
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.model_step import ModelStep

step_create_model = ModelStep(name="AbaloneCreateModel", 
    step_args=model.create(instance_type=instance_type)
)

Next, we create a batch transform step where we submit a set of feature vectors and perform inference. We first need to create a Transformer object and pass the local_pipeline_session parameter to it. Then we create a TransformStep, passing the required arguments, and add this to the pipeline definition:

from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type=instance_type,
    instance_count=transform_instance_count,
    output_path=f"s3://{default_bucket}/{prefix}/transform",
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.steps import TransformStep

transform_args = transformer.transform(transform_data, content_type="text/csv")

step_transform = TransformStep(name="AbaloneTransform", step_args=transform_args)

Finally, we want to add a branch condition to the workflow so that we only run batch transform if the results of model evaluation meet our criteria. We can indicate this conditional by adding a ConditionStep with a particular condition type, like ConditionLessThanOrEqualTo. We then enumerate the steps for the two branches, essentially defining the if/else or true/false branches of the pipeline. The if_steps provided in the ConditionStep (step_create_model, step_transform) are run whenever the condition evaluates to True.

from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step_name=step_eval.name,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value",),
    right=mse_threshold,
)

step_cond = ConditionStep(
    name="AbaloneMSECond",
    conditions=[cond_lte],
    if_steps=[step_create_model, step_transform],
    else_steps=[step_fail],
)

The following diagram illustrates this conditional branch and the associated if/else steps. Only one branch is run, based on the outcome of the model evaluation step as compared in the condition step.

Now that we have all our steps defined, and the underlying class instances created, we can combine them into a pipeline. We provide some parameters, and crucially define the order of operation by simply listing the steps in the desired order. Note that the TransformStep isn’t shown here because it’s the target of the conditional step, and was provided as step argument to the ConditionalStep earlier.

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=local_pipeline_session,
)

To run the pipeline, you must call two methods: pipeline.upsert(), which uploads the pipeline to the underlying service, and pipeline.start(), which starts running the pipeline. You can use various other methods to interrogate the run status, list the pipeline steps, and more. Because we used the local mode pipeline session, these steps are all run locally on your processor. The cell output beneath the start method shows the output from the pipeline:

pipeline.upsert(role_arn=role)
execution = pipeline.start()

You should see a message at the bottom of the cell output similar to the following:

Pipeline execution d8c3e172-089e-4e7a-ad6d-6d76caf987b7 SUCCEEDED

Revert to managed resources

After we’ve confirmed that the pipeline runs without errors and we’re satisfied with the flow and form of the pipeline, we can recreate the pipeline but with SageMaker managed resources and rerun it. The only change required is to use the PipelineSession object instead of LocalPipelineSession:

from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession

local_pipeline_session = LocalPipelineSession()
pipeline_session = PipelineSession()

This informs the service to run each step referencing this session object on SageMaker managed resources. Given the small change, we illustrate only the required code changes in the following code cell, but the same change would need to be implemented on each cell using the local_pipeline_session object. The changes are, however, identical across all cells because we’re only substituting the local_pipeline_session object with the pipeline_session object.

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=pipeline_session,  # non-local session
)

After the local session object has been replaced everywhere, we recreate the pipeline and run it with SageMaker managed resources:

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=pipeline_session, # non-local session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

Clean up

If you want to keep the Studio environment tidy, you can use the following methods to delete the SageMaker pipeline and the model. The full code can be found in the sample notebook.

# delete models 
sm_client = boto3.client("sagemaker")
model_prefix="AbaloneCreateModel"
delete_models(sm_client, model_prefix)

# delete managed pipeline
pipeline_to_delete = 'SM-Managed-Pipeline'
delete_sagemaker_pipeline(sm_client, pipeline_to_delete)

Conclusion

Until recently, you could use the local mode feature of SageMaker Processing and SageMaker Training to iterate on your processing and training scripts locally, before running them on all the data with SageMaker managed resources. With the new local mode feature of SageMaker Pipelines, ML practitioners can now apply the same method when iterating on their ML pipelines, stitching the different ML workflows together. When the pipeline is ready for production, running it with SageMaker managed resources requires just a few lines of code changes. This reduces the pipeline run time during development, leading to more rapid pipeline development with faster development cycles, while reducing the cost of SageMaker managed resources.

To learn more, visit Amazon SageMaker Pipelines or Use SageMaker Pipelines to Run Your Jobs Locally.


About the authors

Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to make the most of it. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers. Paul likes to help customers expand their machine learning initiatives to solve real-world problems.

Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.

Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit worked in early-stage AI startups followed by some time consulting in various roles in AI research, MLOps, and technical leadership.

Read More

Research Focus: Week of September 26, 2022

Research Focus - Sep 2022

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Clifford neural layers for PDE modeling

Johannes Brandstetter, Rianne van den Berg, Max Welling, and Jayesh K. Gupta

Partial differential equations (PDEs) are widely used to describe simulation of physical processes as scalar and vector fields interacting and coevolving over time. 

Recent research has focused on using neural surrogates to accelerate such simulations. However, current methods do not explicitly model relationships between fields and their correlated internal components, such as the scalar temperature field and the vector wind velocity field in weather simulations.  

In this work, we view the time evolution of such correlated fields through the lens of mulitvector fields, which consist of scalar, vector, and higher-order components, such as bivectors. An example of a bivector is the cross-product of two vectors in 3-D, which is a plane segment spanned by these two vectors, and is often represented as a vector itself. But the cross-product has a sign flip under reflection, which a vector does not. It is a bivector. 

The algebraic properties of multivectors, such as multiplication and addition, can be described by Clifford algebras, leading to Clifford neural layers such as Clifford convolutions and Clifford Fourier transforms. These layers are universally applicable and will find direct use in the areas of fluid dynamics, weather forecasting, and the modeling of physical systems in general. 


InAs-Al Hybrid Devices Passing the Topological Gap Protocol

Dr. Chetan Nayak 

While the promise of quantum is great, we are still in the early days of what is possible. Today’s quantum computers enable interesting research, however, innovators find themselves limited by the inadequate scale of these systems and are eager to do more.  Microsoft is taking a more challenging, but ultimately a more promising approach to scaled quantum computing. We are engineering a full-stack quantum machine powered by topological qubits. We theorize that this type of qubit will be inherently more stable than qubits produced with existing methods without sacrificing size or speed. Earlier this year we had a major scientific breakthrough that cleared a significant hurdle – we discovered that we could produce the topological superconducting phase and its concomitant Majorana zero modes. In essence, we now have the building block for a topological qubit and quantum at scale. Learn more about this discovery from Dr. Chetan Nayak, Distinguished Engineer of Quantum at Microsoft, who just released a preprint to arXiv and met with leaders in the quantum computing industry. 


DaVinci Image and Video Enhancement Toolkit 

Huan Yang, Jianlong Fu 

To help users enhance low quality videos in real time, on edge equipment and with limited computing power, researchers from Microsoft Research Asia launched a set of intelligent image/video enhancement tools called DaVinci. This toolkit aims to solve the pain points of existing video enhancement and restoration tools, give full play to the advantages of AI technology and lower the threshold for users to process video footage. 

The targeted features of the DaVinci toolkit include low-level image and video enhancement tasks, real-time image and video filters, and visual quality enhancement such as super-resolution, denoising, and video frame interpolation. The backend technology of the DaVinci toolkit is supported by the industry-leading large-scale, low-level vision pre-training technology, supplemented by a large amount of data training. To maximize the robustness of the model, researchers use four million publicly available images and video data, with contents covering landscapes, buildings, people, and so on. To ensure an adequate amount of training data and rich data types, researchers synthesized data with various degradations, so that the entire model training could cover more actual user application scenarios. The DaVinci toolkit has been packaged and released on GitHub

 


RetrieverTTS: Modeling decomposed factors for text-based speech insertion 

Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo 

In the post-pandemic world, a large portion of meetings, conferences, and trainings have been moved online, leading to a sharp increase in demand for audio and video creation. However, making zero-mistake audio/video recordings is time consuming, with re-recordings required to fix even the smallest mistakes. Therefore, a simple technique for making partial corrections to recordings is urgently needed. Researchers from Microsoft Research Asia’s Intelligent Multimedia Group and from Azure Speech Team developed a text-based speech editing system to address this problem. The system supports cutting, copying, pasting operations, and inserting synthesized speech segments based on text transcriptions.


AI4Science expands with new research lab in Berlin

Deep learning is set to transform the natural sciences, dramatically improving our ability to model and predict natural phenomena over widely varying scales of space and time. In fact, this may be the dawn of a new paradigm of scientific discovery. To help make this new paradigm a reality, Microsoft Research recently established AI4Science, a new global organization that will bring together experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines.

This month, the AI4Science organization expanded with a new presence in Berlin, to be led by Dr. Frank Noé. As a “bridge professor,” Noé specializes in work at the interfaces between the fields of mathematics and computer science as well as physics, biology, chemistry and pharmacy. For his pioneering work in developing innovative computational methods in biophysics, he has received awards and funding from the American Chemical Society and the European Research Council, among others.

In the video below, Noé discusses the new lab in Berlin, and potential for machine learning to advance the natural sciences, with Chris Bishop, Microsoft Technical Fellow and director, AI4Science.

The post Research Focus: Week of September 26, 2022 appeared first on Microsoft Research.

Read More

Searidge Technologies Offers a Safety Net for Airports

Planes taxiing for long periods due to ground traffic — or circling the airport while awaiting clearance to land — don’t just make travelers impatient. They burn fuel unnecessarily, harming the environment and adding to airlines’ costs.

Searidge Technologies, based in Ottawa, Canada, has created AI-powered software to help the aviation industry avoid such issues, increasing efficiency and enhancing safety for airports.

Its Digital Tower and Apron solutions, powered by NVIDIA GPUs, use vision AI to manage traffic control for airports and alert users of safety concerns in real time. Searidge enables airports to handle 15-30% more aircraft per hour and reduce the number of tarmac incidents.

The company’s tech is used across the world, including at London’s Heathrow Airport, Fort Lauderdale-Hollywood International Airport in Florida and Dubai International Airport, to name a few.

In June, Searidge’s Digital Apron and Tower Management System (DATMS) went operational at Hong Kong International Airport as part of an initial phase of the Airport Authority Hong Kong’s large-scale expansion plan, which will bring machine learning to a new, integrated airport operations center.

In addition, Searidge provides the Civil Aviation Department of Hong Kong’s air-traffic control systems with next-generation safety enhancements using its vision AI software.

The deployment in Hong Kong is the industry’s largest digital platform for tower and apron management — and the first collaboration between an airport and an air-navigation service provider for a single digital platform.

Searidge is a member of NVIDIA Metropolis, a partner program focused on bringing to market a new generation of vision AI applications that make the world’s most important spaces and operations safer and more efficient.

Digital Tools for Airports

The early 2000s saw massive growth and restructuring of airports — and with this came increased use of digital tools in the aviation industry.

Founded in 2006, Searidge has become one of the first to bring machine learning to video processing in the aviation space, according to Pat Urbanek, the company’s vice president of business development for Asia Pacific and the Middle East.

“Video processing software for air-traffic control didn’t exist before,” Urbanek said. “It’s taken a decade to become mainstream — but now, intelligent video and machine learning have been brought into airport operations, enabling new levels of automation in air-traffic control and airside operations to enhance safety and efficiency.”

DATMS’s underlying machine learning platform, called Aimee, enables traffic-lighting automation based on data from radars and 4K-resolution video cameras. Aimee is trained to detect aircraft and vehicles. And DATMS is programmed based on the complex roadway rules that determine how buses and other vehicles should operate on service roads across taxiways.

After analyzing video data, the AI-enabled system activates or deactivates airports’ traffic lights in real time, based on when it’s appropriate for passenger buses and other vehicles to move. The status of each traffic light and additional details can also be visualized on end-user screens in airport traffic control rooms.

“What size is an aircraft? Does it have enough space to turn on the runway? Is it going too fast? All of this information and more is sent out over the Searidge Platform and displayed on screen based on user preference,” said Marco Rueckert, vice president of technology at Searidge.

Image courtesy of Searidge Technologies

The same underlying technology is applied to provide enhanced safety alerts for aircraft departure and arrival. In real time, DATMS alerts air traffic controllers of safety-standard breaches — taking into consideration clearances for aircraft to enter a runway, takeoff or land.

Speedups With NVIDIA GPUs

Searidge uses NVIDIA GPUs to optimize inference throughput across its deployments at airports around the globe. To train its AI models, Searidge uses an NVIDIA DGX A100 system.

“The NVIDIA platform allowed us to really bring down the hardware footprint and costs from the customer’s perspective,” Rueckert said. “It provides the scalability factor, so we can easily add more cameras with increasing resolution, which ultimately helps us solve more problems and address more customer needs.”

The company is also exploring the integration of voice data — based on communication between pilots and air-traffic controllers — within its machine learning platform to further enhance airport operations.

Searidge’s Digital Tower and Apron solutions can be customized for the unique challenges that come with varying airport layouts and traffic patterns.

“Of course, having aircraft land on time and letting passengers make their connections increases business and efficiency, but our technology has an environmental impact as well,” Urbanek said. “It can prevent burning of huge amounts of fuel — in the air or at the gate — by providing enhanced efficiency and safety for taxiing, takeoff and landing.”

Watch the latest GTC keynote by NVIDIA founder and CEO Jensen Huang to discover how vision AI and other groundbreaking technologies are shaping the world:

Feature video courtesy of Dubai Airports.

The post Searidge Technologies Offers a Safety Net for Airports appeared first on NVIDIA Blog.

Read More

Creator EposVox Shares Streaming Lessons, Successes This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. In the coming weeks, we’ll be deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

TwitchCon — the world’s top gathering of live streamers — kicks off Friday with the new line of GeForce RTX 40 Series GPUs bringing incredible new technology — from AV1 to AI — to elevate live streams for aspiring and professional Twitch creators alike.

In addition, creator and educator EposVox is in the NVIDIA Studio to discuss his influences, inspiration and advice for getting the most out of live streams.

Plus, join the #From2Dto3D challenge this month by sharing a 2D piece of art next to a 3D rendition of it for a chance to be featured on the NVIDIA Studio social media channels. Be sure to tag #From2Dto3D to enter.

AV1 and Done

Releasing on Oct. 12, the new GeForce RTX 40 Series GPUs feature the eighth-generation NVIDIA video encoder, NVENC for short, now with support for AV1 encoding. For creators like EposVox, the new AV1 encoder will deliver 40% increased efficiency, unlocking higher resolutions and crisper image quality.

The GeForce RTX 40 Series delivers higher video quality with AV1.

NVIDIA has collaborated with OBS Studio to add AV1 support to its next software release, expected later this month. In addition, Discord is enabling AV1 end to end for the first time later this year. GeForce RTX 40 Series owners will be able to stream with crisp, clear image quality at 1440p and even 4K resolution at 60 frames per second.

GeForce RTX 40 Series GPUs also feature dual encoders. This allows creators to capture up to 8K60. And when it’s time to cut a VOD of live streams, the dual encoders work in tandem, dividing work automatically, which slashes export times nearly in half. Blackmagic Design’s DaVinci Resolve, the popular Voukoder plugin for Adobe Premiere Pro, and Jianying — the top video editing app in China — are all enabling dual encoder through encode presets. Expect dual encoder availability for these apps in October.

 

The GeForce RTX 40 Series GPUs also give game streamers an unprecedented gen-to-gen frame-rate boost in PC games alongside the new NVIDIA DLSS 3 technology, which accelerates performance by up to 4x. This will unlock richer, more immersive ray-traced experiences to share via live streams, such as in Cyberpunk 2077 and Portal with RTX.

Virtual Live Streams Come to Life

VTube Studio is a leading app for virtual streamers (VTubers) that makes it easy and fun to bring digital avatars to life on a live stream.

Seamlessly control avatars with AI by using a webcam and GeForce RTX GPU in VTube Studio.

VTube Studio is adding support this month for the NVIDIA Broadcast AR SDK, allowing users to seamlessly control their avatars with AI by using a regular webcam and a GeForce RTX GPU.

Objectively Blissful Streaming

OBS doesn’t stand for objectively blissful streaming, but it should.

OBS Studio is free, open-source software for video recording and live streaming. It’s one of EposVox’s essential apps, as he said it “allows me to produce my content at a rapid pace that’s constantly evolving.”

The software now features native integration of the AI-powered NVIDIA Broadcast effects, including Virtual Background, Noise Removal and Room Echo Removal.

In addition to adding AV1 support for GeForce RTX 40 Series GPUs later this month, the recent OBS 28.0 release added support for high-efficiency video coding (HEVC or H.265), improving video compression rates by 15% across a wide range of NVIDIA GPUs. It also now includes support for high-dynamic range (HDR), offering a greater range of bright and dark colors, which brings stunning vibrance and dramatic improvements in visual quality.

Broadcast for All

The SDKs that power NVIDIA Broadcast are available to developers, enabling native AI feature support in devices ranging from Logitech, Corsair and Elgato, as well as advanced workflows in OBS and Notch software.

Features released last month at NVIDIA GTC include new and updated AI-powered effects.

 

Virtual Background now includes temporal information, so random objects in the background will no longer create distractions by flashing in and out. This will be available in the next major version of OBS Studio.

 

Face Expression Estimation allows apps to accurately track facial expressions for face meshes, even with the simplest of webcams. It’s hugely beneficial to VTubers and can be found in the next version of VTube Studio.

 

Eye Contact allows podcasters to appear as if they’re looking directly at the camera — highly useful for when the user is reading a script or looking away to engage with viewers in the chat window.

It’s EposVox’s World, We’re All Just Living in It

Adam Taylor, who goes by the stage name EposVox or “The Stream Professor,” runs a YouTube channel focused on tech education for content creators and streamers.

He’s been making videos since before YouTube even existed.

“DailyMotion, Google Video, does anyone remember MetaCafe? X-Fire?” said EposVox.

He maintains a strong passion for educational content, which stemmed from his desire to learn video editing workflows as a young man, when he lacked the wealth of knowledge and resources available today.

“I immediately ran into constant walls of information that were kept behind closed doors when it came to deeper video topics, audio setups and more,” the artist said. “It was really frustrating — there was nothing and no one, aside from a decade or two of DOOM9 forums and outdated broadcast books, that had ever heard of a USB port to help guide me.”

While content creation and live streaming, especially with software like OBS Studio and XSplit, are EposVox’s primary focuses, he also aspires to make technology more fun and easy to use.

“The GPU acceleration in 3D and video apps, and now all the AI innovations that are coming to new generations, are incredible — I’m not sure I’d be able to create on the level that I do, nor at the speed I do, without NVIDIA GPUs.”

When searching for content inspiration, EposVox deploys a proactive approach — he’s all about asking questions. “Whether it’s trying to figure out how to do some overkill new setup for myself, breaking down neat effects I see elsewhere, or just asking which point in the process might cause friction for a viewer — I ask questions, figure out the best way to answer those questions, and deliver them to viewers,” he said.

EposVox stressed the importance of experimenting with multiple creative applications, noting that “every tool I can add to my tool chest enhances my creativity by giving me more options or ways to create, and more experiences with new processes for creating things.” This is especially true for the use of AI in his creative workflows, he added.

“What I love about AI art generation right now is the fact that I can just type any idea that comes to mind, in plain text language, and see it come to life,” he said. “I may not get exactly what I was expecting, I may have to continue refining my language and ideas to approach the representation I’m after — but knocking down the barrier between idea conception and seeing some form of that idea in front of me, I cannot overstate the impact that is created here.”

For an optimal live-streaming setup, EposVox recommends a PC equipped with a GeForce RTX GPU. His GeForce RTX 3090 desktop GPU, he said, can handle the rigors of the entire creative process and remain fast even when he’s constantly switching between computationally complex creative applications.

The artist said, “These days, I use GPU-accelerated NVENC encoding for capturing, exporting videos and live streaming.”

EposVox can’t wait for his GeForce RTX 4090 GPU upgrade, primarily to take advantage of the new dual encoders, noting “ I’ll probably end up saving a few hours a day since less time waiting on renders and uploads means I can move from project to project much quicker, rather than having to walk away and work on other things. I’ll be able to focus so much more.”

When asked for parting advice, EposVox didn’t hesitate: “If you commit to a creative vision for a project, but the entity you’re making it for— the company, agency, person or whomever — takes the project in a completely different direction, find some way to still bring your vision to life,” he said. “You’ll be so much better off — in terms of how you feel and the experience gained — if you can still bring that to life.”

YouTuber and live streamer EposVox, aka “The Stream Professor.”

For more tips on live streaming and video exports, check out EposVox’s YouTube channel.

And for step-by-step tutorials for all creative fields — created by industry-leading artists and community showcases — check out the NVIDIA Studio YouTube channel.

Finally, join the #From2Dto3D challenge by posting on Instagram, Twitter or Facebook.

Get updates directly in your inbox by subscribing to the NVIDIA Studio newsletter.

The post Creator EposVox Shares Streaming Lessons, Successes This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Read More

Learning on the edge

Learning on the edge

Microcontrollers, miniature computers that can run simple commands, are the basis for billions of connected devices, from internet-of-things (IoT) devices to sensors in automobiles. But cheap, low-power microcontrollers have extremely limited memory and no operating system, making it challenging to train artificial intelligence models on “edge devices” that work independently from central computing resources.

Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user’s writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server.

To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. Other training solutions designed for connected devices can use more than 500 megabytes of memory, greatly exceeding the 256-kilobyte capacity of most microcontrollers (there are 1,024 kilobytes in one megabyte).

The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes.

This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.

“Our study enables IoT devices to not only perform inference but also continuously update the AI models to newly collected data, paving the way for lifelong on-device learning. The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices,” says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.

Joining Han on the paper are co-lead authors and EECS PhD students Ji Lin and Ligeng Zhu, as well as MIT postdocs Wei-Ming Chen and Wei-Chen Wang, and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.

Lightweight training

A common type of machine-learning model is known as a neural network. Loosely based on the human brain, these models contain layers of interconnected nodes, or neurons, that process data to complete a task, such as recognizing people in photos. The model must be trained first, which involves showing it millions of examples so it can learn the task. As it learns, the model increases or decreases the strength of the connections between neurons, which are known as weights.

The model may undergo hundreds of updates as it learns, and the intermediate activations must be stored during each round. In a neural network, activation is the middle layer’s intermediate results. Because there may be millions of weights and activations, training a model requires much more memory than running a pre-trained model, Han explains.

Han and his collaborators employed two algorithmic solutions to make the training process more efficient and less memory-intensive. The first, known as sparse update, uses an algorithm that identifies the most important weights to update at each round of training. The algorithm starts freezing the weights one at a time until it sees the accuracy dip to a set threshold, then it stops. The remaining weights are updated, while the activations corresponding to the frozen weights don’t need to be stored in memory.

“Updating the whole model is very expensive because there are a lot of activations, so people tend to update only the last layer, but as you can imagine, this hurts the accuracy. For our method, we selectively update those important weights and make sure the accuracy is fully preserved,” Han says.

Their second solution involves quantized training and simplifying the weights, which are typically 32 bits. An algorithm rounds the weights so they are only eight bits, through a process known as quantization, which cuts the amount of memory for both training and inference. Inference is the process of applying a model to a dataset and generating a prediction. Then the algorithm applies a technique called quantization-aware scaling (QAS), which acts like a multiplier to adjust the ratio between weight and gradient, to avoid any drop in accuracy that may come from quantized training.

The researchers developed a system, called a tiny training engine, that can run these algorithmic innovations on a simple microcontroller that lacks an operating system. This system changes the order of steps in the training process so more work is completed in the compilation stage, before the model is deployed on the edge device.

“We push a lot of the computation, such as auto-differentiation and graph optimization, to compile time. We also aggressively prune the redundant operators to support sparse updates. Once at runtime, we have much less workload to do on the device,” Han explains.

A successful speedup

Their optimization only required 157 kilobytes of memory to train a machine-learning model on a microcontroller, whereas other techniques designed for lightweight training would still need between 300 and 600 megabytes.

They tested their framework by training a computer vision model to detect people in images. After only 10 minutes of training, it learned to complete the task successfully. Their method was able to train a model more than 20 times faster than other approaches.

Now that they have demonstrated the success of these techniques for computer vision models, the researchers want to apply them to language models and different types of data, such as time-series data. At the same time, they want to use what they’ve learned to shrink the size of larger models without sacrificing accuracy, which could help reduce the carbon footprint of training large-scale machine-learning models.

“AI model adaptation/training on a device, especially on embedded controllers, is an open challenge. This research from MIT has not only successfully demonstrated the capabilities, but also opened up new possibilities for privacy-preserving device personalization in real-time,” says Nilesh Jain, a principal engineer at Intel who was not involved with this work. “Innovations in the publication have broader applicability and will ignite new systems-algorithm co-design research.”

“On-device learning is the next major advance we are working toward for the connected intelligent edge. Professor Song Han’s group has shown great progress in demonstrating the effectiveness of edge devices for training,” adds Jilei Hou, vice president and head of AI research at Qualcomm. “Qualcomm has awarded his team an Innovation Fellowship for further innovation and advancement in this area.”

This work is funded by the National Science Foundation, the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, Amazon, Intel, Qualcomm, Ford Motor Company, and Google.

Read More

Learning on the edge

Microcontrollers, miniature computers that can run simple commands, are the basis for billions of connected devices, from internet-of-things (IoT) devices to sensors in automobiles. But cheap, low-power microcontrollers have extremely limited memory and no operating system, making it challenging to train artificial intelligence models on “edge devices” that work independently from central computing resources.

Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user’s writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server.

To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. Other training solutions designed for connected devices can use more than 500 megabytes of memory, greatly exceeding the 256-kilobyte capacity of most microcontrollers (there are 1,024 kilobytes in one megabyte).

The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes.

This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.

“Our study enables IoT devices to not only perform inference but also continuously update the AI models to newly collected data, paving the way for lifelong on-device learning. The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices,” says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.

Joining Han on the paper are co-lead authors and EECS PhD students Ji Lin and Ligeng Zhu, as well as MIT postdocs Wei-Ming Chen and Wei-Chen Wang, and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.

Lightweight training

A common type of machine-learning model is known as a neural network. Loosely based on the human brain, these models contain layers of interconnected nodes, or neurons, that process data to complete a task, such as recognizing people in photos. The model must be trained first, which involves showing it millions of examples so it can learn the task. As it learns, the model increases or decreases the strength of the connections between neurons, which are known as weights.

The model may undergo hundreds of updates as it learns, and the intermediate activations must be stored during each round. In a neural network, activation is the middle layer’s intermediate results. Because there may be millions of weights and activations, training a model requires much more memory than running a pre-trained model, Han explains.

Han and his collaborators employed two algorithmic solutions to make the training process more efficient and less memory-intensive. The first, known as sparse update, uses an algorithm that identifies the most important weights to update at each round of training. The algorithm starts freezing the weights one at a time until it sees the accuracy dip to a set threshold, then it stops. The remaining weights are updated, while the activations corresponding to the frozen weights don’t need to be stored in memory.

“Updating the whole model is very expensive because there are a lot of activations, so people tend to update only the last layer, but as you can imagine, this hurts the accuracy. For our method, we selectively update those important weights and make sure the accuracy is fully preserved,” Han says.

Their second solution involves quantized training and simplifying the weights, which are typically 32 bits. An algorithm rounds the weights so they are only eight bits, through a process known as quantization, which cuts the amount of memory for both training and inference. Inference is the process of applying a model to a dataset and generating a prediction. Then the algorithm applies a technique called quantization-aware scaling (QAS), which acts like a multiplier to adjust the ratio between weight and gradient, to avoid any drop in accuracy that may come from quantized training.

The researchers developed a system, called a tiny training engine, that can run these algorithmic innovations on a simple microcontroller that lacks an operating system. This system changes the order of steps in the training process so more work is completed in the compilation stage, before the model is deployed on the edge device.

“We push a lot of the computation, such as auto-differentiation and graph optimization, to compile time. We also aggressively prune the redundant operators to support sparse updates. Once at runtime, we have much less workload to do on the device,” Han explains.

A successful speedup

Their optimization only required 157 kilobytes of memory to train a machine-learning model on a microcontroller, whereas other techniques designed for lightweight training would still need between 300 and 600 megabytes.

They tested their framework by training a computer vision model to detect people in images. After only 10 minutes of training, it learned to complete the task successfully. Their method was able to train a model more than 20 times faster than other approaches.

Now that they have demonstrated the success of these techniques for computer vision models, the researchers want to apply them to language models and different types of data, such as time-series data. At the same time, they want to use what they’ve learned to shrink the size of larger models without sacrificing accuracy, which could help reduce the carbon footprint of training large-scale machine-learning models.

“AI model adaptation/training on a device, especially on embedded controllers, is an open challenge. This research from MIT has not only successfully demonstrated the capabilities, but also opened up new possibilities for privacy-preserving device personalization in real-time,” says Nilesh Jain, a principal engineer at Intel who was not involved with this work. “Innovations in the publication have broader applicability and will ignite new systems-algorithm co-design research.”

“On-device learning is the next major advance we are working toward for the connected intelligent edge. Professor Song Han’s group has shown great progress in demonstrating the effectiveness of edge devices for training,” adds Jilei Hou, vice president and head of AI research at Qualcomm. “Qualcomm has awarded his team an Innovation Fellowship for further innovation and advancement in this area.”

This work is funded by the National Science Foundation, the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, Amazon, Intel, Qualcomm, Ford Motor Company, and Google.

Read More

MBW: Multi-view Bootstrapping in the Wild

Labeling articulated objects in unconstrained settings has a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for…Apple Machine Learning Research