AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).

Although lip motion can provide strong signals for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in the wild (e.g., due to egocentric viewpoints, face coverings, and low resolution) and therefore, a new emerging area of research is unconstrained AV-ASR (e.g., AVATAR), which investigates the contribution of entire visual frames, and not just the mouth region.

Building audiovisual datasets for training AV-ASR models, however, is challenging. Datasets such as How2 and VisSpeech have been created from instructional videos online, but they are small in size. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. Nonetheless, there have been a number of recently released large-scale audio-only models that are heavily optimized via large-scale training on massive audio-only data obtained from audio books, such as LibriLight and LibriSpeech. These models contain billions of parameters, are readily available, and show strong generalization across domains.

With the above challenges in mind, in “AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR”, we present a simple method for augmenting existing large-scale audio-only models with visual information, at the same time performing lightweight domain adaptation. AVFormer injects visual embeddings into a frozen ASR model (similar to how Flamingo injects visual information into large language models for vision-text tasks) using lightweight trainable adaptors that can be trained on a small amount of weakly labeled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training, which we show is crucial to enable the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech).

Unconstrained audiovisual speech recognition. We inject vision into a frozen speech model (BEST-RQ, in grey) for zero-shot audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). The visual context can provide helpful clues for robust speech recognition especially when the audio signal is noisy (the visual loaf of bread helps correct the audio-only mistake “clove” to “loaf” in the generated transcript).

Injecting vision using lightweight modules

Our goal is to add visual understanding capabilities to an existing audio-only ASR model while maintaining its generalization performance to various domains (both AV and audio-only domains).

To achieve this, we augment an existing state-of-the-art ASR model (Best-RQ) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The former projects visual features in the audio token embedding space. This process allows the model to properly connect separately pre-trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of multimodal inputs from videos. We then train these additional modules on unlabeled web videos from the HowTo100M dataset, along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. Such lightweight modules enable data-efficiency and strong generalization of performance.

We evaluated our extended model on AV-ASR benchmarks in a zero-shot setting, where the model is never trained on a manually annotated AV-ASR dataset.

Curriculum learning for vision injection

After the initial evaluation, we discovered empirically that with a naïve single round of joint training, the model struggles to learn both the adapters and the visual projectors in one go. To mitigate this issue, we introduced a two-phase curriculum learning strategy that decouples these two factors — domain adaptation and visual feature integration — and trains the network in a sequential manner. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train the visual projection layers alone in the second phase while the trained adapters are kept frozen.

The first stage focuses on audio domain adaptation. By the second phase, the adapters are completely frozen and the visual projector must simply learn to generate visual prompts that project the visual tokens into the audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs as well as adapt to new audio domains in AV-ASR benchmarks. We apply each phase just once, as an iterative application of alternating phases leads to performance degradation.

Overall architecture and training procedure for AVFormer. The architecture consists of a frozen Conformer encoder-decoder model, and a frozen CLIP encoder (frozen layers shown in gray with a lock symbol), in conjunction with two lightweight trainable modules – (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation. We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is tuned while all the other parts are kept frozen.

The plots below show that without curriculum learning, our AV-ASR model is worse than the audio-only baseline across all datasets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the baseline audio-only model.

Effects of curriculum learning. Red and blue lines are for audiovisual models and are shown on 3 datasets in the zero-shot setting (lower WER % is better). Using the curriculum helps on all 3 datasets (for How2 (a) and Ego4D (c) it is crucial for outperforming audio-only performance). Performance improves up until 4 visual tokens, at which point it saturates.

Results in zero-shot AV-ASR

We compare AVFormer to BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero-shot performance on the three AV-ASR benchmarks: How2, VisSpeech and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all, even outperforming both AVATAR and BEST-RQ when they are trained on LibriSpeech and the full set of HowTo100M. This is notable because for BEST-RQ, this involves training 600M parameters, while AVFormer only trains 4M parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Moreover, we also evaluate performance on LibriSpeech, which is audio-only, and AVFormer outperforms both baselines.

Comparison to state-of-the-art methods for zero-shot performance across different AV-ASR datasets. We also show performances on LibriSpeech which is audio-only. Results are reported as WER % (lower is better). AVATAR and BEST-RQ are finetuned end-to-end (all parameters) on HowTo100M whereas AVFormer works effectively even with 5% of the dataset thanks to the small set of finetuned parameters.

Conclusion

We introduce AVFormer, a lightweight method for adapting existing, frozen state-of-the-art ASR models for AV-ASR. Our approach is practical and efficient, and achieves impressive zero-shot performance. As ASR models get larger and larger, tuning the entire parameter set of pre-trained models becomes impractical (even more so for different domains). Our method seamlessly allows both domain transfer and visual input mixing in the same, parameter efficient model.

Acknowledgements

This research was conducted by Paul Hongsuck Seo, Arsha Nagrani and Cordelia Schmid.

Read More

Robustness in Multimodal Learning under Train-Test Modality Mismatch

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading…Apple Machine Learning Research

Growing and Serving Large Open-domain Knowledge Graphs

*= Equal Contributors
Applications of large open-domain knowledge graphs (KGs) to real-world problems pose many unique challenges. In this paper, we present extensions to Saga our platform for continuous construction and serving of knowledge at scale. In particular, we describe a pipeline for training knowledge graph embeddings that powers key capabilities such as fact ranking, fact verification, a related entities service, and support for entity linking. We then describe how our platform, including graph embeddings, can be leveraged to create a Semantic Annotation service that links…Apple Machine Learning Research

Unconstrained Channel Pruning

Modern neural networks are growing not only in size and complexity but also in inference time. One of the most effective compression techniques — channel pruning — combats this trend by removing channels from convolutional weights to reduce resource consumption. However, removing channels is non-trivial for multi-branch segments of a model, which can introduce extra memory copies at inference time. These copies incur increase latency — so much so, that the pruned model is even slower than the original, unpruned model. As a workaround, existing pruning works constrain certain channels to be…Apple Machine Learning Research

Implement a multi-object tracking solution on a custom dataset with Amazon SageMaker

Implement a multi-object tracking solution on a custom dataset with Amazon SageMaker

The demand for multi-object tracking (MOT) in video analysis has increased significantly in many industries, such as live sports, manufacturing, and traffic monitoring. For example, in live sports, MOT can track soccer players in real time to analyze physical performance such as real-time speed and moving distance.

Since its introduction in 2021, ByteTrack remains to be one of best performing methods on various benchmark datasets, among the latest model developments in MOT application. In ByteTrack, the author proposed a simple, effective, and generic data association method (referred to as BYTE) for detection box and tracklet matching. Rather than only keep the high score detection boxes, it also keeps the low score detection boxes, which can help recover unmatched tracklets with these low score detection boxes when occlusion, motion blur, or size changing occurs. The BYTE association strategy can also be used in other Re-ID based trackers, such as FairMOT. The experiments showed improvements compared to the vanilla tracker algorithms. For example, FairMOT achieved an improvement of 1.3% on MOTA (FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking), which is one of the main metrics in the MOT task when applying BYTE in data association.

In the post Train and deploy a FairMOT model with Amazon SageMaker, we demonstrated how to train and deploy a FairMOT model with Amazon SageMaker on the MOT challenge datasets. When applying a MOT solution in real-world cases, you need to train or fine-tune a MOT model on a custom dataset. With Amazon SageMaker Ground Truth, you can effectively create labels on your own video dataset.

Following on the previous post, we have added the following contributions and modifications:

  • Generate labels for a custom video dataset using Ground Truth
  • Preprocess the Ground Truth generated label to be compatible with ByteTrack and other MOT solutions
  • Train the ByteTrack algorithm with a SageMaker training job (with the option to extend a pre-built container)
  • Deploy the trained model with various deployment options, including asynchronous inference

We also provide the code sample on GitHub, which uses SageMaker for labeling, building, training, and inference.

SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker provides several built-in algorithms and container images that you can use to accelerate training and deployment of ML models. Additionally, custom algorithms such as ByteTrack can also be supported via custom-built Docker container images. For more information about deciding on the right level of engagement with containers, refer to Using Docker containers with SageMaker.

SageMaker provides plenty of options for model deployment, such as real-time inference, serverless inference, and asynchronous inference. In this post, we show how to deploy a tracking model with different deployment options, so that you can choose the suitable deployment method in your own use case.

Overview of solution

Our solution consists of the following high-level steps:

  1. Label the dataset for tracking, with a bounding box on each object (for example, pedestrian, car, and so on). Set up the resources for ML code development and execution.
  2. Train a ByteTrack model and tune hyperparameters on a custom dataset.
  3. Deploy the trained ByteTrack model with different deployment options depending on your use case: real-time processing, asynchronous, or batch prediction.

The following diagram illustrates the architecture in each step.
overview_flow

Prerequisites

Before getting started, complete the following prerequisites:

  1. Create an AWS account or use an existing AWS account.
  2. We recommend running the source code in the us-east-1 Region.
  3. Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge for single GPU training, or ml.p3.16xlarge) for the distributed training job. Other types of GPU instances are also supported, with various performance differences.
  4. Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for inference endpoint.
  5. Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for running batch prediction with processing jobs.

If this is your first time running SageMaker services on the aforementioned instance types, you may have to request a quota increase for the required instances.

Set up your resources

After you complete all the prerequisites, you’re ready to deploy the solution.

  1. Create a SageMaker notebook instance. For this task, we recommend using the ml.t3.medium instance type. While running the code, we use docker build to extend the SageMaker training image with the ByteTrack code (the docker build command will be run locally within the notebook instance environment). Therefore, we recommend increasing the volume size to 100 GB (default volume size to 5 GB) from the advanced configuration options. For your AWS Identity and Access Management (IAM) role, choose an existing role or create a new role, and attach the AmazonS3FullAccess, AmazonSNSFullAccess, AmazonSageMakerFullAccess, and AmazonElasticContainerRegistryPublicFullAccess policies to the role.
  2. Clone the GitHub repo to the /home/ec2-user/SageMaker folder on the notebook instance you created.
  3. Create a new Amazon Simple Storage Service (Amazon S3) bucket or use an existing bucket.

Label the dataset

In the data-preparation.ipynb notebook, we download an MOT16 test video file and split the video file into small video files with 200 frames. Then we upload those video files to the S3 bucket as the data source for labeling.

To label the dataset for the MOT task, refer to Getting started. When the labeling job is complete, we can access the following annotation directory at the job output location in the S3 bucket.

The manifests directory should contain an output folder if we finished labeling all the files. We can see the file output.manifest in the output folder. This manifest file contains information about the video and video tracking labels that you can use later to train and test a model.

Train a ByteTrack model and tune hyperparameters on the custom dataset

To train your ByteTrack model, we use the bytetrack-training.ipynb notebook. The notebook consists of the following steps:

  1. Initialize the SageMaker setting.
  2. Perform data preprocessing.
  3. Build and push the container image.
  4. Define a training job.
  5. Launch the training job.
  6. Tune hyperparameters.

Especially in data preprocessing, we need to convert the labeled dataset with the Ground Truth output format to the MOT17 format dataset, and convert the MOT17 format dataset to a MSCOCO format dataset (as shown in the following figure) so that we can train a YOLOX model on the custom dataset. Because we keep both the MOT format dataset and MSCOCO format dataset, you can train other MOT algorithms without separating detection and tracking on the MOT format dataset. You can easily change the detector to other algorithms such as YOLO7 to use your existing object detection algorithm.

Deploy the trained ByteTrack model

After we train the YOLOX model, we deploy the trained model for inference. SageMaker provides several options for model deployment, such as real-time inference, asynchronous inference, serverless inference, and batch inference. In our post, we use the sample code for real-time inference, asynchronous inference, and batch inference. You can choose the suitable code from these options based on your own business requirements.

Because SageMaker batch transform requires the data to be partitioned and stored on Amazon S3 as input and the invocations are sent to the inference endpoints concurrently, it doesn’t meet the requirements in object tracking tasks where the targets need to be sent in a sequential manner. Therefore, we don’t use the SageMaker batch transform jobs to run the batch inference. In this example, we use SageMaker processing jobs to do batch inference.

The following table summarizes the configuration for our inference jobs.

Inference Type Payload Processing Time Auto Scaling
Real-time Up to 6 MB Up to 1 minute Minimum instance count is 1 or higher
Asynchronous Up to 1 GB Up to 15 minutes Minimum instance count can be zero
Batch (with processing job) No limit No limit Not supported

Deploy a real-time inference endpoint

To deploy a real-time inference endpoint, we can run the bytetrack-inference-yolox.ipynb notebook. We separate ByteTrack inference into object detection and tracking. In the inference endpoint, we only run the YOLOX model for object detection. In the notebook, we create a tracking object, receive the result of object detection from the inference endpoint, and update trackers.

We use SageMaker PyTorchModel SDK to create and deploy a ByteTrack model as follows:

from sagemaker.pytorch.model import PyTorchModel
 
pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="sagemaker-serving/code",
    entry_point="inference.py",
    framework_version="1.7.1",
    py_version="py3",
)
 
endpoint_name =<endpint name>
pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.p3.2xlarge",
    endpoint_name=endpoint_name
)

After we deploy the model to an endpoint successfully, we can invoke the inference endpoint with the following code snippet:

with open(f"datasets/frame_{frame_id}.png", "rb") as f:
    payload = f.read()

response = sm_runtime.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/x-image", Body=payload
)
outputs = json.loads(response["Body"].read().decode())

We run the tracking task on the client side after accepting the detection result from the endpoint (see the following code). By drawing the tracking results in each frame and saving as a tracking video, you can confirm the tracking result on the tracking video.

aspect_ratio_thresh = 1.6
min_box_area = 10
tracker = BYTETracker(
        frame_rate=30,
        track_thresh=0.5,
        track_buffer=30,
        mot20=False,
        match_thresh=0.8
    )

online_targets = tracker.update(torch.as_tensor(outputs[0]), [height, width], (800, 1440))
online_tlwhs = []
online_ids = []
online_scores = []
for t in online_targets:
    tlwh = t.tlwh
    tid = t.track_id
    vertical = tlwh[2] / tlwh[3] > aspect_ratio_thresh
    if tlwh[2] * tlwh[3] > min_box_area and not vertical:
        online_tlwhs.append(tlwh)
        online_ids.append(tid)
        online_scores.append(t.score)
        results.append(
            f"{frame_id},{tid},{tlwh[0]:.2f},{tlwh[1]:.2f},{tlwh[2]:.2f},{tlwh[3]:.2f},{t.score:.2f},-1,-1,-1n"
        )
online_im = plot_tracking(
    frame, online_tlwhs, online_ids, frame_id=frame_id + 1, fps=1. / timer.average_time
)

Deploy an asynchronous inference endpoint

SageMaker asynchronous inference is the ideal option for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near-real-time latency requirements. For MOT tasks, it’s common that a video file is beyond 6 MB, which is the payload limit of a real-time endpoint. Therefore, we deploy an asynchronous inference endpoint. Refer to Asynchronous inference for more details of how to deploy an asynchronous endpoint. We can reuse the model created for the real-time endpoint; for this post, we put a tracking process into the inference script so that we can get the final tracking result directly for the input video.

To use scripts related to ByteTrack on the endpoint, we need to put the tracking script and model into the same folder and compress the folder as the model.tar.gz file, and then upload it to the S3 bucket for model creation. The following diagram shows the structure of model.tar.gz.

We need to explicitly set the request size, response size, and response timeout as the environment variables, as shown in the following code. The name of the environment variable varies depending on the framework. For more details, refer to Create an Asynchronous Inference Endpoint.

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    entry_point="inference.py",
    framework_version="1.7.1",
    sagemaker_session=sm_session,
    py_version="py3",
    env={
        'TS_MAX_REQUEST_SIZE': '1000000000', #default max request size is 6 Mb for torchserve, need to update it to support the 1GB input payload
        'TS_MAX_RESPONSE_SIZE': '1000000000',
        'TS_DEFAULT_RESPONSE_TIMEOUT': '900' # max timeout is 15mins (900 seconds)
    }
)

pytorch_model.create(
    instance_type="ml.p3.2xlarge",
)

When invoking the asynchronous endpoint, instead of sending the payload in the request, we send the Amazon S3 URL of the input video. When the model inference finishes processing the video, the results will be saved on the S3 output path. We can configure Amazon Simple Notification Service (Amazon SNS) topics so that when the results are ready, we can receive an SNS message as a notification.

Run batch inference with SageMaker processing

For video files bigger than 1 GB, we use a SageMaker processing job to do batch inference. We define a custom Docker container to run a SageMaker processing job (see the following code). We draw the tracking result on the input video. You can find the result video in the S3 bucket defined by s3_output.

from sagemaker.processing import ProcessingInput, ProcessingOutput
script_processor.run(
    code='./container-batch-inference/predict.py',
    inputs=[
        ProcessingInput(source=s3_input, destination="/opt/ml/processing/input"),
        ProcessingInput(source=s3_model_uri, destination="/opt/ml/processing/model"),
    ], 
    outputs=[
        ProcessingOutput(source='/opt/ml/processing/output', destination=s3_output),
    ]
)

Clean up

To avoid unnecessary costs, delete the resources you created as part of this solution, including the inference endpoint.

Conclusion

This post demonstrated how to implement a multi-object tracking solution on a custom dataset using one of the state-of-the-art algorithms on SageMaker. We also demonstrated three deployment options on SageMaker so that you can choose the optimal option for your own business scenario. If the use case requires low latency and needs a model to be deployed on an edge device, you can deploy the MOT solution at the edge with AWS Panorama.

For more information, refer to Multi Object Tracking using YOLOX + BYTE-TRACK and data analysis.


About the Authors

Gordon Wang, is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, Generative AI and MLOps. In his spare time, he loves running and hiking.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Guang Yang, is a Senior applied scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

Read More

Retrieval-augmented visual-language pre-training

Retrieval-augmented visual-language pre-training

Large-scale models, such as T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the ability to store substantial amounts of knowledge when scaled to tens of billions of parameters and trained on large text and image datasets. These models achieve state-of-the-art results on downstream tasks, such as image captioning, visual question answering and open vocabulary recognition. Despite such achievements, these models require a massive volume of data for training and end up with a tremendous number of parameters (billions in many cases), resulting in significant computational requirements. Moreover, the data used to train these models can become outdated, requiring re-training every time the world’s knowledge is updated. For example, a model trained just two years ago might yield outdated information about the current president of the United States.

In the fields of natural language processing (RETRO, REALM) and computer vision (KAT), researchers have attempted to address these challenges using retrieval-augmented models. Typically, these models use a backbone that is able to process a single modality at a time, e.g., only text or only images, to encode and retrieve information from a knowledge corpus. However, these retrieval-augmented models are unable to leverage all available modalities in a query and knowledge corpora, and may not find the information that is most helpful for generating the model’s output.

To address these issues, in “REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory”, to appear at CVPR 2023, we introduce a visual-language model that learns to utilize a multi-source multi-modal “memory” to answer knowledge-intensive queries. REVEAL employs neural representation learning to encode and convert diverse knowledge sources into a memory structure consisting of key-value pairs. The keys serve as indices for the memory items, while the corresponding values store pertinent information about those items. During training, REVEAL learns the key embeddings, value tokens, and the ability to retrieve information from this memory to address knowledge-intensive queries. This approach allows the model parameters to focus on reasoning about the query, rather than being dedicated to memorization.

We augment a visual-language model with the ability to retrieve multiple knowledge entries from a diverse set of knowledge sources, which helps generation.

Memory construction from multimodal knowledge corpora

Our approach is similar to REALM in that we precompute key and value embeddings of knowledge items from different sources and index them in a unified knowledge memory, where each knowledge item is encoded into a key-value pair. Each key is a d-dimensional embedding vector, while each value is a sequence of token embeddings representing the knowledge item in more detail. In contrast to previous work, REVEAL leverages a diverse set of multimodal knowledge corpora, including the WikiData knowledge graph, Wikipedia passages and images, web image-text pairs and visual question answering data. Each knowledge item could be text, an image, a combination of both (e.g., pages in Wikipedia) or a relationship or attribute from a knowledge graph (e.g., Barack Obama is 6’ 2” tall). During training, we continuously re-compute the memory key and value embeddings as the model parameters get updated. We update the memory asynchronously at every thousand training steps.

Scaling memory using compression

A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items. The Perceiver model uses a transformer decoder to compress the full token sequence into an arbitrary length. This lets us retrieve top-k memory entries for k as large as a hundred.

The following figure illustrates the procedure of constructing the memory key-value pairs. Each knowledge item is processed through a multi-modal visual-language encoder, resulting in a sequence of image and text tokens. The key head then transforms these tokens into a compact embedding vector. The value head (perceiver) condenses these tokens into fewer ones, retaining the pertinent information about the knowledge item within them.

We encode the knowledge entries from different corpora into unified key and value embedding pairs, where the keys are used to index the memory and values contain information about the entries.

Large-scale pre-training on image-text pairs

To train the REVEAL model, we begin with the large-scale corpus, collected from the public Web with three billion image alt-text caption pairs, introduced in LiT. Since the dataset is noisy, we add a filter to remove data points with captions shorter than 50 characters, which yields roughly 1.3 billion image caption pairs. We then take these pairs, combined with the text generation objective used in SimVLM, to train REVEAL. Given an image-text example, we randomly sample a prefix containing the first few tokens of the text. We feed the text prefix and image to the model as input with the objective of generating the rest of the text as output. The training goal is to condition the prefix and autoregressively generate the remaining text sequence.

To train all components of the REVEAL model end-to-end, we need to warm start the model to a good state (setting initial values to model parameters). Otherwise, if we were to start with random weights (cold-start), the retriever would often return irrelevant memory items that would never generate useful training signals. To avoid this cold-start problem, we construct an initial retrieval dataset with pseudo–ground-truth knowledge to give the pre-training a reasonable head start.

We create a modified version of the WIT dataset for this purpose. Each image-caption pair in WIT also comes with a corresponding Wikipedia passage (words surrounding the text). We put together the surrounding passage with the query image and use it as the pseudo ground-truth knowledge that corresponds to the input query. The passage provides rich information about the image and caption, which is useful for initializing the model.

To prevent the model from relying on low-level image features for retrieval, we apply random data augmentation to the input query image. Given this modified dataset that contains pseudo-retrieval ground-truth, we train the query and memory key embeddings to warm start the model.

REVEAL workflow

The overall workflow of REVEAL consists of four primary steps. First, REVEAL encodes a multimodal input into a sequence of token embeddings along with a condensed query embedding. Then, the model translates each multi-source knowledge entry into unified pairs of key and value embeddings, with the key being utilized for memory indexing and the value encompassing the entire information about the entry. Next, REVEAL retrieves the top-k most related knowledge pieces from multiple knowledge sources, returns the pre-processed value embeddings stored in memory, and re-encodes the values. Finally, REVEAL fuses the top-k knowledge pieces through an attentive knowledge fusion layer by injecting the retrieval score (dot product between query and key embeddings) as a prior during attention calculation. This structure is instrumental in enabling the memory, encoder, retriever and the generator to be concurrently trained in an end-to-end fashion.

Overall workflow of REVEAL.

Results

We evaluate REVEAL on knowledge-based visual question answering tasks using OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained model on the VQA tasks using the same generative objective where the model takes in an image-question pair as input and generates the text answer as output. We demonstrate that REVEAL achieves better results on the A-OKVQA dataset than earlier attempts that incorporate a fixed knowledge or the works that utilize large language models (e.g., GPT-3) as an implicit source of knowledge.

Visual question answering results on A-OKVQA. REVEAL achieves higher accuracy in comparison to previous works including ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.

We also evaluate REVEAL on the image captioning benchmarks using MSCOCO and NoCaps dataset. We directly fine-tune REVEAL on the MSCOCO training split via the cross-entropy generative objective. We measure our performance on the MSCOCO test split and NoCaps evaluation set using the CIDEr metric, which is based on the idea that good captions should be similar to reference captions in terms of word choice, grammar, meaning, and content. Our results on MSCOCO caption and NoCaps datasets are shown below.

Image Captioning results on MSCOCO and NoCaps using the CIDEr metric. REVEAL achieves a higher score in comparison to Flamingo, VinVL, SimVLM and CoCa.

Below we show a couple of qualitative examples of how REVEAL retrieves relevant documents to answer visual questions.

REVEAL can use knowledge from different sources to correctly answer the question.

Conclusion

We present an end-to-end retrieval-augmented visual language (REVEAL) model, which contains a knowledge retriever that learns to utilize a diverse set of knowledge sources with different modalities. We train REVEAL on a massive image-text corpus with four diverse knowledge corpora, and achieve state-of-the-art results on knowledge-intensive visual question answering and image caption tasks. In the future we would like to explore the ability of this model for attribution, and apply it to a broader class of multimodal tasks.

Acknowledgements

This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross and Alireza Fathi.

Read More

A New Age: ‘Age of Empires’ Series Joins GeForce NOW, Part of 20 Games Coming in June

A New Age: ‘Age of Empires’ Series Joins GeForce NOW, Part of 20 Games Coming in June

The season of hot sun and longer days is here, so stay inside this summer with 20 games joining GeForce NOW in June. Or stream across devices by the pool, from grandma’s house or in the car — whichever way, GeForce NOW has you covered.

Titles from the Age of Empires series are the next Xbox games to roll out to GeForce NOW, giving members plenty to do this summer, especially with over 1,600 games part of the GeForce NOW library.

Expand Your Empire

Age of Empires on GeForce NOW
From the Stone Age to the cloud.

NVIDIA released the first Xbox games to the cloud last month as part of its ongoing partnership with Microsoft. Now it’s the first to bring a smash hit Xbox series to the cloud with Ensemble Studios’ Age of Empires titles.

Since the first release in 1997, Age of Empires has established itself as one of the longest-running real-time strategy (RTS) series in existence. The critically acclaimed RTS series puts players in control of an entire empire with the goal of expanding and evolving to become a flourishing civilization.

All four of the franchise’s latest Steam versions will join GeForce NOW later this month: Age of Empires: Definitive Edition, Age of Empires II: Definitive Edition, Age of Empires III: Definitive Edition and Age of Empires IV: Anniversary Edition. Each title will also support new content and updates, like upcoming seasons or the recently released “Return of Rome” expansion for Age of Empires II: Definitive Edition.

Members will be able to rule from PC, Mac, Chromebooks and more when the Definitive Editions of these games join the GeForce NOW library later this month. Upgrade to Priority membership to skip the waiting lines and experience extended gaming sessions for those long campaigns. Or go for an Ultimate membership to conquer enemies at up to 4K resolution and up to eight-hour sessions.

Game on the Go

Now available in Europe, the Logitech G Cloud gaming handheld supports GeForce NOW, giving gamers a way to stream their PC library from the cloud. It features a seven-inch, full-HD touchscreen with a 60Hz refresh rate and precision controls to stream over 1,600 games in the GeForce NOW library.

Pick up the device and get one month of Priority membership for free to celebrate the launch, from now until Thursday, June 22.

“Look at You, Hacker…”

System Shock on GeForce NOW
Welcome to Citadel Station.

Stay cool as a cucumber in the remake of System Shock, the hit game from Nightdive Studios. This game has everything — first-person shooter, role-playing and action-adventure. Fight to survive in the depths of space after a hacker who awakens from a six-month coma finds the space station overrun with hostile mutants and rogue AI. Explore, use hacker skills and unravel the mysteries of the space station streaming System Shock in the cloud.

In addition, members can look for the following two games this week:

  • System Shock (New release on Steam)
  • Killer Frequency (New release on Steam, June 1)

And here’s what the rest of June looks like:

  • Amnesia: The Bunker (New release on Steam, June 6)
  • Harmony: The Fall of Reverie (New release on Steam, June 8)
  • Dordogne (New release on Steam, June 13)
  • Aliens: Dark Descent (New release on Steam, June 20)
  • Trepang2 (New release on Steam, June 21)
  • Layers of Fear (New release on Steam)
  • Park Beyond (New release on Steam)
  • Tom Clancy’s Rainbow Six Extraction (New release on Steam)
  • Age of Empires: Definitive Edition (Steam)
  • Age of Empires II: Definitive Edition (Steam)
  • Age of Empires III: Definitive Edition (Steam)
  • Age of Empires IV: Anniversary Edition (Steam)
  • Derail Valley (Steam)
  • I Am Fish (Steam)
  • Golf Gang (Steam)
  • Contraband Police (Steam)
  • Bloons TD 6 (Steam)
  • Darkest Dungeon (Steam)
  • Darkest Dungeon II (Steam)

Much Ado About May

In addition to the 16 games announced in May, six extra joined the GeForce NOW library:

Conquerer’s Blade didn’t make it in May, so stay tuned to GFN Thursday for any updates.

Finally, before moving forward into the weekend, lets take things back with our question of the week. Let us know your answer on Twitter or in the comments below.

Read More

Digital Renaissance: NVIDIA Neuralangelo Research Reconstructs 3D Scenes

Digital Renaissance: NVIDIA Neuralangelo Research Reconstructs 3D Scenes

Neuralangelo, a new AI model by NVIDIA Research for 3D reconstruction using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures and other real-world objects.

Like Michelangelo sculpting stunning, life-like visions from blocks of marble, Neuralangelo generates 3D structures with intricate details and textures. Creative professionals can then import these 3D objects into design applications, editing them further for use in art, video game development, robotics and industrial digital twins.

Neuralangelo’s ability to translate the textures of complex materials — including roof shingles, panes of glass and smooth marble — from 2D videos to 3D assets significantly surpasses prior methods. The high fidelity makes its 3D reconstructions easier for developers and creative professionals to rapidly create usable virtual objects for their projects using footage captured by smartphones.

“The 3D reconstruction capabilities Neuralangelo offers will be a huge benefit to creators, helping them recreate the real world in the digital world,” said Ming-Yu Liu, senior director of research and co-author on the paper. “This tool will eventually enable developers to import detailed objects — whether small statues or massive buildings — into virtual environments for video games or industrial digital twins.”

In a demo, NVIDIA researchers showcased how the model could recreate objects as iconic as Michelangelo’s David and as commonplace as a flatbed truck. Neuralangelo can also reconstruct building interiors and exteriors — demonstrated with a detailed 3D model of the park at NVIDIA’s Bay Area campus.

Neural Rendering Model Sees in 3D

Prior AI models to reconstruct 3D scenes have struggled to accurately capture repetitive texture patterns, homogenous colors and strong color variations. Neuralangelo adopts instant neural graphics primitives, the technology behind NVIDIA Instant NeRF, to help capture these finer details.

Using a 2D video of an object or scene filmed from various angles, the model selects several frames that capture different viewpoints — like an artist considering a subject from multiple sides to get a sense of depth, size and shape.

Once it’s determined the camera position of each frame, Neuralangelo’s AI creates a rough 3D representation of the scene, like a sculptor starting to chisel the subject’s shape.

The model then optimizes the render to sharpen the details, just as a sculptor painstakingly hews stone to mimic the texture of fabric or a human figure.

The final result is a 3D object or large-scale scene that can be used in virtual reality applications, digital twins or robotics development.

Find NVIDIA Research at CVPR, June 18-22

Neuralangelo is one of nearly 30 projects by NVIDIA Research to be presented at the Conference on Computer Vision and Pattern Recognition (CVPR), taking place June 18-22 in Vancouver. The papers span topics including pose estimation, 3D reconstruction and video generation.

One of these projects, DiffCollage, is a diffusion method that creates large-scale content — including long landscape orientation, 360-degree panorama and looped-motion images. When fed a training dataset of images with a standard aspect ratio, DiffCollage treats these smaller images as sections of a larger visual — like pieces of a collage. This enables diffusion models to generate cohesive-looking large content without being trained on images of the same scale.

sunset beach landscape generated by DiffCollage

The technique can also transform text prompts into video sequences, demonstrated using a pretrained diffusion model that captures human motion:



Learn more about NVIDIA Research at CVPR.

Read More