Research Focus: Week of December 18, 2023

Research Focus: Week of December 18, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
December 18th, 2023

NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams

Deep Neural Networks (DNNs) are essentially stacked transformation functions (layers) that generate progressively complex features/encoding. This makes them universal approximators and allows for unprecedented success in complex tasks. This inferential effectiveness comes at the cost of increased computational complexity, making DNNs hard to scale for operational efficiency in AI applications, especially when running on resource-constrained hardware. 

In a recent paper: NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams, researchers from Microsoft and their collaborators propose a new framework to address this problem. NASerEX leverages neural architecture search (NAS) with a novel saliency-constrained search space and exit decision metric to learn suitable early exit structures to augment deep neural models for scalable efficient inference on big image streams. Optimized exit-augmented models, with the power of smart adaptive inference, perform ~2.5x faster having ~4x aggregated lower effective FLOPs, with no significant accuracy loss.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


InsightPilot: An LLM-Empowered Automated Data Exploration System

Effective data exploration requires in-depth knowledge of the dataset and the user intent, and expertise in data analysis techniques. Not being familiar with either can create obstacles that make the process time-consuming and overwhelming.

In a recent paper, InsightPilot: An LLM-Empowered Automated Data Exploration System, researchers from Microsoft address this issue. InsightPilot is a large language model (LLM)-based, automated system designed to simplify the data exploration process. It features a set of carefully designed analysis actions that streamline the data exploration process. Given a natural language question, InsightPilot collaborates with the LLM to issue a sequence of analysis actions, explore the data, and generate insights. The authors demonstrate the effectiveness of InsightPilot in a user study and a case study, showing how it can help users gain valuable insights from their datasets. 


Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques

Microsoft’s cloud system serves as the backbone for the daily operations of hundreds of thousands of organizations, driving productivity and collaboration. The foundational infrastructure demands both high reliability and efficiency. In a new blog post, Microsoft’s Systems Innovation team explores some recent innovations to continually enhance hyper-scale cloud capacity efficiency, delivering substantial operational cost savings for customers.

Systems Innovation is a collaboration between Microsoft 365, Microsoft Research and Azure. The research group is focused on leveraging their shared deep workload understanding and combining algorithmic research with AI/machine learning techniques and hardware innovation to improve operational reliability and efficiency.


NeurIPS Large Language Model Efficiency Challenge

Large language models (LLMs) trained on large bodies of text can solve tasks with few supervised examples. These few-shot models have shown state-of-the-art success across natural language processing (NLP) tasks, language translation, standardized exams, and coding challenges, as well as in subjective domains such as chatbots. All of these domains involve bootstrapping a single LLM referred to as a foundation model with examples of specific knowledge from the associated task.

The process of updating a model with limited domain-specific data is known as fine-tuning. However, the costs of accessing, fine-tuning and querying foundation models to perform new tasks can be large.

To help democratize access to language models, Microsoft and other industry leaders were pleased to sponsor the NeurIPS Large Language Model Efficiency Challenge, (opens in new tab) which addressed three major issues:

  1. Lack of transparency around model training methods leads to a majority of models being not reproducible.
  2. The absence of a standard benchmark to evaluate these models side-by-side.
  3. Insufficient access to dedicated hardware prevents widespread availability and usage of these models.

The challenge to the community was to adapt a foundation model to specific tasks by fine-tuning on a single GPU of either 4090 or A100 (40GB) within a 24-hour (1-day) time frame, while maintaining high accuracy for these desired tasks.

Each submission was evaluated for accuracy and computational performance tradeoffs at commodity hardware scales. Insights and lessons were distilled into a set of well documented steps and easy-to-follow tutorials. The machine learning community will have documentation on how to achieve the same performance as winning entries, which will serve as the starting point to help them build their own LLM solutions.

The post Research Focus: Week of December 18, 2023 appeared first on Microsoft Research.

Read More

Cool Robots of 2023: Meet the Autonomous Movers and Shakers

Cool Robots of 2023: Meet the Autonomous Movers and Shakers

Outside the glare of the klieg lights that ChatGPT commanded this year, a troupe of autonomous machines nudged the frontiers of robotics forward.

Here are six that showed special prowess — swimming, diving, gripping, seeing, strolling and flying through 2023.

A Media Darling at CES

Ella — a smart stroller from startup Glüxkind Technologies, of Vancouver, Canada — kicked off the year when it was named an honoree in the CES 2023 Innovation Awards.

The canny carriage uses computer vision running on the NVIDIA Jetson edge AI platform to follow parents. Its AI-powered abilities, like smart braking and a rock-my-baby mode, captured the attention of media outlets like Good Morning America and The Times of London as well as an NVIDIA AI Podcast interview with its husband-and-wife cofounders.

A member of NVIDIA Inception, a free program for cutting-edge startups, Glüxkind was one of seven companies with NVIDIA-powered products recognized at the Las Vegas event in January. They included:

  • John Deere for its fully autonomous tractor,
  • AGRIST for its robot that automatically harvests bell peppers,
  • Inception member Skydio for its drone that can fly at a set distance and height without manual intervention,
  • Neubility, another Inception member, for its self-driving delivery robot,
  • Seoul Robotics, a partner in the NVIDIA Metropolis vision AI software, for its Level 5 Control Tower that can turn standard vehicles into self-driving cars, and
  • WHILL for its one-person vehicle that automatically guides a user inside places like airports or hospitals.

Dexterous Food Packer

Inception startup Soft Robotics, of Bedford, Mass., introduced its mGripAI system to an $8 trillion food industry hungry for automation. It combines 3D vision and AI to grasp delicate items such as chicken wings, attracting investors that include Tyson Foods and Johnsonville.

Soft Robotics uses the NVIDIA Omniverse platform and NVIDIA Isaac Sim robotics simulator to create 3D renderings of chicken parts on conveyor belts or in bins. With help from AI and the ray-tracing capabilities of NVIDIA RTX technology, they help the robot gripper handle as many as 100 picks per minute, even under glare or changing light conditions.

“We’re all in on Omniverse and Isaac Sim, and that’s been working great for us,” said David Weatherwax, senior director of software engineering at Soft Robotics.

A Keen Eye in the Factory

In a very different example of industrial digitalization, leading electronics manufacturer Quanta is inspecting the quality of its products using the TM25S, an AI-enabled robot from its subsidiary, Techman Robot.

Using Omniverse, Techman built a digital twin of the inspection robot — as well as the product to be inspected — in Isaac Sim. Programming the robot in simulation reduced time spent on the task by over 70%, compared to programming manually on the real robot.

Then, with powerful optimization tools in Isaac Sim, Techman explored a massive number of program options in parallel on NVIDIA GPUs. The end result, shown in the video below, was an efficient solution that reduced the cycle time of each inspection by 20%.

Sailing the Seas for Data Science

For its part, Saildrone, an Inception startup in Alameda, Calif., created uncrewed watercraft that can cost-effectively gather data for science, fisheries, weather forecasting and more. NVIDIA Jetson modules process data streams from their sensors, some with help from NVIDIA Metropolis vision AI software such as NVIDIA DeepStream, a development kit for intelligent video analytics.

The video below shows how three of its smart sailboats are helping evaluate ocean health around the Hawaiian Islands.

Destination: Mars

The next stop for one autonomous vehicle may be the red planet.

Caltech’s Multi-Modal Mobility Morphobot, or M4, can configure itself to walk, fly or drive at speeds up to 40 mph (video below). An M42 version is now under development at NASA as a Mars rover candidate and has attracted interest for other uses like reconnaissance in fire zones.

Since releasing a paper on it in Nature Communications, the team has been inundated with proposals for the shape-shifting drone built on the NVIDIA Jetson platform.

Delivery Drone Flies High

The year ended on a high note with San Francisco-based Zipline announcing its delivery drones flew more than 55 million miles and made more than 800,000 deliveries since the company’s start in 2011. Zipline now completes one delivery every 70 seconds, globally.

That’s a major milestone for the Inception startup, the field it’s helping pioneer and the customers who can receive everything from pizza to vitamins 7x faster than by truck.

Zipline’s latest drone uses two Jetson Orin NX modules. It can carry eight pounds of cargo for 10 miles at up to 70 mph to deliver packages in single-digit minutes while reducing carbon emissions 97% compared to gasoline-based delivery vehicles.

Machines That Inspire, Amuse

Individual makers designed two autonomous vehicles this year worth special mentions.

Cool Jetson-based robot of 2023
Goran Vuksic with his AI-powered droid

Kabilan KB, a robotics developer and student in Coimbatore, India, built an autonomous wheelchair using Jetson to run computer vision models that find and navigate a path to a user’s desired destination. The undergrad at the Karunya Institute of Technology and Sciences aspires to one day launch a robotics startup.

Finally, an engineering manager in Copenhagen who’s a self-described Star Wars fanatic designed an AI-powered droid based on an NVIDIA Jetson Orin Nano Developer Kit. Goran Vuksic shared his step-by-step technical guide, so others can build their own sci-fi companions.

More than 6,500 companies and 1.2 million developers — as well as a community of makers and enthusiasts — use the NVIDIA Jetson and Isaac platforms for edge AI and robotics.

To get a look at where autonomous machines will go next, see what’s coming at CES in 2024.

Read More

Thomson Reuters Taps Generative AI to Power Legal Offerings

Thomson Reuters Taps Generative AI to Power Legal Offerings

Thomson Reuters, the global content and technology company, is transforming the legal industry with generative AI.

In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Thomson Reuters Chief Product Officer David Wong about its potential — and implications.

Many of Thomson Reuters offerings for the legal industry either address an information retrieval problem or help generate written content.

It has aN AI-driven digital solution that enables law practitioners to search laws and cases intelligently within different jurisdictions. It also provides AI-powered tools that are set to be integrated with commonly used products like Microsoft 365 to automate the time-consuming processes of drafting and analyzing legal documents.

These technologies increase the productivity of legal professionals, enabling them to focus their time on higher-value work. According to Wong, ultimately these tools also have the potential to help deliver better access to justice.

To address ethical concerns, the company has created publicly available AI development guidelines, as well as privacy and data protection policies. And it’s participating in the drafting of ethical guidelines for the industries it serves.

There’s still a wide range of reactions surrounding AI use in the legal field, from optimism about its potential to fears of job replacement. But Wong underscored that no matter what the outlook, “it is very likely that professionals that use AI are going to replace professionals that don’t use AI.”

Looking ahead, Thomson Reuters aims to further integrate generative AI, as well as retrieval-augmented generation techniques into its flagship research products to help lawyers synthesize, read and respond to complicated technical and legal questions. Recently, Thomson Reuters acquired Casetext, which developed the first AI legal assistant, CoCounsel.

In 2024 Thomson Reuters is building on this with the launch of an AI assistant that will be the interface across Thomson Reuters products with GenAI capabilities, including those in other fields such as tax and accounting.

You Might Also Like

Driver’s Ed: How Waabi Uses AI Simulation to Teach Autonomous Vehicles to Drive

Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience—the road to achieving this astronomical level of driving leads to the virtual world. Learn how Waabi uses powerful high-fidelity simulations to train and develop production-level autonomous vehicles.

Polestar’s Dennis Nobelius on the Sustainable Performance Brand’s Plans

Driving enjoyment and autonomous driving capabilities can complement one another in intelligent, sustainable vehicles. Learn about the automaker’s plans to unveil its third vehicle, the Polestar 3, the tech inside it, and what the company’s racing heritage brings to the intersection of smarts and sustainability.

GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments

Humans playing games against machines is nothing new, but now computers can develop games for people to play. Programming enthusiast and social media influencer Harrison Kinsley created GANTheftAuto, an AI-based neural network that generates a playable chunk of the classic video game Grand Theft Auto V.

Subscribe to the AI Podcast, Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

 

Read More

Into the Omniverse: Foundry Nuke’s OpenUSD Enhancements Ring in a 3D Renaissance

Into the Omniverse: Foundry Nuke’s OpenUSD Enhancements Ring in a 3D Renaissance

Editor’s note: This post is part of Into the Omniverse, a series focused on how artists, developers and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse.

3D designers and creators are embracing Universal Scene Description, aka OpenUSD, to transform their workflows.

Creative software company Foundry’s latest release of Nuke, a powerful compositing tool for visual effects (VFX), is bringing increased support for OpenUSD, a framework that provides a unified and extensible ecosystem for describing, composing, simulating and collaborating within 3D worlds.

With advanced compositing and improved interoperability capabilities, artists are showcasing the immense potential of Nuke and OpenUSD for visual storytelling.

Bringing 3D Visions to Life With Nuke and OpenUSD

YouTuber Jacob Zirkle is one such 3D artist.

Inspired by his 10th watch through the Star Wars films, Zirkle wanted to create a sci-fi ship of his own. He first combined computer graphics elements in Blender and Unreal Engine before using USD to bring the scene into Nuke for compositing.

Zirkle’s ship, built using Blender, Nuke, Unreal Engine and USD Composer.

OpenUSD was the glue that held his workflow together.

“Usually, I have to deal with multiple, varying file types in my VFX pipeline, and as soon as something gets updated, it can be a real pain to apply the change across the board,” Zirkle said. “But because I was using the same OpenUSD file for all of my programs, I could save the file once, and changes get automatically propagated through the pipeline — saving me a ton of time.”

Edward McEvenue, an associate creative director at NVIDIA, is using OpenUSD and Nuke to create his short film with the working title: “Dare to Dream.”

Through the project, McEvenue hopes to visualize aspects of automated manufacturing. He uses Autodesk 3ds Max and SideFX Houdini for 3D scene creation, Chaos V-Ray for rendering arbitrary output variables and extended dynamic range sequences, and Nuke for compositing elements for final renders.

OpenUSD helps streamline data transfer between applications, speeding the iteration process. “Nuke’s USD capabilities allow me to seamlessly transition 3D assets between digital content-creation apps, providing a powerful tool for achieving advanced compositing techniques,” he said.

Other NVIDIA creatives have integrated OpenUSD and Nuke into their 3D workflows. A team of 10 artists developed a fully OpenUSD-based pipeline and custom tooling on NVIDIA Omniverse — a development platform for building OpenUSD-based tools and applications — to bring to life the “Da Vinci Workshop,” a project to inspire greater OpenUSD use among pipeline developers.

The artists also used Adobe Substance Painter, Autodesk 3ds Max, Autodesk Maya, DaVinci Resolve, SideFX Houdini, Pixelogic Zbrush and Omniverse USD Composer. OpenUSD served as the backbone of the team’s internal pipeline, offering the flexibility needed to collaborate across applications with ease.

The “Da Vinci Workshop” OpenUSD dataset is now available on the Omniverse launcher — free for developers and artists.

Foundry Nuke representatives, Omniverse community members and the NVIDIA creative team recently joined a livestream to discuss their 3D workflows and the impact of OpenUSD. Learn more by watching the replay:

Powering Digital Workflows With OpenUSD

The 15.0 and 14.1 updates to Nuke bring significant workflow enhancements to those working with OpenUSD.

The updated GeoMerge node now offers four new modes: Merge Layers, Duplicate Prims, Flatten Layers and Flatten to Single Layer. These give users greater control over geometry and OpenUSD layers, allowing for quick merging of complex structures, the duplication of workflows and more effective layer management.

The OpenUSD-based 3D system introduced in Nuke 14.0 enables users to handle large, intricate scenes with greater ease. And the new Scene Graph Popup feature in Nuke 15.0 allows users to easily filter through 3D scene data, reducing time and energy needed to spend searching for specific assets.

In addition, the main 3D scene graph now includes a search and filter feature, simplifying workspace navigation.

Foundry is also embracing OpenUSD across its other products, including the latest updates to Katana 7.0, which boost pipeline efficiency by integrating USD-native workflows already aligned with Nuke’s 3D system architecture.

Get Plugged In to the World of OpenUSD 

NVIDIA and Foundry are both members of the Alliance for OpenUSD (AOUSD), an organization dedicated to an open-source future using the powerful framework. To learn more, explore the AOUSD forum and check out these resources on OpenUSD.

Share your Nuke and Omniverse work as part of the latest community #WinterArtChallenge. Use the hashtag for a chance to be featured on the @NVIDIAStudio and @NVIDIAOmniverse social channels.

Get started with NVIDIA Omniverse by downloading the standard license free, access OpenUSD resources, and learn how Omniverse Enterprise can connect your team. Stay up to date on Instagram, Medium and Twitter. For more, join the Omniverse community on the  forums, Discord server, Twitch and YouTube channels.

Read More

VideoPoet: A large language model for zero-shot video generation

VideoPoet: A large language model for zero-shot video generation

A recent wave of video generation models has burst onto the scene, in many cases showcasing stunning picturesque quality. One of the current bottlenecks in video generation is in the ability to produce coherent large motions. In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts.

To explore the application of language models in video generation, we introduce VideoPoet, a large language model (LLM) that is capable of a wide variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable observation is that the leading video generation models are almost exclusively diffusion-based (for one example, see Imagen Video). On the other hand, LLMs are widely recognized as the de facto standard due to their exceptional learning capabilities across various modalities, including language, code, and audio (e.g., AudioPaLM). In contrast to alternative models in this space, our approach seamlessly integrates many video generation capabilities within a single LLM, rather than relying on separately trained components that specialize on each task.

Overview

The diagram below illustrates VideoPoet’s capabilities. Input images can be animated to produce motion, and (optionally cropped or masked) video can be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical flow, which represent the motion, and paints contents on top to produce the text-guided style.

An overview of VideoPoet, capable of multitasking on a variety of video-centric inputs and outputs. The LLM can optionally take text as input to guide generation for text-to-video, image-to-video, video-to-audio, stylization, and outpainting tasks. Resources used: Wikimedia Commons and DAVIS.

Language models as video generators

One key advantage of using LLMs for training is that one can reuse many of the scalable efficiency improvements that have been introduced in existing LLM training infrastructure. However, LLMs operate on discrete tokens, which can make video generation challenging. Fortunately, there exist video and audio tokenizers, which serve to encode video and audio clips as sequences of discrete tokens (i.e., integer indices), and which can also be converted back into the original representation.

VideoPoet trains an autoregressive language model to learn across video, image, audio, and text modalities through the use of multiple tokenizers (MAGVIT V2 for video and image and SoundStream for audio). Once the model generates tokens conditioned on some context, these can be converted back into a viewable representation with the tokenizer decoders.

A detailed look at the VideoPoet task design, showing the training and inference inputs and outputs of various tasks. Modalities are converted to and from tokens using tokenizer encoder and decoders. Each modality is surrounded by boundary tokens, and a task token indicates the type of task to perform.

Examples generated by VideoPoet

Some examples generated by our model are shown below.

Videos generated by VideoPoet from various text prompts. For specific text prompts refer to the website.

For text-to-video, video outputs are variable length and can apply a range of motions and styles depending on the text content. To ensure responsible practices, we reference artworks and styles in the public domain e.g., Van Gogh’s “Starry Night”.

Text Input    “A Raccoon dancing in Times Square”    “A horse galloping through Van-Gogh’s ‘Starry Night’”    “Two pandas playing cards”    “A large blob of exploding splashing rainbow paint, with an apple emerging, 8k”
Video Output            

For image-to-video, VideoPoet can take the input image and animate it with a prompt.

An example of image-to-video with text prompts to guide the motion. Each video is paired with an image to its left. Left: “A ship navigating the rough seas, thunderstorm and lightning, animated oil on canvas”. Middle: “Flying through a nebula with many twinkling stars”. Right: “A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day”. Reference: Wikimedia Commons, public domain**.

For video stylization, we predict the optical flow and depth information before feeding into VideoPoet with some additional input text.

Examples of video stylization on top of VideoPoet text-to-video generated videos with text prompts, depth, and optical flow used as conditioning. The left video in each pair is the input video, the right is the stylized output. Left: “Wombat wearing sunglasses holding a beach ball on a sunny beach.” Middle: “Teddy bears ice skating on a crystal clear frozen lake.” Right: “A metal lion roaring in the light of a forge.”

VideoPoet is also capable of generating audio. Here we first generate 2-second clips from the model and then try to predict the audio without any text guidance. This enables generation of video and audio from a single model.

        

An example of video-to-audio, generating audio from a video example without any text input.

By default, the VideoPoet model generates videos in portrait orientation to tailor its output towards short-form content. To showcase its capabilities, we have produced a brief movie composed of many short clips generated by VideoPoet. For the script, we asked Bard to write a short story about a traveling raccoon with a scene-by-scene breakdown and a list of accompanying prompts. We then generated video clips for each prompt, and stitched together all resulting clips to produce the final video below.

When we developed VideoPoet, we noticed some nice properties of the model’s capabilities, which we highlight below.

Long video

We are able to generate longer videos simply by conditioning on the last 1 second of video and predicting the next 1 second. By chaining this repeatedly, we show that the model can not only extend the video well but also faithfully preserve the appearance of all objects even over several iterations.

Here are two examples of VideoPoet generating long video from text input:

Text Input    “An astronaut starts dancing on Mars. Colorful fireworks then explode in the background.”    “FPV footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.”           
Video Output                 

It is also possible to interactively edit existing video clips generated by VideoPoet. If we supply an input video, we can change the motion of objects to perform different actions. The object manipulation can be centered at the first frame or the middle frames, which allow for a high degree of editing control.

For example, we can randomly generate some clips from the input video and select the desired next clip.

An input video on the left is used as conditioning to generate four choices given the initial prompt: “Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass”. For the first three outputs we show what would happen for unprompted motions. For the last video in the list below, we add to the prompt, “powering up with smoke in the background” to guide the action.

Image to video control

Similarly, we can apply motion to an input image to edit its contents towards the desired state, conditioned on a text prompt.

Animating a painting with different prompts. Left: “A woman turning to look at the camera.” Right: “A woman yawning.” **

Camera motion

We can also accurately control camera movements by appending the type of desired camera motion to the text prompt. As an example, we generated an image by our model with the prompt, “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river”. The examples below append the given text suffix to apply the desired motion.

Prompts from left to right: “Zoom out”, “Dolly zoom”, “Pan left”, “Arc shot”, “Crane shot”, “FPV drone shot”.

Evaluation results

We evaluate VideoPoet on text-to-video generation with a variety of benchmarks to compare the results to other approaches. To ensure a neutral evaluation, we ran all models on a wide variation of prompts without cherry-picking examples and asked people to rate their preferences. The figure below highlights the percentage of the time VideoPoet was chosen as the preferred option in green for the following questions.

Text fidelity

User preference ratings for text fidelity, i.e., what percentage of videos are preferred in terms of accurately following a prompt.

Motion interestingness

User preference ratings for motion interestingness, i.e., what percentage of videos are preferred in terms of producing interesting motion.

Based on the above, on average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.

Conclusion

Through VideoPoet, we have demonstrated LLMs’ highly-competitive video generation quality across a wide variety of tasks, especially in producing interesting and high quality motions within videos. Our results suggest the promising potential of LLMs in the field of video generation. For future directions, our framework should be able to support “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning should be possible, among many others.

To view more examples in original quality, see the website demo.

Acknowledgements

This research has been supported by a large body of contributors, including Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, and Lu Jiang.

We give special thanks to Alex Siegman and Victor Gomes for managing computing resources. We also give thanks to Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for research discussions, Alonso Martinez for graphic design, David Salesin, Tomas Izo, and Rahul Sukthankar for their support, and Jay Yagnik as architect of the initial concept.

**

(a) The Storm on the Sea of Galilee, by Rembrandt 1633, public domain.

(b) Pillars of Creation, by NASA 2014, public domain.

(c) Wanderer above the Sea of Fog, by Caspar David Friedrich, 1818, public domain

(d) Mona Lisa, by Leonardo Da Vinci, 1503, public domain.

Read More