April 2024 – Page 13

‘The Elder Scrolls Online’ Joins GeForce NOW for Game’s 10th Anniversary

Rain or shine, a new month means new games. GeForce NOW kicks off April with nearly 20 new games, seven of which are available to play this week.

GFN Thursday celebrates the 10-year anniversary of ZeniMax Online Studios’ Elder Scrolls Online by bringing the award-winning online role-playing game (RPG) to the cloud this week.

Plus, the GeForce NOW Ultimate membership comes to gamers in Japan for the first time, with new GeForce RTX 4080 SuperPODs online today.

The Rising Sun Goes Ultimate

Japan and GeForce NOW — *Get ready to drift into the cloud.*

GeForce NOW is rolling out the green carpet to gamers in Japan, expanding next-generation cloud gaming worldwide. The Ultimate membership tier is now available to gamers in the region, delivering up to 4K gaming at up to 120 frames per second, all at ultra-low latency — even on devices without the latest hardware.

Gamers in Japan can now access from the cloud triple-A titles by some of the world’s largest publishers. Capcom’s Street Fighter 6 and Resident Evil Village will be coming to GeForce NOW at a later date for members to stream at the highest performance.

GeForce NOW will operate in Japan alongside GeForce NOW Alliance partner and telecommunications company KDDI, which currently offers its customers access to GeForce RTX 3080-powered servers, in addition to its mobile benefits. Plus, new GFNA partners in other regions will be announced this year — stay tuned to GFN Thursdays for details.

A Decade of Adventure

Elder Scrolls Online on GeForce NOW — *The cloud is slay-ing.*

Discover Tamriel from the comfort of almost anywhere with GeForce NOW. Explore the Elder Scrolls universe solo or alongside thousands of other players in The Elder Scrolls Online as it joins the cloud this week for members.

For a decade, Elder Scrolls Online has cultivated a vibrant community of millions of players and a legacy of exciting stories, characters and adventures. Players have explored Morrowind, Summerset, Skyrim and more, thanks to regular updates and chapter releases. The title’s anniversary celebrations kick off in Amsterdam this week, and fans worldwide can join in by streaming the game from the cloud.

Set during Tamriel’s Second Era, a millennium before The Elder Scrolls V: Skyrim, The Elder Scrolls Online has players exploring a massive, ever-growing world. Together they can encounter memorable quests, challenging dungeons, player vs. player battles and more. Gamers can play their way by customizing their characters, looting and crafting new gear, and unlocking and developing their abilities.

Experience the epic RPG with an Ultimate membership and venture forth in the cloud with friends, tapping eight-hour gaming sessions and exclusive access to servers. Ultimate members can effortlessly explore the awe-inspiring fantasy world with the ability to stream at up to 4K and 120 fps, or experience the game at ultrawide resolutions on supported devices.

April Showers Bring New Games

MEGAMAN X DiVE OFFLINE on GeForce NOW — *X marks the cloud.*

Dive into a new adventure with Mega Man X DiVE Offline from Capcom. It’s the offline, reimagined version of Mega Man X, featuring the franchise’s classic action, over 100 characters from the original series and an all-new story with hundreds of stages to play. Strengthen characters and weapons with a variety of power-ups — then test them out in the side-scrolling action.

Catch it alongside other new games joining the cloud this week:

ARK: Survival Ascended (New release on Xbox, available on PC Game Pass, April 1)
Thief (New release on Epic Games Store, free from April 4-11)
Sons of Valhalla (New release on Steam, April 5)
Elder Scrolls Online (Steam and Epic Games Store)
MEGA MAN X DiVE Offline (Steam)
SUPERHOT: MIND CONTROL DELETE (Xbox, available on PC Game Pass)
Turbo Golf Racing 1.0 (Xbox, available on PC Game Pass)

And members can look for the following throughout the rest of the month:

Dead Island 2 (New release on Steam, April 22)
Phantom Fury (New release on Steam, April 23)
Oddsparks: An Automation Adventure (New release on Steam, April 24)
9-Bit Armies: A Bit Too Far (Steam)
Backpack Battles (Steam)
Dragon’s Dogma 2 Character Creator & Storage (Steam)
Evil West (Xbox, available on PC Game Pass)
Islands of Insight (Steam)
Lightyear Frontier (Steam and Xbox, available on PC Game Pass)
Manor Lords (New release on Steam and Xbox, available on PC Game Pass)
Metaball (Steam)
Tortuga – A Pirate’s Tale (Steam)

Making the Most of March

In addition to the 30 games announced last month, six more joined the GeForce NOW library:

Zoria: Age of Shattering (New release on Steam, March 7)
Deus Ex: Mankind Divided (New release on Epic Games Store, free, March 14)
Dragon’s Dogma 2 (New release on Steam, March 21)
Diablo IV (Xbox, available on PC Game Pass)
Granblue Fantasy: Relink (Steam)
Space Engineers (Xbox, available on PC Game Pass)

Some titles didn’t make it in March. Crown Wars: The Black Prince and Breachway have delayed their launch dates to later this year, and Portal: Revolution will join GeForce NOW in the future. Stay tuned to GFN Thursday for updates.

What are you planning to play this weekend? Let us know on X or in the comments below.

respect your gaming elders

— NVIDIA GeForce NOW (@NVIDIAGFN) April 3, 2024

Accelerating MoE model inference with Locality-Aware Kernel Design

1.0 Summary

We show that by implementing column-major scheduling to improve data locality, we can accelerate the core Triton GEMM (General Matrix-Matrix Multiply) kernel for MoEs (Mixture of Experts) up to 4x on A100, and up to 4.4x on H100 Nvidia GPUs. This post demonstrates several different work decomposition and scheduling algorithms for MoE GEMMs and shows, at the hardware level, why column-major scheduling produces the highest speedup.

Repo and code available at: https://github.com/pytorch-labs/applied-ai/tree/main/triton/.

Figure 1A. Optimized Fused MoE GEMM Kernel TFLOPs on A100 for varying Batch Sizes M

Figure 1B. Optimized Fused MoE GEMM Kernel TFLOPs on H100 for varying Batch Sizes M

2.0 Background

OpenAI’s Triton is a hardware-agnostic language and compiler that as our prior blog post has shown can be used to accelerate quantization workflows. We also showed that in terms of kernel development, much of the same learnings and performance analysis tools from CUDA can be leveraged to provide similar insights into how Triton kernels work under-the-hood and subsequent measures to speedup these kernels in latency sensitive environments. As Triton becomes increasingly adopted in production settings, it is important that developers understand the common tips and tricks to developing performant kernels as well as the generality of these methods to various different architectures and workflows. Thus, this post will explore how we optimized the Triton kernel developed by vLLM for the popular Mixture of Experts (MoE) Mixtral model using classical techniques and how these techniques can be implemented in Triton to achieve performance gain.

Mixtral 8x7B is a sparse Mixture of Experts Language Model. Unlike the classical dense transformer architecture, each transformer block houses 8 MLP layers where each MLP is an ‘expert’. As a token flows through, a router network selects which 2 of the 8 experts should process that token and the results are then combined. The selected experts for the same token vary at each layer. As a result, while Mixtral 8x7B has a total of 47B params, during inference only 13B params are active.

The MoE GEMM (General Matrix-Matrix Multiply) kernel receives a stacked weight matrix containing all the experts, and must subsequently route each token to the TopK (2 for Mixtral) experts by utilizing a mapping array produced by the resultant scores of the router network. In this post, we provide methods to efficiently parallelize this computation during inference time, specifically during autoregression (or decoding stages).

3.0 Work Decomposition – SplitK

We have previously shown that for the matrix problem sizes found in LLM inference, specifically in the context of W4A16 quantized inference, GEMM kernels can be accelerated by applying a SplitK work decomposition. Thus, we started our MoE acceleration research by implementing SplitK in the vLLM MoE Kernel, which produced speedups of approximately 18-20% over the Data Parallel approach.

This result shows that the SplitK optimization can be used as a part of a more formulaic approach to improving/developing Triton kernels in inference settings. To build intuition about these different work decompositions, let’s consider a simple example for the multiplication of two 4×4 matrices and SplitK=2.

In the data parallel GEMM kernel shown below, the computation for a single block of the output matrix will be handled by 1 threadblock, TB0.

Figure 2. Data Parallel GEMM

In contrast, in the SplitK kernel, the work required to compute 1 block in the output matrix, is “split” or shared amongst 2 thread blocks TB0 and TB1. This provides better load balancing and increased parallelism.

Figure 3. SplitK GEMM

The key idea is that we’ve increased our parallelism from MN to MN*SplitK. This approach does incur some costs such as adding inter-threadblock communication via atomic operations. However, these costs are minimal compared to the savings of other constrained GPU resources like shared memory and registers. Most importantly, the SplitK strategy provides superior load balancing characteristics for skinny matrices, (as is the case in MoE inference) and is the common matrix profile during decoding and inference.

4.0 GEMM Hardware Scheduling – Column Major

To improve upon the ~20% speedup with SplitK we focused our investigation on the logic that controls the hardware scheduling of the GEMM in Triton Kernels. Our profiling of the vLLM MoE kernel showed a low L2 cache hit rate, thus we investigated three scheduling options – column-major, row-major and grouped launch. Due to some intrinsic properties of MoE models, such as large expert matrices, and having to dynamically load TopK (2 for Mixtral) matrices during the duration of the kernel, cache reuse/hit rate becomes a bottleneck that this optimization will target.

For background, in our previous blog, we touched on the concept of “tile swizzling”, a method to achieve greater L2 cache hit rate. This concept relates to how the software schedules the GEMM onto the SMs of a GPU. In Triton, this schedule is determined by the pid_m and pid_n calculations. Our key insight is that for skinny matrix multiplications, a column-major ordering ensures optimal reuse of the columns of the weight matrix, B. To illustrate this, let’s take a look at a snippet of what a column major computation of pid_m, and pid_n would look like:

Figure 4. Column Major ordering in PyTorch

From above, we note that with this mapping, we schedule the GEMM such that we calculate the output blocks of C in the following order: C(0, 0), C(1, 0), C(2, 0),… etc. To understand the implications we provide the following illustration:

Figure 5. Cache Reuse Pattern for a Column-Major GEMM Schedule

In the above simplified view of a column-major schedule, let’s assume for a GEMM with skinny activation matrix A, that the entire matrix can fit in the GPU cache which is a reasonable assumption to make for the type of problem sizes we encounter in MoE inference. This allows for maximal reuse of the columns of the weight matrix B, due to the fact that the B column can be re-used for the corresponding output tile calculations, C(0,0), C(1, 0) and C(2, 0). Consider instead, a row-major schedule, C(0,0), C(0,1), C(0, 2) etc. We would have to evict the column of B, and issue multiple load instructions to DRAM to calculate the same amount of output blocks.

An important design consideration when optimizing kernels is a memory access pattern that results in the least amount of global load instructions. This optimal memory access pattern is achieved with the column-major schedule. The results below showcase the performance of the three schedules we investigated:

Figure 6. Comparison of GEMM Schedules on A100 for varying Batch Sizes M

The column-major schedule provides up to a 4x speedup over the other patterns, and as we’ll show in the next section, provides an optimal memory access pattern due to greatly improved data locality.

5.0 Nsight Compute Analysis – Throughput and Memory Access Pattern

For performance analysis, we focus on the M = 2 case for the H100. A similar study can be done for the A100 as many of the same observations carry over. We note the following salient results, that showcase the impact of our optimizations.

Figure 7. H100 Memory Throughput Chart for M = 2. Note the very large increase in the cache hit rates L1 cache hit rate (+2696%) and L2 cache hit rate (+254%).

Figure 8. H100 Memory Instruction Statistics M = 2. Note the 49% reduction in global memory loads.

These statistics show that our optimizations had the intended effect, which can be seen in the reduced cache misses, reduced memory accesses and the resultant 2.7x speedup. More concretely, the trace shows us a 2.54x increase in L2 hit rate (Figure 7), and a ~50% reduction in DRAM accesses (Figure 8).

These improvements ultimately yield the reduced latency, with the optimized kernel being 2.7x faster for bs=2 and 4.4x for bs=512.

6.0 Future Work

Our kernel was tested in FP16, which showcases the numerics and performance of the column major scheduling for MoE, but most production models are using BFloat16. We encountered a limitation in Triton such that tl.atomic_add does not support Bfloat16 and hit launch latency concerns which would require cuda graph support for column major production use. In initial testing this translated to a 70% end-to-end speedup but, we encountered some expert mapping inconsistencies in an end to end environment that are not reflected in the test environment, so further work is needed to fully realize these speedups.

For future work, we intend to move this into a CUDA kernel which will ensure full BFloat16 support and reduced launch latency relative to Triton, and potentially resolve the expert routing inconsistency. We’ve also previously published work on enabling GPTQ W4A16 with Triton GEMM kernels, so natural follow-on work would include fusing dequantization into this kernel to allow for a GPTQ quantized inference path.

7.0 Reproducibility

We have open sourced the Triton kernel code along with an easy to run performance benchmark for readers interested in comparing or verifying the performance on their own GPU.

Acknowledgements

We want to thank Daniel Han, Raghu Ganti, Mudhakar Srivatsa, Bert Maher, Gregory Chanan, Eli Uriegas, and Geeta Chauhan for their review of the presented material and Woo Suk from the vLLM team as we built on his implementation of the Fused MoE kernel.

Introducing improvements to the fine-tuning API and expanding our custom models program

We’re adding new features to help developers have more control over fine-tuning and announcing new ways to build custom models with OpenAI.OpenAI Blog

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

*Equal Contributors
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP — a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training…Apple Machine Learning Research

Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio

Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning (ML) that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio provides all the tools you need to take your models from data preparation to experimentation to production while boosting your productivity.

Amazon SageMaker Canvas is a powerful no-code ML tool designed for business and data teams to generate accurate predictions without writing code or having extensive ML experience. With its intuitive visual interface, SageMaker Canvas simplifies the process of loading, cleansing, and transforming datasets, and building ML models, making it accessible to a broader audience.

However, as your ML needs evolve, or if you require more advanced customization and control, you may want to transition from a no-code environment to a code-first approach. This is where the seamless integration between SageMaker Canvas and SageMaker Studio comes into play.

In this post, we present a solution for the following types of users:

Non-ML experts such as business analysts, data engineers, or developers, who are domain experts and are interested in low-code no-code (LCNC) tools to guide them in preparing data for ML and building ML models. This persona typically is only a SageMaker Canvas user and often relies on ML experts in their organization to review and approve their work.
ML experts who are interested in how LCNC tools can accelerate parts of the ML lifecycle (such as data prep), but are also likely to take a high-code approach to certain parts of the ML lifecycle (such as model building). This persona is typically a SageMaker Studio user who might also be a SageMaker Canvas user. ML experts also often play a role in reviewing and approving the work of non-ML experts for production use cases.

The utility of the solutions proposed in this post is two-fold. Firstly, by demonstrating how you can share models across SageMaker Canvas and SageMaker Studio, non-ML and ML experts can collaborate across their preferred environments, which might be a no-code environment (SageMaker Canvas) for non-experts and a high-code environment (SageMaker Studio) for experts. Secondly, by demonstrating how to share a model from SageMaker Canvas to SageMaker Studio, we show how ML experts who want to pivot from a LCNC approach for development to a high-code approach for production can do so across SageMaker environments. The solution outlined in this post is for users of the new SageMaker Studio. For users of SageMaker Studio Classic, see Collaborate with data scientists for how you can seamlessly transition between SageMaker Canvas and SageMaker Studio Classic.

Solution overview

To seamlessly transition between no-code and code-first ML with SageMaker Canvas and SageMaker Studio, we have outlined two options. You can choose the option based on your requirements. In some cases, you might decide to use both options in parallel.

Option 1: SageMaker Model Registry – A SageMaker Canvas user registers their model in the Amazon SageMaker Model Registry, invoking a governance workflow for ML experts to review model details and metrics, then approve or reject it, after which the user can deploy the approved model from SageMaker Canvas. This option is an automated sharing process providing you with built-in governance and approval tracking. You can view the model metrics; however, there is limited visibility on the model code and architecture. The following diagram illustrates the architecture.

Option 2: Notebook export – In this option, the SageMaker Canvas user exports the full notebook from SageMaker Canvas to Amazon Simple Storage Service (Amazon S3), then shares it with ML experts to import into SageMaker Studio, enabling complete visibility and customization of the model code and logic before the ML expert deploys the enhanced model. In this option, there is complete visibility of the model code and architecture with the ability for the ML expert to customize and enhance the model in SageMaker Studio. However, this option demands a manual export and import of the model notebook into the IDE. The following diagram illustrates this architecture.

The following phases describe the steps for collaboration:

Share – The SageMaker Canvas user registers the model from SageMaker Canvas or downloads the notebook from SageMaker Canvas
Review – The SageMaker Studio user accesses the model through the model registry to review and run the exported notebook through JupyterLab to validate the model
Approval – The SageMaker Studio user approves the model from the model registry
Deploy – The SageMaker Studio user can deploy the model from JupyterLab, or the SageMaker Canvas user can deploy the model from SageMaker Canvas

Let’s look at the two options (model registry and notebook export) within each step in detail.

Prerequisites

Before you dive into the solution, make sure you have signed up for and created an AWS account. Then you need to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites. You can skip this step if you already have your own version of SageMaker Studio running.

Complete the prerequisites for setting up SageMaker Canvas and create the model of your choice for your use case.

Share the model

The SageMaker Canvas user shares the model with the SageMaker Studio user by either registering it in SageMaker Model Registry, which triggers a governance workflow, or by downloading the full notebook from SageMaker Canvas and providing it to the SageMaker Studio user.

SageMaker Model Registry

To deploy using SageMaker Model Registry, complete the following steps:

After a model is created in SageMaker Canvas, choose the options menu (three vertical dots) and choose Add to Model Registry.
Enter a name for the model group.
Choose Add.

You can now see the model is registered.

You can also see the model is pending approval.

SageMaker notebook export

To deploy using a SageMaker notebook, complete the following steps:

On the options menu, choose View Notebook.
Choose Copy S3 URI.

You can now share the S3 URI with the SageMaker Studio user.

Review the model

The SageMaker Studio user accesses the shared model through the model registry to review its details and metrics, or they can import the exported notebook into SageMaker Studio and use Jupyter notebooks to thoroughly validate the model’s code, logic, and performance.

SageMaker Model Registry

To use the model registry, complete the following steps:

On the SageMaker Studio console, choose Models in the navigation pane.
Choose Registered models.
Choose your model.

You can review the model details and see that the status is pending.

You can also review the different metrics to check on the model performance.

You can view the model metrics; however, there is limited visibility on the model code and architecture. If you want complete visibility of the model code and architecture with the ability to customize and enhance the model, use the notebook export option.

SageMaker notebook export

To use the notebook export option as the SageMaker Studio user, complete the following steps.

Launch SageMaker Studio and choose JupyterLab under Applications.
Open the JupyterLab space.If you don’t have a JupyterLab space, you can create one.

Open a terminal and run the following command to copy the notebook from Amazon S3 to SageMaker Studio (the account number in the following example is changed to awsaccountnumber):

sagemaker-user@default:~$ aws s3 cp s3://sagemaker-us-east-1-awsaccountnumber/Canvas/default-20240130t161835/Training/output/Canvas1707947728560/sagemaker-automl-candidates/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb ./canvas.ipynb

terminal

After the notebook is downloaded, you can open the notebook and run the notebook to evaluate further.

Approve the model

After a comprehensive review, the SageMaker Studio user can make an informed decision to either approve or reject the model in the model registry based on their assessment of its quality, accuracy, and suitability for the intended use case.

For users who registered their model via the Canvas UI, please follow the below steps to approve the model. For users who exported the model notebook from the Canvas UI, you may register and approve the model using SageMaker model registry, however, these steps are not required.

SageMaker Model Registry

As the SageMaker Studio user, when you’re comfortable with the model, you can update the status to approved. Approval happens only in SageMaker Model Registry. Complete the following steps:

In SageMaker Studio, navigate to the version of the model.
On the options menu, choose Update status and Approved.
Enter an optional comment and choose Save and update.

Now you can see the model is approved.

Deploy the model

Once the model is ready to deploy (it has received necessary reviews and approvals), users have two options. For users who took the model registry approach, they can deploy from either SageMaker Studio or from SageMaker Canvas. For users who took the model notebook export approach, they can deploy from SageMaker Studio. Both deployment options are detailed below.

Deploy via SageMaker Studio

The SageMaker Studio user can deploy the model from the JupyterLab space.

After the model is deployed, you can navigate to the SageMaker console, choose Endpoints under Inference in the navigation pane, and view the model.

Deploy via SageMaker Canvas

Alternatively, if the deployment is handled by the SageMaker Canvas user, you can deploy the model from SageMaker Canvas.

After the model is deployed, you can navigate to the Endpoints page on the SageMaker console to view the model.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the SageMaker Studio notebook using the following commands:

predictor.delete_model()

predictor.delete_endpoint()

Conclusion

Previously, you could only share models to SageMaker Canvas (or view shared SageMaker Canvas models) in SageMaker Studio Classic. In this post, we showed how to share models built in SageMaker Canvas with SageMaker Studio so that different teams can collaborate and you can pivot from a no-code to a high-code deployment path. By either using SageMaker Model Registry or exporting notebooks, ML experts and non-experts can collaborate, review, and enhance models across these platforms, enabling a smooth workflow from data preparation to production deployment.

For more information about collaborating on models using SageMaker Canvas, refer to Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customer guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focusses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker, and strives to drive business to new ways of working through innovation, incubation, and democratization.

Claire O’Brien Rajkumar is a Sr. Product Manager on the Amazon SageMaker team focused on SageMaker Canvas, the SageMaker low-code no-code workspace for ML and generative AI. SageMaker Canvas helps democratize ML and generative AI by lowering barriers to adoption for those new to ML and accelerating workflows for advanced practitioners.

Research Focus: Week of April 1, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

In the same way that tools can help people complete tasks beyond their innate abilities, tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a surprisingly understudied question is how accurately an LLM uses tools for which it has been trained.

In a recent paper: LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, researchers from Microsoft find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate of 30% to 60%, which is too unreliable for practical use. They propose a biologically inspired method for tool-augmented LLMs – simulated trial and error (STE) – that orchestrates three key mechanisms: trial and error, imagination, and memory. STE simulates plausible scenarios for using a tool, then the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration. Experiments on ToolBench show STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings.

Read the paper

NEW RESEARCH

Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

The latest LLMs have surpassed the performance of older language models on several tasks and benchmarks, sometimes approaching or even exceeding human performance. Yet, it is not always clear whether this is due to the increased capabilities of these models, or other effects, such as artifacts in datasets, test dataset contamination, and the lack of datasets that measure the true capabilities of these models.

As a result, research to comprehend LLM capabilities and limitations has surged of late. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. In a recent paper: MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks, researchers from Microsoft aim to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, Gemini, Gemma and Llama2) by comparing them on the same set of multilingual datasets. Their benchmark comprises 22 datasets covering 81 languages including several low-resource African languages. They also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.

Read the paper

NEW RESEARCH

Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is a process that creates text descriptions for audio recordings. Unlike Closed Captioning, which transcribes speech, AAC aims to describe all sounds in the audio (e.g. : A muffled rumble with people talking in the background while a siren blares in the distance). Typical AAC systems require expensive curated data of audio-text pairs, which often results in a shortage of suitable data, impeding model training.

In this paper: Training Audio Captioning Models without Audio, researchers from Microsoft and Carnegie Mellon University propose a new paradigm for training AAC systems, using text descriptions alone, thereby eliminating the requirement for paired audio and text descriptions. Their approach leverages CLAP, a contrastive learning model that uses audio and text encoders to create a shared vector representation between audio and text. For instance, the text “siren blaring” and its corresponding audio recording would share the same vector. The model is trained on text captions: a GPT language decoder generates captions conditioned on the pretrained CLAP text encoder and a mapping network. During inference, audio input is first converted to its vector using the pretrained CLAP audio encoder and then a text caption is generated.

The researchers find that the proposed text-only framework competes well with top-tier models trained on both text and audio, proving that efficient text-to-audio conversion is possible. They also demonstrated the ability to incorporate various writing styles, such as humorous, beneficial for tailoring caption generation to specific fields. Finally, they highlight that enriching training with LLM-generated text leads to improved performance and has potential in increasing vocabulary diversity.

Read the paper

The post Research Focus: Week of April 1, 2024 appeared first on Microsoft Research.

Amazon Scholar solves century-old problem with automated reasoning

Solution method uses new infrastructure that reduces proof-checking overhead by more than 90%.Read More

Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless

The rise of contextual and semantic search has made ecommerce and retail businesses search straightforward for its consumers. Search engines and recommendation systems powered by generative AI can improve the product search experience exponentially by understanding natural language queries and returning more accurate results. This enhances the overall user experience, helping customers find exactly what they’re looking for.

Amazon OpenSearch Service now supports the cosine similarity metric for k-NN indexes. Cosine similarity measures the cosine of the angle between two vectors, where a smaller cosine angle denotes a higher similarity between the vectors. With cosine similarity, you can measure the orientation between two vectors, which makes it a good choice for some specific semantic search applications.

In this post, we show how to build a contextual text and image search engine for product recommendations using the Amazon Titan Multimodal Embeddings model, available in Amazon Bedrock, with Amazon OpenSearch Serverless.

A multimodal embeddings model is designed to learn joint representations of different modalities like text, images, and audio. By training on large-scale datasets containing images and their corresponding captions, a multimodal embeddings model learns to embed images and texts into a shared latent space. The following is a high-level overview of how it works conceptually:

Separate encoders – These models have separate encoders for each modality—a text encoder for text (for example, BERT or RoBERTa), image encoder for images (for example, CNN for images), and audio encoders for audio (for example, models like Wav2Vec). Each encoder generates embeddings capturing semantic features of their respective modalities
Modality fusion – The embeddings from the uni-modal encoders are combined using additional neural network layers. The goal is to learn interactions and correlations between the modalities. Common fusion approaches include concatenation, element-wise operations, pooling, and attention mechanisms.
Shared representation space – The fusion layers help project the individual modalities into a shared representation space. By training on multimodal datasets, the model learns a common embedding space where embeddings from each modality that represent the same underlying semantic content are closer together.
Downstream tasks – The joint multimodal embeddings generated can then be used for various downstream tasks like multimodal retrieval, classification, or translation. The model uses correlations across modalities to improve performance on these tasks compared to individual modal embeddings. The key advantage is the ability to understand interactions and semantics between modalities like text, images, and audio through joint modeling.

Solution overview

The solution provides an implementation for building a large language model (LLM) powered search engine prototype to retrieve and recommend products based on text or image queries. We detail the steps to use an Amazon Titan Multimodal Embeddings model to encode images and text into embeddings, ingest embeddings into an OpenSearch Service index, and query the index using the OpenSearch Service k-nearest neighbors (k-NN) functionality.

This solution includes the following components:

Amazon Titan Multimodal Embeddings model – This foundation model (FM) generates embeddings of the product images used in this post. With Amazon Titan Multimodal Embeddings, you can generate embeddings for your content and store them in a vector database. When an end-user submits any combination of text and image as a search query, the model generates embeddings for the search query and matches them to the stored embeddings to provide relevant search and recommendations results to end-users. You can further customize the model to enhance its understanding of your unique content and provide more meaningful results using image-text pairs for fine-tuning. By default, the model generates vectors (embeddings) of 1,024 dimensions, and is accessed via Amazon Bedrock. You can also generate smaller dimensions to optimize for speed and performance
Amazon OpenSearch Serverless – It is an on-demand serverless configuration for OpenSearch Service. We use Amazon OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Multimodal Embeddings model. An index created in the Amazon OpenSearch Serverless collection serves as the vector store for our Retrieval Augmented Generation (RAG) solution.
Amazon SageMaker Studio – It is an integrated development environment (IDE) for machine learning (ML). ML practitioners can perform all ML development steps—from preparing your data to building, training, and deploying ML models.

The solution design consists of two parts: data indexing and contextual search. During data indexing, you process the product images to generate embeddings for these images and then populate the vector data store. These steps are completed prior to the user interaction steps.

In the contextual search phase, a search query (text or image) from the user is converted into embeddings and a similarity search is run on the vector database to find the similar product images based on similarity search. You then display the top similar results. All the code for this post is available in the GitHub repo.

The following diagram illustrates the solution architecture.

The following are the solution workflow steps:

Download the product description text and images from the public Amazon Simple Storage Service (Amazon S3) bucket.
Review and prepare the dataset.
Generate embeddings for the product images using the Amazon Titan Multimodal Embeddings model (amazon.titan-embed-image-v1). If you have a huge number of images and descriptions, you can optionally use the Batch inference for Amazon Bedrock.
Store embeddings into the Amazon OpenSearch Serverless as the search engine.
Finally, fetch the user query in natural language, convert it into embeddings using the Amazon Titan Multimodal Embeddings model, and perform a k-NN search to get the relevant search results.

We use SageMaker Studio (not shown in the diagram) as the IDE to develop the solution.

These steps are discussed in detail in the following sections. We also include screenshots and details of the output.

Prerequisites

To implement the solution provided in this post, you should have the following:

An AWS account and familiarity with FMs, Amazon Bedrock, Amazon SageMaker, and OpenSearch Service.
The Amazon Titan Multimodal Embeddings model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If Amazon Titan Multimodal Embeddings is enabled, the access status will show as Access granted, as shown in the following screenshot.

If the model is not available, enable access to the model by choosing Manage model access, selecting Amazon Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

You also need a SageMaker Studio domain. If you don’t have a SageMaker Studio domain already configured, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users for steps to create one.

Set up the solution

When the prerequisite steps are complete, you’re ready to set up the solution:

In your AWS account, open the SageMaker console and choose Studio in the navigation pane.
Choose your domain and user profile, then choose Open Studio.

Your domain and user profile name may be different.

Choose System terminal under Utilities and files.
Run the following command to clone the GitHub repo to the SageMaker Studio instance:

git clone https://github.com/aws-samples/amazon-bedrock-samples.git

Navigate to the multimodal/Titan/titan-multimodal-embeddings/amazon-bedrock-multimodal-oss-searchengine-e2e folder.
Open the titan_mm_embed_search_blog.ipynb notebook.

Run the solution

Open the file titan_mm_embed_search_blog.ipynb and use the Data Science Python 3 kernel. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook performs the following steps:

Install the packages and libraries required for this solution.
Load the publicly available Amazon Berkeley Objects Dataset and metadata in a pandas data frame.

The dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalogue images. For this post, you only use the item images and item names in US English. You use approximately 1,600 products.

Generate embeddings for the item images using the Amazon Titan Multimodal Embeddings model using the get_titan_multomodal_embedding() function. For the sake of abstraction, we have defined all important functions used in this notebook in the utils.py file.

Next, you create and set up an Amazon OpenSearch Serverless vector store (collection and index).

Before you create the new vector search collection and index, you must first create three associated OpenSearch Service policies: the encryption security policy, network security policy, and data access policy.

Finally, ingest the image embedding into the vector index.

Now you can perform a real-time multimodal search.

Run a contextual search

In this section, we show the results of contextual search based on a text or image query.

First, let’s perform an image search based on text input. In the following example, we use the text input “drinkware glass” and send it to the search engine to find similar items.

The following screenshot shows the results.

Now let’s look at the results based on a simple image. The input image gets converted into vector embeddings and, based on the similarity search, the model returns the result.

You can use any image, but for the following example, we use a random image from the dataset based on item ID (for example, item_id = “B07JCDQWM6”), and then send this image to the search engine to find similar items.

The following screenshot shows the results.

Clean up

To avoid incurring future charges, delete the resources used in this solution. You can do this by running the cleanup section of the notebook.

Conclusion

This post presented a walkthrough of using the Amazon Titan Multimodal Embeddings model in Amazon Bedrock to build powerful contextual search applications. In particular, we demonstrated an example of a product listing search application. We saw how the embeddings model enables efficient and accurate discovery of information from images and textual data, thereby enhancing the user experience while searching for the relevant items.

Amazon Titan Multimodal Embeddings helps you power more accurate and contextually relevant multimodal search, recommendation, and personalization experiences for end-users. For example, a stock photography company with hundreds of millions of images can use the model to power its search functionality, so users can search for images using a phrase, image, or a combination of image and text.

The Amazon Titan Multimodal Embeddings model in Amazon Bedrock is now available in the US East (N. Virginia) and US West (Oregon) AWS Regions. To learn more, refer to Amazon Titan Image Generator, Multimodal Embeddings, and Text models are now available in Amazon Bedrock, the Amazon Titan product page, and the Amazon Bedrock User Guide. To get started with Amazon Titan Multimodal Embeddings in Amazon Bedrock, visit the Amazon Bedrock console.

Start building with the Amazon Titan Multimodal Embeddings model in Amazon Bedrock today.

About the Authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

A New Lens: Dotlumen CEO Cornel Amariei on Assistive Technology for the Visually Impaired

Dotlumen is illuminating a new technology to help people with visual impairments navigate the world.

In this episode of NVIDIA’s AI Podcast, recorded live at the NVIDIA GTC global AI conference, host Noah Kravitz spoke with the Romanian startup’s founder and CEO, Cornel Amariei, about developing its flagship Dotlumen Glasses.

Equipped with sensors and powered by AI, the glasses compute a safely walkable path for visually impaired individuals and offer haptic — or tactile — feedback on how to proceed via corresponding vibrations. Amariei further discusses the process and challenges of developing assistive technology and its potential for enhancing accessibility.

Dotlumen is a member of the NVIDIA Inception program for cutting-edge startups.

Stay tuned for more episodes recorded live from GTC.

The AI Podcast · Dotlumen CEO Cornel Amariei on Assistive Technology for the Visually Impaired

Time Stamps

0:52: Background on the glasses

4:28: User experience of the glasses

7:29: How the glasses sense the physical world and compute a walkable path

18:07: The hardest part of the development process

22:20: Expected release and availability of the glasses

25:57: Dotlumen’s technical breakthrough moments

30:19: Other assistive technologies to look out for

You Might Also Like…

Personalized Health: Viome’s Guru Banavar Discusses Startup’s AI-Driven Approach – Ep. 216

Viome Chief Technology Officer Guru Banavar discusses how the startup’s innovations in AI and genomics advance personalized health and wellness.

Cardiac Clarity: Dr. Keith Channon Talks Revolutionizing Heart Health With AI – Ep. 212

Here’s some news to still beating hearts: AI is helping bring some clarity to cardiology. Caristo Diagnostics has developed an AI-powered solution for detecting coronary inflammation in cardiac CT scans.

Matice Founder Jessica Whited on Harnessing Regenerative Species for Medical Breakthroughs – Ep. 198

Scientists at Matice Biosciences are using AI to study the regeneration of tissues in animals known as super-regenerators, such as salamanders and planarians. The goal of the research is to develop new treatments that will help humans heal from injuries without scarring.

How GluxKind Created Ella, the AI-Powered Smart Stroller – Ep. 193

Imagine a stroller that can drive itself, help users up hills, brake on slopes and provide alerts of potential hazards. That’s what GlüxKind has done with Ella, an award-winning smart stroller that uses the NVIDIA Jetson edge AI and robotics platform to power its AI features.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Coming Up ACEs: Decoding the AI Technology That’s Enhancing Games With Realistic Digital Humans

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and which showcases new hardware, software, tools and accelerations for RTX PC users.

Digital characters are leveling up.

Non-playable characters often play a crucial role in video game storytelling, but since they’re usually designed with a fixed purpose, they can get repetitive and boring — especially in vast worlds where there are thousands.

Thanks in part to incredible advances in visual computing like ray tracing and DLSS, video games are more immersive and realistic than ever, making dry encounters with NPCs especially jarring.

Earlier this year, production microservices for the NVIDIA Avatar Cloud Engine launched, giving game developers and digital creators an ace up their sleeve when it comes to making lifelike NPCs. ACE microservices allow developers to integrate state-of-the-art generative AI models into digital avatars in games and applications. With ACE microservices, NPCs can dynamically interact and converse with players in-game and in real time.

Leading game developers, studios and startups are already incorporating ACE into their titles, bringing new levels of personality and engagement to NPCs and digital humans.

Bring Avatars to Life With NVIDIA ACE

The process of creating NPCs starts with providing them a backstory and purpose, which helps guide the narrative and ensures contextually relevant dialogue. Then, ACE subcomponents work together to build avatar interactivity and enhance responsiveness.

NPCs tap up to four AI models to hear, process, generate dialogue and respond.

The player’s voice first goes into NVIDIA Riva, a technology that builds fully customizable, real-time conversational AI pipelines and turns chatbots into engaging and expressive assistants using GPU-accelerated multilingual speech and translation microservices.

With ACE, Riva’s automatic speech recognition (ASR) feature processes what was said and uses AI to deliver a highly accurate transcription in real time. Explore a Riva-powered demo of speech-to-text in a dozen languages.

The transcription then goes into an LLM — such as Google’s Gemma, Meta’s Llama 2 or Mistral — and taps Riva’s neural machine translation to generate a natural language text response. Next, Riva’s Text-to-Speech functionality generates an audio response.

Finally, NVIDIA Audio2Face (A2F) generates facial expressions that can be synced to dialogue in many languages. With the microservice, digital avatars can display dynamic, realistic emotions streamed live or baked in during post-processing.

The AI network automatically animates face, eyes, mouth, tongue and head motions to match the selected emotional range and level of intensity. And A2F can automatically infer emotion directly from an audio clip.

Each step happens in real time to ensure fluid dialogue between the player and the character. And the tools are customizable, giving developers the flexibility to build the types of characters they need for immersive storytelling or worldbuilding.

Born to Roll

At GDC and GTC, developers and platform partners showcased demos leveraging NVIDIA ACE microservices — from interactive NPCs in gaming to powerful digital human nurses.

Ubisoft is exploring new types of interactive gameplay with dynamic NPCs. NEO NPCs, the product of its latest research and development project, are designed to interact in real time with players, their environment and other characters, opening up new possibilities for dynamic and emergent storytelling.

The capabilities of these NEO NPCs were showcased through demos, each focused on different aspects of NPC behaviors, including environmental and contextual awareness; real-time reactions and animations; and conversation memory, collaboration and strategic decision-making. Combined, the demos spotlighted the technology’s potential to push the boundaries of game design and immersion.

Using Inworld AI technology, Ubisoft’s narrative team created two NEO NPCs, Bloom and Iron, each with their own background story, knowledge base and unique conversational style. Inworld technology also provided the NEO NPCs with intrinsic knowledge of their surroundings, as well as interactive responses powered by Inworld’s LLM. NVIDIA A2F provided facial animations and lip syncing for the two NPCs real time.

Inworld and NVIDIA set GDC abuzz with a new technology demo called Covert Protocol, which showcased NVIDIA ACE technologies and the Inworld Engine. In the demo, players controlled a private detective who completed objectives based on the outcome of conversations with NPCs on the scene. Covert Protocol unlocked social simulation game mechanics with AI-powered digital characters that acted as bearers of crucial information, presented challenges and catalyzed key narrative developments. This enhanced level of AI-driven interactivity and player agency is set to open up new possibilities for emergent, player-specific gameplay.

Built on Unreal Engine 5, Covert Protocol uses the Inworld Engine and NVIDIA ACE, including NVIDIA Riva ASR and A2F, to augment Inworld’s speech and animation pipelines.

In the latest version of the NVIDIA Kairos tech demo built in collaboration with Convai, which was shown at CES, Riva ASR and A2F were used to significantly improve NPC interactivity. Convai’s new framework allowed the NPCs to converse among themselves and gave them awareness of objects, enabling them to pick up and deliver items to desired areas. Furthermore, NPCs gained the ability to lead players to objectives and traverse worlds.

Digital Characters in the Real World

The technology used to create NPCs is also being used to animate avatars and digital humans. Going beyond gaming, task-specific generative AI is moving into healthcare, customer service and more.

NVIDIA collaborated with Hippocratic AI at GTC to extend its healthcare agent solution, showcasing the potential of a generative AI healthcare agent avatar. More work underway to develop a super-low-latency inference platform to power real-time use cases.

“Our digital assistants provide helpful, timely and accurate information to patients worldwide,” said Munjal Shah, cofounder and CEO of Hippocratic AI. “NVIDIA ACE technologies bring them to life with cutting-edge visuals and realistic animations that help better connect to patients.”

Internal testing of Hippocratic’s initial AI healthcare agents is focused on chronic care management, wellness coaching, health risk assessments, social determinants of health surveys, pre-operative outreach and post-discharge follow-up.

UneeQ is an autonomous digital human platform focused on AI-powered avatars for customer service and interactive applications. UneeQ integrated the NVIDIA A2F microservice into its platform and combined it with its Synanim ML synthetic animation technology to create highly realistic avatars for enhanced customer experiences and engagement.

“UneeQ combines NVIDIA animation AI with our own Synanim ML synthetic animation technology to deliver real-time digital human interactions that are emotionally responsive and deliver dynamic experiences powered by conversational AI,” said Danny Tomsett, founder and CEO at UneeQ.

AI in Gaming

ACE is one of the many NVIDIA AI technologies that bring games to the next level.

NVIDIA DLSS is a breakthrough graphics technology that uses AI to increase frame rates and improve image quality on GeForce RTX GPUs.
NVIDIA RTX Remix enables modders to easily capture game assets, automatically enhance materials with generative AI tools and quickly create stunning RTX remasters with full ray tracing and DLSS.
NVIDIA Freestyle, accessed through the new NVIDIA app beta, lets users personalize the visual aesthetics of more than 1,200 games through real-time post-processing filters, with features like RTX HDR, RTX Dynamic Vibrance and more.
The NVIDIA Broadcast app transforms any room into a home studio, giving livestream AI-enhanced voice and video tools, including noise and echo removal, virtual background and AI green screen, auto-frame, video noise removal and eye contact.

Experience the latest and greatest in AI-powered experiences with NVIDIA RTX PCs and workstations, and make sense of what’s new, and what’s next, with AI Decoded.

Get weekly updates directly in your inbox by subscribing to the AI Decoded newsletter.

The Rising Sun Goes Ultimate

A Decade of Adventure

April Showers Bring New Games

Making the Most of March

1.0 Summary

2.0 Background

3.0 Work Decomposition – SplitK

4.0 GEMM Hardware Scheduling – Column Major

5.0 Nsight Compute Analysis – Throughput and Memory Access Pattern

6.0 Future Work

7.0 Reproducibility

Acknowledgements

Solution overview

Prerequisites

Share the model

SageMaker Model Registry

SageMaker notebook export

Review the model

SageMaker Model Registry

SageMaker notebook export

Approve the model

SageMaker Model Registry

Deploy the model

Deploy via SageMaker Studio

Deploy via SageMaker Canvas

Clean up

Conclusion

About the Authors

NEW RESEARCH

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

AI Frontiers: AI for health and the future of research with Peter Lee

NEW RESEARCH

Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

NEW RESEARCH

Training Audio Captioning Models without Audio

Solution overview

Prerequisites

Set up the solution

Run the solution

Run a contextual search

Clean up

Conclusion

About the Authors

Time Stamps

You Might Also Like…

Subscribe to the AI Podcast

Bring Avatars to Life With NVIDIA ACE

Born to Roll

Digital Characters in the Real World

AI in Gaming

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.