February 2023 – Page 13

NVIDIA A100 Aces Throughput, Latency Results in Key Inference Benchmark for Financial Services Industry

NVIDIA A100 Tensor Core GPUs running on Supermicro servers have captured leading results for inference in the latest STAC-ML Markets benchmark, a key technology performance gauge for the financial services industry.

The results show NVIDIA demonstrating unrivaled throughput — serving up thousands of inferences per second on the most demanding models — and top latency on the latest STAC-ML inference standard.

The results are closely followed by financial institutions, three-quarters of which rely on machine learning, deep learning or high performance computing, according to a recent survey.

NVIDIA A100: Top Latency Results

The STAC-ML inference benchmark is designed to measure the latency of long short-term memory (LSTM) model inference — the time from receiving new input data until the model output is computed. LSTM is a key model approach used to discover financial time-series data like asset prices.

The benchmark includes three LSTM models of increasing complexity. NVIDIA A100 GPUs, running in a Supermicro Ultra SuperServer, demonstrated low latencies in the 99th percentile.

Accelerated Computing for STAC-ML and STAC-A2, STAC-A3 Benchmarks

Considering the A100 performance on STAC-ML for inference — in addition to its record-setting performance in the STAC-A2 benchmark for option price discovery and the STAC-A3 benchmark for model backtesting — provides a glimpse at how NVIDIA AI computing can accelerate a pipeline of modern trading environments.

It also shows A100 GPUs deliver leading performance and workload versatility for financial institutions.

Predictable Performance for Consistent Low Latency

Predictable performance is crucial for low-latency environments in finance, as extreme outliers can cause substantial losses during fast market moves.

Notably, there were no large outliers in NVIDIA’s latency, as the maximum latency was no more than 2.3x the median latency across all LSTMs and the number of model instances, ranging up to 32 concurrent instances.¹

NVIDIA is the first to submit performance results for what’s known as the Tacana Suite of the benchmark. Tacana is for inference performed on a sliding window, where a new timestep is added and the oldest removed for each inference operation. This is helpful for high-frequency trading, where inference needs to be performed on every market data update.

A second suite, Sumaco, performs inference on an entirely new set of data, which reflects the use case where an event prompts inference based on recent history.

Leading Throughput in Benchmark Results

NVIDIA also submitted a throughput-optimized configuration on the same hardware for the Sumaco Suite in FP16 precision.²

On the least complex LSTM in the benchmark, A100 GPUs on Supermicro servers helped serve up more than 1.7 million inferences per second.³

For the most complex LSTM, these systems handled as many as 12,800 inferences per second.⁴

NVIDIA A100: Performance and Versatility

NVIDIA GPUs offer multiple advantages that lower the total cost of ownership for electronic trading stacks.

For one, NVIDIA AI provides a single platform for training and inference. Whether developing, backtesting or deploying an AI model, NVIDIA AI delivers leading performance — and developers don’t need to learn different programming languages and frameworks for research and trading.

Moreover, the NVIDIA CUDA programming model enables development, optimization and deployment of applications across GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers.

Efficiencies for Reduced Operating Expenses

The financial services industry stands to benefit from not only data throughput advances but also improved operational efficiencies.

Reduced energy and square footage usage for systems in data centers can make a big difference in operating expenses. That’s especially pressing as IT organizations make the case for budgetary outlays to cover new high-performance systems.

On the most demanding LSTM model, NVIDIA A100 exceeded 17,700 inferences per second per kilowatt while consuming 722 watts, offering leading energy efficiency.⁵

The benchmark results confirm that NVIDIA GPUs are unrivaled in terms of throughput and energy efficiency for workloads like backtesting and simulation.

Learn about NVIDIA delivering smarter, more secure financial services.

[1] SUT ID NVDA221118b, max of STAC-ML.Markets.Inf.T.LSTM_A.2.LAT.v1

[2] SUT ID NVDA221118a

[3] STAC-ML.Markets.Inf.S.LSTM_A.4.TPUT.v1

[4] STAC-ML.Markets.Inf.S.LSTM_C.[1,2,4].TPUT.v1

[5] SUT ID NVDA221118a, STAC-ML.Markets.Inf.S.LSTM_C.[1,2,4].ENERG_EFF.v1

Survey Reveals Financial Industry’s Top 4 AI Priorities for 2023

For several years, NVIDIA has been working with some of the world’s leading financial institutions to develop and execute a wide range of rapidly evolving AI strategies. For the past three years, we’ve asked them to tell us collectively what’s on the top of their minds.

Sometimes the results are just what we thought they’d be, and other times they’re truly surprising. This year’s survey, conducted in a time of continued macroeconomic uncertainty, the results were a little of both.

From banking and fintech institutions to insurance and asset management firms, the goals remain the same — find ways to more accurately manage risk, enhance efficiencies to reduce operating costs, and improve experiences for clients and customers. By digging in deeper, we were able to learn which areas of AI are of most interest as well as a bit more.

Below are the top four findings we gleaned from our “State of AI in Financial Services: 2023 Trends” survey taken by nearly 500 global financial services professionals.

Hybrid Cloud Is Coming on Strong

Financial services firms, like other enterprises, are looking to optimize spending for AI training and inference — with the knowledge that sensitive data can’t be migrated to the cloud. To do so cost-effectively, they’re moving many of their compute-intensive workloads to the hybrid cloud.

This year’s survey found that nearly half of respondents’ firms are moving to the hybrid cloud to optimize AI performance and reduce costs. Recent announcements from leading cloud service providers and platforms reinforce this shift and make data portability, MLOps management and software standardization across cloud and on-prem instances a strategic imperative for cost and efficiency.

Large Language Models Top the List of AI Use Cases

The survey results, focused on companies based in the Americas and Europe, with a sample size of over 200, found the top AI use cases to be natural language processing and large language models (26%), recommender systems and next-best action (23%), portfolio optimization (23%) and fraud detection (22%). Emerging workloads for the metaverse, synthetic data generation and virtual worlds were also common.

Banks, trading firms and hedge funds are adopting these technologies to create personalized customer experiences. For example, Deutsche Bank recently announced a multi-year innovation partnership with NVIDIA to embed AI into financial services across use cases, including intelligent avatars, speech AI, fraud detection and risk management, to slash total cost of ownership by up to 80%. The bank plans to use NVIDIA Omniverse to build a 3D virtual avatar to help employees navigate internal systems and respond to HR-related questions.

Banks Seeing More Potential for AI to Grow Revenue

The survey found that AI is having a quantifiable impact on financial institutions. Nearly half of survey takers said that AI will help increase annual revenue for their organization by at least 10%. More than a third noted that AI will also help decrease annual costs by at least 10%.

Financial services professionals highlighted how AI has enhanced business operations — particularly improving customer experience (46%), creating operational efficiencies (35%) and reducing total cost of ownership (20%).

For example, computer vision and natural language processing are helping automate financial document analysis and claims processing, saving companies time, expenses and resources. AI also helps prevent fraud by enhancing anti-money laundering and know-your-customer processes, while recommenders create personalized digital experiences for a firm’s customers or clients.

The Biggest Obstacle: Recruiting and Retaining AI Talent

But there are challenges to achieving AI goals in the enterprise. Recruiting and retaining AI experts is the single biggest obstacle, a problem reported by 36% of survey takers. There is also inadequate technology to enable AI innovation, according to 28% of respondents.

Insufficient data sizes for model training and accuracy is another pressing issue noted by 26% of financial services professionals. This could be addressed through the use of generative AI to produce accurate synthetic financial data used to train AI models.

Executive Support for AI at New High

Despite the challenges, the future for AI in FSI is getting brighter. Increasing executive buy-in for AI is a new theme in the survey results. Some 64% of those surveyed noted that “my executive leadership team values and believes in AI,” compared with 36% a year ago. In addition, 58% said that “AI is important to my company’s future success,” up from 39% a year ago.

Financial institutions plan to continue building out enterprise AI in the future. This will include scaling up and scaling out AI infrastructure, including hardware, software and services.

Empowering data scientists, quants and developers while minimizing bottlenecks requires a sophisticated, full stack AI platform. Executives have seen the ROI of deploying AI-enabled applications. In 2023, these leaders will focus on scaling AI across the enterprise, hiring more data scientists and investing in accelerated computing technology to support training and deployment of AI applications.

Download the “State of AI in Financial Services: 2023 Trends” report for in-depth results and insights.

Watch on-demand sessions from NVIDIA GTC featuring industry leaders from Capital One, Deutsche Bank, U.S. Bank and Ubiquant. And learn more about delivering smarter, more secure financial services and the AI-powered bank.

Deprecation of CUDA 11.6 and Python 3.7 Support

For the upcoming PyTorch 2.0 feature release (target March 2022), we will target CUDA 11.7 as the stable version and CUDA 11.8 as the experimental version of CUDA and Python >=3.8, <=3.11.

If you are still using or depending on CUDA 11.6 or Python 3.7 builds, we strongly recommend moving to at least CUDA 11.7 and Python 3.8, as it would be the minimum versions required for PyTorch 2.0.

Please note that as of Feb 1, CUDA 11.6 and Python 3.7 are no longer included in the nightlies

Please refer to the Release Compatibility Matrix for PyTorch releases:

PyTorch Version	Python	Stable CUDA	Experimental CUDA
2.0	>=3.8, <=3.11	CUDA 11.7, CUDNN 8.5.0.96	CUDA 11.8, CUDNN 8.7.0.84
1.13	>=3.7, <=3.10	CUDA 11.6, CUDNN 8.3.2.44	CUDA 11.7, CUDNN 8.5.0.96
1.12	>=3.7, <=3.10	CUDA 11.3, CUDNN 8.3.2.44	CUDA 11.6, CUDNN 8.3.2.44

As of 2/1/2023

For more information on PyTorch releases, updated compatibility matrix and release policies, please see (and bookmark) Readme.

The future of time-series forecasting, with RFP winner B. Aditya Prakash

From time to time, Meta invites academics to propose research in specific areas that align with our mission of building community and bringing the world closer together.Read More

How to decide between Amazon Rekognition image and video API for video moderation

Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, and reviews more than ever and 78% of them said that brands are responsible for moderating such content. 40% said that they would disengage with a brand after a single exposure to toxic content.

Amazon Rekognition has two sets of APIs that help you moderate images or videos to keep digital communities safe and engaged.

One approach to moderate videos is to model video data as a sample of image frames and use image content moderation models to process the frames individually. This approach allows the reuse of image-based models. Some customers have asked if they could use this approach to moderate videos by sampling image frames and sending them to the Amazon Rekognition image moderation API. They are curious about how this solution compares with the Amazon Rekognition video moderation API.

We recommend using the Amazon Rekognition video moderation API to moderate video content. It’s designed and optimized for video moderation, offering better performance and lower costs. However, there are specific use cases where the image API solution is optimal.

This post compares the two video moderation solutions in terms of accuracy, cost, performance, and architecture complexity to help you choose the best solution for your use case.

Moderate videos using the video moderation API

The Amazon Rekognition video content moderation API is the standard solution used to detect inappropriate or unwanted content in videos. It performs as an asynchronous operation on video content stored in an Amazon Simple Storage Service (Amazon S3) bucket. The analysis results are returned as an array of moderation labels along with a confidence score and timestamp indicating when the label was detected.

The video content moderation API uses the same machine learning (ML) model for image moderation. The output is filtered for noisy false positive results. The workflow is optimized for latency by parallelizing operations like decode, frame extraction, and inference.

The following diagram shows the logical steps of how to use the Amazon Rekognition video moderation API to moderate videos.

The steps are as follows:

Upload videos to an S3 bucket.
Call the video moderation API in an AWS Lambda function (or customized script on premises) with the video file location as a parameter. The API manages the heavy lifting of video decoding, sampling, and inference. You can either implement a heartbeat logic to check the moderation job status until it completes, or use Amazon Simple Notification Service (Amazon SNS) to implement an event-driven pattern. For details about the video moderation API, refer to the following Jupyter notebook for detailed examples.
Store the moderation result as a file in an S3 bucket or database.

Moderate videos using the image moderation API

Instead of using the video content moderation API, some customers choose to independently sample frames from videos and detect inappropriate content by sending the images to the Amazon Rekognition DetectModerationLabels API. Image results are returned in real time with labels for inappropriate content or offensive content along with a confidence score.

The following diagram shows the logical steps of the image API solution.

The steps are as follows:

1. Use a customized application or script as an orchestrator, from loading the video to the local file system.
2. Decode the video.
3. Sample image frames from the video at a chosen interval, such as two frames per second. Then iterate through all the images to:

3.a. Send each image frame to the image moderation API.
3.b. Store the moderation results in a file or database.

Compare this with the video API solution, which requires a light Lambda function to orchestrate API calls. The image sampling solution is CPU intensive and requires more compute resources. You can host the application using AWS services such as Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or Amazon Elastic Compute Cloud (Amazon EC2).

Evaluation dataset

To evaluate both solutions, we use a sample dataset consisting of 200 short-form videos. The videos range from 10 seconds to 45 minutes. 60% of the videos are less than 2 minutes long. This sample dataset is used to test the performance, cost, and accuracy metrics for both solutions. The results compare the Amazon Rekognition image API sampling solution to the video API solution.

To test the image API solution, we use open-source libraries (ffmpeg and OpenCV) to sample images at a rate of two frames per second (one frame every 500 milliseconds). This rate mimics the sampling frequency used by the video content moderation API. Each image is sent to the image content moderation API to generate labels.

To test the video sampling solution, we send the videos directly to the video content moderation API to generate labels.

Results summary

We focus on the following key results:

Accuracy – Both solutions offer similar accuracy (false positive and false negative percentages) using the same sampling frequency of two frames per second
Cost – The image API sampling solution is more expensive than the video API solution using the same sampling frequency of two frames per second
- The image API sampling solution cost can be reduced by sampling fewer frames per second
Performance – On average, the video API has a 425% faster processing time than the image API solution for the sample dataset
- The image API solution performs better in situations with a high frame sample interval and on videos less than 90 seconds
Architecture complexity – The video API solution has a low architecture complexity, whereas the image API sampling solution has a medium architecture complexity

Accuracy

We tested both solutions using the sample set and the same sampling frequency of two frames per second. The results demonstrated that both solutions provide a similar false positive and true positive ratio. This result is expected because under the hood, Amazon Rekognition uses the same ML model for both the video and image moderation APIs.

To learn more about metrics for evaluating content moderation, refer to Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services.

Cost

The cost analysis demonstrates that the image API solution is more expensive than the video API solution if you use the same sampling frequency of two frames per second. The image API solution can be more cost effective if you reduce the number of frames sampled per second.

The two primary factors that impact the cost of a content moderation solution are the Amazon Rekognition API costs and compute costs. The default pricing for the video content moderation API is $0.10 per minute and $0.001 per image for the image content moderation API. A 60-second video produces 120 frames using a rate of two frames per second. The video API costs $0.10 to moderate a 60-second video, whereas the image API costs $0.120.

The price calculation is based on the official price in Region us-east-1 at the time of writing this post. For more information, refer to Amazon Rekognition pricing.

The cost analysis looks at the total cost to generate content moderation labels for the 200 videos in the sample set. The calculations are based on us-east-1 pricing. If you’re using another Region, modify the parameters with the pricing for that Region. The 200 videos contain 4271.39 minutes of content and generate 512,567 image frames at a sampling rate of two frames per second.

This comparison doesn’t consider other costs, such as Amazon S3 storage. We use Lambda as an example to calculate the AWS compute cost. Compute costs take into account the number of requests to Lambda and AWS Step Functions to run the analysis. The Lambda memory/CPU setting is estimated based on the Amazon EC2 specifications. This cost estimate uses a four GB, 2-second Lambda request per image API call. Lambda functions have a maximum invocation timeout limit of 15 minutes. For longer videos, the user may need to implement iteration logic using Step Functions to reduce the number of frames processed per Lambda call. The actual Lambda settings and cost patterns may differ depending on your requirements. It’s recommended to test the solution end to end for a more accurate cost estimation.

The following table summarizes the costs.

Type	Amazon Rekognition Costs	Compute Costs	Total Cost
Video API Solution	$427.14	$0 (Free tier)	$427.14
Image API Solution: Two frames per second	$512.57	$164.23	$676.80
Image API Solution: One frame per second	$256.28	$82.12	$338.40

Performance

On average, the video API solution has a four times faster processing time than the image API solution. The image API solution performs better in situations with a high frame sample interval and on videos shorter than 90 seconds.

This analysis measures performance as the average processing time in seconds per video. It looks at the total and average time to generate content moderation labels for the 200 videos in the sample set. The processing time is measured from the video upload to the result output and includes each step in the image sampling and video API process.

The video API solution has an average processing time of 35.2 seconds per video for the sample set. This is compared to the image API solution with an average processing time of 156.24 seconds per video for the sample set. On average, the video API performs four times faster than the image API solution. The following table summarizes these findings.

Type	Average Processing Time (All Videos)	Average Processing Time (Videos Under 1.5 Minutes)
Video API Solution	35.2 seconds	24.05 seconds
Image API Solution: Two frames per second	156.24 seconds	8.45 seconds
Difference	425%	-185%

The image API performs better than the video API when the video is shorter than 90 seconds. This is because the video API has a queue managing the tasks that has a lead time. The image API can also perform better if you have a lower sampling frequency. Increasing the frame interval to over 5 seconds can decrease the processing time by 6–10 times. It’s important to note that increasing intervals introduces the risk of missed identification of inappropriate content between frame samples.

Architecture complexity

The video API solution has a low architecture complexity. You can set up a serverless pipeline or run a script to retrieve content moderation results. Amazon Rekognition manages the heavy computing and inference. The application orchestrating the Amazon Rekognition APIs can be hosted on a light machine.

The image API solution has a medium architecture complexity. The application logic has to orchestrate additional steps to store videos on the local drive, run image processing to capture frames, and call the image API. The server hosting the application requires higher computing capacity to support the local image processing. For the evaluation, we launched an EC2 instance with 4 vCPU and 8 G RAM to support two parallel threads. Higher compute requirements may lead to additional operation overhead.

Optimal use cases for the image API solution

The image API solution is ideal for three specific use cases when processing videos.

The first is real-time video streaming. You can capture image frames from a live video stream and send the images to the image moderation API.

The second use case is content moderation with a low frame sampling rate requirement. The image API solution is more cost-effective and performant if you sample frames at a low frequency. It’s important to note that there will be a trade-off between cost and accuracy. Sampling frames at a lower rate may increase the risk of missing frames with inappropriate content.

The third use case is for the early detection of inappropriate content in video. The image API solution is flexible and allows you to stop processing and flag the video early on, saving cost and time.

Conclusion

The video moderation API is ideal for most video moderation use cases. It’s more cost effective and performant than the image API solution when you sample frames at a frequency such as two frames per second. Additionally, it has a low architectural complexity and reduced operational overhead requirements.

The following table summarizes our findings to help you maximize the use of the Amazon Rekognition image and video APIs for your specific video moderation use cases. Although these results are averages achieved during testing and by some of our customers, they should give you ideas to balance the use of each API.

.	Video API Solution	Image API Solution
Accuracy	Same accuracy	.
Cost	Lower cost using the default image sampling interval	Lower cost if you reduce the number of frames sampled per second (sacrifice accuracy)
Performance	Faster for videos longer than 90 seconds	Faster for videos less than 90 seconds
Architecture Complexity	Low complexity	Medium complexity

Amazon Rekognition content moderation can not only help your business protect and keep customers safe and engaged, but also contribute to your ongoing efforts to maximize the return on your content moderation investment. Learn more about Content Moderation on AWS and our Content Moderation ML use cases.

About the authors

Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team, with expertise in AI and ML for content moderation and computer vision. She is passionate about promoting AWS AI services and helping customers transform their business solutions.

Brigit Brown is a Solutions Architect at Amazon Web Services. Brigit is passionate about helping customers find innovative solutions to complex business challenges using machine learning and artificial intelligence. Her core areas of depth are natural language processing and content moderation.

How Amazon’s AZ3 chip makes neural networks run more efficiently

Specialized circuitry for compression and for both inline and on-the-fly decompression minimize data movement.Read More

The Flan Collection: Advancing open source methods for instruction tuning

Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team

Language models are now capable of performing many new natural language processing (NLP) tasks by reading instructions, often that they hadn’t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT. However, much of the data that drives these advances remain unreleased to the broader research community.

In “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”, we closely examine and release a newer and more extensive publicly available collection of tasks, templates, and methods for instruction tuning to advance the community’s ability to analyze and improve instruction-tuning methods. This collection was first used in Flan-T5 and Flan-PaLM, for which the latter achieved significant improvements over PaLM. We show that training a model on this collection yields improved performance over comparable public collections on all tested evaluation benchmarks, e.g., a 3%+ improvement on the 57 tasks in the Massive Multitask Language Understanding (MMLU) evaluation suite and 8% improvement on BigBench Hard (BBH). Analysis suggests the improvements stem both from the larger and more diverse set of tasks and from applying a set of simple training and data augmentation techniques that are cheap and easy to implement: mixing zero-shot, few-shot, and chain of thought prompts at training, enriching tasks with input inversion, and balancing task mixtures. Together, these methods enable the resulting language models to reason more competently over arbitrary tasks, even those for which it hasn’t seen any fine-tuning examples. We hope making these findings and resources publicly available will accelerate research into more powerful and general-purpose language models.

Public instruction tuning data collections

Since 2020, several instruction tuning task collections have been released in rapid succession, shown in the timeline below. Recent research has yet to coalesce around a unified set of techniques, with different sets of tasks, model sizes, and input formats all represented. This new collection, referred to below as “Flan 2022”, combines prior collections from FLAN, P3/T0, and Natural Instructions with new dialog, program synthesis, and complex reasoning tasks.

A timeline of public instruction tuning collections, including: UnifiedQA, CrossFit, Natural Instructions, FLAN, P3/T0, MetaICL, ExT5, Super-Natural Instructions, mT0, Unnatural Instructions, Self-Instruct, and OPT-IML Bench. The table describes the release date, the task collection name, the model name, the base model(s) that were finetuned with this collection, the model size, whether the resulting model is Public (green) or Not Public (red), whether they train with zero-shot prompts (“ZS”), few-shot prompts (“FS”), chain-of-thought prompts (“CoT”) together (“+”) or separately (“/”), the number of tasks from this collection in Flan 2022, the total number of examples, and some notable methods, related to the collections, used in these works. Note that the number of tasks and examples vary under different assumptions and so are approximations. Counts for each are reported using task definitions from the respective works.

In addition to scaling to more instructive training tasks, The Flan Collection combines training with different types of input-output specifications, including just instructions (zero-shot prompting), instructions with examples of the task (few-shot prompting), and instructions that ask for an explanation with the answer (chain of thought prompting). Except for InstructGPT, which leverages a collection of proprietary data, Flan 2022 is the first work to publicly demonstrate the strong benefits of mixing these prompting settings together during training. Instead of a trade-off between the various settings, mixing prompting settings during training improves all prompting settings at inference time, as shown below for both tasks held-in and held-out from the set of fine-tuning tasks.

Training jointly with zero-shot and few-shot prompt templates improves performance on both held-in and held-out tasks. The stars indicate the peak performance in each setting. Red lines denote the zero-shot prompted evaluation, lilac denotes few-shot prompted evaluation.

Evaluating instruction tuning methods

To understand the overall effects of swapping one instruction tuning collection for another, we fine-tune equivalently-sized T5 models on popular public instruction-tuning collections, including Flan 2021, T0++, and Super-Natural Instructions. Each model is then evaluated on a set of tasks that are already included in each of the instruction tuning collections, a set of five chain-of-thought tasks, and then a set of 57 diverse tasks from the MMLU benchmark, both with zero-shot and few-shot prompts. In each case, the new Flan 2022 model, Flan-T5, outperforms these prior works, demonstrating a more powerful general-purpose NLP reasoner.

Comparing public instruction tuning collections on held-in, chain-of-thought, and held-out evaluation suites, such as BigBench Hard and MMLU. All models except OPT-IML-Max (175B) are trained by us, using T5-XL with 3B parameters. Green text indicates improvement over the next best comparable T5-XL (3B) model.

Single task fine-tuning

In applied settings, practitioners usually deploy NLP models fine-tuned specifically for one target task, where training data is already available. We examine this setting to understand how Flan-T5 compares to T5 models as a starting point for applied practitioners. Three settings are compared: fine-tuning T5 directly on the target task, using Flan-T5 without further fine-tuning on the target task, and fine-tuning Flan-T5 on the target task. For both held-in and held-out tasks, fine-tuning Flan-T5 offers an improvement over fine-tuning T5 directly. In some instances, usually where training data is limited for a target task, Flan-T5 without further fine-tuning outperforms T5 with direct fine-tuning.

Flan-T5 outperforms T5 on single-task fine-tuning. We compare single-task fine-tuned T5 (blue bars), single-task fine-tuned Flan-T5 (red), and Flan-T5 without any further fine-tuning (beige).

An additional benefit of using Flan-T5 as a starting point is that training is significantly faster and cheaper, converging more quickly than T5 fine-tuning, and usually peaking at higher accuracies. This suggests less task-specific training data may be necessary to achieve similar or better results on a particular task.

Flan-T5 converges faster than T5 on single-task fine-tuning, for each of five held-out tasks from Flan fine-tuning. Flan-T5’s learning curve is indicated with the solid lines, and T5’s learning curve with the dashed line. All tasks are held-out during Flan finetuning.

There are significant energy efficiency benefits for the NLP community to adopt instruction-tuned models like Flan-T5 for single task fine-tuning, rather than conventional non-instruction-tuned models. While pre-training and instruction fine-tuning are financially and computationally expensive, they are a one-time cost, usually amortized over millions of subsequent fine-tuning runs, which can become more costly in aggregate, for the most prominent models. Instruction-tuned models offer a promising solution in significantly reducing the amount of fine-tuning steps needed to achieve the same or better performance.

Conclusion

The new Flan instruction tuning collection unifies the most popular prior public collections and their methods, while adding new templates and simple improvements like training with mixed prompt settings. The resulting method outperforms Flan, P3, and Super-Natural Instructions on held-in, chain of thought, MMLU, and BBH benchmarks by 3–17% across zero-shot and few-shot variants. Results suggest this new collection serves as a more performant starting point for researchers and practitioners interested in both generalizing to new instructions or fine-tuning on a single new task.

Acknowledgements

It was a privilege to work with Jason Wei, Barret Zoph, Le Hou, Hyung Won Chung, Tu Vu, Albert Webson, Denny Zhou, and Quoc V Le on this project.

Introducing ChatGPT Plus

We’re launching a pilot subscription plan for ChatGPT, a conversational AI that can chat with you, answer follow-up questions, and challenge incorrect assumptions.

The new subscription plan, ChatGPT Plus, will be available for $20/month, and subscribers will receive a number of benefits:

General access to ChatGPT, even during peak times
Faster response times
Priority access to new features and improvements

ChatGPT Plus is available to customers in the United States, and we will begin the process of inviting people from our waitlist over the coming weeks. We plan to expand access and support to additional countries and regions soon.

We love our free users and will continue to offer free access to ChatGPT. By offering this subscription pricing, we will be able to help support free access availability to as many people as possible.

Learning from the research preview

We launched ChatGPT as a research preview so we could learn more about the system’s strengths and weaknesses and gather user feedback to help us improve upon its limitations. Since then, millions of people have given us feedback, we’ve made several important updates and we’ve seen users find value across a range of professional use-cases, including drafting & editing content, brainstorming ideas, programming help, and learning new topics.

Our plans for the future

We plan to refine and expand this offering based on your feedback and needs. We’ll also soon be launching the (ChatGPT API waitlist), and we are actively exploring options for lower-cost plans, business plans, and data packs for more availability.

OpenAI

Scaling distributed training with AWS Trainium and Amazon EKS

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and DeepSpeed, it’s common for training jobs to require hundreds of accelerator devices for several weeks or months at a time.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium—a purpose-built machine learning (ML) accelerator optimized to provide a high-performance, cost-effective, and massively scalable platform for training deep learning models in the cloud. Trn1 instances are available in a number of sizes (see the following table), with up to 16 Trainium accelerators per instance.

Instance Size	Trainium Accelerators	Accelerator Memory (GB)	vCPUs	Instance Memory (GiB)	Network Bandwidth (Gbps)
trn1.2xlarge	1	32	8	32	Up to 12.5
trn1.32xlarge	16	512	128	512	800
trn1n.32xlarge (coming soon)	16	512	128	512	1600

Trn1 instances can either be deployed as standalone instances for smaller training jobs, or in highly scalable ultraclusters that support distributed training across tens of thousands of Trainium accelerators. All Trn1 instances support the standalone configuration, whereas Trn1 ultraclusters require trn1.32xlarge or trn1n.32xlarge instances. In an ultracluster, multiple Trn1 instances are co-located in a given AWS Availability Zone and are connected with high-speed, low-latency, Elastic Fabric Adapter (EFA) networking that provides 800 Gbps of nonblocking network bandwidth per instance for collective compute operations. The trn1n.32xlarge instance type, launching in early 2023, will increase this bandwidth to 1600 Gbps per instance.

Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud. AWS customers often deploy these workloads using Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS is a managed Kubernetes service that simplifies the creation, configuration, lifecycle, and monitoring of Kubernetes clusters while still offering the full flexibility of upstream Kubernetes.

Today, we are excited to announce official support for distributed training jobs using Amazon EKS and EC2 Trn1 instances. With this announcement, you can now easily run large-scale containerized training jobs within Amazon EKS while taking full advantage of the price-performance, scalability, and ease of use offered by Trn1 instances.

Along with this announcement, we are also publishing a detailed tutorial that guides you through the steps required to run a multi-instance distributed training job (BERT phase 1 pre-training) using Amazon EKS and Trn1 instances. In this post, you will learn about the solution architecture and review several key steps from the tutorial. Refer to the official tutorial repository for the complete end-to-end workflow.

To follow along, a broad familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EKS is implied, and basic familiarity with deep learning and PyTorch would be helpful.

Solution architecture

The following diagram illustrates the solution architecture.

The solution consists of the following main components:

An EKS cluster
An EKS node group consisting of trn1.32xlarge instances
The AWS Neuron SDK
EKS plugins for Neuron and EFA
An Amazon Elastic Container Registry (Amazon ECR) Rrepository
A training container image
An Amazon FSx for Lustre file system
A Volcano batch scheduler and etcd server
The TorchX universal job launcher
The TorchX DDP module for Trainium

At the heart of the solution is an EKS cluster that provides you with core Kubernetes management functionality via an EKS service endpoint. One of the benefits of Amazon EKS is that the service actively monitors and scales the control plane based on load, which ensures high performance for large workloads such as distributed training. Inside the EKS cluster is a node group consisting of two or more trn1.32xlarge Trainium-based instances residing in the same Availability Zone.

The Neuron SDK is the software stack that provides the driver, compiler, runtime, framework integration (for example, PyTorch Neuron), and user tools that allow you to access the benefits of the Trainium accelerators. The Neuron device driver runs directly on the EKS nodes (Trn1 instances) and provides access to the Trainium chips from within the training containers that are launched on the nodes. Neuron and EFA plugins are installed within the EKS cluster to provide access to the Trainium chips and EFA networking devices required for distributed training.

An ECR repository is used to store the training container images. These images contain the Neuron SDK (excluding the Neuron driver, which runs directly on the Trn1 instances), PyTorch training script, and required dependencies. When a training job is launched on the EKS cluster, the container images are first pulled from Amazon ECR onto the EKS nodes, and the PyTorch worker containers are then instantiated from the images.

Shared storage is provided using a high-performance FSx for Lustre file system that exists in the same Availability Zone as the trn1.32xlarge instances. Creation and attachment of the FSx for Lustre file system to the EKS cluster is mediated by the Amazon FSx for Lustre CSI driver. In this solution, the shared storage is used to store the training dataset and any logs or artifacts created during the training process.

The solution uses the TorchX universal job launcher to launch distributed training jobs within Amazon EKS. TorchX has two important dependencies: the Volcano batch scheduler and the etcd server. Volcano handles the scheduling and queuing of training jobs, while the etcd server is a key-value store used by TorchElastic for synchronization and peer discovery during job startup.

When a training job is launched using TorchX, the launch command uses the provided TorchX distributed DDP module for Trainium to configure the overall training job and then run the appropriate torchrun commands on each of the PyTorch worker pods. When a job is running, it can be monitored using standard Kubernetes tools (such as kubectl) or via standard ML toolsets such as TensorBoard.

Solution overview

Let’s look at the important steps of this solution. Throughout this overview, we refer to the Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS tutorial on GitHub.

Create an EKS cluster

To get started with distributed training jobs in Amazon EKS with Trn1 instances, you first create an EKS cluster as outlined in the tutorial on GitHub. Cluster creation can be achieved using standard tools such as eksctl and AWS CloudFormation.

Create an EKS node group

Next, we need to create an EKS node group containing two or more trn1.32xlarge instances in a supported Region. In the tutorial, AWS CloudFormation is used to create a Trainium-specific EC2 launch template, which ensures that the Trn1 instances are launched with an appropriate Amazon Machine Image (AMI) and the correct EFA network configuration needed to support distributed training. The AMI also includes the Neuron device driver that provides support for the Trainium accelerator chips. With the eksctl Amazon EKS management tool, you can easily create a Trainium node group using a basic YAML manifest that references the newly created launch template. For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-trn1-cluster
  region: us-west-2
  version: "1.23"

iam:
  withOIDC: true

availabilityZones: ["us-west-xx","us-west-yy"]

managedNodeGroups:
  - name: trn1-ng1
    launchTemplate:
      id: TRN1_LAUNCH_TEMPLATE_ID
    minSize: 2
    desiredCapacity: 2
    maxSize: 2
    availabilityZones: ["us-west-xx"]
    privateNetworking: true
    efaEnabled: true

In the preceding manifest, several attributes are configured to allow for the use of Trn1 instances in the EKS cluster. First, metadata.region is set to one of the Regions that supports Trn1 instances (currently us-east-1 and us-west-2). Next, for availabilityZones, Amazon EKS requires that two Availability Zones be specified. One of these Availability Zones must support the use of Trn1 instances, while the other can be chosen at random. The tutorial shows how to determine which Availability Zones will allow for Trn1 instances within your AWS account. The same Trn1-supporting Availability Zone must also be specified using the availabiltyZones attribute associated with the EKS node group. efaEnabled is set to true to configure the nodes with the appropriate EFA network configuration that is required for distributed training. Lastly, the launchTemplate.id attribute associated with the node group points to the EC2 launch template created via AWS CloudFormation in an earlier step.

Assuming that you have already applied the CloudFormation template and installed the eksctl management tool, you can create a Trainium-capable EKS node group by running the following code:

> eksctl create nodegroup -f TEMPLATE.yaml

Install Kubernetes plugins for Trainium and EFA devices

With the node group in place, the next step is to install Kubernetes plugins that provide support for the Trainium accelerators (via the Neuron plugin) and the EFA devices (via the EFA plugin). These plugins can easily be installed on the cluster using the standard kubectl management tool as shown in the tutorial.

To use the TorchX universal PyTorch launcher to launch distributed training jobs, two prerequisites are required: the Volcano batch scheduler, and the etcd server. Much like the Neuron and EFA plugins, we can use the kubectl tool to install Volcano and the etcd server on the EKS cluster.

Attach shared storage to the EKS cluster

In the tutorial, FSx for Lustre is used to provide a high-performance shared file system that can be accessed by the various EKS worker pods. This shared storage is used to host the training dataset, as well as any artifacts and logs creating during the training process. The tutorial describes how to create and attach the shared storage to the cluster using the Amazon FSx for Lustre CSI driver.

Create a training container image

Next, we need to create a training container image that includes the PyTorch training script along with any dependencies. An example Dockerfile is included in the tutorial, which incorporates the BERT pre-training script along with its software dependencies. The Dockerfile is used to build the training container image, and the image is then pushed to an ECR repository from which the PyTorch workers are able to pull the image when a training job is launched on the cluster.

Set up the training data

Before launching a training job, the training data is first copied to the shared storage volume on FSx for Lustre. The tutorial outlines how to create a temporary Kubernetes pod that has access to the shared storage volume, and shows how to log in to the pod in order to download and extract the training dataset using standard Linux shell commands.

With the various infrastructure and software prerequisites in place, we can now focus on the Trainium aspects of the solution.

Precompile your model

The Neuron SDK supports PyTorch through an integration layer called PyTorch Neuron. By default, PyTorch Neuron operates with just-in-time compilation, where the various neural network compute graphs within a training job are compiled as they are encountered during the training process. For larger models, it can be more convenient to use the provided neuron_parallel_compile tool to precompile and cache the various compute graphs in advance so as to avoid graph compilation at training time. Before launching the training job on the EKS cluster, the tutorial shows how to first launch a precompilation job via TorchX using the neuron_parallel_compile tool. Upon completion of the precompilation job, the Neuron compiler will have identified and compiled all of the neural network compute graphs, and cached them to the shared storage volume for later use during the actual BERT pre-training job.

Launch the distributed training job

With precompilation complete, TorchX is then used to launch a 64-worker distributed training job across two trn1.32xlarge instances, with 32 workers per instance. We use 32 workers per instance because each trn1.32xlarge instance contains 16 Trainium accelerators, with each accelerator providing 2 NeuronCores. Each NeuronCore can be accessed as a unique PyTorch XLA device in the training script. An example TorchX launch command from the tutorial looks like the following code:

    torchx run 
    -s kubernetes --workspace="file:///$PWD/docker" 
    -cfg queue=test,image_repo=$ECR_REPO 
    lib/trn1_dist_ddp.py:generateAppDef 
    --name berttrain 
    --script_args "--batch_size 16 --grad_accum_usteps 32 
        --data_dir /data/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128 
        --output_dir /data/output" 
    --nnodes 2 
    --nproc_per_node 32 
    --image $ECR_REPO:bert_pretrain 
    --script dp_bert_large_hf_pretrain_hdf5.py 
    --bf16 True 
    --cacheset bert-large

The various command line arguments in the preceding TorchX command are described in detail in the tutorial. However, the following arguments are most important in configuring the training job:

-cfg queue=test – Specifies the Volcano queue to be used for the training job
-cfg image_repo – Specifies the ECR repository to be used for the TorchX container images
–script_args – Specifies any arguments that should be passed to the PyTorch training script
–nnodes and –nproc_per_node – The number of instances and workers per instance to use for the training job
–script – The name of the PyTorch training script to launch within the training container
–image – The path to the training container image in Amazon ECR
–bf16 – Whether or not to enable BF16 data type

Monitor the training job

After the training job has been launched, there are various ways in which the job can be monitored. The tutorial shows how to monitor basic training script metrics on the command line using kubectl, how to visually monitor training script progress in TensorBoard (see the following screenshot), and how to monitor Trainium accelerator utilization using the neuron-top tool from the Neuron SDK.

Clean up or reuse the environment

When the training job is complete, the cluster can then be reused or re-configured for additional training jobs. For example, the EKS node group can quickly be scaled up using the eksctl command in order to support training jobs that require additional Trn1 instances. Similarly, the provided Dockerfile and TorchX launch commands can easily be modified to support additional deep learning models and distributing training topologies.

If the cluster in no longer required, the tutorial also includes all steps required to remove the EKS infrastructure and related resources.

Conclusion

In this post, we explored how Trn1 instances and Amazon EKS provide a managed platform for high-performance, cost-effective, and massively scalable distributed training of deep learning models. We also shared a comprehensive tutorial showing how to run a real-world multi-instance distributed training job in Amazon EKS using Trn1 instances, and highlighted several of the key steps and components in the solution. This tutorial content can easily be adapted for other models and workloads, and provides you with a foundational solution for distributed training of deep learning models in AWS.

To learn more about how to get started with Trainium-powered Trn1 instances, refer to the Neuron documentation.

About the Authors

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.

Meet the Omnivore: Architectural Researcher Lights Up Omniverse Scenes With ‘SunPath’ Extension

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds.

Things are a lot sunnier these days for designers looking to visualize their projects in NVIDIA Omniverse, a platform for creating and operating metaverse applications.

Pingfan Wu

Pingfan Wu, a senior architectural researcher at the Hunan Architectural Design Institute (HNADI) Group in south-central China, developed an Omniverse extension that makes controlling the sun and its effects on scenes within the platform more intuitive and precise.

Wu has won multiple awards for his work with Omniverse extensions — core building blocks that let anyone create and extend functions of the Omniverse using the popular Python or C++ programming languages.

The “SunPath” extension lets users easily add, manipulate and update a sun-path diagram within the Omniverse viewport.

This enables designers to visualize how the sun will impact a site or building at different times of day throughout the year, which is critical for the architecture, engineering, construction and operations (AECO) industry.

“To achieve digitization in the AECO industry, the first task is to build a bridge for data interoperability between design tools, and the only platform that can fill this role is Omniverse,” Wu said. “Based on the Universal Scene Description framework, Omniverse enables designers to collaborate using different software.”

And the excellent rendering power of Omniverse makes the sunlight look very realistic, Wu added.

Award-Winning Omniverse Extension

The extension shined brightly in NVIDIA’s inaugural #ExtendOmniverse contest last fall, winning the “scene modifier or manipulator tools” category.

The competition invited developers to use the Omniverse Code app to create their own Omniverse extensions for a chance to win an NVIDIA RTX GPU.

Wu decided to join, as he saw the potential for Omniverse’s real-time, collaborative, AI-powered tools to let designers “focus more on ideas and design rather than on rendering and struggling to connect models from different software,” he said.

Many design tools that come with a skylight feature lack realistic shadows, the researcher had noticed. His “SunPath” extension — built in just over a month — solves this problem, as designers can import scenes to Omniverse and create accurate sun studies quickly and easily.

They can even use “SunPath” to perform energy-consumption analyses to make design results more efficient, Wu added.

More Wins With Omniverse

Participating in the #ExtendOmniverse contest inspired Wu to develop further applications of Omniverse for AECO, he said.

He led his team at HNADI, which includes eight members who are experts in both architecture and technology, to create a platform that enables users to customize AECO-specific digital twins. Dubbed HNADI-AECKIT, the platform extends an Omniverse Connector to Rhino, a 3D computer graphics and computer-aided design software.

It won two awards at last year’s AEC Hackathon in China: first prize overall and in the “Best Development of the Omniverse” track.

“The core technical advantage of HNADI-AECKIT is that it opens up a complete linkage between Rhino and Omniverse,” Wu said. “Any output from Rhino can be quickly converted into a high-fidelity, information-visible, interactive, customizable digital-twin scene in Omniverse.”

Learn more about HNADI-AECKIT:

Join In on the Creation

Creators and developers across the world can download NVIDIA Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Discover how to build an Omniverse extension in less than 10 minutes.

For a deeper dive into developing on Omniverse, attend these sessions at GTC, a global conference for the era of AI and the metaverse, running March 20-23.

Find additional documentation and tutorials in the Omniverse Resource Center, which details how developers like Wu can build custom USD-based applications and extensions for the platform.

To discover more free tools, training and a community for developers, join the NVIDIA Developer Program.

Follow NVIDIA Omniverse on Instagram, Medium, Twitter and YouTube for additional resources and inspiration. Check out the Omniverse forums, and join our Discord server and Twitch channel to chat with the community.

NVIDIA A100: Top Latency Results

Accelerated Computing for STAC-ML and STAC-A2, STAC-A3 Benchmarks

Predictable Performance for Consistent Low Latency

Leading Throughput in Benchmark Results

NVIDIA A100: Performance and Versatility

Efficiencies for Reduced Operating Expenses

Hybrid Cloud Is Coming on Strong

Large Language Models Top the List of AI Use Cases

Banks Seeing More Potential for AI to Grow Revenue

The Biggest Obstacle: Recruiting and Retaining AI Talent

Executive Support for AI at New High

Moderate videos using the video moderation API

Moderate videos using the image moderation API

Evaluation dataset

Results summary

Accuracy

Cost

Performance

Architecture complexity

Optimal use cases for the image API solution

Conclusion

About the authors

Public instruction tuning data collections

Evaluating instruction tuning methods

Single task fine-tuning

Conclusion

Acknowledgements

Learning from the research preview

Our plans for the future

Solution architecture

Solution overview

Create an EKS cluster

Create an EKS node group

Install Kubernetes plugins for Trainium and EFA devices

Attach shared storage to the EKS cluster

Create a training container image

Set up the training data

Precompile your model

Launch the distributed training job

Monitor the training job

Clean up or reuse the environment

Conclusion

About the Authors

Award-Winning Omniverse Extension

More Wins With Omniverse

Join In on the Creation

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.