April 2024 – Page 11

Combating Corruption With Data: Cleanlab and Berkeley Research Group on Using AI-Powered Investigative Analytics

Talk about scrubbing data. Curtis Northcutt, cofounder and CEO of Cleanlab, and Steven Gawthorpe, senior data scientist at Berkeley Research Group, speak about Cleanlab’s groundbreaking approach to data curation with Noah Kravitz, host of NVIDIA’s AI Podcast, in an episode recorded live at the NVIDIA GTC global AI conference. The startup’s tools enhance data reliability and trustworthiness through sophisticated error identification and correction algorithms. Northcutt and Gawthorpe provide insights into how AI-powered data analytics can help combat economic crimes and corruption and discuss the intersection of AI, data science and ethical governance in fostering a more just society.

Cleanlab is a member of the NVIDIA Inception program for cutting-edge startups.

Stay tuned for more episodes recorded live from GTC.

The AI Podcast · Cleanlab’s Curtis Northcutt and Berkeley Research Group’s Steven Gawthorpe on AI for Fighting Crime

Time Stamps

1:05: Northcutt on Cleanlab’s inception and mission
2:41: What Cleanlab offers its customers
4:24: The human element in Cleanlab’s data verification
8:57: Gawthorpe on the core functions, aims of the Berkeley Research Group
10:42: Gawthorpe’s approach to data collection and analysis in fraud investigations
16:38: Cleanlab’s one-click solution for generating machine learning models
18:30: The evolution of machine learning and its impact on data analytics
20:07: Future directions in data-driven crimefighting

You Might Also Like…

The Case for Generative AI in the Legal Field – Ep. 210

Thomson Reuters, the global content and technology company, is transforming the legal industry with generative AI. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Thomson Reuters’ Chief Product Officer David Wong about its potential — and implications.

Making Machines Mindful: NYU Professor Talks Responsible AI – Ep. 205

Artificial intelligence is now a household term. Responsible AI is hot on its heels. Julia Stoyanovich, associate professor of computer science and engineering at NYU and director of the university’s Center for Responsible AI, wants to make the terms “AI” and “responsible AI” synonymous.

MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets – Ep. 167

On this episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with David Kanter, founder and executive director of MLCommons, and NVIDIA senior AI developer technology engineer Daniel Galvez, about the democratization of access to speech technology and how ML Commons is helping advance the research and development of machine learning for everyone.

Metaspectral’s Migel Tissera on AI-Based Data Management – Ep. 155

Cofounder and CTO of Metaspectral speaks with‌ ‌‌NVIDIA‌ ‌AI‌ ‌Podcast‌‌ ‌host‌ ‌Noah‌ ‌Kravitz‌ ‌about‌ ‌how‌ ‌Metaspectral’s‌ ‌technologies‌ ‌help‌ ‌space‌ ‌explorers‌ ‌make‌ ‌quicker‌ ‌and‌ ‌better‌ ‌use‌ ‌of‌ ‌the‌ ‌massive‌ ‌amounts‌ ‌of‌ ‌image‌ ‌data‌ ‌they‌ ‌collect‌ ‌out‌ ‌in‌ ‌the‌ ‌cosmos.‌ ‌

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

The Building Blocks of AI: Decoding the Role and Significance of Foundation Models

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and which showcases new hardware, software, tools and accelerations for RTX PC users.

Skyscrapers start with strong foundations. The same goes for apps powered by AI.

A foundation model is an AI neural network trained on immense amounts of raw data, generally with unsupervised learning.

It’s a type of artificial intelligence model trained to understand and generate human-like language. Imagine giving a computer a huge library of books to read and learn from, so it can understand the context and meaning behind words and sentences, just like a human does.

A foundation model’s deep knowledge base and ability to communicate in natural language make it useful for a broad range of applications, including text generation and summarization, copilot production and computer code analysis, image and video creation, and audio transcription and speech synthesis.

ChatGPT, one of the most notable generative AI applications, is a chatbot built with OpenAI’s GPT foundation model. Now in its fourth version, GPT-4 is a large multimodal model that can ingest text or images and generate text or image responses.

Online apps built on foundation models typically access the models from a data center. But many of these models, and the applications they power, can now run locally on PCs and workstations with NVIDIA GeForce and NVIDIA RTX GPUs.

Foundation Model Uses

Foundation models can perform a variety of functions, including:

Language processing: understanding and generating text
Code generation: analyzing and debugging computer code in many programming languages
Visual processing: analyzing and generating images
Speech: generating text to speech and transcribing speech to text

They can be used as is or with further refinement. Rather than training an entirely new AI model for each generative AI application — a costly and time-consuming endeavor — users commonly fine-tune foundation models for specialized use cases.

Pretrained foundation models are remarkably capable, thanks to prompts and data-retrieval techniques like retrieval-augmented generation, or RAG. Foundation models also excel at transfer learning, which means they can be trained to perform a second task related to their original purpose.

For example, a general-purpose large language model (LLM) designed to converse with humans can be further trained to act as a customer service chatbot capable of answering inquiries using a corporate knowledge base.

Enterprises across industries are fine-tuning foundation models to get the best performance from their AI applications.

Types of Foundation Models

More than 100 foundation models are in use — a number that continues to grow. LLMs and image generators are the two most popular types of foundation models. And many of them are free for anyone to try — on any hardware — in the NVIDIA API Catalog.

LLMs are models that understand natural language and can respond to queries. Google’s Gemma is one example; it excels at text comprehension, transformation and code generation. When asked about the astronomer Cornelius Gemma, it shared that his “contributions to celestial navigation and astronomy significantly impacted scientific progress.” It also provided information on his key achievements, legacy and other facts.

Extending the collaboration of the Gemma models, accelerated with the NVIDIA TensorRT-LLM on RTX GPUs, Google’s CodeGemma brings powerful yet lightweight coding capabilities to the community. CodeGemma models are available as 7B and 2B pretrained variants that specialize in code completion and code generation tasks.

MistralAI’s Mistral LLM can follow instructions, complete requests and generate creative text. In fact, it helped brainstorm the headline for this blog, including the requirement that it use a variation of the series’ name “AI Decoded,” and it assisted in writing the definition of a foundation model.

Meta’s Llama 2 is a cutting-edge LLM that generates text and code in response to prompts.

Mistral and Llama 2 are available in the NVIDIA ChatRTX tech demo, running on RTX PCs and workstations. ChatRTX lets users personalize these foundation models by connecting them to personal content — such as documents, doctors’ notes and other data — through RAG. It’s accelerated by TensorRT-LLM for quick, contextually relevant answers. And because it runs locally, results are fast and secure.

Image generators like StabilityAI’s Stable Diffusion XL and SDXL Turbo let users generate images and stunning, realistic visuals. StabilityAI’s video generator, Stable Video Diffusion, uses a generative diffusion model to synthesize video sequences with a single image as a conditioning frame.

Multimodal foundation models can simultaneously process more than one type of data — such as text and images — to generate more sophisticated outputs.

A multimodal model that works with both text and images could let users upload an image and ask questions about it. These types of models are quickly working their way into real-world applications like customer service, where they can serve as faster, more user-friendly versions of traditional manuals.

Many foundation models are free to try — on any hardware — in the NVIDIA API Catalog.

Kosmos 2 is Microsoft’s groundbreaking multimodal model designed to understand and reason about visual elements in images.

Think Globally, Run AI Models Locally

GeForce RTX and NVIDIA RTX GPUs can run foundation models locally.

The results are fast and secure. Rather than relying on cloud-based services, users can harness apps like ChatRTX to process sensitive data on their local PC without sharing the data with a third party or needing an internet connection.

Users can choose from a rapidly growing catalog of open foundation models to download and run on their own hardware. This lowers costs compared with using cloud-based apps and APIs, and it eliminates latency and network connectivity issues. Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

NVIDIA Joins $110 Million Partnership to Help Universities Teach AI Skills

The Biden Administration has announced a new $110 million AI partnership between Japan and the United States that includes an initiative to fund research through a collaboration between the University of Washington and the University of Tsukuba.

NVIDIA is committing $25 million in a collaboration with Amazon that aims to bring the latest technologies to the University of Washington, in Seattle, and the University of Tsukuba, which is northeast of Tokyo.

Universities around the world are preparing students for crucial AI skills by providing access to the high performance computing capabilities of supercomputing.

“This collaboration between the University of Washington, University of Tsukuba, Amazon, and NVIDIA will help provide the research and workforce training for our regions’ tech sectors to keep up with the profound impacts AI is having across every sector of our economy,” said Jay Inslee, governor of Washington State.

Creating AI Opportunities for Students

NVIDIA has been investing in universities for decades computing resources, advanced training curriculums, donations, and other support to provide students and professors with access to high performance computing (HPC) for groundbreaking research results.

NVIDIA founder and CEO Jensen Huang and his wife, Lori Huang, donated $50 million to their alma mater Oregon State University — where they met and earned engineering degrees — to help build one of the world’s fastest supercomputers in a facility bearing their names. This computing center will help students research, develop and apply AI across Oregon State’s top-ranked programs in agriculture, computer sciences, climate science, forestry, oceanography, robotics, water resources, materials sciences and more.

The University of Florida recently unveiled Malachowsky Hall, which was made possible with a $50 million donation from NVIDIA co-founder Chris Malachowsky. This new building along with a previous donation of an AI supercomputer is enabling the University of Florida to offer world class AI training and research opportunities.

Strengthening US-Japan AI Research Collaboration

The U.S.-Japan HPC alliance will advance AI research and development and support the two nations’ global leadership in cutting-edge technology.

The University of Washington and Tsukuba University initiative will support research in critical areas where AI can drive impactful change, such as robotics, healthcare, climate change and atmospheric science, among others.

In addition to the university partnership, NVIDIA recently announced a collaboration with Japan’s National Institute of Advanced Industrial Science and Technology (AIST) on AI and quantum technology.

Addressing Worldwide AI Talent Shortage

Demand for key AI skills is creating a talent shortage worldwide. Some experts calculate there has been a fivefold increase in demand for these skills as a percentage of total U.S. jobs. Universities around the world are looking for ways to prepare students with new skills for the workforce, and corporate-university partnerships are a key tool to help bridge the gap.

NVIDIA unveiled at GTC 2024 new professional certifications in generative AI to help enable the next generation of developers to obtain technical credibility in this important domain.

Learn more about NVIDIA generative AI courses here and here.

Knowledge Bases for Amazon Bedrock now supports custom prompts for the RetrieveAndGenerate API and configuration of the maximum number of retrieved results

With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant, context-specific, and accurate responses without retraining the FMs.

In this post, we discuss two new features of Knowledge Bases for Amazon Bedrock specific to the RetrieveAndGenerate API: configuring the maximum number of results and creating custom prompts with a knowledge base prompt template. You can now choose these as query options alongside the search type.

Overview and benefits of new features

The maximum number of results option gives you control over the number of search results to be retrieved from the vector store and passed to the FM for generating the answer. This allows you to customize the amount of background information provided for generation, thereby giving more context for complex questions or less for simpler questions. It allows you to fetch up to 100 results. This option helps improve the likelihood of relevant context, thereby improving the accuracy and reducing the hallucination of the generated response.

The custom knowledge base prompt template allows you to replace the default prompt template with your own to customize the prompt that’s sent to the model for response generation. This allows you to customize the tone, output format, and behavior of the FM when it responds to a user’s question. With this option, you can fine-tune terminology to better match your industry or domain (such as healthcare or legal). Additionally, you can add custom instructions and examples tailored to your specific workflows.

In the following sections, we explain how you can use these features with either the AWS Management Console or SDK.

Prerequisites

To follow along with these examples, you need to have an existing knowledge base. For instructions to create one, see Create a knowledge base.

Configure the maximum number of results using the console

To use the maximum number of results option using the console, complete the following steps:

On the Amazon Bedrock console, choose Knowledge bases in the left navigation pane.
Select the knowledge base you created.
Choose Test knowledge base.
Choose the configuration icon.
Choose Sync data source before you start testing your knowledge base.
Under Configurations, for Search Type, select a search type based on your use case.

For this post, we use hybrid search because it combines semantic and text search to provider greater accuracy. To learn more about hybrid search, see Knowledge Bases for Amazon Bedrock now supports hybrid search.

Expand Maximum number of source chunks and set your maximum number of results.

To demonstrate the value of the new feature, we show examples of how you can increase the accuracy of the generated response. We used Amazon 10K document for 2023 as the source data for creating the knowledge base. We use the following query for experimentation: “In what year did Amazon’s annual revenue increase from $245B to $434B?”

The correct response for this query is “Amazon’s annual revenue increased from $245B in 2019 to $434B in 2022,” based on the documents in the knowledge base. We used Claude v2 as the FM to generate the final response based on the contextual information retrieved from the knowledge base. Claude 3 Sonnet and Claude 3 Haiku are also supported as the generation FMs.

We ran another query to demonstrate the comparison of retrieval with different configurations. We used the same input query (“In what year did Amazon’s annual revenue increase from $245B to $434B?”) and set the maximum number of results to 5.

As shown in the following screenshot, the generated response was “Sorry, I am unable to assist you with this request.”

Next, we set the maximum results to 12 and ask the same question. The generated response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”

As shown in this example, we are able to retrieve the correct answer based on the number of retrieved results. If you want to learn more about the source attribution that constitutes the final output, choose Show source details to validate the generated answer based on the knowledge base.

Customize a knowledge base prompt template using the console

You can also customize the default prompt with your own prompt based on the use case. To do so on the console, complete the following steps:

Repeat the steps in the previous section to start testing your knowledge base.
Enable Generate responses.
Select the model of your choice for response generation.

We use the Claude v2 model as an example in this post. The Claude 3 Sonnet and Haiku model is also available for generation.

Choose Apply to proceed.

After you choose the model, a new section called Knowledge base prompt template appears under Configurations.

Choose Edit to start customizing the prompt.
Adjust the prompt template to customize how you want to use the retrieved results and generate content.

For this post, we gave a few examples for creating a “Financial Advisor AI system” using Amazon financial reports with custom prompts. For best practices on prompt engineering, refer to Prompt engineering guidelines.

We now customize the default prompt template in several different ways, and observe the responses.

Let’s first try a query with the default prompt. We ask “What was the Amazon’s revenue in 2019 and 2021?” The following shows our results.

From the output, we find that it’s generating the free-form response based on the retrieved knowledge. The citations are also listed for reference.

Let’s say we want to give extra instructions on how to format the generated response, like standardizing it as JSON. We can add these instructions as a separate step after retrieving the information, as part of the prompt template:

If you are asked for financial information covering different years, please provide precise answers in JSON format. Use the year as the key and the concise answer as the value. For example: {year:answer}

The final response has the required structure.

By customizing the prompt, you can also change the language of the generated response. In the following example, we instruct the model to provide an answer in Spanish.

After removing $output_format_instructions$ from the default prompt, the citation from the generated response is removed.

In the following sections, we explain how you can use these features with the SDK.

Configure the maximum number of results using the SDK

To change the maximum number of results with the SDK, use the following syntax. For this example, the query is “In what year did Amazon’s annual revenue increase from $245B to $434B?” The correct response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”

def retrieveAndGenerate(query, kbId, numberOfResults, model_id, region_id):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    return bedrock_agent_runtime.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': numberOfResults,
                        'overrideSearchType': "SEMANTIC", # optional'
                    }
                }
            },
            'type': 'KNOWLEDGE_BASE'
        },
    )

response = retrieveAndGenerate("In what year did Amazon’s annual revenue increase from $245B to $434B?", 
"<knowledge base id>", numberOfResults, model_id, region_id)['output']['text']

The ‘numberOfResults’ option under ‘retrievalConfiguration’ allows you to select the number of results you want to retrieve. The output of the RetrieveAndGenerate API includes the generated response, source attribution, and the retrieved text chunks.

The following are the results for different values of ‘numberOfResults’ parameters. First, we set numberOfResults = 5.

Then we set numberOfResults = 12.

Customize the knowledge base prompt template using the SDK

To customize the prompt using the SDK, we use the following query with different prompt templates. For this example, the query is “What was the Amazon’s revenue in 2019 and 2021?”

The following is the default prompt template:

"""You are a question answering agent. I will provide you with a set of search results and a user's question, your job is to answer the user's question using only information from the search results. If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question. Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user's assertion.
Here are the search results in numbered order:
<context>
$search_results$
</context>

Here is the user's question:
<question>
$query$
</question>

$output_format_instructions$

Assistant:
"""

The following is the customized prompt template:

"""Human: You are a question answering agent. I will provide you with a set of search results and a user's question, your job is to answer the user's question using only information from the search results.If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question.Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user's assertion.

Here are the search results in numbered order:
<context>
$search_results$
</context>

Here is the user's question:
<question>
$query$
</question>

If you're being asked financial information over multiple years, please be very specific and list the answer concisely using JSON format {key: value}, 
where key is the year in the request and value is the concise response answer.
Assistant:
"""

def retrieveAndGenerate(query, kbId, numberOfResults,promptTemplate, model_id, region_id):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    return bedrock_agent_runtime.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': numberOfResults,
                        'overrideSearchType': "SEMANTIC", # optional'
                    }
                },
                'generationConfiguration': {
                        'promptTemplate': {
                            'textPromptTemplate': promptTemplate
                        }
                    }
            },
            'type': 'KNOWLEDGE_BASE'
        },
    )

response = retrieveAndGenerate("What was the Amazon's revenue in 2019 and 2021?”", 
                               "<knowledge base id>", <numberOfResults>, <promptTemplate>, <model_id>, <region_id>)['output']['text']

With the default prompt template, we get the following response:

If you want to provide additional instructions around the output format of the response generation, like standardizing the response in a specific format (like JSON), you can customize the existing prompt by providing more guidance. With our custom prompt template, we get the following response.

The ‘promptTemplate‘ option in ‘generationConfiguration‘ allows you to customize the prompt for better control over answer generation.

Conclusion

In this post, we introduced two new features in Knowledge Bases for Amazon Bedrock: adjusting the maximum number of search results and customizing the default prompt template for the RetrieveAndGenerate API. We demonstrated how to configure these features on the console and via SDK to improve performance and accuracy of the generated response. Increasing the maximum results provides more comprehensive information, whereas customizing the prompt template allows you to fine-tune instructions for the foundation model to better align with specific use cases. These enhancements offer greater flexibility and control, enabling you to deliver tailored experiences for RAG-based applications.

For additional resources to start implementing in your AWS environment, refer to the following:

User guide: Knowledge bases for Amazon Bedrock
YouTube video: Use RAG to improve responses in generative AI application
GitHub repo code samples: Amazon Bedrock Knowledge Base – Samples for building RAG workflows

About the authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

The science behind Echo Frames

How the team behind Echo Frames delivered longer battery life and improved sound quality inside the slim form factor of a pair of eyeglasses.Read More

Faster Dynamically Quantized Inference with XNNPack

Posted by Alan Kelly, Software Engineer

We are excited to announce that XNNPack’s Fully Connected and Convolution 2D operators now support dynamic range quantization. XNNPack is TensorFlow Lite’s CPU backend and CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority. We quadrupled inference performance in TensorFlow Lite’s XNNPack backend compared to the single precision baseline by adding support for dynamic range quantization to the Fully Connected and Convolution operators. This means that more AI powered features may be deployed to older and lower tier devices.

Previously, XNNPack offered users the choice between either full integer quantization, where the weights and activations are stored as signed 8-bit integers, or half-precision (fp16) or single-precision (fp32) floating-point inference. In this article we demonstrate the benefits of dynamic range quantization.

Dynamic Range Quantization

Dynamically quantized models are similar to fully-quantized models in that the weights for the Fully Connected and Convolution operators are quantized to 8-bit integers during model conversion. All other tensors are not quantized, they remain as float32 tensors. During model inference, the floating-point layer activations are converted to 8-bit integers before being passed to the Fully Connected and Convolution operators. The quantization parameters (the zero point and scale) for each row of the activation tensor are calculated dynamically based on the observed range of activations. This maximizes the accuracy of the quantization process as the activations make full use of the 8 quantized bits. In fully-quantized models, these parameters are fixed during model conversion, based on the range of the activation values observed using a representative dataset. The second difference between full quantization and dynamic range quantization is that the output of the Fully Connected and Convolution operators is in 32-bit floating-point format, as opposed to 8-bit integer for fully-quantized operators. With dynamic range quantization, we get most of the performance gains of full quantization, yet with higher overall accuracy.

Traditionally the inference of such models was done using TensorFlow Lite’s native operators. Now dynamically quantized models can benefit from XNNPack’s highly-optimized per-architecture implementations of the Fully Connected and Convolution2D operators. These operators are optimized for all architectures supported by XNNPack (ARM, ARM64, x86 SSE/AVX/AVX512 and WebAssembly), including the latest ArmV9 processors such as the Pixel 8’s Tensor G3 CPU or the One Plus 11’s SnapDragon 8 Gen 2 CPU.

How can you use it?

Two steps are required to use dynamic range quantization. You must first convert your model from TensorFlow with support for dynamic range quantization. Existing models already converted using dynamic range quantization do not need to be reconverted. Dynamic range quantization can be enabled during model conversion by enabling the
converter.optimizations = [tf.lite.Optimize.DEFAULT] converter flag. Unlike full integer quantization, no representative dataset is required and unsupported operators do not prevent conversion from succeeding. Dynamic range quantization is therefore far more accessible to non-expert users than full integer quantization.

From TensorFlow 2.17, dynamically quantized XNNPack inference will be enabled by default in prebuilt binaries. If you want to use it sooner, the nightly TensorFlow builds may be used.

Mixed Precision Inference

In our previous article we presented the impressive performance gains from using half precision inference. Half-precision and dynamic range quantization may now be combined within XNNPack to get the best possible on-device CPU inference performance on devices which have hardware fp16 support (most phones on sale today do). The Fully Connected and Convolution 2D operators can output fp16 data instead of fp32. The Pixel 3, released in 2018, was the first Pixel model with fp16 support. fp16 uses half as many bits to store a floating-point value compared to fp32, meaning that the relative accuracy of each value is reduced due to the significantly shorter mantissa (10 vs 23 bits). Not all models support fp16 inference, but if a model supports it, the computational cost of vectorized floating-point operators can be reduced by half as the CPU can process twice as much data per instruction. Dynamically quantized models with compute-intensive floating point operators, such as Batch Matrix Multiply and Softmax, can benefit from fp16 inference as well.

Performance Improvements

Below, we present benchmarks on four public models covering common computer vision tasks:

EfficientNetV2 – image classification and feature extraction
Inception-v3 – image classification
Deeplab-v3 – semantic segmentation
Stable Diffusion – image generation (diffusion model)

Each model was converted three times where possible: full float, full 8 bit signed integer quantization and dynamic range quantization. Stable Diffusion’s diffusion model could not be converted using full integer quantization due to unsupported operators. The speed-up versus the original float32 model using TFLite’s kernels is shown below.

FP32 refers to the baseline float32 model.
QS8 refers to full signed 8 bit integer quantization using XNNPack.
QD8-F32 refers to dynamically quantized 8 bit integers with fp32 activations using XNNPack.
QD8-F16 refers to dynamically quantized 8 bit integers with fp16 activations using XNNPack.

Graph showing speed-up versus float32 on pixel 8

The speed-up versus TFLite’s dynamically quantized Fully Connected and Convolution operators are shown below. By simply using the latest version of TensorFlow Lite, you can benefit from these speed-ups.

Graph showing speed-up versus TFLite DQ on Pixel 8

We can clearly see that dynamic range quantization is competitive with, and in some cases can exceed, the performance of full integer quantization. Stable Diffusion’s diffusion model runs up to 6.2 times faster than the original float32 model! This is a game changer for on-device performance.

We would expect that full integer quantization would be faster than dynamic range quantization as all operations are calculated using integer arithmetic. Furthermore, dynamic range quantization has the additional overhead of converting the floating point activations to quantized 8 bit integers. Surprisingly, in two of the three models tested, this is not true. Profiling the models using TFLite’s profiler solves the mystery. These models are slower due to a combination of quantization artifacts — quantized arithmetic is more efficient when the ratio of the input and output scales fall within a certain range and missing op support in XNNPack. These quantization parameters are determined during model conversion based on a provided representative dataset. Since the ratio of the scales falls outside the optimal range, a less optimal path must be taken and performance suffers.

Finally, we demonstrate that model precision is mostly preserved when using mixed precision inference, we compare the image generated by the dynamically quantized Stable Diffusion model using fp32 activations with that generated using fp16 activations with the random number generator seeded with the same number to verify that using fp16 activations for the diffusion model does not impact the quality of the generated image.

side by side comparison of images generated using fp16 inference on the left and fp32 inference on the right

Image generated using fp16 inference (left) and fp32 inference (right)

Both generated images are indeed lovely cats, corresponding to the given prompt. The images are indistinguishable which is a very strong indication that the diffusion model is suited to fp16 inference. Of course, as for all neural networks, any quantization strategy should be validated using a large validation dataset, and not just a single test.

Conclusions

Full integer quantization is hard, converting models is difficult, error prone and accuracy is not guaranteed. The representative dataset must be truly representative to minimize quantization errors. Dynamic range quantization offers a compromise between full integer quantization and fp32 inference: the models are of similar size to the fully-quantized model and performance gains are often similar and sometimes exceeds that of fully-quantized models. Using Stable Diffusion, we showed that dynamic range quantization and fp16 inference can be combined giving game changing performance improvements. XNNPack’s dynamic range quantization is now powering Gemini, Google Meet, and Chrome OS audio denoising and will launch in many other products this year. This same technology is now available to our open source users so try it for yourself using the models linked above and following the instructions in the How can I use it section!

Acknowledgements

We would like to thank Frank Barchard and Quentin Khan for contributions towards dynamic range quantization inference in TensorFlow Lite and XNNPack.

Broadcasting Breakthroughs: NVIDIA Holoscan for Media, Available Now, Transforms Live Media With Easy AI Integration

Whether delivering live sports programming, streaming services, network broadcasts or content on social platforms, media companies face a daunting landscape.

Viewers are increasingly opting for interactive and personalized content. Virtual reality and augmented reality continue their drive into the mainstream. New video compression standards are challenging traditional computing infrastructure. And AI is having an impact across the board.

In a situation this dynamic, media companies will benefit most from AI-enabled media solutions that flexibly align with their changing development and delivery needs.

NVIDIA Holoscan for Media, available now, is a software-defined platform that enables developers to easily build live media applications, supercharge them with AI and then deploy them across media platforms.

A New Approach to Media Application Development

Holoscan for Media offers a new approach to development in live media. It simplifies application development by providing an internet protocol (IP)-based, cloud-native architecture that isn’t constrained by dedicated hardware, environments or locations. Instead, it integrates open-source and ubiquitous technologies and streamlines application delivery to customers, all while optimizing costs.

Traditional application development for the live media market relies on dedicated hardware. Because software is tied to that hardware, developers are constrained when it comes to innovating or upgrading applications.

Each deployment type, whether on premises or in the cloud, requires its own build, making development costly and inefficient. Beyond designing an application’s user interface and core functionalities, developers have to build out additional infrastructure services, further eating into research and development budgets.

The most significant challenge is incorporating AI, due to the complexity of building an AI software stack. This prevents many applications in pilot programs from moving to production.

Holoscan for Media eases the integration of AI into application development due to its underlying architecture, which enables software-defined video to be deployed on the same software stack as AI applications, including generative AI-based tools. This benefits vendors and research and development departments looking to incorporate AI apps into live video.

Since the platform is cloud-native, the same architecture can run independent of location, whether in the cloud, on premises or at the edge. Additionally, it’s not tied to a specific device, field-programmable gate array or appliance.

The Holoscan for Media architecture includes services like authentication, logging and security, as well as features that help broadcasters migrate to IP-based technologies, including the SMPTE ST 2110 transport protocol, the precision time protocol for timing and synchronization, and the NMOS controller and registry for dynamic device management.

A Growing Ecosystem of Partners

Beamr, Comprimato, Lawo, Media.Monks, Pebble, RED Digital Cinema, Sony Corporation and Telestream are among the early adopters already transforming live media with Holoscan for Media.

“We use Holoscan for Media as the core infrastructure for our broadcast and media workflow, granting us powerful scale to deliver interest-based content across a wide range of channels and platforms,” said Lewis Smithingham, senior vice president of innovation special operations at Media.Monks, a provider of software-defined production workflows.

“By compartmentalizing applications and making them interoperable, Holoscan for Media allows for easy adoption of new innovations from many different companies in one platform,” said Jeff Goodman, vice president of product management at RED Digital Cinema, a manufacturer of professional digital cinema cameras. “It takes much of the integration complexity out of the equation and will significantly increase the pace of innovation. We are very excited to be a part of it.”

“We believe NVIDIA Holoscan for Media is one of the paths forward to enabling the development of next-generation products and services for the industry, allowing the scaling of GPU power as needed,” said Masakazu Murata, senior general manager of media solutions business at Sony Corporation. “Our M2L-X software switcher prototype running on Holoscan for Media demonstrates how customers can run Sony’s solutions on GPU clusters.”

“Telestream is committed to transforming the media landscape, enhancing efficiency and content experiences without sacrificing quality or user-friendliness,” said Charlie Dunn, senior vice president and general manager at Telestream, a provider of digital media software and solutions. “We’ve seamlessly integrated the Holoscan for Media platform into our INSPECT IP video monitoring solution to achieve a clear and efficient avenue for ST 2110 compliance.”

Experience Holoscan for Media at NAB Show

These partners will showcase how they’re using NVIDIA Holoscan for Media at NAB Show, an event for the broadcast, media and entertainment industry, taking place April 13-17 in Las Vegas.

Explore development on Holoscan for Media and discover applications running on the platform at the Dell Technologies booth. Learn more about NVIDIA’s presence at NAB Show, including details on sessions and demos on generative AI, software-defined broadcast and immersive graphics.

Apply for access to NVIDIA Holoscan for Media today.

Start Up Your Engines: NVIDIA and Google Cloud Collaborate to Accelerate AI Development

NVIDIA and Google Cloud have announced a new collaboration to help startups around the world accelerate the creation of generative AI applications and services.

The announcement, made today at Google Cloud Next ‘24 in Las Vegas, brings together the NVIDIA Inception program for startups and the Google for Startups Cloud Program to widen access to cloud credits, go-to-market support and technical expertise to help startups deliver value to customers faster.

Qualified members of NVIDIA Inception, a global program supporting more than 18,000 startups, will have an accelerated path to using Google Cloud infrastructure with access to Google Cloud credits, offering up to $350,000 for those focused on AI.

Google for Startups Cloud Program members can join NVIDIA Inception and gain access to technological expertise, NVIDIA Deep Learning Institute course credits, NVIDIA hardware and software, and more. Eligible members of the Google for Startups Cloud Program also can participate in NVIDIA Inception Capital Connect, a platform that gives startups exposure to venture capital firms interested in the space.

High-growth emerging software makers of both programs can also gain fast-tracked onboarding to Google Cloud Marketplace, co-marketing and product acceleration support.

This collaboration is the latest in a series of announcements the two companies have made to help ease the costs and barriers associated with developing generative AI applications for enterprises of all sizes. Startups in particular are constrained by the high costs associated with AI investments.

It Takes a Full-Stack AI Platform

In February, Google DeepMind unveiled Gemma, a family of state-of-the-art open models. NVIDIA, in collaboration with Google, recently launched optimizations across all NVIDIA AI platforms for Gemma, helping to reduce customer costs and speed up innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create Google DeepMind’s most capable model yet, Gemini — with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs.

NVIDIA NIM microservices, part of the NVIDIA AI Enterprise software platform, together with Google Kubernetes Engine (GKE) provide a streamlined path for developing AI-powered apps and deploying optimized AI models into production. Built on inference engines including NVIDIA Triton Inference Server and TensorRT-LLM, NIM supports a wide range of leading AI models and delivers seamless, scalable AI inferencing to accelerate generative AI deployment in enterprises.

The Gemma family of models, including Gemma 7B, RecurrentGemma and CodeGemma, are available from the NVIDIA API catalog for users to try from a browser, prototype with the API endpoints and self-host with NIM.

Google Cloud has made it easier to deploy the NVIDIA NeMo framework across its platform via GKE and Google Cloud HPC Toolkit. This enables developers to automate and scale the training and serving of generative AI models, allowing them to rapidly deploy turnkey environments through customizable blueprints that jump-start the development process.

NVIDIA NeMo, part of NVIDIA AI Enterprise, is also available in Google Cloud Marketplace, providing customers another way to easily access NeMo and other frameworks to accelerate AI development.

Further widening the availability of NVIDIA-accelerated generative AI computing, Google Cloud also announced the general availability of A3 Mega will be coming next month. The instances are an expansion to its A3 virtual machine family, powered by NVIDIA H100 Tensor Core GPUs. The new instances will double the GPU-to-GPU network bandwidth from A3 VMs.

Google Cloud’s new Confidential VMs on A3 will also include support for confidential computing to help customers protect the confidentiality and integrity of their sensitive data and secure applications and AI workloads during training and inference — with no code changes while accessing H100 GPU acceleration. These GPU-powered Confidential VMs will be available in Preview this year.

Next Up: NVIDIA Blackwell-Based GPUs

NVIDIA’s newest GPUs based on the NVIDIA Blackwell platform will be coming to Google Cloud early next year in two variations: the NVIDIA HGX B200 and the NVIDIA GB200 NVL72.

The HGX B200 is designed for the most demanding AI, data analytics and high performance computing workloads, while the GB200 NVL72 is designed for next-frontier, massive-scale, trillion-parameter model training and real-time inferencing.

The NVIDIA GB200 NVL72 connects 36 Grace Blackwell Superchips, each with two NVIDIA Blackwell GPUs combined with an NVIDIA Grace CPU over a 900GB/s chip-to-chip interconnect, supporting up to 72 Blackwell GPUs in one NVIDIA NVLink domain and 130TB/s of bandwidth. It overcomes communication bottlenecks and acts as a single GPU, delivering 30x faster real-time LLM inference and 4x faster training compared to the prior generation.

NVIDIA GB200 NVL72 is a multi-node rack-scale system that will be combined with Google Cloud’s fourth generation of advanced liquid-cooling systems.

NVIDIA announced last month that NVIDIA DGX Cloud, an AI platform for enterprise developers that’s optimized for the demands of generative AI, is generally available on A3 VMs powered by H100 GPUs. DGX Cloud with GB200 NVL72 will also be available on Google Cloud in 2025.

How 7 businesses are putting Google Cloud’s AI innovations to work

Discover how companies are using AI technologies to fuel innovation, create new customer experiences and develop better tools for employees.Read More

Cloud Next 2024: More momentum with generative AI

Google and Alphabet CEO Sundar Pichai shares new generative AI capabilities for organizations, announced at Cloud Next 2024.Read More