First Class: NVIDIA Introduces Generative AI Professional Certification

First Class: NVIDIA Introduces Generative AI Professional Certification

NVIDIA is offering a new professional certification in generative AI to enable developers to establish technical credibility in this important domain.

Generative AI is revolutionizing industries worldwide, yet there’s a critical skills gap and need to uplevel employees to more fully harness the technology.

Available for the first time from NVIDIA, this new professional certification enables developers, career professionals, and others to validate and showcase their generative AI skills and expertise. Our new professional certification program introduces two associate-level generative AI certifications, focusing on proficiency in large language models and multimodal workflow skills.

“Generative AI has moved to center stage as governments, industries and organizations everywhere look to harness its transformative capabilities,” NVIDIA founder and CEO Jensen Huang recently said.

The certification will become available starting at GTC, where in-person attendees can also access recommended training to prepare for a certification exam.

“Organizations in every industry need to increase their expertise in this transformative technology,” said Greg Estes, VP of developer programs at NVIDIA. “Our goals are to assist in upskilling workforces, sharpen the skills of qualified professionals, and enable individuals to demonstrate their proficiency in order to gain a competitive advantage in the job market.”

See AI’s Future. Learn How to Use It.  

GTC 2024 — running March 18-21 in San Jose, Calif. — is the first in-person GTC event in five years, and more than 300,000 people are expected to register to attend in person or virtually.  There will be 900 sessions and more than 300 exhibitors showcasing how organizations are deploying NVIDIA platforms to achieve industry breakthroughs.

Attendees can choose from 20 full-day, hands-on technical workshops, with many sessions available virtually in EMEA and APAC time zones. Also, sign up for the GTC Conference + Training package for more than 40 complimentary onsite training labs.

Sign up for GTC . Learn more about the generative AI course here and here.

Read More

Improving LLM understanding of structured data and exploring advanced prompting methods

Improving LLM understanding of structured data and exploring advanced prompting methods

This research paper was presented at the 17th ACM International Conference on Web Search and Data Mining (opens in new tab) (WSDM 2024), the premier conference on web-inspired research on search and data mining.

WSDM logo in white to the left of the first page of the

In today’s data-driven landscape, tables are indispensable for organizing and presenting information, particularly text. They streamline repetitive content, enhance data manageability, enable easier data analysis, and improve machine processing capabilities. Meanwhile, large language models (LLMs) are advancing in their ability to tackle challenges associated with natural language, but the degree to which they understand tables included in their prompts remains an open question. Our research aims to explore this question and improve how LLMs use and work with table-based data.

Our paper, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study (opens in new tab),” presented at WSDM 2024 (opens in new tab), investigates what kinds of prompts most effectively enable LLMs to understand tables; how much LLMs inherently detect structured data; and how LLMs’ existing knowledge can be harnessed to improve this understanding. We also analyze the complex trade-off among multiple combinations of input designs and overall performance.

To address these questions, we propose a new benchmark called Structural Understanding Capabilities (SUC), shown in Figure 1 (a), which focuses on specific tasks to assess LLMs’ ability to understand structured data in tables and compare different types of prompts. We conducted a series of experiments using different prompt designs. Our findings, detailed in the paper, evaluate how each design enhances LLMs’ ability to work with tables. 

The image (a) is a flowchart with three main columns that illustrate the stages, capabilities, and tasks associated with a process benchmarked by SUC (Semantic Understanding Capability), and their application in input designs. Here is the detailed alt text for the image: Flowchart illustrates the detailed design of the Semantic Understanding Capability Benchmark. The leftmost column is labeled 'Stages' with two main stages: 'Partition & Parsing' in blue and 'Search & Retrieval' in pink. Each stage is associated with 'Capabilities' in the middle column. 'Partition & Parsing' includes 'Structural Description Detection', 'Format Understanding', and 'Hierarchy Detection'. 'Search & Retrieval' includes 'Grounding/Locating' and 'Operation Reasoning'. These capabilities correspond to 'Tasks' in the third column. For 'Partition & Parsing', tasks are 'Table Partition', 'Table Size Detection', and 'Hierarchy Detection'. For 'Search & Retrieval', tasks are 'Cell Lookup & Reverse Lookup' and 'Column & Row Retrieval'.  

 

To the right of these columns is image (b) labeled 'Input Designs' connected to 'Partition Mark', 'Serialization', 'Role Prompting', 'Order Permutation', and 'Format Explanation'. These are further linked to types of 'Markup Languages' represented in green boxes: 'HTML', 'XML', 'Markdown', and more indicated by ellipses. Image (b) covers the input designs for the SUC evaluation.
Figure 1. The SUC benchmark and prompt designs for evaluation.

Insights and findings using the SUC benchmark

Based on humans’ perception of tables, we developed tasks to evaluate how LLMs understand them. We conducted evaluations on GPT-3.5 and GPT-4 and discovered that the results depended on certain input factors, such as table format, content order, and partition marks. The findings, detailed in Tables 1 and 2, reveal some notable and unexpected findings:

  • Delimiter-separated formats (e.g., CSV, TSV), underperformed compared with HTML by 6.76 percent.
  • Using HTML and few-shot learning consistently improved performance. The effectiveness of other approaches, such as format explanation, role prompting, order change, and partition marks, varied depending on task difficulty and the required capacity.
  • Despite the simplicity of the benchmark tasks, the highest overall accuracy across seven tasks is only 65.43 percent. This underscores the need for LLMs to have better awareness of table structures and highlights areas for further improvement in table serialization.

Our exploration suggests that:

  • LLMs have a basic understanding of table structures but are far from perfect, even in straightforward tasks like detecting the number of columns and rows.
  • Choosing the right combination of input designs can significantly enhance LLMs’ understanding of structured data.

Our findings revealed significant performance gaps in downstream tasks, attributed to the different combinations of serialization functions and input options. These gaps remained even with GPT-4, underscoring the effectiveness of our benchmark approach.

This is a table regarding the comparison table displaying the accuracy (Acc) of GPT-4 versus previous models in different tasks. Tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The data formats compared are NL + Sep, Markdown, JSON, XML, and HTML. GPT-4 shows improved accuracy across nearly all tasks and formats compared to its predecessors, with notable high accuracy in the HTML format for Table Partition and Merged Cell Detection tasks.
Table 1. SUC benchmark evaluations on table formats.
This table presents the comparison of accuracy (Acc) and changes in accuracy (Δ) for different input designs using GPT-4 on various tasks. The tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The input designs tested are Markup Language HTML with and without various components such as format explanation, partition mark, role prompting, and change order, as well as without 1-shot learning. The last row shows the performance of GPT-4 with Language HTML. The table displays positive and negative changes in percentages with respective tasks, highlighting the impact of each input design modification on the model's accuracy.
Table 2. Ablation study of input designs using the SUC benchmark.

Improved performance with self-augmented prompting

Based on these benchmark evaluations, we investigated how LLMs’ existing knowledge could be used to enhance their understanding of structured data. To do this, we introduced self-augmentation, a model-agnostic technique that improves structural prompting—enabling LLMs to identify key values and ranges by tapping into their own internal knowledge. This technique simplifies and optimizes how LLMs utilize their existing knowledge base to improve their understanding of structured content, allowing them to generate intermediate structural insights. This process is shown in Figure 2, with the results detailed in Table 3.

The image depicts a diagram showing the Self-augmented Prompting workflow that involves an initial table, an intermediate output, and a final output. Here is the detailed alt text for the image: On the left, there's a table with the title 'Antoine Salamin' and columns labeled 'Year', 'Team', 'Driver', 'Races', and 'Pos'. Two rows are visible with the years 1983 and 1989, team name starting with 'Swit...', driver name starting with 'Antoine...', and positions '29th' and '7th' highlighted in the last visible row. Below the table is a box labeled 'Table & Other info' and an arrow pointing right labeled '1st request' with the text 'Identify critical values and ranges of the table'. 

 

In the center, a green box with rounded corners titled 'Intermediate Output' contains text summarizing the table's content, mentioning Antoine Salamin's results from 1983 to 1989, the number of races, podiums, and points range. There's an arrow looping back to the first box with 'LLM' written above it, indicating a feedback loop for further processing. 

 

On the right, a blue box with rounded corners titled 'Final Output' contains a narrative description saying 'In 1989, Antoine Salamin drove a Porsche 962C for the Swiss Team Salamin, powered by a Porsche turbo Flat-6 engine. He competed in two races, achieving one podium and 17 points, finishing 7th overall.' An arrow labeled '2nd request' points from the '1st request' to the 'Intermediate Output' and another from there to the 'Final Output', indicating the sequence of processing requests.
Figure 2. Self-augmented prompting.
This table is comparing the accuracy (Acc) and BLEU scores for different types of input choices on various question-answering datasets: TabFact, HybridQA, SQA, Feverous, and ToTTo. The types include 1-shot and self-explanation approaches (SA) with various modifications such as without table size, partition mark, format explanation, role prompting, critical values and ranges identification, and structural information description. Each row shows the impact of these modifications on the model's performance, with accuracy percentages for the datasets and BLEU-1 to BLEU-4 scores for the ToTTo dataset.
Table 3. Evaluation of downstream tasks. “SA” refers to self-augmented prompting.

Looking forward

Our study sets a key benchmark in expanding the capabilities of LLMs to better understand structured table data, moving beyond conventional natural language processing tasks. We suggest future research should prioritize the integration of structural information to improve performance with various structured data types. Additionally, we propose exploring LLMs’ ability to use external tools or agents for improved handling of structured data, opening new avenues for application.

The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.

Read More

Don’t Pass This Up: Day Passes Now Available on GeForce NOW

Don’t Pass This Up: Day Passes Now Available on GeForce NOW

Gamers can now seize the day with Day Passes, available to purchase for 24-hour continuous access to powerful cloud gaming with all the benefits of a GeForce NOW Ultimate or Priority membership — no commitment required.

Publisher Cygames brings its next triple-A title to the cloud. Granblue Fantasy: Relink leads eight new games joining the GeForce NOW library this week.

Plus, an update for GeForce NOW Windows and macOS adds support for G-SYNC in the cloud. By pairing it with new NVIDIA Reflex support for 60 and 120 frames per second streaming options, Ultimate members can experience ultra-low-latency streaming that’s nearly indistinguishable from using a local PC.

Seize the Day

Day Passes offer access to 24 hours of GeForce RTX-powered cloud gaming. Users can get all the benefits of Ultimate and Priority memberships for a day without committing to longer-term monthly memberships, and choose how and when they access the cloud.

Day Pass Matrix on GeForce NOW
Play for a day.

Ultimate Day Pass users can stream at either 4K 120 fps, up to 240 fps, or with ultrawide resolutions. Plus, they can get all the same benefits as gamers using NVIDIA GeForce RTX 40 Series GPUs, with access to NVIDIA DLSS 3 and NVIDIA Reflex technologies for the smoothest gameplay and lowest latency, even on underpowered devices. Both Ultimate and Priority Day Pass users can turn RTX ON in supported games for immersive, cinematic gameplay.

The Ultimate Day Pass is available for $7.99 and the Priority Day Pass for $3.99. Twenty-four hours of continuous play begins at purchase. Day Passes are available in limited quantities each day, so grab one before the opportunity passes.

Head in the Clouds

Granblue Fantasy: Relink on GeForce NOW
Going on a grand adventure.

Cygames, known for developing popular online game Granblue Fantasy, brings their full-fledged action role-playing game to GeForce NOW. Granblue Fantasy: Relink is now available for fans to stream across devices.

Set in the same universe as the web browser and mobile version of the title, Granblue Fantasy: Relink is an ARPG that features many of the beloved characters from the franchise in an all-new original story. Step into the shoes of a captain leading a Skyfaring crew, alongside a scrappy dragon named Vyrn and a mysterious girl named Lyria, as they navigate the Sky Realm, a world of islands drifting in the clouds.

Slash, shoot and hex treacherous foes with up to three other gaming buddies. GeForce NOW Priority and Ultimate members can become Skyfarers in the cloud with longer game sessions and faster access to GeForce RTX-class servers.

Spring Into New Games

Undisputed on GeForce NOW
Pull no punches.

Step into the ring in Undisputed, an authentic boxing game from Steel City Interactive. Featuring bone-jarring action and more licensed boxers than ever, Undisputed, currently in early access, gives members unprecedented control to master every inch of the ring.

It’s available to stream from the cloud this week, along with the following games:

  • The Thaumaturge (New release on Steam, Mar. 4)
  • Classified: France ‘44 (New release on Steam, Mar. 5)
  • Expeditions: A MudRunner Game (New release on Steam, Mar. 5)
  • Winter Survival (New release on Steam, Mar. 6)
  • Taxi Life: A City Driving Simulator (New release on Steam, Mar. 7)
  • Zoria: Age of Shattering (New release on Steam, Mar. 7)
  • Granblue Fantasy: Relink (Steam)
  • Undisputed (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Chris Bishop at Research Forum

Research advances are driving real-world impact faster than ever. Recent developments in AI are reshaping the way people live, work, and think. In the latest episode of Microsoft Research Forum (opens in new tab), we explore how AI is transforming health care and the natural sciences, the intersection of AI and society, and the continuing evolution of foundational AI technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon.

Keynote: The Revolution in Scientific Discovery

Chris Bishop, Technical Fellow and Director, Microsoft Research AI4Science 

As in our debut event on January 30, this edition of Research Forum began with a keynote address by a leader from Microsoft Research. Chris Bishop shared some exciting real-world progress being made by his team toward modelling and predicting natural phenomena.

Chris Bishop: “In my view, the most important use case of AI will be to scientific discovery. And the reason I believe this is that it’s our understanding of the natural world obtained through scientific discovery, together with its application in the form of technology that has really transformed the human species.”

Panel discussion: Transforming the natural sciences with AI

Bonnie Kruft, Partner Deputy Director, Microsoft Research AI4Science (Host)
Rianne van den Berg, Principal Research Manager, Microsoft Research AI4Science 
Tian Xie, Principal Research Manager, Microsoft Research AI4Science 
Tristan Naumann, Principal Researcher, Microsoft Research Health Futures 
Kristen Severson, Senior Researcher, Microsoft Research New England 
Alex Lu, Senior Researcher, Microsoft Research New England

In a discussion hosted by Bonnie Kruft, Microsoft researchers presented their latest advancements in the fields of foundation models, drug discovery, material design, and machine learning. Panelists highlighted deep learning’s growing impact on the natural sciences.

Tristan Naumann: “Much of the data we have in healthcare is not nicely structured in a clean and easy to use way. And so, one of the things that’s really incredible about some of these recent advances in generative AI, specifically large language models (and) multimodal models, is really this opportunity to have a tool for universal structuring and unlocking some of that data quickly and efficiently, really opens up a lot of new opportunities.” 

Tian Xie: “Similar (to) the field of health and in biology, machine learning is really beginning to interrupt some of the traditional pipelines that happened in materials discovery.”

Kristen Severson: “We have a lot of knowledge about diseases and how they manifest and we don’t want to leave that information on the table when we train a machine learning model. So, there’s not an interest in using solely black box approaches, but instead (in) using what’s already known.”

Alex Lu: “If you look at what particularly differentiates biology and I suspect by extension a lot of other scientific disciplines, the whole point is to try to discover something new. So, by definition, what that new thing is is not going to be captured in your original distribution of data.” 

Rianne van den Berg: “One particular class of generative models that I’m very excited about and that’s becoming increasingly popular is that of diffusion models and score-based generative models. These models have been super successful already, for instance in high resolution image generation and video, and they’re also very naturally suited to target scientific discovery.” 

Lightning talk: What’s new in AutoGen? 

Chi Wang, Principal Researcher, Microsoft Research AI Frontiers 

Chi Wang presented the latest updates on AutoGen – the multi-agent framework for next generation AI applications. The discussion covered milestones achieved, community feedback, exciting new features, and the research and related challenges on the road ahead. He also announced a recent milestone. 

Chi Wang: “Our initial multiagent experiment on the challenging GAIA benchmark turned out to achieve the number one accuracy in the leaderboard in all three levels. That shows the power of AutoGen in solving complex tasks and big potential.”

Lightning talk: The metacognitive demands and opportunities of generative AI

Lev Tankelevitch, Senior Behavioral Science Researcher, Microsoft Research Cambridge (UK)

Lev Tankelevitch explored how metacognition—the psychological capacity to monitor and regulate one’s thoughts and behaviors—provides a valuable lens for understanding and addressing the usability challenges of generative AI systems. This includes prompting, assessing and relying on outputs, and workflow optimization, which require a high degree of metacognitive monitoring and control.

Lev Tankelevitch: “We believe that a metacognitive perspective can help us analyze, measure, and evaluate the usability challenges of generative AI, and it can help us design generative AI systems that can augment human agency and workflows.”

Lightning talk: Getting modular with language models: Building and reusing a library of experts for task generalization

Alessandro Sordoni, Principal Researcher, Microsoft Research Montreal

Alessandro Sordoni discussed recent research on building and re-using large collections of expert language models to improve zero-shot and few-shot generalization to unseen tasks.

Alessandro Sordoni: “Looking forward, I believe that an exciting direction would be to push this to fully decentralized training and continual improvement of language models in the sense that users can train their experts, then share them in the platform and the model gets better.” 

Lightning talk: GigaPath: Real-World Pathology Foundation Model

Naoto Usuyama, Principal Researcher, Microsoft Research Health Futures

Naoto Usuyama presented GigaPath, a novel approach for training large vision transformers for gigapixel pathology images, utilizing a diverse, real-world cancer patient dataset, with the goal of laying a foundation for cancer pathology AI.

Naoto Usuyama: “This project (GigaPath) is not possible without many, many collaborators, and we are just scratching the surface. So, I’m very excited, and I really hope we can unlock the full potential of real-world patient data and advanced AI for cancer care and research.”

Lightning talk: Generative AI and plural governance: Mitigating challenges and surfacing opportunities

Madeleine Daepp (opens in new tab), Senior Researcher, Microsoft Research Redmond
Vanessa Gathecha (opens in new tab), Applied Researcher and Policy Analyst, Baraza Media Lab

This talk featured two expert speakers. Madeleine Daepp discussed the potential impacts and challenges of generative AI in a year with over 70 major global elections. Vanessa Gatheca, a 2024 Microsoft AI and Society fellow (opens in new tab), discussed her work on disinformation in Kenya and Sub-Saharan Africa.

Madeleine Daepp: “The disruption of our digital public sphere is an all-of-society problem that requires an all-of-society response. The AI and Society fellows program is helping to build much needed connections across places, across academic disciplines, and across societal sectors to help us understand the problem and work toward an impactful response.” 

The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.

Read More

Croissant: a metadata format for ML-ready datasets

Croissant: a metadata format for ML-ready datasets

Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.

ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.

There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.

Today, we’re introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn’t change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.

Croissant

This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.

Why a shared format for ML data?

The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.

The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.

Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.

What can Croissant do today?

The Croissant ecosystem: Users can Search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets at:

With a Croissant dataset, it is possible to:

To publish a Croissant dataset, users can:

  • Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
  • Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
  • Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Future direction

We are excited about Croissant’s potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.

We encourage the community to join us in contributing to the effort.

Acknowledgements

Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.

Read More

Research Focus: Week of March 4, 2024

Research Focus: Week of March 4, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 
Week of March 4, 2024

Generative Kaleidoscopic Networks

Neural networks are deep learning models that can be trained to learn complex patterns and relationships within data. In a recent paper: Generative Kaleidoscopic Networks, researchers from Microsoft detail how they discovered an “over-generalization” phenomenon, which indicates that the neural networks tend to learn many-to-one mappings. They then use this phenomenon to introduce a new paradigm of generative modeling by creating a dataset kaleidoscope, dubbed ‘Generative Kaleidoscopic Networks.’ The researchers are exploring theoretical explanations, experiments on multimodal data, and conditional generation using the Generative Kaleidoscopic Networks.

MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.
MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Text Diffusion with Reinforced Conditioning

Diffusion models are a type of machine learning model that have shown exceptional ability to generate high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they offer potential for achieving better non-autoregressive sequence generation—which simultaneously predicts all elements of a sequence, rather than predicting the next element in a sequence.

However, existing text diffusion models have yet to fulfill this potential, due to challenges in handling the discreteness of language. In a recent paper: Text Diffusion with Reinforced Conditioning, researchers from Microsoft and external colleagues uncover two significant limitations in text diffusion models: degradation of self-conditioning during training and misalignment between training and sampling. In response, the researchers propose a novel model called TREC, which empowers text diffusion models with reinforced conditioning, mitigating the degradation by directly motivating quality improvements from self-conditions with reward signals. In the paper, which was presented at the 2024 Association for the Advancement of Artificial Intelligence conference (AAAI), they further propose time-aware variance scaling to address the misalignment issue.

Extensive experiments demonstrate the competitiveness of TREC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.


PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem

Temporal action abstractions, along with belief state representations, are powerful knowledge sharing mechanisms for sequential decision making. In a recent paper, PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem, researchers from Microsoft and University of Maryland propose a novel connection between the seemingly distant realms of training large language models (LLMs) and inducing temporal action abstractions for continuous control domains such as robotics. The researchers introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with a subtle but critical component of LLM training pipelines — input tokenization via byte pair encoding (BPE) – to learn powerful variable-timespan action abstractions. They empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the performance of both multitask imitation learning and few-shot imitation learning on unseen tasks.

The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.

Read More

Efficiently fine-tune the ESM-2 protein language model with Amazon SageMaker

Efficiently fine-tune the ESM-2 protein language model with Amazon SageMaker

In this post, we demonstrate how to efficiently fine-tune a state-of-the-art protein language model (pLM) to predict protein subcellular localization using Amazon SageMaker.

Proteins are the molecular machines of the body, responsible for everything from moving your muscles to responding to infections. Despite this variety, all proteins are made of repeating chains of molecules called amino acids. The human genome encodes 20 standard amino acids, each with a slightly different chemical structure. These can be represented by letters of the alphabet, which then allows us to analyze and explore proteins as a text string. The enormous possible number of protein sequences and structures is what gives proteins their wide variety of uses.

The structure of an amino acid chain

Proteins also play a key role in drug development, as potential targets but also as therapeutics. As shown in the following table, many of the top-selling drugs in 2022 were either proteins (especially antibodies) or other molecules like mRNA translated into proteins in the body. Because of this, many life science researchers need to answer questions about proteins faster, cheaper, and more accurately.

Name Manufacturer 2022 Global Sales ($ billions USD) Indications
Comirnaty Pfizer/BioNTech $40.8 COVID-19
Spikevax Moderna $21.8 COVID-19
Humira AbbVie $21.6 Arthritis, Crohn’s disease, and others
Keytruda Merck $21.0 Various cancers

Data source: Urquhart, L. Top companies and drugs by sales in 2022. Nature Reviews Drug Discovery 22, 260–260 (2023).

Because we can represent proteins as sequences of characters, we can analyze them using techniques originally developed for written language. This includes large language models (LLMs) pretrained on huge datasets, which can then be adapted for specific tasks, like text summarization or chatbots. Similarly, pLMs are pre-trained on large protein sequence databases using unlabeled, self-supervised learning. We can adapt them to predict things like the 3D structure of a protein or how it may interact with other molecules. Researchers have even used pLMs to design novel proteins from scratch. These tools don’t replace human scientific expertise, but they have the potential to speed up pre-clinical development and trial design.

One challenge with these models is their size. Both LLMs and pLMs have grown by orders of magnitude in the past few years, as illustrated in the following figure. This means that it can take a long time to train them to sufficient accuracy. It also means that you need to use hardware, especially GPUs, with large amounts of memory to store the model parameters.

Protein language models, like other large language models, have steadily increased in size for several years

Long training times, plus large instances, equals high cost, which can put this work out of reach for many researchers. For example, in 2023, a research team described training a 100 billion-parameter pLM on 768 A100 GPUs for 164 days! Fortunately, in many cases we can save time and resources by adapting an existing pLM to our specific task. This technique is called fine-tuning, and also allows us to borrow advanced tools from other types of language modeling.

Solution overview

The specific problem we address in this post is subcellular localization: Given a protein sequence, can we build a model that can predict if it lives on the outside (cell membrane) or inside of a cell? This is an important piece of information that can help us understand the function and whether it would make a good drug target.

We start by downloading a public dataset using Amazon SageMaker Studio. Then we use SageMaker to fine-tune the ESM-2 protein language model using an efficient training method. Finally, we deploy the model as a real-time inference endpoint and use it to test some known proteins. The following diagram illustrates this workflow.

AWS architecture for fine tuning ESM

In the following sections, we go through the steps to prepare your training data, create a training script, and run a SageMaker training job. All of the code featured in this post is available on GitHub.

Prepare the training data

We use part of the DeepLoc-2 dataset, which contains several thousand SwissProt proteins with experimentally determined locations. We filter for high-quality sequences between 100–512 amino acids:

df = pd.read_csv(
    "https://services.healthtech.dtu.dk/services/DeepLoc-2.0/data/Swissprot_Train_Validation_dataset.csv"
).drop(["Unnamed: 0", "Partition"], axis=1)
df["Membrane"] = df["Membrane"].astype("int32")

# filter for sequences between 100 and 512 amino acides
df = df[df["Sequence"].apply(lambda x: len(x)).between(100, 512)]

# Remove unnecessary features
df = df[["Sequence", "Kingdom", "Membrane"]]

Next, we tokenize the sequences and split them into training and evaluation sets:

dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2, shuffle=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")

def preprocess_data(examples, max_length=512):
    text = examples["Sequence"]
    encoding = tokenizer(text, truncation=True, max_length=max_length)
    encoding["labels"] = examples["Membrane"]
    return encoding

encoded_dataset = dataset.map(
    preprocess_data,
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=dataset["train"].column_names,
)

encoded_dataset.set_format("torch")

Finally, we upload the processed training and evaluation data to Amazon Simple Storage Service (Amazon S3):

train_s3_uri = S3_PATH + "/data/train"
test_s3_uri = S3_PATH + "/data/test"

encoded_dataset["train"].save_to_disk(train_s3_uri)
encoded_dataset["test"].save_to_disk(test_s3_uri)

Create a training script

SageMaker script mode allows you to run your custom training code in optimized machine learning (ML) framework containers managed by AWS. For this example, we adapt an existing script for text classification from Hugging Face. This allows us to try several methods for improving the efficiency of our training job.

Method 1: Weighted training class

Like many biological datasets, the DeepLoc data is unevenly distributed, meaning there isn’t an equal number of membrane and non-membrane proteins. We could resample our data and discard records from the majority class. However, this would reduce the total training data and potentially hurt our accuracy. Instead, we calculate the class weights during the training job and use them to adjust the loss.

In our training script, we subclass the Trainer class from transformers with a WeightedTrainer class that takes class weights into account when calculating cross-entropy loss. This helps prevent bias in our model:

class WeightedTrainer(Trainer):
    def __init__(self, class_weights, *args, **kwargs):
        self.class_weights = class_weights
        super().__init__(*args, **kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss(
            weight=torch.tensor(self.class_weights, device=model.device)
        )
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Method 2: Gradient accumulation

Gradient accumulation is a training technique that allows models to simulate training on larger batch sizes. Typically, the batch size (the number of samples used to calculate the gradient in one training step) is limited by the GPU memory capacity. With gradient accumulation, the model calculates gradients on smaller batches first. Then, instead of updating the model weights right away, the gradients get accumulated over multiple small batches. When the accumulated gradients equal the target larger batch size, the optimization step is performed to update the model. This lets models train with effectively bigger batches without exceeding the GPU memory limit.

However, extra computation is needed for the smaller batch forward and backward passes. Increased batch sizes via gradient accumulation can slow down training, especially if too many accumulation steps are used. The aim is to maximize GPU usage but avoid excessive slowdowns from too many extra gradient computation steps.

Method 3: Gradient checkpointing

Gradient checkpointing is a technique that reduces the memory needed during training while keeping the computational time reasonable. Large neural networks take up a lot of memory because they have to store all the intermediate values from the forward pass in order to calculate the gradients during the backward pass. This can cause memory issues. One solution is to not store these intermediate values, but then they have to be recalculated during the backward pass, which takes a lot of time.

Gradient checkpointing provides a balanced approach. It saves only some of the intermediate values, called checkpoints, and recalculates the others as needed. Therefore, it uses less memory than storing everything, but also less computation than recalculating everything. By strategically selecting which activations to checkpoint, gradient checkpointing enables large neural networks to be trained with manageable memory usage and computation time. This important technique makes it feasible to train very large models that would otherwise run into memory limitations.

In our training script, we turn on gradient activation and checkpointing by adding the necessary parameters to the TrainingArguments object:

from transformers import TrainingArguments

training_args = TrainingArguments(
	gradient_accumulation_steps=4,
	gradient_checkpointing=True
)

Method 4: Low-Rank Adaptation of LLMs

Large language models like ESM-2 can contain billions of parameters that are expensive to train and run. Researchers developed a training method called Low-Rank Adaptation (LoRA) to make fine-tuning these huge models more efficient.

The key idea behind LoRA is that when fine-tuning a model for a specific task, you don’t need to update all the original parameters. Instead, LoRA adds new smaller matrices to the model that transform the inputs and outputs. Only these smaller matrices are updated during fine-tuning, which is much faster and uses less memory. The original model parameters stay frozen.

After fine-tuning with LoRA, you can merge the small adapted matrices back into the original model. Or you can keep them separate if you want to quickly fine-tune the model for other tasks without forgetting previous ones. Overall, LoRA allows LLMs to be efficiently adapted to new tasks at a fraction of the usual cost.

In our training script, we configure LoRA using the PEFT library from Hugging Face:

from peft import get_peft_model, LoraConfig, TaskType
import torch
from transformers import EsmForSequenceClassification

model = EsmForSequenceClassification.from_pretrained(
	“facebook/esm2_t33_650M_UR50D”,
	Torch_dtype=torch.bfloat16,
	Num_labels=2,
)

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    bias="none",
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "query",
        "key",
        "value",
        "EsmSelfOutput.dense",
        "EsmIntermediate.dense",
        "EsmOutput.dense",
        "EsmContactPredictionHead.regression",
        "EsmClassificationHead.dense",
        "EsmClassificationHead.out_proj",
    ]
)

model = get_peft_model(model, peft_config)

Submit a SageMaker training job

After you have defined your training script, you can configure and submit a SageMaker training job. First, specify the hyperparameters:

hyperparameters = {
    "model_id": "facebook/esm2_t33_650M_UR50D",
    "epochs": 1,
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "use_gradient_checkpointing": True,
    "lora": True,
}

Next, define what metrics to capture from the training logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "'epoch': ([0-9.]*)"},
    {
        "Name": "max_gpu_mem",
        "Regex": "Max GPU memory use during training: ([0-9.e-]*) MB",
    },
    {"Name": "train_loss", "Regex": "'loss': ([0-9.e-]*)"},
    {
        "Name": "train_samples_per_second",
        "Regex": "'train_samples_per_second': ([0-9.e-]*)",
    },
    {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9.e-]*)"},
    {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9.e-]*)"},
]

Finally, define a Hugging Face estimator and submit it for training on an ml.g5.2xlarge instance type. This is a cost-effective instance type that is widely available in many AWS Regions:

from sagemaker.experiments.run import Run
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput

hf_estimator = HuggingFace(
    base_job_name="esm-2-membrane-ft",
    entry_point="lora-train.py",
    source_dir="scripts",
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    transformers_version="4.28",
    pytorch_version="2.0",
    py_version="py310",
    output_path=f"{S3_PATH}/output",
    role=sagemaker_execution_role,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    checkpoint_local_path="/opt/ml/checkpoints",
    sagemaker_session=sagemaker_session,
    keep_alive_period_in_seconds=3600,
    tags=[{"Key": "project", "Value": "esm-fine-tuning"}],
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hf_estimator.fit(
        {
            "train": TrainingInput(s3_data=train_s3_uri),
            "test": TrainingInput(s3_data=test_s3_uri),
        }
    )

The following table compares the different training methods we discussed and their effect on the runtime, accuracy, and GPU memory requirements of our job.

Configuration Billable Time (min) Evaluation Accuracy Max GPU Memory Usage (GB)
Base Model 28 0.91 22.6
Base + GA 21 0.90 17.8
Base + GC 29 0.91 10.2
Base + LoRA 23 0.90 18.6

All of the methods produced models with high evaluation accuracy. Using LoRA and gradient activation decreased the runtime (and cost) by 18% and 25%, respectively. Using gradient checkpointing decreased the maximum GPU memory usage by 55%. Depending on your constraints (cost, time, hardware), one of these approaches may make more sense than another.

Each of these methods perform well by themselves, but what happens when we use them in combination? The following table summarizes the results.

Configuration Billable Time (min) Evaluation Accuracy Max GPU Memory Usage (GB)
All methods 12 0.80 3.3

In this case, we see a 12% reduction in accuracy. However, we’ve reduced the runtime by 57% and GPU memory use by 85%! This is a massive decrease that allows us to train on a wide range of cost-effective instance types.

Clean up

If you’re following along in your own AWS account, delete the any real-time inference endpoints and data you created to avoid further charges.

predictor.delete_endpoint()

bucket = boto_session.resource("s3").Bucket(S3_BUCKET)
bucket.objects.filter(Prefix=S3_PREFIX).delete()

Conclusion

In this post, we demonstrated how to efficiently fine-tune protein language models like ESM-2 for a scientifically relevant task. For more information about using the Transformers and PEFT libraries to train pLMS, check out the posts Deep Learning With Proteins and ESMBind (ESMB): Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction on the Hugging Face blog. You can also find more examples of using machine learning to predict protein properties in the Awesome Protein Analysis on AWS GitHub repository.


About the Author

Brian Loyal Brian Loyal is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 17 years’ experience in biotechnology and machine learning, and is passionate about helping customers solve genomic and proteomic challenges. In his spare time, he enjoys cooking and eating with his friends and family.

Read More

Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso

Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso

As visual generative AI matures from research to the enterprise domain, businesses are seeking responsible ways to integrate the technology into their products.

Bria, a startup based in Tel Aviv, is responding with an open platform for visual generative AI that emphasizes model transparency alongside fair attribution and copyright protections. Currently offering models that convert text prompts to images or transform existing images, the company will this year add text-to-video and image-to-video AI.

“Creating generative AI models requires time and expertise,” said Yair Adato, co-founder and CEO of Bria. “We do the heavy lifting so product teams can adopt our models to achieve a technical edge and go to market quickly, without investing as many resources.”

Advertising agencies and retailers can use Bria’s tools to quickly generate visuals for marketing campaigns. And creative studios can adopt the models to develop stock imagery or edit visuals. Dozens of enterprise clients have integrated the startup’s pretrained models or use its application programming interfaces.

Bria develops its models with the NVIDIA NeMo framework, which is available on NGC, NVIDIA’s hub for accelerated software. The company uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput, low-latency image generation. It’s also adopting NVIDIA Picasso, a foundry for visual generative AI models, to run inference.

“We were looking for a framework to train our models efficiently — one that would minimize compute cost while scaling AI training to more quickly reach model convergence,” said Misha Feinstein, vice president of research and development at Bria. “NeMo features optimization techniques that allow us to maximize the GPUs’ performance during both training and inference.”

Creative Solutions to Creative Challenges

Bria, founded in 2020, offers flexible options for enterprises adopting visual generative AI. By adopting Bria’s platform, its customers can gain a competitive edge by creating visual content at scale while retaining control of their data and technology. Developers can access its pretrained models through APIs or by directly licensing the source code and model weights for further fine-tuning.

“We want to build a company where we respect privacy, content ownership, data ownership and copyright,” said Adato. “To create a healthy, sustainable industry, it’s important to incentivize individuals to keep creating and innovating.”

Adato likens Bria’s attribution program to a music streaming service that pays artists each time one of their songs is played. It’s required for all customers who use Bria’s models — even if they further train and fine-tune the model on their own.

Using licensed datasets provides additional benefits: the Bria team doesn’t need to spend time cleaning the data or sorting out inappropriate content and misinformation.

A Growing Suite of NVIDIA-Accelerated Models

Bria offers two versions of its text-to-image model. One islatency-optimized to rapidly accomplish tasks like image background generation. The other offers higher image resolution. Additional foundation models enable super-resolution, object removal, object generation, inpainting and outpainting.

The company is working to continuously increase the resolution of its generated images, further reduce latency and develop domain-specific models for industries such as ecommerce and stock imagery. Inference is accelerated by the NVIDIA Triton Inference Server software and the NVIDIA TensorRT software development kit.

“We’re running on NVIDIA frameworks, hardware and software,” said Feinstein. “NVIDIA experts have helped us optimize these tools for our needs — we would probably run much slower without their help.”

To keep up with the latest hardware and networking infrastructure, Bria uses cloud computing resources: NVIDIA H100 Tensor Core GPUs for AI training and a variety of NVIDIA Tensor Core GPUs for inference.

Bria is a member of NVIDIA Inception, a program that provides startups with technological support and AI platform guidance. Visit Bria in the Inception Pavilion at NVIDIA GTC, running March 18-21 in San Jose and online.

To train optimized text-to-image models, check out the NeMo Multimodal user guide and GitHub repository. NeMo Multimodal is also available as part of the NeMo container on NGC.

Read More