Automate the process to change image backgrounds using Amazon Bedrock and AWS Step Functions

Automate the process to change image backgrounds using Amazon Bedrock and AWS Step Functions

Many customers, including those in creative advertising, media and entertainment, ecommerce, and fashion, often need to change the background in a large number of images. Typically, this involves manually editing each image with photo software. This can take a lot of effort, especially for large batches of images. However, Amazon Bedrock and AWS Step Functions make it straightforward to automate this process at scale.

Amazon Bedrock offers the generative AI foundation model Amazon Titan Image Generator G1, which can automatically change the background of an image using a technique called outpainting. Step Functions allows you to create an automated workflow that seamlessly connects with Amazon Bedrock and other AWS services. Together, Amazon Bedrock and Step Functions streamline the entire process of automatically changing backgrounds across multiple images.

This post introduces a solution that simplifies the process of changing backgrounds in multiple images. By harnessing the capabilities of generative AI with Amazon Bedrock and the Titan Image Generator G1 model, combined with Step Functions, this solution efficiently generates images with the desired background. This post provides insight into the inner workings of the solution and helps you understand the design choices made to build this own custom solution.

See the GitHub repository for detailed instructions on deploying this solution.

Solution overview

Let’s look at how the solution works at a high level before diving deeper into specific elements and the AWS services used. The following diagram provides a simplified view of the solution architecture and highlights the key elements.

Solution Architecture

The workflow consists of the following steps:

  1. A user uploads multiple images into an Amazon Simple Storage Service (Amazon S3) bucket via a Streamlit web application.
  2. The Streamlit web application calls an Amazon API Gateway REST API endpoint integrated with the Amazon Rekognition DetectLabels API, which detects labels for each image.
  3. Upon submission, the Streamlit web application updates an Amazon DynamoDB table with image details.
  4. The DynamoDB update triggers an AWS Lambda function, which starts a Step Functions workflow.
  5. The Step Functions workflow runs the following steps for each image:
    5.1 Constructs a request payload for the Amazon Bedrock InvokeModel API.
    5.2 Invokes the Amazon Bedrock InvokeModel API action.
    5.3 Parses an image from the response and saves it to an S3 location.
    5.4 Updates the image status in a DynamoDB table.
  6. The Step Functions workflow invokes a Lambda function to generate a status report.
  7. The workflow sends an email using Amazon Simple Notification Service (Amazon SNS).

As shown in the following screenshot, the Streamlit web application allows you to upload images and enter text prompts to specify desired backgrounds, negative prompts, and outpainting mode for image generation. You can also view and remove unwanted labels associated with each uploaded image that you don’t want to keep in the final generated images.

Streamlit Web Application

In this example, the prompt for the background is “London city background.” The automation process generates new images based on the original uploaded images with London as the background.

Generated Images

Streamlit web application and images uploads

A Streamlit web application serves as the frontend for this solution. To protect the application from unauthorized access, it integrates with an Amazon Cognito user pool. API Gateway uses an Amazon Cognito authorizer to authenticate requests. The web application completes the following steps:

  1. For each selected image, it retrieves labels via Amazon Rekognition using an API Gateway REST API endpoint.
  2. Upon submission, the application uploads images to an S3 bucket.
  3. The application updates a DynamoDB table with relevant parameters, image names, and associated labels for each image using another API Gateway REST API endpoint.

Image processing workflow

When the DynamoDB table is updated, DynamoDB Streams triggers a Lambda function to start a new Step Functions workflow. The following is a sample request for the workflow:

{
  "Id": "621fa85a-38bb-4d98-a656-93bbbcf5477f",
  "S3Bucket": "<Image Bucket>",
  "InputS3Prefix": "image-files/<year>/<month>/<day>/<timestamp>",
  "OutputS3Prefix": "generated-image-files/<year>/<month>/<day>/<timestamp>",
  "StatusS3Prefix": "status-report-files/<year>/<month>/<day>/<timestamp>",
  "Prompt": "london city background",
  "NegativePrompt": "low quality, low resolution",
  "Mode": "PRECISE",
  "Images": [
    {
      "ImageName": "bus.png",
      "Labels": "Bus, Person"
    },
    {
      "ImageName": "cop.png",
      "Labels": "Person, Adult, Male, Man, Helmet, Jacket"
    },
    {
      "ImageName": "iguana-2.png",
      "Labels": "Lizard”
    },
    {
      "ImageName": "dog.png",
      "Labels": "Dog"
    }
  ]
}

The Step Functions workflow subsequently performs the following three steps:

  1. Replace the background for all images.
  2. Generate a status report.
  3. Send an email via Amazon SNS.

The following screenshot illustrates the Step Functions workflow.

AWS Step Functions Workflow

Let’s look at each step in more detail.

Replace background for all images

Step Functions uses a Distributed Map to process each image in parallel child workflows. The Distributed Map allows high-concurrency processing. Each child workflow has its own separate run history from that of the parent workflow.

Step Functions uses an InvokeModel optimized API action for Amazon Bedrock. The API accepts requests and responses that are up to 25 MB. However, Step Functions has a 256 KB limit on state payload input and output. To support larger images, the solution uses an S3 bucket where the InvokeModel API reads data from and writes the result to. The following is the configuration for the InvokeModel API for Amazon Bedrock integration:

{
    "ModelId": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-image-generator-v1",
    "ContentType": "application/json",
    "Input": {  
        "S3Uri": “s3://<Image Bucket>/image-files/<year>/<month>/<day>/<timestamp>/<Image name>.json",
    },  
    "Output": {  
        "S3Uri": “s3://<Image Bucket>/generated-image-files/<year>/<month>/<day>/<timestamp>/<Image name>.json”
    } 
}

The Input S3Uri parameter specifies the source location to retrieve the input data. The Output S3Uri parameter specifies the destination to write the API response.

A Lambda function saves the request payload as a JSON file in the specified Input S3Uri location. The InvokeModel API uses this input payload to generate images with the specified background:

{
    "taskType": "OUTPAINTING",
    "outPaintingParams": {
        "text": "london city background",
        "negativeText": "low quality, low resolution",        
        "image": "<base64-encoded string>",                         
        "maskPrompt": "Bus",                      
        "maskImage": "base64-encoded string",                             
        "outPaintingMode": "DEFAULT | PRECISE"                 
    },                                                 
    "imageGenerationConfig": {
        "numberOfImages": 1,
        "quality": "premium",
        "height": 1024,
        "width": 1024,
        "cfgScale": 8.0
    }
}

The Titan Image Generator G1 model supports the following parameters for image generation:

  • taskType – Specifies the outpainting method to replace background of image.
  • text – A text prompt to define the background.
  • negativeText – A text prompt to define what not to include in the image.
  • maskPrompt – A text prompt that defines the mask. It corresponds to labels that you want to retain in the final generated images.
  • maskImage – The JPEG or PNG image encoded in base64.
  • outPaintingMode – Specifies whether to allow modification of the pixels inside the mask or not. DEFAULT allows modification of the image inside the mask in order to keep it consistent with the reconstructed background. PRECISE prevents modification of the image inside the mask.
  • numberOfImages – The number of images to generate.
  • quality – The quality of the generated images: standard or premium.
  • cfgScale – Specifies how strongly the generated image should adhere to the prompt.
  • height – The height of the image in pixels.
  • width – The width of the image in pixels.

The Amazon Bedrock InvokeModel API generates a response with an encoded image in the Output S3Uri location. Another Lambda function parses the image from the response, decodes it from base64, and saves the image file in the following location: s3://<Image Bucket>/generated-image-file/<year>/<month>/<day>/<timestamp>/.

Finally, a child workflow updates a DynamoDB table with image generation status, marking it as either Succeeded or Failed, and including details such as ImageName, Cause, Error, and Status.

Generate a status report

After the image generation process, a Lambda function retrieves the status details from DynamoDB. It dynamically compiles these details into a comprehensive status report in JSON format. It then saves the generated status report a JSON file in the following location: s3://<Image Bucket>/status-report-files/<year>/<month>/<day>/<timestamp>/. The ITOps team can integrate this report with their existing notification system to track if image processing completed successfully. For business users, you can expand this further to generate a report in CSV format.

Send an email via Amazon SNS

Step Functions invokes an Amazon SNS API action to send an email. The email contains details including the S3 location for the status report and final images files. The following is the sample notification email.

Notification Email

Conclusion

In this post, we provided an overview of a sample solution demonstrating the automation of changing image backgrounds at scale using Amazon Bedrock and Step Functions. We also explained each element of the solution in detail. By using the Step Functions optimized integration with Amazon Bedrock, Distributed Map, and the Titan Image Generator G1 model, the solution efficiently replaces the backgrounds of images in parallel, enhancing productivity and scalability.

To deploy the solution, refer to the instructions in the GitHub repository.

Resources

To learn more about Amazon Bedrock, see the following resources:

To learn more about the Titan Image Generator G1 model, see the following resources:

To learn more about using Amazon Bedrock with Step Functions, see the following resources:


About the Author

Chetan Makvana is a Senior Solutions Architect with Amazon Web Services. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture and implementing strategies to drive adoption of AWS services. He is a technology enthusiast and a builder with a core area of interest on generative AI, serverless, and DevOps. Outside of work, he enjoys watching shows, traveling, and music. 

Read More

Social learning: Collaborative learning with large language models

Social learning: Collaborative learning with large language models

Large language models (LLMs) have significantly improved the state of the art for solving tasks specified using natural language, often reaching performance close to that of people. As these models increasingly enable assistive agents, it could be beneficial for them to learn effectively from each other, much like people do in social settings, which would allow LLM-based agents to improve each other’s performance.

To discuss the learning processes of humans, Bandura and Walters described the concept of social learning in 1977, outlining different models of observational learning used by people. One common method of learning from others is through a verbal instruction (e.g., from a teacher) that describes how to engage in a particular behavior. Alternatively, learning can happen through a live model by mimicking a live example of the behavior.

Given the success of LLMs mimicking human communication, in our paper “Social Learning: Towards Collaborative Learning with Large Language Models”, we investigate whether LLMs are able to learn from each other using social learning. To this end, we outline a framework for social learning in which LLMs share knowledge with each other in a privacy-aware manner using natural language. We evaluate the effectiveness of our framework on various datasets, and propose quantitative methods that measure privacy in this setting. In contrast to previous approaches to collaborative learning, such as common federated learning approaches that often rely on gradients, in our framework, agents teach each other purely using natural language.

Social learning for LLMs

To extend social learning to language models, we consider the scenario where a student LLM should learn to solve a task from multiple teacher entities that already know that task. In our paper, we evaluate the student’s performance on a variety of tasks, such as spam detection in short text messages (SMS), solving grade school math problems, and answering questions based on a given text.

A visualization of the social learning process: A teacher model provides instructions or few-shot examples to a student model without sharing its private data.

Language models have shown a remarkable capacity to perform tasks given only a handful of examples–a process called few-shot learning. With this in mind, we provide human-labeled examples of a task that enables the teacher model to teach it to a student. One of the main use cases of social learning arises when these examples cannot be directly shared with the student due, for example, to privacy concerns.

To illustrate this, let’s look at a hypothetical example for a spam detection task. A teacher model is located on device where some users volunteer to mark incoming messages they receive as either “spam” or “not spam”. This is useful data that could help train a student model to differentiate between spam and not spam, but sharing personal messages with other users is a breach of privacy and should be avoided. To prevent this, a social learning process can transfer the knowledge from the teacher model to the student so it learns what spam messages look like without needing to share the user’s personal text messages.

We investigate the effectiveness of this social learning approach by analogy with the established human social learning theory that we discussed above. In these experiments, we use PaLM 2-S models for both the teacher and the student.

A systems view of social learning: At training time, multiple teachers teach the student. At inference time, the student is using what it learned from the teachers.

Synthetic examples

As a counterpart to the live teaching model described for traditional social learning, we propose a learning method where the teachers generate new synthetic examples for the task and share them with the student. This is motivated by the idea that one can create a new example that is sufficiently different from the original one, but is just as educational. Indeed, we observe that our generated examples are sufficiently different from the real ones to preserve privacy while still enabling performance comparable to that achieved using the original examples.

The 8 generated examples perform as well as the original data for several tasks (see our paper).

We evaluate the efficacy of learning through synthetic examples on our task suite. Especially when the number of examples is high enough, e.g., n = 16, we observe no statistically significant difference between sharing original data and teaching with synthesized data via social learning for the majority of tasks, indicating that the privacy improvement does not have to come at the cost of model quality.

Generating 16 instead of just 8 examples further reduces the performance gap relative to the original examples.

The one exception is spam detection, for which teaching with synthesized data yields lower accuracy. This may be because the training procedure of current models makes them biased to only generate non-spam examples. In the paper, we additionally look into aggregation methods for selecting good subsets of examples to use.

Synthetic instruction

Given the success of language models in following instructions, the verbal instruction model can also be naturally adapted to language models by having the teachers generate an instruction for the task. Our experiments show that providing such a generated instruction effectively improves performance over zero-shot prompting, reaching accuracies comparable to few-shot prompting with original examples. However, we did find that the teacher model may fail on certain tasks to provide a good instruction, for example due to a complicated formatting requirement of the output.

For Lambada, GSM8k, and Random Insertion, providing synthetic examples performs better than providing generated instructions, whereas in the other tasks generated instruction obtains a higher accuracy. This observation suggests that the choice of the teaching model depends on the task at hand, similar to how the most effective method for teaching people varies by task.

Depending on the task, generating instructions can work better than generating new examples.

Memorization of the private examples

We want teachers in social learning to teach the student without revealing specifics from the original data. To quantify how prone this process is to leaking information, we used Secret Sharer, a popular method for quantifying to what extent a model memorizes its training data, and adapted it to the social learning setting. We picked this method since it had previously been used for evaluating memorization in federated learning.

To apply the Secret Sharer method to social learning, we design “canary” data points such that we can concretely measure how much the training process memorized them. These data points are included in the datasets used by teachers to generate new examples. After the social learning process completes, we can then measure how much more confident the student is in the secret data points the teacher used, compared to similar ones that were not shared even with the teachers.

In our analysis, discussed in detail in the paper, we use canary examples that include names and codes. Our results show that the student is only slightly more confident in the canaries the teacher used. In contrast, when the original data points are directly shared with the student, the confidence in the included canaries is much higher than in the held-out set. This supports the conclusion that the teacher does indeed use its data to teach without simply copying it over.

Conclusion and next steps

We introduced a framework for social learning that allows language models with access to private data to transfer knowledge through textual communication while maintaining the privacy of that data. In this framework, we identified sharing examples and sharing instructions as basic models and evaluated them on multiple tasks. Furthermore, we adapted the Secret Sharer metric to our framework, proposing a metric for measuring data leakage.

As next steps, we are looking for ways of improving the teaching process, for example by adding feedback loops and iteration. Furthermore, we want to investigate using social learning for modalities other than text.

Acknowledgements

We would like to acknowledge and thank Matt Sharifi, Sian Gooding, Lukas Zilka, and Blaise Aguera y Arcas, who are all co-authors on the paper. Furthermore, we would like to thank Victor Cărbune, Zachary Garrett, Tautvydas Misiunas, Sofia Neata and John Platt for their feedback, which greatly improved the paper. We’d also like to thank Tom Small for creating the animated figure.

Read More

First Class: NVIDIA Introduces Generative AI Professional Certification

First Class: NVIDIA Introduces Generative AI Professional Certification

NVIDIA is offering a new professional certification in generative AI to enable developers to establish technical credibility in this important domain.

Generative AI is revolutionizing industries worldwide, yet there’s a critical skills gap and need to uplevel employees to more fully harness the technology.

Available for the first time from NVIDIA, this new professional certification enables developers, career professionals, and others to validate and showcase their generative AI skills and expertise. Our new professional certification program introduces two associate-level generative AI certifications, focusing on proficiency in large language models and multimodal workflow skills.

“Generative AI has moved to center stage as governments, industries and organizations everywhere look to harness its transformative capabilities,” NVIDIA founder and CEO Jensen Huang recently said.

The certification will become available starting at GTC, where in-person attendees can also access recommended training to prepare for a certification exam.

“Organizations in every industry need to increase their expertise in this transformative technology,” said Greg Estes, VP of developer programs at NVIDIA. “Our goals are to assist in upskilling workforces, sharpen the skills of qualified professionals, and enable individuals to demonstrate their proficiency in order to gain a competitive advantage in the job market.”

See AI’s Future. Learn How to Use It.  

GTC 2024 — running March 18-21 in San Jose, Calif. — is the first in-person GTC event in five years, and more than 300,000 people are expected to register to attend in person or virtually.  There will be 900 sessions and more than 300 exhibitors showcasing how organizations are deploying NVIDIA platforms to achieve industry breakthroughs.

Attendees can choose from 20 full-day, hands-on technical workshops, with many sessions available virtually in EMEA and APAC time zones. Also, sign up for the GTC Conference + Training package for more than 40 complimentary onsite training labs.

Sign up for GTC . Learn more about the generative AI course here and here.

Read More

Improving LLM understanding of structured data and exploring advanced prompting methods

Improving LLM understanding of structured data and exploring advanced prompting methods

This research paper was presented at the 17th ACM International Conference on Web Search and Data Mining (opens in new tab) (WSDM 2024), the premier conference on web-inspired research on search and data mining.

WSDM logo in white to the left of the first page of the

In today’s data-driven landscape, tables are indispensable for organizing and presenting information, particularly text. They streamline repetitive content, enhance data manageability, enable easier data analysis, and improve machine processing capabilities. Meanwhile, large language models (LLMs) are advancing in their ability to tackle challenges associated with natural language, but the degree to which they understand tables included in their prompts remains an open question. Our research aims to explore this question and improve how LLMs use and work with table-based data.

Our paper, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study (opens in new tab),” presented at WSDM 2024 (opens in new tab), investigates what kinds of prompts most effectively enable LLMs to understand tables; how much LLMs inherently detect structured data; and how LLMs’ existing knowledge can be harnessed to improve this understanding. We also analyze the complex trade-off among multiple combinations of input designs and overall performance.

To address these questions, we propose a new benchmark called Structural Understanding Capabilities (SUC), shown in Figure 1 (a), which focuses on specific tasks to assess LLMs’ ability to understand structured data in tables and compare different types of prompts. We conducted a series of experiments using different prompt designs. Our findings, detailed in the paper, evaluate how each design enhances LLMs’ ability to work with tables. 

The image (a) is a flowchart with three main columns that illustrate the stages, capabilities, and tasks associated with a process benchmarked by SUC (Semantic Understanding Capability), and their application in input designs. Here is the detailed alt text for the image: Flowchart illustrates the detailed design of the Semantic Understanding Capability Benchmark. The leftmost column is labeled 'Stages' with two main stages: 'Partition & Parsing' in blue and 'Search & Retrieval' in pink. Each stage is associated with 'Capabilities' in the middle column. 'Partition & Parsing' includes 'Structural Description Detection', 'Format Understanding', and 'Hierarchy Detection'. 'Search & Retrieval' includes 'Grounding/Locating' and 'Operation Reasoning'. These capabilities correspond to 'Tasks' in the third column. For 'Partition & Parsing', tasks are 'Table Partition', 'Table Size Detection', and 'Hierarchy Detection'. For 'Search & Retrieval', tasks are 'Cell Lookup & Reverse Lookup' and 'Column & Row Retrieval'.  

 

To the right of these columns is image (b) labeled 'Input Designs' connected to 'Partition Mark', 'Serialization', 'Role Prompting', 'Order Permutation', and 'Format Explanation'. These are further linked to types of 'Markup Languages' represented in green boxes: 'HTML', 'XML', 'Markdown', and more indicated by ellipses. Image (b) covers the input designs for the SUC evaluation.
Figure 1. The SUC benchmark and prompt designs for evaluation.

Insights and findings using the SUC benchmark

Based on humans’ perception of tables, we developed tasks to evaluate how LLMs understand them. We conducted evaluations on GPT-3.5 and GPT-4 and discovered that the results depended on certain input factors, such as table format, content order, and partition marks. The findings, detailed in Tables 1 and 2, reveal some notable and unexpected findings:

  • Delimiter-separated formats (e.g., CSV, TSV), underperformed compared with HTML by 6.76 percent.
  • Using HTML and few-shot learning consistently improved performance. The effectiveness of other approaches, such as format explanation, role prompting, order change, and partition marks, varied depending on task difficulty and the required capacity.
  • Despite the simplicity of the benchmark tasks, the highest overall accuracy across seven tasks is only 65.43 percent. This underscores the need for LLMs to have better awareness of table structures and highlights areas for further improvement in table serialization.

Our exploration suggests that:

  • LLMs have a basic understanding of table structures but are far from perfect, even in straightforward tasks like detecting the number of columns and rows.
  • Choosing the right combination of input designs can significantly enhance LLMs’ understanding of structured data.

Our findings revealed significant performance gaps in downstream tasks, attributed to the different combinations of serialization functions and input options. These gaps remained even with GPT-4, underscoring the effectiveness of our benchmark approach.

This is a table regarding the comparison table displaying the accuracy (Acc) of GPT-4 versus previous models in different tasks. Tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The data formats compared are NL + Sep, Markdown, JSON, XML, and HTML. GPT-4 shows improved accuracy across nearly all tasks and formats compared to its predecessors, with notable high accuracy in the HTML format for Table Partition and Merged Cell Detection tasks.
Table 1. SUC benchmark evaluations on table formats.
This table presents the comparison of accuracy (Acc) and changes in accuracy (Δ) for different input designs using GPT-4 on various tasks. The tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The input designs tested are Markup Language HTML with and without various components such as format explanation, partition mark, role prompting, and change order, as well as without 1-shot learning. The last row shows the performance of GPT-4 with Language HTML. The table displays positive and negative changes in percentages with respective tasks, highlighting the impact of each input design modification on the model's accuracy.
Table 2. Ablation study of input designs using the SUC benchmark.

Improved performance with self-augmented prompting

Based on these benchmark evaluations, we investigated how LLMs’ existing knowledge could be used to enhance their understanding of structured data. To do this, we introduced self-augmentation, a model-agnostic technique that improves structural prompting—enabling LLMs to identify key values and ranges by tapping into their own internal knowledge. This technique simplifies and optimizes how LLMs utilize their existing knowledge base to improve their understanding of structured content, allowing them to generate intermediate structural insights. This process is shown in Figure 2, with the results detailed in Table 3.

The image depicts a diagram showing the Self-augmented Prompting workflow that involves an initial table, an intermediate output, and a final output. Here is the detailed alt text for the image: On the left, there's a table with the title 'Antoine Salamin' and columns labeled 'Year', 'Team', 'Driver', 'Races', and 'Pos'. Two rows are visible with the years 1983 and 1989, team name starting with 'Swit...', driver name starting with 'Antoine...', and positions '29th' and '7th' highlighted in the last visible row. Below the table is a box labeled 'Table & Other info' and an arrow pointing right labeled '1st request' with the text 'Identify critical values and ranges of the table'. 

 

In the center, a green box with rounded corners titled 'Intermediate Output' contains text summarizing the table's content, mentioning Antoine Salamin's results from 1983 to 1989, the number of races, podiums, and points range. There's an arrow looping back to the first box with 'LLM' written above it, indicating a feedback loop for further processing. 

 

On the right, a blue box with rounded corners titled 'Final Output' contains a narrative description saying 'In 1989, Antoine Salamin drove a Porsche 962C for the Swiss Team Salamin, powered by a Porsche turbo Flat-6 engine. He competed in two races, achieving one podium and 17 points, finishing 7th overall.' An arrow labeled '2nd request' points from the '1st request' to the 'Intermediate Output' and another from there to the 'Final Output', indicating the sequence of processing requests.
Figure 2. Self-augmented prompting.
This table is comparing the accuracy (Acc) and BLEU scores for different types of input choices on various question-answering datasets: TabFact, HybridQA, SQA, Feverous, and ToTTo. The types include 1-shot and self-explanation approaches (SA) with various modifications such as without table size, partition mark, format explanation, role prompting, critical values and ranges identification, and structural information description. Each row shows the impact of these modifications on the model's performance, with accuracy percentages for the datasets and BLEU-1 to BLEU-4 scores for the ToTTo dataset.
Table 3. Evaluation of downstream tasks. “SA” refers to self-augmented prompting.

Looking forward

Our study sets a key benchmark in expanding the capabilities of LLMs to better understand structured table data, moving beyond conventional natural language processing tasks. We suggest future research should prioritize the integration of structural information to improve performance with various structured data types. Additionally, we propose exploring LLMs’ ability to use external tools or agents for improved handling of structured data, opening new avenues for application.

The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.

Read More

Don’t Pass This Up: Day Passes Now Available on GeForce NOW

Don’t Pass This Up: Day Passes Now Available on GeForce NOW

Gamers can now seize the day with Day Passes, available to purchase for 24-hour continuous access to powerful cloud gaming with all the benefits of a GeForce NOW Ultimate or Priority membership — no commitment required.

Publisher Cygames brings its next triple-A title to the cloud. Granblue Fantasy: Relink leads eight new games joining the GeForce NOW library this week.

Plus, an update for GeForce NOW Windows and macOS adds support for G-SYNC in the cloud. By pairing it with new NVIDIA Reflex support for 60 and 120 frames per second streaming options, Ultimate members can experience ultra-low-latency streaming that’s nearly indistinguishable from using a local PC.

Seize the Day

Day Passes offer access to 24 hours of GeForce RTX-powered cloud gaming. Users can get all the benefits of Ultimate and Priority memberships for a day without committing to longer-term monthly memberships, and choose how and when they access the cloud.

Day Pass Matrix on GeForce NOW
Play for a day.

Ultimate Day Pass users can stream at either 4K 120 fps, up to 240 fps, or with ultrawide resolutions. Plus, they can get all the same benefits as gamers using NVIDIA GeForce RTX 40 Series GPUs, with access to NVIDIA DLSS 3 and NVIDIA Reflex technologies for the smoothest gameplay and lowest latency, even on underpowered devices. Both Ultimate and Priority Day Pass users can turn RTX ON in supported games for immersive, cinematic gameplay.

The Ultimate Day Pass is available for $7.99 and the Priority Day Pass for $3.99. Twenty-four hours of continuous play begins at purchase. Day Passes are available in limited quantities each day, so grab one before the opportunity passes.

Head in the Clouds

Granblue Fantasy: Relink on GeForce NOW
Going on a grand adventure.

Cygames, known for developing popular online game Granblue Fantasy, brings their full-fledged action role-playing game to GeForce NOW. Granblue Fantasy: Relink is now available for fans to stream across devices.

Set in the same universe as the web browser and mobile version of the title, Granblue Fantasy: Relink is an ARPG that features many of the beloved characters from the franchise in an all-new original story. Step into the shoes of a captain leading a Skyfaring crew, alongside a scrappy dragon named Vyrn and a mysterious girl named Lyria, as they navigate the Sky Realm, a world of islands drifting in the clouds.

Slash, shoot and hex treacherous foes with up to three other gaming buddies. GeForce NOW Priority and Ultimate members can become Skyfarers in the cloud with longer game sessions and faster access to GeForce RTX-class servers.

Spring Into New Games

Undisputed on GeForce NOW
Pull no punches.

Step into the ring in Undisputed, an authentic boxing game from Steel City Interactive. Featuring bone-jarring action and more licensed boxers than ever, Undisputed, currently in early access, gives members unprecedented control to master every inch of the ring.

It’s available to stream from the cloud this week, along with the following games:

  • The Thaumaturge (New release on Steam, Mar. 4)
  • Classified: France ‘44 (New release on Steam, Mar. 5)
  • Expeditions: A MudRunner Game (New release on Steam, Mar. 5)
  • Winter Survival (New release on Steam, Mar. 6)
  • Taxi Life: A City Driving Simulator (New release on Steam, Mar. 7)
  • Zoria: Age of Shattering (New release on Steam, Mar. 7)
  • Granblue Fantasy: Relink (Steam)
  • Undisputed (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Chris Bishop at Research Forum

Research advances are driving real-world impact faster than ever. Recent developments in AI are reshaping the way people live, work, and think. In the latest episode of Microsoft Research Forum (opens in new tab), we explore how AI is transforming health care and the natural sciences, the intersection of AI and society, and the continuing evolution of foundational AI technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon.

Keynote: The Revolution in Scientific Discovery

Chris Bishop, Technical Fellow and Director, Microsoft Research AI4Science 

As in our debut event on January 30, this edition of Research Forum began with a keynote address by a leader from Microsoft Research. Chris Bishop shared some exciting real-world progress being made by his team toward modelling and predicting natural phenomena.

Chris Bishop: “In my view, the most important use case of AI will be to scientific discovery. And the reason I believe this is that it’s our understanding of the natural world obtained through scientific discovery, together with its application in the form of technology that has really transformed the human species.”

Panel discussion: Transforming the natural sciences with AI

Bonnie Kruft, Partner Deputy Director, Microsoft Research AI4Science (Host)
Rianne van den Berg, Principal Research Manager, Microsoft Research AI4Science 
Tian Xie, Principal Research Manager, Microsoft Research AI4Science 
Tristan Naumann, Principal Researcher, Microsoft Research Health Futures 
Kristen Severson, Senior Researcher, Microsoft Research New England 
Alex Lu, Senior Researcher, Microsoft Research New England

In a discussion hosted by Bonnie Kruft, Microsoft researchers presented their latest advancements in the fields of foundation models, drug discovery, material design, and machine learning. Panelists highlighted deep learning’s growing impact on the natural sciences.

Tristan Naumann: “Much of the data we have in healthcare is not nicely structured in a clean and easy to use way. And so, one of the things that’s really incredible about some of these recent advances in generative AI, specifically large language models (and) multimodal models, is really this opportunity to have a tool for universal structuring and unlocking some of that data quickly and efficiently, really opens up a lot of new opportunities.” 

Tian Xie: “Similar (to) the field of health and in biology, machine learning is really beginning to interrupt some of the traditional pipelines that happened in materials discovery.”

Kristen Severson: “We have a lot of knowledge about diseases and how they manifest and we don’t want to leave that information on the table when we train a machine learning model. So, there’s not an interest in using solely black box approaches, but instead (in) using what’s already known.”

Alex Lu: “If you look at what particularly differentiates biology and I suspect by extension a lot of other scientific disciplines, the whole point is to try to discover something new. So, by definition, what that new thing is is not going to be captured in your original distribution of data.” 

Rianne van den Berg: “One particular class of generative models that I’m very excited about and that’s becoming increasingly popular is that of diffusion models and score-based generative models. These models have been super successful already, for instance in high resolution image generation and video, and they’re also very naturally suited to target scientific discovery.” 

Lightning talk: What’s new in AutoGen? 

Chi Wang, Principal Researcher, Microsoft Research AI Frontiers 

Chi Wang presented the latest updates on AutoGen – the multi-agent framework for next generation AI applications. The discussion covered milestones achieved, community feedback, exciting new features, and the research and related challenges on the road ahead. He also announced a recent milestone. 

Chi Wang: “Our initial multiagent experiment on the challenging GAIA benchmark turned out to achieve the number one accuracy in the leaderboard in all three levels. That shows the power of AutoGen in solving complex tasks and big potential.”

Lightning talk: The metacognitive demands and opportunities of generative AI

Lev Tankelevitch, Senior Behavioral Science Researcher, Microsoft Research Cambridge (UK)

Lev Tankelevitch explored how metacognition—the psychological capacity to monitor and regulate one’s thoughts and behaviors—provides a valuable lens for understanding and addressing the usability challenges of generative AI systems. This includes prompting, assessing and relying on outputs, and workflow optimization, which require a high degree of metacognitive monitoring and control.

Lev Tankelevitch: “We believe that a metacognitive perspective can help us analyze, measure, and evaluate the usability challenges of generative AI, and it can help us design generative AI systems that can augment human agency and workflows.”

Lightning talk: Getting modular with language models: Building and reusing a library of experts for task generalization

Alessandro Sordoni, Principal Researcher, Microsoft Research Montreal

Alessandro Sordoni discussed recent research on building and re-using large collections of expert language models to improve zero-shot and few-shot generalization to unseen tasks.

Alessandro Sordoni: “Looking forward, I believe that an exciting direction would be to push this to fully decentralized training and continual improvement of language models in the sense that users can train their experts, then share them in the platform and the model gets better.” 

Lightning talk: GigaPath: Real-World Pathology Foundation Model

Naoto Usuyama, Principal Researcher, Microsoft Research Health Futures

Naoto Usuyama presented GigaPath, a novel approach for training large vision transformers for gigapixel pathology images, utilizing a diverse, real-world cancer patient dataset, with the goal of laying a foundation for cancer pathology AI.

Naoto Usuyama: “This project (GigaPath) is not possible without many, many collaborators, and we are just scratching the surface. So, I’m very excited, and I really hope we can unlock the full potential of real-world patient data and advanced AI for cancer care and research.”

Lightning talk: Generative AI and plural governance: Mitigating challenges and surfacing opportunities

Madeleine Daepp (opens in new tab), Senior Researcher, Microsoft Research Redmond
Vanessa Gathecha (opens in new tab), Applied Researcher and Policy Analyst, Baraza Media Lab

This talk featured two expert speakers. Madeleine Daepp discussed the potential impacts and challenges of generative AI in a year with over 70 major global elections. Vanessa Gatheca, a 2024 Microsoft AI and Society fellow (opens in new tab), discussed her work on disinformation in Kenya and Sub-Saharan Africa.

Madeleine Daepp: “The disruption of our digital public sphere is an all-of-society problem that requires an all-of-society response. The AI and Society fellows program is helping to build much needed connections across places, across academic disciplines, and across societal sectors to help us understand the problem and work toward an impactful response.” 

The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.

Read More

Croissant: a metadata format for ML-ready datasets

Croissant: a metadata format for ML-ready datasets

Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.

ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.

There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.

Today, we’re introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn’t change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.

Croissant

This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.

Why a shared format for ML data?

The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.

The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.

Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.

What can Croissant do today?

The Croissant ecosystem: Users can Search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets at:

With a Croissant dataset, it is possible to:

To publish a Croissant dataset, users can:

  • Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
  • Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
  • Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Future direction

We are excited about Croissant’s potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.

We encourage the community to join us in contributing to the effort.

Acknowledgements

Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.

Read More

Research Focus: Week of March 4, 2024

Research Focus: Week of March 4, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 
Week of March 4, 2024

Generative Kaleidoscopic Networks

Neural networks are deep learning models that can be trained to learn complex patterns and relationships within data. In a recent paper: Generative Kaleidoscopic Networks, researchers from Microsoft detail how they discovered an “over-generalization” phenomenon, which indicates that the neural networks tend to learn many-to-one mappings. They then use this phenomenon to introduce a new paradigm of generative modeling by creating a dataset kaleidoscope, dubbed ‘Generative Kaleidoscopic Networks.’ The researchers are exploring theoretical explanations, experiments on multimodal data, and conditional generation using the Generative Kaleidoscopic Networks.

MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.
MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Text Diffusion with Reinforced Conditioning

Diffusion models are a type of machine learning model that have shown exceptional ability to generate high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they offer potential for achieving better non-autoregressive sequence generation—which simultaneously predicts all elements of a sequence, rather than predicting the next element in a sequence.

However, existing text diffusion models have yet to fulfill this potential, due to challenges in handling the discreteness of language. In a recent paper: Text Diffusion with Reinforced Conditioning, researchers from Microsoft and external colleagues uncover two significant limitations in text diffusion models: degradation of self-conditioning during training and misalignment between training and sampling. In response, the researchers propose a novel model called TREC, which empowers text diffusion models with reinforced conditioning, mitigating the degradation by directly motivating quality improvements from self-conditions with reward signals. In the paper, which was presented at the 2024 Association for the Advancement of Artificial Intelligence conference (AAAI), they further propose time-aware variance scaling to address the misalignment issue.

Extensive experiments demonstrate the competitiveness of TREC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.


PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem

Temporal action abstractions, along with belief state representations, are powerful knowledge sharing mechanisms for sequential decision making. In a recent paper, PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem, researchers from Microsoft and University of Maryland propose a novel connection between the seemingly distant realms of training large language models (LLMs) and inducing temporal action abstractions for continuous control domains such as robotics. The researchers introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with a subtle but critical component of LLM training pipelines — input tokenization via byte pair encoding (BPE) – to learn powerful variable-timespan action abstractions. They empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the performance of both multitask imitation learning and few-shot imitation learning on unseen tasks.

The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.

Read More