Automate caption creation and search for images at enterprise scale using generative AI and Amazon Kendra

Automate caption creation and search for images at enterprise scale using generative AI and Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they are looking for, even when it’s scattered across multiple locations and content repositories within your organization.

Amazon Kendra supports a variety of document formats, such as Microsoft Word, PDF, and text from various data sources. In this post, we focus on extending the document support in Amazon Kendra to make images searchable by their displayed content. Images can often be searched using supplemented metadata such as keywords. However, it takes a lot of manual effort to add detailed metadata to potentially thousands of images. Generative AI (GenAI) can be helpful in generating the metadata automatically. By generating textual captions, the GenAI caption predictions offer descriptive metadata for images. The Amazon Kendra index can then be enriched with the generated metadata during document ingestion to enable searching the images without any manual effort.

As an example, a GenAI model can be used to generate a textual description for the following image as “a dog laying on the ground under an umbrella” during document ingestion of the image.

Image of a dog laying under an umbrella as an example of what can be searched in this solution

An object recognition model can still detect keywords such as “dog” and “umbrella,” but a GenAI model offers deeper understanding of what is represented in the image by identifying that the dog lies under the umbrella. This helps us build more refined searches in the image search process. The textual description is added as metadata to an Amazon Kendra search index via an automated custom document enrichment (CDE). Users searching for terms like “dog” or “umbrella” will then be able to find the image, as shown in the following screenshot.

Image of Kendra search tool

In this post, we show how to use CDE in Amazon Kendra using a GenAI model deployed on Amazon SageMaker. We demonstrate CDE using simple examples and provide a step-by-step guide for you to experience CDE in an Amazon Kendra index in your own AWS account. It allows users to quickly and easily find the images they need without having to manually tag or categorize them. This solution can also be customized and scaled to meet the needs of different applications and industries.

Image captioning with GenAI

Image description with GenAI involves using ML algorithms to generate textual descriptions of images. The process is also known as image captioning, and operates at the intersection of computer vision and natural language processing (NLP). It has applications in areas where data is multi-modal such as ecommerce, where data contains text in the form of metadata as well as images, or in healthcare, where data could contain MRIs or CT scans along with doctor’s notes and diagnoses, to name a few use cases.

GenAI models learn to recognize objects and features within the images, and then generate descriptions of those objects and features in natural language. The state-of-the-art models use an encoder-decoder architecture, where the image information is encoded in the intermediate layers of the neural network and decoded into textual descriptions. These can be considered as two distinct stages: feature extraction from images and textual caption generation. In the feature extraction stage (encoder), the GenAI model processes the image to extract relevant visual features, such as object shapes, colors, and textures. In the caption generation stage (decoder), the model generates a natural language description of the image based on the extracted visual features.

GenAI models are typically trained on vast amounts of data, which make them suitable for various tasks without additional training. Adapting to custom datasets and new domains is also easily achievable through few-shot learning. Pre-training methods allow multi-modal applications to be easily trained using state-of-the-art language and image models. These pre-training methods also allow you to mix and match the vision model and language model that best fits your data.

The quality of the generated image descriptions depends on the quality and size of the training data, the architecture of the GenAI model, and the quality of the feature extraction and caption generation algorithms. Although image description with GenAI is an active area of research, it shows very good results in a wide range of applications, such as image search, visual storytelling, and accessibility for people with visual impairments.

Use cases

GenAI image captioning is useful in the following use cases:

  • Ecommerce – A common industry use case where images and text occur together is retail. Ecommerce in particular stores vast amounts of data as product images along with textual descriptions. The textual description or metadata is important to ensure that the best products are displayed to the user based on the search queries. Moreover, with the trend of ecommerce sites obtaining data from 3P vendors, the product descriptions are often incomplete, amounting to numerous manual hours and huge overhead resulting from tagging the right information in the metadata columns. GenAI-based image captioning is particularly useful for automating this laborious process. Fine-tuning the model on custom fashion data such as fashion images along with text describing the attributes of fashion products can be used to generate metadata that then improves a user’s search experience.
  • Marketing – Another use case of image search is digital asset management. Marketing firms store vast amounts of digital data that needs to be centralized, easily searchable, and scalable enabled by data catalogs. A centralized data lake with informative data catalogs would reduce duplication efforts and enable wider sharing of creative content and consistency between teams. For graphic design platforms popularly used for enabling social media content generation, or presentations in corporate settings, a faster search could result in an improved user experience by rendering the correct search results for the images that users want to look for and enabling users to search using natural language queries.
  • Manufacturing – The manufacturing industry stores a lot of image data like architecture blueprints of components, buildings, hardware, and equipment. The ability to search through such data enables product teams to easily recreate designs from a starting point that already exists and eliminates a lot of design overhead, thereby speeding up the process of design generation.
  • Healthcare – Doctors and medical researchers can catalog and search through MRIs and CT scans, specimen samples, images of the ailment such as rashes and deformities, along with doctor’s notes, diagnoses, and clinical trials details.
  • Metaverse or augmented reality – Advertising a product is about creating a story that users can imagine and relate to. With AI-powered tools and analytics, it has become easier than ever to build not just one story but customized stories to appear to end-users’ unique tastes and sensibilities. This is where image-to-text models can be a game changer. Visual storytelling can assist in creating characters, adapting them to different styles, and captioning them. It can also be used to power stimulating experiences in the metaverse or augmented reality and immersive content including video games. Image search enables developers, designers, and teams to search their content using natural language queries, which can maintain consistency of content between various teams.
  • Accessibility of digital content for blind and low vision – This is primarily enabled by assistive technologies such as screenreaders, Braille systems that allow touch reading and writing, and special keyboards for navigating websites and applications across the internet. Images, however, need to be delivered as textual content that can then be communicated as speech. Image captioning using GenAI algorithms is a crucial piece for redesigning the internet and making it more inclusive by providing everyone a chance to access, understand, and interact with online content.

Model details and model fine-tuning for custom datasets

In this solution, we take advantage of the vit-gpt2-image-captioning model available from Hugging Face, which is licensed under Apache 2.0 without performing any further fine-tuning. Vit is a foundational model for image data, and GPT-2 is a foundational model for language. The multi-modal combination of the two offers the capability of image captioning. Hugging Face hosts state-of-the-art image captioning models, which can be deployed in AWS in a few clicks and offer simple-to-deploy inference endpoints. Although we can use this pre-trained model directly, we can also customize the model to fit domain-specific datasets, more data types such as video or spatial data, and unique use cases. There are several GenAI models where some models perform best with certain datasets, or your team might already be using vision and language models. This solution offers the flexibility of choosing the best-performing vision and language model as the image captioning model through straightforward replacement of the model we have used.

For customization of the models to unique industry applications, open-source models available on AWS through Hugging Face offer several possibilities. A pre-trained model can be tested for the unique dataset or trained on samples of the labeled data to fine-tune it. Novel research methods also allow any combination of vision and language models to be combined efficiently and trained on your dataset. This newly trained model can then be deployed in SageMaker for the image captioning described in this solution.

An example of a customized image search is Enterprise Resource Planning (ERP). In ERP, image data collected from different stages of logistics or supply chain management could include tax receipts, vendor orders, payslips, and more, which need to be automatically categorized for the purview of different teams within the organization. Another example is to use medical scans and doctor diagnoses to predict new medical images for automatic classification. The vision model extracts features from the MRI, CT, or X-ray images and the text model captions it with the medical diagnoses.

Solution overview

The following diagram shows the architecture for image search with GenAI and Amazon Kendra.

Architecture of proposed solution

We ingest images from Amazon Simple Storage Service (Amazon S3) into Amazon Kendra. During ingestion to Amazon Kendra, the GenAI model hosted on SageMaker is invoked to generate an image description. Additionally, text visible in an image is extracted by Amazon Textract. The image description and the extracted text are stored as metadata and made available to the Amazon Kendra search index. After ingestion, images can be searched via the Amazon Kendra search console, API, or SDK.

We use the advanced operations of CDE in Amazon Kendra to call the GenAI model and Amazon Textract during the image ingestion step. However, we can use CDE for a wider range of use cases. With CDE, you can create, modify, or delete document attributes and content when you ingest your documents into Amazon Kendra. This means you can manipulate and ingest your data as needed. This can be achieved by invoking pre- and post-extraction AWS Lambda functions during ingestion, which allows for data enrichment or modification. For example, we can use Amazon Medical Comprehend when ingesting medical textual data to add ML-generated insights to the search metadata.

You can use our solution to search images through Amazon Kendra by following these steps:

  1. Upload images to an image repository like an S3 bucket.
  2. The image repository is then indexed by Amazon Kendra, which is a search engine that can be used to search for structured and unstructured data. During indexing, the GenAI model as well as Amazon Textract are invoked to generate the image metadata. You can trigger the indexing manually or on a predefined schedule.
  3. You can then search for images using natural language queries, such as “Find images of red roses” or “Show me pictures of dogs playing in the park,” through the Amazon Kendra console, SDK, or API. These queries are processed by Amazon Kendra, which uses ML algorithms to understand the meaning behind the queries and retrieve relevant images from the indexed repository.
  4. The search results are presented to you, along with their corresponding textual descriptions, allowing you to quickly and easily find the images you are looking for.

Prerequisites

You must have the following prerequisites:

  • An AWS account
  • Permissions to provision and invoke the following services via AWS CloudFormation: Amazon S3, Amazon Kendra, Lambda, and Amazon Textract.

Cost estimate

The cost of deploying this solution as a proof of concept is projected in the following table. This is the reason we use Amazon Kendra with the Developer Edition, which is not recommended for production workloads, but provides a low-cost option for developers. We assume that the search functionality of Amazon Kendra is used for 20 working days for 3 hours each day, and therefore calculate associated costs for 60 monthly active hours.

Service Time Consumed Cost Estimate per Month
Amazon S3 Storage of 10 GB with data transfer 2.30 USD
Amazon Kendra Developer Edition with 60 hours/month 67.90 USD
Amazon Textract 100% detect document text on 10,000 images 15.00 USD
Amazon SageMaker Real-time inference with ml.g4dn.xlarge for one model deployed on one endpoint for 3 hours every day for 20 days 44.00 USD
. . 129.2 USD

Deploy resources with AWS CloudFormation

The CloudFormation stack deploys the following resources:

  • A Lambda function that downloads the image captioning model from Hugging Face hub and subsequently builds the model assets
  • A Lambda function that populates the inference code and zipped model artifacts to a destination S3 bucket
  • An S3 bucket for storing the zipped model artifacts and inference code
  • An S3 bucket for storing the uploaded images and Amazon Kendra documents
  • An Amazon Kendra index for searching through the generated image captions
  • A SageMaker real-time inference endpoint for deploying the Hugging Face image
  • captioning model
  • A Lambda function that is triggered while enriching the Amazon Kendra index on demand. It invokes Amazon Textract and a SageMaker real-time inference endpoint.

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access

Management (IAM) roles and policies, a VPC along with subnets, a security group, and an internet gateway in which the custom resource Lambda function is run.

Complete the following steps to provision your resources:

  1. Choose Launch stack to launch the CloudFormation template in the us-east-1 Region:
  2. Choose Next.
  3. On the Specify stack details page, leave the template URL and S3 URI of the parameters file at their defaults, then choose Next.
  4. Continue to choose Next on the subsequent pages.
  5. Choose Create stack to deploy the stack.

Monitor the status of the stack. When the status shows as CREATE_COMPLETE, the deployment is complete.

Ingest and search example images

Complete the following steps to ingest and search your images:

  1. On the Amazon S3 console, create a folder called images in the kendra-image-search-stack-imagecaptions S3 bucket in the us-east-1 Region.
  2. Upload the following images to the images folder.

Image of a beach to test with the kendra image search using automated text captioningImage of a dog celebrating a birthday to test with the kendra image search using automated text captioningImage of a dog under an umbrella to test with the kendra image search using automated text captioningImage of a tablet, notebook and coffee on a desk to test with the kendra image search using automated text captioning

  1. Navigate to the Amazon Kendra console in us-east-1 Region.
  2. In the navigation pane, choose Indexes, then choose your index (kendra-index).
  3. Choose Data sources, then choose generated_image_captions.
  4. Choose Sync now.

Wait for the synchronization to be complete before continuing to the next steps.

  1. In the navigation pane, choose Indexes, then choose kendra-index.
  2. Navigate to the search console.
  3. Try the following queries individually or combined: “dog,” “umbrella,” and “newsletter,” and find out which images are ranked high by Amazon Kendra.

Feel free to test your own queries that fit the uploaded images.

Clean up

To deprovisioning all the resources, complete the following step

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack kendra-genai-image-search and choose Delete.

Wait until the stack status changes to DELETE_COMPLETE.

Conclusion

In this post, we saw how Amazon Kendra and GenAI can be combined to automate the creation of meaningful metadata for images. State-of-the-art GenAI models are extremely useful for generating text captions describing the content of an image. This has several industry use cases, ranging from healthcare and life sciences, retail and ecommerce, digital asset platforms, and media. Image captioning is also crucial for building a more inclusive digital world and redesigning the internet, metaverse, and immersive technologies to cater to the needs of visually challenged sections of society.

Image search enabled through captions enables digital content to be easily searchable without manual effort for these applications, and removes duplication efforts. The CloudFormation template we provided makes it straightforward to deploy this solution to enable image search using Amazon Kendra. A simple architecture of images stored in Amazon S3 and GenAI to create textual descriptions of the images can be used with CDE in Amazon Kendra to power this solution.

This is only one application of GenAI with Amazon Kendra. To dive deeper into how to build GenAI applications with Amazon Kendra, refer to Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models. For building and scaling GenAI applications, we recommend checking out Amazon Bedrock.


About the Authors

Charalampos Grouzakis is a Data Scientist within AWS Professional Services. He has over 11 years of experience in developing and leading data science, machine learning, and big data initiatives. Currently he is helping enterprise customers modernizing their AI/ML workloads within the cloud using industry best practices. Prior to joining AWS, he was consulting customers in various industries such as Automotive, Manufacturing, Telecommunications, Media & Entertainment, Retail and Financial Services. He is passionate about enabling customers to accelerate their AI/ML journey in the cloud and to drive tangible business outcomes.


Bharathi Srinivasan
is a Data Scientist at AWS Professional Services where she loves to build cool things on Sagemaker. She is passionate about driving business value from machine learning applications, with a focus on ethical AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.

Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist, taking long bike-packing trips.

Tanvi Singhal is a Data Scientist within AWS Professional Services. Her skills and areas of expertise include data science, machine learning, and big data. She supports customers in developing Machine learning models and MLops solutions within the cloud. Prior to joining AWS, she was also a consultant in various industries such as Transportation Networking, Retail and Financial Services. She is passionate about enabling customers on their data/AI journey to the cloud.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More

Research Focus: Week of July 31, 2023

Research Focus: Week of July 31, 2023

Microsoft Research Focus 21 | Week of July 31, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Anonymous Tokens with Stronger Metadata Bit Hiding from Algebraic MACs

Protecting the web from malicious activities such as bots or DoS attacks is an important goal. Researchers and practitioners have identified different approaches to balance user experience and security. For example, anonymous tokens allow an issuer to ensure that a user has been vetted while also protecting the user’s privacy. However, in some cases, the issuance or absence of a token can inform an adversary about the strategies used to distinguish honest users from bots or attackers.

In a recent paper: Anonymous Tokens with Stronger Metadata Bit Hiding from Algebraic MACs, researchers from Microsoft show how they designed an anonymous token protocol between a client and an issuer (also a verifier) that enables the issuer to support its fraud detection mechanisms while preserving users’ privacy.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

Survival Instinct in Offline Reinforcement Learning

On many benchmark datasets, offline reinforcement learning (RL) can produce well-performing and safe policies, even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

In a new paper: Survival Instinct in Offline Reinforcement Learning, researchers from the University of Washington and Microsoft demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. This work shows that pessimism endows the agent with a “survival instinct” – an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. The researchers argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating new ones. This research suggests a new paradigm for RL, whereby an agent is “nudged” to learn a desirable behavior with imperfect reward but purposely biased data coverage.


NEW RESEARCH

Nimble: Rollback Protection for Confidential Cloud Services

Cloud providers today offer confidential computing services in which virtual machines (VMs) support trusted execution environments (TEEs), that isolate a customer’s code from other code (including the hypervisor). TEEs offer security properties such as memory confidentiality and execution integrity, even if the provider is compromised. However, TEEs provide volatile state storage, not persistent state storage. So, if a TEE crashes or is maliciously restarted, its data can be lost.

A common way that TEEs today avoid such data loss is to persist an encrypted version of their data in a fault-tolerant cloud storage system such as Azure Table Storage or Cosmos DB. While authenticated encryption ensures that unauthorized parties cannot see the sensitive data or change its contents, encryption does not prevent a compromised provider from returning encryptions of old data. This is known as a “rollback attack,” in which an attacker can return an application running in a TEE to a previous state, potentially one that is vulnerable to attacks or that causes the application to perform incorrect actions.  

In a recent paper, Nimble: Rollback Protection for Confidential Cloud Services, researchers from Microsoft and academic colleagues introduce Nimble, a cloud service that helps applications running in TEE detect rollback attacks.


NEW RESEARCH

Improving machine learning force fields for molecular dynamics simulations with fine-grained force metrics

Machine learning force fields (MLFFs) provide a cost-effective alternative to ab initio molecular dynamics (MD) simulations – a computational method used in theoretical chemistry and materials science to simulate the behavior of molecules and materials at the atomic level. While they typically produce only small errors on the test set, MLFFs inherently encounter generalization and robustness issues during MD simulations.

In a recent paper: Improving machine learning force fields for molecular dynamics simulations with fine-grained force metrics, researchers from Microsoft propose alleviating those issues using global force metrics and fine-grained metrics from element and conformation aspects to systematically measure MLFFs for every atom and every conformation of molecules. Such force metrics can directly examine MLFFs without running costly MD simulations, reducing the computational cost of MLFF evaluation.

The researchers show that an accurate force prediction by MLFFs for all kinds of atom types and all possible conformations plays a crucial role in their usefulness in MD simulations. In addition, they designed continued learning and fine-tuning approaches to improve the performance of MLFFs.


NEW RESEARCH

Project Rumi: Multimodal paralinguistic prompting for LLMs

Large language models (LLMs) are algorithms that process and generate natural language, which can be used to create powerful new productivity tools. However, LLMs may not fully reflect the context and nuances of a conversation. Their performance depends in part on the quality and specificity of the user’s input, or prompt. User input data is a lexical entry, which lacks paralinguistic information (intonation, gestures, facial expressions, etc.) that may convey a speaker’s intentions. This can lead to misinterpretation, misunderstanding, or inappropriate responses from the LLM.

Conveying unspoken meaning and intention is an essential component in the next generation of AI interaction. To improve the quality of the underlying communication, researchers from Microsoft are developing a system called Project Rumi, which incorporates paralinguistic input into prompt-based interactions with LLMs. This system leverages separately trained vision and audio-based models to detect and analyze non-verbal cues extracted from data streams, assessing sentiment from cognitive and physiological data in real time. This multimodal, muti-step architecture integrates with all pretrained text-based LLMs to provide additional information on the user’s sentiment and intention that is not captured by text-based models.

The post Research Focus: Week of July 31, 2023 appeared first on Microsoft Research.

Read More

Exploring summarization options for Healthcare with Amazon SageMaker

Exploring summarization options for Healthcare with Amazon SageMaker

In today’s rapidly evolving healthcare landscape, doctors are faced with vast amounts of clinical data from various sources, such as caregiver notes, electronic health records, and imaging reports. This wealth of information, while essential for patient care, can also be overwhelming and time-consuming for medical professionals to sift through and analyze. Efficiently summarizing and extracting insights from this data is crucial for better patient care and decision-making. Summarized patient information can be useful to a number of downstream processes like data aggregation, effectively coding patients, or grouping patients with similar diagnoses for review.

Artificial intelligence (AI) and machine learning (ML) models have shown great promise in addressing these challenges. Models can be trained to analyze and interpret large volumes of text data, effectively condensing information into concise summaries. By automating the summarization process, doctors can quickly gain access to relevant information, allowing them to focus on patient care and make more informed decisions. See the following case study to learn more about a real-world use case.

Amazon SageMaker, a fully managed ML service, provides an ideal platform for hosting and implementing various AI/ML-based summarization models and approaches. In this post, we explore different options for implementing summarization techniques on SageMaker, including using Amazon SageMaker JumpStart foundation models, fine-tuning pre-trained models from Hugging Face, and building custom summarization models. We also discuss the pros and cons of each approach, enabling healthcare professionals to choose the most suitable solution for generating concise and accurate summaries of complex clinical data.

Two important terms to know before we begin: pre-trained and fine-tuning. A pre-trained or foundation model is one that has been built and trained on a large corpus of data, typically for general language knowledge. Fine-tuning is the process by which a pre-trained model is given another more domain-specific dataset in order to enhance its performance on a specific task. In a healthcare setting, this would mean giving the model some data including phrases and terminology pertaining specifically to patient care.

Build custom summarization models on SageMaker

Though the most high-effort approach, some organizations might prefer to build custom summarization models on SageMaker from scratch. This approach requires more in-depth knowledge of AI/ML models and may involve creating a model architecture from scratch or adapting existing models to suit specific needs. Building custom models can offer greater flexibility and control over the summarization process, but also requires more time and resources compared to approaches that start from pre-trained models. It’s essential to weigh the benefits and drawbacks of this option carefully before proceeding, because it may not be suitable for all use cases.

SageMaker JumpStart foundation models

A great option for implementing summarization on SageMaker is using JumpStart foundation models. These models, developed by leading AI research organizations, offer a range of pre-trained language models optimized for various tasks, including text summarization. SageMaker JumpStart provides two types of foundation models: proprietary models and open-source models. SageMaker JumpStart also provides HIPAA eligibility, making it useful for healthcare workloads. It is ultimately up to the customer to ensure compliance, so be sure to take the appropriate steps. See Architecting for HIPAA Security and Compliance on Amazon Web Services for more details.

Proprietary foundation models

Proprietary models, such as Jurassic models from AI21 and the Cohere Generate model from Cohere, can be discovered through SageMaker JumpStart on the AWS Management Console and are currently under preview. Utilizing proprietary models for summarization is ideal when you don’t need to fine-tune your model on custom data. This offers an easy-to-use, out-of-the-box solution that can meet your summarization requirements with minimal configuration. By using the capabilities of these pre-trained models, you can save time and resources that would otherwise be spent on training and fine-tuning a custom model. Furthermore, proprietary models typically come with user-friendly APIs and SDKs, streamlining the integration process with your existing systems and applications. If your summarization needs can be met by pre-trained proprietary models without requiring specific customization or fine-tuning, they offer a convenient, cost-effective, and efficient solution for your text summarization tasks. Because these models are not trained specifically for healthcare use cases, quality can’t be guaranteed for medical language out of the box without fine-tuning.

Jurassic-2 Grande Instruct is a large language model (LLM) by AI21 Labs, optimized for natural language instructions and applicable to various language tasks. It offers an easy-to-use API and Python SDK, balancing quality and affordability. Popular uses include generating marketing copy, powering chatbots, and text summarization.

On the SageMaker console, navigate to SageMaker JumpStart, find the AI21 Jurassic-2 Grande Instruct model, and choose Try out model.

If you want to deploy the model to a SageMaker endpoint that you manage, you can follow the steps in this sample notebook, which shows you how to deploy Jurassic-2 Large using SageMaker.

Open-source foundation models

Open-source models include FLAN T5, Bloom, and GPT-2 models that can be discovered through SageMaker JumpStart in the Amazon SageMaker Studio UI, SageMaker JumpStart on the SageMaker console, and SageMaker JumpStart APIs. These models can be fine-tuned and deployed to endpoints under your AWS account, giving you full ownership of model weights and script codes.

Flan-T5 XL is a powerful and versatile model designed for a wide range of language tasks. By fine-tuning the model with your domain-specific data, you can optimize its performance for your particular use case, such as text summarization or any other NLP task. For details on how to fine-tune Flan-T5 XL using the SageMaker Studio UI, refer to Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart.

Fine-tuning pre-trained models with Hugging Face on SageMaker

One of the most popular options for implementing summarization on SageMaker is fine-tuning pre-trained models using the Hugging Face Transformers library. Hugging Face provides a wide range of pre-trained transformer models specifically designed for various natural language processing (NLP) tasks, including text summarization. With the Hugging Face Transformers library, you can easily fine-tune these pre-trained models on your domain-specific data using SageMaker. This approach has several advantages, such as faster training times, better performance on specific domains, and easier model packaging and deployment using built-in SageMaker tools and services. If you’re unable to find a suitable model in SageMaker JumpStart, you can choose any model offered by Hugging Face and fine-tune it using SageMaker.

To start working with a model to learn about the capabilities of ML, all you need to do is open SageMaker Studio, find a pre-trained model you want to use in the Hugging Face Model Hub, and choose SageMaker as your deployment method. Hugging Face will give you the code to copy, paste, and run in your notebook. It’s as easy as that! No ML engineering experience required.

The Hugging Face Transformers library enables builders to operate on the pre-trained models and do advanced tasks like fine-tuning, which we explore in the following sections.

Provision resources

Before we can begin, we need to provision a notebook. For instructions, refer to Steps 1 and 2 in Build and Train a Machine Learning Model Locally. For this example, we used the settings shown in the following screenshot.

We also need to create an Amazon Simple Storage Service (Amazon S3) bucket to store the training data and training artifacts. For instructions, refer to Creating a bucket.

Prepare the dataset

To fine-tune our model to have better domain knowledge, we need to get data suitable for the task. When training for an enterprise use case, you’ll need to go through a number of data engineering tasks to prepare your own data to be ready for training. Those tasks are outside the scope of this post. For this example, we’ve generated some synthetic data to emulate nursing notes and stored it in Amazon S3. Storing our data in Amazon S3 enables us to architect our workloads for HIPAA compliance. We start by getting those notes and loading them on the instance where our notebook is running:

from datasets import load_dataset
dataset = load_dataset("csv", data_files={
    "train": "s3://" + bucket_name + train_data_path,
    "validation": "s3://" + bucket_name + test_data_path
})

The notes are composed of a column containing the full entry, note, and a column containing a shortened version exemplifying what our desired output should be, summary. The purpose of using this dataset is to improve our model’s biological and medical vocabulary so that it’s more attuned to summarizing in a healthcare context, called domain fine-tuning, and show our model how to structure its summarized output. In some summarization cases, we may want to create an abstract out of an article or a one-line synopsis of a review, but in this case, we’re trying to get our model to output an abbreviated version of the symptoms and actions taken for a patient so far.

Load the model

The model we use as our foundation is a version of Google’s Pegasus, made available in the Hugging Face Hub, called pegasus-xsum. It’s already pre-trained for summarization, so our fine-tuning process can focus on extending its domain knowledge. Modifying the task our model runs is a different type of fine-tuning not covered in this post. The Transformer library supplies us with a class to load the model definition from our model_checkpoint: google/pegasus-xsum. This will load the model from the hub and instantiate it in our notebook so we can use it later on. Because pegasus-xsum is a sequence-to-sequence model, we want to use the Seq2Seq type of the AutoModel class:

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Now that we have our model, it’s time to put our attention to the other components that will enable us to run our training loop.

Create a tokenizer

The first of these components is the tokenizer. Tokenization is the process by which words from the input data are transformed into numerical representations that our model can understand. Again, the Transformer library provides a class for us to load a tokenizer definition from the same checkpoint we used to instantiate the model:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

With this tokenizer object, we can create a preprocessing function and map it onto our dataset to give us tokens ready to be fed into the model. Finally, we format the tokenized output and remove the columns containing our original text, because the model will not be able to interpret them. Now we’re left with a tokenized input ready to be fed into the model. See the following code:

tokenized_datasets = dataset.map(preprocess_function, batched=True)

tokenized_datasets.set_format("torch")

tokenized_datasets = tokenized_datasets.remove_columns(
    dataset["train"].column_names
)

Create a data collator and optimizer

With our data tokenized and our model instantiated, we’re almost ready to run a training loop. The next components we want to create are the data collator and the optimizer. The data collator is another class provided by Hugging Face through the Transformers library, which we use to create batches of our tokenized data for training. We can easily build this using the tokenizer and model objects we already have just by finding the corresponding class type we’ve used previously for our model (Seq2Seq) for the collator class. The optimizer’s function is to maintain the training state and update the parameters based on our training loss as we work through the loop. To create an optimizer, we can import the optim package from the torch module, where a number of optimization algorithms are available. Some common ones you may have encountered before are Stochastic Gradient Descent and Adam, the latter of the which is applied in our example. Adam’s constructor takes in the model parameters and the parameterized learning rate for the given training run. See the following code:

from transformers import DataCollatorForSeq2Seq
from torch.optim import Adam

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
optimizer = Adam(model.parameters(), lr=learning_rate)

Build the accelerator and scheduler

The last steps before we can begin training are to build the accelerator and the learning rate scheduler. The accelerator comes from a different library (we’ve been primarily using Transformers) produced by Hugging Face, aptly named Accelerate, and will abstract away logic required to manage devices during training (using multiple GPUs for example). For the final component, we revisit the ever-useful Transformers library to implement our learning rate scheduler. By specifying the scheduler type, the total number of training steps in our loop, and the previously created optimizer, the get_scheduler function returns an object that enables us to adjust our initial learning rate throughout the training process:

from accelerate import Accelerator
from transformers import get_scheduler

accelerator = Accelerator()
model, optimizer = accelerator.prepare(
    model, optimizer
)

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Configure a training job

We’re now fully set up for training! Let’s set up a training job, starting by instantiating the training_args using the Transformers library and choosing parameter values. We can pass these, along with our other prepared components and dataset, directly to the trainer and start training, as shown in the following code. Depending on the size of your dataset and chosen parameters, this may take a significant amount of time.

from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="output/",
    save_total_limit=1,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="epoch",
    logging_dir="output/",
    load_best_model_at_end=True,
    disable_tqdm=True,
    logging_first_step=True,
    logging_steps=1,
    save_strategy="epoch",
    predict_with_generate=True
)

trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    optimizers=(optimizer, lr_scheduler)
)
    
trainer.train()

To operationalize this code, we can package it as an entry point file and call it through a SageMaker training job. This allows us to separate the logic we just built away from the training call and allows SageMaker to run training on a separate instance.

Package the model for inference

After training has been run, the model object is ready to be used for inference. As a best practice, let’s save our work for future use. We need to create our model artifacts, zip them together, and upload our tarball to Amazon S3 for storage. To prepare our model for zipping, we need to unwrap the now fine-tuned model, then save the model binary and associated config files. We also need to save our tokenizer to the same directory that we saved our model artifacts to so it is available when we use the model for inference. Our model_dir folder should now look something like the following code:

config.json		pytorch_model.bin	tokenizer_config.json
generation_config.json	special_tokens_map.json		tokenizer.json

All that’s left is to run a tar command to zip up our directory and upload the tar.gz file to Amazon S3:

unwrapped_model = accelerator.unwrap_model(trainer.model)

unwrapped_model.save_pretrained('model_dir', save_function=accelerator.save)

tokenizer.save_pretrained('model_dir')

!cd model_dir/ && tar -czvf model.tar.gz *
!mv model_dir/model.tar.gz ./

with open("model.tar.gz", "rb") as f:
    s3.upload_fileobj(f, bucket_name, artifact_path + "model/model.tar.gz")

Our newly fine-tuned model is now ready and available to be used for inference.

Perform inference

To use this model artifact for inference, open a new file and use the following code, modifying the model_data parameter to fit your artifact save location in Amazon S3. The HuggingFaceModel constructor will rebuild our model from the checkpoint we saved to model.tar.gz, which we can then deploy for inference using the deploy method. Deploying the endpoint will take a few minutes.

from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role

role = get_execution_role()

huggingface_model = HuggingFaceModel(
   model_data=”s3://{bucket_name}/{artifact_path}/model/model.tar.gz”,
   role=role,
   transformers_version=”4.26”,
   pytorch_version=”1.13”,
   py_version=”py39”
)

predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type=”ml.m5.xlarge”
)

After the endpoint is deployed, we can use the predictor we’ve created to test it. Pass the predict method a data payload and run the cell, and you’ll get the response from your fine-tuned model:

data = {
    "inputs": "Text to summarize”
}
predictor.predict(data)

Results

To see the benefit of fine-tuning a model, let’s do a quick test. The following table includes a prompt and the results of passing that prompt to the model before and after fine-tuning.

Prompt Response with No Fine-Tuning Response with Fine-Tuning
Summarize the symptoms that the patient is experiencing. Patient is a 45 year old male with complaints of substernal chest pain radiating to the left arm. Pain is sudden onset while he was doing yard work, associated with mild shortness of breath and diaphoresis. On arrival patient’s heart rate was 120, respiratory rate 24, blood pressure 170/95. 12 lead electrocardiogram done on arrival to the emergency department and three sublingual nitroglycerin administered without relief of chest pain. Electrocardiogram shows ST elevation in anterior leads demonstrating acute anterior myocardial infarction. We have contacted cardiac catheterization lab and prepping for cardiac catheterization by cardiologist. We present a case of acute myocardial infarction. Chest pain, anterior MI, PCI.

As you can see, our fine-tuned model uses health terminology differently, and we’ve been able to change the structure of the response to fit our purposes. Note that results are dependent on your dataset and the design choices made during training. Your version of the model could offer very different results.

Clean up

When you’re finished with your SageMaker notebook, be sure to shut it down to avoid costs from long-running resources. Note that shutting down the instance will cause you to lose any data stored in the instance’s ephemeral memory, so you should save all your work to persistent storage before cleanup. You will also need to go to the Endpoints page on the SageMaker console and delete any endpoints deployed for inference. To remove all artifacts, you also need to go to the Amazon S3 console to delete files uploaded to your bucket.

Conclusion

In this post, we explored various options for implementing text summarization techniques on SageMaker to help healthcare professionals efficiently process and extract insights from vast amounts of clinical data. We discussed using SageMaker Jumpstart foundation models, fine-tuning pre-trained models from Hugging Face, and building custom summarization models. Each approach has its own advantages and drawbacks, catering to different needs and requirements.

Building custom summarization models on SageMaker allows for lots of flexibility and control but requires more time and resources than using pre-trained models. SageMaker Jumpstart foundation models provide an easy-to-use and cost-effective solution for organizations that don’t require specific customization or fine-tuning, as well as some options for simplified fine-tuning. Fine-tuning pre-trained models from Hugging Face offers faster training times, better domain-specific performance, and seamless integration with SageMaker tools and services across a broad catalog of models, but it requires some implementation effort. At the time of writing this post, Amazon has announced another option, Amazon Bedrock, which will offer summarization capabilities in an even more managed environment.

By understanding the pros and cons of each approach, healthcare professionals and organizations can make informed decisions on the most suitable solution for generating concise and accurate summaries of complex clinical data. Ultimately, using AI/ML-based summarization models on SageMaker can significantly enhance patient care and decision-making by enabling medical professionals to quickly access relevant information and focus on providing quality care.

Resources

For the full script discussed in this post and some sample data, refer to the GitHub repo. For more information on how to run ML workloads on AWS, see the following resources:


About the authors

Cody Collins is a New York based Solutions Architect at Amazon Web Services. He works with ISV customers to build industry leading solutions in the cloud. He has successfully delivered complex projects for diverse industries, optimizing efficiency and scalability. In his spare time, he enjoys reading, traveling, and training jiu jitsu.

Ameer Hakme is an AWS Solutions Architect residing in Pennsylvania. His professional focus involves collaborating with Independent software vendors throughout the Northeast, guiding them in designing and constructing scalable, state-of-the-art platforms on the AWS Cloud.

Read More

Unlocking creativity: How generative AI and Amazon SageMaker help businesses produce ad creatives for marketing campaigns with AWS

Unlocking creativity: How generative AI and Amazon SageMaker help businesses produce ad creatives for marketing campaigns with AWS

Advertising agencies can use generative AI and text-to-image foundation models to create innovative ad creatives and content. In this post, we demonstrate how you can generate new images from existing base images using Amazon SageMaker, a fully managed service to build, train, and deploy ML models for at scale. With this solution, businesses large and small can develop new ad creatives much faster and at lower cost than ever before. This allows you to develop new custom ad creative content for your business at low cost and at a rapid pace.

Solution overview

Consider the following scenario: a global automotive company needs new marketing material generated for their new car design being released and hires a creative agency that is known for providing advertising solutions for clients with strong brand equity. The car manufacturer is looking for low-cost ad creatives that display the model in diverse locations, colors, views, and perspectives while maintaining the brand identity of the car manufacturer. With the power of state-of-the-art techniques, the creative agency can support their customer by using generative AI models within their secure AWS environment.

The solution is developed with Generative AI and Text-to-Image models in Amazon SageMaker. SageMaker is a fully managed machine learning (ML) service that that makes it straightforward to build, train, and deploy ML models for any use case with fully managed infrastructure, tools, and workflows. Stable Diffusion is a text-to-image foundation model from Stability AI that powers the image generation process. Diffusers are pre-trained models that use Stable Diffusion to use an existing image to generate new images based on a prompt. Combining Stable Diffusion with Diffusers like ControlNet can take existing brand-specific content and develop stunning versions of it. Key benefits of developing the solution within AWS along with Amazon SageMaker are:

  • Privacy – Storing the data in Amazon Simple Storage Service (Amazon S3) and using SageMaker to host models allows you to adhere to security best practices within your AWS account while not exposing assets publicly.
  • Scalability – The Stable Diffusion model, when deployed as a SageMaker endpoint, brings scalability by allowing you to configure instance sizes and number of instances. SageMaker endpoints also have auto scaling features and are highly available.
  • Flexibility – When creating and deploying endpoints, SageMaker provides the flexibility to choose GPU instance types. Also, instances behind SageMaker endpoints can be changed with minimum effort as business needs change. AWS has also developed hardware and chips using AWS Inferentia2 for high performance at the lowest cost for generative AI inference.
  • Rapid innovation – Generative AI is a rapidly evolving domain with new approaches, and models are being constantly developed and released. Amazon SageMaker JumpStart regularly onboards new models along with foundation models.
  • End-to-end integration – AWS allows you to integrate the creative process with any AWS service and develop an end-to-end process using fine-grained access control through AWS Identity and Access Management (IAM), notification through Amazon Simple Notification Service (Amazon SNS), and postprocessing with the event-driven compute service AWS Lambda.
  • Distribution – When the new creatives are generated, AWS allows distributing the content across global channels in multiple Regions using Amazon CloudFront.

For this post, we use the following GitHub sample, which uses Amazon SageMaker Studio with foundation models (Stable Diffusion), prompts, computer vision techniques, and a SageMaker endpoint to generate new images from existing images. The following diagram illustrates the solution architecture.

The workflow contains the following steps:

  1. We store the existing content (images, brand styles, and so on) securely in S3 buckets.
  2. Within SageMaker Studio notebooks, the original image data is transformed to images using computer vision techniques, which preserves the shape of the product (the car model), removes color and background, and generates monotone intermediate images.
  3. The intermediate image acts as a control image for Stable Diffusion with ControlNet.
  4. We deploy a SageMaker endpoint with the Stable Diffusion text-to-image foundation model from SageMaker Jumpstart and ControlNet on a preferred GPU-based instance size.
  5. Prompts describing new backgrounds and car colors along with the intermediate monotone image are used to invoke the SageMaker endpoint, yielding new images.
  6. New images are stored in S3 buckets as they’re generated.

Deploy ControlNet on SageMaker endpoints

To deploy the model to SageMaker endpoints, we must create a compressed file for each individual technique model artifact along with the Stable Diffusion weights, inference script, and NVIDIA Triton config file.

In the following code, we download the model weights for the different ControlNet techniques and Stable Diffusion 1.5 to the local directory as tar.gz files:

if ids =="runwayml/stable-diffusion-v1-5":
    snapshot_download(ids, local_dir=str(model_tar_dir), local_dir_use_symlinks=False,ignore_patterns=unwanted_files_sd)

elif ids =="lllyasviel/sd-controlnet-canny":
    snapshot_download(ids, local_dir=str(model_tar_dir), local_dir_use_symlinks=False)  

To create the model pipeline, we define an inference.py script that SageMaker real-time endpoints will use to load and host the Stable Diffusion and ControlNet tar.gz files. The following is a snippet from inference.py that shows how the models are loaded and how the Canny technique is called:

controlnet = ControlNetModel.from_pretrained(
        f"{model_dir}/{control_net}",
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
        f"{model_dir}/sd-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32)

# Define technique function for Canny 
image = cv2.Canny(image, low_threshold, high_threshold)

We deploy the SageMaker endpoint with the required instance size (GPU type) from the model URI:

huggingface_model = HuggingFaceModel(
        model_data=model_s3_uri,  # path to your trained sagemaker model
        role=role, # iam role with permissions to create an Endpoint  
        py_version="py39", # python version of the DLC  
        image_uri=image_uri,
)

# Deploy model as SageMaker Endpoint
predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type="ml.p3.2xlarge",
)

Generate new images

Now that the endpoint is deployed on SageMaker endpoints, we can pass in our prompts and the original image we want to use as our baseline.

To define the prompt, we create a positive prompt, p_p, for what we’re looking for in the new image, and the negative prompt, n_p, for what is to be avoided:

p_p="metal orange colored car, complete car, colour photo, outdoors in a pleasant landscape, realistic, high quality"

n_p="cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, blurry, bad anatomy, bad proportions"

Finally, we invoke our endpoint with the prompt and source image to generate our new image:

request={"prompt":p_p, 
        "negative_prompt":n_p, 
        "image_uri":'s3://<bucker>/sportscar.jpeg', #existing content
        "scale": 0.5,
        "steps":20, 
        "low_threshold":100, 
        "high_threshold":200, 
        "seed": 123, 
        "output":"output"}
response=predictor.predict(request)

Different ControlNet techniques

In this section, we compare the different ControlNet techniques and their effect on the resulting image. We use the following original image to generate new content using Stable Diffusion with Control-net in Amazon SageMaker.

The following table shows how the technique output dictates what, from the original image, to focus on.

Technique Name Technique Type Technique Output Prompt Stable Diffusion with ControlNet
canny A monochrome image with white edges on a black background. metal orange colored car, complete car, colour photo, outdoors in a pleasant landscape, realistic, high quality
depth A grayscale image with black representing deep areas and white representing shallow areas. metal red colored car, complete car, colour photo, outdoors in pleasant landscape on beach, realistic, high quality
hed A monochrome image with white soft edges on a black background. metal white colored car, complete car, colour photo, in a city, at night, realistic, high quality
scribble A hand-drawn monochrome image with white outlines on a black background. metal blue colored car, similar to original car, complete car, colour photo, outdoors, breath-taking view, realistic, high quality, different viewpoint

Clean up

After you generate new ad creatives with generative AI, clean up any resources that won’t be used. Delete the data in Amazon S3 and stop any SageMaker Studio notebook instances to not incur any further charges. If you used SageMaker JumpStart to deploy Stable Diffusion as a SageMaker real-time endpoint, delete the endpoint either through the SageMaker console or SageMaker Studio.

Conclusion

In this post, we used foundation models on SageMaker to create new content images from existing images stored in Amazon S3. With these techniques, marketing, advertisement, and other creative agencies can use generative AI tools to augment their ad creatives process. To dive deeper into the solution and code shown in this demo, check out the GitHub repo.

Also, refer to Amazon Bedrock for use cases on generative AI, foundation models, and text-to-image models.


About the Authors

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Sandeep Verma is a Sr. Prototyping Architect with AWS. He enjoys diving deep into customer challenges and building prototypes for customers to accelerate innovation. He has a background in AI/ML, founder of New Knowledge, and generally passionate about tech. In his free time, he loves traveling and skiing with his family.

Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spend lot of her free time.

Read More

Cuddly 3D Creature Comes to Life in Father-Son Collaboration This Week ‘In the NVIDIA Studio’

Cuddly 3D Creature Comes to Life in Father-Son Collaboration This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.

Principal NVIDIA artist and 3D expert Michael Johnson creates highly detailed art that’s both technically impressive and emotionally resonant. It’s evident in his latest piece, Father-Son Collaboration, which draws on inspiration from the vivid imagination of his son and is highlighted this week In the NVIDIA Studio.

“I love how art can bring joy and great memories to others — great work makes me feel special to be a human and an artist,” said Johnson. “Art can flip people’s perspectives and make them feel something completely different.”

Young minds inspire generations of artists.

“The story behind this piece is that I simply wanted to inspire my son and teach him how things can be perceived — how people can be inspired by others’ art,” said Johnson, who could tell that his son — a doodler himself — often considered his own artwork not good enough.

“I wanted to show him what I saw in his art and how it inspired me,” Johnson said.

Through this project, Johnson also aimed to demonstrate the NVIDIA Studio-powered workflows of art studios and concept artists across the world.

This creature is living its best life.

NVIDIA RTX GPU technology plays a pivotal role in accelerating Johnson’s creativity. “As an artist, I care about quick feedback and stability,” he said. “My NVIDIA A6000 RTX graphics card speeds up the rendering process so I can quickly iterate.”

For Father-Son Collaboration, Johnson first opened Autodesk Maya to model the creature’s basic 3D shapes. His GPU-accelerated viewport enabled fast, interactive 3D modeling.

 

Next, he imported models into ZBrush for further sculpting, freestyling and details. “After I had my final sculpt down, I took the model into Rizom-Lab IV software to lay out the UVs,” Johnson said. UV mapping is the process of projecting a 3D model’s surface to a 2D image for texture mapping. It makes the model easier to texture and shade later in the creative workflow.

 

Johnson then used Adobe Substance 3D Painter to apply standard and custom textures and shaders on the character.

“Substance 3D Painter is really great because it displays the final look of the textures without bringing it into an external renderer,” said Johnson.

His GPU unlocked RTX-accelerated light and ambient occlusion baking, optimizing assets in mere seconds.

 

With the textures complete, Johnson imported his models back into Autodesk Maya for hair, grooming, lighting and rendering. For the hair and fur, the artist used XGen, Autodesk Maya’s built-in instancing tool. Autodesk Maya also offers third-party support of GPU-accelerated renderers such as Chaos V-Ray, OTOY OctaneRender and Maxon Redshift.

“Redshift is great — and having a great GPU makes renders really quick,” Johnson added. Redshift’s RTX-accelerated final-frame rendering with AI-powered OptiX denoising exported files with plenty of time to spare.

Johnson put the final touches on Father-Son Collaboration in Adobe Photoshop. With access to over 30 GPU-accelerated features, such as blur gallery, object selection, perspective warp and more, he applied the background and added minor touch-ups to complete the piece.

 

The joy, awe and wonderment he’d hoped to invoke in his son came to fruition when Johnson finally shared the piece.

From a son’s concept to a father’s creation.

“Art is one of the rare things in life that really has no end goal — as it’s really about the process, rather than the result,” Johnson said. “Every day, you learn something new, grow and see things in different ways.”

Principal NVIDIA artist and 3D expert Michael Johnson.

Check out Johnson’s portfolio on Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Learn about the latest with OpenUSD and Omniverse at SIGGRAPH, running August 6-10. Take advantage of showfloor experiences like hands-on labs, special events and demo booths — and don’t miss NVIDIA founder and CEO Jensen Huang’s keynote address on Tuesday, Aug. 8, at 8 a.m. PT. 

Read More

NVIDIA Helps Forge Forum to Set OpenUSD Standard for 3D Worlds

NVIDIA Helps Forge Forum to Set OpenUSD Standard for 3D Worlds

NVIDIA joined Pixar, Adobe, Apple and Autodesk today to found the Alliance for OpenUSD, a major leap toward unlocking the next era of 3D graphics, design and simulation.

The group will standardize and extend OpenUSD, the open-source Universal Scene Description framework that’s the foundation of interoperable 3D applications and projects ranging from visual effects to industrial digital twins.

Several leading companies in the 3D ecosystem already signed on as the alliance’s first general members — Cesium, Epic Games, Foundry, Hexagon, IKEA, SideFX and Unity.

Standardizing OpenUSD will accelerate its adoption, creating a foundational technology that will help today’s 2D internet evolve into a 3D web. Many companies are already working with NVIDIA to pioneer this future.

From Skyscrapers to Sports Cars

OpenUSD is the foundation of NVIDIA Omniverse, a development platform for connecting and building 3D tools and applications. Omniverse is helping companies like Heavy.AI, Kroger and Siemens build and test physically accurate simulations of factories, retail locations, skyscrapers, sports cars and more.

For IKEA, OpenUSD represents “a nonproprietary standard format to author and store 3D content to connect our value chain even closer, and develop home furnishing solutions to a lower price,” Martin Enthed, an innovation manager at IKEA, said in a press release the alliance issued today.

“By joining the alliance, we’re demonstrating our dedication to the advantages that OpenUSD provides our clients when linking with cloud-based platforms, including Nexus, Hexagon’s manufacturing platform, HxDR, Hexagon’s digital reality platform, and NVIDIA Omniverse to build innovative solutions in their industries,” said Burkhard Boeckem, CTO of Hexagon.

The Origins of OpenUSD

Pixar started work on USD in 2012 as a 3D foundation for its feature films, offering interoperability across data and workflows. The company made this powerful, multifaceted technology open source four years later, so anyone can use OpenUSD and contribute to its development.

Image from the Pixar film "Coco" that used USD
A breakdown of a scene from Pixar’s “Coco” contrasted with the final image. USD was instrumental in creating the film’s complex world. © Disney/Pixar

OpenUSD supports the requirements of building virtual worlds — like geometry, cameras, lights and materials. It also includes features necessary for scaling to large, complex datasets, and it’s tremendously extensible, enabling the technology to be adapted to workflows beyond visual effects.

OpenUSD enables real-time collaboration.
Diagram of OpenUSD that demonstrates it’s power as a technology for large scale, industrial workflows.

One unique capability of OpenUSD is its layering system, which lets users collaborate in real time without stepping on each other’s toes. For example, one artist can model a scene while others create the lighting for it.

Forging a Shared Standard

As its first priority, the alliance will develop a specification that describes the core functionality of OpenUSD. That’ll provide a recipe tool builders can implement, encouraging adoption of the open standard across the widest possible array of use cases.

The alliance will operate as part of the Joint Development Foundation (JDF), a branch of the Linux Foundation. The JDF provides a path to turn written specifications into industry standards suitable for adoption by globally respected groups like the International Organization for Standardization, or the ISO.

From OpenUSD to Omniverse

NVIDIA has a deep commitment to OpenUSD and working with ecosystem partners to accelerate the framework’s evolution and adoption across industries.

At last year’s SIGGRAPH, NVIDIA detailed a multiyear roadmap of contributions it’s making to enable OpenUSD use in architecture, engineering, manufacturing and more. An update on these plans will be presented by NVIDIA as part of the alliance at this year’s conference on computer graphics.

Help Build the 3D Future

Collaboration is key to the alliance and evolution of OpenUSD.

To get involved or learn more, attend NVIDIA’s keynote, OpenUSD day, hands-on labs and other showfloor activities at SIGGRAPH, running Aug. 6-10.

The Alliance for OpenUSD also will host a keynote panel session at the Academy Software Foundation’s Open Source Days 2023.

For a deeper dive on OpenUSD:

Read More

AMD’s Journey to Openness and Performance

AMD has gained progress in building a robust software stack that supports an open ecosystem of models, libraries, frameworks, and tools. With proven platforms gaining momentum, there is significance of a leadership software stack and an optimized ecosystem for achieving application performance. PyTorch is a key part of AMD’s AI journey, and AMD’s Victor Peng, AMD President and Soumith Chintala, founder of PyTorch discussed the latest progress at the DC & AI Keynote on June 12.

Building a Powerful SW Stack with ROCm

Victor introduced ROCm, AMD’s SW stack for Instinct Data Center GPUs. It offers a comprehensive set of open-source libraries, runtime, compilers, and tools for developing, running, and fine-tuning AI models. The fifth generation ROCm incorporates optimizations for AI and high-performance computing workloads, including tailored kernels for low-latency memory systems, support for new data types, and integration with OpenAI Triton. With tools for porting AI software to AMD Instinct platforms, ROCm ensures quality and robustness, tested extensively and compliant with PyTorch and TensorFlow frameworks.

Collaboration with PyTorch

To shed light on the partnership between AMD and PyTorch, Victor invited Soumith Chintala, the founder of PyTorch, to discuss the advancements and integration between the two. PyTorch, the industry’s most famous AI framework, boasts a vibrant developer community and is known for its continuous innovation and incorporation of cutting-edge research.

To highlight the AMD and PyTorch partnership, Victor hosted a discussion with Soumith Chintala, the founder of PyTorch. PyTorch, renowned for its innovation and community, is the industry’s leading AI framework. The latest version, PyTorch 2.0, integrates with hardware-agnostic software compilers like OpenAI Triton, enabling efficient training and deployment of AI models. With optimized techniques, PyTorch 2.0 enhances productivity and offers remarkable speed improvements. The collaboration between AMD and the PyTorch Foundation ensures seamless utilization of AMD GPUs, expanding AI accelerator accessibility worldwide and paving the way for future optimizations and broader hardware support.

Empowering the Developer Community

The partnership between AMD and PyTorch benefits the developer community by democratizing access to AI accelerators. Support for AMD GPUs in PyTorch allows developers to train and deploy models across various platforms, including CPUs like EPYC and Ryzen, GPUs like Instinct and Radeon, and embedded devices like Versal SoCs. By ensuring immediate compatibility of new models on AMD platforms, the collaboration streamlines the development process and empowers developers to leverage the full potential of AMD’s hardware. This increased accessibility and flexibility enable developers worldwide to push the boundaries of AI innovation.

Hugging Face and AI Model Innovation

Victor praised Hugging Face as the leading force behind open-source AI model innovation, empowering generative AI with transformative transformers. AMD’s optimized software enables a high-performing development stack, supporting groundbreaking AI advancements for customers and developers through scalable real-world deployments.

Conclusion

At the DC & AI Keynote, AMD demonstrated its dedication to openness, performance, and collaboration. The ROCm SW stack, PyTorch integration, and support for Hugging Face exemplify AMD’s commitment to empowering developers and researchers to achieve AI breakthroughs. By offering accessible, high-performing solutions, AMD fuels the future of AI as a leading GPU platform integrated with PyTorch.

To listen to the full keynote visit the AMD Youtube channel

To listen to Soumith Chintala’s section of the keynote

Read More