September 2023 – Page 15

‘Arteana’s Art Squad’ Assembles — Indie Showrunner Rafi Nizam Creates High-End Children’s Show on a Budget

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving into new GeForce RTX 40 Series GPU features, technologies and resources and how they dramatically accelerate content creation.

Rafi Nizam is an award-winning independent animator, director, character designer and more. He’s developed feature films at Sony Pictures, children’s series and comedies at BBC and global transmedia content at NBCUniversal.

He’s also the creator of Arteana’s Art Squad — a computer graphics animated series featuring vibrant characters who use the power of art to solve the world’s problems. They come together in the Junior School art classroom, where each brings unique artistic talents, knowledge and perspective on art history, art therapy and in art-making.

Aimed at children, the series seeks to inspire viewers by portraying the characters’ artistic journeys and the power of creative expression. Their adventures are meant to spark a sense of empathy by exploring the universal themes of self-doubt, social dynamics, success and failure. Underscoring the power of imagination and creative thinking is a common throughline.

Nizam’s creative insight and unique perspective are the subjects of this week’s In the NVIDIA Studio installment.

The artist recently participated in the ASUS ProArt Masters’ Talks sessions program, where he demonstrated how ASUS ProArt solutions, including the NVIDIA Studio-validated ProArt Studiobook Pro 16 OLED laptop with a GeForce RTX 3060 GPU and the Scan 3XS RTX Studio workstation with NVIDIA RTX A6000 graphic cards, helped produce a high-end animated series on an indie budget.

Meet Arteana’s Art Squad

Meet Arteana, leader of the Art Squad, who possesses a keen interest in historical artists and art movements.

Rivette demonstrates diverse art techniques and is always looking for new ways to express her creativity.

Rivette has a keen eye for detail and a passion for experimenting with different mediums.

ThreeDee, seen here playing the drums, is a kind and compassionate character who uses art therapy as a means of promoting well-being and healing and to uncover the underlying worries that plague the squad.

ThreeDee is passionate about the transformative power of creativity and believes in making space for everyone to explore their emotions and express themselves through art.

Then there’s Figgi, whose spontaneous performance art inspires others to redefine boundaries and embrace individuality.

Figgi, the youngest member of the Art Squad.

Rounding out the squad is PuttPupp — a lovable and playful character made of putty erasers — who serves as the class pet.

PuttPupp is incredibly expressive and animated with a boundless energy that putts everyone in a good mood.

Art Squad, Assemble

Nizam — matching the demeanor and spirit of his work — is honest. He’s not an expert at 3D modeling, nor is he a visual effects artist, and he’s not the most adept at production pipelines. However, he does love to draw.

His focus has always been on characters, storytelling and world-building. He built Arteana’s Art Squad independently while working in NVIDIA Omniverse, a platform for building and connecting 3D tools and apps.

“Speaking as a storyteller first and a non-technical indie creator second, I find Omniverse to be the most user-friendly and versatile way to connect the 3D apps in my workflows I’ve come to rely on, and enjoy this way of working from concept to final pixel.” — Rafi Nizam

“As a showrunner, embarking on making a CG animated show without a crew is kind of daunting, but I’m using NVIDIA Omniverse to discover ways to overcome my limitations in this space,” Nizam said.

Nizam began by modeling each squad member and building production assets in Adobe Substance 3D Modeler using VR. He also utilized the VR app Gravity Sketch to create models for the different objects required in each set or scene.

“Designing 3D character models in VR makes pre-production and look dev possible for an artist like me,“ he said.

Nizam imported his character into Autodesk Maya for the rigging process — creating a skeleton for the 3D model so that it can move.

His RTX GPU delivered AI-powered, accelerated denoising with the default Autodesk Arnold renderer, resulting in highly interactive and photorealistic renders.

Nizam then moved to Adobe Substance 3D Painter to create textures and materials, applying them to production assets. NVIDIA RTX-accelerated light and ambient occlusion baking optimized assets in mere seconds.

Immaculate textures built in Adobe Substance Painter.

Next, Nizam deployed Unreal Engine to record motion captures via a Perception Neuron suit, creating scenes and camera sequences in real time. NVIDIA DLSS technology increased the interactivity of the viewport by using AI to upscale frames rendered at lower resolution, while retaining high-fidelity detail.

Motion capture with Noitom Perception Neuron suits in Unreal Engine allowed Nizam to be more spontaneous with his work.

“Motion capture fosters experimentation and spontaneous collaboration with performers capturing an abundance of movement, a luxury often untenable for indie projects,” said Nizam.

NVIDIA Omniverse’s spatial computing capabilities took Nizam’s creative workflow to the next level. The Omniverse USD Composer’s native VR support enables artists to interactively assemble, light and navigate scenes in real time, individually or collaboratively, in fully ray-traced VR.

Here, Nizam adjusted scene lighting and approved the overall layout in VR. He then moved to desktop to polish and refine the 3D sequences, reviewing final shots before exporting the completed project.

Rendering in NVIDIA Omniverse USD Composer.

Final Renders, Final Thoughts

Nizam is a big proponent of Omniverse, OpenUSD and its ability to streamline 3D content creation.

“Less time and effort, more productivity, cost savings and simpler real-time workflows — I use Omniverse daily for these reasons,” he said.

The Omniverse platform has at its foundation OpenUSD, an open and extensible framework for describing, composing, simulating and collaborating within 3D worlds. OpenUSD unlocks Omniverse’s potential by enabling movement between 3D apps — artists can transition all individual assets to their desired format with a single click.

“All apps were in sync and updated on the fly while I assembled it, thanks to Omniverse being the backbone of my CG creative and production process,” Nizam said.

“I rely on Omniverse Nucleus and Cache as the USD infrastructure for my production pipeline, allowing for seamless collaboration and facilitating cross-application workflows,” Nizam said. “Additionally, I utilize various software connectors, which help bridge different apps and streamline the creative process.”

Rafi Nizam is the director and creator of ‘Arteana’s Art Squad.’

Check out Nizam on Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Get started with NVIDIA Omniverse by downloading the free standard license or learn how Omniverse Enterprise can connect your team. Developers can get started with Omniverse resources. Stay up to date on the platform by subscribing to the newsletter and follow NVIDIA Omniverse on Instagram, Medium and Twitter.

For more, join the Omniverse community and check out the Omniverse forums, Discord server, Twitch and YouTube channels.

Join us for OpenAI’s first developer conference on November 6 in San Francisco

Developer registration for in-person attendance will open in the coming weeks and developers everywhere will be able to livestream the keynote.OpenAI Blog

Graphcore Joins the PyTorch Foundation as a General Member

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Graphcore has joined as a general member.

Graphcore is a UK-based company that specializes in designing and manufacturing AI accelerators, hardware and software specifically tailored for artificial intelligence and machine learning workloads.

“We’re thrilled that PyTorch is the leading framework for development on the Graphcore platform,” said Executive Director of the PyTorch Foundation Ibrahim Haddad. “Graphcore has played an important role in the hardware and open source space, and we look forward to their continued contributions to PyTorch.”

Graphcore has contributed to the PyTorch ecosystem by developing integrations to run on their IPU hardware. These integrations enable researchers and practitioners to use their preferred frameworks while taking advantage of Graphcore’s specialized hardware.

“At Graphcore we’re truly aligned with PyTorch’s objective of reducing the barrier of entry to AI practitioners. By supporting a native PyTorch software environment for IPUs we are giving developers access to new underlying hardware, designed from the ground up for AI, to help unlock new AI techniques to improve efficiency or performance and to drive breakthroughs in AI research and applications, with the same user-friendly PyTorch framework they know and expect. We look forward to contributing to and growing the global AI community as an active member of the PyTorch Foundation and are proud to be the first general member.” Anthony Barbier, Software Frameworks Lead at Graphcore.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Graphcore

Graphcore compute systems are accelerating the AI revolution. Powered by the groundbreaking Intelligence Processing Unit (IPU), Graphcore delivers leading-edge AI performance with unprecedented efficiency. IPUs are used around the world by organisations building their intelligent compute capabilities, including AI-centric startups, large multinational corporations and both public and private research institutions. Graphcore is backed by some of the world’s leading investors and has attracted more than $700m of funding. The company is based in Bristol, UK, with offices across Europe, Asia and North America.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Questions, shrugs and what comes next: A quarter century of change

Since Google was founded, we’ve worked to answer hard questions, help people get answers to theirs, and move technology forward for the world.Read More

Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart

Content moderation plays a pivotal role in maintaining online safety and upholding the values and standards of websites and social media platforms. Its significance is underscored by the protection it provides users from exposure to inappropriate content, safeguarding their well-being in digital spaces. For example, in the advertising industry, content moderation serves to shield brands from unfavorable associations, thereby contributing to brand elevation and revenue growth. Advertisers prioritize their brand’s alignment with appropriate content to uphold their reputation and avert negative publicity. Content moderation also assumes critical importance in the finance and healthcare sectors, where it serves multiple functions. It plays an important role in identifying and safeguarding sensitive personal identifiable and health information (PII, PHI). By adhering to internal standards and practices and complying with external regulations, content moderation enhances digital security for users. This way, it prevents the inadvertent sharing of confidential data on public platforms, ensuring the preservation of user privacy and data security.

In this post, we introduce a novel method to perform content moderation on image data with multi-modal pre-training and a large language model (LLM). With multi-modal pre-training, we can directly query the image content based on a set of questions of interest and the model will be able to answer these questions. This enables users to chat with the image to confirm if it contains any inappropriate content that violates the organization’s policies. We use the powerful generating capability of LLMs to generate the final decision including safe/unsafe labels and category type. In addition, by designing a prompt, we can make an LLM generate the defined output format, such as JSON format. The designed prompt template allows the LLM to determine if the image violates the moderation policy, identify the category of violation, explain why, and provide the output in a structured JSON format.

We use BLIP-2 as the multi-modal pre-training method. BLIP-2 is one of the state-of-the-art models in multi-modal pre-training and outperforms most of the existing methods in visual question answering, image captioning, and image text retrieval. For our LLM, we use Llama 2, the next generation open-source LLM, which outperforms existing open-source language models on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. The following diagram illustrates the solution components.

Challenges in content moderation

Traditional content moderation methods, such as human-based moderation, can’t keep up with the growing volume of user-generated content (UGC). As the volume of UGC increases, human moderators can become overwhelmed and struggle to moderate content effectively. This results in a poor user experience, high moderation costs, and brand risk. Human-based moderation is also prone to errors, which can result in inconsistent moderation and biased decisions. To address these challenges, content moderation powered by machine learning (ML) has emerged as a solution. ML algorithms can analyze large volumes of UGC and identify content that violates the organization’s policies. ML models can be trained to recognize patterns and identify problematic content, such as hate speech, spam, and inappropriate material. According to the study Protect your users, brand, and budget with AI-powered content moderation, ML-powered content moderation can help organizations reclaim up to 95% of the time their teams spend moderating content manually. This allows organizations to focus their resources on more strategic tasks, such as community building and content creation. ML-powered content moderation can also reduce moderation costs because it’s more efficient than human-based moderation.

Despite the advantages of ML-powered content moderation, it still has further improvement space. The effectiveness of ML algorithms heavily relies on the quality of the data they are trained on. When models are trained using biased or incomplete data, they can make erroneous moderation decisions, exposing organizations to brand risks and potential legal liabilities. The adoption of ML-based approaches for content moderation brings several challenges that necessitate careful consideration. These challenges include:

Acquiring labeled data – This can be a costly process, especially for complex content moderation tasks that require training labelers. This cost can make it challenging to gather large enough datasets to train a supervised ML model with ease. Additionally, the accuracy of the model heavily relies on the quality of the training data, and biased or incomplete data can result in inaccurate moderation decisions, leading to brand risk and legal liabilities.
Model generalization – This is critical to adopting ML-based approaches. A model trained on one dataset may not generalize well to another dataset, particularly if the datasets have different distributions. Therefore, it is essential to ensure that the model is trained on a diverse and representative dataset to ensure it generalizes well to new data.
Operational efficiency – This is another challenge when using conventional ML-based approaches for content moderation. Constantly adding new labels and retraining the model when new classes are added can be time-consuming and costly. Additionally, it is essential to ensure that the model is regularly updated to keep up with changes in the content being moderated.
Explainability – End users may perceive the platform as biased or unjust if content gets flagged or removed without justification, resulting in a poor user experience. Similarly, the absence of clear explanations can render the content moderation process inefficient, time-consuming, and costly for moderators.
Adversarial nature – The adversarial nature of image-based content moderation presents a unique challenge to conventional ML-based approaches. Bad actors can attempt to evade content moderation mechanisms by altering the content in various ways, such as using synonyms of images or embedding their actual content within a larger body of non-offending content. This requires constant monitoring and updating of the model to detect and respond to such adversarial tactics.

Multi-modal reasoning with BLIP-2

Multi-modality ML models refer to models that can handle and integrate data from multiple sources or modalities, such as images, text, audio, video, and other forms of structured or unstructured data. One of the popular multi-modality models is the visual-language models such as BLIP-2, which combines computer vision and natural language processing (NLP) to understand and generate both visual and textual information. These models enable computers to interpret the meaning of images and text in a way that mimics human understanding. Vision-language models can tackle a variety of tasks, including image captioning, image text retrieval, visual question answering, and more. For example, an image captioning model can generate a natural language description of an image, and an image text retrieval model can search for images based on a text query. Visual question answering models can respond to natural language questions about images, and multi-modal chatbots can use visual and textual inputs to generate responses. In terms of content moderation, you can use this capability to query against a list of questions.

BLIP-2 contains three parts. The first component is a frozen image encoder, ViT-L/14 from CLIP, which takes image data as input. The second component is a frozen LLM, FlanT5, which outputs text. The third component is a trainable module called Q-Former, a lightweight transformer that connects the frozen image encoder with the frozen LLM. Q-Former employs learnable query vectors to extract visual features from the frozen image encoder and feeds the most useful visual feature to the LLM to output the desired text.

The pre-training process involves two stages. In the first stage, vision-language representation learning is performed to teach Q-Former to learn the most relevant visual representation for the text. In the second stage, vision-to-language generative learning is performed by connecting the output of Q-Former to a frozen LLM and training Q-Former to output visual representations that can be interpreted by the LLM.

BLIP-2 achieves state-of-the-art performance on various vision-language tasks despite having significantly fewer trainable parameters than existing methods. The model also demonstrates emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions. The following illustration is modified from the original research paper.

Solution overview

The following diagram illustrates the solution architecture.

In the following sections, we demonstrate how to deploy BLIP-2 to an Amazon SageMaker endpoint, and use BLIP-2 and an LLM for content moderation.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Create a standalone AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. For the BLIP-2 model, we use an ml.g5.2xlarge SageMaker hosting instance. For the Llama 2 13B model, we use an ml.g5.12xlarge SageMaker hosting instance.

Deploy BLIP-2 to a SageMaker endpoint

You can host an LLM on SageMaker using the Large Model Inference (LMI) container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. With the help of the SageMaker LMI container, the BLIP-2 model can be easily implemented with the Hugging Face library and hosted on SageMaker. You can run blip2-sagemaker.ipynb for this step.

To prepare the Docker image and model file, you need to retrieve the Docker image of DJLServing, package the inference script and configuration files as a model.tar.gz file, and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. You can refer to the inference script and configuration file for more details.

inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=sess.boto_session.region_name, version="0.22.1"
)
! tar czvf model.tar.gz blip2/
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

When the Docker image and inference related files are ready, you create the model, the configuration for the endpoint, and the endpoint:

from sagemaker.utils import name_from_base
blip_model_version = "blip2-flan-t5-xl"
model_name = name_from_base(blip_model_version)
model = Model(
    image_uri=inference_image_uri,
    model_data=s3_code_artifact,
    role=role,
    name=model_name,
)
model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name=model_name
)

When the endpoint status becomes in service, you can invoke the endpoint for image captioning and the instructed zero-shot vision-to-language generation task. For the image captioning task, you only need to pass an image to the endpoint:

import base64
import json
from PIL import Image

smr_client = boto3.client("sagemaker-runtime")

def encode_image(img_file):
    with open(img_file, "rb") as image_file:
        img_str = base64.b64encode(image_file.read())
        base64_string = img_str.decode("latin1")
    return base64_string

def run_inference(endpoint_name, inputs):
    response = smr_client.invoke_endpoint(
        EndpointName=endpoint_name, Body=json.dumps(inputs)
    )
    print(response["Body"].read())

test_image = "carcrash-ai.jpeg"
base64_string = encode_image(test_image)
inputs = {"image": base64_string}
run_inference(endpoint_name, inputs)

For the instructed zero-shot vision-to-language generation task, in addition to the input image, you need to define the question as a prompt:

base64_string = encode_image(test_image)
inputs = {"prompt": "Question: what happened in this photo? Answer:", "image": base64_string}
run_inference(endpoint_name, inputs)

Use BLIP-2 and LLM for content moderation

In this stage, you can make queries on the given image and retrieve hidden information. With the LLM, you organize the queries and retrieve information to generate the JSON format result. You can roughly split this task into the following two sub-tasks:

Extract information from the image with the BLIP-2 model.
Generate the final result and explanation with the LLM.

Extract information from the image with the BLIP-2 model

To retrieve enough useful hidden information from the given image, you need to define queries. Because each query will invoke the endpoint once, many queries will lead to longer processing time. Therefore, we suggest making queries high quality and cover all policies but also without duplicated. In our sample code, we define the queries as follows:

check_list = [
"Does this photo contain complete naked person?",
"Does this photo contain topless person?",
"Does this photo contain weapon?",
"Does this photo contain contact information?",
"Does this photo contain a smoker?",
"Does this photo contain blood?",
"Are there persons fighting in this photo?",
"Does this photo contain harassment words?"
]

With the preceding queries, invoke the endpoint of BLIP-2 to retrieve the information with the following code:

test_image = "./surf_swimwear.png"
raw_image = Image.open(test_image).convert('RGB')

base64_string = encode_image(test_image)
conversations = """"""
for question in check_list:
    inputs = {"prompt": f"Question: {question}? Answer:", "image": base64_string}
    response = run_inference(endpoint_name, inputs)
    conversations += f"""
Question: {question}
Answer: {response}.
"""

In addition to the information retrieved by queries, you can get information with the image captioning task by invoking the endpoint without the prompt field in the payload:

inputs = {"image": base64_string}
response = smr_client.invoke_endpoint(
EndpointName=endpoint_name, Body=json.dumps(inputs)
)
image_caption = response["Body"].read().decode('utf-8')

You can combine the contents of queries and answers with the image caption and use this retrieved information for the downstream task, described in the next section below.

Generate the final result and explanation with the LLM

Large language models (LLMs) such as Llama 2 can generate high-quality results with the right prompt template. Using Amazon SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models. With just a few clicks in SageMaker Studio, you can now discover and deploy Llama 2.

The final results rely on a LLM with a specific prompt template. Such prompt consists of: the moderation policy based on categories of inappropriate or offensive moderation categories; image information extracted from BLIP-2; the question template to the LLM asking if the photo contains unsafe content and requesting its category and reason if unsafe; and instructions to output the results in JSON format. The designed prompt template allows the LLM to determine if the image violates the moderation policy, identify the category of violation, explain why, and provide the output in a structured JSON format.

The core source code is as follows:

prompt_template = f"""
The following is our company's content moderation policy, based on the moderation policy, we gather image information from the user uploaded image. Please answer the question with json format. 
        
###### moderation policy ######
{moderation_policy}
        
###### Image information ######
{conversations}
        
###### Question ######
Based on the following Moderation policy and QA, tell me if the photo containes unsafe content, also give its category and reason if it's unsafe. Please anwser the question with the following format and only put explanation into the reason field:  
"""

prompt_template += """
{
    "flag": "xxx",
    "category": "xxx",
    "reason": "the reason is ..."
}
"""

dialog = [
    {"role": "user", "content": prompt_template}
]

You can customize the prompt based on your own use case. Refer to the notebook for more details. When the prompt is ready, you can invoke the LLM endpoint to generate results:

endpoint_name = "jumpstart-dft-meta-textgeneration-llama-2-70b-f"

def query_endpoint(payload):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response
    
payload = {
    "inputs": [dialog], 
    "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.5}
}
result = query_endpoint(payload)[0]

Part of the generated output is as follows:

> Assistant:  {
    "flag": "unsafe",
    "category": "Suggestive",
    "reason": "The photo contains a topless person, which is considered suggestive content."
}

Explanation:
The photo contains a topless person, which violates the moderation policy's rule number 2, which states that suggestive content includes "Female Swimwear Or Underwear, Male Swimwear Or Underwear, Partial Nudity, Barechested Male, Revealing Clothes and Sexual Situations." Therefore, the photo is considered unsafe and falls under the category of Suggestive.

Occasionally, Llama 2 attaches additional explanation besides the answer from the assistant. You could use the parsing code to extract JSON data from the original generated results:

answer = result['generation']['content'].split('}')[0]+'}'
json.loads(answer)

Advantages of generative approaches

The preceding sections showed how to implement the core part of model inference. In this section, we cover various aspects of generative approaches, including comparisons with conventional approaches and perspectives.

The following table compares each approach.

.	Generative Approach	Classification Approach
Acquiring labeled data	Pre-trained model on a large number of images, zero-shot inference	Requires data from all types of categories
Model generalization	Pre-trained model with various types of images	Requires a large volume of content moderation related data to improve model generalization
Operational efficiency	Zero-shot capabilities	Requires training the model for recognizing different patterns, and retraining when labels are added
Explainability	Reasoning as the text output, great user experience	Hard to achieve reasoning, hard to explain and interpret
Adversarial nature	Robust	High frequency retraining

Potential use cases of multi-modal reasoning beyond content moderation

The BLIP-2 models can be applied to fit multiple purposes with or without fine-tuning, which includes the following:

Image captioning – This asks the model to generate a text description for the image’s visual content. As illustrated in the following example image (left), we can have “a man is standing on the beach with a surfboard” as the image description.
Visual question answering – As the example image in the middle shows, we can ask “Is it commercial related content” and we have “yes” as the answer. In addition, BLIP-2 supports the multi-round conversation and outputs the following question: “Why do you think so?” Based on the visual cue and LLM capabilities, BLIP-2 outputs “it’s a sign for amazon.”
Image text retrieval – Given the question as “Text on the image”, we can extract the image text “it’s monday but keep smiling” as demonstrated in the image on the right.

The following images show examples to demonstrate the zero-shot image-to-text capability of visual knowledge reasoning.

As we can see from various examples above, multi-modality models open up new opportunities for solving complex problems that traditional single-modality models would struggle to address.

Clean up

To avoid incurring future charges, delete the resources created as part of this post. You can do this by following the instructions in the notebook cleanup section, or delete the created endpoints via the SageMaker console and resources stored in the S3 bucket.

Conclusion

In this post, we discussed the importance of content moderation in the digital world and highlighted its challenges. We proposed a new method to help improve content moderation with image data and perform question answering against the images to automatically extract useful information. We also provided further discussion on the advantages of using a generative AI-based approach compared to the traditional classification-based approach. Lastly, we illustrated the potential use cases of visual-language models beyond content moderation.

We encourage you to learn more by exploring SageMaker and building a solution using the multi-modality solution provided in this post and a dataset relevant to your business.

About the Authors

Gordon Wang is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

How Carrier predicts HVAC faults using AWS Glue and Amazon SageMaker

In their own words, “In 1902, Willis Carrier solved one of mankind’s most elusive challenges of controlling the indoor environment through modern air conditioning. Today, Carrier products create comfortable environments, safeguard the global food supply, and enable safe transport of vital medical supplies under exacting conditions.”

At Carrier, the foundation of our success is making products our customers can trust to keep them comfortable and safe year-round. High reliability and low equipment downtime are increasingly important as extreme temperatures become more common due to climate change. We have historically relied on threshold-based systems that alert us to abnormal equipment behavior, using parameters defined by our engineering team. Although such systems are effective, they are intended to identify and diagnose equipment issues rather than predict them. Predicting faults before they occur allows our HVAC dealers to proactively address issues and improve the customer experience.

In order to improve our equipment reliability, we partnered with the Amazon Machine Learning Solutions Lab to develop a custom machine learning (ML) model capable of predicting equipment issues prior to failure. Our teams developed a framework for processing over 50 TB of historical sensor data and predicting faults with 91% precision. We can now notify dealers of impending equipment failure, so that they can schedule inspections and minimize unit downtime. The solution framework is scalable as more equipment is installed and can be reused for a variety of downstream modeling tasks.

In this post, we show how the Carrier and AWS teams applied ML to predict faults across large fleets of equipment using a single model. We first highlight how we use AWS Glue for highly parallel data processing. We then discuss how Amazon SageMaker helps us with feature engineering and building a scalable supervised deep learning model.

Overview of use case, goals, and risks

The main goal of this project is to reduce downtime by predicting impending equipment failures and notifying dealers. This allows dealers to schedule maintenance proactively and provide exceptional customer service. We faced three primary challenges when working on this solution:

Data scalability – Data processing and feature extraction needs to scale across large growing historical sensor data
Model scalability – The modeling approach needs to be capable of scaling across over 10,000 units
Model precision – Low false positive rates are needed to avoid unnecessary maintenance inspections

Scalability, both from a data and modeling perspective, is a key requirement for this solution. We have over 50 TB of historical equipment data and expect this data to grow quickly as more HVAC units are connected to the cloud. Data processing and model inference need to scale as our data grows. In order for our modeling approach to scale across over 10,000 units, we need a model that can learn from a fleet of equipment rather than relying on anomalous readings for a single unit. This will allow for generalization across units and reduce the cost of inference by hosting a single model.

The other concern for this use case is triggering false alarms. This means that a dealer or technician will go on-site to inspect the customer’s equipment and find everything to be operating appropriately. The solution requires a high precision model to ensure that when a dealer is alerted, the equipment is likely to fail. This helps earn the trust of dealers, technicians, and homeowners alike, and reduces the costs associated with unnecessary on-site inspections.

We partnered with the AI/ML experts at the Amazon ML Solutions Lab for a 14-week development effort. In the end, our solution includes two primary components. The first is a data processing module built with AWS Glue that summarizes equipment behavior and reduces the size of our training data for efficient downstream processing. The second is a model training interface managed through SageMaker, which allows us to train, tune, and evaluate our model before it is deployed to a production endpoint.

Data processing

Each HVAC unit we install generates data from 90 different sensors with readings for RPMs, temperature, and pressures throughout the system. This amounts to roughly 8 million data points generated per unit per day, with tens of thousands of units installed. As more HVAC systems are connected to the cloud, we anticipate the volume of data to grow quickly, making it critical for us to manage its size and complexity for use in downstream tasks. The length of the sensor data history also presents a modeling challenge. A unit may start displaying signs of impending failure months before a fault is actually triggered. This creates a significant lag between the predictive signal and the actual failure. A method for compressing the length of the input data becomes critical for ML modeling.

To address the size and complexity of the sensor data, we compress it into cycle features as shown in Figure 1. This dramatically reduces the size of data while capturing features that characterize the equipment’s behavior.

Figure 1: Sample of HVAC sensor data

AWS Glue is a serverless data integration service for processing large quantities of data at scale. AWS Glue allowed us to easily run parallel data preprocessing and feature extraction. We used AWS Glue to detect cycles and summarize unit behavior using key features identified by our engineering team. This dramatically reduced the size of our dataset from over 8 million data points per day per unit down to roughly 1,200. Crucially, this approach preserves predictive information about unit behavior with a much smaller data footprint.

The output of the AWS Glue job is a summary of unit behavior for each cycle. We then use an Amazon SageMaker Processing job to calculate features across cycles and label our data. We formulate the ML problem as a binary classification task with a goal of predicting equipment faults in the next 60 days. This allows our dealer network to address potential equipment failures in a timely manner. It’s important to note that not all units fail within 60 days. A unit experiencing slow performance degradation could take more time to fail. We address this during the model evaluation step. We focused our modeling on summertime because those months are when most HVAC systems in the US are in consistent operation and under more extreme conditions.

Modeling

Transformer architectures have become the state-of-the-art approach for handling temporal data. They can use long sequences of historical data at each time step without suffering from vanishing gradients. The input to our model at a given point in time is composed of the features for the previous 128 equipment cycles, which is roughly one week of unit operation. This is processed by a three-layer encoder whose output is averaged and fed into a multi-layered perceptron (MLP) classifier. The MLP classifier is composed of three linear layers with ReLU activation functions and a final layer with LogSoftMax activation. We use weighted negative log-likelihood loss with a different weight on the positive class for our loss function. This biases our model towards high precision and avoids costly false alarms. It also incorporates our business objectives directly into the model training process. Figure 2 illustrates the transformer architecture.

Figure 2: Temporal transformer architecture

Training

One challenge when training this temporal learning model is data imbalance. Some units have a longer operational history than others and therefore have more cycles in our dataset. Because they are overrepresented in the dataset, these units will have more influence on our model. We solve this by randomly sampling 100 cycles in a unit’s history where we assess the probability of a failure at that time. This ensures that each unit is equally represented during the training process. While removing the imbalanced data problem, this approach has the added benefit of replicating a batch processing approach that will be used in production. This sampling approach was applied to the training, validation, and test sets.

Training was performed using a GPU-accelerated instance on SageMaker. Monitoring the loss shows that it achieves the best results after 180 training epochs as show in Figure 3. Figure 4 shows that the area under the ROC curve for the resulting temporal classification model is 81%.

Figure 3: Training loss over epochs

Figure 4: ROC-AUC for 60-day lockout

Evaluation

While our model is trained at the cycle level, evaluation needs to take place at the unit level. In this way, one unit with multiple true positive detections is still only counted as a single true positive at the unit level. To do this, we analyze the overlap between the predicted outcomes and the 60-day window preceding a fault. This is illustrated in the following figure, which shows four cases of predicting outcomes:

True negative – All the prediction results are negative (purple) (Figure 5)
False positive – The positive predictions are false alarms (Figure 6)
False negative – Although the predictions are all negative, the actual labels could be positive (green) (Figure 7)
True positive – Some of the predictions could be negative (green), and at least one prediction is positive (yellow) (Figure 8)

Figure 5.1: True negative case	Figure 5.2: False positive case
Figure 5.3: False negative case	Figure 5.4: True positive case

After training, we use the evaluation set to tune the threshold for sending an alert. Setting the model confidence threshold at 0.99 yields a precision of roughly 81%. This falls short of our initial 90% criterion for success. However, we found that a good portion of units failed just outside the 60-day evaluation window. This makes sense, because a unit may actively display faulty behavior but take longer than 60 days to fail. To handle this, we defined a metric called effective precision, which is a combination of the true positive precision (81%) with the added precision of lockouts that occurred in the 30 days beyond our target 60-day window.

For an HVAC dealer, what is most important is that an onsite inspection helps prevent future HVAC issues for the customer. Using this model, we estimate that 81.2% of the time the inspection will prevent a lockout from occurring in the next 60 days. Additionally, 10.4% of the time the lockout would have occurred in within 90 days of inspection. The remaining 8.4% will be a false alarm. The effective precision of the trained model is 91.6%.

Conclusion

In this post, we showed how our team used AWS Glue and SageMaker to create a scalable supervised learning solution for predictive maintenance. Our model is capable of capturing trends across long-term histories of sensor data and accurately detecting hundreds of equipment failures weeks in advance. Predicting faults in advance will reduce curb-to-curb time, allowing our dealers to provide more timely technical assistance and improving the overall customer experience. The impacts of this approach will grow over time as more cloud-connected HVAC units are installed every year.

Our next step is to integrate these insights into the upcoming release of Carrier’s Connected Dealer Portal. The portal combines these predictive alerts with other insights we derive from our AWS-based data lake in order to give our dealers more clarity into equipment health across their entire client base. We will continue to improve our model by integrating data from additional sources and extracting more advanced features from our sensor data. The methods employed in this project provide a strong foundation for our team to start answering other key questions that can help us reduce warranty claims and improve equipment efficiency in the field.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab. To learn more about the services used in this project, refer to the AWS Glue Developer Guide and the Amazon SageMaker Developer Guide.

About the Authors

Ravi Patankar is a technical leader for IoT related analytics at Carrier’s Residential HVAC Unit. He formulates analytics problems related to diagnostics and prognostics and provides direction for ML/deep learning-based analytics solutions and architecture.

Dan Volk is a Data Scientist at the AWS Generative AI Innovation Center. He has ten years of experience in machine learning, deep learning and time-series analysis and holds a Master’s in Data Science from UC Berkeley. He is passionate about transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.

Yingwei Yu is an Applied Scientist at AWS Generative AI Innovation Center. He has experience working with several organizations across industries on various proof-of-concepts in machine learning, including NLP, time-series analysis, and generative AI technologies. Yingwei received his PhD in computer science from Texas A&M University.

Yanxiang Yu is an Applied Scientist at Amazon Web Services, working on the Generative AI Innovation Center. With over 8 years of experience building AI and machine learning models for industrial applications, he specializes in generative AI, computer vision, and time series modeling. His work focuses on finding innovative ways to apply advanced generative techniques to real-world problems.

Diego Socolinsky is a Senior Applied Science Manager with the AWS Generative AI Innovation Center, where he leads the delivery team for the Eastern US and Latin America regions. He has over twenty years of experience in machine learning and computer vision, and holds a PhD degree in mathematics from The Johns Hopkins University.

Kexin Ding is a fifth-year Ph.D. candidate in computer science at UNC-Charlotte. Her research focuses on applying deep learning methods for analyzing multi-modal data, including medical image and genomics sequencing data.

Optimize deployment cost of Amazon SageMaker JumpStart foundation models with Amazon SageMaker asynchronous endpoints

The success of generative AI applications across a wide range of industries has attracted the attention and interest of companies worldwide who are looking to reproduce and surpass the achievements of competitors or solve new and exciting use cases. These customers are looking into foundation models, such as TII Falcon, Stable Diffusion XL, or OpenAI’s GPT-3.5, as the engines that power the generative AI innovation.

Foundation models are a class of generative AI models that are capable of understanding and generating human-like content, thanks to the vast amounts of unstructured data they have been trained on. These models have revolutionized various computer vision (CV) and natural language processing (NLP) tasks, including image generation, translation, and question answering. They serve as the building blocks for many AI applications and have become a crucial component in the development of advanced intelligent systems.

However, the deployment of foundation models can come with significant challenges, particularly in terms of cost and resource requirements. These models are known for their size, often ranging from hundreds of millions to billions of parameters. Their large size demands extensive computational resources, including powerful hardware and significant memory capacity. In fact, deploying foundation models usually requires at least one (often more) GPUs to handle the computational load efficiently. For example, the TII Falcon-40B Instruct model requires at least an ml.g5.12xlarge instance to be loaded into memory successfully, but performs best with bigger instances. As a result, the return on investment (ROI) of deploying and maintaining these models can be too low to prove business value, especially during development cycles or for spiky workloads. This is due to the running costs of having GPU-powered instances for long sessions, potentially 24/7.

Earlier this year, we announced Amazon Bedrock, a serverless API to access foundation models from Amazon and our generative AI partners. Although it’s currently in Private Preview, its serverless API allows you to use foundation models from Amazon, Anthropic, Stability AI, and AI21, without having to deploy any endpoints yourself. However, open-source models from communities such as Hugging Face have been growing a lot, and not every one of them has been made available through Amazon Bedrock.

In this post, we target these situations and solve the problem of risking high costs by deploying large foundation models to Amazon SageMaker asynchronous endpoints from Amazon SageMaker JumpStart. This can help cut costs of the architecture, allowing the endpoint to run only when requests are in the queue and for a short time-to-live, while scaling down to zero when no requests are waiting to be serviced. This sounds great for a lot of use cases; however, an endpoint that has scaled down to zero will introduce a cold start time before being able to serve inferences.

Solution overview

The following diagram illustrates our solution architecture.

The architecture we deploy is very straightforward:

The user interface is a notebook, which can be replaced by a web UI built on Streamlit or similar technology. In our case, the notebook is an Amazon SageMaker Studio notebook, running on an ml.m5.large instance with the PyTorch 2.0 Python 3.10 CPU kernel.
The notebook queries the endpoint in three ways: the SageMaker Python SDK, the AWS SDK for Python (Boto3), and LangChain.
The endpoint is running asynchronously on SageMaker, and on the endpoint, we deploy the Falcon-40B Instruct model. It’s currently the state of the art in terms of instruct models and available in SageMaker JumpStart. A single API call allows us to deploy the model on the endpoint.

What is SageMaker asynchronous inference

SageMaker asynchronous inference is one of the four deployment options in SageMaker, together with real-time endpoints, batch inference, and serverless inference. To learn more about the different deployment options, refer to Deploy models for Inference.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making this option ideal for requests with large payload sizes up to 1 GB, long processing times, and near-real-time latency requirements. However, the main advantage that it provides when dealing with large foundation models, especially during a proof of concept (POC) or during development, is the capability to configure asynchronous inference to scale in to an instance count of zero when there are no requests to process, thereby saving costs. For more information about SageMaker asynchronous inference, refer to Asynchronous inference. The following diagram illustrates this architecture.

To deploy an asynchronous inference endpoint, you need to create an AsyncInferenceConfig object. If you create AsyncInferenceConfig without specifying its arguments, the default S3OutputPath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-outputs/{UNIQUE-JOB-NAME} and S3FailurePath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-failures/{UNIQUE-JOB-NAME}.

What is SageMaker JumpStart

Our model comes from SageMaker JumpStart, a feature of SageMaker that accelerates the machine learning (ML) journey by offering pre-trained models, solution templates, and example notebooks. It provides access to a wide range of pre-trained models for different problem types, allowing you to start your ML tasks with a solid foundation. SageMaker JumpStart also offers solution templates for common use cases and example notebooks for learning. With SageMaker JumpStart, you can reduce the time and effort required to start your ML projects with one-click solution launches and comprehensive resources for practical ML experience.

The following screenshot shows an example of just some of the models available on the SageMaker JumpStart UI.

Deploy the model

Our first step is to deploy the model to SageMaker. To do that, we can use the UI for SageMaker JumpStart or the SageMaker Python SDK, which provides an API that we can use to deploy the model to the asynchronous endpoint:

%%time
from sagemaker.jumpstart.model import JumpStartModel, AsyncInferenceConfig
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model_id, model_version = "huggingface-llm-falcon-40b-instruct-bf16", "*"
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy(
    initial_instance_count=0,
    instance_type="ml.g5.12xlarge",
    async_inference_config=AsyncInferenceConfig()
)

This call can take approximately10 minutes to complete. During this time, the endpoint is spun up, the container together with the model artifacts are downloaded to the endpoint, the model configuration is loaded from SageMaker JumpStart, then the asynchronous endpoint is exposed via a DNS endpoint. To make sure that our endpoint can scale down to zero, we need to configure auto scaling on the asynchronous endpoint using Application Auto Scaling. You need to first register your endpoint variant with Application Auto Scaling, define a scaling policy, and then apply the scaling policy. In this configuration, we use a custom metric using CustomizedMetricSpecification, called ApproximateBacklogSizePerInstance, as shown in the following code. For a detailed list of Amazon CloudWatch metrics available with your asynchronous inference endpoint, refer to Monitoring with CloudWatch.

import boto3

client = boto3.client("application-autoscaling")
resource_id = "endpoint/" + my_model.endpoint_name + "/variant/" + "AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0, # Miminum number of instances we want to scale down to - scale down to 0 to stop incurring in costs
    MaxCapacity=1, # Maximum number of instances we want to scale up to - scale up to 1 max is good enough for dev
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": my_model.endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,  # The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 300,  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
        # If the value is true, scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)

You can verify that this policy has been set successfully by navigating to the SageMaker console, choosing Endpoints under Inference in the navigation pane, and looking for the endpoint we just deployed.

Invoke the asynchronous endpoint

To invoke the endpoint, you need to place the request payload in Amazon Simple Storage Service (Amazon S3) and provide a pointer to this payload as a part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon Simple Notification Service (Amazon SNS).

SageMaker Python SDK

After deployment is complete, it will return an AsyncPredictor object. To perform asynchronous inference, you need to upload data to Amazon S3 and use the predict_async() method with the S3 URI as the input. It will return an AsyncInferenceResponse object, and you can check the result using the get_response() method.

Alternatively, if you would like to check for a result periodically and return it upon generation, use the predict() method. We use this second method in the following code:

import time

# Invoking the asynchronous endpoint with the SageMaker Python SDK
def query_endpoint(payload):
    """Query endpoint and print the response"""
    response = predictor.predict_async(
        data=payload,
        input_path="s3://{}/{}".format(bucket, prefix),
    )
    while True:
        try:
            response = response.get_result()
            break
        except:
            print("Inference is not ready ...")
            time.sleep(5)
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {response[0]['generated_text']}")
    
query_endpoint(payload)

Boto3

Let’s now explore the invoke_endpoint_async method from Boto3’s sagemaker-runtime client. It enables developers to asynchronously invoke a SageMaker endpoint, providing a token for progress tracking and retrieval of the response later. Boto3 doesn’t offer a way to wait for the asynchronous inference to be completed like the SageMaker Python SDK’s get_result() operation. Therefore, we take advantage of the fact that Boto3 will store the inference output in Amazon S3 in the response["OutputLocation"]. We can use the following function to wait for the inference file to be written to Amazon S3:

import json
import time
import boto3
from botocore.exceptions import ClientError

s3_client = boto3.client("s3")

# Wait until the prediction is generated
def wait_inference_file(bucket, prefix):
    while True:
        try:
            response = s3_client.get_object(Bucket=bucket, Key=prefix)
            break
        except ClientError as ex:
            if ex.response['Error']['Code'] == 'NoSuchKey':
                print("Waiting for file to be generated...")
                time.sleep(5)
                next
            else:
                raise
        except Exception as e:
            print(e.__dict__)
            raise
    return response

With this function, we can now query the endpoint:

# Invoking the asynchronous endpoint with the Boto3 SDK
import boto3

sagemaker_client = boto3.client("sagemaker-runtime")

# Query the endpoint function
def query_endpoint_boto3(payload):
    """Query endpoint and print the response"""
    response = sagemaker_client.invoke_endpoint_async(
        EndpointName=my_model.endpoint_name,
        InputLocation="s3://{}/{}".format(bucket, prefix),
        ContentType="application/json",
        Accept="application/json"
    )
    output_url = response["OutputLocation"]
    output_prefix = "/".join(output_url.split("/")[3:])
    # Read the bytes of the file from S3 in output_url with Boto3
    output = wait_inference_file(bucket, output_prefix)
    output = json.loads(output['Body'].read())[0]['generated_text']
    # Emit output
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {output}")

query_endpoint_boto3(payload)

LangChain

LangChain is an open-source framework launched in October 2022 by Harrison Chase. It simplifies the development of applications using large language models (LLMs) by providing integrations with various systems and data sources. LangChain allows for document analysis, summarization, chatbot creation, code analysis, and more. It has gained popularity, with contributions from hundreds of developers and significant funding from venture firms. LangChain enables the connection of LLMs with external sources, making it possible to create dynamic, data-responsive applications. It offers libraries, APIs, and documentation to streamline the development process.

LangChain provides libraries and examples for using SageMaker endpoints with its framework, making it easier to use ML models hosted on SageMaker as the “brain” of the chain. To learn more about how LangChain integrates with SageMaker, refer to the SageMaker Endpoint in the LangChain documentation.

One of the limits of the current implementation of LangChain is that it doesn’t support asynchronous endpoints natively. To use an asynchronous endpoint to LangChain, we have to define a new class, SagemakerAsyncEndpoint, that extends the SagemakerEndpoint class already available in LangChain. Additionally, we provide the following information:

The S3 bucket and prefix where asynchronous inference will store the inputs (and outputs)
A maximum number of seconds to wait before timing out
An updated _call() function to query the endpoint with invoke_endpoint_async() instead of invoke_endpoint()
A way to wake up the asynchronous endpoint if it’s in cold start (scaled down to zero)

To review the newly created SagemakerAsyncEndpoint, you can check out the sagemaker_async_endpoint.py file available on GitHub.

from typing import Dict
from langchain import PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import LLMChain
from sagemaker_async_endpoint import SagemakerAsyncEndpoint

class ContentHandler(LLMContentHandler):
    content_type:str = "application/json"
    accepts:str = "application/json"
    len_prompt:int = 0

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {"max_new_tokens": 100, "do_sample": False, "repetition_penalty": 1.1}})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        ans = res[0]['generated_text']
        return ans

chain = LLMChain(
    llm=SagemakerAsyncEndpoint(
        input_bucket=bucket,
        input_prefix=prefix,
        endpoint_name=my_model.endpoint_name,
        region_name=sagemaker.Session().boto_region_name,
        content_handler=ContentHandler(),
    ),
    prompt=PromptTemplate(
        input_variables=["query"],
        template="{query}",
    ),
)

print(chain.run(payload['inputs']))

Clean up

When you’re done testing the generation of inferences from the endpoint, remember to delete the endpoint to avoid incurring in extra charges:

predictor.delete_endpoint()

Conclusion

When deploying large foundation models like TII Falcon, optimizing cost is crucial. These models require powerful hardware and substantial memory capacity, leading to high infrastructure costs. SageMaker asynchronous inference, a deployment option that processes requests asynchronously, reduces expenses by scaling the instance count to zero when there are no pending requests. In this post, we demonstrated how to deploy large SageMaker JumpStart foundation models to SageMaker asynchronous endpoints. We provided code examples using the SageMaker Python SDK, Boto3, and LangChain to illustrate different methods for invoking asynchronous endpoints and retrieving results. These techniques enable developers and researchers to optimize costs while using the capabilities of foundation models for advanced language understanding systems.

To learn more about asynchronous inference and SageMaker JumpStart, check out the following posts:

About the author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Rethinking trust in direct messages in the AI era

This blog post is a part of a series exploring our research in privacy, security, and cryptography. For the previous post, see https://www.microsoft.com/en-us/research/blog/research-trends-in-privacy-security-and-cryptography. While AI has the potential to massively increase productivity, this power can be used equally well for malicious purposes, for example, to automate the creation of sophisticated scam messages. In this post, we explore threats AI can pose for online communication ecosystems and outline a high-level approach to mitigating these threats.

Communication in the age of AI

Concerns regarding the influence of AI on the integrity of online communication are increasingly shared by policymakers, AI researchers, business leaders, and other individuals. These concerns are well-founded, as benign AI chatbots can be easily repurposed to impersonate people, help spread misinformation, and sway both public opinion and personal beliefs. So-called “spear phishing” attacks, which are personalized to the target, have proved devastatingly effective. This is particularly true if victims are not using multifactor authentication, meaning an attacker who steals their login credentials with a phishing mail could access authentic services with those credentials. This opportunity has not been missed by organized cybercrime; AI-powered tools marketed to scammers and fraudsters are already emerging. This is disturbing, because democratic systems, business integrity, and interpersonal relationships all hinge on credible and effective communication—a process that has notably migrated to the digital sphere.

As we enter a world where people increasingly interact with artificial agents, it is critical to acknowledge that these challenges from generative AI are not merely hypothetical. In the context of our product offerings at Microsoft, they materialize as genuine threats that we are actively addressing. We are beginning to witness the impact of AI in generating highly specific types of text (emails, reports, scripts, code) in a personalized, automated, and scalable manner. In the workplace, AI-powered tools are expected to bring about a huge increase in productivity, allowing people to focus on the more creative parts of their work rather than tedious, repetitive details. In addition, AI-powered tools can improve productivity and communication for people with disabilities or among people who do not speak the same language.

In this blog post, we focus on the challenge of establishing trust and accountability in direct communication (between two people), such as email, direct messages on social media platforms, SMS, and even phone calls. In all these scenarios, messaging commonly takes place between individuals who share little or no prior context or connection, yet those messages may carry information of high importance. Some examples include emails discussing job prospects, new connections from mutual friends, and unsolicited but important phone calls. The communication may be initiated on behalf of an organization or an individual, but in either case we encounter the same problem: if the message proves to be misleading, malicious, or otherwise inappropriate, holding anyone accountable for it is impractical, may require difficult and slow legal procedures, and does not extend across different communication platforms.

As the scale of these activities increases, there is also a growing need for a flexible cross-platform accountability mechanism that allows both the message sender and receiver to explicitly declare the nature of their communication. Concretely, the sender should be able to declare accountability for their message and the receiver should be able to hold the sender accountable if the message is inappropriate.

Elements of accountability

The problems outlined above are not exactly new, but recent advances in AI have made them more urgent. Over the past several years, the tech community, alongside media organizations and others, have investigated ways to distinguish whether text or images are created by AI; for example, C2PA is a type of watermarking technology, and one possible solution among others. With AI-powered tools increasingly being used in the workplace, Microsoft believes that it will take a combination of approaches to provide the highest value and most transparency to users.

Focusing on accountability is one such approach. We can start by listing some properties we expect of any workable solution:

People and organizations need to be able to declare accountability for the messages they send.
Receivers need to be able to hold the senders accountable if the message is inappropriate or malicious, to protect future potential victims.
There must exist an incentive for the sender to declare accountability.
The mechanism should only solve the accountability problem and nothing else. It must not have unintended side effects, such as a loss of privacy for honest participants.
Receivers should not be required to register with any service.
The accountability mechanism must be compatible with the plurality of methods people use to communicate today.

One way to build an accountability mechanism is to use a reputation system that verifies real-world identities, connecting our digital interactions to a tangible and ultimately accountable organization or human identity. Online reputation has now become an asset that organizations and individuals have a vested interest in preserving. It creates an incentive for honest and trustworthy behavior, which ultimately contributes to a safer and more reliable digital environment for everyone.

Reputation system for online accountability

Consider what an online communication user experience could be like with an integrated reputation system. In this solution, a message sender could declare their accountability by binding their message to their account in the reputation system in the form of a cryptographic reputation tag. Conversely, the receiver uses the tag to verify the sender’s reputation and can use it to report the sender if the message is inappropriate, reducing the sender’s reputation. It is the sender’s responsibility to judge whether the receiver will perceive the message as inappropriate.

Messages with an attached reputation tag are called reputed messages, whereas those without an associated reputation are called generic messages. Reputed messages would typically make the most sense in one-to-one communication that the sender intends for a particular recipient, or one-to-many communication to a few recipients. For example, a proposal to discuss a business deal, a wedding invitation email, a payment reminder SMS from a company’s billing department, or a work email discussing a joint project might be sent as reputed messages. Generic messages would typically not be intended for a particular receiver. For example, emails sent to a mailing list (many receivers) or non-personalized advertisements (large scale) should be sent as generic.

The different components and workflows of our accountability mechanism are depicted, at a high level, in Figure 1.

system diagram — Figure 1: An accountability mechanism design, showing both the account creation and message sending/reporting workflows.

Taking a concrete example, think of a situation where you receive an email from your bank asking you to verify the security settings for your account. You know that phishing emails often target such scenarios, so your first reaction is to ignore the message. However, in this case your email client has noted the valid reputation tag and automatically moved the email to a reputed messages folder. It shows the sender’s reputation, high, next to the message. Instead of deleting the unsolicited and slightly suspicious email, you decide to check whether the link in the email truly leads you to your bank’s website. You are now convinced this is a legitimate message and proceed with the recommendations to review your security settings.

As another example, suppose you work in your company’s billing department. You find something wrong with a customer’s billing information and decide to send them an email to get more information. Since this is an important matter, you hope to maximize the chance of them seeing your message by attaching the billing department’s reputation tag to it. The customer sees the email go in the reputed messages folder, notices the sender’s high reputation, and responds to it with appropriate urgency.

As a third example, imagine that you receive an unsolicited phone call from someone who claims to be your distant relative and wants to discuss a family reunion they are organizing. They ask you questions about your family, making you slightly uneasy. Right before calling you, they sent you a reputation tag via SMS encoding their reputation and the context of their call. You verify that the tag is valid, but that their reputation is medium. You decide to end the call and report them using the tag they shared, as you felt that their call asking for such sensitive information was inappropriate.

These examples highlight that this single system can be used across many different modes of communication, from emails to social media messages to phone calls, fostering trust and safety across the entire landscape of direct communication methods in use today.

Call to action

In this blog post we have attempted to outline a solution to an already existing problem that is exacerbated by modern AI. Capturing the core of this problem is not easy, and many of the previously proposed solutions have unintended consequences that make them unworkable. For example, we explained why approaches that attempt to limit the use of AI are unlikely to succeed.

The solutions are not easy either. The messaging ecosystem is vastly complex and any solution requiring fundamental changes to that are unlikely to be acceptable. Usability is a key concern as well: if the system is only designed to communicate risk, we may want to avoid inadvertently communicating safety, much like the presence of padlock symbols as a sign of HTTPS have caused confusion and underestimation of risk for web browser users (opens in new tab).

Is there a comprehensive identity framework that would connect real-world identities to digital identities? This connection to a unique real-world identity is crucial, as otherwise anyone could simply create as many distinct reputation accounts as they need for any nefarious purpose.

For organizations, the situation is easier, because countries and states tend to hold public records that establish their existence and “identity.” For individuals, platforms like Reddit, TripAdvisor, and Stack Overflow have built reputation systems for their internal use, but without a foundational layer that confirms unique human identities these cannot be used to solve our problem, just as Facebook’s “real name” policy and X Premium (formerly Twitter Blue) have been insufficient to prevent the creation and use of fake accounts. Still, this is not an impossible problem to solve: LinkedIn is already partnering with CLEAR (opens in new tab) to bind government ID verification to a verification marker in user profiles, and with Microsoft Entra Verified ID (opens in new tab) to verify employment status. Worldcoin (opens in new tab) is building a cryptocurrency with each wallet being linked to a unique real-world person through biometrics, and Apple recently announced Optic ID (opens in new tab) for biometric authentication through their Vision Pro headset.

Whenever we talk about identities—especially real-world identities—we need to talk about privacy. People use different digital identities and communication methods in different communities, and these identities need to be kept separate. Trusting a reputation system with such sensitive information requires careful consideration. Our preliminary research suggests that techniques from modern cryptography can be used to provide strong security and privacy guarantees so that the reputation system learns or reveals nothing unnecessary and cannot be used in unintended ways.

What about the governance of the reputation system? In an extreme case, a single centralized party hosts the system while providing cryptographic transparency guarantees of correct operation. In another extreme, we should explore whether a purely decentralized implementation can be feasible. There are also options between these two extremes; for example, multiple smaller reputation systems hosted by different companies and organizations.

These open questions present an opportunity and a responsibility for the research community. At Microsoft Research, we are diligently working on aspects of this problem in partnership with our research on privacy-preserving verifiable information and identity, secure hardware, transparency systems, and media provenance. We invite the rest of the research community to join in by either following the path we outlined here or suggesting better alternatives. This is the start of a broad exploration that calls for a profound commitment and contribution from all of us.

The post Rethinking trust in direct messages in the AI era appeared first on Microsoft Research.

Making automated visual-inspection systems practical

Benchmarking framework that includes a product-agnostic public dataset, guidelines for model selection, and an evaluation approach helps bridge the gap between research and real-world implementation.Read More

The Halo Effect: AI Deep Dives Into Coral Reef Conservation

With coral reefs in rapid decline across the globe, researchers from the University of Hawaii at Mānoa have pioneered an AI-based surveying tool that monitors reef health from the sky.

Using deep learning models and high-resolution satellite imagery powered by NVIDIA GPUs, the researchers have developed a new method for spotting and tracking coral reef halos — distinctive rings of barren sand encircling reefs.

The study, recently published in the Remote Sensing of Environment journal, could unlock real-time coral reef monitoring and turn the tide on global conservation.

“Coral reef halos are a potential proxy for ecosystem health,” said Amelia Meier, a postdoctoral fellow at the University of Hawaii and co-author of the study. “Visible from space, these halo patterns give scientists and conservationists a unique opportunity to observe vast and distant areas. With AI, we can regularly assess halo presence and size in near real time to determine ecosystem well-being.”

Sea-ing Clearly: Illuminating Reef Health

Previously attributed solely to fish grazing, reef halos can also indicate a healthy predator-prey ecosystem, according to researchers’ recent discoveries. While some herbivorous fish graze algae or seagrass near the protective reef perimeter, hunters dig around the seafloor for burrowed invertebrates, laying bare the surrounding sand.

These dynamics indicate the area hosts a healthy food buffet for sustaining a diverse population of ocean dwellers. When the halo changes shape, it signals an imbalance in the marine food web and could indicate an unhealthy reef environment.

In Hot Water

While making up less than 1% of the ocean, coral reefs offer habitat, food and nursery grounds for over 1 million aquatic species. There’s also huge commercial value — about $375 billion annually in commercial and subsistence fishing, tourism and coastal storm protection, and providing antiviral compounds for drug discovery research.

However, reef health is threatened by overfishing, nutrient contamination and ocean acidification. Intensifying climate change — along with the resulting thermal stress from a warming ocean — also increases coral bleaching and infectious disease.

Over half of the world’s coral reefs are already lost or badly damaged, and scientists predict that by 2050 all reefs will face threats, with many in critical danger.

Charting New Horizons With AI

Spotting changes in reef halos is key to global conservation efforts. However, tracking these changes is labor- and time-intensive, limiting the number of surveys that researchers can perform every year. Access to reefs in remote locations also poses challenges.

The researchers created an AI tool that identifies and measures reef halos from global satellites, giving conservationists an opportunity to proactively address reef degradation.

Using Planet SkySat images, they developed ‌a dual-model framework employing two types of convolutional neural networks (CNNs). Relying on computer vision methods for image segmentation, they trained a Mask R-CNN model that detects the edges of the reef and halo, pixel by pixel. A U-Net model trained to differentiate between the coral reef and halo then classifies and predicts the areas of both.

An overview of the study regions (A), an example of a SkySat satellite image containing halos (B) and a zoomed-in subset of halos (C).

The team used TensorFlow, Keras and PyTorch libraries for training and testing thousands of annotations on the coral reef models.

To handle the task’s large compute requirements, the CNNs operate on an NVIDIA RTX A6000 GPU, boosted by a cuDNN-accelerated PyTorch framework. The researchers received the A6000 GPU as participants in the NVIDIA Academic Hardware Grant Program.

The AI tool quickly identifies and measures around 300 halos across 100 square kilometers in about two minutes. The same task takes a human annotator roughly 10 hours. The model also reaches about 90% accuracy depending on location and can navigate various and complicated halo patterns.

“Our study marks the first instance of training AI on reef halo patterns, as opposed to more common AI datasets of images, such as those of cats and dogs,” Meier said. “Processing thousands of images can take a lot of time, but using the NVIDIA GPU sped up the process significantly.”

One challenge is that image resolution can be a limiting factor in the model’s accuracy. Course-scale imagery with low resolutions makes it difficult to spot ‌reef and halo boundaries and creates less accurate predictions.

Shoring Up Environmental Monitoring

“Our long-term goal is to transform our findings into a robust monitoring tool for assessing changes in halo size and to draw correlations to the population dynamics of predators and herbivores in the area,” Meier said.

With this new approach, the researchers are exploring the relationship between species composition, reef health, and halo presence and size. Currently, they’re looking into the association between sharks and halos. If their hypothesized predator-prey-halo interaction proves true, the team anticipates estimating shark abundance from space.

Meet Arteana’s Art Squad

Art Squad, Assemble

Final Renders, Final Thoughts

About Graphcore

About PyTorch Foundation

About The Linux Foundation

Challenges in content moderation

Multi-modal reasoning with BLIP-2

Solution overview

Prerequisites

Deploy BLIP-2 to a SageMaker endpoint

Use BLIP-2 and LLM for content moderation

Extract information from the image with the BLIP-2 model

Generate the final result and explanation with the LLM

Advantages of generative approaches

Potential use cases of multi-modal reasoning beyond content moderation

Clean up

Conclusion

About the Authors

Overview of use case, goals, and risks

Data processing

Modeling

Training

Evaluation

Conclusion

About the Authors

Solution overview

What is SageMaker asynchronous inference

What is SageMaker JumpStart

Deploy the model

Invoke the asynchronous endpoint

SageMaker Python SDK

Boto3

LangChain

Clean up

Conclusion

About the author

Communication in the age of AI

Elements of accountability

Reputation system for online accountability

AI Frontiers: Models and Systems with Ece Kamar

Call to action

Sea-ing Clearly: Illuminating Reef Health

In Hot Water

Charting New Horizons With AI

Shoring Up Environmental Monitoring

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.