January 2024 – Page 10

NVIDIA Remains Among Very Best Places to Work in U.S., Rising to No. 2 on Glassdoor’s Annual List

NVIDIA continues to be among America’s very best places to work as judged by employees themselves, rising to second place on Glassdoor’s list of best employers for 2024.

This is the fourth consecutive year NVIDIA has been among the top five on the closely watched list, which is based on anonymous employee reviews about their job, company and work environment. Last year, NVIDIA ranked fifth.

Topping this year’s list is Bain & Co., with ServiceNow, MathWorks and Procore Technologies rounding out the top five.

Employees consistently share positive feedback about NVIDIA via Glassdoor’s anonymous reviews, which capture an authentic look at what it’s like to work at more than a million companies.

Some 98% of NVIDIANs approve of founder and CEO Jensen Huang’s leadership and 94% would recommend working at NVIDIA to a friend.

Here are some typical comments submitted by employees:

“NVIDIA is the best company you could possibly work for,” wrote one engineer on the site. “Employees are basically provided with every single thing they need to be able to do their life’s work at NVIDIA. I might just work here for the rest of my life and retire from here.”
“Truly, I have never worked at a place like NVIDIA,” another wrote. “The culture is strong, morale is high, teams are supportive of each other and employees love their work.”
“NVIDIA hires great people — in every discipline where we work, we have world-class experts and a deep bench. NVIDIA has a culture of help; nobody fails alone and we succeed together,” another noted.

Learn more about NVIDIA life, culture and careers.

Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem

We demonstrate how to finetune a 7B parameter model on a typical consumer GPU (NVIDIA T4 16GB) with LoRA and tools from the PyTorch and Hugging Face ecosystem with complete reproducible Google Colab notebook.

Introduction

Large Language Models (LLMs) have shown impressive capabilities in industrial applications. Often, developers seek to tailor these LLMs for specific use-cases and applications to fine-tune them for better performance. However, LLMs are large by design and require a large number of GPUs to be fine-tuned.

Let’s focus on a specific example by trying to fine-tune a Llama model on a free-tier Google Colab instance (1x NVIDIA T4 16GB). Llama-2 7B has 7 billion parameters, with a total of 28GB in case the model is loaded in full-precision. Given our GPU memory constraint (16GB), the model cannot even be loaded, much less trained on our GPU. This memory requirement can be divided by two with negligible performance degradation. You can read more about running models in half-precision and mixed precision for training here.

What makes our Llama fine-tuning expensive?

In the case of full fine-tuning with Adam optimizer using a half-precision model and mixed-precision mode, we need to allocate per parameter:

2 bytes for the weight
2 bytes for the gradient
4 + 8 bytes for the Adam optimizer states

→ With a total of 16 bytes per trainable parameter, this makes a total of 112GB (excluding the intermediate hidden states). Given that the largest GPU available today can have up to 80GB GPU VRAM, it makes fine-tuning challenging and less accessible to everyone. To bridge this gap, Parameter Efficient Fine-Tuning (PEFT) methods are largely adopted today by the community.

Parameter Efficient Fine-Tuning (PEFT) methods

PEFT methods aim at drastically reducing the number of trainable parameters of a model while keeping the same performance as full fine-tuning.

They can be differentiated by their conceptual framework: does the method fine-tune a subset of existing parameters, introduce new parameters, introduce trainable prompts, etc.? We recommend readers to have a look at the paper shared below that extensively compares existing PEFT methods.

Image taken from the paper: Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

For this blog post, we will focus on Low-Rank Adaption for Large Language Models (LoRA), as it is one of the most adopted PEFT methods by the community.

Low-Rank Adaptation for Large Language Models (LoRA) using 🤗 PEFT

The LoRA method by Hu et al. from the Microsoft team came out in 2021, and works by attaching extra trainable parameters into a model(that we will denote by base model).

To make fine-tuning more efficient, LoRA decomposes a large weight matrix into two smaller, low-rank matrices (called update matrices). These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined.

This approach has several advantages:

LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.
The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.
The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
LoRA does not add any inference latency when adapter weights are merged with the base model

In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to attention blocks only. The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, which is determined mainly by the rank r and the shape of the original weight matrix.

Animated diagram that show how LoRA works in practice – original content adapter from the figure 1 of LoRA original paper

Below is a code snippet showing how to train LoRA model using Hugging Face PEFT library:

The base model can be in any `dtype`: leveraging SOTA LLM quantization and loading the base model in 4-bit precision

According to the LoRA formulation, the base model can be compressed in any data type (‘dtype’) as long as the hidden states from the base model are in the same dtype as the output hidden states from the LoRA matrices.

Compressing and quantizing large language models has recently become an exciting topic as SOTA models become larger and more difficult to serve and use for end users. Many people in the community proposed various approaches for effectively compressing LLMs with minimal performance degradation.

This is where the bitsandbytes library comes in. Its purpose is to make cutting-edge research by Tim Dettmers, a leading academic expert on quantization and the use of deep learning hardware accelerators, accessible to the general public.

QLoRA: One of the core contributions of `bitsandbytes` towards the democratization of AI

Quantization of LLMs has largely focused on quantization for inference, but the QLoRA (Quantized model weights + Low-Rank Adapters) paper showed the breakthrough utility of using backpropagation through frozen, quantized weights at large model scales.

With QLoRA we are matching 16-bit fine-tuning performance across all scales and models, while reducing fine-tuning memory footprint by more than 90%— thereby allowing fine-tuning of SOTA models on consumer-grade hardware.

In this approach, LoRA is pivotal both for purposes of fine-tuning and the correction of minimal, residual quantization errors. Due to the significantly reduced size of the quantized model it becomes possible to generously place low-rank adaptors at every network layer, which together still make up just 0.2% of the original model’s weight memory footprint. Through such usage of LoRA, we achieve performance that has been shown to be equivalent to 16-bit full model finetuning.

In addition to generous use of LoRA, to achieve high-fidelity fine-tuning of 4-bit models, QLoRA uses 3 further algorithmic tricks:

4-bit NormalFloat (NF4) quantization, a custom data type exploiting the property of the normal distribution of model weights and distributing an equal number of weights (per block) to each quantization bin—thereby enhancing information density.
Double Quantization, quantization of the quantization constants (further savings).
Paged Optimizers, preventing memory spikes during gradient checkpointing from causing out-of-memory errors.

An interesting aspect is the dequantization of 4-bit weights in the GPU cache, with matrix multiplication performed as a 16-bit floating point operation. In other words, we use a low-precision storage data type (in our case 4-bit, but in principle interchangeable) and one normal precision computation data type. This is important because the latter defaults to 32-bit for hardware compatibility and numerical stability reasons, but should be set to the optimal BFloat16 for newer hardware supporting it to achieve the best performance.

To conclude, through combining these refinements to the quantization process and generous use of LoRA, we compress the model by over 90% and retain full model performance without the usual quantization degradation, while also retaining full fine-tuning capabilities with 16-bit LoRA adapters at every layer.

Using QLoRA in practice

These SOTA quantization methods come packaged in the bitsandbytes library and are conveniently integrated with HuggingFace 🤗 Transformers. For instance, to use LLM.int8 and QLoRA algorithms, respectively, simply pass load_in_8bit and load_in_4bit to the from_pretrained method.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"
# For LLM.int8()
# model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)

# For QLoRA
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

You can read more about quantization features in this specific section of the documentation: https://huggingface.co/docs/transformers/main_classes/quantization

When using QLoRA with Adam optimizer using a 4-bit base model and mixed-precision mode, we need to allocate per parameter:

~0.5 bytes for the weight
2 bytes for the gradient
4 + 8 bytes for the Adam optimizer states

Giving a total of 14 bytes per trainable parameter times 0.0029 as we end up having only 0.29% trainable parameters with QLoRA, this makes the QLoRA training setup cost around 4.5GB to fit, but requires in practice ~7-10GB to include intermediate hidden states which are always in half-precision (7 GB for a sequence length of 512 and 10GB for a sequence length of 1024) in the Google Colab demo shared in the next section.

Below is the code snippet showing how to train QLoRA model using Hugging Face PEFT:

Using TRL for LLM training

Models such as ChatGPT, GPT-4, and Claude are powerful language models that have been fine-tuned using a method called Reinforcement Learning from Human Feedback (RLHF) to be better aligned with how we expect them to behave and would like to use them. The finetuning goes through 3 steps:

Supervised Fine-tuning (SFT)
Reward / preference modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)

From InstructGPT paper: Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).

Here, we will only focus on the supervised fine-tuning step. We train the model on the new dataset following a process similar to that of pretraining. The objective is to predict the next token (causal language modeling). Multiple techniques can be applied to make the training more efficient:

Packing: Instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with an End-Of-Sentence (EOS) token in between and cut chunks of the context size to fill the batch without any padding. This approach significantly improves training efficiency as each token processed by the model contributes to training.

Train on completion only: We want the model to be able to understand the prompt and generate an answer/. Instead of training the model on the whole input (prompt + answer), the training will be more efficient if we only train the model on completion.

You can perform supervised fine-tuning with these techniques using SFTTrainer:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=1024,
    packing=True,
)

Since SFTTrainer back-end is powered by 🤗accelerate, you can easily adapt the training to your hardware setup in one line of code!

For example, with you have 2 GPUs, you can perform Distributed Data Parallel training with using the following command:

accelerate launch --num_processes=2 training_llama_script.py

Putting all the pieces together

We made a complete reproducible Google Colab notebook that you can check through this link. We use all the components shared in the sections above and fine-tune a llama-7b model on UltraChat dataset using QLoRA. As it can be observed through the screenshot below, when using a sequence length of 1024 and a batch size od 4, the memory usage remains very low (around 10GB).

Introducing ChatGPT Team

We’re launching a new ChatGPT plan for teams of all sizes, which provides a secure, collaborative workspace to get the most out of ChatGPT at work.OpenAI Blog

Introducing the GPT Store

We’re launching the GPT Store to help you find useful and popular custom versions of ChatGPT.OpenAI Blog

Inference Llama 2 models with real-time response streaming using Amazon SageMaker

With the rapid adoption of generative AI applications, there is a need for these applications to respond in time to reduce the perceived latency with higher throughput. Foundation models (FMs) are often pre-trained on vast corpora of data with parameters ranging in scale of millions to billions and beyond. Large language models (LLMs) are a type of FM that generate text as a response of the user inference. Inferencing these models with varying configurations of inference parameters may lead to inconsistent latencies. The inconsistency could be because of the varying number of response tokens you are expecting from the model or the type of accelerator the model is deployed on.

In either case, rather than waiting for the full response, you can adopt the approach of response streaming for your inferences, which sends back chunks of information as soon as they are generated. This creates an interactive experience by allowing you to see partial responses streamed in real time instead of a delayed full response.

With the official announcement that Amazon SageMaker real-time inference now supports response streaming, you can now continuously stream inference responses back to the client when using Amazon SageMaker real-time inference with response streaming. This solution will help you build interactive experiences for various generative AI applications such as chatbots, virtual assistants, and music generators. This post shows you how to realize faster response times in the form of Time to First Byte (TTFB) and reduce the overall perceived latency while inferencing Llama 2 models.

To implement the solution, we use SageMaker, a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. For more information about the various deployment options SageMaker provides, refer to Amazon SageMaker Model Hosting FAQs. Let’s understand how we can address the latency issues using real-time inference with response streaming.

Solution overview

Because we want to address the aforementioned latencies associated with real-time inference with LLMs, let’s first understand how we can use the response streaming support for real-time inferencing for Llama 2. However, any LLM can take advantage of response streaming support with real-time inferencing.

Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 models are autoregressive models with decoder only architecture. When provided with a prompt and inference parameters, Llama 2 models are capable of generating text responses. These models can be used for translation, summarization, question answering, and chat.

For this post, we deploy the Llama 2 Chat model meta-llama/Llama-2-13b-chat-hf on SageMaker for real-time inferencing with response streaming.

When it comes to deploying models on SageMaker endpoints, you can containerize the models using specialized AWS Deep Learning Container (DLC) images available for popular open source libraries. Llama 2 models are text generation models; you can use either the Hugging Face LLM inference containers on SageMaker powered by Hugging Face Text Generation Inference (TGI) or AWS DLCs for Large Model Inference (LMI).

In this post, we deploy the Llama 2 13B Chat model using DLCs on SageMaker Hosting for real-time inference powered by G5 instances. G5 instances are a high-performance GPU-based instances for graphics-intensive applications and ML inference. You can also use supported instance types p4d, p3, g5, and g4dn with appropriate changes as per the instance configuration.

Prerequisites

To implement this solution, you should have the following:

An AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
A Hugging Face account. Sign up with your email if you don’t already have account.
- For seamless access of the models available on Hugging Face, especially gated models such as Llama, for fine-tuning and inferencing purposes, you should have a Hugging Face account to obtain a read access token. After you sign up for your Hugging Face account, log in to visit https://huggingface.co/settings/tokens to create a read access token.
Access to Llama 2, using the same email ID that you used to sign up for Hugging Face.
- The Llama 2 models available via Hugging Face are gated models. The use of the Llama model is governed by the Meta license. To download the model weights and tokenizer, request access to Llama and accept their license.
- After you’re granted access (typically in a couple of days), you will receive an email confirmation. For this example, we use the model Llama-2-13b-chat-hf, but you should be able to access other variants as well.

Approach 1: Hugging Face TGI

In this section, we show you how to deploy the meta-llama/Llama-2-13b-chat-hf model to a SageMaker real-time endpoint with response streaming using Hugging Face TGI. The following table outlines the specifications for this deployment.

Specification	Value
Container	Hugging Face TGI
Model Name	meta-llama/Llama-2-13b-chat-hf
ML Instance	ml.g5.12xlarge
Inference	Real-time with response streaming

Deploy the model

First, you retrieve the base image for the LLM to be deployed. You then build the model on the base image. Finally, you deploy the model to the ML instance for SageMaker Hosting for real-time inference.

Let’s observe how to achieve the deployment programmatically. For brevity, only the code that helps with the deployment steps is discussed in this section. The full source code for deployment is available in the notebook llama-2-hf-tgi/llama-2-13b-chat-hf/1-deploy-llama-2-13b-chat-hf-tgi-sagemaker.ipynb.

Retrieve the latest Hugging Face LLM DLC powered by TGI via pre-built SageMaker DLCs. You use this image to deploy the meta-llama/Llama-2-13b-chat-hf model on SageMaker. See the following code:

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.0.3"
)

Define the environment for the model with the configuration parameters defined as follows:

instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
config = {
    'HF_MODEL_ID': "meta-llama/Llama-2-13b-chat-hf", # model_id from hf.co/models
    'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
    'HUGGING_FACE_HUB_TOKEN': "<YOUR_HUGGING_FACE_READ_ACCESS_TOKEN>"
}

Replace <YOUR_HUGGING_FACE_READ_ACCESS_TOKEN> for the config parameter HUGGING_FACE_HUB_TOKEN with the value of the token obtained from your Hugging Face profile as detailed in the prerequisites section of this post. In the configuration, you define the number of GPUs used per replica of a model as 4 for SM_NUM_GPUS. Then you can deploy the meta-llama/Llama-2-13b-chat-hf model on an ml.g5.12xlarge instance that comes with 4 GPUs.

Now you can build the instance of HuggingFaceModel with the aforementioned environment configuration:

llm_model = HuggingFaceModel(
    role=role,
    image_uri=llm_image,
    env=config
)

Finally, deploy the model by providing arguments to the deploy method available on the model with various parameter values such as endpoint_name, initial_instance_count, and instance_type:

llm = llm_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

Perform inference

The Hugging Face TGI DLC comes with the ability to stream responses without any customizations or code changes to the model. You can use invoke_endpoint_with_response_stream if you are using Boto3 or InvokeEndpointWithResponseStream when programming with the SageMaker Python SDK.

The InvokeEndpointWithResponseStream API of SageMaker allows developers to stream responses back from SageMaker models, which can help improve customer satisfaction by reducing the perceived latency. This is especially important for applications built with generative AI models, where immediate processing is more important than waiting for the entire response.

For this example, we use Boto3 to infer the model and use the SageMaker API invoke_endpoint_with_response_stream as follows:

def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json",
        CustomAttributes='accept_eula=false'
    )
    return response_stream

The argument CustomAttributes is set to the value accept_eula=false. The accept_eula parameter must be set to true to successfully obtain the response from the Llama 2 models. After the successful invocation using invoke_endpoint_with_response_stream, the method will return a response stream of bytes.

The following diagram illustrates this workflow.

You need an iterator that loops over the stream of bytes and parses them to readable text. The LineIterator implementation can be found at llama-2-hf-tgi/llama-2-13b-chat-hf/utils/LineIterator.py. Now you’re ready to prepare the prompt and instructions to use them as a payload while inferencing the model.

Prepare a prompt and instructions

In this step, you prepare the prompt and instructions for your LLM. To prompt Llama 2, you should have the following prompt template:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

You build the prompt template programmatically defined in the method build_llama2_prompt, which aligns with the aforementioned prompt template. You then define the instructions as per the use case. In this case, we’re instructing the model to generate an email for a marketing campaign as covered in the get_instructions method. The code for these methods is in the llama-2-hf-tgi/llama-2-13b-chat-hf/2-sagemaker-realtime-inference-llama-2-13b-chat-hf-tgi-streaming-response.ipynb notebook. Build the instruction combined with the task to be performed as detailed in user_ask_1 as follows:

user_ask_1 = f'''
AnyCompany recently announced new service launch named AnyCloud Internet Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: EARLYB1RD to get 20% for 1st 3 months.
'''
instructions = get_instructions(user_ask_1)
prompt = build_llama2_prompt(instructions)

We pass the instructions to build the prompt as per the prompt template generated by build_llama2_prompt.

inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 512,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
        "return_full_text": False
    }
payload = {
    "inputs":  prompt,
    "parameters": inference_params,
    "stream": True ## <-- to have response stream.
}

We club the inference parameters along with prompt with the key stream with the value True to form a final payload. Send the payload to get_realtime_response_stream, which will be used to invoke an endpoint with response streaming:

resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)
print_response_stream(resp)

The generated text from the LLM will be streamed to the output as shown in the following animation.

Approach 2: LMI with DJL Serving

In this section, we demonstrate how to deploy the meta-llama/Llama-2-13b-chat-hf model to a SageMaker real-time endpoint with response streaming using LMI with DJL Serving. The following table outlines the specifications for this deployment.

Specification	Value
Container	LMI container image with DJL Serving
Model Name	meta-llama/Llama-2-13b-chat-hf
ML Instance	ml.g5.12xlarge
Inference	Real-time with response streaming

You first download the model and store it in Amazon Simple Storage Service (Amazon S3). You then specify the S3 URI indicating the S3 prefix of the model in the serving.properties file. Next, you retrieve the base image for the LLM to be deployed. You then build the model on the base image. Finally, you deploy the model to the ML instance for SageMaker Hosting for real-time inference.

Let’s observe how to achieve the aforementioned deployment steps programmatically. For brevity, only the code that helps with the deployment steps is detailed in this section. The full source code for this deployment is available in the notebook llama-2-lmi/llama-2-13b-chat/1-deploy-llama-2-13b-chat-lmi-response-streaming.ipynb.

Download the model snapshot from Hugging Face and upload the model artifacts on Amazon S3

With the aforementioned prerequisites, download the model on the SageMaker notebook instance and then upload it to the S3 bucket for further deployment:

model_name = 'meta-llama/Llama-2-13b-chat-hf'
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# Download the model snapshot
model_download_path = snapshot_download(
    repo_id=model_name, 
    cache_dir=local_model_path, 
    allow_patterns=allow_patterns, 
    token='<YOUR_HUGGING_FACE_READ_ACCESS_TOKEN>'
)

Note that even though you don’t provide a valid access token, the model will download. But when you deploy such a model, the model serving won’t succeed. Therefore, it’s recommended to replace <YOUR_HUGGING_FACE_READ_ACCESS_TOKEN> for the argument token with the value of the token obtained from your Hugging Face profile as detailed in the prerequisites. For this post, we specify the official model’s name for Llama 2 as identified on Hugging Face with the value meta-llama/Llama-2-13b-chat-hf. The uncompressed model will be downloaded to local_model_path as a result of running the aforementioned code.

Upload the files to Amazon S3 and obtain the URI, which will be later used in serving.properties.

You will be packaging the meta-llama/Llama-2-13b-chat-hf model on the LMI container image with DJL Serving using the configuration specified via serving.properties. Then you deploy the model along with model artifacts packaged on the container image on the SageMaker ML instance ml.g5.12xlarge. You then use this ML instance for SageMaker Hosting for real-time inferencing.

Prepare model artifacts for DJL Serving

Prepare your model artifacts by creating a serving.properties configuration file:

%%writefile chat_llama2_13b_hf/serving.properties
engine = MPI
option.entryPoint=djl_python.huggingface
option.tensor_parallel_degree=4
option.low_cpu_mem_usage=TRUE
option.rolling_batch=lmi-dist
option.max_rolling_batch_size=64
option.model_loading_timeout=900
option.model_id={{model_id}}
option.paged_attention=true

We use the following settings in this configuration file:

engine – This specifies the runtime engine for DJL to use. The possible values include Python, DeepSpeed, FasterTransformer, and MPI. In this case, we set it to MPI. Model Parallelization and Inference (MPI) facilitates partitioning the model across all the available GPUs and therefore accelerates inference.
option.entryPoint – This option specifies which handler offered by DJL Serving you would like to use. The possible values are djl_python.huggingface, djl_python.deepspeed, and djl_python.stable-diffusion. We use djl_python.huggingface for Hugging Face Accelerate.
option.tensor_parallel_degree – This option specifies the number of tensor parallel partitions performed on the model. You can set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the number of workers per model that will be started up when DJL serving runs. For example, if we have a 4 GPU machine and we are creating four partitions, then we will have one worker per model to serve the requests.
option.low_cpu_mem_usage – This reduces CPU memory usage when loading models. We recommend that you set this to TRUE.
option.rolling_batch – This enables iteration-level batching using one of the supported strategies. Values include auto, scheduler, and lmi-dist. We use lmi-dist for turning on continuous batching for Llama 2.
option.max_rolling_batch_size – This limits the number of concurrent requests in the continuous batch. The value defaults to 32.
option.model_id – You should replace {{model_id}} with the model ID of a pre-trained model hosted inside a model repository on Hugging Face or S3 path to the model artifacts.

More configuration options can be found in Configurations and settings.

Because DJL Serving expects the model artifacts to be packaged and formatted in a .tar file, run the following code snippet to compress and upload the .tar file to Amazon S3:

s3_code_prefix = f"{s3_prefix}/code" # folder within bucket where code artifact will go
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

Retrieve the latest LMI container image with DJL Serving

Next, you use the DLCs available with SageMaker for LMI to deploy the model. Retrieve the SageMaker image URI for the djl-deepspeed container programmatically using the following code:

from sagemaker import image_uris
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.25.0"
)

You can use the aforementioned image to deploy the meta-llama/Llama-2-13b-chat-hf model on SageMaker. Now you can proceed to create the model.

Create the model

You can create the model whose container is built using the inference_image_uri and the model serving code located at the S3 URI indicated by s3_code_artifact:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-13b-chat-lmi-streaming")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {"MODEL_LOADING_TIMEOUT": "3600"},
    },
)

Now you can create the model config with all the details for the endpoint configuration.

Create the model config

Use the following code to create a model config for the model identified by model_name:

endpoint_config_name = f"{model_name}-config"

endpoint_name = name_from_base(model_name)

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
        },
    ],
)

The model config is defined for the ProductionVariants parameter InstanceType for the ML instance ml.g5.12xlarge. You also provide the ModelName using the same name that you used to create the model in the earlier step, thereby establishing a relation between the model and endpoint configuration.

Now that you have defined the model and model config, you can create the SageMaker endpoint.

Create the SageMaker endpoint

Create the endpoint to deploy the model using the following code snippet:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

You can view the progress of the deployment using the following code snippet:

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]

After the deployment is successful, the endpoint status will be InService. Now that the endpoint is ready, let’s perform inference with response streaming.

Real-time inference with response streaming

As we covered in the earlier approach for Hugging Face TGI, you can use the same method get_realtime_response_stream to invoke response streaming from the SageMaker endpoint. The code for inferencing using the LMI approach is in the llama-2-lmi/llama-2-13b-chat/2-inference-llama-2-13b-chat-lmi-response-streaming.ipynb notebook. The LineIterator implementation is located in llama-2-lmi/utils/LineIterator.py. Note that the LineIterator for the Llama 2 Chat model deployed on the LMI container is different to the LineIterator referenced in Hugging Face TGI section. The LineIterator loops over the byte stream from Llama 2 Chat models inferenced with the LMI container with djl-deepspeed version 0.25.0. The following helper function will parse the response stream received from the inference request made via the invoke_endpoint_with_response_stream API:

from utils.LineIterator import LineIterator

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')

The preceding method prints the stream of data read by the LineIterator in a human-readable format.

Let’s explore how to prepare the prompt and instructions to use them as a payload while inferencing the model.

Because you’re inferencing the same model in both Hugging Face TGI and LMI, the process of preparing the prompt and instructions is same. Therefore, you can use the methods get_instructions and build_llama2_prompt for inferencing.

The get_instructions method returns the instructions. Build the instructions combined with the task to be performed as detailed in user_ask_2 as follows:

user_ask_2 = f'''
AnyCompany recently announced new service launch named AnyCloud Streaming Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: STREAM2DREAM to get 15% for 1st 6 months.
'''

instructions = get_instructions(user_ask_2)
prompt = build_llama2_prompt(instructions)

Pass the instructions to build the prompt as per the prompt template generated by build_llama2_prompt:

inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 512,
        "return_full_text": False,
    }

payload = {
    "inputs":  prompt,
    "parameters": inference_params
}

We club the inference parameters along with the prompt to form a final payload. Then you send the payload to get_realtime_response_stream, which is used to invoke an endpoint with response streaming:

resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)
print_response_stream(resp)

The generated text from the LLM will be streamed to the output as shown in the following animation.

Clean up

To avoid incurring unnecessary charges, use the AWS Management Console to delete the endpoints and its associated resources that were created while running the approaches mentioned in the post. For both deployment approaches, perform the following cleanup routine:

import boto3
sm_client = boto3.client('sagemaker')
endpoint_name="<SageMaker_Real-time_Endpoint_Name>"
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']

print(f"""
About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}
""")

# delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)
# delete endpoint config
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# delete model
sm_client.delete_model(ModelName=model_name)

Replace <SageMaker_Real-time_Endpoint_Name> for variable endpoint_name with the actual endpoint.

For the second approach, we stored the model and code artifacts on Amazon S3. You can clean up the S3 bucket using the following code:

s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)
s3_bucket.objects.filter(Prefix=s3_prefix).delete()

Conclusion

In this post, we discussed how a varying number of response tokens or a different set of inference parameters can affect the latencies associated with LLMs. We showed how to address the problem with the help of response streaming. We then identified two approaches for deploying and inferencing Llama 2 Chat models using AWS DLCs—LMI and Hugging Face TGI.

You should now understand the importance of streaming response and how it can reduce perceived latency. Streaming response can improve the user experience, which otherwise would make you wait until the LLM builds the whole response. Additionally, deploying Llama 2 Chat models with response streaming improves the user experience and makes your customers happy.

You can refer to the official aws-samples amazon-sagemaker-llama2-response-streaming-recipes that covers deployment for other Llama 2 model variants.

References

About the Authors

Pavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services. He works with ISVs in India to help them innovate on AWS. He is a published author for the book “Getting Started with V Programming.” He pursued an Executive M.Tech in Data Science from the Indian Institute of Technology (IIT), Hyderabad. He also pursued an Executive MBA in IT specialization from the Indian School of Business Management and Administration, and holds a B.Tech in Electronics and Communication Engineering from the Vaagdevi Institute of Technology and Science. Pavan is an AWS Certified Solutions Architect Professional and holds other certifications such as AWS Certified Machine Learning Specialty, Microsoft Certified Professional (MCP), and Microsoft Certified Technology Specialist (MCTS). He is also an open-source enthusiast. In his free time, he loves to listen to the great magical voices of Sia and Rihanna.

Sudhanshu Hate is principal AI/ML specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role before Amazon, he conceptualized, created, and led teams to build ground-up open source-based AI and gamification platforms, and successfully commercialized it with over 100 clients. Sudhanshu to his credit a couple of patents, has written two books and several papers and blogs, and has presented his points of view in various technical forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently with digital native clients in India.

How Generative AI Is Redefining the Retail Industry

Ninety-eight percent of retailers plan to invest in generative AI in the next 18 months, according to a new survey conducted by NVIDIA.

That makes retail one of the industries racing fastest to adopt generative AI to ramp up productivity, transform customer experiences and improve efficiency.

Early deployments in the retail industry include personalized shopping advisors and adaptive advertising, with retailers initially testing off-the-shelf models like GPT-4 from OpenAI.

But many are now realizing the value in developing custom models trained on their proprietary data to achieve brand-appropriate tone and personalized results in a scalable, cost-effective way.

Before building them, companies must first consider a variety of questions: whether to opt for an open-source, closed-source or enterprise model; how they plan to train and deploy the models; how to host them; and, most importantly, how to ensure future innovations and new products can be easily incorporated into them.

New offerings like NVIDIA AI Foundations, a curated collection of optimized, enterprise-grade foundation models from NVIDIA and leading open-source pretrained models, are giving retail companies the building blocks they need to construct their custom models. With NVIDIA NeMo, an end-to-end platform for large language model development, retailers can customize and deploy their models at scale using the latest state-of-the-art techniques.

Generative AI Use Cases

Multimodal models are leading the new frontier in the generative AI landscape. They’re capable of processing, understanding and generating content and images from multiple sources such as text, image, video and 3D rendered assets.

This allows retailers to create eye-catching images or videos for a brand’s marketing and advertising campaign using only a few lines of text prompts. Or they can be used to deliver personalized shopping experiences with in-situ and try-on product image results. Yet another use case is in product description generation, where generative AI can intelligently generate detailed e-commerce product descriptions that include product attributes, using meta-tags to greatly improve SEO.

Many retailers are testing the generative AI waters first with internal deployments. For example, some are boosting the productivity of their engineering teams with AI-powered computer code generators that can write optimized lines of code for indicated outcomes. Others are using custom models to generate marketing copy and promotions for various audience segments, increasing click-to-conversion rates. Meanwhile, chatbots and translators are helping employees accomplish their day-to-day tasks.

To enhance customer experiences, retailers are deploying generative AI-powered shopping advisors that can offer personalized product recommendations in customer-tailored conversation styles and display images of products being recommended. It can even display those products if shoppers want to see the recommended product, for example, in their home by uploading a picture of a room. Another use case is a customer service multilingual chatbot capable of answering simple customer inquiries and routing complex ones to human agents for improved, more efficient service.

NVIDIA at NRF

To learn more about how generative AI is shaping the future of retail, connect with the NVIDIA team at NRF: Retail’s Big Show, the world’s largest retail expo, taking place Jan. 14-16 at the Jacob K. Javits Convention Center in New York.

Attend the Big Ideas session on Jan. 14 at 2 p.m. ET to hear from Azita Martin, NVIDIA’s vice president of AI for retail, consumer packaged goods and quick-service restaurants, and others on how Target and Canadian Tire are using generative AI to deliver personalized shopping experiences and drive revenue and productivity.

Visit Dell’s booth on level three (4957) to meet with NVIDIA AI experts and experience NVIDIA’s generative AI demos.

Download the State of AI in Retail and CPG: 2024 Trends report for in-depth results and insights.

Putting the AI in Retail: Survey Reveals Latest Trends Driving Technological Advancements in the Industry

The retail industry is in the midst of a major technology transformation, fueled by the rise in AI.

With the highest potential for AI and analytics among all industries, the retail and consumer packaged goods (CPG) sectors are poised to harness the power of AI to enhance operational efficiency, elevate customer and employee experiences and drive growth. As such, it’s crucial to stay ahead of the curve by anticipating potential trends that aim to change the retail game.

NVIDIA’s first annual “State of AI in Retail and CPG” survey — conducted among industry professionals — provides insights into the state of AI adoption in retail, its impact on revenue and costs and the emerging trends shaping the future of the industry.

With more than 400 respondents globally, including C-suite leaders and other executives, general managers and individual contributors, the survey consisted of questions covering a range of AI topics, top use cases, biggest challenges, infrastructure investment initiatives and deployment models.

Improving Operational Efficiencies Is a Top Priority

To stay ahead in a highly dynamic market, retailers are actively considering how AI can help them meet evolving customer preferences, address labor shortages and drive sustainability efforts. AI has already proved to be a game-changer for retailers, with 69% reporting an increase in annual revenue attributed to AI adoption. Additionally, 72% of retailers using AI experienced a decrease in operating costs.

AI is enhancing operational efficiency, elevating customer experiences and driving growth. The top five current AI use cases were as follows:

Personalized customer recommendations
Store analytics and insights
Loss prevention and asset protection
Augmented reality experiences
Automated marketing content generation

While retailers are actively implementing AI, there are still areas they plan on exploring. These include further investing in AI infrastructure to overcome challenges related to inadequate technology and lack of AI talent, exploring the potential of the metaverse for consumer engagement and operational efficiency while also leveraging AI for brick-and-mortar stores to provide convenience and personalized customer experiences, transforming customer experiences using generative AI, and ensuring data privacy and protection in generative AI adoption.

Generative AI Is Changing Customer Experiences

Generative AI prominently emerged in several of the top AI use cases for retail. The use cases ranged from multimodal shopping advisors for personalized product recommendations; adaptive advertising, promotions and pricing; product tagging and cataloging; identification of similar and complementary products; as well as deployment of brand avatars for automated customer service.

Retailers recognized the transformative potential of generative AI, with 86% expressing a desire to use it to enhance customer experiences. Respondents acknowledged that incorporating AI into business practices and solutions could revolutionize customer engagement, optimize marketing strategies and streamline operational processes.

Staying Ahead With an Omnichannel Approach

To stay competitive, the survey indicated the importance of an omnichannel approach that integrates numerous online and offline channels to provide consumers with a consistent experience.

The results showed that ecommerce was the most used channel, with 79% of retailers actively participating. Mobile applications also gained traction, with over half of retailers using them to bridge the gap between digital and physical shopping experiences.

Despite the rise in digital shopping, 30% of respondents say physical stores have the biggest revenue growth opportunity (ranked second behind ecommerce) and remain the channel with the most AI use cases for retailers. Given the emphasis on intelligent stores and their central role in the omnichannel experience, use cases such as store analytics and loss prevention will continue to be critical investments.

Investing in AI Infrastructure

While AI adoption is still in its early stages, retailers are committed to increasing their AI infrastructure investments. Over 60% of respondents plan to boost their AI investments in the next 18 months. This commitment reflects the industry’s recognition of the technology’s potential to enhance operational efficiency, reduce costs, elevate customer experiences and drive growth.

Download the “State of AI in Retail and CPG: 2024 Trends” report for in-depth results and insights.

Explore NVIDIA’s AI solutions and enterprise-level AI platforms for retail at www.nvidia.com/retail.

Deploy a Slack gateway for Amazon Q, your business expert

Amazon Q is a new generative AI-powered application that helps users get work done. Amazon Q can become your tailored business expert and let you discover content, brainstorm ideas, or create summaries using your company’s data safely and securely. You can use Amazon Q to have conversations, solve problems, generate content, gain insights, and take action by connecting to your company’s information repositories, code, data, and enterprise systems. For more information, see Introducing Amazon Q, a new generative AI-powered assistant (preview).

In this post, we show you how to bring Amazon Q, your business expert, to users in Slack.

You’ll be able converse with Amazon Q using Slack direct messages (DMs) to ask questions and get answers based on company data, get help creating new content such as email drafts, summarize attached files, and perform tasks.

You can also invite Amazon Q to participate in your team channels. In a channel, users can ask it questions in a new message, or tag it in an existing thread at any point, to provide additional data points, resolve a debate, or summarize the conversation and capture the next steps.

Solution overview

Amazon Q is amazingly powerful. Check out the following demo—seeing is believing!

In the demo, our Amazon Q application is populated with a set of AWS whitepapers. You can populate your own Amazon Q business expert application with your own company’s documents and knowledge base articles, so it will be able to answer your questions!

Everything you need is provided as open source in our GitHub repo.

In this post, we walk you through the process to deploy Amazon Q in your AWS account and add it to your Slack workspace. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things it can do:

Respond to messages – In DMs, it responds to all messages. In channels, it responds only to @mentions and responds in a conversation thread.
Render answers containing markdown – This includes headings, lists, bold, italics, tables, and more.
Track sentiment – It provides thumbs up and thumbs down buttons to track user sentiment.
Provide source attribution – It provides references and hyperlinks to sources used by Amazon Q.
Understand conversation context – It tracks the conversation and responds based on the context.
Stay aware of multiple users – When it’s tagged in a thread, it knows who said what, and when, so it can contribute in context and accurately summarize the thread when asked.
Process attached files – It can process up to five attached files for document question answering, summaries, and more.
Start new conversations – You can reset and start new conversations in DM channels by using /new_conversation.

In the following sections, we show how to deploy the project to your own AWS account and Slack workspace, and start experimenting!

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need to have an existing, working Amazon Q business expert application. If you haven’t set one up yet, see Creating an Amazon Q application.

Lastly, you need a Slack account and access to create and publish apps to your Slack organization. If you don’t have one, see if your company can create a Slack sandbox organization for you to experiment, or go to slack.com to create a free Slack account and workspace.

Deploy the solution resources

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

Log in to the AWS Management Console.
Choose one of the following Launch Stack buttons for your desired AWS Region to open the AWS CloudFormation console and create a new stack.

Region	Launch Stack
N. Virginia (`us-east-1`)
Oregon (`us-west-2`)

For Stack name, enter a name for your app (for example, AMAZON-Q-SLACK-GATEWAY).
For AmazonQAppId, enter your existing Amazon Q application ID (for example, 80xxxxx9-7xx3-4xx0-bxx4-5baxxxxx2af5). You can copy it from the Amazon Q console.
For AmazonQRegion, choose the Region where you created your Amazon Q application (us-east-1 or us-west-2).
For AmazonQUserId, enter an Amazon Q user ID email address (leave blank to use a Slack user email as the user ID).
For ContextDaysToLive, enter the length of time to keep conversation metadata cached in Amazon DynamoDB (you can leave this as the default).

When your CloudFormation stack status is CREATE_COMPLETE, choose the Outputs tab, and keep it open—you’ll need it in later steps.

Create your app

Now you can create your app in Slack. Complete the following steps:

Create a Slack app in https://api.slack.com/apps from the generated manifest—copy and paste from the stack output: SlackAppManifest.
Choose App Home in the navigation pane and scroll down to the section Show Tabs.
Enable Messages Tab.
Select Allow users to send Slash commands and messages from the messages tab.

This is a required step to enable your user to send messages to your app.

Add your app in your workspace

Now you can add your app in your workspace. This is required to generate the bot user OAuth token value that is needed in the next step.

Go to OAuth & Permissions (in https://api.slack.com) and choose Install to Workspace to generate the OAuth token.
In Slack, go to your workspace.
Choose your workspace name, Settings & administration, and Manage apps.
Choose your newly created app.
In the right pane, choose Open in App Directory.
Choose Open in Slack.

Configure Slack secrets in AWS Secrets Manager

Let’s configure your Slack secrets in order to verify the signature of each request and post on behalf of your Amazon Q bot.

In this example, we are not enabling Slack token rotation. You can enable it for a production app by implementing rotation via AWS Secrets Manager. Create an issue (or, better yet, a pull request) in the GitHub repo if you want this feature added to a future version.

Complete the following steps to configure a secret in Secrets Manager:

On the AWS CloudFormation console, navigate to your stack Outputs tab and choose the link for SlackSecretConsoleUrl to be redirected to the Secrets Manager console.
Choose Retrieve secret value.
Choose Edit.
Replace the values of SlackSigningSecret and SlackBotUserOAuthToken using the values in the Slack application configuration under Basic Information and OAuth & Permissions.

Be careful you don’t accidentally copy Client Secret instead of Signing Secret.

Start using Amazon Q

Complete the following steps to start using Amazon Q in Slack:

Open your Slack workspace.
Under Apps, Manage, add your new Amazon Q app.
Optionally, add your Amazon Q app to team channels.
In the app DM channel, enter Hello.

You have now deployed a powerful new AI assistant into your sandbox Slack environment.

Play with it, try all the features discussed in this post, and copy the things you saw in the demo video. Most importantly, you can ask about topics related to the documents that you have ingested into your own Amazon Q business expert application. But don’t stop there. You can find additional ways to make it useful, and when you do, let us know by posting a comment.

Once you are convinced how useful it is, talk to your Slack admins (and show them this post) and work with them to deploy it in your company’s Slack workspaces. Your fellow employees will thank you!

Clean up

When you’re finished experimenting with this solution, delete your app in Slack (https://api.slack.com/apps) and clean up your AWS resources by opening the AWS CloudFormation console and deleting the AMAZON-Q-SLACK-GATEWAY stack that you deployed. This deletes the resources that you created by deploying the solution.

Conclusions

This sample Amazon Q slack application discussed in this post is provided as open source—you can use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Explore the code, choose Watch in the GitHub repo to be notified of new releases, and check back for the latest updates. We’d also love to hear your suggestions for improvements and features.

For more information on Amazon Q, refer to What is Amazon Q (For Business Use)?

About the Authors

Gary Benattar is a Senior Software Development Manager in AWS HR. Gary started at Amazon in 2012 as an intern, focusing on building scalable, real-time outlier detection systems. He worked in Seattle and Luxembourg and is now based in Tel Aviv, Israel, where he dedicates his time to building software to revolutionize the future of Human Resources. He co-founded a startup, Zengo, with a focus on making digital wallets secure through multi-party computation. He received his MSc in Software Engineering from Sorbonne University in Paris.

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

NVIDIA and Loss Prevention Retail Council Introduce AI Solution to Address Organized Retail Crime

NVIDIA and the Loss Prevention Research Council (LPRC) are collaborating with several AI companies to showcase a real-time solution for combating and preventing organized retail crime (ORC).

The integrated offering provides advance notifications of suspicious behavior inside and outside stores so that authorities can intervene early.

The LPRC includes asset-protection executives from more than 85 major retail chains, with hundreds of thousands of stores worldwide, as well as law enforcement, consumer packaged goods companies and technology solutions partners. It’s focused on collaborating with the retail industry to reduce shrink — the loss of products for reasons other than sales — and increase safety and security at stores and shopping malls.

Flash mobs and smash-and-grab thefts are a growing concern, costing retailers billions of dollars in lost revenue and causing safety concerns among customers and employees. Crime syndicates have committed brazen, large-scale thefts, often selling stolen merchandise on the black market.

A National Retail Federation survey found that shrink accounted for $112 billion in losses in 2022, with an estimated two-thirds due to theft.

Increasingly, this involves violence. According to the survey, 67% of respondents said they were seeing more violence and aggression associated with organized-crime theft than a year ago.

The AI-based solution, which helps retailers get a jump on often-evasive, fast-moving organized crime groups, uses technology from several leading AI firms that have built their high-performance AI applications on the NVIDIA Metropolis application framework and microservices.

The solution includes product recognition and tracking, as well as anomaly detection, from AiFi, vehicle license plate and model recognition from BriefCam, and physical security management from SureView to provide advance and real-time notifications to retailer command centers.

The three are among over 500 software companies and startups that have developed retail, safety and security AI applications on NVIDIA Metropolis software development kits for vision AI — and that have been certified as NVIDIA Metropolis partners.

“The proposed AI-based ORC solution combines LPRC’s deep expertise in loss prevention from over 23 years of collaboration with asset protection executives with NVIDIA’s deep AI expertise,” said Read Hayes, who leads the LPRC and is a University of Florida research scientist and criminologist. “We believe this type of cross-industry collaboration will help retailers fight back against organized retail crime.”

Developing Integrated AI for Securing Stores

AiFi, based in Silicon Valley, develops computer vision solutions, including autonomous retail capabilities built on the NVIDIA Metropolis application framework. Its solution detects anomalies in shopper behavior, tracks items removed from shelves and notifies retailers if shoppers bypass checkout lanes.

BriefCam, based in Newton, Mass., provides deep learning-based video analytics technology for insightful decision-making. Enabling the forensic search, alerting on and visualization of objects in video, the BriefCam Platform includes integrated license plate recognition and cross-camera object tracking, alongside other capabilities that support effective asset protection and real-time response to theft attempts.

SureView, based in Tampa, Fla., offers a software platform for managing multiple security systems with a single view. The company’s physical security management system receives signals from the AiFi and BriefCam applications, helping teams coordinate a quick and consistent response and providing notifications to store security operations and law enforcement based on the retailer’s business rules.

For more information about AI solutions for mitigating organized retail crime, connect with the NVIDIA team at NRF: Retail’s Big Show, the world’s largest retail expo, taking place Jan. 14-16 at the Javits Convention Center in New York.

Attend the Big Ideas session on Organized Retail Crime on Jan. 14 at 2 p.m. ET, moderated by the LPRC, to discover how Kroger and Jacksons Food are using AI in their stores to tackle crime.

The ORC solution will be showcased at NRF — visit NVIDIA experts in Lenovo’s booth (3665) and Dell’s booth (4957) to learn more about it from NVIDIA’s software partners.

Accelerate AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe, saving up to 75% on inference costs

Multi-model endpoints (MMEs) are a powerful feature of Amazon SageMaker designed to simplify the deployment and operation of machine learning (ML) models. With MMEs, you can host multiple models on a single serving container and host all the models behind a single endpoint. The SageMaker platform automatically manages the loading and unloading of models and scales resources based on traffic patterns, reducing the operational burden of managing a large quantity of models. This feature is particularly beneficial for deep learning and generative AI models that require accelerated compute. The cost savings achieved through resource sharing and simplified model management makes SageMaker MMEs an excellent choice for you to host models at scale on AWS.

Recently, generative AI applications have captured widespread attention and imagination. Customers want to deploy generative AI models on GPUs but at the same time are conscious of costs. SageMaker MMEs support GPU instances and is a great option for these types of applications. Today, we are excited to announce TorchServe support for SageMaker MMEs. This new model server support gives you the advantage of all the benefits of MMEs while still using the serving stack that TorchServe customers are most familiar with. In this post, we demonstrate how to host generative AI models, such as Stable Diffusion and Segment Anything Model, on SageMaker MMEs using TorchServe and build a language-guided editing solution that can help artists and content creators develop and iterate their artwork faster.

Solution overview

Language-guided editing is a common cross-industry generative AI use case. It can help artists and content creators work more efficiently to meet content demand by automating repetitive tasks, optimizing campaigns, and providing a hyper-personalized experience for the end customer. Businesses can benefit from increased content output, cost savings, improved personalization, and enhanced customer experience. In this post, we demonstrate how you can build language-assisted editing features using MME TorchServe that allow you to erase any unwanted object from an image and modify or replace any object in an image by supplying a text instruction.

The user experience flow for each use case is as follows:

To remove an unwanted object, the select the object from the image to highlight it. This action sends the pixel coordinates and the original image to a generative AI model, which generates a segmentation mask for the object. After confirming the correct object selection, you can send the original and mask images to a second model for removal. The detailed illustration of this user flow is demonstrated below.


Step 1: Select an object (“dog”) from the image	Step 2: Confirm the correct object is highlighted	Step 3: Erase the object from the image

To modify or replace an object, the select and highlight the desired object, following the same process as described above. Once you confirm the correct object selection, you can modify the object by supplying the original image, the mask, and a text prompt. The model will then change the highlighted object based on the provided instructions. A detailed illustration of this second user flow is as follows.


Step 1: Select an object (“vase”) from the image	Step 2: Confirm the correct object is highlighted	Step 3: Provide a text prompt (“futuristic vase”) to modify the object

To power this solution, we use three generative AI models: Segment Anything Model (SAM), Large Mask Inpainting Model (LaMa), and Stable Diffusion Inpaint (SD). Here are how these models been utilized in the user experience workflow:

To remove an unwanted object	To modify or replace an object

Segment Anything Model (SAM) is used to generate a segment mask of the object of interest. Developed by Meta Research, SAM is an open-source model that can segment any object in an image. This model has been trained on a massive dataset known as SA-1B, which comprises over 11 million images and 1.1 billion segmentation masks. For more information on SAM, refer to their website and research paper.
LaMa is used to remove any undesired objects from an image. LaMa is a Generative Adversarial Network (GAN) model specializes in fill missing parts of images using irregular masks. The model architecture incorporates image-wide global context and a single-step architecture that uses Fourier convolutions, enabling it to achieve state-of-the-art results at a faster speed. For more details on LaMa, visit their website and research paper.
SD 2 inpaint model from Stability AI is used to modify or replace objects in an image. This model allows us to edit the object in the mask area by providing a text prompt. The inpaint model is based on the text-to-image SD model, which can create high-quality images with a simple text prompt. It provides additional arguments such as original and mask images, allowing for quick modification and restoration of existing content. To learn more about Stable Diffusion models on AWS, refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

All three models are hosted on SageMaker MMEs, which reduces the operational burden from managing multiple endpoints. In addition to that, using MME eliminates concerns about certain models being underutilized because resources are shared. You can observe the benefit from improved instance saturation, which ultimately leads to cost savings. The following architecture diagram illustrates how all three models are served using SageMaker MMEs with TorchServe.

We have published the code to implement this solution architecture in our GitHub repository. To follow along with the rest of the post, use the notebook file. It is recommended to run this example on a SageMaker notebook instance using the conda_python3 (Python 3.10.10) kernel.

Extend the TorchServe container

The first step is to prepare the model hosting container. SageMaker provides a managed PyTorch Deep Learning Container (DLC) that you can retrieve using the following code snippet:

# Use SageMaker PyTorch DLC as base image
baseimage = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    py_version="py310",
    image_scope="inference",
    version="2.0.0",
    instance_type="ml.g5.2xlarge",
)
print(baseimage)

Because the models require resources and additional packages that are not on the base PyTorch DLC, you need to build a Docker image. This image is then uploaded to Amazon Elastic Container Registry (Amazon ECR) so we can access directly from SageMaker. The custom installed libraries are listed in the Docker file:

ARG BASE_IMAGE

FROM $BASE_IMAGE

#Install any additional libraries
RUN pip install segment-anything-py==1.0
RUN pip install opencv-python-headless==4.7.0.68
RUN pip install matplotlib==3.6.3
RUN pip install diffusers
RUN pip install tqdm
RUN pip install easydict
RUN pip install scikit-image
RUN pip install xformers
RUN pip install tensorflow
RUN pip install joblib
RUN pip install matplotlib
RUN pip install albumentations==0.5.2
RUN pip install hydra-core==1.1.0
RUN pip install pytorch-lightning
RUN pip install tabulate
RUN pip install kornia==0.5.0
RUN pip install webdataset
RUN pip install omegaconf==2.1.2
RUN pip install transformers==4.28.1
RUN pip install accelerate
RUN pip install ftfy

Run the shell command file to build the custom image locally and push it to Amazon ECR:

%%capture build_output

reponame = "torchserve-mme-demo"
versiontag = "genai-0.1"

# Build our own docker image
!cd workspace/docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}

Prepare the model artifacts

The main difference for the new MMEs with TorchServe support is how you prepare your model artifacts. The code repo provides a skeleton folder for each model (models folder) to house the required files for TorchServe. We follow the same four-step process to prepare each model .tar file. The following code is an example of the skeleton folder for the SD model:

workspace
|--sd
   |-- custom_handler.py
   |-- model-config.yaml

The first step is to download the pre-trained model checkpoints in the models folder:

import diffusers
import torch
import transformers

pipeline = diffusers.StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting", torch_dtype=torch.float16
)

sd_dir = "workspace/sd/model"
pipeline.save_pretrained(sd_dir)

The next step is to define a custom_handler.py file. This is required to define the behavior of the model when it receives a request, such as loading the model, preprocessing the input, and postprocessing the output. The handle method is the main entry point for requests, and it accepts a request object and returns a response object. It loads the pre-trained model checkpoints and applies the preprocess and postprocess methods to the input and output data. The following code snippet illustrates a simple structure of the custom_handler.py file. For more detail, refer to the TorchServe handler API.

def initialize(self, ctx: Context):

def preprocess(self, data):

def inference(self, data):

def handle(self, data, context):
    requests = self.preprocess(data)
    responses = self.inference(requests)

    return responses

The last required file for TorchServe is model-config.yaml. The file defines the configuration of the model server, such as number of workers and batch size. The configuration is at a per-model level, and an example config file is shown in the following code. For a complete list of parameters, refer to the GitHub repo.

minWorkers: 1
maxWorkers: 1
batchSize: 1
maxBatchDelay: 200
responseTimeout: 300

The final step is to package all the model artifacts into a single .tar.gz file using the torch-model-archiver module:

!torch-model-archiver --model-name sd --version 1.0 --handler workspace/sd/custom_handler.py --extra-files workspace/sd/model --config-file workspace/sam/model-config.yaml --archive-format no-archive!cd sd && tar cvzf sd.tar.gz .

Create the multi-model endpoint

The steps to create a SageMaker MME are the same as before. In this particular example, you spin up an endpoint using the SageMaker SDK. Start by defining an Amazon Simple Storage Service (Amazon S3) location and the hosting container. This S3 location is where SageMaker will dynamically load the models base on invocation patterns. The hosting container is the custom container you built and pushed to Amazon ECR in the earlier step. See the following code:

# This is where our MME will read models from on S3.
multi_model_s3uri = output_path

Then you want to define a MulitDataModel that captures all the attributes like model location, hosting container, and permission access:

print(multi_model_s3uri)
model = Model(
    model_data=f"{multi_model_s3uri}/sam.tar.gz",
    image_uri=container,
    role=role,
    sagemaker_session=smsess,
    env={"TF_ENABLE_ONEDNN_OPTS": "0"},
)

mme = MultiDataModel(
    name="torchserve-mme-genai-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    model_data_prefix=multi_model_s3uri,
    model=model,
    sagemaker_session=smsess,
)
print(mme)

The deploy() function creates an endpoint configuration and hosts the endpoint:

mme.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In the example we provided, we also show how you can list models and dynamically add new models using the SDK. The add_model() function copies your local model .tar files into the MME S3 location:

# Only sam.tar.gz visible!
list(mme.list_models())

models = ["sd/sd.tar.gz", "lama/lama.tar.gz"]
for model in models:
    mme.add_model(model_data_source=model)

Invoke the models

Now that we have all three models hosted on an MME, we can invoke each model in sequence to build our language-assisted editing features. To invoke each model, provide a target_model parameter in the predictor.predict() function. The model name is just the name of the model .tar file we uploaded. The following is an example code snippet for the SAM model that takes in a pixel coordinate, a point label, and dilate kernel size, and generates a segmentation mask of the object in the pixel location:

img_file = "workspace/test_data/sample1.png"
img_bytes = None

with Image.open(img_file) as f:
    img_bytes = encode_image(f)

gen_args = json.dumps(dict(point_coords=[750, 500], point_labels=1, dilate_kernel_size=15))

payload = json.dumps({"image": img_bytes, "gen_args": gen_args}).encode("utf-8")

response = predictor.predict(data=payload, target_model="/sam.tar.gz")
encoded_masks_string = json.loads(response.decode("utf-8"))["generated_image"]
base64_bytes_masks = base64.b64decode(encoded_masks_string)

with Image.open(io.BytesIO(base64_bytes_masks)) as f:
    generated_image_rgb = f.convert("RGB")
    generated_image_rgb.show()

To remove an unwanted object from an image, take the segmentation mask generated from SAM and feed that into the LaMa model with the original image. The following images show an example.


Sample image	Segmentation mask from SAM	Erase the dog using LaMa

To modify or replace any object in an image with a text prompt, take the segmentation mask from SAM and feed it into SD model with the original image and text prompt, as shown in the following example.


Sample image	Segmentation mask from SAM	Replace using SD model with text prompt “a hamster on a bench”

Cost savings

The benefits of SageMaker MMEs increase based on the scale of model consolidation. The following table shows the GPU memory usage of the three models in this post. They are deployed on one g5.2xlarge instance by using one SageMaker MME.

Model	GPU Memory (MiB)
Segment Anything Model	3,362
Stable Diffusion In Paint	3,910
Lama	852

You can see cost savings when hosting the three models with one endpoint, and for use cases with hundreds or thousands of models, the savings are much greater.

For example, consider 100 Stable Diffusion models. Each of the models on its own could be served by an ml.g5.2xlarge endpoint (4 GiB memory), costing $1.52 per instance hour in the US East (N. Virginia) Region. To provide all 100 models using their own endpoint would cost $218,880 per month. With a SageMaker MME, a single endpoint using ml.g5.2xlarge instances can host four models simultaneously. This reduces production inference costs by 75% to only $54,720 per month. The following table summarizes the differences between single-model and multi-model endpoints for this example. Given an endpoint configuration with sufficient memory for your target models, steady state invocation latency after all models have been loaded will be similar to that of a single-model endpoint.

	Single-model endpoint	Multi-model endpoint
Total endpoint price per month	$218,880	$54,720
Endpoint instance type	ml.g5.2xlarge	ml.g5.2xlarge
CPU Memory capacity (GiB)	32	32
GPU Memory capacity (GiB)	24	24
Endpoint price per hour	$1.52	$1.52
Number of instances per endpoint	2	2
Endpoints needed for 100 models	100	25

Clean up

After you are done, please follow the instructions in the cleanup section of the notebook to delete the resources provisioned in this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details on the cost of the inference instances.

Conclusion

This post demonstrates the language-assisted editing capabilities made possible through the use of generative AI models hosted on SageMaker MMEs with TorchServe. The example we shared illustrates how we can use resource sharing and simplified model management with SageMaker MMEs while still utilizing TorchServe as our model serving stack. We utilized three deep learning foundation models: SAM, SD 2 Inpainting, and LaMa. These models enable us to build powerful capabilities, such as erasing any unwanted object from an image and modifying or replacing any object in an image by supplying a text instruction. These features can help artists and content creators work more efficiently and meet their content demands by automating repetitive tasks, optimizing campaigns, and providing a hyper-personalized experience. We invite you to explore the example provided in this post and build your own UI experience using TorchServe on a SageMaker MME.

To get started, see Supported algorithms, frameworks, and instances for multi-model endpoints using GPU backed instances.

About the authors

	James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
	Li Ning is a senior software engineer at AWS with a specialization in building large-scale AI solutions. As a tech lead for TorchServe, a project jointly developed by AWS and Meta, her passion lies in leveraging PyTorch and AWS SageMaker to help customers embrace AI for the greater good. Outside of her professional endeavors, Li enjoys swimming, traveling, following the latest advancements in technology, and spending quality time with her family.
	Ankith Gunapal is an AI Partner Engineer at Meta (PyTorch). He is passionate about model optimization and model serving, with experience ranging from RTL verification, embedded software, computer vision, to PyTorch. He holds a Master’s in Data Science and a Master’s in Telecommunications. Outside of work, Ankith is also an electronic dance music producer.
	Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
	Subhash Talluri is a Lead AI/ML solutions architect of the Telecom Industry business unit at Amazon Web Services. He’s been leading development of innovative AI/ML solutions for Telecom customers and partners worldwide. He brings interdisciplinary expertise in engineering and computer science to help build scalable, secure, and compliant AI/ML solutions via cloud-optimized architectures on AWS.

Introduction

What makes our Llama fine-tuning expensive?

Parameter Efficient Fine-Tuning (PEFT) methods

Low-Rank Adaptation for Large Language Models (LoRA) using 🤗 PEFT

The base model can be in any dtype: leveraging SOTA LLM quantization and loading the base model in 4-bit precision

QLoRA: One of the core contributions of bitsandbytes towards the democratization of AI

Using QLoRA in practice

Using TRL for LLM training

Putting all the pieces together

Solution overview

Prerequisites

Approach 1: Hugging Face TGI

Deploy the model

Perform inference

Prepare a prompt and instructions

Approach 2: LMI with DJL Serving

Download the model snapshot from Hugging Face and upload the model artifacts on Amazon S3

Prepare model artifacts for DJL Serving

Retrieve the latest LMI container image with DJL Serving

Create the model

Create the model config

Create the SageMaker endpoint

Real-time inference with response streaming

Clean up

Conclusion

References

About the Authors

Generative AI Use Cases

NVIDIA at NRF

Improving Operational Efficiencies Is a Top Priority

Generative AI Is Changing Customer Experiences

Staying Ahead With an Omnichannel Approach

Investing in AI Infrastructure

Solution overview

Prerequisites

Deploy the solution resources

Create your app

Add your app in your workspace

Configure Slack secrets in AWS Secrets Manager

Start using Amazon Q

Clean up

Conclusions

About the Authors

Developing Integrated AI for Securing Stores

Solution overview

Extend the TorchServe container

Prepare the model artifacts

Create the multi-model endpoint

Invoke the models

Cost savings

Clean up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

The base model can be in any `dtype`: leveraging SOTA LLM quantization and loading the base model in 4-bit precision

QLoRA: One of the core contributions of `bitsandbytes` towards the democratization of AI