August 2023 – Page 13

DENZA Collaborates With WPP to Build and Deploy Advanced Car Configurators on NVIDIA Omniverse Cloud

DENZA, the luxury EV brand joint venture between BYD and Mercedes-Benz, has collaborated with marketing and communications giant WPP and NVIDIA Omniverse Cloud to build and deploy its next generation of car configurators, NVIDIA founder and CEO Jensen Huang announced at SIGGRAPH.

WPP is using Omniverse Cloud — a platform for developing, deploying and managing industrial digitalization applications — to help unify the automaker’s highly complex design and marketing pipeline.

Omniverse Cloud enables WPP to build a single, physically accurate, real-time digital twin of the DENZA N7 model by integrating full-fidelity design data from the EV maker’s preferred computer-aided design tools via Universal Scene Description, or OpenUSD.

OpenUSD is a 3D framework that enables interoperability between software tools and data types for the building of virtual worlds.

The implementation of a new unified asset pipeline breaks down proprietary data silos, fostering enhanced data accessibility and facilitating collaborative, iterative reviews for the organization’s large design teams and stakeholders. It enables WPP to work on launch campaigns earlier in the design process, making iterations faster and less costly.

Unifying Asset Pipelines With Omniverse Cloud

Using Omniverse Cloud, WPP’s teams can connect their own pipeline of OpenUSD-enabled design and content creation tools such as Autodesk Maya and Adobe Substance 3D Painter to develop a new configurator for the DENZA N7. With a unified asset pipeline in Omniverse, WPP’s teams of artists can iterate and edit in real time a path-traced view of the full engineering dataset of the DENZA N7 — ensuring the virtual car accurately represents the physical car.

Traditional car configurators require hundreds of thousands of images to be prerendered to represent all possible options and variants. OpenUSD makes it possible for WPP to create a digital twin of the car that includes all possible variants in one single asset. No prerendered images are required.

In parallel, WPP’s environmental artists create fully interactive, live 3D virtual sets. These can start with a scan of a real-world environment, such as those WPP captures with their robot dog, or tap into generative AI tools from providers such as Shutterstock to instantly generate 360-degree HDRi backgrounds to maximize opportunity for personalization.

Shutterstock is using NVIDIA Picasso — a foundry for building generative AI visual models — to develop a variety of generative AI services to accelerate 3D workflows. At SIGGRAPH, Shutterstock announced the first offering of these new services – 360 HDRi – to create photorealistic HDR environment maps to relight a scene. With this feature, artists can rapidly create custom environments that fit their needs.

One-Click Publish to GDN

Once the 3D experience is complete, with just one click, WPP can publish it to Graphics Delivery Network (GDN), part of NVIDIA Omniverse Cloud. GDN is a network of data centers built to serve real-time, high-fidelity 3D content to nearly any web device, enabling interactive experiences in the dealer showroom as well as on consumers’ mobile devices.

This eliminates the tedious process of manually packaging, deploying, hosting and managing the experience themselves. If updates are needed, just like with the initial deployment, WPP can publish them with a single click.

CTA: Learn more about Omniverse Cloud and GDN.

Host the Spark UI on Amazon SageMaker Studio

Amazon SageMaker offers several ways to run distributed data processing jobs with Apache Spark, a popular distributed computing framework for big data processing.

You can run Spark applications interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive sessions, you can choose Apache Spark or Ray to easily process large datasets, without worrying about cluster management.

Alternately, if you need more control over the environment, you can use a pre-built SageMaker Spark container to run Spark applications as batch jobs on a fully managed distributed cluster with Amazon SageMaker Processing. This option allows you to select several types of instances (compute optimized, memory optimized, and more), the number of nodes in the cluster, and the cluster configuration, thereby enabling greater flexibility for data processing and model training.

Finally, you can run Spark applications by connecting Studio notebooks with Amazon EMR clusters, or by running your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).

All these options allow you to generate and store Spark event logs to analyze them through the web-based user interface commonly named the Spark UI, which runs a Spark History Server to monitor the progress of Spark applications, track resource usage, and debug errors.

In this post, we share a solution for installing and running Spark History Server on SageMaker Studio and accessing the Spark UI directly from the SageMaker Studio IDE, for analyzing Spark logs produced by different AWS services (AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) and stored in an Amazon Simple Storage Service (Amazon S3) bucket.

Solution overview

The solution integrates Spark History Server into the Jupyter Server app in SageMaker Studio. This allows users to access Spark logs directly from the SageMaker Studio IDE. The integrated Spark History Server supports the following:

Accessing logs generated by SageMaker Processing Spark jobs
Accessing logs generated by AWS Glue Spark applications
Accessing logs generated by self-managed Spark clusters and Amazon EMR

A utility command line interface (CLI) called sm-spark-cli is also provided for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli enables managing Spark History Server without leaving SageMaker Studio.

The solution consists of shell scripts that perform the following actions:

Install Spark on the Jupyter Server for SageMaker Studio user profiles or for a SageMaker Studio shared space
Install the sm-spark-cli for a user profile or shared space

Install the Spark UI manually in a SageMaker Studio domain

To host Spark UI on SageMaker Studio, complete the following steps:

Choose System terminal from the SageMaker Studio launcher.

Run the following commands in the system terminal:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The commands will take a few seconds to complete.

When the installation is complete, you can start the Spark UI by using the provided sm-spark-cli and access it from a web browser by running the following code:

sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>

The S3 location where the event logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are stored can be configured when running Spark applications.

For SageMaker Studio notebooks and AWS Glue Interactive Sessions, you can set up the Spark event log location directly from the notebook by using the sparkmagic kernel.

The sparkmagic kernel contains a set of tools for interacting with remote Spark clusters through notebooks. It offers magic (%spark, %sql) commands to run Spark code, perform SQL queries, and configure Spark settings like executor memory and cores.

For the SageMaker Processing job, you can configure the Spark event log location directly from the SageMaker Python SDK.

Refer to the AWS documentation for additional information:

For SageMaker Processing, refer to PySparkProcessor
For AWS Glue Interactive Sessions, refer to Configuring the Spark UI (console)
For Amazon EMR, refer to Configure an output location

You can choose the generated URL to access the Spark UI.

The following screenshot shows an example of the Spark UI.

You can check the status of the Spark History Server by using the sm-spark-cli status command in the Studio System terminal.

You can also stop the Spark History Server when needed.

Automate the Spark UI installation for users in a SageMaker Studio domain

As an IT admin, you can automate the installation for SageMaker Studio users by using a lifecycle configuration. This can be done for all user profiles under a SageMaker Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.

You can create a lifecycle configuration from the install-history-server.sh script and attach it to an existing SageMaker Studio domain. The installation is run for all the user profiles in the domain.

From a terminal configured with the AWS Command Line Interface (AWS CLI) and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config 
	--studio-lifecycle-config-name install-spark-ui-on-jupyterserver 
	--studio-lifecycle-config-content $LCC_CONTENT 
	--studio-lifecycle-config-app-type JupyterServer 
	--query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
	--region {YOUR_AWS_REGION} 
	--domain-id {YOUR_STUDIO_DOMAIN_ID} 
	--default-user-settings 
	'{
	"JupyterServerAppSettings": {
	"DefaultResourceSpec": {
	"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
	"InstanceType": "system"
	},
	"LifecycleConfigArns": [
	"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
	]
	}}'

After Jupyter Server restarts, the Spark UI and the sm-spark-cli will be available in your SageMaker Studio environment.

Clean up

In this section, we show you how to clean up the Spark UI in a SageMaker Studio domain, either manually or automatically.

Manually uninstall the Spark UI

To manually uninstall the Spark UI in SageMaker Studio, complete the following steps:

Choose System terminal in the SageMaker Studio launcher.

Run the following commands in the system terminal:

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

Uninstall the Spark UI automatically for all SageMaker Studio user profiles

To automatically uninstall the Spark UI in SageMaker Studio for all user profiles, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane, then choose the SageMaker Studio domain.

On the domain details page, navigate to the Environment tab.
Select the lifecycle configuration for the Spark UI on SageMaker Studio.
Choose Detach.

Delete and restart the Jupyter Server apps for the SageMaker Studio user profiles.

Conclusion

In this post, we shared a solution you can use to quickly install the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine learning (ML) and data engineering teams can use scalable cloud compute to access and analyze Spark logs from anywhere and speed up their project delivery. IT admins can standardize and expedite the provisioning of the solution in the cloud and avoid proliferation of custom development environments for ML projects.

All the code shown as part of this post is available in the GitHub repository.

About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size, helping them understand their technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice includes machine learning end to end, machine learning endustrialization, and generative AI. He enjoys spending time with his friends and exploring new places, as well as traveling to new destinations.

Deploy thousands of model ensembles with Amazon SageMaker multi-model endpoints on GPU to minimize your hosting costs

Artificial intelligence (AI) adoption is accelerating across industries and use cases. Recent scientific breakthroughs in deep learning (DL), large language models (LLMs), and generative AI is allowing customers to use advanced state-of-the-art solutions with almost human-like performance. These complex models often require hardware acceleration because it enables not only faster training but also faster inference when using deep neural networks in real-time applications. GPUs’ large number of parallel processing cores makes them well-suited for these DL tasks.

However, in addition to model invocation, those DL application often entail preprocessing or postprocessing in an inference pipeline. For example, input images for an object detection use case might need to be resized or cropped before being served to a computer vision model, or tokenization of text inputs before being used in an LLM. NVIDIA Triton is an open-source inference server that enables users to define such inference pipelines as an ensemble of models in the form of a Directed Acyclic Graph (DAG). It is designed to run models at scale on both CPU and GPU. Amazon SageMaker supports deploying Triton seamlessly, allowing you to use Triton’s features while also benefiting from SageMaker capabilities: a managed, secured environment with MLOps tools integration, automatic scaling of hosted models, and more.

AWS, in its dedication to help customers achieve the highest saving, has continuously innovated not only in pricing options and cost-optimization proactive services, but also in launching cost savings features like multi-model endpoints (MMEs). MMEs are a cost-effective solution for deploying a large number of models using the same fleet of resources and a shared serving container to host all of your models. Instead of using multiple single-model endpoints, you can reduce your hosting costs by deploying multiple models while paying only for a single inference environment. Additionally, MMEs reduce deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

In this post, we show how to run multiple deep learning ensemble models on a GPU instance with a SageMaker MME. To follow along with this example, you can find the code on the public SageMaker examples repository.

How SageMaker MMEs with GPU work

With MMEs, a single container hosts multiple models. SageMaker controls the lifecycle of models hosted on the MME by loading and unloading them into the container’s memory. Instead of downloading all the models to the endpoint instance, SageMaker dynamically loads and caches the models as they are invoked.

When an invocation request for a particular model is made, SageMaker does the following:

It first routes the request to the endpoint instance.
If the model has not been loaded, it downloads the model artifact from Amazon Simple Storage Service (Amazon S3) to that instance’s Amazon Elastic Block Storage volume (Amazon EBS).
It loads the model to the container’s memory on the GPU-accelerated compute instance. If the model is already loaded in the container’s memory, invocation is faster because no further steps are needed.

When an additional model needs to be loaded, and the instance’s memory utilization is high, SageMaker will unload unused models from that instance’s container to ensure that there is enough memory. These unloaded models will remain on the instance’s EBS volume so that they can be loaded into the container’s memory later, thereby removing the need to download them again from the S3 bucket. However, If the instance’s storage volume reaches its capacity, SageMaker will delete the unused models from the storage volume. In cases where the MME receives many invocation requests, and additional instances (or an auto-scaling policy) are in place, SageMaker routes some requests to other instances in the inference cluster to accommodate for the high traffic.

This not only provides a cost saving mechanism, but also enables you to dynamically deploy new models and deprecate old ones. To add a new model, you upload it to the S3 bucket the MME is configured to use and invoke it. To delete a model, stop sending requests and delete it from the S3 bucket. Adding models or deleting them from an MME doesn’t require updating the endpoint itself!

Triton ensembles

The Triton model ensemble represents a pipeline that consists of one model, preprocessing and postprocessing logic, and the connection of input and output tensors between them. A single inference request to an ensemble triggers the run of the entire pipeline as a series of steps using the ensemble scheduler. The scheduler collects the output tensors in each step and provides them as input tensors for other steps according to the specification. To clarify: the ensemble model is still viewed as a single model from an external view.

Triton server architecture includes a model repository: a file system-based repository of the models that Triton will make available for inferencing. Triton can access models from one or more locally accessible file paths or from remote locations like Amazon S3.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. A minimal model configuration must specify the platform or backend (like PyTorch or TensorFlow), the max_batch_size property, and the input and output tensors of the model.

Triton on SageMaker

SageMaker enables model deployment using Triton server with custom code. This functionality is available through the SageMaker managed Triton Inference Server Containers. These containers support common machine leaning (ML) frameworks (like TensorFlow, ONNX, and PyTorch, as well as custom model formats) and useful environment variables that let you optimize performance on SageMaker. Using SageMaker Deep Learning Containers (DLC) images is recommended because they’re maintained and regularly updated with security patches.

Solution walkthrough

For this post, we deploy two different types of ensembles on a GPU instance, using Triton and a single SageMaker endpoint.

The first ensemble consists of two models: a DALI model for image preprocessing and a TensorFlow Inception v3 model for actual inference. The pipeline ensemble takes encoded images as an input, which will have to be decoded, resized to 299×299 resolution, and normalized. This preprocessing will be handled by the DALI model. DALI is an open-source library for common image and speech preprocessing tasks such as decoding and data augmentation. Inception v3 is an image recognition model that consists of symmetric and asymmetric convolutions, and average and max pooling fully connected layers (and therefore is perfect for GPU usage).

The second ensemble transforms raw natural language sentences into embeddings and consists of three models. First, a preprocessing model is applied to the input text tokenization (implemented in Python). Then we use a pre-trained BERT (uncased) model from the Hugging Face Model Hub to extract token embeddings. BERT is an English language model that was trained using a masked language modeling (MLM) objective. Finally, we apply a postprocessing model where the raw token embeddings from the previous step are combined into sentence embeddings.

After we configure Triton to use these ensembles, we show how to configure and run the SageMaker MME.

Finally, we provide an example of each ensemble invocation, as can be seen in the following diagram:

Ensemble 1 – Invoke the endpoint with an image, specifying DALI-Inception as the target ensemble
Ensemble 2 – Invoke the same endpoint, this time with text input and requesting the preprocess-BERT-postprocess ensemble

Set up the environment

First, we set up the needed environment. This includes updating AWS libraries (like Boto3 and the SageMaker SDK) and installing the dependencies required to package our ensembles and run inferences using Triton. We also use the SageMaker SDK default execution role. We use this role to enable SageMaker to access Amazon S3 (where our model artifacts are stored) and the container registry (where the NVIDIA Triton image will be used from). See the following code:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import nvidia.dali as dali
import nvidia.dali.types as types

# SageMaker varaibles
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
role = get_execution_role()

# Other Variables
instance_type = "ml.g4dn.4xlarge"
sm_model_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_config_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

Prepare ensembles

In this next step, we prepare the two ensembles: the TensorFlow (TF) Inception with DALI preprocessing and BERT with Python preprocessing and postprocessing.

This entails downloading the pre-trained models, providing the Triton configuration files, and packaging the artifacts to be stored in Amazon S3 before deploying.

Prepare the TF and DALI ensemble

First, we prepare the directories for storing our models and configurations: for the TF Inception (inception_graphdef), for DALI preprocessing (dali), and for the ensemble (ensemble_dali_inception). Because Triton supports model versioning, we also add the model version to the directory path (denoted as 1 because we only have one version). To learn more about the Triton version policy, refer to Version Policy. Next, we download the Inception v3 model, extract it, and copy to the inception_graphdef model directory. See the following code:

!mkdir -p model_repository/inception_graphdef/1
!mkdir -p model_repository/dali/1
!mkdir -p model_repository/ensemble_dali_inception/1

!wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz 
https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

!(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
!mv /tmp/inception_v3_2016_08_28_frozen.pb model_repository/inception_graphdef/1/model.graphdef

Now, we configure Triton to use our ensemble pipeline. In a config.pbtxt file, we specify the input and output tensor shapes and types, and the steps the Triton scheduler needs to take (DALI preprocessing and the Inception model for image classification):

%%writefile model_repository/ensemble_dali_inception/config.pbtxt
name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inception_graphdef"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "InceptionV3/Predictions/Softmax"
        value: "OUTPUT"
      }
    }
  ]
}

Next, we configure each of the models. First, the model config for DALI backend:

%%writefile model_repository/dali/config.pbtxt
name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 299, 299, 3 ]
  }
]
parameters: [
  {
    key: "num_threads"
    value: { string_value: "12" }
  }
]

Next, the model configuration for TensorFlow Inception v3 we downloaded earlier:

%%writefile model_repository/inception_graphdef/config.pbtxt
name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 256
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 299, 299, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
    label_filename: "inception_labels.txt"
  }
]
instance_group [
    {
      kind: KIND_GPU
    }
]

Because this is a classification model, we also need to copy the Inception model labels to the inception_graphdef directory in the model repository. These labels include 1,000 class labels from the ImageNet dataset.

!aws s3 cp s3://sagemaker-sample-files/datasets/labels/inception_labels.txt model_repository/inception_graphdef/inception_labels.txt

Next, we configure and serialize the DALI pipeline that will handle our preprocessing to file. The preprocessing includes reading the image (using CPU), decoding (accelerated using GPU), and resizing and normalizing the image.

@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    """Create a pipeline which reads images and masks, decodes the images and returns them."""
    images = dali.fn.external_source(device="cpu", name="DALI_INPUT_0")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299) #resize image to the default 299x299 size
    images = dali.fn.crop_mirror_normalize(
        images,
        dtype=types.FLOAT,
        output_layout="HWC",
        crop=(299, 299),  #crop image to the default 299x299 size
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], #crop a central region of the image
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255], #crop a central region of the image
    )
    return images

pipe().serialize(filename="model_repository/dali/1/model.dali")

Finally, we package the artifacts together and upload them as a single object to Amazon S3:

!tar -cvzf model_tf_dali.tar.gz -C model_repository .
model_uri = sagemaker_session.upload_data(
    path="model_tf_dali.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)
print("S3 model uri: {}".format(model_uri))

Prepare the TensorRT and Python ensemble

For this example, we use a pre-trained model from the transformers library.

You can find all models (preprocess and postprocess, along with config.pbtxt files) in the folder ensemble_hf. Our file system structure will include four directories (three for the individual model steps and one for the ensemble) as well as their respective versions:


ensemble_hf
├── bert-trt
|   |── model.pt
|   |──config.pbtxt
├── ensemble
│   └── 1
|   └── config.pbtxt
├── postprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt
├── preprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt

In the workspace folder, we provide with two scripts: the first to convert the model into ONNX format (onnx_exporter.py) and the TensorRT compilation script (generate_model_trt.sh).

Triton natively supports the TensorRT runtime, which enables you to easily deploy a TensorRT engine, thereby optimizing for a selected GPU architecture.

To make sure we use the TensorRT version and dependencies that are compatible with the ones in our Triton container, we compile the model using the corresponding version of NVIDIA’s PyTorch container image:

model_id = "sentence-transformers/all-MiniLM-L6-v2"
! docker run --gpus=all --rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash generate_model_trt.sh $model_id

We then copy the model artifacts to the directory we created earlier and add a version to the path:

! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*

We use a Conda pack to generate a Conda environment that the Triton Python backend will use in preprocessing and postprocessing:

!bash conda_dependencies.sh
!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/
!rm processing_env.tar.gz

Finally, we upload the model artifacts to Amazon S3:

!tar -C ensemble_hf/ -czf model_trt_python.tar.gz .
model_uri = sagemaker_session.upload_data(
    path="model_trt_python.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)

print("S3 model uri: {}".format(model_uri))

Run ensembles on a SageMaker MME GPU instance

Now that our ensemble artifacts are stored in Amazon S3, we can configure and launch the SageMaker MME.

We start by retrieving the container image URI for the Triton DLC image that matches the one in our Region’s container registry (and is used for TensorRT model compilation):

account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

Next, we create the model in SageMaker. In the create_model request, we describe the container to use and the location of model artifacts, and we specify using the Mode parameter that this is a multi-model.

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": models_s3_location,
    "Mode": "MultiModel",
}

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

To host our ensembles, we create an endpoint configuration with the create_endpoint_config API call, and then create an endpoint with the create_endpoint API. SageMaker then deploys all the containers that you defined for the model in the hosting environment.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Although in this example we are setting a single instance to host our model, SageMaker MMEs fully support setting an auto scaling policy. For more information on this feature, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

Create request payloads and invoke the MME for each model

After our real-time MME is deployed, it’s time to invoke our endpoint with each of the model ensembles we used.

First, we create a payload for the DALI-Inception ensemble. We use the shiba_inu_dog.jpg image from the SageMaker public dataset of pet images. We load the image as an encoded array of bytes to use in the DALI backend (to learn more, see Image Decoder examples).

sample_img_fname = "shiba_inu_dog.jpg"

import numpy as np

s3_client = boto3.client("s3")
s3_client.download_file(
    "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", sample_img_fname
)

def load_image(img_path):
    """
    Loads image as an encoded array of bytes.
    This is a typical approach you want to use in DALI backend
    """
    with open(img_path, "rb") as f:
        img = f.read()
        return np.array(list(img)).astype(np.uint8)
    
rv = load_image(sample_img_fname)
print(f"Shape of image {rv.shape}")

rv2 = np.expand_dims(rv, 0)
print(f"Shape of expanded image array {rv2.shape}")

payload = {
    "inputs": [
        {
            "name": "INPUT",
            "shape": rv2.shape,
            "datatype": "UINT8",
            "data": rv2.tolist(),
        }
    ]
}

With our encoded image and payload ready, we invoke the endpoint.

Note that we specify our target ensemble to be the model_tf_dali.tar.gz artifact. The TargetModel parameter is what differentiates MMEs from single-model endpoints and enables us to direct the request to the right model.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload), TargetModel="model_tf_dali.tar.gz"
)

The response includes metadata about the invocation (such as model name and version) and the actual inference response in the data part of the output object. In this example, we get an array of 1,001 values, where each value is the probability of the class the image belongs to (1,000 classes and 1 extra for others).
Next, we invoke our MME again, but this time target the second ensemble. Here the data is just two simple text sentences:

text_inputs = ["Sentence 1", "Sentence 2"]

To simplify communication with Triton, the Triton project provides several client libraries. We use that library to prepare the payload in our request:

import tritonclient.http as http_client

text_inputs = ["Sentence 1", "Sentence 2"]
inputs = []
inputs.append(http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES"))
batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]
input0_real = np.array(batch_request, dtype=np.object_)
inputs[0].set_data_from_numpy(input0_real, binary_data=True)
outputs = []
outputs.append(http_client.InferRequestedOutput("finaloutput"))
request_body, header_length = http_client.InferenceServerClient.generate_request_body(
    inputs, outputs=outputs
)

Now we are ready to invoke the endpoint—this time, the target model is the model_trt_python.tar.gz ensemble:

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
    TargetModel="model_trt_python.tar.gz"
)

The response is the sentence embeddings that can be used in a variety of natural language processing (NLP) applications.

Clean up

Lastly, we clean up and delete the endpoint, endpoint configuration, and model:

sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

Conclusion

In this post, we showed how to configure, deploy, and invoke a SageMaker MME with Triton ensembles on a GPU-accelerated instance. We hosted two ensembles on a single real-time inference environment, which reduced our cost by 50% (for a g4dn.4xlarge instance, which represents over $13,000 in yearly savings). Although this example used only two pipelines, SageMaker MMEs can support thousands of model ensembles, making it an extraordinary cost savings mechanism. Furthermore, you can use SageMaker MMEs’ dynamic ability to load (and unload) models to minimize the operational overhead of managing model deployments in production.

About the authors

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, backpacking, and backpropagating.

Eliuth Triana Isaza is a Developer Relations Manager on the NVIDIA-AWS team. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.

NVIDIA H100 Tensor Core GPU Used on New Microsoft Azure Virtual Machine Series Now Generally Available

Microsoft Azure users can now turn to the latest NVIDIA accelerated computing technology to train and deploy their generative AI applications.

Available today, the Microsoft Azure ND H100 v5 VMs using NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — enables scaling generative AI, high performance computing (HPC) and other applications with a click from a browser.

Available to customers across the U.S., the new instance arrives as developers and researchers are using large language models (LLMs) and accelerated computing to uncover new consumer and business use cases.

The NVIDIA H100 GPU delivers supercomputing-class performance through architectural innovations, including fourth-generation Tensor Cores, a new Transformer Engine for accelerating LLMs and the latest NVLink technology that lets GPUs talk to each other at 900GB/sec.

The inclusion of NVIDIA Quantum-2 CX7 InfiniBand with 3,200 Gbps cross-node bandwidth ensures seamless performance across the GPUs at massive scale, matching the capabilities of top-performing supercomputers globally.

Scaling With v5 VMs

ND H100 v5 VMs are ideal for training and running inference for increasingly complex LLMs and computer vision models. These neural networks drive the most demanding and compute-intensive generative AI applications, including question answering, code generation, audio, video and image generation, speech recognition and more.

The ND H100 v5 VMs achieve up to 2x speedup in LLMs like the BLOOM 175B model for inference versus previous generation instances, demonstrating their potential to further optimize AI applications.

NVIDIA and Azure

NVIDIA H100 Tensor Core GPUs on Azure provide enterprises the performance, versatility and scale to supercharge their AI training and inference workloads. The combination streamlines the development and deployment of production AI with the NVIDIA AI Enterprise software suite integrated with Azure Machine Learning for MLOps, and delivers record-setting AI performance in industry-standard MLPerf benchmarks.

In addition, by connecting the NVIDIA Omniverse platform to Azure, NVIDIA and Microsoft are providing hundreds of millions of Microsoft enterprise users with access to powerful industrial digitalization and AI supercomputing resources.

Learn more about new Azure v5 instances powered by NVIDIA H100 GPUs.

AWS performs fine-tuning on a Large Language Model (LLM) to classify toxic speech for a large gaming company

The video gaming industry has an estimated user base of over 3 billion worldwide¹. It consists of massive amounts of players virtually interacting with each other every single day. Unfortunately, as in the real world, not all players communicate appropriately and respectfully. In an effort to create and maintain a socially responsible gaming environment, AWS Professional Services was asked to build a mechanism that detects inappropriate language (toxic speech) within online gaming player interactions. The overall business outcome was to improve the organization’s operations by automating an existing manual process and to improve user experience by increasing speed and quality in detecting inappropriate interactions between players, ultimately promoting a cleaner and healthier gaming environment.

The customer ask was to create an English language detector that classifies voice and text excerpts into their own custom defined toxic language categories. They wanted to first determine if the given language excerpt is toxic, and then classify the excerpt in a specific customer-defined category of toxicity such as profanity or abusive language.

AWS ProServe solved this use case through a joint effort between the Generative AI Innovation Center (GAIIC) and the ProServe ML Delivery Team (MLDT). The AWS GAIIC is a group within AWS ProServe that pairs customers with experts to develop generative AI solutions for a wide range of business use cases using proof of concept (PoC) builds. AWS ProServe MLDT then takes the PoC through production by scaling, hardening, and integrating the solution for the customer.

This customer use case will be showcased in two separate posts. This post (Part 1) serves as a deep dive into the scientific methodology. It will explain the thought process and experimentation behind the solution, including the model training and development process. Part 2 will delve into the productionized solution, explaining the design decisions, data flow, and illustration of the model training and deployment architecture.

This post covers the following topics:

The challenges AWS ProServe had to solve for this use case
Historical context about large language models (LLMs) and why this technology is a perfect fit for this use case
AWS GAIIC’s PoC and AWS ProServe MLDT’s solution from a data science and machine learning (ML) perspective

Data challenge

The main challenge AWS ProServe faced with training a toxic language classifier was obtaining enough labeled data from the customer to train an accurate model from scratch. AWS received about 100 samples of labeled data from the customer, which is a lot less than the 1,000 samples recommended for fine-tuning an LLM in the data science community.

As an added inherent challenge, natural language processing (NLP) classifiers are historically known to be very costly to train and require a large set of vocabulary, known as a corpus, to produce accurate predictions. A rigorous and effective NLP solution, if provided sufficient amounts of labeled data, would be to train a custom language model using the customer’s labeled data. The model would be trained solely with the players’ game vocabulary, making it tailored to the language observed in the games. The customer had both cost and time constraints that made this solution unviable. AWS ProServe was forced to find a solution to train an accurate language toxicity classifier with a relatively small labeled dataset. The solution lay in what’s known as transfer learning.

The idea behind transfer learning is to use the knowledge of a pre-trained model and apply it to a different but relatively similar problem. For example, if an image classifier was trained to predict if an image contains a cat, you could use the knowledge that the model gained during its training to recognize other animals like tigers. For this language use case, AWS ProServe needed to find a previously trained language classifier that was trained to detect toxic language and fine-tune it using the customer’s labeled data.

The solution was to find and fine-tune an LLM to classify toxic language. LLMs are neural networks that have been trained using a massive number of parameters, typically in the order of billions, using unlabeled data. Before going into the AWS solution, the following section provides an overview into the history of LLMs and their historical use cases.

Tapping into the power of LLMs

LLMs have recently become the focal point for businesses looking for new applications of ML, ever since ChatGPT captured the public mindshare by being the fastest growing consumer application in history², reaching 100 million active users by January 2023, just 2 months after its release. However, LLMs are not a new technology in the ML space. They have been used extensively to perform NLP tasks such as analyzing sentiment, summarizing corpuses, extracting keywords, translating speech, and classifying text.

Due to the sequential nature of text, recurrent neural networks (RNNs) had been the state of the art for NLP modeling. Specifically, the encoder-decoder network architecture was formulated because it created an RNN structure capable of taking an input of arbitrary length and generating an output of arbitrary length. This was ideal for NLP tasks like translation where an output phrase of one language could be predicted from an input phrase of another language, typically with differing numbers of words between the input and output. The Transformer architecture³ (Vaswani, 2017) was a breakthrough improvement on the encoder-decoder; it introduced the concept of self-attention, which allowed the model to focus its attention on different words on the input and output phrases. In a typical encoder-decoder, each word is interpreted by the model in an identical fashion. As the model sequentially processes each word in an input phrase, the semantic information at the beginning may be lost by the end of the phrase. The self-attention mechanism changed this by adding an attention layer to both the encoder and decoder block, so that the model could put different weightings on certain words from the input phrase when generating a certain word in the output phrase. Thus the basis of the transformer model was born.

The transformer architecture was the foundation for two of the most well-known and popular LLMs in use today, the Bidirectional Encoder Representations from Transformers (BERT)⁴ (Radford, 2018) and the Generative Pretrained Transformer (GPT)⁵(Devlin 2018). Later versions of the GPT model, namely GPT3 and GPT4, are the engine that powers the ChatGPT application. The final piece of the recipe that makes LLMs so powerful is the ability to distill information from vast text corpuses without extensive labeling or preprocessing via a process called ULMFiT. This method has a pre-training phase where general text can be gathered and the model is trained on the task of predicting the next word based on previous words; the benefit here is that any input text used for training comes inherently prelabeled based on the order of the text. LLMs are truly capable of learning from internet-scale data. For example, the original BERT model was pre-trained on the BookCorpus and entire English Wikipedia text datasets.

This new modeling paradigm has given rise to two new concepts: foundation models (FMs) and Generative AI. As opposed to training a model from scratch with task-specific data, which is the usual case for classical supervised learning, LLMs are pre-trained to extract general knowledge from a broad text dataset before being adapted to specific tasks or domains with a much smaller dataset (typically on the order of hundreds of samples). The new ML workflow now starts with a pre-trained model dubbed a foundation model. It’s important to build on the right foundation, and there are an increasing number of options, such as the new Amazon Titan FMs, to be released by AWS as part of Amazon Bedrock. These new models are also considered generative because their outputs are human interpretable and in the same data type as the input data. While past ML models were descriptive, such as classifying images of cats vs. dogs, LLMs are generative because their output is the next set of words based on input words. That allows them to power interactive applications such as ChatGPT that can be expressive in the content they generate.

Hugging Face has partnered with AWS to democratize FMs and make them easy to access and build with. Hugging Face has created a Transformers API that unifies more than 50 different transformer architectures on different ML frameworks, including access to pre-trained model weights in their Model Hub, which has grown to over 200,000 models as of writing this post. In the next sections, we explore the proof of concept, the solution, and the FMs that were tested and chosen as the basis for solving this toxic speech classification use case for the customer.

AWS GAIIC proof of concept

AWS GAIIC chose to experiment with LLM foundation models with the BERT architecture to fine-tune a toxic language classifier. A total of three models from Hugging Face’s model hub were tested:

All three model architectures are based on the BERTweet architecture. BERTweet is trained based on the RoBERTa pre-training procedure. The RoBERTa pre-training procedure is an outcome of a replication study of BERT pre-training that evaluated the effects of hyperparameter tuning and training set size to improve the recipe for training BERT models⁶(Liu 2019). The experiment sought to find a pre-training method that improved the performance results of BERT without changing the underlying architecture. The conclusion of the study found that the following pre-training modifications substantially improved the performance of BERT:

Training the model with bigger batches over more data
Removing the next sentence prediction objective
Training on longer sequences
Dynamically changing the masking pattern applied to the training data

The bertweet-base model uses the preceding pre-training procedure from the RoBERTa study to pre-train the original BERT architecture using 850 million English tweets. It is the first public large-scale language model pre-trained for English tweets.

Pre-trained FMs using tweets were thought to fit the use case for two main theoretical reasons:

The length of a tweet is very similar to the length of an inappropriate or toxic phrase found in online game chats
Tweets come from a population with a large variety of different users, similar to that of the population found in gaming platforms

AWS decided to first fine-tune BERTweet with the customer’s labeled data to get a baseline. Then chose to fine-tune two other FMs in bertweet-base-offensive and bertweet-base-hate that were further pre-trained specifically on more relevant toxic tweets to achieve potentially higher accuracy. The bertweet-base-offensive model uses the base BertTweet FM and is further pre-trained on 14,100 annotated tweets that were deemed as offensive⁷ (Zampieri 2019). The bertweet-base-hate model also uses the base BertTweet FM but is further pre-trained on 19,600 tweets that were deemed as hate speech⁸ (Basile 2019).

To further enhance the performance of the PoC model, AWS GAIIC made two design decisions:

Created a two-stage prediction flow where the first model acts as a binary classifier that classifies whether a piece of text is toxic or not toxic. The second model is a fine-grained model that classifies text based on the customer’s defined toxic types. Only if the first model predicts the text as toxic does it get passed to the second model.
Augmented the training data and added a subset of a third-party-labeled toxic text dataset from a public Kaggle competition (Jigsaw Toxicity) to the original 100 samples received from the customer. They mapped the Jigsaw labels to the associated customer-defined toxicity labels and did an 80% split as training data and 20% split as test data to validate the model.

AWS GAIIC used Amazon SageMaker notebooks to run their fine-tuning experiments and found that the bertweet-base-offensive model achieved the best scores on the validation set. The following table summarizes the observed metric scores.

Model	Precision	Recall	F1	AUC
Binary	.92	.90	.91	.92
Fine-grained	.81	.80	.81	.89

From this point, GAIIC handed off the PoC to the AWS ProServe ML Delivery Team to productionize the PoC.

AWS ProServe ML Delivery Team solution

To productionize the model architecture, the AWS ProServe ML Delivery Team (MLDT) was asked by the customer to create a solution that is scalable and easy to maintain. There were a few maintenance challenges of a two-stage model approach:

The models would require double the amount of model monitoring, which makes retraining timing inconsistent. There may be times that one model will have to be retrained more often than the other.
Increased costs of running two models as opposed to one.
The speed of inference slows because inference goes through two models.

To address these challenges, AWS ProServe MLDT had to figure out how to turn the two-stage model architecture into a single model architecture while still being able to maintain the accuracy of the two-stage architecture.

The solution was to first ask the customer for more training data, then to fine-tune the bertweet-base-offensive model on all the labels, including non-toxic samples, into one model. The idea was that fine-tuning one model with more data would result in similar results as fine-tuning a two-stage model architecture on less data. To fine-tune the two-stage model architecture, AWS ProServe MLDT updated the pre-trained model multi-label classification head to include one extra node to represent the non-toxic class.

The following is a code sample of how you would fine-tune a pre-trained model from the Hugging Face model hub using their transformers platform and alter the model’s multi-label classification head to predict the desired number of classes. AWS ProServe MLDT used this blueprint as its basis for fine-tuning. It assumes that you have your train data and validation data ready and in the correct input format.

First, Python modules are imported as well as the desired pre-trained model from the Hugging Face model hub:

# Imports.
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    PreTrainedTokenizer,
    Trainer,
    TrainingArguments,
)

# Load pretrained model from model hub into a tokenizer.
model_checkpoint = “cardiffnlp/bertweet-base-offensive”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The pre-trained model then gets loaded and prepped for fine-tuning. This is the step where the number of toxic categories and all model parameters get defined:

# Load pretrained model into a sequence classifier to be fine-tuned and define the number of classes you want to classify in the num_labels parameter.

model = AutoModelForSequenceClassification.from_pretrained(
            model_checkpoint,
            num_labels=[number of classes]
        )

# Set your training parameter arguments. The below are some key parameters that AWS ProServe MLDT tuned:
training_args = TrainingArguments(
        num_train_epochs=[enter input]
        per_device_train_batch_size=[enter input]
        per_device_eval_batch_size=[enter input]
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        learning_rate=[enter input]
        load_best_model_at_end=True,
        metric_for_best_model=[enter input]
        optim=[enter input],
    )

Model fine-tuning starts with inputting paths to the training and validation datasets:

# Finetune the model from the model_checkpoint, tokenizer, and training_args defined assuming train and validation datasets are correctly preprocessed.
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=[enter input],
        eval_dataset=[enter input],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

# Finetune model command.
trainer.train()

AWS ProServe MLDT received approximately 5,000 more labeled data samples, 3,000 being non-toxic and 2,000 being toxic, and fine-tuned all three bertweet-base models, combining all labels into one model. They used this data in addition to the 5,000 samples from the PoC to fine-tune new one-stage models using the same 80% train set, 20% test set method. The following table shows that the performance scores were comparable to that of the two-stage model.

Model	Precision	Recall	F1	AUC
bertweet-base (1-Stage)	.76	.72	.74	.83
bertweet-base-hate (1-Stage)	.85	.82	.84	.87
bertweet-base-offensive (1-Stage)	.88	.83	.86	.89
bertweet-base-offensive (2-Stage)	.91	.90	.90	.92

The one-stage model approach delivered the cost and maintenance improvements while only decreasing the precision by 3%. After weighing the trade-offs, the customer opted for AWS ProServe MLDT to productionize the one-stage model.

By fine-tuning one model with more labeled data, AWS ProServe MLDT was able to deliver a solution that met the customer’s threshold for model accuracy, as well as deliver on their ask for ease of maintenance, while lowering cost and increasing robustness.

Conclusion

A large gaming customer was looking for a way to detect toxic language within their communication channels to promote a socially responsible gaming environment. AWS GAIIC created a PoC of a toxic language detector by fine-tuning an LLM to detect toxic language. AWS ProServe MLDT then updated the model training flow from a two-stage approach to a one-stage approach and productionized the LLM for the customer to be used at scale.

In this post, AWS demonstrates the effectiveness and practicality of fine-tuning an LLM to solve this customer use case, shares context on the history of foundation models and LLMs, and introduces the workflow between the AWS Generative AI Innovation Center and the AWS ProServe ML Delivery Team. In the next post in this series, we will dive deeper into how AWS ProServe MLDT productionized the resulting one-stage model using SageMaker.

If you are interested in working with AWS to build a Generative AI solution, please reach out to the GAIIC. They will assess your use case, build out a Generative-AI-based proof of concept, and have options to extend collaboration with AWS to implement the resulting PoC into production.

References

Gamer Demographics: Facts and Stats About the Most Popular Hobby in the World
ChatGPT sets record for fastest-growing user base – analyst note
Vaswani et al., “Attention is All You Need”
Radford et al., “Improving Language Understanding by Generative Pre-Training”
Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”
Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”
Marcos Zampieri et al., “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)”
Valerio Basile et al., “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”

About the authors

James Poquiz is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.

Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and GenAI solutions on AWS.

Safa Tinaztepe is a full-stack data scientist with AWS Professional Services. He has a BS in computer science from Emory University and has interests in MLOps, distributed systems, and web3.

INT8 Quantization for x86 CPU in PyTorch

Overview

INT8 quantization is a powerful technique for speeding up deep learning inference on x86 CPU platforms. By reducing the precision of the model’s weights and activations from 32-bit floating-point (FP32) to 8-bit integer (INT8), INT8 quantization can significantly improve the inference speed and reduce memory requirements without sacrificing accuracy.

In this blog, we will discuss the recent progress on INT8 quantization for x86 CPU in PyTorch, focusing on the new x86 quantization backend. We will also briefly look at the new quantization path with PyTorch 2.0 Export (PT2E) and TorchInductor.

X86 Quantization Backend

The current recommended way of quantization in PyTorch is FX. Before PyTorch 2.0, the default quantization backend (a.k.a. QEngine) on x86 CPUs was FBGEMM, which leveraged the FBGEMM performance library to achieve the performance speedup. In the PyTorch 2.0 release, a new quantization backend called X86 was introduced to replace FBGEMM. The x86 quantization backend offers improved INT8 inference performance when compared to the original FBGEMM backend by leveraging the strengths of both FBGEMM and the Intel® oneAPI Deep Neural Network Library (oneDNN) kernel libraries.

Performance Benefit from X86 Backend

To measure the performance benefits of the new X86 backend, we ran INT8 inference on 69 popular deep learning models (shown in Figures 1-3 below) using 4th Gen Intel® Xeon® Scalable processors. The results showed a 2.97X geomean performance speedup compared to FP32 inference performance, while the speedup was 1.43X with the FBGEMM backend. The charts below show the per-model performance speedup comparing the x86 backend and the FBGEMM backend.

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Usage of x86 Backend

By default in 2.0, users on x86 platforms will use the x86 quantization backend and their PyTorch programs will remain unchanged when using the default backend. Alternatively, users can specify x86 as the quantization backend explicitly.
Below is an example code snippet of PyTorch static post-training quantization with x86 quantization backend.

import torch
from torch.ao.quantization import get_default_qconfig_mapping
from torch.quantization.quantize_fx import prepare_fx, convert_fx

qconfig_mapping = get_default_qconfig_mapping()
# Or explicity specify the qengine
# qengine = 'x86'
# torch.backends.quantized.engine = qengine
# qconfig_mapping = get_default_qconfig_mapping(qengine)

model_fp32 = MyModel().eval()
x = torch.randn((1, 3, 224, 224), dtype=torch.float)
x = x.to(memory_format=torch.channels_last)

# Insert observers according to qconfig and backend config
prepared_model = prepare_fx(model_fp32, qconfig_mapping, example_inputs=x)

# Calibration code not shown

# Convert to quantized model
quantized_model = convert_fx(prepared_model)

Technical Details of x86 Backend

We devised heuristic dispatching rules according to the performance numbers from the models we benchmarked to decide whether to invoke oneDNN or FBGEMM performance library to execute the convolution or matrix multiplication operations. The rules are a combination of operation kinds, shapes, CPU architecture information, etc. Detailed logic is available here. For more design and technical discussion, please refer to the Request for Comments.

Next Steps With a New Quantization Path PyTorch 2.0 Export

Although still far from finalized, a new quantization path, PyTorch 2.0 Export (PT2E), is in early design and PoC stage. The new approach is slated to replace the FX quantization path in the future. It is built upon the capabilities of TorchDynamo Export, a feature introduced in the PyTorch 2.0 release for FX graph capturing. This graph is then quantized and lowered to different backends. TorchInductor, the new DL compiler of PyTorch, has shown promising results in terms of FP32 inference speedup on x86 CPU. We are working actively to enable it as one of the quantization backends of PT2E. We believe the new path will lead to further improvements in INT8 inference performance due to more flexibility of fusion at different levels.

Conclusion

The x86 backend introduced in PyTorch 2.0 release has demonstrated a remarkable improvement in INT8 inference speed on x86 CPU platforms. It offers a 1.43X speedup compared to the original FBGEMM backend while maintaining backward compatibility. This enhancement can benefit end users with minimal or no modifications to their programs. Furthermore, a new quantization path, PT2E, is currently in development and is expected to provide even more possibilities in the future.

Acknowledgement

Special thanks to Nikita Shulga, Vasiliy Kuznetsov, Supriya Rao, and Jongsoo Park. Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

Configuration

¹ AWS EC2 r7iz.metal-16xl instance (Intel(R) Xeon(R) Gold 6455B, 32-core/64-thread, Turbo Boost On, Hyper-Threading On, Memory: 8x64GB, Storage: 192GB); OS: Ubuntu 22.04.1 LTS; Kernel: 5.15.0-1028-aws; Batch Size: 1; Core per Instance: 4; PyTorch 2.0 RC3; TorchVision 0.15.0+cpu, test by Intel on 3/77/2023. May not reflect all publicly available security updates.

KDD 2023: Graph neural networks’ new frontiers

Conference general chair and Amazon Scholar Yizhou Sun on modeling long-range dependencies, improving efficiency, and new causal models.Read More

Optimize data preparation with new features in AWS SageMaker Data Wrangler

Data preparation is a critical step in any data-driven project, and having the right tools can greatly enhance operational efficiency. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

In this post, we explore the latest features of SageMaker Data Wrangler that are specifically designed to improve the operational experience. We delve into the support of Simple Storage Service (Amazon S3) manifest files, inference artifacts in an interactive data flow, and the seamless integration with JSON (JavaScript Object Notation) format for inference, highlighting how these enhancements make data preparation easier and more efficient.

Introducing new features

In this section, we discuss the SageMaker Data Wrangler’s new features for optimal data preparation.

S3 manifest file support with SageMaker Autopilot for ML inference

SageMaker Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. You can use SageMaker Autopilot to automatically train, tune, and deploy models on the data that you’ve transformed in your data flow.

This experience is now further simplified with S3 manifest file support. An S3 manifest file is a text file that lists the objects (files) stored in an S3 bucket. If your exported dataset in SageMaker Data Wrangler is quite big and split into multiple-part data files in Amazon S3, now SageMaker Data Wrangler will automatically create a manifest file in S3 representing all these data files. This generated manifest file can now be used with the SageMaker Autopilot UI in SageMaker Data Wrangler to pick up all the partitioned data for training.

Before this feature launch, when using SageMaker Autopilot models trained on prepared data from SageMaker Data Wrangler, you could only choose one data file, which might not represent the entire dataset, especially if the dataset is very large. With this new manifest file experience, you’re not limited to a subset of your dataset. You can build an ML model with SageMaker Autopilot representing all your data using the manifest file and use that for your ML inference and production deployment. This feature enhances operational efficiency by simplifying training ML models with SageMaker Autopilot and streamlining data processing workflows.

Added support for inference flow in generated artifacts

Customers want to take the data transformations they’ve applied to their model training data, such as one-hot encoding, PCA, and impute missing values, and apply those data transformations to real-time inference or batch inference in production. To do so, you must have a SageMaker Data Wrangler inference artifact, which is consumed by a SageMaker model.

Previously, inference artifacts could only be generated from the UI when exporting to SageMaker Autopilot training or exporting an inference pipeline notebook. This didn’t provide flexibility if you wanted to take your SageMaker Data Wrangler flows outside of the Amazon SageMaker Studio environment. Now, you can generate an inference artifact for any compatible flow file through a SageMaker Data Wrangler processing job. This enables programmatic, end-to-end MLOps with SageMaker Data Wrangler flows for code-first MLOps personas, as well as an intuitive, no-code path to get an inference artifact by creating a job from the UI.

Streamlining data preparation

JSON has become a widely adopted format for data exchange in modern data ecosystems. SageMaker Data Wrangler’s integration with JSON format allows you to seamlessly handle JSON data for transformation and cleaning. By providing native support for JSON, SageMaker Data Wrangler simplifies the process of working with structured and semi-structured data, enabling you to extract valuable insights and prepare data efficiently. SageMaker Data Wrangler now supports JSON format for both batch and real-time inference endpoint deployment.

Solution overview

For our use case, we use the sample Amazon customer reviews dataset to show how SageMaker Data Wrangler can simplify the operational effort to build a new ML model using SageMaker Autopilot. The Amazon customer reviews dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 to July 2014.

On a high level, we use SageMaker Data Wrangler to manage this large dataset and perform the following actions:

Develop an ML model in SageMaker Autopilot using all of the dataset, not just a sample.
Build a real-time inference pipeline with the inference artifact generated by SageMaker Data Wrangler, and use JSON formatting for input and output.

S3 manifest file support with SageMaker Autopilot

When creating a SageMaker Autopilot experiment using SageMaker Data Wrangler, you could previously only specify a single CSV or Parquet file. Now you can also use an S3 manifest file, allowing you to use large amounts of data for SageMaker Autopilot experiments. SageMaker Data Wrangler will automatically partition input data files into several smaller files and generate a manifest that can be used in a SageMaker Autopilot experiment to pull in all the data from the interactive session, not just a small sample.

Complete the following steps:

Import the Amazon customer review data from a CSV file into SageMaker Data Wrangler. Make sure to disable sampling when importing the data.
Specify the transformations that normalize the data. For this example, remove symbols and transform everything into lowercase using SageMaker Data Wrangler’s built-in transformations.
Choose Train model to start training.

To train a model with SageMaker Autopilot, SageMaker automatically exports data to an S3 bucket. For large datasets like this one, it will automatically break up the file into smaller files and generate a manifest that includes the location of the smaller files.

First, select your input data.

Earlier, SageMaker Data Wrangler didn’t have an option to generate a manifest file to use with SageMaker Autopilot. Today, with the release of manifest file support, SageMaker Data Wrangler will automatically export a manifest file to Amazon S3, pre-fill the S3 location of the SageMaker Autopilot training with the manifest file S3 location, and toggle the manifest file option to Yes. No work is necessary to generate or use the manifest file.

Configure your experiment by selecting the target for the model to predict.
Next, select a training method. In this case, we select Auto and let SageMaker Autopilot decide the best training method based on the dataset size.

Specify the deployment settings.
Finally, review the job configuration and submit the SageMaker Autopilot experiment for training. When SageMaker Autopilot completes the experiment, you can view the training results and explore the best model.

Thanks to support for manifest files, you can use your entire dataset for the SageMaker Autopilot experiment, not just a subset of your data.

For more information on using SageMaker Autopilot with SageMaker Data Wrangler, see Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot.

Generate inference artifacts from SageMaker Processing jobs

Now, let’s look at how we can generate inference artifacts through both the SageMaker Data Wrangler UI and SageMaker Data Wrangler notebooks.

SageMaker Data Wrangler UI

For our use case, we want to process our data through the UI and then use the resulting data to train and deploy a model through the SageMaker console. Complete the following steps:

Open the data flow your created in the preceding section.
Choose the plus sign next to the last transform, choose Add destination, and choose Amazon S3. This will be where the processed data will be stored.
Choose Create job.
Select Generate inference artifacts in the Inference parameters section to generate an inference artifact.
For Inference artifact name, enter the name of your inference artifact (with .tar.gz as the file extension).
For Inference output node, enter the destination node corresponding to the transforms applied to your training data.
Choose Configure job.
Under Job configuration, enter a path for Flow file S3 location. A folder called data_wrangler_flows will be created under this location, and the inference artifact will be uploaded to this folder. To change the upload location, set a different S3 location.
Leave the defaults for all other options and choose Create to create the processing job.

The processing job will create a tarball (.tar.gz) containing a modified data flow file with a newly added inference section that allows you to use it for inference. You need the S3 uniform resource identifier (URI) of the inference artifact to provide the artifact to a SageMaker model when deploying your inference solution. The URI will be in the form {Flow file S3 location}/data_wrangler_flows/{inference artifact name}.tar.gz.
If you didn’t note these values earlier, you can choose the link to the processing job to find the relevant details. In our example, the URI is s3://sagemaker-us-east-1-43257985977/data_wrangler_flows/example-2023-05-30T12-20-18.tar.gz.
Copy the value of Processing image; we need this URI when creating our model, too.
We can now use this URI to create a SageMaker model on the SageMaker console, which we can later deploy to an endpoint or batch transform job.
Under Model settings¸ enter a model name and specify your IAM role.
For Container input options, select Provide model artifacts and inference image location.
For Location of inference code image, enter the processing image URI.
For Location of model artifacts, enter the inference artifact URI.
Additionally, if your data has a target column that will be predicted by a trained ML model, specify the name of that column under Environment variables, with INFERENCE_TARGET_COLUMN_NAME as Key and the column name as Value.
Finish creating your model by choosing Create model.

We now have a model that we can deploy to an endpoint or batch transform job.

SageMaker Data Wrangler notebooks

For a code-first approach to generate the inference artifact from a processing job, we can find the example code by choosing Export to on the node menu and choosing either Amazon S3, SageMaker Pipelines, or SageMaker Inference Pipeline. We choose SageMaker Inference Pipeline in this example.

In this notebook, there is a section titled Create Processor (this is identical in the SageMaker Pipelines notebook, but in the Amazon S3 notebook, the equivalent code will be under the Job Configurations section). At the bottom of this section is a configuration for our inference artifact called inference_params. It contains the same information that we saw in the UI, namely the inference artifact name and the inference output node. These values will be prepopulated but can be modified. There is additionally a parameter called use_inference_params, which needs to be set to True to use this configuration in the processing job.

Further down is a section titled Define Pipeline Steps, where the inference_params configuration is appended to a list of job arguments and passed into the definition for a SageMaker Data Wrangler processing step. In the Amazon S3 notebook, job_arguments is defined immediately after the Job Configurations section.

With these simple configurations, the processing job created by this notebook will generate an inference artifact in the same S3 location as our flow file (defined earlier in our notebook). We can programmatically determine this S3 location and use this artifact to create a SageMaker model using the SageMaker Python SDK, which is demonstrated in the SageMaker Inference Pipeline notebook.

The same approach can be applied to any Python code that creates a SageMaker Data Wrangler processing job.

JSON file format support for input and output during inference

It’s pretty common for websites and applications to use JSON as request/response for APIs so that the information is easy to parse by different programming languages.

Previously, after you had a trained model, you could only interact with it via CSV as an input format in a SageMaker Data Wrangler inference pipeline. Today, you can use JSON as an input and output format, providing more flexibility when interacting with SageMaker Data Wrangler inference containers.

To get started with using JSON for input and output in the inference pipeline notebook, complete the follow steps:

Define a payload.

For each payload, the model is expecting a key named instances. The value is a list of objects, each being its own data point. The objects require a key called features, and the values should be the features of a single data point that are intended to be submitted to the model. Multiple data points can be submitted in a single request, up to a total size of 6 MB per request.

See the following code:

sample_record_payload = json.dumps
(
	{
		"instances":[
			{"features":["This is the best", "I'd use this product twice a day every day if I could. it's the best ever"]
			}
			]
	}
)

Specify the ContentType as application/json.
Provide data to the model and receive inference in JSON format.

See Common Data Formats for Inference for sample input and output JSON examples.

Clean up

When you are finished using SageMaker Data Wrangler, we recommend that you shut down the instance it runs on to avoid incurring additional charges. For instructions on how to shut down the SageMaker Data Wrangler app and associated instance, see Shut Down Data Wrangler.

Conclusion

SageMaker Data Wrangler’s new features, including support for S3 manifest files, inference capabilities, and JSON format integration, transform the operational experience of data preparation. These enhancements streamline data import, automate data transformations, and simplify working with JSON data. With these features, you can enhance your operational efficiency, reduce manual effort, and extract valuable insights from your data with ease. Embrace the power of SageMaker Data Wrangler’s new features and unlock the full potential of your data preparation workflows.

To get started with SageMaker Data Wrangler, check out the latest information on the SageMaker Data Wrangler product page.

About the authors

Munish Dabra is a Principal Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML and Observability. He has a strong background in designing and building scalable distributed systems. He enjoys helping customers innovate and transform their business in AWS. LinkedIn: /mdabra

Patrick Lin is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is committed to making Amazon SageMaker Data Wrangler the number one data preparation tool for productionized ML workflows. Outside of work, you can find him reading, listening to music, having conversations with friends, and serving at his church.

Index your Alfresco content using the new Amazon Kendra Alfresco connector

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to index and search across several structured and unstructured repositories.

Alfresco Content Services provides open, flexible, highly scalable enterprise content management (ECM) capabilities with the added benefits of a content services platform, making content accessible wherever and however you work through easy integrations with the business applications you use every day. Many organizations use the Alfresco content management platform to store their content. One of the key requirements for enterprise customers using Alfresco is the ability to easily and securely find accurate information across all the stored documents.

We are excited to announce that you can now use the new Amazon Kendra Alfresco connector to search documents stored in your Alfresco repositories and sites. In this post, we show how to use the new connector to retrieve documents stored in Alfresco for indexing purposes and securely use the Amazon Kendra intelligent search function. In addition, the ML-powered intelligent search can accurately find information from unstructured documents with natural language narrative content, for which keyword search is not very effective.

What’s new in the Amazon Kendra Alfresco connector

The Amazon Kendra Alfresco connector offers support for the following:

Basic and OAuth2 authentication mechanisms for the Alfresco On-Premises (On-Prem) platform
Basic and OAuth2 authentication mechanisms for the Alfresco PaaS platform
Aspect-based crawling of Alfresco repository documents

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repositories and sites. The solution in this post demonstrates the following:

Retrieval of documents and comments from Alfresco private sites and public sites
Retrieval of documents and comments from Alfresco repositories using Amazon Kendra-specific aspects
Authentication against Alfresco On-Prem and PaaS platforms using Basic and OAuth2 mechanisms, respectively
The Amazon Kendra search capability with access control across sites and repositories

If you are going to use only one of the platforms, you can still follow this post to build the example solution; just ignore the steps corresponding to the platform that you are not using.

The following is a summary of the steps to build the example solution:

Upload documents to the three Alfresco sites and the repository folder. Make sure the uploaded documents are unique across sites and repository folders.
For the two private sites and repository, use document-level Alfresco permission management to set access permissions. For the public site, you don’t need to set up permissions at the document level. Note that permissions information is retrieved by the Amazon Kendra Alfresco connector and used for access control by the Amazon Kendra search function.
For the two private sites and repository, create a new Amazon Kendra index (you use the same index for both the private sites and the repository). For the public site, create a new Amazon Kendra index.
For the On-Prem private site, create an Amazon Kendra Alfresco data source using Basic authentication, within the Amazon Kendra index for private sites.
For the On-Prem repository documents with Amazon Kendra-specific aspects, create a data source using Basic authentication, within the Amazon Kendra index for private sites.
For the PaaS private site, create a data source using Basic authentication, within the Amazon Kendra index for private sites.
For the PaaS public site, create a data source using OAuth2 authentication, within the Amazon Kendra index for public sites.
Perform a sync for each data source.
Run a test query in the Amazon Kendra index meant for private sites and the repository using access control.
Run a test query in the Amazon Kendra index meant for public sites without access control.

Prerequisites

You need an AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies. You need to have a basic knowledge of AWS and how to navigate the AWS Management Console.

For the Alfresco On-Prem platform, complete the following steps:

Create a private site or use an existing site.
Create a repository folder or use an existing repository folder.
Get the repository URL.
Get Basic authentication credentials (user ID and password).
Make sure authentication are part of the ALFRESCO_ADMINISTRATORS group.
Get the public X509 certificate in .pem format and save it locally.

For the Alfresco PaaS platform, complete the following steps:

Create a private site or use an existing site.
Create a public site or use an existing site.
Get the repository URL.
Get Basic authentication credentials (user ID and password).
Get OAuth2 credentials (client ID, client secret, and token URL).
Confirm that authentication users are part of the ALFRESCO_ADMINISTRATORS group.

Step 1: Upload example documents

Each uploaded document must have 5 MB or less in text. For more information, see Amazon Kendra Service Quotas. You can upload example documents or use existing documents within each site.

As shown in the following screenshot, we have uploaded four documents to the Alfresco On-Prem private site.

We have uploaded three documents to the Alfresco PaaS private site.

We have uploaded five documents to the Alfresco PaaS public site.

We have uploaded two documents to the Alfresco On-Prem repository.

Assign the aspect awskendra:indexControl to one or more documents in the repository folder.

Step 2: Configure Alfresco permissions

Use the Alfresco Permissions Management feature to give access rights to example users for viewing uploaded documents. It is assumed that you have some example Alfresco user names, with email addresses, that can be used for setting permissions at the document level in private sites. These users are not used for crawling the sites.

In the following example for the On-Prem private site, we have provided users My Dev User1 and My Dev User2 with site-consumer access to the example document. Repeat the same procedure for the other uploaded documents.

In the following example for the PaaS private site, we have provided user Kendra User 3 with site-consumer access to the example document. Repeat the same procedure for the other uploaded documents.

For the Alfresco repository documents, we have provided user My Dev user1 with consumer access to the example document.

The following table lists the site or repository names, document names, and permissions.

Platform	Site or Repository Name	Document Name	User IDs
On-Prem	MyAlfrescoSite	ChannelMarketingBudget.xlsx	My Manager User3
On-Prem	MyAlfrescoSite	wellarchitected-sustainability-pillar.pdf	My Dev User1, My Dev User2
On-Prem	MyAlfrescoSite	WorkDocs.docx	My Dev User1, My Dev User2, My Manager User3
On-Prem	MyAlfrescoSite	WorldPopulation.csv	My Dev User1, My Dev User2, My Manager User3
PaaS	MyAlfrescoCloudSite2	DDoS_White_Paper.pdf	Kendra User3
PaaS	MyAlfrescoCloudSite2	wellarchitected-framework.pdf	Kendra User3
PaaS	MyAlfrescoCloudSite2	ML_Training.pptx	Kendra User1
PaaS	MyAlfrescoCloudPublicSite	batch_user.pdf	Everyone
PaaS	MyAlfrescoCloudPublicSite	Amazon Simple Storage Service – User Guide.pdf	Everyone
PaaS	MyAlfrescoCloudPublicSite	AWS Batch – User Guide.pdf	Everyone
PaaS	MyAlfrescoCloudPublicSite	Amazon Detective.docx	Everyone
PaaS	MyAlfrescoCloudPublicSite	Pricing.xlsx	Everyone
On-Prem	Repo: MyAlfrescoRepoFolder1	Polly-dg.pdf (aspect awskendra:indexControl)	My Dev User1
On-Prem	Repo: MyAlfrescoRepoFolder1	Transcribe-api.pdf (aspect awskendra:indexControl)	My Dev User1

Step 3: Set up Amazon Kendra indexes

You can create a new Amazon Kendra index or use an existing index for indexing documents hosted in Alfresco private sites. To create a new index, complete the following steps:

On the Amazon Kendra console, create an index called Alfresco-Private.
Create a new IAM role, then choose Next.
For Access Control, choose Yes.
For Token Type¸ choose JSON.
Keep the user name and group as default.
Choose None for user group expansion because we are assuming no integration with AWS IAM Identity Center (successor to AWS Single Sign-On).
Choose Next.
Choose Developer Edition for this example solution.
Choose Create to create a new index.

The following screenshot shows the Alfresco-Private index after it has been created.

You can verify the access control configuration on the User access control tab.

Repeat these steps to create a second index called Alfresco-Public.

Step 4: Create a data source for the On-Prem private site

To create a data source for the On-Prem private site, complete the following steps:

On the Amazon Kendra console, navigate to the Alfresco-Private index.
Choose Data sources in the navigation pane.
Choose Add data source.

Choose Add connector for the Alfresco connector.

For Data source name, enter Alfresco-OnPrem-Private.
Optionally, add a description.
Keep the remaining settings as default and choose Next.

To connect to the Alfresco On-Prem site, the connector needs access to the public certificate corresponding to the On-Prem server. This was one of the prerequisites.

Use a different browser tab to upload the .pem file to an Amazon Simple Storage Service (Amazon S3) bucket in your account.

You use this S3 bucket name in the next steps.

Return to the data source creation page.
For Source, select Alfresco server.
For Alfresco repository URL, enter the repository URL (created as a prerequisite).
For Alfresco user application URL, enter the same value as the repository URL.
For SSL certificate location, choose Browse S3 and choose the S3 bucket where you uploaded the .pem file.
For Authentication, select Basic authentication.
For AWS Secrets Manager secret, choose Create and add new secret.

A pop-up window opens to create an AWS Secrets Manager secret.

Enter a name for your secret, user name, and password, then choose Save.

For Virtual Private Cloud (VPC), choose No VPC.
Turn the identity crawler on.
For IAM role, choose Create a new IAM role.
Choose Next.

You can configure the data source to synchronize contents from one or more Alfresco sites. For this post, we sync to the on-prem private site.

For Content to sync, select Single Alfresco site sync and choose MyAlfrescoSite.
Select Include comments to retrieve comments in addition to documents.
For Sync mode, select Full sync.
For Frequency, choose Run on demand (or a different frequency option as needed).
Choose Next.

Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults), then choose Next.

On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 5: Create a data source for the On-Prem repository documents with Amazon Kendra-specific aspects

Similarly to the previous steps, create a data source for the On-Prem repository documents with Amazon Kendra-specific aspects:

On the Amazon Kendra console, navigate to the Alfresco-Private index.
Choose Data sources in the navigation pane.
Choose Add data source.
Choose Add connector for the Alfresco connector.
For Data source name, enter Alfresco-OnPrem-Aspects.
Optionally, add a description.
Keep the remaining settings as default and choose Next.
For Source, select Alfresco server.
For Alfresco repository URL, enter the repository URL (created as a prerequisite).
For Alfresco user application URL, enter the same value as the repository URL.
For SSL certificate location, choose Browse S3 and choose the S3 bucket where you uploaded the .pem file.
For Authentication, select Basic authentication.
For AWS Secrets Manager secret, choose the secret you created earlier.
For Virtual Private Cloud (VPC), choose No VPC.
Turn the identity crawler off.
For IAM role, choose Create a new IAM role.
Choose Next.

For this scope, the connector retrieves only those On-Prem server repository documents that have been assigned an aspect called awskendra:indexControl.

For Content to sync, select Alfresco aspects sync.
For Sync mode, select Full sync.
For Frequency, choose Run on demand (or a different frequency option as needed).
Choose Next.
Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults), then choose Next.
On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 6: Create a data source for the PaaS private site

Follow similar steps as the previous sections to create a data source for the PaaS private site:

On the Amazon Kendra console, navigate to the Alfresco-Private index.
Choose Data sources in the navigation pane.
Choose Add data source.
Choose Add connector for the Alfresco connector.
For Data source name, enter Alfresco-Cloud-Private.
Optionally, add a description.
Keep the remaining settings as default and choose Next.
For Source, select Alfresco cloud.
For Alfresco repository URL, enter the repository URL (created as a prerequisite).
For Alfresco user application URL, enter the same value as the repository URL.
For Authentication, select Basic authentication.
For AWS Secrets Manager secret, choose Create and add new secret.
Enter a name for your secret, user name, and password, then choose Save.
For Virtual Private Cloud (VPC), choose No VPC.
Turn the identity crawler off.
For IAM role, choose Create a new IAM role.
Choose Next.

We can configure the data source to synchronize contents from one or more Alfresco sites. For this post, we configure the data source to sync from the PaaS private site MyAlfrescoCloudSite2.

For Content to sync, select Single Alfresco site sync and choose MyAlfrescoCloudSite2.
Select Include comments.
For Sync mode, select Full sync.
For Frequency, choose Run on demand (or a different frequency option as needed).
Choose Next.
Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults) and choose Next.
On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 7: Create a data source for the PaaS public site

We follow similar steps as before to create a data source for the PaaS public site:

On the Amazon Kendra console, navigate to the Alfresco-Public index.
Choose Data sources in the navigation pane.
Choose Add data source.
Choose Add connector for the Alfresco connector.
For Data source name, enter Alfresco-Cloud-Public.
Optionally, add a description.
Keep the remaining settings as default and choose Next.
For Source, select Alfresco cloud.
For Alfresco repository URL, enter the repository URL (created as a prerequisite).
For Alfresco user application URL, enter the same value as the repository URL.
For Authentication, select OAuth2.0 authentication.
For AWS Secrets Manager secret, choose Create and add new secret.
Enter a name for your secret, client ID, client secret, and token URL, then choose Save.
For Virtual Private Cloud (VPC), choose No VPC.
Turn the identity crawler off.
For IAM role, choose Create a new IAM role.
Choose Next.

We configure this data source to sync to the PaaS public site MyAlfrescoCloudPublicSite.

For Content to sync, select Single Alfresco site sync and choose MyAlfrescoCloudPublicSite.
Optionally, select Include comments.
For Sync mode, select Full sync.
For Frequency, choose Run on demand (or a different frequency option as needed).
Choose Next.
Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults) and choose Next.
On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 8: Perform a sync for each data source

Navigate to each of the data sources and choose Sync now. Complete only one synchronization at a time.

Wait for synchronization to be complete for all data sources. When each synchronization is complete for a data source, you see the status as shown in the following screenshot.

You can also view Amazon CloudWatch logs for a specific sync under Sync run history.

Step 9: Run a test query in the private index using access control

Now it’s time to test the solution. We first run a query in the private index using access control:

On the Amazon Kendra console, navigate to the Alfresco-Private index and choose Search indexed content.

Enter a query in the search field.

As shown in the following screenshot, Amazon Kendra didn’t return any results.

Choose Apply token.
Enter the email address corresponding to the My Dev User1 user and choose Apply.

Note that Amazon Kendra access control works based on the email address associated with an Alfresco user name.

Run the search again.

The search results in a document list (containing wellarchitected-sustainability-pillar.pdf in the following example) based on the access control setup.

If you run the same query again and provide an email address that doesn’t have access to either of these documents, you should not see these documents in the results list.

Enter another query to search in the documents based on the aspect awskendra:indexControl.
Choose Apply token, enter the email address corresponding to My Dev User1 user, and choose Apply.
Rerun the query.

Step 10: Run a test query in the public index without access control.

Similarly, we can test our solution by running queries in the public index without access control:

On the Amazon Kendra console, navigate to the Alfresco-Public index and choose Search indexed content.
Run a search query.

Because this example Alfresco public site has not been set up with any access control, we don’t use an access token.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. Delete newly added Alfresco data sources within the indexes. If you created new Amazon Kendra indexes while testing this solution, delete them as well.

Conclusion

With the new Alfresco connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Alfresco, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

About the Authors

Arun Anand is a Senior Solutions Architect at Amazon Web Services based in Houston area. He has 25+ years of experience in designing and developing enterprise applications. He works with partners in Energy & Utilities segment providing architectural and best practice recommendations for new and existing solutions.

Rajnish Shaw is a Senior Solutions Architect at Amazon Web Services, with a background as a Product Developer and Architect. Rajnish is passionate about helping customers build applications on the cloud. Outside of work Rajnish enjoys spending time with family and friends, and traveling.

Yuanhua Wang is a software engineer at AWS with more than 15 years of experience in the technology industry. His interests are software architecture and build tools on cloud computing.

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI.

This is the second post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker. In Part 1, we show how the Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their Salesforce data securely using SageMaker and use its tools to build, train, and deploy models to endpoints hosted on SageMaker. The endpoints are then registered to the Salesforce Data Cloud to activate predictions in Salesforce.

In this post, we expand on this topic to demonstrate how to use Einstein Studio for product recommendations. You can use this integration for traditional models as well as large language models (LLMs).

Solution overview

In this post, we demonstrate how to create a predictive model in SageMaker to recommend the next best product to your customers by using historical data such as customer demographics, marketing engagements, and purchase history from Salesforce Data Cloud.

We use the following sample dataset. To use this dataset in your Data Cloud, refer to Create Amazon S3 Data Stream in Data Cloud.

The following attributes are needed to create the model:

Club Member – If the customer is a club member
Campaign – The campaign the customer is a part of
State – The state or province the customer resides in
Month – The month of purchase
Case Count – The number of cases raised by the customer
Case Type Return – Whether the customer returned any product within the last year
Case Type Shipment Damaged – Whether the customer had any shipments damaged in the last year
Engagement Score – The level of engagement the customer has (response to mailing campaigns, logins to the online store, and so on)
Tenure – The tenure of the customer relationship with the company
Clicks – The average number of clicks the customer has made within a week prior to purchase
Pages Visited – The average number of pages the customer has visited within a week prior to purchase
Product Purchased – The actual product purchased
Id – The ID of the record
DateTime – The timestamp of the dataset

The product recommendation model is built and deployed on SageMaker and is trained using data in the Salesforce Data Cloud. The following steps give an overview of how to use the new capabilities launched in SageMaker for Salesforce to enable the overall integration:

Set up the Amazon SageMaker Studio domain and OAuth between Salesforce and the AWS accounts.
Use the newly launched capability of the Amazon SageMaker Data Wrangler connector for Salesforce Data Cloud to prepare the data in SageMaker without copying the data from Salesforce Data Cloud.
Train a recommendation model in SageMaker Studio using training data that was prepared using SageMaker Data Wrangler.
Package the SageMaker Data Wrangler container and the trained recommendation model container in an inference pipeline so the inference request can use the same data preparation steps you created to preprocess the training data. The real-time inference call data is first passed to the SageMaker Data Wrangler container in the inference pipeline, where it is preprocessed and passed to the trained model for product recommendation. For more information about this process, refer to New — Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler. Although we use a specific algorithm to train the model in our example, you can use any algorithm that you find appropriate for your use case.
Use the newly launched SageMaker provided project template for Salesforce Data Cloud integration to streamline implementing the preceding steps by providing the following templates:
1. An example notebook showcasing data preparation, building, training, and registering the model.
2. The SageMaker provided project template for Salesforce Data Cloud integration, which automates creating a SageMaker endpoint hosting the inference pipeline model. When a version of the model in the Amazon SageMaker Model Registry is approved, the endpoint is exposed as an API with Amazon API Gateway using a custom Salesforce JSON Web Token (JWT) authorizer. API Gateway is required to allow Salesforce Data Cloud to make predictions against the SageMaker endpoint using a JWT token that Salesforce creates and passes with the request when making predictions from Salesforce. JWT can be used as a part of OpenID Connect (OIDC) and OAuth 2.0 frameworks to restrict client access to your APIs.
After you create the API, we recommend registering the model endpoint in Salesforce Einstein Studio. For instructions, refer to Bring Your Own AI Models to Salesforce with Einstein Studio

The following diagram illustrates the solution architecture.

Create a SageMaker Studio domain

First, create a SageMaker Studio domain. For instructions, refer to Onboard to Amazon SageMaker Domain. You should note down the domain ID and execution role that is created and will be used by your user profile. You add permissions to this role in subsequent steps.

The following screenshot shows the domain we created for this post.

The following screenshot shows the example user profile for this post.

Set up the Salesforce connected app

Next, we create a Salesforce connected app to enable the OAuth flow from SageMaker Studio to Salesforce Data Cloud. Complete the following steps:

Log in to Salesforce and navigate to Setup.
Search for App Manager and create a new connected app.
Provide the following inputs:
1. For Connected App Name, enter a name.
2. For API Name, leave as default (it’s automatically populated).
3. For Contact Email, enter your contact email address.
4. Select Enable OAuth Settings.
5. For Callback URL, enter https://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/lab, and provide the domain ID that you captured while creating the SageMaker domain and the Region of your SageMaker domain.
Under Selected OAuth Scopes, move the following from Available OAuth Scopes to Selected OAuth Scopes and choose Save:
1. Manage user data via APIs (api)
2. Perform requests at any time (refresh_token, offline_access)
3. Perform ANSI SQL queries on Salesforce Data Cloud data (Data Cloud_query_api)
4. Manage Salesforce Customer Data Platform profile data (Data Cloud_profile_api
5. Access the identity URL service (id, profile, email, address, phone)
6. Access unique user identifiers (openid)

For more information about creating a connected app, refer to Create a Connected App.

Return to the connected app and navigate to Consumer Key and Secret.
Choose Manage Consumer Details.
Copy the key and secret.

You may be asked to log in to your Salesforce org as part of the two-factor authentication here.

Navigate back to the Manage Connected Apps page.
Open the connected app you created and choose Manage.
Choose Edit Policies and change IP Relaxation to Relax IP restrictions, then save your settings.

Configure SageMaker permissions and lifecycle rules

In this section, we walk through the steps to configure SageMaker permissions and lifecycle management rules.

Create a secret in AWS Secrets Manager

Enable OAuth integration with Salesforce Data Cloud by storing credentials from your Salesforce connected app in AWS Secrets Manager:

On the Secrets Manager console, choose Store a new secret.
Select Other type of secret.

Create your secret with the following key-value pairs:

{
"identity_provider": "SALESFORCE",
"authorization_url": "https://login.salesforce.com/services/oauth2/authorize",
"token_url": "https://login.salesforce.com/services/oauth2/token",
"client_id": "<YOUR_CONSUMER_KEY>",
"client_secret": "<YOUR_CONSUMER_SECRET>"
“issue_url”: “<YOUR_SALESFORCE_ORG_URL>”
}

Add a tag with the key sagemaker:partner and your choice of value.
Save the secret and note the ARN of the secret.

Configure a SageMaker lifecycle rule

The SageMaker Studio domain execution role will require AWS Identity and Access Management (IAM) permissions to access the secret created in the previous step. For more information, refer to Creating roles and attaching policies (console).

On the IAM console, attach the following polices to their respective roles (these roles will be used by the SageMaker project for deployment):
1. Add the policy AmazonSageMakerPartnerServiceCatalogProductsCloudFormationServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsCloudformationRole.
2. Add the policy AmazonSageMakerPartnerServiceCatalogProductsApiGatewayServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsApiGatewayRole.
3. Add the policy AmazonSageMakerPartnerServiceCatalogProductsLambdaServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsLambdaRole.
On the IAM console, navigate to the SageMaker domain execution role.
Choose Add permissions and select Create an inline policy.

Enter the following policy in the JSON policy editor:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:PutSecretValue"
],
"Resource": "arn:aws:secretsmanager:*:*:secret:*",
"Condition": {
"ForAnyValue:StringLike": {
"aws:ResourceTag/sagemaker:partner": "*"
}
}
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:UpdateSecret"
],
"Resource": "arn:aws:secretsmanager:*:*:secret:AmazonSageMaker-*"
}
]
}

SageMaker Studio lifecycle configuration provides shell scripts that run when a notebook is created or started. The lifecycle configuration will be used to retrieve the secret and import it to the SageMaker runtime.

On the SageMaker console, choose Lifecycle configurations in the navigation pane.
Choose Create configuration.
Leave the default selection Jupyter Server App and choose Next.
Give the configuration a name.

Enter the following script in the editor, providing the ARN for the secret you created earlier:

#!/bin/bash
set -eux

cat > ~/.sfgenie_identity_provider_oauth_config <<EOL
{
"secret_arn": "<YOUR_SECRETS_ARN>"
}
EOL

Choose Submit to save the lifecycle configuration.
Choose Domains in the navigation pane and open your domain.
On the Environment tab, choose Attach to attach your lifecycle configuration.
Choose the lifecycle configuration you created and choose Attach to domain.
Choose Set as default.

If you are a returning user to SageMaker Studio, in order to ensure Salesforce Data Cloud is enabled, upgrade to the latest Jupyter and SageMaker Data Wrangler kernels.

This completes the setup to enable data access from Salesforce Data Cloud to SageMaker Studio to build AI and machine learning (ML) models.

Create a SageMaker project

To start using the solution, first create a project using Amazon SageMaker Projects. Complete the following steps:

In SageMaker Studio, under Deployments in the navigation pane, choose Projects.
Choose Create project.
Choose the project template called Model deployment for Salesforce.
Choose Select project template.
Enter a name and optional description for your project.
Enter a model group name.
Enter the name of the Secrets Manager secret that you created earlier.
Choose Create project.

The project may take 1–2 minutes to initiate.

You can see two new repositories. The first one is for sample notebooks that you can use as is or customize to prepare, train, create, and register models in the SageMaker Model Registry. The second repository is for automating the model deployment, which includes exposing the SageMaker endpoint as an API.

Choose clone repo for both notebooks.

For this post, we use the product recommendation example, which can be found in the sagemaker-<YOUR-PROJECT-NAME>-p-<YOUR-PROJECT-ID>-example-nb/product-recommendation directory that you just cloned. Before we run the product-recommendation.ipynb notebook, let’s do some data preparation to create the training data using SageMaker Data Wrangler.

Prepare data with SageMaker Data Wrangler

Complete the following steps:

In SageMaker Studio, on the File menu, choose New and Data Wrangler flow.
After you create the data flow, choose (right-click) the tab and choose Rename to rename the file.
Choose Import data.
Choose Create connection.
Choose Salesforce Data Cloud.
For Name, enter salesforce-data-cloud-sagemaker-connection.
For Salesforce org URL, enter your Salesforce org URL.
Choose Save + Connect.
In the Data Explorer view, select and preview the tables from the Salesforce Data Cloud to create and run the query to extract the required dataset.

Your query will look like below and you may use the table name that you used while uploading data in Salesforce Data Cloud.

SELECT product_purchased__c, club_member__c, campaign__c, state__c, month__c,
      case_count__c,case_type_return__c, case_type_shipment_damaged__c,
      pages_visited__c,engagement_score__c, tenure__c, clicks__c, id__c
FROM Training_Dataset_for_Sagemaker__dll

Choose Create dataset.

Creating the dataset may take some time.

In the data flow view, you can now see a new node added to the visual graph.

For more information on how you can use SageMaker Data Wrangler to create Data Quality and Insights Reports, refer to Get Insights On Data and Data Quality.

SageMaker Data Wrangler offers over 300 built-in transformations. In this step, we use some of these transformations to prepare the dataset for an ML model. For detailed instructions on how to implement these transformations, refer to Transform Data.

Use the Manage columns step with the Drop column transform to drop the column id__c.
Use the Handle missing step with the Drop missing transform to drop rows with missing values for various features. We apply this transformation on all columns.

Use a custom transform step to create categorical values for state__c, case_count__c, and tenure features. Use the following code for this transformation:

from pyspark.sql.functions import when
 
States_List = [‘Washington’, ‘Massachusetts’, ‘California’, ‘Minnesota’, ‘Vermont’, ‘Colorado’, ‘Arizona’]
 
df.withColumn(“club_member__c”,df.club_member__c.cast(‘string’))
df.withColumn(“month__c”,df.month__c.cast(‘string’))
df.withColumn(“case_type_return__c”,df.case_type_return__c.cast(‘string’))
df.withColumn(“case_type_shipment_damaged__c”,df.case_type_shipment_damaged__c.cast(‘string’))
 
df = df.withColumn(‘state__c’, when(df.state__c.isin(States_List), df.state__c).otherwise(“Other”))
 
df = df.withColumn(‘case_count__c’, when(df.case_count__c == 0, “No Cases”).otherwise( when(df.case_count__c <= 2, “1 to 2 Cases”).otherwise(“Greater than 2 Cases”)))
                  
df = df.withColumn(‘tenure__c’, when(df.tenure__c < 1, “Less than 1 Year”).otherwise( when(df.tenure__c == 1, “1 to 2 Years”).otherwise(when(df.tenure__c ==2, “2 to 3 Years”).otherwise(when(df.tenure__c == 3, “3 to 4 Years”).otherwise(“Grater Than 4 Years”)))))

Use the Process numeric step with the Scale values transform and choose Standard scaler to scale clicks__c, engagement__score, and pages__visited__c features.
Use the Encode categorical step with the One-hot encode transform to convert categorical variables to numeric for case__type__return___c, case__type_shipment__damaged, month__c, club__member__c, and campaign__c features (all features except clicks__c, engagement__score, pages__visited__c, and product_purchased__c).

Model building, training, and deployment

To build, train, and deploy the model, complete the following steps:

Return to the SageMaker project, open the product-recommendation.ipynb notebook, and run a processing job to preprocess the data using the SageMaker Data Wrangler configuration you created.
Follow the steps in the notebook to train a model and register it to the SageMaker Model Registry.
Make sure to update the model group name to match with the model group name that you used while creating the SageMaker project.

To locate the model group name, open the SageMaker project that you created earlier and navigate to the Settings tab.

Similarly, the flow file referenced in the notebook must match with the flow file name that you created earlier.

For this post, we used product-recommendation as the model group name, so we update the notebook with project-recommendation as the model group name in the notebook.

After the notebook is run, the trained model is registered in the Model Registry. To learn more about the Model Registry, refer to Register and Deploy Models with Model Registry.

Select the model version you created and update the status of it to Approved.

Now that you have approved the registered model, the SageMaker Salesforce project deploy step will provision and trigger AWS CodePipeline.

CodePipeline has steps to build and deploy a SageMaker endpoint for inference containing the SageMaker Data Wrangler preprocessing steps and the trained model. The endpoint will be exposed to Salesforce Data Cloud as an API through API Gateway. The following screenshot shows the pipeline prefixed with Sagemaker-salesforce-product-recommendation-xxxxx. We also show you the endpoints and API that gets created by the SageMaker project for Salesforce.

If you would like, you can take a look at the CodePipeline deploy step, which uses AWS CloudFormation scripts to create SageMaker endpoint and API Gateway with a custom JWT authorizer.

When pipeline deployment is complete, you can find the SageMaker endpoint on the SageMaker console.

You can explore the API Gateway created by the project template on the API Gateway console.

Choose the link to find the API Gateway URL.

You can find the details of the JWT authorizer by choosing Authorizers on the API Gateway console. You can also go to the AWS Lambda console to review the code of the Lambda function created by project template.

To discover the schema to be used while invoking the API from Einstein Studio, choose Information in the navigation pane of the Model Registry. You will see an Amazon Simple Storage Service (Amazon S3) link to a metadata file. Copy and paste the link into a new browser tab URL.

Let’s look at the file without downloading it. On the file details page, choose the Object actions menu and choose Query with S3 Select.

Choose Run SQL query and take note of the API Gateway URL and schema because you will need this information when registering with Einstein Studio. If you don’t see an APIGWURL key, either the model wasn’t approved, deployment is still in progress, or deployment failed.

Use the Salesforce Einstein Studio API for predictions

Salesforce Einstein Studio is a new and centralized experience in Salesforce Data Cloud that data science and engineering teams can use to easily access their traditional models and LLMs used in generative AI. Next, we set up the API URL and client_id that you set in Secrets Manager earlier in Salesforce Einstein Studio to register and use the model inferences in Salesforce Einstein Studio. For instructions, refer to Bring Your Own AI Models to Salesforce with Einstein Studio.

Clean up

To delete all the resources created by the SageMaker project, on the project page, choose the Action menu and choose Delete.

To delete the resources (API Gateway and SageMaker endpoint) created by CodePipeline, navigate to the AWS CloudFormation console and delete the stack that was created.

Conclusion

In this post, we explained how you can build and train ML models in SageMaker Studio using SageMaker Data Wrangler to import and prepare data that is hosted on the Salesforce Data Cloud and use the newly launched Salesforce Data Cloud JDBC connector in SageMaker Data Wrangler and first-party Salesforce template in the SageMaker provided project template for Salesforce Data Cloud integration. The SageMaker project template for Salesforce enables you to deploy the model and create the endpoint and secure an API for a registered model. You then use the API to make predictions in Salesforce Einstein Studio for your business use cases.

Although we used the example of product recommendation to showcase the steps for implementing the end-to-end integration, you can use the SageMaker project template for Salesforce to create an endpoint and API for any SageMaker traditional model and LLM that is registered in the SageMaker Model Registry. We look forward to seeing what you build in SageMaker using data from Salesforce Data Cloud and empower your Salesforce applications using SageMaker hosted ML models!

This post is a continuation of the series regarding Salesforce Data Cloud and SageMaker integration. For a high-level overview and to learn more about the business impact you can make with this integration approach, refer to Part 1.

Additional resources

About the authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City. Follow him on https://www.linkedin.com/in/darylmartis.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Dharmendra Kumar Rai (DK Rai) is a Sr. Data Architect, Data Lake & AI/ML, serving strategic customers. He works closely with customers to understand how AWS can help them solve problems, especially in the AI/ML and analytics space. DK has many years of experience in building data-intensive solutions across a range of industry verticals, including high-tech, FinTech, insurance, and consumer-facing applications.

Marc Karp is an ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.