Optimize data preparation with new features in AWS SageMaker Data Wrangler

Optimize data preparation with new features in AWS SageMaker Data Wrangler

Data preparation is a critical step in any data-driven project, and having the right tools can greatly enhance operational efficiency. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

In this post, we explore the latest features of SageMaker Data Wrangler that are specifically designed to improve the operational experience. We delve into the support of Simple Storage Service (Amazon S3) manifest files, inference artifacts in an interactive data flow, and the seamless integration with JSON (JavaScript Object Notation) format for inference, highlighting how these enhancements make data preparation easier and more efficient.

Introducing new features

In this section, we discuss the SageMaker Data Wrangler’s new features for optimal data preparation.

S3 manifest file support with SageMaker Autopilot for ML inference

SageMaker Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. You can use SageMaker Autopilot to automatically train, tune, and deploy models on the data that you’ve transformed in your data flow.

This experience is now further simplified with S3 manifest file support. An S3 manifest file is a text file that lists the objects (files) stored in an S3 bucket. If your exported dataset in SageMaker Data Wrangler is quite big and split into multiple-part data files in Amazon S3, now SageMaker Data Wrangler will automatically create a manifest file in S3 representing all these data files. This generated manifest file can now be used with the SageMaker Autopilot UI in SageMaker Data Wrangler to pick up all the partitioned data for training.

Before this feature launch, when using SageMaker Autopilot models trained on prepared data from SageMaker Data Wrangler, you could only choose one data file, which might not represent the entire dataset, especially if the dataset is very large. With this new manifest file experience, you’re not limited to a subset of your dataset. You can build an ML model with SageMaker Autopilot representing all your data using the manifest file and use that for your ML inference and production deployment. This feature enhances operational efficiency by simplifying training ML models with SageMaker Autopilot and streamlining data processing workflows.

Added support for inference flow in generated artifacts

Customers want to take the data transformations they’ve applied to their model training data, such as one-hot encoding, PCA, and impute missing values, and apply those data transformations to real-time inference or batch inference in production. To do so, you must have a SageMaker Data Wrangler inference artifact, which is consumed by a SageMaker model.

Previously, inference artifacts could only be generated from the UI when exporting to SageMaker Autopilot training or exporting an inference pipeline notebook. This didn’t provide flexibility if you wanted to take your SageMaker Data Wrangler flows outside of the Amazon SageMaker Studio environment. Now, you can generate an inference artifact for any compatible flow file through a SageMaker Data Wrangler processing job. This enables programmatic, end-to-end MLOps with SageMaker Data Wrangler flows for code-first MLOps personas, as well as an intuitive, no-code path to get an inference artifact by creating a job from the UI.

Streamlining data preparation

JSON has become a widely adopted format for data exchange in modern data ecosystems. SageMaker Data Wrangler’s integration with JSON format allows you to seamlessly handle JSON data for transformation and cleaning. By providing native support for JSON, SageMaker Data Wrangler simplifies the process of working with structured and semi-structured data, enabling you to extract valuable insights and prepare data efficiently. SageMaker Data Wrangler now supports JSON format for both batch and real-time inference endpoint deployment.

Solution overview

For our use case, we use the sample Amazon customer reviews dataset to show how SageMaker Data Wrangler can simplify the operational effort to build a new ML model using SageMaker Autopilot. The Amazon customer reviews dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 to July 2014.

On a high level, we use SageMaker Data Wrangler to manage this large dataset and perform the following actions:

  1. Develop an ML model in SageMaker Autopilot using all of the dataset, not just a sample.
  2. Build a real-time inference pipeline with the inference artifact generated by SageMaker Data Wrangler, and use JSON formatting for input and output.

S3 manifest file support with SageMaker Autopilot

When creating a SageMaker Autopilot experiment using SageMaker Data Wrangler, you could previously only specify a single CSV or Parquet file. Now you can also use an S3 manifest file, allowing you to use large amounts of data for SageMaker Autopilot experiments. SageMaker Data Wrangler will automatically partition input data files into several smaller files and generate a manifest that can be used in a SageMaker Autopilot experiment to pull in all the data from the interactive session, not just a small sample.

Complete the following steps:

  1. Import the Amazon customer review data from a CSV file into SageMaker Data Wrangler. Make sure to disable sampling when importing the data.
  2. Specify the transformations that normalize the data. For this example, remove symbols and transform everything into lowercase using SageMaker Data Wrangler’s built-in transformations.
  3. Choose Train model to start training.

Data Flow - Train Model

To train a model with SageMaker Autopilot, SageMaker automatically exports data to an S3 bucket. For large datasets like this one, it will automatically break up the file into smaller files and generate a manifest that includes the location of the smaller files.

Data Flow - Autopilot

  1. First, select your input data.

Earlier, SageMaker Data Wrangler didn’t have an option to generate a manifest file to use with SageMaker Autopilot. Today, with the release of manifest file support, SageMaker Data Wrangler will automatically export a manifest file to Amazon S3, pre-fill the S3 location of the SageMaker Autopilot training with the manifest file S3 location, and toggle the manifest file option to Yes. No work is necessary to generate or use the manifest file.

Autopilot Experiment

  1. Configure your experiment by selecting the target for the model to predict.
  2. Next, select a training method. In this case, we select Auto and let SageMaker Autopilot decide the best training method based on the dataset size.

Create an Autopilot Experiment

  1. Specify the deployment settings.
  2. Finally, review the job configuration and submit the SageMaker Autopilot experiment for training. When SageMaker Autopilot completes the experiment, you can view the training results and explore the best model.

Autopilot Experiment - Complete

Thanks to support for manifest files, you can use your entire dataset for the SageMaker Autopilot experiment, not just a subset of your data.

For more information on using SageMaker Autopilot with SageMaker Data Wrangler, see Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot.

Generate inference artifacts from SageMaker Processing jobs

Now, let’s look at how we can generate inference artifacts through both the SageMaker Data Wrangler UI and SageMaker Data Wrangler notebooks.

SageMaker Data Wrangler UI

For our use case, we want to process our data through the UI and then use the resulting data to train and deploy a model through the SageMaker console. Complete the following steps:

  1. Open the data flow your created in the preceding section.
  2. Choose the plus sign next to the last transform, choose Add destination, and choose Amazon S3. This will be where the processed data will be stored.
    Data Flow - S3 Destination
  3. Choose Create job.
    Data Flow - S3 Destination
  4. Select Generate inference artifacts in the Inference parameters section to generate an inference artifact.
  5. For Inference artifact name, enter the name of your inference artifact (with .tar.gz as the file extension).
  6. For Inference output node, enter the destination node corresponding to the transforms applied to your training data.
  7. Choose Configure job.
    Choose Configure Job
  8. Under Job configuration, enter a path for Flow file S3 location. A folder called data_wrangler_flows will be created under this location, and the inference artifact will be uploaded to this folder. To change the upload location, set a different S3 location.
  9. Leave the defaults for all other options and choose Create to create the processing job.
    Processing Job
    The processing job will create a tarball (.tar.gz) containing a modified data flow file with a newly added inference section that allows you to use it for inference. You need the S3 uniform resource identifier (URI) of the inference artifact to provide the artifact to a SageMaker model when deploying your inference solution. The URI will be in the form {Flow file S3 location}/data_wrangler_flows/{inference artifact name}.tar.gz.
  10. If you didn’t note these values earlier, you can choose the link to the processing job to find the relevant details. In our example, the URI is s3://sagemaker-us-east-1-43257985977/data_wrangler_flows/example-2023-05-30T12-20-18.tar.gz.
    Processing Job - Complete
  11. Copy the value of Processing image; we need this URI when creating our model, too.
    Processing Job - S3 URI
  12. We can now use this URI to create a SageMaker model on the SageMaker console, which we can later deploy to an endpoint or batch transform job.
    SageMaker - Create Model
  13. Under Model settings¸ enter a model name and specify your IAM role.
  14. For Container input options, select Provide model artifacts and inference image location.
    Create Model
  15. For Location of inference code image, enter the processing image URI.
  16. For Location of model artifacts, enter the inference artifact URI.
  17. Additionally, if your data has a target column that will be predicted by a trained ML model, specify the name of that column under Environment variables, with INFERENCE_TARGET_COLUMN_NAME as Key and the column name as Value.
    Location of Model Artifacts and Image
  18. Finish creating your model by choosing Create model.
    Create Model

We now have a model that we can deploy to an endpoint or batch transform job.

SageMaker Data Wrangler notebooks

For a code-first approach to generate the inference artifact from a processing job, we can find the example code by choosing Export to on the node menu and choosing either Amazon S3, SageMaker Pipelines, or SageMaker Inference Pipeline. We choose SageMaker Inference Pipeline in this example.

SageMaker Inference Pipeline

In this notebook, there is a section titled Create Processor (this is identical in the SageMaker Pipelines notebook, but in the Amazon S3 notebook, the equivalent code will be under the Job Configurations section). At the bottom of this section is a configuration for our inference artifact called inference_params. It contains the same information that we saw in the UI, namely the inference artifact name and the inference output node. These values will be prepopulated but can be modified. There is additionally a parameter called use_inference_params, which needs to be set to True to use this configuration in the processing job.

Inference Config

Further down is a section titled Define Pipeline Steps, where the inference_params configuration is appended to a list of job arguments and passed into the definition for a SageMaker Data Wrangler processing step. In the Amazon S3 notebook, job_arguments is defined immediately after the Job Configurations section.

Create SageMaker Pipeline

With these simple configurations, the processing job created by this notebook will generate an inference artifact in the same S3 location as our flow file (defined earlier in our notebook). We can programmatically determine this S3 location and use this artifact to create a SageMaker model using the SageMaker Python SDK, which is demonstrated in the SageMaker Inference Pipeline notebook.

The same approach can be applied to any Python code that creates a SageMaker Data Wrangler processing job.

JSON file format support for input and output during inference

It’s pretty common for websites and applications to use JSON as request/response for APIs so that the information is easy to parse by different programming languages.

Previously, after you had a trained model, you could only interact with it via CSV as an input format in a SageMaker Data Wrangler inference pipeline. Today, you can use JSON as an input and output format, providing more flexibility when interacting with SageMaker Data Wrangler inference containers.

To get started with using JSON for input and output in the inference pipeline notebook, complete the follow steps:

  1. Define a payload.

For each payload, the model is expecting a key named instances. The value is a list of objects, each being its own data point. The objects require a key called features, and the values should be the features of a single data point that are intended to be submitted to the model. Multiple data points can be submitted in a single request, up to a total size of 6 MB per request.

See the following code:

sample_record_payload = json.dumps
(
	{
		"instances":[
			{"features":["This is the best", "I'd use this product twice a day every day if I could. it's the best ever"]
			}
			]
	}
)
  1. Specify the ContentType as application/json.
  2. Provide data to the model and receive inference in JSON format.
    Inference Request

See Common Data Formats for Inference for sample input and output JSON examples.

Clean up

When you are finished using SageMaker Data Wrangler, we recommend that you shut down the instance it runs on to avoid incurring additional charges. For instructions on how to shut down the SageMaker Data Wrangler app and associated instance, see Shut Down Data Wrangler.

Conclusion

SageMaker Data Wrangler’s new features, including support for S3 manifest files, inference capabilities, and JSON format integration, transform the operational experience of data preparation. These enhancements streamline data import, automate data transformations, and simplify working with JSON data. With these features, you can enhance your operational efficiency, reduce manual effort, and extract valuable insights from your data with ease. Embrace the power of SageMaker Data Wrangler’s new features and unlock the full potential of your data preparation workflows.

To get started with SageMaker Data Wrangler, check out the latest information on the SageMaker Data Wrangler product page.


About the authors

Munish Dabra is a Principal Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML and Observability. He has a strong background in designing and building scalable distributed systems. He enjoys helping customers innovate and transform their business in AWS. LinkedIn: /mdabra

Patrick Lin is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is committed to making Amazon SageMaker Data Wrangler the number one data preparation tool for productionized ML workflows. Outside of work, you can find him reading, listening to music, having conversations with friends, and serving at his church.

Read More

Index your Alfresco content using the new Amazon Kendra Alfresco connector

Index your Alfresco content using the new Amazon Kendra Alfresco connector

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to index and search across several structured and unstructured repositories.

Alfresco Content Services provides open, flexible, highly scalable enterprise content management (ECM) capabilities with the added benefits of a content services platform, making content accessible wherever and however you work through easy integrations with the business applications you use every day. Many organizations use the Alfresco content management platform to store their content. One of the key requirements for enterprise customers using Alfresco is the ability to easily and securely find accurate information across all the stored documents.

We are excited to announce that you can now use the new Amazon Kendra Alfresco connector to search documents stored in your Alfresco repositories and sites. In this post, we show how to use the new connector to retrieve documents stored in Alfresco for indexing purposes and securely use the Amazon Kendra intelligent search function. In addition, the ML-powered intelligent search can accurately find information from unstructured documents with natural language narrative content, for which keyword search is not very effective.

What’s new in the Amazon Kendra Alfresco connector

The Amazon Kendra Alfresco connector offers support for the following:

  • Basic and OAuth2 authentication mechanisms for the Alfresco On-Premises (On-Prem) platform
  • Basic and OAuth2 authentication mechanisms for the Alfresco PaaS platform
  • Aspect-based crawling of Alfresco repository documents

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repositories and sites. The solution in this post demonstrates the following:

  • Retrieval of documents and comments from Alfresco private sites and public sites
  • Retrieval of documents and comments from Alfresco repositories using Amazon Kendra-specific aspects
  • Authentication against Alfresco On-Prem and PaaS platforms using Basic and OAuth2 mechanisms, respectively
  • The Amazon Kendra search capability with access control across sites and repositories

If you are going to use only one of the platforms, you can still follow this post to build the example solution; just ignore the steps corresponding to the platform that you are not using.

The following is a summary of the steps to build the example solution:

  1. Upload documents to the three Alfresco sites and the repository folder. Make sure the uploaded documents are unique across sites and repository folders.
  2. For the two private sites and repository, use document-level Alfresco permission management to set access permissions. For the public site, you don’t need to set up permissions at the document level. Note that permissions information is retrieved by the Amazon Kendra Alfresco connector and used for access control by the Amazon Kendra search function.
  3. For the two private sites and repository, create a new Amazon Kendra index (you use the same index for both the private sites and the repository). For the public site, create a new Amazon Kendra index.
  4. For the On-Prem private site, create an Amazon Kendra Alfresco data source using Basic authentication, within the Amazon Kendra index for private sites.
  5. For the On-Prem repository documents with Amazon Kendra-specific aspects, create a data source using Basic authentication, within the Amazon Kendra index for private sites.
  6. For the PaaS private site, create a data source using Basic authentication, within the Amazon Kendra index for private sites.
  7. For the PaaS public site, create a data source using OAuth2 authentication, within the Amazon Kendra index for public sites.
  8. Perform a sync for each data source.
  9. Run a test query in the Amazon Kendra index meant for private sites and the repository using access control.
  10. Run a test query in the Amazon Kendra index meant for public sites without access control.

Prerequisites

You need an AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies. You need to have a basic knowledge of AWS and how to navigate the AWS Management Console.

For the Alfresco On-Prem platform, complete the following steps:

  1. Create a private site or use an existing site.
  2. Create a repository folder or use an existing repository folder.
  3. Get the repository URL.
  4. Get Basic authentication credentials (user ID and password).
  5. Make sure authentication are part of the ALFRESCO_ADMINISTRATORS group.
  6. Get the public X509 certificate in .pem format and save it locally.

For the Alfresco PaaS platform, complete the following steps:

  1. Create a private site or use an existing site.
  2. Create a public site or use an existing site.
  3. Get the repository URL.
  4. Get Basic authentication credentials (user ID and password).
  5. Get OAuth2 credentials (client ID, client secret, and token URL).
  6. Confirm that authentication users are part of the ALFRESCO_ADMINISTRATORS group.

Step 1: Upload example documents

Each uploaded document must have 5 MB or less in text. For more information, see Amazon Kendra Service Quotas. You can upload example documents or use existing documents within each site.

As shown in the following screenshot, we have uploaded four documents to the Alfresco On-Prem private site.

We have uploaded three documents to the Alfresco PaaS private site.

We have uploaded five documents to the Alfresco PaaS public site.

We have uploaded two documents to the Alfresco On-Prem repository.

Assign the aspect awskendra:indexControl to one or more documents in the repository folder.

Step 2: Configure Alfresco permissions

Use the Alfresco Permissions Management feature to give access rights to example users for viewing uploaded documents. It is assumed that you have some example Alfresco user names, with email addresses, that can be used for setting permissions at the document level in private sites. These users are not used for crawling the sites.

In the following example for the On-Prem private site, we have provided users My Dev User1 and My Dev User2 with site-consumer access to the example document. Repeat the same procedure for the other uploaded documents.

In the following example for the PaaS private site, we have provided user Kendra User 3 with site-consumer access to the example document. Repeat the same procedure for the other uploaded documents.

For the Alfresco repository documents, we have provided user My Dev user1 with consumer access to the example document.

The following table lists the site or repository names, document names, and permissions.

Platform Site or Repository Name Document Name User IDs
On-Prem MyAlfrescoSite ChannelMarketingBudget.xlsx My Manager User3
On-Prem MyAlfrescoSite wellarchitected-sustainability-pillar.pdf My Dev User1, My Dev User2
On-Prem MyAlfrescoSite WorkDocs.docx My Dev User1, My Dev User2, My Manager User3
On-Prem MyAlfrescoSite WorldPopulation.csv My Dev User1, My Dev User2, My Manager User3
PaaS MyAlfrescoCloudSite2 DDoS_White_Paper.pdf Kendra User3
PaaS MyAlfrescoCloudSite2 wellarchitected-framework.pdf Kendra User3
PaaS MyAlfrescoCloudSite2 ML_Training.pptx Kendra User1
PaaS MyAlfrescoCloudPublicSite batch_user.pdf Everyone
PaaS MyAlfrescoCloudPublicSite Amazon Simple Storage Service – User Guide.pdf Everyone
PaaS MyAlfrescoCloudPublicSite AWS Batch – User Guide.pdf Everyone
PaaS MyAlfrescoCloudPublicSite Amazon Detective.docx Everyone
PaaS MyAlfrescoCloudPublicSite Pricing.xlsx Everyone
On-Prem Repo: MyAlfrescoRepoFolder1 Polly-dg.pdf (aspect awskendra:indexControl) My Dev User1
On-Prem Repo: MyAlfrescoRepoFolder1 Transcribe-api.pdf (aspect awskendra:indexControl) My Dev User1

Step 3: Set up Amazon Kendra indexes

You can create a new Amazon Kendra index or use an existing index for indexing documents hosted in Alfresco private sites. To create a new index, complete the following steps:

  1. On the Amazon Kendra console, create an index called Alfresco-Private.
  2. Create a new IAM role, then choose Next.
  3. For Access Control, choose Yes.
  4. For Token Type¸ choose JSON.
  5. Keep the user name and group as default.
  6. Choose None for user group expansion because we are assuming no integration with AWS IAM Identity Center (successor to AWS Single Sign-On).
  7. Choose Next.
  8. Choose Developer Edition for this example solution.
  9. Choose Create to create a new index.

The following screenshot shows the Alfresco-Private index after it has been created.

  1. You can verify the access control configuration on the User access control tab.

  1. Repeat these steps to create a second index called Alfresco-Public.

Step 4: Create a data source for the On-Prem private site

To create a data source for the On-Prem private site, complete the following steps:

  1. On the Amazon Kendra console, navigate to the Alfresco-Private index.
  2. Choose Data sources in the navigation pane.
  3. Choose Add data source.

  1. Choose Add connector for the Alfresco connector.

  1. For Data source name, enter Alfresco-OnPrem-Private.
  2. Optionally, add a description.
  3. Keep the remaining settings as default and choose Next.

To connect to the Alfresco On-Prem site, the connector needs access to the public certificate corresponding to the On-Prem server. This was one of the prerequisites.

  1. Use a different browser tab to upload the .pem file to an Amazon Simple Storage Service (Amazon S3) bucket in your account.

You use this S3 bucket name in the next steps.

  1. Return to the data source creation page.
  2. For Source, select Alfresco server.
  3. For Alfresco repository URL, enter the repository URL (created as a prerequisite).
  4. For Alfresco user application URL, enter the same value as the repository URL.
  5. For SSL certificate location, choose Browse S3 and choose the S3 bucket where you uploaded the .pem file.
  6. For Authentication, select Basic authentication.
  7. For AWS Secrets Manager secret, choose Create and add new secret.

A pop-up window opens to create an AWS Secrets Manager secret.

  1. Enter a name for your secret, user name, and password, then choose Save.

  1. For Virtual Private Cloud (VPC), choose No VPC.
  2. Turn the identity crawler on.
  3. For IAM role, choose Create a new IAM role.
  4. Choose Next.

You can configure the data source to synchronize contents from one or more Alfresco sites. For this post, we sync to the on-prem private site.

  1. For Content to sync, select Single Alfresco site sync and choose MyAlfrescoSite.
  2. Select Include comments to retrieve comments in addition to documents.
  3. For Sync mode, select Full sync.
  4. For Frequency, choose Run on demand (or a different frequency option as needed).
  5. Choose Next.

  1. Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults), then choose Next.

  1. On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 5: Create a data source for the On-Prem repository documents with Amazon Kendra-specific aspects

Similarly to the previous steps, create a data source for the On-Prem repository documents with Amazon Kendra-specific aspects:

  1. On the Amazon Kendra console, navigate to the Alfresco-Private index.
  2. Choose Data sources in the navigation pane.
  3. Choose Add data source.
  4. Choose Add connector for the Alfresco connector.
  5. For Data source name, enter Alfresco-OnPrem-Aspects.
  6. Optionally, add a description.
  7. Keep the remaining settings as default and choose Next.
  8. For Source, select Alfresco server.
  9. For Alfresco repository URL, enter the repository URL (created as a prerequisite).
  10. For Alfresco user application URL, enter the same value as the repository URL.
  11. For SSL certificate location, choose Browse S3 and choose the S3 bucket where you uploaded the .pem file.
  12. For Authentication, select Basic authentication.
  13. For AWS Secrets Manager secret, choose the secret you created earlier.
  14. For Virtual Private Cloud (VPC), choose No VPC.
  15. Turn the identity crawler off.
  16. For IAM role, choose Create a new IAM role.
  17. Choose Next.

For this scope, the connector retrieves only those On-Prem server repository documents that have been assigned an aspect called awskendra:indexControl.

  1. For Content to sync, select Alfresco aspects sync.
  2. For Sync mode, select Full sync.
  3. For Frequency, choose Run on demand (or a different frequency option as needed).
  4. Choose Next.
  5. Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults), then choose Next.
  6. On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 6: Create a data source for the PaaS private site

Follow similar steps as the previous sections to create a data source for the PaaS private site:

  1. On the Amazon Kendra console, navigate to the Alfresco-Private index.
  2. Choose Data sources in the navigation pane.
  3. Choose Add data source.
  4. Choose Add connector for the Alfresco connector.
  5. For Data source name, enter Alfresco-Cloud-Private.
  6. Optionally, add a description.
  7. Keep the remaining settings as default and choose Next.
  8. For Source, select Alfresco cloud.
  9. For Alfresco repository URL, enter the repository URL (created as a prerequisite).
  10. For Alfresco user application URL, enter the same value as the repository URL.
  11. For Authentication, select Basic authentication.
  12. For AWS Secrets Manager secret, choose Create and add new secret.
  13. Enter a name for your secret, user name, and password, then choose Save.
  14. For Virtual Private Cloud (VPC), choose No VPC.
  15. Turn the identity crawler off.
  16. For IAM role, choose Create a new IAM role.
  17. Choose Next.

We can configure the data source to synchronize contents from one or more Alfresco sites. For this post, we configure the data source to sync from the PaaS private site MyAlfrescoCloudSite2.

  1. For Content to sync, select Single Alfresco site sync and choose MyAlfrescoCloudSite2.
  2. Select Include comments.
  3. For Sync mode, select Full sync.
  4. For Frequency, choose Run on demand (or a different frequency option as needed).
  5. Choose Next.
  6. Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults) and choose Next.
  7. On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 7: Create a data source for the PaaS public site

We follow similar steps as before to create a data source for the PaaS public site:

  1. On the Amazon Kendra console, navigate to the Alfresco-Public index.
  2. Choose Data sources in the navigation pane.
  3. Choose Add data source.
  4. Choose Add connector for the Alfresco connector.
  5. For Data source name, enter Alfresco-Cloud-Public.
  6. Optionally, add a description.
  7. Keep the remaining settings as default and choose Next.
  8. For Source, select Alfresco cloud.
  9. For Alfresco repository URL, enter the repository URL (created as a prerequisite).
  10. For Alfresco user application URL, enter the same value as the repository URL.
  11. For Authentication, select OAuth2.0 authentication.
  12. For AWS Secrets Manager secret, choose Create and add new secret.
  13. Enter a name for your secret, client ID, client secret, and token URL, then choose Save.
  14. For Virtual Private Cloud (VPC), choose No VPC.
  15. Turn the identity crawler off.
  16. For IAM role, choose Create a new IAM role.
  17. Choose Next.

We configure this data source to sync to the PaaS public site MyAlfrescoCloudPublicSite.

  1. For Content to sync, select Single Alfresco site sync and choose MyAlfrescoCloudPublicSite.
  2. Optionally, select Include comments.
  3. For Sync mode, select Full sync.
  4. For Frequency, choose Run on demand (or a different frequency option as needed).
  5. Choose Next.
  6. Map the Alfresco document fields to the Amazon Kendra index fields (you can keep the defaults) and choose Next.
  7. On the Review and Create page, verify all the information, then choose Add data source.

After the data source has been created, the data source page is displayed as shown in the following screenshot.

Step 8: Perform a sync for each data source

Navigate to each of the data sources and choose Sync now. Complete only one synchronization at a time.

Wait for synchronization to be complete for all data sources. When each synchronization is complete for a data source, you see the status as shown in the following screenshot.

You can also view Amazon CloudWatch logs for a specific sync under Sync run history.

Step 9: Run a test query in the private index using access control

Now it’s time to test the solution. We first run a query in the private index using access control:

  1. On the Amazon Kendra console, navigate to the Alfresco-Private index and choose Search indexed content.

  1. Enter a query in the search field.

As shown in the following screenshot, Amazon Kendra didn’t return any results.

  1. Choose Apply token.
  2. Enter the email address corresponding to the My Dev User1 user and choose Apply.

Note that Amazon Kendra access control works based on the email address associated with an Alfresco user name.

  1. Run the search again.

The search results in a document list (containing wellarchitected-sustainability-pillar.pdf in the following example) based on the access control setup.

If you run the same query again and provide an email address that doesn’t have access to either of these documents, you should not see these documents in the results list.

  1. Enter another query to search in the documents based on the aspect awskendra:indexControl.
  2. Choose Apply token, enter the email address corresponding to My Dev User1 user, and choose Apply.
  3. Rerun the query.

Step 10: Run a test query in the public index without access control.

Similarly, we can test our solution by running queries in the public index without access control:

  1. On the Amazon Kendra console, navigate to the Alfresco-Public index and choose Search indexed content.
  2. Run a search query.

Because this example Alfresco public site has not been set up with any access control, we don’t use an access token.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. Delete newly added Alfresco data sources within the indexes. If you created new Amazon Kendra indexes while testing this solution, delete them as well.

Conclusion

With the new Alfresco connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Alfresco, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the Authors

Arun Anand is a Senior Solutions Architect at Amazon Web Services based in Houston area. He has 25+ years of experience in designing and developing enterprise applications. He works with partners in Energy & Utilities segment providing architectural and best practice recommendations for new and existing solutions.

Rajnish Shaw is a Senior Solutions Architect at Amazon Web Services, with a background as a Product Developer and Architect. Rajnish is passionate about helping customers build applications on the cloud. Outside of work Rajnish enjoys spending time with family and friends, and traveling.

Yuanhua Wang is a software engineer at AWS with more than 15 years of experience in the technology industry. His interests are software architecture and build tools on cloud computing.

Read More

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI.

This is the second post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker. In Part 1, we show how the Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their Salesforce data securely using SageMaker and use its tools to build, train, and deploy models to endpoints hosted on SageMaker. The endpoints are then registered to the Salesforce Data Cloud to activate predictions in Salesforce.

In this post, we expand on this topic to demonstrate how to use Einstein Studio for product recommendations. You can use this integration for traditional models as well as large language models (LLMs).

Solution overview

In this post, we demonstrate how to create a predictive model in SageMaker to recommend the next best product to your customers by using historical data such as customer demographics, marketing engagements, and purchase history from Salesforce Data Cloud.

We use the following sample dataset. To use this dataset in your Data Cloud, refer to Create Amazon S3 Data Stream in Data Cloud.

The following attributes are needed to create the model:

  • Club Member – If the customer is a club member
  • Campaign – The campaign the customer is a part of
  • State – The state or province the customer resides in
  • Month – The month of purchase
  • Case Count – The number of cases raised by the customer
  • Case Type Return – Whether the customer returned any product within the last year
  • Case Type Shipment Damaged – Whether the customer had any shipments damaged in the last year
  • Engagement Score – The level of engagement the customer has (response to mailing campaigns, logins to the online store, and so on)
  • Tenure – The tenure of the customer relationship with the company
  • Clicks – The average number of clicks the customer has made within a week prior to purchase
  • Pages Visited – The average number of pages the customer has visited within a week prior to purchase
  • Product Purchased – The actual product purchased
  • Id – The ID of the record
  • DateTime – The timestamp of the dataset

The product recommendation model is built and deployed on SageMaker and is trained using data in the Salesforce Data Cloud. The following steps give an overview of how to use the new capabilities launched in SageMaker for Salesforce to enable the overall integration:

  1. Set up the Amazon SageMaker Studio domain and OAuth between Salesforce and the AWS accounts.
  2. Use the newly launched capability of the Amazon SageMaker Data Wrangler connector for Salesforce Data Cloud to prepare the data in SageMaker without copying the data from Salesforce Data Cloud.
  3. Train a recommendation model in SageMaker Studio using training data that was prepared using SageMaker Data Wrangler.
  4. Package the SageMaker Data Wrangler container and the trained recommendation model container in an inference pipeline so the inference request can use the same data preparation steps you created to preprocess the training data. The real-time inference call data is first passed to the SageMaker Data Wrangler container in the inference pipeline, where it is preprocessed and passed to the trained model for product recommendation. For more information about this process, refer to New — Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler. Although we use a specific algorithm to train the model in our example, you can use any algorithm that you find appropriate for your use case.
  5. Use the newly launched SageMaker provided project template for Salesforce Data Cloud integration to streamline implementing the preceding steps by providing the following templates:
    1. An example notebook showcasing data preparation, building, training, and registering the model.
    2. The SageMaker provided project template for Salesforce Data Cloud integration, which automates creating a SageMaker endpoint hosting the inference pipeline model. When a version of the model in the Amazon SageMaker Model Registry is approved, the endpoint is exposed as an API with Amazon API Gateway using a custom Salesforce JSON Web Token (JWT) authorizer. API Gateway is required to allow Salesforce Data Cloud to make predictions against the SageMaker endpoint using a JWT token that Salesforce creates and passes with the request when making predictions from Salesforce. JWT can be used as a part of OpenID Connect (OIDC) and OAuth 2.0 frameworks to restrict client access to your APIs.
  6. After you create the API, we recommend registering the model endpoint in Salesforce Einstein Studio. For instructions, refer to Bring Your Own AI Models to Salesforce with Einstein Studio

The following diagram illustrates the solution architecture.

Create a SageMaker Studio domain

First, create a SageMaker Studio domain. For instructions, refer to Onboard to Amazon SageMaker Domain. You should note down the domain ID and execution role that is created and will be used by your user profile. You add permissions to this role in subsequent steps.

The following screenshot shows the domain we created for this post.

The following screenshot shows the example user profile for this post.

Set up the Salesforce connected app

Next, we create a Salesforce connected app to enable the OAuth flow from SageMaker Studio to Salesforce Data Cloud. Complete the following steps:

  1. Log in to Salesforce and navigate to Setup.
  2. Search for App Manager and create a new connected app.
  3. Provide the following inputs:
    1. For Connected App Name, enter a name.
    2. For API Name, leave as default (it’s automatically populated).
    3. For Contact Email, enter your contact email address.
    4. Select Enable OAuth Settings.
    5. For Callback URL, enter https://<domain-id>.studio.<region>.sagemaker.aws/jupyter/default/lab, and provide the domain ID that you captured while creating the SageMaker domain and the Region of your SageMaker domain.
  4. Under Selected OAuth Scopes, move the following from Available OAuth Scopes to Selected OAuth Scopes and choose Save:
    1. Manage user data via APIs (api)
    2. Perform requests at any time (refresh_token, offline_access)
    3. Perform ANSI SQL queries on Salesforce Data Cloud data (Data Cloud_query_api)
    4. Manage Salesforce Customer Data Platform profile data (Data Cloud_profile_api
    5. Access the identity URL service (id, profile, email, address, phone)
    6. Access unique user identifiers (openid)

For more information about creating a connected app, refer to Create a Connected App.

  1. Return to the connected app and navigate to Consumer Key and Secret.
  2. Choose Manage Consumer Details.
  3. Copy the key and secret.

You may be asked to log in to your Salesforce org as part of the two-factor authentication here.

  1. Navigate back to the Manage Connected Apps page.
  2. Open the connected app you created and choose Manage.
  3. Choose Edit Policies and change IP Relaxation to Relax IP restrictions, then save your settings.

Configure SageMaker permissions and lifecycle rules

In this section, we walk through the steps to configure SageMaker permissions and lifecycle management rules.

Create a secret in AWS Secrets Manager

Enable OAuth integration with Salesforce Data Cloud by storing credentials from your Salesforce connected app in AWS Secrets Manager:

  1. On the Secrets Manager console, choose Store a new secret.
  2. Select Other type of secret.
  3. Create your secret with the following key-value pairs:
    {
    "identity_provider": "SALESFORCE",
    "authorization_url": "https://login.salesforce.com/services/oauth2/authorize",
    "token_url": "https://login.salesforce.com/services/oauth2/token",
    "client_id": "<YOUR_CONSUMER_KEY>",
    "client_secret": "<YOUR_CONSUMER_SECRET>"
    “issue_url”: “<YOUR_SALESFORCE_ORG_URL>”
    }

  4. Add a tag with the key sagemaker:partner and your choice of value.
  5. Save the secret and note the ARN of the secret.

Configure a SageMaker lifecycle rule

The SageMaker Studio domain execution role will require AWS Identity and Access Management (IAM) permissions to access the secret created in the previous step. For more information, refer to Creating roles and attaching policies (console).

  1. On the IAM console, attach the following polices to their respective roles (these roles will be used by the SageMaker project for deployment):
    1. Add the policy AmazonSageMakerPartnerServiceCatalogProductsCloudFormationServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsCloudformationRole.
    2. Add the policy AmazonSageMakerPartnerServiceCatalogProductsApiGatewayServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsApiGatewayRole.
    3. Add the policy AmazonSageMakerPartnerServiceCatalogProductsLambdaServiceRolePolicy to the service role AmazonSageMakerServiceCatalogProductsLambdaRole.
  2. On the IAM console, navigate to the SageMaker domain execution role.
  3. Choose Add permissions and select Create an inline policy.
  4. Enter the following policy in the JSON policy editor:
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "secretsmanager:GetSecretValue",
    "secretsmanager:PutSecretValue"
    ],
    "Resource": "arn:aws:secretsmanager:*:*:secret:*",
    "Condition": {
    "ForAnyValue:StringLike": {
    "aws:ResourceTag/sagemaker:partner": "*"
    }
    }
    },
    {
    "Effect": "Allow",
    "Action": [
    "secretsmanager:UpdateSecret"
    ],
    "Resource": "arn:aws:secretsmanager:*:*:secret:AmazonSageMaker-*"
    }
    ]
    }

SageMaker Studio lifecycle configuration provides shell scripts that run when a notebook is created or started. The lifecycle configuration will be used to retrieve the secret and import it to the SageMaker runtime.

  1. On the SageMaker console, choose Lifecycle configurations in the navigation pane.
  2. Choose Create configuration.
  3. Leave the default selection Jupyter Server App and choose Next.
  4. Give the configuration a name.
  5. Enter the following script in the editor, providing the ARN for the secret you created earlier:
    #!/bin/bash
    set -eux
    
    cat > ~/.sfgenie_identity_provider_oauth_config <<EOL
    {
    "secret_arn": "<YOUR_SECRETS_ARN>"
    }
    EOL

  1. Choose Submit to save the lifecycle configuration.
  2. Choose Domains in the navigation pane and open your domain.
  3. On the Environment tab, choose Attach to attach your lifecycle configuration.
  4. Choose the lifecycle configuration you created and choose Attach to domain.
  5. Choose Set as default.

If you are a returning user to SageMaker Studio, in order to ensure Salesforce Data Cloud is enabled, upgrade to the latest Jupyter and SageMaker Data Wrangler kernels.

This completes the setup to enable data access from Salesforce Data Cloud to SageMaker Studio to build AI and machine learning (ML) models.

Create a SageMaker project

To start using the solution, first create a project using Amazon SageMaker Projects. Complete the following steps:

  1. In SageMaker Studio, under Deployments in the navigation pane, choose Projects.
  2. Choose Create project.
  3. Choose the project template called Model deployment for Salesforce.
  4. Choose Select project template.
  5. Enter a name and optional description for your project.
  6. Enter a model group name.
  7. Enter the name of the Secrets Manager secret that you created earlier.
  8. Choose Create project.

The project may take 1–2 minutes to initiate.

You can see two new repositories. The first one is for sample notebooks that you can use as is or customize to prepare, train, create, and register models in the SageMaker Model Registry. The second repository is for automating the model deployment, which includes exposing the SageMaker endpoint as an API.

  1. Choose clone repo for both notebooks.

For this post, we use the product recommendation example, which can be found in the sagemaker-<YOUR-PROJECT-NAME>-p-<YOUR-PROJECT-ID>-example-nb/product-recommendation directory that you just cloned. Before we run the product-recommendation.ipynb notebook, let’s do some data preparation to create the training data using SageMaker Data Wrangler.

Prepare data with SageMaker Data Wrangler

Complete the following steps:

  1. In SageMaker Studio, on the File menu, choose New and Data Wrangler flow.
  2. After you create the data flow, choose (right-click) the tab and choose Rename to rename the file.
  3. Choose Import data.
  4. Choose Create connection.
  5. Choose Salesforce Data Cloud.
  6. For Name, enter salesforce-data-cloud-sagemaker-connection.
  7. For Salesforce org URL, enter your Salesforce org URL.
  8. Choose Save + Connect.
  9. In the Data Explorer view, select and preview the tables from the Salesforce Data Cloud to create and run the query to extract the required dataset.
  10. Your query will look like below and you may use the table name that you used while uploading data in Salesforce Data Cloud.
    SELECT product_purchased__c, club_member__c, campaign__c, state__c, month__c,
          case_count__c,case_type_return__c, case_type_shipment_damaged__c,
          pages_visited__c,engagement_score__c, tenure__c, clicks__c, id__c
    FROM Training_Dataset_for_Sagemaker__dll

  11. Choose Create dataset.

Creating the dataset may take some time.

In the data flow view, you can now see a new node added to the visual graph.

For more information on how you can use SageMaker Data Wrangler to create Data Quality and Insights Reports, refer to Get Insights On Data and Data Quality.

SageMaker Data Wrangler offers over 300 built-in transformations. In this step, we use some of these transformations to prepare the dataset for an ML model. For detailed instructions on how to implement these transformations, refer to Transform Data.

  1. Use the Manage columns step with the Drop column transform to drop the column id__c.
  2. Use the Handle missing step with the Drop missing transform to drop rows with missing values for various features. We apply this transformation on all columns.
  3. Use a custom transform step to create categorical values for state__c, case_count__c, and tenure features. Use the following code for this transformation:
    from pyspark.sql.functions import when
     
    States_List = [‘Washington’, ‘Massachusetts’, ‘California’, ‘Minnesota’, ‘Vermont’, ‘Colorado’, ‘Arizona’]
     
    df.withColumn(“club_member__c”,df.club_member__c.cast(‘string’))
    df.withColumn(“month__c”,df.month__c.cast(‘string’))
    df.withColumn(“case_type_return__c”,df.case_type_return__c.cast(‘string’))
    df.withColumn(“case_type_shipment_damaged__c”,df.case_type_shipment_damaged__c.cast(‘string’))
     
    df = df.withColumn(‘state__c’, when(df.state__c.isin(States_List), df.state__c).otherwise(“Other”))
     
    df = df.withColumn(‘case_count__c’, when(df.case_count__c == 0, “No Cases”).otherwise( when(df.case_count__c <= 2, “1 to 2 Cases”).otherwise(“Greater than 2 Cases”)))
                      
    df = df.withColumn(‘tenure__c’, when(df.tenure__c < 1, “Less than 1 Year”).otherwise( when(df.tenure__c == 1, “1 to 2 Years”).otherwise(when(df.tenure__c ==2, “2 to 3 Years”).otherwise(when(df.tenure__c == 3, “3 to 4 Years”).otherwise(“Grater Than 4 Years”)))))

  4. Use the Process numeric step with the Scale values transform and choose Standard scaler to scale clicks__c, engagement__score, and pages__visited__c features.
  5. Use the Encode categorical step with the One-hot encode transform to convert categorical variables to numeric for case__type__return___c, case__type_shipment__damaged, month__c, club__member__c, and campaign__c features (all features except clicks__c, engagement__score, pages__visited__c, and product_purchased__c).

Model building, training, and deployment

To build, train, and deploy the model, complete the following steps:

  1. Return to the SageMaker project, open the product-recommendation.ipynb notebook, and run a processing job to preprocess the data using the SageMaker Data Wrangler configuration you created.
  2. Follow the steps in the notebook to train a model and register it to the SageMaker Model Registry.
  3. Make sure to update the model group name to match with the model group name that you used while creating the SageMaker project.

To locate the model group name, open the SageMaker project that you created earlier and navigate to the Settings tab.

Similarly, the flow file referenced in the notebook must match with the flow file name that you created earlier.

  1. For this post, we used product-recommendation as the model group name, so we update the notebook with project-recommendation as the model group name in the notebook.

After the notebook is run, the trained model is registered in the Model Registry. To learn more about the Model Registry, refer to Register and Deploy Models with Model Registry.

  1. Select the model version you created and update the status of it to Approved.

Now that you have approved the registered model, the SageMaker Salesforce project deploy step will provision and trigger AWS CodePipeline.

CodePipeline has steps to build and deploy a SageMaker endpoint for inference containing the SageMaker Data Wrangler preprocessing steps and the trained model. The endpoint will be exposed to Salesforce Data Cloud as an API through API Gateway. The following screenshot shows the pipeline prefixed with Sagemaker-salesforce-product-recommendation-xxxxx. We also show you the endpoints and API that gets created by the SageMaker project for Salesforce.

If you would like, you can take a look at the CodePipeline deploy step, which uses AWS CloudFormation scripts to create SageMaker endpoint and API Gateway with a custom JWT authorizer.

When pipeline deployment is complete, you can find the SageMaker endpoint on the SageMaker console.

You can explore the API Gateway created by the project template on the API Gateway console.

Choose the link to find the API Gateway URL.

You can find the details of the JWT authorizer by choosing Authorizers on the API Gateway console. You can also go to the AWS Lambda console to review the code of the Lambda function created by project template.

To discover the schema to be used while invoking the API from Einstein Studio, choose Information in the navigation pane of the Model Registry. You will see an Amazon Simple Storage Service (Amazon S3) link to a metadata file. Copy and paste the link into a new browser tab URL.

Let’s look at the file without downloading it. On the file details page, choose the Object actions menu and choose Query with S3 Select.

Choose Run SQL query and take note of the API Gateway URL and schema because you will need this information when registering with Einstein Studio. If you don’t see an APIGWURL key, either the model wasn’t approved, deployment is still in progress, or deployment failed.

Use the Salesforce Einstein Studio API for predictions

Salesforce Einstein Studio is a new and centralized experience in Salesforce Data Cloud that data science and engineering teams can use to easily access their traditional models and LLMs used in generative AI. Next, we set up the API URL and client_id that you set in Secrets Manager earlier in Salesforce Einstein Studio to register and use the model inferences in Salesforce Einstein Studio. For instructions, refer to Bring Your Own AI Models to Salesforce with Einstein Studio.

Clean up

To delete all the resources created by the SageMaker project, on the project page, choose the Action menu and choose Delete.

To delete the resources (API Gateway and SageMaker endpoint) created by CodePipeline, navigate to the AWS CloudFormation console and delete the stack that was created.

Conclusion

In this post, we explained how you can build and train ML models in SageMaker Studio using SageMaker Data Wrangler to import and prepare data that is hosted on the Salesforce Data Cloud and use the newly launched Salesforce Data Cloud JDBC connector in SageMaker Data Wrangler and first-party Salesforce template in the SageMaker provided project template for Salesforce Data Cloud integration. The SageMaker project template for Salesforce enables you to deploy the model and create the endpoint and secure an API for a registered model. You then use the API to make predictions in Salesforce Einstein Studio for your business use cases.

Although we used the example of product recommendation to showcase the steps for implementing the end-to-end integration, you can use the SageMaker project template for Salesforce to create an endpoint and API for any SageMaker traditional model and LLM that is registered in the SageMaker Model Registry. We look forward to seeing what you build in SageMaker using data from Salesforce Data Cloud and empower your Salesforce applications using SageMaker hosted ML models!

This post is a continuation of the series regarding Salesforce Data Cloud and SageMaker integration. For a high-level overview and to learn more about the business impact you can make with this integration approach, refer to Part 1.

Additional resources


About the authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City. Follow him on https://www.linkedin.com/in/darylmartis.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Dharmendra Kumar Rai (DK Rai) is a Sr. Data Architect, Data Lake & AI/ML, serving strategic customers. He works closely with customers to understand how AWS can help them solve problems, especially in the AI/ML and analytics space. DK has many years of experience in building data-intensive solutions across a range of industry verticals, including high-tech, FinTech, insurance, and consumer-facing applications.

Marc Karp is an ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Read More

Bring your own AI using Amazon SageMaker with Salesforce Data Cloud

Bring your own AI using Amazon SageMaker with Salesforce Data Cloud

This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI.

We’re excited to announce Amazon SageMaker and Salesforce Data Cloud integration. With this capability, businesses can access their Salesforce data securely with a zero-copy approach using SageMaker and use SageMaker tools to build, train, and deploy AI models. The inference endpoints are connected with Data Cloud to drive predictions in real time. As a result, businesses can accelerate time to market while maintaining data integrity and security, and reduce the operational burden of moving data from one location to another.

Introducing Einstein Studio on Data Cloud

Data Cloud is a data platform that provides businesses with real-time updates of their customer data from any touch point. With Einstein Studio, a gateway to AI tools on the data platform, admins and data scientists can effortlessly create models with a few clicks or using code. Einstein Studio’s bring your own model (BYOM) experience provides the capability to connect custom or generative AI models from external platforms such as SageMaker to Data Cloud. Custom models can be trained using data from Salesforce Data Cloud accessed through the Amazon SageMaker Data Wrangler connector. Businesses can act on their predictions by seamlessly integrating custom models into Salesforce workflows, leading to improved efficiency, decision-making, and personalized experiences.

Benefits of the SageMaker and Data Cloud Einstein Studio integration

Here’s how using SageMaker with Einstein Studio in Salesforce Data Cloud can help businesses:

  • It provides the ability to connect custom and generative AI models to Einstein Studio for various use cases, such as lead conversion, case classification, and sentiment analysis.
  • It eliminates tedious, costly, and error-prone ETL (extract, transform, and load) jobs. The zero-copy approach to data reduces the overhead to manage data copies, reduces storage costs, and improves efficiencies.
  • It provides access to highly curated, harmonized, and real-time data across Customer 360. This leads to expert models that deliver more intelligent predictions and business insights.
  • It simplifies the consumption of results from business processes and drives value without latency. For example, you can use automated workflows that can adapt in an instant based on new data.
  • It facilitates the operationalization of SageMaker models and inferences in Salesforce.

The following is an example of how to operationalize a SageMaker model using Salesforce Flow.

SageMaker integration

SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

To streamline the SageMaker and Salesforce Data Cloud integration, we are introducing two new capabilities in SageMaker:

  • The SageMaker Data Wrangler Salesforce Data Cloud connector – With the newly launched SageMaker Data Wrangler Salesforce Data Cloud connector, admins can preconfigure connections to Salesforce to enable data analysts and data scientists to quickly access Salesforce data in real time and create features for ML. This will enable users to access Salesforce Data Cloud securely using OAuth. You can interactively visualize, analyze, and transform data using the power of Spark without writing any code using the low-code visual data preparation features of Salesforce Data Wrangler. You can also scale to process large datasets with SageMaker Processing jobs, train ML modes automatically using Amazon SageMaker Autopilot, and integrate with a SageMaker inference pipeline to deploy the same data flow to production with the inference endpoint to process data in real time or in batch for inference.

  • The SageMaker Projects template for Salesforce – We launched a SageMaker Projects template for Salesforce that you can use to deploy endpoints for traditional and large language models (LLMs) and expose SageMaker endpoints as an API automatically. SageMaker Projects provides a straightforward way to set up and standardize the development environment for data scientists and ML engineers to build and deploy ML models on SageMaker.

Partner Quote

“The partnership between Salesforce and AWS Sagemaker will empower customers to leverage the power of AI (both, generative and non-generative models) across their Salesforce data sources, workflows and applications to deliver personalized experiences and power new content generation, summarization, and question-answer type experiences. By combining the best of both worlds, we are creating a new paradigm for data-driven innovation and customer success underpinned by AI.”

-Kaushal Kurapati, Salesforce Senior Vice President of Product, AI and Search

Solution overview

The BYOM integration solution provides customers with a native Salesforce Data Cloud connector in SageMaker Data Wrangler. The SageMaker Data Wrangler connector allows you to securely access Salesforce Data Cloud objects. Once users are authenticated, they can perform data exploration, preparation, and feature engineering tasks needed for model development and inference through the SageMaker Data Wrangler interactive visual interface. Data scientists can work within Amazon SageMaker Studio notebooks to develop custom models, which can be traditional or LLMs, and make them available for deployment by registering the model in the SageMaker Model Registry. When a model is approved for production in the registry, SageMaker Projects will automate the deployment of an invocation API that can be configured as a target in Salesforce Einstein Studio and integrated with Salesforce Customer 360 applications. The following diagram illustrates this architecture

Conclusion

In this post, we shared the SageMaker and Salesforce Einstein Studio BYOM integration, where you can use data in Salesforce Data Cloud to build and train traditional and LLMs in SageMaker. You can use SageMaker Data Wrangler to prepare data from Salesforce Data Cloud with zero copy. We also provided an automated solution to deploy the SageMaker endpoints as an API using a SageMaker Projects template for Salesforce.

AWS and Salesforce are excited to partner together to deliver this experience to our joint customers to help them drive business processes using the power of ML and artificial intelligence.

To learn more about the Salesforce BYOM integration, refer to Bring your own AI models with Einstein Studio. For a detailed implementation using product recommendations example use case, refer to Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce Apps with AI/ML.


About the Authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Maninder (Mani) Kaur is the AI/ML Specialist lead for Strategic ISVs at AWS. With her customer-first approach, Mani helps strategic customers shape their AI/ML strategy, fuel innovation, and accelerate their AI/ML journey. Mani is a firm believer of ethical and responsible AI, and strives to ensure that her customers’ AI solutions align with these principles.

Read More

NVIDIA CEO Jensen Huang Returns to SIGGRAPH

NVIDIA CEO Jensen Huang Returns to SIGGRAPH

One pandemic and one generative AI revolution later, NVIDIA founder and CEO Jensen Huang returns to the SIGGRAPH stage next week to deliver a live keynote at the world’s largest professional graphics conference.

The address, slated for Tuesday, Aug. 8, at 8 a.m. PT in Los Angeles, will feature an exclusive look at some of NVIDIA’s newest breakthroughs, including award-winning research, OpenUSD developments and the latest AI-powered solutions for content creation.

NVIDIA founder and CEO Jensen Huang.

Huang’s address comes after NVIDIA joined forces last week with Pixar, Adobe, Apple and Autodesk to found the Alliance for OpenUSD, a major leap toward unlocking the next era of interoperability in 3D graphics, design and simulation.

The group will standardize and extend OpenUSD, the open-source Universal Scene Description framework that’s the foundation of interoperable 3D applications and projects ranging from visual effects to industrial digital twins.

Huang will also offer a perspective on what’s been a raucous year for AI, with wildly popular new generative AI applications — including ChatGPT and Midjourney — providing a taste of what’s to come as developers worldwide get to work.

Throughout the conference, NVIDIA will participate in sessions on immersive visualization, 3D interoperability and AI-mediated video conferencing and presenting 20 research papers. Attendees will also get the opportunity to join hands-on labs.

Join SIGGRAPH to witness the evolution of AI and visual computing. Watch the keynote on this page.

 

Image source: Ron Diering, via Flickr, some rights reserved.

Read More

Enhancing AWS intelligent document processing with generative AI

Enhancing AWS intelligent document processing with generative AI

Data classification, extraction, and analysis can be challenging for organizations that deal with volumes of documents. Traditional document processing solutions are manual, expensive, error prone, and difficult to scale. AWS intelligent document processing (IDP), with AI services such as Amazon Textract, allows you to take advantage of industry-leading machine learning (ML) technology to quickly and accurately process data from any scanned document or image. Generative artificial intelligence (generative AI) complements Amazon Textract to further automate document processing workflows. Features such as normalizing key fields and summarizing input data support faster cycles for managing document process workflows, while reducing the potential for errors.

Generative AI is driven by large ML models called foundation models (FMs). FMs are transforming the way you can solve traditionally complex document processing workloads. In addition to existing capabilities, businesses need to summarize specific categories of information, including debit and credit data from documents such as financial reports and bank statements. FMs make it easier to generate such insights from the extracted data. To optimize time spent in human review and to improve employee productivity, mistakes such as missing digits in phone numbers, missing documents, or addresses without street numbers can be flagged in an automated way. In the current scenario, you need to dedicate resources to accomplish such tasks using human review and complex scripts. This approach is tedious and expensive. FMs can help complete these tasks faster, with fewer resources, and transform varying input formats into a standard template that can be processed further. At AWS, we offer services such as Amazon Bedrock, the easiest way to build and scale generative AI applications with FMs. Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can find the model that best suits your requirements. We also offer Amazon SageMaker JumpStart, which allows ML practitioners to choose from a broad selection of open-source FMs. ML practitioners can deploy FMs to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

Ricoh offers workplace solutions and digital transformation services designed to help customers manage and optimize information flow across their businesses. Ashok Shenoy, VP of Portfolio Solution Development, says, “We are adding generative AI to our IDP solutions to help our customers get their work done faster and more accurately by utilizing new capabilities such as Q&A, summarization, and standardized outputs. AWS allows us to take advantage of generative AI while keeping each of our customers’ data separate and secure.”

In this post, we share how to enhance your IDP solution on AWS with generative AI.

Improving the IDP pipeline

In this section, we review how the traditional IDP pipeline can be augmented by FMs and walk through an example use case using Amazon Textract with FMs.

AWS IDP is comprised of three stages: classification, extraction, and enrichment. For more details about each stage, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. In the classification stage, FMs can now classify documents without any additional training. This means that documents can be categorized even if the model hasn’t seen similar examples before. FMs in the extraction stage normalize date fields and verify addresses and phone numbers, while ensuring consistent formatting. FMs in the enrichment stage allow inference, logical reasoning, and summarization. When you use FMs in each IDP stage, your workflow will be more streamlined and performance will improve. The following diagram illustrates the IDP pipeline with generative AI.

Intelligent Document Processing Pipeline with Generative AI

Extraction stage of the IDP pipeline

When FMs can’t directly process documents in their native formats (such as PDFs, img, jpeg, and tiff) as an input, a mechanism to convert documents to text is needed. To extract the text from the document before sending it to the FMs, you can use Amazon Textract. With Amazon Textract, you can extract lines and words and pass them to downstream FMs. The following architecture uses Amazon Textract for accurate text extraction from any type of document before sending it to FMs for further processing.

Textract Ingests document data to the Foundation Models

Typically, documents are comprised of structured and semi-structured information. Amazon Textract can be used to extract raw text and data from tables and forms. The relationship between the data in tables and forms plays a vital role in automating business processes. Certain types of information may not be processed by FMs. As a result, we can choose to either store this information in a downstream store or send it to FMs. The following figure is an example of how Amazon Textract can extract structured and semi-structured information from a document, in addition to lines of text that need to be processed by FMs.

Using AWS serverless services to summarize with FMs

The IDP pipeline we illustrated earlier can be seamlessly automated using AWS serverless services. Highly unstructured documents are common in big enterprises. These documents can span from Securities and Exchange Commission (SEC) documents in the banking industry to coverage documents in the health insurance industry. With the evolution of generative AI at AWS, people in these industries are looking for ways to get a summary from those documents in an automated and cost-effective manner. Serverless services help provide the mechanism to build a solution for IDP quickly. Services such as AWS Lambda, AWS Step Functions, and Amazon EventBridge can help build the document processing pipeline with integration of FMs, as shown in the following diagram.

End-to-end document processing with Amazon Textract and Generative AI

The sample application used in the preceding architecture is driven by events. An event is defined as a change in state that has recently occurred. For example, when an object gets uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, Amazon S3 emits an Object Created event. This event notification from Amazon S3 can trigger a Lambda function or a Step Functions workflow. This type of architecture is termed as an event-driven architecture. In this post, our sample application uses an event-driven architecture to process a sample medical discharge document and summarize the details of the document. The flow works as follows:

  1. When a document is uploaded to an S3 bucket, Amazon S3 triggers an Object Created event.
  2. The EventBridge default event bus propagates the event to Step Functions based on an EventBridge rule.
  3. The state machine workflow processes the document, beginning with Amazon Textract.
  4. A Lambda function transforms the analyzed data for the next step.
  5. The state machine invokes a SageMaker endpoint, which hosts the FM using direct AWS SDK integration.
  6. A summary S3 destination bucket receives the summary response gathered from the FM.

We used the sample application with a flan-t5 Hugging face model to summarize the following sample patient discharge summary using the Step Functions workflow.

patient discharge summary

The Step Functions workflow uses AWS SDK integration to call the Amazon Textract AnalyzeDocument and SageMaker runtime InvokeEndpoint APIs, as shown in the following figure.

workflow

This workflow results in a summary JSON object that is stored in a destination bucket. The JSON object looks as follows:

{
  "summary": [
    "John Doe is a 35-year old male who has been experiencing stomach problems for two months. He has been taking antibiotics for the last two weeks, but has not been able to eat much. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has also noticed a change in his stool color, which is now darker. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of fatigue, and has been unable to work for the last two weeks. He has also been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help."
  ],
  "forms": [
    {
      "key": "Ph: ",
      "value": "(888)-(999)-(0000) "
    },
    {
      "key": "Fax: ",
      "value": "(888)-(999)-(1111) "
    },
    {
      "key": "Patient Name: ",
      "value": "John Doe "
    },
    {
      "key": "Patient ID: ",
      "value": "NARH-36640 "
    },
    {
      "key": "Gender: ",
      "value": "Male "
    },
    {
      "key": "Attending Physician: ",
      "value": "Mateo Jackson, PhD "
    },
    {
      "key": "Admit Date: ",
      "value": "07-Sep-2020 "
    },
    {
      "key": "Discharge Date: ",
      "value": "08-Sep-2020 "
    },
    {
      "key": "Discharge Disposition: ",
      "value": "Home with Support Services "
    },
    {
      "key": "Pre-existing / Developed Conditions Impacting Hospital Stay: ",
      "value": "35 yo M c/o stomach problems since 2 months. Patient reports epigastric abdominal pain non- radiating. Pain is described as gnawing and burning, intermittent lasting 1-2 hours, and gotten progressively worse. Antacids used to alleviate pain but not anymore; nothing exacerbates pain. Pain unrelated to daytime or to meals. Patient denies constipation or diarrhea. Patient denies blood in stool but have noticed them darker. Patient also reports nausea. Denies recent illness or fever. He also reports fatigue for 2 weeks and bloating after eating. ROS: Negative except for above findings Meds: Motrin once/week. Tums previously. PMHx: Back pain and muscle spasms. No Hx of surgery. NKDA. FHx: Uncle has a bleeding ulcer. Social Hx: Smokes since 15 yo, 1/2-1 PPD. No recent EtOH use. Denies illicit drug use. Works on high elevation construction. Fast food diet. Exercises 3-4 times/week but stopped 2 weeks ago. "
    },
    {
      "key": "Summary: ",
      "value": "some activity restrictions suggested, full course of antibiotics, check back with physican in case of relapse, strict diet "
    }
  ]
 }

Generating these summaries using IDP with serverless implementation at scale helps organizations get meaningful, concise, and presentable data in a cost-effective way. Step Functions doesn’t limit the method of processing documents to one document at a time. Its distributed map feature can summarize large numbers of documents on a schedule.

The sample application uses a flan-t5 Hugging face model; however, you can use an FM endpoint of your choice. Training and running the model is out of scope of the sample application. Follow the instructions in the GitHub repository to deploy a sample application. The preceding architecture is a guidance on how you can orchestrate an IDP workflow using Step Functions. Refer to the IDP Generative AI workshop for detailed instructions on how to build an application with AWS AI services and FMs.

Set up the solution

Follow the steps in the README file to set the solution architecture (except for the SageMaker endpoints). After you have your own SageMaker endpoint available, you can pass the endpoint name as a parameter to the template.

Clean up

To save costs, delete the resources you deployed as part of the tutorial:

  1. Follow the steps in the cleanup section of the README file.
  2. Delete any content from your S3 bucket and then delete the bucket through the Amazon S3 console.
  3. Delete any SageMaker endpoints you may have created through the SageMaker console.

Conclusion

Generative AI is changing how you can process documents with IDP to derive insights. AWS AI services such as Amazon Textract along with AWS FMs can help accurately process any type of documents. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.


About the Authors

Sonali Sahu is leading intelligent document processing with the AI/ML services team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Ashish Lal is a Senior Product Marketing Manager who leads product marketing for AI services at AWS. He has 9 years of marketing experience and has led the product marketing effort for Intelligent document processing. He got his Master’s in Business Administration at the University of Washington.

Mrunal Daftari is an Enterprise Senior Solutions Architect at Amazon Web Services. He is based in Boston, MA. He is a cloud enthusiast and very passionate about finding solutions for customers that are simple and address their business outcomes. He loves working with cloud technologies, providing simple, scalable solutions that drive positive business outcomes, cloud adoption strategy, and design innovative solutions and drive operational excellence.

Dhiraj Mahapatro is a Principal Serverless Specialist Solutions Architect at AWS. He specializes in helping enterprise financial services adopt serverless and event-driven architectures to modernize their applications and accelerate their pace of innovation. Recently, he has been working on bringing container workloads and practical usage of generative AI closer to serverless and EDA for financial services industry customers.

Jacob Hauskens is a Principal AI Specialist with over 15 years of strategic business development and partnerships experience. For the past 7 years, he has led the creation and implementation of go-to-market strategies for new AI-powered B2B services. Recently, he has been helping ISVs grow their revenue by adding generative AI to intelligent document processing workflows.

Read More

Multimodal medical AI

Multimodal medical AI

Medicine is an inherently multimodal discipline. When providing care, clinicians routinely interpret data from a wide range of modalities including medical images, clinical notes, lab tests, electronic health records, genomics, and more. Over the last decade or so, AI systems have achieved expert-level performance on specific tasks within specific modalities — some AI systems processing CT scans, while others analyzing high magnification pathology slides, and still others hunting for rare genetic variations. The inputs to these systems tend to be complex data such as images, and they typically provide structured outputs, whether in the form of discrete grades or dense image segmentation masks. In parallel, the capacities and capabilities of large language models (LLMs) have become so advanced that they have demonstrated comprehension and expertise in medical knowledge by both interpreting and responding in plain language. But how do we bring these capabilities together to build medical AI systems that can leverage information from all these sources?

In today’s blog post, we outline a spectrum of approaches to bringing multimodal capabilities to LLMs and share some exciting results on the tractability of building multimodal medical LLMs, as described in three recent research papers. The papers, in turn, outline how to introduce de novo modalities to an LLM, how to graft a state-of-the-art medical imaging foundation model onto a conversational LLM, and first steps towards building a truly generalist multimodal medical AI system. If successfully matured, multimodal medical LLMs might serve as the basis of new assistive technologies spanning professional medicine, medical research, and consumer applications. As with our prior work, we emphasize the need for careful evaluation of these technologies in collaboration with the medical community and healthcare ecosystem.

A spectrum of approaches

Several methods for building multimodal LLMs have been proposed in recent months [1, 2, 3], and no doubt new methods will continue to emerge for some time. For the purpose of understanding the opportunities to bring new modalities to medical AI systems, we’ll consider three broadly defined approaches: tool use, model grafting, and generalist systems.

The spectrum of approaches to building multimodal LLMs range from having the LLM use existing tools or models, to leveraging domain-specific components with an adapter, to joint modeling of a multimodal model.

Tool use

In the tool use approach, one central medical LLM outsources analysis of data in various modalities to a set of software subsystems independently optimized for those tasks: the tools. The common mnemonic example of tool use is teaching an LLM to use a calculator rather than do arithmetic on its own. In the medical space, a medical LLM faced with a chest X-ray could forward that image to a radiology AI system and integrate that response. This could be accomplished via application programming interfaces (APIs) offered by subsystems, or more fancifully, two medical AI systems with different specializations engaging in a conversation.

This approach has some important benefits. It allows maximum flexibility and independence between subsystems, enabling health systems to mix and match products between tech providers based on validated performance characteristics of subsystems. Moreover, human-readable communication channels between subsystems maximize auditability and debuggability. That said, getting the communication right between independent subsystems can be tricky, narrowing the information transfer, or exposing a risk of miscommunication and information loss.

Model grafting

A more integrated approach would be to take a neural network specialized for each relevant domain, and adapt it to plug directly into the LLM — grafting the visual model onto the core reasoning agent. In contrast to tool use where the specific tool(s) used are determined by the LLM, in model grafting the researchers may choose to use, refine, or develop specific models during development. In two recent papers from Google Research, we show that this is in fact feasible. Neural LLMs typically process text by first mapping words into a vector embedding space. Both papers build on the idea of mapping data from a new modality into the input word embedding space already familiar to the LLM. The first paper, “Multimodal LLMs for health grounded in individual-specific data”, shows that asthma risk prediction in the UK Biobank can be improved if we first train a neural network classifier to interpret spirograms (a modality used to assess breathing ability) and then adapt the output of that network to serve as input into the LLM.

The second paper, “ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders”, takes this same tack, but applies it to full-scale image encoder models in radiology. Starting with a foundation model for understanding chest X-rays, already shown to be a good basis for building a variety of classifiers in this modality, this paper describes training a lightweight medical information adapter that re-expresses the top layer output of the foundation model as a series of tokens in the LLM’s input embeddings space. Despite fine-tuning neither the visual encoder nor the language model, the resulting system displays capabilities it wasn’t trained for, including semantic search and visual question answering.

Our approach to grafting a model works by training a medical information adapter that maps the output of an existing or refined image encoder into an LLM-understandable form.

Model grafting has a number of advantages. It uses relatively modest computational resources to train the adapter layers but allows the LLM to build on existing highly-optimized and validated models in each data domain. The modularization of the problem into encoder, adapter, and LLM components can also facilitate testing and debugging of individual software components when developing and deploying such a system. The corresponding disadvantages are that the communication between the specialist encoder and the LLM is no longer human readable (being a series of high dimensional vectors), and the grafting procedure requires building a new adapter for not just every domain-specific encoder, but also every revision of each of those encoders.

Generalist systems

The most radical approach to multimodal medical AI is to build one integrated, fully generalist system natively capable of absorbing information from all sources. In our third paper in this area, “Towards Generalist Biomedical AI”, rather than having separate encoders and adapters for each data modality, we build on PaLM-E, a recently published multimodal model that is itself a combination of a single LLM (PaLM) and a single vision encoder (ViT). In this set up, text and tabular data modalities are covered by the LLM text encoder, but now all other data are treated as an image and fed to the vision encoder.

Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same model weights.

We specialize PaLM-E to the medical domain by fine-tuning the complete set of model parameters on medical datasets described in the paper. The resulting generalist medical AI system is a multimodal version of Med-PaLM that we call Med-PaLM M. The flexible multimodal sequence-to-sequence architecture allows us to interleave various types of multimodal biomedical information in a single interaction. To the best of our knowledge, it is the first demonstration of a single unified model that can interpret multimodal biomedical data and handle a diverse range of tasks using the same set of model weights across all tasks (detailed evaluations in the paper).

This generalist-system approach to multimodality is both the most ambitious and simultaneously most elegant of the approaches we describe. In principle, this direct approach maximizes flexibility and information transfer between modalities. With no APIs to maintain compatibility across and no proliferation of adapter layers, the generalist approach has arguably the simplest design. But that same elegance is also the source of some of its disadvantages. Computational costs are often higher, and with a unitary vision encoder serving a wide range of modalities, domain specialization or system debuggability could suffer.

The reality of multimodal medical AI

To make the most of AI in medicine, we’ll need to combine the strength of expert systems trained with predictive AI with the flexibility made possible through generative AI. Which approach (or combination of approaches) will be most useful in the field depends on a multitude of as-yet unassessed factors. Is the flexibility and simplicity of a generalist model more valuable than the modularity of model grafting or tool use? Which approach gives the highest quality results for a specific real-world use case? Is the preferred approach different for supporting medical research or medical education vs. augmenting medical practice? Answering these questions will require ongoing rigorous empirical research and continued direct collaboration with healthcare providers, medical institutions, government entities, and healthcare industry partners broadly. We look forward to finding the answers together.

Read More

Meet the Maker: Developer Taps NVIDIA Jetson as Force Behind AI-Powered Pit Droid

Meet the Maker: Developer Taps NVIDIA Jetson as Force Behind AI-Powered Pit Droid

Goran Vuksic is the brain behind a project to build a real-world pit droid, a type of Star Wars bot that repairs and maintains podracers which zoom across the much-loved film series.

The edge AI Jedi used an NVIDIA Jetson Orin Nano Developer Kit as the brain of the droid itself. The devkit enables the bot, which is a little less than four feet tall and has a simple webcam for eyes, to identify and move its head toward objects.

Vuksic — originally from Croatia and now based in Malmö, Sweden — recently traveled with the pit droid across Belgium and the Netherlands to several tech conferences. He presented to hundreds of people on computer vision and AI, using the droid as an engaging real-world demo.

The pit droid’s first look at the world.

A self-described Star Wars fanatic, he’s upgrading the droid’s capabilities in his free time, when not engrossed in his work as an engineering manager at a Copenhagen-based company. He’s also co-founder and chief technology officer of syntheticAIdata, a member of the NVIDIA Inception program for cutting-edge startups.

The company, which creates vision AI models with cost-effective synthetic data, uses a connector to the NVIDIA Omniverse platform for building and operating 3D tools and applications.

About the Maker

Named a Jetson AI Specialist by NVIDIA and an AI “Most Valuable Professional” by Microsoft, Vuksic got started with artificial intelligence and IT about a decade ago when working for a startup that classified tattoos with vision AI.

Since then, he’s worked as an engineering and technical manager, among other roles, developing IT strategies and solutions for various companies.

Robotics has always interested him, as he was a huge sci-fi fan growing up.

“Watching Star Wars and other films, I imagined how robots might be able to see and do stuff in the real world,” said Vuksic, also a member of the NVIDIA Developer Program.

Now, he’s enabling just that with the pit droid project powered by the NVIDIA Jetson platform, which the developer has used since the launch of its first product nearly a decade ago.

Vuksic reads to the pit droid.

Apart from tinkering with computers and bots, Vuksic enjoys playing the bass guitar in a band with his friends.

His Inspiration

Vuksic built the pit droid for both fun and educational purposes.

As a frequent speaker at tech conferences, he takes the pit droid on stage to engage with his audience, demonstrate how it works and inspire others to build something similar, he said.

Vuksic, his startup co-founder Sherry List and the pit droid present at the Techorama conference in Antwerp, Belgium.

“We live in a connected world — all the things around us are exchanging data and becoming more and more automated,” he added. “I think this is super exciting, and we’ll likely have even more robots to help humans with tasks.”

Using the NVIDIA Jetson platform, Vuksic is at the forefront of robotics innovation, along with an ecosystem of developers using edge AI.

His Jetson Project

Vuksic’s pit droid project, which took him four months, began with 3D printing its body parts and putting them all together.

He then equipped the bot with the Jetson Orin Nano Developer Kit as the brain in its head, which can move in all directions thanks to two motors.

Vuksic places an NVIDIA Jetson Orin Nano Developer Kit in the pit droid’s head.

The Jetson Orin Nano enables real-time processing of the camera feed. “It’s truly, truly amazing to have this processing power in such a small box that fits in the droid’s head,” said Vuksic.

He also uses Microsoft Azure to process the data in the cloud for object-detection training.

“My favorite part of the project was definitely connecting it to the Jetson Orin Nano, which made it easy to run the AI and make the droid move according to what it sees,” said Vuksic, who wrote a step-by-step technical guide to building the bot, so others can try it themselves.

“The most challenging part was traveling with the droid — there was a bit of explanation necessary when I was passing security and opened my bag which contained the robot in parts,” the developer mused. “I said, ‘This is just my big toy!’”

Learn more about the NVIDIA Jetson platform.

Read More

How to Build Generative AI Applications and 3D Virtual Worlds

How to Build Generative AI Applications and 3D Virtual Worlds

To grow and succeed, organizations must continuously focus on technical skills development, especially in rapidly advancing areas of technology, such as generative AI and the creation of 3D virtual worlds.  

NVIDIA Training, which equips teams with skills for the age of AI, high performance computing and industrial digitalization, has released new courses that cover these technologies. The program has already equipped hundreds of thousands of students, developers, researchers and data scientists with critical technical skills.  

With its latest courses, NVIDIA Training is enabling organizations to fully harness the power of generative AI and virtual worlds, which are transforming the business landscape. 

Get Started Building Generative AI Applications     

Generative AI is revolutionizing the ways organizations work. It enables users to quickly generate new content based on a variety of inputs, including text, images, sounds, animation, 3D models and other data types.  

New NVIDIA Training courses on gen AI include:         

  • Generative AI Explained Generative models are accelerating application development for many use cases, including question-answering, summarization, textual entailment, 2D and 3D image and audio creation. In this two-hour course, Bryan Catanzaro, vice president of applied deep learning research at NVIDIA, provides an overview of gen AI’s major developments, where it stands now and what it could be capable of in the future. He’ll discuss technical details and popular generative AI applications, as well as how businesses can responsibly use the technology. 
  • Generative AI With Diffusion Models — Thanks to improvements in computing power and scientific theory, generative AI is more accessible than ever. Get started with gen AI application development with this hands-on course where students will learn how to build a text-to-image generative AI application using the latest techniques. Generate images with diffusion models and refine the output with various optimizations. Build a denoising diffusion model from the U-Net architecture to add context embeddings for greater user control. 

To see a complete list of courses on generative AI and large language models, check out these NVIDIA Training Learning Paths. 

Building Digital 3D Worlds

Advancements in digital world-building are transforming media and entertainment, architecture, engineering, construction and operations, factory planning and avatar creation, among other industries.

Immersive 3D environments elevate user engagement and enable innovative solutions to real-world problems. NVIDIA Omniverse, a platform for connecting and developing 3D tools and applications, lets technical artists, designers and engineers quickly assemble complex and physically accurate simulations and 3D scenes in real time, while seamlessly collaborating with team members.

New NVIDIA Training courses on this topic include:

  • Essentials of USD in NVIDIA Omniverse Universal Scene Description, or OpenUSD, is transforming 3D workflows across industries. It’s an open standard enabling 3D artists and developers to connect, compose and simulate in the metaverse. Students will learn what makes OpenUSD unique for designing 3D worlds. The training covers data modeling using primitive nodes, attributes and relationships, as well as custom schemas and composition for scene assembly and collaboration. 
  • Developing Omniverse Kit ApplicationsLearn how to use the NVIDIA Omniverse Kit development framework to build applications, custom extensions and microservices. Applications may comprise many extensions working in concert to address specific 3D workflows, like industrial digitalization and factory planning. Students will use Omniverse reference applications, like Omniverse USD Composer and USD Presenter, to kickstart their own application development.
     
  • Bootstrapping Computer Vision Models With Synthetic DataLearn how to use NVIDIA Omniverse Replicator, a core Omniverse extension, to accelerate the development of computer vision models. Generate accurate, photorealistic, physics-conforming synthetic data to ease the expensive, time-consuming task of labeling real-world data. Omniverse Replicator accelerates AI development at scale and reduces time to production. 

To see a complete list of courses on graphics and simulation, check out these NVIDIA Training Learning Paths 

Wide Portfolio of Courses 

NVIDIA Training offers courses and resources to help individuals and organizations develop expertise in using NVIDIA technologies to fuel innovation. In addition to those above, a wide range of courses and workshops covering AI, deep learning, accelerated computing, data science, networking and infrastructure are available to explore in the training catalog. 

At the SIGGRAPH conference session “Reimagine Your Curriculum With OpenUSD and NVIDIA Omniverse,” Laura Scholl, senior content developer on the Omniverse team at NVIDIA, will discuss how to incorporate OpenUSD and Omniverse into an educational setting using teaching kits, programs for educators and other resources available from NVIDIA.  

Learn about the latest advances in generative AI, graphics and more by joining NVIDIA at SIGGRAPH. NVIDIA founder and CEO Jensen Huang will deliver a keynote address on Tuesday, Aug. 8, at 8 a.m. PT.

Read More