August 2023 – Page 14

Bring your own AI using Amazon SageMaker with Salesforce Data Cloud

This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI.

We’re excited to announce Amazon SageMaker and Salesforce Data Cloud integration. With this capability, businesses can access their Salesforce data securely with a zero-copy approach using SageMaker and use SageMaker tools to build, train, and deploy AI models. The inference endpoints are connected with Data Cloud to drive predictions in real time. As a result, businesses can accelerate time to market while maintaining data integrity and security, and reduce the operational burden of moving data from one location to another.

Introducing Einstein Studio on Data Cloud

Data Cloud is a data platform that provides businesses with real-time updates of their customer data from any touch point. With Einstein Studio, a gateway to AI tools on the data platform, admins and data scientists can effortlessly create models with a few clicks or using code. Einstein Studio’s bring your own model (BYOM) experience provides the capability to connect custom or generative AI models from external platforms such as SageMaker to Data Cloud. Custom models can be trained using data from Salesforce Data Cloud accessed through the Amazon SageMaker Data Wrangler connector. Businesses can act on their predictions by seamlessly integrating custom models into Salesforce workflows, leading to improved efficiency, decision-making, and personalized experiences.

Benefits of the SageMaker and Data Cloud Einstein Studio integration

Here’s how using SageMaker with Einstein Studio in Salesforce Data Cloud can help businesses:

It provides the ability to connect custom and generative AI models to Einstein Studio for various use cases, such as lead conversion, case classification, and sentiment analysis.
It eliminates tedious, costly, and error-prone ETL (extract, transform, and load) jobs. The zero-copy approach to data reduces the overhead to manage data copies, reduces storage costs, and improves efficiencies.
It provides access to highly curated, harmonized, and real-time data across Customer 360. This leads to expert models that deliver more intelligent predictions and business insights.
It simplifies the consumption of results from business processes and drives value without latency. For example, you can use automated workflows that can adapt in an instant based on new data.
It facilitates the operationalization of SageMaker models and inferences in Salesforce.

The following is an example of how to operationalize a SageMaker model using Salesforce Flow.

SageMaker integration

SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

To streamline the SageMaker and Salesforce Data Cloud integration, we are introducing two new capabilities in SageMaker:

The SageMaker Data Wrangler Salesforce Data Cloud connector – With the newly launched SageMaker Data Wrangler Salesforce Data Cloud connector, admins can preconfigure connections to Salesforce to enable data analysts and data scientists to quickly access Salesforce data in real time and create features for ML. This will enable users to access Salesforce Data Cloud securely using OAuth. You can interactively visualize, analyze, and transform data using the power of Spark without writing any code using the low-code visual data preparation features of Salesforce Data Wrangler. You can also scale to process large datasets with SageMaker Processing jobs, train ML modes automatically using Amazon SageMaker Autopilot, and integrate with a SageMaker inference pipeline to deploy the same data flow to production with the inference endpoint to process data in real time or in batch for inference.

The SageMaker Projects template for Salesforce – We launched a SageMaker Projects template for Salesforce that you can use to deploy endpoints for traditional and large language models (LLMs) and expose SageMaker endpoints as an API automatically. SageMaker Projects provides a straightforward way to set up and standardize the development environment for data scientists and ML engineers to build and deploy ML models on SageMaker.

Partner Quote

“The partnership between Salesforce and AWS Sagemaker will empower customers to leverage the power of AI (both, generative and non-generative models) across their Salesforce data sources, workflows and applications to deliver personalized experiences and power new content generation, summarization, and question-answer type experiences. By combining the best of both worlds, we are creating a new paradigm for data-driven innovation and customer success underpinned by AI.”

-Kaushal Kurapati, Salesforce Senior Vice President of Product, AI and Search

Solution overview

The BYOM integration solution provides customers with a native Salesforce Data Cloud connector in SageMaker Data Wrangler. The SageMaker Data Wrangler connector allows you to securely access Salesforce Data Cloud objects. Once users are authenticated, they can perform data exploration, preparation, and feature engineering tasks needed for model development and inference through the SageMaker Data Wrangler interactive visual interface. Data scientists can work within Amazon SageMaker Studio notebooks to develop custom models, which can be traditional or LLMs, and make them available for deployment by registering the model in the SageMaker Model Registry. When a model is approved for production in the registry, SageMaker Projects will automate the deployment of an invocation API that can be configured as a target in Salesforce Einstein Studio and integrated with Salesforce Customer 360 applications. The following diagram illustrates this architecture

Conclusion

In this post, we shared the SageMaker and Salesforce Einstein Studio BYOM integration, where you can use data in Salesforce Data Cloud to build and train traditional and LLMs in SageMaker. You can use SageMaker Data Wrangler to prepare data from Salesforce Data Cloud with zero copy. We also provided an automated solution to deploy the SageMaker endpoints as an API using a SageMaker Projects template for Salesforce.

AWS and Salesforce are excited to partner together to deliver this experience to our joint customers to help them drive business processes using the power of ML and artificial intelligence.

To learn more about the Salesforce BYOM integration, refer to Bring your own AI models with Einstein Studio. For a detailed implementation using product recommendations example use case, refer to Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce Apps with AI/ML.

About the Authors

Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Maninder (Mani) Kaur is the AI/ML Specialist lead for Strategic ISVs at AWS. With her customer-first approach, Mani helps strategic customers shape their AI/ML strategy, fuel innovation, and accelerate their AI/ML journey. Mani is a firm believer of ethical and responsible AI, and strives to ensure that her customers’ AI solutions align with these principles.

NVIDIA CEO Jensen Huang Returns to SIGGRAPH

One pandemic and one generative AI revolution later, NVIDIA founder and CEO Jensen Huang returns to the SIGGRAPH stage next week to deliver a live keynote at the world’s largest professional graphics conference.

The address, slated for Tuesday, Aug. 8, at 8 a.m. PT in Los Angeles, will feature an exclusive look at some of NVIDIA’s newest breakthroughs, including award-winning research, OpenUSD developments and the latest AI-powered solutions for content creation.

NVIDIA founder and CEO Jensen Huang.

Huang’s address comes after NVIDIA joined forces last week with Pixar, Adobe, Apple and Autodesk to found the Alliance for OpenUSD, a major leap toward unlocking the next era of interoperability in 3D graphics, design and simulation.

The group will standardize and extend OpenUSD, the open-source Universal Scene Description framework that’s the foundation of interoperable 3D applications and projects ranging from visual effects to industrial digital twins.

Huang will also offer a perspective on what’s been a raucous year for AI, with wildly popular new generative AI applications — including ChatGPT and Midjourney — providing a taste of what’s to come as developers worldwide get to work.

Throughout the conference, NVIDIA will participate in sessions on immersive visualization, 3D interoperability and AI-mediated video conferencing and presenting 20 research papers. Attendees will also get the opportunity to join hands-on labs.

Join SIGGRAPH to witness the evolution of AI and visual computing. Watch the keynote on this page.

Image source: Ron Diering, via Flickr, some rights reserved.

Enhancing AWS intelligent document processing with generative AI

Data classification, extraction, and analysis can be challenging for organizations that deal with volumes of documents. Traditional document processing solutions are manual, expensive, error prone, and difficult to scale. AWS intelligent document processing (IDP), with AI services such as Amazon Textract, allows you to take advantage of industry-leading machine learning (ML) technology to quickly and accurately process data from any scanned document or image. Generative artificial intelligence (generative AI) complements Amazon Textract to further automate document processing workflows. Features such as normalizing key fields and summarizing input data support faster cycles for managing document process workflows, while reducing the potential for errors.

Generative AI is driven by large ML models called foundation models (FMs). FMs are transforming the way you can solve traditionally complex document processing workloads. In addition to existing capabilities, businesses need to summarize specific categories of information, including debit and credit data from documents such as financial reports and bank statements. FMs make it easier to generate such insights from the extracted data. To optimize time spent in human review and to improve employee productivity, mistakes such as missing digits in phone numbers, missing documents, or addresses without street numbers can be flagged in an automated way. In the current scenario, you need to dedicate resources to accomplish such tasks using human review and complex scripts. This approach is tedious and expensive. FMs can help complete these tasks faster, with fewer resources, and transform varying input formats into a standard template that can be processed further. At AWS, we offer services such as Amazon Bedrock, the easiest way to build and scale generative AI applications with FMs. Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can find the model that best suits your requirements. We also offer Amazon SageMaker JumpStart, which allows ML practitioners to choose from a broad selection of open-source FMs. ML practitioners can deploy FMs to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

Ricoh offers workplace solutions and digital transformation services designed to help customers manage and optimize information flow across their businesses. Ashok Shenoy, VP of Portfolio Solution Development, says, “We are adding generative AI to our IDP solutions to help our customers get their work done faster and more accurately by utilizing new capabilities such as Q&A, summarization, and standardized outputs. AWS allows us to take advantage of generative AI while keeping each of our customers’ data separate and secure.”

In this post, we share how to enhance your IDP solution on AWS with generative AI.

Improving the IDP pipeline

In this section, we review how the traditional IDP pipeline can be augmented by FMs and walk through an example use case using Amazon Textract with FMs.

AWS IDP is comprised of three stages: classification, extraction, and enrichment. For more details about each stage, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. In the classification stage, FMs can now classify documents without any additional training. This means that documents can be categorized even if the model hasn’t seen similar examples before. FMs in the extraction stage normalize date fields and verify addresses and phone numbers, while ensuring consistent formatting. FMs in the enrichment stage allow inference, logical reasoning, and summarization. When you use FMs in each IDP stage, your workflow will be more streamlined and performance will improve. The following diagram illustrates the IDP pipeline with generative AI.

Extraction stage of the IDP pipeline

When FMs can’t directly process documents in their native formats (such as PDFs, img, jpeg, and tiff) as an input, a mechanism to convert documents to text is needed. To extract the text from the document before sending it to the FMs, you can use Amazon Textract. With Amazon Textract, you can extract lines and words and pass them to downstream FMs. The following architecture uses Amazon Textract for accurate text extraction from any type of document before sending it to FMs for further processing.

Typically, documents are comprised of structured and semi-structured information. Amazon Textract can be used to extract raw text and data from tables and forms. The relationship between the data in tables and forms plays a vital role in automating business processes. Certain types of information may not be processed by FMs. As a result, we can choose to either store this information in a downstream store or send it to FMs. The following figure is an example of how Amazon Textract can extract structured and semi-structured information from a document, in addition to lines of text that need to be processed by FMs.

Using AWS serverless services to summarize with FMs

The IDP pipeline we illustrated earlier can be seamlessly automated using AWS serverless services. Highly unstructured documents are common in big enterprises. These documents can span from Securities and Exchange Commission (SEC) documents in the banking industry to coverage documents in the health insurance industry. With the evolution of generative AI at AWS, people in these industries are looking for ways to get a summary from those documents in an automated and cost-effective manner. Serverless services help provide the mechanism to build a solution for IDP quickly. Services such as AWS Lambda, AWS Step Functions, and Amazon EventBridge can help build the document processing pipeline with integration of FMs, as shown in the following diagram.

The sample application used in the preceding architecture is driven by events. An event is defined as a change in state that has recently occurred. For example, when an object gets uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, Amazon S3 emits an Object Created event. This event notification from Amazon S3 can trigger a Lambda function or a Step Functions workflow. This type of architecture is termed as an event-driven architecture. In this post, our sample application uses an event-driven architecture to process a sample medical discharge document and summarize the details of the document. The flow works as follows:

When a document is uploaded to an S3 bucket, Amazon S3 triggers an Object Created event.
The EventBridge default event bus propagates the event to Step Functions based on an EventBridge rule.
The state machine workflow processes the document, beginning with Amazon Textract.
A Lambda function transforms the analyzed data for the next step.
The state machine invokes a SageMaker endpoint, which hosts the FM using direct AWS SDK integration.
A summary S3 destination bucket receives the summary response gathered from the FM.

We used the sample application with a flan-t5 Hugging face model to summarize the following sample patient discharge summary using the Step Functions workflow.

The Step Functions workflow uses AWS SDK integration to call the Amazon Textract AnalyzeDocument and SageMaker runtime InvokeEndpoint APIs, as shown in the following figure.

This workflow results in a summary JSON object that is stored in a destination bucket. The JSON object looks as follows:

{
  "summary": [
    "John Doe is a 35-year old male who has been experiencing stomach problems for two months. He has been taking antibiotics for the last two weeks, but has not been able to eat much. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has also noticed a change in his stool color, which is now darker. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of fatigue, and has been unable to work for the last two weeks. He has also been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help. He has been experiencing a lot of abdominal pain, bloating, and fatigue. He has been taking antacids for the last two weeks, but they no longer help."
  ],
  "forms": [
    {
      "key": "Ph: ",
      "value": "(888)-(999)-(0000) "
    },
    {
      "key": "Fax: ",
      "value": "(888)-(999)-(1111) "
    },
    {
      "key": "Patient Name: ",
      "value": "John Doe "
    },
    {
      "key": "Patient ID: ",
      "value": "NARH-36640 "
    },
    {
      "key": "Gender: ",
      "value": "Male "
    },
    {
      "key": "Attending Physician: ",
      "value": "Mateo Jackson, PhD "
    },
    {
      "key": "Admit Date: ",
      "value": "07-Sep-2020 "
    },
    {
      "key": "Discharge Date: ",
      "value": "08-Sep-2020 "
    },
    {
      "key": "Discharge Disposition: ",
      "value": "Home with Support Services "
    },
    {
      "key": "Pre-existing / Developed Conditions Impacting Hospital Stay: ",
      "value": "35 yo M c/o stomach problems since 2 months. Patient reports epigastric abdominal pain non- radiating. Pain is described as gnawing and burning, intermittent lasting 1-2 hours, and gotten progressively worse. Antacids used to alleviate pain but not anymore; nothing exacerbates pain. Pain unrelated to daytime or to meals. Patient denies constipation or diarrhea. Patient denies blood in stool but have noticed them darker. Patient also reports nausea. Denies recent illness or fever. He also reports fatigue for 2 weeks and bloating after eating. ROS: Negative except for above findings Meds: Motrin once/week. Tums previously. PMHx: Back pain and muscle spasms. No Hx of surgery. NKDA. FHx: Uncle has a bleeding ulcer. Social Hx: Smokes since 15 yo, 1/2-1 PPD. No recent EtOH use. Denies illicit drug use. Works on high elevation construction. Fast food diet. Exercises 3-4 times/week but stopped 2 weeks ago. "
    },
    {
      "key": "Summary: ",
      "value": "some activity restrictions suggested, full course of antibiotics, check back with physican in case of relapse, strict diet "
    }
  ]
 }

Generating these summaries using IDP with serverless implementation at scale helps organizations get meaningful, concise, and presentable data in a cost-effective way. Step Functions doesn’t limit the method of processing documents to one document at a time. Its distributed map feature can summarize large numbers of documents on a schedule.

The sample application uses a flan-t5 Hugging face model; however, you can use an FM endpoint of your choice. Training and running the model is out of scope of the sample application. Follow the instructions in the GitHub repository to deploy a sample application. The preceding architecture is a guidance on how you can orchestrate an IDP workflow using Step Functions. Refer to the IDP Generative AI workshop for detailed instructions on how to build an application with AWS AI services and FMs.

Set up the solution

Follow the steps in the README file to set the solution architecture (except for the SageMaker endpoints). After you have your own SageMaker endpoint available, you can pass the endpoint name as a parameter to the template.

Clean up

To save costs, delete the resources you deployed as part of the tutorial:

Follow the steps in the cleanup section of the README file.
Delete any content from your S3 bucket and then delete the bucket through the Amazon S3 console.
Delete any SageMaker endpoints you may have created through the SageMaker console.

Conclusion

Generative AI is changing how you can process documents with IDP to derive insights. AWS AI services such as Amazon Textract along with AWS FMs can help accurately process any type of documents. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.

About the Authors

Sonali Sahu is leading intelligent document processing with the AI/ML services team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Ashish Lal is a Senior Product Marketing Manager who leads product marketing for AI services at AWS. He has 9 years of marketing experience and has led the product marketing effort for Intelligent document processing. He got his Master’s in Business Administration at the University of Washington.

Mrunal Daftari is an Enterprise Senior Solutions Architect at Amazon Web Services. He is based in Boston, MA. He is a cloud enthusiast and very passionate about finding solutions for customers that are simple and address their business outcomes. He loves working with cloud technologies, providing simple, scalable solutions that drive positive business outcomes, cloud adoption strategy, and design innovative solutions and drive operational excellence.

Dhiraj Mahapatro is a Principal Serverless Specialist Solutions Architect at AWS. He specializes in helping enterprise financial services adopt serverless and event-driven architectures to modernize their applications and accelerate their pace of innovation. Recently, he has been working on bringing container workloads and practical usage of generative AI closer to serverless and EDA for financial services industry customers.

Jacob Hauskens is a Principal AI Specialist with over 15 years of strategic business development and partnerships experience. For the past 7 years, he has led the creation and implementation of go-to-market strategies for new AI-powered B2B services. Recently, he has been helping ISVs grow their revenue by adding generative AI to intelligent document processing workflows.

Multimodal medical AI

Posted by Greg Corrado, Head of Health AI, Google Research, and Yossi Matias, VP, Engineering and Research, Google Research

Medicine is an inherently multimodal discipline. When providing care, clinicians routinely interpret data from a wide range of modalities including medical images, clinical notes, lab tests, electronic health records, genomics, and more. Over the last decade or so, AI systems have achieved expert-level performance on specific tasks within specific modalities — some AI systems processing CT scans, while others analyzing high magnification pathology slides, and still others hunting for rare genetic variations. The inputs to these systems tend to be complex data such as images, and they typically provide structured outputs, whether in the form of discrete grades or dense image segmentation masks. In parallel, the capacities and capabilities of large language models (LLMs) have become so advanced that they have demonstrated comprehension and expertise in medical knowledge by both interpreting and responding in plain language. But how do we bring these capabilities together to build medical AI systems that can leverage information from all these sources?

In today’s blog post, we outline a spectrum of approaches to bringing multimodal capabilities to LLMs and share some exciting results on the tractability of building multimodal medical LLMs, as described in three recent research papers. The papers, in turn, outline how to introduce de novo modalities to an LLM, how to graft a state-of-the-art medical imaging foundation model onto a conversational LLM, and first steps towards building a truly generalist multimodal medical AI system. If successfully matured, multimodal medical LLMs might serve as the basis of new assistive technologies spanning professional medicine, medical research, and consumer applications. As with our prior work, we emphasize the need for careful evaluation of these technologies in collaboration with the medical community and healthcare ecosystem.

A spectrum of approaches

Several methods for building multimodal LLMs have been proposed in recent months [1, 2, 3], and no doubt new methods will continue to emerge for some time. For the purpose of understanding the opportunities to bring new modalities to medical AI systems, we’ll consider three broadly defined approaches: tool use, model grafting, and generalist systems.

The spectrum of approaches to building multimodal LLMs range from having the LLM use existing tools or models, to leveraging domain-specific components with an adapter, to joint modeling of a multimodal model.

Tool use

In the tool use approach, one central medical LLM outsources analysis of data in various modalities to a set of software subsystems independently optimized for those tasks: the tools. The common mnemonic example of tool use is teaching an LLM to use a calculator rather than do arithmetic on its own. In the medical space, a medical LLM faced with a chest X-ray could forward that image to a radiology AI system and integrate that response. This could be accomplished via application programming interfaces (APIs) offered by subsystems, or more fancifully, two medical AI systems with different specializations engaging in a conversation.

This approach has some important benefits. It allows maximum flexibility and independence between subsystems, enabling health systems to mix and match products between tech providers based on validated performance characteristics of subsystems. Moreover, human-readable communication channels between subsystems maximize auditability and debuggability. That said, getting the communication right between independent subsystems can be tricky, narrowing the information transfer, or exposing a risk of miscommunication and information loss.

Model grafting

A more integrated approach would be to take a neural network specialized for each relevant domain, and adapt it to plug directly into the LLM — grafting the visual model onto the core reasoning agent. In contrast to tool use where the specific tool(s) used are determined by the LLM, in model grafting the researchers may choose to use, refine, or develop specific models during development. In two recent papers from Google Research, we show that this is in fact feasible. Neural LLMs typically process text by first mapping words into a vector embedding space. Both papers build on the idea of mapping data from a new modality into the input word embedding space already familiar to the LLM. The first paper, “Multimodal LLMs for health grounded in individual-specific data”, shows that asthma risk prediction in the UK Biobank can be improved if we first train a neural network classifier to interpret spirograms (a modality used to assess breathing ability) and then adapt the output of that network to serve as input into the LLM.

The second paper, “ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders”, takes this same tack, but applies it to full-scale image encoder models in radiology. Starting with a foundation model for understanding chest X-rays, already shown to be a good basis for building a variety of classifiers in this modality, this paper describes training a lightweight medical information adapter that re-expresses the top layer output of the foundation model as a series of tokens in the LLM’s input embeddings space. Despite fine-tuning neither the visual encoder nor the language model, the resulting system displays capabilities it wasn’t trained for, including semantic search and visual question answering.

Our approach to grafting a model works by training a medical information adapter that maps the output of an existing or refined image encoder into an LLM-understandable form.

Model grafting has a number of advantages. It uses relatively modest computational resources to train the adapter layers but allows the LLM to build on existing highly-optimized and validated models in each data domain. The modularization of the problem into encoder, adapter, and LLM components can also facilitate testing and debugging of individual software components when developing and deploying such a system. The corresponding disadvantages are that the communication between the specialist encoder and the LLM is no longer human readable (being a series of high dimensional vectors), and the grafting procedure requires building a new adapter for not just every domain-specific encoder, but also every revision of each of those encoders.

Generalist systems

The most radical approach to multimodal medical AI is to build one integrated, fully generalist system natively capable of absorbing information from all sources. In our third paper in this area, “Towards Generalist Biomedical AI”, rather than having separate encoders and adapters for each data modality, we build on PaLM-E, a recently published multimodal model that is itself a combination of a single LLM (PaLM) and a single vision encoder (ViT). In this set up, text and tabular data modalities are covered by the LLM text encoder, but now all other data are treated as an image and fed to the vision encoder.

Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same model weights.

We specialize PaLM-E to the medical domain by fine-tuning the complete set of model parameters on medical datasets described in the paper. The resulting generalist medical AI system is a multimodal version of Med-PaLM that we call Med-PaLM M. The flexible multimodal sequence-to-sequence architecture allows us to interleave various types of multimodal biomedical information in a single interaction. To the best of our knowledge, it is the first demonstration of a single unified model that can interpret multimodal biomedical data and handle a diverse range of tasks using the same set of model weights across all tasks (detailed evaluations in the paper).

This generalist-system approach to multimodality is both the most ambitious and simultaneously most elegant of the approaches we describe. In principle, this direct approach maximizes flexibility and information transfer between modalities. With no APIs to maintain compatibility across and no proliferation of adapter layers, the generalist approach has arguably the simplest design. But that same elegance is also the source of some of its disadvantages. Computational costs are often higher, and with a unitary vision encoder serving a wide range of modalities, domain specialization or system debuggability could suffer.

The reality of multimodal medical AI

To make the most of AI in medicine, we’ll need to combine the strength of expert systems trained with predictive AI with the flexibility made possible through generative AI. Which approach (or combination of approaches) will be most useful in the field depends on a multitude of as-yet unassessed factors. Is the flexibility and simplicity of a generalist model more valuable than the modularity of model grafting or tool use? Which approach gives the highest quality results for a specific real-world use case? Is the preferred approach different for supporting medical research or medical education vs. augmenting medical practice? Answering these questions will require ongoing rigorous empirical research and continued direct collaboration with healthcare providers, medical institutions, government entities, and healthcare industry partners broadly. We look forward to finding the answers together.

Meet the Maker: Developer Taps NVIDIA Jetson as Force Behind AI-Powered Pit Droid

Goran Vuksic is the brain behind a project to build a real-world pit droid, a type of Star Wars bot that repairs and maintains podracers which zoom across the much-loved film series.

The edge AI Jedi used an NVIDIA Jetson Orin Nano Developer Kit as the brain of the droid itself. The devkit enables the bot, which is a little less than four feet tall and has a simple webcam for eyes, to identify and move its head toward objects.

Vuksic — originally from Croatia and now based in Malmö, Sweden — recently traveled with the pit droid across Belgium and the Netherlands to several tech conferences. He presented to hundreds of people on computer vision and AI, using the droid as an engaging real-world demo.

The pit droid’s first look at the world.

A self-described Star Wars fanatic, he’s upgrading the droid’s capabilities in his free time, when not engrossed in his work as an engineering manager at a Copenhagen-based company. He’s also co-founder and chief technology officer of syntheticAIdata, a member of the NVIDIA Inception program for cutting-edge startups.

The company, which creates vision AI models with cost-effective synthetic data, uses a connector to the NVIDIA Omniverse platform for building and operating 3D tools and applications.

About the Maker

Named a Jetson AI Specialist by NVIDIA and an AI “Most Valuable Professional” by Microsoft, Vuksic got started with artificial intelligence and IT about a decade ago when working for a startup that classified tattoos with vision AI.

Since then, he’s worked as an engineering and technical manager, among other roles, developing IT strategies and solutions for various companies.

Robotics has always interested him, as he was a huge sci-fi fan growing up.

“Watching Star Wars and other films, I imagined how robots might be able to see and do stuff in the real world,” said Vuksic, also a member of the NVIDIA Developer Program.

Now, he’s enabling just that with the pit droid project powered by the NVIDIA Jetson platform, which the developer has used since the launch of its first product nearly a decade ago.

Apart from tinkering with computers and bots, Vuksic enjoys playing the bass guitar in a band with his friends.

His Inspiration

Vuksic built the pit droid for both fun and educational purposes.

As a frequent speaker at tech conferences, he takes the pit droid on stage to engage with his audience, demonstrate how it works and inspire others to build something similar, he said.

Vuksic, his startup co-founder Sherry List and the pit droid present at the Techorama conference in Antwerp, Belgium.

“We live in a connected world — all the things around us are exchanging data and becoming more and more automated,” he added. “I think this is super exciting, and we’ll likely have even more robots to help humans with tasks.”

Using the NVIDIA Jetson platform, Vuksic is at the forefront of robotics innovation, along with an ecosystem of developers using edge AI.

His Jetson Project

Vuksic’s pit droid project, which took him four months, began with 3D printing its body parts and putting them all together.

He then equipped the bot with the Jetson Orin Nano Developer Kit as the brain in its head, which can move in all directions thanks to two motors.

Vuksic places an NVIDIA Jetson Orin Nano Developer Kit in the pit droid’s head.

The Jetson Orin Nano enables real-time processing of the camera feed. “It’s truly, truly amazing to have this processing power in such a small box that fits in the droid’s head,” said Vuksic.

He also uses Microsoft Azure to process the data in the cloud for object-detection training.

“My favorite part of the project was definitely connecting it to the Jetson Orin Nano, which made it easy to run the AI and make the droid move according to what it sees,” said Vuksic, who wrote a step-by-step technical guide to building the bot, so others can try it themselves.

“The most challenging part was traveling with the droid — there was a bit of explanation necessary when I was passing security and opened my bag which contained the robot in parts,” the developer mused. “I said, ‘This is just my big toy!’”

Learn more about the NVIDIA Jetson platform.

How to Build Generative AI Applications and 3D Virtual Worlds

To grow and succeed, organizations must continuously focus on technical skills development, especially in rapidly advancing areas of technology, such as generative AI and the creation of 3D virtual worlds.

NVIDIA Training, which equips teams with skills for the age of AI, high performance computing and industrial digitalization, has released new courses that cover these technologies. The program has already equipped hundreds of thousands of students, developers, researchers and data scientists with critical technical skills.

With its latest courses, NVIDIA Training is enabling organizations to fully harness the power of generative AI and virtual worlds, which are transforming the business landscape.

Get Started Building Generative AI Applications  

Generative AI is revolutionizing the ways organizations work. It enables users to quickly generate new content based on a variety of inputs, including text, images, sounds, animation, 3D models and other data types.

New NVIDIA Training courses on gen AI include:   

Generative AI Explained — Generative models are accelerating application development for many use cases, including question-answering, summarization, textual entailment, 2D and 3D image and audio creation. In this two-hour course, Bryan Catanzaro, vice president of applied deep learning research at NVIDIA, provides an overview of gen AI’s major developments, where it stands now and what it could be capable of in the future. He’ll discuss technical details and popular generative AI applications, as well as how businesses can responsibly use the technology.

Generative AI With Diffusion Models — Thanks to improvements in computing power and scientific theory, generative AI is more accessible than ever. Get started with gen AI application development with this hands-on course where students will learn how to build a text-to-image generative AI application using the latest techniques. Generate images with diffusion models and refine the output with various optimizations. Build a denoising diffusion model from the U-Net architecture to add context embeddings for greater user control.

To see a complete list of courses on generative AI and large language models, check out these NVIDIA Training Learning Paths.

Building Digital 3D Worlds

Advancements in digital world-building are transforming media and entertainment, architecture, engineering, construction and operations, factory planning and avatar creation, among other industries.

Immersive 3D environments elevate user engagement and enable innovative solutions to real-world problems. NVIDIA Omniverse, a platform for connecting and developing 3D tools and applications, lets technical artists, designers and engineers quickly assemble complex and physically accurate simulations and 3D scenes in real time, while seamlessly collaborating with team members.

New NVIDIA Training courses on this topic include:

Essentials of USD in NVIDIA Omniverse — Universal Scene Description, or OpenUSD, is transforming 3D workflows across industries. It’s an open standard enabling 3D artists and developers to connect, compose and simulate in the metaverse. Students will learn what makes OpenUSD unique for designing 3D worlds. The training covers data modeling using primitive nodes, attributes and relationships, as well as custom schemas and composition for scene assembly and collaboration.

Developing Omniverse Kit Applications — Learn how to use the NVIDIA Omniverse Kit development framework to build applications, custom extensions and microservices. Applications may comprise many extensions working in concert to address specific 3D workflows, like industrial digitalization and factory planning. Students will use Omniverse reference applications, like Omniverse USD Composer and USD Presenter, to kickstart their own application development.
Bootstrapping Computer Vision Models With Synthetic Data — Learn how to use NVIDIA Omniverse Replicator, a core Omniverse extension, to accelerate the development of computer vision models. Generate accurate, photorealistic, physics-conforming synthetic data to ease the expensive, time-consuming task of labeling real-world data. Omniverse Replicator accelerates AI development at scale and reduces time to production.

To see a complete list of courses on graphics and simulation, check out these NVIDIA Training Learning Paths.

Wide Portfolio of Courses

NVIDIA Training offers courses and resources to help individuals and organizations develop expertise in using NVIDIA technologies to fuel innovation. In addition to those above, a wide range of courses and workshops covering AI, deep learning, accelerated computing, data science, networking and infrastructure are available to explore in the training catalog.

At the SIGGRAPH conference session “Reimagine Your Curriculum With OpenUSD and NVIDIA Omniverse,” Laura Scholl, senior content developer on the Omniverse team at NVIDIA, will discuss how to incorporate OpenUSD and Omniverse into an educational setting using teaching kits, programs for educators and other resources available from NVIDIA.

Learn about the latest advances in generative AI, graphics and more by joining NVIDIA at SIGGRAPH. NVIDIA founder and CEO Jensen Huang will deliver a keynote address on Tuesday, Aug. 8, at 8 a.m. PT.

Leveraging transformers to improve product retrieval results

Assessing the absolute utility of query results, rather than just their relative utility, improves learning-to-rank models.Read More

Scale training and inference of thousands of ML models with Amazon SageMaker

As machine learning (ML) becomes increasingly prevalent in a wide range of industries, organizations are finding the need to train and serve large numbers of ML models to meet the diverse needs of their customers. For software as a service (SaaS) providers in particular, the ability to train and serve thousands of models efficiently and cost-effectively is crucial for staying competitive in a rapidly evolving market.

Training and serving thousands of models requires a robust and scalable infrastructure, which is where Amazon SageMaker can help. SageMaker is a fully managed platform that enables developers and data scientists to build, train, and deploy ML models quickly, while also offering the cost-saving benefits of using the AWS Cloud infrastructure.

In this post, we explore how you can use SageMaker features, including Amazon SageMaker Processing, SageMaker training jobs, and SageMaker multi-model endpoints (MMEs), to train and serve thousands of models in a cost-effective way. To get started with the described solution, you can refer to the accompanying notebook on GitHub.

Use case: Energy forecasting

For this post, we assume the role of an ISV company that helps their customers become more sustainable by tracking their energy consumption and providing forecasts. Our company has 1,000 customers who want to better understand their energy usage and make informed decisions about how to reduce their environmental impact. To do this, we use a synthetic dataset and train an ML model based on Prophet for each customer to make energy consumption forecasts. With SageMaker, we can efficiently train and serve these 1,000 models, providing our customers with accurate and actionable insights into their energy usage.

There are three features in the generated dataset:

customer_id – This is an integer identifier for each customer, ranging from 0–999.
timestamp – This is a date/time value that indicates the time at which the energy consumption was measured. The timestamps are randomly generated between the start and end dates specified in the code.
consumption – This is a float value that indicates the energy consumption, measured in some arbitrary unit. The consumption values are randomly generated between 0–1,000 with sinusoidal seasonality.

Solution overview

To efficiently train and serve thousands of ML models, we can use the following SageMaker features:

SageMaker Processing – SageMaker Processing is a fully managed data preparation service that enables you to perform data processing and model evaluation tasks on your input data. You can use SageMaker Processing to transform raw data into the format needed for training and inference, as well as to run batch and online evaluations of your models.
SageMaker training jobs – You can use SageMaker training jobs to train models on a variety of algorithms and input data types, and specify the compute resources needed for training.
SageMaker MMEs – Multi-model endpoints enable you to host multiple models on a single endpoint, which makes it easy to serve predictions from multiple models using a single API. SageMaker MMEs can save time and resources by reducing the number of endpoints needed to serve predictions from multiple models. MMEs support hosting of both CPU- and GPU-backed models. Note that in our scenario, we use 1,000 models, but this is not a limitation of the service itself.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

We use SageMaker Processing to preprocess data and create a single CSV file per customer and store it in Amazon Simple Storage Service (Amazon S3).
The SageMaker training job is configured to read the output of the SageMaker Processing job and distribute it in a round-robin fashion to the training instances. Note that this can also be achieved with Amazon SageMaker Pipelines.
The model artifacts are stored in Amazon S3 by the training job, and are served directly from the SageMaker MME.

Scale training to thousands of models

Scaling the training of thousands of models is possible via the distribution parameter of the TrainingInput class in the SageMaker Python SDK, which allows you to specify how data is distributed across multiple training instances for a training job. There are three options for the distribution parameter: FullyReplicated, ShardedByS3Key, and ShardedByRecord. The ShardedByS3Key option means that the training data is sharded by S3 object key, with each training instance receiving a unique subset of the data, avoiding duplication. After the data is copied by SageMaker to the training containers, we can read the folder and files structure to train a unique model per customer file. The following is an example code snippet:

# Assume that the training data is in an S3 bucket already, pass the parent folder
s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data='s3://my-bucket/customer_data',
    distribution='ShardedByS3Key'
)

# Create a SageMaker estimator and set the training input
estimator = sagemaker.estimator.Estimator(...)
estimator.fit(inputs=s3_input_train)

Every SageMaker training job stores the model saved in the /opt/ml/model folder of the training container before archiving it in a model.tar.gz file, and then uploads it to Amazon S3 upon training job completion. Power users can also automate this process with SageMaker Pipelines. When storing multiple models via the same training job, SageMaker creates a single model.tar.gz file containing all the trained models. This would then mean that, in order to serve the model, we would need to unpack the archive first. To avoid this, we use checkpoints to save the state of individual models. SageMaker provides the functionality to copy checkpoints created during the training job to Amazon S3. Here, the checkpoints need to be saved in a pre-specified location, with the default being /opt/ml/checkpoints. These checkpoints can be used to resume training at a later moment or as a model to deploy on an endpoint. For a high-level summary of how the SageMaker training platform manages storage paths for training datasets, model artifacts, checkpoints, and outputs between AWS Cloud storage and training jobs in SageMaker, refer to Amazon SageMaker Training Storage Folders for Training Datasets, Checkpoints, Model Artifacts, and Outputs.

The following code uses a fictitious model.save() function inside the train.py script containing the training logic:

import tarfile
import boto3
import os

[ ... argument parsing ... ]

for customer in os.list_dir(args.input_path):
    
    # Read data locally within the Training job
    df = pd.read_csv(os.path.join(args.input_path, customer, 'data.csv'))
    
    # Define and train the model
    model = MyModel()
     model.fit(df)
            
    # Save model to output directory
    with open(os.path.join(output_dir, 'model.json'), 'w') as fout:
        fout.write(model_to_json(model))
    
    # Create the model.tar.gz archive containing the model and the training script
    with tarfile.open(os.path.join(output_dir, '{customer}.tar.gz'), "w:gz") as tar:
        tar.add(os.path.join(output_dir, 'model.json'), "model.json")
        tar.add(os.path.join(args.code_dir, "training.py"), "training.py")

Scale inference to thousands of models with SageMaker MMEs

SageMaker MMEs allow you to serve multiple models at the same time by creating an endpoint configuration that includes a list of all the models to serve, and then creating an endpoint using that endpoint configuration. There is no need to re-deploy the endpoint every time you add a new model because the endpoint will automatically serve all models stored in the specified S3 paths. This is achieved with Multi Model Server (MMS), an open-source framework for serving ML models that can be installed in containers to provide the front end that fulfills the requirements for the new MME container APIs. In addition, you can use other model servers including TorchServe and Triton. MMS can be installed in your custom container via the SageMaker Inference Toolkit. To learn more about how to configure your Dockerfile to include MMS and use it to serve your models, refer to Build Your Own Container for SageMaker Multi-Model Endpoints.

The following code snippet shows how to create an MME using the SageMaker Python SDK:

from sagemaker.multidatamodel import MultiDataModel

# Create the MultiDataModel definition
multimodel = MultiDataModel(
    name='customer-models',
    model_data_prefix=f's3://{bucket}/scaling-thousand-models/models',
    model=your_model,
)

# Deploy on a real-time endpoint
predictor = multimodel.deploy(
    initial_instance_count=1,
    instance_type='ml.c5.xlarge',
)

When the MME is live, we can invoke it to generate predictions. Invocations can be done in any AWS SDK as well as with the SageMaker Python SDK, as shown in the following code snippet:

predictor.predict(
    data='{"period": 7}',             # the payload, in this case JSON
    target_model='{customer}.tar.gz'  # the name of the target model
)

When calling a model, the model is initially loaded from Amazon S3 on the instance, which can result in a cold start when calling a new model. Frequently used models are cached in memory and on disk to provide low-latency inference.

Conclusion

SageMaker is a powerful and cost-effective platform for training and serving thousands of ML models. Its features, including SageMaker Processing, training jobs, and MMEs, enable organizations to efficiently train and serve thousands of models at scale, while also benefiting from the cost-saving advantages of using the AWS Cloud infrastructure. To learn more about how to use SageMaker for training and serving thousands of models, refer to Process data, Train a Model with Amazon SageMaker and Host multiple models in one container behind one endpoint.

About the Authors

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He likes to work on machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and playing squash.

Accelerate business outcomes with 70% performance improvements to data processing, training, and inference with Amazon SageMaker Canvas

Amazon SageMaker Canvas is a visual interface that enables business analysts to generate accurate machine learning (ML) predictions on their own, without requiring any ML experience or having to write a single line of code. SageMaker Canvas’s intuitive user interface lets business analysts browse and access disparate data sources in the cloud or on premises, prepare and explore the data, build and train ML models, and generate accurate predictions within a single workspace.

SageMaker Canvas allows analysts to use different data workloads to achieve the desired business outcomes with high accuracy and performance. The compute, storage, and memory requirements to generate accurate predictions are abstracted from the end-user, enabling them to focus on the business problem to be solved. Earlier this year, we announced performance optimizations based on customer feedback to deliver faster and more accurate model training times with SageMaker Canvas.

In this post, we show how SageMaker Canvas can now process data, train models, and generate predictions with increased speed and efficiency for different dataset sizes.

Prerequisites

If you would like to follow along, complete the following prerequisites:

Have an AWS account.
Set up SageMaker Canvas. For instructions, refer to Prerequisites for setting up Amazon SageMaker Canvas.
Download the following two datasets to your local computer. The first is the NYC Yellow Taxi Trip dataset; the second is the eCommerce behavior data about retails events related to products and users.

Both datasets come under the Attribution 4.0 International (CC BY 4.0) license and are free to share and adapt.

Data processing improvements

With underlying performance optimizations, the time to import data into SageMaker Canvas has improved by over 70%. You can now import datasets of up to 2 GB in approximately 50 seconds and up to 5 GB in approximately 65 seconds.

After importing data, business analysts typically validate the data to ensure there are no issues found within the dataset. Example validation checks can be ensuring columns contain the correct data type, seeing if the value ranges are in line with expectations, making sure there is uniqueness in values where applicable, and others.

Data validation is now faster. In our tests, all validations took 50 seconds for the taxi dataset exceeding 5 GB in size, a 10-times improvement in speed.

Model training improvements

The performance optimizations related to ML model training in SageMaker Canvas now enable you to train models without running into potential out-of-memory requests failures.

The following screenshot shows the results of a successful build run using a large dataset the impact of the total_amount feature on the target variable.

Inference improvements

Finally, SageMaker Canvas inference improvements achieved a 3.5 times reduction memory consumption in case of larger datasets in our internal testing.

Conclusion

In this post, we saw various improvements with SageMaker Canvas in importing, validation, training, and inference. We saw an increased in its ability to import large datasets by 70%. We saw a 10 times improvement in data validation, and a 3.5 times reduction in memory consumption. These improvements allow you to better work with large datasets and reduce time when building ML models with SageMaker Canvas.

We encourage you to experience the improvements yourself. We welcome your feedback as we continuously work on performance optimizations to improve the user experience.

About the authors

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications. He enjoys coffee, cooking, staying active, and spending time with his family.

Tim Song is a Software Development Engineer at AWS SageMaker, with 10+ years of experience as software developer, consultant and tech leader he has demonstrated ability to deliver scalable and reliable products and solve complex problems. In his spare time, he enjoys the nature, outdoor running, hiking and etc.

Hariharan Suresh is a Senior Solutions Architect at AWS. He is passionate about databases, machine learning, and designing innovative solutions. Prior to joining AWS, Hariharan was a product architect, core banking implementation specialist, and developer, and worked with BFSI organizations for over 11 years. Outside of technology, he enjoys paragliding and cycling.

Maia Haile is a Solutions Architect at Amazon Web Services based in the Washington, D.C. area. In that role, she helps public sector customers achieve their mission objectives with well architected solutions on AWS. She has 5 years of experience spanning from nonprofit healthcare, Media and Entertainment, and retail. Her passion is leveraging intelligence (AI) and machine learning (ML) to help Public Sector customers achieve their business and technical goals.

Build and train computer vision models to detect car positions in images using Amazon SageMaker and Amazon Rekognition

Computer vision (CV) is one of the most common applications of machine learning (ML) and deep learning. Use cases range from self-driving cars, content moderation on social media platforms, cancer detection, and automated defect detection. Amazon Rekognition is a fully managed service that can perform CV tasks like object detection, video segment detection, content moderation, and more to extract insights from data without the need of any prior ML experience. In some cases, a more custom solution might be needed along with the service to solve a very specific problem.

In this post, we address areas where CV can be applied to use cases where the pose of objects, their position, and orientation is important. One such use case would be customer-facing mobile applications where an image upload is required. It might be for compliance reasons or to provide a consistent user experience and improve engagement. For example, on online shopping platforms, the angle at which products are shown in images has an effect on the rate of buying this product. One such case is to detect the position of a car. We demonstrate how you can combine well-known ML solutions with postprocessing to address this problem on the AWS Cloud.

We use deep learning models to solve this problem. Training ML algorithms for pose estimation requires a lot of expertise and custom training data. Both requirements are hard and costly to obtain. Therefore, we present two options: one that doesn’t require any ML expertise and uses Amazon Rekognition, and another that uses Amazon SageMaker to train and deploy a custom ML model. In the first option, we use Amazon Rekognition to detect the wheels of the car. We then infer the car orientation from the wheel positions using a rule-based system. In the second option, we detect the wheels and other car parts using the Detectron model. These are again used to infer the car position with rule-based code. The second option requires ML experience but is also more customizable. It can be used for further postprocessing on the image, for example, to crop out the whole car. Both of the options can be trained on publicly available datasets. Finally, we show how you can integrate this car pose detection solution into your existing web application using services like Amazon API Gateway and AWS Amplify.

Solution overview

The following diagram illustrates the solution architecture.

The solution consists of a mock web application in Amplify where a user can upload an image and invoke either the Amazon Rekognition model or the custom Detectron model to detect the position of the car. For each option, we host an AWS Lambda function behind an API Gateway that is exposed to our mock application. We configured our Lambda function to run with either the Detectron model trained in SageMaker or Amazon Rekognition.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
An AWS Identity and Access Management (IAM) user with the permissions to deploy and provision the infrastructure, for example, PowerUserAccess (note that permissions would need to be restricted further for a production-ready application and depend on possible integrations with other services).
Docker in your development environment (local machine or a SageMaker notebook instance where you are deploying the solution from).
The AWS Cloud Development Kit (AWS CDK) installed. It can be installed using npm as explained in our GitHub repository.

Create a serverless app using Amazon Rekognition

Our first option demonstrates how you can detect car orientations in images using Amazon Rekognition. The idea is to use Amazon Rekognition to detect the location of the car and its wheels and then do postprocessing to derive the orientation of the car from this information. The whole solution is deployed using Lambda as shown in the Github repository. This folder contains two main files: a Dockerfile that defines the Docker image that will run in our Lambda function, and the app.py file, which will be the main entry point of the Lambda function:

def lambda_handler(event, context):
    body_bytes = json.loads(event["body"])["image"].split(",")[-1]
    body_bytes = base64.b64decode(body_bytes)

    rek = boto3.client('rekognition')
    response = rek.detect_labels(Image={'Bytes': body_bytes}, MinConfidence=80)
    
    angle, img = label_image(img_string=body_bytes, response=response)

    buffered = BytesIO()
    img.save(buffered, format="JPEG")
    img_str = "data:image/jpeg;base64," + base64.b64encode(buffered.getvalue()).decode('utf-8')

The Lambda function expects an event that contains a header and body, where the body should be the image needed to be labeled as base64 decoded object. Given the image, the Amazon Rekognition detect_labels function is invoked from the Lambda function using Boto3. The function returns one or more labels for each object in the image and bounding box details for all of the detected object labels as part of the response, along with other information like confidence of the assigned label, the ancestor labels of the detected label, possible aliases for the label, and the categories the detected label belongs to. Based on the labels returned by Amazon Rekognition, we run the function label_image, which calculates the car angle from the detected wheels as follows:

n_wheels = len(wheel_instances)

wheel_centers = [np.array(_extract_bb_coords(wheel, img)).mean(axis=0)
for wheel in wheel_instances]

wheel_center_comb = list(combinations(wheel_centers, 2))
vecs = [(k, pair[0] - pair[1]) for k,pair in enumerate(wheel_center_comb)]
vecs = sorted(vecs, key = lambda vec: np.linalg.norm(vec[1]))

vec_rel = vecs[1] if n_wheels == 3 else vecs[0]
angle = math.degrees(math.atan(vec_rel[1][1]/vec_rel[1][0]))

wheel_centers_rel = [tuple(wheel.tolist()) for wheel in
wheel_center_comb[vec_rel[0]]]

Note that the application requires that only one car is present in the image and returns an error if that’s not the case. However, the postprocessing can be adapted to provide more granular orientation descriptions, cover several cars, or calculate the orientation of more complex objects.

Improve wheel detection

To further improve the accuracy of the wheel detection, you can use Amazon Rekognition Custom Labels. Similar to fine-tuning using SageMaker to train and deploy a custom ML model, you can bring your own labeled data so that Amazon Rekognition can produce a custom image analysis model for you in just a few hours. With Rekognition Custom Labels, you only need a small set of training images that are specific to your use case, in this case car images with specific angles, because it uses the existing capabilities in Amazon Rekognition of being trained on tens of millions of images across many categories. Rekognition Custom Labels can be integrated with only a few clicks and small adaptations to the Lambda function we use for the standard Amazon Rekognition solution.

Train a model using a SageMaker training job

In our second option, we train a custom deep learning model on SageMaker. We use the Detectron2 framework for the segmentation of car parts. These segments are then used to infer the position of the car.

The Detectron2 framework is a library that provides state-of-the-art detection and segmentation algorithms. Detectron provides a variety of Mask R-CNN models that were trained on the famous COCO (Common objects in Context) dataset. To build our car objects detection model, we use transfer learning to fine-tune a pretrained Mask R-CNN model on the car parts segmentation dataset. This dataset allows us to train a model that can detect wheels but also other car parts. This additional information can be further used in the car angle computations relative to the image.

The dataset contains annotated data of car parts to be used for object detection and semantic segmentation tasks: approximately 500 images of sedans, pickups, and sports utility vehicles (SUVs), taken in multiple views (front, back, and side views). Each image is annotated by 18 instance masks and bounding boxes representing the different parts of a car like wheels, mirrors, lights, and front and back glass. We modified the base annotations of the wheels such that each wheel is considered an individual object instead of considering all the available wheels in the image as one object.

We use Amazon Simple Storage Service (Amazon S3) to store the dataset used for training the Detectron model along with the trained model artifacts. Moreover, the Docker container that runs in the Lambda function is stored in Amazon Elastic Container Registry (Amazon ECR). The Docker container in the Lambda function is needed to include the required libraries and dependencies for running the code. We could alternatively use Lambda layers, but it’s limited to an unzipped deployment packaged size quota of 250 MB and a maximum of five layers can be added to a Lambda function.

Our solution is built on SageMaker: we extend prebuilt SageMaker Docker containers for PyTorch to run our custom PyTorch training code. Next, we use the SageMaker Python SDK to wrap the training image into a SageMaker PyTorch estimator, as shown in the following code snippets:

d2_estimator = Estimator(
        image_uri=training_image_uri,
        role=role,
        sagemaker_session=sm_session,
        instance_count=1,
        instance_type=training_instance,
        output_path=f"s3://{session_bucket}/{prefix_model}",
        base_job_name=f"detectron2")

d2_estimator.fit({
            "training": training_channel,
            "validation": validation_channel,
        },
        wait=True)

Finally, we start the training job by calling the fit() function on the created PyTorch estimator. When the training is finished, the trained model artifact is stored in the session bucket in Amazon S3 to be used for the inference pipeline.

Deploy the model using SageMaker and inference pipelines

We also use SageMaker to host the inference endpoint that runs our custom Detectron model. The full infrastructure used to deploy our solution is provisioned using the AWS CDK. We can host our custom model through a SageMaker real-time endpoint by calling deploy on the PyTorch estimator. This is the second time we extend a prebuilt SageMaker PyTorch container to include PyTorch Detectron. We use it to run the inference script and host our trained PyTorch model as follows:

model = PyTorchModel(
        name="d2-sku110k-model",
        model_data=d2_estimator.model_data,
        role=role,
        sagemaker_session=sm_session,
        entry_point="predict.py",
        source_dir="src",
        image_uri=serve_image_uri,
        framework_version="1.6.0")

    predictor = model.deploy(
        initial_instance_count=1,
        instance_type="ml.g4dn.xlarge",
        endpoint_name="detectron-endpoint",
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.JSONDeserializer(),
        wait=True)

Note that we used an ml.g4dn.xlarge GPU for deployment because it’s the smallest GPU available and sufficient for this demo. Two components need to be configured in our inference script: model loading and model serving. The function model_fn() is used to load the trained model that is part of the hosted Docker container and can also be found in Amazon S3 and return a model object that can be used for model serving as follows:

def model_fn(model_dir: str) -> DefaultPredictor:
  
    for p_file in Path(model_dir).iterdir():
        if p_file.suffix == ".pth":
            path_model = p_file
        
    cfg = get_cfg()
    cfg.MODEL.WEIGHTS = str(path_model)

    return DefaultPredictor(cfg)

The function predict_fn() performs the prediction and returns the result. Besides using our trained model, we use a pretrained version of the Mask R-CNN model trained on the COCO dataset to extract the main car in the image. This is an extra postprocessing step to deal with images where more than one car exists. See the following code:

def predict_fn(input_img: np.ndarray, predictor: DefaultPredictor) -> Mapping:
    
    pretrained_predictor = _get_pretraind_model()
    car_mask = get_main_car_mask(pretrained_predictor, input_img)
    outputs = predictor(input_img)
    fmt_out = {
        "image_height": input_object.shape[0],
        "image_width": input_object.shape[1],
        "pred_boxes": outputs["instances"].pred_boxes.tensor.tolist(),
        "scores": outputs["instances"].scores.tolist(),
        "pred_classes": outputs["instances"].pred_classes.tolist(),
        "car_mask": car_mask.tolist()
    }
    return fmt_out

Similar to the Amazon Rekognition solution, the bounding boxes predicted for the wheel class are filtered from the detection outputs and supplied to the postprocessing module to assess the car position relative to the output.

Finally, we also improved the postprocessing for the Detectron solution. It also uses the segments of different car parts to infer the solution. For example, whenever a front bumper is detected, but no back bumper, it is assumed that we have a front view of the car and the corresponding angle is calculated.

Connect your solution to the web application

The steps to connect the model endpoints to Amplify are as follows:

Clone the application repository that the AWS CDK stack created, named car-angle-detection-website-repo. Make sure you are looking for it in the Region you used for deployment.
Copy the API Gateway endpoints for each of the deployed Lambda functions into the index.html file in the preceding repository (there are placeholders where the endpoint needs to be placed). The following code is an example of what this section of the .html file looks like:

<td align="center" colspan="2">
<select id="endpoint">
<option value="https://ey82aaj8ch.execute-api.eu-central-1.amazonaws.com/prod/">
                Amazon Rekognition</option>
<option value="https://nhq6q88xjg.execute-api.eu-central-1.amazonaws.com/prod/">
                Amazon SageMaker Detectron</option>
</select>
<input class="btn" type="file" id="ImageBrowse" />
<input class="btn btn-primary" type="submit" value="Upload">
</td>

Save the HTML file and push the code change to the remote main branch.

This will update the HTML file in the deployment. The application is now ready to use.

Navigate to the Amplify console and locate the project you created.

The application URL will be visible after the deployment is complete.

Navigate to the URL and have fun with the UI.

Conclusion

Congratulations! We have deployed a complete serverless architecture in which we used Amazon Rekognition, but also gave an option for your own custom model, with this example available on GitHub. If you don’t have ML expertise in your team or enough custom data to train a model, you could select the option that uses Amazon Rekognition. If you want more control over your model, would like to customize it further, and have enough data, you can choose the SageMaker solution. If you have a team of data scientists, they might also want to enhance the models further and pick a more custom and flexible option. You can put the Lambda function and the API Gateway behind your web application using either of the two options. You can also use this approach for a different use case for which you might want to adapt the code.

The advantage of this serverless architecture is that the building blocks are completely exchangeable. The opportunities are almost limitless. So, get started today!

As always, AWS welcomes feedback. Please submit any comments or questions.

About the Authors

Michael Wallner is a Senior Consultant Data & AI with AWS Professional Services and is passionate about enabling customers on their journey to become data-driven and AWSome in the AWS cloud. On top, he likes thinking big with customers to innovate and invent new ideas for them.

Aamna Najmi is a Data Scientist with AWS Professional Services. She is passionate about helping customers innovate with Big Data and Artificial Intelligence technologies to tap business value and insights from data. She has experience in working on data platform and AI/ML projects in the healthcare and life sciences vertical. In her spare time, she enjoys gardening and traveling to new places.

David Sauerwein is a Senior Data Scientist at AWS Professional Services, where he enables customers on their AI/ML journey on the AWS cloud. David focuses on digital twins, forecasting and quantum computation. He has a PhD in theoretical physics from the University of Innsbruck, Austria. He was also a doctoral and post-doctoral researcher at the Max-Planck-Institute for Quantum Optics in Germany. In his free time he loves to read, ski and spend time with his family.

Srikrishna Chaitanya Konduru is a Senior Data Scientist with AWS Professional services. He supports customers in prototyping and operationalising their ML applications on AWS. Srikrishna focuses on computer vision and NLP. He also leads ML platform design and use case identification initiatives for customers across diverse industry verticals. Srikrishna has an M.Sc in Biomedical Engineering from RWTH Aachen university, Germany, with a focus on Medical Imaging.

Ahmed Mansour is a Data Scientist at AWS Professional Services. He provide technical support for customers through their AI/ML journey on the AWS cloud. Ahmed focuses on applications of NLP to the protein domain along with RL. He has a PhD in Engineering from the Technical University of Munich, Germany. In his free time he loves to go to the gym and play with his kids.

Introducing Einstein Studio on Data Cloud

Benefits of the SageMaker and Data Cloud Einstein Studio integration

SageMaker integration

Partner Quote

Solution overview

Conclusion

About the Authors

Image source: Ron Diering, via Flickr, some rights reserved.

Improving the IDP pipeline

Extraction stage of the IDP pipeline

Using AWS serverless services to summarize with FMs

Set up the solution

Clean up

Conclusion

About the Authors

A spectrum of approaches

Tool use

Model grafting

Generalist systems

The reality of multimodal medical AI

About the Maker

His Inspiration

His Jetson Project

Get Started Building Generative AI Applications

Building Digital 3D Worlds

Wide Portfolio of Courses

Use case: Energy forecasting

Solution overview

Scale training to thousands of models

Scale inference to thousands of models with SageMaker MMEs

Conclusion

About the Authors

Prerequisites

Data processing improvements

Model training improvements

Inference improvements

Conclusion

About the authors

Solution overview

Prerequisites

Create a serverless app using Amazon Rekognition

Improve wheel detection

Train a model using a SageMaker training job

Deploy the model using SageMaker and inference pipelines

Connect your solution to the web application

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

Get Started Building Generative AI Applications