HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot, for help?

AI in the real world

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges. 

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it’s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


The promise and challenge of interactive AI “copilots”

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there’s a saying that “data is king.” This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots’ capabilities.

A new multimodal interactive dataset

As a first step in this direction, we are excited to share our paper, “HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab),” presented at ICCV 2023 (opens in new tab). HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer’s first-person video feed in real time and offers verbal instruction. 

An image illustrating the setup for the HoloAssist dataset, which features a two-person interactive assistive task-completion setting.  A task-performer is wearing a mixed reality headset while an instructor watches the first-person video feed and provides instructions.  Eight modalities are captured, RGB, eye gaze, hand pose, head pose, depth, IMU, audio, text transcription.
Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we’ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. inally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.
Data distribution captured in HoloAssist. On the left, the number of sessions per activity, and on the right, the total length of sessions in minutes. There are 20 tasks: GoPro, Nintendo Switch, DSLR, portable printer, computer, Nespresso machine, standalone printer, big coffee machine, IKEA furniture (stool, utility cart, tray table, nightstand), NavVis laser scanner, ATV motorcycle, wheel belt, and circuit breaker.  There are between 25 and 180 sessions per activity and sessions range from 47 to 1390 minutes.
Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.
HoloAssist includes action and conversational annotations and provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.  The image shows examples of each of these.
Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.

Towards proactive AI assistants

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current “chat-based” AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.

The table shows a comparison of nine related datasets and simulation platforms and for each dataset the setting, whether it is collaborative and interactive, instructional and procedural, and the number of hours of video.  HoloAssist features a multi-person assistive setting which is a unique addition to existing first-person (egocentric) datasets.
Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.

Looking forward

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and, anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab).

Contributors

Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Marc Pollefeys

The post HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world appeared first on Microsoft Research.

Read More

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

photos of PhD students Jennifer Scurrell and Alejandro Cuevas along with Senior Researcher Dr. Madeleine Daepp for the Microsoft Research podcast

Every year, interns from academic institutions around the world apply and grow their knowledge as members of the research community at Microsoft. In this Microsoft Research Podcast series, these students join their internship supervisors to share their experience working alongside some of the leading researchers in their respective fields. 

In this episode, PhD students Jennifer Scurrell (opens in new tab) and Alejandro Cuevas (opens in new tab) talk to Senior Researcher Dr. Madeleine Daepp (opens in new tab). They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers they admire over coffee to the teamwork they say helped make it possible for them to succeed in the fast-paced environment of industry, and the impact they hope to have with their work.

The post Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas appeared first on Microsoft Research.

Read More

Brains of the Operation: Atlas Meditech Maps Future of Surgery With AI, Digital Twins

Brains of the Operation: Atlas Meditech Maps Future of Surgery With AI, Digital Twins

Just as athletes train for a game or actors rehearse for a performance, surgeons prepare ahead of an operation.

Now, Atlas Meditech is letting brain surgeons experience a new level of realism in their pre-surgery preparation with AI and physically accurate simulations.

Atlas Meditech, a brain-surgery intelligence platform, is adopting tools — including the MONAI medical imaging framework and NVIDIA Omniverse 3D development platform — to build AI-powered decision support and high-fidelity surgery rehearsal platforms. Its mission: improving surgical outcomes and patient safety.

“The Atlas provides a collection of multimedia tools for brain surgeons, allowing them to mentally rehearse an operation the night before a real surgery,” said Dr. Aaron Cohen-Gadol, founder of Atlas Meditech and its nonprofit counterpart, Neurosurgical Atlas. “With accelerated computing and digital twins, we want to transform this mental rehearsal into a highly realistic rehearsal in simulation.”

Neurosurgical Atlas offers case studies, surgical videos and 3D models of the brain to more than a million online users. Dr. Cohen-Gadol, also a professor of neurological surgery at Indiana University School of Medicine, estimates that more than 90% of brain surgery training programs in the U.S. — as well as tens of thousands of neurosurgeons in other countries — use the Atlas as a key resource during residency and early in their surgery careers.

Atlas Meditech’s Pathfinder software is integrating AI algorithms that can suggest safe surgical pathways for experts to navigate through the brain to reach a lesion.

And with NVIDIA Omniverse, a platform for connecting and building custom 3D pipelines and metaverse applications, the team aims to create custom virtual representations of individual patients’ brains for surgery rehearsal.

Custom 3D Models of Human Brains

A key benefit of Atlas Meditech’s advanced simulations — either onscreen or in immersive virtual reality — is the ability to customize the simulations, so that surgeons can practice on a virtual brain that matches the patient’s brain in size, shape and lesion position.

“Every patient’s anatomy is a little different,” said Dr. Cohen-Gadol. “What we can do now with physics and advanced graphics is create a patient-specific model of the brain and work with it to see and virtually operate on a tumor. The accuracy of the physical properties helps to recreate the experience we have in the real world during an operation.”

To create digital twins of patients’ brains, the Atlas Pathfinder tool has adopted MONAI Label, which can support radiologists by automatically annotating MRI and CT scans to segment normal structures and tumors.

“MONAI Label is the gateway to any healthcare project because it provides us with the opportunity to segment critical structures and protect them,” said Dr. Cohen-Gadol. “For the Atlas, we’re training MONAI Label to act as the eyes of the surgeon, highlighting what is a normal vessel and what’s a tumor in an individual patient’s scan.”

With a segmented view of a patient’s brain, Atlas Pathfinder can adjust its 3D brain model to morph to the patient’s specific anatomy, capturing how the tumor deforms the normal structure of their brain tissue.

Based on the visualization — which radiologists and surgeons can modify to improve the precision — Atlas Pathfinder suggests the safest surgical approaches to access and remove a tumor without harming other parts of the brain. Each approach links out to the Atlas website, which includes a written tutorial of the operative plan.

“AI-powered decision support can make a big difference in navigating a highly complex 3D structure where every millimeter is critical,” Dr. Cohen-Gadol said.

Realistic Rehearsal Environments for Practicing Surgeons 

Atlas Meditech is using NVIDIA Omniverse to develop a virtual operating room that can immerse surgeons into a realistic environment to rehearse upcoming procedures. In the simulation, surgeons can modify how the patient and equipment are positioned.

Using a VR headset, surgeons will be able to work within this virtual environment, going step by step through the procedure and receiving feedback on how closely they are adhering to the target pathway to reach the tumor. AI algorithms can be used to predict how brain tissue would shift as a surgeon uses medical instruments during the operation, and apply that estimated shift to the simulated brain.

“The power to enable surgeons to enter a virtual, 3D space, cut a piece of the skull and rehearse the operation with a simulated brain that has very similar physical properties to the patient would be tremendous,” said Dr. Cohen-Gadol.

To better simulate the brain’s physical properties, the team adopted NVIDIA PhysX, an advanced real-time physics simulation engine that’s part of NVIDIA Omniverse. Using haptic devices, they were able to experiment with adding haptic feedback to the virtual environment, mimicking the feeling of working with brain tissue.

Envisioning AI, Robotics in the Future of Surgery Training

Dr. Cohen-Gadol believes that in the coming years AI models will be able to further enhance surgery by providing additional insights during a procedure. Examples include warning surgeons about critical brain structures that are adjacent to the area they’re working in, tracking medical instruments during surgery, and providing a guide to next steps in the surgery.

Atlas Meditech plans to explore the NVIDIA Holoscan platform for streaming AI applications to power these real-time, intraoperative insights. Applying AI analysis to a surgeon’s actions during a procedure can provide the surgeon with useful feedback to improve their technique.

In addition to being used for surgeons to rehearse operations, Dr. Cohen-Gadol says that digital twins of the brain and of the operating room could help train intelligent medical instruments such as microscope robots using Isaac Sim, a robotics simulation application developed on Omniverse.

View Dr. Cohen-Gadol’s presentation at NVIDIA GTC.

Subscribe to NVIDIA healthcare news.

Read More

Fall in Line for October With Nearly 60 New Games, Including Latest Game Pass Titles to Join the Cloud

Fall in Line for October With Nearly 60 New Games, Including Latest Game Pass Titles to Join the Cloud

October brings more than falling leaves and pumpkin spice lattes for GeForce NOW members. Get ready for nearly 60 new games to stream, including Forza Motorsport and 16 more PC Game Pass titles.

Assassin’s Creed Mirage leads 29 new games to hit the GeForce NOW library this week. In addition, catch a challenge to earn in-game rewards for World of Warship players.

Leap Into the Cloud

Assassin's Creed Mirage on GeForce NOW
Nothing is true. Everything is permitted … in the cloud.

It’s not an illusion — Ubisoft’s Assassin’s Creed Mirage launches in the cloud this week. Mirage was created as an homage to the first Assassin’s Creed games and pays tribute to the series’ well-loved roots.

Join the powerful proto-Assassin order — the Hidden Ones — as a 17-year-old street thief named Basim Ibn Is’haq as he learns to become a master assassin. Stalk the streets of a bustling and historically accurate ninth-century Baghdad — the perfect urban setting to seamlessly parkour across rooftops, scale tall towers and flee guards while uncovering a conspiracy that threatens the city and Basim’s future destiny.

Take a Leap of Faith into a GeForce NOW Ultimate membership and explore this new open world at up to 4K resolution and 120 frames per second. Ultimate members get exclusive access to GeForce RTX 4080 servers in the cloud, making it the easiest upgrade around.

No Tricks, Only Treats

Don’t be spooked — GeForce NOW has plenty of treats for members this month. More PC Game Pass games are coming soon to the cloud, including Forza Motorsport from Turn 10 Studios and Xbox Game Studios and the Dishonored series from Arkane and Bethesda.

Catch some action (with a little stealth, magic and combat mixed in) with the Dishonored franchise. Dive into a struggle of power and revenge that revolves around the assassination of the Empress of the Isles. Members can follow the whole story starting with the original Dishonored game, up through the latest entry, Dishonored: Death of an Outsider, when the series launches in the cloud this month.

Jump into all the action with an Ultimate or Priority account today, for higher performance and faster access to stream over 1,700 games.

Check out the spooktacular list for October:

  • Star Trek: Infinite (New release on Steam, Oct. 12)
  • Lords of the Fallen (New release on Steam and Epic Games Store, Oct. 13)
  • Wizard with a Gun (New release on Steam, Oct. 17)
  • Alaskan Road Truckers (New release Steam and Epic Games Store, Oct. 18)
  • Hellboy: Web of Wyrd (New release on Steam, Oct. 18)
  • HOT WHEELS UNLEASHED 2 – Turbocharged (New release on Steam, Oct. 19)
  • Laika Aged Through Blood (New release on Steam, Oct. 19)
  • Cities: Skylines II (New release on Steam, Xbox and available on PC Game Pass, Oct. 24)
  • Ripout (New release on Steam, Oct 24)
  • War Hospital (New release on Steam, Oct. 26)
  • Alan Wake 2 (New release on Epic Games Store, Oct. 26)
  • Headbangers: Rhythm Royale (New release on Steam, Xbox and available on PC Game Pass, Oct. 31)
  • Jusant (New release on Steam, Xbox and available on PC Game Pass, Oct. 31)
  • Bad North (Xbox, available on Microsoft Store)
  • Daymare 1994: Sandcastle (Steam)
  • For The King (Xbox, available on Microsoft Store)
  • Forza Motorsport (Steam, Xbox and available on PC Game Pass)
  • Heretic’s Fork (Steam)
  • Moonbreaker (Steam)
  • Metro Simulator 2 (Steam)
  • Narita Boy (Xbox, available on Microsoft Store)
  • Sifu (Xbox, available on Microsoft Store)
  • StalCraft (Steam)
  • Star Renegades (Xbox, available on Microsoft Store)
  • Streets of Rogue (Xbox, available on Microsoft Store)
  • Supraland (Xbox, available on Microsoft Store)
  • The Surge (Xbox, available on Microsoft Store)
  • Tiny Football (Steam)
  • Vampire Survivors (Steam and Xbox, available on PC Game Pass)
  • VEILED EXPERTS (Steam)
  • Yes, Your Grace (Xbox, available on Microsoft Store)

Come Sail Away

A new challenge awaits on the open sea.

World of Warships is launching a new in-game event this week exclusive to GeForce NOW members. From Oct. 5-9, those streaming the game on GeForce NOW will be prompted to complete a special in-game challenge chain, only available from the cloud, to earn economic reward containers and one-day GeForce NOW Priority trials. Aspiring admirals can learn more about these challenges on the World of Warships blog and social channels.

Those new to World of Warships can activate the invite code “GEFORCENOW” in the game starting today to claim exclusive rewards, including a seven-day Premium World of Warships account, 300 doubloons, credits and economic boosters. Once 15 battles are completed, players can choose one of the following tech tree ships to speed up game progress: Japanese destroyer Isokaze, American cruiser Phoenix, German battleship Moltke or British aircraft carrier Hermes.

Age Of Empires II on GeForce NOW
Conquer the cloud.

The leaves may be falling, but new games are always coming to the cloud. Dive into the action now with 29 new games this week:

  • Battle Shapers (New release on Steam, Oct. 3)
  • Disgaea 7: Vows of the Virtueless (New release on Steam, Oct. 3)
  • Station to Station (New release on Steam, Oct. 3)
  • The Lamplighter’s League (New release on Steam, Xbox and available on PC Game Pass, Oct. 3)
  • Thief Simulator 2 (New release on Steam, Oct. 4)
  • Heads Will Roll: Reforged (New release on Steam, Oct. 4)
  • Assassin’s Creed Mirage (New release on Ubisoft, Oct. 5)
  • Age of Empires II: Definitive Edition (Xbox, available on PC Game Pass)
  • Arcade Paradise (Xbox, available on PC Game Pass)
  • The Ascent (Xbox, available on Microsoft Store)
  • Citizen Sleeper (Xbox, available on PC Game Pass)
  • Dicey Dungeons (Xbox, available on PC Game Pass)
  • Godlike Burger (Epic Games Store)
  • Greedfall (Xbox, available on Microsoft Store)
  • Hypnospace Outlaw (Xbox, available on PC Game Pass)
  • Killer Frequency (Xbox, available on Microsoft Store)
  • Lonely Mountains: Downhill (Xbox, available on PC Game Pass)
  • Metro 2033 Redux (Xbox, available on Microsoft Store)
  • Metro: Last Light Redux (Xbox, available on Microsoft Store)
  • MudRunner (Xbox, available on Microsoft Store)
  • Potion Craft: Alchemist Simulator (Xbox, available on PC Game Pass)
  • Shadow Gambit: The Cursed Crew (Epic Games Store)
  • Slayers X: Terminal Aftermath: Vengance of the Slayer (Xbox, available on PC Game Pass)
  • Soccer Story (Xbox, available on PC Game Pass)
  • SOMA (Xbox, available on PC Game Pass)
  • Space Hulk: Tactics (Xbox, available on Microsoft Store)
  • SpiderHeck (Xbox, available on PC Game Pass)
  • SUPERHOT: MIND CONTROL DELETE (Xbox, available on Microsoft Store)
  • Surviving Mars (Xbox, available on Microsoft Store)

Surprises in September

On top of the 24 games announced in September, an additional 45 joined the cloud last month:

  • Void Crew (New release on Steam, Sept. 7)
  • Tavernacle! (New release on Steam, Sept. 11)
  • Gunbrella (New release on Steam, Sept. 13)
  • HumanitZ (New release on Steam, Sept. 18)
  • These Doomed Isles (New release on Steam, Sept. 25)
  • Overpass 2 (New release on Steam, Sept. 28)
  • 911 Operator (Epic Games Store)
  • A Plague Tale: Requiem (Xbox)
  • Amnesia: The Bunker (Xbox, available on PC Game Pass)
  • Airborne Kingdom (Epic Games Store)
  • Atomic Heart (Xbox)
  • BlazBlue: Cross Tag Battle (Xbox, available on PC Game Pass)
  • Bramble: The Mountain King (Xbox, available on PC Game Pass)
  • Call of the Wild: The Angler (Xbox)
  • Chained Echoes (Xbox, available on PC Game Pass)
  • Danganronpa V3: Killing Harmony (Xbox)
  • Descenders (Xbox, available on PC Game Pass)
  • Doom Eternal (Xbox, available on PC Game Pass)
  • Dordogne (Xbox, available on PC Game Pass)
  • Eastern Exorcist (Xbox, available on PC Game Pass)
  • Figment 2: Creed Valley (Xbox, available on PC Game Pass)
  • Hardspace: Shipbreaker (Xbox)
  • Insurgency: Sandstorm (Xbox)
  • I Am Fish (Xbox)
  • Last Call BBS (Xbox)
  • The Legend of Tianding (Xbox, available on PC Game Pass)
  • The Matchless Kungfu (Steam)
  • Mechwarrior 5: Mercenaries (Xbox, available on PC Game Pass)
  • Monster Sanctuary (Xbox)
  • Opus Magnum (Xbox)
  • Pizza Possum (New release on Steam, Sept. 28)
  • A Plague Tale: Innocence (Xbox)
  • Quake II (Steam, Epic Games Store and Xbox, available on PC Game Pass)
  • Remnant II (Epic Games Store)
  • Road 96 (Xbox)
  • Shadowrun: Hong Kong – Extended Edition (Xbox)
  • SnowRunner (Xbox)
  • Soulstice (New release on Epic Games Store, free on Sept. 28)
  • Space Hulk: Deathwing – Enhanced Edition (Xbox)
  • Spacelines from the Far Out (Xbox)
  • Superhot (Xbox)
  • Totally Reliable Delivery Service (Xbox, available on PC Game Pass)
  • Vampyr (Xbox)
  • Warhammer 40,000: Battlesector (Xbox, available on PC Game Pass)
  • Yooka-Laylee and the Impossible Lair (Xbox)

Halo Infinite and Kingdoms Reborn didn’t make it in September. Stay tuned to GFN Thursday for more updates.

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

Read More

A Mine-Blowing Breakthrough: Open-Ended AI Agent Voyager Autonomously Plays ‘Minecraft’

A Mine-Blowing Breakthrough: Open-Ended AI Agent Voyager Autonomously Plays ‘Minecraft’

For NVIDIA Senior AI Scientist Jim Fan, the video game Minecraft served as the “perfect primordial soup” for his research on open-ended AI agents.

In the latest AI Podcast episode, host Noah Kravitz spoke with Fan on using large language models to create AI agents — specifically to create Voyager, an AI bot built with Chat GPT-4 that can autonomously play Minecraft.

AI agents are models that “can proactively take actions and then perceive the world, see the consequences of its actions, and then improve itself,” Fan said. Many current AI agents are programmed to achieve specific objectives, such as beating a game as quickly as possible or answering a question. They can work autonomously toward a particular output but lack a broader decision-making agency.

Fan wondered if it was possible to have a “truly open-ended agent that can be prompted by arbitrary natural language to do open-ended, even creative things.”

But he needed a flexible playground in which to test that possibility.

“And that’s why we found Minecraft to be almost a perfect primordial soup for open-ended agents to emerge, because it sets up the environment so well,” he said. Minecraft at its core, after all, doesn’t set a specific key objective for players other than to survive and freely explore the open world.

That became the springboard for Fan’s project, MineDojo, which eventually led to the creation of the AI bot Voyager.

“Voyager leverages the power of Chat GPT-4 to write code in Javascript to execute in the game,” Fan explained. “GPT-4 then looks at the output, and if there’s an error from JavaScript or some feedback from the environment, GPT-4 does a self-reflection and tries to debug the code.”

The bot learns from its mistakes and stores the correctly implemented programs in a skill library for future use, allowing for “lifelong learning.”

In-game, Voyager can autonomously explore for hours, adapting its decisions based on its environment and developing skills to combat monsters and find food when needed.

“We see all these behaviors come from the Voyager setup, the skill library and also the coding mechanism,” Fan explained. “We did not preprogram any of these behaviors.”

He then spoke more generally about the rise and trajectory of LLMs. He foresees strong applications in software, gaming and robotics and increasingly pressing conversations surrounding AI safety.

Fan encourages those looking to get involved and work with LLMs to “just do something,” whether that means using online resources or experimenting with beginner-friendly, CPU-based AI models.

You Might Also Like

Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.

Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.

Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch talks about tackling the challenges of the immune system with a machine learning and data science mindset.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

 

Read More

Simplify medical image classification using Amazon SageMaker Canvas

Simplify medical image classification using Amazon SageMaker Canvas

Analyzing medical images plays a crucial role in diagnosing and treating diseases. The ability to automate this process using machine learning (ML) techniques allows healthcare professionals to more quickly diagnose certain cancers, coronary diseases, and ophthalmologic conditions. However, one of the key challenges faced by clinicians and researchers in this field is the time-consuming and complex nature of building ML models for image classification. Traditional methods require coding expertise and extensive knowledge of ML algorithms, which can be a barrier for many healthcare professionals.

To address this gap, we used Amazon SageMaker Canvas, a visual tool that allows medical clinicians to build and deploy ML models without coding or specialized knowledge. This user-friendly approach eliminates the steep learning curve associated with ML, which frees up clinicians to focus on their patients.

Amazon SageMaker Canvas provides a drag-and-drop interface for creating ML models. Clinicians can select the data they want to use, specify the desired output, and then watch as it automatically builds and trains the model. Once the model is trained, it generates accurate predictions.

This approach is ideal for medical clinicians who want to use ML to improve their diagnosis and treatment decisions. With Amazon SageMaker Canvas, they can use the power of ML to help their patients, without needing to be an ML expert.

Medical image classification directly impacts patient outcomes and healthcare efficiency. Timely and accurate classification of medical images allows for early detection of diseases that aides in effective treatment planning and monitoring. Moreover, the democratization of ML through accessible interfaces like Amazon SageMaker Canvas, enables a broader range of healthcare professionals, including those without extensive technical backgrounds, to contribute to the field of medical image analysis. This inclusive approach fosters collaboration and knowledge sharing and ultimately leads to advancements in healthcare research and improved patient care.

In this post, we’ll explore the capabilities of Amazon SageMaker Canvas in classifying medical images, discuss its benefits, and highlight real-world use cases that demonstrate its impact on medical diagnostics.

Use case

Skin cancer is a serious and potentially deadly disease, and the earlier it is detected, the better chance there is for successful treatment. Statistically, skin cancer (e.g. Basal and squamous cell carcinomas) is one of the most common cancer types and leads to hundreds of thousands of deaths worldwide each year. It manifests itself through the abnormal growth of skin cells.

However, early diagnosis drastically increases the chances of recovery. Moreover, it may render surgical, radiographic, or chemotherapeutic therapies unnecessary or lessen their overall usage, helping to reduce healthcare costs.

The process of diagnosing skin cancer starts with a procedure called a dermoscopy[1], which inspects the general shape, size, and color characteristics of skin lesions. Suspected lesions then undergo further sampling and histological tests for confirmation of the cancer cell type. Doctors use multiple methods to detect skin cancer, starting with visual detection. The American Center for the Study of Dermatology developed a guide for the possible shape of melanoma, which is called ABCD (asymmetry, border, color, diameter) and is used by doctors for initial screening of the disease. If a suspected skin lesion is found, then the doctor takes a biopsy of the visible lesion on the skin and examines it microscopically for a benign or malignant diagnosis and the type of skin cancer. Computer vision models can play a valuable role in helping to identify suspicious moles or lesions, which enables earlier and more accurate diagnosis.

Creating a cancer detection model is a multi-step process, as outlined below:

  1. Gather a large dataset of images from healthy skin and skin with various types of cancerous or precancerous lesions. This dataset needs to be carefully curated to ensure accuracy and consistency.
  2. Use computer vision techniques to preprocess the images and extract relevant to differentiate between healthy and cancerous skin.
  3. Train an ML model on the preprocessed images, using a supervised learning approach to teach the model to distinguish between different skin types.
  4. Evaluate the performance of the model using a variety of metrics, such as precision and recall, to ensure that it accurately identifies cancerous skin and minimizes false positives.
  5. Integrate the model into a user-friendly tool that could be used by dermatologists and other healthcare professionals to aid in the detection and diagnosis of skin cancer.

Overall, the process of developing a skin cancer detection model from scratch typically requires significant resources and expertise. This is where Amazon SageMaker Canvas can help simplify the time and effort for steps 2 – 5.

Solution overview

To demonstrate the creation of a skin cancer computer vision model without writing any code, we use a dermatoscopy skin cancer image dataset published by Harvard Dataverse. We use the dataset, which can be found at HAM10000 and consists of 10,015 dermatoscopic images, to build a skin cancer classification model that predicts skin cancer classes. A few key points about the dataset:

  • The dataset serves as a training set for academic ML purposes.
  • It includes a representative collection of all important diagnostic categories in the realm of pigmented lesions.
  • A few categories in the dataset are: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)
  • More than 50% of the lesions in the dataset are confirmed through histopathology (histo).
  • The ground truth for the rest of the cases is determined through follow-up examination (follow_up), expert consensus (consensus), or confirmation by in vivo confocal microscopy (confocal).
  • The dataset includes lesions with multiple images, which can be tracked using the lesion_id column within the HAM10000_metadata file.

We showcase how to simplify image classification for multiple skin cancer categories without writing any code using Amazon SageMaker Canvas. Given an image of a skin lesion, SageMaker Canvas image classification automatically classifies an image into benign or possible cancer.

Prerequisites

  • Access to an AWS account with permissions to create the resources described in the steps section.
  • An AWS Identity and Access Management (AWS IAM) user with full permissions to use Amazon SageMaker.

Walkthrough

  1. Set-up SageMaker domain
    1. Create an Amazon SageMaker domain using steps outlined here.
    2. Download the HAM10000 dataset.
  2. Set-up datasets
    1. Create an Amazon Simple Storage Service (Amazon S3) bucket with a unique name, which is image-classification-<ACCOUNT_ID> where ACCOUNT_ID is your unique AWS AccountNumber.

      Creating bucket

      Figure 1 Creating bucket

    2. In this bucket create two folders: training-data and test-data.

      Creating folders

      Figure 2 Create folders

    3. Under training-data, create seven folders for each of the skin cancer categories identified in the dataset: akiec, bcc, bkl, df, mel, nv, and vasc.

      Folder View

      Figure 3 Folder View

    4. The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Using the lesion_id-column, copy the corresponding images in the right folder (i.e., you may start with 100 images for each classification).

      List Objects to Import (Sample Images)

      Figure 4 Listing Objects to import (Sample Images)

  3. Use Amazon SageMaker Canvas
    1. Go to the Amazon SageMaker service in the console and select Canvas from the list. Once you are on the Canvas page, please select Open Canvas button.

      Navigate to SageMaker Canvas

      Figure 5 Navigate to Canvas

    2. Once you are on the Canvas page, select My models and then choose New Model on the right of your screen.

      Model Creation

      Figure 6 Creation of Model

    3. A new pop-up window opens up, where we name image_classify as the model’s name and select Image analysis under the Problem type.
  4. Import the dataset
    1. On the next page, please select Create dataset and in the pop-up box name the dataset as image_classify and select the Create button.

      Dataset creation

      Figure 7 Creating dataset

    2. On the next page, change the Data Source to Amazon S3. You can also directly upload the images (i.e., Local upload).

      Import dataset from S3 buckets

      Figure 8 Import Dataset from S3 buckets

    3. When you select Amazon S3, you’ll get the list of buckets present in your account. Select the parent bucket that holds the dataset into subfolder (e.g., image-classify-2023 and select Import data button. This allows Amazon SageMaker Canvas to quickly label the images based on the folder names.
    4. Once, the dataset is successfully imported, you’ll see the value in the Status column change to Ready from Processing.
    5. Now select your dataset by choosing Select dataset at the bottom of your page.
  5. Build your model
    1. On the Build page, you should see your data imported and labelled as per the folder name in Amazon S3.

      Labelling of Amazon S3 data

      Figure 9 Labelling of Amazon S3 data

    2. Select the Quick build button (i.e., the red-highlighted content in the following image) and you’ll see two options to build the model. First one is the Quick build and second one is Standard build. As name suggest quick build option provides speed over accuracy and it takes around 15 to 30 minutes to build the model. The standard build prioritizes accuracy over speed, with model building taking from 45 minutes to 4 hours to complete. Standard build runs experiments using different combinations of hyperparameters and generates many models in the backend (using SageMaker Autopilot functionality) and then picks the best model.
    3. Select Standard build to start building the model. It takes around 2–5 hours to complete.

      Standard build

      Figure 10 Doing Standard build

    4. Once model build is complete, you can see an estimated accuracy as shown in Figure 11.

      Model Prediction

      Figure 11 Model prediction

    5. If you select the Scoring tab, it should provide you insights into the model accuracy. Also, we can select the Advanced metrics button on the Scoring tab to view the precision, recall, and F1 score (A balanced measure of accuracy that takes class balance into account).
    6. The advanced metrics that Amazon SageMaker Canvas shows you depend on whether your model performs numeric, categorical, image, text, or time series forecasting predictions on your data. In this case, we believe recall is more important than precision because missing a cancer detection is far more dangerous than detecting correct. Categorical prediction, such as 2-category prediction or 3-category prediction, refers to the mathematical concept of classification. The advanced metric recall is the fraction of true positives (TP) out of all the actual positives (TP + false negatives). It measures the proportion of positive instances that were correctly predicted as positive by the model. Please refer this A deep dive into Amazon SageMaker Canvas advanced metrics for a deep dive on the advance metrics.
      Advanced metrics

      Figure 12 Advanced metrics

      This completes the model creation step in Amazon SageMaker Canvas.

  6. Test your model
    1. You can now choose the Predict button, which takes you to the Predict page, where you can upload your own images through Single prediction or Batch prediction. Please set the option of your choice and select Import to upload your image and test the model.

      Test your images

      Figure 13 Test your own images

    2. Let’s start by doing a single image prediction. Make sure you are on the Single Prediction and choose Import image. This takes you to a dialog box where you can choose to upload your image from Amazon S3, or do a Local upload. In our case, we select Amazon S3 and browse to our directory where we have the test images and select any image. Then select Import data.

      Navigate to SageMaker Canvas

      Figure 14 Single Image Prediction

    3. Once selected, you should see the screen says Generating prediction results. You should have your results in a few minutes as shown below.
    4. Now let’s try the Batch prediction. Select Batch prediction under Run predictions and select the Import new dataset button and name it BatchPrediction and hit the Create button.

      Single Image prediction results

      Figure 15 Single image prediction results

    5. On the next window, make sure you have selected Amazon S3 upload and browse to the directory where we have our test set and select the Import data button.

      Batch image prediction

      Figure 16 Batch Image Prediction

    6. Once the images are in Ready status, select the radio button for the created dataset and choose Generate predictions. Now, you should see the status of batch prediction batch to Generating predictions. Let’s wait for few minutes for the results.
    7. Once the status is in Ready state, choose the dataset name that takes you to a page showing the detailed prediction on all our images.

      Bacth prediction results

      Figure 17 Batch image prediction results

    8. Another important feature of Batch Prediction is to be able to verify the results and also be able to download the prediction in a zip or csv file for further usage or sharing.

      Download prediction

      Figure 18 Download prediction

With this you have successfully been able to create a model, train it, and test its prediction with Amazon SageMaker Canvas.

Cleaning up

Choose Log out in the left navigation pane to log out of the Amazon SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours and release all resources.

Citation

[1]Fraiwan M, Faouri E. On the Automatic Detection and Classification of Skin Cancer Using Deep Transfer Learning. Sensors (Basel). 2022 Jun 30;22(13):4963. doi: 10.3390/s22134963. PMID: 35808463; PMCID: PMC9269808.

Conclusion

In this post, we showed you how medical image analysis using ML techniques can expedite the diagnosis skin cancer, and its applicability to diagnosing other diseases. However, building ML models for image classification is often complex and time-consuming, requiring coding expertise and ML knowledge. Amazon SageMaker Canvas addressed this challenge by providing a visual interface that eliminates the need for coding or specialized ML skills. This empowers healthcare professionals to use ML without a steep learning curve, allowing them to focus on patient care.

The traditional process of developing a cancer detection model is cumbersome and time-consuming. It involves gathering a curated dataset, preprocessing images, training a ML model, evaluate its performance, and integrate it into a user-friendly tool for healthcare professionals. Amazon SageMaker Canvas simplified the steps from preprocessing to integration, which reduced the time and effort required for building a skin cancer detection model.

In this post, we delved into the powerful capabilities of Amazon SageMaker Canvas in classifying medical images, shedding light on its benefits and presenting real-world use cases that showcase its profound impact on medical diagnostics. One such compelling use case we explored was skin cancer detection and how early diagnosis often significantly enhances treatment outcomes and reduces healthcare costs.

It is important to acknowledge that the accuracy of the model can vary depending on factors, such as the size of the training dataset and the specific type of model employed. These variables play a role in determining the performance and reliability of the classification results.

Amazon SageMaker Canvas can serve as an invaluable tool that assists healthcare professionals in diagnosing diseases with greater accuracy and efficiency. However, it is vital to note that it isn’t intended to replace the expertise and judgment of healthcare professionals. Rather, it empowers them by augmenting their capabilities and enabling more precise and expedient diagnoses. The human element remains essential in the decision-making process, and the collaboration between healthcare professionals and artificial intelligence (AI) tools, including Amazon SageMaker Canvas, is pivotal in providing optimal patient care.


About the authors

 Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Jake Wen is a Solutions Architect at AWS, driven by a passion for Machine Learning, Natural Language Processing, and Deep Learning. He assists Enterprise customers in achieving modernization and scalable deployment in the Cloud. Beyond the tech world, Jake finds delight in skateboarding, hiking, and piloting air drones.

Sonu Kumar Singh is an AWS Solutions Architect, with a specialization in analytics domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact. At AWS his intention is to help customers extract value out of AWS’s 200+ cloud services and empower them in their cloud journey.

Dariush Azimi is a Solution Architect at AWS, with specialization in Machine Learning, Natural Language Processing (NLP), and microservices architecture with Kubernetes. His mission is to empower organizations to harness the full potential of their data through comprehensive end-to-end solutions encompassing data storage, accessibility, analysis, and predictive capabilities.

Read More

Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart

Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart

Healthcare and life sciences (HCLS) customers are adopting generative AI as a tool to get more from their data. Use cases include document summarization to help readers focus on key points of a document and transforming unstructured text into standardized formats to highlight important attributes. With unique data formats and strict regulatory requirements, customers are looking for choices to select the most performant and cost-effective model, as well as the ability to perform necessary customization (fine-tuning) to fit their business use case. In this post, we walk you through deploying a Falcon large language model (LLM) using Amazon SageMaker JumpStart and using the model to summarize long documents with LangChain and Python.

Solution overview

Amazon SageMaker is built on Amazon’s two decades of experience developing real-world ML applications, including product recommendations, personalization, intelligent shopping, robotics, and voice-assisted devices. SageMaker is a HIPAA-eligible managed service that provides tools that enable data scientists, ML engineers, and business analysts to innovate with ML. Within SageMaker is Amazon SageMaker Studio, an integrated development environment (IDE) purpose-built for collaborative ML workflows, which, in turn, contain a wide variety of quickstart solutions and pre-trained ML models in an integrated hub called SageMaker JumpStart. With SageMaker JumpStart, you can use pre-trained models, such as the Falcon LLM, with pre-built sample notebooks and SDK support to experiment with and deploy these powerful transformer models. You can use SageMaker Studio and SageMaker JumpStart to deploy and query your own generative model in your AWS account.

You can also ensure that the inference payload data doesn’t leave your VPC. You can provision models as single-tenant endpoints and deploy them with network isolation. Furthermore, you can curate and manage the selected set of models that satisfy your own security requirements by using the private model hub capability within SageMaker JumpStart and storing the approved models in there. SageMaker is in scope for HIPAA BAASOC123, and HITRUST CSF.

The Falcon LLM is a large language model, trained by researchers at Technology Innovation Institute (TII) on over 1 trillion tokens using AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised of 40 billion and 7 billion parameters, respectively, with fine-tuned versions trained for specific tasks, such as following instructions. Falcon performs well on a variety of tasks, including text summarization, sentiment analysis, question answering, and conversing. This post provides a walkthrough that you can follow to deploy the Falcon LLM into your AWS account, using a managed notebook instance through SageMaker JumpStart to experiment with text summarization.

The SageMaker JumpStart model hub includes complete notebooks to deploy and query each model. As of this writing, there are six versions of Falcon available in the SageMaker JumpStart model hub: Falcon 40B Instruct BF16, Falcon 40B BF16, Falcon 180B BF16, Falcon 180B Chat BF16, Falcon 7B Instruct BF16, and Falcon 7B BF16. This post uses the Falcon 7B Instruct model.

In the following sections, we show how to get started with document summarization by deploying Falcon 7B on SageMaker Jumpstart.

Prerequisites

For this tutorial, you’ll need an AWS account with a SageMaker domain. If you don’t already have a SageMaker domain, refer to Onboard to Amazon SageMaker Domain to create one.

Deploy Falcon 7B using SageMaker JumpStart

To deploy your model, complete the following steps:

  1. Navigate to your SageMaker Studio environment from the SageMaker console.
  2. Within the IDE, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
  3. Deploy the Falcon 7B Instruct model to an endpoint for inference.

Choosing Falcon-7B-Instruct from SageMaker JumpStart

This will open the model card for the Falcon 7B Instruct BF16 model. On this page, you can find the Deploy or Train options as well as links to open the sample notebooks in SageMaker Studio. This post will use the sample notebook from SageMaker JumpStart to deploy the model.

  1. Choose Open notebook.

SageMaker JumpStart Model Deployment Page

  1. Run the first four cells of the notebook to deploy the Falcon 7B Instruct endpoint.

You can see your deployed JumpStart models on the Launched JumpStart assets page.

  1. In the navigation pane, under SageMaker Jumpstart, choose Launched JumpStart assets.
  2. Choose the Model endpoints tab to view the status of your endpoint.

SageMaker JumpStart Launched Model Page

With the Falcon LLM endpoint deployed, you are ready to query the model.

Run your first query

To run a query, complete the following steps:

  1. On the File menu, choose New and Notebook to open a new notebook.

You can also download the completed notebook here.

Create SageMaker Studio notebook

  1. Select the image, kernel, and instance type when prompted. For this post, we choose the Data Science 3.0 image, Python 3 kernel, and ml.t3.medium instance.

Setting SageMaker Studio Notebook Kernel

  1. Import the Boto3 and JSON modules by entering the following two lines into the first cell:
import json
import boto3
  1. Press Shift + Enter to run the cell.
  2. Next, you can define a function that will call your endpoint. This function takes a dictionary payload and uses it to invoke the SageMaker runtime client. Then it deserializes the response and prints the input and generated text.
newline, bold, unbold = 'n', '33[1m', '33[0m'
endpoint_name = 'ENDPOINT_NAME'

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload).encode('utf-8'))
    model_predictions = json.loads(response['Body'].read())
    generated_text = model_predictions[0]['generated_text']
    print (
        f"Input Text: {payload['inputs']}{newline}"
        f"Generated Text: {bold}{generated_text}{unbold}{newline}")

The payload includes the prompt as inputs, together with the inference parameters that will be passed to the model.

  1. You can use these parameters with the prompt to tune the output of the model for your use case:
payload = {
    "inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
    "parameters":{
        "max_new_tokens": 50,
        "return_full_text": False,
        "do_sample": True,
        "top_k":10
        }
}

Query with a summarization prompt

This post uses a sample research paper to demonstrate summarization. The example text file is concerning automatic text summarization in biomedical literature. Complete the following steps:

  1. Download the PDF and copy the text into a file named document.txt.
  2. In SageMaker Studio, choose the upload icon and upload the file to your SageMaker Studio instance.

Uploading File to SageMaker Studio

Out of the box, the Falcon LLM provides support for text summarization.

  1. Let’s create a function that uses prompt engineering techniques to summarize document.txt:
def summarize(text_to_summarize):
    summarization_prompt = """Process the following text and then perform the instructions that follow:

{text_to_summarize}

Provide a short summary of the preceeding text.

Summary:"""
    payload = {
        "inputs": summarization_prompt,
        "parameters":{
            "max_new_tokens": 150,
            "return_full_text": False,
            "do_sample": True,
            "top_k":10
            }
    }
    response = query_endpoint(payload)
    print(response)
    
with open("document.txt") as f:
    text_to_summarize = f.read()

summarize(text_to_summarize)

You’ll notice that for longer documents, an error appears—Falcon, alongside all other LLMs, has a limit on the number of tokens passed as input. We can get around this limit using LangChain’s enhanced summarization capabilities, which allows for a much larger input to be passed to the LLM.

Import and run a summarization chain

LangChain is an open-source software library that allows developers and data scientists to quickly build, tune, and deploy custom generative applications without managing complex ML interactions, commonly used to abstract many of the common use cases for generative AI language models in just a few lines of code. LangChain’s support for AWS services includes support for SageMaker endpoints.

LangChain provides an accessible interface to LLMs. Its features include tools for prompt templating and prompt chaining. These chains can be used to summarize text documents that are longer than what the language model supports in a single call. You can use a map-reduce strategy to summarize long documents by breaking it down into manageable chunks, summarizing them, and combining them (and summarized again, if needed).

  1. Let’s install LangChain to begin:
%pip install langchain
  1. Import the relevant modules and break down the long document into chunks:
import langchain
from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 500,
                    chunk_overlap  = 20,
                    separators = [" "],
                    length_function = len
                )
input_documents = text_splitter.create_documents([text_to_summarize])
  1. To make LangChain work effectively with Falcon, you need to define the default content handler classes for valid input and output:
class ContentHandlerTextSummarization(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> json:
        response_json = json.loads(output.read().decode("utf-8"))
        generated_text = response_json[0]['generated_text']
        return generated_text.split("summary:")[-1]
    
content_handler = ContentHandlerTextSummarization()
  1. You can define custom prompts as PromptTemplate objects, the main vehicle for prompting with LangChain, for the map-reduce summarization approach. This is an optional step because mapping and combine prompts are provided by default if the parameters within the call to load the summarization chain (load_summarize_chain) are undefined.
map_prompt = """Write a concise summary of this text in a few complete sentences:

{text}

Concise summary:"""

map_prompt_template = PromptTemplate(
                        template=map_prompt, 
                        input_variables=["text"]
                      )


combine_prompt = """Combine all these following summaries and generate a final summary of them in a few complete sentences:

{text}

Final summary:"""

combine_prompt_template = PromptTemplate(
                            template=combine_prompt, 
                            input_variables=["text"]
                          )      
  1. LangChain supports LLMs hosted on SageMaker inference endpoints, so instead of using the AWS Python SDK, you can initialize the connection through LangChain for greater accessibility:
summary_model = SagemakerEndpoint(
                    endpoint_name = endpoint_name,
                    region_name= "us-east-1",
                    model_kwargs= {},
                    content_handler=content_handler
                )
  1. Finally, you can load in a summarization chain and run a summary on the input documents using the following code:
summary_chain = load_summarize_chain(llm=summary_model,
                                     chain_type="map_reduce", 
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
                                     verbose=True
                                    ) 
summary = summary_chain({"input_documents": input_documents, 'token_max': 700}, return_only_outputs=True)
print(summary["output_text"])   

Because the verbose parameter is set to True, you’ll see all of the intermediate outputs of the map-reduce approach. This is useful for following the sequence of events to arrive at a final summary. With this map-reduce approach, you can effectively summarize documents much longer than is normally allowed by the model’s maximum input token limit.

Clean up

After you’ve finished using the inference endpoint, it’s important to delete it to avoid incurring unnecessary costs through the following lines of code:

client = boto3.client('runtime.sagemaker')
client.delete_endpoint(EndpointName=endpoint_name)

Using other foundation models in SageMaker JumpStart

Utilizing other foundation models available in SageMaker JumpStart for document summarization requires minimal overhead to set up and deploy. LLMs occasionally vary with the structure of input and output formats, and as new models and pre-made solutions are added to SageMaker JumpStart, depending on the task implementation, you may have to make the following code changes:

  • If you are performing summarization via the summarize() method (the method without using LangChain), you may have to change the JSON structure of the payload parameter, as well as the handling of the response variable in the query_endpoint() function
  • If you are performing summarization via LangChain’s load_summarize_chain() method, you may have to modify the ContentHandlerTextSummarization class, specifically the transform_input() and transform_output() functions, to correctly handle the payload that the LLM expects and the output the LLM returns

Foundation models vary not only in factors such as inference speed and quality, but also input and output formats. Refer to the LLM’s relevant information page on expected input and output.

Conclusion

The Falcon 7B Instruct model is available on the SageMaker JumpStart model hub and performs on a number of use cases. This post demonstrated how you can deploy your own Falcon LLM endpoint into your environment using SageMaker JumpStart and do your first experiments from SageMaker Studio, allowing you to rapidly prototype your models and seamlessly transition to a production environment. With Falcon and LangChain, you can effectively summarize long-form healthcare and life sciences documents at scale.

For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building document summarization proofs of concept for your healthcare and life science-oriented GenAI applications using the method outlined in this post. When Amazon Bedrock is generally available, we will publish a follow-up post showing how you can implement document summarization using Amazon Bedrock and LangChain.


About the Authors

John Kitaoka is a Solutions Architect at Amazon Web Services. John helps customers design and optimize AI/ML workloads on AWS to help them achieve their business goals.

Josh Famestad is a Solutions Architect at Amazon Web Services. Josh works with public sector customers to build and execute cloud based approaches to deliver on business priorities.

Read More

Automate prior authorization using CRD with CDS Hooks and AWS HealthLake

Automate prior authorization using CRD with CDS Hooks and AWS HealthLake

Prior authorization is a crucial process in healthcare that involves the approval of medical treatments or procedures before they are carried out. This process is necessary to ensure that patients receive the right care and that healthcare providers are following the correct procedures. However, prior authorization can be a time-consuming and complex process that requires a lot of paperwork and communication between healthcare providers, insurance companies, and patients.

The prior authorization process for electronic health record (EHRs) consists of five steps:

  1. Determine whether prior authorization is required.
  2. Gather information necessary to support the prior authorization request.
  3. Submit the request for prior authorization.
  4. Monitor the prior authorization request for resolution.
  5. If needed, supplement the prior authorization request with additional required information (and resume at Step 4).

The Da Vinci Burden Reduction project has rearranged these steps for prior authorization into three interrelated implementation guides that are focused on reducing the clinician and payer burden:

  1. Coverage Requirements Discovery (CRD) – This provides decision support to providers at the time they’re ordering diagnostics, specifying treatments, making referrals, scheduling appointments, and so on.
  2. Documentation Templates and Rules (DTR) – This allows providers to download smart questionnaires and rules, such as Clinical Quality Language (CQL), and provides a SMART on FHIR app or EHR app that runs the questionnaires and rules to gather information relevant to a performed or planned service. Running the questionnaires and rules may also be performed by an application that is part of the provider’s EHR.
  3. Prior Authorization Support (PAS) – This allows provider systems to send (and payer systems to receive) prior authorization requests using FHIR, while still meeting regulatory mandates to have X12 278 used, where required, to transport the prior authorization, potentially simplifying processing for either exchange partner (or both).

In this post, we focus on the CRD implementation guide to determine prior authorization requirements and explain how CDS (Clinical Decision Support) Hooks uses AWS HealthLake to determine if prior authorization is required or not.

Solution overview

CRD is a protocol within the electronic prior authorization workflow that facilitates calls between EHRs and the payers using CDS services. When utilized, it provides information on coverage requirements to providers while patient care decisions are in progress. This enables provider staff to make more informed decisions and meet the requirements of their patient’s insurance coverage. Interaction between providers and payers is done seamlessly using CDS Hooks.

CDS Hooks is a Health Level Seven International (HL7) specification. CDS Hooks provides a way to embed additional, near-real-time functionality within a clinician’s workflow of an EHR. With CDS Hooks, eligibility practices like prior authorization can be properly optimized, along with other pre-certification requirements like the physician’s network participation. This function assists providers in making informed decisions by providing them with information on their patient’s condition, treatment options, and the forms that must be completed to facilitate their care. The strategic use of CDS Hooks allows clinicians to quickly develop more patient-centered care plans and assist the prior authorization process by disclosing critical administrative and clinical requirements. For more information on CDS Hooks and its specification, refer to the CDS Hooks website.

The following diagram illustrates how the CRD workflow is automated using HealthLake.

The workflow steps are as follows:

  1. A provider staff member logs into the EHR system to open the patient chart.
  2. The EHR system validates user credentials and invokes the patient-view hook to retrieve patient condition information.
  3. Amazon API Gateway invokes the Patient View Hooks AWS Lambda function.
  4. The Lambda function validates and retrieves the patient ID from the request and gets the patient condition information from HealthLake.
  5. After reviewing the patient condition, the user invokes the order-select hook to retrieve coverage requirements information for the respective drug.
  6. API Gateway invokes the Coverage Requirements Hooks Lambda function.
  7. The Lambda function retrieves claims information for the patient, runs CQL rules based on the medication submitted and claims information retrieved from HealthLake, and determines whether prior authorization is required.

The solution is available in the Determine Coverage Requirements Discovery using CDS Hooks with AWS HealthLake GitHub repo.

Prerequisites

This post assumes familiarity with the following services:

Deploy the application using the AWS SAM CLI

You can deploy the template using the AWS Management Console or the AWS SAM CLI. To use the CLI, complete the following steps:

  1. Install the AWS SAM CLI.
  2. Download the sample code from the AWS samples repository to your local system:
git clone https://github.com/aws-samples/aws-crd-hooks-with-awshealthlake-api
cd aws-crd-hooks-with-awshealthlake-api/
  1. Build the application using AWS SAM:
sam build
  1. Deploy the application using the guided process:
sam deploy --guided
# Replace MY_VALUE with proper resource names
Configuring SAM deploy

======================

Looking for config file [samconfig.toml] : Not found

Setting default arguments for 'sam deploy'

     =========================================

     Stack Name [sam-app]: aws-cds-hooks-with-healthlake

     AWS Region [us-east-1]: us-east-2

     #Shows you resources changes to be deployed and require a 'Y' to initiate deploy

     Confirm changes before deploy [y/N]:

     #SAM needs permission to be able to create roles to connect to the resources in your template

     Allow SAM CLI IAM role creation [Y/n]:

     #Preserves the state of previously provisioned resources when an operation fails

     Disable rollback [y/N]:

     cdsDemoServicesFunction has no authentication. Is this okay? [y/N]: y

     cqlQueryFunction has no authentication. Is this okay? [y/N]: y

     cqlQueryOrderFunction has no authentication. Is this okay? [y/N]: y

     Save arguments to configuration file [Y/n]: y

     SAM configuration file [samconfig.toml]:

     SAM configuration environment [default]:

The deployment may take 30 minutes or more while AWS creates a HealthLake data store and related resources in your AWS account. AWS SAM may time out and return you to your command line. This timeout stops AWS SAM from showing you the progress in the cloud, but doesn’t stop the deployment happening in the cloud. If you see a timeout, go to the AWS CloudFormation console and verify the CloudFormation stack deployment status. Integrate CDS Hooks with your clinical workflow when the CloudFormation stack deployment is complete.

Determine coverage requirements for prior authorization

The solution has two hooks, patient-view and order-select, to determine if prior authorization is required or not based on prior authorization rules from payer. CQL is used to evaluate prior authorization rules.

CDS Hooks can be integrated with EHR that supports CDS Hooks. Alternatively, if you don’t have EHR available for testing, you can use the publicly available sandbox as described in the GitHub repo. Note that the CDS Hooks sandbox is being used solely for the purpose of testing.

After your hooks are integrated with EHR, when a user navigates to the clinical workflow, the patient-view hook is run for the configured patient. Note that the patient ID from the clinical workflow should exist in HealthLake. The cards returned from the API indicate that the patient has a sinus infection health condition and the doctor may need to order a prescription.

You can navigate to the RX View tab to order a prescription. Acting as the doctor, choose the appropriate medication and enter other details as shown in the following screenshot.

The order-select hook is returned with the prior authorization eligibility card.

The next step is to submit a prior authorization using the SMART app or other mechanisms available to the provider.

Clean up

If you no longer need the AWS resources that you created by running this example, you can remove them by deleting the CloudFormation stack that you deployed:

sam delete --stack-name <<your-stack-name>>

Conclusion

In this post, we showed how HealthLake with CDS Hooks can help reduce the burden on providers and improve the member experience by determining coverage requirements for prior authorization as part of the prescription order clinical workflow. CDS Hooks along with HealthLake can help providers at the time they’re ordering diagnostics, specifying treatments, making referrals, and scheduling appointments.

If you are interested in implementing a coverage requirement discovery on AWS using this solution or want to learn more about the implementing prior authorization on AWS , you can contact an AWS Representative.


About the Authors

Manish Patel, a Global Partner Solutions Architect supporting Healthcare and Life Sciences at AWS. He has more than 20 years of experience building solutions for Medicare, Medicaid, Payers, Providers and Life Sciences customers. He drives go-to-market strategies along with partners to accelerate solution developments in areas such as Electronics Health Records, Medical Imaging, multi-model data solutions and Generative AI. He is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures, and provides advice on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing, and building cloud solutions.

Read More

Scalable spherical CNNs for scientific applications

Scalable spherical CNNs for scientific applications

Typical deep learning models for computer vision, like convolutional neural networks (CNNs) and vision transformers (ViT), process signals assuming planar (flat) spaces. For example, digital images are represented as a grid of pixels on a plane. However, this type of data makes up only a fraction of the data we encounter in scientific applications. Variables sampled from the Earth’s atmosphere, like temperature and humidity, are naturally represented on the sphere. Some kinds of cosmological data and panoramic photos are also spherical signals, and are better treated as such.

Using methods designed for planar images to process spherical signals is problematic for a couple of reasons. First, there is a sampling problem, i.e., there is no way of defining uniform grids on the sphere, which are needed for planar CNNs and ViTs, without heavy distortion.

When projecting the sphere into a plane, the patch represented by the red circle is heavily distorted near the poles. This sampling problem hurts the accuracy of conventional CNNs and ViTs on spherical inputs.

Second, signals and local patterns on the sphere are often complicated by rotations, so models need a way to address that. We would like equivariance to 3D rotations, which ensures that learned features follow the rotations of the input. This leads to better utilization of the model parameters and allows training with less data. Equivariance to 3D rotations is also useful in most settings where inputs don’t have a preferred orientation, such as 3D shapes and molecules.

Drone racing with panoramic cameras. Here the sharp turns result in large 3D rotations of the spherical image. We would like our models to be robust to such rotations. Source: https://www.youtube.com/watch?v=_J7qXbbXY80 (licensed under CC BY)
In the atmosphere, it is common to see similar patterns appearing at different positions and orientations. We would like our models to share parameters to recognize these patterns.

With the above challenges in mind, in “Scaling Spherical CNNs”, presented at ICML 2023, we introduce an open-source library in JAX for deep learning on spherical surfaces. We demonstrate how applications of this library match or surpass state-of-the-art performance on weather forecasting and molecular property prediction benchmarks, tasks that are typically addressed with transformers and graph neural networks.

Background on spherical CNNs

Spherical CNNs solve both the problems of sampling and of robustness to rotation by leveraging spherical convolution and cross-correlation operations, which are typically computed via generalized Fourier transforms. For planar surfaces, however, convolution with small filters is faster, because it can be performed on regular grids without using Fourier transforms. The higher computational cost for spherical inputs has so far restricted the application of spherical CNNs to small models and datasets and low resolution datasets.

Our contributions

We have implemented the spherical convolutions from spin-weighted spherical CNNs in JAX with a focus on speed, and have enabled distributed training over a large number of TPUs using data parallelism. We also introduced a new phase collapse activation and spectral batch normalization layer, and a new residual block that improves accuracy and efficiency, which allows training more accurate models up to 100x larger than before. We apply these new models on molecular property regression and weather forecasting.

We scale spherical CNNs by up to two orders of magnitude in terms of feature sizes and model capacity, compared to the literature: Cohen’18Esteves’18Esteves’20, and Cobb’21VGG-19 is included as a conventional CNN reference. Our largest model for weather forecasting has 256 x 256 x 78 inputs and outputs, and runs 96 convolutional layers during training with a lowest internal resolution of 128 x 128 x 256.

Molecular property regression

Predicting properties of molecules has applications in drug discovery, where the goal is to quickly screen numerous molecules in search of those with desirable properties. Similar models may also be relevant in the design of drugs targeting the interaction between proteins. Current methods in computational or experimental quantum chemistry are expensive, which motivates the use of machine learning.

Molecules can be represented by a set of atoms and their positions in 3D space; rotations of the molecule change the positions but not the molecular properties. This motivates the application of spherical CNNs because of their rotation equivariance. However, molecules are not defined as signals on the sphere so the first step is to map them to a set of spherical functions. We do so by leveraging physics-based interactions between the atoms of the molecule.

Each atom is represented by a set of spherical signals accumulating physical interactions with other atoms of each type (shown in the three panels on the right). For example, the oxygen atom (O; top panel) has a channel for oxygen (indicated by the sphere labeled “O” on the left) and hydrogen (“H”, right). The accumulated Coulomb forces on the oxygen atom with respect to the two hydrogen atoms is indicated by the red shaded regions on the bottom of the sphere labeled “H”. Because the oxygen atom contributes no forces to itself, the “O” sphere is uniform. We include extra channels for the Van der Waals forces.

Spherical CNNs are applied to each atom’s features, and results are later combined to produce the property predictions. This results in state-of-the art performance in most properties as typically evaluated in the QM9 benchmark:

Error comparison against the state-of-the-art on 12 properties of QM9 (see the dataset paper for details). We show TorchMD-Net and PaiNN results, normalizing TorchMD-Net errors to 1.0 (lower is better). Our model, shown in green, outperforms the baselines in most targets.

Weather forecasting

Accurate climate forecasts serve as invaluable tools for providing timely warnings of extreme weather events, enabling effective water resource management, and guiding informed infrastructure planning. In a world increasingly threatened by climate disasters, there is an urgency to deliver forecasts much faster and more accurately over a longer time horizon than general circulation models. Forecasting models will also be important for predicting the safety and effectiveness of efforts intended to combat climate change, such as climate interventions. The current state-of-the-art uses costly numerical models based on fluid dynamics and thermodynamics, which tend to drift after a few days.

Given these challenges, there is an urgency for machine learning researchers to address climate forecasting problems, as data-driven techniques have the potential of both reducing the computational cost and improving long range accuracy. Spherical CNNs are suitable for this task since atmospheric data is natively presented on the sphere. They can also efficiently handle repeating patterns at different positions and orientations that are common in such data.

We apply our models to several weather forecasting benchmarks and outperform or match neural weather models based on conventional CNNs (specifically, 1, 2, and 3). Below we show results in a test setting where the model takes a number of atmospheric variables as input and predicts their values six hours ahead. The model is then iteratively applied on its own predictions to produce longer forecasts. During training, the model predicts up to three days ahead, and is evaluated up to five days. Keisler proposed a graph neural network for this task, but we show that spherical CNNs can match the GNN accuracy in the same setting.

Iterative weather forecasting up to five days (120h) ahead with spherical CNNs. The animations show the specific humidity forecast at a given pressure and its error.
Wind speed and temperature forecasts with spherical CNNs.

Additional resources

Our JAX library for efficient spherical CNNs is now available. We have shown applications to molecular property regression and weather forecasting, and we believe the library will be helpful in other scientific applications, as well as in computer vision and 3D vision.

Weather forecasting is an active area of research at Google with the goal of building more accurate and robust models — like Graphcast, a recent ML-based mid-range forecasting model — and to build tools that enable further advancement across the research community, such as the recently released WeatherBench 2.

Acknowledgements

This work was done in collaboration with Jean-Jacques Slotine, and is based on previous collaborations with Kostas Daniilidis and Christine Allen-Blanchette. We thank Stephan Hoyer, Stephan Rasp, and Ignacio Lopez-Gomez for helping with data processing and evaluation, and Fei Sha, Vivian Yang, Anudhyan Boral, Leonardo Zepeda-Núñez, and Avram Hershko for suggestions and discussions. We are thankful to Michael Riley and Corinna Cortes for supporting and encouraging this project.

Read More