AI’s New Frontier: From Daydreams to Digital Deeds

AI’s New Frontier: From Daydreams to Digital Deeds

Imagine a world where you can whisper your digital wishes into your device, and poof, it happens.

That world may be coming sooner than you think. But if you’re worried about AI doing your thinking for you, you might be waiting for a while.

In a fireside chat Wednesday at NVIDIA GTC, the global AI conference, Kanjun Qiu, CEO of Imbue, and Bryan Catanzaro, VP of applied deep learning research at NVIDIA, challenged many of the clichés that have long dominated conversations about AI.

Launched in October 2022, Imbue made headlines with its Series B fundraiser last year, raising over $200 million at a $1 billion valuation.

Bridging the Gap Between ‘Idea and Execution’

The discussion highlighted not only Imbue’s approach toward building practical AI agents able to automate menial, unrewarding work, but also painted a vivid picture of what the next chapter in AI innovation might hold.

“Our lives are full of so much friction … every single person’s vision can come to life,” Qiu said. “The barrier between idea and execution can be much smaller.”

Catanzaro’s reflections on the practical difficulties of using AI for simple tasks, such as his own challenges trying to get his digital assistant to help him find his next meeting, underscored the current limitations in human-AI interaction.

It turns out that figuring out where and when to go to a meeting, while easy for a human assistant, isn’t easy to automate.

“We tend to underestimate the things that we do naturally and overestimate the things that require reasoning,” Catanzaro observed. “One of the things humans deal with well is ambiguity.”

This set the stage for a broader discussion of the need for AI to move beyond mere code generation and become a dynamic, intuitive interface between humans and computers.

Qiu said the idea that AI can be a magical assistant, one that knows everything about you “isn’t necessarily the right paradigm.”

That’s because delegation is hard.

“When I’m delegating something, even to a human, I have to think a lot about ‘okay, how can I package this up so that the person will do the right thing?’”

Instead, the better model might be telling your computer to do anything you want. So you’re “telling your computer to do stuff and the agent is a middle layer,” she said.

Such agents will need to be able to interact with people — something often described as “reasoning,” the two observed — and communicate with computers — or “code.”

A Vision for Empowerment Through Technology

Qiu and Catanzaro — who often completed each other’s sentences during the 45-minute conversation — compared AI’s potential to democratize software creation to the Industrial Revolution’s impact on manufacturing.

The parts needed for a steam engine, for example, once took years to create. Now they can be ordered off the shelf for a small sum.

Both speakers emphasized the importance of creating intuitive interfaces that allow individuals from nontechnical backgrounds to engage with computers more effectively, fostering a more inclusive digital landscape.

That means going beyond coding, which is done in text-heavy environments such as an Integrated Development Environment, or even using text-based chats.

“The interface to agents, a lot of them today, is like a chat interface. It’s not a very good interface, in a lot of ways, very restrictive. And so there are much better ways of working with these systems,” Qiu said.

The Future of Personal Computing

Qiu and Catanzaro discussed the role that virtual worlds will play in this, and how they could serve as interfaces for human-technology interaction.

“I think it’s pretty clear that AI is going to help build virtual worlds,” said Catanzaro. “I think the maybe more controversial part is virtual worlds are going to be necessary for humans to interact with AI.”

People have an almost primal fear of being displaced, Catanzaro said, but what’s much more likely is that our capabilities will be amplified as the technology fades into the background.

Catanzaro compared it to the adoption of electricity. A century ago, people talked a lot about electricity. Now that it’s ubiquitous, it’s no longer the focus of broader conversations, even as it makes our day-to-day lives better.

“I think of it as really being able to [help us] control information environments … once we have control over information environments, we’ll feel a lot more empowered,” Qiu said. “Every single person’s vision can come to life.”

Read More

Here Be Dragons: ‘Dragon’s Dogma 2’ Comes to GeForce NOW

Here Be Dragons: ‘Dragon’s Dogma 2’ Comes to GeForce NOW

Arise for a new adventure with Dragon’s Dogma 2, leading two new titles joining the GeForce NOW library this week.

Set Forth, Arisen

Dragon's Dogma 2
Fulfill a forgotten destiny in “Dragon’s Dogma 2” from Capcom.

Time to go on a grand adventure, Arisen!

Dragon’s Dogma 2, the long-awaited sequel to Capcom’s legendary action role-playing game, streams this week on GeForce NOW.

The game challenges players to choose their own experience, including their Arisen’s appearance, vocation, party, approaches to different situations and more. Wield swords, bows and magick across an immersive fantasy world full of life and battle. But players won’t be alone. Recruit Pawns — mysterious otherworldly beings — to aid in battle and work with other players’ Pawns to fight the diverse monsters inhabiting the ever-changing lands.

Upgrade to a GeForce NOW Ultimate membership to stream Dragon’s Dogma 2 from NVIDIA GeForce RTX 4080 servers in the cloud for the highest performance, even on low-powered devices. Ultimate members also get exclusive access to servers to get right into gaming without waiting for any downloads.

New Games, New Challenges

Battlefield 2042 S7 on GeForce NOW
No holding back.

Battlefield 2042: Season 7 Turning Point is here. Do whatever it takes to battle for Earth’s most valuable resource — water — in a Chilean desert. Deploy on a new map, Haven, focused on suburban combat, and revisit a fan-favorite front: Stadium. Gear up with new hardware like the SCZ-3 SMG or the Predator SRAW, and jump into a battle for ultimate power.

Then, look forward to the following list of games this week:

  • Alone in the Dark (New release on Steam, March 20)
  • Dragon’s Dogma 2 (New release on Steam, March 21)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Instant Latte: NVIDIA Gen AI Research Brews 3D Shapes in Under a Second

Instant Latte: NVIDIA Gen AI Research Brews 3D Shapes in Under a Second

NVIDIA researchers have pumped a double shot of acceleration into their latest text-to-3D generative AI model, dubbed LATTE3D.

Like a virtual 3D printer, LATTE3D turns text prompts into 3D representations of objects and animals within a second.

Crafted in a popular format used for standard rendering applications, the generated shapes can be easily served up in virtual environments for developing video games, ad campaigns, design projects or virtual training grounds for robotics.

“A year ago, it took an hour for AI models to generate 3D visuals of this quality — and the current state of the art is now around 10 to 12 seconds,” said Sanja Fidler, vice president of AI research at NVIDIA, whose Toronto-based AI lab team developed LATTE3D. “We can now produce results an order of magnitude faster, putting near-real-time text-to-3D generation within reach for creators across industries.”

This advancement means that LATTE3D can produce 3D shapes near instantly when running inference on a single GPU, such as the NVIDIA RTX A6000, which was used for the NVIDIA Research demo.

Ideate, Generate, Iterate: Shortening the Cycle

Instead of starting a design from scratch or combing through a 3D asset library, a creator could use LATTE3D to generate detailed objects as quickly as ideas pop into their head.

The model generates a few different 3D shape options based on each text prompt, giving a creator options. Selected objects can be optimized for higher quality within a few minutes. Then, users can export the shape into graphics software applications or platforms such as NVIDIA Omniverse, which enables Universal Scene Description (OpenUSD)-based 3D workflows and applications.

While the researchers trained LATTE3D on two specific datasets — animals and everyday objects — developers could use the same model architecture to train the AI on other data types.

If trained on a dataset of 3D plants, for example, a version of LATTE3D could help a landscape designer quickly fill out a garden rendering with trees, flowering bushes and succulents while brainstorming with a client. If trained on household objects, the model could generate items to fill in 3D simulations of homes, which developers could use to train personal assistant robots before they’re tested and deployed in the real world.

LATTE3D was trained using NVIDIA A100 Tensor Core GPUs. In addition to 3D shapes, the model was trained on diverse text prompts generated using ChatGPT to improve the model’s ability to handle the various phrases a user might come up with to describe a particular 3D object — for example, understanding that prompts featuring various canine species should all generate doglike shapes.

NVIDIA Research comprises hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Researchers shared work at NVIDIA GTC this week that advances the state of the art for training diffusion models. Read more on the NVIDIA Technical Blog, and see the full list of NVIDIA Research sessions at GTC, running in San Jose, Calif., and online through March 21.

For the latest NVIDIA AI news, watch the replay of NVIDIA founder and CEO Jensen Huang’s keynote address at GTC: 

Read More

‘You Transformed the World,’ NVIDIA CEO Tells Researchers Behind Landmark AI Paper

‘You Transformed the World,’ NVIDIA CEO Tells Researchers Behind Landmark AI Paper

Of GTC’s 900+ sessions, the most wildly popular was a conversation hosted by NVIDIA founder and CEO Jensen Huang with seven of the authors of the legendary research paper that introduced the aptly named transformer — a neural network architecture that went on to change the deep learning landscape and enable today’s era of generative AI.

“Everything that we’re enjoying today can be traced back to that moment,” Huang said to a packed room with hundreds of attendees, who heard him speak with the authors of “Attention Is All You Need.”

Sharing the stage for the first time, the research luminaries reflected on the factors that led to their original paper, which has been cited more than 100,000 times since it was first published and presented at the NeurIPS AI conference. They also discussed their latest projects and offered insights into future directions for the field of generative AI.

While they started as Google researchers, the collaborators are now spread across the industry, most as founders of their own AI companies.

“We have a whole industry that is grateful for the work that you guys did,” Huang said.

From L to R: Lukasz Kaiser, Noam Shazeer, Aidan Gomez, Jensen Huang, Llion Jones, Jakob Uszkoreit, Ashish Vaswani and Illia Polosukhin.

Origins of the Transformer Model

The research team initially sought to overcome the limitations of recurrent neural networks, or RNNs, which were then the state of the art for processing language data.

Noam Shazeer, cofounder and CEO of Character.AI, compared RNNs to the steam engine and transformers to the improved efficiency of internal combustion.

“We could have done the industrial revolution on the steam engine, but it would just have been a pain,” he said. “Things went way, way better with internal combustion.”

“Now we’re just waiting for the fusion,” quipped Illia Polosukhin, cofounder of blockchain company NEAR Protocol.

The paper’s title came from a realization that attention mechanisms — an element of neural networks that enable them to determine the relationship between different parts of input data — were the most critical component of their model’s performance.

“We had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better,” said Llion Jones, cofounder and chief technology officer at Sakana AI.

Having a name as general as “transformers” spoke to the team’s ambitions to build AI models that could process and transform every data type — including text, images, audio, tensors and biological data.

“That North Star, it was there on day zero, and so it’s been really exciting and gratifying to watch that come to fruition,” said Aidan Gomez, cofounder and CEO of Cohere. “We’re actually seeing it happen now.”

Packed house at the San Jose Convention Center.

Envisioning the Road Ahead 

Adaptive computation, where a model adjusts how much computing power is used based on the complexity of a given problem, is a key factor the researchers see improving in future AI models.

“It’s really about spending the right amount of effort and ultimately energy on a given problem,” said Jakob Uszkoreit, cofounder and CEO of biological software company Inceptive. “You don’t want to spend too much on a problem that’s easy or too little on a problem that’s hard.”

A math problem like two plus two, for example, shouldn’t be run through a trillion-parameter transformer model — it should run on a basic calculator, the group agreed.

They’re also looking forward to the next generation of AI models.

“I think the world needs something better than the transformer,” said Gomez. “I think all of us here hope it gets succeeded by something that will carry us to a new plateau of performance.”

“You don’t want to miss these next 10 years,” Huang said. “Unbelievable new capabilities will be invented.”

The conversation concluded with Huang presenting each researcher with a framed cover plate of the NVIDIA DGX-1 AI supercomputer, signed with the message, “You transformed the world.”

Jensen presents lead author Ashish Vaswani with a signed DGX-1 cover.

There’s still time to catch the session replay by registering for a virtual GTC pass — it’s free.

To discover the latest in generative AI, watch Huang’s GTC keynote address:

Read More

Abstracts: March 21, 2024

Abstracts: March 21, 2024

Microsoft Research Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Researcher Chang Liu joins host Gretchen Huizinga to discuss Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” In the paper, Liu and his coauthors present M-OFDFT, a variation of orbital-free density functional theory (OFDFT). M-OFDFT leverages deep learning to help identify molecular properties in a way that minimizes the tradeoff between accuracy and efficiency, work with the potential to benefit areas such as drug discovery and materials discovery.

Transcript

[MUSIC PLAYS] 

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Today, I’m talking to Dr. Chang Liu, a senior researcher from Microsoft Research AI4Science. Dr. Liu is coauthor of a paper called “Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” Chang Liu, thanks for joining us on Abstracts


CHANG LIU: Thank you. Thank you for this opportunity to share our work. 

HUIZINGA: So in a few sentences, tell us about the issue or problem your paper addresses and why people should care about this research. 

LIU: Sure. Since this is an AI4Science work, let’s start from this perspective. About science, people always want to understand the properties of matters, such as why some substances can cure disease and why some materials are heavy or conductive. For a very long period of time, these properties can only be studied by observation and experiments, and the outcome will just look like magic to us. If we can understand the underlying mechanism and calculate these properties on our computer, then we can do the magic ourselves, and it can, hence, accelerate industries like medicine development and material discovery. Our work aims to develop a method that handles the most fundamental part of such property calculation and with better accuracy and efficiency. If you zoom into the problem, properties of matters are determined by the properties of molecules that constitute the matter. For example, the energy of a molecule is an important property. It determines which structure it mostly takes, and the structure indicates whether it can bind to a disease-related biomolecule. You may know that molecules consist of atoms, and atoms consist of nuclei and electrons, so properties of a molecule are the result of the interaction among the nuclei and the electrons in the molecule. The nuclei can be treated as classical particles, but electrons exhibit significant quantum effect. You can imagine this like electrons move so fast that they appear like cloud or mist spreading over the space. To calculate the properties of the molecule, you need to first solve the electronic structure—that is, how the electrons spread over this space. This is governed by an equation that is hard to solve. The target of our research is hence to develop a method that solves the electronic structure more accurately and more efficiently so that properties of molecules can be calculated in a higher level of accuracy and efficiency that leads to better ways to solve the industrial problems. 

HUIZINGA: Well, most research owes a debt to work that went before but also moves the science forward. So how does your approach build on and/or differ from related research in this field? 

LIU: Yes, there are indeed quite a few methods that can solve the electronic structure, but they show a harsh tradeoff between accuracy and efficiency. Currently, density functional theory, often called DFT, achieves a preferred balance for most cases and is perhaps the most popular choice. But DFT still requires a considerable cost for large molecular systems. It has a cubic cost scaling. We hope to develop a method that scales with a milder cost increase. We noted an alternative type of method called orbital-free DFT, or called OFDFT, which has a lower order of cost scaling. But existing OFDFT methods cannot achieve satisfying accuracy on molecules. So our work leverages deep learning to achieve an accurate OFDFT method. The method can achieve the same level of accuracy as conventional DFT; meanwhile, it inherits the cost scaling of OFDFT, hence is more efficient than the conventional DFT. 

HUIZINGA: OK, so we’re moving acronyms from DFT to OFDFT, and you’ve got an acronym that goes M-OFDFT. What does that stand for? 

LIU: The M represents molecules, since it is especially hard for classical or existing OFDFT to achieve a good accuracy on molecules. So our development tackles that challenge. 

HUIZINGA: Great. And I’m eager to hear about your methodology and your findings. So let’s go there. Tell us a bit about how you conducted this research and what your methodology was. 

LIU: Yeah. Regarding methodology, let me delve into a bit into some details. We follow the formulation of OFDFT, which solves the electronic structure by optimizing the electron density, where the optimization objective is to minimize the electronic energy. The challenge in OFDFT is, part of the electronic energy, specifically the kinetic energy, is hard to calculate accurately, especially for molecular systems. Existing computation formulas are based on approximate physical models, but the approximation accuracy is not satisfying. Our method uses a deep learning model to calculate the kinetic energy. We train the model on labeled data, and by the powerful learning ability, the model can give a more accurate result. This is the general idea, but there are many technical challenges. For example, since the model is used as an optimization objective, it needs to capture the overall landscape of the function. The model cannot recover the landscape if only one labeled data point is provided. For this, we made a theoretical analysis on the data generation method and found a way to generate multiple labeled data points for each molecular structure. Moreover, we can also calculate a gradient label for each data point, which provides the slope information on the landscape. Another challenge is that the kinetic energy has a strong non-local effect, meaning that the model needs to account for the interaction between any pair of spots in space. This incurs a significant cost if using the conventional way to represent density—that is, to using a grid. For this challenge, we choose to expand the density function on a set of basis functions and use the expansion coefficients to represent the density. The benefit is that it greatly reduces the representation dimension, which in turn reduces the cost for non-local calculation. These two examples are also the differences from other deep learning OFDFT works. There are more technical designs, and you may check them in the paper. 

HUIZINGA: So talk about your findings. After you completed and analyzed what you did, what were your major takeaways or findings? 

LIU: Yeah, let’s dive into the details, into the empirical findings. We find that our deep learning OFDFT, abbreviated as M-OFDFT, is much more accurate than existing OFDFT methods with tens to hundreds times lower error and achieves the same level of accuracy as the conventional DFT. 

HUIZINGA: Wow … 

LIU: On the other hand, the speed is indeed improved over conventional DFT. For example, on a protein molecule with more than 700 atoms, our method achieves nearly 30 times speedup. The empirical cost scaling is lower than quadratic and is one order less than that of conventional DFT. So the speed advantage would be more significant on larger molecules. I’d also like to mention an interesting observation. Since our method is based on deep learning, a natural question is, how accurate would the method be if applied to much larger molecules than those used for training the deep learning model? This is the generalization challenge and is one of the major challenges of deep learning method for molecular science applications. We investigated this question in our method and found that the error increases slower than linearly with molecular size. Although this is not perfect since the error is still increasing, but it is better than using the same model to predict the property directly, which shows an error that increases faster than linearly. This somehow shows the benefits of leveraging the OFDFT framework for using a deep learning method to solve molecular tasks. 

HUIZINGA: Well, let’s talk about real-world impact for a second. You’ve got this research going on in the lab, so to speak. How does it impact real-life situations? Who does this work help the most and how? 

LIU: Since our method achieves the same level of accuracy as conventional DFT but runs faster, it could accelerate molecular property calculation and molecular dynamic simulation especially for large molecules; hence, it has the potential to accelerate solving problems such as medicine development and material discovery. Our method also shows that AI techniques can create new opportunities for other electronic structure formulations, which could inspire more methods to break the long-standing tradeoff between accuracy and efficiency in this field. 

HUIZINGA: So if there was one thing you wanted our listeners to take away, just one little nugget from your research, what would that be? 

LIU: If only for one thing, that would be we develop the method that solves molecular properties more accurately and efficiently than the current portfolio of available methods. 

HUIZINGA: So finally, Chang, what are the big unanswered questions and unsolved problems that remain in this field, and what’s next on your research agenda? 

LIU: Yeah, sure. There indeed remains problems and challenges. One remaining challenge mentioned above is the generalization to molecules much larger than those in training. Although the OFDFT method is better than directly predicting properties, there is still room to improve. One possibility is to consider the success of large language models by including more abundant data and more diverse data in training and using a large model to digest all the data. This can be costly, but it may give us a surprise. And another way we may consider is to incorporate mathematical structures of the learning target functional into the model, such as convexity, lower and upper bounds, and some invariance. And such structures could regularize the model when applied to larger systems than it has seen during training. So we have actually incorporated some such structures into the model, for example, the geometric invariance, but other mathematical properties are nontrivial to incorporate. We made some discussions in the paper, and we’ll engage working on that direction in the future. The ultimate goal underlying this technical development is to build a computational method that is fast and accurate universally so that we can simulate the molecular world of any kind. 

[MUSIC PLAYS] 

HUIZINGA: Well, Chang Liu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts. You can also read it on arXiv, or you can check out the March 2024 issue of Nature Computational Science. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: March 21, 2024 appeared first on Microsoft Research.

Read More

Modeling Extremely Large Images with xT

Modeling Extremely Large Images with xT


As computer vision researchers, we believe that every pixel can tell a story. However, there seems to be a writer’s block settling into the field when it comes to dealing with large images. Large images are no longer rare—the cameras we carry in our pockets and those orbiting our planet snap pictures so big and detailed that they stretch our current best models and hardware to their breaking points when handling them. Generally, we face a quadratic increase in memory usage as a function of image size.

Today, we make one of two sub-optimal choices when handling large images: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. We take another look at these approaches and introduce $x$T, a new framework to model large images end-to-end on contemporary GPUs while effectively aggregating global context with local details.


Architecture for the $x$T framework.

Modeling Extremely Large Images with xT

Modeling Extremely Large Images with xT


As computer vision researchers, we believe that every pixel can tell a story. However, there seems to be a writer’s block settling into the field when it comes to dealing with large images. Large images are no longer rare—the cameras we carry in our pockets and those orbiting our planet snap pictures so big and detailed that they stretch our current best models and hardware to their breaking points when handling them. Generally, we face a quadratic increase in memory usage as a function of image size.

Today, we make one of two sub-optimal choices when handling large images: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. We take another look at these approaches and introduce $x$T, a new framework to model large images end-to-end on contemporary GPUs while effectively aggregating global context with local details.


Architecture for the $x$T framework.

TiC-CLIP: Continual Training of CLIP Models

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate…Apple Machine Learning Research

Computer-aided diagnosis for lung cancer screening

Computer-aided diagnosis for lung cancer screening

Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans.

The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability.

At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential.

To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists.

Developing an interface to communicate model results

Integrating ML models into radiologist workflows involves understanding the nuances and goals of their tasks to meaningfully support them. In the case of lung cancer screening, hospitals follow various country-specific guidelines that are regularly updated. For example, in the US, Lung-RADs V1.1 assigns an alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions.

Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system.

Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view.

The assistive lung cancer screening system comprises 13 models and has a high-level architecture similar to the end-to-end system used in prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores.

Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store.

Reader studies

To evaluate the system’s utility in improving clinical performance, we conducted two reader studies (i.e., experiments designed to assess clinical performance comparing expert performance with and without the aid of a technology) with 12 radiologists using pre-existing, de-identified CT scans. We presented 627 challenging cases to 6 US-based and 6 Japan-based radiologists. In the experimental setup, readers were divided into two groups that read each case twice, with and without assistance from the model. Readers were asked to apply scoring guidelines they typically use in their clinical practice and report their overall suspicion of cancer for each case. We then compared the results of the reader’s responses to measure the impact of the model on their workflow and decisions. The score and suspicion level were judged against the actual cancer outcomes of the individuals to measure sensitivity, specificity, and area under the ROC curve (AUC) values. These were compared with and without assistance.

A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering.

The ability to conduct these studies using the same interface highlights its generalizability to completely different cancer scoring systems, and the generalization of the model and assistive capability to different patient populations. Our study results demonstrated that when radiologists used the system in their clinical evaluation, they had an increased ability to correctly identify lung images without actionable lung cancer findings (i.e., specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening.

Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same.

Translating this into real-world impact through partnership

The system results demonstrate the potential for fewer follow-up visits, reduced anxiety, as well lower overall costs for lung cancer screening. In an effort to translate this research into real-world clinical impact, we are working with: DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field.

Acknowledgements

Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.

Read More