Detecting novel systemic biomarkers in external eye photos

Detecting novel systemic biomarkers in external eye photos

Last year we presented results demonstrating that a deep learning system (DLS) can be trained to analyze external eye photos and predict a person’s diabetic retinal disease status and elevated glycated hemoglobin (or HbA1c, a biomarker that indicates the three-month average level of blood glucose). It was previously unknown that external eye photos contained signals for these conditions. This exciting finding suggested the potential to reduce the need for specialized equipment since such photos can be captured using smartphones and other consumer devices. Encouraged by these findings, we set out to discover what other biomarkers can be found in this imaging modality.

In “A deep learning model for novel systemic biomarkers in photos of the external eye: a retrospective study”, published in Lancet Digital Health, we show that a number of systemic biomarkers spanning several organ systems (e.g., kidney, blood, liver) can be predicted from external eye photos with an accuracy surpassing that of a baseline logistic regression model that uses only clinicodemographic variables, such as age and years with diabetes. The comparison with a clinicodemographic baseline is useful because risk for some diseases could also be assessed using a simple questionnaire, and we seek to understand if the model interpreting images is doing better. This work is in the early stages, but it has the potential to increase access to disease detection and monitoring through new non-invasive care pathways.

A model generating predictions for an external eye photo.

Model development and evaluation

To develop our model, we worked with partners at EyePACS and the Los Angeles County Department of Health Services to create a retrospective de-identified dataset of external eye photos and measurements in the form of laboratory tests and vital signs (e.g., blood pressure). We filtered down to 31 lab tests and vitals that were more commonly available in this dataset and then trained a multi-task DLS with a classification “head” for each lab and vital to predict abnormalities in these measurements.

Importantly, evaluating the performance of many abnormalities in parallel can be problematic because of a higher chance of finding a spurious and erroneous result (i.e., due to the multiple comparisons problem). To mitigate this, we first evaluated the model on a portion of our development dataset. Then, we narrowed the list down to the nine most promising prediction tasks and evaluated the model on our test datasets while correcting for multiple comparisons. Specifically, these nine tasks, their associated anatomy, and their significance for associated diseases are listed in the table below.

Prediction task       Organ system       Significance for associated diseases      
Albumin < 3.5 g/dL       Liver/Kidney       Indication of hypoalbuminemia, which can be due to decreased production of albumin from liver disease or increased loss of albumin from kidney disease.      
AST > 36.0 U/L       Liver      

Indication of liver disease (i.e., damage to the liver or biliary obstruction), commonly caused by viral infections, alcohol use, and obesity.

Calcium < 8.6 mg/dL       Bone / Mineral       Indication of hypocalcemia, which is most commonly caused by vitamin D deficiency or parathyroid disorders.      
eGFR < 60.0 mL/min/1.73 m2       Kidney      

Indication of chronic kidney disease, most commonly due to diabetes and high blood pressure.

Hgb < 11.0 g/dL       Blood count       Indication of anemia which may be due to blood loss, chronic medical conditions, or poor diet.      
Platelet < 150.0 103/µL       Blood count      

Indication of thrombocytopenia, which can be due to decreased production of platelets from bone marrow disorders, such as leukemia or lymphoma, or increased destruction of platelets due to autoimmune disease or medication side effects.

TSH > 4.0 mU/L       Thyroid       Indication of hypothyroidism, which affects metabolism and can be caused by many different conditions.      
Urine albumin/creatinine ratio (ACR) ≥ 300.0 mg/g       Kidney      

Indication of chronic kidney disease, most commonly due to diabetes and high blood pressure.

WBC < 4.0 103/µL       Blood count       Indication of leukopenia which can affect the body’s ability to fight infection.      

Key results

As in our previous work, we compared our external eye model to a baseline model (a logistic regression model taking clinicodemographic variables as input) by computing the area under the receiver operator curve (AUC). The AUC ranges from 0 to 100%, with 50% indicating random performance and higher values indicating better performance. For all but one of the nine prediction tasks, our model statistically outperformed the baseline model. In terms of absolute performance, the model’s AUCs ranged from 62% to 88%. While these levels of accuracy are likely insufficient for diagnostic applications, it is in line with other initial screening tools, like mammography and pre-screening for diabetes, used to help identify individuals who may benefit from additional testing. And as a non-invasive accessible modality, taking photographs of the external eye may offer the potential to help screen and triage patients for confirmatory blood tests or other clinical follow-up.

Results on the EyePACS test set, showing AUC performance of our DLS compared to a baseline model. The variable “n” refers to the total number of datapoints, and “N” refers to the number of positives. Error bars show 95% confidence intervals computed using the DeLong method. Indicates that the target was pre-specified as secondary analysis; all others were pre-specified as primary analysis.

The external eye photos used in both this and the prior study were collected using table top cameras that include a head rest for patient stabilization and produce high quality images with good lighting. Since image quality may be worse in other settings, we wanted to explore to what extent the DLS model is robust to quality changes, starting with image resolution. Specifically, we scaled the images in the dataset down to a range of sizes, and measured performance of the DLS when retrained to handle the downsampled images.

Below we show a selection of the results of this experiment (see the paper for more complete results). These results demonstrate that the DLS is fairly robust and, in most cases, outperforms the baseline model even if the images are scaled down to 150×150 pixels. This pixel count is under 0.1 megapixels, much smaller than the typical smartphone camera.

Effect of input image resolution. Top: Sample images scaled to different sizes for this experiment. Bottom: Comparison of the performance of the DLS (red) trained and evaluated on different image sizes and the baseline model (blue). Shaded regions show 95% confidence intervals computed using the DeLong method.

Conclusion and future directions

Our previous research demonstrated the promise of the external eye modality. In this work, we performed a more exhaustive search to identify the possible systemic biomarkers that can be predicted from these photos. Though these results are promising, many steps remain to determine whether technology like this can help patients in the real world. In particular, as we mention above, the imagery in our studies were collected using large tabletop cameras in a setting that controlled factors such as lighting and head positioning. Furthermore, the datasets used in this work consist primarily of patients with diabetes and did not have sufficient representation of a number of important subgroups – more focused data collection for DLS refinement and evaluation on a more general population and across subgroups will be needed before considering clinical use.

We are excited to explore how these models generalize to smartphone imagery given the potential reach and scale that this enables for the technology. To this end, we are continuing to work with our co-authors at partner institutions like Chang Gung Memorial Hospital in Taiwan, Aravind Eye Hospital in India, and EyePACS in the United States to collect datasets of imagery captured on smartphones. Our early results are promising and we look forward to sharing more in the future.


This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and cross functional contributors. Key contributors to this project include: Boris Babenko, Ilana Traynis, Christina Chen, Preeti Singh, Akib Uddin, Jorge Cuadros, Lauren P. Daskivich, April Y. Maa, Ramasamy Kim, Eugene Yu-Chuan Kang, Yossi Matias, Greg S. Corrado, Lily Peng, Dale R. Webster, Christopher Semturs, Jonathan Krause, Avinash V Varadarajan, Naama Hammel and Yun Liu. We also thank Dave Steiner, Yuan Liu, and Michael Howell for their feedback on the manuscript; Amit Talreja for reviewing code for the paper; Elvia Figueroa and the Los Angeles County Department of Health Services Teleretinal Diabetic Retinopathy Screening program staff for data collection and program support; Andrea Limon and Nikhil Kookkiri for EyePACS data collection and support; Dr. Charles Demosthenes for extracting the data and Peter Kuzmak for getting images for the VA data. Last but not least, a special thanks to Tom Small for the animation used in this blog post.

Read More

Visual language maps for robot navigation

Visual language maps for robot navigation

People are excellent navigators of the physical world, due in part to their remarkable ability to build cognitive maps that form the basis of spatial memory — from localizing landmarks at varying ontological levels (like a book on a shelf in the living room) to determining whether a layout permits navigation from point A to point B. Building robots that are proficient at navigation requires an interconnected understanding of (a) vision and natural language (to associate landmarks or follow instructions), and (b) spatial reasoning (to connect a map representing an environment to the true spatial distribution of objects). While there have been many recent advances in training joint visual-language models on Internet-scale data, figuring out how to best connect them to a spatial representation of the physical world that can be used by robots remains an open research question.

To explore this, we collaborated with researchers at the University of Freiburg and Nuremberg to develop Visual Language Maps (VLMaps), a map representation that directly fuses pre-trained visual-language embeddings into a 3D reconstruction of the environment. VLMaps, which is set to appear at ICRA 2023, is a simple approach that allows robots to (1) index visual landmarks in the map using natural language descriptions, (2) employ Code as Policies to navigate to spatial goals, such as “go in between the sofa and TV” or “move three meters to the right of the chair”, and (3) generate open-vocabulary obstacle maps — allowing multiple robots with different morphologies (mobile manipulators vs. drones, for example) to use the same VLMap for path planning. VLMaps can be used out-of-the-box without additional labeled data or model fine-tuning, and outperforms other zero-shot methods by over 17% on challenging object-goal and spatial-goal navigation tasks in Habitat and Matterport3D. We are also releasing the code used for our experiments along with an interactive simulated robot demo.

VLMaps can be built by fusing pre-trained visual-language embeddings into a 3D reconstruction of the environment. At runtime, a robot can query the VLMap to locate visual landmarks given natural language descriptions, or to build open-vocabulary obstacle maps for path planning.

Classic 3D maps with a modern multimodal twist

VLMaps combines the geometric structure of classic 3D reconstructions with the expression of modern visual-language models pre-trained on Internet-scale data. As the robot moves around, VLMaps uses a pre-trained visual-language model to compute dense per-pixel embeddings from posed RGB camera views, and integrates them into a large map-sized 3D tensor aligned with an existing 3D reconstruction of the physical world. This representation allows the system to localize landmarks given their natural language descriptions (such as “a book on a shelf in the living room”) by comparing their text embeddings to all locations in the tensor and finding the closest match. Querying these target locations can be used directly as goal coordinates for language-conditioned navigation, as primitive API function calls for Code as Policies to process spatial goals (e.g., code-writing models interpret “in between” as arithmetic between two locations), or to sequence multiple navigation goals for long-horizon instructions.

# move first to the left side of the counter, then move between the sink and the oven, then move back and forth to the sofa and the table twice.
robot.move_in_between('sink', 'oven')
pos1 = robot.get_pos('sofa')
pos2 = robot.get_pos('table')
for i in range(2):
# move 2 meters north of the laptop, then move 3 meters rightward.

VLMaps can be used to return the map coordinates of landmarks given natural language descriptions, which can be wrapped as a primitive API function call for Code as Policies to sequence multiple goals long-horizon navigation instructions.


We evaluate VLMaps on challenging zero-shot object-goal and spatial-goal navigation tasks in Habitat and Matterport3D, without additional training or fine-tuning. The robot is asked to navigate to four subgoals sequentially specified in natural language. We observe that VLMaps significantly outperforms strong baselines (including CoW and LM-Nav) by up to 17% due to its improved visuo-lingual grounding.

Tasks    Number of subgoals in a row       Independent
   1 2 3 4   
LM-Nav    26 4 1 1       26   
CoW    42 15 7 3       36   
CLIP MAP    33 8 2 0       30   
VLMaps (ours)      59 34 22 15       59   
GT Map    91 78 71 67       85   

The VLMaps-approach performs favorably over alternative open-vocabulary baselines on multi-object navigation (success rate [%]) and specifically excels on longer-horizon tasks with multiple sub-goals.

A key advantage of VLMaps is its ability to understand spatial goals, such as “go in between the sofa and TV” or “move three meters to the right of the chair”. Experiments for long-horizon spatial-goal navigation show an improvement by up to 29%. To gain more insights into the regions in the map that are activated for different language queries, we visualize the heatmaps for the object type “chair”.

The improved vision and language grounding capabilities of VLMaps, which contains significantly fewer false positives than competing approaches, enable it to navigate zero-shot to landmarks using language descriptions.

Open-vocabulary obstacle maps

A single VLMap of the same environment can also be used to build open-vocabulary obstacle maps for path planning. This is done by taking the union of binary-thresholded detection maps over a list of landmark categories that the robot can or cannot traverse (such as “tables”, “chairs”, “walls”, etc.). This is useful since robots with different morphologies may move around in the same environment differently. For example, “tables” are obstacles for a large mobile robot, but may be traversable for a drone. We observe that using VLMaps to create multiple robot-specific obstacle maps improves navigation efficiency by up to 4% (measured in terms of task success rates weighted by path length) over using a single shared obstacle map for each robot. See the paper for more details.

Experiments with a mobile robot (LoCoBot) and drone in AI2THOR simulated environments. Left: Top-down view of an environment. Middle columns: Agents’ observations during navigation. Right: Obstacle maps generated for different embodiments with corresponding navigation paths.


VLMaps takes an initial step towards grounding pre-trained visual-language information onto spatial map representations that can be used by robots for navigation. Experiments in simulated and real environments show that VLMaps can enable language-using robots to (i) index landmarks (or spatial locations relative to them) given their natural language descriptions, and (ii) generate open-vocabulary obstacle maps for path planning. Extending VLMaps to handle more dynamic environments (e.g., with moving people) is an interesting avenue for future work.

Open-source release

We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional videos and code to benchmark agents in simulation.


We would like to thank the co-authors of this research: Chenguang Huang and Wolfram Burgard.

Read More

Vid2Seq: a pretrained visual language model for describing multi-event videos

Vid2Seq: a pretrained visual language model for describing multi-event videos

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations — for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.

In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens.

A visual language model for dense video captioning

Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.

For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.

Vid2Seq model overview: We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video.

Large-scale pre-training on untrimmed narrated videos

Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom).

Results on downstream dense video captioning benchmarks

The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.

Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better).


We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.


This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

Read More

Responsible AI at Google Research: The Impact Lab

Responsible AI at Google Research: The Impact Lab

Globalized technology has the potential to create large-scale societal impact, and having a grounded research approach rooted in existing international human and civil rights standards is a critical component to assuring responsible and ethical AI development and deployment. The Impact Lab team, part of Google’s Responsible AI Team, employs a range of interdisciplinary methodologies to ensure critical and rich analysis of the potential implications of technology development. The team’s mission is to examine socioeconomic and human rights impacts of AI, publish foundational research, and incubate novel mitigations enabling machine learning (ML) practitioners to advance global equity. We study and develop scalable, rigorous, and evidence-based solutions using data analysis, human rights, and participatory frameworks.

The uniqueness of the Impact Lab’s goals is its multidisciplinary approach and the diversity of experience, including both applied and academic research. Our aim is to expand the epistemic lens of Responsible AI to center the voices of historically marginalized communities and to overcome the practice of ungrounded analysis of impacts by offering a research-based approach to understand how differing perspectives and experiences should impact the development of technology.

What we do

In response to the accelerating complexity of ML and the increased coupling between large-scale ML and people, our team critically examines traditional assumptions of how technology impacts society to deepen our understanding of this interplay. We collaborate with academic scholars in the areas of social science and philosophy of technology and publish foundational research focusing on how ML can be helpful and useful. We also offer research support to some of our organization’s most challenging efforts, including the 1,000 Languages Initiative and ongoing work in the testing and evaluation of language and generative models. Our work gives weight to Google’s AI Principles.

To that end, we:

  • Conduct foundational and exploratory research towards the goal of creating scalable socio-technical solutions
  • Create datasets and research-based frameworks to evaluate ML systems
  • Define, identify, and assess negative societal impacts of AI
  • Create responsible solutions to data collection used to build large models
  • Develop novel methodologies and approaches that support responsible deployment of ML models and systems to ensure safety, fairness, robustness, and user accountability
  • Translate external community and expert feedback into empirical insights to better understand user needs and impacts
  • Seek equitable collaboration and strive for mutually beneficial partnerships

We strive not only to reimagine existing frameworks for assessing the adverse impact of AI to answer ambitious research questions, but also to promote the importance of this work.

Current research efforts

Understanding social problems

Our motivation for providing rigorous analytical tools and approaches is to ensure that social-technical impact and fairness is well understood in relation to cultural and historical nuances. This is quite important, as it helps develop the incentive and ability to better understand communities who experience the greatest burden and demonstrates the value of rigorous and focused analysis. Our goals are to proactively partner with external thought leaders in this problem space, reframe our existing mental models when assessing potential harms and impacts, and avoid relying on unfounded assumptions and stereotypes in ML technologies. We collaborate with researchers at Stanford, University of California Berkeley, University of Edinburgh, Mozilla Foundation, University of Michigan, Naval Postgraduate School, Data & Society, EPFL, Australian National University, and McGill University.

We examine systemic social issues and generate useful artifacts for responsible AI development.


We examine systemic social issues and generate useful artifacts for responsible AI development.
We examine systemic social issues and generate useful artifacts for responsible AI development.


Centering underrepresented voices

We also developed the Equitable AI Research Roundtable (EARR), a novel community-based research coalition created to establish ongoing partnerships with external nonprofit and research organization leaders who are equity experts in the fields of education, law, social justice, AI ethics, and economic development. These partnerships offer the opportunity to engage with multi-disciplinary experts on complex research questions related to how we center and understand equity using lessons from other domains. Our partners include PolicyLink; The Education Trust – West; Notley; Partnership on AI; Othering and Belonging Institute at UC Berkeley; The Michelson Institute for Intellectual Property, HBCU IP Futures Collaborative at Emory University; Center for Information Technology Research in the Interest of Society (CITRIS) at the Banatao Institute; and the Charles A. Dana Center at the University of Texas, Austin. The goals of the EARR program are to: (1) center knowledge about the experiences of historically marginalized or underrepresented groups, (2) qualitatively understand and identify potential approaches for studying social harms and their analogies within the context of technology, and (3) expand the lens of expertise and relevant knowledge as it relates to our work on responsible and safe approaches to AI development.

Through semi-structured workshops and discussions, EARR has provided critical perspectives and feedback on how to conceptualize equity and vulnerability as they relate to AI technology. We have partnered with EARR contributors on a range of topics from generative AI, algorithmic decision making, transparency, and explainability, with outputs ranging from adversarial queries to frameworks and case studies. Certainly the process of translating research insights across disciplines into technical solutions is not always easy but this research has been a rewarding partnership. We present our initial evaluation of this engagement in this paper.

EARR: Components of the ML development life cycle in which multidisciplinary knowledge is key for mitigating human biases.

Grounding in civil and human rights values

In partnership with our Civil and Human Rights Program, our research and analysis process is grounded in internationally recognized human rights frameworks and standards including the Universal Declaration of Human Rights and the UN Guiding Principles on Business and Human Rights. Utilizing civil and human rights frameworks as a starting point allows for a context-specific approach to research  that takes into account how a technology will be deployed and its community impacts. Most importantly, a rights-based approach to research enables us to prioritize conceptual and applied methods that emphasize the importance of understanding the most vulnerable users and the most salient harms to better inform day-to-day decision making, product design and long-term strategies.

Ongoing work

Social context to aid in dataset development and evaluation

We seek to employ an approach to dataset curation, model development and evaluation that is rooted in equity and that avoids expeditious but potentially risky approaches, such as utilizing incomplete data or not considering the historical and social cultural factors related to a dataset. Responsible data collection and analysis requires an additional level of careful consideration of the context in which the data are created. For example, one may see differences in outcomes across demographic variables that will be used to build models and should question the structural and system-level factors at play as some variables could ultimately be a reflection of historical, social and political factors. By using proxy data, such as race or ethnicity, gender, or zip code, we are systematically merging together the lived experiences of an entire group of diverse people and using it to train models that can recreate and maintain harmful and inaccurate character profiles of entire populations. Critical data analysis also requires a careful understanding that correlations or relationships between variables do not imply causation; the association we witness is often caused by additional multiple variables.

Relationship between social context and model outcomes

Building on this expanded and nuanced social understanding of data and dataset construction, we also approach the problem of anticipating or ameliorating the impact of ML models once they have been deployed for use in the real world. There are myriad ways in which the use of ML in various contexts — from education to health care — has exacerbated existing inequity because the developers and decision-making users of these systems lacked the relevant social understanding, historical context, and did not involve relevant stakeholders. This is a research challenge for the field of ML in general and one that is central to our team.

Globally responsible AI centering community experts

Our team also recognizes the saliency of understanding the socio-technical context globally. In line with Google’s mission to “organize the world’s information and make it universally accessible and useful”, our team is engaging in research partnerships globally. For example, we are collaborating with The Natural Language Processing team and the Human Centered team in the Makerere Artificial Intelligence Lab in Uganda to research cultural and language nuances as they relate to language model development.


We continue to address the impacts of ML models deployed in the real world by conducting further socio-technical research and engaging external experts who are also part of the communities that are historically and globally disenfranchised. The Impact Lab is excited to offer an approach that contributes to the development of solutions for applied problems through the utilization of social-science, evaluation, and human rights epistemologies.


We would like to thank each member of the Impact Lab team — Jamila Smith-Loud, Andrew Smart, Jalon Hall, Darlene Neal, Amber Ebinama, and Qazi Mamunur Rashid — for all the hard work they do to ensure that ML is more responsible to its users and society across communities and around the world.

Read More

Learning from deep learning: a case study of feature discovery and validation in pathology

Learning from deep learning: a case study of feature discovery and validation in pathology

When a patient is diagnosed with cancer, one of the most important steps is examination of the tumor under a microscope by pathologists to determine the cancer stage and to characterize the tumor. This information is central to understanding clinical prognosis (i.e., likely patient outcomes) and for determining the most appropriate treatment, such as undergoing surgery alone versus surgery plus chemotherapy. Developing machine learning (ML) tools in pathology to assist with the microscopic review represents a compelling research area with many potential applications.

Previous studies have shown that ML can accurately identify and classify tumors in pathology images and can even predict patient prognosis using known pathology features, such as the degree to which gland appearances deviate from normal. While these efforts focus on using ML to detect or quantify known features, alternative approaches offer the potential to identify novel features. The discovery of new features could in turn further improve cancer prognostication and treatment decisions for patients by extracting information that isn’t yet considered in current workflows.

Today, we’d like to share progress we’ve made over the past few years towards identifying novel features for colorectal cancer in collaboration with teams at the Medical University of Graz in Austria and the University of Milano-Bicocca (UNIMIB) in Italy. Below, we will cover several stages of the work: (1) training a model to predict prognosis from pathology images without specifying the features to use, so that it can learn what features are important; (2) probing that prognostic model using explainability techniques; and (3) identifying a novel feature and validating its association with patient prognosis. We describe this feature and evaluate its use by pathologists in our recently published paper, “Pathologist validation of a machine-learned feature for colon cancer risk stratification”. To our knowledge, this is the first demonstration that medical experts can learn new prognostic features from machine learning, a promising start for the future of this “learning from deep learning” paradigm.

Training a prognostic model to learn what features are important

One potential approach to identifying novel features is to train ML models to directly predict patient outcomes using only the images and the paired outcome data. This is in contrast to training models to predict “intermediate” human-annotated labels for known pathologic features and then using those features to predict outcomes.

Initial work by our team showed the feasibility of training models to directly predict prognosis for a variety of cancer types using the publicly available TCGA dataset. It was especially exciting to see that for some cancer types, the model’s predictions were prognostic after controlling for available pathologic and clinical features. Together with collaborators from the Medical University of Graz and the Biobank Graz, we subsequently extended this work using a large de-identified colorectal cancer cohort. Interpreting these model predictions became an intriguing next step, but common interpretability techniques were challenging to apply in this context and did not provide clear insights.

Interpreting the model-learned features

To probe the features used by the prognostic model, we used a second model (trained to identify image similarity) to cluster cropped patches of the large pathology images. We then used the prognostic model to compute the average ML-predicted risk score for each cluster.

One cluster stood out for its high average risk score (associated with poor prognosis) and its distinct visual appearance. Pathologists described the images as involving high grade tumor (i.e., least-resembling normal tissue) in close proximity to adipose (fat) tissue, leading us to dub this cluster the “tumor adipose feature” (TAF); see next figure for detailed examples of this feature. Further analysis showed that the relative quantity of TAF was itself highly and independently prognostic.

A prognostic ML model was developed to predict patient survival directly from unannotated giga-pixel pathology images. A second image similarity model was used to cluster cropped patches of pathology images. The prognostic model was used to compute the average model-predicted risk score for each cluster. One cluster, dubbed the “tumor adipose feature” (TAF) stood out in terms of its high average risk score (associated with poor survival) and distinct visual appearance. Pathologists learned to identify TAF and pathologist scoring for TAF was shown to be prognostic.
Left: H&E pathology slide with an overlaid heatmap indicating locations of the tumor adipose feature (TAF). Regions highlighted in red/orange are considered to be more likely TAF by the image similarity model, compared to regions highlighted in green/blue or regions not highlighted at all. Right: Representative collection of TAF patches across multiple cases.

Validating that the model-learned feature can be used by pathologists

These studies provided a compelling example of the potential for ML models to predict patient outcomes and a methodological approach for obtaining insights into model predictions. However, there remained the intriguing questions of whether pathologists could learn and score the feature identified by the model while maintaining demonstrable prognostic value.

In our most recent paper, we collaborated with pathologists from the UNIMIB to investigate these questions. Using example images of TAF from the previous publication to learn and understand this feature of interest, UNIMIB pathologists developed scoring guidelines for TAF. If TAF was not seen, the case was scored as “absent”, and if TAF was observed, then “unifocal”, “multifocal”, and “widespread” categories were used to indicate the relative quantity. Our study showed that pathologists could reproducibly identify the ML-derived TAF and that their scoring for TAF provided statistically significant prognostic value on an independent retrospective dataset. To our knowledge, this is the first demonstration of pathologists learning to identify and score a specific pathology feature originally identified by an ML-based approach.

Putting things in context: learning from deep learning as a paradigm

Our work is an example of people “learning from deep learning”. In traditional ML, models learn from hand-engineered features informed by existing domain knowledge. More recently, in the deep learning era, a combination of large-scale model architectures, compute, and datasets has enabled learning directly from raw data, but this is often at the expense of human interpretability. Our work couples the use of deep learning to predict patient outcomes with interpretability methods, to extract new knowledge that could be applied by pathologists. We see this process as a natural next step in the evolution of applying ML to problems in medicine and science, moving from the use of ML to distill existing human knowledge to people using ML as a tool for knowledge discovery.

Traditional ML focused on engineering features from raw data using existing human knowledge. Deep learning enables models to learn features directly from raw data at the expense of human interpretability. Coupling deep learning with interpretability methods provides an avenue for expanding the frontiers of scientific knowledge by learning from deep learning.


This work would not have been possible without the efforts of coauthors Vincenzo L’Imperio, Markus Plass, Heimo Muller, Nicolò’ Tamini, Luca Gianotti, Nicola Zucchini, Robert Reihs, Greg S. Corrado, Dale R. Webster, Lily H. Peng, Po-Hsuan Cameron Chen, Marialuisa Lavitrano, David F. Steiner, Kurt Zatloukal, Fabio Pagni. We also appreciate the support from Verily Life Sciences and the Google Health Pathology teams – in particular Timo Kohlberger, Yunnan Cai, Hongwu Wang, Kunal Nagpal, Craig Mermel, Trissia Brown, Isabelle Flament-Auvigne, and Angela Lin. We also appreciate manuscript feedback from Akinori Mitani, Rory Sayres, and Michael Howell, and illustration help from Abi Jones. This work would also not have been possible without the support of Christian Guelly, Andreas Holzinger, Robert Reihs, Farah Nader, the Biobank Graz, the efforts of the slide digitization team at the Medical University Graz, the participation of the pathologists who reviewed and annotated cases during model development, and the technicians of the UNIMIB team.

Read More

PaLM-E: An embodied multimodal language model

PaLM-E: An embodied multimodal language model

Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of languages to those that can produce images based on text descriptions. Such innovations have been possible due to the increase in availability of large scale datasets along with novel advances that enable the training of models on these data. While scaling of robotics models has seen some success, it is outpaced by other domains due to a lack of datasets available on a scale comparable to large text corpora or image datasets.

Today we introduce PaLM-E, a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. We began with PaLM, a powerful large language model, and “embodied” it (the “E” in PaLM-E), by complementing it with sensor data from the robotic agent. This is the key difference from prior efforts to bring large language models to robotics — rather than relying on only textual input, with PaLM-E we train the language model to directly ingest raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual-language model, while maintaining excellent language-only task capabilities.


PaLM-E is a generalist model competent with robotics, vision, and language tasks. It can control robots, answer visual questions, and write text – and quantitatively excels at all three relative to state-of-the-art models.


An embodied  language model, and also a visual-language generalist

On the one hand, PaLM-E was primarily developed to be a model for robotics, and it solves a variety of tasks on multiple types of robots and for multiple modalities (images, robot states, and neural scene representations). At the same time, PaLM-E is a generally-capable vision-and-language model. It can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and is also proficient at language tasks, like quoting poetry, solving math equations or generating code.

PaLM-E combines our most recent large language model, PaLM, together with one of our most advanced vision models, ViT-22B. The largest instantiation of this approach, built on PaLM-540B, is called PaLM-E-562B and sets a new state of the art on the visual-language OK-VQA benchmark, without task-specific fine-tuning, and while retaining essentially the same general language performance as PaLM-540B.

How does PaLM-E work?

Technically, PaLM-E works by injecting observations into a pre-trained language model. This is realized by transforming sensor data, e.g., images, into a representation through a procedure that is comparable to how words of natural language are processed by a language model.

Language models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model is then able to apply mathematical operations (e.g., matrix multiplication) on the resulting sequence of vectors to predict the next, most likely word token. By feeding the newly predicted word back to the input, the language model can iteratively generate a longer and longer text.

The inputs to PaLM-E are text and other modalities — images, robot states, scene embeddings, etc. — in an arbitrary order, which we call “multimodal sentences”. For example, an input might look like, “What happened between <img_1> and <img_2>?”, where <img_1> and <img_2> are two images. The output is text generated auto-regressively by PaLM-E, which could be an answer to a question, or a sequence of decisions in text form.

PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and/or images) and addresses tasks through multimodal language modeling.

The idea of PaLM-E is to train encoders that convert a variety of inputs into the same space as the natural word token embeddings. These continuous inputs are mapped into something that resembles “words” (although they do not necessarily form discrete sets). Since both the word and image embeddings now have the same dimensionality, they can be fed into the language model.

We initialize PaLM-E for training with pre-trained models for both the language (PaLM) and vision components (Vision Transformer, a.k.a. ViT). All parameters of the model can be updated during training.

Transferring knowledge from large-scale training to robots

PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant positive knowledge transfer from both the vision and language domains, improving the effectiveness of robot learning.

Positive transfer of knowledge from general vision-language tasks results in more effective robot learning, shown for three different robot embodiments and domains.

Results show that PaLM-E can address a large set of robotics, vision and language tasks simultaneously without performance degradation compared to training individual models on individual tasks. Further, the visual-language data actually significantly improves the performance of the robot tasks. This transfer enables PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.


We evaluate PaLM-E on three robotic environments, two of which involve real robots, as well as general vision-language tasks such as visual question answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions on a robot, we pair it with a low-level language-to-action policy to translate text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring a bag of chips to them. To successfully complete the task, PaLM-E produces a plan to find the drawer and open it and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to grab a green block. Even though the block has not been seen by that robot, PaLM-E still generates a step-by-step plan that generalizes beyond the training data of that robot.

PaLM-E controls a mobile robot operating in a kitchen environment. Left: The task is to get a chip bag. PaLM-E shows robustness against adversarial disturbances, such as putting the chip bag back into the drawer. Right: The final steps of executing a plan to retrieve a previously unseen block (green star). This capability is facilitated by transfer learning from the vision and language models.

In the second environment below, the same PaLM-E model solves very long-horizon, precise tasks, such as “sort the blocks by colors into corners,” on a different type of robot. It directly looks at the images and produces a sequence of shorter textually-represented actions — e.g., “Push the blue cube to the bottom right corner,” “Push the blue triangle there too.” — long-horizon tasks that were out of scope for autonomous completion, even in our own most recent models. We also demonstrate the ability to generalize to new tasks not seen during training time (zero-shot generalization), such as pushing red blocks to the coffee cup.

PaLM-E controlling a tabletop robot to successfully complete long-horizon tasks.

The third robot environment is inspired by the field of task and motion planning (TAMP), which studies combinatorially challenging planning tasks (rearranging objects) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E is not only able to also solve these tasks, but it also leverages visual and language knowledge transfer in order to more effectively do so.

PaLM-E produces plans for a task and motion planning environment.

As a visual-language generalist, PaLM-E is a competitive model, even compared with the best vision-language-only models, including Flamingo and PaLI. In particular, PaLM-E-562B achieves the highest number ever reported on the challenging OK-VQA dataset, which requires not only visual understanding but also external knowledge of the world. Further, this result is reached with a generalist model, without fine-tuning specifically on only that task.

PaLM-E exhibits capabilities like visual chain-of-thought reasoning in which the model breaks down its answering process in smaller steps, an ability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to perform inference on multiple images although being trained on only single-image prompts. The image of the New York Knicks and Boston Celtics is under the terms CC-by-2.0 and was posted to Flickr by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.


PaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain. There are additional topics investigated in further detail in the paper, such as how to leverage neural scene representations with PaLM-E and also the extent to which PaLM-E, with greater model scale, experiences less catastrophic forgetting of its language capabilities.

PaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.


This work was done in collaboration across several teams at Google, including the Robotics at Google team and the Brain team, and with TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi S. M. Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Wenlong Huang, Pierre Sermanet, Sergey Levine, Vincent Vanhoucke, and Marc Toussiant. Danny is a PhD student advised by Marc Toussaint at TU Berlin. We also would like to thank several other colleagues for their advice and help, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Mario Lucic, Henning Meyer, and Kendra Byrne.

Read More