How to migrate from BoostedTrees Estimators to TensorFlow Decision Forests

Posted by Mathieu Guillame-Bert and Josh Gordon for the TensorFlow team

Decision forest models like random forests and gradient boosted trees are often the most effective tools available for working with tabular data. They provide many advantages over neural networks, including being easier to configure, and faster to train. Using trees greatly reduces the amount of code required to prepare your dataset, as they natively handle numeric, categorical, and missing features. And they often give good results out-of-the-box, with interpretable properties.

Although we usually think of TensorFlow as a library to train neural networks, a popular use case at Google is to use TensorFlow to create decision forests.

An animation of a decision tree classifying data.

This article provides a migration guide if you were previously creating tree-based models using tf.estimator.BoostedTrees, which was introduced in 2019. The Estimator API took care of much of the complexity of working with models in production, including distributed training and serialization. However, it is no longer recommended for new code.

If you are starting a new project, we recommend that you use TensorFlow Decision Forests (TF-DF). This library provides state-of-the-art algorithms for training, serving and interpreting decision forest models, with many benefits over the previous approach, notably regarding quality, speed, and ease of use.

To start, here are equivalent examples using the Estimator API and TF-DF to create a boosted tree model.

Previously, this is how you would train a gradient boosted tree models with tf.estimator.BoostedTrees (no longer recommended)

python
import tensorflow as tf

# Dataset generators
def make_dataset_fn(dataset_path):
def make_dataset():
data = ... # read dataset
return tf.data.Dataset.from_tensor_slices(...data...).repeat(10).batch(64)
return make_dataset

# List the possible values for the feature "f_2".
f_2_dictionary = ["NA", "red", "blue", "green"]

# The feature columns define the input features of the model.
feature_columns = [
tf.feature_column.numeric_column("f_1"),
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list("f_2",
f_2_dictionary,
# A special value "missing" is used to represent missing values.
default_value=0)
),
]

# Configure the estimator
estimator = boosted_trees.BoostedTreesClassifier(
n_trees=1000,
feature_columns=feature_columns,
n_classes=3,
# Rule of thumb proposed in the BoostedTreesClassifier documentation.
n_batches_per_layer=max(2, int(len(train_df) / 2 / FLAGS.batch_size)),
)

# Stop the training is the validation loss stop decreasing.
early_stopping_hook = early_stopping.stop_if_no_decrease_hook(
estimator,
metric_name="loss",
max_steps_without_decrease=100,
min_steps=50)

tf.estimator.train_and_evaluate(
estimator,
train_spec=tf.estimator.TrainSpec(
make_dataset_fn(train_path),
hooks=[
# Early stopping needs a CheckpointSaverHook.
tf.train.CheckpointSaverHook(
checkpoint_dir=input_config.raw.temp_dir, save_steps=500),
early_stopping_hook,
]),
eval_spec=tf.estimator.EvalSpec(make_dataset_fn(valid_path)))

How to train the same model using TensorFlow Decision Forests

python
import tensorflow_decision_forests as tfdf

# Load the datasets
# This code is similar to the estimator.
def make_dataset(dataset_path):
data = ... # read dataset
return tf.data.Dataset.from_tensor_slices(...data...).batch(64)

train_dataset = make_dataset(train_path)
valid_dataset = make_dataset(valid_path)

# List the input features of the model.
features = [
tfdf.keras.FeatureUsage("f_1", keras.FeatureSemantic.NUMERICAL),
tfdf.keras.FeatureUsage("f_2", keras.FeatureSemantic.CATEGORICAL),
]

model = tfdf.keras.GradientBoostedTreesModel(
task = tfdf.keras.Task.CLASSIFICATION,
num_trees=1000,
features=features,
exclude_non_specified_features=True)

model.fit(train_dataset, valid_dataset)

# Export the model to a SavedModel.
model.save("project/model")

Remarks

  • While not explicit in this example, early stopping is automatically enabled and configured.
  • The dictionary of the “f_2” features is automatically built and optimized (e.g. rare values are merged into an out-of-vocabulary item).
  • The number of classes (3 in this example) is automatically determined from the dataset.
  • The batch size (64 in this example), has no impact on the model training. Larger values are often preferable as it makes reading the dataset more efficient.

TF-DF is all about ease of use, and the previous example can be further simplified and improved, as shown next.

How to train a TensorFlow Decision Forests (recommended solution)

import tensorflow_decision_forests as tfdf
import pandas as pd

# Pandas dataset can be used easily with pd_dataframe_to_tf_dataset.
train_df = pd.read_csv("project/train.csv")

# Convert the Pandas dataframe into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")

model = tfdf.keras.GradientBoostedTreeModel(num_trees=1000)
model.fit(train_dataset)

Remarks

  • We did not specify the semantics (e.g. numerical, or categorical) of the features. In this case, the semantics will be automatically inferred.
  • We also didn’t list which input features to use. In this case, all the columns (except for the label) will be used. The list and semantics of the input feature is visible in the training logs, or with the model inspector API.
  • We did not specify any validation dataset. Each algorithm will optionally extract a validation dataset from the training examples as best for the algorithm. For example, by default, GradientBoostedTreeModel uses 10% of the training data for validation if no validation dataset is provided.

Now, let’s look at a couple differences between the Estimator API and TF-DF.

Differences between the Estimator API and TF-DF

Type of algorithms

TF-DF is a collection of decision forest algorithms. This includes (but is not limited to) the Gradient Boosted Trees available with the Estimator API. Notably, TF-DF also supports Random Forest (great for nosy datasets) and a CART implementation (great for model interpretation).

In addition, for each of those algorithms, TF-DF includes many variations found in the literature and validated experimentally [1, 2, 3].

Exact vs approximate splits

The TF1 GBT Estimator is an approximated tree learning algorithm. Informally, the Estimator builds trees by only considering a random subset of examples and a random subset of the conditions at each step.

By default, TF-DF is an exact tree training algorithm. Informally, TF-DF considers all the training examples and all the possible splits at each step. This is a more common and often better performing solution.

While sometimes faster on larger datasets (>10B examples x features), the estimator approximation are often less accurate (as more trees need to be grown to reach the same quality). In a small dataset (<100M examples x features), the form of approximated training implemented in the Estimator can even be slower than exact training.

TF-DF also supports various types of “approximated” tree training. The recommended approach is to use exact training, and optionally test approximated training on large datasets.

Inference

The Estimator runs model inference using the top-down tree routing algorithm. TF-DF uses an extension of the QuickScorer algorithm.

While both algorithms return the exact same results, the top-down algorithm is less efficient because of exceeding branching predictions and cache misses. TF-DF inference is generally 10x faster on the same model.

For latency critical applications TF-DF offers a C++ API. It provides often ~1µs/example/core inference time. This is often a 50x-1000x speed-up over TF SavedModel inference (especially on small batches).

Multi-head models

The Estimator supports multi-head models (a model that outputs multiple predictions). TF-DF (currently) does not support multi-head models directly, however, using the Keras Functional API, multiple TF-DF models trained in parallel can be assembled into a multi-head model.

Learning more

You can learn more about TensorFlow Decision Forests by visiting the website. If you’re new to this library, the beginner example is a good place to start. Experienced TensorFlow users can visit this guide for important details about the difference between using decision forests and neural networks in TensorFlow, including how to configure your training pipeline, and tips on Dataset I/O. You can also see Migrate from Estimator to Keras APIs for more info on migrating from Estimators to Keras in general.

Read More

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive results across a wide range of tasks. GPT-3 first showed that large language models (LLMs) can be used for few-shot learning and can achieve impressive results without large-scale task-specific data collection or model parameter updating. More recent LLMs, such as GLaM, LaMDA, Gopher, and Megatron-Turing NLG, achieved state-of-the-art few-shot results on many tasks by scaling model size, using sparsely activated modules, and training on larger datasets from more diverse sources. Yet much work remains in understanding the capabilities that emerge with few-shot learning as we push the limits of model scale.

Last year Google Research announced our vision for Pathways, a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators. In “PaLM: Scaling Language Modeling with Pathways”, we introduce the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model trained with the Pathways system, which enabled us to efficiently train a single model across multiple TPU v4 Pods. We evaluated PaLM on hundreds of language understanding and generation tasks, and found that it achieves state-of-the-art few-shot performance across most tasks, by significant margins in many cases.

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities.

Training a 540-Billion Parameter Language Model with Pathways
PaLM demonstrates the first large-scale use of the Pathways system to scale training to 6144 chips, the largest TPU-based system configuration used for training to date. The training is scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods, while using standard data and model parallelism within each Pod. This is a significant increase in scale compared to most previous LLMs, which were either trained on a single TPU v3 Pod (e.g., GLaM, LaMDA), used pipeline parallelism to scale to 2240 A100 GPUs across GPU clusters (Megatron-Turing NLG) or used multiple TPU v3 Pods (Gopher) with a maximum scale of 4096 TPU v3 chips.

PaLM achieves a training efficiency of 57.8% hardware FLOPs utilization, the highest yet achieved for LLMs at this scale. This is due to a combination of the parallelism strategy and a reformulation of the Transformer block that allows for attention and feedforward layers to be computed in parallel, enabling speedups from TPU compiler optimizations.

PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code. We also created a “lossless” vocabulary that preserves all whitespace (especially important for code), splits out-of-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit.

Breakthrough Capabilities on Language, Reasoning, and Code Tasks
PaLM shows breakthrough capabilities on numerous very difficult tasks. We highlight a few examples for language understanding and generation, reasoning, and code-related tasks below.

Language Understanding and Generation
We evaluated PaLM on 29 widely-used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior large models, such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA, on 28 of 29 of tasks that span question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks.

PaLM 540B performance improvement over prior state-of-the-art (SOTA) results on 29 English-based NLP tasks.

In addition to English NLP tasks, PaLM also shows strong performance on multilingual NLP benchmarks, including translation, even though only 22% of the training corpus is non-English.

We also probe emerging and future capabilities of PaLM on the Beyond the Imitation Game Benchmark (BIG-bench), a recently released suite of more than 150 new language modeling tasks, and find that PaLM achieves breakthrough performance. We compare the performance of PaLM to Gopher and Chinchilla, averaged across a common subset of 58 of these tasks. Interestingly, we note that PaLM’s performance as a function of scale follows a log-linear behavior similar to prior models, suggesting that performance improvements from scale have not yet plateaued. PaLM 540B 5-shot also does better than the average performance of people asked to solve the same tasks.

Scaling behavior of PaLM on a subset of 58 BIG-bench tasks. 

PaLM demonstrates impressive natural language understanding and generation capabilities on several BIG-bench tasks. For example, the model can distinguish cause and effect, understand conceptual combinations in appropriate contexts, and even guess the movie from an emoji.

Examples that showcase PaLM 540B 1-shot performance on BIG-bench tasks: labeling cause and effect, conceptual understanding, guessing movies from emoji, and finding synonyms and counterfactuals.

Reasoning
By combining model scale with chain-of-thought prompting, PaLM shows breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning. Prior LLMs, like Gopher, saw less benefit from model scale in improving performance.

Standard prompting versus chain-of-thought prompting for an example grade-school math problem. Chain-of-thought prompting decomposes the prompt for a multi-step reasoning problem into intermediate steps (highlighted in yellow), similar to how a person would approach it.

We observed strong performance from PaLM 540B combined with chain-of-thought prompting on three arithmetic datasets and two commonsense reasoning datasets. For example, with 8-shot prompting, PaLM solves 58% of the problems in GSM8K, a benchmark of thousands of challenging grade school level math questions, outperforming the prior top score of 55% achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier.

This new score is especially interesting, as it approaches the 60% average of problems solved by 9-12 year olds, who are the target audience for the question set. We suspect that separate encoding of digits in the PaLM vocabulary helps enable these performance improvements.

Remarkably, PaLM can even generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding. For example, it can provide high quality explanations for novel jokes not found on the web.

PaLM explains an original joke with two-shot prompts.

Code Generation
LLMs have also been shown [1, 2, 3, 4] to generalize well to coding tasks, such as writing code given a natural language description (text-to-code), translating code from one language to another, and fixing compilation errors (code-to-code).

PaLM 540B shows strong performance across coding tasks and natural language tasks in a single model, even though it has only 5% code in the pre-training dataset. Its few-shot performance is especially remarkable because it is on par with the fine-tuned Codex 12B while using 50 times less Python code for training. This result reinforces earlier findings that larger models can be more sample efficient than smaller models because they better transfer learning from other programming languages and natural language data.

Examples of a fine-tuned PaLM 540B model on text-to-code tasks, such as GSM8K-Python and HumanEval, and code-to-code tasks, such as Transcoder.

We also see a further increase in performance by fine-tuning PaLM on a Python-only code dataset, which we refer to as PaLM-Coder. For an example code repair task called DeepFix, where the objective is to modify initially broken C programs until they compile successfully, PaLM-Coder 540B demonstrates impressive performance, achieving a compile rate of 82.1%, which outperforms the prior 71.7% state of the art. This opens up opportunities for fixing more complex errors that arise during software development.

An example from the DeepFix Code Repair task. The fine-tuned PaLM-Coder 540B fixes compilation errors (left, in red) to a version of code that compiles (right).

Ethical Considerations
Recent research has highlighted various potential risks associated with LLMs trained on web text. It is crucial to analyze and document such potential undesirable risks through transparent artifacts such as model cards and datasheets, which also include information on intended use and testing. To this end, our paper provides a datasheet, model card and Responsible AI benchmark results, and it reports thorough analyses of the dataset and model outputs for biases and risks. While the analysis helps outline some potential risks of the model, domain- and task-specific analysis is essential to truly calibrate, contextualize, and mitigate possible harms. Further understanding of risks and benefits of these models is a topic of ongoing research, together with developing scalable solutions that can put guardrails against malicious uses of language models.

Conclusion and Future Work
PaLM demonstrates the scaling capability of the Pathways system to thousands of accelerator chips across two TPU v4 Pods by training a 540-billion parameter model efficiently with a well-studied, well-established recipe of a dense decoder-only Transformer model. Pushing the limits of model scale enables breakthrough few-shot performance of PaLM across a variety of natural language processing, reasoning, and code tasks.

PaLM paves the way for even more capable models by combining the scaling capabilities with novel architectural choices and training schemes, and brings us closer to the Pathways vision:

“Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency.”

Acknowledgements
PaLM is the result of a large, collaborative effort by many teams within Google Research and across Alphabet. We’d like to thank the entire PaLM team for their contributions: Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Mishra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, and Jason Wei. PaLM builds on top of work by many, many teams at Google and we would especially like to recognize the T5X team, the Pathways infrastructure team, the JAX team, the Flaxformer team, the XLA team, the Plaque team, the Borg team, and the Datacenter networking infrastructure team. We’d like to thank our co-authors on this blog post, Alexander Spiridonov and Maysam Moussalem, as well as Josh Newlan and Tom Small for the images and animations in this blog post. Finally, we would like to thank our advisors for the project: Noah Fiedel, Slav Petrov, Jeff Dean, Douglas Eck, and Kathy Meier-Hellstern.

Read More

Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds.

Pekka Varis’s artistry has come a long way from his early days as a self-styled “punk activist” who spray painted during the “old school days of hip hop in Finland.”

Pekka Varis

More than 30 years later, he’s embracing the future of graphics as a videographer who uses NVIDIA Omniverse, a physically accurate 3D design collaboration and simulation platform, to create vibrant virtual worlds.

His freelance video production company Catchline translates complex technology services into attention-snagging, easily digestible video messages or commercials.

A recent project was an animated ad for Artio, an app that brings a global digital art gallery to individual smart devices.

In the commercial, a digital family searches for the perfect piece of art to enliven their new home, visualizing options on the walls in real time. Plus, the parents unpack boxes filled with detailed items, a child plays with block toys, the dog shakes its head and wags its tail — all in a physically accurate way, thanks to Omniverse.

“NVIDIA Omniverse has given me more artistic power and a totally new level of freedom,” Varis said. “It’s a trusty playground where I can bring in quality content and focus on creating designs, visuals, variations — all that fun stuff.”

Vivifying the Videography

The Helsinki-based videographer takes viewers through his creative process in a three-part tutorial that details his recent commercial project, in which Artio helps the digital family make their blank walls pop.

Watch the first installment of the tutorial series on demand and below:

In subsequent episodes, Varis covers his animation process with Reallusion Character Creator, as well as his steps for lighting and color correction in Omniverse.

His multi-app workflow — bringing characters to life with Reallusion software and perfecting the lighting with Lightmap HDR Light Studio — is made possible by Omniverse Connectors, plugins that connect third-party design apps with Omniverse, and NVIDIA Studio drivers.

He estimates a 90 percent savings in time and cost using Omniverse. “With Omniverse Connectors to Reallusion software like Character Creator and iClone, I can make commercial-quality characters quickly and easily, which I wouldn’t have even dreamed about before,” he said.

“Natasha,” made by Varis using Reallusion Character Creator and Omniverse Create.

His favorite Omniverse apps, he added, are Omniverse Create — a toolkit built on Pixar’s Universal Scene Description format that accelerates advanced scene composition — and Omniverse Machinima, with which he animates and manipulates characters inside of virtual worlds.

“I see myself using Omniverse with every single work commission I get moving forward,” Varis said. “And I hope my video tutorials inspire viewers to try using these superb tools to create their own artwork.”

When not creating videos, Varis makes music as a drummer who loves metal, rock and hip hop — and spends time with his six-year-old daughter, his source of inspiration.

Join in on the Creation

Inspired by Varis? Remix and animate characters from Squad, Mount & Blade II: Bannerlord and MechWarrior 5 using the Omniverse Machinima app for the #MadeInMachinima contest, running through June 27, for a chance to win the latest NVIDIA Studio laptop.

Learn more about Omniverse by watching GTC sessions on demand — featuring visionaries from the Omniverse team, Adobe, Autodesk, Epic Games, Pixar, Unity and Walt Disney Studios.

Creators and developers can download NVIDIA Omniverse for free and get started with step-by-step tutorials on the Omniverse YouTube channel. Follow Omniverse on Instagram, Twitter and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server to chat with the community.

The post Meet the Omnivore: Videographer Makes Digital Walls, Virtual Homes Pop With NVIDIA Omniverse appeared first on NVIDIA Blog.

Read More

Dan Huttenlocher ponders our human future in an age of artificial intelligence

What does it mean to be human in an age where artificial intelligence agents make decisions that shape human actions? That’s a deep question with no easy answers, and it’s been on the mind of Dan Huttenlocher SM ’84, PhD ’88, dean of the MIT Schwarzman College of Computing, for the past few years.

“Advances in AI are going to happen, but the destination that we get to with those advances is up to us, and it is far from certain,” says Huttenlocher, who is also the Henry Ellis Warren Professor in the Department of Electrical Engineering and Computer Science.

Along with former Google CEO Eric Schmidt and elder statesman Henry Kissinger, Huttenlocher recently explored some of the quandaries posed by the rise of AI, in the book, “The Age of AI: And Our Human Future.” For Huttenlocher and his co-authors, “Our belief is that, to get there, we need much more informed dialogue and much more multilateral dialogue. Our hope is that the book will get people interested in doing that from a broad range of places,” he says.

Now, with nearly two and a half years as the college dean, Huttenlocher doesn’t just talk the talk when it comes to interdisciplinarity. He is leading the college as it incorporates computer science into all fields of study at MIT while teaching students to use formidable tools like artificial intelligence ethically and responsibly.

That mission is being accomplished, in part, through two campus-wide initiatives that Huttenlocher is especially excited about: the Common Ground for Computing Education and Social and Ethical Responsibilities of Computing (SERC). SERC is complemented by numerous research and scholarly activities, such as AI for Health Care Equity and the Research Initiative for Combatting Systemic Racism. The Common Ground supports the development of cross-disciplinary courses that integrate computing into other fields of study, while the SERC initiative provides tools that help researchers, educators, and students understand how to conceptualize issues about the impacts of computing early in the research process.

“When I was a grad student, you worked on computer vision assuming that it was going to be a research problem for the rest of your lifetime,” he says. “Now, research problems have practical applications almost overnight in computing-related disciplines. The social impacts and ethical implications around computing are things that need to be considered from the very beginning, not after the fact.”

Budding interest in a nascent field

A deep thinker from an early age, Huttenlocher began pondering questions at the intersection of human intelligence and computing when he was a teenager.

With a mind for math, the Chicago native learned how to program before he entered high school, which was a rare thing in the 1970s. His parents, both academics who studied aspects of the human mind, influenced the path he would follow. His father was a neurologist at the University of Chicago Medical School who studied brain development, while his mother was a professor of cognitive psychology at the same institution.

Huttenlocher pursued a joint major in computer science and cognitive psychology as an undergraduate at the University of Michigan in an effort to bring those two disciplines together. When it came time to apply to graduate school, he found the perfect fit for his dual interests in the nascent field of AI, and enrolled at MIT.

While earning his master’s degree and PhD (in 1984 and 1988, respectively), he researched speech recognition, object recognition, and computer vision. He became fascinated by how machines can directly perceive the world around them. Huttenlocher was also drawn in by the entrepreneurial activity that was then ramping up around Cambridge. He spent summers interning at Silicon Valley startups and small tech companies in the Boston area, which piqued his interest in industry.

“I grew up in an academic household and had a healthy skepticism of following in my parents’ footsteps. So when I graduated, I wasn’t quite sure if I wanted an academic path or not. And to be honest, I’ve been a little bit ambivalent about it ever since. For better or worse, I’ve often ended up doing both at the same time,” he says.

Big problems, direct impact

Huttenlocher joined the computer science faculty at Cornell University in 1988 and also took a position at the Xerox Palo Alto Research Center (PARC), where he had interned as a graduate student. He taught computer science courses and worked on academic research projects when Cornell was in session, and spent his summers at Xerox PARC, as well as one day a week consulting remotely. (Long before Zoom, remote connectivity was “still pretty sketchy” in those days, he says.)

“I’ve long wanted to pair the deeper, bigger problems that we tend to try to make progress on in academia with a more direct and immediate impact on people, so spending time at Xerox PARC and at Cornell was a good way to do that,” he says.

Early in his research career, Huttenlocher took a more algorithmic approach to solving computer vision problems, rather than taking the generic optimization approaches that were more common at the time. Some of the techniques he and his collaborators developed, such as using a graph-based representation of an image, are still being used more than 20 years later.

Later, he and his colleagues conducted some of the first studies on how communities come together on social networks. In those pre-Facebook days, they studied LiveJournal, a social networking site that was popular in the early 2000s. Their work revealed that a person’s tendency to join an online community is not only influenced by the number of friends they have in that community, but also by how those friends are connected to one another.

In addition to research, Huttenlocher was passionate about bridging gaps between disciplines. He was named dean of the interdisciplinary Faculty of Computing and Information Science at Cornell in 2009. Three years later, he took his bridge-building skills to New York City when he became the founding dean of Cornell Tech, a new graduate school being established on Roosevelt Island.

That role was a tremendous challenge but also an extraordinary opportunity to create a campus that combined academia in computing-related disciplines with the growing tech community in New York, he says.

In a way, the role prepared him well to be founding dean of the MIT Schwarzman College of Computing, whose launch represented the most significant structural change to the Institute since the early 1950s.

“I think this place is very special. MIT has its own culture. It is a distinctive place in the positive sense of the word ‘distinctive.’ People are insanely curious here and very collaborative when it comes to solving problems. Just the opportunity to help build something new at MIT, something that will be important for the Institute but also for the country and the world, is amazing,” he says.

Making connections

While Huttenlocher was overseeing the creation of Cornell Tech, he was also forging connections around New York City. Before the Roosevelt Island campus was built, the school rented space in Google’s Eighth Avenue building, which is how he met then-Google CEO Eric Schmidt. The two enjoyed talking about (and sometimes arguing about) the promises and perils of artificial intelligence. At the same time, Schmidt was discussing AI with Henry Kissinger, whom he had befriended at a conference. By happenstance, the three got together and started talking about AI, which led to an article in The Atlantic and, eventually, the book.

“What we realized when we started talking about these questions is that the broader historical and philosophical context for an AI age is not something that has been looked at very much. When people are looking at social and ethical issues around computing, it is usually focused on the current problem, which is vital, but we think this broader framing is also important,” he says.

And when it comes to questions about AI, Huttenlocher feels a sense of urgency.

Advancements are happening so rapidly that there is immense pressure for educational institutions to keep up. Academic courses need to have computing woven through them as part of their intellectual fabric, especially as AI continues to become a larger part of everyday life, he says. This underscores the important work the college is doing, and the challenge it faces moving forward.

For Huttenlocher, who has found himself working in the center of a veritable Venn diagram of disciplines since his days as an undergraduate, it is a challenge he has fully embraced.

“It should not just be computer scientists or engineers looking at these problems. But it should not just be social scientists or humanists looking at them either,” he says. “We really need to bring different groups together.”

Read More

Two Experiments in Peer Review: Posting Preprints and Citation Bias

There is increasing interest in computer science and elsewhere to understand and improve peer review (see here for an overview). With this motivation, we conducted two experiments regarding peer review which we summarize in this blog post.

Pros and Cons of Posting Preprints Online

Motivation.  Authors posting preprints online before review in double-blind peer-review is a widely debated issue of policy as well as authors’ personal choice. This choice is especially challenging for authors who perceive that they may be at a disadvantage in the review process if their identity is revealed. By posting their preprints online they stand to gain publicity but may lose benefits of double-blind review. We substantiate this debate quantitatively by addressing two research questions. 

Research questions. We study the following research questions:

  • (Q1) What fraction of reviewers deliberately search for their assigned paper on the Internet? 
  • (Q2) For preprints posted online, what is the relation between the papers’ visibility and the ranking of the authors’ affiliations?

Methods. We conduct survey-based experiments in the ICML 2021 and EC 2021 conferences. 

  • (Q1) We conduct an anonymized survey, where we ask each reviewer whether they deliberately searched online for any of their assigned papers.
  • (Q2) We consider the papers submitted to ICML or EC which were also available online before review. We survey relevant reviewers to assess the visibility of these papers: we ask them whether they have seen these papers online outside of the reviewing context. Finally, we compute a correlation between papers’ visibility and associated authors’ affiliations’ ranking.

Key findings. We report two main insights:

  • (Q1) More than a third of the respondents self-report searching online for their assigned paper in both ICML and EC.
  • (Q2) We find a weak positive correlation: preprints from better-ranked affiliations enjoy a higher visibility. In ICML, the correlation coefficient is 0.06 and is statistically significant; in EC, the correlation coefficient is 0.05 and is not statistically significant. In particular, papers associated with the top-10-ranked affiliations had a visibility of about 11% in ICML and 22% in EC, whereas the remaining papers had a visibility of 7% and 18% respectively.

Implications. Conference organizers looking to design blinding policies and authors looking to post preprints online can use our findings to gauge the tradeoffs involved.

Details. https://arxiv.org/pdf/2203.17259.pdf

Citation Bias

Motivation.  Many anecdotes suggest that including citations to the works of potential reviewers is a good (albeit unethical) way to increase the acceptance chances of a manuscript. 

Research question. Does the citation of a reviewer’s work in a submission cause the reviewer to be positively biased towards the submission, that is, cause a shift in reviewer’s evaluation that goes beyond the genuine change in the submission’s scientific merit?

Methods. We pair cited and uncited reviewers for each submitted paper and then carefully analyze the differences in their scores. Our analysis accounts for the many confounding factors that may exist. By pairing reviewers, we alleviate the confounding factor of “paper quality” as both cited and uncited reviewers review the same paper. We also control for confounders related to reviewer identities by accommodating various associated aspects such as the reviewers’ expertise and preferences in reviewing papers. Finally, we analyze reviews of uncited reviewers to exclude cases in which a reviewer genuinely decreases their evaluation of a paper because it fails to cite their own relevant past work. 

Key findings. Our findings suggest that citation bias exists, and papers enjoy higher scores from a cited reviewer. Due to this bias, the expected increase in a cited reviewer’s score is 0.16 (on a 6 point scale) in ICML and 0.23 (on a 5 point scale) in EC. For reference, a one-point increase of a score by a single reviewer improves the position of a submission by 11% on average.

Implications. We detect and quantify the strength of citation bias in peer review, informing stakeholders of the presence of the bias. Our work also raises an important open problem of mitigating citation bias.

Details. https://arxiv.org/pdf/2203.17239.pdf

Acknowledgements

The post is based on joint works with Ivan Stelmakh, Ryan Liu,  Xinwei Shen, Marina Meila, Shuchi Chawla, Federico Echenique, and Nihar B. Shah.

Read More

Generating new molecules with graph grammar

Chemical engineers and materials scientists are constantly looking for the next revolutionary material, chemical, and drug. The rise of machine-learning approaches is expediting the discovery process, which could otherwise take years. “Ideally, the goal is to train a machine-learning model on a few existing chemical samples and then allow it to produce as many manufacturable molecules of the same class as possible, with predictable physical properties,” says Wojciech Matusik, professor of electrical engineering and computer science at MIT. “If you have all these components, you can build new molecules with optimal properties, and you also know how to synthesize them. That’s the overall vision that people in that space want to achieve”

However, current techniques, mainly deep learning, require extensive datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, limiting their ability to generalize and generate physical molecules that could be created in the real world.

Now, a new paper from researchers at MIT and IBM tackles this problem using a generative graph model to build new synthesizable molecules within the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a graph grammar — a linguistics analogy of systems and structures for word ordering — that contains a sequence of rules for building molecules, such as monomers and polymers. Using the grammar and production rules that were inferred from the training set, the model can not only reverse engineer its examples, but can create new compounds in a systematic and data-efficient way. “We basically built a language for creating molecules,” says Matusik “This grammar essentially is the generative model.”

Matusik’s co-authors include MIT graduate students Minghao Guo, who is the lead author, and Beichen Li as well as Veronika Thost, Payal Das, and Jie Chen, research staff members with IBM Research. Matusik, Thost, and Chen are affiliated with the MIT-IBM Watson AI Lab. Their method, which they’ve called data-efficient graph grammar (DEG), will be presented at the International Conference on Learning Representations.

“We want to use this grammar representation for monomer and polymer generation, because this grammar is explainable and expressive,” says Guo. “With only a few number of the production rules, we can generate many kinds of structures.”

A molecular structure can be thought of as a symbolic representation in a graph — a string of atoms (nodes) joined together by chemical bonds (edges). In this method, the researchers allow the model to take the chemical structure and collapse a substructure of the molecule down to one node; this may be two atoms connected by a bond, a short sequence of bonded atoms, or a ring of atoms. This is done repeatedly, creating the production rules as it goes, until a single node remains. The rules and grammar then could be applied in the reverse order to recreate the training set from scratch or combined in different combinations to produce new molecules of the same chemical class.

“Existing graph generation methods would produce one node or one edge sequentially at a time, but we are looking at higher-level structures and, specifically, exploiting chemistry knowledge, so that we don’t treat the individual atoms and bonds as the unit. This simplifies the generation process and also makes it more data-efficient to learn,” says Chen.

Further, the researchers optimized the technique so that the bottom-up grammar was relatively simple and straightforward, such that it fabricated molecules that could be made.

“If we switch the order of applying these production rules, we would get another molecule; what’s more, we can enumerate all the possibilities and generate tons of them,” says Chen. “Some of these molecules are valid and some of them not, so the learning of the grammar itself is actually to figure out a minimal collection of production rules, such that the percentage of molecules that can actually be synthesized is maximized.” While the researchers concentrated on three training sets of less than 33 samples each — acrylates, chain extenders, and isocyanates — they note that the process could be applied to any chemical class.

To see how their method performed, the researchers tested DEG against other state-of-the-art models and techniques, looking at percentages of chemically valid and unique molecules, diversity of those created, success rate of retrosynthesis, and percentage of molecules belonging to the training data’s monomer class.

“We clearly show that, for the synthesizability and membership, our algorithm outperforms all the existing methods by a very large margin, while it’s comparable for some other widely-used metrics,” says Guo. Further, “what is amazing about our algorithm is that we only need about 0.15 percent of the original dataset to achieve very similar results compared to state-of-the-art approaches that train on tens of thousands of samples. Our algorithm can specifically handle the problem of data sparsity.”

In the immediate future, the team plans to address scaling up this grammar learning process to be able to generate large graphs, as well as produce and identify chemicals with desired properties.

Down the road, the researchers see many applications for the DEG method, as it’s adaptable beyond generating new chemical structures, the team points out. A graph is a very flexible representation, and many entities can be symbolized in this form — robots, vehicles, buildings, and electronic circuits, for example. “Essentially, our goal is to build up our grammar, so that our graphic representation can be widely used across many different domains,” says Guo, as “DEG can automate the design of novel entities and structures,” says Chen.

This research was supported, in part, by the MIT-IBM Watson AI Lab and Evonik.

Read More

Whitepaper: Machine Learning Best Practices in Healthcare and Life Sciences

For customers looking to implement a GxP-compliant environment on AWS for artificial intelligence (AI) and machine learning (ML) systems, we have released a new whitepaper: Machine Learning Best Practices in Healthcare and Life Sciences.

This whitepaper provides an overview of security and good ML compliance practices and guidance on building GxP-regulated AI/ML systems using AWS services. We cover the points raised by the FDA discussion paper and Good Machine Learning Practices (GMLP) while also drawing from AWS resources: the whitepaper GxP Systems on AWS and the Machine Learning Lens from the AWS Well-Architected Framework. The whitepaper was developed based on our experience with and feedback from AWS pharmaceutical and medical device customers, as well as AWS partners, who are currently using AWS services to develop ML models.

Healthcare and life sciences (HCLS) customers are adopting AWS AI and ML services faster than ever before, but they also face the following regulatory challenges during implementation:

  • Building a secure infrastructure that complies with stringent regulatory processes for working on the public cloud and aligning to the FDA framework for AI and ML.
  • Supporting AI/ML-enabled solutions for GxP workloads covering the following:
    • Reproducibility
    • Traceability
    • Data integrity
  • Monitoring ML models with respect to various changes to parameters and data.
  • Handling model uncertainty and confidence calibration.

In our whitepaper, you learn about the following topics:

  • How AWS approaches ML in a regulated environment and provides guidance on Good Machine Learning Practices using AWS services.
  • Our organizational approach to security and compliance that supports GxP requirements as part of the shared responsibility model.
  • How to reproduce the workflow steps, track model and dataset lineage, and establish model governance and traceability.
  • How to monitor and maintain data integrity and quality checks to detect drifts in data and model quality.
  • Security and compliance best practices for managing AI/ML models on AWS.
  • Various AWS services for managing ML models in a regulated environment.

AWS is dedicated to helping you successfully use AWS services in regulated life science environments to accelerate your research, development, and delivery of the next generation of medical, health, and wellness solutions.

Contact us with questions about using AWS services for AI/ML in GxP systems. To learn more about compliance in the cloud, visit AWS Compliance. You can also check out the following resources:


About the Authors

Susant Mallick is an Industry specialist and digital evangelist in AWS’ Global Healthcare and Life-Sciences practice. He has over 20+ years of experience in the Life Science industry working with biopharmaceutical and medical device companies across North America, APAC and EMEA regions. He has built many Digital Health Platform and Patient Engagement solutions using Mobile App, AI/ML, IoT and other technologies for customers in various Therapeutic Areas. He holds a B.Tech degree in Electrical Engineering and MBA in Finance. His thought leadership and industry expertise earned many accolades in Pharma industry forums.

Sai Sharanya Nalla is a Sr. Data Scientist at AWS Professional Services. She works with customers to develop and implement AI/ ML and HPC solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, taking long walks, and engaging in outreach activities.

Read More

Introducing CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Automatic translation of speech from one language to speech in another language, called speech-to-speech translation (S2ST), is important for breaking down the communication barriers between people speaking different languages. Conventionally, automatic S2ST systems are built with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, so that the system overall is text-centric. Recently, work on S2ST that doesn’t rely on intermediate text representation is emerging, such as end-to-end direct S2ST (e.g., Translatotron) and cascade S2ST based on learned discrete representations of speech (e.g., Tjandra et al.). While early versions of such direct S2ST systems obtained lower translation quality compared to cascade S2ST models, they are gaining traction as they have the potential both to reduce translation latency and compounding errors, and to better preserve paralinguistic and non-linguistic information from the original speech, such as voice, emotion, tone, etc. However, such models usually have to be trained on datasets with paired S2ST data, but the public availability of such corpora is extremely limited.

To foster research on such a new generation of S2ST, we introduce a Common Voice-based Speech-to-Speech translation corpus, or CVSS, which includes sentence-level speech-to-speech translation pairs from 21 languages into English. Unlike existing public corpora, CVSS can be directly used for training such direct S2ST models without any extra processing. In “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation”, we describe the dataset design and development, and demonstrate the effectiveness of the corpus through training of baseline direct and cascade S2ST models and showing performance of a direct S2ST model that approaches that of a cascade S2ST model.

Building CVSS
CVSS is directly derived from the CoVoST 2 speech-to-text (ST) translation corpus, which is further derived from the Common Voice speech corpus. Common Voice is a massively multilingual transcribed speech corpus designed for ASR in which the speech is collected by contributors reading text content from Wikipedia and other text corpora. CoVoST 2 further provides professional text translation for the original transcript from 21 languages into English and from English into 15 languages. CVSS builds on these efforts by providing sentence-level parallel speech-to-speech translation pairs from 21 languages into English (shown in the table below).

To facilitate research with different focuses, two versions of translation speech in English are provided in CVSS, both are synthesized using state-of-the-art TTS systems, with each version providing unique value that doesn’t exist in other public S2ST corpora:

  • CVSS-C: All the translation speech is in a single canonical speaker’s voice. Despite being synthetic, the speech is highly natural, clean, and consistent in speaking style. These properties ease the modeling of the target speech and enable trained models to produce high quality translation speech suitable for general user-facing applications where speech quality is of higher importance than accurately reproducing the speakers’ voices.
  • CVSS-T: The translation speech captures the voice from the corresponding source speech. Each S2ST pair has a similar voice on the two sides, despite being in different languages. Because of this, the dataset is suitable for building models where accurate voice preservation is desired, such as for movie dubbing.

Together with the source speech, the two S2ST datasets contain 1,872 and 1,937 hours of speech, respectively.

Source
Language    
Code     Source
  speech (X)  
CVSS-C
  target speech (En)  
CVSS-T
  target speech (En)  
French fr 309.3 200.3 222.3
German de 226.5 137.0 151.2
Catalan ca 174.8 112.1 120.9
Spanish es 157.6 94.3 100.2
Italian it 73.9 46.5 49.2
Persian fa 58.8 29.9 34.5
Russian ru 38.7 26.9 27.4
Chinese zh 26.5 20.5 22.1
Portuguese     pt 20.0 10.4 11.8
Dutch nl 11.2 7.3 7.7
Estonian et 9.0 7.3 7.1
Mongolian mn 8.4 5.1 5.7
Turkish tr 7.9 5.4 5.7
Arabic ar 5.8 2.7 3.1
Latvian lv 4.9 2.6 3.1
Swedish sv 4.3 2.3 2.8
Welsh cy 3.6 1.9 2.0
Tamil ta 3.1 1.7 2.0
Indonesian id 3.0 1.6 1.7
Japanese ja 3.0 1.7 1.8
Slovenian sl 2.9 1.6 1.9
Total 1,153.2 719.1 784.2
Amount of source and target speech of each X-En pair in CVSS (hours).

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (on numbers, currencies, acronyms, etc., see data samples below, e.g., where “100%” is normalized as “one hundred percent” or “King George II” is normalized as “king george the second”), which can benefit both model training as well as standardizing the evaluation.

CVSS is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and it can be freely downloaded online.

Data Samples

Example 1:
Source audio (French)   
Source transcript (French)    Le genre musical de la chanson est entièrement le disco.
CVSS-C translation audio (English)   
CVSS-T translation audio (English)   
Translation text (English)    The musical genre of the song is 100% Disco.
Normalized translation text (English)        the musical genre of the song is one hundred percent disco
     
     
Example 2:
Source audio (Chinese)       
Source transcript (Chinese)        弗雷德里克王子,英国王室成员,为乔治二世之孙,乔治三世之幼弟。
CVSS-C translation audio (English)       
CVSS-T translation audio (English)       
Translation text (English)        Prince Frederick, member of British Royal Family, Grandson of King George II, brother of King George III.
Normalized translation text (English)        prince frederick member of british royal family grandson of king george the second brother of king george the third

Baseline Models
On each version of CVSS, we trained a baseline cascade S2ST model as well as two baseline direct S2ST models and compared their performance. These baselines can be used for comparison in future research.

Cascade S2ST: To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU on all 21 language pairs (detailed in the paper) when trained on the corpus without using extra data. This ST model is connected to the same TTS models used for constructing CVSS to compose very strong cascade S2ST baselines (ST → TTS).

Direct S2ST: We built two baseline direct S2ST models using Translatotron and Translatotron 2. When trained from scratch with CVSS, the translation quality from Translatotron 2 (8.7 BLEU) approaches that of the strong cascade S2ST baseline (10.6 BLEU). Moreover, when both use pre-training the gap decreases to only 0.7 BLEU on ASR transcribed translation. These results verify the effectiveness of using CVSS to train direct S2ST models.

Translation quality of baseline direct and cascade S2ST models built on CVSS-C, measured by BLEU on ASR transcription from speech translation. The pre-training was done on CoVoST 2 without other extra data sets.

Conclusion
We have released two versions of multilingual-to-English S2ST datasets, CVSS-C and CVSS-T, each with about 1.9K hours of sentence-level parallel S2ST pairs, covering 21 source languages. The translation speech in CVSS-C is in a single canonical speaker’s voice, while the same in CVSS-T is in voices transferred from the source speech. Each of these datasets provides unique value not existing in other public S2ST corpora.

We built baseline multilingual direct S2ST models and cascade S2ST models on both datasets, which can be used for comparison in future works. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous states of the art by +5.8 average BLEU when trained on the corpus without extra data. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.7 BLEU difference on ASR transcribed translation when utilized pre-training. We hope this work helps accelerate the research on direct S2ST.

Acknowledgments
We acknowledge the volunteer contributors and the organizers of the Common Voice and LibriVox projects for their contribution and collection of recordings, the creators of Common Voice, CoVoST, CoVoST 2, Librispeech and LibriTTS corpora for their previous work. The direct contributors to the CVSS corpus and the paper include Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen. We also thank Ankur Bapna, Yiling Huang, Jason Pelecanos, Colin Cherry, Alexis Conneau, Yonghui Wu, Hadar Shemtov and Françoise Beaufays for helpful discussions and support.

Read More