Quantization Aware Training with TensorFlow Model Optimization Toolkit - Performance with Accuracy

Quantization Aware Training with TensorFlow Model Optimization Toolkit – Performance with Accuracy

Posted by the TensorFlow Model Optimization team

We are excited to release the Quantization Aware Training (QAT) API as part of the TensorFlow Model Optimization Toolkit. QAT enables you to train and deploy models with the performance and size benefits of quantization, while retaining close to their original accuracy. This work is part of our roadmap to support the development of smaller and faster ML models. For more background, you can see previous posts on post-training quantization, float16 quantization and sparsity.

Quantization is lossy

Quantization is the process of transforming an ML model into an equivalent representation that uses parameters and computations at a lower precision. This improves the model’s execution performance and efficiency. For example, TensorFlow Lite 8-bit integer quantization results in models that are up to 4x smaller in size, 1.5x-4x faster in computations, and lower power consumption on CPUs. Additionally, it allows model execution on specialized neural accelerators, such as Edge TPU in Coral, which often has a restricted set of data types.

However, the process of going from higher to lower precision is lossy in nature. As seen in the image below, quantization squeezes a small range of floating-point values into a fixed number of information buckets.

Small range of float32 values mapped to int8 is a lossy conversion since int8 only has 255 information channels

This leads to information loss. The parameters (or weights) of a model can now only take a small set of values and the minute differences between them are lost. For example, all values in range [2.0, 2.3] may now be represented in one single bucket. This is similar to rounding errors when fractional values are represented as integers.

There are also other sources of loss. When these lossy numbers are used in several multiply-add computations, these losses accumulate. Further, int8 values, which accumulate into int32 integers, need to be rescaled back to int8 values for the next computation, thus introducing more computational error.

Quantization Aware Training

The core idea is that QAT simulates low-precision inference-time computation in the forward pass of the training process. This work is credited to the original innovations by Skirmantas Kligys in the Google Mobile Vision team. This introduces the quantization error as noise during the training and as part of the overall loss, which the optimization algorithm tries to minimize. Hence, the model learns parameters that are more robust to quantization.

If training is not an option, please check out post-training quantization, which works as part of TensorFlow Lite model conversion. QAT is also useful for researchers and hardware designers who may want to experiment with various quantization strategies (beyond what is supported by TensorFlow Lite) and / or simulate how quantization affects accuracy for different hardware backends.

QAT-trained models have comparable accuracy to floating-point

QAT accuracy numbers tableIn the table above, QAT accuracy numbers were trained with the default TensorFlow Lite configuration and contrasted with the floating-point baseline and post-training quantized models.

Emulating low-precision computation

The training graph itself operates in floating-point (e.g. float32), but it has to emulate low-precision computation, which is fixed-point (e.g. int8 in the case of TensorFlow Lite). To do so, we insert special operations into the graph (tensorflow::ops::FakeQuantWithMinMaxVars) that convert the floating-point tensors into low-precision values and then convert the low-precision values back into floating-point. This ensures that losses from quantization are introduced in the computation and that further computations emulate low-precision. In order to do so, we ensure that the losses from quantization are introduced in the tensor and, since each value in the floating-point tensor now maps 1:1 to a low-precision value, any further computation with similarly mapped tensors won’t introduce any further loss and mimics low-precision computations exactly.

Placing the quantization emulation operations

The quantization emulation operations need to be placed in the training graph such that they are consistent with the way that the quantized graph will be computed. This means that, for our API to be able to execute in TensorFlow Lite, we needed to follow the TensorFlow Lite quantization spec precisely.

The ‘wt quant’ and ‘act quant’ ops introduce losses in the forward pass of the model to simulate actual quantization loss during inference. Note how there is no Quant operation between Conv and ReLU6. This is because ReLUs get fused in TensorFlow Lite.

The API, built upon the Keras layers and model abstractions, hides the complexities mentioned above, so that you can quantize your entire model with a few lines of code.

Logging computation statistics

Aside from emulating the reduced precision computation, the API is also responsible for recording the necessary statistics to quantize the trained model. As an example, this allows you to take a model trained with the API and convert it to a quantized integer-only TensorFlow Lite model.

How to use the API with only few lines of code

The QAT API provides a simple and highly flexible way to quantize your TensorFlow Keras model. It makes it really easy to train with “quantization awareness” for an entire model or only parts of it, then export it for deployment withTensorFlow Lite.

Quantize the entire Keras model

import tensorflow_model_optimization as tfmot

model = tf.keras.Sequential([
...
])
# Quantize the entire model.
quantized_model = tfmot.quantization.keras.quantize_model(model)

# Continue with training as usual.
quantized_model.compile(...)
quantized_model.fit(...)

Quantize part(s) of a Keras model

import tensorflow_model_optimization as tfmot
quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer

model = tf.keras.Sequential([
...
# Only annotated layers will be quantized.
quantize_annotate_layer(Conv2D()),
quantize_annotate_layer(ReLU()),
Dense(),
...
])

# Quantize the model.
quantized_model = tfmot.quantization.keras.quantize_apply(model)

By default, our API is configured to work with the quantized execution support available in TensorFlow Lite. A detailed Colab with an end-to-end training example is located here.

The API is quite flexible and capable of handling far more complicated use cases. For example, it allows you to control quantization precisely within a layer, create custom quantization algorithms, and handle any custom layers that you may have written.

To learn more about how to use the API, please try this Colab. These sections of the Colab provide examples of how users can experiment with different quantization algorithms using the API. You can also check out this recent talk from the TensorFlow Developer Summit.

We are very excited to see how the QAT API further enables TensorFlow users to push the boundaries of efficient execution in their TensorFlow Lite-powered products as well as how it opens the door to researching new quantization algorithms and further developing new hardware platforms with different levels of precision.

If you want to learn more, check out this video from the TensorFlow DevSummit which introduces the Model Optimization Toolkit and explains QAT.

Acknowledgements

Thanks to Pulkit Bhuwalka, Alan Chiao, Suharsh Sivakumar, Raziel Alvarez, Feng Liu, Lawrence Chan, Skirmantas Kligys, Yunlu Li, Khanh LeViet, Billy Lambert, Mark Daoust, Tim Davis, Sarah Sirajuddin, and François CholletRead More

Learning about artificial intelligence: A hub of MIT resources for K-12 students

In light of the recent events surrounding Covid-19, learning for grades K-12 looks very different than it did a month ago. Parents and educators may be feeling overwhelmed about turning their homes into classrooms. 

With that in mind, a team led by Media Lab Associate Professor Cynthia Breazeal has launched aieducation.mit.edu to share a variety of online activities for K-12 students to learn about artificial intelligence, with a focus on how to design and use it responsibly. Learning resources provided on this website can help to address the needs of the millions of children, parents, and educators worldwide who are staying at home due to school closures caused by Covid-19, and are looking for free educational activities that support project-based STEM learning in an exciting and innovative area. 

The website is a collaboration between the Media Lab, MIT Stephen A. Schwarzman College of Computing, and MIT Open Learning, serving as a hub to highlight diverse work by faculty, staff, and students across the MIT community at the intersection of AI, learning, and education. 

“MIT is the birthplace of Constructionism under Seymour Papert. MIT has revolutionized how children learn computational thinking with hugely successful platforms such as Scratch and App Inventor. Now, we are bringing this rich tradition and deep expertise to how children learn about AI through project-based learning that dovetails technical concepts with ethical design and responsible use,” says Breazeal. 

The website will serve as a hub for MIT’s latest work in innovating learning and education in the era of AI. In addition to highlighting research, it also features up-to-date project-based activities, learning units, child-friendly software tools, digital interactives, and other supporting materials, highlighting a variety of MIT-developed educational research and collaborative outreach efforts across and beyond MIT. The site is intended for use by students, parents, teachers, and lifelong learners alike, with resources for children and adults at all learning levels, and with varying levels of comfort with technology, for a range of artificial intelligence topics. The team has also gathered a variety of external resources to explore, such as Teachable Machines by Google, a browser-based platform that lets users train classifiers for their own image-recognition algorithms in a user-friendly way.

In the spirit of “mens et manus” — the MIT motto, meaning “mind and hand” — the vision of technology for learning at MIT is about empowering and inspiring learners of all ages in the pursuit of creative endeavors. The activities highlighted on the new website are designed in the tradition of constructionism: learning through project-based experiences in which learners build and share their work. The approach is also inspired by the idea of computational action, where children can design AI-enabled technologies to help others in their community.

“MIT has been a world leader in AI since the 1960s,” says MIT professor of computer science and engineering Hal Abelson, who has long been involved in MIT’s AI research and educational technology. “MIT’s approach to making machines intelligent has always been strongly linked with our work in K-12 education. That work is aimed at empowering young people through computational ideas that help them understand the world and computational actions that empower them to improve life for themselves and their communities.”

Research in computer science education and AI education highlights the importance of having a mix of plugged and unplugged learning approaches. Unplugged activities include kinesthetic or discussion-based activities developed to introduce children to concepts in AI and its societal impact without using a computer. Unplugged approaches to learning AI are found to be especially helpful for young children. Moreover, these approaches can also be accessible to learning environments (classrooms and homes) that have limited access to technology. 

As computers continue to automate more and more routine tasks, inequity of education remains a key barrier to future opportunities, where success depends increasingly on intellect, creativity, social skills, and having specific skills and knowledge. This accelerating change raises the critical question of how to best prepare students, from children to lifelong learners, to be successful and to flourish in the era of AI.

It is important to help prepare a diverse and inclusive citizenry to be responsible designers and conscientious users of AI. In that spirit, the activities on aieducation.mit.edu range from hands-on programming to paper prototyping, to Socratic seminars, and even creative writing about speculative fiction. The learning units and project-based activities are designed to be accessible to a wide audience with different backgrounds and comfort levels with technology. A number of these activities leverage learning about AI as a way to connect to the arts, humanities, and social sciences, too, offering a holistic view of how AI intersects with different interests and endeavors. 

The rising ubiquity of AI affects us all, but today a disproportionately small slice of the population has the skills or power to decide how AI is designed or implemented; worrying consequences have been seen in algorithmic bias and perpetuation of unjust systems. Democratizing AI through education, starting in K-12, will help to make it more accessible and diverse at all levels, ultimately helping to create a more inclusive, fair, and equitable future.

Read More

Computational thinking class enables students to engage in Covid-19 response

When an introductory computational science class, which is open to the general public, was repurposed to study the Covid-19 pandemic this spring, the instructors saw student registration rise from 20 students to nearly 300.

Introduction to Computational Thinking (6.S083/18.S190), which applies data science, artificial intelligence, and mathematical models using the Julia programming language developed at MIT, was introduced in the fall as a pilot half-semester class. It was launched as part of the MIT Stephen A. Schwarzman College of Computing’s computational thinking program and spearheaded by Department of Mathematics Professor Alan Edelman and Visiting Professor David P. Sanders. They very quickly were able to fast-track the curriculum to focus on applications to Covid-19 responses; students were equally fast in jumping on board.

“Everyone at MIT wants to contribute,” says Edelman. “While we at the Julia Lab are doing research in building tools for scientists, Dave and I thought it would be valuable to teach the students about some of the fundamentals related to computation for drug development, disease models, and such.” 

The course is offered through MIT’s Department of Electronic Engineering and Computer Science and the Department of Mathematics. “This course opens a trove of opportunities to use computation to better understand and contain the Covid-19 pandemic,” says MIT Computer Science and Artificial Intelligence Laboratory Director Daniela Rus.

The fall version of the class had a maximum enrollment of 20 students, but the spring class has ballooned to nearly 300 students in one weekend, almost all from MIT. “We’ve had a tremendous response,” Edelman says. “This definitely stressed the MIT sign-up systems in ways that I could not have imagined.”

Sophomore Shinjini Ghosh, majoring in computer science and linguistics, says she was initially drawn to the class to learn Julia, “but also to develop the skills to do further computational modeling and conduct research on the spread and possible control of Covid-19.”

“There’s been a lot of misinformation about the epidemiology and statistical modeling of the coronavirus,” adds sophomore Raj Movva, a computer science and biology major. “I think this class will help clarify some details, and give us a taste of how one might actually make predictions about the course of a pandemic.” 

Edelman says that he has always dreamed of an interdisciplinary modern class that would combine the machine learning and AI of a “data-driven” world, the modern software and systems possibilities that Julia allows, and the physical models, differential equations, and  scientific machine learning of the “physical world.” 

He calls this class “a natural outgrowth of Julia Lab’s research, and that of the general cooperative open-source Julia community.” For years, this online community collaborates to create tools to speed up the drug approval process, aid in scientific machine learning and differential equations, and predict infectious disease transmission. “The lectures are open to the world, following the great MIT tradition of MIT open courses,” says Edelman.

So when MIT turned to virtual learning to de-densify campus, the transition to an online, remotely taught version of the class was not too difficult for Edelman and Sanders.

“Even though we have run open remote learning courses before, it’s never the same as being able to see the live audience in front of you,” says Edelman. “However, MIT students ask such great questions in the Zoom chat, so that it remains as intellectually invigorating as ever.”

Sanders, a Marcos Moshinsky research fellow currently on leave as a professor at the National University of Mexico, is working on techniques for accelerating global optimization. Involved with the Julia Lab since 2014, Sanders has worked with Edelman on various teaching, research, and outreach projects related to Julia, and his YouTube tutorials have reached over 100,000 views. “His videos have often been referred to as the best way to learn the Julia language,” says Edelman.

Edelman will also be enlisting some help from Philip, his family’s Corgi who until recently had been a frequent wanderer of MIT’s halls and classrooms. “Philip is a well-known Julia expert whose image has been classified many times by Julia’s AI Systems,” says Edelman. “Students are always happy when Philip participates in the online classes.”

Read More

Researching from home: Science stays social, even at a distance

With all but a skeleton crew staying home from each lab to minimize the spread of Covid-19, scores of Picower Institute researchers are immersing themselves in the considerable amount of scientific work that can done away from the bench. With piles of data to analyze; plenty of manuscripts to write; new skills to acquire; and fresh ideas to conceive, share, and refine for the future, neuroscientists have full plates, even when they are away from their, well, plates. They are proving that science can remain social, even if socially distant.

Ever since the mandatory ramp down of on-campus research took hold March 20, for example, teams of researchers in the lab of Troy Littleton, the Menicon Professor of Neuroscience, have sharpened their focus on two data-analysis projects that are every bit as essential to their science as acquiring the data in the lab in the first place. Research scientist Yulia Akbergenova and graduate student Karen Cunningham, for example, are poring over a huge amount of imaging data showing how the strength of connections between neurons, or synapses, mature and how that depends on the molecular components at the site. Another team, comprised of Picower postdoc Suresh Jetti and graduate students Andres Crane and Nicole Aponte-Santiago, is analyzing another large dataset, this time of gene transcription, to learn what distinguishes two subclasses of motor neurons that form synapses of characteristically different strength.

Work is similarly continuing among researchers in the lab of Elly Nedivi, the William R. (1964) and Linda R. Young Professor of Neuroscience. Since heading home, Senior Research Support Associate Kendyll Burnell has been looking at microscope images tracking how inhibitory interneurons innervate the visual cortex of mice throughout their development. By studying the maturation of inhibition, the lab hopes to improve understanding of the role of inhibitory circuitry in the experience-dependent changes, or plasticity, and development of the visual cortex, she says. As she’s worked, her poodle Soma (named for the central body structure of a neuron) has been by her side.

Despite extra time with comforts of home, though, it’s clear that nobody wanted this current mode of socially distant science. For every lab, it’s tremendously disruptive and costly. But labs are finding many ways to make progress nonetheless.

“Although we are certainly hurting because our lab work is at a standstill, the Miller lab is fortunate to have a large library of multiple-electrode neurophysiological data,” says Picower Professor Earl Miller. “The datasets are very rich. As our hypotheses and analytical tools develop, we can keep going back to old data to ask new questions. We are taking advantage of the wet lab downtime to analyze data and write papers. We have three under review and are writing at least three more right now.”

Miller is inviting new collaborations regardless of the physical impediment of social distancing. A recent lab meeting held via the videoconferencing app Zoom included MIT Department of Brain and Cognitive Sciences Associate Professor Ila Fiete and her graduate student, Mikail Khona. The Miller lab has begun studying how neural rhythms move around the cortex and what that means for brain function. Khona presented models of how timing relationships affect those waves. While this kind of an interaction between labs of the Picower Institute and the McGovern Institute for Brain Research would normally have taken place in person in MIT’s Building 46, neither lab let the pandemic get in the way.

Similarly, the lab of Li-Huei Tsai, Picower Professor and director of the Picower Institute, has teamed up with that of Manolis Kellis, professor in the MIT Computer Science and Artificial Intelligence Laboratory. They’re forming several small squads of experimenters and computational experts to launch analyses of gene expression and other data to illuminate the fate of individual cell types like interneurons or microglia in the context of the Alzheimer’s disease-afflicted brain. Other teams are focusing on analyses of questions such as how pathology varies in brain samples carrying different degrees of genetic risk factors. These analyses will prove useful for stages all along the scientific process, Tsai says, from forming new hypotheses to wrapping up papers that are well underway.

Remote collaboration and communication are proving crucial to researchers in other ways, too, proving that online interactions, though distant, can be quite personally fulfilling.

Nicholas DiNapoli, a research engineer in the lab of Associate Professor Kwanghun Chung, is making the best of time away from the bench by learning about the lab’s computational pipeline for processing the enormous amounts of imaging data it generates. He’s also taking advantage of a new program within the lab in which Senior Computer Scientist Lee Kamentsky is teaching Python computer programming principles to anyone in the lab who wants to learn. The training occurs via Zoom two days a week.

As part of a crowded calendar of Zoom meetings, or “Zeetings” as the lab has begun to call them, Newton Professor Mriganka Sur says he makes sure to have one-to-one meetings with everyone in the lab. The team also has organized into small subgroups around different themes of the lab’s research.

But also, the lab has continued to maintain its cohesion by banding together informally creating novel work and social experiences.

Graduate student Ning Leow, for example, used Zoom to create a co-working session in which participants kept a video connection open for hours at a time, just to be in each other’s virtual presence while they worked. Among a group of Sur lab friends, she read a paper related to her thesis and did a substantial amount of data analysis. She also advised a colleague on an analysis technique via the connection.

“I’ve got to say that it worked out really well for me personally because I managed to get whatever I wanted to complete on my list done,” she says, “and there was also a sense of healthy accountability along with the sense of community.”

Whether in person or via an officially imposed distance, science is social. In that spirit, graduate student K. Guadalupe “Lupe” Cruz organized a collaborative art event via Zoom for female scientists in brain and cognitive sciences at MIT. She took a photo of Rosalind Franklin, the scientist whose work was essential for resolving the structure of DNA, and divided it into nine squares to distribute to the event attendees. Without knowing the full picture, everyone drew just their section, talking all the while about how the strange circumstances of Covid-19 have changed their lives. At the end, they stitched their squares together to reconstruct the image.

Examples abound of how Picower scientists, though mostly separate and apart, are still coming together to advance their research and to maintain the fabric of their shared experiences.

Read More

Upcoming changes to TensorFlow.js

Upcoming changes to TensorFlow.js

Posted by Yannick Assogba, Software Engineer, Google Research

As TensorFlow.js is used more and more in production environments, our team recognizes the need for the community to be able to produce small, production optimized bundles for browsers that use TensorFlow.js. We have been laying out the groundwork for this and want to share our upcoming plans with you.

TensorFlow.js updatesOne primary goal we have for upcoming releases of TensorFlow.js is to make it more modular and more tree shakeable, while preserving ease of use for beginners. To that end, we are planning two major version releases to move us in that direction. We are releasing this work over two major versions in order to maintain semver as we make breaking changes.

TensorFlow.js 2.0

In TensorFlow.js 2.x, the only breaking change will be moving the CPU and WebGL backends from tfjs-core into their own NPM packages (tfjs-backend-cpu and tfjs-backend-webgl respectively). While today these are included by default, we want to make tfjs-core as modular and lean as possible.

What does this mean for me as a user?

If you are using the union package (i.e. @tensorflow/tfjs), you should see no impact to your code. If you are using @tensorflow/tfjs-core directly, you will need to import a package for each backend you want to use.

What benefit do I get?

If you are using @tensorflow/tfjs-core directly, you will now have the option of omitting any backend you do not want to use in your application. For example, if you only want the WebGL backend, you will be able to get modest savings by dropping the CPU backend. You will also be able to lazily load the CPU backend as a fallback if your build tooling/app supports that.

TensorFlow.js 3.0

In this release, we will have fully modularized all our ops and kernels (backend specific implementations of the math behind an operation). This will allow tree shakers in bundlers like WebPack, Rollup, and Closure Compiler to do better dead-code elimination and produce smaller bundles.

We will move to a dynamic gradient and kernel registration scheme as well as provide tooling to aid in creating custom bundles that only contain kernels for a given model or TensorFlow.js program.

We will also start shipping ES2017 bundles by default. Users who need to deploy to browsers that only support earlier versions can transpile down to their desired target.

What does this mean for me as a user?

If you are using the union package (i.e. @tensorflow/tfjs), we anticipate the changes will be minimal. In order to support ease of use in getting started with tfjs, we want the default use of the union package to remain close to what it is today.

For users who want smaller production oriented bundles, you will need to change your code to take advantage of ES2015 modules to import only the ops (and other functionality) you would like to end up in your bundle.

In addition, we will provide command-line tooling to enable builds that only load and register the kernels used by the models/programs you are deploying.

What benefit do I get?

Production oriented users will be able to opt into writing code that results in smaller more optimized builds. Other users will still be able to use the union package pretty much as is, but will not get the advantage of the smallest builds possible.

Dynamic gradient and kernel registration will make it easier to implement custom kernels and gradients for researchers and other advanced users.

FAQ

When will this be ready?

We plan to release TensorFlow.js 2.0 this month. We do not have a release date for Tensorflow 3.0 yet because of the magnitude of the change. Since we need to touch almost every file in tfjs-core, we are also taking the opportunity to clean up technical debt where we can.

Should I upgrade to TensorFlow.js 2.x or just wait for 3.x?

We recommend that you upgrade to TensorFlow 2.x if you are actively developing a TensorFlow.js project. It should be a relatively painless upgrade, and any future bug fixes will be on this release train. We do not yet have a release date for TensorFlow.js 3.x.

How do I migrate my app to 2.x or 3.x? Will there be a tutorial to follow?

As we release these versions, we will publish full release notes with instructions on how to upgrade. Separately, with the launch of 3.x, we will publish a guide on making production builds.

How much will I have to change my code to get smaller builds?

We’ll have more details as we get closer to the release of 3.x, but at a high level, we want to take advantage of the ES2015 module system to let you control what code gets into your bundle.

In general, you will need to do things like import {max, div, mul, depthToSpace} from @tensorflow/tjfs (rather than import * as tf from @tensorflow/tfjs) in order for our tooling to determine which kernels to register from the backends you have selected for deployment. We are even working on making the chaining API on the Tensor class opt-in when targeting production builds.

Will this make TensorFlow.js harder to use?

We do not want to make the barrier to entry higher for using TensorFlow.js so we are designing this in a way that only production oriented users need to do extra work to get optimized builds. For end-users developing applications using the union package (@tensorflow/tfjs) from either a hosted script or from NPM in concert with our collection of pre-trained models we expect there will be no changes as a result of these updatesRead More

Making robots better co-workers

Nima Keivan wants to help Amazon’s robots and fulfillment center workers to collaborate more effectively, so that robots can perform the more mundane tasks, while the humans can focus on higher-value jobs.Read More

Towards understanding glasses with graph neural networks

Under a microscope, a pane of window glass doesnt look like a collection of orderly molecules, as a crystal would, but rather a jumble with no discernable structure. Glass is made by starting with a glowing mixture of high-temperature melted sand and minerals. Once cooled, its viscosity (a measure of the friction in the fluid) increases a trillion-fold, and it becomes a solid, resisting tension from stretching or pulling. Yet the molecules in the glass remain in a seemingly disordered state, much like the original molten liquid almost as though the disordered liquid state had been flash-frozen in place. The glass transition, then, first appears to be a dramatic arrest in the movement of the glass molecules. Whether this process corresponds to a structural phase transition (as in water freezing, or the superconducting transition) is a major open question in the field. Understanding the nature of the dynamics of glass is fundamental to understanding how the atomic-scale properties define the visible features of many solid materials.Read More

Accelerating data-driven discoveries

As technologies like single-cell genomic sequencing, enhanced biomedical imaging, and medical “internet of things” devices proliferate, key discoveries about human health are increasingly found within vast troves of complex life science and health data.

But drawing meaningful conclusions from that data is a difficult problem that can involve piecing together different data types and manipulating huge data sets in response to varying scientific inquiries. The problem is as much about computer science as it is about other areas of science. That’s where Paradigm4 comes in.

The company, founded by Marilyn Matz SM ’80 and Turing Award winner and MIT Professor Michael Stonebraker, helps pharmaceutical companies, research institutes, and biotech companies turn data into insights.

It accomplishes this with a computational database management system that’s built from the ground up to host the diverse, multifaceted data at the frontiers of life science research. That includes data from sources like national biobanks, clinical trials, the medical internet of things, human cell atlases, medical images, environmental factors, and multi-omics, a field that includes the study of genomes, microbiomes, metabolomes, and more.

On top of the system’s unique architecture, the company has also built data preparation, metadata management, and analytics tools to help users find the important patterns and correlations lurking within all those numbers.

In many instances, customers are exploring data sets the founders say are too large and complex to be represented effectively by traditional database management systems.

“We’re keen to enable scientists and data scientists to do things they couldn’t do before by making it easier for them to deal with large-scale computation and machine-learning on diverse data,” Matz says. “We’re helping scientists and bioinformaticists with collaborative, reproducible research to ask and answer hard questions faster.”

A new paradigm

Stonebraker has been a pioneer in the field of database management systems for decades. He has started nine companies, and his innovations have set standards for the way modern systems allow people to organize and access large data sets.

Much of Stonebraker’s career has focused on relational databases, which organize data into columns and rows. But in the mid 2000s, Stonebraker realized that a lot of data being generated would be better stored not in rows or columns but in multidimensional arrays.

For example, satellites break the Earth’s surface into large squares, and GPS systems track a person’s movement through those squares over time. That operation involves vertical, horizontal, and time measurements that aren’t easily grouped or otherwise manipulated for analysis in relational database systems.

Stonebraker recalls his scientific colleagues complaining that available database management systems were too slow to work with complex scientific datasets in fields like genomics, where researchers study the relationships between population-scale multi-omics data, phenotypic data, and medical records.

“[Relational database systems] scan either horizontally or vertically, but not both,” Stonebraker explains. “So you need a system that does both, and that requires a storage manager down at the bottom of the system which is capable of moving both horizontally and vertically through a very big array. That’s what Paradigm4 does.”

In 2008, Stonebraker began developing a database management system at MIT that stored data in multidimensional arrays. He confirmed the approach offered major efficiency advantages, allowing analytical tools based on linear algebra, including many forms of machine learning and statistical data processing, to be applied to huge datasets in new ways.

Stonebraker decided to spin the project into a company in 2010, when he partnered with Matz, a successful entrepreneur who co-founded Cognex Corporation, a large industrial machine-vision company that went public in 1989. The founders and their team, including Alex Poliakov BS ’07, went to work building out key features of the system, including its distributed architecture that allows the system to run on low-cost servers, and its ability to automatically clean and organize data in useful ways for users.

The founders describe their database management system as a computational engine for scientific data, and they’ve named it SciDB. On top of SciDB, they developed an analytics platform, called the REVEAL discovery engine, based on users’ daily research activities and aspirations.

“If you’re a scientist or data scientist, Paradigm’s REVEAL and SciDB products take care of all the data wrangling and computational ‘plumbing and wiring,’ so you don’t have to worry about accessing data, moving data, or setting up parallel distributed computing,” Matz says. “Your data is science-ready. Just ask your scientific question and the platform orchestrates all of the data management and computation for you.”

SciDB is designed to be used by both scientists and developers, so users can interact with the system through graphical user interfaces or by leveraging statistical and programming languages like R and Python.

“It’s been very important to sell solutions, not building blocks,” Matz says. “A big part of our success in the life sciences with top pharmas and biotechs and research institutes is bringing them our REVEAL suite of application-specific solutions to problems. We’re not handing them an analytical platform that’s a set of Spark LEGO blocks; we’re giving them solutions that handle the data they deal with daily, and solutions that use their vocabulary and answer the questions they want to work on.”

Accelerating discovery

Today Paradigm4’s customers include some of the biggest pharmaceutical and biotech companies in the world as well as research labs at the National Institutes of Health, Stanford University, and elsewhere.

Customers can integrate genomic sequencing data, biometric measurements, data on environmental factors, and more into their inquiries to enable new discoveries across a range of life science fields.

Matz says SciDB did 1 billion linear regressions in less than an hour in a recent benchmark, and that it can scale well beyond that, which could speed up discoveries and lower costs for researchers who have traditionally had to extract their data from files and then rely on less efficient cloud-computing-based methods to apply algorithms at scale.

“If researchers can run complex analytics in minutes and that used to take days, that dramatically changes the number of hard questions you can ask and answer,” Matz says. “That is a force-multiplier that will transform research daily.”

Beyond life sciences, Paradigm4’s system holds promise for any industry dealing with multifaceted data, including earth sciences, where Matz says a NASA climatologist is already using the system, and industrial IoT, where data scientists consider large amounts of diverse data to understand complex manufacturing systems. Matz says the company will focus more on those industries next year.

In the life sciences, however, the founders believe they already have a revolutionary product that’s enabling a new world of discoveries. Down the line, they see SciDB and REVEAL contributing to national and worldwide health research that will allow doctors to provide the most informed, personalized care imaginable.

“The query that every doctor wants to run is, when you come into his or her office and display a set of symptoms, the doctor asks, ‘Who in this national database has genetics that look like mine, symptoms that look like mine, lifestyle exposures that look like mine? And what was their diagnosis? What was their treatment? And what was their morbidity?” Stonebraker explains. “This is cross correlating you with everybody else to do very personalized medicine, and I think this is within our grasp.”

Read More

Robots Learning to Move like Animals

Robots Learning to Move like Animals



Quadruped robot learning locomotion skills by imitating a dog.

Whether it’s a dog chasing after a ball, or a monkey swinging through the
trees, animals can effortlessly perform an incredibly rich repertoire of agile
locomotion skills. But designing controllers that enable legged robots to
replicate these agile behaviors can be a very challenging task. The superior
agility seen in animals, as compared to robots, might lead one to wonder: can
we create more agile robotic controllers with less effort by directly imitating
animals?

In this work, we present a framework for learning robotic locomotion skills by
imitating animals. Given a reference motion clip recorded from an animal (e.g.
a dog), our framework uses reinforcement learning to train a control policy
that enables a robot to imitate the motion in the real world. Then, by simply
providing the system with different reference motions, we are able to train a
quadruped robot to perform a diverse set of agile behaviors, ranging from fast
walking gaits to dynamic hops and turns. The policies are trained primarily in
simulation, and then transferred to the real world using a latent space
adaptation technique, which is able to efficiently adapt a policy using only a
few minutes of data from the real robot.

Q&A: Markus Buehler on setting coronavirus and AI-inspired proteins to music

The proteins that make up all living things are alive with music. Just ask Markus Buehler: The musician and MIT professor develops artificial intelligence models to design new proteins, sometimes by translating them into sound. His goal is to create new biological materials for sustainable, non-toxic applications. In a project with the MIT-IBM Watson AI Lab, Buehler is searching for a protein to extend the shelf-life of perishable food. In a new study in Extreme Mechanics Letters, he and his colleagues offer a promising candidate: a silk protein made by honeybees for use in hive building. 

In another recent study, in APL Bioengineering, he went a step further and used AI discover an entirely new protein. As both studies went to print, the Covid-19 outbreak was surging in the United States, and Buehler turned his attention to the spike protein of SARS-CoV-2, the appendage that makes the novel coronavirus so contagious. He and his colleagues are trying to unpack its vibrational properties through molecular-based sound spectra, which could hold one key to stopping the virus. Buehler recently sat down to discuss the art and science of his work.

Q: Your work focuses on the alpha helix proteins found in skin and hair. Why makes this protein so intriguing? 

A: Proteins are the bricks and mortar that make up our cells, organs, and body. Alpha helix proteins are especially important. Their spring-like structure gives them elasticity and resilience, which is why skin, hair, feathers, hooves, and even cell membranes are so durable. But they’re not just tough mechanically, they have built-in antimicrobial properties. With IBM, we’re trying to harness this biochemical trait to create a protein coating that can slow the spoilage of quick-to-rot foods like strawberries.

Q: How did you enlist AI to produce this silk protein?

A: We trained a deep learning model on the Protein Data Bank, which contains the amino acid sequences and three-dimensional shapes of about 120,000 proteins. We then fed the model a snippet of an amino acid chain for honeybee silk and asked it to predict the protein’s shape, atom-by-atom. We validated our work by synthesizing the protein for the first time in a lab — a first step toward developing a thin antimicrobial, structurally-durable coating that can be applied to food. My colleague, Benedetto Marelli, specializes in this part of the process. We also used the platform to predict the structure of proteins that don’t yet exist in nature. That’s how we designed our entirely new protein in the APL Bioengineering study. 

Q: How does your model improve on other protein prediction methods? 

A: We use end-to-end prediction. The model builds the protein’s structure directly from its sequence, translating amino acid patterns into three-dimensional geometries. It’s like translating a set of IKEA instructions into a built bookshelf, minus the frustration. Through this approach, the model effectively learns how to build a protein from the protein itself, via the language of its amino acids. Remarkably, our method can accurately predict protein structure without a template. It outperforms other folding methods and is significantly faster than physics-based modeling. Because the Protein Data Bank is limited to proteins found in nature, we needed a way to visualize new structures to make new proteins from scratch.

Q: How could the model be used to design an actual protein?

A: We can build atom-by-atom models for sequences found in nature that haven’t yet been studied, as we did in the APL Bioengineering study using a different method. We can visualize the protein’s structure and use other computational methods to assess its function by analyzing its stablity and the other proteins it binds to in cells. Our model could be used in drug design or to interfere with protein-mediated biochemical pathways in infectious disease.

Q: What’s the benefit of translating proteins into sound?

A: Our brains are great at processing sound! In one sweep, our ears pick up all of its hierarchical features: pitch, timbre, volume, melody, rhythm, and chords. We would need a high-powered microscope to see the equivalent detail in an image, and we could never see it all at once. Sound is such an elegant way to access the information stored in a protein. 

Typically, sound is made from vibrating a material, like a guitar string, and music is made by arranging sounds in hierarchical patterns. With AI we can combine these concepts, and use molecular vibrations and neural networks to construct new musical forms. We’ve been working on methods to turn protein structures into audible representations, and translate these representations into new materials. 

Q: What can the sonification of SARS-CoV-2’s “spike” protein tell us?

A: Its protein spike contains three protein chains folded into an intriguing pattern. These structures are too small for the eye to see, but they can be heard. We represented the physical protein structure, with its entangled chains, as interwoven melodies that form a multi-layered composition. The spike protein’s amino acid sequence, its secondary structure patterns, and its intricate three-dimensional folds are all featured. The resulting piece is a form of counterpoint music, in which notes are played against notes. Like a symphony, the musical patterns reflect the protein’s intersecting geometry realized by materializing its DNA code.

Q: What did you learn?

A: The virus has an uncanny ability to deceive and exploit the host for its own multiplication. Its genome hijacks the host cell’s protein manufacturing machinery, and forces it to replicate the viral genome and produce viral proteins to make new viruses. As you listen, you may be surprised by the pleasant, even relaxing, tone of the music. But it tricks our ear in the same way the virus tricks our cells. It’s an invader disguised as a friendly visitor. Through music, we can see the SARS-CoV-2 spike from a new angle, and appreciate the urgent need to learn the language of proteins.  

Q: Can any of this address Covid-19, and the virus that causes it?

A: In the longer term, yes. Translating proteins into sound gives scientists another tool to understand and design proteins. Even a small mutation can limit or enhance the pathogenic power of SARS-CoV-2. Through sonification, we can also compare the biochemical processes of its spike protein with previous coronaviruses, like SARS or MERS. 

In the music we created, we analyzed the vibrational structure of the spike protein that infects the host. Understanding these vibrational patterns is critical for drug design and much more. Vibrations may change as temperatures warm, for example, and they may also tell us why the SARS-CoV-2 spike gravitates toward human cells more than other viruses. We’re exploring these questions in current, ongoing research with my graduate students. 

We might also use a compositional approach to design drugs to attack the virus. We could search for a new protein that matches the melody and rhythm of an antibody capable of binding to the spike protein, interfering with its ability to infect.

Q: How can music aid protein design?

A: You can think of music as an algorithmic reflection of structure. Bach’s Goldberg Variations, for example, are a brilliant realization of counterpoint, a principle we’ve also found in proteins. We can now hear this concept as nature composed it, and compare it to ideas in our imagination, or use AI to speak the language of protein design and let it imagine new structures. We believe that the analysis of sound and music can help us understand the material world better. Artistic expression is, after all, just a model of the world within us and around us.  

Co-authors of the study in Extreme Mechanics Letters are: Zhao Qin, Hui Sun, Eugene Lim and Benedetto Marelli at MIT; and Lingfei Wu, Siyu Huo, Tengfei Ma and Pin-Yu Chen at IBM Research. Co-author of the study in APL Bioengineering is Chi-Hua Yu. Buehler’s sonification work is supported by MIT’s Center for Art, Science and Technology (CAST) and the Mellon Foundation. 

Read More