MIT – Vedere AI

In machine learning, synthetic data can offer real performance improvements

November 3, 2022

by Adam Zewe | MIT News Office MIT

Teaching a machine to recognize human actions has many potential applications, such as automatically detecting workers who fall at a construction site or enabling a smart home robot to interpret a user’s gestures.

To do this, researchers train machine-learning models using vast datasets of video clips that show humans performing actions. However, not only is it expensive and laborious to gather and label millions or billions of videos, but the clips often contain sensitive information, like people’s faces or license plate numbers. Using these videos might also violate copyright or data protection laws. And this assumes the video data are publicly available in the first place — many datasets are owned by companies and aren’t free to use.

So, researchers are turning to synthetic datasets. These are made by a computer that uses 3D models of scenes, objects, and humans to quickly produce many varying clips of specific actions — without the potential copyright issues or ethical concerns that come with real data.

But are synthetic data as “good” as real data? How well does a model trained with these data perform when it’s asked to classify real human actions? A team of researchers at MIT, the MIT-IBM Watson AI Lab, and Boston University sought to answer this question. They built a synthetic dataset of 150,000 video clips that captured a wide range of human actions, which they used to train machine-learning models. Then they showed these models six datasets of real-world videos to see how well they could learn to recognize actions in those clips.

The researchers found that the synthetically trained models performed even better than models trained on real data for videos that have fewer background objects.

This work could help researchers use synthetic datasets in such a way that models achieve higher accuracy on real-world tasks. It could also help scientists identify which machine-learning applications could be best-suited for training with synthetic data, in an effort to mitigate some of the ethical, privacy, and copyright concerns of using real datasets.

“The ultimate goal of our research is to replace real data pretraining with synthetic data pretraining. There is a cost in creating an action in synthetic data, but once that is done, then you can generate an unlimited number of images or videos by changing the pose, the lighting, etc. That is the beauty of synthetic data,” says Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab, and co-author of a paper detailing this research.

The paper is authored by lead author Yo-whan “John” Kim ’22; Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and seven others. The research will be presented at the Conference on Neural Information Processing Systems.

Building a synthetic dataset

The researchers began by compiling a new dataset using three publicly available datasets of synthetic video clips that captured human actions. Their dataset, called Synthetic Action Pre-training and Transfer (SynAPT), contained 150 action categories, with 1,000 video clips per category.

They selected as many action categories as possible, such as people waving or falling on the floor, depending on the availability of clips that contained clean video data.

Once the dataset was prepared, they used it to pretrain three machine-learning models to recognize the actions. Pretraining involves training a model for one task to give it a head-start for learning other tasks. Inspired by the way people learn — we reuse old knowledge when we learn something new — the pretrained model can use the parameters it has already learned to help it learn a new task with a new dataset faster and more effectively.

They tested the pretrained models using six datasets of real video clips, each capturing classes of actions that were different from those in the training data.

The researchers were surprised to see that all three synthetic models outperformed models trained with real video clips on four of the six datasets. Their accuracy was highest for datasets that contained video clips with “low scene-object bias.”

Low scene-object bias means that the model cannot recognize the action by looking at the background or other objects in the scene — it must focus on the action itself. For example, if the model is tasked with classifying diving poses in video clips of people diving into a swimming pool, it cannot identify a pose by looking at the water or the tiles on the wall. It must focus on the person’s motion and position to classify the action.

“In videos with low scene-object bias, the temporal dynamics of the actions is more important than the appearance of the objects or the background, and that seems to be well-captured with synthetic data,” Feris says.

“High scene-object bias can actually act as an obstacle. The model might misclassify an action by looking at an object, not the action itself. It can confuse the model,” Kim explains.

Boosting performance

Building off these results, the researchers want to include more action classes and additional synthetic video platforms in future work, eventually creating a catalog of models that have been pretrained using synthetic data, says co-author Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab.

“We want to build models which have very similar performance or even better performance than the existing models in the literature, but without being bound by any of those biases or security concerns,” he adds.

They also want to combine their work with research that seeks to generate more accurate and realistic synthetic videos, which could boost the performance of the models, says SouYoung Jin, a co-author and CSAIL postdoc. She is also interested in exploring how models might learn differently when they are trained with synthetic data.

“We use synthetic datasets to prevent privacy issues or contextual or social bias, but what does the model actually learn? Does it learn something that is unbiased?” she says.

Now that they have demonstrated this use potential for synthetic videos, they hope other researchers will build upon their work.

“Despite there being a lower cost to obtaining well-annotated synthetic data, currently we do not have a dataset with the scale to rival the biggest annotated datasets with real videos. By discussing the different costs and concerns with real videos, and showing the efficacy of synthetic data, we hope to motivate efforts in this direction,” adds co-author Samarth Mishra, a graduate student at Boston University (BU).

Additional co-authors include Hilde Kuehne, professor of computer science at Goethe University in Germany and an affiliated professor at the MIT-IBM Watson AI Lab; Leonid Karlinsky, research staff member at the MIT-IBM Watson AI Lab; Venkatesh Saligrama, professor in the Department of Electrical and Computer Engineering at BU; and Kate Saenko, associate professor in the Department of Computer Science at BU and a consulting professor at the MIT-IBM Watson AI Lab.

This research was supported by the Defense Advanced Research Projects Agency LwLL, as well as the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside.

Study urges caution when comparing neural networks to the brain

November 2, 2022

by Anne Trafton | MIT News Office MIT

Neural networks, a type of computing system loosely modeled on the organization of the human brain, form the basis of many artificial intelligence systems for applications such speech recognition, computer vision, and medical image analysis.

In the field of neuroscience, researchers often use neural networks to try to model the same kind of tasks that the brain performs, in hopes that the models could suggest new hypotheses regarding how the brain itself performs those tasks. However, a group of researchers at MIT is urging that more caution should be taken when interpreting these models.

In an analysis of more than 11,000 neural networks that were trained to simulate the function of grid cells — key components of the brain’s navigation system — the researchers found that neural networks only produced grid-cell-like activity when they were given very specific constraints that are not found in biological systems.

“What this suggests is that in order to obtain a result with grid cells, the researchers training the models needed to bake in those results with specific, biologically implausible implementation choices,” says Rylan Schaeffer, a former senior research associate at MIT.

Without those constraints, the MIT team found that very few neural networks generated grid-cell-like activity, suggesting that these models do not necessarily generate useful predictions of how the brain works.

Schaeffer, who is now a graduate student in computer science at Stanford University, is the lead author of the new study, which will be presented at the 2022 Conference on Neural Information Processing Systems this month. Ila Fiete, a professor of brain and cognitive sciences and a member of MIT’s McGovern Institute for Brain Research, is the senior author of the paper. Mikail Khona, an MIT graduate student in physics, is also an author.

Modeling grid cells

Neural networks, which researchers have been using for decades to perform a variety of computational tasks, consist of thousands or millions of processing units connected to each other. Each node has connections of varying strengths to other nodes in the network. As the network analyzes huge amounts of data, the strengths of those connections change as the network learns to perform the desired task.

In this study, the researchers focused on neural networks that have been developed to mimic the function of the brain’s grid cells, which are found in the entorhinal cortex of the mammalian brain. Together with place cells, found in the hippocampus, grid cells form a brain circuit that helps animals know where they are and how to navigate to a different location.

Place cells have been shown to fire whenever an animal is in a specific location, and each place cell may respond to more than one location. Grid cells, on the other hand, work very differently. As an animal moves through a space such as a room, grid cells fire only when the animal is at one of the vertices of a triangular lattice. Different groups of grid cells create lattices of slightly different dimensions, which overlap each other. This allows grid cells to encode a large number of unique positions using a relatively small number of cells.

This type of location encoding also makes it possible to predict an animal’s next location based on a given starting point and a velocity. In several recent studies, researchers have trained neural networks to perform this same task, which is known as path integration.

To train neural networks to perform this task, researchers feed into it a starting point and a velocity that varies over time. The model essentially mimics the activity of an animal roaming through a space, and calculates updated positions as it moves. As the model performs the task, the activity patterns of different units within the network can be measured. Each unit’s activity can be represented as a firing pattern, similar to the firing patterns of neurons in the brain.

In several previous studies, researchers have reported that their models produced units with activity patterns that closely mimic the firing patterns of grid cells. These studies concluded that grid-cell-like representations would naturally emerge in any neural network trained to perform the path integration task.

However, the MIT researchers found very different results. In an analysis of more than 11,000 neural networks that they trained on path integration, they found that while nearly 90 percent of them learned the task successfully, only about 10 percent of those networks generated activity patterns that could be classified as grid-cell-like. That includes networks in which even only a single unit achieved a high grid score.

The earlier studies were more likely to generate grid-cell-like activity only because of the constraints that researchers build into those models, according to the MIT team.

“Earlier studies have presented this story that if you train networks to path integrate, you’re going to get grid cells. What we found is that instead, you have to make this long sequence of choices of parameters, which we know are inconsistent with the biology, and then in a small sliver of those parameters, you will get the desired result,” Schaeffer says.

More biological models

One of the constraints found in earlier studies is that the researchers required the model to convert velocity into a unique position, reported by one network unit that corresponds to a place cell. For this to happen, the researchers also required that each place cell correspond to only one location, which is not how biological place cells work: Studies have shown that place cells in the hippocampus can respond to up to 20 different locations, not just one.

When the MIT team adjusted the models so that place cells were more like biological place cells, the models were still able to perform the path integration task, but they no longer produced grid-cell-like activity. Grid-cell-like activity also disappeared when the researchers instructed the models to generate different types of location output, such as location on a grid with X and Y axes, or location as a distance and angle relative to a home point.

“If the only thing that you ask this network to do is path integrate, and you impose a set of very specific, not physiological requirements on the readout unit, then it’s possible to obtain grid cells,” Fiete says. “But if you relax any of these aspects of this readout unit, that strongly degrades the ability of the network to produce grid cells. In fact, usually they don’t, even though they still solve the path integration task.”

Therefore, if the researchers hadn’t already known of the existence of grid cells, and guided the model to produce them, it would be very unlikely for them to appear as a natural consequence of the model training.

The researchers say that their findings suggest that more caution is warranted when interpreting neural network models of the brain.

“When you use deep learning models, they can be a powerful tool, but one has to be very circumspect in interpreting them and in determining whether they are truly making de novo predictions, or even shedding light on what it is that the brain is optimizing,” Fiete says.

Kenneth Harris, a professor of quantitative neuroscience at University College London, says he hopes the new study will encourage neuroscientists to be more careful when stating what can be shown by analogies between neural networks and the brain.

“Neural networks can be a useful source of predictions. If you want to learn how the brain solves a computation, you can train a network to perform it, then test the hypothesis that the brain works the same way. Whether the hypothesis is confirmed or not, you will learn something,” says Harris, who was not involved in the study. “This paper shows that ‘postdiction’ is less powerful: Neural networks have many parameters, so getting them to replicate an existing result is not as surprising.”

When using these models to make predictions about how the brain works, it’s important to take into account realistic, known biological constraints when building the models, the MIT researchers say. They are now working on models of grid cells that they hope will generate more accurate predictions of how grid cells in the brain work.

“Deep learning models will give us insight about the brain, but only after you inject a lot of biological knowledge into the model,” Khona says. “If you use the correct constraints, then the models can give you a brain-like solution.”

The research was funded by the Office of Naval Research, the National Science Foundation, the Simons Foundation through the Simons Collaboration on the Global Brain, and the Howard Hughes Medical Institute through the Faculty Scholars Program. Mikail Khona was supported by the MathWorks Science Fellowship.

Using sound to model the world

November 1, 2022

by Adam Zewe | MIT News Office MIT

Imagine the booming chords from a pipe organ echoing through the cavernous sanctuary of a massive, stone cathedral.

The sound a cathedral-goer will hear is affected by many factors, including the location of the organ, where the listener is standing, whether any columns, pews, or other obstacles stand between them, what the walls are made of, the locations of windows or doorways, etc. Hearing a sound can help someone envision their environment.

Researchers at MIT and the MIT-IBM Watson AI Lab are exploring the use of spatial acoustic information to help machines better envision their environments, too. They developed a machine-learning model that can capture how any sound in a room will propagate through the space, enabling the model to simulate what a listener would hear at different locations.

By accurately modeling the acoustics of a scene, the system can learn the underlying 3D geometry of a room from sound recordings. The researchers can use the acoustic information their system captures to build accurate visual renderings of a room, similarly to how humans use sound when estimating the properties of their physical environment.

In addition to its potential applications in virtual and augmented reality, this technique could help artificial-intelligence agents develop better understandings of the world around them. For instance, by modeling the acoustic properties of the sound in its environment, an underwater exploration robot could sense things that are farther away than it could with vision alone, says Yilun Du, a grad student in the Department of Electrical Engineering and Computer Science (EECS) and co-author of a paper describing the model.

“Most researchers have only focused on modeling vision so far. But as humans, we have multimodal perception. Not only is vision important, sound is also important. I think this work opens up an exciting research direction on better utilizing sound to model the world,” Du says.

Joining Du on the paper are lead author Andrew Luo, a grad student at Carnegie Mellon University (CMU); Michael J. Tarr, the Kavčić-Moura Professor of Cognitive and Brain Science at CMU; and senior authors Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in MIT’s Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL; and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Sound and vision

In computer vision research, a type of machine-learning model called an implicit neural representation model has been used to generate smooth, continuous reconstructions of 3D scenes from images. These models utilize neural networks, which contain layers of interconnected nodes, or neurons, that process data to complete a task.

The MIT researchers employed the same type of model to capture how sound travels continuously through a scene.

But they found that vision models benefit from a property known as photometric consistency which does not apply to sound. If one looks at the same object from two different locations, the object looks roughly the same. But with sound, change locations and the sound one hears could be completely different due to obstacles, distance, etc. This makes predicting audio very difficult.

The researchers overcame this problem by incorporating two properties of acoustics into their model: the reciprocal nature of sound and the influence of local geometric features.

Sound is reciprocal, which means that if the source of a sound and a listener swap positions, what the person hears is unchanged. Additionally, what one hears in a particular area is heavily influenced by local features, such as an obstacle between the listener and the source of the sound.

To incorporate these two factors into their model, called a neural acoustic field (NAF), they augment the neural network with a grid that captures objects and architectural features in the scene, like doorways or walls. The model randomly samples points on that grid to learn the features at specific locations.

“If you imagine standing near a doorway, what most strongly affects what you hear is the presence of that doorway, not necessarily geometric features far away from you on the other side of the room. We found this information enables better generalization than a simple fully connected network,” Luo says.

From predicting sounds to visualizing scenes

Researchers can feed the NAF visual information about a scene and a few spectrograms that show what a piece of audio would sound like when the emitter and listener are located at target locations around the room. Then the model predicts what that audio would sound like if the listener moves to any point in the scene.

The NAF outputs an impulse response, which captures how a sound should change as it propagates through the scene. The researchers then apply this impulse response to different sounds to hear how those sounds should change as a person walks through a room.

For instance, if a song is playing from a speaker in the center of a room, their model would show how that sound gets louder as a person approaches the speaker and then becomes muffled as they walk out into an adjacent hallway.

When the researchers compared their technique to other methods that model acoustic information, it generated more accurate sound models in every case. And because it learned local geometric information, their model was able to generalize to new locations in a scene much better than other methods.

Moreover, they found that applying the acoustic information their model learns to a computer vison model can lead to a better visual reconstruction of the scene.

“When you only have a sparse set of views, using these acoustic features enables you to capture boundaries more sharply, for instance. And maybe this is because to accurately render the acoustics of a scene, you have to capture the underlying 3D geometry of that scene,” Du says.

The researchers plan to continue enhancing the model so it can generalize to brand new scenes. They also want to apply this technique to more complex impulse responses and larger scenes, such as entire buildings or even a town or city.

“This new technique might open up new opportunities to create a multimodal immersive experience in the metaverse application,” adds Gan.

“My group has done a lot of work on using machine-learning methods to accelerate acoustic simulation or model the acoustics of real-world scenes. This paper by Chuang Gan and his co-authors is clearly a major step forward in this direction,” says Dinesh Manocha, the Paul Chrisman Iribe Professor of Computer Science and Electrical and Computer Engineering at the University of Maryland, who was not involved with this work. “In particular, this paper introduces a nice implicit representation that can capture how sound can propagate in real-world scenes by modeling it using a linear time-invariant system. This work can have many applications in AR/VR as well as real-world scene understanding.”

This work is supported, in part, by the MIT-IBM Watson AI Lab and the Tianqiao and Chrissy Chen Institute.

Machine learning facilitates “turbulence tracking” in fusion reactors

November 1, 2022

by Adam Zewe | MIT News Office MIT

Fusion, which promises practically unlimited, carbon-free energy using the same processes that power the sun, is at the heart of a worldwide research effort that could help mitigate climate change.

A multidisciplinary team of researchers is now bringing tools and insights from machine learning to aid this effort. Scientists from MIT and elsewhere have used computer-vision models to identify and track turbulent structures that appear under the conditions needed to facilitate fusion reactions.

Monitoring the formation and movements of these structures, called filaments or “blobs,” is important for understanding the heat and particle flows exiting from the reacting fuel, which ultimately determines the engineering requirements for the reactor walls to meet those flows. However, scientists typically study blobs using averaging techniques, which trade details of individual structures in favor of aggregate statistics. Individual blob information must be tracked by marking them manually in video data.

The researchers built a synthetic video dataset of plasma turbulence to make this process more effective and efficient. They used it to train four computer vision models, each of which identifies and tracks blobs. They trained the models to pinpoint blobs in the same ways that humans would.

When the researchers tested the trained models using real video clips, the models could identify blobs with high accuracy — more than 80 percent in some cases. The models were also able to effectively estimate the size of blobs and the speeds at which they moved.

Because millions of video frames are captured during just one fusion experiment, using machine-learning models to track blobs could give scientists much more detailed information.

“Before, we could get a macroscopic picture of what these structures are doing on average. Now, we have a microscope and the computational power to analyze one event at a time. If we take a step back, what this reveals is the power available from these machine-learning techniques, and ways to use these computational resources to make progress,” says Theodore Golfinopoulos, a research scientist at the MIT Plasma Science and Fusion Center and co-author of a paper detailing these approaches.

His fellow co-authors include lead author Woonghee “Harry” Han, a physics PhD candidate; senior author Iddo Drori, a visiting professor in the Computer Science and Artificial Intelligence Laboratory (CSAIL), faculty associate professor at Boston University, and adjunct at Columbia University; as well as others from the MIT Plasma Science and Fusion Center, the MIT Department of Civil and Environmental Engineering, and the Swiss Federal Institute of Technology at Lausanne in Switzerland. The research appears today in Nature Scientific Reports.

Heating things up

For more than 70 years, scientists have sought to use controlled thermonuclear fusion reactions to develop an energy source. To reach the conditions necessary for a fusion reaction, fuel must be heated to temperatures above 100 million degrees Celsius. (The core of the sun is about 15 million degrees Celsius.)

A common method for containing this super-hot fuel, called plasma, is to use a tokamak. These devices utilize extremely powerful magnetic fields to hold the plasma in place and control the interaction between the exhaust heat from the plasma and the reactor walls.

However, blobs appear like filaments falling out of the plasma at the very edge, between the plasma and the reactor walls. These random, turbulent structures affect how energy flows between the plasma and the reactor.

“Knowing what the blobs are doing strongly constrains the engineering performance that your tokamak power plant needs at the edge,” adds Golfinopoulos.

Researchers use a unique imaging technique to capture video of the plasma’s turbulent edge during experiments. An experimental campaign may last months; a typical day will produce about 30 seconds of data, corresponding to roughly 60 million video frames, with thousands of blobs appearing each second. This makes it impossible to track all blobs manually, so researchers rely on average sampling techniques that only provide broad characteristics of blob size, speed, and frequency.

“On the other hand, machine learning provides a solution to this by blob-by-blob tracking for every frame, not just average quantities. This gives us much more knowledge about what is happening at the boundary of the plasma,” Han says.

He and his co-authors took four well-established computer vision models, which are commonly used for applications like autonomous driving, and trained them to tackle this problem.

Simulating blobs

To train these models, they created a vast dataset of synthetic video clips that captured the blobs’ random and unpredictable nature.

“Sometimes they change direction or speed, sometimes multiple blobs merge, or they split apart. These kinds of events were not considered before with traditional approaches, but we could freely simulate those behaviors in the synthetic data,” Han says.

Creating synthetic data also allowed them to label each blob, which made the training process more effective, Drori adds.

Using these synthetic data, they trained the models to draw boundaries around blobs, teaching them to closely mimic what a human scientist would draw.

Then they tested the models using real video data from experiments. First, they measured how closely the boundaries the models drew matched up with actual blob contours.

But they also wanted to see if the models predicted objects that humans would identify. They asked three human experts to pinpoint the centers of blobs in video frames and checked to see if the models predicted blobs in those same locations.

The models were able to draw accurate blob boundaries, overlapping with brightness contours which are considered ground-truth, about 80 percent of the time. Their evaluations were similar to those of human experts, and successfully predicted the theory-defined regime of the blob, which agrees with the results from a traditional method.

Now that they have shown the success of using synthetic data and computer vision models for tracking blobs, the researchers plan to apply these techniques to other problems in fusion research, such as estimating particle transport at the boundary of a plasma, Han says.

They also made the dataset and models publicly available, and look forward to seeing how other research groups apply these tools to study the dynamics of blobs, says Drori.

“Prior to this, there was a barrier to entry that mostly the only people working on this problem were plasma physicists, who had the datasets and were using their methods. There is a huge machine-learning and computer-vision community. One goal of this work is to encourage participation in fusion research from the broader machine-learning community toward the broader goal of helping solve the critical problem of climate change,” he adds.

This research is supported, in part, by the U.S. Department of Energy and the Swiss National Science Foundation.

3 Questions: How AI image generators could help robots

October 27, 2022

by Rachel Gordon | MIT CSAIL MIT

AI image generators, which create fantastical sights at the intersection of dreams and reality, bubble up on every corner of the web. Their entertainment value is demonstrated by an ever-expanding treasure trove of whimsical and random images serving as indirect portals to the brains of human designers. A simple text prompt yields a nearly instantaneous image, satisfying our primitive brains, which are hardwired for instant gratification.

Although seemingly nascent, the field of AI-generated art can be traced back as far as the 1960s with early attempts using symbolic rule-based approaches to make technical images. While the progression of models that untangle and parse words has gained increasing sophistication, the explosion of generative art has sparked debate around copyright, disinformation, and biases, all mired in hype and controversy. Yilun Du, a PhD student in the Department of Electrical Engineering and Computer Science and affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), recently developed a new method that makes models like DALL-E 2 more creative and have better scene understanding. Here, Du describes how these models work, whether this technical infrastructure can be applied to other domains, and how we draw the line between AI and human creativity.

Q: AI-generated images use something called “stable diffusion” models to turn words into astounding images in just a few moments. But for every image used, there’s usually a human behind it. So what’s the the line between AI and human creativity? How do these models really work?

A: Imagine all of the images you could get on Google Search and their associated patterns. This is the diet these models are fed on. They’re trained on all of these images and their captions to generate images similar to the billions of images it has seen on the internet.

Let’s say a model has seen a lot of dog photos. It’s trained so that when it gets a similar text input prompt like “dog,” it’s able to generate a photo that looks very similar to the many dog pictures already seen. Now, more methodologically, how this all works dates back to a very old class of models called “energy-based models,” originating in the ’70’s or ’80’s.

In energy-based models, an energy landscape over images is constructed, which is used to simulate the physical dissipation to generate images. When you drop a dot of ink into water and it dissipates, for example, at the end, you just get this uniform texture. But if you try to reverse this process of dissipation, you gradually get the original ink dot in the water again. Or let’s say you have this very intricate block tower, and if you hit it with a ball, it collapses into a pile of blocks. This pile of blocks is then very disordered, and there’s not really much structure to it. To resuscitate the tower, you can try to reverse this folding process to generate your original pile of blocks.

The way these generative models generate images is in a very similar manner, where, initially, you have this really nice image, where you start from this random noise, and you basically learn how to simulate the process of how to reverse this process of going from noise back to your original image, where you try to iteratively refine this image to make it more and more realistic.

In terms of what’s the line between AI and human creativity, you can say that these models are really trained on the creativity of people. The internet has all types of paintings and images that people have already created in the past. These models are trained to recapitulate and generate the images that have been on the internet. As a result, these models are more like crystallizations of what people have spent creativity on for hundreds of years.

At the same time, because these models are trained on what humans have designed, they can generate very similar pieces of art to what humans have done in the past. They can find patterns in art that people have made, but it’s much harder for these models to actually generate creative photos on their own.

If you try to enter a prompt like “abstract art” or “unique art” or the like, it doesn’t really understand the creativity aspect of human art. The models are, rather, recapitulating what people have done in the past, so to speak, as opposed to generating fundamentally new and creative art.

Since these models are trained on vast swaths of images from the internet, a lot of these images are likely copyrighted. You don’t exactly know what the model is retrieving when it’s generating new images, so there’s a big question of how you can even determine if the model is using copyrighted images. If the model depends, in some sense, on some copyrighted images, are then those new images copyrighted? That’s another question to address.

Q: Do you believe images generated by diffusion models encode some sort of understanding about natural or physical worlds, either dynamically or geometrically? Are there efforts toward “teaching” image generators the basics of the universe that babies learn so early on?

A: Do they understand, in code, some grasp of natural and physical worlds? I think definitely. If you ask a model to generate a stable configuration of blocks, it definitely generates a block configuration that’s stable. If you tell it, generate an unstable configuration of blocks, it does look very unstable. Or if you say “a tree next to a lake,” it’s roughly able to generate that.

In a sense, it seems like these models have captured a large aspect of common sense. But the issue that makes us, still, very far away from truly understanding the natural and physical world is that when you try to generate infrequent combinations of words that you or I in our working our minds can very easily imagine, these models cannot.

For example, if you say, “put a fork on top of a plate,” that happens all the time. If you ask the model to generate this, it easily can. If you say, “put a plate on top of a fork,” again, it’s very easy for us to imagine what this would look like. But if you put this into any of these large models, you’ll never get a plate on top of a fork. You instead get a fork on top of a plate, since the models are learning to recapitulate all the images it’s been trained on. It can’t really generalize that well to combinations of words it hasn’t seen.

A fairly well-known example is an astronaut riding a horse, which the model can do with ease. But if you say a horse riding an astronaut, it still generates a person riding a horse. It seems like these models are capturing a lot of correlations in the datasets they’re trained on, but they’re not actually capturing the underlying causal mechanisms of the world.

Another example that’s commonly used is if you get very complicated text descriptions like one object to the right of another one, the third object in the front, and a third or fourth one flying. It really is only able to satisfy maybe one or two of the objects. This could be partially because of the training data, as it’s rare to have very complicated captions But it could also suggest that these models aren’t very structured. You can imagine that if you get very complicated natural language prompts, there’s no manner in which the model can accurately represent all the component details.

Q: You recently came up with a new method that uses multiple models to create more complex images with better understanding for generative art. Are there potential applications of this framework outside of image or text domains?

A: We were really inspired by one of the limitations of these models. When you give these models very complicated scene descriptions, they aren’t actually able to correctly generate images that match them.

One thought is, since it’s a single model with a fixed computational graph, meaning you can only use a fixed amount of computation to generate an image, if you get an extremely complicated prompt, there’s no way you can use more computational power to generate that image.

If I gave a human a description of a scene that was, say, 100 lines long versus a scene that’s one line long, a human artist can spend much longer on the former. These models don’t really have the sensibility to do this. We propose, then, that given very complicated prompts, you can actually compose many different independent models together and have each individual model represent a portion of the scene you want to describe.

We find that this enables our model to generate more complicated scenes, or those that more accurately generate different aspects of the scene together. In addition, this approach can be generally applied across a variety of different domains. While image generation is likely the most currently successful application, generative models have actually been seeing all types of applications in a variety of domains. You can use them to generate different diverse robot behaviors, synthesize 3D shapes, enable better scene understanding, or design new materials. You could potentially compose multiple desired factors to generate the exact material you need for a particular application.

One thing we’ve been very interested in is robotics. In the same way that you can generate different images, you can also generate different robot trajectories (the path and schedule), and by composing different models together, you are able to generate trajectories with different combinations of skills. If I have natural language specifications of jumping versus avoiding an obstacle, you could also compose these models together, and then generate robot trajectories that can both jump and avoid an obstacle .

In a similar manner, if we want to design proteins, we can specify different functions or aspects — in an analogous manner to how we use language to specify the content of the images — with language-like descriptions, such as the type or functionality of the protein. We could then compose these together to generate new proteins that can potentially satisfy all of these given functions.

We’ve also explored using diffusion models on 3D shape generation, where you can use this approach to generate and design 3D assets. Normally, 3D asset design is a very complicated and laborious process. By composing different models together, it becomes much easier to generate shapes such as, “I want a 3D shape with four legs, with this style and height,” potentially automating portions of 3D asset design.

3 Questions: How AI image generators could help robots

October 27, 2022

by Rachel Gordon | MIT CSAIL MIT

Deep learning with light

October 20, 2022

by Adam Zewe | MIT News Office MIT

Ask a smart home device for the weather forecast, and it takes several seconds for the device to respond. One reason this latency occurs is because connected devices don’t have enough memory or power to store and run the enormous machine-learning models needed for the device to understand what a user is asking of it. The model is stored in a data center that may be hundreds of miles away, where the answer is computed and sent to the device.

MIT researchers have created a new method for computing directly on these devices, which drastically reduces this latency. Their technique shifts the memory-intensive steps of running a machine-learning model to a central server where components of the model are encoded onto light waves.

The waves are transmitted to a connected device using fiber optics, which enables tons of data to be sent lightning-fast through a network. The receiver then employs a simple optical device that rapidly performs computations using the parts of a model carried by those light waves.

This technique leads to more than a hundredfold improvement in energy efficiency when compared to other methods. It could also improve security, since a user’s data do not need to be transferred to a central location for computation.

This method could enable a self-driving car to make decisions in real-time while using just a tiny percentage of the energy currently required by power-hungry computers. It could also allow a user to have a latency-free conversation with their smart home device, be used for live video processing over cellular networks, or even enable high-speed image classification on a spacecraft millions of miles from Earth.

“Every time you want to run a neural network, you have to run the program, and how fast you can run the program depends on how fast you can pipe the program in from memory. Our pipe is massive — it corresponds to sending a full feature-length movie over the internet every millisecond or so. That is how fast data comes into our system. And it can compute as fast as that,” says senior author Dirk Englund, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and member of the MIT Research Laboratory of Electronics.

Joining Englund on the paper is lead author and EECS grad student Alexander Sludds; EECS grad student Saumil Bandyopadhyay, Research Scientist Ryan Hamerly, as well as others from MIT, the MIT Lincoln Laboratory, and Nokia Corporation. The research is published today in Science.

Lightening the load

Neural networks are machine-learning models that use layers of connected nodes, or neurons, to recognize patterns in datasets and perform tasks, like classifying images or recognizing speech. But these models can contain billions of weight parameters, which are numeric values that transform input data as they are processed. These weights must be stored in memory. At the same time, the data transformation process involves billions of algebraic computations, which require a great deal of power to perform.

The process of fetching data (the weights of the neural network, in this case) from memory and moving them to the parts of a computer that do the actual computation is one of the biggest limiting factors to speed and energy efficiency, says Sludds.

“So our thought was, why don’t we take all that heavy lifting — the process of fetching billions of weights from memory — move it away from the edge device and put it someplace where we have abundant access to power and memory, which gives us the ability to fetch those weights quickly?” he says.

The neural network architecture they developed, Netcast, involves storing weights in a central server that is connected to a novel piece of hardware called a smart transceiver. This smart transceiver, a thumb-sized chip that can receive and transmit data, uses technology known as silicon photonics to fetch trillions of weights from memory each second.

It receives weights as electrical signals and imprints them onto light waves. Since the weight data are encoded as bits (1s and 0s) the transceiver converts them by switching lasers; a laser is turned on for a 1 and off for a 0. It combines these light waves and then periodically transfers them through a fiber optic network so a client device doesn’t need to query the server to receive them.

“Optics is great because there are many ways to carry data within optics. For instance, you can put data on different colors of light, and that enables a much higher data throughput and greater bandwidth than with electronics,” explains Bandyopadhyay.

Trillions per second

Once the light waves arrive at the client device, a simple optical component known as a broadband “Mach-Zehnder” modulator uses them to perform super-fast, analog computation. This involves encoding input data from the device, such as sensor information, onto the weights. Then it sends each individual wavelength to a receiver that detects the light and measures the result of the computation.

The researchers devised a way to use this modulator to do trillions of multiplications per second, which vastly increases the speed of computation on the device while using only a tiny amount of power.

“In order to make something faster, you need to make it more energy efficient. But there is a trade-off. We’ve built a system that can operate with about a milliwatt of power but still do trillions of multiplications per second. In terms of both speed and energy efficiency, that is a gain of orders of magnitude,” Sludds says.

They tested this architecture by sending weights over an 86-kilometer fiber that connects their lab to MIT Lincoln Laboratory. Netcast enabled machine-learning with high accuracy — 98.7 percent for image classification and 98.8 percent for digit recognition — at rapid speeds.

“We had to do some calibration, but I was surprised by how little work we had to do to achieve such high accuracy out of the box. We were able to get commercially relevant accuracy,” adds Hamerly.

Moving forward, the researchers want to iterate on the smart transceiver chip to achieve even better performance. They also want to miniaturize the receiver, which is currently the size of a shoe box, down to the size of a single chip so it could fit onto a smart device like a cell phone.

“Using photonics and light as a platform for computing is a really exciting area of research with potentially huge implications on the speed and efficiency of our information technology landscape,” says Euan Allen, a Royal Academy of Engineering Research Fellow at the University of Bath, who was not involved with this work. “The work of Sludds et al. is an exciting step toward seeing real-world implementations of such devices, introducing a new and practical edge-computing scheme whilst also exploring some of the fundamental limitations of computation at very low (single-photon) light levels.”

The research is funded, in part, by NTT Research, the National Science Foundation, the Air Force Office of Scientific Research, the Air Force Research Laboratory, and the Army Research Office.

Deep learning with light

October 20, 2022

by Adam Zewe | MIT News Office MIT

Lightening the load

Trillions per second

The research is funded, in part, by NTT Research, the National Science Foundation, the Air Force Office of Scientific Research, the Air Force Research Laboratory, and the Army Research Office.

The science of strength: How data analytics is transforming college basketball

October 17, 2022

by MIT Professional Education MIT

In the 1990s, if you suggested that the corner three-pointer was the best shot in basketball, you might have been laughed out of the gym.

The game was still dominated largely by a fleet of seven-foot centers, most of whom couldn’t shoot from more than a few feet out from the basket. Even the game’s best player, Michael Jordan, was a mid-range specialist who averaged under two three-point attempts per game for his career.

Fast forward to today, and the best players average around a dozen long-ball attempts per game — typically favoring shots from the corner.

What’s changed? Analytics.

“When I first started in the profession, 10 to 12 years ago, data analytics was almost nonexistent in training rooms,” says Adam Petway, the director of strength and conditioning for men’s basketball at the University of Louisville. “Today, we have force platform technology, we have velocity-based training, we have GPS tracking during games and in training, all to get a more objective analysis to help our athletes. So it’s grown exponentially.”

Petway, who previously worked on the coaching staffs of the NBA’s Philadelphia 76ers and Washington Wizards, holds a bachelor’s degree in sports science, an MBA with an emphasis in sport management, and a doctorate in sports science. Recently, he extended his education through MIT Professional Education’s Applied Data Science Program (ADSP).

“The impetus behind enrolling in ADSP was primarily a curiosity to learn and a desire to get better,” Petway says. “In my time in pro and college sports, we’ve had whole departments dedicated to data science, so I know it’s a skill set I’ll need in the future.”

Applying new skills

Petway took classes in a live online format. Although he was the only strength and conditioning coach in his cohort — learning alongside lawyers, professors, and business executives — he says that the focus on data gave all of his classmates a common language of sorts.

“In many people’s minds, the worlds of data science and NCAA strength and conditioning training may not cross. We are finding that there are many other professional and industry sectors that can benefit from data science and analytics, which explains why we are seeing an ever-growing range of professionals from around the globe enroll in our Applied Data Science Program,” says Bhaskar Pant, executive director of MIT Professional Education. “It’s exciting to hear how change-makers like Adam are using the knowledge they gained from the program to tackle their most pressing challenges using data science tools.”

“Having access to such high-level practitioners within data science was something that I found very, very helpful,” Petway says. “The chance to interact with my classmates, and the chance to interact in small groups with the professionals and the professors, was unbelievable. When you’re writing code in Python you might mess up a semicolon and a comma, and get 200 characters into the code and realize that it’s not going to work. So the ability to stop and ask questions, and really get into the material with a cohort of peers from different industries, that was really helpful.”

Petway points to his newfound abilities to code in Python, and to run data through artificial intelligence programs that utilize unsupervised learning techniques, as major takeaways from his experience. Sports teams produce a wealth of data, he notes, but coaches need to be able to process that information in ways that lead to actionable insights.

“Now I’m able to create decision trees, do visualization with data, and run a principal component analysis,” Petway says. “So instead of relying on third-party companies to come in and tell me what to do, I can take all of that data and disseminate the results myself, which not only saves me time, but it saves a lot of money.”

In addition to giving him new capabilities in his coaching role, the skills were crucial to the research for a paper that Petway and a team of several other authors published in the International Journal of Strength and Conditioning this year. “The data came from my PhD program around five years ago,” Petway notes. “I had the data already, but I couldn’t properly visualize it and analyze it until I took the MIT Professional Education course.”

“MIT’s motto is ‘mens et manus’ (‘mind and hand’), which translates to experience-based learning. As such, there was great thought put into how the Applied Data Science Program is structured. The expectation is that every participant not only gains foundational skills, but also learns how to apply that knowledge in real-world scenarios. We are thrilled to see learning from our course applied to top-level college basketball,” says Munther Dahleh, director of the Institute for Data, Systems, and Society, the William A. Coolidge Professor of Electrical Engineering and Computer Science at MIT, and one of the instructors of ADSP.

Data’s growing role in sports

Analytics are pushing the field of strength and conditioning far beyond the days when trainers would simply tell players to do a certain number of reps in the weight room, Petway says. Wearable devices help to track how much ground athletes cover during practice, as well as their average speed. Data from a force platform helps Petway to analyze the force with which basketball players jump (and land), and even to determine how much force an athlete is generating from each leg. Using a tool called a linear position transducer, Petway can measure how fast athletes are moving a prescribed load during weight-lifting exercises.

“Instead of telling someone to do 90 percent of their squat max, we’re telling them to squat 200 kilos, and to move it at a rate above one meter per second,” says Petway. “So it’s more power- and velocity-driven than your traditional weight training.”

The goal is to not only improve athlete’s performance, Petway says, but also to create training programs that minimize the chance of injury. Sometimes, that means deviating from well-worn sports cliches about “giving 110 percent” or “leaving it all on the court.”

“There’s a misconception that doing more is always better,” Petway says. “One of my mentors would always say, ‘Sometimes you have to have the courage to do less.’ The most important thing for our athletes is being available for competition. We can use data analytics now to forecast the early onset of fatigue. If we see that their power output in the weight room is decreasing, we may need to intervene with rest before things get worse. It’s about using information to make more objective decisions.”

The ability to create visuals from data, Petway says, has greatly enhanced his ability to communicate with athletes and other coaches about what he’s seeing in the numbers. “It’s a really powerful tool, being able to take a bunch of data points and show that things are trending up or down, along with the intervention we’re going to need to make based on what the data suggests,” he says.

Ultimately, Petway notes, coaches are primarily interested in just one data point: wins and losses. But as more sports professionals see that data science can lead to more wins, he says, analytics will continue to gain a foothold in the industry. “If you can show that preparing a certain way leads to a higher likelihood that the team will win, that really speaks coaches’ language,” he says. “They just want to see results. And if data science can help deliver those results, they’re going to be bought in.”

The science of strength: How data analytics is transforming college basketball

October 17, 2022

by MIT Professional Education MIT

In the 1990s, if you suggested that the corner three-pointer was the best shot in basketball, you might have been laughed out of the gym.

Fast forward to today, and the best players average around a dozen long-ball attempts per game — typically favoring shots from the corner.

What’s changed? Analytics.

Applying new skills

Data’s growing role in sports

Vedere AI

Posts in category: MIT

In machine learning, synthetic data can offer real performance improvements

Study urges caution when comparing neural networks to the brain

Using sound to model the world

Machine learning facilitates “turbulence tracking” in fusion reactors

3 Questions: How AI image generators could help robots

3 Questions: How AI image generators could help robots

Deep learning with light

Deep learning with light

The science of strength: How data analytics is transforming college basketball

The science of strength: How data analytics is transforming college basketball

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.