Exploring emerging topics in artificial intelligence policy

Members of the public sector, private sector, and academia convened for the second AI Policy Forum Symposium last month to explore critical directions and questions posed by artificial intelligence in our economies and societies.

The virtual event, hosted by the AI Policy Forum (AIPF) — an undertaking by the MIT Schwarzman College of Computing to bridge high-level principles of AI policy with the practices and trade-offs of governing — brought together an array of distinguished panelists to delve into four cross-cutting topics: law, auditing, health care, and mobility.

In the last year there have been substantial changes in the regulatory and policy landscape around AI in several countries — most notably in Europe with the development of the European Union Artificial Intelligence Act, the first attempt by a major regulator to propose a law on artificial intelligence. In the United States, the National AI Initiative Act of 2020, which became law in January 2021, is providing a coordinated program across federal government to accelerate AI research and application for economic prosperity and security gains. Finally, China recently advanced several new regulations of its own.

Each of these developments represents a different approach to legislating AI, but what makes a good AI law? And when should AI legislation be based on binding rules with penalties versus establishing voluntary guidelines?

Jonathan Zittrain, professor of international law at Harvard Law School and director of the Berkman Klein Center for Internet and Society, says the self-regulatory approach taken during the expansion of the internet had its limitations with companies struggling to balance their interests with those of their industry and the public.

“One lesson might be that actually having representative government take an active role early on is a good idea,” he says. “It’s just that they’re challenged by the fact that there appears to be two phases in this environment of regulation. One, too early to tell, and two, too late to do anything about it. In AI I think a lot of people would say we’re still in the ‘too early to tell’ stage but given that there’s no middle zone before it’s too late, it might still call for some regulation.”

A theme that came up repeatedly throughout the first panel on AI laws — a conversation moderated by Dan Huttenlocher, dean of the MIT Schwarzman College of Computing and chair of the AI Policy Forum — was the notion of trust. “If you told me the truth consistently, I would say you are an honest person. If AI could provide something similar, something that I can say is consistent and is the same, then I would say it’s trusted AI,” says Bitange Ndemo, professor of entrepreneurship at the University of Nairobi and the former permanent secretary of Kenya’s Ministry of Information and Communication.

Eva Kaili, vice president of the European Parliament, adds that “In Europe, whenever you use something, like any medication, you know that it has been checked. You know you can trust it. You know the controls are there. We have to achieve the same with AI.” Kalli further stresses that building trust in AI systems will not only lead to people using more applications in a safe manner, but that AI itself will reap benefits as greater amounts of data will be generated as a result.

The rapidly increasing applicability of AI across fields has prompted the need to address both the opportunities and challenges of emerging technologies and the impact they have on social and ethical issues such as privacy, fairness, bias, transparency, and accountability. In health care, for example, new techniques in machine learning have shown enormous promise for improving quality and efficiency, but questions of equity, data access and privacy, safety and reliability, and immunology and global health surveillance remain at large.

MIT’s Marzyeh Ghassemi, an assistant professor in the Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science, and David Sontag, an associate professor of electrical engineering and computer science, collaborated with Ziad Obermeyer, an associate professor of health policy and management at the University of California Berkeley School of Public Health, to organize AIPF Health Wide Reach, a series of sessions to discuss issues of data sharing and privacy in clinical AI. The organizers assembled experts devoted to AI, policy, and health from around the world with the goal of understanding what can be done to decrease barriers to access to high-quality health data to advance more innovative, robust, and inclusive research results while being respectful of patient privacy.

Over the course of the series, members of the group presented on a topic of expertise and were tasked with proposing concrete policy approaches to the challenge discussed. Drawing on these wide-ranging conversations, participants unveiled their findings during the symposium, covering nonprofit and government success stories and limited access models; upside demonstrations; legal frameworks, regulation, and funding; technical approaches to privacy; and infrastructure and data sharing. The group then discussed some of their recommendations that are summarized in a report that will be released soon.

One of the findings calls for the need to make more data available for research use. Recommendations that stem from this finding include updating regulations to promote data sharing to enable easier access to safe harbors such as the Health Insurance Portability and Accountability Act (HIPAA) has for de-identification, as well as expanding funding for private health institutions to curate datasets, amongst others. Another finding, to remove barriers to data for researchers, supports a recommendation to decrease obstacles to research and development on federally created health data. “If this is data that should be accessible because it’s funded by some federal entity, we should easily establish the steps that are going to be part of gaining access to that so that it’s a more inclusive and equitable set of research opportunities for all,” says Ghassemi. The group also recommends taking a careful look at the ethical principles that govern data sharing. While there are already many principles proposed around this, Ghassemi says that “obviously you can’t satisfy all levers or buttons at once, but we think that this is a trade-off that’s very important to think through intelligently.”

In addition to law and health care, other facets of AI policy explored during the event included auditing and monitoring AI systems at scale, and the role AI plays in mobility and the range of technical, business, and policy challenges for autonomous vehicles in particular.

The AI Policy Forum Symposium was an effort to bring together communities of practice with the shared aim of designing the next chapter of AI. In his closing remarks, Aleksander Madry, the Cadence Designs Systems Professor of Computing at MIT and faculty co-lead of the AI Policy Forum, emphasized the importance of collaboration and the need for different communities to communicate with each other in order to truly make an impact in the AI policy space.

“The dream here is that we all can meet together — researchers, industry, policymakers, and other stakeholders — and really talk to each other, understand each other’s concerns, and think together about solutions,” Madry said. “This is the mission of the AI Policy Forum and this is what we want to enable.”

Read More

Taking the guesswork out of dental care with artificial intelligence

When you picture a hospital radiologist, you might think of a specialist who sits in a dark room and spends hours poring over X-rays to make diagnoses. Contrast that with your dentist, who in addition to interpreting X-rays must also perform surgery, manage staff, communicate with patients, and run their business. When dentists analyze X-rays, they do so in bright rooms and on computers that aren’t specialized for radiology, often with the patient sitting right next to them.

Is it any wonder, then, that dentists given the same X-ray might propose different treatments?

“Dentists are doing a great job given all the things they have to deal with,” says Wardah Inam SM ’13, PhD ’16.

Inam is the co-founder of Overjet, a company using artificial intelligence to analyze and annotate X-rays for dentists and insurance providers. Overjet seeks to take the subjectivity out of X-ray interpretations to improve patient care.

“It’s about moving toward more precision medicine, where we have the right treatments at the right time,” says Inam, who co-founded the company with Alexander Jelicich ’13. “That’s where technology can help. Once we quantify the disease, we can make it very easy to recommend the right treatment.”

Overjet has been cleared by the Food and Drug Administration to detect and outline cavities and to quantify bone levels to aid in the diagnosis of periodontal disease, a common but preventable gum infection that causes the jawbone and other tissues supporting the teeth to deteriorate.

In addition to helping dentists detect and treat diseases, Overjet’s software is also designed to help dentists show patients the problems they’re seeing and explain why they’re recommending certain treatments.

The company has already analyzed tens of millions of X-rays, is used by dental practices nationwide, and is currently working with insurance companies that represent more than 75 million patients in the U.S. Inam is hoping the data Overjet is analyzing can be used to further streamline operations while improving care for patients.

“Our mission at Overjet is to improve oral health by creating a future that is clinically precise, efficient, and patient-centric,” says Inam.

It’s been a whirlwind journey for Inam, who knew nothing about the dental industry until a bad experience piqued her interest in 2018.

Getting to the root of the problem

Inam came to MIT in 2010, first for her master’s and then her PhD in electrical engineering and computer science, and says she caught the bug for entrepreneurship early on.

“For me, MIT was a sandbox where you could learn different things and find out what you like and what you don’t like,” Inam says. “Plus, if you are curious about a problem, you can really dive into it.”

While taking entrepreneurship classes at the Sloan School of Management, Inam eventually started a number of new ventures with classmates.

“I didn’t know I wanted to start a company when I came to MIT,” Inam says. “I knew I wanted to solve important problems. I went through this journey of deciding between academia and industry, but I like to see things happen faster and I like to make an impact in my lifetime, and that’s what drew me to entrepreneurship.”

During her postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL), Inam and a group of researchers applied machine learning to wireless signals to create biomedical sensors that could track a person’s movements, detect falls, and monitor respiratory rate.

She didn’t get interested in dentistry until after leaving MIT, when she changed dentists and received an entirely new treatment plan. Confused by the change, she asked for her X-rays and asked other dentists to have a look, only to receive still another variation in diagnosis and treatment recommendations.

At that point, Inam decided to dive into dentistry for herself, reading books on the subject, watching YouTube videos, and eventually interviewing dentists. Before she knew it, she was spending more time learning about dentistry than she was at her job.

The same week Inam quit her job, she learned about MIT’s Hacking Medicine competition and decided to participate. That’s where she started building her team and getting connections. Overjet’s first funding came from the Media Lab-affiliated investment group the E14 Fund.

The E14 fund wrote the first check, and I don’t think we would’ve existed if it wasn’t for them taking a chance on us,” she says.

Inam learned that a big reason for variation in treatment recommendations among dentists is the sheer number of potential treatment options for each disease. A cavity, for instance, can be treated with a filling, a crown, a root canal, a bridge, and more.

When it comes to periodontal disease, dentists must make millimeter-level assessments to determine disease severity and progression. The extent and progression of the disease determines the best treatment.

“I felt technology could play a big role in not only enhancing the diagnosis but also to communicate with the patients more effectively so they understand and don’t have to go through the confusing process I did of wondering who’s right,” Inam says.

Overjet began as a tool to help insurance companies streamline dental claims before the company began integrating its tool directly into dentists’ offices. Every day, some of the largest dental organizations nationwide are using Overjet, including Guardian Insurance, Delta Dental, Dental Care Alliance, and Jefferson Dental and Orthodontics.

Today, as a dental X-ray is imported into a computer, Overjet’s software analyzes and annotates the images automatically. By the time the image appears on the computer screen, it has information on the type of X-ray taken, how a tooth may be impacted, the exact level of bone loss with color overlays, the location and severity of cavities, and more.

The analysis gives dentists more information to talk to patients about treatment options.

“Now the dentist or hygienist just has to synthesize that information, and they use the software to communicate with you,” Inam says. “So, they’ll show you the X-rays with Overjet’s annotations and say, ‘You have 4 millimeters of bone loss, it’s in red, that’s higher than the 3 millimeters you had last time you came, so I’m recommending this treatment.”

Overjet also incorporates historical information about each patient, tracking bone loss on every tooth and helping dentists detect cases where disease is progressing more quickly.

“We’ve seen cases where a cancer patient with dry mouth goes from nothing to something extremely bad in six months between visits, so those patients should probably come to the dentist more often,” Inam says. “It’s all about using data to change how we practice care, think about plans, and offer services to different types of patients.”

The operating system of dentistry

Overjet’s FDA clearances account for two highly prevalent diseases. They also put the company in a position to conduct industry-level analysis and help dental practices compare themselves to peers.

“We use the same tech to help practices understand clinical performance and improve operations,” Inam says. “We can look at every patient at every practice and identify how practices can use the software to improve the care they’re providing.”

Moving forward, Inam sees Overjet playing an integral role in virtually every aspect of dental operations.

“These radiographs have been digitized for a while, but they’ve never been utilized because the computers couldn’t read them,” Inam says. “Overjet is turning unstructured data into data that we can analyze. Right now, we’re building the basic infrastructure. Eventually we want to grow the platform to improve any service the practice can provide, basically becoming the operating system of the practice to help providers do their job more effectively.”

Read More

Robots play with play dough

The inner child in many of us feels an overwhelming sense of joy when stumbling across a pile of the fluorescent, rubbery mixture of water, salt, and flour that put goo on the map: play dough. (Even if this happens rarely in adulthood.)

While manipulating play dough is fun and easy for 2-year-olds, the shapeless sludge is hard for robots to handle. Machines have become increasingly reliable with rigid objects, but manipulating soft, deformable objects comes with a laundry list of technical challenges, and most importantly, as with most flexible structures, if you move one part, you’re likely affecting everything else. 

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University recently let robots take their hand at playing with the modeling compound, but not for nostalgia’s sake. Their new system learns directly from visual inputs to let a robot with a two-fingered gripper see, simulate, and shape doughy objects. “RoboCraft” could reliably plan a robot’s behavior to pinch and release play dough to make various letters, including ones it had never seen. With just 10 minutes of data, the two-finger gripper rivaled human counterparts that teleoperated the machine — performing on-par, and at times even better, on the tested tasks. 

“Modeling and manipulating objects with high degrees of freedom are essential capabilities for robots to learn how to enable complex industrial and household interaction tasks, like stuffing dumplings, rolling sushi, and making pottery,” says Yunzhu Li, CSAIL PhD student and author on a new paper about RoboCraft. “While there’s been recent advances in manipulating clothes and ropes, we found that objects with high plasticity, like dough or plasticine — despite ubiquity in those household and industrial settings — was a largely underexplored territory. With RoboCraft, we learn the dynamics models directly from high-dimensional sensory data, which offers a promising data-driven avenue for us to perform effective planning.” 

With undefined, smooth material, the whole structure needs to be accounted for before you can do any type of efficient and effective modeling and planning. By turning the images into graphs of little particles, coupled with algorithms, RoboCraft, using a graph neural network as the dynamics model, makes more accurate predictions about the material’s change of shapes. 

Typically, researchers have used complex physics simulators to model and understand force and dynamics being applied to objects, but RoboCraft simply uses visual data. The inner-workings of the system relies on three parts to shape soft material into, say, an “R.” 

The first part — perception — is all about learning to “see.” It uses cameras to collect raw, visual sensor data from the environment, which are then turned into little clouds of particles to represent the shapes. A graph-based neural network then uses said particle data to learn to “simulate” the object’s dynamics, or how it moves. Then, algorithms help plan the robot’s behavior so it learns to “shape” a blob of dough, armed with the training data from the many pinches. While the letters are a bit loose, they’re indubitably representative. 

Besides cutesy shapes, the team is (actually) working on making dumplings from dough and a prepared filling. Right now, with just a two finger gripper, it’s a big ask. RoboCraft would need additional tools (a baker needs multiple tools to cook; so do robots) — a rolling pin, a stamp, and a mold. 

A more far in the future domain the scientists envision is using RoboCraft for assistance with household tasks and chores, which could be of particular help to the elderly or those with limited mobility. To accomplish this, given the many obstructions that could take place, a much more adaptive representation of the dough or item would be needed, and as well as exploration into what class of models might be suitable to capture the underlying structural systems. 

“RoboCraft essentially demonstrates that this predictive model can be learned in very data-efficient ways to plan motion. In the long run, we are thinking about using various tools to manipulate materials,” says Li. “If you think about dumpling or dough making, just one gripper wouldn’t be able to solve it. Helping the model understand and accomplish longer-horizon planning tasks, such as, how the dough will deform given the current tool, movements and actions, is a next step for future work.” 

Li wrote the paper alongside Haochen Shi, Stanford master’s student; Huazhe Xu, Stanford postdoc; Zhiao Huang, PhD student at the University of California at San Diego; and Jiajun Wu, assistant professor at Stanford. They will present the research at the Robotics: Science and Systems conference in New York City. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Samsung Global Research Outreach (GRO) Program, the Toyota Research Institute (TRI), and Amazon, Autodesk, Salesforce, and Bosch.

Read More

Researchers release open-source photorealistic simulator for autonomous driving

Hyper-realistic virtual worlds have been heralded as the best driving schools for autonomous vehicles (AVs), since they’ve proven fruitful test beds for safely trying out dangerous driving scenarios. Tesla, Waymo, and other self-driving companies all rely heavily on data to enable expensive and proprietary photorealistic simulators, since testing and gathering nuanced I-almost-crashed data usually isn’t the most easy or desirable to recreate. 

To that end, scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) created “VISTA 2.0,” a data-driven simulation engine where vehicles can learn to drive in the real world and recover from near-crash scenarios. What’s more, all of the code is being open-sourced to the public. 

“Today, only companies have software like the type of simulation environments and capabilities of VISTA 2.0, and this software is proprietary. With this release, the research community will have access to a powerful new tool for accelerating the research and development of adaptive robust control for autonomous driving,” says MIT Professor and CSAIL Director Daniela Rus, senior author on a paper about the research. 

VISTA 2.0 builds off of the team’s previous model, VISTA, and it’s fundamentally different from existing AV simulators since it’s data-driven — meaning it was built and photorealistically rendered from real-world data — thereby enabling direct transfer to reality. While the initial iteration supported only single car lane-following with one camera sensor, achieving high-fidelity data-driven simulation required rethinking the foundations of how different sensors and behavioral interactions can be synthesized. 

Enter VISTA 2.0: a data-driven system that can simulate complex sensor types and massively interactive scenarios and intersections at scale. With much less data than previous models, the team was able to train autonomous vehicles that could be substantially more robust than those trained on large amounts of real-world data. 

“This is a massive jump in capabilities of data-driven simulation for autonomous vehicles, as well as the increase of scale and ability to handle greater driving complexity,” says Alexander Amini, CSAIL PhD student and co-lead author on two new papers, together with fellow PhD student Tsun-Hsuan Wang. “VISTA 2.0 demonstrates the ability to simulate sensor data far beyond 2D RGB cameras, but also extremely high dimensional 3D lidars with millions of points, irregularly timed event-based cameras, and even interactive and dynamic scenarios with other vehicles as well.” 

The team was able to scale the complexity of the interactive driving tasks for things like overtaking, following, and negotiating, including multiagent scenarios in highly photorealistic environments. 

Training AI models for autonomous vehicles involves hard-to-secure fodder of different varieties of edge cases and strange, dangerous scenarios, because most of our data (thankfully) is just run-of-the-mill, day-to-day driving. Logically, we can’t just crash into other cars just to teach a neural network how to not crash into other cars.

Recently, there’s been a shift away from more classic, human-designed simulation environments to those built up from real-world data. The latter have immense photorealism, but the former can easily model virtual cameras and lidars. With this paradigm shift, a key question has emerged: Can the richness and complexity of all of the sensors that autonomous vehicles need, such as lidar and event-based cameras that are more sparse, accurately be synthesized? 

Lidar sensor data is much harder to interpret in a data-driven world — you’re effectively trying to generate brand-new 3D point clouds with millions of points, only from sparse views of the world. To synthesize 3D lidar point clouds, the team used the data that the car collected, projected it into a 3D space coming from the lidar data, and then let a new virtual vehicle drive around locally from where that original vehicle was. Finally, they projected all of that sensory information back into the frame of view of this new virtual vehicle, with the help of neural networks. 

Together with the simulation of event-based cameras, which operate at speeds greater than thousands of events per second, the simulator was capable of not only simulating this multimodal information, but also doing so all in real time — making it possible to train neural nets offline, but also test online on the car in augmented reality setups for safe evaluations. “The question of if multisensor simulation at this scale of complexity and photorealism was possible in the realm of data-driven simulation was very much an open question,” says Amini. 

With that, the driving school becomes a party. In the simulation, you can move around, have different types of controllers, simulate different types of events, create interactive scenarios, and just drop in brand new vehicles that weren’t even in the original data. They tested for lane following, lane turning, car following, and more dicey scenarios like static and dynamic overtaking (seeing obstacles and moving around so you don’t collide). With the multi-agency, both real and simulated agents interact, and new agents can be dropped into the scene and controlled any which way. 

Taking their full-scale car out into the “wild” — a.k.a. Devens, Massachusetts — the team saw  immediate transferability of results, with both failures and successes. They were also able to demonstrate the bodacious, magic word of self-driving car models: “robust.” They showed that AVs, trained entirely in VISTA 2.0, were so robust in the real world that they could handle that elusive tail of challenging failures. 

Now, one guardrail humans rely on that can’t yet be simulated is human emotion. It’s the friendly wave, nod, or blinker switch of acknowledgement, which are the type of nuances the team wants to implement in future work. 

“The central algorithm of this research is how we can take a dataset and build a completely synthetic world for learning and autonomy,” says Amini. “It’s a platform that I believe one day could extend in many different axes across robotics. Not just autonomous driving, but many areas that rely on vision and complex behaviors. We’re excited to release VISTA 2.0 to help enable the community to collect their own datasets and convert them into virtual worlds where they can directly simulate their own virtual autonomous vehicles, drive around these virtual terrains, train autonomous vehicles in these worlds, and then can directly transfer them to full-sized, real self-driving cars.” 

Amini and Wang wrote the paper alongside Zhijian Liu, MIT CSAIL PhD student; Igor Gilitschenski, assistant professor in computer science at the University of Toronto; Wilko Schwarting, AI research scientist and MIT CSAIL PhD ’20; Song Han, associate professor at MIT’s Department of Electrical Engineering and Computer Science; Sertac Karaman, associate professor of aeronautics and astronautics at MIT; and Daniela Rus, MIT professor and CSAIL director. The researchers presented the work at the IEEE International Conference on Robotics and Automation (ICRA) in Philadelphia. 

This work was supported by the National Science Foundation and Toyota Research Institute. The team acknowledges the support of NVIDIA with the donation of the Drive AGX Pegasus.

Read More

Seeing the whole from some of the parts

Upon looking at photographs and drawing on their past experiences, humans can often perceive depth in pictures that are, themselves, perfectly flat. However, getting computers to do the same thing has proved quite challenging.

The problem is difficult for several reasons, one being that information is inevitably lost when a scene that takes place in three dimensions is reduced to a two-dimensional (2D) representation. There are some well-established strategies for recovering 3D information from multiple 2D images, but they each have some limitations. A new approach called “virtual correspondence,” which was developed by researchers at MIT and other institutions, can get around some of these shortcomings and succeed in cases where conventional methodology falters.

The standard approach, called “structure from motion,” is modeled on a key aspect of human vision. Because our eyes are separated from each other, they each offer slightly different views of an object. A triangle can be formed whose sides consist of the line segment connecting the two eyes, plus the line segments connecting each eye to a common point on the object in question. Knowing the angles in the triangle and the distance between the eyes, it’s possible to determine the distance to that point using elementary geometry — although the human visual system, of course, can make rough judgments about distance without having to go through arduous trigonometric calculations. This same basic idea  — of triangulation or parallax views — has been exploited by astronomers for centuries to calculate the distance to faraway stars.  

Triangulation is a key element of structure from motion. Suppose you have two pictures of an object — a sculpted figure of a rabbit, for instance — one taken from the left side of the figure and the other from the right. The first step would be to find points or pixels on the rabbit’s surface that both images share. A researcher could go from there to determine the “poses” of the two cameras — the positions where the photos were taken from and the direction each camera was facing. Knowing the distance between the cameras and the way they were oriented, one could then triangulate to work out the distance to a selected point on the rabbit. And if enough common points are identified, it might be possible to obtain a detailed sense of the object’s (or “rabbit’s”) overall shape.

Considerable progress has been made with this technique, comments Wei-Chiu Ma, a PhD student in MIT’s Department of Electrical Engineering and Computer Science (EECS), “and people are now matching pixels with greater and greater accuracy. So long as we can observe the same point, or points, across different images, we can use existing algorithms to determine the relative positions between cameras.” But the approach only works if the two images have a large overlap. If the input images have very different viewpoints — and hence contain few, if any, points in common — he adds, “the system may fail.”

During summer 2020, Ma came up with a novel way of doing things that could greatly expand the reach of structure from motion. MIT was closed at the time due to the pandemic, and Ma was home in Taiwan, relaxing on the couch. While looking at the palm of his hand and his fingertips in particular, it occurred to him that he could clearly picture his fingernails, even though they were not visible to him.

That was the inspiration for the notion of virtual correspondence, which Ma has subsequently pursued with his advisor, Antonio Torralba, an EECS professor and investigator at the Computer Science and Artificial Intelligence Laboratory, along with Anqi Joyce Yang and Raquel Urtasun of the University of Toronto and Shenlong Wang of the University of Illinois. “We want to incorporate human knowledge and reasoning into our existing 3D algorithms” Ma says, the same reasoning that enabled him to look at his fingertips and conjure up fingernails on the other side — the side he could not see.

Structure from motion works when two images have points in common, because that means a triangle can always be drawn connecting the cameras to the common point, and depth information can thereby be gleaned from that. Virtual correspondence offers a way to carry things further. Suppose, once again, that one photo is taken from the left side of a rabbit and another photo is taken from the right side. The first photo might reveal a spot on the rabbit’s left leg. But since light travels in a straight line, one could use general knowledge of the rabbit’s anatomy to know where a light ray going from the camera to the leg would emerge on the rabbit’s other side. That point may be visible in the other image (taken from the right-hand side) and, if so, it could be used via triangulation to compute distances in the third dimension.

Virtual correspondence, in other words, allows one to take a point from the first image on the rabbit’s left flank and connect it with a point on the rabbit’s unseen right flank. “The advantage here is that you don’t need overlapping images to proceed,” Ma notes. “By looking through the object and coming out the other end, this technique provides points in common to work with that weren’t initially available.” And in that way, the constraints imposed on the conventional method can be circumvented.

One might inquire as to how much prior knowledge is needed for this to work, because if you had to know the shape of everything in the image from the outset, no calculations would be required. The trick that Ma and his colleagues employ is to use certain familiar objects in an image — such as the human form — to serve as a kind of “anchor,” and they’ve devised methods for using our knowledge of the human shape to help pin down the camera poses and, in some cases, infer depth within the image. In addition, Ma explains, “the prior knowledge and common sense that is built into our algorithms is first captured and encoded by neural networks.”

The team’s ultimate goal is far more ambitious, Ma says. “We want to make computers that can understand the three-dimensional world just like humans do.” That objective is still far from realization, he acknowledges. “But to go beyond where we are today, and build a system that acts like humans, we need a more challenging setting. In other words, we need to develop computers that can not only interpret still images but can also understand short video clips and eventually full-length movies.”

A scene in the film “Good Will Hunting” demonstrates what he has in mind. The audience sees Matt Damon and Robin Williams from behind, sitting on a bench that overlooks a pond in Boston’s Public Garden. The next shot, taken from the opposite side, offers frontal (though fully clothed) views of Damon and Williams with an entirely different background. Everyone watching the movie immediately knows they’re watching the same two people, even though the two shots have nothing in common. Computers can’t make that conceptual leap yet, but Ma and his colleagues are working hard to make these machines more adept and — at least when it comes to vision — more like us.

The team’s work will be presented next week at the Conference on Computer Vision and Pattern Recognition.

Read More

Artificial neural networks model face processing in autism

Many of us easily recognize emotions expressed in others’ faces. A smile may mean happiness, while a frown may indicate anger. Autistic people often have a more difficult time with this task. It’s unclear why. But new research, published June 15 in The Journal of Neuroscience, sheds light on the inner workings of the brain to suggest an answer. And it does so using a tool that opens new pathways to modeling the computation in our heads: artificial intelligence.

Researchers have primarily suggested two brain areas where the differences might lie. A region on the side of the primate (including human) brain called the inferior temporal (IT) cortex contributes to facial recognition. Meanwhile, a deeper region called the amygdala receives input from the IT cortex and other sources and helps process emotions.

Kohitij Kar, a research scientist in the lab of MIT Professor James DiCarlo, hoped to zero in on the answer. (DiCarlo, the Peter de Florez Professor in the Department of Brain and Cognitive Sciences, is also a member of the McGovern Institute for Brain Research and director of MIT’s Quest for Intelligence.)

Kar began by looking at data provided by two other researchers: Shuo Wang at Washington University in St. Louis and Ralph Adolphs at Caltech. In one experiment, they showed images of faces to autistic adults and to neurotypical controls. The images had been generated by software to vary on a spectrum from fearful to happy, and the participants judged, quickly, whether the faces depicted happiness. Compared with controls, autistic adults required higher levels of happiness in the faces to report them as happy.

Modeling the brain

Kar trained an artificial neural network, a complex mathematical function inspired by the brain’s architecture, to perform the same task. The network contained layers of units that roughly resemble biological neurons that process visual information. These layers process information as it passes from an input image to a final judgment indicating the probability that the face is happy. Kar found that the network’s behavior more closely matched the neurotypical controls than it did the autistic adults.

The network also served two more interesting functions. First, Kar could dissect it. He stripped off layers and retested its performance, measuring the difference between how well it matched controls and how well it matched autistic adults. This difference was greatest when the output was based on the last network layer. Previous work has shown that this layer in some ways mimics the IT cortex, which sits near the end of the primate brain’s ventral visual processing pipeline. Kar’s results implicate the IT cortex in differentiating neurotypical controls from autistic adults.

The other function is that the network can be used to select images that might be more efficient in autism diagnoses. If the difference between how closely the network matches neurotypical controls versus autistic adults is greater when judging one set of images versus another set of images, the first set could be used in the clinic to detect autistic behavioral traits. “These are promising results,” Kar says. Better models of the brain will come along, “but oftentimes in the clinic, we don’t need to wait for the absolute best product.”

Next, Kar evaluated the role of the amygdala. Again, he used data from Wang and colleagues. They had used electrodes to record the activity of neurons in the amygdala of people undergoing surgery for epilepsy as they performed the face task. The team found that they could predict a person’s judgment based on these neurons’ activity. Kar reanalyzed the data, this time controlling for the ability of the IT-cortex-like network layer to predict whether a face truly was happy. Now, the amygdala provided very little information of its own. Kar concludes that the IT cortex is the driving force behind the amygdala’s role in judging facial emotion.

Noisy networks

Finally, Kar trained separate neural networks to match the judgments of neurotypical controls and autistic adults. He looked at the strengths or “weights” of the connections between the final layers and the decision nodes. The weights in the network matching autistic adults, both the positive or “excitatory” and negative or “inhibitory” weights, were weaker than in the network matching neurotypical controls. This suggests that sensory neural connections in autistic adults might be noisy or inefficient.

To further test the noise hypothesis, which is popular in the field, Kar added various levels of fluctuation to the activity of the final layer in the network modeling autistic adults. Within a certain range, added noise greatly increased the similarity between its performance and that of the autistic adults. Adding noise to the control network did much less to improve its similarity to the control participants. This further suggest that sensory perception in autistic people may be the result of a so-called “noisy” brain.

Computational power

Looking forward, Kar sees several uses for computational models of visual processing. They can be further prodded, providing hypotheses that researchers might test in animal models. “I think facial emotion recognition is just the tip of the iceberg,” Kar says. They can also be used to select or even generate diagnostic content. Artificial intelligence could be used to generate content like movies and educational materials that optimally engages autistic children and adults. One might even tweak facial and other relevant pixels in what autistic people see in augmented reality goggles, work that Kar plans to pursue in the future.

Ultimately, Kar says, the work helps to validate the usefulness of computational models, especially image-processing neural networks. They formalize hypotheses and make them testable. Does one model or another better match behavioral data? “Even if these models are very far off from brains, they are falsifiable, rather than people just making up stories,” he says. “To me, that’s a more powerful version of science.”

Read More

Engineers build LEGO-like artificial intelligence chip

Imagine a more sustainable future, where cellphones, smartwatches, and other wearable devices don’t have to be shelved or discarded for a newer model. Instead, they could be upgraded with the latest sensors and processors that would snap onto a device’s internal chip — like LEGO bricks incorporated into an existing build. Such reconfigurable chipware could keep devices up to date while reducing our electronic waste. 

Now MIT engineers have taken a step toward that modular vision with a LEGO-like design for a stackable, reconfigurable artificial intelligence chip.

The design comprises alternating layers of sensing and processing elements, along with light-emitting diodes (LED) that allow for the chip’s layers to communicate optically. Other modular chip designs employ conventional wiring to relay signals between layers. Such intricate connections are difficult if not impossible to sever and rewire, making such stackable designs not reconfigurable.

The MIT design uses light, rather than physical wires, to transmit information through the chip. The chip can therefore be reconfigured, with layers that can be swapped out or stacked on, for instance to add new sensors or updated processors.

“You can add as many computing layers and sensors as you want, such as for light, pressure, and even smell,” says MIT postdoc Jihoon Kang. “We call this a LEGO-like reconfigurable AI chip because it has unlimited expandability depending on the combination of layers.”

The researchers are eager to apply the design to edge computing devices — self-sufficient sensors and other electronics that work independently from any central or distributed resources such as supercomputers or cloud-based computing.

“As we enter the era of the internet of things based on sensor networks, demand for multifunctioning edge-computing devices will expand dramatically,” says Jeehwan Kim, associate professor of mechanical engineering at MIT. “Our proposed hardware architecture will provide high versatility of edge computing in the future.”

The team’s results are published today in Nature Electronics. In addition to Kim and Kang, MIT authors include co-first authors Chanyeol Choi, Hyunseok Kim, and Min-Kyu Song, and contributing authors Hanwool Yeon, Celesta Chang, Jun Min Suh, Jiho Shin, Kuangye Lu, Bo-In Park, Yeongin Kim, Han Eol Lee, Doyoon Lee, Subeen Pang, Sang-Hoon Bae, Hun S. Kum, and Peng Lin, along with collaborators from Harvard University, Tsinghua University, Zhejiang University, and elsewhere.

Lighting the way

The team’s design is currently configured to carry out basic image-recognition tasks. It does so via a layering of image sensors, LEDs, and processors made from artificial synapses — arrays of memory resistors, or “memristors,” that the team previously developed, which together function as a physical neural network, or “brain-on-a-chip.” Each array can be trained to process and classify signals directly on a chip, without the need for external software or an Internet connection.

In their new chip design, the researchers paired image sensors with artificial synapse arrays, each of which they trained to recognize certain letters — in this case, M, I, and T. While a conventional approach would be to relay a sensor’s signals to a processor via physical wires, the team instead fabricated an optical system between each sensor and artificial synapse array to enable communication between the layers, without requiring a physical connection. 

“Other chips are physically wired through metal, which makes them hard to rewire and redesign, so you’d need to make a new chip if you wanted to add any new function,” says MIT postdoc Hyunseok Kim. “We replaced that physical wire connection with an optical communication system, which gives us the freedom to stack and add chips the way we want.”

The team’s optical communication system consists of paired photodetectors and LEDs, each patterned with tiny pixels. Photodetectors constitute an image sensor for receiving data, and LEDs to transmit data to the next layer. As a signal (for instance an image of a letter) reaches the image sensor, the image’s light pattern encodes a certain configuration of LED pixels, which in turn stimulates another layer of photodetectors, along with an artificial synapse array, which classifies the signal based on the pattern and strength of the incoming LED light.

Stacking up

The team fabricated a single chip, with a computing core measuring about 4 square millimeters, or about the size of a piece of confetti. The chip is stacked with three image recognition “blocks,” each comprising an image sensor, optical communication layer, and artificial synapse array for classifying one of three letters, M, I, or T. They then shone a pixellated image of random letters onto the chip and measured the electrical current that each neural network array produced in response. (The larger the current, the larger the chance that the image is indeed the letter that the particular array is trained to recognize.)

The team found that the chip correctly classified clear images of each letter, but it was less able to distinguish between blurry images, for instance between I and T. However, the researchers were able to quickly swap out the chip’s processing layer for a better “denoising” processor, and found the chip then accurately identified the images.

“We showed stackability, replaceability, and the ability to insert a new function into the chip,” notes MIT postdoc Min-Kyu Song.

The researchers plan to add more sensing and processing capabilities to the chip, and they envision the applications to be boundless.

“We can add layers to a cellphone’s camera so it could recognize more complex images, or makes these into healthcare monitors that can be embedded in wearable electronic skin,” offers Choi, who along with Kim previously developed a “smart” skin for monitoring vital signs.

Another idea, he adds, is for modular chips, built into electronics, that consumers can choose to build up with the latest sensor and processor “bricks.”

“We can make a general chip platform, and each layer could be sold separately like a video game,” Jeehwan Kim says. “We could make different types of neural networks, like for image or voice recognition, and let the customer choose what they want, and add to an existing chip like a LEGO.”

This research was supported, in part, by the Ministry of Trade, Industry, and Energy (MOTIE) from South Korea; the Korea Institute of Science and Technology (KIST); and the Samsung Global Research Outreach Program.

Read More

Student-powered machine learning

From their early days at MIT, and even before, Emma Liu ’22, MNG ’22, Yo-whan “John” Kim ’22, MNG ’22, and Clemente Ocejo ’21, MNG ’22 knew they wanted to perform computational research and explore artificial intelligence and machine learning. “Since high school, I’ve been into deep learning and was involved in projects,” says Kim, who participated in a Research Science Institute (RSI) summer program at MIT and Harvard University and went on to work on action recognition in videos using Microsoft’s Kinect.

As students in the Department of Electrical Engineering and Computer Science who recently graduated from the Master of Engineering (MEng) Thesis Program, Liu, Kim, and Ocejo have developed the skills to help guide application-focused projects. Working with the MIT-IBM Watson AI Lab, they have improved text classification with limited labeled data and designed machine-learning models for better long-term forecasting for product purchases. For Kim, “it was a very smooth transition and … a great opportunity for me to continue working in the field of deep learning and computer vision in the MIT-IBM Watson AI Lab.”

Modeling video

Collaborating with researchers from academia and industry, Kim designed, trained, and tested a deep learning model for recognizing actions across domains — in this case, video. His team specifically targeted the use of synthetic data from generated videos for training and ran prediction and inference tasks on real data, which is composed of different action classes. They wanted to see how pre-training models on synthetic videos, particularly simulations of, or game engine-generated, humans or humanoid actions stacked up to real data: publicly available videos scraped from the internet.

The reason for this research, Kim says, is that real videos can have issues, including representation bias, copyright, and/or ethical or personal sensitivity, e.g., videos of a car hitting people would be difficult to collect, or the use of people’s faces, real addresses, or license plates without consent. Kim is running experiments with 2D, 2.5D, and 3D video models, with the goal of creating domain-specific or even a large, general, synthetic video dataset that can be used for some transfer domains, where data are lacking. For instance, for applications to the construction industry, this could include running its action recognition on a building site. “I didn’t expect synthetically generated videos to perform on par with real videos,” he says. “I think that opens up a lot of different roles [for the work] in the future.”

Despite a rocky start to the project gathering and generating data and running many models, Kim says he wouldn’t have done it any other way. “It was amazing how the lab members encouraged me: ‘It’s OK. You’ll have all the experiments and the fun part coming. Don’t stress too much.’” It was this structure that helped Kim take ownership of the work. “At the end, they gave me so much support and amazing ideas that help me carry out this project.”

Data labeling

Data scarcity was also a theme of Emma Liu’s work. “The overarching problem is that there’s all this data out there in the world, and for a lot of machine learning problems, you need that data to be labeled,” says Liu, “but then you have all this unlabeled data that’s available that you’re not really leveraging.”

Liu, with direction from her MIT and IBM group, worked to put that data to use, training text classification semi-supervised models (and combining aspects of them) to add pseudo labels to the unlabeled data, based on predictions and probabilities about which categories each piece of previously unlabeled data fits into. “Then the problem is that there’s been prior work that’s shown that you can’t always trust the probabilities; specifically, neural networks have been shown to be overconfident a lot of the time,” Liu points out.

Liu and her team addressed this by evaluating the accuracy and uncertainty of the models and recalibrated them to improve her self-training framework. The self-training and calibration step allowed her to have better confidence in the predictions. This pseudo labeled data, she says, could then be added to the pool of real data, expanding the dataset; this process could be repeated in a series of iterations.

For Liu, her biggest takeaway wasn’t the product, but the process. “I learned a lot about being an independent researcher,” she says. As an undergraduate, Liu worked with IBM to develop machine learning methods to repurpose drugs already on the market and honed her decision-making ability. After collaborating with academic and industry researchers to acquire skills to ask pointed questions, seek out experts, digest and present scientific papers for relevant content, and test ideas, Liu and her cohort of MEng students working with the MIT-IBM Watson AI Lab felt they had confidence in their knowledge, freedom, and flexibility to dictate their own research’s direction. Taking on this key role, Liu says, “I feel like I had ownership over my project.”

Demand forecasting

After his time at MIT and with the MIT-IBM Watson AI Lab, Clemente Ocejo also came away with a sense of mastery, having built a strong foundation in AI techniques and timeseries methods beginning with his MIT Undergraduate Research Opportunities Program (UROP), where he met his MEng advisor. “You really have to be proactive in decision-making,” says Ocejo, “vocalizing it [your choices] as the researcher and letting people know that this is what you’re doing.”

Ocejo used his background in traditional timeseries methods for a collaboration with the lab, applying deep learning to better predict product demand forecasting in the medical field. Here, he designed, wrote, and trained a transformer, a specific machine learning model, which is typically used in natural-language processing and has the ability to learn very long-term dependencies. Ocejo and his team compared target forecast demands between months, learning dynamic connections and attention weights between product sales within a product family. They looked at identifier features, concerning the price and amount, as well as account features about who is purchasing the items or services. 

“One product does not necessarily impact the prediction made for another product in the moment of prediction. It just impacts the parameters during training that lead to that prediction,” says Ocejo. “Instead, we wanted to make it have a little more of a direct impact, so we added this layer that makes this connection and learns attention between all of the products in our dataset.”

In the long run, over a one-year prediction, MIT-IBM Watson AI Lab group was able to outperform the current model; more impressively, it did so in the short run (close to a fiscal quarter). Ocejo attributes this to the dynamic of his interdisciplinary team. “A lot of the people in my group were not necessarily very experienced in the deep learning aspect of things, but they had a lot of experience in the supply chain management, operations research, and optimization side, which is something that I don’t have that much experience in,” says Ocejo. “They were giving a lot of good high-level feedback of what to tackle next and … and knowing what the field of industry wanted to see or was looking to improve, so it was very helpful in streamlining my focus.”

For this work, a deluge of data didn’t make the difference for Ocejo and his team, but rather its structure and presentation. Oftentimes, large deep learning models require millions and millions of data points in order to make meaningful inferences; however, the MIT-IBM Watson AI Lab group demonstrated that outcomes and technique improvements can be application-specific. “It just shows that these models can learn something useful, in the right setting, with the right architecture, without needing an excess amount of data,” says Ocejo. “And then with an excess amount of data, it’ll only get better.”

Read More

Collin Stultz named co-director and MIT lead of the Harvard-MIT Program in Health Sciences and Technology

Collin M. Stultz, the Nina T. and Robert H. Rubin Professor in Medical Engineering and Science at MIT, has been named co-director of the Harvard-MIT Program in Health Sciences and Technology (HST), and associate director of MIT’s Institute for Medical Engineering and Science (IMES), effective June 1. IMES is HST’s home at MIT.

Stultz is a professor of electrical engineering and computer science at MIT, a core faculty member in IMES, a member of the HST faculty, and a practicing cardiologist at Massachusetts General Hospital (MGH). He is also a member of the Research Laboratory of Electronics, and an associate member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Anantha P. Chandrakasan, dean of the MIT School of Engineering and Vannevar Bush Professor of Electrical Engineering and Computer Science, praised the appointment, saying “Professor Stultz’s remarkable leadership, commitment to teaching excellence, and unwavering devotion to pursuing advancements in human health, will undoubtedly help to reinforce and bolster the missions of both IMES and HST.”

Stultz is succeeding Emery N. Brown, who was first to serve as HST’s co-director at MIT, following the establishment of IMES in 2012. (Wolfram Goessling is the co-director of HST at Harvard University.) Brown, the Edward Hood Taplin Professor of Medical Engineering and of Computational Neuroscience at MIT, will now be focusing on the establishment of a new joint center between MIT and MGH that will use the study of anesthesia to design novel approaches to controlling brain states, with a goal of improving anesthesia and intensive care management.

“It was a pleasure and honor for me to shepherd HST for the last 10 years,” Brown says. “I am certain that Collin will be a phenomenal co-director. He is a highly accomplished scientist, a master clinician, and a committed educator.“

George Q. Daley, dean of Harvard Medical School and an HST alumnus, says, “I am thrilled that HST’s new co-director will be a Harvard Medical School alumnus who completed clinical training and practice at our affiliated hospitals. Dr. Stultz’s remarkable expertise in computer science and AI will engender positive change as we reinvigorate this historic Harvard-MIT collaboration and redefine the scope of what it means to be a physician-scientist in the 21st century.”

Elazer R. Edelman, the Edward J. Poitras Professor in Medical Engineering and Science and the director of IMES, also an HST alumnus, lauded the appointment, saying, “We are so excited by the future, using the incredible vision of Professor Stultz, his legacy of accomplishment, his commitment to mentorship, and his innate ability to meld excellence in science and medicine, engineering, and physiology to propel us forward. Everything Professor Stultz has done predicates him and HST for success. “

Goessling says he looks forward to working with Stultz in his new role. “I have known Collin since our residency days at Brigham and Women’s Hospital where we cared for patients together. I am truly excited to work collaboratively and synergistically with him to now take care of our students together, to innovate our education programs and continue the legacy of success for HST.”

Stultz earned his BA magna cum laude in mathematics and philosophy from Harvard University in 1988; a PhD in biophysics from Harvard in 1997; and an MD magna cum laude from Harvard Medical School, also in 1997. Stultz then went on to complete an internship and residency in internal medicine, followed by a fellowship in cardiovascular medicine, at the Brigham and Women’s Hospital before joining the faculty at MIT in 2004.

Stultz once said that his research focus at MIT is twofold: “the study of small things you can’t see with the naked eye, and the study of big things that you can,” and his scientific contributions have similarly spanned a wide range of length scales. As a graduate student in the laboratory of Martin Karplus — winner of the 2013 Nobel Prize in Chemistry — Stultz helped to develop computational methods for designing ligands to flexible protein targets. As a junior faculty member at MIT, his group leveraged computational biophysics and experimental biochemistry to model disordered proteins that play important roles in human disease. More recently, his research has focused on the development and application of machine learning methods that enable health care providers to gain insight into patient-specific physiology, using clinical data that are routinely obtained in both clinical and ambulatory settings. 

Stultz is a member of the American Society for Biochemistry and Molecular Biology, the Federation of American Societies for Experimental Biology, and a fellow of the American Institute for Medical and Biomedical Engineering. He is a past recipient of an Irving M. London teaching award, a National Science Foundation CAREER Award, a Burroughs Wellcome Fund Career Award in the Biomedical Sciences, and he is a recent Phi Beta Kappa visiting scholar. 

“Following in the footsteps of a scholar as renowned as Emery Brown is daunting; however, I am extraordinarily optimistic about what HMS, HST, and MIT can accomplish in the years to come,” Stultz says. “I look forward to working with Elazer, Anantha, Wolfram, and the leadership at HMS to advance the educational mission of HST on the HMS campus, and throughout the MIT ecosystem.”

Read More

Hallucinating to better text translation

As babies, we babble and imitate our way to learning languages. We don’t start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas.

Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.

This is the basis of a new machine learning model, called VALHALLA, by researchers from MIT, IBM, and the University of California at San Diego, in which a trained neural network sees a source sentence in one language, hallucinates an image of what it looks like, and then uses both to translate into a target language. The team found that their method demonstrates improved accuracy of machine translation over text-only translation. Further, it provided an additional boost for cases with long sentences, under-resourced languages, and instances where part of the source sentence is inaccessible to the machine translator.

As a core task within the AI field of natural language processing (NLP), machine translation is an “eminently practical technology that’s being used by millions of people every day,” says study co-author Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science with affiliations in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT-IBM Watson AI Lab. With recent, significant advances in deep learning, “there’s been an interesting development in how one might use non-text information — for example, images, audio, or other grounding information — to tackle practical tasks involving language” says Kim, because “when humans are performing language processing tasks, we’re doing so within a grounded, situated world.” The pairing of hallucinated images and text during inference, the team postulated, imitates that process, providing context for improved performance over current state-of-the-art techniques, which utilize text-only data.

This research will be presented at the IEEE / CVF Computer Vision and Pattern Recognition Conference this month. Kim’s co-authors are UC San Diego graduate student Yi Li and Professor Nuno Vasconcelos, along with research staff members Rameswar Panda, Chun-fu “Richard” Chen, Rogerio Feris, and IBM Director David Cox of IBM Research and the MIT-IBM Watson AI Lab.

Learning to hallucinate from images

When we learn new languages and to translate, we’re often provided with examples and practice before venturing out on our own. The same is true for machine-translation systems; however, if images are used during training, these AI methods also require visual aids for testing, limiting their applicability, says Panda.

“In real-world scenarios, you might not have an image with respect to the source sentence. So, our motivation was basically: Instead of using an external image during inference as input, can we use visual hallucination — the ability to imagine visual scenes — to improve machine translation systems?” says Panda.

To do this, the team used an encoder-decoder architecture with two transformers, a type of neural network model that’s suited for sequence-dependent data, like language, that can pay attention key words and semantics of a sentence. One transformer generates a visual hallucination, and the other performs multimodal translation using outputs from the first transformer.

During training, there are two streams of translation: a source sentence and a ground-truth image that is paired with it, and the same source sentence that is visually hallucinated to make a text-image pair. First the ground-truth image and sentence are tokenized into representations that can be handled by transformers; for the case of the sentence, each word is a token. The source sentence is tokenized again, but this time passed through the visual hallucination transformer, outputting a hallucination, a discrete image representation of the sentence. The researchers incorporated an autoregression that compares the ground-truth and hallucinated representations for congruency — e.g., homonyms: a reference to an animal “bat” isn’t hallucinated as a baseball bat. The hallucination transformer then uses the difference between them to optimize its predictions and visual output, making sure the context is consistent.

The two sets of tokens are then simultaneously passed through the multimodal translation transformer, each containing the sentence representation and either the hallucinated or ground-truth image. The tokenized text translation outputs are compared with the goal of being similar to each other and to the target sentence in another language. Any differences are then relayed back to the translation transformer for further optimization.

For testing, the ground-truth image stream drops off, since images likely wouldn’t be available in everyday scenarios.

“To the best of our knowledge, we haven’t seen any work which actually uses a hallucination transformer jointly with a multimodal translation system to improve machine translation performance,” says Panda.

Visualizing the target text

To test their method, the team put VALHALLA up against other state-of-the-art multimodal and text-only translation methods. They used public benchmark datasets containing ground-truth images with source sentences, and a dataset for translating text-only news articles. The researchers measured its performance over 13 tasks, ranging from translation on well-resourced languages (like English, German, and French), under-resourced languages (like English to Romanian) and non-English (like Spanish to French). The group also tested varying transformer model sizes, how accuracy changes with the sentence length, and translation under limited textual context, where portions of the text were hidden from the machine translators.

The team observed significant improvements over text-only translation methods, improving data efficiency, and that smaller models performed better than the larger base model. As sentences became longer, VALHALLA’s performance over other methods grew, which the researchers attributed to the addition of more ambiguous words. In cases where part of the sentence was masked, VALHALLA could recover and translate the original text, which the team found surprising.

Further unexpected findings arose: “Where there weren’t as many training [image and] text pairs, [like for under-resourced languages], improvements were more significant, which indicates that grounding in images helps in low-data regimes,” says Kim. “Another thing that was quite surprising to me was this improved performance, even on types of text that aren’t necessarily easily connectable to images. For example, maybe it’s not so surprising if this helps in translating visually salient sentences, like the ‘there is a red car in front of the house.’ [However], even in text-only [news article] domains, the approach was able to improve upon text-only systems.”

While VALHALLA performs well, the researchers note that it does have limitations, requiring pairs of sentences to be annotated with an image, which could make it more expensive to obtain. It also performs better in its ground domain and not the text-only news articles. Moreover, Kim and Panda note, a technique like VALHALLA is still a black box, with the assumption that hallucinated images are providing helpful information, and the team plans to investigate what and how the model is learning in order to validate their methods.

In the future, the team plans to explore other means of improving translation. “Here, we only focus on images, but there are other types of a multimodal information — for example, speech, video or touch, or other sensory modalities,” says Panda. “We believe such multimodal grounding can lead to even more efficient machine translation models, potentially benefiting translation across many low-resource languages spoken in the world.”

This research was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation.

Read More