Ban, an Amazon Visiting Academic, won for a paper she co-authored with Duke University professor Cynthia Rudin.Read More
Reducing unnecessary clarification questions from voice agents
New approach improves F1 score of clarification questions by 81%.Read More
“Ambient intelligence” will accelerate advances in general AI
Alexa’s chief scientist on how customer-obsessed science is accelerating general intelligence.Read More
Fiddler.ai CEO on the emerging category of explainable AI
Krishna Gade, the founder of this Alexa Fund portfolio company, answers three questions about ‘responsible AI’.Read More
Perfecting pitch perception
New research from MIT neuroscientists suggests that natural soundscapes have shaped our sense of hearing, optimizing it for the kinds of sounds we most often encounter.
In a study reported Dec. 14 in the journal Nature Communications, researchers led by McGovern Institute for Brain Research associate investigator Josh McDermott used computational modeling to explore factors that influence how humans hear pitch. Their model’s pitch perception closely resembled that of humans — but only when it was trained using music, voices, or other naturalistic sounds.
Humans’ ability to recognize pitch — essentially, the rate at which a sound repeats — gives melody to music and nuance to spoken language. Although this is arguably the best-studied aspect of human hearing, researchers are still debating which factors determine the properties of pitch perception, and why it is more acute for some types of sounds than others. McDermott, who is also an associate professor in MIT’s Department of Brain and Cognitive Sciences, and an Investigator with the Center for Brains, Minds, and Machines (CBMM) at MIT, is particularly interested in understanding how our nervous system perceives pitch because cochlear implants, which send electrical signals about sound to the brain in people with profound deafness, don’t replicate this aspect of human hearing very well.
“Cochlear implants can do a pretty good job of helping people understand speech, especially if they’re in a quiet environment. But they really don’t reproduce the percept of pitch very well,” says Mark Saddler, a graduate student and CBMM researcher who co-led the project and an inaugural graduate fellow of the K. Lisa Yang Integrative Computational Neuroscience Center. “One of the reasons it’s important to understand the detailed basis of pitch perception in people with normal hearing is to try to get better insights into how we would reproduce that artificially in a prosthesis.”
Artificial hearing
Pitch perception begins in the cochlea, the snail-shaped structure in the inner ear where vibrations from sounds are transformed into electrical signals and relayed to the brain via the auditory nerve. The cochlea’s structure and function help determine how and what we hear. And although it hasn’t been possible to test this idea experimentally, McDermott’s team suspected our “auditory diet” might shape our hearing as well.
To explore how both our ears and our environment influence pitch perception, McDermott, Saddler, and Research Assistant Ray Gonzalez built a computer model called a deep neural network. Neural networks are a type of machine learning model widely used in automatic speech recognition and other artificial intelligence applications. Although the structure of an artificial neural network coarsely resembles the connectivity of neurons in the brain, the models used in engineering applications don’t actually hear the same way humans do — so the team developed a new model to reproduce human pitch perception. Their approach combined an artificial neural network with an existing model of the mammalian ear, uniting the power of machine learning with insights from biology. “These new machine-learning models are really the first that can be trained to do complex auditory tasks and actually do them well, at human levels of performance,” Saddler explains.
The researchers trained the neural network to estimate pitch by asking it to identify the repetition rate of sounds in a training set. This gave them the flexibility to change the parameters under which pitch perception developed. They could manipulate the types of sound they presented to the model, as well as the properties of the ear that processed those sounds before passing them on to the neural network.
When the model was trained using sounds that are important to humans, like speech and music, it learned to estimate pitch much as humans do. “We very nicely replicated many characteristics of human perception … suggesting that it’s using similar cues from the sounds and the cochlear representation to do the task,” Saddler says.
But when the model was trained using more artificial sounds or in the absence of any background noise, its behavior was very different. For example, Saddler says, “If you optimize for this idealized world where there’s never any competing sources of noise, you can learn a pitch strategy that seems to be very different from that of humans, which suggests that perhaps the human pitch system was really optimized to deal with cases where sometimes noise is obscuring parts of the sound.”
The team also found the timing of nerve signals initiated in the cochlea to be critical to pitch perception. In a healthy cochlea, McDermott explains, nerve cells fire precisely in time with the sound vibrations that reach the inner ear. When the researchers skewed this relationship in their model, so that the timing of nerve signals was less tightly correlated to vibrations produced by incoming sounds, pitch perception deviated from normal human hearing.
McDermott says it will be important to take this into account as researchers work to develop better cochlear implants. “It does very much suggest that for cochlear implants to produce normal pitch perception, there needs to be a way to reproduce the fine-grained timing information in the auditory nerve,” he says. “Right now, they don’t do that, and there are technical challenges to making that happen — but the modeling results really pretty clearly suggest that’s what you’ve got to do.”
ASRU: Integrating speech recognition and language understanding
Amazon’s Jimmy Kunzmann on how “signal-to-interpretation” models improve availability, performance.Read More
Characters for good, created by artificial intelligence
As it becomes easier to create hyper-realistic digital characters using artificial intelligence, much of the conversation around these tools has centered on misleading and potentially dangerous deepfake content. But the technology can also be used for positive purposes — to revive Albert Einstein to teach a physics class, talk through a career change with your older self, or anonymize people while preserving facial communication.
To encourage the technology’s positive possibilities, MIT Media Lab researchers and their collaborators at the University of California at Santa Barbara and Osaka University have compiled an open-source, easy-to-use character generation pipeline that combines AI models for facial gestures, voice, and motion and can be used to create a variety of audio and video outputs.
The pipeline also marks the resulting output with a traceable, as well as human-readable, watermark to distinguish it from authentic video content and to show how it was generated — an addition to help prevent its malicious use.
By making this pipeline easily available, the researchers hope to inspire teachers, students, and health-care workers to explore how such tools can help them in their respective fields. If more students, educators, health-care workers, and therapists have a chance to build and use these characters, the results could improve health and well-being and contribute to personalized education, the researchers write in Nature Machine Intelligence.
“It will be a strange world indeed when AIs and humans begin to share identities. This paper does an incredible job of thought leadership, mapping out the space of what is possible with AI-generated characters in domains ranging from education to health to close relationships, while giving a tangible roadmap on how to avoid the ethical challenges around privacy and misrepresentation,” says Jeremy Bailenson, founding director of the Stanford Virtual Human Interaction Lab, who was not associated with the study.
Although the world mostly knows the technology from deepfakes, “we see its potential as a tool for creative expression,” says the paper’s first author Pat Pataranutaporn, a PhD student in professor of media technology Pattie Maes’ Fluid Interfaces research group.
Other authors on the paper include Maes; Fluid Interfaces master’s student Valdemar Danry and PhD student Joanne Leong; Media Lab Research Scientist Dan Novy; Osaka University Assistant Professor Parinya Punpongsanon; and University of California at Santa Barbara Assistant Professor Misha Sra.
Deeper truths and deeper learning
Generative adversarial networks, or GANs, a combination of two neural networks that compete against each other, have made it easier to create photorealistic images, clone voices, and animate faces. Pataranutaporn, with Danry, first explored its possibilities in a project called Machinoia, where he generated multiple alternative representations of himself — as a child, as an old man, as female — to have a self-dialogue of life choices from different perspectives. The unusual deepfaking experience made him aware of his “journey as a person,” he says. “It was deep truth — to uncover something about yourself you’ve never thought of before, using your own data on your own self.”
Self-exploration is only one of the positive applications of AI-generated characters, the researchers say. Experiments show, for instance, that these characters can make students more enthusiastic about learning and improve cognitive task performance. The technology offers a way for instruction to be “personalized to your interest, your idols, your context, and can be changed over time,” Pataranutaporn explains, as a complement to traditional instruction.
For instance, the MIT researchers used their pipeline to create a synthetic version of Johann Sebastian Bach, which had a live conversation with renowned cellist Yo Yo Ma in Media Lab Professor Tod Machover’s musical interfaces class — to the delight of both the students and Ma.
Other applications might include characters who help deliver therapy, to alleviate a growing shortage of mental health professionals and reach the estimated 44 percent of Americans with mental health issues who never receive counseling, or AI-generated content that delivers exposure therapy to people with social anxiety. In a related use case, the technology can be used to anonymize faces in video while preserving facial expressions and emotions, which may be useful for sessions where people want to share personally sensitive information such as health and trauma experiences, or for whistleblowers and witness accounts.
But there are also more artistic and playful use cases. In this fall’s Experiments in Deepfakes class, led by Maes and research affiliate Roy Shilkrot, students used the technology to animate the figures in a historical Chinese painting and to create a dating “breakup simulator,” among other projects.
Legal and ethical challenges
Many of the applications of AI-generated characters raise legal and ethical issues that must be discussed as the technology evolves, the researchers note in their paper. For instance, how will we decide who has the right to digitally recreate a historical character? Who is legally liable if an AI clone of a famous person promotes harmful behavior online? And is there any danger that we will prefer interacting with synthetic characters over humans?
“One of our goals with this research is to raise awareness about what is possible, ask questions and start public conversations about how this technology can be used ethically for societal benefit. What technical, legal, policy and educational actions can we take to promote positive use cases while reducing the possibility for harm?” states Maes.
By sharing the technology widely, while clearly labeling it as synthesized, Pataranutaporn says, “we hope to stimulate more creative and positive use cases, while also educating people about the technology’s potential benefits and harms
Q&A: Cathy Wu on developing algorithms to safely integrate robots into our world
Cathy Wu is the Gilbert W. Winslow Assistant Professor of Civil and Environmental Engineering and a member of the MIT Institute for Data, Systems, and Society. As an undergraduate, Wu won MIT’s toughest robotics competition, and as a graduate student took the University of California at Berkeley’s first-ever course on deep reinforcement learning. Now back at MIT, she’s working to improve the flow of robots in Amazon warehouses under the Science Hub, a new collaboration between the tech giant and the MIT Schwarzman College of Computing. Outside of the lab and classroom, Wu can be found running, drawing, pouring lattes at home, and watching YouTube videos on math and infrastructure via 3Blue1Brown and Practical Engineering. She recently took a break from all of that to talk about her work.
Q: What put you on the path to robotics and self-driving cars?
A: My parents always wanted a doctor in the family. However, I’m bad at following instructions and became the wrong kind of doctor! Inspired by my physics and computer science classes in high school, I decided to study engineering. I wanted to help as many people as a medical doctor could.
At MIT, I looked for applications in energy, education, and agriculture, but the self-driving car was the first to grab me. It has yet to let go! Ninety-four percent of serious car crashes are caused by human error and could potentially be prevented by self-driving cars. Autonomous vehicles could also ease traffic congestion, save energy, and improve mobility.
I first learned about self-driving cars from Seth Teller during his guest lecture for the course Mobile Autonomous Systems Lab (MASLAB), in which MIT undergraduates compete to build the best full-functioning robot from scratch. Our ball-fetching bot, Putzputz, won first place. From there, I took more classes in machine learning, computer vision, and transportation, and joined Teller’s lab. I also competed in several mobility-related hackathons, including one sponsored by Hubway, now known as Blue Bike.
Q: You’ve explored ways to help humans and autonomous vehicles interact more smoothly. What makes this problem so hard?
A: Both systems are highly complex, and our classical modeling tools are woefully insufficient. Integrating autonomous vehicles into our existing mobility systems is a huge undertaking. For example, we don’t know whether autonomous vehicles will cut energy use by 40 percent, or double it. We need more powerful tools to cut through the uncertainty. My PhD thesis at Berkeley tried to do this. I developed scalable optimization methods in the areas of robot control, state estimation, and system design. These methods could help decision-makers anticipate future scenarios and design better systems to accommodate both humans and robots.
Q: How is deep reinforcement learning, combining deep and reinforcement learning algorithms, changing robotics?
A: I took John Schulman and Pieter Abbeel’s reinforcement learning class at Berkeley in 2015 shortly after Deepmind published their breakthrough paper in Nature. They had trained an agent via deep learning and reinforcement learning to play “Space Invaders” and a suite of Atari games at superhuman levels. That created quite some buzz. A year later, I started to incorporate reinforcement learning into problems involving mixed traffic systems, in which only some cars are automated. I realized that classical control techniques couldn’t handle the complex nonlinear control problems I was formulating.
Deep RL is now mainstream but it’s by no means pervasive in robotics, which still relies heavily on classical model-based control and planning methods. Deep learning continues to be important for processing raw sensor data like camera images and radio waves, and reinforcement learning is gradually being incorporated. I see traffic systems as gigantic multi-robot systems. I’m excited for an upcoming collaboration with Utah’s Department of Transportation to apply reinforcement learning to coordinate cars with traffic signals, reducing congestion and thus carbon emissions.
Q: You’ve talked about the MIT course, 6.003 (Signals and Systems), and its impact on you. What about it spoke to you?
A: The mindset. That problems that look messy can be analyzed with common, and sometimes simple, tools. Signals are transformed by systems in various ways, but what do these abstract terms mean, anyway? A mechanical system can take a signal like gears turning at some speed and transform it into a lever turning at another speed. A digital system can take binary digits and turn them into other binary digits or a string of letters or an image. Financial systems can take news and transform it via millions of trading decisions into stock prices. People take in signals every day through advertisements, job offers, gossip, and so on, and translate them into actions that in turn influence society and other people. This humble class on signals and systems linked mechanical, digital, and societal systems and showed me how foundational tools can cut through the noise.
Q: In your project with Amazon you’re training warehouse robots to pick up, sort, and deliver goods. What are the technical challenges?
A: This project involves assigning robots to a given task and routing them there. [Professor] Cynthia Barnhart’s team is focused on task assignment, and mine, on path planning. Both problems are considered combinatorial optimization problems because the solution involves a combination of choices. As the number of tasks and robots increases, the number of possible solutions grows exponentially. It’s called the curse of dimensionality. Both problems are what we call NP Hard; there may not be an efficient algorithm to solve them. Our goal is to devise a shortcut.
Routing a single robot for a single task isn’t difficult. It’s like using Google Maps to find the shortest path home. It can be solved efficiently with several algorithms, including Dijkstra’s. But warehouses resemble small cities with hundreds of robots. When traffic jams occur, customers can’t get their packages as quickly. Our goal is to develop algorithms that find the most efficient paths for all of the robots.
Q: Are there other applications?
A: Yes. The algorithms we test in Amazon warehouses might one day help to ease congestion in real cities. Other potential applications include controlling planes on runways, swarms of drones in the air, and even characters in video games. These algorithms could also be used for other robotic planning tasks like scheduling and routing.
Q: AI is evolving rapidly. Where do you hope to see the big breakthroughs coming?
A: I’d like to see deep learning and deep RL used to solve societal problems involving mobility, infrastructure, social media, health care, and education. Deep RL now has a toehold in robotics and industrial applications like chip design, but we still need to be careful in applying it to systems with humans in the loop. Ultimately, we want to design systems for people. Currently, we simply don’t have the right tools.
Q: What worries you most about AI taking on more and more specialized tasks?
A: AI has the potential for tremendous good, but it could also help to accelerate the widening gap between the haves and the have-nots. Our political and regulatory systems could help to integrate AI into society and minimize job losses and income inequality, but I worry that they’re not equipped yet to handle the firehose of AI.
Q: What’s the last great book you read?
A: “How to Avoid a Climate Disaster,” by Bill Gates. I absolutely loved the way that Gates was able to take an overwhelmingly complex topic and distill it down into words that everyone can understand. His optimism inspires me to keep pushing on applications of AI and robotics to help avoid a climate disaster.
Nonsense can make sense to machine-learning models
For all that neural networks can accomplish, we still don’t really understand how they operate. Sure, we can program them to learn, but making sense of a machine’s decision-making process remains much like a fancy puzzle with a dizzying, complex pattern where plenty of integral pieces have yet to be fitted.
If a model was trying to classify an image of said puzzle, for example, it could encounter well-known, but annoying adversarial attacks, or even more run-of-the-mill data or processing issues. But a new, more subtle type of failure recently identified by MIT scientists is another cause for concern: “overinterpretation,” where algorithms make confident predictions based on details that don’t make sense to humans, like random patterns or image borders.
This could be particularly worrisome for high-stakes environments, like split-second decisions for self-driving cars, and medical diagnostics for diseases that need more immediate attention. Autonomous vehicles in particular rely heavily on systems that can accurately understand surroundings and then make quick, safe decisions. The network used specific backgrounds, edges, or particular patterns of the sky to classify traffic lights and street signs — irrespective of what else was in the image.
The team found that neural networks trained on popular datasets like CIFAR-10 and ImageNet suffered from overinterpretation. Models trained on CIFAR-10, for example, made confident predictions even when 95 percent of input images were missing, and the remainder is senseless to humans.
“Overinterpretation is a dataset problem that’s caused by these nonsensical signals in datasets. Not only are these high-confidence images unrecognizable, but they contain less than 10 percent of the original image in unimportant areas, such as borders. We found that these images were meaningless to humans, yet models can still classify them with high confidence,” says Brandon Carter, MIT Computer Science and Artificial Intelligence Laboratory PhD student and lead author on a paper about the research.
Deep-image classifiers are widely used. In addition to medical diagnosis and boosting autonomous vehicle technology, there are use cases in security, gaming, and even an app that tells you if something is or isn’t a hot dog, because sometimes we need reassurance. The tech in discussion works by processing individual pixels from tons of pre-labeled images for the network to “learn.”
Image classification is hard, because machine-learning models have the ability to latch onto these nonsensical subtle signals. Then, when image classifiers are trained on datasets such as ImageNet, they can make seemingly reliable predictions based on those signals.
Although these nonsensical signals can lead to model fragility in the real world, the signals are actually valid in the datasets, meaning overinterpretation can’t be diagnosed using typical evaluation methods based on that accuracy.
To find the rationale for the model’s prediction on a particular input, the methods in the present study start with the full image and repeatedly ask, what can I remove from this image? Essentially, it keeps covering up the image, until you’re left with the smallest piece that still makes a confident decision.
To that end, it could also be possible to use these methods as a type of validation criteria. For example, if you have an autonomously driving car that uses a trained machine-learning method for recognizing stop signs, you could test that method by identifying the smallest input subset that constitutes a stop sign. If that consists of a tree branch, a particular time of day, or something that’s not a stop sign, you could be concerned that the car might come to a stop at a place it’s not supposed to.
While it may seem that the model is the likely culprit here, the datasets are more likely to blame. “There’s the question of how we can modify the datasets in a way that would enable models to be trained to more closely mimic how a human would think about classifying images and therefore, hopefully, generalize better in these real-world scenarios, like autonomous driving and medical diagnosis, so that the models don’t have this nonsensical behavior,” says Carter.
This may mean creating datasets in more controlled environments. Currently, it’s just pictures that are extracted from public domains that are then classified. But if you want to do object identification, for example, it might be necessary to train models with objects with an uninformative background.
This work was supported by Schmidt Futures and the National Institutes of Health. Carter wrote the paper alongside Siddhartha Jain and Jonas Mueller, scientists at Amazon, and MIT Professor David Gifford. They are presenting the work at the 2021 Conference on Neural Information Processing Systems.
Quantum scientists and academics discuss quantum computing
Watch as the panel talks about everything from what got them interested in quantum research to where they see the field headed in the future.Read More