Why it’s a problem that pulse oximeters don’t work as well on patients of color

Pulse oximetry is a noninvasive test that measures the oxygen saturation level in a patient’s blood, and it has become an important tool for monitoring many patients, including those with Covid-19. But new research links faulty readings from pulse oximeters with racial disparities in health outcomes, potentially leading to higher rates of death and complications such as organ dysfunction, in patients with darker skin.

It is well known that non-white intensive care unit (ICU) patients receive less-accurate readings of their oxygen levels using pulse oximeters — the common devices clamped on patients’ fingers. Now, a paper co-authored by MIT scientists reveals that inaccurate pulse oximeter readings can lead to critically ill patients of color receiving less supplemental oxygen during ICU stays.

The paper,Assessment of Racial and Ethnic Differences in Oxygen Supplementation Among Patients in the Intensive Care Unit,” published in JAMA Internal Medicine, focused on the question of whether there were differences in supplemental oxygen administration among patients of different races and ethnicities that were associated with pulse oximeter performance discrepancies. 

The findings showed that inaccurate readings of Asian, Black, and Hispanic patients resulted in them receiving less supplemental oxygen than white patients. These results provide insight into how health technologies such as the pulse oximeter contribute to racial and ethnic disparities in care, according to the researchers.

The study’s senior author, Leo Anthony Celi, clinical research director and principal research scientist at the MIT Laboratory for Computational Physiology, and a principal research scientist at the MIT Institute for Medical Engineering and Science (IMES), says the challenge is that health care technology is routinely designed around the majority population.

“Medical devices are typically developed in rich countries with white, fit individuals as test subjects,” he explains. “Drugs are evaluated through clinical trials that disproportionately enroll white individuals. Genomics data overwhelmingly come from individuals of European descent.”

“It is therefore not surprising that we observe disparities in outcomes across demographics, with poorer outcomes among those who were not included in the design of health care,” Celi adds.

While pulse oximeters are widely used due to ease of use, the most accurate way to measure blood oxygen saturation (SaO2) levels is by taking a sample of the patient’s arterial blood. False readings of normal pulse oximetry (SpO2) can lead to hidden hypoxemia. Elevated bilirubin in the bloodstream and the use of certain medications in the ICU called vasopressors can also throw off pulse oximetry readings.

More than 3,000 participants were included in the study, of whom 2,667 were white, 207 Black, 112 Hispanic, and 83 Asian — using data from the Medical Information Mart for Intensive Care version 4, or MIMIC-IV dataset. This dataset is comprised of more than 50,000 patients admitted to the ICU at Beth Israel Deaconess Medical Center, and includes both pulse oximeter readings and oxygen saturation levels detected in blood samples. MIMIC-IV also includes rates of administration of supplemental oxygen.

When the researchers compared SpO2 levels taken by pulse oximeter to oxygen saturation from blood samples, they found that Black, Hispanic, and Asian patients had higher SpO2 readings than white patients for a given blood oxygen saturation level measured in blood samples. The turnaround time of arterial blood gas analysis may take from several minutes up to an hour. As a result, clinicians typically make decisions based on pulse oximetry reading, unaware of its suboptimal performance in certain patient demographics.

Eric Gottlieb, the study’s lead author, a nephrologist, a lecturer at MIT, and a Harvard Medical School fellow at Brigham and Women’s Hospital, called for more research to be done, in order to better understand “how pulse oximeter performance disparities lead to worse outcomes; possible differences in ventilation management, fluid resuscitation, triaging decisions, and other aspects of care should be explored. We then need to redesign these devices and properly evaluate them to ensure that they perform equally well for all patients.”

Celi emphasizes that understanding biases that exist within real-world data is crucial in order to better develop algorithms and artificial intelligence to assist clinicians with decision-making. “Before we invest more money on developing artificial intelligence for health care using electronic health records, we have to identify all the drivers of outcome disparities, including those that arise from the use of suboptimally designed technology,” he argues. “Otherwise, we risk perpetuating and magnifying health inequities with AI.”

Celi described the project and research as a testament to the value of data sharing that is the core of the MIMIC project. “No one team has the expertise and perspective to understand all the biases that exist in real-world data to prevent AI from perpetuating health inequities,” he says. “The database we analyzed for this project has more than 30,000 credentialed users consisting of teams that include data scientists, clinicians, and social scientists.”

The many researchers working on this topic together form a community that shares and performs quality checks on codes and queries, promotes reproducibility of the results, and crowdsources the curation of the data, Celi says. “There is harm when health data is not shared,” he says. “Limiting data access means limiting the perspectives with which data is analyzed and interpreted. We’ve seen numerous examples of model mis-specifications and flawed assumptions leading to models that ultimately harm patients.”

Read More

Using artificial intelligence to control digital manufacturing

Scientists and engineers are constantly developing new materials with unique properties that can be used for 3D printing, but figuring out how to print with these materials can be a complex, costly conundrum.

Often, an expert operator must use manual trial-and-error — possibly making thousands of prints — to determine ideal parameters that consistently print a new material effectively. These parameters include printing speed and how much material the printer deposits.

MIT researchers have now used artificial intelligence to streamline this procedure. They developed a machine-learning system that uses computer vision to watch the manufacturing process and then correct errors in how it handles the material in real-time.

They used simulations to teach a neural network how to adjust printing parameters to minimize error, and then applied that controller to a real 3D printer. Their system printed objects more accurately than all the other 3D printing controllers they compared it to.

The work avoids the prohibitively expensive process of printing thousands or millions of real objects to train the neural network. And it could enable engineers to more easily incorporate novel materials into their prints, which could help them develop objects with special electrical or chemical properties. It could also help technicians make adjustments to the printing process on-the-fly if material or environmental conditions change unexpectedly.

“This project is really the first demonstration of building a manufacturing system that uses machine learning to learn a complex control policy,” says senior author Wojciech Matusik, professor of electrical engineering and computer science at MIT who leads the Computational Design and Fabrication Group (CDFG) within the Computer Science and Artificial Intelligence Laboratory (CSAIL). “If you have manufacturing machines that are more intelligent, they can adapt to the changing environment in the workplace in real-time, to improve the yields or the accuracy of the system. You can squeeze more out of the machine.”

The co-lead authors on the research are Mike Foshey, a mechanical engineer and project manager in the CDFG, and Michal Piovarci, a postdoc at the Institute of Science and Technology in Austria. MIT co-authors include Jie Xu, a graduate student in electrical engineering and computer science, and Timothy Erps, a former technical associate with the CDFG.

Picking parameters

Determining the ideal parameters of a digital manufacturing process can be one of the most expensive parts of the process because so much trial-and-error is required. And once a technician finds a combination that works well, those parameters are only ideal for one specific situation. She has little data on how the material will behave in other environments, on different hardware, or if a new batch exhibits different properties.

Using a machine-learning system is fraught with challenges, too. First, the researchers needed to measure what was happening on the printer in real-time.

To do this, they developed a machine-vision system using two cameras aimed at the nozzle of the 3D printer. The system shines light at material as it is deposited and, based on how much light passes through, calculates the material’s thickness.

“You can think of the vision system as a set of eyes watching the process in real-time,” Foshey says.

The controller would then process images it receives from the vision system and, based on any error it sees, adjust the feed rate and the direction of the printer.

But training a neural network-based controller to understand this manufacturing process is data-intensive, and would require making millions of prints. So, the researchers built a simulator instead.

Successful simulation

To train their controller, they used a process known as reinforcement learning in which the model learns through trial-and-error with a reward. The model was tasked with selecting printing parameters that would create a certain object in a simulated environment. After being shown the expected output, the model was rewarded when the parameters it chose minimized the error between its print and the expected outcome.

In this case, an “error” means the model either dispensed too much material, placing it in areas that should have been left open, or did not dispense enough, leaving open spots that should be filled in. As the model performed more simulated prints, it updated its control policy to maximize the reward, becoming more and more accurate.

However, the real world is messier than a simulation. In practice, conditions typically change due to slight variations or noise in the printing process. So the researchers created a numerical model that approximates noise from the 3D printer. They used this model to add noise to the simulation, which led to more realistic results.

“The interesting thing we found was that, by implementing this noise model, we were able to transfer the control policy that was purely trained in simulation onto hardware without training with any physical experimentation,” Foshey says. “We didn’t need to do any fine-tuning on the actual equipment afterwards.”

When they tested the controller, it printed objects more accurately than any other control method they evaluated. It performed especially well at infill printing, which is printing the interior of an object. Some other controllers deposited so much material that the printed object bulged up, but the researchers’ controller adjusted the printing path so the object stayed level.

Their control policy can even learn how materials spread after being deposited and adjust parameters accordingly.

“We were also able to design control policies that could control for different types of materials on-the-fly. So if you had a manufacturing process out in the field and you wanted to change the material, you wouldn’t have to revalidate the manufacturing process. You could just load the new material and the controller would automatically adjust,” Foshey says.

Now that they have shown the effectiveness of this technique for 3D printing, the researchers want to develop controllers for other manufacturing processes. They’d also like to see how the approach can be modified for scenarios where there are multiple layers of material, or multiple materials being printed at once. In addition, their approach assumed each material has a fixed viscosity (“syrupiness”), but a future iteration could use AI to recognize and adjust for viscosity in real-time.

Additional co-authors on this work include Vahid Babaei, who leads the Artificial Intelligence Aided Design and Manufacturing Group at the Max Planck Institute; Piotr Didyk, associate professor at the University of Lugano in Switzerland; Szymon Rusinkiewicz, the David M. Siegel ’83 Professor of computer science at Princeton University; and Bernd Bickel, professor at the Institute of Science and Technology in Austria.

The work was supported, in part, by the FWF Lise-Meitner program, a European Research Council starting grant, and the U.S. National Science Foundation.

Read More

New hardware offers faster computation for artificial intelligence, with much less energy

As scientists push the boundaries of machine learning, the amount of time, energy, and money required to train increasingly complex neural network models is skyrocketing. A new area of artificial intelligence called analog deep learning promises faster computation with a fraction of the energy usage.

Programmable resistors are the key building blocks in analog deep learning, just like transistors are the core elements for digital processors. By repeating arrays of programmable resistors in complex layers, researchers can create a network of analog artificial “neurons” and “synapses” that execute computations just like a digital neural network. This network can then be trained to achieve complex AI tasks like image recognition and natural language processing.

A multidisciplinary team of MIT researchers set out to push the speed limits of a type of human-made analog synapse that they had previously developed. They utilized a practical inorganic material in the fabrication process that enables their devices to run 1 million times faster than previous versions, which is also about 1 million times faster than the synapses in the human brain.

Moreover, this inorganic material also makes the resistor extremely energy-efficient. Unlike materials used in the earlier version of their device, the new material is compatible with silicon fabrication techniques. This change has enabled fabricating devices at the nanometer scale and could pave the way for integration into commercial computing hardware for deep-learning applications.

“With that key insight, and the very powerful nanofabrication techniques we have at MIT.nano, we have been able to put these pieces together and demonstrate that these devices are intrinsically very fast and operate with reasonable voltages,” says senior author Jesús A. del Alamo, the Donner Professor in MIT’s Department of Electrical Engineering and Computer Science (EECS). “This work has really put these devices at a point where they now look really promising for future applications.”

“The working mechanism of the device is electrochemical insertion of the smallest ion, the proton, into an insulating oxide to modulate its electronic conductivity. Because we are working with very thin devices, we could accelerate the motion of this ion by using a strong electric field, and push these ionic devices to the nanosecond operation regime,” explains senior author Bilge Yildiz, the Breene M. Kerr Professor in the departments of Nuclear Science and Engineering and Materials Science and Engineering.

“The action potential in biological cells rises and falls with a timescale of milliseconds, since the voltage difference of about 0.1 volt is constrained by the stability of water,” says senior author Ju Li, the Battelle Energy Alliance Professor of Nuclear Science and Engineering and professor of materials science and engineering, “Here we apply up to 10 volts across a special solid glass film of nanoscale thickness that conducts protons, without permanently damaging it. And the stronger the field, the faster the ionic devices.”

These programmable resistors vastly increase the speed at which a neural network is trained, while drastically reducing the cost and energy to perform that training. This could help scientists develop deep learning models much more quickly, which could then be applied in uses like self-driving cars, fraud detection, or medical image analysis.

“Once you have an analog processor, you will no longer be training networks everyone else is working on. You will be training networks with unprecedented complexities that no one else can afford to, and therefore vastly outperform them all. In other words, this is not a faster car, this is a spacecraft,” adds lead author and MIT postdoc Murat Onen.

Co-authors include Frances M. Ross, the Ellen Swallow Richards Professor in the Department of Materials Science and Engineering; postdocs Nicolas Emond and Baoming Wang; and Difei Zhang, an EECS graduate student. The research is published today in Science.

Accelerating deep learning

Analog deep learning is faster and more energy-efficient than its digital counterpart for two main reasons. “First, computation is performed in memory, so enormous loads of data are not transferred back and forth from memory to a processor.” Analog processors also conduct operations in parallel. If the matrix size expands, an analog processor doesn’t need more time to complete new operations because all computation occurs simultaneously.

The key element of MIT’s new analog processor technology is known as a protonic programmable resistor. These resistors, which are measured in nanometers (one nanometer is one billionth of a meter), are arranged in an array, like a chess board.

In the human brain, learning happens due to the strengthening and weakening of connections between neurons, called synapses. Deep neural networks have long adopted this strategy, where the network weights are programmed through training algorithms. In the case of this new processor, increasing and decreasing the electrical conductance of protonic resistors enables analog machine learning.

The conductance is controlled by the movement of protons. To increase the conductance, more protons are pushed into a channel in the resistor, while to decrease conductance protons are taken out. This is accomplished using an electrolyte (similar to that of a battery) that conducts protons but blocks electrons.

To develop a super-fast and highly energy efficient programmable protonic resistor, the researchers looked to different materials for the electrolyte. While other devices used organic compounds, Onen focused on inorganic phosphosilicate glass (PSG).

PSG is basically silicon dioxide, which is the powdery desiccant material found in tiny bags that come in the box with new furniture to remove moisture. It is studied as a proton conductor under humidified conditions for fuel cells. It is also the most well-known oxide used in silicon processing. To make PSG, a tiny bit of phosphorus is added to the silicon to give it special characteristics for proton conduction.

Onen hypothesized that an optimized PSG could have a high proton conductivity at room temperature without the need for water, which would make it an ideal solid electrolyte for this application. He was right.

Surprising speed

PSG enables ultrafast proton movement because it contains a multitude of nanometer-sized pores whose surfaces provide paths for proton diffusion. It can also withstand very strong, pulsed electric fields. This is critical, Onen explains, because applying more voltage to the device enables protons to move at blinding speeds.

“The speed certainly was surprising. Normally, we would not apply such extreme fields across devices, in order to not turn them into ash. But instead, protons ended up shuttling at immense speeds across the device stack, specifically a million times faster compared to what we had before. And this movement doesn’t damage anything, thanks to the small size and low mass of protons. It is almost like teleporting,” he says.

“The nanosecond timescale means we are close to the ballistic or even quantum tunneling regime for the proton, under such an extreme field,” adds Li.

Because the protons don’t damage the material, the resistor can run for millions of cycles without breaking down. This new electrolyte enabled a programmable protonic resistor that is a million times faster than their previous device and can operate effectively at room temperature, which is important for incorporating it into computing hardware.

Thanks to the insulating properties of PSG, almost no electric current passes through the material as protons move. This makes the device extremely energy efficient, Onen adds.

Now that they have demonstrated the effectiveness of these programmable resistors, the researchers plan to reengineer them for high-volume manufacturing, says del Alamo. Then they can study the properties of resistor arrays and scale them up so they can be embedded into systems.

At the same time, they plan to study the materials to remove bottlenecks that limit the voltage that is required to efficiently transfer the protons to, through, and from the electrolyte.

“Another exciting direction that these ionic devices can enable is energy-efficient hardware to emulate the neural circuits and synaptic plasticity rules that are deduced in neuroscience, beyond analog deep neural networks. We have already started such a collaboration with neuroscience, supported by the MIT Quest for Intelligence,” adds Yildiz.

“The collaboration that we have is going to be essential to innovate in the future. The path forward is still going to be very challenging, but at the same time it is very exciting,” del Alamo says.

“Intercalation reactions such as those found in lithium-ion batteries have been explored extensively for memory devices. This work demonstrates that proton-based memory devices deliver impressive and surprising switching speed and endurance,” says William Chueh, associate professor of materials science and engineering at Stanford University, who was not involved with this research. “It lays the foundation for a new class of memory devices for powering deep learning algorithms.”

“This work demonstrates a significant breakthrough in biologically inspired resistive-memory devices. These all-solid-state protonic devices are based on exquisite atomic-scale control of protons, similar to biological synapses but at orders of magnitude faster rates,” says Elizabeth Dickey, the Teddy & Wilton Hawkins Distinguished Professor and head of the Department of Materials Science and Engineering at Carnegie Mellon University, who was not involved with this work. “I commend the interdisciplinary MIT team for this exciting development, which will enable future-generation computational devices.”

This research is funded, in part, by the MIT-IBM Watson AI Lab.

Read More

Explained: How to tell if artificial intelligence is working the way we want it to

About a decade ago, deep-learning models started achieving superhuman results on all sorts of tasks, from beating world-champion board game players to outperforming doctors at diagnosing breast cancer.

These powerful deep-learning models are usually based on artificial neural networks, which were first proposed in the 1940s and have become a popular type of machine learning. A computer learns to process data using layers of interconnected nodes, or neurons, that mimic the human brain. 

As the field of machine learning has grown, artificial neural networks have grown along with it.

Deep-learning models are now often composed of millions or billions of interconnected nodes in many layers that are trained to perform detection or classification tasks using vast amounts of data. But because the models are so enormously complex, even the researchers who design them don’t fully understand how they work. This makes it hard to know whether they are working correctly.

For instance, maybe a model designed to help physicians diagnose patients correctly predicted that a skin lesion was cancerous, but it did so by focusing on an unrelated mark that happens to frequently occur when there is cancerous tissue in a photo, rather than on the cancerous tissue itself. This is known as a spurious correlation. The model gets the prediction right, but it does so for the wrong reason. In a real clinical setting where the mark does not appear on cancer-positive images, it could result in missed diagnoses.

With so much uncertainty swirling around these so-called “black-box” models, how can one unravel what’s going on inside the box?

This puzzle has led to a new and rapidly growing area of study in which researchers develop and test explanation methods (also called interpretability methods) that seek to shed some light on how black-box machine-learning models make predictions.

What are explanation methods?

At their most basic level, explanation methods are either global or local. A local explanation method focuses on explaining how the model made one specific prediction, while global explanations seek to describe the overall behavior of an entire model. This is often done by developing a separate, simpler (and hopefully understandable) model that mimics the larger, black-box model.

But because deep learning models work in fundamentally complex and nonlinear ways, developing an effective global explanation model is particularly challenging. This has led researchers to turn much of their recent focus onto local explanation methods instead, explains Yilun Zhou, a graduate student in the Interactive Robotics Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL) who studies models, algorithms, and evaluations in interpretable machine learning.

The most popular types of local explanation methods fall into three broad categories.

The first and most widely used type of explanation method is known as feature attribution. Feature attribution methods show which features were most important when the model made a specific decision.

Features are the input variables that are fed to a machine-learning model and used in its prediction. When the data are tabular, features are drawn from the columns in a dataset (they are transformed using a variety of techniques so the model can process the raw data). For image-processing tasks, on the other hand, every pixel in an image is a feature. If a model predicts that an X-ray image shows cancer, for instance, the feature attribution method would highlight the pixels in that specific X-ray that were most important for the model’s prediction.

Essentially, feature attribution methods show what the model pays the most attention to when it makes a prediction.

“Using this feature attribution explanation, you can check to see whether a spurious correlation is a concern. For instance, it will show if the pixels in a watermark are highlighted or if the pixels in an actual tumor are highlighted,” says Zhou.

A second type of explanation method is known as a counterfactual explanation. Given an input and a model’s prediction, these methods show how to change that input so it falls into another class. For instance, if a machine-learning model predicts that a borrower would be denied a loan, the counterfactual explanation shows what factors need to change so her loan application is accepted. Perhaps her credit score or income, both features used in the model’s prediction, need to be higher for her to be approved.

“The good thing about this explanation method is it tells you exactly how you need to change the input to flip the decision, which could have practical usage. For someone who is applying for a mortgage and didn’t get it, this explanation would tell them what they need to do to achieve their desired outcome,” he says.

The third category of explanation methods are known as sample importance explanations. Unlike the others, this method requires access to the data that were used to train the model.

A sample importance explanation will show which training sample a model relied on most when it made a specific prediction; ideally, this is the most similar sample to the input data. This type of explanation is particularly useful if one observes a seemingly irrational prediction. There may have been a data entry error that affected a particular sample that was used to train the model. With this knowledge, one could fix that sample and retrain the model to improve its accuracy.

How are explanation methods used?

One motivation for developing these explanations is to perform quality assurance and debug the model. With more understanding of how features impact a model’s decision, for instance, one could identify that a model is working incorrectly and intervene to fix the problem, or toss the model out and start over.

Another, more recent, area of research is exploring the use of machine-learning models to discover scientific patterns that humans haven’t uncovered before. For instance, a cancer diagnosing model that outperforms clinicians could be faulty, or it could actually be picking up on some hidden patterns in an X-ray image that represent an early pathological pathway for cancer that were either unknown to human doctors or thought to be irrelevant, Zhou says.

It’s still very early days for that area of research, however.

Words of warning

While explanation methods can sometimes be useful for machine-learning practitioners when they are trying to catch bugs in their models or understand the inner-workings of a system, end-users should proceed with caution when trying to use them in practice, says Marzyeh Ghassemi, an assistant professor and head of the Healthy ML Group in CSAIL.

As machine learning has been adopted in more disciplines, from health care to education, explanation methods are being used to help decision makers better understand a model’s predictions so they know when to trust the model and use its guidance in practice. But Ghassemi warns against using these methods in that way.

“We have found that explanations make people, both experts and nonexperts, overconfident in the ability or the advice of a specific recommendation system. I think it is very important for humans not to turn off that internal circuitry asking, ‘let me question the advice that I am
given,’” she says.

Scientists know explanations make people over-confident based on other recent work, she adds, citing some recent studies by Microsoft researchers.

Far from a silver bullet, explanation methods have their share of problems. For one, Ghassemi’s recent research has shown that explanation methods can perpetuate biases and lead to worse outcomes for people from disadvantaged groups.

Another pitfall of explanation methods is that it is often impossible to tell if the explanation method is correct in the first place. One would need to compare the explanations to the actual model, but since the user doesn’t know how the model works, this is circular logic, Zhou says.

He and other researchers are working on improving explanation methods so they are more faithful to the actual model’s predictions, but Zhou cautions that, even the best explanation should be taken with a grain of salt.

“In addition, people generally perceive these models to be human-like decision makers, and we are prone to overgeneralization. We need to calm people down and hold them back to really make sure that the generalized model understanding they build from these local explanations are balanced,” he adds.

Zhou’s most recent research seeks to do just that.

What’s next for machine-learning explanation methods?

Rather than focusing on providing explanations, Ghassemi argues that more effort needs to be done by the research community to study how information is presented to decision makers so they understand it, and more regulation needs to be put in place to ensure machine-learning models are used responsibly in practice. Better explanation methods alone aren’t the answer.

“I have been excited to see that there is a lot more recognition, even in industry, that we can’t just take this information and make a pretty dashboard and assume people will perform better with that. You need to have measurable improvements in action, and I’m hoping that leads to real guidelines about improving the way we display information in these deeply technical fields, like medicine,” she says.

And in addition to new work focused on improving explanations, Zhou expects to see more research related to explanation methods for specific use cases, such as model debugging, scientific discovery, fairness auditing, and safety assurance. By identifying fine-grained characteristics of explanation methods and the requirements of different use cases, researchers could establish a theory that would match explanations with specific scenarios, which could help overcome some of the pitfalls that come from using them in real-world scenarios.

Read More

A technique to improve both fairness and accuracy in artificial intelligence

For workers who use machine-learning models to help them make decisions, knowing when to trust a model’s predictions is not always an easy task, especially since these models are often so complex that their inner workings remain a mystery.

Users sometimes employ a technique, known as selective regression, in which the model estimates its confidence level for each prediction and will reject predictions when its confidence is too low. Then a human can examine those cases, gather additional information, and make a decision about each one manually.

But while selective regression has been shown to improve the overall performance of a model, researchers at MIT and the MIT-IBM Watson AI Lab have discovered that the technique can have the opposite effect for underrepresented groups of people in a dataset. As the model’s confidence increases with selective regression, its chance of making the right prediction also increases, but this does not always happen for all subgroups.

For instance, a model suggesting loan approvals might make fewer errors on average, but it may actually make more wrong predictions for Black or female applicants. One reason this can occur is due to the fact that the model’s confidence measure is trained using overrepresented groups and may not be accurate for these underrepresented groups.

Once they had identified this problem, the MIT researchers developed two algorithms that can remedy the issue. Using real-world datasets, they show that the algorithms reduce performance disparities that had affected marginalized subgroups.

“Ultimately, this is about being more intelligent about which samples you hand off to a human to deal with. Rather than just minimizing some broad error rate for the model, we want to make sure the error rate across groups is taken into account in a smart way,” says senior MIT author Greg Wornell, the Sumitomo Professor in Engineering in the Department of Electrical Engineering and Computer Science (EECS) who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory of Electronics (RLE) and is a member of the MIT-IBM Watson AI Lab.

Joining Wornell on the paper are co-lead authors Abhin Shah, an EECS graduate student, and Yuheng Bu, a postdoc in RLE; as well as Joshua Ka-Wing Lee SM ’17, ScD ’21 and Subhro Das, Rameswar Panda, and Prasanna Sattigeri, research staff members at the MIT-IBM Watson AI Lab. The paper will be presented this month at the International Conference on Machine Learning.

To predict or not to predict

Regression is a technique that estimates the relationship between a dependent variable and independent variables. In machine learning, regression analysis is commonly used for prediction tasks, such as predicting the price of a home given its features (number of bedrooms, square footage, etc.) With selective regression, the machine-learning model can make one of two choices for each input — it can make a prediction or abstain from a prediction if it doesn’t have enough confidence in its decision.

When the model abstains, it reduces the fraction of samples it is making predictions on, which is known as coverage. By only making predictions on inputs that it is highly confident about, the overall performance of the model should improve. But this can also amplify biases that exist in a dataset, which occur when the model does not have sufficient data from certain subgroups. This can lead to errors or bad predictions for underrepresented individuals.

The MIT researchers aimed to ensure that, as the overall error rate for the model improves with selective regression, the performance for every subgroup also improves. They call this monotonic selective risk.

“It was challenging to come up with the right notion of fairness for this particular problem. But by enforcing this criteria, monotonic selective risk, we can make sure the model performance is actually getting better across all subgroups when you reduce the coverage,” says Shah.

Focus on fairness

The team developed two neural network algorithms that impose this fairness criteria to solve the problem.

One algorithm guarantees that the features the model uses to make predictions contain all information about the sensitive attributes in the dataset, such as race and sex, that is relevant to the target variable of interest. Sensitive attributes are features that may not be used for decisions, often due to laws or organizational policies. The second algorithm employs a calibration technique to ensure the model makes the same prediction for an input, regardless of whether any sensitive attributes are added to that input.

The researchers tested these algorithms by applying them to real-world datasets that could be used in high-stakes decision making. One, an insurance dataset, is used to predict total annual medical expenses charged to patients using demographic statistics; another, a crime dataset, is used to predict the number of violent crimes in communities using socioeconomic information. Both datasets contain sensitive attributes for individuals.

When they implemented their algorithms on top of a standard machine-learning method for selective regression, they were able to reduce disparities by achieving lower error rates for the minority subgroups in each dataset. Moreover, this was accomplished without significantly impacting the overall error rate.

“We see that if we don’t impose certain constraints, in cases where the model is really confident, it could actually be making more errors, which could be very costly in some applications, like health care. So if we reverse the trend and make it more intuitive, we will catch a lot of these errors. A major goal of this work is to avoid errors going silently undetected,” Sattigeri says.

The researchers plan to apply their solutions to other applications, such as predicting house prices, student GPA, or loan interest rate, to see if the algorithms need to be calibrated for those tasks, says Shah. They also want to explore techniques that use less sensitive information during the model training process to avoid privacy issues.

And they hope to improve the confidence estimates in selective regression to prevent situations where the model’s confidence is low, but its prediction is correct. This could reduce the workload on humans and further streamline the decision-making process, Sattigeri says.

This research was funded, in part, by the MIT-IBM Watson AI Lab and its member companies Boston Scientific, Samsung, and Wells Fargo, and by the National Science Foundation.

Read More

Teaching AI to ask clinical questions

Physicians often query a patient’s electronic health record for information that helps them make treatment decisions, but the cumbersome nature of these records hampers the process. Research has shown that even when a doctor has been trained to use an electronic health record (EHR), finding an answer to just one question can take, on average, more than eight minutes.

The more time physicians must spend navigating an oftentimes clunky EHR interface, the less time they have to interact with patients and provide treatment.

Researchers have begun developing machine-learning models that can streamline the process by automatically finding information physicians need in an EHR. However, training effective models requires huge datasets of relevant medical questions, which are often hard to come by due to privacy restrictions. Existing models struggle to generate authentic questions — those that would be asked by a human doctor — and are often unable to successfully find correct answers.

To overcome this data shortage, researchers at MIT partnered with medical experts to study the questions physicians ask when reviewing EHRs. Then, they built a publicly available dataset of more than 2,000 clinically relevant questions written by these medical experts.

When they used their dataset to train a machine-learning model to generate clinical questions, they found that the model asked high-quality and authentic questions, as compared to real questions from medical experts, more than 60 percent of the time.

With this dataset, they plan to generate vast numbers of authentic medical questions and then use those questions to train a machine-learning model which would help doctors find sought-after information in a patient’s record more efficiently.

“Two thousand questions may sound like a lot, but when you look at machine-learning models being trained nowadays, they have so much data, maybe billions of data points. When you train machine-learning models to work in health care settings, you have to be really creative because there is such a lack of data,” says lead author Eric Lehman, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL).

The senior author is Peter Szolovits, a professor in the Department of Electrical Engineering and Computer Science (EECS) who heads the Clinical Decision-Making Group in CSAIL and is also a member of the MIT-IBM Watson AI Lab. The research paper, a collaboration between co-authors at MIT, the MIT-IBM Watson AI Lab, IBM Research, and the doctors and medical experts who helped create questions and participated in the study, will be presented at the annual conference of the North American Chapter of the Association for Computational Linguistics.

“Realistic data is critical for training models that are relevant to the task yet difficult to find or create,” Szolovits says. “The value of this work is in carefully collecting questions asked by clinicians about patient cases, from which we are able to develop methods that use these data and general language models to ask further plausible questions.”

Data deficiency

The few large datasets of clinical questions the researchers were able to find had a host of issues, Lehman explains. Some were composed of medical questions asked by patients on web forums, which are a far cry from physician questions. Other datasets contained questions produced from templates, so they are mostly identical in structure, making many questions unrealistic.

“Collecting high-quality data is really important for doing machine-learning tasks, especially in a health care context, and we’ve shown that it can be done,” Lehman says.

To build their dataset, the MIT researchers worked with practicing physicians and medical students in their last year of training. They gave these medical experts more than 100 EHR discharge summaries and told them to read through a summary and ask any questions they might have. The researchers didn’t put any restrictions on question types or structures in an effort to gather natural questions. They also asked the medical experts to identify the “trigger text” in the EHR that led them to ask each question.

For instance, a medical expert might read a note in the EHR that says a patient’s past medical history is significant for prostate cancer and hypothyroidism. The trigger text “prostate cancer” could lead the expert to ask questions like “date of diagnosis?” or “any interventions done?”

They found that most questions focused on symptoms, treatments, or the patient’s test results. While these findings weren’t unexpected, quantifying the number of questions about each broad topic will help them build an effective dataset for use in a real, clinical setting, says Lehman.

Once they had compiled their dataset of questions and accompanying trigger text, they used it to train machine-learning models to ask new questions based on the trigger text.

Then the medical experts determined whether those questions were “good” using four metrics: understandability (Does the question make sense to a human physician?), triviality (Is the question too easily answerable from the trigger text?), medical relevance (Does it makes sense to ask this question based on the context?), and relevancy to the trigger (Is the trigger related to the question?).

Cause for concern

The researchers found that when a model was given trigger text, it was able to generate a good question 63 percent of the time, whereas a human physician would ask a good question 80 percent of the time.

They also trained models to recover answers to clinical questions using the publicly available datasets they had found at the outset of this project. Then they tested these trained models to see if they could find answers to “good” questions asked by human medical experts.

The models were only able to recover about 25 percent of answers to physician-generated questions.

“That result is really concerning. What people thought were good-performing models were, in practice, just awful because the evaluation questions they were testing on were not good to begin with,” Lehman says.

The team is now applying this work toward their initial goal: building a model that can automatically answer physicians’ questions in an EHR. For the next step, they will use their dataset to train a machine-learning model that can automatically generate thousands or millions of good clinical questions, which can then be used to train a new model for automatic question answering.

While there is still much work to do before that model could be a reality, Lehman is encouraged by the strong initial results the team demonstrated with this dataset.

This research was supported, in part, by the MIT-IBM Watson AI Lab. Additional co-authors include Leo Anthony Celi of the MIT Institute for Medical Engineering and Science; Preethi Raghavan and Jennifer J. Liang of the MIT-IBM Watson AI Lab; Dana Moukheiber of the University of Buffalo; Vladislav Lialin and Anna Rumshisky of the University of Massachusetts at Lowell; Katelyn Legaspi, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, and Pia Gabrielle I. Alfonso of the University of the Philippines; Anne Janelle R. Sy and Patricia Therese S. Pile of the University of the East Ramon Magsaysay Memorial Medical Center; Marianne Taliño of the Ateneo de Manila University School of Medicine and Public Health; and Byron C. Wallace of Northeastern University.

Read More

Artificial intelligence model finds potential drug molecules a thousand times faster

The entirety of the known universe is teeming with an infinite number of molecules. But what fraction of these molecules have potential drug-like traits that can be used to develop life-saving drug treatments? Millions? Billions? Trillions? The answer: novemdecillion, or 1060. This gargantuan number prolongs the drug development process for fast-spreading diseases like Covid-19 because it is far beyond what existing drug design models can compute. To put it into perspective, the Milky Way has about 100 thousand million, or 108, stars.

In a paper that will be presented at the International Conference on Machine Learning (ICML), MIT researchers developed a geometric deep-learning model called EquiBind that is 1,200 times faster than one of the fastest existing computational molecular docking models, QuickVina2-W, in successfully binding drug-like molecules to proteins. EquiBind is based on its predecessor, EquiDock, which specializes in binding two proteins using a technique developed by the late Octavian-Eugen Ganea, a recent MIT Computer Science and Artificial Intelligence Laboratory and Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic) postdoc, who also co-authored the EquiBind paper.

Before drug development can even take place, drug researchers must find promising drug-like molecules that can bind or “dock” properly onto certain protein targets in a process known as drug discovery. After successfully docking to the protein, the binding drug, also known as the ligand, can stop a protein from functioning. If this happens to an essential protein of a bacterium, it can kill the bacterium, conferring protection to the human body.

However, the process of drug discovery can be costly both financially and computationally, with billions of dollars poured into the process and over a decade of development and testing before final approval from the Food and Drug Administration. What’s more, 90 percent of all drugs fail once they are tested in humans due to having no effects or too many side effects. One of the ways drug companies recoup the costs of these failures is by raising the prices of the drugs that are successful.

The current computational process for finding promising drug candidate molecules goes like this: most state-of-the-art computational models rely upon heavy candidate sampling coupled with methods like scoring, ranking, and fine-tuning to get the best “fit” between the ligand and the protein. 

Hannes Stärk, a first-year graduate student at the MIT Department of Electrical Engineering and Computer Science and lead author of the paper, likens typical ligand-to-protein binding methodologies to “trying to fit a key into a lock with a lot of keyholes.” Typical models time-consumingly score each “fit” before choosing the best one. In contrast, EquiBind directly predicts the precise key location in a single step without prior knowledge of the protein’s target pocket, which is known as “blind docking.”

Unlike most models that require several attempts to find a favorable position for the ligand in the protein, EquiBind already has built-in geometric reasoning that helps the model learn the underlying physics of molecules and successfully generalize to make better predictions when encountering new, unseen data.

The release of these findings quickly attracted the attention of industry professionals, including Pat Walters, the chief data officer for Relay Therapeutics. Walters suggested that the team try their model on an already existing drug and protein used for lung cancer, leukemia, and gastrointestinal tumors. Whereas most of the traditional docking methods failed to successfully bind the ligands that worked on those proteins, EquiBind succeeded.

“EquiBind provides a unique solution to the docking problem that incorporates both pose prediction and binding site identification,” Walters says. “This approach, which leverages information from thousands of publicly available crystal structures, has the potential to impact the field in new ways.”

“We were amazed that while all other methods got it completely wrong or only got one correct, EquiBind was able to put it into the correct pocket, so we were very happy to see the results for this,” Stärk says.

While EquiBind has received a great deal of feedback from industry professionals that has helped the team consider practical uses for the computational model, Stärk hopes to find different perspectives at the upcoming ICML in July.

“The feedback I’m most looking forward to is suggestions on how to further improve the model,” he says. “I want to discuss with those researchers … to tell them what I think can be the next steps and encourage them to go ahead and use the model for their own papers and for their own methods … we’ve had many researchers already reaching out and asking if we think the model could be useful for their problem.”

This work was funded, in part, by the Pharmaceutical Discovery and Synthesis consortium; the Jameel Clinic; the DTRA Discovery of Medical Countermeasures Against New and Emerging threats program; the DARPA Accelerated Molecular Discovery program; the MIT-Takeda Fellowship; and the NSF Expeditions grant Collaborative Research: Understanding the World Through Code.

This work is dedicated to the memory of Octavian-Eugen Ganea, who made crucial contributions to geometric machine learning research and generously mentored many students — a brilliant scholar with a humble soul.

Read More

Smart textiles sense how their users are moving

Using a novel fabrication process, MIT researchers have produced smart textiles that snugly conform to the body so they can sense the wearer’s posture and motions.

By incorporating a special type of plastic yarn and using heat to slightly melt it — a process called thermoforming — the researchers were able to greatly improve the precision of pressure sensors woven into multilayered knit textiles, which they call 3DKnITS.

They used this process to create a “smart” shoe and mat, and then built a hardware and software system to measure and interpret data from the pressure sensors in real time. The machine-learning system predicted motions and yoga poses performed by an individual standing on the smart textile mat with about 99 percent accuracy.

Their fabrication process, which takes advantage of digital knitting technology, enables rapid prototyping and can be easily scaled up for large-scale manufacturing, says Irmandy Wicaksono, a research assistant in the MIT Media Lab and lead author of a paper presenting 3DKnITS.

The technique could have many applications, especially in health care and rehabilitation. For example, it could be used to produce smart shoes that track the gait of someone who is learning to walk again after an injury, or socks that monitor pressure on a diabetic patient’s foot to prevent the formation of ulcers.

“With digital knitting, you have this freedom to design your own patterns and also integrate sensors within the structure itself, so it becomes seamless and comfortable, and you can develop it based on the shape of your body,” Wicaksono says.

He wrote the paper with MIT undergraduate students Peter G. Hwang, Samir Droubi, and Allison N. Serio through the Undergraduate Research Opportunities Program; Franny Xi Wu, a recent graduate of the Wellesley College; Wei Yan, assistant professor at the Nanyang Technological University; and senior author Joseph A. Paradiso, the Alexander W. Dreyfoos Professor and director of the Responsive Environments group within the Media Lab. The research will be presented at the IEEE Engineering in Medicine and Biology Society Conference.

“Some of the early pioneering work on smart fabrics happened at the Media Lab in the late ’90s. The materials, embeddable electronics, and fabrication machines have advanced enormously since then,” Paradiso says. “It’s a great time to see our research returning to this area, for example through projects like Irmandy’s — they point at an exciting future where sensing and functions diffuse more fluidly into materials and open up enormous possibilities.”

Knitting know-how

To produce a smart textile, the researchers use a digital knitting machine that weaves together layers of fabric with rows of standard and functional yarn. The multilayer knit textile is composed of two layers of conductive yarn knit sandwiched around a piezoresistive knit, which changes its resistance when squeezed. Following a pattern, the machine stitches this functional yarn throughout the textile in horizontal and vertical rows. Where the functional fibers intersect, they create a pressure sensor, Wicaksono explains.

But yarn is soft and pliable, so the layers shift and rub against each other when the wearer moves. This generates noise and causes variability that make the pressure sensors much less accurate.

Wicaksono came up with a solution to this problem while working in a knitting factory in Shenzhen, China, where he spent a month learning to program and maintain digital knitting machines. He watched workers making sneakers using thermoplastic yarns that would start to melt when heated above 70 degrees Celsius, which slightly hardens the textile so it can hold a precise shape.

He decided to try incorporating melting fibers and thermoforming into the smart textile fabrication process.

“The thermoforming really solves the noise issue because it hardens the multilayer textile into one layer by essentially squeezing and melting the whole fabric together, which improves the accuracy. That thermoforming also allows us to create 3D forms, like a sock or shoe, that actually fit the precise size and shape of the user,” he says.

Once he perfected the fabrication process, Wicaksono needed a system to accurately process pressure sensor data. Since the fabric is knit as a grid, he crafted a wireless circuit that scans through rows and columns on the textile and measures the resistance at each point. He designed this circuit to overcome artifacts caused by “ghosting” ambiguities, which occur when the user exerts pressure on two or more separate points simultaneously.

Inspired by deep-learning techniques for image classification, Wicaksono devised a system that displays pressure sensor data as a heat map. Those images are fed to a machine-learning model, which is trained to detect the posture, pose, or motion of the user based on the heat map image.

Analyzing activities

Once the model was trained, it could classify the user’s activity on the smart mat (walking, running, doing push-ups, etc.) with 99.6 percent accuracy and could recognize seven yoga poses with 98.7 percent accuracy.

They also used a circular knitting machine to create a form-fitted smart textile shoe with 96 pressure sensing points spread across the entire 3D textile. They used the shoe to measure pressure exerted on different parts of the foot when the wearer kicked a soccer ball.   

The high accuracy of 3DKnITS could make them useful for applications in prosthetics, where precision is essential. A smart textile liner could measure the pressure a prosthetic limb places on the socket, enabling a prosthetist to easily see how well the device fits, Wicaksono says.

He and his colleagues are also exploring more creative applications. In collaboration with a sound designer and a contemporary dancer, they developed a smart textile carpet that drives musical notes and soundscapes based on the dancer’s steps, to explore the bidirectional relationship between music and choreography. This research was recently presented at the ACM Creativity and Cognition Conference.

“I’ve learned that interdisciplinary collaboration can create some really unique applications,” he says.

Now that the researchers have demonstrated the success of their fabrication technique, Wicaksono plans to refine the circuit and machine learning model. Currently, the model must be calibrated to each individual before it can classify actions, which is a time-consuming process. Removing that calibration step would make 3DKnITS easier to use. The researchers also want to conduct tests on smart shoes outside the lab to see how environmental conditions like temperature and humidity impact the accuracy of sensors.

“It’s always amazing to see technology advance in ways that are so meaningful. It is incredible to think that the clothing we wear, an arm sleeve or a sock, can be created in ways that its three-dimensional structure can be used for sensing,” says Eric Berkson, assistant professor of orthopaedic surgery at Harvard Medical School and sports medicine orthopaedic surgeon at Massachusetts General Hospital, who was not involved in this research. “In the medical field, and in orthopedic sports medicine specifically, this technology provides the ability to better detect and classify motion and to recognize force distribution patterns in real-world (out of the laboratory) situations. This is the type of thinking that will enhance injury prevention and detection techniques and help evaluate and direct rehabilitation.”

This research was supported, in part, by the MIT Media Lab Consortium.

Read More

Startup lets doctors classify skin conditions with the snap of a picture

At the age of 22, when Susan Conover wanted to get a strange-looking mole checked out, she was told it would take three months to see a dermatologist. When the mole was finally removed and biopsied, doctors determined it was cancerous. At the time, no one could be sure the cancer hadn’t spread to other parts of her body — the difference between stage 2 and stage 3 or 4 melanoma.

Thankfully, the mole ended up being confined to one spot. But the experience launched Conover into the world of skin diseases and dermatology. After exploring those topics and possible technological solutions in MIT’s System Design and Management graduate program, Conover founded Piction Health.

Piction Health began as a mobile app that used artificial intelligence to recognize melanoma from images. Over time, however, Conover realized that other skin conditions make up the vast majority of cases physicians and dermatologists see. Today, Conover and her co-founder Pranav Kuber focus on helping physicians identify and manage the most common skin conditions — including rashes like eczema, acne, and shingles — and plan to partner with a company to help diagnose skin cancers down the line.

“All these other conditions are the ones that are often referred to dermatology, and dermatologists become frustrated because they’d prefer to be spending time on skin cancer cases or other conditions that need their help,” Conover says. “We realized we needed to pivot away from skin cancer in order to help skin cancer patients see the dermatologist faster.”

After primary care physicians take a photo of a patient’s skin condition, Piction’s app shows images of similar skin presentations. Piction also helps physicians differentiate between the conditions they most suspect to make better care decisions for the patient.

Conover says Piction can reduce the time it takes physicians to evaluate a case by around 30 percent. It can also help physicians refer a patient to a dermatologist more quickly for special cases they’re not confident in managing. More broadly, Conover is focused on helping health organizations reduce costs related to unnecessary revisits, ineffective prescriptions, and unnecessary referrals.

So far, more than 50 physicians have used Piction’s product, and the company has established partnerships with several organizations, including a well-known defense organization that had two employees diagnosed with late-stage melanoma recently after they couldn’t see a dermatologist right away.

“A lot of people don’t realize that it’s really hard to see a dermatologist — it can take three to six months — and with the pandemic it’s never been a worse time to try to see a dermatologist,” Conover says.

Shocked into action

At the time of Conover’s melanoma diagnosis, she had recently earned a bachelor’s degree in mechanical engineering from the University of Texas at Austin. But she didn’t do a deep dive into dermatology until she needed a thesis topic for her master’s at MIT.

“It was just a really scary experience,” Conover says of her melanoma. “I consider myself very lucky because I learned at MIT that there’s a huge number of people with skin problems every year, two-thirds of those people go into primary care to get help, and about half of those cases are misdiagnosed because these providers don’t have as much training in dermatology.”

Conover first began exploring the idea of starting a company to diagnose melanoma during the Nuts and Bolts of Founding New Ventures course offered over MIT’s Independent Activities Period in 2015. She also went through the IDEAS Social Innovation Challenge and the MIT $100K Entrepreneurship Competition while building her system. After graduation, she spent a year at MIT as a Catalyst Fellow in the MIT linQ program, where she worked in the lab of Martha Gray, the J.W. Kieckhefer Professor of Health Sciences and Technology and a member of MIT’s Institute for Medical Engineering and Science (IMES).

Through MIT’s Venture Mentoring Service, Conover also went through the I-Corps program, where she continued to speak with stakeholders. Through those conversations, she learned that skin rashes like psoriasis, eczema, and rosacea account for the vast majority of skin problems seen by primary care physicians.

Meanwhile, while public health campaigns have focused on the importance of protection from the sun, public knowledge around conditions like shingles, which effects up to 1 percent of Americans each year, is severely lacking.

Although training a machine-learning model to recognize a myriad of diverse conditions would be more difficult than training a model to recognize melanoma, Conover’s small team decided that was the best path forward.

“We decided it’s better to just jump to making the full product, even though it sounded scary and huge: a product that identifies all different rashes across multiple body parts and skin tones and age groups,” Conover says.

The leap required Piction to establish data partnerships with hundreds of dermatologists in countries around the world during the pandemic. Conover says Piction now has the world’s largest dataset of rashes, containing over 1 million photos taken by dermatologists in 18 countries.

“We focused on getting photos of different skin tones, as many skin tones are underrepresented even in medical literature and teaching,” Conover says. “Providers don’t always learn how all the different skin tones can present conditions, so our representative database is a substantial statement about our commitment to health equity.”

Conover says Piction’s image database helps doctors evaluate conditions more accurately in primary care. After a provider has determined the most likely condition, Piction presents physicians with information on treatment options for each condition.

“This front-line primary care environment is the ideal place for our innovation because they care for patients with skin conditions every day,” Conover says.

Helping doctors at scale

Conover is constantly reminded of the need for her system from family and friends, who have taken to sending her pictures of their skin condition for advice. Recently, Conover’s friend developed shingles, a disease that can advance quickly and can cause blindness if it spreads to certain locations on the body. A doctor misdiagnosed the shingles on her forehead as a spider bite and prescribed the wrong medication. The shingles got worse and caused ear and scalp pain before the friend went to the emergency room and received the proper treatment.

“It was one of those moments where we thought, ‘If only physicians had the right tools,’” Conover says. “The PCP jumped to what she thought the problem was but didn’t build the full list of potential conditions and narrow from there.”

Piction will be launching several additional pilots this year. Down the line, Conover wants to add capabilities to identify and evaluate wounds and infectious diseases that are more common in other parts of the world, like leprosy. By partnering with nonprofit groups, the company also hopes to bring its solution to doctors in low-resource settings.

“This has potential to become a full diagnostic tool in the future,” Conover says. “I just don’t want anyone to feel the way I felt when I had my first diagnosis, and I want other people like me to be able to get the care they need at the right time and move on with their lives.”

Read More

Building explainability into the components of machine-learning models

Explanation methods that help users understand and trust machine-learning models often describe how much certain features used in the model contribute to its prediction. For example, if a model predicts a patient’s risk of developing cardiac disease, a physician might want to know how strongly the patient’s heart rate data influences that prediction.

But if those features are so complex or convoluted that the user can’t understand them, does the explanation method do any good?

MIT researchers are striving to improve the interpretability of features so decision makers will be more comfortable using the outputs of machine-learning models. Drawing on years of field work, they developed a taxonomy to help developers craft features that will be easier for their target audience to understand.

“We found that out in the real world, even though we were using state-of-the-art ways of explaining machine-learning models, there is still a lot of confusion stemming from the features, not from the model itself,” says Alexandra Zytek, an electrical engineering and computer science PhD student and lead author of a paper introducing the taxonomy.

To build the taxonomy, the researchers defined properties that make features interpretable for five types of users, from artificial intelligence experts to the people affected by a machine-learning model’s prediction. They also offer instructions for how model creators can transform features into formats that will be easier for a layperson to comprehend.

They hope their work will inspire model builders to consider using interpretable features from the beginning of the development process, rather than trying to work backward and focus on explainability after the fact.

MIT co-authors include Dongyu Liu, a postdoc; visiting professor Laure Berti-Équille, research director at IRD; and senior author Kalyan Veeramachaneni, principal research scientist in the Laboratory for Information and Decision Systems (LIDS) and leader of the Data to AI group. They are joined by Ignacio Arnaldo, a principal data scientist at Corelight. The research is published in the June edition of the Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining’s peer-reviewed Explorations Newsletter.

Real-world lessons

Features are input variables that are fed to machine-learning models; they are usually drawn from the columns in a dataset. Data scientists typically select and handcraft features for the model, and they mainly focus on ensuring features are developed to improve model accuracy, not on whether a decision-maker can understand them, Veeramachaneni explains.

For several years, he and his team have worked with decision makers to identify machine-learning usability challenges. These domain experts, most of whom lack machine-learning knowledge, often don’t trust models because they don’t understand the features that influence predictions.

For one project, they partnered with clinicians in a hospital ICU who used machine learning to predict the risk a patient will face complications after cardiac surgery. Some features were presented as aggregated values, like the trend of a patient’s heart rate over time. While features coded this way were “model ready” (the model could process the data), clinicians didn’t understand how they were computed. They would rather see how these aggregated features relate to original values, so they could identify anomalies in a patient’s heart rate, Liu says.

By contrast, a group of learning scientists preferred features that were aggregated. Instead of having a feature like “number of posts a student made on discussion forums” they would rather have related features grouped together and labeled with terms they understood, like “participation.”

“With interpretability, one size doesn’t fit all. When you go from area to area, there are different needs. And interpretability itself has many levels,” Veeramachaneni says.

The idea that one size doesn’t fit all is key to the researchers’ taxonomy. They define properties that can make features more or less interpretable for different decision makers and outline which properties are likely most important to specific users.

For instance, machine-learning developers might focus on having features that are compatible with the model and predictive, meaning they are expected to improve the model’s performance.

On the other hand, decision makers with no machine-learning experience might be better served by features that are human-worded, meaning they are described in a way that is natural for users, and understandable, meaning they refer to real-world metrics users can reason about.

“The taxonomy says, if you are making interpretable features, to what level are they interpretable? You may not need all levels, depending on the type of domain experts you are working with,” Zytek says.

Putting interpretability first

The researchers also outline feature engineering techniques a developer can employ to make features more interpretable for a specific audience.

Feature engineering is a process in which data scientists transform data into a format machine-learning models can process, using techniques like aggregating data or normalizing values. Most models also can’t process categorical data unless they are converted to a numerical code. These transformations are often nearly impossible for laypeople to unpack.

Creating interpretable features might involve undoing some of that encoding, Zytek says. For instance, a common feature engineering technique organizes spans of data so they all contain the same number of years. To make these features more interpretable, one could group age ranges using human terms, like infant, toddler, child, and teen. Or rather than using a transformed feature like average pulse rate, an interpretable feature might simply be the actual pulse rate data, Liu adds.

“In a lot of domains, the tradeoff between interpretable features and model accuracy is actually very small. When we were working with child welfare screeners, for example, we retrained the model using only features that met our definitions for interpretability, and the performance decrease was almost negligible,” Zytek says.

Building off this work, the researchers are developing a system that enables a model developer to handle complicated feature transformations in a more efficient manner, to create human-centered explanations for machine-learning models. This new system will also convert algorithms designed to explain model-ready datasets into formats that can be understood by decision makers.

Read More