Amid excitement over everything from Chat-GPT to Nobel Prize-winning chemistry research that uses artificial intelligence to predict protein structures, academics Aravind Narayan and Sayash Kapoor have established a reputation for throwing cold water on at least some of the claims about how AI will transform people's lives.
Narayan, a professor of computer science at Princeton University, went viral in 2019 for a talk he gave at MIT on how to recognize overblown declarations about AI's capabilities. Now he's teamed up with his Ph.D. student Kapoor to publish the book "AI Snake Oil," which came out Sep. 24. While the book encourages skepticism about many AI claims, the pair consider themselves cautiously optimistic about the future of the technology -- including as it pertains to health care.
"It's easy to look at all the flaws and misuses of chatbots and conclude that the world has gone mad for being so gaga about a technology that is so failure prone," they write in the book. "But that conclusion would be too simplistic."
STAT spoke with Kapoor about times when AI has not lived up to its hype and its future potential in health care. This conversation has been lightly edited for brevity and clarity.
How do you view AI's use specifically in the health care sector? Are you more skeptical of the claims that are made there compared to other areas that you talk about in the book?
I think one of the main takeaways of the book and the overarching theme is that AI is an umbrella term that describes completely disconnected technology. Some types of AI have made extremely rapid progress in the last decade, most notably applications of generative AI -- the public-facing applications like text bots and ChatGPT, as well as text-to-image models like Stable Diffusion, Midjourney, and DALL-E. But in medicine, we've also seen applications like AlphaFold, which are now being used to predict protein structures. I think those applications are likely to have equal, if not more, impact when it comes to the health care domain.
When it comes to text-based generative AI models, we are seeing companies that are building technology for health care -- things like Abridge, which transcribes patient notes. In health care, I do think a lot of the positive impact of AI would be from generative AI.
On the other hand, we also talk about predictive AI in the book, which refers to AI that is used to make decisions about individuals based on predictions about their future. In many cases, a large amount of snake oil is concentrated in the predictive AI sector.
So one example of an AI tool is the Impact Pro algorithm by Optum, which was used in hospitals across the US to predict which patients are most likely to need the most health care in the days to come. This algorithm was used to prioritize people, and based on that, select who should receive more health care and who should be placed in the priority list.
Now, in 2019 Ziad Obermeyer and others ran an algorithmic bias study of Optum and found that the algorithm actually had a lot of racial disparities. It was much more likely to recommend that white patients get access to this better health care and be recognized as higher risk compared to Black patients. And the reason for that was that the algorithm was actually predicting patients who would spend the most. So it was predicting the cost of health care, and not really who would be at most risk or who needed the health care the most.
What this also shows us is that when it comes to predictive AI, there are a lot of subtle issues that can go wrong in ways that are essentially silent. There is no easy way to diagnose this type of failure unless you have access to data from a large number of people. And so in the chapter on predictive AI, we go over a number of these reasons [why predictive AI can fail]. For Optum, it was the choice of the target variable -- what the algorithm was predicting.
But in other cases, we also point out that there are systematic reasons why interventions are hard to make based on predictions. One example that we share is from a group of researchers in the 1990s who tried to build an algorithm to predict when a patient comes to the hospital with symptoms of pneumonia, to predict if they should be admitted overnight, or if they were a low-risk patient, [if] they should be released immediately. And if a patient had asthma, the algorithm would recommend them to be released immediately, more much more often.
The reason this happened was that in status quo, when a patient would come in with symptoms of asthma, health care workers would obviously recognize that this patient needs more care and attention, and so they would send them to the ICU. Because of that, asthmatic patients actually had a lower risk of developing serious complications when they came in with symptoms of pneumonia. But that was precisely because of the fact that they were sent to the ICU.
And so had the doctors in the 1990s adopted this algorithm, they would send the asthmatic patients home when they came in with symptoms of pneumonia, they would discharge them without admitting them, and that would be disastrous.
How do you set standards for these tools?
For example, a recent FDA clarification of a rule basically said that medical AI devices count as medical devices as well. So that's a positive piece of news.
I do think, however, that we currently lack a lot of the standards when it comes to developing predictive AI. One example is that unlike regular medical technology, when we develop AI systems, they are sensitive to the distribution in which they're deployed. So it's not enough to have a one-size-fits all tool that is developed once and then used in hospitals over time and across the country. What we really need to have is domain-specific interventions, so a tool that's fine-tuned to work well within a specific hospital system or even a specific hospital.
I think that's important, because over time and from geographic location to location, the patterns of health care, the patterns of diseases, are also where minor things like what types of devices are in use within a particular hospital can change. And unlike traditional medical devices, machine-learning algorithms are really sensitive to these small changes.
This is something we've also seen play out in real world examples. So an example is Epic's sepsis prediction algorithm, which actually folks at STAT, including Casey Ross, reported on a couple of years ago very extensively. This was an algorithm that Epic sold as being a one-size-fits-all success prediction tool, and it was deployed in 2016. It wasn't until 2021 when a set of experts from the University of Michigan looked at the algorithm's results that they found that the algorithm did not work as well as the company had claimed. And then it took Epic another year to change how their algorithm was deployed, to make sure that every hospital had to modify it or adapt it, or train the model on their own data.
Going forward, this is the type of insight that can help inform practices for improving predictive AI if we are using it in medical settings. The other thing, of course, is once we start treating medical AI systems as medical devices or as medical interventions, we also need to evaluate them just as we evaluate other medical interventions. So having prospective studies in order to evaluate how well these work should be complemented with studies of post-deployment medical AI assessments. Both of these things are necessary in order to come to a place where AI can actually be useful at predicting clinically relevant settings.
Where do you see AI in health care going in the next five or 10 years?
So I think generative AI will continue to see widespread adoption. We are in a period where people are experimenting with possible uses of AI, and I do think a lot of these uses might not have matured enough for them to be deployed in the real world. My main worry is that we end up in a world where we rush to adopt generative AI across medical users, without proper evaluation schemes.
That said, I am broadly optimistic about the slightly longer term. We are already seeing early signs of generative AI being adopted to increase efficiency. On the one hand, for example, by summarizing doctor's notes or by helping transcribe them, and on the other hand, by pushing the frontiers of what's possible, for example, through semi-automated drug discovery.
I think both of those areas will continue to see a lot more adoption, and I think a lot of it will have many positive effects, so long as we develop ways to evaluate these models and not fool ourselves into thinking that the models work better than they do. As someone recently put it to me, language models and generative AI always work better at first glance than they do in the longer run, and they're always more impressive in a demo than they are in the real world.