In early May, Salt Lake City-based Recursion Pharmaceuticals announced that it was releasing an open-source, 300-gigabyte dataset of genetically altered cell images to accelerate the use of artificial intelligence in drug discovery; the RxRx1 dataset was made publicly available on June 27.
After years of watching major players in the fields of both technology and pharmaceuticals keep their cards close to their chests, it was somewhat surprising to see a company at the intersection of both industries show even a small percentage of its hand (those 300 gigabytes make up less than half a percent of Recursion's weekly output).
However, as advanced technologies such as machine and deep learning AI are increasingly incorporated into medicine, it is crucial that data-sharing becomes the norm, rather than the exception, according to Ron Alfa, MD, PhD, senior vice president of translational discovery at Recursion. In fact, in just the few weeks since Recursion's announcement, Google, Penn Medicine and St. Jude Children's Research Hospital have all released open-source datasets or tools for their analysis.
"We're moving to a point where — and we're not quite there yet — there are lots of brilliant people in this space using software and machine learning for drug discovery, but the challenges for lots of this work are questions of what is the data, how do we get it and does it even exist — and if not, how do we generate it?" Dr. Alfa said. "So we're seeing, increasingly, as new companies crop up in the space, they face the challenge of having to spend time building that dataset."
Here, Dr. Alfa discusses why data-sharing is necessary for progress to be made in drug discovery, and how biotech companies can make the shift from viewing datasets as their intellectual property to understanding that the real IP is how they actually use that data.
Editor's note: Responses have been lightly edited for clarity and length.
Question: Why did Recursion choose to release the dataset for free?
Dr. Ron Alfa: This was the right time for the dataset because we're at a point where people are starting to appreciate that the big data that is available and accessible is most likely not adequate for answering the questions we need to answer to move drugs to the clinic. Questions are starting to revolve around how we generate the appropriate data and which companies are actually working to generate the data. We've been generating this massive dataset for five years now, and we think it's an important part of the conversation to release some of the data to give a sense to the industry, to say that this is the type of data that's going to be crucial for the industry and for drug discovery.
The big challenges in machine learning and biology, very broadly, are the fact that biology is really messy, and the issue of reproducibility. When you look at public genomics datasets, how reproducible are these datasets? From site to site, are the signals that we're detecting using machine learning signals that represent the biology we think it's representing, or just the noise of experimental variability? So we wanted to help the community begin to deal with this problem.
Drug discovery takes a really long time and it's really expensive, and the truth of the matter is most things fail in the clinic; that's part of the reason why it's so expensive. The question that Recursion is asking, at the highest level, is that every industry has benefited from the use of technology — specifically machine learning and artificial intelligence methods — so can we apply those same techniques to drug discovery and biology to improve outcomes and reduce costs for the industry? To answer that, we're building massive datasets to decode biology. Most companies use datasets that are publicly available, but we're using automations and web labs to really control the biology and be able to generate these large datasets.
Q: The tech world has traditionally been very private about its intellectual property; are those attitudes changing now that more tech companies are moving into healthcare?
RA: As we enter an era where it's increasingly obvious that machine learning and large datasets are going to be incredibly impactful in drug discovery, are companies starting to rethink their practices for data sharing? We're seeing that that's true, because at the end of the day what we're appreciating is that the more data that you're able to deploy a machine learning approach on, you will benefit from having a much better machine learning model.
Still, the industry is driven based on information. We're always going to have to protect something because it's incredibly expensive to do drug discovery, research and development and clinical studies in humans. We need to ensure that that process is protected and continues to earn money, but also think about ways that information will be exchanged in order to accelerate how we develop these algorithms and the tech that's going to allow the industry to be more efficient and bring more drugs to clinics.
Q: What is the motivator behind those shifting attitudes? Is it just a matter of creating the best possible technology or is it more of an ethical push to improve outcomes?
RA: From our standpoint, the company was founded on this idea that this is the 21st century and there is an incredible amount of tech out there, with machine learning able to do just about everything and new drugs that are approved every year — yet there's an incredible unmet need for patients. What can we do from a technology standpoint to accelerate our place in the industry to meet those needs for patients?
To achieve the goals that we want to achieve — that is, decoding biology and bringing new drugs to the clinic — we need to generate data, build it at scale and build it in a certain way. We have to think very carefully about what kinds of data we need and how to generate it.
From a moral, ethical standpoint, we should definitely be using all the technology that we can to develop new drugs. Three years ago, there were a lot of questions around whether we should use machine learning and algorithms: Yes, we should absolutely use machine learning and technology to find new drugs, to help people.
More articles about health IT:
Most people feel unprepared for the AI-powered future, poll finds: 4 notes
Dr. Eric Topol on the issues that still need to be addressed as digital medicine advances
Switching EHR systems may increase operation times, study finds