AI Transforming Real-World Clinical Development
By Deborah Borfitz
February 21, 2023 | Artificial intelligence (AI) and machine learning (ML) solutions bringing tangible improvements to clinical trial conduct took center stage at the recent Summit for Clinical Ops Executives (SCOPE), where a succession of companies—including Pfizer, AbbVie, and IQVIA—took to the podium to share their use cases and road to success. “This is not science fiction,” says West Barnes, senior director, product, in IQVIA’s applied science data center. “Disruption is coming to the clinical research industry much faster than anybody expected it.”
Pfizer now has an AI-powered clinical data management system, with about 600 users, which has been deployed across more than 100 clinical studies. “We started with ROI [return on investment] as our end goal,” focusing on quick wins that reduced clinical trial cycle time, says Prasanna Rao, the company’s global head of AI and ML and formerly a Watson solution architect at IBM.
Under the leadership of Chief Scientific Officer Thomas Hudson, M.D., AbbVie embraces the idea that sharing its deep learning capabilities is a “moral imperative,” says Brian Martin, head of AI in R&D information research. Martin stresses the importance of starting with individual use cases, building those competencies into the workflow, and measuring ROI with business rather than data science metrics.
“We’re at another inflection point in human history,” says Barnes, referencing the popularity of the deep learning model DALL-E that is winning competitions in the art community and ChatGPT that can write college application essays. And the “avalanche of data coming at [clinical ops] people”—from wearables, electronic clinical outcome assessments, eConsent, and electronic medical records as well as electronic data capture (EDC) and labs—is likely now beyond the realm of human capacity to effectively and efficiently manage, adds Wendy Morahan, senior director, product, for clinical data analytics at IQVIA.
Smart Approaches
Pfizer’s Rao presents a trio of “smart” approaches to managing medical coding, data query, and protocol deviations. To speed up the workflow for medical coders who would otherwise have to consult MedDRA and WHO dictionaries for answers, AI has been used to generate a top-five list of suggested adverse event (AE) terms and drug names, he says.
The medical coding ML models were pretrained by a vendor partner on PubMed text, enabling “machine guessing” with a remarkably high level of accuracy, Rao continues. That cut in half the time elapsing (an average 466 minutes) between the time code is captured and recorded by the clinical ops team and helped enable record-breaking compression of total cycle time from first subject/first visit to submission on Pfizer’s COVID vaccine study.
When used to code the AE term “neck spasm” found in the clinical data, for example, the machine made three guesses, he shares. “Cervical spasm” was the rank 1 guess, with a 90% confidence score (“what the human ultimately approved”); “neck cramp” was rank 2, at 88%; and “muscle spasm” was rank 3, at 82%. “We just let the modal learn from lots and lots of actual data in PubMed.”
The smart medical coding system can understand other variables as well, says Rao, including clinical abbreviations such as L4 and L5 (vertebrae in the lumbar spine where a bulging disc might occur) and the type of cancer a person might have (e.g., invasive lobular breast carcinoma). It is also really good at catching typos, with 96% confidence recognizing “hypertwinsion” as a misspelling of the term hypertension, such that humans could be taken out of the loop entirely.
“We do about 120,000 codes per year,” Rao says. “All of last year we did an average of 1,000 coding terms through the smart medical code, and we have proven easily that this will save a lot of time between data capture and when the coding happens.”
For data managers, Pfizer’s flagship AI product is a smart data query system that detects data discrepancies in the EDC system and can send query texts back to sites—but also has human-in-the-loop feedback to retrain the underlying algorithms when they make a mistake to improve their future performance, says Rao. For unstructured data, the company is still piloting a deep learning language model called generative pre-trained transformer (GPT), among others, to test their ability to use the information to read and construct sentences.
The smart data query system can pick up on terms that can be flagged as a discrepancy, for example. These might be anything from signs and symptoms to a diagnosis or disease names, Rao says. “We do closed-loop feedback and retraining for terms that were missed and manage both the false-positives and the false-negatives.” Data integrity is improved by keeping much of the focus on “critical data queries,” such as an AE and concomitant medication that are not expected to co-occur.
The smart protocol deviation tool was generated from the smart data query system, he explains. For the vaccine study where immunosuppressants were prohibited, for example, the model had to learn various medication names and classes to appropriately flag the right agents as a protocol deviation.
Both pattern recognition and clinical inference are being applied here, he notes. “The model has been trained on lots and lots of [specific] patterns... [and] drug dictionaries and openFDA [drug product] labels.” Machines can learn to recognize that someone taking Tylenol for an AE called malaria is not logical, he cites as an example, but can’t be programmed around the literally hundreds of thousands of generic medications that might treat multiple conditions. “That’s where the clinical inference comes in.”
Measuring Success
The journey with AI always begins with ensuring the availability of the quality and quantity of data needed to train the models, says Rao. For design of the models themselves Pfizer relies on external help from many different vendors.
“All of this is based upon the real investment we make in terms of our subject matter experts... [who] produce the ground truth” and later use the models, he says. “The models are learning from human behavior.”
Such models need to be embedded seamlessly in operations and be easy to use, he stresses, pointing to the three ranked choices offered by the medical coding system that staff have come to trust. Previously, getting the work done was a 10-step process involving a search through the hierarchical structure of the MedDRA dictionary.
The smart query system was introduced as small, proof-of-concept projects and ROI was measured based on the time compression, says Rao. The projects for smart medical coding and protocol deviation were outgrowths of success with the initial data management solution.
Achieving an ROI required disrupting established processes to get people to trust machines and models to give them recommendations, says Rao. Picking use cases that reduce cycle time, however incrementally, has helped build excitement for more widespread adoption of AI solutions across every pocket of the R&D organization—more than 25 AI use cases have been undertaken, including for compound drug discovery and submission-related quality control checks. It’s important to keep implementations under a year, he advises, to maintain everyone’s interest.
At Pfizer, projects are being deployed across the portfolio and delivery is focused on automation with a minimal number of clicks, he says. Change management has been particularly important. While users will reject a model that will take away their jobs, they will embrace one that instead allows them to focus on the core work of ensuring data integrity and eliminates redundant manual tasks. Exposing confidence scores has been an important way of gaining the trust of users, he later adds.
‘Game-Changing’ Toolset
AI tools are likewise applied across all of R&D at AbbVie, Martin reports. What’s new here is experimentation with natural language generation, which is expected to be a game-changing solution once it moves to the mainstream.
Martin introduces a conceptual framework for operational clinical AI language tools (FOCAL) and makes the case for why, at least in the clinical operations space, “GPT is extremely important but... ChatGPT is not.” All six components of FOCAL— identification, transformation, generation, summarization, validation/valuation, and orchestration tools—either have been or are being developed and are or will be used at AbbVie, and can be deployed in any application space where users indicate those capabilities are needed.
Martin’s presentation is on generating first-draft clinical study reports (CSRs), part of the long protocol-to-publication process that involves a lot of carryover language. “We are probably less than six months away from being able to automatically generate a CSR for a study ... in minutes internally,” he says, and it represents potentially huge savings since first drafts of those reports are currently contracted out.
The identification tool, Cortellis Search, was the first component built, says Martin. It allows users, simply by entering some search terms, to quickly find specific phrases in a document tied to their question and insert them into a CSR. Additional functions include the ability to transcribe PDFs as well as videos, making them searchable.
Another tool, the Structured Content Retrieval and Authoring Platform (SCRAP), will bring in tables from a laboratory information management system or clinical data platform in a pre-structured, pre-templated form—and then “refresh that data as data in the underlying system comes in without having to reenter it into the [CSR],” Martin continues. It also allows users to take a template of a document and just pull in sections specific to a therapeutic area or type of context. “No one has to go find that document, find that section, highlight it, copy it, paste it, [and] drag it over; they can put in the study number, select the section, hit a button, ... [and SCRAP] pulls it out of the repository.”
The transformation tool, one of the biggest components of FOCAL, offers machine language translation into foreign languages and its accuracy on biomedical literature outperforms industry-leading platforms Google Translate and Microsoft Bing, says Martin. Behind it is a framework giving it the memory to support validation, so that “once a sentence gets sent out and validated in its translation [by a human], that sentence never needs to be sent out again, nor does any sentence that is framed as an exact match to that.”
Stored, validated translations can be carried through to a document and entire documents can be translated, he adds. The current model can interpret 140 different languages to varying levels of accuracy and precision. Language depth was made possible by the “interlingua approach” involving the creation of a shared representation of a sentence's meaning that transcends the language in which it is written. “It allows me to train Spanish to English, German to English, Portuguese to English and then do Spanish to Portuguese without having to have Spanish-Portuguese pairs of documents.”
When changing tense, as is necessary when moving text from a protocol to a CSR, the tool knows where the corrections are needed, offers the proposed future tense, and responds to search/replace commands by users on a phrase-by-phrase basis, enthuses Martin.
For the generation component of FOCAL, Martin’s team started with a “Hugging Face” (open source), pre-built GPT-J-6B model with six billion parameters—far smaller than the 350 billion parameters of the popular GPT-3 text-generating model. “Yet when we train it on CSR text, we can use that custom language model to auto-complete a phrase... [and] it is contextually aware of the section of the document that it is working in,” as reflected in the output text.
Summarization, which overlaps with generation, can both extract and abstract text, he explains. With extractive summarization, the tool uses the open-source SBERT sentence transformer to pull key passages out of a document and then provide attribution back to that source document. SBERT is trained on CSR paragraphs and comparable article paragraphs, allowing the team to build a “maximum relevance” model for identifying potential key paragraphs for extraction.
For abstraction, the team instead generated a custom GPT model using the CSR and its comparable publication, says Martin. Then, when fed the title and first sentence, the model generates a summary of the CSR.
This is not something ChatGPT can do well, as Martin demonstrates with an abstracted summarization it prepared on the ACHIEVE II migraine treatment study. While “effectively perfect for inclusion,” the summary document was riddled with factual inaccuracies—a phenomenon he likens to asking the model to name the first woman president of the United States and it confidently responding Hillary Clinton with details on her inauguration. The problem with ChatGPT models is that they are mathematically and technologically wired only to generate the “middle-of-the-bell-curve normal for something” rather than novel insights or anything unique.
User Focused
Happily, none of the output from FOCAL needs to be validated, “because in the end the process has never changed,” says Martin. “It is still a human being who is using the tools to augment their process... [and] accelerate what they’re doing.” Validation is a people and not a technology process in this scenario, he notes.
Similarly, key performance indicators and ROI can be framed in terms of the value to the user—e.g., time savings of 10 minutes per day every day for a year, he adds. Rather than worrying about AUC and F1 score—metrics for data scientists—the emphasis is on the measures that matter to the business.
The concept of orchestration, or how to make the tools work as an ecosystem, is focused on how to connect the components in a way that is easiest for end users, Martin says. AbbVie’s internally developed PACER integration platform is a lightweight, event-driven messaging system built with the human in the loop. The orchestration approach supports tasks that are part of an automatable process that requires human interaction and can handily be deployed across systems and locations.
The most sensible way to give users access to all the tools was to provide them in Microsoft Word since it is the “platform of choice for using these [CSR] documents,” says Martin. So, he and his team built a generative AI assistant as an add-in to contextually combine all the services and make them available in Word with a click.
“You can build 100 of these [add-ins] and customize them to specific services that a user might need for a task,” Martin adds. “But you still just have one translation service on the backend that all of those different variations of it can bring to bear. That’s the power of this orchestration ecosystem approach.”
The technology behind most of these models is going to be released in the public domain over the next 18 months, says Martin. “There is no strategic value in our models in generating CSR language... that can’t also benefit every other pharma company and every other company out there. The CSR itself is the strategic asset.”
Connected Intelligence
Garnering value from data—including archived and highly unstandardized information on trials—was the theme of the IQVIA presentation on “connected intelligence,” which is all about generating insights and pushing them into the workflow so end users can take the next best action, as Morahan explains it. The company primarily offers augmented rather than artificial intelligence tools where humans interact with a machine model.
IQVIA has taken giant steps forward with what is “effectively the ChatGPT of enrollment strategy design,” Barnes says. The production-grade system can take in protocol information and sponsor constraints around cost, risk, and time to generate in “moments” a complete enrollment design inclusive of benchmarks and competitive protocols that need to be considered, recommended countries and sites, expected enrollment rate, milestones, timelines, and risk factors.
The system generates three scenarios representing the fastest, cheapest, and balanced approach, he continues, and can remodel around any subsequent constraints if the initial output isn’t satisfactory (e.g., unwanted country, wrong sites, or too aggressive).
In one presented example a protocol search tool is used to find, in its top-50 results, between 70% and 90% of the studies a human-based search would turn up—plus “maybe some you would have missed being that the data is so expansive,” says Barnes. Another use case for the site selection process showcases models that identify qualified sites and investigators and ranks them into three tiers. “We have found over the years that these tier 1 sites enroll 40% faster than the non-tier 1 sites.”
IQVIA’s augmented intelligence tools can not only predict enrollment rates but also quantify the associated risk to the study in terms of executing to plan, he continues. “For all this to work there has to be data in the system and that by itself is a big problem.”
Morahan explains how IQVIA has solved the problem of getting data into the system with a trio of use cases for risk-based quality management (RBQM), embedding intelligence into the document review process (“query GPT”), and a query recommendation engine. The first step, she says, is consolidating data so there is enough of it to run machine learning models that can detect discrepancies and compose query text. At the end of the day, it is still the data manager who is making the decision about whether the machine made the right recommendation, she stresses.
RBQM capabilities include augmented site risk analytics, based on data from deviations, patients (e.g., AEs, EDC assessment data, labs), and clinical operations (e.g., CRAs in the field), to help keep the human focused on where their skills are needed, says Morahan. Deviations related to the investigational product (IP) is the focus of a compliance model that mines free text and categorizes the identified deviations into categories such as loss of IP, storage issues, under-dosing, and over-dosing.
A deceptively simple use case for labs and vitals outliers combines five different algorithms to create a confidence ranking for the analysis. “A ranking of 5 means all five algorithms have identified a data point as an outlier, [while] a ranking of 1 or 2 means only one or two of the algorithms have [so] identified that data point... [and] human critical thinking skills are needed,” Morahan says.
For unstructured data in the document review realm, IQVIA has a standalone system using a series of algorithms that have been put together into a workflow and is completely agnostic to the type of electronic Trial Master File (eTMF) being used, says Morahan. The system does quality checks on scanned documents and pulls out document metadata that can also be stored for later retrieval and querying. It also recommends the eTMF model category that it should be stored in (e.g., zone and artifact) and the machine continuously learns based on adjustments made by users before pushing the send button.
“This model has been in use at IQVIA for a couple of years, and we’ve trained the model on over 10 million documents,” Morahan says. “We pump about 14,000 documents a day through it... and the model is now classifying documents at around a 99% accuracy rate.” This has resulted in a 75% reduction in manual time and effort in the document processing workflow.