Data plays a critical role in the discovery, development, and sustained delivery of new and existing medicines to patients. There are a variety of scientific, regulatory, project, clinical, and supply chain-related decisions stakeholders must make to matriculate through this full lifecycle. Data is what informs these decisions. Stakeholders must carefully consider the implementation of systems to assure that all this data, in its various forms, dispositions, and locations is available for effective decision making. The implementation of an enterprise data strategy must account for a variety of factors.
A futuristic data-driven experience in the pharmaceutical industry might look a little like this:
No paper, no writing implements, and few keyboards. Communication is made through radio frequency via RFID scanners and receivers.
Scientists and operations staff wear augmented reality interfaces with visual and natural language processing-based audio support, on top of the usual PPE.
Instruments and laboratory equipment is equipped with IoT adapters that enable data streaming. When scientists approach an instrument, their RFID detector opens the acquisition console on their heads-up display (AR HUD) allowing for automatic initiation of experiments.
Data from acquisition is streamed via 5G (maybe 6G!) to the cloud.
ML-based analysis and the scientist’s interpretation is streamed to decision support interfaces.
How do we get closer to this seamlessly data-driven future?
Pharmaceutical Product Lifecycle Data Strategy
When stakeholders first embark on strategy development, it’s best to start with identifying the sources of data as a function of the unit operations across the lifecycle. In the pharmaceutical industry this lifecycle starts in Research and Development. The goal of pharmaceutical research (also known as discovery) is to identify molecular entities that are suitable for clinical investigation. Today, automation and digital measurement outputs must be carefully accounted for in data strategy development.
Moreover, consider the degree to which software systems support the completion of specific unit operations. Rather than rely on scientists to document or transcribe information into human-readable documents or presentations, today’s stakeholders can leverage the ability to automate the streaming of formatted output from such instrumental or robotic data sources and rely on software-based decision support systems. The resultant systems afford data-driven decision making (as opposed to the document-driven paradigm of the past). When confronting the challenge of implementing a practical data strategy, the following considerations are essential:
- Business process analysis—the steps to discover, develop, and supply new therapies to patients
- Data-generating activities—the unit operations across the lifecycle that generate data
- Data governance model—systems and procedures needed to glean the most benefit from data-generating activities
- Digital knowledge transfer between unit operations—when matriculating through these unit operations, how does knowledge flow as the composition of teams’ change? For example, how does knowledge transfer from discovery to development upon candidate nomination? Assuring successful/ efficient knowledge transfer is essential to keep the trains running on time.
Here I will discuss the typical set of dependent stages for discovery and development project teams. Each step produces a variety of data that informs project team decision-making and should be carefully evaluated for successful implementation.
Target Selection and Assay Validation
For each disease target of interest, project teams must focus on one (or more) biological targets whose function mitigates or eliminates disease progression. The process of target selection starts with gaining a fundamental understanding of the biochemical mechanism of disease or infection. Data from systems biology experimentation (genetics, proteomics, etc.) is carefully evaluated to select the most suitable biological target. To efficiently discover the most suitable molecular entity for target functional activity modulation, stakeholders must develop a series of in-vitro and in-vivo biological assays. Consequently, a series of assay engineering experiments are performed to assure that functional activity modulation can be detected in a reliable manner. Data from such experiments allow for a series of assay protocols to be implemented. Moreover, assay validation experiments generate data which allows for stakeholders to assure that data produced from such assays can be used to rank-order prioritize molecules in terms of their likely disease mitigating effect.
Molecular Entity Screening
Upon completion of assay validation, discovery project teams seek to identify a series of molecules across a range of molecular modalities which exhibit target function modulation (i.e., inhibition, agonism, etc.). In many cases, millions of samples (or test articles) of molecules are subjected to initial screening to determine minimal activity. The data generated must be carefully evaluated using a variety of chemically and statistically aware decision support software interfaces. Additionally, a variety of so-called counter-screens are employed (and datasets evaluated) to assure that any lead series molecules interact specifically with the disease target while avoiding unintentional interaction with the large variety of biological systems present in the human body. Ideally, the results of evaluating datasets produced by screening and counter-screening assays will reveal a number of lead series molecules which can be further optimized with increasingly realistic (and costly) in-vitro and in-vivo assay models.
Lead Optimization
Upon completion of lead series identification, discovery project teams undertake the iterative lead optimization cycle, wherein the team conducts a series of molecular optimization experiments via a design-make-test (DMT) iterative cycle. Each step of this cycle generates data that establishes quantitative (molecular) structure-activity relationships. The goal is to gain a clear understanding of the correlation between molecular identity and corresponding test article assay performance. For cycles to produce key insights, project teams must confirm that the substances produced for assays are of the correct identity and sufficient purity. Medicinal chemists producing test articles subject them to spectroscopic (usually NMR) and chromatographic analysis to confirm identity and purity.
When considering the use of data during lead optimization, the source and format of molecular design and preparation information (i.e., synthetic route and purification method), identity and purity measurements, and bioassay protocol and activity modulation quantitative measurements must all be presented in the appropriate decision support interface. These data are used to establish correlation models between identity/composition attributes and therapeutic attributes, assuring that the appropriate data governance models are instituted to assure accurate correlations.
Clinical Candidate Selection
As performance testing during DMT cycles achieves important performance and selectivity milestones, project teams must ultimately select from a variety of substances to promote candidates for clinical investigation. A variety of factors influence the outcome of candidate nomination activities. The goal is to maximize the probability of mitigating or eliminating disease progression while minimizing adverse events. Project teams subject clinical candidates to a set of confirmatory assays to predict the overall likelihood of successful clinical outcomes.
Since such clinical trials (i.e., testing with human patient populations) of therapies are regulated by global health authorities (FDA, eMEA, etc.), confirmatory studies must adhere to good laboratory practices (cGLP). The procedures executed and the data generated during these experiments must also adhere to a firm’s cGLP policies. Therefore, data governance should also extend to overall quality assurance and regulatory policies during this important phase of the lifecycle.
Finally, a key factor also considered when promoting a specific candidate is projected manufacturing cost. All other things being equal, project teams will nominate the candidate that can be made for the lowest cost.
Clinical Testing and CMC Development
Upon acceptance of a nominated candidate for drug development, project teams (usually with a new composition of stakeholders) must design and implement a clinical strategy to assess the viability of a new therapy. The knowledge obtained during discovery unit operations will be leveraged to prepare this clinical strategy. In addition, such clinical investigation must follow good clinical practices (cGCP) and good manufacturing practices (cGMP). Experiments to design and implement processes to produce, formulate and test clinical trial materials must follow such cGxP policies to mitigate risk. But adherence to cGxP policy also helps to assure approval of clinical testing applications (e.g., investigational new drug applications, as mandated in the US by the code of federal regulations, specifically 21CFR312).
Therefore, a firm’s data governance strategy should account for these policies—an incremental consideration to efforts in discovery.
Commercialization and Supply Chain
Many have summarized the costs and risks associated with the clinical development of new therapies. The process can be long and relatively expensive. If such clinical studies confirm the therapeutic benefits of an investigational new drug (IND), and health authorities approve the commercialization of this new drug via an NDA, project teams are faced with the task of transferring the various clinical production, formulation, and testing processes to their commercial supply chain teams.
An effective digital tech transfer process is essential to an efficient supply of these new therapies to patients. Additionally, ongoing testing within the commercial supply chain is essential to assure quality and risk mitigation. Similar to the DMT cycles in discovery, the data governance (for batch records and test result records generated during production) must be carefully devised.
Pharma Data Utilization
Categories
A number of important data utilization categories must be considered across this lifecycle when establishing a data governance model.
- Contemporaneous Decision Support—offering data access to stakeholders just-in-time for any/every experiment/process result. These interfaces support both experimentation-specific decisions (i.e., the results of an experiment in discovery—is this material the appropriate identity and composition to subject to a bioassay?); and release decisions (i.e., for a collection of test results from different experiments, is the material appropriate for a downstream process operation?)
- Comparative Analysis—reference-type comparisons: being able to compare data from a variety of experiments, samples, and other variables is essential to glean various insights (longitudinal and serial trends, standard-to-sample differences, etc.)
Attributes
Implementation of systems based on these use cases require the following attributes to be carefully considered:
- Lowering Data Access Utilization Barriers—in addition to the types of decision support use cases identified above, establishing data governance which minimizes access barriers, while maintaining data security is essential to overall success.Stakeholders must consider:
- The distribution of data sources (data entry/capture systems, data generating instruments and equipment); just-in-time, network-based access to stream data from the source to accessible repositories
- The degree to which data sets are tagged to enable facile queries (scientifically relevant tags including chemical structure query, spectral feature-type queries)
- Support for Format and Technique Agnostic Tools—based on the volume and variety of data formats from the collection of data-producing sources, data governance must include subsystems which serve as conversion and normalization microservices. This allows for data consumers to rely on access to datasets without having to depend on format and technique specific access and presentation interfaces.
- Access Points and Consumer Demographics Accounting—when implementing such access points, stakeholders should carefully consider how data consumers will interact with data, upon execution of queries.
Relevant priorities should include:
- The type of software interface (browser or federated user interfaces); the structure and types of queries (SQL and RestAPI-based, form-based, text/natural language based)
- How results are displayed/presented
- Decisions made with such query output
Predictive Models Based on Machine Learning
Finally, the most recent addition to most data governance strategies is the consideration for systems to produce query outputs that serve as training sets for predictive models. As data is produced by an enterprise-wide collection of data sources, and each dataset is given context (re: sample, experiment, and acquisition method), firms can utilize the systems above to produce such training sets. This can be accomplished by including RestAPI-based access to data utilization systems and include a JSON-based conversion utility. This allows facile incorporation into various ML systems—both on the cloud and on-premises systems.
Conclusion
Stakeholders are challenged to implement strategies to support effective data usage across the pharmaceutical lifecycle. While this challenge can appear daunting, designing a data governance model using the considerations described above can lead to the discovery of better medicines, more efficient matriculation of R&D project timelines, and ultimately implementation of predictive models for the next generation of vital, life-saving therapies. Industry stakeholders should also consider practical access, format, and presentation requirements when developing such data governance with particular emphasis on data conversion, normalization, and presentation microservices implementation.