Essential Datasets for Tracking Drug Development from Discovery to Market
The FDA breaks down the drug development process into 5 main steps:
- Discovery and development
- Preclinical Research
- Clinical Research
- FDA Review
- FDA Post-Market Safety Monitoring
Here, I’ll break down some of the most useful datasets for learning more about what companies and researchers are doing at different stages of the drug development and delivery pipeline. I consider many of these datasets as primary sources, and use them when conducting research or building new data products to keep tabs on the industry.
Discovery and Development and Preclinical Research ¶
This can be the most opaque area, because we have the fewest regulatory requirements for data disclosure and companies can be quite secretive about early programs. To keep tabs on researchers and early indications, we can use academic literature, citation databases, and intellectual property records.
📚 PubMed and PubMed Central
Comprehensive database of biomedical literature
Learn More
Author/Host
National Library of Medicine (NLM)
Key Features
- Over 37 million biomedical literature citations
- Full text peer reviewed articles and links to publisher sites
- Advanced search capabilities and API
Notes and Uses
Search PubMed to find out what research groups are doing and for broadly surveying the landscape in different fields.
🔍 OpenAlex
Open database of scholarly works, authors, and institutions
Learn More
Author/Host
OurResearch.org
Key Features
- Linked open data
- API access
- Comprehensive metadata
Notes and Uses
OpenAlex's knowledge graph allows us to enhance our PubMed searches by looking for networks of researchers, drugs, or device targets. It also helps us quickly gain context and identify key players or areas in a field.
📄 USPTO Patent Data Portal
Official source for U.S. Patent and Trademark Data
Learn More
Author/Host
US Patent and Trademark Office (USPTO)
Key Features
- Bulk data downloads
- API access
- Historical and current patent information
Notes and Uses
USPTO data is the authoritative source for intellectual property information in the US. We search USPTO data to find out what the competitive landscape of an industry is, or what products a company is working on.
📄 PatentsView Curated Datasets
A data curation project from the Office of the Chief Economist of USPTO.
Learn More
Author/Host
USPTO Office of Chief Economist and Collaborators
Key Features
- Curated patent datasets
- Bulk data downloads
- API access
- Historical and current patent information in an easily consumable format
Notes and Uses
USPTO data is extensive, detailed, and can be challenging to work with. PatentsView creates summarized and otherwise de-duplicated datasets that often contain exactly the information we need in a consumable format. One particularly useful feature is the applicant and assignee name disambiguation.
Clinical Research ¶
Clinical trials are vital to the drug and biologic approval process to demonstrate safety and effectiveness. Since the stakes are higher for clinical research, there are more regulations around how research is conducted and transparency. Clinical trials are the most important data source to track to keep tabs on regulated products moving through the pipeline. There are different trial phases, typically numbered Phase 1 through 4, with Phase 1 being a general safety and dosage study with a small group of 20-100 healthy people, and phase 4 being safety and efficacy study on thousands of volunteers with and without the disease/condition1.
📄 ClinicalTrials.gov
Database of clinical research studies from around the world along with metadata, trial information and results.
Learn More
Author/Host
National Library of Medicine
Key Features
- Intervention and conditions
- Trial sponsors
- Trial phase
- Important dates for results
Notes and Uses
Searching for active trials or trials that will be ending and reporting results soon. It is important to note that while companies are required to provide information, the FDA and NLM do not check all submissions for accuracy or ensure that data is supplied in a timely manner. It is the sponsor's responsibility to provide information and they can edit or change it at any time.
📄 Aggregate Analysis of Clinical Trials (AACT)
AACT is a pre-processed postgres database of all the data in clinicaltrials.gov, updated daily.
Learn More
Author/Host
Clinical Trials Transformation Initiative (CTTI)
Key Features
- Relational database version of clinicaltrials.gov
- Direct access to postgres
- Data downloads
- Curated project data
Notes and Uses
Easily design direct SQL queries to create your own datasets in bulk. Rather than chaining API calls or parsing XML from clinicaltrials.gov, you can create whatever data you like using SQL.
FDA Review ¶
During FDA review, most data is confidential and embargoed. However, once drugs, devices, or biologics are approved, FDA provides a lot of useful information in many different databases and API’s. Below are some of the data sources we use most frequently to track companies and learn more about commercial activities and opportunities.
📄 Drugs@FDA
A searchable database of all FDA approved products for human use
Learn More
Author/Host
FDA's CDER Office
Key Features
- Drug names
- Ingredients
- Sponsors
- Marketing material
Notes and Uses
Searching for all the products associtaed with a given active ingredient.
📄 The Orange Book
A searchable database of all FDA approved drugs, therapeutic equivalents (generics) and associated intellectual property and exclusivity information.
Learn More
Author/Host
FDA's Center for Drug Evaluation and Research (CDER)
Key Features
- Drug names and ingredients
- Generic drugs
- Patents and FD&C exclusivity designations
- Estimated exclusivity dates
Notes and Uses
Useful for linking intellectual property information to companies and drugs. We have a custom orange book database that allows us to answer questions like: "When does the exclusivity protection end for <drug>?" or "How many drugs approved in 2000 have generic in 2024?"
📄 The Purple Book
A disappointing version of the Orange Book for CBER regulated Biologics
Learn More
Author/Host
FDA's Center for Biologics Evaluation and Research (CBER)
Key Features
- List of biologics and sponsors
- Dose and administration details
- Exclusivity and expiration
Notes and Uses
This resource is much less comprehensive than the orange book. For one, there are no patent listing requirements for biologics, so we don't have a great idea of the exclusivity period or strategies companies will use to protect IP. This is still the most reliable listing of CBER regulated Biologics and we use it just like the orange book.
📄 Devices@FDA
Searchable catalog of FDA regulated medical devices.
Learn More
Author/Host
FDA's Center for Devices and Radiological Health (CDRH)
Key Features
- Device summaries
- Device manufacturer
- Approval date
- User instructions
Notes and Uses
We are using this database to research recently approved electroencephalograph devices.
📄 FDA Open Data Portal
Easy access to APIs and bulk data downloads from FDA
Learn More
Author/Host
FDA Office of Health Informatics
Key Features
- API access to key datasets
- Unified portal for FDA data
- Catalog of new custom datasets and features
Notes and Uses
Great way to get easy access to many of the available FDA databases as well as curated datasets and data exploration tools. A central hub to access data from across FDA.
📄 DailyMed Drug Labels
Database containing labeling data submitted to FDA by companies.
Learn More
Author/Host
National Library of Medicine
Key Features
- Drug/Biologics and Medical Device labels
- Some unapproved food and supplement labels
- Prescribing information
- API
Notes and Uses
Linking a drug to its approved uses without parsing the PDF's in Drugs@FDA.
FDA Post-Market Safety Monitoring ¶
After drugs are marketed and in use, we use safety monitoring data from FDA and pricing/use data from Medicare and Medicaid. Due to extensive public private partnerships, complicated deals/rebates, and generally insufficient transparency, payments and drug prices get a lot trickier to track. We will provide more information about how we track, estimate, and monitor pricing in the future.
📚 FDA's Adverse Event Reporting System (FAERS)
Quarterly releases of adverse event reports as a downloadable file
Learn More
Author/Host
FDA's Center for Drug Evaluation and Research (CDER)
Key Features
- Downloadable files
- Coverage from January 2004 to present
- Uniform coding for adverse events and medication errors via MedDRA terminology
Notes and Uses
FAERS is a surveillance system to monitor for problems with approved drugs and therapeutic products. This system is important for tracking side effects or other problems with approved products that may require further investigation or a recall.
📄 Medicare Part B and D
Various datasets related to medical insurance and drug coverage for the Medicare program
Learn More
Author/Host
Center for Medicare and Medicaid Services (CMS)
Key Features
- Drug pricing
- Drug usage
- Device pricing
- Inpatient and outpatient
Notes and Uses
Medicare is a federal health insurance program for people 65 or older and those with certain disabilities or conditions. The program is funded by the US Government, so we have some data on what is paid and service usage. Medicare is divided into parts A-D, with parts A and B managed by the government, and parts C and D managed by private organizations and approved by the government. We often use the data from Part D, drug coverage, and part B medical insurance
📄 Medicaid
Open data portal for Medicaid and the Children's Health Insurance Program (CHIP)
Learn More
Author/Host
Center for Medicare and Medicaid Services (CMS)
Key Features
- Drug pricing
- Enrollment
- Payments
- Eligibility
Notes and Uses
Medicaid is a joint federal-state program that helps cover medical costs for people with limited income. The federal government sets rules about Medicaid, then states administer and run their own programs.
We know pharma data. Sign up for our newsletter for research and data product updates.
Let us know if you have any feedback, ideas, or corrections by emailing [email protected]
see the FDA’s description of clinical trial phases ↩︎