Essential Datasets for Tracking Drug Development from Discovery to Market

The FDA breaks down the drug development process into 5 main steps:

  1. Discovery and development
  2. Preclinical Research
  3. Clinical Research
  4. FDA Review
  5. FDA Post-Market Safety Monitoring

Here, I’ll break down some of the most useful datasets for learning more about what companies and researchers are doing at different stages of the drug development and delivery pipeline. I consider many of these datasets as primary sources, and use them when conducting research or building new data products to keep tabs on the industry.

Discovery and Development and Preclinical Research

This can be the most opaque area, because we have the fewest regulatory requirements for data disclosure and companies can be quite secretive about early programs. To keep tabs on researchers and early indications, we can use academic literature, citation databases, and intellectual property records.

📚 PubMed and PubMed Central

Comprehensive database of biomedical literature

Learn More

Author/Host

National Library of Medicine (NLM)

Key Features

  • Over 37 million biomedical literature citations
  • Full text peer reviewed articles and links to publisher sites
  • Advanced search capabilities and API

Notes and Uses

Search PubMed to find out what research groups are doing and for broadly surveying the landscape in different fields.

🔍 OpenAlex

Open database of scholarly works, authors, and institutions

Learn More

Author/Host

OurResearch.org

Key Features

  • Linked open data
  • API access
  • Comprehensive metadata

Notes and Uses

OpenAlex's knowledge graph allows us to enhance our PubMed searches by looking for networks of researchers, drugs, or device targets. It also helps us quickly gain context and identify key players or areas in a field.

📄 USPTO Patent Data Portal

Official source for U.S. Patent and Trademark Data

Learn More

Author/Host

US Patent and Trademark Office (USPTO)

Key Features

  • Bulk data downloads
  • API access
  • Historical and current patent information

Notes and Uses

USPTO data is the authoritative source for intellectual property information in the US. We search USPTO data to find out what the competitive landscape of an industry is, or what products a company is working on.

📄 PatentsView Curated Datasets

A data curation project from the Office of the Chief Economist of USPTO.

Learn More

Author/Host

USPTO Office of Chief Economist and Collaborators

Key Features

  • Curated patent datasets
  • Bulk data downloads
  • API access
  • Historical and current patent information in an easily consumable format

Notes and Uses

USPTO data is extensive, detailed, and can be challenging to work with. PatentsView creates summarized and otherwise de-duplicated datasets that often contain exactly the information we need in a consumable format. One particularly useful feature is the applicant and assignee name disambiguation.

Clinical Research

Clinical trials are vital to the drug and biologic approval process to demonstrate safety and effectiveness. Since the stakes are higher for clinical research, there are more regulations around how research is conducted and transparency. Clinical trials are the most important data source to track to keep tabs on regulated products moving through the pipeline. There are different trial phases, typically numbered Phase 1 through 4, with Phase 1 being a general safety and dosage study with a small group of 20-100 healthy people, and phase 4 being safety and efficacy study on thousands of volunteers with and without the disease/condition1.

📄 ClinicalTrials.gov

Database of clinical research studies from around the world along with metadata, trial information and results.

Learn More

Author/Host

National Library of Medicine

Key Features

  • Intervention and conditions
  • Trial sponsors
  • Trial phase
  • Important dates for results

Notes and Uses

Searching for active trials or trials that will be ending and reporting results soon. It is important to note that while companies are required to provide information, the FDA and NLM do not check all submissions for accuracy or ensure that data is supplied in a timely manner. It is the sponsor's responsibility to provide information and they can edit or change it at any time.

📄 Aggregate Analysis of Clinical Trials (AACT)

AACT is a pre-processed postgres database of all the data in clinicaltrials.gov, updated daily.

Learn More

Author/Host

Clinical Trials Transformation Initiative (CTTI)

Key Features

  • Relational database version of clinicaltrials.gov
  • Direct access to postgres
  • Data downloads
  • Curated project data

Notes and Uses

Easily design direct SQL queries to create your own datasets in bulk. Rather than chaining API calls or parsing XML from clinicaltrials.gov, you can create whatever data you like using SQL.

FDA Review

During FDA review, most data is confidential and embargoed. However, once drugs, devices, or biologics are approved, FDA provides a lot of useful information in many different databases and API’s. Below are some of the data sources we use most frequently to track companies and learn more about commercial activities and opportunities.

📄 Drugs@FDA

A searchable database of all FDA approved products for human use

Learn More

Author/Host

FDA's CDER Office

Key Features

  • Drug names
  • Ingredients
  • Sponsors
  • Marketing material

Notes and Uses

Searching for all the products associtaed with a given active ingredient.

📄 The Orange Book

A searchable database of all FDA approved drugs, therapeutic equivalents (generics) and associated intellectual property and exclusivity information.

Learn More

Author/Host

FDA's Center for Drug Evaluation and Research (CDER)

Key Features

  • Drug names and ingredients
  • Generic drugs
  • Patents and FD&C exclusivity designations
  • Estimated exclusivity dates

Notes and Uses

Useful for linking intellectual property information to companies and drugs. We have a custom orange book database that allows us to answer questions like: "When does the exclusivity protection end for <drug>?" or "How many drugs approved in 2000 have generic in 2024?"

📄 The Purple Book

A disappointing version of the Orange Book for CBER regulated Biologics

Learn More

Author/Host

FDA's Center for Biologics Evaluation and Research (CBER)

Key Features

  • List of biologics and sponsors
  • Dose and administration details
  • Exclusivity and expiration

Notes and Uses

This resource is much less comprehensive than the orange book. For one, there are no patent listing requirements for biologics, so we don't have a great idea of the exclusivity period or strategies companies will use to protect IP. This is still the most reliable listing of CBER regulated Biologics and we use it just like the orange book.

📄 Devices@FDA

Searchable catalog of FDA regulated medical devices.

Learn More

Author/Host

FDA's Center for Devices and Radiological Health (CDRH)

Key Features

  • Device summaries
  • Device manufacturer
  • Approval date
  • User instructions

Notes and Uses

We are using this database to research recently approved electroencephalograph devices.

📄 FDA Open Data Portal

Easy access to APIs and bulk data downloads from FDA

Learn More

Author/Host

FDA Office of Health Informatics

Key Features

  • API access to key datasets
  • Unified portal for FDA data
  • Catalog of new custom datasets and features

Notes and Uses

Great way to get easy access to many of the available FDA databases as well as curated datasets and data exploration tools. A central hub to access data from across FDA.

📄 DailyMed Drug Labels

Database containing labeling data submitted to FDA by companies.

Learn More

Author/Host

National Library of Medicine

Key Features

  • Drug/Biologics and Medical Device labels
  • Some unapproved food and supplement labels
  • Prescribing information
  • API

Notes and Uses

Linking a drug to its approved uses without parsing the PDF's in Drugs@FDA.

FDA Post-Market Safety Monitoring

After drugs are marketed and in use, we use safety monitoring data from FDA and pricing/use data from Medicare and Medicaid. Due to extensive public private partnerships, complicated deals/rebates, and generally insufficient transparency, payments and drug prices get a lot trickier to track. We will provide more information about how we track, estimate, and monitor pricing in the future.

📚 FDA's Adverse Event Reporting System (FAERS)

Quarterly releases of adverse event reports as a downloadable file

Learn More

Author/Host

FDA's Center for Drug Evaluation and Research (CDER)

Key Features

  • Downloadable files
  • Coverage from January 2004 to present
  • Uniform coding for adverse events and medication errors via MedDRA terminology

Notes and Uses

FAERS is a surveillance system to monitor for problems with approved drugs and therapeutic products. This system is important for tracking side effects or other problems with approved products that may require further investigation or a recall.

📄 Medicare Part B and D

Various datasets related to medical insurance and drug coverage for the Medicare program

Learn More

Author/Host

Center for Medicare and Medicaid Services (CMS)

Key Features

  • Drug pricing
  • Drug usage
  • Device pricing
  • Inpatient and outpatient

Notes and Uses

Medicare is a federal health insurance program for people 65 or older and those with certain disabilities or conditions. The program is funded by the US Government, so we have some data on what is paid and service usage. Medicare is divided into parts A-D, with parts A and B managed by the government, and parts C and D managed by private organizations and approved by the government. We often use the data from Part D, drug coverage, and part B medical insurance

📄 Medicaid

Open data portal for Medicaid and the Children's Health Insurance Program (CHIP)

Learn More

Author/Host

Center for Medicare and Medicaid Services (CMS)

Key Features

  • Drug pricing
  • Enrollment
  • Payments
  • Eligibility

Notes and Uses

Medicaid is a joint federal-state program that helps cover medical costs for people with limited income. The federal government sets rules about Medicaid, then states administer and run their own programs.