About

Methodology & Transparency

Every data point has a source. Every pipeline is documented.

Most biotech intelligence products are black boxes. Data appears in flashy graphics, but without explanation of where it came from, how it was cleaned, or what might be wrong with it. RxDataLab documents every step of our pipeline, from source to display. Every data point traces back to its original filing.

Data Sources

We pull exclusively from primary regulatory sources: filings and registrations that are made because they are legally required.

  • ClinicalTrials.gov — The NLM registry of clinical research. 450K+ trials, updated via the AACT database (Aggregate Analysis of ClinicalTrials.gov).

  • SEC EDGAR — Electronic filings with the Securities and Exchange Commission. We parse Form 4 (insider trading disclosures), 8-K and 6-K (material event notices), Schedule 13D and G (beneficial ownership and institutional fund positions), 13F (institutional holdings), Form D (private capital raises), and 10-K and Q financial statements. Synced multiple times daily with the SEC.

  • FDA Orange Book — The FDA’s list of approved drug products with patent and exclusivity data. Updated monthly, typically around the 25th. We track historical changes to surface what is new each cycle.

  • USPTO Patent Records — Patent filings linked to drug programs and companies, used to contextualize pipeline and exclusivity data.

  • Financial market data — Financial data linked to company records for context alongside regulatory activity.

Linking Companies Across Databases

Raw data is not unique. All the data vendors are selling similar data, even if most won’t tell you that. The challenge that RxDataLab addresses is linking the data and surfacing it in a useful way. For example, ClinicalTrials.gov is naturally organized by study, and FDA sources are identified by NDA/ANDA number or drug name/ingredient. Both will have a “sponsor” free text form identifying the company responsible for the trial or compound, but those sponsor names vary wildly and they change over time, making it quite difficult to answer a question like “show me all of Merck’s trials”.

The SEC, on the other hand, is organized around reporting entities (companies and individuals) and nicely maps these entities to a unique identifier called a CIK (Central Index Key). CIK’s are unique and relatively persistent. They do not change during company name changes and will persist from venture funding through an IPO if they are filing with the agency1.

In public markets, we identify companies based on stock tickers, and sometimes more exotic (but reliable) identifiers like CUSIP’s or FIGI IDs.

For RxDataLab’s customers and our internal research, a company-first orientation is helpful so our first organizing principle is the company, and all records from clinical trials and FDA databases are harmonized by canonical company name.

RxDataLab.comCompany (RxID)BioHedge FundClinical TrialsSEC FilingsFDA FilingsMarket DataPortfolio HoldingsOwnership FilingsCIK / CUSIP

Our harmonization pipeline uses a two-stage approach: deterministic rule-based matching first (name normalization, legal suffix stripping, ticker lookups against SEC company data), followed by LLM-assisted disambiguation for cases that can’t be resolved deterministically. Every LLM-generated match is validated programmatically against official SEC records or manually verified before it enters production. There is always human validation for any edge cases or cases with lower confidence scores. We documented this process briefly in How I Use LLMs for Data Harmonization.

The result is a maintained mapping of 500+ public companies with canonical company records linking the various databases, enabling us to quickly query across the regulatory landscape.

SEC Filing Analysis for Biotech

8-K filing topics for JPM Healthcare week, 2024 — classified by our signal detection pipeline into seven categories. From [The Biggest Disclosure Day in Biotech](/research/jpm-healthcare-8k-analysis/).
8-K filing topics for JPM Healthcare week, 2024 — classified by our signal detection pipeline into seven categories. From The Biggest Disclosure Day in Biotech.

We track every SEC filing for the companies in our database, and we parse a subset of these filings into structured, normalized database tables. This preserves data lineage and enables relational queries across filing types. In particular, our tracking of beneficial ownership and insider activity in biotech is unparalleled. Here is a subset of the forms we track more closely, and why:

Form 4 (insider activity): is a required filing for company insiders and >10% owners. We track the filers and companies, allowing us to watch and aggregate in many useful ways.

Schedule 13 (beneficial ownership): are mandated filings for >5% owners. We parse and track this information and use it across our database to track institutional investor activity.

8-K signal detection: We use a custom ML pipeline to parse and classify 8-K filings into categories (clinical trial events, FDA actions, financing events, licensing deals, strategic review). This approach allows us to continually scan and surface biotech signals across the market in an unbiased way. This can be especially useful for less positive news that companies don’t want to advertise via press release. The data pipeline also allows you to track specific competitors or therapeutic areas so you don’t miss a thing. You can see an example on our free public Biotech Strategic Signals Dashboard.

Note: RxDataLab builds and maintains its own document parsers and data pipelines. We do not use third-party data vendors for SEC data, which means we control provenance end-to-end. Every filing displayed on the platform links to the original SEC document.

Trial Classification with Ontologies

ClinicalTrials.gov tags studies with MeSH (Medical Subject Headings) terms. MeSH are the NLM’s controlled vocabulary for indexing biomedical literature. For disease conditions, MeSH works reasonably well, allowing us segment trials for Duchenne muscular dystrophy then also zoom out to trials covering all muscular dystrophies. For drug interventions, MeSH struggles to capture the necessary context to look at broad classes of interventions or mechanisms of action that we are interested in, like “monoclonal antibodies” or BTK inhibitors.

To address this, RxDataLab uses two domain-specific ontologies as the backbone of our competitive landscape features. MONDO (Monarch Disease Ontology) provides a unified disease hierarchy that maps well onto the MeSH classified trials and allows for cross referencing to classification systems like ICD, SNOMED CT, and Orphanet. For interventions, we use NCIt (NCI Thesaurus) which provides mechanism-of-action drug classes as first-class concepts, allowing users to search for PARP inhibitors, ADCs, bispecifics, KRAS inhibitors and more.

Ontologies provide the infrastructure to join across completely different datasets. For example, MONDO’s ICD-10 cross-references allow us to connect trial landscape data to CMS claims, providing real-world patient context. NCIt carries FDA UNII codes that link trial interventions to Orange Book patent and exclusivity records, and other ontologies like RxNorm, and UMLS CUI codes that bridge to SNOMED CT and EHR data.

Scope and Limitations

What we track: Companies with active registrations on ClinicalTrials.gov and mandated US regulatory filings. If a program has not reached the clinical registry, it is not in our database. We are not a news aggregator or retail catalyst tracker.

  • Filing compliance gaps Not everyone files appropriately. For example, many companies raise private capital without filing, despite legal requirements. We cover this in Building a Form D Database.
  • Parser Accuracy and Quality Control: Our parsers and classifiers are thoroughly tested, but errors can occur. We use heuristics and automated anomaly detection to catch issues before they surface, and users can flag anything that slips through. Every data point links to its original source so discrepancies can be verified directly.
  • Not real-time: Our aggregators and scanners run on an hourly or daily cadence. We are not a trading platform. Our goal is to provide reliable information about the business of biotech in a useful way.

Further Reading

For technical detail on specific areas of the pipeline:


  1. CIKs persist through name changes and most corporate restructurings. Acquisitions and spin-offs can result in new CIK assignments; we track these transitions as part of our entity resolution process. ↩︎


Custom Tools

We build tools like this for research teams

If you have a specific monitoring need, like a therapeutic area, a fund, or a pipeline segment, and want it packaged into something your team can actually use, reach out. We work with BD, strategy, and investor teams who need custom data infrastructure built on primary sources.

or email [email protected]