Methodology & Transparency

Most biotech intelligence products are black boxes. Data appears in flashy graphics, but without explanation of where it came from, how it was cleaned, or its limitations. RxDataLab documents every step of our pipeline, from source to display. Every data point traces back to its original filing.

Key Takeaways

Data comes from primary regulatory sources such as ClinicalTrials.gov, SEC EDGAR, FDA filings, USPTO records, and more. No third-party data vendors.
Every data point is traceable back to the original source. Data lineage is preserved end-to-end.

500+ public biotech companies are harmonized across all databases using a multi-stage entity resolution pipeline.
Trial and intervention classification uses biomedical ontologies like MONDO (disease hierarchy), NCIt (mechanism-of-action drug classes), and RxNorm, enabling cross-database joins across RxDataLab databases and external ones such as CMS data, EHR, and more.
RxDataLab is not a news aggregator or retail catalyst tracker. Coverage is limited to companies with active clinical registrations and mandated US regulatory filings.

For the reasoning behind these choices, see our data philosophy.

Data Sources #

We pull exclusively from primary regulatory sources: filings and registrations that are made because they are legally required.

ClinicalTrials.gov — The NLM registry of clinical research. 450K+ trials, updated via the AACT database (Aggregate Analysis of ClinicalTrials.gov).
SEC Filings — Electronic filings with the Securities and Exchange Commission. We parse Form 4 (insider trading disclosures), 8-K and 6-K (material event notices), Schedule 13D and G (beneficial ownership and institutional fund positions), 13F (institutional holdings), Form D (private capital raises), and 10-K and Q financial statements. Synced multiple times daily with the SEC.
FDA Orange and Purple Books — The FDA’s list of approved drug and biologic products with patent and exclusivity data. We track historical changes to surface what is new each cycle.
USPTO Patent Records — Patent filings linked to drug programs and companies, used to contextualize pipeline and exclusivity data.
Financial market data — Financial data linked to company records for context alongside regulatory activity.

Linking Companies Across Databases #

Raw data is not unique. All the data vendors are selling similar data, even if most won’t tell you that. The challenge that RxDataLab addresses is linking the data and surfacing relevant information in a useful way. For example, ClinicalTrials.gov is naturally organized by study, and FDA sources are identified by NDA/ANDA number or drug name/ingredient. Both will have a “sponsor” free text form identifying the company responsible for the trial or compound, but those sponsor names vary wildly and they change over time, making it quite difficult to answer a question like “show me all of Merck’s trials” (even for an LLM).

The SEC is organized around reporting entities (companies and individuals) and maps entities to a unique identifier called a CIK (Central Index Key). CIK’s are unique and relatively persistent. They do not change during company name changes and will persist from venture funding through an IPO if they are filing with the agency¹.

In public markets, we identify companies based on stock tickers, and sometimes identifiers like CUSIP’s or FIGI IDs.

For RxDataLab’s customers and our internal research, a company-first orientation is helpful so our first organizing principle is the company, and all records from clinical trials and FDA databases are harmonized by canonical company ID.

Our harmonization pipeline uses a two-stage approach: deterministic rule-based matching first (name normalization, legal suffix stripping, ticker lookups against SEC company data), followed by LLM-assisted disambiguation for cases that can’t be resolved deterministically. Every LLM-generated match is validated programmatically against official SEC records or manually verified before it enters production. There is always human validation for any edge cases or cases with lower confidence scores. We documented a version of this process in How I Use LLMs for Data Harmonization.

The result is a maintained mapping of 500+ public companies with canonical company records linking the various databases. Every data product and export uses these identifiers — the same company resolves correctly whether it appears as a ClinicalTrials.gov sponsor, an SEC filer, or an Orange Book applicant.

SEC Filing Analysis for Biotech #

8-K filing topics for JPM Healthcare week, 2024 — classified by our signal detection pipeline into seven categories. From [The Biggest Disclosure Day in Biotech](/research/jpm-healthcare-8k-analysis/). — 8-K filing topics for JPM Healthcare week, 2024 — classified by our signal detection pipeline into seven categories. From The Biggest Disclosure Day in Biotech.

We track every SEC filing for the companies in our database, and we parse a subset of these filings into structured, normalized database tables. This preserves data lineage and enables relational queries across filing types. In particular, our tracking of beneficial ownership and insider activity in biotech is unparalleled. Here is a subset of the forms we track more closely, and why:

Form 4 (insider activity): is a required filing for company insiders and >10% owners. We track the filers and companies, allowing us to watch and aggregate in many useful ways.

Schedule 13 (beneficial ownership): are mandated filings for >5% owners. We parse and track this information and use it across our database to track institutional investor activity.

8-K signal detection: We use a custom ML pipeline to parse and classify 8-K filings into categories (clinical trial events, FDA actions, financing events, licensing deals, strategic review). This approach allows us to continually scan and surface biotech signals across the market in an unbiased way. This can be especially useful for less positive news that companies don’t want to advertise via press release. The classified events are available as a structured data export or live on the public Biotech Strategic Signals Dashboard.

Note: RxDataLab builds and maintains its own document parsers and data pipelines. We do not use third-party data vendors for SEC data, which means we control provenance end-to-end. Every filing displayed on the platform links to the original SEC document.

Trial Classification with Ontologies #

ClinicalTrials.gov tags studies with MeSH (Medical Subject Headings) terms. MeSH are the NLM’s controlled vocabulary for indexing biomedical literature. For disease conditions, MeSH works reasonably well, allowing us segment trials for Duchenne muscular dystrophy then also zoom out to trials covering all muscular dystrophies. For drug interventions, MeSH struggles to capture the necessary context to look at broad classes of interventions or mechanisms of action that we are interested in, like “monoclonal antibodies” or BTK inhibitors.

To address this, RxDataLab uses two domain-specific ontologies as the backbone of our competitive landscape features. MONDO (Monarch Disease Ontology) provides a unified disease hierarchy that maps well onto the MeSH classified trials and allows for cross referencing to classification systems like ICD, SNOMED CT, and Orphanet. For interventions, we use NCIt (NCI Thesaurus) which provides mechanism-of-action drug classes as first-class concepts, allowing users to search for PARP inhibitors, ADCs, bispecifics, KRAS inhibitors and more.

Ontologies provide the infrastructure to join across completely different datasets. For example, MONDO’s ICD-10 cross-references allow us to connect trial landscape data to CMS claims, providing real-world patient context. NCIt carries FDA UNII codes that link trial interventions to Orange Book patent and exclusivity records, and other ontologies like RxNorm, and UMLS CUI codes that bridge to SNOMED CT and EHR data.

Scope and Limitations #

What we track: Companies with active registrations on ClinicalTrials.gov and mandated US regulatory filings. If a program has not reached the clinical registry, it is not in our database. We are not a news aggregator or retail catalyst tracker.

Filing compliance gaps Not everyone files appropriately. For example, many companies raise private capital without filing, despite legal requirements. We cover this in Building a Form D Database.
Parser Accuracy and Quality Control: Our parsers and classifiers are thoroughly tested, but errors can occur. We use heuristics and automated anomaly detection to catch issues before they surface, and users can flag anything that slips through. Every data point links to its original source so discrepancies can be verified directly.
Not real-time: Our aggregators and scanners run on an hourly or daily cadence. We are not a trading platform. Our goal is to provide reliable information about the business of biotech in a useful way.