About
Methodology & Transparency
Every data point has a source. Every pipeline is documented.
Most biotech intelligence products are black boxes. Data appears without explanation of where it came from, how it was cleaned, or what might be wrong with it. We take a different approach.
At RxDataLab we document our data harmonization processes from source to display. Users should be able to trust, but verify, the provenance of any data point.
Data Sources
We pull exclusively from primary regulatory sources. These are filings and registrations that are made because they are legally required, not because a company has a good PR team.
ClinicalTrials.gov — The NLM registry of clinical research. 450K+ trials, updated via the AACT database (Aggregate Analysis of ClinicalTrials.gov), a daily-refreshed PostgreSQL mirror built for bulk analysis. Every trial has a unique NCT ID assigned by NIH.
SEC EDGAR — Electronic filings with the Securities and Exchange Commission. We parse Form 4 (insider trading disclosures), 8-K and 6-K (material event notices), Schedule 13 (beneficial ownership and institutional fund positions), Form D (private capital raises), and 10K and Q financial statements. Synced 3x daily via the SEC Submissions API.
FDA Orange Book — The FDA’s list of approved drug products with patent and exclusivity data. Updated monthly, typically around the 25th. We track historical changes to surface what is new each cycle.
Financial market data — Daily price and volume data for tracked public companies, linked to their regulatory filings and clinical programs.
The Linking Problem
Anyone working in this field knows that data is not unique. All the major data vendors are going to pull this type of data. The challenge that RxDataLab addresses is the linking the data and surfacing it in a useful way. For example, ClinicalTrials.gov and the various FDA data sources identify companies by free-text sponsor names. These can be inconsistent within one dataset, let alone across them. For example, Merck runs trials filed under Merck Sharp & Dohme LLC, Acceleron Pharma, and Prometheus Biosciences — all separate subsidiaries, all appearing as distinct entities in the registry. Medtronic appears under a dozen business unit names. Free text names are quite difficult to deal with.
The SEC’s EDGAR database on the other hand, has a unique identifier called a CIK (Central Index Key). CIK’s are unique and relatively persistent. They do not change during company name changes and will exist for a company when private and public if they are filing with the agency1. In public markets, we identify companies based on stock tickers, and sometimes more exotic (but reliable) identifiers like CUSIP’s or FIGI IDs.
Our harmonization pipeline uses a two-stage approach: deterministic rule-based matching first (name normalization, legal suffix stripping, ticker lookups against SEC company data), followed by LLM-assisted disambiguation for cases that can’t be resolved deterministically. Every LLM-generated match is validated programmatically against official SEC records before it enters production. There is always human validation for any edge cases or cases with lower confidence scores. We documented this process briefly in How I Use LLMs for Data Harmonization.
The result is a maintained mapping of 500+ public companies with canonical company records linking the various databases, enabling us to quickly query across the regulatory landscape.
SEC Filing Parsing
We track every SEC filing for the companies in our database, and we parse parse a subset of these filings into structured, normalized database tables. This preserves data lineage and enables relational queries across filing types.
Form 4 (insider trading): is a required filing for company insiders and >10% owners. We track the filers and companies, allowing us to watch and aggregate in many useful ways.
Schedule 13 (beneficial ownership): are mandated filings for >5% owners. We parse and track this information and use it across our database to track institutional investor activity.
8-K signal detection: We use a custom ML pipeline to parse and classify 8-K filings into categories (clinical trial events, FDA actions, financing events, licensing deals, strategic review). This approach is deterministic and auditable, so the same filing always produces the same classification.
RxDataLab has a custom document scraper and parser. We do not use any data vendors for SEC data. Every filing displayed on the platform links to the original SEC document by accession number.
Scope and Limitations
What we track: Companies with active registrations on ClinicalTrials.gov and mandated US regulatory filings. If a program has not reached the clinical registry, it is not in our database. We are not a news aggregator or retail catalyst tracker.
- Filing compliance gaps Not everyone files appropriately. For example, many companies raise private capital without filing, despite legal requirements. We cover this in Building a Form D Database.
- Not real-time: Our aggregators and scanners run on an hourly or daily cadence. We target longer horizons.
Further Reading
For technical detail on specific areas of the pipeline:
- How I Use LLMs for Data Harmonization — the entity resolution pipeline, hybrid approach, and quality control framework
- Mapping Competitive Biotech Landscapes with Embeddings — clinical trial semantic analysis and competitive mapping
- Building a Form D Database — private capital data, its uses, and its limitations
of course, there are caveats and companies may have different CIK’s during different stages, but we take care of that on the backend. ↩︎
Work With RxDataLab
Questions about data coverage, methodology, or custom data needs.