- RxDataLab

Building an Intelligent SEC 8-K Classification System for Biotech Investors

The Problem: Signal vs. Noise in SEC Filings

As a biotech investor, SEC 8-K filings are critical for tracking material events. Companies file 8-Ks within 4 days of significant events, making them a real-time pulse of the industry. However, there’s a catch: Item 8.01 - “Other Events”.

Item 8.01 is the SEC’s catch-all category. It can contain anything from groundbreaking clinical trial results and FDA approvals to routine press releases about conference attendance. For investors, this creates a signal-to-noise problem:

High-value 8.01s: FDA NDA acceptance, pivotal trial readouts, regulatory designations
Low-value 8.01s: Executive conference attendance, general corporate updates
No way to distinguish them without reading every filing

Our platform tracks ~100 biotech companies. At scale, manually triaging hundreds of Item 8.01 filings becomes unsustainable.

The Solution: Keyword-Based Classification Without LLMs

We implemented a lightweight classification system that categorizes Item 8.01 filings into three high-signal categories:

Classification Categories

Clinical (31 keywords): “clinical trial”, “trial results”, “phase 1/2/3”, “efficacy”, “topline results”, “data readout”, “ASCO”, “ESMO”
FDA (40 keywords): “fda approval”, “nda submission”, “bla accepted”, “breakthrough designation”, “regulatory approval”, “priority review”
Licensing (17 keywords): “licensing agreement”, “collaboration agreement”, “milestone payment”, “exclusive license”

Why not use an LLM?

Speed: Keyword matching takes milliseconds vs. seconds for LLM inference
Cost: Zero per-classification cost vs. API fees
Determinism: Same input → same output, always
Transparency: Easy to debug and expand keyword lists
Local-first: All processing happens on already-downloaded HTML

Technical Implementation

Architecture

SEC Filing Download (rate-limited, 10 req/sec max) ↓ Item Number Extraction (regex: item\s+(\d+.\d+)) ↓ Item 8.01 Detection ↓ Keyword Matching (case-insensitive, counts per category) ↓ Classification Storage (clinical/fda/licensing/null)

Database Schema

CREATE TABLE sec_filings ( – … other fields … item_numbers TEXT, – JSON array: [“8.01”, “9.01”] item_801_type TEXT, – ‘clinical’, ‘fda’, ’licensing’, or NULL item_801_classified_at TIMESTAMP, – Allows re-classification parsed_status INTEGER – 0=unparsed, 1=success, 2=failed );

The timestamp-based approach is key: it allows us to re-run classification when we update keywords without re-downloading from SEC.

Core Classification Logic

func ClassifyItem801(htmlContent string) string { content := strings.ToLower(htmlContent)

  clinicalCount := countKeywordMatches(content, clinicalKeywords)
  fdaCount := countKeywordMatches(content, fdaKeywords)
  licensingCount := countKeywordMatches(content, licensingKeywords)

  // Return category with most matches
  maxCount := 0
  category := ""

  if clinicalCount > maxCount {
      maxCount = clinicalCount
      category = "clinical"
  }
  // ... similar for FDA and licensing

  return category // Empty string if no matches

}

Design decisions:

Simple string matching over regex for performance
Vote-based system - category with most keyword matches wins
Empty string = unclassified, not NULL, to distinguish “processed but no match” from “not yet processed”

Display Enhancement

Classified 8.01s get priority badges:

Clinical/FDA 8.01: Green badge, Priority 1 (same as financial results)
Licensing 8.01: Blue badge, Priority 2 (same as material agreements)
Generic 8.01: Purple badge, Priority 3 (low priority)

This surfaces high-value events without burying legitimate “Other Events” filings.

Key Technical Considerations

Rate Limiting (Critical!)

The SEC’s rate limit is 10 requests/second. We implement:

100ms minimum delay between requests (SEC requirement)
1 second delay for background jobs (server-friendly)
Batch processing (10-50 filings at a time)

Idempotency

Accession numbers are unique across all filings (composite key: CIK + date + sequence)
Database constraints prevent duplicate insertions
Re-running scraper is safe and picks up only new filings

Parsing Robustness

8-K HTML format varies widely:

Regex approach: (?i)item\s+(\d+.\d+) catches most variations
Deduplication: Multiple mentions of same item → single entry
Item 9.01 exclusion: Always present (exhibits), never informative

Background Processing

Two independent jobs:

Filing scraper: Runs every 4 hours, fetches new 8-Ks from SEC
Item parser: Runs every 1 hour, processes unparsed filings (10 per batch)

Both jobs are mutex-protected to prevent concurrent runs.

Re-classification Strategy

The timestamp-based system allows:

Automatic classification of new filings as they’re parsed
Manual re-classification via admin button (processes local HTML, no SEC calls)
Keyword evolution - update keywords, re-classify all historical filings in seconds

Query logic: WHERE item_801_classified_at IS NULL OR item_801_classified_at < ?

Pass time.Now() to re-classify everything, or time.Now().Add(-24*time.Hour) to only re-classify filings older than 24 hours.

Performance Characteristics

Initial classification: ~100ms per filing (includes HTML fetch from SEC)
Re-classification: ~1ms per filing (local keyword matching only)
Storage: ~10KB per filing average (full HTML stored for future re-classification)

For 100 companies with ~500 8-Ks/year total:

Daily processing: ~1-2 filings, <1 second
Full backfill: 500 filings × 100ms = 50 seconds + 500 seconds rate limiting = ~9 minutes total

Real-World Impact

Before: All Item 8.01 filings looked identical - purple “Other Events” badges After: High-signal filings jump out with green/blue badges

Example: A company announces NDA submission. Previously buried as “Other Events”. Now: Green “FDA Action 8.01” badge, Priority 1, sorted to the top of the feed.

Future Enhancements

Expanded keyword lists - The lists are intentionally verbose and easy to expand
Combo detection - Multiple categories matching could indicate major events
Confidence scores - Weight keywords by specificity (e.g., “bla submission” > “fda meeting”)
Negative keywords - Exclude false positives
Phase 2: LLM enhancement - Use keywords for pre-filtering, LLM for nuanced classification of uncertain cases

Key Takeaways

Start simple: Keyword matching solves 80% of the problem at 0.1% the complexity
Local-first: Store raw data, enable fast iteration without re-downloading
Respect rate limits: SEC will block you if you’re aggressive
Timestamp everything: Enables auditing and re-processing
Make keywords visible: Non-technical users can expand keyword lists themselves

The entire feature took ~4 hours to implement and has zero ongoing cost. Sometimes the simplest solution is the right one.

Biotech Industry Intelligence