Building an Intelligent SEC 8-K Classification System for Biotech Investors
The Problem: Signal vs. Noise in SEC Filings
As a biotech investor, SEC 8-K filings are critical for tracking material events. Companies file 8-Ks within 4 days of significant events, making them a real-time pulse of the industry. However, there’s a catch: Item 8.01 - “Other Events”.
Item 8.01 is the SEC’s catch-all category. It can contain anything from groundbreaking clinical trial results and FDA approvals to routine press releases about conference attendance. For investors, this creates a signal-to-noise problem:
- High-value 8.01s: FDA NDA acceptance, pivotal trial readouts, regulatory designations
- Low-value 8.01s: Executive conference attendance, general corporate updates
- No way to distinguish them without reading every filing
Our platform tracks ~100 biotech companies. At scale, manually triaging hundreds of Item 8.01 filings becomes unsustainable.
The Solution: Keyword-Based Classification Without LLMs
We implemented a lightweight classification system that categorizes Item 8.01 filings into three high-signal categories:
Classification Categories
- Clinical (31 keywords): “clinical trial”, “trial results”, “phase 1/2/3”, “efficacy”, “topline results”, “data readout”, “ASCO”, “ESMO”
- FDA (40 keywords): “fda approval”, “nda submission”, “bla accepted”, “breakthrough designation”, “regulatory approval”, “priority review”
- Licensing (17 keywords): “licensing agreement”, “collaboration agreement”, “milestone payment”, “exclusive license”
Why not use an LLM?
- Speed: Keyword matching takes milliseconds vs. seconds for LLM inference
- Cost: Zero per-classification cost vs. API fees
- Determinism: Same input → same output, always
- Transparency: Easy to debug and expand keyword lists
- Local-first: All processing happens on already-downloaded HTML
Technical Implementation
Architecture
SEC Filing Download (rate-limited, 10 req/sec max) ↓ Item Number Extraction (regex: item\s+(\d+.\d+)) ↓ Item 8.01 Detection ↓ Keyword Matching (case-insensitive, counts per category) ↓ Classification Storage (clinical/fda/licensing/null)
Database Schema
CREATE TABLE sec_filings ( – … other fields … item_numbers TEXT, – JSON array: [“8.01”, “9.01”] item_801_type TEXT, – ‘clinical’, ‘fda’, ’licensing’, or NULL item_801_classified_at TIMESTAMP, – Allows re-classification parsed_status INTEGER – 0=unparsed, 1=success, 2=failed );
The timestamp-based approach is key: it allows us to re-run classification when we update keywords without re-downloading from SEC.
Core Classification Logic
func ClassifyItem801(htmlContent string) string { content := strings.ToLower(htmlContent)
clinicalCount := countKeywordMatches(content, clinicalKeywords)
fdaCount := countKeywordMatches(content, fdaKeywords)
licensingCount := countKeywordMatches(content, licensingKeywords)
// Return category with most matches
maxCount := 0
category := ""
if clinicalCount > maxCount {
maxCount = clinicalCount
category = "clinical"
}
// ... similar for FDA and licensing
return category // Empty string if no matches
}
Design decisions:
- Simple string matching over regex for performance
- Vote-based system - category with most keyword matches wins
- Empty string = unclassified, not NULL, to distinguish “processed but no match” from “not yet processed”
Display Enhancement
Classified 8.01s get priority badges:
- Clinical/FDA 8.01: Green badge, Priority 1 (same as financial results)
- Licensing 8.01: Blue badge, Priority 2 (same as material agreements)
- Generic 8.01: Purple badge, Priority 3 (low priority)
This surfaces high-value events without burying legitimate “Other Events” filings.
Key Technical Considerations
- Rate Limiting (Critical!)
The SEC’s rate limit is 10 requests/second. We implement:
- 100ms minimum delay between requests (SEC requirement)
- 1 second delay for background jobs (server-friendly)
- Batch processing (10-50 filings at a time)
- Idempotency
- Accession numbers are unique across all filings (composite key: CIK + date + sequence)
- Database constraints prevent duplicate insertions
- Re-running scraper is safe and picks up only new filings
- Parsing Robustness
8-K HTML format varies widely:
- Regex approach: (?i)item\s+(\d+.\d+) catches most variations
- Deduplication: Multiple mentions of same item → single entry
- Item 9.01 exclusion: Always present (exhibits), never informative
- Background Processing
Two independent jobs:
- Filing scraper: Runs every 4 hours, fetches new 8-Ks from SEC
- Item parser: Runs every 1 hour, processes unparsed filings (10 per batch)
Both jobs are mutex-protected to prevent concurrent runs.
- Re-classification Strategy
The timestamp-based system allows:
- Automatic classification of new filings as they’re parsed
- Manual re-classification via admin button (processes local HTML, no SEC calls)
- Keyword evolution - update keywords, re-classify all historical filings in seconds
Query logic: WHERE item_801_classified_at IS NULL OR item_801_classified_at < ?
Pass time.Now() to re-classify everything, or time.Now().Add(-24*time.Hour) to only re-classify filings older than 24 hours.
Performance Characteristics
- Initial classification: ~100ms per filing (includes HTML fetch from SEC)
- Re-classification: ~1ms per filing (local keyword matching only)
- Storage: ~10KB per filing average (full HTML stored for future re-classification)
For 100 companies with ~500 8-Ks/year total:
- Daily processing: ~1-2 filings, <1 second
- Full backfill: 500 filings × 100ms = 50 seconds + 500 seconds rate limiting = ~9 minutes total
Real-World Impact
Before: All Item 8.01 filings looked identical - purple “Other Events” badges After: High-signal filings jump out with green/blue badges
Example: A company announces NDA submission. Previously buried as “Other Events”. Now: Green “FDA Action 8.01” badge, Priority 1, sorted to the top of the feed.
Future Enhancements
- Expanded keyword lists - The lists are intentionally verbose and easy to expand
- Combo detection - Multiple categories matching could indicate major events
- Confidence scores - Weight keywords by specificity (e.g., “bla submission” > “fda meeting”)
- Negative keywords - Exclude false positives
- Phase 2: LLM enhancement - Use keywords for pre-filtering, LLM for nuanced classification of uncertain cases
Key Takeaways
- Start simple: Keyword matching solves 80% of the problem at 0.1% the complexity
- Local-first: Store raw data, enable fast iteration without re-downloading
- Respect rate limits: SEC will block you if you’re aggressive
- Timestamp everything: Enables auditing and re-processing
- Make keywords visible: Non-technical users can expand keyword lists themselves
The entire feature took ~4 hours to implement and has zero ongoing cost. Sometimes the simplest solution is the right one.
