Methodology
This page describes how filings get from EDGAR into secwatch.observer. The short version: poll SEC filings, fetch the source documents and exhibits, clean the text, run structured extraction, store the result in SQLite, and publish both human-readable and machine-readable surfaces.
1. SEC ingestion
A poller watches the SEC's current 8-K feed and inserts new accession numbers into a SQLite filings table. The same system maintains company metadata from SEC reference data so filings can be connected to ticker pages and company-level feeds.
Companies are keyed by their SEC Central Index Key (CIK), the stable identifier for an issuer. Ticker symbols are treated as display aliases on top of the CIK, so multi-class issuers such as Alphabet (GOOG and GOOGL) resolve to the same company and a search for either ticker surfaces the same filings.
Requests to sec.gov go through a shared rate limiter below the SEC's published ceiling. The request User-Agent includes a real contact email, and the pipeline is designed to fail closed rather than hammer the source when SEC responses slow down or fail.
2. Fetching and cleaning
For each pending row, the processor follows the EDGAR index page to the primary 8-K document and relevant exhibits, especially 99.x press releases and supplemental materials. Material agreement exhibits can also be pulled when they are the most relevant filing attachment.
The raw HTML is converted into cleaner text by stripping iXBRL blocks, tag runs, duplicated boilerplate, repeated lines, and excess whitespace. This step is deterministic and keeps the downstream model focused on the filing content rather than presentation noise.
3. Structured LLM summary
The cleaned filing text and item codes are sent to an LLM with a strict structured-output prompt. The response is parsed against a schema and stored with model metadata so future prompt or model changes can be audited.
The production model currently runs through Ollama Cloud. The exact model name is stored with each generated summary so model changes can be traced later.
The summary layer currently includes:
- a one-line headline,
- concise bullet points,
- an event type such as earnings, M&A, leadership, debt, litigation, cyber, dividend, regulatory, or other material,
- sentiment,
- a materiality score,
- confidence metadata where applicable.
The LLM is the opinionated part of the system. The surrounding pipeline is built to keep that output source-linked, structured, inspectable, and easy to revise when quality checks show drift.
4. Enrichment and fact extraction
Some event types get additional treatment. Earnings filings can be enriched with reported metrics from SEC XBRL and consensus context where available. Executive movement radar runs a focused extraction layer over Item 5.02 filings to identify appointments, departures, and role changes with source-linked evidence.
The product direction is to keep adding focused radars where SEC filings contain repeatable, high-value facts. Each radar should produce structured facts, source links, and public surfaces that can flow into filing detail pages, ticker hubs, daily digests, RSS feeds, and future workflow tools.
5. Quality checks
secwatch treats summaries as useful but fallible. The system tracks quality snapshots, warning rates, failure modes, and extraction health so regressions are visible. High-value fact layers, such as executive movement radar, are designed around source-linked evidence rather than unsupported claims.
Important limitation: a filing page can help you decide what to read first, but it should not be the final authority for legal, trading, or compliance decisions. The EDGAR document remains the source of truth.
Trust and evaluation
The public surfaces are designed around five explicit questions: what happened, who it happened to, when it was filed or took effect, which SEC source proves it, and how confident the system is.
Taxonomy
- event_type buckets include earnings, M&A, leadership, debt, litigation, cyber, dividend, regulatory, other material, and other.
- sentiment is from the issuer or event perspective: positive, neutral, or negative.
- materiality_score ranges from 0.0 for routine disclosures to 1.0 for events likely to materially change attention or risk. This is the raw model estimate, kept as provenance.
- calibrated_materiality_score is a deterministic adjustment of the raw score: known boilerplate (for example a routine Item 5.07 annual-meeting vote result, when nothing else on the filing is material) is capped so identical routine events score consistently. It is what the feed, filters, ranking, and displayed materiality use; it only ever lowers the raw score and is null when the raw score is null.
- confidence describes extraction confidence, not investment certainty.
Model metadata
Each filing stores the model name, generation timestamp, schema-shaped summary, source URLs, and canonical accession number. Filing pages expose these fields in JSON, Markdown, and plain-text alternates (and in the in-page JSON-LD for crawlers).
Correction policy
Summaries currently publish as machine generated and uncorrected unless a later review says otherwise. The JSON event payload includes review and correction fields so future corrections can be made visible without changing the route contract.
Known limits
- Extraction quality depends on the filing and exhibit text available from EDGAR.
- Some older filings may have less complete scoring or fact metadata than newly processed filings.
- Source-grounded claim/evidence pairs are public only when evidence is located; ungrounded facts should not power citation-grade public modules.
- secwatch focuses on Form 8-K today and does not claim comprehensive coverage of all SEC disclosures.
6. Publishing surfaces
Once a filing reaches status='ready', it becomes available across several surfaces:
- the live React feed on the homepage,
- a crawlable filing detail page,
- ticker and item-code pages,
- RSS feeds,
- JSON feeds for retrieval use cases,
- daily digest pages when selected by materiality,
- radar pages when specialized extraction applies.
The live homepage receives new ready filings over Server-Sent Events (SSE), which is a simple fit for a one-way live filing stream.
The frontend live feed uses a Vite + React + TypeScript app. The crawlable pages are server-rendered FastAPI/Jinja pages. That split is intentional: the homepage can be interactive, while filing, ticker, digest, and radar pages remain easy for search engines, LLM retrieval systems, and readers without JavaScript.
7. Analytics and privacy
secwatch uses lightweight first-party analytics to understand product health: pageviews, route families, source clicks, digest clicks, radar engagement, server request classes, and high-level crawler/LLM traffic. The goal is to understand whether the site is useful, not to build an ad profile.
The request analytics intentionally avoid storing raw IP addresses or raw User-Agent strings. Bot and browser traffic is classified into coarse buckets so the site can distinguish human workflow signals from crawler and LLM indexing behavior.
What this is not
- Investment advice. secwatch summarizes and organizes public filings; it does not recommend trades.
- Real-time execution infrastructure. Latency is intended for awareness and research, not trading automation.
- A replacement for EDGAR. Every important claim should be checked against the original filing.
- Comprehensive across all SEC forms. The product is focused on 8-Ks today, with other filing types possible later.
Contact
If you find a bug or have feedback, email hello@secwatch.observer.