secwatch / observer

Methodology

This page describes how filings get from EDGAR into secwatch.observer. The short version: poll SEC filings, fetch the source documents and exhibits, clean the text, run structured extraction, store the result in SQLite, and publish both human-readable and machine-readable surfaces.

1. SEC ingestion

A poller watches the SEC's current 8-K feed and inserts new accession numbers into a SQLite filings table. The same system maintains company metadata from SEC reference data so filings can be connected to ticker pages and company-level feeds.

Companies are keyed by their SEC Central Index Key (CIK), the stable identifier for an issuer. Ticker symbols are treated as display aliases on top of the CIK, so multi-class issuers such as Alphabet (GOOG and GOOGL) resolve to the same company and a search for either ticker surfaces the same filings.

Requests to sec.gov go through a shared rate limiter below the SEC's published ceiling. The request User-Agent includes a real contact email, and the pipeline is designed to fail closed rather than hammer the source when SEC responses slow down or fail.

2. Fetching and cleaning

For each pending row, the processor follows the EDGAR index page to the primary 8-K document and relevant exhibits, especially 99.x press releases and supplemental materials. Material agreement exhibits can also be pulled when they are the most relevant filing attachment.

The raw HTML is converted into cleaner text by stripping iXBRL blocks, tag runs, duplicated boilerplate, repeated lines, and excess whitespace. This step is deterministic and keeps the downstream model focused on the filing content rather than presentation noise.

3. Structured LLM summary

The cleaned filing text and item codes are sent to an LLM with a strict structured-output prompt. The response is parsed against a schema and stored with model metadata so future prompt or model changes can be audited.

The production model currently runs through Ollama Cloud. The exact model name is stored with each generated summary so model changes can be traced later.

The summary layer currently includes:

The LLM is the opinionated part of the system. The surrounding pipeline is built to keep that output source-linked, structured, inspectable, and easy to revise when quality checks show drift.

4. Enrichment and fact extraction

Some event types get additional treatment. Earnings filings can be enriched with reported metrics from SEC XBRL and consensus context where available. Executive movement radar runs a focused extraction layer over Item 5.02 filings to identify appointments, departures, and role changes with source-linked evidence.

The product direction is to keep adding focused radars where SEC filings contain repeatable, high-value facts. Each radar should produce structured facts, source links, and public surfaces that can flow into filing detail pages, ticker hubs, daily digests, RSS feeds, and future workflow tools.

5. Quality checks

secwatch treats summaries as useful but fallible. The system tracks quality snapshots, warning rates, failure modes, and extraction health so regressions are visible. High-value fact layers, such as executive movement radar, are designed around source-linked evidence rather than unsupported claims.

Important limitation: a filing page can help you decide what to read first, but it should not be the final authority for legal, trading, or compliance decisions. The EDGAR document remains the source of truth.

Trust and evaluation

The public surfaces are designed around five explicit questions: what happened, who it happened to, when it was filed or took effect, which SEC source proves it, and how confident the system is.

Taxonomy

Model metadata

Each filing stores the model name, generation timestamp, schema-shaped summary, source URLs, and canonical accession number. Filing pages expose these fields in JSON, Markdown, and plain-text alternates (and in the in-page JSON-LD for crawlers).

Correction policy

Summaries currently publish as machine generated and uncorrected unless a later review says otherwise. The JSON event payload includes review and correction fields so future corrections can be made visible without changing the route contract.

Known limits

6. Publishing surfaces

Once a filing reaches status='ready', it becomes available across several surfaces:

The live homepage receives new ready filings over Server-Sent Events (SSE), which is a simple fit for a one-way live filing stream.

The frontend live feed uses a Vite + React + TypeScript app. The crawlable pages are server-rendered FastAPI/Jinja pages. That split is intentional: the homepage can be interactive, while filing, ticker, digest, and radar pages remain easy for search engines, LLM retrieval systems, and readers without JavaScript.

7. Analytics and privacy

secwatch uses lightweight first-party analytics to understand product health: pageviews, route families, source clicks, digest clicks, radar engagement, server request classes, and high-level crawler/LLM traffic. The goal is to understand whether the site is useful, not to build an ad profile.

The request analytics intentionally avoid storing raw IP addresses or raw User-Agent strings. Bot and browser traffic is classified into coarse buckets so the site can distinguish human workflow signals from crawler and LLM indexing behavior.

What this is not

Contact

If you find a bug or have feedback, email hello@secwatch.observer.