secwatch / observer

Methodology

This page describes how filings get from EDGAR onto the homepage. The whole pipeline is roughly 2,500 lines of Python and a SQLite file. There is no microservice diagram because there are no microservices.

1. The poller

Every minute during market hours (every 30 minutes overnight; everything in ET so cron survives DST), a single async process fetches the SEC's "latest 8-K" atom feed at https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=8-K&output=atom. New accession numbers get inserted into a SQLite filings table with status='pending' via INSERT OR IGNORE. The same poller hydrates the companies table from SEC's company_tickers.json on a 7-day refresh.

Everything that hits sec.gov goes through an 8-requests-per-second sliding-window limiter shared across the process. The SEC's published limit is 10/s; we leave headroom for retries. The User-Agent header includes a real contact email, because the SEC returns 403 to requests without one.

2. The fetcher + cleaner

For each pending row, a separate processor follows the EDGAR index page to the primary 8-K document and any EX-99.x press releases or supplemental tables (or EX-2.x / EX-10.x material-agreement exhibits when no EX-99 is present). The body and chosen exhibits fetch in parallel through the same 8 req/s gate.

Raw HTML goes through a deterministic cleaning pipeline that strips iXBRL header blocks, XBRL tag runs, exhibit-routing boilerplate, repeated lines, and excess whitespace. Typical reduction is 30–50% of the input character count — meaningful when the LLM input budget is finite and earnings filings can run to 60 KB of body + exhibits combined.

3. The LLM

The cleaned text and the extracted item codes go to a single LLM call with a strict-JSON response format. Today I run deepseek-v4-flash:cloud on Ollama Cloud — fast, cheap, good enough at structured extraction. The prompt asks for a one-line headline, exactly three bullets, an event_type from a fixed taxonomy, the tickers mentioned, and a confidence rating. The response is parsed against a Pydantic schema. If it fails to parse, the prompt is reissued once with a stricter "your previous response was not valid JSON" header. A second failure marks the row failed.

The LLM is the only opinionated step in the pipeline. Everything else is deterministic. I keep the prompt and model identifiers stored on each row's summary_model column so any future reprompt or model swap can be audited.

4. Earnings enrichment

When a filing's classified event type is earnings, the processor follows up with two best-effort calls:

The match between XBRL period-end and Finnhub period is fuzzy by a small tolerance to absorb fiscal-year offsets. If both numbers land in the same period, the card shows beat / miss / in-line with a percentage delta. If only reported lands, the card shows just the number. If neither lands (foreign issuer, ticker not in Finnhub, period outside the 60-day window), nothing is shown rather than something potentially wrong.

5. Delivery

Once a row reaches status='ready', an in-process broadcaster publishes a Server-Sent Event to every connected browser tab. The frontend is a Vite + React + TypeScript single-page app served as static files by Caddy. SSE is fine for this; WebSockets would be overkill.

The read API and SSE stream are served by FastAPI from a small pool of read-only SQLite connections (PRAGMA query_only=ON). The poller and processor each open their own write-capable connection. SQLite is in WAL mode — multiple readers + one writer is a fine fit.

What this is not

Source

All of the above lives in github.com/rotoole1230/edgar-streaming. The conventions file at .claude/skills/edgar-stream-conventions/SKILL.md is the most concise statement of the project's design constraints.