Methodology
This page describes how filings get from EDGAR onto the homepage. The whole pipeline is roughly 2,500 lines of Python and a SQLite file. There is no microservice diagram because there are no microservices.
1. The poller
Every minute during market hours (every 30 minutes overnight; everything in ET so cron survives DST), a single async process fetches the SEC's "latest 8-K" atom feed at https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=8-K&output=atom. New accession numbers get inserted into a SQLite filings table with status='pending' via INSERT OR IGNORE. The same poller hydrates the companies table from SEC's company_tickers.json on a 7-day refresh.
Everything that hits sec.gov goes through an 8-requests-per-second sliding-window limiter shared across the process. The SEC's published limit is 10/s; we leave headroom for retries. The User-Agent header includes a real contact email, because the SEC returns 403 to requests without one.
2. The fetcher + cleaner
For each pending row, a separate processor follows the EDGAR index page to the primary 8-K document and any EX-99.x press releases or supplemental tables (or EX-2.x / EX-10.x material-agreement exhibits when no EX-99 is present). The body and chosen exhibits fetch in parallel through the same 8 req/s gate.
Raw HTML goes through a deterministic cleaning pipeline that strips iXBRL header blocks, XBRL tag runs, exhibit-routing boilerplate, repeated lines, and excess whitespace. Typical reduction is 30–50% of the input character count — meaningful when the LLM input budget is finite and earnings filings can run to 60 KB of body + exhibits combined.
3. The LLM
The cleaned text and the extracted item codes go to a single LLM call with a strict-JSON response format. Today I run deepseek-v4-flash:cloud on Ollama Cloud — fast, cheap, good enough at structured extraction. The prompt asks for a one-line headline, exactly three bullets, an event_type from a fixed taxonomy, the tickers mentioned, and a confidence rating. The response is parsed against a Pydantic schema. If it fails to parse, the prompt is reissued once with a stricter "your previous response was not valid JSON" header. A second failure marks the row failed.
The LLM is the only opinionated step in the pipeline. Everything else is deterministic. I keep the prompt and model identifiers stored on each row's summary_model column so any future reprompt or model swap can be audited.
4. Earnings enrichment
When a filing's classified event type is earnings, the processor follows up with two best-effort calls:
- SEC XBRL companyconcept at
data.sec.gov/api/xbrl/companyconcept/CIK.../us-gaap/EarningsPerShareDiluted.json(and the Revenues synonyms) for the reported figure. Goes through the same 8 req/s limiter. - Finnhub free tier at
/stock/earnings?symbol=...for the analyst consensus estimate. Capped at 58 calls/minute by an internal limiter so we stay under their free-tier ceiling.
The match between XBRL period-end and Finnhub period is fuzzy by a small tolerance to absorb fiscal-year offsets. If both numbers land in the same period, the card shows beat / miss / in-line with a percentage delta. If only reported lands, the card shows just the number. If neither lands (foreign issuer, ticker not in Finnhub, period outside the 60-day window), nothing is shown rather than something potentially wrong.
5. Delivery
Once a row reaches status='ready', an in-process broadcaster publishes a Server-Sent Event to every connected browser tab. The frontend is a Vite + React + TypeScript single-page app served as static files by Caddy. SSE is fine for this; WebSockets would be overkill.
The read API and SSE stream are served by FastAPI from a small pool of read-only SQLite connections (PRAGMA query_only=ON). The poller and processor each open their own write-capable connection. SQLite is in WAL mode — multiple readers + one writer is a fine fit.
What this is not
- Real-time enough for trading. The 30-second median latency from EDGAR to your browser is good for context, not for execution.
- A replacement for reading the filing. The headline is a pointer, not a verdict. Every card links to the EDGAR source.
- Comprehensive across SEC forms. 8-K only, today. Form 4 (insider trades), 10-K/Q, and S-1 are out of scope for now.
Source
All of the above lives in github.com/rotoole1230/edgar-streaming. The conventions file at .claude/skills/edgar-stream-conventions/SKILL.md is the most concise statement of the project's design constraints.