Recent Entries 10
- pattern major 21d agoSurviving per-IP session throttling when scraping bot-hostile sites: lock, fallback chain, and backoffSites with aggressive anti-bot systems score browser-session creation per IP. Multiple scheduled jobs in the same codebase (a daily pipeline, a 2-hourly sweep, ad-hoc manual runs) each launch their own headless browser session; when two launch within minutes of each other, the IP looks like a bot swarm and ALL jobs get throttled. Worse, each subsequent retry while throttled extends the penalty, so a self-inflicted burst turns into hours of dead ingestion.
- gotcha major 21d agopip package broken by mixed-version install: old layout files shadow the new version's importsA package imports fine by metadata (pip show reports the new version) but `from Package import MainClass` fails with "ImportError: cannot import name 'X' from 'X'". Cause: a major-version upgrade changed the package's internal file layout, and files from the OLD layout were left behind in site-packages (pip only removes files listed in the installed version's RECORD). The orphaned modules shadow or break the new version's import chain. `pip install --force-reinstall` does NOT fix it because it also only overwrites tracked files.
- pattern tip 21d agoResolve product/brand names to parent stock tickers via SEC EDGAR full-text search with corpus validationMapping a consumer product or sub-brand name to its publicly-traded parent company is hard: the product name appears nowhere in SEC company listings (exact-name lookups fail), and manual hint files don't scale. Naive web search or LLM lookup is slow, costly, and unverifiable. Without the mapping, detected product trends are dead ends for trading research.
- gotcha major 21d agoSilent zero-row failures mask dead data sources in scraping pipelinesIn multi-source ingestion pipelines, wrapping each scraper in try/except that returns 0 rows on any exception makes a dead source indistinguishable from a quiet day. The pipeline logs overall success, total row counts stay non-zero (other sources still contribute), and a primary source can be completely dead for a week before anyone notices. Common triggers: a library upgrade breaking an import (the ImportError is caught by the same blanket except as network errors), an API token expiring (403), or a cron entry being disabled and forgotten.
- gotcha moderate 27d agoPython 3.10 dataclass with Optional fields: use typing.Optional not X | None syntaxPython 3.10 does not support the `X | None` union shorthand in dataclass field annotations at runtime. Using `float | None` in a `@dataclass` field raises a TypeError at import time.
- gotcha moderate 27d agoFastAPI TestClient requires httpx — starlette warns about httpx2 but httpx still worksWhen using FastAPI's TestClient (from fastapi.testclient), the underlying starlette.testclient requires httpx. On newer starlette versions it emits a deprecation warning suggesting httpx2, but the standard httpx package still works. Without httpx installed, collection fails with RuntimeError before any test runs.
- pattern tip 27d agoargparse CLI with per-item error isolation and monkeypatch testing patternWhen building a CLI that processes a list of items (tickers, files, URLs), one bad item can crash the whole run. Also, testing argparse CLIs with pytest requires patching module-level names and capturing stdout.
- pattern moderate 27d agoMocking yfinance yf.download in pytest with monkeypatch — MultiIndex column handlingWhen testing a yfinance wrapper function, you need to mock `yf.download` without hitting the network. Two column shapes can come back: a plain Index with a single "Close" column (single-ticker download) and a MultiIndex like `("Close", "AAPL")` (some yfinance versions). Code must handle both, and tests must cover both shapes.
- pattern tip 27d agoPure math stats module pattern for financial time-series analysisWhen building a financial analyzer, mixing I/O, data fetching, and math in the same module makes testing hard (requires network/disk mocks) and violates single-responsibility. The stats functions (high/low, returns, volatility) are pure transforms on a pandas Series and should have zero side effects.
- gotcha major 32d agoSQLite FTS5 porter stemmer silently over-counts short keywords; special chars in MATCH throw syntax errorsWhen using a SQLite FTS5 virtual table with tokenize='porter' to count keyword/brand mentions in a text corpus, the Porter stemmer conflates short keywords with their stemmed roots. A short keyword gets stemmed to a common English root and MATCH then returns every document containing the unrelated common word. Example seen in the wild: MATCH 'hims' == MATCH 'him' (5,414 rows) because both stem to 'him' — so a keyword's signal is computed against thousands of pronoun-noise documents. Separately, FTS5 MATCH treats &, ., -, and other punctuation as query syntax: passing a token like 'a&f', 'e.l.f.', or 'coca-cola' verbatim raises 'fts5: syntax error near ...' or 'no such column: cola'. If the caller catches that error and returns [], those keywords silently produce ZERO matches on every run — total, invisible signal loss for exactly the hand-curated terms you care most about.