markcrawl
v3Python CLI + library for site crawling, markdown extraction, screenshot capture, and page classification. v3 closes 3 unresolved v2 feedback items: (1) bug fe6f3c39 — binary missing flags (root cause was stale local install; PyPI 0.9.3+ already had them; v0.10.0 + CI smoke job + README upgrade docs prevent regression); (2) perf c69402e7 — replaced uncapped doubling-from-1s (the actual bug; 'fixed 60s' was misremembered) with tenacity full-jitter 2s→30s 5-attempts; (3) works_well a179496d — designlens consumers can stop using Playwright fallback after `pip install --upgrade markcrawl`. v3 also adds: ERROR-severity terminal-failure logs with URL extraction, before_sleep retry observability, CI workflow catching source-vs-PyPI drift.
| Capability | What it does |
|---|---|
| crawl_site | Crawl URLs starting from --base or --seed-file with sitemap support, content extraction, configurable rate limiting |
| extract_markdown | Convert HTML to clean markdown via markdownify; output as pages.jsonl with one row per crawled page |
| classify_pages | Apply page-type classification heuristics (article/listing/product/etc.) |
| screenshot_capture | Capture screenshots via Playwright at configurable viewport (--screenshot, --screenshot-viewport WxH, --screenshot-selector, --screenshot-format png/jpeg, --screenshot-wait-ms, --no-screenshot-full-page). Available since PyPI 0.9.x; v3 documents the upgrade path so consumers stop falling back to Playwright direct. |
| tunable_retry_backoff | v3 NEW. tenacity-backed retry: 5 attempts max, full-jitter exponential 2s→30s, Retry-After header honored via _ExpoJitterRetryAfter wait_base subclass, before_sleep logger emits structured per-retry observability ([retry] attempt=N status=... url=... sleep=Xs elapsed=Ys), terminal-failure ERROR with URL for operator diagnostics. |
Usage guide
Detailed usage docs for markcrawlare being published. Request access and we'll get you started directly.
Early access is free while we're in beta.