markcrawl

Python CLI + library for site crawling, markdown extraction, screenshot capture, and page classification. v3 closes 3 unresolved v2 feedback items: (1) bug fe6f3c39 — binary missing flags (root cause was stale local install; PyPI 0.9.3+ already had them; v0.10.0 + CI smoke job + README upgrade docs prevent regression); (2) perf c69402e7 — replaced uncapped doubling-from-1s (the actual bug; 'fixed 60s' was misremembered) with tenacity full-jitter 2s→30s 5-attempts; (3) works_well a179496d — designlens consumers can stop using Playwright fallback after `pip install --upgrade markcrawl`. v3 also adds: ERROR-severity terminal-failure logs with URL extraction, before_sleep retry observability, CI workflow catching source-vs-PyPI drift.

▸ Capabilities

Capability	What it does
crawl_site	Crawl URLs starting from --base or --seed-file with sitemap support, content extraction, configurable rate limiting
extract_markdown	Convert HTML to clean markdown via markdownify; output as pages.jsonl with one row per crawled page
classify_pages	Apply page-type classification heuristics (article/listing/product/etc.)
screenshot_capture	Capture screenshots via Playwright at configurable viewport (--screenshot, --screenshot-viewport WxH, --screenshot-selector, --screenshot-format png/jpeg, --screenshot-wait-ms, --no-screenshot-full-page). Available since PyPI 0.9.x; v3 documents the upgrade path so consumers stop falling back to Playwright direct.
tunable_retry_backoff	v3 NEW. tenacity-backed retry: 5 attempts max, full-jitter exponential 2s→30s, Retry-After header honored via _ExpoJitterRetryAfter wait_base subclass, before_sleep logger emits structured per-retry observability ([retry] attempt=N status=... url=... sleep=Xs elapsed=Ys), terminal-failure ERROR with URL for operator diagnostics.

Usage guide

Detailed usage docs for markcrawlare being published. Request access and we'll get you started directly.

Ready to try markcrawl?

Early access is free while we're in beta.