Stowage/browsertrix-crawler

Fork 0

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Commit graph

Author	SHA1	Message	Date
Ilya Kreymer	705bc0cd9f	Async Fetch Refactor (#880 ) - separate out reading stream response while browser is waiting (not really async) from actual async loading, this is not handled via fetchResponseBody() - unify async fetch into first trying browser networking for regular GET, fallback to regular fetch() - load headers and body separately in async fetch, allowing for cancelling request after headers - refactor direct fetch of non-html pages: load headers and handle loading body, adding page async, allowing worker to continue loading browser-based pages (should allow more parallelization in the future) - unify WARC writing in preparation for dedup: unified serializeWARC() called for all paths, WARC digest computed, additional checks for payload added for streaming loading	2025-09-10 12:05:21 -07:00
Ilya Kreymer	4495532606	Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629 ) Adds enterprise policy to always download PDF and sets download dir to /dev/null Moves policies to chromium.json and brave.json for clarity Further cleanup of non-HTML loading path: - sets downloadResponse when page load is aborted but response is actually download - sets firstResponse when first response finishes, but page doesn't fully load - logs that non-HTML pages skip all post-crawl behaviors in one place - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-26 09:16:24 -07:00

Author

SHA1

Message

Date

Ilya Kreymer

705bc0cd9f

Async Fetch Refactor (#880 )

- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading

2025-09-10 12:05:21 -07:00

Ilya Kreymer

4495532606

Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629 )

Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
 - logs that non-HTML pages skip all post-crawl behaviors in one place
 - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

2024-06-26 09:16:24 -07:00

2 commits