Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	fcfb6b7ae3	fix tests	2025-09-12 09:03:19 -07:00
Ilya Kreymer	e72b34318d	Add WARC-Protocol header (#715 ) - add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42 - also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-05-19 18:59:52 -07:00
Ilya Kreymer	5c00bca2b4	tests: use old.webrecorder.net for testing (#710 ) replace webrecorder.net -> old.webrecorder.net to fix tests relying on old website for now	2024-10-31 13:24:58 -04:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Tessa Walsh	fd98033268	Loosen selectors for login fields in automated profile creation (#638 ) Fixes #637 - Username will match if name attribute is one of: user, username, email - Password will match if type is password and name attribute is one of: pass, password This loosens the rules sufficiently to solve the issue with the URL in the linked issue without requiring users to pass custom CSS selectors at this point. It looks like we were also using XPath methods like contains whereas puppeteer expects CSS selectors, hence the syntax change.	2024-07-11 15:55:06 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	5a47cc4b41	warc: add Network.resourceType (https://chromedevtools.github.io/devt … (#481 ) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451	2024-03-04 18:10:45 -08:00
Ilya Kreymer	51660cdcc4	pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471 ) Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.	2024-02-21 16:02:25 -08:00
Ilya Kreymer	a512e92886	Include resource type + mime type in page resources list (#468 ) The `:pageinfo:<url>` record now includes the mime type + resource type (from Chrome) along with status code for each resource, for better filtering / comparison.	2024-02-19 19:11:48 -08:00
Ilya Kreymer	e8f2073a7e	Update Browser Image (#466 ) - Update to Brave browser (1.62.165) - Update page resource test to reflect latest Brave behavior	2024-02-17 22:40:12 -08:00
Ilya Kreymer	96f3c407b1	Page Resources: Include Cached Resources (#465 ) Ensure cached resources (that are not written to WARC) are still included in the `url:pageinfo:...` records. This will make it easier to track which resources are actually loaded from a given page. Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about include cached resources	2024-02-16 14:36:32 -08:00