Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
benoit74	32435bfac7	Consider disk usage of collDir instead of default /crawls (#586 ) Fix #585 Changes: - compute disk usage based on crawler `collDir` property instead of always computing it on `/crawls` directory	2024-06-07 10:13:15 -07:00
Ilya Kreymer	1bd94d93a1	cleanup dockerfile + fix test (#595 ) - remove obsolete line from Dockerfile - fix pdf test to webrecorder-hosted pdf	2024-06-06 12:14:44 -07:00
Vinzenz Sinapius	068ee79288	Add group policies, limit browser access to container filesystem (#579 ) Add some default policy settings to disable unneeded Brave features. Helps a bit with #463, but Brave unfortunately doesn't provide all mentioned settings as policy options. Most important changes are in `config/policies/lockdown-profilebrowser.json` it limits access to the container filesystem especially during interactive profile browser creation.	2024-06-05 12:46:49 -07:00
Ilya Kreymer	757e838832	base image version bump to brave 1.66.115 (#592 )	2024-06-04 13:35:13 -07:00
Ilya Kreymer	a7d279cfbd	Load non-HTML resources directly whenever possible (#583 ) Optimize the direct loading of non-HTML pages. Currently, the behavior is: - make a HEAD request first - make a direct fetch request only if HEAD request is a non-HTML and 200 - only use fetch request if non-HTML and 200 and doesn't set any cookies This changes the behavior to: - get cookies from browser for page URL - make a direct fetch request with cookies, if provided - only use fetch request if non-HTML and 200 Also: - ensures pageinfo is properly set with timestamp for direct fetch. - remove obsolete Agent handling that is no longer used in default (fetch) If fetch request results in HTML, the response is aborted and browser loading is used.	2024-05-24 14:51:51 -07:00
Ilya Kreymer	089d901b9b	Always add warcinfo records to all WARCs (#556 ) Fixes #553 Includes `warcinfo` records at the beginning of new WARCs, as well as the combined WARC. Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.	2024-05-22 15:47:05 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Ilya Kreymer	6c15bb3f00	version: bump to 1.1.3	2024-05-21 16:37:03 -07:00
Tessa Walsh	1fcd3b7d6b	Fix failOnFailedLimit and add tests (#580 ) Fixes #575 - Adds a missing await to fetching the number of failed pages from Redis - Fixes a typo in the fatal logging message - Adds a test to ensure that the crawl fails with exit code 17 if --failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that will 404	2024-05-21 16:35:43 -07:00
Ilya Kreymer	27226255ee	Sitemap Parsing Fixes (#578 ) Additional fixes for sitemaps: - Fix parsing sitemaps that have data wrapped in CDATA fields, fixes part of https://github.com/webrecorder/browsertrix/issues/1750 - Fix parsing where the .gz sitemap have content-encoding and are actually not gzipped - Ensure error in gzip parsing doesn't break crawl, just errors sitemap parsing.	2024-05-21 14:24:17 -07:00
Ilya Kreymer	6b04a39f2f	save state: export pending list as array of json strings + fix importing save state to support pending (#576 ) The save state export accidentally exported the pending data as an object, instead of a list of JSON strings, as it is stored in Redis, while import was expecting list of json strings. The getPendingList() function parses the json, but then was re-encoding it for writeStats(). This was likely a mistake. This PR fixes things: - support loading pending state as both array of objects and array of json strings for backwards compatibility - save state as array of json strings - remove json decoding and encoding in getPendingList() and writeStats() Fixes #568	2024-05-21 10:58:35 -07:00
Ed Summers	2ef116d667	Mention command line options when restarting (#577 ) It's probably worth reminding people that the command line options need to be passed in again since the crawl state doesn't include them. Refs #568	2024-05-21 10:57:50 -07:00
Ilya Kreymer	1735c3d8e2	headers: better filtering and encoding (#573 ) Ensure headers are processed via internal checks before attempting to pass to `new Headers` to ensure validity: - filter out http/2 style pseudoheaders (starting with ':') - check if header values are non-ascii, and if so, encode with `encodeURI` fixes #569 + prep for latest version of base image which contain pseudo-headers (replaces #546)	2024-05-15 11:06:34 -07:00
Tessa Walsh	8318039ae3	Fix regressions with `failOnFailedSeed` option (#572 ) Fixes #563 This PR makes a few changes to fix a regression in behavior around `failOnFailedSeed` for the 1.x releases: - Fail with exit code 1, not 17, when pages are unreachable due to DNS not resolving or other network errors if the page is a seed and `failOnFailedSeed` is set - Extend tests, add test to ensure crawl succeeds on 404 seed status code if `failOnINvalidStatus` isn't set	2024-05-15 11:02:33 -07:00
Ilya Kreymer	10f6414f2f	PDF loading status code fix (#571 ) when loading a PDF as a page, the browser returns a 'false positive' net::ERR_ABORTED even though the PDF is loaded. - this is already handled, but status code was still being cleared, ensure status code is not reset to 0 on response - ensure page status and mime are also recorded if this failure is ignored (in shouldIgnoreAbort) - tests: add test for PDF capture fixes #570	2024-05-14 15:26:06 -07:00
Ilya Kreymer	c71274d841	add STORE_REGION env var to be able to specify region (#565 ) defaults to us-east-1 for minio compatibility fixes #515	2024-05-12 12:42:04 -04:00
Ilya Kreymer	d2fbe7344f	Skip Checking Empty Frame + eval timeout (#564 ) Don't run frame.evaluate() on an empty frame, also add a timeout just in case to frame.evaluate().	2024-05-09 11:05:33 +02:00
Ilya Kreymer	bd5368cbca	version: bump to 1.1.2	2024-05-07 13:46:05 +02:00
Ilya Kreymer	ddc3e104db	improved handling of requests from workers: (#562 ) On sites with regular workers, requests from workers were being skipped as there was no match for the worker frameId. Add recorder.hasFrame() frameId to match not just service-worker frameIds but also other frame ids already tracked in the frameIdToExecId map.	2024-05-06 11:04:31 -04:00
Ilya Kreymer	22b2136eb9	profiles: ensure initial page.load() is awaited (#561 ) refactor to create a startLoad() method and await it, follow-up to #559	2024-05-02 17:55:22 +02:00
Ilya Kreymer	a61206fd73	profiles: ensure all page.goto() promises have at least catch block or are awaited (#559 ) In particular, an API call to /navigate starts, but doesn't wait for a page load to finish, since user can choose to close the profile browser at any time. This ensures that user operations don't cause the browser to crash if page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading. bump to 1.1.1	2024-04-25 09:34:57 +02:00
Ilya Kreymer	15d2b09757	warcinfo: fix version to 1.1 to avoid confusion (part of #553 ) (#557 ) Ensure warcinfo record is also WARC/1.1	2024-04-18 21:52:24 -07:00
Ilya Kreymer	dece69c233	version: bump to 1.1.0!	2024-04-18 17:45:57 -07:00
Ilya Kreymer	0201fef559	docs: fix typo	2024-04-18 17:19:13 -07:00
Ilya Kreymer	51d82598e7	Support site-specific wait via browsertrix-behaviors (#555 ) The 0.6.0 release of Browsertrix Behaviors / webrecorder/browsertrix-behaviors#70 introduces support for site-specific behaviors to implement an `awaitPageLoad()` function which allows for waiting for specific resources on the page load. - This PR just adds a call to this function directly after page load. - Factors out into an `awaitPageLoad()` method used in both crawler and replaycrawler to support the same wait in QA Mode - This is to support custom loading wait time for Instagram (other sites in the future)	2024-04-18 17:16:57 -07:00
Tessa Walsh	75b617dc94	Add crawler QA docs (#551 ) Fixes #550 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-18 16:18:22 -04:00
Ilya Kreymer	cf053d0f4d	replay counts: don't filter out URLs with __wb_method (#552 ) This helps reduce resource count disparity in replay/crawl counts, since non-GET crawl requests are not filtered out.	2024-04-18 15:23:34 -04:00
Mattia	ea7b2bbefc	allow minio to connect to other regions (#543 ) This should address the issue of connecting to buckets stored outside us-east-1 (https://github.com/webrecorder/browsertrix-crawler/issues/515) while the switch from Minio client to AWS SDK is being worked on (https://github.com/webrecorder/browsertrix-crawler/issues/479) Co-authored-by: Mattia <m@ttia.it>	2024-04-17 08:55:33 -07:00
Tessa Walsh	efebc331ee	Set mime type for html pages (#545 ) Fixes #544 As long as the response has a content-type header, we should use it to set MIME type for the page.	2024-04-15 14:04:30 -07:00
Ilya Kreymer	f6edec0b95	Fix for --rolloverSize for individual WARCs in 1.x (#542 ) Fixes #533 Fixes rollover in WARCWriter, separate from combined WARC rollover size: - check rolloverSize and close previous WARCs when size exceeds - add timestamp to resource WARC filenames to support rollover, eg. screenshots-{ts}.warc.gz - use append mode for all write streams, just in case - tests: add test for rollover of individual WARCs with 500K size limit - tests: update screenshot tests to account for WARCs now being named screenshots-{ts}.warc.gz instead of just screenshots.warc.gz	2024-04-15 13:43:08 -07:00
Ilya Kreymer	16671cb610	qa: filter out non-html pages (#541 ) Fixes #540 Also ensure mime type is set on page for non-html pages when loaded through browser, already being set for direct fetch path.	2024-04-12 16:21:50 -07:00
Ilya Kreymer	8d4e9ca2dc	Better logging of all queue WARCWriter operations (#536 ) warcwriter operations result in a write promise being put on a queue, and handled one-at-a-time. This change wraps that promise in an async function that awaits the actual write and logs any rejections. - If an additional log details is provided, successful writes are also logged for now, including success logging for resource records (text, screenshot, pageinfo) - screenshot / text / pageinfo use the appropriate logcontext for the resource for better log filtering	2024-04-12 14:31:07 -07:00
Tessa Walsh	05acad1789	Remove no longer needed invalid Brave update URLs (#539 )	2024-04-12 16:13:34 -04:00
Ilya Kreymer	e15f0c95d9	Adblock support (#534 ) Now that RWP 2.0.0 with adblock support has been released (webrecorder/replayweb.page#307), this enables adblock on the QA mode RWP embed, to get more accurate screenshots. Fetches the adblock.gz directly from RWP (though could also fetch it separately from Easylist) Updates to 1.1.0-beta.5	2024-04-12 09:47:32 -07:00
Ilya Kreymer	b5f3238c29	Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535 ) Cherry-picked from the use-js-wacz branch, now implementing separate writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and new `--copy-page-files` flag. Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-11 13:55:52 -07:00
Ilya Kreymer	c247189474	qa/replay crawl loading improvements (#526 ) - use frame.load() to load RWP frame directly instead of waiting for navigation messages - retry loading RWP if replay frame is missing - support --postLoadDelay in replay crawl - support --include / --exclude options in replay crawler, allow excluding and including pages to QA via regex - improve --qaDebugImageDiff debug image saving, save images to same dir, using ${counter}-${workerid}-${pageid}-{crawl,replay,vdiff}.png for better sorting - when running QA crawl, check and use QA_ARGS instead of CRAWL_ARGS if provided - ensure empty string text from page is treated different from error (undefined) - ensure info.warc.gz is closed in closeFiles() misc: - fix typo in --postLoadDelay check! - enable 'startEarly' mode for behaviors (autofetch, autoplay) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-04 13:05:24 -07:00
Ilya Kreymer	98f64458d8	ensure all warcwriter write operations go through a queue. (#528 ) Currently, only the recorder's WARCWriter writes records through a queue, resulting in other WARCs potentially suffering from concurrent write attempts. This fixes that by: - adding the concurrent queue to WARCWriter itself - all writeRecord, writeRecordPair, writeNewResourceRecord calls are first added to the PQueue, which ensures writes happen in order and one-at-a-time - flush() also ensures queue is empty/idle - should avoid any issues with concurrent writes to any WARC	2024-04-04 09:36:16 -07:00
Ilya Kreymer	db613aa4ff	Revert "Make /app world-readable to better support non-root usage" (#529 ) Reverts webrecorder/browsertrix-crawler#523 The chmod operation is a bit slow, and in testing don't think the CI is related to chmod :/	2024-04-03 19:48:37 -07:00
Ilya Kreymer	97b95fdf18	merge V1.0.4 change -> main: (#527 ) refactor handling of max size for html/js/css (copy of #525) - due to a typo (and lack of type-checking!) incorrectly passed in matchFetchSize instead of maxFetchSize, resulting in text/css/js for >5MB instead of >25MB not properly streamed back to the browser - add type checking to AsyncFetcherOptions to avoid this in the future. - refactor to avoid checking size altogether for 'essential resources', html(document), js and css, instead always fetch them fully and continue in the browser. Only apply rewriting if <25MB. fixes #522	2024-04-03 17:38:50 -07:00
Vinzenz Sinapius	23fda685d9	Make /app world-readable to better support non-root usage (#523 ) Possible fix for failing tests with non-root deployment.	2024-04-03 15:22:12 -07:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	5152169916	bump version to 1.1.0-beta.3	2024-03-28 17:19:40 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	ea098b6daf	avoid cloudflare detection of puppeteer when using browser profiles: (#518 ) - filter out 'other' / no url targets from puppeteer attachment - disable '--disable-site-isolation-trials' for profiles - workaround for #446 with profiles - also fixes `pageExtraDelay` not working for non-200 responses - may be useful for detecting captcha blocked pages. - connect VNC right away instead of waiting for page to fully finish loading, hopefully resulting in faster profile start-up time.	2024-03-28 10:21:31 -07:00
Ilya Kreymer	0d973d67e3	upgrade puppeteer-core to 22.6.1 (#516 ) Using latest puppeteer-core to keep up with latest browsers, mostly minor syntax changes Due to change in puppeteer hiding the executionContextId, need to create a frameId->executionContextId mapping and track it ourselves to support the custom evaluateWithCLI() function	2024-03-27 09:26:51 -07:00
Ilya Kreymer	0ad10a8dee	Unify WARC writing + CDXJ indexing into single class (#507 ) Previously, there was the main WARCWriter as well as utility WARCResourceWriter that was used for screenshots, text, pageinfo and only generated resource records. This separate WARC writing path did not generate CDX, but used appendFile() to append new WARC records to an existing WARC. This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a pre-requisite to the js-wacz conversion (#484) since all WARCs need to have generated CDX. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-26 14:54:27 -07:00
Ilya Kreymer	01c4139aa7	Fixes from 1.0.3 release -> main (#517 ) sitemap improvements: gz support + application/xml + extraHops fix #511 - follow up to https://github.com/webrecorder/browsertrix-crawler/issues/496 - support parsing sitemap urls that end in .gz with gzip decompression - support both `application/xml` and `text/xml` as valid sitemap content-types (add test for both) - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page) fixes redirected seed (from #476) being counted against page limit: #509 - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508	2024-03-26 14:50:36 -07:00
Vinzenz Sinapius	6b6cb4137a	Use RFC2606 invalid domain names (#514 ) `invalid.dev` can potentially be registered and used. `.invalid` is guaranteed to never be valid. See also: https://www.rfc-editor.org/rfc/rfc2606.html	2024-03-26 14:09:04 -07:00
Ilya Kreymer	ecbc1d8ddd	quickfix: fix typo, remove duplicate declaration!	2024-03-22 21:51:50 -07:00

1 2 3 4 5 ...

464 commits