Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	151115d46c	Fix Pending Request causing timeout (#636 ) Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded. (Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)	2024-07-09 11:02:41 -07:00
Ilya Kreymer	bfe42ad31e	Improved handling of pages that redirect back to the same page. (#635 ) In the case of a page https://example.com/ which results in a redirect chain: 307 https://example.com/ -> 307 https://auth.example.com/ -> 200 https://example.com/ - Includes status in dupe checks, ensures that `307 https://example.com/` and `200 https://example.com/` are both recorded to WARC - When setting page timestamp, update the timestamp to the lower status code if above 300, eg. first setting to `307 https://example.com/` and then to `200 https://example.com/` Fixes #634	2024-07-08 10:51:37 -07:00
Ilya Kreymer	320c041235	version: bump to 1.2.3	2024-07-08 10:50:51 -07:00
Ilya Kreymer	302b119908	Dependency Update / 1.2.2 (#633 ) Dependency Updates: - Bump Brave to 1.67.123 - Update puppeteer-core to latest, fixes possible crash when loading current browser with old profiles - Tests: simplifies extra hops test to avoid complex pages that could lead to timeout	2024-07-03 12:55:14 -07:00
Ilya Kreymer	a3396adba2	tests: reduce logging (#596 ) remove logging of crawl logs by default for clearer output from tests, only log in case of error.	2024-06-26 13:05:13 -07:00
Ilya Kreymer	4495532606	Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629 ) Adds enterprise policy to always download PDF and sets download dir to /dev/null Moves policies to chromium.json and brave.json for clarity Further cleanup of non-HTML loading path: - sets downloadResponse when page load is aborted but response is actually download - sets firstResponse when first response finishes, but page doesn't fully load - logs that non-HTML pages skip all post-crawl behaviors in one place - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-26 09:16:24 -07:00
Ilya Kreymer	6a9ca3df54	Don't filter saving redirect if no response body. (#628 ) It's possible for a redirect, especially a browser-generated one to have headers and no body (eg. Brave removing tracking url query). Don't filter these redirects out from being written to WARC, just set payload to empty buffer. fixes #627 where Brave-generated redirect response was not stored.	2024-06-25 15:48:22 -07:00
Ilya Kreymer	2ab58c0ea3	Remove DISPLAY env var from image (#625 ) To avoid a strange chromium bug: https://issues.chromium.org/issues/40209037 which causes WebGL to fail in headless mode if DISPLAY if set. Instead, just set DISPLAY directly for Xvfb, x11vnc and pass in `--display=` to browser if running in headful mode.	2024-06-25 13:53:43 -07:00
Ilya Kreymer	92ad800fe4	browser policies: disable restoring any tabs on startup + set new tab URL to about:blank (#626 ) addresses memory issues with profiles as they accumulate tabs! fixes webrecorder/browsertrix#1880	2024-06-25 13:38:52 -07:00
Ilya Kreymer	e65bf21135	version: bump to 1.2.1	2024-06-25 13:28:59 -07:00
Ilya Kreymer	8af8b3c19a	1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624 )	2024-06-21 16:34:06 -07:00
Ilya Kreymer	65a86352fd	Updated rewriting for YouTube + dependency update (#623 ) - update to wabac.js 2.19.0 to use new html rewriting support in wabac.js 2.19.0 - update to browsertrix-behaviors to 0.6.1 to fix instagram behavior - bump to 1.2.0-beta.3	2024-06-21 15:03:53 -07:00
Ilya Kreymer	de10ba9f15	version: bump to 1.2.0-beta.2	2024-06-20 20:11:35 -07:00
Ilya Kreymer	ea114c6083	bump brave to 1.67.119 (#620 )	2024-06-20 20:10:46 -07:00
Ilya Kreymer	9847af7765	disable socat by default (#622 ) - crawling: add '--debugAccessBrowser' flag to enable connecting via 9222, only run socat then - profiles: only run socat in headless mode	2024-06-20 20:10:25 -07:00
Ilya Kreymer	3c26996f93	add yarn.lock to Docker to ensure consistent builds! (#621 )	2024-06-20 18:54:05 -07:00
Ilya Kreymer	febf4b7532	logging: log error message when seed is failed to be created (#619 ) for example, due to bad include/exclude regex, fixes #598	2024-06-20 18:41:57 -07:00
Ilya Kreymer	3339374092	http auth support per seed (supersedes #566 ): (#616 ) - parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config) - add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders() - tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI) - docs: add HTTP Auth to YAML config section --------- Co-authored-by: Ed Summers <ehs@pobox.com>	2024-06-20 16:35:30 -07:00
Ilya Kreymer	6329b19a20	clearer scope check (#615 ) split isInScope into a protected sync getScope() used for link extraction (no need for async as we know seed is already set) and which returns url / isOOS count. and a simpler, public async isInScope() which just returns a bool, but also ensures the seed exists.	2024-06-18 16:11:48 -07:00
Ilya Kreymer	ac722cc856	adjust browser viewport to avoid cutting off bottom of page (#614 ) - subtract the browser ui height from default viewport computed from screen dimensions - hard-code height to 81px for now - fixes #613, bottom of page being cut-off as viewport height was too big	2024-06-14 15:25:59 -07:00
Ilya Kreymer	ff481855d5	add EXPOSE for ports used inside container (#612 ) documents fixed internal ports used in browsertrix, via EXPOSE cmd, addresses #558	2024-06-14 15:19:35 -07:00
Ilya Kreymer	64080e7f67	merge 1.1.4 -> 1.2.0 beta.1 (#611 )	2024-06-13 23:24:54 -07:00
Ilya Kreymer	f504effa51	Merge branch 'main' into release/1.1.4 bump to 1.2.0-beta.1	2024-06-13 19:28:25 -07:00
Ilya Kreymer	9094a8355f	Fix header newline escape (#609 ) - Ensure newline escaping happens consistently, even for 'excluded' headers which get a `x-orig-` prefix but are still added - Ensure excluded headers in list path are still added with `x-orig-` prefix. - fixes #607	2024-06-13 19:13:12 -07:00
Ilya Kreymer	f85727954a	add undici for 1.1.4 release, to fix #606 (#608 )	2024-06-13 18:46:05 -07:00
Ilya Kreymer	53d437570e	dependency: update RWP to 2.0.1 (#610 ) for QA, use ReplayWeb.page 2.0.1 by default	2024-06-13 18:43:58 -07:00
Ilya Kreymer	8f8326eaf5	Fix synching extraSeeds state with multiple crawler instances (#605 ) Fixes #604 Ensures that extra seeds are propagated to all crawler instances. Adds a new redis hashmap key to store the extraSeed mappings url->extraSeeds index, to ensure the extra seeds are added in the same order on other instances, even if encountered in different order. Add a new redis lua primitive 'addnewseed' which combines several operations: check if extra seed already exists and returning existing index, add new seed to extraSeed list, also add to regular URL seed list.	2024-06-13 17:18:06 -07:00
Tessa Walsh	9b6efdf6ba	Change some logged errors to warns (#600 ) Fixes #599 This PR modifies the following logging messages from `error` to `warn`, to avoid alarming users for fairly expected behavior: - Behavior timeout - Page worker timeout - Streaming fetch error (will already trigger another page error message) --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-06-13 15:42:27 -04:00
Ilya Kreymer	833551bd77	recorder: add missing shouldSkip() to responseReceived callback (#602 ) Fixes #601, fixes issue with extra wait on PDF pages, where browser seems to be waiting for a chrome-extension:// URL. These should have already be getting skipped, but missed here.	2024-06-13 12:13:14 -07:00
Ilya Kreymer	1e7f8361fe	tests: fix blockrules tests (#603 ) The blockrules tests assumed the youtube serves videos with `video/mp4` mime. However, now youtube also serves them with mime `application/vnd.yt-ump`. Both mime types are now checked to verify video are present.	2024-06-13 12:12:46 -07:00
Ilya Kreymer	f6c4bf9935	bump version to 1.1.4	2024-06-13 10:31:57 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
benoit74	32435bfac7	Consider disk usage of collDir instead of default /crawls (#586 ) Fix #585 Changes: - compute disk usage based on crawler `collDir` property instead of always computing it on `/crawls` directory	2024-06-07 10:13:15 -07:00
Ilya Kreymer	1bd94d93a1	cleanup dockerfile + fix test (#595 ) - remove obsolete line from Dockerfile - fix pdf test to webrecorder-hosted pdf	2024-06-06 12:14:44 -07:00
Vinzenz Sinapius	068ee79288	Add group policies, limit browser access to container filesystem (#579 ) Add some default policy settings to disable unneeded Brave features. Helps a bit with #463, but Brave unfortunately doesn't provide all mentioned settings as policy options. Most important changes are in `config/policies/lockdown-profilebrowser.json` it limits access to the container filesystem especially during interactive profile browser creation.	2024-06-05 12:46:49 -07:00
Ilya Kreymer	757e838832	base image version bump to brave 1.66.115 (#592 )	2024-06-04 13:35:13 -07:00
Ilya Kreymer	a7d279cfbd	Load non-HTML resources directly whenever possible (#583 ) Optimize the direct loading of non-HTML pages. Currently, the behavior is: - make a HEAD request first - make a direct fetch request only if HEAD request is a non-HTML and 200 - only use fetch request if non-HTML and 200 and doesn't set any cookies This changes the behavior to: - get cookies from browser for page URL - make a direct fetch request with cookies, if provided - only use fetch request if non-HTML and 200 Also: - ensures pageinfo is properly set with timestamp for direct fetch. - remove obsolete Agent handling that is no longer used in default (fetch) If fetch request results in HTML, the response is aborted and browser loading is used.	2024-05-24 14:51:51 -07:00
Ilya Kreymer	089d901b9b	Always add warcinfo records to all WARCs (#556 ) Fixes #553 Includes `warcinfo` records at the beginning of new WARCs, as well as the combined WARC. Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.	2024-05-22 15:47:05 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Ilya Kreymer	6c15bb3f00	version: bump to 1.1.3	2024-05-21 16:37:03 -07:00
Tessa Walsh	1fcd3b7d6b	Fix failOnFailedLimit and add tests (#580 ) Fixes #575 - Adds a missing await to fetching the number of failed pages from Redis - Fixes a typo in the fatal logging message - Adds a test to ensure that the crawl fails with exit code 17 if --failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that will 404	2024-05-21 16:35:43 -07:00
Ilya Kreymer	27226255ee	Sitemap Parsing Fixes (#578 ) Additional fixes for sitemaps: - Fix parsing sitemaps that have data wrapped in CDATA fields, fixes part of https://github.com/webrecorder/browsertrix/issues/1750 - Fix parsing where the .gz sitemap have content-encoding and are actually not gzipped - Ensure error in gzip parsing doesn't break crawl, just errors sitemap parsing.	2024-05-21 14:24:17 -07:00
Ilya Kreymer	6b04a39f2f	save state: export pending list as array of json strings + fix importing save state to support pending (#576 ) The save state export accidentally exported the pending data as an object, instead of a list of JSON strings, as it is stored in Redis, while import was expecting list of json strings. The getPendingList() function parses the json, but then was re-encoding it for writeStats(). This was likely a mistake. This PR fixes things: - support loading pending state as both array of objects and array of json strings for backwards compatibility - save state as array of json strings - remove json decoding and encoding in getPendingList() and writeStats() Fixes #568	2024-05-21 10:58:35 -07:00
Ed Summers	2ef116d667	Mention command line options when restarting (#577 ) It's probably worth reminding people that the command line options need to be passed in again since the crawl state doesn't include them. Refs #568	2024-05-21 10:57:50 -07:00
Ilya Kreymer	1735c3d8e2	headers: better filtering and encoding (#573 ) Ensure headers are processed via internal checks before attempting to pass to `new Headers` to ensure validity: - filter out http/2 style pseudoheaders (starting with ':') - check if header values are non-ascii, and if so, encode with `encodeURI` fixes #569 + prep for latest version of base image which contain pseudo-headers (replaces #546)	2024-05-15 11:06:34 -07:00
Tessa Walsh	8318039ae3	Fix regressions with `failOnFailedSeed` option (#572 ) Fixes #563 This PR makes a few changes to fix a regression in behavior around `failOnFailedSeed` for the 1.x releases: - Fail with exit code 1, not 17, when pages are unreachable due to DNS not resolving or other network errors if the page is a seed and `failOnFailedSeed` is set - Extend tests, add test to ensure crawl succeeds on 404 seed status code if `failOnINvalidStatus` isn't set	2024-05-15 11:02:33 -07:00
Ilya Kreymer	10f6414f2f	PDF loading status code fix (#571 ) when loading a PDF as a page, the browser returns a 'false positive' net::ERR_ABORTED even though the PDF is loaded. - this is already handled, but status code was still being cleared, ensure status code is not reset to 0 on response - ensure page status and mime are also recorded if this failure is ignored (in shouldIgnoreAbort) - tests: add test for PDF capture fixes #570	2024-05-14 15:26:06 -07:00
Ilya Kreymer	c71274d841	add STORE_REGION env var to be able to specify region (#565 ) defaults to us-east-1 for minio compatibility fixes #515	2024-05-12 12:42:04 -04:00
Ilya Kreymer	d2fbe7344f	Skip Checking Empty Frame + eval timeout (#564 ) Don't run frame.evaluate() on an empty frame, also add a timeout just in case to frame.evaluate().	2024-05-09 11:05:33 +02:00

1 2 3 4 5 ...

396 commits