Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	356b3f8d10	bump to 1.3.2	2024-09-30 15:51:13 -07:00
Ilya Kreymer	728f00219a	ensure extraHops also apply to maxDepth (#694 ) - if extraHops is set, crawler should visit pages beyond maxDepth - currently returning out of scope at depth limit even if extraHops is set - adjust isInScope and isAtMaxDepth to account for extraHops - tests: update extra hops test to test extraHops beyond depth - fixes #693	2024-09-30 15:46:34 -07:00
Ilya Kreymer	9f310907f0	version: bump to 1.3.1	2024-09-27 14:30:56 -04:00
Ilya Kreymer	a56e13d2ff	Additional exception safety (#692 ) - add additional catch() block - wrap page.title() in timedRun() to catch/log exception if this fails - log error in getting cookies - hopefully fixes hard-to-repro edge case crash in openzim/zimit#376	2024-09-27 14:30:25 -04:00
Tessa Walsh	607fc84c7d	Include depth in pages JSONL files (#691 ) Fixes #690	2024-09-27 10:01:20 -04:00
Ilya Kreymer	6b4ba5b430	direct fetch: when cancelling due to redirect, read full body (#688 ) to avoid possible exception due to encoding. (Probably a node bug, reported in nodejs/undici#3616) Replace abort with cancel, which is the recommended way to cancel the response. fixes #687	2024-09-17 10:29:23 -07:00
Ilya Kreymer	da442573b8	version: bump to 1.3.0	2024-09-12 09:22:22 -07:00
Ilya Kreymer	eb50fdffde	exit codes: exit with error code 10 if interrupt is caused by unexpected browser exit (#686 ) Differentiate from expected/predictable interrupts due to limits (exit code 11) and unexpected interrupt due to browser crash (now exit code 10) fixes #683	2024-09-12 09:10:23 -07:00
Ilya Kreymer	fdb76f2c88	update current crawl size in redis on each healthcheck call (#685 ) - allows Browsertrix app to adjust size, if needed, more frequently - run checkLimits() before starting crawl, in case out of space	2024-09-10 08:28:07 -07:00
Ilya Kreymer	b42548373d	eslint: add strict await checking: (#684 ) - require await / void / catch for promises - don't allow unnecessary await	2024-09-06 16:24:18 -07:00
Ilya Kreymer	9cacae6bb6	cleanup: remove old config files from pywb (#682 )	2024-09-05 20:23:34 -07:00
Ilya Kreymer	c38b69e74b	bump browser to 1.69.162 (#681 )	2024-09-05 20:21:43 -07:00
Ilya Kreymer	083a9d2090	version: bump to 1.3.0-beta.1	2024-09-05 18:11:52 -07:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	802a416c7e	Additional direct fetch improvements (#678 ) - use existing headersTimeout in undici to limit time to headers fetch to 30 seconds, reject direct fetch if timeout is reached - allow full page timeout for loading payload via direct fetch - support setting global fetch() settings - add markPageUsed() to only reuse pages when not doing direct fetch - apply auth headers to direct fetch - catch failed fetch and timeout errors - support failOnFailedSeeds for direct fetch, ensure timeout is working	2024-09-05 13:28:49 -07:00
Ilya Kreymer	9d0e3423a3	WARC writer + incremental indexing fixes (#679 ) - ensure WARC rollover happens only after response/request + cdx or single record + cdx have been written - ensure request payload is buffered for POST request indexing - update to warcio 2.3.1 for POST request case-insensitive 'content-type' check - recorder: remove unused 'tempdir', no longer used as warcio chooses a temp file on it's own	2024-09-05 11:10:31 -07:00
Ilya Kreymer	0d6a0b0efa	fix for direct fetch timeouts (#677 ) - use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds - don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document - don't count non-200 responses as non-essential even if missing content-type fixes #676	2024-09-05 10:32:31 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
Tessa Walsh	39c8f48bb2	Disable behaviors entirely if --behaviors array is empty (#672 ) Fixes #651	2024-08-27 13:20:19 -07:00
Ilya Kreymer	c61a03de6e	ci: use docker compose instead of docker-compose	2024-08-14 21:21:35 -07:00
Henry Wilkinson	4c1da90d8f	Adds warning about crawling with basic auth (#669 ) Closes https://github.com/webrecorder/browsertrix/issues/1950 over here too ### Changes - Adds a warning about using basic auth - Adds a link to MDN because learning and cross referencing is fun!	2024-08-14 21:14:31 -07:00
benoit74	4cc67a3267	Update Brave image + isolated Python venv for dependencies installation (#591 ) - Debian distro now requires the use of virtual environments to not mess with dependencies installed by official apt packages - removes tldextract update now that pywb is not in use anymore - bump brave version to 1.68.141, for use with base image added in https://github.com/webrecorder/browsertrix-browser-base/pull/20 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-08-14 21:12:00 -07:00
Ilya Kreymer	23fbbcb6bf	version: bump to 1.3.0-beta.0	2024-08-14 20:12:48 -07:00
Ilya Kreymer	8d7fb1e084	1.2.8 updates: (#668 ) - rewriting: update wabac.js, use getCustomRewriter(), don't truncate POST request bodies for URLs that use a custom rewriter - browser: disable --enable-automation, setting webdriver = true, so no need for override - deps: update puppeteer-core, necessary changes for latest puppeteer	2024-08-13 23:38:55 -07:00
Ilya Kreymer	bb34c5ef47	version: bump to 1.2.7 deps: bump RWP in Dockerfile to 2.1.3	2024-08-09 13:23:16 -07:00
Tessa Walsh	84129b1888	QA: Ensure empty string text is propagated for QA comparison (#667 ) Fixes #666 Fixes two issues with QA replay text extraction: - ensures empty string text from QA replay is treated as empty string, instead of undefined - avoids a divide by zero when both original and replay text strings was 0. Ensures the match is 1.0 if both crawl and QA replay text is an empty string --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-08-09 13:20:56 -07:00
Ilya Kreymer	a1ba29d878	deps: update puppeteer-core to 22.14.0 (#661 )	2024-07-30 13:51:52 -07:00
Ilya Kreymer	ff81048d3a	deps: bump browsertrix-behaviors to 0.6.3 (#659 ) adds support for detecting videos in shadow dom with query-selector-shadow-dom library	2024-07-30 09:41:21 -07:00
Ilya Kreymer	9f2b9bf4e5	version: bump to 1.2.6	2024-07-29 16:41:40 -07:00
Ilya Kreymer	539730d54e	remove crc32 computation, fixes #653 (#657 ) Removes crc32 computation, which was incorrect, and no longer needed	2024-07-29 16:19:44 -07:00
Ilya Kreymer	717dd138ec	Ignore invalid URLs in redirects (#658 ) ensure URL from redirects are valid - it's possible that a redirect is a 'synthetic' redirect created by the browser for http->https enforcement, which may include an invalid URL. eg: http://<invalid url> -> https://<invalid url> Prevent trying to record this invalid URL fixes #654	2024-07-29 14:51:22 -07:00
Ilya Kreymer	d620eb8e31	misc tweaks: (#650 ) - logging: log behavior options that are enabled on startup, after seeds - redis: launch local redis only if --redisStoreUrl starts with redis://localhost or redis://127.0.0.1 - interrupt: check that crawler is not 'done' before exiting with exit code 13, if already done, exit with 0	2024-07-23 21:50:26 -04:00
Ilya Kreymer	48716c172d	docs: regnerate cli options with ./docs/gen-cli.sh	2024-07-19 18:53:50 -07:00
benoit74	1099f4f3c8	Make it clear that profile argument can be an HTTP(S) URL (#649 ) Small documentation enhancement to make it clear that browser profile can be passed as HTTP(S) URL as well.	2024-07-19 18:53:28 -07:00
Ilya Kreymer	88a2fbd0a0	Fix 206 response + general video handling (#646 ) Refactors handling of 206 responses: - If a 206 response is encountered, and its actually the full range, convert to 200 and rewrite range and content-range headers to x-range and x-orig-range. This is to support rewriting of 206 responses for DASH manifests - If a partial 206 response starting with `0-`, do a full async fetch separately. - If a partial 206 response not starting with 0-, just ignore (very likely a duplicate picked up when handling the 0- response) - Don't stream content-types that can be rewritten, since streaming prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no content-length and don't get properly rewritten. - Overall, adds missing rewriting of DASH/HLS manifests that have no content-length and are served as 206. - Update to latest wabac.js which fixes rewriting of DASH manifest to avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192 - Fixes #645	2024-07-17 13:24:25 -07:00
Ilya Kreymer	01666b4474	deps: bump browsertrix-behaviors to 0.6.2	2024-07-11 19:53:59 -07:00
Ilya Kreymer	4fb9577d4f	don't disable extraHops when using sitemaps: (#639 ) - instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it. -if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope. - bump version to 1.2.4	2024-07-11 19:48:43 -07:00
Ilya Kreymer	1a48b37478	bump replayweb.page to 2.1.1 (#640 )	2024-07-11 16:22:37 -07:00
Tessa Walsh	fd98033268	Loosen selectors for login fields in automated profile creation (#638 ) Fixes #637 - Username will match if name attribute is one of: user, username, email - Password will match if type is password and name attribute is one of: pass, password This loosens the rules sufficiently to solve the issue with the URL in the linked issue without requiring users to pass custom CSS selectors at this point. It looks like we were also using XPath methods like contains whereas puppeteer expects CSS selectors, hence the syntax change.	2024-07-11 15:55:06 -07:00
Ilya Kreymer	151115d46c	Fix Pending Request causing timeout (#636 ) Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded. (Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)	2024-07-09 11:02:41 -07:00
Ilya Kreymer	bfe42ad31e	Improved handling of pages that redirect back to the same page. (#635 ) In the case of a page https://example.com/ which results in a redirect chain: 307 https://example.com/ -> 307 https://auth.example.com/ -> 200 https://example.com/ - Includes status in dupe checks, ensures that `307 https://example.com/` and `200 https://example.com/` are both recorded to WARC - When setting page timestamp, update the timestamp to the lower status code if above 300, eg. first setting to `307 https://example.com/` and then to `200 https://example.com/` Fixes #634	2024-07-08 10:51:37 -07:00
Ilya Kreymer	320c041235	version: bump to 1.2.3	2024-07-08 10:50:51 -07:00
Ilya Kreymer	302b119908	Dependency Update / 1.2.2 (#633 ) Dependency Updates: - Bump Brave to 1.67.123 - Update puppeteer-core to latest, fixes possible crash when loading current browser with old profiles - Tests: simplifies extra hops test to avoid complex pages that could lead to timeout	2024-07-03 12:55:14 -07:00
Ilya Kreymer	a3396adba2	tests: reduce logging (#596 ) remove logging of crawl logs by default for clearer output from tests, only log in case of error.	2024-06-26 13:05:13 -07:00
Ilya Kreymer	4495532606	Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629 ) Adds enterprise policy to always download PDF and sets download dir to /dev/null Moves policies to chromium.json and brave.json for clarity Further cleanup of non-HTML loading path: - sets downloadResponse when page load is aborted but response is actually download - sets firstResponse when first response finishes, but page doesn't fully load - logs that non-HTML pages skip all post-crawl behaviors in one place - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-26 09:16:24 -07:00
Ilya Kreymer	6a9ca3df54	Don't filter saving redirect if no response body. (#628 ) It's possible for a redirect, especially a browser-generated one to have headers and no body (eg. Brave removing tracking url query). Don't filter these redirects out from being written to WARC, just set payload to empty buffer. fixes #627 where Brave-generated redirect response was not stored.	2024-06-25 15:48:22 -07:00
Ilya Kreymer	2ab58c0ea3	Remove DISPLAY env var from image (#625 ) To avoid a strange chromium bug: https://issues.chromium.org/issues/40209037 which causes WebGL to fail in headless mode if DISPLAY if set. Instead, just set DISPLAY directly for Xvfb, x11vnc and pass in `--display=` to browser if running in headful mode.	2024-06-25 13:53:43 -07:00
Ilya Kreymer	92ad800fe4	browser policies: disable restoring any tabs on startup + set new tab URL to about:blank (#626 ) addresses memory issues with profiles as they accumulate tabs! fixes webrecorder/browsertrix#1880	2024-06-25 13:38:52 -07:00
Ilya Kreymer	e65bf21135	version: bump to 1.2.1	2024-06-25 13:28:59 -07:00

1 2 3 4 5 ...

536 commits