Commit graph

127 commits

Author SHA1 Message Date
Ilya Kreymer
fd49041f63
flow behaviors: add scrolling into view (#892)
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
2025-10-07 08:17:56 -07:00
Ilya Kreymer
8ca7756d1b
tests: remove example.com from tests (#885)
also use local http-server for behavior tests
2025-09-19 23:21:47 -07:00
Ilya Kreymer
a2742df328
seed urls list: check for quoted URLs and remove quotes (#883)
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
2025-09-12 13:34:41 -07:00
Ilya Kreymer
705bc0cd9f
Async Fetch Refactor (#880)
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
2025-09-10 12:05:21 -07:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Ilya Kreymer
eb374fa835
base: bump to brave 1.80.113 (#857)
version: bump to 1.7.0-beta.0
tests: update deprecated command to work with latest minio
2025-06-30 19:55:38 -07:00
Ilya Kreymer
e72b34318d
Add WARC-Protocol header (#715)
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-05-19 18:59:52 -07:00
Ilya Kreymer
71de8d6582
lang code fixes: (#834)
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes #833
2025-05-12 16:06:29 -07:00
Ilya Kreymer
fc59d04231
Deps update 1.6.1 (#826) 2025-05-02 00:43:37 -07:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
f83d0e8f02
Add option to push behavior + behavior script logs to Redis (#805)
Fixes #804 

- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-03 15:46:10 -07:00
Ilya Kreymer
2b56455e8b
stuck page handling: when attempting to restart browser, add more retries (#808)
fixes issue mentioned in:
https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186
2025-04-01 16:56:01 -07:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
Tessa Walsh
5fedde6eee
Fail crawl with fatal message if custom behavior isn't loaded (#799)
Fixes #797 

The crawler will now exit with a fatal log message and exit code 17 if:

- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)

Tests have also been added accordingly.
2025-03-31 17:35:30 -07:00
Tessa Walsh
8f581a587c
Validate Autoclick selector, fail crawl if invalid (#800)
Fixes #798 

Also modifies the existing test for link selector validation to check 17
status code on exit when link selectors fail validation.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-03-30 13:48:41 -07:00
Ilya Kreymer
323b654c54 tests: update qa test to use awp site 2025-03-21 13:06:53 -07:00
Tessa Walsh
e402ddc202
Strip credentials from proxy address in crawl logs (#778)
Fixes https://github.com/webrecorder/security/issues/14
2025-02-26 15:23:38 -05:00
benoit74
fc56c2cf76
Add more exit codes to detect interruption reason (#764)
Fix #584

- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-10 14:00:55 -08:00
Ilya Kreymer
00835fc4f2
Retry same queue (#757)
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes #132

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-06 18:48:40 -08:00
Ilya Kreymer
0ca27e4fa1
QA fix: ensure replay iframe actually been updated after goto call! (#756)
qa fix: check url of iframe, ensure it is not about:blank anymore
test: add test to ensure expected diff
deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0
2025-02-06 10:41:38 -08:00
Ilya Kreymer
2e46140c3f
Make numRetries configurable (#754)
Add --numRetries param, default to 1 instead of 5.
2025-02-05 23:34:55 -08:00
Ilya Kreymer
fe6199eebd
pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749)
follow-up to #747
2025-01-30 11:18:59 -08:00
Ilya Kreymer
457d07aea4
if uploading wacz files, compute waczfile name on load to be able to … (#748)
…store filename along with page data:

- set filename on crawler load, if not already set, otherwise use
existing
- store filename per crawler instance in <crawlid>:nextWacz
- add 'filename' field to page when writing pages to redis
- clear wacz filename when wacz is uploaded to set a new one
- fixes #747

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-29 18:15:28 -08:00
Ilya Kreymer
a00866bbab
Apply exclusions to redirects (#745)
- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes #744
2025-01-28 11:28:23 -08:00
Ilya Kreymer
f7cbf9645b
Retry support and additional fixes (#743)
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
2025-01-25 22:55:49 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Tessa Walsh
60c84b342e
Support loading custom behaviors from git repo (#717)
Fixes #712 
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-11-13 22:50:33 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
5c00bca2b4
tests: use old.webrecorder.net for testing (#710)
replace webrecorder.net -> old.webrecorder.net to fix tests relying on
old website for now
2024-10-31 13:24:58 -04:00
Ilya Kreymer
157ac34d8c
fix typo in QA exclude check, which resulted in all URLs being excluded (#697)
- ensure exclusions now work as expected in replay mode
- add test for using --exclude with replay
2024-10-07 17:25:36 -07:00
Ilya Kreymer
d497a424fc
tests: disable blockrules youtube tests in CI (#698)
due to youtube being blocked, disable test involving youtube embeds when
running in CI for now
2024-10-04 17:37:13 -07:00
Ilya Kreymer
728f00219a
ensure extraHops also apply to maxDepth (#694)
- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is
set
- adjust isInScope and isAtMaxDepth to account for extraHops
- tests: update extra hops test to test extraHops beyond depth
- fixes #693
2024-09-30 15:46:34 -07:00
Tessa Walsh
607fc84c7d
Include depth in pages JSONL files (#691)
Fixes #690
2024-09-27 10:01:20 -04:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
802a416c7e
Additional direct fetch improvements (#678)
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
2024-09-05 13:28:49 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
Ilya Kreymer
4fb9577d4f
don't disable extraHops when using sitemaps: (#639)
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
2024-07-11 19:48:43 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00
Ilya Kreymer
302b119908
Dependency Update / 1.2.2 (#633)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
2024-07-03 12:55:14 -07:00
Ilya Kreymer
a3396adba2
tests: reduce logging (#596)
remove logging of crawl logs by default for clearer output from tests, only log in case of error.
2024-06-26 13:05:13 -07:00
Ilya Kreymer
4495532606
Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629)
Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
 - logs that non-HTML pages skip all post-crawl behaviors in one place
 - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-26 09:16:24 -07:00
Ilya Kreymer
6a9ca3df54
Don't filter saving redirect if no response body. (#628)
It's possible for a redirect, especially a browser-generated one to have
headers and no body (eg. Brave removing tracking url query). Don't
filter these redirects out from being written to WARC, just set payload to empty
buffer.

fixes #627 where Brave-generated redirect response was not stored.
2024-06-25 15:48:22 -07:00
Ilya Kreymer
65a86352fd
Updated rewriting for YouTube + dependency update (#623)
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
2024-06-21 15:03:53 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
f504effa51 Merge branch 'main' into release/1.1.4
bump to 1.2.0-beta.1
2024-06-13 19:28:25 -07:00
Ilya Kreymer
53d437570e
dependency: update RWP to 2.0.1 (#610)
for QA, use ReplayWeb.page 2.0.1 by default
2024-06-13 18:43:58 -07:00
Ilya Kreymer
8f8326eaf5
Fix synching extraSeeds state with multiple crawler instances (#605)
Fixes #604 

Ensures that extra seeds are propagated to all crawler instances.
Adds a new redis hashmap key to store the extraSeed mappings
url->extraSeeds index, to ensure the extra seeds are added in the same
order on other instances, even if encountered in different order.
Add a new redis lua primitive 'addnewseed' which combines several
operations: check if extra seed already exists and returning existing
index, add new seed to extraSeed list, also add to regular URL seed
list.
2024-06-13 17:18:06 -07:00