Commit graph

120 commits

Author SHA1 Message Date
Ilya Kreymer
8595bcebc1 add new logger.interrupt() which will interrupt and exit crawl but not fail unlike logger.fatal()
replace some logger.fatal() with interrupts to allow for retries instead of immediate failure, esp.
when external inputs (profile, behaviors) can not be downloaded
2025-11-25 07:58:30 -08:00
Ilya Kreymer
565ba54454
better failure detection, allow update support for captcha detection via behaviors (#917)
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
2025-11-19 15:49:49 -08:00
Ilya Kreymer
87edef3362
netIdle cleanup + better default for pages where networkIdle timesout (#916)
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913 
- avoid accidental logging

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-11-18 16:34:02 -08:00
aponb
b50ef1230f
feat: add extraChromeArgs support for passing custom Chrome flags (#877)
This change introduces a new CLI option --extraChromeArgs to Browsertrix
Crawler, allowing users to pass arbitrary Chrome flags without modifying
the codebase.

This approach is future-proof: any Chrome flag can be provided at
runtime, avoiding the need for hard-coded allowlists.
Maintains backward compatibility: if no extraChromeArgs are passed,
behavior remains unchanged.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-11-11 12:03:30 -08:00
Ilya Kreymer
3935526240
add --saveProfile option to save profile after successful crawl (#903)
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation

fixes #898
2025-10-29 19:57:25 -07:00
Ilya Kreymer
002feb287b
dismiss js dialog popups (#895)
move the JS dialog handler to not be only for autoclick, dismiss all JS
dialogs (alert(), prompt()) to avoid blocking page
fixes #891
2025-10-08 14:57:52 -07:00
Ilya Kreymer
2270964996
logging: remove duplicate seeds found error (#893)
Per discussion, the message is unnecessary / confusing (doesn't provide
enough info) and can also happen on crawler restart.
2025-10-07 08:18:22 -07:00
Ilya Kreymer
a2742df328
seed urls list: check for quoted URLs and remove quotes (#883)
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
2025-09-12 13:34:41 -07:00
Ilya Kreymer
705bc0cd9f
Async Fetch Refactor (#880)
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
2025-09-10 12:05:21 -07:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Ilya Kreymer
18fe5a9676
behavior logging: remove last line dupe check for behavior logs (#874)
Shouldn't skip multiple log messages, as this is unexpected behavior for
user-defined behaviors.
2025-07-30 16:20:14 -07:00
Ilya Kreymer
0652a3fb1d
quickfix: WACZ upload retry support: (#871)
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
2025-07-29 15:41:22 -07:00
sua yoo
bc4d649307
Capitalization fix for log messages (#870)
Capitalizes "URL" in log messages.
2025-07-24 23:52:12 -07:00
Ilya Kreymer
1a4341bfbc
url queueing: log skipped URLs as errors if depth === 0 (#868)
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes #866

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-23 10:05:40 -07:00
Ilya Kreymer
549d655173
Support option to fail crawl on content check (#861)
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes #860

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-08 13:08:52 -07:00
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Ilya Kreymer
52235ab21e
tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842)
allows for customizing tmp directory with TMPDIR env var
2025-05-28 12:48:06 -07:00
Ilya Kreymer
71de8d6582
lang code fixes: (#834)
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes #833
2025-05-12 16:06:29 -07:00
Ilya Kreymer
e39d5a31eb
support pause interrupt: (#825)
- add new interrupt reason / exit code
- add isCrawlPaused() which checks redis <id>:paused key
- exit gracefully, upload WACZ file when paused

fixes #824
2025-05-05 10:10:08 -07:00
Ilya Kreymer
13e9648398
state: add trimqueue() redis command to trim queue / seen list (#821)
useful to support dynamically lowering pageLimit when restarting a crawl
fixes issue raised in webrecorder/browsertrix#2514
2025-04-29 18:18:04 -07:00
Ilya Kreymer
f2dac05577
regression fix: start redis if needed before attempting to init state! (#819)
bump to 1.6.0-beta.1
2025-04-09 21:37:46 +02:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
f83d0e8f02
Add option to push behavior + behavior script logs to Redis (#805)
Fixes #804 

- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-03 15:46:10 -07:00
aponb
6898bcf7ae
useSHA1 Parameter for generating SHA1 record hashes (#532) (#812)
By using the useSHA1 flag, the payload digest in records will use SHA-1
with Base32 encoding instead of the default SHA-256

Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>
2025-04-02 17:10:50 -07:00
Ilya Kreymer
bf6fbe8776
Remove extra console.log statements (#811)
- remove one added in screencaster
- also remove others that are outside logging system
- bump to 1.5.10
2025-04-02 09:25:11 -07:00
Ilya Kreymer
fd41b32100
saved state tweaks: (#809)
- if saved state filename is somehow duplicated, don't readd to array to
avoid deletion (fixes edge case in #791)
- also avoid double interpolation of filename
2025-04-01 18:59:04 -07:00
Emma Segal-Grossman
41b968baac
Dynamically adjust reported aspect ratio based on GEOMETRY (#794)
Closes #793 
Related to #733

Adjusts the reported aspect ratio based on GEOMETRY env var.
Also adjusts stylesheet in screencast HTML to match.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:26:12 -07:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
Ilya Kreymer
e751929a7a
Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803)
- extractLinks() now handled via browsertix-behaviors
- fixes #770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
2025-03-31 12:02:25 -07:00
Ilya Kreymer
9a7ac9bef1
Fix using cached WACZ filename if already set ahead of time. (#783)
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix #747
2025-02-28 17:58:56 -08:00
benoit74
fc56c2cf76
Add more exit codes to detect interruption reason (#764)
Fix #584

- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-10 14:00:55 -08:00
Ilya Kreymer
846f0355f6
Improved handling of browser stuck / crashed (#763)
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
2025-02-10 10:16:25 -08:00
Ilya Kreymer
5807c320bf
remove fatal() on new window error + stats fix (#762)
logging (#752): ensure failed included in totals
fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted.
bump to 1.5.2
2025-02-09 15:26:36 -08:00
Ilya Kreymer
a5050a25d7
Readd health check on retry (#759)
- health check failures should be incremented even if retrying, in case
restart is needed
- cleanup writePage()
- bump default --maxPageRetries to 2 for better default for Browsertrix
2025-02-06 20:13:20 -08:00
Ilya Kreymer
00835fc4f2
Retry same queue (#757)
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes #132

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-06 18:48:40 -08:00
Ilya Kreymer
5c9d808651
exit code cleanup (#753)
- use consistent enums for exit codes
- add disk space check on startup and add OutOfSpace exit code (3)
- preparation for #584
2025-02-06 17:54:51 -08:00
Ilya Kreymer
2e46140c3f
Make numRetries configurable (#754)
Add --numRetries param, default to 1 instead of 5.
2025-02-05 23:34:55 -08:00
Ilya Kreymer
95a631188d
hang protection: wrap remaining evaluate() calls to avoid rare hangs (#750)
wrap remaining frame.evaluate() and page.evaluate() calls that are not
already within a timedRun() in their own timedRun() to avoid rare cases
where they do not return (eg. if page crashes during the evaluate)
2025-01-30 17:39:20 -08:00
Ilya Kreymer
fe6199eebd
pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749)
follow-up to #747
2025-01-30 11:18:59 -08:00
Ilya Kreymer
457d07aea4
if uploading wacz files, compute waczfile name on load to be able to … (#748)
…store filename along with page data:

- set filename on crawler load, if not already set, otherwise use
existing
- store filename per crawler instance in <crawlid>:nextWacz
- add 'filename' field to page when writing pages to redis
- clear wacz filename when wacz is uploaded to set a new one
- fixes #747

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-29 18:15:28 -08:00
Ilya Kreymer
a00866bbab
Apply exclusions to redirects (#745)
- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes #744
2025-01-28 11:28:23 -08:00
Ilya Kreymer
f7cbf9645b
Retry support and additional fixes (#743)
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
2025-01-25 22:55:49 -08:00
Ilya Kreymer
5d9c62e264
Retry Failed Pages + Ignore Hashtags in Redirect Check (#739)
- Retry pages that are marked as failed once, at the end of the crawl,
in case it was due to a timeout
- Also, don't treat differences in hashtag between seed page loaded and
actual URL as a redirect (eg. don't add as new seed)
2025-01-16 15:51:35 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Ilya Kreymer
d923e11436
separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736)
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes #735, where fetch was interrupted, resulted in an empty response
2024-12-31 13:52:12 -08:00
Francesco Servida
07e5ceb4c2
Implemented option for FullPage screenshot after the behaviours have run (#656)
- new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped.

Related to #486

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-23 21:26:55 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
e5bab8e7c8
various edge-case loading optimizations: (#709)
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-31 14:06:17 -07:00
Ilya Kreymer
652cf9cfa6
link extraction promise cleanup: (#701)
- catch frame.evaluate() directly and log errors there to avoid any
possibility of exception being propagated before wrapping in timedRun()
- also add clearTimeout() to timedRun()
- possibly fixes openzim/zimit#376
2024-10-11 00:11:24 -07:00