- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
- if saved state filename is somehow duplicated, don't readd to array to
avoid deletion (fixes edge case in #791)
- also avoid double interpolation of filename
Closes#793
Related to #733
Adjusts the reported aspect ratio based on GEOMETRY env var.
Also adjusts stylesheet in screencast HTML to match.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Follow-up to #712
Fixes a few things I noticed while testing out
https://github.com/webrecorder/browsertrix/pull/2520
- Ignore `.git` directory of git repositories when recursively walking
cloned git repo to collect custom behaviors
- Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2
was overly restrictive for Git repositories)
- Log name of custom behavior scripts (filename or URLs) added as info messages in
`behavior` context
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
Fixes#797
The crawler will now exit with a fatal log message and exit code 17 if:
- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)
Tests have also been added accordingly.
- extractLinks() now handled via browsertix-behaviors
- fixes#770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
Fixes#798
Also modifies the existing test for link selector validation to check 17
status code on exit when link selectors fail validation.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix#747
- undo accidentally setting window timeout to 20000 seconds instead of
20 for debugging!
- follow up to #781
- bump to 1.5.6.1
- should hopefully fix crawls stuck in this way..
- set retries back to 3, was set high by mistake
- if will restart, throw exception to restart crawler
- otherwise, attempt to kill browser process that is stalled (appears to
work in testing)
- follow-up to #766
Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode:
- profile browser exits with ExitCodes.SignalInterrupted in response to signal
- use ExitCodes.Success or GenericError for other exit codes
Fix#584
- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
logging (#752): ensure failed included in totals
fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted.
bump to 1.5.2
- health check failures should be incremented even if retrying, in case
restart is needed
- cleanup writePage()
- bump default --maxPageRetries to 2 for better default for Browsertrix
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes#132
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
qa fix: check url of iframe, ensure it is not about:blank anymore
test: add test to ensure expected diff
deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0
wrap remaining frame.evaluate() and page.evaluate() calls that are not
already within a timedRun() in their own timedRun() to avoid rare cases
where they do not return (eg. if page crashes during the evaluate)
…store filename along with page data:
- set filename on crawler load, if not already set, otherwise use
existing
- store filename per crawler instance in <crawlid>:nextWacz
- add 'filename' field to page when writing pages to redis
- clear wacz filename when wacz is uploaded to set a new one
- fixes#747
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes#744
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
- Retry pages that are marked as failed once, at the end of the crawl,
in case it was due to a timeout
- Also, don't treat differences in hashtag between seed page loaded and
actual URL as a redirect (eg. don't add as new seed)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future
Fixes#728, also #216, #665, #31
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes#735, where fetch was interrupted, resulted in an empty response