Commit graph

476 commits

Author SHA1 Message Date
Ilya Kreymer
a5050a25d7
Readd health check on retry (#759)
- health check failures should be incremented even if retrying, in case
restart is needed
- cleanup writePage()
- bump default --maxPageRetries to 2 for better default for Browsertrix
2025-02-06 20:13:20 -08:00
Ilya Kreymer
00835fc4f2
Retry same queue (#757)
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes #132

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-06 18:48:40 -08:00
Ilya Kreymer
5c9d808651
exit code cleanup (#753)
- use consistent enums for exit codes
- add disk space check on startup and add OutOfSpace exit code (3)
- preparation for #584
2025-02-06 17:54:51 -08:00
Ilya Kreymer
b435afeb4b version: bump to 1.5.1 2025-02-06 11:40:31 -08:00
Ilya Kreymer
0ca27e4fa1
QA fix: ensure replay iframe actually been updated after goto call! (#756)
qa fix: check url of iframe, ensure it is not about:blank anymore
test: add test to ensure expected diff
deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0
2025-02-06 10:41:38 -08:00
Ilya Kreymer
2e46140c3f
Make numRetries configurable (#754)
Add --numRetries param, default to 1 instead of 5.
2025-02-05 23:34:55 -08:00
Ilya Kreymer
f379da19be version: bump to 1.5.0! 2025-01-31 21:57:18 -08:00
Ilya Kreymer
95a631188d
hang protection: wrap remaining evaluate() calls to avoid rare hangs (#750)
wrap remaining frame.evaluate() and page.evaluate() calls that are not
already within a timedRun() in their own timedRun() to avoid rare cases
where they do not return (eg. if page crashes during the evaluate)
2025-01-30 17:39:20 -08:00
Ilya Kreymer
1da49258c4 version: bump to 1.5.0-beta.4 2025-01-30 14:32:30 -08:00
Ilya Kreymer
fe6199eebd
pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749)
follow-up to #747
2025-01-30 11:18:59 -08:00
Ilya Kreymer
457d07aea4
if uploading wacz files, compute waczfile name on load to be able to … (#748)
…store filename along with page data:

- set filename on crawler load, if not already set, otherwise use
existing
- store filename per crawler instance in <crawlid>:nextWacz
- add 'filename' field to page when writing pages to redis
- clear wacz filename when wacz is uploaded to set a new one
- fixes #747

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-29 18:15:28 -08:00
Ilya Kreymer
a00866bbab
Apply exclusions to redirects (#745)
- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes #744
2025-01-28 11:28:23 -08:00
Ilya Kreymer
f7cbf9645b
Retry support and additional fixes (#743)
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
2025-01-25 22:55:49 -08:00
Ilya Kreymer
5d9c62e264
Retry Failed Pages + Ignore Hashtags in Redirect Check (#739)
- Retry pages that are marked as failed once, at the end of the crawl,
in case it was due to a timeout
- Also, don't treat differences in hashtag between seed page loaded and
actual URL as a redirect (eg. don't add as new seed)
2025-01-16 15:51:35 -08:00
Ilya Kreymer
bc4a95883d
clear out core dumps to avoid using up volume space: (#740)
- add 'ulimit -c' to startup script
- delete any './core' files that exist in working dir just in case
- fixes #738
2025-01-16 15:50:59 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Ilya Kreymer
871490758a
Dependency Update for 1.4.2 (#737) 2025-01-06 12:06:40 -08:00
Ilya Kreymer
d923e11436
separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736)
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes #735, where fetch was interrupted, resulted in an empty response
2024-12-31 13:52:12 -08:00
Ilya Kreymer
fb8ed18f82
package: pin @novnc/novnc to 1.4.0 to prevent accidental upgrades (#727)
- novnc 1.5.0 not compatible with current configuration)
- fixes #726
- bump to 1.4.1
2024-11-25 18:42:56 -08:00
Ilya Kreymer
9af34f9a1d version: bump to 1.4.0 2024-11-25 00:36:43 -08:00
Ilya Kreymer
6bfa7d5766
Dependency Update (#725)
- update yarn packages
- update RWP to 2.2.4
- update base image to brave 1.73.91
- fix typing issue
- bump to 1.4.0-beta.1
2024-11-24 01:22:50 -08:00
Francesco Servida
07e5ceb4c2
Implemented option for FullPage screenshot after the behaviours have run (#656)
- new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped.

Related to #486

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-23 21:26:55 -08:00
Ilya Kreymer
214eb6ca8f
support removing range from query (via wabac.js 2.20.6): (#724)
- fix for archiving facebook video, to match
webrecorder/archiveweb.page#272
- permissions: auto enable permissions to avoid possibly modal (for both
profiles and crawling)
- deps: update to latest wabac.js + warcio.js
2024-11-22 10:31:12 -08:00
Ilya Kreymer
0b9cd71c5a
Ensure partial responses are not written (#721)
various fixes for streaming, especially related to range requests
- follow up to #709
- fix: prefer streaming current response via takeStream, not only when
size is unknown
- don't serialize async responses prematurely
- don't serialize 206 responses if there is size mismatch
2024-11-13 23:28:37 -08:00
Ilya Kreymer
f56d6505c1
fix indexing of cookie header: (#714)
- add fields option for adding req.http:cookie and referrer entries to
the cdxj
- update to warcio 2.4.0 to support this functionality
2024-11-13 23:13:40 -08:00
Tessa Walsh
60c84b342e
Support loading custom behaviors from git repo (#717)
Fixes #712 
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-11-13 22:50:33 -08:00
Ilya Kreymer
ea05307528
add disable-lazy-loading flag, should fix #699 (#720) 2024-11-11 21:55:09 -08:00
Ilya Kreymer
c8e2e43d4d
Dependency Update (#718)
- bump browsertrix-behaviors to 0.6.5
- bump browsertrix-base-image to 1.71.123
- bump puppeteer-core to 23.7.1
2024-11-10 19:34:38 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
e5bab8e7c8
various edge-case loading optimizations: (#709)
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-31 14:06:17 -07:00
Ilya Kreymer
5c00bca2b4
tests: use old.webrecorder.net for testing (#710)
replace webrecorder.net -> old.webrecorder.net to fix tests relying on
old website for now
2024-10-31 13:24:58 -04:00
Ilya Kreymer
181d9b824c
deps: update to latest wabac (#708)
bump version to 1.3.4
2024-10-26 11:02:32 -07:00
Ilya Kreymer
0d39ea3590
dep: update to wabac.js 2.20 (#704)
Update imports for new TS-based wabac.js
2024-10-16 21:02:04 -07:00
Ilya Kreymer
a45b85dd74 version: bump to 1.3.3 2024-10-11 00:12:23 -07:00
Ilya Kreymer
652cf9cfa6
link extraction promise cleanup: (#701)
- catch frame.evaluate() directly and log errors there to avoid any
possibility of exception being propagated before wrapping in timedRun()
- also add clearTimeout() to timedRun()
- possibly fixes openzim/zimit#376
2024-10-11 00:11:24 -07:00
Ilya Kreymer
157ac34d8c
fix typo in QA exclude check, which resulted in all URLs being excluded (#697)
- ensure exclusions now work as expected in replay mode
- add test for using --exclude with replay
2024-10-07 17:25:36 -07:00
Ilya Kreymer
282c47ad66
bump puppeteer core to 23.5.1 (#700)
includes possible improvements for detecting crashes with wrong stack
trace (see: puppeteer/puppeteer#13056)
2024-10-07 16:39:48 -07:00
Tessa Walsh
e05d50d637
Add documentation for crawl collections (#695)
Fixes #675
2024-10-05 11:51:32 -07:00
Ilya Kreymer
d497a424fc
tests: disable blockrules youtube tests in CI (#698)
due to youtube being blocked, disable test involving youtube embeds when
running in CI for now
2024-10-04 17:37:13 -07:00
Ilya Kreymer
356b3f8d10 bump to 1.3.2 2024-09-30 15:51:13 -07:00
Ilya Kreymer
728f00219a
ensure extraHops also apply to maxDepth (#694)
- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is
set
- adjust isInScope and isAtMaxDepth to account for extraHops
- tests: update extra hops test to test extraHops beyond depth
- fixes #693
2024-09-30 15:46:34 -07:00
Ilya Kreymer
9f310907f0 version: bump to 1.3.1 2024-09-27 14:30:56 -04:00
Ilya Kreymer
a56e13d2ff
Additional exception safety (#692)
- add additional catch() block
- wrap page.title() in timedRun() to catch/log exception if this fails
- log error in getting cookies
- hopefully fixes hard-to-repro edge case crash in openzim/zimit#376
2024-09-27 14:30:25 -04:00
Tessa Walsh
607fc84c7d
Include depth in pages JSONL files (#691)
Fixes #690
2024-09-27 10:01:20 -04:00
Ilya Kreymer
6b4ba5b430
direct fetch: when cancelling due to redirect, read full body (#688)
to avoid possible exception due to encoding. (Probably a node bug,
reported in nodejs/undici#3616)
Replace abort with cancel, which is the recommended way to cancel the
response.

fixes #687
2024-09-17 10:29:23 -07:00
Ilya Kreymer
da442573b8 version: bump to 1.3.0 2024-09-12 09:22:22 -07:00
Ilya Kreymer
eb50fdffde
exit codes: exit with error code 10 if interrupt is caused by unexpected browser exit (#686)
Differentiate from expected/predictable interrupts due to limits (exit
code 11) and unexpected interrupt due to browser crash (now exit code
10)
fixes #683
2024-09-12 09:10:23 -07:00
Ilya Kreymer
fdb76f2c88
update current crawl size in redis on each healthcheck call (#685)
- allows Browsertrix app to adjust size, if needed, more frequently
- run checkLimits() before starting crawl, in case out of space
2024-09-10 08:28:07 -07:00
Ilya Kreymer
b42548373d
eslint: add strict await checking: (#684)
- require await / void / catch for promises
- don't allow unnecessary await
2024-09-06 16:24:18 -07:00