- original implementation did not actually wait for sitemap to complete
before queuing new ones, resulting in concurrency resource leak
- refactor to await completion of sitemap parser, replacing pending list
with counter
- also, don't parse sitemap if single-page and no extra hops!
- fixes issues in #928
- if a page is stuck in a window.alert / window.prompt loop, showing >10
or more consecutive dialogs (unrelated to unloading), call Page.crash()
to more quickly move on to next page, as not much else can be done.
- add exception handling in dialog accept/dismiss to avoid crawler crash
- fixes#926
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes#923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
Fixes#920
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes#897
Fixes#631
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- clear size to 0 immediately after wacz is uploaded
- if crawler is in paused, ensure upload of any data on startup
- fetcher q: stop queuing async requests if recorder is marked for
stopping
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913
- avoid accidental logging
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This change introduces a new CLI option --extraChromeArgs to Browsertrix
Crawler, allowing users to pass arbitrary Chrome flags without modifying
the codebase.
This approach is future-proof: any Chrome flag can be provided at
runtime, avoiding the need for hard-coded allowlists.
Maintains backward compatibility: if no extraChromeArgs are passed,
behavior remains unchanged.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
If the SingletonLock (and SingletonPort, SingletonSocket) files somehow
made it into the profile, the browser will refuse to start. This will
ensure that it is cleared.
(Could also do it before saving it as well, but this will catch it for
any existing profiles).
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation
fixes#898
- log when profie download starts
- ensure there is a timeout to profile download attempt (60 secs)
- attempt retry 2 more times if initial profile download times out
- fail crawl after 3 retries, if profile can not be downloaded
successfully
bumpt to 1.8.2
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config
Fixes#836
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes#866
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>