- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
Fixes#631
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation
fixes#898
- log when profie download starts
- ensure there is a timeout to profile download attempt (60 secs)
- attempt retry 2 more times if initial profile download times out
- fail crawl after 3 retries, if profile can not be downloaded
successfully
bumpt to 1.8.2
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config
Fixes#836
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes#866
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes#860
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes#833
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
Follow-up to #368
This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
Fixes#804
- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
- extractLinks() now handled via browsertix-behaviors
- fixes#770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix#747
- undo accidentally setting window timeout to 20000 seconds instead of
20 for debugging!
- follow up to #781
- bump to 1.5.6.1
- should hopefully fix crawls stuck in this way..
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
logging (#752): ensure failed included in totals
fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted.
bump to 1.5.2