Current requires locally built Brave base image named:
webrecorder/browsertrix-browser-base:brave-test-latest
brave-ad-blocking-disabled-profile.tar.gz may not be working quite
correctly and may need to be replaced, as it wasn't possible to modify
the selects in brave://settings via create-login-profile's interactive
mode quite yet
* ad blocking via request interception, extending block rules system, adding new AdBlockRules
* Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build
* Enabled via --blockAds and setting a custom message via --adBlockMessage
* new test to check for ad blocking
* Add test-crawls dir to .gitignore and .dockerignore
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
* Run in Docker as User
This follows a similar pattern to pywb to run as the user that owns the
crawls directory.
bump version to 0.7.0-beta.6
Closes#170
* interrupts: simplify interrupt behavior:
- SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load
improvements of remote state / parallel crawlers (for browsertrix-cloud):
- SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl
- SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler
* scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
- bump to 0.7.0-beta.4
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
* new window: use cdp instead of window.open
* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page
* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156
* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser
* bump version to 0.7.0-beta.3
improved logging of pywb + redis:
- if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log
- otherwise, just ignore (don't print to stdout as that's too confusing)
- print if wb-manager fails, likely due to existing collection
waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait
dependency: update to latest puppeteer-core (16.1.0)
- add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load
- add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl
- update to browsertrix-behaviors 0.3.2
- update current url for screencasting of page before page load starts
bump to 0.7.0-beta.2
* update base image
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes
* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now
bump to 0.7.0-beta.0
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds
* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request
* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed
* bump to 0.6.0
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check
- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.
- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.
- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.
- S3 Storage refactor, simplify, don't add additional paths by default.
- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.
- wacz save: reenable wacz validation after save.
- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.
- bump to 0.6.0-beta.1
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.
* bump to 0.6.0-beta.0
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded
* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option
* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy
* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
* error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler
* bump version to 0.5.0-beta.6
- direct check fix: only do direct check if HEAD returns 200 status code
- if direct load results in non-200 status code, still load in browser
- error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error
- state: tweak error logging message
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113
* update README