- split browser into NewContextBrowser and PersistentContextBrowser (old behavior)
- NewContextBrowser() closes and recreates the context when pages are closed
- worker uses 'storageState' to get state, set again on new page for that worker
* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288
* Fix full page screenshot (#296)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* various loading improvements to avoid pages getting 'stuck' + load state tracking
- add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object.
- set loadState to 0 (failed) by default
- set loadState to 1 (content-loaded) on 'domcontentloaded' event
- if page.goto() finishes, set to loadState to 2 'full-page-load'.
- if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors
- page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out
- pages: log 'loadState' as part of pages.jsonl
- improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well
- screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting
- deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements)
- workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages
- worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state
- screencaster: allow reusing caster slots with fixed ids
- interrupt timedCrawlPage() wait if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary
* State / Worker Refactor (#252)
* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes#150)
- override console.error to avoid logging ioredis errors (fixes#244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt
* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages
* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1
* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)
* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Ensure page is included in all logging details
- Update logging messages to be a single string, with variables added in the details
- Always wait for all pending wait requests to finish (unless counter <0)
- Don't set puppeteer-cluster timeout (prep for removing puppeeteer-cluster)
- Add behaviorTimeout to running behaviors in crawler, in addition to in behaviors themselves.
- Add logging for behavior start, finish and timeout
- Move writeStats() logging to beginning of each page as well as at the end, to avoid confusion about pending pages.
- For events from frames, use frameUrl along with current page
- deps: bump browsertrix-behaviors to 0.4.2
- version: bump to 0.8.1
* behaviors: don't run behaviors in iframes that are about:blank or are from an ad-host (even if ad-blocking is not disabled), fixes#210
* logging: log behavior wait start and success, in addition to error, with url in details
* crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object
logging: check if data object is an error, log fields from error. Convert missing console.* to new logger
* evaluate failuire: log with error, not fatal
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
* profiles: use vnc for automatic profile creation (fixes#194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer
* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected
* x11vnc cleanup: just pass password via cmdline to simplify setup
* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive
* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
* update base image
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes
* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now
bump to 0.7.0-beta.0
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check
- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.
- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.
- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.
- S3 Storage refactor, simplify, don't add additional paths by default.
- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.
- wacz save: reenable wacz validation after save.
- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.
- bump to 0.6.0-beta.1
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.
* bump to 0.6.0-beta.0
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded
* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option
* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy
* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes#25
* update CHANGES and README with new features
* bump version to 0.4.1