Commit graph

23 commits

Author SHA1 Message Date
Ilya Kreymer
0e415e1d7a add default args 2023-04-25 18:57:06 -07:00
Ilya Kreymer
7c8200b7d1 worker: attempt to recover if firstPage is gone 2023-04-25 14:52:24 -07:00
Ilya Kreymer
5c1a33d596 optimizations:
- add partial custom Route-like impl for blockrules / adblockrules
- fix evaluateWithCLI for puppeteer / playwright, avoid creating cdp if applying to top-frame/already exists
2023-04-25 10:17:16 -07:00
Ilya Kreymer
810856fcf9 add --browserdriver for testing different drivers
add testing with puppeteer again!
comment out BlockRules / AdBlockRules for now
2023-04-24 23:40:11 -07:00
Ilya Kreymer
96a3aa837b possible work for #298
- split browser into NewContextBrowser and PersistentContextBrowser (old behavior)
- NewContextBrowser() closes and recreates the context when pages are closed
- worker uses 'storageState' to get state, set again on new page for that worker
2023-04-24 16:32:03 -07:00
Ilya Kreymer
d4e222fab2
merge regression fixes from 0.9.1: full page screenshot + allow service workers if no profile used (#297)
* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288

* Fix full page screenshot (#296)
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 10:26:56 -07:00
Ilya Kreymer
02fb137b2c
Catch loading issues (#255)
* various loading improvements to avoid pages getting 'stuck' + load state tracking
- add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object.
- set loadState to 0 (failed) by default
- set loadState to 1 (content-loaded) on 'domcontentloaded' event
- if page.goto() finishes, set to loadState to 2 'full-page-load'. 
- if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors
- page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out
- pages: log 'loadState' as part of pages.jsonl
- improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well
- screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting
- deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements)
- workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages
- worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state
- screencaster: allow reusing caster slots with fixed ids
- interrupt timedCrawlPage() wait if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty
2023-03-20 18:31:37 -07:00
Ilya Kreymer
07e503a8e6
Logger cleanup (#254)
* logging: convert logger to a singleton to simplify use

* add logger to create-login-profile.js
2023-03-17 14:24:44 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Ilya Kreymer
5da379cb5f
Logging and Behavior Tweaks (#229)
- Ensure page is included in all logging details
- Update logging messages to be a single string, with variables added in the details
- Always wait for all pending wait requests to finish (unless counter <0)
- Don't set puppeteer-cluster timeout (prep for removing puppeeteer-cluster)
- Add behaviorTimeout to running behaviors in crawler, in addition to in behaviors themselves.
- Add logging for behavior start, finish and timeout
- Move writeStats() logging to beginning of each page as well as at the end, to avoid confusion about pending pages.
- For events from frames, use frameUrl along with current page
- deps: bump browsertrix-behaviors to 0.4.2
- version: bump to 0.8.1
2023-02-23 18:50:22 -08:00
Ilya Kreymer
38a9dbdaae
behaviors: don't run behaviors in iframes that are about:blank or are… (#211)
* behaviors: don't run behaviors in iframes that are about:blank or are from an ad-host (even if ad-blocking is not disabled), fixes #210

* logging: log behavior wait start and success, in addition to error, with url in details
2023-01-23 16:47:33 -08:00
Ilya Kreymer
a767721f5e
crawl state: add getPendingList() to return pending state from either… (#205)
* crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object
logging: check if data object is an error, log fields from error. Convert missing console.* to new logger
* evaluate failuire: log with error, not fatal
2023-01-23 10:43:12 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Ilya Kreymer
5ee05985b1
Use VNC for headful profile creation (#197)
* profiles: use vnc for automatic profile creation (fixes #194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer

* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected

* x11vnc cleanup: just pass password via cmdline to simplify setup

* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive

* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
2023-01-09 23:56:53 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
e22d95e2f0
Logging and browser improvements: (#158)
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
2022-08-21 00:30:25 -07:00
Ilya Kreymer
0a309af740
Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144)
* update base image 
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes

* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now

bump to 0.7.0-beta.0
2022-06-30 19:24:26 -07:00
Ilya Kreymer
93b6dad7b9
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
500ed1f9a1
Profile Creation Improvements (#136)
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.

* bump to 0.6.0-beta.0
2022-05-05 14:27:17 -05:00
Ilya Kreymer
5e5efda437
Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128)
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded

* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option

* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy

* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
2022-03-18 10:32:59 -07:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
761ce7067b
behaviors update (#105)
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
2022-02-20 22:22:19 -08:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00