Commit graph

13 commits

Author SHA1 Message Date
Ilya Kreymer
3c7c7bfbc4
optimize shutdown: if after interrupt signal was received, redis connection is gone, assume crawler is being terminated and exit quickly, (#292)
don't attemtpt to reconnect to redis (assume crawler is also being shutdown)
2023-04-24 09:50:49 -07:00
Ilya Kreymer
07e503a8e6
Logger cleanup (#254)
* logging: convert logger to a singleton to simplify use

* add logger to create-login-profile.js
2023-03-17 14:24:44 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Tessa Walsh
f35d495103
Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
65933c6b12
Interrupt Handling Fixes (#167)
* interrupts: simplify interrupt behavior:
- SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load

improvements of remote state / parallel crawlers (for browsertrix-cloud):
- SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl
- SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler

* scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!
2022-09-20 17:09:52 -07:00
Ilya Kreymer
cf90304fa7
0.6.0 Wait State + Screencasting Fixes (#141)
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished 
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds

* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request

* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed

* bump to 0.6.0
2022-06-17 11:58:44 -07:00
Ilya Kreymer
761ce7067b
behaviors update (#105)
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
2022-02-20 22:22:19 -08:00
Ilya Kreymer
39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
Ilya Kreymer
8f740d4e24 support custom crawl directory with --cwd flag, default to /crawls
update README
2020-11-02 15:28:19 +00:00
Ilya Kreymer
a875aa90d3 Dockerfile: switch to cmd 'crawl', instead of entrypoint to support running 'pywb' also
update README with docker-compose and docker run examples, update commandline example
default output to './crawls' subdirectory
2020-11-01 21:35:00 -08:00
Ilya Kreymer
91b8994a08 refactor crawler and default driver:
- add extensible defaultDriver, wrap crawling functionality in Crawler class
- support headless/non-headless, custom driver
- support custom collection name for pywb, generate-cdx option
- autoplay: add slightly delay for splash loading
2020-11-01 19:53:47 -08:00