Commit graph

8 commits

Author SHA1 Message Date
Tessa Walsh
3864c76090
Add option to log errors to redis (#279) 2023-04-11 11:32:52 -04:00
Tessa Walsh
62fe4b4a99
Add options to filter logs by --logLevel and --context (#271)
* Add .DS_Store to gitignore

* Add --logLevel and --context filtering options

* Add log filtering test
2023-04-01 10:07:59 -07:00
Ilya Kreymer
07e503a8e6
Logger cleanup (#254)
* logging: convert logger to a singleton to simplify use

* add logger to create-login-profile.js
2023-03-17 14:24:44 -07:00
Tessa Walsh
1bee46b321
Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219)
* This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.

* Remove concurrency models, always use new tab

* logging improvements: include worker-id in logs, use 'worker' context
- logging: log info string / version as first line
- logging: improve logging of error stack traces
- interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
- interruption: don't repair if interrupting, wait for queue to be idle
- log text extraction
- init order: ensure wb-manager init called first, then logs created
- logging: adjust info->debug logging
- Log no jobs available as debug

* tests: bail on first failure

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* deps: bump puppeteer-core to 17.1.2
- bump to 0.9.0-beta.0

--------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-08 18:31:19 -08:00
Ilya Kreymer
ac5a720362
logging: serialize regex as string to avoid empty '{}' when logging scoping rules, fixes #234 (#235) 2023-03-02 11:39:37 -08:00
Ilya Kreymer
63717c4b04
Crawl log (#231)
* logging:
- write most of the crawl log to '{coll}/logs/crawl-{iso-timestamp}.log', part of #230
- ensure log filename consists of numeric timestamp only
- close log before wacz file is generated to allow storing log in wacz
- close log after writing stats
- add logs/ directory to wacz with new py-wacz
- deps: bump to py-wacz 0.4.8 to support logs in wacz
2023-02-24 18:31:08 -08:00
Ilya Kreymer
a767721f5e
crawl state: add getPendingList() to return pending state from either… (#205)
* crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object
logging: check if data object is an error, log fields from error. Convert missing console.* to new logger
* evaluate failuire: log with error, not fatal
2023-01-23 10:43:12 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00