Commit graph

127 commits

Author SHA1 Message Date
Ilya Kreymer
1e7f8361fe
tests: fix blockrules tests (#603)
The blockrules tests assumed the youtube serves videos with `video/mp4`
mime. However, now youtube also serves them with mime
`application/vnd.yt-ump`. Both mime types are now checked to verify video are present.
2024-06-13 12:12:46 -07:00
Ilya Kreymer
e2b4cc1844
proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589)
fixes #587 

The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.

Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)

The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)

---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-10 13:11:00 -07:00
Ilya Kreymer
b83d1c58da
add --dryRun flag and mode (#594)
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes #593
2024-06-07 10:34:19 -07:00
benoit74
32435bfac7
Consider disk usage of collDir instead of default /crawls (#586)
Fix #585 

Changes:
- compute disk usage based on crawler `collDir` property instead of
always computing it on `/crawls` directory
2024-06-07 10:13:15 -07:00
Ilya Kreymer
1bd94d93a1
cleanup dockerfile + fix test (#595)
- remove obsolete line from Dockerfile
- fix pdf test to webrecorder-hosted pdf
2024-06-06 12:14:44 -07:00
Ilya Kreymer
089d901b9b
Always add warcinfo records to all WARCs (#556)
Fixes #553 

Includes `warcinfo` records at the beginning of new WARCs, as well as
the combined WARC.
Makes the warcinfo record also WARC/1.1 to match the rest of the WARC
records.
2024-05-22 15:47:05 -07:00
Ilya Kreymer
894681e5fc
Bump version to 1.2.0 Beta + make draft release for each commit (#582)
Generate draft release from main and *-release branches to simplify
release process

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-05-22 15:45:48 -07:00
Tessa Walsh
1fcd3b7d6b
Fix failOnFailedLimit and add tests (#580)
Fixes #575

- Adds a missing await to fetching the number of failed pages from Redis
- Fixes a typo in the fatal logging message
- Adds a test to ensure that the crawl fails with exit code 17 if
--failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that
will 404
2024-05-21 16:35:43 -07:00
Tessa Walsh
8318039ae3
Fix regressions with failOnFailedSeed option (#572)
Fixes #563 

This PR makes a few changes to fix a regression in behavior around
`failOnFailedSeed` for the 1.x releases:

- Fail with exit code 1, not 17, when pages are unreachable due to DNS
not resolving or other network errors if the page is a seed and
`failOnFailedSeed` is set
- Extend tests, add test to ensure crawl succeeds on 404 seed status
code if `failOnINvalidStatus` isn't set
2024-05-15 11:02:33 -07:00
Ilya Kreymer
10f6414f2f
PDF loading status code fix (#571)
when loading a PDF as a page, the browser returns a 'false positive'
net::ERR_ABORTED even though the PDF is loaded.
- this is already handled, but status code was still being cleared,
ensure status code is not reset to 0 on response
- ensure page status and mime are also recorded if this failure is
ignored (in shouldIgnoreAbort)
- tests: add test for PDF capture

fixes #570
2024-05-14 15:26:06 -07:00
Ilya Kreymer
15d2b09757
warcinfo: fix version to 1.1 to avoid confusion (part of #553) (#557)
Ensure warcinfo record is also WARC/1.1
2024-04-18 21:52:24 -07:00
Ilya Kreymer
f6edec0b95
Fix for --rolloverSize for individual WARCs in 1.x (#542)
Fixes #533 

Fixes rollover in WARCWriter, separate from combined WARC rollover size:
- check rolloverSize and close previous WARCs when size exceeds
- add timestamp to resource WARC filenames to support rollover, eg.
screenshots-{ts}.warc.gz
- use append mode for all write streams, just in case
- tests: add test for rollover of individual WARCs with 500K size limit
- tests: update screenshot tests to account for WARCs now being named
screenshots-{ts}.warc.gz instead of just screenshots.warc.gz
2024-04-15 13:43:08 -07:00
Ilya Kreymer
b5f3238c29
Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535)
Cherry-picked from the use-js-wacz branch, now implementing separate
writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and
new `--copy-page-files` flag.

Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-11 13:55:52 -07:00
Ilya Kreymer
01c4139aa7
Fixes from 1.0.3 release -> main (#517)
sitemap improvements: gz support + application/xml + extraHops fix #511
- follow up to
https://github.com/webrecorder/browsertrix-crawler/issues/496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)

fixes redirected seed (from #476) being counted against page limit: #509
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
2024-03-26 14:50:36 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a512e92886
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
2024-02-19 19:11:48 -08:00
Ilya Kreymer
e8f2073a7e
Update Browser Image (#466)
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
2024-02-17 22:40:12 -08:00
Ilya Kreymer
96f3c407b1
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
2024-02-16 14:36:32 -08:00
Ilya Kreymer
703835a7dd
detect invalid custom behaviors on load: (#450)
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
2023-12-13 15:14:53 -05:00
Ilya Kreymer
3323262852
WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
  - add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
2023-12-07 23:02:55 -08:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00
Ilya Kreymer
2aeda56d40
improved text extraction: (addresses #403) (#404)
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-31 23:05:30 -07:00
Ilya Kreymer
8c92901889
load saved state fixes + redis tests (#415)
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes #411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
2023-10-23 09:36:10 -07:00
Ilya Kreymer
14c8221d46
tests: disable ad-block tests: seeing inconsistent ci behavior, though tests pass on local brave (#407) 2023-10-09 09:41:50 -07:00
Ilya Kreymer
f453dbfb56
Switch to Brave Base Image (#400)
* switch to brave:
- switch base browser to brave base image 1.58.135
- tests: add extra delay for blocking tests
- bump to 0.12.0-beta.0
2023-10-02 14:30:44 -07:00
Tessa Walsh
7e03dc076f
Set new logic for invalid seeds (#395)
Allow for some seeds to be invalid unless failOnFailedSeed is set

Fail crawl if not valid seeds are provided

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-29 13:02:52 -04:00
Ilya Kreymer
e5b0c4ec1b
optimize link extraction: (fixes #376) (#380)
* optimize link extraction: (fixes #376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached
2023-09-15 10:12:08 -07:00
benoit74
947d15725b
Enhance file stats test to detect file modification (#382)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-15 12:34:56 -04:00
benoit74
d72443ced3
Add option to output stats file live, i.e. after each page crawled (#374)
* Add option to output stats file live, i.e. after each page crawled

* Always output stat files after each page crawled (+ test)

* Fix inversion between expected and test value
2023-09-14 15:16:19 -07:00
Anish Lakhwara
1c486ea1f3
Capture Favicon (#362)
- get favicon from CDP debug page, if available, log warning if not
- store in favIconUrl in pages.jsonl
- test: add test for favIcon and additional multi-page crawls
2023-09-10 11:29:35 -07:00
Ilya Kreymer
5ba6c33bff
args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352) (#353)
treat non-regexes as strings and pass to RegExp constructor
tests: add additional scope parsing tests for different types passed in as exclusions
update yargs
bump to 0.10.4
2023-08-13 15:08:36 -07:00
Amani
442f4486d3
feat: Add custom behavior injection (#285)
* support loading custom behaviors from a specified directory via --customBehaviors
* call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1)
* tests: add tests for multiple custom behaviors

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-06 13:09:48 -07:00
Tessa Walsh
254da95a44
Fix disk utilization computation errors (#338)
* Check size of /crawls by default to fix disk utilization check

* Refactor calculating percentage used and add unit tests

* add tests using df output for with disk usage above and below
threshold

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-05 21:58:28 -07:00
Ilya Kreymer
392c8bba0f
allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318) (#319)
remove warning when --scopeType and --include used together
tests: update tests to reflect new semantics of --include + --scopeType
2023-05-23 09:43:11 -07:00
Ilya Kreymer
71b618fe94
Switch back to Puppeteer from Playwright (#301)
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298) 
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-26 15:41:35 -07:00
Tessa Walsh
b303af02ef
Add --title and --description CLI args to write metadata into datapackage.json (#276)
Multi-word values including spaces must be enclosed in double quotes.

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-04 10:46:03 -04:00
Tessa Walsh
62fe4b4a99
Add options to filter logs by --logLevel and --context (#271)
* Add .DS_Store to gitignore

* Add --logLevel and --context filtering options

* Add log filtering test
2023-04-01 10:07:59 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
0cf6219d80
Fix --overwrite CLI flag (#220)
* Delete collection if --overwrite before wb-manager init

* Add tests
2023-02-02 21:02:47 -08:00
Tessa Walsh
c0b0d5b87f
Serialize Redis pending pages as JSON objects (#212)
* Add redis:// prefix to test --redisStoreUrl

* Serialize pending pages as JSON objects
2023-01-23 16:44:03 -08:00
Tessa Walsh
1a066dbd7b
Add RedisCrawlState test (#208) 2023-01-23 10:16:22 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00