Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.
Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter
Fixes#457
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp
Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
- add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
Ensure the final pending wait also has a timeout, set to max page
timeout x num workers.
Could also set higher, but needs to have a timeout, eg. in case of
downloading live stream that never terminates.
Fixes#348 in the 0.12.x line.
Also bumps version to 0.12.3
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
Due to an optimization, numPending() call assumed that queueSize() would
be called to update cached queue size. However, in the current worker
code, this is not the case. Remove cacheing the queue size and just check
queue size in numPending(), to ensure pending list is always processed.
When calling directFetchCapture, and aborting the response via an
exception, throw `new Error("response-filtered-out");`
so that it can be ignored. This exception is only used for direct
capture, and should not be logged as an error - rethrow and
handle in calling function to indicate direct fetch is skipped
Previously, responses >2MB are streamed to disk and an empty response returned to browser,
to avoid holding large response in memory.
This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.
This PR sets different limits for:
- HTML as well as other JS necessary for page to load to 25MB
- All other content limit is set to 5MB
Also includes some more type fixing
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes#426
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343
Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream,
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays
args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334
bump to 0.12.1
Updated arg parsing thanks to example in
https://github.com/yargs/yargs/issues/846#issuecomment-517264899
to support multiple value arguments specified as either one string or
multiple string using array type + coerce function.
This allows for `choice` option to also be used to validate the options,
when needed.
With this setup, `--text to-pages,to-warc,final-to-warc`, `--text
to-pages,to-warc --text final-to-warc` and `--text to-pages --text
to-warc --text final-to-warc` all result in the same configuration!
Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system.
Also updated README with new text extraction options and bumped version
to 0.12.0
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes#411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
Part of work for webrecorder/browsertrix-cloud#1216:
- support adding/removing exclusions dynamically via a Redis message
list
- add processMessage() which checks <uid>:msg list for any messages
- handle addExclusion / removeExclusion messages to add / remove
exclusions for each seed
- also add filterQueue() which filters queue, one URL at a time, async
when a new exclusion is added
- don't set start / end time in redis
- rename setEndTimeAndExit to setStatusAndExit
add 'fast cancel' option:
- add isCrawlCanceled() to state, which checks redis canceled key
- on interrupt, if canceled, immediately exit with status 0
- on fatal, exit with code 0 if restartsOnError is set
- no longer keeping track of start/end time in crawler itself
- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting
- add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.
* Store crawler start and end times in Redis lists
* end time tweaks:
- set end time for logger.fatal()
- set missing start time into setEndTime()
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Allow for some seeds to be invalid unless failOnFailedSeed is set
Fail crawl if not valid seeds are provided
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- run behaviors: check if behaviors object exists before trying to run behaviors to avoid failure message
- skip behaviors if frame no longer attached / has empty URL
* error handling fixes:
- listen to correct event for page crashes, 'error' instead of 'crash', may fix#371, #351
- more removal of duplicate logging for status-related errors, eg. if page crashed, don't log worker exception
- detect browser 'disconnected' event, interrupt crawl (but allow post-crawl tasks, such as waiting for pending requests to run), set browser to null to avoid trying to use again.
worker
- bump new page timeout to 20
- if loading page from new domain, always use new page
logger:
- log timestamp first for better sorting
* optimize link extraction: (fixes#376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached
* behavior logging tweaks, add netIdle
* fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes!
now used not only for link extraction but also to run() behaviors
* add logging if iframe check fails
* Dockerfile: add commented out line to use local behaviors.js
* bump behaviors to 0.5.2
* Add option to output stats file live, i.e. after each page crawled
* Always output stat files after each page crawled (+ test)
* Fix inversion between expected and test value