Commit graph

505 commits

Author SHA1 Message Date
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00
Ilya Kreymer
15661eb9c8
More flexible multi value arg parsing + README update for 0.12.0 (#422)
Updated arg parsing thanks to example in
https://github.com/yargs/yargs/issues/846#issuecomment-517264899
to support multiple value arguments specified as either one string or
multiple string using array type + coerce function.

This allows for `choice` option to also be used to validate the options,
when needed.

With this setup, `--text to-pages,to-warc,final-to-warc`, `--text
to-pages,to-warc --text final-to-warc` and `--text to-pages --text
to-warc --text final-to-warc` all result in the same configuration!

Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system.

Also updated README with new text extraction options and bumped version
to 0.12.0

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-02 11:47:37 -07:00
Ilya Kreymer
2aeda56d40
improved text extraction: (addresses #403) (#404)
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-31 23:05:30 -07:00
Ilya Kreymer
064db52272 base image: bump brave to 1.59.120
version: bump to 0.12.0-beta.2
2023-10-26 19:48:49 -07:00
benoit74
bc730a0d37
Return User-Agent on all code path to set headers appropriately (#420)
Fixes #419
2023-10-25 12:32:10 -04:00
Ilya Kreymer
ffc1d3ffa4 quickfix: storage webhook, keep path and bytes! 2023-10-23 18:35:03 -07:00
Ilya Kreymer
8c92901889
load saved state fixes + redis tests (#415)
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes #411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
2023-10-23 09:36:10 -07:00
Ilya Kreymer
45139dba0b
Support adding/removing exclusions without restarting the crawler (#408)
Part of work for webrecorder/browsertrix-cloud#1216:
- support adding/removing exclusions dynamically via a Redis message
list
- add processMessage() which checks <uid>:msg list for any messages
- handle addExclusion / removeExclusion messages to add / remove
exclusions for each seed
- also add filterQueue() which filters queue, one URL at a time, async
when a new exclusion is added
2023-10-21 19:11:31 -07:00
Ilya Kreymer
3a83695524
storage: also compute crc32 as part of storage webhook when uploading… (#414)
… a WACZ file

fixes #412
2023-10-20 16:29:07 -07:00
Ilya Kreymer
f6d5a019b1
disable component updates by setting --component-updater to invalid URL (#413)
Currently, Brave will attempt an automatic update of components on
launch. This should prevent that.
2023-10-20 16:28:22 -07:00
Ilya Kreymer
9ae297c000 version: bump to 0.12.0-beta.1 2023-10-09 14:03:31 -07:00
Ilya Kreymer
1a273abc20
remove tracking execution time here (handled in browsertrix cloud app instead) (#406)
- don't set start / end time in redis
- rename setEndTimeAndExit to setStatusAndExit

add 'fast cancel' option:
- add isCrawlCanceled() to state, which checks redis canceled key
- on interrupt, if canceled, immediately exit with status 0
- on fatal, exit with code 0 if restartsOnError is set
- no longer keeping track of start/end time in crawler itself
2023-10-09 12:28:58 -07:00
Ilya Kreymer
14c8221d46
tests: disable ad-block tests: seeing inconsistent ci behavior, though tests pass on local brave (#407) 2023-10-09 09:41:50 -07:00
Ilya Kreymer
8533f6ccf9
additional failure logic: (#402)
- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting
- add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.
2023-10-03 20:21:30 -07:00
Tessa Walsh
a23f840318
Store crawler start and end times in Redis lists (#397)
* Store crawler start and end times in Redis lists

* end time tweaks:
- set end time for logger.fatal()
- set missing start time into setEndTime()

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-10-02 17:55:52 -07:00
Ilya Kreymer
f453dbfb56
Switch to Brave Base Image (#400)
* switch to brave:
- switch base browser to brave base image 1.58.135
- tests: add extra delay for blocking tests
- bump to 0.12.0-beta.0
2023-10-02 14:30:44 -07:00
Ilya Kreymer
4c7ebf18d4 version: bump to 0.11.2 2023-09-29 11:18:22 -07:00
Tessa Walsh
7e03dc076f
Set new logic for invalid seeds (#395)
Allow for some seeds to be invalid unless failOnFailedSeed is set

Fail crawl if not valid seeds are provided

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-29 13:02:52 -04:00
gitreich
18dce9534e
Update README.md (#390)
added missing quotes in command to extend an existing profiles
2023-09-29 09:23:05 -07:00
Ilya Kreymer
52817c776e
add more timeouts to operations that happen outside of page processing time: (#396)
- await page.close() if not finished within 20s
- await crawler.pageFinished() if not finished within 60s (in case config is being written)
2023-09-27 15:46:36 -07:00
Ilya Kreymer
165a9787af
logging and beheaviors improvements (#389)
- run behaviors: check if behaviors object exists before trying to run behaviors to avoid failure message
- skip behaviors if frame no longer attached / has empty URL
2023-09-20 15:02:37 -04:00
Ilya Kreymer
c6cbbc1a17
Update CI Release Action (#386)
* update to latest actions, use docker meta action with semver tags
2023-09-18 22:43:47 -05:00
Ilya Kreymer
c4287c7ed9
Error handling fixes to avoid crawler getting stuck. (#385)
* error handling fixes:
- listen to correct event for page crashes, 'error' instead of 'crash', may fix #371, #351
- more removal of duplicate logging for status-related errors, eg. if page crashed, don't log worker exception
- detect browser 'disconnected' event, interrupt crawl (but allow post-crawl tasks, such as waiting for pending requests to run), set browser to null to avoid trying to use again.

worker
- bump new page timeout to 20
- if loading page from new domain, always use new page

logger:
- log timestamp first for better sorting
2023-09-18 15:24:33 -07:00
Ilya Kreymer
0c88eb78af
favicon: use 127.0.0.1 instead of localhost (#384)
catch exception in fetch
bump to 0.11.1
2023-09-17 12:50:39 -07:00
Ilya Kreymer
debfe8945f README: add --restartOnError cli opt 2023-09-15 11:22:52 -07:00
Ilya Kreymer
e5b0c4ec1b
optimize link extraction: (fixes #376) (#380)
* optimize link extraction: (fixes #376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached
2023-09-15 10:12:08 -07:00
benoit74
947d15725b
Enhance file stats test to detect file modification (#382)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-15 12:34:56 -04:00
Vinzenz Sinapius
7b6bb681c7
Update tldextract cache for pywb in build process (#383) 2023-09-15 12:22:17 -04:00
Ilya Kreymer
3c9be514d3
behavior logging tweaks, add netIdle (#381)
* behavior logging tweaks, add netIdle
* fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes!
now used not only for link extraction but also to run() behaviors
* add logging if iframe check fails
* Dockerfile: add commented out line to use local behaviors.js
* bump behaviors to 0.5.2
2023-09-14 19:48:41 -07:00
benoit74
d72443ced3
Add option to output stats file live, i.e. after each page crawled (#374)
* Add option to output stats file live, i.e. after each page crawled

* Always output stat files after each page crawled (+ test)

* Fix inversion between expected and test value
2023-09-14 15:16:19 -07:00
Ilya Kreymer
afecec01bd
status: fix typo setting status to log message (#379)
status should be set to 'done'!
2023-09-13 22:54:55 -07:00
Ilya Kreymer
a3cfc55c38
various fixes regarding state restart: (#370)
* additional fixes:
- use distinct exit code for subsequent interrupt (13) and fatal interrupt (17)
- if crawl has been stopped, mark for final exit for post crawl tasks
- stopped takes precedence over interrupted: if both, still exit with 0 (and marked for final exit)
- if no warcs found, crawl stopped, but previous pages found, don't consider failed!
- cleanup: remove unused code, rename to gracefulFinishOnInterrupt, separate from graceful finish via crawl stopped
2023-09-13 10:48:21 -07:00
Anish Lakhwara
5bd4fedff9
Add example of mounting custom behaviours (#369)
* feat: add docker mount custom behavior to README

* Add link to behaviors tutorial

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-09-13 10:47:05 -07:00
Ilya Kreymer
6a73d292b4 bump to 0.11.0 for new features 2023-09-13 10:39:59 -07:00
Graham Hukill
1eeee2c215
Surface lastmod option for sitemap parser (#367)
* Surface lastmod option for sitemap parser
- Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after
specified ISO date.

The library used to parse sitemaps for URLs added an optional
"lastmod" argument in v3.2.5 that allows filtering URLs returned
by a "last_modified" element present in sitemap XMLs.  This
surfaces that argument to the browsertrix-crawler CLI runtime
parameters.

This can be useful for orienting a crawl around a list of seeds
known to contain sitemaps, but are only interested in including
URLs that have been modified on or after X date.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-13 10:20:41 -07:00
Ilya Kreymer
f8508a85ab
logging fixes: (#377)
- avoid duplicate logging for same error, if logging more specific message and rethrowing exception,
set e.detail to "logged" and worker exception handler will not log same error again
- add option to log timeouts as warnings instead of errors
- remove unneed async method in browser, get headers directly
- fix logging in screenshots to include page
2023-09-13 10:05:05 -07:00
Ilya Kreymer
283fa00299
logging: resolve confusion with 'crawl done' not being written to log… (#375)
* logging: resolve confusion with 'crawl done' not being written to log, because the log is itself stored in the WACZ: (fixes #365)
- keep log file open until end, even if its being written to WACZ, close before exit
- add logging of 'crawling done' when crawling is done (writing to WACZ or not)
- add debug logging of 'end of log file' to indicate log file is being added to WACZ and nothing else will be added there in the WACZ.
2023-09-13 10:04:09 -07:00
Anish Lakhwara
1c486ea1f3
Capture Favicon (#362)
- get favicon from CDP debug page, if available, log warning if not
- store in favIconUrl in pages.jsonl
- test: add test for favIcon and additional multi-page crawls
2023-09-10 11:29:35 -07:00
Anish Lakhwara
d42010a598
feat: precommit (#363)
* add .husky/pre-commit
* run lint on precommit
2023-09-07 13:03:22 -07:00
Ilya Kreymer
b95c535821
misc exit features: (#366)
- if interrupted (via signal or due to limits) and not finished, return error code 11 to indicate interruption
- allow stopping single instances with hset '<crawlid>:stopone' uid (similar to status)
- deliberate stop via redis not considered interruption (exit 0)
2023-09-06 11:14:18 -04:00
Ilya Kreymer
3c2f5f8934
link extraction optimization: for scopeType page, set depth == extraHops to avoid getting links (#364)
if we know no additional links wil be used
2023-08-31 13:42:14 -07:00
Ilya Kreymer
cf404efa13
improve crawl stopped check with unified isCrawlRunning() check with checks both interrupted + redis-based state (#356)
- handle browser crash -- if getting new page fails after 5 tries, assume browser crashed and exit
- check if timedRun() returns a non-null value before expanding
- update timedRun() to rethrow any non-timeout exception, instead of just logging 'unknown exception', as it should be handled downstream.
2023-08-22 09:16:00 -07:00
Ilya Kreymer
212bff0a27
mark for upload-and-delete when crawl is interrupted for any limit: total size, total time, or disk limit (#354) 2023-08-15 11:34:39 -07:00
Ilya Kreymer
5ba6c33bff
args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352) (#353)
treat non-regexes as strings and pass to RegExp constructor
tests: add additional scope parsing tests for different types passed in as exclusions
update yargs
bump to 0.10.4
2023-08-13 15:08:36 -07:00
Ilya Kreymer
16751de147 version: bump to 0.10.3 2023-08-08 08:43:27 -07:00
Ilya Kreymer
6270571b34
seed parsing: return null if invalid url encountered in parseUrl to avoid subsequent exception! (#349)
adjust error labels to differentiate invalid pages vs seeds
fixes webrecorder/browsertrix-cloud#1037
2023-08-08 08:42:44 -07:00
Ilya Kreymer
69fc1819d1
sizeLimit fix: (#347)
- only delete local data if uploading and uploaded succeeded, not after every sizeLimit interruption
- fixes #344
2023-08-01 00:04:10 -07:00
Amani
442f4486d3
feat: Add custom behavior injection (#285)
* support loading custom behaviors from a specified directory via --customBehaviors
* call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1)
* tests: add tests for multiple custom behaviors

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-06 13:09:48 -07:00