Commit graph

71 commits

Author SHA1 Message Date
Ilya Kreymer
c42d3df889 remove different port, add qa_compare with different user 2024-03-22 09:17:21 -07:00
Ilya Kreymer
f136cdf18c attempt to change port on repeat call 2024-03-22 00:24:21 -07:00
Ilya Kreymer
10e92a4f7b tests: disable retryStrategy for redis, test for better termination behavior 2024-03-22 00:10:26 -07:00
Ilya Kreymer
b18148b715 tests: change ports for different tests that use redis to be unique 2024-03-20 12:14:49 -07:00
Ilya Kreymer
e4d8388ac8 Merge branch 'main' into qa-crawl-work 2024-03-19 11:25:58 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
251e1b3005 Merge branch 'main' into qa-crawl-work
bump to 1.1.0-beta.1
2024-03-16 15:34:57 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
3a9ffd826c tests: try different port for redis 2024-03-08 13:13:49 -08:00
Ilya Kreymer
0abfaac87d qa test: use redis://127.0.0.1:36379 for ci to match other redis usage 2024-03-08 13:02:00 -08:00
Ilya Kreymer
5a1b2a99bb tests: add qa comparison test:
- run crawl with 3 pages, text/screenshots enabled
- run qa crawl using resulting WACZ
- enable writing pages to redis
- verify comparison data is included in page data added to redis ':pages' key
while crawl is running
2024-03-08 12:47:36 -08:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a512e92886
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
2024-02-19 19:11:48 -08:00
Ilya Kreymer
e8f2073a7e
Update Browser Image (#466)
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
2024-02-17 22:40:12 -08:00
Ilya Kreymer
96f3c407b1
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
2024-02-16 14:36:32 -08:00
Ilya Kreymer
703835a7dd
detect invalid custom behaviors on load: (#450)
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
2023-12-13 15:14:53 -05:00
Ilya Kreymer
3323262852
WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
  - add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
2023-12-07 23:02:55 -08:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00
Ilya Kreymer
2aeda56d40
improved text extraction: (addresses #403) (#404)
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-31 23:05:30 -07:00
Ilya Kreymer
8c92901889
load saved state fixes + redis tests (#415)
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes #411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
2023-10-23 09:36:10 -07:00
Ilya Kreymer
14c8221d46
tests: disable ad-block tests: seeing inconsistent ci behavior, though tests pass on local brave (#407) 2023-10-09 09:41:50 -07:00
Ilya Kreymer
f453dbfb56
Switch to Brave Base Image (#400)
* switch to brave:
- switch base browser to brave base image 1.58.135
- tests: add extra delay for blocking tests
- bump to 0.12.0-beta.0
2023-10-02 14:30:44 -07:00
Tessa Walsh
7e03dc076f
Set new logic for invalid seeds (#395)
Allow for some seeds to be invalid unless failOnFailedSeed is set

Fail crawl if not valid seeds are provided

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-29 13:02:52 -04:00
Ilya Kreymer
e5b0c4ec1b
optimize link extraction: (fixes #376) (#380)
* optimize link extraction: (fixes #376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached
2023-09-15 10:12:08 -07:00
benoit74
947d15725b
Enhance file stats test to detect file modification (#382)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-15 12:34:56 -04:00
benoit74
d72443ced3
Add option to output stats file live, i.e. after each page crawled (#374)
* Add option to output stats file live, i.e. after each page crawled

* Always output stat files after each page crawled (+ test)

* Fix inversion between expected and test value
2023-09-14 15:16:19 -07:00
Anish Lakhwara
1c486ea1f3
Capture Favicon (#362)
- get favicon from CDP debug page, if available, log warning if not
- store in favIconUrl in pages.jsonl
- test: add test for favIcon and additional multi-page crawls
2023-09-10 11:29:35 -07:00
Ilya Kreymer
5ba6c33bff
args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352) (#353)
treat non-regexes as strings and pass to RegExp constructor
tests: add additional scope parsing tests for different types passed in as exclusions
update yargs
bump to 0.10.4
2023-08-13 15:08:36 -07:00
Amani
442f4486d3
feat: Add custom behavior injection (#285)
* support loading custom behaviors from a specified directory via --customBehaviors
* call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1)
* tests: add tests for multiple custom behaviors

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-06 13:09:48 -07:00
Tessa Walsh
254da95a44
Fix disk utilization computation errors (#338)
* Check size of /crawls by default to fix disk utilization check

* Refactor calculating percentage used and add unit tests

* add tests using df output for with disk usage above and below
threshold

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-05 21:58:28 -07:00
Ilya Kreymer
392c8bba0f
allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318) (#319)
remove warning when --scopeType and --include used together
tests: update tests to reflect new semantics of --include + --scopeType
2023-05-23 09:43:11 -07:00
Ilya Kreymer
71b618fe94
Switch back to Puppeteer from Playwright (#301)
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298) 
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-26 15:41:35 -07:00
Tessa Walsh
b303af02ef
Add --title and --description CLI args to write metadata into datapackage.json (#276)
Multi-word values including spaces must be enclosed in double quotes.

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-04 10:46:03 -04:00
Tessa Walsh
62fe4b4a99
Add options to filter logs by --logLevel and --context (#271)
* Add .DS_Store to gitignore

* Add --logLevel and --context filtering options

* Add log filtering test
2023-04-01 10:07:59 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
0cf6219d80
Fix --overwrite CLI flag (#220)
* Delete collection if --overwrite before wb-manager init

* Add tests
2023-02-02 21:02:47 -08:00
Tessa Walsh
c0b0d5b87f
Serialize Redis pending pages as JSON objects (#212)
* Add redis:// prefix to test --redisStoreUrl

* Serialize pending pages as JSON objects
2023-01-23 16:44:03 -08:00
Tessa Walsh
1a066dbd7b
Add RedisCrawlState test (#208) 2023-01-23 10:16:22 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Tessa Walsh
f35d495103
Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Tessa Walsh
e02058f001 Add ad blocking via request interception (#173)
* ad blocking via request interception, extending block rules system, adding new AdBlockRules
* Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build
* Enabled via --blockAds and setting a custom message via --adBlockMessage
* new test to check for ad blocking
* Add test-crawls dir to .gitignore and .dockerignore
2022-11-15 18:30:27 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
ffa3174578
Fix for warcio.js (#178)
* dependency fix: set warcio to 1.5.1 until we update to esm support
bump test timeout
fixes #175
bump to 0.7.1
2022-10-24 08:20:01 +02:00
Ilya Kreymer
7ed5586bdb scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity 2022-03-22 17:43:13 -07:00
Ilya Kreymer
0c32d0f223
add 'scopeType: domain' to include all subdomains + http/https include (#117)
- add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope.
- include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/
- fixes #116
2022-03-06 14:46:14 -08:00