Commit graph

505 commits

Author SHA1 Message Date
Tessa Walsh
74831373fd Update README options 2023-07-06 15:21:30 -04:00
wvengen
de2b4512b6
Allow configuration of deduplication policy (#331) (#332) 2023-07-06 14:54:35 -04:00
Tessa Walsh
22dc2e8426
deps: bump browsertrix-behaviors to ^0.5.1 (#341) 2023-07-06 10:15:18 -07:00
Ilya Kreymer
5ce410c275
profiles: use newly provided puppeteer page.setBypassServiceWorker() (#340)
* profiles: use newly provided puppeteer page.setBypassServiceWorker() instead of cdp command
bump puppeteer core to 20.7.4
2023-07-06 10:09:32 -04:00
Tessa Walsh
254da95a44
Fix disk utilization computation errors (#338)
* Check size of /crawls by default to fix disk utilization check

* Refactor calculating percentage used and add unit tests

* add tests using df output for with disk usage above and below
threshold

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-05 21:58:28 -07:00
Ilya Kreymer
3049b957bd version: bump to 0.10.2
deps: bump to py-wacz 0.4.9
2023-07-05 21:20:58 -07:00
Ilya Kreymer
c7dc504c75
deps: update puppeteer-core to 20.4.0, fixes #324 (#325) 2023-05-30 19:25:54 -07:00
Ilya Kreymer
7b906f921c
Origin Overrides: Ensure Host header also set (#326)
* origin overrides: ensure 'host' and 'origin' headers are also overridden, set to the *original* host and origin when sent to the destination origin
2023-05-30 19:25:37 -07:00
Ilya Kreymer
7c6c7d57a8 version: bump to 0.10.1 2023-05-30 19:12:28 -07:00
Tessa Walsh
d9b72bb9f5
Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS (#323) 2023-05-30 19:06:44 -07:00
Ilya Kreymer
db46cdf6d5 version: bump to 0.10.0 2023-05-23 12:45:29 -07:00
Ilya Kreymer
392c8bba0f
allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318) (#319)
remove warning when --scopeType and --include used together
tests: update tests to reflect new semantics of --include + --scopeType
2023-05-23 09:43:11 -07:00
Ilya Kreymer
f51154facb
Chrome 112 + new headless mode + consistent viewport tweaks (#316)
* base: update to chrome 112
headless: switch to using new headless mode available in 112 which is more in sync with headful mode
viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set)
profiles: fix catching new window message, reopening page in current window
versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1)
bump to 0.10.0-beta.4

* profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages
2023-05-22 16:24:39 -07:00
Tessa Walsh
cc606deba9
Improve thumbnails with sharp (#304)
* Resize thumbnails to 640x360 with sharp
2023-05-19 11:30:24 -07:00
Ilya Kreymer
b5df5ad3c1 version: bump to 0.10.0-beta.3 2023-05-19 07:44:29 -07:00
Ilya Kreymer
77f0a935aa
stopping: if crawl is marked as stopping, and no warcs found, mark state as failed also (to avoid loop in cloud when (#314)
crawler is restarted)
2023-05-19 07:38:16 -07:00
Marc-Andre Lemburg
f0d69ba399
Disable Chrome optimization logic (#312)
These optimizations can often lead to Chrome downloading large ML models in
the background, which then end up in the web crawling archives, even though
they don't have anything to do with the crawl.

Fixes #311.
2023-05-19 07:30:53 -07:00
Ilya Kreymer
4b0dee56c2
state: adjust redis keys to be more consistent (#309)
- use <crawlid>:stopping for crawl stop request
- use <crawlid>:size for total setting crawl total size
bump to 0.10.0-beta.2
2023-05-07 13:01:24 -07:00
Tessa Walsh
f3c64b2b07
Consolidate wacz error loglines (#306)
* Print WACZ and reindexing errors/stacktraces on single line

* Log full stderr as single line if debug is enabled
2023-05-07 13:00:56 -07:00
Tessa Walsh
a0cf0ebde7
Log fatal messages to redis errors (#305) 2023-05-07 00:43:19 -07:00
Ilya Kreymer
ba6a3b6d6a version: bump to 0.10.0-beta.1 2023-05-06 00:12:09 -07:00
Ilya Kreymer
f4c4203381
crawl stopping / additional states: (#303)
* crawl stopping / additional states:
- adds check for 'isCrawlStopped()' which checks redis key to see if crawl has been stopped externally, and interrupts work
loop and prevents crawl from starting on load
- additional crawl states: 'generate-wacz', 'generate-cdx', 'generate-warc', 'uploading-wacz', and 'pending-wait' to indicate
when crawl is no longer running but crawler performing work
- addresses part of webrecorder/browsertrix-cloud#263, webrecorder/browsertrix-cloud#637
2023-05-03 16:25:59 -07:00
Tessa Walsh
d4bc9e80b9
Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed (#300)
* Catch 400 pywb errors on page load and mark page failed

* Add --failOnFailedSeed option to fail crawl with exit code 1 if seed doesn't load, resolves #207

* Handle 4xx or 5xx page.goto responses as page load errors
2023-04-26 16:49:32 -07:00
Ilya Kreymer
71b618fe94
Switch back to Puppeteer from Playwright (#301)
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298) 
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-26 15:41:35 -07:00
Ilya Kreymer
d4e222fab2
merge regression fixes from 0.9.1: full page screenshot + allow service workers if no profile used (#297)
* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288

* Fix full page screenshot (#296)
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 10:26:56 -07:00
Ilya Kreymer
3c7c7bfbc4
optimize shutdown: if after interrupt signal was received, redis connection is gone, assume crawler is being terminated and exit quickly, (#292)
don't attemtpt to reconnect to redis (assume crawler is also being shutdown)
2023-04-24 09:50:49 -07:00
Ilya Kreymer
5c497f4fa4 version: bump version to 0.10.0-beta.0 2023-04-19 19:17:58 -07:00
Ilya Kreymer
3d8e21ea59
origin override: add --originOverride source=dest to allow routing where https://src-host:src-port/path/page.html -> http://dest-host:dest-port/path/page.html where source=https://src-host:src-port and dest=http://dest-host:dest-port (#281) 2023-04-19 19:17:15 -07:00
Tessa Walsh
4143ebbd02
Store archive dir size in Redis (#291) 2023-04-19 18:10:02 -07:00
Ilya Kreymer
52822f9e42
worker: lower wait time, in case where no additional pages remain and other workers will finish quickly. otherwise, results in a min 10 seconds wait for >1 workers if only one page is encountered (#289) 2023-04-17 18:11:56 -07:00
Tessa Walsh
c23cd66c66
Store done in redis as integer and only save full json in redis for failed pages (#284)
* Store done in redis as integer rather than full json

* Add numFailed to crawler stats

* Cast numDone to int before returning

* Increment done counter for failed URLs

* Fix movefailed to push failed URL to failed not done key

* Don't add failed to total stats twice
2023-04-13 13:31:33 -07:00
Tessa Walsh
3864c76090
Add option to log errors to redis (#279) 2023-04-11 11:32:52 -04:00
Ilya Kreymer
4a27f8c4a0 version: bump to 0.9.1 2023-04-08 16:53:57 -07:00
Ilya Kreymer
ebdf0ac8f8 version: bump to 0.9.0! 2023-04-07 17:42:46 -07:00
Tessa Walsh
e2e80e98ef
Don't set viewport for full page screenshots (#221) 2023-04-07 17:42:06 -07:00
Tessa Walsh
b303af02ef
Add --title and --description CLI args to write metadata into datapackage.json (#276)
Multi-word values including spaces must be enclosed in double quotes.

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-04 10:46:03 -04:00
Ilya Kreymer
d4233582bb ci: bump yarn install timeout for ci, use latest gh action 2023-04-03 12:18:42 -07:00
Ilya Kreymer
24e9c43b29 version: bump to 0.9.0-beta.2 2023-04-03 11:52:24 -07:00
Ilya Kreymer
78faa965c5
Add --maxPageLimit override (#275)
* max page limit:
- rename --limit -> --pageLimit (keep alias for now)
- add new --maxPageLimit flag which overrides --pageLimit to ensure it is not greater than max
- readme: add new --pageLimit, --maxPageLimit to README
2023-04-03 11:10:47 -07:00
Ilya Kreymer
86e930d633
blockrules/logger: use global logger var (#274) 2023-04-03 10:58:13 -07:00
Tessa Walsh
d8c505a076
Update README for 0.9.0 (#272)
* Update README for Playwright/0.9.0

* Add ad blocking to README
2023-04-02 21:55:14 -07:00
Tessa Walsh
62fe4b4a99
Add options to filter logs by --logLevel and --context (#271)
* Add .DS_Store to gitignore

* Add --logLevel and --context filtering options

* Add log filtering test
2023-04-01 10:07:59 -07:00
Tessa Walsh
746d80adc7
Ensure crawler can't run out of space with --diskUtilization param (#264)
* Implement --diskUtilization

* Keep threshold fixed but project usage based on archive dir size
2023-03-31 09:35:18 -07:00
Ilya Kreymer
4ba6e949d3
Reset locked pending URLs when crawler restarts. (#267)
* pending lock reset:
- quicker retry of pending URLs after crawler crash by clearing pending page locks
- pending urls are locked with <crawl>:p:<url> to indicate they are current being rendered
- when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending lock, to allow the URL
to be retried again, as it's no longer actively being crawled.
2023-03-30 21:29:41 -07:00
Ilya Kreymer
fcd55c690a
worker index: set worker index automatically to work with k8s naming (#266)
- if CRAWL_ID env var set to 'crawl-id-name' while hostname is 'crawl-id-name-N' (automatically set via k8s statefulsets),
then set starting worker index to N * numWorkers
2023-03-29 22:27:17 -07:00
Tessa Walsh
b0e93cb06e
Add option for sleep interval after behaviors run + timing cleanup (#257)
* Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131)

* Store total page time in 'maxPageTime', include pageExtraDelay

* Rename timeout->pageLoadTimeout

* cleanup:
- store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions
- add secondsElapsed() utility function to help checking time elapsed
- cleanup comments

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-22 11:50:18 -07:00
Ilya Kreymer
02fb137b2c
Catch loading issues (#255)
* various loading improvements to avoid pages getting 'stuck' + load state tracking
- add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object.
- set loadState to 0 (failed) by default
- set loadState to 1 (content-loaded) on 'domcontentloaded' event
- if page.goto() finishes, set to loadState to 2 'full-page-load'. 
- if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors
- page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out
- pages: log 'loadState' as part of pages.jsonl
- improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well
- screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting
- deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements)
- workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages
- worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state
- screencaster: allow reusing caster slots with fixed ids
- interrupt timedCrawlPage() wait if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty
2023-03-20 18:31:37 -07:00
Ilya Kreymer
07e503a8e6
Logger cleanup (#254)
* logging: convert logger to a singleton to simplify use

* add logger to create-login-profile.js
2023-03-17 14:24:44 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
f19f1fcb8d
Minor crawler fixes after puppeteer-cluster removal refactoring (#250)
* Remove screencaster from Worker/WorkerPool

* Don't increment errors in crawlPageInWorker

* Set pageTarget variable early
2023-03-13 15:07:59 -07:00