Commit graph

505 commits

Author SHA1 Message Date
Ilya Kreymer
4b8a414410
Add total timeout + limit redis queue retries (#248)
* time limits: readd total timeount to runTask() in worker, just in case
refactor working runTask() to either return true/false if task was timed out
if timed out, recreate the page
redis: add limit to retried URLs, currently set to 1
* retry: remove URL if not retrying, log removal of URL from queue
2023-03-13 14:48:04 -07:00
Tessa Walsh
aadd9a0483
Add timedRun to prevent async operations from hanging (#243)
* Add timedRun and apply to network requests

* Remove debugging print statement

* minor tweaks:
- move seconds to 2nd param, make param required
- use FETCH_TIMEOUT_SECS for fetch events and PAGE_OP_TIMEOUT_SECS for in-page events respectively
- use timedRun() for check CF action
- remove extra async

* additional logging
ensure queue is cleared when interrupting!

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-10 20:11:24 -08:00
Ilya Kreymer
99208882c3
size check fix: fix typo in size check, where size wasn't actually being checked correctly for --sizeLimit (#241) 2023-03-09 09:00:14 -08:00
Tessa Walsh
1bee46b321
Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219)
* This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.

* Remove concurrency models, always use new tab

* logging improvements: include worker-id in logs, use 'worker' context
- logging: log info string / version as first line
- logging: improve logging of error stack traces
- interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
- interruption: don't repair if interrupting, wait for queue to be idle
- log text extraction
- init order: ensure wb-manager init called first, then logs created
- logging: adjust info->debug logging
- Log no jobs available as debug

* tests: bail on first failure

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* deps: bump puppeteer-core to 17.1.2
- bump to 0.9.0-beta.0

--------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-08 18:31:19 -08:00
Ilya Kreymer
ac5a720362
logging: serialize regex as string to avoid empty '{}' when logging scoping rules, fixes #234 (#235) 2023-03-02 11:39:37 -08:00
Ilya Kreymer
3a4d318e90 CHANGES: update changes for 0.8.1 2023-02-24 18:33:29 -08:00
Ilya Kreymer
63717c4b04
Crawl log (#231)
* logging:
- write most of the crawl log to '{coll}/logs/crawl-{iso-timestamp}.log', part of #230
- ensure log filename consists of numeric timestamp only
- close log before wacz file is generated to allow storing log in wacz
- close log after writing stats
- add logs/ directory to wacz with new py-wacz
- deps: bump to py-wacz 0.4.8 to support logs in wacz
2023-02-24 18:31:08 -08:00
Sara Tavares
5b1f224dcb
fix typos (#232) 2023-02-24 11:09:40 -08:00
Ilya Kreymer
5da379cb5f
Logging and Behavior Tweaks (#229)
- Ensure page is included in all logging details
- Update logging messages to be a single string, with variables added in the details
- Always wait for all pending wait requests to finish (unless counter <0)
- Don't set puppeteer-cluster timeout (prep for removing puppeeteer-cluster)
- Add behaviorTimeout to running behaviors in crawler, in addition to in behaviors themselves.
- Add logging for behavior start, finish and timeout
- Move writeStats() logging to beginning of each page as well as at the end, to avoid confusion about pending pages.
- For events from frames, use frameUrl along with current page
- deps: bump browsertrix-behaviors to 0.4.2
- version: bump to 0.8.1
2023-02-23 18:50:22 -08:00
Ilya Kreymer
a4358f4622 CHANGES: update with latest PRs for release! 2023-02-04 16:49:17 -08:00
Ilya Kreymer
a59ec05a85
update behaviors to 0.4.1, rename 'Behavior line' -> 'Behavior log' (#223) 2023-02-04 16:02:43 -08:00
Ilya Kreymer
b513246b03
deps: bump pywb to 2.7.3, update CHANGES to current version (#222)
* deps: bump pywb to 2.7.3
bump to 0.8.0 for release

* update CHANGES
2023-02-03 17:56:30 -08:00
Tessa Walsh
0cf6219d80
Fix --overwrite CLI flag (#220)
* Delete collection if --overwrite before wb-manager init

* Add tests
2023-02-02 21:02:47 -08:00
Ilya Kreymer
10e61d4c85
Bump to Chrome 109, Beta 0.8.0-beta.1 Release (#215)
- bump to chrome-109 image
- bump uwsgi to fix intermittent build errors
-remove installs moved to base image
bump to 0.8.0-beta.1
2023-01-30 19:00:33 -08:00
Ilya Kreymer
38a9dbdaae
behaviors: don't run behaviors in iframes that are about:blank or are… (#211)
* behaviors: don't run behaviors in iframes that are about:blank or are from an ad-host (even if ad-blocking is not disabled), fixes #210

* logging: log behavior wait start and success, in addition to error, with url in details
2023-01-23 16:47:33 -08:00
Tessa Walsh
c0b0d5b87f
Serialize Redis pending pages as JSON objects (#212)
* Add redis:// prefix to test --redisStoreUrl

* Serialize pending pages as JSON objects
2023-01-23 16:44:03 -08:00
Ilya Kreymer
a767721f5e
crawl state: add getPendingList() to return pending state from either… (#205)
* crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object
logging: check if data object is an error, log fields from error. Convert missing console.* to new logger
* evaluate failuire: log with error, not fatal
2023-01-23 10:43:12 -08:00
Tessa Walsh
1a066dbd7b
Add RedisCrawlState test (#208) 2023-01-23 10:16:22 -08:00
kuechensofa
f9df7a94ce
Add requests[socks] python dependency (#201)
Add requests[socks] python dependency to enable SOCKS proxy support for pywb inside the docker container
2023-01-19 21:55:07 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Ilya Kreymer
2b03e23174
arg parsing fix: (#200)
- check if array of scope includes is actually empty before using it over scope
- check if screenshot arg setting is empty
2023-01-12 19:58:04 -08:00
Ilya Kreymer
5ee05985b1
Use VNC for headful profile creation (#197)
* profiles: use vnc for automatic profile creation (fixes #194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer

* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected

* x11vnc cleanup: just pass password via cmdline to simplify setup

* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive

* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
2023-01-09 23:56:53 -08:00
Ed Summers
33a153ac54
remove unused parts of config (#198)
remove commented out config options (enable-auto-fetch and auto-index) to avoid confusion
2023-01-04 17:00:22 -08:00
Tessa Walsh
f35d495103
Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Ilya Kreymer
057cc82897
new setting: add support for specifying language via the --lang flag (#186) 2022-11-21 11:59:37 -08:00
Ilya Kreymer
b268c02823 package: fix license string in package.json 2022-11-21 09:20:15 -08:00
Ilya Kreymer
2a1e0edf3c version: set version correctly to 0.8.0-beta.0 2022-11-15 18:30:27 -08:00
Ilya Kreymer
cacf5da5a1 esm conversion: finish esm conversion for create-login-profile.js 2022-11-15 18:30:27 -08:00
Tessa Walsh
e02058f001 Add ad blocking via request interception (#173)
* ad blocking via request interception, extending block rules system, adding new AdBlockRules
* Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build
* Enabled via --blockAds and setting a custom message via --adBlockMessage
* new test to check for ad blocking
* Add test-crawls dir to .gitignore and .dockerignore
2022-11-15 18:30:27 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Tim
5b738bd24e
Fix incorrect combineWARCs property in README.md (#180)
This stumped me for a little while. The actual property isn't plural.
2022-11-14 22:17:44 -08:00
Ed Summers
cd17764b77
Check if group/user exists (#176)
Ensure that group and user do not already exist before creating them.

Fixes #174
2022-11-03 17:28:13 -07:00
Ilya Kreymer
ffa3174578
Fix for warcio.js (#178)
* dependency fix: set warcio to 1.5.1 until we update to esm support
bump test timeout
fixes #175
bump to 0.7.1
2022-10-24 08:20:01 +02:00
Ilya Kreymer
1213694dde bump to 0.7.0 for release! 2022-10-11 16:14:53 -07:00
Ilya Kreymer
be3b6b85fa README: update default behaviors in README, fixes #169 2022-10-11 15:33:32 -07:00
Ed Summers
3ba64535a5
Run in Docker as User (#171)
* Run in Docker as User

This follows a similar pattern to pywb to run as the user that owns the
crawls directory.

bump version to 0.7.0-beta.6

Closes #170
2022-09-28 12:49:52 -07:00
Ilya Kreymer
65933c6b12
Interrupt Handling Fixes (#167)
* interrupts: simplify interrupt behavior:
- SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load

improvements of remote state / parallel crawlers (for browsertrix-cloud):
- SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl
- SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler

* scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!
2022-09-20 17:09:52 -07:00
Ilya Kreymer
fd1737962b dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165)
bump to 0.7.0-beta.5
2022-09-15 23:13:31 -07:00
Ilya Kreymer
314ee3f730
Default Wait-Time Improvements (#162)
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
- bump to 0.7.0-beta.4
2022-09-08 23:39:26 -07:00
Ilya Kreymer
5c931275ed pending wait: set max pending request wait to 120 seconds 2022-09-02 17:53:04 -07:00
Ilya Kreymer
a52ee5ed1f dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3 2022-09-02 17:45:16 -07:00
Ilya Kreymer
e22d95e2f0
Logging and browser improvements: (#158)
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
2022-08-21 00:30:25 -07:00
Ilya Kreymer
6cc38bf511
Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157)
* new window: use cdp instead of window.open

* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page

* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156

* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser

* bump version to 0.7.0-beta.3
2022-08-19 09:23:40 -07:00
Ilya Kreymer
827c153679 fix for latest puppeteer: page._client -> page._client() 2022-08-17 21:40:10 -07:00
Ilya Kreymer
c5d208024a
Wait Default + Logging Improvements (#153)
improved logging of pywb + redis:
- if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log
- otherwise, just ignore (don't print to stdout as that's too confusing)
- print if wb-manager fails, likely due to existing collection

waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait
dependency: update to latest puppeteer-core (16.1.0)
2022-08-11 18:44:39 -07:00
raffaele messuti
a527cc9b36
Update README.md (#147)
fix link to puppeteer waitUntil
2022-08-11 18:28:54 -07:00
Ilya Kreymer
e3b8b5ba21
Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145)
- add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load
- add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl
- update to browsertrix-behaviors 0.3.2
- update current url for screencasting of page before page load starts
bump to 0.7.0-beta.2
2022-07-08 17:17:46 -07:00
Ilya Kreymer
bd10f1ad8c bump to 0.7.0-beta.1 2022-07-03 11:11:11 -07:00
Ilya Kreymer
82c771f7cd ci: possibly fix for ci release build (issues building uwsgi) 2022-07-03 11:09:06 -07:00
Ilya Kreymer
0a309af740
Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144)
* update base image 
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes

* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now

bump to 0.7.0-beta.0
2022-06-30 19:24:26 -07:00