Commit graph

134 commits

Author SHA1 Message Date
Tessa Walsh
034657dbb6 Use brave-1.46.144 base image 2023-01-09 15:02:46 -05:00
Tessa Walsh
b8bed40e14 Add working ad block disabled Brave profile 2023-01-09 15:02:40 -05:00
Tessa Walsh
c078ce7fb9 Modify BROWSER_BIN 2022-12-13 11:43:22 -05:00
Tessa Walsh
59e41b04c2 Set Brave default profile in argparser 2022-12-12 17:22:50 -05:00
Tessa Walsh
9d3af6f80f WIP: Add default Brave profile
Current requires locally built Brave base image named:
webrecorder/browsertrix-browser-base:brave-test-latest

brave-ad-blocking-disabled-profile.tar.gz may not be working quite
correctly and may need to be replaced, as it wasn't possible to modify
the selects in brave://settings via create-login-profile's interactive
mode quite yet
2022-12-12 17:21:58 -05:00
Ilya Kreymer
2a1e0edf3c version: set version correctly to 0.8.0-beta.0 2022-11-15 18:30:27 -08:00
Ilya Kreymer
cacf5da5a1 esm conversion: finish esm conversion for create-login-profile.js 2022-11-15 18:30:27 -08:00
Tessa Walsh
e02058f001 Add ad blocking via request interception (#173)
* ad blocking via request interception, extending block rules system, adding new AdBlockRules
* Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build
* Enabled via --blockAds and setting a custom message via --adBlockMessage
* new test to check for ad blocking
* Add test-crawls dir to .gitignore and .dockerignore
2022-11-15 18:30:27 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Tim
5b738bd24e
Fix incorrect combineWARCs property in README.md (#180)
This stumped me for a little while. The actual property isn't plural.
2022-11-14 22:17:44 -08:00
Ed Summers
cd17764b77
Check if group/user exists (#176)
Ensure that group and user do not already exist before creating them.

Fixes #174
2022-11-03 17:28:13 -07:00
Ilya Kreymer
ffa3174578
Fix for warcio.js (#178)
* dependency fix: set warcio to 1.5.1 until we update to esm support
bump test timeout
fixes #175
bump to 0.7.1
2022-10-24 08:20:01 +02:00
Ilya Kreymer
1213694dde bump to 0.7.0 for release! 2022-10-11 16:14:53 -07:00
Ilya Kreymer
be3b6b85fa README: update default behaviors in README, fixes #169 2022-10-11 15:33:32 -07:00
Ed Summers
3ba64535a5
Run in Docker as User (#171)
* Run in Docker as User

This follows a similar pattern to pywb to run as the user that owns the
crawls directory.

bump version to 0.7.0-beta.6

Closes #170
2022-09-28 12:49:52 -07:00
Ilya Kreymer
65933c6b12
Interrupt Handling Fixes (#167)
* interrupts: simplify interrupt behavior:
- SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load

improvements of remote state / parallel crawlers (for browsertrix-cloud):
- SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl
- SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler

* scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!
2022-09-20 17:09:52 -07:00
Ilya Kreymer
fd1737962b dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165)
bump to 0.7.0-beta.5
2022-09-15 23:13:31 -07:00
Ilya Kreymer
314ee3f730
Default Wait-Time Improvements (#162)
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
- bump to 0.7.0-beta.4
2022-09-08 23:39:26 -07:00
Ilya Kreymer
5c931275ed pending wait: set max pending request wait to 120 seconds 2022-09-02 17:53:04 -07:00
Ilya Kreymer
a52ee5ed1f dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3 2022-09-02 17:45:16 -07:00
Ilya Kreymer
e22d95e2f0
Logging and browser improvements: (#158)
* logging: add 'jserrors' option to --logging to print JS errors
* browser config: use flags from playwright
* browser: use socat to allow connecting via devtools via crawling on port 9222
2022-08-21 00:30:25 -07:00
Ilya Kreymer
6cc38bf511
Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157)
* new window: use cdp instead of window.open

* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page

* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156

* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser

* bump version to 0.7.0-beta.3
2022-08-19 09:23:40 -07:00
Ilya Kreymer
827c153679 fix for latest puppeteer: page._client -> page._client() 2022-08-17 21:40:10 -07:00
Ilya Kreymer
c5d208024a
Wait Default + Logging Improvements (#153)
improved logging of pywb + redis:
- if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log
- otherwise, just ignore (don't print to stdout as that's too confusing)
- print if wb-manager fails, likely due to existing collection

waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait
dependency: update to latest puppeteer-core (16.1.0)
2022-08-11 18:44:39 -07:00
raffaele messuti
a527cc9b36
Update README.md (#147)
fix link to puppeteer waitUntil
2022-08-11 18:28:54 -07:00
Ilya Kreymer
e3b8b5ba21
Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145)
- add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load
- add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl
- update to browsertrix-behaviors 0.3.2
- update current url for screencasting of page before page load starts
bump to 0.7.0-beta.2
2022-07-08 17:17:46 -07:00
Ilya Kreymer
bd10f1ad8c bump to 0.7.0-beta.1 2022-07-03 11:11:11 -07:00
Ilya Kreymer
82c771f7cd ci: possibly fix for ci release build (issues building uwsgi) 2022-07-03 11:09:06 -07:00
Ilya Kreymer
0a309af740
Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144)
* update base image 
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes

* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now

bump to 0.7.0-beta.0
2022-06-30 19:24:26 -07:00
Ilya Kreymer
cf90304fa7
0.6.0 Wait State + Screencasting Fixes (#141)
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished 
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds

* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request

* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed

* bump to 0.6.0
2022-06-17 11:58:44 -07:00
Ilya Kreymer
e7eb6a6620 create profile: fix typo in cookie settings, multiply by seconds in day
uwsgi: set number of workers to be 2x cpus by default
2022-06-01 09:11:11 -07:00
Ilya Kreymer
70ba9241ca limit interrupt fix: after self-interrupting, only look at local pending list (for redis state)
logging: don't log CF check errors, do log when errorCount is reset
2022-05-19 06:25:46 +00:00
Ilya Kreymer
6ec47cdd14
profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139)
- save cookies on page load and also before profile creation
- default cookie duration is 7 days, configurable via --cookieDays option
2022-05-18 23:23:32 -07:00
Ilya Kreymer
93b6dad7b9
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
500ed1f9a1
Profile Creation Improvements (#136)
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.

* bump to 0.6.0-beta.0
2022-05-05 14:27:17 -05:00
Ilya Kreymer
5dfbfbeaf6
update dependencies: (#134)
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
2022-04-15 16:22:47 -07:00
Ilya Kreymer
9b938304ce dependencies: update to pywb>=2.6.6, wacz>=0.4.5 2022-04-11 15:09:59 -07:00
Ilya Kreymer
cc391146c4 package: set minio version to fixed (7.0.26) 2022-04-09 22:07:17 -07:00
Ilya Kreymer
bfd72835d1 update CHANGES for 0.5.0 release 2022-04-09 21:59:44 -07:00
Ilya Kreymer
7ed5586bdb scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity 2022-03-22 17:43:13 -07:00
Ilya Kreymer
5afd19f43d
Non-HTML Page Load Optimization (#130)
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
2022-03-22 17:41:51 -07:00
Ilya Kreymer
09082e8abb dependencies: set wacz>=0.4.4 2022-03-18 10:38:34 -07:00
Ilya Kreymer
8727ca7f8c redis state error handling: catch and log potential errors with reading json state for next url
bump version to 0.5.0-beta.7
2022-03-18 10:34:17 -07:00
Ilya Kreymer
5e5efda437
Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128)
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded

* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option

* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy

* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
2022-03-18 10:32:59 -07:00
Ilya Kreymer
dedf1cc0ad typo fix: add await to loadProfile in create-login-profile.js 2022-03-15 02:40:06 +00:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
1fae21b0cf
Better check to see if ERR_ABORTED should be ignored. (#127)
* error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler
* bump version to 0.5.0-beta.6
2022-03-14 14:41:39 -07:00
Ilya Kreymer
ab096cd5b0
Improve to URL direct check and fetch (#125)
- direct check fix: only do direct check if HEAD returns 200 status code
- if direct load results in non-200 status code, still load in browser
- error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error
- state: tweak error logging message
2022-03-14 11:11:53 -07:00
Ilya Kreymer
81e8fa6da7
Incremental save state (#124)
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113

* update README
2022-03-14 10:41:56 -07:00
phiresky
fb297574c7
add documentation of env variables for socks proxy + browser extensions (#120) 2022-03-13 15:00:46 -07:00