Commit graph

364 commits

Author SHA1 Message Date
Ilya Kreymer
e534f49e5e recorder: don't do streaming fetch for unknown or large responses if content-type is
text/html, always need to load in browser to continue
type fixes: remove a few 'any' types in reqresp
2023-11-09 08:57:13 -08:00
emma
5a6bef890f revert adding prettier 2023-11-08 16:40:49 -05:00
Emma Segal-Grossman
24cad2a33e
Merge pull request #427 from webrecorder/recorder-work-ts--typescript-eslint
Add typescript-eslint and prettier to repo
2023-11-08 14:52:22 -05:00
emma
e5b0813d89 add formatting step to precommit hook 2023-11-08 14:41:15 -05:00
emma
87a6002473 add prettier and format everything 2023-11-08 14:37:57 -05:00
emma
7330dd382b fix eslint-reported issues 2023-11-08 14:09:15 -05:00
emma
b62aa5e4b6 add typescript-eslint 2023-11-08 13:45:10 -05:00
Ilya Kreymer
6df1ecfdd4 fix merge 2023-11-07 22:16:16 -08:00
Ilya Kreymer
2538bcc283 Merge branch 'dev-1.0.0' into recorder-work-ts 2023-11-07 22:13:20 -08:00
Ilya Kreymer
df0fe887ce Merge branch 'recorder-work' into recorder-work-ts 2023-11-07 22:01:37 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
868cd7ab48 remove pywb dependency
- only keep py-wacz
- use cdxj-indexer for --generateCDX
2023-11-07 20:01:42 -08:00
Ilya Kreymer
468a00939d logging: reenable logging for timed out pending requests for now 2023-11-07 18:24:13 -08:00
Ilya Kreymer
034de9a78d fix warcinfo test after version update 2023-11-07 17:38:15 -08:00
Ilya Kreymer
988bf7a08a remove unused code, remove references to pywb 2023-11-07 17:24:04 -08:00
Ilya Kreymer
e7a850c380
Apply suggestions from code review, remove commented out code
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-07 17:20:08 -08:00
Ilya Kreymer
c43e314d07 merge fixes 2023-11-03 18:37:44 -07:00
Ilya Kreymer
e2c95d6c1d Merge branch 'recorder-work' (0.12.1) into recorder-work-ts 2023-11-03 18:35:46 -07:00
Ilya Kreymer
53cfd39416 Merge branch 'main' (0.12.1 release) into recorder-work 2023-11-03 18:31:18 -07:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00
Ilya Kreymer
15661eb9c8
More flexible multi value arg parsing + README update for 0.12.0 (#422)
Updated arg parsing thanks to example in
https://github.com/yargs/yargs/issues/846#issuecomment-517264899
to support multiple value arguments specified as either one string or
multiple string using array type + coerce function.

This allows for `choice` option to also be used to validate the options,
when needed.

With this setup, `--text to-pages,to-warc,final-to-warc`, `--text
to-pages,to-warc --text final-to-warc` and `--text to-pages --text
to-warc --text final-to-warc` all result in the same configuration!

Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system.

Also updated README with new text extraction options and bumped version
to 0.12.0

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-02 11:47:37 -07:00
Ilya Kreymer
bc938f0137 remove ts ignores, update minio to enable typing 2023-11-01 18:58:50 -07:00
Ilya Kreymer
de8f2b8c89 workerid switch back to number, use custom WorkerId type 2023-11-01 18:40:10 -07:00
Ilya Kreymer
9a9267baeb ci: build js 2023-11-01 18:04:11 -07:00
Ilya Kreymer
557a50d9ba tests: fix test imports 2023-11-01 17:54:51 -07:00
Ilya Kreymer
f748c0e211 fix Sitemapper, support custom sitemap paths also 2023-11-01 17:35:40 -07:00
Ilya Kreymer
a317425842 more fixes, initial working version! 2023-11-01 17:24:35 -07:00
Ilya Kreymer
492ff81d67 migrate create-login-profile, fix build, update to nodenext build mode 2023-11-01 17:10:31 -07:00
Ilya Kreymer
a8ca0c683c convert main.ts and crawler.ts! 2023-11-01 12:59:31 -07:00
Ilya Kreymer
036af5f9fb more fixes 2023-11-01 10:22:32 -07:00
Ilya Kreymer
064fc4f472 add new files converted to ts! 2023-11-01 09:59:32 -07:00
Ilya Kreymer
93a6bbbdc2 additional fixes 2023-10-31 23:54:24 -07:00
Ilya Kreymer
26d57a3134 Merge branch 'recorder-work' into recorder-work-ts 2023-10-31 23:53:57 -07:00
Ilya Kreymer
8cc1a8c015 remove util 2023-10-31 23:14:56 -07:00
Ilya Kreymer
ccff712fb6 Merge branch 'main' into recorder-work 2023-10-31 23:11:35 -07:00
Ilya Kreymer
2aeda56d40
improved text extraction: (addresses #403) (#404)
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-31 23:05:30 -07:00
Ilya Kreymer
064db52272 base image: bump brave to 1.59.120
version: bump to 0.12.0-beta.2
2023-10-26 19:48:49 -07:00
benoit74
bc730a0d37
Return User-Agent on all code path to set headers appropriately (#420)
Fixes #419
2023-10-25 12:32:10 -04:00
Ilya Kreymer
ffc1d3ffa4 quickfix: storage webhook, keep path and bytes! 2023-10-23 18:35:03 -07:00
Ilya Kreymer
8c92901889
load saved state fixes + redis tests (#415)
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes #411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
2023-10-23 09:36:10 -07:00
Ilya Kreymer
84f210e0b4 more type fixes 2023-10-22 20:41:57 -07:00
Ilya Kreymer
126f3faacf add worker.ts 2023-10-22 20:00:11 -07:00
Ilya Kreymer
d5341452d7 strict type fixes 2023-10-22 18:09:21 -07:00
Ilya Kreymer
555598e57d add argParser, some strict type checking fixes 2023-10-22 11:20:29 -07:00
Ilya Kreymer
728c8b423f more utils 2023-10-22 09:48:38 -07:00
Ilya Kreymer
52325d2159 convert a few more utils 2023-10-22 09:33:17 -07:00
Ilya Kreymer
dfb0ee6b32 more convos 2023-10-21 22:12:49 -07:00
Ilya Kreymer
e5fa61d4cf move to src/util 2023-10-21 21:39:38 -07:00
Ilya Kreymer
5e5b4de79b more ts work, compile all in src/ dir, add reqresp 2023-10-21 21:05:33 -07:00
Ilya Kreymer
e4baa0422d begin ts conversion! 2023-10-21 20:36:06 -07:00