Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	e534f49e5e	recorder: don't do streaming fetch for unknown or large responses if content-type is text/html, always need to load in browser to continue type fixes: remove a few 'any' types in reqresp	2023-11-09 08:57:13 -08:00
emma	5a6bef890f	revert adding prettier	2023-11-08 16:40:49 -05:00
Emma Segal-Grossman	24cad2a33e	Merge pull request #427 from webrecorder/recorder-work-ts--typescript-eslint Add typescript-eslint and prettier to repo	2023-11-08 14:52:22 -05:00
emma	e5b0813d89	add formatting step to precommit hook	2023-11-08 14:41:15 -05:00
emma	87a6002473	add prettier and format everything	2023-11-08 14:37:57 -05:00
emma	7330dd382b	fix eslint-reported issues	2023-11-08 14:09:15 -05:00
emma	b62aa5e4b6	add typescript-eslint	2023-11-08 13:45:10 -05:00
Ilya Kreymer	6df1ecfdd4	fix merge	2023-11-07 22:16:16 -08:00
Ilya Kreymer	2538bcc283	Merge branch 'dev-1.0.0' into recorder-work-ts	2023-11-07 22:13:20 -08:00
Ilya Kreymer	df0fe887ce	Merge branch 'recorder-work' into recorder-work-ts	2023-11-07 22:01:37 -08:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	868cd7ab48	remove pywb dependency - only keep py-wacz - use cdxj-indexer for --generateCDX	2023-11-07 20:01:42 -08:00
Ilya Kreymer	468a00939d	logging: reenable logging for timed out pending requests for now	2023-11-07 18:24:13 -08:00
Ilya Kreymer	034de9a78d	fix warcinfo test after version update	2023-11-07 17:38:15 -08:00
Ilya Kreymer	988bf7a08a	remove unused code, remove references to pywb	2023-11-07 17:24:04 -08:00
Ilya Kreymer	e7a850c380	Apply suggestions from code review, remove commented out code Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-07 17:20:08 -08:00
Ilya Kreymer	c43e314d07	merge fixes	2023-11-03 18:37:44 -07:00
Ilya Kreymer	e2c95d6c1d	Merge branch 'recorder-work' (0.12.1) into recorder-work-ts	2023-11-03 18:35:46 -07:00
Ilya Kreymer	53cfd39416	Merge branch 'main' (0.12.1 release) into recorder-work	2023-11-03 18:31:18 -07:00
Ilya Kreymer	dd7b926d87	Exclusion Optimizations: follow-up to (#423 ) Follow-up to #408 - optimized exclusion filtering: - use zscan with default count instead of ordered scan to remvoe - use glob match when possible (non-regex as determined by string check) - move isInScope() check to worker to avoid creating a page and then closing for every excluded URL - tests: update saved-state test to be more resilient to delays args: also support '--text false' for backwards compatibility, fixes webrecorder/browsertrix-cloud#1334 bump to 0.12.1	2023-11-03 15:15:09 -07:00
Ilya Kreymer	15661eb9c8	More flexible multi value arg parsing + README update for 0.12.0 (#422 ) Updated arg parsing thanks to example in https://github.com/yargs/yargs/issues/846#issuecomment-517264899 to support multiple value arguments specified as either one string or multiple string using array type + coerce function. This allows for `choice` option to also be used to validate the options, when needed. With this setup, `--text to-pages,to-warc,final-to-warc`, `--text to-pages,to-warc --text final-to-warc` and `--text to-pages --text to-warc --text final-to-warc` all result in the same configuration! Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system. Also updated README with new text extraction options and bumped version to 0.12.0 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 11:47:37 -07:00
Ilya Kreymer	bc938f0137	remove ts ignores, update minio to enable typing	2023-11-01 18:58:50 -07:00
Ilya Kreymer	de8f2b8c89	workerid switch back to number, use custom WorkerId type	2023-11-01 18:40:10 -07:00
Ilya Kreymer	9a9267baeb	ci: build js	2023-11-01 18:04:11 -07:00
Ilya Kreymer	557a50d9ba	tests: fix test imports	2023-11-01 17:54:51 -07:00
Ilya Kreymer	f748c0e211	fix Sitemapper, support custom sitemap paths also	2023-11-01 17:35:40 -07:00
Ilya Kreymer	a317425842	more fixes, initial working version!	2023-11-01 17:24:35 -07:00
Ilya Kreymer	492ff81d67	migrate create-login-profile, fix build, update to nodenext build mode	2023-11-01 17:10:31 -07:00
Ilya Kreymer	a8ca0c683c	convert main.ts and crawler.ts!	2023-11-01 12:59:31 -07:00
Ilya Kreymer	036af5f9fb	more fixes	2023-11-01 10:22:32 -07:00
Ilya Kreymer	064fc4f472	add new files converted to ts!	2023-11-01 09:59:32 -07:00
Ilya Kreymer	93a6bbbdc2	additional fixes	2023-10-31 23:54:24 -07:00
Ilya Kreymer	26d57a3134	Merge branch 'recorder-work' into recorder-work-ts	2023-10-31 23:53:57 -07:00
Ilya Kreymer	8cc1a8c015	remove util	2023-10-31 23:14:56 -07:00
Ilya Kreymer	ccff712fb6	Merge branch 'main' into recorder-work	2023-10-31 23:11:35 -07:00
Ilya Kreymer	2aeda56d40	improved text extraction: (addresses #403 ) (#404 ) - use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to get the snapshot (consistent with ArchiveWeb.page) - should be slightly more performant - keep option to use DOM.getDocument - refactor warc resource writing to separate class, used by text extraction and screenshots - write extracted text to WARC files as 'urn:text:<url>' after page loads, similar to screenshots - also store final text to WARC as 'urn:textFinal:<url>' if it is different - cli options: update `--text` to take one more more comma-separated string options `--text to-warc,to-pages,final-to-warc`. For backwards compatibility, support `--text` and `--text true` to be equivalent to `--text to-pages`. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-31 23:05:30 -07:00
Ilya Kreymer	064db52272	base image: bump brave to 1.59.120 version: bump to 0.12.0-beta.2	2023-10-26 19:48:49 -07:00
benoit74	bc730a0d37	Return User-Agent on all code path to set headers appropriately (#420 ) Fixes #419	2023-10-25 12:32:10 -04:00
Ilya Kreymer	ffc1d3ffa4	quickfix: storage webhook, keep path and bytes!	2023-10-23 18:35:03 -07:00
Ilya Kreymer	8c92901889	load saved state fixes + redis tests (#415 ) - set done key correctly, just an int now - also check if array for old-style save states (for backwards compatibility) - fixes #411 - tests: includes tests using redis: tests save state + dynamically adding exclusions (follow up to #408) - adds `--debugAccessRedis` flag to allow accessing local redis outside container	2023-10-23 09:36:10 -07:00
Ilya Kreymer	84f210e0b4	more type fixes	2023-10-22 20:41:57 -07:00
Ilya Kreymer	126f3faacf	add worker.ts	2023-10-22 20:00:11 -07:00
Ilya Kreymer	d5341452d7	strict type fixes	2023-10-22 18:09:21 -07:00
Ilya Kreymer	555598e57d	add argParser, some strict type checking fixes	2023-10-22 11:20:29 -07:00
Ilya Kreymer	728c8b423f	more utils	2023-10-22 09:48:38 -07:00
Ilya Kreymer	52325d2159	convert a few more utils	2023-10-22 09:33:17 -07:00
Ilya Kreymer	dfb0ee6b32	more convos	2023-10-21 22:12:49 -07:00
Ilya Kreymer	e5fa61d4cf	move to src/util	2023-10-21 21:39:38 -07:00
Ilya Kreymer	5e5b4de79b	more ts work, compile all in src/ dir, add reqresp	2023-10-21 21:05:33 -07:00
Ilya Kreymer	e4baa0422d	begin ts conversion!	2023-10-21 20:36:06 -07:00

1 2 3 4 5 ...

364 commits