Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2026-04-18 15:10:21 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	ff6a602dcd	disable profile cache, don't skip direct fetch if no frame	2026-02-02 12:44:48 -08:00
Ilya Kreymer	b7a7474243	Merge branch 'fix-ua' into priv-testing	2026-02-02 12:42:40 -08:00
Ilya Kreymer	a142e9f62e	only get default UA if adding suffix	2026-02-02 12:27:53 -08:00
Ilya Kreymer	afdc37412e	Merge branch 'fix-network-load-param' into priv-testing	2026-02-02 03:29:51 -08:00
Ilya Kreymer	021206950c	correctly skip network fetch if in browser context don't do direct fetch if no page	2026-02-02 03:28:38 -08:00
Ilya Kreymer	9ece619aa8	Merge branch 'fix-network-load-param' into priv-testing	2026-02-02 01:28:22 -08:00
Ilya Kreymer	1a18e7cb9c	Fix setting default user-agent - Only use major version from, set rest to 0.0.0 to match default behavior - Don't pass --user-agent if just using default for simplicity	2026-02-02 01:26:59 -08:00
Ilya Kreymer	9413143350	fix browser network loading: - add missing CDP param that resulted in browser network being skipped! - try browser network for direct fetch too, in case of blocking/different permissions - only default to node fetch when network loading failed Possible fix for #960, could be blocking non-browser fetch of srcset URLs	2026-02-02 01:22:34 -08:00
Ilya Kreymer	6f71e6e25c	don't skip direct fetch	2026-02-01 21:39:45 -08:00
Ilya Kreymer	3a43fe1277	fix UA computation, don't set UA if default	2026-02-01 14:22:22 -08:00
Ilya Kreymer	a2b14e833d	privacy enchancements testing: - disable direct fetch altogether - disable non-proxied WebRTC	2026-02-01 11:10:44 -08:00
Ilya Kreymer	14d866e17a	post-rebase fix	2026-01-30 13:33:15 -08:00
Ilya Kreymer	901ec94fa5	cleanup	2026-01-30 13:32:37 -08:00
Ilya Kreymer	00de3c13a6	compute total size of revisits per crawl: conservedSize += (origSize * num of revisits) - (sum of revisit sizes)	2026-01-30 13:32:37 -08:00
Ilya Kreymer	77f04b97c3	simplify revisit update logic, always incr when revisit encountered	2026-01-30 13:32:37 -08:00
Ilya Kreymer	801333bcfe	update dupe count when aggregating	2026-01-30 13:32:37 -08:00
Ilya Kreymer	a423be06dc	move commit key set to earlier	2026-01-30 13:32:37 -08:00
Ilya Kreymer	bcdd04f8d7	- track uncommitted crawl ids in separate key - cleanup uncommitted crawl id keys when crawl is canceled - simplify final exit checks, for operations on final crawler exit	2026-01-30 13:32:37 -08:00
Ilya Kreymer	4c6bce7db0	commit to merged index only on final exit	2026-01-30 13:32:36 -08:00
Ilya Kreymer	eff9b53930	cleanup, always incr dupeCount even if size is smaller	2026-01-30 13:32:36 -08:00
Ilya Kreymer	fd5308cf4a	fix stats computation by always using done / total stats from redis	2026-01-30 13:32:36 -08:00
Ilya Kreymer	f0db436284	Optimize Indexing + Progress Tracking (#950 ) Optimize import/purge indexing: - download remote cdx locally, use built-in gzip decompression and line reader for parsing - batch import operations (at 4k operations) to resolve promise - also pipeline redis operations where possible, grouping related functionality together, instead of having individual update methods - track progress via 'updateProgress' stat, update before every file is imported. - for purge, first 50% of progress is import, last 50% is the commit/merge of hashes into alldupes	2026-01-30 13:32:36 -08:00
Ilya Kreymer	1452e20e7a	use getFileOrUrlJson() helper for index fetching	2026-01-30 13:32:36 -08:00
Ilya Kreymer	fdd9958e12	- add 'estimatedRedundantSize' calculation for estimated wasted space - rename to 'conservedSize' calculation for estimated conserved space - add 'dupeUrls' to track duplicate URLs added on each crawl explicitly - incrStat and type checking for stat types - add removedCrawls and removedCrawlSize to track removed crawls - clean up stats, add incrStat() to better keep track of stats in one place	2026-01-30 13:32:36 -08:00
Ilya Kreymer	87b6485a1c	avoid queueing same URLs with queue set	2026-01-30 13:32:36 -08:00
Ilya Kreymer	fecde73081	explicit init to check if wacz is valid	2026-01-30 13:32:36 -08:00
Ilya Kreymer	a735011250	set totalSize in counts	2026-01-30 13:32:36 -08:00
Ilya Kreymer	e1760d3ec4	ensure src wacz list is updated on commit also store totalCrawls and removableCrawls	2026-01-30 13:32:36 -08:00
Ilya Kreymer	bebaf38e0b	error handling: - skip invalid wacz files provided for import - skip invalid multi-wacz json files provided for import - tests: add invalid multi-wacz file for testing	2026-01-30 13:32:36 -08:00
Ilya Kreymer	86ba514275	remove extra sleep	2026-01-30 13:32:36 -08:00
Ilya Kreymer	c62eb6a62a	always commit	2026-01-30 13:32:36 -08:00
Ilya Kreymer	ffb278a956	add logging	2026-01-30 13:32:36 -08:00
Ilya Kreymer	2c94eaa512	fix getHashDupe, use all key	2026-01-30 13:32:36 -08:00
Ilya Kreymer	44a247d43a	indexer: ensure indexer size is number	2026-01-30 13:32:36 -08:00
Ilya Kreymer	b96c409729	include size in hash key data add hash dupe when WARC record actually written store savedSize as diff between original and revisit WARC records indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original	2026-01-30 13:32:36 -08:00
Ilya Kreymer	0872279aa1	add urlNormalize to addHashDupe	2026-01-30 13:32:36 -08:00
Ilya Kreymer	27a19bb64f	fix size count typo, unique == not dupe!	2026-01-30 13:32:36 -08:00
Ilya Kreymer	ff622013de	don't commit to all if will be purged anyway	2026-01-30 13:32:36 -08:00
Ilya Kreymer	8618310a6c	update purging of crawls to readd/recommit from added crawls, instead of removing hashes from removed crawls, as hashes may be present in other crawls remove crawl-specific keys for removed crawls	2026-01-30 13:32:36 -08:00
Ilya Kreymer	8311b61fa1	uniq -> unique add 'removable' count for number of crawls that can be removed from the index	2026-01-30 13:32:36 -08:00
Ilya Kreymer	55fbe43b22	stats: - compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key - add stats on crawl commit, remove on crawl remove - tests: update tests to check stats	2026-01-30 13:32:36 -08:00
Ilya Kreymer	2e37fa3e54	don't include current crawl as self-reference dependency	2026-01-30 13:32:36 -08:00
Ilya Kreymer	d31530a753	cleanup pass: - support dedupe without requiring wacz, no crawl dependency tracking stored - add dedupe test w/o wacz - cleanup dedupe related naming	2026-01-30 13:32:36 -08:00
Ilya Kreymer	0f857c6572	generate wacz filename if deduping	2026-01-30 13:32:36 -08:00
Ilya Kreymer	6e259ea012	add removing option to also remove unused crawls if doing a full sync, disable by default	2026-01-30 13:32:36 -08:00
Ilya Kreymer	a919400a99	indexer optimize: commit only if added	2026-01-30 13:32:36 -08:00
Ilya Kreymer	4104ba8361	rename 'dedup' -> 'dedupe' for consistency	2026-01-30 13:32:36 -08:00
Ilya Kreymer	8b2ac7ae67	always return wacz, store wacz depends only for current wacz store crawlid depends for entire crawl	2026-01-30 13:32:36 -08:00
Ilya Kreymer	4c84cdf5d3	cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading	2026-01-30 13:32:36 -08:00
Ilya Kreymer	645098a142	update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz	2026-01-30 13:32:33 -08:00

1 2 3 4 5 ...

293 commits