Commit graph

293 commits

Author SHA1 Message Date
Ilya Kreymer
ff6a602dcd disable profile cache, don't skip direct fetch if no frame 2026-02-02 12:44:48 -08:00
Ilya Kreymer
b7a7474243 Merge branch 'fix-ua' into priv-testing 2026-02-02 12:42:40 -08:00
Ilya Kreymer
a142e9f62e only get default UA if adding suffix 2026-02-02 12:27:53 -08:00
Ilya Kreymer
afdc37412e Merge branch 'fix-network-load-param' into priv-testing 2026-02-02 03:29:51 -08:00
Ilya Kreymer
021206950c correctly skip network fetch if in browser context
don't do direct fetch if no page
2026-02-02 03:28:38 -08:00
Ilya Kreymer
9ece619aa8 Merge branch 'fix-network-load-param' into priv-testing 2026-02-02 01:28:22 -08:00
Ilya Kreymer
1a18e7cb9c Fix setting default user-agent
- Only use major version from, set rest to 0.0.0 to match default behavior
- Don't pass --user-agent if just using default for simplicity
2026-02-02 01:26:59 -08:00
Ilya Kreymer
9413143350 fix browser network loading:
- add missing CDP param that resulted in browser network being skipped!
- try browser network for direct fetch too, in case of blocking/different permissions
- only default to node fetch when network loading failed

Possible fix for #960, could be blocking non-browser fetch of srcset URLs
2026-02-02 01:22:34 -08:00
Ilya Kreymer
6f71e6e25c don't skip direct fetch 2026-02-01 21:39:45 -08:00
Ilya Kreymer
3a43fe1277 fix UA computation, don't set UA if default 2026-02-01 14:22:22 -08:00
Ilya Kreymer
a2b14e833d privacy enchancements testing:
- disable direct fetch altogether
- disable non-proxied WebRTC
2026-02-01 11:10:44 -08:00
Ilya Kreymer
14d866e17a post-rebase fix 2026-01-30 13:33:15 -08:00
Ilya Kreymer
901ec94fa5 cleanup 2026-01-30 13:32:37 -08:00
Ilya Kreymer
00de3c13a6 compute total size of revisits per crawl:
conservedSize += (origSize * num of revisits) - (sum of revisit sizes)
2026-01-30 13:32:37 -08:00
Ilya Kreymer
77f04b97c3 simplify revisit update logic, always incr when revisit encountered 2026-01-30 13:32:37 -08:00
Ilya Kreymer
801333bcfe update dupe count when aggregating 2026-01-30 13:32:37 -08:00
Ilya Kreymer
a423be06dc move commit key set to earlier 2026-01-30 13:32:37 -08:00
Ilya Kreymer
bcdd04f8d7 - track uncommitted crawl ids in separate key
- cleanup uncommitted crawl id keys when crawl is canceled
- simplify final exit checks, for operations on final crawler exit
2026-01-30 13:32:37 -08:00
Ilya Kreymer
4c6bce7db0 commit to merged index only on final exit 2026-01-30 13:32:36 -08:00
Ilya Kreymer
eff9b53930 cleanup, always incr dupeCount even if size is smaller 2026-01-30 13:32:36 -08:00
Ilya Kreymer
fd5308cf4a fix stats computation by always using done / total stats from redis 2026-01-30 13:32:36 -08:00
Ilya Kreymer
f0db436284 Optimize Indexing + Progress Tracking (#950)
Optimize import/purge indexing:
- download remote cdx locally, use built-in gzip decompression and line
reader for parsing
- batch import operations (at 4k operations) to resolve promise
- also pipeline redis operations where possible, grouping related
functionality together, instead of having individual update methods
- track progress via 'updateProgress' stat, update before every file is
imported.
- for purge, first 50% of progress is import, last 50% is the
commit/merge of hashes into alldupes
2026-01-30 13:32:36 -08:00
Ilya Kreymer
1452e20e7a use getFileOrUrlJson() helper for index fetching 2026-01-30 13:32:36 -08:00
Ilya Kreymer
fdd9958e12 - add 'estimatedRedundantSize' calculation for estimated wasted space
- rename to 'conservedSize' calculation for estimated conserved space
- add 'dupeUrls' to track duplicate URLs added on each crawl explicitly
- incrStat and type checking for stat types
- add removedCrawls and removedCrawlSize to track removed crawls
- clean up stats, add incrStat() to better keep track of stats in one place
2026-01-30 13:32:36 -08:00
Ilya Kreymer
87b6485a1c avoid queueing same URLs with queue set 2026-01-30 13:32:36 -08:00
Ilya Kreymer
fecde73081 explicit init to check if wacz is valid 2026-01-30 13:32:36 -08:00
Ilya Kreymer
a735011250 set totalSize in counts 2026-01-30 13:32:36 -08:00
Ilya Kreymer
e1760d3ec4 ensure src wacz list is updated on commit
also store totalCrawls and removableCrawls
2026-01-30 13:32:36 -08:00
Ilya Kreymer
bebaf38e0b error handling:
- skip invalid wacz files provided for import
- skip invalid multi-wacz json files provided for import
- tests: add invalid multi-wacz file for testing
2026-01-30 13:32:36 -08:00
Ilya Kreymer
86ba514275 remove extra sleep 2026-01-30 13:32:36 -08:00
Ilya Kreymer
c62eb6a62a always commit 2026-01-30 13:32:36 -08:00
Ilya Kreymer
ffb278a956 add logging 2026-01-30 13:32:36 -08:00
Ilya Kreymer
2c94eaa512 fix getHashDupe, use all key 2026-01-30 13:32:36 -08:00
Ilya Kreymer
44a247d43a indexer: ensure indexer size is number 2026-01-30 13:32:36 -08:00
Ilya Kreymer
b96c409729 include size in hash key data
add hash dupe when WARC record actually written
store savedSize as diff between original and revisit WARC records
indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original
2026-01-30 13:32:36 -08:00
Ilya Kreymer
0872279aa1 add urlNormalize to addHashDupe 2026-01-30 13:32:36 -08:00
Ilya Kreymer
27a19bb64f fix size count typo, unique == not dupe! 2026-01-30 13:32:36 -08:00
Ilya Kreymer
ff622013de don't commit to all if will be purged anyway 2026-01-30 13:32:36 -08:00
Ilya Kreymer
8618310a6c update purging of crawls to readd/recommit from added crawls,
instead of removing hashes from removed crawls, as hashes may be present in other crawls
remove crawl-specific keys for removed crawls
2026-01-30 13:32:36 -08:00
Ilya Kreymer
8311b61fa1 uniq -> unique
add 'removable' count for number of crawls that can be removed from the index
2026-01-30 13:32:36 -08:00
Ilya Kreymer
55fbe43b22 stats:
- compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key
- add stats on crawl commit, remove on crawl remove
- tests: update tests to check stats
2026-01-30 13:32:36 -08:00
Ilya Kreymer
2e37fa3e54 don't include current crawl as self-reference dependency 2026-01-30 13:32:36 -08:00
Ilya Kreymer
d31530a753 cleanup pass:
- support dedupe without requiring wacz, no crawl dependency tracking stored
- add dedupe test w/o wacz
- cleanup dedupe related naming
2026-01-30 13:32:36 -08:00
Ilya Kreymer
0f857c6572 generate wacz filename if deduping 2026-01-30 13:32:36 -08:00
Ilya Kreymer
6e259ea012 add removing option to also remove unused crawls if doing a full sync, disable by default 2026-01-30 13:32:36 -08:00
Ilya Kreymer
a919400a99 indexer optimize: commit only if added 2026-01-30 13:32:36 -08:00
Ilya Kreymer
4104ba8361 rename 'dedup' -> 'dedupe' for consistency 2026-01-30 13:32:36 -08:00
Ilya Kreymer
8b2ac7ae67 always return wacz, store wacz depends only for current wacz
store crawlid depends for entire crawl
2026-01-30 13:32:36 -08:00
Ilya Kreymer
4c84cdf5d3 cleanup, keep compatibility with redis 6 still
set to 'post-crawl' state after uploading
2026-01-30 13:32:36 -08:00
Ilya Kreymer
645098a142 update to new data model:
- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
2026-01-30 13:32:33 -08:00