Ilya Kreymer
ff6a602dcd
disable profile cache, don't skip direct fetch if no frame
2026-02-02 12:44:48 -08:00
Ilya Kreymer
b7a7474243
Merge branch 'fix-ua' into priv-testing
2026-02-02 12:42:40 -08:00
Ilya Kreymer
a142e9f62e
only get default UA if adding suffix
2026-02-02 12:27:53 -08:00
Ilya Kreymer
afdc37412e
Merge branch 'fix-network-load-param' into priv-testing
2026-02-02 03:29:51 -08:00
Ilya Kreymer
021206950c
correctly skip network fetch if in browser context
...
don't do direct fetch if no page
2026-02-02 03:28:38 -08:00
Ilya Kreymer
9ece619aa8
Merge branch 'fix-network-load-param' into priv-testing
2026-02-02 01:28:22 -08:00
Ilya Kreymer
1a18e7cb9c
Fix setting default user-agent
...
- Only use major version from, set rest to 0.0.0 to match default behavior
- Don't pass --user-agent if just using default for simplicity
2026-02-02 01:26:59 -08:00
Ilya Kreymer
9413143350
fix browser network loading:
...
- add missing CDP param that resulted in browser network being skipped!
- try browser network for direct fetch too, in case of blocking/different permissions
- only default to node fetch when network loading failed
Possible fix for #960 , could be blocking non-browser fetch of srcset URLs
2026-02-02 01:22:34 -08:00
Ilya Kreymer
6f71e6e25c
don't skip direct fetch
2026-02-01 21:39:45 -08:00
Ilya Kreymer
3a43fe1277
fix UA computation, don't set UA if default
2026-02-01 14:22:22 -08:00
Ilya Kreymer
a2b14e833d
privacy enchancements testing:
...
- disable direct fetch altogether
- disable non-proxied WebRTC
2026-02-01 11:10:44 -08:00
Ilya Kreymer
14d866e17a
post-rebase fix
2026-01-30 13:33:15 -08:00
Ilya Kreymer
901ec94fa5
cleanup
2026-01-30 13:32:37 -08:00
Ilya Kreymer
00de3c13a6
compute total size of revisits per crawl:
...
conservedSize += (origSize * num of revisits) - (sum of revisit sizes)
2026-01-30 13:32:37 -08:00
Ilya Kreymer
77f04b97c3
simplify revisit update logic, always incr when revisit encountered
2026-01-30 13:32:37 -08:00
Ilya Kreymer
801333bcfe
update dupe count when aggregating
2026-01-30 13:32:37 -08:00
Ilya Kreymer
a423be06dc
move commit key set to earlier
2026-01-30 13:32:37 -08:00
Ilya Kreymer
bcdd04f8d7
- track uncommitted crawl ids in separate key
...
- cleanup uncommitted crawl id keys when crawl is canceled
- simplify final exit checks, for operations on final crawler exit
2026-01-30 13:32:37 -08:00
Ilya Kreymer
4c6bce7db0
commit to merged index only on final exit
2026-01-30 13:32:36 -08:00
Ilya Kreymer
eff9b53930
cleanup, always incr dupeCount even if size is smaller
2026-01-30 13:32:36 -08:00
Ilya Kreymer
fd5308cf4a
fix stats computation by always using done / total stats from redis
2026-01-30 13:32:36 -08:00
Ilya Kreymer
f0db436284
Optimize Indexing + Progress Tracking ( #950 )
...
Optimize import/purge indexing:
- download remote cdx locally, use built-in gzip decompression and line
reader for parsing
- batch import operations (at 4k operations) to resolve promise
- also pipeline redis operations where possible, grouping related
functionality together, instead of having individual update methods
- track progress via 'updateProgress' stat, update before every file is
imported.
- for purge, first 50% of progress is import, last 50% is the
commit/merge of hashes into alldupes
2026-01-30 13:32:36 -08:00
Ilya Kreymer
1452e20e7a
use getFileOrUrlJson() helper for index fetching
2026-01-30 13:32:36 -08:00
Ilya Kreymer
fdd9958e12
- add 'estimatedRedundantSize' calculation for estimated wasted space
...
- rename to 'conservedSize' calculation for estimated conserved space
- add 'dupeUrls' to track duplicate URLs added on each crawl explicitly
- incrStat and type checking for stat types
- add removedCrawls and removedCrawlSize to track removed crawls
- clean up stats, add incrStat() to better keep track of stats in one place
2026-01-30 13:32:36 -08:00
Ilya Kreymer
87b6485a1c
avoid queueing same URLs with queue set
2026-01-30 13:32:36 -08:00
Ilya Kreymer
fecde73081
explicit init to check if wacz is valid
2026-01-30 13:32:36 -08:00
Ilya Kreymer
a735011250
set totalSize in counts
2026-01-30 13:32:36 -08:00
Ilya Kreymer
e1760d3ec4
ensure src wacz list is updated on commit
...
also store totalCrawls and removableCrawls
2026-01-30 13:32:36 -08:00
Ilya Kreymer
bebaf38e0b
error handling:
...
- skip invalid wacz files provided for import
- skip invalid multi-wacz json files provided for import
- tests: add invalid multi-wacz file for testing
2026-01-30 13:32:36 -08:00
Ilya Kreymer
86ba514275
remove extra sleep
2026-01-30 13:32:36 -08:00
Ilya Kreymer
c62eb6a62a
always commit
2026-01-30 13:32:36 -08:00
Ilya Kreymer
ffb278a956
add logging
2026-01-30 13:32:36 -08:00
Ilya Kreymer
2c94eaa512
fix getHashDupe, use all key
2026-01-30 13:32:36 -08:00
Ilya Kreymer
44a247d43a
indexer: ensure indexer size is number
2026-01-30 13:32:36 -08:00
Ilya Kreymer
b96c409729
include size in hash key data
...
add hash dupe when WARC record actually written
store savedSize as diff between original and revisit WARC records
indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original
2026-01-30 13:32:36 -08:00
Ilya Kreymer
0872279aa1
add urlNormalize to addHashDupe
2026-01-30 13:32:36 -08:00
Ilya Kreymer
27a19bb64f
fix size count typo, unique == not dupe!
2026-01-30 13:32:36 -08:00
Ilya Kreymer
ff622013de
don't commit to all if will be purged anyway
2026-01-30 13:32:36 -08:00
Ilya Kreymer
8618310a6c
update purging of crawls to readd/recommit from added crawls,
...
instead of removing hashes from removed crawls, as hashes may be present in other crawls
remove crawl-specific keys for removed crawls
2026-01-30 13:32:36 -08:00
Ilya Kreymer
8311b61fa1
uniq -> unique
...
add 'removable' count for number of crawls that can be removed from the index
2026-01-30 13:32:36 -08:00
Ilya Kreymer
55fbe43b22
stats:
...
- compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key
- add stats on crawl commit, remove on crawl remove
- tests: update tests to check stats
2026-01-30 13:32:36 -08:00
Ilya Kreymer
2e37fa3e54
don't include current crawl as self-reference dependency
2026-01-30 13:32:36 -08:00
Ilya Kreymer
d31530a753
cleanup pass:
...
- support dedupe without requiring wacz, no crawl dependency tracking stored
- add dedupe test w/o wacz
- cleanup dedupe related naming
2026-01-30 13:32:36 -08:00
Ilya Kreymer
0f857c6572
generate wacz filename if deduping
2026-01-30 13:32:36 -08:00
Ilya Kreymer
6e259ea012
add removing option to also remove unused crawls if doing a full sync, disable by default
2026-01-30 13:32:36 -08:00
Ilya Kreymer
a919400a99
indexer optimize: commit only if added
2026-01-30 13:32:36 -08:00
Ilya Kreymer
4104ba8361
rename 'dedup' -> 'dedupe' for consistency
2026-01-30 13:32:36 -08:00
Ilya Kreymer
8b2ac7ae67
always return wacz, store wacz depends only for current wacz
...
store crawlid depends for entire crawl
2026-01-30 13:32:36 -08:00
Ilya Kreymer
4c84cdf5d3
cleanup, keep compatibility with redis 6 still
...
set to 'post-crawl' state after uploading
2026-01-30 13:32:36 -08:00
Ilya Kreymer
645098a142
update to new data model:
...
- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
2026-01-30 13:32:33 -08:00