browsertrix-crawler

Stowage/browsertrix-crawler

Fork 0

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2026-02-06 09:50:15 +00:00

Commit graph

e559c312e0

Merge 6d0ec995d2 into c57481f9e1 Tessa Walsh 2026-02-05 15:45:31 -05:00
6d0ec995d2 Rename issue-965-urls-not-queued-list Tessa Walsh 2026-02-05 15:45:15 -05:00
41ece00e51 Update cli-options.md Tessa Walsh 2026-02-05 15:14:49 -05:00
d97c591de9 Add tests Tessa Walsh 2026-02-05 15:07:08 -05:00
c0f87e0cdf

Merge 14d866e17a into c57481f9e1 Ilya Kreymer 2026-02-05 14:42:45 -05:00
afc3b78625 Only write pagesNotQueued.jsonl if option passed Tessa Walsh 2026-02-05 13:06:02 -05:00
bae536dd5c Write pages file with unqueued urls Tessa Walsh 2026-02-05 12:46:02 -05:00
2531dc183c Add method to write pagesNotQueued.jsonl Tessa Walsh 2026-02-05 11:59:25 -05:00
b14c684fe4 Log when page URL not queued bc of limit hit Tessa Walsh 2026-02-05 11:20:35 -05:00
e8f7866156 rate limit by text match test rate-limit-work Ilya Kreymer 2026-02-04 18:02:57 -08:00
0e5002164a misc fixes: - net idle wait post page only default to 10 latest-testing Ilya Kreymer 2026-02-04 17:38:28 -08:00
33c53c2854 Merge branch 'rate-limit-work' into latest-testing Ilya Kreymer 2026-02-04 17:36:20 -08:00
8bc0e3ddbb use redis to store rate limit to ensure expiry, remove unused Ilya Kreymer 2026-02-04 17:25:41 -08:00
5b5d50b6d2 rate limit detect and restart Ilya Kreymer 2026-02-03 10:24:55 -08:00
d812f1c2dc rate limit work: - detect 403, 429, 503 as possible rate limit, attempt to restart and not record Ilya Kreymer 2026-02-02 23:02:35 -08:00
54e97a1bef Merge branch 'policy-tweaks' into latest-testing Ilya Kreymer 2026-02-04 16:39:31 -08:00
c5e3639e4d Merge branch 'fix-iframe-load' into latest-testing Ilya Kreymer 2026-02-04 16:39:17 -08:00
556f396afd Merge branch 'fix-network-load-param' into latest-testing Ilya Kreymer 2026-02-04 16:38:40 -08:00
21069e07f4 Merge branch 'main' into latest-testing Ilya Kreymer 2026-02-04 16:38:09 -08:00
7e38f78806

Merge a0fcf9d6ad into c57481f9e1 Ilya Kreymer 2026-02-05 00:31:50 +00:00
a0fcf9d6ad frame behaviors: use frame.evaluate() instead of custom evaluteWithCLI() custom function not always working, only needed to inject getEventListeners() which is used minimally (only in autoscroll to potentially skip scrolling) fix-iframe-load Ilya Kreymer 2026-02-04 16:27:11 -08:00
5d385fd617

Merge 9ab13e3b04 into c57481f9e1 Ilya Kreymer 2026-02-04 19:22:11 -05:00
c57481f9e1

Fix default user-agent to not include minor version + set sec-ua-ch-* headers (#962) main Ilya Kreymer 2026-02-04 16:06:28 -08:00
18acaeff87 lint fix Ilya Kreymer 2026-02-04 16:03:45 -08:00
b1c3d4d966

Update src/create-login-profile.ts Ilya Kreymer 2026-02-04 11:36:47 -08:00
74a303ae9b

Update src/util/browser.ts Ilya Kreymer 2026-02-04 11:35:20 -08:00
beec19ead6 use emulation.setUserAgentOverride Ilya Kreymer 2026-02-02 19:53:34 -08:00
21751127e5 don't disable cache in profiles to match crawler headers better Ilya Kreymer 2026-02-02 16:43:50 -08:00
8fe9378dae addl policy tweaks Ilya Kreymer 2026-02-02 16:40:50 -08:00
9ab13e3b04 default to direct fetch still if no page available fix-network-load-param Ilya Kreymer 2026-02-02 16:27:26 -08:00
e0afc70131 store browser major version, also override sec-ua-* headers don't use default header as it adds Headless in headless mode Ilya Kreymer 2026-02-02 16:23:14 -08:00
c7a518ea42

Update src/crawler.ts Ilya Kreymer 2026-02-02 13:20:06 -08:00
ff6a602dcd disable profile cache, don't skip direct fetch if no frame priv-testing Ilya Kreymer 2026-02-02 12:44:48 -08:00
b7a7474243 Merge branch 'fix-ua' into priv-testing Ilya Kreymer 2026-02-02 12:42:40 -08:00
a142e9f62e only get default UA if adding suffix Ilya Kreymer 2026-02-02 12:27:53 -08:00
afdc37412e Merge branch 'fix-network-load-param' into priv-testing Ilya Kreymer 2026-02-02 03:29:51 -08:00
021206950c correctly skip network fetch if in browser context don't do direct fetch if no page Ilya Kreymer 2026-02-02 03:28:38 -08:00
9ece619aa8 Merge branch 'fix-network-load-param' into priv-testing Ilya Kreymer 2026-02-02 01:28:22 -08:00
1a18e7cb9c Fix setting default user-agent - Only use major version from, set rest to 0.0.0 to match default behavior - Don't pass --user-agent if just using default for simplicity Ilya Kreymer 2026-02-01 14:22:22 -08:00
9413143350 fix browser network loading: - add missing CDP param that resulted in browser network being skipped! - try browser network for direct fetch too, in case of blocking/different permissions - only default to node fetch when network loading failed Ilya Kreymer 2026-02-02 01:17:48 -08:00
6f71e6e25c don't skip direct fetch Ilya Kreymer 2026-02-01 21:39:45 -08:00
3a43fe1277 fix UA computation, don't set UA if default Ilya Kreymer 2026-02-01 14:22:22 -08:00
12caf9890b try rebrowser patches again Ilya Kreymer 2026-02-01 13:15:09 -08:00
a2b14e833d privacy enchancements testing: - disable direct fetch altogether - disable non-proxied WebRTC Ilya Kreymer 2026-02-01 11:10:44 -08:00
f4e20100c4 bump version to 1.12.0-beta.0 for beta release v1.12.0-beta.0 dedupe-beta-release Ilya Kreymer 2026-01-30 16:34:55 -08:00
14d866e17a post-rebase fix hash-based-dedup Ilya Kreymer 2026-01-30 13:33:15 -08:00
901ec94fa5 cleanup Ilya Kreymer 2026-01-23 12:58:05 -08:00
00de3c13a6 compute total size of revisits per crawl: conservedSize += (origSize * num of revisits) - (sum of revisit sizes) Ilya Kreymer 2026-01-23 03:38:37 -08:00
77f04b97c3 simplify revisit update logic, always incr when revisit encountered Ilya Kreymer 2026-01-23 02:54:37 -08:00
801333bcfe update dupe count when aggregating Ilya Kreymer 2026-01-23 01:26:27 -08:00
a423be06dc move commit key set to earlier Ilya Kreymer 2026-01-22 22:06:25 -08:00
bcdd04f8d7 - track uncommitted crawl ids in separate key - cleanup uncommitted crawl id keys when crawl is canceled - simplify final exit checks, for operations on final crawler exit Ilya Kreymer 2026-01-22 17:47:06 -08:00
4c6bce7db0 commit to merged index only on final exit Ilya Kreymer 2026-01-22 17:07:11 -08:00
eff9b53930 cleanup, always incr dupeCount even if size is smaller Ilya Kreymer 2026-01-21 23:06:53 -08:00
fd5308cf4a fix stats computation by always using done / total stats from redis Ilya Kreymer 2026-01-21 21:11:34 -08:00
f0db436284 Optimize Indexing + Progress Tracking (#950) Ilya Kreymer 2026-01-13 13:23:56 -08:00
1452e20e7a use getFileOrUrlJson() helper for index fetching Ilya Kreymer 2026-01-07 21:49:01 -08:00
fdd9958e12 - add 'estimatedRedundantSize' calculation for estimated wasted space - rename to 'conservedSize' calculation for estimated conserved space - add 'dupeUrls' to track duplicate URLs added on each crawl explicitly - incrStat and type checking for stat types - add removedCrawls and removedCrawlSize to track removed crawls - clean up stats, add incrStat() to better keep track of stats in one place Ilya Kreymer 2026-01-07 10:06:08 -08:00
87b6485a1c avoid queueing same URLs with queue set Ilya Kreymer 2026-01-05 18:40:45 -08:00
fecde73081 explicit init to check if wacz is valid Ilya Kreymer 2026-01-04 16:35:36 -08:00
a735011250 set totalSize in counts Ilya Kreymer 2026-01-04 16:34:51 -08:00
e1760d3ec4 ensure src wacz list is updated on commit also store totalCrawls and removableCrawls Ilya Kreymer 2026-01-03 00:25:00 -08:00
bebaf38e0b error handling: - skip invalid wacz files provided for import - skip invalid multi-wacz json files provided for import - tests: add invalid multi-wacz file for testing Ilya Kreymer 2025-12-20 12:14:33 -08:00
09388ff9dc tests: add test for import from json Ilya Kreymer 2025-12-20 10:04:43 -08:00
86ba514275 remove extra sleep Ilya Kreymer 2025-12-19 22:18:01 -08:00
c62eb6a62a always commit Ilya Kreymer 2025-12-19 22:01:25 -08:00
ffb278a956 add logging Ilya Kreymer 2025-12-19 21:40:15 -08:00
2c94eaa512 fix getHashDupe, use all key Ilya Kreymer 2025-12-19 21:28:52 -08:00
44a247d43a indexer: ensure indexer size is number Ilya Kreymer 2025-12-19 21:13:26 -08:00
b96c409729 include size in hash key data add hash dupe when WARC record actually written store savedSize as diff between original and revisit WARC records indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original Ilya Kreymer 2025-12-19 16:12:30 -08:00
0872279aa1 add urlNormalize to addHashDupe Ilya Kreymer 2025-12-11 10:46:23 -08:00
27a19bb64f fix size count typo, unique == not dupe! Ilya Kreymer 2025-12-11 10:37:53 -08:00
ff622013de don't commit to all if will be purged anyway Ilya Kreymer 2025-12-10 23:50:56 -08:00
8618310a6c update purging of crawls to readd/recommit from added crawls, instead of removing hashes from removed crawls, as hashes may be present in other crawls remove crawl-specific keys for removed crawls Ilya Kreymer 2025-12-10 19:01:37 -08:00
8311b61fa1 uniq -> unique add 'removable' count for number of crawls that can be removed from the index Ilya Kreymer 2025-12-10 15:18:59 -08:00
55fbe43b22 stats: - compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key - add stats on crawl commit, remove on crawl remove - tests: update tests to check stats Ilya Kreymer 2025-12-10 12:40:44 -08:00
2e37fa3e54 don't include current crawl as self-reference dependency Ilya Kreymer 2025-12-09 16:20:19 -08:00
d31530a753 cleanup pass: - support dedupe without requiring wacz, no crawl dependency tracking stored - add dedupe test w/o wacz - cleanup dedupe related naming Ilya Kreymer 2025-11-28 01:16:58 -08:00
0f857c6572 generate wacz filename if deduping Ilya Kreymer 2025-11-27 23:40:02 -08:00
6e259ea012 add removing option to also remove unused crawls if doing a full sync, disable by default Ilya Kreymer 2025-10-25 15:41:31 -07:00
a919400a99 indexer optimize: commit only if added Ilya Kreymer 2025-10-25 13:17:01 -07:00
4104ba8361 rename 'dedup' -> 'dedupe' for consistency Ilya Kreymer 2025-10-25 09:33:37 -07:00
8b2ac7ae67 always return wacz, store wacz depends only for current wacz store crawlid depends for entire crawl Ilya Kreymer 2025-10-24 15:01:00 -07:00
4c84cdf5d3 cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading Ilya Kreymer 2025-10-24 13:24:53 -07:00
645098a142 update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz Ilya Kreymer 2025-10-24 10:38:36 -07:00
90df3a173a - track source index for each hash, so entry becomes '<source index> <date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires Ilya Kreymer 2025-10-17 18:08:38 -07:00
0e1c95c5c9 dedup post requests and non-404s as well! update timestamp after import Ilya Kreymer 2025-09-25 10:40:57 -07:00
8c94062963 use dedup redis for queue up wacz files that need to be updated use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries Ilya Kreymer 2025-09-22 22:30:08 -07:00
912ab1e842 dedup indexing: strip hash prefix from digest, as cdx does not have it tests: add index import + dedup crawl to ensure digests match fully Ilya Kreymer 2025-09-22 17:46:19 -07:00
9b03eb8a5d deps update Ilya Kreymer 2025-09-19 20:54:52 -07:00
fde6145bda tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records Ilya Kreymer 2025-09-18 13:10:53 -07:00
a7b801f5fb update to latest warcio (2.4.7) to fix issus when returning payload only size Ilya Kreymer 2025-09-18 02:04:28 -07:00
24ef0fab1e rename --dedupStoreUrl -> redisDedupUrl bump version to 1.9.0 fix typo Ilya Kreymer 2025-09-17 23:36:25 -07:00
72384f7918 warc writing: - update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response Ilya Kreymer 2025-09-17 20:48:32 -07:00
047e24249e keep skipping dupe URLs as before Ilya Kreymer 2025-09-17 20:02:01 -07:00
1715f262dd add indexer entrypoint: - populate dedup index from remote wacz/multi wacz/multiwacz json Ilya Kreymer 2025-09-17 19:23:32 -07:00
3984a1ec2f args: add separate --dedupIndexUrl to support separate redis for dedup indexing prep: - move WACZLoader to wacz for reuse Ilya Kreymer 2025-09-16 17:48:13 -07:00
d008f979d5 dedup work: - resource dedup via page digest - page dedup via page digest check, blocking of dupe page Ilya Kreymer 2025-08-30 12:41:10 -07:00
689d9f6c6b

Apply pageExtraDelay after successful direct fetch (#961) v1.11.2 Ilya Kreymer 2026-01-30 13:18:48 -08:00
5ad603e7da fixes #957 also apply page extra delay when if direct fetch succeeded to enforce consistent rate limiting Ilya Kreymer 2026-01-30 10:10:18 -08:00