Commit graph

  • 26aeb8db4f
    Merge 809dad108b into 62e99d70f2 Ilya Kreymer 2026-06-18 20:24:04 -07:00
  • 809dad108b Always Rate Limit Status Codes: - detect 403, 429, 503 status codes as possible rate limited, skip pages that have these status codes from being archived - add --rateLimitStatusCodes to customize these status codes - can use --no-rateLimitStatusCodes to disable rate limiting altogether rate-limit-work Ilya Kreymer 2026-02-02 23:02:35 -08:00
  • 62e99d70f2
    deps: bump puppeteer to 25.1.0 (#1049) main Ilya Kreymer 2026-06-18 20:19:20 -07:00
  • a615411523 deps: bump puppeteer to 25.1.0 Ilya Kreymer 2026-06-18 19:23:44 -07:00
  • b6b52f279d
    build(deps): bump the npm_and_yarn group across 1 directory with 2 updates (#1048) dependabot[bot] 2026-06-18 19:11:44 -07:00
  • 5ba8fef0f4
    build(deps): bump the npm_and_yarn group across 1 directory with 2 updates dependabot[bot] 2026-06-19 02:05:46 +00:00
  • 28a8484c93
    deps: bump brave to 1.91.175 (#1047) Ilya Kreymer 2026-06-18 19:04:08 -07:00
  • a5373f7f9a bump wabac/warcio dependencies Ilya Kreymer 2026-06-18 17:32:35 -07:00
  • a1c73e84f3 bump to brave 1.91.175 version to 1.13.2 Ilya Kreymer 2026-06-18 17:27:42 -07:00
  • a5dff102dd
    wip: use builtin behaviours from gist misty/behaviour_from_gist Misty De Meo 2026-05-12 13:16:59 -07:00
  • ebf5b46303 build(deps): bump uuid in the npm_and_yarn group across 1 directory dependabot[bot] 2026-06-16 17:54:16 +00:00
  • 17945b78a1
    build(deps): bump uuid in the npm_and_yarn group across 1 directory dependabot[bot] 2026-06-16 17:54:16 +00:00
  • 23a8886cce ci: avoid logging in for dependabot Misty De Meo 2026-06-16 09:59:12 -07:00
  • c4cd0c97d9
    ci: avoid logging in for dependabot Misty De Meo 2026-06-16 09:59:12 -07:00
  • e7bc6e378e Dockerfile: apply code review suggestion Misty De Méo 2026-06-16 09:33:01 -07:00
  • 004d410b27 Dockefile: avoid persisting cache in layer Misty De Meo 2026-06-16 09:12:30 -07:00
  • 9bb5950cf8
    Dockerfile: apply code review suggestion Misty De Méo 2026-06-16 09:33:01 -07:00
  • 5ccec13dac
    Dockefile: avoid persisting cache in layer Misty De Meo 2026-06-16 09:12:30 -07:00
  • f6434966c1 version: bump to 1.13.1 v1.13.1 Ilya Kreymer 2026-06-15 20:58:26 -07:00
  • 017a41a223
    deps: browsertrix-behaviors v0.10.1 (#1043) Misty De Méo 2026-06-15 20:57:11 -07:00
  • 2e9ad14454
    fix: resolve dependency CVEs, fix ws v8 screencast regression, prune devDependencies from image (#1042) Roy Teeuwen 2026-06-16 05:56:29 +02:00
  • 3f172d7474
    deps: browsertrix-behaviors v0.10.1 Misty De Meo 2026-06-15 12:34:45 -07:00
  • 5cc1702ea0 chore: upgrade @typescript-eslint to v8 and @novnc/novnc to 1.7.0 Roy Teeuwen 2026-06-10 23:17:13 +02:00
  • 6b1169b668 fix: prune devDependencies from the production image Roy Teeuwen 2026-06-10 22:45:13 +02:00
  • c7c5641129 fix: use WebSocketServer named export for ws v8 + add screencast e2e test Roy Teeuwen 2026-06-10 22:30:11 +02:00
  • 1990a1360e fix: resolve remaining dependency CVEs flagged by audit Roy Teeuwen 2026-06-10 22:12:51 +02:00
  • 6563c31a80
    build(deps): bump the npm_and_yarn group across 1 directory with 8 updates dependabot[bot] 2026-06-10 20:10:26 +00:00
  • 787398adf7
    ci: skip docker login for external PRs (#1041) Ilya Kreymer 2026-06-10 13:06:20 -07:00
  • efb422ce0e fix: upgrade dependencies to resolve CRITICAL and HIGH CVEs Roy Teeuwen 2026-06-10 22:00:46 +02:00
  • 6820046331 ci: skip docker login for external PRs, should still work but will be rate limited by dockerhub Ilya Kreymer 2026-06-10 12:19:16 -07:00
  • 859f4e3268
    Merge 4aee4322df into d8b167747e Misty De Méo 2026-06-09 10:37:54 -07:00
  • d8b167747e ci: update docker buildx actions Misty De Meo 2026-06-09 09:36:15 -07:00
  • dd36784eac
    ci: update docker buildx actions Misty De Meo 2026-06-09 09:36:15 -07:00
  • 696704e042 ci: update docker actions Misty De Meo 2026-06-08 11:03:50 -07:00
  • a6969d8271
    Merge dd2a4ca7b2 into 14bfa1fb61 Emma Segal-Grossman 2026-06-08 17:04:01 -04:00
  • ad3e350235
    Merge ab313e89b1 into 14bfa1fb61 Ilya Kreymer 2026-06-08 17:04:01 -04:00
  • 1e5aeb4f74
    ci: update docker actions Misty De Meo 2026-06-08 11:03:50 -07:00
  • d7830ec12b
    Merge dee2c95ad2 into 14bfa1fb61 aponb 2026-06-05 15:29:22 +02:00
  • 4aee4322df
    crawler: use destructured named params for this.queueUrl misty/addLink_force_in_scope Misty De Meo 2026-06-04 12:47:32 -07:00
  • 9d51092a10
    crawler: add option for behaviour addLink scope Misty De Meo 2026-06-04 12:37:47 -07:00
  • 6f049c55e8
    - add ignoreScope for isIncluded (default to false) - types: add LinkEntry for isIncluded() - tests: add tests with ignoreScope, also ignoring maxDepth Ilya Kreymer 2026-06-03 20:17:55 -07:00
  • 60491d3993
    crawler: treat addLink links as in scope unless excluded Misty De Meo 2026-06-03 16:42:59 -07:00
  • 14bfa1fb61 version: bump to 1.13.0 v1.13.0 Ilya Kreymer 2026-06-03 22:54:06 -07:00
  • 1cc392344e
    browser: remove additional shell commands (#1034) v1.13.0-beta.2 Misty De Méo 2026-06-03 17:28:16 -07:00
  • dc4825516c
    browser: adjust two calls Misty De Meo 2026-05-26 15:12:59 -07:00
  • a6242319b8
    ci: update actions versions (#1029) Misty De Méo 2026-06-03 16:46:16 -07:00
  • e9aaa1b78b
    External command tweaks (#1031) Misty De Méo 2026-06-03 16:45:36 -07:00
  • 3db07c4f29
    deps: bump to base image 1.90.128 (#1036) Ilya Kreymer 2026-06-03 16:44:42 -07:00
  • 4fe124aa46 ci: use node 24 also Ilya Kreymer 2026-06-03 15:57:03 -07:00
  • 544d331023 deps: bump to base image 1.90.128 bump to 1.13.0-beta.2 Ilya Kreymer 2026-06-03 15:30:36 -07:00
  • a067cdd6f3
    Merge 9e461182b5 into 2b83c76710 aponb 2026-06-03 13:08:18 -04:00
  • 69315fdbb2
    file_reader: adjust git call Misty De Meo 2026-05-26 13:47:32 -07:00
  • c5dd45d3a3
    storage: only promisify execFile once Misty De Meo 2026-05-26 13:43:55 -07:00
  • 6b04c47d49
    storage: tweak getDFOutput call Misty De Meo 2026-05-26 13:42:05 -07:00
  • 2b83c76710
    tests: point test-extract at old.webrecorder.net/community (#1035) Misty De Méo 2026-06-02 14:16:08 -07:00
  • f78ba3dd0c
    tests: apply formatting Misty De Meo 2026-06-02 12:36:34 -07:00
  • c12e0c0c29
    Apply suggestions from code review Ilya Kreymer 2026-06-02 12:20:00 -07:00
  • 3d63a83312
    tests: use old.webrecorder.net Misty De Meo 2026-06-02 12:02:09 -07:00
  • 5441b31c23
    tests: point test-extract at webrecorder.net Misty De Meo 2026-06-02 10:40:34 -07:00
  • dede611bce version: bump to 1.13.0-beta.1, deps: browsertrix-behaviors to 0.10.0 v1.13.0-beta.1 Ilya Kreymer 2026-06-01 17:54:12 -07:00
  • 9baa677d42
    Merge cb869f0728 into 3433a4a440 lasztoth 2026-05-15 08:45:03 +00:00
  • cb869f0728 Added persistent QA page count and documentation KGX747 2026-05-15 10:44:59 +02:00
  • 637d4cf041
    ci: update actions versions Misty De Meo 2026-05-13 15:53:13 -07:00
  • 9e0de504e1
    wip: use builtin behaviours from gist misty/behaviour_hardcode Misty De Meo 2026-05-12 12:00:52 -07:00
  • 9e461182b5 Add flag to disable deduplication aponb 2026-05-07 17:27:04 +02:00
  • dee2c95ad2 Format domain stats files after rebase aponb 2026-05-07 10:13:46 +02:00
  • 818ec99d2b Fix leftover merge markers after rebase aponb 2026-05-07 09:55:46 +02:00
  • bcc20ca84d Finalize domain completeness for deep crawls aponb 2026-04-27 17:52:56 +02:00
  • 83a2cc84f4 Persist domain completeness state aponb 2026-04-27 17:46:16 +02:00
  • 0abdb681b3 Add optional completeness signal to domain stats for depth-0 domain crawls aponb 2026-04-25 22:25:01 +02:00
  • 1430fee7ad Add attributed per-domain crawl budgets and stats aponb 2026-04-21 17:58:47 +02:00
  • ac31183b21 Deployed 3433a4a with MkDocs version: 1.6.1 gh-pages 2026-04-30 18:16:58 +00:00
  • 3433a4a440
    Page Level Dedupe support: (#1018) Ilya Kreymer 2026-04-30 20:14:42 +02:00
  • b72e450e44 logging: log when page is skipped to debug log with 'pageStatus' context Ilya Kreymer 2026-04-30 10:42:58 +02:00
  • 620ced6907 fix reversed logic: - queuePageLimit is limit for how big queue can get, set to 0 (unlimited) if page dedupe is enabled as pages may be skipped - pageLimit is actually limit of how many pages - ensure logging / usage correct Ilya Kreymer 2026-04-30 10:38:38 +02:00
  • 2a171e615a
    Apply suggestions from code review Ilya Kreymer 2026-04-28 22:37:40 +02:00
  • 4ae629e8a0
    Fix allowHashUrls option and scope checking for hash URLs (#1025) v1.13.0-beta.0 Ilya Kreymer 2026-04-28 22:32:12 +02:00
  • cb77d32c34 Add relative non-hash url test case back Tessa Walsh 2026-04-28 13:37:06 -04:00
  • 72349204db Return tests to what they were in local branch before rebase Tessa Walsh 2026-04-28 13:33:39 -04:00
  • 6cce31b875 More test cleanup Tessa Walsh 2026-04-28 13:19:08 -04:00
  • 14877df55f Fix test issues from rebase Tessa Walsh 2026-04-28 13:15:10 -04:00
  • 1ce5bb9fb4 scope check URL resolution: - pass pageUrl to seed.isIncluded() to allow resolving scope of relative URLs to current page, if provided - resolves any relative URLs found with custom link extraction Ilya Kreymer 2026-04-24 08:49:23 +02:00
  • 611769df23 ScopeSeedInitOpts -> ScopedSeedInitOpts Ilya Kreymer 2026-04-22 07:57:19 +02:00
  • 1b327e8415 allow #-tag urls to be treated as distinct URLs for scope types other than custom: - fix --allowHashUrls option being ignored - if --allowHashUrls global or --allowHash per seed is set, don't reset to false if scopeType != 'custom' - tests: update scope tests to ensure --allowHash and --allowHashUrls now work as expected Ilya Kreymer 2026-04-22 07:31:53 +02:00
  • ac3b2849ab
    support scope check for relative URLs using current page: (#1026) Ilya Kreymer 2026-04-28 19:11:45 +02:00
  • c637ef2e3f
    docs: fix formatting for behaviour docs (#1022) Misty De Méo 2026-04-27 13:11:20 -07:00
  • 4578dd09c8 update test to not depend on allowHash fix for now Ilya Kreymer 2026-04-24 10:15:04 +02:00
  • 58227331fb support scope check for relative URLs using current page: - pass pageUrl to seed.isIncluded() to allow resolving scope of relative URLs to current page, if provided - resolves any relative URLs found with custom link extraction - fixes part of #1023 Ilya Kreymer 2026-04-24 08:49:23 +02:00
  • 2eadbbd1f1
    docs: fix formatting for behaviour docs Misty De Meo 2026-04-20 16:08:58 -07:00
  • 94ad705c46
    Merge 9689da4451 into 7c10fb108b Ilya Kreymer 2026-04-15 19:58:26 -07:00
  • 9689da4451 support archiving websocket data: - saving WS frames from CDP to a temp file as jsonl data - generate separate 'resource' record with jsonl data - generate request/response data with headers only, set warc-truncated header - should match evolving iipc/warc-specifications#115 spec ws-support Ilya Kreymer 2026-04-05 01:16:08 -07:00
  • 1cc3ea4226 tests: add third crawl test just in case Ilya Kreymer 2026-04-11 15:42:21 -07:00
  • a859352ce7 Page Level Dedupe support: - add --dedupePagesMinDepth to enable page-level dedupe at certain depth or greater - add 'duplicate' as another skip reason, log skip reason when page is skipped due to dedupe - when pageDedupe is enabled, set pageLimit to 0 and allow queueing pages beyond expected limit, in case pages are skipped - add queuePageLimit and check limit on each new page at queue pop time, allows skipping already deduped pages and incrementally crawling new pages - when limit reached, queued pages are drained and marked as excluded / logged to skippedPages list - tests: test page dedupe / incremental crawling: new pages are archived on subsequent crawls, previous pages skipped with 'duplicate' reason - docs: add Page Deduplication on dedupe page - docs: add Reports page, document skipped pages / --reportSkipped report Ilya Kreymer 2026-03-09 17:19:15 -07:00
  • 7c10fb108b
    tests: include tests in TS format and lint operations, reformat existing tests to match style (#1016) Emma Segal-Grossman 2026-04-09 19:06:57 -04:00
  • d1aa7ec054 tests: include tests in format and lint operations, reformat existing tests to match style Ilya Kreymer 2026-04-09 12:52:33 -07:00
  • 1c6e814e15
    Add option to write JSONL file with data on skipped pages (#966) Tessa Walsh 2026-04-09 15:51:41 -04:00
  • f98a545c13 format fix Ilya Kreymer 2026-04-09 11:32:02 -07:00
  • ea40aab2f1 rename to 'redirectToExcluded' to be more precise for when page is skipped, if redirect is explicitly excluded add test for redirectToExcluded Ilya Kreymer 2026-04-09 11:26:58 -07:00
  • 9b1fef6892 writeSkippedPage - just pass seedId, to look up seed in one place record 'redirectOutOfScope' type to indicate pages that are excluded because they redirect to out-of-scope pages Ilya Kreymer 2026-04-08 22:57:38 -07:00
  • 0d24ec846b remove unused line readd via rebase probably Ilya Kreymer 2026-04-08 22:22:10 -07:00