Commit graph

  • 94ad705c46
    Merge 9689da4451 into 7c10fb108b Ilya Kreymer 2026-04-15 19:58:26 -07:00
  • 9689da4451 support archiving websocket data: - saving WS frames from CDP to a temp file as jsonl data - generate separate 'resource' record with jsonl data - generate request/response data with headers only, set warc-truncated header - should match evolving iipc/warc-specifications#115 spec ws-support Ilya Kreymer 2026-04-05 01:16:08 -07:00
  • 6f5395cec4
    Merge dd2a4ca7b2 into 7c10fb108b Emma Segal-Grossman 2026-04-15 14:34:42 -04:00
  • 3d76ede571
    Merge ab313e89b1 into 7c10fb108b Ilya Kreymer 2026-04-15 14:34:42 -04:00
  • c644d0e191 fix typo rate-limit-work Ilya Kreymer 2026-04-14 15:35:02 -07:00
  • 73e909e65c
    Merge 1cc3ea4226 into 7c10fb108b Ilya Kreymer 2026-04-11 22:42:47 +00:00
  • 1cc3ea4226 tests: add third crawl test just in case page-dedupe Ilya Kreymer 2026-04-11 15:42:21 -07:00
  • a859352ce7 Page Level Dedupe support: - add --dedupePagesMinDepth to enable page-level dedupe at certain depth or greater - add 'duplicate' as another skip reason, log skip reason when page is skipped due to dedupe - when pageDedupe is enabled, set pageLimit to 0 and allow queueing pages beyond expected limit, in case pages are skipped - add queuePageLimit and check limit on each new page at queue pop time, allows skipping already deduped pages and incrementally crawling new pages - when limit reached, queued pages are drained and marked as excluded / logged to skippedPages list - tests: test page dedupe / incremental crawling: new pages are archived on subsequent crawls, previous pages skipped with 'duplicate' reason - docs: add Page Deduplication on dedupe page - docs: add Reports page, document skipped pages / --reportSkipped report Ilya Kreymer 2026-03-09 17:19:15 -07:00
  • c4fb989b9a
    Merge d6e1f01395 into 7c10fb108b lasztoth 2026-04-09 17:19:46 -07:00
  • 7c10fb108b
    tests: include tests in TS format and lint operations, reformat existing tests to match style (#1016) main Emma Segal-Grossman 2026-04-09 19:06:57 -04:00
  • ef109f1ea4 Deployed 1c6e814 with MkDocs version: 1.6.1 gh-pages 2026-04-09 19:53:49 +00:00
  • d1aa7ec054 tests: include tests in format and lint operations, reformat existing tests to match style Ilya Kreymer 2026-04-09 12:52:33 -07:00
  • 1c6e814e15
    Add option to write JSONL file with data on skipped pages (#966) Tessa Walsh 2026-04-09 15:51:41 -04:00
  • f98a545c13 format fix Ilya Kreymer 2026-04-09 11:32:02 -07:00
  • ea40aab2f1 rename to 'redirectToExcluded' to be more precise for when page is skipped, if redirect is explicitly excluded add test for redirectToExcluded Ilya Kreymer 2026-04-09 11:26:58 -07:00
  • 9b1fef6892 writeSkippedPage - just pass seedId, to look up seed in one place record 'redirectOutOfScope' type to indicate pages that are excluded because they redirect to out-of-scope pages Ilya Kreymer 2026-04-08 22:57:38 -07:00
  • 0d24ec846b remove unused line readd via rebase probably Ilya Kreymer 2026-04-08 22:22:10 -07:00
  • 43b4b20298 queueUrl: account for empty seed url list, if calling from QA mode Ilya Kreymer 2026-04-08 22:03:31 -07:00
  • ecede3355b
    Merge 344df55724 into 64fdaf0d11 Ilya Kreymer 2026-04-09 04:53:25 +00:00
  • 344df55724 retry after Ilya Kreymer 2026-04-08 21:52:40 -07:00
  • 14aaf9a201 support custom retry after Ilya Kreymer 2026-04-08 13:37:39 -07:00
  • 1c94f04873 Comment out WACZ validation test Tessa Walsh 2026-04-08 13:03:24 -04:00
  • e46b2c3e20 Fix tests Tessa Walsh 2026-04-08 12:23:02 -04:00
  • 6e3e29cf3c Fix ts in skippedPages.jsonl Tessa Walsh 2026-04-08 11:36:38 -04:00
  • 7c0c6c1177 Rename arg to reportSkipped and file to skippedPages.jsonl Tessa Walsh 2026-04-08 11:19:29 -04:00
  • 46684e3235 add 'rateLimitOn200MatchText' to allow configuring text that is considered rate limit page even with 200 status, set previously hard-coded value as default Ilya Kreymer 2026-04-07 23:36:25 -07:00
  • a01566a6b5 improved check for rate limit: - set separate rateLimited key to a bool if crawler is rate limited (will exit) for easy checking via browsertrix app - check for rate limit after page completes, in case page does in fact return 200 - clear rateLimited on first successful load, but doesn't clear counter, in case rate limit still in place, will become in affect - fixes #758 Ilya Kreymer 2026-04-07 12:41:18 -07:00
  • 4e585f7d28 add case Ilya Kreymer 2026-03-06 21:03:54 -08:00
  • 4ae23219d3 treat rate limit similar to pause, exit after all pages done rather than immediately Ilya Kreymer 2026-03-06 20:47:13 -08:00
  • b9cb43a327 fix skipping direct fetch for non-200 for non-html status, return 600 to indicate skip Ilya Kreymer 2026-02-16 18:55:34 -08:00
  • 4e2b2328f5 add rate limit to direct fetch track stats for direct fetch Ilya Kreymer 2026-02-14 22:55:27 -08:00
  • 9d5c5505e2 post rebase fixes Ilya Kreymer 2026-02-10 21:31:42 -08:00
  • 48306950d5 treat 429 as instant rate limit, lower limit to 3 for other statuses Ilya Kreymer 2026-02-10 21:16:41 -08:00
  • 94475bdff6 don't fail when rate limited, mark directly in recorder Ilya Kreymer 2026-02-06 21:38:34 -08:00
  • f6944c4b59 fix dynamic page check to use correct pageUrl Ilya Kreymer 2026-02-06 21:23:32 -08:00
  • 0f6e795e1c rate limit by text match test Ilya Kreymer 2026-02-04 18:02:57 -08:00
  • 12033d728a use redis to store rate limit to ensure expiry, remove unused Ilya Kreymer 2026-02-04 17:25:41 -08:00
  • 1c12b74e10 rate limit detect and restart Ilya Kreymer 2026-02-03 10:24:55 -08:00
  • b17a15dda4 rate limit work: - detect 403, 429, 503 as possible rate limit, attempt to restart and not record Ilya Kreymer 2026-02-02 23:02:35 -08:00
  • 6ae4987d4e tests: migrate new test to .ts tests: update eslint to ignore promise check on tests Ilya Kreymer 2026-04-03 14:13:53 -07:00
  • acf31e26e5 Remove todo Tessa Walsh 2026-02-11 17:10:47 -05:00
  • e427a153be Fix test Tessa Walsh 2026-02-11 15:25:13 -05:00
  • 667db4fe25 Add WACZ tests Tessa Walsh 2026-02-11 11:01:35 -05:00
  • a98eccb18e Fix formatting Tessa Walsh 2026-02-11 10:55:18 -05:00
  • 972110c55e Add reports dir to WACZ if listNotQueued arg is passed Tessa Walsh 2026-02-11 10:31:43 -05:00
  • 9ccbbb4231 Fix test Tessa Walsh 2026-02-10 17:30:35 -05:00
  • 2ed8101505 Add missing test import Tessa Walsh 2026-02-10 17:10:36 -05:00
  • 001c07529e Rename test module so new tests actually run Tessa Walsh 2026-02-10 16:41:23 -05:00
  • 10ab15b741 Write notQueued.jsonl file to new reports dir, not pages Tessa Walsh 2026-02-05 15:45:15 -05:00
  • 7e5e692f90 Update cli-options.md Tessa Walsh 2026-02-05 15:14:49 -05:00
  • 2abbfed38b Add tests Tessa Walsh 2026-02-05 15:07:08 -05:00
  • 8d10a38797 Only write pagesNotQueued.jsonl if option passed Tessa Walsh 2026-02-05 13:06:02 -05:00
  • 95c99d2b0e Write pages file with unqueued urls Tessa Walsh 2026-02-05 12:46:02 -05:00
  • 1ff0e36408 Add method to write pagesNotQueued.jsonl Tessa Walsh 2026-02-05 11:59:25 -05:00
  • 377c18c74a Log when page URL not queued bc of limit hit Tessa Walsh 2026-02-05 11:20:35 -05:00
  • 23e81274c2 remove logger from frame.evaluate() qa-csr-clues-and-text-compare Ilya Kreymer 2026-04-03 09:38:17 -07:00
  • 64fdaf0d11
    Convert tests from JS to TS (#1003) Emma Segal-Grossman 2026-04-02 20:05:41 -04:00
  • 14c2431a20 deps: update ts-jest Ilya Kreymer 2026-04-02 16:06:47 -07:00
  • f7c81286f7
    Merge cf1dc0d0d1 into 8bc0f42362 Emma Segal-Grossman 2026-04-02 21:27:49 +00:00
  • cf1dc0d0d1
    add more logging & timeout for loading frame emma 2026-04-02 17:27:42 -04:00
  • dd2a4ca7b2
    update replaywebpage version update-rwp emma 2026-04-02 16:22:31 -04:00
  • 240830a6d7 remove www test-proxy-sitemap Ilya Kreymer 2026-04-02 12:18:34 -07:00
  • 8615a4fce8 sitemap: use www path Ilya Kreymer 2026-04-02 12:09:23 -07:00
  • 653c4a44c0 use www path Ilya Kreymer 2026-04-02 11:44:33 -07:00
  • 77d6200a0f sitemap test: just wait upto 15 seconds, don't check sitemapDone as it may be set before urls are queued Ilya Kreymer 2026-04-02 11:40:53 -07:00
  • 55a0a2e880 default path Ilya Kreymer 2026-04-02 11:37:10 -07:00
  • 233c792cc1 w/o cors proxy Ilya Kreymer 2026-04-02 11:00:42 -07:00
  • e38b556283 test sitemap loading through proxy test waiting after sitemapDone Ilya Kreymer 2026-04-02 00:23:07 -07:00
  • 931f514693 fix stdio config to pipe stdin, ignore others Ilya Kreymer 2026-04-01 22:44:29 -07:00
  • 338017355e
    fix type in scope test emma 2026-04-01 20:13:13 -04:00
  • 8bc0f42362 version: bump to 1.13.0-beta.0 Ilya Kreymer 2026-04-01 16:10:31 -07:00
  • e9dd8f2f2a further clean up tests emma 2026-03-24 23:04:27 -04:00
  • 04cd5a23b2 show logs on errors emma 2026-03-24 18:41:01 -04:00
  • 482777795e fix ci specific tests emma 2026-03-24 18:05:34 -04:00
  • f1d261987b fix tests emma 2026-03-24 17:13:49 -04:00
  • 9e0ef7f787 set up more test infra emma 2026-03-24 14:59:43 -04:00
  • f509817083 set up ts-jest emma 2026-03-24 14:47:00 -04:00
  • fc15103d70 first pass at updating tests to use typescript emma 2026-03-24 14:41:21 -04:00
  • ca9439d64d ensure status exists Ilya Kreymer 2026-04-01 14:08:15 -07:00
  • f8bd4e5b8b
    only do raw text compare for 200 status code pages Ilya Kreymer 2026-03-31 22:22:30 -07:00
  • e6bf4ee9ad
    capture and track document request id for raw text extraction in qa emma 2026-03-31 21:13:56 -04:00
  • f02b7bcba8
    finish passing through refersTo uuid emma 2026-03-31 21:13:32 -04:00
  • 9b81b6fb44
    add WARC-Refers-To records to outputs emma 2026-03-31 20:29:15 -04:00
  • 39b6fb9c03
    ensure text from response & html text gets written to output warcs in qa emma 2026-03-31 18:16:49 -04:00
  • 2f5469b695
    expand CSR detection patterns emma 2026-03-31 16:59:34 -04:00
  • 395dece012
    compare original text against extracted raw text, rather than replayed text emma 2026-03-31 15:14:21 -04:00
  • 842f375674
    fix incorrect replay suffixes emma 2026-03-31 15:02:27 -04:00
  • 36e3ebc689
    include raw text output when running qa emma 2026-03-31 14:57:11 -04:00
  • 3580ffcd6d
    first pass at tests emma 2026-03-24 13:28:56 -04:00
  • f6baffcc9f
    move csr clues into main "comparison" object in comparison data emma 2026-03-24 13:22:01 -04:00
  • 2a7011b648
    fix duplicate csp clue name emma 2026-03-23 14:25:01 -04:00
  • 0e7e23ca93
    refactor text extraction & generate text from raw response when missing in qa emma 2026-03-18 15:20:22 -04:00
  • 05f3d5d99c
    get jsdom-based text extraction on raw response html working emma 2026-03-17 21:16:42 -04:00
  • dee920f792
    wip 1 emma 2026-03-17 16:15:26 -04:00
  • 5fbc89b8d4
    fix test emma 2026-03-11 17:09:29 -04:00
  • b4e4f426cd
    save csr data to page state, and improve test specficity emma 2026-03-11 16:15:20 -04:00
  • d15556882f
    add test & always write csr clues out when checking for them emma 2026-03-11 16:04:38 -04:00
  • 581a18db00
    implement client-side rendering detection in qa emma 2026-03-11 14:36:49 -04:00
  • cee501a20a
    add reference to external WACZ per revisit record (#1009) v1.12.4 Ilya Kreymer 2026-03-31 17:39:06 -07:00
  • 9976c0474f bump to 1.12.4 Ilya Kreymer 2026-03-31 16:25:49 -07:00