browsertrix-crawler/tests/extra_hops_depth.test.js

import fs from "fs";

import util from "util";
import {exec as execCallback } from "child_process";

const exec = util.promisify(execCallback);

const extraHopsTimeout = 180000;


test("check that URLs are crawled 2 extra hops beyond depth", async () => {
  try {
    await exec("docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection extra-hops-beyond --extraHops 2 --url https://webrecorder.net/ --limit 7");
  }
  catch (error) {
    console.log(error);
  }

  const crawledPages = fs.readFileSync("test-crawls/collections/extra-hops-beyond/pages/pages.jsonl", "utf8");
  const crawledPagesArray = crawledPages.trim().split("\n");

  const expectedPages = [
    "https://webrecorder.net/",
    "https://webrecorder.net/blog",
    "https://webrecorder.net/tools",
    "https://webrecorder.net/community",
    "https://webrecorder.net/about",
    "https://webrecorder.net/contact",
    "https://webrecorder.net/faq",
  ];

  // first line is the header, not page, so adding -1
  expect(crawledPagesArray.length - 1).toEqual(expectedPages.length);

  for (const page of crawledPagesArray) {
    const url = JSON.parse(page).url;
    if (!url) {
      continue;
    }
    expect(expectedPages.indexOf(url) >= 0).toBe(true);
  }
}, extraHopsTimeout);
Convert to ESM (#179) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0 2022-10-24 15:30:10 +02:00			`import fs from "fs";`

			`import util from "util";`
			`import {exec as execCallback } from "child_process";`

			`const exec = util.promisify(execCallback);`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-03-17 12:50:32 -07:00			`const extraHopsTimeout = 180000;`

Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00
Convert to ESM (#179) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0 2022-10-24 15:30:10 +02:00			`test("check that URLs are crawled 2 extra hops beyond depth", async () => {`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00			`try {`
Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts 2022-12-15 12:38:41 -05:00			`await exec("docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection extra-hops-beyond --extraHops 2 --url https://webrecorder.net/ --limit 7");`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00			`}`
			`catch (error) {`
			`console.log(error);`
			`}`

Switch back to Puppeteer from Playwright (#301) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-04-26 15:41:35 -07:00			`const crawledPages = fs.readFileSync("test-crawls/collections/extra-hops-beyond/pages/pages.jsonl", "utf8");`
			`const crawledPagesArray = crawledPages.trim().split("\n");`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00
			`const expectedPages = [`
Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts 2022-12-15 12:38:41 -05:00			`"https://webrecorder.net/",`
			`"https://webrecorder.net/blog",`
			`"https://webrecorder.net/tools",`
			`"https://webrecorder.net/community",`
			`"https://webrecorder.net/about",`
			`"https://webrecorder.net/contact",`
			`"https://webrecorder.net/faq",`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00			`];`

Switch back to Puppeteer from Playwright (#301) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-04-26 15:41:35 -07:00			`// first line is the header, not page, so adding -1`
Add option to output stats file live, i.e. after each page crawled (#374) * Add option to output stats file live, i.e. after each page crawled * Always output stat files after each page crawled (+ test) * Fix inversion between expected and test value 2023-09-15 00:16:19 +02:00			`expect(crawledPagesArray.length - 1).toEqual(expectedPages.length);`
Switch back to Puppeteer from Playwright (#301) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-04-26 15:41:35 -07:00
			`for (const page of crawledPagesArray) {`
Support Extra Hops beyond current scope with --extraHops option (#98) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2 2022-01-15 09:03:09 -08:00			`const url = JSON.parse(page).url;`
			`if (!url) {`
			`continue;`
			`}`
			`expect(expectedPages.indexOf(url) >= 0).toBe(true);`
			`}`
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-03-17 12:50:32 -07:00			`}, extraHopsTimeout);`