browsertrix-crawler/src/util/constants.ts

export const HTML_TYPES = [
  "text/html",
  "application/xhtml",
  "application/xhtml+xml",
];
export const WAIT_UNTIL_OPTS = [
  "load",
  "domcontentloaded",
  "networkidle0",
  "networkidle2",
];

export const SERVICE_WORKER_OPTS = [
  "disabled",
  "disabled-if-profile",
  "enabled",
] as const;

export type ServiceWorkerOpt = (typeof SERVICE_WORKER_OPTS)[number];

export const DETECT_SITEMAP = "<detect>";

export const EXTRACT_TEXT_TYPES = ["to-pages", "to-warc", "final-to-warc"];

export const BEHAVIOR_LOG_FUNC = "__bx_log";
export const ADD_LINK_FUNC = "__bx_addLink";
export const MAX_DEPTH = 1000000;

export const FETCH_HEADERS_TIMEOUT_SECS = 30;
export const PAGE_OP_TIMEOUT_SECS = 5;
export const SITEMAP_INITIAL_FETCH_TIMEOUT_SECS = 30;

export const DEFAULT_SELECTORS = [
  {
    selector: "a[href]",
    extract: "href",
    isAttribute: false,
  },
];

export const DISPLAY = ":99";
Add Prettier to the repo, and format all the files! (#428) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed. 2023-11-09 19:11:11 -05:00			`export const HTML_TYPES = [`
			`"text/html",`
			`"application/xhtml",`
			`"application/xhtml+xml",`
			`];`
			`export const WAIT_UNTIL_OPTS = [`
			`"load",`
			`"domcontentloaded",`
			`"networkidle0",`
			`"networkidle2",`
			`];`
SAX-based sitemap parser (#497) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2024-03-18 19:14:07 -07:00
service worker capture fix: disable by default for now (#506) Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2 2024-03-22 13:37:14 -07:00			`export const SERVICE_WORKER_OPTS = [`
			`"disabled",`
			`"disabled-if-profile",`
			`"enabled",`
			`] as const;`

			`export type ServiceWorkerOpt = (typeof SERVICE_WORKER_OPTS)[number];`

SAX-based sitemap parser (#497) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2024-03-18 19:14:07 -07:00			`export const DETECT_SITEMAP = "<detect>";`

improved text extraction: (addresses #403) (#404) - use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to get the snapshot (consistent with ArchiveWeb.page) - should be slightly more performant - keep option to use DOM.getDocument - refactor warc resource writing to separate class, used by text extraction and screenshots - write extracted text to WARC files as 'urn:text:<url>' after page loads, similar to screenshots - also store final text to WARC as 'urn:textFinal:<url>' if it is different - cli options: update `--text` to take one more more comma-separated string options `--text to-warc,to-pages,final-to-warc`. For backwards compatibility, support `--text` and `--text true` to be equivalent to `--text to-pages`. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> 2023-10-31 23:05:30 -07:00			`export const EXTRACT_TEXT_TYPES = ["to-pages", "to-warc", "final-to-warc"];`

Convert to ESM (#179) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0 2022-10-24 15:30:10 +02:00			`export const BEHAVIOR_LOG_FUNC = "__bx_log";`
optimize link extraction: (fixes #376) (#380) * optimize link extraction: (fixes #376) - dedup urls in browser first - don't return entire list of URLs, process one-at-a-time via callback - add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler - optimize addqueue: atomically check if already at max urls and if url already seen in one redis call - add QueueState enum to indicate possible states: url added, limit hit, or dupe url - better logging: log rejected promises for link extraction - tests: add test for exact page limit being reached 2023-09-15 10:12:08 -07:00			`export const ADD_LINK_FUNC = "__bx_addLink";`
Logger cleanup (#254) * logging: convert logger to a singleton to simplify use * add logger to create-login-profile.js 2023-03-17 14:24:44 -07:00			`export const MAX_DEPTH = 1000000;`
Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59) * Create an argument parser class * move constants, arg parser to separate files in utils/* * ensure yaml config overriden by command-line args * yaml loading work: - simplify yaml config by using yargs.config option - move all option parsing to argParser, simply expose parseArgs - export constants directly - add lint to util/* files * support inline 'seeds' in cmdline and yaml config tests: - add test for crawl config, ensuring seeds crawled + wacz created - add test to ensure cmdline overrides yaml config * scope fix: empty scope implies only fixed list, use '.' for any scope lint fix * update readme with yaml config info * allow 'url' and 'seeds' if both provided Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: emmadickson <emma.dickson@artsymail.com> 2021-06-23 19:36:32 -07:00
Additional direct fetch improvements (#678) - use existing headersTimeout in undici to limit time to headers fetch to 30 seconds, reject direct fetch if timeout is reached - allow full page timeout for loading payload via direct fetch - support setting global fetch() settings - add markPageUsed() to only reuse pages when not doing direct fetch - apply auth headers to direct fetch - catch failed fetch and timeout errors - support failOnFailedSeeds for direct fetch, ensure timeout is working 2024-09-05 13:28:49 -07:00			`export const FETCH_HEADERS_TIMEOUT_SECS = 30;`
			`export const PAGE_OP_TIMEOUT_SECS = 5;`
			`export const SITEMAP_INITIAL_FETCH_TIMEOUT_SECS = 30;`

Add Prettier to the repo, and format all the files! (#428) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed. 2023-11-09 19:11:11 -05:00			`export const DEFAULT_SELECTORS = [`
			`{`
			`selector: "a[href]",`
			`extract: "href",`
			`isAttribute: false,`
			`},`
			`];`
Remove DISPLAY env var from image (#625) To avoid a strange chromium bug: https://issues.chromium.org/issues/40209037 which causes WebGL to fail in headless mode if DISPLAY if set. Instead, just set DISPLAY directly for Xvfb, x11vnc and pass in `--display=` to browser if running in headful mode. 2024-06-25 13:53:43 -07:00
			`export const DISPLAY = ":99";`