browsertrix-crawler/util/constants.js at 0.7.2 - Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Ilya Kreymer 0e0b85d7c3

Customizable extract selectors + typo fix (0.4.2) (#72 )

* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2

2021-07-23 18:31:43 -07:00

11 lines

339 B

JavaScript

Raw Permalink Blame History

 module.exports.HTML_TYPES = ["text/html", "application/xhtml", "application/xhtml+xml"];
 module.exports.WAIT_UNTIL_OPTS = ["load", "domcontentloaded", "networkidle0", "networkidle2"];
 module.exports.BEHAVIOR_LOG_FUNC = "__bx_log";
 module.exports.DEFAULT_SELECTORS = [{
   selector: "a[href]",
   extract: "href",
   isAttribute: false
 }];