browsertrix-crawler/util/constants.js
Ilya Kreymer 0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00

11 lines
339 B
JavaScript

module.exports.HTML_TYPES = ["text/html", "application/xhtml", "application/xhtml+xml"];
module.exports.WAIT_UNTIL_OPTS = ["load", "domcontentloaded", "networkidle0", "networkidle2"];
module.exports.BEHAVIOR_LOG_FUNC = "__bx_log";
module.exports.DEFAULT_SELECTORS = [{
selector: "a[href]",
extract: "href",
isAttribute: false
}];