mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 14:33:17 +00:00

* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2
11 lines
339 B
JavaScript
11 lines
339 B
JavaScript
|
|
module.exports.HTML_TYPES = ["text/html", "application/xhtml", "application/xhtml+xml"];
|
|
module.exports.WAIT_UNTIL_OPTS = ["load", "domcontentloaded", "networkidle0", "networkidle2"];
|
|
module.exports.BEHAVIOR_LOG_FUNC = "__bx_log";
|
|
|
|
module.exports.DEFAULT_SELECTORS = [{
|
|
selector: "a[href]",
|
|
extract: "href",
|
|
isAttribute: false
|
|
}];
|
|
|