Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Tessa Walsh	0192d05f4c	Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts	2023-01-19 14:17:27 -05:00
Tessa Walsh	f35d495103	Add screenshot functionality (#188 ) * Add screenshot and thumbnail functionality Introduces a --screenshot CLI option, which takes a comma-separated list of screenshot types: view,fullPage,thumbnail. In addition, this commit: - Adds '--experimental-global-webcrypto' to ensure webcrypto is available in node - Deprecates newContext, instead always using page context for 1 worker and window context for >1 worker * Separate screenshotTypes into exported const Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>	2022-12-21 09:06:13 -08:00
Ilya Kreymer	277314f2de	Convert to ESM (#179 ) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0	2022-11-15 18:30:27 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	0e0b85d7c3	Customizable extract selectors + typo fix (0.4.2) (#72 ) * fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2	2021-07-23 18:31:43 -07:00
Ilya Kreymer	e7d3767efb	Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51 ) * support hashtag for page-scoped crawls: - allow hashtags for current page, automatically set scope to current w/ different hashtags - also allow hashtags for URLs specified via urlFile - driver: simplify driver, move default driver function to loadPage() - bump version to 0.4.0-beta.0 * add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well * graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker) * replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex. scopeType options include: prefix - scope is prefix of current page (default) page - scope is current page + hashtags (spa mode) domain - scope is domain/origin of current page any - scope is any url (default for urlFile) - bump version to 0.4.0-beta.1	2021-05-21 15:37:02 -07:00
Emma Dickson	63376ab6ac	Add --urlFile param to specify text file with a list of URLs to crawl (#38 ) * Resolves #12 * Make --url param optional. Only one of --url of --urlFile should be specified. * Add ignoreScope option queueUrls() to support adding specific URLs * add tests for urlFile * bump version to 0.3.2 Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-05-12 22:57:06 -07:00