Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes#496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#493
This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.
Initial docs site set to https://crawler.docs.browsertrix.com/
Many thanks to @Shrinks99 for help setting this up!
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.
Fixes#491
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext
recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)
tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0
bump to 1.0.0-beta.7
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit
requeue: ensure requeued urls are requeued with same depth/priority, not
0
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
0.5.3, which will add support for behaviors to add links.
Simplify adding links by simply adding the links directly, instead of
batching to 500 links. Errors are already being logged in queueing a new
URL fails.
don't treat non-200 pages as errors, still extract text, take
screenshots, and run behaviors
only consider actual page load errors, eg. chrome-error:// page url, as
errors
- if a seed page redirects (page response != seed url), then add the
final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure
page is marked as failed if page.url() starts with chrome-error://
- fixes#475
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
- recorder: don't attempt to record response with mime type
`text/event-stream` (will not terminate).
- resources: don't track non http/https resources.
- resources: store page timestamp on first resources URL match, in case
multiple responses for same page encountered.
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.
Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
Fixes#462
Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
pages to the database for each crawl.
Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
Also include timestamp (as ISO date) in `pageinfo:` records
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.
Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter
Fixes#457
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp
Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
- add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
Ensure the final pending wait also has a timeout, set to max page
timeout x num workers.
Could also set higher, but needs to have a timeout, eg. in case of
downloading live stream that never terminates.
Fixes#348 in the 0.12.x line.
Also bumps version to 0.12.3
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
Due to an optimization, numPending() call assumed that queueSize() would
be called to update cached queue size. However, in the current worker
code, this is not the case. Remove cacheing the queue size and just check
queue size in numPending(), to ensure pending list is always processed.
When calling directFetchCapture, and aborting the response via an
exception, throw `new Error("response-filtered-out");`
so that it can be ignored. This exception is only used for direct
capture, and should not be logged as an error - rethrow and
handle in calling function to indicate direct fetch is skipped
Previously, responses >2MB are streamed to disk and an empty response returned to browser,
to avoid holding large response in memory.
This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.
This PR sets different limits for:
- HTML as well as other JS necessary for page to load to 25MB
- All other content limit is set to 5MB
Also includes some more type fixing
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.