Commit graph

359 commits

Author SHA1 Message Date
Ilya Kreymer
c42d3df889 remove different port, add qa_compare with different user 2024-03-22 09:17:21 -07:00
Ilya Kreymer
a7ee58cc26 fix permissions on downloaded files 2024-03-22 09:04:13 -07:00
Ilya Kreymer
d2760f7054 bump jest version 2024-03-22 09:03:26 -07:00
Ilya Kreymer
f136cdf18c attempt to change port on repeat call 2024-03-22 00:24:21 -07:00
Ilya Kreymer
10e92a4f7b tests: disable retryStrategy for redis, test for better termination behavior 2024-03-22 00:10:26 -07:00
Ilya Kreymer
f6a7dab3ba Merge branch 'main' into qa-crawl-work 2024-03-21 14:04:37 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Ilya Kreymer
ce2ffca78c
Merge branch 'main' into qa-crawl-work 2024-03-21 13:23:13 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Ilya Kreymer
b18148b715 tests: change ports for different tests that use redis to be unique 2024-03-20 12:14:49 -07:00
Ilya Kreymer
aee5af5578 more cleanup 2024-03-20 12:05:55 -07:00
Ilya Kreymer
52f80d0440 cleanup, add more constants, remove commented out code 2024-03-20 12:02:37 -07:00
Henry Wilkinson
5e2768ebcf
Docs homepage link fix
@tw4l Oops :\
2024-03-20 14:13:52 -04:00
Henry Wilkinson
79e39ae2f0
Merge pull request #501 from webrecorder/docs-minor-fixes
Docs: Minor fixes to edit link & clarifications
2024-03-20 13:04:12 -04:00
Henry Wilkinson
3ec9d1b9e8
Update docs/docs/index.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 13:03:16 -04:00
Henry Wilkinson
0d26cf2619
Adds note about where to find Browsertrix — the cloud service 2024-03-20 12:41:29 -04:00
Henry Wilkinson
4b5ebb04f8
Fixes docs edit link 2024-03-20 12:34:29 -04:00
Ilya Kreymer
cb435f6d4f readd parseArgs import 2024-03-19 11:26:20 -07:00
Ilya Kreymer
e4d8388ac8 Merge branch 'main' into qa-crawl-work 2024-03-19 11:25:58 -07:00
Ilya Kreymer
9a2ada3461 version: bump to 1.0.0 2024-03-18 19:15:35 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
5060e6b0b1
profiles: handle terminate signals directly (#500)
- add our own signal handling to create-login-profile to ensure fast
exit in k8s
- print crawler version info string on startup
2024-03-18 17:24:48 -04:00
Tessa Walsh
4d64eedcd3
Temporarily disable tmp-cdx creation (#499)
Fixes #498 

To revert after 1.0.0 when we make changes that allow for using the temp
CDX in WACZ creation.
2024-03-18 14:03:34 -07:00
Ilya Kreymer
251e1b3005 Merge branch 'main' into qa-crawl-work
bump to 1.1.0-beta.1
2024-03-16 15:34:57 -07:00
Ilya Kreymer
f96c6a13dc version: bump to 1.0.0-beta.8 2024-03-16 15:32:19 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
ceffad9599 cleanup 2024-03-12 21:49:58 -04:00
Ilya Kreymer
8d0f4117dc disable CORS for replaycrawler (for now) to allow loading any existing WACZ from 'localhost' for replay QA 2024-03-12 17:06:58 -04:00
Ilya Kreymer
aa4ecd5a31 qa crawl init: support loading pages from json file if 'pages' key is specified, otherwise load from 'resources' 2024-03-12 08:08:02 -04:00
Ilya Kreymer
d7d6558741 support loading multi-wacz .json files locally
support parsing out the query string when detecting file type
2024-03-10 18:18:34 -07:00
Ilya Kreymer
3a9ffd826c tests: try different port for redis 2024-03-08 13:13:49 -08:00
Ilya Kreymer
0abfaac87d qa test: use redis://127.0.0.1:36379 for ci to match other redis usage 2024-03-08 13:02:00 -08:00
Ilya Kreymer
0a1018a780 Merge branch 'main' into qa-crawl-work 2024-03-08 12:53:33 -08:00
Ilya Kreymer
fa37f62c86
Additional type fixes, follow-up to #488 (#489)
More type safety (keep using WorkerOpts when needed)
follow-up to changes in #488
2024-03-08 12:52:30 -08:00
Ilya Kreymer
5a1b2a99bb tests: add qa comparison test:
- run crawl with 3 pages, text/screenshots enabled
- run qa crawl using resulting WACZ
- enable writing pages to redis
- verify comparison data is included in page data added to redis ':pages' key
while crawl is running
2024-03-08 12:47:36 -08:00
Ilya Kreymer
4f4f7a1324 qa: consolidate comparison data into pages data added to redis
- add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data
- add seperate type for ComparisonData
- add comparison data for processPageInfo, if pagestate is available
- additional type fixes
- remove --qaWriteToRedis, now included with page data
2024-03-08 11:31:11 -08:00
Ilya Kreymer
5c42549228 Merge branch 'main' into qa-crawl-work 2024-03-08 11:07:37 -08:00
Ilya Kreymer
3b6c11d77b
page state type fixes: (#488)
- ensure pageid always inited for pagestate
- remove generic any from PageState
- use WorkerState instead of internal WorkerOpts
2024-03-08 11:05:26 -08:00
Ilya Kreymer
c4231e5196 misc qa work:
- ensure original pageid is used for qa'd pages
- use standard ':qa' key to write qa comparison data to with --qaWriteToRedis
- print crawl stats in qa
- include title + favicons in qa
2024-03-07 17:14:46 -08:00
Ilya Kreymer
2d85f2de2b Merge branch 'main' into qa-crawl-work 2024-03-07 14:23:30 -08:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
c98742417f Merge branch 'main' into qa-crawl-work 2024-03-05 11:00:51 -08:00
Ilya Kreymer
65133c9d9d
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
2024-03-04 23:58:39 -08:00
Ilya Kreymer
63cedbc91a version: bump to 1.0.0-beta.6 2024-03-04 18:11:28 -08:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
4520e9e96f
Fail on status code option + requeue fix (#480)
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit

requeue: ensure requeued urls are requeued with same depth/priority, not
0
2024-03-04 17:21:44 -08:00
Ilya Kreymer
0e0d74e799 fixes for 1.0.0-beta.5 merge 2024-02-28 23:33:35 -08:00
Ilya Kreymer
fb9de39cb3 Merge branch 'dev-1.0.0' into qa-crawl-work 2024-02-28 23:28:50 -08:00