Commit graph

331 commits

Author SHA1 Message Date
Ilya Kreymer
f2fa0f8de0 cleanup 2024-03-22 21:50:54 -07:00
Ilya Kreymer
50a771cc68 Merge branch 'unify-warc-writer' into use-js-wacz 2024-03-22 21:49:15 -07:00
Ilya Kreymer
750d51aede fix screenshots path, disable tempcdx still 2024-03-22 21:44:55 -07:00
Ilya Kreymer
adbcf76502 remove warcresourcewriter
unify warc-writing into single WARCWriter class to support cdx indexing for all records
create dedicated writers for screenshots and text
2024-03-22 21:08:29 -07:00
Ilya Kreymer
3e76568113 Merge branch 'main' into use-js-wacz 2024-03-22 18:04:28 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Tessa Walsh
d5e5976b6f Switch js-wacz dependency to ^0.1.0 2024-03-22 16:49:48 -04:00
Ilya Kreymer
22a7351dc7
service worker capture fix: disable by default for now (#506)
Due to issues with capturing top-level pages, make bypassing service
workers the default for now. Previously, it was only disabled when using
profiles. (This is also consistent with ArchiveWeb.page behavior).
Includes:
- add --serviceWorker option which can be `disabled`,
disabled-if-profile (previous default) and `enabled`
- ensure page timestamp is set for direct fetch
- warn if page timestamp is missing on serialization, then set to now
before serializing

bump version to 1.0.2
2024-03-22 13:37:14 -07:00
Tessa Walsh
a5d36ce1ad Generate CDX with warcio CDXIndexer 2024-03-22 16:35:56 -04:00
Tessa Walsh
82169feffe Fix typo 2024-03-22 16:35:56 -04:00
Tessa Walsh
118ffb0327 Fix extra hops test 2024-03-22 16:35:56 -04:00
Tessa Walsh
84c1ef2098 Fix custom driver test to account for extraPages 2024-03-22 16:35:56 -04:00
Tessa Walsh
97b1069f30 Fix extra hops test to account for extraPages 2024-03-22 16:35:56 -04:00
Tessa Walsh
13b6385a14 Temporariy comment out validation tests using py-wacz 2024-03-22 16:35:56 -04:00
Tessa Walsh
c68d117692 Add WACZLogger class for use with js-wacz 2024-03-22 16:35:56 -04:00
Tessa Walsh
952cd75a66 Wait until after WACZ generation to delete tmp-cdx 2024-03-22 09:21:25 -04:00
Ilya Kreymer
1595b3595d fix tests? 2024-03-21 20:08:16 -07:00
Ilya Kreymer
c6723b007f replace generateCDX with just moves files from tmp-cdx 2024-03-21 19:55:58 -07:00
Ilya Kreymer
a457a5e079 switch to using js-wacz natively for wacz creation!
remove python dependencies
2024-03-21 19:43:54 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Henry Wilkinson
5e2768ebcf
Docs homepage link fix
@tw4l Oops :\
2024-03-20 14:13:52 -04:00
Henry Wilkinson
79e39ae2f0
Merge pull request #501 from webrecorder/docs-minor-fixes
Docs: Minor fixes to edit link & clarifications
2024-03-20 13:04:12 -04:00
Henry Wilkinson
3ec9d1b9e8
Update docs/docs/index.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 13:03:16 -04:00
Henry Wilkinson
0d26cf2619
Adds note about where to find Browsertrix — the cloud service 2024-03-20 12:41:29 -04:00
Henry Wilkinson
4b5ebb04f8
Fixes docs edit link 2024-03-20 12:34:29 -04:00
Ilya Kreymer
9a2ada3461 version: bump to 1.0.0 2024-03-18 19:15:35 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
5060e6b0b1
profiles: handle terminate signals directly (#500)
- add our own signal handling to create-login-profile to ensure fast
exit in k8s
- print crawler version info string on startup
2024-03-18 17:24:48 -04:00
Tessa Walsh
4d64eedcd3
Temporarily disable tmp-cdx creation (#499)
Fixes #498 

To revert after 1.0.0 when we make changes that allow for using the temp
CDX in WACZ creation.
2024-03-18 14:03:34 -07:00
Ilya Kreymer
f96c6a13dc version: bump to 1.0.0-beta.8 2024-03-16 15:32:19 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
fa37f62c86
Additional type fixes, follow-up to #488 (#489)
More type safety (keep using WorkerOpts when needed)
follow-up to changes in #488
2024-03-08 12:52:30 -08:00
Ilya Kreymer
3b6c11d77b
page state type fixes: (#488)
- ensure pageid always inited for pagestate
- remove generic any from PageState
- use WorkerState instead of internal WorkerOpts
2024-03-08 11:05:26 -08:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
65133c9d9d
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
2024-03-04 23:58:39 -08:00
Ilya Kreymer
63cedbc91a version: bump to 1.0.0-beta.6 2024-03-04 18:11:28 -08:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
4520e9e96f
Fail on status code option + requeue fix (#480)
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit

requeue: ensure requeued urls are requeued with same depth/priority, not
0
2024-03-04 17:21:44 -08:00
Ilya Kreymer
dd78457b2b version: bump to 1.0.0-beta.5 2024-02-28 22:57:05 -08:00
Ilya Kreymer
184f4a2395
Ensure links added via behaviors also get processed (#478)
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
0.5.3, which will add support for behaviors to add links.

Simplify adding links by simply adding the links directly, instead of
batching to 500 links. Errors are already being logged in queueing a new
URL fails.
2024-02-28 22:56:32 -08:00
Ilya Kreymer
c348de270f
store page statusCode if not 200 (#477)
don't treat non-200 pages as errors, still extract text, take
screenshots, and run behaviors
only consider actual page load errors, eg. chrome-error:// page url, as
errors
2024-02-28 22:56:12 -08:00
Ilya Kreymer
fba4730d88
new seed on redirect + error page check: (#476)
- if a seed page redirects (page response != seed url), then add the
final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure
page is marked as failed if page.url() starts with chrome-error://
- fixes #475
2024-02-28 11:31:59 -08:00
Ilya Kreymer
dd48251b39
Include WARC prefix for screenshots and text WARCs (#473)
Ensure the env var / cli <warc prefix>-<crawlId> is also applied to
`screenshots.warc.gz` and `text.warc.gz`
2024-02-27 23:33:34 -08:00
Ilya Kreymer
cdd047d15e
warcwriter: better filehandle init on first use (#474)
Ensure warcwriter file is inited on first use, instead of throwing error
- was initing from writeRecordPair() but not writeSingleRecord()
2024-02-23 21:35:55 -08:00
Ilya Kreymer
d36564e0b0 typo: remove extra console.log 2024-02-22 16:13:50 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a5e939567c
Set warc prefix via WARC_PREFIX env var (#470)
In addition to `--warcPrefix` flag, also support WARC_PREFIX env var,
which takes precedence.
Bump to 1.0.0-beta.4
2024-02-21 11:30:28 -08:00