Commit graph

536 commits

Author SHA1 Message Date
Mattia
ea7b2bbefc
allow minio to connect to other regions (#543)
This should address the issue of connecting to buckets stored outside
us-east-1
(https://github.com/webrecorder/browsertrix-crawler/issues/515) while
the switch from Minio client to AWS SDK is being worked on
(https://github.com/webrecorder/browsertrix-crawler/issues/479)

Co-authored-by: Mattia <m@ttia.it>
2024-04-17 08:55:33 -07:00
Tessa Walsh
efebc331ee
Set mime type for html pages (#545)
Fixes #544 

As long as the response has a content-type header, we should use it to
set MIME type for the page.
2024-04-15 14:04:30 -07:00
Ilya Kreymer
f6edec0b95
Fix for --rolloverSize for individual WARCs in 1.x (#542)
Fixes #533 

Fixes rollover in WARCWriter, separate from combined WARC rollover size:
- check rolloverSize and close previous WARCs when size exceeds
- add timestamp to resource WARC filenames to support rollover, eg.
screenshots-{ts}.warc.gz
- use append mode for all write streams, just in case
- tests: add test for rollover of individual WARCs with 500K size limit
- tests: update screenshot tests to account for WARCs now being named
screenshots-{ts}.warc.gz instead of just screenshots.warc.gz
2024-04-15 13:43:08 -07:00
Ilya Kreymer
16671cb610
qa: filter out non-html pages (#541)
Fixes #540 

Also ensure mime type is set on page for non-html pages when loaded
through browser, already being set for direct fetch path.
2024-04-12 16:21:50 -07:00
Ilya Kreymer
8d4e9ca2dc
Better logging of all queue WARCWriter operations (#536)
warcwriter operations result in a write promise being put on a queue,
and handled one-at-a-time. This change wraps that promise in an async function that awaits the actual
write and logs any rejections.
- If an additional log details is provided, successful writes are also
logged for now, including success logging for resource records (text,
screenshot, pageinfo)
- screenshot / text / pageinfo use the appropriate logcontext for the resource for better log filtering
2024-04-12 14:31:07 -07:00
Tessa Walsh
05acad1789
Remove no longer needed invalid Brave update URLs (#539) 2024-04-12 16:13:34 -04:00
Ilya Kreymer
e15f0c95d9
Adblock support (#534)
Now that RWP 2.0.0 with adblock support has been released
(webrecorder/replayweb.page#307), this enables adblock on the QA mode
RWP embed, to get more accurate screenshots.
Fetches the adblock.gz directly from RWP (though could also fetch it
separately from Easylist)
Updates to 1.1.0-beta.5
2024-04-12 09:47:32 -07:00
Ilya Kreymer
b5f3238c29
Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535)
Cherry-picked from the use-js-wacz branch, now implementing separate
writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and
new `--copy-page-files` flag.

Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-11 13:55:52 -07:00
Ilya Kreymer
c247189474
qa/replay crawl loading improvements (#526)
- use frame.load() to load RWP frame directly instead of waiting for
navigation messages
- retry loading RWP if replay frame is missing
- support --postLoadDelay in replay crawl
- support --include / --exclude options in replay crawler, allow
excluding and including pages to QA via regex
- improve --qaDebugImageDiff debug image saving, save images to same
dir, using ${counter}-${workerid}-${pageid}-{crawl,replay,vdiff}.png for
better sorting
- when running QA crawl, check and use QA_ARGS instead of CRAWL_ARGS if
provided
- ensure empty string text from page is treated different from error (undefined)
- ensure info.warc.gz is closed in closeFiles()

misc:
- fix typo in --postLoadDelay check!
- enable 'startEarly' mode for behaviors (autofetch, autoplay)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-04 13:05:24 -07:00
Ilya Kreymer
98f64458d8
ensure all warcwriter write operations go through a queue. (#528)
Currently, only the recorder's WARCWriter writes records through a
queue, resulting in other WARCs potentially suffering from concurrent
write attempts. This fixes that by:
- adding the concurrent queue to WARCWriter itself
- all writeRecord, writeRecordPair, writeNewResourceRecord calls are
first added to the PQueue, which ensures writes happen in order and
one-at-a-time
- flush() also ensures queue is empty/idle
- should avoid any issues with concurrent writes to any WARC
2024-04-04 09:36:16 -07:00
Ilya Kreymer
db613aa4ff
Revert "Make /app world-readable to better support non-root usage" (#529)
Reverts webrecorder/browsertrix-crawler#523
The chmod operation is a bit slow, and in testing don't think the CI is related to chmod :/
2024-04-03 19:48:37 -07:00
Ilya Kreymer
97b95fdf18
merge V1.0.4 change -> main: (#527)
refactor handling of max size for html/js/css (copy of #525)
- due to a typo (and lack of type-checking!) incorrectly passed in
matchFetchSize instead of maxFetchSize, resulting in text/css/js for
>5MB instead of >25MB not properly streamed back to the browser
- add type checking to AsyncFetcherOptions to avoid this in the future.
- refactor to avoid checking size altogether for 'essential resources',
html(document), js and css, instead always fetch them fully and
continue in the browser. Only apply rewriting if <25MB.
fixes #522
2024-04-03 17:38:50 -07:00
Vinzenz Sinapius
23fda685d9
Make /app world-readable to better support non-root usage (#523)
Possible fix for failing tests with non-root deployment.
2024-04-03 15:22:12 -07:00
Tessa Walsh
1325cc3868
Gracefully handle non-absolute path for create-login-profile --filename (#521)
Fixes #513 

If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.

Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-29 13:46:54 -07:00
Ilya Kreymer
5152169916 bump version to 1.1.0-beta.3 2024-03-28 17:19:40 -07:00
Ilya Kreymer
2059f2b6ae
add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)
but before running link extraction, text extraction, screenshots and
behaviors.

Useful for sites that load quickly but perform async loading / init
afterwards, fixes #519

A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
2024-03-28 17:17:29 -07:00
Ilya Kreymer
ea098b6daf
avoid cloudflare detection of puppeteer when using browser profiles: (#518)
- filter out 'other' / no url targets from puppeteer attachment
- disable '--disable-site-isolation-trials' for profiles
- workaround for #446 with profiles
- also fixes `pageExtraDelay` not working for non-200 responses - may be
useful for detecting captcha blocked pages.
- connect VNC right away instead of waiting for page to fully finish
loading, hopefully resulting in faster profile start-up time.
2024-03-28 10:21:31 -07:00
Ilya Kreymer
0d973d67e3
upgrade puppeteer-core to 22.6.1 (#516)
Using latest puppeteer-core to keep up with latest browsers, mostly
minor syntax changes

Due to change in puppeteer hiding the executionContextId, need to create
a frameId->executionContextId mapping and track it ourselves to support
the custom evaluateWithCLI() function
2024-03-27 09:26:51 -07:00
Ilya Kreymer
0ad10a8dee
Unify WARC writing + CDXJ indexing into single class (#507)
Previously, there was the main WARCWriter as well as utility
WARCResourceWriter that was used for screenshots, text, pageinfo and
only generated resource records. This separate WARC writing path did not
generate CDX, but used appendFile() to append new WARC records to an
existing WARC.

This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a
pre-requisite to the js-wacz conversion (#484) since all WARCs need to
have generated CDX.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-26 14:54:27 -07:00
Ilya Kreymer
01c4139aa7
Fixes from 1.0.3 release -> main (#517)
sitemap improvements: gz support + application/xml + extraHops fix #511
- follow up to
https://github.com/webrecorder/browsertrix-crawler/issues/496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)

fixes redirected seed (from #476) being counted against page limit: #509
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
2024-03-26 14:50:36 -07:00
Vinzenz Sinapius
6b6cb4137a
Use RFC2606 invalid domain names (#514)
`invalid.dev` can potentially be registered and used. `.invalid` is
guaranteed to never be valid. See also:
https://www.rfc-editor.org/rfc/rfc2606.html
2024-03-26 14:09:04 -07:00
Ilya Kreymer
ecbc1d8ddd quickfix: fix typo, remove duplicate declaration! 2024-03-22 21:51:50 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
22a7351dc7
service worker capture fix: disable by default for now (#506)
Due to issues with capturing top-level pages, make bypassing service
workers the default for now. Previously, it was only disabled when using
profiles. (This is also consistent with ArchiveWeb.page behavior).
Includes:
- add --serviceWorker option which can be `disabled`,
disabled-if-profile (previous default) and `enabled`
- ensure page timestamp is set for direct fetch
- warn if page timestamp is missing on serialization, then set to now
before serializing

bump version to 1.0.2
2024-03-22 13:37:14 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Henry Wilkinson
5e2768ebcf
Docs homepage link fix
@tw4l Oops :\
2024-03-20 14:13:52 -04:00
Henry Wilkinson
79e39ae2f0
Merge pull request #501 from webrecorder/docs-minor-fixes
Docs: Minor fixes to edit link & clarifications
2024-03-20 13:04:12 -04:00
Henry Wilkinson
3ec9d1b9e8
Update docs/docs/index.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 13:03:16 -04:00
Henry Wilkinson
0d26cf2619
Adds note about where to find Browsertrix — the cloud service 2024-03-20 12:41:29 -04:00
Henry Wilkinson
4b5ebb04f8
Fixes docs edit link 2024-03-20 12:34:29 -04:00
Ilya Kreymer
9a2ada3461 version: bump to 1.0.0 2024-03-18 19:15:35 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
5060e6b0b1
profiles: handle terminate signals directly (#500)
- add our own signal handling to create-login-profile to ensure fast
exit in k8s
- print crawler version info string on startup
2024-03-18 17:24:48 -04:00
Tessa Walsh
4d64eedcd3
Temporarily disable tmp-cdx creation (#499)
Fixes #498 

To revert after 1.0.0 when we make changes that allow for using the temp
CDX in WACZ creation.
2024-03-18 14:03:34 -07:00
Ilya Kreymer
f96c6a13dc version: bump to 1.0.0-beta.8 2024-03-16 15:32:19 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
fa37f62c86
Additional type fixes, follow-up to #488 (#489)
More type safety (keep using WorkerOpts when needed)
follow-up to changes in #488
2024-03-08 12:52:30 -08:00
Ilya Kreymer
3b6c11d77b
page state type fixes: (#488)
- ensure pageid always inited for pagestate
- remove generic any from PageState
- use WorkerState instead of internal WorkerOpts
2024-03-08 11:05:26 -08:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
65133c9d9d
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
2024-03-04 23:58:39 -08:00
Ilya Kreymer
63cedbc91a version: bump to 1.0.0-beta.6 2024-03-04 18:11:28 -08:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
4520e9e96f
Fail on status code option + requeue fix (#480)
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit

requeue: ensure requeued urls are requeued with same depth/priority, not
0
2024-03-04 17:21:44 -08:00
Ilya Kreymer
dd78457b2b version: bump to 1.0.0-beta.5 2024-02-28 22:57:05 -08:00
Ilya Kreymer
184f4a2395
Ensure links added via behaviors also get processed (#478)
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
0.5.3, which will add support for behaviors to add links.

Simplify adding links by simply adding the links directly, instead of
batching to 500 links. Errors are already being logged in queueing a new
URL fails.
2024-02-28 22:56:32 -08:00
Ilya Kreymer
c348de270f
store page statusCode if not 200 (#477)
don't treat non-200 pages as errors, still extract text, take
screenshots, and run behaviors
only consider actual page load errors, eg. chrome-error:// page url, as
errors
2024-02-28 22:56:12 -08:00
Ilya Kreymer
fba4730d88
new seed on redirect + error page check: (#476)
- if a seed page redirects (page response != seed url), then add the
final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure
page is marked as failed if page.url() starts with chrome-error://
- fixes #475
2024-02-28 11:31:59 -08:00