Commit graph

302 commits

Author SHA1 Message Date
Ilya Kreymer
b57dea50b5 profiles:
- add our own signal handling to create-login-profile to ensure fast exit in k8s
- print crawler version info string on startup
2024-03-16 16:19:42 -07:00
Ilya Kreymer
f96c6a13dc version: bump to 1.0.0-beta.8 2024-03-16 15:32:19 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Ilya Kreymer
fa37f62c86
Additional type fixes, follow-up to #488 (#489)
More type safety (keep using WorkerOpts when needed)
follow-up to changes in #488
2024-03-08 12:52:30 -08:00
Ilya Kreymer
3b6c11d77b
page state type fixes: (#488)
- ensure pageid always inited for pagestate
- remove generic any from PageState
- use WorkerState instead of internal WorkerOpts
2024-03-08 11:05:26 -08:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
65133c9d9d
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
2024-03-04 23:58:39 -08:00
Ilya Kreymer
63cedbc91a version: bump to 1.0.0-beta.6 2024-03-04 18:11:28 -08:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
4520e9e96f
Fail on status code option + requeue fix (#480)
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit

requeue: ensure requeued urls are requeued with same depth/priority, not
0
2024-03-04 17:21:44 -08:00
Ilya Kreymer
dd78457b2b version: bump to 1.0.0-beta.5 2024-02-28 22:57:05 -08:00
Ilya Kreymer
184f4a2395
Ensure links added via behaviors also get processed (#478)
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
0.5.3, which will add support for behaviors to add links.

Simplify adding links by simply adding the links directly, instead of
batching to 500 links. Errors are already being logged in queueing a new
URL fails.
2024-02-28 22:56:32 -08:00
Ilya Kreymer
c348de270f
store page statusCode if not 200 (#477)
don't treat non-200 pages as errors, still extract text, take
screenshots, and run behaviors
only consider actual page load errors, eg. chrome-error:// page url, as
errors
2024-02-28 22:56:12 -08:00
Ilya Kreymer
fba4730d88
new seed on redirect + error page check: (#476)
- if a seed page redirects (page response != seed url), then add the
final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure
page is marked as failed if page.url() starts with chrome-error://
- fixes #475
2024-02-28 11:31:59 -08:00
Ilya Kreymer
dd48251b39
Include WARC prefix for screenshots and text WARCs (#473)
Ensure the env var / cli <warc prefix>-<crawlId> is also applied to
`screenshots.warc.gz` and `text.warc.gz`
2024-02-27 23:33:34 -08:00
Ilya Kreymer
cdd047d15e
warcwriter: better filehandle init on first use (#474)
Ensure warcwriter file is inited on first use, instead of throwing error
- was initing from writeRecordPair() but not writeSingleRecord()
2024-02-23 21:35:55 -08:00
Ilya Kreymer
d36564e0b0 typo: remove extra console.log 2024-02-22 16:13:50 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a5e939567c
Set warc prefix via WARC_PREFIX env var (#470)
In addition to `--warcPrefix` flag, also support WARC_PREFIX env var,
which takes precedence.
Bump to 1.0.0-beta.4
2024-02-21 11:30:28 -08:00
Ilya Kreymer
a512e92886
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
2024-02-19 19:11:48 -08:00
Ilya Kreymer
8d2d79a5df
Misc Page Resource/Recorder Fixes (#467)
- recorder: don't attempt to record response with mime type
`text/event-stream` (will not terminate).
- resources: don't track non http/https resources.
- resources: store page timestamp on first resources URL match, in case
multiple responses for same page encountered.
2024-02-17 23:32:19 -08:00
Ilya Kreymer
e8f2073a7e
Update Browser Image (#466)
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
2024-02-17 22:40:12 -08:00
Ilya Kreymer
46eb02dfcb version: bump to 1.0.0-beta.3 2024-02-16 14:37:58 -08:00
Ilya Kreymer
96f3c407b1
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
2024-02-16 14:36:32 -08:00
Tessa Walsh
bdffa7922c
Add arg to write pages to Redis (#464)
Fixes #462 

Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
pages to the database for each crawl.
Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
Also include timestamp (as ISO date) in `pageinfo:` records

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-09 16:44:17 -08:00
Ilya Kreymer
298deac59d add fix from 0.12.4 - puppeteer-core to 20.8.2
bump to 1.0.0-beta.2
2024-01-17 14:44:34 -08:00
Ilya Kreymer
f4ecaa8454 Merge branch 'main' into dev-1.0.0 2024-01-17 14:42:13 -08:00
Ilya Kreymer
18ffb3d971
skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/request pairs are not written to WARC (#460)
Allows for skipping network traffic that doesn't need to be stored, as
it is not necessary/will result in incorrect replay (eg. 304 instead of
a 200).
2024-01-17 14:27:51 -08:00
Ilya Kreymer
2fc0f67f04
Generate urn:pageinfo:<page url> records (#458)
Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.

Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter

Fixes #457
2024-01-15 16:08:13 -05:00
Tessa Walsh
cd3a1b0c6c
Bump puppeteer-core to ^20.8.2 to patch vulnerability (#459)
Fixes https://github.com/webrecorder/browsertrix-crawler/issues/456
2024-01-15 12:02:18 -08:00
Ilya Kreymer
db2dbe042f bump to 1.0.0-beta.1
update yarn.lock
2024-01-03 00:21:03 -08:00
Ilya Kreymer
63c884fb1b Merge branch 'main' (0.12.3) into 1.0.0 2024-01-03 00:20:23 -08:00
Ilya Kreymer
703835a7dd
detect invalid custom behaviors on load: (#450)
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
2023-12-13 15:14:53 -05:00
Ilya Kreymer
3323262852
WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
  - add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
2023-12-07 23:02:55 -08:00
Ilya Kreymer
c3b98e5047
Add timeout to final awaitPendingClear() (#442)
Ensure the final pending wait also has a timeout, set to max page
timeout x num workers.
Could also set higher, but needs to have a timeout, eg. in case of
downloading live stream that never terminates.
Fixes #348 in the 0.12.x line.

Also bumps version to 0.12.3
2023-11-16 16:20:09 -05:00
dependabot[bot]
540c355d25
Bump sharp from 0.32.1 to 0.32.6 (#443)
Bumps [sharp](https://github.com/lovell/sharp) from 0.32.1 to 0.32.6 to fix vulnerability
2023-11-16 16:18:00 -05:00
Ilya Kreymer
e9ed7a45df Merge 0.12.2 into dev-1.0.0 2023-11-15 23:00:13 -08:00
Ilya Kreymer
19dac943cc
Add types + validation for log context options (#435)
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
2023-11-14 21:54:40 -08:00
Ilya Kreymer
9ba0b9edc1
Backport pending list never being reprocessed (#438)
Backport of #433 to 0.12.x.

Bump version to 0.12.2
2023-11-13 19:21:48 -08:00
Ilya Kreymer
456155ecf6
more specific types additions (#434)
- add QueueEntry for type of json object stored in Redis
- and PageCallbacks for callback type
- use Crawler type
2023-11-13 09:31:52 -08:00
Ilya Kreymer
0d51e03825
Fix potential for pending list never being processed (#433)
Due to an optimization, numPending() call assumed that queueSize() would
be called to update cached queue size. However, in the current worker
code, this is not the case. Remove cacheing the queue size and just check
queue size in numPending(), to ensure pending list is always processed.
2023-11-13 09:31:21 -08:00
Ilya Kreymer
3972942f5f
logging: don't log filtered out direct fetch attempt as error (#432)
When calling directFetchCapture, and aborting the response via an
exception, throw `new Error("response-filtered-out");`
so that it can be ignored. This exception is only used for direct
capture, and should not be logged as an error - rethrow and
handle in calling function to indicate direct fetch is skipped
2023-11-13 09:16:57 -08:00
Ilya Kreymer
ab0f66aa54
Raise size limit for large HTML pages (#430)
Previously, responses >2MB are streamed to disk and an empty response returned to browser,
to avoid holding large response in memory. 
This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.

This PR sets different limits for:
- HTML as well as other JS necessary for page to load to 25MB
- All other content limit is set to 5MB

Also includes some more type fixing
2023-11-09 18:33:44 -08:00
Ilya Kreymer
783d006d52
follow-up to #428: update ignore files (#431)
- actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed
2023-11-09 17:13:53 -08:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00