Commit graph

27 commits

Author SHA1 Message Date
Ilya Kreymer
10f6414f2f
PDF loading status code fix (#571)
when loading a PDF as a page, the browser returns a 'false positive'
net::ERR_ABORTED even though the PDF is loaded.
- this is already handled, but status code was still being cleared,
ensure status code is not reset to 0 on response
- ensure page status and mime are also recorded if this failure is
ignored (in shouldIgnoreAbort)
- tests: add test for PDF capture

fixes #570
2024-05-14 15:26:06 -07:00
Ilya Kreymer
ddc3e104db
improved handling of requests from workers: (#562)
On sites with regular workers, requests from workers were being skipped
as there was no match for the worker frameId.

Add recorder.hasFrame() frameId to match not just service-worker
frameIds but also other frame ids already tracked in the frameIdToExecId
map.
2024-05-06 11:04:31 -04:00
Ilya Kreymer
8d4e9ca2dc
Better logging of all queue WARCWriter operations (#536)
warcwriter operations result in a write promise being put on a queue,
and handled one-at-a-time. This change wraps that promise in an async function that awaits the actual
write and logs any rejections.
- If an additional log details is provided, successful writes are also
logged for now, including success logging for resource records (text,
screenshot, pageinfo)
- screenshot / text / pageinfo use the appropriate logcontext for the resource for better log filtering
2024-04-12 14:31:07 -07:00
Ilya Kreymer
98f64458d8
ensure all warcwriter write operations go through a queue. (#528)
Currently, only the recorder's WARCWriter writes records through a
queue, resulting in other WARCs potentially suffering from concurrent
write attempts. This fixes that by:
- adding the concurrent queue to WARCWriter itself
- all writeRecord, writeRecordPair, writeNewResourceRecord calls are
first added to the PQueue, which ensures writes happen in order and
one-at-a-time
- flush() also ensures queue is empty/idle
- should avoid any issues with concurrent writes to any WARC
2024-04-04 09:36:16 -07:00
Ilya Kreymer
97b95fdf18
merge V1.0.4 change -> main: (#527)
refactor handling of max size for html/js/css (copy of #525)
- due to a typo (and lack of type-checking!) incorrectly passed in
matchFetchSize instead of maxFetchSize, resulting in text/css/js for
>5MB instead of >25MB not properly streamed back to the browser
- add type checking to AsyncFetcherOptions to avoid this in the future.
- refactor to avoid checking size altogether for 'essential resources',
html(document), js and css, instead always fetch them fully and
continue in the browser. Only apply rewriting if <25MB.
fixes #522
2024-04-03 17:38:50 -07:00
Ilya Kreymer
0d973d67e3
upgrade puppeteer-core to 22.6.1 (#516)
Using latest puppeteer-core to keep up with latest browsers, mostly
minor syntax changes

Due to change in puppeteer hiding the executionContextId, need to create
a frameId->executionContextId mapping and track it ourselves to support
the custom evaluateWithCLI() function
2024-03-27 09:26:51 -07:00
Ilya Kreymer
0ad10a8dee
Unify WARC writing + CDXJ indexing into single class (#507)
Previously, there was the main WARCWriter as well as utility
WARCResourceWriter that was used for screenshots, text, pageinfo and
only generated resource records. This separate WARC writing path did not
generate CDX, but used appendFile() to append new WARC records to an
existing WARC.

This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a
pre-requisite to the js-wacz conversion (#484) since all WARCs need to
have generated CDX.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-26 14:54:27 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Tessa Walsh
4d64eedcd3
Temporarily disable tmp-cdx creation (#499)
Fixes #498 

To revert after 1.0.0 when we make changes that allow for using the temp
CDX in WACZ creation.
2024-03-18 14:03:34 -07:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
65133c9d9d
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
2024-03-04 23:58:39 -08:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
d36564e0b0 typo: remove extra console.log 2024-02-22 16:13:50 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a5e939567c
Set warc prefix via WARC_PREFIX env var (#470)
In addition to `--warcPrefix` flag, also support WARC_PREFIX env var,
which takes precedence.
Bump to 1.0.0-beta.4
2024-02-21 11:30:28 -08:00
Ilya Kreymer
a512e92886
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
2024-02-19 19:11:48 -08:00
Ilya Kreymer
8d2d79a5df
Misc Page Resource/Recorder Fixes (#467)
- recorder: don't attempt to record response with mime type
`text/event-stream` (will not terminate).
- resources: don't track non http/https resources.
- resources: store page timestamp on first resources URL match, in case
multiple responses for same page encountered.
2024-02-17 23:32:19 -08:00
Ilya Kreymer
96f3c407b1
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
2024-02-16 14:36:32 -08:00
Tessa Walsh
bdffa7922c
Add arg to write pages to Redis (#464)
Fixes #462 

Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
pages to the database for each crawl.
Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
Also include timestamp (as ISO date) in `pageinfo:` records

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-09 16:44:17 -08:00
Ilya Kreymer
18ffb3d971
skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/request pairs are not written to WARC (#460)
Allows for skipping network traffic that doesn't need to be stored, as
it is not necessary/will result in incorrect replay (eg. 304 instead of
a 200).
2024-01-17 14:27:51 -08:00
Ilya Kreymer
2fc0f67f04
Generate urn:pageinfo:<page url> records (#458)
Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.

Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter

Fixes #457
2024-01-15 16:08:13 -05:00
Ilya Kreymer
3323262852
WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440)
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
  - add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
2023-12-07 23:02:55 -08:00
Ilya Kreymer
19dac943cc
Add types + validation for log context options (#435)
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
2023-11-14 21:54:40 -08:00
Ilya Kreymer
3972942f5f
logging: don't log filtered out direct fetch attempt as error (#432)
When calling directFetchCapture, and aborting the response via an
exception, throw `new Error("response-filtered-out");`
so that it can be ignored. This exception is only used for direct
capture, and should not be logged as an error - rethrow and
handle in calling function to indicate direct fetch is skipped
2023-11-13 09:16:57 -08:00
Ilya Kreymer
ab0f66aa54
Raise size limit for large HTML pages (#430)
Previously, responses >2MB are streamed to disk and an empty response returned to browser,
to avoid holding large response in memory. 
This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.

This PR sets different limits for:
- HTML as well as other JS necessary for page to load to 25MB
- All other content limit is set to 5MB

Also includes some more type fixing
2023-11-09 18:33:44 -08:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Renamed from util/recorder.js (Browse further)