Commit graph

12 commits

Author SHA1 Message Date
Ilya Kreymer
fcfb6b7ae3 fix tests 2025-09-12 09:03:19 -07:00
Ilya Kreymer
e72b34318d
Add WARC-Protocol header (#715)
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-05-19 18:59:52 -07:00
Ilya Kreymer
5c00bca2b4
tests: use old.webrecorder.net for testing (#710)
replace webrecorder.net -> old.webrecorder.net to fix tests relying on
old website for now
2024-10-31 13:24:58 -04:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00
Ilya Kreymer
894681e5fc
Bump version to 1.2.0 Beta + make draft release for each commit (#582)
Generate draft release from main and *-release branches to simplify
release process

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-05-22 15:45:48 -07:00
Ilya Kreymer
9f18a49c0a
Better tracking of failed requests + logging context exclude (#485)
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext

recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)

tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0

bump to 1.0.0-beta.7
2024-03-07 11:35:53 -05:00
Ilya Kreymer
5a47cc4b41
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (#481)
Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
2024-03-04 18:10:45 -08:00
Ilya Kreymer
51660cdcc4
pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471)
Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
2024-02-21 16:02:25 -08:00
Ilya Kreymer
a512e92886
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
2024-02-19 19:11:48 -08:00
Ilya Kreymer
e8f2073a7e
Update Browser Image (#466)
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
2024-02-17 22:40:12 -08:00
Ilya Kreymer
96f3c407b1
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
2024-02-16 14:36:32 -08:00