Commit graph

16 commits

Author SHA1 Message Date
Percival
7dd13a9ec4
fix: Skip proxy for seed file and custom behavior downloads (#907) 2025-11-11 10:51:24 -05:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Tessa Walsh
46a02d12a3
Remove hardcoded /tmp prefix from path (#843)
Fast-follow to #842 to fix a typo
2025-05-28 15:46:19 -07:00
Ilya Kreymer
52235ab21e
tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842)
allows for customizing tmp directory with TMPDIR env var
2025-05-28 12:48:06 -07:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
2961d3b9f2
Write behaviors downloaded from URL to tempdir (#816)
Follow-up to #368 

This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
2025-04-04 11:23:29 -04:00
Tessa Walsh
2b00c1f065
Tweaks for custom behavior loading (#807)
Follow-up to #712 

Fixes a few things I noticed while testing out
https://github.com/webrecorder/browsertrix/pull/2520

- Ignore `.git` directory of git repositories when recursively walking
cloned git repo to collect custom behaviors
- Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2
was overly restrictive for Git repositories)
- Log name of custom behavior scripts (filename or URLs) added as info messages in
`behavior` context

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:15:57 -07:00
Tessa Walsh
5fedde6eee
Fail crawl with fatal message if custom behavior isn't loaded (#799)
Fixes #797 

The crawler will now exit with a fatal log message and exit code 17 if:

- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)

Tests have also been added accordingly.
2025-03-31 17:35:30 -07:00
Ilya Kreymer
f7cbf9645b
Retry support and additional fixes (#743)
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
2025-01-25 22:55:49 -08:00
Tessa Walsh
60c84b342e
Support loading custom behaviors from git repo (#717)
Fixes #712 
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-11-13 22:50:33 -08:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
5060e6b0b1
profiles: handle terminate signals directly (#500)
- add our own signal handling to create-login-profile to ensure fast
exit in k8s
- print crawler version info string on startup
2024-03-18 17:24:48 -04:00
Ilya Kreymer
703835a7dd
detect invalid custom behaviors on load: (#450)
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
2023-12-13 15:14:53 -05:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Renamed from util/file_reader.js (Browse further)