Commit graph

26 commits

Author SHA1 Message Date
Tessa Walsh
60c84b342e
Support loading custom behaviors from git repo (#717)
Fixes #712 
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-11-13 22:50:33 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Tessa Walsh
e05d50d637
Add documentation for crawl collections (#695)
Fixes #675
2024-10-05 11:51:32 -07:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
Tessa Walsh
39c8f48bb2
Disable behaviors entirely if --behaviors array is empty (#672)
Fixes #651
2024-08-27 13:20:19 -07:00
Henry Wilkinson
4c1da90d8f
Adds warning about crawling with basic auth (#669)
Closes https://github.com/webrecorder/browsertrix/issues/1950 over here
too

### Changes
- Adds a warning about using basic auth
- Adds a link to MDN because learning and cross referencing is fun!
2024-08-14 21:14:31 -07:00
Ilya Kreymer
48716c172d docs: regnerate cli options with ./docs/gen-cli.sh 2024-07-19 18:53:50 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
b83d1c58da
add --dryRun flag and mode (#594)
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes #593
2024-06-07 10:34:19 -07:00
Ed Summers
2ef116d667
Mention command line options when restarting (#577)
It's probably worth reminding people that the command line options need
to be passed in again since the crawl state doesn't include them.

Refs #568
2024-05-21 10:57:50 -07:00
Ilya Kreymer
c71274d841
add STORE_REGION env var to be able to specify region (#565)
defaults to us-east-1 for minio compatibility
fixes #515
2024-05-12 12:42:04 -04:00
Ilya Kreymer
0201fef559 docs: fix typo 2024-04-18 17:19:13 -07:00
Tessa Walsh
75b617dc94
Add crawler QA docs (#551)
Fixes #550

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-04-18 16:18:22 -04:00
Tessa Walsh
1325cc3868
Gracefully handle non-absolute path for create-login-profile --filename (#521)
Fixes #513 

If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.

Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-29 13:46:54 -07:00
Ilya Kreymer
2059f2b6ae
add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)
but before running link extraction, text extraction, screenshots and
behaviors.

Useful for sites that load quickly but perform async loading / init
afterwards, fixes #519

A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
2024-03-28 17:17:29 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Henry Wilkinson
5e2768ebcf
Docs homepage link fix
@tw4l Oops :\
2024-03-20 14:13:52 -04:00
Henry Wilkinson
3ec9d1b9e8
Update docs/docs/index.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 13:03:16 -04:00
Henry Wilkinson
0d26cf2619
Adds note about where to find Browsertrix — the cloud service 2024-03-20 12:41:29 -04:00
Henry Wilkinson
4b5ebb04f8
Fixes docs edit link 2024-03-20 12:34:29 -04:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00