Merge pull request #471 from openzim/fix_browsertrix_args

Enhance support of Browsertrix Crawler arguments
2025-12-31 04:23:15 +00:00 · 2025-02-14 15:35:32 +01:00 · 2025-02-14 15:35:32 +01:00 · a9efec4797
commit a9efec4797
parent 4f9085b10e 2f7a83e187
6 changed files with 669 additions and 164 deletions
--- a/.github/workflows/DailyTests.yaml
+++ b/.github/workflows/DailyTests.yaml
@ -18,7 +18,7 @@ jobs:
        run: docker build -t local-zimit .

      - name: run crawl of test website
-        run: docker run -v $PWD/output:/output local-zimit zimit --url https://website.test.openzim.org/ --name tests_eng_test-website --zim-file tests_eng_test-website.zim
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds https://website.test.openzim.org/ --name tests_eng_test-website --zim-file tests_eng_test-website.zim

      - name: archive ZIM
        uses: actions/upload-artifact@v4
--- a/.github/workflows/Tests.yaml
+++ b/.github/workflows/Tests.yaml
@ -63,19 +63,19 @@ jobs:
        run: docker run -v $PWD/output:/output local-zimit zimit --help

      - name: run crawl with soft size limit
-        run: docker run -v $PWD/output:/output local-zimit zimit --url http://website.test.openzim.org/ --sizeSoftLimit 8192 --name tests_en_sizesoftlimit --zim-file tests_en_sizesoftlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --statsFilename /output/stats_sizesoftlimit.json
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds http://website.test.openzim.org/ --sizeSoftLimit 8192 --name tests_en_sizesoftlimit --zim-file tests_en_sizesoftlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --zimit-progress-file /output/stats_sizesoftlimit.json

      - name: run crawl with hard size limit
-        run: docker run -v $PWD/output:/output local-zimit zimit --url http://website.test.openzim.org/ --sizeHardLimit 8192 --name tests_en_sizehardlimit --zim-file tests_en_sizehardlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --statsFilename /output/stats_sizehardlimit.json || true
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds http://website.test.openzim.org/ --sizeHardLimit 8192 --name tests_en_sizehardlimit --zim-file tests_en_sizehardlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --zimit-progress-file /output/stats_sizehardlimit.json || true

      - name: run crawl with soft time limit
-        run: docker run -v $PWD/output:/output local-zimit zimit --url http://website.test.openzim.org/ --timeSoftLimit 1 --name tests_en_timesoftlimit --zim-file tests_en_timesoftlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --statsFilename /output/stats_timesoftlimit.json
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds http://website.test.openzim.org/ --timeSoftLimit 1 --name tests_en_timesoftlimit --zim-file tests_en_timesoftlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --zimit-progress-file /output/stats_timesoftlimit.json

      - name: run crawl with hard time limit
-        run: docker run -v $PWD/output:/output local-zimit zimit --url http://website.test.openzim.org/ --timeHardLimit 1 --name tests_en_timehardlimit --zim-file tests_en_timehardlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --statsFilename /output/stats_timehardlimit.json || true
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds http://website.test.openzim.org/ --timeHardLimit 1 --name tests_en_timehardlimit --zim-file tests_en_timehardlimit.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --zimit-progress-file /output/stats_timehardlimit.json || true

      - name: run standard crawl
-        run: docker run -v $PWD/output:/output local-zimit zimit --url http://website.test.openzim.org/http-return-codes.html --name tests_en_onepage --zim-file tests_en_onepage.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --statsFilename /output/stats.json --keep
+        run: docker run -v $PWD/output:/output local-zimit zimit --seeds http://website.test.openzim.org/http-return-codes.html --name tests_en_onepage --zim-file tests_en_onepage.zim --adminEmail test@example.com --mobileDevice "Pixel 5" --zimit-progress-file /output/stats.json --statsFilename /output/crawl.json --warc2zim-progress-file /output/warc2zim.json --keep

      - name: run integration test suite
        run: docker run -v $PWD/tests-integration/integration.py:/app/integration.py -v $PWD/output:/output local-zimit bash -c "/app/zimit/bin/pip install pytest; /app/zimit/bin/pytest -v /app/integration.py"
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -11,6 +11,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 - Change solution to report partial ZIM to the Zimfarm and other clients (#304)
 - Keep temporary folder when crawler or warc2zim fails, even if not asked for (#468)
+- Add many missing Browsertrix Crawler arguments ; drop default overrides by zimit ; drop `--noMobileDevice` setting (not needed anymore) (#433)
+- Document all Browsertrix Crawler default arguments values (#416)
+- Use preferred Browsertrix Crawler arguments names: (part of #471)
+  - `--seeds` instead of `--url`
+  - `--seedFile` instead of `--urlFile`
+  - `--pageLimit` instead of `--limit`
+  - `--pageLoadTimeout` instead of `--timeout`
+  - `--scopeIncludeRx` instead of `--include`
+  - `--scopeExcludeRx` instead of `--exclude`
+  - `--pageExtraDelay` instead of `--delay`
+- Remove confusion between zimit, warc2zim and crawler stats filenames (part of #471)
+  - `--statsFilename` is now the crawler stats file (since it is the same name, just like other arguments)
+  - `--zimit-progress-file` is now the zimit stats location
+  - `--warc2zim-progress-file` is the warc2zim stats location
+  - all are optional values, if not set and needed temporary files are used

 ### Fixed

--- a/README.md
+++ b/README.md
@ -38,16 +38,15 @@ Usage

 `zimit` is intended to be run in Docker. Docker image is published at https://github.com/orgs/openzim/packages/container/package/zimit.

-The image accepts the following parameters, **as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones**; useful for setting metadata, for instance:
+The image accepts the following parameters, **as well as any of the [Browsertrix crawler](https://crawler.docs.browsertrix.com/user-guide/cli-options/) and [warc2zim](https://github.com/openzim/warc2zim) ones**:

- Required: `--url URL` - the url to be crawled
+- Required: `--seeds URL` - the url to start crawling from ; multiple URLs can be separated by a comma (even if **usually not needed**, these are just the **seeds** of the crawl) ; first seed URL is used as ZIM homepage
 - Required: `--name` - Name of ZIM file
 - `--output` - output directory (defaults to `/output`)
- `--limit U` - Limit capture to at most U URLs
- `--behaviors` - Control which browsertrix behaviors are ran (defaults to `autoplay,autofetch,siteSpecific`, adding `autoscroll` to the list is possible to automatically scroll the pages and fetch resources which are lazy loaded)
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=|signup-landing\?|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
+- `--pageLimit U` - Limit capture to at most U URLs
+- `--scopeExcludeRx <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--scopeExcludeRx="(\?q=|signup-landing\?|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
 - `--workers N` - number of crawl workers to be run in parallel
- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
+- `--waitUntil` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--waitUntil domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
 - `--keep` - in case of failure, WARC files and other temporary files (which are stored as a subfolder of output directory) are always kept, otherwise they are automatically deleted. Use this flag to always keep WARC files, even in case of success.

 Example command:
@ -55,7 +54,7 @@ Example command:
 ```bash
 docker run ghcr.io/openzim/zimit zimit --help
 docker run ghcr.io/openzim/zimit warc2zim --help
-docker run  -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile
+docker run  -v /output:/output ghcr.io/openzim/zimit zimit --seeds URL --name myzimfile
 ```

 **Note**: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`).
--- a/src/zimit/about.py
+++ b/src/zimit/about.py
@ -1 +1 @@
-__version__ = "2.1.9-dev0"
+__version__ = "3.0.0-dev0"
--- a/src/zimit/zimit.py
+++ b/src/zimit/zimit.py