## Changelog All notable changes to this project are documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.2.0). ## [Unreleased] ### Added - Added `--overwrite` flag to overwrite existing ZIM file if it exists (#399) ### Changed - Fix issues preventing interrupted crawls from being resumed. (#499) - Ensure build directory is used explicitly instead of a randomized subdirectory when passed, and pre-create it if it does not exist. - Use all warc_dirs found instead of just the latest so interrupted crawls use all collected pages across runs when an explicit collections directory is not passed. - Don't cleanup an explicitly passed build directory. ## [3.0.5] - 2024-04-11 ### Changed - Upgrade to browsertrix crawler 1.6.0 (#493) ## [3.0.4] - 2024-04-04 ### Changed - Upgrade to browsertrix crawler 1.5.10 (#491) ## [3.0.3] - 2024-02-28 ### Changed - Upgrade to browsertrix crawler 1.5.7 (#483) ## [3.0.2] - 2024-02-27 ### Changed - Upgrade to browsertrix crawler 1.5.6 (#482) ## [3.0.1] - 2024-02-24 ### Changed - Upgrade to browsertrix crawler 1.5.4 (#476) ## [3.0.0] - 2024-02-17 ### Changed - Change solution to report partial ZIM to the Zimfarm and other clients (#304) - Keep temporary folder when crawler or warc2zim fails, even if not asked for (#468) - Add many missing Browsertrix Crawler arguments ; drop default overrides by zimit ; drop `--noMobileDevice` setting (not needed anymore) (#433) - Document all Browsertrix Crawler default arguments values (#416) - Use preferred Browsertrix Crawler arguments names: (part of #471) - `--seeds` instead of `--url` - `--seedFile` instead of `--urlFile` - `--pageLimit` instead of `--limit` - `--pageLoadTimeout` instead of `--timeout` - `--scopeIncludeRx` instead of `--include` - `--scopeExcludeRx` instead of `--exclude` - `--pageExtraDelay` instead of `--delay` - Remove confusion between zimit, warc2zim and crawler stats filenames (part of #471) - `--statsFilename` is now the crawler stats file (since it is the same name, just like other arguments) - `--zimit-progress-file` is now the zimit stats location - `--warc2zim-progress-file` is the warc2zim stats location - all are optional values, if not set and needed temporary files are used ### Fixed - Do not create the ZIM when crawl is incomplete (#444) ## [2.1.8] - 2024-02-07 ### Changed - Upgrade to browsertrix crawler 1.5.1, Python 3.13 and others (#462 + #464) ## [2.1.7] - 2024-01-10 ### Changed - Upgrade to browsertrix crawler 1.4.2 (#450) - Upgrade to warc2zim 2.2.0 ## [2.1.6] - 2024-11-07 ### Changed - Upgrade to browsertrix crawler 1.3.5 (#426) ## [2.1.5] - 2024-11-01 ### Changed - Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424) ## [2.1.4] - 2024-10-11 ### Changed - Upgrade to browsertrix crawler 1.3.3 (#411) ## [2.1.3] - 2024-10-08 ### Changed - Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406) ### Fixed - Fix help (#393) ## [2.1.2] - 2024-09-09 ### Changed - Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380) ## [2.1.1] - 2024-09-05 ### Added - Add support for uncompressed tar archive in --warcs (#369) ### Changed - Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307) ### Fixed - Stream files downloads to not exhaust memory (#373) - Fix documentation on `--diskUtilization` setting (#375) ## [2.1.0] - 2024-08-09 ### Added - Add `--custom-behaviors` argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313) - Add daily automated end-to-end tests of a page with Youtube player (#330) - Add `--warcs` option to directly process WARC files (#301) ### Changed - Make it clear that `--profile` argument can be an HTTP(S) URL (and not only a path) (#288) - Fix README imprecisions + add back warc2zim availability in docker image (#314) - Enhance integration test to assert final content of the ZIM (#287) - Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354) - Do not log number of WARC files found (#357) - Upgrade dependencies (warc2zim 2.1.0) ### Fixed - Sort WARC directories found by modification time (#366) ## [2.0.6] - 2024-08-02 ### Changed - Upgraded Browsertrix Crawler to 1.2.6 ## [2.0.5] - 2024-07-24 ### Changed - Upgraded Browsertrix Crawler to 1.2.5 - Upgraded warc2zim to 2.0.3 ## [2.0.4] - 2024-07-15 ### Changed - Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316) ## [2.0.3] - 2024-06-24 ### Changed - Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323) ## [2.0.2] - 2024-06-18 ### Changed - Upgrade dependencies (mainly warc2zim 2.0.2) ## [2.0.1] - 2024-06-13 ### Changed - Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318) ### Fixed - Crawler is not correctly checking disk size / usage (#305) ## [2.0.0] - 2024-06-04 ### Added - New `--version` flag to display Zimit version (#234) - New `--logging` flag to adjust Browsertrix Crawler logging (#273) - Use new `--scraper-suffix` flag of warc2zim to enhance ZIM "Scraper" metadata (#275) - New `--noMobileDevice` CLI argument - Publish Docker image for `linux/arm64` (in addition to `linux/amd64`) (#178) ### Changed - **Use `warc2zim` version 2**, which works without Service Worker anymore (#193) - Upgraded Browsertrix Crawler to 1.1.3 - Adopt Python bootstrap conventions - Upgrade to Python 3.12 + upgrade dependencies - Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284) - Drop initial check of URL in Python (#256) - `--userAgent` CLI argument overrides again the `--userAgentSuffix` and `--adminEmail` values - `--userAgent` CLI arguement is not mandatory anymore ### Fixed - Fix support for Youtube videos (#291) - Fix crawler `--waitUntil` values (#289) ## [1.6.3] - 2024-01-18 ### Changed - Adapt to new `warc2zim` code structure - Using browsertrix-crawler 0.12.4 - Using warc2zim 1.5.5 ### Added - New `--build` parameter (optional) to specify the directory holding Browsertrix files ; if not set, `--output` directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if `--keep` is set. ### Fixed - `--collection` parameter was not working (#252) ## [1.6.2] - 2023-11-17 ### Changed - Using browsertrix-crawler 0.12.3 ### Fixed - Fix logic passing args to crawler to support value '0' (#245) - Fix documentation about Chrome and headless (#248) ## [1.6.1] - 2023-11-06 ### Changed - Using browsertrix-crawler 0.12.1 ## [1.6.0] - 2023-11-02 ### Changed - Scraper fails for all HTTP error codes returned when checking URL at startup (#223) - User-Agent now has a default value (#228) - Manipulation of spaces with UA suffix and adminEmail has been modified - Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227) - Using browsertrix-crawler 0.12.0 ## [1.5.3] - 2023-10-02 ### Changed - Using browsertrix-crawler 0.11.2 ## [1.5.2] - 2023-09-19 ### Changed - Using browsertrix-crawler 0.11.1 ## [1.5.1] - 2023-09-18 ### Changed - Using browsertrix-crawler 0.11.0 - Scraper stat file is not created empty (#211) - Crawler statistics are not available anymore (#213) - Using warc2zim 1.5.4 ## [1.5.0] - 2023-08-23 ### Added - `--long-description` param ## [1.4.1] - 2023-08-23 ### Changed - Using browsertrix-crawler 0.10.4 - Using warc2zim 1.5.3 ## [1.4.0] - 2023-08-02 ### Added - `--title` to set ZIM title - `--description` to set ZIM description - New crawler options: `--maxPageLimit`, `--delay`, `--diskUtilization` - `--zim-lang` param to set warc2zim's `--lang` (ISO-639-3) ### Changed - Using browsertrix-crawler 0.10.2 - Default and accepted values for `--waitUntil` from crawler's update - Using warc2zim 1.5.2 - Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172) - `--failOnFailedSeed` used inconditionally - `--lang` now passed to crawler (ISO-639-1) ### Removed - `--newContext` from crawler's update ## [1.3.1] - 2023-02-06 ### Changed - Using browsertrix-crawler 0.8.0 - Using warc2zim version 1.5.1 with wabac.js 2.15.2 ## [1.3.0] - 2023-02-02 ### Added - Initial url check normalizes homepage redirects to standart ports – 80/443 (#137) ### Changed - Using warc2zim version 1.5.0 with scope conflict fix and videos fix - Using browsertrix-crawler 0.8.0-beta.1 - Fixed `--allowHashUrls` being a boolean param - Increased `check_url` timeout (12s to connect, 27s to read) instead of 10s ## [1.2.0] - 2022-06-21 ### Added - `--urlFile` browsertrix crawler parameter - `--depth` browsertrix crawler parameter - `--extraHops`, parameter - `--collection` browsertrix crawler parameter - `--allowHashUrls` browsertrix crawler parameter - `--userAgentSuffix` browsertrix crawler parameter - `--behaviors`, parameter - `--behaviorTimeout` browsertrix crawler parameter - `--profile` browsertrix crawler parameter - `--sizeLimit` browsertrix crawler parameter - `--timeLimit` browsertrix crawler parameter - `--healthCheckPort`, parameter - `--overwrite` parameter ### Changed - using browsertrix-crawler `0.6.0` and warc2zim `1.4.2` - default WARC location after crawl changed from `collections/capture-*/archive/` to `collections/crawl-*/archive/` ### Removed - `--scroll` browsertrix crawler parameter (see `--behaviors`) - `--scope` browsertrix crawler parameter (see `--scopeType`, `--include` and `--exclude`) ## [1.1.5] - using crawler 0.3.2 and warc2zim 1.3.6 ## [1.1.4] - Defaults to `load,networkidle0` for waitUntil param (same as crawler) - Allows setting combinations of values for waitUntil param - Updated warc2zim to 1.3.5 - Updated browsertrix-crawler to 0.3.1 - Warc to zim now written to `{temp_root_dir}/collections/capture-*/archive/` where `capture-*` is dynamic and includes the datetime. (from browsertrix-crawler) ## [1.1.3] - allows same first-level-domain redirects - fixed redirects to URL in scope - updated crawler to 0.2.0 - `statsFilename` now informs whether limit was hit or not ## [1.1.2] - added support for --custom-css - added domains block list (dfault) ## [1.1.1] - updated browsertrix-crawler to 0.1.4 - autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets ## [1.0] - initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3