mirror of
https://github.com/openzim/zimit.git
synced 2025-12-31 04:23:15 +00:00
8.3 KiB
8.3 KiB
Changelog
All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.2.0).
[Unreleased]
Changed
- Upgrade to browsertrix crawler 1.5.0, Python 3.13 and others (#462)
[2.1.7] - 2024-01-10
Changed
- Upgrade to browsertrix crawler 1.4.2 (#450)
- Upgrade to warc2zim 2.2.0
[2.1.6] - 2024-11-07
Changed
- Upgrade to browsertrix crawler 1.3.5 (#426)
[2.1.5] - 2024-11-01
Changed
- Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424)
[2.1.4] - 2024-10-11
Changed
- Upgrade to browsertrix crawler 1.3.3 (#411)
[2.1.3] - 2024-10-08
Changed
- Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406)
Fixed
- Fix help (#393)
[2.1.2] - 2024-09-09
Changed
- Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380)
[2.1.1] - 2024-09-05
Added
- Add support for uncompressed tar archive in --warcs (#369)
Changed
- Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307)
Fixed
- Stream files downloads to not exhaust memory (#373)
- Fix documentation on
--diskUtilizationsetting (#375)
[2.1.0] - 2024-08-09
Added
- Add
--custom-behaviorsargument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313) - Add daily automated end-to-end tests of a page with Youtube player (#330)
- Add
--warcsoption to directly process WARC files (#301)
Changed
- Make it clear that
--profileargument can be an HTTP(S) URL (and not only a path) (#288) - Fix README imprecisions + add back warc2zim availability in docker image (#314)
- Enhance integration test to assert final content of the ZIM (#287)
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
- Do not log number of WARC files found (#357)
- Upgrade dependencies (warc2zim 2.1.0)
Fixed
- Sort WARC directories found by modification time (#366)
[2.0.6] - 2024-08-02
Changed
- Upgraded Browsertrix Crawler to 1.2.6
[2.0.5] - 2024-07-24
Changed
- Upgraded Browsertrix Crawler to 1.2.5
- Upgraded warc2zim to 2.0.3
[2.0.4] - 2024-07-15
Changed
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
[2.0.3] - 2024-06-24
Changed
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
[2.0.2] - 2024-06-18
Changed
- Upgrade dependencies (mainly warc2zim 2.0.2)
[2.0.1] - 2024-06-13
Changed
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
Fixed
- Crawler is not correctly checking disk size / usage (#305)
[2.0.0] - 2024-06-04
Added
- New
--versionflag to display Zimit version (#234) - New
--loggingflag to adjust Browsertrix Crawler logging (#273) - Use new
--scraper-suffixflag of warc2zim to enhance ZIM "Scraper" metadata (#275) - New
--noMobileDeviceCLI argument - Publish Docker image for
linux/arm64(in addition tolinux/amd64) (#178)
Changed
- Use
warc2zimversion 2, which works without Service Worker anymore (#193) - Upgraded Browsertrix Crawler to 1.1.3
- Adopt Python bootstrap conventions
- Upgrade to Python 3.12 + upgrade dependencies
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
- Drop initial check of URL in Python (#256)
--userAgentCLI argument overrides again the--userAgentSuffixand--adminEmailvalues--userAgentCLI arguement is not mandatory anymore
Fixed
- Fix support for Youtube videos (#291)
- Fix crawler
--waitUntilvalues (#289)
[1.6.3] - 2024-01-18
Changed
- Adapt to new
warc2zimcode structure - Using browsertrix-crawler 0.12.4
- Using warc2zim 1.5.5
Added
- New
--buildparameter (optional) to specify the directory holding Browsertrix files ; if not set,--outputdirectory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if--keepis set.
Fixed
--collectionparameter was not working (#252)
[1.6.2] - 2023-11-17
Changed
- Using browsertrix-crawler 0.12.3
Fixed
- Fix logic passing args to crawler to support value '0' (#245)
- Fix documentation about Chrome and headless (#248)
[1.6.1] - 2023-11-06
Changed
- Using browsertrix-crawler 0.12.1
[1.6.0] - 2023-11-02
Changed
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
- Using browsertrix-crawler 0.12.0
[1.5.3] - 2023-10-02
Changed
- Using browsertrix-crawler 0.11.2
[1.5.2] - 2023-09-19
Changed
- Using browsertrix-crawler 0.11.1
[1.5.1] - 2023-09-18
Changed
- Using browsertrix-crawler 0.11.0
- Scraper stat file is not created empty (#211)
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
[1.5.0] - 2023-08-23
Added
--long-descriptionparam
[1.4.1] - 2023-08-23
Changed
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
[1.4.0] - 2023-08-02
Added
--titleto set ZIM title--descriptionto set ZIM description- New crawler options:
--maxPageLimit,--delay,--diskUtilization --zim-langparam to set warc2zim's--lang(ISO-639-3)
Changed
- Using browsertrix-crawler 0.10.2
- Default and accepted values for
--waitUntilfrom crawler's update - Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
--failOnFailedSeedused inconditionally--langnow passed to crawler (ISO-639-1)
Removed
--newContextfrom crawler's update
[1.3.1] - 2023-02-06
Changed
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
[1.3.0] - 2023-02-02
Added
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
Changed
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed
--allowHashUrlsbeing a boolean param - Increased
check_urltimeout (12s to connect, 27s to read) instead of 10s
[1.2.0] - 2022-06-21
Added
--urlFilebrowsertrix crawler parameter--depthbrowsertrix crawler parameter--extraHops, parameter--collectionbrowsertrix crawler parameter--allowHashUrlsbrowsertrix crawler parameter--userAgentSuffixbrowsertrix crawler parameter--behaviors, parameter--behaviorTimeoutbrowsertrix crawler parameter--profilebrowsertrix crawler parameter--sizeLimitbrowsertrix crawler parameter--timeLimitbrowsertrix crawler parameter--healthCheckPort, parameter--overwriteparameter
Changed
- using browsertrix-crawler
0.6.0and warc2zim1.4.2 - default WARC location after crawl changed
from
collections/capture-*/archive/tocollections/crawl-*/archive/
Removed
--scrollbrowsertrix crawler parameter (see--behaviors)--scopebrowsertrix crawler parameter (see--scopeType,--includeand--exclude)
[1.1.5]
- using crawler 0.3.2 and warc2zim 1.3.6
[1.1.4]
- Defaults to
load,networkidle0for waitUntil param (same as crawler) - Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
- Warc to zim now written to
{temp_root_dir}/collections/capture-*/archive/wherecapture-*is dynamic and includes the datetime. (from browsertrix-crawler)
[1.1.3]
- allows same first-level-domain redirects
- fixed redirects to URL in scope
- updated crawler to 0.2.0
statsFilenamenow informs whether limit was hit or not
[1.1.2]
- added support for --custom-css
- added domains block list (dfault)
[1.1.1]
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
[1.0]
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3