2022-06-21 12:04:56 +00:00
|
|
|
|
## Changelog
|
|
|
|
|
|
|
|
|
|
|
|
All notable changes to this project are documented in this file.
|
|
|
|
|
|
|
|
|
|
|
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
|
|
|
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.2.0).
|
|
|
|
|
|
|
2024-08-02 08:46:43 +00:00
|
|
|
|
## [Unreleased]
|
|
|
|
|
|
|
2024-08-07 09:38:15 +00:00
|
|
|
|
### Added
|
|
|
|
|
|
|
|
|
|
|
|
- Add `--custom-behaviors` argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313)
|
2024-07-22 07:07:12 +00:00
|
|
|
|
- Add daily automated end-to-end tests of a page with Youtube player (#330)
|
2024-08-07 09:38:15 +00:00
|
|
|
|
|
2024-07-19 13:46:05 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Make it clear that `--profile` argument can be an HTTP(S) URL (and not only a path) (#288)
|
2024-07-19 14:02:24 +00:00
|
|
|
|
- Fix README imprecisions + add back warc2zim availability in docker image (#314)
|
2024-07-19 15:10:12 +00:00
|
|
|
|
- Enhance integration test to assert final content of the ZIM (#287)
|
2024-07-23 09:10:16 +00:00
|
|
|
|
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
|
2024-07-23 09:27:15 +00:00
|
|
|
|
- Do not log number of WARC files found (#357)
|
2024-07-19 13:46:05 +00:00
|
|
|
|
|
2024-08-02 08:17:58 +00:00
|
|
|
|
## [2.0.6] - 2024-08-02
|
2024-07-24 06:39:21 +00:00
|
|
|
|
|
2024-08-02 08:07:46 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgraded Browsertrix Crawler to 1.2.6
|
|
|
|
|
|
|
2024-07-24 06:37:27 +00:00
|
|
|
|
## [2.0.5] - 2024-07-24
|
2024-07-15 08:58:03 +00:00
|
|
|
|
|
2024-07-24 05:34:25 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgraded Browsertrix Crawler to 1.2.5
|
2024-07-24 05:35:55 +00:00
|
|
|
|
- Upgraded warc2zim to 2.0.3
|
2024-07-24 05:34:25 +00:00
|
|
|
|
|
2024-07-15 08:49:11 +00:00
|
|
|
|
## [2.0.4] - 2024-07-15
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
|
2024-06-24 07:56:35 +00:00
|
|
|
|
|
2024-06-24 07:50:13 +00:00
|
|
|
|
## [2.0.3] - 2024-06-24
|
2024-06-18 14:05:47 +00:00
|
|
|
|
|
2024-06-24 06:48:38 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
|
|
|
|
|
|
|
2024-06-18 13:44:13 +00:00
|
|
|
|
## [2.0.2] - 2024-06-18
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgrade dependencies (mainly warc2zim 2.0.2)
|
|
|
|
|
|
|
2024-06-13 11:42:17 +00:00
|
|
|
|
|
2024-06-13 11:32:13 +00:00
|
|
|
|
## [2.0.1] - 2024-06-13
|
2024-06-04 15:14:43 +00:00
|
|
|
|
|
2024-06-13 10:21:29 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
|
|
|
|
|
|
|
|
|
|
|
|
### Fixed
|
|
|
|
|
|
|
|
|
|
|
|
- Crawler is not correctly checking disk size / usage (#305)
|
|
|
|
|
|
|
2024-06-03 19:59:04 +00:00
|
|
|
|
## [2.0.0] - 2024-06-04
|
2023-11-17 11:30:37 +01:00
|
|
|
|
|
2024-01-18 13:27:55 +01:00
|
|
|
|
### Added
|
|
|
|
|
|
|
2024-06-03 19:59:04 +00:00
|
|
|
|
- New `--version` flag to display Zimit version (#234)
|
2024-01-23 17:28:56 +01:00
|
|
|
|
- New `--logging` flag to adjust Browsertrix Crawler logging (#273)
|
2024-01-31 14:56:09 +01:00
|
|
|
|
- Use new `--scraper-suffix` flag of warc2zim to enhance ZIM "Scraper" metadata (#275)
|
2024-03-27 13:18:04 +00:00
|
|
|
|
- New `--noMobileDevice` CLI argument
|
2024-06-03 19:59:04 +00:00
|
|
|
|
- Publish Docker image for `linux/arm64` (in addition to `linux/amd64`) (#178)
|
2024-01-15 07:54:42 +01:00
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
2024-06-03 19:59:04 +00:00
|
|
|
|
- **Use `warc2zim` version 2**, which works without Service Worker anymore (#193)
|
|
|
|
|
|
- Upgraded Browsertrix Crawler to 1.1.3
|
2024-01-18 13:27:55 +01:00
|
|
|
|
- Adopt Python bootstrap conventions
|
2024-03-07 08:40:55 +00:00
|
|
|
|
- Upgrade to Python 3.12 + upgrade dependencies
|
2024-06-03 19:59:04 +00:00
|
|
|
|
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
|
|
|
|
|
|
- Drop initial check of URL in Python (#256)
|
2024-03-27 13:18:04 +00:00
|
|
|
|
- `--userAgent` CLI argument overrides again the `--userAgentSuffix` and `--adminEmail` values
|
|
|
|
|
|
- `--userAgent` CLI arguement is not mandatory anymore
|
|
|
|
|
|
|
|
|
|
|
|
### Fixed
|
|
|
|
|
|
|
|
|
|
|
|
- Fix support for Youtube videos (#291)
|
2024-06-03 08:00:58 +00:00
|
|
|
|
- Fix crawler `--waitUntil` values (#289)
|
2024-01-15 07:54:42 +01:00
|
|
|
|
|
2024-01-18 09:12:36 +01:00
|
|
|
|
## [1.6.3] - 2024-01-18
|
2024-01-15 07:54:42 +01:00
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Adapt to new `warc2zim` code structure
|
2024-01-18 09:00:23 +01:00
|
|
|
|
- Using browsertrix-crawler 0.12.4
|
|
|
|
|
|
- Using warc2zim 1.5.5
|
2024-01-15 07:54:42 +01:00
|
|
|
|
|
2023-11-23 08:51:48 +01:00
|
|
|
|
### Added
|
|
|
|
|
|
|
2023-11-23 13:08:45 +01:00
|
|
|
|
- New `--build` parameter (optional) to specify the directory holding Browsertrix files ; if not set, `--output`
|
|
|
|
|
|
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only
|
|
|
|
|
|
if `--keep` is set.
|
2023-11-23 08:51:48 +01:00
|
|
|
|
|
|
|
|
|
|
### Fixed
|
|
|
|
|
|
|
|
|
|
|
|
- `--collection` parameter was not working (#252)
|
|
|
|
|
|
|
2023-11-17 11:25:09 +01:00
|
|
|
|
## [1.6.2] - 2023-11-17
|
2023-11-02 21:10:28 +01:00
|
|
|
|
|
2023-11-15 15:23:03 +01:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
2023-11-17 11:17:41 +01:00
|
|
|
|
- Using browsertrix-crawler 0.12.3
|
2023-11-15 15:23:03 +01:00
|
|
|
|
|
2023-11-15 15:11:42 +01:00
|
|
|
|
### Fixed
|
|
|
|
|
|
|
|
|
|
|
|
- Fix logic passing args to crawler to support value '0' (#245)
|
2023-11-17 11:25:09 +01:00
|
|
|
|
- Fix documentation about Chrome and headless (#248)
|
2023-11-15 15:11:42 +01:00
|
|
|
|
|
2023-11-06 10:00:03 +01:00
|
|
|
|
## [1.6.1] - 2023-11-06
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Using browsertrix-crawler 0.12.1
|
2023-11-02 21:10:28 +01:00
|
|
|
|
|
2023-11-02 20:52:43 +01:00
|
|
|
|
## [1.6.0] - 2023-11-02
|
2023-10-23 10:47:24 +02:00
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
|
2023-10-23 11:45:55 +02:00
|
|
|
|
- User-Agent now has a default value (#228)
|
|
|
|
|
|
- Manipulation of spaces with UA suffix and adminEmail has been modified
|
|
|
|
|
|
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
|
2023-11-02 20:52:43 +01:00
|
|
|
|
- Using browsertrix-crawler 0.12.0
|
2023-10-23 10:47:24 +02:00
|
|
|
|
|
2023-10-02 10:51:06 +00:00
|
|
|
|
## [1.5.3] - 2023-10-02
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Using browsertrix-crawler 0.11.2
|
|
|
|
|
|
|
2023-09-19 09:04:23 +00:00
|
|
|
|
## [1.5.2] - 2023-09-19
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Using browsertrix-crawler 0.11.1
|
|
|
|
|
|
|
2023-09-18 08:28:09 +00:00
|
|
|
|
## [1.5.1] - 2023-09-18
|
2023-08-28 13:10:07 +02:00
|
|
|
|
|
2023-09-11 10:43:28 +00:00
|
|
|
|
### Changed
|
2023-08-28 13:10:07 +02:00
|
|
|
|
|
2023-09-18 08:28:09 +00:00
|
|
|
|
- Using browsertrix-crawler 0.11.0
|
2023-09-11 10:43:28 +00:00
|
|
|
|
- Scraper stat file is not created empty (#211)
|
2023-09-18 16:16:12 +02:00
|
|
|
|
- Crawler statistics are not available anymore (#213)
|
2023-09-18 08:28:09 +00:00
|
|
|
|
- Using warc2zim 1.5.4
|
2023-08-28 13:10:07 +02:00
|
|
|
|
|
2023-08-23 16:33:46 +00:00
|
|
|
|
## [1.5.0] - 2023-08-23
|
|
|
|
|
|
|
|
|
|
|
|
### Added
|
|
|
|
|
|
|
|
|
|
|
|
- `--long-description` param
|
|
|
|
|
|
|
|
|
|
|
|
## [1.4.1] - 2023-08-23
|
2023-08-10 18:51:19 +00:00
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
2023-08-23 12:15:01 +00:00
|
|
|
|
- Using browsertrix-crawler 0.10.4
|
|
|
|
|
|
- Using warc2zim 1.5.3
|
2023-08-10 18:51:19 +00:00
|
|
|
|
|
2023-08-02 14:42:10 +00:00
|
|
|
|
## [1.4.0] - 2023-08-02
|
2023-02-27 09:57:36 +00:00
|
|
|
|
|
2023-04-10 13:08:12 +00:00
|
|
|
|
### Added
|
|
|
|
|
|
|
|
|
|
|
|
- `--title` to set ZIM title
|
|
|
|
|
|
- `--description` to set ZIM description
|
|
|
|
|
|
- New crawler options: `--maxPageLimit`, `--delay`, `--diskUtilization`
|
2023-08-02 11:21:43 +00:00
|
|
|
|
- `--zim-lang` param to set warc2zim's `--lang` (ISO-639-3)
|
2023-04-10 13:08:12 +00:00
|
|
|
|
|
2023-02-27 09:57:36 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
2023-08-02 11:01:10 +00:00
|
|
|
|
- Using browsertrix-crawler 0.10.2
|
2023-03-24 07:26:10 +00:00
|
|
|
|
- Default and accepted values for `--waitUntil` from crawler's update
|
2023-08-02 14:47:23 +00:00
|
|
|
|
- Using warc2zim 1.5.2
|
2023-03-10 12:10:06 +00:00
|
|
|
|
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
|
2023-05-22 11:23:46 +00:00
|
|
|
|
- `--failOnFailedSeed` used inconditionally
|
2023-08-02 11:21:43 +00:00
|
|
|
|
- `--lang` now passed to crawler (ISO-639-1)
|
2023-02-27 09:57:36 +00:00
|
|
|
|
|
2023-03-24 07:26:10 +00:00
|
|
|
|
### Removed
|
|
|
|
|
|
|
|
|
|
|
|
- `--newContext` from crawler's update
|
|
|
|
|
|
|
2023-02-06 11:48:44 +01:00
|
|
|
|
## [1.3.1] - 2023-02-06
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- Using browsertrix-crawler 0.8.0
|
|
|
|
|
|
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
|
2022-06-21 12:04:56 +00:00
|
|
|
|
|
2023-02-02 16:31:45 +00:00
|
|
|
|
## [1.3.0] - 2023-02-02
|
2022-06-21 17:20:30 +00:00
|
|
|
|
|
2022-06-22 09:57:01 +00:00
|
|
|
|
### Added
|
|
|
|
|
|
|
|
|
|
|
|
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
|
2022-06-21 17:20:30 +00:00
|
|
|
|
|
2022-06-30 09:42:50 +00:00
|
|
|
|
### Changed
|
|
|
|
|
|
|
2023-02-02 16:31:45 +00:00
|
|
|
|
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
|
2023-01-31 10:34:32 +00:00
|
|
|
|
- Using browsertrix-crawler 0.8.0-beta.1
|
2022-07-18 10:23:16 +00:00
|
|
|
|
- Fixed `--allowHashUrls` being a boolean param
|
2022-07-25 08:41:08 +00:00
|
|
|
|
- Increased `check_url` timeout (12s to connect, 27s to read) instead of 10s
|
2022-06-30 09:42:50 +00:00
|
|
|
|
|
2022-06-21 17:08:38 +00:00
|
|
|
|
## [1.2.0] - 2022-06-21
|
2022-06-21 12:04:56 +00:00
|
|
|
|
|
|
|
|
|
|
### Added
|
|
|
|
|
|
|
|
|
|
|
|
- `--urlFile` browsertrix crawler parameter
|
|
|
|
|
|
- `--depth` browsertrix crawler parameter
|
|
|
|
|
|
- `--extraHops`, parameter
|
|
|
|
|
|
- `--collection` browsertrix crawler parameter
|
|
|
|
|
|
- `--allowHashUrls` browsertrix crawler parameter
|
|
|
|
|
|
- `--userAgentSuffix` browsertrix crawler parameter
|
|
|
|
|
|
- `--behaviors`, parameter
|
|
|
|
|
|
- `--behaviorTimeout` browsertrix crawler parameter
|
|
|
|
|
|
- `--profile` browsertrix crawler parameter
|
|
|
|
|
|
- `--sizeLimit` browsertrix crawler parameter
|
|
|
|
|
|
- `--timeLimit` browsertrix crawler parameter
|
|
|
|
|
|
- `--healthCheckPort`, parameter
|
|
|
|
|
|
- `--overwrite` parameter
|
|
|
|
|
|
|
|
|
|
|
|
### Changed
|
|
|
|
|
|
|
|
|
|
|
|
- using browsertrix-crawler `0.6.0` and warc2zim `1.4.2`
|
2024-01-18 13:27:55 +01:00
|
|
|
|
- default WARC location after crawl changed
|
2022-06-21 12:04:56 +00:00
|
|
|
|
from `collections/capture-*/archive/` to `collections/crawl-*/archive/`
|
|
|
|
|
|
|
|
|
|
|
|
### Removed
|
|
|
|
|
|
|
|
|
|
|
|
- `--scroll` browsertrix crawler parameter (see `--behaviors`)
|
|
|
|
|
|
- `--scope` browsertrix crawler parameter (see `--scopeType`, `--include` and `--exclude`)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## [1.1.5]
|
2021-06-10 14:14:11 +00:00
|
|
|
|
|
|
|
|
|
|
- using crawler 0.3.2 and warc2zim 1.3.6
|
|
|
|
|
|
|
2022-06-21 12:04:56 +00:00
|
|
|
|
## [1.1.4]
|
2021-03-04 10:40:12 +00:00
|
|
|
|
|
|
|
|
|
|
- Defaults to `load,networkidle0` for waitUntil param (same as crawler)
|
|
|
|
|
|
- Allows setting combinations of values for waitUntil param
|
2021-05-12 16:28:30 +00:00
|
|
|
|
- Updated warc2zim to 1.3.5
|
|
|
|
|
|
- Updated browsertrix-crawler to 0.3.1
|
2021-05-12 17:03:48 +00:00
|
|
|
|
- Warc to zim now written to `{temp_root_dir}/collections/capture-*/archive/` where
|
|
|
|
|
|
`capture-*` is dynamic and includes the datetime. (from browsertrix-crawler)
|
2021-03-04 10:40:12 +00:00
|
|
|
|
|
2022-06-21 12:04:56 +00:00
|
|
|
|
## [1.1.3]
|
2021-01-15 12:59:00 +00:00
|
|
|
|
|
|
|
|
|
|
- allows same first-level-domain redirects
|
|
|
|
|
|
- fixed redirects to URL in scope
|
2021-02-15 17:15:54 +00:00
|
|
|
|
- updated crawler to 0.2.0
|
2021-03-01 09:59:34 +00:00
|
|
|
|
- `statsFilename` now informs whether limit was hit or not
|
2021-01-15 12:59:00 +00:00
|
|
|
|
|
2022-06-21 12:04:56 +00:00
|
|
|
|
## [1.1.2]
|
2021-01-15 12:59:00 +00:00
|
|
|
|
|
|
|
|
|
|
- added support for --custom-css
|
|
|
|
|
|
- added domains block list (dfault)
|
|
|
|
|
|
|
2022-06-21 12:04:56 +00:00
|
|
|
|
## [1.1.1]
|
2020-12-14 08:13:54 +00:00
|
|
|
|
|
|
|
|
|
|
- updated browsertrix-crawler to 0.1.4
|
|
|
|
|
|
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
|
|
|
|
|
|
|
2022-06-21 12:04:56 +00:00
|
|
|
|
## [1.0]
|
2020-12-14 08:13:54 +00:00
|
|
|
|
|
|
|
|
|
|
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3
|