Commit graph

30 commits

Author SHA1 Message Date
benoit74
d8e6d55f87
Release 2.0.0 2024-06-03 19:59:04 +00:00
benoit74
c0ffb74d8c
Adopt Python bootstrap conventions 2024-01-18 13:31:00 +01:00
benoit74
343fb7e770
Replace warning about service workers by a nota bene about there removal since 2.x 2024-01-18 13:28:11 +01:00
benoit74
60b970f844
Enhance README by removing Chrome and headless reference 2023-11-16 13:14:11 +01:00
yuki
b568848a98
minor spelling mistake
i win
2023-07-13 12:49:34 +00:00
renaud gaudin
b8714d1260 removed references to docker.io 2023-03-22 13:55:07 +00:00
renaud gaudin
af9a3d24d9 removed obsolete ref to cap-add in README 2023-02-02 16:30:15 +00:00
Kelson
859e79c165
"main" is the new default branch 2022-12-21 11:06:50 +01:00
Emmanuel Engelhart
0025901959
Replace Docker Hub build badge with CI badge 2022-06-11 11:56:18 +02:00
Emmanuel Engelhart
3d3f4fb121
Add release tag 2022-06-11 11:52:48 +02:00
JensKorte
1f31d6c1a5
Update README.md
relative link didn't work and replaced by https://github.com/openzim/warc2zim
2022-05-30 21:45:18 +02:00
renaud gaudin
98587045b4 Updated readme: warc2zim params can be passed 2022-05-03 10:31:34 +00:00
lakesidethinks
6da4714cff Update README.md 2021-01-25 12:31:09 -06:00
rgaudin
e91cd7921e
Added domains blocklist (#77)
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.

- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
rgaudin
f6d44314cd
Fixed #58: updated README with limitations 2020-12-12 13:58:32 +00:00
renaud gaudin
568068ecfc WARC2zim version update
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00
Ilya Kreymer
989567e05e README: fix typos in example command 2020-11-10 06:10:12 +00:00
Ilya Kreymer
c228c8300c split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler
use versioned browsertrix-crawler:0.1.0 image
part of #45
2020-11-03 17:21:54 +00:00
Ilya Kreymer
c26fe5d4cd replace run.sh with python runner zimit.py, as suggested in #28
should fix arg parsing issues in #28,#18
warc2zim now called directly from zimit.py, both for arg check and for actual zim creation
crawler renamed to crawler.js, no longer handles zim creation, only crawling
add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25
pywb: update to latest dev version with dedup support, add redis for deduplication
2020-10-16 18:54:04 +00:00
renaud gaudin
901729a069 added link to /dev/shm info on readme 2020-10-07 13:56:21 +00:00
Ilya Kreymer
94f0b7362d Merge README changes from master 2020-10-06 15:53:22 +00:00
Ilya Kreymer
e4128c8183 add help text/validation for all config options, url now must be passed in with --url
add --scroll boolean option, which activates simple autoscroll behavior
use chrome user-agent for manual fetch
reenable pywb option
cleanup Dockerfile: update to warc2zim 1.0.1, install fonts-stix for math science sites
update README
2020-09-29 05:22:33 +00:00
Kelson
bb5b7e48c1
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00
Kelson
ac650bff05
Update README.md 2020-09-25 11:36:30 +02:00
Ilya Kreymer
f25b390f15 add regex exclusions 2020-09-22 17:48:09 +00:00
Ilya Kreymer
b00c4262a7 add --limit param for max URLs to be captured
add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8)
improved error handling
2020-09-21 07:16:26 +00:00
Ilya Kreymer
9b23de828b
Update README.md 2020-09-19 15:53:23 -07:00
Ilya Kreymer
4e04645e6b move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00
Ilya Kreymer
1de577bd78 use puppeteeer-cluster for parallel crawling
use yargs to parse command-line args
2020-09-19 22:19:20 +00:00
renaud gaudin
15cf636ff3 reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00