benoit74
d8e6d55f87
Release 2.0.0
2024-06-03 19:59:04 +00:00
benoit74
c0ffb74d8c
Adopt Python bootstrap conventions
2024-01-18 13:31:00 +01:00
benoit74
343fb7e770
Replace warning about service workers by a nota bene about there removal since 2.x
2024-01-18 13:28:11 +01:00
benoit74
60b970f844
Enhance README by removing Chrome and headless reference
2023-11-16 13:14:11 +01:00
yuki
b568848a98
minor spelling mistake
...
i win
2023-07-13 12:49:34 +00:00
renaud gaudin
b8714d1260
removed references to docker.io
2023-03-22 13:55:07 +00:00
renaud gaudin
af9a3d24d9
removed obsolete ref to cap-add in README
2023-02-02 16:30:15 +00:00
Kelson
859e79c165
"main" is the new default branch
2022-12-21 11:06:50 +01:00
Emmanuel Engelhart
0025901959
Replace Docker Hub build badge with CI badge
2022-06-11 11:56:18 +02:00
Emmanuel Engelhart
3d3f4fb121
Add release tag
2022-06-11 11:52:48 +02:00
JensKorte
1f31d6c1a5
Update README.md
...
relative link didn't work and replaced by https://github.com/openzim/warc2zim
2022-05-30 21:45:18 +02:00
renaud gaudin
98587045b4
Updated readme: warc2zim params can be passed
2022-05-03 10:31:34 +00:00
lakesidethinks
6da4714cff
Update README.md
2021-01-25 12:31:09 -06:00
rgaudin
e91cd7921e
Added domains blocklist ( #77 )
...
All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist ) lists
are now blocked at local resolver level by updating /etc/hosts in entrypoint.
- this saves network and CPU resources by failing early.
- this is wanted in almost all cases
- can be bypassed by setting a blank entrypoint
2021-01-12 07:31:16 +01:00
rgaudin
f6d44314cd
Fixed #58 : updated README with limitations
2020-12-12 13:58:32 +00:00
renaud gaudin
568068ecfc
WARC2zim version update
...
- updated to latest warc2zim release
- fixed param name typo in README
- added creation of `/output` so container can run on default params even if /output
is not a mounted volume
2020-11-10 08:26:56 +00:00
Ilya Kreymer
989567e05e
README: fix typos in example command
2020-11-10 06:10:12 +00:00
Ilya Kreymer
c228c8300c
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler
...
use versioned browsertrix-crawler:0.1.0 image
part of #45
2020-11-03 17:21:54 +00:00
Ilya Kreymer
c26fe5d4cd
replace run.sh with python runner zimit.py, as suggested in #28
...
should fix arg parsing issues in #28,#18
warc2zim now called directly from zimit.py, both for arg check and for actual zim creation
crawler renamed to crawler.js, no longer handles zim creation, only crawling
add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25
pywb: update to latest dev version with dedup support, add redis for deduplication
2020-10-16 18:54:04 +00:00
renaud gaudin
901729a069
added link to /dev/shm info on readme
2020-10-07 13:56:21 +00:00
Ilya Kreymer
94f0b7362d
Merge README changes from master
2020-10-06 15:53:22 +00:00
Ilya Kreymer
e4128c8183
add help text/validation for all config options, url now must be passed in with --url
...
add --scroll boolean option, which activates simple autoscroll behavior
use chrome user-agent for manual fetch
reenable pywb option
cleanup Dockerfile: update to warc2zim 1.0.1, install fonts-stix for math science sites
update README
2020-09-29 05:22:33 +00:00
Kelson
bb5b7e48c1
Additional README.md changes ( #16 )
2020-09-25 12:02:43 +02:00
Kelson
ac650bff05
Update README.md
2020-09-25 11:36:30 +02:00
Ilya Kreymer
f25b390f15
add regex exclusions
2020-09-22 17:48:09 +00:00
Ilya Kreymer
b00c4262a7
add --limit param for max URLs to be captured
...
add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8 )
improved error handling
2020-09-21 07:16:26 +00:00
Ilya Kreymer
9b23de828b
Update README.md
2020-09-19 15:53:23 -07:00
Ilya Kreymer
4e04645e6b
move warc2zim to be launched by node process
2020-09-19 22:47:19 +00:00
Ilya Kreymer
1de577bd78
use puppeteeer-cluster for parallel crawling
...
use yargs to parse command-line args
2020-09-19 22:19:20 +00:00
renaud gaudin
15cf636ff3
reset master branch for 2020 codebase
2020-08-19 09:36:48 +02:00