browsertrix-crawler/Dockerfile

60 lines
1.5 KiB
Text
Raw Normal View History

State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base
2023-01-09 15:00:37 -05:00
ARG BROWSER_VERSION=brave-1.46.144
2020-10-31 13:16:37 -07:00
FROM ${BROWSER_IMAGE_BASE}:${BROWSER_VERSION}
# TODO: Move this into base image
RUN apt-get update && apt-get install -y jq
# needed to add args to main build stage
ARG BROWSER_VERSION
2020-10-31 13:16:37 -07:00
ENV PROXY_HOST=localhost \
PROXY_PORT=8080 \
PROXY_CA_URL=http://wsgiprox/download/pem \
PROXY_CA_FILE=/tmp/proxy-ca.pem \
DISPLAY=:99 \
GEOMETRY=1360x1020x16 \
BROWSER_VERSION=${BROWSER_VERSION} \
2022-12-13 11:43:22 -05:00
BROWSER_BIN=/usr/bin/chromium-browser \
OPENSSL_CONF=/app/openssl.conf
2020-10-31 13:16:37 -07:00
WORKDIR /app
ADD requirements.txt /app/
RUN pip install 'uwsgi==2.0.20'
RUN pip install -U setuptools; pip install -r requirements.txt
2020-10-31 13:16:37 -07:00
ADD package.json /app/
# to allow forcing rebuilds from this stage
ARG REBUILD
# Download and format ad host blocklist as JSON
RUN mkdir -p /tmp/ads && cd /tmp/ads && \
curl -vs -o ad-hosts.txt https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts && \
cat ad-hosts.txt | grep '^0.0.0.0 '| awk '{ print $2; }' | grep -v '0.0.0.0' | jq --raw-input --slurp 'split("\n")' > /app/ad-hosts.json && \
rm /tmp/ads/ad-hosts.txt
2020-10-31 13:16:37 -07:00
RUN yarn install
ADD *.js /app/
ADD util/*.js /app/util/
ADD config/ /app/
ADD html/ /app/html/
COPY brave-default-profile.tar.gz /app/
COPY brave-ad-block-disabled-profile.tar.gz /app/
RUN ln -s /app/main.js /usr/bin/crawl; ln -s /app/create-login-profile.js /usr/bin/create-login-profile
WORKDIR /crawls
2020-10-31 13:16:37 -07:00
ADD docker-entrypoint.sh /docker-entrypoint.sh
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["crawl"]
2020-10-31 13:16:37 -07:00