0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25
* update CHANGES and README with new features
* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
|
|
|
ARG BROWSER_VERSION=91
|
2021-02-03 22:24:38 -08:00
|
|
|
|
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
|
|
|
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base
|
2020-10-31 13:16:37 -07:00
|
|
|
|
2021-06-24 15:39:17 -07:00
|
|
|
ARG BROWSER_BIN=google-chrome
|
2020-10-31 13:16:37 -07:00
|
|
|
|
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25
* update CHANGES and README with new features
* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
|
|
|
FROM ${BROWSER_IMAGE_BASE}:${BROWSER_VERSION} AS browser
|
2021-03-13 16:48:31 -08:00
|
|
|
|
2021-06-24 15:39:17 -07:00
|
|
|
FROM ubuntu:bionic
|
|
|
|
|
|
|
|
RUN apt-get update -y && apt-get install --no-install-recommends -qqy software-properties-common \
|
|
|
|
&& add-apt-repository -y ppa:deadsnakes \
|
|
|
|
&& apt-get update -y \
|
2021-07-19 15:49:43 -07:00
|
|
|
&& apt-get install --no-install-recommends -qqy build-essential fonts-stix locales-all redis-server xvfb gpg-agent curl git socat \
|
2021-06-24 15:39:17 -07:00
|
|
|
python3.8 python3.8-distutils python3.8-dev gpg ca-certificates \
|
2020-10-31 13:16:37 -07:00
|
|
|
&& apt-get clean \
|
|
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
|
2021-06-24 15:39:17 -07:00
|
|
|
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - \
|
|
|
|
&& echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list \
|
|
|
|
&& curl -sL https://deb.nodesource.com/setup_16.x -o /tmp/nodesource_setup.sh && bash /tmp/nodesource_setup.sh \
|
|
|
|
&& apt-get update -y && apt-get install -qqy nodejs yarn \
|
|
|
|
&& curl https://bootstrap.pypa.io/get-pip.py | python3.8 \
|
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
|
|
|
&& pip install 'setuptools<58.0'
|
2021-06-24 15:39:17 -07:00
|
|
|
|
|
|
|
# needed to add args to main build stage
|
2021-02-03 22:24:38 -08:00
|
|
|
ARG BROWSER_VERSION
|
2021-06-24 15:39:17 -07:00
|
|
|
ARG BROWSER_BIN
|
2021-02-03 22:24:38 -08:00
|
|
|
|
2020-10-31 13:16:37 -07:00
|
|
|
ENV PROXY_HOST=localhost \
|
|
|
|
PROXY_PORT=8080 \
|
|
|
|
PROXY_CA_URL=http://wsgiprox/download/pem \
|
|
|
|
PROXY_CA_FILE=/tmp/proxy-ca.pem \
|
2020-11-01 19:22:53 -08:00
|
|
|
DISPLAY=:99 \
|
2021-02-03 22:24:38 -08:00
|
|
|
GEOMETRY=1360x1020x16 \
|
2021-06-24 15:39:17 -07:00
|
|
|
BROWSER_VERSION=${BROWSER_VERSION} \
|
|
|
|
BROWSER_BIN=${BROWSER_BIN}
|
2020-10-31 13:16:37 -07:00
|
|
|
|
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
|
|
|
COPY --from=browser /deb/*.deb /deb/
|
|
|
|
RUN dpkg -i /deb/*.deb; apt-get update; apt-mark hold chromium-browser; apt --fix-broken install -qqy; \
|
2020-11-05 22:34:33 +00:00
|
|
|
rm -rf /var/lib/opts/lists/*
|
2020-10-31 13:16:37 -07:00
|
|
|
|
|
|
|
WORKDIR /app
|
|
|
|
|
2021-03-13 16:48:31 -08:00
|
|
|
ADD requirements.txt /app/
|
|
|
|
RUN pip install -r requirements.txt
|
|
|
|
|
2020-10-31 13:16:37 -07:00
|
|
|
ADD package.json /app/
|
|
|
|
|
2021-03-13 16:48:31 -08:00
|
|
|
# to allow forcing rebuilds from this stage
|
|
|
|
ARG REBUILD
|
|
|
|
|
2020-10-31 13:16:37 -07:00
|
|
|
RUN yarn install
|
|
|
|
|
|
|
|
ADD uwsgi.ini /app/
|
2020-11-01 19:22:53 -08:00
|
|
|
ADD *.js /app/
|
2021-06-23 19:36:32 -07:00
|
|
|
ADD util/*.js /app/util/
|
|
|
|
COPY config.yaml /app/
|
2021-06-07 17:43:36 -07:00
|
|
|
ADD screencast/ /app/screencast/
|
2020-11-01 19:22:53 -08:00
|
|
|
|
2020-11-01 21:35:00 -08:00
|
|
|
RUN ln -s /app/main.js /usr/bin/crawl
|
2021-04-10 13:08:22 -07:00
|
|
|
RUN ln -s /app/create-login-profile.js /usr/bin/create-login-profile
|
2020-11-01 21:35:00 -08:00
|
|
|
|
2020-11-02 15:28:19 +00:00
|
|
|
WORKDIR /crawls
|
2020-10-31 13:16:37 -07:00
|
|
|
|
2020-11-01 21:35:00 -08:00
|
|
|
CMD ["crawl"]
|
2020-10-31 13:16:37 -07:00
|
|
|
|