mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 06:23:16 +00:00
add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)
but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.
This commit is contained in:
parent
ea098b6daf
commit
2059f2b6ae
3 changed files with 21 additions and 0 deletions
|
@ -10,6 +10,13 @@ See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remar
|
|||
|
||||
The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.
|
||||
|
||||
### Additional Wait
|
||||
|
||||
Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the `--postLoadDelay` flag
|
||||
can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).
|
||||
|
||||
(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
|
||||
|
||||
## Ad Blocking
|
||||
|
||||
Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).
|
||||
|
|
|
@ -1802,6 +1802,13 @@ self.__bx_behaviors.selectMainBehavior();
|
|||
|
||||
await this.netIdle(page, logDetails);
|
||||
|
||||
if (this.params.postLoadDelay) {
|
||||
logger.info("Awaiting post load delay", {
|
||||
seconds: this.params.pagePostLoadDelay,
|
||||
});
|
||||
await sleep(this.params.pagePostLoadDelay);
|
||||
}
|
||||
|
||||
// skip extraction if at max depth
|
||||
if (seed.isAtMaxDepth(depth) || !selectorOptsList) {
|
||||
logger.debug("Skipping Link Extraction, At Max Depth");
|
||||
|
|
|
@ -317,6 +317,13 @@ class ArgParser {
|
|||
type: "number",
|
||||
},
|
||||
|
||||
postLoadDelay: {
|
||||
describe:
|
||||
"If >0, amount of time to sleep (in seconds) after page has loaded, before taking screenshots / getting text / running behaviors",
|
||||
default: 0,
|
||||
type: "number",
|
||||
},
|
||||
|
||||
pageExtraDelay: {
|
||||
alias: "delay",
|
||||
describe:
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue