add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)

but before running link extraction, text extraction, screenshots and
behaviors.

Useful for sites that load quickly but perform async loading / init
afterwards, fixes #519

A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
This commit is contained in:
Ilya Kreymer 2024-03-28 17:17:29 -07:00 committed by GitHub
parent ea098b6daf
commit 2059f2b6ae
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 21 additions and 0 deletions

View file

@ -10,6 +10,13 @@ See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remar
The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first. The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.
### Additional Wait
Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the `--postLoadDelay` flag
can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).
(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
## Ad Blocking ## Ad Blocking
Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md). Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).

View file

@ -1802,6 +1802,13 @@ self.__bx_behaviors.selectMainBehavior();
await this.netIdle(page, logDetails); await this.netIdle(page, logDetails);
if (this.params.postLoadDelay) {
logger.info("Awaiting post load delay", {
seconds: this.params.pagePostLoadDelay,
});
await sleep(this.params.pagePostLoadDelay);
}
// skip extraction if at max depth // skip extraction if at max depth
if (seed.isAtMaxDepth(depth) || !selectorOptsList) { if (seed.isAtMaxDepth(depth) || !selectorOptsList) {
logger.debug("Skipping Link Extraction, At Max Depth"); logger.debug("Skipping Link Extraction, At Max Depth");

View file

@ -317,6 +317,13 @@ class ArgParser {
type: "number", type: "number",
}, },
postLoadDelay: {
describe:
"If >0, amount of time to sleep (in seconds) after page has loaded, before taking screenshots / getting text / running behaviors",
default: 0,
type: "number",
},
pageExtraDelay: { pageExtraDelay: {
alias: "delay", alias: "delay",
describe: describe: