add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)

but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.
2025-10-19 06:23:16 +00:00 · 2024-03-28 17:17:29 -07:00 · 2024-03-28 17:17:29 -07:00 · 2059f2b6ae
commit 2059f2b6ae
parent ea098b6daf
3 changed files with 21 additions and 0 deletions
--- a/docs/docs/user-guide/common-options.md
+++ b/docs/docs/user-guide/common-options.md
@ -10,6 +10,13 @@ See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remar

 The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.

+### Additional Wait
+
+Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the `--postLoadDelay` flag
+can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).
+
+(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
+
 ## Ad Blocking

 Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).
--- a/src/crawler.ts
+++ b/src/crawler.ts
@ -1802,6 +1802,13 @@ self.__bx_behaviors.selectMainBehavior();

    await this.netIdle(page, logDetails);

+    if (this.params.postLoadDelay) {
+      logger.info("Awaiting post load delay", {
+        seconds: this.params.pagePostLoadDelay,
+      });
+      await sleep(this.params.pagePostLoadDelay);
+    }
+
    // skip extraction if at max depth
    if (seed.isAtMaxDepth(depth) || !selectorOptsList) {
      logger.debug("Skipping Link Extraction, At Max Depth");
--- a/src/util/argParser.ts
+++ b/src/util/argParser.ts
@ -317,6 +317,13 @@ class ArgParser {
        type: "number",
      },

+      postLoadDelay: {
+        describe:
+          "If >0, amount of time to sleep (in seconds) after page has loaded, before taking screenshots / getting text / running behaviors",
+        default: 0,
+        type: "number",
+      },
+
      pageExtraDelay: {
        alias: "delay",
        describe: