SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
import child_process from "child_process";
|
|
|
|
import Redis from "ioredis";
|
|
|
|
|
|
|
|
function sleep(ms) {
|
|
|
|
return new Promise((resolve) => setTimeout(resolve, ms));
|
|
|
|
}
|
|
|
|
|
|
|
|
async function waitContainer(containerId) {
|
|
|
|
try {
|
|
|
|
child_process.execSync(`docker kill -s SIGINT ${containerId}`);
|
|
|
|
} catch (e) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// containerId is initially the full id, but docker ps
|
|
|
|
// only prints the short id (first 12 characters)
|
|
|
|
containerId = containerId.slice(0, 12);
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
try {
|
|
|
|
const res = child_process.execSync("docker ps -q", { encoding: "utf-8" });
|
|
|
|
if (res.indexOf(containerId) < 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
} catch (e) {
|
|
|
|
console.error(e);
|
|
|
|
}
|
|
|
|
await sleep(500);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-03-26 14:50:36 -07:00
|
|
|
async function runCrawl(numExpected, url, sitemap="", limit=0, numExpectedLessThan=0, extra="") {
|
2025-04-01 13:40:03 -07:00
|
|
|
const command = `docker run -d -p 36381:6379 -e CRAWL_ID=test webrecorder/browsertrix-crawler crawl --url ${url} --sitemap ${sitemap} --limit ${limit} --context sitemap --logging debug --debugAccessRedis ${extra}`;
|
|
|
|
const containerId = child_process.execSync(command, {encoding: "utf-8"});
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
|
2024-03-22 17:32:42 -07:00
|
|
|
await sleep(3000);
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
|
2024-03-22 17:32:42 -07:00
|
|
|
const redis = new Redis("redis://127.0.0.1:36381/0", { lazyConnect: true, retryStrategy: () => null });
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
|
|
|
|
let finished = 0;
|
|
|
|
|
|
|
|
try {
|
|
|
|
await redis.connect({
|
|
|
|
maxRetriesPerRequest: 100,
|
|
|
|
});
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
finished = await redis.zcard("test:q");
|
|
|
|
|
2025-04-01 13:40:03 -07:00
|
|
|
if (await redis.get("test:sitemapDone")) {
|
|
|
|
break;
|
|
|
|
}
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
if (finished >= numExpected) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} catch (e) {
|
|
|
|
console.error(e);
|
|
|
|
} finally {
|
|
|
|
await waitContainer(containerId);
|
|
|
|
}
|
|
|
|
|
|
|
|
expect(finished).toBeGreaterThanOrEqual(numExpected);
|
2024-03-26 14:50:36 -07:00
|
|
|
|
|
|
|
if (numExpectedLessThan) {
|
|
|
|
expect(finished).toBeLessThanOrEqual(numExpectedLessThan);
|
|
|
|
}
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
test("test sitemap fully finish", async () => {
|
2025-04-01 13:40:03 -07:00
|
|
|
await runCrawl(3500, "https://www.mozilla.org/", "", 0);
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
});
|
|
|
|
|
|
|
|
test("test sitemap with limit", async () => {
|
|
|
|
await runCrawl(1900, "https://www.mozilla.org/", "", 2000);
|
|
|
|
});
|
|
|
|
|
|
|
|
test("test sitemap with limit, specific URL", async () => {
|
|
|
|
await runCrawl(1900, "https://www.mozilla.org/", "https://www.mozilla.org/sitemap.xml", 2000);
|
|
|
|
});
|
2024-03-26 14:50:36 -07:00
|
|
|
|
|
|
|
test("test sitemap with application/xml content-type", async () => {
|
|
|
|
await runCrawl(10, "https://bitarchivist.net/", "", 0);
|
|
|
|
});
|
|
|
|
|
2024-07-11 19:48:43 -07:00
|
|
|
test("test sitemap with narrow scope, extraHops, to ensure out-of-scope sitemap URLs do not count as extraHops", async () => {
|
2025-04-01 13:40:03 -07:00
|
|
|
await runCrawl(0, "https://www.mozilla.org/", "", 2000, 100, "--extraHops 1 --scopeType page");
|
2024-03-26 14:50:36 -07:00
|
|
|
});
|