Commit graph

3 commits

Author SHA1 Message Date
Ilya Kreymer
802a416c7e
Additional direct fetch improvements (#678)
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
2024-09-05 13:28:49 -07:00
Ilya Kreymer
53d437570e
dependency: update RWP to 2.0.1 (#610)
for QA, use ReplayWeb.page 2.0.1 by default
2024-06-13 18:43:58 -07:00
Ilya Kreymer
8f8326eaf5
Fix synching extraSeeds state with multiple crawler instances (#605)
Fixes #604 

Ensures that extra seeds are propagated to all crawler instances.
Adds a new redis hashmap key to store the extraSeed mappings
url->extraSeeds index, to ensure the extra seeds are added in the same
order on other instances, even if encountered in different order.
Add a new redis lua primitive 'addnewseed' which combines several
operations: check if extra seed already exists and returning existing
index, add new seed to extraSeed list, also add to regular URL seed
list.
2024-06-13 17:18:06 -07:00