Stowage/browsertrix-crawler

Fork 0

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Commit graph

Author	SHA1	Message	Date
emma	5a6bef890f	revert adding prettier	2023-11-08 16:40:49 -05:00
emma	87a6002473	add prettier and format everything	2023-11-08 14:37:57 -05:00
Ilya Kreymer	2aeda56d40	improved text extraction: (addresses #403 ) (#404 ) - use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to get the snapshot (consistent with ArchiveWeb.page) - should be slightly more performant - keep option to use DOM.getDocument - refactor warc resource writing to separate class, used by text extraction and screenshots - write extracted text to WARC files as 'urn:text:<url>' after page loads, similar to screenshots - also store final text to WARC as 'urn:textFinal:<url>' if it is different - cli options: update `--text` to take one more more comma-separated string options `--text to-warc,to-pages,final-to-warc`. For backwards compatibility, support `--text` and `--text true` to be equivalent to `--text to-pages`. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-31 23:05:30 -07:00

Author

SHA1

Message

Date

emma

5a6bef890f

revert adding prettier

2023-11-08 16:40:49 -05:00

emma

87a6002473

add prettier and format everything

2023-11-08 14:37:57 -05:00

Ilya Kreymer

2aeda56d40

improved text extraction: (addresses #403 ) (#404 )

- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

2023-10-31 23:05:30 -07:00

3 commits