clamav

mirror of https://github.com/Cisco-Talos/clamav.git synced 2025-10-19 10:23:17 +00:00

Author	SHA1	Message	Date
Val S.	4d86271053	Fix issue recording OOXML document metadata The ZIP single record search feature is used to find specific files when parsing OOXML documents. I observed that the core properties for a PowerPoint file were missing in a test as compared with the previous release. The error handling check for the unzip search returns CL_VIRUS when there is a match, not CL_SUCCESS! CLAM-2886	2025-10-14 13:10:35 -04:00
Val S.	6e43931a1c	Fix ZIP parser issue recording empty file entries If csize (and usize) are 0, like with a directory or other empty file entry, then the functionionality to record file record information when indexing the central directory and each associated file record will neglect to store the `local_header_offset` or `local_header_size`. That causes problems later after sorting the file records and then checking for overlapping files. CLAM-2884	2025-10-14 13:10:35 -04:00
Val S.	b87b028a30	Fix ZIP parser issue using central directory The function which indexes a ZIP central directory is not advancing to the next central directory record thus exceeding the max-files scan limit for many ZIPs. CLAM-2884	2025-10-14 13:10:34 -04:00
Val S.	aadf25df6a	Fix static analysis code quality issues (#1582 ) `libclamav/libmspack.c`: Initialize variables before first `goto done;` to fix unitialized variable use in an error condition. `libclamav/others.c`: Explicitly ignore return values for calls to add JSON values when subsequent calls don't depend on them. If we were to add error handling here, the only thing we'd do is debug- log it. I don't think it's worth adding the extra lines of code. `libclamav/unarj.c`: Removed dead code. The `status` variable is immediately set afterwards based on whether or not any files may be extracted. `libclamav/unzip.c`: Removed dead code. The `ret` variable is checked immediately after being set, above. This check after the `do`-`while()` loop is dead code. `sigtool/sigtool.c`: Fix potential NULL deref in error handling. This is a fix for the same issue as was fixed in a previous commit. I somehow overlooked this one. Copy/paste bug. `libclamav/pdfdecode.c`: Fix leaked `stream` memory when `filter_lzwdecode()` fails. `clamdtop/clamdtop.c`: Fix possible NULL dereference if `strchr` returns NULL in `read_version()` and `check_stats_available()`. `libclamav/rtf.c`: Fix memory leak in `rtf_object_process()` if `cli_gentemp_with_prefix()` fails. Also change empty for-loop to resolve clang-format weirdness and make it more obvious the for-loop has no body. `libclamav/aspack.c`: Ensure that `endoff - old` is not negative in `build_decrypt_array()` before passing to `CLI_ISCONTAINED()` which expects unsigned values. `libclamav/upx.c`: Fix integer overflow checks in multiple functions. `libclamav/vba_extract.c`: Set `entries` pointer back to NULL after free in `word_read_macro_entry()` error condition. `libclamav/unzip.c`: Remove logic to return `CL_EMAXFILES` from `index_local_file_headers()`. It seems it only overwrote the status when not `CL_SUCCESS` in which case it could be overriding a more serious failure. Further, updates to the how the ZIP parser works has made it so this needs to return `CL_SUCCESS` in order for the caller to at least scan the files found so far. Finally, the calling function has checks of its own to make sure we don't exceeds the max-files limit. `libclamav/unzip.c`: Fix issue where `cli_append_potentially_unwanted()` in `index_local_file_headers()` might overwrite an error in `status` with `CL_CLEAN`. Instead, it now checks the return value and only overwrites the `CL_EFORMAT` status with a different value if not `CL_SUCCESS`. `libclamav/unzip.c`: Fix a potential leak with `combined_catalogue` and `temp_catalogue` in an error condition. We should always free them if not NULL, not just if the function failed. And to make this safe, we must set `combined_catalogue` to NULL when we give ownership to `*catalogue`. `libclamav/scanners.c`: Fix a potential leak in error handling for the `cli_ole2_tempdir_scan_vba()` function. CLAM-2768	2025-10-02 11:46:14 -04:00
Val S.	a77a271fb5	Reduce unnecessary scanning of embedded file FPs (#1571 ) When embedded file type recognition finds a possible embedded file, it is being scanned as a new embedded file even if it turns out it was a false positive and parsing fails. My solution is to pre-parse the file headers as little possible to determine if it is valid. If possible, also determine the file size based on the headers. That will make it so we don't have to scan additional data when the embedded file is not at the very end. This commit adds header checks prior to embedded ZIP, ARJ, and CAB scanning. For these types I was also able to use the header checks to determine the object size so as to prevent excessive pattern matching. TODO: Add the same for RAR, EGG, 7Z, NULSFT, AUTOIT, IShield, and PDF. This commit also removes duplicate matching for embedded MSEXE. The embedded MSEXE detection and scanning logic was accidentally creating an extra duplicate layer in between scanning and detection because of the logic within the `cli_scanembpe()` function. That function was effectively doing the header check which this commit adds for ZIP, ARJ, and CAB but minus the size check. Note: It is unfortunately not possible to get an accurage size from PE file headers. The `cli_scanembpe()` function also used to dump to a temp file for no reason since FMAPs were extended to support windows into other FMAPs. So this commit removes the intermediate layer as well as dropping a temp file for each embedded PE file. Further, this commit adds configuration and DCONF safeguards around all embedded file type scanning. Finally, this commit adds a set of tests to validate proper extraction of embedded ZIP, ARJ, CAB, and MSEXE files. CLAM-2862 Co-authored-by: TheRaynMan <draynor@sourcefire.com>	2025-09-23 15:57:28 -04:00
Val S.	bae86fd272	ZIP: Fix possible leak (#1568 ) Fix a possible memory leak in the overlapping files detecting logic. The issue is because cleanup for the zip catalogue allocated by this function only happens if the status is no CL_SUCCESS. My fix uses a better pattern to ensure we don't override a format error with a "clean" result when adding the heuristic alert. Fixes: https://issues.oss-fuzz.com/issues/376723678 CLAM-2857	2025-09-15 18:01:03 -04:00
Valerie Snyder	31dcec1e42	libclamav: Add engine option to toggle temp directory recursion Temp directory recursion in ClamAV is when each layer of a scan gets its own temp directory in the parent layer's temp directory. In addition to temp directory recursion, ClamAV has been creating a new subdirectory for each file scan as a risk-adverse method to ensure no temporary file leaks fill up the disk. Creating a directory is relatively slow on Windows in particular if scanning a lot of very small files. This commit: 1. Separates the temp directory recursion feature from the leave-temps feature so that libclamav can leave temp files without making subdirectories for each file scanned. 2. Makes it so that when temp directory recursion is off, libclamav will just use the configure temp directory for all files. The new option to enable temp directory recursion is for libclamav-only at this time. It is off by default, and you can enable it like this: ```c cl_engine_set_num(engine, CL_ENGINE_TMPDIR_RECURSION, 1); ``` For the `clamscan` and `clamd` programs, temp directory recursion will be enabled when `--leave-temps` / `LeaveTemporaryFiles` is enabled. The difference is that when disabled, it will return to using the configured temp directory without making a subdirectory for each file scanned, so as to improve scan performance for small files, mostly on Windows. Under the hood, this commit also: 1. Cleans up how we keep track of tmpdirs for each layer. The goal here is to align how we keep track of layer-specific stuff using the scan_layer structure. 2. Cleans up how we record metadata JSON for embedded files. Note: Embedded files being different from Contained files, as they are extracted not with a parser, but by finding them with file type magic signatures. CLAM-1583	2025-08-14 22:38:58 -04:00
Valerie Snyder	f7e60d566f	Record unique object-id for each layer scanned Every time we push a new map onto the scanning recursion context, give it a unique object id number, which counts from zero. Moved the location where we add metadata for each file from the "cli_magic_scan" function over to the "recursion stack push" function. Include a "path" as a parameter for creating a new fmap, and rename some related variables and functions to be more intuitive. CLAM-2796 See also: CLAM-2485, CLAM-2626	2025-08-14 21:23:33 -04:00
Val S.	17d0665580	ZIP: Fix NULL-dereference for OOXML scans (#1552 ) I accidentally introduced a NULL-dereference bug when scanning any OOXML file in https://github.com/Cisco-Talos/clamav/pull/1548 I overlooked the test failure out of haste. 😔 The NULL-dereference happens because the `unzip_search()` feature allowed searching some other file than the one that is currently being scanned, which you would do by setting `ctx` to NULL and setting an `fmap` parameter instead. In practice, the current layer's `fmap` from the `ctx` was always passed in. This fix makes it so the `unzip_search()` and related functions only take the `ctx` parameter and do not have and `fmap` or `fsize` field (Note: the `fsize` was never needed, because `fmap->len` take care of that). CLAM-2837	2025-08-14 21:17:46 -04:00
Valerie Snyder	18854bf4bc	Windows: Fix filepath basename issue On Windows, the cli_basename function should treat both '/' and '\' as path separators. Most Windows APIs also accept both. On Linux/Unix, it makes sense when using a filepath that is more for informational purposes or where it may have come from a Windows system, to treat the '\' as a path separator. But in situations where the the path is needed for some critical action, like moving or deleting a file, we can't treat it as a path separator.	2025-08-11 18:14:19 -04:00
Valerie Snyder	0cc5d75093	ZIP: Fix infinite loop + significant code cleanup An infinite loop may occur when scanning some malformed ZIP files. I introduced this issue in `96c00b6d80` with this line: ```c // decrement coff by 1 to account for the increment at the end of the loop coff -= 1; ``` The problem is that the function may return 0, which should indicate that there are no more files. The result was that `coff` would stay the same and the loop would repeat. This issue is in 1.5 development and affects the 1.5.0 beta but does not affect any production versions. Fixes: https://github.com/Cisco-Talos/clamav/issues/1534 Special thanks to Sophie0x2E for an initial fix, proposed in https://github.com/Cisco-Talos/clamav/pull/1539 In review, I was uncomfortable with other existing code and decided to to a more significant overhaul of the error handling in the ZIP module. In addition to cleanup, this commit has some functional changes: - When parsing a central directory file header inside of `parse_central_directory_file_header()`, it will now fail out if the "extra length" or "comment length" fields would exceced the length of the archive. That doesn't mean the associated local file header won't be parsed later, but it won't use the central directory file header to find it. Instead, the ZIP module will have to find the local file header by searching for extra records not listed in the central directory. This change was mostly to tidy up complex error handling. - Add two FTM new signatures to identify split ZIP archives. This signature identifies the first segment (first file) in a split or spanned ZIP archive. It may also be found on a single-segment "split" archive, depending on the ZIP archiver. ``` 0:0:504b0708504b0304:ZIP (First segment split/spanned):CL_TYPE_ANY:CL_TYPE_ZIP ``` Practically speaking, this new signature makes it so ClamAV identifies the file as a ZIP right away without having to rely on SFX_ZIP detection. Extraction is then handled by the ZIP `cli_unzip` function rather than extracting each with `cli_unzip_single` which handles SFX_ZIP entries. Note: ClamAV isn't capable of finding additional files on disk to support handling the additional segments. So it doesn't make any difference with handling those other files. This signature is for single-segment split/spanned archives, depending on the ZIP archiver. ``` 0:0:504b0303504b0304:ZIP (Single-segment split/spanned):CL_TYPE_ANY:CL_TYPE_ZIP ``` Like the first one, this also means we won't rely on SFX_ZIP detection and will treat this files as regular ZIPs. - Added a test file to verify that ClamAV can extract a single-file "split" ZIP. - Added a clamscan test with test files to verify that scanning a split archive across two segments correctly extracts the properly formed zip file entries. Sadly, we can't join the segments to extract everything.	2025-08-11 18:14:19 -04:00
frsell	1199c89c1d	Zip: Fix additional bugs found in PR review Fix bug in end_offset calculation when searching for additional file entries. Fix size of memset after realloc.	2025-03-28 09:42:12 -04:00
Val Snyder	96c00b6d80	Zip: Minor fixes, code cleanup Fixes: - We need to look at the local headers if no central directory headers are found. Restructured the main `cli_unzip()` function to allocate an empty zip catalogue when we can't use a central directory at all. - In `index_local_file_headers_within_bounds()`, we must decrement the `coff` variable after adding the size of a file entry using `parse_local_file_header()`, to account for the increment when it loops around. If we don't, the next entry won't be at 'PK\x03\x04', it will be at 'K\0x03\x04'. - Attempt to unzip when encrypted if we don't have a valid password. This may enable extraction for files where a header lies about encryption. - The `fmap_need_off()` call to get the `compressed_data` pointer used the wrong size, checking if there was enough data for a header instead of for the compressed data that follows the header. I stumbled across this older bug when testing extraction of a zip where the file entries are tiny and I'd stripped off the central directory. As a result, there wasn't enough data for a whole file header and my test failed. Cleanup: - Initialize status variables as CL_ERROR and only assign to CL_SUCCESS if successful. This is to protect against future changes in case someone accidentally goes-to-done without setting the status. - Remove legacy use of CL_CLEAN. Not a functional change. This mostly a stylistic preference. - Use calloc instead of malloc + memset in a couple places. Make use of the new allocation macros with goto-done error handling. - Some opinionated format changes such as shifting some longer function arguments all to a new line so they're no so far to the right. - Auto-format with clang-format.	2025-03-25 12:48:52 -04:00
frsell	8d8433b5ef	Zip: Extend parsing to rely less on central directory Add logic to search for additional file entries between those that have been found using the central directory.	2025-03-25 12:15:37 -04:00
Val Snyder	7ff29b8c37	Bump copyright dates for 2025	2025-02-14 10:24:30 -05:00
Micah Snyder	47dfe9bd5d	Remove libjson-c dead code As of ClamAV 0.105, libjson-c is required. There is also no option to disable libjson-c support. This commit removes the dead code associated with the old build option.	2024-04-13 12:34:15 -04:00
Micah Snyder	45c6938be8	Remove libbz2 dead code As of ClamAV 0.105, libbz2 is required. There is also no option to disable bz2 support. This commit removes the dead code associated with the old build option.	2024-04-13 12:34:15 -04:00
Micah Snyder	3ae9c1e434	Add LHA/LZH archive support File type magic signatures chosen based on the extensions supported by Rust delharc crate. See: https://docs.rs/delharc/latest/delharc/	2024-04-09 10:35:22 -04:00
Micah Snyder	8e04c25fec	Rename clamav memory allocation functions We have some special functions to wrap malloc, calloc, and realloc to make sure we don't allocate more than some limit, similar to the max-filesize and max-scansize limits. Our wrappers are really only needed when allocating memory for scans based on untrusted user input, where a scan file could have bytes that claim you need to allocate some ridiculous amount of memory. Right now they're named: - cli_malloc - cli_calloc - cli_realloc - cli_realloc2 ... and these names do not convey their purpose This commit renames them to: - cli_max_malloc - cli_max_calloc - cli_max_realloc - cli_max_realloc2 The realloc ones also have an additional feature in that they will not free your pointer if you try to realloc to 0 bytes. Freeing the memory is undefined by the C spec, and only done with some realloc implementations, so this stabilizes on the behavior of not doing that, which should prevent accidental double-free's. So for the case where you may want to realloc and do not need to have a maximum, this commit adds the following functions: - cli_safer_realloc - cli_safer_realloc2 These are used for the MPOOL_REALLOC and MPOOL_REALLOC2 macros when MPOOL is disabled (e.g. because mmap-support is not found), so as to match the behavior in the mpool_realloc/2 functions that do not make use of the allocation-limit.	2024-03-15 13:18:47 -04:00
Micah Snyder	6d6e04ddf8	Optimization: replace limited allocation calls There are a large number of allocations for fix sized buffers using the `cli_malloc` and `cli_calloc` calls that check if the requested size is larger than our allocation threshold for allocations based on untrusted input. These allocations will always be higher than the threshold, so the extra stack frame and check for these calls is a waste of CPU. This commit replaces needless calls with A -> B: - cli_malloc -> malloc - cli_calloc -> calloc - CLI_MALLOC -> MALLOC - CLI_CALLOC -> CALLOC I also noticed that our MPOOL_MALLOC / MPOOL_CALLOC are not limited by the max-allocation threshold, when MMAP is found/enabled. But the alternative was set to cli_malloc / cli_calloc when disabled. I changed those as well. I didn't change the cli_realloc/2 calls because our version of realloc not only implements a threshold but also stabilizes the undefined behavior in realloc to protect against accidental double-free's. It may be worth implementing a cli_realloc that doesn't have the threshold built-in, however, so as to allow reallocaitons for things like buffers for loading signatures, which aren't subject to the same concern as allocations for scanning possible malware. There was one case in mbox.c where I changed MALLOC -> CLI_MALLOC, because it appears to be allocating based on untrusted input.	2024-03-15 13:18:47 -04:00
Micah Snyder	9cb28e51e6	Bump copyright dates for 2024	2024-01-22 11:27:17 -05:00
RainRat	1b17e20571	Fix typos (no functional changes)	2024-01-19 09:08:36 -08:00
Micah Snyder	e27a450bf8	ZIP: Always parse file names Having the filename is useful for certain callbacks, and will likely be more useful in the future if we can start comparing detected filetypes with file extensions. E.g. if filetype is just "binary" or "text" we may be able to do better by trusting a ".js" extension to determine the type. Or else if detected file type is "pe" but the extension is ".png" we may want to say it's suspicious. Also adjusted the example callback program to disable metadata option. The CL_SCAN_GENERAL_COLLECT_METADATA is no longer required for the Zip parser to record filenames for embedded files, and described in the previous commit. This program can be used to demonstrate that it is behaving as desired.	2023-08-23 17:41:40 -07:00
Micah Snyder	d24c6f7569	Coverity-401434: Switch to realloc that does not free after failure Because we free after the done-label.	2023-04-13 00:11:34 -07:00
Micah Snyder	6eebecc303	Bump copyright for 2023	2023-02-12 11:20:22 -08:00
Micah Snyder	df99364c3c	Improvements to adhere to scan time limit Error handling for bytecode sig evaluation, as well as logical and yara sig evaluation only consider CL_VIRUS or CL_SUCCESS and fail to consider terminal errors such as CL_ETIMEOUT. This commit fixes the error handling in those functions so we properly abort instead of continuing to evaluate more sigs. This commit also adds some a scan time limit checks: 1. shortly after a bytecode time limit exceeded. 2. after evaluating each lsig or yara sig. 3. when processing unzip central directory file headers.	2022-10-19 13:13:57 -07:00
Micah Snyder	4e0adaad80	ZIP: Remove allmatch checks And fix issue where magic_scan() critical error may be ignored.	2022-10-19 13:13:57 -07:00
Micah Snyder	66f48d3e05	Strong indicator precedence over PUA / Heuristic detections Signatures that start with "PUA.", "Heuristics.", or "BC.Heuristics." are perceived to be less serious, or more likely to have false positives, than other signatures that we would think of us as "strong indicators". At present, only a subset of "Heuristics." signatures, such as those added by the phishing module, are added as "potentially unwanted". Unless you're using heuristic-precedence mode, these "potentially unwanted" indicators are recorded but not reported unless no other signature alerts. This behavior should apply to all signatures that start with "PUA." and "Heuristics.". We already do a string match comparison on the signature name to apply that behavior to bytecode matches that start with "BC.Heuristics.". I moved that string comparison logic used for "BC.Heuristics." into the main `cl_append_virus()` function and extended it to cover the other two cases. I also replaced all hardcoded calls to append "Heuristics." signatures to append using the `cli_append_potentially_unwanted()` function, so we can skip the string compare in these cases. That function will of course append them as strong indicators if heuristic-precedence mode is enabled.	2022-10-19 13:13:57 -07:00
Micah Snyder	cd3134568a	Code quality: Refactor layer attributes as scan parameter The current implementation sets a "next layer attributes" flag field in the scan context. This may introduce bugs if accidentally not cleared during error handling, causing that attribute to be applied to a different layer than intended. This commit resolves that by adding an attribute flag to the major internal scan functions and removing the "next layer attributes" from the scan context. This attributes flag shares the same flag fields as the attributes flag in the new file inspection callback and the flags are defined in `clamav.h`.	2022-10-13 08:57:44 -07:00
Micah Snyder	9d6ebd6d50	Adds file inspection callback and example code libclamav callbacks can be used to access embedded file content at each layer of extraction during the course of a scan. The existing callbacks only provide access to the file descriptor and a guess at the file type. This patch adds a new callback for the purposes of file/archive inspection that provides additional insight into the embedded file. This includes: - ancestors: an array of parent file names - parent file size: the size of the direct parent layer - file name: current layer's filename, if any. - file buffer (pointer) - file size: size of file buffer - file type: just a guess at the current file's type - file descriptor: may be -1 if the layer is in-memory only. - layer attributes: a flag field. see LAYER_ATTRIBUTE_* defines in clamav.h Two new example apps are added that are automatically built when compiling under CMake: - ex2 demonstrates the prescan callback. - ex3 demonstrates the new file inspection callback. The examples are now installed if enabled, so you can test them in the Docker image, and so that they'll be colocated with the DLLs so you can test them on Windows. The installed examples should also be able to find the UnRAR library at run time, without having to set LD_LIBRARY_PATH. This commit also sets the fmap->name in an fmap-scan using the basname of the provided filename if the caller provided the filename and the provided fmap does not have the name set.	2022-10-13 08:57:44 -07:00
Micah Snyder	c86ce2e295	Zip parser: tolerate 2-byte overlap in file entries The heuristic to alert on overlapping file entries is detecting some non-malicious JAR files observed in critical enterprise software. The goal with overlap detection is to alert on non-recursive zip- bombs, so this tiny overlap isn't a concern. We'll allow a 2-byte overlap so we don't alert on such zips.	2022-07-18 11:38:00 -07:00
ragusaa	1c6746853f	Fixed heap buffer overflow while loading signatures There is a possible overflow read when loading PDB and WDB phishing signatures. This issue is not a vulnerability. Changed const char pointers to uint8_t pointers when they are to be used with data, as well as removing asserts and adding additional error checking. Thank you Michał Dardas for reporting this issue. This fix also resolves: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43845 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43812 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43866 This commit also fixes a minor leak of pattern matching trans nodes that was observed when testing with the MPOOL module disabled.	2022-05-16 18:29:25 -07:00
micasnyd	140c88aa4e	Bump copyright for 2022 Includes minor format corrections.	2022-01-09 14:23:25 -07:00
Mickey Sola	6cbb648113	Update rust logging inclusion based on feedback Convert cli_dbgmsg to inline function to ensure ctx check for debug flag is always run Add copyright and licensing info Fix valgrind uninitialized buffer issue in cliunzip.c Windows build fix	2021-11-15 12:33:05 -08:00
Micah Snyder	ae9cac8761	Fix memory leak in zip parser error handling The zip parser may leak a string for a zip record filename if the the record in the central directory is the last record. This fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=31760 It also appears that there may be a related record filename leak if an error occured when indexing the files in the central directory header, but I don't have any test file for this but it was an obvious fix.	2021-10-30 17:13:52 -07:00
Micah Snyder	d1141becac	Fix fmap handle_gets() page arithmetic The fixes to the fmap bounds for nested (duplicate) fmaps added recently introduced a subtle arithmetic bug that was detected by OSS-Fuzz: ```c scanat = m->nested_offset + at % m->pgsz; ``` should have been: ```c scanat = (m->nested_offset + at) % m->pgsz; ``` Without the parenthesis, `scanat` could be > `m->pgsz`, which would overflow in the subsequent `memchr()` call. See: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40452 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40455 This commit also tightens up some of the other bounds checks done with `CLI_ISCONTAINED()` macro so the check limits the bounds to the nested fmap and not the original map. In addition, I've added a `CLI_ISCONTAINED_0_TO()` macro that removes checks when the "bigger" buffer starts at offset 0. This should silence a bunch of (benign) warnings and medium severity Coverity issues. There is also a possible use of an uninitialized variable (`old_hook_lsig_matches`) in `cli_magic_scan()`. Finally, I also removed an unecessary NULL-check on `filebase` in `fmap_dup_to_file()` that Coverity was unhappy with.	2021-10-29 12:35:15 -07:00
Micah Snyder	db013a2bfd	libclamav: Fix scan recursion tracking Scan recursion is the process of identifying files embedded in other files and then scanning them, recursively. Internally this process is more complex than it may sound because a file may have multiple layers of types before finding a new "file". At present we treat the recursion count in the scanning context as an index into both our fmap list AND our container list. These two lists are conceptually a part of the same thing and should be unified. But what's concerning is that the "recursion level" isn't actually incremented or decremented at the same time that we add a layer to the fmap or container lists but instead is more touchy-feely, increasing when we find a new "file". To account for this shadiness, the size of the fmap and container lists has always been a little longer than our "max scan recursion" limit so we don't accidentally overflow the fmap or container arrays (!). I've implemented a single recursion-stack as an array, similar to before, which includes a pointer to each fmap at each layer, along with the size and type. Push and pop functions add and remove layers whenever a new fmap is added. A boolean argument when pushing indicates if the new layer represents a new buffer or new file (descriptor). A new buffer will reset the "nested fmap level" (described below). This commit also provides a solution for an issue where we detect embedded files more than once during scan recursion. For illustration, imagine a tarball named foo.tar.gz with this structure: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| baz.exe \| PE \| 0 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| │ └── hello.txt \| ASCII \| 2 \| 0 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| \| └── world.txt \| ASCII \| 2 \| 0 \| (A) If we scan for embedded files at any layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| ├── foo.tar \| TAR \| 1 \| 0 \| \| │ ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hola.txt \| ASCII \| 3 \| 0 \| \| │ ├── baz.exe \| PE \| 2 \| 1 \| \| │ │ ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ │ │ └── hello.txt \| ASCII \| 4 \| 0 \| \| │ │ └── sfx.7z \| 7Z \| 3 \| 1 \| \| │ │ └── world.txt \| ASCII \| 4 \| 0 \| \| │ ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hello.txt \| ASCII \| 3 \| 0 \| \| │ └── sfx.7z \| 7Z \| 2 \| 1 \| \| │ └── world.txt \| ASCII \| 3 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| (A) is bad because it scans content more than once. Note that for the GZ layer, it may detect the ZIP and 7Z if the signature hits on the compressed data, which it might, though extracting the ZIP and 7Z will likely fail. The reason the above doesn't happen now is that we restrict embedded type scans for a bunch of archive formats to include GZ and TAR. (B) If we scan for embedded files at the foo.tar layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| ├── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ └── hello.txt \| ASCII \| 3 \| 0 \| \| └── sfx.7z \| 7Z \| 2 \| 1 \| \| └── world.txt \| ASCII \| 3 \| 0 \| (B) is almost right. But we can achieve it easily enough only scanning for embedded content in the current fmap when the "nested fmap level" is 0. The upside is that it should safely detect all embedded content, even if it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe. The biggest risk I can think of affects ZIPs. SFXZIP detection is identical to ZIP detection, which is why we don't allow SFXZIP to be detected if insize of a ZIP. If we only allow embedded type scanning at fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP if the bar.exe was not compressed in foo.zip and if non-compressed files extracted from ZIPs aren't extracted as new buffers: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.zip \| ZIP \| 0 \| 0 \| \| └── bar.exe \| PE \| 1 \| 1 \| \| └── sfx.zip \| ZIP \| 2 \| 2 \| Provided that we ensure all files extracted from zips are scanned in new buffers, option (B) should be safe. (C) If we scan for embedded files at the baz.exe layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ └── hello.txt \| ASCII \| 4 \| 0 \| \| └── sfx.7z \| 7Z \| 3 \| 1 \| \| └── world.txt \| ASCII \| 4 \| 0 \| (C) is right. But it's harder to achieve. For this example we can get it by restricting 7ZSFX and ZIPSFX detection only when scanning an executable. But that may mean losing detection of archives embedded elsewhere. And we'd have to identify allowable container types for each possible embedded type, which would be very difficult. So this commit aims to solve the issue the (B)-way. Note that in all situations, we still have to scan with file typing enabled to determine if we need to reassign the current file type, such as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2- compressed. Detection of DMG and a handful of other types rely on finding data partway through or near the ned of a file before reassigning the entire file as the new type. Other fixes and considerations in this commit: - The utf16 HTML parser has weak error handling, particularly with respect to creating a nested fmap for scanning the ascii decoded file. This commit cleans up the error handling and wraps the nested scan with the recursion-stack push()/pop() for correct recursion tracking. Before this commit, each container layer had a flag to indicate if the container layer is valid. We need something similar so that the cli_recursion_stack_get_() functions ignore normalized layers. Details... Imagine an LDB signature for HTML content that specifies a ZIP container. If the signature actually alerts on the normalized HTML and you don't ignore normalized layers for the container check, it will appear as though the alert is in an HTML container rather than a ZIP container. This commit accomplishes this with a boolean you set in the scan context before scanning a new layer. Then when the new fmap is created, it will use that flag to set similar flag for the layer. The context flag is reset those that anything after this doesn't have that flag. The flag allows the new recursion_stack_get() function to ignore normalized layers when iterating the stack to return a layer at a requested index, negative or positive. Scanning normalized extracted/normalized javascript and VBA should also use the 'layer is normalized' flag. - This commit also fixes Heuristic.Broken.Executable alert for ELF files to make sure that: A) these only alert if cli_append_virus() returns CL_VIRUS (aka it respects the FP check). B) all broken-executable alerts for ELF only happen if the SCAN_HEURISTIC_BROKEN option is enabled. - This commit also cleans up the error handling in cli_magic_scan_dir(). This was needed so we could correctly apply the layer-is-normalized-flag to all VBA macros extracted to a directory when scanning the directory. - Also fix an issue where exceeding scan maximums wouldn't cause embedded file detection scans to abort. Granted we don't actually want to abort if max filesize or max recursion depth are exceeded... only if max scansize, max files, and max scantime are exceeded. Add 'abort_scan' flag to scan context, to protect against depending on correct error propagation for fatal conditions. Instead, setting this flag in the scan context should guarantee that a fatal condition deep in scan recursion isn't lost which result in more stuff being scanned instead of aborting. This shouldn't be necessary, but some status codes like CL_ETIMEOUT never used to be fatal and it's easier to do this than to verify every parser only returns CL_ETIMEOUT and other "fatal status codes" in fatal conditions. - Remove duplicate is_tar() prototype from filestypes.c and include is_tar.h instead. - Presently we create the fmap hash when creating the fmap. This wastes a bit of CPU if the hash is never needed. Now that we're creating fmap's for all embedded files discovered with file type recognition scans, this is a much more frequent occurence and really slows things down. This commit fixes the issue by only creating fmap hashes as needed. This should not only resolve the perfomance impact of creating fmap's for all embedded files, but also should improve performance in general. - Add allmatch check to the zip parser after the central-header meta match. That way we don't multiple alerts with the same match except in allmatch mode. Clean up error handling in the zip parser a tiny bit. - Fixes to ensure that the scan limits such as scansize, filesize, recursion depth, # of embedded files, and scantime are always reported if AlertExceedsMax (--alert-exceeds-max) is enabled. - Fixed an issue where non-fatal alerts for exceeding scan maximums may mask signature matches later on. I changed it so these alerts use the "possibly unwanted" alert-type and thus only alert if no other alerts were found or if all-match or heuristic-precedence are enabled. - Added the "Heuristics.Limits.Exceeded." events to the JSON metadata when the --gen-json feature is enabled. These will show up once under "ParseErrors" the first time a limit is exceeded. In the present implementation, only one limits-exceeded events will be added, so as to prevent a malicious or malformed sample from filling the JSON buffer with millions of events and using a tonne of RAM.	2021-10-25 16:02:29 -07:00
Tim Gates	251befbdf3	docs: Fix a few typos docs: Fix a few typos There are small typos in: - libclamav/others_common.c - libclamav/pe.c - libclamav/unzip.c Fixes: - Should read `descriptor` rather than `desriptor`. - Should read `record` rather than `reocrd`. - Should read `overarching` rather than `overaching`.	2021-08-09 15:41:17 -07:00
Micah Snyder (micasnyd)	b9ca6ea103	Update copyright dates for 2021 Also fixes up clang-format.	2021-03-19 15:12:26 -07:00
Jan Smutny	b33e4be3ea	zip: Fix false negative w. HeuristicScanPrecedence	2020-07-25 11:15:21 -07:00
Micah Snyder	e2f59af30a	Clang-format touchup	2020-07-24 16:37:25 -07:00
Micah Snyder	d1f209e879	Fix fmap NULL deref in preclass bytecode hook If using a bytecode signature that makes use of the BC_PRECLASS hook and if it alerts, a NULL dereference may occur. This change fixes that. Also fixed unrelated memory leaks introduced recently when adding file name extraction to the zip parser and rar parser.	2020-06-03 11:00:53 -04:00
Micah Snyder	11ef77007b	Improve tmp sub-directory names At present many parsers create tmp subdirectories to store extracted files. For parsers like the vba parser, this is required as the directory is later scanned. For other parsers, these subdirectories are probably not helpful now that we provide recursive sub-dirs when --leave-temps is enabled. It's not quite as simple as removing the extra subdirectories, however. Certain parsers, like autoit, don't create very unique filenames and would result in file name collisions when --leave-temps is not enabled. The best thing to do would be to make sure each parser uses unique filenames and doesn't rely on cli_magic_scan_dir() to scan extracted content before removing the extra subdirectory. In the meantime, this commit gives the extra subdirectories meaningful names to improve readability. This commit also: - Provides the 'bmp' prefix for extracted PE icons. - Removes empty tmp subdirs when extracting rtf files, to eliminate clutter. - The PDF parser sometimes creates tmp files when decompressing streams before it knows if there is actually any content to decompress. This resulted in a large number of empty files. While it would be best to avoid creating empty files in the first place, that's not quite as as it sounds. This commit does the next best thing and deletes the tmp files if nothing was actually extracted, even if --leave-temps is enabled. - Removes the "scantemp" prefix for unnamed fmaps scanned with cli_magic_scan(). The 5-character hashes given to tmp files with prefixes resulted in occasional file name collisions when extracting certain file types with thousands of embedded files. - The VBA and TAR parsers mistakenly used NAME_MAX instead of PATH_MAX, resulting in truncated file paths and failed extraction when --leave-temps is enabled and a lot of recursion is in play. This commit switches them from NAME_MAX to PATH_MAX.	2020-06-03 11:00:53 -04:00
Micah Snyder	005cbf5a37	Record names of extracted files A way is needed to record scanned file names for two purposes: 1. File names (and extensions) must be stored in the json metadata properties recorded when using the --gen-json clamscan option. Future work may use this to compare file extensions with detected file types. 2. File names are useful when interpretting tmp directory output when using the --leave-temps option. This commit enables file name retention for later use by storing file names in the fmap header structure, if a file name exists. To store the names in fmaps, an optional name argument has been added to any internal scan API's that create fmaps and every call to these APIs has been modified to pass a file name or NULL if a file name is not required. The zip and gpt parsers required some modification to record file names. The NSIS and XAR parsers fail to collect file names at all and will require future work to support file name extraction. Also: - Added recursive extraction to the tmp directory when the --leave-temps option is enabled. When not enabled, the tmp directory structure remains flat so as to prevent the likelihood of exceeding MAX_PATH. The current tmp directory is stored in the scan context. - Made the cli_scanfile() internal API non-static and added it to scanners.h so it would be accessible outside of scanners.c in order to remove code duplication within libmspack.c. - Added function comments to scanners.h and matcher.h - Converted a TDB-type macros and LSIG-type macros to enums for improved type safey. - Converted more return status variables from `int` to `cl_error_t` for improved type safety, and corrected ooxml file typing functions so they use `cli_file_t` exclusively rather than mixing types with `cl_error_t`. - Restructured the magic_scandesc() function to use goto's for error handling and removed the early_ret_from_magicscan() macro and magic_scandesc_cleanup() function. This makes the code easier to read and made it easier to add the recursive tmp directory cleanup to magic_scandesc(). - Corrected zip, egg, rar filename extraction issues. - Removed use of extra sub-directory layer for zip, egg, and rar file extraction. For Zip, this also involved changing the extracted filenames to be randomly generated rather than using the "zip.###" file name scheme.	2020-06-03 10:39:18 -04:00
Micah Snyder	206dbaefe8	Update copyright dates for 2020	2020-01-03 15:44:07 -05:00
Micah Snyder	74bbc91433	Modifies zip scanning behavior so it scans files using zip records from the catalogue which provides deduplication of file records resulting in faster extraction and scan time and reducing the likelihood of alerting on non-malicious duplicate file entries as overlapping files.	2019-10-02 16:08:30 -04:00
Micah Snyder	a04a7627b3	bb12356 - Improvement to overlapping zip files detection logic. Increases zip overlapping file checking to equal the max scan files setting.	2019-10-02 16:08:30 -04:00
Micah Snyder (micasnyd)	6a0abb897a	Adds --max-scantime clamscan option and MaxScanTime clamd config option. --max-scantime replaces the --timelimit clamscan option that had been experimental. Default max-scantime set to 2 minutes (120000 milliseconds).	2019-10-02 16:08:29 -04:00
Micah Snyder	e3d269a933	Adds detection and heuristic alert for zips with overlapping files, preventing extraction of non-recursive zip bombs.	2019-10-02 16:08:26 -04:00
Micah Snyder	ad6e0f70cb	Adds unzip parser code readability improvements; doxygen function comments.	2019-10-02 16:08:26 -04:00

1 2 3

141 commits