clamav

mirror of https://github.com/Cisco-Talos/clamav.git synced 2025-10-19 10:23:17 +00:00

Author	SHA1	Message	Date
Micah Snyder	5fe5f87252	Fix performance issue scanning some Windows executables Scanning CL_TYPE_MSEXE that have embedded file type signature matches for CL_TYPE_MSEXE are incorrectly passing the PE header check for the contained file, resulting in excessive scan times. The problem is that the `peinfo` struct needs to have the `offset` set for the contained `CL_TYPE_MSEXE` match prior to the header check. Without that, the header check was actually validating the PE header of the original file, which would always pass when that's a PE, and would always fail if it's an OLE2 file (the other type which we check for contained PEs). The additional code change in this commit is to make it so the `ctx` parameter must never be NULL, and removing the `map` parameter because, in practice, that is always from `ctx->fmap`. This is to safeguard against future changes to the function that may accidentally use `ctx` without a proper NULL check. CLAM-2882	2025-10-12 16:13:02 -04:00
Val S.	a77a271fb5	Reduce unnecessary scanning of embedded file FPs (#1571 ) When embedded file type recognition finds a possible embedded file, it is being scanned as a new embedded file even if it turns out it was a false positive and parsing fails. My solution is to pre-parse the file headers as little possible to determine if it is valid. If possible, also determine the file size based on the headers. That will make it so we don't have to scan additional data when the embedded file is not at the very end. This commit adds header checks prior to embedded ZIP, ARJ, and CAB scanning. For these types I was also able to use the header checks to determine the object size so as to prevent excessive pattern matching. TODO: Add the same for RAR, EGG, 7Z, NULSFT, AUTOIT, IShield, and PDF. This commit also removes duplicate matching for embedded MSEXE. The embedded MSEXE detection and scanning logic was accidentally creating an extra duplicate layer in between scanning and detection because of the logic within the `cli_scanembpe()` function. That function was effectively doing the header check which this commit adds for ZIP, ARJ, and CAB but minus the size check. Note: It is unfortunately not possible to get an accurage size from PE file headers. The `cli_scanembpe()` function also used to dump to a temp file for no reason since FMAPs were extended to support windows into other FMAPs. So this commit removes the intermediate layer as well as dropping a temp file for each embedded PE file. Further, this commit adds configuration and DCONF safeguards around all embedded file type scanning. Finally, this commit adds a set of tests to validate proper extraction of embedded ZIP, ARJ, CAB, and MSEXE files. CLAM-2862 Co-authored-by: TheRaynMan <draynor@sourcefire.com>	2025-09-23 15:57:28 -04:00
Valerie Snyder	3b2313362e	Metadata JSON: Simplify recording alerts and indicators We presently record Alerts as an array of signature names. Instead, it should be an object with properties of its own. We should record alerting indicators and weak indicators in a single "Indicators", likely with the same structure as the "Alerts" objects. When an alerting indicator is ignored (e.g. ignored by callback or if the file is trusted by an FP signature), we can remove it from the "Alerts" array, and for the "Indicators" array, add a "Ignored" key with a string value that explains why it was ignored. This eliminates the need to track and propagate the additional "WeakIndicators" and "IgnoredAlerts" arrays.	2025-08-14 22:40:49 -04:00
Valerie Snyder	6d9b57eeeb	libclamav: cl_scan*_ex() functions provide verdict separate from errors It is a shortcoming of existing scan APIs that it is not possible to return an error without masking a verdict. We presently work around this limitation by counting up detections at the end and then overriding the error code with `CL_VIRUS`, if necessary. The `cl_scanfile_ex()`, `cl_scandesc_ex()`, and `cl_scanmap_ex()` functions should provide the scan verdict separately from the error code. This introduces a new enum for recording and reporting a verdict: `cl_verdict_t` with options: - `CL_VERDICT_NOTHING_FOUND` - `CL_VERDICT_TRUSTED` - `CL_VERDICT_STRONG_INDICATOR` - `CL_VERDICT_POTENTIALLY_UNWANTED` Notably, the newer scan APIs may set the verdict to `CL_VERDICT_TRUSTED` if there is a (hash-based) FP signature for a file, or in the cause where Authenticode or similar certificate-based verification was performed, or in the case where an application scan callback returned `CL_VERIFIED`. CLAM-763 CLAM-865	2025-08-14 22:40:46 -04:00
Valerie Snyder	31dcec1e42	libclamav: Add engine option to toggle temp directory recursion Temp directory recursion in ClamAV is when each layer of a scan gets its own temp directory in the parent layer's temp directory. In addition to temp directory recursion, ClamAV has been creating a new subdirectory for each file scan as a risk-adverse method to ensure no temporary file leaks fill up the disk. Creating a directory is relatively slow on Windows in particular if scanning a lot of very small files. This commit: 1. Separates the temp directory recursion feature from the leave-temps feature so that libclamav can leave temp files without making subdirectories for each file scanned. 2. Makes it so that when temp directory recursion is off, libclamav will just use the configure temp directory for all files. The new option to enable temp directory recursion is for libclamav-only at this time. It is off by default, and you can enable it like this: ```c cl_engine_set_num(engine, CL_ENGINE_TMPDIR_RECURSION, 1); ``` For the `clamscan` and `clamd` programs, temp directory recursion will be enabled when `--leave-temps` / `LeaveTemporaryFiles` is enabled. The difference is that when disabled, it will return to using the configured temp directory without making a subdirectory for each file scanned, so as to improve scan performance for small files, mostly on Windows. Under the hood, this commit also: 1. Cleans up how we keep track of tmpdirs for each layer. The goal here is to align how we keep track of layer-specific stuff using the scan_layer structure. 2. Cleans up how we record metadata JSON for embedded files. Note: Embedded files being different from Contained files, as they are extracted not with a parser, but by finding them with file type magic signatures. CLAM-1583	2025-08-14 22:38:58 -04:00
Valerie Snyder	7f25b928de	Record scan matches (evidence) at each recursion layer Move recording of evidence (aka Strong, PUA, and Weak indicators) to be done in each layer of a scan, and passed up to the parent layer with the top level only connecting the results at the very end of the scan. This is needed to provide access the last alert for a given layer when we upgrade the scan callbacks. Note that when adding evidence from a child layer that is a normalized layer, we do not want to increase the depth. It should appear as though the match occured on the parent layer. This is for two reasons: 1. We don't run the scan callbacks on normalized layers. 2. Future matches on Weak Indicators should be able to treat normalized layer matches the same as original file matches. Keep reading for more about Weak Indicators. Recording scan matches at each recursion layer is also needed to support Weak Indicators, a feature where an alerting signature (aka Strong Indicator) may require the the match of a non-alerting signature (aka Weak Indicator) on the same layer or on child layers in order to alert. Support for Weak indicators was blocked by not keeping track of where indicators were found. So this commit also enables support for recording Weak indicators. Like PUA, Weak indicators are treated differently based on the signature prefix. That is, any signatures starting with "Weak." won't cause an alert on its own. The next step to completing Weak Indicator support will be adding a logical subsignature feature to depend on a weak indicator match. CLAM-2626 CLAM-2485	2025-08-14 21:23:34 -04:00
Valerie Snyder	f7e60d566f	Record unique object-id for each layer scanned Every time we push a new map onto the scanning recursion context, give it a unique object id number, which counts from zero. Moved the location where we add metadata for each file from the "cli_magic_scan" function over to the "recursion stack push" function. Include a "path" as a parameter for creating a new fmap, and rename some related variables and functions to be more intuitive. CLAM-2796 See also: CLAM-2485, CLAM-2626	2025-08-14 21:23:33 -04:00
Valerie Snyder	aa7b7e9421	Swap clean cache from MD5 to SHA2-256 Change the clean-cache to use SHA2-256 instead of MD5. Note that all references are changed to specify "SHA2-256" now instead of "SHA256", for clarity. But there is no plan to add support for SHA3 algorithms at this time. Significant code cleanup. E.g.: - Implemented goto-done error handling. - Used `uint8_t ` instead of `unsigned char `. - Use `bool` for boolean checks, rather than `int. - Used `#defines` instead of magic numbers. - Removed duplicate `#defines` for things like hash length. Add new option to calculate and record additional hash types when the "generate metadata JSON" feature is enabled: - libclamav option: `CL_SCAN_GENERAL_STORE_EXTRA_HASHES` - clamscan option: `--json-store-extra-hashes` (default off) - clamd.conf option: `JsonStoreExtraHashes` (default 'no') Renamed the sigtool option `--sha256` to `--sha2-256`. The original option is still functional, but is deprecated. For the "generate metadata JSON" feature, the file hash is now stored as "sha2-256" instead of "FileMD5". If you enable the "extra hashes" option, then it will also record "md5" and "sha1". Deprecate and disable the internal "SHA collect" feature. This option had been hidden behind C #ifdef checks for an option that wasn't exposed through CMake, so it was basically unavailable anyways. Changes to calculate file hashes when they're needed and no sooner. For the FP feature in the matcher module, I have mimiced the optimization in the FMAP scan routine which makes it so that it can calculate multiple hashes in a single pass of the file. The `HandlerType` feature stores a hash of the file in the scan ctx to prevent retyping the exact same data more than once. I removed that hash field and replaced it with an attribute flag that is applied to the new recursion stack layer when retyping a file. This also closes a minor bug that would prevent retyping a file with an all-zero hash. :) The work upgrading cache.c to support SHA2-256 sized hashes thanks to: https://github.com/m-sola CLAM-255 CLAM-1858 CLAM-1859 CLAM-1860	2025-08-14 21:23:30 -04:00
Val Snyder	7ff29b8c37	Bump copyright dates for 2025	2025-02-14 10:24:30 -05:00
Micah Snyder	47dfe9bd5d	Remove libjson-c dead code As of ClamAV 0.105, libjson-c is required. There is also no option to disable libjson-c support. This commit removes the dead code associated with the old build option.	2024-04-13 12:34:15 -04:00
RainRat	143d23c326	Fix typos and remove duplicate #include	2024-04-10 19:31:46 -04:00
Micah Snyder	405829ee88	Refine max-allocation and safer-allocation function and macro names We add the _OR_GOTO_DONE suffix to the macros that go to done if the allocation fails. This makes it obvious what is different about the macro versus the equivalent function, and that error handling is built-in. Renamed the cli_strdup to safer_strdup to make it obvious that it exists because it is safer than regular strdup. Regular strdup doesn't have the NULL check before trying to dup, and so may result in a NULL-deref crash. Also remove unused STRDUP (_OR_GOTO_DONE) macro, since the one with the NULL-check is preferred.	2024-03-15 13:18:47 -04:00
Micah Snyder	902623972d	Remove max-allocation limits where not required The cli_max_malloc, cli_max_calloc, and cli_max_realloc functions provide a way to protect against allocating too much memory when the size of the allocation is derived from the untrusted input. Specifically, we worry about values in the file being scanned being manipulated to exhaust the RAM and crash the application. There is no need to check the limits if the size of the allocation is fixed, or if the size of the allocation is necessary for signature loading, or the general operation of the applications. E.g. checking the max-allocation limit for the size of a hash, or for the size of the scan recursion stack, is a complete waste of time. Although we significantly increased the max-allocation limit in a recent release, it is best not to check an allocation if the allocation will be safe. It would be a waste of time. I am also hopeful that if we can reduce the number allocations that require a limit-check to those that require it for the safe scan of a file, then eventually we can store the limit in the scan- context, and make it configurable.	2024-03-15 13:18:47 -04:00
Micah Snyder	8e04c25fec	Rename clamav memory allocation functions We have some special functions to wrap malloc, calloc, and realloc to make sure we don't allocate more than some limit, similar to the max-filesize and max-scansize limits. Our wrappers are really only needed when allocating memory for scans based on untrusted user input, where a scan file could have bytes that claim you need to allocate some ridiculous amount of memory. Right now they're named: - cli_malloc - cli_calloc - cli_realloc - cli_realloc2 ... and these names do not convey their purpose This commit renames them to: - cli_max_malloc - cli_max_calloc - cli_max_realloc - cli_max_realloc2 The realloc ones also have an additional feature in that they will not free your pointer if you try to realloc to 0 bytes. Freeing the memory is undefined by the C spec, and only done with some realloc implementations, so this stabilizes on the behavior of not doing that, which should prevent accidental double-free's. So for the case where you may want to realloc and do not need to have a maximum, this commit adds the following functions: - cli_safer_realloc - cli_safer_realloc2 These are used for the MPOOL_REALLOC and MPOOL_REALLOC2 macros when MPOOL is disabled (e.g. because mmap-support is not found), so as to match the behavior in the mpool_realloc/2 functions that do not make use of the allocation-limit.	2024-03-15 13:18:47 -04:00
Micah Snyder	6d6e04ddf8	Optimization: replace limited allocation calls There are a large number of allocations for fix sized buffers using the `cli_malloc` and `cli_calloc` calls that check if the requested size is larger than our allocation threshold for allocations based on untrusted input. These allocations will always be higher than the threshold, so the extra stack frame and check for these calls is a waste of CPU. This commit replaces needless calls with A -> B: - cli_malloc -> malloc - cli_calloc -> calloc - CLI_MALLOC -> MALLOC - CLI_CALLOC -> CALLOC I also noticed that our MPOOL_MALLOC / MPOOL_CALLOC are not limited by the max-allocation threshold, when MMAP is found/enabled. But the alternative was set to cli_malloc / cli_calloc when disabled. I changed those as well. I didn't change the cli_realloc/2 calls because our version of realloc not only implements a threshold but also stabilizes the undefined behavior in realloc to protect against accidental double-free's. It may be worth implementing a cli_realloc that doesn't have the threshold built-in, however, so as to allow reallocaitons for things like buffers for loading signatures, which aren't subject to the same concern as allocations for scanning possible malware. There was one case in mbox.c where I changed MALLOC -> CLI_MALLOC, because it appears to be allocating based on untrusted input.	2024-03-15 13:18:47 -04:00
Micah Snyder	9cb28e51e6	Bump copyright dates for 2024	2024-01-22 11:27:17 -05:00
RainRat	caf324e544	Fix typos (no functional changes)	2023-11-26 18:01:19 -05:00
Micah Snyder	ba4a561d71	Resolve Coverity assignment of overlapping memory warnings Coverity is unhappy with the use of the EC32, cli_readint32, and cli_writeint32 macros (and the 64bit equivalents to potentially change the endianess of variables in place. It claims: overlapping_assignment: Assigning ... to ..., which have overlapping memory locations and different types. Using a temporary variable in between reading and writing should resolve these "high impact" complaints. Resolves: Coverity-225232. 225225, 225215, 225212, 225180, 225170, 225165, 225161, 225159.	2023-04-26 10:43:13 -07:00
Micah Snyder	38386349c5	Fix many warnings	2023-04-13 00:11:34 -07:00
Micah Snyder	6bed7580ab	Coverity-405726, 405725: Fix overlapping copy complaint Fix issue introduced during 1.1 dev. Fix coverity-405726 coverity-405725.	2023-04-13 00:11:34 -07:00
Sebastian Andrzej Siewior	d2a0bb8275	libclamav/pe: Convert struct pe_image_data_dir to native endian. A few user of VirtualAddress and Size in cli_exe_info::pe_image_data_dir don't use the endian wrapper while other places do. This leads to testsuite failures on big endian machines. Convert the content of struct pe_image_data_dir to native format so that that the EC32() conversation can be removed. Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>	2023-03-17 11:31:46 -07:00
Micah Snyder	6eebecc303	Bump copyright for 2023	2023-02-12 11:20:22 -08:00
Micah Snyder	f7b139a776	PE, ELF, Mach-O: code cleanup The header parsing / executable metadata collecting functions for the PE, ELF, and Mach-O file types were using `int` for the return type. Mostly they were returning 0 for success and -1, -2, -3, or -4 for failure. But in some cases they were returning cl_error_t enum values for failure. Regardless, the function using them was treating 0 as success and non-zero as failure, which it stored as -1 ... every time. This commit switches them all to use cl_error_t. I am continuing to storeo the final result as 0 / -1 in the `peinfo` struct, but outside of that everything has been made consistent. While I was working on that, I got a tad side tracked. I noticed that the target type isn't an enum, or even a set of #defines. So I made an enum and then changed the code that uses target types to use the enum. I also removed the `target` parameter from a number of functions that don't actually use it at all. Some recursion was masking the fact that it was an unused parameter which is why there was no warning about it.	2022-10-19 13:13:57 -07:00
Micah Snyder	7d64f7b114	PE: Remove allmatch checks + minor code cleanup And fix issue where call to magic_scan may not propagate critical errors.	2022-10-19 13:13:57 -07:00
Micah Snyder	66f48d3e05	Strong indicator precedence over PUA / Heuristic detections Signatures that start with "PUA.", "Heuristics.", or "BC.Heuristics." are perceived to be less serious, or more likely to have false positives, than other signatures that we would think of us as "strong indicators". At present, only a subset of "Heuristics." signatures, such as those added by the phishing module, are added as "potentially unwanted". Unless you're using heuristic-precedence mode, these "potentially unwanted" indicators are recorded but not reported unless no other signature alerts. This behavior should apply to all signatures that start with "PUA." and "Heuristics.". We already do a string match comparison on the signature name to apply that behavior to bytecode matches that start with "BC.Heuristics.". I moved that string comparison logic used for "BC.Heuristics." into the main `cl_append_virus()` function and extended it to cover the other two cases. I also replaced all hardcoded calls to append "Heuristics." signatures to append using the `cli_append_potentially_unwanted()` function, so we can skip the string compare in these cases. That function will of course append them as strong indicators if heuristic-precedence mode is enabled.	2022-10-19 13:13:57 -07:00
Micah Snyder	858b541a51	Matcher: Remove allmatch checks and significantly tidy code Significantly tidy the `cli_scan_fmap()` function, and add comments to explain how it all works. Add SHA1 and SHA256 digest variables to the FMAP structure in addition to the existing MD5. Add a function to set the hash so that when we calculate the hashes for hash matching, we save them for subsequent FP matching. This enabled me to remove the extra "hash-only" FP check from `cli_scan_fmap()`. This will also make it easier to switch the clean cache hash algorithm to SHA256 in the future. Remove extra allmatch checks that are no longer required. Add a new header to prevent #include order issues.	2022-10-19 13:13:57 -07:00
Micah Snyder	621381e0cd	Allmatch-mode overhaul, part 1: append_virus Rework the append_virus mechanism to store evidence (strong indicators, pua indicators, and eventually weak indicators) in vectors. When appending a "virus", we will return CLEAN when in allmatch-mode, and simply add the indicator to the appropriate vector. Later we can check if there were any alerts to return a vector by summing the lengths of the strong and pua indicator vectors. This does away with storing the latest "virname" in the scan context. Instead, we can query for the last indicator in the evidence, giving priority to strong indicators. When heuristic-precendence is enabled, add PUA as Strong instead of as PotentiallyUnwanted. This way, they will be treated equally and reported in order in allmatch mode. Also document reason for disabling cache with metadata JSON enabled	2022-10-19 13:13:57 -07:00
Micah Snyder	cd3134568a	Code quality: Refactor layer attributes as scan parameter The current implementation sets a "next layer attributes" flag field in the scan context. This may introduce bugs if accidentally not cleared during error handling, causing that attribute to be applied to a different layer than intended. This commit resolves that by adding an attribute flag to the major internal scan functions and removing the "next layer attributes" from the scan context. This attributes flag shares the same flag fields as the attributes flag in the new file inspection callback and the flags are defined in `clamav.h`.	2022-10-13 08:57:44 -07:00
Micah Snyder	8011786315	PE parser error handling, type safety, warnings In the PE parser when reading up to 4KB following the entrypoint there's is no call to verify if the read failed. Later it is assumed that the read succeeded and that the data in the buffer is valid. I believe the correct response is to bail out if the read failed. I also fixed some warnings: - The the max # of PE sections was effectively disabled by setting it to the max size of a uint16_t, so the max-check was pointless. - Some undocumented switch fall-throughs were throwing warnings as well. - Unsigned integer subtraction results in a signed value, which was throwing warnings when compared with another unsigned value. The substraction for `peinfo->nsections - 1` won't be less than 0 though because we've already verified that nsections != 0 so we can just cast the result of the subtraction back to an unsigned value to silence the warning safely.	2022-05-01 12:24:19 -07:00
Micah Snyder	9e14ffab36	PE parser: fix recently introduced NULL dereference crash Commit `f82492aef4` fixed a crash in Windows debug builds but in so doing accidentally introduced a possible crash when scanning PE files that lack import tables. The issue being that the openssl hashing functions try to "finish" a hash that was never started. This commit fixes the issue by returning CL_BREAK instead of CL_SUCCESS when the import table doesn't exist or RVA is invalid so that we can differentiate between successfully calculating the hashes and successfully skipping the hashing process.	2022-03-29 12:48:06 -07:00
Micah Snyder	f82492aef4	Windows: Fix crash in Debug builds when scanning some PE files The endianness conversion when reading the PE image import descriptor is making the change in-place in the fmap. On Windows, the fmap is read-only and so in Debug builds that's causing a crash. This uses a buffer on the stack and copies each image import descriptor before doing the conversions and then processing each.	2022-03-02 20:29:07 -08:00
mko-x	a21cc6dcd7	Add explicit log level parameter to application logging API * Added loglevel parameter to logg() * Fix logg and mprintf internals with new loglevels * Update all logg calls to set loglevel * Update all mprintf calls to set loglevel * Fix hidden logg calls * Executed clam-format	2022-02-15 15:13:55 -08:00
Micah Snyder	c24654d244	Fix all-match mode bug in PE section hash scans The PE section hash scanning code didn't implement the all-match check. While this check isn't the ideal implementation for all-match mode... (see the commit message for the previous commit) ...it's simple enough to add the all-match check here for now.	2022-01-14 12:51:14 -07:00
micasnyd	140c88aa4e	Bump copyright for 2022 Includes minor format corrections.	2022-01-09 14:23:25 -07:00
Micah Snyder	d1141becac	Fix fmap handle_gets() page arithmetic The fixes to the fmap bounds for nested (duplicate) fmaps added recently introduced a subtle arithmetic bug that was detected by OSS-Fuzz: ```c scanat = m->nested_offset + at % m->pgsz; ``` should have been: ```c scanat = (m->nested_offset + at) % m->pgsz; ``` Without the parenthesis, `scanat` could be > `m->pgsz`, which would overflow in the subsequent `memchr()` call. See: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40452 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40455 This commit also tightens up some of the other bounds checks done with `CLI_ISCONTAINED()` macro so the check limits the bounds to the nested fmap and not the original map. In addition, I've added a `CLI_ISCONTAINED_0_TO()` macro that removes checks when the "bigger" buffer starts at offset 0. This should silence a bunch of (benign) warnings and medium severity Coverity issues. There is also a possible use of an uninitialized variable (`old_hook_lsig_matches`) in `cli_magic_scan()`. Finally, I also removed an unecessary NULL-check on `filebase` in `fmap_dup_to_file()` that Coverity was unhappy with.	2021-10-29 12:35:15 -07:00
Micah Snyder	db013a2bfd	libclamav: Fix scan recursion tracking Scan recursion is the process of identifying files embedded in other files and then scanning them, recursively. Internally this process is more complex than it may sound because a file may have multiple layers of types before finding a new "file". At present we treat the recursion count in the scanning context as an index into both our fmap list AND our container list. These two lists are conceptually a part of the same thing and should be unified. But what's concerning is that the "recursion level" isn't actually incremented or decremented at the same time that we add a layer to the fmap or container lists but instead is more touchy-feely, increasing when we find a new "file". To account for this shadiness, the size of the fmap and container lists has always been a little longer than our "max scan recursion" limit so we don't accidentally overflow the fmap or container arrays (!). I've implemented a single recursion-stack as an array, similar to before, which includes a pointer to each fmap at each layer, along with the size and type. Push and pop functions add and remove layers whenever a new fmap is added. A boolean argument when pushing indicates if the new layer represents a new buffer or new file (descriptor). A new buffer will reset the "nested fmap level" (described below). This commit also provides a solution for an issue where we detect embedded files more than once during scan recursion. For illustration, imagine a tarball named foo.tar.gz with this structure: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| baz.exe \| PE \| 0 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| │ └── hello.txt \| ASCII \| 2 \| 0 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| \| └── world.txt \| ASCII \| 2 \| 0 \| (A) If we scan for embedded files at any layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| ├── foo.tar \| TAR \| 1 \| 0 \| \| │ ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hola.txt \| ASCII \| 3 \| 0 \| \| │ ├── baz.exe \| PE \| 2 \| 1 \| \| │ │ ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ │ │ └── hello.txt \| ASCII \| 4 \| 0 \| \| │ │ └── sfx.7z \| 7Z \| 3 \| 1 \| \| │ │ └── world.txt \| ASCII \| 4 \| 0 \| \| │ ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hello.txt \| ASCII \| 3 \| 0 \| \| │ └── sfx.7z \| 7Z \| 2 \| 1 \| \| │ └── world.txt \| ASCII \| 3 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| (A) is bad because it scans content more than once. Note that for the GZ layer, it may detect the ZIP and 7Z if the signature hits on the compressed data, which it might, though extracting the ZIP and 7Z will likely fail. The reason the above doesn't happen now is that we restrict embedded type scans for a bunch of archive formats to include GZ and TAR. (B) If we scan for embedded files at the foo.tar layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| ├── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ └── hello.txt \| ASCII \| 3 \| 0 \| \| └── sfx.7z \| 7Z \| 2 \| 1 \| \| └── world.txt \| ASCII \| 3 \| 0 \| (B) is almost right. But we can achieve it easily enough only scanning for embedded content in the current fmap when the "nested fmap level" is 0. The upside is that it should safely detect all embedded content, even if it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe. The biggest risk I can think of affects ZIPs. SFXZIP detection is identical to ZIP detection, which is why we don't allow SFXZIP to be detected if insize of a ZIP. If we only allow embedded type scanning at fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP if the bar.exe was not compressed in foo.zip and if non-compressed files extracted from ZIPs aren't extracted as new buffers: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.zip \| ZIP \| 0 \| 0 \| \| └── bar.exe \| PE \| 1 \| 1 \| \| └── sfx.zip \| ZIP \| 2 \| 2 \| Provided that we ensure all files extracted from zips are scanned in new buffers, option (B) should be safe. (C) If we scan for embedded files at the baz.exe layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ └── hello.txt \| ASCII \| 4 \| 0 \| \| └── sfx.7z \| 7Z \| 3 \| 1 \| \| └── world.txt \| ASCII \| 4 \| 0 \| (C) is right. But it's harder to achieve. For this example we can get it by restricting 7ZSFX and ZIPSFX detection only when scanning an executable. But that may mean losing detection of archives embedded elsewhere. And we'd have to identify allowable container types for each possible embedded type, which would be very difficult. So this commit aims to solve the issue the (B)-way. Note that in all situations, we still have to scan with file typing enabled to determine if we need to reassign the current file type, such as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2- compressed. Detection of DMG and a handful of other types rely on finding data partway through or near the ned of a file before reassigning the entire file as the new type. Other fixes and considerations in this commit: - The utf16 HTML parser has weak error handling, particularly with respect to creating a nested fmap for scanning the ascii decoded file. This commit cleans up the error handling and wraps the nested scan with the recursion-stack push()/pop() for correct recursion tracking. Before this commit, each container layer had a flag to indicate if the container layer is valid. We need something similar so that the cli_recursion_stack_get_() functions ignore normalized layers. Details... Imagine an LDB signature for HTML content that specifies a ZIP container. If the signature actually alerts on the normalized HTML and you don't ignore normalized layers for the container check, it will appear as though the alert is in an HTML container rather than a ZIP container. This commit accomplishes this with a boolean you set in the scan context before scanning a new layer. Then when the new fmap is created, it will use that flag to set similar flag for the layer. The context flag is reset those that anything after this doesn't have that flag. The flag allows the new recursion_stack_get() function to ignore normalized layers when iterating the stack to return a layer at a requested index, negative or positive. Scanning normalized extracted/normalized javascript and VBA should also use the 'layer is normalized' flag. - This commit also fixes Heuristic.Broken.Executable alert for ELF files to make sure that: A) these only alert if cli_append_virus() returns CL_VIRUS (aka it respects the FP check). B) all broken-executable alerts for ELF only happen if the SCAN_HEURISTIC_BROKEN option is enabled. - This commit also cleans up the error handling in cli_magic_scan_dir(). This was needed so we could correctly apply the layer-is-normalized-flag to all VBA macros extracted to a directory when scanning the directory. - Also fix an issue where exceeding scan maximums wouldn't cause embedded file detection scans to abort. Granted we don't actually want to abort if max filesize or max recursion depth are exceeded... only if max scansize, max files, and max scantime are exceeded. Add 'abort_scan' flag to scan context, to protect against depending on correct error propagation for fatal conditions. Instead, setting this flag in the scan context should guarantee that a fatal condition deep in scan recursion isn't lost which result in more stuff being scanned instead of aborting. This shouldn't be necessary, but some status codes like CL_ETIMEOUT never used to be fatal and it's easier to do this than to verify every parser only returns CL_ETIMEOUT and other "fatal status codes" in fatal conditions. - Remove duplicate is_tar() prototype from filestypes.c and include is_tar.h instead. - Presently we create the fmap hash when creating the fmap. This wastes a bit of CPU if the hash is never needed. Now that we're creating fmap's for all embedded files discovered with file type recognition scans, this is a much more frequent occurence and really slows things down. This commit fixes the issue by only creating fmap hashes as needed. This should not only resolve the perfomance impact of creating fmap's for all embedded files, but also should improve performance in general. - Add allmatch check to the zip parser after the central-header meta match. That way we don't multiple alerts with the same match except in allmatch mode. Clean up error handling in the zip parser a tiny bit. - Fixes to ensure that the scan limits such as scansize, filesize, recursion depth, # of embedded files, and scantime are always reported if AlertExceedsMax (--alert-exceeds-max) is enabled. - Fixed an issue where non-fatal alerts for exceeding scan maximums may mask signature matches later on. I changed it so these alerts use the "possibly unwanted" alert-type and thus only alert if no other alerts were found or if all-match or heuristic-precedence are enabled. - Added the "Heuristics.Limits.Exceeded." events to the JSON metadata when the --gen-json feature is enabled. These will show up once under "ParseErrors" the first time a limit is exceeded. In the present implementation, only one limits-exceeded events will be added, so as to prevent a malicious or malformed sample from filling the JSON buffer with millions of events and using a tonne of RAM.	2021-10-25 16:02:29 -07:00
Tim Gates	251befbdf3	docs: Fix a few typos docs: Fix a few typos There are small typos in: - libclamav/others_common.c - libclamav/pe.c - libclamav/unzip.c Fixes: - Should read `descriptor` rather than `desriptor`. - Should read `record` rather than `reocrd`. - Should read `overarching` rather than `overaching`.	2021-08-09 15:41:17 -07:00
Micah Snyder	971a12ddb9	Clang-format cleanup	2021-07-17 10:39:27 -07:00
Andrew	1306d100ee	Support SHA256-based .cat files and related improvements/bugfixes Trusted SHA256-based Authenticode hashes can now be loaded in from .cat files. In addition: - Files that are covered by Authenticode hashes loaded in from .cat files will now be treated as VERIFIED like executables where the embedded Authenticode sig is deemed to be trusted based on .crb rules. This fixes a regression introduced in 0.102 (I think). - The Authenticode hashes for signed EXEs without .crb coverage will no longer be computed in cli_check_auth_header unless hashes from .cat rules have been loaded. This fixes a slight performance regression introduced in 0.102 (I think).	2021-07-16 14:42:12 -07:00
Andrew Williams	1df4f82f2b	libclamav: Increase max PE section count to 65535 Windows XP had a maximum section count of 96, and this has been the max for ClamAV forever as well. Raising this prevents malicious executables from being able to evade certain ClamAV signatures by having 97 or more sections.	2021-07-12 22:39:36 -07:00
Micah Snyder	0255f29a72	Blacklist & Whitelist verbiage Improvements to use modern block list and allow list verbiage. blacklist -> block list whitelist -> allow listed blacklisted -> blocked whitelisted -> allowed In the case of certificate verification, use "trust" or "verify" when something is allowed. Also changed domainlist -> domain list (or DomainList) to match.	2021-05-27 14:16:00 -07:00
Micah Snyder (micasnyd)	b9ca6ea103	Update copyright dates for 2021 Also fixes up clang-format.	2021-03-19 15:12:26 -07:00
ihsinme	5f698f3842	Fix unsigned arithmetic checks Fixes unsigned arithmetic checks in PE and PNG parsers.	2021-02-17 11:43:00 -08:00
Micah Snyder	c110392780	Change permission for new tmp files from RWX to RW	2020-06-03 11:00:53 -04:00
Micah Snyder	9b9999d778	Rename core scanning functions Many of the core scanning functions' names no longer represent their specific purpose or arguments. This commit aims to make the names more intuitive. Names are now prefixed with "magic" if they involve file-typing and file-type parsing. In addition, each function now includes the type of input being scanned whether its "desc", "fmap", or "buff". Some of the APIs also now specify "type" to indicate that a type other than "ANY" may be passed in to select the type rather than use file type magic for type recognition. \| current name \| new name \| \| ------------------------- \| --------------------------------- \| \| magic_scandesc() \| cli_magic_scan() \| \| cli_magic_scandesc_type() \| <delete> \| \| cli_magic_scandesc() \| cli_magic_scan_desc() \| \| cli_base_scandesc() \| cli_magic_scan_desc_type() \| \| cli_partition_scandesc() \| <delete> \| \| cli_map_scandesc() \| magic_scan_nested_fmap_type() \| \| cli_map_scan() \| cli_magic_scan_nested_fmap_type() \| \| cli_mem_scandesc() \| cli_magic_scan_buff() \| \| cli_scanbuff() \| cli_scan_buff() \| \| cli_scandesc() \| cli_scan_desc() \| \| cli_fmap_scandesc() \| cli_scan_fmap() \| \| cli_scanfile() \| cli_magic_scan_file() \| \| cli_scandir() \| cli_magic_scan_dir() \| \| cli_filetype2() \| cli_determine_fmap_type() \| \| cli_filetype() \| cli_compare_ftm_file() \| \| cli_partitiontype() \| cli_compare_ftm_partition() \| \| cli_scanraw() \| scanraw() \|	2020-06-03 11:00:40 -04:00
Micah Snyder	005cbf5a37	Record names of extracted files A way is needed to record scanned file names for two purposes: 1. File names (and extensions) must be stored in the json metadata properties recorded when using the --gen-json clamscan option. Future work may use this to compare file extensions with detected file types. 2. File names are useful when interpretting tmp directory output when using the --leave-temps option. This commit enables file name retention for later use by storing file names in the fmap header structure, if a file name exists. To store the names in fmaps, an optional name argument has been added to any internal scan API's that create fmaps and every call to these APIs has been modified to pass a file name or NULL if a file name is not required. The zip and gpt parsers required some modification to record file names. The NSIS and XAR parsers fail to collect file names at all and will require future work to support file name extraction. Also: - Added recursive extraction to the tmp directory when the --leave-temps option is enabled. When not enabled, the tmp directory structure remains flat so as to prevent the likelihood of exceeding MAX_PATH. The current tmp directory is stored in the scan context. - Made the cli_scanfile() internal API non-static and added it to scanners.h so it would be accessible outside of scanners.c in order to remove code duplication within libmspack.c. - Added function comments to scanners.h and matcher.h - Converted a TDB-type macros and LSIG-type macros to enums for improved type safey. - Converted more return status variables from `int` to `cl_error_t` for improved type safety, and corrected ooxml file typing functions so they use `cli_file_t` exclusively rather than mixing types with `cl_error_t`. - Restructured the magic_scandesc() function to use goto's for error handling and removed the early_ret_from_magicscan() macro and magic_scandesc_cleanup() function. This makes the code easier to read and made it easier to add the recursive tmp directory cleanup to magic_scandesc(). - Corrected zip, egg, rar filename extraction issues. - Removed use of extra sub-directory layer for zip, egg, and rar file extraction. For Zip, this also involved changing the extracted filenames to be randomly generated rather than using the "zip.###" file name scheme.	2020-06-03 10:39:18 -04:00
Jonas Zaddach (jzaddach)	d5a733ef90	XLM (Excel 4.0) macro detection and extraction XLM is a macro language in Excel that was used before VBA (before 1996). It is still parsed and executed by modern Excel and is gaining popularity with malware authors. This patch adds rudimentary support for detecting and extracting Excel 4.0 (XLM) macros. The code is based on Didier Steven's plugin_biff for oletools.py.	2020-04-29 14:19:41 -07:00
Micah Snyder	898c08f08b	Formatting touch-up	2020-01-03 15:53:29 -05:00
Micah Snyder	206dbaefe8	Update copyright dates for 2020	2020-01-03 15:44:07 -05:00
Andrew	07990918f7	Handle case where Authenticode sig directly follows PE header	2019-11-13 14:05:28 -08:00

1 2 3 4 5 ...

301 commits