clamav

mirror of https://github.com/Cisco-Talos/clamav.git synced 2025-10-19 10:23:17 +00:00

Author	SHA1	Message	Date
Val S.	0462dae12a	Increase limit for finding PE files embedded in other PE files I am seeing missed detections since we changed to prohibit embedded file type identification when inside an embedded file. In particular, I'm seeing this issue with PE files that contain multiple other MSEXE as well as a variety of false positives for PE file headers. For example, imagine a PE with four concatenated DLL's, like so: ``` [ EXE file \| DLL #1 \| DLL #2 \| DLL #3 \| DLL #4 ] ``` And note that false positives for embedded MSEXE files are fairly common. So there may be a few mixed in there. Before limiting embedded file identification we might interpret the file structure something like this: ``` MSEXE: { embedded MSEXE #1: false positive, embedded MSEXE #2: false positive, embedded MSEXE #3: false positive, embedded MSEXE #4: DLL #1: { embedded MSEXE #1: false positive, embedded MSEXE #2: DLL #2: { embedded MSEXE #1: DLL #3: { embedded MSEXE #1: false positive, embedded MSEXE #2: false positive, embedded MSEXE #3: false positive, embedded MSEXE #4: false positive, embedded MSEXE #5: DLL #4 } embedded MSEXE #2: false positive, embedded MSEXE #3: false positive, embedded MSEXE #4: false positive, embedded MSEXE #5: false positive, embedded MSEXE #6: DLL #4 } embedded MSEXE #3: DLL #3, embedded MSEXE #4: false positive, embedded MSEXE #5: false positive, embedded MSEXE #6: false positive, embedded MSEXE #7: false positive, embedded MSEXE #8: DLL #4 } } ``` This is obviously terrible, which is why why we don't allow detecting embedded files within other embedded files. So after we enforce that limit, the same file may be interpreted like this instead: ``` MSEXE: { embedded MSEXE #1: false positive, embedded MSEXE #2: false positive, embedded MSEXE #3: false positive, embedded MSEXE #4: DLL #1, embedded MSEXE #5: false positive, embedded MSEXE #6: DLL #2, embedded MSEXE #7: DLL #3, embedded MSEXE #8: false positive, embedded MSEXE #9: false positive, embedded MSEXE #10: false positive, embedded MSEXE #11: false positive, embedded MSEXE #12: DLL #4 } ``` That's great! Except that we now exceed the "MAX_EMBEDDED_OBJ" limit for embedded type matches (limit 10, but 12 found). That means we won't see or extract the 4th DLL anymore. My solution is to lift the limit when adding an matched MSEXE type. We already do this for matched ZIPSFX types. While doing this, I've significantly tidied up the limits checks to make it more readble, and removed duplicate checks from within the `ac_addtype()` function. CLAM-2897	2025-10-14 14:05:12 -04:00
Val Snyder	7ff29b8c37	Bump copyright dates for 2025	2025-02-14 10:24:30 -05:00
Micah Snyder	405829ee88	Refine max-allocation and safer-allocation function and macro names We add the _OR_GOTO_DONE suffix to the macros that go to done if the allocation fails. This makes it obvious what is different about the macro versus the equivalent function, and that error handling is built-in. Renamed the cli_strdup to safer_strdup to make it obvious that it exists because it is safer than regular strdup. Regular strdup doesn't have the NULL check before trying to dup, and so may result in a NULL-deref crash. Also remove unused STRDUP (_OR_GOTO_DONE) macro, since the one with the NULL-check is preferred.	2024-03-15 13:18:47 -04:00
Micah Snyder	39070d1c76	Remove additional memory allocation limits relating to signature load Variables like the number of signature parts are considered trusted user input and so allocations based on those values need not check the memory allocation limit. Specifically for the allocation of the normalized buffer in cli_scanscript, I determined that the size of SCANBUFF is fixed and so safe, and the maxpatlen comes from the signature load and is therefore also trusted, so we do not need to check the allocation limit.	2024-03-15 13:18:47 -04:00
Micah Snyder	8e04c25fec	Rename clamav memory allocation functions We have some special functions to wrap malloc, calloc, and realloc to make sure we don't allocate more than some limit, similar to the max-filesize and max-scansize limits. Our wrappers are really only needed when allocating memory for scans based on untrusted user input, where a scan file could have bytes that claim you need to allocate some ridiculous amount of memory. Right now they're named: - cli_malloc - cli_calloc - cli_realloc - cli_realloc2 ... and these names do not convey their purpose This commit renames them to: - cli_max_malloc - cli_max_calloc - cli_max_realloc - cli_max_realloc2 The realloc ones also have an additional feature in that they will not free your pointer if you try to realloc to 0 bytes. Freeing the memory is undefined by the C spec, and only done with some realloc implementations, so this stabilizes on the behavior of not doing that, which should prevent accidental double-free's. So for the case where you may want to realloc and do not need to have a maximum, this commit adds the following functions: - cli_safer_realloc - cli_safer_realloc2 These are used for the MPOOL_REALLOC and MPOOL_REALLOC2 macros when MPOOL is disabled (e.g. because mmap-support is not found), so as to match the behavior in the mpool_realloc/2 functions that do not make use of the allocation-limit.	2024-03-15 13:18:47 -04:00
Micah Snyder	6d6e04ddf8	Optimization: replace limited allocation calls There are a large number of allocations for fix sized buffers using the `cli_malloc` and `cli_calloc` calls that check if the requested size is larger than our allocation threshold for allocations based on untrusted input. These allocations will always be higher than the threshold, so the extra stack frame and check for these calls is a waste of CPU. This commit replaces needless calls with A -> B: - cli_malloc -> malloc - cli_calloc -> calloc - CLI_MALLOC -> MALLOC - CLI_CALLOC -> CALLOC I also noticed that our MPOOL_MALLOC / MPOOL_CALLOC are not limited by the max-allocation threshold, when MMAP is found/enabled. But the alternative was set to cli_malloc / cli_calloc when disabled. I changed those as well. I didn't change the cli_realloc/2 calls because our version of realloc not only implements a threshold but also stabilizes the undefined behavior in realloc to protect against accidental double-free's. It may be worth implementing a cli_realloc that doesn't have the threshold built-in, however, so as to allow reallocaitons for things like buffers for loading signatures, which aren't subject to the same concern as allocations for scanning possible malware. There was one case in mbox.c where I changed MALLOC -> CLI_MALLOC, because it appears to be allocating based on untrusted input.	2024-03-15 13:18:47 -04:00
Micah Snyder	9cb28e51e6	Bump copyright dates for 2024	2024-01-22 11:27:17 -05:00
RainRat	1b17e20571	Fix typos (no functional changes)	2024-01-19 09:08:36 -08:00
RainRat	caf324e544	Fix typos (no functional changes)	2023-11-26 18:01:19 -05:00
Micah Snyder	b778a6b12e	Abort signature load for short signature patterns If a signature has a pattern that is too short will fail to load the signature but does not cause the entire load process to abort. This is bad for two reasons: 1) It is not immediately apparent that the signature is bad, and so it could be published accidentally. 2) The signature is partially loaded by the time the bad pattern is observed and that may cause a crash later. Because of (1), it is not worth it to try to unload the first part of the signature. Instead, we should just abort the signature load. Fixes: https://github.com/Cisco-Talos/clamav/issues/923 We should also abort loading if the filter pattern for the boyer-moore matcher is shorter than 2 bytes. Also, do not print the final "Loading" progress bar if an error occurred.	2023-06-12 18:03:45 -07:00
Micah Snyder	6eebecc303	Bump copyright for 2023	2023-02-12 11:20:22 -08:00
Micah Snyder	858b541a51	Matcher: Remove allmatch checks and significantly tidy code Significantly tidy the `cli_scan_fmap()` function, and add comments to explain how it all works. Add SHA1 and SHA256 digest variables to the FMAP structure in addition to the existing MD5. Add a function to set the hash so that when we calculate the hashes for hash matching, we save them for subsequent FP matching. This enabled me to remove the extra "hash-only" FP check from `cli_scan_fmap()`. This will also make it easier to switch the clean cache hash algorithm to SHA256 in the future. Remove extra allmatch checks that are no longer required. Add a new header to prevent #include order issues.	2022-10-19 13:13:57 -07:00
Micah Snyder	33555ef696	Hashtable / hashmap / hashset code cleanup I found mixed types and multiple bugs in the hashtable/map/set code, and very little documentation. The most documentation available is the bytecode compiler users manual. Although I also found one discrepancy there with the return value for the BC API map_remove function that calls cli_map_removekey() and so put in an issue with the compiler project for the documentation. Most notably is that this hashtab.c had a lot of functions returning negative enum values instead of returning the enums and then having the caller evaluate the return code to return a negative/0/1 result. This commit fixes all of that, and adds in a bunch of documentation to explain the purpose and behavior of each function and structure provided by hashtab.c/.h. Specific bugs that I know I fixed outside of code quality improvements: - cli_hashset_toarray() was returning CL_ENULLARG / CL_EMEM on failure, when the caller is expecting a ssize_t to indicate how big of an array is allocated. It now returns -1 on failure. I also found that an attempt was made to have the same API that takes a mempool pointer even if mempool is disabled. I preserved that, but made it so the macro is in all-caps so it's more obvious what is going on.	2022-10-19 13:13:57 -07:00
Micah Snyder	2cb83dc540	Tests: All-match mode tests Add tests to verify an alert on the base file in addition to embedded file type recognition (for ZIPSFX extraction) and then subsequent detection of content extracted from the embedded zip.	2022-10-19 13:13:57 -07:00
Andy Ragusa	778a4b1341	Corrected types to remove warnings.	2022-10-18 14:04:36 -07:00
Andy Ragusa	a82d2821c1	Fixed type mismatch Fixed a type mismatch that appears to be causing a warning in Coverity analysis.	2022-10-12 18:49:28 -07:00
Andy Ragusa	a50f6ee50b	Changed type of newCapacity to match trans_capacity to eliminate warning	2022-10-11 15:14:54 -07:00
Andy Ragusa	b3a3b358b0	Speed up freeing of signatures Speed up freeing of signatures by tracking all malloced blocks instead of having to find duplicates in our data structures on signature unload.	2022-10-07 08:30:57 -07:00
Micah Snyder	74887875db	Add code comments to explain AC pattern prefix process When adding a pattern to the AC trie, checks are done to make sure the bytes that go in the AC trie don't have any `?` wildcards and additionally that the first two bytes are not "\x00\x00". If they are, the position of the pattern that goes in the AC trie can be shifted right until a static pattern is identified that can go in the AC trie. Any bytes to the left of the new start of the pattern become a "prefix". During matching, once the AC trie match occurs and the bytes to the right of that pattern are matched, then the bytes from the prefix are matched. The reason that we don't want the bytes that go in the AC trie to start with "\x00\x00" is that it is such a common pattern in files that it would match constantly, and the scan process would spend a lot of time just checking through the list of patterns associated with a "\x00\x00" AC match, and that'd be crazy slow. But it is important to note that when shifting right, if there aren't enough nonzero, non-wildcard bytes to form a good prefix for the AC trie, that it is tolerable to bend the rule and let some patterns start with "\x00\x00". In that way, a small pattern like "0000ab" is still valid, and can be matched.	2022-06-10 09:11:57 -07:00
Micah Snyder	fdf23d500a	Fix possible 2-byte overread when adding sig pattern It is possible to create a signature pattern that tries to add a zero-byte matching pattern to the A-C trie. A missing check at this stage can end up with a 2-byte overread when indexing the (empty) pattern to make sure the bytes added to the A-C trie are static and not both zero. This over read issue is not a vulnerability. This commit fixes the issue by adding a check for the pattern length. Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43832 Also added: - type casts and a "fall-through" comment to silence compile warnings. - a few additional length checks to protect against an additional 1-byte over read.	2022-06-10 09:11:57 -07:00
ragusaa	55b2eafc84	Fix integer overflow/undefined behavior in NSIS parser Fix integer overflow in the NSIS parser Cast int32_t to uint32_t for comparison with uint32_t, to prevent integer overflow, as well as signed/unsigned compare warning. Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=44493 Also address some other undefined behavior warnings: * mpool.c: Fixed pointer overflow errors uncovered by UndefinedBehaviorSanitizer. * matcher-ac.c: Test length to avoid passing NULLs to memcmp.	2022-06-01 13:46:36 -07:00
ragusaa	1c6746853f	Fixed heap buffer overflow while loading signatures There is a possible overflow read when loading PDB and WDB phishing signatures. This issue is not a vulnerability. Changed const char pointers to uint8_t pointers when they are to be used with data, as well as removing asserts and adding additional error checking. Thank you Michał Dardas for reporting this issue. This fix also resolves: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43845 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43812 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43866 This commit also fixes a minor leak of pattern matching trans nodes that was observed when testing with the MPOOL module disabled.	2022-05-16 18:29:25 -07:00
ragusaa	7b464ab882	Fix small leak when loading invalid FTM signatures Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43844	2022-04-19 15:46:27 -07:00
Andy Ragusa	e51920dfe8	Free correct variable in signature load error handling We don't allocate a copy of the signature name to store in the AC pattern structure for logical signature patterns because it is already stored in the logical signature structure. But oss-fuzz found that we're freeing that virname in when an error happens even if it wasn't copied. This fix checks the allocation before MPOOL_FREE. Since virname is passed in, and only cloned under certain condtions, check to see that it has actually been cloned before freeing it in any cleanup code. Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45205	2022-04-14 15:24:35 -07:00
ragusaa	4373e8f234	Fix possible invalid free (#507 ) 'new' is allocated by mpool, so should be freed by the mpool free function. This issue is not a vulnerability Resolves: https://github.com/Cisco-Talos/clamav/issues/430	2022-03-22 17:06:22 -07:00
Micah Snyder	fd587c741c	Image fuzzy hash: new logical sub-signature feature Add a new logical signature subsignature type for matching on images with image fuzzy hashes. Image fuzzy hash subsigantures follow this format: fuzzy_img#<hash>#<dist> In this initial implementation, the hamming distance (dist) is ignored and only exact fuzzy hash matches will alert. Fuzzy hash matching is only performed for supported image types. Also: removed some excessive debug log messages on start-up. Fixed an issue where the signature name (virname) is being allocated and stored for every subsignature or even ever sub-pattern in an AC-pattern (i.e. NDB sig or LDB subsig) containing a `{n-m}` or `*` wildcard. This fix is only for LDB subsigs though. NDB signatures are still allocaing one virname per sub-pattern. This fix was required because I needed a place to store the virname with fuzzy-hash subsignatures. Storing it in the fuzzy-hash subsig metadatathe way AC-pattern, PCRE, and BComp subsigs were doing it wouldn't work because it would cross the C-Rust FFI boundary and giving pointers to Rust allocated stuff is dicey. Not to mention native Rust strings are different thatn C strings. Anyways, the correct thing to do was to store the virname with the actual logical signature. TODO: Keep track of NDB signatures in the same way and store the virname for NDB sigs there instead of in AC-patterns so that we can get rid of the virname field in the AC-pattern struct.	2022-03-02 13:12:59 -07:00
Micah Snyder	86cff75500	A-C pattern match code cleanup, add comments	2022-02-23 12:28:31 -07:00
micasnyd	140c88aa4e	Bump copyright for 2022 Includes minor format corrections.	2022-01-09 14:23:25 -07:00
Micah Snyder	db013a2bfd	libclamav: Fix scan recursion tracking Scan recursion is the process of identifying files embedded in other files and then scanning them, recursively. Internally this process is more complex than it may sound because a file may have multiple layers of types before finding a new "file". At present we treat the recursion count in the scanning context as an index into both our fmap list AND our container list. These two lists are conceptually a part of the same thing and should be unified. But what's concerning is that the "recursion level" isn't actually incremented or decremented at the same time that we add a layer to the fmap or container lists but instead is more touchy-feely, increasing when we find a new "file". To account for this shadiness, the size of the fmap and container lists has always been a little longer than our "max scan recursion" limit so we don't accidentally overflow the fmap or container arrays (!). I've implemented a single recursion-stack as an array, similar to before, which includes a pointer to each fmap at each layer, along with the size and type. Push and pop functions add and remove layers whenever a new fmap is added. A boolean argument when pushing indicates if the new layer represents a new buffer or new file (descriptor). A new buffer will reset the "nested fmap level" (described below). This commit also provides a solution for an issue where we detect embedded files more than once during scan recursion. For illustration, imagine a tarball named foo.tar.gz with this structure: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| baz.exe \| PE \| 0 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| │ └── hello.txt \| ASCII \| 2 \| 0 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| \| └── world.txt \| ASCII \| 2 \| 0 \| (A) If we scan for embedded files at any layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| ├── foo.tar \| TAR \| 1 \| 0 \| \| │ ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hola.txt \| ASCII \| 3 \| 0 \| \| │ ├── baz.exe \| PE \| 2 \| 1 \| \| │ │ ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ │ │ └── hello.txt \| ASCII \| 4 \| 0 \| \| │ │ └── sfx.7z \| 7Z \| 3 \| 1 \| \| │ │ └── world.txt \| ASCII \| 4 \| 0 \| \| │ ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hello.txt \| ASCII \| 3 \| 0 \| \| │ └── sfx.7z \| 7Z \| 2 \| 1 \| \| │ └── world.txt \| ASCII \| 3 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| (A) is bad because it scans content more than once. Note that for the GZ layer, it may detect the ZIP and 7Z if the signature hits on the compressed data, which it might, though extracting the ZIP and 7Z will likely fail. The reason the above doesn't happen now is that we restrict embedded type scans for a bunch of archive formats to include GZ and TAR. (B) If we scan for embedded files at the foo.tar layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| ├── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ └── hello.txt \| ASCII \| 3 \| 0 \| \| └── sfx.7z \| 7Z \| 2 \| 1 \| \| └── world.txt \| ASCII \| 3 \| 0 \| (B) is almost right. But we can achieve it easily enough only scanning for embedded content in the current fmap when the "nested fmap level" is 0. The upside is that it should safely detect all embedded content, even if it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe. The biggest risk I can think of affects ZIPs. SFXZIP detection is identical to ZIP detection, which is why we don't allow SFXZIP to be detected if insize of a ZIP. If we only allow embedded type scanning at fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP if the bar.exe was not compressed in foo.zip and if non-compressed files extracted from ZIPs aren't extracted as new buffers: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.zip \| ZIP \| 0 \| 0 \| \| └── bar.exe \| PE \| 1 \| 1 \| \| └── sfx.zip \| ZIP \| 2 \| 2 \| Provided that we ensure all files extracted from zips are scanned in new buffers, option (B) should be safe. (C) If we scan for embedded files at the baz.exe layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ └── hello.txt \| ASCII \| 4 \| 0 \| \| └── sfx.7z \| 7Z \| 3 \| 1 \| \| └── world.txt \| ASCII \| 4 \| 0 \| (C) is right. But it's harder to achieve. For this example we can get it by restricting 7ZSFX and ZIPSFX detection only when scanning an executable. But that may mean losing detection of archives embedded elsewhere. And we'd have to identify allowable container types for each possible embedded type, which would be very difficult. So this commit aims to solve the issue the (B)-way. Note that in all situations, we still have to scan with file typing enabled to determine if we need to reassign the current file type, such as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2- compressed. Detection of DMG and a handful of other types rely on finding data partway through or near the ned of a file before reassigning the entire file as the new type. Other fixes and considerations in this commit: - The utf16 HTML parser has weak error handling, particularly with respect to creating a nested fmap for scanning the ascii decoded file. This commit cleans up the error handling and wraps the nested scan with the recursion-stack push()/pop() for correct recursion tracking. Before this commit, each container layer had a flag to indicate if the container layer is valid. We need something similar so that the cli_recursion_stack_get_() functions ignore normalized layers. Details... Imagine an LDB signature for HTML content that specifies a ZIP container. If the signature actually alerts on the normalized HTML and you don't ignore normalized layers for the container check, it will appear as though the alert is in an HTML container rather than a ZIP container. This commit accomplishes this with a boolean you set in the scan context before scanning a new layer. Then when the new fmap is created, it will use that flag to set similar flag for the layer. The context flag is reset those that anything after this doesn't have that flag. The flag allows the new recursion_stack_get() function to ignore normalized layers when iterating the stack to return a layer at a requested index, negative or positive. Scanning normalized extracted/normalized javascript and VBA should also use the 'layer is normalized' flag. - This commit also fixes Heuristic.Broken.Executable alert for ELF files to make sure that: A) these only alert if cli_append_virus() returns CL_VIRUS (aka it respects the FP check). B) all broken-executable alerts for ELF only happen if the SCAN_HEURISTIC_BROKEN option is enabled. - This commit also cleans up the error handling in cli_magic_scan_dir(). This was needed so we could correctly apply the layer-is-normalized-flag to all VBA macros extracted to a directory when scanning the directory. - Also fix an issue where exceeding scan maximums wouldn't cause embedded file detection scans to abort. Granted we don't actually want to abort if max filesize or max recursion depth are exceeded... only if max scansize, max files, and max scantime are exceeded. Add 'abort_scan' flag to scan context, to protect against depending on correct error propagation for fatal conditions. Instead, setting this flag in the scan context should guarantee that a fatal condition deep in scan recursion isn't lost which result in more stuff being scanned instead of aborting. This shouldn't be necessary, but some status codes like CL_ETIMEOUT never used to be fatal and it's easier to do this than to verify every parser only returns CL_ETIMEOUT and other "fatal status codes" in fatal conditions. - Remove duplicate is_tar() prototype from filestypes.c and include is_tar.h instead. - Presently we create the fmap hash when creating the fmap. This wastes a bit of CPU if the hash is never needed. Now that we're creating fmap's for all embedded files discovered with file type recognition scans, this is a much more frequent occurence and really slows things down. This commit fixes the issue by only creating fmap hashes as needed. This should not only resolve the perfomance impact of creating fmap's for all embedded files, but also should improve performance in general. - Add allmatch check to the zip parser after the central-header meta match. That way we don't multiple alerts with the same match except in allmatch mode. Clean up error handling in the zip parser a tiny bit. - Fixes to ensure that the scan limits such as scansize, filesize, recursion depth, # of embedded files, and scantime are always reported if AlertExceedsMax (--alert-exceeds-max) is enabled. - Fixed an issue where non-fatal alerts for exceeding scan maximums may mask signature matches later on. I changed it so these alerts use the "possibly unwanted" alert-type and thus only alert if no other alerts were found or if all-match or heuristic-precedence are enabled. - Added the "Heuristics.Limits.Exceeded." events to the JSON metadata when the --gen-json feature is enabled. These will show up once under "ParseErrors" the first time a limit is exceeded. In the present implementation, only one limits-exceeded events will be added, so as to prevent a malicious or malformed sample from filling the JSON buffer with millions of events and using a tonne of RAM.	2021-10-25 16:02:29 -07:00
Andrea DePasquale	fb7d05c4d0	Add check for signature pattern bytes < 0x80 When locale is UTF-8, check that signature pattern bytes are < 0x80 before using the isalpha() and toupper() functions since that can lead to segfaults and/or unintended matches. For example take a LDB signature with a case-insensitive subsignature containing byte 0xb5. The uint16_t value of pattern->pattern[i] is 0x10b5 since 0xb5 is OR'd with the CLI_MATCH_NOCASE (0x1000) flag. Locale: C isalpha((unsigned char) (0x10b5 & 0xff)): 0 toupper((unsigned char) (0x10b5 & 0xff)): b5 Locale: en_US.UTF-8 isalpha((unsigned char) (0x10b5 & 0xff)): 1 toupper((unsigned char) (0x10b5 & 0xff)): 39c U+00B5 is the Micro Sign (also known as Mu) U+03BC is the Greek Small Letter Mu U+039C is the Greek Capital Letter Mu	2021-10-07 17:46:01 -07:00
Micah Snyder	090c8990e3	libclamav, clamscan: load/unload callbacks & progress meters Add progress callbacks to libclamav for: - database load - engine compile - engine free Add a progress bar to clamscan for load & compile. These are disabled if you run with --debug or stdout is not a TTY or you are using one of --quiet, --infected, or --no-summary. Added code so you can test the engine-free callback by building with ENABLE_ENGINE_FREE_PROGRESSBAR defined. The compile & free progress callbacks pre-calculate the number of tasks to complete to estimate the progress. Some tasks may take longer than others so the progress speed my appear to vary a little. The callbacks return type is a cl_error_t but doesn't currently do anything. It is reserved for future use. Minor formatting change in matcher-ac.c to counteract weird clang-format behavior, and to make it easier to read. Added progress callbacks and clamscan progress bars to the news.	2021-07-16 11:47:23 -07:00
Micah Snyder	1ee5c96c59	Correct return status variable type Should use the 'cl_error_t' enum, not 'int'. No functional difference, but is better for type safety and for debugging.	2021-06-19 15:59:55 -07:00
Micah Snyder	d1ccf7747d	clang-format housekeeping	2021-06-18 16:34:59 -07:00
Mickey Sola	c0bad34b09	Fix all-match mode FP checks The `cli_append_virus()` function does an FP check. If it is an FP, it will return `CL_CLEAN` and the match/alert/virus should be discarded. This fix will respect FP verdicts when appending virus name in ac and bm matchers in all match mode.	2021-06-18 16:27:19 -07:00
Micah Snyder (micasnyd)	b9ca6ea103	Update copyright dates for 2021 Also fixes up clang-format.	2021-03-19 15:12:26 -07:00
Micah Snyder	e409920298	Fix assorted warnings Add missing ping_clamd() declaration in client.h Fix check for ping option to first check if ping option is NULL before strdup'ing and checking if the alloc failed. Fix format string for uint64_t print. Correctly assign name pointer to stack buffer in cpio parser. Remove vestigial variables from insert_list() function matcher-ac.c, left over from before the load-time optimizations completely restructured everything. Silence warnings about unused parameters in progress bar callback function.	2020-07-31 16:05:31 -07:00
Micah Snyder	ae77e87880	Add EmbeddedObjects to JSON The metadata projecties JSON structure isn't recording file types found embedded within a file such as self-extracting (SFX) types and office document types (DOCX, PPTX, etc). This presents a problem... At present there's no way to know if the current file has ended and a few file is found tacked on to the end of the first file. If there were, we could simply check if the type found by the raw-scan exists within the first file, or after. If within the first, and the type is an archive then it's reasonable to conclude we're either observing zip headers (for SFXZIP detections) or other files that are not compressed. If the type ISN'T found within the first file, then we definitely have whole new file to parse and we should do so with cli_magic_scan() rather than only using these embedded type scanners. At present we can't ignore SFXZIP detections even if the original file type is a ZIP because we may have found two ZIPs appended together to evade detection (a legitimate trick). As a consequence, we will effectively parse every zip entry twice. The same issue applies to types found within non-compressed archives. This commit adds an EmbeddedObjects list to the metadata JSON object so that the existance of these types is noted. Additionally, this commit removes the two-part int64 cli_jsonint64() implementation as json_object_new_int64() should be available everywhere and the macro to detect such support was never set.	2020-06-03 10:39:18 -04:00
Micah Snyder	206dbaefe8	Update copyright dates for 2020	2020-01-03 15:44:07 -05:00
Micah Snyder (micasnyd)	3dd506a7ee	bb12389 - fast AC sig load - courtesy of Alberto Wu This commit addresses the signature load time issue in the following steps: 1. Loaded list items are allocated but left unattached; only a node reference is set on them for further processing. This is done with no increase of memory usage. See changes in insert_list and matcher-ac.h 2. Before the tries are built, the whole list of entries is sorted by node, then by pattern, then by partno. This requires O(N log(N)) time. 3. The list is processed linearly, one node at a time and the `next_same` chain is built. Each next_same chain head is also extracted. This requires O(N) time. 4. The list of heads is sorted by partno. This requires O(M log(M)) time on average with M<=N. 5. The list of heads is processed linearly and the `next` chain is built. This has O(M) complexity. And improves scantime performance, by adding checks to: 1. Place longer lists earlier in the trie. 2. Keep close patterns close, rather than scattering them further apart. This reduced memory cache faults to improve load and scan time performance.	2019-11-08 14:05:08 -08:00
Micah Snyder	bcb4505e60	bb12370 - cli_strndup and other str* replacements must be built and exported for every OS to be used outside of libclamav on systems that don't have the original functions (e.g. strndup). This commit renames the macros to be uppercase, renames the replacement functions to be preceeded with two understores (e.g. __cli_strndup), and removes the ifdef's so that they are built regardless, because there are no ifdefs in libclamav.map.	2019-10-02 16:08:30 -04:00
Micah Snyder	ee40795fe2	Converted mpool calls to macros when USE_MPOOL is defined to clearly differentiate between function and macro behavior.	2019-10-02 16:08:25 -04:00
Micah Snyder	5f4f69102d	Correcting types from int to cl_error_t where appropriate. Eliminating unused variables and referencing unused parameters to remove warnings.	2019-10-02 16:08:25 -04:00
Micah Snyder	52cddcbcfd	Updating and cleaning up copyright notices.	2019-10-02 16:08:18 -04:00
Micah Snyder	b3e82e5e61	Replacing libclamav/cltypes.h with clamav-types.h.in, which generates a header clamav-types.h that we install alongside clamav.h.	2019-10-02 16:08:17 -04:00
Micah Snyder	72fd33c8b2	clang-format'd using new .clang-format rules.	2019-10-02 16:08:16 -04:00
Micah Snyder	38fe8b69a0	Added .clang-format style rules, clam-format script to automate formatting of ClamAV code, and preparing select files so that clang-format does not alter carefully formatted sections.	2019-10-02 16:08:16 -04:00
Micah Snyder (micasnyd)	cc12e21dd2	bb12221: Fix for subtle type-mismatch that could result in an infinite loop with a large number of sigs.	2018-12-02 23:07:08 -05:00
Micah Snyder	d7979d4ff7	Restructured scan options flags from a single bitflag field to a structure containing multiple bitflag fields. This also required adding a new function to the bytecode API to get scan options a la carte, and modifying the existing function to hand back scan options in the old/deprecated uint32_t bitflag format. Re-generated bytecode iface header files. Updated libclamav documentation detailing new scan options structure. Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'. Renamed references to 'scan all' to 'scan all match'. Renamed a couple of 'Hueristic.' signature names as 'Heuristics.' signatures (plural) to match majority of other heuristics.	2018-12-02 23:06:59 -05:00
Micah Snyder	927b2bab17	bb11992: cleaning up some variable initialization.	2018-02-08 16:00:14 -05:00
Micah Snyder	d0cba11ea7	adding back changes to eliminate warnings from mspack, matcher, others, and readdb.	2017-09-21 13:10:01 -04:00

1 2 3 4 5

232 commits