Commit graph

232 commits

Author SHA1 Message Date
Val S.
0462dae12a
Increase limit for finding PE files embedded in other PE files
I am seeing missed detections since we changed to prohibit embedded
file type identification when inside an embedded file.
In particular, I'm seeing this issue with PE files that contain multiple
other MSEXE as well as a variety of false positives for PE file headers.

For example, imagine a PE with four concatenated DLL's, like so:
```
  [ EXE file   | DLL #1  | DLL #2  | DLL #3  | DLL #4 ]
```

And note that false positives for embedded MSEXE files are fairly common.
So there may be a few mixed in there.

Before limiting embedded file identification we might interpret the file
structure something like this:
```
MSEXE: {
  embedded MSEXE #1: false positive,
  embedded MSEXE #2: false positive,
  embedded MSEXE #3: false positive,
  embedded MSEXE #4: DLL #1: {
    embedded MSEXE #1: false positive,
    embedded MSEXE #2: DLL #2: {
      embedded MSEXE #1: DLL #3: {
        embedded MSEXE #1: false positive,
        embedded MSEXE #2: false positive,
        embedded MSEXE #3: false positive,
        embedded MSEXE #4: false positive,
        embedded MSEXE #5: DLL #4
      }
      embedded MSEXE #2: false positive,
      embedded MSEXE #3: false positive,
      embedded MSEXE #4: false positive,
      embedded MSEXE #5: false positive,
      embedded MSEXE #6: DLL #4
    }
    embedded MSEXE #3: DLL #3,
    embedded MSEXE #4: false positive,
    embedded MSEXE #5: false positive,
    embedded MSEXE #6: false positive,
    embedded MSEXE #7: false positive,
    embedded MSEXE #8: DLL #4
  }
}
```

This is obviously terrible, which is why why we don't allow detecting
embedded files within other embedded files.
So after we enforce that limit, the same file may be interpreted like
this instead:
```
MSEXE: {
  embedded MSEXE #1:  false positive,
  embedded MSEXE #2:  false positive,
  embedded MSEXE #3:  false positive,
  embedded MSEXE #4:  DLL #1,
  embedded MSEXE #5:  false positive,
  embedded MSEXE #6:  DLL #2,
  embedded MSEXE #7:  DLL #3,
  embedded MSEXE #8:  false positive,
  embedded MSEXE #9:  false positive,
  embedded MSEXE #10: false positive,
  embedded MSEXE #11: false positive,
  embedded MSEXE #12: DLL #4
}
```

That's great! Except that we now exceed the "MAX_EMBEDDED_OBJ" limit
for embedded type matches (limit 10, but 12 found). That means we won't
see or extract the 4th DLL anymore.

My solution is to lift the limit when adding an matched MSEXE type.
We already do this for matched ZIPSFX types.
While doing this, I've significantly tidied up the limits checks to
make it more readble, and removed duplicate checks from within the
`ac_addtype()` function.

CLAM-2897
2025-10-14 14:05:12 -04:00
Val Snyder
7ff29b8c37
Bump copyright dates for 2025 2025-02-14 10:24:30 -05:00
Micah Snyder
405829ee88 Refine max-allocation and safer-allocation function and macro names
We add the _OR_GOTO_DONE suffix to the macros that go to done if the
allocation fails. This makes it obvious what is different about the
macro versus the equivalent function, and that error handling is
built-in.

Renamed the cli_strdup to safer_strdup to make it obvious that it exists
because it is safer than regular strdup. Regular strdup doesn't have the
NULL check before trying to dup, and so may result in a NULL-deref
crash.

Also remove unused STRDUP (_OR_GOTO_DONE) macro, since the one with the
NULL-check is preferred.
2024-03-15 13:18:47 -04:00
Micah Snyder
39070d1c76 Remove additional memory allocation limits relating to signature load
Variables like the number of signature parts are considered trusted user input
and so allocations based on those values need not check the memory allocation
limit.

Specifically for the allocation of the normalized buffer in cli_scanscript,
I determined that the size of SCANBUFF is fixed and so safe, and the maxpatlen
comes from the signature load and is therefore also trusted, so we do not
need to check the allocation limit.
2024-03-15 13:18:47 -04:00
Micah Snyder
8e04c25fec Rename clamav memory allocation functions
We have some special functions to wrap malloc, calloc, and realloc to
make sure we don't allocate more than some limit, similar to the
max-filesize and max-scansize limits. Our wrappers are really only
needed when allocating memory for scans based on untrusted user input,
where a scan file could have bytes that claim you need to allocate
some ridiculous amount of memory. Right now they're named:
- cli_malloc
- cli_calloc
- cli_realloc
- cli_realloc2

... and these names do not convey their purpose

This commit renames them to:
- cli_max_malloc
- cli_max_calloc
- cli_max_realloc
- cli_max_realloc2

The realloc ones also have an additional feature in that they will not
free your pointer if you try to realloc to 0 bytes. Freeing the memory
is undefined by the C spec, and only done with some realloc
implementations, so this stabilizes on the behavior of not doing that,
which should prevent accidental double-free's.

So for the case where you may want to realloc and do not need to have a
maximum, this commit adds the following functions:
- cli_safer_realloc
- cli_safer_realloc2

These are used for the MPOOL_REALLOC and MPOOL_REALLOC2 macros when
MPOOL is disabled (e.g. because mmap-support is not found), so as to
match the behavior in the mpool_realloc/2 functions that do not make use
of the allocation-limit.
2024-03-15 13:18:47 -04:00
Micah Snyder
6d6e04ddf8 Optimization: replace limited allocation calls
There are a large number of allocations for fix sized buffers using the
`cli_malloc` and `cli_calloc` calls that check if the requested size is
larger than our allocation threshold for allocations based on untrusted
input. These allocations will *always* be higher than the threshold, so
the extra stack frame and check for these calls is a waste of CPU.

This commit replaces needless calls with A -> B:
- cli_malloc -> malloc
- cli_calloc -> calloc
- CLI_MALLOC -> MALLOC
- CLI_CALLOC -> CALLOC

I also noticed that our MPOOL_MALLOC / MPOOL_CALLOC are not limited by
the max-allocation threshold, when MMAP is found/enabled. But the
alternative was set to cli_malloc / cli_calloc when disabled. I changed
those as well.

I didn't change the cli_realloc/2 calls because our version of realloc
not only implements a threshold but also stabilizes the undefined
behavior in realloc to protect against accidental double-free's.
It may be worth implementing a cli_realloc that doesn't have the
threshold built-in, however, so as to allow reallocaitons for things
like buffers for loading signatures, which aren't subject to the same
concern as allocations for scanning possible malware.

There was one case in mbox.c where I changed MALLOC -> CLI_MALLOC,
because it appears to be allocating based on untrusted input.
2024-03-15 13:18:47 -04:00
Micah Snyder
9cb28e51e6 Bump copyright dates for 2024 2024-01-22 11:27:17 -05:00
RainRat
1b17e20571
Fix typos (no functional changes) 2024-01-19 09:08:36 -08:00
RainRat
caf324e544
Fix typos (no functional changes) 2023-11-26 18:01:19 -05:00
Micah Snyder
b778a6b12e
Abort signature load for short signature patterns
If a signature has a pattern that is too short will fail to load the
signature but does not cause the entire load process to abort.
This is bad for two reasons:
1) It is not immediately apparent that the signature is bad, and so it
could be published accidentally.
2) The signature is partially loaded by the time the bad pattern is
observed and that may cause a crash later.

Because of (1), it is not worth it to try to unload the first part of the
signature. Instead, we should just abort the signature load.

Fixes: https://github.com/Cisco-Talos/clamav/issues/923

We should also abort loading if the filter pattern for the boyer-moore
matcher is shorter than 2 bytes.

Also, do not print the final "Loading" progress bar if an error occurred.
2023-06-12 18:03:45 -07:00
Micah Snyder
6eebecc303 Bump copyright for 2023 2023-02-12 11:20:22 -08:00
Micah Snyder
858b541a51 Matcher: Remove allmatch checks and significantly tidy code
Significantly tidy the `cli_scan_fmap()` function, and add comments to
explain how it all works.

Add SHA1 and SHA256 digest variables to the FMAP structure in addition
to the existing MD5. Add a function to set the hash so that when we
calculate the hashes for hash matching, we save them for subsequent FP
matching. This enabled me to remove the extra "hash-only" FP check from
`cli_scan_fmap()`. This will also make it easier to switch the clean
cache hash algorithm to SHA256 in the future.

Remove extra allmatch checks that are no longer required.

Add a new header to prevent #include order issues.
2022-10-19 13:13:57 -07:00
Micah Snyder
33555ef696 Hashtable / hashmap / hashset code cleanup
I found mixed types and multiple bugs in the hashtable/map/set code, and
very little documentation.

The most documentation available is the bytecode compiler users manual.
Although I also found one discrepancy there with the return value for
the BC API map_remove function that calls cli_map_removekey() and so put
in an issue with the compiler project for the documentation.

Most notably is that this hashtab.c had a lot of functions returning
negative enum values instead of returning the enums and then having the
caller evaluate the return code to return a negative/0/1 result.
This commit fixes all of that, and adds in a bunch of documentation to
explain the purpose and behavior of each function and structure provided
by hashtab.c/.h.

Specific bugs that I know I fixed outside of code quality improvements:
- cli_hashset_toarray() was returning CL_ENULLARG / CL_EMEM on failure,
when the caller is expecting a ssize_t to indicate how big of an array
is allocated. It now returns -1 on failure.

I also found that an attempt was made to have the same API that takes a
mempool pointer even if mempool is disabled. I preserved that, but made
it so the macro is in all-caps so it's more obvious what is going on.
2022-10-19 13:13:57 -07:00
Micah Snyder
2cb83dc540 Tests: All-match mode tests
Add tests to verify an alert on the base file in addition to embedded
file type recognition (for ZIPSFX extraction) and then subsequent
detection of content extracted from the embedded zip.
2022-10-19 13:13:57 -07:00
Andy Ragusa
778a4b1341 Corrected types to remove warnings. 2022-10-18 14:04:36 -07:00
Andy Ragusa
a82d2821c1 Fixed type mismatch
Fixed a type mismatch that appears to be causing a warning in Coverity
analysis.
2022-10-12 18:49:28 -07:00
Andy Ragusa
a50f6ee50b Changed type of newCapacity to match trans_capacity to eliminate warning 2022-10-11 15:14:54 -07:00
Andy Ragusa
b3a3b358b0 Speed up freeing of signatures
Speed up freeing of signatures by tracking all malloced blocks instead
of having to find duplicates in our data structures on signature unload.
2022-10-07 08:30:57 -07:00
Micah Snyder
74887875db Add code comments to explain AC pattern prefix process
When adding a pattern to the AC trie, checks are done to make sure the
bytes that go in the AC trie don't have any `?` wildcards and
additionally that the first two bytes are not "\x00\x00".
If they are, the position of the pattern that goes in the AC trie can be
shifted right until a static pattern is identified that can go in the
AC trie. Any bytes to the left of the new start of the pattern become a
"prefix".

During matching, once the AC trie match occurs and the bytes to the
right of that pattern are matched, then the bytes from the prefix are
matched.

The reason that we don't want the bytes that go in the AC trie to start
with "\x00\x00" is that it is such a common pattern in files that it
would match constantly, and the scan process would spend a lot of time
just checking through the list of patterns associated with a "\x00\x00"
AC match, and that'd be crazy slow.
But it is important to note that when shifting right, if there aren't
enough nonzero, non-wildcard bytes to form a good prefix for the AC
trie, that it is tolerable to bend the rule and let some patterns start
with "\x00\x00". In that way, a small pattern like "0000ab" is still
valid, and can be matched.
2022-06-10 09:11:57 -07:00
Micah Snyder
fdf23d500a Fix possible 2-byte overread when adding sig pattern
It is possible to create a signature pattern that tries to add a
zero-byte matching pattern to the A-C trie. A missing check at this
stage can end up with a 2-byte overread when indexing the (empty)
pattern to make sure the bytes added to the A-C trie are static and
not both zero.

This over read issue is not a vulnerability.

This commit fixes the issue by adding a check for the pattern length.

Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43832

Also added:
- type casts and a "fall-through" comment to silence compile warnings.
- a few additional length checks to protect against an additional 1-byte
over read.
2022-06-10 09:11:57 -07:00
ragusaa
55b2eafc84
Fix integer overflow/undefined behavior in NSIS parser
Fix integer overflow in the NSIS parser

Cast int32_t to uint32_t for comparison with uint32_t, to prevent
integer overflow, as well as signed/unsigned compare warning.

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=44493

Also address some other undefined behavior warnings:
* mpool.c: Fixed pointer overflow errors uncovered by UndefinedBehaviorSanitizer.
* matcher-ac.c: Test length to avoid passing NULLs to memcmp.
2022-06-01 13:46:36 -07:00
ragusaa
1c6746853f
Fixed heap buffer overflow while loading signatures
There is a possible overflow read when loading PDB and WDB phishing
signatures.

This issue is not a vulnerability.

Changed const char pointers to uint8_t pointers when they are to be used
with data, as well as removing asserts and adding additional error
checking.

Thank you Michał Dardas for reporting this issue.

This fix also resolves:
- https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43845
- https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43812
- https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43866

This commit also fixes a minor leak of pattern matching trans nodes
that was observed when testing with the MPOOL module disabled.
2022-05-16 18:29:25 -07:00
ragusaa
7b464ab882
Fix small leak when loading invalid FTM signatures
Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43844
2022-04-19 15:46:27 -07:00
Andy Ragusa
e51920dfe8 Free correct variable in signature load error handling
We don't allocate a copy of the signature name to store in the AC
pattern structure for logical signature patterns because it is already
stored in the logical signature structure. But oss-fuzz found that we're
freeing that virname in when an error happens even if it wasn't copied.

This fix checks the allocation before MPOOL_FREE.

Since virname is passed in, and only cloned under certain condtions,
check to see that it has actually been cloned before freeing it in any
cleanup code.

Resolves: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45205
2022-04-14 15:24:35 -07:00
ragusaa
4373e8f234
Fix possible invalid free (#507)
'new' is allocated by mpool, so should be freed by the mpool free
function. 


This issue is not a vulnerability



Resolves: https://github.com/Cisco-Talos/clamav/issues/430
2022-03-22 17:06:22 -07:00
Micah Snyder
fd587c741c Image fuzzy hash: new logical sub-signature feature
Add a new logical signature subsignature type for matching on images
with image fuzzy hashes.

Image fuzzy hash subsigantures follow this format:

    fuzzy_img#<hash>#<dist>

In this initial implementation, the hamming distance (dist) is ignored
and only exact fuzzy hash matches will alert.

Fuzzy hash matching is only performed for supported image types.

Also: removed some excessive debug log messages on start-up.

Fixed an issue where the signature name (virname) is being allocated and
stored for every subsignature or even ever sub-pattern in an AC-pattern
(i.e. NDB sig or LDB subsig) containing a `{n-m}` or `*` wildcard.
This fix is only for LDB subsigs though. NDB signatures are still
allocaing one virname per sub-pattern.

This fix was required because I needed a place to store the virname with
fuzzy-hash subsignatures. Storing it in the fuzzy-hash subsig
metadatathe way AC-pattern, PCRE, and BComp subsigs were doing it
wouldn't work because it would cross the C-Rust FFI boundary and giving
pointers to Rust allocated stuff is dicey. Not to mention native Rust
strings are different thatn C strings. Anyways, the correct thing to do
was to store the virname with the actual logical signature.

TODO: Keep track of NDB signatures in the same way and store the virname
for NDB sigs there instead of in AC-patterns so that we can get rid of
the virname field in the AC-pattern struct.
2022-03-02 13:12:59 -07:00
Micah Snyder
86cff75500 A-C pattern match code cleanup, add comments 2022-02-23 12:28:31 -07:00
micasnyd
140c88aa4e Bump copyright for 2022
Includes minor format corrections.
2022-01-09 14:23:25 -07:00
Micah Snyder
db013a2bfd libclamav: Fix scan recursion tracking
Scan recursion is the process of identifying files embedded in other
files and then scanning them, recursively.

Internally this process is more complex than it may sound because a file
may have multiple layers of types before finding a new "file".

At present we treat the recursion count in the scanning context as an
index into both our fmap list AND our container list. These two lists
are conceptually a part of the same thing and should be unified.

But what's concerning is that the "recursion level" isn't actually
incremented or decremented at the same time that we add a layer to the
fmap or container lists but instead is more touchy-feely, increasing
when we find a new "file".

To account for this shadiness, the size of the fmap and container lists
has always been a little longer than our "max scan recursion" limit so
we don't accidentally overflow the fmap or container arrays (!).

I've implemented a single recursion-stack as an array, similar to before,
which includes a pointer to each fmap at each layer, along with the size
and type. Push and pop functions add and remove layers whenever a new
fmap is added. A boolean argument when pushing indicates if the new layer
represents a new buffer or new file (descriptor). A new buffer will reset
the "nested fmap level" (described below).

This commit also provides a solution for an issue where we detect
embedded files more than once during scan recursion.

For illustration, imagine a tarball named foo.tar.gz with this structure:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |

But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| baz.exe                   | PE    | 0         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| │   └── hello.txt         | ASCII | 2         | 0                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |
|     └── world.txt         | ASCII | 2         | 0                 |

(A) If we scan for embedded files at any layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| ├── foo.tar               | TAR   | 1         | 0                 |
| │   ├── bar.zip           | ZIP   | 2         | 1                 |
| │   │   └── hola.txt      | ASCII | 3         | 0                 |
| │   ├── baz.exe           | PE    | 2         | 1                 |
| │   │   ├── sfx.zip       | ZIP   | 3         | 1                 |
| │   │   │   └── hello.txt | ASCII | 4         | 0                 |
| │   │   └── sfx.7z        | 7Z    | 3         | 1                 |
| │   │       └── world.txt | ASCII | 4         | 0                 |
| │   ├── sfx.zip           | ZIP   | 2         | 1                 |
| │   │   └── hello.txt     | ASCII | 3         | 0                 |
| │   └── sfx.7z            | 7Z    | 2         | 1                 |
| │       └── world.txt     | ASCII | 3         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |

(A) is bad because it scans content more than once.

Note that for the GZ layer, it may detect the ZIP and 7Z if the
signature hits on the compressed data, which it might, though
extracting the ZIP and 7Z will likely fail.

The reason the above doesn't happen now is that we restrict embedded
type scans for a bunch of archive formats to include GZ and TAR.

(B) If we scan for embedded files at the foo.tar layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     ├── baz.exe           | PE    | 2         | 1                 |
|     ├── sfx.zip           | ZIP   | 2         | 1                 |
|     │   └── hello.txt     | ASCII | 3         | 0                 |
|     └── sfx.7z            | 7Z    | 2         | 1                 |
|         └── world.txt     | ASCII | 3         | 0                 |

(B) is almost right. But we can achieve it easily enough only scanning for
embedded content in the current fmap when the "nested fmap level" is 0.
The upside is that it should safely detect all embedded content, even if
it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe.

The biggest risk I can think of affects ZIPs. SFXZIP detection
is identical to ZIP detection, which is why we don't allow SFXZIP to be
detected if insize of a ZIP. If we only allow embedded type scanning at
fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP
if the bar.exe was not compressed in foo.zip and if non-compressed files
extracted from ZIPs aren't extracted as new buffers:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.zip                   | ZIP   | 0         | 0                 |
| └── bar.exe               | PE    | 1         | 1                 |
|     └── sfx.zip           | ZIP   | 2         | 2                 |

Provided that we ensure all files extracted from zips are scanned in
new buffers, option (B) should be safe.

(C) If we scan for embedded files at the baz.exe layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |
|         ├── sfx.zip       | ZIP   | 3         | 1                 |
|         │   └── hello.txt | ASCII | 4         | 0                 |
|         └── sfx.7z        | 7Z    | 3         | 1                 |
|             └── world.txt | ASCII | 4         | 0                 |

(C) is right. But it's harder to achieve. For this example we can get it by
restricting 7ZSFX and ZIPSFX detection only when scanning an executable.
But that may mean losing detection of archives embedded elsewhere.
And we'd have to identify allowable container types for each possible
embedded type, which would be very difficult.

So this commit aims to solve the issue the (B)-way.

Note that in all situations, we still have to scan with file typing
enabled to determine if we need to reassign the current file type, such
as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2-
compressed. Detection of DMG and a handful of other types rely on
finding data partway through or near the ned of a file before
reassigning the entire file as the new type.

Other fixes and considerations in this commit:

- The utf16 HTML parser has weak error handling, particularly with respect
  to creating a nested fmap for scanning the ascii decoded file.
  This commit cleans up the error handling and wraps the nested scan with
  the recursion-stack push()/pop() for correct recursion tracking.

  Before this commit, each container layer had a flag to indicate if the
  container layer is valid.
  We need something similar so that the cli_recursion_stack_get_*()
  functions ignore normalized layers. Details...

  Imagine an LDB signature for HTML content that specifies a ZIP
  container. If the signature actually alerts on the normalized HTML and
  you don't ignore normalized layers for the container check, it will
  appear as though the alert is in an HTML container rather than a ZIP
  container.

  This commit accomplishes this with a boolean you set in the scan context
  before scanning a new layer. Then when the new fmap is created, it will
  use that flag to set similar flag for the layer. The context flag is
  reset those that anything after this doesn't have that flag.
  The flag allows the new recursion_stack_get() function to ignore
  normalized layers when iterating the stack to return a layer at a
  requested index, negative or positive.

  Scanning normalized extracted/normalized javascript and VBA should also
  use the 'layer is normalized' flag.

- This commit also fixes Heuristic.Broken.Executable alert for ELF files
  to make sure that:

  A) these only alert if cli_append_virus() returns CL_VIRUS (aka it
  respects the FP check).

  B) all broken-executable alerts for ELF only happen if the
  SCAN_HEURISTIC_BROKEN option is enabled.

- This commit also cleans up the error handling in cli_magic_scan_dir().
  This was needed so we could correctly apply the layer-is-normalized-flag
  to all VBA macros extracted to a directory when scanning the directory.

- Also fix an issue where exceeding scan maximums wouldn't cause embedded
  file detection scans to abort. Granted we don't actually want to abort
  if max filesize or max recursion depth are exceeded... only if max
  scansize, max files, and max scantime are exceeded.

  Add 'abort_scan' flag to scan context, to protect against depending on
  correct error propagation for fatal conditions. Instead, setting this
  flag in the scan context should guarantee that a fatal condition deep in
  scan recursion isn't lost which result in more stuff being scanned
  instead of aborting. This shouldn't be necessary, but some status codes
  like CL_ETIMEOUT never used to be fatal and it's easier to do this than
  to verify every parser only returns CL_ETIMEOUT and other "fatal
  status codes" in fatal conditions.

- Remove duplicate is_tar() prototype from filestypes.c and include
  is_tar.h instead.

- Presently we create the fmap hash when creating the fmap.
  This wastes a bit of CPU if the hash is never needed.
  Now that we're creating fmap's for all embedded files discovered with
  file type recognition scans, this is a much more frequent occurence and
  really slows things down.

  This commit fixes the issue by only creating fmap hashes as needed.
  This should not only resolve the perfomance impact of creating fmap's
  for all embedded files, but also should improve performance in general.

- Add allmatch check to the zip parser after the central-header meta
  match. That way we don't multiple alerts with the same match except in
  allmatch mode. Clean up error handling in the zip parser a tiny bit.

- Fixes to ensure that the scan limits such as scansize, filesize,
  recursion depth, # of embedded files, and scantime are always reported
  if AlertExceedsMax (--alert-exceeds-max) is enabled.

- Fixed an issue where non-fatal alerts for exceeding scan maximums may
  mask signature matches later on. I changed it so these alerts use the
  "possibly unwanted" alert-type and thus only alert if no other alerts
  were found or if all-match or heuristic-precedence are enabled.

- Added the "Heuristics.Limits.Exceeded.*" events to the JSON metadata
  when the --gen-json feature is enabled. These will show up once under
  "ParseErrors" the first time a limit is exceeded. In the present
  implementation, only one limits-exceeded events will be added, so as to
  prevent a malicious or malformed sample from filling the JSON buffer
  with millions of events and using a tonne of RAM.
2021-10-25 16:02:29 -07:00
Andrea DePasquale
fb7d05c4d0 Add check for signature pattern bytes < 0x80
When locale is UTF-8, check that signature pattern bytes are < 0x80
before using the isalpha() and toupper() functions since that can lead
to segfaults and/or unintended matches.

For example take a LDB signature with a case-insensitive subsignature
containing byte 0xb5. The uint16_t value of pattern->pattern[i] is
0x10b5 since 0xb5 is OR'd with the CLI_MATCH_NOCASE (0x1000) flag.

Locale: C
isalpha((unsigned char) (0x10b5 & 0xff)): 0
toupper((unsigned char) (0x10b5 & 0xff)): b5

Locale: en_US.UTF-8
isalpha((unsigned char) (0x10b5 & 0xff)): 1
toupper((unsigned char) (0x10b5 & 0xff)): 39c

U+00B5 is the Micro Sign (also known as Mu)
U+03BC is the Greek Small Letter Mu
U+039C is the Greek Capital Letter Mu
2021-10-07 17:46:01 -07:00
Micah Snyder
090c8990e3
libclamav, clamscan: load/unload callbacks & progress meters
Add progress callbacks to libclamav for:
- database load
- engine compile
- engine free

Add a progress bar to clamscan for load & compile.
These are disabled if you run with --debug or stdout is not a TTY or you
are using one of --quiet, --infected, or --no-summary.

Added code so you can test the engine-free callback by building with
ENABLE_ENGINE_FREE_PROGRESSBAR defined.

The compile & free progress callbacks pre-calculate the number of
tasks to complete to estimate the progress. Some tasks may take longer
than others so the progress speed my appear to vary a little.

The callbacks return type is a cl_error_t but doesn't currently do
anything. It is reserved for future use.

Minor formatting change in matcher-ac.c to counteract weird
clang-format behavior, and to make it easier to read.

Added progress callbacks and clamscan progress bars to the news.
2021-07-16 11:47:23 -07:00
Micah Snyder
1ee5c96c59 Correct return status variable type
Should use the 'cl_error_t' enum, not 'int'. No functional difference,
but is better for type safety and for debugging.
2021-06-19 15:59:55 -07:00
Micah Snyder
d1ccf7747d clang-format housekeeping 2021-06-18 16:34:59 -07:00
Mickey Sola
c0bad34b09 Fix all-match mode FP checks
The `cli_append_virus()` function does an FP check. If it is an FP, it
will return `CL_CLEAN` and the match/alert/virus should be discarded.

This fix will respect FP verdicts when appending virus name in ac and
bm matchers in all match mode.
2021-06-18 16:27:19 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
e409920298 Fix assorted warnings
Add missing ping_clamd() declaration in client.h

Fix check for ping option to first check if ping option is NULL before
strdup'ing and checking if the alloc failed.

Fix format string for uint64_t print.

Correctly assign name pointer to stack buffer in cpio parser.

Remove vestigial variables from insert_list() function matcher-ac.c,
left over from before the load-time optimizations completely
restructured everything.

Silence warnings about unused parameters in progress bar callback
function.
2020-07-31 16:05:31 -07:00
Micah Snyder
ae77e87880 Add EmbeddedObjects to JSON
The metadata projecties JSON structure isn't recording file types found
embedded within a file such as self-extracting (SFX) types and office
document types (DOCX, PPTX, etc).  This presents a problem...

At present there's no way to know if the current file has ended and a
few file is found tacked on to the end of the first file.  If there
were, we could simply check if the type found by the raw-scan exists
within the first file, or after.

If within the first, and the type is an archive then it's reasonable to
conclude we're either observing zip headers (for SFXZIP detections) or
other files that are not compressed.

If the type ISN'T found within the first file, then we definitely have
whole new file to parse and we should do so with cli_magic_scan()
rather than only using these embedded type scanners.

At present we can't ignore SFXZIP detections even if the original file
type is a ZIP because we may have found two ZIPs appended together to
evade detection (a legitimate trick).  As a consequence, we will
effectively parse every zip entry twice.  The same issue applies to
types found within non-compressed archives.

This commit adds an EmbeddedObjects list to the metadata JSON object so
that the existance of these types is noted.

Additionally, this commit removes the two-part int64 cli_jsonint64()
implementation as json_object_new_int64() should be available
everywhere and the macro to detect such support was never set.
2020-06-03 10:39:18 -04:00
Micah Snyder
206dbaefe8 Update copyright dates for 2020 2020-01-03 15:44:07 -05:00
Micah Snyder (micasnyd)
3dd506a7ee bb12389 - fast AC sig load - courtesy of Alberto Wu
This commit addresses the signature load time issue in the following steps:
1. Loaded list items are allocated but left unattached; only a node reference is set on them for further processing. This is done with no increase of memory usage. See changes in insert_list and matcher-ac.h
2. Before the tries are built, the whole list of entries is sorted by node, then by pattern, then by partno. This requires O(N log(N)) time.
3. The list is processed linearly, one node at a time and the `next_same` chain is built. Each next_same chain head is also extracted. This requires O(N) time.
4. The list of heads is sorted by partno. This requires O(M log(M)) time on average with M<=N.
5. The list of heads is processed linearly and the `next` chain is built. This has O(M) complexity.

And improves scantime performance, by adding checks to:
1. Place longer lists earlier in the trie.
2. Keep close patterns close, rather than scattering them further apart.

This reduced memory cache faults to improve load and scan time performance.
2019-11-08 14:05:08 -08:00
Micah Snyder
bcb4505e60 bb12370 - cli_strndup and other str* replacements must be built and exported for every OS to be used outside of libclamav on systems that don't have the original functions (e.g. strndup). This commit renames the macros to be uppercase, renames the replacement functions to be preceeded with two understores (e.g. __cli_strndup), and removes the ifdef's so that they are built regardless, because there are no ifdefs in libclamav.map. 2019-10-02 16:08:30 -04:00
Micah Snyder
ee40795fe2 Converted mpool calls to macros when USE_MPOOL is defined to clearly differentiate between function and macro behavior. 2019-10-02 16:08:25 -04:00
Micah Snyder
5f4f69102d Correcting types from int to cl_error_t where appropriate. Eliminating unused variables and referencing unused parameters to remove warnings. 2019-10-02 16:08:25 -04:00
Micah Snyder
52cddcbcfd Updating and cleaning up copyright notices. 2019-10-02 16:08:18 -04:00
Micah Snyder
b3e82e5e61 Replacing libclamav/cltypes.h with clamav-types.h.in, which generates a header clamav-types.h that we install alongside clamav.h. 2019-10-02 16:08:17 -04:00
Micah Snyder
72fd33c8b2 clang-format'd using new .clang-format rules. 2019-10-02 16:08:16 -04:00
Micah Snyder
38fe8b69a0 Added .clang-format style rules, clam-format script to automate formatting of ClamAV code, and preparing select files so that clang-format does not alter carefully formatted sections. 2019-10-02 16:08:16 -04:00
Micah Snyder (micasnyd)
cc12e21dd2 bb12221: Fix for subtle type-mismatch that could result in an infinite loop with a large number of sigs. 2018-12-02 23:07:08 -05:00
Micah Snyder
d7979d4ff7 Restructured scan options flags from a single bitflag field to a structure containing multiple bitflag fields. This also required adding a new function to the bytecode API to get scan options a la carte, and modifying the existing function to hand back scan options in the old/deprecated uint32_t bitflag format. Re-generated bytecode iface header files.
Updated libclamav documentation detailing new scan options structure.
Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'.
Renamed references to 'scan all' to 'scan all match'.
Renamed a couple of 'Hueristic.*' signature names as 'Heuristics.*' signatures (plural) to match majority of other heuristics.
2018-12-02 23:06:59 -05:00
Micah Snyder
927b2bab17 bb11992: cleaning up some variable initialization. 2018-02-08 16:00:14 -05:00
Micah Snyder
d0cba11ea7 adding back changes to eliminate warnings from mspack, matcher, others, and readdb. 2017-09-21 13:10:01 -04:00