In regression testing against a large sample set, I found that strictly
disallowing any embedded file identification if any previous layer was
an embedded file resulted in missed detections.
Specifically, I found an MSEXE file which has an embedded RAR, which in
turn had another MSEXE that itself had an embedded 7ZIP containing... malware.
sha256: c3cf573fd3d1568348506bf6dd6152d99368a7dc19037d135d5903bc1958ea85
To make it so ClamAV can extract all that, we must loosen the
restriction and allow prior layers to be embedded, just not the current
layer.
I've also added some logic to prevent attempting to extract an object at
the same offset twice. The `fpt->offset`s appear in order, but if you
have multiple file type magic signatures match on the same address, like
maybe you accidentally load two different .ftm files, then you'd get
duplicates and double-extraction.
As a bonus, I found I could also skip over offsets within a valid ZIP,
ARJ, and CAB since we now have the length of those from the header check
and as I know we don't want to extract embedded contents in that way.
Previously for documents containing VBA projects, the VBA was treated
as an object within the document and not as a normalized version of
the document. I apparently switched it say that the VBA is a normalized
version of the document. This kind of makes sense in that presently
Javascript extracted from HTML is treated as a normalized version of the
HTML. But it probably shouldn't.
Normalized layers are treated as the same file as the parent.
So now those older signatures that match on VBA projects using
"Container:CL_TYPE_MSOLE2" are failing to match.
So this commit switches it back. VBA project bits written out to a temp
file for scanning will be treated as being contained within the document.
CLAM-2896
Uncompressed ZIP-based TNEF message attachments, like OOXML office
document attachments, get double-extracted because of embedded file type
recognition.
To prevent excessive scan times, disable embedded file type recognition
for TNEF files and relay on TNEF parsing to extract attachments.
CLAM-2885
Scanning CL_TYPE_MSEXE that have embedded file type signature matches
for CL_TYPE_MSEXE are incorrectly passing the PE header check for the
contained file, resulting in excessive scan times.
The problem is that the `peinfo` struct needs to have the `offset` set
for the contained `CL_TYPE_MSEXE` match prior to the header check.
Without that, the header check was actually validating the PE header of
the original file, which would always pass when that's a PE, and would
always fail if it's an OLE2 file (the other type which we check for
contained PEs).
The additional code change in this commit is to make it so the `ctx`
parameter must never be NULL, and removing the `map` parameter because,
in practice, that is always from `ctx->fmap`. This is to safeguard
against future changes to the function that may accidentally use `ctx`
without a proper NULL check.
CLAM-2882
`libclamav/libmspack.c`: Initialize variables before first `goto done;`
to fix unitialized variable use in an error condition.
`libclamav/others.c`: Explicitly ignore return values for calls to add
JSON values when subsequent calls don't depend on them.
If we were to add error handling here, the only thing we'd do is debug-
log it. I don't think it's worth adding the extra lines of code.
`libclamav/unarj.c`: Removed dead code.
The `status` variable is immediately set afterwards based on whether or
not any files may be extracted.
`libclamav/unzip.c`: Removed dead code.
The `ret` variable is checked immediately after being set, above. This
check after the `do`-`while()` loop is dead code.
`sigtool/sigtool.c`: Fix potential NULL deref in error handling.
This is a fix for the same issue as was fixed in a previous commit.
I somehow overlooked this one. Copy/paste bug.
`libclamav/pdfdecode.c`: Fix leaked `stream` memory when
`filter_lzwdecode()` fails.
`clamdtop/clamdtop.c`: Fix possible NULL dereference if `strchr` returns
NULL in `read_version()` and `check_stats_available()`.
`libclamav/rtf.c`: Fix memory leak in `rtf_object_process()` if
`cli_gentemp_with_prefix()` fails.
Also change empty for-loop to resolve clang-format weirdness and make it
more obvious the for-loop has no body.
`libclamav/aspack.c`: Ensure that `endoff - old` is not negative in
`build_decrypt_array()` before passing to `CLI_ISCONTAINED()` which expects
unsigned values.
`libclamav/upx.c`: Fix integer overflow checks in multiple functions.
`libclamav/vba_extract.c`: Set `entries` pointer back to NULL after free in
`word_read_macro_entry()` error condition.
`libclamav/unzip.c`: Remove logic to return `CL_EMAXFILES` from
`index_local_file_headers()`. It seems it only overwrote the status when
not `CL_SUCCESS` in which case it could be overriding a more serious failure.
Further, updates to the how the ZIP parser works has made it so this needs
to return `CL_SUCCESS` in order for the caller to at least scan the files
found so far.
Finally, the calling function has checks of its own to make sure we don't
exceeds the max-files limit.
`libclamav/unzip.c`: Fix issue where `cli_append_potentially_unwanted()` in
`index_local_file_headers()` might overwrite an error in `status` with
`CL_CLEAN`. Instead, it now checks the return value and only overwrites the
`CL_EFORMAT` status with a different value if not `CL_SUCCESS`.
`libclamav/unzip.c`: Fix a potential leak with `combined_catalogue` and
`temp_catalogue` in an error condition. We should always free them if not NULL,
not just if the function failed. And to make this safe, we must set
`combined_catalogue` to NULL when we give ownership to `*catalogue`.
`libclamav/scanners.c`: Fix a potential leak in error handling for the
`cli_ole2_tempdir_scan_vba()` function.
CLAM-2768
`clamscan/manager.c`: Fix double-free in an error condition in `scanfile()`.
`common/optparser.c`: Fix uninitialized use of the `numarg` variable when
`arg` is `NULL`.
`libclamav/cache.c`: Don't check if `ctx-fmap` is `NULL` when we've
already dereferenced it.
`libclamav/crypto.c`: The `win_exception` variable and associated logic
is Windows-specific and so needs preprocessor platform checks. Otherwise
it generates unused variable warnings.
`libclamav/crypto.c`: Check for `size_t` overflow of the `byte_read`
variable in the `cl_hash_file_fd_ex()` function.
`libclamav/crypto.c`: Fix a memory leak in the `cl_hash_file_fd_ex()`
function.
`libclamav/fmap.c`: Correctly the `name` and `path` pointer if
`fmap_duplicate()` fails. Also need to clear those variables when
duplicating the parent `map` so that on error it does not free the wrong
`name` or `path`.
`libclamav/fmap.c`: Refine error handling for `hash_string` cleanup in
`cl_fmap_get_hash()`. Coverity's complaint was that `hash_string` could
never be non-NULL if `status` is not `CL_SUCCESS`. I.e., the cleanup is
dead code. I don't think my cleanup actually "fixes" that though it is
definitely a better way to do the error handling.
The `if (NULL != hash_string) {` check is still technically dead code.
It safeguards against future changes that may `goto done` between the
allocation and transfering ownership from `hash_string` to `hash_out`.
`libclamav/others.c`: Fix possible memory leak in `cli_recursion_stack_push()`.
`libclamav/others.c`: Refactor an if/else + switch statement inside
`cli_dispatch_scan_callback()` so that the `CL_SCAN_CALLBACK_ALERT` case
is not dead-code. It's also easier to read now.
`libclamav/pdfdecode.c`: For logging, use the `%zu` to format `size_t`
instead of casting to `long long` and using `%llu`. Simiularly use the
`STDu32` format string macro for `uint32_t`.
`libclamav/pdfdecode.c`: Fix a possible double-free for the `decoded`
pointer in `filter_lzwdecode()`.
`libclamav/pdfdecode.c`: Remove the `if (capacity > UINT_MAX) {`
overflow check inside `filter_lzwdecode()`, which didn't do anything.
The `capacity` variable this point is a fixed value and so I also changed
the `avail_out` to be that fixed `INFLATE_CHUNK_SIZE` value rather than
using `capacity`. It is more straightforward and replicates how similar
logic works later in the file.
I also removed the copy-pasted `(Bytef *)` cast which didn't reaaally do
anything, and was a copypaste from a different algorihm. The lzw
implementation interface doesn't use `Bytef`.
`libclamav/readdb.c`: Fix a possible NULL-deref on the `matcher` variable
in the error handling/cleanup code if the function fails.
`libclamav/scanners.c`: Fix an issue where the return value from some of
the parsers may be lost/overridden by the call to
`cli_dispatch_scan_callback()` just after the `done:` label in
`cli_magic_scan()`.
`libclamav/scanners.c`: Silence an unused-return value warning when
calling `cli_basename()`.
`sigtool/sigtool.c` and `unit_tests/check_regex.c`:
Fix possible NULL-derefs of the `ctx.recursion_stack` pointer in the error
handling for several functions.
Also, and this isn't a Coverity thing:
`libclamav/json_api.c` and `libclamav/others.c`:
Fix support for libjson-c version 0.13 and older.
I don't think we *should* be using the old version, but some environments
such as the current OSS-Fuzz base image are older and still use it.
The issue is that `json_object_new_uint64()` was introduced in a later
libjson-c version, so we have to fallback to use `json_object_new_int64()`
with older libjson-c, provided the int were storing isn't too big.
CLAM-2768
This commit removes the XZ.DicSizeLimit heuristic. The heuristic was
triggered whenever CLI_MAX_ALLOCATION was reached or an OOM event
happened during one of the allocation wrappers. Due to the heuristic not
being indicative of the actual event(s) that lead to its triggering, and
that we have not had any valid hits for this heuristic, we are opting to
remove it.
CLAM-2834
When embedded file type recognition finds a possible embedded file, it
is being scanned as a new embedded file even if it turns out it was a
false positive and parsing fails. My solution is to pre-parse the file
headers as little possible to determine if it is valid. If possible,
also determine the file size based on the headers. That will make it so
we don't have to scan additional data when the embedded file is not at
the very end.
This commit adds header checks prior to embedded ZIP, ARJ, and CAB
scanning. For these types I was also able to use the header checks to
determine the object size so as to prevent excessive pattern matching.
TODO: Add the same for RAR, EGG, 7Z, NULSFT, AUTOIT, IShield, and PDF.
This commit also removes duplicate matching for embedded MSEXE.
The embedded MSEXE detection and scanning logic was accidentally
creating an extra duplicate layer in between scanning and detection
because of the logic within the `cli_scanembpe()` function.
That function was effectively doing the header check which this commit
adds for ZIP, ARJ, and CAB but minus the size check.
Note: It is unfortunately not possible to get an accurage size from PE
file headers.
The `cli_scanembpe()` function also used to dump to a temp file for no
reason since FMAPs were extended to support windows into other FMAPs.
So this commit removes the intermediate layer as well as dropping a temp
file for each embedded PE file.
Further, this commit adds configuration and DCONF safeguards around all
embedded file type scanning.
Finally, this commit adds a set of tests to validate proper extraction
of embedded ZIP, ARJ, CAB, and MSEXE files.
CLAM-2862
Co-authored-by: TheRaynMan <draynor@sourcefire.com>
If the application never calls `cl_load()`, then the clean-cache is
never initialized. That is a legitimate mode to run in when perhaps we
just want to extract stuff, record metadata, or for fuzzing.
This commit adds a check if the ctx->engine->cache is NULL. If it is,
then we treat it as though caching is disabled.
CLAM-2856
We presently record Alerts as an array of signature names.
Instead, it should be an object with properties of its own.
We should record alerting indicators and weak indicators in a single
"Indicators", likely with the same structure as the "Alerts" objects.
When an alerting indicator is ignored (e.g. ignored by callback or if
the file is trusted by an FP signature), we can remove it from the
"Alerts" array, and for the "Indicators" array, add a "Ignored" key with
a string value that explains why it was ignored.
This eliminates the need to track and propagate the additional
"WeakIndicators" and "IgnoredAlerts" arrays.
If the outermost layer is trusted (e.g. using an FP signature), the verdict passed
back by the `cl_scan*_ex()` functions should be CL_VERDICT_TRUSTED.
To ensure this, and other correct verdicts, I moved the logic setting the verdict
to occur when adding indicators, or trusting a given layer. Then at the end of a
scan, it will set the output verdict parameter to the top level verdict.
This commit also:
* Fixes a bug in the `ex_scan_callbacks` program where a crash would happen when
a hash was retrieved for an inner layer, but isn't for the container.
* Added debug logs whenever a hash is calculated or set, printing the hash type
and hash string.
* When a layer is trusted, in addition to removing evidence for that layer, it
will also index the metadata JSON (if that feature is enabled) and will rename
any "Viruses" to "IgnoredAlerts", and rename "ContainedIndicators" to
"IgnoredContainedIndicators".
* Fixed an issue where setting the hash algorithm with extra characters, like
setting to "sha256789" would ignore the extra characters, and report the hash
type as the same. It will now fail if the string length differs from the known
hash algorithm.
Both enum variants are 0, so it's a no-op.
`cl_error_t` should be used to determine if we keep going or stop, whether
that's because there was a detection and we're not in allmatch mode, or if
because of an error.
`cl_verdict_t` should be used to determine or say the verdict (clean is a
verdict, though I feel 'nothing found' is more accurate).
Improvements to the ex_scan_callbacks.c program:
- Print the verdict enum variant names to be more explicit.
- Add the file_props callback (aka metadata JSON) with --gen-json option.
- Add a --debug option.
- Use '-' in option names instead of '_' to be consistent with other programs.
- Add option to disable allmatch, which I named --one-match. :)
Tests: Add ex_scan_callbacks test where --allmatch is disabled.
Verify that CL_VIRUS is returned when a match occurs.
I found a few bugs and inconsistencies from this test and went and fixed
them, and improved the clamav.h function comments as well.
Largely this resulted in cleanup in `cli_magic_scan()` to make sure we
don't accidentally overwrite the return code.
But it also meant making sure that callback functions which are supposed
to trust a file actually clear the evidence/verdict and don't return
CL_VIRUS.
It is a shortcoming of existing scan APIs that it is not possible
to return an error without masking a verdict.
We presently work around this limitation by counting up detections at
the end and then overriding the error code with `CL_VIRUS`, if necessary.
The `cl_scanfile_ex()`, `cl_scandesc_ex()`, and `cl_scanmap_ex()` functions
should provide the scan verdict separately from the error code.
This introduces a new enum for recording and reporting a verdict:
`cl_verdict_t` with options:
- `CL_VERDICT_NOTHING_FOUND`
- `CL_VERDICT_TRUSTED`
- `CL_VERDICT_STRONG_INDICATOR`
- `CL_VERDICT_POTENTIALLY_UNWANTED`
Notably, the newer scan APIs may set the verdict to `CL_VERDICT_TRUSTED`
if there is a (hash-based) FP signature for a file, or in the cause where
Authenticode or similar certificate-based verification was performed, or
in the case where an application scan callback returned `CL_VERIFIED`.
CLAM-763
CLAM-865
The ClamScan scan summary prints bytes scanned and bytes read in
multiples of 4096 (aka `CL_COUNT_PRECISION`), as is provided by the
`cl_scanfile()`, `cl_scandesc()`, `cl_scanfile_callback()`, and
`cl_scandesc_callback()` functions.
I believe this imprecision was the result of using an `unsigned long int`
which may be 64bit or 32bit, depending on platform. I believe the
intention was to be able to support scanning more than 4 GiB of data.
Since the new `cl_scan*_ex()` functions use a `uint64_t`, which
guarantees a 64bit integer and supports ~16,777,216 terabytes, I find no
reason not to report an accurate count.
For the legacy scan functions (above) I've kept the `CL_COUNT_PRECISION`
behavior to maintain backwards compatibility.
I have also improved the bytes scanned/read output to report GiB, MiB,
KiB, or B as appropriate. Previously, it always report "MB".
CLAM-1433
Add the following scan callbacks:
```c
cl_engine_set_scan_callback(engine, &pre_hash_callback, CL_SCAN_CALLBACK_PRE_HASH);
cl_engine_set_scan_callback(engine, &pre_scan_callback, CL_SCAN_CALLBACK_PRE_SCAN);
cl_engine_set_scan_callback(engine, &post_scan_callback, CL_SCAN_CALLBACK_POST_SCAN);
cl_engine_set_scan_callback(engine, &alert_callback, CL_SCAN_CALLBACK_ALERT);
cl_engine_set_scan_callback(engine, &file_type_callback, CL_SCAN_CALLBACK_FILE_TYPE);
```
Each callback may alter scan behavior using the following return codes:
* CL_BREAK
Scan aborted by callback (the rest of the scan is skipped).
This does not mark the file as clean or infected, it just skips the rest of the scan.
* CL_SUCCESS / CL_CLEAN
File scan will continue.
This is different than CL_VERIFIED because it does not affect prior or future alerts.
Return CL_VERIFIED instead if you want to remove prior alerts for this layer and skip
the rest of the scan for this layer.
* CL_VIRUS
This means you don't trust the file. A new alert will be added.
For CL_SCAN_CALLBACK_ALERT: Means you agree with the alert (no extra alert needed).
* CL_VERIFIED
Layer explicitly trusted by the callback and previous alerts removed FOR THIS layer.
You might want to do this if you trust the hash or verified a digital signature.
The rest of the scan will be skipped FOR THIS layer.
For contained files, this does NOT mean that the parent or adjacent layers are trusted.
Each callback is given a pointer to the current scan layer from which
they can get previous layers, can get the the layer's fmap, and then
various attributes of the layer and of the fmap such as:
- layer recursion level
- layer object id
- layer file type
- layer attributes (was decerypted, normalized, embedded, or re-typed)
- layer last alert
- fmap name
- fmap hash (md5, sha1, or sha2-256)
- fmap data (pointer and size)
- fmap file descriptor, if any (fd, offset, size)
- fmap filepath, if any (filepath, offset, size)
To make this possible, this commits introduced a handful of new APIs to
query scan-layer details and fmap details:
- `cl_error_t cl_fmap_set_name(cl_fmap_t *map, const char *name);`
- `cl_error_t cl_fmap_get_name(cl_fmap_t *map, const char **name_out);`
- `cl_error_t cl_fmap_set_path(cl_fmap_t *map, const char *path);`
- `cl_error_t cl_fmap_get_path(cl_fmap_t *map, const char **path_out, size_t *offset_out, size_t *len_out);`
- `cl_error_t cl_fmap_get_fd(const cl_fmap_t *map, int *fd_out, size_t *offset_out, size_t *len_out);`
- `cl_error_t cl_fmap_get_size(const cl_fmap_t *map, size_t *size_out);`
- `cl_error_t cl_fmap_set_hash(const cl_fmap_t *map, const char *hash_alg, char hash);`
- `cl_error_t cl_fmap_have_hash(const cl_fmap_t *map, const char *hash_alg, bool *have_hash_out);`
- `cl_error_t cl_fmap_will_need_hash_later(const cl_fmap_t *map, const char *hash_alg);`
- `cl_error_t cl_fmap_get_hash(const cl_fmap_t *map, const char *hash_alg, const char **hash_out);`
- `cl_error_t cl_fmap_get_data(const cl_fmap_t *map, size_t offset, size_t len, const uint8_t **data_out, size_t *data_len_out);`
- `cl_error_t cl_scan_layer_get_fmap(cl_scan_layer_t *layer, cl_fmap_t **fmap_out);`
- `cl_error_t cl_scan_layer_get_parent_layer(cl_scan_layer_t *layer, cl_scan_layer_t **parent_layer_out);`
- `cl_error_t cl_scan_layer_get_type(cl_scan_layer_t *layer, const char **type_out);`
- `cl_error_t cl_scan_layer_get_recursion_level(cl_scan_layer_t *layer, uint32_t *recursion_level_out);`
- `cl_error_t cl_scan_layer_get_object_id(cl_scan_layer_t *layer, uint64_t *object_id_out);`
- `cl_error_t cl_scan_layer_get_last_alert(cl_scan_layer_t *layer, const char **alert_name_out);`
- `cl_error_t cl_scan_layer_get_attributes(cl_scan_layer_t *layer, uint32_t *attributes_out);`
This commit deprecates but does not remove the existing scan callbacks:
- `void cl_engine_set_clcb_pre_cache(struct cl_engine *engine, clcb_pre_cache callback);`
- `void cl_engine_set_clcb_file_inspection(struct cl_engine *engine, clcb_file_inspection callback);`
- `void cl_engine_set_clcb_pre_scan(struct cl_engine *engine, clcb_pre_scan callback);`
- `void cl_engine_set_clcb_post_scan(struct cl_engine *engine, clcb_post_scan callback);`
- `void cl_engine_set_clcb_virus_found(struct cl_engine *engine, clcb_virus_found callback);`
- `void cl_engine_set_clcb_hash(struct cl_engine *engine, clcb_hash callback);`
This commit also adds an interactive test program to demonstrate the callbacks.
See: `examples/ex_scan_callbacks.c`
CLAM-255
CLAM-2485
CLAM-2626
Scans containing embedded files (e.g. appended ZIP archives) are not
running the callback functions (pre-scan, pre-cache, inspection, etc.)
on the embedded files.
Run the cli_magic_scan() function for every file we extract using
embedded file type recognition.
CLAM-2796
File-type scanning for OOXML-based office documents results in
double-extraction, because we also extract OOXML document components
in the appopriate parser modules.
CLAM-1412
Add `cl_scanfile_ex()`, `cl_scanmap_ex()`, and `cl_scandesc_ex()`
functions that provide the following additional parameters:
hash_hint (Optional) A NULL terminated string of the file hash so that
libclamav does not need to calculate it.
[out] hash_out (Optional) A NULL terminated string of the file hash.
The caller is responsible for freeing the string.
hash_alg The hashing algorithm used for either `hash_hint` or `hash_out`.
Supported algorithms are "md5", "sha1", "sha2-256".
If not specified, the default is "sha2-256".
file_type_hint (Optional) A NULL terminated string of the file type hint.
E.g. "pe", "elf", "zip", etc.
You may also use ClamAV type names such as "CL_TYPE_PE".
ClamAV will ignore the hint if it is not familiar with the specified type.
See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
file_type_out (Optional) A NULL terminated string of the file type
of the top layer as determined by ClamAV.
Will take the form of the standard ClamAV file type format. E.g. "CL_TYPE_PE".
See also: https://docs.clamav.net/appendix/FileTypes.html#file-types
CLAM-2626
Temp directory recursion in ClamAV is when each layer of a scan gets its
own temp directory in the parent layer's temp directory.
In addition to temp directory recursion, ClamAV has been creating a new
subdirectory for each file scan as a risk-adverse method to ensure
no temporary file leaks fill up the disk.
Creating a directory is relatively slow on Windows in particular if
scanning a lot of very small files.
This commit:
1. Separates the temp directory recursion feature from the leave-temps
feature so that libclamav can leave temp files without making
subdirectories for each file scanned.
2. Makes it so that when temp directory recursion is off, libclamav
will just use the configure temp directory for all files.
The new option to enable temp directory recursion is for libclamav-only
at this time. It is off by default, and you can enable it like this:
```c
cl_engine_set_num(engine, CL_ENGINE_TMPDIR_RECURSION, 1);
```
For the `clamscan` and `clamd` programs, temp directory recursion will
be enabled when `--leave-temps` / `LeaveTemporaryFiles` is enabled.
The difference is that when disabled, it will return to using the
configured temp directory without making a subdirectory for each file
scanned, so as to improve scan performance for small files, mostly on
Windows.
Under the hood, this commit also:
1. Cleans up how we keep track of tmpdirs for each layer.
The goal here is to align how we keep track of layer-specific stuff
using the scan_layer structure.
2. Cleans up how we record metadata JSON for embedded files.
Note: Embedded files being different from Contained files, as they
are extracted not with a parser, but by finding them with
file type magic signatures.
CLAM-1583
Move recording of evidence (aka Strong, PUA, and Weak indicators) to be
done in each layer of a scan, and passed up to the parent layer with the
top level only connecting the results at the very end of the scan.
This is needed to provide access the last alert for a given layer when
we upgrade the scan callbacks.
Note that when adding evidence from a child layer that is a normalized
layer, we do not want to increase the depth. It should appear as though
the match occured on the parent layer.
This is for two reasons:
1. We don't run the scan callbacks on normalized layers.
2. Future matches on Weak Indicators should be able to treat normalized
layer matches the same as original file matches. Keep reading for
more about Weak Indicators.
Recording scan matches at each recursion layer is also needed to support
Weak Indicators, a feature where an alerting signature (aka Strong
Indicator) may require the the match of a non-alerting signature (aka
Weak Indicator) on the same layer or on child layers in order to alert.
Support for Weak indicators was blocked by not keeping track of where
indicators were found. So this commit also enables support for recording
Weak indicators.
Like PUA, Weak indicators are treated differently based on the signature
prefix. That is, any signatures starting with "Weak." won't cause an
alert on its own.
The next step to completing Weak Indicator support will be adding a
logical subsignature feature to depend on a weak indicator match.
CLAM-2626
CLAM-2485
There is a check which makes it so this feature only records URIs when
the HTML file is at the top level. E.g. if you zip an HTML file, then
a scan of the ZIP will no longer record URIs for the HTML within the
ZIP.
I think this is a mistake and I have removed that check.
Every time we push a new map onto the scanning recursion context, give
it a unique object id number, which counts from zero.
Moved the location where we add metadata for each file from the
"cli_magic_scan" function over to the "recursion stack push" function.
Include a "path" as a parameter for creating a new fmap, and rename some
related variables and functions to be more intuitive.
CLAM-2796
See also: CLAM-2485, CLAM-2626
Change the clean-cache to use SHA2-256 instead of MD5.
Note that all references are changed to specify "SHA2-256" now instead
of "SHA256", for clarity. But there is no plan to add support for SHA3
algorithms at this time.
Significant code cleanup. E.g.:
- Implemented goto-done error handling.
- Used `uint8_t *` instead of `unsigned char *`.
- Use `bool` for boolean checks, rather than `int.
- Used `#defines` instead of magic numbers.
- Removed duplicate `#defines` for things like hash length.
Add new option to calculate and record additional hash types when the
"generate metadata JSON" feature is enabled:
- libclamav option: `CL_SCAN_GENERAL_STORE_EXTRA_HASHES`
- clamscan option: `--json-store-extra-hashes` (default off)
- clamd.conf option: `JsonStoreExtraHashes` (default 'no')
Renamed the sigtool option `--sha256` to `--sha2-256`.
The original option is still functional, but is deprecated.
For the "generate metadata JSON" feature, the file hash is now stored as
"sha2-256" instead of "FileMD5". If you enable the "extra hashes" option,
then it will also record "md5" and "sha1".
Deprecate and disable the internal "SHA collect" feature.
This option had been hidden behind C #ifdef checks for an option that
wasn't exposed through CMake, so it was basically unavailable anyways.
Changes to calculate file hashes when they're needed and no sooner.
For the FP feature in the matcher module, I have mimiced the
optimization in the FMAP scan routine which makes it so that it can
calculate multiple hashes in a single pass of the file.
The `HandlerType` feature stores a hash of the file in the scan ctx to
prevent retyping the exact same data more than once.
I removed that hash field and replaced it with an attribute flag that is
applied to the new recursion stack layer when retyping a file.
This also closes a minor bug that would prevent retyping a file with an
all-zero hash. :)
The work upgrading cache.c to support SHA2-256 sized hashes thanks to:
https://github.com/m-sola
CLAM-255
CLAM-1858
CLAM-1859
CLAM-1860
I accidentally introduced a NULL-dereference bug when scanning any OOXML
file in https://github.com/Cisco-Talos/clamav/pull/1548
I overlooked the test failure out of haste. 😔
The NULL-dereference happens because the `unzip_search()` feature
allowed searching some other file than the one that is currently being
scanned, which you would do by setting `ctx` to NULL and setting an
`fmap` parameter instead.
In practice, the current layer's `fmap` from the `ctx` was always passed in.
This fix makes it so the `unzip_search()` and related functions only
take the `ctx` parameter and do not have and `fmap` or `fsize` field
(Note: the `fsize` was never needed, because `fmap->len` take care of that).
CLAM-2837
On Windows, the cli_basename function should treat both '/' and '\' as path
separators. Most Windows APIs also accept both.
On Linux/Unix, it makes sense when using a filepath that is more for
informational purposes or where it may have come from a Windows system,
to treat the '\' as a path separator.
But in situations where the the path is needed for some critical action,
like moving or deleting a file, we can't treat it as a path separator.
Threat Research requests scanning URIs in PDF files and adding them to
the json report file.
This change adds URI scanning support to the PDF parser, including
support for object references to URIs in PDF files.
Jira: CLAM-2588
Fix out-of-order references and other minor improvements.
CLAM-2588, CLAM-2757
This is just preliminary support for identifying an assortment of
different AI model files.
So far, this detects the following types:
- GGML GGUF (.gguf)
- ONNX AI (.onnx)
- TensorFlow Lite (.tflite)
Additional types to consider:
- SafeTensors (.safetensors)
- TensorFlow (.pb, .ckpt, .tfrecords)
- Keras (.keras)
- pickle (.pkl)
- numpy (.npy, .npz)
- coreml (.coreml)
- PyTorch (.pt, .pth, .bin, .mar, .pte, .pt2, .ptl)
Outside of being able to differentiate by file type, the scanner
will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA.
We're not adding parsers to further process these files, for now.
Store URLs found in HTML `<a>` and `<form>` tags during scan of HTML files
when recording scan metadata.
HTML URL recording will be ON by default, but is a part of the
generate-metadata-json feature.
The generate-metadata-json feature is OFF by default.
This introduces a new general scan option:
- libclamav: `CL_SCAN_GENERAL_STORE_HTML_URLS`.
- ClamD: `JsonStoreHTMLUrls`.
- ClamScan: `--json-store-html-urls`
Thank you Matt Jolly for the helpful comment on the pull request.
If SCAN_COLLECT_METADATA is enabled, and caching is disabled, we zero-out
the hash after recording it.
This results in a non-NULL and invalid-hash that may be passed to
`cli_scan_fmap()` for the "raw mode" scan.
It's an uncommon code path, but would result in comparing hash-sigs with
a zeroed hash rather than the valid hash.
This bug could result in a missed hash-based sig matches.
There is no reason to invalidate or zero-out the hash if we happen to
calculate it. We avoid the cache-lookup by checking the engine setting,
not by checking if we have a hash.
As of ClamAV 0.105, libjson-c is required.
There is also no option to disable libjson-c support.
This commit removes the dead code associated with the old build
option.
As of ClamAV 0.105, libbz2 is required.
There is also no option to disable bz2 support.
This commit removes the dead code associated with the old build
option.
As of ClamAV 0.105, PCRE2 is required. PCRE (1) is not an option, and
there is also no option to disable PCRE support.
This commit removes the dead code associated with those old build
options.
Primarily this commit fixes an issue with the size of the parameters
passed to cli_checklimits(). The parameters were "unsigned long", which
varies in size depending on platform.
I've switched them to uint64_t / u64.
While working on this, I observed some concerning warnigns on Windows,
and some less serious ones, primarily regarding inconsistencies with
`const` parameters.
Finally, in `scanmem.c`, there is a warning regarding use of `wchar_t *`
with `GetModuleFileNameEx()` instead of `GetModuleFileNameExW()`.
This made me realize this code assumes we're not defining `UNICODE`,
which would have such macros use the 'A' variant.
I have fixed it the best I can, although I'm still a little
uncomfortable with some of this code that uses `char` or `wchar_t`
instead of TCHAR.
I also remove the `if (GetModuleFileNameEx) {` conditional, because this
macro/function will always be defined. The original code was checking a
function pointer, and so this was a bug when integrating into ClamAV.
Regarding the changes to `rijndael.c`, I found that this module assumes
`unsigned long` == 32bits. It does not.
I have corrected it to use `uint32_t`.
We add the _OR_GOTO_DONE suffix to the macros that go to done if the
allocation fails. This makes it obvious what is different about the
macro versus the equivalent function, and that error handling is
built-in.
Renamed the cli_strdup to safer_strdup to make it obvious that it exists
because it is safer than regular strdup. Regular strdup doesn't have the
NULL check before trying to dup, and so may result in a NULL-deref
crash.
Also remove unused STRDUP (_OR_GOTO_DONE) macro, since the one with the
NULL-check is preferred.
Variables like the number of signature parts are considered trusted user input
and so allocations based on those values need not check the memory allocation
limit.
Specifically for the allocation of the normalized buffer in cli_scanscript,
I determined that the size of SCANBUFF is fixed and so safe, and the maxpatlen
comes from the signature load and is therefore also trusted, so we do not
need to check the allocation limit.
We have some special functions to wrap malloc, calloc, and realloc to
make sure we don't allocate more than some limit, similar to the
max-filesize and max-scansize limits. Our wrappers are really only
needed when allocating memory for scans based on untrusted user input,
where a scan file could have bytes that claim you need to allocate
some ridiculous amount of memory. Right now they're named:
- cli_malloc
- cli_calloc
- cli_realloc
- cli_realloc2
... and these names do not convey their purpose
This commit renames them to:
- cli_max_malloc
- cli_max_calloc
- cli_max_realloc
- cli_max_realloc2
The realloc ones also have an additional feature in that they will not
free your pointer if you try to realloc to 0 bytes. Freeing the memory
is undefined by the C spec, and only done with some realloc
implementations, so this stabilizes on the behavior of not doing that,
which should prevent accidental double-free's.
So for the case where you may want to realloc and do not need to have a
maximum, this commit adds the following functions:
- cli_safer_realloc
- cli_safer_realloc2
These are used for the MPOOL_REALLOC and MPOOL_REALLOC2 macros when
MPOOL is disabled (e.g. because mmap-support is not found), so as to
match the behavior in the mpool_realloc/2 functions that do not make use
of the allocation-limit.
There are a large number of allocations for fix sized buffers using the
`cli_malloc` and `cli_calloc` calls that check if the requested size is
larger than our allocation threshold for allocations based on untrusted
input. These allocations will *always* be higher than the threshold, so
the extra stack frame and check for these calls is a waste of CPU.
This commit replaces needless calls with A -> B:
- cli_malloc -> malloc
- cli_calloc -> calloc
- CLI_MALLOC -> MALLOC
- CLI_CALLOC -> CALLOC
I also noticed that our MPOOL_MALLOC / MPOOL_CALLOC are not limited by
the max-allocation threshold, when MMAP is found/enabled. But the
alternative was set to cli_malloc / cli_calloc when disabled. I changed
those as well.
I didn't change the cli_realloc/2 calls because our version of realloc
not only implements a threshold but also stabilizes the undefined
behavior in realloc to protect against accidental double-free's.
It may be worth implementing a cli_realloc that doesn't have the
threshold built-in, however, so as to allow reallocaitons for things
like buffers for loading signatures, which aren't subject to the same
concern as allocations for scanning possible malware.
There was one case in mbox.c where I changed MALLOC -> CLI_MALLOC,
because it appears to be allocating based on untrusted input.
Image fuzzy hashing is enabled by default. The following options have
been added to allow users to disable it, if desired.
New clamscan options:
--scan-image[=yes(*)/no]
--scan-image-fuzzy-hash[=yes(*)/no]
New clamd config options:
ScanImage yes(*)/no
ScanImageFuzzyHash yes(*)/no
New libclamav scan options:
options.parse &= ~CL_SCAN_PARSE_IMAGE;
options.parse &= ~CL_SCAN_PARSE_IMAGE_FUZZY_HASH;
This commit also changes scan behavior to disable image fuzzy hashing
for specific types when the DCONF (.cfg) signatures disable those types.
That is, if DCONF disables the PNG parser, it should not only disable
the CVE/format checker for PNG files, but also disable image fuzzy
hashing for PNG files.
Also adds a DCONF option to disable image fuzzy hashing:
OTHER_CONF_IMAGE_FUZZY_HASH
DCONF allows scanning features to be disabled using a configuration
"signature".