Compare commits

..

9 commits

Author SHA1 Message Date
Val S.
e98b0075e6
Fix ZIP parser issue recording empty file entries
If csize (and usize) are 0, like with a directory or other empty file
entry, then the functionionality to record file record information when
indexing the central directory and each associated file record will
neglect to store the `local_header_offset` or `local_header_size`.
That causes problems later after sorting the file records and then
checking for overlapping files.

CLAM-2884
2025-10-12 19:12:06 -04:00
Val S.
87e389d1b2
Fix compiler warning
Mismatched declaration and definition.
2025-10-12 16:15:49 -04:00
Val S.
fa830aabdc
Increase limit for finding PE files embedded in other PE files
I am seeing missed detections since we changed to prohibit embedded
file type identification when inside an embedded file.
In particular, I'm seeing this issue with PE files that contain multiple
other MSEXE as well as a variety of false positives for PE file headers.

For example, imagine a PE with four concatenated DLL's, like so:
```
  [ EXE file   | DLL #1  | DLL #2  | DLL #3  | DLL #4 ]
```

And note that false positives for embedded MSEXE files are fairly common.
So there may be a few mixed in there.

Before limiting embedded file identification we might interpret the file
structure something like this:
```
MSEXE: {
  embedded MSEXE #1: false positive,
  embedded MSEXE #2: false positive,
  embedded MSEXE #3: false positive,
  embedded MSEXE #4: DLL #1: {
    embedded MSEXE #1: false positive,
    embedded MSEXE #2: DLL #2: {
      embedded MSEXE #1: DLL #3: {
        embedded MSEXE #1: false positive,
        embedded MSEXE #2: false positive,
        embedded MSEXE #3: false positive,
        embedded MSEXE #4: false positive,
        embedded MSEXE #5: DLL #4
      }
      embedded MSEXE #2: false positive,
      embedded MSEXE #3: false positive,
      embedded MSEXE #4: false positive,
      embedded MSEXE #5: false positive,
      embedded MSEXE #6: DLL #4
    }
    embedded MSEXE #3: DLL #3,
    embedded MSEXE #4: false positive,
    embedded MSEXE #5: false positive,
    embedded MSEXE #6: false positive,
    embedded MSEXE #7: false positive,
    embedded MSEXE #8: DLL #4
  }
}
```

This is obviously terrible, which is why why we don't allow detecting
embedded files within other embedded files.
So after we enforce that limit, the same file may be interpreted like
this instead:
```
MSEXE: {
  embedded MSEXE #1:  false positive,
  embedded MSEXE #2:  false positive,
  embedded MSEXE #3:  false positive,
  embedded MSEXE #4:  DLL #1,
  embedded MSEXE #5:  false positive,
  embedded MSEXE #6:  DLL #2,
  embedded MSEXE #7:  DLL #3,
  embedded MSEXE #8:  false positive,
  embedded MSEXE #9:  false positive,
  embedded MSEXE #10: false positive,
  embedded MSEXE #11: false positive,
  embedded MSEXE #12: DLL #4
}
```

That's great! Except that we now exceed the "MAX_EMBEDDED_OBJ" limit
for embedded type matches (limit 10, but 12 found). That means we won't
see or extract the 4th DLL anymore.

My solution is to lift the limit when adding an matched MSEXE type.
We already do this for matched ZIPSFX types.
While doing this, I've significantly tidied up the limits checks to
make it more readble, and removed duplicate checks from within the
`ac_addtype()` function.

CLAM-2897
2025-10-12 16:13:11 -04:00
Val S.
85e67f6001
Loosen restrictions on embedded file identification
In regression testing against a large sample set, I found that strictly
disallowing any embedded file identification if any previous layer was
an embedded file resulted in missed detections.

Specifically, I found an MSEXE file which has an embedded RAR, which in
turn had another MSEXE that itself had an embedded 7ZIP containing... malware.
sha256: c3cf573fd3d1568348506bf6dd6152d99368a7dc19037d135d5903bc1958ea85

To make it so ClamAV can extract all that, we must loosen the
restriction and allow prior layers to be embedded, just not the current
layer.

I've also added some logic to prevent attempting to extract an object at
the same offset twice. The `fpt->offset`s appear in order, but if you
have multiple file type magic signatures match on the same address, like
maybe you accidentally load two different .ftm files, then you'd get
duplicates and double-extraction.
As a bonus, I found I could also skip over offsets within a valid ZIP,
ARJ, and CAB since we now have the length of those from the header check
and as I know we don't want to extract embedded contents in that way.
2025-10-12 16:13:11 -04:00
Val S.
3f417e3d72
Fix issue detecting VBA projects
Previously for documents containing VBA projects, the VBA was treated
as an object within the document and not as a normalized version of
the document. I apparently switched it say that the VBA is a normalized
version of the document. This kind of makes sense in that presently
Javascript extracted from HTML is treated as a normalized version of the
HTML. But it probably shouldn't.

Normalized layers are treated as the same file as the parent.
So now those older signatures that match on VBA projects using
"Container:CL_TYPE_MSOLE2" are failing to match.

So this commit switches it back. VBA project bits written out to a temp
file for scanning will be treated as being contained within the document.

CLAM-2896
2025-10-12 16:13:11 -04:00
Val S.
92af9bfffd
Fix issue recording OOXML document metadata
The ZIP single record search feature is used to find specific files when
parsing OOXML documents. I observed that the core properties for a
PowerPoint file were missing in a test as compared with the previous
release.

The error handling check for the unzip search returns CL_VIRUS when
there is a match, not CL_SUCCESS!

CLAM-2886
2025-10-12 16:13:10 -04:00
Val S.
b8dc7be7e1
Scan performance optimization for TNEF message scans
Uncompressed ZIP-based TNEF message attachments, like OOXML office
document attachments, get double-extracted because of embedded file type
recognition.

To prevent excessive scan times, disable embedded file type recognition
for TNEF files and relay on TNEF parsing to extract attachments.

CLAM-2885
2025-10-12 16:13:10 -04:00
Val S.
e51a6a8cac
Fix ZIP parser issue
The function which indexes a ZIP central directory is not advancing
to the next central directory record thus exceeding the max-files scan
limit for many ZIPs.

CLAM-2884
2025-10-12 16:13:10 -04:00
Micah Snyder
5fe5f87252
Fix performance issue scanning some Windows executables
Scanning CL_TYPE_MSEXE that have embedded file type signature matches
for CL_TYPE_MSEXE are incorrectly passing the PE header check for the
contained file, resulting in excessive scan times.

The problem is that the `peinfo` struct needs to have the `offset` set
for the contained `CL_TYPE_MSEXE` match prior to the header check.
Without that, the header check was actually validating the PE header of
the original file, which would always pass when that's a PE, and would
always fail if it's an OLE2 file (the other type which we check for
contained PEs).

The additional code change in this commit is to make it so the `ctx`
parameter must never be NULL, and removing the `map` parameter because,
in practice, that is always from `ctx->fmap`. This is to safeguard
against future changes to the function that may accidentally use `ctx`
without a proper NULL check.

CLAM-2882
2025-10-12 16:13:02 -04:00
4 changed files with 178 additions and 62 deletions

View file

@ -33,7 +33,7 @@
#define MAGIC_BUFFER_SIZE 1028
#define CL_TYPENO 500
#define MAX_EMBEDDED_OBJ 16
#define MAX_EMBEDDED_OBJ 10
typedef enum cli_file {
CL_TYPE_ANY = 0,

View file

@ -833,8 +833,9 @@ int cli_ac_chklsig(const char *expr, const char *end, uint32_t *lsigcnt, unsigne
return -1;
}
for (i += 2; i + 1 < len && (isdigit(expr[i + 1]) || expr[i + 1] == ','); i++)
;
for (i += 2; i + 1 < len && (isdigit(expr[i + 1]) || expr[i + 1] == ','); i++) {
continue;
}
}
if (&expr[i + 1] == rend)
@ -1625,19 +1626,23 @@ void cli_ac_freedata(struct cli_ac_data *data)
}
}
/* returns only CL_SUCCESS or CL_EMEM */
inline static int ac_addtype(struct cli_matched_type **list, cli_file_t type, off_t offset, const cli_ctx *ctx)
/**
* @brief Add a match for an object type to the list of matched types.
*
* Important: The caller is responsible for checking limits!
*
* @param list Pointer to the list of matched types. *list may be NULL if no types have been added yet.
* @param type The type of the embedded object.
* @param offset The offset of the embedded object.
* @param ctx The context information. May be NULL.
* @return cl_error_t CL_SUCCESS regardless if added, or CL_EMEM if memory allocation failed.
*/
inline static cl_error_t ac_addtype(struct cli_matched_type **list, cli_file_t type, off_t offset, const cli_ctx *ctx)
{
struct cli_matched_type *tnode, *tnode_last;
struct cli_matched_type *tnode;
if (type == CL_TYPE_ZIPSFX) {
if (*list && ctx && ctx->engine->maxfiles && (*list)->cnt > ctx->engine->maxfiles)
return CL_SUCCESS;
} else if (*list && (*list)->cnt >= MAX_EMBEDDED_OBJ) {
return CL_SUCCESS;
}
if (!(tnode = calloc(1, sizeof(struct cli_matched_type)))) {
tnode = calloc(1, sizeof(struct cli_matched_type));
if (NULL == tnode) {
cli_errmsg("cli_ac_addtype: Can't allocate memory for new type node\n");
return CL_EMEM;
}
@ -1645,16 +1650,25 @@ inline static int ac_addtype(struct cli_matched_type **list, cli_file_t type, of
tnode->type = type;
tnode->offset = offset;
tnode_last = *list;
while (tnode_last && tnode_last->next)
tnode_last = tnode_last->next;
if (*list) {
// Add to end of existing list.
struct cli_matched_type *tnode_last = *list;
if (tnode_last)
while (tnode_last && tnode_last->next) {
tnode_last = tnode_last->next;
}
tnode_last->next = tnode;
else
} else {
// First type in the list.
*list = tnode;
}
(*list)->cnt++;
if (UNLIKELY(cli_get_debug_flag())) {
cli_dbgmsg("ac_addtype: added %s embedded object at offset " STDi64 ". Embedded object count: %d\n", cli_ftname(type), (uint64_t)offset, (*list)->cnt);
}
return CL_SUCCESS;
}
@ -1999,14 +2013,65 @@ cl_error_t cli_ac_scanbuff(
cli_dbgmsg("Matched signature for file type %s\n", pt->virname);
type = pt->type;
if ((ftoffset != NULL) &&
((*ftoffset == NULL) || (*ftoffset)->cnt < MAX_EMBEDDED_OBJ || type == CL_TYPE_ZIPSFX) && (type >= CL_TYPE_SFX || ((ftype == CL_TYPE_MSEXE || ftype == CL_TYPE_ZIP || ftype == CL_TYPE_MSOLE2) && type == CL_TYPE_MSEXE))) {
/* FIXME: the first offset in the array is most likely the correct one but
* it may happen it is not
*/
for (j = 1; j <= CLI_DEFAULT_AC_TRACKLEN + 1 && offmatrix[0][j] != (uint32_t)-1; j++)
if (ac_addtype(ftoffset, type, offmatrix[pt->parts - 1][j], ctx))
return CL_EMEM;
if (ftoffset != NULL) {
// Caller provided a pointer to record matched types.
bool too_many_types = false;
bool supported_type = false;
if (*ftoffset != NULL) {
// Have some type matches already. Check limits.
if (ctx && ((type == CL_TYPE_ZIPSFX) ||
(type == CL_TYPE_MSEXE && ftype == CL_TYPE_MSEXE))) {
// When ctx present, limit the number of type matches using ctx->engine->maxfiles for specific types.
// Reasoning:
// ZIP local file header entries likely to be numerous if a single ZIP appended to the scanned file.
// MSEXE can contain many embedded MSEXE entries and MSEXE type false positives matches.
if (ctx->engine->maxfiles == 0) {
// Max-files limit is disabled.
} else if ((*ftoffset)->cnt >= ctx->engine->maxfiles) {
if (UNLIKELY(cli_get_debug_flag())) {
cli_dbgmsg("ac_addtype: Can't add %s type at offset " STDu64 " to list of embedded type matches. Reached maxfiles limit of %u\n", cli_ftname(type), (*ftoffset)->offset, ctx->engine->maxfiles);
}
too_many_types = true;
}
} else {
// Limit the number of type matches using MAX_EMBEDDED_OBJ.
if ((*ftoffset)->cnt >= MAX_EMBEDDED_OBJ) {
if (UNLIKELY(cli_get_debug_flag())) {
cli_dbgmsg("ac_addtype: Can't add %s type at offset " STDu64 " to list of embedded type matches. Reached MAX_EMBEDDED_OBJ limit of %u\n", cli_ftname(type), (*ftoffset)->offset, MAX_EMBEDDED_OBJ);
}
too_many_types = true;
}
}
}
// Filter to supported types.
if (
// Found type is MBR.
type == CL_TYPE_MBR ||
// Found type is any SFX type (i.e., ZIPSFX, RARSFX, 7ZSSFX, etc.).
type >= CL_TYPE_SFX ||
// Found type is an MSEXE, but only if host file type is one of MSEXE, ZIP, or MSOLE2.
(type == CL_TYPE_MSEXE && (ftype == CL_TYPE_MSEXE || ftype == CL_TYPE_ZIP || ftype == CL_TYPE_MSOLE2))) {
supported_type = true;
}
if (supported_type && !too_many_types) {
/* FIXME: the first offset in the array is most likely the correct one but
* it may happen it is not
* Until we're certain and can fix this, we add all offsets in the list.
*/
for (j = 1; j <= CLI_DEFAULT_AC_TRACKLEN + 1 && offmatrix[0][j] != (uint32_t)-1; j++) {
ret = ac_addtype(ftoffset, type, offmatrix[pt->parts - 1][j], ctx);
if (CL_SUCCESS != ret) {
return ret;
}
}
}
}
memset(offmatrix[0], (uint32_t)-1, pt->parts * (CLI_DEFAULT_AC_TRACKLEN + 2) * sizeof(uint32_t));
@ -2066,11 +2131,59 @@ cl_error_t cli_ac_scanbuff(
cli_dbgmsg("Matched signature for file type %s at %u\n", pt->virname, realoff);
type = pt->type;
if ((ftoffset != NULL) &&
((*ftoffset == NULL) || (*ftoffset)->cnt < MAX_EMBEDDED_OBJ || type == CL_TYPE_ZIPSFX) && (type == CL_TYPE_MBR || type >= CL_TYPE_SFX || ((ftype == CL_TYPE_MSEXE || ftype == CL_TYPE_ZIP || ftype == CL_TYPE_MSOLE2) && type == CL_TYPE_MSEXE))) {
if (ac_addtype(ftoffset, type, realoff, ctx))
return CL_EMEM;
if (ftoffset != NULL) {
// Caller provided a pointer to record matched types.
bool too_many_types = false;
bool supported_type = false;
if (*ftoffset != NULL) {
// Have some type matches already. Check limits.
if (ctx && ((type == CL_TYPE_ZIPSFX) ||
(type == CL_TYPE_MSEXE && ftype == CL_TYPE_MSEXE))) {
// When ctx present, limit the number of type matches using ctx->engine->maxfiles for specific types.
// Reasoning:
// ZIP local file header entries likely to be numerous if a single ZIP appended to the scanned file.
// MSEXE can contain many embedded MSEXE entries and MSEXE type false positives matches.
if (ctx->engine->maxfiles == 0) {
// Max-files limit is disabled.
} else if ((*ftoffset)->cnt >= ctx->engine->maxfiles) {
if (UNLIKELY(cli_get_debug_flag())) {
cli_dbgmsg("ac_addtype: Can't add %s type at offset " STDu64 " to list of embedded type matches. Reached maxfiles limit of %u\n", cli_ftname(type), (*ftoffset)->offset, ctx->engine->maxfiles);
}
too_many_types = true;
}
} else {
// Limit the number of type matches using MAX_EMBEDDED_OBJ.
if ((*ftoffset)->cnt >= MAX_EMBEDDED_OBJ) {
if (UNLIKELY(cli_get_debug_flag())) {
cli_dbgmsg("ac_addtype: Can't add %s type at offset " STDu64 " to list of embedded type matches. Reached MAX_EMBEDDED_OBJ limit of %u\n", cli_ftname(type), (*ftoffset)->offset, MAX_EMBEDDED_OBJ);
}
too_many_types = true;
}
}
}
// Filter to supported types.
if (
// Found type is MBR.
type == CL_TYPE_MBR ||
// Found type is any SFX type (i.e., ZIPSFX, RARSFX, 7ZSSFX, etc.).
type >= CL_TYPE_SFX ||
// Found type is an MSEXE, but only if host file type is one of MSEXE, ZIP, or MSOLE2.
(type == CL_TYPE_MSEXE && (ftype == CL_TYPE_MSEXE || ftype == CL_TYPE_ZIP || ftype == CL_TYPE_MSOLE2))) {
supported_type = true;
}
if (supported_type && !too_many_types) {
ret = ac_addtype(ftoffset, type, realoff, ctx);
if (CL_SUCCESS != ret) {
return ret;
}
}
}
}
} else {

View file

@ -390,6 +390,6 @@ cl_error_t cli_matchmeta(cli_ctx *ctx, const char *fname, size_t fsizec, size_t
* - 9 - MachO
* @param ctx The current scan context
*/
void cli_targetinfo(struct cli_target_info *info, unsigned int target, cli_ctx *ctx);
void cli_targetinfo(struct cli_target_info *info, cli_target_t target, cli_ctx *ctx);
#endif

View file

@ -717,17 +717,35 @@ static cl_error_t parse_local_file_header(
zip += LOCAL_HEADER_elen;
bytes_remaining -= LOCAL_HEADER_elen;
if (!csize) { /* FIXME: what's used for method0 files? csize or usize? Nothing in the specs, needs testing */
cli_dbgmsg("cli_unzip: local header - skipping empty file\n");
} else {
if (bytes_remaining < csize) {
cli_dbgmsg("cli_unzip: local header - stream out of file\n");
status = CL_EPARSE;
goto done;
}
if (bytes_remaining < csize) {
cli_dbgmsg("cli_unzip: local header - stream out of file\n");
status = CL_EPARSE;
goto done;
}
if (NULL != record) {
/* Don't actually unzip if we're just collecting the file record information (offset, sizes) */
if (NULL == record) {
if (NULL == original_filename) {
record->original_filename = NULL;
} else {
record->original_filename = CLI_STRNDUP(original_filename, strlen(original_filename));
}
record->local_header_offset = loff;
record->local_header_size = zip - local_header;
record->compressed_size = csize;
record->uncompressed_size = usize;
record->method = LOCAL_HEADER_method;
record->flags = LOCAL_HEADER_flags;
record->encrypted = (LOCAL_HEADER_flags & F_ENCR) ? 1 : 0;
status = CL_SUCCESS;
} else {
/*
* Unzip or decompress & then unzip.
*/
if (!csize) { /* FIXME: what's used for method0 files? csize or usize? Nothing in the specs, needs testing */
cli_dbgmsg("cli_unzip: local header - skipping empty file\n");
} else {
zip = fmap_need_ptr_once(ctx->fmap, zip, csize);
if (NULL == zip) {
cli_dbgmsg("cli_unzip: local header - data out of file\n");
@ -751,27 +769,12 @@ static cl_error_t parse_local_file_header(
goto done;
}
}
} else {
if (NULL == original_filename) {
record->original_filename = NULL;
} else {
record->original_filename = CLI_STRNDUP(original_filename, strlen(original_filename));
}
record->local_header_offset = loff;
record->local_header_size = zip - local_header;
record->compressed_size = csize;
record->uncompressed_size = usize;
record->method = LOCAL_HEADER_method;
record->flags = LOCAL_HEADER_flags;
record->encrypted = (LOCAL_HEADER_flags & F_ENCR) ? 1 : 0;
status = CL_SUCCESS;
}
zip += csize;
bytes_remaining -= csize;
}
zip += csize;
bytes_remaining -= csize;
if (LOCAL_HEADER_flags & F_USEDD) {
if (bytes_remaining < 12) {
cli_dbgmsg("cli_unzip: local header - data desc out of file\n");
@ -910,8 +913,8 @@ static cl_error_t parse_central_directory_file_header(
central_header = fmap_need_off(ctx->fmap, central_file_header_offset, SIZEOF_CENTRAL_HEADER);
if (NULL == central_header) {
cli_dbgmsg("cli_unzip: central header - file header offset out of file\n");
status = CL_EPARSE;
cli_dbgmsg("cli_unzip: central header - reached end of central directory.\n");
status = CL_BREAK;
goto done;
}
@ -1311,7 +1314,7 @@ cl_error_t index_local_file_headers_within_bounds(
if (start_offset > fsize || end_offset > fsize || start_offset > end_offset) {
cli_errmsg("index_local_file_headers_within_bounds: Invalid offset arguments: start_offset=%u, end_offset=%u, fsize=%u\n",
start_offset, end_offset, fsize);
start_offset, end_offset, fsize);
status = CL_EPARSE;
goto done;
}