Commit graph

58 commits

Author SHA1 Message Date
Valerie Snyder
31dcec1e42
libclamav: Add engine option to toggle temp directory recursion
Temp directory recursion in ClamAV is when each layer of a scan gets its
own temp directory in the parent layer's temp directory.

In addition to temp directory recursion, ClamAV has been creating a new
subdirectory for each file scan as a risk-adverse method to ensure
no temporary file leaks fill up the disk.
Creating a directory is relatively slow on Windows in particular if
scanning a lot of very small files.

This commit:

1. Separates the temp directory recursion feature from the leave-temps
   feature so that libclamav can leave temp files without making
   subdirectories for each file scanned.

2. Makes it so that when temp directory recursion is off, libclamav
   will just use the configure temp directory for all files.

The new option to enable temp directory recursion is for libclamav-only
at this time. It is off by default, and you can enable it like this:

```c
cl_engine_set_num(engine, CL_ENGINE_TMPDIR_RECURSION, 1);
```

For the `clamscan` and `clamd` programs, temp directory recursion will
be enabled when `--leave-temps` / `LeaveTemporaryFiles` is enabled.

The difference is that when disabled, it will return to using the
configured temp directory without making a subdirectory for each file
scanned, so as to improve scan performance for small files, mostly on
Windows.

Under the hood, this commit also:

1. Cleans up how we keep track of tmpdirs for each layer.
   The goal here is to align how we keep track of layer-specific stuff
   using the scan_layer structure.

2. Cleans up how we record metadata JSON for embedded files.
   Note: Embedded files being different from Contained files, as they
         are extracted not with a parser, but by finding them with
         file type magic signatures.

CLAM-1583
2025-08-14 22:38:58 -04:00
Valerie Snyder
f7e60d566f
Record unique object-id for each layer scanned
Every time we push a new map onto the scanning recursion context, give
it a unique object id number, which counts from zero.

Moved the location where we add metadata for each file from the
"cli_magic_scan" function over to the "recursion stack push" function.

Include a "path" as a parameter for creating a new fmap, and rename some
related variables and functions to be more intuitive.

CLAM-2796
See also: CLAM-2485, CLAM-2626
2025-08-14 21:23:33 -04:00
Val Snyder
7ff29b8c37
Bump copyright dates for 2025 2025-02-14 10:24:30 -05:00
Micah Snyder
47dfe9bd5d Remove libjson-c dead code
As of ClamAV 0.105, libjson-c is required.
There is also no option to disable libjson-c support.

This commit removes the dead code associated with the old build
option.
2024-04-13 12:34:15 -04:00
Micah Snyder
71ff5c579c Remove libxml2 dead code
As of ClamAV 0.105, libxml2 is required.
There is also no option to disable PCRE support.

This commit removes the dead code associated with the old build
option.
2024-04-13 12:34:15 -04:00
Micah Snyder
9cb28e51e6 Bump copyright dates for 2024 2024-01-22 11:27:17 -05:00
Micah Snyder
6c0ef38f38 Coverity-396088: Fix possible use of uninitialized variable in msxml parser. 2023-04-26 10:43:13 -07:00
Micah Snyder
6eebecc303 Bump copyright for 2023 2023-02-12 11:20:22 -08:00
Micah Snyder
cc9d578fd0 MSXML: Remove allmatch checks + minor code cleanup 2022-10-19 13:13:57 -07:00
Micah Snyder
cd3134568a Code quality: Refactor layer attributes as scan parameter
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.

This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
2022-10-13 08:57:44 -07:00
mko-x
a21cc6dcd7
Add explicit log level parameter to application logging API
* Added loglevel parameter to logg()

* Fix logg and mprintf internals with new loglevels

* Update all logg calls to set loglevel

* Update all mprintf calls to set loglevel

* Fix hidden logg calls

* Executed clam-format
2022-02-15 15:13:55 -08:00
micasnyd
140c88aa4e Bump copyright for 2022
Includes minor format corrections.
2022-01-09 14:23:25 -07:00
Micah Snyder
db013a2bfd libclamav: Fix scan recursion tracking
Scan recursion is the process of identifying files embedded in other
files and then scanning them, recursively.

Internally this process is more complex than it may sound because a file
may have multiple layers of types before finding a new "file".

At present we treat the recursion count in the scanning context as an
index into both our fmap list AND our container list. These two lists
are conceptually a part of the same thing and should be unified.

But what's concerning is that the "recursion level" isn't actually
incremented or decremented at the same time that we add a layer to the
fmap or container lists but instead is more touchy-feely, increasing
when we find a new "file".

To account for this shadiness, the size of the fmap and container lists
has always been a little longer than our "max scan recursion" limit so
we don't accidentally overflow the fmap or container arrays (!).

I've implemented a single recursion-stack as an array, similar to before,
which includes a pointer to each fmap at each layer, along with the size
and type. Push and pop functions add and remove layers whenever a new
fmap is added. A boolean argument when pushing indicates if the new layer
represents a new buffer or new file (descriptor). A new buffer will reset
the "nested fmap level" (described below).

This commit also provides a solution for an issue where we detect
embedded files more than once during scan recursion.

For illustration, imagine a tarball named foo.tar.gz with this structure:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |

But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| baz.exe                   | PE    | 0         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| │   └── hello.txt         | ASCII | 2         | 0                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |
|     └── world.txt         | ASCII | 2         | 0                 |

(A) If we scan for embedded files at any layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| ├── foo.tar               | TAR   | 1         | 0                 |
| │   ├── bar.zip           | ZIP   | 2         | 1                 |
| │   │   └── hola.txt      | ASCII | 3         | 0                 |
| │   ├── baz.exe           | PE    | 2         | 1                 |
| │   │   ├── sfx.zip       | ZIP   | 3         | 1                 |
| │   │   │   └── hello.txt | ASCII | 4         | 0                 |
| │   │   └── sfx.7z        | 7Z    | 3         | 1                 |
| │   │       └── world.txt | ASCII | 4         | 0                 |
| │   ├── sfx.zip           | ZIP   | 2         | 1                 |
| │   │   └── hello.txt     | ASCII | 3         | 0                 |
| │   └── sfx.7z            | 7Z    | 2         | 1                 |
| │       └── world.txt     | ASCII | 3         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |

(A) is bad because it scans content more than once.

Note that for the GZ layer, it may detect the ZIP and 7Z if the
signature hits on the compressed data, which it might, though
extracting the ZIP and 7Z will likely fail.

The reason the above doesn't happen now is that we restrict embedded
type scans for a bunch of archive formats to include GZ and TAR.

(B) If we scan for embedded files at the foo.tar layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     ├── baz.exe           | PE    | 2         | 1                 |
|     ├── sfx.zip           | ZIP   | 2         | 1                 |
|     │   └── hello.txt     | ASCII | 3         | 0                 |
|     └── sfx.7z            | 7Z    | 2         | 1                 |
|         └── world.txt     | ASCII | 3         | 0                 |

(B) is almost right. But we can achieve it easily enough only scanning for
embedded content in the current fmap when the "nested fmap level" is 0.
The upside is that it should safely detect all embedded content, even if
it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe.

The biggest risk I can think of affects ZIPs. SFXZIP detection
is identical to ZIP detection, which is why we don't allow SFXZIP to be
detected if insize of a ZIP. If we only allow embedded type scanning at
fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP
if the bar.exe was not compressed in foo.zip and if non-compressed files
extracted from ZIPs aren't extracted as new buffers:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.zip                   | ZIP   | 0         | 0                 |
| └── bar.exe               | PE    | 1         | 1                 |
|     └── sfx.zip           | ZIP   | 2         | 2                 |

Provided that we ensure all files extracted from zips are scanned in
new buffers, option (B) should be safe.

(C) If we scan for embedded files at the baz.exe layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |
|         ├── sfx.zip       | ZIP   | 3         | 1                 |
|         │   └── hello.txt | ASCII | 4         | 0                 |
|         └── sfx.7z        | 7Z    | 3         | 1                 |
|             └── world.txt | ASCII | 4         | 0                 |

(C) is right. But it's harder to achieve. For this example we can get it by
restricting 7ZSFX and ZIPSFX detection only when scanning an executable.
But that may mean losing detection of archives embedded elsewhere.
And we'd have to identify allowable container types for each possible
embedded type, which would be very difficult.

So this commit aims to solve the issue the (B)-way.

Note that in all situations, we still have to scan with file typing
enabled to determine if we need to reassign the current file type, such
as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2-
compressed. Detection of DMG and a handful of other types rely on
finding data partway through or near the ned of a file before
reassigning the entire file as the new type.

Other fixes and considerations in this commit:

- The utf16 HTML parser has weak error handling, particularly with respect
  to creating a nested fmap for scanning the ascii decoded file.
  This commit cleans up the error handling and wraps the nested scan with
  the recursion-stack push()/pop() for correct recursion tracking.

  Before this commit, each container layer had a flag to indicate if the
  container layer is valid.
  We need something similar so that the cli_recursion_stack_get_*()
  functions ignore normalized layers. Details...

  Imagine an LDB signature for HTML content that specifies a ZIP
  container. If the signature actually alerts on the normalized HTML and
  you don't ignore normalized layers for the container check, it will
  appear as though the alert is in an HTML container rather than a ZIP
  container.

  This commit accomplishes this with a boolean you set in the scan context
  before scanning a new layer. Then when the new fmap is created, it will
  use that flag to set similar flag for the layer. The context flag is
  reset those that anything after this doesn't have that flag.
  The flag allows the new recursion_stack_get() function to ignore
  normalized layers when iterating the stack to return a layer at a
  requested index, negative or positive.

  Scanning normalized extracted/normalized javascript and VBA should also
  use the 'layer is normalized' flag.

- This commit also fixes Heuristic.Broken.Executable alert for ELF files
  to make sure that:

  A) these only alert if cli_append_virus() returns CL_VIRUS (aka it
  respects the FP check).

  B) all broken-executable alerts for ELF only happen if the
  SCAN_HEURISTIC_BROKEN option is enabled.

- This commit also cleans up the error handling in cli_magic_scan_dir().
  This was needed so we could correctly apply the layer-is-normalized-flag
  to all VBA macros extracted to a directory when scanning the directory.

- Also fix an issue where exceeding scan maximums wouldn't cause embedded
  file detection scans to abort. Granted we don't actually want to abort
  if max filesize or max recursion depth are exceeded... only if max
  scansize, max files, and max scantime are exceeded.

  Add 'abort_scan' flag to scan context, to protect against depending on
  correct error propagation for fatal conditions. Instead, setting this
  flag in the scan context should guarantee that a fatal condition deep in
  scan recursion isn't lost which result in more stuff being scanned
  instead of aborting. This shouldn't be necessary, but some status codes
  like CL_ETIMEOUT never used to be fatal and it's easier to do this than
  to verify every parser only returns CL_ETIMEOUT and other "fatal
  status codes" in fatal conditions.

- Remove duplicate is_tar() prototype from filestypes.c and include
  is_tar.h instead.

- Presently we create the fmap hash when creating the fmap.
  This wastes a bit of CPU if the hash is never needed.
  Now that we're creating fmap's for all embedded files discovered with
  file type recognition scans, this is a much more frequent occurence and
  really slows things down.

  This commit fixes the issue by only creating fmap hashes as needed.
  This should not only resolve the perfomance impact of creating fmap's
  for all embedded files, but also should improve performance in general.

- Add allmatch check to the zip parser after the central-header meta
  match. That way we don't multiple alerts with the same match except in
  allmatch mode. Clean up error handling in the zip parser a tiny bit.

- Fixes to ensure that the scan limits such as scansize, filesize,
  recursion depth, # of embedded files, and scantime are always reported
  if AlertExceedsMax (--alert-exceeds-max) is enabled.

- Fixed an issue where non-fatal alerts for exceeding scan maximums may
  mask signature matches later on. I changed it so these alerts use the
  "possibly unwanted" alert-type and thus only alert if no other alerts
  were found or if all-match or heuristic-precedence are enabled.

- Added the "Heuristics.Limits.Exceeded.*" events to the JSON metadata
  when the --gen-json feature is enabled. These will show up once under
  "ParseErrors" the first time a limit is exceeded. In the present
  implementation, only one limits-exceeded events will be added, so as to
  prevent a malicious or malformed sample from filling the JSON buffer
  with millions of events and using a tonne of RAM.
2021-10-25 16:02:29 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
9b9999d778 Rename core scanning functions
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.

| current name              | new name                          |
| ------------------------- | --------------------------------- |
| magic_scandesc()          | cli_magic_scan()                  |
| cli_magic_scandesc_type() | <delete>                          |
| cli_magic_scandesc()      | cli_magic_scan_desc()             |
| cli_base_scandesc()       | cli_magic_scan_desc_type()        |
| cli_partition_scandesc()  | <delete>                          |
| cli_map_scandesc()        | magic_scan_nested_fmap_type()     |
| cli_map_scan()            | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc()        | cli_magic_scan_buff()             |
| cli_scanbuff()            | cli_scan_buff()                   |
| cli_scandesc()            | cli_scan_desc()                   |
| cli_fmap_scandesc()       | cli_scan_fmap()                   |
| cli_scanfile()            | cli_magic_scan_file()             |
| cli_scandir()             | cli_magic_scan_dir()              |
| cli_filetype2()           | cli_determine_fmap_type()         |
| cli_filetype()            | cli_compare_ftm_file()            |
| cli_partitiontype()       | cli_compare_ftm_partition()       |
| cli_scanraw()             | scanraw()                         |
2020-06-03 11:00:40 -04:00
Micah Snyder
005cbf5a37 Record names of extracted files
A way is needed to record scanned file names for two purposes:

1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.

2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.

This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.

To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required.  The zip and gpt parsers required some modification to record
file names.  The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.

Also:

- Added recursive extraction to the tmp directory when the
  --leave-temps option is enabled.  When not enabled, the tmp directory
  structure remains flat so as to prevent the likelihood of exceeding
  MAX_PATH.  The current tmp directory is stored in the scan context.

- Made the cli_scanfile() internal API non-static and added it to
  scanners.h so it would be accessible outside of scanners.c in order to
  remove code duplication within libmspack.c.

- Added function comments to scanners.h and matcher.h

- Converted a TDB-type macros and LSIG-type macros to enums for improved
  type safey.

- Converted more return status variables from `int` to `cl_error_t` for
  improved type safety, and corrected ooxml file typing functions so
  they use `cli_file_t` exclusively rather than mixing types with
  `cl_error_t`.

- Restructured the magic_scandesc() function to use goto's for error
  handling and removed the early_ret_from_magicscan() macro and
  magic_scandesc_cleanup() function.  This makes the code easier to
  read and made it easier to add the recursive tmp directory cleanup to
  magic_scandesc().

- Corrected zip, egg, rar filename extraction issues.

- Removed use of extra sub-directory layer for zip, egg, and rar file
  extraction.  For Zip, this also involved changing the extracted
  filenames to be randomly generated rather than using the "zip.###"
  file name scheme.
2020-06-03 10:39:18 -04:00
Micah Snyder
206dbaefe8 Update copyright dates for 2020 2020-01-03 15:44:07 -05:00
Micah Snyder
4524c398f3 Argument and return types for fmap_readn(), cli_writen(), cli_readn() converted to use size_t instead of int. 2019-10-02 16:08:25 -04:00
Micah Snyder
ca8b4c466e Assortment of warning fixes. 2019-10-02 16:08:25 -04:00
Micah Snyder
52cddcbcfd Updating and cleaning up copyright notices. 2019-10-02 16:08:18 -04:00
Micah Snyder
72fd33c8b2 clang-format'd using new .clang-format rules. 2019-10-02 16:08:16 -04:00
Micah Snyder
8cf9b527b0 Updated win32 3rdparty libxml2 to version 2.9.8. 2018-12-02 23:07:01 -05:00
Micah Snyder
d39cb6581f Updating libclamunrar from legacy C implementation to modern unrar 5.6.5. API changes and supporting changes included to pass the filepath of the scanned file into libclamav through the cli_ctx structure, required by the unrar library to open archives. The filename argument may be optional for the scandesc scanning variant, but libclamav will make a best effort to identify the filename from the file descriptor if it was not provided. In addition, included the ability to prefix temp file and directory names with file basenames. 2018-12-02 23:06:59 -05:00
Micah Snyder
d7979d4ff7 Restructured scan options flags from a single bitflag field to a structure containing multiple bitflag fields. This also required adding a new function to the bytecode API to get scan options a la carte, and modifying the existing function to hand back scan options in the old/deprecated uint32_t bitflag format. Re-generated bytecode iface header files.
Updated libclamav documentation detailing new scan options structure.
Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'.
Renamed references to 'scan all' to 'scan all match'.
Renamed a couple of 'Hueristic.*' signature names as 'Heuristics.*' signatures (plural) to match majority of other heuristics.
2018-12-02 23:06:59 -05:00
Micah Snyder
78a06dae12 elminating warnings that crop up when using --with-libjson 2017-08-28 17:49:25 -04:00
Steven Morgan
928864b818 fix file descriptor leak for msxml documents - patch from Chris Miserva. 2017-01-17 12:27:07 -05:00
Steven Morgan
22cb38ed24 pull request #53(2/4): Spelling fix by klemens(ka7). 2016-10-19 15:57:45 -04:00
Steven Morgan
10bcbe7a46 clean up file/memory in error case. 2016-07-08 12:15:12 -04:00
Anthony Chan
46b9e8c2ca Fix bug in msxml_parse_element which may leave behind empty temp file and leak a little memory 2016-07-08 12:04:17 -04:00
Kevin Lin
4016c0f994 msxml_parser: suppress xml2 parser error and warnings to clamav debug 2016-05-26 17:05:36 -04:00
Kevin Lin
cd70b7cad8 msxml_parser: add custom callback data slot 2016-05-26 17:05:36 -04:00
Kevin Lin
cb7403214b msxml_parser: change method of setting callback system; add comment_cb 2016-05-26 17:05:36 -04:00
Kevin Lin
6732844acb msxml_parser: flags for modifying reader usage (json, walk) 2016-05-26 17:05:36 -04:00
Kevin Lin
c2df9f79d3 mhtml: wrapper for xml parsing using libxml2 htmlparser 2016-05-26 17:05:35 -04:00
Kevin Lin
80ba6f2c9a clang compiler warning corrections 2016-02-18 11:59:08 -05:00
Kevin Lin
99967b5790 compile bugfix for non-json builds 2016-01-15 16:49:19 -05:00
Kevin Lin
d45186458b squash hwp amd msxml parser memory leaks 2016-01-15 15:34:01 -05:00
Kevin Lin
523e4264e0 msxml_parser: add MSXML_JSON_MULTI option for tracking multiple entries for same key 2015-12-17 16:18:17 -05:00
Kevin Lin
416456da73 msxml_parser: add callback-based scanning mechanism 2015-12-16 16:16:01 -05:00
Kevin Lin
66e314847c msxml_parser: add MSXML_SCAN_B64_TRIM4 key field (for HWPML) 2015-12-16 16:16:01 -05:00
Mickey Sola
46a35abe56 mass update of copyright headers 2015-09-17 13:41:26 -04:00
Kevin Lin
f8f9d7e1f5 cid 12151/12150 - correct condition for checking allmatch in parsing msxml documents (revised) 2015-08-21 14:41:37 -04:00
Kevin Lin
f8f2ff9480 cid 12144 - simplify null ctx check in msxml parsing 2015-08-19 11:14:52 -04:00
Kevin Lin
4ba7e7fe90 cid 12151/12150 - correct condition for checking allmatch in parsing msxml documents 2015-08-19 11:14:52 -04:00
Kevin Lin
3d374eec31 msxml: virus detection and allmatch fixes 2015-04-29 17:17:31 -04:00
Kevin Lin
48d1a07597 msxml: memory issues with tempfiles 2015-04-17 11:25:44 -04:00
Kevin Lin
25a8a0f91a msxml: memory fixes 2015-04-16 12:31:09 -04:00
Kevin Lin
f773990c28 msxml: final suppression of parsing errors (for release) 2015-04-01 13:21:15 -04:00
Kevin Lin
f17cd8d16f added clamav-specific xmlerror handler for msxml 2015-04-01 12:14:44 -04:00
Kevin Lin
d2efc60c01 win32: fixed build in regards to msxml 2015-03-16 12:07:03 -04:00