Commit graph

37 commits

Author SHA1 Message Date
Val Snyder
7ff29b8c37
Bump copyright dates for 2025 2025-02-14 10:24:30 -05:00
Micah Snyder
3ae9c1e434 Add LHA/LZH archive support
File type magic signatures chosen based on the extensions supported
by Rust delharc crate.

See: https://docs.rs/delharc/latest/delharc/
2024-04-09 10:35:22 -04:00
Micah Snyder
9cb28e51e6 Bump copyright dates for 2024 2024-01-22 11:27:17 -05:00
Micah Snyder
6eebecc303 Bump copyright for 2023 2023-02-12 11:20:22 -08:00
Micah Snyder
19a4fa03b3 CPIO: Remove all-match checks + code cleanup
And fix issue where magic_scan() critical error may be ignored.
2022-10-19 13:13:57 -07:00
Micah Snyder
cd3134568a Code quality: Refactor layer attributes as scan parameter
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.

This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
2022-10-13 08:57:44 -07:00
micasnyd
140c88aa4e Bump copyright for 2022
Includes minor format corrections.
2022-01-09 14:23:25 -07:00
Micah Snyder
db013a2bfd libclamav: Fix scan recursion tracking
Scan recursion is the process of identifying files embedded in other
files and then scanning them, recursively.

Internally this process is more complex than it may sound because a file
may have multiple layers of types before finding a new "file".

At present we treat the recursion count in the scanning context as an
index into both our fmap list AND our container list. These two lists
are conceptually a part of the same thing and should be unified.

But what's concerning is that the "recursion level" isn't actually
incremented or decremented at the same time that we add a layer to the
fmap or container lists but instead is more touchy-feely, increasing
when we find a new "file".

To account for this shadiness, the size of the fmap and container lists
has always been a little longer than our "max scan recursion" limit so
we don't accidentally overflow the fmap or container arrays (!).

I've implemented a single recursion-stack as an array, similar to before,
which includes a pointer to each fmap at each layer, along with the size
and type. Push and pop functions add and remove layers whenever a new
fmap is added. A boolean argument when pushing indicates if the new layer
represents a new buffer or new file (descriptor). A new buffer will reset
the "nested fmap level" (described below).

This commit also provides a solution for an issue where we detect
embedded files more than once during scan recursion.

For illustration, imagine a tarball named foo.tar.gz with this structure:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |

But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| baz.exe                   | PE    | 0         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| │   └── hello.txt         | ASCII | 2         | 0                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |
|     └── world.txt         | ASCII | 2         | 0                 |

(A) If we scan for embedded files at any layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| ├── foo.tar               | TAR   | 1         | 0                 |
| │   ├── bar.zip           | ZIP   | 2         | 1                 |
| │   │   └── hola.txt      | ASCII | 3         | 0                 |
| │   ├── baz.exe           | PE    | 2         | 1                 |
| │   │   ├── sfx.zip       | ZIP   | 3         | 1                 |
| │   │   │   └── hello.txt | ASCII | 4         | 0                 |
| │   │   └── sfx.7z        | 7Z    | 3         | 1                 |
| │   │       └── world.txt | ASCII | 4         | 0                 |
| │   ├── sfx.zip           | ZIP   | 2         | 1                 |
| │   │   └── hello.txt     | ASCII | 3         | 0                 |
| │   └── sfx.7z            | 7Z    | 2         | 1                 |
| │       └── world.txt     | ASCII | 3         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |

(A) is bad because it scans content more than once.

Note that for the GZ layer, it may detect the ZIP and 7Z if the
signature hits on the compressed data, which it might, though
extracting the ZIP and 7Z will likely fail.

The reason the above doesn't happen now is that we restrict embedded
type scans for a bunch of archive formats to include GZ and TAR.

(B) If we scan for embedded files at the foo.tar layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     ├── baz.exe           | PE    | 2         | 1                 |
|     ├── sfx.zip           | ZIP   | 2         | 1                 |
|     │   └── hello.txt     | ASCII | 3         | 0                 |
|     └── sfx.7z            | 7Z    | 2         | 1                 |
|         └── world.txt     | ASCII | 3         | 0                 |

(B) is almost right. But we can achieve it easily enough only scanning for
embedded content in the current fmap when the "nested fmap level" is 0.
The upside is that it should safely detect all embedded content, even if
it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe.

The biggest risk I can think of affects ZIPs. SFXZIP detection
is identical to ZIP detection, which is why we don't allow SFXZIP to be
detected if insize of a ZIP. If we only allow embedded type scanning at
fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP
if the bar.exe was not compressed in foo.zip and if non-compressed files
extracted from ZIPs aren't extracted as new buffers:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.zip                   | ZIP   | 0         | 0                 |
| └── bar.exe               | PE    | 1         | 1                 |
|     └── sfx.zip           | ZIP   | 2         | 2                 |

Provided that we ensure all files extracted from zips are scanned in
new buffers, option (B) should be safe.

(C) If we scan for embedded files at the baz.exe layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |
|         ├── sfx.zip       | ZIP   | 3         | 1                 |
|         │   └── hello.txt | ASCII | 4         | 0                 |
|         └── sfx.7z        | 7Z    | 3         | 1                 |
|             └── world.txt | ASCII | 4         | 0                 |

(C) is right. But it's harder to achieve. For this example we can get it by
restricting 7ZSFX and ZIPSFX detection only when scanning an executable.
But that may mean losing detection of archives embedded elsewhere.
And we'd have to identify allowable container types for each possible
embedded type, which would be very difficult.

So this commit aims to solve the issue the (B)-way.

Note that in all situations, we still have to scan with file typing
enabled to determine if we need to reassign the current file type, such
as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2-
compressed. Detection of DMG and a handful of other types rely on
finding data partway through or near the ned of a file before
reassigning the entire file as the new type.

Other fixes and considerations in this commit:

- The utf16 HTML parser has weak error handling, particularly with respect
  to creating a nested fmap for scanning the ascii decoded file.
  This commit cleans up the error handling and wraps the nested scan with
  the recursion-stack push()/pop() for correct recursion tracking.

  Before this commit, each container layer had a flag to indicate if the
  container layer is valid.
  We need something similar so that the cli_recursion_stack_get_*()
  functions ignore normalized layers. Details...

  Imagine an LDB signature for HTML content that specifies a ZIP
  container. If the signature actually alerts on the normalized HTML and
  you don't ignore normalized layers for the container check, it will
  appear as though the alert is in an HTML container rather than a ZIP
  container.

  This commit accomplishes this with a boolean you set in the scan context
  before scanning a new layer. Then when the new fmap is created, it will
  use that flag to set similar flag for the layer. The context flag is
  reset those that anything after this doesn't have that flag.
  The flag allows the new recursion_stack_get() function to ignore
  normalized layers when iterating the stack to return a layer at a
  requested index, negative or positive.

  Scanning normalized extracted/normalized javascript and VBA should also
  use the 'layer is normalized' flag.

- This commit also fixes Heuristic.Broken.Executable alert for ELF files
  to make sure that:

  A) these only alert if cli_append_virus() returns CL_VIRUS (aka it
  respects the FP check).

  B) all broken-executable alerts for ELF only happen if the
  SCAN_HEURISTIC_BROKEN option is enabled.

- This commit also cleans up the error handling in cli_magic_scan_dir().
  This was needed so we could correctly apply the layer-is-normalized-flag
  to all VBA macros extracted to a directory when scanning the directory.

- Also fix an issue where exceeding scan maximums wouldn't cause embedded
  file detection scans to abort. Granted we don't actually want to abort
  if max filesize or max recursion depth are exceeded... only if max
  scansize, max files, and max scantime are exceeded.

  Add 'abort_scan' flag to scan context, to protect against depending on
  correct error propagation for fatal conditions. Instead, setting this
  flag in the scan context should guarantee that a fatal condition deep in
  scan recursion isn't lost which result in more stuff being scanned
  instead of aborting. This shouldn't be necessary, but some status codes
  like CL_ETIMEOUT never used to be fatal and it's easier to do this than
  to verify every parser only returns CL_ETIMEOUT and other "fatal
  status codes" in fatal conditions.

- Remove duplicate is_tar() prototype from filestypes.c and include
  is_tar.h instead.

- Presently we create the fmap hash when creating the fmap.
  This wastes a bit of CPU if the hash is never needed.
  Now that we're creating fmap's for all embedded files discovered with
  file type recognition scans, this is a much more frequent occurence and
  really slows things down.

  This commit fixes the issue by only creating fmap hashes as needed.
  This should not only resolve the perfomance impact of creating fmap's
  for all embedded files, but also should improve performance in general.

- Add allmatch check to the zip parser after the central-header meta
  match. That way we don't multiple alerts with the same match except in
  allmatch mode. Clean up error handling in the zip parser a tiny bit.

- Fixes to ensure that the scan limits such as scansize, filesize,
  recursion depth, # of embedded files, and scantime are always reported
  if AlertExceedsMax (--alert-exceeds-max) is enabled.

- Fixed an issue where non-fatal alerts for exceeding scan maximums may
  mask signature matches later on. I changed it so these alerts use the
  "possibly unwanted" alert-type and thus only alert if no other alerts
  were found or if all-match or heuristic-precedence are enabled.

- Added the "Heuristics.Limits.Exceeded.*" events to the JSON metadata
  when the --gen-json feature is enabled. These will show up once under
  "ParseErrors" the first time a limit is exceeded. In the present
  implementation, only one limits-exceeded events will be added, so as to
  prevent a malicious or malformed sample from filling the JSON buffer
  with millions of events and using a tonne of RAM.
2021-10-25 16:02:29 -07:00
Andy Ragusa
ee15dd7e0d Fix clamd_valgrind tests when built with clang-8
Since strcpy only writes a null-byte to the first terminating byte,
valgrind is throwing errors about uninitialized reads for strncpy's
that could potentially read beyond the null-byte.  Initializing
the whole array to 0 resolves this.
2021-07-30 12:46:25 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
e4e3149368 Fix fmap-duplicate performance issue
The fmap_duplicate function is used create a new fmap with a view into
an existing fmap. When the new view is a different size than the old
fmap, a new hash must be calculated for the duplicate fmap. However,
when the duplicated fmap is the same size as the original fmap, the hash
will be the same and there's no point recalculating.

The issue is apparent when scanning large EXE files because the hash was
being calculated at the beginning and end of the scan.

Digging into this issue revealed that hash calculations for fmaps were
also being performed at the wrong place. For scans of maps we use
fmap_duplicate() early in the process to apply the name API argument to
the duplicate fmap. Fixing the logic so we doing recalculate the hash
revealed that we never calculated hashes for fmap's created from buffers
in the first place, so that also had to be fixed be relocating where the
hash is calculated.

I also found that fmap_duplicate()'s offset argument used an off_t,
though it and all caller offsets are not allowed to be negative. This
was a bit of tangent to fix a bunch of off_t variables and paramters
that should've been size_t.

Added a couple unit tests to verify that making duplicate fmaps, and
duplicate-duplicate fmaps works as expected after the change.

Changed CLI_ISCONTAINED() and CLI_ISCONTAINED2() macros to cast to
size_t, because pointers and buffer sizes may not be negative, and these
two macros do not rely on substraction.
2021-01-28 12:54:50 -08:00
Micah Snyder
e409920298 Fix assorted warnings
Add missing ping_clamd() declaration in client.h

Fix check for ping option to first check if ping option is NULL before
strdup'ing and checking if the alloc failed.

Fix format string for uint64_t print.

Correctly assign name pointer to stack buffer in cpio parser.

Remove vestigial variables from insert_list() function matcher-ac.c,
left over from before the load-time optimizations completely
restructured everything.

Silence warnings about unused parameters in progress bar callback
function.
2020-07-31 16:05:31 -07:00
Micah Snyder
e2f59af30a Clang-format touchup 2020-07-24 16:37:25 -07:00
Micah Snyder (micasnyd)
e830b45ca7 Fix unitialized name buffer in CPIO parser
Fixes a possible stack buffer overflow introduced in 0.103 development
when we added optional names to file maps (fmaps). The CPIO parser uses
a stack buffer to store the name (if present).  If no name present, then
the stack buffer was passed unitialized to the fmap scanning function
which could cause an overflow.

This fix both initializes the buffer and uses a pointer so the scan
function gets NULL instead of a buffer in the event that a name isn't
present as that's the intended way to use the API, rather than passing
an empty string name buffer.
2020-07-14 17:04:04 -07:00
Micah Snyder
9b9999d778 Rename core scanning functions
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.

| current name              | new name                          |
| ------------------------- | --------------------------------- |
| magic_scandesc()          | cli_magic_scan()                  |
| cli_magic_scandesc_type() | <delete>                          |
| cli_magic_scandesc()      | cli_magic_scan_desc()             |
| cli_base_scandesc()       | cli_magic_scan_desc_type()        |
| cli_partition_scandesc()  | <delete>                          |
| cli_map_scandesc()        | magic_scan_nested_fmap_type()     |
| cli_map_scan()            | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc()        | cli_magic_scan_buff()             |
| cli_scanbuff()            | cli_scan_buff()                   |
| cli_scandesc()            | cli_scan_desc()                   |
| cli_fmap_scandesc()       | cli_scan_fmap()                   |
| cli_scanfile()            | cli_magic_scan_file()             |
| cli_scandir()             | cli_magic_scan_dir()              |
| cli_filetype2()           | cli_determine_fmap_type()         |
| cli_filetype()            | cli_compare_ftm_file()            |
| cli_partitiontype()       | cli_compare_ftm_partition()       |
| cli_scanraw()             | scanraw()                         |
2020-06-03 11:00:40 -04:00
Micah Snyder
005cbf5a37 Record names of extracted files
A way is needed to record scanned file names for two purposes:

1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.

2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.

This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.

To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required.  The zip and gpt parsers required some modification to record
file names.  The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.

Also:

- Added recursive extraction to the tmp directory when the
  --leave-temps option is enabled.  When not enabled, the tmp directory
  structure remains flat so as to prevent the likelihood of exceeding
  MAX_PATH.  The current tmp directory is stored in the scan context.

- Made the cli_scanfile() internal API non-static and added it to
  scanners.h so it would be accessible outside of scanners.c in order to
  remove code duplication within libmspack.c.

- Added function comments to scanners.h and matcher.h

- Converted a TDB-type macros and LSIG-type macros to enums for improved
  type safey.

- Converted more return status variables from `int` to `cl_error_t` for
  improved type safety, and corrected ooxml file typing functions so
  they use `cli_file_t` exclusively rather than mixing types with
  `cl_error_t`.

- Restructured the magic_scandesc() function to use goto's for error
  handling and removed the early_ret_from_magicscan() macro and
  magic_scandesc_cleanup() function.  This makes the code easier to
  read and made it easier to add the recursive tmp directory cleanup to
  magic_scandesc().

- Corrected zip, egg, rar filename extraction issues.

- Removed use of extra sub-directory layer for zip, egg, and rar file
  extraction.  For Zip, this also involved changing the extracted
  filenames to be randomly generated rather than using the "zip.###"
  file name scheme.
2020-06-03 10:39:18 -04:00
Micah Snyder
206dbaefe8 Update copyright dates for 2020 2020-01-03 15:44:07 -05:00
Micah Snyder
4524c398f3 Argument and return types for fmap_readn(), cli_writen(), cli_readn() converted to use size_t instead of int. 2019-10-02 16:08:25 -04:00
Micah Snyder
8a4669be47 fuzz - 12156 - Explicit cast to unsigned 32bit int because a signed int could not hold 0xffff << 16 bits. 2019-10-02 16:08:19 -04:00
Micah Snyder
52cddcbcfd Updating and cleaning up copyright notices. 2019-10-02 16:08:18 -04:00
Micah Snyder
b3e82e5e61 Replacing libclamav/cltypes.h with clamav-types.h.in, which generates a header clamav-types.h that we install alongside clamav.h. 2019-10-02 16:08:17 -04:00
Micah Snyder
72fd33c8b2 clang-format'd using new .clang-format rules. 2019-10-02 16:08:16 -04:00
Micah Snyder
d7979d4ff7 Restructured scan options flags from a single bitflag field to a structure containing multiple bitflag fields. This also required adding a new function to the bytecode API to get scan options a la carte, and modifying the existing function to hand back scan options in the old/deprecated uint32_t bitflag format. Re-generated bytecode iface header files.
Updated libclamav documentation detailing new scan options structure.
Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'.
Renamed references to 'scan all' to 'scan all match'.
Renamed a couple of 'Hueristic.*' signature names as 'Heuristics.*' signatures (plural) to match majority of other heuristics.
2018-12-02 23:06:59 -05:00
Steven Morgan
7a307529d8 bb11580 - make cli_matchmeta() respect allmatch. 2016-06-08 16:25:34 -04:00
Kevin Lin
c76e630507 bb#11514 - initialize cpio name buffer 2016-03-10 16:08:05 -05:00
Mickey Sola
46a35abe56 mass update of copyright headers 2015-09-17 13:41:26 -04:00
Shawn Webb
cd94be7a52 Silence a bunch of compiler warnings in libclamav 2014-07-10 18:11:49 -04:00
Shawn Webb
60d8d2c352 Move all the crypto API to clamav.h 2014-07-01 19:38:01 -04:00
Shawn Webb
b2e7c931d0 Use OpenSSL for hashing. 2014-02-08 00:31:12 -05:00
Kevin Lin
328a33258a modified cli_map_scan and cli_map_scandesc to take a cli_file_t
modified all respective calls to the above change
2014-01-23 17:06:30 -05:00
David Raynor
3cab931d78 Add ForceToDisk option for clamd and force-to-disk arg for clamscan 2013-11-08 17:10:43 -05:00
Török Edvin
87f763991b Introduce cli_map_scandesc to scan a portion of the existing file
And switch CPIO, MACHO, and SWF to use it.

Now they no longer need to dump a tempfile and remap.
To investigate if it is possible to do this with TAR.
2011-06-17 23:08:47 +03:00
Török Edvin
87af1d57b3 fmapify Mach-O and CPIO 2011-06-13 11:28:32 +03:00
Tomasz Kojm
570b1d0050 libclamav: cdb: drop FileType; cover ARJ, CAB, TAR, CPIO and 7Z 2010-01-14 23:32:35 +01:00
Tomasz Kojm
f1e02ed252 libclamav/cpio.c: wrap unistd.h, reported by Nigel Horne 2009-08-18 23:53:04 +02:00
Tomasz Kojm
3222a09656 libclamav: add support for Universal Binaries
(archives with Mach-O files for different architectures, bb#1592)
2009-07-14 18:19:54 +02:00
Tomasz Kojm
e6417b01b8 libclamav: add support for cpio archives (bb#1649) 2009-07-02 17:10:14 +02:00