Commit graph

31 commits

Author SHA1 Message Date
Val Snyder
7ff29b8c37
Bump copyright dates for 2025 2025-02-14 10:24:30 -05:00
Micah Snyder
1e5ddefcee Clang-format touchup 2024-03-15 13:18:47 -04:00
Micah Snyder
9cb28e51e6 Bump copyright dates for 2024 2024-01-22 11:27:17 -05:00
RainRat
caf324e544
Fix typos (no functional changes) 2023-11-26 18:01:19 -05:00
Micah Snyder
38386349c5 Fix many warnings 2023-04-13 00:11:34 -07:00
Micah Snyder
6eebecc303 Bump copyright for 2023 2023-02-12 11:20:22 -08:00
Micah Snyder
66f48d3e05 Strong indicator precedence over PUA / Heuristic detections
Signatures that start with "PUA.", "Heuristics.", or "BC.Heuristics."
are perceived to be less serious, or more likely to have false
positives, than other signatures that we would think of us as "strong
indicators".

At present, only a subset of "Heuristics." signatures, such as those
added by the phishing module, are added as "potentially unwanted".
Unless you're using heuristic-precedence mode, these "potentially
unwanted" indicators are recorded but not reported unless no other
signature alerts. This behavior should apply to all signatures that
start with "PUA." and "Heuristics.". We already do a string match
comparison on the signature name to apply that behavior to bytecode
matches that start with "BC.Heuristics.".

I moved that string comparison logic used for "BC.Heuristics." into the
main `cl_append_virus()` function and extended it to cover the other two
cases.

I also replaced all hardcoded calls to append "Heuristics." signatures
to append using the `cli_append_potentially_unwanted()` function, so we
can skip the string compare in these cases. That function will of course
append them as strong indicators if heuristic-precedence mode is
enabled.
2022-10-19 13:13:57 -07:00
Micah Snyder
ae5d1ef692 Fix image parser match reporting issue
When using --heuristic-precendence, the heuristic alerts in the image
parsers much be reported as regular matches, which means that when not
in all-match mode, the scan should abort. Right now, those matches
aren't being recorded so scanning continues in archives containing more
than just the malformed image file.

This fix returns the "CL_VIRUS" status code when heuristic-precendence
is enabled and allmatch is not enabled.
2022-10-19 13:13:57 -07:00
Micah Snyder
621381e0cd Allmatch-mode overhaul, part 1: append_virus
Rework the append_virus mechanism to store evidence (strong indicators,
pua indicators, and eventually weak indicators) in vectors. When
appending a "virus", we will return CLEAN when in allmatch-mode, and
simply add the indicator to the appropriate vector.
Later we can check if there were any alerts to return a vector by
summing the lengths of the strong and pua indicator vectors.

This does away with storing the latest "virname" in the scan context.
Instead, we can query for the last indicator in the evidence, giving
priority to strong indicators.

When heuristic-precendence is enabled, add PUA as Strong instead of
as PotentiallyUnwanted. This way, they will be treated equally and
reported in order in allmatch mode.

Also document reason for disabling cache with metadata JSON enabled
2022-10-19 13:13:57 -07:00
Micah Snyder
cd3134568a Code quality: Refactor layer attributes as scan parameter
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.

This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
2022-10-13 08:57:44 -07:00
micasnyd
140c88aa4e Bump copyright for 2022
Includes minor format corrections.
2022-01-09 14:23:25 -07:00
Micah Snyder
db013a2bfd libclamav: Fix scan recursion tracking
Scan recursion is the process of identifying files embedded in other
files and then scanning them, recursively.

Internally this process is more complex than it may sound because a file
may have multiple layers of types before finding a new "file".

At present we treat the recursion count in the scanning context as an
index into both our fmap list AND our container list. These two lists
are conceptually a part of the same thing and should be unified.

But what's concerning is that the "recursion level" isn't actually
incremented or decremented at the same time that we add a layer to the
fmap or container lists but instead is more touchy-feely, increasing
when we find a new "file".

To account for this shadiness, the size of the fmap and container lists
has always been a little longer than our "max scan recursion" limit so
we don't accidentally overflow the fmap or container arrays (!).

I've implemented a single recursion-stack as an array, similar to before,
which includes a pointer to each fmap at each layer, along with the size
and type. Push and pop functions add and remove layers whenever a new
fmap is added. A boolean argument when pushing indicates if the new layer
represents a new buffer or new file (descriptor). A new buffer will reset
the "nested fmap level" (described below).

This commit also provides a solution for an issue where we detect
embedded files more than once during scan recursion.

For illustration, imagine a tarball named foo.tar.gz with this structure:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |

But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| baz.exe                   | PE    | 0         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| │   └── hello.txt         | ASCII | 2         | 0                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |
|     └── world.txt         | ASCII | 2         | 0                 |

(A) If we scan for embedded files at any layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| ├── foo.tar               | TAR   | 1         | 0                 |
| │   ├── bar.zip           | ZIP   | 2         | 1                 |
| │   │   └── hola.txt      | ASCII | 3         | 0                 |
| │   ├── baz.exe           | PE    | 2         | 1                 |
| │   │   ├── sfx.zip       | ZIP   | 3         | 1                 |
| │   │   │   └── hello.txt | ASCII | 4         | 0                 |
| │   │   └── sfx.7z        | 7Z    | 3         | 1                 |
| │   │       └── world.txt | ASCII | 4         | 0                 |
| │   ├── sfx.zip           | ZIP   | 2         | 1                 |
| │   │   └── hello.txt     | ASCII | 3         | 0                 |
| │   └── sfx.7z            | 7Z    | 2         | 1                 |
| │       └── world.txt     | ASCII | 3         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |

(A) is bad because it scans content more than once.

Note that for the GZ layer, it may detect the ZIP and 7Z if the
signature hits on the compressed data, which it might, though
extracting the ZIP and 7Z will likely fail.

The reason the above doesn't happen now is that we restrict embedded
type scans for a bunch of archive formats to include GZ and TAR.

(B) If we scan for embedded files at the foo.tar layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     ├── baz.exe           | PE    | 2         | 1                 |
|     ├── sfx.zip           | ZIP   | 2         | 1                 |
|     │   └── hello.txt     | ASCII | 3         | 0                 |
|     └── sfx.7z            | 7Z    | 2         | 1                 |
|         └── world.txt     | ASCII | 3         | 0                 |

(B) is almost right. But we can achieve it easily enough only scanning for
embedded content in the current fmap when the "nested fmap level" is 0.
The upside is that it should safely detect all embedded content, even if
it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe.

The biggest risk I can think of affects ZIPs. SFXZIP detection
is identical to ZIP detection, which is why we don't allow SFXZIP to be
detected if insize of a ZIP. If we only allow embedded type scanning at
fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP
if the bar.exe was not compressed in foo.zip and if non-compressed files
extracted from ZIPs aren't extracted as new buffers:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.zip                   | ZIP   | 0         | 0                 |
| └── bar.exe               | PE    | 1         | 1                 |
|     └── sfx.zip           | ZIP   | 2         | 2                 |

Provided that we ensure all files extracted from zips are scanned in
new buffers, option (B) should be safe.

(C) If we scan for embedded files at the baz.exe layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |
|         ├── sfx.zip       | ZIP   | 3         | 1                 |
|         │   └── hello.txt | ASCII | 4         | 0                 |
|         └── sfx.7z        | 7Z    | 3         | 1                 |
|             └── world.txt | ASCII | 4         | 0                 |

(C) is right. But it's harder to achieve. For this example we can get it by
restricting 7ZSFX and ZIPSFX detection only when scanning an executable.
But that may mean losing detection of archives embedded elsewhere.
And we'd have to identify allowable container types for each possible
embedded type, which would be very difficult.

So this commit aims to solve the issue the (B)-way.

Note that in all situations, we still have to scan with file typing
enabled to determine if we need to reassign the current file type, such
as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2-
compressed. Detection of DMG and a handful of other types rely on
finding data partway through or near the ned of a file before
reassigning the entire file as the new type.

Other fixes and considerations in this commit:

- The utf16 HTML parser has weak error handling, particularly with respect
  to creating a nested fmap for scanning the ascii decoded file.
  This commit cleans up the error handling and wraps the nested scan with
  the recursion-stack push()/pop() for correct recursion tracking.

  Before this commit, each container layer had a flag to indicate if the
  container layer is valid.
  We need something similar so that the cli_recursion_stack_get_*()
  functions ignore normalized layers. Details...

  Imagine an LDB signature for HTML content that specifies a ZIP
  container. If the signature actually alerts on the normalized HTML and
  you don't ignore normalized layers for the container check, it will
  appear as though the alert is in an HTML container rather than a ZIP
  container.

  This commit accomplishes this with a boolean you set in the scan context
  before scanning a new layer. Then when the new fmap is created, it will
  use that flag to set similar flag for the layer. The context flag is
  reset those that anything after this doesn't have that flag.
  The flag allows the new recursion_stack_get() function to ignore
  normalized layers when iterating the stack to return a layer at a
  requested index, negative or positive.

  Scanning normalized extracted/normalized javascript and VBA should also
  use the 'layer is normalized' flag.

- This commit also fixes Heuristic.Broken.Executable alert for ELF files
  to make sure that:

  A) these only alert if cli_append_virus() returns CL_VIRUS (aka it
  respects the FP check).

  B) all broken-executable alerts for ELF only happen if the
  SCAN_HEURISTIC_BROKEN option is enabled.

- This commit also cleans up the error handling in cli_magic_scan_dir().
  This was needed so we could correctly apply the layer-is-normalized-flag
  to all VBA macros extracted to a directory when scanning the directory.

- Also fix an issue where exceeding scan maximums wouldn't cause embedded
  file detection scans to abort. Granted we don't actually want to abort
  if max filesize or max recursion depth are exceeded... only if max
  scansize, max files, and max scantime are exceeded.

  Add 'abort_scan' flag to scan context, to protect against depending on
  correct error propagation for fatal conditions. Instead, setting this
  flag in the scan context should guarantee that a fatal condition deep in
  scan recursion isn't lost which result in more stuff being scanned
  instead of aborting. This shouldn't be necessary, but some status codes
  like CL_ETIMEOUT never used to be fatal and it's easier to do this than
  to verify every parser only returns CL_ETIMEOUT and other "fatal
  status codes" in fatal conditions.

- Remove duplicate is_tar() prototype from filestypes.c and include
  is_tar.h instead.

- Presently we create the fmap hash when creating the fmap.
  This wastes a bit of CPU if the hash is never needed.
  Now that we're creating fmap's for all embedded files discovered with
  file type recognition scans, this is a much more frequent occurence and
  really slows things down.

  This commit fixes the issue by only creating fmap hashes as needed.
  This should not only resolve the perfomance impact of creating fmap's
  for all embedded files, but also should improve performance in general.

- Add allmatch check to the zip parser after the central-header meta
  match. That way we don't multiple alerts with the same match except in
  allmatch mode. Clean up error handling in the zip parser a tiny bit.

- Fixes to ensure that the scan limits such as scansize, filesize,
  recursion depth, # of embedded files, and scantime are always reported
  if AlertExceedsMax (--alert-exceeds-max) is enabled.

- Fixed an issue where non-fatal alerts for exceeding scan maximums may
  mask signature matches later on. I changed it so these alerts use the
  "possibly unwanted" alert-type and thus only alert if no other alerts
  were found or if all-match or heuristic-precedence are enabled.

- Added the "Heuristics.Limits.Exceeded.*" events to the JSON metadata
  when the --gen-json feature is enabled. These will show up once under
  "ParseErrors" the first time a limit is exceeded. In the present
  implementation, only one limits-exceeded events will be added, so as to
  prevent a malicious or malformed sample from filling the JSON buffer
  with millions of events and using a tonne of RAM.
2021-10-25 16:02:29 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
e4e3149368 Fix fmap-duplicate performance issue
The fmap_duplicate function is used create a new fmap with a view into
an existing fmap. When the new view is a different size than the old
fmap, a new hash must be calculated for the duplicate fmap. However,
when the duplicated fmap is the same size as the original fmap, the hash
will be the same and there's no point recalculating.

The issue is apparent when scanning large EXE files because the hash was
being calculated at the beginning and end of the scan.

Digging into this issue revealed that hash calculations for fmaps were
also being performed at the wrong place. For scans of maps we use
fmap_duplicate() early in the process to apply the name API argument to
the duplicate fmap. Fixing the logic so we doing recalculate the hash
revealed that we never calculated hashes for fmap's created from buffers
in the first place, so that also had to be fixed be relocating where the
hash is calculated.

I also found that fmap_duplicate()'s offset argument used an off_t,
though it and all caller offsets are not allowed to be negative. This
was a bit of tangent to fix a bunch of off_t variables and paramters
that should've been size_t.

Added a couple unit tests to verify that making duplicate fmaps, and
duplicate-duplicate fmaps works as expected after the change.

Changed CLI_ISCONTAINED() and CLI_ISCONTAINED2() macros to cast to
size_t, because pointers and buffer sizes may not be negative, and these
two macros do not rely on substraction.
2021-01-28 12:54:50 -08:00
Micah Snyder
4d2b100ade Code review fixes
The new GIF, PNG, TIFF, and JPEG types should be enabled for 0.103.1+
(aka FLEVEL 122+)
2021-01-28 12:54:50 -08:00
Micah Snyder
205614b403 Integrate JPEG exploit check into JPEG parser
Integrated the JPEG exploit check into the JPEG parser and removed it
from special.c.

As a happy consequence of this, the photoshop file detection and
embedded JPEG thumbnail exploit check was merged in as well, which means
that the embedded thumbnails can also be scanned as embedded JPEG files.
2021-01-28 12:54:50 -08:00
Micah Snyder
1ae678c945 JPEG format validator improvements
Adds debug output to the JPEG format validator to help resolve issues
with unusually formatted JPEGs and to validate that the JPEG parser is
working correctly.

Relaxes the rules around duplicate application markers or application
markers that appear later than expected, due to prior XMP metadata, etc.

Removed the requirement for an application marker to exist, as some
older JPEGs don't appear to use JFIF, Exif, or SPIFF application
extensions.

I tested against a relatively large data set of JPEGs from Mac & Windows
stock photos, personal photos, and assorted downloaded photos. FP rates
when alerting on broken media should be very low.
2021-01-28 12:54:50 -08:00
Micah Snyder
4cce1fcd20 GIF, PNG bugfixes; Add AlertBrokenMedia option
Added a new scan option to alert on broken media (graphics) file
formats. This feature mitigates the risk of malformed media files
intended to exploit vulnerabilities in other software. At present
media validation exists for JPEG, TIFF, PNG, and GIF files.

To enable this feature, set `AlertBrokenMedia yes` in clamd.conf, or
use the `--alert-broken-media` option when using `clamscan`.
These options are disabled by default for now.

Application developers may enable this scan option by enabling
`CL_SCAN_HEURISTIC_BROKEN_MEDIA` for the `heuristic` scan option bit
field.

Fixed PNG parser logic bugs that caused an excess of parsing errors
and fixed a stack exhaustion issue affecting some systems when
scanning PNG files. PNG file type detection was disabled via
signature database update for 0.103.0 to mitigate effects from these
bugs.

Fixed an issue where PNG and GIF files no longer work with Target:5
(graphics) signatures if detected as CL_TYPE_PNG/GIF rather than as
CL_TYPE_GRAPHICS. Target types now support up to 10 possible file
types to make way for additional graphics types in future releases.

Scanning JPEG, TIFF, PNG, and GIF files will no longer return "parse"
errors when file format validation fails. Instead, the scan will alert
with the "Heuristics.Broken.Media" signature prefix and a descriptive
suffix to indicate the issue, provided that the "alert broken media"
feature is enabled.

GIF format validation will no longer fail if the GIF image is missing
the trailer byte, as this appears to be a relatively common issue in
otherwise functional GIF files.

Added a TIFF dynamic configuration (DCONF) option, which was missing.
This will allow us to disable TIFF format validation via signature
database update in the event that it proves to be problematic.
This feature already exists for many other file types.

Added CL_TYPE_JPEG and CL_TYPE_TIFF types.
2021-01-28 12:54:47 -08:00
Micah Snyder
005cbf5a37 Record names of extracted files
A way is needed to record scanned file names for two purposes:

1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.

2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.

This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.

To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required.  The zip and gpt parsers required some modification to record
file names.  The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.

Also:

- Added recursive extraction to the tmp directory when the
  --leave-temps option is enabled.  When not enabled, the tmp directory
  structure remains flat so as to prevent the likelihood of exceeding
  MAX_PATH.  The current tmp directory is stored in the scan context.

- Made the cli_scanfile() internal API non-static and added it to
  scanners.h so it would be accessible outside of scanners.c in order to
  remove code duplication within libmspack.c.

- Added function comments to scanners.h and matcher.h

- Converted a TDB-type macros and LSIG-type macros to enums for improved
  type safey.

- Converted more return status variables from `int` to `cl_error_t` for
  improved type safety, and corrected ooxml file typing functions so
  they use `cli_file_t` exclusively rather than mixing types with
  `cl_error_t`.

- Restructured the magic_scandesc() function to use goto's for error
  handling and removed the early_ret_from_magicscan() macro and
  magic_scandesc_cleanup() function.  This makes the code easier to
  read and made it easier to add the recursive tmp directory cleanup to
  magic_scandesc().

- Corrected zip, egg, rar filename extraction issues.

- Removed use of extra sub-directory layer for zip, egg, and rar file
  extraction.  For Zip, this also involved changing the extracted
  filenames to be randomly generated rather than using the "zip.###"
  file name scheme.
2020-06-03 10:39:18 -04:00
Mickey Sola
a25d48d7fa gif - clang formatted; copyright dates fixed 2020-04-23 10:48:08 -07:00
Aldo Mazzeo
153a87a74b Making the GIF parser more tolerant and supporting GIF overlays 2020-04-23 10:48:07 -07:00
Micah Snyder
206dbaefe8 Update copyright dates for 2020 2020-01-03 15:44:07 -05:00
Micah Snyder
52cddcbcfd Updating and cleaning up copyright notices. 2019-10-02 16:08:18 -04:00
Micah Snyder
b3e82e5e61 Replacing libclamav/cltypes.h with clamav-types.h.in, which generates a header clamav-types.h that we install alongside clamav.h. 2019-10-02 16:08:17 -04:00
Micah Snyder
72fd33c8b2 clang-format'd using new .clang-format rules. 2019-10-02 16:08:16 -04:00
Mickey Sola
46a35abe56 mass update of copyright headers 2015-09-17 13:41:26 -04:00
Shawn Webb
60d8d2c352 Move all the crypto API to clamav.h 2014-07-01 19:38:01 -04:00
Shawn Webb
b2e7c931d0 Use OpenSSL for hashing. 2014-02-08 00:31:12 -05:00
Tomasz Kojm
ac47dac79d libclamav: add basic GIF validator 2011-04-11 17:23:14 +02:00
Tomasz Kojm
1fb9e80cfe do more checks 2011-04-06 15:53:28 +02:00
Tomasz Kojm
b44fb658ba libclamav: add basic JPEG validator 2011-04-05 16:33:38 +02:00