Commit graph

138 commits

Author SHA1 Message Date
Val S.
1a2515eea9
Fix compiler warning
Mismatched declaration and definition.
2025-10-14 14:05:12 -04:00
Val S.
a77a271fb5
Reduce unnecessary scanning of embedded file FPs (#1571)
When embedded file type recognition finds a possible embedded file, it
is being scanned as a new embedded file even if it turns out it was a
false positive and parsing fails. My solution is to pre-parse the file
headers as little possible to determine if it is valid. If possible,
also determine the file size based on the headers. That will make it so
we don't have to scan additional data when the embedded file is not at
the very end.

This commit adds header checks prior to embedded ZIP, ARJ, and CAB
scanning. For these types I was also able to use the header checks to
determine the object size so as to prevent excessive pattern matching.

TODO: Add the same for RAR, EGG, 7Z, NULSFT, AUTOIT, IShield, and PDF.

This commit also removes duplicate matching for embedded MSEXE.
The embedded MSEXE detection and scanning logic was accidentally
creating an extra duplicate layer in between scanning and detection
because of the logic within the `cli_scanembpe()` function.
That function was effectively doing the header check which this commit
adds for ZIP, ARJ, and CAB but minus the size check.
Note: It is unfortunately not possible to get an accurage size from PE
file headers.
The `cli_scanembpe()` function also used to dump to a temp file for no
reason since FMAPs were extended to support windows into other FMAPs.
So this commit removes the intermediate layer as well as dropping a temp
file for each embedded PE file.

Further, this commit adds configuration and DCONF safeguards around all
embedded file type scanning.

Finally, this commit adds a set of tests to validate proper extraction
of embedded ZIP, ARJ, CAB, and MSEXE files.

CLAM-2862

Co-authored-by: TheRaynMan <draynor@sourcefire.com>
2025-09-23 15:57:28 -04:00
Valerie Snyder
f7e60d566f
Record unique object-id for each layer scanned
Every time we push a new map onto the scanning recursion context, give
it a unique object id number, which counts from zero.

Moved the location where we add metadata for each file from the
"cli_magic_scan" function over to the "recursion stack push" function.

Include a "path" as a parameter for creating a new fmap, and rename some
related variables and functions to be more intuitive.

CLAM-2796
See also: CLAM-2485, CLAM-2626
2025-08-14 21:23:33 -04:00
Valerie Snyder
aa7b7e9421
Swap clean cache from MD5 to SHA2-256
Change the clean-cache to use SHA2-256 instead of MD5.
Note that all references are changed to specify "SHA2-256" now instead
of "SHA256", for clarity. But there is no plan to add support for SHA3
algorithms at this time.

Significant code cleanup. E.g.:
- Implemented goto-done error handling.
- Used `uint8_t *` instead of `unsigned char *`.
- Use `bool` for boolean checks, rather than `int.
- Used `#defines` instead of magic numbers.
- Removed duplicate `#defines` for things like hash length.

Add new option to calculate and record additional hash types when the
"generate metadata JSON" feature is enabled:
- libclamav option: `CL_SCAN_GENERAL_STORE_EXTRA_HASHES`
- clamscan option: `--json-store-extra-hashes` (default off)
- clamd.conf option: `JsonStoreExtraHashes` (default 'no')

Renamed the sigtool option `--sha256` to `--sha2-256`.
The original option is still functional, but is deprecated.

For the "generate metadata JSON" feature, the file hash is now stored as
"sha2-256" instead of "FileMD5". If you enable the "extra hashes" option,
then it will also record "md5" and "sha1".

Deprecate and disable the internal "SHA collect" feature.
This option had been hidden behind C #ifdef checks for an option that
wasn't exposed through CMake, so it was basically unavailable anyways.

Changes to calculate file hashes when they're needed and no sooner.

For the FP feature in the matcher module, I have mimiced the
optimization in the FMAP scan routine which makes it so that it can
calculate multiple hashes in a single pass of the file.

The `HandlerType` feature stores a hash of the file in the scan ctx to
prevent retyping the exact same data more than once.
I removed that hash field and replaced it with an attribute flag that is
applied to the new recursion stack layer when retyping a file.
This also closes a minor bug that would prevent retyping a file with an
all-zero hash. :)

The work upgrading cache.c to support SHA2-256 sized hashes thanks to:
https://github.com/m-sola

CLAM-255
CLAM-1858
CLAM-1859
CLAM-1860
2025-08-14 21:23:30 -04:00
Val Snyder
7ff29b8c37
Bump copyright dates for 2025 2025-02-14 10:24:30 -05:00
Micah Snyder
a729aafc38 Remove PCRE dead code
As of ClamAV 0.105, PCRE2 is required. PCRE (1) is not an option, and
there is also no option to disable PCRE support.

This commit removes the dead code associated with those old build
options.
2024-04-13 12:34:15 -04:00
Micah Snyder
3ae9c1e434 Add LHA/LZH archive support
File type magic signatures chosen based on the extensions supported
by Rust delharc crate.

See: https://docs.rs/delharc/latest/delharc/
2024-04-09 10:35:22 -04:00
Micah Snyder
9cb28e51e6 Bump copyright dates for 2024 2024-01-22 11:27:17 -05:00
RainRat
caf324e544
Fix typos (no functional changes) 2023-11-26 18:01:19 -05:00
Micah Snyder
6eebecc303 Bump copyright for 2023 2023-02-12 11:20:22 -08:00
Micah Snyder
f7b139a776 PE, ELF, Mach-O: code cleanup
The header parsing / executable metadata collecting functions for the
PE, ELF, and Mach-O file types were using `int` for the return type.
Mostly they were returning 0 for success and -1, -2, -3, or -4 for
failure. But in some cases they were returning cl_error_t enum values
for failure. Regardless, the function using them was treating 0 as
success and non-zero as failure, which it stored as -1 ... every time.

This commit switches them all to use cl_error_t.  I am continuing to
storeo the final result as 0 / -1 in the `peinfo` struct, but outside of
that everything has been made consistent.

While I was working on that, I got a tad side tracked.  I noticed that
the target type isn't an enum, or even a set of #defines. So I made an
enum and then changed the code that uses target types to use the enum.

I also removed the `target` parameter from a number of functions that
don't actually use it at all. Some recursion was masking the fact that
it was an unused parameter which is why there was no warning about it.
2022-10-19 13:13:57 -07:00
Micah Snyder
73088d261b Fix issue detecting embedded zips attached to small files
If initial file type recognition comes back as an SFX type, which may
happen for small files that do not get recognized as any other file type
and contain a zip entry somewhere in the middle, then the type will be
set to that SFX type. This is a problem because later on when we go to
do embedded file type recognition, we explicitly skip SFX types, in
addition to TAR's and other types that are parsed elsewhere and have a
high embedded file type recognition FP-rate because they aren't
compressed.

This commit prohibits that initial FTM check from selecting an SFX type.
The SFX type will be rediscovered in `scanraw()` where the type is
handled/parsed.
2022-10-19 13:13:57 -07:00
Micah Snyder
29a761219a Matcher: code cleanup, fix possible leaks
Added inline documentation and did some general cleanup of the
`cli_scan_buff`, and then updated the function comment now that I
understand the function a little better.

While doing this, I found that the calls to cli_ac_initdata were being
done regardless of whether or not logically initialized matcher data was
required or used.  But the call to free that matcher data was only being
done when AC-data was not provided by the caller.  This would be a leak.
I fixed this by only initalizing the AC data when AC data is not
provided by the caller.
2022-10-19 13:13:57 -07:00
Micah Snyder
858b541a51 Matcher: Remove allmatch checks and significantly tidy code
Significantly tidy the `cli_scan_fmap()` function, and add comments to
explain how it all works.

Add SHA1 and SHA256 digest variables to the FMAP structure in addition
to the existing MD5. Add a function to set the hash so that when we
calculate the hashes for hash matching, we save them for subsequent FP
matching. This enabled me to remove the extra "hash-only" FP check from
`cli_scan_fmap()`. This will also make it easier to switch the clean
cache hash algorithm to SHA256 in the future.

Remove extra allmatch checks that are no longer required.

Add a new header to prevent #include order issues.
2022-10-19 13:13:57 -07:00
Andy Ragusa
778a4b1341 Corrected types to remove warnings. 2022-10-18 14:04:36 -07:00
Micah Snyder
cd3134568a Code quality: Refactor layer attributes as scan parameter
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.

This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
2022-10-13 08:57:44 -07:00
Andy Ragusa
b3a3b358b0 Speed up freeing of signatures
Speed up freeing of signatures by tracking all malloced blocks instead
of having to find duplicates in our data structures on signature unload.
2022-10-07 08:30:57 -07:00
Scott Hutton
21d1f7defc Various Rust-related code cleanup
* Broke out the variants of error/result handling in `frs_error.rs`.
  Made syntax slightly cleaner for `frs_call!`, explicitly moving
  the output variables *out* of the function call so as not to make
  the parameter order confusing.

* Wrapped the FuzzyHash map into a container rather than exposing
  the HashMap directly.  Simplifies casting, and allows it to feel
  more like a class with methods.

* Fixed various clippy complaints regarding unsafe, etc.

* Rename `frs_error.rs` to `ffi_utils.rs` and migrated ffi-specific
  features like the `validate_str_param!()` macro to this new module.
2022-03-02 13:12:59 -07:00
Micah Snyder
fd587c741c Image fuzzy hash: new logical sub-signature feature
Add a new logical signature subsignature type for matching on images
with image fuzzy hashes.

Image fuzzy hash subsigantures follow this format:

    fuzzy_img#<hash>#<dist>

In this initial implementation, the hamming distance (dist) is ignored
and only exact fuzzy hash matches will alert.

Fuzzy hash matching is only performed for supported image types.

Also: removed some excessive debug log messages on start-up.

Fixed an issue where the signature name (virname) is being allocated and
stored for every subsignature or even ever sub-pattern in an AC-pattern
(i.e. NDB sig or LDB subsig) containing a `{n-m}` or `*` wildcard.
This fix is only for LDB subsigs though. NDB signatures are still
allocaing one virname per sub-pattern.

This fix was required because I needed a place to store the virname with
fuzzy-hash subsignatures. Storing it in the fuzzy-hash subsig
metadatathe way AC-pattern, PCRE, and BComp subsigs were doing it
wouldn't work because it would cross the C-Rust FFI boundary and giving
pointers to Rust allocated stuff is dicey. Not to mention native Rust
strings are different thatn C strings. Anyways, the correct thing to do
was to store the virname with the actual logical signature.

TODO: Keep track of NDB signatures in the same way and store the virname
for NDB sigs there instead of in AC-patterns so that we can get rid of
the virname field in the AC-pattern struct.
2022-03-02 13:12:59 -07:00
micasnyd
140c88aa4e Bump copyright for 2022
Includes minor format corrections.
2022-01-09 14:23:25 -07:00
Micah Snyder
db013a2bfd libclamav: Fix scan recursion tracking
Scan recursion is the process of identifying files embedded in other
files and then scanning them, recursively.

Internally this process is more complex than it may sound because a file
may have multiple layers of types before finding a new "file".

At present we treat the recursion count in the scanning context as an
index into both our fmap list AND our container list. These two lists
are conceptually a part of the same thing and should be unified.

But what's concerning is that the "recursion level" isn't actually
incremented or decremented at the same time that we add a layer to the
fmap or container lists but instead is more touchy-feely, increasing
when we find a new "file".

To account for this shadiness, the size of the fmap and container lists
has always been a little longer than our "max scan recursion" limit so
we don't accidentally overflow the fmap or container arrays (!).

I've implemented a single recursion-stack as an array, similar to before,
which includes a pointer to each fmap at each layer, along with the size
and type. Push and pop functions add and remove layers whenever a new
fmap is added. A boolean argument when pushing indicates if the new layer
represents a new buffer or new file (descriptor). A new buffer will reset
the "nested fmap level" (described below).

This commit also provides a solution for an issue where we detect
embedded files more than once during scan recursion.

For illustration, imagine a tarball named foo.tar.gz with this structure:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |

But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| baz.exe                   | PE    | 0         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| │   └── hello.txt         | ASCII | 2         | 0                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |
|     └── world.txt         | ASCII | 2         | 0                 |

(A) If we scan for embedded files at any layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| ├── foo.tar               | TAR   | 1         | 0                 |
| │   ├── bar.zip           | ZIP   | 2         | 1                 |
| │   │   └── hola.txt      | ASCII | 3         | 0                 |
| │   ├── baz.exe           | PE    | 2         | 1                 |
| │   │   ├── sfx.zip       | ZIP   | 3         | 1                 |
| │   │   │   └── hello.txt | ASCII | 4         | 0                 |
| │   │   └── sfx.7z        | 7Z    | 3         | 1                 |
| │   │       └── world.txt | ASCII | 4         | 0                 |
| │   ├── sfx.zip           | ZIP   | 2         | 1                 |
| │   │   └── hello.txt     | ASCII | 3         | 0                 |
| │   └── sfx.7z            | 7Z    | 2         | 1                 |
| │       └── world.txt     | ASCII | 3         | 0                 |
| ├── sfx.zip               | ZIP   | 1         | 1                 |
| └── sfx.7z                | 7Z    | 1         | 1                 |

(A) is bad because it scans content more than once.

Note that for the GZ layer, it may detect the ZIP and 7Z if the
signature hits on the compressed data, which it might, though
extracting the ZIP and 7Z will likely fail.

The reason the above doesn't happen now is that we restrict embedded
type scans for a bunch of archive formats to include GZ and TAR.

(B) If we scan for embedded files at the foo.tar layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     ├── baz.exe           | PE    | 2         | 1                 |
|     ├── sfx.zip           | ZIP   | 2         | 1                 |
|     │   └── hello.txt     | ASCII | 3         | 0                 |
|     └── sfx.7z            | 7Z    | 2         | 1                 |
|         └── world.txt     | ASCII | 3         | 0                 |

(B) is almost right. But we can achieve it easily enough only scanning for
embedded content in the current fmap when the "nested fmap level" is 0.
The upside is that it should safely detect all embedded content, even if
it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe.

The biggest risk I can think of affects ZIPs. SFXZIP detection
is identical to ZIP detection, which is why we don't allow SFXZIP to be
detected if insize of a ZIP. If we only allow embedded type scanning at
fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP
if the bar.exe was not compressed in foo.zip and if non-compressed files
extracted from ZIPs aren't extracted as new buffers:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.zip                   | ZIP   | 0         | 0                 |
| └── bar.exe               | PE    | 1         | 1                 |
|     └── sfx.zip           | ZIP   | 2         | 2                 |

Provided that we ensure all files extracted from zips are scanned in
new buffers, option (B) should be safe.

(C) If we scan for embedded files at the baz.exe layer, we may detect:
| description               | type  | rec level | nested fmap level |
| ------------------------- | ----- | --------- | ----------------- |
| foo.tar.gz                | GZ    | 0         | 0                 |
| └── foo.tar               | TAR   | 1         | 0                 |
|     ├── bar.zip           | ZIP   | 2         | 1                 |
|     │   └── hola.txt      | ASCII | 3         | 0                 |
|     └── baz.exe           | PE    | 2         | 1                 |
|         ├── sfx.zip       | ZIP   | 3         | 1                 |
|         │   └── hello.txt | ASCII | 4         | 0                 |
|         └── sfx.7z        | 7Z    | 3         | 1                 |
|             └── world.txt | ASCII | 4         | 0                 |

(C) is right. But it's harder to achieve. For this example we can get it by
restricting 7ZSFX and ZIPSFX detection only when scanning an executable.
But that may mean losing detection of archives embedded elsewhere.
And we'd have to identify allowable container types for each possible
embedded type, which would be very difficult.

So this commit aims to solve the issue the (B)-way.

Note that in all situations, we still have to scan with file typing
enabled to determine if we need to reassign the current file type, such
as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2-
compressed. Detection of DMG and a handful of other types rely on
finding data partway through or near the ned of a file before
reassigning the entire file as the new type.

Other fixes and considerations in this commit:

- The utf16 HTML parser has weak error handling, particularly with respect
  to creating a nested fmap for scanning the ascii decoded file.
  This commit cleans up the error handling and wraps the nested scan with
  the recursion-stack push()/pop() for correct recursion tracking.

  Before this commit, each container layer had a flag to indicate if the
  container layer is valid.
  We need something similar so that the cli_recursion_stack_get_*()
  functions ignore normalized layers. Details...

  Imagine an LDB signature for HTML content that specifies a ZIP
  container. If the signature actually alerts on the normalized HTML and
  you don't ignore normalized layers for the container check, it will
  appear as though the alert is in an HTML container rather than a ZIP
  container.

  This commit accomplishes this with a boolean you set in the scan context
  before scanning a new layer. Then when the new fmap is created, it will
  use that flag to set similar flag for the layer. The context flag is
  reset those that anything after this doesn't have that flag.
  The flag allows the new recursion_stack_get() function to ignore
  normalized layers when iterating the stack to return a layer at a
  requested index, negative or positive.

  Scanning normalized extracted/normalized javascript and VBA should also
  use the 'layer is normalized' flag.

- This commit also fixes Heuristic.Broken.Executable alert for ELF files
  to make sure that:

  A) these only alert if cli_append_virus() returns CL_VIRUS (aka it
  respects the FP check).

  B) all broken-executable alerts for ELF only happen if the
  SCAN_HEURISTIC_BROKEN option is enabled.

- This commit also cleans up the error handling in cli_magic_scan_dir().
  This was needed so we could correctly apply the layer-is-normalized-flag
  to all VBA macros extracted to a directory when scanning the directory.

- Also fix an issue where exceeding scan maximums wouldn't cause embedded
  file detection scans to abort. Granted we don't actually want to abort
  if max filesize or max recursion depth are exceeded... only if max
  scansize, max files, and max scantime are exceeded.

  Add 'abort_scan' flag to scan context, to protect against depending on
  correct error propagation for fatal conditions. Instead, setting this
  flag in the scan context should guarantee that a fatal condition deep in
  scan recursion isn't lost which result in more stuff being scanned
  instead of aborting. This shouldn't be necessary, but some status codes
  like CL_ETIMEOUT never used to be fatal and it's easier to do this than
  to verify every parser only returns CL_ETIMEOUT and other "fatal
  status codes" in fatal conditions.

- Remove duplicate is_tar() prototype from filestypes.c and include
  is_tar.h instead.

- Presently we create the fmap hash when creating the fmap.
  This wastes a bit of CPU if the hash is never needed.
  Now that we're creating fmap's for all embedded files discovered with
  file type recognition scans, this is a much more frequent occurence and
  really slows things down.

  This commit fixes the issue by only creating fmap hashes as needed.
  This should not only resolve the perfomance impact of creating fmap's
  for all embedded files, but also should improve performance in general.

- Add allmatch check to the zip parser after the central-header meta
  match. That way we don't multiple alerts with the same match except in
  allmatch mode. Clean up error handling in the zip parser a tiny bit.

- Fixes to ensure that the scan limits such as scansize, filesize,
  recursion depth, # of embedded files, and scantime are always reported
  if AlertExceedsMax (--alert-exceeds-max) is enabled.

- Fixed an issue where non-fatal alerts for exceeding scan maximums may
  mask signature matches later on. I changed it so these alerts use the
  "possibly unwanted" alert-type and thus only alert if no other alerts
  were found or if all-match or heuristic-precedence are enabled.

- Added the "Heuristics.Limits.Exceeded.*" events to the JSON metadata
  when the --gen-json feature is enabled. These will show up once under
  "ParseErrors" the first time a limit is exceeded. In the present
  implementation, only one limits-exceeded events will be added, so as to
  prevent a malicious or malformed sample from filling the JSON buffer
  with millions of events and using a tonne of RAM.
2021-10-25 16:02:29 -07:00
Micah Snyder
81402e1abb Inline doxygen documentation fixup
Fixup input output params to be anotated with [in,out], not [in/out].

Note: skipped some other incorrectly annodated [out] params that are
already staged to be fixed in a different PR.
2021-07-17 10:39:27 -07:00
Micah Snyder
090c8990e3
libclamav, clamscan: load/unload callbacks & progress meters
Add progress callbacks to libclamav for:
- database load
- engine compile
- engine free

Add a progress bar to clamscan for load & compile.
These are disabled if you run with --debug or stdout is not a TTY or you
are using one of --quiet, --infected, or --no-summary.

Added code so you can test the engine-free callback by building with
ENABLE_ENGINE_FREE_PROGRESSBAR defined.

The compile & free progress callbacks pre-calculate the number of
tasks to complete to estimate the progress. Some tasks may take longer
than others so the progress speed my appear to vary a little.

The callbacks return type is a cl_error_t but doesn't currently do
anything. It is reserved for future use.

Minor formatting change in matcher-ac.c to counteract weird
clang-format behavior, and to make it easier to read.

Added progress callbacks and clamscan progress bars to the news.
2021-07-16 11:47:23 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
4cce1fcd20 GIF, PNG bugfixes; Add AlertBrokenMedia option
Added a new scan option to alert on broken media (graphics) file
formats. This feature mitigates the risk of malformed media files
intended to exploit vulnerabilities in other software. At present
media validation exists for JPEG, TIFF, PNG, and GIF files.

To enable this feature, set `AlertBrokenMedia yes` in clamd.conf, or
use the `--alert-broken-media` option when using `clamscan`.
These options are disabled by default for now.

Application developers may enable this scan option by enabling
`CL_SCAN_HEURISTIC_BROKEN_MEDIA` for the `heuristic` scan option bit
field.

Fixed PNG parser logic bugs that caused an excess of parsing errors
and fixed a stack exhaustion issue affecting some systems when
scanning PNG files. PNG file type detection was disabled via
signature database update for 0.103.0 to mitigate effects from these
bugs.

Fixed an issue where PNG and GIF files no longer work with Target:5
(graphics) signatures if detected as CL_TYPE_PNG/GIF rather than as
CL_TYPE_GRAPHICS. Target types now support up to 10 possible file
types to make way for additional graphics types in future releases.

Scanning JPEG, TIFF, PNG, and GIF files will no longer return "parse"
errors when file format validation fails. Instead, the scan will alert
with the "Heuristics.Broken.Media" signature prefix and a descriptive
suffix to indicate the issue, provided that the "alert broken media"
feature is enabled.

GIF format validation will no longer fail if the GIF image is missing
the trailer byte, as this appears to be a relatively common issue in
otherwise functional GIF files.

Added a TIFF dynamic configuration (DCONF) option, which was missing.
This will allow us to disable TIFF format validation via signature
database update in the event that it proves to be problematic.
This feature already exists for many other file types.

Added CL_TYPE_JPEG and CL_TYPE_TIFF types.
2021-01-28 12:54:47 -08:00
Mickey Sola
9ea3b93018 Recurse all fpmaps when doing fpchecks
Changes cli_checkfp_virus to a recursive function which checks all
parent fmaps in the context for false positives

Simplifies params needed for cli_checkfp_virus to use the current digest
and fmap length that resides within the fmap struct itself
2020-08-03 12:11:56 -07:00
Micah Snyder
9b9999d778 Rename core scanning functions
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.

| current name              | new name                          |
| ------------------------- | --------------------------------- |
| magic_scandesc()          | cli_magic_scan()                  |
| cli_magic_scandesc_type() | <delete>                          |
| cli_magic_scandesc()      | cli_magic_scan_desc()             |
| cli_base_scandesc()       | cli_magic_scan_desc_type()        |
| cli_partition_scandesc()  | <delete>                          |
| cli_map_scandesc()        | magic_scan_nested_fmap_type()     |
| cli_map_scan()            | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc()        | cli_magic_scan_buff()             |
| cli_scanbuff()            | cli_scan_buff()                   |
| cli_scandesc()            | cli_scan_desc()                   |
| cli_fmap_scandesc()       | cli_scan_fmap()                   |
| cli_scanfile()            | cli_magic_scan_file()             |
| cli_scandir()             | cli_magic_scan_dir()              |
| cli_filetype2()           | cli_determine_fmap_type()         |
| cli_filetype()            | cli_compare_ftm_file()            |
| cli_partitiontype()       | cli_compare_ftm_partition()       |
| cli_scanraw()             | scanraw()                         |
2020-06-03 11:00:40 -04:00
Micah Snyder
005cbf5a37 Record names of extracted files
A way is needed to record scanned file names for two purposes:

1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.

2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.

This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.

To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required.  The zip and gpt parsers required some modification to record
file names.  The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.

Also:

- Added recursive extraction to the tmp directory when the
  --leave-temps option is enabled.  When not enabled, the tmp directory
  structure remains flat so as to prevent the likelihood of exceeding
  MAX_PATH.  The current tmp directory is stored in the scan context.

- Made the cli_scanfile() internal API non-static and added it to
  scanners.h so it would be accessible outside of scanners.c in order to
  remove code duplication within libmspack.c.

- Added function comments to scanners.h and matcher.h

- Converted a TDB-type macros and LSIG-type macros to enums for improved
  type safey.

- Converted more return status variables from `int` to `cl_error_t` for
  improved type safety, and corrected ooxml file typing functions so
  they use `cli_file_t` exclusively rather than mixing types with
  `cl_error_t`.

- Restructured the magic_scandesc() function to use goto's for error
  handling and removed the early_ret_from_magicscan() macro and
  magic_scandesc_cleanup() function.  This makes the code easier to
  read and made it easier to add the recursive tmp directory cleanup to
  magic_scandesc().

- Corrected zip, egg, rar filename extraction issues.

- Removed use of extra sub-directory layer for zip, egg, and rar file
  extraction.  For Zip, this also involved changing the extracted
  filenames to be randomly generated rather than using the "zip.###"
  file name scheme.
2020-06-03 10:39:18 -04:00
Micah Snyder
206dbaefe8 Update copyright dates for 2020 2020-01-03 15:44:07 -05:00
Micah Snyder
5f4f69102d Correcting types from int to cl_error_t where appropriate. Eliminating unused variables and referencing unused parameters to remove warnings. 2019-10-02 16:08:25 -04:00
Andrew
7ba310e605 PE parsing code improvements, db loading bug fixes
Consolidate the PE parsing code into one function.  I tried to preserve all existing functionality from the previous, distinct implementations to a large extent (with the exceptions mentioned below).  If I noticed potential bugs/improvements, I added a TODO statement about those so that they can be fixed in a smaller commit later.  Also, there are more TODOs in places where I'm not entirely sure why certain actions are performed - more research is needed for these.

I'm submitting a pull request now so that regression testing can be done, and because merging what I have thus far now will likely have fewer conflicts than if I try to merge later

PE parsing code improvements:
- PEs without all 16 data directories are parsed more appropriately now
- Added lots more debug statements

Also:
 - Allow MAX_BC and MAX_TRACKED_PCRE to be specified via CFLAGS

    When doing performance testing with the latest CVD, MAX_BC and
    MAX_TRACKED_PCRE need to be raised to track all the events.
    Allow these to be specified via CFLAGS by not redefining them
    if they are already defined

- Fix an issue preventing wildcard sizes in .MDB/.MSB rules

    I'm not sure what the original intent of the check I removed was,
    but it prevents using wildcard sizes in .MDB/.MSB rules.  AFAICT
    these wildcard sizes should be handled appropriately by the MD5
    section hash computation code, so I don't think a check on that
    is needed.

- Fix several issues related to db loading
     - .imp files will now get loaded if they exist in a directory passed
       via clamscan's '-d' flag
     - .pwdb files will now get loaded if they exist in a directory passed
       via clamscan's '-d' flag even when compiling without yara support
     - Changes to .imp, .ign, and .ign2 files will now be reflected in calls
       to cl_statinidir and cl_statchkdir (and also .pwdb files, even when
       compiling without yara support)
     - The contents of .sfp files won't be included in some of the signature
       counts, and the contents of .cud files will be
     - Any local.gdb files will no longer be loaded twice

- For .imp files, you are no longer required to specify a minimum flevel for wildcard rules, since this isn't needed
2019-10-02 16:08:20 -04:00
Micah Snyder
52cddcbcfd Updating and cleaning up copyright notices. 2019-10-02 16:08:18 -04:00
Micah Snyder
b3e82e5e61 Replacing libclamav/cltypes.h with clamav-types.h.in, which generates a header clamav-types.h that we install alongside clamav.h. 2019-10-02 16:08:17 -04:00
Micah Snyder
72fd33c8b2 clang-format'd using new .clang-format rules. 2019-10-02 16:08:16 -04:00
Micah Snyder
38fe8b69a0 Added .clang-format style rules, clam-format script to automate formatting of ClamAV code, and preparing select files so that clang-format does not alter carefully formatted sections. 2019-10-02 16:08:16 -04:00
Mickey Sola
2b6c456a1b bcomp - updates and fixes following code review 2018-12-02 23:07:03 -05:00
Mickey Sola
18ff502920 refactoring byte compare functionality as a subsig; adding loader and matchers for bytecompare subsig 2018-12-02 23:07:03 -05:00
Micah Snyder
d0cba11ea7 adding back changes to eliminate warnings from mspack, matcher, others, and readdb. 2017-09-21 13:10:01 -04:00
Micah Snyder
169af0fc67 Revert "eliminating warnings. mostly correcting variable types. also correcting struct initialization in a couple instances (var = {0} does not zero the memory on all platforms). Also some minor formatting corrections in areas I was already working. eliminated some unused variables."
This reverts commit 84a7f40288.
2017-09-20 12:37:07 -04:00
Micah Snyder
84a7f40288 eliminating warnings. mostly correcting variable types. also correcting struct initialization in a couple instances (var = {0} does not zero the memory on all platforms). Also some minor formatting corrections in areas I was already working. eliminated some unused variables. 2017-08-15 14:00:07 -04:00
Steven Morgan
cbf5017a7d bb11805 fix multiple results. Refactor false positive and heuristic precedence logic. 2017-04-18 12:07:06 -04:00
Kevin Lin
87b2a1a9e3 add 'Intermediates' field to target description block
(allows specification of any number of intermediate containers)
2017-02-01 17:33:02 -05:00
Kevin Lin
984f90ca4f bb#11587 - track linked bcs on matchers for target 7 normalization 2016-06-28 15:19:50 -04:00
Mickey Sola
46a35abe56 mass update of copyright headers 2015-09-17 13:41:26 -04:00
Kevin Lin
e7b3198df2 bb#9858 - added target 14 for binary (unidentified) files 2015-07-23 16:37:15 -04:00
Steven Morgan
7665e02d5b Add support for YARA private rules and referencing other rules in a YARA condition. 2015-06-19 16:33:59 -04:00
Steven Morgan
b7999b89c9 YARA: capture offsets in matcher and use for processing YARA condition 'at' clauses. 2015-03-30 17:12:01 -04:00
Steven Morgan
f51f42e95c Capture YARA compiled condition string and anchor in struct cli_ac_lsig. 2015-03-06 17:10:47 -05:00
Steven Morgan
9de400559d refactor and simplify cli_lsig_eval, add new function cli_exp_eval to loop thru the lsig table and call either lsig_eval or yara_eval. 2015-03-03 19:25:13 -05:00
Kevin Lin
b5b3fecd6c unioned lsig logic and future yara conditional 2015-02-11 10:36:43 -08:00