Commit graph

632 commits

Author SHA1 Message Date
Micah Snyder (micasnyd)
a71eb34999 Fix invalid zip & macho scan recursion
If zip content is detected within a file by way of the embedded file
type recognition scan (in `scanraw()`), a raw scan of that "ZIPSFX" will
detect all subsequent zip entries as new ZIPSFX's. Though they aren't
actually scanned later, it shows up in the metadata JSON. This commit
prevents embedded file type detection for ZIPSFX like we already have
for ZIP.

Semi-related, the mach-o unibin parser presently allows scanning of FAT
partitions anywhere in the fmap, to include the very beginning of the
fmap. This would be an infinite loop, scanning the same file over and
over again, were it not for the scan recursion limit. With the recursion
limit, it's ok, but still bad behavior. This commit prevents scanning
FAT files from the mach-o unibin parser where the offset is less than
the end of the headers.

Also fixed an unsigned integer comparison in the OLE2 parser that
might overflow.
2021-06-17 11:30:23 -07:00
Micah Snyder
0255f29a72 Blacklist & Whitelist verbiage
Improvements to use modern block list and allow list verbiage.

blacklist -> block list
whitelist -> allow listed
blacklisted -> blocked
whitelisted -> allowed

In the case of certificate verification, use "trust" or "verify" when
something is allowed.

Also changed domainlist -> domain list (or DomainList) to match.
2021-05-27 14:16:00 -07:00
Micah Snyder (micasnyd)
1919141768 Fix ENGINE_OPTIONS_FORCE_TO_DISK scan performance
There is a scan logic issue where the main libclamav scanning functions
create an extra "nested" fmap for each file being scanned. This is
slightly inefficient for a normal scan, but causes a major performance
issue when using ENGINE_OPTIONS_FORCE_TO_DISK. It causes every scanned
file to be duplicated in the temp directory before the scan.

We fix this by using `cli_magic_scan()` in `scan_common()` instead
of `cli_magic_scan_nested_fmap_type()`. We can do this now that the
`cl_scandesc_callback()` API creates an fmap for the caller, instead of
the old logic where `scan_common()` called different API's depending on
whether or not we have an fmap or a file descriptor.
2021-05-17 17:22:22 -07:00
Micah Snyder
bae444a25b clang-format housekeeping 2021-04-09 19:08:14 -07:00
Micah Snyder (micasnyd)
861153a656 Fix errors when scanning files > 4G
This commit resolves https://bugzilla.clamav.net/show_bug.cgi?id=12673

Changes in 0.103 to order of operations for creating fmaps and
performaing hashes of fmaps resulted errors when scanning files that are
4096M and a different (but related) error when scanning files > 4096M.
This is despite the fact that scanning is supposed to be limited to
--max-scansize (MaxScanSize) and was also apparently limited to
INT_MAX - 2 (aka ~1.999999G) back in 2014 to alleviate reported crashes
for a few large file formats.
(see https://bugzilla.clamav.net/show_bug.cgi?id=10960)
This last limitation was not documented, so I added it to the sample
clamd.conf.

Anyways, the main issue is that the fmap module was using "unsigned int"
and was then enforcing a limitation (verbose error messages) when that
a map length exceeded the capapacity of an unsigned int. This commit
switches the associated variables over to uint64_t, and while fmaps are
still limited to size_t in other places, the fmap module will at least
work with files > 4G on 64bit systems.

In testing this, I found that the time to hash a file, particularly when
hashing a file on an NTFS partition from Linux was really slow because
we were hashing in FILEBUFF chunks (about 8K) at a time.  Increasing
this to 10MB chunks speeds up scanning of large files.

Finally, now that hashing is performed immediately when an fmap is
created for a file, hashing of files larger than max-scansize was
occuring. This commit adds checks to bail out early if the file size
exceeds the maximum before creating an fmap. It will alert with the
Heuristics.Limits.Exceeded name if the heuristic is enabled.

Also fixed CheckFmapFeatures.cmake module that detects if
sysconf(_SC_PAGESIZE) is available.
2021-03-31 12:16:41 -07:00
Micah Snyder (micasnyd)
b9ca6ea103 Update copyright dates for 2021
Also fixes up clang-format.
2021-03-19 15:12:26 -07:00
Micah Snyder
625e506b07 clamd: PR review fix, update acknowledgements
The clamd TOCTOU access check fix introduced and expectation that the
scanfile API will set errno if access was denied. We should instead use
the cl_error_t error code enum.

Also added Duane Waddle to the 0.104 contributors acknowledgements.
2021-02-14 18:45:32 -08:00
Micah Snyder
e4e3149368 Fix fmap-duplicate performance issue
The fmap_duplicate function is used create a new fmap with a view into
an existing fmap. When the new view is a different size than the old
fmap, a new hash must be calculated for the duplicate fmap. However,
when the duplicated fmap is the same size as the original fmap, the hash
will be the same and there's no point recalculating.

The issue is apparent when scanning large EXE files because the hash was
being calculated at the beginning and end of the scan.

Digging into this issue revealed that hash calculations for fmaps were
also being performed at the wrong place. For scans of maps we use
fmap_duplicate() early in the process to apply the name API argument to
the duplicate fmap. Fixing the logic so we doing recalculate the hash
revealed that we never calculated hashes for fmap's created from buffers
in the first place, so that also had to be fixed be relocating where the
hash is calculated.

I also found that fmap_duplicate()'s offset argument used an off_t,
though it and all caller offsets are not allowed to be negative. This
was a bit of tangent to fix a bunch of off_t variables and paramters
that should've been size_t.

Added a couple unit tests to verify that making duplicate fmaps, and
duplicate-duplicate fmaps works as expected after the change.

Changed CLI_ISCONTAINED() and CLI_ISCONTAINED2() macros to cast to
size_t, because pointers and buffer sizes may not be negative, and these
two macros do not rely on substraction.
2021-01-28 12:54:50 -08:00
Micah Snyder
c40f03ade6 GPT parser verbosity, DMG/BZ2 detection fix
Reduced the verbosity of a GPT parser warning that occurs frequently
when parsing DMG files prior to DMG file type recognition.

DMG files support a handful of compression formats. File type
recognition for DMG presently works by doing "embedded" file type
recognition during the raw scan after having already identified the file
type by traditional file type magic checks. I found that when DMG uses
bzip2 for compression, we identify an MBR type containing a BZ type, at
which point the raw scan detects it as DMG. The previous commits broke
this by disabling embedded file type recognition for BZ and other
compression & archivie types. Ideally the fix would be to do DMG file
type detection by checking the end of the file; perhaps adding negative
offset support for FTM sigs could fix it. Until we can implement that or
another/better solution for DMG file type detection, we'll have to allow
embedded file type recognition for BZ files.

Also added some comments to narrate the scan process.
2021-01-28 12:54:50 -08:00
Micah Snyder
5566f0c76a Scan performance improvements
ClamAV's embedded file type recognition detects some files found in
non-archive formats but for archive formats and compressed data streams
like bzip2 and gzip, it will often detect file type magic bytes of
compressed files and then attempt to parse the compressed data as if
they were whole files, resulting in wasted CPU cycles and confusing
warnings.

This patch prevents embedded file type recognition for CL_TYPE_GZ and
CL_TYPE_BZ.

Also revert the UTF8 Byte Order Mark (BOM) detection and associated
scanning of all text types as HTML files that had been added in 0.103.
Scanning a file as HTML is not performant because it creates temp files
and and normalizes the original files 3 ways.

Better text type detection, transcoding, and HTML detection is probably
still needed, but will have to wait. Scanning any embedded content that
looked like text with the HTML parser impacts performance too much.
2021-01-28 12:54:50 -08:00
Micah Snyder
205614b403 Integrate JPEG exploit check into JPEG parser
Integrated the JPEG exploit check into the JPEG parser and removed it
from special.c.

As a happy consequence of this, the photoshop file detection and
embedded JPEG thumbnail exploit check was merged in as well, which means
that the embedded thumbnails can also be scanned as embedded JPEG files.
2021-01-28 12:54:50 -08:00
Micah Snyder
1ae678c945 JPEG format validator improvements
Adds debug output to the JPEG format validator to help resolve issues
with unusually formatted JPEGs and to validate that the JPEG parser is
working correctly.

Relaxes the rules around duplicate application markers or application
markers that appear later than expected, due to prior XMP metadata, etc.

Removed the requirement for an application marker to exist, as some
older JPEGs don't appear to use JFIF, Exif, or SPIFF application
extensions.

I tested against a relatively large data set of JPEGs from Mac & Windows
stock photos, personal photos, and assorted downloaded photos. FP rates
when alerting on broken media should be very low.
2021-01-28 12:54:50 -08:00
Micah Snyder
4cce1fcd20 GIF, PNG bugfixes; Add AlertBrokenMedia option
Added a new scan option to alert on broken media (graphics) file
formats. This feature mitigates the risk of malformed media files
intended to exploit vulnerabilities in other software. At present
media validation exists for JPEG, TIFF, PNG, and GIF files.

To enable this feature, set `AlertBrokenMedia yes` in clamd.conf, or
use the `--alert-broken-media` option when using `clamscan`.
These options are disabled by default for now.

Application developers may enable this scan option by enabling
`CL_SCAN_HEURISTIC_BROKEN_MEDIA` for the `heuristic` scan option bit
field.

Fixed PNG parser logic bugs that caused an excess of parsing errors
and fixed a stack exhaustion issue affecting some systems when
scanning PNG files. PNG file type detection was disabled via
signature database update for 0.103.0 to mitigate effects from these
bugs.

Fixed an issue where PNG and GIF files no longer work with Target:5
(graphics) signatures if detected as CL_TYPE_PNG/GIF rather than as
CL_TYPE_GRAPHICS. Target types now support up to 10 possible file
types to make way for additional graphics types in future releases.

Scanning JPEG, TIFF, PNG, and GIF files will no longer return "parse"
errors when file format validation fails. Instead, the scan will alert
with the "Heuristics.Broken.Media" signature prefix and a descriptive
suffix to indicate the issue, provided that the "alert broken media"
feature is enabled.

GIF format validation will no longer fail if the GIF image is missing
the trailer byte, as this appears to be a relatively common issue in
otherwise functional GIF files.

Added a TIFF dynamic configuration (DCONF) option, which was missing.
This will allow us to disable TIFF format validation via signature
database update in the event that it proves to be problematic.
This feature already exists for many other file types.

Added CL_TYPE_JPEG and CL_TYPE_TIFF types.
2021-01-28 12:54:47 -08:00
Micah Snyder (micasnyd)
9e20cdf6ea Add CMake build tooling
This patch adds experimental-quality CMake build tooling.

The libmspack build required a modification to use "" instead of <> for
header #includes. This will hopefully be included in the libmspack
upstream project when adding CMake build tooling to libmspack.

Removed use of libltdl when using CMake.

Flex & Bison are now required to build.

If -DMAINTAINER_MODE, then GPERF is also required, though it currently
doesn't actually do anything.  TODO!

I found that the autotools build system was generating the lexer output
but not actually compiling it, instead using previously generated (and
manually renamed) lexer c source. As a consequence, changes to the .l
and .y files weren't making it into the build. To resolve this, I
removed generated flex/bison files and fixed the tooling to use the
freshly generated files. Flex and bison are now required build tools.
On Windows, this adds a dependency on the winflexbison package,
which can be obtained using Chocolatey or may be manually installed.

CMake tooling only has partial support for building with external LLVM
library, and no support for the internal LLVM (to be removed in the
future). I.e. The CMake build currently only supports the bytecode
interpreter.

Many files used include paths relative to the top source directory or
relative to the current project, rather than relative to each build
target. Modern CMake support requires including internal dependency
headers the same way you would external dependency headers (albeit
with "" instead of <>). This meant correcting all header includes to
be relative to the build targets and not relative to the workspace.

For example, ...

```c
include "../libclamav/clamav.h"
include "clamd/clamd_others.h"
```

... becomes:

```c
// libclamav
include "clamav.h"

// clamd
include "clamd_others.h"
```

Fixes header name conflicts by renaming a few of the files.

Converted the "shared" code into a static library, which depends on
libclamav. The ironically named "shared" static library provides
features common to the ClamAV apps which are not required in
libclamav itself and are not intended for use by downstream projects.
This change was required for correct modern CMake practices but was
also required to use the automake "subdir-objects" option.
This eliminates warnings when running autoreconf which, in the next
version of autoconf & automake are likely to break the build.

libclamav used to build in multiple stages where an earlier stage is
a static library containing utils required by the "shared" code.
Linking clamdscan and clamdtop with this libclamav utils static lib
allowed these two apps to function without libclamav. While this is
nice in theory, the practical gains are minimal and it complicates
the build system. As such, the autotools and CMake tooling was
simplified for improved maintainability and this feature was thrown
out. clamdtop and clamdscan now require libclamav to function.

Removed the nopthreads version of the autotools
libclamav_internal_utils static library and added pthread linking to
a couple apps that may have issues building on some platforms without
it, with the intention of removing needless complexity from the
source. Kept the regular version of libclamav_internal_utils.la
though it is no longer used anywhere but in libclamav.

Added an experimental doxygen build option which attempts to build
clamav.h and libfreshclam doxygen html docs.

The CMake build tooling also may build the example program(s), which
isn't a feature in the Autotools build system.

Changed C standard to C90+ due to inline linking issues with socket.h
when linking libfreshclam.so on Linux.

Generate common.rc for win32.

Fix tabs/spaces in shared Makefile.am, and remove vestigial ifndef
from misc.c.

Add CMake files to the automake dist, so users can try the new
CMake tooling w/out having to build from a git clone.

clamonacc changes:
- Renamed FANOTIFY macro to HAVE_SYS_FANOTIFY_H to better match other
  similar macros.
- Added a new clamav-clamonacc.service systemd unit file, based on
  the work of ChadDevOps & Aaron Brighton.
- Added missing clamonacc man page.

Updates to clamdscan man page, add missing options.

Remove vestigial CL_NOLIBCLAMAV definitions (all apps now use
libclamav).

Rename Windows mspack.dll to libmspack.dll so all ClamAV-built
libraries have the lib-prefix with Visual Studio as with CMake.
2020-08-13 00:25:34 -07:00
Micah Snyder (micasnyd)
1a8b164b4f Fix new issues identified by Coverity
298485: Fix possible fd leaks.

298486: Fix possible use-after-free.
2020-08-12 18:14:39 -07:00
Micah Snyder (micasnyd)
c637de532b Disable embedded type recognition for disk images
Using file type recognition scan mode for disk images and other raw
archive formats is problematic. One simple reason is that the contained
files will be detected and parsed and scanned twice, first when deteced
by the type recog scan, and later when the archive is extracted and the
files are properly scanned. Another reason is an increased likelihood
for incorrect type recognition, as seen with supposed MHTML files (they
weren't) found in GPT disk images.

Though a previous patch disabled embedded type recognition for GPT
files, this one extens this to the following:
- CL_TYPE_CPIO_OLD
- CL_TYPE_ZIP
- CL_TYPE_OLD_TAR
- CL_TYPE_POSIX_TAR

ZIP is included because file entries in a ZIP are incorrectly detected
as ZIPSFX's and though we also ensure not to scan ZIPSFX's found in
ZIP's, it's more efficient not to do the type recognition in the first
place and it prevents us from adding those bogus ZIPSFX entries into the
scan properties JSON.

This patch also fixes what appears to be a copy-paste typo, where
CL_TYPE_ISHIELD_MSI types were accidentally having their container value
set to CL_TYPE_AUTOIT.
2020-08-12 00:18:53 -07:00
Micah Snyder
5d7e54c0bf Code review fixes
Exit early from VBA scanning loop if virus found.

Add VBA/XLM suffix to ContainsMacros heuristics.

Fix setting status code for error and virus conditions.

Increment/decrement recursion counter when scanning vba dir.
2020-08-11 11:45:06 -07:00
Micah Snyder
07a66adc75 Fix bug added in previous patch, fixup unit tests to use newly added sanitized_basename parameter. 2020-08-11 11:45:06 -07:00
Micah Snyder
860764eb16 Heuristic macro detection for imp VBA extraction
Notably the commit adds a heuristic alert when VBA is extracted using
the new VBA extraction code and similarly adds "HasMacros":true to the
JSON scan properties.

In addition, a change was added to the cli_sanitize_filepath() function
so it converts posix pathseps to Windows pathseps on Windows and also
outputs a sanitized basename pointer (optional) which is used when
generating a temporary filename so that using a prefix with pathseps in
it won't cause file creation failures (observed with --leave-temps where
original filenames are incorporated into temporarily filenames).

Included soem error handling improvements for cli_vba_scandir() to
better track alert and macro detections.

Downgraded utf8 conversion error messages to debug messages because they
are too verbose in files with invalid filenames (observed in some
malware).

Changed the xlm macro and vba project temp filenames to include
"xlm_macros" and "vba_project" prefix, to make it easier to find them.

Relocated XLM and VBA temp files from the top-level tmp directory to the
current sub_tmpdir, so tempfiles for a given scan are more organized.
2020-08-11 11:45:06 -07:00
Micah Snyder
b1dbf93f0b Fix newly introduced VBA/XLM OLE2 bugs
Fix an infinite loop in the new XLM macro parser.

Fix error handling, resource cleanup in OLE2 parser.

Fix issues tracking detected "viruses" in VBA & OLE2 parsers affecting
non-allmatch (regular) scan mode, wherein multiple viruses may be found
but each record lost and the overall detection comes up clean.

Also silence switch() fall-through warning for WORD/PPT/XL/HWP (OOXML)
file type fall-throughs to the ZIP parser (because they are zips).

Also silence switch() fall-through warning when handling the limits-
exceeded error types, checking for the limits-exceeded heuristic, and
continuing on to bail out with a clean verdict.
2020-08-11 11:45:06 -07:00
Mickey Sola
9ea3b93018 Recurse all fpmaps when doing fpchecks
Changes cli_checkfp_virus to a recursive function which checks all
parent fmaps in the context for false positives

Simplifies params needed for cli_checkfp_virus to use the current digest
and fmap length that resides within the fmap struct itself
2020-08-03 12:11:56 -07:00
Micah Snyder
e2f59af30a Clang-format touchup 2020-07-24 16:37:25 -07:00
Micah Snyder (micasnyd)
6198778903 Additional XLM parser error handling fixes
Improve error handling for functions that read the XLM BIFF temp-files.

Improve resource cleanup to alleviate Coverity false positive issue.
2020-07-16 13:39:47 -07:00
Andrew
319bfb51a5 Fix several coverity warnings
290424 Missing break in switch - In hash_match: Missing break
statement between cases in switch statement

290414 Resource leak - In cli_scanishield_msi: Leak of memory or
pointers to system resources. Memory leak in a fail case

288197 Resource leak - In decrypt_any: Leak of memory or pointers
to system resources. Memory leak in a fail case

290426 Resource leak - In cli_magic_scan: Leak of memory or pointers
to system resources. Leaked a file prefix when running with
--save-temps

192923 Resource leak - In cli_scanrar: Leak of memory or pointers to
system resources. Leaked a file descriptor if a virus was found in
a RAR file comment

225146 Resource leak - In cli_scanegg: Leak of memory or pointers
to system resources. Leaked a file descriptor if unable to write
a comment file to disk

290425 Resource leak - In scan_common: Leak of memory or pointers
to system resources. Memory leaks in various fail cases.

Also changes cli_scanrar to write out the file comment only if
--leave-temps is specified and scan the buffer (like what is done
in cli_scanegg) instead of writing the file out, scanning that,
and then deleting the file if --leave-temps is not specified.

The unit tests stopped working when correcting an issue with a
switch statement that determined what type of signature had matched
on a Google SafeBrowsing GDB rule. Looking into the unit tests, it
looks like the code had always assumed that the test cases would be
detected by a malware test rule in unit_tests/input/daily.gdb, but
now some of the tests get matched on the phishing test rule.
I updated the test logic to be more clear, and added tests for both
cases now.

Fix some memory leaks in libclamav/scanners.c
2020-07-15 08:39:32 -07:00
Micah Snyder
e01ba94e36 bb12506: Fix phishing/heuristic alert verbosity
Some detections, like phishing, are considered heuristic alerts because
they match based on behavior more than on content.  A subset of these
are considered "potentially unwanted" (low-severity).  These
low-severity alerts include:
- phishing
- PDFs with obfuscated object names
- bytecode signature alerts that start with "BC.Heuristics"

The concept is that unless you enable "heuristic precedence" (a method
of lowing the threshold to immediateley alert on low-severity
detections), the scan should continue after a match in case a higher
severity match is found.  Only at the end will it print the low-severity
match if nothing else was found.

The current implementation is buggy though. Scanning of archives does
not correctly bail out for the entire archive if one email contains a
phishing link.  Instead, it sets the "heuristic found"  flag then and
alerts for every subsequent file in the archive because it doesn't know
if the heuristic was found in an embedded file or the target file.
Because it's just a heuristic and the status is "clean", it keeps
scanning.

This patch corrects the behavior by checking if a low-severity alerts
were found at the end of scanning the target file, instead of at the end
of each embedded file.

Additionally, this patch fixes an in issue with phishing alerts wherein
heuristic precedence mode did not cause a scan to stop after the first
alert.

The above changes required restructuring to create an fmap inside of
cl_scandesc_callback() so that scan_common() could be modified to
require an fmap and set up so that the current *ctx->fmap pointer is
never NULL when scan_common() evaluates match results.

Also fixed a couple minor bugs in the phishing unit tests and cleaned up
the test code for improved legitibility and type safety.
2020-06-03 17:20:35 -04:00
Micah Snyder
d1f209e879 Fix fmap NULL deref in preclass bytecode hook
If using a bytecode signature that makes use of the BC_PRECLASS hook and
if it alerts, a NULL dereference may occur.  This change fixes that.

Also fixed unrelated memory leaks introduced recently when adding file
name extraction to the zip parser and rar parser.
2020-06-03 11:00:53 -04:00
Andrew
ead920501b Fix fmap leak in scan_common when map parameter is NULL
scan_common must either be passed an fmap (map) or a file
descriptor (desc) corresponding to the file being scanned.
In the case where map is NULL, scan_common will create an
fmap in order to execute the BC_PRECLASS bytecode hook, and
this fmap wasn't being unmapped afterward
2020-06-03 11:00:53 -04:00
Micah Snyder
e0dae24fcc Fix dupl. fmap name bug, fix fd init in HTML norm
Fixed copypaste bug with duplicated fmap names being assigned to the
parent instead of the dup/child fmap.

Fixed file descriptor initialization issue in the HTML normalizer.
2020-06-03 11:00:53 -04:00
Micah Snyder
c110392780 Change permission for new tmp files from RWX to RW 2020-06-03 11:00:53 -04:00
Micah Snyder
11ef77007b Improve tmp sub-directory names
At present many parsers create tmp subdirectories to store extracted
files.  For parsers like the vba parser, this is required as the
directory is later scanned.  For other parsers, these subdirectories are
probably not helpful now that we provide recursive sub-dirs when
--leave-temps is enabled.  It's not quite as simple as removing the extra
subdirectories, however.  Certain parsers, like autoit, don't create very
unique filenames and would result in file name collisions when
--leave-temps is not enabled.

The best thing to do would be to make sure each parser uses unique
filenames and doesn't rely on cli_magic_scan_dir() to scan extracted
content before removing the extra subdirectory.  In the meantime, this
commit gives the extra subdirectories meaningful names to improve
readability.

This commit also:

- Provides the 'bmp' prefix for extracted PE icons.

- Removes empty tmp subdirs when extracting rtf files, to eliminate
  clutter.

- The PDF parser sometimes creates tmp files when decompressing streams
  before it knows if there is actually any content to decompress.  This
  resulted in a large number of empty files.  While it would be best to
  avoid creating empty files in the first place, that's not quite as
  as it sounds.  This commit does the next best thing and deletes the
  tmp files if nothing was actually extracted, even if --leave-temps is
  enabled.

- Removes the "scantemp" prefix for unnamed fmaps scanned with
  cli_magic_scan().  The 5-character hashes given to tmp files with
  prefixes resulted in occasional file name collisions when extracting
  certain file types with thousands of embedded files.

- The VBA and TAR parsers mistakenly used NAME_MAX instead of PATH_MAX,
  resulting in truncated file paths and failed extraction  when
  --leave-temps is enabled and a lot of recursion is in play.  This commit
  switches them from NAME_MAX to PATH_MAX.
2020-06-03 11:00:53 -04:00
Micah Snyder
c545cad161 Only create rfc2397 tmp directory when needed
HTML normalization creates a tmp directory for storing rfc2397 style
links.  The vast majority of html does not make use of rfc2397 and thus
an excess of empty tmp directories are generated.  This commit alters
behavior to only create the rfc2397 directory when required if it does
not already exist.
2020-06-03 11:00:53 -04:00
Micah Snyder
9b9999d778 Rename core scanning functions
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.

| current name              | new name                          |
| ------------------------- | --------------------------------- |
| magic_scandesc()          | cli_magic_scan()                  |
| cli_magic_scandesc_type() | <delete>                          |
| cli_magic_scandesc()      | cli_magic_scan_desc()             |
| cli_base_scandesc()       | cli_magic_scan_desc_type()        |
| cli_partition_scandesc()  | <delete>                          |
| cli_map_scandesc()        | magic_scan_nested_fmap_type()     |
| cli_map_scan()            | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc()        | cli_magic_scan_buff()             |
| cli_scanbuff()            | cli_scan_buff()                   |
| cli_scandesc()            | cli_scan_desc()                   |
| cli_fmap_scandesc()       | cli_scan_fmap()                   |
| cli_scanfile()            | cli_magic_scan_file()             |
| cli_scandir()             | cli_magic_scan_dir()              |
| cli_filetype2()           | cli_determine_fmap_type()         |
| cli_filetype()            | cli_compare_ftm_file()            |
| cli_partitiontype()       | cli_compare_ftm_partition()       |
| cli_scanraw()             | scanraw()                         |
2020-06-03 11:00:40 -04:00
Micah Snyder
ae77e87880 Add EmbeddedObjects to JSON
The metadata projecties JSON structure isn't recording file types found
embedded within a file such as self-extracting (SFX) types and office
document types (DOCX, PPTX, etc).  This presents a problem...

At present there's no way to know if the current file has ended and a
few file is found tacked on to the end of the first file.  If there
were, we could simply check if the type found by the raw-scan exists
within the first file, or after.

If within the first, and the type is an archive then it's reasonable to
conclude we're either observing zip headers (for SFXZIP detections) or
other files that are not compressed.

If the type ISN'T found within the first file, then we definitely have
whole new file to parse and we should do so with cli_magic_scan()
rather than only using these embedded type scanners.

At present we can't ignore SFXZIP detections even if the original file
type is a ZIP because we may have found two ZIPs appended together to
evade detection (a legitimate trick).  As a consequence, we will
effectively parse every zip entry twice.  The same issue applies to
types found within non-compressed archives.

This commit adds an EmbeddedObjects list to the metadata JSON object so
that the existance of these types is noted.

Additionally, this commit removes the two-part int64 cli_jsonint64()
implementation as json_object_new_int64() should be available
everywhere and the macro to detect such support was never set.
2020-06-03 10:39:18 -04:00
Micah Snyder
005cbf5a37 Record names of extracted files
A way is needed to record scanned file names for two purposes:

1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.

2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.

This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.

To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required.  The zip and gpt parsers required some modification to record
file names.  The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.

Also:

- Added recursive extraction to the tmp directory when the
  --leave-temps option is enabled.  When not enabled, the tmp directory
  structure remains flat so as to prevent the likelihood of exceeding
  MAX_PATH.  The current tmp directory is stored in the scan context.

- Made the cli_scanfile() internal API non-static and added it to
  scanners.h so it would be accessible outside of scanners.c in order to
  remove code duplication within libmspack.c.

- Added function comments to scanners.h and matcher.h

- Converted a TDB-type macros and LSIG-type macros to enums for improved
  type safey.

- Converted more return status variables from `int` to `cl_error_t` for
  improved type safety, and corrected ooxml file typing functions so
  they use `cli_file_t` exclusively rather than mixing types with
  `cl_error_t`.

- Restructured the magic_scandesc() function to use goto's for error
  handling and removed the early_ret_from_magicscan() macro and
  magic_scandesc_cleanup() function.  This makes the code easier to
  read and made it easier to add the recursive tmp directory cleanup to
  magic_scandesc().

- Corrected zip, egg, rar filename extraction issues.

- Removed use of extra sub-directory layer for zip, egg, and rar file
  extraction.  For Zip, this also involved changing the extracted
  filenames to be randomly generated rather than using the "zip.###"
  file name scheme.
2020-06-03 10:39:18 -04:00
Micah Snyder
9f2de39e04 New tmp sub-dir per scan; JSON meta improvements
This commit improves the layout of the tmp file output and the JSON
metadata output when using the --leave-temps and --gen-json options.

For all scans, each scan target will get a unique tmp sub-directory. If
using --leave-temps, that subdir will include the basename of the
original file to make it easier to identify. Additionally, when using
--leave-temps option, all extracted objects will have their
subdirectories extracted in recursive subdirectories including filename
prefixes where available. When not using the --leave-temps option, the
layout of the tmp sub-directory will remain flat, so as to alleviate the
possibility of exceeding PATH_MAX.

The JSON metadata generated by the --gen-json option is now generated
for all file types, not just a select few. The format is also
pretty-printed for readability and now includes filenames and file paths
when available.

Also:

- Added missing ALLMATCH check when determining if bytecode hooks should
be run.

- Added cl_engine_get_str API to windows libclamav symbol export file.
2020-06-03 10:38:17 -04:00
Mickey Sola
ffd0f1357f png - fixup PR based on feedback 2020-05-08 13:29:52 -04:00
Aldo Mazzeo
f366b7c703 Transforming the PNG checker into a PNG exploit seeker 2020-05-08 13:24:25 -04:00
Jonas Zaddach (jzaddach)
d5a733ef90 XLM (Excel 4.0) macro detection and extraction
XLM is a macro language in Excel that was used before VBA (before
1996). It is still parsed and executed by modern Excel and is gaining
popularity with malware authors.

This patch adds rudimentary support for detecting and extracting
Excel 4.0 (XLM) macros.

The code is based on Didier Steven's plugin_biff for oletools.py.
2020-04-29 14:19:41 -07:00
Mickey Sola
c6f6b9e67b dlp - clang-format'd 2020-04-29 13:55:25 -07:00
John Schember
a6a355629d Add DLP feature to detect credit cards only
Add Data-Loss-Prevention option to detect credit cards only, excluding
debit and private label cards where possible.

You can select the credit card-only DLP mode for clamscan with the
`--structured-cc-mode` command-line option.

You can select the credit card-only DLP mode for clamd with the
`StructuredCCOnly` clamd.conf config option.

This patch also adds credit card matching for additional vendors:
- Mastercard 2016
- China Union Pay
- Discover 2009
2020-04-29 13:55:25 -07:00
Jonas Zaddach (jzaddach)
b7f8440965 Modernize VBA code extraction from Microsoft Office files
- Existing VBA extraction code uses undocumented cache structures.
  This code uses the documented way of accessing VBA projects.
- Adds additional detail to the dumped information:
  Project name, Project doc string, ...
  All VBA projects are dumped into a single file.
- Malware authors are currently evading detection by spreading
  malicious code over several projects. It is hard to write
  signatures if only part of the malicious code is visible.
2020-04-28 13:32:07 -07:00
Mickey Sola
f55acb436e gif - update PR based on feedback; add dconf option for gif scanning 2020-04-23 10:48:08 -07:00
Mickey Sola
a25d48d7fa gif - clang formatted; copyright dates fixed 2020-04-23 10:48:08 -07:00
Aldo Mazzeo
153a87a74b Making the GIF parser more tolerant and supporting GIF overlays 2020-04-23 10:48:07 -07:00
Micah Snyder
f5a2584609 libclamav: Fixes scanning of embedded fmaps
Specifically this fixes use of cli_map_scandesc().

The cli_map_scandesc() function used to override the current fmap
settings with a new size and offset, performing a scan of the embedded
content.  This broke the ability to iterate backwards through the fmap
recursion array when an alert occurs to check each map's hash for
whitelist matches.

In order to fix this issue, it needed to be possible to duplicate an
fmap header for the scan of the embedded file without duplicating the
actual map/data.  This wasn't feasible with the posix fmap handle
implementation where the fmap header, bitmap array, and memory map
were all contiguouus.  This commit makes it possible by extracting the
fmap header and bitmap array from the mmap region, using instead a
pointer for both the bitmap array and mmap/data.  The resulting posix
fmap handle implementation as a result ended up working more similarly
to existing the Windows implementation.

In addition to the above changes, this commit fixes:
- fmap recursion tracking for cli_scandesc()
- a recursion tracking issue in cli_scanembpe() error handling
2020-04-20 11:26:43 -07:00
Micah Snyder
cbe2cba4d1 libclamav: Generate hash for each new fmap
Signature alerts on content extracted into a new fmap such as normalized
HTML resulted in checking FP signatures against the fmap's hash value
that was initialized to all zeroes, and never computed.

This patch seeks will enable FP signatures of normalized HTML files or
other content that is extracted to a new fmap to work.  This patch
doesn't resolve the issue that normal people will write FP signatures
targeting the original file, not the normalized file and thus won't
really see benefit from this bug-fix.

Additional work is needed to traverse the fmap recursion lists and
FP-check all parent fmaps when an alert occurs.  In addition, the HTML
normalization method of temporarily overriding the ctx->fmap instead of
increasing the recursion depth and doing ctx->fmap++/-- will need to be
corrected for fmap reverse recursion traversal to work.
2020-04-20 11:26:43 -07:00
Jonas Zaddach
d9db7cd2e2 libclamav: Support for HFS+ compressed files
ClamAV doesn't handle compressed attribute for hfs+ file catalog
entries.  

This patch adds support for FLATE compressed files.

To accomplish this, we had to find and parse the root/header node
of the attributes file, if one exists. Then, parse the attribute map
to check if the compressed attribute exists. If compressed, parse the
compression header to determine how to decompress it. Support is
included for both inline compressed files as well as compressed
resource forks. 

Inflating inline compressed files is straightforward.

Inflating a compressed resource fork requires more work: 
- Find location and size of the resource.
- Parse the resource block table.
- Inflate and write each block to a temporary file to be scanned.

Additional changes needed for this work:
- Make hfsplus_fetch_node work for both catalog and attributes.
- Figure out node size.
- Handle nodes that span several blocks.
- If the attributes are missing, or invalid, extraction continues.
  This behavior is to support malformed files which would also
  extract on macOS and perhaps other systems.

This patch also:
- Adds filename extraction for the hfs+ parser.
- Skips embedded file type detection for GPT image file types. This
  prevents double extraction of embedded files, or misclassfication
  of GPT images as MHTML, for example. This resolves bb12335.
2020-04-13 08:33:18 -07:00
Mickey Sola
9565c92b87 scanners - uncomment utf8 BOM check to improve file type identifiaction and add html scan check to account for false negatives caused by the change 2020-02-03 09:20:23 -08:00
Mickey Sola
5d411c68fb bb12461 - error out properly when pdf parser fails to allocate a map; normalize/sanitize user supplied filename and comment info when parsing arj headers; add better bound checking and error handling to arj header parsers 2020-01-31 11:52:00 -08:00
Mickey Sola
3ac3cf17f2 bb12332 - fix segfault when scanning bz2 compressed iso9660s by limiting total page prefaulting to the 4GB max 2020-01-30 09:15:44 -08:00