dependency-track/docs/_docs/getting-started/monitoring.md

362 lines
19 KiB
Markdown
Raw Normal View History

---
title: Monitoring
category: Getting Started
chapter: 1
order: 13
---
### Health
Starting with v4.8.0, Dependency-Track exposes health information according to the [MicroProfile Health] specification.
Refer to the specification for details on how the exposed endpoints behave (i.e. [MicroProfile Health REST interfaces specifications]).
Currently, only a single [readiness check] is included. The *database* check verifies that database connections can be
acquired and used successfully. The check spans both connection pools (see [Connection Pooling]).
```json
{
"status": "UP",
"checks": [
{
"name": "database",
"status": "UP",
"data": {
"nontx_connection_pool": "UP",
"tx_connection_pool": "UP"
}
}
]
}
```
### Logging
Logging of the API server is configured via [Logback]. All distributions of the API server ship with
a [default Logback configuration]. It defines the following behavior:
1. Log messages from the embedded Jetty server to:
* `$HOME/.dependency-track/server.<NUMBER>.log`
2. Log messages from Dependency-Track and the underlying Alpine framework to:
* `$HOME/.dependency-track/dependency-track.<NUMBER>.log`
* Standard Output
3. Log security-related messages to:
* `$HOME/.dependency-track/dependency-track-audit.<NUMBER>.log`
* Standard Output
4. For log files:
* Create a new log file once the current one exceeds 10MB in size
* Retain a history of up to 9 files per log before overwriting them
5. Output logs in a human-friendly format
> For containerized deployments, `$HOME` will refer to the `/data` directory.
#### Custom Logging Configuration
When operating Dependency-Track in container-centric environments, where logs are typically forwarded
from containers' standard output to a centralized log aggregator (e.g. ElasticSearch, OpenSearch, Splunk),
it is desirable to disable logging to disk, and even change the output to a more machine-readable format.
Starting with Dependency-Track v4.9.0, it is possible to provide a custom Logback configuration,
and configure JSON as output format (powered by [logstash-logback-encoder]).
An example configuration file for JSON logging to standard output ([`logback-json.xml`]) is included
in the API server container image, and can be enabled using the `LOGGING_CONFIG_PATH` environment variable:
```shell
# (Other configuration options omitted for brevity)
docker run -it --rm \
-e "LOGGING_CONFIG_PATH=logback-json.xml" \
dependencytrack/apiserver:latest
```
Refer to the [logstash-logback-encoder documentation] for advanced customization details.
In order to use a truly custom configuration file, it has to be mounted into the container, e.g.:
```shell
# (Other configuration options omitted for brevity)
docker run -it --rm \
-v "./path/to/logback-custom.xml:/etc/dtrack/logback-custom.xml:ro" \
-e "LOGGING_CONFIG_PATH=/etc/dtrack/logback-custom.xml" \
dependencytrack/apiserver:latest
```
For non-containerized distributions of the API server, a custom configuration file may be provided
via the `logback.configurationFile` JVM property:
```shell
# (Other configuration options omitted for brevity)
java -Dlogback.configurationFile=/path/to/logback-custom.xml \
-jar dependency-track-apiserver.jar
```
### Metrics
The API server can be configured to expose system metrics via the Prometheus [text-based exposition format].
They can then be collected and visualized using tools like [Prometheus] and [Grafana]. Especially for containerized
deployments where directly attaching to the underlying Java Virtual Machine (JVM) is not possible, monitoring
system metrics via Prometheus is crucial for observability.
> System metrics are not the same as portfolio metrics exposed via `/api/v1/metrics` REST API or
> the web UI. The metrics described here are of technical nature and meant for monitoring
> the application itself, not the data managed by it. If exposition of portfolio statistics via Prometheus is desired,
> refer to [community integrations] like Jetstack's [dependency-track-exporter].
To enable metrics exposition, set the `alpine.metrics.enabled` property to `true` (see [Configuration]).
Metrics will be exposed in the `/metrics` endpoint, and can optionally be protected using
basic authentication via `alpine.metrics.auth.username` and `alpine.metrics.auth.password`.
#### Exposed Metrics
Exposed metrics include various general purpose system and JVM statistics (CPU and Heap usage, thread states,
garbage collector activity etc.), but also some related to Dependency-Track's internal event and notification system.
More metrics covering other areas of Dependency-Track will be added in future versions.
##### Database
Metrics of the ORM used by the Dependency-Track API server are exposed under the `datanucleus` namespace.
They provide a high-level overview of how many, and which kind of persistence operations are performed:
```yaml
# HELP datanucleus_transactions_rolledback_total Total number of rolled-back transactions
# TYPE datanucleus_transactions_rolledback_total counter
datanucleus_transactions_rolledback_total 0.0
# HELP datanucleus_queries_failed_total Total number of queries that completed with an error
# TYPE datanucleus_queries_failed_total counter
datanucleus_queries_failed_total 0.0
# HELP datanucleus_query_execution_time_ms_avg Average query execution time in milliseconds
# TYPE datanucleus_query_execution_time_ms_avg gauge
datanucleus_query_execution_time_ms_avg 0.0
# HELP datanucleus_transaction_execution_time_ms_avg Average transaction execution time in milliseconds
# TYPE datanucleus_transaction_execution_time_ms_avg gauge
datanucleus_transaction_execution_time_ms_avg 77.0
# HELP datanucleus_datastore_reads_total Total number of read operations from the datastore
# TYPE datanucleus_datastore_reads_total counter
datanucleus_datastore_reads_total 5650.0
# HELP datanucleus_datastore_writes_total Total number of write operations to the datastore
# TYPE datanucleus_datastore_writes_total counter
datanucleus_datastore_writes_total 1045.0
# HELP datanucleus_object_deletes_total Total number of objects deleted from the datastore
# TYPE datanucleus_object_deletes_total counter
datanucleus_object_deletes_total 0.0
# HELP datanucleus_transactions_total Total number of transactions
# TYPE datanucleus_transactions_total counter
datanucleus_transactions_total 1107.0
# HELP datanucleus_queries_active Number of currently active queries
# TYPE datanucleus_queries_active gauge
datanucleus_queries_active 0.0
# HELP datanucleus_queries_executed_total Total number of executed queries
# TYPE datanucleus_queries_executed_total counter
datanucleus_queries_executed_total 4095.0
# HELP datanucleus_connections_active Number of currently active managed datastore connections
# TYPE datanucleus_connections_active gauge
datanucleus_connections_active 0.0
# HELP datanucleus_object_inserts_total Total number of objects inserted into the datastore
# TYPE datanucleus_object_inserts_total counter
datanucleus_object_inserts_total 6.0
# HELP datanucleus_object_fetches_total Total number of objects fetched from the datastore
# TYPE datanucleus_object_fetches_total counter
datanucleus_object_fetches_total 981.0
# HELP datanucleus_transactions_active_total Number of currently active transactions
# TYPE datanucleus_transactions_active_total counter
datanucleus_transactions_active_total 0.0
# HELP datanucleus_object_updates_total Total number of objects updated in the datastore
# TYPE datanucleus_object_updates_total counter
datanucleus_object_updates_total 1039.0
# HELP datanucleus_transactions_committed_total Total number of committed transactions
# TYPE datanucleus_transactions_committed_total counter
datanucleus_transactions_committed_total 1107.0
```
Additionally, metrics about the database connection pools are exposed under the `hikaricp` namespace.
Monitoring these metrics is essential for tweaking the connection pool configuration (see [Connection Pooling]):
```yaml
# HELP hikaricp_connections Total connections
# TYPE hikaricp_connections gauge
hikaricp_connections{pool="non-transactional",} 13.0
hikaricp_connections{pool="transactional",} 12.0
# HELP hikaricp_connections_usage_seconds Connection usage time
# TYPE hikaricp_connections_usage_seconds summary
hikaricp_connections_usage_seconds_count{pool="non-transactional",} 5888.0
hikaricp_connections_usage_seconds_sum{pool="non-transactional",} 60.928
hikaricp_connections_usage_seconds_count{pool="transactional",} 138.0
hikaricp_connections_usage_seconds_sum{pool="transactional",} 0.036
# HELP hikaricp_connections_usage_seconds_max Connection usage time
# TYPE hikaricp_connections_usage_seconds_max gauge
hikaricp_connections_usage_seconds_max{pool="non-transactional",} 1.319
hikaricp_connections_usage_seconds_max{pool="transactional",} 0.007
# HELP hikaricp_connections_min Min connections
# TYPE hikaricp_connections_min gauge
hikaricp_connections_min{pool="non-transactional",} 10.0
hikaricp_connections_min{pool="transactional",} 10.0
# HELP hikaricp_connections_pending Pending threads
# TYPE hikaricp_connections_pending gauge
hikaricp_connections_pending{pool="non-transactional",} 0.0
hikaricp_connections_pending{pool="transactional",} 0.0
# HELP hikaricp_connections_idle Idle connections
# TYPE hikaricp_connections_idle gauge
hikaricp_connections_idle{pool="non-transactional",} 13.0
hikaricp_connections_idle{pool="transactional",} 12.0
# HELP hikaricp_connections_timeout_total Connection timeout total count
# TYPE hikaricp_connections_timeout_total counter
hikaricp_connections_timeout_total{pool="non-transactional",} 0.0
hikaricp_connections_timeout_total{pool="transactional",} 0.0
# HELP hikaricp_connections_creation_seconds_max Connection creation time
# TYPE hikaricp_connections_creation_seconds_max gauge
hikaricp_connections_creation_seconds_max{pool="non-transactional",} 0.0
hikaricp_connections_creation_seconds_max{pool="transactional",} 0.0
# HELP hikaricp_connections_creation_seconds Connection creation time
# TYPE hikaricp_connections_creation_seconds summary
hikaricp_connections_creation_seconds_count{pool="non-transactional",} 12.0
hikaricp_connections_creation_seconds_sum{pool="non-transactional",} 0.0
hikaricp_connections_creation_seconds_count{pool="transactional",} 11.0
hikaricp_connections_creation_seconds_sum{pool="transactional",} 0.0
# HELP hikaricp_connections_active Active connections
# TYPE hikaricp_connections_active gauge
hikaricp_connections_active{pool="non-transactional",} 0.0
hikaricp_connections_active{pool="transactional",} 0.0
# HELP hikaricp_connections_max Max connections
# TYPE hikaricp_connections_max gauge
hikaricp_connections_max{pool="non-transactional",} 20.0
hikaricp_connections_max{pool="transactional",} 20.0
# HELP hikaricp_connections_acquire_seconds Connection acquire time
# TYPE hikaricp_connections_acquire_seconds summary
hikaricp_connections_acquire_seconds_count{pool="non-transactional",} 5888.0
hikaricp_connections_acquire_seconds_sum{pool="non-transactional",} 0.009996981
hikaricp_connections_acquire_seconds_count{pool="transactional",} 138.0
hikaricp_connections_acquire_seconds_sum{pool="transactional",} 4.68092E-4
# HELP hikaricp_connections_acquire_seconds_max Connection acquire time
# TYPE hikaricp_connections_acquire_seconds_max gauge
hikaricp_connections_acquire_seconds_max{pool="non-transactional",} 1.41889E-4
hikaricp_connections_acquire_seconds_max{pool="transactional",} 1.77837E-4
```
##### Event and Notification System
Various improvements for Snyk analyzer (#2246) * Add parsing logic for Snyk API errors Also move tests for SnykParser into their own class instead of keeping them in SnykAnalysisTaskTest. Signed-off-by: nscuro <nscuro@protonmail.com> * Use the actually useful error fields in Snyk responses Signed-off-by: nscuro <nscuro@protonmail.com> * Improve Snyk analyzer; Add tests; Fix various bugs Signed-off-by: nscuro <nscuro@protonmail.com> * Reword Snyk rate limiting config keys Signed-off-by: nscuro <nscuro@protonmail.com> * Fix SnykParserTest Signed-off-by: nscuro <nscuro@protonmail.com> * Use retries instead of client-side rate limiting when rate limited by the Snyk API Addresses #2248 Signed-off-by: nscuro <nscuro@protonmail.com> * Disable implicit retry behavior on all exceptions Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk config keys documentation Signed-off-by: nscuro <nscuro@protonmail.com> * Report sunset API version only once per analysis Also send a notification instead of just logging it Signed-off-by: nscuro <nscuro@protonmail.com> * Add ability to use multiple Snyk tokens in round-robin Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk docs Signed-off-by: nscuro <nscuro@protonmail.com> * Update default Snyk API version to 2022-11-14 Signed-off-by: nscuro <nscuro@protonmail.com> * Fix visibility of index field Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk configuration screenshot Signed-off-by: nscuro <nscuro@protonmail.com> Signed-off-by: nscuro <nscuro@protonmail.com>
2022-12-09 10:09:15 +01:00
Event and notification metrics include the following:
```yaml
# HELP alpine_events_published_total Total number of published events
# TYPE alpine_events_published_total counter
alpine_events_published_total{event="<EVENT_CLASS_NAME>",publisher="<PUBLISHER_CLASS_NAME>",} 1.0
# HELP alpine_notifications_published_total Total number of published notifications
# TYPE alpine_notifications_published_total counter
alpine_notifications_published_total{group="<NOTIFICATION_GROUP>",level="<NOTIFICATION_LEVEL>",scope="<NOTIFICATION_SCOPE>",} 1.0
# HELP alpine_event_processing_seconds
# TYPE alpine_event_processing_seconds summary
alpine_event_processing_seconds_count{event="<EVENT_NAME>",subscriber="<SUBSCRIBER_NAME>",} 1.0
alpine_event_processing_seconds_sum{event="<EVENT_NAME>",subscriber="<SUBSCRIBER_NAME>",} 0.047599797
# HELP alpine_event_processing_seconds_max
# TYPE alpine_event_processing_seconds_max gauge
alpine_event_processing_seconds_max{event="<EVENT_NAME>",subscriber="<SUBSCRIBER_NAME>",} 0.047599797
```
> `alpine_notifications_published_total` will report all notifications, not only those for which an alert has been configured.
Events and notifications are processed by [executors]. The executor *Alpine-EventService* corresponds to what is
typically referred to as *worker pool*, and is responsible for executing the majority of events in Dependency-Track.
*Alpine-SingleThreadedEventService* is a dedicated executor for events that can't safely be executed in parallel.
The *SnykAnalysisTask* executor is used to perform API requests to [Snyk] (if enabled) in parallel, in order to work
around the missing batch functionality in Snyk's REST API. The following executor metrics are available:
```yaml
# HELP executor_pool_max_threads The maximum allowed number of threads in the pool
# TYPE executor_pool_max_threads gauge
executor_pool_max_threads{name="Alpine-NotificationService",} 4.0
executor_pool_max_threads{name="Alpine-SingleThreadedEventService",} 1.0
executor_pool_max_threads{name="Alpine-EventService",} 40.0
executor_pool_max_threads{name="SnykAnalysisTask",} 10.0
# HELP executor_pool_core_threads The core number of threads for the pool
# TYPE executor_pool_core_threads gauge
executor_pool_core_threads{name="Alpine-NotificationService",} 4.0
executor_pool_core_threads{name="Alpine-SingleThreadedEventService",} 1.0
executor_pool_core_threads{name="Alpine-EventService",} 40.0
executor_pool_core_threads{name="SnykAnalysisTask",} 10.0
# HELP executor_pool_size_threads The current number of threads in the pool
# TYPE executor_pool_size_threads gauge
executor_pool_size_threads{name="Alpine-NotificationService",} 0.0
executor_pool_size_threads{name="Alpine-SingleThreadedEventService",} 1.0
executor_pool_size_threads{name="Alpine-EventService",} 7.0
executor_pool_size_threads{name="SnykAnalysisTask",} 10.0
# HELP executor_active_threads The approximate number of threads that are actively executing tasks
# TYPE executor_active_threads gauge
executor_active_threads{name="Alpine-NotificationService",} 0.0
executor_active_threads{name="Alpine-SingleThreadedEventService",} 1.0
executor_active_threads{name="Alpine-EventService",} 2.0
executor_active_threads{name="SnykAnalysisTask",} 0.0
# HELP executor_completed_tasks_total The approximate total number of tasks that have completed execution
# TYPE executor_completed_tasks_total counter
executor_completed_tasks_total{name="Alpine-NotificationService",} 0.0
executor_completed_tasks_total{name="Alpine-SingleThreadedEventService",} 0.0
executor_completed_tasks_total{name="Alpine-EventService",} 5.0
executor_completed_tasks_total{name="SnykAnalysisTask",} 132.0
# HELP executor_queued_tasks The approximate number of tasks that are queued for execution
# TYPE executor_queued_tasks gauge
executor_queued_tasks{name="Alpine-NotificationService",} 0.0
executor_queued_tasks{name="Alpine-SingleThreadedEventService",} 160269.0
executor_queued_tasks{name="Alpine-EventService",} 0.0
executor_queued_tasks{name="SnykAnalysisTask",} 0.0
# HELP executor_queue_remaining_tasks The number of additional elements that this queue can ideally accept without blocking
# TYPE executor_queue_remaining_tasks gauge
executor_queue_remaining_tasks{name="Alpine-NotificationService",} 2.147483647E9
executor_queue_remaining_tasks{name="Alpine-SingleThreadedEventService",} 2.147323378E9
executor_queue_remaining_tasks{name="Alpine-EventService",} 2.147483647E9
executor_queue_remaining_tasks{name="SnykAnalysisTask",} 2.147483647E9
```
Executor metrics are a good way to monitor how busy an API server instance is, and how good of a job it's
doing keeping up with the work it's being exposed to. For example, a constantly maxed-out `executor_active_threads`
value combined with a high number of `executor_queued_tasks` may indicate that the configured `alpine.worker.pool.size`
is too small for the workload at hand.
##### Retries
Various improvements for Snyk analyzer (#2246) * Add parsing logic for Snyk API errors Also move tests for SnykParser into their own class instead of keeping them in SnykAnalysisTaskTest. Signed-off-by: nscuro <nscuro@protonmail.com> * Use the actually useful error fields in Snyk responses Signed-off-by: nscuro <nscuro@protonmail.com> * Improve Snyk analyzer; Add tests; Fix various bugs Signed-off-by: nscuro <nscuro@protonmail.com> * Reword Snyk rate limiting config keys Signed-off-by: nscuro <nscuro@protonmail.com> * Fix SnykParserTest Signed-off-by: nscuro <nscuro@protonmail.com> * Use retries instead of client-side rate limiting when rate limited by the Snyk API Addresses #2248 Signed-off-by: nscuro <nscuro@protonmail.com> * Disable implicit retry behavior on all exceptions Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk config keys documentation Signed-off-by: nscuro <nscuro@protonmail.com> * Report sunset API version only once per analysis Also send a notification instead of just logging it Signed-off-by: nscuro <nscuro@protonmail.com> * Add ability to use multiple Snyk tokens in round-robin Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk docs Signed-off-by: nscuro <nscuro@protonmail.com> * Update default Snyk API version to 2022-11-14 Signed-off-by: nscuro <nscuro@protonmail.com> * Fix visibility of index field Signed-off-by: nscuro <nscuro@protonmail.com> * Update Snyk configuration screenshot Signed-off-by: nscuro <nscuro@protonmail.com> Signed-off-by: nscuro <nscuro@protonmail.com>
2022-12-09 10:09:15 +01:00
Dependency-Track will occasionally retry requests to external services. Metrics about this behavior are
exposed in the following format:
```
resilience4j_retry_calls_total{kind="successful_with_retry",name="snyk-api",} 42.0
resilience4j_retry_calls_total{kind="failed_without_retry",name="snyk-api",} 0.0
resilience4j_retry_calls_total{kind="failed_with_retry",name="snyk-api",} 0.0
resilience4j_retry_calls_total{kind="successful_without_retry",name="snyk-api",} 9014.0
```
Where `name` describes the remote endpoint that Dependency-Track uses retries for.
#### Grafana Dashboard
Because [Micrometer](https://micrometer.io/) is used to collect and expose metrics, common Grafana dashboards for
Micrometer should just work.
An [example dashboard] is provided as a quickstart. Refer to the [Grafana documentation] for instructions on how to import it.
> The example dashboard is meant to be a starting point. Users are strongly encouraged to explore the available metrics
> and build their own dashboards, tailored to their needs. The sample dashboard is not actively maintained by the project
> team, however community contributions are more than welcome.
![System Metrics in Grafana]({{ site.baseurl }}/images/screenshots/monitoring-metrics-system.png)
![Event Metrics in Grafana]({{ site.baseurl }}/images/screenshots/monitoring-metrics-events.png)
[community integrations]: {{ site.baseurl }}{% link _docs/integrations/community-integrations.md %}
[Configuration]: {{ site.baseurl }}{% link _docs/getting-started/configuration.md %}
[Connection Pooling]: {{ site.baseurl }}{% link _docs/getting-started/database-support.md %}#connection-pooling
[default Logback configuration]: https://github.com/DependencyTrack/dependency-track/blob/master/src/main/docker/logback.xml
[dependency-track-exporter]: https://github.com/jetstack/dependency-track-exporter
[example dashboard]: {{ site.baseurl }}/files/grafana-dashboard.json
[executors]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ThreadPoolExecutor.html
[Grafana]: https://grafana.com/
[Grafana documentation]: https://grafana.com/docs/grafana/latest/dashboards/export-import/#import-dashboard
[Logback]: https://logback.qos.ch/
[`logback-json.xml`]: https://github.com/DependencyTrack/dependency-track/blob/master/src/main/docker/logback-json.xml
[logstash-logback-encoder]: https://github.com/logfellow/logstash-logback-encoder
[logstash-logback-encoder documentation]: https://github.com/logfellow/logstash-logback-encoder/tree/logstash-logback-encoder-7.3#loggingevent-fields
[MicroProfile Health]: https://download.eclipse.org/microprofile/microprofile-health-3.1/microprofile-health-spec-3.1.html
[MicroProfile Health REST interfaces specifications]: https://download.eclipse.org/microprofile/microprofile-health-3.1/microprofile-health-spec-3.1.html#_appendix_a_rest_interfaces_specifications
[Prometheus]: https://prometheus.io/
[readiness check]: https://download.eclipse.org/microprofile/microprofile-health-3.1/microprofile-health-spec-3.1.html#_readiness_check
[Snyk]: {{ site.baseurl }}{% link _docs/datasources/snyk.md %}
[text-based exposition format]: https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format
[thread states]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Thread.State.html