Monitoring Spark with Prometheus, reloaded

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Monitoring series:
Monitoring Apache Spark with Prometheus
Monitoring multiple federated clusters with Prometheus - the secure way
Application monitoring with Prometheus and Pipeline
Building a cloud cost management system on top of Prometheus
Monitoring Spark with Prometheus, reloaded

At Banzai Cloud we deploy large distributed applications to Kubernetes clusters that we also operate. We don’t enjoy waking up to PagerDuty notifications in the middle of the night, so we try to get ahead of problems by operating these clusters as efficiently as possible. We have out-of-the-box centralized log collection, end-to-end tracing, monitoring and alerts for all the spotguides we support, including applications, Kubernetes clusters and all the infrastructure we deploy.

For monitoring, we choose Prometheus - the de-facto monitoring tool of cloud native applications. We’ve already explored our use through a series of articles on use cases of some advanced scenarios.

However, when monitor with Prometheus, there’s a catch. Prometheus uses a pull model in favor of http(s) to scrape data from applications. For batch jobs, it also supports a push model, but enabling this feature requires a special component called pushgateway.

tl;dr: 🔗︎

There’s no out-of-the-box solution for monitoring Spark with Prometheus. We’ve created one but we’re struggling to push the Prometheus sink upstream, into the Spark codebase. Therefore, we closed the PR and made the sink source available as a standalone, independent package.

Apache Spark and Prometheus 🔗︎

Spark supports a few sinks out-of-the-box (Graphite, CSV, Ganglia), but Prometheus is not one of them, so we introduced a new Prometheus sink (here’s the PR and related apache ticket, SPARK-22343).

Long story, short, we’ve blogged about the original proposal before. However, the community proved unreceptive to the idea of adding a new industry standard monitoring solution. We closed our PR and now we’re making the Apache Spark Prometheus Sink available the alternative way. Ideally, this should be part of Spark (as with the other sinks), but it also works fine from the classpath.

Monitor Apache Spark with Prometheus the `alternative` way 🔗︎

We have externalized the sink into a separate library, which you can use by either building it yourself, or by taking it from our Maven repository.

Prometheus sink 🔗︎

PrometheusSink is a Spark metrics sink that publishes Spark metrics to Prometheus.

Prerequisites 🔗︎

As previously mentioned, Prometheus uses a pull model over http(s) to scrape data from applications. For batch jobs it also supports a push model. We need to use this model, since Spark pushes metrics to sinks. In order to enable this feature for Prometheus, a special component called pushgateway must be running.

How to enable PrometheusSink in Spark 🔗︎

Spark publishes metrics to Sinks listed in the metrics configuration file. The location of the metrics configuration file can be specified for spark-submit as follows:

--conf spark.metrics.conf=<path_to_the_file>/metrics.properties

Add the following lines to the metrics configuration file:

# Enable Prometheus for all instances by class name
*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=<prometheus pushgateway protocol> - defaults to http
*.sink.prometheus.pushgateway-address=<prometheus pushgateway address> - defaults to 127.0.0.1:9091
*.sink.prometheus.period=<period> - defaults to 10
*.sink.prometheus.unit=< unit> - defaults to seconds (TimeUnit.SECONDS)
*.sink.prometheus.pushgateway-enable-timestamp=<enable/disable metrics timestamp> - defaults to false

pushgateway-address-protocol - the scheme of the URL where the pushgateway service is available
pushgateway-address - the host and port URL where the pushgateway service is available
period - controls the periodicity of metrics sent to pushgateway
unit - the time unit of that periodicity
pushgateway-enable-timestamp - enables sending the timestamp of those metrics sent to the pushgateway service. This is disabled by default as not all versions of the pushgateway service support timestamp for metrics.

spark-submit needs to know the repository from which it downloads the jar containing PrometheusSink:

--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases

Note: this is a Maven repo hosted on GitHub

Also, we have to specify the spark-metrics package that includes PrometheusSink and it’s dependent packages for spark-submit:

--packages com.banzaicloud:spark-metrics_2.11:2.2.1-1.0.0,io.prometheus:simpleclient:0.0.23,io.prometheus:simpleclient_dropwizard:0.0.23,io.prometheus:simpleclient_pushgateway:0.0.23,io.dropwizard.metrics:metrics-core:3.1.2

Package version 🔗︎

The version number of the package is formatted as: com.banzaicloud:spark-metrics_<scala version>:<spark version>-<version>

Conclusion 🔗︎

This is not the ideal scenario but it does the job, and it’s independent of the Spark codebase. At Banzai Cloud we’re still hoping to contribute this sink once the community decides it actually needs it. Meanwhile, you can open new feature requests, use this existing PR, or ask for native Prometheus support through one of our social media channels. As usual, we’re happy to help. All the software we use and create is open source, so we’re always eager to help make open source projects better.

If you’d like to contribute to making this available as an official Spark package, let us know.

Related resources

Running Spark applications securely on Kubernetes

article