Deploying Zeppelin and Spark on Kubernetes using Helm charts

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

In our previous posts in the Zeppelin series above we’ve already gone into detail about Zeppelin Spark notebooks on K8s. If you’ve read those you should already be familiar with Spark & Zeppelin components, also be aware that it’s no easy task to setup all the pieces of a running environment. Today we want to show you the quickest and easiest way to setup a Zeppelin Server and Spark on K8s with all the necessary components using Helm charts.

The Zeppelin Spark chart is an aggregated chart containing the following sub charts:

The flow diagram below shows how these components are deployed via this aggregated Helm chart.

Zeppelin flow

The Zeppelin Server chart also contains all necessary config files used by Zeppelin (notably all files from conf directory) which are wrapped in config files as you can see below:

configMap.yaml: log4j.properties, log4j_k8s_cluster.properties, shiro.ini.template
interpreter-settings-config.yaml: interpreter.json
zeppelin-site-config.yaml: zeppelin-site.xml

Some configuration properties are replaced with template variables, wired out as parameters to values.yaml so that they are configurable via a Helm client. Should you need to change a property at time of deployment you only need to replace its value - in the corresponding configMap - with {{ .Values.configurableProperty }} then you can refer to it as zeppelin.configurableProperty in the Helm client.

In short, Zeppelin Spark offers configurations of the following:

basic security with Shiro
centralised logging to Syslog
the storing of notebooks on different Cloud providers: Google Storage, Amazon S3, Azure Blob Storage
the deployment of Spark History Server and the logging of Spark events to Google Storage, Amazon S3, Azure Blob Storage
an ingress to reach Zeppelin UI from outside
preconfigured spark-submit options

Let’s walk through each of these, highlighting related Helm chart parameters.

Basic security with Shiro 🔗︎

Shiro is the default authorized framework for Zeppelin, so you can setup several different Realms for LDAP, Active Directory etc. As you’ve already seen above, we have a configMap containing shiro.ini.template, with a default configuration using IniRealm, meaning users and groups are configurable in conf/shiro.ini under the [user] and [group] sections. As of now, you can configure an admin username and password.

  zeppelin
    username: "admin"
    password: "zeppelin"

These are the default values for username and password. The password will be stored as a secret (zeppelin-secret.yaml), picked up by an initContainer at deployment time running a Shiro hasher tool, replacing ADMIN_PASSWORD in the shiro.ini.template file with the encoded password. Pipeline is already integrated with Bank-Vaults, so you’ll be able to use Vault to store your Zeppelin username & password.

Centralised logging to Syslog 🔗︎

There are two log4j files configured: one for Zeppelin Server (log4j.properties) and one to be used by Spark Driver and Executors (log4j_k8s_cluster.properties). Both are embedded in a config map configured by default for INFO level logging to console. Should you pass the required parameters below, a SyslogAppender will be created for Zeppelin Server and Spark Driver & Executors.

  zeppelin
    logService:
      logService.host: 10.44.0.12
      logService.zeppelinLogPort: 512
      logService.sparkLogPort: 512
      logService.applicationLogPort: 512

There also exists a separate logger for your application level logs, by default it’s named: application. You can change it, as well as all other optional parameters:

  zeppelin
    logService:
      zeppelinLogLevel: INFO
      zeppelinFacility: LOCAL4
      zeppelinLogPattern: "%5p [%d] ({%t} %F[%M]:%L) - %m%n"
      sparkLogLevel: INFO
      sparkFacility: LOCAL4
      sparkLogPattern: "[%p] XXX %c:%L - %m%n"
      applicationLoggerName: application
      applicationLogLevel: INFO
      applicationFacility: LOCAL4
      applicationLogPattern: "[%p] XXX %c:%L - %m%n"

Setup notebook storage on different Cloud providers 🔗︎

Zeppelin can store notebooks on several Cloud storage providers. You can easily configure this by setting storage type and path as illustrated below:

zeppelin.notebookStorage.type - ‘s3’ | ‘azure’ | ‘gs’
zeppelin.notebookStorage.path - bucket name in case of S3 / GS, file share name for Azure.

On Google and Amazon we’re using IAM roles and policies so that you can reach buckets belonging to the same account / project automatically, while on Azure you also have to specify zeppelin.azureStorageAccountName & zeppelin.azureStorageAccessKey.

Deploy Spark History Server 🔗︎

By default Spark History Server is not enabled. You can enable it by setting historyServer.enabled to true. You must also specify where to log Spark events, both for Spark HS and and for Spark Driver passing via sparkSubmitOptions:

  historyServer:
    enabled: true
  spark:
    spark-hs:
      app:
        logDirectory: "gs://spark-k8-logs/"
  zeppelin:
    sparkSubmitOptions:
      eventLogDirectory: "gs://spark-k8-logs/"

Log directory - both logDirectory, and eventLogDirectory - have to reference an existing bucket for each Cloud provider as follows:

s3a://yourBucketName/
wasb://your_blob_container_name@your_storage_account_name.blob.core.windows.net/
gs://yourBucketName/

For a more in-depth examination of this subject, read Spark History Server

Ingress to reach Zeppelin UI from outside 🔗︎

By default a traefik based Ingress service will be created for Zeppelin, so that the Zeppelin UI will be available on an external address and the specified baseURL.

Related chart parameters:

  zeppelin
    pipelineIngress:
      enabled: true
      ingressURL:
      baseURL: /zeppelin

Preconfigured spark-submit options 🔗︎

You can configure the properties below in order to change default values:

  zeppelin:
    sparkSubmitOptions:
      k8sNameSpace: default
      sparkDriverCores: 1
      sparkDriverLimitCores: 2
      sparkExecutorCores: 1
      sparkDriverMemory: 4G
      sparkExcutorMemory: 2G
      sparkMetricsConf: /opt/spark/conf/metrics.properties
      dynamicAllocation: true
      shuffleService: true
      shuffleNameSpace: default
      shuffleLabels: app=spark-shuffle-service,spark-version=2.2.0
      DriverImage: banzaicloud/spark-driver-py:v2.2.1-k8s-1.0.30
      ExecutorImage: banzaicloud/spark-executor-py:v2.2.1-k8s-1.0.30
      initContainerImage: banzaicloud/spark-init:v2.2.1-k8s-1.0.30
      resourceStagingServerInt: http://spark-rss:10000
      resourceStagingServerExt: http://spark-rss:10000
      sparkLocalDir: /tmp/spark-local
      eventLogDirectory: ""
      driverServiceAccountName: "spark"
      log4jConfigPath: "file:///var/spark-data/spark-files/log4j_k8s_cluster.properties"

Minimal config example necessary for launch 🔗︎

If you’re fine with most of the defaults, then there’s a minimal config example for an up and running Zeppelin Server with History Server and everything else you might need on Google Cloud:

cat > zeppelin-params.yaml <EOF
historyServer:
  enabled: true
spark:
  spark-hs:
    app:
      logDirectory: "gs://spark-k8-logs/"
zeppelin:
  sparkSubmitOptions:
    eventLogDirectory: "gs://spark-k8-logs/"
  notebookStorage:
    type: "gs"
    path: "zeppelin-nb"
EOF

helm install -f zeppelin-params.yaml banzaicloud-stable/zeppelin-spark

Note: spark-k8-logs, zeppelin-nb have to be created beforehand and are accessible by project owners

These Helm charts are the basis of our Zeppelin Spark spotguide, which is meant to further ease the deployment of running Spark workloads using Zeppelin. As you have seen using this chart, Zeppelin Spark chart makes it easy to launch Zeppelin, but it is still necessary to manage the logging infrastructure, and to monitor it and the different storage buckets used. WQe has automated all these with Pipeline using our open source operators like the Cloud agnostic PVC operator and the Fluent based Logging operator.

Related resources

CI/CD flow for Zeppelin notebooks

article

Running Zeppelin Spark notebooks on Kubernetes - deep dive

article

Running Zeppelin Spark notebooks on Kubernetes

article