Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd
In our previous posts in the Zeppelin series above we’ve already gone into detail about Zeppelin Spark notebooks on K8s. If you’ve read those you should already be familiar with Spark & Zeppelin components, also be aware that it’s no easy task to setup all the pieces of a running environment. Today we want to show you the quickest and easiest way to setup a Zeppelin Server and Spark on K8s with all the necessary components using Helm charts.
The Zeppelin Spark chart is an aggregated chart containing the following sub charts:
The flow diagram below shows how these components are deployed via this aggregated Helm chart.
The Zeppelin Server chart also contains all necessary config files used by Zeppelin (notably all files from conf
directory) which are wrapped in config files as you can see below:
configMap.yaml
: log4j.properties, log4j_k8s_cluster.properties, shiro.ini.templateinterpreter-settings-config.yaml
: interpreter.jsonzeppelin-site-config.yaml
: zeppelin-site.xml
Some configuration properties are replaced with template variables, wired out as parameters to values.yaml
so that they are configurable via a Helm client. Should you need to change a property at time of deployment you only need to replace its value - in the corresponding configMap - with {{ .Values.configurableProperty }}
then you can refer to it as zeppelin.configurableProperty
in the Helm client.
In short, Zeppelin Spark offers configurations of the following:
- basic security with Shiro
- centralised logging to Syslog
- the storing of notebooks on different Cloud providers:
Google Storage
,Amazon S3
,Azure Blob Storage
- the deployment of
Spark History Server
and the logging of Spark events toGoogle Storage
,Amazon S3
,Azure Blob Storage
- an ingress to reach
Zeppelin UI
from outside - preconfigured
spark-submit
options
Let’s walk through each of these, highlighting related Helm chart parameters.
Basic security with Shiro 🔗︎
Shiro
is the default authorized framework for Zeppelin, so you can setup several different Realms
for LDAP, Active Directory etc.
As you’ve already seen above, we have a configMap containing shiro.ini.template, with a default configuration using IniRealm, meaning users and groups are configurable in conf/shiro.ini under the [user] and [group] sections. As of now, you can configure an admin username and password.
zeppelin
username: "admin"
password: "zeppelin"
These are the default values for username and password. The password will be stored as a secret (zeppelin-secret.yaml), picked up by an initContainer
at deployment time running a Shiro
hasher tool, replacing ADMIN_PASSWORD in the shiro.ini.template file with the encoded password. Pipeline is already integrated with Bank-Vaults, so you’ll be able to use Vault to store your Zeppelin username & password.
Centralised logging to Syslog 🔗︎
There are two log4j files configured: one for Zeppelin Server (log4j.properties) and one to be used by Spark Driver and Executors (log4j_k8s_cluster.properties). Both are embedded in a config map configured by default for INFO level logging to console. Should you pass the required parameters below, a SyslogAppender will be created for Zeppelin Server and Spark Driver & Executors.
zeppelin
logService:
logService.host: 10.44.0.12
logService.zeppelinLogPort: 512
logService.sparkLogPort: 512
logService.applicationLogPort: 512
There also exists a separate logger for your application level logs, by default it’s named: application
. You can change it, as well as all other optional parameters:
zeppelin
logService:
zeppelinLogLevel: INFO
zeppelinFacility: LOCAL4
zeppelinLogPattern: "%5p [%d] ({%t} %F[%M]:%L) - %m%n"
sparkLogLevel: INFO
sparkFacility: LOCAL4
sparkLogPattern: "[%p] XXX %c:%L - %m%n"
applicationLoggerName: application
applicationLogLevel: INFO
applicationFacility: LOCAL4
applicationLogPattern: "[%p] XXX %c:%L - %m%n"
Setup notebook storage on different Cloud providers 🔗︎
Zeppelin can store notebooks on several Cloud storage providers. You can easily configure this by setting storage type and path as illustrated below:
- zeppelin.notebookStorage.type - ‘s3’ | ‘azure’ | ‘gs’
- zeppelin.notebookStorage.path - bucket name in case of S3 / GS, file share name for Azure.
On Google and Amazon we’re using IAM roles and policies so that you can reach buckets belonging to the same account / project automatically, while on Azure you also have to specify zeppelin.azureStorageAccountName & zeppelin.azureStorageAccessKey.
Deploy Spark History Server 🔗︎
By default Spark History Server
is not enabled. You can enable it by setting historyServer.enabled
to true. You must also specify where to log Spark events, both for Spark HS and and for Spark Driver passing via sparkSubmitOptions:
historyServer:
enabled: true
spark:
spark-hs:
app:
logDirectory: "gs://spark-k8-logs/"
zeppelin:
sparkSubmitOptions:
eventLogDirectory: "gs://spark-k8-logs/"
Log directory - both logDirectory
, and eventLogDirectory
- have to reference an existing bucket for each Cloud provider as follows:
- s3a://yourBucketName/
- wasb://your_blob_container_name@your_storage_account_name.blob.core.windows.net/
- gs://yourBucketName/
For a more in-depth examination of this subject, read Spark History Server
Ingress to reach Zeppelin UI from outside 🔗︎
By default a traefik
based Ingress
service will be created for Zeppelin, so that the Zeppelin UI will be available on an external address and the specified baseURL.
Related chart parameters:
zeppelin
pipelineIngress:
enabled: true
ingressURL:
baseURL: /zeppelin
Preconfigured spark-submit options 🔗︎
You can configure the properties below in order to change default values:
zeppelin:
sparkSubmitOptions:
k8sNameSpace: default
sparkDriverCores: 1
sparkDriverLimitCores: 2
sparkExecutorCores: 1
sparkDriverMemory: 4G
sparkExcutorMemory: 2G
sparkMetricsConf: /opt/spark/conf/metrics.properties
dynamicAllocation: true
shuffleService: true
shuffleNameSpace: default
shuffleLabels: app=spark-shuffle-service,spark-version=2.2.0
DriverImage: banzaicloud/spark-driver-py:v2.2.1-k8s-1.0.30
ExecutorImage: banzaicloud/spark-executor-py:v2.2.1-k8s-1.0.30
initContainerImage: banzaicloud/spark-init:v2.2.1-k8s-1.0.30
resourceStagingServerInt: http://spark-rss:10000
resourceStagingServerExt: http://spark-rss:10000
sparkLocalDir: /tmp/spark-local
eventLogDirectory: ""
driverServiceAccountName: "spark"
log4jConfigPath: "file:///var/spark-data/spark-files/log4j_k8s_cluster.properties"
Minimal config example necessary for launch 🔗︎
If you’re fine with most of the defaults, then there’s a minimal config example for an up and running Zeppelin Server with History Server and everything else you might need on Google Cloud:
cat > zeppelin-params.yaml <EOF
historyServer:
enabled: true
spark:
spark-hs:
app:
logDirectory: "gs://spark-k8-logs/"
zeppelin:
sparkSubmitOptions:
eventLogDirectory: "gs://spark-k8-logs/"
notebookStorage:
type: "gs"
path: "zeppelin-nb"
EOF
helm install -f zeppelin-params.yaml banzaicloud-stable/zeppelin-spark
Note: spark-k8-logs, zeppelin-nb have to be created beforehand and are accessible by project owners
These Helm charts are the basis of our Zeppelin Spark spotguide
, which is meant to further ease the deployment of running Spark workloads using Zeppelin. As you have seen using this chart, Zeppelin Spark chart makes it easy to launch Zeppelin, but it is still necessary to manage the logging infrastructure, and to monitor it and the different storage buckets used. WQe has automated all these with Pipeline using our open source operators like the Cloud agnostic PVC operator and the Fluent based Logging operator.