Running Zeppelin Spark notebooks on Kubernetes

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

In our first blog about Zeppelin on Kubernetes we explored a few of the problems we’ve encountered so far. Let’s briefly recap those, here:

communication between Zeppelin Server and RemoteInterpreterServer
dependency handling and logger setup
proper handling of pod lifecycles started by spark-submit

To address these problems we created the following pull request, PR-2637, in Apache Zeppelin. The most important part of that PR is to extend the functionality of RemoteInterpreterManagedProcess, which is responsible for managing and connecting to remotely running interpreter processes. SparkK8RemoteInterpreterManagedProcess simplifies connecting to RemoteInterpreterServer and directly communicates with Kubernetes clusters. Our solution uses spark-submit and interpreter.sh. However, after starting spark-submit, we’ll be watching for events related to the Spark Driver pod created by spark-submit. After the Driver state is ‘Running’, a separate Kubernetes Service will be created, bounded to RemoteInterpreterServer inside the Spark Driver. SparkK8RemoteInterpreterManagedProcess will try to connect to RemoteInterpreterServer through this service. Should the connection fail, both Spark Driver and the service will be deleted.

As you can see in the sequence diagram below, the flow of actions upon starting a Zeppelin notebook is the same as the flow upon starting any Spark Application on K8s

Zeppelin Spark K8s flow

Now let’s start a Zeppelin notebook on Minikube, using our pre-built Docker images:

First, start the ResourceStagingServer used by spark-submit to distribute resources across Spark driver and executors (in our case the Zeppelin Spark interpreter JAR)

wget https://raw.githubusercontent.com/apache-spark-on-k8s/spark/branch-2.2-kubernetes/conf/kubernetes-resource-staging-server.yaml
kubectl create -f kubernetes-resource-staging-server.yaml

Next, create a Kubernetes service to reach the Zeppelin server from outside the cluster

wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-service.yaml
kubectl create -f zeppelin-service.yaml

Get the address of ResourceStagingServer either from the k8s dashboard or by running

kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}'

Then download the Zeppelin server’s pod definition

wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-pod.yaml

And edit zeppelin-pod.yaml: setting the address of ResourceStagingServer, which we already retrieved

- name: SPARK_SUBMIT_OPTIONS
    value: >-
      --kubernetes-namespace default
      --conf spark.executor.instances=2
      --conf spark.kubernetes.resourceStagingServer.uri=http://10.0.0.121:10000
      --conf spark.kubernetes.resourceStagingServer.internal.uri=http://10.0.0.121:10000

Start the Zeppelin server

kubectl create -f zeppelin-pod.yaml

Check the list of pods and wait until zeppelin-server pod status is Running

$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          2m
zeppelin-server                                  1/1       Running   0          1m

The Zeppelin UI should be reachable on the same ip as the Minikube dashboard (the address of the node), while the port can be retrieved either from the k8s dashboard or by running

kubectl get svc zeppelin-k8s-service -o jsonpath='{.spec.ports[0].nodePort}'

Finally, start a notebook and check the list of pods again

$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          33m
zeppelin-server                                  1/1       Running   0          24m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-driver    1/1       Running   0          4m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-1    1/1       Running   0          3m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-2    1/1       Running   0          3m

You can also build Zeppelin from scratch and create your own Docker image via our GitHub repository.

We hope you enjoy running Zeppelin notebooks on Kubernetes!

About Banzai Cloud Pipeline 🔗︎

Banzai Cloud’s Pipeline provides a platform for enterprises to develop, deploy, and scale container-based applications. It leverages best-of-breed cloud components, such as Kubernetes, to create a highly productive, yet flexible environment for developers and operations teams alike. Strong security measures — multiple authentication backends, fine-grained authorization, dynamic secret management, automated secure communications between components using TLS, vulnerability scans, static code analysis, CI/CD, and so on — are default features of the Pipeline platform.

Related resources

Deploying Zeppelin and Spark on Kubernetes using Helm charts

article

CI/CD flow for Zeppelin notebooks

article

Running Zeppelin Spark notebooks on Kubernetes

article