Introduction to Spark on Kubernetes

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

Today we’re starting a Spark on Kubernetes series to explain the motivation behind, technical details pertaining to, and overall advantages of a cloud native, micro service-oriented deployment. The series will help orient readers in the context of what Spark on Kubernetes is, what the available options are and involve a deep-dive into the technology to help readers understand how to operate, deploy and run workloads in a Spark on k8s cluster - culminating in our Pipeline Apache Spark Spotguide - to help you appreciate the ease and elegance of how we tackle these problems, while running diverse workloads.

Spark is rapidly becoming the industry standard for data processing and streaming but, unluckily, it’s been designed mostly with static/on-prem environments in mind. Significant progress has been made toward Spark becoming better in the cloud, however, it still treats virtual machines as physical ones. Resource managers (like YARN) were integrated with Spark but they were not really designed for a dynamic and fast moving cloud infrastructure. At the same time, Kubernetes quickly evolved to fill these gaps, and became the enterprise standard orchestration framework, designed primarily for use on the cloud. Please find below some features of k8s - this list is far from complete:

Orchestration of multiple container runtimes
Cloud independent
Support for dynamic clustering (from network, volumes, compute, etc)
Service upgrades and rollbacks
Well defined storage and network management
Effective resource isolation and management
Portability across multiple container runtimes and cloud providers (including federated)
Multiple and divers workloads, from batch to streaming and beyond

There are several approaches that allow you to wrap Spark in containers and benefit from their portability, repeatable deployments and reduced dev-ops load. However, none of these features make Spark a first class citizen in the container space. They all share one common problem - blowing up containers and diverging from container best practices (e.g. one process per container).

Considering the benefits of the above in regards to a micro service oriented approach, it’s not surprising that the community is set to solve these problems by starting this Spark issue and collaborating on this project. We have joined this community, as well, and are already contributing to help make Spark on Kubernetes a first class k8s citizen and a supported Spotguide on our PaaS, Pipeline.

Among the direct benefits of this approach is a leaner deployment unit, with less resource consumption and faster reaction to changes in the underlying infrastructure. Currently, just by wrapping Spark in containers and scheduling with one of the available RMs causes subtle issues and bugs: resource managers know nothing about each other, they don’t work together and each try to schedule based on different rules. When using a multi tenant and multi application cluster (very typical Pipeline or Kubernetes utilization), the problem is even larger, since both YARN and a standalone resource scaler will create workload islands. All of these issues are remedied by Spark on k8s, via specifying one generic k8s RM for all applications (aware of the overall cluster state and infrastructure), while Spark addresses the resource needs accessed by the k8s API through a plugin developed by the Spark-on-k8s project.

One possible interaction between the Spark flow and the resources requested by Spark and provided by Kubernetes, looks like the following. They all run in their own individual containers and pods, and are scheduled by Kubernetes.

Spark Flow

The progress of this project and the community has been outstanding - new features and bug fixes are added at an extraordinary pace. At Banzai Cloud we try to add our own share of contributions, to help make Spark on k8s your best option when it comes to running workloads in the cloud. Help us and the community by contributing to any of the issues below.

https://github.com/apache/spark/pull/19775
https://github.com/apache/zeppelin/pull/2637
https://github.com/apache-spark-on-k8s/spark/pull/532
https://github.com/apache-spark-on-k8s/spark/pull/531

The next post in the series will be a technical walkhtrough for using and (auto)scaling Spark on Kubernetes using the native k8s API developed by the Spark-on-k8s plugin.

Related resources

Running Spark applications securely on Kubernetes

article

Deploy production-grade Spark to Kubernetes in minutes

article

Spotguides revisited

article