Placeholder image

Janos Matyas

Mon, Nov 27, 2017


Introduction to Spark on Kubernetes

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Apache Spark CI/CD workflow howto
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
CI/CD flow for Zeppelin notebooks

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

Today we are starting a Spark on Kubernetes series to explain the motivation, benefits, technical details and overall advantages of a cloud native and micro service oriented deployment. The series will put readers in context of what Spark on Kubernetes is, what are the available options and deep-dive into rabbit holes to understand of how to operate, deploy and run workloads in a Spark on K8S cluster, and finishing with our Pipeline Apache Spark Spotguide to appreciate the ease and elegance of how we are tackling these problems, while running diverse workloads.

Pipeline is a microservice platform with an in built CI/CD pipelene running different workloads defined by spotguides - and one of those is Spark. The platform itself is not tied to big data workloads only.

Spark is becoming the enterprise standard for data processing and streaming but unluckily it has been designed mostly for static/on-prem environments. Significant progress has been made to make Spark better in the cloud, however it is still treating the virtual machines as physical ones. Resource managers (like YARN) were integrated with Spark but they were not really designed for the dynamic and fast moving cloud infrastructure. In the same time Kubernetes evolved fast to fill these gaps and become the enterprise standard orchestration framework designed primarily for the cloud. Please find below some features of K8S - the list by far is not complete:

  • Orchestration of multiple container runtimes
  • Cloud independent
  • Support for dynamic clustering (from network, volumes, compute, etc)
  • Service upgrades and rollbacks
  • Well defined storage and network management
  • Effective resource isolation and management
  • Portability across multiple container runtimes and cloud providers (federated as well)
  • Multiple and divers workloads, from batch to streaming to any other applications

There are several approaches to wrap Spark in containers and benefit of the portability, repeatable deployments and reduced dev-ops load, however none of them make Spark a first class citizen in the container space. They all share one common problem - blowing up containers and divert from best practices of containers (e.g. one process per container).

Considering the benefits of above and a micro service oriented approach, the community set to fix these problems by starting with this Spark issue and collaborate on this project. We have joined this community as well and contributing to make Spark on Kubernetes a first class K8S citizen and a supported Spotguide on our PaaS, Pipeline.

Among the direct benefits are a leaner deployment unit, with less resource consumption and faster reaction to changes in the underlying infrastructure. Currently by just wrapping Spark in containers and scheduling with one of the available RMs causes subtle issues and bugs - the resource managers knows nothing about each other, they don’t work together and all try to schedule based on different rules. In case of a multi tenant and multi application cluster (very typical Pipeline or Kubernetes utilization), the problem is even bigger as YARN or the standalone resource scaler are creating workload islands. All of these are fixed by the Spark on K8S project by having one generic K8S RM for all applications (aware of the overall cluster state and infrastructure) while Spark is addressing resource needs by accessing the K8S API, through a plugin developed by the Spark on K8S project.

A possible interaction of the flow and the resources requested by Spark and provided by Kubernetes looks like the following. They are all running in their own individual container and pods, and scheduled by Kubernetes.

Spark Flow

The progress of the project and the community is outstanding - new features and bug fixes are added at a fast pace. At Banzai Cloud we try to add our own share of contribution to make Spark on K8S your best option when it comes to run workloads in the cloud. Help us and the community by contributing to any of these issues below.

https://github.com/apache/spark/pull/19775
https://github.com/apache/zeppelin/pull/2637
https://github.com/apache-spark-on-k8s/spark/pull/532
https://github.com/apache-spark-on-k8s/spark/pull/531

Next post in the series will be a technical walkhtrough of using and (auto)scaling Spark on Kubernetes using the native K8S API developed by the Spark-on-K8S plugin.

If you are interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter:

Star



Comments

comments powered by Disqus