For some time we’ve been evangelizing the idea that the runtime fabric
of big data workloads should be Kubernetes. In this post I’d like to walk through the thought process behind that change and discuss its benefits. Obliviously, this is a pretty large topic, and this post has no intention of covering it completely - also, it reflects the views and opinions that we at Banzai Cloud believe in and push others to adopt.
tl;dr: 🔗︎
We have made or utilize several cloud native
changes to the big data frameworks we deploy, run and monitor with our Pipeline Platform. To take a deep dive into these respective technologies and the detailed list of changes, please following the links below.
Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes
Collecting Spark History Server event logs in the cloud
Deploy production-grade Spark to Kubernetes in minutes
Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
Tensorflow on Kubernetes series:
Tensorflow on K8s
Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd
For us, the changes we made (highlighted above) were merely side-effects of building our open source Pipeline PaaS while running frameworks on Kubernetes. All the other applications we run and provision (web servers, JEE application servers, databases, Java, Spring, Serverless frameworks, etc) follow the same cloud native charter, and those principles
define our approach to how we deploy these big data frameworks.
The open source community 🔗︎
While the number of contributors, GitHub stars, commits or forks is not always the best indication of a project’s success or maturity, it should be noted that Kubernetes is the fastest-growing project in the history of open source, and exceeds the rest of the Hadoop ecosystem in every respect. The sheer velocity of the project, the aggressive release cycle and the number of vendors and companies adopting and deploying Kubernetes clusters is considerably larger than any other the big data ecosystem’s growth. One of the main reasons is that big data companies run all kinds of workloads, and Kubernetes offers a common runtime fabric for all of them.
Ease of operation - everywhere 🔗︎
Operating a Hadoop/YARN/Zookeeper/etc cluster and deploying it in the cloud, is not for the faint of heart. Just a short while ago, that was true of Kubernetes as well. However, it’s become significantly easier to install and operate Kubernetes clusters and now all major cloud providers have launched Kubernetes as a Service, including Google, Microsoft Azure, AWS, Alibaba, Huawei, Oracle, Tencent, and many others. Vendor support is tremendous when compared with the single digit number of pure play Hadoop vendors. As of writing, there are more than 50+ vendors shipping CNCF-certified
Kubernetes distributions: how many ODPi
certified Hadoop distros are there?
Kubernetes is a game changer for the cloud; in the past there’s been no standard virtual machine image format, and applications built for one cloud provider could not easily be deployed to other clouds. At the same time, a containerized application built for Kubernetes can be deployed to any k8s service, regardless of its underlying infrastructure - whether on-prem, cloud or federated.
In the past, it took us several months to push and integrate a big data service to a new cloud provider, but recently we managed to add AKS and GKE support (alongside AWS) in just two weeks each. That’s pretty fast. And the code works on every kind of cluster; Kubernetes understands whether it’s running in the cloud or on-premise, and automatically registers/uses the services of its providers, so Kubernetes does all the specific integrating, itself.
The wide range of out-of-the-box services 🔗︎
Big data relies on several projects/services, like YARN for scheduling and Zookeeper for consistency and discovery. While these were good enough for on-prem environments, they have failed to progress at the pace of technologies used in dynamic cloud environments. Both scheduling and consistency, service discovery and infrastructure management, are considerably better and more efficient on Kubernetes - and these features were designed as a core part of the platform from day one. The most popular big data projects like Spark, Zeppelin, Jupiter, Kafka, Heron, as well as AI frameworks like Tensorflow, are all now benefitting from, or being built on, core Kubernetes building blocks - like its scheduler, service discovery, internal RAFT-based consistency models and many others. Kubernetes now offers support for device plugins
by externalizing logic from the core to facilitate innovation, and there are already implementations for GPU, NIC, FPGA, InfiniBand and lots of others, that are ready to support these ways of scheduling. Kubernetes supports multiple schedulers, and, besides the already feature rich default scheduler (individual and collective resource requirements, quality of service requirements, hardware constraints, affinity or anti-affinity specifications, data locality, inter-workload interference), developers can build application specific custom schedulers (for Spark executors, Heron topology, etc).
Additionally, exposing these services (with load balancing, replication, ingress, DNS support and cloud provider specific registration, like ELB) is a core part of k8s. Unlike big data projects where services still need to directly
talk to each other through predefined hardcoded IP addresses or by using callbacks, Kubernetes offers and enforces frameworks that rethink their service discovery and resiliency the cloud native
way.
Storage is of paramount importance to big data. At the end of the day, people can change compute frameworks and engines with relative ease, but data has the most gravitational drag. Ozone was introduced in Hadoop 3.0, as an object store compatible with S3 or HDFS that uses the S3 API. Ignoring the fact that Kubernetes projects have had those features for a long time (Minio or Rook all support AWS, GCP, Azure, NAS, DC/OS, Drive - SSD, JBOD, SAN - and CloudFoundry through a unified S3-like API/interface), and that Reed-Solomon (aka erasure coding for HDFS in Hadoop 3.0) has long since been released with the new Container Storage Interface - a single, cluster-level volumes plugin API that is shared by all orchestrators - there’s a lot that can be achieved. Imagine a Spark or mapreduce shuffle stage or a method of Spark Streaming checkpointing, wherein data has to be accessed rapidly from many nodes. Kubernetes supports the Amazon Elastic File System, EFS, AzureFiles and GPD, so you can dynamically mount an EFS, AF, or PD volume for each VM, and access pods for this particular stage/checkpointing while still using an S3 as your main input/output data store. All these are claim based, triggered and translated to native
cloud provider calls on demand. There is no need to pre-create them during cluster installation. Neither is it necessary to write cloud provider specific code, since cloud, infrastructure and application vendors have already made their plugins, and, in Kubernetes, these claims are translated to provider specific calls.
The open source data governance work done by vendors, the increased popularity, and cheap, redundant storage options of cloud-based object stores, make cloud storage the de facto data lake for enterprises. HDFS is an obsolete technology, which failed to evolve at the pace of newer technologies and at this point is more of a liability than an asset.
The massive numbers of Kubernetes ready applications 🔗︎
A few years ago, one of the keynotes at the Hadoop Summit was about YARN apps. With Hadoop 3.0 the announcement was made that the way had been opened for apps in 3.x (beginning with 3.1 and 3.2). Currently, there aren’t many applications out there running on YARN. However, these days it’s hard to find any new or legacy application which has not been containerized
(aka dockerized
) and not been made to run on Kubernetes. Now, even giant framework databases (like Oracle) are first class citizens of Kubernetes. There is a common package management and application definition in Kubernetes called Helm, which helps define, install, and upgrade even the most complex Kubernetes applications using charts; it’s an easy way to first define then create, version, share, publish and update applications and deployments.
This list could go on and on - but let’s end things here, so we can reflect on what’s been discussed. The big data paradigm shift is happening right in front of us; it’s being driven by the Kubernetes community.
If you’ve made it this far, you’re probably interested in running a big data workload on Kubernetes. You can read about the changes to Banzai Cloud and the Kubernetes community that make these frameworks first class citizens on k8s.
Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
Tensorflow on Kubernetes series:
Tensorflow on K8s
Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd