Background
For data processing tasks, there are different ways you can go about it:
- using SQL to leverage a database engine to perform data transformation
- dataframe-based frameworks such as pandas, ray, dask, polars
- big data processing frameworks such as spark
Check out this article for more info on polars vs spark benchmark.
The problem
At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat.
Using spark would solve a lot of scaling problems, the only problem is that spark in cluster mode is very hard to set up. That’s why most if not all cloud providers provide managed spark runtime (AWS EMR, GCP Dataproc, Azure Databricks). And since these are managed spark, this means you can’t test it locally (and you’ll have to wait at least 5 minutes for the cluster to start and finish bootstrapping.)
The solution
As of Apache Spark 3.1 release in March 2021, spark on kubernetes is generally available. This is great in so many ways, including but not limited to:
- you have full control of your infra
- takes less than 10 seconds to start a spark cluster
- can store logs in a central location, to be viewed later via spark history server
- can use minio as local storage backend (better throughput compared to calling S3 via home/work internet)
- cheaper than all managed solutions, even serverless variants (more on this later)
Demo
Start a local kubernetes cluster
Minikube, Kind or K3d should work.
k3d cluster create spark-cluster --api-port 0.0.0.0:6443
# verify that a cluster is created
kubectl get nodes
Create a namespace for spark jobs
kubectl create namespace spark
Create a service account for spark
kubectl create serviceaccount spark --namespace spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=spark
Install local spark (so you can run spark-submit
)
brew install temurin
brew install apache-spark
Run a demo spark job
If encounter To use support for EC Keys
error: https://stackoverflow.com/questions/75796747/spark-submit-error-to-use-support-for-ec-keys-you-must-explicitly-add-this-depe
# get kubernetes control plane url via `kubectl cluster-info`
spark-submit \
--master k8s://https://0.0.0.0:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=spark:3.4.1 \
local:///opt/spark/examples/src/main/python/pi.py
Tearing down k3d
k3d cluster delete spark-cluster
Full Demo
Regarding pricing
Since spark-submit on kubernetes is akin to serverless spark, we will be comparing costs with serverless spark runtime.
Setup:
- CPU: 8
- Memory: 16
- Runtime: 15 Hours
- Ephemeral storage: 20 GB
- Region: Singapore
Remarks: spark on kubernetes in this article assumes you’re using GKE Autopilot, which has a management fee of $60/month (unless it’s your only cluster). Compute is spot.
Serverless Runtime | Reference | Total Cost | Break even point if you run at least 33x workloads |
---|---|---|---|
AWS EMR Serverless | https://aws.amazon.com/emr/pricing/ | 9.61 USD | 317.22 USD |
GCP Dataproc Serverless | https://cloud.google.com/products/calculator | 4.26 USD | 140.58 USD |
Azure Databricks Serverless | https://www.databricks.com/product/pricing/product-pricing/instance-types | 5.61 USD | 185.13 USD |
(Bonus) spark on kubernetes | https://cloud.google.com/kubernetes-engine/pricing | 2.417 USD | 139.76 USD (Including cluster management fee) |
The script I used to find a break event point:
aws = 9.61272
gcp = 4.26
databricks = 5.61
k8s = 2.417
found_break_even_point = False
multiplier = 0
while not found_break_even_point:
multiplier += 1
print(multiplier)
aws_multiplied = aws * multiplier
gcp_multiplied = gcp * multiplier
databricks_multiplied = databricks * multiplier
k8s_reference = (k8s * multiplier) + 60 # add gke autopilot management fee
if k8s_reference < min([aws_multiplied, gcp_multiplied, databricks_multiplied]):
found_break_even_point = True
Why GKE Autopilot
Traditionally, when you’re creating a kubernetes cluster, you’ll have to attach nodes yourself. This means, if you’re running spark jobs where the compute requirement exceeds the available capacity, your jobs won’t be able to start. By using GKE Autopilot, GCP automatically provision nodes and attach to your GKE cluster automatically, which means you won’t ever face “insufficient cpu / memory” errors.
Closing
If you’re already using kubernetes, and also use spark for data processing, migrating workloads to k8s can reduce a significant amount of cost. For our case, existing spark jobs are run on a 4vCPU/16GB VM (109.79 USD / Month
), and this cause a lot of dent in our bills. But if we migrate to spark on k8s, it would only cost us around16.41 USD / month
😱. So that’s a whopping 85% price reduction 🤯.
For more advanced use cases, check out https://github.com/kahnwong/spark-on-k8s.