Spark: do you optimize the infrastructure?
If you work at or run a company that has a Data-Science (DS) team, then you know how critical experimentation is for making any meaningful progress.
Depending on your company's scale, your DS team will incur different costs. Here is a shitty graph I made to help you locate where your DS team is:
If your company is currently scaling up then, you might be very well aware of how costly these experiments are. You probably lie in the pink colored valley. This is where you need to use large GPUs, Spark, BigTable, etc to make sense of data. But at this stage, you don't have the luxury of leaving a 200-node Spark cluster running over the weekend. You are running on a strict budget and you are trying to survive for as long as you can on that budget.
Some data scientists just want to watch the cloud burn
Okay, it's a bit unfair to expect a data scientist to run their experiments on the most optimal infrastructure. That's why you need an MLOps team. Our MLOps team has spent way too much time and effort in solving for infrastructure costs and we have realized one thing: Everything is over-provisioned.
No matter how fancy your data-science stack is, under the hood, you are being charged only for the basic stuff:
- Virtual Machine (VM) Cores
- VM RAM
- Misc licensing fees (irreducible)
- Man-made horrors
You gotta know your cloud really well. Your infra cost savings will come from here.
Things you should know:
VM Cores & RAM
Any computation that you wanna perform doesn't necessarily use 100% of the CPU or 100% of its RAM. Each cloud provider will offer you a family of VMs which will offer different configurations of RAM and CPU. Your DS team should be exploiting this to its fullest.
Roughly, you will find three families of VMs:
- high-cpu: Bare minimum RAM. Good for microservices. Usually the cheapest.
- standard: good balance of CPU and RAM
- high-memory: Bare minimum number of cores. Good for Spark, Databases, caches etc. Are also the costliest.
For example, here is the before and after of a Spark Job:
The CPU utilization for a 22 node Spark cluster that uses a 32 core VM:
It's obvious that you don't need these many cores for this particular job. You are wasting money on cores that you aren't using. You can simply switch to a VM with high-memory and less number of cores:
With a high-memory VM with 8 cores, the CPU utilization is immediately much better.
In the first case we were being charged for 2232 = 704* cores. Even though we switched to a high-memory machine, we are being charged way less due to a low number of cores (228 = 176*). This simple provisioning change reduces the cost of this job by more than 50%!
People with in-depth Apache Spark knowledge will catch this easily. I highly urge every data-engineer & data scientist to read this book by authors of Spark: Spark: The Definitive Guide
If your work requires multiple VMs and you have selected the perfect VM type, even then I can guarantee that there is room for optimization!
See, no computation task will make full use of your compute resource for the entire duration of the run. Let's again take an example of a Spark cluster:
Continuing from the previous job, we can see that for the entire duration of the job, we are using 22 VMs.
What if you come up with a heuristic to scale down the nodes when 80% of the RAM is unused?
Boom! Turns out that for most of our job, we can get away with only 2 VMs!
Autoscaling policies have a lot of perks! One unexpected advantage we saw was that of one particular job that failed due to a code-error but, never scaled up the nodes and kept retrying for 1 hour! Without autoscaling, we would have kept a long-running failed job running on 30 nodes.
Autoscaling will ensure that you don't waste money on idle resources!
Did you know that there are special types of VMs that offer you the same CPU and RAM configuration for a much much much cheaper rate!?
These VMs are suitable for fault-tolerant workloads. Your cloud provider can take away these resources if compute is requested elsewhere but here is a quote from GCP's page:
Spot VMs are highly affordable compute instances suitable for batch jobs and fault-tolerant workloads. Spot VMs offer the same machine types, options, and performance as regular compute instances. If your applications are fault tolerant and can withstand possible instance preemptions, then Spot instances can reduce your Compute Engine costs by up to 91%!
Go check with your cloud provider if they support Spot VMs!
Misc VM Stuff
- VM Family
Some VM families are cheaper than others. For example, E2 VM families on GCP are much cheaper than N1 or N2D VM families. Cheaper VMs are slower due to differences in chipsets. (This is however not always the case. AMD VMs are cheaper intel VMs but significantly faster)
- VM Location
Depending on what location you provision a VM, you will be charged different rates. (For example, VMs in Central US can be significantly cheaper than VMs in the Southeast Asian region)
- VM Storage
Say you are running a microservice on 2 VMs with some 50GB of storage attached. Then you might want to ask yourself if you really need a high-performance SSD attached to it?
Lastly, a huge portion of infrastructure costs can be optimized if you simply ask "should we be spending this much money (or any money at all) on a task?". Your cloud provider will NOT be asking you this.
Engineers are people. And they will make mistakes. Always.
You have to set up your tech stack to allow rapid feedback if shit hits the ceiling. Invest in alerts.
Alerts can save you from:
- Services/Jobs that are failing silently
- Resources staying idle for a long amount of time
- Resources provisioned with abnormally high cores/ram
- Stale services/jobs that you forgot to deprecate
Get an MLOps team, please.
- When I say "scaling", I mean that your company is running on limited investor funds or some limited profit margin.
- A cool page on MLOps: https://www.databricks.com/glossary/mlops
Find me on Twitter for more tomfoolery: https://twitter.com/shvbsle