[NOTES] Kubernetes vs SLURM: Scheduler Architecture Notes for 2026
Introduction
When running large-scale distributed workloads, the default Kubernetes scheduler often falls short of specialized requirements. High-performance computing (HPC) environments frequently demand sophisticated scheduling strategies that Kubernetes’ built-in components cannot provide. This is where custom schedulers like those inspired by SLURM (Simple Linux Utility for Resource Management) come into play, offering fine-grained control over job placement, resource allocation, and workload management.
In this post, we’ll explore how to implement custom schedulers in Kubernetes, the lessons learned from SLURM’s approach, and practical patterns for integrating advanced scheduling logic into your Kubernetes clusters.
Understanding Default Kubernetes Scheduling
Before diving into custom schedulers, it’s important to understand Kubernetes’ default scheduling model. The built-in kube-scheduler uses a series of filter and score functions to place pods on nodes. It considers factors like:
Resource requests and limits
- Node affinity and pod affinity rules
- Taints and tolerations
- Priority and preemption policies While powerful, this approach treats scheduling as a relatively simple problem: fit the pod on a suitable node. It doesn’t handle complex scenarios that HPC environments require, such as:
Gang scheduling (scheduling multiple pods together atomically)
- Queue-based fair-share resource allocation
- Priority-based job preemption with backfilling
- Advanced topology awareness
- Heterogeneous resource types
What SLURM Brings to the Table
SLURM has been the de facto workload manager in HPC environments for decades. Its scheduler implements several features that make it attractive for complex workload management:
Gang Scheduling: SLURM can guarantee that all tasks in a job start together. This is critical for parallel applications that need synchronized execution.
Fair-Share Scheduling: Resources can be allocated fairly across different users and groups, with the ability to prioritize certain workloads.
Advanced Job Prioritization: SLURM uses a sophisticated priority system that considers job age, fair-share allocation, and explicit priority levels to determine scheduling order.
Backfilling: The scheduler can opportunistically run smaller jobs in gaps left by larger jobs, improving overall cluster utilization.
Topology-Aware Scheduling: SLURM is aware of node topology and can schedule tasks to optimize communication patterns.
Custom Schedulers in Kubernetes
Kubernetes provides several mechanisms for implementing custom scheduling logic:
1. Extending the Default Scheduler
The simplest approach is to use scheduler plugins (introduced in Kubernetes 1.16). These plugins allow you to:
Add custom filter functions to remove unsuitable nodes
- Implement custom scoring functions to rank nodes
- Control reservation and binding behavior Example plugin hooks include:
PreFilter: Early filtering based on pod requirements
- Filter: Eliminate nodes that cannot accommodate the pod
- Score: Rank remaining nodes by suitability
- Reserve: Temporarily allocate resources
- Bind: Assign the pod to the selected node
2. Multiple Schedulers
For more complex requirements, you can run multiple schedulers alongside the default one. Pods can specify which scheduler should handle their placement using the schedulerName field:
apiVersion: v1
Kind: Pod
Metadata:
Name: my-pod
Spec:
schedulerName: custom-scheduler
Containers:
Name: app
Image: myapp:latest
This allows different schedulers to manage different workload types within the same cluster.
3. Custom Scheduler Frameworks
Several projects have created complete custom schedulers for Kubernetes:
Volcano: A batch scheduling system designed for HPC and big data workloads, featuring gang scheduling, fair-share allocation, and queue management
- Kube-Batch: A batch scheduler supporting job-level scheduling and reclaim strategies
- Karmada: A multi-cluster scheduler for distributing workloads across clusters
- Yunikorn: Apache YuniKorn, a universal resource orchestrator that brings enterprise scheduling capabilities
Building a SLURM-Inspired Scheduler
Here’s a conceptual architecture for a SLURM-inspired Kubernetes scheduler:
Key Components
Job Queue Manager: Manages incoming jobs in queues with different priorities, similar to SLURM’s job queue system.
Priority Calculator: Determines job priority based on: Job age (longer waiting jobs get higher priority)
- Fair-share allocation (ensuring equitable resource access)
- Explicit priority levels
- QoS (Quality of Service) tier Scheduling Algorithm: Select highest-priority job from queue
- Attempt to schedule all replicas/pods of the job together (gang scheduling)
- If feasible, allocate resources and create pods
- If not feasible, check for backfill opportunities
- Schedule smaller jobs in available slots Resource Monitor: Continuously tracks: Node capacity and utilization
- Pod resource requests and actual usage
- Allocation fairness across users/groups
Implementation Considerations
State Management: Use Kubernetes ConfigMaps or CRDs to store scheduler state (job queues, fair-share data, reservations). Consider using etcd directly for high-performance scenarios.
Communication: Implement webhooks or use the scheduler framework API to receive pod scheduling requests and communicate decisions.
Resource Guarantees: Use resource quotas and priority classes to enforce resource allocations across users and job priorities.
Monitoring and Observability: Instrument the scheduler to emit metrics on: Scheduling latency
- Queue depth and wait times
- Backfill efficiency
- Fair-share allocation accuracy
Integration Patterns
Pattern 1: Coexistence with Default Scheduler
Run a custom scheduler for batch workloads while the default scheduler handles interactive and best-effort workloads. Use selectors and namespace-level defaults to route pods to the appropriate scheduler.
Pattern 2: Hierarchical Scheduling
Implement multiple tiers of schedulers: Global scheduler for cluster-wide resource allocation
- Local schedulers for namespace or tenant-specific logic
- Specialized schedulers for GPU, memory-optimized, or network-intensive workloads
Pattern 3: Preemption and Backfilling
Implement fair-share allocation with dynamic preemption: High-priority jobs can preempt lower-priority ones
- Preempted jobs return to the queue for rescheduling
- Backfill algorithm fills gaps with smaller jobs
Practical Challenges and Solutions
Challenge 1: Gang Scheduling Complexity
Gang scheduling in Kubernetes is non-trivial because pod creation is asynchronous. Solution: Use JobSet (Kubernetes 1.30+) or implement your own batch job CRD that coordinates multiple pods, ensuring atomic scheduling or rollback.
Challenge 2: Topology Awareness
Kubernetes doesn’t have built-in NUMA or advanced topology awareness. Solution: Use Pod Topology Spread Constraints combined with node labels to encode topology information.
Challenge 3: Fair-Share Implementation
Implementing accurate fair-share requires fine-grained accounting. Solution: Track resource consumption per user/group, adjust priority scores dynamically based on cumulative consumption, and use preemption policies to enforce allocations.
Challenge 4: Overcommitment and Overbooking
HPC systems often overbook resources. Solution: Use request/limit separation carefully, implement admission webhooks to enforce overcommitment policies, and use PodDisruptionBudgets to manage graceful eviction.
Emerging Standards
The Kubernetes community is investing in standardized scheduling components:
Karpenter: Autoscaling and bin-packing optimized for containerized workloads
- Gatekeeper: Policy enforcement that can integrate with scheduling decisions
- Resource Slices: Work-in-progress APIs for better resource model representation
Conclusion
Custom schedulers in Kubernetes bring HPC-style workload management to containerized environments. While the default scheduler is excellent for general-purpose workloads, specialized applications benefit from SLURM-inspired features like gang scheduling, fair-share allocation, and sophisticated priority management.
The path forward depends on your specific requirements: if you need basic extensions, scheduler plugins may suffice. For complex batch workloads, consider Volcano or similar frameworks. For organizations building entirely new platforms, implementing a custom scheduler provides maximum flexibility at the cost of additional complexity.
As Kubernetes continues to evolve, we expect deeper integration of batch scheduling concepts into the core platform, potentially reducing the need for external solutions in the future.
Flashcards!
I ask Claude to make flashcards for my tech notes! You should try it too!
What is gang scheduling and why is it critical for HPC workloads?
Gang scheduling ensures that all tasks/pods in a job start together atomically. This is critical for parallel applications that need synchronized execution, as partial scheduling could lead to deadlocks or resource waste where some pods wait indefinitely for their siblings.
What are the main limitations of Kubernetes' default scheduler for HPC workloads?
The default scheduler treats scheduling as a simple "fit pod on suitable node" problem. It lacks: gang scheduling, queue-based fair-share allocation, priority-based preemption with backfilling, advanced topology awareness, and proper handling of heterogeneous resource types.
What is backfilling in the context of job scheduling?
Backfilling is when the scheduler opportunistically runs smaller jobs in gaps left by larger jobs waiting for resources. This improves overall cluster utilization without delaying higher-priority large jobs.
How do you specify a custom scheduler for a pod in Kubernetes?
Use the
schedulerName field in the pod spec. For example: spec.schedulerName: custom-scheduler. This tells Kubernetes to use your custom scheduler instead of the default kube-scheduler.
What are the five main scheduler plugin hooks mentioned in the blog?
1. PreFilter: Early filtering based on pod requirements
2. Filter: Eliminate unsuitable nodes
3. Score: Rank remaining nodes by suitability
4. Reserve: Temporarily allocate resources
5. Bind: Assign the pod to the selected node
What is fair-share scheduling and how does SLURM implement it?
Fair-share scheduling allocates resources equitably across different users and groups, with the ability to prioritize certain workloads. SLURM tracks cumulative resource consumption per user/group and adjusts priority dynamically to ensure everyone gets their fair allocation over time.
Name three popular custom scheduler frameworks for Kubernetes
1. Volcano: Batch scheduling with gang scheduling and queue management
2. Yunikorn: Apache project bringing enterprise scheduling capabilities
3. Kube-Batch: Job-level scheduling with reclaim strategies
What four factors does the Priority Calculator consider in a SLURM-inspired scheduler?
1. Job age (longer waiting jobs get higher priority)
2. Fair-share allocation (ensuring equitable resource access)
3. Explicit priority levels
4. QoS (Quality of Service) tier
Why is gang scheduling particularly challenging in Kubernetes?
Pod creation in Kubernetes is asynchronous, making atomic scheduling difficult. You must coordinate multiple pods to either all schedule successfully or roll back completely. Solutions include using JobSet (Kubernetes 1.30+) or implementing custom batch job CRDs.
What is the "Coexistence with Default Scheduler" integration pattern?
Run a custom scheduler for batch workloads while the default scheduler handles interactive and best-effort workloads. Use selectors and namespace-level defaults to route pods to the appropriate scheduler based on workload type.
What storage options can be used for managing custom scheduler state?
Kubernetes ConfigMaps or Custom Resource Definitions (CRDs) for job queues, fair-share data, and reservations. For high-performance scenarios, consider using etcd directly for faster access and updates.
What is topology-aware scheduling and why does it matter?
Topology-aware scheduling places tasks to optimize communication patterns based on physical node layout, NUMA architecture, or network topology. This reduces latency and improves performance for distributed applications that communicate heavily.
How can you implement topology awareness in Kubernetes despite limited built-in support?
Use Pod Topology Spread Constraints combined with node labels to encode topology information. Label nodes with rack, zone, NUMA node, or other topology data, then use spread constraints to influence placement.
What is the five-step scheduling algorithm for a SLURM-inspired scheduler?
1. Select highest-priority job from queue
2. Attempt to schedule all replicas/pods together (gang scheduling)
3. If feasible, allocate resources and create pods
4. If not feasible, check for backfill opportunities
5. Schedule smaller jobs in available slots
What metrics should a custom scheduler emit for monitoring?
1. Scheduling latency
2. Queue depth and wait times
3. Backfill efficiency
4. Fair-share allocation accuracy
These help identify bottlenecks and ensure the scheduler is performing as expected.
What is the Hierarchical Scheduling pattern?
Implement multiple tiers of schedulers: a global scheduler for cluster-wide resource allocation, local schedulers for namespace/tenant-specific logic, and specialized schedulers for specific workload types (GPU, memory-optimized, network-intensive).
How does preemption work in the Preemption and Backfilling pattern?
High-priority jobs can preempt (evict) lower-priority ones to claim resources. Preempted jobs return to the queue for rescheduling. Meanwhile, a backfill algorithm fills resource gaps with smaller jobs that won't delay higher-priority work.
What is JobSet and when was it introduced in Kubernetes?
JobSet is a Kubernetes feature introduced in version 1.30+ that coordinates multiple pods for batch jobs, enabling atomic gang scheduling or rollback. It simplifies implementing all-or-nothing pod creation for parallel workloads.
How can you handle resource overcommitment in Kubernetes for HPC workloads?
1. Use request/limit separation carefully (requests < limits allows overcommitment)
2. Implement admission webhooks to enforce overcommitment policies
3. Use PodDisruptionBudgets to manage graceful eviction when actual usage exceeds capacity
What are the three approaches to implementing custom scheduling logic in Kubernetes?
1. Extending the Default Scheduler: Use scheduler plugins to add custom filter/score functions
2. Multiple Schedulers: Run separate schedulers for different workload types
3. Custom Scheduler Frameworks: Use complete custom schedulers like Volcano or Yunikorn
What are the four key components of a SLURM-inspired Kubernetes scheduler architecture?
1. Job Queue Manager: Manages incoming jobs in priority queues
2. Priority Calculator: Determines job priority based on age, fair-share, and QoS
3. Scheduling Algorithm: Handles gang scheduling and backfilling
4. Resource Monitor: Tracks capacity, utilization, and fair allocation
What does SLURM stand for?
Simple Linux Utility for Resource Management. It's been the de facto workload manager in HPC environments for decades.
What emerging standards in the Kubernetes community are relevant to scheduling?
1. Karpenter: Autoscaling and bin-packing optimized for containers
2. Gatekeeper: Policy enforcement that can integrate with scheduling decisions
3. Resource Slices: Work-in-progress APIs for better resource model representation
Critique
The blog is open for critique. Drop me an email at shivbhosale97@gmail.com with subject: "blog critique" and I will add your points in this section. Also please point out any inaccuracies / mistakes in the essay/post. Your critique will appear in-line like this:
"quote in the blog you want to critique"
Expand to find the critique here