Shiv After Dark

[NOTES] Kubernetes vs SLURM: Scheduler Architecture Notes for 2026

Introduction

When running large-scale distributed workloads, the default Kubernetes scheduler often falls short of specialized requirements. High-performance computing (HPC) environments frequently demand sophisticated scheduling strategies that Kubernetes’ built-in components cannot provide. This is where custom schedulers like those inspired by SLURM (Simple Linux Utility for Resource Management) come into play, offering fine-grained control over job placement, resource allocation, and workload management.

In this post, we’ll explore how to implement custom schedulers in Kubernetes, the lessons learned from SLURM’s approach, and practical patterns for integrating advanced scheduling logic into your Kubernetes clusters.

Understanding Default Kubernetes Scheduling

Before diving into custom schedulers, it’s important to understand Kubernetes’ default scheduling model. The built-in kube-scheduler uses a series of filter and score functions to place pods on nodes. It considers factors like:

Resource requests and limits

Gang scheduling (scheduling multiple pods together atomically)

What SLURM Brings to the Table

SLURM has been the de facto workload manager in HPC environments for decades. Its scheduler implements several features that make it attractive for complex workload management:

Gang Scheduling: SLURM can guarantee that all tasks in a job start together. This is critical for parallel applications that need synchronized execution.

Fair-Share Scheduling: Resources can be allocated fairly across different users and groups, with the ability to prioritize certain workloads.

Advanced Job Prioritization: SLURM uses a sophisticated priority system that considers job age, fair-share allocation, and explicit priority levels to determine scheduling order.

Backfilling: The scheduler can opportunistically run smaller jobs in gaps left by larger jobs, improving overall cluster utilization.

Topology-Aware Scheduling: SLURM is aware of node topology and can schedule tasks to optimize communication patterns.

Custom Schedulers in Kubernetes

Kubernetes provides several mechanisms for implementing custom scheduling logic:

1. Extending the Default Scheduler

The simplest approach is to use scheduler plugins (introduced in Kubernetes 1.16). These plugins allow you to:

Add custom filter functions to remove unsuitable nodes

PreFilter: Early filtering based on pod requirements

2. Multiple Schedulers

For more complex requirements, you can run multiple schedulers alongside the default one. Pods can specify which scheduler should handle their placement using the schedulerName field:

apiVersion: v1
Kind: Pod
Metadata:
  Name: my-pod
Spec:
  schedulerName: custom-scheduler
  Containers:
Name: app
    Image: myapp:latest

This allows different schedulers to manage different workload types within the same cluster.

3. Custom Scheduler Frameworks

Several projects have created complete custom schedulers for Kubernetes:

Volcano: A batch scheduling system designed for HPC and big data workloads, featuring gang scheduling, fair-share allocation, and queue management

Building a SLURM-Inspired Scheduler

Here’s a conceptual architecture for a SLURM-inspired Kubernetes scheduler:

Key Components

Job Queue Manager: Manages incoming jobs in queues with different priorities, similar to SLURM’s job queue system.

Priority Calculator: Determines job priority based on: Job age (longer waiting jobs get higher priority)

  1. Attempt to schedule all replicas/pods of the job together (gang scheduling)
  2. If feasible, allocate resources and create pods
  3. If not feasible, check for backfill opportunities
  4. Schedule smaller jobs in available slots Resource Monitor: Continuously tracks: Node capacity and utilization

Implementation Considerations

State Management: Use Kubernetes ConfigMaps or CRDs to store scheduler state (job queues, fair-share data, reservations). Consider using etcd directly for high-performance scenarios.

Communication: Implement webhooks or use the scheduler framework API to receive pod scheduling requests and communicate decisions.

Resource Guarantees: Use resource quotas and priority classes to enforce resource allocations across users and job priorities.

Monitoring and Observability: Instrument the scheduler to emit metrics on: Scheduling latency

Integration Patterns

Pattern 1: Coexistence with Default Scheduler

Run a custom scheduler for batch workloads while the default scheduler handles interactive and best-effort workloads. Use selectors and namespace-level defaults to route pods to the appropriate scheduler.

Pattern 2: Hierarchical Scheduling

Implement multiple tiers of schedulers: Global scheduler for cluster-wide resource allocation

Pattern 3: Preemption and Backfilling

Implement fair-share allocation with dynamic preemption: High-priority jobs can preempt lower-priority ones

Practical Challenges and Solutions

Challenge 1: Gang Scheduling Complexity

Gang scheduling in Kubernetes is non-trivial because pod creation is asynchronous. Solution: Use JobSet (Kubernetes 1.30+) or implement your own batch job CRD that coordinates multiple pods, ensuring atomic scheduling or rollback.

Challenge 2: Topology Awareness

Kubernetes doesn’t have built-in NUMA or advanced topology awareness. Solution: Use Pod Topology Spread Constraints combined with node labels to encode topology information.

Challenge 3: Fair-Share Implementation

Implementing accurate fair-share requires fine-grained accounting. Solution: Track resource consumption per user/group, adjust priority scores dynamically based on cumulative consumption, and use preemption policies to enforce allocations.

Challenge 4: Overcommitment and Overbooking

HPC systems often overbook resources. Solution: Use request/limit separation carefully, implement admission webhooks to enforce overcommitment policies, and use PodDisruptionBudgets to manage graceful eviction.

Emerging Standards

The Kubernetes community is investing in standardized scheduling components:

Karpenter: Autoscaling and bin-packing optimized for containerized workloads

Conclusion

Custom schedulers in Kubernetes bring HPC-style workload management to containerized environments. While the default scheduler is excellent for general-purpose workloads, specialized applications benefit from SLURM-inspired features like gang scheduling, fair-share allocation, and sophisticated priority management.

The path forward depends on your specific requirements: if you need basic extensions, scheduler plugins may suffice. For complex batch workloads, consider Volcano or similar frameworks. For organizations building entirely new platforms, implementing a custom scheduler provides maximum flexibility at the cost of additional complexity.

As Kubernetes continues to evolve, we expect deeper integration of batch scheduling concepts into the core platform, potentially reducing the need for external solutions in the future.


Flashcards!

I ask Claude to make flashcards for my tech notes! You should try it too!

What is gang scheduling and why is it critical for HPC workloads?
Gang scheduling ensures that all tasks/pods in a job start together atomically. This is critical for parallel applications that need synchronized execution, as partial scheduling could lead to deadlocks or resource waste where some pods wait indefinitely for their siblings.
What are the main limitations of Kubernetes' default scheduler for HPC workloads?
The default scheduler treats scheduling as a simple "fit pod on suitable node" problem. It lacks: gang scheduling, queue-based fair-share allocation, priority-based preemption with backfilling, advanced topology awareness, and proper handling of heterogeneous resource types.
What is backfilling in the context of job scheduling?
Backfilling is when the scheduler opportunistically runs smaller jobs in gaps left by larger jobs waiting for resources. This improves overall cluster utilization without delaying higher-priority large jobs.
How do you specify a custom scheduler for a pod in Kubernetes?
Use the schedulerName field in the pod spec. For example: spec.schedulerName: custom-scheduler. This tells Kubernetes to use your custom scheduler instead of the default kube-scheduler.
What are the five main scheduler plugin hooks mentioned in the blog?
1. PreFilter: Early filtering based on pod requirements
2. Filter: Eliminate unsuitable nodes
3. Score: Rank remaining nodes by suitability
4. Reserve: Temporarily allocate resources
5. Bind: Assign the pod to the selected node
What is fair-share scheduling and how does SLURM implement it?
Fair-share scheduling allocates resources equitably across different users and groups, with the ability to prioritize certain workloads. SLURM tracks cumulative resource consumption per user/group and adjusts priority dynamically to ensure everyone gets their fair allocation over time.
Name three popular custom scheduler frameworks for Kubernetes
1. Volcano: Batch scheduling with gang scheduling and queue management
2. Yunikorn: Apache project bringing enterprise scheduling capabilities
3. Kube-Batch: Job-level scheduling with reclaim strategies
What four factors does the Priority Calculator consider in a SLURM-inspired scheduler?
1. Job age (longer waiting jobs get higher priority)
2. Fair-share allocation (ensuring equitable resource access)
3. Explicit priority levels
4. QoS (Quality of Service) tier
Why is gang scheduling particularly challenging in Kubernetes?
Pod creation in Kubernetes is asynchronous, making atomic scheduling difficult. You must coordinate multiple pods to either all schedule successfully or roll back completely. Solutions include using JobSet (Kubernetes 1.30+) or implementing custom batch job CRDs.
What is the "Coexistence with Default Scheduler" integration pattern?
Run a custom scheduler for batch workloads while the default scheduler handles interactive and best-effort workloads. Use selectors and namespace-level defaults to route pods to the appropriate scheduler based on workload type.
What storage options can be used for managing custom scheduler state?
Kubernetes ConfigMaps or Custom Resource Definitions (CRDs) for job queues, fair-share data, and reservations. For high-performance scenarios, consider using etcd directly for faster access and updates.
What is topology-aware scheduling and why does it matter?
Topology-aware scheduling places tasks to optimize communication patterns based on physical node layout, NUMA architecture, or network topology. This reduces latency and improves performance for distributed applications that communicate heavily.
How can you implement topology awareness in Kubernetes despite limited built-in support?
Use Pod Topology Spread Constraints combined with node labels to encode topology information. Label nodes with rack, zone, NUMA node, or other topology data, then use spread constraints to influence placement.
What is the five-step scheduling algorithm for a SLURM-inspired scheduler?
1. Select highest-priority job from queue
2. Attempt to schedule all replicas/pods together (gang scheduling)
3. If feasible, allocate resources and create pods
4. If not feasible, check for backfill opportunities
5. Schedule smaller jobs in available slots
What metrics should a custom scheduler emit for monitoring?
1. Scheduling latency
2. Queue depth and wait times
3. Backfill efficiency
4. Fair-share allocation accuracy
These help identify bottlenecks and ensure the scheduler is performing as expected.
What is the Hierarchical Scheduling pattern?
Implement multiple tiers of schedulers: a global scheduler for cluster-wide resource allocation, local schedulers for namespace/tenant-specific logic, and specialized schedulers for specific workload types (GPU, memory-optimized, network-intensive).
How does preemption work in the Preemption and Backfilling pattern?
High-priority jobs can preempt (evict) lower-priority ones to claim resources. Preempted jobs return to the queue for rescheduling. Meanwhile, a backfill algorithm fills resource gaps with smaller jobs that won't delay higher-priority work.
What is JobSet and when was it introduced in Kubernetes?
JobSet is a Kubernetes feature introduced in version 1.30+ that coordinates multiple pods for batch jobs, enabling atomic gang scheduling or rollback. It simplifies implementing all-or-nothing pod creation for parallel workloads.
How can you handle resource overcommitment in Kubernetes for HPC workloads?
1. Use request/limit separation carefully (requests < limits allows overcommitment)
2. Implement admission webhooks to enforce overcommitment policies
3. Use PodDisruptionBudgets to manage graceful eviction when actual usage exceeds capacity
What are the three approaches to implementing custom scheduling logic in Kubernetes?
1. Extending the Default Scheduler: Use scheduler plugins to add custom filter/score functions
2. Multiple Schedulers: Run separate schedulers for different workload types
3. Custom Scheduler Frameworks: Use complete custom schedulers like Volcano or Yunikorn
What are the four key components of a SLURM-inspired Kubernetes scheduler architecture?
1. Job Queue Manager: Manages incoming jobs in priority queues
2. Priority Calculator: Determines job priority based on age, fair-share, and QoS
3. Scheduling Algorithm: Handles gang scheduling and backfilling
4. Resource Monitor: Tracks capacity, utilization, and fair allocation
What does SLURM stand for?
Simple Linux Utility for Resource Management. It's been the de facto workload manager in HPC environments for decades.
What emerging standards in the Kubernetes community are relevant to scheduling?
1. Karpenter: Autoscaling and bin-packing optimized for containers
2. Gatekeeper: Policy enforcement that can integrate with scheduling decisions
3. Resource Slices: Work-in-progress APIs for better resource model representation

Critique

The blog is open for critique. Drop me an email at shivbhosale97@gmail.com with subject: "blog critique" and I will add your points in this section. Also please point out any inaccuracies / mistakes in the essay/post. Your critique will appear in-line like this:

"quote in the blog you want to critique"
Expand to find the critique here

#distributed-systems #gang-scheduling #hpc #infrastructure #kubernetes #kubernetes-at-scale #notes #slurm