Spring Boot + Kubernetes Operator — Building a Custom Operator in Java

The Operator Pattern
Dependency & Project Setup
Custom Resource Definition (CRD)
The Reconciler Loop
Status Conditions & Events
Error Handling & Retry
Dependent Resources
Packaging & Deployment

The Operator pattern follows a simple control loop: observe → diff → act. The operator watches for changes to its custom resources (desired state), compares them with the actual state of the cluster, and takes actions to converge the two. This loop runs continuously — it is not a one-shot job.

Custom Resource (CR) — a YAML/JSON object with your schema (e.g., a DatabaseBackup specifying schedule, target database, retention).
Custom Resource Definition (CRD) — the schema registered with Kubernetes that makes the API server accept your CR type.
Reconciler — the Java class that receives CR change events and drives actual state toward desired state (creates Jobs, updates Secrets, monitors progress).
Controller Manager — the operator process that watches the Kubernetes API, queues events, and invokes the Reconciler.

Operators encode operational knowledge in code: the same steps a human would perform to back up, restore, scale, or upgrade an application. Once deployed, the operator performs these tasks autonomously and consistently.

Java Operator SDK provides the controller-manager runtime, event watching, caching, and Spring Boot integration. The operator-framework-spring-boot-starter auto-configures the operator with the cluster credentials from the running pod's ServiceAccount.

<dependencies> <dependency> <groupId>io.javaoperatorsdk</groupId> <artifactId>operator-framework-spring-boot-starter</artifactId> <version>4.9.2</version> </dependency> <dependency> <groupId>io.fabric8</groupId> <artifactId>kubernetes-client</artifactId> </dependency> </dependencies>

# application.yml javaoperatorsdk: operator-name: database-backup-operator # Namespace to watch — empty string means all namespaces namespaces: "" # Leader election prevents multiple operator pods from reconciling simultaneously leader-election: enabled: true lease-name: database-backup-operator-leader

A CRD is defined in two places: a Java class hierarchy (JOSDK generates the YAML from it) and the Kubernetes cluster (where the YAML must be applied before the operator can create resources of that type).

// The spec — desired state declared by the user public class DatabaseBackupSpec { private String schedule; // cron expression: "0 2 * * *" private String databaseRef; // name of a Secret containing DB credentials private String s3Bucket; // target S3 bucket for backup files private int retentionDays; // how long to keep backup files // getters + setters } // The status — actual state written by the operator public class DatabaseBackupStatus { private String phase; // Pending, Running, Succeeded, Failed private String lastBackupTime; private String lastBackupFile; private String message; private List<Condition> conditions = new ArrayList<>(); // getters + setters } // The Custom Resource — combines spec + status with Kubernetes metadata @Group("backup.cscode.io") @Version("v1") @ShortNames("dbb") public class DatabaseBackup extends CustomResource<DatabaseBackupSpec, DatabaseBackupStatus> implements Namespaced { // Kubernetes metadata (name, namespace, labels, etc.) is inherited from CustomResource }

# Generate CRD YAML from the Java class (JOSDK CRD generator plugin) mvn generate-resources # writes target/classes/META-INF/fabric8/*.yaml # Apply the CRD to the cluster kubectl apply -f target/classes/META-INF/fabric8/databasebackups.backup.cscode.io-v1.yml # Verify CRD is registered kubectl get crd databasebackups.backup.cscode.io

The Reconciler<T> interface has a single method — reconcile() — called by JOSDK whenever a relevant event occurs (CR created, updated, deleted, or re-queued by a timer). The method must be idempotent: it may be called multiple times for the same state, and should produce the same result each time.

@ControllerConfiguration( namespaces = WATCH_ALL_NAMESPACES, name = "database-backup-reconciler" ) @Component public class DatabaseBackupReconciler implements Reconciler<DatabaseBackup>, Cleaner<DatabaseBackup> { private final KubernetesClient kubernetesClient; private final BackupJobService backupJobService; @Override public UpdateControl<DatabaseBackup> reconcile( DatabaseBackup backup, Context<DatabaseBackup> context) { log.info("Reconciling DatabaseBackup {}/{}", backup.getMetadata().getNamespace(), backup.getMetadata().getName()); DatabaseBackupSpec spec = backup.getSpec(); // 1. Ensure the backup CronJob exists with the correct schedule ensureCronJobExists(backup, spec); // 2. Check if a backup Job is currently running Optional<Job> runningJob = findActiveBackupJob(backup); if (runningJob.isPresent()) { Job job = runningJob.get(); if (isJobCompleted(job)) { updateStatusSucceeded(backup, job); return UpdateControl.patchStatus(backup); } else if (isJobFailed(job)) { updateStatusFailed(backup, job); // Re-schedule check after 1 minute return UpdateControl.patchStatus(backup) .rescheduleAfter(Duration.ofMinutes(1)); } // Still running — check again in 30 seconds return UpdateControl.noUpdate().rescheduleAfter(Duration.ofSeconds(30)); } // 3. No active job — update status to Pending backup.getStatus().setPhase("Pending"); return UpdateControl.patchStatus(backup); } @Override public DeleteControl cleanup(DatabaseBackup backup, Context<DatabaseBackup> context) { // CR is being deleted — clean up owned resources deleteOwnedCronJob(backup); log.info("Cleaned up resources for DatabaseBackup {}", backup.getMetadata().getName()); return DeleteControl.defaultDelete(); } private void ensureCronJobExists(DatabaseBackup backup, DatabaseBackupSpec spec) { String cronJobName = backup.getMetadata().getName() + "-backup"; String namespace = backup.getMetadata().getNamespace(); boolean exists = kubernetesClient.batch().v1().cronjobs() .inNamespace(namespace).withName(cronJobName).get() != null; if (!exists) { CronJob cronJob = backupJobService.buildCronJob(backup, cronJobName); // Set owner reference so CronJob is deleted with the CR cronJob.getMetadata().setOwnerReferences(List.of( new OwnerReferenceBuilder() .withName(backup.getMetadata().getName()) .withApiVersion(backup.getApiVersion()) .withKind(backup.getKind()) .withUid(backup.getMetadata().getUid()) .withController(true) .withBlockOwnerDeletion(true) .build() )); kubernetesClient.batch().v1().cronjobs().inNamespace(namespace).create(cronJob); log.info("Created CronJob {}/{}", namespace, cronJobName); } } }

Status conditions are the standard way to communicate operator state to users and other tools (e.g., kubectl wait --for=condition=Ready). Each condition has a type, status, reason, and message — following Kubernetes conventions.

// Condition types for DatabaseBackup public enum BackupConditionType { SCHEDULE_VALID, // CronJob expression is valid and created BACKUP_RUNNING, // a backup Job is currently active BACKUP_SUCCEEDED, // last backup completed successfully BACKUP_FAILED // last backup failed } // Helper to set a condition private void setCondition(DatabaseBackup backup, BackupConditionType type, boolean isTrue, String reason, String message) { Condition condition = new ConditionBuilder() .withType(type.name()) .withStatus(isTrue ? "True" : "False") .withReason(reason) .withMessage(message) .withLastTransitionTime(ZonedDateTime.now().format(DateTimeFormatter.ISO_OFFSET_DATE_TIME)) .build(); List<Condition> conditions = backup.getStatus().getConditions(); conditions.removeIf(c -> c.getType().equals(type.name())); conditions.add(condition); } // Emit a Kubernetes Event (visible in `kubectl describe databasebackup`) private void emitEvent(DatabaseBackup backup, String reason, String message) { kubernetesClient.v1().events().inNamespace(backup.getMetadata().getNamespace()) .create(new EventBuilder() .withNewMetadata() .withGenerateName(backup.getMetadata().getName() + "-") .withNamespace(backup.getMetadata().getNamespace()) .endMetadata() .withReason(reason) .withMessage(message) .withType("Normal") .withInvolvedObject(new ObjectReferenceBuilder() .withKind(backup.getKind()) .withName(backup.getMetadata().getName()) .withNamespace(backup.getMetadata().getNamespace()) .build()) .build()); }

When a reconciler throws an exception, JOSDK retries with exponential backoff by default. You can customise the retry strategy or use a RetryConfiguration to limit attempts and set maximum backoff intervals.

@ControllerConfiguration( retryConfiguration = @Retry(maxAttempts = 5, initialInterval = 5000, intervalMultiplier = 2.0, maxInterval = 60000) ) public class DatabaseBackupReconciler implements Reconciler<DatabaseBackup> { @Override public UpdateControl<DatabaseBackup> reconcile(DatabaseBackup backup, Context<DatabaseBackup> context) { try { return doReconcile(backup, context); } catch (TransientException e) { // Rethrow — JOSDK will retry with backoff log.warn("Transient error reconciling {} — will retry: {}", backup.getMetadata().getName(), e.getMessage()); throw e; } catch (PermanentException e) { // Don't rethrow — update status to Failed and stop retrying log.error("Permanent error for {}: {}", backup.getMetadata().getName(), e.getMessage()); backup.getStatus().setPhase("Failed"); backup.getStatus().setMessage(e.getMessage()); return UpdateControl.patchStatus(backup); } } }

JOSDK's Dependent Resource abstraction manages owned Kubernetes resources (CronJobs, ConfigMaps, Secrets) declaratively. Instead of manually checking whether a resource exists and creating/updating it, you declare it as a KubernetesDependentResource and JOSDK handles the desired-state reconciliation for you.

// Declare the CronJob as a dependent resource — JOSDK creates/updates/deletes it automatically @KubernetesDependent(labelSelector = "app.kubernetes.io/managed-by=database-backup-operator") public class BackupCronJobDependentResource extends CRUDKubernetesDependentResource<CronJob, DatabaseBackup> { public BackupCronJobDependentResource() { super(CronJob.class); } @Override protected CronJob desired(DatabaseBackup backup, Context<DatabaseBackup> context) { // Return the desired CronJob state — JOSDK compares with actual and patches if different return new CronJobBuilder() .withNewMetadata() .withName(backup.getMetadata().getName() + "-backup") .withNamespace(backup.getMetadata().getNamespace()) .addToLabels("app.kubernetes.io/managed-by", "database-backup-operator") .endMetadata() .withNewSpec() .withSchedule(backup.getSpec().getSchedule()) .withNewJobTemplate() // ... Job template with backup container spec .endJobTemplate() .endSpec() .build(); } } // Reference the dependent resource in the reconciler @ControllerConfiguration(dependents = @Dependent(type = BackupCronJobDependentResource.class)) public class DatabaseBackupReconciler implements Reconciler<DatabaseBackup> { // CronJob creation/update/deletion is now fully managed by JOSDK }

The operator runs as a normal Kubernetes Deployment in a dedicated namespace. It needs a ServiceAccount with RBAC permissions to watch and manage the resources it operates on.

FROM eclipse-temurin:21-jre WORKDIR /app COPY target/database-backup-operator-1.0.jar app.jar ENTRYPOINT ["java", "-jar", "app.jar"]

# RBAC — operator needs permissions to manage its resources apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: database-backup-operator-role rules: - apiGroups: ["backup.cscode.io"] resources: ["databasebackups", "databasebackups/status", "databasebackups/finalizers"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["batch"] resources: ["cronjobs", "jobs"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: [""] resources: ["events", "secrets"] verbs: ["get", "list", "watch", "create", "patch"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["get", "create", "update"] # required for leader election --- apiVersion: apps/v1 kind: Deployment metadata: name: database-backup-operator namespace: operators spec: replicas: 2 # HA — leader election prevents dual reconciliation selector: matchLabels: app: database-backup-operator template: spec: serviceAccountName: database-backup-operator containers: - name: operator image: myregistry/database-backup-operator:1.0 env: - name: JAVA_TOOL_OPTIONS value: "-XX:+UseG1GC -Xms256m -Xmx512m"

# Example CR — user creates this to schedule a backup apiVersion: backup.cscode.io/v1 kind: DatabaseBackup metadata: name: production-db-backup namespace: default spec: schedule: "0 2 * * *" # 02:00 UTC daily databaseRef: "prod-db-secret" s3Bucket: "my-backups-bucket" retentionDays: 30

Contents