Contents
- The Operator Pattern
- Dependency & Project Setup
- Custom Resource Definition (CRD)
- The Reconciler Loop
- Status Conditions & Events
- Error Handling & Retry
- Dependent Resources
- Packaging & Deployment
The Operator pattern follows a simple control loop: observe → diff → act. The operator watches for changes to its custom resources (desired state), compares them with the actual state of the cluster, and takes actions to converge the two. This loop runs continuously — it is not a one-shot job.
- Custom Resource (CR) — a YAML/JSON object with your schema (e.g., a DatabaseBackup specifying schedule, target database, retention).
- Custom Resource Definition (CRD) — the schema registered with Kubernetes that makes the API server accept your CR type.
- Reconciler — the Java class that receives CR change events and drives actual state toward desired state (creates Jobs, updates Secrets, monitors progress).
- Controller Manager — the operator process that watches the Kubernetes API, queues events, and invokes the Reconciler.
Operators encode operational knowledge in code: the same steps a human would perform to back up, restore, scale, or upgrade an application. Once deployed, the operator performs these tasks autonomously and consistently.
Java Operator SDK provides the controller-manager runtime, event watching, caching, and Spring Boot integration. The operator-framework-spring-boot-starter auto-configures the operator with the cluster credentials from the running pod's ServiceAccount.
<dependencies>
<dependency>
<groupId>io.javaoperatorsdk</groupId>
<artifactId>operator-framework-spring-boot-starter</artifactId>
<version>4.9.2</version>
</dependency>
<dependency>
<groupId>io.fabric8</groupId>
<artifactId>kubernetes-client</artifactId>
</dependency>
</dependencies>
# application.yml
javaoperatorsdk:
operator-name: database-backup-operator
# Namespace to watch — empty string means all namespaces
namespaces: ""
# Leader election prevents multiple operator pods from reconciling simultaneously
leader-election:
enabled: true
lease-name: database-backup-operator-leader
A CRD is defined in two places: a Java class hierarchy (JOSDK generates the YAML from it) and the Kubernetes cluster (where the YAML must be applied before the operator can create resources of that type).
// The spec — desired state declared by the user
public class DatabaseBackupSpec {
private String schedule; // cron expression: "0 2 * * *"
private String databaseRef; // name of a Secret containing DB credentials
private String s3Bucket; // target S3 bucket for backup files
private int retentionDays; // how long to keep backup files
// getters + setters
}
// The status — actual state written by the operator
public class DatabaseBackupStatus {
private String phase; // Pending, Running, Succeeded, Failed
private String lastBackupTime;
private String lastBackupFile;
private String message;
private List<Condition> conditions = new ArrayList<>();
// getters + setters
}
// The Custom Resource — combines spec + status with Kubernetes metadata
@Group("backup.cscode.io")
@Version("v1")
@ShortNames("dbb")
public class DatabaseBackup extends CustomResource<DatabaseBackupSpec, DatabaseBackupStatus>
implements Namespaced {
// Kubernetes metadata (name, namespace, labels, etc.) is inherited from CustomResource
}
# Generate CRD YAML from the Java class (JOSDK CRD generator plugin)
mvn generate-resources # writes target/classes/META-INF/fabric8/*.yaml
# Apply the CRD to the cluster
kubectl apply -f target/classes/META-INF/fabric8/databasebackups.backup.cscode.io-v1.yml
# Verify CRD is registered
kubectl get crd databasebackups.backup.cscode.io
The Reconciler<T> interface has a single method — reconcile() — called by JOSDK whenever a relevant event occurs (CR created, updated, deleted, or re-queued by a timer). The method must be idempotent: it may be called multiple times for the same state, and should produce the same result each time.
@ControllerConfiguration(
namespaces = WATCH_ALL_NAMESPACES,
name = "database-backup-reconciler"
)
@Component
public class DatabaseBackupReconciler
implements Reconciler<DatabaseBackup>, Cleaner<DatabaseBackup> {
private final KubernetesClient kubernetesClient;
private final BackupJobService backupJobService;
@Override
public UpdateControl<DatabaseBackup> reconcile(
DatabaseBackup backup,
Context<DatabaseBackup> context) {
log.info("Reconciling DatabaseBackup {}/{}", backup.getMetadata().getNamespace(),
backup.getMetadata().getName());
DatabaseBackupSpec spec = backup.getSpec();
// 1. Ensure the backup CronJob exists with the correct schedule
ensureCronJobExists(backup, spec);
// 2. Check if a backup Job is currently running
Optional<Job> runningJob = findActiveBackupJob(backup);
if (runningJob.isPresent()) {
Job job = runningJob.get();
if (isJobCompleted(job)) {
updateStatusSucceeded(backup, job);
return UpdateControl.patchStatus(backup);
} else if (isJobFailed(job)) {
updateStatusFailed(backup, job);
// Re-schedule check after 1 minute
return UpdateControl.patchStatus(backup)
.rescheduleAfter(Duration.ofMinutes(1));
}
// Still running — check again in 30 seconds
return UpdateControl.noUpdate().rescheduleAfter(Duration.ofSeconds(30));
}
// 3. No active job — update status to Pending
backup.getStatus().setPhase("Pending");
return UpdateControl.patchStatus(backup);
}
@Override
public DeleteControl cleanup(DatabaseBackup backup, Context<DatabaseBackup> context) {
// CR is being deleted — clean up owned resources
deleteOwnedCronJob(backup);
log.info("Cleaned up resources for DatabaseBackup {}", backup.getMetadata().getName());
return DeleteControl.defaultDelete();
}
private void ensureCronJobExists(DatabaseBackup backup, DatabaseBackupSpec spec) {
String cronJobName = backup.getMetadata().getName() + "-backup";
String namespace = backup.getMetadata().getNamespace();
boolean exists = kubernetesClient.batch().v1().cronjobs()
.inNamespace(namespace).withName(cronJobName).get() != null;
if (!exists) {
CronJob cronJob = backupJobService.buildCronJob(backup, cronJobName);
// Set owner reference so CronJob is deleted with the CR
cronJob.getMetadata().setOwnerReferences(List.of(
new OwnerReferenceBuilder()
.withName(backup.getMetadata().getName())
.withApiVersion(backup.getApiVersion())
.withKind(backup.getKind())
.withUid(backup.getMetadata().getUid())
.withController(true)
.withBlockOwnerDeletion(true)
.build()
));
kubernetesClient.batch().v1().cronjobs().inNamespace(namespace).create(cronJob);
log.info("Created CronJob {}/{}", namespace, cronJobName);
}
}
}
Status conditions are the standard way to communicate operator state to users and other tools (e.g., kubectl wait --for=condition=Ready). Each condition has a type, status, reason, and message — following Kubernetes conventions.
// Condition types for DatabaseBackup
public enum BackupConditionType {
SCHEDULE_VALID, // CronJob expression is valid and created
BACKUP_RUNNING, // a backup Job is currently active
BACKUP_SUCCEEDED, // last backup completed successfully
BACKUP_FAILED // last backup failed
}
// Helper to set a condition
private void setCondition(DatabaseBackup backup, BackupConditionType type,
boolean isTrue, String reason, String message) {
Condition condition = new ConditionBuilder()
.withType(type.name())
.withStatus(isTrue ? "True" : "False")
.withReason(reason)
.withMessage(message)
.withLastTransitionTime(ZonedDateTime.now().format(DateTimeFormatter.ISO_OFFSET_DATE_TIME))
.build();
List<Condition> conditions = backup.getStatus().getConditions();
conditions.removeIf(c -> c.getType().equals(type.name()));
conditions.add(condition);
}
// Emit a Kubernetes Event (visible in `kubectl describe databasebackup`)
private void emitEvent(DatabaseBackup backup, String reason, String message) {
kubernetesClient.v1().events().inNamespace(backup.getMetadata().getNamespace())
.create(new EventBuilder()
.withNewMetadata()
.withGenerateName(backup.getMetadata().getName() + "-")
.withNamespace(backup.getMetadata().getNamespace())
.endMetadata()
.withReason(reason)
.withMessage(message)
.withType("Normal")
.withInvolvedObject(new ObjectReferenceBuilder()
.withKind(backup.getKind())
.withName(backup.getMetadata().getName())
.withNamespace(backup.getMetadata().getNamespace())
.build())
.build());
}
When a reconciler throws an exception, JOSDK retries with exponential backoff by default. You can customise the retry strategy or use a RetryConfiguration to limit attempts and set maximum backoff intervals.
@ControllerConfiguration(
retryConfiguration = @Retry(maxAttempts = 5, initialInterval = 5000,
intervalMultiplier = 2.0, maxInterval = 60000)
)
public class DatabaseBackupReconciler implements Reconciler<DatabaseBackup> {
@Override
public UpdateControl<DatabaseBackup> reconcile(DatabaseBackup backup,
Context<DatabaseBackup> context) {
try {
return doReconcile(backup, context);
} catch (TransientException e) {
// Rethrow — JOSDK will retry with backoff
log.warn("Transient error reconciling {} — will retry: {}",
backup.getMetadata().getName(), e.getMessage());
throw e;
} catch (PermanentException e) {
// Don't rethrow — update status to Failed and stop retrying
log.error("Permanent error for {}: {}", backup.getMetadata().getName(), e.getMessage());
backup.getStatus().setPhase("Failed");
backup.getStatus().setMessage(e.getMessage());
return UpdateControl.patchStatus(backup);
}
}
}
JOSDK's Dependent Resource abstraction manages owned Kubernetes resources (CronJobs, ConfigMaps, Secrets) declaratively. Instead of manually checking whether a resource exists and creating/updating it, you declare it as a KubernetesDependentResource and JOSDK handles the desired-state reconciliation for you.
// Declare the CronJob as a dependent resource — JOSDK creates/updates/deletes it automatically
@KubernetesDependent(labelSelector = "app.kubernetes.io/managed-by=database-backup-operator")
public class BackupCronJobDependentResource
extends CRUDKubernetesDependentResource<CronJob, DatabaseBackup> {
public BackupCronJobDependentResource() {
super(CronJob.class);
}
@Override
protected CronJob desired(DatabaseBackup backup, Context<DatabaseBackup> context) {
// Return the desired CronJob state — JOSDK compares with actual and patches if different
return new CronJobBuilder()
.withNewMetadata()
.withName(backup.getMetadata().getName() + "-backup")
.withNamespace(backup.getMetadata().getNamespace())
.addToLabels("app.kubernetes.io/managed-by", "database-backup-operator")
.endMetadata()
.withNewSpec()
.withSchedule(backup.getSpec().getSchedule())
.withNewJobTemplate()
// ... Job template with backup container spec
.endJobTemplate()
.endSpec()
.build();
}
}
// Reference the dependent resource in the reconciler
@ControllerConfiguration(dependents = @Dependent(type = BackupCronJobDependentResource.class))
public class DatabaseBackupReconciler implements Reconciler<DatabaseBackup> {
// CronJob creation/update/deletion is now fully managed by JOSDK
}
The operator runs as a normal Kubernetes Deployment in a dedicated namespace. It needs a ServiceAccount with RBAC permissions to watch and manage the resources it operates on.
FROM eclipse-temurin:21-jre
WORKDIR /app
COPY target/database-backup-operator-1.0.jar app.jar
ENTRYPOINT ["java", "-jar", "app.jar"]
# RBAC — operator needs permissions to manage its resources
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: database-backup-operator-role
rules:
- apiGroups: ["backup.cscode.io"]
resources: ["databasebackups", "databasebackups/status", "databasebackups/finalizers"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["events", "secrets"]
verbs: ["get", "list", "watch", "create", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update"] # required for leader election
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-backup-operator
namespace: operators
spec:
replicas: 2 # HA — leader election prevents dual reconciliation
selector:
matchLabels:
app: database-backup-operator
template:
spec:
serviceAccountName: database-backup-operator
containers:
- name: operator
image: myregistry/database-backup-operator:1.0
env:
- name: JAVA_TOOL_OPTIONS
value: "-XX:+UseG1GC -Xms256m -Xmx512m"
# Example CR — user creates this to schedule a backup
apiVersion: backup.cscode.io/v1
kind: DatabaseBackup
metadata:
name: production-db-backup
namespace: default
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
databaseRef: "prod-db-secret"
s3Bucket: "my-backups-bucket"
retentionDays: 30