We have a container cluster with mode: Autopilot running in GKE.
We are currently receiving errors in a short window when performing a "blue/green"-deployment from Jenkins.
When we switch the service to the new deployment there is a window under 100ms that will generate the following error.
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>
I assume that this is because one of the pods are not started yet, but still starts routing traffic to the deployment.
We check that the deployment is rolled out after the deployment is created like this.
With the Jenkins plugin: https://github.com/jenkinsci/google-kubernetes-engine-plugin
We have the verifyDeployments attribute set to true.
step([
$class: 'KubernetesEngineBuilder',
projectId: env.PROJECT_ID,
clusterName: env.CLUSTER_NAME,
namespace: env.NAMESPACE,
location: env.CLUSTER_LOCATION,
manifestPattern: './apps/app/deployments/green.yaml',
credentialsId: env.APP_CREDENTIALS_ID,
verifyDeployments: true
])
We also included a second check to really verify that the deployment is rolled out.
Apparently the Jenkins plugin doesn't seem to do this very reliably.
kubectl rollout status deployment app-deployment --namespace app-namespace --watch --timeout=5m
We also noticed that it may occur that the deployment can error and a service is created anyway a subsequent step, which will crash the application, but this is another case we need to figure out how to solve, probably related to the Jenkins plugin.
Our deployment YAML looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
namespace: app
labels: {app.kubernetes.io/managed-by: graphite-jenkins-gke}
spec:
progressDeadlineSeconds: 600
replicas: 3
selector:
matchLabels: {app: app-blue}
template:
metadata:
labels: {app: app-blue}
spec:
automountServiceAccountToken: true
containers:
image: eu.gcr.io/container-registry-project/app:latest
imagePullPolicy: Always
name: app
ports:
- {containerPort: 8080, name: http, protocol: TCP}
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
resources:
limits: {cpu: 500m, ephemeral-storage: 1Gi, memory: 512Mi}
requests: {cpu: 500m, ephemeral-storage: 1Gi, memory: 512Mi}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [NET_RAW]
privileged: false
readOnlyRootFilesystem: false
runAsNonRoot: false
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: app
serviceAccountName: app
Our service YAML looks like this:
apiVersion: v1
kind: Service
metadata:
name: app-service
namespace: app
spec:
selector:
app: app-blue
ports:
- protocol: TCP
port: 80
targetPort: 8080
We simply switch the selector - app: in the service, to the app-blue or the app-green deployment to switch to the new deployment, but always get a small window of errors when doing this, anyone have any idea what we're doing wrong?