I'm trying to migrate a 3 node zookeeper ensemble from VMs to a kubernetes cluster without downtime.
I know there are a lot of blog posts and other articles on how to migrate zookeeper without downtime VMs to VMs to bare mettal to Vms etc. but couldn't find one which migrates w/o downtime to k8s
.
This is the config on all zk nodes (zoo.cfg):
autopurge.purgeInterval=1
initLimit=10
syncLimit=5
autopurge.snapRetainCount=5
snapCount=5000
4lw.commands.whitelist=*
tickTime=2000
dataDir=/var/opt/zookeeper/data/data
admin.serverPort=8080
reconfigEnabled=true
admin.enableServer=True
standaloneEnabled=false
dynamicConfigFile=/opt/zookeeper/apache-zookeeper-3.7.1-bin/conf/zoo.cfg.dynamic
and /opt/zookeeper/current/conf/zoo.cfg.dynamic
server.1=inzzk01:2888:3888;2181
server.2=inzzk02:2888:3888;2181
server.3=inzzk03:2888:3888;2181
Up until here all is good, the cluster is formed
I run zk in k8s as a statefulset from this answer (btw, by itself if I create a 3 pod cluster it works as expected), so scrap everything on k8s to work on a clean cluster and
add the below to the config on VMs + restart each node:
server.4=10.100.102.106:30888:31888;30181
server.5=10.100.102.232:30889:31889;30182
The 2 IP addresses above are correct k8s nodes IP addresses (also the ports are correct)
In the logs all is normal:
2023-02-17 13:45:30,107 [myid:1] - WARN [QuorumConnectionThread-[myid=1]-3:QuorumCnxManager@401] - Cannot open channel to 4 at election address /10.100.102.106:31888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
2023-02-17 13:45:30,113 [myid:1] - WARN [QuorumConnectionThread-[myid=1]-4:QuorumCnxManager@401] - Cannot open channel to 5 at election address /10.100.102.232:31889
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
I tried all types of services (ClusterIP, headless and not, Loadbalancer and NodePort) in the end I figured the simple way to go is no service
+ add hostNetwork: true
to the statefulset. This way the ports are directly mapped to the k8s nodes so no proxy/SNAT/DNAT/xNAT :) so I can target them directly. Again not recommended! but for the sake of this example.
kubectl -n infraservices get all
NAME READY STATUS RESTARTS AGE
pod/zk-0 0/1 Running 1 (65s ago) 2m45s
In the logs of the pod:
2023-02-17 14:00:48,308 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:4, n.state:LOOKING, n.leader:4, n.round:0x1, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config version:0x0
2023-02-17 14:00:48,315 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@308] - 4 Received version: 1600000000 my version: 0
2023-02-17 14:00:48,315 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@316] - restarting leader election
2023-02-17 14:00:48,315 [myid:4] - WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 2. myId: 4
2023-02-17 14:00:48,393 [myid:4] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 3. myId: 4
2023-02-17 14:00:48,394 [myid:4] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@985] - Leaving listener
2023-02-17 14:00:48,395 [myid:4] - WARN [SendWorker:2:QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277)
2023-02-17 14:00:48,395 [myid:4] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1402] - Connection broken for id 1, my id = 4
java.net.SocketException: Socket closed
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at java.base/java.io.BufferedInputStream.fill(Unknown Source)
at java.base/java.io.BufferedInputStream.read(Unknown Source)
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1390)
2023-02-17 14:00:48,395 [myid:4] - WARN [SendWorker:3:QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277)
2023-02-17 14:00:48,395 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@472] - WorkerReceiver is down
2023-02-17 14:00:48,395 [myid:4] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue
java.lang.InterruptedException
On the VM the logs look like this:
2023-02-17 14:03:42,165 [myid:1] - INFO [ListenerHandler-inzzk01/10.100.100.128:3888:QuorumCnxManager$Listener$ListenerHandler@1076] - Received connection request from /10.100.102.106:51674
2023-02-17 14:03:42,167 [myid:1] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@1402] - Connection broken for id 4, my id = 1
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1390)
2023-02-17 14:03:42,167 [myid:1] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 4. myId: 1
2023-02-17 14:03:42,167 [myid:1] - WARN [SendWorker:4:QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277)
2023-02-17 14:03:42,167 [myid:1] - WARN [SendWorker:4:QuorumCnxManager$SendWorker@1300] - Send worker leaving thread id 4 my id = 1
I am able to connect from the pod to the cluster with zkCli.sh
root@inccl02az12-23rpb-mvnnw:/apache-zookeeper-3.7.1-bin# bin/zkCli.sh -timeout 3000 -server inzzk01:2181
[zk: inzzk01:2181(CONNECTED) 6] get /zookeeper/config
server.1=inzzk01:2888:3888:participant;0.0.0.0:2181
server.2=inzzk02:2888:3888:participant;0.0.0.0:2181
server.3=inzzk03:2888:3888:participant;0.0.0.0:2181
version=1600000000
[zk: inzzk01:2181(CONNECTED) 7]
So how can I connect at least one zookeeper node as a pod in k8s to an existing cluster outside
k8s ?