Score:1

Timeouts on Cloud SQL and other external services when using NAT + IP Masquerade on GKE

it flag

I have to configure a static IP in one of my PODs because a remote service (outside of my cluster) requires trusted IP whitelisting.

I followed the documentation provided by Google:

https://cloud.google.com/nat/docs/overview?hl=es-419

https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent

But when trying to configure egress traffic using Google cloud NAT service in my GKE cluster plus masquerading using the ip-masq-agent I start getting timeouts and problems when accessing remote services outside of the cluster.

My Cluster is in version 1.19.10-gke.1600.

I have tried these config files with the following results:

resyncInterval: 60s

Result:

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in chain) */

The services keep using the wrong IP.


resyncInterval: 60s
masqLinkLocal: true

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             169.254.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in the chain) */

The same effect, my outside services get the wrong IP.


nonMasqueradeCIDRs:
  - 0.0.0.0/0
resyncInterval: 60s
masqLinkLocal: true

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere             /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in the chain) */

It looks this works better because the external services receive the correct IP but I get connection problems and timeouts.


This is my NAT configuration:

NAT mapping
- High availability: Yes
- Source subnets & IP ranges: All subnets' primary and secondary IP ranges
- NAT IP addresses: static-egress-ip XXX.XXX.XXX.XXX

I'm out of ideas, can someone give me any advice?


After the response got here I updated my config file to add the ips following google cloud documentation, the file goes like this:

nonMasqueradeCIDRs:
  - 10.0.0.0/8
  - 172.16.0.0/12
  - 192.168.0.0/16
  - 100.64.0.0/10
  - 192.0.0.0/24
  - 192.0.2.0/24
  - 192.88.99.0/24
  - 198.18.0.0/15
  - 198.51.100.0/24
  - 203.0.113.0/24
  - 240.0.0.0/4
resyncInterval: 60s
masqLinkLocal: true

The result of this in the iptables is:

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             100.64.0.0/10        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.0.0.0/24         /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.0.2.0/24         /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.88.99.0/24       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             198.18.0.0/15        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             198.51.100.0/24      /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             203.0.113.0/24       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             240.0.0.0/4          /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in chain) */

But if I run a curl checkip.amazonaws.com to see what IP is being used by the node I get a different IP from the one defined in my NAT Cloud configuration and the external services reject request as non trusted from my cluster.

Score:1
gh flag

It seems you have set the nonMasqueradeCIDRs: as 0.0.0.0/0 thereby preventing Masquerading of all the CIDR traffic, so to fix this issue, in the config file update the nonMasqueradeCIDRs: key with the IPs mentioned in Defaut non-masquerade destination paragraph [1] as given below.

nonMasqueradeCIDRs:

  • 172.16.0.0/12
  • 192.168.0.0/16
  • 100.64.0.0/10
  • 192.0.0.0/24
  • 192.0.2.0/24
  • 192.88.99.0/24
  • 198.18.0.0/15
  • 198.51.100.0/24
  • 203.0.113.0/24
  • 240.0.0.0/4
  • 10.0.0.0/8

Also please note that the IPs referred in the screenshot were not wrong IPs but those are ranges reserved by RFC 1918/link-local i.e., the IPs 10.0.0.0/8, 172.16.0.0/12 192.168.0.0/16 are reserved for RFC 1918 and the IP range 169.254.0.0/16 is reserved for link-local and these are non-masqueradable and hence these IPs are being displayed with the description ‘ip-masq-agent: local traffic is not subject to masquerade’[2].

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent#default-non-masq-dests

[2] https://kubernetes.io/docs/tasks/administer-cluster/ip-masq-agent/#ip-masquerade-agent-user-guide

Regards, Anbu.

it flag
Hi @Anbu, thanks a lot for the help. I followed your instructions and instead of using 0.0.0.0/0 as masq I put the ranges defined in the documentation. You can see the editions made to my question to see the output of the commands but I still have the problem that my nodes and pods are calling the external services using an IP that I don't control. The NAT fixed IP I configured in GOOGLE CLOUD NAT is ignored. I'm testing this with a curl directly from the node ssh or calling my external services wich return IP unallowed. Any Ideas?
Score:0
it flag

Finally we were able to diagnose the problem. Our cluster was created some time ago when GCP didn't supported private clusters so our clusted is public.

Each node have a public ephemeral IP thus the NAT rules are being ignored.

The solution was setting a node with a static IP and not ephemeral and configuring the workload that requires trusted auth to always deploy on that specific node. This is not a perfect solution but is what we can do quickly to solve the problem.

The real solution would be migrating to a private cluster and configuring the NAT but sadly GCP does not support migration from a public cluster to a private one. The only option would be creating a new cluster an migrate the workloads to the new cluster, process that we will need to execute in the short term.

Maybe it is a good moment to test autopilot which does not support automate migration too.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.