Score:0

AWS EKS - EIA attached on node not reachable by Pod

in flag

I'm using a standard AWS EKS cluster, all cloud based (K8S 1.21) with multiple node groups, one of which uses a Launch Template that defines an Elastic Inference Accelerator attached to the instances (eia2.medium) to serve some kind of Tensorflow model.

I've been struggling a lot to make our Deep Learning model working at all while deployed, namely I have a Pod in a Deployment with a Service Account and an EKS IRSA policy attached, that is based on AWS Deep Learning Container for inference model serving based on Tensorflow 1.15.0.

The image used is 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu and when the model is deployed in the cluster, with a node affinity to the proper EIA-enabled node, it simply doesn't work when called using /invocations endpoint:

Using Amazon Elastic Inference Client Library Version: 1.6.3
Number of Elastic Inference Accelerators Available: 1
Elastic Inference Accelerator ID: eia-<id>
Elastic Inference Accelerator Type: eia2.medium
Elastic Inference Accelerator Ordinal: 0

2022-05-11 13:47:17.799145: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1221] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator whitelist from server.
WARNING:__main__:unexpected tensorflow serving exit (status: 134). restarting.

Just as a reference, when using the CPU-only image available at 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.15.0-cpu, the model serves perfectly in any environment (locally too).

Each EKS node and the Pod itself (via IRSA) has the following policy attached:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elastic-inference:Connect",
                "iam:List*",
                "iam:Get*",
                "ec2:Describe*",
                "ec2:Get*",
                "ec2:ModifyInstanceAttribute"
            ],
            "Resource": "*"
        }
    ]
}

as per documentation from AWS itself, also i have created a VPC Endpoint for Elastic Inference as described by AWS and binded it to the private subnets used by EKS nodes along with a properly configured Security Group which allows SSH, HTTPS and 8500/8501 TCP ports from any worker node in the VPC CIDR.

Using both the AWS Reachability Analyzer and the IAM Policy Simulator nothing seems wrong and the networking and permissions seem fine, while also the EISetupValidator.py script provided by AWS says the same.

Any clue on what's actually happening here? Am i missing some kind of permissions or networking setup?

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.