I'm using a standard AWS EKS cluster, all cloud based (K8S 1.21) with multiple node groups, one of which uses a Launch Template that defines an Elastic Inference Accelerator attached to the instances (eia2.medium) to serve some kind of Tensorflow model.
I've been struggling a lot to make our Deep Learning model working at all while deployed, namely I have a Pod in a Deployment with a Service Account and an EKS IRSA policy attached, that is based on AWS Deep Learning Container for inference model serving based on Tensorflow 1.15.0.
The image used is 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu
and when the model is deployed in the cluster, with a node affinity to the proper EIA-enabled node, it simply doesn't work when called using /invocations
endpoint:
Using Amazon Elastic Inference Client Library Version: 1.6.3
Number of Elastic Inference Accelerators Available: 1
Elastic Inference Accelerator ID: eia-<id>
Elastic Inference Accelerator Type: eia2.medium
Elastic Inference Accelerator Ordinal: 0
2022-05-11 13:47:17.799145: F external/org_tensorflow/tensorflow/contrib/ei/session/eia_session.cc:1221] Non-OK-status: SwapExStateWithEI(tmp_inputs, tmp_outputs, tmp_freeze) status: Internal: Failed to get the initial operator whitelist from server.
WARNING:__main__:unexpected tensorflow serving exit (status: 134). restarting.
Just as a reference, when using the CPU-only image available at 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.15.0-cpu
, the model serves perfectly in any environment (locally too).
Each EKS node and the Pod itself (via IRSA) has the following policy attached:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"elastic-inference:Connect",
"iam:List*",
"iam:Get*",
"ec2:Describe*",
"ec2:Get*",
"ec2:ModifyInstanceAttribute"
],
"Resource": "*"
}
]
}
as per documentation from AWS itself, also i have created a VPC Endpoint for Elastic Inference as described by AWS and binded it to the private subnets used by EKS nodes along with a properly configured Security Group which allows SSH, HTTPS and 8500/8501 TCP ports from any worker node in the VPC CIDR.
Using both the AWS Reachability Analyzer and the IAM Policy Simulator nothing seems wrong and the networking and permissions seem fine, while also the EISetupValidator.py
script provided by AWS says the same.
Any clue on what's actually happening here? Am i missing some kind of permissions or networking setup?