I have an AWS ECS Service with 2 REPLICA tasks in it.
When I manually stop one of the tasks, the new one is created in almost exactly 5 minutes.
Similar 5 minute timeout happens when one or more tasks get stopped due to a failure.
here is cloudformation definition of my service:
ServiceFrontend:
Type: AWS::ECS::Service
DependsOn:
- LoadBalancerRule
Properties:
ServiceName: "my-service-frontend"
Cluster:
Fn::ImportValue: !Sub "${ProjectName}:${EnvType}:ClusterName"
DeploymentConfiguration:
MaximumPercent: 100
MinimumHealthyPercent: 0
DesiredCount: 2
TaskDefinition: !GetAtt FrontendTaskStack.Outputs.TaskDefinition
HealthCheckGracePeriodSeconds: 600
ServiceRegistries:
- RegistryArn: !GetAtt 'DiscoveryService.Arn'
ContainerName: !Sub "${ServiceName}-frontend"
ContainerPort: !Ref 'FrontendContainerPort'
LoadBalancers:
- ContainerName: !Sub "${ServiceName}-frontend"
ContainerPort: !Ref 'FrontendContainerPort'
TargetGroupArn: !Ref 'TargetGroup'
PlacementStrategies:
- Field: 'memory'
Type: 'binpack'
- Field: 'cpu'
Type: 'binpack'
My question is: what defines this timeout? Can I control it?
Or where can I see more insights in what is happening during those 5 mintues, because ECS service events only show how old task is deregistered and new one is registered again after 5 minutes, nothing in between.
If I change the same service and increase the desired tasks number - it starts provisioning new tasks in less than 30 seconds. How can I get the same recovery time when one of the tasks has stopped for some reason.
Googling and going through ECS docs doesn't seem to bring the answer.
For context: this service has no autoscaling on service level, cluster's capacity provider has auto-scaling configured. But I don't think it is relevant here, as capacity provider doesn't start changing the capacity here.