Hey serverfault people,
Image a generic "Software as a Service" company offering a service running on AWS (hey, that's us). There is no rocket science involved, standard web-application doing its thing as usual and an end-user smartphone app. As customers are from Europe, naturally the AWS eu-central-1 region is containing everything for multiple tenants.
Now Sales manages to win a customer from Australia - all good so far, as the web-application can handle different timezones, currencies and locales already. But: Australia as far away as you can get from Europe (at least on earth), and so quite some round trip time is now involved. Per request we do see roughly 300ms - 400ms extra per direction (EDIT: this is wrong when speaking about RTT as pointing out in the commends, we do see 2x400ms = 800ms extra for the first HTTPS request).
For the mentioned web-application, which is used by the customer for management purposes, its totally fine. The rendered HTML is there a bit later but thanks to CDNs (CloudFront), assets are not an issue.
But the end-user smartphone application, which does smaller but more JSON requests, is affected. There it feels at the edge of "OK-ish" but definitively not snappy.
Now the question is: how to improve the timings from an end-user perspective?
We already thought about a few options here:
Clone the complete software and host it in AWS ap-southeast-2 as well
Benefit: awesome performance, easy to setup, CI/CD would allow deploying the same code simultaneously in EU and AU.
Drawbacks: we have to maintain and pay for two identical infrastructure sets, data can not be shared easily, lots of duplication in all terms.
Move only computation instances to AWS ap-southeast-2
Nope, will not work as database or redis queries would be affected by the round trip time even more.
Have a read only replica in AWS ap-southeast-2 and do writes in eu-central-1
Better as option 2, but adds lot of complexity in the code plus the number of writes is not that that few usually.
Spin up a load balancer in AWS ap-southeast-2 and peer connect the VPCs
Idea: users connect to the AU endpoint and traffic is going via beefy connection to the EU instances. However, we this would obviously not reduce the distance and we are unsure about the potential improvement (if any?)
Does anybody have experienced a similar issue and is willing to share some insights?
Update: it seems only the first HTTPS request seems to be very slow.
While digging into AWS Load Balancer options, I also noticed that AWS Global Accelerator might help, so we did some tests.
From local system (in EU):
curl -w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}\n" -o /dev/null -s "https://saas.example.com/ping" "https://saas.example.com/ping"
dns_resolution: 0,019074, tcp_established: 0,041330, ssl_handshake_done: 0,081763, TTFB: 0,103270
dns_resolution: 0,000071, tcp_established: 0,000075, ssl_handshake_done: 0,000075, TTFB: 0,017285
From AU (EC2 instance):
curl -w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}\n" -o /dev/null -s "https://saas.example.com/ping" "https://saas.example.com/ping"
dns_resolution: 0,004180, tcp_established: 0,288959, ssl_handshake_done: 0,867298, TTFB: 1,161823
dns_resolution: 0,000030, tcp_established: 0,000032, ssl_handshake_done: 0,000033, TTFB: 0,296621
From AU to AWS Global Accelerator(EC2 instance):
curl -w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}\n" -o /dev/null -s "https://saas-with-global-accelerator.example.com/ping" "https://saas-with-global-accelerator.example.com/ping"
dns_resolution: 0,004176, tcp_established: 0,004913, ssl_handshake_done: 0,869347, TTFB: 1,163484
dns_resolution: 0,000025, tcp_established: 0,000027, ssl_handshake_done: 0,000028, TTFB: 0,294524
In a nutshell: It seems the TLS handshake is causing the biggest initial latency. If it can be reused however, the extra time for AU to EU seems really "just" ~277ms (0,294524s - 0,017285s) for Time To First Byte.
Greetings!