Score:0

How to find the bottleneck of network bandwidth from local machine to Azure VM and improve it?

es flag

I create an Azure VM with SKU Standard_D1_v21 in Southeast Asia. According to this doc, the expected bandwidth is 750Mbps.

However, I test the connection bandwidth from my local machine to the VM, the result is around 3Mbps. The test tool is iPerf3, with the VM as iPerf3 server and my local machine as iPerf3 clent. No other heavy network workload in the VM during the test. I did another test through NTTTCP followed this result.

I understand that real bandwidth will be less than expected for several reasons such as:

  • The connection is cross regions with many hops.
  • VMs in shared infrastructure share a total limitation (link).
  • The VM is deployed to the least latency region (Azure speed test).

But the real bandwidth (3Mbps) is far less than expected (750Mbps). So how to:

  1. troubleshoot the root caues? A misconfiguation in VM; or to many hops in cross-region connection; or a throttle somewhere in the infrastructure?
  2. how to improve the bandwidth between local machine and the remote VM?
cn flag
Why would you specify the throughput of 3 mbps and not the latency? Which is presumed to be awful.
Score:1
bh flag

Welcome to the forum.

The problem here is that you do not control the points between. Finding this granularity is part science, part art, as there is no real good one answer, more a report on findings across many factors.

Locating a bottleneck properly would mean being able to test all points between to and from each side, you will never have access to all of them, but you can infer some things and view a lot of angles to get a better picture.

Things you can eliminate are "Do you have adequate throughput on your side?" That can be had if you have a known site with dedicated bandwidth to test against and a known good route between capable of sustaining your throughput. Online speed tests can be deceiving, because you may have an adequate connection, and THEY may have the adequate bandwidth to perform tests reliably. But neither of you control what happens between you and them. IF you are consistently getting good speed tests to multiple test sites, then you can relatively safely rule out the issue is on your side as far as you control it. The flip side of course if that failure to all may still not be you, it could be your ISP's ISP, someone in a core network somewhere, some route down congesting another to stay up, etc. In which case if you can ask your ISP to get involved, but on a consumer line they will throw the "Speeds up to" at you, on a dedicated business line you have an argument, but still a burden of proof.

Though it is not impossible, the choke point is not likely on the azure side, unless you have a limited BW contractually.

You can do things like use MTR https://en.wikipedia.org/wiki/MTR_(software) to get a slight overview, but bear in mind dropped ICMP can be a result of the natural operation of the systems as most networking equipment will discard ICMP under stress, and some are just configured not to respond by default. So though this may give you more to go on, it is not a smoking gun, you have to understand how to read, and interpret it.

You can look places like here https://www.thousandeyes.com/outages/ where major outages are recorded via sensor networks worldwide. That can sometimes give you a clue, especially in the AS network (the core nodes of the internet in simple terms) if one is having issues that directly affect your route. Thousand Eyes

To interpret that, and determine what "your" route is, you can start with HE's BGP tools here https://bgp.he.net/ you will see effectively where in the internet you are in a routing sense. IF you click the IP that is what you go through to get to the internet (commonly known as your public IP), form there you will see something like this...HE BGP

This is who you are to the internet (Visiting from), what network you are coming from (announced as) and how that enters the "internet" (Your ISP)

That can be traced from ASN to ASN (Autonomous system number) to see the actual route you take (at that point in time, as this is subject to shift around without notice to keep routes open) or even look at it graphed.

ASN Graph

Then you can compare to the thousand eyes chart, or a multitude of othr BGP reporting tools online to see if those routes have known congestion, or are they flapping (going up and down), etc.

That will generally give you enough information to notify who may be in charge of what system (be advised most will not care past your ISP) and though that may not get you to where yo want to be, it explains why you are not where you want to be.

All in all you have to think of it like this, speed is not consistent all over the internet, some sections are connected via mind boggling fast links, some not so much. And when one of those big ones go down, many smaller suffer greatly.

So what can you do about it? Sometimes you can go around it if you are doing peer to peer connections, but you going around will not ensure your customers all traveling the same routes will have the same experience. For instance you could have a web server over there, and you may be able to get a solid good throughput connection to something like a VPN provider that exits closer to your remote end and force it to take a better route. Users going to your server not doing the same thing will not have the same experience. There are some services that can do this for a fee for the whole thing, effectively making your sever appear elsewhere than where it really is, and taking a premium route. Possibly cloudflare has services like this, I have never used it, so I would have to let an expert in cloudflair chime in on that.

Hopefully that gives you enough to go on to understand how this can get tricky, and also understand your experience and someone a state away could be entirely different.

Score:1
cn flag

What about you do some utterly obvious tests - stuff that makes me wonder whether you should rather ask on superuser than on a site for professionals.

  • Test the bandwidth on both endpoints and across. Use a test that you get from the internet - a lot of them run in the browser. Speedtest.net allows you to select the counterpoint, so you can start with a local server then use the other side (close at least), then hang yourself through the whole path.

Essentially: Unless you or the other side have a local issue - which is unlikely at comical speed levels like you nail them - this is a routing issue and there is NOTHING you can do - except 2 things:

  • Talk to your ISP Support
  • Change your internet provider.

Routing is their domain. They may have buggered a particular route up - but if not, well, nothing you can do except looking for another internet provider.

Your test (as it is) is quite useless - because it tests only A to B - not whether there is a local problem on any side. With my approach you can vary the counterpoint and see i.e. whether your local machine has problems into particular destinations. It may be that you have good local bandwidth, but international links are overloaded.

If that works out - time to check the used OS. RSS is there for a reason - it scales up the number of packets "in flight" which may be needed for long pings.

Besides that there really is nothing you CAN do.

stanleyerror avatar
es flag
You are right, superuser is a better place to ask. But still thanks for your answer, it helps.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.