As far as I know, for a L3 router to join a network at all, the joining entity has to talk with the DHCP server to get an IP address. DHCP discover messages are broadcast with MAC FF:FF:FF:FF:FF:FF, and to the whole subnet.
these messages are broadcasted in a single layer 2 domain. broadcast does not cross layer 2 boundaries.
also, DHCP is not a common case for routers. Many times, routers need to be pre-configured with IP addresses, and automatic address assignment is not happening at all. I would assume that in ISP WANs this is actually the case, not the DHCP part.
If you are talking about a home router that connects to ISP network, then DHCP is processed on the other side of the link, and is not broadcasted further.
And so, if the router is newly connected to a WAN with thousands,
WAN with that number of routers is not a single layer 2 domain. Also router is layer 3 device. The broadcast cannot go past the layer 2 boundary, that is it stops at the next router.
I would assume that this router shold know where to route DHCP messages or is capable of answering them himself. I however do not know.
And, facing the same direction, I can apply the same argument to ARPs. ARP messages are broadcast as well around the network, just like the DHCP discover, and so the same set of problems would arise.
ARP is also broadcasted in the same layer 2 domain. ARP actually does not make sence across layer 2 boundary, since layer 2 communication is only used between two neighbor devices attached to the same layer 2 domain.
How to scale layer 2 is also a valid question, but it is a separate question. For your scenario it is safe to assume that the number of devices in a single layer 2 domain is small.
I can probably apply the same argument to messages used by the Network layer to coordinate its routers with distance-vector algorithm, unless the routers are somehow organized in a tree or graph-like manner, but I digress.
It kinda does. Routing protocols have mechanisms that are used to keep the number of network layer messages small enough. These mechanisms have their limitations, which renders routing protocols not applicable in some scenarios. Since the functionality and mechanisms are different, this requires discussing each protocol separately.
Organizing routers in a tree is pretty bad for the redundancy standpoint. If a link fails, there is no place in topology to reroute. (Here, tree refers to the actual physical topology. (R)STP reduces topology to a tree, but if links fail, new tree can be calculated.)
And every network can be modelled as a graph, since graphs can have any number of nodes, and any number of links, and be arbitrary connected.