Score:0

Oracle 19c database session unstable with Azure Load Balancer

jp flag

We are trying to implement a HA setup with an Oracle database in Azure, somewhat as described in DBAKevlar's white paper and Oracle HA in Azure - Options. Our setup includes an Oracle Data Guard pair (primary and standby database) in 2 availability zones, an observer in a 3rd zone, and an Azure Load Balancer. The Oracle database pair is in FSFO mode, so that it will automatically fail over if the primary database should fail. The Azure LB monitors the oracle listener to determine where the primary database is. SQL Net clients connect to the Azure LB and then through to the primary database.

With either an Oracle 19c database and Oracle 11g clients or an Oracle 11g database and Oracle 19c clients this works perfectly, connections are able to be made and stay open as long as the client wants.

The problem starts when both the client and the database are 19c: the connection can still be made, and an initial query is executed normally, but after a few minutes the database session logs off even though the client undertakes no such action and the connection to the Azure LB stays up. As the client does not know that the connection has logged off, the next query causes sqlnet to wait for answer until a timeout occurs (15 minutes) after which it errors.

If we take the load balancer out, the connection of 19c client to 19c database is stable and does not spontaneously log off. This shows that the issue is with the Azure Load Balancer.

To make sure that DataGuard/FSFO was not the issue, we tested the same scenarios with just one stand-alone database and an Azure Load Balancer. No Data Guard, no FSFO and the database session still disappeared while the client connection to the LB stayed open, as seen from the client.

We do not have the ability to monitor sessions on the Azure LB itself, so we cannot know if it actively closes connections.

I am unable to determine why the logoff happens.. Why does it not occur if one side is 11g and the other is 19C? Does the 19c Oracle Net driver interact with the Azure Load Balancer? Does the Azure LB detect the Oracle 19c combo and act differently? Any hints would be appreciated.

Score:1
cn flag

The Azure LB monitors the oracle listener to determine where the primary database is. SQL Net clients connect to the Azure LB and then through to the primary database.

Why not use Oracle's Transparent Network Substrate (TNS) to tell the Clients about both Primary and Standby databases and let them deal with the hassle of figuring out which one they need to talk to? That way, you're using all Oracle "stuff", all the way through.

This is the sort of thing you'd use on a client machine.

(DESCRIPTION =
  (ADDRESS_LIST =
    (FAILOVER=YES)
    (LOAD_BALANCE=NO)
    (ADDRESS = (PROTOCOL=TCP)(PORT=????)(HOST=Primary-Server))
    (ADDRESS = (PROTOCOL=TCP)(PORT=????)(HOST=Standby-Server))
  )
  (CONNECT_DATA =
    (SERVICE_NAME=Service-Name)
    (SERVER=DEDICATED)
    (FAILOVER_MODE = (TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=3))
  )
)

Both servers are shown and, because it's set up to "FAILOVER", all connections are made to the Primary Server ... unless that dies, in which case the client will try the Standby instead. If you know the Primary will be down for some time, you can reverse the entries.

The "trick" is the Service-Name, which is only offered by the Primary database at any time. (Make sure you don't make this Service Name the same as either database name!)

Tony avatar
jp flag
Yes, that would be an alternative, as is TAF. However, both require the use of Oracle TNS, and not all clients have that. The solution as described was proposed by external consultants as a catch-all solution. It has the added benefit that the configuration is in 1 place (the LB), rather than on every client.
Phill  W. avatar
cn flag
TNS Naming files don't have to be /local/ to each machine; a "stub" file can be used to "include" the "real" file from a file share (permissions .. er .. permitting). Also, TNS settings can, supposedly, be "published" via your Active Directory.
Tony avatar
jp flag
True, the stub file is not even needed - you can set TNS_ADMIN to point to the central file. The biggest issues are 1) political and 2) apps that use jdbc or .Net drivers which do not use tnsnames.ora.
Score:1
tr flag

Yes, that would be an alternative, as is TAF. However, both require the use of Oracle TNS, and not all clients have that. The solution as described was proposed by external consultants as a catch-all solution. It has the added benefit that the configuration is in 1 place (the LB), rather than on every client.

If you have Oracle Enterprise Edition you could use Oracle Connection Manager. Connection Manager is a Oracle Client installation with the Connection Manager service wich acts as a proxy for database connections. Databases register their services at the connection manager (remote_listener parameter) and the connection manager is the target for your external clients. No need for complex TNS connect strings. Connection Manager is included with Enterprise Edition.

Score:0
jp flag

Just to bring this to a close, we never found a solution to the problem. In the end we decided that the Azure LB solution was not viable and abandoned the idea.

During testing we also found that FSFO, whilst reliable in itself, was too fragile for our needs. Specifically, when the failed over database itself fails during reinstating the former primary, that database is lost and an outage occurs. By chance we also found that a failover in a multi-standby environment can invalidate all remaining standbys, leaving the new primary database as the only instance. This is too great a risk for our business and we abandoned the idea of using FSFO as well.

Our current direction is to take our cluster out of Azure and reinstate it as a RAC database either on Exadata in the Cloud or on owned hardware in a DC close to the Azure DC.

Score:0
cn flag

Wondering if you have managed to solve your issue? We just in the process of deploying an active/passive cluster in Azure using LB as the frontend and would like to avoid such issues when we go into production.

Tony avatar
jp flag
No, we never found a solution. At this moment we've decided for this and other reasons to move our database out of Azure and back to a RAC setup.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.