This is not really a Win32 Perl related blog entry, but a problem that my team recently experienced. I am adding this to our blog site with the hope that someone else experiencing the same problem may find it useful and (hopefully) save them some time troubleshooting.

Recently I ran into quite the system administrator emergency. Our Active Directory network started acting flaky. How odd it was that our servers were unable to talk with each other. Client machines were not able to boot onto the network. Attempts to access resources were confronted with ungodly long blocking wait times only to eventually fail.

Applications and services that normally execute flawlessly were suddenly cast into a bottomless pit of error massages such as:

There are currently no logon servers available to service the logon request

and

The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you.

Of course the big question was why were there no logon servers available and why was there an attempt to compromise security? Needless to say it took several hours until my team could locate the root cause of the problem. Of course finding the cause is one thing, fixing it and understanding how the problem occurred is another.

The cause was that the built in firewall on our domain controller was blocking incoming DNS requests. All domain machines need to locate which machine on the network is the domain controller (aka the Active Directory server) to know where to send security authentication requests. Since this server is where all domain secrets are kept (account information, credentials, certificate keys, machine associations, etc) all domain machines need to be able to chat with it from time to time.

For some reason (we are still tracking that down) the domain controller decided to block the incoming port 53 (the DNS service port). Once that is done all machines which send requests to the DNS service with a request to look up the domain controller’s SRV record would, of course, fail. Once failed the machine has to conclude that there are no domain controllers available. This is just the sad truth.

To fix this my team unblocked the firewall port and rebooted our server farm. From that point everything was good. Now we just have to determine how  this suddenly happened. A post mortem will hopefully show this along with how to prevent it in the future.