Wednesday, April 07, 2010

Thinking too hard for the solution

At work we've been rolling out Nagios agents on the servers, it's simple enough on the Linux and Windows. But the problem was a few servers we had running in the DMZ, an isolated network where their required greater security. A problem started up where certain ports were not working, the Nagios ports, but all that was working was 22, standard SSH port.

Trying to figure out this problem I looked through the system, thinking it might be an internal firewall rule, stopped iptables which no change or result. It wasn't until I used NMAP to scan the servers that I realized all of the ports were closed, expect 22.

At this point I asked the network engineer and he confirmed that yes, there were excessive ports being blocked on the DMZ network.

This brings a very important point, how to troubleshoot and gather information.

First, when dealing with any new problem, alway understand what is your environment. I can't stress this enough. I once helped a friend over the phone with a network problem at a large grade school where she was deploying a network device. The problem was she couldn't get terminal access to the device that she installed on the network. So we ran through the steps, the IP address she was given to use, ping commands, etc.

The problem turned out to be the school administrator gave her a duplicate IP address, and while she could ping the address (which was not her device), she could not terminal into the machine. She assumed that since it was the IP address given to her, it must be working, but in the end it was not. It's important to always double check and know the environment.

This can also be not only trusting the information you are given but knowing what to check when something is wrong. Another example is a problem I experienced once while changing an server's IP address. I followed the company standards for moving an Windows IP address from one subnet to another, also I worked closely with our network engineer who had access to the switch I was connecting to.

After making the IP address change, I could not get access to the network. I saw that I had link lights on the server, but my server could not access the network or reply from a ping. I checked all of the cables, even plugged into another port on the patch panel, nothing. I asked the network engineer three times if the network ports were changed to the right subnet, each time he checked and confirmed. Finally I asked my manager, who still had switch access, if there was some problem I couldn't find. He logged into the switch, and found it was set to the wrong subnet.

The problem in this case is I assumed the network engineer who checked three times, actually checked his work. From this case, it was much harder to check the work since I did not have access, but since I checked all possible connections and problems on the servers, I could confidently say it was not an OS issue. This is important to know when a problem appears and who 's side should fix it. Often it's going to be a battle back and forth on where the problem lays.

In larger companies resolving issues becomes difficult, sometimes the department you work with on a project may be half way across the world. In my work, many of the co-workers are on not local and I have limited access to remote servers. It's difficult but I still use the same skills to know when a problem is from our side or theirs, I inspect the environment and then apply the same knowledge to figure out where is the problem.

In any work you need to know how to fix something, it may be a broken computer to a issue on a project. They are very different but require similar skills, the knowledge of the environment to make a decision.

No comments: